Ask HN: What tools do you use for data munging and merging?

Looks like you're looking for ETL solutions. It's funny reading some of the replies here, you can tell who's coming from a more BI background compared to a software engineering background.

Anyways, I think Alteryx does this job really well. It's great for ETL that the average Joe can pick up and learn very quickly. Sure, the UX could be improved, but it's miles better than AWS' Data Pipeline and other tools. It is a bit costly like another user mentioned, but well worth it IMO. Tableau has introduced some new products the past year or two to compete in this space as well, so if you use Tableau for reporting, look into their new offerings. Dell introduced an ETL tool a few years ago called Boomi. It had some promise a few years ago and might be something to consider. I have no idea what it costs though. Another option is SSIS if the data is eventually going to live in SQL Server or some Microsoft database.

Ultimately, I would consider options based on your target database/environment.

mipmap04 · 6 years ago

I totally agree with your first sentence. As a former enterprise data warehouse architect, it's interesting seeing some of the responses. Surprised I'm not seeing more Informatica and SSIS. I'll also say SSIS is a good solution for nearly any data storage platform and is included in a SQL Server license so might be a good, cheap enterprise solution if you already have SQL Server.

kmerrol · 6 years ago

From my experience SSIS suffers mightily from the Write-Once Read Only problem as well as huge manual click-fests when package metadata changes. That and a lack of development from MSFT with obvious direction towards ADF means SSIS is not really great for modern data architecture.

If you're a MSFT shop then sure and if you're willing to drop the box-line GUI and move to .NET assembly-based packages even more so, but I've spent enough of my life keeping SSIS running and not going back.

derision · 6 years ago

+1 for SSIS. Used it extensively in a previous position at a large fortune 500 insurance company

hellectronic · 6 years ago

I do not search for an ETL solution, but thanks for your reply because I am using Pentaho Data Integration for ETL and reading about other tools can help :)

The python ecosystem is really good here.

I do a lot of exploratory coding in ipython, though the threshold for "switch to a real editor and run git init and poetry new" is pretty low.

Want to munge CSV? stdlib, or pandas(https://pandas.pydata.org/pandas-docs/stable/)

Want to munge JSON? stdlib.

Want to munge a database? pandas, stdlib, anything that speaks sqlalchemy

Want to validate your json/sql/CSV/whatever and have it come in as a bunch of structured classes not std types? jsonschema https://pypi.org/project/jsonschema/, attrs+marshmallow, attrs+cattrs. http://www.attrs.org/en/stable/, https://desert.readthedocs.io/en/latest/ https://marshmallow.readthedocs.io/en/stable/ https://github.com/Tinche/cattrs

Want to transform csv into sqlite? https://github.com/simonw/csvs-to-sqlite

Want to serve sqlite as a REST api? https://datasette.readthedocs.io/en/stable/

Want to stuff simple things into a database really fast? https://dataset.readthedocs.io/en/latest/

Want to flip numerical data around? Numpy.https://docs.scipy.org/doc/numpy/reference/

Want to model it? Scipy https://www.scipy.org/docs.html, pandas

Want to plot it? seaborn https://seaborn.pydata.org/, plotnine https://plotnine.readthedocs.io/en/stable/

Want to futz around with structured data in a sensible way? glom https://glom.readthedocs.io/en/latest/, python-lenses https://github.com/ingolemo/python-lenses.

Want to spit out tabular data? Tabulate https://github.com/astanin/python-tabulate

Want to figure out where not to eat in chicago? built-ins: counter, defaultdict, comprehensions: https://www.youtube.com/watch?v=lyDLAutA88s

There's a LOT you can do pretty fast, and I can more or less hammer out a basic ETL cli script in my sleep at this point.

intrepidhero · 6 years ago

Sweet. Thanks for the links to seaborn and plotnine. I hadn't seen those before.

I do of automation with csv, yaml, openpyxl, and jinja2. Wrapped in a simple GUI (Qt, Gtk, or Tk, pick your poison) so my non-programmer colleagues can quickly run it and get a PDF report, generated with reportlab.

billfruit · 6 years ago

I think it is not that rich when dealing with binary data.

j88439h84 · 6 years ago

Sure it is. https://construct.readthedocs.io/

hprotagonist · 6 years ago

fsvo 'binary', i think i agree. I've written my fair share of struct.unpack-heavy code, and it's just tedious.

I also think that __str__ and __repr__ behavior on bytes is misleading at best. I do not like getting ascii values for some bytes and not others... and iterating to get base-10 ints!?

Pillow[simd], soundfile, and imageio can do nice things for image and sound data.

hellectronic · 6 years ago

Thank you for the list.

hn12345 · 6 years ago

cube2222 · 6 years ago

A project I'm using, which I'm also one of the authors of, is OctoSQL[1].

It allows you to transform and join data from different data sources (including databases like mysql, postgres, redis, and CSV or JSON files, more to come) using a familiar interface - plain SQL.

Other than that, I like Exploratory[2].

And for nested JSON data I use another project of mine, jql[3] - an easier to use (in my opinion) lispy alternative to jq. (originally inspired by a hackernews comment actually)

[1]:https://github.com/cube2222/octosql

[2]:https://exploratory.io

[3]:https://github.com/cube2222/jql

hatmatrix · 6 years ago

Interesting project and I like the name. It seems that the underlying relationship in each data source has to be relational though (even for JSON files) - is that a fundamental limitation of using SQL as the glue language, or are there plans to integrate hierarchical models of data in some form as well?

If you mean hierarchical data querying (like nested json, or json in sql rows), then that's very high on our radar.

OctoSQL sounds interesting thanks.

westonsankey · 6 years ago

I try to avoid GUI-based tools as they generally don't lend themselves well to version control, code reuse, etc. There are several orchestration tools that allow you to create arbitrarily complex ETL pipelines using SQL, Python, Spark, etc. Here are links to a few:

http://airflow.apache.org/

https://www.prefect.io/

https://dagster.readthedocs.io/en/0.6.7/

https://www.getdbt.com/

veritas3241 · 6 years ago

dbt is a fantastic tool. We really on it for basically all of our transformations at GitLab.

pjot · 6 years ago

+1 for dbt.

Great tool with a really active community along with it

eyeball · 6 years ago

Dbt looks awesome. Too bad no oracle support.

drewbanin · 6 years ago

Hi there - I run product for dbt - happy to chat more about Oracle support if you're interested. There are a handful of community-supported dbt plugins out in the wild. Oracle could very well be one of the next ones. Check some of these out to see what's involved in building a database plugin:

- https://github.com/fishtown-analytics/dbt-spark

- https://github.com/fishtown-analytics/dbt-presto

- https://github.com/mikaelene/dbt-sqlserver

- https://github.com/jacobm001/dbt-mssql

TheTank · 6 years ago

QuestDB (https://www.questdb.io) might help. Import your file(s) by drag and drop directly into the web console. Once data is loaded, you can run SQL queries in the console to order, format, concatenate, join (including time-based joins), etc. Once finished, you can export your formatted results back to CSV. Also, it's pretty quick!

xn · 6 years ago

Visidata (http://visidata.org/) is my tool of choice these days.

ysr23 · 6 years ago

yup lots of love for visidata, a nice quick way to eyeball data and then munge: https://visidata.org/docs/join/ before proper carpentry with pandas.

That looks interesting, thanks.

wodenokoto · 6 years ago

I really like R / Tidyverse, but that requires that you have more memory than your datasize (especially if you are using narrow and long dataframes, such as recommended by tidyverse) and it also requires you to code, instead of using a UI

stuxnet79 · 6 years ago

Is there really no way of processing dataframes by chunks in the R ecosystem? I'm firmly in the Python camp but I wonder how much I'm missing out on in the R world.

deadcaribou · 6 years ago

There are many ways to process larger than RAM objects in R.

The dbplyr package makes it possible to use a local db table as if they are in-memory data frames (https://dbplyr.tidyverse.org/).

https://diskframe.com/ is also a pretty great solution that processes data by chunks.

onearmedscissor · 6 years ago

Take a look at data.table https://cran.r-project.org/web/packages/data.table/

livingmargot · 6 years ago

Sorry to butt in, but could you point me to a resource on processing data in chunks using Python?

2data222 · 6 years ago

We use Talend. It does have a gigantic Java-based Windows desktop application. It's pretty powerful and I don't hate it. We looked at MuleSoft as well but it's not ready for Enterprise prime-time like we need it to be.

The data preparation Wikipedia page mentions Talend by name along with Paxata, Trifacta, Alteryx, and Ataccama.

JPKab · 6 years ago

Alteryx is very good, especially if you want to hand off maintenance and operation of data processing flows to non coders.