Show HN: I built an open-source data copy tool called ingestr

Show HN: I built an open-source data copy tool called ingestr github.com/bruin-data/ing...

Hi there, Burak here. I built an open-source data copy tool called ingestr (https://github.com/bruin-data/ingestr)

I did build quite a few data warehouses both for the companies I worked at, as well as for consultancy projects. One of the more common pain points I observed was that everyone had to rebuild the same data ingestion bit over and over again, and each in different ways:

- some wrote code for the ingestion from scratch to various degrees

- some used off-the-shelf data ingestion tools like Fivetran / Airbyte

I have always disliked both of these approaches, for different reasons, but never got around to working on what I'd imagine to be the better way forward.

The solutions that required writing code for copying the data had quite a bit of overhead such as how to generalize them, what language/library to use, where to deploy, how to monitor, how to schedule, etc. I ended up figuring out solutions for each of these matters, but the process always felt suboptimal. I like coding but for more novel stuff rather than trying to copy a table from Postgres to BigQuery. There are libraries like dlt (awesome lib btw, and awesome folks!) but that still required me to write, deploy, and maintain the code.

Then there are solutions like Fivetran or Airbyte, where there's a UI and everything is managed through there. While it is nice that I didn't have to write code for copying the data, I still had to either pay some unknown/hard-to-predict amount of money to these vendors or host Airbyte myself which is roughly back to square zero (for me, since I want to maintain the least amount of tech myself). Nothing was versioned, people were changing things in the UI and breaking the connectors, and what worked yesterday didn't work today.

I had a bit of spare time a couple of weeks ago and I wanted to take a stab at the problem. I have been thinking of standardizing the process for quite some time already, and dlt had some abstractions that allowed me to quickly prototype a CLI that copies data from one place to another. I made a few decisions (that I hope I won't regret in the future):

- everything is a URI: every source and every destination is represented as a URI

- there can be only one thing copied at a time: it'll copy only a single table within a single command, not a full database with an unknown amount of tables

- incremental loading is a must, but doesn't have to be super flexible: I decided to support full-refresh, append-only, merge, and delete+insert incremental strategies, because I believe this covers 90% of the use-cases out there.

- it is CLI-only, and can be configured with flags & env variables so that it can be automated quickly, e.g. drop it into GitHub Actions and run it daily.

The result ended up being `ingestr` (https://github.com/bruin-data/ingestr).

I am pretty happy with how the first version turned out, and I plan to add support for more sources & destinations. ingestr is built to be flexible with various source and destination combinations, and I plan to introduce more non-DB sources such as Notion, GSheets, and custom APIs that return JSON (which I am not sure how exactly I'll do but open to suggestions!).

To be perfectly clear: I don't think ingestr covers 100% of data ingestion/copying needs out there, and it doesn't aim that. My goal with it is to cover most scenarios with a decent set of trade-offs so that common scenarios can be solved easily without having to write code or manage infra. There will be more complex needs that require engineering effort by others, and that's fine.

I'd love to hear your feedback on how can ingestr help data copying needs better, looking forward to hearing your thoughts!

Best, Burak

I was surprised to see SQLite listed as a source but not as a destination. Any big reasons for that or is it just something you haven't got around to implementing yet?

I've been getting a huge amount of useful work done over the past few years sucking data from other systems into SQLite files on my own computer - I even have my own small db-to-sqlite tool for this (built on top of SQLAlchemy) - https://github.com/simonw/db-to-sqlite

karakanb · 2 years ago

I do use the dlt library to support as many source & destinations as possible and they do not support SQLite as of today. I am interested in supporting SQLite simply because I love it as well, so that's definitely in the roadmap.

db-to-sqlite looks lovely, I'll see if I can learn a thing or two from it!

mritchie712 · 2 years ago

looks like dlt doesn't support it as a destination (which this is a wrapper around)

https://dlthub.com/docs/dlt-ecosystem/destinations/

MatthausK · 2 years ago

one of the dltHub founders here - we aim to address this in the coming weeks

xk3 · 2 years ago

I used sqlite-utils to create a tool that can merge SQLITE files and split them:

https://github.com/chapmanjacobd/library?tab=readme-ov-file#...

MatthausK · 2 years ago

one of the dltHub founders here - we aim to address this in the coming weeks

simonw · 2 years ago

yevpats · 2 years ago

Firstly, congrats :) (Generalized) ingestion is a very hard problem because any abstraction that you come up with will always some limitations where you might need to fallback to writing code and have full access to the 3rd party APIs. But definitely in some cases generalized ingestion is much better then re-writing the same ingestion piece especially for complex connectors. Take a look at CloudQuery (https://github.com/cloudquery/cloudquery) open source high performance ELT framework powered by Apache Arrow (so you can write plugins in any language). (Maintainer here)

couldn't agree more! I see ingestr more as a common-scenario solution rather than a general solution that solves all cases, kinda like how I treat shell oneliners instead of writing an applicataion in another language. I guess there's space for both approaches.

I'll definitely take a look at CloudQuery, thanks a lot for sharing!

sascjm · 2 years ago

Hi Burak. I have been testing ingestr using a source and destination Postgres database. What I'm trying to do is copy data from my Prod database to my test database. I find when using replace I get additional dlt columns added to the tables as hints. It also does not work for a defined primary key only natural keys. Composite keys do not work. Can you tell me the basic, minimal that it supports. I would love to use it to keep our Prod and Test databases in sync, but it appears that the functionality I need is not there. Thanks very much.

Hi there, thanks a lot for your comment and trying it out. Do you mind joining our Slack community via the link in the readme or create a github issue so that we can dive into this? I'd love to understand what doesn't work and provide fixes.

matijash · 2 years ago

This looks pretty cool! What was the hardest part about building this?

hey, thanks!

I guess there were a couple of things that I found as tricky:

- deciding on the right way to represent sources and destinations was hard, before landing on URIs I thought of using config files but that'd also add additional complexity etc

- the platforms had different quirks concerning different data types

- dlt stores state on its own, which means that re-runs are not running from scratch after changing the incremental strategy, and they require a full refresh, it took me quite some time to figure out how exactly to work with it

I think among these the hardest part was to get myself to build and release it, because I had it in my mind for a long time and it took me a _long while_ to build and share it :)

kipukun · 2 years ago

Do you think you'll add local file support in the future? Also, do you have any plans on making the reading of a source parallel? For example, connectorx uses an optional partition column to read chunks of a table concurrently. Cool how it's abstracted.

I have just released v0.1.2 which supports CSV destinations with the URI format `csv://path/to/file.csv`, hope that's helpful!

I am working on file support right now as a destination to begin with. I believe I should get local files as well as S3-compatible sources going by tonight.

Reading the sources in parallel is an interesting idea, I'll definitely take a look at it. ingestr supports incremental loads by a partitioned column, but there's no parallelized partition reading at the moment.

Thanks a lot for your comment!

adawg4 · 2 years ago

I second this!

e12e · 2 years ago

Looks interesting. Clickhouse seems to be conspicuously missing as source and destination. Although I suppose clickhouse can masquerade as postgres: https://clickhouse.com/docs/en/interfaces/postgresql

Ed: there's an issue already: https://github.com/bruin-data/ingestr/issues/1

hermitcrab · 2 years ago

I am very interested in data ingestion. I develop a desktop data wrangling tool in C++ ( Easy Data Transform ). So far it can import files in various formats (CSV, Excel, JSON, XML etc). But I am interested in being able to import from databases, APIs and other sources. Would I be able to ship your CLI as part of my product on Windows and Mac? Or can someone suggest some other approach to importing from lots of data sources without coding them all individually?

hmm, that's an interesting question, I don't know the answer to be honest. are you able to run external scripts on the device? if so, you might be able to install & run ingestr with a CSV destination (which I released literally 2 mins ago), but that seems like a lot of work as well, and will probably be way slower than your C++ application.

Maybe someone else has another idea?

I can start a CLI as a separate process. But ingesting to CSV and then reading the CSV would be slow. Maybe it would be better to ingest into DuckDB or in memory in Arrow memory format. If anyone has any other suggestions, I am all ears.

jrhizor · 2 years ago

I like the idea of encoding complex connector configs into URIs!

javajosh · 2 years ago

Perhaps OP re-invented it, but it's been around for a long time in the java world via jdbc urls. See, for example this writeup: https://www.baeldung.com/java-jdbc-url-format

I don't think I invented anything tbh, I just relied on SQLAlchemy's URI formats, and I decided to abuse it slightly for even more config.

Glad to hear that! I am not 100% sure if it’ll look pretty for all platforms but I hope it’ll be an okay base to get started!