thibautdr (u/thibautdr)

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

johhns4 · a year ago

Do please say when this will work as this could make my workflows a lot easier to visualize and work with.

thibautdr · a year ago

I sure will!

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

gregw2 · a year ago

Isn’t pandas centric ETL much more memory intensive and less compute efficient than using SQL?

thibautdr · a year ago

I wrote an article questioning the use of Pandas for ETL. I invite you to read it: https://medium.com/@thibaut_gourdel/should-you-use-pandas-fo...

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

Kalanos · a year ago

Reminds me of Elyra

thibautdr · a year ago

Yes, there are similarities, but Elyra allows you to develop orchestration pipelines for Python scripts and notebooks, so you still have to write your own code. With Amphi, you design your data pipelines using a graphical interface, and it generates the Python code to execute. Hope that helps.

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

Kalanos · a year ago

what about dask?

thibautdr · a year ago

Using Modin, deploying the pandas code on Dask should be possible: https://modin.readthedocs.io/en/stable/development/using_pan...

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

Joeboy · a year ago

Since there are "ETL" people here, I have a couple of naive questions, in case anybody can answer:

1) Are there any"standard"-ish (or popular-ish) file formats for node-based / low-code pipelines?

2) Is there any such format that's also reasonably human readable / writable?

3) Are there low-code ETL apps that (can) run in the browser, probably using WASM?

Thanks and sorry if these are dumb questions.

thibautdr · a year ago

Thanks for the great questions:

1. As far as I know, there isn't a "standard" file format for low-code pipelines.

2. Some formats are more readable than others. YAML, for example, is quite readable. However, it's often a tradeoff: the more abstracted it is, the less control you have.

3. Funny you ask, I actually tried to make Amphi run in the browser with WASM. I think it's still too early in terms of both performance and limitations. Performance will likely improve soon, but browser limitations currently prevent the use of sockets, which are indispensable for database connections, for example.

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

paulvnickerson · a year ago

Very cool, thanks for sharing. Does it support the pandas-like rapidsai dask_cudf framework? (https://docs.rapids.ai/api/dask-cudf/stable/)

thibautdr · a year ago

Great, thanks for sharing. I was familiar with Dask and cudf separately but not this one.I was planning to implement dask support through Modin but I'll definitely take a look at dask_cudf.

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

johhns4 · a year ago

Wow amazing work! How does the inputs work, are they created for you or does it support custom as well?

thibautdr · a year ago

Thank you! Inputs components are pre-built for now but the ability to add custom inputs is coming soon!

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

mritchie712 · a year ago

If you're looking for "open source Python ETL", two things that are better options:

https://dlthub.com/

https://hub.meltano.com/

we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.

0 - https://www.definite.app/

thibautdr · a year ago

Hey, Amphi's developer here. Those two tools are great, big fan of dlt myself :)

However, Amphi is a low-code solution while those two are code-based. Also, those two focus on the ingestion part (EL) while Amphi is focusing on different ETL use-cases (file integration, data preparation, AI pipelines).

thibautdr commented on Open Source Python ETL amphi.ai/... · Posted by u/justjico

C4stor · a year ago

It's a good idea, but from the docs it looks like the high level abstractions are wrong.

If my data pipeline is "take this table, filter it, output it", I really don't want to use a "csv file input" or a "excel file output".

I want to say "anything here in the pipeline that I will define that behaves like a table, apply it this transformation", so that I can swap my storage later without touching the pipeline.

Same things for output. Personally I want to say "this goes to a file" at the pipeline level, and the details of the serialization should be changeable instantly.

That being said, can't complain about a free tool, kudos on making it available !

thibautdr · a year ago

Hey, not sure I get your point here. I believe the abstraction provides what you're describing. You can swap a file input with a table input without touching the rest of the components (provided you don't have major structural changes). Let me know what you meant :)