Readit News logoReadit News
thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
johhns4 · a year ago
Do please say when this will work as this could make my workflows a lot easier to visualize and work with.
thibautdr · a year ago
I sure will!
thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
gregw2 · a year ago
Isn’t pandas centric ETL much more memory intensive and less compute efficient than using SQL?
thibautdr · a year ago
I wrote an article questioning the use of Pandas for ETL. I invite you to read it: https://medium.com/@thibaut_gourdel/should-you-use-pandas-fo...
thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
Kalanos · a year ago
Reminds me of Elyra
thibautdr · a year ago
Yes, there are similarities, but Elyra allows you to develop orchestration pipelines for Python scripts and notebooks, so you still have to write your own code. With Amphi, you design your data pipelines using a graphical interface, and it generates the Python code to execute. Hope that helps.
thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
Kalanos · a year ago
what about dask?
thibautdr · a year ago
Using Modin, deploying the pandas code on Dask should be possible: https://modin.readthedocs.io/en/stable/development/using_pan...
thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
Joeboy · a year ago
Since there are "ETL" people here, I have a couple of naive questions, in case anybody can answer:

1) Are there any"standard"-ish (or popular-ish) file formats for node-based / low-code pipelines?

2) Is there any such format that's also reasonably human readable / writable?

3) Are there low-code ETL apps that (can) run in the browser, probably using WASM?

Thanks and sorry if these are dumb questions.

thibautdr · a year ago
Thanks for the great questions:

1. As far as I know, there isn't a "standard" file format for low-code pipelines.

2. Some formats are more readable than others. YAML, for example, is quite readable. However, it's often a tradeoff: the more abstracted it is, the less control you have.

3. Funny you ask, I actually tried to make Amphi run in the browser with WASM. I think it's still too early in terms of both performance and limitations. Performance will likely improve soon, but browser limitations currently prevent the use of sockets, which are indispensable for database connections, for example.

thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
paulvnickerson · a year ago
Very cool, thanks for sharing. Does it support the pandas-like rapidsai dask_cudf framework? (https://docs.rapids.ai/api/dask-cudf/stable/)
thibautdr · a year ago
Great, thanks for sharing. I was familiar with Dask and cudf separately but not this one.I was planning to implement dask support through Modin but I'll definitely take a look at dask_cudf.
thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
johhns4 · a year ago
Wow amazing work! How does the inputs work, are they created for you or does it support custom as well?
thibautdr · a year ago
Thank you! Inputs components are pre-built for now but the ability to add custom inputs is coming soon!
thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
mritchie712 · a year ago
If you're looking for "open source Python ETL", two things that are better options:

https://dlthub.com/

https://hub.meltano.com/

we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.

0 - https://www.definite.app/

thibautdr · a year ago
Hey, Amphi's developer here. Those two tools are great, big fan of dlt myself :)

However, Amphi is a low-code solution while those two are code-based. Also, those two focus on the ingestion part (EL) while Amphi is focusing on different ETL use-cases (file integration, data preparation, AI pipelines).

thibautdr commented on Open Source Python ETL   amphi.ai/... · Posted by u/justjico
C4stor · a year ago
It's a good idea, but from the docs it looks like the high level abstractions are wrong.

If my data pipeline is "take this table, filter it, output it", I really don't want to use a "csv file input" or a "excel file output".

I want to say "anything here in the pipeline that I will define that behaves like a table, apply it this transformation", so that I can swap my storage later without touching the pipeline.

Same things for output. Personally I want to say "this goes to a file" at the pipeline level, and the details of the serialization should be changeable instantly.

That being said, can't complain about a free tool, kudos on making it available !

thibautdr · a year ago
Hey, not sure I get your point here. I believe the abstraction provides what you're describing. You can swap a file input with a table input without touching the rest of the components (provided you don't have major structural changes). Let me know what you meant :)

u/thibautdr

KarmaCake day27March 27, 2024View Original