Open Source Python ETL

Hi everyone, thanks for posting Amphi :)

To give some context, Amphi is a low-code ETL tool for both structured and unstructured data. The key use cases include file integration, data preparation, data migration, and creating data pipelines for AI tasks like data extraction and RAG. What sets it apart from traditional ETL tools is that it generates Python code that you own and can deploy anywhere. Amphi is available as a standalone web app or as a JupyterLab extension.

Visit the GitHub: https://github.com/amphi-ai/amphi-etl Give it a try and let me know what you think

OutOfHere · a year ago

You know what does not set it apart? AI-washing. Also, lying about being open source when it isn't.

isjamesalive · a year ago

To be fair, the only place the words ‘open’ and ‘source‘ appear in the readme are once in a sub-heading, where it’s phrased ‘open-source’. It’s clearly labelled ELv2.

Possibly more of a subtle miscommunication or misunderstanding than a deliberate lie.

slt2021 · a year ago

i liked the idea of leveraging jupyterlab as server. data engineers/scientists already use jupyter, so this is neat idea.

custom extension for jupyterlab is a great way to leverage existing jupyterlab install base: not everyone will be willing to install and jump through hoops to install software X, but installing extension is one pip install away and no need to run separate process, since you are running inside jupyterlab server.

this reminds of ALTERYX (another drag and drop ETL tool)

thibautdr · a year ago

Thanks! Being based on JupyterLab also allows Amphi to benefit from the vast ecosystem of extensions already available, such as the Git extension or using different file systems (S3).

Some users pointed out they were Alteryx users but liked the Python code generation from Amphi :)

johhns4 · a year ago

Wow amazing work! How does the inputs work, are they created for you or does it support custom as well?

thibautdr · a year ago

Thank you! Inputs components are pre-built for now but the ability to add custom inputs is coming soon!

With all the data issues strong quality and normalisation I often get the impression that enabling more people with non CS backgrounds to do this work is not necessarily a good thing.

In other words, if writing python and sql is the skill requirement that stops you from making an etl pipeline, maybe do something else.

otter-in-a-suit · a year ago

While I’m not usually on board with gatekeeping, this field already struggles with a huge amount of very non-technical folks and their respective managers, producing overall mediocre results and giving the profession a bad rep, to the point where I now avoid the Data Engineer title and just call it “SWE specialized in large data processing” or something equally as fluffy.

For me it’s more accurate, too. At $work, there’s no difference to how an SWE vs a “DE” works. Same interview process too, DSA, distributed systems etc.

However, having done this for more than a decade, that is relatively rare. It’s usually a mix of GUI tools with zero reproducibility / infra-as-code, untyped python, copy pasted shell scripts, zero tests, zero ci/cd, no lifting/static analysis/code reviews etc., paired with generally zero understanding of the underlying tech. It’s all very formulaic with little to no actual understanding.

I will spare you my usual rant on why a language without a solid type system like python is a horrible idea for this field, too.

Which is why I much appreciate dbt. While some people scoff at the idea of “SQL with jinja templating”, their approach has certainly helped to move DE closer to SWE work, purely by virtue of their value prop mostly being exactly that. And it works out great.

hermitcrab · a year ago

So if Bob from accounts needs a new report generating, he has to wait for 6 month for an IT guy to do it? Who probably won't do a very good job, because he doesn't understand what Bob needs as well as Bob does? Bob is going to hack something horrific together in Excel instead. Better surely to let Bob have a GUI point and click tool more appropriate to the job?

morkalork · a year ago

On the other hand, Bob keeps asking for a self-serve reporting tool but in my experience, he doesn't actually want to use it. We went the route of putting all the data into a lake and hooking up GUI reporting tools to it and what did we get? Bob doesn't understand this or that column, Bob made a report that is fundamentally flawed, Bob sent a request for the engineers to make him a report using the tool and so on. Bob wanted something, or someone else, to do the work for him. When it became apparent that the tool isn't magic and can neither read minds nor divinate the true meaning of data in the DB, it became the engineer's problem again. So why not let engineers use the tools they prefer?

pelasaco · a year ago

Bob doesn't write a specification, because Bob doesn't know as well what he wants. He will have to explore, try out until he reaches something that he can work with. Nobody is willing to spend time planning and documenting stuff. Everyone feels one youtube away from being expert in software development.

pm90 · a year ago

With this argument, Computer Science wouldn’t have progressed beyond assemblers.

chasd00 · a year ago

I sort of see your point. if you’re not willing to at least try to learn something new then, yeah, probably better off doing something else.

anakaine · a year ago

This is elitist and frankly, unhelpful. The answer to a skills shortage is not a practitioner lockdown, but policy, training, guidance and mentoring. If you're stuck in start up land and you have this issue, you have hired the wrong skills. If you're encountering this in enterprise land, your organisation, and potentially you depending on your position of influence, should be angling to improve compliance and literacy not through obstruction but through policy and upskilling. Failing to do so will kill your ability to innovate.

necovek · a year ago

FWIW, while I disagree with the parent comment, I don't see you arguing against it.

They actually implied that you should try upskilling first — but if that fails, you shouldn't be doing ETL yourself.

I mostly disagree with the parent comment because there's so many things one can easily do up to a level, and then when the going gets tough, you need to call in an expert. Eg. most people can operate a screwdriver or impact driver to fix things, but to fix some problems, you really need a trained technician (or well, an experienced DIY person, but that's not everybody).

The fact that you are not strong enough to screw in an M14 bolt does not mean you should be forbidden from using an impact driver: tools are there to help you. The logic of the parent comment was seemingly that if you are not strong enough to tighten an M14 bolt, you probably don't know what you are doing regardless of the type of the bolt you are tightening, so you should simply not do it.

The point I agree with in a parent comment is that not everybody can achieve a similar level of proficiency: while upskilling and improving/simplifying tools can get you most of the way there, there's always going to be that extra bit that requires a sudden, sharp jump in knowledge, smartness or experience to be able to deal with it.

itsoktocry · a year ago

>This is elitist and frankly, unhelpful. The answer to a skills shortage is not a practitioner lockdown, but policy, training, guidance and mentoring.

I think the point is that these tools have their own learning curve, and non-tech business people are not doing it well, either; how much different is it from learning SQL? Which one is more broadly valuable and transferrable as skill?

If this is the career you want (data or data-adjacent), why not just learn SQL? There are far more learning resources and the value of the knowledge will assuredly outlast any low-code tool.

pydry · a year ago

Skills shortage?

ic_fly2 · a year ago

jamesblonde · a year ago

#dang The title needs changing - it's not open-source, it is license ELv2 - Elastic License v2.

maleldil · a year ago

While you're right (it's indeed not open-source), the project advertises itself as such, so the title is "right", even if it's a lie.

lma21 · a year ago

isn't the code available here? https://github.com/amphi-ai/amphi-etl

what makes it not OSS?

uneekname · a year ago

That is source available, not open source. The term "open-source" is widely used to describe software that is licensed using a specific set of software licenses that grant certain freedoms to users. You can read more here[0]

[0] https://opensource.org/licenses

mrtranscendence · a year ago

It's open source if you're using language like a normal human being. If you're a bit of a pedant and wish everyone to adhere to definitions imposed from on high regardless of real-world usage, it's absolutely not open source.

mritchie712 · a year ago

If you're looking for "open source Python ETL", two things that are better options:

https://dlthub.com/

https://hub.meltano.com/

we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.

0 - https://www.definite.app/

Hey, Amphi's developer here. Those two tools are great, big fan of dlt myself :)

However, Amphi is a low-code solution while those two are code-based. Also, those two focus on the ingestion part (EL) while Amphi is focusing on different ETL use-cases (file integration, data preparation, AI pipelines).

I understand that. I'd change the title / H1 though, "Open Source Python ETL" doesn't describe what you're building very well.

Good luck! Looks cool.

jdnier · a year ago

Are you able to describe what makes them better? (Honest question, I'm not familiar with either or with Amphi.)

It seems Definite's use case is focused on connecting to lots of data sources. For much smaller scale, how does Amphi compare?

most data engineers would think of something like Fivetran when you say "ETL" (look at the ETL section here[0]).

It looks like Amphi could handle some low code transformations (the "T" in ETL), but calling it ETL feels like a stretch.

So to rephrase a bit, if you're looking for an open source, python based Fivetran alternatives, dlt and meltano would be my picks.

0 - https://mattturck.com/landscape/mad2024.pdf)

awesomebytes · a year ago

I was not familiar with the acronym ETL and it is not explained anywhere in the website! My feedback would be to at least write it once, on the first instance so others like me will know what they are reading :)

fuzztester · a year ago

Others already replied about what ETL is.

Wikipedia:

https://en.m.wikipedia.org/wiki/Extract,_transform,_load

I'll just add:

It is a common term and practice among enterprise software users, i.e. generally medium or large companies that use packaged plus custom software for their business needs.

ETL is not common among startups, because they have a different focus, infrastructure and scale.

m463 · a year ago

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed (including cleaning), and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations.

https://en.wikipedia.org/wiki/Extract,_transform,_load

ljouhet · a year ago

Didn't understand either:

Extract, transform and load...

ETL: Extract, transform and load data...

Extract, transform and load (ETL) data...

Extract, Transform and Load data...

Thanks for pointing that out, it's actually mentioned (Extract, transform and load ...) in the very first sentence below the tagline, but if you didn't get it then it's not clear.

ghoshbishakh · a year ago

It is written on the website: Extract, transform and load. Yes, an illustrative example description would help I agree.

mrwyz · a year ago

Not open source. Misleading title.

paulvnickerson · a year ago

Very cool, thanks for sharing. Does it support the pandas-like rapidsai dask_cudf framework? (https://docs.rapids.ai/api/dask-cudf/stable/)

Great, thanks for sharing. I was familiar with Dask and cudf separately but not this one.I was planning to implement dask support through Modin but I'll definitely take a look at dask_cudf.

Cool. We use it a lot at work for working with large data sets on a GPU cluster.

cvalka · a year ago

THIS IS NOT OPEN SOURCE!

maphew · a year ago

It's published on GitHub under license ELv2 - Elastic License v2. This does not meet the open source definition, so indeed it's not Open Source. ELv2 is an open source sibling though, closer than many other openish licenses: https://www.elastic.co/pricing/faq/licensing

Still, Amphi should not claim to be 'Open Source'.

runningmike · a year ago

IMHO this title is primarily chosen for promotion. Not needed. But unfortunately many people, young and old, have never heard of OSI.