Readit News logoReadit News
Posted by u/Alanhlwang 2 years ago
Launch HN: Serra (YC S23) – Open-core, Python-based dbt alternative
Hey HN! Alan and Albert here, cofounders of Serra. Serra is end-to-end dbt—we make building reliable, scalable ELT/ETL easy by replacing brittle SQL scripts with object-oriented Python. It’s open core: https://github.com/Serra-Technologies/serra, and our docs are here: https://docs.serra.io/documentation/.

I stumbled into this idea as a data engineer for Disney+’s subscriptions team. We were “firefighters for data,” ready to debug huge pipelines that always crashed and burned. The worst part of my job at Disney+ was the graveyard on-call rotations, where pagers from 12am to 5am were guaranteed, and you'd have to dig through thousands of lines of someone else’s SQL. SQL is long-winded—1000 lines of SQL can often be summarized with 10 key transforms. We take this SQL and summarize those transforms with reusable, testable, scalable Spark objects.

Serra is written in PySpark and modularizes every component of ETL through Spark objects. Similar to dbt, we apply software engineering best practices to data, but we aim to do it not just with transformations, but with data connectors as well. We accomplish this with a configuration YAML file—the idea is if we have a pipeline with said 1000 line SQL script that is using third-party connectors, we can summarize all of this into a 12-block config file that gives easy high-level overhead and debugging capabilities—10 blocks for the transforms and 2 for the in-house connectors. Then, we can add tests and custom alerts to each of these objects/blocks so that we know where exactly the pipeline breaks and why.

We are open-source to make it easy to customize Serra to whatever flavor you like with custom transformers/connectors. The connectors we support OOB are Snowflake, AWS, BigQuery, and Databricks and are adding more based on feedback. The transforms we support include mapping, pivoting, joining, truncating, imputing, and more. We’re doing our best to make Serra as easy to use as possible. If you have docker installed, you can run this docker command to instantly get setup with a Serra environment to create modular pipelines.

We wrap up our functionality with a command line tool that lets you: - create your ETL pipelines, test them locally with a subset of your data, and deploy them to the cloud (currently we only support Databricks, but will soon support others and plan to host our own clusters too). It also has an experimental “translate” feature which is still a bit finicky, but the idea is to take your existing SQL script and get suggestions on how you can chunk up and modularize your job with our config. It’s still just a super early suggestion feature that is definitely not fleshed out, but we think it’s a cool approach.

Here’s a quick demo going through retooling a long-winded SQL script to an easily maintainable, scalable ETL job: https://www.loom.com/share/acc633c0ec03455e9e8837f5c3db3165?.... (docker command: docker run --mount type=bind,source="$(pwd)",target=/app -it serraio/serra /bin/bash)

We don’t see or store any of your data—we’re a transit layer that helps you write ETL jobs that you can send to your warehouse of choice with your actual data. Right now we are helping customers retool their messy data pipelines and plan to monetize by hosting Serra on the cloud, charging if you run the job on our own clusters, and per API call on our translate feature (once it’s mature).

We’re super excited to launch this to Hacker News. We’d love to hear what you think. Thanks in advance!

addisonj · 2 years ago
Congrats on the launch!

Interesting project in a space that I am pretty certain is going to change a lot in the coming years. Here is a bit of random feedback and questions.

* Some of your messaging related to python vs yaml is a bit confusing, which results in me not being immediately clear on the value prop. After digging through docs and code I now understand that the yaml is a declarative pipeline calling the underlying python code that can include user defined transformations. Nifty! As someone who has led data platform teams, I understand that this would be a big win for any data platform team to better support data eng/scientists. But you don't tell me any of that. I would look at trying to give more context to what this is and adding more of these use cases and values in your marketing (even if they are pretty nascent at this stage)

* From the loom, the play you are doing is clear and makes a lot of sense to build a cloud service to easily run these jobs... but that makes me wonder if your licensing choice is maybe a bit too restrictive? IMHO, the most important thing to do when building dev tools is to be very deliberate in your end-to-end user -> customer journey and designing your open source and commercial strategies to nicely dovetail. For a product like this, I would think the faster and bigger I can build a community, the better, and that may mean "giving away" a lot of the initial core innovation, but with a clear plan on the innovation I can drive through integrated services, which would imply as open as a license as possible. As is, I think you might find it much harder to get people to take it serious, as, unlike other source available companies (Elastic, Cockroach, etc) you aren't yet proven to be worth the effort to get this approved vs a full open source alternative

* On a similar note, what is in the repo right now seems to be a relatively thin wrapper around spark. That isn't a criticism. Many technologies and communities have started based on a "remix" of a lower level tool that offers simplified UX/DX or big workflow improvements. What sets those apart though, imho, is to drastically lower the barrier to entry to using the underlying technology and to be seen as leaders and experts in the space you operate. I am guessing you probably have lots of features planned, but I would also give a soft suggestion to look as much into thinking of learnability as a feature (via features, interactive docs, etc) as I would almost anything else, as that is really where a lot of the value of a higher level interface like this comes in

* My past experience with really large and complex ETL jobs that essentially required dropping into spark to represent them has me wonder how much actual complexity can be represented by the transformers? I would be curious to know what your most complex pipeline is? It doesn't seem there is an API limitation why these pipelines couldn't get quite a bit larger and represent many sql statements, other than big long spark pipelines getting kind of ugly, and in some cases, could even remove the need for quite a few airflow jobs. I am curious to know if and how you see Serra addressing those sorts of problems like those types of ETL jobs.

Once again, congrats on launching! Happy to give more context/thoughts in a thread or reach out to me via in profile

Alanhlwang · 2 years ago
This is super insightful thanks a ton for this gold mine.

On the python vs yaml part—definitely could've made that way more clear in the demo. Right now are framework lets you call these python objects in your yaml file, but we are working on just a python-centric implementation as well for those that do not want to interact with yamls.

On the loom and licensing choice—that's a great point. One of the main issues we ran into is getting adoption as we originally just tried licensing out the framework (mega fail ofc)—found out the hard way that no dev wants to buy something to try it out. We're definitely flexible on our license and will take all this feedback into account.

On the barrier of entry—also super insightful. We're working on a local UI offering that will be a 'config' block builder that will be free for all installs. We're implementing a DAG view similar to Airflow on the transform level. We also want to make it super easy to see your code and preview how it changes with this local UI (and have a list of all the params you need for your spark objects without having to go through the docs). We also want to flesh out more features especially on the translate side, as well as host on the cloud.

With the complexity issue that's something I ran into Disney as well! As the product grows we definitely want to flesh out our transformers based on the scripts we see. For now, the developer can make one-off transformers—we actually have a catch all "SQL transformer" for cases where you want to just pass in your sql (similar to a dbt model) and run it that way. That way it's a fail safe for if you have one specific transform that you feel is super hard to break down, you can fall back on dbt's way of just modularizing the SQL into a transform, and reference it however many times you want as an input block later on.

Thanks so much for the congrats, will definitely reach out and would love to have further discussions in the thread as will.

khaledh · 2 years ago
The pattern of reading from data sources to a Pandas DataFrame first defeats the whole point of using Spark[1]. Maybe it's ok for small tables, but you'll probably run out of memory on large tables.

[1] https://github.com/Serra-Technologies/serra/blob/a7a80c77af5...

kermatt · 2 years ago
Moving between Spark and Pandas can cause type casting as well. For example the range of allowable dates in Pandas is much smaller than in Spark. We completely abandoned Pandas in favor of PySpark for this reason.

It seems unnecessary to use multiple dataframe implementations when Spark is already in play.

smcin · 2 years ago
Are you referring to pandas.Timestamp.max being 2262-04-11 23:47:16.854775807 ?

https://pandas.pydata.org/docs/reference/api/pandas.Timestam...

(pandas design choice was to support nanosecond times, for financial data.)

albertstanley · 2 years ago
This is a completely valid point, we'll be changing the readers to directly read into Spark. Thank for the comment!
vladsanchez · 2 years ago
That's a smell. I thought they basically packaged the ETL portion of DBT as open source, not the data connector implementation. I'd like it be connector agnostic so that you can choose the most suitable for your needs.

Good intentions, but perhaps wrong execution. We'll see!

boxed · 2 years ago
If the selling point is "replacing brittle SQL scripts with object-oriented Python" you should have at least one example of this code in the README!
loveparade · 2 years ago
I don't think SQL can be more brittle than untyped object-oriented Python. The latter is as brittle as it gets, as we all know from ML code. SQL is common for ETL exactly because it's not brittle, (mostly) declarative and easy to test and modularize.

What Python give you is more flexibility, at the expensive of exponentially more brittleness.

dragonwriter · 2 years ago
> SQL is common for ETL exactly because it’s not brittle,

SQL is common for ETL because typically at least one, sometimes both, ends of an ETL operation is an RDBMS for which SQL is the standard language. It has nothing to do with lack of brittleness.

Alanhlwang · 2 years ago
Definitely a great point, adding one today
nouri · 2 years ago
We’re considering adopting DBT or a similar tool for the orchestration of our data pipelines on Snowflake. But we also explored Snowflake Dynamic Tables, and they make it easy to build a complex DAG without having to describe it.

I’m curious if data warehouse features like materialized views or dynamic tables will end up making DBT or the like obsolete?

https://docs.snowflake.com/en/user-guide/dynamic-tables-abou...

noel_gomez · 2 years ago
It is likely that these will make it into dbt since there are many dbt users who also use snowflake. The main thing to consider when deciding to some something other than dbt is:

* can you apply software dev best practices, CI/CD etc * is it proprietary or can you use it with other dbs * is there a large community behind it e.g. dbt packages and dbt python libraries * will you also get docs, dq, lineage or will you need additional tools\ing * will you need to orchestrate other aspects of your data flow e.g. EL, then T, then activation, etc

Databricks also has delta live tables and for the reasons above I usually suggest people consider all of these and not just go all-in with one vendor

Alanhlwang · 2 years ago
This is a great point—in terms of dynamic tables/materialized views, the software engineering best practices ie modularization, testing, version control are not as intuitive/straight-forward to apply in comparison to dbt and Serra. We can also add a direct SnowflakeDynamicTableWriter into our framework to work with the best of both worlds!
riwsky · 2 years ago
Were your Disney+ fires using dbt? The comparison in your demo doesn’t resemble normal dbt usage: it forces the SQL to inline the state abbreviation instead of using a dbt seed file, while the initial serra version uses one; it initially shows the serra code-folded, to make the SQL seem more verbose; and the SQL makes no use of CTEs or dbt models, either of which would make the transform steps clear.
Alanhlwang · 2 years ago
These are all great points, and no, Disney didn't use dbt. We wanted the demo to show how you can modularize SQL into reusable objects that you can fully customize error logs for, while also adding the value of handling all steps of ETL in-house. You could definitely write this script modularly using dbt, but we feel like that value add of having connectors that easily integrate with your transforms (e2e), as well as taking the software engineering best practices that dbt applies a step further by turning each transform and connect into objects that you can test, modularize, and customize error logs for is our main differentiator.
morkalork · 2 years ago
The data engineering space feels like it's earning the same reputation front end had/has with the endless stream of new and shiny frameworks.
yuppiepuppie · 2 years ago
I couldn’t agree more. Except that the cost for a data tool is incredibly higher than any FE tool.

I manage both a FE team and a data team. The former spends around 1k/month on infra and hosting. While the data team easily spends 20k/month.

The gold rush is data. Build shovels.

geonnave · 2 years ago
For those also wondering what is ETL and dbt:

> ETL (Extract, Transform, Load) is a process that involves extracting data from various sources, transforming it to fit operational needs, and loading it into a database for analysis. dbt (data build tool) is an open-source software tool that enables data analysts and engineers to transform and model data in the data warehouse, streamlining the transformation part of the ETL process.

sails · 2 years ago
I like that you improve on the underlying database error messages, as they are really unhelpful, and I think this is a great place to add value.

I've been keeping track of a few dbt alternatives. Dbt have opened up the market to this use case, while only partially solving the business model and maturity side. Here are the more interesting ones:

sdf.com (ex Meta team)

sqlmesh.com (relatively new)

paradime.io (more an IDE)

cloud.google.com/dataform (GCP only)

Alanhlwang · 2 years ago
These are great links, we'll take a look