Readit News logoReadit News
zenlikethat commented on AI Angst   tbray.org/ongoing/When/20... · Posted by u/AndrewDucker
nikolayasdf123 · 6 months ago
> Go programming language is especially well-suited to LLM-driven automation. It’s small, has a large standard library, and a culture that has strong shared idioms for doing almost anything

+1 to this. thank you `go fmt` for uniform code. (even culture of uniform test style!). thank you culture of minimal dependencies. and of course go standard library and static/runtime tooling. thank you simple code that is easy to write for humans..

and as it turns out for AIs too.

zenlikethat · 6 months ago
I found that bit slightly ironic because it always seems to produce slightly cringy Go code for me that might get the job done but skips over some of the usual design philosophies like use of interfaces, channels, and context. But for many parts, yeah, I’ve been very satisfied with Go code gen.
zenlikethat commented on Bauplan – Git-for-data pipelines on object storage   docs.bauplanlabs.com/en/l... · Posted by u/barabbababoon
tech_ken · 8 months ago
The Git-like approach to data versioning seems really promising to me, but I'm wondering what those merge operations are expected to look like in practice. In a coding environment, I'd review the PR basically line-by-line to check for code quality, engineering soundness, etc. But in the data case it's not clear to me that a line-by-line review would be possible, or even useful; and I'm also curious about what (if any) tooling is provided to support it?

For example: I saw the YouTube video demo someone linked here where they had an example of a quarterly report pipeline. Say that I'm one of two analysts tasked with producing that report, and my coworker would like to land a bunch of changes. Say in their data branch, the topline report numbers are different from `main` by X%. Clearly it's due to some change in the pipeline, but it seems like I will still have to fire up a notebook and copy+paste chunks of the pipeline to see step-by-step where things are different. Is there another recommended workflow (or even better: provided tooling) for determining which deltas in the pipeline contributed to the X% difference?

zenlikethat · 8 months ago
That’s a great question. Diffing is one area we’ve thought a bit about but still need to dedicate more cycles to. One thing I would be curious about is, what are you doing in these notebooks to check? For what it’s worth, could possibly have an intermediate Python model that does some calculation to look at differences and materializes the results to a table, which you could then query directly for further insight.

One thing we do have support for “expectations” — model-like Python steps that check data quality, and can flag it if the pipeline violates them.

zenlikethat commented on Bauplan – Git-for-data pipelines on object storage   docs.bauplanlabs.com/en/l... · Posted by u/barabbababoon
dijksterhuis · 8 months ago
the big question i have is — where is the code executed? “the cloud”? who’s cloud? my cloud? your environment on AWS?

the paper briefly mentions “bring your own cloud” in 4.5 but the docs page doesn’t seem to have any information on doing that (or at least none that i can find).

zenlikethat · 8 months ago
The code you execute on your data currently runs in a per-customer AWS account managed by us. We leave the door open for BYOC based on the architecture we’ve designed, but due to lean startup life, that’s not an option yet. We’d definitely be down to chat about it
zenlikethat commented on Bauplan – Git-for-data pipelines on object storage   docs.bauplanlabs.com/en/l... · Posted by u/barabbababoon
korijn · 8 months ago
How does this compare to dbt? Seems like it can do the same?
zenlikethat · 8 months ago
Some similarities, but Bauplan offers:

1. Great Python support. Piping something from a structured data catalog into Python is trivial, and so is persisting results. With materialization, you never need to recompute something in Python twice if you don’t want to — you can store it in your data catalog forever.

Also, you can request anything Python package you want, and even have different Python versions and packages in different workflow steps.

2. Catalog integration. Safely make changes and run experiments in branches.

3. Efficient caching and data re-use. We do a ton of tricks behind to scenes to avoid recomputing or rescanning things that have already been done, and pass data between steps with Arrow zero copy tables. This means your DAGs run a lot faster because the amount of time spent shuffling bytes around is minimal.

zenlikethat commented on Bauplan – Git-for-data pipelines on object storage   docs.bauplanlabs.com/en/l... · Posted by u/barabbababoon
esafak · 8 months ago
It is a service, not an open source tool, as far as I can tell. Do you intend to stay that way? What is the business model and pricing?

I am a bit concerned that you want users to swap out both their storage and workflow orchestrator. It's hard enough to convince users to drop one.

How does it compare to DuckDB or Polars for medium data?

zenlikethat · 8 months ago
Yep, staying service.

RE: workflow orchestrators. You can use the Bauplan SDK to query, launch jobs and get results from within your existing platform, we don’t want to replace entirely if it’s doesn’t fit for you, just to augment.

RE: DuckDB and Polars. It literally uses DuckDB under the hood but with two huge upgrades: one, we plug into your data catalog for really efficient scanning even on massive data lake houses, before it hits the DuckDB step. Two, we do efficient data caching. Query results and intermediate scans and stuff can be reused across runs.

More details here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

As for Polars, you can use Polars itself within your Python models easily by specifying it in a pip decorator. We install all requested packages within Python modules.

zenlikethat commented on Bauplan – Git-for-data pipelines on object storage   docs.bauplanlabs.com/en/l... · Posted by u/barabbababoon
anentropic · 8 months ago
I am very interested in this but have some questions after a quick look

It mentions "Serverless pipelines. Run fast, stateless Python functions in the cloud." on the home page... but it took me a while of clicking around looking for exactly what the deployment model is

e.g. is it the cloud provider's own "serverless functions"? or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

Under examples I found https://docs.bauplanlabs.com/en/latest/examples/data_product... which shows running a cli command `serverless deploy` to deploy an AWS Lambda

for me deploying to regular Lambda func is a plus, but this example raises more questions...

https://docs.bauplanlabs.com/en/latest/commands_cheatsheet.h... doesn't show any 'serverless' or 'deploy' command... presumably the example is using an external tool i.e. the Serverless framework?

which is fine, great even - I can presumably use my existing code deployment methodology like CDK or Terraform instead

Just suggesting that the underlying details could be spelled out a bit more up front.

In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

I like it

Last question is re here https://docs.bauplanlabs.com/en/latest/tutorial/index.html

> "Need credentials? Fill out this form to get started"

Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

What does that provide? There's no pricing mentioned so far - what is the model?

zenlikethat · 8 months ago
> or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

This one, although it’s a custom orchestration system, not Kubernetes. (there are some similarities but our system is really optimized for data workloads)

We manage Iceberg for easy data versioning, take care of data caching and Python modules, etc., and you just write some Python and SQL and exec it over your data catalog without having to worry about Docker and all infra stuff.

I wrote a bit on what the efficient SQL half takes care of for you here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

> In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

Philosophically, yes. In practice so far we manage the machines in separate AWS accounts _for_ the customers, in a sort of hybrid approach, but the idea is not dissimilar.

> Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

Yep. We’d help you get started and use our demo team. Send jacopo.tagliabue@bauplanlabs.com an email

RE: pricing. Good question. Early startup stage bespoke at the moment. Contact your friendly neighborhood Bauplan founder to learn more :)

u/zenlikethat

KarmaCake day1507March 28, 2012View Original