Readit News logoReadit News
jgraettinger1 commented on AI agent benchmarks are broken   ddkang.substack.com/p/ai-... · Posted by u/neehao
majormajor · a month ago
> Discriminating good answers is easier than generating them.

I don't think this is true for many fields - especially outside of math/programming. Let's say the task is "find the ten most promising energy startups in Europe." (This is essentially the sort of work I see people frequently talk about using research modes of models for here or on LinkedIn.)

In ye olden days pre-LLM you'd be able to easily filter out a bunch of bad answers from lazy humans since they'd be short, contain no detail, have a bunch of typos, formatting inconsistencies from copy-paste, etc. You can't do that for LLM output.

So unless you're a domain expert on European energy startups you can't check for a good answer without doing a LOT of homework. And if you're using a model that usually only looks at, say, the top two pages of Google results to try to figure this out, how is the validator going to do better than the original generator?

And what about when the top two pages of Google results start turning into model-generated blogspam?

If your benchmark can't evaluate prospective real-world tasks like this, it's of limited use.

A larger issue is that once your benchmark, that used this task as a criteria, based on an expert's knowledge, is published, anyone making an AI Agent is incredibly incentivized to (intentionally or not!) to train specifically on this answer without necessarily actually getting better at the fundamental steps in the task.

IMO you can never use an AI agent benchmark that is published on the internet more than once.

jgraettinger1 · a month ago
> You can't do that for LLM output.

That's true if you're just evaluating the final answer. However, wouldn't you evaluate the context -- including internal tokens -- built by the LLM under test ?

In essence, the evaluator's job isn't to do separate fact-finding, but to evaluate whether the under-test LLM made good decisions given the facts at hand.

jgraettinger1 commented on What is Realtalk’s relationship to AI? (2024)   dynamicland.org/2024/FAQ/... · Posted by u/prathyvsh
thousand_nights · 2 months ago
because the preachers preach how amazing it is on their greenfield 'i built a todo list app in 5 minutes from scratch' and then you use it on an established codebase with a bigger context than the llm could ever possibly consume and spend 5x more time debugging the slop than it would've taken you to do the task yourself, and you become jaded

stop underestimating the amount of internalized knowledge people can have about projects in the real world, it's so annoying.

an llm can't ever possibly get close to it. there's some guy in a team in another building who knows why a certain weird piece of critical business logic was put there 6 years ago, the llm will never know this and won't understand this even if it consumed the whole repository because it would have to work there for years to understand how the business works

jgraettinger1 · 2 months ago
But that’s not good. You don’t want Bob to be the gate keeper for why a process is the way it is.

In my experience working with agents helps eliminate that crap, because you have to bring the agent along as it reads your code (or process or whatever) for it to be effective. Just like human co-workers need to be brought along, so it’s not all on poor Bob.

jgraettinger1 commented on The drawbridges come up: the dream of a interconnected context ecosystem is over   dbreunig.com/2025/06/16/d... · Posted by u/dbreunig
_jholland · 2 months ago
I have made it my mission to conquer SAP and gain control of our own critical financial data.

As a business, they uniquely leverage inefficient and clunky design to drive profit. Simply because they haven’t documented their systems sufficiently, it is “industry standard practice” to go straight to a £100/hr+ consultant to build what should be straightforward integrations and perform basic IT Admin procedures.

Through many painful late nights I have waded through their meticulously constructed labyrinth of undocumented parameters and gotchas built on foot-guns to eventually get to both build and configure an SAP instance from scratch and expose a complete API in Python.

It is for me a David and Goliath moment, carrying more value than the consultancy fees and software licences I've spared my company.

jgraettinger1 · 2 months ago
Hi, I’m a cofounder / CTO of estuary.dev. Our whole mission is democratizing and enabling use of data within orgs.

Open to a conversation about your work here? Reach me at johnny at estuary dot dev.

jgraettinger1 commented on Why Use Structured Errors in Rust Applications?   home.expurple.me/posts/wh... · Posted by u/todsacerdoti
arccy · 3 months ago
Having had to work with various application written in rust... i find they have some of the most terrible of errors. "<some low level operation> failed" with absolutely no context on why the operation was invoked in the first place, or with what arguments.

This is arguably worse than crashing with a stack trace (at least i can see a call path) or go's typical chain of human annotated error chains.

jgraettinger1 · 3 months ago
I would recommend the `anyhow` crate and use of anyhow::Context to annotate errors on the return path within applications, like:

  falliable_func().context("failed to frob the peanut")?
Combine that with the `thiserror` crate for implementing errors within a library context. `thiserror` makes it easy to implement structured errors which embed other errors, and plays well with `anyhow`.

jgraettinger1 commented on What's working for YC companies since the AI boom   jamesin.substack.com/p/wh... · Posted by u/jseidel
ck_one · 3 months ago
In many fields there is no moat. It’s an execution battle and it comes down to question: can the startup innovate faster and get to the customers or can the incumbent defend its existing distribution well enough.

Microsoft owns GitHub and VSCode yet cursor was able to out execute them. Legora is moving very quickly in the legal space. Not clear yet who will win.

jgraettinger1 · 3 months ago
> Microsoft owns GitHub and VSCode yet cursor was able to out execute them

Really? My startup is under 30 people. We develop in the open (source available) and are extremely willing to try new process or tooling if it'll gain us an edge -- but we're also subject to SOC2.

Our own evaluation was Cursor et all isn't worth the headache of the compliance paperwork. Copilot + VSCode is playing rapid catch-up and is a far easier "yes".

How large is the intersection of companies who a) believe Cursor has a substantive edge in capability, and b) have willingness to send Cursor their code (and go through the headaches of various vendor reviews and declarations)?

jgraettinger1 commented on HTAP is Dead   mooncake.dev/blog/htap-is... · Posted by u/moonikakiss
pradn · 3 months ago
On the data warehousing side, I think the story looks like this:

1) Cloud data warehouses like Redshift, Snowflake, and BigQuery proved to be quite good at handling very large datasets (petabytes) with very fast querying.

2) Customers of these proprietary solutions didn't want to be locked in. So many are drifting toward Iceberg tables on top of Parquet (columnar) data files.

Another "hidden" motive here is that Cloud object stores give you regional (multi-zonal) redundancy without having to pay extra inter-zonal fees. An OLTP database would likely have to pay this cost, as it likely won't be based purely on object stores - it'll need a fast durable medium (disk), if at least for the WAL or the hot pages. So here we see the topology of Cloud object stores being another reason forcing the split between OLTP and OLAP.

But how does this new world of open OLTP/OLAP technologies look like? Pretty complicated.

1) You'd probably run PostGres as your OLTP DB, as it's the default these days and scales quite well.

2) You'd set up an Iceberg/Parquet system for OLAP, probably on Cloud object stores.

3) Now you need to stream the changes from PostGres to Iceberg/Parquet. The canonical OSS way to do this is to set up a Kafka cluster with Kafka Connect. You use the Debezium CDC connector for Postgres to pull deltas, then write to Iceberg/Parquet using the Iceberg sink connector. This incurs extra compute, memory, network, and disk.

There's so many moving parts here. The ideal is likely a direct Postgres->Iceberg write flow built-into PostGres. The pg_mooncake this company is offering also adds DuckDB-based querying, but that's likely not necessary if you plan to use Iceberg-compatible querying engines anyway.

Ideally, you have one plugin for purely streaming PostGres writes to Iceberg with some defined lag. That would cut out the third bullet above.

jgraettinger1 · 3 months ago
> There's so many moving parts here.

Yep. At the scope of a single table, append-only history is nice but you're often after a clone of your source table within Iceberg, materialized from insert/update/delete events with bounded latency.

There are also nuances like Postgres REPLICA IDENTITY and TOAST columns. Enabling REPLICA IDENTITY FULL amplifies you source DB WAL volume, but not having it means your CDC updates will clobber your unchanged TOAST values.

If you're moving multiple tables, ideally your multi-table source transactions map into corresponding Iceberg transactions.

Zooming out, there's the orchestration concern of propagating changes to table schema over time, or handling tables that come and go at the source DB, or adding new data sources, or handling sources without trivially mapped schema (legacy lakes / NoSQL / SaaS).

As an on-topic plug, my company tackles this problem. Postgres => Iceberg is a common use case.

[0] https://docs.estuary.dev/reference/Connectors/materializatio...

jgraettinger1 commented on The copilot delusion   deplet.ing/the-copilot-de... · Posted by u/isaiahwp
anon7000 · 3 months ago
Yeah, and the article talks about those ways in which AI is useful. Overall, the author doesn’t have a problem with experts using AI to help them. The main argument is that we’re calling AI a copilot, and many newbies may be trusting it or leaning on it too much, when in reality, it’s still a shitty coworker half the time. Real copilots are actually your peers and experts at what they do.

> Now? We’re building a world where that curiosity gets lobotomized at the door. Some poor bastard—born to be great—is going to get told to "review this AI-generated patchset" for eight hours a day, until all that wonder calcifies into apathy. The terminal will become a spreadsheet. The debugger a coffin.

On the other hand, one could argue that AI is just another abstraction. After all, some folks may complain that over-reliance on garbage collectors means that newbies never learn how to properly manage memory. While memory management is useful knowledge for most programmers, it rarely practically comes up for many modern professional tasks. That said, at least knowing about it means you have a deeper level of understanding and mastery of programming. Over time, all those small, rare details add up, and you may become an expert.

I think AI is in a different class because it’s an extremely leaky abstraction.

We use many abstractions every day. A web developer really doesn’t need to know how deeper levels of the stack work — the abstractions are very strong. Sure, you’ll want to know about networking and how browsers work to operate at a very high level, but you can absolutely write very nice, scalable websites and products with more limited knowledge. The key thing is that you know what you’re building on, and you know where to go learn about things if you need to. (Kind of like how a web developer should know the fundamental basics of HTML/CSS/JS before really using a web framework. And that doesn’t take much effort.)

AI is different — you can potentially get away with not knowing the fundamental basics of programming… to a point. You can get away with not knowing where to look for answers and how to learn. After all, AIs would be fucking great at completing basic programming assignments at the college level.

But at some point, the abstraction gets very leaky. Your code will break in unexpected ways. And the core worry for many is that fewer and fewer new developers will be learning the debugging, thinking, and self-learning skills which are honestly CRITICAL to becoming an expert in this field.

You get skills like that by doing things yourself and banging your head against the wall and trying again until it works, and by being exposed to a wide variety of projects and challenges. Honestly, that’s just how learning works — repetition and practice!

But if we’re abstracting away the very act of learning, it is fair to wonder how much that will hurt the long-term skills of many developers.

Of course, I’m not saying AI causes everyone to become clueless. There are still smart, driven people who will pick up core skills along the way. But it seems pretty plausible that the % of people who do that will decrease. You don’t get those skills unless you’re challenged, and with AI, those beginner level “learn how to program” challenges become trivial. Which means people will have to challenge themselves.

And ultimately, the abstraction is just leaky. AI might look like it solves your problems for you to a novice, but once you see through the mirage, you realize that you cannot abstract away your core programming & debugging skills. You actually have to rely on those skills to fix the issues AI creates for you — so you better be learning them along the way!!

Btw, I say this as someone who does use AI coding assistants. I don’t think it’s all bad or all good. But we can’t just wave away the downsides just because it’s useful

jgraettinger1 · 3 months ago
> On the other hand, one could argue that AI is just another abstraction

I, as a user of a library abstraction, get a well defined boundary and interface contract — plus assurance it’s been put through paces by others. I can be pretty confident it will honor that contract, freeing me up to not have to know the details myself or second guess the author.

jgraettinger1 commented on Introducing S2   s2.dev/blog/intro... · Posted by u/brancz
shikhar · 8 months ago
(S2 Founder) Congrats on the success with Estuary! You are not the first person to tell me there is no/tiny market for this. Clearly _you_ thought there was something to it, when you looked to HN for validation. We may do a lot more on top of S2, like offering Kafka compatibility, but the core primitive matters. I have wanted it. It gets reinvented in all kinds of contexts and reused sub-optimally in the form of systems that have lost their soul, and that was enough for me to have this conviction and become a founder.

ED: I appreciate where you are coming from, and understand the challenges ahead. Thank you for the advice.

jgraettinger1 · 8 months ago
The market is gobsmackingly huge, it's just the go-to-market entry points which are narrow.

In my opinion, the key is to find a value prop and positioning which lets prospects try your service while spending a minimum of their own risk capital / reputation points within their own org.

That makes it hard to go after core storage, because it's such a widely used, fundamental, and reliable part of most every company's infrastructure. You and I may agree that conventions of incremental files in S3 are a less-than-ideal primitive for representing streams, but plenty of companies are doing it this way just fine and don't feel that it's broken.

WarpStream, on the other hand, leaned in to the perceived complexity of running Kafka and the share of users who wanted a Kafka solution with the operational profile of using S3. Internal champions can sell trying their service because the prospect's existing thing is already understood to be a pain in the butt.

For what it's worth, if I were entering the space anew today I'd be thinking carefully about the Iceberg standard and what I might be able to do with it.

jgraettinger1 commented on Introducing S2   s2.dev/blog/intro... · Posted by u/brancz
CodesInChaos · 8 months ago
1. Do you support compression for data stored in segments?

2. Does the choice of storage class only affect chunks or also segments?

To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so. But I assume that would require significant engineering effort for applications that require data to be replicated to several AZs before acknowledging them. Though some applications might be willing to sacrifice 1s of writes on node failure, in exchange for cheap and fast writes.

3. You could be clearer about what "latency" means. I see at least three different latencies that could be important to different applications:

a) time until a write is durably stored and acknowledged

b) time until a tailing reader sees a write

c) time to first byte after a read request for old data

4. How do you handle streams which are rarely written to? Will newly appended records to those streams remain in chunks indefinitely? Or do you create tiny segments? Or replace and existing segment with the concatenated data?

jgraettinger1 · 8 months ago
> To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so.

Yep, this is approximately Gazette's architecture (https://github.com/gazette/core). It buys the latency profile of flash storage, with the unbounded storage and durability of S3.

An addendum is there's no need to flush to S3 quite that frequently, if readers instead tail ACK'd content from local disk. Another neat thing you can do is hand bulk historical readers pre-signed URLs to files in cloud storage, so those bytes don't need to proxy through brokers.

jgraettinger1 commented on Introducing S2   s2.dev/blog/intro... · Posted by u/brancz
jgraettinger1 · 8 months ago
Roughly ten years ago, I started Gazette [0]. Gazette is in an architectural middle-ground between Kafka and WarpStream (and S2). It offers unbounded byte-oriented log streams which are backed by S3, but brokers use local scratch disks for initial replication / durability guarantees and to lower latency for appends and reads (p99 <5ms as opposed to >500ms), while guaranteeing all files make it to S3 with niceties like configurable target sizes / compression / latency bounds. Clients doing historical reads pull content directly from S3, and then switch to live tailing of very recent appends.

Gazette started as an internal tool in my previous startup (AdTech related). When forming our current business, we very briefly considered offering it as a raw service [1] before moving on to a holistic data movement platform that uses Gazette as an internal detail [2].

My feedback is: the market positioning for a service like this is extremely narrow. You basically have to make it API compatible with a thing that your target customer is already using so that trying it is zero friction (WarpStream nailed this), or you have to move further up to the application stack and more-directly address the problems your target customers are trying to solve (as we have). Good luck!

[0]: https://gazette.readthedocs.io/en/latest/ [1]: https://news.ycombinator.com/item?id=21464300 [2]: https://estuary.dev

u/jgraettinger1

KarmaCake day298May 1, 2015View Original