Launch HN: Langfuse (YC W23) – OSS Tracing and Workflows to Improve LLM Apps

Launch HN: Langfuse (YC W23) – OSS Tracing and Workflows to Improve LLM Apps github.com/langfuse/langf...

Hey HN, we are Marc, Clemens, and Max – the founders of Langfuse. Langfuse leverages traces, evaluations, prompt management, and metrics to help developers debug and improve LLM applications. Here is a full walkthrough: https://www.youtube.com/watch?v=2E8iTvGo9Hs

With Langfuse, you can instrument your app and start ingesting traces, thereby tracking LLM calls and other relevant logic in your app such as retrieval, embedding, or agent actions. Langfuse then helps to analyze traces and use features such as evaluations or prompt management to make improvements to your app.

You can sign up to try Langfuse Cloud (https://cloud.langfuse.com/ – we have a generous free tier) or self-host Langfuse (https://langfuse.com/self-hosting) within a couple of minutes.

In the 15 months since our “Show HN” (https://news.ycombinator.com/item?id=37310070), thousands of teams adopted the project (including teams like KhanAcademy, Twilio, and Samsara) and we hit all of the scaling limits that we anticipated in the original Show HN thread. On our v1/v2 setup, we frequently exhausted IOPS on Postgres and had our Node.js container grind to a halt during tokenizations. Since then, we migrated our Cloud infrastructure from Vercel/Supabase to Porter and then to AWS & Clickhouse. Last week, we put the finishing touches on the Langfuse v3.0.0 release (https://github.com/langfuse/langfuse/releases/tag/v3.0.0) that unlocks major scalability improvements we have made over the past half year and are happy to share with the OSS ecosystem today.

Langfuse v3 addresses three challenges we encountered as an LLM observability platform: a) handling high ingestion throughput with large events (long strings, multimodal images/audio/video), b) providing fast analytical, table, and single-item reads across the product, and c) serving prompts quickly and reliably in the critical path of user’s applications. Langfuse is used by thousands of active self-hosting deployments, so at every point we needed to prioritize stability, fully automated migrations/upgrades, and use of infrastructure components that self-hosters can deploy freely on any cloud vendor.

The v3 release adds powerful infrastructure with a Clickhouse database next to Postgres, blob storage for events and introduces a worker as well as queues and caches (Redis) for data ingestion.

The Langfuse SDKs were originally written to send updates to a single trace to our backend. The backend then upserts tracing data in Postgres. Dealing with these updates to guarantee backwards compatibility with older SDK versions was a challenge. Our ingestion pipeline writes all events into S3 and sends a reference to the file via Redis to our worker container. From there, we read all events with the same id (including all previously ingested ones) and merge them into a final event. We insert the new row into ClickHouse which automatically replaces the existing data for the same ID. Re-merging all event updates enables us to keep a high-throughput pipeline by converting updates into new insert-only records.

We ran many iterations to optimize our sorting keys in ClickHouse, use skip indexes efficiently, and rewrote almost all of our queries and API endpoints to make optimal use of the schema. Using a specialized, analytical database required a more database-centric application design than a swiss-army-knife database like Postgres.

The new infrastructure delivers dramatic performance gains: dashboards now respond within 400ms (95th percentile) instead of timing out on large projects and lookback windows, and tables load up to 90% faster - displaying data within 800ms even for the largest projects.

Finally, to serve prompts from prompt management with low-latency and high availability, we use caches heavily and also decoupled our infrastructure. For sensitive paths, we use dedicated deployments to avoid “noisy neighbors” within the same server. We also improved client-side caching in our SDKs. This enhancement allows them to prefetch prompts and revalidate them in the background, resulting in zero latency when retrieving a prompt at runtime.

If you have any questions or feedback, please join us in this HN thread, or in future on our Discord and GitHub Discussions. While Langfuse v3 is scalable, we tried hard to make it easy to get started with Langfuse and self-host it in your own infrastructure (https://langfuse.com/self-hosting).

PS: Here (https://langfuse.com/blog/2024-12-langfuse-v3-infrastructure...) is a more in-depth blog post on how we built Langfuse V3.

PPS: if you find these problems exciting, we are hiring (https://langfuse.com/join-us) in Berlin!

swyx · 8 months ago

(unsolicited review) we've been happy adopters of LangFuse at AINews (https://smol.ai/news). ive been tracking the llm ops landscape (https://www.latent.space/p/braintrust) for a while and its very nice to have an open source solution that is so comprehensive and intuitive!

reflections/thoughts on where this field goes next:

1. i wonder if there are new ops solutions for the realtime apis popping up

2. retries for instructor like structured outputs mess up the traces, i wonder if they can be tracked and collapsible

3. chatgpt canvas like "drafting" workflows are on the rise (https://www.latent.space/p/inference-fast-and-slow) and again its noisy to see in a chat flow

4. how often do people actually use the feedback tagging and then subsequently finetuning? i always feel guilty that i dont do it yet and wonder when and where i should.

marcklingen · 8 months ago

appreciate your constructive feedback!

> i wonder if there are new ops solutions for the realtime apis popping up

This is something we have spent quite some time on already, both on designs internally and talking to teams using Langfuse with realtime applications. IMO the usage patterns are still developing and the data capturing/visualization needs across teams is not aligned. What matters: (1) capture streams, (2) for non-text provide timestamped transcript/labels, (3) capture the difference between user-time and api-level-time (e.g. when catching up on a stream after having categorized the input first).

We are excited to build support for this, if you or others have ideas or a wishlist, please add them to this thread: https://github.com/orgs/langfuse/discussions/4757

> retries for instructor like structured outputs mess up the traces, i wonder if they can be tracked and collapsible

Great feedback. Being able to retroactively downrank llm calls to be `debug` level in order to collapse/hide them by default would be interesting. Added thread for this here: https://github.com/orgs/langfuse/discussions/4758

> chatgpt canvas like "drafting" workflows are on the rise (https://www.latent.space/p/inference-fast-and-slow) and again its noisy to see in a chat flow

Can you share an example trace for this or open a thread on github? Would love to understand this in more detail as I have seen different trace-representations of it -- the best yet was a _git diff_ on a wrapper span for every iteration.

> how often do people actually use the feedback tagging and then subsequently finetuning? i always feel guilty that i dont do it yet and wonder when and where i should.

Have not seen finetuning based on user-feedback a lot as the feedback can be noisy and low in frequency (unless there is a very clear feedback loop built into the product). More common workflow that I have seen: identify new problems via user feedback -> review them manually -> create llm-as-a-judge or other automated evals for this problem -> select "good" examples for fine-tuning based on a mix of different evals that currently run on production data -> sanitize the dataset (e.g. remove PII).

Finetuning has been more popular for structured output, sql generation (clear feedback loop / retries at run-time if the output does not work). More teams fine-tune on all the output that has passed this initial run-time gate for model distillation without further quality controls on the training dataset. They usually then run evals on a test dataset in order to verify whether the fine-tuned hits their quality bar.

im too lazy to send traces for now haha. maybe in future when it REALLY bothers me.

good luck keep going.

dcreater · 8 months ago

Thread is filled with positive reviews.. Little odd

priompteng · 8 months ago

Felt the same way. While Langfuse could be great, it oddly looks like solicited “review”ish comments from existing Langfuse users. Just gotta be a careful.

Maxious · 8 months ago

> Make sure your friends don't post booster comments. That's not allowed on HN. Our readers have a nose for this, and will sniff them out and flame you. That will damage your reputation—and ours—and we may have to bury your thread.

https://news.ycombinator.com/yli.html

Dead Comment

mfdupuis · 8 months ago

This is actually one of the more interesting LLM observability platforms I've seen. Beyond addressing scaling issues, where do you see yourself going next?

Positioning/roadmap differs between the different project in the space.

We summarized what we strongly believe in here: https://langfuse.com/why Tldr: open apis, self-hostable, LLM/cloud/model/framework-agnostic, API first, unopinionated building blocks for sophisticated teams, simple yet scalable instrumentation that is incrementally adoptable

Regarding roadmap, this is the near-term view: https://langfuse.com/roadmap

We work closely with the community, and the roadmap can change frequently based on feedback. GitHub Discussions is very active, so feel free to join the conversation if you want to suggest or contribute a feature: https://langfuse.com/ideas

mathiasn · 8 months ago

What are other potential platforms?

This is a good long-list of projects, although it is not narrowly scoped to tracing/evals/prompt-management: https://github.com/tensorchord/Awesome-LLMOps?tab=readme-ov-...

resiros · 8 months ago

One missing in the list below is Agenta (https://github.com/agenta-ai/agenta).

We're oss, otel compliant with stronger focus on evals and the enabling collaboration between subject matter experts and devs.

suninsight · 8 months ago

Bunch of them : Langsmith, Lunary, Phoenix Arize, Portkey, Datadog and Helicone.

We also picked Langfuse - more details here: https://www.nonbios.ai/post/the-nonbios-llm-observability-pi...

calebkaiser · 8 months ago

I'm a maintainer of Opik, an open source LLM evaluation and observability platform. We only launched a few months ago, but we're growing rapidly: https://github.com/comet-ml/opik

kappamax · 8 months ago

Congrats Marc! We've been using Langfuse for about 6-months for our LLMOps tooling. While its SDKs are limited to python and typescript, their openapi specification is pretty easy to implement in any language.

The team behind it is amazing, and their product being OSS is one of the reasons we chose it. But it just keeps getting better.

We're incidentally only using part of the product because we've implemented most of these new features, prompt caching, execution etc in our app. But with the API you can decide what parts are core to your business logic and outsource the parts you don't want to deal with to Langfuse.

I appreciate that its not an opionated product.

Thanks for the feedback.

Being unopinionated and API-first has been a core design decision. We want to build the building blocks that everyone needs while acknowledging that most Langfuse users are very sophisticated teams that have a clear idea of what they want to achieve. Over time we will build more abstractions for common workflows to make it easier to get started but new features will always start API-first.

More on this here: https://langfuse.com/why

lunarcave · 8 months ago

A happy Langfuse customer here!

We've been building an agent platform and some of our customers wanted someway to exfil OTEL traces to their own setup. Initially we tried building our own but then realised Languse does exactly what we needed doing. So we offered it as a first class integration, (and started using it internally).

Great product, and hope you guys continue to improve it!

Thanks! Really enjoyed working with you maintainers of other projects to help them offer more native LLM observability and evaluation to their users/communities. There is a lot that goes into making the observability/eval part scalable/useful and requirements change on a weekly basis with new advancements. Same applies to other projects and it makes a lot of sense to integrate.

Overview of community integrations: https://langfuse.com/docs/integrations/overview

Packages that depend on Langfuse: https://langfuse.com/faq/all/packages-depending-on-langfuse

extr · 8 months ago

Very timely post/update, was just checking out your product. IMO it is one of the best solutions I've looked at. Appreciate your dedication to self hosting, for us it's not really practical to have traces with potentially sensitive customer data sitting around on some external company's server somewhere (no offense).

Thank you for the kind words! Let us know if you have any questions or feedback regarding the self-hosting documentation and experience. We collaborate with many teams that have diverse security needs, including HIPAA, PCI, and on-premises deployments on bare metal without internet access.

ddtaylor · 8 months ago

You guys just saved me a lot of trouble. Amazing work everyone wow.

tmshapland · 8 months ago

Seems like Langfuse is becoming the standard. Whenever I talk to other builders, they're using Langfuse.

mdeichmann · 8 months ago

Thank you! If these builders have some feedback to share, ask them to reach out to us :)