ClickHouse raises $350M Series C

Does anyone use clickhouse in production? I was initially pretty impressed but when I really put it through its paces I could OoM it as soon as I actually started querying non-trivial amounts of data:

https://github.com/ClickHouse/ClickHouse/issues/79064

fishtoaster · 3 months ago

Yep. Clickhouse is absolutely great for tons of production use cases.

Unless you try to join tables in it, in which case it will immediately explode.

More seriously, it's a columnar data store, not a relational database. It'll definitely pretend to be "postgres but faster", but that's a very thin and very leaky facade. You want to do massively a complex set of selects and conditional sums over one table with 3b rows and tb of data? You'll get a result in tens of seconds without optimization. You want to join two tables that postgres could handle easily? You'll OOM a machine with TB of memory.

So: good for very specific use cases. If you have those usecases, it's great! If you don't, use something else. Many large companies have those use cases.

Boxxed · 3 months ago

Yeah I think that's a good summary. For instance, clickbench is comprised of >40 queries and there's not a single join in them: https://github.com/ClickHouse/ClickBench/blob/main/clickhous...

adrian17 · 3 months ago

The majority of our queries have joins (plus our core logic often depends on fact table expansion with `arrayJoin()`s) before aggregations and we're doing fine. AFAIK whenever we hit memory issues, they are mostly due to high-cardinality aggregations (especially with uniqExact), not joins. But I'm sure it can depend on the specifics.

hodgesrm · 3 months ago

> More seriously, it's a columnar data store, not a relational database.

Could you explain why you don't think ClickHouse is relational? The storage is an implementation detail. It affects how fast queries run but not the query model. Joins have already improved substantially and will continue to do so in future.

hodgesrm · 3 months ago

It's used in production by many thousands of companies at this point. The ClickHouse Inc numbers are just a fraction of the total users.

p.s., It's also possible to break ClickHouse as you demonstrated. It used to be a lot easier.

Boxxed · 3 months ago

I guess I'm curious how; I breathe on it wrong and it OoMs.

mplanchard · 3 months ago

Yes (via Clickhouse Cloud, which is pretty reasonably priced).

It’s important to structure your tables and queries in a way that aligns with the ordering keys, in order to optimize how much data needs to be loaded into RAM. You absolutely CANNOT just replicate your existing postgres DB and its primary keys or whatever over to CH. There are tricks like projections and incremental materialized views that can help to get the appropriate “lenses” for your queries. We use incremental MVs to, for example, continuously aggregate all-time stats about tens of billions of records. In general, for CH, space is cheap and RAM is expensive, so it’s better to duplicate a table’s data with a different ordering key than to make an inefficient query.

As long as the queries align with the ordering keys, it is insanely fast and able to enable analytics queries for truly massive amounts of data. We’ve been very impressed.

Boxxed · 3 months ago

Well that's exactly my complaint. The bug I filed above was pretty much the optimal case (one huge table, one very small table, both ordered by the join key) and it still OoMs.

AlexClickHouse · 3 months ago

Thanks for creating this issue, it is worth investigating!

I see you also created similar issues in Polars: https://github.com/pola-rs/polars/issues/17932 and DuckDB: https://github.com/duckdb/duckdb/issues/17066

ClickHouse has a built-in memory tracker, so even if there is not enough memory, it will stop the query and send an exception to the client, instead of crashing. It also allows fair sharing of memory between different workloads.

You need to provide more info on the issue for reproduction, e.g., how to fill the tables. 16 GB of memory should be enough even for a CROSS JOIN between a 10 billion-row and a 100-row table, because it is processed in a streaming fashion without accumulating a large amount of data in memory. The same should be true for a merge join.

However, there are places when a large buffer might be needed. For example, if you insert data into a table backed by S3 storage, it requires a buffer that can be in the order of 500 MB.

There is a possibility that your machine has 16 GB of memory, but most of it is consumed by Chrome, Slack, or Safari, and not much is left for ClickHouse server.

Boxxed · 3 months ago

Yeah I feel like I'm on crazy pills, I'm OoM'ing all these big data tools that everyone loves very trivially -- duckdb OoM'd just loading a CSV file, and Polars OoM'd just reading the first couple rows of a parquet file?

I do want to get a better reproduction on CH because it seems like it's an interplay between the INSERT INTO...SELECT. It's just a bit of work to generate synthetic data with the same profile as my production data (for what it's worth I did put quite a bit of effort into following the doc guidelines for dealing with low-memory machines).

owenthejumper · 3 months ago

I find Clickhouse fascinating, really good, and also really tough to run. It's a non-linear memory hog. It probably needs 32GB RAM for basics to run, otherwise it will OOM on minimal amount of data. That said, it won't "OOM", as in crash. It will just report the query would use too much memory, so it aborted the query.

david38 · 3 months ago

It’s fantastic but it’s a columnar store. It’s not a Postgres replacement.

hackitup7 · 3 months ago

Yes for relatively large workloads

_gmax0 · 3 months ago

Heard from the grapevine that CloudFlare uses it for their analytics.

tveita · 3 months ago

They don't make a secret of it: https://blog.cloudflare.com/log-analytics-using-clickhouse/

Clickhouse is great, but like any database if you run it at scale someone must tend to it.

lossolo · 3 months ago

7 years, 24/7 high volume, self hosted, no issues really.

Dead Comment

Is there an ELI5 for this company? I'm having a difficult time understanding it from their website. Is it an alternative to Postgres etc? Something that runs on top of it? And analyzes your DB automatically?

jameslk · 3 months ago

When Postgres takes a while to answer analytical questions like "what's the 75th percentile of response time for these 900 some billion requests rows, grouped by device, network, and date for the past 30 days", that's when you might want to try out ClickHouse

cluckindan · 3 months ago

Or literally any other OLAP database.

Is it a surprise that OLTP is not efficient at aggregation and analytics?

NunoSempere · 3 months ago

That seems like the kind of problem that would be easily done through monte-carlo approximation? How hard is it to get 1M random rows in a postgres database?

jgalt212 · 3 months ago

I'm not sure storing 900B or 900MM records for analytics benefits anyone other than AWS. Why not sample?

NewJazz · 3 months ago

I'm struggling with TimescaleDB performance right now and wondering if the grass is greener.

bandoti · 3 months ago

Or if you have to use it because you’re self-hosting PostHog :)

arecurrence · 3 months ago

Clickhouse has a wide range of really interesting technologies that are not in Postgres; fundamentally, it's not an OLTP database like Postgres but more-so aimed at OLAP workloads. I really appreciate Clickhouse's focus on performance and quite a bit of work goes into optimizing the memory allocation and operations among different data types.

The heart of Clickhouse are these table engines (they don't exist in Postgres) https://clickhouse.com/docs/engines/table-engines . The primary column (or columns) is ordered in some way and adjacent values in memory are from the same column in the table. Index entries span wide areas (EG: By default there's only one key record in the primary index for every 8192 rows) because most operations in Clickhouse are aggregate in nature. Inserts are also expected to be in bulk (They are initially a new physical part that is later merged into the main table structure). A single DELETE is an ALTER TABLE operation in the MergeTree engine. :)

This structure allows it to literally crunch billions of values per second (brutally, not with pre-processing, erm, "tricks" although there is a lot of support for that in Clickhouse as well). I've had tables with hundreds of columns and 100+ billion rows that are nearly as performant as a million row table if I can structure the query to work with the table's physical ordering.

Clickhouse recommends not using nullable fields because of the performance implications (it requires storing a bit somewhere for each value). That's how much they care about perf and how close to the raw data type it is that their memory allocation uses. :)

porridgeraisin · 3 months ago

> Inserts are also expected to be in bulk (They are initially a new physical part that is later merged into the main table structure). A single DELETE is an ALTER TABLE operation in the MergeTree engine.

> They are initially a new physical part that is later merged into the main table structure

> A single DELETE is an ALTER TABLE operation

Can you explain these two further?

lbhdc · 3 months ago

Its a db company that offers an open source database and cloud managed services.

The database is OLAP where Postgres is an OLTP database. Essentially it very fast at complex queries, and is targeted at analytics workloads.

datavirtue · 3 months ago

Postgres has been used as the basis for several OLAP systems. These guys are probably using a modified Greenplum.

Silasdev · 3 months ago

SQL, OLAP, Primary use case is fast aggregations on append only data, like usage analytics.

It's fast, it's........ really fast!!

But you need to get comfortable with their extended SQL dialect that forces you to think a little different than with usual SQL if you want to keep perf high.

simantel · 3 months ago

It's an alternative to Postgres in the sense that they're both databases. Read up on OLAP vs. OLTP to see the difference.

Dead Comment

doix · 3 months ago

I guess you could say it's an alternative to postgres. It's a different database, that's column oriented which makes different tradeoffs. I'd say DuckDB is a better comparison, if you're familiar with it.

pythonaut_16 · 3 months ago

Expanding for the original question:

Roughly speaking, Postgres is to SQLite what Clickhouse is to DuckDB.

OLTP -> Online Transaction Processing. Postgres and traditional RDBMS. Mainly focused on transactions and addressing specific rows. Queries like "show me all orders for customer X".

OLAP -> Online Analytical Processing. Clickhouse and other columnar oriented. For analytical and calculation queries, like "show me the total value of all orders in March 2024". OLTP database typically store data by column rather than row, and usually have optimizations for storage space and query speed based on that. As a tradeoff they're typically slower for OLTP type queries. Often you'd bring in an OLAP db like Clickhouse when you have a huge volume of data and your OLTP database is struggling to keep up.

joshstrange · 3 months ago

If you go into it with MySQL/Postgres knowledge you will probably hate it.

Source: me

I almost wish it didn’t use SQL so that it was clear how different it is. Nothing works like you are used to, footguns galore, and I hate zookeeper.

I’d replace it with Postgres in a heartbeat if I thought I could get away with it, I don’t think our data size really needs CH. Unfortunately, my options are “spin up a Custer on company resources to prove my point” or “spin it up on my own infra” (which is not possible since that would require pulling company data to my servers which I would never do). So instead I’m stuck dealing with CH.

whobre · 3 months ago

It's not like Postgres at all, except on the very superficial level. It is an analytical engine like BigQuery, Snowflake, Teradata, etc...