I've Operated Petabyte-Scale ClickHouse® Clusters for 5 Years

rw-book-cover

Metadata

Author: tinybird.co
Full Title: I’ve Operated Petabyte-Scale ClickHouse® Clusters for 5 Years
URL: https://www.tinybird.co/blog-posts/what-i-learned-operating-clickhouse

Highlights

I’ve been operating large ClickHouse clusters for years. Here’s what I’ve learned about architecture, storage, upgrades, config, testing, costs, and ingestion. (View Highlight)
In that time, I’ve dealt with ClickHouse daily, helped start a company that uses it, sent critical changes to the database project, and managed many petabyte-scale clusters. Setting up a cluster is easy, the hard part is keeping it running. Let me go through the good and the bad parts, focusing on the problems you may find (so you can avoid them). (View Highlight)
The architecture ClickHouse proposes for managing large clusters uses replicas and shards. Basically, you split your data into different buckets (e.g., per hash(user_id)and), amd every bucket goes to a shard. Then you have different replicas (copies of the data) for each shard. (View Highlight)
This is a pretty standard and straightforward architecture, but it’s starting to feel like the wrong approach for these kinds of systems. Many people now believe that you have to use cloud storage and separate compute from storage. I don’t completely agree, but this approach does offer advantages for cluster management and reduced costs. I’ll discuss this in more detail later. (View Highlight)
A few years ago, we started with a pretty basic system with no shards, just replicas. We’d scale the replicas vertically to handle larger queries, and we’d add more replicas to handle more traffic. Cloud storage was not an option, so we used local SSD disks (to reduce latency). Adding shards would have been an option if we had to process more data in each query, but we end up not doing it because re-sharding was super hard and none of our customers had that need (well, they did, but if you are smart enough in your data schema design, you can put it off). (View Highlight)
That configuration is pretty easy to handle; you add a load balancer in front of all the replicas and route the traffic coming from your app to a replica depending on the request type (I’ll talk more about this later). We have an HTTP load balancer with some logic to handle instructions from the backend. So the backend (apps in the diagram) make decisions about where to send a particular request based on load, replica type, consistent hashing (to leverage caches, more on this in Storage), and many more things. I don’t think you need all that for a basic install, but you’ll need some of it. (View Highlight)
A quick note about HTTP: ClickHouse offers a TCP connector with a native protocol, but we don’t use it. It does not offer many advantages for the type of application we build, and HTTP allows us to use a lot of battle-tested tooling. (View Highlight)
This architecture can become really expensive, mainly because you need a replica of all the data in all the machines (depending on how you load balance and the availability you want to have). Say you have a 300TB table and you need to handle 1000 QPS. If each replica can manage 100 QPS, you’d need 10 replicas, so 300 * 10 = 3000TB. If you are using SSDs (more on storage later), you have a problem. (View Highlight)
I always recommend having a replica just for writes, people call this compute-compute separation. We do this by default and handle the failover and everything in case of error or overload. (View Highlight)
You might also decide to split the cluster into more replicas depending on the different workloads. If you have queries that need a particular p99 objective, for example, you may want to isolate those queries to a replica and keep it under 40% load so high percentiles stay stable. Real-time sub-second queries are expensive; forget about those cheap batch jobs running on spot instances. (View Highlight)
ClickHouse open source faces a significant challenge: limited support for cloud storage. Modern OLAP databases and data systems should leverage cloud storage for cost efficiency and independent scaling of compute and storage resources. Snowflake established this standard over a decade ago, and ClickHouse (open source) lags behind, while other systems like StarRocks are way ahead. (View Highlight)
Let me give you a quick tutorial on how storage works in ClickHouse (and other databases). As I said, you can split your data into “buckets” and put each bucket in a shard. Each shard can have 1 or more replicas (copies) of the data. (View Highlight)
The data can be stored locally, in which case every replica needs to copy all the data from other replicas. You insert data in one shard, and the data is replicated to others. In ClickHouse, that’s managed by a central system called Zookeeper. (View Highlight)
The data can also be stored in a central system, like S3, in which case replicas can store the data there instead of on a local disk. Each replica could store a copy or simply point to other replica data (that’s zero-copy replication). (View Highlight)
There are a few cases where using cloud storage does not make sense (more on this, yes, later) but, in most cases, having compute-storage separation is the way to go. What I describe here, unless otherwise stated, uses cloud storage. (View Highlight)
In ClickHouse, you have all those options: local disk, storage, and the zero-copy replication mechanism that relies on cloud storage. (View Highlight)
To achieve cost-effective performance, we use a modified zero-copy feature (available in our private fork, something I’ll talk more about in the future, especially if you want to have your own), but you can use it as-is in the ClickHouse open source repo. Ours includes some changes, operation limitations, and optimizations. We also employ local SSDs for caching, adjusting the cache size based on usage to optimize performance. For customers prioritizing low latency, we use a hot/cold storage architecture with local SSDs and S3. This approach provides the required low latency that S3 alone cannot deliver. (View Highlight)
About costs: The way ClickHouse stores data in S3 is ineffective; it uses a lot of write ops by default (tricks on how to make this better later on ingestion). This has to do with the fact that the ClickHouse storage system was designed to be used with local storage where write operations are “free” (you are limited by IOPS, which are not free, of course). It uses A LOT of files to store data parts and uses the OS page cache as the cache, so it was not designed with tiered storage in mind. Don’t worry, there are ways to make it better. (View Highlight)
My general advice is: if you want to use storage-compute separation, use zero-copy replication, but keep an eye on it, because you could be losing data if you do some part operations. And of course, ClickHouse, Inc. could remove the feature any time (they had actually planned to do it, but they decided not to). If you are not brave enough, go with a hot/cold or just with SSDs and keep a close eye on your costs. (View Highlight)
About compression: use ZSTD(1) or maybe ZSTD(2) if you want to have a good compromise between speed and compression. It’s better than LZ4 except for a few particular cases, but you can pick the compression per column basics. From time to time, test other compression formats (write speed, read speed, and compression ratio) just in case your data plays well with something that’s not ZSTD. (View Highlight)
Upgrades are “easy” now. Well, as easy as updating a database cluster could be, which is never easy. The first update we did took us 3 hours with 2 weeks of preparation. With time, we learned how to do them without experiencing downtime, killing any app sending queries, or losing data. Then, we managed to do it without any performance degradation and finally as part of the CI/CD pipeline. It only took us 4 years to figure out. AFAIK, there is no other company doing this. (View Highlight)
My general workflow advice for updating a cluster:
1. Add a new replica on the new version
2. Monitor the replicas’ logs and check if something is wrong. You’ll have several heart attacks the first time because ClickHouse loves to log some stuff that seems critical, but it’s not.
3. If you can avoid DDL changes like adding columns, data-skipping indices, or anything that modifies table structure, that’s better.
4. It’s obvious, but do not use any new feature in the new replica until your cluster is up to date with the new version.
5. Send some read traffic to check if everything is fine (timing, memory, resource usage…)
6. Test writes/inserts
7. Now it’s time to move real traffic. If you are unsure, keep the replica for a few hours.
8. After that, start updating other replicas. (View Highlight)
Rarely: Data storage format incompatible changes. It has happened to us 2-3 times in the last 2 years, we fixed at least 2, not aware of others. We noticed it because we have all the possible combinations of data types in our customer base. It’s unlikely that you’ll hit this issue in a regular deployment (not multi-tenant). In these cases, good luck, prepare yourself to dig deeper into the filesystem or recover a backup. Do not upgrade anything right away after the release; wait at least a month. Don’t use any experimental feature unless you know what you are doing. (View Highlight)
SQL behavior change. That’s usually caused by bug fixes, but it can break your queries. Make sure you test your expected data correctly. You have the system.query_log with all the queries your cluster runs. Use it all the time. (View Highlight)
Performance changes. Sometimes for the better, sometimes for the worse. They are not usually bad, but test your queries for performance, otherwise, you’ll need to roll back. (View Highlight)
Settings changes: this is something to take a look at really deeply, as you need to review the new ones and changes in the old ones. Pay attention to features that are disabled by default or flags that change. The changelog is usually a good way to understand what to look for, but you may want to have your CI pointing to the latest master build. (View Highlight)
Costs of the replicas/shards. The cost depends on the price of your infra, of course. (View Highlight)
At least 3 ZooKeeper replicas. It should be on a different machine/pod/whatever than the database. The reason is you want your ZK server to be as isolated as possible. If you have ZK in the same pod/machine and the machine gets overloaded, everything will be slow and eventually fail, leaving your tables in read-only mode (best scenario). A ZK problem means you are fucked. (View Highlight)
Storage: you need to decide where you are storing the data: locally, S3, or both, and crunch the numbers. There are some recommendations in the “storage” section, but you should take into account that disks can’t be resized down, and in S3 you need to account for operation costs. (View Highlight)
Likely a load balancer. A small one is fine. You need HA, so a couple of them. (View Highlight)
Backup storage: Depending on the retention policy and the amount of data, this could be expensive. (View Highlight)
How to decide how many cores you need: it depends a lot on your load and use case. There is no easy way, but as a rule of thumb, a 32-core machine can process around 5GB/s max (of uncompressed data) healthily. (View Highlight)
About the people needed to handle the system: For a small cluster, a single part-time person is good enough. For loads with over 20k rows/s and people pushing changes, you may need a full-time person to handle the cluster and take a look at the crazy queries people are going to write. There is a difference of 3-4x in the hardware you need if you follow some basic rules. So, if you are not paying US salaries, having someone looking after the cluster pays off. (View Highlight)
Regardless, I don’t recommend handling a cluster by yourself. You want to solve your analytics product requirements, not spend your time handling ClickHouse. That’s why we created Tinybird, but if you have read this far, it means you want to DIY, and I feel you. (View Highlight)
Let me clarify something: we created Tinybird because we wanted to build an analytics application without all the pain I’m describing. We do not offer ClickHouse hosting; we solve the analytics problem. It’s a narrower problem than a data warehouse solves, and we solve it much better. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

AI Learning Paths for Software Engineers Without Becoming a Data Scientist

Power and Prediction

Why Software Engineers Should Learn a Bit of Data Science

I've Operated Petabyte-Scale ClickHouse® Clusters for 5 Years

Metadata

Highlights

Graph View

Table of Contents

Now Reading

Reader: Frequently Asked Questions