ClickHouse: A Blazingly Fast DBMS With Full SQL Join Support - Under the Hood - Part 2
- ClickHouse is designed to be fast. Queries in ClickHouse are processed in a highly parallel fashion, taking all the necessary resources available on the current server and in many cases, utilizing hardware up to its theoretical limits. The more CPU cores and main memory a server has the more performance gains from parallel execution a query will have. (View Highlight)
- We will explore in this post how the Hash join algorithm is fast and most generic. The Parallel hash join algorithm can be faster with large right-hand side tables but requires more memory. Both the hash join and parallel hash join are memory-bound. Whereas the grace hash join is a non-memory bound version that spills data temporarily to disk. Grace hash join doesn’t require any sorting of the data and therefore overcomes some of the performance challenges of other join algorithms that spool data to disk like the (partial) merge join algorithm (we will cover this in the second part). (View Highlight)
- An in-memory hash table can serve 250 million totally random requests per second (and more than a billion if it fits in the CPU cache). This very fast lookup capability makes the in-memory hash table a natural general choice in ClickHouse for implementing joins when it is not possible or feasible to take advantage of table sorting. (View Highlight)
- ① All data from the right-hand side table is streamed (in parallel by 2 threads because max_threads = 2) into the memory, and then ClickHouse fills an in-memory hash table with this data.
② Data from the left-hand side table is streamed (in parallel by 2 threads because max_threads = 2) and ③ joined by doing lookups into the hash table. (View Highlight)
- Note that because ClickHouse takes the right-hand side table and creates a hash table for it in RAM, it is more memory efficient to place the smaller table on the right-hand side of the JOIN. (View Highlight)
- ClickHouse reads the same amount of total rows (and data): 100 million rows from the roles table + 1 million rows from the actors table. However, the join query with the larger roles table on the right-hand side is five times slower. This is because the default hash join is not thread-safe for inserting the right table’s rows into the hash table. (View Highlight)
- The parallel hash join algorithm is a variation of a hash join that splits the input data to build several hash tables concurrently in order to speed up the join at the expense of higher memory overhead. We sketch this algorithm below: (View Highlight)
- ① All data from the right table is streamed (in parallel by 2 threads because
max_threads = 2) into the memory. The data is streamed block-wise. The rows from each streamed block are split into 2 buckets (
max_threads = 2) by applying a hash function to the join keys of every row. We sketch this with the orange and blue colors in the diagram above. In parallel, one in-memory hash table is filled per bucket using a single thread. Note that the hash function for splitting the rows into buckets is different from the one that is used in the hash tables internally.
② Data from the left table is streamed (in parallel by 2 threads because
max_threads = 2), and the same ‘bucket hash function’ from step ① is applied to the join keys of each row for determining the corresponding hash table and the rows are ③ joined by doing lookups into the corresponding hash table (View Highlight)
- The parallel hash join was roughly 100% faster than the standard hash join, but had more than twice the peak memory consumption, although the amount of rows and data read, as well as the size of the right-hand side table is the same for both queries.
The reason for this much higher memory consumption is that the query was run on a node with 30 CPU cores and, therefore, with a
max_threads setting of 30. This means that, as we will demonstrate below, 30 concurrent hash tables were used. (View Highlight)
- Both the hash and parallel hash join algorithms described above are fast but memory-bound. If the right-hand side table doesn’t fit into the main memory, ClickHouse will raise an OOM exception. In this situation, ClickHouse users can sacrifice performance and use a (partial) merge algorithm (described in the next post) that (partially) sorts the table’s data into external storage before merging it. (View Highlight)
- ① All data from the right table is streamed block-wise (in parallel by 2 threads because
max_threads = 2) into the memory. The rows from each streamed block are split into 3 buckets (because
grace_hash_join_initial_buckets = 3) by applying a hash function to the join keys of every row. We sketch this with the orange, blue, and green colors in the diagram above. An in-memory hash table is filled with rows from the first (orange) bucket. The joining of the other two (green and blue) buckets from the right_table is delayed by saving them to temporary storage.
Note that if the in-memory hash table grows beyond the memory limit (as set by
max_bytes_in_join), ClickHouse dynamically increases the number of buckets and recomputes the assigned bucket for each row. Any rows which don’t belong to the current bucket are flushed and reassigned.
Also note that ClickHouse rounds the set value for
grace_hash_join_initial_buckets up to the closest power of two. Therefore as 3 is rounded up to 4, and 4 initial buckets are used. We use 3 buckets in our diagrams for readability, and there is no crucial difference to the inner workings with 4.
② Data from the left table is streamed in parallel by 2 threads (
max_threads = 2), and the same ‘bucket hash function’ from step ① is applied to the join keys of each row for determining the corresponding bucket. Rows corresponding to the first bucket are ③ joined (as the corresponding hash table is in memory). The joining of the other buckets is delayed by saving them to temporary storage.
The key in steps ① and ② is that the ‘bucket hash function’ will consistently assign values to the same bucket, thereby effectively partitioning the data and solving the problem by decomposition. (View Highlight)
- As expected, the hash join was faster. However, the grace hash join consumed only half of the peak main memory.
The main memory consumption of the grace hash join can be reduced further by increasing the
grace_hash_join_initial_buckets setting. We demonstrate this by re-running the query with a value of 8 for the
grace_hash_join_initial_buckets (View Highlight)
- The run of grace hash join with 8 initial buckets consumed roughly 70% less main memory than the run with 3 initial buckets. For the sacrifice of higher execution time, the memory consumption can be reduced quite linearly by increasing the number of buckets. (View Highlight)