Data Pipelines with ClickHouse, dbt, and Superset

I recently wrapped up a contract for an initial version of an analytics pipeline for an online learning product. My responsibility was to create datasets and dashboards to answer questions related to course engagement (like "how many people watched the videos in this course?"). The system was already set up to store events in ClickHouse, so I'd take it from there, defining data models in dbt and visualizing them using Superset. In this post, I'll summarize some of the high-level objectives and talk a bit about my experience with each major component.

Objectives

This being an analytics system, there were some goals I could bring from my past experience: make the system easy to operate, make it easy to trace metrics definitions, document dead-ends, etc. There were also goals specific to this project and organization. What made this project unique is that, unlike all of my other software experience, I wouldn't be deploying this anywhere. My contract was to contribute to a large open source project and so I found myself unable to make any assumptions about the environment in which everything would run. For this reason, minimizing operational burden became a higher priority than it might have been in my previous roles.

ClickHouse

I ended up spending a good amount of time building up my intuition around ClickHouse's performance and behavior. Looking back, I'd say there are three topics that shaped my understanding of ClickHouse: materialized views; data storage configuration via primary keys; and query optimization. I'll talk about each topic and then I'll end with how we decided to structure our data given what we've learned.

Materialized views

The first significant departure from my mental model of databases was ClickHouse's implementation of materialized views. In ClickHouse, these are more like streaming transformations than typical materialized views. You provide a transformation query and a target table; when data is inserted into the base table, your query is run on that new data and the result is written to the target table. If that query fails, my understanding is that the entire write is rejected, not just the insert to the specific materialized view that failed. Timescale's continuous aggregates come to mind as a similar database feature.

Primary key and order by

The other main adjustment to my mental model was around table indexes and how they affect storage and query performance. Something that ClickHouse's online training emphasized is that query times are massively influenced by the amount of data that needs to be read from disk. This is determined by the primary key and order by clauses in a table definition.

Primary keys are used as part of the storage location and queries only process whichever files match the key(s) used in the query. For example, if you're tracking different kinds of events and often limit queries to one event type at a time, you could include the event type as your primary key. That way, you're only reading in the data that's relevant to your query. If your data includes the timestamp of the event, you probably wouldn't want to include that field in your primary key since a new directory would be created for every timestamp that arrives. This reminds me a lot of Hive, where indexes (or "partitions", as they're called in Hive) are stored as key/value pairs in the file system (e.g. /data/event_type=click/part-00001.parquet).

In addition to primary keys, the order by clause specifies how data stored on disk should be ordered. To reuse the click event example, the event timestamp would be a good fit for the table's order by clause.

Query optimization

My exploration of ClickHouse's query optimization mostly focused on how (or whether) ClickHouse leverages table indexes. Querying tables directly worked great, and the explain output confirmed that data was being skipped in the right places. However, index usage seemed to be pretty unpredicable (or predictably unused) once queries were wrapped in a view or if joins were involved. For example, I found that if views contain window functions then primary keys are not used. The entire table is read and processed before filtering the final result. In the case of joins, indexes are only applied to the left-hand side table. If you want to use an index for any other tables, one would have to add a CTE or subquery that uses the index and then use that in the join. I started noticing this when my queries started failing from out of memory errors (ClickHouse will stop the query if it detects a dataset can't fit into memory), since the default join algorithm is to load the entire right-hand side table into memory.

These limitations present a significant hurdle. Window functions are an essential part of my SQL toolkit, and any attempt I made to replace them usually involved joining a table to itself. Additionally, any final output of our query pipeline would join our processed data with a dimension table to bring in columns like human-readable entity names. Those dimension tables could easily grow to a size where they can't fit into memory.

Approach for using ClickHouse

With those topics in mind, the data materialization strategy for this project is: rely heavily on views, then fall back to incremental table materialization if views were not performing well enough.

Views are preferred for two reasons. One is that views would allow this system to report on data as soon as it arrived in the source tables. I don't have a clear benchmark, but while testing locally the time from a browser interaction to updated report was under a minute. The other reason is that views would minimize the operational burden. Every model that needs to be refreshed would introduce another opportunity for that job to fail, creating more work for teams tasked with running these services.

In a kind of reversal of those priorities, incremental table updates is preferred to materialized views despite the latter having the same benefits as views. This decision came down to two operational drawbacks of materialized views. For starters, materialized views are not supported in the ClickHouse dbt plugin. Though even if it was, we were unclear how to update materialized views without dropping data. Adding a periodic job to write new data to a table, while adding operational burden, minimized the risk of dropping data and so we opted for that. It could be that materialized views can be updated atomically, but exploring that unknown was not a priority for this initial version.

Ultimately, despite the limitations of ClickHouse's query optimization, we opted to continue with the strategy of using views. This would heavily influence our usage of dbt and Superset.

dbt

Ideally, we would use dbt to define, document, and test all of our models. However, given the lack of query optimization done by ClickHouse, dbt ended up containing a rather thin layer of transformation before defining datasets in Superset. dbt is still useful, even with a limited number of models under its purview, as we could still leverage the documentation and testing features. This was mostly for the source data, as our most interesting client-facing models would be defined elsewhere, but that's still better than no documentation or tests.

Superset

Finally, the visualization piece of the puzzle. The goal of this whole project was to visualize data in Superset, and to do that one must make a dashboard. To make a dashboard, you need a few charts. Every chart relies on one (and exactly one, I think) dataset. A dataset is either a physical dataset (i.e. a table or view in a database) or a virtual dataset (a SQL query).

The gist of this is: a lot of what I'd hoped to define in dbt was defined as a virtual dataset in Superset. By the time I arrived to the Superset side of things, my colleagues had code in place to take yaml and SQL files, render them as jinja templates, and import into Superset. I could write jinja-templated SQL for Superset datasets, and the resulting queries ultimately look pretty similar to what they would have in dbt. I am deeply grateful for my colleagues who handled the task of figuring out how importing assets works in Superset. I've tried learning how to import assets into Superset twice — first at a previous job and again at the start of this project — and I still don't really know how they figured it out.

Superset's SQL templating really helped enable a lot of performance benefits. Every dataset could be templated to take advantage of the filters in the dashboard, which addressed the predicate pushdown issue that prevented us from defining all this stuff in dbt.

Conclusion, next steps, and some ideas for the future

In terms of the overall project, there should be enough for stakeholders to interact with and comment on. I really enjoyed working on this project and with the people I got to work with. I'm pretty bummed that so much logic lives in Superset. Something I've been thinking about the past few days is writing a dbt plugin to generate Superset asset definitions instead of creating views and tables in ClickHouse. After all, Preset boasts a dbt integration. Could we get some of the same benefits without going through Preset? Could we continue to define data transformations in dbt, but have more control over where the final definitions live?