Simplifying Your Data Stack With ClickHouse Cloud

Enterprise data and analytics platforms can be complex and expensive to operate. Businesses often find themselves running multiple databases to support different data use cases and a plethora of supporting tools for purposes such as ETL and analytics. Often, all of these technologies will be running in some cloud environment such as AWS or Azure which brings additional complexity and overhead. All of this requires a large team to manage and has significant manpower and license costs just to keep the lights on.

ClickHouse and ClickHouse Cloud can help to strip out this complexity and cost. It's unique features and properties allow you to consolidate onto a single analytical database technology which can serve multiple roles. In turn, this allows you to consolidate down to fewer or even one master copy of the data representing a single real time source of truth. The result is a leaner technology stack, better data quality and increased collaboration between the data experts in your business as all data is stored in one place.

ClickHouse gets a lot of attention for it's performance, but we think that in bigger companies, this architectural simplification is one it's the main benefits. We wanted to take the time to provide more depth on this argument and really make the case.

One Database Serving Multiple Use Cases

Many businesses find themselves with multiple centralised data stores. There may be a data warehouse or two to support business intelligence workloads (e.g Teradata and Snowflake), data lakes to store unstructured data (e.g. AWS S3), services for time series or machine generated log data (e.g. Splunk), observability systems (e.g. Datadog and New Relic) and real time in-memory caches to serve user facing analytics (e.g. Redis).

ClickHouse is relatively unique in the database space in that it can viably serve many of these use cases. It is becoming a strong vanilla data warehouse which can support SQL based business intelligence (particulary as it's JOIN performance improves), but one which can also scale to time series oriented and machine generated data that might have gone into a time series oriented system such as Splunk. It is a very natural fit for observability and monitoring use cases, potentially avoiding the high cost of SaaS observability tools such as Datadog. Finally, it can also serve as an aplication backend to serve analytical functions of your user facing applications. This is one database viably replacing 5 or 6 different siloed data stores.

Each time we remove a database, we simplify our estate. It's one less database to run and one less set of infrastructure to pay for and maintain. It's one less copy of the data and one less set of ETL code to keep running. It's one less set of skills required and one less license to pay. Data quality rises as we need to replicate data to fewer places. In addition, data analysts and data scientists can more easily join up disparate data and collaborate around the same single source of truth. Consolidating databases should be an absolutely key aim for data leaders and ClickHouse gives us this potential.

Simplify Your ETL

Data teams today spend a lot of effort on ETL - extracting data from source systems, transforming it between different formats and loading it into various systems, data warehouses and data lakes. This type of work is bread and butter for data engineers, but it comes with huge manpower overhead. In spite of this, we still don't really get it right. On the ground, data teams are constantly faced with broken pipelines and data quality issues and our users are unsatisfied.

The fact is that the best ETL is no ETL. When we have consolidated onto fewer databases as described above, we simply don't need to do it. All of the drugery associated with ETL can be avoided, and all of the maintenence overheads and data quality issues go away.

ClickHouse can also help us reduce ETL through it's table engine abstraction. We can point our ClickHouse engine at data that is stored in MySQL or PostgreSQL and query it directly withiout any ETL. Likewise, an increasing amount of data is stored in data lakes in Parquet and Iceberg formats which we can query directly.

Finally, if there is no choice but to use ETL, the ClickPipes product incorporated into ClickHouse Cloud is excellent. Within the same product we can setup a CDC stream to automatically ingest updates from PostgreSQL or MySQL into ClickHouse with no third party tools to run.

Simplify Your Data Modelling And Orchestration

A significant amount of work has historically gone into making data reporting ready by preprocessing it and breaking it down into different denormalised tables. This used to happen as part of the ETL process and was referred to as data modelling, whilst in a more modern data stack it became known as analytics engineering and was popularised by dbt which allowed us to define transformations in plain SQL.

Complementing this is the workflow orchestration space. Here we use tools such as Airflow and Dagster to run data pipelines and transformation jobs, handling things such as schedules and dependencies, logging and retries.

These tools all have their value, but I believe that ClickHouse can minimise the scope for them. This is because the performance of ClickHouse means that we can work with relatively raw and unprocessed data. We do not need to denormalise data for performance. We can simply build a thin layer of views over raw data to coerce the data in the format we need to power our reports, dashboards and applications.

Simplify Your Application Architecture

Writing a real time, dynamic application with a lot of interactive analytics over fresh data can present some challenges for developers.

Firstly, most of the OLTP databases that they typically develop against are not optimised for analytics. They find that their applications slow down and load on their database increases as they include more metrics and analytics in their applications.

At this point, they may be tempted to pre-calculate numbers and perhaps store them in a cache to improve application performance and keep load off their database. This has worked for a long time, but it's undoubtedly more complexity at the application layer.

Nowadays, users expect real time numbers and websites and applications that update without a page refresh. This means that developers are then often pushed towards implementing a subscription system based on web sockets whereby updates can be pushed to the frontend. This leads to more complexity in our application architecture and infrastructure.

Because of the performance of ClickHouse, we can simply hit the database more frequently and get the latest data. Our web applications then become simpler based on a traditional HTTP request/response cycle.

Reducing Requirement For Stream Processing

If you need to calculate and process data in real time, you may be tempted to move into this stream processing space and make use of technologies such as Kafka Streams and Flink.

Whilst these technologies are very impressive and powerful, they can also be relatively difficult to use and reason about. Running a Flink cluster

Businesses who go down this route of stream processing often have two parallel architectures (known as the Kappa architectyre) - one for batch and one for streaming which is another set of effort.

With ClickHouse, we can potentially avoid going down this rabbit hole and use it to solve both batch and streaming related problems. We can stream directly into ClickHouse tables using the Kafka table engine, and have analytics updated downstream using materialised views. This can delay the day that we actually need to build a stream processing architeceture.

Simplify The Data Science Workflow

We previously demonstrated how to make use of ClickHouse for data science work including a forecast, anomaly detection, linear regression without writing a line of Python.

We have found that a lot of work in the data scientist lifecycle can be carried out entirely within ClickHouse. For instance, instead of a seperate feature engineering pipeline and reposistory, we can simply build views over raw data to coerce the data in the format we need to train our models.

We can also use the analytical functions of ClickHouse to perform numerical processing inline inside the database, reducing the amount of code that we need to write outside of the database.

We have also shown how we use ClickHouse to record data from our model training runs and evaluations. This allows us to track the performance of our models and query it directly using SQL without having some external model tracking tool.

As we wrote here, we think that a simple data science notebook provided by a tool such as HEX running against ClickHouse is an incredibly powerful and incredibly simple stack for data science and analytics work.

Reducing Operational Overhead With ClickHouse Cloud

The ClickHouse cloud product is excellent. As it's fully managed, this allowed you to avoid a significant amount of work around scaling, backups, monitoring and security.

Though the same can be said about the likes of Snowflake, Redshift and BigQuery, these systems do not have an on premise self hosted option as with Open Source ClickHouse. This means that ClickHouse gives us the ability to move between the two if requirements change. I also think this is a key differentiator for ClickHouse.

Simplication

Hopefully we have made the point above that ClickHouse can be a powerful tool for simplifying your data stack, how it can serve multiple roles take the strain whilst other parts of your technology estate.

I do not claim that all of the technologies above become redundant with ClickHouse. At a certain level of scale then maybe orchestration tools, specialised databases, stream processing and observability tools all become relevant. Perhaps there are certain parts of your business where they still add value today. However, with ClickHouse at the heart of your data strategy then maybe their use can be delayed or shrunk to simplify things and take out cost.