The Cloudflare Blog

Eliminating Cold Starts 2: shard and conquer

Harris Hancock — Fri, 26 Sep 2025 13:00:00 GMT

Five years ago, we announced that we were Eliminating Cold Starts with Cloudflare Workers. In that episode, we introduced a technique to pre-warm Workers during the TLS handshake of their first request. That technique takes advantage of the fact that the TLS Server Name Indication (SNI) is sent in the very first message of the TLS handshake. Armed with that SNI, we often have enough information to pre-warm the request’s target Worker.

Eliminating cold starts by pre-warming Workers during TLS handshakes was a huge step forward for us, but “eliminate” is a strong word. Back then, Workers were still relatively small, and had cold starts constrained by limits explained later in this post. We’ve relaxed those limits, and users routinely deploy complex applications on Workers, often replacing origin servers. Simultaneously, TLS handshakes haven’t gotten any slower. In fact, TLS 1.3 only requires a single round trip for a handshake – compared to three round trips for TLS 1.2 – and is more widely used than it was in 2021.

Earlier this month, we finished deploying a new technique intended to keep pushing the boundary on cold start reduction. The new technique (or old, depending on your perspective) uses a consistent hash ring to take advantage of our global network. We call this mechanism “Worker sharding”.

What’s in a cold start?

A Worker is the basic unit of compute in our serverless computing platform. It has a simple lifecycle. We instantiate it from source code (typically JavaScript), make it serve a bunch of requests (often HTTP, but not always), and eventually shut it down some time after it stops receiving traffic, to re-use its resources for other Workers. We call that shutdown process “eviction”.

The most expensive part of the Worker’s lifecycle is the initial instantiation and first request invocation. We call this part a “cold start”. Cold starts have several phases: fetching the script source code, compiling the source code, performing a top-level execution of the resulting JavaScript module, and finally, performing the initial invocation to serve the incoming HTTP request that triggered the whole sequence of events in the first place.

Cold starts have become longer than TLS handshakes

Fundamentally, our TLS handshake technique depends on the handshake lasting longer than the cold start. This is because the duration of the TLS handshake is time that the visitor must spend waiting, regardless, so it’s beneficial to everyone if we do as much work during that time as possible. If we can run the Worker’s cold start in the background while the handshake is still taking place, and if that cold start finishes before the handshake, then the request will ultimately see zero cold start delay. If, on the other hand, the cold start takes longer than the TLS handshake, then the request will see some part of the cold start delay – though the technique still helps reduce that visible delay.

In the early days, TLS handshakes lasting longer than Worker cold starts was a safe bet, and cold starts typically won the race. One of our early blog posts explaining how our platform works mentions 5 millisecond cold start times – and that was correct, at the time!

For every limit we have, our users have challenged us to relax them. Cold start times are no different.

There are two crucial limits which affect cold start time: Worker script size and the startup CPU time limit. While we didn’t make big announcements at the time, we have quietly raised both of those limits since our last Eliminating Cold Starts blog post:

Worker script size (compressed) increased from 1 MB to 5 MB, then again from 5 MB to 10 MB, for paying users.
Worker script size (compressed) increased from 1 MB to 3 MB for free users.
Startup CPU time increased from 200ms to 400ms.

We relaxed these limits because our users wanted to deploy increasingly complex applications to our platform. And deploy they did! But the increases have a cost:

Increasing script size increases the amount of data we must transfer from script storage to the Workers runtime.
Increasing script size also increases the time complexity of the script compilation phase.
Increasing the startup CPU time limit increases the maximum top-level execution time.

Taken together, cold starts for complex applications began to lose the TLS handshake race.

Routing requests to an existing Worker

With relaxed script size and startup time limits, optimizing cold start time directly was a losing battle. Instead, we needed to figure out how to reduce the absolute number of cold starts, so that requests are simply less likely to incur one.

One option is to route requests to existing Worker instances, where before we might have chosen to start a new instance.

Previously, we weren’t particularly good at routing requests to existing Worker instances. We could trivially coalesce requests to a single Worker instance if they happened to land on a machine which already hosted a Worker, because in that case it’s not a distributed systems problem. But what if a Worker already existed in our data center on a different server, and some other server received a request for the Worker? We would always choose to cold start a new Worker on the machine which received the request, rather than forward the request to the machine with the already-existing Worker, even though forwarding the request would avoid the cold start.

To drive the point home: Imagine a visitor sends one request per minute to a data center with 300 servers, and that the traffic is load balanced evenly across all servers. On average, each server will receive one request every five hours. In particularly busy data centers, this span of time could be long enough that we need to evict the Worker to re-use its resources, resulting in a 100% cold start rate. That’s a terrible experience for the visitor.

Consequently, we found ourselves explaining to users, who saw high latency while prototyping their applications, that their latency would counterintuitively decrease once they put sufficient traffic on our network. This highlighted the inefficiency in our original, simple design.

If, instead, those requests were all coalesced onto one single server, we would notice multiple benefits. The Worker would receive one request per minute, which is short enough to virtually guarantee that it won’t be evicted. This would mean the visitor may experience a single cold start, and then have a 100% “warm request rate.” We would also use 99.7% (299 / 300) less memory serving this traffic. This makes room for other Workers, decreasing their eviction rate, and increasing their warm request rates, too – a virtuous cycle!

There’s a cost to coalescing requests to a single instance, though, right? After all, we’re adding latency to requests if we have to proxy them around the data center to a different server.

In practice, the added time-to-first-byte is less than one millisecond, and is the subject of continual optimization by our IPC and performance teams. One millisecond is far less than a typical cold start, meaning it’s always better, in every measurable way, to proxy a request to a warm Worker than it is to cold start a new one.

The consistent hash ring

A solution to this very problem lies at the heart of many of our products, including one of our oldest: the HTTP cache in our Content Delivery Network.

When a visitor requests a cacheable web asset through Cloudflare, the request gets routed through a pipeline of proxies. One of those proxies is a caching proxy, which stores the asset for later, so we can serve it to future requests without having to request it from the origin again.

A Worker cold start is analogous to an HTTP cache miss, in that a request to a warm Worker is like an HTTP cache hit.

When our standard HTTP proxy pipeline routes requests to the caching layer, it chooses a cache server based on the request's cache key to optimize the HTTP cache hit rate. The cache key is the request’s URL, plus some other details. This technique is often called “sharding”. The servers are considered to be individual shards of a larger, logical system – in this case a data center’s HTTP cache. So, we can say things like, “Each data center contains one logical HTTP cache, and that cache is sharded across every server in the data center.”

Until recently, we could not make the same claim about the set of Workers in a data center. Instead, each server contained its own standalone set of Workers, and they could easily duplicate effort.

We borrow the cache’s trick to solve that. In fact, we even use the same type of data structure used by our HTTP cache to choose servers: a consistent hash ring. A naive sharding implementation might use a classic hash table mapping Worker script IDs to server addresses. That would work fine for a set of servers which never changes. But servers are actually ephemeral and have their own lifecycle. They can crash, get rebooted, taken out for maintenance, or decommissioned. New ones can come online. When these events occur, the size of the hash table would change, necessitating a re-hashing of the whole table. Every Worker’s home server would change, and all sharded Workers would be cold started again!

A consistent hash ring improves this scenario significantly. Instead of establishing a direct correspondence between script IDs and server addresses, we map them both to a number line whose end wraps around to its beginning, also known as a ring. To look up the home server of a Worker, first we hash its script, and then we find where it lies on the ring. Next, we take the server address which comes directly on or after that position on the ring, and consider that the Worker’s home.

If a new server appears for some reason, all the Workers that lie before it on the ring get re-homed, but none of the other Workers are disturbed. Similarly, if a server disappears, all the Workers which lay before it on the ring get re-homed.

We refer to the Worker’s home server as the “shard server”. In request flows involving sharding, there is also a “shard client”. It’s also a server! The shard client initially receives a request, and, using its consistent hash ring, looks up which shard server it should send the request to. I’ll be using these two terms – shard client and shard server – in the rest of this post.

Handling overload

The nature of HTTP assets lend themselves well to sharding. If they are cacheable, they are static, at least for their cache Time to Live (TTL) duration. So, serving them requires time and space complexity which scales linearly with their size.

But Workers aren’t JPEGs. They are live units of compute which can use up to five minutes of CPU time per request. Their time and space complexity do not necessarily scale with their input size, and can vastly outstrip the amount of computing power we must dedicate to serving even a huge file from cache.

This means that individual Workers can easily get overloaded when given sufficient traffic. So, no matter what we do, we need to keep in mind that we must be able to scale back up to infinity. We will never be able to guarantee that a data center has only one instance of a Worker, and we must always be able to horizontally scale at the drop of a hat to support burst traffic. Ideally this is all done without producing any errors.

This means that a shard server must have the ability to refuse requests to invoke Workers on it, and shard clients must always gracefully handle this scenario.

Two load shedding options

I am aware of two general solutions to shedding load gracefully, without serving errors.

In the first solution, the client asks politely if it may issue the request. It then sends the request if it receives a positive response. If it instead receives a “go away” response, it handles the request differently, like serving it locally. In HTTP, this pattern can be found in Expect: 100-continue semantics. The main downside is that this introduces one round-trip of latency to set the expectation of success before the request can be sent. (Note that a common naive solution is to just retry requests. This works for some kinds of requests, but is not a general solution, as requests may carry arbitrarily large bodies.)

The second general solution is to send the request without confirming that it can be handled by the server, then count on the server to forward the request elsewhere if it needs to. This could even be back to the client. This avoids the round-trip of latency that the first solution incurs, but there is a tradeoff: It puts the shard server in the request path, pumping bytes back to the client. Fortunately, we have a trick to minimize the amount of bytes we actually have to send back in this fashion, which I’ll describe in the next section.

Optimistically sending sharded requests

There are a couple of reasons why we chose to optimistically send sharded requests without waiting for permission.

The first reason of note is that we expect to see very few of these refused requests in practice. The reason is simple: If a shard client receives a refusal for a Worker, then it must cold start the Worker locally. As a consequence, it can serve all future requests locally without incurring another cold start. So, after a single refusal, the shard client won’t shard that Worker any more (until traffic for the Worker tapers off enough for an eviction, at least).

Generally, this means we expect that if a request gets sharded to a different server, the shard server will most likely accept the request for invocation. Since we expect success, it makes a lot more sense to optimistically send the entire request to the shard server than it does to incur a round-trip penalty to establish permission first.

The second reason is that we have a trick to avoid paying too high a cost for proxying the request back to the client, as I mentioned above.

We implement our cross-instance communication in the Workers runtime using Cap’n Proto RPC, whose distributed object model enables some incredible features, like JavaScript-native RPC. It is also the elder, spiritual sibling to the just-released Cap’n Web.

In the case of sharding, Cap’n Proto makes it very easy to implement an optimal request refusal mechanism. When the shard client assembles the sharded request, it includes a handle (called a capability in Cap’n Proto) to a lazily-loaded local instance of the Worker. This lazily-loaded instance has the same exact interface as any other Worker exposed over RPC. The difference is just that it’s lazy – it doesn’t get cold started until invoked. In the event the shard server decides it must refuse the request, it does not return a “go away” response, but instead returns the shard client’s own lazy capability!

The shard client’s application code only sees that it received a capability from the shard server. It doesn’t know where that capability is actually implemented. But the shard client’s RPC system does know where the capability lives! Specifically, it recognizes that the returned capability is actually a local capability – the same one that it passed to the shard server. Once it realizes this, it also realizes that any request bytes it continues to send to the shard server will just come looping back. So, it stops sending more request bytes, waits to receive back from the shard server all the bytes it already sent, and shortens the request path as soon as possible. This takes the shard server entirely out of the loop, preventing a “trombone effect.”

Workers invoking Workers

With load shedding behavior figured out, we thought the hard part was over.

But, of course, Workers may invoke other Workers. There are many ways this could occur, most obviously via Service Bindings. Less obviously, many of our favorite features, such as Workers KV, are actually cross-Worker invocations. But there is one product, in particular, that stands out for its powerful ability to invoke other Workers: Workers for Platforms.

Workers for Platforms allows you to run your own functions-as-a-service on Cloudflare infrastructure. To use the product, you deploy three special types of Workers:

a dynamic dispatch Worker
any number of user Workers
an optional, parameterized outbound Worker

A typical request flow for Workers for Platforms goes like so: First, we invoke the dynamic dispatch Worker. The dynamic dispatch Worker chooses and invokes a user Worker. Then, the user Worker invokes the outbound Worker to intercept its subrequests. The dynamic dispatch Worker chose the outbound Worker's arguments prior to invoking the user Worker.

To really amp up the fun, the dynamic dispatch Worker could have a tail Worker attached to it. This tail Worker would need to be invoked with traces related to all the preceding invocations. Importantly, it should be invoked one single time with all events related to the request flow, not invoked multiple times for different fragments of the request flow.

You might further ask, can you nest Workers for Platforms? I don’t know the official answer, but I can tell you that the code paths do exist, and they do get exercised.

To support this nesting doll of Workers, we keep a context stack during invocations. This context includes things like ownership overrides, resource limit overrides, trust levels, tail Worker configurations, outbound Worker configurations, feature flags, and so on. This context stack was manageable-ish when everything was executed on a single thread. For sharding to be truly useful, though, we needed to be able to move this context stack around to other machines.

Our choice of Cap’n Proto RPC as our primary communications medium helped us make sense of it all. To shard Workers deep within a stack of invocations, we serialize the context stack into a Cap’n Proto data structure and send it to the shard server. The shard server deserializes it into native objects, and continues the execution where things left off.

As with load shedding, Cap’n Proto’s distributed object model provides us simple answers to otherwise difficult questions. Take the tail Worker question – how do we coalesce tracing data from invocations which got fanned out across any number of other servers back to one single place? Easy: create a capability (a live Cap’n Proto object) for a reportTraces() callback on the dynamic dispatch Worker’s home server, and put that in the serialized context stack. Now, that context stack can be passed around at will. That context stack will end up in multiple places: At a minimum, it will end up on the user Worker’s shard server and the outbound Worker’s shard server. It may also find its way to other shard servers if any of those Workers invoked service bindings! Each of those shard servers can call the reportTraces() callback, and be confident that the data will make its way back to the right place: the dynamic dispatch Worker’s home server. None of those shard servers need to actually know where that home server is. Phew!

Eviction rates down, warm request rates up

Features like this are always satisfying to roll out, because they produce graphs showing huge efficiency gains.

Once fully rolled out, only about 4% of total requests from enterprise traffic ended up being sharded. To put that another way, 96% of all enterprise requests are to Workers which are sufficiently loaded that we must run multiple instances of them in a data center.

Despite that low total rate of sharding, we reduced our global Worker eviction rate by 10x.

Our eviction rate is a measure of memory pressure within our system. You can think of it like garbage collection at a macro level, and it has the same implications. Fewer evictions means our system uses memory more efficiently. This has the happy consequence of using less CPU to clean up our memory. More relevant to Workers users, the increased efficiency means we can keep Workers in memory for an order of magnitude longer, improving their warm request rate and reducing their latency.

The high leverage shown – sharding just 4% of our traffic to improve memory efficiency by 10x – is a consequence of the power-law distribution of Internet traffic.

A power law distribution is a phenomenon which occurs across many fields of science, including linguistics, sociology, physics, and, of course, computer science. Events which follow power law distributions typically see a huge amount clustered in some small number of “buckets”, and the rest spread out across a large number of those “buckets”. Word frequency is a classic example: A small handful of words like “the”, “and”, and “it” occur in texts with extremely high frequency, while other words like “eviction” or “trombone” might occur only once or twice in a text.

In our case, the majority of Workers requests goes to a small handful of high-traffic Workers, while a very long tail goes to a huge number of low-traffic Workers. The 4% of requests which were sharded are all to low-traffic Workers, which are the ones that benefit the most from sharding.

So did we eliminate cold starts? Or will there be an Eliminating Cold Starts 3 in our future?

For enterprise traffic, our warm request rate increased from 99.9% to 99.99% – that’s three 9’s to four 9’s. Conversely, this means that the cold start rate went from 0.1% to 0.01% of requests, a 10x decrease. A moment’s thought, and you’ll realize that this is coherent with the eviction rate graph I shared above: A 10x decrease in the number of Workers we destroy over time must imply we’re creating 10x fewer to begin with.

Simultaneously, our warm request rate became less volatile throughout the course of the day.

Hmm.

I hate to admit this to you, but I still notice a little bit of space at the top of the graph. 😟

Can you help us get to five 9’s?

HTTP Analytics for 6M requests per second using ClickHouse

Alex Bocharov — Tue, 06 Mar 2018 13:00:00 GMT

One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:

Analytics tab in Cloudflare dashboard
Zone Analytics API with 2 endpoints
- Dashboard endpoint
- Co-locations endpoint (Enterprise plan only)

In this blog post I'm going to talk about the exciting evolution of the Cloudflare analytics pipeline over the last year. I'll start with a description of the old pipeline and the challenges that we experienced with it. Then, I'll describe how we leveraged ClickHouse to form the basis of a new and improved pipeline. In the process, I'll share details about how we went about schema design and performance tuning for ClickHouse. Finally, I'll look forward to what the Data team is thinking of providing in the future.

Let's start with the old data pipeline.

Old data pipeline

The previous pipeline was built in 2014. It has been mentioned previously in Scaling out PostgreSQL for CloudFlare Analytics using CitusDB and More data, more data blog posts from the Data team.

It had following components:

Log forwarder - collected Cap'n Proto formatted logs from the edge, notably DNS and Nginx logs, and shipped them to Kafka in Cloudflare central datacenter.
Kafka cluster - consisted of 106 brokers with x3 replication factor, 106 partitions, ingested Cap'n Proto formatted logs at average rate 6M logs per second.
Kafka consumers - each of 106 partitions had dedicated Go consumer (a.k.a. Zoneagg consumer), which read logs and produced aggregates per partition per zone per minute and then wrote them into Postgres. Postgres database - single instance PostgreSQL database (a.k.a. RollupDB), accepted aggregates from Zoneagg consumers and wrote them into temporary tables per partition per minute. It then rolled-up the aggregates into further aggregates with aggregation cron. More specifically:
- Aggregates per partition, minute, zone → aggregates data per minute, zone
- Aggregates per minute, zone → aggregates data per hour, zone
- Aggregates per hour, zone → aggregates data per day, zone
- Aggregates per day, zone → aggregates data per month, zone
Citus Cluster - consisted of Citus main and 11 Citus workers with x2 replication factor (a.k.a. Zoneagg Citus cluster), the storage behind Zone Analytics API and our BI internal tools. It had replication cron, which did remote copy of tables from Postgres instance into Citus worker shards.
Zone Analytics API - served queries from internal PHP API. It consisted of 5 API instances written in Go and queried Citus cluster, and was not visible to external users.
PHP API - 3 instances of proxying API, which forwarded public API queries to internal Zone Analytics API, and had some business logic on zone plans, error messages, etc.
Load Balancer - nginx proxy, forwarded queries to PHP API/Zone Analytics API.

Cloudflare has grown tremendously since this pipeline was originally designed in 2014. It started off processing under 1M requests per second and grew to current levels of 6M requests per second. The pipeline had served us and our customers well over the years, but began to split at the seams. Any system should be re-engineered after some time, when requirements change.

Some specific disadvantages of the original pipeline were:

Postgres SPOF - single PostgreSQL instance was a SPOF (Single Point of Failure), as it didn't have replicas or backups and if we were to lose this node, whole analytics pipeline could be paralyzed and produce no new aggregates for Zone Analytics API.
Citus main SPOF - Citus main was the entrypoint to all Zone Analytics API queries and if it went down, all our customers' Analytics API queries would return errors.
Complex codebase - thousands of lines of bash and SQL for aggregations, and thousands of lines of Go for API and Kafka consumers made the pipeline difficult to maintain and debug.
Many dependencies - the pipeline consisted of many components, and failure in any individual component could result in halting the entire pipeline.
High maintenance cost - due to its complex architecture and codebase, there were frequent incidents, which sometimes took engineers from the Data team and other teams many hours to mitigate.

Over time, as our request volume grew, the challenges of operating this pipeline became more apparent, and we realized that this system was being pushed to its limits. This realization inspired us to think about which components would be ideal candidates for replacement, and led us to build new data pipeline.

Our first design of an improved analytics pipeline centred around the use of the Apache Flink stream processing system. We had previously used Flink for other data pipelines, so it was a natural choice for us. However, these pipelines had been at a much lower rate than the 6M requests per second we needed to process for HTTP Analytics, and we struggled to get Flink to scale to this volume - it just couldn't keep up with ingestion rate per partition on all 6M HTTP requests per second.

Our colleagues on our DNS team had already built and productionized DNS analytics pipeline atop ClickHouse. They wrote about it in "How Cloudflare analyzes 1M DNS queries per second" blog post. So, we decided to take a deeper look at ClickHouse.

ClickHouse

"ClickHouse не тормозит." Translation from Russian: ClickHouse doesn't have brakes (or isn't slow) © ClickHouse core developers

When exploring additional candidates for replacing some of the key infrastructure of our old pipeline, we realized that using a column oriented database might be well suited to our analytics workloads. We wanted to identify a column oriented database that was horizontally scalable and fault tolerant to help us deliver good uptime guarantees, and extremely performant and space efficient such that it could handle our scale. We quickly realized that ClickHouse could satisfy these criteria, and then some.

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is blazing fast, linearly scalable, hardware efficient, fault tolerant, feature rich, highly reliable, simple and handy. ClickHouse core developers provide great help on solving issues, merging and maintaining our PRs into ClickHouse. For example, engineers from Cloudflare have contributed a whole bunch of code back upstream:

Aggregate function topK by Marek Vavruša
IP prefix dictionary by Marek Vavruša
SummingMergeTree engine optimizations by Marek Vavruša
Kafka table Engine by Marek Vavruša. We're thinking to replace Kafka Go consumers with this engine when it will be stable enough and ingest from Kafka into ClickHouse directly.
Aggregate function sumMap by Alex Bocharov. Without this function it would be impossible to build our new Zone Analytics API.
Mark cache fix by Alex Bocharov
uniqHLL12 function fix for big cardinalities by Alex Bocharov. The description of the issue and its fix should be an interesting reading.

Along with filing many bug reports, we also report about every issue we face in our cluster, which we hope will help to improve ClickHouse in future.

Even though DNS analytics on ClickHouse had been a great success, we were still skeptical that we would be able to scale ClickHouse to the needs of the HTTP pipeline:

Kafka DNS topic has on average 1.5M messages per second vs 6M messages per second for HTTP requests topic.
Kafka DNS topic average uncompressed message size is 130B vs 1630B for HTTP requests topic.
DNS query ClickHouse record consists of 40 columns vs 104 columns for HTTP request ClickHouse record.

After unsuccessful attempts with Flink, we were skeptical of ClickHouse being able to keep up with the high ingestion rate. Luckily, early prototype showed promising performance and we decided to proceed with old pipeline replacement. The first step in replacing the old pipeline was to design a schema for the new ClickHouse tables.

ClickHouse schema design

Once we identified ClickHouse as a potential candidate, we began exploring how we could port our existing Postgres/Citus schemas to make them compatible with ClickHouse.

For our Zone Analytics API we need to produce many different aggregations for each zone (domain) and time period (minutely / hourly / daily / monthly). For deeper dive about specifics of aggregates please follow Zone Analytics API documentation or this handy spreadsheet.

These aggregations should be available for any time range for the last 365 days. While ClickHouse is a really great tool to work with non-aggregated data, with our volume of 6M requests per second we just cannot afford yet to store non-aggregated data for that long.

To give you an idea of how much data is that, here is some "napkin-math" capacity planning. I'm going to use an average insertion rate of 6M requests per second and $100 as a cost estimate of 1 TiB to calculate storage cost for 1 year in different message formats:

Metric	Cap'n Proto	Cap'n Proto (zstd)	ClickHouse
Avg message/record size	1630 B	360 B	36.74 B
Storage requirements for 1 year	273.93 PiB	60.5 PiB	18.52 PiB (RF x3)
Storage cost for 1 year	$28M	$6.2M	$1.9M

And that is if we assume that requests per second will stay the same, but in fact it's growing fast all the time.

Even though storage requirements are quite scary, we're still considering to store raw (non-aggregated) requests logs in ClickHouse for 1 month+. See "Future of Data APIs" section below.

Non-aggregated requests table

We store over 100+ columns, collecting lots of different kinds of metrics about each request passed through Cloudflare. Some of these columns are also available in our Enterprise Log Share product, however ClickHouse non-aggregated requests table has more fields.

With so many columns to store and huge storage requirements we've decided to proceed with the aggregated-data approach, which worked well for us before in old pipeline and which will provide us with backward compatibility.

Aggregations schema design #1

According to the API documentation, we need to provide lots of different requests breakdowns and to satisfy these requirements we decided to test the following approach:

Create Cickhouse materialized views with ReplicatedAggregatingMergeTree engine pointing to non-aggregated requests table and containing minutely aggregates data for each of the breakdowns:
- Requests totals - containing numbers like total requests, bytes, threats, uniques, etc.
- Requests by colo - containing requests, bytes, etc. breakdown by edgeColoId - each of 120+ Cloudflare datacenters
- Requests by http status - containing breakdown by HTTP status code, e.g. 200, 404, 500, etc.
- Requests by content type - containing breakdown by response content type, e.g. HTML, JS, CSS, etc.
- Requests by country - containing breakdown by client country (based on IP)
- Requests by threat type - containing breakdown by threat type
- Requests by browser - containing breakdown by browser family, extracted from user agent
- Requests by ip class - containing breakdown by client IP class
Write the code gathering data from all 8 materialized views, using two approaches:
- Querying all 8 materialized views at once using JOIN
- Querying each one of 8 materialized views separately in parallel
Run performance testing benchmark against common Zone Analytics API queries

Schema design #1 didn't work out well. ClickHouse JOIN syntax forces to write monstrous query over 300 lines of SQL, repeating the selected columns many times because you can do only pairwise joins in ClickHouse.

As for querying each of materialized views separately in parallel, benchmark showed prominent, but moderate results - query throughput would be a little bit better than using our Citus based old pipeline.

Aggregations schema design #2

In our second iteration of the schema design, we strove to keep a similar structure to our existing Citus tables. To do this, we experimented with the SummingMergeTree engine, which is described in detail by the excellent ClickHouse documentation:

In addition, a table can have nested data structures that are processed in a special way. If the name of a nested table ends in 'Map' and it contains at least two columns that meet the following criteria... then this nested table is interpreted as a mapping of key => (values...), and when merging its rows, the elements of two data sets are merged by 'key' with a summation of the corresponding (values...).

We were pleased to find this feature, because the SummingMergeTree engine allowed us to significantly reduce the number of tables required as compared to our initial approach. At the same time, it allowed us to match the structure of our existing Citus tables. The reason was that the ClickHouse Nested structure ending in 'Map' was similar to the Postgres hstore data type, which we used extensively in the old pipeline.

However, there were two existing issues with ClickHouse maps:

SummingMergeTree does aggregation for all records with same primary key, but final aggregation across all shards should be done using some aggregate function, which didn't exist in ClickHouse.
For storing uniques (uniques visitors based on IP), we need to use AggregateFunction data type, and although SummingMergeTree allows you to create column with such data type, it will not perform aggregation on it for records with same primary keys.

To resolve problem #1, we had to create a new aggregation function sumMap. Luckily, ClickHouse source code is of excellent quality and its core developers are very helpful with reviewing and merging requested changes.

As for problem #2, we had to put uniques into separate materialized view, which uses the ReplicatedAggregatingMergeTree Engine and supports merge of AggregateFunction states for records with the same primary keys. We're considering adding the same functionality into SummingMergeTree, so it will simplify our schema even more.

We also created a separate materialized view for the Colo endpoint because it has much lower usage (5% for Colo endpoint queries, 95% for Zone dashboard queries), so its more dispersed primary key will not affect performance of Zone dashboard queries.

Once schema design was acceptable, we proceeded to performance testing.

ClickHouse performance tuning

We explored a number of avenues for performance improvement in ClickHouse. These included tuning index granularity, and improving the merge performance of the SummingMergeTree engine.

By default ClickHouse recommends to use 8192 index granularity. There is nice article explaining ClickHouse primary keys and index granularity in depth.

While default index granularity might be excellent choice for most of use cases, in our case we decided to choose the following index granularities:

For the main non-aggregated requests table we chose an index granularity of 16384. For this table, the number of rows read in a query is typically on the order of millions to billions. In this case, a large index granularity does not make a huge difference on query performance.
For the aggregated requests_* stables, we chose an index granularity of 32. A low index granularity makes sense when we only need to scan and return a few rows. It made a huge difference in API performance - query latency decreased by 50% and throughput increased by ~3 times when we changed index granularity 8192 → 32.

Not relevant to performance, but we also disabled the min_execution_speed setting, so queries scanning just a few rows won't return exception because of "slow speed" of scanning rows per second.

On the aggregation/merge side, we've made some ClickHouse optimizations as well, like increasing SummingMergeTree maps merge speed by x7 times, which we contributed back into ClickHouse for everyone's benefit.

Once we had completed the performance tuning for ClickHouse, we could bring it all together into a new data pipeline. Next, we describe the architecture for our new, ClickHouse-based data pipeline.

New data pipeline

The new pipeline architecture re-uses some of the components from old pipeline, however it replaces its most weak components.

New components include:

Kafka consumers - 106 Go consumers per each partition consume Cap'n Proto raw logs and extract/prepare needed 100+ ClickHouse fields. Consumers no longer do any aggregation logic.
ClickHouse cluster - 36 nodes with x3 replication factor. It handles non-aggregate requests logs ingestion and then produces aggregates using materialized views.
Zone Analytics API - rewritten and optimized version of API in Go, with many meaningful metrics, healthchecks, failover scenarios.

As you can see the architecture of new pipeline is much simpler and fault-tolerant. It provides Analytics for all our 7M+ customers' domains totalling more than 2.5 billion monthly unique visitors and over 1.5 trillion monthly page views.

On average we process 6M HTTP requests per second, with peaks of upto 8M requests per second.

Average log message size in Cap’n Proto format used to be ~1630B, but thanks to amazing job on Kafka compression by our Platform Operations Team, it decreased significantly. Please see "Squeezing the firehose: getting the most from Kafka compression" blog post with deeper dive into those optimisations.

Benefits of new pipeline

No SPOF - removed all SPOFs and bottlenecks, everything has at least x3 replication factor.
Fault-tolerant - it's more fault-tolerant, even if Kafka consumer or ClickHouse node or Zone Analytics API instance fails, it doesn't impact the service.
Scalable - we can add more Kafka brokers or ClickHouse nodes and scale ingestion as we grow. We are not so confident about query performance when cluster will grow to hundreds of nodes. However, Yandex team managed to scale their cluster to 500+ nodes, distributed geographically between several data centers, using two-level sharding.
Reduced complexity - due to removing messy crons and consumers which were doing aggregations and refactoring API code we were able to:
- Shutdown Postgres RollupDB instance and free it up for reuse.
- Shutdown Citus cluster 12 nodes and free it up for reuse. As we won't use Citus for serious workload anymore we can reduce our operational and support costs.
- Delete tens of thousands of lines of old Go, SQL, Bash, and PHP code.
- Remove WWW PHP API dependency and extra latency.
Improved API throughput and latency - with previous pipeline Zone Analytics API was struggling to serve more than 15 queries per second, so we had to introduce temporary hard rate limits for largest users. With new pipeline we were able to remove hard rate limits and now we are serving ~40 queries per second. We went further and did intensive load testing for new API and with current setup and hardware we are able serve up to ~150 queries per second and this is scalable with additional nodes.
Easier to operate - with shutdown of many unreliable components, we are finally at the point where it's relatively easy to operate this pipeline. ClickHouse quality helps us a lot in this matter.
Decreased amount of incidents - with new more reliable pipeline, we now have fewer incidents than before, which ultimately has reduced on-call burden. Finally, we can sleep peacefully at night :-).

Recently, we've improved the throughput and latency of the new pipeline even further with better hardware. I'll provide details about this cluster below.

Our ClickHouse cluster

In total we have 36 ClickHouse nodes. The new hardware is a big upgrade for us:

Chassis - Quanta D51PH-1ULH chassis instead of Quanta D51B-2U chassis (2x less physical space)
CPU - 40 logical cores E5-2630 v3 @ 2.40 GHz instead of 32 cores E5-2630 v4 @ 2.20 GHz
RAM - 256 GB RAM instead of 128 GB RAM
Disks - 12 x 10 TB Seagate ST10000NM0016-1TT101 disks instead of 12 x 6 TB Toshiba TOSHIBA MG04ACA600E
Network - 2 x 25G Mellanox ConnectX-4 in MC-LAG instead of 2 x 10G Intel 82599ES

Our Platform Operations team noticed that ClickHouse is not great at running heterogeneous clusters yet, so we need to gradually replace all nodes in the existing cluster with new hardware, all 36 of them. The process is fairly straightforward, it's no different than replacing a failed node. The problem is that ClickHouse doesn't throttle recovery.

Here is more information about our cluster:

Avg insertion rate - all our pipelines bring together 11M rows per second.
Avg insertion bandwidth - 47 Gbps.
Avg queries per second - on average cluster serves ~40 queries per second with frequent peaks up to ~80 queries per second.
CPU time - after recent hardware upgrade and all optimizations, our cluster CPU time is quite low.
Max disk IO (device time) - it's low as well.

In order to make the switch to the new pipeline as seamless as possible, we performed a transfer of historical data from the old pipeline. Next, I discuss the process of this data transfer.

Historical data transfer

As we have 1 year storage requirements, we had to do one-time ETL (Extract Transfer Load) from the old Citus cluster into ClickHouse.

At Cloudflare we love Go and its goroutines, so it was quite straightforward to write a simple ETL job, which:

For each minute/hour/day/month extracts data from Citus cluster
Transforms Citus data into ClickHouse format and applies needed business logic
Loads data into ClickHouse

The whole process took couple of days and over 60+ billions rows of data were transferred successfully with consistency checks. The completion of this process finally led to the shutdown of old pipeline. However, our work does not end there, and we are constantly looking to the future. In the next section, I'll share some details about what we are planning.

Future of Data APIs

Log Push

We're currently working on something called "Log Push". Log push allows you to specify a desired data endpoint and have your HTTP request logs sent there automatically at regular intervals. At the moment, it's in private beta and going to support sending logs to:

Amazon S3 bucket
Google Cloud Service bucket
Other storage services and platforms

It's expected to be generally available soon, but if you are interested in this new product and you want to try it out please contact our Customer Support team.

Logs SQL API

We're also evaluating possibility of building new product called Logs SQL API. The idea is to provide customers access to their logs via flexible API which supports standard SQL syntax and JSON/CSV/TSV/XML format response.

Queries can extract:

Raw requests logs fields (e.g. SELECT field1, field2, ... FROM requests WHERE ...)
Aggregated data from requests logs (e.g. SELECT clientIPv4, count() FROM requests GROUP BY clientIPv4 ORDER BY count() DESC LIMIT 10)

Google BigQuery provides similar SQL API and Amazon has product callled Kinesis Data analytics with SQL API support as well.

Another option we're exploring is to provide syntax similar to DNS Analytics API with filters and dimensions.

We're excited to hear your feedback and know more about your analytics use case. It can help us a lot to build new products!

Conclusion

All this could not be possible without hard work across multiple teams! First of all thanks to other Data team engineers for their tremendous efforts to make this all happen. Platform Operations Team made significant contributions to this project, especially Ivan Babrou and Daniel Dao. Contributions from Marek Vavruša in DNS Team were also very helpful.

Finally, Data team at Cloudflare is a small team, so if you're interested in building and operating distributed services, you stand to have some great problems to work on. Check out the Distributed Systems Engineer - Data and Data Infrastructure Engineer roles in London, UK and San Francisco, US, and let us know what you think.

How Cloudflare analyzes 1M DNS queries per second

Marek Vavruša — Wed, 10 May 2017 21:50:20 GMT

On Friday, we announced DNS analytics for all Cloudflare customers. Because of our scale –– by the time you’ve finished reading this, Cloudflare DNS will have handled millions of DNS queries –– we had to be creative in our implementation. In this post, we’ll describe the systems that make up DNS Analytics which help us comb through trillions of these logs each month.

How logs come in from the edge

Cloudflare already has a data pipeline for HTTP logs. We wanted to utilize what we could of that system for the new DNS analytics. Every time one of our edge services gets an HTTP request, it generates a structured log message in the Cap’n Proto format and sends it to a local multiplexer service. Given the volume of the data, we chose not to record the full DNS message payload, only telemetry data we are interested in such as response code, size, or query name, which has allowed us to keep only ~150 bytes on average per message. It is then fused with processing metadata such as timing information and exceptions triggered during query processing. The benefit of fusing data and metadata at the edge is that we can spread the computational cost across our thousands of edge servers, and log only what we absolutely need.

The multiplexer service (known as “log forwarder”) is running on each edge node, assembling log messages from multiple services and transporting them to our warehouse for processing over a TLS secured channel. A counterpart service running in the warehouse receives and demultiplexes the logs into several Apache Kafka clusters. Over the years, Apache Kafka has proven to be an invaluable service for buffering between producers and downstream consumers, preventing data loss when consumers fail over or require maintenance. Since version 0.10, Kafka allows rack-aware allocation of replicas, which improves resilience against rack or site failure, giving us fault tolerant storage of unprocessed messages.

Having a queue with structured logs has allowed us to investigate issues retrospectively without requiring access to production nodes, but it has proven to be quite laborious at times. In the early days of the project we would skim the queue to find offsets for the rough timespan we needed, and then extract the data into HDFS in Parquet format for offline analysis.

About aggregations

The HTTP analytics service was built around stream processors generating aggregations, so we planned to leverage Apache Spark to stream the logs to HDFS automatically. As Parquet doesn’t natively support indexes or arranging the data in a way that avoids full table scans, it’s impractical for on-line analysis or serving reports over an API. There are extensions like parquet-index that create indexes over the data, but not on-the-fly. Given this, the initial plan was to only show aggregated reports to customers, and keep the raw data for internal troubleshooting.

The problem with aggregated summaries is that they only work on columns with low cardinality (a number of unique values). With aggregation, each column in a given time frame explodes to the number of rows equal to the number of unique entries, so it’s viable to aggregate on something like response code which only has 12 possible values, but not a query name for example. Domain names are subject to popularity, so if, for example, a popular domain name gets asked 1,000 times a minute, one could expect to achieve 1000x row reduction for per-minute aggregation, however in practice it is not so.

Due to how DNS caching works, resolvers will answer identical queries from cache without going to the authoritative server for the duration of the TTL. The TTL tends to be longer than a minute. So, while authoritative servers see the same request many times, our data is skewed towards non-cacheable queries such as typos or random prefix subdomain attacks. In practice, we see anywhere between 0 - 60x row reduction when aggregating by query names, so storing aggregations in multiple resolutions almost negates the row reduction. Aggregations are also done with multiple resolution and key combinations, so aggregating on a high cardinality column can even result in more rows than original.

For these reasons we started by only aggregating logs at the zone level, which was useful enough for trends, but too coarse for root cause analysis. For example, in one case we were investigating short bursts of unavailability in one of our data centers. Having unaggregated data allowed us to narrow the issue down to the specific DNS queries experiencing latency spikes, and then correlated the queries with a misconfigured firewall rule. Cases like these would be much harder to investigate with only aggregated data as it only affected a tiny percentage of requests that would be lost in the aggregations.

So we started looking into several OLAP (Online Analytical Processing) systems. The first system we looked into was Druid. We were really impressed with the capabilities and how the front-end (Pivot and formerly Caravel) is able to slice and dice the data, allowing us to generate reports with arbitrary dimensions. Druid has already been deployed in similar environments with over 100B events/day, so we were confident it could work, but after testing on sampled data we couldn’t justify the hardware costs of hundreds of nodes. Around the same time Yandex open-sourced their OLAP system, ClickHouse.

And then it Clicked

ClickHouse has a much simpler system design - all the nodes in a cluster have equal functionality and use only ZooKeeper for coordination. We built a small cluster of several nodes to start kicking the tires, and found the performance to be quite impressive and true to the results advertised in the performance comparisons of analytical DBMS, so we proceeded with building a proof of concept. The first obstacle was a lack of tooling and the small size of the community, so we delved into the ClickHouse design to understand how it works.

ClickHouse doesn’t support ingestion from Kafka directly, as it’s only a database, so we wrote an adapter service in Go. It read Cap’n Proto encoded messages from Kafka, converted them into TSV, and inserted into ClickHouse over the HTTP interface in batches. Later, we rewrote the service to use a Go library using the native ClickHouse interface to boost performance. Since then, we’ve contributed some performance improvements back to the project. One thing we learned during the ingestion performance evaluation is that ClickHouse ingestion performance highly depends on batch size - the number of rows you insert at once. To understand why, we looked further into how ClickHouse stores data.

The most common table engine ClickHouse uses for storage is the MergeTree family. It is conceptually similar to LSM algorithm used in Google’s BigTable or Apache Cassandra, however it avoids an intermediate memory table, and writes directly to disk. This gives it excellent write throughput, as each inserted batch is only sorted by “primary key”, compressed, and written to disk to form a segment. The absence of a memory table or any notion of “freshness” of the data also means that it is append-only and data modification or deletion isn’t supported. The only way to delete data currently is to remove data by calendar months, as segments never overlap a month boundary. The ClickHouse team is actively working on making this configurable. On the other hand, this makes writing and segment merging conflict-free, so the ingestion throughput scales linearly with the number of parallel inserts until the I/O or cores are saturated. This, however, also means it is not fit for tiny batches, which is why we rely on Kafka and inserter services for buffering. ClickHouse then keeps constantly merging segments in the background, so many small parts will be merged and written more times (thus increasing write amplification) and too many unmerged parts will trigger aggressive throttling of insertions until the merging progresses. We have found that several insertions per table per second work best as a tradeoff between real-time ingestion and ingestion performance.

The key to table read performance is indexing and the arrangement of the data on disk. No matter how fast processing is, when the engine needs to scan terabytes of data from disk and use only a fraction of it, it’s going to take time. ClickHouse is a columnar store, so each segment contains a file for each column, with sorted values for each row. This way whole columns not present in the query can be skipped, and then multiple cells can be processed in parallel with vectorized execution. In order to avoid full scans, each segment also has a sparse index file. Given that all columns are sorted by the “primary key”, the index file only contains marks (captured rows) of every Nth row in order to be able to keep it in memory even for very large tables. For example the default settings is to make a mark of every 8,192th row. This way only 122,070 marks are required to sparsely index a table with 1 trillion rows, which easily fits in memory. See primary keys in ClickHouse for a deeper dive into how it works.

When querying the table using primary key columns, the index returns approximate ranges of rows considered. Ideally the ranges should be wide and contiguous. For example, when the typical usage is to generate reports for individual zones, placing the zone on the first position of the primary key will result in rows sorted by zone in each column, making the disk reads for individual zones contiguous, whereas sorting primarily by timestamp would not. The rows can be sorted in only one way, so the primary key must be chosen carefully with the typical query load in mind. In our case, we optimized read queries for individual zones and have a separate table with sampled data for exploratory queries. The lesson learned is that instead of trying to optimize the index for all purposes and splitting the difference, we have made several tables instead.

One such specialisation are tables with aggregations over zones. Queries across all rows are significantly more expensive, as there is no opportunity to exclude data from scanning. This makes it less practical for analysts to compute basic aggregations on long periods of time, so we decided to use materialized views to incrementally compute predefined aggregations, such as counters, uniques, and quantiles. The materialized views leverage the sort phase on batch insertion to do productive work - computing aggregations. So after the newly inserted segment is sorted, it also produces a table with rows representing dimensions, and columns representing aggregation function state. The difference between aggregation state and final result is that we can generate reports using an arbitrary time resolution without actually storing the precomputed data in multiple resolutions. In some cases the state and result can be the same - for example basic counters, where hourly counts can be produced by summing per-minute counts, however it doesn’t make sense to sum unique visitors or latency quantiles. This is when aggregation state is much more useful, as it allows meaningful merging of more complicated states, such as HyperLogLog (HLL) bitmap to produce hourly unique visitors estimate from minutely aggregations. The downside is that storing state can be much more expensive than final values - the aforementioned HLL state tends to be 20-100 bytes / row when compressed, while a counter is only 8 bytes (1 byte compressed on average). These tables are then used to quickly visualise general trends across zones or sites, and also by our API service that uses them opportunistically for simple queries. Having both incrementally aggregated and unaggregated data in the same place allowed us simplify the architecture by deprecating stream processing altogether.

Infrastructure and data integrity

We started with RAID-10 composed of 12 6TB spinning disks on each node, but reevaluated it after the first inevitable disk failures. In the second iteration we migrated to RAID-0, for two reasons. First, it wasn’t possible to hot-swap just the faulty disks, and second the array rebuild took tens of hours which degraded I/O performance. It was significantly faster to replace a faulty node and use internal replication to populate it with data over the network (2x10GbE), than to wait for an array to finish rebuilding. To compensate for higher probability of node failure, we switched to 3-way replication and allocated replicas of each shard to different racks, and started planning for replication to a separate data warehouse.

Another disk failure highlighted a problem with the filesystem we used. Initially we used XFS, but it started to lock up during replication from 2 peers at the same time, thus breaking replication of segments before it completed. This issue has manifested itself with a lot of I/O activity with little disk usage increase as broken parts were deleted, so we gradually migrated to ext4 that didn’t have the same issue.

Visualizing Data

At the time we relied solely on Pandas and ClickHouse’s HTTP interface for ad-hoc analyses, but we wanted to make it more accessible for both analysis and monitoring. Since we knew Caravel - now renamed to Superset - from the experiments with Druid, we started working on an integration with ClickHouse.

Superset is a data visualisation platform designed to be intuitive, and allows analysts to interactively slice and dice the data without writing a single line of SQL. It was initially built and open sourced by AirBnB for Druid, but over time it has gained support for SQL-based backends using SQLAlchemy, an abstraction and ORM for tens of different database dialects. So we wrote and open-sourced a ClickHouse dialect and finally a native Superset integration that has been merged a few days ago.

Superset has served us well for ad-hoc visualisations, but it is still not polished enough for our monitoring use case. At Cloudflare we’re heavy users of Grafana for visualisation of all our metrics, so we wrote and open-sourced a Grafana integration as well.

It has allowed us to seamlessly extend our existing monitoring dashboards with the new analytical data. We liked it so much that we wanted to give the same ability to look at the analytics data to you, our users. So we built a Grafana app to visualise data from Cloudflare DNS Analytics. Finally, we made it available in your Cloudflare dashboard analytics. Over time we’re going to add new data sources, dimensions, and other useful ways how to visualise your data from Cloudflare. Let us know what you’d like to see next.

Does solving these kinds of technical and operational challenges excite you? Cloudflare is always hiring for talented specialists and generalists within our Engineering, Technical Operations and other teams.

Introducing lua-capnproto: better serialization in Lua

Jiale Zhi — Fri, 28 Feb 2014 09:00:00 GMT

When we need to transfer data from one program to another program, either within a machine or from one data center to another some form of serialization is needed. Serialization converts data stored in memory into a form that can be sent across a network or between processes and then converted back into data a program can use directly.

At CloudFlare, we have data centers all over the world. When transferring data from one data center to another, we need a very efficient way of serializing data, saving us both time and network bandwidth.

We've looked at a few serialization projects. For example, one popular serialization format is JSON, for some of our Go programs we use gob, and we've made use of Protocol Buffers in the past. But lately we've been using a new serialization protocol called Cap'n Proto.

Cap'n Proto attracted us because of its very high performance compared to other serialization projects. It looks a little like a better version of Protocol Buffers, and the author of Cap'n Proto, Kenton, was the primary author of Protocol Buffers version 2.

At CloudFlare, we use NGINX in conjunction with Lua for front-line web serving, proxying and traffic filtering. We need to serialize our data in Lua and transport it across the Internet. But unfortunately, there was no Lua module for Cap'n Proto. So, we decided to write lua-capnproto and release it as yet another CloudFlare open source contribution.

lua-capnproto provides very fast data serialization and a really easy to use API. Here I'll show you how to use lua-capnproto to do serialization and deserialization.

Install lua-capnproto

To install lua-capnproto, you need to install Cap'n Proto, LuaJIT 2.1 and luarocks first.

Then you can install lua-capnproto using the following commands:

git clone https://github.com/cloudflare/lua-capnproto.gitcd lua-capnprotosudo luarocks make

To test whether lua-capnproto was installed successfully, you can use the capnp compiler to generate a Lua version of one of the Cap'n Proto examples as follows:

capnp compile -olua proto/example.capnp

If everything goes well, you should see no errors and a file named example_capnp.lua generated under the proto directory.

Write a Cap’n Proto definition

Here's a sample Cap’n Proto definition that would be stored in a file called AddressBook.capnp:

    @0xdbb9ad1f14bf0b36;  # unique file ID, generated by capnp id

    struct Person {
      id @0 :UInt32;
      name @1 :Text;
      email @2 :Text;
      phones @3 :List(PhoneNumber);

      struct PhoneNumber {
        number @0 :Text;
        type @1 :Type;

        enum Type {
          mobile @0;
          home @1;
          work @2;
        }
      }

      employment :union {
        unemployed @4 :Void;
        employer @5 :Text;
        school @6 :Text;
        selfEmployed @7 :Void;
        # We assume that a person is only one of these.
      }
    }

    struct AddressBook {
      people @0 :List(Person);
    }

We have a root structure named AddressBook containing a list named people whose members are also structures. What we are going to do is to serialize an AddressBook structure and then read the structure from serialized data. For more details about the Cap'n Proto definition, you can checkout its documentation at here.

Prepare your data

Preparing data is pretty simple. All you need to do is to generate a Lua table corresponding to the root structure. The following list gives rules to help you write this table. On the left is a Cap'n Proto data type, on the right is its corresponding Lua data type.

struct -> Lua hash table
list -> Lua array table
bool -> Lua boolean
int8/16/32 or uint8/16/32 or float32/64 -> Lua number
int64/uint64 -> LuaJIT 64bit number
enum -> Lua string
void -> Lua string “Void”
group -> Lua hash table (the same as struct)
union -> Lua hash table which has only one value set

A few notices:

Because Lua number type represents real (double-precision floating-point) numbers and Lua has no integer type, so by default you can't store a 64 bit integer using number type without losing precision. LuaJIT has an extension which supports 64-bit integers. You need to add ULL or LL to the end of the number (ULL is for unsigned integers, LL for signed). So, if you need to serialize a 64-bit integer, remember to append ULL or LL to your number.

For example:

id @0 :Int64; -> id = 12345678901234LL

Enum values are automatically converted from strings to their values, you don’t need to do it yourself. By default, enums will be converted to the uppercase with underscores form. You can change this behavior using annotations. The lua-capnproto document has more details.

Here is an example:

type @0 :Type; enum Type { binaryAddr @0; textAddr @1; } -> type = “TEXT_ADDR”

void is a special type in Cap’n Proto. For simplicity, we just use a string "Void" to represent void (actually, when serializing, any value other than nil will work, but we use "Void" for consistency).

A sample data table looks like this:

    local data = {
        people = {
            {
                id = "123",
                name = "Alice",
                email = "alice@example.com",
                phones = {
                    {
                        number = "555-1212",
                        ["type"] = "MOBILE",
                    },
                },
                employment = {
                    school = "MIT",
                },
            },
            {
                id = "456",
                name = "Bob",
                email = "bob@example.com",
                phones = {
                    {
                        number = "555-4567",
                        ["type"] = "HOME",
                    },
                    {
                        number = "555-7654",
                        ["type"] = "WORK",
                    },
                },
                employment = {
                    unemployed = "Void",
                },
            },
        }
    }

Compile and run

Now let's compile the Cap'n Proto file:

capnp compile -olua AddressBook.capnp

You shouldn't see any errors and a file named AddressBook_capnp.lua is generated under the current directory.

To use this file, we only need to write a file named main.lua (or whatever name you desire), and get all the required modules ready.

    local addressBook = require "AddressBook_capnp"                                                                       
    local capnp = require "capnp"

Now we can start to serialize using our already prepared data.

    local bin = addressBook.AddressBook.serialize(data)

That’s it. Just one line of code. All the serialization work is done. Now variable bin (a Lua string) holds the serialized binary data. You can write this string to a file or send it through network.

Want about deserialization? It's as easy as serialization.

    local decoded = addressBook.AddressBook.parse(bin)

Now variable decoded holds a table which is identical to data. You can find the complete code here. Note that you need LuaJIT to run the code.

Performance

If you are happy with the API, here is even better news. We chose Cap'n Proto because of its impressively high performance. So when writing lua-capnproto, I also made every effort to make it fast.

In one project we switched to lua-capnproto from lua-cjson (a quite fast JSON library for Lua) for serialization. So let's see how fast lua-capnproto is compared to lua-cjson.

You can also run benchmark.lua yourself (included in the source code) to find out how fast lua-capnproto is compared to lua-cjson on your machine.

The future

We are already using lua-capnproto in production at CloudFlare and it has been running very well for the past month. But lua-capnproto is still a very young project. Some features are missing and there's a lot of work to do in the future. We will continue to make lua-capnproto more stable and more reliable, and would be happy to receive contributions from the open source community.