The Cloudflare Blog

A QUICker SASE client: re-building Proxy Mode

Koko Uko — Thu, 05 Mar 2026 06:00:00 GMT

When you need to use a proxy to keep your zero trust environment secure, it often comes with a cost: poor performance for your users. Soon after deploying a client proxy, security teams are generally slammed with support tickets from users frustrated with sluggish browser speed, slow file transfers, and video calls glitching at just the wrong moment. After a while, you start to chalk it up to the proxy — potentially blinding yourself to other issues affecting performance.

We knew it didn’t have to be this way. We knew users could go faster, without sacrificing security, if we completely re-built our approach to proxy mode. So we did.

In the early days of developing the device client for our SASE platform, Cloudflare One, we prioritized universal compatibility. When an admin enabled proxy mode, the Client acted as a local SOCKS5 or HTTP proxy. However, because our underlying tunnel architecture was built on WireGuard, a Layer 3 (L3) protocol, we faced a technical hurdle: how to get application-layer (L4) TCP traffic into an L3 tunnel. Moving from L4 to L3 was especially difficult because our desktop Client works across multiple platforms (Windows, macOS, Linux) so we couldn’t use the kernel to achieve this.

To get over this hurdle, we used smoltcp, a Rust-based user-space TCP implementation. When a packet hit the local proxy, the Client had to perform a conversion, using smoltcp to convert the L4 stream into L3 packets for the WireGuard tunnel.

While this worked, it wasn't efficient. Smoltcp is optimized for embedded systems, and does not support modern TCP features. In addition, in the Cloudflare edge, we had to convert the L3 packets back into an L4 stream. For users, this manifested as a performance ceiling. On media-heavy sites where a browser might open dozens of concurrent connections for images and video, and the lack of a high performing TCP stack led to high latency and sluggish load times when even on high-speed fiber connections, proxy mode felt significantly slower than all the other device client modes.

Introducing direct L4 proxying with QUIC

To solve this, we’ve re-built the Cloudflare One Client’s proxy mode from the ground up and deprecated the use of WireGuard for proxy mode, so we can capitalize on the capabilities of QUIC. We were already leveraging MASQUE (part of QUIC) for proxying IP packets, and added the usage of QUIC streams for direct L4 proxying.

By leveraging HTTP/3 (RFC 9114) with the CONNECT method, we can now keep traffic at Layer 4, where it belongs. When your browser sends a SOCKS5 or HTTP request to the Client, it is no longer broken down into L3 packets.

Instead, it is encapsulated directly into a QUIC stream.

This architectural shift provides three immediate technical advantages:

Bypassing smoltcp: By removing the L3 translation layer, we eliminate IP packet handling and the limitations of smoltcp’s TCP implementation.
Native QUIC Benefits: We benefit from modern congestion control and flow control, which are handled natively by the transport layer.
Tuneability: The Client and Cloudflare’s edge can tune QUIC’s parameters to optimize performance.

In our internal testing, the results were clear: download and upload speeds doubled, and latency decreased significantly.

Who benefits the most

While faster is always better, this update specifically unblocks three key common use cases.

First, in coexistence with third-party VPNs where a legacy VPN is still required for specific on-prem resources or where having a dual SASE setup is required for redundancy/compliance, the local proxy mode is the go-to solution for adding zero trust security to web traffic. This update ensures that "layering" security doesn't mean sacrificing the user experience.

Second, for high-bandwidth application partitioning, proxy mode is often used to steer specific browser traffic through Cloudflare Gateway while leaving the rest of the OS on the local network. Users can now stream high-definition content or handle large datasets without sacrificing performance.

Finally, developers and power users who rely on the SOCKS5 secondary listener for CLI tools or scripts will see immediate improvements. Remote API calls and data transfers through the proxy now benefit from the same low-latency connection as the rest of the Cloudflare global network.

How to get started

The proxy mode improvements are available with minimum client version 2025.8.779.0 for Windows, macOS, and Linux devices. To take advantage of these performance gains, ensure you are running the latest version of the Cloudflare One Client.

Log in to the Cloudflare One dashboard.
Navigate to Teams & Resources > Devices > Device profiles > General profiles.
Select a profile to edit or create a new one and ensure the Service mode is set to Local proxy mode and the Device tunnel protocol is set to MASQUE.

You can verify your active protocol on a client machine by running the following command in your terminal:

warp-cli settings | grep protocol

Visit our documentation for detailed guidance on enabling proxy mode for your devices.

If you haven't started your SASE journey yet, you can sign up for a free Cloudflare One account for up to 50 users today. Simply create an account, download the Cloudflare One Client, and follow our onboarding guide to experience a faster, more stable connection for your entire team.

Measuring characteristics of TCP connections at Internet scale

Suleman Ahmad — Wed, 29 Oct 2025 13:00:00 GMT

Every interaction on the Internet—including loading a web page, streaming a video, or making an API call—starts with a connection. These fundamental logical connections consist of a stream of packets flowing back and forth between devices.

Various aspects of these network connections have captured the attention of researchers and practitioners for as long as the Internet has existed. The interest in connections even predates the label, as can be seen in the seminal 1991 paper, “Characteristics of wide-area TCP/IP conversations.” By any name, the Internet measurement community has been steeped in characterizations of Internet communication for decades, asking everything from “how long?” and “how big?” to “how often?” – and those are just to start.

Surprisingly, connection characteristics on the wider Internet are largely unavailable. While anyone can use tools (e.g., Wireshark) to capture data locally, it’s virtually impossible to measure connections globally because of access and scale. Moreover, network operators generally do not share the characteristics they observe — assuming that non-trivial time and energy is taken to observe them.

In this blog post, we move in another direction by sharing aggregate insights about connections established through our global CDN. We present characteristics of TCP connections—which account for about 70% of HTTP requests to Cloudflare—providing empirical insights that are difficult to obtain from client-side measurements alone.

Why connection characteristics matter

Characterizing system behavior helps us predict the impact of changes. In the context of networks, consider a new routing algorithm or transport protocol: how can you measure its effects? One option is to deploy the change directly on live networks, but this is risky. Unexpected consequences could disrupt users or other parts of the network, making a “deploy-first” approach potentially unsafe or ethically questionable.

A safer alternative to live deployment as a first step is simulation. Using simulation, a designer can get important insights about their scheme without having to build a full version. But simulating the whole Internet is challenging, as described by another highly seminal work, “Why we don't know how to simulate the Internet”.

To run a useful simulation, we need it to behave like the real system we’re studying. That means generating synthetic data that mimics real-world behavior. Often, we do this by using statistical distributions — mathematical descriptions of how the real data behaves. But before we can create those distributions, we first need to characterize the data — to measure and understand its key properties. Only then can our simulation produce realistic results.

Unpacking the dataset

The value of any data depends on its collection mechanism. Every dataset has blind spots, biases, and limitations, and ignoring these can lead to misleading conclusions. By examining the finer details — how the data was gathered, what it represents, and what it excludes — we can better understand its reliability and make informed decisions about how to use it. Let’s take a closer look at our collected telemetry.

Dataset Overview. The data describes TCP connections, labeled Visitor to Cloudflare in the above diagram, which serve requests via HTTP 1.0, 1.1, and 2.0 that make up about 70% of all 84 million HTTP requests per second, on average, received at our global CDN servers.

Sampling. The passively collected snapshot of data is drawn from a uniformly sampled 1% of all TCP connections to Cloudflare between October 7 and October 15, 2025. Sampling takes place at each individual client-facing server to mitigate biases that may appear by sampling at the datacenter level.

Diversity. Unlike many large operators, whose traffic is primarily their own and dominated by a few services such as search, social media, or streaming video, the vast majority of Cloudflare’s workload comes from our customers, who choose to put Cloudflare in front of their websites to help protect, improve performance, and reduce costs. This diversity of customers brings a wide variety of web applications, services, and users from around the world. As a result, the connections we observe are shaped by a broad range of client devices and application-specific behaviors that are constantly evolving.

What we log. Each entry in the log consists of socket-level metadata captured via the Linux kernel’s TCP_INFO struct, alongside the SNI and the number of requests made during the connection. The logs exclude individual HTTP requests, transactions, and details. We restrict our use of the logs to connection metadata statistics such as duration and number of packets transmitted, as well as the number of HTTP requests processed.

Data capture. We have elected to represent ‘useful’ connections in our dataset that have been fully processed, by characterizing only those connections that close gracefully with a FIN packet. This excludes connections intercepted by attack mitigations, or that timeout, or that abort because of a RST packet.

Since a graceful close does not in itself indicate a ‘useful’ connection, we additionally require at least one successful HTTP request during the connection to filter out idle or non-HTTP connections from this analysis — interestingly, these make up 11% of all TCP connections to Cloudflare that close with a FIN packet.

If you’re curious, we’ve also previously blogged about the details of Cloudflare’s overall logging mechanism and post-processing pipeline.

Visualizing connection characteristics

Although networks are inherently dynamic and trends can change over time, the large-scale patterns we observe across our global infrastructure remain remarkably consistent over time. While our data offers a global view of connection characteristics, distributions can still vary according to regional traffic patterns.

In our visualizations we represent characteristics with cumulative distribution function (CDF) graphs, specifically their empirical equivalents. CDFs are particularly useful for gaining a macroscopic view of the distribution. They give a clear picture of both common and extreme cases in a single view. We use them in the illustrations below to make sense of large-scale patterns. To better interpret the distributions, we also employ log-scaled axes to account for the presence of extreme values common to networking data.

A long-standing question about Internet connections relates to “Elephants and Mice”; practitioners and researchers are entirely aware that most flows are small and some are huge, yet little data exists to inform the lines that divide them. This is where our presentation begins.

Packet Counts

Let’s start by taking a look at the distribution of the number of response packets sent in connections by Cloudflare servers back to the clients.

On the graph, the x-axis represents the number of response packets sent in log-scale, while the y-axis shows the cumulative fraction of connections below each packet count. The average response consists of roughly 240 packets, but the distribution is highly skewed. The median is 12 packets, which indicates that 50% of Internet connections consist of very few packets. Extending further to the 90th percentile, connections carry only 107 packets.

This stark contrast highlights the heavy-tailed nature of Internet traffic: while a few connections transport massive amounts of data—like video streams or large file transfers—most interactions are tiny, delivering small web objects, microservice traffic, or API responses.

The above plot breaks down the packet count distribution by HTTP protocol version. For HTTP/1.X (both HTTP 1.0 and 1.1 combined) connections, the median response consists of just 10 packets, and 90% of connections carry fewer than 63 response packets. In contrast, HTTP/2 connections show larger responses, with a median of 16 packets and a 90th percentile of 170 packets. This difference likely reflects how HTTP/2 multiplexes multiple streams over a single connection, often consolidating more requests and responses into fewer connections, which increases the total number of packets exchanged per connection. HTTP/2 connections also have additional control-plane frames and flow-control messages that increase response packet counts.

Despite these differences, the combined view displays the same heavy-tailed pattern: a small fraction of connections carry enormous volumes of data (elephant flows), extending to millions of packets, while most remain lightweight (mice flows).

So far, we’ve focused on the total number of packets sent from our servers to clients, but another important dimension of connection behavior is the balance between packets sent and received, illustrated below.

The x-axis shows the ratio of packets sent by our servers to packets received from clients, visualized as a CDF. Across all connections, the median ratio is 0.91, meaning that in half of connections, clients send slightly more packets than the server responds with. This excess of client-side packets primarily reflects TLS handshake initiation (ClientHello), HTTP control request headers, and data acknowledgements (ACKs), causing the client to typically transmit more packets than the server returns with the content payload — particularly for low-volume connections that dominate the distribution.

The mean ratio is higher, at 1.28, due to a long tail of client-heavy connections, such as large downloads typical of CDN workloads. Most connections fall within a relatively narrow range: 10% of connections have a ratio below 0.67, and 90% are below 1.85. However, the long-tailed behavior highlights the diversity of Internet traffic: extreme values arise from both upload-heavy and download-heavy connections. The variance of 3.71 reflects these asymmetric flows, while the bulk of connections maintain a roughly balanced upload-to-download exchange.

Bytes sent

Another dimension to look at the data is using bytes sent by our servers to clients, which captures the actual volume of data delivered over each connection. This metric is derived from tcpi_bytes_sent, also covering (re)transmitted segment payloads while excluding the TCP header, as defined in linux/tcp.h and aligned with RFC 4898 (TCP Extended Statistics MIB).

The plots above break down bytes sent by HTTP protocol version. The x-axis represents the total bytes sent by our servers over each connection. The patterns are generally consistent with what we observed in the packet count distributions.

For HTTP/1.X, the median response delivers 4.8 KB, and 90% of connections send fewer than 51 KB. In contrast, HTTP/2 connections show slightly larger responses, with a median of 6 KB and a 90th percentile of 146 KB. The mean is much higher—224 KB for HTTP/1.x and 390 KB for HTTP/2—reflecting a small number of very large transfers. These long-tailed extreme flows can reach tens of gigabytes per connection, while some very lightweight connections carry minimal payloads: the minimum for HTTP/1.X is 115 bytes and for HTTP/2 it is 202 bytes.

By making use of the tcpi_bytes_received metric, we can now look at the ratio of bytes sent to bytes received per connection to better understand the balance of data exchange. This ratio captures how asymmetric each connection is — essentially, how much data our servers send compared to what they receive from clients. Across all connections, the median ratio is 3.78, meaning that in half of all cases, servers send nearly four times more data than they receive. The average is far higher at 81.06, showing a strong long tail driven by download-heavy flows. Again we see the heavy long-tailed distribution, a small fraction of extreme cases push the ratio into the millions, with more extreme values of data transfers towards clients.

Connection duration

While packet and byte counts capture how much data is exchanged, connection duration provides insight into how that exchange unfolds over time.

The CDF above shows the distribution of connection durations (lifetimes) in seconds. A reminder that the x-axis is log-scale. Across all connections, the median duration is just 4.7 seconds, meaning half of connections complete in under five seconds. The mean is much higher at 96 seconds, reflecting a small number of long-lived connections that skew the average. Most connections fall within a window of 0.1 seconds (10th percentile) to 300 seconds (90th percentile). We also observe some extremely long-lived connections lasting multiple days, possibly maintained via keep-alives for connection reuse without hitting our default idle timeout limits. These long-lived connections typically represent persistent sessions or multimedia traffic, while the majority of web traffic remains short, bursty, and transient.

Request counts

A single connection can carry multiple HTTP requests for web traffic. This reveals patterns about connection multiplexing.

The above shows the number of HTTP requests (in log-scale) that we see on a single connection, broken down by HTTP protocol version. Right away, we can see that for both HTTP/1.X (mean 3 requests) and HTTP/2 (mean 8 requests) connections, the median number of requests is just 1, reinforcing the prevalence of limited connection reuse. However, because HTTP/2 supports multiplexing multiple streams over a single connection, the 90th percentile rises to 10 requests, with occasional extreme cases carrying thousands of requests, which can be amplified due to connection coalescing. In contrast, HTTP/1.X connections have much lower request counts. This aligns with protocol design: HTTP/1.0 followed a “one request per connection” philosophy, while HTTP/1.1 introduced persistent connections — even combining both versions, it’s rare to see HTTP/1.X connections carrying more than two requests at the 90th percentile.

The prevalence of short-lived connections can be partly explained by automated clients or scripts that tend to open new connections rather than maintaining long-lived sessions. To explore this intuition, we split the data between traffic originating from data centers (likely automated) and typical user traffic (user-driven), using client ASNs as a proxy.

The plot above shows that non-DC (user-driven) traffic has slightly higher request counts per connection, consistent with browsers or apps fetching multiple resources over a single persistent connection, with a mean of 5 requests and a 90th percentile of 5 requests per connection. In contrast, DC-originated traffic has a mean of roughly 3 requests and a 90th percentile of 2, validating our expectation. Despite these differences, the median number of requests remains 1 for both groups highlighting that, regardless of origin of connections, most are genuinely brief.

Inferring path characteristics from connection-level data

Connection-level measurements can also provide insights into underlying path characteristics. Let’s examine this in more detail.

Path MTU

The maximum transmission unit (MTU) along the network path is often referred to as the Path MTU (PMTU). PMTU determines the largest packet size that can traverse a connection without fragmentation or packet drop, affecting throughput, efficiency, and latency. The Linux TCP stack on our servers tracks the largest segment size that can be sent without fragmentation along the path for a connection, as part of Path MTU discovery.

From that data we saw that the median (and the 90th percentile!) PMTU was 1500 bytes, which aligns with the typical Ethernet MTU and is considered standard for most Internet paths. Interestingly, the 10th percentile sits at 1,420 bytes, reflecting cases where paths include network links with slightly smaller MTUs—common in some VPNs, IPv6tov4 tunnels, or older networking equipment that impose stricter limits to avoid fragmentation. At the extreme, we have seen MTU as small as 552 bytes for IPv4 connections which relates to the minimum allowed PMTU value by the Linux kernel.

Initial congestion window

A key parameter in transport protocols is the congestion window (CWND), which is the number of packets that can be transmitted without waiting for an acknowledgement from the receiver. We call these packets or bytes “in-flight.” During a connection, the congestion window evolves dynamically throughout a connection.

However, the initial congestion window (ICWND) at the start of a data transfer can have an outsized impact, especially for short-lived connections, which dominate Internet traffic as we’ve seen above. If the ICWND is set too low, small and medium transfers take additional round-trip times to reach bottleneck bandwidth, slowing delivery. Conversely, if it’s too high, the sender risks overwhelming the network, causing unnecessary packet loss and retransmissions — potentially for all connections that share the bottleneck link.

A reasonable estimate of the ICWND can be taken as the congestion window size at the instant the TCP sender transitions out of slow start. This transition marks the point at which the sender shifts from exponential growth to congestion-avoidance, having inferred that further growth may risk congestion. The figure below shows the distribution of congestion window sizes at the moment slow start exits — as calculated by BBR. The median is roughly 464 KB, which corresponds to about 310 packets per connection with a typical 1,500-byte MTU, while extreme flows carry tens of megabytes in flight. This variance reflects the diversity of TCP connections and the dynamically evolving nature of the networks carrying traffic.

It’s important to emphasize that these values reflect a mix of network paths, including not only paths between Cloudflare and end users, but also between Cloudflare and neighboring datacenters, which are typically well provisioned and offer higher bandwidth.

Our initial inspection of the above distribution left us doubtful, because the values seem very high. We then realized the numbers are an artifact of behaviour specific to BBR, in which it sets the congestion window higher than its estimate of the path’s available capacity, bandwidth delay product (BDP). The inflated value is by design. To prove the hypothesis, we re-plot the distribution from above in the figure below alongside BBR’s estimate of BDP. The difference is clear between BBR’s congestion window of unacknowledged packets and its BDP estimate.

The above plot adds the computed BDP values in context with connection telemetry. The median BDP comes out to be roughly 77 KB, which is roughly 50 packets. If we compare this to the congestion window distribution taken from above, we see BDP estimations from recently closed connections are much more stable.

We are using these insights to help identify reasonable initial congestion window sizes and the circumstances for them. Our own experiments internally make clear that ICWND sizes can affect performance by as much as 30-40% for smaller connections. Such insights will potentially help to revisit efforts to find better initial congestion window values, which has been a default of 10 packets for more than a decade.

Deeper understanding, better performance

We observed that Internet connections are highly heterogeneous, confirming decades-long observations of strong heavy-tail characteristics consistent with “elephants and mice” phenomenon. Ratios of upload to download bytes are unsurprising for larger flows, but surprisingly small for short flows, highlighting the asymmetric nature of Internet traffic. Understanding these connection characteristics continues to inform ways to improve connection performance, reliability, and user experience.

We will continue to build on this work, and plan to publish connection-level statistics on Cloudflare Radar so that others can similarly benefit.

Our work on improving our network is ongoing, and we welcome researchers, academics, interns, and anyone interested in this space to reach out at ask-research@cloudflare.com. By sharing knowledge and working together, we all can continue to make the Internet faster, safer, and more reliable for everyone.

Reducing double spend latency from 40 ms to < 1 ms on privacy proxy

Ben Yang — Tue, 05 Aug 2025 13:00:00 GMT

One of Cloudflare’s big focus areas is making the Internet faster for end users. Part of the way we do that is by looking at the "big rocks" or bottlenecks that might be slowing things down — particularly processes on the critical path. When we recently turned our attention to our privacy proxy product, we found a big opportunity for improvement.

What is our privacy proxy product? These proxies let users browse the web without exposing their personal information to the websites they’re visiting. Cloudflare runs infrastructure for privacy proxies like Apple’s Private Relay and Microsoft’s Edge Secure Network.

Like any secure infrastructure, we make sure that users authenticate to these privacy proxies before we open up a connection to the website they’re visiting. In order to do this in a privacy-preserving way (so that Cloudflare collects the least possible information about end-users) we use an open Internet standard – Privacy Pass – to issue tokens that authenticate to our proxy service.

Every time a user visits a website via our Privacy Proxy, we check the validity of the Privacy Pass token which is included in the Proxy-Authorization header in their request. Before we cryptographically validate a user's token, we check if this token has already been spent. If the token is unspent, we let the user request through. Otherwise, it’s a "double-spend". From an access control perspective, double-spends are indicative of a problem. From a privacy perspective, double-spends can reduce the anonymity set and privacy characteristics. From a performance perspective, our privacy proxies see millions of requests per second – and any time spent authenticating delays people from accessing sites – so the check needs to be fast. Let’s see how we reduced the latency of these double-spend checks from ~40 ms to <1 ms.

How did we discover the issue?

We use a tracing platform, Jaeger. It lets us see which paths our code took and how long functions took to run. When we looked into these traces, we saw latencies of ~ 40 ms. It was a good lead, but it alone was not enough to conclude it was an issue. The reason was we only sample a small percentage of our traces, so what we saw was not the whole picture. We needed to look at more data. We could’ve increased how many traces we sampled, but traces are large and heavy for our systems to process. Metrics are a lighter weight solution. We added metrics to get data on all double-spend checks.

The lines in this graph are median latencies we saw for the slowest privacy proxies around the world. The metrics data gave us confidence that it was a problem affecting a large portion of requests… assuming that ~ 45 ms was longer than expected. But, was it expected? What numbers did we expect?

The expected latency

To understand what times are reasonable to expect, let’s go into detail on what makes up a “double-spend check”. When we do a double-spend check, we ask a backing data store if a Privacy Pass token exists. The data store we use is memcached. We have many memcached instances running on servers around the world, so which server do we ask? For this, we use mcrouter. Instead of figuring out which memcached server to ask, we give our request to mcrouter, and it will handle choosing a good memcached server to use. We looked at the median time it took for mcrouter to process our request. This graph shows the average latencies per server over time. There are spikes, but most of the time the latency is < 1 ms.

By this point, we were confident that double-spend check latencies were longer than expected everywhere, and we started looking for the root cause.

How did we investigate the issue?

We took inspiration from the scientific method. We analyzed our code, created theories for why sections of code caused latency, and used data to reject those theories. For any remaining theories, we implemented fixes and tested if they worked.

Let’s look at the code. At a high level, the double-spend checking logic is:

Get a connection, which can be broken down into:
1. Send a memcached version command. This serves as a health check for whether the connection is still good to send data on.
2. If the connection is still good, acquire it. Otherwise, establish a new connection.
Send a memcached get command on the connection.

Let’s go through the theories we had for each step listed above.

Theory 1: health check takes long

We measured the health check primarily as a sanity check. The version command is simple and fast to process, so it should not take long. And we remained sane. The median latency was < 1 ms.

Theory 2: waiting to get a connection

To understand why we may need to wait to get a connection, let’s go into more detail on how we get a connection. In our code, we use a connection pool. The pool is a set of ready-to-go connections to mcrouter. The benefit of having a pool is that we do not have to pay the overhead of establishing a connection every time we want to make a request. Pools have a size limit, though. Our limit was 20 per server, and this is where a potential problem lies. Imagine we have a server that processes 5,000 requests every second, and requests stay for 45 ms. We can use something called Little’s Law to estimate the average number of requests in our system: 5000 x 0.045 = 225. Due to our pool size limits, we can only have 20 connections at a time, so we can only process 20 requests at any point in time. That means 205 requests are just waiting! When we do a double-spend check, maybe we’re waiting ~ 40 ms to get a connection?

We looked at the metrics of many different servers. No matter what the requests per second was, the latency was consistently ~ 40 ms, disproving the theory. For example, this graph shows data from a server that saw a maximum of 20 requests per second. It shows a histogram over time, and the large majority of requests fall in the 40 - 50 ms bucket.

Theory 3: delays in Nagle’s algorithm and delayed acks

We decided to chat with Gemini, giving it the observations we had so far. It suggested many things, but the most interesting was to check if TCP_NODELAY was set. If we had set this option in our code, it would’ve disabled something called Nagle’s algorithm. Nagle’s algorithm itself was not a problem, but when enabled alongside another feature, delayed ACKs, latencies could creep in. To explain why, let’s go through an analogy.

Suppose we run a group chat app. Normally, people type a full thought and send it in one message. But, we have a friend who sends one word at a time: "Hi". Send. "how". Send. "are". Send. “you”. Send. That’s a lot of notifications. Nagle’s algorithm aims to prevent this. Nagle says that if the friend wants to send one short message, that’s fine, but it only lets them do it once per turn. When they try to send more single words right after, Nagle will save the words in a draft message. Once the draft message hits a certain length, Nagle sends. But what if the draft message never hits that length? To manage this, delayed ACKs initiates a 40 ms timer whenever the friend sends a message. If the app gets no further input before the timer ends, the message is sent to the group.

I took a closer look at the code, both Cloudflare authored code and code from dependencies we rely on. We depended on the memcache-async crate for implementing the code that lets us send memcache commands. Here is the code for sending a memcached version command:

self.io.write_all(b"version\r\n").await?;
self.io.flush().await?;

Nothing out of the ordinary. Then, we looked inside the get function.

let writer = self.io.get_mut();
writer.write_all(b"get ").await?;
writer.write_all(key.as_ref()).await?;
writer.write_all(b"\r\n").await?;
writer.flush().await?;

In our code, we set io as a TcpStream, meaning that each write_all call resulted in sending a message. With Nagle’s algorithm enabled, the data flow looked like this:

Oof. We tried to send all three small messages, but after we sent the “get “, the kernel put the token and \r\n in a buffer and started waiting. When mcrouter got the “get “, it could not do anything because it did not have the full command. So, it waited 40 ms. Then, it sent an ACK in response. We got the ACK, and sent the rest of the command in the buffer. mcrouter got the rest of the command, processed it, and returned a response telling us if the token exists. What would the data flow look like with Nagle’s algorithm disabled?

We would send all three small messages. mcrouter would have the full command, and return a response immediately. No waiting, whatsoever.

Why 40 ms?

Our Linux servers have minimum bounds for the delay. Here is a snippet of Linux source code that defines those bounds.

#if HZ >= 100
#define TCP_DELACK_MIN	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
#define TCP_ATO_MIN	((unsigned)(HZ/25))
#else
#define TCP_DELACK_MIN	4U
#define TCP_ATO_MIN	4U
#endif

The comment tells us that TCP_DELACK_MIN is the minimum time delayed ACKs will wait before sending an ACK. We spent some time digging through Cloudflare’s custom kernel settings and found this:

CONFIG_HZ=1000

CONFIG_HZ eventually propagates to HZ and results in a 40 ms delay. That's where the number comes from!

The fix

We were sending three separate messages for a single command when we only needed to send one. We captured what a get command looked like in Wireshark to verify we were sending three separate messages. (We captured this locally on MacOS. Interestingly, we got an ACK for every message.)

The fix was to use BufWriter so that write_all would buffer the small messages in a user-space memory buffer, and flush would send the entire memcached command in one message. The Wireshark capture looked much cleaner.

Conclusion

After deploying the fix to production, we saw the median double-spend check latency drop to expected values everywhere.

Our investigation followed a systematic, data-driven approach. We began by using observability tools to confirm the problem's scale. From there, we formed testable hypotheses and used data to systematically disprove them. This process ultimately led us to a subtle interaction between Nagle’s algorithm and delayed ACKs, caused by how we made use of a third-party dependency.

Ultimately, our mission is to help build a better Internet. Every millisecond saved contributes to a faster and more seamless, private browsing experience for end users. We're excited to have this rolled out and excited to continue to chase further performance improvements!

Multi-Path TCP: revolutionizing connectivity, one path at a time

Marek Majkowski — Fri, 03 Jan 2025 14:00:00 GMT

The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in RFCs documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time. Why? It turns out that in multi-path setups, even the smallest differences between paths can harm the connection quality due to packet reordering and other issues. As a result, Internet devices usually use a single path and let the routers handle the path selection.

There is another way. Enter Multi-Path TCP (MPTCP), which exploits the presence of multiple interfaces on a device, such as a mobile phone that has both Wi-Fi and cellular antennas, to achieve multi-path connectivity.

MPTCP has had a long history — see the Wikipedia article and the spec (RFC 8684) for details. It's a major extension to the TCP protocol, and historically most of the TCP changes failed to gain traction. However, MPTCP is supposed to be mostly an operating system feature, making it easy to enable. Applications should only need minor code changes to support it.

There is a caveat, however: MPTCP is still fairly immature, and while it can use multiple paths, giving it superpowers over regular TCP, it's not always strictly better than it. Whether MPTCP should be used over TCP is really a case-by-case basis.

In this blog post we show how to set up MPTCP to find out.

Subflows

Internally, MPTCP extends TCP by introducing "subflows". When everything is working, a single TCP connection can be backed by multiple MPTCP subflows, each using different paths. This is a big deal - a single TCP byte stream is now no longer identified by a single 5-tuple. On Linux you can see the subflows with ss -M, like:

marek$ ss -tMn dport = :443 | cat
tcp   ESTAB 0  	0 192.168.2.143%enx2800af081bee:57756 104.28.152.1:443
tcp   ESTAB 0  	0       192.168.1.149%wlp0s20f3:44719 104.28.152.1:443
mptcp ESTAB 0  	0                 192.168.2.143:57756 104.28.152.1:443

Here you can see a single MPTCP connection, composed of two underlying TCP flows.

MPTCP aspirations

Being able to separate the lifetime of a connection from the lifetime of a flow allows MPTCP to address two problems present in classical TCP: aggregation and mobility.

Aggregation: MPTCP can aggregate the bandwidth of many network interfaces. For example, in a data center scenario, it's common to use interface bonding. A single flow can make use of just one physical interface. MPTCP, by being able to launch many subflows, can expose greater overall bandwidth. I'm personally not convinced if this is a real problem. As we'll learn below, modern Linux has a BLESS-like MPTCP scheduler and macOS stack has the "aggregation" mode, so aggregation should work, but I'm not sure how practical it is. However, there are certainly projects that are trying to do link aggregation using MPTCP.
Mobility: On a customer device, a TCP stream is typically broken if the underlying network interface goes away. This is not an uncommon occurrence — consider a smartphone dropping from Wi-Fi to cellular. MPTCP can fix this — it can create and destroy many subflows over the lifetime of a single connection and survive multiple network changes.

Improving reliability for mobile clients is a big deal. While some software can use QUIC, which also works on Multipath Extensions, a large number of classical services still use TCP. A great example is SSH: it would be very nice if you could walk around with a laptop and keep an SSH session open and switch Wi-Fi networks seamlessly, without breaking the connection.

MPTCP work was initially driven by UCLouvain in Belgium. The first serious adoption was on the iPhone. Apparently, users have a tendency to use Siri while they are walking out of their home. It's very common to lose Wi-Fi connectivity while they are doing this. (source)

Implementations

Currently, there are only two major MPTCP implementations — Linux kernel support from v5.6, but realistically you need at least kernel v6.1 (MPTCP is not supported on Android yet) and iOS from version 7 / Mac OS X from 10.10.

Typically, Linux is used on the server side, and iOS/macOS as the client. It's possible to get Linux to work as a client-side, but it's not straightforward, as we'll learn soon. Beware — there is plenty of outdated Linux MPTCP documentation. The code has had a bumpy history and at least two different APIs were proposed. See the Linux kernel source for the mainline API and the mptcp.dev website.

Linux as a server

Conceptually, the MPTCP design is pretty sensible. After the initial TCP handshake, each peer may announce additional addresses (and ports) on which it can be reached. There are two ways of doing this. First, in the handshake TCP packet each peer specifies the "Do not attempt to establish new subflows to this address and port" bit, also known as bit [C], in the MPTCP TCP extensions header.

^{Wireshark dissecting MPTCP flags from a SYN packet.}^{Tcpdump does not report}^{this flag yet.}

With this bit cleared, the other peer is free to assume the two-tuple is fine to be reconnected to. Typically, the server allows the client to reuse the server IP/port address. Usually, the client is not listening and disallows the server to connect back to it. There are caveats though. For example, in the context of Cloudflare, where our servers are using Anycast addressing, reconnecting to the server IP/port won't work. Going twice to the IP/port pair is unlikely to reach the same server. For us it makes sense to set this flag, disallowing clients from reconnecting to our server addresses. This can be done on Linux with:

# Linux server sysctl - useful for ECMP or Anycast servers
$ sysctl -w net.mptcp.allow_join_initial_addr_port=0

There is also a second way to advertise a listening IP/port. During the lifetime of a connection, a peer can send an ADD-ADDR MPTCP signal which advertises a listening IP/port. This can be managed on Linux by ip mptcp endpoint ... signal, like:

# Linux server - extra listening address
$ ip mptcp endpoint add 192.51.100.1 dev eth0 port 4321 signal

With such a config, a Linux peer (typically server) will report the additional IP/port with ADD-ADDR MPTCP signal in an ACK packet, like this:

host > host: Flags [.], ack 1, win 8, options [mptcp 30 add-addr v1 id 1 192.51.100.1:4321 hmac 0x...,nop,nop], length 0

It's important to realize that either peer can send ADD-ADDR messages. Unusual as it might sound, it's totally fine for the client to advertise extra listening addresses. The most common scenario though, consists of either nobody, or just a server, sending ADD-ADDR.

Technically, to launch an MPTCP socket on Linux, you just need to replace IPPROTO_TCP with IPPROTO_MPTCP in the application code:

IPPROTO_MPTCP = 262
sd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP)

In practice, though, this introduces some changes to the sockets API. Currently not all setsockopt's work yet — like TCP_USER_TIMEOUT. Additionally, at this stage, MPTCP is incompatible with kTLS.

Path manager / scheduler

Once the peers have exchanged the address information, MPTCP is ready to kick in and perform the magic. There are two independent pieces of logic that MPTCP handles. First, given the address information, MPTCP must figure out if it should establish additional subflows. The component that decides on this is called "Path Manager". Then, another component called "scheduler" is responsible for choosing a specific subflow to transmit the data over.

Both peers have a path manager, but typically only the client uses it. A path manager has a hard task to launch enough subflows to get the benefits, but not too many subflows which could waste resources. This is where the MPTCP stacks get complicated.

Linux as client

On Linux, path manager is an operating system feature, not an application feature. The in-kernel path manager requires some configuration — it must know which IP addresses and interfaces are okay to start new subflows. This is configured with ip mptcp endpoint ... subflow, like:

$ ip mptcp endpoint add dev wlp1s0 192.0.2.3 subflow  # Linux client

This informs the path manager that we (typically a client) own a 192.0.2.3 IP address on interface wlp1s0, and that it's fine to use it as source of a new subflow. There are two additional flags that can be passed here: "backup" and "fullmesh". Maintaining these ip mptcp endpoints on a client is annoying. They need to be added and removed every time networks change. Fortunately, NetworkManager from 1.40 supports managing these by default. If you want to customize the "backup" or "fullmesh" flags, you can do this here (see the documentation):

ubuntu$ cat /etc/NetworkManager/conf.d/95-mptcp.conf
# set "subflow" on all managed "ip mptcp endpoints". 0x22 is the default.
[connection]
connection.mptcp-flags=0x22

Path manager also takes a "limit" setting, to set a cap of additional subflows per MPTCP connection, and limit the received ADD-ADDR messages, like:

$ ip mptcp limits set subflow 4 add_addr_accepted 2  # Linux client

I experimented with the "mobility" use case on my Ubuntu 22 Linux laptop. I repeatedly enabled and disabled Wi-Fi and Ethernet. On new kernels (v6.12), it works, and I was able to hold a reliable MPTCP connection over many interface changes. I was less lucky with the Ubuntu v6.8 kernel. Unfortunately, the default path manager on Linux client only works when the flag "Do not attempt to establish new subflows to this address and port" is cleared on the server. Server-announced ADD-ADDR don't result in new subflows created, unless ip mptcp endpoint has a fullmesh flag.

It feels like the underlying MPTCP transport code works, but the path manager requires a bit more intelligence. With a new kernel, it's possible to get the "interactive" case working out of the box, but not for the ADD-ADDR case.

Custom path manager

Linux allows for two implementations of a path manager component. It can either use built-in kernel implementation (default), or userspace netlink daemon.

$ sysctl -w net.mptcp.pm_type=1 # use userspace path manager

However, from what I found there is no serious implementation of configurable userspace path manager. The existing implementations don't do much, and the API seems immature yet.

Scheduler and BPF extensions

Thus far we've covered Path Manager, but what about the scheduler that chooses which link to actually use? It seems that on Linux there is only one built-in "default" scheduler, and it can do basic failover on packet loss. The developers want to write MPTCP schedulers in BPF, and this work is in-progress.

macOS

As opposed to Linux, macOS and iOS expose a raw MPTCP API. On those operating systems, path manager is not handled by the kernel, but instead can be an application responsibility. The exposed low-level API is based on connectx(). For example, here's an example of obscure code that establishes one connection with two subflows:

int sock = socket(AF_MULTIPATH, SOCK_STREAM, 0);
connectx(sock, ..., &cid1);
connectx(sock, ..., &cid2);

This powerful API is hard to use though, as it would require every application to listen for network changes. Fortunately, macOS and iOS also expose higher-level APIs. One example is nw_connection in C, which uses nw_parameters_set_multipath_service.

Another, more common example is using Network.framework, and would look like this:

let parameters = NWParameters.tcp
parameters.multipathServiceType = .interactive
let connection = NWConnection(host: host, port: port, using: parameters)

The API supports three MPTCP service type modes:

Handover Mode: Tries to minimize cellular. Uses only Wi-Fi. Uses cellular only when Wi-Fi Assist is enabled and makes such a decision.
Interactive Mode: Used for Siri. Reduces latency. Only for low-bandwidth flows.
Aggregation Mode: Enables resource pooling but it's only available for developer accounts and not deployable.

The MPTCP API is nicely integrated with the iPhone "Wi-Fi Assist" feature. While the official documentation is lacking, it's possible to find sources explaining how it actually works. I was able to successfully test both the cleared "Do not attempt to establish new subflows" bit and ADD-ADDR scenarios. Hurray!

IPv6 caveat

Sadly, MPTCP IPv6 has a caveat. Since IPv6 addresses are long, and MPTCP uses the space-constrained TCP Extensions field, there is not enough room for ADD-ADDR messages if TCP timestamps are enabled. If you want to use MPTCP and IPv6, it's something to consider.

Summary

I find MPTCP very exciting, being one of a few deployable serious TCP extensions. However, current implementations are limited. My experimentation showed that the only practical scenario where currently MPTCP might be useful is:

Linux as a server
macOS/iOS as a client
"interactive" use case

With a bit of effort, Linux can be made to work as a client.

Don't get me wrong, Linux developers did tremendous work to get where we are, but, in my opinion for any serious out-of-the-box use case, we're not there yet. I'm optimistic that Linux can develop a good MPTCP client story relatively soon, and the possibility of implementing the Path manager and Scheduler in BPF is really enticing.

Time will tell if MPTCP succeeds — it's been 15 years in the making. In the meantime, Multi-Path QUIC is under active development, but it's even further from being usable at this stage.

We're not quite sure if it makes sense for Cloudflare to support MPTCP. Reach out if you have a use case in mind!

Shoutout to Matthieu Baerts for tremendous help with this blog post.

A Socket API that works across JavaScript runtimes — announcing a WinterCG spec and Node.js implementation of connect()

Dominik Picheta — Thu, 28 Sep 2023 13:00:37 GMT

Earlier this year, we announced a new API for creating outbound TCP sockets — connect(). From day one, we’ve been working with the Web-interoperable Runtimes Community Group (WinterCG) community to chart a course toward making this API a standard, available across all runtimes and platforms — including Node.js.

Today, we’re sharing that we’ve reached a new milestone in the path to making this API available across runtimes — engineers from Cloudflare and Vercel have published a draft specification of the connect() sockets API for review by the community, along with a Node.js compatible implementation of the connect() API that developers can start using today.

This implementation helps both application developers and maintainers of libraries and frameworks:

Maintainers of existing libraries that use the node:net and node:tls APIs can use it to more easily add support for runtimes where node:net and node:tls are not available.
JavaScript frameworks can use it to make connect() available in local development, making it easier for application developers to target runtimes that provide connect().

Why create a new standard? Why connect()?

As we described when we first announced connect(), to-date there has not been a standard API across JavaScript runtimes for creating and working with TCP or UDP sockets. This makes it harder for maintainers of open-source libraries to ensure compatibility across runtimes, and ultimately creates friction for application developers who have to navigate which libraries work on which platforms.

While Node.js provides the node:net and node:tls APIs, these APIs were designed over 10 years ago in the very early days of the Node.js project and remain callback-based. As a result, they can be hard to work with, and expose configuration in ways that don’t fit serverless platforms or web browsers.

The connect() API fills this gap by incorporating the best parts of existing socket APIs and prior proposed standards, based on feedback from the JavaScript community — including contributors to Node.js. Libraries like pg (node-postgres on Github) are already using the connect() API.

The connect() specification

At time of writing, the draft specification of the Sockets API defines the following API:

dictionary SocketAddress {
  DOMString hostname;
  unsigned short port;
};

typedef (DOMString or SocketAddress) AnySocketAddress;

enum SecureTransportKind { "off", "on", "starttls" };

[Exposed=*]
dictionary SocketOptions {
  SecureTransportKind secureTransport = "off";
  boolean allowHalfOpen = false;
};

[Exposed=*]
interface Connect {
  Socket connect(AnySocketAddress address, optional SocketOptions opts);
};

interface Socket {
  readonly attribute ReadableStream readable;
  readonly attribute WritableStream writable;

  readonly attribute Promise closed;
  Promise close();

  Socket startTls();
};

The proposed API is Promise-based and reuses existing standards whenever possible. For example, ReadableStream and WritableStream are used for the read and write ends of the socket. This makes it easy to pipe data from a TCP socket to any other library or existing code that accepts a ReadableStream as input, or to write to a TCP socket via a WritableStream.

The entrypoint of the API is the connect() function, which takes a string containing both the hostname and port separated by a colon, or an object with discrete hostname and port fields. It returns a Socket object which represents a socket connection. An instance of this object exposes attributes and methods for working with the connection.

A connection can be established in plain-text or TLS mode, as well as a special “starttls” mode which allows the socket to be easily upgraded to TLS after some period of plain-text data transfer, by calling the startTls() method on the Socket object. No need to create a new socket or switch to using a separate set of APIs once the socket is upgraded to use TLS.

For example, to upgrade a socket using the startTLS pattern, you might do something like this:

import { connect } from "@arrowood.dev/socket"

const options = { secureTransport: "starttls" };
const socket = connect("address:port", options);
const secureSocket = socket.startTls();
// The socket is immediately writable
// Relies on web standard WritableStream
const writer = secureSocket.writable.getWriter();
const encoder = new TextEncoder();
const encoded = encoder.encode("hello");
await writer.write(encoded);

Equivalent code using the node:net and node:tls APIs:

import net from 'node:net'
import tls from 'node:tls'

const socket = new net.Socket(HOST, PORT);
socket.once('connect', () => {
  const options = { socket };
  const secureSocket = tls.connect(options, () => {
    // The socket can only be written to once the
    // connection is established.
    // Polymorphic API, uses Node.js streams
    secureSocket.write('hello');
  }
})

Use the Node.js implementation of connect() in your library

To make it easier for open-source library maintainers to adopt the connect() API, we’ve published an implementation of connect() in Node.js that allows you to publish your library such that it works across JavaScript runtimes, without having to maintain any runtime-specific code.

To get started, install it as a dependency:

npm install --save @arrowood.dev/socket

And import it in your library or application:

import { connect } from "@arrowood.dev/socket"

What’s next for connect()?

The wintercg/proposal-sockets-api is published as a draft, and the next step is to solicit and incorporate feedback. We’d love your feedback, particularly if you maintain an open-source library or make direct use of the node:net or node:tls APIs.

Once feedback has been incorporated, engineers from Cloudflare, Vercel and beyond will be continuing to work towards contributing an implementation of the API directly to Node.js as a built-in API.

Unbounded memory usage by TCP for receive buffers, and how we fixed it

Mike Freemon — Thu, 25 May 2023 15:31:46 GMT

At Cloudflare, we are constantly monitoring and optimizing the performance and resource utilization of our systems. Recently, we noticed that some of our TCP sessions were allocating more memory than expected.

The Linux kernel allows TCP sessions that match certain characteristics to ignore memory allocation limits set by autotuning and allocate excessive amounts of memory, all the way up to net.ipv4.tcp_rmem max (the per-session limit). On Cloudflare’s production network, there are often many such TCP sessions on a server, causing the total amount of allocated TCP memory to reach net.ipv4.tcp_mem thresholds (the server-wide limit). When that happens, the kernel imposes memory use constraints on all TCP sessions, not just the ones causing the problem. Those constraints have a negative impact on throughput and latency for the user. Internally within the kernel, the problematic sessions trigger TCP collapse processing, “OFO” pruning (dropping of packets already received and sitting in the out-of-order queue), and the dropping of newly arriving packets.

This blog post describes in detail the root cause of the problem and shows the test results of a solution.

TCP receive buffers are excessively big for some sessions

Our journey began when we started noticing a lot of TCP sessions on some servers with large amounts of memory allocated for receive buffers. Receive buffers are used by Linux to hold packets that have arrived from the network but have not yet been read by the local process.

Digging into the details, we observed that most of those TCP sessions had a latency (RTT) of roughly 20ms. RTT is the round trip time between the endpoints, measured in milliseconds. At that latency, standard BDP calculations tell us that a window size of 2.5 MB can accommodate up to 1 Gbps of throughput. We then counted the number of TCP sessions with an upper memory limit set by autotuning (skmem_rb) greater than 5 MB, which is double our calculated window size. The relationship between the window size and skmem_rb is described in more detail here. There were 558 such TCP sessions on one of our servers. Most of those sessions looked similar to this:

The key fields to focus on above are:

recvq – the user payload bytes in the receive queue (waiting to be read by the local userspace process)
skmem “r” field – the actual amount of kernel memory allocated for the receive buffer (this is the same as the kernel variable sk_rmem_alloc)
skmem “rb” field – the limit for “r” (this is the same as the kernel variable sk_rcvbuf)
l7read – the user payload bytes read by the local userspace process

Note the value of 256MiB for skmem_r and skmem_rb. That is the red flag that something is very wrong, because those values match the system-wide maximum value set by sysctl net.ipv4.tcp_rmem. Linux autotuning should not permit the buffers to grow that large for these sessions.

Memory limits are not being honored for some TCP sessions

TCP autotuning sets the maximum amount of memory that a session can use. More information about Linux autotuning can be found at Optimizing TCP for high WAN throughput while preserving low latency.

Here is a graph of one of the problematic sessions, showing skmem_r (allocated memory) and skmem_rb (the limit for “r”) over time:

This graph is showing us that the limit being set by autotuning is being ignored, because every time skmem_r exceeds skmem_rb, skmem_rb is simply being raised to match it. So something is wrong with how skmem_rb is being handled. This explains the high memory usage. The question now is why.

The reproducer

At this point, we had only observed this problem in our production environment. Because we couldn’t predict which TCP sessions would fall into this dysfunctional state, and because we wanted to see the session information for these dysfunctional sessions from the beginning of those sessions, we needed to collect a lot of TCP session data for all TCP sessions. This is challenging in a production environment running at the scale of Cloudflare’s network. We needed to be able to reproduce this in a controlled lab environment. To that end, we gathered more details about what distinguishes these problematic TCP sessions from others, and ran a large number of experiments in our lab environment to reproduce the problem.

After a lot of attempts, we finally got it.

We were left with some pretty dirty lab machines by the time we got to this point, meaning that a lot of settings had been changed. We didn’t believe that all of them were related to the problem, but we didn’t know which ones were and which were not. So we went through a further series of tests to get us to a minimal set up to reproduce the problem. It turned out that a number of factors that we originally thought were important (such as latency) were not important.

The minimal set up turned out to be surprisingly simple:

At the sending host, run a TCP program with an infinite loop, sending 1500B packets, with a 1 ms delay between each send.
At the receiving host, run a TCP program with an infinite loop, reading 1B at a time, with a 1 ms delay between each read.

That’s it. Run these programs and watch your receive queue grow unbounded until it hits net.ipv4.tcp_rmem max.

tcp_server_sender.py

import time
import socket
import errno

daemon_port = 2425
payload = b'a' * 1448

listen_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
listen_sock.bind(('0.0.0.0', daemon_port))

# listen backlog
listen_sock.listen(32)
listen_sock.setblocking(True)

while True:
    mysock, _ = listen_sock.accept()
    mysock.setblocking(True)
    
    # do forever (until client disconnects)
    while True:
        try:
            mysock.send(payload)
            time.sleep(0.001)
        except Exception as e:
            print(e)
            mysock.close()
            break

tcp_client_receiver.py

import socket
import time

def do_read(bytes_to_read):
    total_bytes_read = 0
    while True:
        bytes_read = client_sock.recv(bytes_to_read)
        total_bytes_read += len(bytes_read)
        if total_bytes_read >= bytes_to_read:
            break

server_ip = “192.168.2.139”
server_port = 2425

client_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client_sock.connect((server_ip, server_port))
client_sock.setblocking(True)

while True:
    do_read(1)
    time.sleep(0.001)

Reproducing the problem

First, we ran the above programs with these settings:

Kernel 6.1.14 vanilla
net.ipv4.tcp_rmem max = 256 MiB (window scale factor 13, or 8192 bytes)
net.ipv4.tcp_adv_win_scale = -2

Here is what this TCP session is doing:

At second 189 of the run, we see these packets being exchanged:

This is a significant failure because the memory limits are being ignored, and memory usage is unbounded until net.ipv4.tcp_rmem max is reached.

When net.ipv4.tcp_rmem max is reached:

The kernel drops incoming packets.
A ZeroWindow is never sent. A ZeroWindow is a packet sent by the receiver to the sender telling the sender to stop sending packets. This is normal and expected behavior when the receiver buffers are full.
The sender retransmits, with exponential backoff.
Eventually (~15 minutes, depending on system settings) the session times out and the connection is broken (“Errno 110 Connection timed out”).

Note that there is a range of packet sizes that can be sent, and a range of intervals which can be used for the delays, to cause this abnormal condition. This first reproduction is intentionally defined to grow the receive buffer quickly. These rates and delays do not reflect exactly what we see in production.

A closer look at real traffic in production

The prior section describes what is happening in our lab systems. Is that consistent with what we see in our production streams? Let’s take a look, now that we know more about what we are looking for.

We did find similar TCP sessions on our production network, which provided confirmation. But we also found this one, which, although it looks a little different, is actually the same root cause:

During this TCP session, the rate at which the userspace process is reading from the socket (the L7read rate line) after second 411 is zero. That is, L7 stops reading entirely at that point.

Notice that the bottom two graphs have a log scale on their y-axis to show that throughput and window size are never zero, even after L7 stops reading.

Here is the pattern of packet exchange that repeats itself during the erroneous “growth phase” after L7 stopped reading at the 411 second mark:

This variation of the problem is addressed below in the section called “Reader never reads”.

Getting to the root cause

sk_rcvbuf is being increased inappropriately. Somewhere. Let’s review the code to narrow down the possibilities.

sk_rcvbuf only gets updated in three places (that are relevant to this issue):

Actually, we are not calling tcp_set_rcvlowat, which eliminates that one. Next we used bpftrace scripts to figure out if it’s in tcp_clamp_window or tcp_rcv_space_adjust. After bpftracing, the answer is: It’s tcp_clamp_window.

Summarizing what we know so far,part I

tcp_try_rmem_schedule is being called as usual.

Sometimes rmem_alloc > sk_rcvbuf. When that happens, prune is called, which calls tcp_clamp_window. tcp_clamp_window increases sk_rcvbuf to match rmem_alloc. That is unexpected.

The key question is: Why is rmem_alloc > sk_rcvbuf?

Why is rmem_alloc > sk_rcvbuf?

More kernel code review ensued, reviewing all the places where rmem_alloc is increased, and looking to see where rmem_alloc could be exceeding sk_rcvbuf. After more bpftracing, watching netstats, etc., the answer is: TCP coalescing.

TCP coalescing

Coalescing is where the kernel will combine packets as they are being received.

Note that this is not Generic Receive Offload (GRO). This is specific to TCP for packets on the INPUT path. Coalesce is a L4 feature that appends user payload from an incoming packet to an already existing packet, if possible. This saves memory (header space).

tcp_rcv_established calls tcp_queue_rcv, which calls tcp_try_coalesce. If the incoming packet can be coalesced, then it will be, and rmem_alloc is raised to reflect that. Here’s the important part: rmem_alloc can and does go above sk_rcvbuf because of the logic in that routine.

Summarizing what we know so far,part II

Data packets are being received
tcp_rcv_established will coalesce, raising rmem_alloc above sk_rcvbuf
tcp_try_rmem_schedule -> tcp_prune_queue -> tcp_clamp_window will raise sk_rcvbuf to match rmem_alloc
The kernel then increases the window size based upon the new sk_rcvbuf value

In step 2, in order for rmem_alloc to exceed sk_rcvbuf, it has to be near sk_rcvbuf in the first place. We use tcp_adv_win_scale of -2, which means the window size will be 25% of the available buffer size, so we would not expect rmem_alloc to even be close to sk_rcvbuf. In our tests, the truesize ratio is not close to 4, so something unexpected is happening.

Why is rmem_alloc even close to sk_rcvbuf?

Why is rmem_alloc close to sk_rcvbuf?

Sending a ZeroWindow (a packet advertising a window size of zero) is how a TCP receiver tells a TCP sender to stop sending when the receive window is full. This is the mechanism that should keep rmem_alloc well below sk_rcvbuf.

During our tests, we happened to notice that the SNMP metric TCPWantZeroWindowAdv was increasing. The receiver was not sending ZeroWindows when it should have been. So our attention fell on the window calculation logic, and we arrived at the root cause of all of our problems.

The root cause

The problem has to do with how the receive window size is calculated. This is the value in the TCP header that the receiver sends to the sender. Together with the ACK value, it communicates to the sender what the right edge of the window is.

The way TCP’s sliding window works is described in Stevens, “TCP/IP Illustrated, Volume 1”, section 20.3. Visually, the receive window looks like this:

In the early days of the Internet, wide-area communications links offered low bandwidths (relative to today), so the 16 bits in the TCP header was more than enough to express the size of the receive window needed to achieve optimal throughput. Then the future happened, and now those 16-bit window values are scaled based upon a multiplier set during the TCP 3-way handshake.

The window scaling factor allows us to reach high throughputs on modern networks, but it also introduced an issue that we must now discuss.

The granularity of the receive window size that can be set in the TCP header is larger than the granularity of the actual changes we sometimes want to make to the size of the receive window.

When window scaling is in effect, every time the receiver ACKs some data, the receiver has to move the right edge of the window either left or right. The only exception would be if the amount of ACKed data is exactly a multiple of the window scale factor, and the receive window size specified in the ACK packet was reduced by the same multiple. This is rare.

So the right edge has to move. Most of the time, the receive window size does not change and the right edge moves to the right in lockstep with the ACK (the left edge), which always moves to the right.

The receiver can decide to increase the size of the receive window, based on its normal criteria, and that’s fine. It just means the right edge moves farther to the right. No problems.

But what happens when we approach a window full condition? Keeping the right edge unchanged is not an option. We are forced to make a decision. Our choices are:

Move the right edge to the right
Move the right edge to the left

But if we have arrived at the upper limit, then moving the right edge to the right requires us to ignore the upper limit. This is equivalent to not having a limit. This is what Linux does today, and is the source of the problems described in this post.

This occurs for any window scaling factor greater than one. This means everyone.

A sidebar on terminology

The window size specified in the TCP header is the receive window size. It is sent from the receiver to the sender. The ACK number plus the window size defines the range of sequence numbers that the sender may send. It is also called the advertised window, or the offered window.

There are three terms related to TCP window management that are important to understand:

Closing the window. This is when the left edge of the window moves to the right. This occurs every time an ACK of a data packet arrives at the sender.
Opening the window. This is when the right edge of the window moves to the right.
Shrinking the window. This is when the right edge of the window moves to the left.

Opening and shrinking is not the same thing as the receive window size in the TCP header getting larger or smaller. The right edge is defined as the ACK number plus the receive window size. Shrinking only occurs when that right edge moves to the left (i.e. gets reduced).

RFC 7323 describes window retraction. Retracting the window is the same as shrinking the window.

Discussion Regarding Solutions

There are only three options to consider:

Let the window grow
Drop incoming packets
Shrink the window

Let the window grow

Letting the window grow is the same as ignoring the memory limits set by autotuning. It results in allocating excessive amounts of memory for no reason. This is really just kicking the can down the road until allocated memory reaches net.ipv4.tcp_rmem max, when we are forced to choose from among one of the other two options.

Drop incoming packets

Dropping incoming packets will cause the sender to retransmit the dropped packets, with exponential backoff, until an eventual timeout (depending on the client read rate), which breaks the connection. ZeroWindows are never sent. This wastes bandwidth and processing resources by retransmitting packets we know will not be successfully delivered to L7 at the receiver. This is functionally incorrect for a window full situation.

Shrink the window

Shrinking the window involves moving the right edge of the window to the left when approaching a window full condition. A ZeroWindow is sent when the window is full. There is no wasted memory, no wasted bandwidth, and no broken connections.

The current situation is that we are letting the window grow (option #1), and when net.ipv4.tcp_rmem max is reached, we are dropping packets (option #2).

We need to stop doing option #1. We could either drop packets (option #2) when sk_rcvbuf is reached. This avoids excessive memory usage, but is still functionally incorrect for a window full situation. Or we could shrink the window (option #3).

Shrinking the window

It turns out that this issue has already been addressed in the RFC’s.

RFC 7323 says:

There are two elements here that are important.

“there are instances when a retracted window can be offered”
“Implementations MUST ensure that they handle a shrinking window”

Appendix F of that RFC describes our situation, adding:

“This is a general problem and can happen any time the sender does a write, which is smaller than the window scale factor.”

Kernel patch

The Linux kernel patch we wrote to enable TCP window shrinking has been merged upstream and will be in kernel version 6.5 and later. The commit can be found here.

Rerunning the test above with kernel patch

Here is the test we showed above, but this time using the kernel patch:

Here is the pattern of packet exchanges that repeat when using the kernel patch:

We see that the memory limit is being honored, ZeroWindows are being sent, there are no retransmissions, and no disconnects after 15 minutes. This is the desired result.

Test results using a TCP window scaling factor of 8

The window scaling factor of 8 and tcp_adv_win_scale of 1 is commonly seen on the public Internet, so let’s test that.

kernel 6.1.14 vanilla
tcp_rmem max = 8 MiB (window scale factor 8, or 256 bytes)
tcp_adv_win_scale = 1

Without the kernel patch

At the ~2100 second mark, we see the same problems we saw earlier when using wscale 13.

With the kernel patch

The kernel patch is working as expected.

Test results using an oscillating reader

This is a test run where the reader alternates every 240 seconds between reading slow and reading fast. Slow is 1B every 1 ms and fast is 3300B every 1 ms.

kernel 6.1.14 vanilla
net.ipv4.tcp_rmem max = 256 MiB (window scale factor 13, or 8192 bytes)
tcp_adv_win_scale = -2

Without the kernel patch

With the kernel patch

The kernel patch is working as expected.

NB. We do see the increase of skmem_rb at the 720 second mark, but it only goes to ~20MB and does not grow unbounded. Whether or not 20MB is the most ideal value for this TCP session is an interesting question, but that is a topic for a different blog post.

Reader never reads

Here’s a good one. Say a reader never reads from the socket. How much TCP receive buffer memory would we expect that reader to consume? One might assume the answer is that the reader would read a few packets, store the payload in the receive queue, then pause the flow of packets until the userspace program starts reading. The actual answer is that the reader will read packets until the receive queue grows to the size of net.ipv4.tcp_rmem max. This is incorrect behavior, to say the very least.

For this test, the sender sends 4 bytes every 1 ms. The reader, literally, never reads from the socket. Not once.

kernel 6.1.14 vanilla
net.ipv4.tcp_rmem max = 8 MiB (window scale factor 8, or 256 bytes)
net.ipv4.tcp_adv_win_scale = -2

Without the kernel patch:

With the kernel patch:

Using the kernel patch produces the expected behavior.

Results from the Cloudflare production network

We deployed this patch to the Cloudflare production network, and can see the effects in aggregate when running at scale.

Packet Drop Rates

This first graph shows RcvPruned, which shows how many incoming packets per second were dropped due to memory constraints.

The patch was enabled on most servers on 05/01 at 22:00, eliminating those drops.

TCPRcvCollapsed

Recall that TCPRcvCollapsed is the number of packets per second that are merged together in the queue in order to reduce memory usage (by eliminating header metadata). This occurs when memory limits are reached.

The patch was enabled on most servers on 05/01 at 22:00. These graphs show the results from one of our data centers. The upper graph shows that the patch has eliminated all collapse processing. The lower graph shows the amount of time spent in collapse processing (each line in the lower graph is a single server). This is important because it can impact Cloudflare’s responsiveness in processing HTTP requests. The result of the patch is that all latency due to TCP collapse processing has been eliminated.

Memory

Because the memory limits set by autotuning are now being enforced, the total amount of memory allocated is reduced.

In this graph, the green line shows the total amount of memory allocated for TCP buffers in one of our data centers. This is with the patch enabled. The purple line is the same total, but from exactly 7 days prior to the time indicated on the x axis, before the patch was enabled. Using this approach to visualization, it is clear to see the memory saved with the patch enabled.

ZeroWindows

TCPWantZeroWindowAdv is the number of times per second that the window calculation based on available buffer memory produced a result that should have resulted in a ZeroWindow being sent to the sender, but was not. In other words, this is how often the receive buffer was increased beyond the limit set by autotuning.

After a receiver has sent a Zero Window to the sender, the receiver is not expecting to get any additional data from the sender. Should additional data packets arrive at the receiver during the period when the window size is zero, those packets are dropped and the metric TCPZeroWindowDrop is incremented. These dropped packets are usually just due to the timing of these events, i.e. the Zero Window packet in one direction and some data packets flowing in the other direction passed by each other on the network.

The patch was enabled on most servers on 05/01 at 22:00, although it was enabled for a subset of servers on 04/26 and 04/28.

The upper graph tells us that ZeroWindows are indeed being sent when they need to be based on the available memory at the receiver. This is what the lack of “Wants” starting on 05/01 is telling us.

The lower graph reports the packets that are dropped because the session is in a ZeroWindow state. These are ok to drop, because the session is in a ZeroWindow state. These drops do not have a negative impact, for the same reason (it’s in a ZeroWindow state).

All of these results are as expected.

Importantly, we have also not found any peer TCP stacks that are non-RFC compliant (i.e. that are not able to accept a shrinking window).

Summary

In this blog post, we described when and why TCP memory limits are not being honored in the Linux kernel, and introduced a patch that fixes it. All in a day’s work at Cloudflare, where we are helping build a better Internet.

Announcing connect() — a new API for creating TCP sockets from Cloudflare Workers

Brendan Irvine-Broque — Tue, 16 May 2023 13:00:13 GMT

Today, we are excited to announce a new API in Cloudflare Workers for creating outbound TCP sockets, making it possible to connect directly to any TCP-based service from Workers.

Standard protocols including SSH, MQTT, SMTP, FTP, and IRC are all built on top of TCP. Most importantly, nearly all applications need to connect to databases, and most databases speak TCP. And while Cloudflare D1 works seamlessly on Workers, and some hosted database providers allow connections over HTTP or WebSockets, the vast majority of databases, both relational (SQL) and document-oriented (NoSQL), require clients to connect by opening a direct TCP “socket”, an ongoing two-way connection that is used to send queries and receive data. Now, Workers provides an API for this, the first of many steps to come in allowing you to use any database or infrastructure you choose when building full-stack applications on Workers.

Database drivers, the client code used to connect to databases and execute queries, are already using this new API. pg, the most widely used JavaScript database driver for PostgreSQL, works on Cloudflare Workers today, with more database drivers to come.

The TCP Socket API is available today to everyone. Get started by reading the TCP Socket API docs, or connect directly to any PostgreSQL database from your Worker by following this guide.

First — what is a TCP Socket?

TCP (Transmission Control Protocol) is a foundational networking protocol of the Internet. It is the underlying protocol that is used to make HTTP requests (prior to HTTP/3, which uses QUIC), to send email over SMTP, to query databases using database–specific protocols like MySQL, and many other application-layer protocols.

A TCP socket is a programming interface that represents a two-way communication connection between two applications that have both agreed to “speak” over TCP. One application (ex: a Cloudflare Worker) initiates an outbound TCP connection to another (ex: a database server) that is listening for inbound TCP connections. Connections are established by negotiating a three-way handshake, and after the handshake is complete, data can be sent bi-directionally.

A socket is the programming interface for a single TCP connection — it has both a readable and writable “stream” of data, allowing applications to read and write data on an ongoing basis, as long as the connection remains open.

connect() — A simpler socket API

With Workers, we aim to support standard APIs that are supported across browsers and non-browser environments wherever possible, so that as many NPM packages as possible work on Workers without changes, and package authors don’t have to write runtime-specific code. But for TCP sockets, we faced a challenge — there was no clear shared standard across runtimes. Node.js provides the net and tls APIs, but Deno implements a different API — Deno.connect. And web browsers do not provide a raw TCP socket API, though a WICG proposal does exist, and it is different from both Node.js and Deno.

We also considered how a TCP socket API could be designed to maximize performance and ergonomics in a serverless environment. Most networking APIs were designed well before serverless emerged, with the assumption that the developer’s application is also the server, responsible for directly handling configuring TLS options and credentials.

With this backdrop, we reached out to the community, with a focus on maintainers of database drivers, ORMs and other libraries that create outbound TCP connections. Using this feedback, we’ve tried to incorporate the best elements of existing APIs and proposals, and intend to contribute back to future standards, as part of the Web-interoperable Runtimes Community Group (WinterCG).

The API we landed on is a simple function, connect(), imported from the new cloudflare:sockets module, that returns an instance of a Socket. Here’s a simple example showing it used to connect to a Gopher server. Gopher was one of the Internet’s early protocols that relied on TCP/IP, and still works today:

import { connect } from 'cloudflare:sockets';

export default {
  async fetch(req: Request) {
    const gopherAddr = "gopher.floodgap.com:70";
    const url = new URL(req.url);

    try {
      const socket = connect(gopherAddr);

      const writer = socket.writable.getWriter()
      const encoder = new TextEncoder();
      const encoded = encoder.encode(url.pathname + "\r\n");
      await writer.write(encoded);

      return new Response(socket.readable, { headers: { "Content-Type": "text/plain" } });
    } catch (error) {
      return new Response("Socket connection failed: " + error, { status: 500 });
    }
  }
};

We think this API design has many benefits that can be realized not just on Cloudflare, but in any serverless environment that adopts this design:

connect(address: SocketAddress | string, options?: SocketOptions): Socket

declare interface Socket {
  get readable(): ReadableStream;
  get writable(): WritableStream;
  get closed(): Promise;
  close(): Promise;
  startTls(): Socket;
}

declare interface SocketOptions {
  secureTransport?: string;
  allowHalfOpen: boolean;
}

declare interface SocketAddress {
  hostname: string;
  port: number;
}

Opportunistic TLS (StartTLS), without separate APIs

Opportunistic TLS, a pattern of creating an initial insecure connection, and then upgrading it to a secure one that uses TLS, remains common, particularly with database drivers. In Node.js, you must use the net API to create the initial connection, and then use the tls API to create a new, upgraded connection. In Deno, you pass the original socket to Deno.startTls(), which creates a new, upgraded connection.

Drawing on a previous W3C proposal for a TCP Socket API, we’ve simplified this by providing one API, that allows TLS to be enabled, allowed, or used when creating a socket, and exposes a simple method, startTls(), for upgrading a socket to use TLS.

// Create a new socket without TLS. secureTransport defaults to "off" if not specified.
const socket = connect("address:port", { secureTransport: "off" })

// Create a new socket, then upgrade it to use TLS.
// Once startTls() is called, only the newly created socket can be used.
const socket = connect("address:port", { secureTransport: "starttls" })
const secureSocket = socket.startTls();

// Create a new socket with TLS
const socket = connect("address:port", { secureTransport: "use" })

TLS configuration — a concern of host infrastructure, not application code

Existing APIs for creating TCP sockets treat TLS as a library that you interact with in your application code. The tls.createSecureContext() API from Node.js has a plethora of advanced configuration options that are mostly environment specific. If you use custom certificates when connecting to a particular service, you likely use a different set of credentials and options in production, staging and development. Managing direct file paths to credentials across environments and swapping out .env files in production build steps are common pain points.

Host infrastructure is best positioned to manage this on your behalf, and similar to Workers support for making subrequests using mTLS, TLS configuration and credentials for the socket API will be managed via Wrangler, and a connect() function provided via a capability binding. Currently, custom TLS credentials and configuration are not supported, but are coming soon.

Start writing data immediately, before the TLS handshake finishes

Because the connect() API synchronously returns a new socket, one can start writing to the socket immediately, without waiting for the TCP handshake to first complete. This means that once the handshake completes, data is already available to send immediately, and host platforms can make use of pipelining to optimize performance.

connect() API + DB drivers = Connect directly to databases

Many serverless databases already work on Workers, allowing clients to connect over HTTP or over WebSockets. But most databases don’t “speak” HTTP, including databases hosted on most cloud providers.

Databases each have their own “wire protocol”, and open-source database “drivers” that speak this protocol, sending and receiving data over a TCP socket. Developers rely on these drivers in their own code, as do database ORMs. Our goal is to make sure that you can use the same drivers and ORMs you might use in other runtimes and on other platforms on Workers.

Try it now — connect to PostgreSQL from Workers

We’ve worked with the maintainers of pg, one of the most popular database drivers in the JavaScript ecosystem, used by ORMs including Sequelize and knex.js, to add support for connect().

You can try this right now. First, create a new Worker and install pg:

wrangler init
npm install --save pg

As of this writing, you’ll need to enable the node_compat option in wrangler.toml:

wrangler.toml

name = "my-worker"
main = "src/index.ts"
compatibility_date = "2023-05-15"
node_compat = true

In just 20 lines of TypeScript, you can create a connection to a Postgres database, execute a query, return results in the response, and close the connection:

index.ts

import { Client } from "pg";

export interface Env {
  DB: string;
}

export default {
  async fetch(
    request: Request,
    env: Env,
    ctx: ExecutionContext
  ): Promise {
    const client = new Client(env.DB);
    await client.connect();
    const result = await client.query({
      text: "SELECT * from customers",
    });
    console.log(JSON.stringify(result.rows));
    const resp = Response.json(result.rows);
    // Close the database connection, but don't block returning the response
    ctx.waitUntil(client.end());
    return resp;
  },
};

To test this in local development, use the --experimental-local flag (instead of –local), which uses the open-source Workers runtime, ensuring that what you see locally mirrors behavior in production:

wrangler dev --experimental-local

What’s next for connecting to databases from Workers?

This is only the beginning. We’re aiming for the two popular MySQL drivers, mysql and mysql2, to work on Workers soon, with more to follow. If you work on a database driver or ORM, we’d love to help make your library work on Workers.

If you’ve worked more closely with database scaling and performance, you might have noticed that in the example above, a new connection is created for every request. This is one of the biggest current challenges of connecting to databases from serverless functions, across all platforms. With typical client connection pooling, you maintain a local pool of database connections that remain open. This approach of storing a reference to a connection or connection pool in global scope will not work, and is a poor fit for serverless. Managing individual pools of client connections on a per-isolate basis creates other headaches — when and how should connections be terminated? How can you limit the total number of concurrent connections across many isolates and locations?

Instead, we’re already working on simpler approaches to connection pooling for the most popular databases. We see a path to a future where you don’t have to think about or manage client connection pooling on your own. We’re also working on a brand new approach to making your database reads lightning fast.

What’s next for sockets on Workers?

Supporting outbound TCP connections is only one half of the story — we plan to support inbound TCP and UDP connections, as well as new emerging application protocols based on QUIC, so that you can build applications beyond HTTP with Socket Workers.

Earlier today we also announced Smart Placement, which improves performance by placing any Worker that makes multiple HTTP requests to an origin run as close as possible to reduce round-trip time. We’re working on making this work with Workers that open TCP connections, so that if your Worker connects to a database in Virginia and makes many queries over a TCP connection, each query is lightning fast and comes from the nearest location on Cloudflare’s global network.

We also plan to support custom certificates and other TLS configuration options in the coming months — tell us what is a must-have in order to connect to the services you need to connect to from Workers.

Get started, and share your feedback

The TCP Socket API is available today to everyone. Get started by reading the TCP Socket API docs, or connect directly to any PostgreSQL database from your Worker by following this guide.

We want to hear your feedback, what you’d like to see next, and more about what you’re building. Join the Cloudflare Developers Discord.

Watch on Cloudflare TV

When the window is not fully open, your TCP stack is doing more than you think

Marek Majkowski — Tue, 26 Jul 2022 13:00:00 GMT

Over the years I've been lurking around the Linux kernel and have investigated the TCP code many times. But when recently we were working on Optimizing TCP for high WAN throughput while preserving low latency, I realized I have gaps in my knowledge about how Linux manages TCP receive buffers and windows. As I dug deeper I found the subject complex and certainly non-obvious.

In this blog post I'll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection. Specifically, looking for answers to seemingly trivial questions:

How much data can be stored in the TCP receive buffer? (it's not what you think)
How fast can it be filled? (it's not what you think either!)

Our exploration focuses on the receiving side of the TCP connection. We'll try to understand how to tune it for the best speed, without wasting precious memory.

A case of a rapid upload

To best illustrate the receive side buffer management we need pretty charts! But to grasp all the numbers, we need a bit of theory.

We'll draw charts from a receive side of a TCP flow, running a pretty straightforward scenario:

The client opens a TCP connection.
The client does send(), and pushes as much data as possible.
The server doesn't recv() any data. We expect all the data to stay and wait in the receive queue.
We fix the SO_RCVBUF for better illustration.

Simplified pseudocode might look like (full code if you dare):

sd = socket.socket(AF_INET, SOCK_STREAM, 0)
sd.bind(('127.0.0.3', 1234))
sd.listen(32)

cd = socket.socket(AF_INET, SOCK_STREAM, 0)
cd.setsockopt(SOL_SOCKET, SO_RCVBUF, 32*1024)
cd.connect(('127.0.0.3', 1234))

ssd, _ = sd.accept()

while true:
    cd.send(b'a'*128*1024)

We're interested in basic questions:

How much data can fit in the server’s receive buffer? It turns out it's not exactly the same as the default read buffer size on Linux; we'll get there.
Assuming infinite bandwidth, what is the minimal time - measured in RTT - for the client to fill the receive buffer?

A bit of theory

Let's start by establishing some common nomenclature. I'll follow the wording used by the ss Linux tool from the iproute2 package.

First, there is the buffer budget limit. ss manpage calls it skmem_rb, in the kernel it's named sk_rcvbuf. This value is most often controlled by the Linux autotune mechanism using the net.ipv4.tcp_rmem setting:

$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 131072 6291456

Alternatively it can be manually set with setsockopt(SO_RCVBUF) on a socket. Note that the kernel doubles the value given to this setsockopt. For example SO_RCVBUF=16384 will result in skmem_rb=32768. The max value allowed to this setsockopt is limited to meager 208KiB by default:

$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992

The aforementioned blog post discusses why manual buffer size management is problematic - relying on autotuning is generally preferable.

Here’s a diagram showing how skmem_rb budget is being divided:

In any given moment, we can think of the budget as being divided into four parts:

Recv-q: part of the buffer budget occupied by actual application bytes awaiting read().
Another part of is consumed by metadata handling - the cost of struct sk_buff and such.
Those two parts together are reported by ss as skmem_r - kernel name is sk_rmem_alloc.
What remains is "free", that is: it's not actively used yet.
However, a portion of this "free" region is an advertised window - it may become occupied with application data soon.
The remainder will be used for future metadata handling, or might be divided into the advertised window further in the future.

The upper limit for the window is configured by tcp_adv_win_scale setting. By default, the window is set to at most 50% of the "free" space. The value can be clamped further by the TCP_WINDOW_CLAMP option or an internal rcv_ssthresh variable.

How much data can a server receive?

Our first question was "How much data can a server receive?". A naive reader might think it's simple: if the server has a receive buffer set to say 64KiB, then the client will surely be able to deliver 64KiB of data!

But this is totally not how it works. To illustrate this, allow me to temporarily set sysctl tcp_adv_win_scale=0. This is not a default and, as we'll learn, it's the wrong thing to do. With this setting the server will indeed set 100% of the receive buffer as an advertised window.

Here's our setup:

The client tries to send as fast as possible.
Since we are interested in the receiving side, we can cheat a bit and speed up the sender arbitrarily. The client has transmission congestion control disabled: we set initcwnd=10000 as the route option.
The server has a fixed skmem_rb set at 64KiB.
The server has tcp_adv_win_scale=0.

There are so many things here! Let's try to digest it. First, the X axis is an ingress packet number (we saw about 65). The Y axis shows the buffer sizes as seen on the receive path for every packet.

First, the purple line is a buffer size limit in bytes - skmem_rb. In our experiment we called setsockopt(SO_RCVBUF)=32K and skmem_rb is double that value. Notice, by calling SO_RCVBUF we disabled the Linux autotune mechanism.
Green recv-q line is how many application bytes are available in the receive socket. This grows linearly with each received packet.
Then there is the blue skmem_r, the used data + metadata cost in the receive socket. It grows just like recv-q but a bit faster, since it accounts for the cost of the metadata kernel needs to deal with.
The orange rcv_win is an advertised window. We start with 64KiB (100% of skmem_rb) and go down as the data arrives.
Finally, the dotted line shows rcv_ssthresh, which is not important yet, we'll get there.

Running over the budget is bad

It's super important to notice that we finished with skmem_r higher than skmem_rb! This is rather unexpected, and undesired. The whole point of the skmem_rb memory budget is, well, not to exceed it. Here's how ss shows it:

$ ss -m
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port   
tcp    ESTAB  62464   0       127.0.0.3:1234      127.0.0.2:1235
     skmem:(r73984,rb65536,...)

As you can see, skmem_rb is 65536 and skmem_r is 73984, which is 8448 bytes over! When this happens we have an even bigger issue on our hands. At around the 62nd packet we have an advertised window of 3072 bytes, but while packets are being sent, the receiver is unable to process them! This is easily verifiable by inspecting an nstat TcpExtTCPRcvQDrop counter:

$ nstat -az TcpExtTCPRcvQDrop
TcpExtTCPRcvQDrop    13    0.0

In our run 13 packets were dropped. This variable counts a number of packets dropped due to either system-wide or per-socket memory pressure - we know we hit the latter. In our case, soon after the socket memory limit was crossed, new packets were prevented from being enqueued to the socket. This happened even though the TCP advertised window was still open.

This results in an interesting situation. The receiver's window is open which might indicate it has resources to handle the data. But that's not always the case, like in our example when it runs out of the memory budget.

The sender will think it hit a network congestion packet loss and will run the usual retry mechanisms including exponential backoff. This behavior can be looked at as desired or undesired, depending on how you look at it. On one hand no data will be lost, the sender can eventually deliver all the bytes reliably. On the other hand the exponential backoff logic might stall the sender for a long time, causing a noticeable delay.

The root of the problem is straightforward - Linux kernel skmem_rb sets a memory budget for both the data and metadata which reside on the socket. In a pessimistic case each packet might incur a cost of a struct sk_buff + struct skb_shared_info, which on my system is 576 bytes, above the actual payload size, plus memory waste due to network card buffer alignment:

We now understand that Linux can't just advertise 100% of the memory budget as an advertised window. Some budget must be reserved for metadata and such. The upper limit of window size is expressed as a fraction of the "free" socket budget. It is controlled by tcp_adv_win_scale, with the following values:

By default, Linux sets the advertised window at most at 50% of the remaining buffer space.

Even with 50% of space "reserved" for metadata, the kernel is very smart and tries hard to reduce the metadata memory footprint. It has two mechanisms for this:

TCP Coalesce - on the happy path, Linux is able to throw away struct sk_buff. It can do so, by just linking the data to the previously enqueued packet. You can think about it as if it was extending the last packet on the socket.
TCP Collapse - when the memory budget is hit, Linux runs "collapse" code. Collapse rewrites and defragments the receive buffer from many small skb's into a few very long segments - therefore reducing the metadata cost.

Here's an extension to our previous chart showing these mechanisms in action:

TCP Coalesce is a very effective measure and works behind the scenes at all times. In the bottom chart, the packets where the coalesce was engaged are shown with a pink line. You can see - the skmem_r bumps (blue line) are clearly correlated with a lack of coalesce (pink line)! The nstat TcpExtTCPRcvCoalesce counter might be helpful in debugging coalesce issues.

The TCP Collapse is a bigger gun. Mike wrote about it extensively, and I wrote a blog post years ago, when the latency of TCP collapse hit us hard. In the chart above, the collapse is shown as a red circle. We clearly see it being engaged after the socket memory budget is reached - from packet number 63. The nstat TcpExtTCPRcvCollapsed counter is relevant here. This value growing is a bad sign and might indicate bad latency spikes - especially when dealing with larger buffers. Normally collapse is supposed to be run very sporadically. A prominent kernel developer describes this pessimistic situation:

This also means tcp advertises a too optimistic window for a given allocated rcvspace: When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_collapse() too often, especially when application is slow to drain its receive queue [...] This is a major latency source.

If the memory budget remains exhausted after the collapse, Linux will drop ingress packets. In our chart it's marked as a red "X". The nstat TcpExtTCPRcvQDrop counter shows the count of dropped packets.

rcv_ssthresh predicts the metadata cost

Perhaps counter-intuitively, the memory cost of a packet can be much larger than the amount of actual application data contained in it. It depends on number of things:

Network card: some network cards always allocate a full page (4096, or even 16KiB) per packet, no matter how small or large the payload.
Payload size: shorter packets, will have worse metadata to content ratio since struct skb will be comparably larger.
Whether XDP is being used.
L2 header size: things like ethernet, vlan tags, and tunneling can add up.
Cache line size: many kernel structs are cache line aligned. On systems with larger cache lines, they will use more memory (see P4 or S390X architectures).

The first two factors are the most important. Here's a run when the sender was specially configured to make the metadata cost bad and the coalesce ineffective (the details of the setup are messy):

You can see the kernel hitting TCP collapse multiple times, which is totally undesired. Each time a collapse kernel is likely to rewrite the full receive buffer. This whole kernel machinery, from reserving some space for metadata with tcp_adv_win_scale, via using coalesce to reduce the memory cost of each packet, up to the rcv_ssthresh limit, exists to avoid this very case of hitting collapse too often.

The kernel machinery most often works fine, and TCP collapse is rare in practice. However, we noticed that's not the case for certain types of traffic. One example is websocket traffic with loads of tiny packets and a slow reader. One kernel comment talks about such a case:

* The scheme does not work when sender sends good segments opening
* window and then starts to feed us spaghetti. But it should work
* in common situations. Otherwise, we have to rely on queue collapsing.

Notice that the rcv_ssthresh line dropped down on the TCP collapse. This variable is an internal limit to the advertised window. By dropping it the kernel effectively says: hold on, I mispredicted the packet cost, next time I'm given an opportunity I'm going to open a smaller window. Kernel will advertise a smaller window and be more careful - all of this dance is done to avoid the collapse.

Normal run - continuously updated window

Finally, here's a chart from a normal run of a connection. Here, we use the default tcp_adv_win_wcale=1 (50%):

Early in the connection you can see rcv_win being continuously updated with each received packet. This makes sense: while the rcv_ssthresh and tcp_adv_win_scale restrict the advertised window to never exceed 32KiB, the window is sliding nicely as long as there is enough space. At packet 18 the receiver stops updating the window and waits a bit. At packet 32 the receiver decides there still is some space and updates the window again, and so on. At the end of the flow the socket has 56KiB of data. This 56KiB of data was received over a sliding window reaching at most 32KiB .

The saw blade pattern of rcv_win is enabled by delayed ACK (aka QUICKACK). You can see the "acked" bytes in red dashed line. Since the ACK's might be delayed, the receiver waits a bit before updating the window. If you want a smooth line, you can use quickack 1 per-route parameter, but this is not recommended since it will result in many small ACK packets flying over the wire.

In normal connection we expect the majority of packets to be coalesced and the collapse/drop code paths never to be hit.

Large receive windows - rcv_ssthresh

For large bandwidth transfers over big latency links - big BDP case - it's beneficial to have a very wide advertised window. However, Linux takes a while to fully open large receive windows:

In this run, the skmem_rb is set to 2MiB. As opposed to previous runs, the buffer budget is large and the receive window doesn't start with 50% of the skmem_rb! Instead it starts from 64KiB and grows linearly. It takes a while for Linux to ramp up the receive window to full size - ~800KiB in this case. The window is clamped by rcv_ssthresh. This variable starts at 64KiB and then grows at a rate of two full-MSS packets per each packet which has a "good" ratio of total size (truesize) to payload size.

Eric Dumazet writes about this behavior:

Stack is conservative about RWIN increase, it wants to receive packets to have an idea of the skb->len/skb->truesize ratio to convert a memory budget to RWIN.Some drivers have to allocate 16K buffers (or even 32K buffers) just to hold one segment (of less than 1500 bytes of payload), while others are able to pack memory more efficiently.

This behavior of slow window opening is fixed, and not configurable in vanilla kernel. We prepared a kernel patch that allows to start up with higher rcv_ssthresh based on per-route option initrwnd:

$ ip route change local 127.0.0.0/8 dev lo initrwnd 1000

With the patch and the route change deployed, this is how the buffers look:

The advertised window is limited to 64KiB during the TCP handshake, but with our kernel patch enabled it's quickly bumped up to 1MiB in the first ACK packet afterwards. In both runs it took ~1800 packets to fill the receive buffer, however it took different time. In the first run the sender could push only 64KiB onto the wire in the second RTT. In the second run it could immediately push full 1MiB of data.

This trick of aggressive window opening is not really necessary for most users. It's only helpful when:

You have high-bandwidth TCP transfers over big-latency links.
The metadata + buffer alignment cost of your NIC is sensible and predictable.
Immediately after the flow starts your application is ready to send a lot of data.
The sender has configured large initcwnd.

You care about shaving off every possible RTT.

On our systems we do have such flows, but arguably it might not be a common scenario. In the real world most of your TCP connections go to the nearest CDN point of presence, which is very close.

Getting it all together

In this blog post, we discussed a seemingly simple case of a TCP sender filling up the receive socket. We tried to address two questions: with our isolated setup, how much data can be sent, and how quickly?

With the default settings of net.ipv4.tcp_rmem, Linux initially sets a memory budget of 128KiB for the receive data and metadata. On my system, given full-sized packets, it's able to eventually accept around 113KiB of application data.

Then, we showed that the receive window is not fully opened immediately. Linux keeps the receive window small, as it tries to predict the metadata cost and avoid overshooting the memory budget, therefore hitting TCP collapse. By default, with the net.ipv4.tcp_adv_win_scale=1, the upper limit for the advertised window is 50% of "free" memory. rcv_ssthresh starts up with 64KiB and grows linearly up to that limit.

On my system it took five window updates - six RTTs in total - to fill the 128KiB receive buffer. In the first batch the sender sent ~64KiB of data (remember we hacked the initcwnd limit), and then the sender topped it up with smaller and smaller batches until the receive window fully closed.

I hope this blog post is helpful and explains well the relationship between the buffer size and advertised window on Linux. Also, it describes the often misunderstood rcv_ssthresh which limits the advertised window in order to manage the memory budget and predict the unpredictable cost of metadata.

In case you wonder, similar mechanisms are in play in QUIC. The QUIC/H3 libraries though are still pretty young and don't have so many complex and mysterious toggles.... yet.

As always, the code and instructions on how to reproduce the charts are available at our GitHub.

A July 4 technical reading list

John Graham-Cumming — Mon, 04 Jul 2022 12:55:08 GMT

Here’s a short list of recent technical blog posts to give you something to read today.

Internet Explorer, we hardly knew ye

Microsoft has announced the end-of-life for the venerable Internet Explorer browser. Here we take a look at the demise of IE and the rise of the Edge browser. And we investigate how many bots on the Internet continue to impersonate Internet Explorer versions that have long since been replaced.

Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module

Looking for something with a lot of technical detail? Look no further than this blog about live-patching the Linux kernel using eBPF. Code, Makefiles and more within!

Hertzbleed explained

Feeling mathematical? Or just need a dose of CPU-level antics? Look no further than this deep explainer about how CPU frequency scaling leads to a nasty side channel affecting cryptographic algorithms.

Early Hints update: How Cloudflare, Google, and Shopify are working together to build a faster Internet for everyone

The HTTP standard for Early Hints shows a lot of promise. How much? In this blog post, we dig into data about Early Hints in the real world and show how much faster the web is with it.

Private Access Tokens: eliminating CAPTCHAs on iPhones and Macs with open standards

Dislike CAPTCHAs? Yes, us too. As part of our program to eliminate captures there’s a new standard: Private Access Tokens. This blog shows how they work and how they can be used to prove you’re human without saying who you are.

Optimizing TCP for high WAN throughput while preserving low latency

Network nerd? Yeah, me too. Here’s a very in depth look at how we tune TCP parameters for low latency and high throughput.

...We protect entire corporate networks, help customers build Internet-scale applications efficiently, accelerate any website or Internet application, ward off DDoS attacks, keep hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.

Optimizing TCP for high WAN throughput while preserving low latency

Mike Freemon — Fri, 01 Jul 2022 13:00:01 GMT

Here at Cloudflare we're constantly working on improving our service. Our engineers are looking at hundreds of parameters of our traffic, making sure that we get better all the time.

One of the core numbers we keep a close eye on is HTTP request latency, which is important for many of our products. We regard latency spikes as bugs to be fixed. One example is the 2017 story of "Why does one NGINX worker take all the load?", where we optimized our TCP Accept queues to improve overall latency of TCP sockets waiting for accept().

Performance tuning is a holistic endeavor, and we monitor and continuously improve a range of other performance metrics as well, including throughput. Sometimes, tradeoffs have to be made. Such a case occurred in 2015, when a latency spike was discovered in our processing of HTTP requests. The solution at the time was to set tcp_rmem to 4 MiB, which minimizes the amount of time the kernel spends on TCP collapse processing. It was this collapse processing that was causing the latency spikes. Later in this post we discuss TCP collapse processing in more detail.

The tradeoff is that using a low value for tcp_rmem limits TCP throughput over high latency links. The following graph shows the maximum throughput as a function of network latency for a window size of 2 MiB. Note that the 2 MiB corresponds to a tcp_rmem value of 4 MiB due to the tcp_adv_win_scale setting in effect at the time.

For the Cloudflare products then in existence, this was not a major problem, as connections terminate and content is served from nearby servers due to our BGP anycast routing.

Since then, we have added new products, such as Magic WAN, WARP, Spectrum, Gateway, and others. These represent new types of use cases and traffic flows.

For example, imagine you're a typical Magic WAN customer. You have connected all of your worldwide offices together using the Cloudflare global network. While Time to First Byte still matters, Magic WAN office-to-office traffic also needs good throughput. For example, a lot of traffic over these corporate connections will be file sharing using protocols such as SMB. These are elephant flows over long fat networks. Throughput is the metric every eyeball watches as they are downloading files.

We need to continue to provide world-class low latency while simultaneously providing high throughput over high-latency connections.

Before we begin, let’s introduce the players in our game.

TCP receive window is the maximum number of unacknowledged user payload bytes the sender should transmit (bytes-in-flight) at any point in time. The size of the receive window can and does go up and down during the course of a TCP session. It is a mechanism whereby the receiver can tell the sender to stop sending if the sent packets cannot be successfully received because the receive buffers are full. It is this receive window that often limits throughput over high-latency networks.

net.ipv4.tcp_adv_win_scale is a (non-intuitive) number used to account for the overhead needed by Linux to process packets. The receive window is specified in terms of user payload bytes. Linux needs additional memory beyond that to track other data associated with packets it is processing.

The value of the receive window changes during the lifetime of a TCP session, depending on a number of factors. The maximum value that the receive window can be is limited by the amount of free memory available in the receive buffer, according to this table:

tcp_adv_win_scale	TCP window size
4	15/16 * available memory in receive buffer
3	⅞ * available memory in receive buffer
2	¾ * available memory in receive buffer
1	½ * available memory in receive buffer
0	available memory in receive buffer
-1	½ * available memory in receive buffer
-2	¼ * available memory in receive buffer
-3	⅛ * available memory in receive buffer

We can intuitively (and correctly) understand that the amount of available memory in the receive buffer is the difference between the used memory and the maximum limit. But what is the maximum size a receive buffer can be? The answer is sk_rcvbuf.

sk_rcvbuf is a per-socket field that specifies the maximum amount of memory that a receive buffer can allocate. This can be set programmatically with the socket option SO_RCVBUF. This can sometimes be useful to do, for localhost TCP sessions, for example, but in general the use of SO_RCVBUF is not recommended.

So how is sk_rcvbuf set? The most appropriate value for that depends on the latency of the TCP session and other factors. This makes it difficult for L7 applications to know how to set these values correctly, as they will be different for every TCP session. The solution to this problem is Linux autotuning.

Linux autotuning

Linux autotuning is logic in the Linux kernel that adjusts the buffer size limits and the receive window based on actual packet processing. It takes into consideration a number of things including TCP session RTT, L7 read rates, and the amount of available host memory.

Autotuning can sometimes seem mysterious, but it is actually fairly straightforward.

The central idea is that Linux can track the rate at which the local application is reading data off of the receive queue. It also knows the session RTT. Because Linux knows these things, it can automatically increase the buffers and receive window until it reaches the point at which the application layer or network bottleneck links are the constraint on throughput (and not host buffer settings). At the same time, autotuning prevents slow local readers from having excessively large receive queues. The way autotuning does that is by limiting the receive window and its corresponding receive buffer to an appropriate size for each socket.

The values set by autotuning can be seen via the Linux “ss” command from the iproute package (e.g. “ss -tmi”). The relevant output fields from that command are:

Recv-Q is the number of user payload bytes not yet read by the local application.

rcv_ssthresh is the window clamp, a.k.a. the maximum receive window size. This value is not known to the sender. The sender receives only the current window size, via the TCP header field. A closely-related field in the kernel, tp->window_clamp, is the maximum window size allowable based on the amount of available memory. rcv_ssthresh is the receiver-side slow-start threshold value.

skmem_r is the actual amount of memory that is allocated, which includes not only user payload (Recv-Q) but also additional memory needed by Linux to process the packet (packet metadata). This is known within the kernel as sk_rmem_alloc.

Note that there are other buffers associated with a socket, so skmem_r does not represent the total memory that a socket might have allocated. Those other buffers are not involved in the issues presented in this post.

skmem_rb is the maximum amount of memory that could be allocated by the socket for the receive buffer. This is higher than rcv_ssthresh to account for memory needed for packet processing that is not packet data. Autotuning can increase this value (up to tcp_rmem max) based on how fast the L7 application is able to read data from the socket and the RTT of the session. This is known within the kernel as sk_rcvbuf.

rcv_space is the high water mark of the rate of the local application reading from the receive buffer during any RTT. This is used internally within the kernel to adjust sk_rcvbuf.

Earlier we mentioned a setting called tcp_rmem. net.ipv4.tcp_rmem consists of three values, but in this document we are always referring to the third value (except where noted). It is a global setting that specifies the maximum amount of memory that any TCP receive buffer can allocate, i.e. the maximum permissible value that autotuning can use for sk_rcvbuf. This is essentially just a failsafe for autotuning, and under normal circumstances should play only a minor role in TCP memory management.

It’s worth mentioning that receive buffer memory is not preallocated. Memory is allocated based on actual packets arriving and sitting in the receive queue. It’s also important to realize that filling up a receive queue is not one of the criteria that autotuning uses to increase sk_rcvbuf. Indeed, preventing this type of excessive buffering (bufferbloat) is one of the benefits of autotuning.

What’s the problem?

The problem is that we must have a large TCP receive window for high BDP sessions. This is directly at odds with the latency spike problem mentioned above.

Something has to give. The laws of physics (speed of light in glass, etc.) dictate that we must use large window sizes. There is no way to get around that. So we are forced to solve the latency spikes differently.

A brief recap of the latency spike problem

Sometimes a TCP session will fill up its receive buffers. When that happens, the Linux kernel will attempt to reduce the amount of memory the receive queue is using by performing what amounts to a “defragmentation” of memory. This is called collapsing the queue. Collapsing the queue takes time, which is what drives up HTTP request latency.

We do not want to spend time collapsing TCP queues.

Why do receive queues fill up to the point where they hit the maximum memory limit? The usual situation is when the local application starts out reading data from the receive queue at one rate (triggering autotuning to raise the max receive window), followed by the local application slowing down its reading from the receive queue. This is valid behavior, and we need to handle it correctly.

Selecting sysctl values

Before exploring solutions, let’s first decide what we need as the maximum TCP window size.

As we have seen above in the discussion about BDP, the window size is determined based upon the RTT and desired throughput of the connection.

Because Linux autotuning will adjust correctly for sessions with lower RTTs and bottleneck links with lower throughput, all we need to be concerned about are the maximums.

For latency, we have chosen 300 ms as the maximum expected latency, as that is the measured latency between our Zurich and Sydney facilities. It seems reasonable enough as a worst-case latency under normal circumstances.

For throughput, although we have very fast and modern hardware on the Cloudflare global network, we don’t expect a single TCP session to saturate the hardware. We have arbitrarily chosen 3500 mbps as the highest supported throughput for our highest latency TCP sessions.

The calculation for those numbers results in a BDP of 131MB, which we round to the more aesthetic value of 128 MiB.

Recall that allocation of TCP memory includes metadata overhead in addition to packet data. The ratio of actual amount of memory allocated to user payload size varies, depending on NIC driver settings, packet size, and other factors. For full-sized packets on some of our hardware, we have measured average allocations up to 3 times the packet data size. In order to reduce the frequency of TCP collapse on our servers, we set tcp_adv_win_scale to -2. From the table above, we know that the max window size will be ¼ of the max buffer space.

We end up with the following sysctl values:

net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2

A tcp_rmem of 512MiB and tcp_adv_win_scale of -2 results in a maximum window size that autotuning can set of 128 MiB, our desired value.

Disabling TCP collapse

Patient: Doctor, it hurts when we collapse the TCP receive queue.

Doctor: Then don’t do that!

Generally speaking, when a packet arrives at a buffer when the buffer is full, the packet gets dropped. In the case of these receive buffers, Linux tries to “save the packet” when the buffer is full by collapsing the receive queue. Frequently this is successful, but it is not guaranteed to be, and it takes time.

There are no problems created by immediately just dropping the packet instead of trying to save it. The receive queue is full anyway, so the local receiver application still has data to read. The sender’s congestion control will notice the drop and/or ZeroWindow and will respond appropriately. Everything will continue working as designed.

At present, there is no setting provided by Linux to disable the TCP collapse. We developed an in-house patch to the kernel to disable the TCP collapse logic.

Kernel patch – Attempt #1

The kernel patch for our first attempt was straightforward. At the top of tcp_try_rmem_schedule(), if the memory allocation fails, we simply return (after pred_flag = 0 and tcp_sack_reset()), thus completely skipping the tcp_collapse and related logic.

It didn’t work.

Although we eliminated the latency spikes while using large buffer limits, we did not observe the throughput we expected.

One of the realizations we made as we investigated the situation was that standard network benchmarking tools such as iperf3 and similar do not expose the problem we are trying to solve. iperf3 does not fill the receive queue. Linux autotuning does not open the TCP window large enough. Autotuning is working perfectly for our well-behaved benchmarking program.

We need application-layer software that is slightly less well-behaved, one that exercises the autotuning logic under test. So we wrote one.

A new benchmarking tool

Anomalies were seen during our “Attempt #1” that negatively impacted throughput. The anomalies were seen only under certain specific conditions, and we realized we needed a better benchmarking tool to detect and measure the performance impact of those anomalies.

This tool has turned into an invaluable resource during the development of this patch and raised confidence in our solution.

It consists of two Python programs. The reader opens a TCP session to the daemon, at which point the daemon starts sending user payload as fast as it can, and never stops sending.

The reader, on the other hand, starts and stops reading in a way to open up the TCP receive window wide open and then repeatedly causes the buffers to fill up completely. More specifically, the reader implemented this logic:

reads as fast as it can, for five seconds
- this is called fast mode
- opens up the window
calculates 5% of the high watermark of the bytes reader during any previous one second
for each second of the next 15 seconds:
- this is called slow mode
- reads that 5% number of bytes, then stops reading
- sleeps for the remainder of that particular second
- most of the second consists of no reading at all
steps 1-3 are repeated in a loop three times, so the entire run is 60 seconds

This has the effect of highlighting any issues in the handling of packets when the buffers repeatedly hit the limit.

Revisiting default Linux behavior

Taking a step back, let’s look at the default Linux behavior. The following is kernel v5.15.16.

NIC speed (mbps)	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse	TCP window (MiB)	buffer metadata to user payload ratio	Prune Called	RcvCollapsed	RcvQDrop	OFODrop	Test Result
1000	300	512	-2	0	128	4	0	0	0	0	GOOD
1000	300	256	1	0	128	2	0	0	0	0	GOOD
1000	300	170	2	0	128	1.33	24	490K	0	0	GOOD
1000	300	146	3	0	128	1.14	57	616K	0	0	GOOD
1000	300	137	4	0	128	1.07	74	803K	0	0	GOOD

The Linux kernel is effective at freeing up space in order to make room for incoming packets when the receive buffer memory limit is hit. As documented previously, the cost for saving these packets (i.e. not dropping them) is latency.

However, the latency spikes, in milliseconds, for tcp_try_rmem_schedule(), are:

tcp_rmem 170 MiB, tcp_adv_win_scale +2 (170p2):

@ms:
[0]       27093 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[1]           0 |
[2, 4)        0 |
[4, 8)        0 |
[8, 16)       0 |
[16, 32)      0 |
[32, 64)     16 |

tcp_rmem 146 MiB, tcp_adv_win_scale +3 (146p3):

@ms:
(..., 16)  25984 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16, 20)       0 |
[20, 24)       0 |
[24, 28)       0 |
[28, 32)       0 |
[32, 36)       0 |
[36, 40)       0 |
[40, 44)       1 |
[44, 48)       6 |
[48, 52)       6 |
[52, 56)       3 |

tcp_rmem 137 MiB, tcp_adv_win_scale +4 (137p4):

@ms:
(..., 16)  37222 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16, 20)       0 |
[20, 24)       0 |
[24, 28)       0 |
[28, 32)       0 |
[32, 36)       0 |
[36, 40)       1 |
[40, 44)       8 |
[44, 48)       2 |

These are the latency spikes we cannot have on the Cloudflare global network.

Kernel patch – Attempt #2

So the “something” that was not working in Attempt #1 was that the receive queue memory limit was hit early on as the flow was just ramping up (when the values for sk_rmem_alloc and sk_rcvbuf were small, ~800KB). This occurred at about the two second mark for 137p4 test (about 2.25 seconds for 170p2).

In hindsight, we should have noticed that tcp_prune_queue() actually raises sk_rcvbuf when it can. So we modified the patch in response to that, added a guard to allow the collapse to execute when sk_rmem_alloc is less than the threshold value.

net.ipv4.tcp_collapse_max_bytes = 6291456

The next section discusses how we arrived at this value for tcp_collapse_max_bytes.

The patch is available here.

The results with the new patch are as follows:

oscil – 300ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
oscil reader	300	512	-2	6	1000	128	4	0	0	0	12	1-941 941 941
oscil reader	300	256	1	6	1000	128	2	0	0	0	11	1-941 941 941
oscil reader	300	170	2	6	1000	128	1.33	0	9	86	11	1-941 36-605 1-298
oscil reader	300	146	3	6	1000	128	1.14	0	7	1550	16	1-940 2-82 292-395
oscil reader	300	137	4	6	1000	128	1.07	0	10	3020	9	1-940 2-13 13-33

oscil – 20ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
oscil reader	20	512	-2	6	1000	128	4	0	0	0	13	795-941 941 941
oscil reader	20	256	1	6	1000	128	2	0	0	0	13	795-941 941 941
oscil reader	20	170	2	6	1000	128	1.33	0	0	0	8	795-941 941 941
oscil reader	20	146	3	6	1000	128	1.14	0	0	0	7	795-941 941 941
oscil reader	20	137	4	6	1000	128	1.07	0	4	196	12	795-941 13-941 941

oscil – 0ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
oscil reader	0.3	512	-2	6	1000	128	4	0	0	0	9	941 941 941
oscil reader	0.3	256	1	6	1000	128	2	0	0	0	22	941 941 941
oscil reader	0.3	170	2	6	1000	128	1.33	0	0	0	8	941 941 941
oscil reader	0.3	146	3	6	1000	128	1.14	0	0	0	10	941 941 941
oscil reader	0.3	137	4	6	1000	128	1.07	0	0	0	10	941 941 941

iperf3 – 300 ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
iperf3	300	512	-2	6	1000	128	4	0	0	0	7	941
iperf3	300	256	1	6	1000	128	2	0	0	0	6	941
iperf3	300	170	2	6	1000	128	1.33	0	0	0	9	941
iperf3	300	146	3	6	1000	128	1.14	0	0	0	11	941
iperf3	300	137	4	6	1000	128	1.07	0	0	0	7	941

iperf3 – 20 ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
iperf3	20	512	-2	6	1000	128	4	0	0	0	7	941
iperf3	20	256	1	6	1000	128	2	0	0	0	15	941
iperf3	20	170	2	6	1000	128	1.33	0	0	0	7	941
iperf3	20	146	3	6	1000	128	1.14	0	0	0	7	941
iperf3	20	137	4	6	1000	128	1.07	0	0	0	6	941

iperf3 – 0ms tests

Test	RTT (ms)	tcp_rmem (MiB)	tcp_adv_win_scale	tcp_disable_collapse (MiB)	NIC speed (mbps)	TCP window (MiB)	real buffer metadata to user payload ratio	RcvCollapsed	RcvQDrop	OFODrop	max latency (us)	Test Result
iperf3	0.3	512	-2	6	1000	128	4	0	0	0	6	941
iperf3	0.3	256	1	6	1000	128	2	0	0	0	14	941
iperf3	0.3	170	2	6	1000	128	1.33	0	0	0	6	941
iperf3	0.3	146	3	6	1000	128	1.14	0	0	0	7	941
iperf3	0.3	137	4	6	1000	128	1.07	0	0	0	6	941

All tests are successful.

Setting tcp_collapse_max_bytes

In order to determine this setting, we need to understand what the biggest queue we can collapse without incurring unacceptable latency.

Using 6 MiB should result in a maximum latency of no more than 2 ms.

Cloudflare production network results

Current production settings (“Old”)

net.ipv4.tcp_rmem = 8192 2097152 16777216
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_collapse_max_bytes = 0
net.ipv4.tcp_notsent_lowat = 4294967295

tcp_collapse_max_bytes of 0 means that the custom feature is disabled and that the vanilla kernel logic is used for TCP collapse processing.

New settings under test (“New”)

net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_collapse_max_bytes = 6291456
net.ipv4.tcp_notsent_lowat = 131072

The tcp_notsent_lowat setting is discussed in the last section of this post.

The middle value of tcp_rmem was changed as a result of separate work that found that Linux autotuning was setting receive buffers too high for localhost sessions. This updated setting reduces TCP memory usage for those sessions, but does not change anything about the type of TCP sessions that is the focus of this post.

For the following benchmarks, we used non-Cloudflare host machines in Iowa, US, and Melbourne, Australia performing data transfers to the Cloudflare data center in Marseille, France. In Marseille, we have some hosts configured with the existing production settings, and others with the system settings described in this post. Software used is perf3 version 3.9, kernel 5.15.32.

Throughput results

RTT

(ms)

Throughput with Current Settings

(mbps)

Throughput with

New Settings

(mbps)

Increase

Factor

Iowa to

Marseille

121

276

6600

24x

Melbourne to Marseille

282

120

3800

32x

Iowa-Marseille throughput

Iowa-Marseille receive window and bytes-in-flight

Melbourne-Marseille throughput

Melbourne-Marseille receive window and bytes-in-flight

Even with the new settings in place, the Melbourne to Marseille performance is limited by the receive window on the Cloudflare host. This means that further adjustments to these settings yield even higher throughput.

Latency results

The Y-axis on these charts are the 99th percentile time for TCP collapse in seconds.

Cloudflare hosts in Marseille running the current production settings

Cloudflare hosts in Marseille running the new settings

The takeaway in looking at these graphs is that maximum TCP collapse time for the new settings is no worse than with the current production settings. This is the desired result.

Send Buffers

What we have shown so far is that the receiver side seems to be working well, but what about the sender side?

As part of this work, we are setting tcp_wmem max to 512 MiB. For oscillating reader flows, this can cause the send buffer to become quite large. This represents bufferbloat and wasted kernel memory, both things that nobody likes or wants.

Fortunately, there is already a solution: tcp_notsent_lowat. This setting limits the size of unsent bytes in the write queue. More details can be found at https://lwn.net/Articles/560082.

The results are significant:

The RTT for these tests was 466ms. Throughput is not negatively affected. Throughput is at full wire speed in all cases (1 Gbps). Memory usage is as reported by /proc/net/sockstat, TCP mem.

Our web servers already set tcp_notsent_lowat to 131072 for its sockets. All other senders are using 4 GiB, the default value. We are changing the sysctl so that 131072 is in effect for all senders running on the server.

Conclusion

The goal of this work is to open the throughput floodgates for high BDP connections while simultaneously ensuring very low HTTP request latency.

We have accomplished that goal.

A Primer on Proxies

Lucas Pardue — Sat, 19 Mar 2022 17:01:15 GMT

Traffic proxying, the act of encapsulating one flow of data inside another, is a valuable privacy tool for establishing boundaries on the Internet. Encapsulation has an overhead, Cloudflare and our Internet peers strive to avoid turning it into a performance cost. MASQUE is the latest collaboration effort to design efficient proxy protocols based on IETF standards. We're already running these at scale in production; see our recent blog post about Cloudflare's role in iCloud Private Relay for an example.

In this blog post series, we’ll dive into proxy protocols.

To begin, let’s start with a simple question: what is proxying? In this case, we are focused on forward proxying — a client establishes an end-to-end tunnel to a target server via a proxy server. This contrasts with the Cloudflare CDN, which operates as a reverse proxy that terminates client connections and then takes responsibility for actions such as caching, security including WAF, load balancing, etc. With forward proxying, the details about the tunnel, such as how it is established and used, whether it provides confidentiality via authenticated encryption, and so on, vary by proxy protocol. Before going into specifics, let’s start with one of the most common tunnels used on the Internet: TCP.

Transport basics: TCP provides a reliable byte stream

The TCP transport protocol is a rich topic. For the purposes of this post, we will focus on one aspect: TCP provides a readable and writable, reliable, and ordered byte stream. Some protocols like HTTP and TLS require reliable transport underneath them and TCP's single byte stream is an ideal fit. The application layer reads or writes to this byte stream, but the details about how TCP sends this data "on the wire" are typically abstracted away.

Large application objects are written into a stream, then they are split into many small packets, and they are sent in order to the network. At the receiver, packets are read from the network and combined back into an identical stream. Networks are not perfect and packets can be lost or reordered. TCP is clever at dealing with this and not worrying the application with details. It just works. A way to visualize this is to imagine a magic paper shredder that can both shred documents and convert shredded papers back to whole documents. Then imagine you and your friend bought a pair of these and decided that it would be fun to send each other shreds.

The one problem with TCP is that when a lost packet is detected at a receiver, the sender needs to retransmit it. This takes time to happen and can mean that the byte stream reconstruction gets delayed. This is known as TCP head-of-line blocking. Applications regularly use TCP via a socket API that abstracts away protocol details; they often can't tell if there are delays because the other end is slow at sending or if the network is slowing things down via packet loss.

Proxy Protocols

Proxying TCP is immensely useful for many applications, including, though certainly not limited to HTTPS, SSH, and RDP. In fact, Oblivious DoH, which is a proxy protocol for DNS messages, could very well be implemented using a TCP proxy, though there are reasons why this may not be desirable. Today, there are a number of different options for proxying TCP end-to-end, including:

SOCKS, which runs in cleartext and requires an expensive connection establishment step.
Transparent TCP proxies, commonly referred to as performance enhancing proxies (PEPs), which must be on path and offer no additional transport security, and, definitionally, are limited to TCP protocols.
Layer 4 proxies such as Cloudflare Spectrum, which might rely on side carriage metadata via something like the PROXY protocol.
HTTP CONNECT, which transforms HTTPS connections into opaque byte streams.

While SOCKS and PEPs are viable options for some use cases, when choosing which proxy protocol to build future systems upon, it made most sense to choose a reusable and general-purpose protocol that provides well-defined and standard abstractions. As such, the IETF chose to focus on using HTTP as a substrate via the CONNECT method.

The concept of using HTTP as a substrate for proxying is not new. Indeed, HTTP/1.1 and HTTP/2 have supported proxying TCP-based protocols for a long time. In the following sections of this post, we’ll explain in detail how CONNECT works across different versions of HTTP, including HTTP/1.1, HTTP/2, and the recently standardized HTTP/3.

HTTP/1.1 and CONNECT

In HTTP/1.1, the CONNECT method can be used to establish an end-to-end TCP tunnel to a target server via a proxy server. This is commonly applied to use cases where there is a benefit of protecting the traffic between the client and the proxy, or where the proxy can provide access control at network boundaries. For example, a Web browser can be configured to issue all of its HTTP requests via an HTTP proxy.

A client sends a CONNECT request to the proxy server, which requests that it opens a TCP connection to the target server and desired port. It looks something like this:

CONNECT target.example.com:80 HTTP/1.1
Host: target.example.com

If the proxy succeeds in opening a TCP connection to the target, it responds with a 2xx range status code. If there is some kind of problem, an error status in the 5xx range can be returned. Once a tunnel is established there are two independent TCP connections; one on either side of the proxy. If a flow needs to stop, you can simply terminate them.

HTTP CONNECT proxies forward data between the client and the target server. The TCP packets themselves are not tunneled, only the data on the logical byte stream. Although the proxy is supposed to forward data and not process it, if the data is plaintext there would be nothing to stop it. In practice, CONNECT is often used to create an end-to-end TLS connection where only the client and target server have access to the protected content; the proxy sees only TLS records and can't read their content because it doesn't have access to the keys.

Finally, it's worth noting that after a successful CONNECT request, the HTTP connection (and the TCP connection underpinning it) has been converted into a tunnel. There is no more possibility of issuing other HTTP messages, to the proxy itself, on the connection.

HTTP/2 and CONNECT

HTTP/2 adds logical streams above the TCP layer in order to support concurrent requests and responses on a single connection. Streams are also reliable and ordered byte streams, operating on top of TCP. Returning to our magic shredder analogy: imagine you wanted to send a book. Shredding each page one after another and rebuilding the book one page at a time is slow, but handling multiple pages at the same time might be faster. HTTP/2 streams allow us to do that. But, as we all know, trying to put too much into a shredder can sometimes cause it to jam.

In HTTP/2, each request and response is sent on a different stream. To support this, HTTP/2 defines frames that contain the stream identifier that they are associated with. Requests and responses are composed of HEADERS and DATA frames which contain HTTP header fields and HTTP content, respectively. Frames can be large. When they are sent on the wire they might span multiple TLS records or TCP segments. Side note: the HTTP WG has been working on a new revision of the document that defines HTTP semantics that are common to all HTTP versions. The terms message, header fields, and content all come from this description.

HTTP/2 concurrency allows applications to read and write multiple objects at different rates, which can improve HTTP application performance, such as web browsing. HTTP/1.1 traditionally dealt with this concurrency by opening multiple TCP connections in parallel and striping requests across these connections. In contrast, HTTP/2 multiplexes frames belonging to different streams onto the single byte stream provided by one TCP connection. Reusing a single connection has benefits, but it still leaves HTTP/2 at risk of TCP head-of-line blocking. For more details, refer to Perf Planet blog.

HTTP/2 also supports the CONNECT method. In contrast to HTTP/1.1, CONNECT requests do not take over an entire HTTP/2 connection. Instead, they convert a single stream into an end-to-end tunnel. It looks something like this:

:method = CONNECT
:authority = target.example.com:443

If the proxy succeeds in opening a TCP connection, it responds with a 2xx (Successful) status code. After this, the client sends DATA frames to the proxy, and the content of these frames are put into TCP packets sent to the target. In the return direction, the proxy reads from the TCP byte stream and populates DATA frames. If a tunnel needs to stop, you can simply terminate the stream; there is no need to terminate the HTTP/2 connection.

By using HTTP/2, a client can create multiple CONNECT tunnels in a single connection. This can help reduce resource usage (saving the global count of TCP connections) and allows related tunnels to be logically grouped together, ensuring that they "share fate" when either client or proxy need to gracefully close. On the proxy-to-server side there are still multiple independent TCP connections.

One challenge of multiplexing tunnels on concurrent streams is how to effectively prioritize them. We've talked in the past about prioritization for web pages, but the story is a bit different for CONNECT. We've been thinking about this and captured considerations in the new Extensible Priorities draft.

QUIC, HTTP/3 and CONNECT

QUIC is a new secure and multiplexed transport protocol from the IETF. QUIC version 1 was published as RFC 9000 in May 2021 and, the next day, we enabled it for all Cloudflare customers.

QUIC is composed of several foundational features. You can think of these like individual puzzle pieces that interlink to form a transport service. This service needs one more piece, an application mapping, to bring it all together.

Similar to HTTP/2, QUIC version 1 provides reliable and ordered streams. But QUIC streams live at the transport layer and they are the only type of QUIC primitive that can carry application data. QUIC has no opinion on how streams get used. Applications that wish to use QUIC must define that themselves.

QUIC streams can be long (up to 2^62 - 1 bytes). Stream data is sent on the wire in the form of STREAM frames. All QUIC frames must fit completely inside a QUIC packet. QUIC packets must fit entirely in a UDP datagram; fragmentation is prohibited. These requirements mean that a long stream is serialized to a series of QUIC packets sized roughly to the path MTU (Maximum Transmission Unit). STREAM frames provide reliability via QUIC loss detection and recovery. Frames are acknowledged by the receiver and if the sender detects a loss (via missing acknowledgments), QUIC will retransmit the lost data. In contrast, TCP retransmits packets. This difference is an important feature of QUIC, letting implementations decide how to repacketize and reschedule lost data.

When multiplexing streams, different packets can contain STREAM frames belonging to different stream identifiers. This creates independence between streams and helps avoid the head-of-line blocking caused by packet loss that we see in TCP. If a UDP packet containing data for one stream is lost, other streams can continue to make progress without being blocked by retransmission of the lost stream.

To use our magic shredder analogy one more time: we're sending a book again, but this time we parallelise our task by using independent shredders. We need to logically associate them together so that the receiver knows the pages and shreds are all for the same book, but otherwise they can progress with less chance of jamming.

HTTP/3 is an example of an application mapping that describes how streams are used to exchange: HTTP settings, QPACK state, and request and response messages. HTTP/3 still defines its own frames like HEADERS and DATA, but it is overall simpler than HTTP/2 because QUIC deals with the hard stuff. Since HTTP/3 just sees a logical byte stream, its frames can be arbitrarily sized. The QUIC layer handles segmenting HTTP/3 frames over STREAM frames for sending in packets. HTTP/3 also supports the CONNECT method. It functions identically to CONNECT in HTTP/2, each request stream converting to an end-to-end tunnel.

HTTP packetization comparison

We've talked about HTTP/1.1, HTTP/2 and HTTP/3. The diagram below is a convenient way to summarize how HTTP requests and responses get serialized for transmission over a secure transport. The main difference is that with TLS, protected records are split across several TCP segments. While with QUIC there is no record layer, each packet has its own protection.

Limitations and looking ahead

HTTP CONNECT is a simple and elegant protocol that has a tremendous number of application use cases, especially for privacy-enhancing technology. In particular, applications can use it to proxy DNS-over-HTTPS similar to what’s been done for Oblivious DoH, or more generic HTTPS traffic (based on HTTP/1.1 or HTTP/2), and many more.

However, what about non-TCP traffic? Recall that HTTP/3 is an application mapping for QUIC, and therefore runs over UDP as well. What if we wanted to proxy QUIC? What if we wanted to proxy entire IP datagrams, similar to VPN technologies like IPsec or WireGuard? This is where MASQUE comes in. In the next post, we’ll discuss how the MASQUE Working Group is standardizing technologies to enable proxying for datagram-based protocols like UDP and IP.

Zero Trust client sessions

Kenny Johnson — Fri, 18 Mar 2022 13:00:48 GMT

Starting today, you can build Zero Trust rules that require periodic authentication to control network access. We’ve made this feature available for years for web-based applications, but we’re excited to bring this level of granular enforcement to TCP connections and UDP flows.

We’re excited to announce that Zero Trust client-based sessions are now generally available. During CIO Week in 2021, we announced the beta program for this feature. We incorporated feedback from early users into the generally available version. In this post, I will revisit why Zero Trust client-based sessions are important, how the feature works and what we learned during the beta.

Securing traffic with Sessions

We built Zero Trust client-based sessions to enhance the security of Cloudflare’s Zero Trust Network Access (ZTNA). The Zero Trust client is software that runs on a user machine and forwards all traffic from the machine to Cloudflare before it is sent over the Internet. This includes traffic bound for internal IPs and hostnames that typically house sensitive business applications. These sensitive applications were traditionally accessed using a VPN. Unlike VPNs, Cloudflare’s ZTNA allows administrators to set granular policies about who can access a specific resource. The only piece missing was that once a user enrolled their machine with the Zero Trust client, they had a forever persistent session. This makes lost/stolen laptops, shared workstations and personal devices more of a risk than they should be. We built Zero Trust client-based sessions to solve this.

Zero Trust client-based sessions require a user to reauthenticate with their identity provider before accessing specific resources. The authentication pop-up is triggered only when a user attempts to access a protected resource. This prevents unnecessary pop-ups to users where a session may never be necessary. Administrators can specify how often they would like their users to reauthenticate, depending on the resource. This is possible because the user’s last successful authentication is saved and evaluated against any ZTNA policy with a session configured.

What we learned during the beta period

During the beta period of Zero Trust client-based sessions, we worked closely with our customers and Cloudflare’s own security team to identify areas for immediate improvement. We identified two major areas of improvements before releasing to General Availability: pop-ups, which can be intrusive, and browser-based authentication, which is not always possible. We identified new strategies for properly serving an authentication pop up to a user without being overly intrusive. In the future, users will have control over when they receive notifications to authenticate. The other area for improvement was that on certain machines and operating systems, browser-based authentication is not always possible. We are planning to add an option to authenticate directly from the Zero Trust client itself.

What’s next

This is only the beginning for Zero Trust client-based authentication. In the future, we plan to add options for step-up multifactor authentication and automated enrollment options via certificates and Service Tokens. Getting started is easy! Follow this guide for setting up Zero Trust client-based sessions in your Cloudflare Zero Trust dashboard.

How to stop running out of ephemeral ports and start to love long-lived connections

Marek Majkowski — Wed, 02 Feb 2022 09:53:28 GMT

Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way.

It's particularly interesting when basic things used everywhere fail. Recently we've reached such a breaking point in a ubiquitous part of Linux networking: establishing a network connection using the connect() system call.

Since we are not doing anything special, just establishing TCP and UDP connections, how could anything go wrong? Here's one example: we noticed alerts from a misbehaving server, logged in to check it out and saw:

marek@:~# ssh 127.0.0.1
ssh: connect to host 127.0.0.1 port 22: Cannot assign requested address

You can imagine the face of my colleague who saw that. SSH to localhost refuses to work, while she was already using SSH to connect to that server! On another occasion:

marek@:~# dig cloudflare.com @1.1.1.1
dig: isc_socket_bind: address in use

This time a basic DNS query failed with a weird networking error. Failing DNS is a bad sign!

In both cases the problem was Linux running out of ephemeral ports. When this happens it's unable to establish any outgoing connections. This is a pretty serious failure. It's usually transient and if you don't know what to look for it might be hard to debug.

The root cause lies deeper though. We can often ignore limits on the number of outgoing connections. But we encountered cases where we hit limits on the number of concurrent outgoing connections during normal operation.

In this blog post I'll explain why we had these issues, how we worked around them, and present an userspace code implementing an improved variant of connect() syscall.

Outgoing connections on Linux part 1 - TCP

Let's start with a bit of historical background.

Long-lived connections

Back in 2014 Cloudflare announced support for WebSockets. We wrote two articles about it:

If you skim these blogs, you'll notice we were totally fine with the WebSocket protocol, framing and operation. What worried us was our capacity to handle large numbers of concurrent outgoing connections towards the origin servers. Since WebSockets are long-lived, allowing them through our servers might greatly increase the concurrent connection count. And this did turn out to be a problem. It was possible to hit a ceiling for a total number of outgoing connections imposed by the Linux networking stack.

In a pessimistic case, each Linux connection consumes a local port (ephemeral port), and therefore the total connection count is limited by the size of the ephemeral port range.

Basics - how port allocation works

When establishing an outbound connection a typical user needs the destination address and port. For example, DNS might resolve cloudflare.com to the '104.1.1.229' IPv4 address. A simple Python program can establish a connection to it with the following code:

cd = socket.socket(AF_INET, SOCK_STREAM)
cd.connect(('104.1.1.229', 80))

The operating system’s job is to figure out how to reach that destination, selecting an appropriate source address and source port to form the full 4-tuple for the connection:

The operating system chooses the source IP based on the routing configuration. On Linux we can see which source IP will be chosen with ip route get:

$ ip route get 104.1.1.229
104.1.1.229 via 192.168.1.1 dev eth0 src 192.168.1.8 uid 1000
	cache

The src parameter in the result shows the discovered source IP address that should be used when going towards that specific target.

The source port, on the other hand, is chosen from the local port range configured for outgoing connections, also known as the ephemeral port range. On Linux this is controlled by the following sysctls:

$ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_reserved_ports
net.ipv4.ip_local_port_range = 32768    60999
net.ipv4.ip_local_reserved_ports =

The ip_local_port_range sets the low and high (inclusive) port range to be used for outgoing connections. The ip_local_reserved_ports is used to skip specific ports if the operator needs to reserve them for services.

Vanilla TCP is a happy case

The default ephemeral port range contains more than 28,000 ports (60999+1-32768=28232). Does that mean we can have at most 28,000 outgoing connections? That’s the core question of this blog post!

In TCP the connection is identified by a full 4-tuple, for example:

full 4-tuple	192.168.1.8	32768	104.1.1.229	80

In principle, it is possible to reuse the source IP and port, and share them against another destination. For example, there could be two simultaneous outgoing connections with these 4-tuples:

full 4-tuple #A	192.168.1.8	32768	104.1.1.229	80
full 4-tuple #B	192.168.1.8	32768	151.101.1.57	80

This "source two-tuple" sharing can happen in practice when establishing connections using the vanilla TCP code:

sd = socket.socket(SOCK_STREAM)
sd.connect( (remote_ip, remote_port) )

But slightly different code can prevent this sharing, as we’ll discuss.

In the rest of this blog post, we’ll summarise the behaviour of code fragments that make outgoing connections showing:

The technique’s description
The typical `errno` value in the case of port exhaustion
And whether the kernel is able to reuse the {source IP, source port}-tuple against another destination

The last column is the most important since it shows if there is a low limit of total concurrent connections. As we're going to see later, the limit is present more often than we'd expect.

technique description	errno on port exhaustion	possible src 2-tuple reuse
connect(dst_IP, dst_port)	EADDRNOTAVAIL	yes (good!)

In the case of generic TCP, things work as intended. Towards a single destination it's possible to have as many connections as an ephemeral range allows. When the range is exhausted (against a single destination), we'll see EADDRNOTAVAIL error. The system also is able to correctly reuse local two-tuple {source IP, source port} for ESTABLISHED sockets against other destinations. This is expected and desired.

Manually selecting source IP address

Let's go back to the Cloudflare server setup. Cloudflare operates many services, to name just two: CDN (caching HTTP reverse proxy) and WARP.

For Cloudflare, it’s important that we don’t mix traffic types among our outgoing IPs. Origin servers on the Internet might want to differentiate traffic based on our product. The simplest example is CDN: it's appropriate for an origin server to firewall off non-CDN inbound connections. Allowing Cloudflare cache pulls is totally fine, but allowing WARP connections which contain untrusted user traffic might lead to problems.

To achieve such outgoing IP separation, each of our applications must be explicit about which source IPs to use. They can’t leave it up to the operating system; the automatically-chosen source could be wrong. While it's technically possible to configure routing policy rules in Linux to express such requirements, we decided not to do that and keep Linux routing configuration as simple as possible.

Instead, before calling connect(), our applications select the source IP with the bind() syscall. A trick we call "bind-before-connect":

sd = socket.socket(SOCK_STREAM)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse
bind(src_IP, 0) connect(dst_IP, dst_port)	EADDRINUSE	no (bad!)

This code looks rather innocent, but it hides a considerable drawback. When calling bind(), the kernel attempts to find an unused local two-tuple. Due to BSD API shortcomings, the operating system can't know what we plan to do with the socket. It's totally possible we want to listen() on it, in which case sharing the source IP/port with a connected socket will be a disaster! That's why the source two-tuple selected when calling bind() must be unique.

Due to this API limitation, in this technique the source two-tuple can't be reused. Each connection effectively "locks" a source port, so the number of connections is constrained by the size of the ephemeral port range. Notice: one source port is used up for each connection, no matter how many destinations we have. This is bad, and is exactly the problem we were dealing with back in 2014 in the WebSockets articles mentioned above.

Fortunately, it's fixable.

IP_BIND_ADDRESS_NO_PORT

Back in 2014 we fixed the problem by setting the SO_REUSEADDR socket option and manually retrying bind()+ connect() a couple of times on error. This worked ok, but later in 2015 Linux introduced a proper fix: the IP_BIND_ADDRESS_NO_PORT socket option. This option tells the kernel to delay reserving the source port:

sd = socket.socket(SOCK_STREAM)
sd.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse
IP_BIND_ADDRESS_NO_PORT bind(src_IP, 0) connect(dst_IP, dst_port)	EADDRNOTAVAIL	yes (good!)

This gets us back to the desired behavior. On modern Linux, when doing bind-before-connect for TCP, you should set IP_BIND_ADDRESS_NO_PORT.

Explicitly selecting a source port

Sometimes an application needs to select a specific source port. For example: the operator wants to control full 4-tuple in order to debug ECMP routing issues.

Recently a colleague wanted to run a cURL command for debugging, and he needed the source port to be fixed. cURL provides the --local-port option to do this¹ :

$ curl --local-port 9999 -4svo /dev/null https://cloudflare.com/cdn-cgi/trace
*   Trying 104.1.1.229:443...

In other situations source port numbers should be controlled, as they can be used as an input to a routing mechanism.

But setting the source port manually is not easy. We're back to square one in our hackery since IP_BIND_ADDRESS_NO_PORT is not an appropriate tool when calling bind() with a specific source port value. To get the scheme working again and be able to share source 2-tuple, we need to turn to SO_REUSEADDR:

sd = socket.socket(SOCK_STREAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( (src_IP, src_port) )
sd.connect( (dst_IP, dst_port) )

Our summary table:

technique description	errno on port exhaustion	possible src 2-tuple reuse
SO_REUSEADDR bind(src_IP, src_port) connect(dst_IP, dst_port)	EADDRNOTAVAIL	yes (good!)

Here, the user takes responsibility for handling conflicts, when an ESTABLISHED socket sharing the 4-tuple already exists. In such a case connect will fail with EADDRNOTAVAIL and the application should retry with another acceptable source port number.

Userspace connectx implementation

With these tricks, we can implement a common function and call it connectx. It will do what bind()+connect() should, but won't have the unfortunate ephemeral port range limitation. In other words, created sockets are able to share local two-tuples as long as they are going to distinct destinations:

def connectx((source_IP, source_port), (destination_IP, destination_port)):

We have three use cases this API should support:

user specified	technique
{_, _, dst_IP, dst_port}	vanilla connect()
{src_IP, _, dst_IP, dst_port}	IP_BIND_ADDRESS_NO_PORT
{src_IP, src_port, dst_IP, dst_port}	SO_REUSEADDR

The name we chose isn't an accident. MacOS (specifically the underlying Darwin OS) has exactly that function implemented as a connectx() system call (implementation):

It's more powerful than our connectx code, since it supports TCP Fast Open.

Should we, Linux users, be envious? For TCP, it's possible to get the right kernel behaviour with the appropriate setsockopt/bind/connect dance, so a kernel syscall is not quite needed.

But for UDP things turn out to be much more complicated and a dedicated syscall might be a good idea.

Outgoing connections on Linux - part 2 - UDP

In the previous section we listed three use cases for outgoing connections that should be supported by the operating system:

Vanilla egress: operating system chooses the outgoing IP and port
Source IP selection: user selects outgoing IP but the OS chooses port
Full 4-tuple: user selects full 4-tuple for the connection

We demonstrated how to implement all three cases on Linux for TCP, without hitting connection count limits due to source port exhaustion.

It's time to extend our implementation to UDP. This is going to be harder.

For UDP, Linux maintains one hash table that is keyed on local IP and port, which can hold duplicate entries. Multiple UDP connected sockets can not only share a 2-tuple but also a 4-tuple! It's totally possible to have two distinct, connected sockets having exactly the same 4-tuple. This feature was created for multicast sockets. The implementation was then carried over to unicast connections, but it is confusing. With conflicting sockets on unicast addresses, only one of them will receive any traffic. A newer connected socket will "overshadow" the older one. It's surprisingly hard to detect such a situation. To get UDP connectx() right, we will need to work around this "overshadowing" problem.

Vanilla UDP is limited

It might come as a surprise to many, but by default, the total count for outbound UDP connections is limited by the ephemeral port range size. Usually, with Linux you can't have more than ~28,000 connected UDP sockets, even if they point to multiple destinations.

Ok, let's start with the simplest and most common way of establishing outgoing UDP connections:

sd = socket.socket(SOCK_DGRAM)
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
connect(dst_IP, dst_port)	EAGAIN	no (bad!)	no

The simplest case is not a happy one. The total number of concurrent outgoing UDP connections on Linux is limited by the ephemeral port range size. On our multi-tenant servers, with potentially long-lived gaming and H3/QUIC flows containing WebSockets, this is too limiting.

On TCP we were able to slap on a setsockopt and move on. No such easy workaround is available for UDP.

For UDP, without REUSEADDR, Linux avoids sharing local 2-tuples among UDP sockets. During connect() it tries to find a 2-tuple that is not used yet. As a side note: there is no fundamental reason that it looks for a unique 2-tuple as opposed to a unique 4-tuple during 'connect()'. This suboptimal behavior might be fixable.

SO_REUSEADDR is hard

To allow local two-tuple reuse we need the SO_REUSEADDR socket option. Sadly, this would also allow established sockets to share a 4-tuple, with the newer socket overshadowing the older one.

sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.connect( (dst_IP, dst_port) )

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
SO_REUSEADDR connect(dst_IP, dst_port)	EAGAIN	yes	yes (bad!)

In other words, we can't just set SO_REUSEADDR and move on, since we might hit a local 2-tuple that is already used in a connection against the same destination. We might already have an identical 4-tuple connected socket underneath. Most importantly, during such a conflict we won't be notified by any error. This is unacceptably bad.

Detecting socket conflicts with eBPF

We thought a good solution might be to write an eBPF program to detect such conflicts. The idea was to put a code on the connect() syscall. Linux cgroups allow the BPF_CGROUP_INET4_CONNECT hook. The eBPF is called every time a process under a given cgroup runs the connect() syscall. This is pretty cool, and we thought it would allow us to verify if there is a 4-tuple conflict before moving the socket from UNCONNECTED to CONNECTED states.

Here is how to load and attach our eBPF

bpftool prog load ebpf.o /sys/fs/bpf/prog_connect4  type cgroup/connect4
bpftool cgroup attach /sys/fs/cgroup/unified/user.slice connect4 pinned /sys/fs/bpf/prog_connect4

With such a code, we'll greatly reduce the probability of overshadowing:

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
INET4_CONNECT hook SO_REUSEADDR connect(dst_IP, dst_port)	manual port discovery, EPERM on conflict	yes	yes, but small

However, this solution is limited. First, it doesn't work for sockets with an automatically assigned source IP or source port, it only works when a user manually creates a 4-tuple connection from userspace. Then there is a second issue: a typical race condition. We don't grab any lock, so it's technically possible a conflicting socket will be created on another CPU in the time between our eBPF conflict check and the finish of the real connect() syscall machinery. In short, this lockless eBPF approach is better than nothing, but fundamentally racy.

Socket traversal - SOCK_DIAG ss way

There is another way to verify if a conflicting socket already exists: we can check for connected sockets in userspace. It's possible to do it without any privileges quite effectively with the SOCK_DIAG_BY_FAMILY feature of netlink interface. This is the same technique the ss tool uses to print out sockets available on the system.

The netlink code is not even all that complicated. Take a look at the code. Inside the kernel, it goes quickly into a fast __udp_lookup() routine. This is great - we can avoid iterating over all sockets on the system.

With that function handy, we can draft our UDP code:

sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.bind( src_addr )
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError(...)
sd.connect( dst_addr )

This code has the same race condition issue as the connect inet eBPF hook before. But it's a good starting point. We need some locking to avoid the race condition. Perhaps it's possible to do it in the userspace.

SO_REUSEADDR as a lock

Here comes a breakthrough: we can use SO_REUSEADDR as a locking mechanism. Consider this:

sd = socket.socket(SOCK_DGRAM)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( src_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 0)
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError()
sd.connect( dst_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

The idea here is:

We need REUSEADDR around bind, otherwise it wouldn't be possible to reuse a local port. It's technically possible to clear REUSEADDR after bind. Doing this technically makes the kernel socket state inconsistent, but it doesn't hurt anything in practice.
By clearing REUSEADDR, we're locking new sockets from using that source port. At this stage we can check if we have ownership of the 4-tuple we want. Even if multiple sockets enter this critical section, only one, the newest, can win this verification. This is a cooperative algorithm, so we assume all tenants try to behave.
At this point, if the verification succeeds, we can perform connect() and have a guarantee that the 4-tuple won't be reused by another socket at any point in the process.

This is rather convoluted and hacky, but it satisfies our requirements:

technique description	errno on port exhaustion	possible src 2-tuple reuse	risk of overshadowing
REUSEADDR as a lock	EAGAIN	yes	no

Sadly, this schema only works when we know the full 4-tuple, so we can't rely on kernel automatic source IP or port assignments.

Faking source IP and port discovery

In the case when the user calls 'connect' and specifies only target 2-tuple - destination IP and port, the kernel needs to fill in the missing bits - the source IP and source port. Unfortunately the described algorithm expects the full 4-tuple to be known in advance.

One solution is to implement source IP and port discovery in userspace. This turns out to be not that hard. For example, here's a snippet of our code:

def _get_udp_port(family, src_addr, dst_addr):
    if ephemeral_lo == None:
        _read_ephemeral()
    lo, hi = ephemeral_lo, ephemeral_hi
    start = random.randint(lo, hi)
    ...

Putting it all together

Combining the manual source IP, port discovery and the REUSEADDR locking dance, we get a decent userspace implementation of connectx() for UDP.

We have covered all three use cases this API should support:

user specified	comments
{_, _, dst_IP, dst_port}	manual source IP and source port discovery
{src_IP, _, dst_IP, dst_port}	manual source port discovery
{src_IP, src_port, dst_IP, dst_port}	just our "REUSEADDR as lock" technique

Take a look at the full code.

Summary

This post described a problem we hit in production: running out of ephemeral ports. This was partially caused by our servers running numerous concurrent connections, but also because we used the Linux sockets API in a way that prevented source port reuse. It meant that we were limited to ~28,000 concurrent connections per protocol, which is not enough for us.

We explained how to allow source port reuse and prevent having this ephemeral-port-range limit imposed. We showed an userspace connectx() function, which is a better way of creating outgoing TCP and UDP connections on Linux.

Our UDP code is more complex, based on little known low-level features, assumes cooperation between tenants and undocumented behaviour of the Linux operating system. Using REUSEADDR as a locking mechanism is rather unheard of.

The connectx() functionality is valuable, and should be added to Linux one way or another. It's not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.

___

¹ On a side note, on the second cURL run it fails due to TIME-WAIT sockets: "bind failed with errno 98: Address already in use".

One option is to wait for the TIME_WAIT socket to die, or work around this with the time-wait sockets kill script. Killing time-wait sockets is generally a bad idea, violating protocol, unneeded and sometimes doesn't work. But hey, in some extreme cases it's good to know what's possible. Just saying.

Announcing Argo for Spectrum

Achiel van der Mandele — Tue, 23 Nov 2021 13:58:39 GMT

Today we're excited to announce the general availability of Argo for Spectrum, a way to turbo-charge any TCP based application. With Argo for Spectrum, you can reduce latency, packet loss and improve connectivity for any TCP application, including common protocols like Minecraft, Remote Desktop Protocol and SFTP.

The Internet — more than just a browser

When people think of the Internet, many of us think about using a browser to view websites. Of course, it’s so much more! We often use other ways to connect to each other and to the resources we need for work. For example, you may interact with servers for work using SSH File Transfer Protocol (SFTP), git or Remote Desktop software. At home, you might play a video game on the Internet with friends.

To help people that protect these services against DDoS attacks, Spectrum launched in 2018 and extends Cloudflare’s DDoS protection to any TCP or UDP based protocol. Customers use it for a wide variety of use cases, including to protect video streaming (RTMP), gaming and internal IT systems. Spectrum also supports common VoIP protocols such as SIP and RTP, which have recently seen an increase in DDoS ransomware attacks. A lot of these applications are also highly sensitive to performance issues. No one likes waiting for a file to upload or dealing with a lagging video game.

Latency and throughput are the two metrics people generally discuss when talking about network performance. Latency refers to the amount of time a piece of data (a packet) takes to traverse between two systems. Throughput refers to the amount of bits you can actually send per second. This blog will discuss how these two interplay and how we improve them with Argo for Spectrum.

Argo to the rescue

There are a number of factors that cause poor performance between two points on the Internet, including network congestion, the distance between the two points, and packet loss. This is a problem many of our customers have, even on web applications. To help, we launched Argo Smart Routing in 2017, a way to reduce latency (or time to first byte, to be precise) for any HTTP request that goes to an origin.

That’s great for folks who run websites, but what if you’re working on an application that doesn’t speak HTTP? Up until now people had limited options for improving performance for these applications. That changes today with the general availability of Argo for Spectrum. Argo for Spectrum offers the same benefits as Argo Smart Routing for any TCP-based protocol.

Argo for Spectrum takes the same smarts from our network traffic and applies it to Spectrum. At time of writing, Cloudflare sits in front of approximately 20% of the Alexa top 10 million websites. That means that we see, in near real-time, which networks are congested, which are slow and which are dropping packets. We use that data and take action by provisioning faster routes, which sends packets through the Internet faster than normal routing. Argo for Spectrum works the exact same way, using the same intelligence and routing plane but extending it to any TCP based application.

Performance

But what does this mean for real application performance? To find out, we ran a set of benchmarks on Catchpoint. Catchpoint is a service that allows you to set up performance monitoring from all over the world. Tests are repeated at intervals and aggregate results are reported. We wanted to use a third party such as Catchpoint to get objective results (as opposed to running themselves).

For our test case, we used a file server in the Netherlands as our origin. We provisioned various tests on Catchpoint to measure file transfer performance from various places in the world: Rabat, Tokyo, Los Angeles and Lima.

Throughput of a 10MB file. Higher is better.

Depending on location, transfers saw increases of up to 108% (for locations such as Tokyo) and 85% on average. Why is it so much faster? The answer is bandwidth delay product. In layman's terms, bandwidth delay product means that the higher the latency, the lower the throughput. This is because with transmission protocols such as TCP, we need to wait for the other party to acknowledge that they received data before we can send more.

As an analogy, let’s assume we’re operating a water cleaning facility. We send unprocessed water through a pipe to a cleaning facility, but we’re not sure how much capacity the facility has! To test, we send an amount of water through the pipe. Once the water has arrived, the facility will call us up and say, “we can easily handle this amount of water at a time, please send more.” If the pipe is short, the feedback loop is quick: the water will arrive, and we’ll immediately be able to send more without having to wait. If we have a very, very long pipe, we have to stop sending water for a while before we get confirmation that the water has arrived and there’s enough capacity.

The same happens with TCP: we send an amount of data to the wire and wait to get confirmation that it arrived. If the latency is high it reduces the throughput because we’re constantly waiting for confirmation. If latency is low we can throttle throughput at a high rate. With Spectrum and Argo, we help in two ways: the first is that Spectrum terminates the TCP connection close to the user, meaning that latency for that link is low. The second is that Argo reduces the latency between our edge and the origin. In concert, they create a set of low-latency connections, resulting in a low overall bandwidth delay product between users in origin. The result is a much higher throughput than you would otherwise get.

Argo for Spectrum supports any TCP based protocol. This includes commonly used protocols like SFTP, git (over SSH), RDP and SMTP, but also media streaming and gaming protocols such as RTMP and Minecraft. Setting up Argo for Spectrum is easy. When creating a Spectrum application, just hit the “Argo Smart Routing” toggle. Any traffic will automatically be smart routed.

Argo for Spectrum covers much more than just these applications: we support any TCP-based protocol. If you're interested, reach out to your account team today to see what we can do for you.

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking

Omer Yoachimik — Tue, 14 Jul 2020 11:00:00 GMT

Magic Transit is Cloudflare’s L3 DDoS Scrubbing service for protecting network infrastructure. As part of our ongoing investment in Magic Transit and our DDoS protection capabilities, we’re excited to talk about a new piece of software helping to protect Magic Transit customers: flowtrackd. flowtrackd is a software-defined DDoS protection system that significantly improves our ability to automatically detect and mitigate even the most complex TCP-based DDoS attacks. If you are a Magic Transit customer, this feature will be enabled by default at no additional cost on July 30, 2020.

TCP-Based DDoS Attacks

In the first quarter of 2020, one out of every two L3/4 DDoS attacks Cloudflare mitigated was an ACK Flood, and over 66% of all L3/4 attacks were TCP based. Most types of DDoS attacks can be mitigated by finding unique characteristics that are present in all attack packets and using that to distinguish ‘good’ packets from the ‘bad’ ones. This is called "stateless" mitigation, because any packet that has these unique characteristics can simply be dropped without remembering any information (or "state") about the other packets that came before it. However, when attack packets have no unique characteristics, then "stateful" mitigation is required, because whether a certain packet is good or bad depends on the other packets that have come before it.

The most sophisticated types of TCP flood require stateful mitigation, where every TCP connection must be tracked in order to know whether any particular TCP packet is part of an active connection. That kind of mitigation is called "flow tracking", and it is typically implemented in Linux by the iptables conntrack module. However, DDoS protection with conntrack is not as simple as flipping the iptable switch, especially at the scale and complexity that Cloudflare operates in. If you're interested to learn more, in this blog we talk about the technical challenges of implementing iptables conntrack.

Complex TCP DDoS attacks pose a threat as they can be harder to detect and mitigate. They therefore have the potential to cause service degradation, outages and increased false positives with inaccurate mitigation rules. So how does Cloudflare block patternless DDoS attacks without affecting legitimate traffic?

Bidirectional TCP Flow Tracking

Using Cloudflare's traditional products, HTTP applications can be protected by the WAF service, and TCP/UDP applications can be protected by Spectrum. These services are "reverse proxies", meaning that traffic passes through Cloudflare in both directions. In this bidirectional topology, we see the entire TCP flow (i.e., segments sent by both the client and the server) and can therefore track the state of the underlying TCP connection. This way, we know if a TCP packet belongs to an existing flow or if it is an “out of state” TCP packet. Out of state TCP packets look just like regular TCP packets, but they don’t belong to any real connection between a client and a server. These packets are most likely part of an attack and are therefore dropped.

Reverse Proxy: What Cloudflare Sees

While not trivial, tracking TCP flows can be done when we serve as a proxy between the client and server, allowing us to absorb and mitigate out of state TCP floods. However it becomes much more challenging when we only see half of the connection: the ingress flow. This visibility into ingress but not egress flows is the default deployment method for Cloudflare’s Magic Transit service, so we had our work cut out for us in identifying out of state packets.

The Challenge With Unidirectional TCP Flows

With Magic Transit, Cloudflare receives inbound internet traffic on behalf of the customer, scrubs DDoS attacks, and routes the clean traffic to the customer’s data center over a tunnel. The data center then responds directly to the eyeball client using a technique known as Direct Server Return (DSR).

Magic Transit: Asymmetric L3 Routing

Using DSR, when a TCP handshake is initiated by an eyeball client, it sends a SYN packet that gets routed via Cloudflare to the origin data center. The origin then responds with a SYN-ACK directly to the client, bypassing Cloudflare. Finally, the client responds with an ACK that once again routes to the origin via Cloudflare and the connection is then considered established.

L3 Routing: What Cloudflare Sees

In a unidirectional flow we don’t see the SYN+ACK sent from the origin to the eyeball client, and therefore can't utilize our existing flow tracking capabilities to identify out of state packets.

Unidirectional TCP Flow Tracking

To overcome the challenges of unidirectional flows, we recently completed the development and rollout of a new system, codenamed flowtrackd (“flow tracking daemon”). flowtrackd is a state machine that hooks into the network interface. Using only the ingress traffic that routes through Cloudflare, flowtrackd determines whether to forward or drop each received TCP packet based on the state of its related connection.

The state machine that determines the state of the flows was developed in-house and complements Gatebot and dosd. Together Gatebot, dosd, and flowtrackd provide a comprehensive multi layer DDoS protection.

Releasing flowtrackd to the Wild

And it works! Less than a day after releasing flowtrackd to an early access customer, flowtrackd automatically detected and mitigated an ACK flood that peaked at 6 million packets per second. No downtime, service disruption, or false positives were reported.

flowtrackd Mitigates 6M pps Flood

Cloudflare’s DDoS Protection - Delivered From Every Data Center

As opposed to legacy scrubbing center providers with limited network infrastructures, Cloudflare provides DDoS Protection from every one of our data centers in over 200 locations around the world. We write our own software-defined DDoS protection systems. Notice I say systems, because as opposed to vendors that use a dedicated third party appliance, we’re able to write and spin up whatever software we need, deploy it in the optimal location in our tech stack and are therefore not dependent on other vendors or be limited to the capabilities of one appliance.

flowtrackd joins the Cloudflare DDoS protection family which includes our veteran Gatebot and the younger and energetic dosd. flowtrackd will be available from every one of our data centers, with a total mitigation capacity of over 37 Tbps, protecting our Magic Transit customers against the most complex TCP DDoS attacks.

New to Magic Transit? Replace your legacy provider with Magic Transit and pay nothing until your current contract expires. Offer expires September 1, 2020. Click here for details.

Conntrack tales - one thousand and one flows

Marek Majkowski — Mon, 06 Apr 2020 11:00:00 GMT

At Cloudflare we develop new products at a great pace. Their needs often challenge the architectural assumptions we made in the past. For example, years ago we decided to avoid using Linux's "conntrack" - stateful firewall facility. This brought great benefits - it simplified our iptables firewall setup, sped up the system a bit and made the inbound packet path easier to understand.

But eventually our needs changed. One of our new products had a reasonable need for it. But we weren't confident - can we just enable conntrack and move on? How does it actually work? I volunteered to help the team understand the dark corners of the "conntrack" subsystem.

What is conntrack?

"Conntrack" is a part of Linux network stack, specifically part of the firewall subsystem. To put that into perspective: early firewalls were entirely stateless. They could express only basic logic, like: allow SYN packets to port 80 and 443, and block everything else.

The stateless design gave some basic network security, but was quickly deemed insufficient. You see, there are certain things that can't be expressed in a stateless way. The canonical example is assessment of ACK packets - it's impossible to say if an ACK packet is legitimate or part of a port scanning attempt, without tracking the connection state.

To fill such gaps all the operating systems implemented connection tracking inside their firewalls. This tracking is usually implemented as a big table, with at least 6 columns: protocol (usually TCP or UDP), source IP, source port, destination IP, destination port and connection state. On Linux this subsystem is called "conntrack" and is often enabled by default. Here's how the table looks on my laptop inspected with "conntrack -L" command:

The obvious question is how large this state tracking table can be. This setting is under "/proc/sys/net/nf_conntrack_max":

$ cat /proc/sys/net/nf_conntrack_max
262144

This is a global setting, but the limit is per container. On my system each container, or "network namespace", can have up to 256K conntrack entries.

What exactly happens when the number of concurrent connections exceeds the conntrack limit?

Testing conntrack is hard

In past testing conntrack was hard - it required complex hardware or vm setup. Fortunately, these days we can use modern "user namespace" facilities which do permission magic, allowing an unprivileged user to feel like root. Using the tool "unshare" it's possible to create an isolated environment where we can precisely control the packets going through and experiment with iptables and conntrack without threatening the health of our host system. With appropriate parameters it's possible to create and manage a networking namespace, including access to namespaced iptables and conntrack, from an unprivileged user.

This script is the heart of our test:

# Enable tun interface
ip tuntap add name tun0 mode tun
ip link set tun0 up
ip addr add 192.0.2.1 peer 192.0.2.2 dev tun0
ip route add 0.0.0.0/0 via 192.0.2.2 dev tun0

# Refer to conntrack at least once to ensure it's enabled
iptables -t raw -A PREROUTING -j CT
# Create a counter in mangle table
iptables -t mangle -A PREROUTING
# Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP

tcpdump -ni any -B 16384 -ttt &
...
./venv/bin/python3 send_syn.py

conntrack -L
# Show iptables counters
iptables -nvx -t raw -L PREROUTING
iptables -nvx -t mangle -L PREROUTING

This bash script is shortened for readability. See the full version here. The accompanying "send_syn.py" is just sending 10 SYN packets over "tun0" interface. Here is the source but allow me to paste it here - showing off "scapy" is always fun:

tun = TunTapInterface("tun0", mode_tun=True)
tun.open()

for i in range(10000,10000+10):
    ip=IP(src="198.18.0.2", dst="192.0.2.1")
    tcp=TCP(sport=i, dport=80, flags="S")
    send(ip/tcp, verbose=False, inter=0.01, socket=tun)

The bash script above contains a couple of gems. Let's walk through them.

First, please note that we can't just inject packets into the loopback interface using SOCK_RAW sockets. The Linux networking stack is a complex beast. The semantics of sending packets over a SOCK_RAW are different then delivering a packet over a real interface. We'll discuss this later, but for now, to avoid triggering unexpected behaviour, we will deliver packets over a tun/tap device which better emulates a real interface.

Then we need to make sure the conntrack is active in the network namespace we wish to use for testing. Traditionally, just loading the kernel module would have done that, but in the brave new world of containers and network namespaces, a method had to be found to allow conntrack to be active in some and inactive in other containers. Hence this is tied to usage - rules referencing conntrack must exist in the namespace's iptables for conntrack to be active inside the container.

As a side note, containers triggering host to load kernel modules is an interesting subject.

After the "-t raw -A PREROUTING" rule, which we added "-t mangle -A PREROUTING" rule, but notice - it doesn't have any action! This syntax is allowed by iptables and it is pretty useful to get iptables to report rule counters. We'll need these counters soon. A careful reader might suggest looking at "policy" counters in iptables to achieve our goal. Sadly, "policy" counters (increased for each packet entering a chain), work only if there is at least one rule inside it.

The rest of the steps are self-explanatory. We set up "tcpdump" in the background, send 10 SYN packets to 127.0.0.1:80 using the "scapy" Python library. Then we print the conntrack table and iptables counters.

Let's run this script in action. Remember to run it under networking namespace as fake root with "unshare -Ur -n":

This is all nice. First we see a "tcpdump" listing showing 10 SYN packets. Then we see the conntrack table state, showing 10 created flows. Finally, we see iptables counters in two rules we created, each showing 10 packets processed.

Can conntrack table fill up?

Given that the conntrack table is size constrained, what exactly happens when it fills up? Let's check it out. First, we need to drop the conntrack size. As mentioned it's controlled by a global toggle - it's necessary to tune it on the host side. Let's reduce the table size to 7 entries, and repeat our test:

This is getting interesting. We still see the 10 inbound SYN packets. We still see that the "-t raw PREROUTING" table received 10 packets, but this is where similarities end. The "-t mangle PREROUTING" table saw only 7 packets. Where did the three missing SYN packets go?

It turns out they went where all the dead packets go. They were hard dropped. Conntrack on overfill does exactly that. It even complains in the "dmesg":

This is confirmed by our iptables counters. Let's review the famous iptables diagram:

image by Jan Engelhardt CC BY-SA 3.0

As we can see, the "-t raw PREROUTING" happens before conntrack, while "-t mangle PREROUTING" is just after it. This is why we see 10 and 7 packets reported by our iptables counters.

Let me emphasize the gravity of our discovery. We showed three completely valid SYN packets being implicitly dropped by "conntrack". There is no explicit "-j DROP" iptables rule. There is no configuration to be toggled. Just the fact of using "conntrack" means that, when it's full, packets creating new flows will be dropped. No questions asked.

This is the dark side of using conntrack. If you use it, you absolutely must make sure it doesn't get filled.

We could end our investigation here, but there are a couple of interesting caveats.

Strict vs loose

Conntrack supports a "strict" and "loose" mode, as configured by "nf_conntrack_tcp_loose" toggle.

$ cat /proc/sys/net/netfilter/nf_conntrack_tcp_loose
1

By default, it's set to "loose" which means that stray ACK packets for unseen TCP flows will create new flow entries in the table. We can generalize: "conntrack" will implicitly drop all the packets that create new flow, whether that's SYN or just stray ACK.

What happens when we clear the "nf_conntrack_tcp_loose=0" setting? This is a subject for another blog post, but suffice to say - it's a mess. First, this setting is not settable in the network namespace scope - although it should be. To test it you need to be in the root network namespace. Then, due to twisted logic the ACK will be dropped on a full conntrack table, even though in this case it doesn't create a flow. If the table is not full, the ACK packet will pass through it, having "-ctstate INVALID" from "mangle" table forward.

When doesn't a conntrack entry get created?

There are important situations when conntrack entry is not created. For example, we could replace these line in our script:

# Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP

With those:

# Make sure inbound SYN packets don't go to networking stack
iptables -A INPUT -j DROP

Naively we could think dropping SYN packets past the conntrack layer would not interfere with the created flows. This is not correct. In spite of these SYN packets having been seen by conntrack, no flow state is created for them. Packets hitting "-j DROP" will not create new conntrack flows. Pretty magical, isn't it?

Full Conntrack causes with EPERM

Recently we hit a case when a "sendto()" syscall on UDP socket from one of our applications was erroring with EPERM. This is pretty weird, and not documented in the man page. My colleague had no doubts:

I'll save you the gruesome details, but indeed, the full conntrack table will do that to your new UDP flows - you will get EPERM. Beware. Funnily enough, it's possible to get EPERM if an outbound packet is dropped on OUTPUT firewall in other ways. For example:

marek:~$ sudo iptables -I OUTPUT -p udp --dport 53 --dst 192.0.2.8 -j DROP
marek:~$ strace -e trace=write nc -vu 192.0.2.8 53
write(3, "X", 1)                        = -1 EPERM (Operation not permitted)
+++ exited with 1 +++

If you ever receive EPERM from "sendto()", you might want to treat it as a transient error, if you suspect a filled conntrack problem, or permanent error if you blame iptables configuration.

This is also why we can't send our SYN packets directly using SOCK_RAW sockets in our test. Let's see what happens on conntrack overfill with standard "hping3" tool:

$ hping3 -S -i u10000 -c 10 --spoof 192.18.0.2 192.0.2.1 -p 80 -I lo
HPING 192.0.2.1 (lo 192.0.2.1): S set, 40 headers + 0 data bytes
[send_ip] sendto: Operation not permitted

"send()" even on a SOCK_RAW socket fails with EPERM when conntrack table is full.

Full conntrack can happen on a SYN flood

There is one more caveat. During a SYN flood, the conntrack entries will totally be created for the spoofed flows. Take a look at second test case we prepared, this time correctly listening on port 80, and sending SYN+ACK:

We can see 7 SYN+ACK's flying out of the port 80 listening socket. The final three SYN's go nowhere as they are dropped by conntrack.

This has important implications. If you use conntrack on publicly accessible ports, during SYN flood mitigation technologies like SYN Cookies won't help. You are still at risk of running out of conntrack space and therefore affecting legitimate connections.

For this reason, as a general rule consider avoiding conntrack on inbound connections (-j NOTRACK). Alternatively having some reasonable rate limits on iptables layer, doing "-j DROP". This will work well and won't create new flows, as we discussed above. The best method though, would be to trigger SYN Cookies from a layer before conntrack, like XDP. But this is a subject for another time.

Summary

Over the years Linux conntrack has gone through many changes and has improved a lot. While performance used to be a major concern, these days it's considered to be very fast. Dark corners remain. Correctly applying conntrack is tricky.

In this blog post we showed how it's possible to test parts of conntrack with "unshare" and a series of scripts. We showed the behaviour when the conntrack table gets filled - packets might implicitly be dropped. Finally, we mentioned the curious case of SYN floods where incorrectly applied conntrack may cause harm.

Stay tuned for more horror stories as we dig deeper and deeper into the Linux networking stack guts.

A cost-effective and extensible testbed for transport protocol development

Lohith Bellad — Tue, 14 Jan 2020 16:07:15 GMT

This was originally published on Perf Planet's 2019 Web Performance Calendar.

At Cloudflare, we develop protocols at multiple layers of the network stack. In the past, we focused on HTTP/1.1, HTTP/2, and TLS 1.3. Now, we are working on QUIC and HTTP/3, which are still in IETF draft, but gaining a lot of interest.

QUIC is a secure and multiplexed transport protocol that aims to perform better than TCP under some network conditions. It is specified in a family of documents: a transport layer which specifies packet format and basic state machine, recovery and congestion control, security based on TLS 1.3, and an HTTP application layer mapping, which is now called HTTP/3.

Let’s focus on the transport and recovery layer first. This layer provides a basis for what is sent on the wire (the packet binary format) and how we send it reliably. It includes how to open the connection, how to handshake a new secure session with the help of TLS, how to send data reliably and how to react when there is packet loss or reordering of packets. Also it includes flow control and congestion control to interact well with other transport protocols in the same network. With confidence in the basic transport and recovery layer, we can take a look at higher application layers such as HTTP/3.

To develop such a transport protocol, we need multiple stages of the development environment. Since this is a network protocol, it’s best to test in an actual physical network to see how works on the wire. We may start the development using localhost, but after some time we may want to send and receive packets with other hosts. We can build a lab with a couple of virtual machines, using Virtualbox, VMWare or even with Docker. We also have a local testing environment with a Linux VM. But sometimes these have a limited network (localhost only) or are noisy due to other processes in the same host or virtual machines.

Next step is to have a test lab, typically an isolated network focused on protocol analysis only consisting of dedicated x86 hosts. Lab configuration is particularly important for testing various cases - there is no one-size-fits-all scenario for protocol testing. For example, EDGE is still running in production mobile networks but LTE is dominant and 5G deployment is in early stages. WiFi is very common these days. We want to test our protocol in all those environments. Of course, we can't buy every type of machine or have a very expensive network simulator for every type of environment, so using cheap hardware and an open source OS where we can configure similar environments is ideal.

The QUIC Protocol Testing lab

The goal of the QUIC testing lab is to aid transport layer protocol development. To develop a transport protocol we need to have a way to control our network environment and a way to get as many different types of debugging data as possible. Also we need to get metrics for comparison with other protocols in production.

The QUIC Testing Lab has the following goals:

Help with multiple transport protocol development: Developing a new transport layer requires many iterations, from building and validating packets as per protocol spec, to making sure everything works fine under moderate load, to very harsh conditions such as low bandwidth and high packet loss. We need a way to run tests with various network conditions reproducibly in order to catch unexpected issues.
Debugging multiple transport protocol development: Recording as much debugging info as we can is important for fixing bugs. Looking into packet captures definitely helps but we also need a detailed debugging log of the server and client to understand the what and why for each packet. For example, when a packet is sent, we want to know why. Is this because there is an application which wants to send some data? Or is this a retransmit of data previously known as lost? Or is this a loss probe which is not an actual packet loss but sent to see if the network is lossy?
Performance comparison between each protocol: We want to understand the performance of a new protocol by comparison with existing protocols such as TCP, or with a previous version of the protocol under development. Also we want to test with varying parameters such as changing the congestion control mechanism, changing various timeouts, or changing the buffer sizes at various levels of the stack.
Finding a bottleneck or errors easily: Running tests we may see an unexpected error - a transfer that timed out, or ended with an error, or a transfer was corrupted at the client side - each test needs to make sure every test is run correctly, by using a checksum of the original file to compare with what is actually downloaded, or by checking various error codes at the protocol of API level.

When we have a test lab with separate hardware, we have benefits, as follows:

Can configure the testing lab without public Internet access - safe and quiet.
Handy access to hardware and its console for maintenance purpose, or for adding or updating hardware.
Try other CPU architectures. For clients we use the Raspberry Pi for regular testing because this is ARM architecture (32bit or 64bit), similar to modern smartphones. So testing with ARM architecture helps for compatibility testing before going into a smartphone OS.
We can add a real smartphone for testing, such as Android or iPhone. We can test with WiFi but these devices also support Ethernet, so we can test them with a wired network for better consistency.

Lab Configuration

Here is a diagram of our QUIC Protocol Testing Lab:

This is a conceptual diagram and we need to configure a switch for connecting each machine. Currently, we have Raspberry Pis (2 and 3) as an Origin and a Client. And small Intel x86 boxes for the Traffic Shaper and Edge server plus Ethernet switches for interconnectivity.

Origin is simply serving HTTP and HTTPS test objects using a web server. Client may download a file from Origin directly to simulate a download direct from a customer's origin server.
Client will download a test object from Origin or Edge, using a different protocol. In typical a configuration Client connects to Edge instead of Origin, so this is to simulate an edge server in the real world. For TCP/HTTP we are using the curl command line client and for QUIC, quiche’s http3_client with some modification.
Edge is running Cloudflare's web server to serve HTTP/HTTPS via TCP and also the QUIC protocol using quiche. Edge server is installed with the same Linux kernel used on Cloudflare's production machines in order to have the same low level network stack.
Traffic Shaper is sitting between Client and Edge (and Origin), controlling network conditions. Currently we are using FreeBSD and ipfw + dummynet. Traffic shaping can also be done using Linux' netem which provides additional network simulation features.

The goal is to run tests with various network conditions, such as bandwidth, latency and packet loss upstream and downstream. The lab is able to run a plaintext HTTP test but currently our focus of testing is HTTPS over TCP and HTTP/3 over QUIC. Since QUIC is running over UDP, both TCP and UDP traffic need to be controlled.

Test Automation and Visualization

In the lab, we have a script installed in Client, which can run a batch of testing with various configuration parameters - for each test combination, we can define a test configuration, including:

Network Condition - Bandwidth, Latency, Packet Loss (upstream and downstream)

For example using netem traffic shaper we can simulate LTE network as below,(RTT=50ms, BW=22Mbps upstream and downstream, with BDP queue size)

$ tc qdisc add dev eth0 root handle 1:0 netem delay 25ms
$ tc qdisc add dev eth0 parent 1:1 handle 10: tbf rate 22mbit buffer 68750 limit 70000

Test Object sizes - 1KB, 8KB, … 32MB
Test Protocols: HTTPS (TCP) and QUIC (UDP)
Number of runs and number of requests in a single connection

The test script outputs a CSV file of results for importing into other tools for data processing and visualization - such as Google Sheets, Excel or even a jupyter notebook. Also it’s able to post the result to a database (Clickhouse in our case), so we can query and visualize the results.

Sometimes a whole test combination takes a long time - the current standard test set with simulated 2G, 3G, LTE, WiFi and various object sizes repeated 10 times for each request may take several hours to run. Large object testing on a slow network takes most of the time, so sometimes we also need to run a limited test (e.g. testing LTE-like conditions only for a smoke test) for quick debugging.

Chart using Google Sheets:

The following comparison chart shows the total transfer time in msec for TCP vs QUIC for different network conditions. The QUIC protocol used here is a development version one.

Debugging and performance analysis using of a smartphone

Mobile devices have become a crucial part of our day to day life, so testing the new transport protocol on mobile devices is critically important for mobile app performance. To facilitate that, we need to have a mobile test app which will proxy data over the new transport protocol under development. With this we have the ability to analyze protocol functionality and performance in mobile devices with different network conditions.

Adding a smartphone to the testbed mentioned above gives an advantage in terms of understanding real performance issues. The major smartphone operating systems, iOS and Android, have quite different networking stack. Adding a smartphone to testbed gives the ability to understand these operating system network stacks in depth which aides new protocol designs.

The above figure shows the network block diagram of another similar lab testbed used for protocol testing where a smartphone is connected both wired and wirelessly. A Linux netem based traffic shaper sits in-between the client and server shaping the traffic. Various networking profiles are fed to the traffic shaper to mimic real world scenarios. The client can be either an Android or iOS based smartphone, the server is a vanilla web server serving static files. Client, server and traffic shaper are all connected to the Internet along with the private lab network for management purposes.

The above lab has mobile devices for both Android or iOS installed with a test app built with a proprietary client proxy software for proxying data over the new transport protocol under development. The test app also has the ability to make HTTP requests over TCP for comparison purposes.

The Android or iOS test app can be used to issue multiple HTTPS requests of different object sizes sequentially and concurrently using TCP and QUIC as underlying transport protocol. Later, TTOTAL (total transfer time) of each HTTPS request is used to compare TCP and QUIC performance over different network conditions. One such comparison is shown below,

The table above shows the total transfer time taken for TCP and QUIC requests over an LTE network profile fetching different objects with different concurrency levels using the test app. Here TCP goes over native OS network stack and QUIC goes over Cloudflare QUIC stack.

Debugging network performance issues is hard when it comes to mobile devices. By adding an actual smartphone into the testbed itself we have the ability to take packet captures at different layers. These are very critical in analyzing and understanding protocol performance.

It's easy and straightforward to capture packets and analyze them using the tcpdump tool on x86 boxes, but it's a challenge to capture packets on iOS and Android devices. On iOS device ‘rvictl’ lets us capture packets on an external interface. But ‘rvictl’ has some drawbacks such as timestamps being inaccurate. Since we are dealing with millisecond level events, timestamps need to be accurate to analyze the root cause of a problem.

We can capture packets on internal loopback interfaces on jailbroken iPhones and rooted Android devices. Jailbreaking a recent iOS device is nontrivial. We also need to make sure that autoupdate of any sort is disabled on such a phone otherwise it would disable the jailbreak and you have to start the whole process again. With a jailbroken phone we have root access to the device which lets us take packet captures as needed using tcpdump.

Packet captures taken using jailbroken iOS devices or rooted Android devices connected to the lab testbed help us analyze performance bottlenecks and improve protocol performance.

iOS and Android devices different network stacks in their core operating systems. These packet captures also help us understand the network stack of these mobile devices, for example in iOS devices packets punted through loopback interface had a mysterious delay of 5 to 7ms.

Conclusion

Cloudflare is actively involved in helping to drive forward the QUIC and HTTP/3 standards by testing and optimizing these new protocols in simulated real world environments. By simulating a wide variety of networks we are working on our mission of Helping Build a Better Internet. For everyone, everywhere.

Would like to thank SangJo Lee, Hiren Panchasara, Lucas Pardue and Sreeni Tellakula for their contributions.

Accelerating UDP packet transmission for QUIC

Alessandro Ghedini — Wed, 08 Jan 2020 17:08:00 GMT

This was originally published on Perf Planet's 2019 Web Performance Calendar.

QUIC, the new Internet transport protocol designed to accelerate HTTP traffic, is delivered on top of UDP datagrams, to ease deployment and avoid interference from network appliances that drop packets from unknown protocols. This also allows QUIC implementations to live in user-space, so that, for example, browsers will be able to implement new protocol features and ship them to their users without having to wait for operating systems updates.

But while a lot of work has gone into optimizing TCP implementations as much as possible over the years, including building offloading capabilities in both software (like in operating systems) and hardware (like in network interfaces), UDP hasn't received quite as much attention as TCP, which puts QUIC at a disadvantage. In this post we'll look at a few tricks that help mitigate this disadvantage for UDP, and by association QUIC.

For the purpose of this blog post we will only be concentrating on measuring throughput of QUIC connections, which, while necessary, is not enough to paint an accurate overall picture of the performance of the QUIC protocol (or its implementations) as a whole.

Test Environment

The client used in the measurements is h2load, built with QUIC and HTTP/3 support, while the server is NGINX, built with the open-source QUIC and HTTP/3 module provided by Cloudflare which is based on quiche (github.com/cloudflare/quiche), Cloudflare's own open-source implementation of QUIC and HTTP/3.

The client and server are run on the same host (my laptop) running Linux 5.3, so the numbers don’t necessarily reflect what one would see in a production environment over a real network, but it should still be interesting to see how much of an impact each of the techniques have.

Baseline

Currently the code that implements QUIC in NGINX uses the sendmsg() system call to send a single UDP packet at a time.

ssize_t sendmsg(int sockfd, const struct msghdr *msg,
    int flags);

The struct msghdr carries a struct iovec which can in turn carry multiple buffers. However, all of the buffers within a single iovec will be merged together into a single UDP datagram during transmission. The kernel will then take care of encapsulating the buffer in a UDP packet and sending it over the wire.

The throughput of this particular implementation tops out at around 80-90 MB/s, as measured by h2load when performing 10 sequential requests for a 100 MB resource.

sendmmsg()

Due to the fact that sendmsg() only sends a single UDP packet at a time, it needs to be invoked quite a lot in order to transmit all of the QUIC packets required to deliver the requested resources, as illustrated by the following bpftrace command:

% sudo bpftrace -p $(pgrep nginx) -e 'tracepoint:syscalls:sys_enter_sendm* { @[probe] = count(); }'
Attaching 2 probes...
 
 
@[tracepoint:syscalls:sys_enter_sendmsg]: 904539

Each of those system calls causes an expensive context switch between the application and the kernel, thus impacting throughput.

But while sendmsg() only transmits a single UDP packet at a time for each invocation, its close cousin sendmmsg() (note the additional “m” in the name) is able to batch multiple packets per system call:

int sendmmsg(int sockfd, struct mmsghdr *msgvec,
    unsigned int vlen, int flags);

Multiple struct mmsghdr structures can be passed to the kernel as an array, each in turn carrying a single struct msghdr with its own struct iovec , with each element in the msgvec array representing a single UDP datagram.

Let's see what happens when NGINX is updated to use sendmmsg() to send QUIC packets:

% sudo bpftrace -p $(pgrep nginx) -e 'tracepoint:syscalls:sys_enter_sendm* { @[probe] = count(); }'
Attaching 2 probes...
 
 
@[tracepoint:syscalls:sys_enter_sendmsg]: 2437
@[tracepoint:syscalls:sys_enter_sendmmsg]: 15676

The number of system calls went down dramatically, which translates into an increase in throughput, though not quite as big as the decrease in syscalls:

UDP segmentation offload

With sendmsg() as well as sendmmsg(), the application is responsible for separating each QUIC packet into its own buffer in order for the kernel to be able to transmit it. While the implementation in NGINX uses static buffers to implement this, so there is no overhead in allocating them, all of these buffers need to be traversed by the kernel during transmission, which can add significant overhead.

Linux supports a feature, Generic Segmentation Offload (GSO), which allows the application to pass a single "super buffer" to the kernel, which will then take care of segmenting it into smaller packets. The kernel will try to postpone the segmentation as much as possible to reduce the overhead of traversing outgoing buffers (some NICs even support hardware segmentation, but it was not tested in this experiment due to lack of capable hardware). Originally GSO was only supported for TCP, but support for UDP GSO was recently added as well, in Linux 4.18.

This feature can be controlled using the UDP_SEGMENT socket option:

setsockopt(fd, SOL_UDP, UDP_SEGMENT, &gso_size, sizeof(gso_size)))

As well as via ancillary data, to control segmentation for each sendmsg() call:

cm = CMSG_FIRSTHDR(&msg);
cm->cmsg_level = SOL_UDP;
cm->cmsg_type = UDP_SEGMENT;
cm->cmsg_len = CMSG_LEN(sizeof(uint16_t));
*((uint16_t *) CMSG_DATA(cm)) = gso_size;

Where gso_size is the size of each segment that form the "super buffer" passed to the kernel from the application. Once configured, the application can provide one contiguous large buffer containing a number of packets of gso_size length (as well as a final smaller packet), that will then be segmented by the kernel (or the NIC if hardware segmentation offloading is supported and enabled).

Up to 64 segments can be batched with the UDP_SEGMENT option.

GSO with plain sendmsg() already delivers a significant improvement:

And indeed the number of syscalls also went down significantly, compared to plain sendmsg() :

% sudo bpftrace -p $(pgrep nginx) -e 'tracepoint:syscalls:sys_enter_sendm* { @[probe] = count(); }'
Attaching 2 probes...
 
 
@[tracepoint:syscalls:sys_enter_sendmsg]: 18824

GSO can also be combined with sendmmsg() to deliver an even bigger improvement. The idea being that each struct msghdr can be segmented in the kernel by setting the UDP_SEGMENT option using ancillary data, allowing an application to pass multiple “super buffers”, each carrying up to 64 segments, to the kernel in a single system call.

The improvement is again fairly significant:

Evolving from AFAP

Transmitting packets as fast as possible is easy to reason about, and there's much fun to be had in optimizing applications for that, but in practice this is not always the best strategy when optimizing protocols for the Internet

Bursty traffic is more likely to cause or be affected by congestion on any given network path, which will inevitably defeat any optimization implemented to increase transmission rates.

Packet pacing is an effective technique to squeeze out more performance from a network flow. The idea being that adding a short delay between each outgoing packet will smooth out bursty traffic and reduce the chance of congestion, and packet loss. For TCP this was originally implemented in Linux via the fq packet scheduler, and later by the BBR congestion control algorithm implementation, which implements its own pacer.

Due to the nature of current QUIC implementations, which reside entirely in user-space, pacing of QUIC packets conflicts with any of the techniques explored in this post, because pacing each packet separately during transmission will prevent any batching on the application side, and in turn batching will prevent pacing, as batched packets will be transmitted as fast as possible once received by the kernel.

However Linux provides some facilities to offload the pacing to the kernel and give back some control to the application:

SO_MAX_PACING_RATE: an application can define this socket option to instruct the fq packet scheduler to pace outgoing packets up to the given rate. This works for UDP sockets as well, but it is yet to be seen how this can be integrated with QUIC, as a single UDP socket can be used for multiple QUIC connections (unlike TCP, where each connection has its own socket). In addition, this is not very flexible, and might not be ideal when implementing the BBR pacer.
SO_TXTIME / SCM_TXTIME: an application can use these options to schedule transmission of specific packets at specific times, essentially instructing fq to delay packets until the provided timestamp is reached. This gives the application a lot more control, and can be easily integrated into sendmsg() as well as sendmmsg(). But it does not yet support specifying different times for each packet when GSO is used, as there is no way to define multiple timestamps for packets that need to be segmented (each segmented packet essentially ends up being sent at the same time anyway).

While the performance gains achieved by using the techniques illustrated here are fairly significant, there are still open questions around how any of this will work with pacing, so more experimentation is required.

When TCP sockets refuse to die

Marek Majkowski — Fri, 20 Sep 2019 15:53:33 GMT

While working on our Spectrum server, we noticed something weird: the TCP sockets which we thought should have been closed were lingering around. We realized we don't really understand when TCP sockets are supposed to time out!

Image by Sergiodc2 CC BY SA 3.0

In our code, we wanted to make sure we don't hold connections to dead hosts. In our early code we naively thought enabling TCP keepalives would be enough... but it isn't. It turns out a fairly modern TCP_USER_TIMEOUT socket option is equally important. Furthermore, it interacts with TCP keepalives in subtle ways. Many people are confused by this.

In this blog post, we'll try to show how these options work. We'll show how a TCP socket can time out during various stages of its lifetime, and how TCP keepalives and user timeout influence that. To better illustrate the internals of TCP connections, we'll mix the outputs of the tcpdump and the ss -o commands. This nicely shows the transmitted packets and the changing parameters of the TCP connections.

SYN-SENT

Let's start from the simplest case - what happens when one attempts to establish a connection to a server which discards inbound SYN packets?

The scripts used here are available on our GitHub.

$ sudo ./test-syn-sent.py # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,940ms,0) 00:01.028 IP host.2 > host.1: Flags [S] # first retry 00:03.044 IP host.2 > host.1: Flags [S] # second retry 00:07.236 IP host.2 > host.1: Flags [S] # third retry 00:15.427 IP host.2 > host.1: Flags [S] # fourth retry 00:31.560 IP host.2 > host.1: Flags [S] # fifth retry 01:04.324 IP host.2 > host.1: Flags [S] # sixth retry 02:10.000 connect ETIMEDOUT

Ok, this was easy. After the connect() syscall, the operating system sends a SYN packet. Since it didn't get any response the OS will by default retry sending it 6 times. This can be tweaked by the sysctl:

$ sysctl net.ipv4.tcp_syn_retries net.ipv4.tcp_syn_retries = 6

It's possible to overwrite this setting per-socket with the TCP_SYNCNT setsockopt:

setsockopt(sd, IPPROTO_TCP, TCP_SYNCNT, 6);

The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default, the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno. At this moment in the lifetime of a connection, SO_KEEPALIVE settings are ignored, but TCP_USER_TIMEOUT is not. For example, setting it to 5000ms, will cause the following interaction:

$ sudo ./test-syn-sent.py 5000 # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,996ms,0) 00:01.016 IP host.2 > host.1: Flags [S] # first retry 00:03.032 IP host.2 > host.1: Flags [S] # second retry 00:05.016 IP host.2 > host.1: Flags [S] # what is this? 00:05.024 IP host.2 > host.1: Flags [S] # what is this? 00:05.036 IP host.2 > host.1: Flags [S] # what is this? 00:05.044 IP host.2 > host.1: Flags [S] # what is this? 00:05.050 connect ETIMEDOUT

Even though we set user-timeout to 5s, we still saw the six SYN retries on the wire. This behaviour is probably a bug (as tested on 5.2 kernel): we would expect only two retries to be sent - at 1s and 3s marks and the socket to expire at 5s mark. Instead, we saw this, but also we saw further 4 retransmitted SYN packets aligned to 5s mark - which makes no sense. Anyhow, we learned a thing - the TCP_USER_TIMEOUT does affect the behaviour of connect().

SYN-RECV

SYN-RECV sockets are usually hidden from the application. They live as mini-sockets on the SYN queue. We wrote about the SYN and Accept queues in the past. Sometimes, when SYN cookies are enabled, the sockets may skip the SYN-RECV state altogether.

In SYN-RECV state, the socket will retry sending SYN+ACK 5 times as controlled by:

$ sysctl net.ipv4.tcp_synack_retries net.ipv4.tcp_synack_retries = 5

Here is how it looks on the wire:

$ sudo ./test-syn-recv.py 00:00.000 IP host.2 > host.1: Flags [S] # all subsequent packets dropped 00:00.000 IP host.1 > host.2: Flags [S.] # initial SYN+ACK State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,996ms,0) 00:01.033 IP host.1 > host.2: Flags [S.] # first retry 00:03.045 IP host.1 > host.2: Flags [S.] # second retry 00:07.301 IP host.1 > host.2: Flags [S.] # third retry 00:15.493 IP host.1 > host.2: Flags [S.] # fourth retry 00:31.621 IP host.1 > host.2: Flags [S.] # fifth retry 01:04:610 SYN-RECV disappears

With default settings, the SYN+ACK is re-transmitted at 1s, 3s, 7s, 15s, 31s marks, and the SYN-RECV socket disappears at the 64s mark.

Neither SO_KEEPALIVE nor TCP_USER_TIMEOUT affect the lifetime of SYN-RECV sockets.

Final handshake ACK

After receiving the second packet in the TCP handshake - the SYN+ACK - the client socket moves to an ESTABLISHED state. The server socket remains in SYN-RECV until it receives the final ACK packet.

Losing this ACK doesn't change anything - the server socket will just take a bit longer to move from SYN-RECV to ESTAB. Here is how it looks:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # initial ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1sec,0) ESTAB 0 0 host:2 host:1 00:01.014 IP host.1 > host.2: Flags [S.] 00:01.014 IP host.2 > host.1: Flags [.] # retried ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1.012ms,1) ESTAB 0 0 host:2 host:1

As you can see SYN-RECV, has the "on" timer, the same as in example before. We might argue this final ACK doesn't really carry much weight. This thinking lead to the development of TCP_DEFER_ACCEPT feature - it basically causes the third ACK to be silently dropped. With this flag set the socket remains in SYN-RECV state until it receives the first packet with actual data:

$ sudo ./test-syn-ack.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # delivered, but the socket stays as SYN-RECV State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,7.192ms,0) ESTAB 0 0 host:2 host:1 00:08.020 IP host.2 > host.1: Flags [P.], length 11 # payload moves the socket to ESTAB State Recv-Q Send-Q Local:Port Peer:Port ESTAB 11 0 host:1 host:2 ESTAB 0 0 host:2 host:1

The server socket remained in the SYN-RECV state even after receiving the final TCP-handshake ACK. It has a funny "on" timer, with the counter stuck at 0 retries. It is converted to ESTAB - and moved from the SYN to the accept queue - after the client sends a data packet or after the TCP_DEFER_ACCEPT timer expires. Basically, with DEFER ACCEPT the SYN-RECV mini-socket discards the data-less inbound ACK.

Idle ESTAB is forever

Let's move on and discuss a fully-established socket connected to an unhealthy (dead) peer. After completion of the handshake, the sockets on both sides move to the ESTABLISHED state, like:

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:2 host:1 ESTAB 0 0 host:1 host:2

These sockets have no running timer by default - they will remain in that state forever, even if the communication is broken. The TCP stack will notice problems only when one side attempts to send something. This raises a question - what to do if you don't plan on sending any data over a connection? How do you make sure an idle connection is healthy, without sending any data over it?

This is where TCP keepalives come in. Let's see it in action - in this example we used the following toggles:

SO_KEEPALIVE = 1 - Let's enable keepalives.
TCP_KEEPIDLE = 5 - Send first keepalive probe after 5 seconds of idleness.
TCP_KEEPINTVL = 3 - Send subsequent keepalive probes after 3 seconds.
TCP_KEEPCNT = 3 - Time out after three failed probes.

$ sudo ./test-idle.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 0 host:2 host:1 timer:(keepalive,2.992ms,0) # all subsequent packets dropped 00:05.083 IP host.2 > host.1: Flags [.], ack 1 # first keepalive probe 00:08.155 IP host.2 > host.1: Flags [.], ack 1 # second keepalive probe 00:11.231 IP host.2 > host.1: Flags [.], ack 1 # third keepalive probe 00:14.299 IP host.2 > host.1: Flags [R.], seq 1, ack 1

Indeed! We can clearly see the first probe sent at the 5s mark, two remaining probes 3s apart - exactly as we specified. After a total of three sent probes, and a further three seconds of delay, the connection dies with ETIMEDOUT, and final the RST is transmitted.

For keepalives to work, the send buffer must be empty. You can notice the keepalive timer active in the "timer:(keepalive)" line.

Keepalives with TCP_USER_TIMEOUT are confusing

We mentioned the TCP_USER_TIMEOUT option before. It sets the maximum amount of time that transmitted data may remain unacknowledged before the kernel forcefully closes the connection. On its own, it doesn't do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped. However, this socket option does change the semantics of TCP keepalives. The tcp(7) manpage is somewhat confusing:

Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will override keepalive to determine when to close a connection due to keepalive failure.

The original commit message has slightly more detail:

tcp: Add TCP_USER_TIMEOUT socket option

To understand the semantics, we need to look at the kernel code in linux/net/ipv4/tcp_timer.c:693:

if ((icsk->icsk_user_timeout != 0 && elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) && icsk->icsk_probes_out > 0) ||

For the user timeout to have any effect, the icsk_probes_out must not be zero. The check for user timeout is done only after the first probe went out. Let's check it out. Our connection settings:

TCP_USER_TIMEOUT = 5*1000 - 5 seconds
SO_KEEPALIVE = 1 - enable keepalives
TCP_KEEPIDLE = 1 - send first probe quickly - 1 second idle
TCP_KEEPINTVL = 11 - subsequent probes every 11 seconds
TCP_KEEPCNT = 3 - send three probes before timing out

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # all subsequent packets dropped 00:01.001 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.233 IP host.2 > host.1: Flags [R.] # timer for second probe fired, socket aborted due to TCP_USER_TIMEOUT

So what happened? The connection sent the first keepalive probe at the 1s mark. Seeing no response the TCP stack then woke up 11 seconds later to send a second probe. This time though, it executed the USER_TIMEOUT code path, which decided to terminate the connection immediately.

What if we bump TCP_USER_TIMEOUT to larger values, say between the second and third probe? Then, the connection will be closed on the third probe timer. With TCP_USER_TIMEOUT set to 12.5s:

00:01.022 IP host.2 > host.1: Flags [.] # first probe 00:12.094 IP host.2 > host.1: Flags [.] # second probe 00:23.102 IP host.2 > host.1: Flags [R.] # timer for third probe fired, socket aborted due to TCP_USER_TIMEOUT

We’ve shown how TCP_USER_TIMEOUT interacts with keepalives for small and medium values. The last case is when TCP_USER_TIMEOUT is extraordinarily large. Say we set it to 30s:

00:01.027 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.195 IP host.2 > host.1: Flags [.], ack 1 # second probe 00:23.207 IP host.2 > host.1: Flags [.], ack 1 # third probe 00:34.211 IP host.2 > host.1: Flags [.], ack 1 # fourth probe! But TCP_KEEPCNT was only 3! 00:45.219 IP host.2 > host.1: Flags [.], ack 1 # fifth probe! 00:56.227 IP host.2 > host.1: Flags [.], ack 1 # sixth probe! 01:07.235 IP host.2 > host.1: Flags [R.], seq 1 # TCP_USER_TIMEOUT aborts conn on 7th probe timer

We saw six keepalive probes on the wire! With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:

TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

Busy ESTAB socket is not forever

Thus far we have discussed the case where the connection is idle. Different rules apply when the connection has unacknowledged data in a send buffer.

Let's prepare another experiment - after the three-way handshake, let's set up a firewall to drop all packets. Then, let's do a send on one end to have some dropped packets in-flight. An experiment shows the sending socket dies after ~16 minutes:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # All subsequent packets dropped 00:00.206 IP host.2 > host.1: Flags [P.], length 11 # first data packet 00:00.412 IP host.2 > host.1: Flags [P.], length 11 # early retransmit, doesn't count 00:00.620 IP host.2 > host.1: Flags [P.], length 11 # 1nd retry 00:01.048 IP host.2 > host.1: Flags [P.], length 11 # 2rd retry 00:01.880 IP host.2 > host.1: Flags [P.], length 11 # 3th retry State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 11 host:2 host:1 timer:(on,1.304ms,3) 00:03.543 IP host.2 > host.1: Flags [P.], length 11 # 4th 00:07.000 IP host.2 > host.1: Flags [P.], length 11 # 5th 00:13.656 IP host.2 > host.1: Flags [P.], length 11 # 6th 00:26.968 IP host.2 > host.1: Flags [P.], length 11 # 7th 00:54.616 IP host.2 > host.1: Flags [P.], length 11 # 8th 01:47.868 IP host.2 > host.1: Flags [P.], length 11 # 9th 03:34.360 IP host.2 > host.1: Flags [P.], length 11 # 10th 05:35.192 IP host.2 > host.1: Flags [P.], length 11 # 11th 07:36.024 IP host.2 > host.1: Flags [P.], length 11 # 12th 09:36.855 IP host.2 > host.1: Flags [P.], length 11 # 13th 11:37.692 IP host.2 > host.1: Flags [P.], length 11 # 14th 13:38.524 IP host.2 > host.1: Flags [P.], length 11 # 15th 15:39.500 connection ETIMEDOUT

The data packet is retransmitted 15 times, as controlled by:

$ sysctl net.ipv4.tcp_retries2 net.ipv4.tcp_retries2 = 15

From the ip-sysctl.txt documentation:

The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.

The connection indeed died at ~940 seconds. Notice the socket has the "on" timer running. It doesn't matter at all if we set SO_KEEPALIVE - when the "on" timer is running, keepalives are not engaged.

TCP_USER_TIMEOUT keeps on working though. The connection will be aborted exactly after user-timeout specified time since the last received packet. With the user timeout set the tcp_retries2 value is ignored.

Zero window ESTAB is... forever?

There is one final case worth mentioning. If the sender has plenty of data, and the receiver is slow, then TCP flow control kicks in. At some point the receiver will ask the sender to stop transmitting new data. This is a slightly different condition than the one described above.

In this case, with flow control engaged, there is no in-flight or unacknowledged data. Instead the receiver throttles the sender with a "zero window" notification. Then the sender periodically checks if the condition is still valid with "window probes". In this experiment we reduced the receive buffer size for simplicity. Here's how it looks on the wire:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.], win 1152 00:00.000 IP host.2 > host.1: Flags [.]

00:00.202 IP host.2 > host.1: Flags [.], length 576 # first data packet 00:00.202 IP host.1 > host.2: Flags [.], ack 577, win 576 00:00.202 IP host.2 > host.1: Flags [P.], length 576 # second data packet 00:00.244 IP host.1 > host.2: Flags [.], ack 1153, win 0 # throttle it! zero-window

00:00.456 IP host.2 > host.1: Flags [.], ack 1 # zero-window probe 00:00.456 IP host.1 > host.2: Flags [.], ack 1153, win 0 # nope, still zero-window

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 1152 0 host:1 host:2 ESTAB 0 129920 host:2 host:1 timer:(persist,048ms,0)

The packet capture shows a couple of things. First, we can see two packets with data, each 576 bytes long. They both were immediately acknowledged. The second ACK had "win 0" notification: the sender was told to stop sending data.

But the sender is eager to send more! The last two packets show a first "window probe": the sender will periodically send payload-less "ack" packets to check if the window size had changed. As long as the receiver keeps on answering, the sender will keep on sending such probes forever.

The socket information shows three important things:

The read buffer of the reader is filled - thus the "zero window" throttling is expected.
The write buffer of the sender is filled - we have more data to send.
The sender has a "persist" timer running, counting the time until the next "window probe".

In this blog post we are interested in timeouts - what will happen if the window probes are lost? Will the sender notice?

By default, the window probe is retried 15 times - adhering to the usual tcp_retries2 setting.

The tcp timer is in persist state, so the TCP keepalives will not be running. The SO_KEEPALIVE settings don't make any difference when window probing is engaged.

As expected, the TCP_USER_TIMEOUT toggle keeps on working. A slight difference is that similarly to user-timeout on keepalives, it's engaged only when the retransmission timer fires. During such an event, if more than user-timeout seconds since the last good packet passed, the connection will be aborted.

Note about using application timeouts

In the past we have shared an interesting war story:

The curious case of slow downloads

Our HTTP server gave up on the connection after an application-managed timeout fired. This was a bug - a slow connection might have correctly slowly drained the send buffer, but the application server didn't notice that.

We abruptly dropped slow downloads, even though this wasn't our intention. We just wanted to make sure the client connection was still healthy. It would be better to use TCP_USER_TIMEOUT than rely on application-managed timeouts.

But this is not sufficient. We also wanted to guard against a situation where a client stream is valid, but is stuck and doesn't drain the connection. The only way to achieve this is to periodically check the amount of unsent data in the send buffer, and see if it shrinks at a desired pace.

For typical applications sending data to the Internet, I would recommend:

Enable TCP keepalives. This is needed to keep some data flowing in the idle-connection case.
Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT.
Be careful when using application-managed timeouts. To detect TCP failures use TCP keepalives and user-timeout. If you want to spare resources and make sure sockets don't stay alive for too long, consider periodically checking if the socket is draining at the desired pace. You can use ioctl(TIOCOUTQ) for that, but it counts both data buffered (notsent) on the socket and in-flight (unacknowledged) bytes. A better way is to use TCP_INFO tcpi_notsent_bytes parameter, which reports only the former counter.

An example of checking the draining pace:

while True: notsent1 = get_tcp_info(c).tcpi_notsent_bytes notsent1_ts = time.time() ... poll.poll(POLL_PERIOD) ... notsent2 = get_tcp_info(c).tcpi_notsent_bytes notsent2_ts = time.time() pace_in_bytes_per_second = (notsent1 - notsent2) / (notsent2_ts - notsent1_ts) if pace_in_bytes_per_second > 12000: # pace is above effective rate of 96Kbps, ok! else: # socket is too slow...

There are ways to further improve this logic. We could use TCP_NOTSENT_LOWAT, although it's generally only useful for situations where the send buffer is relatively empty. Then we could use the SO_TIMESTAMPING interface for notifications about when data gets delivered. Finally, if we are done sending the data to the socket, it's possible to just call close() and defer handling of the socket to the operating system. Such a socket will be stuck in FIN-WAIT-1 or LAST-ACK state until it correctly drains.

Summary

In this post we discussed five cases where the TCP connection may notice the other party going away:

SYN-SENT: The duration of this state can be controlled by TCP_SYNCNT or tcp_syn_retries.
SYN-RECV: It's usually hidden from application. It is tuned by tcp_synack_retries.
Idling ESTABLISHED connection, will never notice any issues. A solution is to use TCP keepalives.
Busy ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.
Zero-window ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.

Especially the last two ESTABLISHED cases can be customized with TCP_USER_TIMEOUT, but this setting also affects other situations. Generally speaking, it can be thought of as a hint to the kernel to abort the connection after so-many seconds since the last good packet. This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.

In this post we presented scripts showing the effects of timeout-related socket options under various network conditions. Interleaving the tcpdump packet capture with the output of ss -o is a great way of understanding the networking stack. We were able to create reproducible test cases showing the "on", "keepalive" and "persist" timers in action. This is a very useful framework for further experimentation.

Finally, it's surprisingly hard to tune a TCP connection to be confident that the remote host is actually up. During our debugging we found that looking at the send buffer size and currently active TCP timer can be very helpful in understanding whether the socket is actually healthy. The bug in our Spectrum application turned out to be a wrong TCP_USER_TIMEOUT setting - without it sockets with large send buffers were lingering around for way longer than we intended.

The scripts used in this article can be found on our GitHub.

Figuring this out has been a collaboration across three Cloudflare offices. Thanks to Hiren Panchasara from San Jose, Warren Nelson from Austin and Jakub Sitnicki from Warsaw. Fancy joining the team? Apply here!

Magic Transit: Network functions at Cloudflare scale

Nick Wondra — Tue, 13 Aug 2019 13:00:00 GMT

Today we announced Cloudflare Magic Transit, which makes Cloudflare’s network available to any IP traffic on the Internet. Up until now, Cloudflare has primarily operated proxy services: our servers terminate HTTP, TCP, and UDP sessions with Internet users and pass that data through new sessions they create with origin servers. With Magic Transit, we are now also operating at the IP layer: in addition to terminating sessions, our servers are applying a suite of network functions (DoS mitigation, firewalling, routing, and so on) on a packet-by-packet basis.

Over the past nine years, we’ve built a robust, scalable global network that currently spans 193 cities in over 90 countries and is ever growing. All Cloudflare customers benefit from this scale thanks to two important techniques. The first is anycast networking. Cloudflare was an early adopter of anycast, using this routing technique to distribute Internet traffic across our data centers. It means that any data center can handle any customer’s traffic, and we can spin up new data centers without needing to acquire and provision new IP addresses. The second technique is homogeneous server architecture. Every server in each of our edge data centers is capable of running every task. We build our servers on commodity hardware, making it easy to quickly increase our processing capacity by adding new servers to existing data centers. Having no specialty hardware to depend on has also led us to develop an expertise in pushing the limits of what’s possible in networking using modern Linux kernel techniques.

Magic Transit is built on the same network using the same techniques, meaning our customers can now run their network functions at Cloudflare scale. Our fast, secure, reliable global edge becomes our customers’ edge. To explore how this works, let’s follow the journey of a packet from a user on the Internet to a Magic Transit customer’s network.

Putting our DoS mitigation to work… for you!

In the announcement blog post we describe an example deployment for Acme Corp. Let’s continue with this example here. When Acme brings their IP prefix 203.0.113.0/24 to Cloudflare, we start announcing that prefix to our transit providers, peers, and to Internet exchanges in each of our data centers around the globe. Additionally, Acme stops announcing the prefix to their own ISPs. This means that any IP packet on the Internet with a destination address within Acme’s prefix is delivered to a nearby Cloudflare data center, not to Acme’s router.

Let’s say I want to access Acme’s FTP server on 203.0.113.100 from my computer in Cloudflare’s office in Champaign, IL. My computer generates a TCP SYN packet with destination address 203.0.113.100 and sends it out to the Internet. Thanks to anycast, that packet ends up at Cloudflare’s data center in Chicago, which is the closest data center (in terms of Internet routing distance) to Champaign. The packet arrives on the data center’s router, which uses ECMP (Equal Cost Multi-Path) routing to select which server should handle the packet and dispatches the packet to the selected server.

Once at the server, the packet flows through our XDP- and iptables-based DoS detection and mitigation functions. If this TCP SYN packet were determined to be part of an attack, it would be dropped and that would be the end of it. Fortunately for me, the packet is permitted to pass.

So far, this looks exactly like any other traffic on Cloudflare’s network. Because of our expertise in running a global anycast network we’re able to attract Magic Transit customer traffic to every data center and apply the same DoS mitigation solution that has been protecting Cloudflare for years. Our DoS solution has handled some of the largest attacks ever recorded, including a 942Gbps SYN flood in 2018. Below is a screenshot of a recent SYN flood of 300M packets per second. Our architecture lets us scale to stop the largest attacks.

Network namespaces for isolation and control

The above looked identical to how all other Cloudflare traffic is processed, but this is where the similarities end. For our other services, the TCP SYN packet would now be dispatched to a local proxy process (e.g. our nginx-based HTTP/S stack). For Magic Transit, we instead want to dynamically provision and apply customer-defined network functions like firewalls and routing. We needed a way to quickly spin up and configure these network functions while also providing inter-network isolation. For that, we turned to network namespaces.

Namespaces are a collection of Linux kernel features for creating lightweight virtual instances of system resources that can be shared among a group of processes. Namespaces are a fundamental building block for containerization in Linux. Notably, Docker is built on Linux namespaces. A network namespace is an isolated instance of the Linux network stack, including its own network interfaces (with their own eBPF hooks), routing tables, netfilter configuration, and so on. Network namespaces give us a low-cost mechanism to rapidly apply customer-defined network configurations in isolation, all with built-in Linux kernel features so there’s no performance hit from userspace packet forwarding or proxying.

When a new customer starts using Magic Transit, we create a brand new network namespace for that customer on every server across our edge network (did I mention that every server can run every task?). We built a daemon that runs on our servers and is responsible for managing these network namespaces and their configurations. This daemon is constantly reading configuration updates from Quicksilver, our globally distributed key-value store, and applying customer-defined configurations for firewalls, routing, etc, inside the customer’s namespace. For example, if Acme wants to provision a firewall rule to allow FTP traffic (TCP ports 20 and 21) to 203.0.113.100, that configuration is propagated globally through Quicksilver and the Magic Transit daemon applies the firewall rule by adding an nftables rule to the Acme customer namespace:

# Apply nftables rule inside Acme’s namespace
$ sudo ip netns exec acme_namespace nft add rule inet filter prerouting ip daddr 203.0.113.100 tcp dport 20-21 accept

Getting the customer’s traffic to their network namespace requires a little routing configuration in the default network namespace. When a network namespace is created, a pair of virtual ethernet (veth) interfaces is also created: one in the default namespace and one in the newly created namespace. This interface pair creates a “virtual wire” for delivering network traffic into and out of the new network namespace. In the default network namespace, we maintain a routing table that forwards Magic Transit customer IP prefixes to the veths corresponding to those customers’ namespaces. We use iptables to mark the packets that are destined for Magic Transit customer prefixes, and we have a routing rule that specifies that these specially marked packets should use the Magic Transit routing table.

(Why go to the trouble of marking packets in iptables and maintaining a separate routing table? Isolation. By keeping Magic Transit routing configurations separate we reduce the risk of accidentally modifying the default routing table in a way that affects how non-Magic Transit traffic flows through our edge.)

Network namespaces provide a lightweight environment where a Magic Transit customer can run and manage network functions in isolation, letting us put full control in the customer’s hands.

GRE + anycast = magic

After passing through the edge network functions, the TCP SYN packet is finally ready to be delivered back to the customer’s network infrastructure. Because Acme Corp. does not have a network footprint in a colocation facility with Cloudflare, we need to deliver their network traffic over the public Internet.

This poses a problem. The destination address of the TCP SYN packet is 203.0.113.100, but the only network announcing the IP prefix 203.0.113.0/24 on the Internet is Cloudflare. This means that we can’t simply forward this packet out to the Internet—it will boomerang right back to us! In order to deliver this packet to Acme we need to use a technique called tunneling.

Tunneling is a method of carrying traffic from one network over another network. In our case, it involves encapsulating Acme’s IP packets inside of IP packets that can be delivered to Acme’s router over the Internet. There are a number of common tunneling protocols, but Generic Routing Encapsulation (GRE) is often used for its simplicity and widespread vendor support.

GRE tunnel endpoints are configured both on Cloudflare’s servers (inside of Acme’s network namespace) and on Acme’s router. Cloudflare servers then encapsulate IP packets destined for 203.0.113.0/24 inside of IP packets destined for a publicly-routable IP address for Acme’s router, which decapsulates the packets and emits them into Acme’s internal network.

Now, I’ve omitted an important detail in the diagram above: the IP address of Cloudflare’s side of the GRE tunnel. Configuring a GRE tunnel requires specifying an IP address for each side, and the outer IP header for packets sent over the tunnel must use these specific addresses. But Cloudflare has thousands of servers, each of which may need to deliver packets to the customer through a tunnel. So how many Cloudflare IP addresses (and GRE tunnels) does the customer need to talk to? The answer: just one, thanks to the magic of anycast.

Cloudflare uses anycast IP addresses for our GRE tunnel endpoints, meaning that any server in any data center is capable of encapsulating and decapsulating packets for the same GRE tunnel. How is this possible? Isn’t a tunnel a point-to-point link? The GRE protocol itself is stateless—each packet is processed independently and without requiring any negotiation or coordination between tunnel endpoints. While the tunnel is technically bound to an IP address it need not be bound to a specific device. Any device that can strip off the outer headers and then route the inner packet can handle any GRE packet sent over the tunnel. Actually, in the context of anycast the term “tunnel” is misleading since it implies a link between two fixed points. With Cloudflare’s Anycast GRE, a single “tunnel” gives you a conduit to every server in every data center on Cloudflare’s global edge.

One very powerful consequence of Anycast GRE is that it eliminates single points of failure. Traditionally, GRE-over-Internet can be problematic because an Internet outage between the two GRE endpoints fully breaks the “tunnel”. This means reliable data delivery requires going through the headache of setting up and maintaining redundant GRE tunnels terminating at different physical sites and rerouting traffic when one of the tunnels breaks. But because Cloudflare is encapsulating and delivering customer traffic from every server in every data center, there is no single “tunnel” to break. This means Magic Transit customers can enjoy the redundancy and reliability of terminating tunnels at multiple physical sites while only setting up and maintaining a single GRE endpoint, making their jobs simpler.

Our scale is now your scale

Magic Transit is a powerful new way to deploy network functions at scale. We’re not just giving you a virtual instance, we’re giving you a global virtual edge. Magic Transit takes the hardware appliances you would typically rack in your on-prem network and distributes them across every server in every data center in Cloudflare’s network. This gives you access to our global anycast network, our fleet of servers capable of running your tasks, and our engineering expertise building fast, reliable, secure networks. Our scale is now your scale.