The Cloudflare Blog

Lost in transit: debugging dropped packets from negative header lengths

Terin Stock — Mon, 26 Jun 2023 13:00:56 GMT

Previously, I wrote about building network load balancers with the maglev scheduler, which we use for ingress into our Kubernetes clusters. At the time of that post we were using Foo-over-UDP encapsulation with virtual interfaces, one for each Internet Protocol version for each worker node.

To reduce operational toil managing the traffic director nodes, we've recently switched to using IP Virtual Server's (IPVS) native support for encapsulation. Much to our surprise, instead of a smooth change, we instead observed significant drops in bandwidth and failing API requests. In this post I'll discuss the impact observed, the multi-week search for the root cause, and the ultimate fix.

Recap and the change

To support our requirements we've been creating virtual interfaces on our traffic directors configured to encapsulate traffic with Foo-Over-UDP (FOU). In this encapsulation new UDP and IP headers are added to the original packet. When the worker node receives this packet, the kernel removes the outer headers and injects the inner packet back into the network stack. Each virtual interface would be assigned a private IP, which would be configured to send traffic to these private IPs in "direct" mode.

This configuration presents several problems for our operations teams.

For example, each Kubernetes worker node needs a separate virtual interface on the traffic director, and each of the interfaces requires their own private IP. The pairs of virtual interfaces and private IPs were only used by this system, but still needed to be tracked in our configuration management system. To ensure the interfaces were created and configured properly on each director we had to run complex health checks, which added to the lag between provisioning a new worker node and the director being ready to send it traffic. Finally, the header for FOU also lacks a way to signal the "next protocol" of the inner packet, requiring a separate virtual interface for IPv4 and IPv6. Each of these problems individually contributed a small amount of toil, but taken together, gave us impetus to find a better alternative.

In the time since we had originally implemented this system, IPVS has added native support for encapsulation. This would allow us to eliminate provisioning virtual interfaces (and their corresponding private IPs), and be able to use newly provisioned workers without delay.

IPVS doesn't support Foo-over-UDP as an encapsulation type. As part of this project we switched to a similar option, Generic UDP Encapsulation (GUE). This encapsulation option does include the "next protocol", allowing one listener on the worker nodes to support both IPv4 and IPv6.

What went wrong?

When we make changes to our Kubernetes clusters, we go through several layers of testing. This change had been deployed to the traffic directors in front of our staging environments, where it had been running for several weeks. However, due to the nature of this bug, the type of traffic to our staging environment did not trigger the underlying bug.

We began a slow rollout of this change to one production cluster, and after a few hours we began observing issues reaching services behind Kubernetes Load Balancers. The behavior observed was very interesting: the vast majority of requests had no issues, but a small percentage of requests corresponding to large HTTP request payloads or gRPC had significant latency. However, large responses had no corresponding latency increase. There was no corresponding increase in latency seen to any requests to our ingress controllers, though we could observe a small drop in overall requests per second.

Through debugging after the incident, we discovered that the traffic directors were dropping packets that exceeded the Internet maximum transmission unit (MTU) of 1500 bytes, despite the packets being smaller than the actual MTU configured in our internal networks. Once dropped, the original host would fragment and resend packets until they were small enough to pass through the traffic directors. Dropping one packet is inconvenient, but unlikely to be noticed. However, when making a request with large payloads it is very likely that all packets would be dropped and need to be individually fragmented and resent.

When every packet is dropped and has to be resent, the network latency can add up to several seconds, exceeding the request timeouts configured by applications. This would cause the request to fail. necessitating retries by applications. As more traffic directors were reconfigured, these retries were less likely to succeed, leading to slower processing and causing the backlog of queued work to increase.

As you can see this small, but consistent, number of dropped packets could cause a domino effect into much larger problems. Once it became clear there was a problem, we reverted traffic directors to their previous configuration, and this quickly restored traffic to previous levels. From this we knew something about the change caused this problem.

Finding the culprit

With the symptoms of the issues identified, we started to try to understand the root cause. Once the root cause was understood, we could come up with a satisfactory solution.

Knowing the packets were larger than the Internet MTU, our first thought was that this was a misconfiguration of the machine in our configuration management tool. However, we found the interface MTUs were all set as expected, and there were no overriding MTUs in the routing table. We also found that sending packets from the director that exceeded the Internet MTU worked fine with no drops.

Cilium has developed a debugging tool pwru, short for "packet, where are you?", that uses eBPF to aid in debugging the kernel networking state. We used pwru in our staging environment and found the location where the packet had been dropped.

sudo pwru --output-skb --output-tuple --per-cpu-buffer 2097152 --output-file pwru.log

This captures tracing data for all packets that reach the traffic director, and saves the trace data to "pwru.log". There are filters built into pwru to select matching packets, but unfortunately they could not be used as the packet was being modified by the encapsulation. Instead, we used grep afterwards to find a matching packet, and then filtered farther based on the pointer in the first column.

0xffff947712f34400      9        []               packet_rcv netns=4026531840 mark=0x0 ifindex=4 proto=8 mtu=1600 len=1512 172.70.2.6:49756->198.51.100.150:8000(tcp)
0xffff947712f34400      9        []                 skb_push netns=4026531840 mark=0x0 ifindex=4 proto=8 mtu=1600 len=1512 172.70.2.6:49756->198.51.100.150:8000(tcp)
0xffff947712f34400      9        []              consume_skb netns=4026531840 mark=0x0 ifindex=4 proto=8 mtu=1600 len=1512 172.70.2.6:49756->198.51.100.150:8000(tcp)
[ ... snip ... ]
0xffff947712f34400      9        []         inet_gso_segment netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1544 172.70.4.34:33259->172.70.64.149:5501(udp)
0xffff947712f34400      9        []        udp4_ufo_fragment netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1524 172.70.4.34:33259->172.70.64.149:5501(udp)
0xffff947712f34400      9        []   skb_udp_tunnel_segment netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1524 172.70.4.34:33259->172.70.64.149:5501(udp)
0xffff947712f34400      9        [] kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED) netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1558 172.70.4.34:33259->172.70.64.149:5501(udp)

Looking at the lines logged for this packet, we can observe some behavior. We originally received a TCP packet for the load balancer IP. However, when the packet is dropped, it is a UDP packet destined for the worker's IP on the port we use for GUE. We can surmise that the packet was being processed and encapsulated by IPVS, and form a theory it was being dropped on the egress path from the director node. We could also see that when the packet was dropped, the packet was still smaller than the calculated MTU.

We can visualize this change by applying this information to our GUE encapsulation diagram shown earlier. The byte totals of the encapsulated packet is 1544, matching the length pwru logged entering inet_gso_segment above.

The trace above tells us what kernel functions are entered, but does not tell us if or how the program flow left. This means we don't know in which function kfree_skb_reason was called. Fortunately pwru can print a stacktrace when functions are entered.

0xffff9868b7232e00 	19       	[pwru] kfree_skb_reason(SKB_DROP_REASON_NOT_SPECIFIED) netns=4026531840 mark=0x0 ifindex=7 proto=8 mtu=1600 len=1558 172.70.4.34:63336->172.70.72.206:5501(udp)
kfree_skb_reason
validate_xmit_skb
__dev_queue_xmit
ip_finish_output2
ip_vs_tunnel_xmit   	[ip_vs]
ip_vs_in_hook   [ip_vs]
nf_hook_slow
ip_local_deliver
ip_sublist_rcv_finish
ip_sublist_rcv
ip_list_rcv
__netif_receive_skb_list_core
netif_receive_skb_list_internal
napi_complete_done
mlx5e_napi_poll [mlx5_core]
__napi_poll
net_rx_action
__softirqentry_text_start
__irq_exit_rcu
common_interrupt
asm_common_interrupt
audit_filter_syscall
__audit_syscall_exit
syscall_exit_work
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe

This stacktrace shows that kfree_skb_reason was called from within the validate_xmit_skb function and this is called from ip_vs_tunnel_xmit. However, when looking at the implementation of validate_xmit_skb, we find there are three paths to kfree_skb. How can we determine which path is taken?

In addition to the eBPF support used by pwru, Linux has support for attaching dynamic tracepoints using perf-probe. After installing the kernel source code and debugging symbols, we can ask the kprobe mechanism what lines of validate_xmit_skb we can create a dynamic tracepoint. It prints the line numbers for the line we can attach a tracepoint onto.

Unfortunately, we can't create a tracepoint on the goto lines, but we can attach tracepoints around them, using the context to determine how control flowed through this function. In addition to specifying a line number, additional arguments can be specified to include local variables. The skb variable is a pointer to a structure that represents this packet, which can be logged to separate packets in case more than one is being processed at a time.

sudo perf probe --add 'validate_xmit_skb:17 skb'
sudo perf probe --add 'validate_xmit_skb:20 skb'
sudo perf probe --add 'validate_xmit_skb:24 skb'
sudo perf probe --add 'validate_xmit_skb:32 skb'

Access to these tracepoints could be recorded by using perf-record and providing the tracepoint names given when they were added.

sudo perf record -a -e 'probe:validate_xmit_skb_L17,probe:validate_xmit_skb_L20,probe:validate_xmit_skb_L24,probe:validate_xmit_skb_L32'

The tests can be rerun so some packets are logged, before using perf-script to read the generated "perf.data" file. When inspecting the output file, we found that all the packets that were dropped were coming from the control flow of netif_needs_gso succeeding (that is, from the goto on line 18). We continued to create and record tracepoints, following the failing control flow, until execution reached __skb_udp_tunnel_segment.

When netif_needs_gso returns false, we do not see packet drops and no problems are reported. It returns true when the packet is larger than the maximum segment size (MSS) advertised by our peer in the connection handshake. For IPv4, the MSS is usually 40 bytes smaller than the MTU (though this can be adjusted by the application or system configuration). For most packets the traffic directors see, the MSS is 40 bytes less than the Internet MTU of 1500, or in this case 1460.

The tracepoints in this function showed that control flow was leaving through the error case on line 33: that kernel was unable to allocate memory for the tunnel header. GUE was designed to have a minimal tunnel header, so this seemed suspicious. We updated the probe to also record the calculated tnl_hlen, and reran the tests. To our surprise the value recorded by the probes was "-2". Huh, somehow the encapsulated packet got smaller?

static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
    netdev_features_t features,
    struct sk_buff *(*gso_inner_segment)(struct sk_buff *skb,
   				  	netdev_features_t features),
    __be16 new_protocol, bool is_ipv6)
{
    int tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb); // tunnel header length computed here
    bool remcsum, need_csum, offload_csum, gso_partial;
    struct sk_buff *segs = ERR_PTR(-EINVAL);
    struct udphdr *uh = udp_hdr(skb);
    u16 mac_offset = skb->mac_header;
    __be16 protocol = skb->protocol;
    u16 mac_len = skb->mac_len;
    int udp_offset, outer_hlen;
    __wsum partial;
    bool need_ipsec;

    if (unlikely(!pskb_may_pull(skb, tnl_hlen))) // allocation failed due to negative number.
   	 goto out;

Ultimate fix

At this point the kernel's behavior was a bit baffling: why would the tunnel header be computed to be a negative number? To answer this question, we added two more probes. The first was added to ip_vs_in_hook, a hook function that is called as packets enter and leave IPVS code. The second probe was added to __dev_queue_xmit, which is called when preparing to ask the hardware device to transmit the packet. To both of these probes we also logged some of the fields of the sk_buff struct by using the "pointer->field" syntax. These fields are offsets into the packet's data for the packet's headers, as well as corresponding offsets for the encapsulated headers.

The mac_header and inner_mac_header are offsets to the packet's layer two header. For Ethernet this includes the MAC addresses for the frame, but also other metadata such as the EtherType field giving the type the next protocol.
The network_header and inner_network_header fields are offsets to the packet's layer three header. For our purposes, this would be the header for IPv4 or IPv6.
The transport_header and inner_transport_header fields are offset to the packet's layer four header, such as TCP, UDP, or ICMP.

sudo perf probe -m ip_vs --add 'ip_vs_in_hook dev=skb->dev->name:string skb skb->inner_mac_header skb->inner_network_header skb->inner_transport_header skb->mac_header skb->network_header skb->transport_header skb->ipvs_property skb->len:u skb->data_len:u'
sudo perf probe --add '__dev_queue_xmit dev=skb->dev->name:string skb skb->inner_mac_header skb->inner_network_header skb->inner_transport_header skb->mac_header skb->network_header skb->transport_header skb->len:u skb->data_len:u'

When we review the tracepoints using perf-script, we can notice something interesting with these values.

swapper     0 [030] 79090.376151:    probe:ip_vs_in_hook: (ffffffffc0feebe0) dev="vlan100" skb=0xffff9ca661e90200 inner_mac_header=0x0 inner_network_header=0x0 inner_transport_header=0x0 mac_header=0x44 network_header=0x52 transport_header=0x66 len=1512 data_len=12
swapper     0 [030] 79090.376153:    probe:ip_vs_in_hook: (ffffffffc0feebe0) dev="vlan100" skb=0xffff9ca661e90200 inner_mac_header=0x44 inner_network_header=0x52 inner_transport_header=0x66 mac_header=0x44 network_header=0x32 transport_header=0x46 len=1544 data_len=12
swapper     0 [030] 79090.376155: probe:__dev_queue_xmit: (ffffffff85070e60) dev="vlan100" skb=0xffff9ca661e90200 inner_mac_header=0x44 inner_network_header=0x52 inner_transport_header=0x66 mac_header=0x44 network_header=0x32 transport_header=0x46 len=1558 data_len=12

When the packet reaches ip_vs_in_hook on the way into IPVS, it only has outer packet headers. This makes sense, as the packet hasn't been encapsulated by IPVS yet. When the same hook is called again as the packet leaves IPVS, we see the encapsulation is already completed. We also find that the outer MAC header and the inner MAC header are at the same offset. Knowing how the tunnel header length is calculated in __skb_udp_tunnel_segment, we can also see where "-2" comes from. The inner_mac_header is at offset 0x44, while the transport_header is at offset 0x46.

When packets pass through network interfaces, the code for the interface reserves some space for the MAC header. For example, on an Ethernet interface some space is reserved for the Ethernet headers. FOU and GUE do not use link-layer addressing like Ethernet so no space is needed to be reserved. When we were using the virtual interfaces with FOU before it was correctly handling this case by setting the inner MAC header offset to the same as the inner network offset, effectively making it zero bytes long.

When using the encapsulation inside IPVS, this was not being accounted for, resulting in the inner MAC header offset being invalid. When packets did not need to be segmented this was a harmless bug. When segmenting, however, the tunnel header size needed to be known to duplicate it to all segmented packets.

I created a patch to correct the MAC header offset in IPVS's encapsulation code. The correction to the inner header offsets needed to be duplicated for IPv4 and IPv6.

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index c7652da78c88..9193e109e6b3 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -1207,6 +1207,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 	skb->transport_header = skb->network_header;
 
 	skb_set_inner_ipproto(skb, next_protocol);
+	skb_set_inner_mac_header(skb, skb_inner_network_offset(skb));
 
 	if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
 		bool check = false;
@@ -1349,6 +1350,7 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 	skb->transport_header = skb->network_header;
 
 	skb_set_inner_ipproto(skb, next_protocol);
+	skb_set_inner_mac_header(skb, skb_inner_network_offset(skb));
 
 	if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
 		bool check = false;

When the patch was included in our kernel and deployed the difference in end-to-end request latency was immediately noticeable. We also no longer observed packet drops for requests with large payloads.

I've submitted the Linux kernel patch upstream, which is currently queued for inclusion in future releases of the kernel.

Conclusion

Through this post, we hoped to have provided some insight into the process of investigating networking issues and how to begin debugging issues in the kernel using pwru and kprobe tracepoints. This investigation also led to a Linux kernel patch upstream. It also allowed us to safely roll out IPVS's native encapsulation.

Kubectl with Cloudflare Zero Trust

Terin Stock — Fri, 24 Jun 2022 14:08:51 GMT

Cloudflare is a heavy user of Kubernetes for engineering workloads: it's used to power the backend of our APIs, to handle batch-processing such as analytics aggregation and bot detection, and engineering tools such as our CI/CD pipelines. But between load balancers, API servers, etcd, ingresses, and pods, the surface area exposed by Kubernetes can be rather large.

In this post, we share a little bit about how our engineering team dogfoods Cloudflare Zero Trust to secure Kubernetes — and enables kubectl without proxies.

Our General Approach to Kubernetes Security

As part of our security measures, we heavily limit what can access our clusters over the network. Where a network service is exposed, we add additional protections, such as requiring Cloudflare Access authentication or Mutual TLS (or both) to access ingress resources.

These network restrictions include access to the cluster's API server. Without access to this, engineers at Cloudflare would not be able to use tools like kubectl to introspect their team's resources. While we believe Continuous Deployments and GitOps are best practices, allowing developers to use the Kubernetes API aids in troubleshooting and increasing developer velocity. Not having access would have been a deal breaker.

To satisfy our security requirements, we're using Cloudflare Zero Trust, and we wanted to share how we're using it, and the process that brought us here.

Before Zero Trust

In the world before Zero Trust, engineers could access the Kubernetes API by connecting to a VPN appliance. While this is common across the industry, and it does allow access to the API, it also dropped engineers as clients into the internal network: they had much more network access than necessary.

We weren't happy with this situation, but it was the status quo for several years. At the beginning of 2020, we retired our VPN and thus the Kubernetes team needed to find another solution.

Kubernetes with Cloudflare Tunnels

At the time we worked closely with the team developing Cloudflare Tunnels to add support for handling kubectl connections using Access and cloudflared tunnels.

While this worked for our engineering users, it was a significant hurdle to on-boarding new employees. Each Kubernetes cluster required its own tunnel connection from the engineer's device, which made shuffling between clusters annoying. While kubectl supported connecting through SOCKS proxies, this support was not universal to all tools in the Kubernetes ecosystem.

We continued using this solution internally while we worked towards a better solution.

Kubernetes with Zero Trust

Since the launch of Cloudflare One, we've been dogfooding the Zero Trust agent in various configurations. At first we'd been using it to implement secure DNS with 1.1.1.1. As time went on, we began to use it to dogfood additional Zero Trust features.

We're now leveraging the private network routing in Cloudflare Zero Trust to allow engineers to access the Kubernetes APIs without needing to setup cloudflared tunnels or configure kubectl and other Kubernetes ecosystem tools to use tunnels. This isn't something specific to Cloudflare, you can do this for your team today!

Configuring Zero Trust

We use a configuration management tool for our Zero Trust configuration to enable infrastructure-as-code, which we've adapted below. However, the same configuration can be achieved using the Cloudflare Zero Trust dashboard.

The first thing we need to do is create a new tunnel. This tunnel will be used to connect the Cloudflare edge network to the Kubernetes API. We run the tunnel endpoints within Kubernetes, using configuration shown later in this post.

resource "cloudflare_argo_tunnel" "k8s_zero_trust_tunnel" {
  account_id = var.account_id
  name       = "k8s_zero_trust_tunnel"
  secret     = var.tunnel_secret
}

The "tunnel_secret" secret should be a 32-byte random number, which you should temporarily save as we'll reuse it later for the Kubernetes setup later.

After we've created the tunnel, we need to create the routes so the Cloudflare network knows what traffic to send through the tunnel.

resource "cloudflare_tunnel_route" "k8s_zero_trust_tunnel_ipv4" {
  account_id = var.account_id
  tunnel_id  = cloudflare_argo_tunnel.k8s_zero_trust_tunnel.id
  network    = "198.51.100.101/32"
  comment    = "Kubernetes API Server (IPv4)"
}
 
resource "cloudflare_tunnel_route" "k8s_zero_trust_tunnel_ipv6" {
  account_id = var.account_id
  tunnel_id  = cloudflare_argo_tunnel.k8s_zero_trust_tunnel.id
  network    = "2001:DB8::101/128"
  comment    = "Kubernetes API Server (IPv6)"
}

We support accessing the Kubernetes API via both IPv4 and IPv6 addresses, so we configure routes for both. If you're connecting to your API server via a hostname, these IP addresses should match what is returned via a DNS lookup.

Next we'll configure settings for Cloudflare Gateway so that it's compatible with the API servers and clients.

resource "cloudflare_teams_list" "k8s_apiserver_ips" {
  account_id = var.account_id
  name       = "Kubernetes API IPs"
  type       = "IP"
  items      = ["198.51.100.101/32", "2001:DB8::101/128"]
}
 
resource "cloudflare_teams_rule" "k8s_apiserver_zero_trust_http" {
  account_id  = var.account_id
  name        = "Don't inspect Kubernetes API"
  description = "Allow connections from kubectl to API"
  precedence  = 10000
  action      = "off"
  enabled     = true
  filters     = ["http"]
  traffic     = format("any(http.conn.dst_ip[*] in $%s)", replace(cloudflare_teams_list.k8s_apiserver_ips.id, "-", ""))
}

As we use mutual TLS between clients and the API server, and not all the traffic between kubectl and the API servers are HTTP, we've disabled HTTP inspection for these connections.

You can pair these rules with additional Zero Trust rules, such as device attestation, session lifetimes, and user and group access policies to further customize your security.

Deploying Tunnels

Once you have your tunnels created and configured, you can deploy their endpoints into your network. We've chosen to deploy the tunnels as pods, as this allows us to use Kubernetes's deployment strategies for rolling out upgrades and handling node failures.

apiVersion: v1
kind: ConfigMap
metadata:
  name: tunnel-zt
  namespace: example
  labels:
    tunnel: api-zt
data:
  config.yaml: |
    tunnel: 8e343b13-a087-48ea-825f-9783931ff2a5
    credentials-file: /opt/zt/creds/creds.json
    metrics: 0.0.0.0:8081
    warp-routing:
        enabled: true

This creates a Kubernetes ConfigMap with a basic configuration that enables WARP routing for the tunnel ID specified. You can get this tunnel ID from your configuration management system, the Cloudflare Zero Trust dashboard, or by running the following command from another device logged into the same account.

cloudflared tunnel list

Next, we'll need to create a secret for our tunnel credentials. While you should use a secret management system, for simplicity we'll create one directly here.

jq -cn --arg accountTag $CF_ACCOUNT_TAG \
       --arg tunnelID $CF_TUNNEL_ID \
       --arg tunnelName $CF_TUNNEL_NAME \
       --arg tunnelSecret $CF_TUNNEL_SECRET \
   '{AccountTag: $accountTag, TunnelID: $tunnelID, TunnelName: $tunnelName, TunnelSecret: $tunnelSecret}' | \
kubectl create secret generic -n example tunnel-creds --from-file=creds.json=/dev/stdin

This creates a new secret "tunnel-creds" in the "example" namespace with the credentials file the tunnel expects.

Now we can deploy the tunnels themselves. We deploy multiple replicas to ensure some are always available, even while nodes are being drained.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    tunnel: api-zt
  name: tunnel-api-zt
  namespace: example
spec:
  replicas: 3
  selector:
    matchLabels:
      tunnel: api-zt
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  template:
    metadata:
      labels:
        tunnel: api-zt
    spec:
      containers:
        - args:
            - tunnel
            - --config
            - /opt/zt/config/config.yaml
            - run
          env:
            - name: GOMAXPROCS
              value: "2"
            - name: TZ
              value: UTC
          image: cloudflare/cloudflared:2022.5.3
          livenessProbe:
            failureThreshold: 1
            httpGet:
              path: /ready
              port: 8081
            initialDelaySeconds: 10
            periodSeconds: 10
          name: tunnel
          ports:
            - containerPort: 8081
              name: http-metrics
          resources:
            limits:
              cpu: "1"
              memory: 100Mi
          volumeMounts:
            - mountPath: /opt/zt/config
              name: config
              readOnly: true
            - mountPath: /opt/zt/creds
              name: creds
              readOnly: true
      volumes:
        - secret:
            name: tunnel-creds
          name: creds
        - configMap:
            name: tunnel-api-zt
          name: config

Using Kubectl with Cloudflare Zero Trust

After deploying the Cloudflare Zero Trust agent, members of your team can now access the Kubernetes API without needing to set up any special SOCKS tunnels!

kubectl version --short
Client Version: v1.24.1
Server Version: v1.24.1

What's next?

If you try this out, send us your feedback! We're continuing to improve Zero Trust for non-HTTP workflows.

Automated Origin CA for Kubernetes

Terin Stock — Fri, 13 Nov 2020 12:00:00 GMT

In 2016, we launched the Cloudflare Origin CA, a certificate authority optimized for making it easy to secure the connection between Cloudflare and an origin server. Running our own CA has allowed us to support fast issuance and renewal, simple and effective revocation, and wildcard certificates for our users.

Out of the box, managing TLS certificates and keys within Kubernetes can be challenging and error prone. The secret resources have to be constructed correctly, as components expect secrets with specific fields. Some forms of domain verification require manually rotating secrets to pass. Once you're successful, don't forget to renew before the certificate expires!

cert-manager is a project to fill this operational gap, providing Kubernetes resources that manage the lifecycle of a certificate. Today we're releasing origin-ca-issuer, an extension to cert-manager integrating with Cloudflare Origin CA to easily create and renew certificates for your account's domains.

Origin CA Integration

Creating an Issuer

After installing cert-manager and origin-ca-issuer, you can create an OriginIssuer resource. This resource creates a binding between cert-manager and the Cloudflare API for an account. Different issuers may be connected to different Cloudflare accounts in the same Kubernetes cluster.

apiVersion: cert-manager.k8s.cloudflare.com/v1
kind: OriginIssuer
metadata:
  name: prod-issuer
  namespace: default
spec:
  signatureType: OriginECC
  auth:
    serviceKeyRef:
      name: service-key
      key: key
      ```

This creates a new OriginIssuer named "prod-issuer" that issues certificates using ECDSA signatures, and the secret "service-key" in the same namespace is used to authenticate to the Cloudflare API.

Signing an Origin CA Certificate

After creating an OriginIssuer, we can now create a Certificate with cert-manager. This defines the domains, including wildcards, that the certificate should be issued for, how long the certificate should be valid, and when cert-manager should renew the certificate.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-com
  namespace: default
spec:
  # The secret name where cert-manager
  # should store the signed certificate.
  secretName: example-com-tls
  dnsNames:
    - example.com
  # Duration of the certificate.
  duration: 168h
  # Renew a day before the certificate expiration.
  renewBefore: 24h
  # Reference the Origin CA Issuer you created above,
  # which must be in the same namespace.
  issuerRef:
    group: cert-manager.k8s.cloudflare.com
    kind: OriginIssuer
    name: prod-issuer

Once created, cert-manager begins managing the lifecycle of this certificate, including creating the key material, crafting a certificate signature request (CSR), and constructing a certificate request that will be processed by the origin-ca-issuer.

When signed by the Cloudflare API, the certificate will be made available, along with the private key, in the Kubernetes secret specified within the secretName field. You'll be able to use this certificate on servers proxied behind Cloudflare.

Extra: Ingress Support

If you're using an Ingress controller, you can use cert-manager's Ingress support to automatically manage Certificate resources based on your Ingress resource.

apiVersion: networking/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/issuer: prod-issuer
    cert-manager.io/issuer-kind: OriginIssuer
    cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
  name: example
  namespace: default
spec:
  rules:
    - host: example.com
      http:
        paths:
          - backend:
              serviceName: examplesvc
              servicePort: 80
            path: /
  tls:
    # specifying a host in the TLS section will tell cert-manager 
    # what DNS SANs should be on the created certificate.
    - hosts:
        - example.com
      # cert-manager will create this secret
      secretName: example-tls

Building an External cert-manager Issuer

An external cert-manager issuer is a specialized Kubernetes controller. There's no direct communication between cert-manager and external issuers at all; this means that you can use any existing tools and best practices for developing controllers to develop an external issuer.

We've decided to use the excellent controller-runtime project to build origin-ca-issuer, running two reconciliation controllers.

OriginIssuer Controller

The OriginIssuer controller watches for creation and modification of OriginIssuer custom resources. The controllers create a Cloudflare API client using the details and credentials referenced. This client API instance will later be used to sign certificates through the API. The controller will periodically retry to create an API client; once it is successful, it updates the OriginIssuer's status to be ready.

CertificateRequest Controller

The CertificateRequest controller watches for the creation and modification of cert-manager's CertificateRequest resources. These resources are created automatically by cert-manager as needed during a certificate's lifecycle.

The controller looks for Certificate Requests that reference a known OriginIssuer, this reference is copied by cert-manager from the origin Certificate resource, and ignores all resources that do not match. The controller then verifies the OriginIssuer is in the ready state, before transforming the certificate request into an API request using the previously created clients.

On a successful response, the signed certificate is added to the certificate request, and which cert-manager will use to create or update the secret resource. On an unsuccessful request, the controller will periodically retry.

Learn More

Up-to-date documentation and complete installation instructions can be found in our GitHub repository. Feedback and contributions are greatly appreciated. If you're interested in Kubernetes at Cloudflare, including building controllers like these, we're hiring.

High Availability Load Balancers with Maglev

Terin Stock — Wed, 10 Jun 2020 11:00:00 GMT

Background

We run many backend services that power our customer dashboard, APIs, and features available at our edge. We own and operate physical infrastructure for our backend services. We need an effective way to route arbitrary TCP and UDP traffic between services and also from outside these data centers.

Previously, all traffic for these backend services would pass through several layers of stateful TCP proxies and NATs before reaching an available instance. This solution worked for several years, but as we grew it caused our service and operations teams many issues. Our service teams needed to deal with drops of availability, and our operations teams had much toil when needing to do maintenance on load balancer servers.

Goals

With the experience with our stateful TCP proxy and NAT solutions in mind, we had several goals for a replacement load balancing service, while remaining on our own infrastructure:

Preserve source IPs through routing decisions to destination servers. This allows us to support servers that require client IP addresses as part of their operation, without workarounds such as X-Forwarded-For headers or the PROXY TCP extension.
Support an architecture where backends are located across many racks and subnets. This prevents solutions that cannot be routed by existing network equipment.
Allow operation teams to perform maintenance with zero downtime. We should be able to remove load balancers at any time without causing any connection resets or downtime for services.
Use Linux tools and features that are commonplace and well-tested. There are a lot of very cool networking features in Linux we could experiment with, but we wanted to optimize for least surprising behavior for operators who do not primarily work with these load balancers.
No explicit connection synchronization between load balancers. We found that communication between load balancers significantly increased the system complexity, allowing for more opportunities for things to go wrong.
Allow for staged rollout from the previous load balancer implementation. We should be able to migrate the traffic of specific services between the two implementations to find issues and gain confidence in the system.

Reaching Zero Downtime

Problems

Previously, when traffic arrived at our backend data centers, our routers would pick and forward packets to one of the L4 load balancers servers it knew about. These L4 load balancers would determine what service the traffic was for, then forward the traffic to one of the service's L7 servers.

This architecture worked fine during normal operations. However, issues would quickly surface whenever the set of load balancers changed. Our routers would forward traffic to the new set and it was very likely traffic would arrive to a different load balancer than before. As each load balancer maintained its own connection state, it would be unable to forward traffic for these new in-progress connections. These connections would then be reset, potentially causing errors for our customers.

Consistent Hashing

During normal operations, our new architecture has similar behavior to the previous design. A L4 load balancer would be selected by our routers, which would then forward traffic to a service's L7 server.

There's a significant change when the set of load balancers changes. As our load balancers are now stateless, it doesn't matter which load balancer our router selects to forward traffic to, they'll end up reaching the same backend server.

Implementation

BGP

Our load balancer servers announce service IP addresses to our data centers’ routers using BGP, unchanged from the previous solution. Our routers choose which load balancers will receive packets based on a routing strategy called equal-cost multi-path routing (ECMP).

ECMP hashes information from packets to pick a path for that packet. The hash function used by routers is often fixed in firmware. Routers that chose a poor hashing function, or chose bad inputs, can create unbalanced network and server load, or break assumptions made by the protocol layer.

We worked with our networking team to ensure ECMP is configured on our routers to hash only based on the packet's 5-tuple—the protocol, source address and port, and destination address and port.

For maintenance, our operators can withdraw the BGP session and traffic will transparently shift to other load balancers. However, if a load balancer suddenly becomes unavailable, such as with a kernel panic or power failure, there is a short delay before the BGP keepalive mechanism fails and routers terminate the session.

It's possible for routers to terminate BGP sessions after a much shorter delay using the Bidirectional Forwarding Detection (BFD) protocol between the router and load balancers. Different routers have different limitations and restrictions on BFD that makes it difficult to use in an environment heavily using L2 link aggregation and VXLANs.

We're continuing to work with our networking team to find solutions to reduce the time to terminate BGP sessions, using tools and configurations they're most comfortable with.

Selecting Backends with Maglev

To ensure all load balancers are sending traffic to the same backends, we decided to use the Maglev connection scheduler. Maglev is a consistent hash scheduler hashing a 5-tuple of information from each packet—the protocol, source address and port, and destination address and port—to determine a backend server.

By being a consistent hash, the same backend server is chosen by every load balancer for a packet without needing to persist any connection state. This allows us to transparently move traffic between load balancers without requiring explicit connection synchronization between them.

IPVS and Foo-Over-UDP

Where possible, we wanted to use commonplace and reliable Linux features. Linux has implemented a powerful layer 4 load balancer, the IP Virtual Server (IPVS), since the early 2000s. IPVS has supported the Maglev scheduler since Linux 4.18.

Our load balancer and application servers are spread across multiple racks and subnets. To route traffic from the load balancer we opted to use Foo-Over-UDP encapsulation.

In Foo-Over-UDP encapsulation a new IP and UDP header are added around the original packet. When these packets arrive on the destination server, the Linux kernel removes the outer IP and UDP headers and inserts the inner payload back into the networking stack for processing as if the packet had originally been received on that server.

Compared to other encapsulation methods—such as IPIP, GUE, and GENEVE—we felt Foo-Over-UDP struck a nice balance between features and flexibility. Direct Server Return, where application servers reply directly to clients and bypass the load balancers, was implemented as a byproduct of the encapsulation. There was no state associated with the encapsulation, each server only required one encapsulation interface to receive traffic from all load balancers.

Example Load Balancer Configuration

# Load in the kernel modules required for IPVS and FOU.
$ modprobe ip_vs && modprobe ip_vs_mh && modprobe fou

# Create one tunnel between the load balancer and
# an application server. The IPs are the machines'
# real IPs on the network.
$ ip link add name lbtun1 type ipip \
remote 192.0.2.1 local 192.0.2.2 ttl 2 \
encap fou encap-sport auto encap-dport 5555

# Inform the kernel about the VIPs that might be announced here.
$ ip route add table local local 198.51.100.0/24 \
dev lo proto kernel

# Give the tunnel an IP address local to this machine.
# Traffic on this machine destined for this IP address will
# be sent down the tunnel.
$ ip route add 203.0.113.1 dev lbtun1 scope link

# Tell IPVS about the service, and that it should use the
# Maglev scheduler.
$ ipvsadm -A -t 198.51.100.1:80 -s mh

# Tell IPVS about a backend for this service.
$ ipvsadm -a -t 198.51.100.1:80 -r 203.0.113.1:80

Example Application Server Configuration

# The kernel module may need to be loaded.
$ modprobe fou

# Setup an IPIP receiver.
# ipproto 4 = IPIP (not IPv4)
$ ip fou add port 5555 ipproto 4

# Bring up the tunnel.
$ ip link set dev tunl0 up

# Disable reverse path filtering on tunnel interface.
$ sysctl -w net.ipv4.conf.tunl0.rp_filter=0
$ sysctl -w net.ipv4.conf.all.rp_filter=0

IPVS does not support Foo-Over-UDP as a packet forwarding method. To work around this limitation, we've created virtual interfaces that implement Foo-Over-UDP encapsulation. We can then use IPVS's direct packet forwarding method along with the kernel routing table to choose a specific interface.

Linux is often configured to ignore packets that arrive on an interface that is different from the interface used for replies. As packets will now be arriving on the virtual "tunl0" interface, we need to disable reverse path filtering on this interface. The kernel uses the higher value of the named and "all" interfaces, so you may need to decrease "all" and adjust other interfaces.

MTUs and Encapsulation

The maximum IPv4 packet size, or maximum transmission unit (MTU), we accept from the internet is 1500 bytes. To ensure we did not fragment these packets during encapsulation we increased our internal MTUs from the default to accommodate the IP and UDP headers.

The team had underestimated the complexity of changing the MTU across all our racks of equipment. We had to adjust the MTU across all our routers and switches, of our bonded and VXLAN interfaces, and finally our Foo-Over-UDP encapsulation. Even with a carefully orchestrated rollout, and we still uncovered MTU-related bugs with our switches and server stack, many of which manifested first as issues on other parts of the network.

Node Agent

We've written a Go agent running on each load balancer that synchronizes with a control plane layer that's tracking the location of services. The agent then configures the system based on active services and available backend servers.

To configure IPVS and the routing table we're using packages built upon the netlink Go package. We're open sourcing the IPVS netlink package we built today, which supports querying, creating and updating IPVS virtual servers, destinations, and statistics.

Unfortunately, there is no official programming interface for iptables, so we must instead execute the iptables binary. The agent computes an ideal set of iptables chains and rules, which is then reconciled with the live rules.

Subset of iptables for a service

*filter
-A INPUT -d 198.51.100.0/24 -m comment --comment \
"leif:nhAi5v93jwQYcJuK" -j LEIFUR-LB
-A LEIFUR-LB -d 198.51.100.1/32 -p tcp -m comment --comment \
"leif:G4qtNUVFCkLCu4yt" -m multiport --dports 80 -j LEIFUR-GQ4OKHRLCJYOWIN9
-A LEIFUR-GQ4OKHRLCJYOWIN9 -s 10.0.0.0/8 -m comment --comment \
"leif:G4qtNUVFCkLCu4yt" -j ACCEPT
-A LEIFUR-GQ4OKHRLCJYOWIN9 -s 172.16.0.0/12 -m comment --comment \
"leif:0XxZ2OwlQWzIYFTD" -j ACCEPT

The iptables output of a rule may differ significantly from the input given by our ideal rule. To avoid needing to parse the entire iptables rule in our comparisons, we store a hash of the rule, including the position in the chain, as an iptables comment. We then can compare the comment to our ideal rule to determine if we need to take any actions. On chains that are shared (such as INPUT) the agent ignores unmanaged rules.

Kubernetes Integration

We use the network load balancer described here as a cloud load balancer for Kubernetes. A controller assigns virtual IP addresses to Kubernetes services requesting a load balancer IP. These IPs get configured by the agent in IPVS. Traffic is directed to a subset of cluster nodes for handling by kube-proxy, unless the External Traffic Policy is set to "Local" in which case the traffic is sent to the specific backends the workloads are running on.

This allows us to have internal Kubernetes clusters that better replicate the load balancer behavior of managed clusters on cloud providers. Services running Kubernetes, such as ingress controllers, API gateways, and databases, have access to correct client IP addresses of load balanced traffic.

Future Work

Continuing a close eye on future developments of IPVS and alternatives, including nftlb and Express Data Path (XDP) and eBPF.
Migrate to nftables. The "flat priorities" and lack of programmable interface for iptables makes it ill-suited for including automated rules alongside rules added by operators. We hope as more projects and operations move to nftables we'll be able to switch without creating a "blind-spot" to operations.
Failures of a load balancer can result in temporary outages due to BGP hold timers. We'd like to improve how we're handling the failures with BGP sessions.
Investigate using Lightweight Tunnels to reduce the number of Foo-Over-UDP interfaces are needed on the load balancer nodes.

Additional Reading

Multi-tier load-balancing with Linux. Vincent Bernat (2018). Describes a network load balancer using IPVS and IPIP.
Introducing the GitHub Load Balancer. Joe Williams, Theo Julienne (2017). Describes a similar split layer 4 and layer 7 architecture, using consistent hashing and Foo-Over-UDP. They seemed to have limitations with IPVS that looked to have been resolved.
Foo over UDP. Jonathan Corbet (2014). Describes the basics of IPIP and Foo-Over-UDP, which was just introduced at the time.
UDP encapsulation, FOU, GUE, & RCO. Tom Herbert (2015). Describes the different UDP encapsulation options.
Hashing on broken assumptions. Lorenzo Saino (2017). Farther discussion on hashing difficulties with ECMP.
BFD (Bidirectional Forwarding Detection): Does it work and is it worth it?. Tom School (2009). Discussion on BFD with common protocols and where it can become a problem.

Accelerating Node.js applications with HTTP/2 Server Push

Terin Stock — Tue, 16 Aug 2016 12:17:19 GMT

In April, we announced support for HTTP/2 Server Push via the HTTP Link header. My coworker John has demonstrated how easy it is to add Server Push to an example PHP application.

CC BY 2.0 image by Nicky Fernandes

We wanted to make it easy to improve the performance of contemporary websites built with Node.js. we developed the netjet middleware to parse the generated HTML and automatically add the Link headers. When used with an example Express application you can see the headers being added:

We use Ghost to power this blog, so if your browser supports HTTP/2 you have already benefited from Server Push without realizing it! More on that below.

In netjet, we use the PostHTML project to parse the HTML with a custom plugin. Right now it is looking for images, scripts and external stylesheets. You can implement this same technique in other environments too.

Putting an HTML parser in the response stack has a downside: it will increase the page load latency (or "time to first byte"). In most cases, the added latency will be overshadowed by other parts of your application, such as database access. However, netjet includes an adjustable LRU cache keyed by ETag headers, allowing netjet to insert Link headers quickly on pages already parsed.

If you are designing a brand new application, however, you should consider storing metadata on embedded resources alongside your content, eliminating the HTML parse, and possible latency increase, entirely.

Netjet is compatible with any Node.js HTML framework that supports Express-like middleware. Getting started is as simple as adding netjet to the beginning of your middleware chain.

var express = require('express');
var netjet = require('netjet');
var root = '/path/to/static/folder';

express()
  .use(netjet({
    cache: {
      max: 100
    }
  }))
  .use(express.static(root))
  .listen(1337);

With a little more work, you can even use netjet without frameworks.

var http = require('http');
var netjet = require('netjet');

var port = 1337;
var hostname = 'localhost';
var preload = netjet({
  cache: {
    max: 100
  }
});

var server = http.createServer(function (req, res) {
  preload(req, res, function () {
      res.statusCode = 200;
      res.setHeader('Content-Type', 'text/html');
      res.end('Hello World');
  });
});

server.listen(port, hostname, function () {
  console.log('Server running at http://' + hostname + ':' + port+ '/');
});

See the netjet documentation for more information on the supported options.

Seeing what’s pushed

Chrome's Developer Tools makes it easy to verify that your site is using Server Push. The Network tab shows pushed assets with "Push" included as part of the initiator.

Unfortunately, Firefox's Developers Tools don't yet directly expose if the resource pushed. You can, however, check for the cf-h2-pushed header in the page's response headers, which contains a list of resources that CloudFlare offered browsers over Server Push.

Contributions to improve netjet or the documentation are greatly appreciated. I'm excited to hear where people are using netjet.

Ghost and Server Push

Ghost is one such exciting integration. With the aid of the Ghost team, I've integrated netjet, and it has been available as an opt-in beta since version 0.8.0.

If you are running a Ghost instance, you can enable Server Push by modifying the server's config.js file and add the preloadHeaders option to the production configuration block.

production: { 
  url: 'https://my-ghost-blog.com', 
  preloadHeaders: 100, 
  // ... 
}

Ghost has put together a support article for Ghost(Pro) customers.

Conclusion

With netjet, your Node.js applications can start to use browser preloading and, when used with CloudFlare, HTTP/2 Server Push today.

At CloudFlare, we're excited to make tools to help increase the performance of websites. If you find this interesting, we are hiring in Austin, Texas; Champaign, Illinois; London; San Francisco; and Singapore.