The Cloudflare Blog

Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers

Anita Tenjarla — Tue, 31 Mar 2026 13:00:00 GMT

We're proud to introduce Programmable Flow Protection: a system designed to let Magic Transit customers implement their own custom DDoS mitigation logic and deploy it across Cloudflare’s global network. This enables precise, stateful mitigation for custom and proprietary protocols built on UDP. It is engineered to provide the highest possible level of customization and flexibility to mitigate DDoS attacks of any scale.

Programmable Flow Protection is currently in beta and available to all Magic Transit Enterprise customers for an additional cost. Contact your account team to join the beta or sign up at this page.

Programmable Flow Protection is customizable

Our existing DDoS mitigation systems have been designed to understand and protect popular, well-known protocols from DDoS attacks. For example, our Advanced TCP Protection system uses specific known characteristics about the TCP protocol to issue challenges and establish a client’s legitimacy. Similarly, our Advanced DNS Protection builds a per-customer profile of DNS queries to mitigate DNS attacks. Our generic DDoS mitigation platform also understands common patterns across a variety of other well known protocols, including NTP, RDP, SIP, and many others.

However, custom or proprietary UDP protocols have always been a challenge for Cloudflare’s DDoS mitigation systems because our systems do not have the relevant protocol knowledge to make intelligent decisions to pass or drop traffic.

Programmable Flow Protection addresses this gap. Now, customers can write their own eBPF program that defines what “good” and “bad” packets are and how to deal with them. Cloudflare then runs the program across our entire global network. The program can choose to either drop or challenge “bad” packets, preventing them from reaching the customer’s origin.

The problem of UDP-based attacks

UDP is a connectionless transport layer protocol. Unlike TCP, UDP has no handshake or stateful connections. It does not promise that packets will arrive in order or exactly once. UDP instead prioritizes speed and simplicity, and is therefore well-suited for online gaming, VoIP, video streaming, and any other use case where the application requires real-time communication between clients and servers.

Our DDoS mitigation systems have always been able to detect and mitigate attacks against well-known protocols built on top of UDP. For example, the standard DNS protocol is built on UDP, and each DNS packet has a well-known structure. If we see a DNS packet, we know how to interpret it. That makes it easier for us to detect and drop DNS-based attacks.

Unfortunately, if we don’t understand the protocol inside a UDP packet’s payload, our DDoS mitigation systems have limited options available at mitigation time. If an attacker sends a large flood of UDP traffic that does not match any known patterns or protocols, Cloudflare can either entirely block or apply a rate limit to the destination IP and port combination. This is a crude “last line of defense” that is only intended to keep the rest of the customer’s network online, and it can be painful in a couple ways.

First, a block or a generic rate limit does not distinguish good traffic from bad, which means these mitigations will likely cause legitimate clients to experience lag or connection loss — doing the attacker’s job for them! Second, a generic rate limit can be too strict or too lax depending on the customer. For example, a customer who expects to receive 1Gbps of legitimate traffic probably needs more aggressive rate limiting compared to a customer who expects to receive 25Gbps of legitimate traffic.

^{An illustration of UDP packet contents. A user can define a valid payload and reject traffic that doesn’t match the defined pattern.}

The Programmable Flow Protection platform was built to address this problem by allowing our customers to dictate what “good” versus “bad” traffic actually looks like. Many of our customers use custom or proprietary UDP protocols that we do not understand — and now we don’t have to.

How Programmable Flow Protection works

In previous blog posts, we’ve described how “flowtrackd”, our stateful network layer DDoS mitigation system, protects Magic Transit users from complex TCP and DNS attacks. We’ve also described how we use Linux technologies like XDP and eBPF to efficiently mitigate common types of large scale DDoS attacks.

Programmable Flow Protection combines these technologies in a novel way. With Programmable Flow Protection, a customer can write their own eBPF program that decides whether to pass, drop, or challenge individual packets based on arbitrary logic. A customer can upload the program to Cloudflare, and Cloudflare will execute it on every packet destined to their network. Programs are executed in userspace, not kernel space, which allows Cloudflare the flexibility to support a variety of customers and use cases on the platform without compromising security. Programmable Flow Protection programs run after all of Cloudflare’s existing DDoS mitigations, so users still benefit from our standard security protections.

There are many similarities between an XDP eBPF program loaded into the Linux kernel and an eBPF program running on the Programmable Flow Protection platform. Both types of programs are compiled down to BPF bytecode. They are both run through a “verifier” to ensure memory safety and verify program termination. They are also executed in a fast, lightweight VM to provide isolation and stability.

However, eBPF programs loaded into the Linux kernel make use of many Linux-specific “helper functions” to integrate with the network stack, maintain state between program executions, and emit packets to network devices. Programmable Flow Protection offers the same functionality whenever a customer chooses, but with a different API tailored specifically to implement DDoS mitigations. For example, we’ve built helper functions to store state about clients between program executions, perform cryptographic validation, and emit challenge packets to clients. With these helper functions, a developer can use the power of the Cloudflare platform to protect their own network.

Combining customer knowledge with Cloudflare’s network

Let’s step through an example to illustrate how a customer’s protocol-specific knowledge can be combined with Cloudflare’s network to create powerful mitigations.

Say a customer hosts an online gaming server on UDP port 207. The game engine uses a proprietary application header that is specific to the game. Cloudflare has no knowledge of the structure or contents of the application header. The customer gets hit by DDoS attacks that overwhelm the game server and players report lag in gameplay. The attack traffic comes from highly randomized source IPs and ports, and the payload data appears to be random as well.

To mitigate the attack, the customer can use their knowledge of the application header and deploy a Programmable Flow Protection program to check a packet’s validity. In this example, the application header contains a token that is unique to the gaming protocol. The customer can therefore write a program to extract the last byte of the token. The program passes all packets with the correct value present and drops all other traffic:

#include 
#include 
#include 

#include "cf_ebpf_defs.h"
#include "cf_ebpf_helper.h"

// Custom application header
struct apphdr {
    uint8_t  version;
    uint16_t length;   // Length of the variable-length token
    uint8_t  token[0]; // Variable-length token
} __attribute__((packed));

uint64_t
cf_ebpf_main(void *state)
{
    struct cf_ebpf_generic_ctx *ctx = state;
    struct cf_ebpf_parsed_headers headers;
    struct cf_ebpf_packet_data *p;

    // Parse the packet headers with provided helper function
    if (parse_packet_data(ctx, &p, &headers) != 0) {
        return CF_EBPF_DROP;
    }

    // Drop packets not destined to port 207
    struct udphdr *udp_hdr = (struct udphdr *)headers.udp;
    if (ntohs(udp_hdr->dest) != 207) {
        return CF_EBPF_DROP;
    }

    // Get application header from UDP payload
    struct apphdr *app = (struct apphdr *)(udp_hdr + 1);
    if ((uint8_t *)(app + 1) > headers.data_end) {
        return CF_EBPF_DROP;
    }

    // Perform memory checks to satisfy the verifier
    // and access the token safely
    if ((uint8_t *)(app->token + token_len) > headers.data_end) {
        return CF_EBPF_DROP;
    }

    // Check the last byte of the token against expected value
    uint8_t *last_byte = app->token + token_len - 1;
    if (*last_byte != 0xCF) {
        return CF_EBPF_DROP;
    }

    return CF_EBPF_PASS;
}

^{An eBPF program to filter packets according to a value in the application header.}

This program leverages application-specific information to create a more targeted mitigation than Cloudflare is capable of crafting on its own. Customers can now combine their proprietary knowledge with the capacity of Cloudflare’s global network to absorb and mitigate massive attacks better than ever before.

Going beyond firewalls: stateful tracking and challenges

Many pattern checks, like the one performed in the example above, can be accomplished with traditional firewalls. However, programs provide useful primitives that are not available in firewalls, including variables, conditional execution, loops, and procedure calls. But what really sets Programmable Flow Protection apart from other solutions is its ability to statefully track flows and challenge clients to prove they are real. A common type of attack that showcases these abilities is a replay attack.

In a replay attack, an attacker repeatedly sends packets that were valid at some point, and therefore conform to expected patterns of the traffic, but are no longer valid in the application’s current context. For example, the attacker could record some of their valid gameplay traffic and use a script to duplicate and transmit the same traffic at a very high rate.

With Programmable Flow Protection, a user can deploy a program that challenges suspicious clients and drops scripted traffic. We can extend our original example as follows:


#include 
#include 
#include 

#include "cf_ebpf_defs.h"
#include "cf_ebpf_helper.h"

uint64_t
cf_ebpf_main(void *state)
{
    // ...
 
    // Get the status of this source IP (statefully tracked)
    uint8_t status;
    if (cf_ebpf_get_source_ip_status(&status) != 0) {
        return CF_EBPF_DROP;
    }

    switch (status) {
        case NONE:
		// Issue a custom challenge to this source IP
             issue_challenge();
             cf_ebpf_set_source_ip_status(CHALLENGED);
             return CF_EBPF_DROP;


        case CHALLENGED:
		// Check if this packet passes the challenge
		// with custom logic
             if (verify_challenge()) {
                 cf_ebpf_set_source_ip_status(VERIFIED);
                 return CF_EBPF_PASS;
             } else {
                 cf_ebpf_set_source_ip_status(BLOCKED);
                 return CF_EBPF_DROP;
             }


        case VERIFIED:
		// This source IP has passed the challenge
		return CF_EBPF_PASS;

	 case BLOCKED:
		// This source IP has been blocked
		return CF_EBPF_DROP;

        default:
            return CF_EBPF_PASS;
    }


    return CF_EBPF_PASS;
}

^{An eBPF program to challenge UDP connections and statefully manage connections. This example has been simplified for illustration purposes.}

The program statefully tracks the source IP addresses it has seen and emits a packet with a cryptographic challenge back to unknown clients. A legitimate client running a valid gaming client is able to correctly solve the challenge and respond with proof, but the attacker’s script is not. Traffic from the attacker is marked as “blocked” and subsequent packets are dropped.

With these new abilities, customers can statefully track flows and make sure only real, verified clients can send traffic to their origin servers. Although we have focused the example on gaming, the potential use cases for this technology extend to any UDP-based protocol.

Get started today

We’re excited to offer the Programmable Flow Protection feature to Magic Transit Enterprise customers. Talk to your account manager to learn more about how you can enable Programmable Flow Protection to help keep your infrastructure safe.

We’re still in active development of the platform, and we’re excited to see what our users build next. If you are not yet a Cloudflare customer, let us know if you’d like to protect your network with Cloudflare and Programmable Flow Protection by signing up at this page: https://www.cloudflare.com/lp/programmable-flow-protection/.

MadeYouReset: An HTTP/2 vulnerability thwarted by Rapid Reset mitigations

Alex Forster — Thu, 14 Aug 2025 22:03:00 GMT

_{(Correction on August 19, 2025: This post was updated to correct and clarify details about the vulnerability and the HTTP/2 protocol.)}

On August 13, security researchers at Tel Aviv University disclosed a new HTTP/2 denial-of-service (DoS) vulnerability that they are calling MadeYouReset (CVE-2025-8671). This vulnerability exists in a limited number of unpatched HTTP/2 server implementations that do not accurately track use of server-sent stream resets, which can lead to resource consumption. If you’re using Cloudflare for HTTP DDoS mitigation, you’re already protected from MadeYouReset.

Cloudflare was informed of this vulnerability in May through a coordinated disclosure process, and we were able to confirm that our systems were not susceptible. We foresaw this sort of attack while mitigating the "Netflix vulnerabilities" in 2019, and added even stronger defenses in response to Rapid Reset (CVE-2023-44487) in 2023. MadeYouReset and Rapid Reset are two conceptually similar attacks that exploit a fundamental feature within the HTTP/2 specification (RFC 9113): stream resets. In the HTTP/2 protocol, a client initiates a bidirectional stream that carries an HTTP request/response exchange, represented as frames sent between the client and server. Typically, HEADERS and DATA frames are used for a complete exchange. Endpoints can use the RST_STREAM frame to prematurely terminate a stream, essentially cancelling operations and signalling that it won’t process any more request or response data. Furthermore, HTTP/2 requires that RST_STREAM is sent when there are protocol errors related to the stream. For example, section 6.1 of RFC 9113 requires that when a DATA frame is received under the wrong circumstances, "...the recipient MUST respond with a stream error (Section 5.4.2) of type STREAM_CLOSED".

The vulnerability exploited by both MadeYouReset and Rapid Reset lies in the potential for malicious actors to abuse this stream reset mechanism. By repeatedly causing stream resets, attackers can overwhelm a server's resources. While the server is attempting to process and respond to a multitude of requests, the rapid succession of resets forces it to expend computational effort on starting and then immediately discarding these operations. This can lead to resource exhaustion and impact the availability of the targeted server for legitimate users; as described previously, the main difference between the two attacks is that Rapid Reset exploits client-sent resets, while MadeYouReset exploits server-sent resets. It works by using a client to persuade a server into resetting streams via intentionally sending frames that trigger protocol violations, which in turn trigger stream errors.

RFC 9113 details a number of denial-of-service considerations. Fundamentally, the protocol provides many features with legitimate uses that can be exploited by attackers with nefarious intent. Implementations are advised to harden themselves: "An endpoint that doesn't monitor use of these features exposes itself to a risk of denial of service. Implementations SHOULD track the use of these features and set limits on their use."

Fortunately, the MadeYouReset vulnerability only impacts a relatively small number of HTTP/2 implementations. In most major HTTP/2 implementations already in widespread use today, the proactive measures taken to implement RFC 9113 guidance and counter Rapid Reset in 2023 have also provided substantial protection against MadeYouReset, limiting its potential impact and preventing a similarly disruptive event.

A note about Cloudflare’s Pingora and its users: Our open-sourced Pingora framework uses the popular Rust-language h2 library for its HTTP/2 support. Versions of h2 prior to 0.4.11 were potentially susceptible to MadeYouReset. Users of Pingora can patch their applications by updating their h2 crate version using the cargo update command. Pingora does not itself terminate inbound HTTP connections to Cloudflare’s network, meaning this vulnerability could not be exploited against Cloudflare’s infrastructure.

We would like to credit researchers Gal Bar Nahum, Anat Bremler-Barr, and Yaniv Harel of Tel Aviv University for discovering this vulnerability and thank them for their leadership in the coordinated disclosure process. Cloudflare always encourages security researchers to submit vulnerabilities like this to our HackerOne Bug Bounty program.

How Cloudflare auto-mitigated world record 3.8 Tbps DDoS attack

Manish Arora — Wed, 02 Oct 2024 13:00:00 GMT

Since early September, Cloudflare's DDoS protection systems have been combating a month-long campaign of hyper-volumetric L3/4 DDoS attacks. Cloudflare’s defenses mitigated over one hundred hyper-volumetric L3/4 DDoS attacks throughout the month, with many exceeding 2 billion packets per second (Bpps) and 3 terabits per second (Tbps). The largest attack peaked 3.8 Tbps — the largest ever disclosed publicly by any organization. Detection and mitigation was fully autonomous. The graphs below represent two separate attack events that targeted the same Cloudflare customer and were mitigated autonomously.

^{A mitigated 3.8 Terabits per second DDoS attack that lasted 65 seconds}

^{A mitigated 2.14 billion packet per second DDoS attack that lasted 60 seconds}

Cloudflare customers are protected

Cloudflare customers using Cloudflare’s HTTP reverse proxy services (e.g. Cloudflare WAF and Cloudflare CDN) are automatically protected.

Cloudflare customers using Spectrum and Magic Transit are also automatically protected. Magic Transit customers can further optimize their protection by deploying Magic Firewall rules to enforce a strict positive and negative security model at the packet layer.

Other Internet properties may not be safe

The scale and frequency of these attacks are unprecedented. Due to their sheer size and bits/packets per second rates, these attacks have the ability to take down unprotected Internet properties, as well as Internet properties that are protected by on-premise equipment or by cloud providers that just don’t have sufficient network capacity or global coverage to be able to handle these volumes alongside legitimate traffic without impacting performance.

Cloudflare, however, does have the network capacity, global coverage, and intelligent systems needed to absorb and automatically mitigate these monstrous attacks.

In this blog post, we will review the attack campaign and why its attacks are so severe. We will describe the anatomy of a Layer 3/4 DDoS attack, its objectives, and how attacks are generated. We will then proceed to detail how Cloudflare’s systems were able to autonomously detect and mitigate these monstrous attacks without impacting performance for our customers. We will describe the key aspects of our defenses, starting with how our systems generate real-time (dynamic) signatures to match the attack traffic all the way to how we leverage kernel features to drop packets at wire-speed.

Campaign analysis

We have observed this attack campaign targeting multiple customers in the financial services, Internet, and telecommunication industries, among others. This attack campaign targets bandwidth saturation as well as resource exhaustion of in-line applications and devices.

The attacks predominantly leverage UDP on a fixed port, and originated from across the globe with larger shares coming from Vietnam, Russia, Brazil, Spain, and the US.

The high packet rate attacks appear to originate from multiple types of compromised devices, including MikroTik devices, DVRs, and Web servers, orchestrated to work in tandem and flood the target with exceptionally large volumes of traffic. The high bitrate attacks appear to originate from a large number of compromised ASUS home routers, likely exploited using a CVE 9.8 (Critical) vulnerability that was recently discovered by Censys.

Russia	12.1%
Vietnam	11.6%
United States	9.3%
Spain	6.5%
Brazil	4.7%
France	4.7%
Romania	4.4%
Taiwan	3.4%
United Kingdom	3.3%
Italy	2.8%

^{Share of packets observed by source location}

Anatomy of DDoS attacks

Before we discuss how Cloudflare automatically detected and mitigated the largest DDoS attacks ever seen, it‘s important to understand the basics of DDoS attacks.

^{Simplified diagram of a DDoS attack}

The goal of a Distributed Denial of Service (DDoS) attack is to deny legitimate users access to a service. This is usually done by exhausting resources needed to provide the service. In the context of these recent Layer 3/4 DDoS attacks, that resource is CPU cycles and network bandwidth.

Exhausting CPU cycles

Processing a packet consumes CPU cycles. In the case of regular (non-attack) traffic, a legitimate packet received by a service will cause that service to perform some action, and different actions require different amounts of CPU processing. However, before a packet is even delivered to a service there is per-packet work that needs to be done. Layer 3 packet headers need to be parsed and processed to deliver the packet to the correct machine and interface. Layer 4 packet headers need to be processed and routed to the correct socket (if any). There may be multiple additional processing steps that inspect every packet. Therefore, if an attacker sends at a high enough packet rate, then they can potentially saturate the available CPU resources, denying service to legitimate users.

To defend against high packet rate attacks, you need to be able to inspect and discard the bad packets using as few CPU cycles as possible, leaving enough CPU to process the good packets. You can additionally acquire more, or faster, CPUs to perform the processing — but that can be a very lengthy process that bears high costs.

Exhausting network bandwidth

Network bandwidth is the total amount of data per time that can be delivered to a server. You can think of bandwidth like a pipe to transport water. The amount of water we can deliver through a drinking straw is less than what we could deliver through a garden hose. If an attacker is able to push more garbage data into the pipe than it can deliver, then both the bad data and the good data will be discarded upstream, at the entrance to the pipe, and the DDoS is therefore successful.

Defending against attacks that can saturate network bandwidth can be difficult because there is very little that can be done if you are on the downstream side of the saturated pipe. There are really only a few choices: you can get a bigger pipe, you can potentially find a way to move the good traffic to a new pipe that isn’t saturated, or you can hopefully ask the upstream side of the pipe to stop sending some or all of the data into the pipe.

Generating DDoS attacks

If we think about what this means from the attackers point of view you realize there are similar constraints. Just as it takes CPU cycles to receive a packet, it also takes CPU cycles to create a packet. If, for example, the cost to send and receive a packet were equal, then the attacker would need an equal amount of CPU power to generate the attack as we would need to defend against it. In most cases this is not true — there is a cost asymmetry, as the attacker is able to generate packets using fewer CPU cycles than it takes to receive those packets. However, it is worth noting that generating attacks is not free and can require a large amount of CPU power.

Saturating network bandwidth can be even more difficult for an attacker. Here the attacker needs to be able to output more bandwidth than the target service has allocated. They actually need to be able to exceed the capacity of the receiving service. This is so difficult that the most common way to achieve a network bandwidth attack is to use a reflection/amplification attack method, for example a DNS Amplification attack. These attacks allow the attacker to send a small packet to an intermediate service, and the intermediate service will send a large packet to the victim.

In both scenarios, the attacker needs to acquire or gain access to many devices to generate the attack. These devices can be acquired in a number of different ways. They may be server class machines from cloud providers or hosting services, or they can be compromised devices like DVRs, routers, and webcams that have been infected with the attacker’s malware. These machines together form the botnet.

How Cloudflare defends against large attacks

Now that we understand the fundamentals of how DDoS attacks work, we can explain how Cloudflare can defend against these attacks.

Spreading the attack surface using global anycast

The first not-so-secret ingredient is that Cloudflare’s network is built on anycast. In brief, anycast allows a single IP address to be advertised by multiple machines around the world. A packet sent to that IP address will be served by the closest machine. This means when an attacker uses their distributed botnet to launch an attack, the attack will be received in a distributed manner across the Cloudflare network. An infected DVR in Dallas, Texas will send packets to a Cloudflare server in Dallas. An infected webcam in London will send packets to a Cloudflare server in London.

^{Anycast vs. Unicast networks}

Our anycast network additionally allows Cloudflare to allocate compute and bandwidth resources closest to the regions that need them the most. Densely populated regions will generate larger amounts of legitimate traffic, and the data centers placed in those regions will have more bandwidth and CPU resources to meet those needs. Sparsely populated regions will naturally generate less legitimate traffic, so Cloudflare data centers in those regions can be sized appropriately. Since attack traffic is mainly coming from compromised devices, those devices will tend to be distributed in a manner that matches normal traffic flows sending the attack traffic proportionally to datacenters that can handle it. And similarly, within the datacenter, traffic is distributed across multiple machines.

Additionally, for high bandwidth attacks, Cloudflare’s network has another advantage. A large proportion of traffic on the Cloudflare network does not consume bandwidth in a symmetrical manner. For example, an HTTP request to get a webpage from a site behind Cloudflare will be a relatively small incoming packet, but produce a larger amount of outgoing traffic back to the client. This means that the Cloudflare network tends to egress far more legitimate traffic than we receive. However, the network links and bandwidth allocated are symmetrical, meaning there is an abundance of ingress bandwidth available to receive volumetric attack traffic.

Generating real-time signatures

By the time you’ve reached an individual server inside a datacenter, the bandwidth of the attack has been distributed enough that none of the upstream links are saturated. That doesn’t mean the attack has been fully stopped yet, since we haven’t dropped the bad packets. To do that, we need to sample traffic, qualify an attack, and create rules to block the bad packets.

Sampling traffic and dropping bad packets is the job of our l4drop component, which uses XDP (eXpress Data Path) and leverages an extended version of the Berkeley Packet Filter known as eBPF (extended BPF). This enables us to execute custom code in kernel space and process (drop, forward, or modify) each packet directly at the network interface card (NIC) level. This component helps the system drop packets efficiently without consuming excessive CPU resources on the machine.

^{Cloudflare DDoS protection system overview}

We use XDP to sample packets to look for suspicious attributes that indicate an attack. The samples include fields such as the source IP, source port, destination IP, destination port, protocol, TCP flags, sequence number, options, packet rate and more. This analysis is conducted by the denial of service daemon (dosd). Dosd holds our secret sauce. It has many filters which instruct it, based on our curated heuristics, when to initiate mitigation. To our customers, these filters are logically grouped by attack vectors and exposed as the DDoS Managed Rules. Our customers can customize their behavior to some extent, as needed.

As it receives samples from XDP, dosd will generate multiple permutations of fingerprints for suspicious traffic patterns. Then, using a data streaming algorithm, dosd will identify the most optimal fingerprints to mitigate the attack. Once an attack is qualified, dosd will push a mitigation rule inline as an eBPF program to surgically drop the attack traffic.

The detection and mitigation of attacks by dosd is done at the server level, at the data center level and at the global level — and it’s all software defined. This makes our network extremely resilient and leads to almost instant mitigation. There are no out-of-path “scrubbing centers” or “scrubbing devices”. Instead, each server runs the full stack of the Cloudflare product suite including the DDoS detection and mitigation component. And it is all done autonomously. Each server also gossips (multicasts) mitigation instructions within a data center between servers, and globally between data centers. This ensures that whether an attack is localized or globally distributed, dosd will have already installed mitigation rules inline to ensure a robust mitigation.

Strong defenses against strong attacks

Our software-defined, autonomous DDoS detection and mitigation systems run across our entire network. In this post we focused mainly on our dynamic fingerprinting capabilities, but our arsenal of defense systems is much larger. The Advanced TCP Protection system and Advanced DNS Protection system work alongside our dynamic fingerprinting to identify sophisticated and highly randomized TCP-based DDoS attacks and also leverages statistical analysis to thwart complex DNS-based DDoS attacks. Our defenses also incorporate real-time threat intelligence, traffic profiling, and machine learning classification as part of our Adaptive DDoS Protection to mitigate traffic anomalies.

Together, these systems, alongside the full breadth of the Cloudflare Security portfolio, are built atop of the Cloudflare network — one of the largest networks in the world — to ensure our customers are protected from the largest attacks in the world.

How we built Network Analytics v2

Alex Forster — Tue, 02 May 2023 13:00:52 GMT

Network Analytics v2 is a fundamental redesign of the backend systems that provide real-time visibility into network layer traffic patterns for Magic Transit and Spectrum customers. In this blog post, we'll dive into the technical details behind this redesign and discuss some of the more interesting aspects of the new system.

To protect Cloudflare and our customers against Distributed Denial of Service (DDoS) attacks, we operate a sophisticated in-house DDoS detection and mitigation system called dosd. It takes samples of incoming packets, analyzes them for attacks, and then deploys mitigation rules to our global network which drop any packets matching specific attack fingerprints. For example, a simple network layer mitigation rule might say “drop UDP/53 packets containing responses to DNS ANY queries”.

In order to give our Magic Transit and Spectrum customers insight into the mitigation rules that we apply to their traffic, we introduced a new reporting system called "Network Analytics" back in 2020. Network Analytics is a data pipeline that analyzes raw packet samples from the Cloudflare global network. At a high level, the analysis process involves trying to match each packet sample against the list of mitigation rules that dosd has deployed, so that it can infer whether any particular packet sample was dropped due to a mitigation rule. Aggregated time-series data about these packet samples is then rolled up into one-minute buckets and inserted into a ClickHouse database for long-term storage. The Cloudflare dashboard queries this data using our public GraphQL APIs, and displays the data to customers using interactive visualizations.

What was wrong with v1?

This original implementation of Network Analytics delivered a ton of value to customers and has served us well. However, in the years since it was launched, we have continued to significantly improve our mitigation capabilities by adding entirely new mitigation systems like Advanced TCP Protection (otherwise known as flowtrackd) and Magic Firewall. The original version of Network Analytics only reports on mitigations created by dosd, which meant we had a reporting system that was showing incomplete information.

Adapting the original version of Network Analytics to work with Magic Firewall would have been relatively straightforward. Since firewall rules are “stateless”, we can tell whether a firewall rule matches a packet sample just by looking at the packet itself. That’s the same thing we were already doing to figure out whether packets match dosd mitigation rules.

However, despite our efforts, adapting Network Analytics to work with flowtrackd turned out to be an insurmountable problem. flowtrackd is “stateful”, meaning it determines whether a packet is part of a legitimate connection by tracking information about the other packets it has seen previously. The original Network Analytics design is incompatible with stateful systems like this, since that design made an assumption that the fate of a packet can be determined simply by looking at the bytes inside it.

Rethinking our approach

Rewriting a working system is not usually a good idea, but in this case it was necessary since the fundamental assumptions made by the old design were no longer true. When starting over with Network Analytics v2, it was clear to us that the new design not only needed to fix the deficiencies of the old design, it also had to be flexible enough to grow to support future products that we haven’t even thought of yet. To meet this high bar, we needed to really understand the core principles of network observability.

In the world of on-premise networks, packets typically chain through a series of appliances that each serve their own special purposes. For example, a packet may first pass through a firewall, then through a router, and then through a load balancer, before finally reaching the intended destination. The links in this chain can be thought of as independent “network functions”, each with some well-defined inputs and outputs.

A key insight for us was that, if you squint a little, Cloudflare’s software architecture looks very similar to this. Each server receives packets and chains them through a series of independent and specialized software components that handle things like DDoS mitigation, firewalling, reverse proxying, etc.

After noticing this similarity, we decided to explore how people with traditional networks monitor them. Universally, the answer is either Netflow or sFlow.

Nearly all on-premise hardware appliances can be configured to send a stream of Netflow or sFlow samples to a centralized flow collector. Traditional network operators tend to take these samples at many different points in the network, in order to monitor each device independently. This was different from our approach, which was to take packet samples only once, as soon as they entered the network and before performing any processing on them.

Another interesting thing we noticed was that Netflow and sFlow samples contain more than just information about packet contents. They also contain lots of metadata, such as the interface that packets entered and exited on, whether they were passed or dropped, which firewall or ACL rule they hit, and more. The metadata format is also extensible, so that devices can include information in their samples which might not make sense for other samples to contain. This flexibility allows flow collectors to offer rich reporting without necessarily having to understand the functions that each device performs on a network.

The more we thought about what kind of features and flexibility we wanted in an analytics system, the more we began to appreciate the elegance of traditional network monitoring. We realized that we could take advantage of the similarities between Cloudflare’s software architecture and “network functions” by having each software component emit its own packet samples with its own context-specific metadata attached.

Even though it seemed counterintuitive for our software to emit multiple streams of packet samples this way, we realized through taking inspiration from traditional network monitoring that doing so was exactly how we could build the extensible and future-proof observability that we needed.

Design & implementation

The implementation of Network Analytics v2 could be broken down into two separate pieces of work. First, we needed to build a new data pipeline that could receive packet samples from different sources, then normalize those samples and write them to long-term storage. We called this data pipeline samplerd – the “sampler daemon”.

The samplerd pipeline is relatively small and straightforward. It implements a few different interfaces that other software can use to send it metadata-rich packet samples. It then normalizes these samples and forwards them for postprocessing and insertion into a ClickHouse database.

The other, larger piece of work was to modify existing Cloudflare systems and make them send packet samples to samplerd. The rest of this post will cover a few interesting technical challenges that we had to overcome to adapt these systems to work with samplerd.

l4drop

The first system that incoming packets enter is our xdp daemon, called xdpd. In a few words, xdpd manages the installation of multiple XDP programs: a packet sampler, l4drop and L4LB. l4drop is where many types of attacks are mitigated. Mitigations done at this level are very cheap, because they happen so early in the network stack.

Before introducing samplerd, these XDP programs were organized like this:

An incoming packet goes through a sampler that will emit a packet sample for some packets. It then enters l4drop, a set of programs that will decide the fate of a particular packet. Finally, L4LB is in charge of layer 4 load balancing.

It’s critical that the samples are emitted even for packets that get dropped further down in the pipeline, because that provides visibility into what’s dropped. That’s useful both from a customer perspective to have a more comprehensive view in dashboards but also to continuously adapt our mitigations as attacks change.

In l4drop’s original configuration, a packet sample is emitted prior to the mitigation decision. Thus, that sample can’t record the mitigation action that’s taken on that particular packet.

samplerd wants packet samples to include the mitigation outcome and other metadata that indicates why a particular mitigation decision was taken. For instance, a packet may be dropped because it matched an attack mitigation signature. Or it may pass because it matched a rate limiting rule and it was under the threshold for that rule. All of this is valuable information that needs to be shown to customers.

Given this requirement, the first idea we had was to simply move the sampler after l4drop and have l4drop just mark the packet as “to be dropped”, along with metadata for the reason why. The sampler component would then have all the necessary details to emit a sample with the final fate of the packet and its associated metadata. After emitting the sample, the sampler would drop or pass the packet.

However, this requires copying all the metadata associated with the dropping decision for every single packet, whether it will be sampled or not. The cost of this copying proved prohibitive considering that every packet entering Cloudflare goes through the xdpd programs.

So we went back to the drawing board. What we actually need to know when making a sampling decision is whether we need to copy the metadata. We only need to copy the metadata if a particular packet will be sampled. That’s why it made sense to effectively split the sampler into two parts by sandwiching the programs that make the mitigation decision together. First, we make the mitigation decision, then we go through the mitigation decision programs. These programs can then decide to copy metadata only when a packet will be sampled. They will however always mark a packet with a DROP or PASS mark. Then the sampler will check the mark for sampling and the DROP/PASS mark. Based on those marks, they’ll build a sample if necessary and drop or pass the packet.

Given how tightly the sampler is now coupled with the rest of l4drop, it’s not a standalone part of xdpd anymore and the final result looks like this:

iptables

Another of our mitigation layers is iptables. We use it for some types of mitigations that we can’t perform in l4drop, like stateful connection tracking. iptables mitigations are organized as a list of rules that an incoming packet will be evaluated against. It’s also possible for a rule to jump to another rule when some conditions are met. Some of these rules will perform rate limiting, which will only drop packets beyond a certain threshold. For instance, we might drop all packets beyond a 10 packet-per-second threshold.

Prior to the introduction of samplerd, our typical rules would match on some characteristics of the packet – say, the IP and port – and make a decision whether to immediately drop or pass the packet.

To adapt our iptables rules to samplerd, we need to make them emit annotated samples, so that we can know why a decision was taken. To this end, one idea would be to just make the rules which drop packets also emit a nflog sample with a certain probability. One of the issues with that approach has to do with rate limiting rules. A packet may match such a rule, but the packet may be under the threshold and so that packet gets passed further down the line. That doesn’t work because we also want to sample those passed packets too, since it’s important for a customer to know what was passed and dropped by the rate limiter. But since a packet that passes the rate limiter may be dropped by further rules down the line, it’ll have multiple chances to be sampled, causing oversampling of some parts of the traffic. That would introduce statistical distortions in the sampled data.

To solve this, we can once again separate these steps like we did in l4drop, and make several sets of rules. First, the sampling decision is made by the first set of rules. Then, the pass-or-drop decision is made by the second set of rules. Finally, the sample can be emitted (if necessary), and then the packet can be passed or dropped by the third set of rules.

To communicate between rules we use Linux packet markings. For instance, a mark will be placed on a packet to signal that the packet will be sampled, and another mark will signify that the packet matched a particular rule and that it needs to be dropped.

For incoming packets, the rule in charge of the random sampling decision is evaluated first. Then the mitigation rules are evaluated next, in a specific order. When one of those rules decides to drop a packet, it jumps straight to the last set of rules, which will emit a sample if necessary before dropping. If no mitigation rule matches, eventually packets fall through to the last set of rules, where they will match a generic pass rule. That rule will emit a sample if necessary and pass the packet down the stack for further processing. By organizing rules in stages this way, we won’t ever double-sample packets.

ClickHouse & GraphQL

Once the samplerd daemon has the samples from the various mitigation systems, it does some light processing and ships those samples to be stored in ClickHouse. This inserter further enriches the metadata present in the sample, for instance by identifying the account associated with a particular destination IP. It also identifies ongoing attacks and adds a unique attack ID to each sample that is part of an attack.

We designed the inserters so that we’ll never need to change the data once it has been written, so that we can sustain high levels of insertion. Part of how we achieved this was by using ClickHouse’s MergeTree table engine. However, for improved performance, we have also used a less common ClickHouse table engine, called AggregatingMergeTree. Let’s dive into this using a simplified example.

Each packet sample is stored in a table that looks like the below:

Attack ID	Dest IP	Dest Port	…	Sample Interval (SI)
abcd	1.1.1.1	53	…	1000
abcd	1.0.0.1	53	…	1000

The sample interval is the number of packets that went through between two samples, as we are using ABR.

These tables are then queried through the GraphQL API, either directly or by the dashboard. This required us to build a view of all the samples for a particular attack, to identify (for example) a fixed destination IP. These attacks may span days or even weeks and so these queries could potentially be costly and slow. For instance, a naive query to know whether the attack “abcd” has a fixed destination port or IP may look like this:

SELECT if(uniq(dest_ip) == 1, any(dest_ip), NULL), if(uniq(dest_port) == 1, any(dest_port), NULL)
FROM samples
WHERE attack_id = ‘abcd’

In the above query, we ask ClickHouse for a lot more data than we should need. We only really want to know whether there is one value or multiple values, yet we ask for an estimation of the number of unique values. One way to know if all values are the same (for values that can be ordered) is to check whether the maximum value is equal to the minimum. So we could rewrite the above query as:

SELECT if(min(dest_ip) == max(dest_ip), any(dest_ip), NULL), if(min(dest_port) == max(dest_port), any(dest_port), NULL)
FROM samples
WHERE attack_id = ‘abcd’

And the good news is that storing the minimum or the maximum takes very little space, typically the size of the column itself, as opposed to keeping the state that uniq() might require. It’s also very easy to store and update as we insert. So to speed up that query, we have added a precomputed table with running minimum and maximum using the AggregatingMergeTree engine. This is the special ClickHouse table engine that can compute and store the result of an aggregate function on a particular key. In our case, we will use the attackID as the key to group on, like this:

Attack ID	min(Dest IP)	max(Dest IP)	min(Dest Port)	max(Dest Port)	…	sum(SI)
abcd	1.0.0.1	1.1.1.1	53	53	…	2000

Note: this can be generalized to many aggregating functions like sum(). The constraint on the function is that it gives the same result whether it’s given the whole set all at once or whether we apply the function to the value it returned on a subset and another value from the set.

Then the query that we run can be much quicker and simpler by querying our small aggregating table. In our experience, that table is roughly 0.002% of the original data size, although admittedly all columns of the original table are not present.

And we can use that to build a SQL view that would look like this for our example:

SELECT if(min_dest_ip == max_dest_ip, min_dest_ip, NULL), if(min_dest_port == max_dest_port, min_dest_port, NULL)
FROM aggregated_samples
WHERE attack_id = ‘abcd’

Attack ID	Dest IP	Dest Port	…	Σ
abcd		53	…	2000

Implementation detail: in practice, it is possible that a row in the aggregated table gets split on multiple partitions. In that case, we will have two rows for a particular attack ID. So in production we have to take the min or max of all the rows in the aggregating table. That’s usually only three to four rows, so it’s still much faster than going over potentially thousands of samples spanning multiple days. In practice, the query we use in production is thus closer to:

SELECT if(min(min_dest_ip) == max(max_dest_ip), min(min_dest_ip), NULL), if(min(min_dest_port) == max(max_dest_port), min(min_dest_port), NULL)
FROM aggregated_samples
WHERE attack_id = ‘abcd’

Takeaways

Rewriting Network Analytics was a bet that has paid off. Customers now have a more accurate and higher fidelity view of their network traffic. Internally, we can also now troubleshoot and fine tune our mitigation systems much more effectively. And as we develop and deploy new mitigation systems in the future, we are confident that we can adapt our reporting in order to support them.

SLP: a new DDoS amplification vector in the wild

Alex Forster — Tue, 25 Apr 2023 13:07:56 GMT

Earlier today, April 25, 2023, researchers Pedro Umbelino at Bitsight and Marco Lux at Curesec published their discovery of CVE-2023-29552, a new DDoS reflection/amplification attack vector leveraging the SLP protocol. If you are a Cloudflare customer, your services are already protected from this new attack vector.

Service Location Protocol (SLP) is a “service discovery” protocol invented by Sun Microsystems in 1997. Like other service discovery protocols, it was designed to allow devices in a local area network to interact without prior knowledge of each other. SLP is a relatively obsolete protocol and has mostly been supplanted by more modern alternatives like UPnP, mDNS/Zeroconf, and WS-Discovery. Nevertheless, many commercial products still offer support for SLP.

Since SLP has no method for authentication, it should never be exposed to the public Internet. However, Umbelino and Lux have discovered that upwards of 35,000 Internet endpoints have their devices’ SLP service exposed and accessible to anyone. Additionally, they have discovered that the UDP version of this protocol has an amplification factor of up to 2,200x, which is the third largest discovered to-date.

Cloudflare expects the prevalence of SLP-based DDoS attacks to rise significantly in the coming weeks as malicious actors learn how to exploit this newly discovered attack vector.

Cloudflare customers are protected

If you are a Cloudflare customer, our automated DDoS protection system already protects your services from these SLP amplification attacks.To avoid being exploited to launch the attacks, if you are a network operator, you should ensure that you are not exposing the SLP protocol directly to the public Internet. You should consider blocking UDP port 427 via access control lists or other means. This port is rarely used on the public Internet, meaning it is relatively safe to block without impacting legitimate traffic. Cloudflare Magic Transit customers can use the Magic Firewall to craft and deploy such rules.

Cloudflare mitigates record-breaking 71 million request-per-second DDoS attack

Omer Yoachimik — Mon, 13 Feb 2023 18:37:51 GMT

This was a weekend of record-breaking DDoS attacks. Over the weekend, Cloudflare detected and mitigated dozens of hyper-volumetric DDoS attacks. The majority of attacks peaked in the ballpark of 50-70 million requests per second (rps) with the largest exceeding 71 million rps. This is the largest reported HTTP DDoS attack on record, more than 54% higher than the previous reported record of 46M rps in June 2022.

The attacks were HTTP/2-based and targeted websites protected by Cloudflare. They originated from over 30,000 IP addresses. Some of the attacked websites included a popular gaming provider, cryptocurrency companies, hosting providers, and cloud computing platforms. The attacks originated from numerous cloud providers, and we have been working with them to crack down on the botnet.

Record breaking attack: DDoS attack exceeding 71 million requests per second

Over the past year, we’ve seen more attacks originate from cloud computing providers. For this reason, we will be providing service providers that own their own autonomous system a free Botnet threat feed. The feed will provide service providers threat intelligence about their own IP space; attacks originating from within their autonomous system. Service providers that operate their own IP space can now sign up to the early access waiting list.

Is this related to the Super Bowl or Killnet?

No. This campaign of attacks arrives less than two weeks after the Killnet DDoS campaign that targeted healthcare websites. Based on the methods and targets, we do not believe that these recent attacks are related to the healthcare campaign. Furthermore, yesterday was the US Super Bowl, and we also do not believe that this attack campaign is related to the game event.

What are DDoS attacks?

Distributed Denial of Service attacks are cyber attacks that aim to take down Internet properties and make them unavailable for users. These types of cyberattacks can be very efficient against unprotected websites and they can be very inexpensive for the attackers to execute.

An HTTP DDoS attack usually involves a flood of HTTP requests towards the target website. The attacker’s objective is to bombard the website with more requests than it can handle. Given a sufficiently high amount of requests, the website’s server will not be able to process all of the attack requests along with the legitimate user requests. Users will experience this as website-load delays, timeouts, and eventually not being able to connect to their desired websites at all.

Illustration of a DDoS attack

To make attacks larger and more complicated, attackers usually leverage a network of bots — a botnet. The attacker will orchestrate the botnet to bombard the victim’s websites with HTTP requests. A sufficiently large and powerful botnet can generate very large attacks as we’ve seen in this case.

However, building and operating botnets requires a lot of investment and expertise. What is the average Joe to do? Well, an average Joe that wants to launch a DDoS attack against a website doesn’t need to start from scratch. They can hire one of numerous DDoS-as-a-Service platforms for as little as $30 per month. The more you pay, the larger and longer of an attack you’re going to get.

Why DDoS attacks?

Over the years, it has become easier, cheaper, and more accessible for attackers and attackers-for-hire to launch DDoS attacks. But as easy as it has become for the attackers, we want to make sure that it is even easier - and free - for defenders of organizations of all sizes to protect themselves against DDoS attacks of all types.

Unlike Ransomware attacks, Ransom DDoS attacks don’t require an actual system intrusion or a foothold within the targeted network. Usually Ransomware attacks start once an employee naively clicks an email link that installs and propagates the malware. There’s no need for that with DDoS attacks. They are more like a hit-and-run attack. All a DDoS attacker needs to know is the website’s address and/or IP address.

Is there an increase in DDoS attacks?

Yes. The size, sophistication, and frequency of attacks has been increasing over the past months. In our latest DDoS threat report, we saw that the amount of HTTP DDoS attacks increased by 79% year-over-year. Furthermore, the amount of volumetric attacks exceeding 100 Gbps grew by 67% quarter-over-quarter (QoQ), and the number of attacks lasting more than three hours increased by 87% QoQ.

But it doesn’t end there. The audacity of attackers has been increasing as well. In our latest DDoS threat report, we saw that Ransom DDoS attacks steadily increased throughout the year. They peaked in November 2022 where one out of every four surveyed customers reported being subject to Ransom DDoS attacks or threats.

Distribution of Ransom DDoS attacks by month

Should I be worried about DDoS attacks?

Yes. If your website, server, or networks are not protected against volumetric DDoS attacks using a cloud service that provides automatic detection and mitigation, we really recommend that you consider it.

Cloudflare customers shouldn’t be worried, but should be aware and prepared. Below is a list of recommended steps to ensure your security posture is optimized.

What steps should I take to defend against DDoS attacks?

Cloudflare’s systems have been automatically detecting and mitigating these DDoS attacks.

Cloudflare offers many features and capabilities that you may already have access to but may not be using. So as extra precaution, we recommend taking advantage of these capabilities to improve and optimize your security posture:

Ensure all DDoS Managed Rules are set to default settings (High sensitivity level and mitigation actions) for optimal DDoS activation.
Cloudflare Enterprise customers that are subscribed to the Advanced DDoS Protection service should consider enabling Adaptive DDoS Protection, which mitigates attacks more intelligently based on your unique traffic patterns.
Deploy firewall rules and rate limiting rules to enforce a combined positive and negative security model. Reduce the traffic allowed to your website based on your known usage.
Ensure your origin is not exposed to the public Internet (i.e., only enable access to Cloudflare IP addresses). As an extra security precaution, we recommend contacting your hosting provider and requesting new origin server IPs if they have been targeted directly in the past.
Customers with access to Managed IP Lists should consider leveraging those lists in firewall rules. Customers with Bot Management should consider leveraging the bot scores within the firewall rules.
Enable caching as much as possible to reduce the strain on your origin servers, and when using Workers, avoid overwhelming your origin server with more subrequests than necessary.
Enable DDoS alerting to improve your response time.

Preparing for the next DDoS wave

Defending against DDoS attacks is critical for organizations of all sizes. While attacks may be initiated by humans, they are executed by bots — and to play to win, you must fight bots with bots. Detection and mitigation must be automated as much as possible, because relying solely on humans to mitigate in real time puts defenders at a disadvantage. Cloudflare’s automated systems constantly detect and mitigate DDoS attacks for our customers, so they don’t have to. This automated approach, combined with our wide breadth of security capabilities, lets customers tailor the protection to their needs.

We've been providing unmetered and unlimited DDoS protection for free to all of our customers since 2017, when we pioneered the concept. Cloudflare's mission is to help build a better Internet. A better Internet is one that is more secure, faster, and reliable for everyone - even in the face of DDoS attacks.

PIPEFAIL: How a missing shell option slowed Cloudflare down

Alex Forster — Tue, 05 Apr 2022 12:57:28 GMT

At Cloudflare, we’re used to being the fastest in the world. However, for approximately 30 minutes last December, Cloudflare was slow. Between 20:10 and 20:40 UTC on December 16, 2021, web requests served by Cloudflare were artificially delayed by up to five seconds before being processed. This post tells the story of how a missing shell option called “pipefail” slowed Cloudflare down.

Background

Before we can tell this story, we need to introduce you to some of its characters.

Cloudflare’s Front Line protects millions of users from some of the largest attacks ever recorded. This protection is orchestrated by a sidecar service called dosd, which analyzes traffic and looks for attacks. When dosd detects an attack, it provides Front Line with a list of attack fingerprints that describe how Front Line can match and block the attack traffic.

Instances of dosd run on every Cloudflare server, and they communicate with each other using a peer-to-peer mesh to identify malicious traffic patterns. This decentralized design allows dosd to perform analysis with much higher fidelity than is possible with a centralized system, but its scale also imposes some strict performance requirements. To meet these requirements, we need to provide dosd with very fast access to large amounts of configuration data, which naturally means that dosd depends on Quicksilver. Cloudflare developed Quicksilver to manage configuration data and replicate it around the world in milliseconds, allowing it to be accessed by services like dosd in microseconds.

One piece of configuration data that dosd needs comes from the Addressing API, which is our authoritative IP address management service. The addressing data it provides is important because dosd uses it to understand what kind of traffic is expected on particular IPs. Since addressing data doesn’t change very frequently, we use a simple Kubernetes cron job to query it at 10 minutes past each hour and write it into Quicksilver, allowing it to be efficiently accessed by dosd.

With this context, let’s walk through the change we made on December 16 that ultimately led to the slowdown.

The Change

Approximately once a week, all of our Bug Fixes and Performance Improvements to the Front Line codebase are released to the network. On December 16, the Front Line team released a fix for a subtle bug in how the code handled compression in the presence of a Cache-Control: no-transform header. Unfortunately, the team realized pretty quickly that this fix actually broke some customers who had started depending on that buggy behavior, so the team decided to roll back the release and work with those customers to correct the issue.

Here’s a graph showing the progression of the rollback. While most releases and rollbacks are fully automated, this particular rollback needed to be performed manually due to its urgency. Since this was a manual rollback, SREs decided to perform it in two batches as a safety measure. The first batch went to our smaller tier 2 and 3 data centers, and the second batch went to our larger tier 1 data centers.

SREs started the first batch at 19:25 UTC, and it completed in about 30 minutes. Then, after verifying that there were no issues, they started the second batch at 20:10. That’s when the slowdown started.

The Slowdown

Within minutes of starting the second batch of rollbacks, alerts started firing. “Traffic levels are dropping.” “CPU utilization is dropping.” “A P0 incident has been automatically declared.” The timing could not be a coincidence. Somehow, a deployment of known-good code, which had been limited to a subset of the network and which had just been successfully performed 40 minutes earlier, appeared to be causing a global problem.

A P0 incident is an “all hands on deck” emergency, so dozens of Cloudflare engineers quickly began to assess impact to their services and test their theories about the root cause. The rollback was paused, but that did not fix the problem. Then, approximately 10 minutes after the start of the incident, my team – the DOS team – received a concerning alert: “dosd is not running on numerous servers.” Before that alert fired we had been investigating whether the slowdown was caused by an unmitigated attack, but this required our immediate attention.

Based on service logs, we were able to see that dosd was panicking because the customer addressing data in Quicksilver was corrupted in some way. Remember: the data in this Quicksilver key is important. Without it, dosd could not make correct choices anymore, so it refused to continue.

Once we realized that the addressing data was corrupted, we had to figure out how it was corrupted so that we could fix it. The answer turned out to be pretty obvious: the Quicksilver key was completely empty.

Following the old adage – “did you try restarting it?” – we decided to manually re-run the Kubernetes cron job that populates this key and see what happened. At 20:40 UTC, the cron job was manually triggered. Seconds after it completed, dosd started running again, and traffic levels began returning to normal. We confirmed that the Quicksilver key was no longer empty, and the incident was over.

The Aftermath

Despite fixing the problem, we still didn’t really understand what had just happened.

Why was the Quicksilver key empty?

It was urgent that we quickly figure out how an empty value was written into that Quicksilver key, because for all we knew, it could happen again at any moment.

We started by looking at the Kubernetes cron job, which turned out to have a bug:

This cron job is implemented using a small Bash script. If you’re unfamiliar with Bash (particularly shell pipelining), here’s what it does:

First, the dos-make-addr-conf executable runs. Its job is to query the Addressing API for various bits of JSON data and serialize it into a Toml document, which gets written Into config.toml. Afterward, that Toml is passed as input into the dosctl executable, whose job is to simply write it into a Quicksilver key called template_vars.

Can you spot the bug? Here’s a hint: what happens if dos-make-addr-conf fails for some reason and exits with a non-zero error code? It turns out that, by default, the shell pipeline ignores the error code and continues executing! This means that the output of dos-make-addr-conf (which could be empty) gets unconditionally written into dosctl and used as the value of the template_vars key, regardless of whether dos-make-addr-conf succeeded or failed.

30 years ago, when the first users of Bourne shell were burned by this problem, a shell option called “pipefail” was introduced. Enabling this option changes the shell’s behavior so that, when any command in a pipeline series fails, the entire pipeline fails. However, this option is not enabled by default, so it’s widely recommended as best practice that all scripts should start by enabling this (and a few other) options.

Here’s the fixed version of that cron job:

This bug was particularly insidious because dosd actually did attempt to gracefully handle the case where this Quicksilver key contained invalid Toml. However, an empty string is a perfectly valid Toml document. If an error message had been accidentally written into this Quicksilver key instead of an empty string, then dosd would have rejected the update and continued to use the previous value.

Why did that cause the Front Line to slow down?

We had figured out how an empty key could be written into Quicksilver, and we were confident that it wouldn’t happen again. However, we still needed to untangle how that empty key caused such a severe incident.

As I mentioned earlier, the Front Line relies on dosd to tell it how to mitigate attacks, but it doesn’t depend on dosd directly to serve requests. Instead, once every few seconds, the Front Line asynchronously asks dosd for new attack fingerprints and stores them in an in-memory cache. This cache is consulted while serving each request, and if dosd ever fails to provide fresh attack fingerprints, then the stale fingerprints will continue to be used instead. So how could this have caused the impact that we saw?

As part of the rollback process, the Front Line’s code needed to be reloaded. Reloading this code implicitly flushed the in-memory caches, including the attack fingerprint data from dosd. The next time that a request tried to consult with the cache, the caching layer realized that it had no attack fingerprints to return and a “cache miss” happened.

To handle a cache miss, the caching layer tried to reach out to dosd, and this is when the slowdown happened. While the caching layer was waiting for dosd to reply, it blocked all pending requests from progressing. Since dosd wasn’t running, the attempt eventually timed out after five seconds when the caching layer gave up. But in the meantime, each pending request was stuck waiting for the timeout to happen. Once it did, all the pending requests that were queued up over the five-second timeout period became unblocked and were finally allowed to progress. This cycle repeated over and over again every five seconds on every server until the dosd failure was resolved.

To trigger this slowdown, not only did dosd have to fail, but the Front Line’s in-memory cache had to also be flushed at the same time. If dosd had failed, but the Front Line’s cache had not been flushed, then the stale attack fingerprints would have remained in the cache and request processing would not have been impacted.

Why didn’t the first rollback cause this problem?

These two batches of rollbacks were performed by forcing servers to run a Salt highstate. When each batch was executed, thousands of servers began running highstates at the same time. The highstate process involves, among other things, contacting the Addressing API in order to retrieve various bits of customer addressing information.

The first rollback started at 19:25 UTC, and the second rollback started 45 minutes later at 20:10. Remember how I mentioned that our Kubernetes cron job only runs on the 10th minute of every hour? At 21:10 – exactly the time that our cron job started executing – thousands of servers also began to highstate, flooding the Addressing API with requests. All of these requests were queued up and eventually served, but it took the Addressing API a few minutes to work through the backlog. This delay was long enough to cause our cron job to time out, and, due to the “pipefail” bug, inadvertently clobber the Quicksilver key that it was responsible for updating.

To trigger the “pipefail” bug, not only did we have to flood the Addressing API with requests, we also had to do it at exactly 10 minutes after the hour. If SREs had started the second batch of rollbacks a few minutes earlier or later, this bug would have continued to lay dormant.

Lessons Learned

This was a unique incident where a chain of small or unlikely failures cascaded into a severe and painful outage that we deeply regret. In response, we have hardened each link in the chain:

A manual rollback inadvertently triggered the thundering herd problem, which overwhelmed the Addressing API. We have since significantly scaled out the Addressing API, so that it can handle high request rates if it ever again has to.
An error in a Kubernetes cron job caused invalid data to be written to Quicksilver. We have since made sure that, when this cron job fails, it is no longer possible for that failure to clobber the Quicksilver key.
dosd did not correctly handle all possible error conditions when loading configuration data from Quicksilver, causing it to fail. We have since taken these additional conditions into account where necessary, so that dosd will gracefully degrade in the face of corrupt Quicksilver data.
The Front Line had an unexpected dependency on dosd, which caused it to fail when dosd failed. We have since removed all such dependencies, and the Front Line will now gracefully survive dosd failures.

More broadly, this incident has served as an example to us of why code and systems must always be resilient to failure, no matter how unlikely that failure may seem.

CVE-2022-26143: A Zero-Day vulnerability for launching UDP amplification DDoS attacks

Omer Yoachimik — Tue, 08 Mar 2022 15:22:13 GMT

A zero-day vulnerability in the Mitel MiCollab business phone system has recently been discovered (CVE-2022-26143). This vulnerability, called TP240PhoneHome, which Cloudflare customers are already protected against, can be used to launch UDP amplification attacks. This type of attack reflects traffic off vulnerable servers to victims, amplifying the amount of traffic sent in the process by an amplification factor of 220 billion percent in this specific case.

Cloudflare has been actively involved in investigating the TP240PhoneHome exploit, along with other members of the InfoSec community. Read our joint disclosure here for more details. As far as we can tell, the vulnerability has been exploited as early as February 18, 2022. We have deployed emergency mitigation rules to protect Cloudflare customers against the amplification DDoS attacks.

Mitel has been informed of the vulnerability. As of February 22, they have issued a high severity security advisory advising their customers to block exploitation attempts using a firewall, until a software patch is made available. Cloudflare Magic Transit customers can use the Magic Firewall to block external traffic to the exposed Mitel UDP port 10074 by following the example in the screenshot below, or by pasting the following expression into their Magic Firewall rule editor and selecting the Block action:

(udp.dstport eq 10074).

Creating a Magic Firewall rule to block traffic to port 10074

To learn more, register for our webinar on March 23rd, 2022.

Exploiting the vulnerability to launch DDoS attacks

Mitel Networks is based in Canada and provides business communications and collaboration products to over 70 million business users around the world. Amongst their enterprise collaboration products is the aforementioned Mitel MiCollab platform, known to be used in critical infrastructure such as municipal governments, schools, and emergency services. The vulnerability was discovered in the Mitel MiCollab platform.

The vulnerability manifests as an unauthenticated UDP port that is incorrectly exposed to the public Internet. The call control protocol running on this port can be used to, amongst other things, issue the debugging command startblast. This command does not place real telephone calls; rather, it simulates a “blast” of calls in order to test the system. For each test call that is made, two UDP packets are emitted in response to the issuer of the command.

According to the security advisory, the exploit can “allow a malicious actor to gain unauthorized access to sensitive information and services, cause performance degradations or a denial of service condition on the affected system. If exploited with a denial of service attack, the impacted system may cause significant outbound traffic impacting availability of other services.”

Since this is an unauthenticated and connectionless UDP-based protocol, you can use spoofing to direct the response traffic toward any IP and port number — and by doing so, reflect and amplify a DDoS attack to the victim.

We’ve mainly focused on the amplification vector because it can be used to hurt the whole Internet, but the phone systems themselves can likely be hurt in other ways with this vulnerability. This UDP call control port offers many other commands. With some work, it’s likely that you could use this UDP port to commit toll fraud, or to simply render the phone system inoperable. We haven’t assessed these other possibilities, because we do not have access to a device that we can safely test with.

The good news

Fortunately, only a few thousand of these devices are improperly exposed to the public Internet, meaning that this vector can “only” achieve several hundred million packets per second total. This volume of traffic can cause major outages if you’re not protected by an always-on automated DDoS protection service, but it’s nothing to be concerned with if you are.

Furthermore, an attacker can't run multiple commands at the same time. Instead, the server queues up commands and executes them serially. The fact that you can only launch one attack at a time from these devices, mixed with the fact that you can make that attack for many hours, has fascinating implications. If an attacker chooses to start an attack by specifying a very large number of packets, then that box is “burned” – it can’t be used to attack anyone else until the attack completes.

How Cloudflare detects and mitigates DDoS attacks

To defend organizations against DDoS attacks, we built and operate software-defined systems that run autonomously. They automatically detect and mitigate DDoS attacks across our entire network.

Initially, traffic is routed through the Internet via BGP Anycast to the nearest Cloudflare edge data center. Once the traffic reaches our data center, our DDoS systems sample it asynchronously allowing for out-of-path analysis of traffic without introducing latency penalties.

The analysis is done using data streaming algorithms. Packet samples are compared to the fingerprints and multiple real-time signatures are created based on the dynamic masking of various fingerprint attributes. Each time another packet matches one of the signatures, a counter is increased. When the system qualifies an attack, i.e., the activation threshold is reached for a given signature, a mitigation rule is compiled and pushed inline. The mitigation rule includes the real-time signature and the mitigation action, e.g., drop.

You can read more about our autonomous DDoS protection systems and how they work in our joint-disclosure technical blog post.

Helping build a better Internet

Cloudflare’s mission is to help build a better Internet. A better Internet is one that is more secure, faster, and reliable for everyone — even in the face of DDoS attacks and emerging zero-day threats. As part of our mission, since 2017, we’ve been providing unmetered and unlimited DDoS protection for free to all of our customers. Over the years, it has become increasingly easier for attackers to launch DDoS attacks. To counter the attacker’s advantage, we want to make sure that it is also easy and free for organizations of all sizes to protect themselves against DDoS attacks of all types.

Not using Cloudflare yet? Start now.

CVE-2022-26143: TP240PhoneHome reflection/amplification DDoS attack vector

Alex Forster — Tue, 08 Mar 2022 14:59:58 GMT

Beginning in mid-February 2022, security researchers, network operators, and security vendors observed a spike in DDoS attacks sourced from UDP port 10074 targeting broadband access ISPs, financial institutions, logistics companies, and organizations in other vertical markets.

Upon further investigation, it was determined that the devices abused to launch these attacks are MiCollab and MiVoice Business Express collaboration systems produced by Mitel, which incorporate TP-240 VoIP- processing interface cards and supporting software; their primary function is to provide Internet-based site-to-site voice connectivity for PBX systems.

Approximately 2600 of these systems have been incorrectly provisioned so that an unauthenticated system test facility has been inadvertently exposed to the public Internet, allowing attackers to leverage these PBX VoIP gateways as DDoS reflectors/amplifiers.

Mitel is aware that these systems are being abused to facilitate high-pps (packets-per-second) DDoS attacks, and have been actively working with customers to remediate abusable devices with patched software that disables public access to the system test facility.

In this blog, we will further explore the observed activity, explain how the driver has been abused, and share recommended mitigation steps. This research was created cooperatively among a team of researchers from Akamai SIRT, Cloudflare, Lumen Black Lotus Labs, NETSCOUT ASERT, TELUS, Team Cymru, and The Shadowserver Foundation.

DDoS attacks in the wild

While spikes of network traffic associated with the vulnerable service were observed on January 8th and February 7,th 2022, we believe the first actual attacks leveraging the exploit began on February 18th.

Observed attacks were primarily predicated on packets-per-second, or throughput, and appeared to be UDP reflection/amplification attacks sourced from UDP/10074 that were mainly directed towards destination ports UDP/80 and UDP/443. The single largest observed attack of this type preceding this one was approximately 53 Mpps and 23 Gbps. The average packet size for that attack was approximately 60 bytes, with an attack duration of approximately 5 minutes. The amplified attack packets are not fragmented.

This particular attack vector differs from most UDP reflection/amplification attack methodologies in that the exposed system test facility can be abused to launch a sustained DDoS attack of up to 14 hours in duration by means of a single spoofed attack initiation packet, resulting in a record-setting packet amplification ratio of 4,294,967,296:1. A controlled test of this DDoS attack vector yielded more than 400 Mmpps of sustained DDoS attack traffic.

It should be noted that this single-packet attack initiation capability has the effect of precluding network operator traceback of the spoofed attack initiator traffic. This helps mask the attack traffic generation infrastructure, making it less likely that the attack origin can be traced compared with other UDP reflection/amplification DDoS attack vectors.

Abusing the tp240dvr driver

The abused service on affected Mitel systems is called tp240dvr (“TP-240 driver”) and appears to run as a software bridge to facilitate interactions with TDM/VoIP PCI interface cards. The service listens for commands on UDP/10074 and is not meant to be exposed to the Internet, as confirmed by the manufacturer of these devices. It is this exposure to the Internet that ultimately allows it to be abused.

The tp240dvr service exposes an unusual command that is designed to stress-test its clients in order to facilitate debugging and performance testing. This command can be abused to cause the tp240dvr service to send this stress-test to attack victims. The traffic consists of a high rate of short informative status update packets that can potentially overwhelm victims and cause the DDoS scenario.

This command can also be abused by attackers to launch very high-throughput attacks. Attackers can use specially-crafted commands to cause the tp240dvr service to send larger informative status update packets, significantly increasing the amplification ratio.

By extensively testing isolated virtual TP-240-based systems in a lab setting, researchers were able to cause these devices to generate massive amounts of traffic in response to comparatively small request payloads. We will cover this attack scenario in greater technical depth in the following sections.

Calculating the potential attack impact

As previously mentioned, amplification via this abusable test facility differs substantially from how it is accomplished with most other UDP reflection/amplification DDoS vectors. Typically, reflection/amplification attacks require the attacker to continuously transmit malicious payloads to abusable nodes for as long as they wish to attack the victim. In the case of TP-240 reflection/amplification, this continuous transmission is not necessary to launch high-impact DDoS attacks.

Instead, an attacker leveraging TP-240 reflection/amplification can launch a high-impact DDoS attack using a single packet. Examination of the tp240dvr binary reveals that, due to its design, an attacker can theoretically cause the service to emit 2,147,483,647 responses to a single malicious command. Each response generates two packets on the wire, leading to approximately 4,294,967,294 amplified attack packets being directed toward the attack victim.

For each response to a command, the first packet contains a counter that increments with each sent response. As the counter value increments, the size of this first packet will grow from 36 bytes to 45 bytes. The second packet contains diagnostic output from the function, which can be influenced by the attacker. By optimizing each initiator packet to maximize the size of the second packet, every command will result in amplified packets that are up to 1,184 bytes in length.

In theory, a single abusable node generating the upper limit of 4,294,967,294 packets at a rate of 80kpps would result in an attack duration of roughly 14 hours. Over the course of the attack, the “counter” packets alone would generate roughly 95.5GB of amplified attack traffic destined for the targeted network. The maximally-padded “diagnostic output” packets would account for an additional 2.5TB of attack traffic directed towards the target.

This would yield a sustained flood of just under 393mb/sec of attack traffic from a single reflector/amplifier, all resulting from a single spoofed attack initiator packet of only 1,119 bytes in length. This results in a nearly unimaginable amplification ratio of 2,200,288,816:1 — a multiplier of 220 billion percent, triggered by a single packet.

Upper boundaries of attack volume and simultaneity

The tp240dvr service processes commands using a single thread. This means they can only process a single command at a time, and thus can only be used to launch one attack at a time. In the example scenario presented above, during the 14 hours that the abused device would be attacking the target, it cannot be leveraged to attack any other target. This is somewhat unique in the context of DDoS reflection/amplification vectors.

Although this characteristic also causes the tp240dvr service to be unavailable to legitimate users, it is much preferable to having these devices be leveraged in parallel by multiple attackers — and leaving legitimate operators of these systems to wonder why their outbound Internet data capacity is being consumed at much higher rates.

Additionally, it appears these devices are on relatively low-powered hardware, in terms of their traffic-generation capabilities. On an Internet where 100/Gbps links, dozens of CPU cores, and multi-threading capabilities have become commonplace, we can all be thankful this abusable service is not found on top-of-the-line hardware platforms capable of individually generating millions of packets per second, and running with thousands of parallelized threads.

Lastly, it is also good news that of the tens of thousands of these devices, which have been purchased and deployed historically by governments, commercial enterprises, and other organizations worldwide, a relatively small number of them have been configured in a manner that leaves them in this abusable state, and of those, many have been properly secured and taken offline from an attacker’s perspective.

Collateral impact

The collateral impact of TP-240 reflection/amplification attacks is potentially significant for organizations with Internet-exposed Mitel MiCollab and MiVoice Business Express collaboration systems that are abused as DDoS reflectors/amplifiers.

This may include partial or full interruption of voice communications through these systems, as well as additional service disruption due to transit capacity consumption, state-table exhaustion of NATs, and stateful firewalls, etc.

Wholesale filtering of all UDP/10074-sourced traffic by network operators may potentially overblock legitimate Internet traffic, and is therefore contraindicated.

Recommended actions

TP-240 reflection/amplification DDoS attacks are sourced from UDP/10074 and destined for the UDP port of the attacker’s choice. This amplified attack traffic can be detected, classified, traced back, and safely mitigated using standard DDoS defense tools and techniques.

Flow telemetry and packet capture via open-source and commercial analysis systems can alert network operators and end customers of TP-240 reflection/amplification attacks.

Network access control lists (ACLs), flowspec, destination-based remotely triggered blackhole (D/RTBH), source-based remotely triggered blackhole (S/RTBH), and intelligent DDoS mitigation systems can be used to mitigate these attacks.

Network operators should perform reconnaissance to identify and facilitate remediation of abusable TP-240 reflectors/amplifiers on their networks and/or the networks of their customers. Operators of Mitel MiCollab and MiVoice Business Express collaboration systems should proactively contact Mitel in order to receive specific remediation instructions from the vendor.

Organizations with business-critical public-facing Internet properties should ensure that all relevant network infrastructure, architectural, and operational Best Current Practices (BCPs) have been implemented, including situationally specific network access policies that only permit Internet traffic via required IP protocols and ports. Internet access network traffic to/from internal organizational personnel should be isolated from Internet traffic to/from public-facing Internet properties, and served via separate upstream Internet transit links.

DDoS defenses for all public-facing Internet properties and supporting infrastructure should be implemented in a situationally appropriate manner, including periodic testing to ensure that any changes to the organization’s servers/services/applications are incorporated into its DDoS defense plan.

It is imperative that organizations operating mission-critical public-facing Internet properties and/or infrastructure ensure that all servers/services/application/datastores/infrastructure elements are protected against DDoS attack, and are included in periodic, realistic tests of the organization’s DDoS mitigation plan. Critical ancillary supporting services such as authoritative and recursive DNS servers must be included in this plan.

Network operators should implement ingress and egress source address validation in order to prevent attackers from initiating reflection/amplification DDoS attacks.

All potential DDoS attack mitigation measures described in this document MUST be tested and customized in a situationally appropriate manner prior to deployment on production networks.

Mitigating factors

Operators of Internet-exposed TP-240-based Mitel MiCollab and MiVoice Business Express collaboration systems can prevent abuse of their systems to launch DDoS attacks by blocking incoming Internet traffic destined for UDP/10074 via access control lists (ACLs), firewall rules, and other standard network access control policy enforcement mechanisms.

Mitel have provided patched software versions that prevent TP-240-equipped MiCollab and MiVoice Business Express collaboration systems from being abused as DDoS reflectors/amplifiers by preventing exposure of the service to the Internet. Mitel customers should contact the vendor for remediation instructions.

Collateral impact to abusable TP-240 reflectors/amplifiers can alert network operators and/or end-customers to remove affected systems from “demilitarized zone” (DMZ) networks or Internet Data Centers (IDCs), or to disable relevant UDP port-forwarding rules that allow specific UDP/10074 traffic sourced from the public Internet to reach these devices, thereby preventing them from being abused to launch reflection/amplification DDoS attacks.

The amplified attack traffic is not fragmented, so there is no additional attack component consisting of non-initial fragments, as is the case with many other UDP reflection/amplification DDoS vectors.

Implementation of ingress and egress source-address validation (SAV; also known as anti-spoofing) can prevent attackers from launching reflection/amplification DDoS attacks.

Conclusion

Unfortunately, many abusable services that should not be exposed to the public Internet are nevertheless left open for attackers to exploit. This scenario is yet another example of real-world deployments not adhering to vendor guidance. Vendors can prevent this situation by adopting “safe by default” postures on devices before shipping.

Reflection/amplification DDoS attacks would be impossible to launch if all network operators implemented ingress and egress source-address validation (SAV, also known as anti-spoofing). The ability to spoof the IP address(es) of the intended attack target(s) is required to launch such attacks. Service providers must continue to implement SAV in their own networks, and require that their downstream customers do so.

As is routinely the case with newer DDoS attack vectors, it appears that after an initial period of employment by advanced attackers with access to bespoke DDoS attack infrastructure, TP-240 reflection/amplification has been weaponized and added to the arsenals of so-called “booter/stresser” DDoS-for-hire services, placing it within the reach of the general attacker population.

Collaboration across the operational, research, and vendor communities is central to the continued viability of the Internet. The quick response to and ongoing remediation of this high-impact DDoS attack vector has only been possible as a result of such collaboration. Organizations with a vested interest in the stability and resiliency of the Internet should embrace and support cross-industry cooperative efforts as a core principle.

The combined efforts of the research and mitigation task force demonstrates that successful collaboration across industry peers to quickly remediate threats to availability and resiliency is not only possible, but is also increasingly critical for the continued viability of the global Internet.

Sources

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-26143/https://www.mitel.com/en-ca/support/security-advisories/mitel-product-security-advisory-22-0001 https://www.cisa.gov/uscert/ncas/alerts/TA14-017A https://www.senki.org/ddos-attack-preparation-workbook/https://www.manrs.org/resources/https://www.rfc-editor.org/info/bcp38 https://www.rfc-editor.org/info/bcp84 https://datatracker.ietf.org/doc/html/rfc7039

Research and mitigation task force contributors

Researchers from the following organizations have contributed to the findings and recommendations described in this document:

In particular, the Mitigation Task Force would like to cite Mitel for their exemplary cooperation, rapid response, and ongoing participation in remediation efforts. Mitel quickly created and disseminated patched software, worked with their customers and partners to update affected systems, and supplied valuable expertise as the Task Force worked to formulate this document.

Update on recent VoIP attacks: What should I do if I’m attacked?

Omer Yoachimik — Thu, 07 Oct 2021 02:20:59 GMT

Attackers continue targeting VoIP infrastructure around the world. In our blog from last week, May I ask who’s calling, please? A recent rise in VoIP DDoS attacks, we reviewed how the SIP protocol works, ways it can be abused, and how Cloudflare can help protect against attacks on VoIP infrastructure without impacting performance.

Cloudflare’s network stands in front of some of the largest, most performance-sensitive voice and video providers in the world, and is uniquely well suited to mitigating attacks on VoIP providers.

Because of the sustained attacks we are observing, we are sharing details on recent attack patterns, what steps they should take before an attack, and what to do after an attack has taken place.

Below are three of the most common questions we’ve received from companies concerned about attacks on their VoIP systems, and Cloudflare’s answers.

Question #1: How is VoIP infrastructure being attacked?

The attackers primarily use off-the-shelf booter services to launch attacks against VoIP infrastructure. The attack methods being used are not novel, but the persistence of the attacker and their attempts to understand the target’s infrastructure are.

Attackers have used various attack vectors to probe the existing defenses of targets and try to infiltrate any existing defenses to disrupt VoIP services offered by certain providers. In some cases, they have been successful. HTTP attacks against API gateways and the corporate websites of the providers have been combined with network-layer and transport-layer attack against VoIP infrastructures. Examples:

TCP floods targeting stateful firewallsThese are being used in “trial-and-error” type attacks. They are not very effective against telephony infrastructure specifically (because it’s mostly UDP) but very effective at overwhelming stateful firewalls.
UDP floods targeting SIP infrastructureFloods of UDP traffic that have no well-known fingerprint, aimed at critical VoIP services. Generic floods like this may look like legitimate traffic to unsophisticated filtering systems.
UDP reflection targeting SIP infrastructureThese methods, when targeted at SIP or RTP services, can easily overwhelm Session Border Controllers (SBCs) and other telephony infrastructure. The attacker seems to learn enough about the target’s infrastructure to target such services with high precision.
SIP protocol-specific attacksAttacks at the application layer are of particular concern because of the higher resource cost of generating application errors vs filtering on network devices.

Question #2: How should I prepare my organization in case our VoIP infrastructure is targeted?

Deploy an always-on DDoS mitigation serviceCloudflare recommends the deployment of always-on network level protection, like Cloudflare Magic Transit, prior to your organization being attacked.
Do not rely on reactive on-demand SOC-based DDoS Protection services that require humans to analyze attack traffic — they take too long to respond. Instead, onboard to a cloud service that has sufficient network capacity and automated DDoS mitigation systems.
Cloudflare has effective mitigations in place for the attacks seen against VoIP infrastructure, including for sophisticated TCP floods and SIP specific attacks.
Enforce a positive security modelBlock TCP on IP/port ranges that are not expected to receive TCP, instead of relying on on-premise firewalls that can be overwhelmed. Block network probing attempts (e.g. ICMP) and other packets that you don't normally expect to see.
Build custom mitigation strategiesWork together with your DDoS protection vendor to tailor mitigation strategies to your workload. Every network is different, and each poses unique challenges when integrating with DDoS mitigation systems.
Educate your employeesTrain all of your employees to be on the lookout for ransom demands. Check email, support tickets, form submissions, and even server access logs. Ensure employees know to immediately report ransom demands to your Security Incident Response team.

Question #3: What should I do if I receive a ransom/threat?

Do not pay the ransomPaying the ransom only encourages bad actors—and there’s no guarantee that they won’t attack your network now or later.
Notify CloudflareWe can help ensure your website and network infrastructure are safeguarded against these attacks.
Notify local law enforcementThey will also likely request a copy of the ransom letter that you received.

Cloudflare is here to help

With over 100 Tbps of network capacity, a network architecture that efficiently filters traffic close to the source, and a physical presence in over 250 cities, Cloudflare can help protect critical VoIP infrastructure without impacting latency, jitter, and call quality. Test results demonstrate a performance improvement of 36% on average across the globe for a real customer network using Cloudflare Magic Transit.

Some of the largest voice and video providers in the world rely on Cloudflare to protect their networks and ensure their services remain online and fast. We stand ready to help.

Talk to a Cloudflare specialist to learn more.Under attack? Contact our hotline to speak with someone immediately.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

Omer Yoachimik — Fri, 01 Oct 2021 00:05:42 GMT

Over the past month, multiple Voice over Internet Protocol (VoIP) providers have been targeted by Distributed Denial of Service (DDoS) attacks from entities claiming to be REvil. The multi-vector attacks combined both L7 attacks targeting critical HTTP websites and API endpoints, as well as L3/4 attacks targeting VoIP server infrastructure. In some cases, these attacks resulted in significant impact to the targets’ VoIP services and website/API availability.

Cloudflare’s network is able to effectively protect and accelerate voice and video infrastructure because of our global reach, sophisticated traffic filtering suite, and unique perspective on attack patterns and threat intelligence.

If you or your organization have been targeted by DDoS attacks, ransom attacks and/or extortion attempts, seek immediate help to protect your Internet properties. We recommend not paying the ransom, and to report it to your local law enforcement agencies.

Voice (and video, emojis, conferences, cat memes and remote classrooms) over IP

Voice over IP (VoIP) is a term that's used to describe a group of technologies that allow for communication of multimedia over the Internet. This technology enables your FaceTime call with your friends, your virtual classroom lessons over Zoom and even some “normal” calls you make from your cell phone.

The principles behind VoIP are similar to traditional digital calls over circuit-switched networks. The main difference is that the encoded media, e.g., voice or video, is partitioned into small units of bits that are transferred over the Internet as the payloads of IP packets according to specially defined media protocols.

This “packet switching” of voice data, as compared to traditional “circuit switching”, results in much more efficient use of network resources. As a result, calling over VoIP can be much more cost-effective than calls made over the POTS (“plain old telephone service”). Switching to VoIP can cut down telecom costs for businesses by more than 50%, so it's no surprise that one in every three businesses has already adopted VoIP technologies. VoIP is flexible, scalable, and has been especially useful in bringing people together remotely during the pandemic.

A key protocol behind most VoIP calls is the heavily adopted Session Initiation Protocol (SIP). SIP was originally defined in RFC-2543 (1999) and designed to serve as a flexible and modular protocol for initiating calls (“sessions”), whether voice or video, or two-party or multiparty.

Speed is key for VoIP

Real-time communication between people needs to feel natural, immediate and responsive. Therefore, one of the most important features of a good VoIP service is speed. The user experiences this as natural sounding audio and high definition video, without lag or stutter. Users’ perceptions of call quality are typically closely measured and tracked using metrics like Perceptual Evaluation of Speech Quality and Mean Opinion Scores. While SIP and other VoIP protocols can be implemented using TCP or UDP as the underlying protocols, UDP is typically chosen because it’s faster for routers and servers to process them.

UDP is a protocol that is unreliable, stateless and comes with no Quality of Service (QoS) guarantees. What this means is that the routers and servers typically use less memory and computational power to process UDP packets and therefore can process more packets per second. Processing packets faster results in quicker assembly of the packets’ payloads (the encoded media), and therefore a better call quality.

Under the guidelines of faster is better, VoIP servers will attempt to process the packets as fast as possible on a first-come-first-served basis. Because UDP is stateless, it doesn’t know which packets belong to existing calls and which attempt to initiate a new call. Those details are in the SIP headers in the form of requests and responses which are not processed until further up the network stack.

When the rate of packets per second increases beyond the router’s or server’s capacity, the faster is better guideline actually turns into a disadvantage. While a traditional circuit-switched system will refuse new connections when its capacity is reached and attempt to maintain the existing connections without impairment, a VoIP server, in its race to process as many packets as possible, will not be able to handle all packets or all calls when its capacity is exceeded. This results in latency and disruptions for ongoing calls, and failed attempts of making or receiving new calls.

Without proper protection in place, the race for a superb call experience comes at a security cost which attackers learned to take advantage of.

DDoSing VoIP servers

Attackers can take advantage of UDP and the SIP protocol to overwhelm unprotected VoIP servers with floods of specially-crafted UDP packets. One way attackers overwhelm VoIP servers is by pretending to initiate calls. Each time a malicious call initiation request is sent to the victim, their server uses computational power and memory to authenticate the request. If the attacker can generate enough call initiations, they can overwhelm the victim’s server and prevent it from processing legitimate calls. This is a classic DDoS technique applied to SIP.

A variation on this technique is a SIP reflection attack. As with the previous technique, malicious call initiation requests are used. However, in this variation, the attacker doesn’t send the malicious traffic to the victim directly. Instead, the attacker sends them to many thousands of random unwitting SIP servers all across the Internet, and they spoof the source of the malicious traffic to be the source of the intended victim. That causes thousands of SIP servers to start sending unsolicited replies to the victim, who must then use computational resources to discern whether they are legitimate. This too can starve the victim server of resources needed to process legitimate calls, resulting in a widespread denial of service event for users. Without the proper protection in place, VoIP services can be extremely susceptible to DDoS attacks.

The graph below shows a recent multi-vector UDP DDoS attack that targeted VoIP infrastructure protected by Cloudflare’s Magic Transit service. The attack peaked just above 70 Gbps and 16M packets per second. While it's not the largest attack we’ve ever seen, attacks of this size can have large impact on unprotected infrastructure. This specific attack lasted a bit over 10 hours and was automatically detected and mitigated.

Below are two additional graphs of similar attacks seen last week against SIP infrastructure. In the first chart we see multiple protocols being used to launch the attack, with the bulk of traffic coming from (spoofed) DNS reflection and other common amplification and reflection vectors. These attacks peaked at over 130 Gbps and 17.4M pps.

Protecting VoIP services without sacrificing performance

One of the most important factors for delivering a quality VoIP service is speed. The lower the latency, the better. Cloudflare’s Magic Transit service can help protect critical VoIP infrastructure without impacting latency and call quality.

Cloudflare’s Anycast architecture, coupled with the size and scale of our network, minimizes and can even improve latency for traffic routed through Cloudflare versus the public Internet. Check out our recent post from Cloudflare’s Speed Week for more details on how this works, including test results demonstrating a performance improvement of 36% on average across the globe for a real customer network using Magic Transit.

Furthermore, every packet that is ingested in a Cloudflare data center is analyzed for DDoS attacks using multiple layers of out-of-path detection to avoid latency. Once an attack is detected, the edge generates a real-time fingerprint that matches the characteristics of the attack packets. The fingerprint is then matched in the Linux kernel eXpress Data Path (XDP) to quickly drop attack packets at wirespeed without inflicting collateral damage on legitimate packets. We have also recently deployed additional specific mitigation rules to inspect UDP traffic to determine whether it is valid SIP traffic.

The detection and mitigation is done autonomously within every single Cloudflare edge server — there is no “scrubbing center” with limited capacity and limited deployment scope in the equation. Additionally, threat intelligence is automatically shared across our network in real-time to ‘teach’ other edge servers about the attack.

Edge detections are also completely configurable. Cloudflare Magic Transit customers can use the L3/4 DDoS Managed Ruleset to tune and optimize their DDoS protection settings, and also craft custom packet-level (including deep packet inspection) firewall rules using the Magic Firewall to enforce a positive security model.

Bringing people together, remotely

Cloudflare’s mission is to help build a better Internet. A big part of that mission is making sure that people around the world can communicate with their friends, family and colleagues uninterrupted — especially during these times of COVID. Our network is uniquely positioned to help keep the world connected, whether that is by helping developers build real-time communications systems or by keeping VoIP providers online.

Our network’s speed and our always-on, autonomous DDoS protection technology helps VoIP providers to continue serving their customers without sacrificing performance or having to give in to ransom DDoS extortionists.

Talk to a Cloudflare specialist to learn more.

Under attack? Contact our hotline to speak with someone immediately.