The Cloudflare Blog

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

Chris Howells — Fri, 10 Mar 2023 14:00:00 GMT

As a security company, it’s critical that we have good processes for dealing with security issues. We regularly release software to our servers - on a daily basis even - which includes new features, bug fixes, and as required, security patches. But just as critical is the software which is embedded into the server hardware, known as firmware. Primarily of interest is the BIOS and Baseboard Management Controller (BMC), but many other components also have firmware such as Network Interface Cards (NICs).

As the world becomes more digital, software which needs updating is appearing in more and more devices. As well as my computer, over the last year, I have waited patiently while firmware has updated in my TV, vacuum cleaner, lawn mower and light bulbs. It can be a cumbersome process, including obtaining the firmware, deploying it to the device which needs updating, navigating menus and other commands to initiate the update, and then waiting several minutes for the update to complete.

Firmware updates can be annoying even if you only have a couple of devices. We have more than a few devices at Cloudflare. We have a huge number of servers of varying kinds, from varying vendors, spread over 285 cities worldwide. We need to be able to rapidly deploy various types of firmware updates to all of them, reliably, and automatically, without any kind of manual intervention.

In this blog post I will outline the methods that we use to automate firmware deployment to our entire fleet. We have been using this method for several years now, and have deployed firmware without interrupting our SRE team, entirely automatically.

Background

A key component of our ability to deploy firmware at scale is the iPXE, an open source boot loader. iPXE is the glue which operates between the server and operating system, and is responsible for loading the operating system after the server has completed the Power On Self Test (POST). It is very flexible and contains a scripting language. With iPXE, we can write boot scripts which query the firmware version, continue booting if the correct firmware version is deployed, or if not, boot into a flashing environment to flash the correct firmware.

We only deploy new firmware when our systems are out of production, so we need a method to coordinate deployment only on out of production systems. The simplest way to do this is when they are rebooting, because by definition they are out of production then. We reboot our entire fleet every month, and have the ability to schedule reboots more urgently if required to deal with a security issue. Regularly rebooting our fleets has many advantages. We can deploy the latest Linux kernel, base operating system, and ensure that we do not have any breaking changes in our operating system and configuration management environment that breaks on fresh boot.

Our entire fleet operates in UEFI mode. UEFI is a modern replacement for the BIOS and offers more features and more security, such as Secure Boot. A full description of all of these changes is outside the scope of this article, but essentially UEFI provides a minimal environment and shell capable of executing binaries. Secure Boot ensures that the binaries are signed with keys embedded in the system, to prevent a bad actor from tampering with our software.

How we update the BIOS

We are able to update the BIOS without booting any operating system, purely by taking advantage of features offered by iPXE and the UEFI shell. This requires a flashing binary written for the UEFI environment.

Upon boot, iPXE is started. Through iPXE’s built-in variable ${smbios/0.5.0} it is possible to query the current BIOS version, and compare it to the latest version, and trigger a flash only if there is a mis-match. iPXE then downloads the files required for the firmware update to a ramdisk.

The following is an example of a very basic iPXE script which performs such an action:

# Check whether the BIOS version is 2.03
iseq ${smbios/0.5.0} 2.03 || goto biosupdate
echo Nothing to do for {{ model }}
exit 0

:biosupdate
echo Trying to update BIOS/UEFI...
echo Current: ${smbios/0.5.0}
echo New: 2.03

imgfetch ${boot_prefix}/tools/x64/shell.efi || goto unexpected_error
imgfetch startup.nsh || goto unexpected_error

imgfetch AfuEfix64.efi || goto unexpected_error
imgfetch bios-2.03.bin || goto unexpected_error

imgexec shell.efi || goto unexpected_error

Meanwhile, startup.nsh contains the binary to run and command line arguments to effect the flash:

startup.nsh:

%homefilesystem%\AfuEfix64.efi %homefilesystem%\bios-2.03.bin /P /B /K /N /X /RLC:E /REBOOT

After rebooting, the machine will boot using its new BIOS firmware, version 2.03. Since ${smbios/0.5.0} now contains 2.03, the machine continues to boot and enter production.

Other firmware updates such as BMC, network cards and more

Unfortunately, the number of vendors that support firmware updates with UEFI flashing binaries is limited. There are a large number of other updates that we need to perform such as BMC and NIC.

Consequently, we need another way to flash these binaries. Thankfully, these vendors invariably support flashing from Linux. Consequently we can perform flashing from a minimal Linux environment. Since vendor firmware updates are typically closed source utilities and vendors are often highly secretive about firmware flashing, we can ensure that the flashing environment does not provide an attackable surface by ensuring that the network is not configured. If it’s not on the network, it can’t be attacked and exploited.

Not being on the network means that we need to inject files into the boot process when the machine boots. We can accomplish this with an initial ramdisk (initrd), and iPXE makes it easy to add additional initrd to the boot.

Creating an initrd is as simple as creating an archive of the files using cpio using the newc archive format.

Let’s imagine we are going to flash Broadcom NIC firmware. We’ll use the bnxtnvm firmware update utility, the firmware image firmware.pkg, and a shell script called flash to automate the task.

The files are laid out in the file system like this:

cd broadcom
find .
./opt/preflight
./opt/preflight/scripts
./opt/preflight/scripts/flash
./opt/broadcom
./opt/broadcom/firmware.pkg
./opt/broadcom/bnxtnvm

Now we compress all of these files into an image called broadcom.img.

find . | cpio --quiet -H newc -o | gzip -9 -n > ../broadcom.img

This is the first step completed; we have the firmware packaged up into an initrd.

Since it’s challenging to read, say, the firmware version of the NIC, from the EFI shell, we store firmware versions as UEFI variables. These can be written from Linux via efivars, the UEFI variable file system, and then read by iPXE on boot.

An example of writing an EFI variable from Linux looks like this:

declare -r fw_path='/sys/firmware/efi/efivars/broadcom-fw-9ca25c23-368a-4c21-943f-7d91f2b76008'
declare -r efi_header='\x07\x00\x00\x00'
declare -r version='1.05'

/bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars

# Files on efivarfs are immutable by default, so remove the immutable flag so that we can write to it: https://docs.kernel.org/filesystems/efivarfs.html
if [ -f "${fw_path}" ] ; then
    /usr/bin/chattr -i "${fw_path}"
fi

echo -n -e "${efi_header}${version}" >| "$fw_path"

Then we can write an iPXE configuration file to load the flashing kernel, userland and flashing utilities.

set cf/guid 9ca25c23-368a-4c21-943f-7d91f2b76008

iseq ${efivar/broadcom-fw-${cf/guid}} 1.05 && echo Not flashing broadcom firmware, version already at 1.05 || goto update
exit

:update
echo Starting broadcom firmware update
kernel ${boot_prefix}/vmlinuz initrd=baseimg.img initrd=linux-initramfs-modules.img initrd=broadcom.img
initrd ${boot_prefix}/baseimg.img
initrd ${boot_prefix}/linux-initramfs-modules.img
initrd ${boot_prefix}/firmware/broadcom.img

Flashing scripts are deposited into /opt/preflight/scripts and we use systemd to execute them with run-parts on boot:

/etc/systemd/system/preflight.service:

[Unit]
Description=Pre-salt checks and simple configurations on boot
Before=salt-highstate.service
After=network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/run-parts --verbose /opt/preflight/scripts

[Install]
WantedBy=multi-user.target
RequiredBy=salt-highstate.service

An example flashing script in /opt/preflight/scripts might look like:

#!/bin/bash

trap 'catch $? $LINENO' ERR
catch(){
    #error handling goes here
    echo "Error $1 occured on line $2"
}

declare -r fw_path='/sys/firmware/efi/efivars/broadcom-fw-9ca25c23-368a-4c21-943f-7d91f2b76008'
declare -r efi_header='\x07\x00\x00\x00'
declare -r version='1.05'

lspci | grep -q Broadcom
if [ $? -eq 0 ]; then
    echo "Broadcom firmware flashing starting"
    if [ ! -f "$fw_path" ] ; then
        chmod +x /opt/broadcom/bnxtnvm
        declare -r interface=$(/opt/broadcom/bnxtnvm listdev | grep "Device Interface Name" | awk -F ": " '{print $2}')
        /opt/broadcom/bnxtnvm -dev=${interface} -force -y install /opt/broadcom/BCM957414M4142C.pkg
        declare -r status=$?
        declare -r currentversion=$(/opt/broadcom/bnxtnvm -dev=${interface} device_info | grep "Package version on NVM" | awk -F ": " '{print $2}')
        declare -r expectedversion=$(echo $version | awk '{print $2}')
        if [ $status -eq 0 -a "$currentversion" = "$expectedversion" ]; then
            echo "Broadcom firmware $version flashed successfully"
            /bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars
            echo -n -e "${efi_header}${version}" >| "$fw_path"
            echo "Created $fw_path"
        else
            echo "Failed to flash Broadcom firmware $version"
            /opt/broadcom/bnxtnvm -dev=${interface} device_info
        fi
    else
        echo "Broadcom firmware up-to-date"
    fi
else
    echo "No Broadcom NIC installed"
    /bin/mount -o remount,rw,nosuid,nodev,noexec,noatime none /sys/firmware/efi/efivars
    if [ -f "${fw_path}" ] ; then
        /usr/bin/chattr -i "${fw_path}"
    fi
    echo -n -e "${efi_header}${version}" >| "$fw_path"
    echo "Created $fw_path"
fi

if [ -f "${fw_path}" ]; then
    echo "rebooting in 60 seconds"
    sleep 60
    /sbin/reboot
fi

Conclusion

Whether you manage just your laptop or desktop computer, or a fleet of servers, it’s important to keep the firmware updated to ensure that the availability, performance and security of the devices is maintained.

If you have a few devices and would benefit from automating the deployment process, we hope that we have inspired you to have a go by making use of some basic open source tools such as the iPXE boot loader and some scripting.

Final thanks to my colleague Ignat Korchagin who did a large amount of the original work on the UEFI BIOS firmware flashing infrastructure.

Bringing the best live video experience to Cloudflare Stream with AV1

Renan Dincer — Wed, 05 Oct 2022 17:08:00 GMT

Consumer hardware is pushing the limits of consumers’ bandwidth.

VR headsets support 5760 x 3840 resolution — 22.1 million pixels per frame of video. Nearly all new TVs and smartphones sold today now support 4K — 8.8 million pixels per frame. It’s now normal for most people on a subway to be casually streaming video on their phone, even as they pass through a tunnel. People expect all of this to just work, and get frustrated when it doesn’t.

Consumer Internet bandwidth hasn’t kept up. Even advanced mobile carriers still limit streaming video resolution to prevent network congestion. Many mobile users still have to monitor and limit their mobile data usage. Higher Internet speeds require expensive infrastructure upgrades, and 30% of Americans still say they often have problems simply connecting to the Internet at home.

We talk to developers every day who are pushing up against these limits, trying to deliver the highest quality streaming video without buffering or jitter, challenged by viewers’ expectations and bandwidth. Developers building live video experiences hit these limits the hardest — buffering doesn’t just delay video playback, it can cause the viewer to get out of sync with the live event. Buffering can cause a sports fan to miss a key moment as playback suddenly skips ahead, or find out in a text message about the outcome of the final play, before they’ve had a chance to watch.

Today we’re announcing a big step towards breaking the ceiling of these limits — support in Cloudflare Stream for the AV1 codec for live videos and their recordings, available today to all Cloudflare Stream customers in open beta. Read the docs to get started, or watch an AV1 video from Cloudflare Stream in your web browser. AV1 is an open and royalty-free video codec that uses 46% less bandwidth than H.264, the most commonly used video codec on the web today.

What is AV1, and how does it improve live video streaming?

Every piece of information that travels across the Internet, from web pages to photos, requires data to be transmitted between two computers. A single character usually takes one byte, so a two-page letter would be 3600 bytes or 3.6 kilobytes of data transferred.

One pixel in a photo takes 3 bytes, one each for red, green and blue in the pixel. A 4K photo would take 8,294,400 bytes, or 8.2 Megabytes. A video is like a photo that changes 30 times a second, which would make almost 15 Gigabytes per minute. That’s a lot!

To reduce the amount of bandwidth needed to stream video, before video is sent to your device, it is compressed using a codec. When your device receives video, it decodes this into the pixels displayed on your screen. These codecs are essential to both streaming and storing video.

Video compression codecs combine multiple advanced techniques, and are able to compress video to one percent of the original size, with your eyes barely noticing a difference. This also makes video codecs computationally intensive and hard to run. Smartphones, laptops and TVs have specific media decoding hardware, separate from the main CPU, optimized to decode specific protocols quickly, using the minimum amount of battery life and power.

Every few years, as researchers invent more efficient compression techniques, standards bodies release new codecs that take advantage of these improvements. Each generation of improvements in compression technology increases the requirements for computers that run them. With higher requirements, new chips are made available with increased compute capacity. These new chips allow your device to display higher quality video while using less bandwidth.

AV1 takes advantage of recent advances in compute to deliver video with dramatically fewer bytes, even compared to other relatively recent video protocols like VP9 and HEVC.

AV1 leverages the power of new smartphone chips

One of the biggest developments of the past few years has been the rise of custom chip designs for smartphones. Much of what’s driven the development of these chips is the need for advanced on-device image and video processing, as companies compete on the basis of which smartphone has the best camera.

This means the phones we carry around have an incredible amount of compute power. One way to think about AV1 is that it shifts work from the network to the viewer’s device. AV1 is fewer bytes over the wire, but computationally harder to decode than prior formats. When AV1 was first announced in 2018, it was dismissed by some as too slow to encode and decode, but smartphone chips have become radically faster in the past four years, more quickly than many saw coming.

AV1 hardware decoding is already built into the latest Google Pixel smartphones as part of the Tensor chip design. The Samsung Exynos 2200 and MediaTek Dimensity 1000 SoC mobile chipsets both support hardware accelerated AV1 decoding. It appears that Google will require that all devices that support Android 14 support decoding AV1. And AVPlayer, the media playback API built into iOS and tvOS, now includes an option for AV1, which hints at future support. It’s clear that the industry is heading towards hardware-accelerated AV1 decoding in the most popular consumer devices.

With hardware decoding comes battery life savings — essential for both today’s smartphones and tomorrow’s VR headsets. For example, a Google Pixel 6 with AV1 hardware decoding uses only minimal battery and CPU to decode and play our test video:

AV1 encoding requires even more compute power

Just as decoding is significantly harder for end-user devices, it is also significantly harder to encode video using AV1. When AV1 was announced in 2018, many doubted whether hardware would be able to encode it efficiently enough for the protocol to be adopted quickly enough.

To demonstrate this, we encoded the 4K rendering of Big Buck Bunny (a classic among video engineers!) into AV1, using an AMD EPYC 7642 48-Core Processor with 256 GB RAM. This CPU continues to be a workhorse of our compute fleet, as we have written about previously. We used the following command to re-encode the video, based on the example in the ffmpeg AV1 documentation:

ffmpeg -i bbb_sunflower_2160p_30fps_normal.mp4 -c:v libaom-av1 -crf 30 -b:v 0 -strict -2 av1_test.mkv

Using a single core, encoding just two seconds of video at 30fps took over 30 minutes. Even if all 48 cores were used to encode, it would take at minimum over 43 seconds to encode just two seconds of video. Live encoding using only CPUs would require over 20 servers running at full capacity.

Special-purpose AV1 software encoders like rav1e and SVT-AV1 that run on general purpose CPUs can encode somewhat faster than libaom-av1 with ffmpeg, but still consume a huge amount of compute power to encode AV1 in real-time, requiring multiple servers running at full capacity in many scenarios.

Cloudflare Stream encodes your video to AV1 in real-time

At Cloudflare, we control both the hardware and software on our network. So to solve the CPU constraint, we’ve installed dedicated AV1 hardware encoders, designed specifically to encode AV1 at blazing fast speeds. This end to end control is what lets us encode your video to AV1 in real-time. This is entirely out of reach to most public cloud customers, including the video infrastructure providers who depend on them for compute power.

Encoding in real-time means you can use AV1 for live video streaming, where saving bandwidth matters most. With a pre-recorded video, the client video player can fetch future segments of video well in advance, relying on a buffer that can be many tens of seconds long. With live video, buffering is constrained by latency — it’s not possible to build up a large buffer when viewing a live stream. There is less margin for error with live streaming, and every byte saved means that if a viewer’s connection is interrupted, it takes less time to recover before the buffer is empty.

Stream lets you support AV1 with no additional work

AV1 has a chicken or the egg dilemma. And we’re helping solve it.

Companies with large video libraries often re-encode their entire content library to a new codec before using it. But AV1 is so computationally intensive that re-encoding whole libraries has been cost prohibitive. Companies have to choose specific videos to re-encode, and guess which content will be most viewed ahead of time. This is particularly challenging for apps with user generated content, where content can suddenly go viral, and viewer patterns are hard to anticipate.

This has slowed down the adoption of AV1 — content providers wait for more devices to support AV1, and device manufacturers wait for more content to use AV1. Which will come first?

With Cloudflare Stream there is no need to manually trigger re-encoding, re-upload video, or manage the bulk encoding of a large video library. This is a unique approach that is made possible by integrating encoding and delivery into a single product — it is not possible to encode on-demand using the old way of encoding first, and then pointing a CDN at a bucket of pre-encoded files.

We think this approach can accelerate the adoption of AV1. Consider a video app with millions of minutes of user-generated video. Most videos will never be watched again. In the old model, developers would have to spend huge sums of money to encode upfront, or pick and choose which videos to re-encode. With Stream, we can help anyone incrementally adopt AV1, without re-encoding upfront. As we work towards making AV1 Generally Available, we’ll be working to make supporting AV1 simple and painless, even for videos already uploaded to Stream, with no special configuration necessary.

Open, royalty-free, and widely supported

At Cloudflare, we are committed to open standards and fighting patent trolls. While there are multiple competing options for new video codecs, we chose to support AV1 first in part because it is open source and has royalty-free licensing.

Other encoding codecs force device manufacturers to pay royalty fees in order to adopt their standard in consumer hardware, and have been quick to file lawsuits against competing video codecs. The group behind the open and royalty-free VP8 and VP9 codecs have been pushing back against this model for more than a decade, and AV1 is the successor to these codecs, with support from all the biggest technology companies, both software and hardware. Beyond its technical accomplishments, AV1 is a clear message from the industry that the future of video encoding should be open, royalty-free, and free from patent litigation.

Try AV1 right now with your live stream or live recording

Support for AV1 is currently in open beta. You can try using AV1 on your own live video with Cloudflare Stream right now — just add the ?betaCodecSuggestion=av1 query parameter to the HLS or DASH manifest URL for any live stream or live recording created after October 1st in Cloudflare Stream. Read the docs to get started. If you don’t yet have a Cloudflare account, you can sign up here and start using Cloudflare Stream in just a few minutes.

We also have a recording of a live video, encoded using AV1, that you can watch here. Note that Safari does not yet support AV1.

We encourage you to try AV1 with your test streams, and we’d love your feedback. Join our Discord channel and tell us what you’re building, and what kinds of video you’re interested in using AV1 with. We’d love to hear from you!

The EPYC journey continues to Milan in Cloudflare’s 11th generation Edge Server

Chris Howells — Tue, 31 Aug 2021 13:00:45 GMT

When I was interviewing to join Cloudflare in 2014 as a member of the SRE team, we had just introduced our generation 4 server, and I was excited about the prospects. Since then, Cloudflare, the industry and I have all changed dramatically. The best thing about working for a rapidly growing company like Cloudflare is that as the company grows, new roles open up to enable career development. And so, having left the SRE team last year, I joined the recently formed hardware engineering team, a team that simply didn’t exist in 2014.

We aim to introduce a new server platform to our edge network every 12 to 18 months or so, to ensure that we keep up with the latest industry technologies and developments. We announced the generation 9 server in October 2018 and we announced the generation 10 server in February 2020. We consider this length of cycle optimal: short enough to stay nimble and take advantage of the latest technologies, but long enough to offset the time taken by our hardware engineers to test and validate the entire platform. When we are shipping servers to over 200 cities around the world with a variety of regulatory standards, it’s essential to get things right the first time.

We continually work with our silicon vendors to receive product roadmaps and stay on top of the latest technologies. Since mid-2020, the hardware engineering team at Cloudflare has been working on our generation 11 server.

Requests per Watt is one of our defining characteristics when testing new hardware and we use it to identify how much more efficient a new hardware generation is than the previous generation. We continually strive to reduce our operational costs and power consumption reduction is one of the most important parts of this. It’s good for the planet and we can fit more servers into a rack, reducing our physical footprint.

The design of these Generation 11 x86 servers has been in parallel with our efforts to design next-generation edge servers using the Ampere Altra Arm architecture. You can read more about our tests in a blog post by my colleague Sung and we will document our work on Arm at the edge in a subsequent blog post.

We evaluated Intel’s latest generation of “Ice Lake” Xeon processors. Although Intel’s chips were able to compete with AMD in terms of raw performance, the power consumption was several hundred watts higher per server - that’s enormous. This meant that Intel’s Performance per Watt was unattractive.

We previously described how we had deployed AMD EPYC 7642’s processors in our generation 10 server. This has 48 cores and is based on AMD’s 2nd generation EPYC architecture, code named Rome. For our generation 11 server, we evaluated 48, 56 and 64 core samples based on AMD’s 3rd generation EPYC architecture, code named Milan. We were interested to find that comparing the two 48 core processors directly, we saw a performance boost of several percent in the 3rd generation EPYC architecture. We therefore had high hopes for the 56 core and 64 core chips.

So, based on the samples we received from our vendors and our subsequent testing, hardware from AMD and Ampere made the shortlist for our generation 11 server. On this occasion, we decided that Intel did not meet our requirements. However, it’s healthy that Intel and AMD compete and innovate in the x86 space and we look forward to seeing how Intel’s next generation shapes up.

Testing and validation process

Before we go on to talk about the hardware, I’d like to say a few words about the testing process we went through to test out generation 11 servers.

As we elected to proceed with AMD chips, we were able to use our generation 10 servers as our Engineering Validation Test platform, with the only changes being the new silicon and updated firmware. We were able to perform these upgrades ourselves in our hardware validation lab.

Cloudflare’s network is built with commodity hardware and we source the hardware from multiple vendors, known as ODMs (Original Design Manufacturer) who build the servers to our specifications.

When you are working with bleeding edge silicon and experimental firmware, not everything is plain sailing. We worked with one of our ODMs to eliminate an issue which was causing the Linux kernel to panic on boot. Once resolved, we used a variety of synthetic benchmarking tools to verify the performance including cf_benchmark, as well as an internal tool which applies a synthetic load to our entire software stack.

Once we were satisfied, we ordered Design Validation Test samples, which were manufactured by our ODMs with the new silicon. We continued to test these and iron out the inevitable issues that arise when you are developing custom hardware. To ensure that performance matched our expectations, we used synthetic benchmarking to test the new silicon. We also began testing it in our production environment by gradually introducing customer traffic to them as confidence grew.

Once the issues were resolved, we ordered the Product Validation Test samples, which were again manufactured by our ODMs, taking into account the feedback obtained in the DVT phase. As these are intended to be production grade, we work with the broader Cloudflare teams to deploy these units like a mass production order.

CPU

Previously: AMD EPYC 7642 48-Core ProcessorNow: AMD EPYC 7713 64-Core Processor

	AMD EPYC 7642	AMD EPYC 7643	AMD EPYC 7663	AMD EPYC 7713
Status	Incumbent	Candidate	Candidate	Candidate
Core Count	48	48	56	64
Thread Count	96	96	112	128
Base Clock	2.3GHz	2.3GHz	2.0GHz	2.0GHz
Max Boost Clock	3.3GHz	3.6GHz	3.5GHz	3.675GHz
Total L3 Cache	256MB	256MB	256MB	256MB
Default TDP	225W	225W	240W	225W
Configurable TDP	240W	240W	240W	240W

In the above chart, TDP refers to Thermal Design Power, a measure of the heat dissipated. All of the above processors have a configurable TDP - assuming the cooling solution is capable - giving more performance at the expense of increased power consumption. We tested all processors configured at their highest supported TDP.

The 64 core processors have 33% more cores than the 48 core processors so you might hypothesize that we would see a corresponding 33% increase in performance, although our benchmarks saw slightly more modest gains. This can be explained because the 64 core processors have lower base clock frequencies to fit within the same 225W power envelope.

In production testing, we found that the 64 core EPYC 7713 gave us around a 29% performance boost over the incumbent, whilst having similar power consumption and thermal properties.

Memory

Previously: 256GB DDR4-2933Now: 384GB DDR4-3200

Having made a decision about the processor, the next step was to determine the optimal amount of memory for our workload. We ran a series of experiments with our chosen EPYC 7713 processor and 256GB, 384GB and 512GB memory configurations. We started off by running synthetic benchmarks with tools such as STREAM to ensure that none of the configurations performed unexpectedly poorly and to generate a baseline understanding of the performance.

After the synthetic benchmarks, we proceeded to test the various configurations with production workloads to empirically determine the optimal quantity. We use Prometheus and Grafana to gather and display a rich set of metrics from all of our servers so that we can monitor and spot trends, and we re-used the same infrastructure for our performance analysis.

As well as measuring available memory, previous experience has shown us that one of the best ways to ensure that we have enough memory is to observe request latency and disk IO performance. If there is insufficient memory, we expect to see request latency and disk IO volume and latency to increase. The reason for this is that our core HTTP server uses memory to cache web assets and if there is insufficient memory the assets will be ejected from memory prematurely and more assets will be fetched from disk instead of memory, degrading performance.

Like most things in life, it’s a balancing act. We want enough memory to take advantage of the fact that serving web assets directly from memory is much faster than even the best NVMe disks. We also want to future proof our platform to enable the new features such as the ones that we recently announced in security week and developer week. However, we don’t want to spend unnecessarily on excess memory that will never be used. We found that the 512GB configuration did not provide a performance boost to justify the extra cost and settled on the 384GB configuration.

We also tested the performance impact of switching from DDR4-2933 to DDR4-3200 memory. We found that it provided a performance boost of several percent and the pricing has improved to the point where it is cost beneficial to make the change.

Disk

Previously: 3x Samsung PM983 x 960GBNow: 2x Samsung PM9A3 x 1.92TB

We validated samples by studying the manufacturer’s data sheets and testing using fio to ensure that the results being obtained in our test environment were in line with the published specifications. We also developed an automation framework to help compare different drive models using fio. The framework helps us to restore the drives close to factory settings, precondition the drives, perform the sequential and random tests in our environment, and analyze the data results to evaluate the bandwidth and latency results. Since our SSD samples were arriving in our test center at different months, having an automated framework helped in dealing with speedy evaluations by reducing our time spent testing and doing analysis.

For Gen 11 we decided to move to a 2x 2TB configuration from the original 3x 1TB configuration giving us an extra 1 TB of storage. This also meant we could use the higher performance of a 2TB drive and save around 6W of power since there is one less SSD.

After analyzing the performances of various 2TB drives, their latencies and endurances, we chose Samsung’s PM9A3 SSDs as our Gen11 drives. The results we obtained below were consistent with the manufacturer's claims.

Sequential performance:

Random Performance:

Compared to our previous generation drives, we could see a 1.5x - 2x improvement in read and write bandwidths. The higher values for the PM9A3 can be attributed to the fact that these are PCIe 4.0 drives, have more intelligent SSD controllers and an upgraded NAND architecture.

Network

Previously: Mellanox ConnectX-4 dual-port 25GNow: Mellanox ConnectX-4 dual-port 25G

There is no change on the network front; the Mellanox ConnectX-4 is a solid performer which continues to meet our needs. We investigated higher speed Ethernet, but we do not currently see this as beneficial. Cloudflare’s network is built on cheap commodity hardware and the highly distributed nature of Cloudflare’s network means we don’t have discrete DDoS scrubbing centres. All points of presence operate as scrubbing centres. This means that we distribute the load across our entire network and do not need to employ higher speed and more expensive Ethernet devices.

Open source firmware

Transparency, security and integrity is absolutely critical to us at Cloudflare. Last year, we described how we had deployed Platform Secure Boot to create trust that we were running the software that we thought we were.

Now, we are pleased to announce that we are deploying open source firmware to our servers using OpenBMC. With access to the source code, we have been able to configure BMC features such as the fan PID controller, having BIOS POST codes recorded and accessible, and managing networking ports and devices. Prior to OpenBMC, requesting these features from our vendors led to varying results and misunderstandings of the scope and capabilities of the BMC. After working with the BMC source code much more directly, we have the flexibility to work on features ourselves to our liking, or understand why the BMC is incapable of running our desired software.

Whilst our current BMC is an industry standard, we feel that OpenBMC better suits our needs and gives us advantages such as allowing us to deal with upstream security issues without a dependency on our vendors. Some opportunities with security include integration of desired authentication modules, usage of specific software packages, staying up to date with the latest Linux kernel, and controlling a variety of attack vectors. Because we have a kernel lockdown implemented, flashing tooling is difficult to use in our environment. With access to source code of the flashing tools, we have an understanding of what the tools need access to, and assess whether or not this meets our standard of security.

Summary

The jump between our generation 9 and generation 10 servers was enormous. To summarise, we changed from a dual-socket Intel platform to a single socket AMD platform. We upgraded the SATA SSDs to NVMe storage devices, and physically the multi-node chassis changed to a 1U form factor.

At the start of the generation 11 project we weren’t sure if we would be making such radical changes again. However, after a thorough testing of the latest chips and a review of how well the generation 10 server has performed in production for over a year, our generation 11 server built upon the solid foundations of generation 10 and ended up as a refinement rather than total revamp. Despite this, and bearing in mind that performance varies by time of day and geography, we are pleased that generation 11 is capable of serving approximately 29% more requests than generation 10 without an increase in power consumption.

Thanks to Denny Mathew and Ryan Chow’s work on benchmarking and OpenBMC, respectively.

If you are interested in working with bleeding edge hardware, open source server firmware, solving interesting problems, helping to improve our performance, and are interested in helping us work on our generation 12 server platform (amongst many other things!), we’re hiring.