The Cloudflare Blog

Removing uncertainty through "what-if" capacity planning

Curt Robords — Fri, 20 Sep 2024 14:01:00 GMT

Infrastructure planning for a network serving more than 81 million requests at peak and which is globally distributed across more than 330 cities in 120+ countries is complex. The capacity planning team at Cloudflare ensures there is enough capacity in place all over the world so that our customers have one less thing to worry about - our infrastructure, which should just work. Through our processes, the team puts careful consideration into “what-ifs”. What if something unexpected happens and one of our data centers fails? What if one of our largest customers triples, or quadruples their request count? Across a gamut of scenarios like these, the team works to understand where traffic will be served from and how the Cloudflare customer experience may change.

This blog post gives a look behind the curtain of how these scenarios are modeled at Cloudflare, and why it's so critical for our customers.

Scenario planning and our customers

Cloudflare customers rely on the data centers that Cloudflare has deployed all over the world, placing us within 50 ms of approximately 95% of the Internet-connected population globally. But round-trip time to our end users means little if those data centers don’t have the capacity to serve requests. Cloudflare has invested deeply into systems that are working around the clock to optimize the requests flowing through our network because we know that failures happen all the time: the Internet can be a volatile place. See our blog post from August 2024 on how we handle this volatility in real time on our backbone, and our blog post from late 2023 about how another system, Traffic Manager, actively works in and between data centers, moving traffic to optimize the customer experience around constraints in our data centers. Both of these systems do a fantastic job in real time, but there is still a gap — what about over the long term?

Most of the volatility that the above systems are built to manage is resolved within shorter time scales than which we build plans for. (There are, of course, some failures that are exceptions.) Most scenarios we model still need to take into account the state of our data centers in the future, as well as what actions systems like Traffic Manager will take during those periods. But before getting into those constraints, it’s important to note how capacity planning measures things: in units of CPU Time, defined as the time that each request takes in the CPU. This is done for the same reasons that Traffic Manager uses CPU Time, in that it enables the team to 1) use a common unit across different types of customer workloads and 2) speak a common language with other teams and systems (like Traffic Manager). The same reasoning the Traffic Manager team cited in their own blog post is equally applicable for capacity planning:

…using requests per second as a metric isn’t accurate enough when actually moving traffic. The reason for this is that different customers have different resource costs to our service; a website served mainly from cache with the WAF deactivated is much cheaper CPU wise than a site with all WAF rules enabled and caching disabled. So we record the time that each request takes in the CPU. We can then aggregate the CPU time across each plan to find the CPU time usage per plan. We record the CPU time in ms, and take a per second value, resulting in a unit of milliseconds per second.

This is important for customers for the same reason that the Traffic Manager team cited in their blog post as well: we can correlate CPU time to performance, specifically latency.

Now that we know our unit of measurement is CPU time, we need to set up our models with the new constraints associated with the change that we’re trying to model. Specifically, there are a subset of constraints that we are particularly interested in because we know that they have the ability to impact our customers by impacting the availability of CPU in a data center. These are split into two main inputs in our models: Supply and Demand. We can think of these as “what-if” questions, such as the following examples:

Demand what-ifs

What if a new customer onboards to Cloudflare with a significant volume of requests and/or bytes?
What if an existing customer increased its volume of requests and/or bytes by some multiplier (i.e. 2x, 3x, nx), at peak, for the next three months?
What if the growth rate, in number of requests and bytes, of all of our data centers worldwide increased from X to Y two months from now, indefinitely?
What if the growth rate, in number of requests and bytes, of data center facility A increased from X to Y one month from now?
What if traffic egressing from Cloudflare to a last-mile network shifted from one location (such as Boston) to another (such as New York City) next week?

Supply what-ifs

What if data center facility A lost some or all of its available servers two months from now?
What if we added X servers to data center facility A today?
What if some or all of our connectivity to other ASNs (12,500 Networks/nearly 300 Tbps) failed now?

Output

For any one of these, or a combination of them, in our model’s output, we aim to provide answers to the following:

What will the overall capacity picture look like over time?
Where will the traffic go?
How will this impact our costs?
Will we need to deploy additional servers to handle the increased load?

Given these sets of questions and outputs, manually creating a model to answer each of these questions, or a combination of these questions, quickly becomes an operational burden for any team. This is what led us to launch “Scenario Planner”.

Scenario Planner

In August 2024, the infrastructure team finished building “Scenario Planner”, a system that enables anyone at Cloudflare to simulate “what-ifs”. This provides our team the opportunity to quickly model hypothetical changes to our demand and supply metrics across time and in any of Cloudflare’s data centers. The core functionality of the system has to do with the same questions we need to answer in the manual models discussed above. After we enter the changes we want to model, Scenario Planner converts from units that are commonly associated with each question to our common unit of measurement: CPU Time. These inputs are then used to model the updated capacity across all of our data centers, including how demand may be distributed in cases where capacity constraints may start impacting performance in a particular location. As we know, if that happens then it triggers Traffic Manager to serve some portion of those requests from a nearby location to minimize impact on customers and user experience.

Updated demand questions with inputs

Question: What if a new customer onboards to Cloudflare with a significant volume of requests?
Input: The new customer’s expected volume, geographic distribution, and timeframe of requests, converted to a count of virtual CPUs
Calculation(s): Scenario Planner converts from server count to CPU Time, and distributes the new demand across the regions selected according to the aggregate distribution of all customer usage.

Question: What if an existing customer increased its volume of requests and/or bytes by some multiplier (i.e. 2x, 3x, nx), at peak, for the next three months?
Input: Select the customer name, the multiplier, and the timeframe
Calculation(s): Scenario Planner already has how the selected customer’s traffic is distributed across all data centers globally, so this involves simply multiplying that value by the multiplier selected by the user

Question: What if the growth rate, in number of requests and bytes, of all of our data centers worldwide increased from X to Y two months from now, indefinitely?
Input: Enter a new global growth rate and timeframe
Calculation(s): Scenario Planner distributes this growth across all data centers globally according to their current growth rate. In other words, the global growth is an aggregation of all individual data center’s growth rates, and to apply a new “Global” growth rate, the system scales up each of the individual data center’s growth rates commensurate with the current distribution of growth.

Question: What if the growth rate, in number of requests and bytes, of data center facility A increased from X to Y one month from now?
Input: Select a data center facility, enter a new growth rate for that data center and the timeframe to apply that change across.
Calculation(s): Scenario Planner passes the new growth rate for the data center to the backend simulator, across the timeline specified by the user

Updated supply questions with inputs

Question: What if data center facility A lost some or all of its available servers two months from now?
Input: Select a data center, and enter the number of servers to remove, or select to remove all servers in that location, as well as the timeframe for when those servers will not be available
Calculation(s): Scenario Planner converts the server count entered (including all servers in a given location) to CPU Time before passing to the backend

Question: What if we added X servers to data center facility A today?
Input: Select a data center, and enter the number of servers to add, as well as the timeline for when those servers will first go live
Calculation(s): Scenario Planner converts the server count entered (including all servers in a given location) to CPU Time before passing to the backend

We made it simple for internal users to understand the impact of those changes because Scenario Planner outputs the same views that everyone who has seen our heatmaps and visual representations of our capacity status is familiar with. There are two main outputs the system provides: a heatmap and an “Expected Failovers” view. Below, we explore what these are, with some examples.

Heatmap

Capacity planning evaluates its success on its ability to predict demand: we generally produce a weekly, monthly, and quarterly forecast of 12 months to three years worth of demand, and nearly all of our infrastructure decisions are based on the output of this forecast. Scenario Planner provides a view of the results of those forecasts that are implemented via a heatmap: it shows our current state, as well as future planned server additions that are scheduled based on the forecast.

Here is an example of our heatmap, showing some of our largest data centers in Eastern North America (ENAM). Ashburn is showing as yellow, briefly, because our capacity planning threshold for adding more server capacity to our data centers is 65% utilization (based on CPU time supply and demand): this gives the Cloudflare teams time to procure additional servers, ship them, install them, and bring them live before customers will be impacted and systems like Traffic Manager would begin triggering. The little cloud icons indicate planned upgrades of varying sizes to get ahead of forecasted future demand well ahead of time to avoid customer performance degradation.

The question Scenario Planner answers then is how this view changes with a hypothetical scenario: What if our Ashburn, Miami, and Atlanta facilities shut down completely? This is unlikely to happen, but we would expect to see enormous impact on the remaining largest facilities in ENAM. We’ll simulate all three of these failing at the same time, taking them offline indefinitely:

This results in a view of our capacity through the rest of the year in the remaining large data centers in ENAM — capacity is clearly constrained: Traffic Manager will be working hard to mitigate any impact to customer performance if this were to happen. Our capacity view in the heatmap is capped at 75%: this is because Traffic Manager typically engages around this level of CPU utilization. Beyond 75%, Cloudflare customers may begin to experience increased latency, though this is dependent on the product and workload, and is in reality much more dynamic.

This outcome in the heatmap is not unexpected. But now we typically get a follow-up question: clearly this traffic won’t fit in just Newark, Chicago, and Toronto, so where do all these requests get served from? Enter the failover simulator: Capacity Planning has been simulating how Traffic Manager may work in the long term for quite a while, and for Scenario Planner, it was simple to extend this functionality to answer exactly this question.

There is currently no traffic being moved by Traffic Manager from these data centers, but our simulation shows a significant portion of the Atlanta CPU time being served from our DFW/Dallas data center as well as Newark (bottom pink), and Chicago (orange) through the rest of the year, during this hypothetical failure. With Scenario Planner, Capacity Planning can take this information and simulate multiple failures all over the world to understand the impact to customers, taking action to ensure that customers trusting Cloudflare with their web properties can expect high performance even in instances of major data center failures.

Planning with uncertainty

Capacity planning a large global network comes with plenty of uncertainties. Scenario Planner is one example of the work the Capacity Planning team is doing to ensure that the millions of web properties our customers entrust to Cloudflare can expect consistent, top tier performance all over the world.

The Capacity Planning team is hiring — check out the Cloudflare careers page and search for open roles on the Capacity Planning team.

Cloudflare deployment in Guam

David Antunes — Mon, 25 Jul 2022 13:00:00 GMT

Having fast Internet properties means being as few milliseconds as possible away from our customers and their users, no matter where they are on Earth. And because of the design of Cloudflare's network we don't just make Internet properties faster by being closer, we bring our Zero Trust services closer too. So whether you're connecting to a public API, a website, a SaaS application, or your company's internal applications, we're close by.

This is possible by adding new cities, partners, capacity, and cables. And we have seen over and over again how making the Internet faster in a region also can have a clear impact on traffic: if the experience is quicker, people usually do more online.

Cloudflare’s network keeps increasing, and its global footprint does so accordingly. In April 2022 we announced that the Cloudflare network now spans 275 cities and the number keeps growing.

In this blog post we highlight the deployment of our data center in Hagatna, Guam.

Why a blog about Guam?

Guam is about 2,400 km from both Tokyo in the north and Manila in the west, and about 6,100 km from Honolulu in the east. Honolulu itself is the most remote major city in the US and one of the most remote in the world, the closest major city from it being San Francisco, California at 3,700 km. From here one can derive how far Guam is from the US to the west and from Asia to the east.

Figure 1: Guam Geographical Location.

Why is this relevant? As explained here, latency is the time it takes for data to pass from one point on a network to another. And one of the main reasons behind network latency is the distance between client devices — like a browser on a mobile phone — making requests and the servers responding to those requests. So, if we consider where Guam is geographically, we get a good picture about how Guam’s residents can be affected by the long distances their Internet requests, and responses, have to travel.

This is why every time Cloudflare adds a new location, we help make the Internet a bit faster. The reason is that every new location brings Cloudflare’s services closer to the users. As part of Cloudflare’s mission, the Guam deployment is a perfect example of how we are going from being the most global network on Earth to the most local one as well.

Submarine cables

There are 486 active submarine cables and 1,306 landings that are currently active or under construction, running to an estimated 1.3 million km around the globe.

A closer look at specific submarine cables landing in Guam show us that the region is actually well served in terms of submarine cables, with several connections to the mainland such as Japan, Taiwan, Philippines, Australia, Indonesia and Hawaii, therefore making Guam more resilient to matters such as the one that affected Tonga in January 2022 due to the impact of a volcanic eruption on submarine cables - we wrote about it here.

Figure 2: Submarine Cables Landing in Guam (source: submarinecablemap.com)

The picture above also shows the relevance of Guam for other even more remote locations, such as the Federated States of Micronesia (FSM) or the Marshall Islands, which have an ‘extra-hop’ to cover when trying to reach the rest of the Internet. Palau also relies on Guam but, from a resilience point of view, has alternatives to locations such as the Philippines or to Australia.

Presence at Mariana Islands Internet Exchange

Cloudflare’s presence in Guam is through Mariana Islands Internet Exchange, or MARIIX, allowing Cloudflare to peer with participants such as:

AS 395400 - University of Guam
AS 9246 - GTA Teleguam
AS 3605 - DoCoMo Pacific
AS 7131 - IT&E
AS 17456 - PDS

As there are multiple participants, these are being added gradually. The first was AS 7131, being served from April 2022, and the latest addition is AS 9246, from July 2022.

As some of these ASNs or ASs (autonomous systems — large networks or group of networks) have their own downstream customers, further ASs can leverage Cloudflare’s deployment at Guam, examples being AS 17893 - Palau National Communications Corp - or AS 21996 - Guam Cell.

Therefore, the Cloudflare deployment brings not only a better (and generally faster) Internet to Guam’s residents, but also to residents in nearby archipelagos that are anchored on Guam. In May 2022, according to UN’s forecasts, the covered resident population in the main areas in the region stands around 171k in Guam, 105k in FSM and 60k in the Marshall Islands.

For this deployment, Cloudflare worked with the skilled MARIIX personnel for the physical installations, provisioning and services turn-up. Despite the geographical distance and time zone differences (Hagatna is 9 hours ahead of GMT but only two hours ahead of the Cloudflare office in Singapore, so the time difference wasn’t a big challenge), all the logistics involved and communications went smoothly. A recent blog posted by APNIC, where we can see some personnel with whom Cloudflare worked, reiterates the valuable work being done locally and the increasing importance of Guam in the region.

Performance impact for local/regional habitants

Before Cloudflare’s deployment in Guam, customers of local ASs trying to reach Internet properties via Cloudflare’s network were redirected mainly to Cloudflare’s deployments in Tokyo and Seattle. This is due to the anycast routing used by Cloudflare — as described here; anycast typically routes incoming traffic to the nearest data center. In the case of Guam, and as previously described, these large distances to the nearest locations represents a distance of thousands of kilometers or, in other words, high latency thus affecting user experience badly.

With Cloudflare’s deployment in Guam, Guam’s and nearby archipelagos’ residents are no longer redirected to those faraway locations, instead they are served locally by the new Cloudflare servers. Although a decrease of a few milliseconds may not seem a lot, it actually represents a significant boost in user experience as latency is dramatically reduced. As the total distance between users and servers is reduced, load time is reduced as well. And as users very often quit waiting for a site to load when the load time is high, the opposite occurs as well, i.e., users are more willing to stay within a site if loading times are good. This improvement represents both a better user experience and higher use of the Internet.

In the case of Guam, we use AS 9246 as an example as it was previously served by Seattle but since around 23h00 UTC 14/July/2022 is served by Guam, as illustrated below:

Figure 3: Requests per Colo for AS 9246 Before vs After Cloudflare Deployment at Guam.

The following chart displays the median and the 90th percentile of the eyeball TCP RTT for AS 9246 immediately before and after AS 9246 users started to use the Guam deployment:

Figure 4: Eyeball TCP RTT for AS 9246 Before vs After Cloudflare Deployment at Guam.

From the chart above we can derive that the overall reduction for the eyeball TCP RTT immediately before and after Guam’s deployment was:

Median decreased from 136.3ms to 9.3ms, a 93.2% reduction;
P90 decreased from 188.7ms to 97.0ms, a 48.5% reduction.

When comparing the [12h00 ; 13h00] UTC period of the 14/July/2022 (therefore, AS 9246 still served by Seattle) vs the same hour but for the 15th/July/2022 (thus AS9246 already served by Guam), the differences are also clear. We pick this period as this is a busy hour period locally since local time equals UTC plus 10 hours:

Figure 5 - Median Eyeball TCP RTT for AS 9246 from Seattle vs Guam.

The median eyeball TCP RTT decreased from 146ms to 12ms, i.e., a massive 91.8% reduction and perhaps, due to already mentioned geographical specificities, one of Cloudflare's deployments representing a larger reduction in latency for the local end users.

Impact on Internet traffic

We can actually see an increase in HTTP requests in Guam since early April, right when we were setting up our Guam data center. The impact of the deployment was more clear after mid-April, with a further step up in mid-June. Comparing March 8 with July 17, there was an 11.5% increase in requests, as illustrated below:

Figure 6: Trend in HTTP Requests per Second in Guam.

Edge Partnership Program

If you’re an ISP that is interested in hosting a Cloudflare cache to improve performance and reduce backhaul, get in touch on our Edge Partnership Program page. And if you’re a software, data, or network engineer – or just the type of person who is curious and wants to help make the Internet better – consider joining our team.

ARMs Race: Ampere Altra takes on the AWS Graviton2

Sung Park — Thu, 11 Mar 2021 12:00:00 GMT

Over three years ago, we embraced the ARM ecosystem after evaluating the Qualcomm Centriq. The Centriq and its Falkor cores delivered a significant reduction in power consumption while maintaining a comparable performance against the processor that was powering our server fleet at the time. By the time we completed porting our software stack to be compatible with ARM, Qualcomm decided to exit the server business. Since then, we have been waiting for another server-grade ARM processor with hopes to improve our power efficiencies across our global network, which now spans more than 200 cities in over 100 countries.

ARM has introduced the Neoverse N1 platform, the blueprint for creating power-efficient processors licensed to institutions that can customize the original design to meet their specific requirements. Ampere licensed the Neoverse N1 platform to create the Ampere Altra, a processor that allows companies that own and manage their own fleet of servers, like ourselves, to take advantage of the expanding ARM ecosystem. We have been working with Ampere to determine whether Altra is the right processor to power our first generation of ARM edge servers.

The AWS Graviton2 is the only other Neoverse N1-based processor publicly accessible, but only made available through Amazon’s cloud product portfolio. We wanted to understand the differences between the two, so we compared Ampere’s single-socket server, named Mt. Snow, equipped with the Ampere Altra Q80-30 against an EC2 instance of the AWS Graviton2.

The Mt. Snow 1P server equipped with the Ampere Altra Q80-30

The Ampere Altra and AWS Graviton2 alike are based on the Neoverse N1 platform by ARM, manufactured on the TSMC 7nm process. The N1 reference core features an 11-stage out-of-order execution pipeline along with the following specifications in ARM nomenclature.

Pipeline Stage	Width
Fetch	4 instructions/cycle
Decode	4 instructions/cycle
Rename	4 Mops/cycle
Dispatch	8 µops/cycle
Issue	8 µops/cycle
Commit	8 µops/cycle

A N1 core contains 64KiB of L1 instruction and 64KiB of L1 data caches and a dedicated L2 cache that can be implemented with up to 1MiB per core. The L3 cache or the System Level Cache can be implemented with up to 256MiB (up to 4MiB per slice, up to 64 slices) shared by all cores.

Cache Hierarchy	Capacity
L1i Cache	64 KiB
L1d Cache	64 KiB
L2 Cache	Up to 1 MiB
L3 Cache	Up to 256 MiB

A pair of N1 cores can optionally be combined into a dual-core configuration through the Component Aggregation Layer, then placed inside the mesh interconnects named the Coherent Mesh Network or CMN-600 that supports up to 64 pairs of cores, totaling 128 cores.

Component Aggregation Layer (CAL) supports up to two N1 cores

Coherent Hub Interface (CHI) is used to connect components to the Mesh Cross Point (XP)

Coherent Mesh Network (CMN-600) can be scaled up to 8x8 XPs supporting up to 128 cores

System Specifications

	AWS (Annapurna Labs)	Ampere
Processor Model	AWS Graviton2	Ampere Altra Q80-30
ISA	ARMv8.2	ARMv8.2+
Core / Thread Count	64C / 64T	80C / 80T
Operating Frequency	2.5 GHz	3.0 GHz
L1 Cache Size	8 MiB (64 x 128 KiB)	10 MiB (80 x 128 KiB)
L2 Cache Size	64 MiB (64 x 1 MiB)	80 MiB (80 x 1 MiB)
L3 Cache Size	32 MiB	32 MiB
System Memory	256 GB (8 x 32 GB DDR4-3200)	256 GB (8 x 32 GB DDR4-3200)

We can verify that the two processors are based on the Neoverse N1 platform by checking CPU part inside cpuinfo, which returns 0xd0c, the designated part number for the Neoverse N1.

sung@ampere-altra:~$ cat /proc/cpuinfo | grep 'CPU part' | head -1
CPU part	: 0xd0c

sung@aws-graviton2:~$ cat /proc/cpuinfo | grep 'CPU part' | head -1
CPU part	: 0xd0c

Ampere backported various features into the Ampere Altra notably, speculative side-channel attack mitigations that include Meltdown and Spectre (variants 1 and 2) from the ARMv8.5 instruction set architecture, giving Altra the “+” designation in their ISA (ARMv8.2+).

The Ampere Altra consists of 25% more physical cores and 20% higher operating frequency than the AWS Graviton2. Because of Altra’s higher core-count and frequency, we should expect the Ampere Altra to perform up to 50% better than the AWS Graviton2 in compute-bound workloads.

Both Ampere and AWS decided to implement 1MiB of L2 cache per core, but only 32MiB of L3 or System Level Cache, although the CMN-600 specification allows up to 256MiB. This leaves the Ampere Altra with a lower L3 cache-per-core ratio than the AWS Graviton2 and places Altra at a potential disadvantage in situations where the application’s working set does not fit within the L3 cache.

Methodology

We imaged both systems with Debian Buster and downloaded our open-source benchmark suite, cf_benchmark. This benchmark suite executes 49 different workloads over approximately 15 to 30 minutes. Each workload utilizes a library that has either been used in our stack or considered at one point or another. Due to the relatively short duration of the benchmark and the libraries called by the workloads, we run cf_benchmark as our smoke test on any new system brought in for evaluation. We recommend cf_benchmark to our technology partners when we discuss new hardware, primarily processors, to ensure these workloads can compile and run successfully.

We ran cf_benchmark multiple times and observed no significant run-to-run variations. Similar to industry standard benchmarks, we calculated the overall score using geometric mean to provide a more conservative result than arithmetic mean across all 49 workloads and category scores using their respective subset of workloads.

In single-core scenarios, cf_benchmark will spawn a single application thread, occupying a single hardware thread. In multi-core scenarios, cf_benchmark will spawn as many application threads as needed until all hardware threads on the processor are occupied. Neither processors implemented simultaneous multithreading, so each physical core can only execute one hardware thread at a time.

Overall Performance Results

Given Altra’s advantage in both operating frequency and sheer core-count, the Ampere Altra took the lead across single and multi-core performance.

In single-core performance, the Ampere Altra outperformed the AWS Graviton2 by 16%. The differences in operating frequencies between the two processors gave the Ampere Altra a proportional advantage of up to 20% in more than half of our single-core workloads.

In multi-core performance, the Ampere Altra continued to maintain its edge over the AWS Graviton2 by 31%. We expected the Ampere Altra to attain a theoretical advantage between 20% to 50% due to Altra’s higher operating frequency and core count while taking inherent overheads associated with increasing core count into consideration, primarily scaling out the mesh network. We observed that the majority of our multi-core workloads fell within this 20% to 50% range.

Categorized Results

In OpenSSL and LuaJIT, the Ampere Altra scaled proportionally with its operating frequency in single-core performance and continued to scale out with its core count in multi-core performance. The Ampere Altra maintained its lead in the vast majority of compression and Golang workloads with the exception of Brotli level 9 compression and some Golang multi-core workloads that were not able to keep the two processors busy at 100% CPU utilization.

OpenSSL

As described previously, we use a fork of OpenSSL, named BoringSSL, in places such as for TLS handshake handling. The original OpenSSL comes with a built-in benchmark called `speed` that measures single and multi-core performance. The asymmetric and symmetric crypto workloads found here saturated the processors at 100% CPU utilization. Altra’s performance aligned with our expectations throughout this entire category by maintaining a consistent 20% advantage over the AWS Graviton2 in single-core and continued to scale with its core count and performed, on average, 47% better in multi-core performance.

Public Key Cryptography

Symmetric Key Cryptography

LuaJIT

We commonly describe LuaJIT as the glue that holds Cloudflare together. LuaJIT has a long history here at Cloudflare for the same reason it is widely used in the videogame industry, where performance and responsiveness are non-negotiable requirements.

Although we have been transitioning from LuaJIT to Rust, we welcomed the instruction cache coherency implemented into the Ampere Altra. The instruction cache coherency aims to address the drawbacks associated with LuaJIT, such as its self-modifying properties, where the next instruction that needs to be fetched is not in the instruction cache, but rather the data cache. With this feature enabled, the instruction cache becomes coherent or made aware of changes made in the data cache. The L2 cache becomes inclusive of the L1 instruction cache, meaning any cache lines found in the L1 instruction cache should also be present in the L2 cache. Furthermore, L2 cache intercepts any invalidations or stores made to any of the L2’s cache lines and propagates those changes up to the L1 instruction cache.

The Ampere Altra performed well within single-core performance except for fasta where both processors stalled at the front-end of the pipeline. In the multi-core version of fasta, Altra did not spend as many cycles stalled in the front-end likely due to Ampere's implementation of the instruction cache coherency. All other workloads continued to scale in multi-core workloads with binary trees coming in at 73%. Spectral was omitted since the multi-core variant failed to run on both servers.

Compression

Brotli and Gzip are the two primary types of compression we use at Cloudflare. We find value in highly efficient compression algorithms as it helps us find a balance between spending more resources on a faster processor or having a larger storage capacity in addition to being able to transfer contents faster. The Ampere Altra performed well on both compression types with the exception of Brotli level 9 and Gzip level 4 multi-core performance.

Brotli

Out of the entire benchmark suite, Brotli level 9 caused the largest delta between the Ampere Altra and the AWS Graviton2. Although we do not use Brotli level 7 and above when performing dynamic compression, we decided to investigate further.

From historical observations, we noticed that most processors tend to experience significant performance degradation at level 9. Compression workloads generally consist of a higher branch instructions mix, and we found approximately 15% to 20% of the dynamic instructions to be branch instructions. Intuitively, we first looked at the misprediction rate since a high miss rate will quickly add up penalty cycles. However, to our surprise, we found the misprediction rate at level 9 very low.

By analyzing our dataset further, we found the common underlying cause appeared to be the high number of page faults incurred at level 9. Ampere has demonstrated that by increasing the page size from 4K to 64K bytes, we can alleviate the bottleneck and bring the Ampere Altra at parity with the AWS Graviton2. We plan to experiment with large page sizes in the future as we continue to evaluate Altra.

Gzip

Golang

LuaJIT, Rust, C++, and Go are some of our primary languages; the Golang workloads found here include cryptography, compression, regular expression, and string manipulation. The Ampere Altra performed proportionally to its operating frequency against the AWS Graviton2 in the vast majority of single-core performance. In multi-core performance, the workloads that could not saturate the processors to 100% CPU utilization did not scale proportionally to the two processor’s respective operating frequencies and core count.

Go Cryptography

Go Compression & String

Go Regex

Frequency, Power & Thermals

We were unable to collect any telemetry on the AWS Graviton2. Power would have been an interesting data point on the AWS Graviton2, but the EC2 instance did not expose either frequency or power sensors. The Ampere model we tested was the Altra Q80-30, rated with an operating frequency of 3.0GHz and a thermal design power (TDP) of 210W.

The Ampere Altra maintained a sustained operating frequency throughout cf_benchmark while Altra’s dynamic voltage and frequency scaling (DVFS) mechanism occasionally lowered its operating frequency in between workloads as needed to reduce power consumption. Package power varied depending on the workload, but Altra never consumed anywhere near its rated TDP while the default fan curve maintained the temperature below 70C. At the system level, by simply querying the baseboard management controller over IPMI, we observed that the Ampere server consumed no more than 300W. More power-efficient servers directly translates to more flexibility over how we can populate our racks and we hope this trend will continue in production.

Conclusion

In our top-down evaluation, the Ampere Altra largely performed as we had expected throughout our benchmarking suite. We regularly observed the two Neoverse N1 processors perform to their respective operating frequency and core count, with the Ampere Altra coming on top in these key areas over the AWS Graviton2. The Ampere Altra yielded a better single-core performance due to higher operating frequency and multiplied its impact even further by scaling out with a higher core count in multi-core performance against the AWS Graviton2. We also found our Ampere evaluation server consuming less power than we had anticipated and hope this trend will continue in production.

Throughout this initial set of testing and learning more about the underlying architecture, there were many parts we found interesting when compared against contemporary x86 counterparts such as the absence of simultaneous multithreading or dynamic frequency boost. We look forward to seeing which design philosophy makes more sense for the cloud. We are currently assessing Altra’s performance against our fleet of servers and collaborating with Ampere to define an ARM edge server based on Altra that could be deployed at scale.

Automating data center expansions with Airflow

Jet Mariscal — Wed, 27 Jan 2021 16:46:02 GMT

Cloudflare’s network keeps growing, and that growth doesn’t just come from building new data centers in new cities. We’re also upgrading the capacity of existing data centers by adding newer generations of servers — a process that makes our network safer, faster, and more reliable for our users.

Connecting new Cloudflare servers to our network has always been complex, in large part because of the amount of manual effort that used to be required. Members of our Data Center and Infrastructure Operations, Network Operations, and Site Reliability Engineering teams had to carefully follow steps in an extremely detailed standard operating procedure (SOP) document, often copying command-line snippets directly from the document and pasting them into terminal windows.

But such a manual process can only scale so far, and we knew must be a way to automate the installation of new servers.

Here’s how we tackled that challenge by building our own Provisioning-as-a-Service (PraaS) platform and cut by 90% the amount of time our team spent on mundane operational tasks.

Choosing and using an automation framework

When we began our automation efforts, we quickly realized it made sense to replace each of these manual SOP steps with an API-call equivalent and to present them in a self-service web-based portal.

To organize these new automatic steps, we chose Apache Airflow, an open-source workflow management platform. Airflow is built around directed acyclic graphs, or DAGs, which are collections of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

In this new system, each SOP step is implemented as a task in the DAG. The majority of these tasks are API calls to Salt — software which automates the management and configuration of any infrastructure or application, and which we use to manage our servers, switches, and routers. Other DAG tasks are calls to query Prometheus (systems monitoring and alerting toolkit), Thanos (a highly available Prometheus setup with long-term storage capabilities), Google Chat webhooks, JIRA, and other internal systems.

Here is an example of one of these tasks. In the original SOP, SREs were given the following instructions to enable anycast:

Login to a remote system.
Copy and Paste the command in the terminal.
Replace the router placeholder in the command snippet with the actual value.
Execute the command.

In our new workflow, this step becomes a single task in the DAG named “enable_anycast”:

enable_anycast = builder.wrap_class(AsyncSaltAPIOperator)(
             task_id='enable_anycast',
             target='{{ params.netops }}',
             function='cmd.run',
             fun_kwargs={'cmd': 'salt {{ get_router(params.colo_name) }} '
                         'anycast.enable --out=json --out-indent=-1'},
             salt_conn_id='salt_api',
             trigger_rule='one_success')

As you can see, automation eliminates the need for a human operator to login to a remote system, and to figure out the router that will be used to replace the placeholder in the command to be executed.

In Airflow, a task is an implementation of an Operator. The Operator in the automated step is the “AsyncSaltAPIOperator”, a custom operator built in-house. This extensibility is one of the many reasons that made us decide to use Apache Airflow. It allowed us to extend its functionality by writing custom operators that suit our needs.

SREs from various teams have written quite a lot of custom Airflow Operators that integrate with Salt, Prometheus, Bitbucket, Google Chat, JIRA, PagerDuty, among others.

Manual SOP steps transformed into a feature-packed automation

The tasks that replaced steps in the SOP are marvelously feature-packed. Here are some highlights of what they are capable of, on top of just executing a command:

Failure HandlingWhen a task fails for whatever reason, it automatically retries until it exhausts its maximum retry limit that we set for the task. We employ various retry strategies, including instructing tasks to not retry at all, especially when it’s impractical to retry, or when we deliberately do not want it to retry at all regardless of whether there are any retry attempts remaining, such as when an exception is encountered or a condition that is unlikely to change for the better.

LoggingEach task provides a comprehensive log during executions. We’ve written our tasks to ensure that we log as much information as possible that would help us audit and troubleshoot issues.

NotificationsWe’ve written our tasks to send a notification with information such as the name of the DAG, the name of the task, its task state, the number of attempts it took to reach a certain state, and a link to view the task logs.

When a task fails, we definitely want to be notified, so we also set tasks to additionally provide information such as the number of retry attempts and links to view relevant wiki pages or Grafana dashboards.

Depending on the criticality of the failure, we can also instruct it to page the relevant on-call person on the provisioning shift, should it require immediate attention.

Jinja TemplatingJinja templating allows providing dynamic content using code to otherwise static objects such as strings. We use this in combination with macros wherein we provide parameters that can change during the execution, since macros are evaluated while the task gets run.

MacrosMacros are used to pass dynamic information into task instances at runtime. Macros are a way to expose objects to templates. In other words, macros are functions that take input, modify that input, and give the modified output.

Adapting tasks for preconditions and human intervention

There are a few steps in the SOP that require certain preconditions to be met. We use sensors to set dependencies between these tasks, and even between different DAGs, so that one does not run until the dependency has been met.

Below is an example of a sensor that waits until all nodes resolve to their assigned DNS records:

verify_node_dns = builder.wrap_class(DNSSensor)(
            task_id='verify_node_dns',
            zone=domain,
            nodes_from='{{ to_json(run_ctx.globals.import_nodes_via_mpl) }}',
            timeout=60 * 30,
            poke_interval=60 * 10,
	mode='reschedule')

In addition, some of our tasks still require input from a human operator. In these circumstances, we use sensors as blocking tasks that prevent work from starting until certain preconditions are met. We use these to set dependencies between tasks and even DAGs so that one does not run until the dependency has finished successfully.

The code below is a simple example of a task that will send notifications to get the attention of a human operator, and waits until a Change Request ticket has been provided and verified:

verify_jira_input = builder.wrap_class(InputSensor)(
            task_id='verify_jira_input',
            var_key='jira',
            prompt='Please provide the Change Request ticket.',
            notify=True,
            require_human=True)

Another sensor task example is waiting until a zone has been deployed by a Cloudflare engineer as described in https://blog.cloudflare.com/improving-the-resiliency-of-our-infrastructure-dns-zone/.

In order for PraaS to be able to accept human inputs, we’ve written a separate DAG we call our DAG Manager. Whenever we need to submit input back to a running expansion DAG, we simply trigger the DAG Manager and pass in our input as a JSON configuration, which will then be processed accordingly and submit the input back to the expansion DAG.

Managing Dependencies Between Tasks

Replacing SOP steps with DAG tasks was only the first part of our journey towards greater automation. We also had to define the dependencies between these tasks and construct the workflow accordingly.

Here’s an example of what this looks like in code:

verify_cr >> parse_cr >> [execute_offline, execute_online]
        execute_online >> silence_highstate_runner >> silence_metals >> \
            disable_highstate_runner

The code simply uses bit shift operators to chain the operations. A list of tasks can also be set as dependencies:

change_metal_status >>  [wait_for_change_metal_status, verify_zone_update] >> \
evaluate_ecmp_management

With the bit shift operator, chaining multiple dependencies becomes concise.

By default, a downstream task will only run if its upstream has succeeded. For a more complex dependency setup, we set a trigger_rule which defines the rule by which the generated task gets triggered.

All operators have a trigger_rule argument. The Airflow scheduler decides whether to run the task or not depending on what rule was specified in the task. An example rule that we use a lot in PraaS is “one_success” — it fires as soon as at least one parent succeeds, and it does not wait for all parents to be done.

Solving Complex Workflows with Branching and Multi-DAGs

Having complex workflows means that we need a workflow to branch, or only go down a certain path, based on an arbitrary condition, which is typically related to something that happened in an upstream task. Branching is used to perform conditional logic, that is, execute a set of tasks based on a condition. We use BranchPythonOperator to achieve this.

At some point in the workflow, our data center expansion DAGs trigger various external DAGs to accomplish complex tasks. This is why we have written our DAGs to be fully reusable. We did not try to incorporate all the logic into a single DAG; instead, we created other separable DAGs that are fully reusable and can be triggered on-demand manually by hand or programmatically — our DAG Manager and the “helper” DAG is an example of this.

The Helper DAG comprises logic that allows us to mimic a “for loop” by having the DAG respawn itself if needed, technically doing cycles. If you recall, a DAG is acyclic, but we have some tasks in our workflow that require us to do complex loops and are solved by using a helper DAG.

We designed reusable DAGs early on, which allowed us to build complex automation workflows from separable DAGs, each of which handles distinct and well-defined tasks. Each data center DAG could easily reuse other DAGs by triggering them programmatically.

Having separate DAGs that run independently, that are triggered by other DAGs, and that keep inter-dependencies between them, is a pattern we use a lot. It has allowed us to execute very complex workflows.

Creating DAGs that Scale and Executing Tasks at Scale

Data center expansions are done in two phases:

Phase 1 - this is the phase in which servers are powered on. It boots our custom Linux kernel, and begins the provisioning process.

Phase 2 - this is the phase in which newly provisioned servers are enabled in the cluster to receive production traffic.

To reflect these phases in the automation workflow, we also wrote two separate DAGs, one for each phase. However, we have over 200 data centers, so if we were to write a pair of DAGs for each, we would end up writing and maintaining 400 files!

A viable option could be to parameterize our DAGs. At first glance, this approach sounds reasonable. However, it poses one major challenge: tracking the progress of DAG runs will be too difficult and confusing for the human operator using PraaS.

Following the software design principle called DRY (Don’t Repeat Yourself), and inspired by the Factory Method design pattern in programming, we’ve instead written both phase 1 and phase 2 DAGs in a way that allow them to dynamically create multiple different DAGs with exactly the same tasks, and to fully reuse the exact same code. As a result, we only maintain one code base, and as we add new data centers, we are also able to generate a DAG for each new data center instantly, without writing a single line of code.

And Airflow even made it easy to put a simple customized web UI on top of the process, which made it simple to use by more employees who didn’t have to understand all the details.

The death of SOPs?

We would like to think that all of this automation removes the need for our original SOP document. But this is not really the case. Automation can fail, the components in it can fail, and a particular task in the DAG may fail. When this happens, our SOPs will be used again to prevent provisioning and expansion activities from stopping completely.

Introducing automation paved the way for what we call an SOP-as-Code practice. We made sure that every task in the DAG had an equivalent manual step in the SOP that SREs can execute by hand, should the need arise, and that every change in the SOP has a corresponding pull request (PR) in the code.

What’s next for PraaS

Onboarding of the other provisioning activities into PraaS, such as decommissioning, is already ongoing.

For expansions, our ultimate goal is a fully autonomous system that monitors whether new servers have been racked in our edge data centers — and automatically triggers expansions — with no human intervention.

Watch it on Cloudflare TV

An introduction to three-phase power and PDUs

Rob Dinh — Fri, 04 Dec 2020 14:05:37 GMT

Our fleet of over 200 locations comprises various generations of servers and routers. And with the ever changing landscape of services and computing demands, it’s imperative that we manage power in our data centers right. This blog is a brief Electrical Engineering 101 session going over specifically how power distribution units (PDU) work, along with some good practices on how we use them. It appears to me that we could all use a bit more knowledge on this topic, and more love and appreciation of something that’s critical but usually taken for granted, like hot showers and opposable thumbs.

A PDU is a device used in data centers to distribute power to multiple rack-mounted machines. It’s an industrial grade power strip typically designed to power an average consumption of about seven US households. Advanced models have monitoring features and can be accessed via SSH or webGUI to turn on and off power outlets. How we choose a PDU depends on what country the data center is and what it provides in terms of voltage, phase, and plug type.

For each of our racks, all of our dual power-supply (PSU) servers are cabled to one of the two vertically mounted PDUs. As shown in the picture above, one PDU feeds a server’s PSU via a red cable, and the other PDU feeds that server’s other PSU via a blue cable. This is to ensure we have power redundancy maximizing our service uptime; in case one of the PDUs or server PSUs fail, the redundant power feed will be available keeping the server alive.

Faraday’s Law and Ohm’s Law

Like most high-voltage applications, PDUs and servers are designed to use AC power. Meaning voltage and current aren’t constant — they’re sine waves with magnitudes that can alternate between positive and negative at a certain frequency. For example, a voltage feed of 100V is not constantly at 100V, but it bounces between 100V and -100V like a sine wave. One complete sine wave cycle is one phase of 360 degrees, and running at 50Hz means there are 50 cycles per second.

The sine wave can be explained by Faraday’s Law and by looking at how an AC power generator works. Faraday’s Law tells us that a current is induced to flow due to a changing magnetic field. Below illustrates a simple generator with a permanent magnet rotating at constant speed and a coil coming in contact with the magnet’s magnetic field. Magnetic force is strongest at the North and South ends of the magnet. So as it rotates on itself near the coil, current flow fluctuates in the coil. One complete rotation of the magnet represents one phase. As the North end approaches the coil, current increases from zero. Once the North end leaves, current decreases to zero. The South end in turn approaches, now the current “increases” in the opposite direction. And finishing the phase, the South end leaves returning the current back to zero. Current alternates its direction at every half cycle, hence the naming of Alternating Current.

Current and voltage in AC power fluctuate in-phase, or “in tandem”, with each other. So by Ohm's Law of Power = Voltage x Current, power will always be positive. Notice on the graph below that AC power (Watts) has two peaks per cycle. But for practical purposes, we'd like to use a constant power value. We do that by interpreting AC power into "DC" power using root-mean-square (RMS) averaging, which takes the max value and divides it by √2. For example, in the US, our conditions are 208V 24A at 60Hz. When we look at spec sheets, all of these values can be assumed as RMS'd into their constant DC equivalent values. When we say we're fully utilizing a PDU's max capacity of 5kW, it actually means that the power consumption of our machines bounces between 0 and 7.1kW (5kW x √2).

It’s also critical to figure out the sum of power our servers will need in a rack so that it falls under the PDU’s design max power capacity. For our US example, a PDU is typically 5kW (208 volts x 24 amps); therefore, we're budgeting 5kW and fit as many machines as we can under that. If we need more machines and the total sum power goes above 5kW, we'd need to provision another power source. That would lead to possibly another set of PDUs and racks that we may not fully use depending on demand; e.g. more underutilized costs. All we can do is abide by P = V x I.

However there is a way we can increase the max power capacity economically — 3-phase PDU. Compared to single phase, its max capacity is √3 or 1.7 times higher. A 3-phase PDU of the same US specs above has a capacity of 8.6kW (5kW x √3), allowing us to power more machines under the same source. Using a 3-phase setup might mean it has thicker cables and bigger plug; but despite being more expensive than a 1-phase, its value is higher compared to two 1-phase rack setups for these reasons:

It’s more cost-effective, because there are fewer hardware resources to buy
Say the computing demand adds up to 215kW of hardware, we would need 25 3-phase racks compared to 43 1-phase racks.
Each rack needs two PDUs for power redundancy. Using the example above, we would need 50 3-phase PDUs compared to 86 1-phase PDUs to power 215kW worth of hardware.
That also means a smaller rack footprint and fewer power sources provided and charged by the data center, saving us up to √3 or 1.7 times in opex.
It’s more resilient, because there are more circuit breakers in a 3-phase PDU — one more than in a 1-phase. For example, a 48-outlet PDU that is 1-phase would be split into two circuits of 24 outlets. While a 3-phase one would be split into 3 circuits of 16 outlets. If a breaker tripped, we’d lose 16 outlets using a 3-phase PDU instead of 24.

The PDU shown above is a 3-phase model of 48 outlets. We can see three pairs of circuit breakers for the three branches that are intertwined with each other — white, grey, and black. Industry demands today pressure engineers to maximize compute performance and minimize physical footprint, making the 3-phase PDU a widely-used part of operating a data center.

What is 3-phase?

A 3-phase AC generator has three coils instead of one where the coils are 120 degrees apart inside the cylindrical core, as shown in the figure below. Just like the 1-phase generator, current flow is induced by the rotation of the magnet thus creating power from each coil sequentially at every one-third of the magnet’s rotation cycle. In other words, we’re generating three 1-phase power offset by 120 degrees.

A 3-phase feed is set up by joining any of its three coils into line pairs. L1, L2, and L3 coils are live wires with each on their own phase carrying their own phase voltage and phase current. Two phases joining together form one line carrying a common line voltage and line current. L1 and L2 phase voltages create the L1/L2 line voltage. L2 and L3 phase voltages create the L2/L3 line voltage. L1 and L3 phase voltages create the L1/L3 line voltage.

Let’s take a moment to clarify the terminology. Some other sources may refer to line voltage (or current) as line-to-line or phase-to-phase voltage (or current). It can get confusing, because line voltage is the same as phase voltage in 1-phase circuits, as there’s only one phase. Also, the magnitude of the line voltage is equal to the magnitude of the phase voltage in 3-phase Delta circuits, while the magnitude of the line current is equal to the magnitude of the phase current in 3-phase Wye circuits.

Conversely, the line current equals to phase current times √3 in Delta circuits. In Wye circuits, the line voltage equals to phase voltage times √3.

In Delta circuits:

V_line = V_phase

I_line = √3 x I_phase

In Wye circuits:

V_line = √3 x V_phase

I_line = I_phase

Delta and Wye circuits are the two methods that three wires can join together. This happens both at the power source with three coils and at the PDU end with three branches of outlets. Note that the generator and the PDU don’t need to match each other’s circuit types.

On PDUs, these phases join when we plug servers into the outlets. So we conceptually use the wirings of coils above and replace them with resistors to represent servers. Below is a simplified wiring diagram of a 3-phase Delta PDU showing the three line pairs as three modular branches. Each branch carries two phase currents and its own one common voltage drop.

And this one below is of a 3-phase Wye PDU. Note that Wye circuits have an additional line known as the neutral line where all three phases meet at one point. Here each branch carries one phase and a neutral line, therefore one common current. The neutral line isn’t considered as one of the phases.

Thanks to a neutral line, a Wye PDU can offer a second voltage source that is √3 times lower for smaller devices, like laptops or monitors. Common voltages for Wye PDUs are 230V/400V or 120V/208V, particularly in North America.

Where does the √3 come from?

Why are we multiplying by √3? As the name implies, we are adding phasors. Phasors are complex numbers representing sine wave functions. Adding phasors is like adding vectors. Say your GPS tells you to walk 1 mile East (vector a), then walk a 1 mile North (vector b). You walked 2 miles, but you only moved by 1.4 miles NE from the original location (vector a+b). That 1.4 miles of “work” is what we want.

Let’s take in our application L1 and L2 in a Delta circuit. we add phases L1 and L2, we get a L1/L2 line. We assume the 2 coils are identical. Let’s say α represents the voltage magnitude for each phase. The 2 phases are 120 degrees offset as designed in the 3-phase power generator:

|L1| = |L2| = α
L1 = |L1|∠0° = α∠0°
L2 = |L2|∠-120° = α∠-120°

Using vector addition to solve for L1/L2:

L1/L2 = L1 + L2

$$\text{L1/L2} = α∠\text{0°} - α∠\text{-120°}$$

$$\text{L1/L2} = (α\cos{\text{0°}}+jα\sin{\text{0°}})-(α\cos{\text{-120°}}+jα\sin{\text{-120°}})$$

$$\text{L1/L2} = (α\cos{\text{0°}}-α\cos{\text{-120°}})+j(α\sin{\text{0°}}-α\sin{\text{-120°}})$$

$$\text{L1/L2} = \frac{3}{2}α+j\frac{\sqrt3}{2}α$$

Convert L1/L2 into polar form:

$$\text{L1/L2} = \sqrt{(\frac{3}{2}α)^2 + (\frac{\sqrt3}{2}α)^2}∠\tan{-1}(\frac{\frac{\sqrt3}{2}α}{\frac{3}{2}α})$$

$$\text{L1/L2} = \sqrt 3α∠\text{30°}$$

Since voltage is a scalar, we’re only interested in the “work”:

|L1/L2| = √3α

Given that α also applies for L3. This means for any of the three line pairs, we multiply the phase voltage by √3 to calculate the line voltage.

V_line = √3 x V_phase

Now with the three line powers being equal, we can add them all to get the overall effective power. The derivation below works for both Delta and Wye circuits.

P_overall = 3 x P_line

P_overall = 3 x (V_line x I_line)

P_overall = (3/√3) x (V_phase x I_phase)

P_overall = √3 x V_phase x I_phase

Using the US example, V_phase is 208V and I_phase is 24A. This leads to the overall 3-phase power to be 8646W (√3 x 208V x 24A) or 8.6kW. There lies the biggest advantage for using 3-phase systems. Adding 2 sets of coils and wires (ignoring the neutral wire), we’ve turned a generator that can produce √3 or 1.7 times more power!

Dealing with 3-phase

The derivation in the section above assumes that the magnitude at all three phases is equal, but we know in practice that's not always the case. In fact, it's barely ever. We rarely have servers and switches evenly distributed across all three branches on a PDU. Each machine may have different loads and different specs, so power could be wildly different, potentially causing a dramatic phase imbalance. Having a heavily imbalanced setup could potentially hinder the PDU's available capacity.

A perfectly balanced and fully utilized PDU at 8.6kW means that each of its three branches has 2.88kW of power consumed by machines. Laid out simply, it's spread 2.88 + 2.88 + 2.88. This is the best case scenario. If we were to take 1kW worth of machines out of one branch, spreading power to 2.88 + 1.88 + 2.88. Imbalance is introduced, the PDU is underutilized, but we're fine. However, if we were to put back that 1kW into another branch — like 3.88 + 1.88 + 2.88 — the PDU is over capacity, even though the sum is still 8.6kW. In fact, it would be over capacity even if you just added 500W instead of 1kW on the wrong branch, thus reaching 3.18 + 1.88 + 2.88 (8.1kW).

That's because a 8.6kW PDU is spec'd to have a maximum of 24A for each phase current. Overloading one of the branches can force phase currents to go over 24A. Theoretically, we can reach the PDU’s capacity by loading one branch until its current reaches 24A and leave the other two branches unused. That’ll render it into a 1-phase PDU, losing the benefit of the √3 multiplier. In reality, the branch would have fuses rated less than 24A (usually 20A) to ensure we won't reach that high and cause overcurrent issues. Therefore the same 8.6kW PDU would have one of its branches tripped at 4.2kW (208V x 20A).

Loading up one branch is the easiest way to overload the PDU. Being heavily imbalanced significantly lowers PDU capacity and increases risk of failure. To help minimize that, we must:

Ensure that total power consumption of all machines is under the PDU's max power capacity
Try to be as phase-balanced as possible by spreading cabling evenly across the three branches
Ensure that the sum of phase currents from powered machines at each branch is under the fuse rating at the circuit breaker.

This spreadsheet from Raritan is very useful when designing racks.

For the sake of simplicity, let's ignore other machines like switches. Our latest 2U4N servers are rated at 1800W. That means we can only fit a maximum of four of these 2U4N chassis (8600W / 1800W = 4.7 chassis). Rounding them up to 5 would reach a total rack level power consumption of 9kW, so that's a no-no.

Splitting 4 chassis into 3 branches evenly is impossible, and will force us to have one of the branches to have 2 chassis. That would lead to a non-ideal phase balancing:

Keeping phase currents under 24A, there's only 1.1A (24A - 22.9A) to add on L1 or L2 before the PDU gets overloaded. Say we want to add as many machines as we can under the PDU’s power capacity. One solution is we can add up to 242W on the L1/L2 branch until both L1 and L2 currents reach their 24A limit.

Alternatively, we can add up to 298W on the L2/L3 branch until L2 current reaches 24A. Note we can also add another 298W on the L3/L1 branch until L1 current reaches 24A.

In the examples above, we can see that various solutions are possible. Adding two 298W machines each at L2/L3 and L3/L1 is the most phase balanced solution, given the parameters. Nonetheless, PDU capacity isn't optimized at 7.8kW.

Dealing with a 1800W server is not ideal, because whichever branch we choose to power one would significantly swing the phase balance unfavorably. Thankfully, our Gen X servers take up less space and are more power efficient. Smaller footprint allows us to have more flexibility and fine-grained control over our racks in many of our diverse data centers. Assuming each 1U server is 450W, as if we physically split the 1800W 2U4N into fours each with their own power supplies, we're now able to fit 18 nodes. That's 2 more nodes than the four 2U4N setup:

Adding two more servers here means we've increased our value by 12.5%. While there are more factors not considered here to calculate the Total Cost of Ownership, this is still a great way to show we can be smarter with asset costs.

Cloudflare provides the back-end services so that our customers can enjoy the performance, reliability, security, and global scale of our edge network. Meanwhile, we manage all of our hardware in over 100 countries with various power standards and compliances, and ensure that our physical infrastructure is as reliable as it can be.

There’s no Cloudflare without hardware, and there’s no Cloudflare without power. Want to know more? Watch this Cloudflare TV segment about power: https://cloudflare.tv/event/7E359EDpCZ6mHahMYjEgQl.

Anchoring Trust: A Hardware Secure Boot Story

Derek Chamorro — Tue, 17 Nov 2020 12:00:00 GMT

As a security company, we pride ourselves on finding innovative ways to protect our platform to, in turn, protect the data of our customers. Part of this approach is implementing progressive methods in protecting our hardware at scale. While we have blogged about how we address security threats from application to memory, the attacks on hardware, as well as firmware, have increased substantially. The data cataloged in the National Vulnerability Database (NVD) has shown the frequency of hardware and firmware-level vulnerabilities rising year after year.

Technologies like secure boot, common in desktops and laptops, have been ported over to the server industry as a method to combat firmware-level attacks and protect a device’s boot integrity. These technologies require that you create a trust ‘anchor’, an authoritative entity for which trust is assumed and not derived. A common trust anchor is the system Basic Input/Output System (BIOS) or the Unified Extensible Firmware Interface (UEFI) firmware.

While this ensures that the device boots only signed firmware and operating system bootloaders, does it protect the entire boot process? What protects the BIOS/UEFI firmware from attacks?

The Boot Process

Before we discuss how we secure our boot process, we will first go over how we boot our machines.

The above image shows the following sequence of events:

After powering on the system (through a baseboard management controller (BMC) or physically pushing a button on the system), the system unconditionally executes the UEFI firmware residing on a flash chip on the motherboard.
UEFI performs some hardware and peripheral initialization and executes the Preboot Execution Environment (PXE) code, which is a small program that boots an image over the network and usually resides on a flash chip on the network card.
PXE sets up the network card, and downloads and executes a small program bootloader through an open source boot firmware called iPXE.
iPXE loads a script that automates a sequence of commands for the bootloader to know how to boot a specific operating system (sometimes several of them). In our case, it loads our Linux kernel, initrd (this contains device drivers which are not directly compiled into the kernel), and a standard Linux root filesystem. After loading these components, the bootloader executes and hands off the control to the kernel.
Finally, the Linux kernel loads any additional drivers it needs and starts applications and services.

UEFI Secure Boot

Our UEFI secure boot process is fairly straightforward, albeit customized for our environments. After loading the UEFI firmware from the bootloader, an initialization script defines the following variables:

Platform Key (PK): It serves as the cryptographic root of trust for secure boot, giving capabilities to manipulate and/or validate the other components of the secure boot framework.

Trusted Database (DB): Contains a signed (by platform key) list of hashes of all PCI option ROMs, as well as a public key, which is used to verify the signature of the bootloader and the kernel on boot.

These variables are respectively the master platform public key, which is used to sign all other resources, and an allow list database, containing other certificates, binary file hashes, etc. In default secure boot scenarios, Microsoft keys are used by default. At Cloudflare we use our own, which makes us the root of trust for UEFI:

But, by setting our trust anchor in the UEFI firmware, what attack vectors still exist?

UEFI Attacks

As stated previously, firmware and hardware attacks are on the rise. It is clear from the figure below that firmware-related vulnerabilities have increased significantly over the last 10 years, especially since 2017, when the hacker community started attacking the firmware on different platforms:

This upward trend, coupled with recent malware findings in UEFI, shows that trusting firmware is becoming increasingly problematic.

By tainting the UEFI firmware image, you poison the entire boot trust chain. The ability to trust firmware integrity is important beyond secure boot. For example, if you can't trust the firmware not to be compromised, you can't trust things like trusted platform module (TPM) measurements to be accurate, because the firmware is itself responsible for doing these measurements (e.g a TPM is not an on-path security mechanism, but instead it requires firmware to interact and cooperate with). Firmware may be crafted to extend measurements that are accepted by a remote attestor, but that don't represent what's being locally loaded. This could cause firmware to have a questionable measured boot and remote attestation procedure.

If we can’t trust firmware, then hardware becomes our last line of defense.

Hardware Root of Trust

Early this year, we made a series of blog posts on why we chose AMD EPYC processors for our Gen X servers. With security in mind, we started turning on features that were available to us and set forth the plan of using AMD silicon as a Hardware Root of Trust (HRoT).

Platform Secure Boot (PSB) is AMD’s implementation of hardware-rooted boot integrity. Why is it better than UEFI firmware-based root of trust? Because it is intended to assert, by a root of trust anchored in the hardware, the integrity and authenticity of the System ROM image before it can execute. It does so by performing the following actions:

Authenticates the first block of BIOS/UEFI prior to releasing x86 CPUs from reset.
Authenticates the System Read-Only Memory (ROM) contents on each boot, not just during updates.
Moves the UEFI Secure Boot trust chain to immutable hardware.

This is accomplished by the AMD Platform Security Processor (PSP), an ARM Cortex-A5 microcontroller that is an immutable part of the system on chip (SoC). The PSB consists of two components:

On-chip Boot ROM

Embeds a SHA384 hash of an AMD root signing key
Verifies and then loads the off-chip PSP bootloader located in the boot flash

Off-chip Boot**loader******

Locates the PSP directory table that allows the PSP to find and load various images
Authenticates first block of BIOS/UEFI code
Releases CPUs after successful authentication

The PSP secures the On-chip Boot ROM code, loads the off-chip PSP firmware into PSP static random access memory (SRAM) after authenticating the firmware, and passes control to it.
The Off-chip Bootloader (BL) loads and specifies applications in a specific order (whether or not the system goes into a debug state and then a secure EFI application binary interface to the BL)
The system continues initialization through each bootloader stage.
If each stage passes, then the UEFI image is loaded and the x86 cores are released.

Now that we know the booting steps, let’s build an image.

Build Process

Public Key Infrastructure

Before the image gets built, a public key infrastructure (PKI) is created to generate the key pairs involved for signing and validating signatures:

Our original device manufacturer (ODM), as a trust extension, creates a key pair (public and private) that is used to sign the first segment of the BIOS (private key) and validates that segment on boot (public key).

On AMD’s side, they have a key pair that is used to sign (the AMD root signing private key) and certify the public key created by the ODM. This is validated by AMD’s root signing public key, which is stored as a hash value (RSASSA-PSS: SHA-384 with 4096-bit key is used as the hashing algorithm for both message and mask generation) in SPI-ROM.

Private keys (both AMD and ODM) are stored in hardware security modules.

Because of the way the PKI mechanisms are built, the system cannot be compromised if only one of the keys is leaked. This is an important piece of the trust hierarchy that is used for image signing.

Certificate Signing Request

Once the PKI infrastructure is established, a BIOS signing key pair is created, together with a certificate signing request (CSR). Creating the CSR uses known common name (CN) fields that many are familiar with:

countryName
stateOrProvinceName
localityName
organizationName

In addition to the fields above, the CSR will contain a serialNumber field, a 32-bit integer value represented in ASCII HEX format that encodes the following values:

PLATFORM_VENDOR_ID: An 8-bit integer value assigned by AMD for each ODM.
PLATFORM_MODEL_ID: A 4-bit integer value assigned to a platform by the ODM.
BIOS_KEY_REVISION_ID: is set by the ODM encoding a 4-bit key revision as unary counter value.
DISABLE_SECURE_DEBUG: Fuse bit that controls whether secure debug unlock feature is disabled permanently.
DISABLE_AMD_BIOS_KEY_USE: Fuse bit that controls if the BIOS, signed by an AMD key, (with vendor ID == 0) is permitted to boot on a CPU with non-zero Vendor ID.
DISABLE_BIOS_KEY_ANTI_ROLLBACK: Fuse bit that controls whether BIOS key anti-rollback feature is enabled.

Remember these values, as we’ll show how we use them in a bit. Any of the DISABLE values are optional, but recommended based on your security posture/comfort level.

AMD, upon processing the CSR, provides the public part of the BIOS signing key signed and certified by the AMD signing root key as a RSA Public Key Token file (.stkn) format.

Putting It All Together

The following is a step-by-step illustration of how signed UEFI firmware is built:

The ODM submits their public key used for signing Cloudflare images to AMD.
AMD signs this key using their RSA private key and passes it back to ODM.
The AMD public key and the signed ODM public key are part of the final BIOS SPI image.
The BIOS source code is compiled and various BIOS components (PEI Volume, Driver eXecution Environment (DXE) volume, NVRAM storage, etc.) are built as usual.
The PSP directory and BIOS directory are built next. PSP directory and BIOS directory table points to the location of various firmware entities.
The ODM builds the signed BIOS Root of Trust Measurement (RTM) signature based on the blob of BIOS PEI volume concatenated with BIOS Directory header, and generates the digital signature of this using the private portion of ODM signing key. The SPI location for signed BIOS RTM code is finally updated with this signature blob.
Finally, the BIOS binaries, PSP directory, BIOS directory and various firmware binaries are combined to build the SPI BIOS image.

Enabling Platform Secure Boot

Platform Secure Boot is enabled at boot time with a PSB-ready firmware image. PSB is configured using a region of one-time programmable (OTP) fuses, specified for the customer. OTP fuses are on-chip non-volatile memory (NVM) that permits data to be written to memory only once. There is NO way to roll the fused CPU back to an unfused one.

Enabling PSB in the field will go through two steps: fusing and validating.

Fusing: Fuse the values assigned in the serialNumber field that was generated in the CSR
Validating: Validate the fused values and the status code registers

If validation is successful, the BIOS RTM signature is validated using the ODM BIOS signing key, PSB-specific registers (MP0_C2P_MSG_37 and MP0_C2P_MSG_38) are updated with the PSB status and fuse values, and the x86 cores are released

If validation fails, the registers above are updated with the PSB error status and fuse values, and the x86 cores stay in a locked state.

Let’s Boot!

With a signed image in hand, we are ready to enable PSB on a machine. We chose to deploy this on a few machines that had an updated, unsigned AMI UEFI firmware image, in this case version 2.16. We use a couple of different firmware update tools, so, after a quick script, we ran an update to change the firmware version from 2.16 to 2.18C (the signed image):

. $sudo ./UpdateAll.sh
Bin file name is ****.218C

BEGIN

+---------------------------------------------------------------------------+
|                 AMI Firmware Update Utility v5.11.03.1778                 |      
|                 Copyright (C)2018 American Megatrends Inc.                |                       
|                         All Rights Reserved.                              |
+---------------------------------------------------------------------------+
Reading flash ............... done
FFS checksums ......... ok
Check RomLayout ........ ok.
Erasing Boot Block .......... done
Updating Boot Block ......... done
Verifying Boot Block ........ done
Erasing Main Block .......... done
Updating Main Block ......... done
Verifying Main Block ........ done
Erasing NVRAM Block ......... done
Updating NVRAM Block ........ done
Verifying NVRAM Block ....... done
Erasing NCB Block ........... done
Updating NCB Block .......... done
Verifying NCB Block ......... done

Process completed.

After the update completed, we rebooted:

After a successful install, we validated that the image was correct via the sysfs information provided in the dmidecode output:

Testing

With a signed image installed, we wanted to test that it worked, meaning: what if an unauthorized user installed their own firmware image? We did this by downgrading the image back to an unsigned image, 2.16. In theory, the machine shouldn’t boot as the x86 cores should stay in a locked state. After downgrading, we rebooted and got the following:

This isn’t a powered down machine, but the result of booting with an unsigned image.

Flashing back to a signed image is done by running the same flashing utility through the BMC, so we weren’t bricked. Nonetheless, the results were successful.

Naming Convention

Our standard UEFI firmware images are alphanumeric, making it difficult to distinguish (by name) the difference between a signed and unsigned image (v2.16A vs v2.18C), for example. There isn’t a remote attestation capability (yet) to probe the PSB status registers or to store these values by means of a signature (e.g. TPM quote). As we transitioned to PSB, we wanted to make this easier to determine by adding a specific suffix: -sig that we could query in userspace. This would allow us to query this information via Prometheus. Changing the file name alone wouldn’t do it, so we had to make the following changes to reflect a new naming convention for signed images:

Update filename
Update BIOS version for setup menu
Update post message
Update SMBIOS type 0 (BIOS version string identifier)

Signed images now have a -sig suffix:

~$ sudo dmidecode -t0
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
# SMBIOS implementations newer than version 3.2.0 are not
# fully supported by this version of dmidecode.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
	Vendor: American Megatrends Inc.
	Version: V2.20-sig
	Release Date: 09/29/2020
	Address: 0xF0000
	Runtime Size: 64 kB
	ROM Size: 16 MB

Conclusion

Finding weaknesses in firmware is a challenge that many attackers have taken on. Attacks that physically manipulate the firmware used for performing hardware initialization during the booting process can invalidate many of the common secure boot features that are considered industry standard. By implementing a hardware root of trust that is used for code signing critical boot entities, your hardware becomes a 'first line of defense' in ensuring that your server hardware and software integrity can derive trust through cryptographic means.

What’s Next?

While this post discussed our current, AMD-based hardware platform, how will this affect our future hardware generations? One of the benefits of working with diverse vendors like AMD and Ampere (ARM) is that we can ensure they are baking in our desired platform security by default (which we’ll speak about in a future post), making our hardware security outlook that much brighter ?.

Cloudflare Network expands to more than 100 Countries

Kevin Dompig — Wed, 15 Jul 2020 15:43:17 GMT

2020 has been a historic year that will forever be associated with the COVID-19 pandemic. Over the past six months, we have seen societies, businesses, and entire industries unsettled. The situation at Cloudflare has been no different. And while this pandemic has affected each and every one of us, we here at Cloudflare have not forgotten what our mission is: to help build a better Internet.

We have expanded our global network to 206 cities across more than 100 countries. This is in addition to completing 40+ datacenter expansion projects and adding over 1Tbps in dedicated “backbone” (transport) capacity connecting our major data centers so far this year.

Pandemic times means new processes

There was zero chance that 2020 would mean business as usual within the Infrastructure department. We were thrown a curve-ball as the pandemic began affecting our supply chains and operations. By April, the vast majority of the world’s passenger flights were grounded. The majority of bulk air freight ships within the lower deck (“belly”) of these flights, which saw an imbalance between supply and demand with the sudden 74% decrease in passenger belly cargo capacity relative to the same period last year.

We were fortunate to have existing logistics partners who were involved in medical equipment distribution, subsequently offering to help us maintain our important global infrastructure cargo flows. In one example, our logistics partner Expeditors , already operating with limited staff to limit the risk of exposure, went above and beyond securing us space on “freight only” converted passenger aircraft flights from Taipei.

Six new cities

New cities represent more than a new dot on the map for Cloudflare. These cities represent unique partnerships with ISPs all over the world which allow Cloudflare to bring the Internet closer to an ISP’s end-users and increase our edge compute capabilities in the region.

When we enter into a partnership with an ISP it is also a commitment to that ISP, and to our customers, to increase the security, performance, and reliability of the more than 27 million Internet properties on our network. These accomplishments would not be possible if not for our partners and the individuals who make up Cloudflare’s Infrastructure and Network Operations teams.

Of the six new cities we’ve turned up this year five represent new countries for Cloudflare, including one very close to my heart.

Vientiane, Laos*. Our first data center in Laos, a country where accessible high-speed Internet has only been available for a few decades. In the last two decades, there has been an exponential growth of Laos Internet users increasing from just under 6,000 residents in 2000 to over 1,000,000 residents when 4G LTE service launched in 2015.
Tegucigalpa, Honduras*. The nation's capital, Tegucigalpa is the most populated city in Honduras. This also marks our first data center in-country. Today 41% of the population are active Internet users which is the lowest of all countries in Central America and we are excited to help bring faster and more reliable Internet to the people of Honduras.
Johor Bahru, Malaysia. Our second data center in Malaysia is located in the second-largest city in the country. Johor Bahru serves as a gateway between Singapore and Malaysia and we are proud to be a participant on the Johor Bahru Internet Exchange (JBIX), the second IX to launch in Malaysia. Turning up this site was a challenge for our deployment partners, who went above and beyond to install and provision during the beginning stages of nationwide COVID-19 lockdowns.
Monrovia, Liberia*. Our 15th data center on the continent of Africa shares a history with the United States. Liberia as we know it today first began as a settlement for freed slaves from the United States almost 200 years ago. Monrovia, the nation’s capital, was the largest settlement and today makes up roughly one-third of the population of Liberia.
Bandar Seri Begawan, Brunei*. Although Brunei is the second smallest country in Southeast Asia by land area (only behind Singapore), and the smallest country by population, Brunei is currently ranked 4th in the world for the highest share of social media users coming in at 94% of the population.
Paramaribo, Suriname*. While Suriname is one of the smallest countries in South America it is one of the most diverse countries in all of the Americas. 80% of the country is covered by tropical rainforest and 70% of the nation's population resides in just two urban districts which only account for 0.5% of the country's total land. This also happens to be my home country where my entire family is from, and where I was born.

* = New Country for Cloudflare’s global network

Cloudflare network expansion hits home for our team

50th-percentile performance improvement versus other edge networks when Surinamese traffic started being served in-country. // Source: Cedexis

When I shared the news with my Surinamese family that Cloudflare had turned up a data center in Suriname they were extremely proud. More importantly, they are excited at the opportunity to see increased performance and reliability on the Internet. When my family immigrated to the United States more than 30 years ago, one of the hardships of establishing this new life was communicating with loved ones back home.

Even in the advent of global communication via social media, challenges still persist. Having access to reliable and secure Internet is still a luxury in many parts of the world. Suriname is no different. For many years my family relied on wireless carriers for reliable Internet connectivity. That has changed significantly in the last decade with improvements to the Internet’s infrastructure in Suriname. One big challenge is the physical distance between the Internet users in Suriname and where the content (servers) are located.

When we partner with a local ISP in-country it allows us to shorten that distance and bring the Internet closer. I am proud to work for a company that is helping deliver better access to Internet users like my family in Suriname and millions of others around the world.

Cloudflare is still growing, even when everyone is working from home

During these times communicating over the Internet has become essential for everyone. As more businesses and people are shifting to Internet-based operations our work is more critical than ever. We’re only halfway through the year and have a lot more work to do. This means we are continuing to grow our company by hiring and offering fully remote onboarding. Check out our careers page for more information on full-time positions and internship roles across the globe.

Introducing Regional Services

Achiel van der Mandele — Fri, 26 Jun 2020 11:00:00 GMT

In a world where, increasingly, workloads shift to the cloud, it is often uncertain and unclear how data travels the Internet and in which countries data is processed. Today, Cloudflare is pleased to announce that we're giving our customers control. With Regional Services, we’re providing customers full control over exactly where their traffic is handled.

We operate a global network spanning more than 200 cities. Each data center runs servers with the exact same software stack. This has enabled Cloudflare to quickly and efficiently add capacity where needed. It also allows our engineers to ship features with ease: deploy once, and it's available globally.

The same benefit applies to our customers: configure once and that change is applied everywhere in seconds, regardless of whether they’re changing security features, adding a DNS record or deploying a Cloudflare Worker containing code.

Having a homogenous network is great from a routing point of view: whenever a user performs an HTTP request, the closest datacenter is found due to Cloudflare's Anycast network. BGP looks at the hops that would need to be traversed to find the closest data center. This means that someone near the Canadian border (let's say North Dakota) could easily find themselves routed to Winnipeg (inside Canada) instead of a data center in the United States. This is generally what our customers want and expect: find the fastest way to serve traffic, regardless of geographic location.

Some organizations, however, have expressed preferences for maintaining regional control over their data for a variety of reasons. For example, they may be bound by agreements with their own customers that include geographic restrictions on data flows or data processing. As a result, some customers have requested control over where their web traffic is serviced.

Regional Services gives our customers the ability to accommodate regional restrictions while still using Cloudflare’s global edge network. As of today, Enterprise customers can add Regional Services to their contracts. With Regional Services, customers can choose which subset of data centers are able to service traffic on the HTTP level. But we're not reducing network capacity to do this: that would not be the Cloudflare Way. Instead, we're allowing customers to use our entire network for DDoS protection but limiting the data centers that apply higher-level layer 7 security and performance features such as WAF, Workers, and Bot Management.

Traffic is ingested on our global Anycast network at the location closest to the client, as usual, and then passed to data centers inside the geographic region of the customer’s choice. TLS keys are only stored and used to actually handle traffic inside that region. This gives our customers the benefit of our huge, low-latency, high-throughput network, capable of withstanding even the largest DDoS attacks, while also giving them local control: only data centers inside a customer’s preferred geographic region will have the access necessary to apply security policies.

The diagram below shows how this process works. When users connect to Cloudflare, they hit the closest data center to them, by nature of our Anycast network. That data center detects and mitigates DDoS attacks. Legitimate traffic is passed through to a data center with the geographic region of the customers choosing. Inside that data center, traffic is inspected at OSI layer 7 and HTTP products can work their magic:

Content can be returned from and stored in cache
The WAF looks inside the HTTP payloads
Bot Management detects and blocks suspicious activity
Workers scripts run
Access policies are applied
Load Balancers look for the best origin to service traffic

Today's launch includes preconfigured geographic regions; we'll look to add more depending on customer demand. Today, US and EU regions are available immediately, meaning layer 7 (HTTP) products can be configured to only be applied within those regions and not outside of them.

The US and EU maps are depicted below. Purple dots represent data centers that apply DDoS protection and network acceleration. Orange dots represent data centers that process traffic.

US

EU

We're very excited to provide new tools to our customers, allowing them to dictate which of our data centers employ HTTP features and which do not. If you're interested in learning more, contact sales@cloudflare.com.

Securing Memory at EPYC Scale

Derek Chamorro — Fri, 28 Feb 2020 14:00:00 GMT

Security is a serious business, one that we do not take lightly at Cloudflare. We have invested a lot of effort into ensuring that our services, both external and internal, are protected by meeting or exceeding industry best practices. Encryption is a huge part of our strategy as it is embedded in nearly every process we have. At Cloudflare, we encrypt data both in transit (on the network) and at rest (on the disk). Both practices address some of the most common vectors used to exfiltrate information and these measures serve to protect sensitive data from attackers but, what about data currently in use?

Can encryption or any technology eliminate all threats? No, but as Infrastructure Security, it’s our job to consider worst-case scenarios. For example, what if someone were to steal a server from one of our data centers? How can we leverage the most reliable, cutting edge, innovative technology to secure all data on that host if it were in the wrong hands? Would it be protected? And, in particular, what about the server’s RAM?

Data in random access memory (RAM) is usually stored in the clear. This can leave data vulnerable to software or hardware probing by an attacker on the system. Extracting data from memory isn’t an easy task but, with the rise of persistent memory technologies, additional attack vectors are possible:

Dynamic random-access memory (DRAM) interface snooping
Installation of hardware devices that access host memory
Freezing and stealing dual in-line memory module (DIMMs)
Stealing non-volatile dual in-line memory module (NVDIMMs)

So, what about enclaves? Hardware manufacturers have introduced Trusted Execution Environments (also known as enclaves) to help create security boundaries by isolating software execution at runtime so that sensitive data can be processed in a trusted environment, such as secure area inside an existing processor or Trusted Platform Module.

While this allows developers to shield applications in untrusted environments, it doesn’t effectively address all of the physical system attacks mentioned previously. Enclaves were also meant to run small pieces of code. You could run an entire OS in an enclave, but there are limitations and performance issues in doing so.

This isn’t meant to bash enclave usage; we just wanted a better solution for encrypting all memory at scale. We expected performance to be compromised, and conducted tests to see just how much.

Time to get EPYC

Since we are using AMD for our tenth generation "Gen X servers", we found an interesting security feature within the System on a Chip architecture of the AMD EPYC line. Secure Memory Encryption (SME) is an x86 instruction set extension introduced by AMD and available in the EPYC processor line. SME provides the ability to mark individual pages of memory as encrypted using standard x86 page tables. A page that is marked encrypted will be automatically decrypted when read from DRAM and encrypted when written to DRAM. SME can therefore be used to protect the contents of DRAM from physical attacks on the system.

Sounds complicated, right? Here’s the secret: It isn’t ?

Components

SME is comprised of two components:

AES-128 encryption engine: Embedded in the memory controller. It is responsible for encrypting and decrypting data in main memory when an appropriate key is provided via the Secure Processor.
AMD Secure Processor (AMD-SP): An on-die 32-bit ARM Cortex A5 CPU that provides cryptographic functionality for secure key generation and key management. Think of this like a mini hardware security module that uses a hardware random number generator to generate the 128-bit key(s) used by the encryption engine.

How It Works

We had two options available to us when it came to enabling SME. The first option, regular SME, requires enabling a model specific register MSR 0xC001_0010[SMEE]. This enables the ability to set a page table entry encryption bit:

0 = memory encryption features are disabled
1 = memory encryption features are enabled

After memory encryption is enabled, a physical address bit (C-Bit) is utilized to mark if a memory page is protected. The operating system sets the bit of a physical address to 1 in the page table entry (PTE) to indicate the page should be encrypted. This causes any data assigned to that memory space to be automatically encrypted and decrypted by the AES engine in the memory controller:

Becoming More Transparent

While arbitrarily flagging which page table entries we want encrypted is nice, our objective is to ensure that we are incorporating the full physical protection of SME. This is where the second mode of SME came in, Transparent SME (TSME). In TSME, all memory is encrypted regardless of the value of the encrypt bit on any particular page. This includes both instruction and data pages, as well as the pages corresponding to the page tables themselves.

Enabling TSME is as simple as:

Setting a BIOS flag:

2. Enabling kernel support with the following flag:

CONFIG_AMD_MEM_ENCRYPT=y

After a reboot you should see the following in dmesg:

$ sudo dmesg | grep SME
[    2.537160] AMD Secure Memory Encryption (SME) active

Performance Testing

To weigh the pros and cons of implementation against the potential risk of a stolen server, we had to test the performance of enabling TSME. We took a test server that mirrored a production edge metal with the following specs:

Memory: 8 x 32GB 2933MHz
CPU: AMD 2nd Gen EPYC 7642 with SMT enabled and running NPS4 mode
OS: Debian 9
Kernel: 5.4.12

The performance tools we used were:

STREAM
Cryptsetup
Benchmarky

Stream

We used a custom STREAM binary with 24 threads, using all available cores, to measure the sustainable memory bandwidth (in MB/s). Four synthetic computational kernels are run, with the output of each kernel being used as an input to the next. The best rates observed are reported for each choice of thread count.

The figures above show 2.6% to 4.2% performance variation, with a mean of 3.7%. These were the highest numbers measured, which fell below an expected performance impact of >5%.

Cryptsetup

While cryptsetup is normally used for encrypting disk partitions, it has a benchmarking utility that will report on a host’s cryptographic performance by iterating key derivation functions using memory only:

Example:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1162501 iterations per second for 256-bit key
PBKDF2-sha256    1403716 iterations per second for 256-bit key
PBKDF2-sha512    1161213 iterations per second for 256-bit key
PBKDF2-ripemd160  856679 iterations per second for 256-bit key
PBKDF2-whirlpool  661979 iterations per second for 256-bit key

Benchmarky

Benchmarky is a homegrown tool provided by our Performance team to run synthetic workloads against a specific target to evaluate performance of different components. It uses Cloudflare Workers to send requests and read stats on responses. In addition to that, it also reports versions of all important stack components and their CPU usage. Each test runs 256 concurrent clients, grabbing a cached 10kB PNG image from a performance testing endpoint, and calculating the requests per second (RPS).

Conclusion

In the majority of test results, performance decreased by a nominal amount, actually less than we expected. AMD’s official white paper on SME even states that encryption and decryption of memory through the AES engine does incur a small amount of additional latency for DRAM memory accesses, though dependent on the workload. Across all 11 data points, our average performance drag was only down by .699%. Even at scale, enabling this feature has reduced the worry that any data could be exfiltrated from a stolen server.

While we wait for other hardware manufacturers to add support for total memory encryption, we are happy that AMD has set the bar high and is protecting our next generation edge hardware.

Impact of Cache Locality

Sung Park — Wed, 26 Feb 2020 12:00:00 GMT

In the past, we didn't have the opportunity to evaluate as many CPUs as we do today. The hardware ecosystem was simple – Intel had consistently delivered industry leading processors. Other vendors could not compete with them on both performance and cost. Recently it all changed: AMD has been challenging the status quo with their 2nd Gen EPYC processors.

This is not the first time that Intel has been challenged; previously there was Qualcomm, and we worked with AMD and considered their 1st Gen EPYC processors and based on the original Zen architecture, but ultimately, Intel prevailed. AMD did not give up and unveiled their 2nd Gen EPYC processors codenamed Rome based on the latest Zen 2 architecture.

Playing with some new fun kit. #epyc pic.twitter.com/1No8Cmfzwl
— Matthew Prince ? (@eastdakota) November 8, 2019

This made many improvements over its predecessors. Improvements include a die shrink from 14nm to 7nm, a doubling of the top end core count from 32 to 64, and a larger L3 cache size. Let’s emphasize again on the size of that L3 cache, which is 32 MiB L3 cache per Core Complex Die (CCD).

This time around, we have taken steps to understand our workloads at the hardware level through the use of hardware performance counters and profiling tools. Using these specialized CPU registers and profilers, we collected data on the AMD 2nd Gen EPYC and Intel Skylake-based Xeon processors in a lab environment, then validated our observations in production against other generations of servers from the past.

Simulated Environment

CPU Specifications

We evaluated several Intel Cascade Lake and AMD 2nd Gen EPYC processors, trading off various factors between power and performance; the AMD EPYC 7642 CPU came out on top. The majority of Cascade Lake processors have 1.375 MiB L3 cache per core shared across all cores, a common theme that started with Skylake. On the other hand, the 2nd Gen EPYC processors start at 4 MiB per core. The AMD EPYC 7642 is a unique SKU since it has 256 MiB of L3 cache. Having a cache this large or approximately 5.33 MiB sitting right next to each core means that a program will spend fewer cycles fetching data from RAM with the capability to have more data readily available in the L3 cache.

Before (Intel)

After (AMD)

Traditional cache layout has also changed with the introduction of 2nd Gen EPYC, a byproduct of AMD using a multi-chip module (MCM) design. The 256 MiB L3 cache is formed by 8 individual dies or Core Complex Die (CCD) that is formed by 2 Core Complexes (CCX) with each CCX containing 16 MiB of L3 cache.

Core Complex (CCX) - Up to four cores

Core Complex Die (CCD) - Created by combining two CCXs

AMD 2nd Gen EPYC 7642 - Created with 8x CCDs plus an I/O die in the center

Methodology

Our production traffic shares many characteristics of a sustained workload which typically does not induce large variation in operating frequencies nor enter periods of idle time. We picked out a simulated traffic pattern that closely resembled our production traffic behavior which was the Cached 10KiB png via HTTPS. We were interested in assessing the CPU’s maximum throughput or requests per second (RPS), one of our key metrics. With that being said, we did not disable Intel Turbo Boost or AMD Precision Boost, nor matched the frequencies clock-for-clock while measuring for requests per second, instructions retired per second (IPS), L3 cache miss rate, and sustained operating frequency.

Results

The 1P AMD 2nd Gen EPYC 7642 powered server took the lead and processed 50% more requests per second compared to our Gen 9’s 2P Intel Xeon Platinum 6162 server.

We are running a sustained workload, so we should end up with a sustained operating frequency that is higher than base clock. The AMD EPYC 7642 operating frequency or the number cycles that the processor had at its disposal was approximately 20% greater than the Intel Xeon Platinum 6162, so frequency alone was not enough to explain the 50% gain in requests per second.

Taking a closer look, the number of instructions retired over time was far greater on the AMD 2nd Gen EPYC 7642 server, thanks to its low L3 cache miss rate.

Production Environment

CPU Specifications

Methodology

Our most predominant bottleneck appears to be the cache memory and we saw significant improvement in requests per second as well as time to process a request due to low L3 cache miss rate. The data we present in this section was collected at a point of presence that spanned between Gen 7 to Gen 9 servers. We also collected data from a secondary region to gain additional confidence that the data we present here was not unique to one particular environment. Gen 9 is the baseline just as we have done in the previous section.

We put the 2nd Gen EPYC-based Gen X into production with hopes that the results would mirror closely to what we have previously seen in the lab. We found that the requests per second did not quite align with the results we had hoped, but the AMD EPYC server still outperformed all previous generations including outperforming the Intel Gen 9 server by 36%.

Sustained operating frequency was nearly identical to what we have seen back in the lab.

Due to the lower than expected requests per second, we also saw lower instructions retired over time and higher L3 cache miss rate but maintained a lead over Gen 9, with 29% better performance.

Conclusion

The single AMD EPYC 7642 performed very well during our lab testing, beating our Gen 9 server with dual Intel Xeon Platinum 6162 with the same total number of cores. Key factors we noticed were its large L3 cache, which led to a low L3 cache miss rate, as well as a higher sustained operating frequency. The AMD 2nd Gen EPYC 7642 did not have as big of an advantage in production, but nevertheless still outperformed all previous generations. The observation we made in production was based on a PoP that could have been influenced by a number of other factors such as but not limited to ambient temperature, timing, and other new products that will shape our traffic patterns in the future such as WebAssembly on Cloudflare Workers. The AMD EPYC 7642 opens up the possibility for our upcoming Gen X server to maintain the same core count while processing more requests per second than its predecessor.

Got a passion for hardware? I think we should get in touch. We are always looking for talented and curious individuals to join our team. The data presented here would not have been possible if it was not for the teamwork between many different individuals within Cloudflare. As a team, we strive to work together to create highly performant, reliable, and secure systems that will form the pillars of our rapidly growing network that spans 200 cities in more than 90 countries and we are just getting started.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Rob Dinh — Tue, 25 Feb 2020 14:00:00 GMT

More than 1 billion unique IP addresses pass through the Cloudflare Network each day, serving on average 11 million HTTP requests per second and operating within 100ms of 95% of the Internet-connected population globally. Our network spans 200 cities in more than 90 countries, and our engineering teams have built an extremely fast and reliable infrastructure.

We’re extremely proud of our work and are determined to help make the Internet a better and more secure place. Cloudflare engineers who are involved with hardware get down to servers and their components to understand and select the best hardware to maximize the performance of our stack.

Our software stack is compute intensive and is very much CPU bound, driving our engineers to work continuously at optimizing Cloudflare’s performance and reliability at all layers of our stack. With the server, a straightforward solution for increasing computing power is to have more CPU cores. The more cores we can include in a server, the more output we can expect. This is important for us since the diversity of our products and customers has grown over time with increasing demand that requires our servers to do more. To help us drive compute performance, we needed to increase core density and that's what we did. Below is the processor detail for servers we’ve deployed since 2015, including the core counts:

---	Gen 6	Gen 7	Gen 8	Gen 9
Start of service	2015	2016	2017	2018
CPU	Intel Xeon E5-2630 v3	Intel Xeon E5-2630 v4	Intel Xeon Silver 4116	Intel Xeon Platinum 6162
Physical Cores	2 x 8	2 x 10	2 x 12	2 x 24
TDP	2 x 85W	2 x 85W	2 x 85W	2 x 150W
TDP per Core	10.65W	8.50W	7.08W	6.25W

In 2018, we made a big jump in total number of cores per server with Gen 9. Our physical footprint was reduced by 33% compared to Gen 8, giving us increased capacity and computing power per rack. Thermal Design Power (TDP aka typical power usage) are mentioned above to highlight that we've also been more power efficient over time. Power efficiency is important to us: first, because we'd like to be as carbon friendly as we can; and second, so we can better utilize our provisioned power supplied by the data centers. But we know we can do better.

Our main defining metric is Requests per Watt. We can increase our Requests per Second number with more cores, but we have to stay within our power budget envelope. We are constrained by the data centers’ power infrastructure which, along with our selected power distribution units, leads us to power cap for each server rack. Adding servers to a rack obviously adds more power draw increasing power consumption at the rack level. Our Operational Costs significantly increase if we go over a rack’s power cap and have to provision another rack. What we need is more compute power inside the same power envelope which will drive a higher (better) Requests per Watt number – our key metric.

As you might imagine, we look at power consumption carefully in the design stage. From the above you can see that it’s not worth the time for us to deploy more power-hungry CPUs if TDP per Core is higher than our current generation which would hurt our Requests per Watt metric. As we started looking at production ready systems to power our Gen X solution, we took a long look at what is available to us in the market today, and we’ve made our decision. We’re moving on from Gen 9's 48-core setup of dual socket Intel® Xeon® Platinum 6162's to a 48-core single socket AMD EPYC™ 7642.

Gen X server setup with single socket 48-core AMD EPYC 7642.

---	Intel	AMD
CPU	Xeon Platinum 6162	EPYC 7642
Microarchitecture	“Skylake”	“Zen 2”
Codename	“Skylake SP”	“Rome”
Process	14nm	7nm
Physical Cores	2 x 24	48
Frequency	1.9 GHz	2.4 GHz
L3 Cache / socket	24 x 1.375MiB	16 x 16MiB
Memory / socket	6 channels, up to DDR4-2400	8 channels, up to DDR4-3200
TDP	2 x 150W	225W
PCIe / socket	48 lanes	128 lanes
ISA	x86-64	x86-64

From the specs, we see that with the AMD chip we get to keep the same amount of cores and lower TDP. Gen 9's TDP per Core was 6.25W, Gen X's will be 4.69W... That's a 25% decrease. With higher frequency, and perhaps going to a more simplified setup of single socket, we can speculate that the AMD chip will perform better. We're walking through a series of tests, simulations, and live production results in the rest of this blog to see how much better AMD performs.

As a side note before we go further, TDP is a simplified metric from the manufacturers’ datasheets that we use in the early stages of our server design and CPU selection process. A quick Google search leads to thoughts that AMD and Intel define TDP differently, which basically makes the spec unreliable. Actual CPU power draw, and more importantly server system power draw, are what we really factor in our final decisions.

Ecosystem Readiness

At the beginning of our journey to choose our next CPU, we got a variety of processors from different vendors that could fit well with our software stack and services, which are written in C, LuaJIT, and Go. More details about benchmarking for our stack were explained when we benchmarked Qualcomm's ARM® chip in the past. We're going to go through the same suite of tests as Vlad's blog this time around since it is a quick and easy "sniff test". This allows us to test a bunch of CPUs within a manageable time period before we commit to spend more engineering effort and need to apply our software stack.

We tried a variety of CPUs with different number of cores, sockets, and frequencies. Since we're explaining how we chose the AMD EPYC 7642, all the graphs in this blog focus on how AMD compares with our Gen 9's Intel Xeon Platinum 6162 CPU as a baseline.

Our results correspond to server node for both CPUs tested; meaning the numbers pertain to 2x 24-core processors for Intel, and 1x 48-core processor for AMD – a two socket Intel based server and a one socket AMD EPYC powered server. Before we started our testing, we changed the Cloudflare lab test servers’ BIOS settings to match our production server settings. This gave us CPU frequencies yields for AMD at 3.03 Ghz and Intel at 2.50 Ghz on average with very little variation. With gross simplification, we expect that with the same amount of cores AMD would perform about 21% better than Intel. Let’s start with our crypto tests.

Cryptography

Looking promising for AMD. In public key cryptography, it does 18% better. Meanwhile, for symmetric key, AMD loses on AES-128-GCM, but it’s comparable overall.

Compression

We do a lot of compression at the edge to save bandwidth and help deliver content faster. We go through both zlib and brotli libraries written in C. All tests are done on blog.cloudflare.com HTML file in memory.

AMD wins by an average of 29% using gzip across all qualities. It does even better with brotli with tests lower than quality 7, which we use for dynamic compression. There’s a throughput cliff starting brotli-9 which Vlad’s explanation is that Brotli consumes lots of memory and thrashes cache. Nevertheless, AMD wins by a healthy margin.

A lot of our services are written in Go. In the following graphs we’re redoing the crypto and compression tests in Go along with RegExp on 32KB strings and the strings' library.

Go Cryptography

Go Compression

Go Regexp

Go Strings

AMD performs better in all of our Go benchmarks except for ECDSA P256 Sign losing by 38%, which is peculiar since with the test in C it does 24% better. It’s worth investigating what’s going on here. Other than that, AMD doesn’t win by as much of a margin, but it still proves to be better.

LuaJIT

We rely a lot on LuaJIT in our stack. As Vlad said, it’s the glue that holds Cloudflare together. We’re glad to show that AMD wins here as well.

Overall our tests show a single EPYC 7642 to be more competitive than two Xeon Platinum 6162. While there are a couple of tests where AMD loses out such as OpenSSL AES-128-GCM and Go OpenSSL ECDSA-P256 Sign, AMD wins in all the others. By scanning quickly and treating all tests equally, AMD does on average 25% better than Intel.

Performance Simulations

After our ‘sniff’ tests, we put our servers through another series of emulations which apply synthetic workloads simulating our edge software stack. Here we are simulating workloads of scenarios with different types of requests we see in production. Types of requests vary from asset size, whether they go through HTTP or HTTPS, WAF, Workers, or one of many additional variables. Below shows the throughput comparison between the two CPUs of the types of requests we see most typically.

The results above are ratios using Gen 9's Intel CPUs as the baseline normalized at 1.0 on the X-axis. For example, looking at simple requests of 10KiB assets over HTTPS, we see that AMD does 1.50x better than Intel in Requests per Second. On average for the tests shown on the graph above, AMD performs 34% better than Intel. Considering that the TDP for the single AMD EPYC 7642 is 225W, when compared to two Intel's being 300W, we're looking at AMD delivering up to 2.0x better Requests per Watt vs. the Intel CPUs!

By this time, we were already leaning heavily toward a single socket setup with AMD EPYC 7642 as our CPU for Gen X. We were excited to see exactly how well AMD EPYC servers would do in production, so we immediately shipped a number of the servers out to some of our data centers.

Live Production

Step one of course was to get all our test servers set up for a production environment. All of our machines in the fleet are loaded with the same processes and services which makes for a great apples-to-apples comparison. Like data centers everywhere, we have multiple generations of servers deployed, and we deploy our servers in clusters such that each cluster is pretty homogeneous by server generation. In some environments this can lead to varying utilization curves between clusters. This is not the case for us. Our engineers have optimized CPU utilization across all server generations so that no matter if the machine's CPU has 8 cores or 24 cores, CPU usage is generally the same.

As you can see above and to illustrate our ‘similar CPU utilization’ comment, there is no significant difference in CPU usage between Gen X AMD powered servers and Gen 9 Intel based servers. This means both test and baseline servers are equally loaded. Good. This is exactly what we want to see with our setup, to have a fair comparison. The 2 graphs below show the comparative number of requests processed at the CPU single core and all core (server) level.

We see that AMD does on average about 23% more requests. That's really good! We talked a lot about bringing more muscle in the Gen 9 blog. We have the same number of cores, yet AMD does more work, and does it with less power. Just by looking at the specs for number of cores and TDP in the beginning, it's really nice to see that AMD also delivers significantly more performance with better power efficiency.

But as we mentioned earlier, TDP isn’t a standardized spec across manufacturers so let’s look at real power usage below. Measuring server power consumption along with requests per second (RPS) yields the graph below:

Observing our servers request rate over their power consumption, the AMD Gen X server performs 28% better. While we could have expected more out of AMD since its TDP is 25% lower, keep in mind that TDP is very ambiguous. In fact, we saw that AMD actual power draw ran nearly at spec TDP with it's much higher than base frequency; Intel was far from it. Another reason why TDP is becoming a less reliable estimate of power draw. Moreover, CPU is just one component contributing to the overall power of the system. Let’s remind that Intel CPUs are integrated in a multi-node system as described in the Gen 9 blog, while AMD is in a regular 1U form-factor machine. That actually doesn’t favor AMD since multi-node systems are designed for high density capabilities at lower power per node, yet it still outperformed the Intel system on a power per node basis anyway.

Through the majority of comparisons from the datasheets, test simulations, and live production performance, the 1P AMD EPYC 7642 configuration performed significantly better than the 2P Intel Xeon 6162. We’ve seen in some environments that AMD can do up to 36% better in live production, and we believe we can achieve that consistently with some optimization on both our hardware and software.

So that's it. AMD wins.

The additional graphs below show the median and p99 NGINX processing mostly on-CPU time latencies between the two CPUs throughout 24 hours. On average, AMD processes about 25% faster. At p99, it does about 20-50% depending on the time of day.

Conclusion

Hardware and Performance engineers at Cloudflare do significant research and testing to figure out the best server configuration for our customers. Solving big problems like this is why we love working here, and we're also helping to solve yours with our services like serverless edge compute and the array of security solutions such as Magic Transit, Argo Tunnel, and DDoS protection. All of our servers on the Cloudflare Network are designed to make our products work reliably, and we strive to make each new generation of our server design better than its predecessor. We believe the AMD EPYC 7642 is the answer for our Gen X's processor question.

With Cloudflare Workers, developers have enjoyed deploying their applications to our Network, which is ever expanding across the globe. We’ve been proud to empower our customers by letting them focus on writing their code while we are managing the security and reliability in the cloud. We are now even more excited to say that their work will be deployed on our Gen X servers powered by 2nd Gen AMD EPYC processors.

Expanding Rome to a data center near you

Thanks to AMD, using the EPYC 7642 allows us to increase our capacity and expand into more cities easier. Rome wasn’t built in one day, but it will be very close to many of you.

In the last couple of years, we've been experimenting with many Intel and AMD x86 chips along with ARM CPUs. We look forward to having these CPU manufacturers partner with us for future generations so that together we can help build a better Internet.

Cloudflare’s Gen X: Servers for an Accelerated Future

Nitin Rao — Mon, 24 Feb 2020 13:00:00 GMT

“Every server can run every service.”

We designed and built Cloudflare’s network to be able to grow capacity quickly and inexpensively; to allow every server, in every city, to run every service; and to allow us to shift customers and traffic across our network efficiently. We deploy standard, commodity hardware, and our product developers and customers do not need to worry about the underlying servers. Our software automatically manages the deployment and execution of our developers’ code and our customers’ code across our network. Since we manage the execution and prioritization of code running across our network, we are both able to optimize the performance of our highest tier customers and effectively leverage idle capacity across our network.

An alternative approach might have been to run several fragmented networks with specialized servers designed to run specific features, such as the Firewall, DDoS protection or Workers. However, we believe that approach would have resulted in wasted idle resources and given us less flexibility to build new software or adopt the newest available hardware. And a single optimization target means we can provide security and performance at the same time.

We use Anycast to route a web request to the nearest Cloudflare data center (from among 200 cities), improving performance and maximizing the surface area to fight attacks.

Once a datacenter is selected, we use Unimog, Cloudflare’s custom load balancing system, to dynamically balance requests across diverse generations of servers. We load balance at different layers: between cities, between physical deployments located across a city, between external Internet ports, between internal cables, between servers, and even between logical CPU threads within a server.

As demand grows, we can scale out by simply adding new servers, points of presence (PoPs), or cities to the global pool of available resources. If any server component has a hardware failure, it is gracefully de-prioritized or removed from the pool, to be batch repaired by our operations team. This architecture has enabled us to have no dedicated Cloudflare staff at any of the 200 cities, instead relying on help for infrequent physical tasks from the ISPs (or data centers) hosting our equipment.

Gen X: Intel Not Inside

We recently turned up our tenth generation of servers, “Gen X”, already deployed across major US cities, and in the process of being shipped worldwide. Compared with our prior server (Gen 9), it processes as much as 36% more requests while costing substantially less. Additionally, it enables a ~50% decrease in L3 cache miss rate and up to 50% decrease in NGINX p99 latency, powered by a CPU rated at 25% lower TDP (thermal design power) per core.

Notably, for the first time, Intel is not inside. We are not using their hardware for any major server components such as the CPU, board, memory, storage, network interface card (or any type of accelerator). Given how critical Intel is to our industry, this would until recently have been unimaginable, and is in contrast with prior generations which made extensive use of their hardware.

Intel-based Gen 9 server

This time, AMD is inside.

We were particularly impressed by the 2nd Gen AMD EPYC processors because they proved to be far more efficient for our customers’ workloads. Since the pendulum of technology leadership swings back and forth between providers, we wouldn’t be surprised if that changes over time. However, we were happy to adapt quickly to the components that made the most sense for us.

Compute

CPU efficiency is very important to our server design. Since we have a compute-heavy workload, our servers are typically limited by the CPU before other components. Cloudflare’s software stack scales quite well with additional cores. So, we care more about core-count and power-efficiency than dimensions such as clock speed.

We selected the AMD EPYC 7642 processor in a single-socket configuration for Gen X. This CPU has 48-cores (96 threads), a base clock speed of 2.4 GHz, and an L3 cache of 256 MB. While the rated power (225W) may seem high, it is lower than the combined TDP in our Gen 9 servers and we preferred the performance of this CPU over lower power variants. Despite AMD offering a higher core count option with 64-cores, the performance gains for our software stack and usage weren’t compelling enough.

We have deployed the AMD EPYC 7642 in half a dozen Cloudflare data centers; it is considerably more powerful than a dual-socket pair of high-core count Intel processors (Skylake as well as Cascade Lake) we used in the last generation.

Readers of our blog might remember our excitement around ARM processors. We even ported the entirety of our software stack to run on ARM, just as it does with x86, and have been maintaining that ever since even though it calls for slightly more work for our software engineering teams. We did this leading up to the launch of Qualcomm’s Centriq server CPU, which eventually got shuttered. While none of the off-the-shelf ARM CPUs available this moment are interesting to us, we remain optimistic about high core count offerings launching in 2020 and beyond, and look forward to a day when our servers are a mix of x86 (Intel and AMD) and ARM.

We aim to replace servers when the efficiency gains enabled by new equipment outweigh their cost.

The performance we’ve seen from the AMD EPYC 7642 processor has encouraged us to accelerate replacement of multiple generations of Intel-based servers.

Compute is our largest investment in a server. Our heaviest workloads, from the Firewall to Workers (our serverless offering), often require more compute than other server resources. Also, the average size in kilobytes of a web request across our network tends to be small, influenced in part by the relative popularity of APIs and mobile applications. Our approach to server design is very different than traditional content delivery networks engineered to deliver large object video libraries, for whom servers focused on storage might make more sense, and re-architecting to offer serverless is prohibitively capital intensive.

Our Gen X server is intentionally designed with an “empty” PCIe slot for a potential add on card, if it can perform some functions more efficiently than the primary CPU. Would that be a GPU, FPGA, SmartNIC, custom ASIC, TPU or something else? We’re intrigued to explore the possibilities.

In accompanying blog posts over the next few days, our hardware engineers will describe how AMD 7642 performed against the benchmarks we care about. We are thankful for their hard work.

Memory, Storage & Network

Since we are typically limited by CPU, Gen X represented an opportunity to grow components such as RAM and SSD more slowly than compute.

For memory, we continued to use 256GB of RAM, as in our prior generation, but rated higher at 2933MHz. For storage, we continue to have ~3TB, but moved to 3x1TB form factor using NVME flash (instead of SATA) with increased available IOPS and higher endurance, which enables full disk encryption using LUKS without penalty. For the network card, we continue to use Mellanox 2x25G NIC.

We moved from our multi-node chassis back to a simple 1U form factor, designed to be lighter and less error prone during operational work at the data center. We also added multiple new ODM partners to diversify how we manufacture our equipment and to take advantage of additional global warehousing.

Network Expansion

Our newest generation of servers give us the flexibility to continue to build out our network even closer to every user on Earth. We’re proud of the hard work from across engineering teams on Gen X, and are grateful for the support of our partners. Be on the lookout for more blogs about these servers in the coming days.

Cloudflare Expanded to 200 Cities in 2019

Jon Rolfe — Sat, 04 Jan 2020 17:00:00 GMT

We have exciting news: Cloudflare closed out the decade by reaching our 200th city* across 90+ countries. Each new location increases the security, performance, and reliability of the 20-million-plus Internet properties on our network. Over the last quarter, we turned up seven data centers spanning from Chattogram, Bangladesh all the way to the Hawaiian Islands:

Chattogram & Dhaka, Bangladesh. These data centers are our first in Bangladesh, ensuring that its 161 million residents will have a better experience on our network.
Honolulu, Hawaii, USA. Honolulu is one of the most remote cities in the world; with our Honolulu data center up and running, Hawaiian visitors can be served 2,400 miles closer than ever before! Hawaii is a hub for many submarine cables in the Pacific, meaning that some Pacific Islands will also see significant improvements.
Adelaide, Australia. Our 7th Australasian data center can be found “down under” in the capital of South Australia. Despite being Australia’s fifth-largest city, Adelaide is often overlooked for Australian interconnection. We, for one, are happy to establish a presence in it and its unique UTC+9:30 time zone!
Thimphu, Bhutan. Bhutan is the seventh SAARC (South Asian Association for Regional Cooperation) country with a Cloudflare network presence. Thimphu is our first Bhutanese data center, continuing our mission of security and performance for all.
St George’s, Grenada. Our Grenadian data center is joining the Grenada Internet Exchange (GREX), the first non-profit Internet Exchange (IX) in the English-speaking Caribbean.

We’ve come a long way since our launch in 2010, moving from colocating in key Internet hubs to fanning out across the globe and partnering with local ISPs. This has allowed us to offer security, performance, and reliability to Internet users in all corners of the world. In addition to the 35 cities we added in 2019, we expanded our existing data centers behind-the-scenes. We believe there are a lot of opportunities to harness in 2020 as we look to bring our network and its edge-computing power closer and closer to everyone on the Internet.

*Includes cities where we have data centers with active Internet ports and those where we are configuring our servers to handle traffic for more customers (at the time of publishing).

Good Morning, Jakarta!

Chris Chua — Fri, 11 Oct 2019 00:00:25 GMT

Beneath the veneer of glass and concrete, this is a city of surprises and many faces. On 3rd October 2019, we brought together a group of leaders from across a number of industries to connect in Central Jakarta, Indonesia.

The habit of sharing stories at the lunch table, exchanging ideas, and listening to ideas from the different viewpoints of people from all tiers, paying first-hand attention to all input from customers, and listening to the dreams of some of life’s warriors may sound simple but it is a source of inspiration and encouragement in helping the cyberspace community in this region.

And our new data center in Jakarta extends our Asia Pacific network to 64 cities, and our global network to 194 cities.

Selamat Pagi

Right on time, Kate Fleming extended a warm welcome to our all our Indonesia guests. "We were especially appreciative of the investment of your time that you made coming to join us."

Kate, is the Head of Customer Success for APAC. Australian-born, Kate spent the past 5 years living in Malaysia and Singapore. She leads a team of Customer Success Managers in Singapore. The Customer Success team is dispersed across multiple offices and time zones. We are the advocates for Cloudflare Enterprise customers. We help with your on-boarding journey and various post sales activities from project and resource management planning to training, configuration recommendations, sharing best practices, point of escalation and more.

"Today, the Indonesian Cloudflare team would like to share with you some insights and best practices around how Cloudflare is not only a critical part of any organization’s cyber security planning, but is working towards building a better internet in the process.” - Kate

Learning Modern Trends of Cyber Attacks

Ayush Verma, who is our Solutions Engineer for ASEAN and India, was there to unveil the latest cyber security trends. He shared insights on how to stay ahead of the game in the fast-charging online environment.

Get answers to questions like:How can I secure my site without sacrificing performance?What are the latest trends in malicious attacks — and how should I prepare?

Superheroes Behind The Scenes

We were very honored to have two industry leaders speak to us.

Jullian Gafar, the CTO from PT Viva Media Baru.PT Viva Media Baru is an online media company based out of Jakarta, Indonesia.
Firman Gautama, the VP of Infrastructure & Security from PT. Global Tiket Network.PT. Global Tiket Network offer hotel, flight, car rental, train, world class event/concert and attraction tickets.

It was a golden opportunity to hear from the leaders themselves about what’s keeping them busy lately, their own approaches to cyber security, best practices, and easy-to-implement and cost-efficient strategies.

Fireside Chat Highlights: Shoutout from Pak Firman, who was very pleased with the support he received from Kartika. He said "most sales people are hard to reach after completing a sale. Kartika always goes the extra mile, she stays engaged with me. The Customer Experience is just exceptional.”

Our Mission Continues

Thank you for giving us your time to connect. It brings us back to our roots and core mission of helping to build a better internet. Based on this principle “The Result Never Betrays the Effort’ we believe that what we are striving for today, by creating various innovations in our services and strategies to improve your business, will in time produce the best results. For this reason, we offer our endless thanks for your support and loyalty in continuing to push forward with us. Always at your service!

Cloudflare Event Crew in Indonesia #CloudflareJKTChris Chua (Organiser) | Kate Fleming | Bentara Frans | Ayush Verma | Welly Tandiono | Kartika Mulyo | Riyan Baharudin

Cloudflare Global Network Expands to 193 Cities

Nitin Rao — Thu, 15 Aug 2019 01:41:18 GMT

Cloudflare’s global network currently spans 193 cities across 90+ countries. With over 20 million Internet properties on our network, we increase the security, performance, and reliability of large portions of the Internet every time we add a location.

Expanding Network to New Cities

So far in 2019, we’ve added a score of new locations: Amman, Antananarivo*, Arica*, Asunción, Baku, Bengaluru, Buffalo, Casablanca, Córdoba*, Cork, Curitiba, Dakar*, Dar es Salaam, Fortaleza, Geneva, Göteborg, Guatemala City, Hyderabad, Kigali, Kolkata, Male*, Maputo, Nagpur, Neuquén*, Nicosia, Nouméa, Ottawa, Port-au-Prince, Porto Alegre, Querétaro, Ramallah, and Thessaloniki.

Our Humble Beginnings

When Cloudflare launched in 2010, we focused on putting servers at the Internet’s crossroads: large data centers with key connections, like the Amsterdam Internet Exchange and Equinix Ashburn. This not only provided the most value to the most people at once but was also easier to manage by keeping our servers in the same buildings as all the local ISPs, server providers, and other people they needed to talk to streamline our services.

This is a great approach for bootstrapping a global network, but we’re obsessed with speed in general. There are over five hundred cities in the world with over one million inhabitants, but only a handful of them have the kinds of major Internet exchanges that we targeted. Our goal as a company is to help make a better Internet for all, not just those lucky enough to live in areas with affordable and easily-accessible interconnection points. However, we ran up against two broad, nasty problems: a) running out of major Internet exchanges and b) latency still wasn’t as low as we wanted. Clearly, we had to start scaling in new ways.

One of our first big steps was entering into partnerships around the world with local ISPs, who have many of the same problems we do: ISPs want to save money and provide fast Internet to their customers, but they often don’t have a major Internet exchange nearby to connect to. Adding Cloudflare equipment to their infrastructure effectively brought more of the Internet closer to them. We help them speed up millions of Internet properties while reducing costs by serving traffic locally. Additionally, since all of our servers are designed to support all our products, a relatively small physical footprint can also provide security, performance, reliability, and more.

Upgrading Capacity in Existing Cities

Though it may be obvious and easy to overlook, continuing to build out existing locations is also a key facet of building a global network. This year, we have significantly increased the computational capacity at the edge of our network. Additionally, by making it easier to interconnect with Cloudflare, we have increased the number of unique networks directly connected with us to over 8,000. This makes for a faster, more reliable Internet experience for the >1 billion IPs that we see daily.

To make these capacity upgrades possible for our customers, efficient infrastructure deployment has been one of our keys to success. We want our infrastructure deployment to be targeted and flexible.

Targeted Deployment

The next Cloudflare customer through our door could be a small restaurant owner on a Pro plan with thousands of monthly pageviews or a fast-growing global tech company like Discord. As a result, we need to always stay one step ahead and synthesize a lot of data all at once for our customers.

To accommodate this expansion, our Capacity Planning team is learning new ways to optimize our servers. One key strategy is targeting exactly where to send our servers. However, staying on top of everything isn’t easy - we are a global anycast network, which introduces unpredictability as to where incoming traffic goes. To make things even more difficult, each city can contain as many as five distinct deployments. Planning isn’t just a question of what city to send servers to, it’s one of which address.

To make sense of it all, we tackle the problem with simulations. Some, but not all, of the variables we model include historical traffic growth rates, foreseeable anomalous spikes (e.g., Cyber Day in Chile), and consumption states from our live deal pipeline, as well as product costs, user growth, end-customer adoption. We also add in site reliability, potential for expansion, and expected regional expansion and partnerships, as well as strategic priorities and, of course, feedback from our fantastic Systems Reliability Engineers.

Flexible Supply Chain

Knowing where to send a server is only the first challenge of many when it comes to a global network. Just like our user base, our supply chain must span the entire world while also staying flexible enough to quickly react to time constraints, pricing changes including taxes and tariffs, import/export restrictions and required certifications - not to mention local partnerships many more dynamic location-specific variables. Even more reason we have to stay quick on our feet, there will always be unforeseen roadblocks and detours even in the most well-prepared plans. For example, a planned expansion in our Prague location might warrant an expanded presence in Vienna for failover.

Once servers arrive at our data centers, our Data Center Deployment and Technical Operations teams work with our vendors and on-site data center personnel (our “Remote Hands” and “Smart Hands”) to install the physical server, manage the cabling, and handle other early-stage provisioning processes.

Our architecture, which is designed so that every server can support every service, makes it easier to withstand hardware failures and efficiently load balance workloads between equipment and between locations.

Join Our Team

If working at a rapidly expanding, globally diverse company interests you, we’re hiring for scores of positions, including in the Infrastructure group. If you want to help increase hardware efficiency, deploy and maintain servers, work on our supply chain, or strengthen ISP partnerships, get in touch.

*Represents cities where we have data centers with active Internet ports and where we are configuring our servers to handle traffic for more customers (at the time of publishing)

Solving Problems with Serverless – The Cloudflare LED Data Center Board, Part I

Kassian Wren — Thu, 14 Feb 2019 20:00:00 GMT

You know you have a cool job when your first project lets you bring your hobby into the office.

That’s what happened to me just a few short weeks ago when I joined Cloudflare. The task: to create a light-up version of our Data Center map – we’re talking more than a hundred LEDs tied to the deployment state of each and every Cloudflare data center. This map will be a part of our booths, so it has to be able to travel; meaning we have to consider physical shipping and the ability to update the data when the map is away from the office. And the fun part – we are debuting it at SF Developer Week in late February (I even get to give a talk about it!) That gave me one week of software time in our San Francisco office, and a little over two and a half in the Austin office with the physical materials.

What the final LEDs will look like on a map of the world.

So what does this have to do with Serverless? Well, let’s think about where and how this map will need to operate: This will be going to expo halls and conferences, and we want it to update to show our most current data center statuses for at least that event, if not updating once a day. But we don’t need to stay connected to the information store constantly-- nor should we expect to, over conference or expo WiFi.

Data Stored about Data Centers

The data stored about each data center has two distinct categories; data relevant to the data center itself, and data about how that data center should be rendered on the map. These are relatively simple data structures, however: for a data center, we want to store what city the data center is in, the latitude and longitude, and the status. We arbitrarily assign an ID integer to each data center, which we'll use to match this data with the data in the other store. We’re not going to pick and choose which data centers we want; just pull them all down and let the microcontroller figure out how to display them.

Speaking of, this is where the data store relevant to the display comes in. LEDs are on strands numbered 0-7, and are represented by an LED numbered 0-63. We need to store the ID of the data center we created for the first store, the strand number, and the LED number in the strand.

Both of these sets of data can be stored in a key-value store, with the ID number as the key and a JSON object representing either the data center or its representative LED on the map as the value. Because of this, coupled with the fact that we do not need to search or index this data, we decided to use Workers KV data stores to keep this information.

The Data Center and Data Center Map API

We needed two APIs around the data centers, and we needed them fast– both in the more immediate sense of having only a few weeks to complete the project, and in the sense that we needed the data to download relatively quickly over non-ideal internet situations. We also know this map will be traveling all over the world– we'd need the API to work and have at least a decent latency no matter where the map was geographically.

This is where the hundreds of LEDs comes in handy– each one represents a data center that we could deploy serverless Workers to. We can deploy the API to the data center before we leave from the comfort of the office, and it'd be ready for us when we hit the conference floor. Workers also, unsurprisingly, work really well with Workers KV data stores; allowing us to rapidly develop APIs around our data.

Our Software Architecture Diagram

In the end, we ended up with this architecture diagram; 2 Workers KV data stores, and 2 serverless Workers; all of which can be deployed across the world in order to make sure the physical map has the updated data every time we head to a new show.

In the next post in this series, we'll take a look at the physical architecture of the sign:

We'll also take a look at the system we built that uses the architecture laid out in this post that consumes this data and turns it into an LED map – so keep an eye out for it next month!

Interested in deploying a Cloudflare Worker without setting up a domain on Cloudflare? We’re making it easier to get started building serverless applications with custom subdomains on workers.dev. If you’re already a Cloudflare customer, you can add Workers to your existing website here.

Reserve a workers.dev subdomain

Argo Tunnel + DC/OS

Tom Brightbill — Mon, 21 Jan 2019 13:00:00 GMT

Cloudflare is proud to partner with Mesosphere on their new Argo Tunnel offering available within their DC/OS (Data Center / Operating System) catalogue! Before diving deeper into the offering itself, we’ll first do a quick overview of the Mesophere platform, DC/OS.

What is Mesosphere and DC/OS?

Mesosphere DC/OS provides application developers and operators an easy way to consistently deploy and run applications and data services on cloud providers and on-premise infrastructure. The unified developer and operator experience across clouds makes it easy to realize use cases like global reach, resource expansion, and business continuity.

In this multi cloud world Cloudflare and Mesosphere DC/OS are great complements. Mesosphere DC/OS provides the same common services experience for developers and operators, and Cloudflare provides the same common service access experience across cloud providers. DC/OS helps tremendously for avoiding vendor lock-in to a single provider, while Cloudflare can load balance traffic intelligently (in addition to many other services) at the edge between providers. This new offering will allow you to load balance through the use of Argo Tunnel.

Quick Tunnel Refresh

Cloudflare Argo Tunnel is a private connection between your services and Cloudflare. Tunnel makes it such that only traffic that routes through the Cloudflare network can reach your service.

Cloudflare’s lightweight Argo Tunnel daemon creates an encrypted Tunnel between your origin web server and Cloudflare’s nearest data center — all without opening any public inbound ports. In other words, it’s a private link. Only Cloudflare can see the service and communicate with it, and for the rest of the internet, the service is reachable only through the hostname configured on Cloudflare. Check this out if you’d like to learn more.

By using Argo Tunnel, DC/OS is able to load balance your traffic to any of your hosts, wherever they are running on Earth, in any cloud provider! Need more instances in Paris? Just launch them! Are instances more affordable in a specific provider? Just launch them there, and thanks to Argo Tunnel and DC/OS your traffic will be directed to exactly where it belongs.

Requirements

In order to use this application in DC/OS you must have a zone on Cloudflare with the Argo service enabled. Argo can be enabled for any plan type and is billed on usage. Because this application requires the use of a DC/OS ‘secrets’ the Enterprise version of DC/OS is required. To get started on Cloudflare please see here and sign up for an account. To do the same with DC/OS, please see here.

Cloudflare Argo Tunnel Support for DC/OS

Argo Tunnel is the fast way to make services that run on DC/OS private agents (that are only bound to the DC/OS internal network) accessible over the public internet.

When you launch the Tunnel for your service, it creates persistent outbound connections to the 2 closest Cloudflare PoPs over which the entire Cloudflare network will route through to reach the service associated with the Tunnel. There is no need to configure DNS, update a NAT configuration, or modify firewall rules (connections are outbound). The Argo Tunnel exposed service gets all the benefits offered by the Cloudflare network (e.g. DDoS protection, CDN and performance, TLS, etc.).

The Cloudflare Argo Tunnel Service is available from Mesosphere DC/OS catalog.

Configuration of the Argo Tunnel requires you to specify three things.

Cloudflare Hostname - The DNS name of your service on the Cloudflare network. This is the address where you wish your service to be available from on the Internet. (Note: adding a zone to Cloudflare is extremely simple, you can get started here (link: https://www.cloudflare.com/plans/)
Local Service Url - The local URL of the service that you want to make available, on the machines running Argo Tunnel.
Load Balancer Pool - The load balancer pool you want the service to be part of. Use any value you like, keeping the value consistent for tunnels you wish to load balance traffic onto as a unit. Inside Cloudflare you can manage how your traffic is balanced between and inside your pools.

Assuming you do that setup for a service in a West Coast DC/OS cluster and an East Coast DC/OS cluster, with a respective us-west and us-east LB pool, then you end up with a Cloudflare load balancer globally balancing traffic between these clusters. The load balancer can be configured to do geosteering, which you can learn more about here.

For more details see the DC/OS Argo Tunnel documentation. We hope this partnership is a meaningful step towards a simple multi-cloud solution for DC/OS customers.

To sign up for Cloudflare click here and to sign up for DC/OS click here. This partnership between Cloudflare and Mesosphere we hope will help you drive private secure and performant multi cloud deployments

Ten new data centers: Cloudflare expands global network to 165 cities

Nitin Rao — Thu, 20 Dec 2018 16:13:00 GMT

Cloudflare is excited to announce the addition of ten new data centers across the United States, Bahrain, Russia, Vietnam, Pakistan and France (Réunion). We're delighted to help improve the performance and security of over 12 million domains across these diverse countries that collectively represent about half a billion Internet users.

Our global network now spans 165 cities, with 46 new cities added just this year, and several dozen additional locations being actively worked on.

United States of America

Our expansion begins in the United States, where Cloudflare's 36th and 37th data centers in the nation serve Charlotte (North Carolina) and Columbus (Ohio) respectively. They are promising markets for interconnection, and join our existing deployments in Ashburn, Atlanta, Boston, Chicago, Dallas, Denver, Detroit, Houston, Indianapolis, Jacksonville, Kansas City, Las Vegas, Los Angeles, McAllen, Memphis, Miami, Minneapolis, Montgomery, Nashville, Newark, Norfolk, Omaha, Philadelphia, Portland, Richmond, Sacramento, Salt Lake City, San Diego, San Jose, Seattle, St. Louis, Tallahassee, and Tampa.

Bahrain

Cloudflare's Manama (Bahrain) data center, our 158th globally, further expands our Middle East coverage. A growing hub for cloud computing, including public sector adoption (with the Kingdom's "Cloud First" policy), Bahrain is attracting talent and investment in innovative companies.

Russia

Cloudflare's new St. Petersburg deployment serves as a point of redundancy to our existing Moscow facility, while also expanding our surface area to withstand DDoS attacks and reducing latency for local Internet users. (Hint: If you live in Novosibirsk or other parts of Russia, stay tuned for upcoming Cloudflare deployments near you).

Vietnam

Hànội and Hồ Chí Minh City, the two most populated cities in Vietnam with an estimated population of 8 million and 9 million respectively, now host Cloudflare's 160th and 161st data center.

On November 19, 1997, the Internet officially became available in Vietnam. Since then, several telecommunication companies - including VNPT, FPT, Viettel, CMC, VDC, and NetNam - have played a critical role in integrating the use of Internet into the government systems, business environment, school facilities, and many other organizations. With our new data centers in place, we are delighted to help provide a faster and safer Internet experience.

Pakistan

The world's sixth most populous country, Pakistan is a land of delicious food, breathtaking natural beauty, poetry and, of course cricket. Its natural beauty is exemplified by being the home of 5 out of 14 mountains which are at least 8,000m high, including K2, the second highest peak in the world. Pakistan's rich history includes the 5,000 year old lost civilization of Mohenjo-daro, with incredible design from complex architecture on a grid-layout to advanced water and sewage systems.

Nanga Parbat - Creative Commons Attribution-Share Alike 3.0 Unported

Today, Cloudflare is unveiling three new data centers housed in Pakistan, one in each of the most populous cities - Karachi and Lahore - alongside an additional data center in the capital city, Islamabad. We are already seeing latency per request decrease by over 3x and as much as 150ms, and expect this to further improve as we tune routing for all our customers.

Latency from PTCL to Cloudflare customers reduces by over 3x across Pakistan. Courtesy: Cedexis

Réunion (France)

8,000 miles away, the final stop in today's expansion is Sainte-Marie in the Réunion island, the overseas department France off the coast of Magadascar (which can also expect some Cloudflare servers very soon!)

Expansion ahead!

Even beyond these, we are working on at least six new cities in each of Africa, Asia, Europe, North America, and South America. Guess 20 upcoming locations to receive Cloudflare swag.

Lagos, Nigeria - Cloudflare’s 155th city

Martin J Levy — Tue, 30 Oct 2018 07:00:00 GMT

At just shy of 200 million, Nigeria is the most populous country in Africa (Ethiopia is second and Egypt is third). That’s a lot of people to communicate with the world - and communicate they all do!

According to a published report earlier this year, 84% of the Nigerian population own a mobile device (193 million population and 162 million mobile subscriptions). Again, that’s #1 for any country in Africa. But why so connected? Maybe because Nigeria (and Lagos specifically) is always on the move!

Lagos, as those that know the city say, never sleeps, it’s filled with color from the food to fashion to even the diverse people going about their business. The vibrancy of the city is like a hard slap to the face, no matter what you have been told, your first time here will still knock you out. In Lagos, anything is possible, from the sadness of poverty to the clearly visible upper class, the city sucks you in like a surfers dream wave. Visitor come into Lagos and leave feeling like they’ve been through a unique experience. The traffic is mind blowing and the same goes for the work pace.

Lagos, a city always on the move!

Cloudflare lands in Lagos, Nigeria

Cloudflare has now entered Lagos with a secure facility colocated and interconnected to one of the primary undersea cable operators along with a full connection to both the local Internet Exchange (IXPN) and the newly announced WAF-IX Lagos exchange. With every data center we add, local users not only connect faster and more reliably to sites proxied by Cloudflare, but our global DDoS mitigation becomes stronger and more robust. Instead of serving Nigeria from Europe (London, Lisbon, etc), content can be delivered locally from Lagos.

If you follow the African telecom industry; then you'll know that West Africa’s economy is dominated by Nigeria; but few outside the region know that Nigeria is dominated by the phrase mobile-first. A country thats mobile-first means that mobile has eclipsed all other connectivity methods. Other countries with this status include Indonesia, China, Kenya, Philippines, Myanmar and many others. Being mobile-first doesn’t mean smartphone dominance (Myanmar qualifies because of the availability of $20 phones); but it does mean dependence on mobile infrastructure. In fact SMS or texting is still the number one common communications method for most of these places. But smartphone penetration in Nigeria is growing fast and sitting at around 21 million according to the recent report referenced above.

Nollywood

La la land has Hollywood, India has Bollywood and Nigeria has Nollywood. Yes, movie-making in Nigeria is a booming business with 1,500 or more movies produced a year and yet Nollywood is only twenty-five years old. That’s still less movies than Bollywood; but that doesn’t worry any of the Nigerian statisticians because they restate the revenue number based on a per-capita basis and that makes Nollywood bigger than Bollywood. But we digress.

The one thing that all that mobile penetration can provide is a distribution method for Nollywood’s movies and while this has nothing to do with Cloudflare’s commitment to Nigeria; it’s interesting to see that Netflix (the movie and tv streaming giant) has just invested heavily into Nollywood!

Even Hollywood has tapped Nigerians for major roles. David Oyelowo, the British actor of Nigerian descent, played Martin Luther King Jr. in the 2014 movie Selma. Danny Glover (of Lethal Weapon and The Color Purple fame) is of Nigerian descent and has acted in a Nigeria based movie called 93 Days. Then there’s Forest Whitaker (Last King of Scotland), John Boyega (Finn in Star Wars: The Force Awakens and The Last Jedi), and geek-favorite Richard Ayoade (Maurice Moss in the IT Crowd); all of Nigerian descent.

Connectivity across and around Africa

Lagos is well served by undersea cables that reach northwards to Europe and south towards South Africa. In fact every African west coast undersea cable lands in Lagos Nigeria.

ACE (Africa Coast to Europe))
Glo-1 (and Glo-2)
MainOne
NCSCS (Nigeria Cameroon Submarine Cable System)
SAT-3/WASC
WACS (West African Cable System)

While most of these cables are pumping bandwidth from North to South (i.e. Europe to Africa); some are now being used for inter-country connectivity at the IP backbone level. WACS has Nigeria, Angola, and South Africa connected. MainOne has Ghana and Nigeria connected, etc.

Nigerian websites that instantly win

When we look at the Alexa-top-50 list for Nigerian users we find that 18 of those sites are Cloudflare customers. That means an instant win for both consumers (those smartphone users) and the website or app operators. BTW: of the remaining 32 sites, 15 are owned by massive content players (search, online video, social media, operating systems or phones) and of the remaining 17 sites (Wikipedia being one of highest visited on that list); they are mainly non-CDN’ed websites that are hosted outside the country (Europe or USA).

After Nigeria

Cloudflare in West Africa far from finished! We still have places like Senegal, Côte d’Ivoire, and Ghana to do (just to name a few). As always, stay tuned as Cloudflare grows the network! Oh, and by now any regular reader of our blogs will know that we are growing, both in network deployment and in fantastic staff. So pop over to the jobs page and see what interests you!

Ulaanbaatar, Mongolia

Martin J Levy — Tue, 02 Oct 2018 23:59:00 GMT

Whenever you get into a conversation about exotic travel or ponder visiting the four corners of the globe, inevitably you end up discussing Ulaanbaatar in Mongolia. Travelers want to experience the rich culture and vivid blue skies of Mongolia; a feature which gives the country its nickname of “Land of the Eternal Blue Sky”.

Ulaanbaatar (or Ulan Bator; but shortened to UB by many) is the capital of Mongolia and located nearly a mile above sea level just outside the Gobi Desert - a desert that spans a good percentage of Central Asia’s Mongolia. (The rest of the Gobi Desert extends into China). The country is nestled squarely between Russia to the north and China to the south. It’s also home to some of the richest and ancient customs and festivals around. It’s those festivals that successfully draw in the tourists who want to experience something quite unique. Luckily, even with all the tourists, Mongolia has managed to keep its local customs; both in the cities and within its nomadic tribes.

via Wikipedia

History also has drawn explorers and conquerors to and from the region; but more on that later.

Cloudflare is also drawn into Mongolia

Any avid reader of our blogs will know that we frequently explain that the expansion of our network provides customers and end-users with both more capacity and less latency. That goal (covering 95% of the Internet population with 10 milliseconds or less of latency) means that Mongolia was seriously on our radar.

Now we have a data center in Ulaanbaatar, Mongolia, latency into that blue sky country is significantly reduced. Prior to this data center going live we were shipping bits into the country via Hong Kong, a whopping 1,800 miles away (or 50 milliseconds if we talk latency). That's far! We know this new data center is a win-win for both mobile and broadband customers within the country and for Cloudflare customers as a whole.

Just how did we get Cloudflare into Mongolia?

Ulaanbaatar is city number 154 on Cloudflare’s network. Our expansion into cities like Ulaanbaatar doesn’t just happen instantly; it takes many teams within Cloudflare in order to successfully deploy in a place like this.

However, before deploying, we need to negotiate a place to deploy into. A new city requires a secure data center for us to build into. A bandwidth partner is also required. We need to get access to the local networks and to also acquire cache-fill bandwidth in order to operate our CDN. Once we have those items negotiated, we can focus on the next steps. Any site we build has to match our own stringent security standards (we are PCI/DSS compliant – hence all our data centers need to also be PCI/DSS compliant). That’s a paperwork process, which surprisingly takes longer than most other stages (because we care about security).

Then logistics kicks in. A BOM (Bill of Materials) is created. Serial numbers recorded. Power plugs chosen. Fun fact: Cloudflare data centers are nearly all identical, except the power cables. While we live in a world where fiber optic cables and connectors are universal, independent of location (or speed in some cases), the power connections we receive for our data centers vary widely as we expand around the globe.

The actual shipping is possibly the more interesting part of the logistics process. Getting a few pallets of hardware strapped up and ready to move is only a small part of the process. Paperwork again becomes the all-powerful issue. Each country has its own customs and import process, each shipping company has its own twist on how to process things, and each shipment needs to be coordinated with a receiving party. Our logistics team pulls off this miracle for new sites, upgrades for existing sites, replacement parts for existing sites, all while sometimes calmly listening to mundane on-hold music from around the globe.

Then our hardware arrives! Seriously, this is a biggie and those around the office that follow these new city builds are always celebrating on those days. The logistics team has done their part; now it’s time for the deployment team to kick-in.

The deployment team’s main goal is to get hardware racked-and-stacked in a site where (in most cases) we are contracting out the remote-hands to do this work. Sometimes we send people to a site to build it; however, that’s not a scalable process and hence we use local remote-hands contractors to do this heavy-lifting and cabling. There are diagrams, there are instructions, there are color-coded cables (‘cause the right cable needs to go into the right socket). Depending on the size of the build; it can be just a days work or up-to a weeks worth of work. We vary our data center sizes based on the capacity needs for each city. Once racked-and-stacked there is one more job to get done within the Infrastructure team. They get the initial private network connection enabled and secured. That single connection provides us with the ability to continue to the next step. Actually setting up the network and servers at the new site.

Every new data center site is shipped with zero configuration loaded into network hardware and compute servers. They all ship raw with no Cloudflare knowledge embedded into them. This is by design. The network teams first goal is to configure the routers and switches, which is mainly a bootstrap process in order for the hardware to phone-home and securely request its full configuration setup. We have previously written about our extensive network automation methods. In the case of a new site, it’s not that different. Once the site can communicate back home, it’s clear to the software base that its configuration is out of date. Updates are pushed to the site and network monitoring is automatically enabled.

But let's not paint a perfect rosy picture. There can still be networking issues. Just one is worth pointing-out as it’s a recurring issue and one that plagues the industry globally. Fiber optic cables can sometimes be plugged in with their receive and transmit sides reversed. It’s a 50:50 chance of being right. Sometimes it just amazes us that we can’t get this fixed; but … a quick swap of the two connectors and we are back in business!

Those explorers and conquerors

No conversation about Mongolia would be valid unless we discuss Genghis Khan. Back in the 13th century, Genghis Khan founded the Mongol Empire. He unified the tribes of what is now Mongolia (and beyond). He established the largest land empire in history and is well described both online and via various TV documentaries (The History Channel doesn’t skimp when it comes to covering him). Genghis Khan was a name of honor that he didn’t receive till 1206. Before that he was just named “Temujin”.

Photo by SarahTz CC by/2.0

Around 30 miles outside Ulaanbaatar is the equestrian statue of Genghis Khan on the banks of the Tuul River in Gorkhi Terelj National Park. Pictured above, this statue is 131 feet tall and built from 250 tons of steel.

Meanwhile back in Mongolia in present time

We get to announce our new city (and country) data center during a very special time. The Golden Eagle Festival takes place during the first weekend of October (that’s October 6 and 7 this year). It’s a test of a hunter’s speed and agility. In this case the hunters (nomads of Mongolia) are practicing an ancient technique of using Golden Eagles to hunt. It takes place in the far west of the country in the Altai Mountains.

The most famous festival in Mongolia is the Naadam Festival in mid-July. So many things going on within that festival, all of which is a celebration of Mongolian and nomadic culture. The festival celebrates the local sports (wrestling, archery, and horse racing) along with plenty of arts. The opening ceremony can be quite elaborate.

When discussing Mongolia, travelers nearly always want to make sure their itineraries overlap at least one of these festivals!

Cloudflare continues to expand globally

Last week was Cloudflare’s Birthday Week and we announced many services and products. Rest-assured that everything we announced is instantly available in Ulaanbaatar, Mongolia (data center city 154) just like it’s available everywhere else on Cloudflare’s network.

One final trivia point regarding our Ulaanbaatar data center. With Ulaanbaatar live, we now have datacenters covering the letters A thru Z (from Amsterdam to Zurich), i.e. with U added, every letter is now fully represented.

If you like what you’ve read and want to come join our infrastructure team, our logistics team, our network team, our SRE team, or any of the teams that help with these global expansions, then we would love to see your resume or CV. Look here for job opening.