The Cloudflare Blog

Measuring Hyper-Threading and Turbo Boost

Sung Park — Tue, 05 Oct 2021 12:58:35 GMT

We often put together experiments that measure hardware performance to improve our understanding and provide insights to our hardware partners. We recently wanted to know more about Hyper-Threading and Turbo Boost. The last time we assessed these two technologies was when we were still deploying the Intel Xeons (Skylake/Purley), but beginning with our Gen X servers we switched over to the AMD EPYC (Zen 2/Rome). This blog is about our latest attempt at quantifying the performance impact of Hyper-Threading and Turbo Boost on our AMD-based servers running our software stack.

Intel briefly introduced Hyper-Threading with NetBurst (Northwood) back in 2002, then reintroduced Hyper-Threading six years later with Nehalem along with Turbo Boost. AMD presented their own implementation of these technologies with Zen in 2017, but AMD’s version of Turbo Boost actually dates back to AMD K10 (Thuban), in 2010, when it used to be called Turbo Core. Since Zen, Hyper-Threading and Turbo Boost are known as simultaneous multithreading (SMT) and Core Performance Boost (CPB), respectively. The underlying implementation of Hyper-Threading and Turbo Boost differs between the two vendors, but the high-level concept remains the same.

Hyper-Threading or simultaneous multithreading creates a second hardware thread within a processor’s core, also known as a logical core, by duplicating various parts of the core to support the context of a second application thread. The two hardware threads execute simultaneously within the core, across their dedicated and remaining shared resources. If neither hardware threads contend over a particular shared resource, then the throughput can be drastically increased.

Turbo Boost or Core Performance Boost opportunistically allows the processor to operate beyond its rated base frequency as long as the processor operates within guidelines set by Intel or AMD. Generally speaking, the higher the frequency, the faster the processor finishes a task.

Simulated Environment

CPU Specification

Our Gen X or 10th generation servers are powered by the AMD EPYC 7642, based on the Zen 2 microarchitecture. The vast majority of the Zen 2-based processors along with its successor Zen 3 that our Gen 11 servers are based on, supports simultaneous multithreading and Core Performance Boost.

Similar to Intel’s Hyper-Threading, AMD implemented 2-way simultaneous multithreading. The AMD EPYC 7642 has 48 cores, and with simultaneous multithreading enabled it can simultaneously execute 96 hardware threads. Core Performance Boost allows the AMD EPYC 7642 to operate anywhere between 2.3 to 3.3 GHz, depending on the workload and limitations imposed on the processor. With Core Performance Boost disabled, the processor will operate at 2.3 GHz, the rated base frequency on the AMD EPYC 7642. We took our usual simulated traffic pattern of 10 KiB cached assets over HTTPS, provided by our performance team, to generate a sustained workload that saturated the processor to 100% CPU utilization.

Results

After establishing a baseline with simultaneous multithreading and Core Performance Boost disabled, we started enabling one feature at a time. When we enabled Core Performance Boost, the processor operated near its peak turbo frequency, hovering between 3.2 to 3.3 GHz which is more than 39% higher than the base frequency. Higher operating frequency directly translated into 40% additional requests per second. We then disabled Core Performance Boost and enabled simultaneous multithreading. Similar to Core Performance Boost, simultaneous multithreading alone improved requests per second by 43%. Lastly, by enabling both features, we observed an 86% improvement in requests per second.

Latencies were generally lowered by either or both Core Performance Boost and simultaneous multithreading. While Core Performance Boost consistently maintained a lower latency than the baseline, simultaneous multithreading gradually took longer to process a request as it reached tail latencies. Though not depicted in the figure below, when we examined beyond p9999 or 99.99th percentile, simultaneous multithreading, even with the help of Core Performance Boost, exponentially increased in latency by more than 150% over the baseline, presumably due to the two hardware threads contending over a shared resource within the core.

Production Environment

Moving into production, since our traffic fluctuates throughout the day, we took four identical Gen X servers and measured in parallel during peak hours. The only changes we made to the servers were enabling and disabling simultaneous multithreading and Core Performance Boost to create a comprehensive test matrix. We conducted the experiment in two different regions to identify any anomalies and mismatching trends. All trends were alike.

Before diving into the results, we should preface that the baseline server operated at a higher CPU utilization than others. Every generation, our servers deliver a noticeable improvement in performance. So our load balancer, named Unimog, sends a different number of connections to the target server based on its generation to balance out the CPU utilization. When we disabled simultaneous multithreading and Core Performance Boost, the baseline server’s performance degraded to the point where Unimog encountered a “guard rail” or the lower limit on the requests sent to the server, and so its CPU utilization rose instead. Given that the baseline server operated at a higher CPU utilization, the baseline server processed more requests per second to meet the minimum performance threshold.

Results

Due to the skewed baseline, when core performance boost was enabled, we only observed 7% additional requests per second. Next, simultaneous multithreading alone improved requests per second by 41%. Lastly, with both features enabled, we saw an 86% improvement in requests per second.

Though we lack concrete baseline data, we can normalize requests per second by CPU utilization to approximate the improvement for each scenario. Once normalized, the estimated improvement in requests per second from core performance boost and simultaneous multithreading were 36% and 80%, respectively. With both features enabled, requests per second improved by 136%.

Latency was not as interesting since the baseline server operated at a higher CPU utilization, and in turn, it produced a higher tail latency than we would have otherwise expected. All other servers maintained a lower latency due to their lower CPU utilization in conjunction with Core Performance Boost, simultaneous multithreading, or both.

At this point, our experiment did not go as we had planned. Our baseline is skewed, and we only got half useful answers. However, we find experimenting to be important because we usually end up finding other helpful insights as well.

Let’s add power data. Since our baseline server was operating at a higher CPU utilization, we knew it was serving more requests and therefore, consumed more power than it needed to. Enabling Core Performance Boost allowed the processor to run up to its peak turbo frequency, increasing power consumption by 35% over the skewed baseline. More interestingly, enabling simultaneous multithreading increased power consumption by only 7%. Combining Core Performance Boost with simultaneous multithreading resulted in 58% increase in power consumption.

AMD’s implementation of simultaneous multithreading appears to be power efficient as it achieves 41% additional requests per second while consuming only 7% more power compared to the skewed baseline. For completeness, using the data we have, we bridged performance and power together to obtain performance per watt to summarize power efficiency. We divided the non-normalized requests per second by power consumption to produce the requests per watt figure below. Our Gen X servers attained the best performance per watt by enabling just simultaneous multithreading.

Conclusion

In our assessment of AMD’s implementation of Hyper-Threading and Turbo Boost, the original experiment we designed to measure requests per second and latency did not pan out as expected. As soon as we entered production, our baseline measurement was skewed due to the imbalance in CPU utilization and only partially reproduced our lab results.

We added power to the experiment and found other meaningful insights. By analyzing the performance and power characteristics of simultaneous multithreading and Core Performance Boost, we concluded that simultaneous multithreading could be a power-efficient mechanism to attain additional requests per second. Drawbacks of simultaneous multithreading include long tail latency that is currently curtailed by enabling Core Performance Boost. While the higher frequency enabled by Core Performance Boost provides latency reduction and more requests per second, we are more mindful that the increase in power consumption is quite significant.

Do you want to help shape the Cloudflare network? This blog was a glimpse of the work we do at Cloudflare. Come join us and help complete the feedback loop for our developers and hardware partners.

Designing Edge Servers with Arm CPUs to Deliver 57% More Performance Per Watt

Nitin Rao — Tue, 27 Jul 2021 12:59:23 GMT

Cloudflare has millions of free customers. Not only is it something we’re incredibly proud of in the context of helping to build a better Internet — but it’s something that has made the Cloudflare service measurably better. One of the ways we’ve benefited is that it’s created a very strong imperative for Cloudflare to maintain a network that is as efficient as possible. There’s simply no other way to serve so many free customers.

In the spirit of this, we are very excited about the latest step in our energy-efficiency journey: turning to Arm for our server CPUs. It has been a long journey getting here — we first started testing Arm CPUs all the way back in November 2017. It’s only recently, however, that the quantum of energy efficiency improvement from Arm has become clear. Our first deployment of an Arm-based CPU, designed by Ampere, was earlier this month – July 2021.

Our most recently deployed generation of edge servers, Gen X, used AMD Rome CPUs. Compared with that, the newest Arm based CPUs process an incredible 57% more Internet requests per watt. While AMD has a sequel, Milan (and which Cloudflare will also be deploying), it doesn’t achieve the same degree of energy efficiency that the Arm processor does — managing only 39% more requests per watt than Rome CPUs in our existing fleet. As Arm based CPUs become more widely deployed, and our software is further optimized to take advantage of the Arm architecture, we expect further improvements in the energy efficiency of Arm servers.

Using Arm, Cloudflare can now securely process over ten times as many Internet requests for every watt of power consumed, than we did for servers designed in 2013.

(In the graphic below, for 2021, the perforated data point refers to x86 CPUs, whereas the bold data point refers to Arm CPUs)

As Arm server CPUs demonstrate their performance and become more widely deployed, we hope this will inspire x86 CPUs manufacturers (such as Intel and AMD) to urgently take energy efficiency more seriously. This is especially important since, worldwide, x86 CPUs continue to represent the vast majority of global data center energy consumption.

Together, we can reduce the carbon impact of Internet use. The environment depends on it.

ARMs Race: Ampere Altra takes on the AWS Graviton2

Sung Park — Thu, 11 Mar 2021 12:00:00 GMT

Over three years ago, we embraced the ARM ecosystem after evaluating the Qualcomm Centriq. The Centriq and its Falkor cores delivered a significant reduction in power consumption while maintaining a comparable performance against the processor that was powering our server fleet at the time. By the time we completed porting our software stack to be compatible with ARM, Qualcomm decided to exit the server business. Since then, we have been waiting for another server-grade ARM processor with hopes to improve our power efficiencies across our global network, which now spans more than 200 cities in over 100 countries.

ARM has introduced the Neoverse N1 platform, the blueprint for creating power-efficient processors licensed to institutions that can customize the original design to meet their specific requirements. Ampere licensed the Neoverse N1 platform to create the Ampere Altra, a processor that allows companies that own and manage their own fleet of servers, like ourselves, to take advantage of the expanding ARM ecosystem. We have been working with Ampere to determine whether Altra is the right processor to power our first generation of ARM edge servers.

The AWS Graviton2 is the only other Neoverse N1-based processor publicly accessible, but only made available through Amazon’s cloud product portfolio. We wanted to understand the differences between the two, so we compared Ampere’s single-socket server, named Mt. Snow, equipped with the Ampere Altra Q80-30 against an EC2 instance of the AWS Graviton2.

The Mt. Snow 1P server equipped with the Ampere Altra Q80-30

The Ampere Altra and AWS Graviton2 alike are based on the Neoverse N1 platform by ARM, manufactured on the TSMC 7nm process. The N1 reference core features an 11-stage out-of-order execution pipeline along with the following specifications in ARM nomenclature.

Pipeline Stage	Width
Fetch	4 instructions/cycle
Decode	4 instructions/cycle
Rename	4 Mops/cycle
Dispatch	8 µops/cycle
Issue	8 µops/cycle
Commit	8 µops/cycle

A N1 core contains 64KiB of L1 instruction and 64KiB of L1 data caches and a dedicated L2 cache that can be implemented with up to 1MiB per core. The L3 cache or the System Level Cache can be implemented with up to 256MiB (up to 4MiB per slice, up to 64 slices) shared by all cores.

Cache Hierarchy	Capacity
L1i Cache	64 KiB
L1d Cache	64 KiB
L2 Cache	Up to 1 MiB
L3 Cache	Up to 256 MiB

A pair of N1 cores can optionally be combined into a dual-core configuration through the Component Aggregation Layer, then placed inside the mesh interconnects named the Coherent Mesh Network or CMN-600 that supports up to 64 pairs of cores, totaling 128 cores.

Component Aggregation Layer (CAL) supports up to two N1 cores

Coherent Hub Interface (CHI) is used to connect components to the Mesh Cross Point (XP)

Coherent Mesh Network (CMN-600) can be scaled up to 8x8 XPs supporting up to 128 cores

System Specifications

	AWS (Annapurna Labs)	Ampere
Processor Model	AWS Graviton2	Ampere Altra Q80-30
ISA	ARMv8.2	ARMv8.2+
Core / Thread Count	64C / 64T	80C / 80T
Operating Frequency	2.5 GHz	3.0 GHz
L1 Cache Size	8 MiB (64 x 128 KiB)	10 MiB (80 x 128 KiB)
L2 Cache Size	64 MiB (64 x 1 MiB)	80 MiB (80 x 1 MiB)
L3 Cache Size	32 MiB	32 MiB
System Memory	256 GB (8 x 32 GB DDR4-3200)	256 GB (8 x 32 GB DDR4-3200)

We can verify that the two processors are based on the Neoverse N1 platform by checking CPU part inside cpuinfo, which returns 0xd0c, the designated part number for the Neoverse N1.

sung@ampere-altra:~$ cat /proc/cpuinfo | grep 'CPU part' | head -1
CPU part	: 0xd0c

sung@aws-graviton2:~$ cat /proc/cpuinfo | grep 'CPU part' | head -1
CPU part	: 0xd0c

Ampere backported various features into the Ampere Altra notably, speculative side-channel attack mitigations that include Meltdown and Spectre (variants 1 and 2) from the ARMv8.5 instruction set architecture, giving Altra the “+” designation in their ISA (ARMv8.2+).

The Ampere Altra consists of 25% more physical cores and 20% higher operating frequency than the AWS Graviton2. Because of Altra’s higher core-count and frequency, we should expect the Ampere Altra to perform up to 50% better than the AWS Graviton2 in compute-bound workloads.

Both Ampere and AWS decided to implement 1MiB of L2 cache per core, but only 32MiB of L3 or System Level Cache, although the CMN-600 specification allows up to 256MiB. This leaves the Ampere Altra with a lower L3 cache-per-core ratio than the AWS Graviton2 and places Altra at a potential disadvantage in situations where the application’s working set does not fit within the L3 cache.

Methodology

We imaged both systems with Debian Buster and downloaded our open-source benchmark suite, cf_benchmark. This benchmark suite executes 49 different workloads over approximately 15 to 30 minutes. Each workload utilizes a library that has either been used in our stack or considered at one point or another. Due to the relatively short duration of the benchmark and the libraries called by the workloads, we run cf_benchmark as our smoke test on any new system brought in for evaluation. We recommend cf_benchmark to our technology partners when we discuss new hardware, primarily processors, to ensure these workloads can compile and run successfully.

We ran cf_benchmark multiple times and observed no significant run-to-run variations. Similar to industry standard benchmarks, we calculated the overall score using geometric mean to provide a more conservative result than arithmetic mean across all 49 workloads and category scores using their respective subset of workloads.

In single-core scenarios, cf_benchmark will spawn a single application thread, occupying a single hardware thread. In multi-core scenarios, cf_benchmark will spawn as many application threads as needed until all hardware threads on the processor are occupied. Neither processors implemented simultaneous multithreading, so each physical core can only execute one hardware thread at a time.

Overall Performance Results

Given Altra’s advantage in both operating frequency and sheer core-count, the Ampere Altra took the lead across single and multi-core performance.

In single-core performance, the Ampere Altra outperformed the AWS Graviton2 by 16%. The differences in operating frequencies between the two processors gave the Ampere Altra a proportional advantage of up to 20% in more than half of our single-core workloads.

In multi-core performance, the Ampere Altra continued to maintain its edge over the AWS Graviton2 by 31%. We expected the Ampere Altra to attain a theoretical advantage between 20% to 50% due to Altra’s higher operating frequency and core count while taking inherent overheads associated with increasing core count into consideration, primarily scaling out the mesh network. We observed that the majority of our multi-core workloads fell within this 20% to 50% range.

Categorized Results

In OpenSSL and LuaJIT, the Ampere Altra scaled proportionally with its operating frequency in single-core performance and continued to scale out with its core count in multi-core performance. The Ampere Altra maintained its lead in the vast majority of compression and Golang workloads with the exception of Brotli level 9 compression and some Golang multi-core workloads that were not able to keep the two processors busy at 100% CPU utilization.

OpenSSL

As described previously, we use a fork of OpenSSL, named BoringSSL, in places such as for TLS handshake handling. The original OpenSSL comes with a built-in benchmark called `speed` that measures single and multi-core performance. The asymmetric and symmetric crypto workloads found here saturated the processors at 100% CPU utilization. Altra’s performance aligned with our expectations throughout this entire category by maintaining a consistent 20% advantage over the AWS Graviton2 in single-core and continued to scale with its core count and performed, on average, 47% better in multi-core performance.

Public Key Cryptography

Symmetric Key Cryptography

LuaJIT

We commonly describe LuaJIT as the glue that holds Cloudflare together. LuaJIT has a long history here at Cloudflare for the same reason it is widely used in the videogame industry, where performance and responsiveness are non-negotiable requirements.

Although we have been transitioning from LuaJIT to Rust, we welcomed the instruction cache coherency implemented into the Ampere Altra. The instruction cache coherency aims to address the drawbacks associated with LuaJIT, such as its self-modifying properties, where the next instruction that needs to be fetched is not in the instruction cache, but rather the data cache. With this feature enabled, the instruction cache becomes coherent or made aware of changes made in the data cache. The L2 cache becomes inclusive of the L1 instruction cache, meaning any cache lines found in the L1 instruction cache should also be present in the L2 cache. Furthermore, L2 cache intercepts any invalidations or stores made to any of the L2’s cache lines and propagates those changes up to the L1 instruction cache.

The Ampere Altra performed well within single-core performance except for fasta where both processors stalled at the front-end of the pipeline. In the multi-core version of fasta, Altra did not spend as many cycles stalled in the front-end likely due to Ampere's implementation of the instruction cache coherency. All other workloads continued to scale in multi-core workloads with binary trees coming in at 73%. Spectral was omitted since the multi-core variant failed to run on both servers.

Compression

Brotli and Gzip are the two primary types of compression we use at Cloudflare. We find value in highly efficient compression algorithms as it helps us find a balance between spending more resources on a faster processor or having a larger storage capacity in addition to being able to transfer contents faster. The Ampere Altra performed well on both compression types with the exception of Brotli level 9 and Gzip level 4 multi-core performance.

Brotli

Out of the entire benchmark suite, Brotli level 9 caused the largest delta between the Ampere Altra and the AWS Graviton2. Although we do not use Brotli level 7 and above when performing dynamic compression, we decided to investigate further.

From historical observations, we noticed that most processors tend to experience significant performance degradation at level 9. Compression workloads generally consist of a higher branch instructions mix, and we found approximately 15% to 20% of the dynamic instructions to be branch instructions. Intuitively, we first looked at the misprediction rate since a high miss rate will quickly add up penalty cycles. However, to our surprise, we found the misprediction rate at level 9 very low.

By analyzing our dataset further, we found the common underlying cause appeared to be the high number of page faults incurred at level 9. Ampere has demonstrated that by increasing the page size from 4K to 64K bytes, we can alleviate the bottleneck and bring the Ampere Altra at parity with the AWS Graviton2. We plan to experiment with large page sizes in the future as we continue to evaluate Altra.

Gzip

Golang

LuaJIT, Rust, C++, and Go are some of our primary languages; the Golang workloads found here include cryptography, compression, regular expression, and string manipulation. The Ampere Altra performed proportionally to its operating frequency against the AWS Graviton2 in the vast majority of single-core performance. In multi-core performance, the workloads that could not saturate the processors to 100% CPU utilization did not scale proportionally to the two processor’s respective operating frequencies and core count.

Go Cryptography

Go Compression & String

Go Regex

Frequency, Power & Thermals

We were unable to collect any telemetry on the AWS Graviton2. Power would have been an interesting data point on the AWS Graviton2, but the EC2 instance did not expose either frequency or power sensors. The Ampere model we tested was the Altra Q80-30, rated with an operating frequency of 3.0GHz and a thermal design power (TDP) of 210W.

The Ampere Altra maintained a sustained operating frequency throughout cf_benchmark while Altra’s dynamic voltage and frequency scaling (DVFS) mechanism occasionally lowered its operating frequency in between workloads as needed to reduce power consumption. Package power varied depending on the workload, but Altra never consumed anywhere near its rated TDP while the default fan curve maintained the temperature below 70C. At the system level, by simply querying the baseboard management controller over IPMI, we observed that the Ampere server consumed no more than 300W. More power-efficient servers directly translates to more flexibility over how we can populate our racks and we hope this trend will continue in production.

Conclusion

In our top-down evaluation, the Ampere Altra largely performed as we had expected throughout our benchmarking suite. We regularly observed the two Neoverse N1 processors perform to their respective operating frequency and core count, with the Ampere Altra coming on top in these key areas over the AWS Graviton2. The Ampere Altra yielded a better single-core performance due to higher operating frequency and multiplied its impact even further by scaling out with a higher core count in multi-core performance against the AWS Graviton2. We also found our Ampere evaluation server consuming less power than we had anticipated and hope this trend will continue in production.

Throughout this initial set of testing and learning more about the underlying architecture, there were many parts we found interesting when compared against contemporary x86 counterparts such as the absence of simultaneous multithreading or dynamic frequency boost. We look forward to seeing which design philosophy makes more sense for the cloud. We are currently assessing Altra’s performance against our fleet of servers and collaborating with Ampere to define an ARM edge server based on Altra that could be deployed at scale.

Gen X Performance Tuning

Sung Park — Thu, 27 Feb 2020 14:00:00 GMT

We are using AMD 2nd Gen EPYC 7642 for our tenth generation “Gen X” servers. We found many aspects of this processor compelling such as its increase in performance due to its frequency bump and cache-to-core ratio. We have partnered with AMD to get the best performance out of this processor and today, we are highlighting our tuning efforts that led to an additional 6% performance.

Thermal Design Power & Dynamic Power

Thermal design power (TDP) and dynamic power, amongst others, play a critical role when tuning a system. Many share a common belief that thermal design power is the maximum or average power drawn by the processor. The 48-core AMD EPYC 7642 has a TDP rating of 225W which is just as high as the 64-core AMD EPYC 7742. It comes to mind that fewer cores should translate into lower power consumption, so why is the AMD EPYC 7642 expected to draw just as much power as the AMD EPYC 7742?

TDP Comparison between the EPYC 7642, EPYC 7742 and top-end EPYC 7H12

Let’s take a step back and understand that TDP does not always mean the maximum or average power that the processor will draw. At a glance, TDP may provide a good estimate about the processor’s power draw; TDP is really about how much heat the processor is expected to generate. TDP should be used as a guideline for designing cooling solutions with appropriate thermal capacitance. The cooling solution is expected to indefinitely dissipate heat up to the TDP, in turn, this can help the chip designers determine a power budget and create new processors around that constraint. In the case of the AMD EPYC 7642, the extra power budget was spent on retaining all of its 256 MiB L3 cache and the cores to operate at a higher sustained frequency as needed during our peak hours.

The overall power drawn by the processor depends on many different factors; dynamic or active power establishes the relationship between power and frequency. Dynamic power is a function of capacitance - the chip itself, supply voltage, frequency of the processor, and activity factor. The activity factor is dependent on the characteristics of the workloads running on the processor. Different workloads will have different characteristics. Some examples include Cloudflare Workers or Cloudflare for Teams, the hotspots in these or any other particular programs will utilize different parts of the processor, affecting the activity factor.

Determinism Modes & Configurable TDP

The latest AMD processors continue to implement AMD Precision Boost which opportunistically allows the processor to operate at a higher frequency than its base frequency. How much higher can the frequency go depends on many different factors such as electrical, thermal, and power limitations.

AMD offers a knob known as determinism modes. This knob affects power and frequency, placing emphasis one over the other depending on the determinism selected. There is a white paper posted on AMD's website that goes into the nuanced details of determinism, and remembering how frequency and power are related - this was the simplest definition I took away from the paper:

Performance Determinism - Power is a function of frequency.

Power Determinism - Frequency is a function of power.

Another knob that is available to us is Configurable Thermal Design Power (cTDP), this knob allows the end-user to reconfigure the factory default thermal design power. The AMD EPYC 7642 is rated at 225W; however, we have been given guidance from AMD that this particular part can be reconfigured up to 240W. As mentioned previously, the cooling solution must support up to the TDP to avoid throttling. We have also done our due diligence and tested that our cooling solution can reliably dissipate up to 240W, even at higher ambient temperatures.

We gave these two knobs a try and got the results shown in the figures below using 10 KiB web assets over HTTPS provided by our performance team over a sustained period of time to properly heat up the processor. The figures are broken down into the average operating frequencies across all 48 cores, total package power drawn from the socket, and the highest reported die temperature out of the 8 dies that are laid out on the AMD EPYC 7642. Each figure will compare the results we obtained from power and performance determinism, and finally, cTDP at 240W using power determinism.

Performance determinism clearly emphasized stabilizing operating frequency by matching its frequency to its lowest performing core over time. This mode appears to be ideal for tuners that emphasizes predictability over maximizing the processor’s operating frequency. In other words, the number of cycles that the processor has at its disposal every second should be predictable. This is useful if two or more cores share data dependencies. By allowing the cores to work in unison; this can prevent one core stalling another. Power determinism on the other hand, maximized its power and frequency as much as it can.

Heat generated by power can be compounded even further by ambient temperature. All components should stay within safe operating temperature at all times as specified by the vendor’s datasheet. If unsure, please reach out to your vendor as soon as possible. We put the AMD EPYC 7642 through several thermal tests and determined that it will operate within safe operating temperatures. Before diving into the figure below, it is important to mention that our fan speed ramped up over time, preventing the processor from reaching critical temperature; nevertheless, this meant that our cooling solution worked as intended - a combination of fans and a heatsink rated for 240W. We have yet to see the processors throttle in production. The figure below shows the highest temperature of the 8 dies laid out on the AMD EPYC 7642.

An unexpected byproduct of performance determinism was frequency jitter. It took a longer time for performance determinism to achieve steady-state frequency, which contradicted predictable performance that performance determinism mode was meant to achieve.

Finally, here are the deltas out from the real world with performance determinism and TDP set to factory default of 225W as the baseline. Deltas under 10% can be influenced by a wide variety of factors especially in production, however, none of our data showed a negative trend. Here are the averages.

Nodes Per Socket (NPS)

The AMD EPYC 7642 physically lays out its 8 dies across 4 different quadrants on a single package. By having such a layout, AMD supports dividing its dies into NUMA domains and this feature is called Nodes Per Socket or NPS. Available NPS options differ model-to-model, the AMD EPYC 7642 supports 4, 2, and 1 node(s) per socket. We thought it might be worth our time exploring this option despite the fact that none of these choices will end up yielding a shared last level cache. We did not observe any significant deltas from the NPS options in terms of performance.

Conclusion

Tuning the processor to yield an additional 6% throughput or requests per second out of the box has been a great start. Some parts will have more rooms for improvement than others. In production we were able to achieve an additional 2% requests per seconds with power determinism, and we achieved another 4% by reconfiguring the TDP to 240W. We will continue to work with AMD as well as internal teams within Cloudflare to identify additional areas of improvement. If you like the idea of supercharging the Internet on behalf of our customers, then come join us.

Impact of Cache Locality

Sung Park — Wed, 26 Feb 2020 12:00:00 GMT

In the past, we didn't have the opportunity to evaluate as many CPUs as we do today. The hardware ecosystem was simple – Intel had consistently delivered industry leading processors. Other vendors could not compete with them on both performance and cost. Recently it all changed: AMD has been challenging the status quo with their 2nd Gen EPYC processors.

This is not the first time that Intel has been challenged; previously there was Qualcomm, and we worked with AMD and considered their 1st Gen EPYC processors and based on the original Zen architecture, but ultimately, Intel prevailed. AMD did not give up and unveiled their 2nd Gen EPYC processors codenamed Rome based on the latest Zen 2 architecture.

Playing with some new fun kit. #epyc pic.twitter.com/1No8Cmfzwl
— Matthew Prince ? (@eastdakota) November 8, 2019

This made many improvements over its predecessors. Improvements include a die shrink from 14nm to 7nm, a doubling of the top end core count from 32 to 64, and a larger L3 cache size. Let’s emphasize again on the size of that L3 cache, which is 32 MiB L3 cache per Core Complex Die (CCD).

This time around, we have taken steps to understand our workloads at the hardware level through the use of hardware performance counters and profiling tools. Using these specialized CPU registers and profilers, we collected data on the AMD 2nd Gen EPYC and Intel Skylake-based Xeon processors in a lab environment, then validated our observations in production against other generations of servers from the past.

Simulated Environment

CPU Specifications

We evaluated several Intel Cascade Lake and AMD 2nd Gen EPYC processors, trading off various factors between power and performance; the AMD EPYC 7642 CPU came out on top. The majority of Cascade Lake processors have 1.375 MiB L3 cache per core shared across all cores, a common theme that started with Skylake. On the other hand, the 2nd Gen EPYC processors start at 4 MiB per core. The AMD EPYC 7642 is a unique SKU since it has 256 MiB of L3 cache. Having a cache this large or approximately 5.33 MiB sitting right next to each core means that a program will spend fewer cycles fetching data from RAM with the capability to have more data readily available in the L3 cache.

Before (Intel)

After (AMD)

Traditional cache layout has also changed with the introduction of 2nd Gen EPYC, a byproduct of AMD using a multi-chip module (MCM) design. The 256 MiB L3 cache is formed by 8 individual dies or Core Complex Die (CCD) that is formed by 2 Core Complexes (CCX) with each CCX containing 16 MiB of L3 cache.

Core Complex (CCX) - Up to four cores

Core Complex Die (CCD) - Created by combining two CCXs

AMD 2nd Gen EPYC 7642 - Created with 8x CCDs plus an I/O die in the center

Methodology

Our production traffic shares many characteristics of a sustained workload which typically does not induce large variation in operating frequencies nor enter periods of idle time. We picked out a simulated traffic pattern that closely resembled our production traffic behavior which was the Cached 10KiB png via HTTPS. We were interested in assessing the CPU’s maximum throughput or requests per second (RPS), one of our key metrics. With that being said, we did not disable Intel Turbo Boost or AMD Precision Boost, nor matched the frequencies clock-for-clock while measuring for requests per second, instructions retired per second (IPS), L3 cache miss rate, and sustained operating frequency.

Results

The 1P AMD 2nd Gen EPYC 7642 powered server took the lead and processed 50% more requests per second compared to our Gen 9’s 2P Intel Xeon Platinum 6162 server.

We are running a sustained workload, so we should end up with a sustained operating frequency that is higher than base clock. The AMD EPYC 7642 operating frequency or the number cycles that the processor had at its disposal was approximately 20% greater than the Intel Xeon Platinum 6162, so frequency alone was not enough to explain the 50% gain in requests per second.

Taking a closer look, the number of instructions retired over time was far greater on the AMD 2nd Gen EPYC 7642 server, thanks to its low L3 cache miss rate.

Production Environment

CPU Specifications

Methodology

Our most predominant bottleneck appears to be the cache memory and we saw significant improvement in requests per second as well as time to process a request due to low L3 cache miss rate. The data we present in this section was collected at a point of presence that spanned between Gen 7 to Gen 9 servers. We also collected data from a secondary region to gain additional confidence that the data we present here was not unique to one particular environment. Gen 9 is the baseline just as we have done in the previous section.

We put the 2nd Gen EPYC-based Gen X into production with hopes that the results would mirror closely to what we have previously seen in the lab. We found that the requests per second did not quite align with the results we had hoped, but the AMD EPYC server still outperformed all previous generations including outperforming the Intel Gen 9 server by 36%.

Sustained operating frequency was nearly identical to what we have seen back in the lab.

Due to the lower than expected requests per second, we also saw lower instructions retired over time and higher L3 cache miss rate but maintained a lead over Gen 9, with 29% better performance.

Conclusion

The single AMD EPYC 7642 performed very well during our lab testing, beating our Gen 9 server with dual Intel Xeon Platinum 6162 with the same total number of cores. Key factors we noticed were its large L3 cache, which led to a low L3 cache miss rate, as well as a higher sustained operating frequency. The AMD 2nd Gen EPYC 7642 did not have as big of an advantage in production, but nevertheless still outperformed all previous generations. The observation we made in production was based on a PoP that could have been influenced by a number of other factors such as but not limited to ambient temperature, timing, and other new products that will shape our traffic patterns in the future such as WebAssembly on Cloudflare Workers. The AMD EPYC 7642 opens up the possibility for our upcoming Gen X server to maintain the same core count while processing more requests per second than its predecessor.

Got a passion for hardware? I think we should get in touch. We are always looking for talented and curious individuals to join our team. The data presented here would not have been possible if it was not for the teamwork between many different individuals within Cloudflare. As a team, we strive to work together to create highly performant, reliable, and secure systems that will form the pillars of our rapidly growing network that spans 200 cities in more than 90 countries and we are just getting started.