The Cloudflare Blog

The post-quantum future: challenges and opportunities

Sofía Celi — Fri, 25 Feb 2022 16:03:25 GMT

“People ask me to predict the future, when all I want to do is prevent it. Better yet, build it. Predicting the future is much too easy, anyway. You look at the people around you, the street you stand on, the visible air you breathe, and predict more of the same. To hell with more. I want better.”— Ray Bradbury, from Beyond 1984: The People Machines

The story and the path are clear: quantum computers are coming that will have the ability to break the cryptographic mechanisms we rely on to secure modern communications, but there is hope! The cryptographic community has designed new mechanisms to safeguard against this disruption. There are challenges: will the new safeguards be practical? How will the fast-evolving Internet migrate to this new reality? In other blog posts in this series, we have outlined some potential solutions to these questions: there are new algorithms for maintaining confidentiality and authentication (in a “post-quantum” manner) in the protocols we use. But will they be fast enough to deploy at scale? Will they provide the required properties and work in all protocols? Are they easy to use?

Adding post-quantum cryptography into architectures and networks is not only about being novel and looking at interesting research problems or exciting engineering challenges. It is primarily about protecting people’s communications and data because a quantum adversary can not only decrypt future traffic but, if they want to, past traffic. Quantum adversaries could also be capable of other attacks (by using quantum algorithms, for example) that we may be unaware of now, so protecting against them is, in a way, the challenge of facing the unknown. We can’t fully predict everything that will happen with the advent of quantum computers¹, but we can prepare and build greater protections than the ones that currently exist. We do not see the future as apocalyptic, but as an opportunity to reflect, discover and build better.

What are the challenges, then? And related to this question: what have we learned from the past that enables us to build better in a post-quantum world?

Beyond a post-quantum TLS

As we have shown in other blog posts, the most important security and privacy properties to protect in the face of a quantum computer are confidentiality and authentication. The threat model of confidentiality is clear: quantum computers will not only be able to decrypt on-going traffic, but also any traffic that was recorded and stored prior to their arrival. The threat model for authentication is a little more complex: a quantum computer could be used to impersonate a party (by successfully mounting a monster-in-the-middle attack, for example) in a connection or conversation, and it could also be used to retroactively modify elements of the past message, like the identity of a sender (by, for example, changing the authorship of a past message to a different party). Both threat models are important to consider and pose a problem not only for future traffic but also for any traffic sent now.

In the case of using a quantum computer to impersonate a party: how can this be done? Suppose an attacker is able to use a quantum computer to compromise a user’s TLS certificate private key (which has been used to sign lots of past connections). The attacker can then forge connections and pretend that they come from the honest user (by signing with the user’s key) to another user, let’s say Bob. Bob will think the connections are coming from the honest user (as they all did in the past), when, in reality, they are now coming from the attacker.

We have algorithms that protect confidentiality and authentication in the face of quantum threats. We know how to integrate them into TLS, as we have seen in this blog post, so is that it? Will our connections then be safe? We argue that we will not yet be done, and these are the future challenges we see:

Changing the key exchange of the TLS handshake is simple; changing the authentication of TLS, in practice, is hard.
Middleboxes and middleware in the network, such as antivirus software and corporate proxies, can be slow to upgrade, hindering the rollout of new protocols.
TLS is not the only foundational protocol of the Internet, there are other protocols to take into account: some of them are very similar to TLS and they are easy to fix; others such as DNSSEC or QUIC are more challenging.

TLS (we will be focusing on its current version, which is 1.3) is a protocol that aims to achieve three primary security properties:

The first two properties are easy to maintain in a quantum-computer world: confidentiality is maintained by swapping the existing non-quantum-safe algorithm for a post-quantum one; integrity is maintained because the algorithms are intractable on a quantum computer. What about the last property, authentication? There are three ways to achieve authentication in a TLS handshake, depending on whether server-only or mutual authentication is required:

By using a ‘pre-shared’ key (PSK) generated from a previous run of the TLS connection that can be used to establish a new connection (this is often called "session resumption" or "resuming" with a PSK),
By using a Password-Authenticated Key Exchange (PAKE) for handshake authentication or post-handshake authentication (with the usage of exported authenticators, for example). This can be done by using the OPAQUE or the SPAKE protocols,
By using public certificates that advertise parameters that are employed to assure (i.e., create a proof of) the identity of the party you are talking to (this is by far the most common method).

Securing the first authentication mechanism is easily achieved in the post-quantum sphere as a unique key is derived from the initial quantum-protected handshake. The second authentication mechanism does not pose a theoretical challenge (as public and private parameters are replaced with post-quantum counterparts) but rather faces practical limitations: certificate-based authentication involves multiple actors and it is difficult to properly synchronize this change with them, as we will see next. It is not only one public parameter and one public certificate: certificate-based authentication is achieved through the use of a certificate chain with multiple entities.

A certificate chain is a chain of trust by which different entities attest to the validity of public parameters to provide verifiability and confidence. Typically, for one party to authenticate another (for example, for a client to authenticate a server), a chain of certificates starting from a root’s Certificate Authority (CA) certificate is used, followed by at least one intermediate CA certificate, and finally by the leaf (or end-entity) certificate of the actual party. This is what you usually find in real-world connections. It is worth noting that the order of this chain (for TLS 1.3) does not require each certificate to certify the one immediately preceding it. Servers sometimes send both a current and deprecated intermediate certificate for transitional purposes, or are configured incorrectly.

What are these certificates? Why do multiple actors need to validate them? A (digital) certificate certifies the ownership of a public key by the named party of the certificate: it attests that the party owns the private counterpart of the public parameter through the use of digital signatures. A CA is the entity that issues these certificates. Browsers, operating systems or mobile devices operate CA “membership” programs where a CA must meet certain criteria to be incorporated into the trusted set. Devices accept their CA root certificates as they come “pre-installed” in a root store. Root certificates, in turn, are used to generate a number of intermediate certificates which will be, in turn, used to generate leaf certificates. This certificate chain process of generation, validation and revocation is not only a procedure that happens at a software level but rather an amalgamation of policies, rules, roles, and hardware² and software needs. This is what is often called the Public Key Infrastructure (PKI).

All of the above goes to show that while we can change all of these parameters to post-quantum ones, it is not as simple as just modifying the TLS handshake. Certificate-based authentication involves many actors and processes, and it does not only involve one algorithm (as it happens in the key exchange phase) but typically at least six signatures: one handshake signature; two in the certificate chain; one OCSP staple and two SCTs. The last five signatures together are used to prove that the server you’re talking to is the right server for the website you’re visiting, for example. Of these five, the last three are essentially patches: the OCSP staple is used to deal with revoked certificates and the SCTs are to detect rogue CA’s. Starting with a clean slate, could we improve on the status quo with an efficient solution?

More pointedly, we can ask if indeed we still need to use this system of public attestation. The migration to post-quantum cryptography is also an opportunity to modify this system. The PKI as it exists is difficult to maintain, update, revoke, model, or compose. We have an opportunity, perhaps, to rethink this system.

Even without considering making fundamental changes to public attestation, updating the existing complex system presents both technical and management/coordination challenges:

On the technical side: are the post-quantum signatures, which have larger sizes and bigger computation times, usable in our handshakes? We explore this idea in this experiment, but we need more information. One potential solution is to cache intermediate certificates or to use other forms of authentication beyond digital signatures (like KEMTLS).
On the management/coordination side: how are we going to coordinate the migration of this complex system? Will there be some kind of ceremony to update algorithms? How will we deal with the situation where some systems have updated but others have not? How will we revoke past certificates?

This challenge brings into light that the migration to post-quantum cryptography is not only about the technical changes but is dependent on how the Internet works as the interconnected community that it is. Changing systems involves coordination and the collective willingness to do so.

On the other hand, post-quantum password-based authentication for TLS is still an open discussion. Most PAKE systems nowadays use Diffie-Hellman assumptions, which can be broken by a quantum computer. There are some ideas on how to transition their underlying algorithms to the post-quantum world; but these seem to be so inefficient as to render their deployment infeasible. It seems, though, that password authentication has an interesting property called “quantum annoyance”. A quantum computer can compromise the algorithm, but only one instance of the problem at the time for each guess of a password: “Essentially, the adversary must guess a password, solve a discrete logarithm based on their guess, and then check to see if they were correct”, as stated in the paper. Early quantum computers might take quite a long time to solve each guess, which means that a quantum-annoying PAKE combined with a large password space could delay quantum adversaries considerably in their goal of recovering a large number of passwords. Password-based authentication for TLS, therefore, could be safe for a longer time. But this does not mean, however, that it is not threatened by quantum adversaries.

The world of security protocols, though, does not end with TLS. There are many other security protocols (such as DNSSEC, WireGuard, SSH, QUIC, and more) that will need to transition to post-quantum cryptography. For DNSSEC, the challenge is complicated by the protocol not seeming to be able to deal with large signatures or high computation costs on verification time. According to research from SIDN Labs, it seems like only Falcon-512 and Rainbow-I-CZ can be used in DNSSEC (note that, though, there is a recent attack on Rainbow).

Scheme	Public key size	Signature size	Speed of operations
Finalists
Dilithium2	1,312	2,420	Very fast
Falcon-512	897	690	Fast, if you have the right hardware
Rainbos-I-CZ	103,648	66	Fast
Alternate Candidates
SPHINCS⁺-128f	32	17,088	Slow
SPHINCS⁺-128s	32	7,856	Very slow
GeMSS-128	352,188	33	Very slow
Picnic3	35	14,612	Very slow

Table 1: Signature post-quantum algorithms. The orange rows show the suitable algorithms for DNSSEC.

What are the alternatives for a post-quantum DNSSEC? Perhaps, the isogeny-based signature scheme, SQISign, might be a solution if its verification time can be improved (which currently is 42 ms as noted in the original paper when running on a 3.40GHz Intel Core i7-6700 CPU over 250 runs for verification. Still slower than P-384. Recently, it has improved to 25ms). Another solution might be the usage of MAYO, which on an Intel i5-8400H CPU at 2.5GHz, a signing operation can take 2.50 million cycles, and a verification operation can take 1.3 million cycles. There is a lot of research that needs to be done to make isogeny-based cryptography faster so it will fit the protocol’s needs (research on this area is currently ongoing —see, for example, the Isogeny School) and provide assurance of their security properties. Another alternative could be using other forms of authentication for this protocol’s case, like using hash-based signatures.

DNSSEC is just one example of a protocol where post-quantum cryptography has a long road ahead, as we need hands-on experimentation to go along with technical updates. For the other protocols, there is timely research: there are, for example, proposals for a post-quantum WireGuard and for a post-quantum SSH. More research, though, needs to be done on the practical implications of these changes over real connections.

One important thing to note here is that there will likely be an intermediate period in which security protocols provide a “hybrid” set of algorithms for transitional purposes, compliance and security. “Hybrid” means that both a pre-quantum (or classical) algorithm and a post-quantum one are used to generate the secret used to encrypt or provide authentication. The security reason for using this hybrid mode is due to safeguarding in case post-quantum algorithms are broken. There are still many unknowns here (single code point, multiple codepoints, contingency plans) that we need to consider.

The failures of cryptography in practice

The Achilles heel for cryptography is often introducing it into the real-world. Designing, implementing, and deploying cryptography is notoriously hard to get right due to flaws in security proofs, implementation bugs, and vulnerabilities³ of software and hardware. We often deploy cryptography that is found to be flawed and the cost of fixing it is immense (it involves resources, coordination, and more). We have some previous lessons to draw from it as a community, though. For TLS 1.3, we pushed for verifying implementations of the standard, and for using tools to analyze the symbolic and computational models, as seen in other blog posts. Every time we design a new algorithm, we should aim for this same level of confidence, especially for the big migration to post-quantum cryptography.

In other blog posts, we have discussed our formal verification efforts, so we will not repeat these here. Rather let’s focus on what remains to be done on the formal verification front. Verification, analysis and implementation are not yet complete and we still need to:

Create easy-to-understand guides into what formal analysis is and how it can be used (as formal languages are unfamiliar to developers).
Develop user-tested APIs.
Flawless integration of a post-quantum algorithm’s API into protocols’ APIs.
Test and analyze the boundaries between verified and unverified code.
Verifying specifications at the standard level by, for example, integrating hacspec into IETF drafts.

Only in doing so can we prevent some of the security issues we have had in the past. Post-quantum cryptography will be a big migration and we can, if we are not careful, repeat the same issues of the past. We want a future that is better. We want to mitigate bugs and provide high assurance of the security of connections users have.

A post-quantum tunnel is born

We’ve described the challenges of post-quantum cryptography from a theoretical and practical perspective. These are problems we are working on and issues that we are analyzing. They will take time to solve. But what can you expect in the short-term? What new ideas do we have? Let’s look at where else we can put post-quantum cryptography.

Cloudflare Tunnel is a service that creates a secure, outbound-only connection between your services and Cloudflare by deploying a lightweight connector in your environment. This is the end at the server side. At the other end, at the client side, we have WARP, a ‘VPN’ for client devices that can secure and accelerate all HTTPS traffic. So, what if we add post-quantum cryptography to all our internal infrastructure, and also add it to this server and the client endpoints? We would then have a post-quantum server to client connection where any request from the WARP client to a private network (one that uses Tunnel) is secure against a quantum adversary (how to do it will be similar to what is detailed here). Why would we want to do this? First, because it is great to have a connection that is fully protected against quantum computers. Second, because we can better measure the impacts of post-quantum cryptography in this environment (and even measure them in a mobile environment). This will also mean that we can provide guidelines to clients and servers on how to migrate to post-quantum cryptography. It would also be the first available service to do so at this scale. How will all of us experience this transition? Only time will tell, but we are excited to work towards this vision.

And, furthermore, as Tunnel uses the QUIC protocol in some cases and WARP uses the WireGuard protocol, this means that we can experiment with post-quantum cryptography in protocols that are novel and have not seen much experimentation in the past.

So, what is the future in the post-quantum cryptography era? The future is better and not just the same. The future is fast and more secure. Deploying cryptography can be challenging and we have had problems with it in the past; but post-quantum cryptography is the opportunity to dream of better security, it is the path to explore, and it is the reality to make it happen.

Thank you for reading our post-quantum blog post series and expect more post-quantum content and updates from us!

If you are a student enrolled in a PhD or equivalent research program and looking for an internship for 2022, see open opportunities.

If you’re interested in contributing to projects helping Cloudflare, our engineering teams are hiring.

You can reach us with questions, comments, and research ideas at ask-research@cloudflare.com.

.....

¹And when we do predict, we often predict more of the same attacks we are accustomed to: adversaries breaking into connections, security being tampered with. Is this all that they will be capable of?

²The private part of a public key advertised in a certificate is often the target of attacks. An attacker who steals a certificate authority's private keys is able to forge certificates, for example. Private keys are almost always stored on a hardware security module (HSM), which prevents key extraction. This is a small example of how hardware is involved in the process.

³Like constant-time failures, side-channel, and timing attacks.

Cloudflare Research: Two Years In

Nick Sullivan — Sun, 10 Oct 2021 12:57:40 GMT

Great technology companies build innovative products and bring them into the world; iconic technology companies change the nature of the world itself.

Cloudflare’s mission reflects our ambitions: to help build a better Internet. Fulfilling this mission requires a multifaceted approach that includes ongoing product innovation, strategic decision-making, and the audacity to challenge existing assumptions about the structure and potential of the Internet. Two years ago, Cloudflare Research was founded to explore opportunities that leverage fundamental and applied computer science research to help change the playing field.

We’re excited to share five operating principles that guide Cloudflare’s approach to applying research to help build a better Internet and five case studies that exemplify these principles. Cloudflare Research will be all over the blog for the next several days, so subscribe and follow along!

Innovation comes from all places

Innovative companies don’t become innovative by having one group of people within the company dedicated to the future; they become that way by having a culture where new ideas are free-flowing and can come from anyone. Research is most effective when it is permitted to grow beyond or outside isolated lab environments, is deeply integrated into all facets of a company’s work, and connected with the world at large. We think of our research team as a support organization dedicated to identifying and nurturing ideas that are between three and five years away.

Cloudflare Research prioritizes strong collaboration to stay on top of big questions from our product, ETI, engineering teams and industry. Within Cloudflare, we learn about problems by embedding ourselves within other teams. Externally, we invite visiting researchers and academia, sit on boards and committees, contribute to open standards design and development, attend dozens of tech conferences a year, and publish papers in conferences.

Research engineering is truly full-stack. We incubate ideas from conception to prototype through to production code. Our team employs specialists from academic and industry backgrounds and generalists that help connect ideas across disciplines. We work closely with the product and engineering organizations on graduation plans where code ownership is critical. We also hire many research interns who help evaluate and de-risk new ideas before we help build them into production systems and eventually hand them off to other teams at the company.

Research questions can come from all parts of the company and even from customers. Our collaborative approach has led to many positive outcomes for Cloudflare and the Internet at large.

Case Study #1: Password security

Several years ago, a service called Have I Been Pwned, which lets people check if their password has been compromised in a breach, started using Cloudflare to keep the site up and running under load. However, the setup also highlighted a privacy issue: Have I Been Pwned would inevitably have access to every password submitted to the service, making it a juicy target for attackers. This insight raised the question: can such a service be offered in a privacy-preserving way?

Like much of the research team’s work, the kernel of a solution came from somewhere else at the company. The first work on the problem came from members of the support engineering organization, who developed a clever solution that mostly solved the problem. However, this initial investigation outlined a deep and complex problem space that could have applications far beyond passwords. At this point, the research team got involved and reached out to experts, including Thomas Ristenpart at Cornell Tech, to help study it more deeply.

This collaboration led us down a path to devise a new protocol and publish a paper entitled Protocols for Checking Compromised Credentials at ACM CCS 2019. The paper pointed us in a direction, but a desire to take this research and make it applicable to people led us to host our first visiting researcher to help build a prototype. We found even more inspiration and support internally when we learned that another team at Cloudflare was interested in adding capabilities to the Cloudflare Web Application Firewall to detect breached passwords for customers. This team ended up helping us build the prototype and integrate it into Cloudflare’s service.

This combination of customer insight, internal alignment, and external collaboration not only resulted in a powerful and unique customer feature, but it also helped advance the state of the art in privacy-preserving technology for an essential aspect of our online lives: Authentication. We’ll be making an exciting announcement around this technology on Thursday, so stay tuned.

A question-oriented approach leads to better products

To support Cloudflare’s fast-paced roadmap, the Research team takes a question-first approach. Focusing on questions is the essence of research itself, but it’s important to recognize what that means for an industry-leading company. We never stop asking big questions about the problems our products are already solving. Changes like the move from desktop to mobile computing or the transition from on-premise security solutions to the cloud happened within a decade, so it’s important to ask questions like:

How will social and geopolitical forces changes impact the products we’re building today?
What assumptions may be present in existing systems based on the homogenous experiences of their creators?
What changes are happening in the industry that will affect the way the Internet works?
What opportunities do we see to leverage new and lower-cost hardware?
How can we scale to meet growing needs?
Are there new computing paradigms that need to be understood?
How are user expectations shifting?
Is there something we can build to help move the Internet in the right direction?
Which security or privacy risks are likely to be exacerbated in the coming years?

By focusing on broader questions and being open to answers that don’t fit within the boundaries of existing solutions, we have the opportunity to see around corners. This type of foresight helps solve more than business problems; it can help improve products and the Internet at large. These types of questions are asked across the R&D functions of the company, but research focuses on the questions that aren’t easily answered in the short-term.

Case Study #2: SSL/TLS Recommender

When Cloudflare made SSL certificates free and automatic for all customers, it was a big step towards securing the web. One of the early criticisms of Cloudflare’s SSL offerings was the so-called “mullet” critique. Cloudflare offers a mode called “Flexible SSL,” which enables encryption between the browser and Cloudflare but keeps the connection between Cloudflare and the origin site unencrypted for backward compatibility. It’s called the mullet criticism because Flexible SSL provides encryption on the front half of the connection but none on the back half.

Most security risks for online communication occur between the user and Cloudflare (at the ISP or in the coffee shop, for example), not between Cloudflare and the origin server. The customers who couldn’t enable encryption on their servers had a much better security posture with Cloudflare.

In a few isolated instances, however, a customer did not take advantage of the most secure configuration possible, which resulted in unexpected and impactful security problems. Given the non-zero risk and acknowledging this valid product critique, we asked: what could we do to improve the security posture of millions of websites?

With the help of research interns with Internet scanning and measurement expertise, we built an advanced scanning/crawling tool to see how much things could be improved. It turned out we could do a lot, so we worked with various teams across the company to connect our scanning infrastructure to Cloudflare’s product. We now offer a service to all customers called the SSL/TLS recommender that has helped thousands of customers secure their sites and trim their mullets. The underlying reason that websites don’t use encryption for parts of their backends is complex, and what makes this project a good example of asking questions in a pragmatic way is that it not only gives Cloudflare an improved product, but it gives researcher set of tools to further explore the fundamental problem.

Build the tools today to deal with the issues of tomorrow

Another critical objective of research in the Cloudflare context is to help us prepare for the unknown. We identify and solve "what-if" scenarios based on how society is changing or may change. Thousands of companies rely on our infrastructure to serve their users and meet their business needs. We are continually improving their experience by future-proofing the technology that supports them during the best of times and the potential worst of times.

To prepare, we:

Identify areas of future risk.
Explore the fundamentals of the problem.
Build expertise and validate that expertise in peer-reviewed venues.
Build operating experience with prototypes and large-scale experiments.
Build networks of relationships with those who can help in a crisis.

We would like to emulate the strategic thinking and planning required of the often unseen groups that support society — like the forestry service or a public health department. When the fire hits or the next virus arrives, we have the best chance to not only get through it but to innovate and flourish because of the mitigations put in place and the relationships built while thinking through impending challenges.

Case Study #3: IP Address Agility

One area of impending risk for the Internet is the exhaustion of IPv4 addresses. There are only around four billion potential IPv4 addresses, which is fewer than the number of humans globally, let alone the number of Internet-connected devices. We decided to examine this problem more deeply in the context of Cloudflare’s unique approach to networking.

Cloudflare’s architecture challenges several assumptions about how different aspects of the Internet relate to each other. Historically, a server IP address corresponds to a specific machine. Cloudflare’s anycast architecture and server design allow any server to serve any customer site. This architecture has allowed the service to scale in incredible ways with very little work.

This insight led us to question other fundamental assumptions about how the Internet could potentially work. If an IP address doesn’t need to correspond with a specific server, what else could we decouple from it? Could we decouple hostnames from IPs? Maybe we could! We could also frame the question negatively: what do we lose if we don’t take advantage of this decoupling? It was worth an experiment, at least.

We decided to validate this assumption empirically. We ran an experiment serving all free customers from the same IP in an entire region in a cross-organizational effort that involved the DNS team, the IP addressing team, and the edge load balancing team. The experiment proved that using a single IP for millions of services was feasible, but it also highlighted some risks. The result was a paper published in ACM SIGCOMM 2021 and a project to re-architect our authoritative DNS system to be infinitely more flexible.

Going further together

The Internet is not simply a loose federation of companies and billions of dollars of deployed hardware; it’s a network of interconnected relationships with a global societal impact. These relationships and the technical standards that govern them are the connective tissue that allows us to build important aspects of modern society on the Internet.

Millions of Internet properties use Cloudflare’s network, which serves tens of millions of requests per second. The decisions we make, and the products we release, have significant implications on the industry and billions of people. For the majority of cases, we only control one side of the Internet connection. To serve billions of users and Internet-connected devices, we are governed by and need to support protocols such as DNS, BGP, TLS, and QUIC, defined by technical standards organizations like the Internet Engineering Task Force (IETF).

Protocols evolve, and new protocols are constantly changing to serve the evolving needs of the Internet. An important part of our mission to help build a more secure, more private, more performant, and more available Internet involves taking a leadership role in shaping these protocols, deploying them at scale, and building the open-source software that implements them.

Case Study #4: Oblivious DNS

When Cloudflare launched the 1.1.1.1 recursive DNS service in 2019, privacy was again at the forefront. We offered 1.1.1.1 with a new encryption standard called DNS-over-HTTPS (DoH) to keep queries private as they travel over the Internet. However, even with DoH, a DNS query is associated with the IP address of the user who sent it. Cloudflare wrote a privacy policy and created technical measures to ensure that user IP data is only kept for a short amount of time and even engaged a top auditing firm to confirm this. But still, the question remained, could we offer the service without needing to have this privacy-sensitive information in the first place? We created the Onion Resolver to let users access the service over the Tor anonymity network, but it was extremely slow to use. Could we offer cryptographically private DNS with acceptable performance? It was an open problem.

The breakthrough came through a combination of sources. We learned about new results out of academia describing a novel proxying technique called Oblivious DNS. We also found other folks in the industry from Apple and Fastly who were working on the same problem. The resulting proposal combined Oblivious DNS and DoH into an elegant protocol called Oblivious DoH (or ODoH). ODoH was published and discussed extensively at the IETF. DNS is an especially standards-dependent protocol since different parties operate so many components of the system. ODoH adds another participant to the equation, making careful interoperability standards even more critical.

The research team took an early draft of ODoH and, with the help of an intern (whose experience on the team you’ll hear from tomorrow), we built a prototype on Cloudflare Workers. With it, we measured the performance and confirmed the viability of a large-scale ODoH deployment, leading to this paper, published at PoPETS 2021.

The results of our experimentation gave us the confidence to build a production-quality implementation. Our research engineers worked with the engineers on the resolver team to implement and launch ODoH for 1.1.1.1. We also open-sourced ODoH-related code in Go, Rust, and a Cloudflare Workers-compatible implementation. We also worked with the open source community to land the protocol in widely available tools, like dnscrypt, to help further ODoH adoption. ODoH is just one example of cross-industry work to develop standards that you’ll learn more about in an upcoming post.

From theory to practice at a global scale

There are thousands of papers published at dozens of venues every year on Internet technology. The ideas in academic research can change our fundamental understanding of the world scientifically but often haven’t impacted users yet. Innovation comes from all places, and it's important to adopt ideas from the wider/outside community since we're in a fortunate position to do so.

As an academic, you are rewarded for the breakthrough, not the follow-through. We value the ability to pursue, even drive, the follow-through because of the tangible improvements challenges offer to the Internet. We find that we can often learn much more about an idea by building and deploying it than by only thinking it up and writing it down. The smallest wrinkle in a lab environment or theoretical paper can become a large issue at Internet scale, and seemingly minor insights can unlock enormous potential.

We are grateful at Cloudflare to be in a rare position to leverage the diversity of the Internet to further existing research and study its real-world impact. We can bring the insights and solutions described in papers to reality because we have the insight into user behavior required to do it: right now, nearly every user on the Internet uses Cloudflare directly or indirectly.

We also have enough scale that problems that could have been solved conventionally must now use new tools. For example, cryptographic tools like identity-based encryption and zero-knowledge proofs have been around on paper for decades. They have only recently been deployed in a few live systems but are now emerging as valuable tools in the conventional Internet and have real-life impact.

Case Study #5: Cryptographic Attestation of Personhood

In Case Study #4, we explored a deep and fundamental networking question. As fun as exploring the plumbing can be, it’s also important to explore user experience questions because clients and customers can immediately feel them. A big part of Cloudflare’s value is protecting customers from bots, and one of the tools we use is the CAPTCHA. People hate CAPTCHAs. Few aspects of the web experience inspire more seething anger in the user experience than being stopped and asked to identify street signs and trucks in low-res images when trying to go to a site. Furthermore, the most common CAPTCHAs are inaccessible to many communities, including the blind and the visually impaired. The Bots team at Cloudflare has made great strides to reduce the number of CAPTCHAs shown on the server-side using machine learning (a challenging problem in itself). We decided to complement their work by asking if we could leverage accessibility on the client-side to provide equivalent protection to a CAPTCHA?

This question led us down the path to developing a new technique we call the Cryptographic Attestation of Personhood (CAP). Instead of users proving to Cloudflare that they are human by performing a human task, we could ask them to prove that they have trusted physical hardware. We identified the widely deployed WebAuthn standard to do this so that millions of users could leverage their hard security keys or the hardware biometric system built into their phones and laptops.

This project raised several exciting and profound privacy and user experience questions. With the help of a research intern experienced in anonymous authentication, we published a paper at SAC 2021 that leveraged and improved upon zero-knowledge cryptography systems to add additional privacy features to our solution. Cloudflare’s ubiquity allows us to expose millions of users to this new idea and understand how it’s understood: for example, do users think sites can collect biometrics from their users this way? We have an upcoming paper submission (with the user research team) about these user experience questions.

Some answers to these research-shaped questions are surprising, some weren’t, but in all cases, the outcome of taking a questions-oriented approach was tools for solving problems.

What to expect from the blog

As mentioned above, Cloudflare Research will be all over the Cloudflare blog in the next several days to make several announcements and share technical explainers, including:

A new landing page where you can read our research and find additional resources.
Technical deep dives into research papers and standards, including the projects referenced above (and MANY more).
Insights into our process and ways to collaborate with us.

Happy reading!

Heartbleed Revisited

Nick Sullivan — Sat, 27 Mar 2021 13:00:00 GMT

In 2014, a bug was found in OpenSSL, a popular encryption library used to secure the majority of servers on the Internet. This bug allowed attackers to abuse an obscure feature called TLS heartbeats to read memory from affected servers. Heartbleed was big news because it allowed attackers to extract the most important secret on a server: its TLS/SSL certificate private key. After confirming that the bug was easy to exploit, we revoked and reissued over 100,000 certificates, which highlighted some major issues with how the Internet is secured.

As much as Heartbleed and other key compromise events were painful for security and operations teams around the world, they also provided a learning opportunity for the industry. Over the past seven years, Cloudflare has taken the lessons of Heartbleed and applied them to improve the design of our systems and the resiliency of the Internet overall. Read on to learn how using Cloudflare reduces the risk of key compromise and reduces the cost of recovery if it happens.

Keeping keys safe

An important tenet of security system design is defense-in-depth. Important things should be protected with multiple layers of defense. This is why security-conscious people keep spare house keys in a secure lockbox rather than under the mat. For cryptographic systems that face the Internet, defense-in-depth means designing your systems so that keys are not one exploit away from being stolen. This wasn’t the case for OpenSSL and Heartbleed. Private keys were loaded into memory into a memory-unsafe Internet-facing process, so only a single memory disclosure bug was needed to exfiltrate them.

Keyless SSL: keeping the server separate from the key

An effective defense-in-depth strategy for protecting TLS/SSL private keys is splitting the process into two parts: the private key/authentication part, and the encryption/decryption part. This is exactly why we developed Keyless SSL, to allow customers to keep full control over their private keys while allowing Cloudflare to handle the rest of the connection details. Keyless SSL provides physical separation between where a key is being used and where it is stored. We support Keyless SSL for both software and hardware security modules (HSMs), and today we’re announcing that we now support multiple cloud-based HSMs.

Geo Key Manager: configurable key management by geography

Heartbleed showed us that this strategy could also be useful to add an extra layer of security for private keys that we manage for our customers. In 2017, we launched Geo Key Manager, a feature that allows customers to select which locations in the world they want their keys to be stored. Geo Key Manager protects keys against physical compromise of servers in different geographies. In 2019, we took this even further by implementing a strategy called Keyless Everywhere, moving all managed keys onto a system that provides logical separation between the Internet and the private keys.

Delegated Credentials: key separation with no latency

Keyless SSL is a great security solution, but depending on where keys are stored it can introduce some latency. Because of this, we worked with the IETF to develop a new standard called Delegated Credentials, which allows connections to use short-lived keys that are signed by the certificate instead of the certificate itself in TLS. This eliminates the additional latency introduced by Keyless SSL. We support Delegated Credentials for all Keyless SSL and Geo Key Manager customers; and Firefox 89 (May 2021) will enable Delegated Credential support by default.

Making revocation work

When Heartbleed happened and hundreds of thousands of certificates were deemed potentially compromised, the logical thing to do was to revoke and reissue these certificates. Doing so caused some unexpected consequences.

The first consequence of revocation was a massive surge of network traffic related to the revocation information. In 2014, there were three main mechanisms for certificate revocation:

Certificate Revocation Lists (CRLs)
- A list of revoked serial numbers for a given certificate authority
Online Certificate Status Protocol (OCSP)
- A protocol to query a certificate authority about the revocation status of a certificate
- OCSP can be queried by the browser or responses can be included by the server at connection time “OCSP stapling”
CRLSets — Chrome’s custom revocation system
- A meta-list of revoked serial numbers, limited to high-value certificates

Tens of thousands of certificates were revoked at once due to potential compromise from Heartbleed. After this, the CRL for a prominent CA, GlobalSign, increased from 22KB to 4.7MB in a day. This caused major disruptions in Cloudflare’s internal caching infrastructure and bandwidth spikes that affected the Internet at large as all clients that relied on CRLs to check for certificate validation (mostly Microsoft Windows) downloaded the file.

After this revocation event, it became clear that there were other reasons why revocation wasn’t a functional system. In Firefox, if a user makes a connection to a site and a stapled OCSP response is not provided, Firefox queries the OCSP server for a response. However, Firefox implements a fail-open strategy: if the OCSP response takes too long, the revocation check is bypassed and the page is rendered. An attacker with a privileged network position could use a compromised and revoked certificate to attack users by simply blocking the OCSP request and letting the browser skip the revocation check. OCSP stapling is a way for a server to get the OCSP response to the browser in a reliable way, but since stapling is not a requirement for clients, an attacker could just not include the staple and the browser would fail open, leaving the user vulnerable to attack. OCSP also doesn’t provide strong protection against key compromise in fail-open mode (plus, OCSP is a privacy leak, but that’s another issue).

The situation in Chrome is even worse from a security perspective. Since neither OCSP nor CRLs are checked for the majority of certificates (CRLSets only contain revoked “Extended Validation” certificates), most certificates are trusted without checking revocation status. The solution to revoking the set of Cloudflare-managed certificates in Chrome for Heartbleed was actually a short patch to the Chromium codebase. Clearly not a scalable solution!

Mass revocation of certificates back in 2014 plainly didn’t work. Some of the certificates at the time were valid for up to five years, so a compromised key was a problem for a long time.

In 2015, a new standard emerged that seemed to provide a reasonable solution to this problem: OCSP Must-staple. Certificates issued with the must-staple feature are only trusted if accompanied by a valid OCSP staple. If a must-staple certificate is compromised and revoked, then it can only be used to attack users for the lifetime of the last issued OCSP (usually 10 days or less). This is a big improvement and allows certificate owners to limit their overall risk.

Cloudflare has supported best-effort OCSP stapling since 2012. In 2017, we set out to improve the reliability of our OCSP stapling so that we could support OCSP must-staple certificates. The result of this work was High-reliability OCSP stapling and a research study published in IMC that demonstrated the feasibility of OCSP must-staple more broadly on the Internet. Cloudflare now supports OCSP must-staple certificates, providing an additional safety net in case of a future key compromise.

2014 vs. now

We’ve come a long way in seven years. Cloudflare is constantly innovating in the TLS/SSL key protection and security space. Here’s what changed over the past few years:

2014

Five year certificates
Opportunistic OCSP stapling
No OCSP must-staple
Keys in Internet-facing process
No Keyless SSL
No Delegated Credentials

2021

Configurable lifetime certificates (from one year down to two weeks with ACM)
100% OCSP stapling support
OCSP Must-staple support
Keyless Everywhere
Keyless SSL + Cloud HSM support! (new)
Geo Key Manager
Delegated Credential Support

These improvements and more are big reasons why Cloudflare is a leader in the security space and why the next Heartbleed won’t be as bad as the last.

Securing the post-quantum world

Nick Sullivan — Fri, 11 Dec 2020 12:00:00 GMT

Quantum computing is inevitable; cryptography prepares for the future

Quantum computing began in the early 1980s. It operates on principles of quantum physics rather than the limitations of circuits and electricity, which is why it is capable of processing highly complex mathematical problems so efficiently. Quantum computing could one day achieve things that classical computing simply cannot.

The evolution of quantum computers has been slow. Still, work is accelerating, thanks to the efforts of academic institutions such as Oxford, MIT, and the University of Waterloo, as well as companies like IBM, Microsoft, Google, and Honeywell. IBM has held a leadership role in this innovation push and has named optimization the most likely application for consumers and organizations alike. Honeywell expects to release what it calls the “world’s most powerful quantum computer” for applications like fraud detection, optimization for trading strategies, security, machine learning, and chemistry and materials science.

In 2019, the Google Quantum Artificial Intelligence (AI) team announced that their 53-qubit (analogous to bits in classical computing) machine had achieved “quantum supremacy.” This was the first time a quantum computer was able to solve a problem faster than any classical computer in existence. This was considered a significant milestone.

Quantum computing will change the face of Internet security forever — particularly in the realm of cryptography, which is the way communications and information are secured across channels like the Internet. Cryptography is critical to almost every aspect of modern life, from banking to cellular communications to connected refrigerators and systems that keep subways running on time. This ultra-powerful, highly sophisticated new generation of computing has the potential to unravel decades of work that have been put into developing the cryptographic algorithms and standards we use today.

Quantum computers will crack modern cryptographic algorithms

Quantum computers can take a very large integer and find out its prime factor extremely rapidly by using Shor’s algorithm. Why is this so important in the context of cryptographic security?

Most cryptography today is based on algorithms that incorporate difficult problems from number theory, like factoring. The forerunner of nearly all modern cryptographic schemes is RSA (Rivest-Shamir-Adleman), which was devised back in 1976. Basically, every participant of a public key cryptography system like RSA has both a public key and a private key. To send a secure message, data is encoded as a large number and scrambled using the public key of the person you want to send it to. The person on the receiving end can decrypt it with their private key. In RSA, the public key is a large number, and the private key is its prime factors. With Shor’s algorithm, a quantum computer with enough qubits could factor large numbers. For RSA, someone with a quantum computer can take a public key and factor it to get the private key, which allows them to read any message encrypted with that public key. This ability to factor numbers breaks nearly all modern cryptography. Since cryptography is what provides pervasive security for how we communicate and share information online, this has significant implications.

Theoretically, if an adversary were to gain control of a quantum computer, they could create total chaos. They could create cryptographic certificates and impersonate banks to steal funds, disrupt Bitcoin, break into digital wallets, and access and decrypt confidential communications. Some liken this to Y2K. But, unlike Y2K, there’s no fixed date as to when existing cryptography will be rendered insecure. Researchers have been preparing and working hard to get ahead of the curve by building quantum-resistant cryptography solutions.

When will a quantum computer be built that is powerful enough to break all modern cryptography? By some estimates, it may take 10 to 15 years. Companies and universities have made a commitment to innovation in the field of quantum computing, and progress is certainly being made. Unlike classical computers, quantum computers rely on quantum effects, which only happen at the atomic scale. To instantiate a qubit, you need a particle that exhibits quantum effects like an electron or a photon. These particles are extremely small and hard to manage, so one of the biggest hurdles to the realization of quantum computers is how to keep the qubits stable long enough to do the expensive calculations involved in cryptographic algorithms.

Both quantum computing and quantum-resistant cryptography are works in progress

It takes a long time for hardware technology to develop and mature. Similarly, new cryptographic techniques take a long time to discover and refine. To protect today’s data from tomorrow’s quantum adversaries, we need new cryptographic techniques that are not vulnerable to Shor’s algorithm.

The National Institute of Standards and Technology (NIST) is leading the charge in defining post-quantum cryptography algorithms to replace RSA and ECC. There is a project currently underway to test and select a set of post-quantum computing-resistant algorithms that go beyond existing public-key cryptography. NIST plans to make a recommendation sometime between 2022 and 2024 for two to three algorithms for both encryption and digital signatures. As Dustin Moody, NIST mathematician points out, the organization wants to cover as many bases as possible: “If some new attack is found that breaks all lattices, we’ll still have something to fall back on.”

We’re following closely. The participants of NIST have developed high-speed implementations of post-quantum algorithms on different computer architectures. We’ve taken some of these algorithms and tested them in Cloudflare’s systems in various capacities. Last year, Cloudflare and Google performed the TLS Post-Quantum Experiment, which involved implementing and supporting new key exchange mechanisms based on post-quantum cryptography for all Cloudflare customers for a period of a few months. As an edge provider, Cloudflare was well positioned to turn on post-quantum algorithms for millions of websites to measure performance and use these algorithms to provide confidentiality in TLS connections. This experiment led us to some useful insights around which algorithms we should focus on for TLS and which we should not (sorry, SIDH!).

More recently, we have been working with researchers from the University of Waterloo and Radboud University on a new protocol called KEMTLS, which will be presented at Real World Crypto 2021. In our last TLS experiment, we replaced the key negotiation part of TLS with quantum-safe alternatives but continued to rely on digital signatures. KEMTLS is designed to be fully post-quantum and relies only on public-key encryption.

On the implementation side, Cloudflare team members including Armando Faz Hernandez and visiting researcher Bas Westerbaan have developed high-speed assembly versions of several of the NIST finalists (Kyber, Dilithium), as well as other relevant post-quantum algorithms (CSIDH, SIDH) in our CIRCL cryptography library written in Go.

A visualization of AVX2-optimized NTT for Kyber by Bas Westerbaan

Post-quantum security, coming soon?

Everything that is encrypted with today’s public key cryptography can be decrypted with tomorrow’s quantum computers. Imagine waking up one day, and everyone’s diary from 2020 is suddenly public. Although it’s impossible to find enough storage to record keep all the ciphertext sent over the Internet, there are current and active efforts to collect a lot of it. This makes deploying post-quantum cryptography as soon as possible a pressing privacy concern.

Cloudflare is taking steps to accelerate this transition. First, we endeavor to use post-quantum cryptography for most internal services by the end of 2021. Second, we plan to be among the first services to offer post-quantum cipher suites to customers as standards emerge. We’re optimistic that collaborative efforts among NIST, Microsoft, Cloudflare, and other computing companies will yield a robust, standards-based solution. Although powerful quantum computers are likely in our future, Cloudflare is helping to make sure the Internet is ready for when they arrive.

For more on Quantum Computing, check out my interview with Scott Aaronson and this segment by Sofía Celi and Armando Faz (in Spanish!) on Cloudflare TV.

Helping build the next generation of privacy-preserving protocols

Nick Sullivan — Tue, 08 Dec 2020 12:00:00 GMT

Over the last ten years, Cloudflare has become an important part of Internet infrastructure, powering websites, APIs, and web services to help make them more secure and efficient. The Internet is growing in terms of its capacity and the number of people using it and evolving in terms of its design and functionality. As a player in the Internet ecosystem, Cloudflare has a responsibility to help the Internet grow in a way that respects and provides value for its users. Today, we’re making several announcements around improving Internet protocols with respect to something important to our customers and Internet users worldwide: privacy.

These initiatives are:

Fixing one of the last information leaks in HTTPS through Encrypted Client Hello (ECH), previously known as Encrypted SNI
Making DNS even more private by supporting Oblivious DNS-over-HTTPS (ODoH)
Developing a superior protocol for password authentication, OPAQUE, that makes password breaches less likely to occur

Each of these projects impacts an aspect of the Internet that influences our online lives and digital footprints. Whether we know it or not, there is a lot of private information about us and our lives floating around online. This is something we can help fix.

For over a year, we have been working through standards bodies like the IETF and partnering with the biggest names in Internet technology (including Mozilla, Google, Equinix, and more) to design, deploy, and test these new privacy-preserving protocols at Internet scale. Each of these three protocols touches on a critical aspect of our online lives, and we expect them to help make real improvements to privacy online as they gain adoption.

A continuing tradition at Cloudflare

One of Cloudflare’s core missions is to support and develop technology that helps build a better Internet. As an industry, we’ve made exceptional progress in making the Internet more secure and robust. Cloudflare is proud to have played a part in this progress through multiple initiatives over the years.

Here are a few highlights:

Universal SSL™. We’ve been one of the driving forces for encrypting the web. We launched Universal SSL in 2014 to give website encryption to our customers for free and have actively been working along with certificate authorities like Let’s Encrypt, web browsers, and website operators to help remove mixed content. Before Universal SSL launched to give all Cloudflare customers HTTPS for free, only 30% of connections to websites were encrypted. Through the industry’s efforts, that number is now 80% -- and a much more significant proportion of overall Internet traffic. Along with doing our part to encrypt the web, we have supported the Certificate Transparency project via Nimbus and Merkle Town, which has improved accountability for the certificate ecosystem HTTPS relies on for trust.
TLS 1.3 and QUIC. We’ve also been a proponent of upgrading existing security protocols. Take Transport Layer Security (TLS), the underlying protocol that secures HTTPS. Cloudflare engineers helped contribute to the design of TLS 1.3, the latest version of the standard, and in 2016 we launched support for an early version of the protocol. This early deployment helped lead to improvements to the final version of the protocol. TLS 1.3 is now the most widely used encryption protocol on the web and a vital component of the emerging QUIC standard, of which we were also early adopters.
Securing Routing, Naming, and Time. We’ve made major efforts to help secure other critical components of the Internet. Our efforts to help secure Internet routing through our RPKI toolkit, measurement studies, and “Is BGP Safe Yet” tool have significantly improved the Internet’s resilience against disruptive route leaks. Our time service (time.cloudflare.com) has helped keep people’s clocks in sync with more secure protocols like NTS and Roughtime. We’ve also made DNS more secure by supporting DNS-over-HTTPS and DNS-over-TLS in 1.1.1.1 at launch, along with one-click DNSSEC in our authoritative DNS service and registrar.

Continuing to improve the security of the systems of trust online is critical to the Internet’s growth. However, there is a more fundamental principle at play: respect. The infrastructure underlying the Internet should be designed to respect its users.

Building an Internet that respects users

When you sign in to a specific website or service with a privacy policy, you know what that site is expected to do with your data. It’s explicit. There is no such visibility to the users when it comes to the operators of the Internet itself. You may have an agreement with your Internet Service Provider (ISP) and the site you’re visiting, but it’s doubtful that you even know which networks your data is traversing. Most people don’t have a concept of the Internet beyond what they see on their screen, so it’s hard to imagine that people would accept or even understand what a privacy policy from a transit wholesaler or an inspection middlebox would even mean.

Without encryption, Internet browsing information is implicitly shared with countless third parties online as information passes between networks. Without secure routing, users’ traffic can be hijacked and disrupted. Without privacy-preserving protocols, users’ online life is not as private as they would think or expect. The infrastructure of the Internet wasn’t built in a way that reflects their expectations.

Normal network flow

Network flow with malicious route leak

The good news is that the Internet is continuously evolving. One of the groups that help guide that evolution is the Internet Architecture Board (IAB). The IAB provides architectural oversight to the Internet Engineering Task Force (IETF), the Internet’s main standard-setting body. The IAB recently published RFC 8890, which states that individual end-users should be prioritized when designing Internet protocols. It says that if there’s a conflict between the interests of end-users and the interest of service providers, corporations, or governments, IETF decisions should favor end users. One of the prime interests of end-users is the right to privacy, and the IAB published RFC 6973 to indicate how Internet protocols should take privacy into account.

Today’s technical blog posts are about improvements to the Internet designed to respect user privacy. Privacy is a complex topic that spans multiple disciplines, so it’s essential to clarify what we mean by “improving privacy.” We are specifically talking about changing the protocols that handle privacy-sensitive information exposed “on-the-wire” and modifying them so that this data is exposed to fewer parties. This data continues to exist. It’s just no longer available or visible to third parties without building a mechanism to collect it at a higher layer of the Internet stack, the application layer. These changes go beyond website encryption; they go deep into the design of the systems that are foundational to making the Internet what it is.

The toolbox: cryptography and secure proxies

Two tools for making sure data can be used without being seen are cryptography and secure proxies.

Cryptography allows information to be transformed into a format that a very limited number of people (those with the key) can understand. Some describe cryptography as a tool that transforms data security problems into key management problems. This is a humorous but fair description. Cryptography makes it easier to reason about privacy because only key holders can view data.

Another tool for protecting access to data is isolation/segmentation. By physically limiting which parties have access to information, you effectively build privacy walls. A popular architecture is to rely on policy-aware proxies to pass data from one place to another. Such proxies can be configured to strip sensitive data or block data transfers between parties according to what the privacy policy says.

Both these tools are useful individually, but they can be even more effective if combined. Onion routing (the cryptographic technique underlying Tor) is one example of how proxies and encryption can be used in tandem to enforce strong privacy. Broadly, if party A wants to send data to party B, they can encrypt the data with party B’s key and encrypt the metadata with a proxy’s key and send it to the proxy.

Platforms and services built on top of the Internet can build in consent systems, like privacy policies presented through user interfaces. The infrastructure of the Internet relies on layers of underlying protocols. Because these layers of the Internet are so far below where the user interacts with them, it’s almost impossible to build a concept of user consent. In order to respect users and protect them from privacy issues, the protocols that glue the Internet together should be designed with privacy enabled by default.

Data vs. metadata

The transition from a mostly unencrypted web to an encrypted web has done a lot for end-user privacy. For example, the “coffeeshop stalker” is no longer an issue for most sites. When accessing the majority of sites online, users are no longer broadcasting every aspect of their web browsing experience (search queries, browser versions, authentication cookies, etc.) over the Internet for any participant on the path to see. Suppose a site is configured correctly to use HTTPS. In that case, users can be confident their data is secure from onlookers and reaches only the intended party because their connections are both encrypted and authenticated.

However, HTTPS only protects the content of web requests. Even if you only browse sites over HTTPS, that doesn’t mean that your browsing patterns are private. This is because HTTPS fails to encrypt a critical aspect of the exchange: the metadata. When you make a phone call, the metadata is the phone number, not the call’s contents. Metadata is the data about the data.

To illustrate the difference and why it matters, here’s a diagram of what happens when you visit a website like an imageboard. Say you’re going to a specific page on that board (https://.com/room101/) that has specific embedded images hosted on .com.

Page load for an imageboard, returning an HTML page with an image from an embarassing site

Subresource fetch for the image from an embarassing site

The space inside the dotted line here represents the part of the Internet that your data needs to transit. They include your local area network or coffee shop, your ISP, an Internet transit provider, and it could be the network portion of the cloud provider that hosts the server. Users often don’t have a relationship with these entities or a contract to prevent these parties from doing anything with the user’s data. And even if those entities don’t look at the data, a well-placed observer intercepting Internet traffic could see anything sent unencrypted. It would be best if they just didn’t see it at all. In this example, the fact that the user visited .com can be seen by an observer, which is expected. However, though page content is encrypted, it’s possible to learn which specific page you’ve visited can be seen since .com is also visible.

It’s a general rule that if data is available to on-path parties on the Internet, some of these on-path parties will use this data. It’s also true that these on-path parties need some metadata in order to facilitate the transport of this data. This balance is explored in RFC 8558, which explains how protocols should be designed thoughtfully with respect to the balance between too much metadata (bad for privacy) and too little metadata (bad for operations).

In an ideal world, Internet protocols would be designed with the principle of least privilege. They would provide the minimum amount of information needed for the on-path parties (the pipes) to do the job of transporting the data to the right place and keep everything else confidential by default. Current protocols, including TLS 1.3 and QUIC, are important steps towards this ideal but fall short with respect to metadata privacy.

Knowing both who you are and what you do online can lead to profiling

Today’s announcements reflect two metadata protection levels: the first involves limiting the amount of metadata available to third-party observers (like ISPs). The second involves restricting the amount of metadata that users share with service providers themselves.

Hostnames are an example of metadata that needs to be protected from third-party observers, which DoH and ECH intend to do. However, it doesn’t make sense to hide the hostname from the site you’re visiting. It also doesn’t make sense to hide it from a directory service like DNS. A DNS server needs to know which hostname you’re resolving to resolve it for you!

A privacy issue arises when a service provider knows about both what sites you’re visiting and who you are. Individual websites do not have this dangerous combination of information (except in the case of third party cookies, which are going away soon in browsers), but DNS providers do. Thankfully, it’s not actually necessary for a DNS resolver to know *both* the hostname of the service you’re going to and which IP you’re coming from. Disentangling the two, which is the goal of ODoH, is good for privacy.

The Internet is part of 'our' Infrastructure

Roads should be well-paved, well lit, have accurate signage, and be optimally connected. They aren't designed to stop a car based on who's inside it. Nor should they be! Like transportation infrastructure, Internet infrastructure is responsible for getting data where it needs to go, not looking inside packets, and making judgments. But the Internet is made of computers and software, and software tends to be written to make decisions based on the data it has available to it.

Privacy-preserving protocols attempt to eliminate the temptation for infrastructure providers and others to peek inside and make decisions based on personal data. A non-privacy preserving protocol like HTTP keeps data and metadata, like passwords, IP addresses, and hostnames, as explicit parts of the data sent over the wire. The fact that they are explicit means that they are available to any observer to collect and act on. A protocol like HTTPS improves upon this by making some of the data (such as passwords and site content) invisible on the wire using encryption.

The three protocols we are exploring today extend this concept.

ECH takes most of the unencrypted metadata in TLS (including the hostname) and encrypts it with a key that was fetched ahead of time.
ODoH (a new variant of DoH co-designed by Apple, Cloudflare, and Fastly engineers) uses proxies and onion-like encryption to make the source of a DNS query invisible to the DNS resolver. This protects the user’s IP address when resolving hostnames.
OPAQUE uses a new cryptographic technique to keep passwords hidden even from the server. Utilizing a construction called an Oblivious Pseudo-Random Function (as seen in Privacy Pass), the server does not learn the password; it only learns whether or not the user knows the password.

By making sure Internet infrastructure acts more like physical infrastructure, user privacy is more easily protected. The Internet is more private if private data can only be collected where the user has a chance to consent to its collection.

Doing it together

As much as we’re excited about working on new ways to make the Internet more private, innovation at a global scale doesn’t happen in a vacuum. Each of these projects is the output of a collaborative group of individuals working out in the open in organizations like the IETF and the IRTF. Protocols must come about through a consensus process that involves all the parties that make up the interconnected set of systems that power the Internet. From browser builders to cryptographers, from DNS operators to website administrators, this is truly a global team effort.

We also recognize that sweeping technical changes to the Internet will inevitably also impact the technical community. Adopting these new protocols may have legal and policy implications. We are actively working with governments and civil society groups to help educate them about the impact of these potential changes.

We’re looking forward to sharing our work today and hope that more interested parties join in developing these protocols. The projects we are announcing today were designed by experts from academia, industry, and hobbyists together and were built by engineers from Cloudflare Research (including the work of interns, which we will highlight) with everyone’s support Cloudflare.

If you’re interested in this type of work, we’re hiring!

SAD DNS Explained

Marek Vavruša — Fri, 13 Nov 2020 19:06:13 GMT

This week, at the ACM CCS 2020 conference, researchers from UC Riverside and Tsinghua University announced a new attack against the Domain Name System (DNS) called SAD DNS (Side channel AttackeD DNS). This attack leverages recent features of the networking stack in modern operating systems (like Linux) to allow attackers to revive a classic attack category: DNS cache poisoning. As part of a coordinated disclosure effort earlier this year, the researchers contacted Cloudflare and other major DNS providers and we are happy to announce that 1.1.1.1 Public Resolver is no longer vulnerable to this attack.

In this post, we’ll explain what the vulnerability was, how it relates to previous attacks of this sort, what mitigation measures we have taken to protect our users, and future directions the industry should consider to prevent this class of attacks from being a problem in the future.

DNS Basics

The Domain Name System (DNS) is what allows users of the Internet to get around without memorizing long sequences of numbers. What’s often called the “phonebook of the Internet” is more like a helpful system of translators that take natural language domain names (like blog.cloudflare.com or gov.uk) and translate them into the native language of the Internet: IP addresses (like 192.0.2.254 or [2001:db8::cf]). This translation happens behind the scenes so that users only need to remember hostnames and don’t have to get bogged down with remembering IP addresses.

DNS is both a system and a protocol. It refers to the hierarchical system of computers that manage the data related to naming on a network and it refers to the language these computers use to speak to each other to communicate answers about naming. The DNS protocol consists of pairs of messages that correspond to questions and responses. Each DNS question (query) and answer (reply) follows a standard format and contains a set of parameters that contain relevant information such as the name of interest (such as blog.cloudflare.com) and the type of response record desired (such as A for IPv4 or AAAA for IPv6).

The DNS Protocol and Spoofing

These DNS messages are exchanged over a network between machines using a transport protocol. Originally, DNS used UDP, a simple stateless protocol in which messages are endowed with a set of metadata indicating a source port and a destination port. More recently, DNS has adapted to use more complex transport protocols such as TCP and even advanced protocols like TLS or HTTPS, which incorporate encryption and strong authentication into the mix (see Peter Wu’s blog post about DNS protocol encryption).

Still, the most common transport protocol for message exchange is UDP, which has the advantages of being fast, ubiquitous and requiring no setup. Because UDP is stateless, the pairing of a response to an outstanding query is based on two main factors: the source address and port pair, and information in the DNS message. Given that UDP is both stateless and unauthenticated, anyone, and not just the recipient, can send a response with a forged source address and port, which opens up a range of potential problems.

The blue portions contribute randomness

Since the transport layer is inherently unreliable and untrusted, the DNS protocol was designed with additional mechanisms to protect against forged responses. The first two bytes in the message form a message or transaction ID that must be the same in the query and response. When a DNS client sends a query, it will set the ID to a random value and expect the value in the response to match. This unpredictability introduces entropy into the protocol, which makes it less likely that a malicious party will be able to construct a valid DNS reply without first seeing the query. There are other potential variables to account for, like the DNS query name and query type are also used to pair query and response, but these are trivial to guess and don’t introduce an additional entropy.

Those paying close attention to the diagram may notice that the amount of entropy introduced by this measure is only around 16 bits, which means that there are fewer than a hundred thousand possibilities to go through to find the matching reply to a given query. More on this later.

The DNS Ecosystem

DNS servers fall into one of a few main categories: recursive resolvers (like 1.1.1.1 or 8.8.8.8), nameservers (like the DNS root servers or Cloudflare Authoritative DNS). There are also elements of the ecosystem that act as “forwarders” such as dnsmasq. In a typical DNS lookup, these DNS servers work together to complete the task of delivering the IP address for a specified domain to the client (the client is usually a stub resolver - a simple resolver built into an operating system). For more detailed information about the DNS ecosystem, take a look at our learning site. The SAD DNS attack targets the communication between recursive resolvers and nameservers.

Each of the participants in DNS (client, resolver, nameserver) uses the DNS protocol to communicate with each other. Most of the latest innovations in DNS revolve around upgrading the transport between users and recursive resolvers to use encryption. Upgrading the transport protocol between resolvers and authoritative servers is a bit more complicated as it requires a new discovery mechanism to instruct the resolver when to (and when not to use) a more secure channel. Aside from a few examples like our work with Facebook to encrypt recursive-to-authoritative traffic with DNS-over-TLS, most of these exchanges still happen over UDP. This is the core issue that enables this new attack on DNS, and one that we’ve seen before.

Kaminsky’s Attack

Prior to 2008, recursive resolvers typically used a single open port (usually port 53) to send and receive messages to authoritative nameservers. This made guessing the source port trivial, so the only variable an attacker needed to guess to forge a response to a query was the 16-bit message ID. The attack Kaminsky described was relatively simple: whenever a recursive resolver queried the authoritative name server for a given domain, an attacker would flood the resolver with DNS responses for some or all of the 65 thousand or so possible message IDs. If the malicious answer with the right message ID arrived before the response from the authoritative server, then the DNS cache would be effectively poisoned, returning the attacker’s chosen answer instead of the real one for as long as the DNS response was valid (called the TTL, or time-to-live).

For popular domains, resolvers contact authoritative servers once per TTL (which can be as short as 5 minutes), so there are plenty of opportunities to mount this attack. Forwarders that cache DNS responses are also vulnerable to this type of attack.

In response to this attack, DNS resolvers started doing source port randomization and careful checking of the security ranking of cached data. To poison these updated resolvers, forged responses would not only need to guess the message ID, but they would also have to guess the source port, bringing the number of guesses from the tens of thousands to over a billion. This made the attack effectively infeasible. Furthermore, the IETF published RFC 5452 on how to harden DNS from guessing attacks.

It should be noted that this attack did not work for DNSSEC-signed domains since their answers are digitally signed. However, even now in 2020, DNSSEC is far from universal.

Defeating Source Port Randomization with Fragmentation

Another way to avoid having to guess the source port number and message ID is to split the DNS response in two. As is often the case in computer security, old attacks become new again when attackers discover new capabilities. In 2012, researchers Amir Herzberg and Haya Schulman from Bar Ilan University discovered that it was possible for a remote attacker to defeat the protections provided by source port randomization. This new attack leveraged another feature of UDP: fragmentation. For a primer on the topic of UDP fragmentation, check out our previous blog post on the subject by Marek Majkowski.

The key to this attack is the fact that all the randomness that needs to be guessed in a DNS poisoning attack is concentrated at the beginning of the DNS message (UDP header and DNS header).If the UDP response packet (sometimes called a datagram) is split into two fragments, the first half containing the message ID and source port and the second containing part of the DNS response, then all an attacker needs to do is forge the second fragment and make sure that the fake second fragment arrives at the resolver before the true second fragment does. When a datagram is fragmented, each fragment is assigned a 16-bit IDs (called IP-ID), which is used to reassemble it at the other end of the connection. Since the second fragment only has the IP-ID as entropy (again, this is a familiar refrain in this area), this attack is feasible with a relatively small number of forged packets. The downside of this attack is the precondition that the response must be fragmented in the first place, and the fragment must be carefully altered to pass the original section counts and UDP checksum.

Also discussed in the original and follow-up papers is a method of forcing two remote servers to send packets between each other which are fragmented at an attacker-controlled point, making this attack much more feasible. The details are in the paper, but it boils down to the fact that the control mechanism for describing the maximum transmissible unit (MTU) between two servers -- which determines at which point packets are fragmented -- can be set via a forged UDP packet.

We explored this risk in a previous blog post in the context of certificate issuance last year when we introduced our multi-path DCV service, which mitigates this risk in the context of certificate issuance by making DNS queries from multiple vantage points. Nevertheless, fragmentation-based attacks are proving less and less effective as DNS providers move to eliminate support for fragmented DNS packets (one of the major goals of DNS Flag Day 2020).

Defeating Source Port Randomization via ICMP error messages

Another way to defeat the source port randomization is to use some measurable property of the server that makes the source port easier to guess. If the attacker could ask the server which port number is being used for a pending query, that would make the construction of a spoofed packet much easier. No such thing exists, but it turns out there is something close enough - the attacker can discover which ports are surely closed (and thus avoid having to send traffic). One such mechanism is the ICMP “port unreachable” message.

Let’s say the target receives a UDP datagram destined for its IP and some port, the datagram either ends up either being accepted and silently discarded by the application, or rejected because the port is closed. If the port is closed, or more importantly, closed to the IP address that the UDP datagram was sent from, the target will send back an ICMP message notifying the attacker that the port is closed. This is handy to know since the attacker now doesn’t have to bother trying to guess the pending message ID on this port and move to other ports. A single scan of the server effectively reduces the search space of valid UDP responses from 2³² (over a billion) to 2¹⁷ (around a hundred thousand), at least in theory.

This trick doesn’t always work. Many resolvers use “connected” UDP sockets instead of “open” UDP sockets to exchange messages between the resolver and nameserver. Connected sockets are tied to the peer address and port on the OS layer, which makes it impossible for an attacker to guess which “connected” UDP sockets are established between the target and the victim, and since the attacker isn’t the victim, it can’t directly observe the outcome of the probe.

To overcome this, the researchers found a very clever trick: they leverage ICMP rate limits as a side channel to reveal whether a given port is open or not. ICMP rate limiting was introduced (somewhat ironically, given this attack) as a security feature to prevent a server from being used as an unwitting participant in a reflection attack. In broad terms, it is used to limit how many ICMP responses a server will send out in a given time period. Say an attacker wanted to scan 10,000 ports and sent a burst of 10,000 UDP packets to a server configured with an ICMP rate limit of 50 per second, then only the first 50 would get an ICMP “port unreachable” message in reply.

Rate limiting seems innocuous until you remember one of the core rules of data security: don’t let private information influence publicly measurable metrics. ICMP rate limiting violates this rule because the rate limiter’s behavior can be influenced by an attacker making guesses as to whether a “secret” port number is open or not.

don’t let private information influence publicly measurable metrics

An attacker wants to know whether the target has an open port, so it sends a spoofed UDP message from the authoritative server to that port. If the port is open, no ICMP reply is sent and the rate counter remains unchanged. If the port is inaccessible, then an ICMP reply is sent (back to the authoritative server, not to the attacker) and the rate is increased by one. Although the attacker doesn’t see the ICMP response, it has influenced the counter. The counter itself isn’t known outside the server, but whether it has hit the rate limit or not can be measured by any outside observer by sending a UDP packet and waiting for a reply. If an ICMP “port unreachable” reply comes back, the rate limit hasn’t been reached. No reply means the rate limit has been met. This leaks one bit of information about the counter to the outside observer, which in the end is enough to reveal the supposedly secret information (whether the spoofed request got through or not).

Diagram inspired by original paper‌‌

Concretely, the attack works as follows: the attacker sends a bunch (large enough to trigger the rate limiting) of probe messages to the target, but with a forged source address of the victim. In the case where there are no open ports in the probed set, the target will send out the same amount of ICMP “port unreachable” responses back to the victim and trigger the rate limit on outgoing ICMP messages. The attacker can now send an additional verification message from its own address and observe whether an ICMP response comes back or not. If it does then there was at least one port open in the set and the attacker can divide the set and try again, or do a linear scan by inserting the suspected port number into a set of known closed ports. Using this approach, the attacker can narrow down to the open ports and try to guess the message ID until it is successful or gives up, similarly to the original Kaminsky attack.

In practice there are some hurdles to successfully mounting this attack.

First, the target IP, or a set of target IPs must be discovered. This might be trivial in some cases - a single forwarder, or a fixed set of IPs that can be discovered by probing and observing attacker controlled zones, but more difficult if the target IPs are partitioned across zones as the attacker can’t see the resolver egress IP unless she can monitor the traffic for the victim domain.
The attack also requires a large enough ICMP outgoing rate limit in order to be able to scan with a reasonable speed. The scan speed is critical, as it must be completed while the query to the victim nameserver is still pending. As the scan speed is effectively fixed, the paper instead describes a method to potentially extend the window of opportunity by triggering the victim's response rate limiting (RRL), a technique to protect against floods of forged DNS queries. This may work if the victim implements RRL and the target resolver doesn’t implement a retry over TCP (A Quantitative Study of the Deployment of DNS Rate Limiting shows about 16% of nameservers implement some sort of RRL).
Generally, busy resolvers will have ephemeral ports opening and closing, which introduces false positive open ports for the attacker, and ports open for different pending queries than the one being attacked.

We’ve implemented an additional mitigation to 1.1.1.1 to prevent message ID guessing - if the resolver detects an ID enumeration attempt, it will stop accepting any more guesses and switches over to TCP. This reduces the number of attempts for the attacker even if it guesses the IP address and port correctly, similarly to how the number of password login attempts is limited.

Outlook

Ultimately these are just mitigations, and the attacker might be willing to play the long game. As long as the transport layer is insecure and DNSSEC is not widely deployed, there will be different methods of chipping away at these mitigations.

It should be noted that trying to hide source IPs or open port numbers is a form of security through obscurity. Without strong cryptographic authentication, it will always be possible to use spoofing to poison DNS resolvers. The silver lining here is that DNSSEC exists, and is designed to protect against this type of attack, and DNS servers are moving to explore cryptographically strong transports such as TLS for communicating between resolvers and authoritative servers.

At Cloudflare, we’ve been helping to reduce the friction of DNSSEC deployment, while also helping to improve transport security in the long run. There is also an effort to increase entropy in DNS messages with RFC 7873 - Domain Name System (DNS) Cookies, and make DNS over TCP support mandatory RFC 7766 - DNS Transport over TCP - Implementation Requirements, with even more documentation around ways to mitigate this type of issue available in different places. All of these efforts are complementary, which is a good thing. The DNS ecosystem consists of many different parties and software with different requirements and opinions, as long as the operators support at least one of the preventive measures, these types of attacks will become more and more difficult.

If you are an operator of an DNS forwarder or recursive DNS resolver, you should consider taking the following steps to protect yourself from this attack:

Upgrade your Linux Kernel with https://git.kernel.org/linus/b38e7819cae946e2edf869e604af1e65a5d241c5 (included with v5.10), which uses unpredictable rate limits
Block the outgoing ICMP “port unreachable” messages with iptables or lower the ICMP rate limit on Linux
Keep your DNS software up to date

We’d like to thank the researchers for responsibly disclosing this attack and look forward to working with them in the future on efforts to strengthen the DNS.

Going Keyless Everywhere

Nick Sullivan — Fri, 01 Nov 2019 13:01:00 GMT

Time flies. The Heartbleed vulnerability was discovered just over five and a half years ago. Heartbleed became a household name not only because it was one of the first bugs with its own web page and logo, but because of what it revealed about the fragility of the Internet as a whole. With Heartbleed, one tiny bug in a cryptography library exposed the personal data of the users of almost every website online.

Heartbleed is an example of an underappreciated class of bugs: remote memory disclosure vulnerabilities. High profile examples other than Heartbleed include Cloudbleed and most recently NetSpectre. These vulnerabilities allow attackers to extract secrets from servers by simply sending them specially-crafted packets. Cloudflare recently completed a multi-year project to make our platform more resilient against this category of bug.

For the last five years, the industry has been dealing with the consequences of the design that led to Heartbleed being so impactful. In this blog post we’ll dig into memory safety, and how we re-designed Cloudflare’s main product to protect private keys from the next Heartbleed.

Memory Disclosure

Perfect security is not possible for businesses with an online component. History has shown us that no matter how robust their security program, an unexpected exploit can leave a company exposed. One of the more famous recent incidents of this sort is Heartbleed, a vulnerability in a commonly used cryptography library called OpenSSL that exposed the inner details of millions of web servers to anyone with a connection to the Internet. Heartbleed made international news, caused millions of dollars of damage, and still hasn’t been fully resolved.

Typical web services only return data via well-defined public-facing interfaces called APIs. Clients don’t typically get to see what’s going on under the hood inside the server, that would be a huge privacy and security risk. Heartbleed broke that paradigm: it enabled anyone on the Internet to get access to take a peek at the operating memory used by web servers, revealing privileged data usually not exposed via the API. Heartbleed could be used to extract the result of previous data sent to the server, including passwords and credit cards. It could also reveal the inner workings and cryptographic secrets used inside the server, including TLS certificate private keys.

Heartbleed let attackers peek behind the curtain, but not too far. Sensitive data could be extracted, but not everything on the server was at risk. For example, Heartbleed did not enable attackers to steal the content of databases held on the server. You may ask: why was some data at risk but not others? The reason has to do with how modern operating systems are built.

A simplified view of process isolation

Most modern operating systems are split into multiple layers. These layers are analogous to security clearance levels. So-called user-space applications (like your browser) typically live in a low-security layer called user space. They only have access to computing resources (memory, CPU, networking) if the lower, more credentialed layers let them.

User-space applications need resources to function. For example, they need memory to store their code and working memory to do computations. However, it would be risky to give an application direct access to the physical RAM of the computer they’re running on. Instead, the raw computing elements are restricted to a lower layer called the operating system kernel. The kernel only runs specially-designed applications designed to safely manage these resources and mediate access to them for user-space applications.

When a new user space application process is launched, the kernel gives it a virtual memory space. This virtual memory space acts like real memory to the application but is actually a safely guarded translation layer the kernel uses to protect the real memory. Each application’s virtual memory space is like a parallel universe dedicated to that application. This makes it impossible for one process to view or modify another’s, the other applications are simply not addressable.

Heartbleed, Cloudbleed and the process boundary

Heartbleed was a vulnerability in the OpenSSL library, which was part of many web server applications. These web servers run in user space, like any common applications. This vulnerability caused the web server to return up to 2 kilobytes of its memory in response to a specially-crafted inbound request.

Cloudbleed was also a memory disclosure bug, albeit one specific to Cloudflare, that got its name because it was so similar to Heartbleed. With Cloudbleed, the vulnerability was not in OpenSSL, but instead in a secondary web server application used for HTML parsing. When this code parsed a certain sequence of HTML, it ended up inserting some process memory into the web page it was serving.

It’s important to note that both of these bugs occurred in applications running in user space, not kernel space. This means that the memory exposed by the bug was necessarily part of the virtual memory of the application. Even if the bug were to expose megabytes of data, it would only expose data specific to that application, not other applications on the system.

In order for a web server to serve traffic over the encrypted HTTPS protocol, it needs access to the certificate’s private key, which is typically kept in the application’s memory. These keys were exposed to the Internet by Heartbleed. The Cloudbleed vulnerability affected a different process, the HTML parser, which doesn’t do HTTPS and therefore doesn’t keep the private key in memory. This meant that HTTPS keys were safe, even if other data in the HTML parser’s memory space wasn’t.

The fact that the HTML parser and the web server were different applications saved us from having to revoke and re-issue our customers’ TLS certificates. However, if another memory disclosure vulnerability is discovered in the web server, these keys are again at risk.

Moving keys out of Internet-facing processes

Not all web servers keep private keys in memory. In some deployments, private keys are held in a separate machine called a Hardware Security Module (HSM). HSMs are built to withstand physical intrusion and tampering and are often built to comply with stringent compliance requirements. They can often be bulky and expensive. Web servers designed to take advantage of keys in an HSM connect to them over a physical cable and communicate with a specialized protocol called PKCS#11. This allows the web server to serve encrypted content while being physically separated from the private key.

At Cloudflare, we built our own way to separate a web server from a private key: Keyless SSL. Rather than keeping the keys in a separate physical machine connected to the server with a cable, the keys are kept in a key server operated by the customer in their own infrastructure (this can also be backed by an HSM).

More recently, we launched Geo Key Manager, a service that allows users to store private keys in only select Cloudflare locations. Connections to locations that do not have access to the private key use Keyless SSL with a key server hosted in a datacenter that does have access.

In both Keyless SSL and Geo Key Manager, private keys are not only not part of the web server’s memory space, they’re often not even in the same country! This extreme degree of separation is not necessary to protect against the next Heartbleed. All that is needed is for the web server and the key server to not be part of the same application. So that’s what we did. We call this Keyless Everywhere.

Keyless SSL is coming from inside the house

Repurposing Keyless SSL for Cloudflare-held private keys was easy to conceptualize, but the path from ideation to live in production wasn't so straightforward. The core functionality of Keyless SSL comes from the open source gokeyless which customers run on their infrastructure, but internally we use it as a library and have replaced the main package with an implementation suited to our requirements (we've creatively dubbed it gokeyless-internal).

As with all major architecture changes, it’s prudent to start with testing out the model with something new and low risk. In our case, the test bed was our experimental TLS 1.3 implementation. In order to quickly iterate through draft versions of the TLS specification and push releases without affecting the majority of Cloudflare customers, we re-wrote our custom nginx web server in Go and deployed it in parallel to our existing infrastructure. This server was designed to never hold private keys from the start and only leverage gokeyless-internal. At this time there was only a small amount of TLS 1.3 traffic and it was all coming from the beta versions of browsers, which allowed us to work through the initial kinks of gokeyless-internal without exposing the majority of visitors to security risks or outages due to gokeyless-internal.

The first step towards making TLS 1.3 fully keyless was identifying and implementing the new functionality we needed to add to gokeyless-internal. Keyless SSL was designed to run on customer infrastructure, with the expectation of supporting only a handful of private keys. But our edge must simultaneously support millions of private keys, so we implemented the same lazy loading logic we use in our web server, nginx. Furthermore, a typical customer deployment would put key servers behind a network load balancer, so they could be taken out of service for upgrades or other maintenance. Contrast this with our edge, where it’s important to maximize our resources by serving traffic during software upgrades. This problem is solved by the excellent tableflip package we use elsewhere at Cloudflare.

The next project to go Keyless was Spectrum, which launched with default support for gokeyless-internal. With these small victories in hand, we had the confidence necessary to attempt the big challenge, which was porting our existing nginx infrastructure to a fully keyless model. After implementing the new functionality, and being satisfied with our integration tests, all that’s left is to turn this on in production and call it a day, right? Anyone with experience with large distributed systems knows how far "working in dev" is from "done," and this story is no different. Thankfully we were anticipating problems, and built a fallback into nginx to complete the handshake itself if any problems were encountered with the gokeyless-internal path. This allowed us to expose gokeyless-internal to production traffic without risking downtime in the event that our reimplementation of the nginx logic was not 100% bug-free.

When rolling back the code doesn’t roll back the problem

Our deployment plan was to enable Keyless Everywhere, find the most common causes of fallbacks, and then fix them. We could then repeat this process until all sources of fallbacks had been eliminated, after which we could remove access to private keys (and therefore the fallback) from nginx. One of the early causes of fallbacks was gokeyless-internal returning ErrKeyNotFound, indicating that it couldn’t find the requested private key in storage. This should not have been possible, since nginx only makes a request to gokeyless-internal after first finding the certificate and key pair in storage, and we always write the private key and certificate together. It turned out that in addition to returning the error for the intended case of the key truly not found, we were also returning it when transient errors like timeouts were encountered. To resolve this, we updated those transient error conditions to return ErrInternal, and deployed to our canary datacenters. Strangely, we found that a handful of instances in a single datacenter started encountering high rates of fallbacks, and the logs from nginx indicated it was due to a timeout between nginx and gokeyless-internal. The timeouts didn’t occur right away, but once a system started logging some timeouts it never stopped. Even after we rolled back the release, the fallbacks continued with the old version of the software! Furthermore, while nginx was complaining about timeouts, gokeyless-internal seemed perfectly healthy and was reporting reasonable performance metrics (sub-millisecond median request latency).

To debug the issue, we added detailed logging to both nginx and gokeyless, and followed the chain of events backwards once timeouts were encountered.

➜ ~ grep 'timed out' nginx.log | grep Keyless | head -5
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015157 Keyless SSL request/response timed out while reading Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015231 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015271 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015280 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:50.000 29m41 2018/07/25 05:30:50 [error] 4525#0: *1015289 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1

You can see the first request to log a timeout had id 1015157. Also interesting that the first log line was "timed out while reading," but all the others are "timed out while waiting," and this latter message is the one that continues forever. Here is the matching request in the gokeyless log:

➜ ~ grep 'id=1015157 ' gokeyless.log | head -1
2018-07-25T05:30:39.000 29m41 2018/07/25 05:30:39 [DEBUG] connection 127.0.0.1:30520: worker=ecdsa-29 opcode=OpECDSASignSHA256 id=1015157 sni=announce.php?info_hash=%a8%9e%9dc%cc%3b1%c8%23%e4%93%21r%0f%92mc%0c%15%89&peer_id=-ut353s-%ce%ad%5e%b1%99%06%24e%d5d%9a%08&port=42596&uploaded=65536&downloaded=0&left=0&corrupt=0&key=04a184b7&event=started&numwant=200&compact=1&no_peer_id=1 ip=104.20.33.147

Aha! That SNI value is clearly invalid (SNIs are like Host headers, i.e. they are domains, not URL paths), and it’s also quite long. Our storage system indexes certificates based on two indices: which SNI they correspond to, and which IP addresses they correspond to (for older clients that don’t support SNI). Our storage interface uses the memcached protocol, and the client library that gokeyless-internal uses rejects requests for keys longer than 250 characters (memcached’s maximum key length), whereas the nginx logic is to simply ignore the invalid SNI and treat the request as if only had an IP. The change in our new release had shifted this condition from ErrKeyNotFound to ErrInternal, which triggered cascading problems in nginx. The “timeouts” it encountered were actually a result of throwing away all in-flight requests multiplexed on a connection which happened to return ErrInternalfor a single request. These requests were retried, but once this condition triggered, nginx became overloaded by the number of retried requests plus the continuous stream of new requests coming in with bad SNI, and was unable to recover. This explains why rolling back gokeyless-internal didn’t fix the problem.

This discovery finally brought our attention to nginx, which thus far had escaped blame since it had been working reliably with customer key servers for years. However, communicating over localhost to a multitenant key server is fundamentally different than reaching out over the public Internet to communicate with a customer’s key server, and we had to make the following changes:

Instead of a long connection timeout and a relatively short response timeout for customer key servers, extremely short connection timeouts and longer request timeouts are appropriate for a localhost key server.
Similarly, it’s reasonable to retry (with backoff) if we timeout waiting on a customer key server response, since we can’t trust the network. But over localhost, a timeout would only occur if gokeyless-internal were overloaded and the request were still queued for processing. In this case a retry would only lead to more total work being requested of gokeyless-internal, making the situation worse.
Most significantly, nginx must not throw away all requests multiplexed on a connection if any single one of them encounters an error, since a single connection no longer represents a single customer.

Implementations matter

CPU at the edge is one of our most precious assets, and it’s closely guarded by our performance team (aka CPU police). Soon after turning on Keyless Everywhere in one of our canary datacenters, they noticed gokeyless using ~50% of a core per instance. We were shifting the sign operations from nginx to gokeyless, so of course it would be using more CPU now. But nginx should have seen a commensurate reduction in CPU usage, right?

Wrong. Elliptic curve operations are very fast in Go, but it’s known that RSA operations are much slower than their BoringSSL counterparts.

Although Go 1.11 includes optimizations for RSA math operations, we needed more speed. Well-tuned assembly code is required to match the performance of BoringSSL, so Armando Faz from our Crypto team helped claw back some of the lost CPU by reimplementing parts of the math/big package with platform-dependent assembly in an internal fork of Go. The recent assembly policy of Go prefers the use of Go portable code instead of assembly, so these optimizations were not upstreamed. There is still room for more optimizations, and for that reason we’re still evaluating moving to cgo + BoringSSL for sign operations, despite cgo’s many downsides.

Changing our tooling

Process isolation is a powerful tool for protecting secrets in memory. Our move to Keyless Everywhere demonstrates that this is not a simple tool to leverage. Re-architecting an existing system such as nginx to use process isolation to protect secrets was time-consuming and difficult. Another approach to memory safety is to use a memory-safe language such as Rust.

Rust was originally developed by Mozilla but is starting to be used much more widely. The main advantage that Rust has over C/C++ is that it has memory safety features without a garbage collector.

Re-writing an existing application in a new language such as Rust is a daunting task. That said, many new Cloudflare features, from the powerful Firewall Rules feature to our 1.1.1.1 with WARP app, have been written in Rust to take advantage of its powerful memory-safety properties. We’re really happy with Rust so far and plan on using it even more in the future.

Conclusion

The harrowing aftermath of Heartbleed taught the industry a lesson that should have been obvious in retrospect: keeping important secrets in applications that can be accessed remotely via the Internet is a risky security practice. In the following years, with a lot of work, we leveraged process separation and Keyless SSL to ensure that the next Heartbleed wouldn’t put customer keys at risk.

However, this is not the end of the road. Recently memory disclosure vulnerabilities such as NetSpectre have been discovered which are able to bypass application process boundaries, so we continue to actively explore new ways to keep keys secure.

Delegated Credentials for TLS

Nick Sullivan — Fri, 01 Nov 2019 13:00:00 GMT

Today we’re happy to announce support for a new cryptographic protocol that helps make it possible to deploy encrypted services in a global network while still maintaining fast performance and tight control of private keys: Delegated Credentials for TLS. We have been working with partners from Facebook, Mozilla, and the broader IETF community to define this emerging standard. We’re excited to share the gory details today in this blog post.

Also, be sure to check out the blog posts on the topic by our friends at Facebook and Mozilla!

Deploying TLS globally

Many of the technical problems we face at Cloudflare are widely shared problems across the Internet industry. As gratifying as it can be to solve a problem for ourselves and our customers, it can be even more gratifying to solve a problem for the entire Internet. For the past three years, we have been working with peers in the industry to solve a specific shared problem in the TLS infrastructure space: How do you terminate TLS connections while storing keys remotely and maintaining performance and availability? Today we’re announcing that Cloudflare now supports Delegated Credentials, the result of this work.

Cloudflare’s TLS/SSL features are among the top reasons customers use our service. Configuring TLS is hard to do without internal expertise. By automating TLS, web site and web service operators gain the latest TLS features and the most secure configurations by default. It also reduces the risk of outages or bad press due to misconfigured or insecure encryption settings. Customers also gain early access to unique features like TLS 1.3, post-quantum cryptography, and OCSP stapling as they become available.

Unfortunately, for web services to authorize a service to terminate TLS for them, they have to trust the service with their private keys, which demands a high level of trust. For services with a global footprint, there is an additional level of nuance. They may operate multiple data centers located in places with varying levels of physical security, and each of these needs to be trusted to terminate TLS.

To tackle these problems of trust, Cloudflare has invested in two technologies: Keyless SSL, which allows customers to use Cloudflare without sharing their private key with Cloudflare; and Geo Key Manager, which allows customers to choose the geographical locations in which Cloudflare should keep their keys. Both of these technologies are able to be deployed without any changes to browsers or other clients. They also come with some downsides in the form of availability and performance degradation.

Keyless SSL introduces extra latency at the start of a connection. In order for a server without access to a private key to establish a connection with a client, that servers needs to reach out to a key server, or a remote point of presence, and ask them to do a private key operation. This not only adds additional latency to the connection, causing the content to load slower, but it also introduces some troublesome operational constraints on the customer. Specifically, the server with access to the key needs to be highly available or the connection can fail. Sites often use Cloudflare to improve their site’s availability, so having to run a high-availability key server is an unwelcome requirement.

Turning a pull into a push

The reason services like Keyless SSL that rely on remote keys are so brittle is their architecture: they are pull-based rather than push-based. Every time a client attempts a handshake with a server that doesn’t have the key, it needs to pull the authorization from the key server. An alternative way to build this sort of system is to periodically push a short-lived authorization key to the server and use that for handshakes. Switching from a pull-based model to a push-based model eliminates the additional latency, but it comes with additional requirements, including the need to change the client.

Enter the new TLS feature of Delegated Credentials (DCs). A delegated credential is a short-lasting key that the certificate’s owner has delegated for use in TLS. They work like a power of attorney: your server authorizes our server to terminate TLS for a limited time. When a browser that supports this protocol connects to our edge servers we can show it this “power of attorney”, instead of needing to reach back to a customer’s server to get it to authorize the TLS connection. This reduces latency and improves performance and reliability.

The pull model

The push model

A fresh delegated credential can be created and pushed out to TLS servers long before the previous credential expires. Momentary blips in availability will not lead to broken handshakes for clients that support delegated credentials. Furthermore, a Delegated Credentials-enabled TLS connection is just as fast as a standard TLS connection: there’s no need to connect to the key server for every handshake. This removes the main drawback of Keyless SSL for DC-enabled clients.

Delegated credentials are intended to be an Internet Standard RFC that anyone can implement and use, not a replacement for Keyless SSL. Since browsers will need to be updated to support the standard, proprietary mechanisms like Keyless SSL and Geo Key Manager will continue to be useful. Delegated credentials aren’t just useful in our context, which is why we’ve developed it openly and with contributions from across industry and academia. Facebook has integrated them into their own TLS implementation, and you can read more about how they view the security benefits here. When it comes to improving the security of the Internet, we’re all on the same team.

"We believe delegated credentials provide an effective way to boost security by reducing certificate lifetimes without sacrificing reliability. This will soon become an Internet standard and we hope others in the industry adopt delegated credentials to help make the Internet ecosystem more secure."

— Subodh Iyengar, software engineer at Facebook

Extensibility beyond the PKI

At Cloudflare, we’re interested in pushing the state of the art forward by experimenting with new algorithms. In TLS, there are three main areas of experimentation: ciphers, key exchange algorithms, and authentication algorithms. Ciphers and key exchange algorithms are only dependent on two parties: the client and the server. This freedom allows us to deploy exciting new choices like ChaCha20-Poly1305 or post-quantum key agreement in lockstep with browsers. On the other hand, the authentication algorithms used in TLS are dependent on certificates, which introduces certificate authorities and the entire public key infrastructure into the mix.

Unfortunately, the public key infrastructure is very conservative in its choice of algorithms, making it harder to adopt newer cryptography for authentication algorithms in TLS. For instance, EdDSA, a highly-regarded signature scheme, is not supported by certificate authorities, and root programs limit the certificates that will be signed. With the emergence of quantum computing, experimenting with new algorithms is essential to determine which solutions are deployable and functional on the Internet.

Since delegated credentials introduce the ability to use new authentication key types without requiring changes to certificates themselves, this opens up a new area of experimentation. Delegated credentials can be used to provide a level of flexibility in the transition to post-quantum cryptography, by enabling new algorithms and modes of operation to coexist with the existing PKI infrastructure. It also enables tiny victories, like the ability to use smaller, faster Ed25519 signatures in TLS.

Inside DCs

A delegated credential contains a public key and an expiry time. This bundle is then signed by a certificate along with the certificate itself, binding the delegated credential to the certificate for which it is acting as “power of attorney”. A supporting client indicates its support for delegated credentials by including an extension in its Client Hello.

A server that supports delegated credentials composes the TLS Certificate Verify and Certificate messages as usual, but instead of signing with the certificate’s private key, it includes the certificate along with the DC, and signs with the DC’s private key. Therefore, the private key of the certificate only needs to be used for the signing of the DC.

Certificates used for signing delegated credentials require a special X.509 certificate extension (currently only available at DigiCert). This requirement exists to avoid breaking assumptions people may have about the impact of temporary access to their keys on security, particularly in cases involving HSMs and the still unfixed Bleichenbacher oracles in older TLS versions. Temporary access to a key can enable signing lots of delegated credentials which start far in the future, and as a result support was made opt-in. Early versions of QUIC had similar issues, and ended up adopting TLS to fix them. Protocol evolution on the Internet requires working well with already existing protocols and their flaws.

Delegated Credentials at Cloudflare and Beyond

Currently we use delegated credentials as a performance optimization for Geo Key Manager and Keyless SSL. Customers can update their certificates to include the special extension for delegated credentials, and we will automatically create delegated credentials and distribute them to the edge through the Keyless SSL or Geo Key Manager. For more information, see the documentation. It also enables us to be more conservative about where we keep keys for customers, improving our security posture.

Delegated Credentials would be useless if it wasn’t also supported by browsers and other HTTP clients. Christopher Patton, a former intern at Cloudflare, implemented support in Firefox and its underlying NSS security library. This feature is now in the Nightly versions of Firefox. You can turn it on by activating the configuration option security.tls.enable_delegated_credentials at about:config. Studies are ongoing on how effective this will be in a wider deployment. There also is support for Delegated Credentials in BoringSSL.

"At Mozilla we welcome ideas that help to make the Web PKI more robust. The Delegated Credentials feature can help to provide secure and performant TLS connections for our users, and we're happy to work with Cloudflare to help validate this feature."

— Thyla van der Merwe, Cryptography Engineering Manager at Mozilla

One open issue is the question of client clock accuracy. Until we have a wide-scale study we won’t know how many connections using delegated credentials will break because of the 24 hour time limit that is imposed. Some clients, in particular mobile clients, may have inaccurately set clocks, the root cause of one third of all certificate errors in Chrome. Part of the way that we’re aiming to solve this problem is through standardizing and improving Roughtime, so web browsers and other services that need to validate certificates can do so independent of the client clock.

Cloudflare’s global scale means that we see connections from every corner of the world, and from many different kinds of connection and device. That reach enables us to find rare problems with the deployability of protocols. For example, our early deployment helped inform the development of the TLS 1.3 standard. As we enable developing protocols like delegated credentials, we learn about obstacles that inform and affect their future development.

Conclusion

As new protocols emerge, we'll continue to play a role in their development and bring their benefits to our customers. Today’s announcement of a technology that overcomes some limitations of Keyless SSL is just one example of how Cloudflare takes part in improving the Internet not just for our customers, but for everyone. During the standardization process of turning the draft into an RFC, we’ll continue to maintain our implementation and come up with new ways to apply delegated credentials.

Tales from the Crypt(o team)

Nick Sullivan — Sun, 27 Oct 2019 23:00:00 GMT

Halloween season is upon us. This week we’re sharing a series of blog posts about work being done at Cloudflare involving cryptography, one of the spookiest technologies around. So subscribe to this blog and come back every day for tricks, treats, and deep technical content.

A long-term mission

Cryptography is one of the most powerful technological tools we have, and Cloudflare has been at the forefront of using cryptography to help build a better Internet. Of course, we haven’t been alone on this journey. Making meaningful changes to the way the Internet works requires time, effort, experimentation, momentum, and willing partners. Cloudflare has been involved with several multi-year efforts to leverage cryptography to help make the Internet better.

Here are some highlights to expect this week:

We’re renewing Cloudflare’s commitment to privacy-enhancing technologies by sharing some of the recent work being done on Privacy Pass: Supporting the latest version of the Privacy Pass Protocol
We’re helping forge a path to a quantum-safe Internet by sharing some of the results of the Post-quantum Cryptography experiment: The TLS Post-Quantum Experiment
We’re sharing the rust-based software we use to power time.cloudflare.com: Announcing cfnts: Cloudflare's implementation of NTS in Rust
We’re doing a deep dive into the technical details of Encrypted DNS: DNS Encryption Explained
We’re announcing support for a new technique we developed with industry partners to help keep TLS private keys more secure: Delegated Credentials for TLS, and how we're keeping keys safe from memory disclosure attacks: Going Keyless Everywhere

The milestones we’re sharing this week would not be possible without partnerships with companies, universities, and individuals working in good faith to help build a better Internet together. Hopefully, this week provides a fun peek into the future of the Internet.

Cloudflare’s Approach to Research

Nick Sullivan — Wed, 18 Sep 2019 14:03:38 GMT

Cloudflare’s mission is to help build a better Internet. One of the tools used in pursuit of this goal is computer science research. We’ve learned that some of the difficult problems to solve are best approached through research and experimentation to understand the solution before engineering it at scale. This research-focused approach to solving the big problems of the Internet is exemplified by the work of the Cryptography Research team, which leverages research to help build a safer, more secure and more performant Internet. Over the years, the team has worked on more than just cryptography, so we’re taking the model we’ve developed and expanding the scope of the team to include more areas of computer science research. Cryptography Research at Cloudflare is now Cloudflare Research. I am excited to share some of the insights we’ve learned over the years in this blog post.

Cloudflare’s research model

Principle

Description

Team structure

Hybrid approach. We have a program that allows research engineers to be embedded into product and operations teams for temporary assignments. This gives people direct exposure to practical problems.

Problem philosophy

Impact-focused. We use our expertise and the expertise of partners in industry and academia to select projects that have the potential to make a big impact, and for which existing solutions are insufficient or not yet popularized.

Promoting solutions

Open collaboration. Popularizing winning ideas through public outreach, working with industry partners to promote standardization, and implementing ideas at scale to show they’re effective.

The hybrid approach to research

“Super-ambitious goals tend to be unifying and energizing to people; but only if they believe there's a chance of success.” - Peter Diamandis

Given the scale and reach of Cloudflare, research problems (and opportunities) present themselves all the time. Our approach to research is a practical one. We choose to tackle projects that have the potential to make a big impact, and for which existing solutions are insufficient. This stems from a belief that the interconnected systems that make up the Internet can be changed and improved in a fundamental way. While some research problems are solvable in a few months, some may take years. We don’t shy away from long-term projects, but the Internet moves fast, so it’s important to break down long-term projects into smaller, independently-valuable pieces in order to continually provide value while pursuing a bigger vision.

Successful technological innovation is not purely about technical accomplishments. New creations need the social and political scaffolding to support it while being built, and the momentum and support to gain popularity. We are better able to innovate if grounded in a deep understanding of the current day-to-day. To stay grounded, our research team members spend part of their time solving practical problems that affect Cloudflare and our customers right now.

Cloudflare employs a hybrid research model similar to the model pioneered by Google. Innovation can come from everywhere in a company, so teams are encouraged to find the right balance between research and engineering activities. The research team works with the same tools, systems, and constraints as the rest of the engineering organization.

Research engineers are expected to write production-quality code and contribute to engineering activities. This enables researchers to leverage the rich data provided by Cloudflare’s production environment for experiments. To further break down silos, we have a program that allows research engineers to be embedded into product and operations teams for temporary assignments. This gives people direct exposure to practical problems.

Continuing a successful tradition (our tradition)

“Skate to where the puck is going, not where it has been.” - Wayne Gretzky

The output of the research team is both new knowledge and technology that can lead to innovative products. Research works hand-in-hand with both product and engineering to help drive long-term positive outcomes for both Cloudflare and the Internet at large.

An example of a long-term project that requires both research and engineering is helping the Internet migrate from insecure to secure network protocols. To tackle the problem, we pursued several smaller projects with discrete and measurable outcomes. This included:

Building designs and supporting systems to enable Cloudflare SSL certificates to scale to all free customers
Open sourcing software that helped Let’s Encrypt do the same
Measuring the prevalence of middleboxes harmful to security
Helping design and deploy TLS 1.3
Supporting accountability in the PKI with Certificate Transparency
Promoting secure time synchronization via NTS and Roughtime
Encrypting DNS with DoH and DoT

and many other smaller projects. Each step along the way contributed something concrete to help make the Internet more secure.

This year’s Crypto Week is a great example of the type of impact an effective hybrid research organization can make. Every day that week, a new announcement was made that helped take research results and realize their practical impact. From the League of Entropy, which is based on fundamental work by researchers at EPFL, to Cloudflare Time Services, which helps address time security issues raised in papers by former Cloudflare intern Aanchal Malhotra, to our own (currently running) post-quantum experiment with Google Chrome, engineers at Cloudflare combined research with building large-scale production systems to help solve some unsolved problems on the Internet.

Open collaboration, open standards, and open source

“We reject kings, presidents and voting. We believe in rough consensus and running code.” - Dave Clark

Effective research requires:

Choosing interesting problems to solve
Popularizing the ideas discovered while studying the solution space
Implementing the ideas at scale to show they’re effective

Cloudflare’s massive popularity puts us in a very privileged position. We can research, implement and deploy experiments at a scale that simply can’t be done by most organizations. This makes Cloudflare an attractive research partner for universities and other research institutions who have domain knowledge but not data. We rely on our own expertise along with that of peers in both academia and industry to decide which problems to tackle in order to achieve common goals and make new scientific progress. Our middlebox detection project, proposed by researchers at the University of Michigan, is an example of such a problem.

We’re not purists who are only interested in pursuing our own ideas. Some interesting problems have already been solved, but the solution isn’t widely known or implemented. In this situation, we contribute our efforts to help elevate the best ideas and make them available to the public in an accessible way. Our early work popularizing elliptic curves on the Internet is such an example.

Popularizing an idea and implementing the idea at scale are two different things. Along with popularizing winning ideas, we want to ensure these ideas stick and provide benefits to Internet users. To promote the widespread deployment of useful ideas, we work on standards and deploy newly emerging standards early on. Doing so helps the industry easily adopt innovations and supports interoperability. For example, the work done for Crypto Week 2019 has helped the development of international technical standards. Aspects of the League of Entropy are now being standardized at the CFRG, Roughtime is now being considered for adoption as an IETF standard, and we are presenting our post-quantum results as part of NIST’s post-quantum cryptography standardization effort.

Open source software is another key aspect of scaling the implementation of an idea. We open source associated code whenever possible. The research team collaborates with the wider research world as well as internally with other teams at Cloudflare.

Focus areas going forward

Doing research, sharing it in an accessible way, working with top experts to validate it, and working on standardization has several benefits. It provides an opportunity to educate the public, further scientific understanding, and improve the state of the art; but it’s also a great way to attract candidates. Great engineers want to work on interesting projects and great researchers want to see their work have an impact. This hybrid research approach is attractive to both types of candidates.

Computer science is a vast arena, so the areas we’re currently focusing on are:

Security and privacy
Cryptography
Internet measurement
Low-level networking and operating systems
Emerging networking paradigms

Here are some highlights of publications we’ve co-authored over the last few years in these areas. We’ll be building on this tradition going forward.

Attacking White-Box AES Constructions. Created during investigations into protecting keys in untrusted locations.
Privacy Pass. Arose from discussions around reducing CAPTCHA friction without reducing site protection.
Is the Web Ready for OCSP Must-Staple? Arose during the development of high-availability OCSP stapling.
Protocols for Checking Compromised Credentials. Work that came out of collaborations with Have I Been Pwned.
The Security Impact of HTTPS Interception. Came about when exploring the end-to-end security of customer connections and discussing with researchers interested in the same topic. Resulted in MALCOLM.
In search of CurveSwap: Measuring elliptic curve implementations in the wild. New attack devised when analyzing TLS 1.2 vulnerabilities.
Does Certificate Transparency Break the Web? Measuring Adoption and Error Rate. Partly motivated by the difficulty of running Nimbus.

And by the way, we’re hiring!

Product ManagementHelp the research team explore the future of peer-to-peer systems by building and managing projects like the Distributed Web Gateway.

EngineeringEngineering Manager (San Francisco, London)Systems Engineer - Cryptography Research (San Francisco)Cryptography Research Engineer Internship (San Francisco, London)

If none of these fit you perfectly, but you still want to reach out, send us an email at: ask-research@cloudflare.com.

How Cloudflare and Wall Street Are Helping Encrypt the Internet Today

Nick Sullivan — Fri, 13 Sep 2019 23:00:00 GMT

Today has been a big day for Cloudflare, as we became a public company on the New York Stock Exchange (NYSE: NET). To mark the occasion, we decided to bring our favorite entropy machines to the floor of the NYSE. Footage of these lava lamps is being used as an additional seed to our entropy-generation system LavaRand — bolstering Internet encryption for over 20 million Internet properties worldwide.

(This is mostly for fun. But when’s the last time you saw a lava lamp on the trading floor of the New York Stock Exchange?)

A little context: generating truly random numbers using computers is impossible, because code is inherently deterministic (i.e. predictable). To compensate for this, engineers draw from pools of randomness created by entropy generators, which is a fancy term for "things that are truly unpredictable".

It turns out that lava lamps are fantastic sources of entropy, as was first shown by Silicon Graphics in the 1990s. It’s a torch we’ve been proud to carry forward: today, Cloudflare uses lava lamps to generate entropy that helps make millions of Internet properties more secure.

Housed in our San Francisco headquarters is a wall filled with dozens of lava lamps, undulating with mesmerizing randomness. We capture these lava lamps on video via a camera mounted across the room, and feed the resulting footage into an algorithm — called LavaRand — that amplifies the pure randomness of these lava lamps to dizzying extremes (computers can't create seeds of pure randomness, but they can massively amplify them).

Shortly before we rang the opening bell this morning, we recorded footage of our lava lamps in operation on the trading room floor of the New York Stock Exchange, and we're ingesting the footage into our LavaRand system. The resulting entropy is mixed with the myriad additional sources of entropy that we leverage every day, creating a cryptographically-secure source of randomness — fortified by Wall Street.

We recently took our enthusiasm for randomness a step further by facilitating the League of Entropy, a consortium of global organizations and individual contributors, generating verifiable randomness via a globally distributed network. As one of the founding members of the League, LavaRand (pictured above) plays a key role in empowering developers worldwide with a pool of randomness with extreme entropy and high reliability.

And today, she’s enjoying the view from the podium!

One caveat: the lava lamps we run in our San Francisco headquarters are recorded in real-time, 24/7, giving us an ongoing stream of entropy. For reasons that are understandable, the NYSE doesn't allow for live video feeds from the exchange floor while it is in operation. But this morning they did let us record footage of the lava lamps operating shortly before the opening bell. The video was recorded and we're ingesting it into our LavaRand system (alongside many other entropy generators, including the lava lamps back in San Francisco).

Welcome to Crypto Week 2019

Nick Sullivan — Sun, 16 Jun 2019 17:07:57 GMT

The Internet is an extraordinarily complex and evolving ecosystem. Its constituent protocols range from the ancient and archaic (hello FTP) to the modern and sleek (meet WireGuard), with a fair bit of everything in between. This evolution is ongoing, and as one of the most connected networks on the Internet, Cloudflare has a duty to be a good steward of this ecosystem. We take this responsibility to heart: Cloudflare’s mission is to help build a better Internet. In this spirit, we are very proud to announce Crypto Week 2019.

Every day this week we’ll announce a new project or service that uses modern cryptography to build a more secure, trustworthy Internet. Everything we release this week will be free and immediately useful. This blog is a fun exploration of the themes of the week.

The Internet of the Future

Many pieces of the Internet in use today were designed in a different era with different assumptions. The Internet’s success is based on strong foundations that support constant reassessment and improvement. Sometimes these improvements require deploying new protocols.

Performing an upgrade on a system as large and decentralized as the Internet can’t be done by decree;

There are too many economic, cultural, political, and technological factors at play.
Changes must be compatible with existing systems and protocols to even be considered for adoption.
To gain traction, new protocols must provide tangible improvements for users. Nobody wants to install an update that doesn’t improve their experience!

The last time the Internet had a complete reboot and upgrade was during TCP/IP flag day in 1983. Back then, the Internet (called ARPANET) had fewer than ten thousand hosts! To have an Internet-wide flag day today to switch over to a core new protocol is inconceivable; the scale and diversity of the components involved is way too massive. Too much would break. It’s challenging enough to deprecate outmoded functionality. In some ways, the open Internet is a victim of its own success. The bigger a system grows and the longer it stays the same, the harder it is to change. The Internet is like a massive barge: it takes forever to steer in a different direction and it’s carrying a lot of garbage.

ARPANET, 1983 (Computer History Museum)

As you would expect, many of the warts of the early Internet still remain. Both academic security researchers and real-life adversaries are still finding and exploiting vulnerabilities in the system. Many vulnerabilities are due to the fact that most of the protocols in use on the Internet have a weak notion of trust inherited from the early days. With 50 hosts online, it’s relatively easy to trust everyone, but in a world-scale system, that trust breaks down in fascinating ways. The primary tool to scale trust is cryptography, which helps provide some measure of accountability, though it has its own complexities.

In an ideal world, the Internet would provide a trustworthy substrate for human communication and commerce. Some people naïvely assume that this is the natural direction the evolution of the Internet will follow. However, constant improvement is not a given. It’s possible that the Internet of the future will actually be worse than the Internet today: less open, less secure, less private, less trustworthy. There are strong incentives to weaken the Internet on a fundamental level by Governments, by businesses such as ISPs, and even by the financial institutions entrusted with our personal data.

In a system with as many stakeholders as the Internet, real change requires principled commitment from all invested parties. At Cloudflare, we believe everyone is entitled to an Internet built on a solid foundation of trust. Crypto Week is our way of helping nudge the Internet’s evolution in a more trust-oriented direction. Each announcement this week helps bring the Internet of the future to the present in a tangible way.

Ongoing Internet Upgrades

Before we explore the Internet of the future, let’s explore some of the previous and ongoing attempts to upgrade the Internet’s fundamental protocols.

Routing Security

As we highlighted in last year’s Crypto Week one of the weak links on the Internet is routing. Not all networks are directly connected.

To send data from one place to another, you might have to rely on intermediary networks to pass your data along. A packet sent from one host to another may have to be passed through up to a dozen of these intermediary networks. No single network knows the full path the data will have to take to get to its destination, it only knows which network to pass it to next. The protocol that determines how packets are routed is called the Border Gateway Protocol (BGP.) Generally speaking, networks use BGP to announce to each other which addresses they know how to route packets for and (dependent on a set of complex rules) these networks share what they learn with their neighbors.

Unfortunately, BGP is completely insecure:

Any network can announce any set of addresses to any other network, even addresses they don’t control. This leads to a phenomenon called BGP hijacking, where networks are tricked into sending data to the wrong network.
A BGP hijack is most often caused by accidental misconfiguration, but can also be the result of malice on the network operator’s part.
During a BGP hijack, a network inappropriately announces a set of addresses to other networks, which results in packets destined for the announced addresses to be routed through the illegitimate network.

Understanding the risk

If the packets represent unencrypted data, this can be a big problem as it allows the hijacker to read or even change the data:

In 2018, a rogue network hijacked the addresses of a service called MyEtherWallet, financial transactions were routed through the attacker network, which the attacker modified, resulting in the theft of over a hundred thousand dollars of cryptocurrency.

Mitigating the risk

The Resource Public Key Infrastructure (RPKI) system helps bring some trust to BGP by enabling networks to utilize cryptography to digitally sign network routes with certificates, making BGP hijacking much more difficult.

This enables participants of the network to gain assurances about the authenticity of route advertisements. Certificate Transparency (CT) is a tool that enables additional trust for certificate-based systems. Cloudflare operates the Cirrus CT log to support RPKI.

Since we announced our support of RPKI last year, routing security has made big strides. More routes are signed, more networks validate RPKI, and the software ecosystem has matured, but this work is not complete. Most networks are still vulnerable to BGP hijacking. For example, Pakistan knocked YouTube offline with a BGP hijack back in 2008, and could likely do the same today. Adoption here is driven less by providing a benefit to users, but rather by reducing systemic risk, which is not the strongest motivating factor for adopting a complex new technology. Full routing security on the Internet could take decades.

DNS Security

The Domain Name System (DNS) is the phone book of the Internet. Or, for anyone under 25 who doesn’t remember phone books, it’s the system that takes hostnames (like cloudflare.com or facebook.com) and returns the Internet address where that host can be found. For example, as of this publication, www.cloudflare.com is 104.17.209.9 and 104.17.210.9 (IPv4) and 2606:4700::c629:d7a2, 2606:4700::c629:d6a2 (IPv6). Like BGP, DNS is completely insecure. Queries and responses sent unencrypted over the Internet are modifiable by anyone on the path.

There are many ongoing attempts to add security to DNS, such as:

DNSSEC that adds a chain of digital signatures to DNS responses
DoT/DoH that wraps DNS queries in the TLS encryption protocol (more on that later)

Both technologies are slowly gaining adoption, but have a long way to go.

DNSSEC-signed responses served by Cloudflare

Cloudflare’s 1.1.1.1 resolver queries are already over 5% DoT/DoH

Just like RPKI, securing DNS comes with a performance cost, making it less attractive to users. However,

Services like 1.1.1.1 provide extremely fast DNS, which means that for many users, encrypted DNS is faster than the unencrypted DNS from their ISP.
This performance improvement makes it appealing for customers of privacy-conscious applications, like Firefox and Cloudflare’s 1.1.1.1 app, to adopt secure DNS.

The Web

Transport Layer Security (TLS) is a cryptographic protocol that gives two parties the ability to communicate over an encrypted and authenticated channel**.** TLS protects communications from eavesdroppers even in the event of a BGP hijack. TLS is what puts the “S” in HTTPS. TLS protects web browsing against multiple types of network adversaries.

Requests hop from network to network over the Internet

For unauthenticated protocols, an attacker on the path can impersonate the server

Attackers can use BGP hijacking to change the path so that communication can be intercepted

Authenticated protocols are protected from interception attacks

The adoption of TLS on the web is partially driven by the fact that:

It’s easy and free for websites to get an authentication certificate (via Let’s Encrypt, Universal SSL, etc.)
Browsers make TLS adoption appealing to website operators by only supporting new web features such as HTTP/2 over HTTPS.

This has led to the rapid adoption of HTTPS over the last five years.

HTTPS adoption curve (from Google Chrome)‌‌

To further that adoption, TLS recently got an upgrade in TLS 1.3, making it faster and more secure (a combination we love). It’s taking over the Internet!

TLS 1.3 adoption over the last 12 months (from Cloudflare's perspective)

Despite this fantastic progress in the adoption of security for routing, DNS, and the web, there are still gaps in the trust model of the Internet. There are other things needed to help build the Internet of the future. To find and identify these gaps, we lean on research experts.

Research Farm to Table

Cryptographic security on the Internet is a hot topic and there have been many flaws and issues recently pointed out in academic journals. Researchers often study the vulnerabilities of the past and ask:

What other critical components of the Internet have the same flaws?
What underlying assumptions can subvert trust in these existing systems?

The answers to these questions help us decide what to tackle next. Some recent research topics we’ve learned about include:

Quantum Computing
Attacks on Time Synchronization
DNS attacks affecting Certificate issuance
Scaling distributed trust

Cloudflare keeps abreast of these developments and we do what we can to bring these new ideas to the Internet at large. In this respect, we’re truly standing on the shoulders of giants.

Future-proofing Internet Cryptography

The new protocols we are currently deploying (RPKI, DNSSEC, DoT/DoH, TLS 1.3) use relatively modern cryptographic algorithms published in the 1970s and 1980s.

The security of these algorithms is based on hard mathematical problems in the field of number theory, such as factoring and the elliptic curve discrete logarithm problem.
If you can solve the hard problem, you can crack the code. Using a bigger key makes the problem harder, making it more difficult to break, but also slows performance.

Modern Internet protocols typically pick keys large enough to make it infeasible to break with classical computers, but no larger. The sweet spot is around 128-bits of security; meaning a computer has to do approximately 2¹²⁸ operations to break it.

Arjen Lenstra and others created a useful measure of security levels by comparing the amount of energy it takes to break a key to the amount of water you can boil using that much energy. You can think of this as the electric bill you’d get if you run a computer long enough to crack the key.

35-bit security is “Teaspoon security” -- It takes about the same amount of energy to break a 35-bit key as it does to boil a teaspoon of water (pretty easy).

65 bits gets you up to “Pool security” – The energy needed to boil the average amount of water in a swimming pool.

105 bits is “Sea Security” – The energy needed to boil the Mediterranean Sea.

114-bits is “Global Security” – The energy needed to boil all water on Earth.

128-bit security is safely beyond that of Global Security – Anything larger is excessive.
256-bit security corresponds to “Universal Security” – The estimated mass-energy of the observable universe. So, if you ever hear someone suggest 256-bit AES, you know they mean business.

Post-Quantum of Solace

As far as we know, the algorithms we use for cryptography are functionally uncrackable with all known algorithms that classical computers can run. Quantum computers change this calculus. Instead of transistors and bits, a quantum computer uses the effects of quantum mechanics to perform calculations that just aren’t possible with classical computers. As you can imagine, quantum computers are very difficult to build. However, despite large-scale quantum computers not existing quite yet, computer scientists have already developed algorithms that can only run efficiently on quantum computers. Surprisingly, it turns out that with a sufficiently powerful quantum computer, most of the hard mathematical problems we rely on for Internet security become easy!

Although there are still quantum-skeptics out there, some experts estimate that within 15-30 years these large quantum computers will exist, which poses a risk to every security protocol online. Progress is moving quickly; every few months a more powerful quantum computer is announced.

Luckily, there are cryptography algorithms that rely on different hard math problems that seem to be resistant to attack from quantum computers. These math problems form the basis of so-called quantum-resistant (or post-quantum) cryptography algorithms that can run on classical computers. These algorithms can be used as substitutes for most of our current quantum-vulnerable algorithms.

Some quantum-resistant algorithms (such as McEliece and Lamport Signatures) were invented decades ago, but there’s a reason they aren’t in common use: they lack some of the nice properties of the algorithms we’re currently using, such as key size and efficiency.
Some quantum-resistant algorithms require much larger keys to provide 128-bit security
Some are very CPU intensive,
And some just haven’t been studied enough to know if they’re secure.

It is possible to swap our current set of quantum-vulnerable algorithms with new quantum-resistant algorithms, but it’s a daunting engineering task. With widely deployed protocols, it is hard to make the transition from something fast and small to something slower, bigger or more complicated without providing concrete user benefits. When exploring new quantum-resistant algorithms, minimizing user impact is of utmost importance to encourage adoption. This is a big deal, because almost all the protocols we use to protect the Internet are vulnerable to quantum computers.

Cryptography-breaking quantum computing is still in the distant future, but we must start the transition to ensure that today’s secure communications are safe from tomorrow’s quantum-powered onlookers; however, that’s not the most timely problem with the Internet. We haven’t addressed that...yet.

Attacking time

Just like DNS, BGP, and HTTP, the Network Time Protocol (NTP) is fundamental to how the Internet works. And like these other protocols, it is completely insecure.

Last year, Cloudflare introduced Roughtime support as a mechanism for computers to access the current time from a trusted server in an authenticated way.
Roughtime is powerful because it provides a way to distribute trust among multiple time servers so that if one server attempts to lie about the time, it will be caught.

However, Roughtime is not exactly a secure drop-in replacement for NTP.

Roughtime lacks the complex mechanisms of NTP that allow it to compensate for network latency and yet maintain precise time, especially if the time servers are remote. This leads to imprecise time.
Roughtime also involves expensive cryptography that can further reduce precision. This lack of precision makes Roughtime useful for browsers and other systems that need coarse time to validate certificates (most certificates are valid for 3 months or more), but some systems (such as those used for financial trading) require precision to the millisecond or below.

With Roughtime we supported the time protocol of the future, but there are things we can do to help improve the health of security online today.

Some academic researchers, including Aanchal Malhotra of Boston University, have demonstrated a variety of attacks against NTP, including BGP hijacking and off-path User Datagram Protocol (UDP) attacks.

Some of these attacks can be avoided by connecting to an NTP server that is close to you on the Internet.
However, to bring cryptographic trust to time while maintaining precision, we need something in between NTP and Roughtime.
To solve this, it’s natural to turn to the same system of trust that enabled us to patch HTTP and DNS: Web PKI.

Attacking the Web PKI

The Web PKI is similar to the RPKI, but is more widely visible since it relates to websites rather than routing tables.

If you’ve ever clicked the lock icon on your browser’s address bar, you’ve interacted with it.
The PKI relies on a set of trusted organizations called Certificate Authorities (CAs) to issue certificates to websites and web services.
Websites use these certificates to authenticate themselves to clients as part of the TLS protocol in HTTPS.

TLS provides encryption and integrity from the client to the server with the help of a digital certificate

TLS connections are safe against MITM, because the client doesn’t trust the attacker’s certificate

While we were all patting ourselves on the back for moving the web to HTTPS, some researchers managed to find and exploit a weakness in the system: the process for getting HTTPS certificates.

Certificate Authorities (CAs) use a process called domain control validation (DCV) to ensure that they only issue certificates to websites owners who legitimately request them.

Some CAs do this validation manually, which is secure, but can’t scale to the total number of websites deployed today.
More progressive CAs have automated this validation process, but rely on insecure methods (HTTP and DNS) to validate domain ownership.

Without ubiquitous cryptography in place (DNSSEC may never reach 100% deployment), there is no completely secure way to bootstrap this system. So, let’s look at how to distribute trust using other methods.

One tool at our disposal is the distributed nature of the Cloudflare network.

Cloudflare is global. We have locations all over the world connected to dozens of networks. That means we have different vantage points, resulting in different ways to traverse networks. This diversity can prove an advantage when dealing with BGP hijacking, since an attacker would have to hijack multiple routes from multiple locations to affect all the traffic between Cloudflare and other distributed parts of the Internet. The natural diversity of the network raises the cost of the attacks.

A distributed set of connections to the Internet and using them as a quorum is a mighty paradigm to distribute trust, with or without cryptography.

Distributed Trust

This idea of distributing the source of trust is powerful. Last year we announced the Distributed Web Gateway that

Enables users to access content on the InterPlanetary File System (IPFS), a network structured to reduce the trust placed in any single party.
Even if a participant of the network is compromised, it can’t be used to distribute compromised content because the network is content-addressed.
However, using content-based addressing is not the only way to distribute trust between multiple independent parties.

Another way to distribute trust is to literally split authority between multiple independent parties. We’ve explored this topic before. In the context of Internet services, this means ensuring that no single server can authenticate itself to a client on its own. For example,

In HTTPS the server’s private key is the lynchpin of its security. Compromising the owner of the private key (by hook or by crook) gives an attacker the ability to impersonate (spoof) that service. This single point of failure puts services at risk. You can mitigate this risk by distributing the authority to authenticate the service between multiple independently-operated services.

TLS doesn’t protect against server compromise

With distributed trust, multiple parties combine to protect the connection

An attacker that has compromised one of the servers cannot break the security of the system‌‌

The Internet barge is old and slow, and we’ve only been able to improve it through the meticulous process of patching it piece by piece. Another option is to build new secure systems on top of this insecure foundation. IPFS is doing this, and IPFS is not alone in its design. There has been more research into secure systems with decentralized trust in the last ten years than ever before.

The result is radical new protocols and designs that use exotic new algorithms. These protocols do not supplant those at the core of the Internet (like TCP/IP), but instead, they sit on top of the existing Internet infrastructure, enabling new applications, much like HTTP did for the web.

Gaining Traction

Some of the most innovative technical projects were considered failures because they couldn’t attract users. New technology has to bring tangible benefits to users to sustain it: useful functionality, content, and a decent user experience. Distributed projects, such as IPFS and others, are gaining popularity, but have not found mass adoption. This is a chicken-and-egg problem. New protocols have a high barrier to entry**—users have to install new software—**and because of the small audience, there is less incentive to create compelling content. Decentralization and distributed trust are nice security features to have, but they are not products. Users still need to get some benefit out of using the platform.

An example of a system to break this cycle is the web. In 1992 the web was hardly a cornucopia of awesomeness. What helped drive the dominance of the web was its users.

The growth of the user base meant more incentive for people to build services, and the availability of more services attracted more users. It was a virtuous cycle.
It’s hard for a platform to gain momentum, but once the cycle starts, a flywheel effect kicks in to help the platform grow.

The Distributed Web Gateway project Cloudflare launched last year in Crypto Week is our way of exploring what happens if we try to kickstart that flywheel. By providing a secure, reliable, and fast interface from the classic web with its two billion users to the content on the distributed web, we give the fledgling ecosystem an audience.

If the advantages provided by building on the distributed web are appealing to users, then the larger audience will help these services grow in popularity.
This is somewhat reminiscent of how IPv6 gained adoption. It started as a niche technology only accessible using IPv4-to-IPv6 translation services.
IPv6 adoption has now grown so much that it is becoming a requirement for new services. For example, Apple is requiring that all apps work in IPv6-only contexts.

Eventually, as user-side implementations of distributed web technologies improve, people may move to using the distributed web natively rather than through an HTTP gateway. Or they may not! By leveraging Cloudflare’s global network to give users access to new technologies based on distributed trust, we give these technologies a better chance at gaining adoption.

Happy Crypto Week

At Cloudflare, we always support new technologies that help make the Internet better. Part of helping make a better Internet is scaling the systems of trust that underpin web browsing and protect them from attack. We provide the tools to create better systems of assurance with fewer points of vulnerability. We work with academic researchers of security to get a vision of the future and engineer away vulnerabilities before they can become widespread. It’s a constant journey.

Cloudflare knows that none of this is possible without the work of researchers. From award-winning researcher publishing papers in top journals to the blog posts of clever hobbyists, dedicated and curious people are moving the state of knowledge of the world forward. However, the push to publish new and novel research sometimes holds researchers back from committing enough time and resources to fully realize their ideas. Great research can be powerful on its own, but it can have an even broader impact when combined with practical applications. We relish the opportunity to stand on the shoulders of these giants and use our engineering know-how and global reach to expand on their work to help build a better Internet.

So, to all of you dedicated researchers, thank you for your work! Crypto Week is yours as much as ours. If you’re working on something interesting and you want help to bring the results of your research to the broader Internet, please contact us at ask-research@cloudflare.com. We want to help you realize your dream of making the Internet safe and trustworthy.

If you're a research-oriented engineering manager or student, we're also hiring in London and San Francisco!

Welcome to Crypto Week

Nick Sullivan — Mon, 17 Sep 2018 13:00:00 GMT

The Internet is an amazing invention. We marvel at how it connects people, connects ideas, and makes the world smaller. But the Internet isn’t perfect. It was put together piecemeal through publicly funded research, private investment, and organic growth that has left us with an imperfect tapestry. It’s also evolving. People are constantly developing creative applications and finding new uses for existing Internet technology. Issues like privacy and security that were afterthoughts in the early days of the Internet are now supremely important. People are being tracked and monetized, websites and web services are being attacked in interesting new ways, and the fundamental system of trust the Internet is built on is showing signs of age. The Internet needs an upgrade, and one of the tools that can make things better is cryptography.

Every day this week, Cloudflare will be announcing support for a new technology that uses cryptography to make the Internet better (hint: subscribe to the blog to make sure you don't miss any of the news). Everything we are announcing this week is free to use and provides a meaningful step towards supporting a new capability or structural reinforcement. So why are we doing this? Because it’s good for the users and good for the Internet. Welcome to Crypto Week!

Day 1: Distributed Web Gateway

Day 2: DNSSEC

Expanding DNSSEC Adoption

Day 3: RPKI

Day 4: Onion Routing

Introducing the Cloudflare Onion Service

Day 5: Roughtime

Roughtime: Securing Time with Digital Signatures

A more trustworthy Internet

Everything we do online depends on a relationship between users, services, and networks that is supported by some sort of trust mechanism. These relationships can be physical (I plug my router into yours), contractual (I paid a registrar for this domain name), or reliant on a trusted third party (I sent a message to my friend on iMessage via Apple). The simple act of visiting a website involves hundreds of trust relationships, some explicit and some implicit. The sheer size of the Internet and number of parties involved make trust online incredibly complex. Cryptography is a tool that can be used to encode and enforce, and most importantly scale these trust relationships.

To illustrate this, let’s break down what happens when you visit a website. But before we can do this, we need to know the jargon.

Autonomous Systems (100 thousand or so active): An AS corresponds to a network provider connected to the Internet. Each has a unique Autonomous System Number (ASN).
IP ranges (1 million or so): Each AS is assigned a set of numbers called IP addresses. Each of these IP addresses can be used by the AS to identify a computer on its network when connecting to other networks on the Internet. These addresses are assigned by the Regional Internet Registries (RIR), of which there are 5. Data sent from one IP address to another hops from one AS to another based on a “route” that is determined by a protocol called BGP.
Domain names (>1 billion): Domain names are the human-readable names that correspond to Internet services (like “cloudflare.com” or “mail.google.com”). These Internet services are accessed via the Internet by connecting to their IP address, which can be obtained from their domain name via the Domain Name System (DNS).
Content (infinite): The main use case of the Internet is to enable the transfer of specific pieces of data from one point on the network to another. This data can be of any form or type.

When you type a website such as blog.cloudflare.com into your browser, a number of things happen. First, a (recursive) DNS service is contacted to get the IP address of the site. This DNS server configured by your ISP when you connect to the Internet, or it can be a public service such as 1.1.1.1 or 8.8.8.8. A query to the DNS service travels from network to network along a path determined by BGP announcements. If the recursive DNS server does not know the answer to the query, then it contacts the appropriate authoritative DNS services, starting with a root DNS server, down to a top level domain server (such as com or org), down to the DNS server that is authoritative for the domain. Once the DNS query has been answered, the browser sends an HTTP request to its IP address (traversing a sequence of networks), and in response, the server sends the content of the website.

So what’s the problem with this picture? For one, every DNS query and every network hop needs to be trusted in order to trust the content of the site. Any DNS query could be modified, a network could advertise an IP that belongs to another network, and any machine along the path could modify the content. When the Internet was small, there were mechanisms to combat this sort of subterfuge. Network operators had a personal relationship with each other and could punish bad behavior, but given the number of networks in existence almost 400,000 as of this week this is becoming difficult to scale.

Cryptography is a tool that can encode these trust relationships and make the whole system reliant on hard math rather than physical handshakes and hopes.

Building a taller tower of turtles

Attribution 2.0 Generic (CC BY 2.0)

The two main tools that cryptography provides to help solve this problem are cryptographic hashes and digital signatures.

A hash function is a way to take any piece of data and transform it into a fixed-length string of data, called a digest or hash. A hash function is considered cryptographically strong if it is computationally infeasible (read: very hard) to find two inputs that result in the same digest, and that changing even one bit of the input results in a completely different digest. The most popular hash function that is considered secure is SHA-256, which has 256-bit outputs. For example, the SHA-256 hash of the word “crypto” is

DA2F073E06F78938166F247273729DFE465BF7E46105C13CE7CC651047BF0CA4

And the SHA-256 hash of “crypt0” is

7BA359D3742595F38347A0409331FF3C8F3C91FF855CA277CB8F1A3A0C0829C4

The other main tool is digital signatures. A digital signature is a value that can only be computed by someone with a private key, but can be verified by anyone with the corresponding public key. Digital signatures are way for a private key holder to “sign,” or attest to the authenticity of a specific message in a way that anyone can validate it.

These two tools can be combined to solidify the trust relationships on the Internet. By giving private keys to the trusted parties who are responsible for defining the relationships between ASs, IPs, domain names and content, you can create chains of trust that can be publicly verified. Rather than hope and pray, these relationships can be validated in real time at scale.

Let’s take our webpage loading example and see where digital signatures can be applied.

Routing. Time-bound delegation of trust is defined through a system called the RPKI. RPKI defines an object called a Resource Certificate, an attestation that states that a given IP range belongs to a specific ASN for this period of time, digitally signed by the RIR responsible for assigning the IP range. Networks share routes via BGP, and if a route is advertised for an IP that does not conform the the Resource Certificate, the network can choose not to accept it.

DNS. Adding cryptographic assurance to routing is powerful, but if a network adversary can change the content of the data (such as the DNS responses), then the system is still at risk. DNSSEC is a system built to provide a trusted link between names and IP addresses. The root of trust in DNSSEC is the DNS root key, which is managed with an elaborate signing ceremony.

HTTPS. When you connect to a site, not only do you want it to be coming from the right host, you also want the content to be private. The Web PKI is a system that issues certificates to sites, allowing you to bind the domain name to a time-bounded private key. Because there are many CAs, additional accountability systems like certificate transparency need to be involved to help keep the system in check.

This cryptographic scaffolding turns the Internet into an encoded system of trust. With these systems in place, Internet users no longer need to trust every network and party involved in this diagram, they only need to trust the RIRs, DNSSEC and the CAs (and know the correct time).

This week we’ll be making some announcements that help strengthen this system of accountability.

Privacy and integrity with friends

The Internet is great because it connects us to each other, but the details of how it connects us are important. The technical choices made when Internet was designed come with some interesting human implications.

One implication is trackability. Your IP address is contained on every packet you send over the Internet. This acts as a unique identifier for anyone (corporations, governments, etc.) to track what you do online. Furthermore, if you connect to a server, that server’s identity is sent in plaintext on the request even over HTTPS, revealing your browsing patterns to any intermediary who cares to look.

Another implication is malleability. Resources on the Internet are defined by where they are, not what they are. If you want to go to CNN or BBC, then you connect to the server for cnn.com or bbc.co.uk and validate the certificate to make sure it’s the right site. But once you’ve made the connection, there’s no good way to know that the actual content is what you expect it to be. If the server is hacked, it could send you anything, including dangerous malicious code. HTTPS is a secure pipe, but there’s no inherent way to make sure what gets sent through the pipe is what you expect.

Trackability and malleability are not inherent features of interconnectedness. It is possible to design networks that don’t have these downsides. It is also possible to build new networks with better characteristic on top of the existing Internet. The key ingredient is cryptography.

Tracking-resilient networking

One of the networks built on top of the Internet that provides good privacy properties is Tor. The Tor network is run by a group of users who allow their computers to be used to route traffic for other members of the network. Using cryptography, it is possible to route traffic from one place to another without points along the path knowing both the source and the destination at the same time. This is called onion routing because it involves multiple layers of encryption, like an onion. Traffic coming out of the Tor network is “anonymous” because it could have come from anyone connected to the network. Everyone just blends in, making it hard to track individuals.

Similarly, web services can use onion routing to serve content inside the Tor network without revealing their location to visitors. Instead of using a hostname to identify their network location, so-called onion services use a cryptographic public key as their address. There are hundreds of onion services in use, including the one we use for 1.1.1.1 or the one in use by Facebook.

Troubles occur at the boundary between Tor network and the rest of the Internet. This is especially true for user is attempting to access services that rely on abuse prevention mechanisms based on reputation. Since Tor is used by both privacy-conscious users and malicious bots, connections from both get lumped together and as the expression goes, one bad apple ruins the bunch. This unfortunately exposes legitimate visitors to anti-abuse mechanisms like CAPTCHAs. Tools like Privacy Pass help reduce this burden but don’t eliminate it completely. This week we’ll be announcing a new way to improve this situation.

Bringing integrity to content delivery

Let’s revisit the issue of malleability: the fact that you can’t always trust the other side of a connection to send you the content you expect. There are technologies that allow users to insure the integrity of content without trusting the server. One such technology is a feature of HTML called Subresource Integrity (SRI). SRI allows a webpage with sub-resources (like a script or stylesheet) to embed a unique cryptographic hash into the page so that when the sub-resource is loaded, it is checked to see that is matches the expected value. This protects the site from loading unexpected scripts from third parties, a known attack vector.

Another idea is to flip this on its head: what if instead of fetching a piece of content from a specific location on the network, you asked the network to find piece of content that matches a given hash? By assigning resources based on their actual content rather than by location it’s possible to create a network in which you can fetch content from anywhere on the network and still know it’s authentic. This idea is called content addressing and there are networks built on top of the Internet that use it. These content addressed networks, based on protocols such IPFS and DAT, are blazing a trail new trend in Internet applications called the Distributed Web. With the Distributed Web applications, malleability is no longer an issue, opening up a new set of possibilities.

Combining strengths

Networks based on cryptographic principles, like Tor and IPFS, have one major downside compared to networks based on names: usability. Humans are exceptionally bad at remembering or distinguishing between cryptographically-relevant numbers. Take, for instance, the New York Times’ onion address:

https://www.nytimes3xbfgragh.onion/

This is would easily confused with similar-looking onion addresses, such as

https://www.nytimes3xfkdbgfg.onion/

which may be controlled by a malicious actor.

Content addressed networks are even worse from the perspective of regular people. For example, there is a snapshot of the Turkish version of Wikipedia on IPFS with the hash:

QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX

Try typing this hash into your browser without making a mistake.

These naming issues are things Cloudflare is perfectly positioned to help solve.First, by putting the hash address of an IPFS site in the DNS (and adding DNSSEC for trust) you can give your site a traditional hostname while maintaining a chain of trust.

Second, by enabling browsers to use a traditional DNS name to access the web through onion services, you can provide safer access to your site for Tor user with the added benefit of being better able to distinguish between bots and humans.With Cloudflare as the glue, is is possible to connect both standard internet and tor users to web sites and services on both the traditional web with the distributed web.

This is the promise of Crypto Week: using cryptographic guarantees to make a stronger, more trustworthy and more private internet without sacrificing usability.

Happy Crypto Week

In conclusion, we’re working on many cutting-edge technologies based on cryptography and applying them to make the Internet better. The first announcement today is the launch of Cloudflare's Distributed Web Gateway and browser extension. Keep tabs on the Cloudflare blog for exciting updates as the week progresses.

I’m very proud of the team’s work on Crypto Week, which was made possible by the work of a dedicated team, including several brilliant interns. If this type of work is interesting to you, Cloudflare is hiring for the crypto team and others!

A Detailed Look at RFC 8446 (a.k.a. TLS 1.3)

Nick Sullivan — Fri, 10 Aug 2018 23:00:00 GMT

For the last five years, the Internet Engineering Task Force (IETF), the standards body that defines internet protocols, has been working on standardizing the latest version of one of its most important security protocols: Transport Layer Security (TLS). TLS is used to secure the web (and much more!), providing encryption and ensuring the authenticity of every HTTPS website and API. The latest version of TLS, TLS 1.3 (RFC 8446) was published today. It is the first major overhaul of the protocol, bringing significant security and performance improvements. This article provides a deep dive into the changes introduced in TLS 1.3 and its impact on the future of internet security.

An evolution

One major way Cloudflare provides security is by supporting HTTPS for websites and web services such as APIs. With HTTPS (the “S” stands for secure) the communication between your browser and the server travels over an encrypted and authenticated channel. Serving your content over HTTPS instead of HTTP provides confidence to the visitor that the content they see is presented by the legitimate content owner and that the communication is safe from eavesdropping. This is a big deal in a world where online privacy is more important than ever.

The machinery under the hood that makes HTTPS secure is a protocol called TLS. It has its roots in a protocol called Secure Sockets Layer (SSL) developed in the mid-nineties at Netscape. By the end of the 1990s, Netscape handed SSL over to the IETF, who renamed it TLS and have been the stewards of the protocol ever since. Many people still refer to web encryption as SSL, even though the vast majority of services have switched over to supporting TLS only. The term SSL continues to have popular appeal and Cloudflare has kept the term alive through product names like Keyless SSL and Universal SSL.

In the IETF, protocols are called RFCs. TLS 1.0 was RFC 2246, TLS 1.1 was RFC 4346, and TLS 1.2 was RFC 5246. Today, TLS 1.3 was published as RFC 8446. RFCs are generally published in order, keeping 46 as part of the RFC number is a nice touch.

TLS 1.2 wears parachute pants and shoulder pads

MC Hammer, like SSL, was popular in the 90s

Over the last few years, TLS has seen its fair share of problems. First of all, there have been problems with the code that implements TLS, including Heartbleed, BERserk, goto fail;, and more. These issues are not fundamental to the protocol and mostly resulted from a lack of testing. Tools like TLS Attacker and Project Wycheproof have helped improve the robustness of TLS implementation, but the more challenging problems faced by TLS have had to do with the protocol itself.

TLS was designed by engineers using tools from mathematicians. Many of the early design decisions from the days of SSL were made using heuristics and an incomplete understanding of how to design robust security protocols. That said, this isn’t the fault of the protocol designers (Paul Kocher, Phil Karlton, Alan Freier, Tim Dierks, Christopher Allen and others), as the entire industry was still learning how to do this properly. When TLS was designed, formal papers on the design of secure authentication protocols like Hugo Krawczyk’s landmark SIGMA paper were still years away. TLS was 90s crypto: It meant well and seemed cool at the time, but the modern cryptographer’s design palette has moved on.

Many of the design flaws were discovered using formal verification. Academics attempted to prove certain security properties of TLS, but instead found counter-examples that were turned into real vulnerabilities. These weaknesses range from the purely theoretical (SLOTH and CurveSwap), to feasible for highly resourced attackers (WeakDH, LogJam, FREAK, SWEET32), to practical and dangerous (POODLE, ROBOT).

TLS 1.2 is slow

Encryption has always been important online, but historically it was only used for things like logging in or sending credit card information, leaving most other data exposed. There has been a major trend in the last few years towards using HTTPS for all traffic on the Internet. This has the positive effect of protecting more of what we do online from eavesdroppers and injection attacks, but has the downside that new connections get a bit slower.

For a browser and web server to agree on a key, they need to exchange cryptographic data. The exchange, called the “handshake” in TLS, has remained largely unchanged since TLS was standardized in 1999. The handshake requires two additional round-trips between the browser and the server before encrypted data can be sent (or one when resuming a previous connection). The additional cost of the TLS handshake for HTTPS results in a noticeable hit to latency compared to an HTTP alone. This additional delay can negatively impact performance-focused applications.

Defining TLS 1.3

Unsatisfied with the outdated design of TLS 1.2 and two-round-trip overhead, the IETF set about defining a new version of TLS. In August 2013, Eric Rescorla laid out a wishlist of features for the new protocol:https://www.ietf.org/proceedings/87/slides/slides-87-tls-5.pdf

After some debate, it was decided that this new version of TLS was to be called TLS 1.3. The main issues that drove the design of TLS 1.3 were mostly the same as those presented five years ago:

reducing handshake latency
encrypting more of the handshake
improving resiliency to cross-protocol attacks
removing legacy features

The specification was shaped by volunteers through an open design process, and after four years of diligent work and vigorous debate, TLS 1.3 is now in its final form: RFC 8446. As adoption increases, the new protocol will make the internet both faster and more secure.

In this blog post I will focus on the two main advantages TLS 1.3 has over previous versions: security and performance.

Trimming the hedges

Creative Commons Attribution-Share Alike 3.0

In the last two decades, we as a society have learned a lot about how to write secure cryptographic protocols. The parade of cleverly-named attacks from POODLE to Lucky13 to SLOTH to LogJam showed that even TLS 1.2 contains antiquated ideas from the early days of cryptographic design. One of the design goals of TLS 1.3 was to correct previous mistakes by removing potentially dangerous design elements.

Fixing key exchange

TLS is a so-called “hybrid” cryptosystem. This means it uses both symmetric key cryptography (encryption and decryption keys are the same) and public key cryptography (encryption and decryption keys are different). Hybrid schemes are the predominant form of encryption used on the Internet and are used in SSH, IPsec, Signal, WireGuard and other protocols. In hybrid cryptosystems, public key cryptography is used to establish a shared secret between both parties, and the shared secret is used to create symmetric keys that can be used to encrypt the data exchanged.

As a general rule, public key crypto is slow and expensive (microseconds to milliseconds per operation) and symmetric key crypto is fast and cheap (nanoseconds per operation). Hybrid encryption schemes let you send a lot of encrypted data with very little overhead by only doing the expensive part once. Much of the work in TLS 1.3 has been about improving the part of the handshake, where public keys are used to establish symmetric keys.

RSA key exchange

The public key portion of TLS is about establishing a shared secret. There are two main ways of doing this with public key cryptography. The simpler way is with public-key encryption: one party encrypts the shared secret with the other party’s public key and sends it along. The other party then uses its private key to decrypt the shared secret and ... voila! They both share the same secret. This technique was discovered in 1977 by Rivest, Shamir and Adelman and is called RSA key exchange. In TLS’s RSA key exchange, the shared secret is decided by the client, who then encrypts it to the server’s public key (extracted from the certificate) and sends it to the server.

The other form of key exchange available in TLS is based on another form of public-key cryptography, invented by Diffie and Hellman in 1976, so-called Diffie-Hellman key agreement. In Diffie-Hellman, the client and server both start by creating a public-private key pair. They then send the public portion of their key share to the other party. When each party receives the public key share of the other, they combine it with their own private key and end up with the same value: the pre-main secret. The server then uses a digital signature to ensure the exchange hasn’t been tampered with. This key exchange is called “ephemeral” if the client and server both choose a new key pair for every exchange.

Both modes result in the client and server having a shared secret, but RSA mode has a serious downside: it’s not forward secret. That means that if someone records the encrypted conversation and then gets ahold of the RSA private key of the server, they can decrypt the conversation. This even applies if the conversation was recorded and the key is obtained some time well into the future. In a world where national governments are recording encrypted conversations and using exploits like Heartbleed to steal private keys, this is a realistic threat.

RSA key exchange has been problematic for some time, and not just because it’s not forward-secret. It’s also notoriously difficult to do correctly. In 1998, Daniel Bleichenbacher discovered a vulnerability in the way RSA encryption was done in SSL and created what’s called the “million-message attack,” which allows an attacker to perform an RSA private key operation with a server’s private key by sending a million or so well-crafted messages and looking for differences in the error codes returned. The attack has been refined over the years and in some cases only requires thousands of messages, making it feasible to do from a laptop. It was recently discovered that major websites (including facebook.com) were also vulnerable to a variant of Bleichenbacher’s attack called the ROBOT attack as recently as 2017.

To reduce the risks caused by non-forward secret connections and million-message attacks, RSA encryption was removed from TLS 1.3, leaving ephemeral Diffie-Hellman as the only key exchange mechanism. Removing RSA key exchange brings other advantages, as we will discuss in the performance section below.

Diffie-Hellman named groups

When it comes to cryptography, giving too many options leads to the wrong option being chosen. This principle is most evident when it comes to choosing Diffie-Hellman parameters. In previous versions of TLS, the choice of the Diffie-Hellman parameters was up to the participants. This resulted in some implementations choosing incorrectly, resulting in vulnerable implementations being deployed. TLS 1.3 takes this choice away.

Diffie-Hellman is a powerful tool, but not all Diffie-Hellman parameters are “safe” to use. The security of Diffie-Hellman depends on the difficulty of a specific mathematical problem called the discrete logarithm problem. If you can solve the discrete logarithm problem for a set of parameters, you can extract the private key and break the security of the protocol. Generally speaking, the bigger the numbers used, the harder it is to solve the discrete logarithm problem. So if you choose small DH parameters, you’re in trouble.

The LogJam and WeakDH attacks of 2015 showed that many TLS servers could be tricked into using small numbers for Diffie-Hellman, allowing an attacker to break the security of the protocol and decrypt conversations.

Diffie-Hellman also requires the parameters to have certain other mathematical properties. In 2016, Antonio Sanso found an issue in OpenSSL where parameters were chosen that lacked the right mathematical properties, resulting in another vulnerability.

TLS 1.3 takes the opinionated route, restricting the Diffie-Hellman parameters to ones that are known to be secure. However, it still leaves several options; permitting only one option makes it difficult to update TLS in case these parameters are found to be insecure some time in the future.

Fixing ciphers

The other half of a hybrid crypto scheme is the actual encryption of data. This is done by combining an authentication code and a symmetric cipher for which each party knows the key. As I’ll describe, there are many ways to encrypt data, most of which are wrong.

CBC mode ciphers

In the last section we described TLS as a hybrid encryption scheme, with a public key part and a symmetric key part. The public key part is not the only one that has caused trouble over the years. The symmetric key portion has also had its fair share of issues. In any secure communication scheme, you need both encryption (to keep things private) and integrity (to make sure people don’t modify, add, or delete pieces of the conversation). Symmetric key encryption is used to provide both encryption and integrity, but in TLS 1.2 and earlier, these two pieces were combined in the wrong way, leading to security vulnerabilities.

An algorithm that performs symmetric encryption and decryption is called a symmetric cipher. Symmetric ciphers usually come in two main forms: block ciphers and stream ciphers.

A stream cipher takes a fixed-size key and uses it to create a stream of pseudo-random data of arbitrary length, called a key stream. To encrypt with a stream cipher, you take your message and combine it with the key stream by XORing each bit of the key stream with the corresponding bit of your message.. To decrypt, you take the encrypted message and XOR it with the key stream. Examples of pure stream ciphers are RC4 and ChaCha20. Stream ciphers are popular because they’re simple to implement and fast in software.

A block cipher is different than a stream cipher because it only encrypts fixed-sized messages. If you want to encrypt a message that is shorter or longer than the block size, you have to do a bit of work. For shorter messages, you have to add some extra data to the end of the message. For longer messages, you can either split your message up into blocks the cipher can encrypt and then use a block cipher mode to combine the pieces together somehow. Alternatively, you can turn your block cipher into a stream cipher by encrypting a sequence of counters with a block cipher and using that as the stream. This is called “counter mode”. One popular way of encrypting arbitrary length data with a block cipher is a mode called cipher block chaining (CBC).

In order to prevent people from tampering with data, encryption is not enough. Data also needs to be integrity-protected. For CBC-mode ciphers, this is done using something called a message-authentication code (MAC), which is like a fancy checksum with a key. Cryptographically strong MACs have the property that finding a MAC value that matches an input is practically impossible unless you know the secret key. There are two ways to combine MACs and CBC-mode ciphers. Either you encrypt first and then MAC the ciphertext, or you MAC the plaintext first and then encrypt the whole thing. In TLS, they chose the latter, MAC-then-Encrypt, which turned out to be the wrong choice.

You can blame this choice for BEAST, as well as a slew of padding oracle vulnerabilities such as Lucky 13 and Lucky Microseconds. Read my previous post on padding oracle attacks for a comprehensive explanation of these flaws. The interaction between CBC mode and padding was also the cause of the widely publicized POODLE vulnerability in SSLv3 and some implementations of TLS.

RC4 is a classic stream cipher designed by Ron Rivest (the “R” of RSA) that was broadly supported since the early days of TLS. In 2013, it was found to have measurable biases that could be leveraged to allow attackers to decrypt messages.

AEAD Mode

In TLS 1.3, all the troublesome ciphers and cipher modes have been removed. You can no longer use CBC-mode ciphers or insecure stream ciphers such as RC4. The only type of symmetric crypto allowed in TLS 1.3 is a new construction called AEAD (authenticated encryption with additional data), which combines encryption and integrity into one seamless operation.

Fixing digital signatures

Another important part of TLS is authentication. In every connection, the server authenticates itself to the client using a digital certificate, which has a public key. In RSA-encryption mode, the server proves its ownership of the private key by decrypting the pre-main secret and computing a MAC over the transcript of the conversation. In Diffie-Hellman mode, the server proves ownership of the private key using a digital signature. If you’ve been following this blog post so far, it should be easy to guess that this was done incorrectly too.

PKCS#1v1.5

Daniel Bleichenbacher has made a living identifying problems with RSA in TLS. In 2006, he devised a pen-and-paper attack against RSA signatures as used in TLS. It was later discovered that major TLS implemenations including those of NSS and OpenSSL were vulnerable to this attack. This issue again had to do with how difficult it is to implement padding correctly, in this case, the PKCS#1 v1.5 padding used in RSA signatures. In TLS 1.3, PKCS#1 v1.5 is removed in favor of the newer design RSA-PSS.

Signing the entire transcript

We described earlier how the server uses a digital signature to prove that the key exchange hasn’t been tampered with. In TLS 1.2 and earlier, the server’s signature only covers part of the handshake. The other parts of the handshake, specifically the parts that are used to negotiate which symmetric cipher to use, are not signed by the private key. Instead, a symmetric MAC is used to ensure that the handshake was not tampered with. This oversight resulted in a number of high-profile vulnerabilities (FREAK, LogJam, etc.). In TLS 1.3 these are prevented because the server signs the entire handshake transcript.

The FREAK, LogJam and CurveSwap attacks took advantage of two things:

the fact that intentionally weak ciphers from the 1990s (called export ciphers) were still supported in many browsers and servers, and
the fact that the part of the handshake used to negotiate which cipher was used was not digitally signed.

The on-path attacker can swap out the supported ciphers (or supported groups, or supported curves) from the client with an easily crackable choice that the server supports. They then break the key and forge two finished messages to make both parties think they’ve agreed on a transcript.

These attacks are called downgrade attacks, and they allow attackers to force two participants to use the weakest cipher supported by both parties, even if more secure ciphers are supported. In this style of attack, the perpetrator sits in the middle of the handshake and changes the list of supported ciphers advertised from the client to the server to only include weak export ciphers. The server then chooses one of the weak ciphers, and the attacker figures out the key with a brute-force attack, allowing the attacker to forge the MACs on the handshake. In TLS 1.3, this type of downgrade attack is impossible because the server now signs the entire handshake, including the cipher negotiation.

Better living through simplification

TLS 1.3 is a much more elegant and secure protocol with the removal of the insecure features listed above. This hedge-trimming allowed the protocol to be simplified in ways that make it easier to understand, and faster.

No more take-out menu

In previous versions of TLS, the main negotiation mechanism was the ciphersuite. A ciphersuite encompassed almost everything that could be negotiated about a connection:

type of certificates supported
hash function used for deriving keys (e.g., SHA1, SHA256, ...)
MAC function (e.g., HMAC with SHA1, SHA256, …)
key exchange algorithm (e.g., RSA, ECDHE, …)
cipher (e.g., AES, RC4, ...)
cipher mode, if applicable (e.g., CBC)

Ciphersuites in previous versions of TLS had grown into monstrously large alphabet soups. Examples of commonly used cipher suites are: DHE-RC4-MD5 or ECDHE-ECDSA-AES-GCM-SHA256. Each ciphersuite was represented by a code point in a table maintained by an organization called the Internet Assigned Numbers Authority (IANA). Every time a new cipher was introduced, a new set of combinations needed to be added to the list. This resulted in a combinatorial explosion of code points representing every valid choice of these parameters. It had become a bit of a mess.

TLS 1.2

TLS 1.3

TLS 1.3 removes many of these legacy features, allowing for a clean split between three orthogonal negotiations:

Cipher + HKDF Hash
Key Exchange
Signature Algorithm

This simplified cipher suite negotiation and radically reduced set of negotiation parameters opens up a new possibility. This possibility enables the TLS 1.3 handshake latency to drop from two round-trips to only one round-trip, providing the performance boost that will ensure that TLS 1.3 will be popular and widely adopted.

Performance

When establishing a new connection to a server that you haven’t seen before, it takes two round-trips before data can be sent on the connection. This is not particularly noticeable in locations where the server and client are geographically close to each other, but it can make a big difference on mobile networks where latency can be as high as 200ms, an amount that is noticeable for humans.

1-RTT mode

TLS 1.3 now has a radically simpler cipher negotiation model and a reduced set of key agreement options (no RSA, no user-defined DH parameters). This means that every connection will use a DH-based key agreement and the parameters supported by the server are likely easy to guess (ECDHE with X25519 or P-256). Because of this limited set of choices, the client can simply choose to send DH key shares in the first message instead of waiting until the server has confirmed which key shares it is willing to support. That way, the server can learn the shared secret and send encrypted data one round trip earlier. Chrome’s implementation of TLS 1.3, for example, sends an X25519 keyshare in the first message to the server.

In the rare situation that the server does not support one of the key shares sent by the client, the server can send a new message, the HelloRetryRequest, to let the client know which groups it supports. Because the list has been trimmed down so much, this is not expected to be a common occurrence.

0-RTT resumption

A further optimization was inspired by the QUIC protocol. It lets clients send encrypted data in their first message to the server, resulting in no additional latency cost compared to unencrypted HTTP. This is a big deal, and once TLS 1.3 is widely deployed, the encrypted web is sure to feel much snappier than before.

In TLS 1.2, there are two ways to resume a connection, session ids and session tickets. In TLS 1.3 these are combined to form a new mode called PSK (pre-shared key) resumption. The idea is that after a session is established, the client and server can derive a shared secret called the “resumption main secret”. This can either be stored on the server with an id (session id style) or encrypted by a key known only to the server (session ticket style). This session ticket is sent to the client and redeemed when resuming a connection.

For resumed connections, both parties share a resumption main secret so key exchange is not necessary except for providing forward secrecy. The next time the client connects to the server, it can take the secret from the previous session and use it to encrypt application data to send to the server, along with the session ticket. Something as amazing as sending encrypted data on the first flight does come with its downfalls.

Replayability

There is no interactivity in 0-RTT data. It’s sent by the client, and consumed by the server without any interactions. This is great for performance, but comes at a cost: replayability. If an attacker captures a 0-RTT packet that was sent to server, they can replay it and there’s a chance that the server will accept it as valid. This can have interesting negative consequences.

An example of dangerous replayed data is anything that changes state on the server. If you increment a counter, perform a database transaction, or do anything that has a permanent effect, it’s risky to put it in 0-RTT data.

As a client, you can try to protect against this by only putting “safe” requests into the 0-RTT data. In this context, “safe” means that the request won’t change server state. In HTTP, different methods are supposed to have different semantics. HTTP GET requests are supposed to be safe, so a browser can usually protect HTTPS servers against replay attacks by only sending GET requests in 0-RTT. Since most page loads start with a GET of “/” this results in faster page load time.

Problems start to happen when data sent in 0-RTT are used for state-changing requests. To help prevent against this failure case, TLS 1.3 also includes the time elapsed value in the session ticket. If this diverges too much, the client is either approaching the speed of light, or the value has been replayed. In either case, it’s prudent for the server to reject the 0-RTT data.

For more details about 0-RTT, and the improvements to session resumption in TLS 1.3, check out this previous blog post.

Deployability

TLS 1.3 was a radical departure from TLS 1.2 and earlier, but in order to be deployed widely, it has to be backwards compatible with existing software. One of the reasons TLS 1.3 has taken so long to go from draft to final publication was the fact that some existing software (namely middleboxes) wasn’t playing nicely with the new changes. Even minor changes to the TLS 1.3 protocol that were visible on the wire (such as eliminating the redundant ChangeCipherSpec message, bumping the version from 0x0303 to 0x0304) ended up causing connection issues for some people.

Despite the fact that future flexibility was built into the TLS spec, some implementations made incorrect assumptions about how to handle future TLS versions. The phenomenon responsible for this change is called ossification and I explore it more fully in the context of TLS in my previous post about why TLS 1.3 isn’t deployed yet. To accommodate these changes, TLS 1.3 was modified to look a lot like TLS 1.2 session resumption (at least on the wire). This resulted in a much more functional, but less aesthetically pleasing protocol. This is the price you pay for upgrading one of the most widely deployed protocols online.

Conclusions

TLS 1.3 is a modern security protocol built with modern tools like formal analysis that retains its backwards compatibility. It has been tested widely and iterated upon using real world deployment data. It’s a cleaner, faster, and more secure protocol ready to become the de facto two-party encryption protocol online. Draft 28 of TLS 1.3 is enabled by default for all Cloudflare customers, and we will be rolling out the final version soon.

Publishing TLS 1.3 is a huge accomplishment. It is one the best recent examples of how it is possible to take 20 years of deployed legacy code and change it on the fly, resulting in a better internet for everyone. TLS 1.3 has been debated and analyzed for the last three years and it’s now ready for prime time. Welcome, RFC 8446.

Introducing Certificate Transparency and Nimbus

Nick Sullivan — Fri, 23 Mar 2018 14:45:00 GMT

Certificate Transparency (CT) is an ambitious project to help improve security online by bringing accountability to the system that protects HTTPS. Cloudflare is announcing support for this project by introducing two new public-good services:

Nimbus: A free and open certificate transparency log
Merkle Town: A dashboard for exploring the certificate transparency ecosystem

In this blog post we’ll explain what Certificate Transparency is and how it will become a critical tool for ensuring user safety online. It’s important for website operators and certificate authorities to learn about CT as soon as possible, because participating in CT becomes mandatory in Chrome for all certificates issued after April 2018. We’ll also explain how Nimbus works and how CT uses a structure called a Merkle tree to scale to the point of supporting all trusted certificates on the Internet. For more about Merkle Town, read the follow up post by my colleague Patrick Donahue.

Trust and Accountability

Everything we do online requires a baseline level of trust. When you use a browser to visit your bank’s website or your favorite social media site, you expect that the server on the other side of the connection is operated by the organization indicated in the address bar. This expectation is based on trust, and this trust is supported by a system called the web Public Key Infrastructure (web PKI).

From a high level, a PKI is similar to a system of notaries who can grant servers the authority to serve websites by giving them a signed object called a digital certificate. This certificate contains the name of the website, the name of the organization that requested the certificate, how long it’s valid for, and a public key. For each public key, there is an associated private key. In order to serve an HTTPS site with a given certificate, the server needs to prove ownership of the associated private key.

Sites obtain digital certificates from trusted third parties called Certificate Authorities (CAs). CAs validate the operator’s ownership of a domain before issuing them a certificate. If the issuing CA for a certificate is trusted by the browser, then that certificate can be used to serve content over HTTPS to visitors of the site. All this happens under the hood of the browser and is only surfaced to user by the little green lock in your address bar, or a nasty error message if things go wrong.

The Web’s PKI is an elaborate and complicated system. There are dozens of organizations that operate the certificate authorities trusted by popular browsers. Different browsers manage their own “root programs” in which they choose which certificate authorities are trusted. Becoming a trusted CA often requires passing an audit (WebTrust for CAs) and promising to follow a set of rules called the Baseline Requirements. These rules are set by a standards body called the CA/Browser forum, which consists of browsers and CAs. Each root program has their own application process and program-specific guidelines.

This system works great, until it doesn’t. As a system of trust, there are many ways for the PKI to fail. One of the risks inherent in the web PKI is that any certificate authority can issue any certificate for any website. That means that the Hong Kong Post Office can issue a certificate valid for gmail.com or facebook.com, and every browser will trust it. This forces users to put a lot of faith in certificate authorities and hope they don’t misbehave by issuing a certificate to the wrong person. Even worse, if a CA gets hacked, then an attacker could issue any certificate without the CA knowing, putting users at an even greater risk.

If someone manages to get a certificate for a site that isn’t theirs, they can impersonate that site to its visitors and steal their information. To get a sense of how many CAs are trusted by popular browsers, here’s the list of the 160 trusted CAs in Firefox. Microsoft and Apple maintain their own lists, which are comparably long (Cloudflare keeps track of these lists in our cfssl_trust repository). By trusting all of these organizations implicitly when you browse the internet, users bear the risk of certificate authority misbehavior.

What’s missing from the PKI is accountability. If a mis-issued certificate is used to target an individual, there is no feedback mechanism to let anyone know that the CA misbehaved.

This is not a theoretical situation. In 2011, DigiNotar, a Dutch CA, was hacked. The hacker used their access to the CA to issue a certificate for *.google.com. They then attempted to use this certificate to impersonate Gmail and targeted users in Iran in an attempt to compromise their personal information.

The attack was detected by Google using a technique called public key pinning. Key pinning is a risky technique and only viable for very technically savvy organizations. The value provided by key pinning is usually overshadowed by the operational risk it introduces. It’s often considered the canonical example of a “foot gun” mechanism. Key pinning is being deprecated by browsers.

If key pinning is not the solution, what can be done to detect CA malfeasance? This is where Certificate Transparency comes in. The goal of CT is to make all certificates public so that mis-issued certificates can be detected and appropriate action taken. This helps bring accountability to balance the trust we collectively place in the fragile web PKI.

Shining a light

Creative Commons Zero (CC0) - angele-j

If all certificates are public, then so are mis-issued certificates. Certificate Transparency brings accountability to the web PKI using a technology called a blockchain an append-only public ledger, an ordered list. It puts trusted certificates into a list and makes that list available to anyone. That sounds easy, but given the decentralized nature of the internet, there are many challenges in making this a reliable bedrock for accountability.

The CT ecosystem is an extension of the PKI that introduces additional participants and roles. Along with browsers and CAs, new parties are introduced to play a role in the overall health of the system. These parties are:

Log operators
Auditors
Monitors

At a high level, a log operator is someone who manages the list of certificates than can only be added to. If someone submits a browser-trusted certificate to a log, it must be incorporated into the list within a pre-set grace period called a maximum merge delay (MMD), which is usually 24 hours. The log gives the submitter back a receipt, called a Signed Certificate Timestamp (SCT). An SCT is a promise to include the certificate in the log within the grace period. You can interact with a CT log via the CT API (defined in RFC 6962).

An auditor is a third party that keeps log operators honest. They query logs from various vantage points on the internet and gossip with each other about what order certificates are in. They’re like the paparazzi of the PKI. They also keep track of whether an SCT has been honored or not by measuring the time it took between the SCT’s timestamp and the corresponding certificate showing up in the log. Along with being a dashboard, the Merkle Town backend is also an auditor; it has even detected issues in other logs.

A monitor is a service that helps alert websites of mis-issuance. It crawls logs for new certificates and alerts website owners if a new certificate is found for their domain. Popular monitors include SSLMate’s CertSpotter and Facebook’s Certificate Transparency Monitoring. Cloudflare is planning on offering a free log monitoring service integrated into the Cloudflare dashboard for customers by the end of the year.

Letting browsers know that certificates are logged

One way to push the web to adopt CT is for browsers to start requiring website certificates to be logged. Validating that a certificate is logged by querying logs directly when a connection is made is a potential privacy issue (exposing browser history to a third party), and adds latency. Instead, a better way to ensure that a certificate is logged is to require the server to present SCTs, the inclusion receipts described in the last section.

There are multiple ways that an SCT can be presented to the client. The most popular way to is to embed the SCTs into the certificate when it’s issued. This involves the CA submitting a pre-certificate, a precursor to a certificate, to various CT logs to obtain SCTs before finalizing the certificate.

For certificates that don’t have SCTs embedded, a server has other mechanisms to transmit SCTs to the client. SCTs can be included in the connection as a TLS extension, or in a stapled OCSP response. These mechanisms are more difficult to do reliably, but they allow any certificate to be included in CT.

A certificate with embedded SCTs

Browser CT requirements

Not all logs are created equally. A malicious CA could collude with a set of logs to create a certificate and a set of SCTs and not ever incorporate the certificate into the logs. This certificate could then be used to attack users. To protect the ecosystem from collusion risks, the browsers that support CT have chosen to only accept SCTs from a list of vetted logs that are actively audited. There are also diversity requirements: logs should be managed by different entities on different infrastructures to avoid collusion and shared outages. Connections are said to be “CT Qualified” if they provide the client with enough SCTs from vetted logs to suit the risk profile of the certificate.

In order for a connection to be CT qualified in Chrome, it has to follow Chrome’s CT Policy. For most certificates, this means an SCT needs to be presented for at least one Google and one non-Google log trusted by Chrome. Long-lived certificates more than two SCTs.

In order to become trusted by Chrome, a CT log must submit itself for inclusion and pass a 90 day monitoring period in which it must pass some stringent requirements including the following:

Have no outage that exceeds an MMD (maximum merge delay) of more than 24 hours
Have 99% uptime, with no outage lasting longer than the MMD (as measured by Google)
Incorporate a certificate for which an SCT has been issued by the Log within the MMD
Maintain the append-only property of the of the Log by providing consistent views of the Merkle Tree at all times and to all parties

The full policy is here. The complete list of vetted logs is available here. Nimbus, Cloudflare’s family of CT logs, was included in Chrome 65, which is the current stable version of the Browser.

All or nothing

Creative Commons Attribution-Share Alike 3.0 - Gorskiya

CT only protects users if all certificates are logged. If a CA issues a certificate that is not logged in CT and will be trusted by browsers, then users can still be subject to targeted attacks.

For the last few years, Chrome has required all Extended Validation (EV) certificates to be CT qualified. Cloudflare has been making sure all that connections to Cloudflare are CT Qualified for Chrome and have been since May 2017. This is done using an automatic submission service related to our recent OCSP stapling project and the SCT TLS extension. We monitor whether or not the set of SCTs we provide is conformant with browser policies using the Expect-CT header, which reports client errors back to our servers. Expect CT is included in every HTTPS response Cloudflare serves.

This isn’t enough. As long as there are certificates that are trusted by browsers that are not required to be CT Qualified, then users are at risk. This is why the Chrome team announced that they will require Certificate Transparency for all newly issued, publicly trusted certificates starting in April 2018.

Changes for Certificate Authorities

The safest way to make sure a certificate is always CT qualified when shown to a browser is to embed enough SCTs from vetted logs to conform to the browser’s policy. This is a big operational change for some CAs, because

It adds the additional step of having to understand which SCTs are required for different browsers’ CT policies and keeping up to date with policy changes
It adds a step in the issuance process where pre-certificates are to different logs before the certificate is issued
Some CT logs may have outages or be slow to respond, so fallback strategies may be required to not delay issuance

If you’re a CA operator and are struggling to implement this or worried about the additional delay caused by submitting pre-certificates to logs, Cloudflare can help. We’re offering an experimental API for CAs that takes pre-certificates and returns a valid set of SCTs for a given certificate. This API handles sorting out the browser policies and leverages Argo Smart Routing to return SCTs with as little delay as possible. Please contact our CT team at ct-logs@cloudflare.com if you have any interest in this offering.

Building a verifiable globally consistent log

The PKI is huge. The industry-wide push for HTTPS has introduced millions of new certificates into the web PKI. There are over a quarter-billion(!) certificates logged in CT, and this number is growing by almost a million per day. This number is sure to grow as we approach Chrome’s April deadline. Managing a high availability database of this size that you can only add to (a property called append-only) is a substantial engineering challenge.

The naïve data structure to use for an append-only database is a hash chain. In a hash chain, elements are aligned in order and combined using a one-way hash function like SHA-256. The diagram below describes how a hash chain is created from a list of values d1 to d8. Start with the first element, d1, which is hashed to a value a, which then becomes the head of the chain. Every time an element is added to the chain, a hash function is computed over two inputs: the current chain head and the new element. This hash value become the new chain head. Because one-way hash functions can’t be reversed, it’s computationally infeasible to change a value without it changing the entire chain that has been computed from it. This makes a hash chain’s history very difficult to modify maliciously.

A hash chain is the optimal data structure for inserting new elements: only one hash needs to be computed for each element added. However, it is not an efficient data structure for validating whether an element is correctly included in a chain given a chain head. In the example below, there are six additional elements (b, d4-d8) needed to validate that d3 is correct on a chain of 8 values. You need approximately n/2 elements to verify an average element in a chain of length n. In computer science terms, this is called “linear scaling.”

When building a system, it’s best to try to reduce complexity for the participants involved. For CT, the main participants we care about are the log operator and the auditor. If we were to choose a hash chain as our data structure, the job of the log operator would be easy but that of the auditor would be very hard. We can do better.

When you ask a computer scientist how to optimize an algorithm, nine out of ten times, the solution they will suggest is to use a tree (the other 1/10 times, they’ll suggest a Bloom filter). That’s exactly what we can do here. Specifically, we can use a data structure called a Merkle Tree. It’s like a hash chain, but rather than hashing elements in a line, you hash them in pairs.

For each new element, instead of hashing it into a running total, you arrange the elements into a balanced binary tree and compute the hash of the element with its sibling. This gives you half as many hashes as elements. These hashes are then arranged in pairs and hashed together to create the next level of the tree. This continues until you have one element, the top of the tree; this is called the tree head. Adding a new value to a Merkle tree requires the modification of at most one hash per level in the tree, following the path from the element up to the tree head.

The depth of a binary tree is logarithmic with respect to the number of elements. Roughly speaking, if the tree size is 8 = 2^3, then the depth is 4 (= 3+1), if it’s 1024 = 2^10 then the tree depth is 11 (= 10+1), for 1048576 = 2^20 the tree size is 21 (= 20+1). The cost of insertion is at most log_2(size), which is worse than in a hash chain, but generally not too bad.

What makes the Merkle tree so useful is the efficiency of element validation. Instead of having to compute n/2 hashes like in a hash chain, you only need the elements in the tree that lead you to the root. This is called the co-path. In the diagram below, the co-path is computed for d3.

The copath consists of one value per level of the tree. The computation necessary to prove that an element is correct (an inclusion proof) is therefore logarithmic, not linear as in the case of a hash tree. Both insertion and validation are cheap relative to the size of the tree, making a Merkle tree the ideal data structure for CT.

A certificate transparency log is a Merkle tree where the leaf elements are certificates. Each log has a private key that it uses to sign the current tree head at regular intervals. Some CT logs are huge with over a hundred million entries, but because of the efficiency of Merkle trees, inclusion proofs only require around 30 hashes. This structure provides a good balance between the cost to the log operator of adding certificates and the cost to the auditor of validating its consistency.

Nimbus

Our own addition to the log ecosystem is Nimbus. Nimbus is a family of Certificate Transparency logs with an open acceptance policy. Nimbus accepts any certificate that is signed by a CA from our cfssl_trust root store. The logs are organized by year, e.g. Nimbus 2018, Nimbus 2019, etc. with each log only accepting certificates that expire in that year.

Nimbus is built with Trillian, a Go-based implementation of a scalable Merkle tree. The data backend is custom, re-using components from Cloudflare’s high capacity logging infrastructure, which runs entirely on Cloudflare’s bare metal infrastructure. Having a high-reliability and completely open log that is not dependent on shared cloud infrastructure adds insurance in case of outages. Nimbus is intended to bring diversity and stability to the CT ecosystem, and in the end, make the internet safer for everyone.

Why TLS 1.3 isn't in browsers yet

Nick Sullivan — Tue, 26 Dec 2017 20:30:00 GMT

Upgrading a security protocol in an ecosystem as complex as the Internet is difficult. You need to update clients and servers and make sure everything in between continues to work correctly. The Internet is in the middle of such an upgrade right now. Transport Layer Security (TLS), the protocol that keeps web browsing confidential (and many people persist in calling SSL), is getting its first major overhaul with the introduction of TLS 1.3. Last year, Cloudflare was the first major provider to support TLS 1.3 by default on the server side. We expected the client side would follow suit and be enabled in all major browsers soon thereafter. It has been over a year since Cloudflare’s TLS 1.3 launch and still, none of the major browsers have enabled TLS 1.3 by default.

The reductive answer to why TLS 1.3 hasn’t been deployed yet is middleboxes: network appliances designed to monitor and sometimes intercept HTTPS traffic inside corporate environments and mobile networks. Some of these middleboxes implemented TLS 1.2 incorrectly and now that’s blocking browsers from releasing TLS 1.3. However, simply blaming network appliance vendors would be disingenuous. The deeper truth of the story is that TLS 1.3, as it was originally designed, was incompatible with the way the Internet has evolved over time. How and why this happened is the multifaceted question I will be exploring in this blog post.

To help support this discussion with data, we built a tool to help check if your network is compatible with TLS 1.3:https://tls13.mitm.watch/

How version negotiation used to work in TLS

The Transport Layer Security protocol, TLS, is the workhorse that enables secure web browsing with HTTPS. The TLS protocol was adapted from an earlier protocol, Secure Sockets Layer (SSL), in the late 1990s. TLS currently has three versions: 1.0, 1.1 and 1.2. The protocol is very flexible and can evolve over time in different ways. Minor changes can be incorporated as “extensions” (such as OCSP and Certificate Transparency) while larger and more fundamental changes often require a new version. TLS 1.3 is by far the largest change to the protocol in its history, completely revamping the cryptography and introducing features like 0-RTT.

Not every client and server support the same version of TLS—that would make it impossible to upgrade the protocol—so most support multiple versions simultaneously. In order to agree on a common version for a connection, clients and servers negotiate. TLS version negotiation is very simple. The client tells the server the newest version of the protocol that it supports and the server replies back with the newest version of the protocol that they both support.

Versions in TLS are represented as two-byte values. Since TLS was adapted from SSLv3, the literal version numbers used in the protocol were just increments of the minor version:SSLv3 is 3.0TLS 1.0 is 3.1TLS 1.1 is 3.2TLS 1.2 is 3.3, etc.

When connecting to a server with TLS, the client sends its highest supported version at the beginning of the connection:

(3, 3) → server

A server that understands the same version can reply back with a message starting with the same version bytes.

(3, 3) → server
client ← (3, 3)

Or, if the server only knows an older version of the protocol, it can reply with an older version. For example, if the server only speaks TLS 1.0, it can reply with:

(3, 3) → server
client ← (3, 1)

If the client supports the version returned by the server then they continue using that version of TLS and establish a secure connection. If the client and server don’t share a common version, the connection fails.

This negotiation was designed to be future-compatible. If a client sends higher version than the server supports, the server should still be able to reply with whatever version the server supports. For example, if a client sends (3, 8) to a modern-day TLS 1.2-capable server, it should just reply back with (3,3) and the handshake will continue as TLS 1.2.

Pretty simple, right? As it turns out, some servers didn’t implement this correctly and this led to a chain of events that exposed web users to a serious security vulnerability.

POODLE and the downgrade dance

CC0 Creative Commons

The last major upgrade to TLS was the introduction of TLS 1.2. During the roll-out of TLS 1.2 in browsers, it was found that some TLS 1.0 servers did not implement version negotiation correctly. When a client connected with a TLS connection advertising support for TLS 1.2, the faulty servers would disconnect instead of negotiating a version of TLS that they understood (like TLS 1.0).

Browsers had three options to deal with this situation

enable TLS 1.2 and a percentage of websites would stop working
delay the deployment of TLS 1.2 until these servers are fixed
retry with an older version of TLS if the connection fails

One expectation that people have about their browsers is that when they are updated, websites keep working. The number of misbehaving servers was far too numerous to just break with an update, eliminating option 1). By this point, TLS 1.2 had been around for a year, and still, servers were still broken waiting longer wasn’t going to solve the situation, eliminating option 2). This left option 3) as the only viable choice.

To both enable TLS 1.2 and keep their users happy, most browsers implemented what’s called an “insecure downgrade”. When faced with a connection failure when connecting to a site, they would try again with TLS 1.1, then with TLS 1.0, then if that failed, SSLv3. This downgrade logic “fixed” these broken servers... at the cost of a slow connection establishment. ?

However, insecure downgrades are called insecure for a reason. Client downgrades are triggered by a specific type of network failure, one that can be easily spoofed. From the client’s perspective, there’s no way to tell if this failure was caused by a faulty server or by an attacker who happens to be on the path of the connection network. This means that network attackers can inject fake network failures and trick a client into connecting to a server with SSLv3, even if both support a newer protocol. At this point, there were no severe publicly-known vulnerabilities in SSLv3, so this didn’t seem like a big problem. Then POODLE happened.

In October 2014, Bodo Möller published POODLE, a serious vulnerability in the SSLv3 protocol that allows an in-network attacker to reveal encrypted information with minimal effort. Because TLS 1.0 was so widely deployed on both clients and servers, very few connections on the web used SSLv3, which should have kept them safe. However, the insecure downgrade feature made it possible for attackers to downgrade any connection to SSLv3 if both parties supported it (which most of them did). The availability of this “downgrade dance” vector turned the risk posed by POODLE from a curiosity into a serious issue affecting most of the web.

Fixing POODLE and removing insecure downgrade

The fixes for POODLE were:

disable SSLv3 on the client or the server
enable a new TLS feature called SCSV.

SCSV was conveniently proposed by Bodo Möller and Adam Langley a few months before POODLE. Simply put, with SCSV the client adds an indicator, called a downgrade canary, into its client hello message if it is retrying due to a network error. If the server supports a newer version than the one advertised in the client hello and sees this canary, it can assume a downgrade attack is happening and close the connection. This is a nice feature, but it is optional and requires both clients and servers to update, leaving many web users exposed.

Browsers immediately removed support for SSLv3, with very little impact other than breaking some SSLv3-only websites. Users with older browsers had to depend on web servers disabling SSLv3. Cloudflare did this immediately for its customers, and so did most sites, but even in late 2017 over 10% of sites measured by SSL Pulse still support SSLv3.

Turning off SSLv3 was a feasible solution to POODLE because SSLv3 was not critical to the Internet. This raises the question: what happens if there’s a serious vulnerability in TLS 1.0? TLS 1.0 is still very widely used and depended on, turning it off in the browser would lock out around 10% of users. Also, despite SCSV being a nice solution to insecure downgrades, many servers don’t support it. The only option to ensure the safety of users against a future issue in TLS 1.0 is to disable insecure fallback.

After several years of bad performance due to the client having to reconnect multiple times, the majority of the websites that did not implement version negotiation correctly fixed their servers. This made it possible for some browsers to remove this insecure fallback:Firefox in 2015, and Chrome in 2016. Because of this, Chrome and Firefox users are now in a safer position in the event of a new TLS 1.0 vulnerability.

Introducing a new version (again)

Designing a protocol for future compatibility is hard, it’s easy to make mistakes if there isn’t a feedback mechanism. This is the case of TLS version negotiation: nothing breaks if you implement it wrong, just hypothetical future versions. There is no signal to developers that an implementation is flawed, and so mistakes can happen without being noticed. That is, until a new version of the protocol is deployed and your implementation fails, but by then the code is deployed and it could take years for everyone to upgrade. The fact that some server implementations failed to handle a “future” version of TLS correctly should have been expected.

TLS version negotiation, though simple to explain, is actually an example of a protocol design antipattern. It demonstrates a phenomenon in protocol design called ossification. If a protocol is designed with a flexible structure, but that flexibility is never used in practice, some implementation is going to assume it is constant. Adam Langley compares the phenomenon to rust. If you rarely open a door, its hinge is more likely to rust shut. Protocol negotiation in TLS is such a hinge.

Around the time of TLS 1.2’s deployment, the discussions around designing an ambitious new version of TLS were beginning. It was going to be called TLS 1.3, and the version number was naturally chosen as 3.4 (or (3, 4)). By mid-2016, the TLS 1.3 draft had been through 15 iterations, and the version number and negotiation were basically set. At this point, browsers had removed the insecure fallback. It was assumed that servers that weren’t able to do TLS version negotiation correctly had already learned the lessons of POODLE and finally implemented version negotiation correctly. This turned out to not be the case. Again.

When presented with a client hello with version 3.4, a large percentage of TLS 1.2-capable servers would disconnect instead of replying with 3.3. Internet scans by Hanno Böck, David Benjamin, SSL Labs, and others confirmed that the failure rate for TLS 1.3 was very high, over 3% in many measurements. This was the exact same situation faced during the upgrade to TLS 1.2. History was repeating itself.

This unexpected setback caused a crisis of sorts for the people involved in the protocol’s design. Browsers did not want to re-enable the insecure downgrade and fight the uphill battle of oiling the protocol negotiation joint again for the next half-decade. But without a downgrade, using TLS 1.3 as written would break 3% of the Internet for their users. What could be done?

The controversial choice was to accept a proposal from David Benjamin to make the first TLS 1.3 message (the client hello) look like it TLS 1.2. The version number from the client was changed back to (3, 3) and a new “version” extension was introduced with the list of supported versions inside. The server would return a server hello starting with (3, 4) if TLS 1.3 was supported and (3, 3) or earlier otherwise. Draft 16 of TLS 1.3 contained this new and “improved” protocol negotiation logic.

And this worked. Servers for the most part were tolerant to this change and easily fell back to TLS 1.2, ignoring the new extension. But this was not the end of the story. Ossification is a fickle phenomenon. It not only affects clients and servers, but it also affects everything on the network that interacts with a protocol.

The real world is full of middleboxes

CC BY-SA 3.0

Readers of our blog may remember a post from earlier this year on HTTPS interception. In it, we discussed a study that measured how many “secure” HTTPS connections were actually intercepted and decrypted by inspection software or hardware somewhere in between the browser and the web server. There are also passive inspection middleboxes that parse TLS and either block or divert the connections based on visible data, such as ISPs that use hostname information (SNI) from TLS connections to block “banned” websites.

In order to inspect traffic, these network appliances need to implement some or all of TLS. Every new device that supports TLS introduces a TLS implementation that makes assumption about how the protocol should act. The more implementations there are, the more likely it is that ossification occurs.

In February of 2017, both Chrome and Firefox started enabling TLS 1.3 for a percentage of their customers. The results were unexpectedly horrible. A much higher percentage of TLS 1.3 connections were failing than expected. For some users, no matter what the website, TLS 1.2 worked but TLS 1.3 did not.

Success rates for TLS 1.3 Draft 18

Firefox & Cloudflare
97.8% for TLS 1.2
96.1% for TLS 1.3

Chrome & Gmail
98.3% for TLS 1.2
92.3% for TLS 1.3

After some investigation, it was found that some widely deployed middleboxes, both of the intercepting and passive variety, were causing connections to fail.

Because TLS has generally looked the same throughout its history, some network appliances made assumptions about how the protocol would evolve over time. When faced with a new version that violated these assumptions, these appliances fail in various ways.

Some features of TLS that were changed in TLS 1.3 were merely cosmetic. Things like the ChangeCipherSpec, session_id, and compression fields that were part of the protocol since SSLv3 were removed. These fields turned out to be considered essential features of TLS to some of these middleboxes, and removing them caused connection failures to skyrocket.

If a protocol is in use for a long enough time with a similar enough format, people building tools around that protocol will make assumptions around that format being constant. This is often not an intentional choice by developers, but an unintended consequence of how a protocol is used in practice. Developers of network devices may not understand every protocol used on the internet, so they often test against what they see on the network. If a part of a protocol that is supposed to be flexible never changes in practice, someone will assume it is a constant. This is more likely the more implementations are created.

It would be disingenuous to put all of the blame for this on the specific implementers of these middleboxes. Yes, they created faulty implementations of TLS, but another way to think about it is that the original design of TLS lent itself to this type of failure. Implementers implement to the reality of the protocol, not the intention of the protocol’s designer or the text of the specification. In complex ecosystems with multiple implementers, unused joints rust shut.

Removing features that have been part of a protocol for 20 years and expecting it to simply “work” was wishful thinking.

Making TLS 1.3 work

At the IETF meeting in Singapore last month, a new change was proposed to TLS 1.3 to help resolve this issue. These changes were based on an idea from Kyle Nekritz of Facebook: make TLS 1.3 look like TLS 1.2 to middleboxes. This change re-introduces many of the parts of the protocol that were removed (session_id, ChangeCipherSpec, an empty compression field), and introduces some other changes that make TLS 1.3 look like TLS 1.2 in all the ways that matter to the broken middleboxes.

Several iterations of these changes were developed by BoringSSL and tested in Chrome over a series of months. Facebook also performed some similar experiments and the two teams converged on the same set of changes.

Chrome Experiment Success Rates

TLS 1.2 - 98.6%
Experimental changes (PR 1091 on Github) - 98.8%

Firefox Experiment Success Rates

TLS 1.2 - 98.42%
Experimental changes (PR 1091 on Github) - 98.37%

These experiments showed that is was possible to modify TLS 1.3 to be compatible with middleboxes. They also demonstrated the ossification phenomenon. As we described in a previous section, the only thing that could have rusted shut in the client hello, the version negotiation, rusted shut. This resulted in Draft 16, which moved the version negotiation to an extension. As an intermediary between the client and the server, middleboxes also care about the server hello message. This message had many more hinges that were thought to be flexible but turned out weren’t. Almost all of these had rusted shut. The new “middlebox-friendly” changes took this reality into account. These experimental changes were incorporated into the specification in TLS 1.3 Draft 22.

Making sure this doesn’t happen again

The original protocol negotiation mechanism is unrecoverably burnt. That means it likely can’t be used in a future version of TLS without significant breakage. However, many of the other protocol negotiation features are still flexible, such as ciphersuites selection and extensions. It would be great to keep it this way.

Last year, Adam Langley wrote a great blog post about cryptographic protocol design (https://www.imperialviolet.org/2016/05/16/agility.html) that covers similar ground as this blog post. In his post, he proposes the adage “have one joint and keep it well oiled.” This is great advice for protocol designers. Ossification is real, as we have seen in TLS 1.3.

Along these lines, David Benjamin proposed a way to keep the most important joints in TLS oiled. His GREASE proposal for TLS is designed to throw in random values where a protocol should be tolerant of new values. If popular implementations intersperse unknown ciphers, extensions and versions in real-world deployments, then implementers will be forced to handle them correctly. GREASE is like WD-40 for the Internet.

One thing to note is that GREASE is intended to prevent servers from ossifying, not middleboxes, so there is still potential for more types of greasing to happen in TLS.

Even with GREASE, some servers were only found to be intolerant to TLS 1.3 as late as December 2017. GREASE only identifies servers that are intolerant to unknown extensions, but some servers may still be intolerant to specific extension ids. For example, RSA's BSAFE library used the extension id 40 for an experimental extension called "extended_random", associated with a theorized NSA backdoor. Extension id 40 happens to be the extension id used for TLS 1.3 key shares. David Benjamin reported that this library is still in use by some printers, which causes them to be TLS 1.3 intolerant. Matthew Green has a detailed write-up of this compatibility issue.

Help us understand the issue

Cloudflare has been working with the Mozilla Firefox team to help measure this phenomenon, and Google and Facebook have been doing their own measurements. These experiments are hard to perform because the developers need to get protocol variants into browsers, which can take the entire release cycle of the browser (often months) to get into the hands of the users seeing issues. Cloudflare now supports the latest (hopefully) middlebox-compatible TLS 1.3 draft version (draft 22), but there’s always a chance we find a middlebox that is incompatible with this version.

To sidestep the browser release cycle, we took a shortcut. We built a website that you can use to see if TLS 1.3 works from your browser’s vantage point. This test was built by our Crypto intern, Peter Wu. It uses Adobe Flash, because that’s the only widespread cross-platform way to get access to raw sockets from a browser.

How it works:

We cross-compiled our Golang-based TLS 1.3 client library (tls-tris) to JavaScript
We build a JavaScript library (called jssock) that implements tls-tris on the low-level socket interface network exposed through Adobe Flash
We connect to a remote server using TLS 1.2 and TLS 1.3 and compare the results

If there is a mismatch, we gather information from the connection on both sides and send it back for analysis

If you see a failure, let us know! If you’re in a corporate environment, share the middlebox information, if you’re on a residential network, tell us who your ISP is.

We’re excited for TLS 1.3 to finally be enabled by default in browsers. This experiment will hopefully help prove that the latest changes make it safe for users to upgrade.

Cloudflare supports Privacy Pass

Nick Sullivan — Thu, 09 Nov 2017 16:00:00 GMT

Enabling anonymous access to the web with privacy-preserving cryptography

Cloudflare supports Privacy Pass, a recently-announced privacy-preserving protocol developed in collaboration with researchers from Royal Holloway and the University of Waterloo. Privacy Pass leverages an idea from cryptography — zero-knowledge proofs — to let users prove their identity across multiple sites anonymously without enabling tracking. Users can now use the Privacy Pass browser extension to reduce the number of challenge pages presented by Cloudflare. We are happy to support this protocol and believe that it will help improve the browsing experience for some of the Internet’s least privileged users.

The Privacy Pass extension is available for both Chrome and Firefox. When people use anonymity services or shared IPs, it makes it more difficult for website protection services like Cloudflare to identify their requests as coming from legitimate users and not bots. Privacy Pass helps reduce the friction for these users—which include some of the most vulnerable users online—by providing them a way to prove that they are a human across multiple sites on the Cloudflare network. This is done without revealing their identity, and without exposing Cloudflare customers to additional threats from malicious bots. As the first service to support Privacy Pass, we hope to help demonstrate its usefulness and encourage more Internet services to adopt it.

Adding support for Privacy Pass is part of a broader initiative to help make the Internet accessible to as many people as possible. Because Privacy Pass will only be used by a small subset of users, we are also working on other improvements to our network in service of this goal. For example, we are making improvements in our request categorization logic to better identify bots and to improve the web experience for legitimate users who are negatively affected by Cloudflare’s current bot protection algorithms. As this system improves, users should see fewer challenges and site operators should see fewer requests from unwanted bots. We consider Privacy Pass a piece of this puzzle.

Privacy Pass is fully open source under a BSD license and the code is available on GitHub. We encourage anyone who is interested to download the source code, play around with the implementations and contribute to the project. The Pass Team have also open sourced a reference implementation of the server in Go if you want to test both sides of the system. Privacy Pass support at Cloudflare is currently in beta. If you find a bug, please let the team know by creating an issue on GitHub.

In this blog post I'll be going into depth about the problems that motivated our support for this project and how you can use it to reduce the annoyance factor of CAPTCHAs and other user challenges online.

Enabling universal access to content

Cloudflare believes that the web is for everyone. This includes people who are accessing the web anonymously or through shared infrastructure. Tools like VPNs are useful for protecting your identity online, and people using these tools should have the same access as everyone else. We believe the vast collection of information and services that make up the Internet should be available to every person.

In a blog post last year, our CEO, Matthew Prince, spoke about the tension between security, anonymity, and convenience on the Internet. He posited that in order to secure a website or service while still allowing anonymous visitors, you have to sacrifice a bit of convenience for these users. This tradeoff is something that every website or web service has to make.

The Internet is full of bad actors. The frequency and severity of online attacks is rising every year. This turbulent environment not only threatens websites and web services with attacks, it threatens their ability to stay online. As smaller and more diverse sites become targets of anonymous threats, a greater percentage of the Internet will choose to sacrifice user convenience in order to stay secure and universally accessible.

The average Internet user visits dozens of sites and services every day. Jumping through a hoop or two when trying to access a single website is not that big of a problem for people. Having to do that for every site you visit every day can be exhausting. This is the problem that Privacy Pass is perfectly designed to solve.

Privacy Pass doesn’t completely eliminate this inconvenience. Matthew’s trilemma still applies: anonymous users are still inconvenienced for sites that want security. What Privacy Pass does is to notably reduce that inconvenience for users with access to a browser. Instead of having to be inconvenienced thirty times to visit thirty different domains, you only have to be inconvenienced once to gain access to thirty domains on the Cloudflare network. Crucially, unlike unauthorized services like CloudHole, Privacy Pass is designed to respect user privacy and anonymity. This is done using privacy-preserving cryptography, which prevents Cloudflare or anyone else from tracking a user’s browsing across sites. Before we go into how this works, let’s take a step back and take a look at why this is necessary.

Am I a bot or not?

D J Shin Creative Commons Attribution-Share Alike 3.0 Unported

Without explicit information about the identity of a user, a web server has to rely on fuzzy signals to guess which request is from a bot and which is from a human. For example, bots often use automated scripts instead of web browsers to do their crawling. The way in which scripts make web requests is often different than how web browsers would make the same request in subtle ways.

A simple way for a user to prove they are not a bot to a website is by logging in. By providing valid authentication credentials tied to a long-term identity, a user is exchanging their anonymity for convenience. Having valid authentication credentials is a strong signal that a request is not from a bot. Typically, if you authenticate yourself to a website (say by entering your username and password) the website sets what’s called a “cookie”. A cookie is just a piece of data with an expiration date that’s stored by the browser. As long as the cookie hasn’t expired, the browser includes it as part of the subsequent requests to the server that set it. Authentication cookies are what websites use to know whether you’re logged in or not. Cookies are only sent on the domain that set them. A cookie set by site1.com is not sent for requests to site2.com. This prevents identity leakage from one site to another.

A request with an authentication cookie is usually not from a bot, so bot detection is much easier for sites that require authentication. Authentication is by definition de-anonymizing, so putting this in terms of Matthew’s trilemma, these sites can have security and convenience because they provide no anonymous access. The web would be a very different place if every website required authentication to display content, so this signal can only be used for a small set of sites. The question for the rest of the Internet becomes: without authentication cookies, what else can be used as a signal that a user is a person and not a bot?

The Turing Test

One thing that can be used is a user challenge: a task that the server asks the user to do before showing content. User challenges can come in many forms, from a proof-of-work to a guided tour puzzle to the classic CAPTCHA. A CAPTCHA (an acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a test to see if the user is a human or not. It often involves reading some scrambled letters or identifying certain slightly obscured objects — tasks that humans are generally better at than automated programs. The goal of a user challenge is not only to deter bots, but to gain confidence that a visitor is a person. Cloudflare uses a combination of different techniques as user challenges.

CAPTCHAs can be annoying and time-consuming to solve, so they are usually reserved for visitors with a high probability of being malicious.

The challenge system Cloudflare uses is cookie-based. If you solve a challenge correctly, Cloudflare will set a cookie called CF_CLEARANCE for the domain that presented the challenge. Clearance cookies are like authentication cookies, but instead of being tied to an identity, they are tied to the fact that you solved a challenge sometime in the past.

Person sends Request
Server responds with a challenge
Person sends solution
Server responds with set-cookie and bypass cookie
Person sends new request with cookie
Server responds with content from origin

Site visitors who are able to solve a challenge are much more likely to be people than bots, the harder the challenge, the more likely the visitor is a person. The presence of a valid CF_CLEARANCE cookie is a strong positive signal that a request is from a legitimate person.

How Privacy Pass protects your privacy: a voting analogy

You can use cryptography to prove that you have solved a challenge of a certain difficulty without revealing which challenge you solved. The technique that enables this is something called a Zero-knowledge proof. This may sound scary, so let’s use a real-world scenario, vote certification, to explain the idea.

In some voting systems the operators of the voting center certify every ballot before sending them to be counted. This is to prevent people from adding fraudulent ballots while the ballots are being transferred from where the vote takes place to where the vote is counted.

An obvious mechanism would be to have the certifier sign every ballot that a voter submits. However, this would mean that the certifier, having just seen the person that handed them a ballot, would know how each person voted. Instead, we can use a better mechanism that preserves voters’ privacy using an envelope and some carbon paper.

The voter fills out their ballot
The voter puts their ballot into an envelope along with a piece of carbon paper, and seals the envelope
The sealed envelope is given to the certifier
The certifier signs the outside of the envelope. The pressure of the signature transfers the signature from the carbon paper to the ballot itself, effectively signing the ballot.
Later, when the ballot counter unseals the envelope, they see the certifier’s signature on the ballot.

With this system, a voting administrator can authenticate a ballot without knowing its content, and then the ballot can be verified by an independent assessor.

Privacy Pass is like vote certification for the Internet. In this analogy, Cloudflare’s challenge checking service is the vote certifier, Cloudflare’s bot detection service is the vote counter and the anonymous visitor is the voter. When a user encounters a challenge on site A, they put a ballot into a sealed envelope and send it to the server along with the challenge solution. The server then signs the envelope and returns it to the client. Since the server is effectively signing the ballot without knowing its contents, this is called a blind signature.

When the user sees a challenge on site B, the user takes the ballot out of the envelope and sends it to the server. The server then checks the signature on the ballot, which proves that the user has solved a challenge. Because the server has never seen the contents of the ballot, it doesn’t know which site the challenge was solved for, just that a challenge was solved.

It turns out that with the right cryptographic construction, you can approximate this scenario digitally. This is the idea behind Privacy Pass.

The Privacy Pass team implemented this using a privacy-preserving cryptographic construction called an Elliptic Curve Verifiable Oblivious Pseudo-Random Function (EC-VOPRF). Yes, it’s a mouthful. From the Privacy Pass Team:

Every time the Privacy Pass plugin needs a new set of privacy passes, it creates a set of thirty random numbers t1 to t30, hashes them into a curve (P-256 in our case), blinds them with a value b and sends them along with a challenge solution. The server returns the set of points multiplied by its private key and a batch discrete logarithm equivalence proof. Each pair tn, HMAC(n,M) constitutes a Privacy Pass and can be redeemed to solve a subsequent challenge. Voila!

If none of these words make sense to you and you want to know more, check out the Privacy Pass team’s [protocol design document](https://privacypass.github.io/protocol/).

Making it work in the browser

It takes more than a nice security protocol based on solid cryptography to make something useful in the real world. To bring the advantages of this protocol to users, the Privacy Pass team built a client in JavaScript and packaged it using WebExtensions, a cross-browser framework for developing applications that run in the browser and modify website behavior. This standard is compatible with both Chrome and Firefox. A reference implementation of the server side of the protocol was also implemented in Go.

If you’re a web user and are annoyed by CAPTCHAs, you can download the Privacy Pass extension for Chrome here and for Firefox here. It will significantly improve your web browsing experience. Once it is installed, you’ll see a small icon on your browser with a number under it. The number is how many unused privacy passes you have. If you are running low on passes, simply click on the icon and select “Get More Passes,” which will load a CAPTCHA you can solve in exchange for thirty passes. Every time you visit a domain that requires a user challenge page to view, Privacy Pass will “spend” a pass and the content will load transparently. Note that you may see more than one pass spent up when you load a site for the first time if the site has subresources from multiple domains.

The Privacy Pass extension works by hooking into the browser and looking for HTTP responses that have a specific header that indicates support for the Privacy Pass protocol. When a challenge page is returned, the extension will either try to issue new privacy passes or redeem existing privacy passes. The cryptographic operations in the plugin were built on top of SJCL.

If you’re a Cloudflare customer and want to opt out from supporting Privacy Pass, please contact our support team and they will disable it for you. We are soon adding a toggle for Privacy Pass in the Firewall app in the Cloudflare dashboard.

The web is for everyone

The technology behind Privacy Pass is free for anyone to use. We see a bright future for this technology and think it will benefit from community involvement. The protocol is currently only deployed at Cloudflare, but it could easily be used across different organizations. It’s easy to imagine obtaining a Privacy Pass that proves that you have a Twitter or Facebook identity and using it to access other services on the Internet without revealing your identity, for example. There are a wide variety of applications of this technology that extend well beyond our current use cases.

If this technology is intriguing to you and you want to collaborate, please reach out to the Privacy Pass team on GitHub.

Geo Key Manager: How It Works

Nick Sullivan — Tue, 26 Sep 2017 13:00:00 GMT

Today we announced Geo Key Manager, a feature that gives customers unprecedented control over where their private keys are stored when uploaded to Cloudflare. This feature builds on a previous Cloudflare innovation called Keyless SSL and a novel cryptographic access control mechanism based on both identity-based encryption and broadcast encryption. In this post we’ll explain the technical details of this feature, the first of its kind in the industry, and how Cloudflare leveraged its existing network and technologies to build it.

Keys in different area codes

Cloudflare launched Keyless SSL three years ago to wide acclaim. With Keyless SSL, customers are able to take advantage of the full benefits of Cloudflare’s network while keeping their HTTPS private keys inside their own infrastructure. Keyless SSL has been popular with customers in industries with regulations around the control of access to private keys, such as the financial industry. Keyless SSL adoption has been slower outside these regulated industries, partly because it requires customers to run custom software (the key server) inside their infrastructure.

Standard Configuration

Keyless SSL

One of the motivating use cases for Keyless SSL was the expectation that customers may not trust a third party like Cloudflare with their private keys. We found that this is actually very uncommon; most customers actually do trust Cloudflare with their private keys. But we have found that sometimes customers would like a way to reduce the risk associated with having their keys in some physical locations around the world.

This is where Geo Key Manager is useful: it lets customers limit the exposure of their private keys to certain locations. It’s similar Keyless SSL, but instead of having to run a key server inside your infrastructure, Cloudflare hosts key servers in the locations of your choosing. This reduces the complexity of deploying Keyless SSL and gives the control that people care about. Geo Key Manager “just works” with no software required.

A Keyless SSL Refresher

Keyless SSL was developed at Cloudflare to make HTTPS more secure. Content served over HTTPS is both encrypted and authenticated so that eavesdroppers or attackers can’t read or modify it. HTTPS makes use of a protocol called Transport Layer Security (TLS) to keep this data safe.

TLS has two phases: a handshake phase and a data exchange phase. In the handshake phase, cryptographic keys are exchanged and a shared secret is established. As part of this exchange, the server proves its identity to the client using a certificate and a private key. In the data exchange phase, shared keys from the handshake are used to encrypt and authenticate the data.

A TLS handshake can be naturally broken down into two components:

The private key operation

Everything else

The private key operation is critical to the TLS handshake: it allows the server to prove that it owns a given certificate. Without this private key operation, the client has no way of knowing whether or not the server is authentic. In Keyless SSL, the private key operation is separated from the rest of the handshake. In a Keyless SSL handshake, instead of performing the private key operation locally, the server makes a remote procedure call to another server (called the key server) that controls the private key.

Keyless SSL lets you logically separate concerns so that a compromise of the web server does not result in a compromise of the private key. This additional security comes at a cost. The remote procedure call from the server to the key server can add latency to the handshake, slowing down connection establishment. The additional latency cost corresponds to the round-trip time from the server to the key server, which can be as much as a second if the key server is on the other side of the world.

Luckily, this latency cost only applies to the first time you connect to a server. Once the handshake is complete, the key server is not involved. Furthermore, if you reconnect to a site you don’t have to pay the latency cost either because resuming a connection with TLS Session Resumption doesn’t require the private key.

Latency is only added for the initial connection.

For a deep-dive on this topic, read this post I wrote in 2014.

With Keyless SSL as a basic building block, we set out to design Geo Key Manager.

Geo Key Manager: designing the architecture

Cloudflare has a truly international customer base and we’ve learnt that customers around the world have different regulatory and statutory requirements, and different risk profiles, concerning the placement of their private keys. There’s no one size fits all solution across the planet. With that philosophy in mind, we set out to design a very flexible system for deciding where keys can be kept.

The first problem to solve was access control. How could we limit the number of locations that a private key is sent to? Cloudflare has a database that takes user settings and distributes them to all edge locations securely, and very fast. However, this system is optimized to synchronize an entire database worldwide; modifying it to selectively distribute keys to different locations was too big of an architectural change for Cloudflare. Instead, we decided to explore the idea of a cryptographic access control (CAC) system.

In a CAC system, data is encrypted and distributed everywhere. A piece of data can only be accessed if you have the decryption key. By only sending decryption keys to certain locations, we can effectively limit who has access to data. For example, we could encrypt customer private keys once—right after they’re uploaded—and send the encrypted keys to every location using the existing database replication system.

We’ve experimented with CAC systems before, most notably with the Red October project. With Red October, you can encrypt a piece of data so that multiple people are required to decrypt it. This is how PAL, our Docker Secrets solution works. However, Red October system is ill-suited to Geo Key Manager for a number of reasons:

The more locations you encrypt to, the larger the encrypted key gets

There is no way to encrypt a key to “everywhere except a given location” without having to re-encrypt when new locations are added (which we do frequently)
There has to be a secure registry of each datacenter’s public key

For Geo Key Manager we wanted something that provides users with granular control and can scale with Cloudflare’s growing network. We came up with the following requirements:

Users can select from a set of pre-defined regions they would like their keys to be in (E.U., U.S., Highest Security, etc.)Users can add specific datacenter locations outside of the chosen regions (e.g. all U.S. locations plus Toronto)
Users can choose selected datacenter locations inside their chosen regions to not send keys (e.g. all E.U. locations except London)
It should be fast to decrypt a key and easy to store it, no matter how complicated the configuration

Building a system to satisfy these requirements gives customers the freedom to decide where their keys are kept and the scalability necessary to be useful with a growing network.

Identity-based encryption

The cryptographic tool that allows us to satisfy these requirements is called Identity-based encryption.

Unlike traditional public key cryptography, where each party has a public key and a private key, in identity-based encryption, your identity is your public key. This is a very powerful concept, because it allows you to encrypt data to a person without having to obtain a their public key ahead of time. Instead of a large random-looking number, a public key can literally be any string, such as “bob@example.com” or “beebob”. Identity-based encryption is useful for email, where you can imagine encrypting a message using person’s email address as the public key. Compare this to the complexity of using PGP, where you need to find someone’s public key and validate it before you send a message. With identity-based encryption, you can even encrypt data to someone before they have the private key associated with their identity (which is managed by a central key generation authority).

Public Key Cryptography (PGP, etc.)

Identity-based Encryption

ID-based encryption was proposed by Shamir in the 80s, but it wasn’t fully practical until Boneh and Franklin’s proposal in 2001. Since then, a variety of interesting schemes been discovered and put to use. The underlying cryptographic primitive that makes efficient ID-based encryption possible is called Elliptic Curve Pairings, a topic that deserves its own blog post.

Our scheme is based the combination of two primitives:

Identity-Based Broadcast Encryption (IBBE)

Identity-Based Revocation (IBR)

Identity-Based Broadcast Encryption (IBBE) is like a allowlist. It lets you take a piece of data and encrypt it to a set of recipients. The specific construction we use is from a 2007 paper by Delerablee. Critically, the size of the ciphertext does not depend on the number of recipients. This means we can efficiently encrypt data without it getting larger no matter how many recipients there are (or PoPs in our case).

Identity-based Revocation (IBR) is like a blocklist. It lets you encrypt a piece of data so that all recipients can decrypt it except for a pre-defined set of recipients who are excluded. The implementation we used was from section 4 of a paper by Attrapadung et al. from 2011. Again, the ciphertext size does not depend on the number of excluded identities.

These two primitives can be combined to create a very flexible cryptographic access control scheme. To do this, create two sets of identities: an identity for each region, and an identity for each datacenter location. Once these sets have been decided, each server is provisioned the identity-based encryption private key for its region and its location.

With this in place, you can configure access to the key in terms of the following sets:

Set of regions to encrypt to
Set of locations inside the region to exclude
Set of locations outside the region to include

Here’s how you can encrypt a customer key so that a given access control policy (regions, blocklist, allowlist) can be enforced:

Create a key encryption key KEK

Split it in two halves KEK1, KEK2
Encrypt KEK1 with IBBE to include the set of regions (allowlist regions)
Encrypt KEK2 with IBR to exclude locations in the regions defined in 3 (blocklist colo)
Encrypt KEK with IBBE to include locations not in the regions defined in 3 (allowlist colo)
Send all three encrypted keys (KEK1)region, (KEK2)exclude, (KEK)include to all locations

This can be visualized as follows with

Regions: U.S. and E.U.
Blocklist colos: Dallas and London
Allowlist colos: Sydney and Tokyo

The locations inside the regions allowlist can decrypt half of the key (KEK1), and need to also be outside the blocklist to decrypt the other half of the key (KEK2). In the example, London and Dallas only have access to KEK1 because they can’t decrypt KEK2. These locations can’t reconstruct KEK and therefore can’t decrypt the private key. Every other location in the E.U. and U.S. can decrypt KEK1 and KEK2 so they can construct KEK to decrypt the private key. Tokyo and Sydney can decrypt the KEK from the allowlist and use it to decrypt the private key.

This will make the private TLS key available in all of the EU and US except for Dallas and London and it will additionally be available in Tokyo and Sydney. The result is the following map:

This design lets customers choose with per-datacenter granularity where their private keys can be accessed. If someone connects to a datacenter that can’t decrypt the private key, that SSL connection is handled using Keyless SSL, where the key server is located in a location with access to the key. The Geo Key code automatically finds the nearest data center that can access the relevant TLS private key and uses Keyless SSL to handle the TLS handshake with the client.

Creating a key server network

Any time you use Keyless SSL for a new connection, there’s going to be a latency cost for connecting to the key server. With Geo Key Manager, we wanted to reduce this latency cost as much as possible. In practical terms, this means we need to know which key server will respond fastest.

To solve this, we created an overlay network between all of our datacenters to measure latency. Every datacenter has an outbound connection to every other datacenter. Every few seconds, we send a “ping” (a message in the Keyless SSL protocol, not an ICMP message) and we measure how long it takes the server to send a corresponding “pong”.

When a client connects to a site behind Cloudflare in a datacenter that can’t decrypt the private key for that site, we use metadata to find out which datacenters have access to the key. We then choose the datacenter that has the lowest latency according to our measurements and use that datacenter’s key server for Keyless SSL. If the location with the lowest latency is overloaded, we may choose another location with higher latency but more capacity.

The data from these measurements was used to construct the following map, highlighting the additional latency added for visitors around the world for the US-only configuration.

Latency added when keys are in U.S. only. Green: no latency cost,Yellow: <50ms,Red: > 100ms

In conclusion

We’re constantly innovating to provide our customers with powerful features that are simple to use. With Geo Key Manager, we are leveraging Keyless SSL and 21st century cryptography to improve private key security in an increasingly complicated geo-political climate.

High-reliability OCSP stapling and why it matters

Nick Sullivan — Mon, 10 Jul 2017 12:43:30 GMT

At Cloudflare our focus is making the Internet faster and more secure. Today we are announcing a new enhancement to our HTTPS service: High-Reliability OCSP stapling. This feature is a step towards enabling an important security feature on the web: certificate revocation checking. Reliable OCSP stapling also improves connection times by up to 30% in some cases. In this post, we’ll explore the importance of certificate revocation checking in HTTPS, the challenges involved in making it reliable, and how we built a robust OCSP stapling service.

Why revocation is hard

Digital certificates are the cornerstone of trust on the web. A digital certificate is like an identification card for a website. It contains identity information including the website’s hostname along with a cryptographic public key. In public key cryptography, each public key has an associated private key. This private key is kept secret by the site owner. For a browser to trust an HTTPS site, the site’s server must provide a certificate that is valid for the site’s hostname and a proof of control of the certificate’s private key. If someone gets access to a certificate’s private key, they can impersonate the site. Private key compromise is a serious risk to trust on the web.

Certificate revocation is a way to mitigate the risk of key compromise. A website owner can revoke a compromised certificate by informing the certificate issuer that it should no longer be trusted. For example, back in 2014, Cloudflare revoked all managed certificates after it was shown that the Heartbleed vulnerability could be used to steal private keys. There are other reasons to revoke, but key compromise is the most common.

Certificate revocation has a spotty history. Most of the revocation checking mechanisms implemented today don’t protect site owners from key compromise. If you know about why revocation checking is broken, feel free to skip ahead to the OCSP stapling section below.

Revocation checking: a history of failure

There are several ways a web browser can check whether a site’s certificate is revoked or not. The most well-known mechanisms are Certificate Revocation Lists (CRL) and Online Certificate Status Protocol (OCSP). A CRL is a signed list of serial numbers of certificates revoked by a CA. OCSP is a protocol that can be used to query a CA about the revocation status of a given certificate. An OCSP response contains signed assertions that a certificate is not revoked.

Certificates that support OCSP contain the responder's URL and those that support CRLs contain a URLs where the CRL can be obtained. When a browser is served a certificate as part of an HTTPS connection, it can use the embedded URL to download a CRL or an OCSP response and check that the certificate hasn't been revoked before rendering the web page. The question then becomes: what should the browser do if the request for a CRL or OCSP response fails? As it turns out, both answers to that question are problematic.

Hard-fail doesn’t work

When browsers encounter a web page and there’s a problem fetching revocation information, the safe option is to block the page and show a security warning. This is called a hard-fail strategy. This strategy is conservative from a security standpoint, but prone to false positives. For example, if the proof of non-revocation could not be obtained for a valid certificate, a hard-fail strategy will show a security warning. Showing a security warning when no security issue exists is dangerous because it can lead to warning fatigue and teach users to click through security warnings, which is a bad idea.

In the real world, false positives are unavoidable. OCSP and CRL endpoints subject to service outages and network errors. There are also common situations where these endpoints are completely inaccessible to the browser, such as when the browser is behind a captive portal. For some access points used in hotels and airplanes, unencrypted traffic (like OCSP endpoints) are blocked. A hard-fail strategy force users behind captive portals and other networks that block OCSP requests to click through unnecessary security warnings. This reality is unpalatable to browser vendors.

Another drawback to a hard-fail strategy is that it puts an increased burden on certificate authorities to keep OCSP and CRL endpoints available and online. A broken OCSP or CRL server becomes a central point of failure for all certificates issued by a certificate authority. If browsers followed a hard-fail strategy, an OCSP outage would be an Internet outage. Certificate authorities are organizations optimized to provide trust and accountability, not necessarily resilient infrastructure. In a hard-fail world, the availability of the web as a whole would be limited by the ability for CAs to keep their OCSP services online at all times; a dangerous systemic risk to the internet as a whole.

Soft-fail: it’s not much better

To avoid the downsides of a hard-fail strategy, most browsers take another approach to certificate revocation checking. Upon seeing a new certificate, the browser will attempt to fetch the revocation information from the CRL or OCSP endpoint embedded in the certificate. If the revocation information is available, they rely on it, and otherwise they assume the certificate is not revoked and display the page without any errors. This is called a “soft-fail” strategy.

The soft-fail strategy has a critical security flaw. An attacker with network position can block the OCSP request. If this attacker also has the private key of a revoked certificate, they can intercept the outgoing connection for the site and present the revoked certificate to the browser. Since the browser doesn’t know the certificate is revoked and is following a soft-fail strategy, the page will load without alerting the user. As Adam Langley described: “soft-fail revocation checks are like a seat-belt that snaps when you crash. Even though it works 99% of the time, it's worthless because it only works when you don't need it.”

A soft-fail strategy also makes connections slower. If revocation information for a certificate is not already cached, the browser will block the rendering of the page until the revocation information is retrieved, or a timeout occurs. This additional step causes a noticeable and unwelcome delay, with marginal security benefits. This tradeoff is a hard sell for the performance-obsessed web. Because of the limited benefit, some browsers have eliminated live revocation checking for at least some subset of certificates.

Live OCSP checking has an additional downside: it leaks private browsing information. OCSP requests are sent over unencrypted HTTP and are tied to a specific certificate. Sending an OCSP request tells the certificate authority which websites you are visiting. Furthermore, everyone on the network path between your browser and the OCSP server will also know which sites you are browsing.

Alternative revocation checking

Some client still perform soft-fail OCSP checking, but it’s becoming less common due to the performance and privacy downsides described above. To protect high-value certificates, some browsers have explored alternative mechanisms for revocation checking.

One technique is to pre-package a list of revoked certificates and distribute them through browser updates. Because the list of all revoked certificates is so large, only a few high-impact certificates are included in this list. This technique is called OneCRL by Firefox and CRLSets by Chrome. This has been effective for some high-profile revocations, but it is by no means a complete solution. Not only are not all certificates covered, this technique leaves a window of vulnerability between the time the certificate is revoked and the certificate list gets to browsers.

OCSP Stapling

OCSP stapling is a technique to get revocation information to browsers that fixes some of the performance and privacy issues associated with live OCSP fetching. In OCSP stapling, the server includes a current OCSP response for the certificate included (or "stapled") into the initial HTTPS connection. That removes the need for the browser to request the OCSP response itself. OCSP stapling is widely supported by modern browsers.

Not all servers support OCSP stapling, so browsers still take a soft-fail approach to warning the user when the OCSP response is not stapled. Some browsers (such as Safari, Edge and Firefox for now) check certificate revocation for certificates, so OCSP stapling can provide a performance boost of up to 30%. For browsers like Chrome that don’t check for revocation for all certificates, OCSP stapling provides a proof of non-revocation that they would not have otherwise.

High-reliability OCSP stapling

Cloudflare started offering OCSP stapling in 2012. Cloudflare’s original implementation relied on code from nginx that was able to provide OCSP stapling for a some, but not all connections. As Cloudflare’s network grew, the implementation wasn’t able to scale with it, resulting in a drop in the percentage of connections with OCSP responses stapled. The architecture we had chosen had served us well, but we could definitely do better.

In the last year we redesigned our OCSP stapling infrastructure to make it much more robust and reliable. We’re happy to announce that we now provide reliable OCSP stapling connections to Cloudflare. As long as the certificate authority has set up OCSP for a certificate, Cloudflare will serve a valid OCSP stapled response. All Cloudflare customers now benefit from much more reliable OCSP stapling.

OCSP stapling past

In Cloudflare’s original implementation of OCSP stapling, OCSP responses were fetched opportunistically. Given a connection that required a certificate, Cloudflare would check to see if there was a fresh OCSP response to staple. If there was, it would be included in the connection. If not, then the client would not be sent an OCSP response, and Cloudflare would send a request to refresh the OCSP response in the cache in preparation for the next request.

If a fresh OCSP response wasn’t cached, the connection wouldn’t get an OCSP staple. The next connection for that same certificate would get a OCSP staple, because the cache will have been populated.

This architecture was elegant, but not robust. First, there are several situations in which the client is guaranteed to not get an OCSP response. For example, the first request in every cache region and the first request after an OCSP response expires are guaranteed to not have an OCSP response stapled. With Cloudflare’s expansion to more locations, these failures were more common. Less popular sites would have their OCSP responses fetched less often resulting in an even lower ratio of stapled connections. Another reason that connections could be missing OCSP responses is if the OCSP request from Cloudflare to fill the cache failed. There was a lot of room for improvement.

Our solution: OCSP pre-fetching

In order to be able to reliably include OCSP staples in all connection, we decided to change the model. Instead of fetching the OCSP response when a request came in, we would fetch it in a centralized location and distribute valid responses to all our servers. When a response started getting close to expiration, we’d fetch a new one. If the OCSP request fails, we put it into a queue to re-fetch at a later time. Since most OCSP staples are valid for around 7 days, there is a lot of flexibility in term of refreshing expiring responses.

To keep our cache of OCSP responses fresh, we created an OCSP fetching service. This service ensures that there is a valid OCSP response for every certificate managed by Cloudflare. We constantly crawl our cache of OCSP responses and refresh those that are close to expiring. We also make sure to never cache invalid OCSP responses, as this can have bad consequences. This system has been running for several months now, and we are now reliably including OCSP staples for almost every HTTPS request.

Reliable stapling improves performance for browsers that would have otherwise fetched OCSP, but it also changes the optimal failure strategy for browsers. If a browser can reliably get an OCSP staple for a certificate, why not switch back from a soft-fail to a hard-fail strategy?

OCSP must-staple

As described above, the soft-fail strategy for validating OCSP responses opens up a security hole. An attacker with a revoked certificate can simply neglect to provide an OCSP response when a browser connects to it and the browser will accept their revoked certificate.

In the OCSP fetching case, a soft-fail approach makes sense. There many reasons the browser would not be able to obtain an OCSP: captive portals, broken OCSP servers, network unreliability and more. However, as we have shown with our high-reliability OCSP fetching service, it is possible for a server to fetch OCSP responses without any of these problems. OCSP responses are re-usable and are valid for several days. When one is close to expiring, the server can fetch a new one out-of-band and be able to reliably serve OCSP staples for all connections.

Public Domain

If the client knows that a server will always serve OCSP staples for every connection, it can apply a hard-fail approach, failing a connection if the OCSP response is missing. This closes the security hole introduced by the soft-fail strategy. This is where OCSP must-staple fits in.

OCSP must-staple is an extension that can be added to a certificate that tells the browser to expect an OCSP staple whenever it sees the certificate. This acts as an explicit signal to the browser that it’s safe to use the more secure hard-fail strategy.

Firefox enforces OCSP must-staple, returning the following error if such a certificate is presented without a stapled OCSP response.

Chrome provides the ability to mark a domain as “Expect-Staple”. If Chrome sees a certificate for the domain without a staple, it will send a report to a pre-configured report endpoint.

Reliability

As a part of our push to provide reliable OCSP stapling, we put our money where our mouths are and put an OCSP must-staple certificate on blog.cloudflare.com. Now if we ever don’t serve an OCSP staple, this page will fail to load on browsers like Firefox that enforce must-staple. You can identify a certificate by looking at the certificate details for the “1.3.6.1.5.5.7.1.24” OID.

Cloudflare customers can choose to upload must-staple custom certificates, but we encourage them not to do so yet because there may be a multi-second delay between the certificate being installed and our ability to populate the OCSP response cache. This will be fixed in the coming months. Other than the first few seconds after uploading the certificate, Cloudflare’s new OCSP fetching is robust enough to offer OCSP staples for every connection thereafter.

As of today, an attacker with access to the private key for a revoked certificate can still hijack the connection. All they need to do is to place themselves on the network path of the connection and block the OCSP request. OCSP must-staple prevents that, since an attacker will not be able to obtain an OCSP response that says the certificate has not been revoked.

The weird world of OCSP responders

For browsers, an OCSP failure is not the end of the world. Most browsers are configured to soft-fail when an OCSP responder returns an error, so users are unaffected by OCSP server failures. Some Certificate Authorities have had massive multi-day outages in their OCSP servers without affecting the availability of sites that use their certificates.

There’s no strong feedback mechanism for broken or slow OCSP servers. This lack of feedback has led to an ecosystem of faulty or unreliable OCSP servers. We experienced this first-hand while developing high-reliability OCSP stapling. In this section, we’ll outline half a dozen unexpected behaviors we found when deploying high-reliability OCSP stapling. A big thanks goes out to all the CAs who fixed the issues we pointed out. CA names redacted to preserve their identities.

CA #1: Non-overlapping periods

We noticed CA #1 certificates frequently missing their refresh-deadline, and during debugging we were lucky enough to see this:

$ date -u
Sat Mar  4 02:45:35 UTC 2017
$ ocspfetch 
This Update: 2017-03-04 01:45:49 +0000 UTC
Next Update: 2017-03-04 02:45:49 +0000 UTC
$ date -u
Sat Mar  4 02:45:48 UTC 2017
$ ocspfetch 
This Update: 2017-03-04 02:45:49 +0000 UTC
Next Update: 2017-03-04 03:45:49 +0000 UTC

It shows that CA #1 had configured their OCSP responders to use an incredibly short validity period with almost no overlap between validity periods, which makes it functionally impossible to always have a fresh OCSP response for their certificates. We contacted them, and they reconfigured the responder to produce new responses every half-interval.

CA #2: Wrong signature algorithm

Several certificates from CA #2 started failing with this error:

 bad OCSP signature: crypto/rsa: verification error

The issue is that the OCSP claims to be signed with SHA256-RSA, when it is actually signed with SHA1-RSA (and the reverse: indicates SHA1-RSA, actually signed with SHA256-RSA).

CA #3: Malformed OCSP responses

When we first started the project, our tool was unable to parse dozens of certificates in our database because of this error

asn1: structure error: integer not minimally-encoded

and many OCSP responses that we fetched from the same CA:

parsing ocsp response: bad OCSP signature: asn1: structure error: integer not minimally-encoded

What happened was that this CA had begun issuing <1% of certificates with a minor formatting error that rendered them unparseable by Golang’s x509 package. After contacting them directly, they quickly fixed the issue but then we had to patch Golang's parser to be more lenient about encoding bugs.

CA #4: Failed responder behind a load balancer

A small number of CA #4’s OCSP responders fell into a "bad state" without them knowing, and would return 404 on every request. Since the CA used load balancing to round-robin requests to a number of responders, it looked like 1 in 6 requests would fail inexplicably.

Two times between Jan 2017 and May 2017, this CA also experienced some kind of large data-loss event that caused them to return persistent "Try Later" responses for a large number of requests.

CA #5: Delay between certificate and OCSP responder

When a certificate is issued by CA #5, there is a large delay between the time a certificate is issued and the OCSP responder is able to start returning signed responses. This results in a delay between certificate issance and the availability of OCSP. It has mostly been resolved, but this general pattern is dangerous for OCSP must-staple. There have been some changes discussed by the CA/B Forum, an organization that regulates the issuance and management of certificates, to require CAs to offer OCSP soon after issuance.

CA #6: Extra certificates

It is typical to embed only one certificate in an OCSP response, if any. The one embedded certificate is supposed to be a leaf specially issued for signing an intermediate's OCSP responses. However, several CAs embed multiple certificates: the leaf they use for signing OCSP responses, the intermediate itself, and sometimes all the intermediates up to the root certificate.

Conclusions

We made OCSP stapling better and more reliable for Cloudflare customers. Despite the various strange behaviors we found in OCSP servers, we’ve been able to consistently serve OCSP responses for over 99.9% of connections since we’ve moved over to the new system. This great work was done by Brendan McMillion and Alessandro Ghedini. This is an important step in protecting the web community from attackers who have compromised certificate private keys.

How to make your site HTTPS-only

Nick Sullivan — Thu, 06 Jul 2017 13:35:00 GMT

The Internet is getting more secure every day as people enable HTTPS, the secure version of HTTP, on their sites and services. Last year, Mozilla reported that the percentage of requests made by Firefox using encrypted HTTPS passed 50% for the first time. HTTPS has numerous benefits that are not available over unencrypted HTTP, including improved performance with HTTP/2, SEO benefits for search engines like Google and the reassuring lock icon in the address bar.

So how do you add HTTPS to your site or service? That’s simple, Cloudflare offers free and automatic HTTPS support for all customers with no configuration. Sign up for any plan and Cloudflare will issue an SSL certificate for you and serve your site over HTTPS.

HTTPS-only

Enabling HTTPS does not mean that all visitors are protected. If a visitor types your website’s name into the address bar of a browser or follows an HTTP link, it will bring them to the insecure HTTP version of your website. In order to make your site HTTPS-only, you need to redirect visitors from the HTTP to the HTTPS version of your site.

Going HTTPS-only should be as easy as a click of a button, so we literally added one to the Cloudflare dashboard. Enable the “Always Use HTTPS” feature and all visitors of the HTTP version of your website will be redirected to the HTTPS version. You’ll find this option just above the HTTP Strict Transport Security setting and it is of course also available through our API.

In case you would like to redirect only some subset of your requests you can still do this by creating a Page Rule. Simply use the “Always Use HTTPS” setting on any URL pattern.

Securing your site: next steps

Once you have confirmed that your site is fully functional with HTTPS-only enabled, you can take it a step further and enable HTTP Strict Transport Security (HSTS). HSTS is a header that tells browsers that your site is available over HTTPS and will be for a set period of time. Once a browser sees an HSTS header for a site, it will automatically fetch the HTTPS version of HTTP pages without needing to follow redirects. HSTS can be enabled in the crypto app right under the Always Use HTTPS toggle.

It's also important to secure the connection between Cloudflare and your site. To do that, you can use Cloudflare's Origin CA to get a free certificate for your origin server. Once your origin server is set up with HTTPS and a valid certificate, change your SSL mode to Full (strict) to get the highest level of security.