The Cloudflare Blog

Introducing EmDash — the spiritual successor to WordPress that solves plugin security

Matt “TK” Taylor — Wed, 01 Apr 2026 13:00:00 GMT

The cost of building software has drastically decreased. We recently rebuilt Next.js in one week using AI coding agents. But for the past two months our agents have been working on an even more ambitious project: rebuilding the WordPress open source project from the ground up.

WordPress powers over 40% of the Internet. It is a massive success that has enabled anyone to be a publisher, and created a global community of WordPress developers. But the WordPress open source project will be 24 years old this year. Hosting a website has changed dramatically during that time. When WordPress was born, AWS EC2 didn’t exist. In the intervening years, that task has gone from renting virtual private servers, to uploading a JavaScript bundle to a globally distributed network at virtually no cost. It’s time to upgrade the most popular CMS on the Internet to take advantage of this change.

Our name for this new CMS is EmDash. We think of it as the spiritual successor to WordPress. It’s written entirely in TypeScript. It is serverless, but you can run it on your own hardware or any platform you choose. Plugins are securely sandboxed and can run in their own isolate, via Dynamic Workers, solving the fundamental security problem with the WordPress plugin architecture. And under the hood, EmDash is powered by Astro, the fastest web framework for content-driven websites.

EmDash is fully open source, MIT licensed, and available on GitHub. While EmDash aims to be compatible with WordPress functionality, no WordPress code was used to create EmDash. That allows us to license the open source project under the more permissive MIT license. We hope that allows more developers to adapt, extend, and participate in EmDash’s development.

You can deploy the EmDash v0.1.0 preview to your own Cloudflare account, or to any Node.js server today as part of our early developer beta:

Or you can try out the admin interface here in the EmDash Playground:

What WordPress has accomplished

The story of WordPress is a triumph of open source that enabled publishing at a scale never before seen. Few projects have had the same recognisable impact on the generation raised on the Internet. The contributors to WordPress’s core, and its many thousands of plugin and theme developers have built a platform that democratised publishing for millions; many lives and livelihoods being transformed by this ubiquitous software.

There will always be a place for WordPress, but there is also a lot more space for the world of content publishing to grow. A decade ago, people picking up a keyboard universally learned to publish their blogs with WordPress. Today it’s just as likely that person picks up Astro, or another TypeScript framework to learn and build with. The ecosystem needs an option that empowers a wide audience, in the same way it needed WordPress 23 years ago.

EmDash is committed to building on what WordPress created: an open source publishing stack that anyone can install and use at little cost, while fixing the core problems that WordPress cannot solve.

Solving the WordPress plugin security crisis

WordPress’ plugin architecture is fundamentally insecure. 96% of security issues for WordPress sites originate in plugins. In 2025, more high severity vulnerabilities were found in the WordPress ecosystem than the previous two years combined.

Why, after over two decades, is WordPress plugin security so problematic?

A WordPress plugin is a PHP script that hooks directly into WordPress to add or modify functionality. There is no isolation: a WordPress plugin has direct access to the WordPress site’s database and filesystem. When you install a WordPress plugin, you are trusting it with access to nearly everything, and trusting it to handle every malicious input or edge case perfectly.

EmDash solves this. In EmDash, each plugin runs in its own isolated sandbox: a Dynamic Worker. Rather than giving direct access to underlying data, EmDash provides the plugin with capabilities via bindings, based on what the plugin explicitly declares that it needs in its manifest. This security model has a strict guarantee: an EmDash plugin can only perform the actions explicitly declared in its manifest. You can know and trust upfront, before installing a plugin, exactly what you are granting it permission to do, similar to going through an OAuth flow and granting a 3rd party app a specific set of scoped permissions.

For example, a plugin that sends an email after a content item gets saved looks like this:

import { definePlugin } from "emdash";

export default () =>
  definePlugin({
    id: "notify-on-publish",
    version: "1.0.0",
    capabilities: ["read:content", "email:send"],
    hooks: {
      "content:afterSave": async (event, ctx) => {
        if (event.collection !== "posts" || event.content.status !==    "published") return;

        await ctx.email!.send({
          to: "editors@example.com",
          subject: `New post published: ${event.content.title}`,
          text: `"${event.content.title}" is now live.`,
         });

        ctx.log.info(`Notified editors about ${event.content.id}`);
      },
    },
  });

This plugin explicitly requests two capabilities: content:afterSave to hook into the content lifecycle, and email:send to access the ctx.email function. It is impossible for the plugin to do anything other than use these capabilities. It has no external network access. If it does need network access, it can specify the exact hostname it needs to talk to, as part of its definition, and be granted only the ability to communicate with a particular hostname.

And in all cases, because the plugin’s needs are declared statically, upfront, it can always be clear exactly what the plugin is asking for permission to be able to do, at install time. A platform or administrator could define rules for what plugins are or aren’t allowed to be installed by certain groups of users, based on what permissions they request, rather than an allowlist of approved or safe plugins.

Solving plugin security means solving marketplace lock-in

WordPress plugin security is such a real risk that WordPress.org manually reviews and approves each plugin in its marketplace. At the time of writing, that review queue is over 800 plugins long, and takes at least two weeks to traverse. The vulnerability surface area of WordPress plugins is so wide that in practice, all parties rely on marketplace reputation, ratings and reviews. And because WordPress plugins run in the same execution context as WordPress itself and are so deeply intertwined with WordPress code, some argue they must carry forward WordPress’ GPL license.

These realities combine to create a chilling effect on developers building plugins, and on platforms hosting WordPress sites.

Plugin security is the root of this problem. Marketplace businesses provide trust when parties otherwise cannot easily trust each other. In the case of the WordPress marketplace, the plugin security risk is so large and probable that many of your customers can only reasonably trust your plugin via the marketplace. But in order to be part of the marketplace your code must be licensed in a way that forces you to give it away for free everywhere other than that marketplace. You are locked in.

EmDash plugins have two important properties that mitigate this marketplace lock-in:

Plugins can have any license: they run independently of EmDash and share no code. It’s the plugin author’s choice.
Plugin code runs independently in a secure sandbox: a plugin can be provided to an EmDash site, and trusted, without the EmDash site ever seeing the code.

The first part is straightforward — as the plugin author, you choose what license you want. The same way you can when publishing to NPM, PyPi, Packagist or any other registry. It’s an open ecosystem for all, and up to the community, not the EmDash project, what license you use for plugins and themes.

The second part is where EmDash’s plugin architecture breaks free of the centralized marketplace.

Developers need to rely on a third party marketplace having vetted the plugin far less to be able to make decisions about whether to use or trust it. Consider the example plugin above that sends emails after content is saved; the plugin declares three things:

It only runs on the content:afterSave hook
It has the read:content capability
It has the email:send capability

The plugin can have tens of thousands of lines of code in it, but unlike a WordPress plugin that has access to everything and can talk to the public Internet, the person adding the plugin knows exactly what access they are granting to it. The clearly defined boundaries allow you to make informed decisions about security risks and to zoom in on more specific risks that relate directly to the capabilities the plugin is given.

The more that both sites and platforms can trust the security model to provide constraints, the more that sites and platforms can trust plugins, and break free of centralized control of marketplaces and reputation. Put another way: if you trust that food safety is enforced in your city, you’ll be adventurous and try new places. If you can’t trust that there might be a staple in your soup, you’ll be consulting Google before every new place you try, and it’s harder for everyone to open new restaurants.

Every EmDash site has x402 support built in — charge for access to content

The business model of the web is at risk, particularly for content creators and publishers. The old way of making content widely accessible, allowing all clients free access in exchange for traffic, breaks when there is no human looking at a site to advertise to, and the client is instead their agent accessing the web on their behalf. Creators need ways to continue to make money in this new world of agents, and to build new kinds of websites that serve what people’s agents need and will pay for. Decades ago a new wave of creators created websites that became great businesses (often using WordPress to power them) and a similar opportunity exists today.

x402 is an open, neutral standard for Internet-native payments. It lets anyone on the Internet easily charge, and any client pay on-demand, on a pay-per-use basis. A client, such as an agent, sends a HTTP request and receives a HTTP 402 Payment Required status code. In response, the client pays for access on-demand, and the server can let the client through to the requested content.

EmDash has built-in support for x402. This means anyone with an EmDash site can charge for access to their content without requiring subscriptions and with zero engineering work. All you need to do is configure which content should require payment, set how much to charge, and provide a Wallet address. The request/response flow ends up looking like this:

Every EmDash site has a built-in business model for the AI era.

Solving scale-to-zero for WordPress hosting platforms

WordPress is not serverless: it requires provisioning and managing servers, scaling them up and down like a traditional web application. To maximize performance, and to be able to handle traffic spikes, there’s no avoiding the need to pre-provision instances and run some amount of idle compute, or share resources in ways that limit performance. This is particularly true for sites with content that must be server rendered and cannot be cached.

EmDash is different: it’s built to run on serverless platforms, and make the most out of the v8 isolate architecture of Cloudflare’s open source runtime workerd. On an incoming request, the Workers runtime instantly spins up an isolate to execute code and serve a response. It scales back down to zero if there are no requests. And it only bills for CPU time (time spent doing actual work).

You can run EmDash anywhere, on any Node.js server — but on Cloudflare you can run millions of instances of EmDash using Cloudflare for Platforms that each instantly scale fully to zero or up to as many RPS as you need to handle, using the exact same network and runtime that the biggest websites in the world rely on.

Beyond cost optimizations and performance benefits, we’ve bet on this architecture at Cloudflare in part because we believe in having low cost and free tiers, and that everyone should be able to build websites that scale. We’re excited to help platforms extend the benefits of this architecture to their own customers, both big and small.

Modern frontend theming and architecture via Astro

EmDash is powered by Astro, the web framework for content-driven websites. To create an EmDash theme, you create an Astro project that includes:

Pages: Astro routes for rendering content (homepage, blog posts, archives, etc.)
Layouts: Shared HTML structure
Components: Reusable UI elements (navigation, cards, footers)
Styles: CSS or Tailwind configuration
A seed file: JSON that tells the CMS what content types and fields to create

This makes creating themes familiar to frontend developers who are increasingly choosing Astro, and to LLMs which are already trained on Astro.

WordPress themes, though incredibly flexible, operate with a lot of the same security risks as plugins, and the more popular and commonplace your theme, the more of a target it is. Themes run through integrating with functions.php which is an all-encompassing execution environment, enabling your theme to be both incredibly powerful and potentially dangerous. EmDash themes, as with dynamic plugins, turns this expectation on its head. Your theme can never perform database operations.

An AI Native CMS — MCP, CLI, and Skills for EmDash

The least fun part about working with any CMS is doing the rote migration of content: finding and replacing strings, migrating custom fields from one format to another, renaming, reordering and moving things around. This is either boring repetitive work or requires one-off scripts and “single-use” plugins and tools that are usually neither fun to write nor to use.

EmDash is designed to be managed programmatically by your AI agents. It provides the context and the tools that your agents need, including:

Agent Skills: Each EmDash instance includes Agent Skills that describe to your agent the capabilities EmDash can provide to plugins, the hooks that can trigger plugins, guidance on how to structure a plugin, and even how to port legacy WordPress themes to EmDash natively. When you give an agent an EmDash codebase, EmDash provides everything the agent needs to be able to customize your site in the way you need.
EmDash CLI: The EmDash CLI enables your agent to interact programmatically with your local or remote instance of EmDash. You can upload media, search for content, create and manage schemas, and do the same set of things you can do in the Admin UI.
Built-in MCP Server: Every EmDash instance provides its own remote Model Context Protocol (MCP) server, allowing you to do the same set of things you can do in the Admin UI.

Pluggable authentication, with Passkeys by default

EmDash uses passkey-based authentication by default, meaning there are no passwords to leak and no brute-force vectors to defend against. User management includes familiar role-based access control out of the box: administrators, editors, authors, and contributors, each scoped strictly to the actions they need. Authentication is pluggable, so you can set EmDash up to work with your SSO provider, and automatically provision access based on IdP metadata.

Import your WordPress sites to EmDash

You can import an existing WordPress site by either going to WordPress admin and exporting a WXR file, or by installing the EmDash Exporter plugin on a WordPress site, which configures a secure endpoint that is only exposed to you, and protected by a WordPress Application Password you control. Migrating content takes just a few minutes, and automatically works to bring any attached media into EmDash’s media library.

Creating any custom content types on WordPress that are not a Post or a Page has meant installing heavy plugins like Advanced Custom Fields, and squeezing the result into a crowded WordPress posts table. EmDash does things differently: you can define a schema directly in the admin panel, which will create entirely new EmDash collections for you, separately ordered in the database. On import, you can use the same capabilities to take any custom post types from WordPress, and create an EmDash content type from it.

For bespoke blocks, you can use the EmDash Block Kit Agent Skill to instruct your agent of choice and build them for EmDash.

Try it

EmDash is v0.1.0 preview, and we’d love you to try it, give feedback, and we welcome contributions to the EmDash GitHub repository.

If you’re just playing around and want to first understand what’s possible — try out the admin interface in the EmDash Playground.

To create a new EmDash site locally, via the CLI, run:

npm create emdash@latest

Or you can do the same via the Cloudflare dashboard below:

We’re excited to see what you build, and if you're active in the WordPress community, as a hosting platform, a plugin or theme author, or otherwise — we’d love to hear from you. Email us at emdash@cloudflare.com, and tell us what you’d like to see from the EmDash project.

If you want to stay up to date with major EmDash developments, you can leave your email address here.

How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams

André Venceslau — Fri, 27 Mar 2026 13:00:00 GMT

Cloudflare Workflows is a durable execution engine that lets you chain steps, retry on failure, and persist state across long-running processes. Developers use Workflows to power background agents, manage data pipelines, build human-in-the-loop approval systems, and more.

Last month, we announced that every workflow deployed to Cloudflare now has a complete visual diagram in the dashboard.

We built this because being able to visualize your applications is more important now than ever before. Coding agents are writing code that you may or may not be reading. However, the shape of what gets built still matters: how the steps connect, where they branch, and what's actually happening.

If you've seen diagrams from visual workflow builders before, those are usually working from something declarative: JSON configs, YAML, drag-and-drop. However, Cloudflare Workflows are just code. They can include Promises, Promise.all, loops, conditionals, and/or be nested in functions or classes. This dynamic execution model makes rendering a diagram a bit more complicated.

We use Abstract Syntax Trees (ASTs) to statically derive the graph, tracking Promise and await relationships to understand what runs in parallel, what blocks, and how the pieces connect.

Keep reading to learn how we built these diagrams, or deploy your first workflow and see the diagram for yourself.

Here’s an example of a diagram generated from Cloudflare Workflows code:

Dynamic workflow execution

Generally, workflow engines can execute according to either dynamic or sequential (static) execution order. Sequential execution might seem like the more intuitive solution: trigger workflow → step A → step B → step C, where step B starts executing immediately after the engine completes Step A, and so forth.

Cloudflare Workflows follow the dynamic execution model. Since workflows are just code, the steps execute as the runtime encounters them. When the runtime discovers a step, that step gets passed over to the workflow engine, which manages its execution. The steps are not inherently sequential unless awaited — the engine executes all unawaited steps in parallel. This way, you can write your workflow code as flow control without additional wrappers or directives. Here’s how the handoff works:

An engine, which is a “supervisor” Durable Object for that instance, spins up. The engine is responsible for the logic of the actual workflow execution.
The engine triggers a user worker via dynamic dispatch, passing control over to Workers runtime.
When Runtime encounters a step.do, it passes the execution back to the engine.
The engine executes the step, persists the result (or throws an error, if applicable) and triggers the user Worker again.

With this architecture, the engine does not inherently “know” the order of the steps that it is executing — but for a diagram, the order of steps becomes crucial information. The challenge here lies in getting the vast majority of workflows translated accurately into a diagnostically helpful graph; with the diagrams in beta, we will continue to iterate and improve on these representations.

Parsing the code

Fetching the script at deploy time, instead of run time, allows us to parse the workflow in its entirety to statically generate the diagram.

Taking a step back, here is the life of a workflow deployment:

To create the diagram, we fetch the script after it has been bundled by the internal configuration service which deploys Workers (step 2 under Workflow deployment). Then, we use a parser to create an abstract syntax tree (AST) representing the workflow, and our internal service generates and traverses an intermediate graph with all WorkflowEntrypoints and calls to workflows steps. We render the diagram based on the final result on our API.

When a Worker is deployed, the configuration service bundles (using esbuild by default) and minifies the code unless specified otherwise. This presents another challenge — while Workflows in TypeScript follow an intuitive pattern, their minified Javascript (JS) can be dense and indigestible. There are also different ways that code can be minified, depending on the bundler.

Here’s an example of Workflow code that shows agents executing in parallel:

const summaryPromise = step.do(
         `summary agent (loop ${loop})`,
         async () => {
           return runAgentPrompt(
             this.env,
             SUMMARY_SYSTEM,
             buildReviewPrompt(
               'Summarize this text in 5 bullet points.',
               draft,
               input.context
             )
           );
         }
       );
        const correctnessPromise = step.do(
         `correctness agent (loop ${loop})`,
         async () => {
           return runAgentPrompt(
             this.env,
             CORRECTNESS_SYSTEM,
             buildReviewPrompt(
               'List correctness issues and suggested fixes.',
               draft,
               input.context
             )
           );
         }
       );
        const clarityPromise = step.do(
         `clarity agent (loop ${loop})`,
         async () => {
           return runAgentPrompt(
             this.env,
             CLARITY_SYSTEM,
             buildReviewPrompt(
               'List clarity issues and suggested fixes.',
               draft,
               input.context
             )
           );
         }
       );

Bundling with rspack, a snippet of the minified code looks like this:

class pe extends e{async run(e,t){de("workflow.run.start",{instanceId:e.instanceId});const r=await t.do("validate payload",async()=>{if(!e.payload.r2Key)throw new Error("r2Key is required");if(!e.payload.telegramChatId)throw new Error("telegramChatId is required");return{r2Key:e.payload.r2Key,telegramChatId:e.payload.telegramChatId,context:e.payload.context?.trim()}}),s=await t.do("load source document from r2",async()=>{const e=await this.env.REVIEW_DOCUMENTS.get(r.r2Key);if(!e)throw new Error(`R2 object not found: ${r.r2Key}`);const t=(await e.text()).trim();if(!t)throw new Error("R2 object is empty");return t}),n=Number(this.env.MAX_REVIEW_LOOPS??"5"),o=this.env.RESPONSE_TIMEOUT??"7 days",a=async(s,i,c)=>{if(s>n)return le("workflow.loop.max_reached",{instanceId:e.instanceId,maxLoops:n}),await t.do("notify max loop reached",async()=>{await se(this.env,r.telegramChatId,`Review stopped after ${n} loops for ${e.instanceId}. Start again if you still need revisions.`)}),{approved:!1,loops:n,finalText:i};const h=t.do(`summary agent (loop ${s})`,async()=>te(this.env,"You summarize documents. Keep the output short, concrete, and factual.",ue("Summarize this text in 5 bullet points.",i,r.context)))...

Or, bundling with vite, here is a minified snippet:

class ht extends pe {
  async run(e, r) {
    b("workflow.run.start", { instanceId: e.instanceId });
    const s = await r.do("validate payload", async () => {
      if (!e.payload.r2Key)
        throw new Error("r2Key is required");
      if (!e.payload.telegramChatId)
        throw new Error("telegramChatId is required");
      return {
        r2Key: e.payload.r2Key,
        telegramChatId: e.payload.telegramChatId,
        context: e.payload.context?.trim()
      };
    }), n = await r.do(
      "load source document from r2",
      async () => {
        const i = await this.env.REVIEW_DOCUMENTS.get(s.r2Key);
        if (!i)
          throw new Error(`R2 object not found: ${s.r2Key}`);
        const c = (await i.text()).trim();
        if (!c)
          throw new Error("R2 object is empty");
        return c;
      }
    ), o = Number(this.env.MAX_REVIEW_LOOPS ?? "5"), l = this.env.RESPONSE_TIMEOUT ?? "7 days", a = async (i, c, u) => {
      if (i > o)
        return H("workflow.loop.max_reached", {
          instanceId: e.instanceId,
          maxLoops: o
        }), await r.do("notify max loop reached", async () => {
          await J(
            this.env,
            s.telegramChatId,
            `Review stopped after ${o} loops for ${e.instanceId}. Start again if you still need revisions.`
          );
        }), {
          approved: !1,
          loops: o,
          finalText: c
        };
      const h = r.do(
        `summary agent (loop ${i})`,
        async () => _(
          this.env,
          et,
          K(
            "Summarize this text in 5 bullet points.",
            c,
            s.context
          )
        )
      )...

Minified code can get pretty gnarly — and depending on the bundler, it can get gnarly in a bunch of different directions.

We needed a way to parse the various forms of minified code quickly and precisely. We decided oxc-parser from the JavaScript Oxidation Compiler (OXC) was perfect for the job. We first tested this idea by having a container running Rust. Every script ID was sent to a Cloudflare Queue, after which messages were popped and sent to the container to process. Once we confirmed this approach worked, we moved to a Worker written in Rust. Workers supports running Rust via WebAssembly, and the package was small enough to make this straightforward.

The Rust Worker is responsible for first converting the minified JS into AST node types, then converting the AST node types into the graphical version of the workflow that is rendered on the dashboard. To do this, we generate a graph of pre-defined node types for each workflow and translate into our graph representation through a series of node mappings.

Rendering the diagram

There were two challenges to rendering a diagram version of the workflow: how to track step and function relationships correctly, and how to define the workflow node types as simply as possible while covering all the surface area.

To guarantee that step and function relationships are tracked correctly, we needed to collect both the function and step names. As we discussed earlier, the engine only has information about the steps, but a step may be dependent on a function, or vice versa. For example, developers might wrap steps in functions or define functions as steps. They could also call steps within a function that come from different modules or rename steps.

Although the library passes the initial hurdle by giving us the AST, we still have to decide how to parse it. Some code patterns require additional creativity. For example, functions — within a WorkflowEntrypoint, there can be functions that call steps directly, indirectly, or not at all. Consider functionA, which contains console.log(await functionB(), await functionC()) where functionB calls a step.do(). In that case, both functionA and functionB should be included on the workflow diagram; however, functionC should not. To catch all functions which include direct and indirect step calls, we create a subgraph for each function and check whether it contains a step call itself or whether it calls another function which might. Those subgraphs are represented by a function node, which contains all of its relevant nodes. If a function node is a leaf of the graph, meaning it has no direct or indirect workflow steps within it, it is trimmed from the final output.

We check for other patterns as well, including a list of static steps from which we can infer the workflow diagram or variables, defined in up to ten different ways. If your script contains multiple workflows, we follow a similar pattern to the subgraphs created for functions, abstracted one level higher.

For every AST node type, we had to consider every way they could be used inside of a workflow: loops, branches, promises, parallels, awaits, arrow functions… the list goes on. Even within these paths, there are dozens of possibilities. Consider just a few of the possible ways to loop:

// for...of
for (const item of items) {
	await step.do(`process ${item}`, async () => item);
}
// while
while (shouldContinue) {
	await step.do('poll', async () => getStatus());
}
// map
await Promise.all(
	items.map((item) => step.do(`map ${item}`, async () => item)),
);
// forEach
await items.forEach(async (item) => {
	await step.do(`each ${item}`, async () => item);
});

And beyond looping, how to handle branching:

// switch / case
switch (action.type) {
	case 'create':
		await step.do('handle create', async () => {});
		break;
	default:
		await step.do('handle unknown', async () => {});
		break;
}

// if / else if / else
if (status === 'pending') {
	await step.do('pending path', async () => {});
} else if (status === 'active') {
	await step.do('active path', async () => {});
} else {
	await step.do('fallback path', async () => {});
}

// ternary operator
await (cond
	? step.do('ternary true branch', async () => {})
	: step.do('ternary false branch', async () => {}));

// nullish coalescing with step on RHS
const myStepResult =
	variableThatCanBeNullUndefined ??
	(await step.do('nullish fallback step', async () => 'default'));

// try/catch with finally
try {
	await step.do('try step', async () => {});
} catch (_e) {
	await step.do('catch step', async () => {});
} finally {
	await step.do('finally step', async () => {});
}

Our goal was to create a concise API that communicated what developers need to know without overcomplicating it. But converting a workflow into a diagram meant accounting for every pattern (whether it follows best practices, or not) and edge case possible. As we discussed earlier, each step is not explicitly sequential, by default, to any other step. If a workflow does not utilize await and Promise.all(), we assume that the steps will execute in the order in which they are encountered. But if a workflow included await, Promise or Promise.all(), we needed a way to track those relationships.

We decided on tracking execution order, where each node has a starts: and resolves: field. The starts and resolves indices tell us when a promise started executing and when it ends relative to the first promise that started without an immediate, subsequent conclusion. This correlates to vertical positioning in the diagram UI (i.e., all steps with starts:1 will be inline). If steps are awaited when they are declared, then starts and resolves will be undefined, and the workflow will execute in the order of the steps’ appearance to the runtime.

While parsing, when we encounter an unawaited Promise or Promise.all(), that node (or nodes) are marked with an entry number, surfaced in the starts field. If we encounter an await on that promise, the entry number is incremented by one and saved as the exit number (which is the value in resolves). This allows us to know which promises run at the same time and when they’ll complete in relation to each other.

export class ImplicitParallelWorkflow extends WorkflowEntrypoint {
 async run(event: WorkflowEvent, step: WorkflowStep) {
   const branchA = async () => {
     const a = step.do("task a", async () => "a"); //starts 1
     const b = step.do("task b", async () => "b"); //starts 1
     const c = await step.waitForEvent("task c", { type: "my-event", timeout: "1 hour" }); //starts 1 resolves 2
     await step.do("task d", async () => JSON.stringify(c)); //starts 2 resolves 3
     return Promise.all([a, b]); //resolves 3
   };

   const branchB = async () => {
     const e = step.do("task e", async () => "e"); //starts 1
     const f = step.do("task f", async () => "f"); //starts 1
     return Promise.all([e, f]); //resolves 2
   };

   await Promise.all([branchA(), branchB()]);

   await step.sleep("final sleep", 1000);
 }
}

You can see the steps’ alignment in the diagram:

After accounting for all of those patterns, we settled on the following list of node types:

| StepSleep
| StepDo
| StepWaitForEvent
| StepSleepUntil
| LoopNode
| ParallelNode
| TryNode
| BlockNode
| IfNode
| SwitchNode
| StartNode
| FunctionCall
| FunctionDef
| BreakNode;

Here are a few samples of API output for different behaviors:

function call:

{
  "functions": {
    "runLoop": {
      "name": "runLoop",
      "nodes": []
    }
  }
}

if condition branching to step.do:

{
  "type": "if",
  "branches": [
    {
      "condition": "loop > maxLoops",
      "nodes": [
        {
          "type": "step_do",
          "name": "notify max loop reached",
          "config": {
            "retries": {
              "limit": 5,
              "delay": 1000,
              "backoff": "exponential"
            },
            "timeout": 10000
          },
          "nodes": []
        }
      ]
    }
  ]
}

parallel with step.do and waitForEvent:

{
  "type": "parallel",
  "kind": "all",
  "nodes": [
    {
      "type": "step_do",
      "name": "correctness agent (loop ${...})",
      "config": {
        "retries": {
          "limit": 5,
          "delay": 1000,
          "backoff": "exponential"
        },
        "timeout": 10000
      },
      "nodes": [],
      "starts": 1
    },
...
    {
      "type": "step_wait_for_event",
      "name": "wait for user response (loop ${...})",
      "options": {
        "event_type": "user-response",
        "timeout": "unknown"
      },
      "starts": 3,
      "resolves": 4
    }
  ]
}

What’s next

Ultimately, the goal of these Workflow diagrams is to serve as a full-service debugging tool. That means you’ll be able to:

Trace an execution through the graph in real time
Discover errors, wait for human-in-the-loop approvals, and skip steps for testing
Access visualizations in local development

Check out the diagrams on your Workflow overview pages. If you have any feature requests or notice any bugs, share your feedback directly with the Cloudflare team by joining the Cloudflare Developers community on Discord.

Powering the agents: Workers AI now runs large models, starting with Kimi K2.5

Michelle Chen — Thu, 19 Mar 2026 19:53:16 GMT

We're making Cloudflare the best place for building and deploying agents. But reliable agents aren't built on prompts alone; they require a robust, coordinated infrastructure of underlying primitives.

At Cloudflare, we have been building these primitives for years: Durable Objects for state persistence, Workflows for long running tasks, and Dynamic Workers or Sandbox containers for secure execution. Powerful abstractions like the Agents SDK are designed to help you build agents on top of Cloudflare’s Developer Platform.

But these primitives only provided the execution environment. The agent still needed a model capable of powering it.

Starting today, Workers AI is officially in the big models game. We now offer frontier open-source models on our AI inference platform. We’re starting by releasing Moonshot AI’s Kimi K2.5 model on Workers AI. With a full 256k context window and support for multi-turn tool calling, vision inputs, and structured outputs, the Kimi K2.5 model is excellent for all kinds of agentic tasks. By bringing a frontier-scale model directly into the Cloudflare Developer Platform, we’re making it possible to run the entire agent lifecycle on a single, unified platform.

The heart of an agent is the AI model that powers it, and that model needs to be smart, with high reasoning capabilities and a large context window. Workers AI now runs those models.

The price-performance sweet spot

We spent the last few weeks testing Kimi K2.5 as the engine for our internal development tools. Within our OpenCode environment, Cloudflare engineers use Kimi as a daily driver for agentic coding tasks. We have also integrated the model into our automated code review pipeline; you can see this in action via our public code review agent, Bonk, on Cloudflare GitHub repos. In production, the model has proven to be a fast, efficient alternative to larger proprietary models without sacrificing quality.

Serving Kimi K2.5 began as an experiment, but it quickly became critical after reviewing how the model performs and how cost-efficient it is. As an illustrative example: we have an agent that does security reviews of Cloudflare’s codebases. This agent processes over 7B tokens per day, and using Kimi, it has caught more than 15 confirmed issues in a single codebase. Doing some rough math, if we had run this agent on a mid-tier proprietary model, we would have spent $2.4M a year for this single use case, on a single codebase. Running this agent with Kimi K2.5 cost just a fraction of that: we cut costs by 77% simply by making the switch to Workers AI.

As AI adoption increases, we are seeing a fundamental shift not only in how engineering teams are operating, but how individuals are operating. It is becoming increasingly common for people to have a personal agent like OpenClaw running 24/7. The volume of inference is skyrocketing.

This new rise in personal and coding agents means that cost is no longer a secondary concern; it is the primary blocker to scaling. When every employee has multiple agents processing hundreds of thousands of tokens per hour, the math for proprietary models stops working. Enterprises will look to transition to open-source models that offer frontier-level reasoning without the proprietary price tag. Workers AI is here to facilitate this shift, providing everything from serverless endpoints for a personal agent to dedicated instances powering autonomous agents across an entire organization.

The large model inference stack

Workers AI has served models, including LLMs, since its launch two years ago, but we’ve historically prioritized smaller models. Part of the reason was that for some time, open-source LLMs fell far behind the models from frontier model labs. This changed with models like Kimi K2.5, but to serve this type of very large LLM, we had to make changes to our inference stack. We wanted to share with you some of what goes on behind the scenes to support a model like Kimi.

We’ve been working on custom kernels for Kimi K2.5 to optimize how we serve the model, which is built on top of our proprietary Infire inference engine. Custom kernels improve the model’s performance and GPU utilization, unlocking gains that would otherwise go unclaimed if you were just running the model out of the box. There are also multiple techniques and hardware configurations that can be leveraged to serve a large model. Developers typically use a combination of data, tensor, and expert parallelization techniques to optimize model performance. Strategies like disaggregated prefill are also important, in which you separate the prefill and generation stages onto different machines in order to get better throughput or higher GPU utilization. Implementing these techniques and incorporating them into the inference stack takes a lot of dedicated experience to get right.

Workers AI has already done the experimentation with serving techniques to yield excellent throughput on Kimi K2.5. A lot of this does not come out of the box when you self-host an open-source model. The benefit of using a platform like Workers AI is that you don’t need to be a Machine Learning Engineer, a DevOps expert, or a Site Reliability Engineer to do the optimizations required to host it: we’ve already done the hard part, you just need to call an API.

Beyond the model — platform improvements for agentic workloads

In concert with this launch, we’ve also improved our platform and are releasing several new features to help you build better agents.

Prefix caching and surfacing cached tokens

When you work with agents, you are likely sending a large number of input tokens as part of the context: this could be detailed system prompts, tool definitions, MCP server tools, or entire codebases. Inputs can be as large as the model context window, so in theory, you could be sending requests with almost 256k input tokens. That’s a lot of tokens.

When an LLM processes a request, the request is broken down into two stages: the prefill stage processes input tokens and the output stage generates output tokens. These stages are usually sequential, where input tokens have to be fully processed before you can generate output tokens. This means that sometimes the GPU is not fully utilized while the model is doing prefill.

With multi-turn conversations, when you send a new prompt, the client sends all the previous prompts, tools, and context from the session to the model as well. The delta between consecutive requests is usually just a few new lines of input; all the other context has already gone through the prefill stage during a previous request. This is where prefix caching helps. Instead of doing prefill on the entire request, we can cache the input tensors from a previous request, and only do prefill on the new input tokens. This saves a lot of time and compute from the prefill stage, which means a faster Time to First Token (TTFT) and a higher Tokens Per Second (TPS) throughput as you’re not blocked on prefill.

Workers AI has always done prefix caching, but we are now surfacing cached tokens as a usage metric and offering a discount on cached tokens compared to input tokens. (Pricing can be found on the model page.) We also have new techniques for you to leverage in order to get a higher prefix cache hit rate, reducing your costs.

New session affinity header for higher cache hit rates

In order to route to the same model instance and take advantage of prefix caching, we use a new x-session-affinity header. When you send this header, you’ll improve your cache hit ratio, leading to more cached tokens and subsequently, faster TTFT, TPS, and lower inference costs.

You can pass the new header like below, with a unique string per session or per agent. Some clients like OpenCode implement this automatically out of the box. Our Agents SDK starter has already set up the wiring to do this for you, too.

curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/moonshotai/kimi-k2.5" \
  -H "Authorization: Bearer {API_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "x-session-affinity: ses_12345678" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is prefix caching and why does it matter?"
      }
    ],
    "max_tokens": 2400,
    "stream": true
  }'

Redesigned async APIs

Serverless inference is really hard. With a pay-per-token business model, it’s cheaper on a single request basis because you don’t need to pay for entire GPUs to service your requests. But there’s a trade-off: you have to contend with other people’s traffic and capacity constraints, and there’s no strict guarantee that your request will be processed. This is not unique to Workers AI — it’s evidently the case across serverless model providers, given the frequent news reports of overloaded providers and service disruptions. While we always strive to serve your request and have built-in autoscaling and rebalancing, there are hard limitations (like hardware) that make this a challenge.

For volumes of requests that would exceed synchronous rate limits, you can submit batches of inferences to be completed asynchronously. We’re introducing a revamped Asynchronous API, which means that for asynchronous use cases, you won’t run into Out of Capacity errors and inference will execute durably at some point. Our async API looks more like flex processing than a batch API, where we process requests in the async queue as long as we have headroom in our model instances. With internal testing, our async requests usually execute within 5 minutes, but this will depend on what live traffic looks like. As we bring Kimi to the public, we will tune our scaling accordingly, but the async API is the best way to make sure you don’t run into capacity errors in durable workflows. This is perfect for use cases that are not real-time, such as code scanning agents or research agents.

Workers AI previously had an asynchronous API, but we’ve recently revamped the systems under the hood. We now rely on a pull-based system versus the historical push-based system, allowing us to pull in queued requests as soon as we have capacity. We’ve also added better controls to tune the throughput of async requests, monitoring GPU utilization in real-time and pulling in async requests when utilization is low, so that critical synchronous requests get priority while still processing asynchronous requests efficiently.

To use the asynchronous API, you would send your requests as seen below. We also have a way to set up event notifications so that you can know when the inference is complete instead of polling for the request.

// (1.) Push a request in queue
// pass queueRequest: true
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
  "requests": [{
    "messages": [{
      "role": "user",
      "content": "Tell me a joke"
    }]
  }, {
    "messages": [{
      "role": "user",
      "content": "Explain the Pythagoras theorem"
    }]
  }, ...{} ];
}, {
  queueRequest: true,
});


// (2.) grab the request id
let request_id;
if(res && res.request_id){
  request_id = res.request_id;
}
// (3.) poll the status
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
  request_id: request_id
});

if(res && res.status === "queued" || res.status === "running") {
 // retry by polling again
 ...
}
else 
 return Response.json(res); // This will contain the final completed response

Try it out today

Get started with Kimi K2.5 on Workers AI today. You can read our developer docs to find out model information and pricing, and how to take advantage of prompt caching via session affinity headers and asynchronous API. The Agents SDK starter also now uses Kimi K2.5 as its default model. You can also connect to Kimi K2.5 on Workers AI via Opencode. For a live demo, try it in our playground.

And if this set of problems around serverless inference, ML optimizations, and GPU infrastructure sound interesting to you — we’re hiring!

The truly programmable SASE platform

Abe Carryl — Mon, 02 Mar 2026 06:00:00 GMT

Every organization approaches security through a unique lens, shaped by their tooling, requirements, and history. No two environments look the same, and none stay static for long. We believe the platforms that protect them shouldn't be static either.

Cloudflare built our global network to be programmable by design, so we can help organizations unlock this flexibility and freedom. In this post, we’ll go deeper into what programmability means, and how Cloudflare One, our SASE platform, helps customers architect their security and networking with our building blocks to meet their unique and custom needs.

What programmability actually means

The term programmability has become diluted by the industry. Most security vendors claim programmability because they have public APIs, documented Terraform providers, webhooks, and alerting. That’s great, and Cloudflare offers all of those things too.

These foundational capabilities provide customization, infrastructure-as-code, and security operations automation, but they're table stakes. With traditional programmability, you can configure a webhook to send an alert to Slack when a policy triggers.

But the true value of programmability is something different. It is the ability to intercept a security event, enrich it with external context, and act on it in real time. Say a user attempts to access a regulated application containing sensitive financial data. Before the request completes, you query your learning management system to verify the user has completed the required compliance training. If their certification has expired, or they never completed it, access is denied, and they are redirected to the training portal. The policy did not just trigger an alert — it made the decision.

Building the most programmable SASE platform

The Cloudflare global network spans more than 330 cities across the globe and operates within approximately 50 milliseconds of 95% of the Internet-connected population. This network runs every service on every server in every data center. That means our industry-leading SASE platform and Developer Platform run side by side, on the same metal, making our Cloudflare services both composable and programmable.

When you use Cloudflare to protect your external web properties, you are using the same network, the same tools, and the same primitives as when you secure your users, devices, and private networks with Cloudflare One. Those are also the same primitives you use when you build and deploy full-stack applications on our Developer Platform. They are designed to work together — not because they were integrated after the fact, but because they were never separate to begin with.

By design, this allows customers to extend policy decisions with custom logic in real time. You can call an external risk API, inject dynamic headers, or validate browser attributes. You can route traffic based on your business logic without adding latency or standing up separate infrastructure. Standalone SASE providers without their own compute platform require you to deploy automation in a separate cloud, manually configure webhooks, and accept the round-trip latency and management overhead of stitching together disconnected systems. With Cloudflare, your Worker augments inline SASE services like Access to enforce custom policies, at the edge, in milliseconds.

What programmability unlocks

At its core, every security gateway operates on the same fundamental model. Traffic flows from sources, through policies, to destinations. The policies are where things get interesting, but in most platforms, your options are limited to predefined actions: allow, block, isolate, or quarantine.

We think there is a better way. What if you could invoke custom logic instead?

Rather than predefined actions, you could:

Dynamically inject headers based on user identity claims
Call external risk engines for a real-time verdict before allowing access
Enforce access controls based on location and working hours

Today, customers can already do many of these things with Cloudflare. And we are strengthening the integration between our SASE and Developer Platform to make this even easier. Programmability extensions, like the ones listed above, will be natively integrated into Cloudflare One, enabling customers to build real-time, custom logic into their security and networking policies. Inspect a request and make a decision in milliseconds. Or run a Worker on a schedule to analyze user activity and update policies accordingly, such as adding users to a high-risk list based on signals from an external system.

We are building this around the concept of actions: both managed and custom. Managed actions will provide templates for common scenarios like IT service management integrations, redirects, and compliance automation. Custom actions allow you to define your own logic entirely. When a Gateway HTTP policy matches, instead of being limited to allow, block, or isolate, you can invoke a Cloudflare Worker directly. Your code runs at the edge, in real time, with full access to the request context.

How customers are building today

While we are improving this experience, many customers are already using Cloudflare One and Developer Platform this way today. Here is a simple example that illustrates what you can do with this programmability.

Automated device session revocation

The problem: A customer wanted to enforce periodic re-authentication for their Cloudflare One Client users, similar to how traditional VPNs require users to re-authenticate every few hours. Cloudflare's pre-defined session controls are designed around per-application policies, not global time-based expiration.

The solution: A scheduled Cloudflare Worker that queries the Devices API, identifies devices that have been inactive longer than a specified threshold, and revokes their registrations, forcing users to re-authenticate via their identity provider.

export default {
  async scheduled(event, env, ctx) {
    const API_TOKEN = env.API_TOKEN;
    const ACCOUNT_ID = env.ACCOUNT_ID;
    const REVOKE_INTERVAL_MINUTES = parseInt(env.REVOKE_INTERVAL_MINUTES); // Reuse for inactivity threshold
    const DRY_RUN = env.DRY_RUN === 'true';

    const headers = {
      'Authorization': `Bearer ${API_TOKEN}`,
      'Content-Type': 'application/json'
    };

    let cursor = '';
    let allDevices = [];

    // Fetch all registrations with cursor-based pagination
    while (true) {
      let url = `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/devices/registrations?per_page=100`;
      if (cursor) {
        url += `&cursor=${cursor}`;
      }

      const devicesResponse = await fetch(url, { headers });
      const devicesData = await devicesResponse.json();
      if (!devicesData.success) {
        console.error('Failed to fetch registrations:', devicesData.errors);
        return;
      }

      allDevices = allDevices.concat(devicesData.result);

      // Extract next cursor (adjust if your response uses a different field, e.g., devicesData.result_info.cursor)
      cursor = devicesData.cursor || '';
      if (!cursor) break;
    }

    const now = new Date();

    for (const device of allDevices) {
      const lastSeen = new Date(device.last_seen_at);
      const minutesInactive = (now - lastSeen) / (1000 * 60);

      if (minutesInactive > REVOKE_INTERVAL_MINUTES) {
        console.log(`Registration ${device.id} inactive for ${minutesInactive} minutes.`);

        if (DRY_RUN) {
          console.log(`Dry run: Would delete registration ${device.id}`);
        } else {
          const deleteResponse = await fetch(
            `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/devices/registrations/${device.id}`,
            { method: 'DELETE', headers }
          );
          const deleteData = await deleteResponse.json();
          if (deleteData.success) {
            console.log(`Deleted registration ${device.id}`);
          } else {
            console.error(`Failed to delete ${device.id}:`, deleteData.errors);
          }
        }
      }
    }
  }
};

Configure the Worker with environment secrets (API_TOKEN, ACCOUNT_ID, REVOKE_INTERVAL_MINUTES) and a cron trigger (0 */4 * * * for every 4 hours), and you have automated session management. Just getting a simple feature like this into a vendor’s roadmap could take months, and even longer to move into a management interface.

But with automated device session revocation, our technical specialist deployed this policy with the customer in an afternoon. It's been running in production for months.

We’ve observed countless implementations like this across Cloudflare One deployments. We’ve seen users implement coaching pages and purpose justification workflows by using our existing redirect policies and Workers. Other users have built custom logic that evaluates browser attributes before making policy or routing decisions. Each solves a unique problem that would otherwise require waiting for a vendor to build a specific, niche integration with a third-party system. Instead, customers are building exactly what they need, on their timeline, with logic they own.

A programmable platform that changes the conversation

We believe the future of enterprise security isn't a monolithic platform that tries to do everything. It's a composable and programmable platform that gives customers the tools and flexibility to extend it in any direction.

For security teams, we expect our platform to change the conversation. Instead of filing a feature request and hoping it makes the roadmap, you can build a tailored solution that addresses your exact requirements today.

For our partners and managed security service providers (MSSPs), our platform opens up their ability to build and deliver solutions for their specific customer base. That means industry-specific solutions, or capabilities for customers in a specific regulatory environment. Custom integrations become a competitive advantage, not a professional services engagement.

And for our customers, it means you're building on a platform that is easy to deploy and fundamentally adaptable to your most complex and changing needs. Your security platform grows with you — it doesn’t constrain you.

What's next

We're just getting started. Throughout 2026, you'll see us continue to deepen the integration between Cloudflare One and our Developer Platform. We plan to start by creating custom actions in Cloudflare Gateway that support dynamic policy enforcement. These actions can use auxiliary data stored in your organization's existing databases without the administrative or compliance challenges of migrating that data into Cloudflare. These same custom actions will also support request augmentation to pass along Cloudflare attributes to your internal systems, for better logging and access decisions in your downstream systems.

In the meantime, the building blocks are already here. External evaluation rules, custom device posture checks, Gateway redirects, and the full power of Workers are available today. If you're not sure where to start, our developer documentation has guides and reference architectures for extending Cloudflare One.

We built Cloudflare on the belief that security should be ridiculously easy to use, but we also know that "easy" doesn't mean "one-size-fits-all." It means giving you the tools to build exactly what you need. We believe that’s the future of SASE.

We deserve a better streams API for JavaScript

James M Snell — Fri, 27 Feb 2026 06:00:00 GMT

Handling data in streams is fundamental to how we build applications. To make streaming work everywhere, the WHATWG Streams Standard (informally known as "Web streams") was designed to establish a common API to work across browsers and servers. It shipped in browsers, was adopted by Cloudflare Workers, Node.js, Deno, and Bun, and became the foundation for APIs like fetch(). It's a significant undertaking, and the people who designed it were solving hard problems with the constraints and tools they had at the time.

But after years of building on Web streams – implementing them in both Node.js and Cloudflare Workers, debugging production issues for customers and runtimes, and helping developers work through far too many common pitfalls – I've come to believe that the standard API has fundamental usability and performance issues that cannot be fixed easily with incremental improvements alone. The problems aren't bugs; they're consequences of design decisions that may have made sense a decade ago, but don't align with how JavaScript developers write code today.

This post explores some of the fundamental issues I see with Web streams and presents an alternative approach built around JavaScript language primitives that demonstrate something better is possible.

In benchmarks, this alternative can run anywhere between 2x to 120x faster than Web streams in every runtime I've tested it on (including Cloudflare Workers, Node.js, Deno, Bun, and every major browser). The improvements are not due to clever optimizations, but fundamentally different design choices that more effectively leverage modern JavaScript language features. I'm not here to disparage the work that came before; I'm here to start a conversation about what can potentially come next.

Where we're coming from

The Streams Standard was developed between 2014 and 2016 with an ambitious goal to provide "APIs for creating, composing, and consuming streams of data that map efficiently to low-level I/O primitives." Before Web streams, the web platform had no standard way to work with streaming data.

Node.js already had its own streaming API at the time that was ported to also work in browsers, but WHATWG chose not to use it as a starting point given that it is chartered to only consider the needs of Web browsers. Server-side runtimes only adopted Web streams later, after Cloudflare Workers and Deno each emerged with first-class Web streams support and cross-runtime compatibility became a priority.

The design of Web streams predates async iteration in JavaScript. The for await...of syntax didn't land until ES2018, two years after the Streams Standard was initially finalized. This timing meant the API couldn't initially leverage what would eventually become the idiomatic way to consume asynchronous sequences in JavaScript. Instead, the spec introduced its own reader/writer acquisition model, and that decision rippled through every aspect of the API.

Excessive ceremony for common operations

The most common task with streams is reading them to completion. Here's what that looks like with Web streams:

// First, we acquire a reader that gives an exclusive lock
// on the stream...
const reader = stream.getReader();
const chunks = [];
try {
  // Second, we repeatedly call read and await on the returned
  // promise to either yield a chunk of data or indicate we're
  // done.
  while (true) {
    const { value, done } = await reader.read();
    if (done) break;
    chunks.push(value);
  }
} finally {
  // Finally, we release the lock on the stream
  reader.releaseLock();
}

You might assume this pattern is inherent to streaming. It isn't. The reader acquisition, the lock management, and the { value, done } protocol are all just design choices, not requirements. They are artifacts of how and when the Web streams spec was written. Async iteration exists precisely to handle sequences that arrive over time, but async iteration did not yet exist when the streams specification was written. The complexity here is pure API overhead, not fundamental necessity.

Consider the alternative approach now that Web streams do support for await...of:

const chunks = [];
for await (const chunk of stream) {
  chunks.push(chunk);
}

This is better in that there is far less boilerplate, but it doesn't solve everything. Async iteration was retrofitted onto an API that wasn't designed for it, and it shows. Features like BYOB (bring your own buffer) reads aren't accessible through iteration. The underlying complexity of readers, locks, and controllers are still there, just hidden. When something does go wrong, or when additional features of the API are needed, developers find themselves back in the weeds of the original API, trying to understand why their stream is "locked" or why releaseLock() didn't do what they expected or hunting down bottlenecks in code they don't control.

The locking problem

Web streams use a locking model to prevent multiple consumers from interleaving reads. When you call getReader(), the stream becomes locked. While locked, nothing else can read from the stream directly, pipe it, or even cancel it – only the code that is actually holding the reader can.

This sounds reasonable until you see how easily it goes wrong:

async function peekFirstChunk(stream) {
  const reader = stream.getReader();
  const { value } = await reader.read();
  // Oops — forgot to call reader.releaseLock()
  // And the reader is no longer available when we return
  return value;
}

const first = await peekFirstChunk(stream);
// TypeError: Cannot obtain lock — stream is permanently locked
for await (const chunk of stream) { /* never runs */ }

Forgetting releaseLock() permanently breaks the stream. The locked property tells you that a stream is locked, but not why, by whom, or whether the lock is even still usable. Piping internally acquires locks, making streams unusable during pipe operations in ways that aren't obvious.

The semantics around releasing locks with pending reads were also unclear for years. If you called read() but didn't await it, then called releaseLock(), what happened? The spec was recently clarified to cancel pending reads on lock release – but implementations varied, and code that relied on the previous unspecified behavior can break.

That said, it's important to recognize that locking in itself is not bad. It does, in fact, serve an important purpose to ensure that applications properly and orderly consume or produce data. The key challenge is with the original manual implementation of it using APIs like getReader() and releaseLock(). With the arrival of automatic lock and reader management with async iterables, dealing with locks from the users point of view became a lot easier.

For implementers, the locking model adds a fair amount of non-trivial internal bookkeeping. Every operation must check lock state, readers must be tracked, and the interplay between locks, cancellation, and error states creates a matrix of edge cases that must all be handled correctly.

BYOB: complexity without payoff

BYOB (bring your own buffer) reads were designed to let developers reuse memory buffers when reading from streams, an important optimization intended for high-throughput scenarios. The idea is sound: instead of allocating new buffers for each chunk, you provide your own buffer and the stream fills it.

In practice, (and yes, there are always exceptions to be found) BYOB is rarely used to any measurable benefit. The API is substantially more complex than default reads, requiring a separate reader type (ReadableStreamBYOBReader) and other specialized classes (e.g. ReadableStreamBYOBRequest), careful buffer lifecycle management, and understanding of ArrayBuffer detachment semantics. When you pass a buffer to a BYOB read, the buffer becomes detached – transferred to the stream – and you get back a different view over potentially different memory. This transfer-based model is error-prone and confusing:

const reader = stream.getReader({ mode: 'byob' });
const buffer = new ArrayBuffer(1024);
let view = new Uint8Array(buffer);

const result = await reader.read(view);
// 'view' should now be detached and unusable
// (it isn't always in every impl)
// result.value is a NEW view, possibly over different memory
view = result.value; // Must reassign

BYOB also can't be used with async iteration or TransformStreams, so developers who want zero-copy reads are forced back into the manual reader loop.

For implementers, BYOB adds significant complexity. The stream must track pending BYOB requests, handle partial fills, manage buffer detachment correctly, and coordinate between the BYOB reader and the underlying source. The Web Platform Tests for readable byte streams include dedicated test files just for BYOB edge cases: detached buffers, bad views, response-after-enqueue ordering, and more.

BYOB ends up being complex for both users and implementers, yet sees little adoption in practice. Most developers stick with default reads and accept the allocation overhead.

Most userland implementations of custom ReadableStream instances do not typically bother with all the ceremony required to correctly implement both default and BYOB read support in a single stream – and for good reason. It's difficult to get right and most of the time consuming code is typically going to fallback on the default read path. The example below shows what a "correct" implementation would need to do. It's big, complex, and error prone, and not a level of complexity that the typical developer really wants to have to deal with:

new ReadableStream({
    type: 'bytes',
    
    async pull(controller: ReadableByteStreamController) {      
      if (offset >= totalBytes) {
        controller.close();
        return;
      }
      
      // Check for BYOB request FIRST
      const byobRequest = controller.byobRequest;
      
      if (byobRequest) {
        // === BYOB PATH ===
        // Consumer provided a buffer - we MUST fill it (or part of it)
        const view = byobRequest.view!;
        const bytesAvailable = totalBytes - offset;
        const bytesToWrite = Math.min(view.byteLength, bytesAvailable);
        
        // Create a view into the consumer's buffer and fill it
        // not critical but safer when bytesToWrite != view.byteLength
        const dest = new Uint8Array(
          view.buffer,
          view.byteOffset,
          bytesToWrite
        );
        
        // Fill with sequential bytes (our "data source")
        // Can be any thing here that writes into the view
        for (let i = 0; i < bytesToWrite; i++) {
          dest[i] = (offset + i) & 0xFF;
        }
        
        offset += bytesToWrite;
        
        // Signal how many bytes we wrote
        byobRequest.respond(bytesToWrite);
        
      } else {
        // === DEFAULT READER PATH ===
        // No BYOB request - allocate and enqueue a chunk
        const bytesAvailable = totalBytes - offset;
        const chunkSize = Math.min(1024, bytesAvailable);
        
        const chunk = new Uint8Array(chunkSize);
        for (let i = 0; i < chunkSize; i++) {
          chunk[i] = (offset + i) & 0xFF;
        }
        
        offset += chunkSize;
        controller.enqueue(chunk);
      }
    },
    
    cancel(reason) {
      console.log('Stream canceled:', reason);
    }
  });

When a host runtime provides a byte-oriented ReadableStream from the runtime itself, for instance, as the body of a fetch Response, it is often far easier for the runtime itself to provide an optimized implementation of BYOB reads, but those still need to be capable of handling both default and BYOB reading patterns and that requirement brings with it a fair amount of complexity.

Backpressure: good in theory, broken in practice

Backpressure – the ability for a slow consumer to signal a fast producer to slow down – is a first-class concept in Web streams. In theory. In practice, the model has some serious flaws.

The primary signal is desiredSize on the controller. It can be positive (wants data), zero (at capacity), negative (over capacity), or null (closed). Producers are supposed to check this value and stop enqueueing when it's not positive. But there's nothing enforcing this: controller.enqueue() always succeeds, even when desiredSize is deeply negative.

new ReadableStream({
  start(controller) {
    // Nothing stops you from doing this
    while (true) {
      controller.enqueue(generateData()); // desiredSize: -999999
    }
  }
});

Stream implementations can and do ignore backpressure; and some spec-defined features explicitly break backpressure. tee(), for instance, creates two branches from a single stream. If one branch reads faster than the other, data accumulates in an internal buffer with no limit. A fast consumer can cause unbounded memory growth while the slow consumer catches up, and there's no way to configure this or opt out beyond canceling the slower branch.

Web streams do provide clear mechanisms for tuning backpressure behavior in the form of the highWaterMark option and customizable size calculations, but these are just as easy to ignore as desiredSize, and many applications simply fail to pay attention to them.

The same issues exist on the WritableStream side. A WritableStream has a highWaterMark and desiredSize. There is a writer.ready promise that producers of data are supposed to pay attention but often don't.

const writable = getWritableStreamSomehow();
const writer = writable.getWriter();

// Producers are supposed to wait for the writer.ready
// It is a promise that, when resolves, indicates that
// the writables internal backpressure is cleared and
// it is ok to write more data
await writer.ready;
await writer.write(...);

For implementers, backpressure adds complexity without providing guarantees. The machinery to track queue sizes, compute desiredSize, and invoke pull() at the right times must all be implemented correctly. However, since these signals are advisory, all that work doesn't actually prevent the problems backpressure is supposed to solve.

The hidden cost of promises

The Web streams spec requires promise creation at numerous points, often in hot paths and often invisible to users. Each read() call doesn't just return a promise; internally, the implementation creates additional promises for queue management, pull() coordination, and backpressure signaling.

This overhead is mandated by the spec's reliance on promises for buffer management, completion, and backpressure signals. While some of it is implementation-specific, much of it is unavoidable if you're following the spec as written. For high-frequency streaming – video frames, network packets, real-time data – this overhead is significant.

The problem compounds in pipelines. Each TransformStream adds another layer of promise machinery between source and sink. The spec doesn't define synchronous fast paths, so even when data is available immediately, the promise machinery still runs.

For implementers, this promise-heavy design constrains optimization opportunities. The spec mandates specific promise resolution ordering, making it difficult to batch operations or skip unnecessary async boundaries without risking subtle compliance failures. There are many hidden internal optimizations that implementers do make but these can be complicated and difficult to get right.

While I was writing this blog post, Vercel's Malte Ubl published their own blog post describing some research work Vercel has been doing around improving the performance of Node.js' Web streams implementation. In that post they discuss the same fundamental performance optimization problem that every implementation of Web streams face:

"Or consider pipeTo(). Each chunk passes through a full Promise chain: read, write, check backpressure, repeat. An {value, done} result object is allocated per read. Error propagation creates additional Promise branches.
None of this is wrong. These guarantees matter in the browser where streams cross security boundaries, where cancellation semantics need to be airtight, where you do not control both ends of a pipe. But on the server, when you are piping React Server Components through three transforms at 1KB chunks, the cost adds up.
We benchmarked native WebStream pipeThrough at 630 MB/s for 1KB chunks. Node.js pipeline() with the same passthrough transform: ~7,900 MB/s. That is a 12x gap, and the difference is almost entirely Promise and object allocation overhead." - Malte Ubl, https://vercel.com/blog/we-ralph-wiggumed-webstreams-to-make-them-10x-faster

As part of their research, they have put together a set of proposed improvements for Node.js' Web streams implementation that will eliminate promises in certain code paths which can yield a significant performance boost up to 10x faster, which only goes to prove the point: promises, while useful, add significant overhead. As one of the core maintainers of Node.js, I am looking forward to helping Malte and the folks at Vercel get their proposed improvements landed!

In a recent update made to Cloudflare Workers, I made similar kinds of modifications to an internal data pipeline that reduced the number of JavaScript promises created in certain application scenarios by up to 200x. The result is several orders of magnitude improvement in performance in those applications.

Real-world failures

Exhausting resources with unconsumed bodies

When fetch() returns a response, the body is a ReadableStream. If you only check the status and don't consume or cancel the body, what happens? The answer varies by implementation, but a common outcome is resource leakage.

async function checkEndpoint(url) {
  const response = await fetch(url);
  return response.ok; // Body is never consumed or cancelled
}

// In a loop, this can exhaust connection pools
for (const url of urls) {
  await checkEndpoint(url);
}

This pattern has caused connection pool exhaustion in Node.js applications using undici (the fetch() implementation built into Node.js), and similar issues have appeared in other runtimes. The stream holds a reference to the underlying connection, and without explicit consumption or cancellation, the connection may linger until garbage collection – which may not happen soon enough under load.

The problem is compounded by APIs that implicitly create stream branches. Request.clone() and Response.clone() perform implicit tee() operations on the body stream – a detail that's easy to miss. Code that clones a request for logging or retry logic may unknowingly create branched streams that need independent consumption, multiplying the resource management burden.

Now, to be certain, these types of issues are implementation bugs. The connection leak was definitely something that undici needed to fix in its own implementation, but the complexity of the specification does not make dealing with these types of issues easy.

"Cloning streams in Node.js's fetch() implementation is harder than it looks. When you clone a request or response body, you're calling tee() - which splits a single stream into two branches that both need to be consumed. If one consumer reads faster than the other, data buffers unbounded in memory waiting for the slow branch. If you don't properly consume both branches, the underlying connection leaks. The coordination required between two readers sharing one source makes it easy to accidentally break the original request or exhaust connection pools. It's a simple API call with complex underlying mechanics that are difficult to get right." - Matteo Collina, Ph.D. - Platformatic Co-Founder & CTO, Node.js Technical Steering Committee Chair

Falling headlong off the tee() memory cliff

tee() splits a stream into two branches. It seems straightforward, but the implementation requires buffering: if one branch is read faster than the other, the data must be held somewhere until the slower branch catches up.

const [forHash, forStorage] = response.body.tee();

// Hash computation is fast
const hash = await computeHash(forHash);

// Storage write is slow — meanwhile, the entire stream
// may be buffered in memory waiting for this branch
await writeToStorage(forStorage);

The spec does not mandate buffer limits for tee(). And to be fair, the spec allows implementations to implement the actual internal mechanisms for tee()and other APIs in any way they see fit so long as the observable normative requirements of the specification are met. But if an implementation chooses to implement tee() in the specific way described by the streams specification, then tee() will come with a built-in memory management issue that is difficult to work around.

Implementations have had to develop their own strategies for dealing with this. Firefox initially used a linked-list approach that led to O(n) memory growth proportional to the consumption rate difference. In Cloudflare Workers, we opted to implement a shared buffer model where backpressure is signaled by the slowest consumer rather than the fastest.

Transform backpressure gaps

TransformStream creates a readable/writable pair with processing logic in between. The transform() function executes on write, not on read. Processing of the transform happens eagerly as data arrives, regardless of whether any consumer is ready. This causes unnecessary work when consumers are slow, and the backpressure signaling between the two sides has gaps that can cause unbounded buffering under load. The expectation in the spec is that the producer of the data being transformed is paying attention to the writer.ready signal on the writable side of the transform but quite often producers just simply ignore it.

If the transform's transform() operation is synchronous and always enqueues output immediately, it never signals backpressure back to the writable side even when the downstream consumer is slow. This is a consequence of the spec design that many developers completely overlook. In browsers, where there's only a single user and typically only a small number of stream pipelines active at any given time, this type of foot gun is often of no consequence, but it has a major impact on server-side or edge performance in runtimes that serve thousands of concurrent requests.

const fastTransform = new TransformStream({
  transform(chunk, controller) {
    // Synchronously enqueue — this never applies backpressure
    // Even if the readable side's buffer is full, this succeeds
    controller.enqueue(processChunk(chunk));
  }
});

// Pipe a fast source through the transform to a slow sink
fastSource
  .pipeThrough(fastTransform)
  .pipeTo(slowSink);  // Buffer grows without bound

What TransformStreams are supposed to do is check for backpressure on the controller and use promises to communicate that back to the writer:

const fastTransform = new TransformStream({
  async transform(chunk, controller) {
    if (controller.desiredSize <= 0) {
      // Wait on the backpressure to clear somehow
    }

    controller.enqueue(processChunk(chunk));
  }
});

A difficulty here, however, is that the TransformStreamDefaultController does not have a ready promise mechanism like Writers do; so the TransformStream implementation would need to implement a polling mechanism to periodically check when controller.desiredSize becomes positive again.

The problem gets worse in pipelines. When you chain multiple transforms – say, parse, transform, then serialize – each TransformStream has its own internal readable and writable buffers. If implementers follow the spec strictly, data cascades through these buffers in a push-oriented fashion: the source pushes to transform A, which pushes to transform B, which pushes to transform C, each accumulating data in intermediate buffers before the final consumer has even started pulling. With three transforms, you can have six internal buffers filling up simultaneously.

Developers using the streams API are expected to remember to use options like highWaterMark when creating their sources, transforms, and writable destinations but often they either forget or simply choose to ignore it.

source
  .pipeThrough(parse)      // buffers filling...
  .pipeThrough(transform)  // more buffers filling...
  .pipeThrough(serialize)  // even more buffers...
  .pipeTo(destination);    // consumer hasn't started yet

Implementations have found ways to optimize transform pipelines by collapsing identity transforms, short-circuiting non-observable paths, deferring buffer allocation, or falling back to native code that does not run JavaScript at all. Deno, Bun, and Cloudflare Workers have all successfully implemented "native path" optimizations that can help eliminate much of the overhead, and Vercel's recent fast-webstreams research is working on similar optimizations for Node.js. But the optimizations themselves add significant complexity and still can't fully escape the inherently push-oriented model that TransformStream uses.

GC thrashing in server-side rendering

Streaming server-side rendering (SSR) is a particularly painful case. A typical SSR stream might render thousands of small HTML fragments, each passing through the streams machinery:

// Each component enqueues a small chunk
function renderComponent(controller) {
  controller.enqueue(encoder.encode(`${content}`));
}

// Hundreds of components = hundreds of enqueue calls
// Each one triggers promise machinery internally
for (const component of components) {
  renderComponent(controller);  // Promises created, objects allocated
}

Every fragment means promises created for read() calls, promises for backpressure coordination, intermediate buffer allocations, and { value, done } result objects – most of which become garbage almost immediately.

Under load, this creates GC pressure that can devastate throughput. The JavaScript engine spends significant time collecting short-lived objects instead of doing useful work. Latency becomes unpredictable as GC pauses interrupt request handling. I've seen SSR workloads where garbage collection accounts for a substantial portion (up to and beyond 50%) of total CPU time per request. That's time that could be spent actually rendering content.

The irony is that streaming SSR is supposed to improve performance by sending content incrementally. But the overhead of the streams machinery can negate those gains, especially for pages with many small components. Developers sometimes find that buffering the entire response is actually faster than streaming through Web streams, defeating the purpose entirely.

The optimization treadmill

To achieve usable performance, every major runtime has resorted to non-standard internal optimizations for Web streams. Node.js, Deno, Bun, and Cloudflare Workers have all developed their own workarounds. This is particularly true for streams wired up to system-level I/O, where much of the machinery is non-observable and can be short-circuited.

Finding these optimization opportunities can itself be a significant undertaking. It requires end-to-end understanding of the spec to identify which behaviors are observable and which can safely be elided. Even then, whether a given optimization is actually spec-compliant is often unclear. Implementers must make judgment calls about which semantics they can relax without breaking compatibility. This puts enormous pressure on runtime teams to become spec experts just to achieve acceptable performance.

These optimizations are difficult to implement, frequently error-prone, and lead to inconsistent behavior across runtimes. Bun's "Direct Streams" optimization takes a deliberately and observably non-standard approach, bypassing much of the spec's machinery entirely. Cloudflare Workers' IdentityTransformStream provides a fast-path for pass-through transforms but is Workers-specific and implements behaviors that are not standard for a TransformStream. Each runtime has its own set of tricks and the natural tendency is toward non-standard solutions, because that's often the only way to make things fast.

This fragmentation hurts portability. Code that performs well on one runtime may behave differently (or poorly) on another, even though it's using "standard" APIs. The complexity burden on runtime implementers is substantial, and the subtle behavioral differences create friction for developers trying to write cross-runtime code, particularly those maintaining frameworks that must be able to run efficiently across many runtime environments.

It is also necessary to emphasize that many optimizations are only possible in parts of the spec that are unobservable to user code. The alternative, like Bun "Direct Streams", is to intentionally diverge from the spec-defined observable behaviors. This means optimizations often feel "incomplete". They work in some scenarios but not in others, in some runtimes but not others, etc. Every such case adds to the overall unsustainable complexity of the Web streams approach which is why most runtime implementers rarely put significant effort into further improvements to their streams implementations once the conformance tests are passing.

Implementers shouldn't need to jump through these hoops. When you find yourself needing to relax or bypass spec semantics just to achieve reasonable performance, that's a sign something is wrong with the spec itself. A well-designed streaming API should be efficient by default, not require each runtime to invent its own escape hatches.

The compliance burden

A complex spec creates complex edge cases. The Web Platform Tests for streams span over 70 test files, and while comprehensive testing is a good thing, what's telling is what needs to be tested.

Consider some of the more obscure tests that implementations must pass:

Prototype pollution defense: One test patches Object.prototype.then to intercept promise resolutions, then verifies that pipeTo() and tee() operations don't leak internal values through the prototype chain. This tests a security property that only exists because the spec's promise-heavy internals create an attack surface.
WebAssembly memory rejection: BYOB reads must explicitly reject ArrayBuffers backed by WebAssembly memory, which look like regular buffers but can't be transferred. This edge case exists because of the spec's buffer detachment model – a simpler API wouldn't need to handle it.
Crash regression for state machine conflicts: A test specifically checks that calling byobRequest.respond() after enqueue() doesn't crash the runtime. This sequence creates a conflict in the internal state machine — the enqueue() fulfills the pending read and should invalidate the byobRequest, but implementations must gracefully handle the subsequent respond() rather than corrupting memory in order to cover the very likely possibility that developers are not using the complex API correctly.

These aren't contrived scenarios invented by test authors in total vacuum. They're consequences of the spec's design and reflect real world bugs.

For runtime implementers, passing the WPT suite means handling intricate corner cases that most application code will never encounter. The tests encode not just the happy path but the full matrix of interactions between readers, writers, controllers, queues, strategies, and the promise machinery that connects them all.

A simpler API would mean fewer concepts, fewer interactions between concepts, and fewer edge cases to get right resulting in more confidence that implementations actually behave consistently.

The takeaway

Web streams are complex for users and implementers alike. The problems with the spec aren't bugs. They emerge from using the API exactly as designed. They aren't issues that can be fixed solely through incremental improvements. They're consequences of fundamental design choices. To improve things we need different foundations.

A better streams API is possible

After implementing the Web streams spec multiple times across different runtimes and seeing the pain points firsthand, I decided it was time to explore what a better, alternative streaming API could look like if designed from first principles today.

What follows is a proof of concept: it's not a finished standard, not a production-ready library, not even necessarily a concrete proposal for something new, but a starting point for discussion that demonstrates the problems with Web streams aren't inherent to streaming itself; they're consequences of specific design choices that could be made differently. Whether this exact API is the right answer is less important than whether it sparks a productive conversation about what we actually need from a streaming primitive.

What is a stream?

Before diving into API design, it's worth asking: what is a stream?

At its core, a stream is just a sequence of data that arrives over time. You don't have all of it at once. You process it incrementally as it becomes available.

Unix pipes are perhaps the purest expression of this idea:

cat access.log | grep "error" | sort | uniq -c

Data flows left to right. Each stage reads input, does its work, writes output. There's no pipe reader to acquire, no controller lock to manage. If a downstream stage is slow, upstream stages naturally slow down as well. Backpressure is implicit in the model, not a separate mechanism to learn (or ignore).

In JavaScript, the natural primitive for "a sequence of things that arrive over time" is already in the language: the async iterable. You consume it with for await...of. You stop consuming by stopping iteration.

This is the intuition the new API tries to preserve: streams should feel like iteration, because that's what they are. The complexity of Web streams – readers, writers, controllers, locks, queuing strategies – obscures this fundamental simplicity. A better API should make the simple case simple and only add complexity where it's genuinely needed.

Design principles

I built the proof-of-concept alternative around a different set of principles.

Streams are iterables.

No custom ReadableStream class with hidden internal state. A readable stream is just an AsyncIterable. You consume it with for await...of. No readers to acquire, no locks to manage.

Pull-through transforms

Transforms don't execute until the consumer pulls. There's no eager evaluation, no hidden buffering. Data flows on-demand from source, through transforms, to the consumer. If you stop iterating, processing stops.

Explicit backpressure

Backpressure is strict by default. When a buffer is full, writes reject rather than silently accumulating. You can configure alternative policies – block until space is available, drop oldest, drop newest – but you have to choose explicitly. No more silent memory growth.

Batched chunks

Instead of yielding one chunk per iteration, streams yield Uint8Array[]: arrays of chunks. This amortizes the async overhead across multiple chunks, reducing promise creation and microtask latency in hot paths.

Bytes only

The API deals exclusively with bytes (Uint8Array). Strings are UTF-8 encoded automatically. There's no "value stream" vs "byte stream" dichotomy. If you want to stream arbitrary JavaScript values, use async iterables directly. While the API uses Uint8Array, it treats chunks as opaque. There is no partial consumption, no BYOB patterns, no byte-level operations within the streaming machinery itself. Chunks go in, chunks come out, unchanged unless a transform explicitly modifies them.

Synchronous fast paths matter

The API recognizes that synchronous data sources are both necessary and common. The application should not be forced to always accept the performance cost of asynchronous scheduling simply because that's the only option provided. At the same time, mixing sync and async processing can be dangerous. Synchronous paths should always be an option and should always be explicit.

The new API in action

Creating and consuming streams

In Web streams, creating a simple producer/consumer pair requires TransformStream, manual encoding, and careful lock management:

const { readable, writable } = new TransformStream();
const enc = new TextEncoder();
const writer = writable.getWriter();
await writer.write(enc.encode("Hello, World!"));
await writer.close();
writer.releaseLock();

const dec = new TextDecoder();
let text = '';
for await (const chunk of readable) {
  text += dec.decode(chunk, { stream: true });
}
text += dec.decode();

Even this relatively clean version requires: a TransformStream, manual TextEncoder and TextDecoder, and explicit lock release.

Here's the equivalent with the new API:

import { Stream } from 'new-streams';

// Create a push stream
const { writer, readable } = Stream.push();

// Write data — backpressure is enforced
await writer.write("Hello, World!");
await writer.end();

// Consume as text
const text = await Stream.text(readable);

The readable is just an async iterable. You can pass it to any function that expects one, including Stream.text() which collects and decodes the entire stream.

The writer has a simple interface: write(), writev() for batched writes, end() to signal completion, and abort() for errors. That's essentially it.

The Writer is not a concrete class. Any object that implements write(), end(), and abort() can be a writer making it easy to adapt existing APIs or create specialized implementations without subclassing. There's no complex UnderlyingSink protocol with start(), write(), close(), and abort() callbacks that must coordinate through a controller whose lifecycle and state are independent of the WritableStream it is bound to.

Here's a simple in-memory writer that collects all written data:

// A minimal writer implementation — just an object with methods
function createBufferWriter() {
  const chunks = [];
  let totalBytes = 0;
  let closed = false;

  const addChunk = (chunk) => {
    chunks.push(chunk);
    totalBytes += chunk.byteLength;
  };

  return {
    get desiredSize() { return closed ? null : 1; },

    // Async variants
    write(chunk) { addChunk(chunk); },
    writev(batch) { for (const c of batch) addChunk(c); },
    end() { closed = true; return totalBytes; },
    abort(reason) { closed = true; chunks.length = 0; },

    // Sync variants return boolean (true = accepted)
    writeSync(chunk) { addChunk(chunk); return true; },
    writevSync(batch) { for (const c of batch) addChunk(c); return true; },
    endSync() { closed = true; return totalBytes; },
    abortSync(reason) { closed = true; chunks.length = 0; return true; },

    getChunks() { return chunks; }
  };
}

// Use it
const writer = createBufferWriter();
await Stream.pipeTo(source, writer);
const allData = writer.getChunks();

No base class to extend, no abstract methods to implement, no controller to coordinate with. Just an object with the right shape.

Pull-through transforms

Under the new API design, transforms should not perform any work until the data is being consumed. This is a fundamental principle.

// Nothing executes until iteration begins
const output = Stream.pull(source, compress, encrypt);

// Transforms execute as we iterate
for await (const chunks of output) {
  for (const chunk of chunks) {
    process(chunk);
  }
}

Stream.pull() creates a lazy pipeline. The compress and encrypt transforms don't run until you start iterating output. Each iteration pulls data through the pipeline on demand.

This is fundamentally different from Web streams' pipeThrough(), which starts actively pumping data from the source to the transform as soon as you set up the pipe. Pull semantics mean you control when processing happens, and stopping iteration stops processing.

Transforms can be stateless or stateful. A stateless transform is just a function that takes chunks and returns transformed chunks:

// Stateless transform — a pure function
// Receives chunks or null (flush signal)
const toUpperCase = (chunks) => {
  if (chunks === null) return null; // End of stream
  return chunks.map(chunk => {
    const str = new TextDecoder().decode(chunk);
    return new TextEncoder().encode(str.toUpperCase());
  });
};

// Use it directly
const output = Stream.pull(source, toUpperCase);

Stateful transforms are simple objects with member functions that maintain state across calls:

// Stateful transform — a generator that wraps the source
function createLineParser() {
  // Helper to concatenate Uint8Arrays
  const concat = (...arrays) => {
    const result = new Uint8Array(arrays.reduce((n, a) => n + a.length, 0));
    let offset = 0;
    for (const arr of arrays) { result.set(arr, offset); offset += arr.length; }
    return result;
  };

  return {
    async *transform(source) {
      let pending = new Uint8Array(0);
      
      for await (const chunks of source) {
        if (chunks === null) {
          // Flush: yield any remaining data
          if (pending.length > 0) yield [pending];
          continue;
        }
        
        // Concatenate pending data with new chunks
        const combined = concat(pending, ...chunks);
        const lines = [];
        let start = 0;

        for (let i = 0; i < combined.length; i++) {
          if (combined[i] === 0x0a) { // newline
            lines.push(combined.slice(start, i));
            start = i + 1;
          }
        }

        pending = combined.slice(start);
        if (lines.length > 0) yield lines;
      }
    }
  };
}

const output = Stream.pull(source, createLineParser());

For transforms that need cleanup on abort, add an abort handler:

// Stateful transform with resource cleanup
function createGzipCompressor() {
  // Hypothetical compression API...
  const deflate = new Deflater({ gzip: true });

  return {
    async *transform(source) {
      for await (const chunks of source) {
        if (chunks === null) {
          // Flush: finalize compression
          deflate.push(new Uint8Array(0), true);
          if (deflate.result) yield [deflate.result];
        } else {
          for (const chunk of chunks) {
            deflate.push(chunk, false);
            if (deflate.result) yield [deflate.result];
          }
        }
      }
    },
    abort(reason) {
      // Clean up compressor resources on error/cancellation
    }
  };
}

For implementers, there's no Transformer protocol with start(), transform(), flush() methods and controller coordination passed into a TransformStream class that has its own hidden state machine and buffering mechanisms. Transforms are just functions or simple objects: far simpler to implement and test.

Explicit backpressure policies

When a bounded buffer fills up and a producer wants to write more, there are only a few things you can do:

Reject the write: refuse to accept more data
Wait: block until space becomes available
Discard old data: evict what's already buffered to make room
Discard new data: drop what's incoming

That's it. Any other response is either a variation of these (like "resize the buffer," which is really just deferring the choice) or domain-specific logic that doesn't belong in a general streaming primitive. Web streams currently always choose Wait by default.

The new API makes you choose one of these four explicitly:

strict (default): Rejects writes when the buffer is full and too many writes are pending. Catches "fire-and-forget" patterns where producers ignore backpressure.
block: Writes wait until buffer space is available. Use when you trust the producer to await writes properly.
drop-oldest: Drops the oldest buffered data to make room. Useful for live feeds where stale data loses value.
drop-newest: Discards incoming data when full. Useful when you want to process what you have without being overwhelmed.

const { writer, readable } = Stream.push({
  highWaterMark: 10,
  backpressure: 'strict' // or 'block', 'drop-oldest', 'drop-newest'
});

No more hoping producers cooperate. The policy you choose determines what happens when the buffer fills.

Here's how each policy behaves when a producer writes faster than the consumer reads:

// strict: Catches fire-and-forget writes that ignore backpressure
const strict = Stream.push({ highWaterMark: 2, backpressure: 'strict' });
strict.writer.write(chunk1);  // ok (not awaited)
strict.writer.write(chunk2);  // ok (fills slots buffer)
strict.writer.write(chunk3);  // ok (queued in pending)
strict.writer.write(chunk4);  // ok (pending buffer fills)
strict.writer.write(chunk5);  // throws! too many pending writes

// block: Wait for space (unbounded pending queue)
const blocking = Stream.push({ highWaterMark: 2, backpressure: 'block' });
await blocking.writer.write(chunk1);  // ok
await blocking.writer.write(chunk2);  // ok
await blocking.writer.write(chunk3);  // waits until consumer reads
await blocking.writer.write(chunk4);  // waits until consumer reads
await blocking.writer.write(chunk5);  // waits until consumer reads

// drop-oldest: Discard old data to make room
const dropOld = Stream.push({ highWaterMark: 2, backpressure: 'drop-oldest' });
await dropOld.writer.write(chunk1);  // ok
await dropOld.writer.write(chunk2);  // ok
await dropOld.writer.write(chunk3);  // ok, chunk1 discarded

// drop-newest: Discard incoming data when full
const dropNew = Stream.push({ highWaterMark: 2, backpressure: 'drop-newest' });
await dropNew.writer.write(chunk1);  // ok
await dropNew.writer.write(chunk2);  // ok
await dropNew.writer.write(chunk3);  // silently dropped

Explicit Multi-consumer patterns

// Share with explicit buffer management
const shared = Stream.share(source, {
  highWaterMark: 100,
  backpressure: 'strict'
});

const consumer1 = shared.pull();
const consumer2 = shared.pull(decompress);

Instead of tee() with its hidden unbounded buffer, you get explicit multi-consumer primitives. Stream.share() is pull-based: consumers pull from a shared source, and you configure the buffer limits and backpressure policy upfront.

There's also Stream.broadcast() for push-based multi-consumer scenarios. Both require you to think about what happens when consumers run at different speeds, because that's a real concern that shouldn't be hidden.

Sync/async separation

Not all streaming workloads involve I/O. When your source is in-memory and your transforms are pure functions, async machinery adds overhead without benefit. You're paying for coordination of "waiting" that adds no benefit.

The new API has complete parallel sync versions: Stream.pullSync(), Stream.bytesSync(), Stream.textSync(), and so on. If your source and transforms are all synchronous, you can process the entire pipeline without a single promise.

// Async — when source or transforms may be asynchronous
const textAsync = await Stream.text(source);

// Sync — when all components are synchronous
const textSync = Stream.textSync(source);

Here's a complete synchronous pipeline – compression, transformation, and consumption with zero async overhead:

// Synchronous source from in-memory data
const source = Stream.fromSync([inputBuffer]);

// Synchronous transforms
const compressed = Stream.pullSync(source, zlibCompressSync);
const encrypted = Stream.pullSync(compressed, aesEncryptSync);

// Synchronous consumption — no promises, no event loop trips
const result = Stream.bytesSync(encrypted);

The entire pipeline executes in a single call stack. No promises are created, no microtask queue scheduling occurs, and no GC pressure from short-lived async machinery. For CPU-bound workloads like parsing, compression, or transformation of in-memory data, this can be significantly faster than the equivalent Web streams code – which would force async boundaries even when every component is synchronous.

Web streams has no synchronous path. Even if your source has data ready and your transform is a pure function, you still pay for promise creation and microtask scheduling on every operation. Promises are fantastic for cases in which waiting is actually necessary, but they aren't always necessary. The new API lets you stay in sync-land when that's what you need.

Bridging the gap between this and web streams

The async iterator based approach provides a natural bridge between this alternative approach and Web streams. When coming from a ReadableStream to this new approach, simply passing the readable in as input works as expected when the ReadableStream is set up to yield bytes:

const readable = getWebReadableStreamSomehow();
const input = Stream.pull(readable, transform1, transform2);
for await (const chunks of input) {
  // process chunks
}

When adapting to a ReadableStream, a bit more work is required since the alternative approach yields batches of chunks, but the adaptation layer is as easily straightforward:

async function* adapt(input) {
  for await (const chunks of input) {
    for (const chunk of chunks) {
      yield chunk;
    }
  }
}

const input = Stream.pull(source, transform1, transform2);
const readable = ReadableStream.from(adapt(input));

How this addresses the real-world failures from earlier

Unconsumed bodies: Pull semantics mean nothing happens until you iterate. No hidden resource retention. If you don't consume a stream, there's no background machinery holding connections open.
The tee() memory cliff: Stream.share() requires explicit buffer configuration. You choose the highWaterMark and backpressure policy upfront: no more silent unbounded growth when consumers run at different speeds.
Transform backpressure gaps: Pull-through transforms execute on-demand. Data doesn't cascade through intermediate buffers; it flows only when the consumer pulls. Stop iterating, stop processing.
GC thrashing in SSR: Batched chunks (Uint8Array[]) amortize async overhead. Sync pipelines via Stream.pullSync() eliminate promise allocation entirely for CPU-bound workloads.

Performance

The design choices have performance implications. Here are benchmarks from the reference implementation of this possible alternative compared to Web streams (Node.js v24.x, Apple M1 Pro, averaged over 10 runs):

Scenario	Alternative	Web streams	Difference
Small chunks (1KB × 5000)	~13 GB/s	~4 GB/s	~3× faster
Tiny chunks (100B × 10000)	~4 GB/s	~450 MB/s	~8× faster
Async iteration (8KB × 1000)	~530 GB/s	~35 GB/s	~15× faster
Chained 3× transforms (8KB × 500)	~275 GB/s	~3 GB/s	~80–90× faster
High-frequency (64B × 20000)	~7.5 GB/s	~280 MB/s	~25× faster

The chained transform result is particularly striking: pull-through semantics eliminate the intermediate buffering that plagues Web streams pipelines. Instead of each TransformStream eagerly filling its internal buffers, data flows on-demand from consumer to source.

Now, to be fair, Node.js really has not yet put significant effort into fully optimizing the performance of its Web streams implementation. There's likely significant room for improvement in Node.js' performance results through a bit of applied effort to optimize the hot paths there. That said, running these benchmarks in Deno and Bun also show a significant performance improvement with this alternative iterator based approach than in either of their Web streams implementations as well.

Browser benchmarks (Chrome/Blink, averaged over 3 runs) show consistent gains as well:

Scenario	Alternative	Web streams	Difference
Push 3KB chunks	~135k ops/s	~24k ops/s	~5–6× faster
Push 100KB chunks	~24k ops/s	~3k ops/s	~7–8× faster
3 transform chain	~4.6k ops/s	~880 ops/s	~5× faster
5 transform chain	~2.4k ops/s	~550 ops/s	~4× faster
bytes() consumption	~73k ops/s	~11k ops/s	~6–7× faster
Async iteration	~1.1M ops/s	~10k ops/s	~40–100× faster

These benchmarks measure throughput in controlled scenarios; real-world performance depends on your specific use case. The difference between Node.js and browser gains reflects the distinct optimization paths each environment takes for Web streams.

It's worth noting that these benchmarks compare a pure TypeScript/JavaScript implementation of the new API against the native (JavaScript/C++/Rust) implementations of Web streams in each runtime. The new API's reference implementation has had no performance optimization work; the gains come entirely from the design. A native implementation would likely show further improvement.

The gains illustrate how fundamental design choices compound: batching amortizes async overhead, pull semantics eliminate intermediate buffering, and the freedom for implementations to use synchronous fast paths when data is available immediately all contribute.

"We’ve done a lot to improve performance and consistency in Node streams, but there’s something uniquely powerful about starting from scratch. New streams’ approach embraces modern runtime realities without legacy baggage, and that opens the door to a simpler, performant and more coherent streams model." - Robert Nagy, Node.js TSC member and Node.js streams contributor

What's next

I'm publishing this to start a conversation. What did I get right? What did I miss? Are there use cases that don't fit this model? What would a migration path for this approach look like? The goal is to gather feedback from developers who've felt the pain of Web streams and have opinions about what a better API should look like.

Try it yourself

A reference implementation for this alternative approach is available now and can be found at https://github.com/jasnell/new-streams.

API Reference: See the API.md for complete documentation
Examples: The samples directory has working code for common patterns

I welcome issues, discussions, and pull requests. If you've run into Web streams problems I haven't covered, or if you see gaps in this approach, let me know. But again, the idea here is not to say "Let's all use this shiny new object!"; it is to kick off a discussion that looks beyond the current status quo of Web Streams and returns back to first principles.

Web streams was an ambitious project that brought streaming to the web platform when nothing else existed. The people who designed it made reasonable choices given the constraints of 2014 – before async iteration, before years of production experience revealed the edge cases.

But we've learned a lot since then. JavaScript has evolved. A streaming API designed today can be simpler, more aligned with the language, and more explicit about the things that matter, like backpressure and multi-consumer behavior.

We deserve a better stream API. So let's talk about what that could look like.

How we rebuilt Next.js with AI in one week

Steve Faulkner — Tue, 24 Feb 2026 20:00:00 GMT

_{*This post was updated at 12:35 pm PT to fix a typo in the build time benchmarks.}

Last week, one engineer and an AI model rebuilt the most popular front-end framework from scratch. The result, vinext (pronounced "vee-next"), is a drop-in replacement for Next.js, built on Vite, that deploys to Cloudflare Workers with a single command. In early benchmarks, it builds production apps up to 4x faster and produces client bundles up to 57% smaller. And we already have customers running it in production.

The whole thing cost about $1,100 in tokens.

The Next.js deployment problem

Next.js is the most popular React framework. Millions of developers use it. It powers a huge chunk of the production web, and for good reason. The developer experience is top-notch.

But Next.js has a deployment problem when used in the broader serverless ecosystem. The tooling is entirely bespoke: Next.js has invested heavily in Turbopack but if you want to deploy it to Cloudflare, Netlify, or AWS Lambda, you have to take that build output and reshape it into something the target platform can actually run.

If you’re thinking: “Isn’t that what OpenNext does?”, you are correct.

That is indeed the problem OpenNext was built to solve. And a lot of engineering effort has gone into OpenNext from multiple providers, including us at Cloudflare. It works, but quickly runs into limitations and becomes a game of whack-a-mole.

Building on top of Next.js output as a foundation has proven to be a difficult and fragile approach. Because OpenNext has to reverse-engineer Next.js's build output, this results in unpredictable changes between versions that take a lot of work to correct.

Next.js has been working on a first-class adapters API, and we've been collaborating with them on it. It's still an early effort but even with adapters, you're still building on the bespoke Turbopack toolchain. And adapters only cover build and deploy. During development, next dev runs exclusively in Node.js with no way to plug in a different runtime. If your application uses platform-specific APIs like Durable Objects, KV, or AI bindings, you can't test that code in dev without workarounds.

Introducing vinext

What if instead of adapting Next.js output, we reimplemented the Next.js API surface on Vite directly? Vite is the build tool used by most of the front-end ecosystem outside of Next.js, powering frameworks like Astro, SvelteKit, Nuxt, and Remix. A clean reimplementation, not merely a wrapper or adapter. We honestly didn't think it would work. But it’s 2026, and the cost of building software has completely changed.

We got a lot further than we expected.

npm install vinext

Replace next with vinext in your scripts and everything else stays the same. Your existing app/, pages/, and next.config.js work as-is.

vinext dev          # Development server with HMR
vinext build        # Production build
vinext deploy       # Build and deploy to Cloudflare Workers

This is not a wrapper around Next.js and Turbopack output. It's an alternative implementation of the API surface: routing, server rendering, React Server Components, server actions, caching, middleware. All of it built on top of Vite as a plugin. Most importantly Vite output runs on any platform thanks to the Vite Environment API.

The numbers

Early benchmarks are promising. We compared vinext against Next.js 16 using a shared 33-route App Router application. Both frameworks are doing the same work: compiling, bundling, and preparing server-rendered routes. We disabled TypeScript type checking and ESLint in Next.js's build (Vite doesn't run these during builds), and used force-dynamic so Next.js doesn't spend extra time pre-rendering static routes, which would unfairly slow down its numbers. The goal was to measure only bundler and compilation speed, nothing else. Benchmarks run on GitHub CI on every merge to main.

Production build time:

Framework	Mean	vs Next.js
Next.js 16.1.6 (Turbopack)	7.38s	baseline
vinext (Vite 7 / Rollup)	4.64s	1.6x faster
vinext (Vite 8 / Rolldown)	1.67s	4.4x faster

Client bundle size (gzipped):

Framework	Gzipped	vs Next.js
Next.js 16.1.6	168.9 KB	baseline
vinext (Rollup)	74.0 KB	56% smaller
vinext (Rolldown)	72.9 KB	57% smaller

These benchmarks measure compilation and bundling speed, not production serving performance. The test fixture is a single 33-route app, not a representative sample of all production applications. We expect these numbers to evolve as three projects continue to develop. The full methodology and historical results are public. Take them as directional, not definitive.

The direction is encouraging, though. Vite's architecture, and especially Rolldown (the Rust-based bundler coming in Vite 8), has structural advantages for build performance that show up clearly here.

Deploying to Cloudflare Workers

vinext is built with Cloudflare Workers as the first deployment target. A single command takes you from source code to a running Worker:

vinext deploy

This handles everything: builds the application, auto-generates the Worker configuration, and deploys. Both the App Router and Pages Router work on Workers, with full client-side hydration, interactive components, client-side navigation, React state.

For production caching, vinext includes a Cloudflare KV cache handler that gives you ISR (Incremental Static Regeneration) out of the box:

import { KVCacheHandler } from "vinext/cloudflare";
import { setCacheHandler } from "next/cache";

setCacheHandler(new KVCacheHandler(env.MY_KV_NAMESPACE));

KV is a good default for most applications, but the caching layer is designed to be pluggable. That setCacheHandler call means you can swap in whatever backend makes sense. R2 might be a better fit for apps with large cached payloads or different access patterns. We're also working on improvements to our Cache API that should provide a strong caching layer with less configuration. The goal is flexibility: pick the caching strategy that fits your app.

Live examples running right now:

We also have a live example of Cloudflare Agents running in a Next.js app, without the need for workarounds like getPlatformProxy, since the entire app now runs in workerd, during both dev and deploy phases. This means being able to use Durable Objects, AI bindings, and every other Cloudflare-specific service without compromise. Have a look here.

Frameworks are a team sport

The current deployment target is Cloudflare Workers, but that's a small part of the picture. Something like 95% of vinext is pure Vite. The routing, the module shims, the SSR pipeline, the RSC integration: none of it is Cloudflare-specific.

Cloudflare is looking to work with other hosting providers about adopting this toolchain for their customers (the lift is minimal — we got a proof-of-concept working on Vercel in less than 30 minutes!). This is an open-source project, and for its long term success, we believe it’s important we work with partners across the ecosystem to ensure ongoing investment. PRs from other platforms are welcome. If you're interested in adding a deployment target, open an issue or reach out.

Status: Experimental

We want to be clear: vinext is experimental. It's not even one week old, and it has not yet been battle-tested with any meaningful traffic at scale. If you're evaluating it for a production application, proceed with appropriate caution.

That said, the test suite is extensive: over 1,700 Vitest tests and 380 Playwright E2E tests, including tests ported directly from the Next.js test suite and OpenNext's Cloudflare conformance suite. We’ve verified it against the Next.js App Router Playground. Coverage sits at 94% of the Next.js 16 API surface. Early results from real-world customers are encouraging. We've been working with National Design Studio, a team that's aiming to modernize every government interface, on one of their beta sites, CIO.gov. They're already running vinext in production, with meaningful improvements in build times and bundle sizes.

The README is honest about what's not supported and won't be, and about known limitations. We want to be upfront rather than overpromise.

What about pre-rendering?

vinext already supports Incremental Static Regeneration (ISR) out of the box. After the first request to any page, it's cached and revalidated in the background, just like Next.js. That part works today.

vinext does not yet support static pre-rendering at build time. In Next.js, pages without dynamic data get rendered during next build and served as static HTML. If you have dynamic routes, you use generateStaticParams() to enumerate which pages to build ahead of time. vinext doesn't do that… yet.

This was an intentional design decision for launch. It's on the roadmap, but if your site is 100% prebuilt HTML with static content, you probably won't see much benefit from vinext today. That said, if one engineer can spend $1,100 in tokens and rebuild Next.js, you can probably spend $10 and migrate to a Vite-based framework designed specifically for static content, like Astro (which also deploys to Cloudflare Workers).

For sites that aren't purely static, though, we think we can do something better than pre-rendering everything at build time.

Introducing Traffic-aware Pre-Rendering

Next.js pre-renders every page listed in generateStaticParams() during the build. A site with 10,000 product pages means 10,000 renders at build time, even though 99% of those pages may never receive a request. Builds scale linearly with page count. This is why large Next.js sites end up with 30-minute builds.

So we built Traffic-aware Pre-Rendering (TPR). It's experimental today, and we plan to make it the default once we have more real-world testing behind it.

The idea is simple. Cloudflare is already the reverse proxy for your site. We have your traffic data. We know which pages actually get visited. So instead of pre-rendering everything or pre-rendering nothing, vinext queries Cloudflare's zone analytics at deploy time and pre-renders only the pages that matter.

vinext deploy --experimental-tpr

  Building...
  Build complete (4.2s)

  TPR (experimental): Analyzing traffic for my-store.com (last 24h)
  TPR: 12,847 unique paths — 184 pages cover 90% of traffic
  TPR: Pre-rendering 184 pages...
  TPR: Pre-rendered 184 pages in 8.3s → KV cache

  Deploying to Cloudflare Workers...

For a site with 100,000 product pages, the power law means 90% of traffic usually goes to 50 to 200 pages. Those get pre-rendered in seconds. Everything else falls back to on-demand SSR and gets cached via ISR after the first request. Every new deploy refreshes the set based on current traffic patterns. Pages that go viral get picked up automatically. All of this works without generateStaticParams() and without coupling your build to your production database.

Taking on the Next.js challenge, but this time with AI

A project like this would normally take a team of engineers months, if not years. Several teams at various companies have attempted it, and the scope is just enormous. We tried once at Cloudflare! Two routers, 33+ module shims, server rendering pipelines, RSC streaming, file-system routing, middleware, caching, static export. There's a reason nobody has pulled it off.

This time we did it in under a week. One engineer (technically engineering manager) directing AI.

The first commit landed on February 13. By the end of that same evening, both the Pages Router and App Router had basic SSR working, along with middleware, server actions, and streaming. By the next afternoon, App Router Playground was rendering 10 of 11 routes. By day three, vinext deploy was shipping apps to Cloudflare Workers with full client hydration. The rest of the week was hardening: fixing edge cases, expanding the test suite, bringing API coverage to 94%.

What changed from those earlier attempts? AI got better. Way better.

Why this problem is made for AI

Not every project would go this way. This one did because a few things happened to line up at the right time.

Next.js is well-specified. It has extensive documentation, a massive user base, and years of Stack Overflow answers and tutorials. The API surface is all over the training data. When you ask Claude to implement getServerSideProps or explain how useRouter works, it doesn't hallucinate. It knows how Next works.

Next.js has an elaborate test suite. The Next.js repo contains thousands of E2E tests covering every feature and edge case. We ported tests directly from their suite (you can see the attribution in the code). This gave us a specification we could verify against mechanically.

Vite is an excellent foundation. Vite handles the hard parts of front-end tooling: fast HMR, native ESM, a clean plugin API, production bundling. We didn't have to build a bundler. We just had to teach it to speak Next.js. @vitejs/plugin-rsc is still early, but it gave us React Server Components support without having to build an RSC implementation from scratch.

The models caught up. We don't think this would have been possible even a few months ago. Earlier models couldn't sustain coherence across a codebase this size. New models can hold the full architecture in context, reason about how modules interact, and produce correct code often enough to keep momentum going. At times, I saw it go into Next, Vite, and React internals to figure out a bug. The state-of-the-art models are impressive, and they seem to keep getting better.

All of those things had to be true at the same time. Well-documented target API, comprehensive test suite, solid build tool underneath, and a model that could actually handle the complexity. Take any one of them away and this doesn't work nearly as well.

How we actually built it

Almost every line of code in vinext was written by AI. But here's the thing that matters more: every line passes the same quality gates you'd expect from human-written code. The project has 1,700+ Vitest tests, 380 Playwright E2E tests, full TypeScript type checking via tsgo, and linting via oxlint. Continuous integration runs all of it on every pull request. Establishing a set of good guardrails is critical to making AI productive in a codebase.

The process started with a plan. I spent a couple of hours going back and forth with Claude in OpenCode to define the architecture: what to build, in what order, which abstractions to use. That plan became the north star. From there, the workflow was straightforward:

Define a task ("implement the next/navigation shim with usePathname, useSearchParams, useRouter").
Let the AI write the implementation and tests.
Run the test suite.
If tests pass, merge. If not, give the AI the error output and let it iterate.
Repeat.

We wired up AI agents for code review too. When a PR was opened, an agent reviewed it. When review comments came back, another agent addressed them. The feedback loop was mostly automated.

It didn't work perfectly every time. There were PRs that were just wrong. The AI would confidently implement something that seemed right but didn't match actual Next.js behavior. I had to course-correct regularly. Architecture decisions, prioritization, knowing when the AI was headed down a dead end: that was all me. When you give AI good direction, good context, and good guardrails, it can be very productive. But the human still has to steer.

For browser-level testing, I used agent-browser to verify actual rendered output, client-side navigation, and hydration behavior. Unit tests miss a lot of subtle browser issues. This caught them.

Over the course of the project, we ran over 800 sessions in OpenCode. Total cost: roughly $1,100 in Claude API tokens.

What this means for software

Why do we have so many layers in the stack? This project forced me to think deeply about this question. And to consider how AI impacts the answer.

Most abstractions in software exist because humans need help. We couldn't hold the whole system in our heads, so we built layers to manage the complexity for us. Each layer made the next person's job easier. That's how you end up with frameworks on top of frameworks, wrapper libraries, thousands of lines of glue code.

AI doesn't have the same limitation. It can hold the whole system in context and just write the code. It doesn't need an intermediate framework to stay organized. It just needs a spec and a foundation to build on.

It's not clear yet which abstractions are truly foundational and which ones were just crutches for human cognition. That line is going to shift a lot over the next few years. But vinext is a data point. We took an API contract, a build tool, and an AI model, and the AI wrote everything in between. No intermediate framework needed. We think this pattern will repeat across a lot of software. The layers we've built up over the years aren't all going to make it.

Acknowledgments

Thanks to the Vite team. Vite is the foundation this whole thing stands on. @vitejs/plugin-rsc is still early days, but it gave me RSC support without having to build that from scratch, which would have been a dealbreaker. The Vite maintainers were responsive and helpful as I pushed the plugin into territory it hadn't been tested in before.

We also want to acknowledge the Next.js team. They've spent years building a framework that raised the bar for what React development could look like. The fact that their API surface is so well-documented and their test suite so comprehensive is a big part of what made this project possible. vinext wouldn't exist without the standard they set.

Try it

vinext includes an Agent Skill that handles migration for you. It works with Claude Code, OpenCode, Cursor, Codex, and dozens of other AI coding tools. Install it, open your Next.js project, and tell the AI to migrate:

npx skills add cloudflare/vinext

Then open your Next.js project in any supported tool and say:

migrate this project to vinext

The skill handles compatibility checking, dependency installation, config generation, and dev server startup. It knows what vinext supports and will flag anything that needs manual attention.

Or if you prefer doing it by hand:

npx vinext init    # Migrate an existing Next.js project
npx vinext dev     # Start the dev server
npx vinext deploy  # Ship to Cloudflare Workers

The source is at github.com/cloudflare/vinext. Issues, PRs, and feedback are welcome.

Code Mode: give agents an entire API in 1,000 tokens

Matt Carey — Fri, 20 Feb 2026 14:00:00 GMT

Model Context Protocol (MCP) has become the standard way for AI agents to use external tools. But there is a tension at its core: agents need many tools to do useful work, yet every tool added fills the model's context window, leaving less room for the actual task.

Code Mode is a technique we first introduced for reducing context window usage during agent tool use. Instead of describing every operation as a separate tool, let the model write code against a typed SDK and execute the code safely in a Dynamic Worker Loader. The code acts as a compact plan. The model can explore tool operations, compose multiple calls, and return just the data it needs. Anthropic independently explored the same pattern in their Code Execution with MCP post.

Today we are introducing a new MCP server for the entire Cloudflare API — from DNS and Zero Trust to Workers and R2 — that uses Code Mode. With just two tools, search() and execute(), the server is able to provide access to the entire Cloudflare API over MCP, while consuming only around 1,000 tokens. The footprint stays fixed, no matter how many API endpoints exist.

For a large API like the Cloudflare API, Code Mode reduces the number of input tokens used by 99.9%. An equivalent MCP server without Code Mode would consume 1.17 million tokens — more than the entire context window of the most advanced foundation models.

^{Code mode savings vs native MCP, measured with}^tiktoken

You can start using this new Cloudflare MCP server today. And we are also open-sourcing a new Code Mode SDK in the Cloudflare Agents SDK, so you can use the same approach in your own MCP servers and AI Agents.

Server‑side Code Mode

This new MCP server applies Code Mode server-side. Instead of thousands of tools, the server exports just two: search() and execute(). Both are powered by Code Mode. Here is the full tool surface area that gets loaded into the model context:

[
  {
    "name": "search",
    "description": "Search the Cloudflare OpenAPI spec. All $refs are pre-resolved inline.",
    "inputSchema": {
      "type": "object",
      "properties": {
        "code": {
          "type": "string",
          "description": "JavaScript async arrow function to search the OpenAPI spec"
        }
      },
      "required": ["code"]
    }
  },
  {
    "name": "execute",
    "description": "Execute JavaScript code against the Cloudflare API.",
    "inputSchema": {
      "type": "object",
      "properties": {
        "code": {
          "type": "string",
          "description": "JavaScript async arrow function to execute"
        }
      },
      "required": ["code"]
    }
  }
]

To discover what it can do, the agent calls search(). It writes JavaScript against a typed representation of the OpenAPI spec. The agent can filter endpoints by product, path, tags, or any other metadata and narrow thousands of endpoints to the handful it needs. The full OpenAPI spec never enters the model context. The agent only interacts with it through code.

When the agent is ready to act, it calls execute(). The agent writes code that can make Cloudflare API requests, handle pagination, check responses, and chain operations together in a single execution.

Both tools run the generated code inside a Dynamic Worker isolate — a lightweight V8 sandbox with no file system, no environment variables to leak through prompt injection and external fetches disabled by default. Outbound requests can be explicitly controlled with outbound fetch handlers when needed.

Example: Protecting an origin from DDoS attacks

Suppose a user tells their agent: "protect my origin from DDoS attacks." The agent's first step is to consult documentation. It might call the Cloudflare Docs MCP Server, use a Cloudflare Skill, or search the web directly. From the docs it learns: put Cloudflare WAF and DDoS protection rules in front of the origin.

Step 1: Search for the right endpoints The search tool gives the model a spec object: the full Cloudflare OpenAPI spec with all $refs pre-resolved. The model writes JavaScript against it. Here the agent looks for WAF and ruleset endpoints on a zone:

async () => {
  const results = [];
  for (const [path, methods] of Object.entries(spec.paths)) {
    if (path.includes('/zones/') &&
        (path.includes('firewall/waf') || path.includes('rulesets'))) {
      for (const [method, op] of Object.entries(methods)) {
        results.push({ method: method.toUpperCase(), path, summary: op.summary });
      }
    }
  }
  return results;
}

The server runs this code in a Workers isolate and returns:

[
  { "method": "GET",    "path": "/zones/{zone_id}/firewall/waf/packages",              "summary": "List WAF packages" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}", "summary": "Update a WAF package" },
  { "method": "GET",    "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}/rules", "summary": "List WAF rules" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}/rules/{rule_id}", "summary": "Update a WAF rule" },
  { "method": "GET",    "path": "/zones/{zone_id}/rulesets",                           "summary": "List zone rulesets" },
  { "method": "POST",   "path": "/zones/{zone_id}/rulesets",                           "summary": "Create a zone ruleset" },
  { "method": "GET",    "path": "/zones/{zone_id}/rulesets/phases/{ruleset_phase}/entrypoint", "summary": "Get a zone entry point ruleset" },
  { "method": "PUT",    "path": "/zones/{zone_id}/rulesets/phases/{ruleset_phase}/entrypoint", "summary": "Update a zone entry point ruleset" },
  { "method": "POST",   "path": "/zones/{zone_id}/rulesets/{ruleset_id}/rules",        "summary": "Create a zone ruleset rule" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/rulesets/{ruleset_id}/rules/{rule_id}", "summary": "Update a zone ruleset rule" }
]

The full Cloudflare API spec has over 2,500 endpoints. The model narrowed that to the WAF and ruleset endpoints it needs, without any of the spec entering the context window.

The model can also drill into a specific endpoint's schema before calling it. Here it inspects what phases are available on zone rulesets:

async () => {
  const op = spec.paths['/zones/{zone_id}/rulesets']?.get;
  const items = op?.responses?.['200']?.content?.['application/json']?.schema;
  // Walk the schema to find the phase enum
  const props = items?.allOf?.[1]?.properties?.result?.items?.allOf?.[1]?.properties;
  return { phases: props?.phase?.enum };
}

{
  "phases": [
    "ddos_l4", "ddos_l7",
    "http_request_firewall_custom", "http_request_firewall_managed",
    "http_response_firewall_managed", "http_ratelimit",
    "http_request_redirect", "http_request_transform",
    "magic_transit", "magic_transit_managed"
  ]
}

The agent now knows the exact phases it needs: ddos_l7 for DDoS protection and http_request_firewall_managed for WAF.

Step 2: Act on the API The agent switches to using execute. The sandbox gets a cloudflare.request() client that can make authenticated calls to the Cloudflare API. First the agent checks what rulesets already exist on the zone:

async () => {
  const response = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets`
  });
  return response.result.map(rs => ({
    name: rs.name, phase: rs.phase, kind: rs.kind
  }));
}

[
  { "name": "DDoS L7",          "phase": "ddos_l7",                        "kind": "managed" },
  { "name": "Cloudflare Managed","phase": "http_request_firewall_managed", "kind": "managed" },
  { "name": "Custom rules",     "phase": "http_request_firewall_custom",   "kind": "zone" }
]

The agent sees that managed DDoS and WAF rulesets already exist. It can now chain calls to inspect their rules and update sensitivity levels in a single execution:

async () => {
  // Get the current DDoS L7 entrypoint ruleset
  const ddos = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets/phases/ddos_l7/entrypoint`
  });

  // Get the WAF managed ruleset
  const waf = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets/phases/http_request_firewall_managed/entrypoint`
  });
}

This entire operation, from searching the spec and inspecting a schema to listing rulesets and fetching DDoS and WAF configurations, took four tool calls.

The Cloudflare MCP server

We started with MCP servers for individual products. Want an agent that manages DNS? Add the DNS MCP server. Want Workers logs? Add the Workers Observability MCP server. Each server exported a fixed set of tools that mapped to API operations. This worked when the tool set was small, but the Cloudflare API has over 2,500 endpoints. No collection of hand-maintained servers could keep up.

The Cloudflare MCP server simplifies this. Two tools, roughly 1,000 tokens, and coverage of every endpoint in the API. When we add new products, the same search() and execute() code paths discover and call them — no new tool definitions, no new MCP servers. It even has support for the GraphQL Analytics API.

Our MCP server is built on the latest MCP specifications. It is OAuth 2.1 compliant, using Workers OAuth Provider to downscope the token to selected permissions approved by the user when connecting. The agent only gets the capabilities the user explicitly granted.

For developers, this means you can use a simple agent loop and still give your agent access to the full Cloudflare API with built-in progressive capability discovery.

Comparing approaches to context reduction

Several approaches have emerged to reduce how many tokens MCP tools consume:

Client-side Code Mode was our first experiment. The model writes TypeScript against typed SDKs and runs it in a Dynamic Worker Loader on the client. The tradeoff is that it requires the agent to ship with secure sandbox access. Code Mode is implemented in Goose and Anthropics Claude SDK as Programmatic Tool Calling.

Command-line interfaces are another path. CLIs are self-documenting and reveal capabilities as the agent explores. Tools like OpenClaw and Moltworker convert MCP servers into CLIs using MCPorter to give agents progressive disclosure. The limitation is obvious: the agent needs a shell, which not every environment provides and which introduces a much broader attack surface than a sandboxed isolate.

Dynamic tool search, as used by Anthropic in Claude Code, surfaces a smaller set of tools hopefully relevant to the current task. It shrinks context use but now requires a search function that must be maintained and evaluated, and each matched tool still uses tokens.

Each approach solves a real problem. But for MCP servers specifically, server-side Code Mode combines their strengths: fixed token cost regardless of API size, no modifications needed on the agent side, progressive discovery built in, and safe execution inside a sandboxed isolate. The agent just calls two tools with code. Everything else happens on the server.

Get started today

The Cloudflare MCP server is available now. Point your MCP client at the server URL and you'll be redirected to Cloudflare to authorize and select the permissions to grant to your agent. Add this config to your MCP client:

{
  "mcpServers": {
    "cloudflare-api": {
      "url": "https://mcp.cloudflare.com/mcp"
    }
  }
}

For CI/CD, automation, or if you prefer managing tokens yourself, create a Cloudflare API token with the permissions you need. Both user tokens and account tokens are supported and can be passed as bearer tokens in the Authorization header.

More information on different MCP setup configurations can be found at the Cloudflare MCP repository.

Looking forward

Code Mode solves context costs for a single API. But agents rarely talk to one service. A developer's agent might need the Cloudflare API alongside GitHub, a database, and an internal docs server. Each additional MCP server brings the same context window pressure we started with.

Cloudflare MCP Server Portals let you compose multiple MCP servers behind a single gateway with unified auth and access control. We are building a first-class Code Mode integration for all your MCP servers, and exposing them to agents with built-in progressive discovery and the same fixed-token footprint, regardless of how many services sit behind the gateway.

Building vertical microfrontends on Cloudflare’s platform

Brayden Wilmoth — Fri, 30 Jan 2026 14:00:00 GMT

Updated at 6:55 a.m. PT

Today, we’re introducing a new Worker template for Vertical Microfrontends (VMFE). This template allows you to map multiple independent Cloudflare Workers to a single domain, enabling teams to work in complete silos — shipping marketing, docs, and dashboards independently — while presenting a single, seamless application to the user.

Most microfrontend architectures are "horizontal", meaning different parts of a single page are fetched from different services. Vertical microfrontends take a different approach by splitting the application by URL path. In this model, a team owning the `/blog` path doesn't just own a component; they own the entire vertical stack for that route – framework, library choice, CI/CD and more. Owning the entire stack of a path, or set of paths, allows teams to have true ownership of their work and ship with confidence.

Teams face problems as they grow, where different frameworks serve varying use cases. A marketing website could be better utilized with Astro, for example, while a dashboard might be better with React. Or say you have a monolithic code base where many teams ship as a collective. An update to add new features from several teams can get frustratingly rolled back because a single team introduced a regression. How do we solve the problem of obscuring the technical implementation details away from the user and letting teams ship a cohesive user experience with full autonomy and control of their domains?

Vertical microfrontends can be the answer. Let’s dive in and explore how they solve developer pain points together.

What are vertical microfrontends?

A vertical microfrontend is an architectural pattern where a single independent team owns an entire slice of the application’s functionality, from the user interface all the way down to the CI/CD pipeline. These slices are defined by paths on a domain where you can associate individual Workers with specific paths:

/      = Marketing
/docs  = Documentation
/blog  = Blog
/dash  = Dashboard

We could take it a step further and focus on more granular sub-path Worker associations, too, such as a dashboard. Within a dashboard, you likely segment out various features or products by adding depth to your URL path (e.g. /dash/product-a) and navigating between two products could mean two entirely different code bases.

Now with vertical microfrontends, we could also have the following:

/dash/product-a  = WorkerA
/dash/product-b  = WorkerB

Each of the above paths are their own frontend project with zero shared code between them. The product-a and product-b routes map to separately deployed frontend applications that have their own frameworks, libraries, CI/CD pipelines defined and owned by their own teams. FINALLY.

You can now own your own code from end to end. But now we need to find a way to stitch these separate projects together, and even more so, make them feel as if they are a unified experience.

We experience this pain point ourselves here at Cloudflare, as the dashboard has many individual teams owning their own products. Teams must contend with the fact that changes made outside their control impact how users experience their product.

Internally, we are now using a similar strategy for our own dashboard. When users navigate from the core dashboard into our ZeroTrust product, in reality they are two entirely separate projects and the user is simply being routed to that project by its path /:accountId/one.

Visually unified experiences

Stitching these individual projects together to make them feel like a unified experience isn’t as difficult as you might think: It only takes a few lines of CSS magic. What we absolutely do not want to happen is to leak our implementation details and internal decisions to our users. If we fail to make this user experience feel like one cohesive frontend, then we’ve done a grave injustice to our users.

To accomplish this sleight of hand, let us take a little trip in understanding how view transitions and document preloading come into play.

View transitions

When we want to seamlessly navigate between two distinct pages while making it feel smooth to the end user, view transitions are quite useful. Defining specific DOM elements on our page to stick around until the next page is visible, and defining how any changes are handled, make for quite the powerful quilt-stitching tool for multi-page applications.

There may be, however, instances where making the various vertical microfrontends feel different is more than acceptable. Perhaps our marketing website, documentation, and dashboard are each uniquely defined, for instance. A user would not expect all three of those to feel cohesive as you navigate between the three parts. But… if you decide to introduce vertical slices to an individual experience such as the dashboard (e.g. /dash/product-a & /dash/product-b), then users should never know they are two different repositories/workers/projects underneath.

Okay, enough talk — let’s get to work. I mentioned it was low-effort to make two separate projects feel as if they were one to a user, and if you have yet to hear about CSS View Transitions then I’m about to blow your mind.

What if I told you that you could make animated transitions between different views — single-page app (SPA) or multi-page app (MPA) — feel as if they were one? Before any view transitions are added, if we navigate between pages owned by two different Workers, the interstitial loading state would be the white blank screen in our browser for some few hundred milliseconds until the full next page began rendering. Pages would not feel cohesive, and it certainly would not feel like a single-page application.

^{Appears as multiple navigation elements between each site.}

If we want elements to stick around, rather than seeing a white blank page, we can achieve that by defining CSS View Transitions. With the code below, we’re telling our current document page that when a view transition event is about to happen, keep the nav DOM element on the screen, and if any delta in appearance exists between our existing page and our destination page, then we’ll animate that with an ease-in-out transition.

All of a sudden, two different Workers feel like one.

@supports (view-transition-name: none) {
  ::view-transition-old(root),
  ::view-transition-new(root) {
    animation-duration: 0.3s;
    animation-timing-function: ease-in-out;
  }
  nav { view-transition-name: navigation; }
}

^{Appears as a single navigation element between three distinct sites.}

Preloading

Transitioning between two pages makes it look seamless — and we also want it to feel as instant as a client-side SPA. While currently Firefox and Safari do not support Speculation Rules, Chrome/Edge/Opera do support the more recent newcomer. The speculation rules API is designed to improve performance for future navigations, particularly for document URLs, making multi-page applications feel more like single-page applications.

Breaking it down into code, what we need to define is a script rule in a specific format that tells the supporting browsers how to prefetch the other vertical slices that are connected to our web application — likely linked through some shared navigation.

With that, our application prefetches our other microfrontends and holds them in our in-memory cache, so if we were to navigate to those pages it would feel nearly instant.

You likely won’t require this for clearly discernible vertical slices (marketing, docs, dashboard) because users would expect a slight load between them. However, it is highly encouraged to use when vertical slices are defined within a specific visible experience (e.g. within dashboard pages).

Between View Transitions and Speculation Rules, we are able to tie together entirely different code repositories to feel as if they were served from a single-page application. Wild if you ask me.

Zero-config request routing

Now we need a mechanism to host multiple applications, and a method to stitch them together as requests stream in. Defining a single Cloudflare Worker as the “Router” allows a single logical point (at the edge) to handle network requests and then forward them to whichever vertical microfrontend is responsible for that URL path. Plus it doesn’t hurt that then we can map a single domain to that router Worker and the rest “just works.”

Service bindings

If you have yet to explore Cloudflare Worker service bindings, then it is worth taking a moment to do so.

Service bindings allow one Worker to call into another, without going through a publicly-accessible URL. A Service binding allows Worker A to call a method on Worker B, or to forward a request from Worker A to Worker. Breaking it down further, the Router Worker can call into each vertical microfrontend Worker that has been defined (e.g. marketing, docs, dashboard), assuming each of them were Cloudflare Workers.

Why is this important? This is precisely the mechanism that “stitches” these vertical slices together. We’ll dig into how the request routing is handling the traffic split in the next section. But to define each of these microfrontends, we’ll need to update our Router Worker’s wrangler definition, so it knows which frontends it’s allowed to call into.

{
  "$schema": "./node_modules/wrangler/config-schema.json",
  "name": "router",
  "main": "./src/router.js",
  "services": [
    {
      "binding": "HOME",
      "service": "worker_marketing"
    },
    {
      "binding": "DOCS",
      "service": "worker_docs"
    },
    {
      "binding": "DASH",
      "service": "worker_dash"
    },
  ]
}

Our above sample definition is defined in our Router Worker, which then tells us that we are permitted to make requests into three separate additional Workers (marketing, docs, and dash). Granting permissions is as simple as that, but let’s tumble into some of the more complex logic with request routing and HTML rewriting network responses.

Request routing

With knowledge of the various other Workers we are able to call into if needed, now we need some logic in place to know where to direct network requests when. Since the Router Worker is assigned to our custom domain, all incoming requests hit it first at the network edge. It then determines which Worker should handle the request and manages the resulting response.

The first step is to map URL paths to associated Workers. When a certain request URL is received, we need to know where it needs to be forwarded. We do this by defining rules. While we support wildcard routes, dynamic paths, and parameter constraints, we are going to stay focused on the basics — literal path prefixes — as it illustrates the point more clearly.

In this example, we have three microfrontends:

/      = Marketing
/docs  = Documentation
/dash  = Dashboard

Each of the above paths need to be mapped to an actual Worker (see our wrangler definition for services in the section above). For our Router Worker, we define an additional variable with the following data, so we can know which paths should map to which service bindings. We now know where to route users as requests come in! Define a wrangler variable with the name ROUTES and the following contents:

{
  "routes":[
    {"binding": "HOME", "path": "/"},
    {"binding": "DOCS", "path": "/docs"},
    {"binding": "DASH", "path": "/dash"}
  ]
}

Let’s envision a user visiting our website path /docs/installation. Under the hood, what happens is the request first reaches our Router Worker which is in charge of understanding what URL paths map to which individual Workers. It understands that the /docs path prefix is mapped to our DOCS service binding which referencing our wrangler file points us at our worker_docs project. Our Router Worker, knowing that /docs is defined as a vertical microfrontend route, removes the /docs prefix from the path and forwards the request to our worker_docs Worker to handle the request and then finally returns whatever response we get.

Why does it drop the /docs path, though? This was an implementation detail choice that was made so that when the Worker is accessed via the Router Worker, it can clean up the URL to handle the request as if it were called from outside our Router Worker. Like any Cloudflare Worker, our worker_docs service might have its own individual URL where it can be accessed. We decided we wanted that service URL to continue to work independently. When it’s attached to our new Router Worker, it would automatically handle removing the prefix, so the service could be accessible from its own defined URL or through our Router Worker… either place, doesn’t matter.

HTMLRewriter

Splitting our various frontend services with URL paths (e.g. /docs or /dash) makes it easy for us to forward a request, but when our response contains HTML that doesn’t know it’s being reverse proxied through a path component… well, that causes problems.

Say our documentation website has an image tag in the response . If our user was visiting this page at https://website.com/docs/, then loading the logo.png file would likely fail because our /docs path is somewhat artificially defined only by our Router Worker.

Only when our services are accessed through our Router Worker do we need to do some HTML rewriting of absolute paths so our returned browser response references valid assets. In practice what happens is that when a request passes through our Router Worker, we pass the request to the correct Service Binding, and we receive the response from that. Before we pass that back to the client, we have an opportunity to rewrite the DOM — so where we see absolute paths, we go ahead and prepend that with the proxied path. Where previously our HTML was returning our image tag with we now modify it before returning to the client browser to .

Let’s return for a moment to the magic of CSS view transitions and document preloading. We could of course manually place that code into our projects and have it work, but this Router Worker will automatically handle that logic for us by also using HTMLRewriter.

In your Router Worker ROUTES variable, if you set smoothTransitions to true at the root level, then the CSS transition views code will be added automatically. Additionally, if you set the preload key within a route to true, then the script code speculation rules for that route will automatically be added as well.

Below is an example of both in action:

{
  "smoothTransitions":true, 
  "routes":[
    {"binding": "APP1", "path": "/app1", "preload": true},
    {"binding": "APP2", "path": "/app2", "preload": true}
  ]
}

Get started

You can start building with the Vertical Microfrontend template today.

Visit the Cloudflare Dashboard deeplink here or go to “Workers & Pages” and click the “Create application” button to get started. From there, click “Select a template” and then “Create microfrontend” and you can begin configuring your setup.

Check out the documentation to see how to map your existing Workers and enable View Transitions. We can't wait to see what complex, multi-team applications you build on the edge!

Introducing Moltworker: a self-hosted personal AI agent, minus the minis

Celso Martinho — Thu, 29 Jan 2026 14:00:00 GMT

Editorial note: As of January 30, 2026, Moltbot has been renamed to OpenClaw.

The Internet woke up this week to a flood of people buying Mac minis to run Moltbot (formerly Clawdbot), an open-source, self-hosted AI agent designed to act as a personal assistant. Moltbot runs in the background on a user's own hardware, has a sizable and growing list of integrations for chat applications, AI models, and other popular tools, and can be controlled remotely. Moltbot can help you with your finances, social media, organize your day — all through your favorite messaging app.

But what if you don’t want to buy new dedicated hardware? And what if you could still run your Moltbot efficiently and securely online? Meet Moltworker, a middleware Worker and adapted scripts that allows running Moltbot on Cloudflare's Sandbox SDK and our Developer Platform APIs.

A personal assistant on Cloudflare — how does that work?

Cloudflare Workers has never been as compatible with Node.js as it is now. Where in the past we had to mock APIs to get some packages running, now those APIs are supported natively by the Workers Runtime.

This has changed how we can build tools on Cloudflare Workers. When we first implemented Playwright, a popular framework for web testing and automation that runs on Browser Rendering, we had to rely on memfs. This was bad because not only is memfs a hack and an external dependency, but it also forced us to drift away from the official Playwright codebase. Thankfully, with more Node.js compatibility, we were able to start using node:fs natively, reducing complexity and maintainability, which makes upgrades to the latest versions of Playwright easy to do.

The list of Node.js APIs we support natively keeps growing. The blog post “A year of improving Node.js compatibility in Cloudflare Workers” provides an overview of where we are and what we’re doing.

We measure this progress, too. We recently ran an experiment where we took the 1,000 most popular NPM packages, installed and let AI loose, to try to run them in Cloudflare Workers, Ralph Wiggum as a "software engineer" style, and the results were surprisingly good. Excluding the packages that are build tools, CLI tools or browser-only and don’t apply, only 15 packages genuinely didn’t work. That's 1.5%.

Here’s a graphic of our Node.js API support over time:

We put together a page with the results of our internal experiment on npm packages support here, so you can check for yourself.

Moltbot doesn’t necessarily require a lot of Workers Node.js compatibility because most of the code runs in a container anyway, but we thought it would be important to highlight how far we got supporting so many packages using native APIs. This is because when starting a new AI agent application from scratch, we can actually run a lot of the logic in Workers, closer to the user.

The other important part of the story is that the list of products and APIs on our Developer Platform has grown to the point where anyone can build and run any kind of application — even the most complex and demanding ones — on Cloudflare. And once launched, every application running on our Developer Platform immediately benefits from our secure and scalable global network.

Those products and services gave us the ingredients we needed to get started. First, we now have Sandboxes, where you can run untrusted code securely in isolated environments, providing a place to run the service. Next, we now have Browser Rendering, where you can programmatically control and interact with headless browser instances. And finally, R2, where you can store objects persistently. With those building blocks available, we could begin work on adapting Moltbot.

How we adapted Moltbot to run on us

Moltbot on Workers, or Moltworker, is a combination of an entrypoint Worker that acts as an API router and a proxy between our APIs and the isolated environment, both protected by Cloudflare Access. It also provides an administration UI and connects to the Sandbox container where the standard Moltbot Gateway runtime and its integrations are running, using R2 for persistent storage.

^{High-level architecture diagram of Moltworker.}

Let's dive in more.

AI Gateway

Cloudflare AI Gateway acts as a proxy between your AI applications and any popular AI provider, and gives our customers centralized visibility and control over the requests going through.

Recently we announced support for Bring Your Own Key (BYOK), where instead of passing your provider secrets in plain text with every request, we centrally manage the secrets for you and can use them with your gateway configuration.

An even better option where you don’t have to manage AI providers' secrets at all end-to-end is to use Unified Billing. In this case you top up your account with credits and use AI Gateway with any of the supported providers directly, Cloudflare gets charged, and we will deduct credits from your account.

To make Moltbot use AI Gateway, first we create a new gateway instance, then we enable the Anthropic provider for it, then we either add our Claude key or purchase credits to use Unified Billing, and then all we need to do is set the ANTHROPIC_BASE_URL environment variable so Moltbot uses the AI Gateway endpoint. That’s it, no code changes necessary.

Once Moltbot starts using AI Gateway, you’ll have full visibility on costs and have access to logs and analytics that will help you understand how your AI agent is using the AI providers.

Note that Anthropic is one option; Moltbot supports other AI providers and so does AI Gateway. The advantage of using AI Gateway is that if a better model comes along from any provider, you don’t have to swap keys in your AI Agent configuration and redeploy — you can simply switch the model in your gateway configuration. And more, you specify model or provider fallbacks to handle request failures and ensure reliability.

Sandboxes

Last year we anticipated the growing need for AI agents to run untrusted code securely in isolated environments, and we announced the Sandbox SDK. This SDK is built on top of Cloudflare Containers, but it provides a simple API for executing commands, managing files, running background processes, and exposing services — all from your Workers applications.

In short, instead of having to deal with the lower-level Container APIs, the Sandbox SDK gives you developer-friendly APIs for secure code execution and handles the complexity of container lifecycle, networking, file systems, and process management — letting you focus on building your application logic with just a few lines of TypeScript. Here’s an example:

import { getSandbox } from '@cloudflare/sandbox';
export { Sandbox } from '@cloudflare/sandbox';

export default {
  async fetch(request: Request, env: Env): Promise {
    const sandbox = getSandbox(env.Sandbox, 'user-123');

    // Create a project structure
    await sandbox.mkdir('/workspace/project/src', { recursive: true });

    // Check node version
    const version = await sandbox.exec('node -v');

    // Run some python code
    const ctx = await sandbox.createCodeContext({ language: 'python' });
    await sandbox.runCode('import math; radius = 5', { context: ctx });
    const result = await sandbox.runCode('math.pi * radius ** 2', { context: ctx });

    return Response.json({ version, result });
  }
};

This fits like a glove for Moltbot. Instead of running Docker in your local Mac mini, we run Docker on Containers, use the Sandbox SDK to issue commands into the isolated environment and use callbacks to our entrypoint Worker, effectively establishing a two-way communication channel between the two systems.

R2 for persistent storage

The good thing about running things in your local computer or VPS is you get persistent storage for free. Containers, however, are inherently ephemeral, meaning data generated within them is lost upon deletion. Fear not, though — the Sandbox SDK provides the sandbox.mountBucket() that you can use to automatically, well, mount your R2 bucket as a filesystem partition when the container starts.

Once we have a local directory that is guaranteed to survive the container lifecycle, we can use that for Moltbot to store session memory files, conversations and other assets that are required to persist.

Browser Rendering for browser automation

AI agents rely heavily on browsing the sometimes not-so-structured web. Moltbot utilizes dedicated Chromium instances to perform actions, navigate the web, fill out forms, take snapshots, and handle tasks that require a web browser. Sure, we can run Chromium on Sandboxes too, but what if we could simplify and use an API instead?

With Cloudflare’s Browser Rendering, you can programmatically control and interact with headless browser instances running at scale in our edge network. We support Puppeteer, Stagehand, Playwright and other popular packages so that developers can onboard with minimal code changes. We even support MCP for AI.

In order to get Browser Rendering to work with Moltbot we do two things:

First we create a thin CDP proxy (CDP is the protocol that allows instrumenting Chromium-based browsers) from the Sandbox container to the Moltbot Worker, back to Browser Rendering using the Puppeteer APIs.
Then we inject a Browser Rendering skill into the runtime when the Sandbox starts.

From the Moltbot runtime perspective, it has a local CDP port it can connect to and perform browser tasks.

Zero Trust Access for authentication policies

Next up we want to protect our APIs and Admin UI from unauthorized access. Doing authentication from scratch is hard, and is typically the kind of wheel you don’t want to reinvent or have to deal with. Zero Trust Access makes it incredibly easy to protect your application by defining specific policies and login methods for the endpoints.

^{Zero Trust Access Login methods configuration for the Moltworker application.}

Once the endpoints are protected, Cloudflare will handle authentication for you and automatically include a JWT token with every request to your origin endpoints. You can then validate that JWT for extra protection, to ensure that the request came from Access and not a malicious third party.

Like with AI Gateway, once all your APIs are behind Access you get great observability on who the users are and what they are doing with your Moltbot instance.

Moltworker in action

Demo time. We’ve put up a Slack instance where we could play with our own instance of Moltbot on Workers. Here are some of the fun things we’ve done with it.

We hate bad news.

Here’s a chat session where we ask Moltbot to find the shortest route between Cloudflare in London and Cloudflare in Lisbon using Google Maps and take a screenshot in a Slack channel. It goes through a sequence of steps using Browser Rendering to navigate Google Maps and does a pretty good job at it. Also look at Moltbot’s memory in action when we ask him the second time.

We’re in the mood for some Asian food today, let’s get Moltbot to work for help.

We eat with our eyes too.

Let’s get more creative and ask Moltbot to create a video where it browses our developer documentation. As you can see, it downloads and runs ffmpeg to generate the video out of the frames it captured in the browser.

Run your own Moltworker

We open-sourced our implementation and made it available at https://github.com/cloudflare/moltworker, so you can deploy and run your own Moltbot on top of Workers today.

The README guides you through the necessary setup steps. You will need a Cloudflare account and a Workers Paid plan to access Sandbox Containers; however, all other products are either entirely free (like AI Gateway) or include generous free tiers that allow you to get started and scale under reasonable limits.

Note that Moltworker is a proof of concept, not a Cloudflare product. Our goal is to showcase some of the most exciting features of our Developer Platform that can be used to run AI agents and unsupervised code efficiently and securely, and get great observability while taking advantage of our global network.

Feel free to contribute to or fork our GitHub repository; we will keep an eye on it for a while for support. We are also considering contributing upstream to the official project with Cloudflare skills in parallel.

Conclusion

We hope you enjoyed this experiment, and we were able to convince you that Cloudflare is the perfect place to run your AI applications and agents. We’ve been working relentlessly trying to anticipate the future and release features like the Agents SDK that you can use to build your first agent in minutes, Sandboxes where you can run arbitrary code in an isolated environment without the complications of the lifecycle of a container, and AI Search, Cloudflare’s managed vector-based search service, to name a few.

Cloudflare now offers a complete toolkit for AI development: inference, storage APIs, databases, durable execution for stateful workflows, and built-in AI capabilities. Together, these building blocks make it possible to build and run even the most demanding AI applications on our global edge network.

If you're excited about AI and want to help us build the next generation of products and APIs, we're hiring.

Building a serverless, post-quantum Matrix homeserver

Nick Kuntz — Tue, 27 Jan 2026 14:00:00 GMT

^{* This post was updated at 11:45 a.m. Pacific time to clarify that the use case described here is a proof of concept and a personal project. Some sections have been updated for clarity.}

Matrix is the gold standard for decentralized, end-to-end encrypted communication. It powers government messaging systems, open-source communities, and privacy-focused organizations worldwide.

For the individual developer, however, the appeal is often closer to home: bridging fragmented chat networks (like Discord and Slack) into a single inbox, or simply ensuring your conversation history lives on infrastructure you control. Functionally, Matrix operates as a decentralized, eventually consistent state machine. Instead of a central server pushing updates, homeservers exchange signed JSON events over HTTP, using a conflict resolution algorithm to merge these streams into a unified view of the room's history.

But there is a "tax" to running it. Traditionally, operating a Matrix homeserver has meant accepting a heavy operational burden. You have to provision virtual private servers (VPS), tune PostgreSQL for heavy write loads, manage Redis for caching, configure reverse proxies, and handle rotation for TLS certificates. It’s a stateful, heavy beast that demands to be fed time and money, whether you’re using it a lot or a little.

We wanted to see if we could eliminate that tax entirely.

Spoiler: We could. In this post, we’ll explain how we ported a Matrix homeserver to Cloudflare Workers. The resulting proof of concept is a serverless architecture where operations disappear, costs scale to zero when idle, and every connection is protected by post-quantum cryptography by default. You can view the source code and deploy your own instance directly from Github.

From Synapse to Workers

Our starting point was Synapse, the Python-based reference Matrix homeserver designed for traditional deployments. PostgreSQL for persistence, Redis for caching, filesystem for media.

Porting it to Workers meant questioning every storage assumption we’d taken for granted.

The challenge was storage. Traditional homeservers assume strong consistency via a central SQL database. Cloudflare Durable Objects offers a powerful alternative. This primitive gives us the strong consistency and atomicity required for Matrix state resolution, while still allowing the application to run at the edge.

We ported the core Matrix protocol logic — event authorization, room state resolution, cryptographic verification — in TypeScript using the Hono framework. D1 replaces PostgreSQL, KV replaces Redis, R2 replaces the filesystem, and Durable Objects handle real-time coordination.

Here’s how the mapping worked out:

From monolith to serverless

Moving to Cloudflare Workers brings several advantages for a developer: simple deployment, lower costs, low latency, and built-in security.

Easy deployment: A traditional Matrix deployment requires server provisioning, PostgreSQL administration, Redis cluster management, TLS certificate renewal, load balancer configuration, monitoring infrastructure, and on-call rotations.

With Workers, deployment is simply: wrangler deploy. Workers handles TLS, load balancing, DDoS protection, and global distribution.

Usage-based costs: Traditional homeservers cost money whether anyone is using them or not. Workers pricing is request-based, so you pay when you’re using it, but costs drop to near zero when everyone’s asleep.

Lower latency globally: A traditional Matrix homeserver in us-east-1 adds 200ms+ latency for users in Asia or Europe. Workers, meanwhile, run in 300+ locations worldwide. When a user in Tokyo sends a message, the Worker executes in Tokyo.

Built-in security: Matrix homeservers can be high-value targets: They handle encrypted communications, store message history, and authenticate users. Traditional deployments require careful hardening: firewall configuration, rate limiting, DDoS mitigation, WAF rules, IP reputation filtering.

Workers provide all of this by default.

Post-quantum protection

Cloudflare deployed post-quantum hybrid key agreement across all TLS 1.3 connections in October 2022. Every connection to our Worker automatically negotiates X25519MLKEM768 — a hybrid combining classical X25519 with ML-KEM, the post-quantum algorithm standardized by NIST.

Classical cryptography relies on mathematical problems that are hard for traditional computers but trivial for quantum computers running Shor’s algorithm. ML-KEM is based on lattice problems that remain hard even for quantum computers. The hybrid approach means both algorithms must fail for the connection to be compromised.

Following a message through the system

Understanding where encryption happens matters for security architecture. When someone sends a message through our homeserver, here’s the actual path:

The sender’s client takes the plaintext message and encrypts it with Megolm — Matrix’s end-to-end encryption. This encrypted payload then gets wrapped in TLS for transport. On Cloudflare, that TLS connection uses X25519MLKEM768, making it quantum-resistant.

The Worker terminates TLS, but what it receives is still encrypted — the Megolm ciphertext. We store that ciphertext in D1, index it by room and timestamp, and deliver it to recipients. But we never see the plaintext. The message “Hello, world” exists only on the sender’s device and the recipient’s device.

When the recipient syncs, the process reverses. They receive the encrypted payload over another quantum-resistant TLS connection, then decrypt locally with their Megolm session keys.

Two layers, independent protection

This protects via two encryption layers that operate independently:

The transport layer (TLS) protects data in transit. It’s encrypted at the client and decrypted at the Cloudflare edge. With X25519MLKEM768, this layer is now post-quantum.

The application layer (Megolm E2EE) protects message content. It’s encrypted on the sender’s device and decrypted only on recipient devices. This uses classical Curve25519 cryptography.

Who sees what

Any Matrix homeserver operator — whether running Synapse on a VPS or this implementation on Workers — can see metadata: which rooms exist, who’s in them, when messages were sent. But no one in the infrastructure chain can see the message content, because the E2EE payload is encrypted on sender devices before it ever hits the network. Cloudflare terminates TLS and passes requests to your Worker, but both see only Megolm ciphertext. Media in encrypted rooms is encrypted client-side before upload, and private keys never leave user devices.

What traditional deployments would need

Achieving post-quantum TLS on a traditional Matrix deployment would require upgrading OpenSSL or BoringSSL to a version supporting ML-KEM, configuring cipher suite preferences correctly, testing client compatibility across all Matrix apps, monitoring for TLS negotiation failures, staying current as PQC standards evolve, and handling clients that don’t support PQC gracefully.

With Workers, it’s automatic. Chrome, Firefox, and Edge all support X25519MLKEM768. Mobile apps using platform TLS stacks inherit this support. The security posture improves as Cloudflare’s PQC deployment expands — no action required on our part.

The storage architecture that made it work

The key insight from porting Tuwunel was that different data needs different consistency guarantees. We use each Cloudflare primitive for what it does best.

D1 for the data model

D1 stores everything that needs to survive restarts and support queries: users, rooms, events, device keys. Over 25 tables covering the full Matrix data model.

CREATE TABLE events (
	event_id TEXT PRIMARY KEY,
	room_id TEXT NOT NULL,
	sender TEXT NOT NULL,
	event_type TEXT NOT NULL,
	state_key TEXT,
	content TEXT NOT NULL,
	origin_server_ts INTEGER NOT NULL,
	depth INTEGER NOT NULL
);

D1’s SQLite foundation meant we could port Tuwunel’s queries with minimal changes. Joins, indexes, and aggregations work as expected.

We learned one hard lesson: D1’s eventual consistency breaks foreign key constraints. A write to rooms might not be visible when a subsequent write to events checks the foreign key. We removed all foreign keys and enforce referential integrity in application code.

KV for ephemeral state

OAuth authorization codes live for 10 minutes, while refresh tokens last for a session.

// Store OAuth code with 10-minute TTL
kv.put(&format!("oauth_code:{}", code), &token_data)?
	.expiration_ttl(600)
	.execute()
	.await?;

KV’s global distribution means OAuth flows work fast regardless of where users are located.

R2 for media

Matrix media maps directly to R2, so you can upload an image, get back a content-addressed URL – and egress is free.

Durable Objects for atomicity

Some operations can’t tolerate eventual consistency. When a client claims a one-time encryption key, that key must be atomically removed. If two clients claim the same key, encrypted session establishment fails.

Durable Objects provide single-threaded, strongly consistent storage:

#[durable_object]
pub struct UserKeysObject {
	state: State,
	env: Env,
}

impl UserKeysObject {
	async fn claim_otk(&self, algorithm: &str) -> Result> {
    	// Atomic within single DO - no race conditions possible
    	let mut keys: Vec = self.state.storage()
        	.get("one_time_keys")
        	.await
        	.ok()
        	.flatten()
        	.unwrap_or_default();

    	if let Some(idx) = keys.iter().position(|k| k.algorithm == algorithm) {
        	let key = keys.remove(idx);
        	self.state.storage().put("one_time_keys", &keys).await?;
        	return Ok(Some(key));
    	}
    	Ok(None)
	}
}

We use UserKeysObject for E2EE key management, RoomObject for real-time room events like typing indicators and read receipts, and UserSyncObject for to-device message queues. The rest flows through D1.

Complete end-to-end encryption, complete OAuth

Our implementation supports the full Matrix E2EE stack: device keys, cross-signing keys, one-time keys, fallback keys, key backup, and dehydrated devices.

Modern Matrix clients use OAuth 2.0/OIDC instead of legacy password flows. We implemented a complete OAuth provider, with dynamic client registration, PKCE authorization, RS256-signed JWT tokens, token refresh with rotation, and standard OIDC discovery endpoints.

curl https://matrix.example.com/.well-known/openid-configuration
{
  "issuer": "https://matrix.example.com",
  "authorization_endpoint": "https://matrix.example.com/oauth/authorize",
  "token_endpoint": "https://matrix.example.com/oauth/token",
  "jwks_uri": "https://matrix.example.com/.well-known/jwks.json"
}

Point Element or any Matrix client at the domain, and it discovers everything automatically.

Sliding Sync for mobile

Traditional Matrix sync transfers megabytes of data on initial connection, draining mobile battery and data plans.

Sliding Sync lets clients request exactly what they need. Instead of downloading everything, clients get the 20 most recent rooms with minimal state. As users scroll, they request more ranges. The server tracks position and sends only deltas.

Combined with edge execution, mobile clients can connect and render their room list in under 500ms, even on slow networks.

The comparison

For a homeserver serving a small team:

	Traditional (VPS)	Workers
Monthly cost (idle)	$20-50	<$1
Monthly cost (active)	$20-50	$3-10
Global latency	100-300ms	20-50ms
Time to deploy	Hours	Seconds
Maintenance	Weekly	None
DDoS protection	Additional cost	Included
Post-quantum TLS	Complex setup	Automatic

^*^{Based on public rates and metrics published by DigitalOcean, AWS Lightsail, and Linode as of January 15, 2026.}

The economics improve further at scale. Traditional deployments require capacity planning and over-provisioning. Workers scale automatically.

The future of decentralized protocols

We started this as an experiment: could Matrix run on Workers? It can—and the approach can work for other stateful protocols, too.

By mapping traditional stateful components to Cloudflare’s primitives — Postgres to D1, Redis to KV, mutexes to Durable Objects — we can see that complex applications don't need complex infrastructure. We stripped away the operating system, the database management, and the network configuration, leaving only the application logic and the data itself.

Workers offers the sovereignty of owning your data, without the burden of owning the infrastructure.

I have been experimenting with the implementation and am excited for any contributions from others interested in this kind of service.

Ready to build powerful, real-time applications on Workers? Get started with Cloudflare Workers and explore Durable Objects for your own stateful edge applications. Join our Discord community to connect with other developers building at the edge.

Astro is joining Cloudflare

Fred Schott — Fri, 16 Jan 2026 14:00:00 GMT

The Astro Technology Company, creators of the Astro web framework, is joining Cloudflare.

Astro is the web framework for building fast, content-driven websites. Over the past few years, we’ve seen an incredibly diverse range of developers and companies use Astro to build for the web. This ranges from established brands like Porsche and IKEA, to fast-growing AI companies like Opencode and OpenAI. Platforms that are built on Cloudflare, like Webflow Cloud and Wix Vibe, have chosen Astro to power the websites their customers build and deploy to their own platforms. At Cloudflare, we use Astro, too — for our developer docs, website, landing pages, blog, and more. Astro is used almost everywhere there is content on the Internet.

By joining forces with the Astro team, we are doubling down on making Astro the best framework for content-driven websites for many years to come. The best version of Astro — Astro 6 — is just around the corner, bringing a redesigned development server powered by Vite. The first public beta release of Astro 6 is now available, with GA coming in the weeks ahead.

We are excited to share this news and even more thrilled for what it means for developers building with Astro. If you haven’t yet tried Astro — give it a spin and run npm create astro@latest.

What this means for Astro

Astro will remain open source, MIT-licensed, and open to contributions, with a public roadmap and open governance. All full-time employees of The Astro Technology Company are now employees of Cloudflare, and will continue to work on Astro. We’re committed to Astro’s long-term success and eager to keep building.

Astro wouldn’t be what it is today without an incredibly strong community of open-source contributors. Cloudflare is also committed to continuing to support open-source contributions, via the Astro Ecosystem Fund, alongside industry partners including Webflow, Netlify, Wix, Sentry, Stainless and many more.

From day one, Astro has been a bet on the web and portability: Astro is built to run anywhere, across clouds and platforms. Nothing changes about that. You can deploy Astro to any platform or cloud, and we’re committed to supporting Astro developers everywhere.

There are many web frameworks out there — so why are developers choosing Astro?

Astro has been growing rapidly:

Why? Many web frameworks have come and gone trying to be everything to everyone, aiming to serve the needs of both content-driven websites and web applications.

The key to Astro’s success: Instead of trying to serve every use case, Astro has stayed focused on five design principles. Astro is…

Content-driven: Astro was designed to showcase your content.
Server-first: Websites run faster when they render HTML on the server.
Fast by default: It should be impossible to build a slow website in Astro.
Easy to use: You don’t need to be an expert to build something with Astro.
Developer-focused: You should have the resources you need to be successful.

Astro’s Islands Architecture is a core part of what makes all of this possible. The majority of each page can be fast, static HTML — fast and simple to build by default, oriented around rendering content. And when you need it, you can render a specific part of a page as a client island, using any client UI framework. You can even mix and match multiple frameworks on the same page, whether that’s React.js, Vue, Svelte, Solid, or anything else:

Bringing back the joy in building websites

The more Astro and Cloudflare started talking, the clearer it became how much we have in common. Cloudflare’s mission is to help build a better Internet — and part of that is to help build a faster Internet. Almost all of us grew up building websites, and we want a world where people have fun building things on the Internet, where anyone can publish to a site that is truly their own.

When Astro first launched in 2021, it had become painful to build great websites — it felt like a fight with build tools and frameworks. It sounds strange to say it, with the coding agents and powerful LLMs of 2026, but in 2021 it was very hard to build an excellent and fast website without being a domain expert in JavaScript build tooling. So much has gotten better, both because of Astro and in the broader frontend ecosystem, that we take this almost for granted today.

The Astro project has spent the past five years working to simplify web development. So as LLMs, then vibe coding, and now true coding agents have come along and made it possible for truly anyone to build — Astro provided a foundation that was simple and fast by default. We’ve all seen how much better and faster agents get when building off the right foundation, in a well-structured codebase. More and more, we’ve seen both builders and platforms choose Astro as that foundation.

We’ve seen this most clearly through the platforms that both Cloudflare and Astro serve, that extend Cloudflare to their own customers in creative ways using Cloudflare for Platforms, and have chosen Astro as the framework that their customers build on.

When you deploy to Webflow Cloud, your Astro site just works and is deployed across Cloudflare’s network. When you start a new project with Wix Vibe, behind the scenes you’re creating an Astro site, running on Cloudflare. And when you generate a developer docs site using Stainless, that generates an Astro project, running on Cloudflare, powered by Starlight — a framework built on Astro.

Each of these platforms is built for a different audience. But what they have in common — beyond their use of Cloudflare and Astro — is they make it fun to create and publish content to the Internet. In a world where everyone can be both a builder and content creator, we think there are still so many more platforms to build and people to reach.

Astro 6 — new local dev server, powered by Vite

Astro 6 is coming, and the first open beta release is now available. To be one of the first to try it out, run:

npm create astro@latest -- --ref next

Or to upgrade your existing Astro app, run:

npx @astrojs/upgrade beta

Astro 6 brings a brand new development server, built on the Vite Environments API, that runs your code locally using the same runtime that you deploy to. This means that when you run astro dev with the Cloudflare Vite plugin, your code runs in workerd, the open-source Cloudflare Workers runtime, and can use Durable Objects, D1, KV, Agents and more. This isn’t just a Cloudflare feature: Any JavaScript runtime with a plugin that uses the Vite Environments API can benefit from this new support, and ensure local dev runs in the same environment, with the same runtime APIs as production.

Live Content Collections in Astro are also stable in Astro 6 and out of beta. These content collections let you update data in real time, without requiring a rebuild of your site. This makes it easy to bring in content that changes often, such as the current inventory in a storefront, while still benefitting from the built-in validation and caching that come with Astro’s existing support for content collections.

There’s more to Astro 6, including Astro’s most upvoted feature request — first-class support for Content Security Policy (CSP) — as well as simpler APIs, an upgrade to Zod 4, and more.

Doubling down on Astro

We're thrilled to welcome the Astro team to Cloudflare. We’re excited to keep building, keep shipping, and keep making Astro the best way to build content-driven sites. We’re already thinking about what comes next beyond V6, and we’d love to hear from you.

To keep up with the latest, follow the Astro blog and join the Astro Discord. Tell us what you’re building!

How Workers powers our internal maintenance scheduling pipeline

Kevin Deems — Mon, 22 Dec 2025 14:00:00 GMT

Cloudflare has data centers in over 330 cities globally, so you might think we could easily disrupt a few at any time without users noticing when we plan data center operations. However, the reality is that disruptive maintenance requires careful planning, and as Cloudflare grew, managing these complexities through manual coordination between our infrastructure and network operations specialists became nearly impossible.

It is no longer feasible for a human to track every overlapping maintenance request or account for every customer-specific routing rule in real time. We reached a point where manual oversight alone couldn't guarantee that a routine hardware update in one part of the world wouldn't inadvertently conflict with a critical path in another.

We realized we needed a centralized, automated "brain" to act as a safeguard — a system that could see the entire state of our network at once. By building this scheduler on Cloudflare Workers, we created a way to programmatically enforce safety constraints, ensuring that no matter how fast we move, we never sacrifice the reliability of the services on which our customers depend.

In this blog post, we’ll explain how we built it, and share the results we’re seeing now.

Building a system to de-risk critical maintenance operations

Picture an edge router that acts as one of a small, redundant group of gateways that collectively connect the public Internet to the many Cloudflare data centers operating in a metro area. In a populated city, we need to ensure that the multiple data centers sitting behind this small cluster of routers do not get cut off because the routers were all taken offline simultaneously.

Another maintenance challenge comes from our Zero Trust product, Dedicated CDN Egress IPs, which allows customers to choose specific data centers from which their user traffic will exit Cloudflare and be sent to their geographically close origin servers for low latency. (For the purpose of brevity in this post, we'll refer to the Dedicated CDN Egress IPs product as "Aegis," which was its former name.) If all the data centers a customer chose are offline at once, they would see higher latency and possibly 5xx errors, which we must avoid.

Our maintenance scheduler solves problems like these. We can make sure that we always have at least one edge router active in a certain area. And when scheduling maintenance, we can see if the combination of multiple scheduled events would cause all the data centers for a customer’s Aegis pools to be offline at the same time.

Before we created the scheduler, these simultaneous disruptive events could cause downtime for customers. Now, our scheduler notifies internal operators of potential conflicts, allowing us to propose a new time to avoid overlapping with other related data center maintenance events.

We define these operational scenarios, such as edge router availability and customer rules, as maintenance constraints which allow us to plan more predictable and safe maintenance.

Maintenance constraints

Every constraint starts with a set of proposed maintenance items, such as a network router or list of servers. We then find all the maintenance events in the calendar that overlap with the proposed maintenance time window.

Next, we aggregate product APIs, such as a list of Aegis customer IP pools. Aegis returns a set of IP ranges where a customer requested egress out of specific data center IDs, shown below.

[
    {
      "cidr": "104.28.0.32/32",
      "pool_name": "customer-9876",
      "port_slots": [
        {
          "dc_id": 21,
          "other_colos_enabled": true,
        },
        {
          "dc_id": 45,
          "other_colos_enabled": true,
        }
      ],
      "modified_at": "2023-10-22T13:32:47.213767Z"
    },
]

In this scenario, data center 21 and data center 45 relate to each other because we need at least one data center online for the Aegis customer 9876 to receive egress traffic from Cloudflare. If we tried to take data centers 21 and 45 down simultaneously, our coordinator would alert us that there would be unintended consequences for that customer workload.

We initially had a naive solution to load all data into a single Worker. This included all server relationships, product configurations, and metrics for product and infrastructure health to compute constraints. Even in our proof of concept phase, we ran into problems with “out of memory” errors.

We needed to be more cognizant of Workers’ platform limits. This required loading only as much data as was absolutely necessary to process the constraint’s business logic. If a maintenance request for a router in Frankfurt, Germany, comes in, we almost certainly do not care what is happening in Australia since there is no overlap across regions. Thus, we should only load data for neighboring data centers in Germany. We needed a more efficient way to process relationships in our dataset.

Graph processing on Workers

As we looked at our constraints, a pattern emerged where each constraint boiled down to two concepts: objects and associations. In graph theory, these components are known as vertices and edges, respectively. An object could be a network router and an association could be the list of Aegis pools in the data center that requires the router to be online. We took inspiration from Facebook’s TAO research paper to establish a graph interface on top of our product and infrastructure data. The API looks like the following:

type ObjectID = string

interface MainTAOInterface {
  object_get(id: ObjectID): Promise

  assoc_get(id1: ObjectID, atype: TAssocType): AsyncIterable
}

The core insight is that associations are typed. For example, a constraint would call the graph interface to retrieve Aegis product data.

async function constraint(c: AppContext, aegis: TAOAegisClient, datacenters: string[]): Promise> {
  const datacenterEntries = await Promise.all(
    datacenters.map(async (dcID) => {
      const iter = aegis.assoc_get(c, dcID, AegisAssocType.DATACENTER_INSIDE_AEGIS_POOL)
      const pools: string[] = []
      for await (const assoc of iter) {
        pools.push(assoc.id2)
      }
      return [dcID, pools] as const
    }),
  )

  const datacenterToPools = new Map(datacenterEntries)
  const uniquePools = new Set()
  for (const pools of datacenterToPools.values()) {
    for (const pool of pools) uniquePools.add(pool)
  }

  const poolTotalsEntries = await Promise.all(
    [...uniquePools].map(async (pool) => {
      const total = aegis.assoc_count(c, pool, AegisAssocType.AEGIS_POOL_CONTAINS_DATACENTER)
      return [pool, total] as const
    }),
  )

  const poolTotals = new Map(poolTotalsEntries)
  const poolAnalysis: Record = {}
  for (const [dcID, pools] of datacenterToPools.entries()) {
    for (const pool of pools) {
      poolAnalysis[pool] = {
        affectedDatacenters: new Set([dcID]),
        totalDatacenters: poolTotals.get(pool),
      }
    }
  }

  return poolAnalysis
}

We use two association types in the code above:

DATACENTER_INSIDE_AEGIS_POOL, which retrieves the Aegis customer pools that a data center resides in.
AEGIS_POOL_CONTAINS_DATACENTER, which retrieves the data centers an Aegis pool needs to serve traffic.

The associations are inverted indices of one another. The access pattern is exactly the same as before, but now the graph implementation has much more control of how much data it queries. Before, we needed to load all Aegis pools into memory and filter inside constraint business logic. Now, we can directly fetch only the data that matters to the application.

The interface is powerful because our graph implementation can improve performance behind the scenes without complicating the business logic. This lets us use the scalability of Workers and Cloudflare’s CDN to fetch data from our internal systems very quickly.

Fetch pipeline

We switched to using the new graph implementation, sending more targeted API requests. Response sizes dropped by 100x overnight, switching from loading a few massive requests to many tiny requests.

While this solves the issue of loading too much into memory, we now have a subrequest problem because instead of a few large HTTP requests, we make an order of magnitude more of small requests. Overnight, we started consistently breaching subrequest limits.

In order to solve this problem, we built a smart middleware layer between our graph implementation and the fetch API.

export const fetchPipeline = new FetchPipeline()
  .use(requestDeduplicator())
  .use(lruCacher({
    maxItems: 100,
  }))
  .use(cdnCacher())
  .use(backoffRetryer({
    retries: 3,
    baseMs: 100,
    jitter: true,
  }))
  .handler(terminalFetch);

If you’re familiar with Go, you may have seen the singleflight package before. We took inspiration from this idea and the first middleware component in the fetch pipeline deduplicates inflight HTTP requests, so they all wait on the same Promise for data instead of producing duplicate requests in the same Worker. Next, we use a lightweight Least Recently Used (LRU) cache to internally cache requests that we have already seen before.

Once both of those are complete, we use Cloudflare’s caches.default.match function to cache all GET requests in the region that the Worker is running. Since we have multiple data sources with different performance characteristics, we choose time to live (TTL) values carefully. For example, real-time data is only cached for 1 minute. Relatively static infrastructure data could be cached for 1–24 hours depending on the type of data. Power management data might be changed manually and infrequently, so we can cache it for longer at the edge.

In addition to those layers, we have the standard exponential backoff, retries and jitter. This helps reduce wasted fetch calls where a downstream resource might be unavailable temporarily. By backing off slightly, we increase the chance that we fetch the next request successfully. Conversely, if the Worker sends requests constantly without backoff, it will easily breach the subrequest limit when the origin starts returning 5xx errors.

Putting it all together, we saw ~99% cache hit rate. Cache hit rate is the percentage of HTTP requests served from Cloudflare’s fast cache memory (a "hit") versus slower requests to data sources running in our control plane (a "miss"), calculated as (hits / (hits + misses)). A high rate means better HTTP request performance and lower costs because querying data from cache in our Worker is an order of magnitude faster than fetching from an origin server in a different region. After tuning settings, for our in memory and CDN caches, hit rates have increased dramatically. Since much of our workload is real-time, we will never have a 100% hit rate as we must request fresh data at least once per minute.

We have talked about improving the fetching layer, but not about how we made origin HTTP requests faster. Our maintenance coordinator needs to react in real-time to network degradation and failure of machines in data centers. We use our distributed Prometheus query engine, Thanos, to deliver performant metrics from the edge into the coordinator.

Thanos in real-time

To explain how our choice in using the graph processing interface affected our real-time queries, let’s walk through an example. In order to analyze the health of edge routers, we could send the following query:

sum by (instance) (network_snmp_interface_admin_status{instance=~"edge.*"})

Originally, we asked our Thanos service, which stores Prometheus metrics, for a list of each edge router’s current health status and would manually filter for routers relevant to the maintenance inside the Worker. This is suboptimal for many reasons. For example, Thanos returned multi-MB responses which it needed to decode and encode. The Worker also needed to cache and decode these large HTTP responses only to filter out the majority of the data while processing a specific maintenance request. Since TypeScript is single-threaded and parsing JSON data is CPU-bound, sending two large HTTP requests means that one is blocked waiting for the other to finish parsing.

Instead, we simply use the graph to find targeted relationships such as the interface links between edge and spine routers, denoted as EDGE_ROUTER_NETWORK_CONNECTS_TO_SPINE.

sum by (lldp_name) (network_snmp_interface_admin_status{instance=~"edge01.fra03", lldp_name=~"spine.*"})

The result is 1 Kb on average instead of multiple MBs, or approximately 1000x smaller. This also massively reduces the amount of CPU required inside the Worker because we offload most of the deserialization to Thanos. As we explained before, this means we need to make a higher number of these smaller fetch requests, but load balancers in front of Thanos can spread the requests evenly to increase throughput for this use case.

Our graph implementation and fetch pipeline successfully tamed the 'thundering herd' of thousands of tiny real-time requests. However, historical analysis presents a different I/O challenge. Instead of fetching small, specific relationships, we need to scan months of data to find conflicting maintenance windows. In the past, Thanos would issue a massive amount of random reads to our object store, R2. To solve this massive bandwidth penalty without losing performance, we adopted a new approach the Observability team developed internally this year.

Historical data analysis

There are enough maintenance use cases that we must rely on historical data to tell us if our solution is accurate and will scale with the growth of Cloudflare’s network. We do not want to cause incidents, and we also want to avoid blocking proposed physical maintenance unnecessarily. In order to balance these two priorities, we can use time series data about maintenance events that happened two months or even a year ago to tell us how often a maintenance event is violating one of our constraints, e.g. edge router availability or Aegis. We blogged earlier this year about using Thanos to automatically release and revert software to the edge.

Thanos primarily fans out to Prometheus, but when Prometheus' retention is not enough to answer the query it has to download data from object storage — R2 in our case. Prometheus TSDB blocks were originally designed for local SSDs, relying on random access patterns that become a bottleneck when moved to object storage. When our scheduler needs to analyze months of historical maintenance data to identify conflicting constraints, random reads from object storage incur a massive I/O penalty. To solve this, we implemented a conversion layer that transforms these blocks into Apache Parquet files. Parquet is a columnar format native to big data analytics that organizes data by column rather than row, which — together with rich statistics — allows us to only fetch what we need.

Furthermore, since we are rewriting TSDB blocks into Parquet files, we can also store the data in a way that allows us to read the data in just a few big sequential chunks.

sum by (instance) (hmd:release_scopes:enabled{dc_id="45"})

In the example above we would choose the tuple “(__name__, dc_id)” as a primary sorting key so that metrics with the name “hmd:release_scopes:enabled” and the same value for “dc_id” get sorted close together.

Our Parquet gateway now issues precise R2 range requests to fetch only the specific columns relevant to the query. This reduces the payload from megabytes to kilobytes. Furthermore, because these file segments are immutable, we can aggressively cache them on the Cloudflare CDN.

This turns R2 into a low-latency query engine, allowing us to backtest complex maintenance scenarios against long-term trends instantly, avoiding the timeouts and high tail latency we saw with the original TSDB format. The graph below shows a recent load test, where Parquet reached up to 15x the P90 performance compared to the old system for the same query pattern.

To get a deeper understanding of how the Parquet implementation works, you can watch this talk at PromCon EU 2025, Beyond TSDB: Unlocking Prometheus with Parquet for Modern Scale.

Building for scale

By leveraging Cloudflare Workers, we moved from a system that ran out of memory to one that intelligently caches data and uses efficient observability tooling to analyze product and infrastructure data in real time. We built a maintenance scheduler that balances network growth with product performance.

But “balance” is a moving target.

Every day, we add more hardware around the world, and the logic required to maintain it without disrupting customer traffic gets exponentially harder with more products and types of maintenance operations. We’ve worked through the first set of challenges, but now we’re staring down more subtle, complex ones that only appear at this massive scale.

We need engineers who aren't afraid of hard problems. Join our Infrastructure team and come build with us.

Python Workers redux: fast cold starts, packages, and a uv-first workflow

Dominik Picheta — Mon, 08 Dec 2025 06:00:00 GMT

^{Note: This post was updated with additional details regarding AWS Lambda.}

Last year we announced basic support for Python Workers, allowing Python developers to ship Python to region: Earth in a single command and take advantage of the Workers platform.

Since then, we’ve been hard at work making the Python experience on Workers feel great. We’ve focused on bringing package support to the platform, a reality that’s now here — with exceptionally fast cold starts and a Python-native developer experience.

This means a change in how packages are incorporated into a Python Worker. Instead of offering a limited set of built-in packages, we now support any package supported by Pyodide, the WebAssembly runtime powering Python Workers. This includes all pure Python packages, as well as many packages that rely on dynamic libraries. We also built tooling around uv to make package installation easy.

We’ve also implemented dedicated memory snapshots to reduce cold start times. These snapshots result in serious speed improvements over other serverless Python vendors. In cold start tests using common packages, Cloudflare Workers start over 2.4x faster than AWS Lambda without SnapStart and 3x faster than Google Cloud Run.

In this blog post, we’ll explain what makes Python Workers unique and share some of the technical details of how we’ve achieved the wins described above. But first, for those who may not be familiar with Workers or serverless platforms – and especially those coming from a Python background — let us share why you might want to use Workers at all.

Deploying Python globally in 2 minutes

Part of the magic of Workers is simple code and easy global deployments. Let's start by showing how you can deploy a FastAPI app across the world with fast cold starts in less than two minutes.

A simple Worker using FastAPI can be implemented in a handful of lines:

from fastapi import FastAPI
from workers import WorkerEntrypoint
import asgi

app = FastAPI()

@app.get("/")
async def root():
   return {"message": "This is FastAPI on Workers"}

class Default(WorkerEntrypoint):
   async def fetch(self, request):
       return await asgi.fetch(app, request.js_object, self.env)

To deploy something similar, just make sure you have uv and npm installed, then run the following:

$ uv tool install workers-py
$ pywrangler init --template \
    https://github.com/cloudflare/python-workers-examples/03-fastapi
$ pywrangler deploy

With just a little code and a pywrangler deploy, you’ve now deployed your application across Cloudflare’s edge network that extends to 330 locations across 125 countries. No worrying about infrastructure or scaling.

And for many use cases, Python Workers are completely free. Our free tier offers 100,000 requests per day and 10ms CPU time per invocation. For more information, check out the pricing page in our documentation.

For more examples, check out the repo in GitHub. And read on to find out more about Python Workers.

So what can you do with Python Workers?

Now that you’ve got a Worker, just about anything is possible. You write the code, so you get to decide. Your Python Worker receives HTTP requests and can make requests to any server on the public Internet.

You can set up cron triggers, so your Worker runs on a regular schedule. Plus, if you have more complex requirements, you can make use of Workflows for Python Workers, or even long-running WebSocket servers and clients using Durable Objects.

Here are more examples of the sorts of things you can do using Python Workers:

Faster package cold starts

Serverless platforms like Workers save you money by only running your code when it’s necessary to do so. This means that if your Worker isn’t receiving requests, it may be shut down and will need to be restarted once a new request comes in. This typically incurs a resource overhead we refer to as the “cold start.” It’s important to keep these as short as possible to minimize latency for end users.

In standard Python, booting the runtime is expensive, and our initial implementation of Python Workers focused on making the runtime boot fast. However, we quickly realized that this wasn’t enough. Even if the Python runtime boots quickly, in real-world scenarios the initial startup usually includes loading modules from packages, and unfortunately, in Python many popular packages can take several seconds to load.

We set out to make cold starts fast, regardless of whether packages were loaded.

To measure realistic cold start performance, we set up a benchmark that imports common packages, as well as a benchmark running a “hello world” using a bare Python runtime. Standard Lambda is able to start just the runtime quickly, but once you need to import packages, the cold start times shoot up. In order to optimize for faster cold starts with packages, you can use SnapStart on Lambda (which we will be adding to the linked benchmarks shortly). This incurs a cost to store the snapshot and an additional cost on every restore. Python Workers will automatically apply memory snapshots for free for every Python Worker.

Here are the average cold start times when loading three common packages (httpx, fastapi and pydantic):

Platform	Mean Cold Start (secs)
Cloudflare Python Workers	1.027
AWS Lambda (without SnapStart)	2.502
Google Cloud Run	3.069

In this case, Cloudflare Python Workers have 2.4x faster cold starts than AWS Lambda without SnapStart and 3x faster cold starts than Google Cloud Run. We achieved these low cold start numbers by using memory snapshots, and in a later section we explain how we did so.

We are regularly running these benchmarks. Go here for up-to-date data and more info on our testing methodology.

We’re architecturally different from these other platforms — namely, Workers is isolate-based. Because of that, our aims are high, and we are planning for a zero cold start future.

Package tooling integrated with uv

The diverse package ecosystem is a large part of what makes Python so amazing. That’s why we’ve been hard at work ensuring that using packages in Workers is as easy as possible.

We realised that working with the existing Python tooling is the best path towards a great development experience. So we picked the uv package and project manager, as it’s fast, mature, and gaining momentum in the Python ecosystem.

We built our own tooling around uv called pywrangler. This tool essentially performs the following actions:

Reads your Worker’s pyproject.toml file to determine the dependencies specified in it
Includes your dependencies in a python_modules folder that lives in your Worker

Pywrangler calls out to uv to install the dependencies in a way that is compatible with Python Workers, and calls out to wrangler when developing locally or deploying Workers.

Effectively this means that you just need to run pywrangler dev and pywrangler deploy to test your Worker locally and deploy it.

Type hints

You can generate type hints for all of the bindings defined in your wrangler config using pywrangler types. These type hints will work with Pylance or with recent versions of mypy.

To generate the types, we use wrangler types to create typescript type hints, then we use the typescript compiler to generate an abstract syntax tree for the types. Finally, we use the TypeScript hints — such as whether a JS object has an iterator field — to generate mypy type hints that work with the Pyodide foreign function interface.

Decreasing cold start duration using snapshots

Python startup is generally quite slow and importing a Python module can trigger a large amount of work. We avoid running Python startup during a cold start using memory snapshots.

When a Worker is deployed, we execute the Worker’s top-level scope and then take a memory snapshot and store it alongside your Worker. Whenever we are starting a new isolate for the Worker, we restore the memory snapshot and the Worker is ready to handle requests, with no need to execute any Python code in preparation. This improves cold start times considerably. For instance, starting a Worker that imports fastapi, httpx and pydantic without snapshots takes around 10 seconds. With snapshots, it takes 1 second.

The fact that Pyodide is built on WebAssembly enables this. We can easily capture the full linear memory of the runtime and restore it.

Memory snapshots and Entropy

WebAssembly runtimes do not require features like address space layout randomization for security, so most of the difficulties with memory snapshots on a modern operating system do not arise. Just like with native memory snapshots, we still have to carefully handle entropy at startup to avoid using the XKCD random number generator (we’re very into actual randomness).

By snapshotting memory, we might inadvertently lock in a seed value for randomness. In this case, future calls for “random” numbers would consistently return the same sequence of values across many requests.

Avoiding this is particularly challenging because Python uses a lot of entropy at startup. These include the libc functions getentropy() and getrandom() and also reading from /dev/random and /dev/urandom. All of these functions share the same implementation in terms of the JavaScript crypto.getRandomValues() function.

In Cloudflare Workers, crypto.getRandomValues() has always been disabled at startup in order to allow us to switch to using memory snapshots in the future. Unfortunately, the Python interpreter cannot bootstrap without calling this function. And many packages also require entropy at startup time. There are essentially two purposes for this entropy:

Hash seeds for hash randomization
Seeds for pseudorandom number generators

Hash randomization we do at startup time and accept the cost that each specific Worker has a fixed hash seed. Python has no mechanism to allow replacing the hash seed after startup.

For pseudorandom number generators (PRNG), we take the following approach:

At deploy time:

Seed the PRNG with a fixed “poison seed”, then record the PRNG state.
Replace all APIs that call into the PRNG with an overlay that fails the deployment with a user error.
Execute the top level scope of user code.
Capture the snapshot.

At run time:

Assert that the PRNG state is unchanged. If it changed, we forgot the overlay for some method. Fail the deployment with an internal error.
After restoring the snapshot, reseed the random number generator before executing any handlers.

With this, we can ensure that PRNGs can be used while the Worker is running, but stop Workers from using them during initialization and pre-snapshot.

Memory snapshots and WebAssembly state

An additional difficulty arises when creating memory snapshots on WebAssembly: The memory snapshot we are saving consists only of the WebAssembly linear memory, but the full state of the Pyodide WebAssembly instance is not contained in the linear memory.

There are two tables outside of this memory.

One table holds the values of function pointers. Traditional computers use a “Von Neumann” architecture, which means that code exists in the same memory space as data, so that calling a function pointer is a jump to some memory address. WebAssembly has a “Harvard architecture” where code lives in a separate address space. This is key to most of the security guarantees of WebAssembly and in particular why WebAssembly does not need address space layout randomization. A function pointer in WebAssembly is an index into the function pointer table.

A second table holds all JavaScript objects referenced from Python. JavaScript objects cannot be directly stored into memory because the JavaScript virtual machine forbids directly obtaining a pointer to a JavaScript object. Instead, they are stored into a table and represented in WebAssembly as an index into the table.

We need to ensure that both of these tables are in exactly the same state after we restore a snapshot as they were when we captured the snapshot.

The function pointer table is always in the same state when the WebAssembly instance is initialized and is updated by the dynamic loader when we load dynamic libraries — native Python packages like numpy.

To handle dynamic loading:

When taking the snapshot, we patch the loader to record the load order of dynamic libraries, the address in memory where the metadata for each library is allocated, and the function pointer table base address for relocations.
When restoring the snapshot, we reload the dynamic libraries in the same order, and we use a patched memory allocator to place the metadata in the same locations. We assert that the current size of the function pointer table matches the function pointer table base we recorded for the dynamic library.

All of this ensures that each function pointer has the same meaning after we’ve restored the snapshot as it had when we took the snapshot.

To handle the JavaScript references, we implemented a fairly limited system. If a JavaScript object is accessible from globalThis by a series of property accesses, we record those property accesses and replay them when restoring the snapshot. If any reference exists to a JavaScript object that is not accessible in this way, we fail deployment of the Worker. This is good enough to deal with all the existing Python packages with Pyodide support, which do top level imports like:

from js import fetch

Reducing cold start frequency using sharding

Another important characteristic of our performance strategy for Python Workers is sharding. There is a very detailed description of what went into its implementation here. In short, we now route requests to existing Worker instances, whereas before we might have chosen to start a new instance.

Sharding was actually enabled for Python Workers first and proved to be a great test bed for it. A cold start is far more expensive in Python than in JavaScript, so ensuring requests are routed to an already-running isolate is especially important.

Where do we go from here?

This is just the start. We have many plans to make Python Workers better:

More developer-friendly tooling
Even faster cold starts by utilising our isolate architecture
Support for more packages
Support for native TCP sockets, native WebSockets, and more bindings

To learn more about Python Workers, check out the documentation available here. To get help, be sure to join our Discord.

Connecting to production: the architecture of remote bindings

Samuel Macleod — Wed, 12 Nov 2025 14:00:00 GMT

Remote bindings are bindings that connect to a deployed resource on your Cloudflare account instead of a locally simulated resource – and recently, we announced that remote bindings are now generally available.

With this launch, you can now connect to deployed resources like R2 buckets and D1 databases while running Worker code on your local machine. This means you can test your local code changes against real data and services, without the overhead of deploying for each iteration.

In this blog post, we’ll dig into the technical details of how we built it, creating a seamless local development experience.

Developing on the Workers platform

A key part of the Cloudflare Workers platform has been the ability to develop your code locally without having to deploy it every time you wanted to test something – though the way we’ve supported this has changed greatly over the years.

We started with wrangler dev running in remote mode. This works by deploying and connecting to a preview version of your Worker that runs on Cloudflare’s network every time you make a change to your code, allowing you to test things out as you develop. However, remote mode isn’t perfect — it’s complex and hard to maintain. And the developer experience leaves a lot to be desired: slow iteration speed, unstable debugging connections, and lack of support for multi-worker scenarios.

Those issues and others motivated a significant investment in a fully local development environment for Workers, which was released in mid-2023 and became the default experience for wrangler dev. Since then, we've put a huge amount of work into the local dev experience with Wrangler, the Cloudflare Vite plugin (alongside @cloudflare/vitest-pool-workers) & Miniflare.

Still, the original remote mode remained accessible via a flag: wrangler dev --remote. When using remote mode, all the DX benefits of a fully local experience and the improvements we’ve made over the last few years are bypassed. So why do people still use it? It enables one key unique feature: binding to remote resources while locally developing. When you use local mode to develop a Worker locally, all of your bindings are simulated locally using local (initially empty) data. This is fantastic for iterating on your app’s logic with test data – but sometimes that’s not enough, whether you want to share resources across your team, reproduce bugs tied to real data, or just be confident that your app will work in production with real resources.

Given this, we saw an opportunity: If we could bring the best parts of remote mode (i.e. access to remote resources) to wrangler dev, there’d be one single flow for developing Workers that would enable many use cases, while not locking people out of the advancements we’ve made to local development. And that’s what we did!

As of Wrangler v4.37.0 you can pick on a per-binding basis whether a binding should use remote or local resources, simply by specifying the remote option. It’s important to re-emphasise this—you only need to add remote: true! There’s no complex management of API keys and credentials involved, it all just works using Wrangler’s existing Oauth connection to the Cloudflare API.

{
  "name": "my-worker",
  "compatibility_date": "2025-01-01",
  "kv_namespaces": [{
    "binding": "KV",
    "id": "my-kv-id",
  },{
    "binding": "KV_2",
    "id": "other-kv-id",
    "remote": true
  }],
  "r2_buckets": [{
    "bucket_name": "my-r2-name",
    "binding": "R2"
  }]
}

The eagle-eyed among you might have realised that some bindings already worked like this, accessing remote resources from local dev. Most prominently, the AI binding was a trailblazer for what a general remote bindings solution could look like. From its introduction, the AI binding always connected to a remote resource, since a true local experience that supports all the different models you can use with Workers AI would be impractical and require a huge upfront download of AI models.

As we realised different products within Workers needed something similar to remote bindings (Images and Hyperdrive, for instance), we ended up with a bit of a patchwork of different solutions. We’ve now unified under a single remote bindings solution that works for all binding types.

How we built it

We wanted to make it really easy for developers to access remote resources without having to change their production Workers code, and so we landed on a solution that required us to fetch data from the remote resource at the point of use in your Worker.

const value = await env.KV.get("some-key")

^{The above code snippet shows accessing the “some-key” value in the env.KV}^{KV namespace}^{, which is not available locally and needs to be fetched over the network.}

So if that was our requirement, how would we get there? For instance, how would we get from a user calling env.KV.put(“key”, “value”) in their Worker to actually storing that in a remote KV store? The obvious solution was perhaps to use the Cloudflare API. We could have just replaced the entire env locally with stub objects that made API calls, transforming env.KV.put() into PUT http:///accounts/{account_id}/storage/kv/namespaces/{namespace_id}/values/{key_name}.

This would’ve worked great for KV, R2, D1, and other bindings with mature HTTP APIs, but it would have been a pretty complex solution to implement and maintain. We would have had to replicate the entire bindings API surface and transform every possible operation on a binding to an equivalent API call. Additionally, some binding operations don’t have an equivalent API call, and wouldn’t be supportable using this strategy.

Instead, we realised that we already had a ready-made API waiting for us — the one we use in production!

How bindings work under the hood in production

Most bindings on the Workers platform boil down to essentially a service binding. A service binding is a link between two Workers that allows them to communicate over HTTP or JSRPC (we’ll come back to JSRPC later).

For example, the KV binding is implemented as a service binding between your authored Worker and a platform Worker, speaking HTTP. The JS API for the KV binding is implemented in the Workers runtime, and translates calls like env.KV.get() to HTTP calls to the Worker that implements the KV service.

^{Diagram showing a simplified model of how a KV binding works in production}

You may notice that there’s a natural async network boundary here — between the runtime translating the env.KV.get() call and the Worker that implements the KV service. We realised that we could use that natural network boundary to implement remote bindings. Instead of the production runtime translating env.KV.get() to an HTTP call, we could have the local runtime (workerd) translate env.KV.get() to an HTTP call, and then send it directly to the KV service, bypassing the production runtime. And so that’s what we did!

^{Diagram showing a locally run worker with a single KV binding, with a single remote proxy client that communicates to the remote proxy server, which in turn communicates with the remote KV}

The above diagram shows a local Worker running with a remote KV binding. Instead of being handled by the local KV simulation, it’s now being handled by a remote proxy client. This Worker then communicates with a remote proxy server connected to the real remote KV resource, ultimately allowing the local Worker to communicate with the remote KV data seamlessly.

Each binding can independently either be handled by a remote proxy client (all connected to the same remote proxy server) or by a local simulation, allowing for very dynamic workflows where some bindings are locally simulated while others connect to the real remote resource, as illustrated in the example below:

^{The above diagram and config shows a Worker (running on your computer) bound to 3 different resources—two local (KV & R2), and one remote (KV_2)}

How JSRPC fits in

The above section deals with bindings that are backed by HTTP connections (like KV and R2), but modern bindings use JSRPC. That means we needed a way for the locally running workerd to speak JSRPC to a production runtime instance.

In a stroke of good luck, a parallel project was going on to make this possible, as detailed in the Cap’n Web blog. We integrated that by making the connection between the local workerd instance and the remote runtime instance communicate over websockets using Cap’n Web, enabling bindings backed by JSRPC to work. This includes newer bindings like Images, as well as JSRPC service bindings to your own Workers.

Remote bindings with Vite, Vitest and the JavaScript ecosystem

We didn't want to limit this exciting new feature to only wrangler dev. We wanted to support it in our Cloudflare Vite Plugin and vitest-pool-workers packages, as well as allowing any other potential tools and use cases from the JavaScript ecosystem to also benefit from it.

In order to achieve this, the wrangler package now exports utilities such as startRemoteProxySession that allow tools not leveraging wrangler dev to also support remote bindings. You can find more details in the official remote bindings documentation.

How do I try this out?

Just use wrangler dev! As of Wrangler v4.37.0 (@cloudflare/vite-plugin v1.13.0, @cloudflare/vitest-pool-workers v0.9.0), remote bindings are available in all projects, and can be turned on a per-binding basis by adding remote: true to the binding definition in your Wrangler config file.

A closer look at Python Workflows, now in beta

Caio Nogueira — Mon, 10 Nov 2025 14:00:00 GMT

Developers can already use Cloudflare Workflows to build long-running, multi-step applications on Workers. Now, Python Workflows are here, meaning you can use your language of choice to orchestrate multi-step applications.

With Workflows, you can automate a sequence of idempotent steps in your application with built-in error handling and retry behavior. But Workflows were originally supported only in TypeScript. Since Python is the de facto language of choice for data pipelines, artificial intelligence/machine learning, and task automation – all of which heavily rely on orchestration – this created friction for many developers.

Over the years, we’ve been giving developers the tools to build these applications in Python, on Cloudflare. In 2020, we brought Python to Workers via Transcrypt before directly integrating Python into workerd in 2024. Earlier this year, we built support for CPython along with any packages built in Pyodide, like matplotlib and pandas, in Workers. Now, Python Workflows are supported as well, so developers can create robust applications using the language they know best.

Why Python for Workflows?

Imagine you’re training an LLM. You need to label the dataset, feed data, wait for the model to run, evaluate the loss, adjust the model, and repeat. Without automation, you’d need to start each step, monitor manually until completion, and then start the next one. Instead, you could use a workflow to orchestrate the training of the model, triggering each step pending the completion of its predecessor. For any manual adjustments needed, like evaluating the loss and adjusting the model accordingly, you can implement a step that notifies you and waits for the necessary input.

Consider data pipelines, which are a top Python use case for ingesting and processing data. By automating the data pipeline through a defined set of idempotent steps, developers can deploy a workflow that handles the entire data pipeline for them.

Take another example: building AI agents, such as an agent to manage your groceries. Each week, you input your list of recipes, and the agent (1) compiles the list of necessary ingredients, (2) checks what ingredients you have left over from previous weeks, and (3) orders the differential for pickup from your local grocery store. Using a Workflow, this could look like:

await step.wait_for_event() the user inputs the grocery list
step.do() compile list of necessary ingredients
step.do() check list of necessary ingredients against left over ingredients
step.do() make an API call to place the order
step.do() proceed with payment

Using workflows as a tool to build agents on Cloudflare can simplify agents’ architecture and improve their odds for reaching completion through individual step retries and state persistence. Support for Python Workflows means building agents with Python is easier than ever.

How Python Workflows work

Cloudflare Workflows uses the underlying infrastructure that we created for durable execution, while providing an idiomatic way for Python users to write their workflows. In addition, we aimed for complete feature parity between the Javascript and the Python SDK. This is possible because Cloudflare Workers support Python directly in the runtime itself.

Creating a Python Workflow

Cloudflare Workflows are fully built on top of Workers and Durable Objects. Each element plays a part in storing Workflow metadata, and instance level information. For more detail on how the Workflows platform works, check out this blog post.

At the very bottom of the Workflows control plane sits the user Worker, which is the WorkflowEntrypoint. When the Workflow instance is ready to run, the Workflow engine will call into the run method of the user worker via RPC, which in this case will be a Python Worker.

This is an example skeleton for a Workflow declaration, provided by the official documentation:

export class MyWorkflow extends WorkflowEntrypoint {
  async run(event: WorkflowEvent, step: WorkflowStep) {
    // Steps here
  }
}

The run method, as illustrated above, provides a WorkflowStep parameter that implements the durable execution APIs. This is what users rely on for at-most-once execution. These APIs are implemented in JavaScript and need to be accessed in the context of the Python Worker.

A WorkflowStep must cross the RPC barrier, meaning the engine (caller) exposes it as an RpcTarget. This setup allows the user's Workflow (callee) to substitute the parameter with a stub. This stub then enables the use of durable execution APIs for Workflows by RPCing back to the engine. To read more about RPC serialization and how functions can be passed from caller and callee, read the Remote-Procedure call documentation.

All of this is true for both Python and JavaScript Workflows, since we don’t really change how the user Worker is called from the Workflows side. However, in the Python case, there is another barrier – language bridging between Python and the JavaScript module. When an RPC request targets a Python Worker, there is a Javascript entrypoint module responsible for proxying the request to be handled by the Python script, and then returned to the caller. This process typically involves type translation before and after handling the request.

Overcoming the language barrier

Python workers rely on Pyodide, which is a port of CPython to WebAssembly. Pyodide provides a foreign function interface (FFI) to JavaScript which allows for calling into JavaScript methods from Python. This is the mechanism that allows other bindings and Python packages to work within the Workers platform. Therefore, we use this FFI layer not only to allow using the Workflow binding directly, but also to provide WorkflowStep methods in Python. In other words, by considering that WorkflowEntrypoint is a special class for the runtime, the run method is manually wrapped so that WorkflowStep is exposed as a JsProxy instead of being type translated like other JavaScript objects. Moreover, by wrapping the APIs from the perspective of the user Worker, we allow ourselves to make some adjustments to the overall development experience, instead of simply exposing a JavaScript SDK to a different language with different semantics.

Making the Python Workflows SDK Pythonic

A big part of porting Workflows to Python includes exposing an interface that Python users will be familiar with and have no problems using, similarly to what happens with our JavaScript APIs. Let's take a step back and look at a snippet for a Workflow (written in Typescript) definition.

import { WorkflowEntrypoint, WorkflowStep, WorkflowEvent} from 'cloudflare:workers';
 
export class MyWorkflow extends WorkflowEntrypoint {
    async run(event: WorkflowEvent, step: WorkflowStep) {
        let state = step.do("my first step", async () => {
          // Access your properties via event.payload
          let userEmail = event.payload.userEmail
          let createdTimestamp = event.payload.createdTimestamp
          return {"userEmail": userEmail, "createdTimestamp": createdTimestamp}
	    })
 
        step.sleep("my first sleep", "30 minutes");
 
        await step.waitForEvent("receive example event", { type: "simple-event", timeout: "1 hour" })
 
   	 const developerWeek = Date.parse("22 Sept 2025 13:00:00 UTC");
        await step.sleepUntil("sleep until X times out", developerWeek)
    }
}

The Python implementation of the workflows API requires modification of the do method. Unlike other languages, Python does not easily support anonymous callbacks. This behavior is typically achieved through the use of decorators, which in this case allow us to intercept the method and expose it idiomatically. In other words, all parameters maintain their original order, with the decorated method serving as the callback.

The methods waitForEvent, sleep, and sleepUntil can retain their original signatures, as long as their names are converted to snake case.

Here’s the corresponding Python version for the same workflow, achieving similar behavior:

from workers import WorkflowEntrypoint
 
class MyWorkflow(WorkflowEntrypoint):
    async def run(self, event, step):
        @step.do("my first step")
        async def my_first_step():
            user_email = event["payload"]["userEmail"]
            created_timestamp = event["payload"]["createdTimestamp"]
            return {
                "userEmail": user_email,
                "createdTimestamp": created_timestamp,
            }
 
        await my_first_step()
 
        step.sleep("my first sleep", "30 minutes")
 
         await step.wait_for_event(
            "receive example event",
            "simple-event",
            timeout="1 hour",
        )
 
        developer_week = datetime(2024, 10, 24, 13, 0, 0, tzinfo=timezone.utc)
        await step.sleep_until("sleep until X times out", developer_week)

DAG Workflows

When designing Workflows, we’re often managing dependencies between steps even when some of these tasks can be handled concurrently. Even though we’re not thinking about it, many Workflows have a directed acyclic graph (DAG) execution flow. Concurrency is achievable in the first iteration of Python Workflows (i.e.: minimal port to Python Workers) because Pyodide captures Javascript thenables and proxies them into Python awaitables.

Consequently, asyncio.gather works as a counterpart to Promise.all. Although this is perfectly fine and ready to be used in the SDK, we also support a declarative approach.

One of the advantages of decorating the do method is that we can essentially provide further abstractions on the original API, and have them work on the entrypoint wrapper. Here’s an example of a Python API making use of the DAG capabilities introduced:

from workers import Response, WorkflowEntrypoint

class PythonWorkflowDAG(WorkflowEntrypoint):
    async def run(self, event, step):

        @step.do('dependency 1')
        async def dep_1():
            # does stuff
            print('executing dep1')

        @step.do('dependency 2')
        async def dep_2():
            # does stuff
            print('executing dep2')

        @step.do('demo do', depends=[dep_1, dep_2], concurrent=True)
        async def final_step(res1=None, res2=None):
            # does stuff
            print('something')

        await final_step()

This kind of approach makes the Workflow declaration much cleaner, leaving state management to the Workflows engine data plane, as well as the Python workers Workflow wrapper. Note that even though multiple steps can run with the same name, the engine will slightly modify the name of each step to ensure uniqueness. In Python Workflows, a dependency is considered resolved once the initial step involving it has been successfully completed.

Try it out

Check out writing Workers in Python and create your first Python Workflow today! If you have any feature requests or notice any bugs, share your feedback directly with the Cloudflare team by joining the Cloudflare Developers community on Discord.

How Workers VPC Services connects to your regional private networks from anywhere in the world

Thomas Gauvin — Wed, 05 Nov 2025 14:00:00 GMT

In April, we shared our vision for a global virtual private cloud on Cloudflare, a way to unlock your applications from regionally constrained clouds and on-premise networks, enabling you to build truly cross-cloud applications.

Today, we’re announcing the first milestone of our Workers VPC initiative: VPC Services. VPC Services allow you to connect to your APIs, containers, virtual machines, serverless functions, databases and other services in regional private networks via Cloudflare Tunnels from your Workers running anywhere in the world.

Once you set up a Tunnel in your desired network, you can register each service that you want to expose to Workers by configuring its host or IP address. Then, you can access the VPC Service as you would any other Workers service binding — Cloudflare’s network will automatically route to the VPC Service over Cloudflare’s network, regardless of where your Worker is executing:

export default {
  async fetch(request, env, ctx) {
    // Perform application logic in Workers here	

    // Call an external API running in a ECS in AWS when needed using the binding
    const response = await env.AWS_VPC_ECS_API.fetch("http://internal-host.com");

    // Additional application logic in Workers
    return new Response();
  },
};

Workers VPC is now available to everyone using Workers, at no additional cost during the beta, as is Cloudflare Tunnels. Try it out now. And read on to learn more about how it works under the hood.

Connecting the networks you trust, securely

Your applications span multiple networks, whether they are on-premise or in external clouds. But it’s been difficult to connect from Workers to your APIs and databases locked behind private networks.

We have previously described how traditional virtual private clouds and networks entrench you into traditional clouds. While they provide you with workload isolation and security, traditional virtual private clouds make it difficult to build across clouds, access your own applications, and choose the right technology for your stack.

A significant part of the cloud lock-in is the inherent complexity of building secure, distributed workloads. VPC peering requires you to configure routing tables, security groups and network access-control lists, since it relies on networking across clouds to ensure connectivity. In many organizations, this means weeks of discussions and many teams involved to get approvals. This lock-in is also reflected in the solutions invented to wrangle this complexity: Each cloud provider has their own bespoke version of a “Private Link” to facilitate cross-network connectivity, further restricting you to that cloud and the vendors that have integrated with it.

With Workers VPC, we’re simplifying that dramatically. You set up your Cloudflare Tunnel once, with the necessary permissions to access your private network. Then, you can configure Workers VPC Services, with the tunnel and hostname (or IP address and port) of the service you want to expose to Workers. Any request made to that VPC Service will use this configuration to route to the given service within the network.

{
  "type": "http",
  "name": "vpc-service-name",
  "http_port": 80,
  "https_port": 443,
  "host": {
    "hostname": "internally-resolvable-hostname.com",
    "resolver_network": {
      "tunnel_id": "0191dce4-9ab4-7fce-b660-8e5dec5172da"
    }
  }
}

This ensures that, once represented as a Workers VPC Service, a service in your private network is secured in the same way other Cloudflare bindings are, using the Workers binding model. Let’s take a look at a simple VPC Service binding example:

{
  "name": "WORKER-NAME",
  "main": "./src/index.js",
  "vpc_services": [
    {
      "binding": "AWS_VPC2_ECS_API",
      "service_id": "5634563546"
    }
  ]
}

Like other Workers bindings, when you deploy a Worker project that tries to connect to a VPC Service, the access permissions are verified at deploy time to ensure that the Worker has access to the service in question. And once deployed, the Worker can use the VPC Service binding to make requests to that VPC Service — and only that service within the network.

That’s significant: Instead of exposing the entire network to the Worker, only the specific VPC Service can be accessed by the Worker. This access is verified at deploy time to provide a more explicit and transparent service access control than traditional networks and access-control lists do.

This is a key factor in the design of Workers bindings: de facto security with simpler management and making Workers immune to Server-Side Request Forgery (SSRF) attacks. We’ve gone deep on the binding security model in the past, and it becomes that much more critical when accessing your private networks.

Notably, the binding model is also important when considering what Workers are: scripts running on Cloudflare’s global network. They are not, in contrast to traditional clouds, individual machines with IP addresses, and do not exist within networks. Bindings provide secure access to other resources within your Cloudflare account – and the same applies to Workers VPC Services.

A peek under the hood

So how do VPC Services and their bindings route network requests from Workers anywhere on Cloudflare’s global network to regional networks using tunnels? Let’s look at the lifecycle of a sample HTTP Request made from a VPC Service’s dedicated fetch() request represented here:

It all starts in the Worker code, where the .fetch() function of the desired VPC Service is called with a standard JavaScript Request (as represented with Step 1). The Workers runtime will use a Cap’n Proto remote-procedure-call to send the original HTTP request alongside additional context, as it does for many other Workers bindings.

The Binding Worker of the VPC Service System receives the HTTP request along with the binding context, in this case, the Service ID of the VPC Service being invoked. The Binding Worker will proxy this information to the Iris Service within an HTTP CONNECT connection, a standard pattern across Cloudflare’s bindings to place connection logic to Cloudflare’s edge services within Worker code rather than the Workers runtime itself (Step 2).

The Iris Service is the main service for Workers VPC. Its responsibility is to accept requests for a VPC Service and route them to the network in which your VPC Service is located. It does this by integrating with Apollo, an internal service of Cloudflare One. Apollo provides a unified interface that abstracts away the complexity of securely connecting to networks and tunnels, across various layers of networking.

To integrate with Apollo, Iris must complete two tasks. First, Iris will parse the VPC Service ID from the metadata and fetch the information of the tunnel associated with it from our configuration store. This includes the tunnel ID and type from the configuration store (Step 3), which is the information that Iris needs to send the original requests to the right tunnel.

Second, Iris will create the UDP datagrams containing DNS questions for the A and AAAA records of the VPC Service’s hostname. These datagrams will be sent first, via Apollo. Once DNS resolution is completed, the original request is sent along, with the resolved IP address and port (Step 4). That means that steps 4 through 7 happen in sequence twice for the first request: once for DNS resolution and a second time for the original HTTP Request. Subsequent requests benefit from Iris’ caching of DNS resolution information, minimizing request latency.

In Step 5, Apollo receives the metadata of the Cloudflare Tunnel that needs to be accessed, along with the DNS resolution UDP datagrams or the HTTP Request TCP packets. Using the tunnel ID, it determines which datacenter is connected to the Cloudflare Tunnel. This datacenter is in a region close to the Cloudflare Tunnel, and as such, Apollo will route the DNS resolution messages and the Original Request to the Tunnel Connector Service running in that datacenter (Step 5).

The Tunnel Connector Service is responsible for providing access to the Cloudflare Tunnel to the rest of Cloudflare’s network. It will relay the DNS resolution questions, and subsequently the original request to the tunnel over the QUIC protocol (Step 6).

Finally, the Cloudflare Tunnel will send the DNS resolution questions to the DNS resolver of the network it belongs to. It will then send the original HTTP Request from its own IP address to the destination IP and port (Step 7). The results of the request are then relayed all the way back to the original Worker, from the datacenter closest to the tunnel all the way to the original Cloudflare datacenter executing the Worker request.

What VPC Service allows you to build

This unlocks a whole new tranche of applications you can build on Cloudflare. For years, Workers have excelled at the edge, but they've largely been kept "outside" your core infrastructure. They could only call public endpoints, limiting their ability to interact with the most critical parts of your stack—like a private accounts API or an internal inventory database. Now, with VPC Services, Workers can securely access those private APIs, databases, and services, fundamentally changing what's possible.

This immediately enables true cross-cloud applications that span Cloudflare Workers and any other cloud like AWS, GCP or Azure. We’ve seen many customers adopt this pattern over the course of our private beta, establishing private connectivity between their external clouds and Cloudflare Workers. We’ve even done so ourselves, connecting our Workers to Kubernetes services in our core datacenters to power the control plane APIs for many of our services. Now, you can build the same powerful, distributed architectures, using Workers for global scale while keeping stateful backends in the network you already trust.

It also means you can connect to your on-premise networks from Workers, allowing you to modernize legacy applications with the performance and infinite scale of Workers. More interesting still are some emerging use cases for developer workflows. We’ve seen developers run cloudflared on their laptops to connect a deployed Worker back to their local machine for real-time debugging. The full flexibility of Cloudflare Tunnels is now a programmable primitive accessible directly from your Worker, opening up a world of possibilities.

The path ahead of us

VPC Services is the first milestone within the larger Workers VPC initiative, but we’re just getting started. Our goal is to make connecting to any service and any network, anywhere in the world, a seamless part of the Workers experience. Here’s what we’re working on next:

Deeper network integration. Starting with Cloudflare Tunnels was a deliberate choice. It's a highly available, flexible, and familiar solution, making it the perfect foundation to build upon. To provide more options for enterprise networking, we're going to be adding support for standard IPsec tunnels, Cloudflare Network Interconnect (CNI), and AWS Transit Gateway, giving you and your teams more choices and potential optimizations. Crucially, these connections will also become truly bidirectional, allowing your private services to initiate connections back to Cloudflare resources such as pushing events to Queues or fetching from R2.

Expanded protocol and service support. The next step beyond HTTP is enabling access to TCP services. This will first be achieved by integrating with Hyperdrive. We're evolving the previous Hyperdrive support for private databases to be simplified with VPC Services configuration, avoiding the need to add Cloudflare Access and manage security tokens. This creates a more native experience, complete with Hyperdrive's powerful connection pooling. Following this, we will add broader support for raw TCP connections, unlocking direct connectivity to services like Redis caches and message queues from Workers ‘connect()’.

Ecosystem compatibility. We want to make connecting to a private service feel as natural as connecting to a public one. To do so, we will be providing a unique autogenerated hostname for each Workers VPC Service, similar to Hyperdrive’s connection strings. This will make it easier to use Workers VPC with existing libraries and object–relational mapping libraries that may require a hostname (e.g., in a global ‘fetch()’ call or a MongoDB connection string). Workers VPC Service hostname will automatically resolve and route to the correct VPC Service, just as the ‘fetch()’ command does.

Get started with Workers VPC

We’re excited to release Workers VPC Services into open beta today. We’ve spent months building out and testing our first milestone for Workers to private network access. And we’ve refined it further based on feedback from both internal teams and customers during the closed beta.

Now, we’re looking forward to enabling everyone to build cross-cloud apps on Workers with Workers VPC, available for free during the open beta. With Workers VPC, you can bring your apps on private networks to region Earth, closer to your users and available to Workers across the globe.

Get started with Workers VPC Services for free now.

Building a better testing experience for Workflows, our durable execution engine for multi-step applications

Olga Silva — Tue, 04 Nov 2025 14:00:00 GMT

Cloudflare Workflows is our take on "Durable Execution." They provide a serverless engine, powered by the Cloudflare Developer Platform, for building long-running, multi-step applications that persist through failures. When Workflows became generally available earlier this year, they allowed developers to orchestrate complex processes that would be difficult or impossible to manage with traditional stateless functions. Workflows handle state, retries, and long waits, allowing you to focus on your business logic.

However, complex orchestrations require robust testing to be reliable. To date, testing Workflows was a black-box process. Although you could test if a Workflow instance reached completion through an await to its status, there was no visibility into the intermediate steps. This made debugging really difficult. Did the payment processing step succeed? Did the confirmation email step receive the correct data? You couldn't be sure without inspecting external systems or logs.

Why was this necessary?

As developers ourselves, we understand the need to ensure reliable code, and we heard your feedback loud and clear: the developer experience for testing Workflows needed to be better.

The black box nature of testing was one part of the problem. Beyond that, though, the limited testing offered came at a high cost. If you added a workflow to your project, even if you weren't testing the workflow directly, you were required to disable isolated storage because we couldn't guarantee isolation between tests. Isolated storage is a vitest-pool-workers feature to guarantee that each test runs in a clean, predictable environment, free from the side effects of other tests. Being forced to have it disabled meant that state could leak between tests, leading to flaky, unpredictable, and hard-to-debug failures.

This created a difficult choice for developers building complex applications. If your project used Workers, Durable Objects, and R2 alongside Workflows, you had to either abandon isolated testing for your entire project or skip testing. This friction resulted in a poor testing experience, which in turn discouraged the adoption of Workflows. Solving this wasn't just an improvement, it was a critical step in making Workflows part of any well-tested Cloudflare application.

Introducing isolated testing for Workflows

We're introducing a new set of APIs that enable comprehensive, granular, and isolated testing for your Workflows, all running locally and offline with vitest-pool-workers, our testing framework that supports running tests in the Workers runtime workerd. This enables fast, reliable, and cheap test runs that don't depend on a network connection.

They are available through the cloudflare:test module, with @cloudflare/vitest-pool-workers version 0.9.0 and above. The new test module provides two primary functions to introspect your Workflows:

introspectWorkflowInstance: useful for unit tests with known instance IDs
introspectWorkflow: useful for integration tests where IDs are typically generated dynamically.

Let's walk through a practical example.

A practical example: testing a blog moderation workflow

Imagine a simple Workflow for moderating a blog. When a user submits a comment, the Workflow requests a review from workers-ai. Based on the violation score returned, it then waits for a moderator to approve or deny the comment. If approved, it calls a step.do to publish the comment via an external API.

Testing this without our new APIs would be impossible. You'd have no direct way to simulate the step’s outcomes and simulate the moderator's approval. Now, you can mock everything.

Here’s the test code using introspectWorkflowInstance with a known instance ID:

import { env, introspectWorkflowInstance } from "cloudflare:test";

it("should mock a an ambiguous score, approve comment and complete", async () => {
   // CONFIG
   await using instance = await introspectWorkflowInstance(
       env.MODERATOR,
       "my-workflow-instance-id-123"
   );
   await instance.modify(async (m) => {
       await m.mockStepResult({ name: "AI content scan" }, { violationScore: 50 });
       await m.mockEvent({ 
           type: "moderation-approval", 
           payload: { action: "approved" },
       });
       await m.mockStepResult({ name: "publish comment" }, { status: "published" });
   });

   await env.MODERATOR.create({ id: "my-workflow-instance-id-123" });
   
   // ASSERTIONS
   expect(await instance.waitForStepResult({ name: "AI content scan" })).toEqual(
       { violationScore: 50 }
   );
   expect(
       await instance.waitForStepResult({ name: "publish comment" })
   ).toEqual({ status: "published" });

   await expect(instance.waitForStatus("complete")).resolves.not.toThrow();
});

This test mocks the outcomes of steps that require external API calls, such as the 'AI content scan', which calls Workers AI, and the 'publish comment' step, which calls an external blog API.

If the instance ID is not known, because you are either making a worker request that starts one/multiple Workflow instances with random generated ids, you can call introspectWorkflow(env.MY_WORKFLOW). Here’s the test code for that scenario, where only one Workflow instance is created:

it("workflow mock a non-violation score and be successful", async () => {
   // CONFIG
   await using introspector = await introspectWorkflow(env.MODERATOR);
   await introspector.modifyAll(async (m) => {
       await m.disableSleeps();
       await m.mockStepResult({ name: "AI content scan" }, { violationScore: 0 });
   });

   await SELF.fetch(`https://mock-worker.local/moderate`);

   const instances = introspector.get();
   expect(instances.length).toBe(1);

   // ASSERTIONS
   const instance = instances[0];
   expect(await instance.waitForStepResult({ name: "AI content scan"  })).toEqual({ violationScore: 0 });
   await expect(instance.waitForStatus("complete")).resolves.not.toThrow();
});

Notice how in both examples we’re calling the introspectors with await using - this is the Explicit Resource Management syntax from modern JavaScript. It is crucial here because when the introspector objects go out of scope at the end of the test, its disposal method is automatically called. This is how we ensure each test works with its own isolated storage.

The modify and modifyAll functions are the gateway to controlling instances. Inside its callback, you get access to a modifier object with methods to inject behavior such as mocking step outcomes, events and disabling sleeps.

You can find detailed documentation on the Workers Cloudflare Docs.

How we connected Vitest to the Workflows Engine

To understand the solution, you first need to understand the local architecture. When you run wrangler dev, your Workflows are powered by Miniflare, a simulator for testing Cloudflare Workers, and workerd. Each running workflow instance is backed by its own SQLite Durable Object, which we call the "Engine DO". This Engine DO is responsible for executing steps, persisting state, and managing the instance's lifecycle. It lives inside the local isolated Workers runtime.

Meanwhile, the Vitest test runner is a separate Node.js process living outside of workerd. This is why we have a Vitest custom pool that allows tests to run inside workerd called vitest-pool-workers. Vitest-pool-workers has a Runner Worker, which is a worker to run the tests with bindings to everything specified in the user wrangler.json file. This worker has access to the APIs under the “cloudflare:test” module. It communicates with Node.js through a special DO called Runner Object via WebSocket/RPC.

The first approach we considered was to use the test runner worker. In its current state, Runner worker has access to Workflow bindings from Workflows defined on the wrangler file. We considered also binding each Workflow's Engine DO namespace to this runner worker. This would give vitest-pool-workers direct access to the Engine DOs where it would be possible to directly call Engine methods.

While promising, this approach would have required undesirable changes to the core of Miniflare and vitest-pool-workers, making it too invasive for this single feature.

Firstly, we would have needed to add a new unsafe field to Miniflare's Durable Objects. Its sole purpose would be to specify the service name of our Engines, preventing Miniflare from applying its default user prefix which would otherwise prevent the Durable Objects from being found.

Secondly, vitest-pool-workers would have been forced to bind every Engine DO from the Workflows in the project to its runner, even those not being tested. This would introduce unwanted bindings into the test environment, requiring an additional cleanup to ensure they were not exposed to the user's tests env.

The breakthrough

The solution is a combination of privileged local-only APIs and Remote Procedure Calls (RPC).

First, we added a set of unsafe functions to the local implementation of the Workflows binding, functions that are not available in the production environment. They act as a controlled access point, accessible from the test environment, allowing the test runner to get a stub to a specific Engine DO by providing its instance ID.

Once the test runner has this stub, it uses RPC to call specific, trusted methods on the Engine DO via a special RpcTarget called WorkflowInstanceModifier. Any class that extends RpcTarget has its objects replaced by a stub. Calling a method on this stub, in turn, makes an RPC back to the original object.

This simpler approach is far less invasive because it's confined to the Workflows environment, which also ensures any future feature changes are safely isolated.

Introspecting Workflows with unknown IDs

When creating Workflows instances (either by create() or createBatch()) developers can provide a specific ID or have it automatically generated for them. This ID identifies the Workflow instance and is then used to create the associated Engine DO ID.

The logical starting point for implementation was introspectWorkflowInstance(binding, instanceID), as the instance ID is known in advance. This allows us to generate the Engine DO ID required to identify the engine associated with that Workflow instance.

But often, one part of your application (like an HTTP endpoint) will create a Workflow instance with a randomly generated ID. How can we introspect an instance when we don't know its ID until after it's created?

The answer was to use a powerful feature of JavaScript: Proxy objects.

When you use introspectWorkflow(binding), we wrap the Workflow binding in a Proxy. This proxy non-destructively intercepts all calls to the binding, specifically looking for .create() and .createBatch(). When your test triggers a workflow creation, the proxy inspects the call. It captures the instance ID — either one you provided or the random one generated — and immediately sets up the introspection on that ID, applying all the modifications you defined in the modifyAll call. The original creation call then proceeds as normal.

env[workflow] = new Proxy(env[workflow], {
  get(target, prop) {
    if (prop === "create") {
      return new Proxy(target.create, {
        async apply(_fn, _this, [opts = {}]) {

          // 1. Ensure an ID exists 
          const optsWithId = "id" in opts ? opts : { id: crypto.randomUUID(), ...opts };

          // 2. Apply test modifications before creation
          await introspectAndModifyInstance(optsWithId.id);

          // 3. Call the original 'create' method 
          return target.create(optsWithId);
        },
      });
    }

    // Same logic for createBatch()
  }
}

When the await using block from introspectWorkflow() finishes, or the dispose() method is called at the end of the test, the introspector is disposed of, and the proxy is removed, leaving the binding in its original state. It’s a low-impact approach that prioritizes developer experience and long-term maintainability.

Get started with testing Workflows

Ready to add tests to your Workflows? Here’s how to get started:

Update your dependencies: Make sure you are using @cloudflare/vitest-pool-workers version 0.9.0 or newer. Run the following command in your project: npm install @cloudflare/vitest-pool-workers@latest
Configure your test environment: If you're new to testing on Workers, follow our guide to write your first test.

Start writing tests: Import introspectWorkflowInstance or introspectWorkflow from cloudflare:test in your test files and use the patterns shown in this post to mock, control, and assert on your Workflow's behavior. Also check out the official API reference.

Keeping the Internet fast and secure: introducing Merkle Tree Certificates

Luke Valenta — Tue, 28 Oct 2025 13:00:00 GMT

The world is in a race to build its first quantum computer capable of solving practical problems not feasible on even the largest conventional supercomputers. While the quantum computing paradigm promises many benefits, it also threatens the security of the Internet by breaking much of the cryptography we have come to rely on.

To mitigate this threat, Cloudflare is helping to migrate the Internet to Post-Quantum (PQ) cryptography. Today, about 50% of traffic to Cloudflare's edge network is protected against the most urgent threat: an attacker who can intercept and store encrypted traffic today and then decrypt it in the future with the help of a quantum computer. This is referred to as the harvest now, decrypt later threat.

However, this is just one of the threats we need to address. A quantum computer can also be used to crack a server's TLS certificate, allowing an attacker to impersonate the server to unsuspecting clients. The good news is that we already have PQ algorithms we can use for quantum-safe authentication. The bad news is that adoption of these algorithms in TLS will require significant changes to one of the most complex and security-critical systems on the Internet: the Web Public-Key Infrastructure (WebPKI).

The central problem is the sheer size of these new algorithms: signatures for ML-DSA-44, one of the most performant PQ algorithms standardized by NIST, are 2,420 bytes long, compared to just 64 bytes for ECDSA-P256, the most popular non-PQ signature in use today; and its public keys are 1,312 bytes long, compared to just 64 bytes for ECDSA. That's a roughly 20-fold increase in size. Worse yet, the average TLS handshake includes a number of public keys and signatures, adding up to 10s of kilobytes of overhead per handshake. This is enough to have a noticeable impact on the performance of TLS.

That makes drop-in PQ certificates a tough sell to enable today: they don’t bring any security benefit before Q-day — the day a cryptographically relevant quantum computer arrives — but they do degrade performance. We could sit and wait until Q-day is a year away, but that’s playing with fire. Migrations always take longer than expected, and by waiting we risk the security and privacy of the Internet, which is dear to us.

It's clear that we must find a way to make post-quantum certificates cheap enough to deploy today by default for everyone — not just those that can afford it. In this post, we'll introduce you to the plan we’ve brought together with industry partners to the IETF to redesign the WebPKI in order to allow a smooth transition to PQ authentication with no performance impact (and perhaps a performance improvement!). We'll provide an overview of one concrete proposal, called Merkle Tree Certificates (MTCs), whose goal is to whittle down the number of public keys and signatures in the TLS handshake to the bare minimum required.

But talk is cheap. We know from experience that, as with any change to the Internet, it's crucial to test early and often. Today we're announcing our intent to deploy MTCs on an experimental basis in collaboration with Chrome Security. In this post, we'll describe the scope of this experiment, what we hope to learn from it, and how we'll make sure it's done safely.

The WebPKI today — an old system with many patches

Why does the TLS handshake have so many public keys and signatures?

Let's start with Cryptography 101. When your browser connects to a website, it asks the server to authenticate itself to make sure it's talking to the real server and not an impersonator. This is usually achieved with a cryptographic primitive known as a digital signature scheme (e.g., ECDSA or ML-DSA). In TLS, the server signs the messages exchanged between the client and server using its secret key, and the client verifies the signature using the server's public key. In this way, the server confirms to the client that they've had the same conversation, since only the server could have produced a valid signature.

If the client already knows the server's public key, then only 1 signature is required to authenticate the server. In practice, however, this is not really an option. The web today is made up of around a billion TLS servers, so it would be unrealistic to provision every client with the public key of every server. What's more, the set of public keys will change over time as new servers come online and existing ones rotate their keys, so we would need some way of pushing these changes to clients.

This scaling problem is at the heart of the design of all PKIs.

Trust is transitive

Instead of expecting the client to know the server's public key in advance, the server might just send its public key during the TLS handshake. But how does the client know that the public key actually belongs to the server? This is the job of a certificate.

A certificate binds a public key to the identity of the server — usually its DNS name, e.g., cloudflareresearch.com. The certificate is signed by a Certification Authority (CA) whose public key is known to the client. In addition to verifying the server's handshake signature, the client verifies the signature of this certificate. This establishes a chain of trust: by accepting the certificate, the client is trusting that the CA verified that the public key actually belongs to the server with that identity.

Clients are typically configured to trust many CAs and must be provisioned with a public key for each. Things are much easier however, since there are only 100s of CAs instead of billions. In addition, new certificates can be created without having to update clients.

These efficiencies come at a relatively low cost: for those counting at home, that's +1 signature and +1 public key, for a total of 2 signatures and 1 public key per TLS handshake.

That's not the end of the story, however. As the WebPKI has evolved, so have these chains of trust grown a bit longer. These days it's common for a chain to consist of two or more certificates rather than just one. This is because CAs sometimes need to rotate their keys, just as servers do. But before they can start using the new key, they must distribute the corresponding public key to clients. This takes time, since it requires billions of clients to update their trust stores. To bridge the gap, the CA will sometimes use the old key to issue a certificate for the new one and append this certificate to the end of the chain.

That's +1 signature and +1 public key, which brings us to 3 signatures and 2 public keys. And we still have a little ways to go.

Trust but verify

The main job of a CA is to verify that a server has control over the domain for which it’s requesting a certificate. This process has evolved over the years from a high-touch, CA-specific process to a standardized, mostly automated process used for issuing most certificates on the web. (Not all CAs fully support automation, however.) This evolution is marked by a number of security incidents in which a certificate was mis-issued to a party other than the server, allowing that party to impersonate the server to any client that trusts the CA.

Automation helps, but attacks are still possible, and mistakes are almost inevitable. Earlier this year, several certificates for Cloudflare's encrypted 1.1.1.1 resolver were issued without our involvement or authorization. This apparently occurred by accident, but it nonetheless put users of 1.1.1.1 at risk. (The mis-issued certificates have since been revoked.)

Ensuring mis-issuance is detectable is the job of the Certificate Transparency (CT) ecosystem. The basic idea is that each certificate issued by a CA gets added to a public log. Servers can audit these logs for certificates issued in their name. If ever a certificate is issued that they didn't request itself, the server operator can prove the issuance happened, and the PKI ecosystem can take action to prevent the certificate from being trusted by clients.

Major browsers, including Firefox and Chrome and its derivatives, require certificates to be logged before they can be trusted. For example, Chrome, Safari, and Firefox will only accept the server's certificate if it appears in at least two logs the browser is configured to trust. This policy is easy to state, but tricky to implement in practice:

Operating a CT log has historically been fairly expensive. Logs ingest billions of certificates over their lifetimes: when an incident happens, or even just under high load, it can take some time for a log to make a new entry available for auditors.
Clients can't really audit logs themselves, since this would expose their browsing history (i.e., the servers they wanted to connect to) to the log operators.

The solution to both problems is to include a signature from the CT log along with the certificate. The signature is produced immediately in response to a request to log a certificate, and attests to the log's intent to include the certificate in the log within 24 hours.

Per browser policy, certificate transparency adds +2 signatures to the TLS handshake, one for each log. This brings us to a total of 5 signatures and 2 public keys in a typical handshake on the public web.

The future WebPKI

The WebPKI is a living, breathing, and highly distributed system. We've had to patch it a number of times over the years to keep it going, but on balance it has served our needs quite well — until now.

Previously, whenever we needed to update something in the WebPKI, we would tack on another signature. This strategy has worked because conventional cryptography is so cheap. But 5 signatures and 2 public keys on average for each TLS handshake is simply too much to cope with for the larger PQ signatures that are coming.

The good news is that by moving what we already have around in clever ways, we can drastically reduce the number of signatures we need.

Crash course on Merkle Tree Certificates

Merkle Tree Certificates (MTCs) is a proposal for the next generation of the WebPKI that we are implementing and plan to deploy on an experimental basis. Its key features are as follows:

All the information a client needs to validate a Merkle Tree Certificate can be disseminated out-of-band. If the client is sufficiently up-to-date, then the TLS handshake needs just 1 signature, 1 public key, and 1 Merkle tree inclusion proof. This is quite small, even if we use post-quantum algorithms.
The MTC specification makes certificate transparency a first class feature of the PKI by having each CA run its own log of exactly the certificates they issue.

Let's poke our head under the hood a little. Below we have an MTC generated by one of our internal tests. This would be transmitted from the server to the client in the TLS handshake:

Looks like your average PEM encoded certificate. Let's decode it and look at the parameters:

$ openssl x509 -in merkle-tree-cert.pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 531 (0x213)
        Signature Algorithm: 1.3.6.1.4.1.44363.47.0
        Issuer: 1.3.6.1.4.1.44363.47.1=44363.48.3
        Validity
            Not Before: Oct 21 15:33:26 2025 GMT
            Not After : Oct 28 15:33:26 2025 GMT
        Subject: CN=cloudflareresearch.com
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:70:ed:e1:96:87:b4:22:ef:fb:dc:a9:cd:9c:5c:
                    ef:1e:9e:ab:1b:6d:d7:11:74:7b:76:c8:3c:a1:5f:
                    94:37:45:99:d8:80:e3:5c:24:4f:28:46:b5:bf:84:
                    60:d8:fc:eb:82:5a:c4:4e:33:90:c7:b3:36:51:0c:
                    92:6d:bf:88:27
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature
            X509v3 Extended Key Usage:
                TLS Web Server Authentication
            X509v3 Subject Alternative Name:
                DNS:cloudflareresearch.com, DNS:static-ct.cloudflareresearch.com
    Signature Algorithm: 1.3.6.1.4.1.44363.47.0
    Signature Value:
        00:00:00:00:00:00:02:00:00:00:00:00:00:00:02:58:00:e0:
        44:be:03:a5:bd:6a:b7:f2:9e:39:77:4c:16:4c:f8:06:e5:e1:
        55:c0:93:21:c6:79:83:3c:dd:5b:e6:57:89:c0:75:b3:4c:ec:
        75:8a:0b:53:a0:ca:1c:07:0c:1a:92:dd:c7:7c:a2:23:5d:83:
        0e:e4:23:43:38:af:43:20:a8:66:44:34:95:87:ea:2b:f0:0f:
        16:52:bb:ea:67:67:1e:89:36:4f:90:d4:05:55:89:46:f1:b7:
        b6:68:84:d3:57:31:ae:2b:c3:79:31:86:85:9d:24:ed:cf:25:
        a4:5c:fd:8f:f6:76:14:55:dd:67:2e:df:d6:8c:25:0d:52:48:
        c8:e3:fe:f9:7c:e6:a5:30:52:a5:b5:c7:3a:89:a5:c1:f6:4b:
        5b:95:ef:70:b8:91:fc:61:0f:6d:16:de:39:e9:a0:59:49:2b:
        34:71:7c:2a:16:da:c7:af:de:f7:01:94:10:c4:62:d1:f5:00:
        87:bd:e8:a2:f4:df:3b:35:79:27:0e:fc:cc:43:e7:60:5a:df:
        df:06:e8:d3:7e:eb:b3:bf:7b:25:43:0f:34:9a:26:c0:d3:6d:
        5d:0c:28:bc:87:58:58:15:00:00

While some of the parameters probably look familiar, others will look unusual. On the familiar side, the subject and public key are exactly what we might expect: the DNS name is cloudflareresearch.com and the public key is for a familiar signature algorithm, ECDSA-P256. This algorithm is not PQ, of course — in the future we would put ML-DSA-44 there instead.

On the unusual side, OpenSSL appears to not recognize the signature algorithm of the issuer and just prints the raw OID and bytes of the signature. There's a good reason for this: the MTC does not have a signature in it at all! So what exactly are we looking at?

The trick to leave out signatures is that a Merkle Tree Certification Authority (MTCA) produces its signatureless certificates in batches rather than individually. In place of a signature, the certificate has an inclusion proof of the certificate in a batch of certificates signed by the MTCA.

To understand how inclusion proofs work, let's think about a slightly simplified version of the MTC specification. To issue a batch, the MTCA arranges the unsigned certificates into a data structure called a Merkle tree that looks like this:

Each leaf of the tree corresponds to a certificate, and each inner node is equal to the hash of its children. To sign the batch, the MTCA uses its secret key to sign the head of the tree. The structure of the tree guarantees that each certificate in the batch was signed by the MTCA: if we tried to tweak the bits of any one of the certificates, the treehead would end up having a different value, which would cause the signature to fail.

An inclusion proof for a certificate consists of the hash of each sibling node along the path from the certificate to the treehead:

Given a validated treehead, this sequence of hashes is sufficient to prove inclusion of the certificate in the tree. This means that, in order to validate an MTC, the client also needs to obtain the signed treehead from the MTCA.

This is the key to MTC's efficiency:

Signed treeheads can be disseminated to clients out-of-band and validated offline. Each validated treehead can then be used to validate any certificate in the corresponding batch, eliminating the need to obtain a signature for each server certificate.
During the TLS handshake, the client tells the server which treeheads it has. If the server has a signatureless certificate covered by one of those treeheads, then it can use that certificate to authenticate itself. That's 1 signature,1 public key and 1 inclusion proof per handshake, both for the server being authenticated.

Now, that's the simplified version. MTC proper has some more bells and whistles. To start, it doesn’t create a separate Merkle tree for each batch, but it grows a single large tree, which is used for better transparency. As this tree grows, periodically (sub)tree heads are selected to be shipped to browsers, which we call landmarks. In the common case browsers will be able to fetch the most recent landmarks, and servers can wait for batch issuance, but we need a fallback: MTC also supports certificates that can be issued immediately and don’t require landmarks to be validated, but these are not as small. A server would provision both types of Merkle tree certificates, so that the common case is fast, and the exceptional case is slow, but at least it’ll work.

Experimental deployment

Ever since early designs for MTCs emerged, we’ve been eager to experiment with the idea. In line with the IETF principle of “running code”, it often takes implementing a protocol to work out kinks in the design. At the same time, we cannot risk the security of users. In this section, we describe our approach to experimenting with aspects of the Merkle Tree Certificates design without changing any trust relationships.

Let’s start with what we hope to learn. We have lots of questions whose answers can help to either validate the approach, or uncover pitfalls that require reshaping the protocol — in fact, an implementation of an early MTC draft by Maximilian Pohl and Mia Celeste did exactly this. We’d like to know:

What breaks? Protocol ossification (the tendency of implementation bugs to make it harder to change a protocol) is an ever-present issue with deploying protocol changes. For TLS in particular, despite having built-in flexibility, time after time we’ve found that if that flexibility is not regularly used, there will be buggy implementations and middleboxes that break when they see things they don’t recognize. TLS 1.3 deployment took years longer than we hoped for this very reason. And more recently, the rollout of PQ key exchange in TLS caused the Client Hello to be split over multiple TCP packets, something that many middleboxes weren't ready for.

What is the performance impact? In fact, we expect MTCs to reduce the size of the handshake, even compared to today's non-PQ certificates. They will also reduce CPU cost: ML-DSA signature verification is about as fast as ECDSA, and there will be far fewer signatures to verify. We therefore expect to see a reduction in latency. We would like to see if there is a measurable performance improvement.

What fraction of clients will stay up to date? Getting the performance benefit of MTCs requires the clients and servers to be roughly in sync with one another. We expect MTCs to have fairly short lifetimes, a week or so. This means that if the client's latest landmark is older than a week, the server would have to fallback to a larger certificate. Knowing how often this fallback happens will help us tune the parameters of the protocol to make fallbacks less likely.

In order to answer these questions, we are implementing MTC support in our TLS stack and in our certificate issuance infrastructure. For their part, Chrome is implementing MTC support in their own TLS stack and will stand up infrastructure to disseminate landmarks to their users.

As we've done in past experiments, we plan to enable MTCs for a subset of our free customers with enough traffic that we will be able to get useful measurements. Chrome will control the experimental rollout: they can ramp up slowly, measuring as they go and rolling back if and when bugs are found.

Which leaves us with one last question: who will run the Merkle Tree CA?

Bootstrapping trust from the existing WebPKI

Standing up a proper CA is no small task: it takes years to be trusted by major browsers. That’s why Cloudflare isn’t going to become a “real” CA for this experiment, and Chrome isn’t going to trust us directly.

Instead, to make progress on a reasonable timeframe, without sacrificing due diligence, we plan to "mock" the role of the MTCA. We will run an MTCA (on Workers based on our StaticCT logs), but for each MTC we issue, we also publish an existing certificate from a trusted CA that agrees with it. We call this the bootstrap certificate. When Chrome’s infrastructure pulls updates from our MTCA log, they will also pull these bootstrap certificates, and check whether they agree. Only if they do, they’ll proceed to push the corresponding landmarks to Chrome clients. In other words, Cloudflare is effectively just “re-encoding” an existing certificate (with domain validation performed by a trusted CA) as an MTC, and Chrome is using certificate transparency to keep us honest.

Conclusion

With almost 50% of our traffic already protected by post-quantum encryption, we’re halfway to a fully post-quantum secure Internet. The second part of our journey, post-quantum certificates, is the hardest yet though. A simple drop-in upgrade has a noticeable performance impact and no security benefit before Q-day. This means it’s a hard sell to enable today by default. But here we are playing with fire: migrations always take longer than expected. If we want to keep an ubiquitously private and secure Internet, we need a post-quantum solution that’s performant enough to be enabled by default today.

Merkle Tree Certificates (MTCs) solves this problem by reducing the number of signatures and public keys to the bare minimum while maintaining the WebPKI's essential properties. We plan to roll out MTCs to a fraction of free accounts by early next year. This does not affect any visitors that are not part of the Chrome experiment. For those that are, thanks to the bootstrap certificates, there is no impact on security.

We’re excited to keep the Internet fast and secure, and will report back soon on the results of this experiment: watch this space! MTC is evolving as we speak, if you want to get involved, please join the IETF PLANTS mailing list.

Announcing Workers automatic tracing, now in open beta

Nevi Shah — Tue, 28 Oct 2025 12:00:00 GMT

When your Worker slows down or starts throwing errors, finding the root cause shouldn't require hours of log analysis and trial-and-error debugging. You should have clear visibility into what's happening at every step of your application's request flow. This is feedback we’ve heard loud and clear from developers using Workers, and today we’re excited to announce an Open Beta for tracing on Cloudflare Workers! You can now:

Get automatic instrumentation for applications on the Workers platform: No manual setup, complex instrumentation, or code changes. It works out of the box.
Explore and investigate traces in the Cloudflare dashboard: Your traces are processed and available in the Workers Observability dashboard alongside your existing logs.
Export logs and traces to OpenTelemetry-compatible providers: Send OpenTelemetry traces (and correlated logs) to your observability provider of choice.

In 2024, we set out to build the best first-party observability of any cloud platform. We launched a new metrics dashboard to give better insights into how your Worker is performing, Workers Logs to automatically ingest and store logs for your Workers, a query builder to explore your data across any dimension and real-time logs to stream your logs in real time with advanced filtering capabilities. Starting today, you can get an even deeper understanding of your Workers applications by enabling automatic tracing!

What is Workers Tracing?

Workers traces capture and emit OpenTelemetry-compliant spans to show you detailed metadata and timing information on every operation your Worker performs. It helps you identify performance bottlenecks, resolve errors, and understand how your Worker interacts with other services on the Workers platform. You can now answer questions like:

Which calls are slowing down my application?
Which queries to my database take the longest?
What happened within a request that resulted in an error?

Tracing provides a visualization of each invocation's journey through various operations. Each operation is captured as a span, a timed segment that shows what happened and how long it took. Child spans nest within parent spans to show sub-operations and dependencies, creating a hierarchical view of your invocation’s execution flow. Each span can include contextual metadata or attributes that provide details for debugging and filtering events.

Full automatic instrumentation, no code changes

Previously, instrumenting your application typically required an understanding of the OpenTelemetry spec, multiple OTel libraries, and how they related to each other. Implementation was tedious and bloated your codebase with instrumentation code that obfuscated your application logic.

Setting up tracing typically meant spending hours integrating third-party SDKs, wrapping every database call and API request with instrumentation code, and debugging complex config files before you saw a single trace. This implementation overhead often makes observability an afterthought, leaving you without full visibility in production when issues arise.

What makes Workers Tracing truly magical is it’s completely automatic – no set up, no code changes, no wasted time. We took the approach of automatically instrumenting every I/O operation in your Workers, through a deep integration in workerd, our runtime, enabling us to capture the full extent of data flows through every invocation of your Workers.

You focus on your application logic. We take care of the instrumentation.

What you can trace today

The operations covered today are:

Binding calls: Interactions with various Worker bindings. KV reads and writes, R2 object storage operations, Durable Object invocations, and many more binding calls are automatically traced. This gives you complete visibility into how your Worker uses other services.
Fetch calls: All outbound HTTP requests are automatically instrumented, capturing timing, status codes, and request metadata. This enables you to quickly identify which external dependencies are affecting your application's performance.
Handler calls: Methods on a Worker that can receive and process external inputs, such as fetch handlers, scheduled handlers, and queue handlers. This gives you visibility into performance of how your Worker is being invoked.

Automatic attributes on every span

Our automated instrumentation captures each operation as a span. For example, a span generated by an R2 binding call (like a get or put operation) will automatically contain any available attributes, such as the operation type, the error if applicable, the object key, and duration. These detailed attributes provide the context you need to answer precise questions about your application without needing to manually log every detail.

We will continue to add more detailed attributes to spans and add the ability to trace an invocation across multiple Workers or external services. Our documentation contains a complete list of all instrumented spans and their attributes.

Investigate traces in the Workers dashboard

You can easily view traces directly within a specific Worker application in the Cloudflare dashboard, giving you immediate visibility into your application's performance. You’ll find a list of all trace events within your desired time frame and a trace visualization of each invocation including duration of each call and any available attributes. You can also query across all Workers on your account, letting you pinpoint issues occurring on multiple applications.

To get started viewing traces on your Workers application, you can set:

Export traces to OpenTelemetry compatible providers

However, we realize that some development teams need Workers data to live alongside other telemetry data in the tools they are already using. That’s why we’re also adding tracing exports, letting your team send, visualize and query data with your existing observability stack! Starting today, you can export traces directly to providers like Honeycomb, Grafana, Sentry or any other OpenTelemetry Protocol (OTLP) provider with an available endpoint.

Correlated logs and traces

We also support exporting OTLP-formatted logs that share the same trace ID, enabling third-party platforms to automatically correlate log entries with their corresponding traces. This lets you easily jump between spans and related log messages.

Set up your destination, enable exports, and go!

To start sending events to your destination of choice, first, configure your OTLP endpoint destination in the Cloudflare dashboard. For every destination you can specify a custom name and set custom headers to include API keys or app configuration.

Once you have your destination set up (e.g. honeycomb-tracing), set the following in your wrangler.jsonc and deploy:

Coming up for Workers observability

This is just the beginning of Workers providing the workflows and tools to get you the telemetry data you want, where you want it. We’re improving our support both for native tracing in the dashboard and for exporting other types of telemetry to 3rd parties. In the upcoming months we’ll be launching:

Support for more spans and attributes: We are adding more automatic traces for every part of the Workers platform. While our first goal is to give you visibility into the duration of every operation within your request, we also want to add detailed attributes. Your feedback on what’s missing will be extremely valuable here.
Trace context propagation: When building distributed applications, ensuring your traces connect across all of your services (even those outside of Cloudflare), automatically linking spans together to create complete, end-to-end visibility is critical. For example, a trace from Workers could be nested from a parent service or vice versa. When fully implemented, our automatic trace context propagation will follow W3C standards to ensure compatibility across your existing tools and services.
Support for custom spans and attributes: While automatic instrumentation gives you visibility into what’s happening within the Workers platform, we know you need visibility into your own application logic too. So, we’ll give you the ability to manually add your own spans as well.
Ability to export metrics: Today, metrics, logs and traces are available for you to monitor and view within the Workers dashboard. But the final missing piece is giving you the ability to export both infrastructure metrics (like request volume, error rates, and execution duration) and custom application metrics to your preferred observability provider.

What you can expect from tracing pricing

Today, at the start of beta, viewing traces in the Cloudflare dashboard and exporting traces to a 3rd party provider are both free. On January 15, 2026, tracing and log events will be charged the following pricing:

Viewing Workers traces in the Cloudflare dashboard

To view traces in the Cloudflare dashboard, you can do so on a Workers Free and Paid plan at the pricing shown below:

	Workers Free	Workers Paid
Included Volume	200K events per day	20M events per month
Additional Events	N/A	$0.60 per million logs
Retention	3 days	7 days

Exporting traces and logs

To export traces to a 3rd-party OTLP-compatible destination, you will need a Workers Paid subscription. Pricing is based on total span or log events with the following inclusions:

	Workers Free	Workers Paid
Events	Not available	10 million events per month
Additional events	$0.05 per million batched events

Enable tracing today

Ready to get started with tracing on your Workers application?

Check out our documentation: Learn how to get set up, read about current limitations and discover more about what’s coming up.
Join the chatter in our GitHub discussion: Your feedback will be extremely valuable in our beta period on our automatic instrumentation, tracing dashboard, and OpenTelemetry export flow. Head to our GitHub discussion to raise issues, put in feature requests and get in touch with us!

Unpacking Cloudflare Workers CPU Performance Benchmarks

Kenton Varda — Tue, 14 Oct 2025 20:00:25 GMT

On October 4, independent developer Theo Browne published a series of benchmarks designed to compare server-side JavaScript execution speed between Cloudflare Workers and Vercel, a competing compute platform built on AWS Lambda. The initial results showed Cloudflare Workers performing worse than Node.js on Vercel at a variety of CPU-intensive tasks, by a factor of as much as 3.5x.

We were surprised by the results. The benchmarks were designed to compare JavaScript execution speed in a CPU-intensive workload that never waits on external services. But, Cloudflare Workers and Node.js both use the same underlying JavaScript engine: V8, the open source engine from Google Chrome. Hence, one would expect the benchmarks to be executing essentially identical code in each environment. Physical CPUs can vary in performance, but modern server CPUs do not vary by anywhere near 3.5x.

On investigation, we discovered a wide range of small problems that contributed to the disparity, ranging from some bad tuning in our infrastructure, to differences between the JavaScript libraries used on each platform, to some issues with the test itself. We spent the week working on many of these problems, which means over the past week Workers got better and faster for all of our customers. We even fixed some problems that affect other compute providers but not us, such as an issue that made trigonometry functions much slower on Vercel. This post will dig into all the gory details.

It's important to note that the original benchmark was not representative of billable CPU usage on Cloudflare, nor did the issues involved impact most typical workloads. Most of the disparity was an artifact of the specific benchmark methodology. Read on to understand why.

With our fixes, the results now look much more like we'd expect:

There is still work to do, but we're happy to say that after these changes, Cloudflare now performs on par with Vercel in every benchmark case except the one based on Next.js. On that benchmark, the gap has closed considerably, and we expect to be able to eliminate it with further improvements detailed later in this post.

We are grateful to Theo for highlighting areas where we could make improvements, which will now benefit all our customers, and even many who aren't our customers.

Our benchmark methodology

We wanted to run Theo's test with no major design changes, in order to keep numbers comparable. Benchmark cases are nearly identical to Theo's original test but we made a couple changes in how we ran the test, in the hopes of making the results more accurate:

Theo ran the test client on a laptop connected by a Webpass internet connection in San Francisco, against Vercel instances running in its sfo1 region. In order to make our results easier to reproduce, we chose instead to run our test client directly in AWS's us-east-1 datacenter, invoking Vercel instances running in its iad1 region (which we understand to be in the same building). We felt this would minimize any impact from network latency. Because of this, Vercel's numbers are slightly better in our results than they were in Theo's.
We chose to use Vercel instances with 1 vCPU instead of 2. All of the benchmarks are single-threaded workloads, meaning they cannot take advantage of a second CPU anyway. Vercel's CTO, Malte Ubl, had stated publicly on X that using single-CPU instances would make no difference in this test, and indeed, we found this to be correct. Using 1 vCPU makes it easier to reason about pricing, since both Vercel and Cloudflare charge for CPU time ($0.128/hr for Vercel in iad1, and $0.072/hr for Cloudflare globally).
We made some changes to fix bugs in the test, for which we submitted a pull request. More on this below.

Cloudflare platform improvements

Theo's benchmarks covered a variety of frameworks, making it clear that no single JavaScript library could be at fault for the general problem. Clearly, we needed to look first at the Workers Runtime itself. And so we did, and we found two problems – not bugs, but tuning and heuristic choices which interacted poorly with the benchmarks as written.

Sharding and warm isolate routing: A problem of scheduling, not CPU speed

Over the last year we shipped smarter routing that sends traffic to warm isolates more often. That cuts cold starts for large apps, which matters for frameworks with heavy initialization requirements like Next.js. The original policy optimized for latency and throughput across billions of requests, but was less optimal for heavily CPU-bound workloads for the same reason that such workloads cause performance issues in other platforms like Node.js: When the CPU is busy computing an expensive operation for one request, other requests sent to the same isolate must wait for it to finish before they can proceed.

The system uses heuristics to detect when requests are getting blocked behind each other, and automatically spin up more isolates to compensate. However, these heuristics are not precise, and the particular workload generated by Theo's tests – in which a burst of expensive traffic would come from a single client – played poorly with our existing algorithm. As a result, the benchmarks showed much higher latency (and variability in latency) than would normally be expected.

It's important to understand that, as a result of this problem, the benchmark was not really measuring CPU time. Pricing on the Workers platform is based on CPU time – that is, time spent actually executing JavaScript code, as opposed to time waiting for things. Time spent waiting for the isolate to become available makes the request take longer, but is not billed as CPU time against the waiting request. So, this problem would not have affected your bill.

After analyzing the benchmarks, we updated the algorithm to detect sustained CPU-heavy work earlier, then bias traffic so that new isolates spin up faster. The result is that Workers can more effectively and efficiently autoscale when different workloads are applied. I/O-bound workloads coalesce into individual already warm isolates while CPU-bound are directed so that they do not block each other. This change has already been rolled out globally and is enabled automatically for everyone. It should be pretty clear from the graph when the change was rolled out:

V8 garbage collector tuning

While this scheduling issue accounted for the majority of the disparity in the benchmark, we did find a minor issue affecting code execution performance during our testing.

The range of issues that we uncovered in the framework code in these benchmarks repeatedly pointed at garbage collection and memory management issues as being key contributors to the results. But, we would expect these to be an issue with the same frameworks running in Node.js as well. To see exactly what was going on differently with Workers and why it was causing such a significant degradation in performance, we had to look inwards at our own memory management configuration.

The V8 garbage collector has a huge number of knobs that can be tuned that directly impact performance. One of these is the size of the "young generation". This is where newly created objects go initially. It's a memory area that's less compact, but optimized for short-lived objects. When objects have bounced around the "young space" for a few generations they get moved to the old space, which is more compact, but requires more CPU to reclaim.

V8 allows the embedding runtime to tune the size of the young generation. And it turns out, we had done so. Way back in June of 2017, just two months after the Workers project kicked off, we – or specifically, I, Kenton, as I was the only engineer on the project at the time – had configured this value according to V8's recommendations at the time for environments with 512MB of memory or less. Since Workers defaults to a limit of 128MB per isolate, this seemed appropriate.

V8's entire garbage collector has changed dramatically since 2017. When analyzing the benchmarks, it became apparent that the setting which made sense in 2017 no longer made sense in 2025, and we were now limiting V8's young space too rigidly. Our configuration was causing V8's garbage collection to work harder and more frequently than it otherwise needed to. As a result, we have backed off on the manual tuning and now allow V8 to pick its young space size more freely, based on its internal heuristics. This is already live on Cloudflare Workers, and it has given an approximately 25% boost to the benchmarks with only a small increase in memory usage. Of course, the benchmarks are not the only Workers that benefit: all Workers should now be faster. That said, for most Workers the difference has been much smaller.

Tuning OpenNext for performance

The platform changes solved most of the problem. Following the changes, our testing showed we were now even on all of the benchmarks save one: Next.js.

Next.js is a popular web application framework which, historically, has not had built-in support for hosting on a wide range of platforms. Recently, a project called OpenNext has arisen to fill the gap, making Next.js work well on many platforms, including Cloudflare. On investigation, we found several missing optimizations and other opportunities to improve performance, explaining much of why the benchmark performed poorly on Workers.

Unnecessary allocations and copies

When profiling the benchmark code, we noticed that garbage collection was dominating the timeline. From 10-25% of the request processing time was being spent reclaiming memory.

So we dug in and discovered that OpenNext, and in some cases Next.js and React itself, will often create unnecessary copies of internal data buffers at some of the worst times during the handling of the process. For instance, there's one pipeThrough() operation in the rendering pipeline that we saw creating no less than 50 2048-byte Buffer instances, whether they are actually used or not.

We further discovered that on every request, the Cloudflare OpenNext adapter has been needlessly copying every chunk of streamed output data as it’s passed out of the renderer and into the Workers runtime to return to users. Given this benchmark returns a 5 MB result on every request, that's a lot of data being copied!

In other places, we found that arrays of internal Buffer instances were being copied and concatenated using Buffer.concat for no other reason than to get the total number of bytes in the collection. That is, we spotted code of the form getBody().length. The function getBody() would concatenate a large number of buffers into a single buffer and return it, without storing the buffer anywhere. So, all that work was being done just to read the overall length. Obviously this was not intended, and fixing it was an easy win.

We've started opening a series of pull requests in OpenNext to fix these issues, and others in hot paths, removing some unnecessary allocations and copies:

We're not done. We intend to keep iterating through OpenNext code, making improvements wherever they’re needed – not only in the parts that run on Workers. Many of these improvements apply to other OpenNext platforms. The shared goal of OpenNext is to make NextJS as fast as possible regardless of where you choose to run your code.

Inefficient Streams Adapters

Much of the Next.js code was written to use Node.js's APIs for byte streams. Workers, however, prefers the web-standard Streams API, and uses it to represent HTTP request and response bodies. This necessitates using adapters to convert between the two APIs. When investigating the performance bottlenecks, we found a number of examples where inefficient streams adapters are being needlessly applied. For example:

const stream = Readable.toWeb(Readable.from(res.getBody()))

res.getBody() was performing a Buffer.concat(chunks) to copy accumulated chunks of data into a new Buffer, which was then passed as an iterable into a Node.js stream.Readable that was then wrapped by an adapter that returns a ReadableStream. While these utilities do serve a useful purpose, this becomes a data buffering nightmare since both Node.js streams and Web streams each apply their own internal buffers! Instead we can simply do:

const stream = ReadableStream.from(chunks);

This returns a ReadableStream directly from the accumulated chunks without additional copies, extraneous buffering, or passing everything through inefficient adaptation layers.

In other places we see that Next.js and React make extensive use of ReadableStream to pass bytes through, but the streams being created are value-oriented rather than byte-oriented! For example,

const readable = new ReadableStream({
  pull(controller) {
    controller.enqueue(chunks.shift());
    if (chunks.length === 0) {
      controller.close();
    }
});  // Default highWaterMark is 1!

Seems perfectly reasonable. However, there's an issue here. If the chunks are Buffer or Uint8Array instances, every instance ends up being a separate read by default. So if the chunk is only a single byte, or 1000 bytes, that's still always two reads. By converting this to a byte stream with a reasonable high water mark, we can make it possible to read this stream much more efficiently:

const readable = new ReadableStream({
  type: 'bytes',
  pull(controller) {
    controller.enqueue(chunks.shift());
    if (chunks.length === 0) {
      controller.close();
    }
}, { highWaterMark: 4096 });

Now, the stream can be read as a stream of bytes rather than a stream of distinct JavaScript values, and the individual chunks can be coalesced internally into 4096 byte chunks, making it possible to optimize the reads much more efficiently. Rather than reading each individual enqueued chunk one at a time, the ReadableStream will proactively call pull() repeatedly until the highWaterMark is reached. Reads then do not have to ask the stream for one chunk of data at a time.

While it would be best for the rendering pipeline to be using byte streams and paying attention to back pressure signals more, our implementation can still be tuned to better handle cases like this.

The bottom line? We've got some work to do! There are a number of improvements to make in the implementation of OpenNext and the adapters that allow it to work on Cloudflare that we will continue to investigate and iterate on. We've made a handful of these fixes already and we're already seeing improvements. Soon we also plan to start submitting patches to Next.js and React to make further improvements upstream that will ideally benefit the entire ecosystem.

JSON parsing

Aside from buffer allocations and streams, one additional item stood out like a sore thumb in the profiles: JSON.parse() with a reviver function. This is used in both React and Next.js and in our profiling this was significantly slower than it should be. We built a microbenchmark and found that JSON.parse with a reviver argument recently got even slower when the standard added a third argument to the reviver callback to provide access to the JSON source context.

For those unfamiliar with the reviver function, it allows an application to effectively customize how JSON is parsed. But it has drawbacks. The function gets called on every key-value pair included in the JSON structure, including every individual element of an Array that gets serialized. In Theo's NextJS benchmark, in any single request, it ends up being called well over 100,000 times!

Even though this problem affects all platforms, not just ours, we decided that we weren't just going to accept it. After all, we have contributors to V8 on the Workers runtime team! We've upstreamed a V8 patch that can speed up JSON.parse() with revivers by roughly 33 percent. That should be in V8 starting with version 14.3 (Chrome 143) and can help everyone using V8, not just Cloudflare: Node.js, Chrome, Deno, the entire ecosystem. If you are not using Cloudflare Workers or didn't change the syntax of your reviver you are currently suffering under the red performance bar.

We will continue to work with framework authors to reduce overhead in hot paths. Some changes belong in the frameworks, some belong in the engine, some in our platform.

Node.js's trigonometry problem

We are engineers, and we like to solve engineering problems — whether our own, or for the broader community.

Theo's benchmarks were actually posted in response to a different benchmark by another author which compared Cloudflare Workers against Vercel. The original benchmark focused on calling trigonometry functions (e.g. sine and cosine) in a tight loop. In this benchmark, Cloudflare Workers performed 3x faster than Node.js running on Vercel.

The author of the original benchmark offered this as evidence that Cloudflare Workers are just faster. Theo disagreed, and so did we. We expect to be faster, but not by 3x! We don't implement math functions ourselves; these come with V8. We weren't happy to just accept the win, so we dug in.

It turns out that Node.js is not using the latest, fastest path for these functions. Node.js can be built with either the clang or gcc compilers, and is written to support a broader range of operating systems and architectures than Workers. This means that Node.js' compilation often ends up using a lowest-common denominator for some things in order to provide support for the broadest range of platforms. V8 includes a compile-time flag that, in some configurations, allows it to use a faster implementation of the trig functions. In Workers, mostly by coincidence, that flag is enabled by default. In Node.js, it is not. We've opened a pull request to enable the flag in Node.js so that everyone benefits, at least on platforms where it can be supported.

Assuming that lands, and once AWS Lambda and Vercel are able to pick it up, we expect this specific gap to go away, making these operations faster for everyone. This change won't benefit our customers, since Cloudflare Workers already uses the faster trig functions, but a bug is a bug and we like making everything faster.

Benchmarks are hard

Even the best benchmarks have bias and tradeoffs. It's difficult to create a benchmark that is truly representative of real-world performance, and all too easy to misinterpret the results of benchmarks that are not. We particularly liked Planetscale's take on this subject.

These specific CPU-bound tests are not an ideal choice to represent web applications. Theo even notes this in his video. Most real-world applications on Workers and Vercel are bound by databases, downstream services, network, and page size. End user experience is what matters. CPU is one piece of that picture. That said, if a benchmark shows us slower, we take it seriously.

While the benchmarks helped us find and fix many real problems, we also found a few problems with the benchmarks themselves, which contributed to the apparent disparity in speed:

Running locally

The benchmark is designed to be run on your laptop, from which it hits Cloudflare's and Vercel's servers over the Internet. It makes the assumption that latency observed from the client is a close enough approximation of server-side CPU time. The reasons are fair: As Theo notes, Cloudflare does not permit an application to measure its own CPU time, in order to prevent timing side channel attacks. Actual CPU time can be seen in logs after the fact, but gathering those may be a lot of work. It's just easier to measure time from the client.

However, as Cloudflare and Vercel are hosted from different data centers, the network latency to each can be a factor in the benchmark, and this can skew the results. Typically, this effect will favor Cloudflare, because Cloudflare can run your Worker in locations spread across 330+ cities worldwide, and will tend to choose the closest one to you. Vercel, on the other hand, usually places compute in a central location, so latency will vary depending on your distance from that location.

For our own testing, to minimize this effect, we ran the benchmark client from a VM on AWS located in the same data center as our Vercel instances. Since Cloudflare is well-connected to every AWS location, we think this should have eliminated network latency from the picture. We chose AWS's us-east-1 / Vercel's iad1 for our test as it is widely seen as the default choice; any other choice could draw questions about cherry-picking.

Not all CPUs are equal

Cloudflare's servers aren't all identical. Although we refresh them aggressively, there will always be multiple generations of hardware in production at any particular time. Currently, this includes generations 10, 11, and 12 of our server hardware.

Other cloud providers are no different. No cloud provider simply throws away all their old servers every time a new version becomes available.

Of course, newer CPUs run faster, even for single-threaded workloads. The differences are not as large as they used to be 20-30 years ago, but they are not nothing. As such, an application may get (a little bit) lucky or unlucky depending on what machine it is assigned to.

In cloud environments, even identical CPUs can yield different performance depending on circumstances, due to multitenancy. The server your application is assigned to is running many others as well. In AWS Lambda, a server may be running hundreds of applications; in Cloudflare, with our ultra-efficient runtime, a server may be running thousands. These "noisy neighbors" won't share the same CPU core as your app, but they may share other resources, such as memory bandwidth. As a result, performance can vary.

It's important to note that these problems create correlated noise. That is, if you run the test again, the application is likely to remain assigned to the same machines as before – this is true of both Cloudflare and Vercel. So, this noise cannot be corrected by simply running more iterations. To correct for this type of noise on Cloudflare, one would need to initiate requests from a variety of geographic locations, in order to hit different Cloudflare data centers and therefore different machines. But, that is admittedly a lot of work. (We are not familiar with how best to get an application to switch machines on Vercel.)

A Next.js config bug

The Cloudflare version of the NextJS benchmark was not configured to use force-dynamic while the Vercel version was. This triggered curious behavior. Our understanding is that pages which are not "dynamic" should normally be rendered statically at build time. With OpenNext, however, it appears the pages are still rendered dynamically, but if multiple requests for the same page are received at the same time, OpenNext will only invoke the rendering once. Before we made the changes to fix our scheduling algorithm to avoid sending too many requests to the same isolate, this behavior may have somewhat counteracted that problem. Theo reports that he had disabled force-dynamic in the Cloudflare version specifically for this reason: with it on, our results were so bad as to appear outright broken, so he intentionally turned it off.

Ironically, though, once we fixed the scheduling issue, using "static" rendering (i.e. not enabling force-dynamic) hurt Cloudflare's performance for other reasons. It seems that when OpenNext renders a "cacheable" page, streaming of the response body is inhibited. This interacted poorly with a property of the benchmark client: it measured time-to-first-byte (TTFB), rather than total request/response time. When running in dynamic mode – as the test did on Vercel – the first byte would be returned to the client before the full page had been rendered. The rest of the rendering would happen as bytes streamed out. But with OpenNext in non-dynamic mode, the entire payload was rendered into a giant buffer upfront, before any bytes were returned to the client.

Due to the TTFB behavior of the benchmark client, in dynamic mode, the benchmark actually does not measure the time needed to fully render the page. We became suspicious when we noticed that Vercel's observability tools indicated more CPU time had been spent than the benchmark itself had reported.

One option would have been to change the benchmarks to use TTLB instead – that is, wait until the last byte is received before stopping the timer. However, this would make the benchmark even more affected by network differences: The responses are quite large, ranging from 2MB to 15MB, and so the results could vary depending on the bandwidth to the provider. Indeed, this would tend to favor Cloudflare, but as the point of the test is to measure CPU speed, not bandwidth, it would be an unfair advantage.

Once we changed the Cloudflare version of the test to use force-dynamic as well, matching the Vercel version, the streaming behavior then matched, making the request fair. This means that neither version is actually measuring the cost of rendering the full page to HTML, but at least they are now measuring the same thing.

As a side note, the original behavior allowed us to spot that OpenNext has a couple of performance bottlenecks in its implementation of the composable cache it uses to deduplicate rendering requests. While fixes to these aren't going to impact the numbers for this particular set of benchmarks, we're working on improving those pieces also.

A React SSR config bug

The React SSR benchmark contained a more basic configuration error. React inspects the environment variable NODE_ENV to decide whether the environment is "production" or a development environment. Many Node.js-based environments, including Vercel, set this variable automatically in production. Many frameworks, such as OpenNext, automatically set this variable for Workers in production as well. However, the React SSR benchmark was written against lower-level React APIs, not using any framework. In this case, the NODE_ENV variable wasn't being set at all.

And, unfortunately, when NODE_ENV is not set, React defaults to "dev mode", a mode that contains extra debugging checks and is therefore much slower than production mode. As a result, the numbers for Workers were much worse than they should have been.

Arguably, it may make sense for Workers to set this variable automatically for all deployed workers, particularly when Node.js compatibility is enabled. We are looking into doing this in the future, but for now we've updated the test to set it directly.

What we’re going to do next

Our improvements to the Workers Runtime are already live for all workers, so you do not need to change anything. Many apps will already see faster, steadier tail latency on compute heavy routes with less jitter during bursts. In places where garbage collection improved, some workloads will also use fewer billed CPU seconds.

We also sent Theo a pull request to update OpenNext with our improvements there, and with other test fixes.

But we're far from done. We still have work to do to close the gap between OpenNext and Next.js on Vercel – but given the other benchmark results, it's clear we can get there. We also have plans for further improvements to our scheduling algorithm, so that requests almost never block each other. We will continue to improve V8, and even Node.js – the Workers team employs multiple core contributors to each project. Our approach is simple: improve open source infrastructure so that everyone gets faster, then make sure our platform makes the most of those improvements.

And, obviously, we'll be writing more benchmarks, to make sure we're catching these kinds of issues ourselves in the future. If you have a benchmark that shows Workers being slower, send it to us with a repro. We will profile it, fix what we can upstream, and share back what we learn!