The Cloudflare Blog

Improving your monitoring setup by integrating Cloudflare’s analytics data into Prometheus and Grafana

Martin Hauskrecht — Thu, 20 May 2021 13:00:15 GMT

The following is a guest post by Martin Hauskrecht, DevOps Engineer at Labyrinth Labs.

Here at Labyrinth Labs, we put great emphasis on monitoring. Having a working monitoring setup is a critical part of the work we do for our clients.

Cloudflare's Analytics dashboard provides a lot of useful information for debugging and analytics purposes for our customer Pixel Federation. However, it doesn’t automatically integrate with existing monitoring tools such as Grafana and Prometheus, which our DevOps engineers use every day to monitor our infrastructure.

Cloudflare provides a Logs API, but the amount of logs we’d need to analyze is so vast, it would be simply inefficient and too pricey to do so. Luckily, Cloudflare already does the hard work of aggregating our thousands of events per second and exposes them in an easy-to-use API.

Having Cloudflare’s data from our zones integrated with other systems’ metrics would give us a better understanding of our systems and the ability to correlate metrics and create more useful alerts, making our Day-2 operations (e.g. debugging incidents or analyzing the usage of our systems) more efficient.

Since our monitoring stack is primarily based on Prometheus and Grafana, we decided to implement our own Prometheus exporter that pulls data from Cloudflare’s GraphQL Analytics API.

Design

Based on current cloud trends and our intention to use the exporter in Kubernetes, writing the code in Go was the obvious choice. Cloudflare provides an API SDK for Golang, so the common API tasks were made easy to start with.

We take advantage of Cloudflare’s GraphQL API to obtain analytics data about each of our zones and transform them into Prometheus metrics that are then exposed on a metrics endpoint.

We are able to obtain data about the total number and rate of requests, bandwidth, cache utilization, threats, SSL usage, and HTTP response codes. In addition, we are also able to monitor what type of content is being transmitted and what countries and locations the requests originate from.

All of this information is provided through the http1mGroups node in Cloudflare’s GraphQL API. If you want to see what Datasets are available, you can find a brief description at https://developers.cloudflare.com/analytics/graphql-api/features/data-sets.

On top of all of these, we can also obtain data for Cloudflare’s data centers. Our graphs can easily show the distribution of traffic among them, further helping in our evaluations. The data is obtained from the httpRequestsAdaptiveGroups node in GraphQL.

After running the queries against the GraphQL API, we simply format the results to follow the Prometheus metrics format and expose them on the /metrics endpoint. To make things faster, we use Goroutines and make the requests in parallel.

Deployment

Our primary intention was to use the exporter in Kubernetes. Therefore, it comes with a Docker image and Helm chart to make deployments easier. You might need to adjust the Service annotations to match your Prometheus configuration.

The exporter itself exposes the gathered metrics on the /metrics endpoint. Therefore setting the Prometheus annotations either on the pod or a Kubernetes service will do the job.

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/scrape: "true"

We plan on adding a Prometheus ServiceMonitor to the Helm chart to make scraping the exporter even easier for those who use the Prometheus operator in Kubernetes.

The configuration is quite easy, you just provide your API email and key. Optionally you can limit the scraping to selected zones only. Refer to our docs in the GitHub repo or see the example below.

 env:
   - name: CF_API_EMAIL
     value: 
   - name: CF_API_KEY
     value: 

  # Optionally, you can filter zones by adding IDs following the example below.
  # - name: ZONE_XYZ
  #   value:

To deploy the exporter with Helm you simply need to run:

helm repo add lablabs-cloudflare-exporter https://lablabs.github.io/cloudflare-exporter
helm repo update

helm install cloudflare-exporter lablabs-cloudflare-exporter/cloudflare-exporter \
--set env[0].CF_API_EMAIL= \
--set env[1].CF_API_KEY=

We also provide a Helmfile in our repo to make deployments easier, you just need to add your credentials to make it work.

Visualizing the data

I’ve already explained how the exporter works and how you can get it running. As I mentioned before, we use Grafana to visualize our metrics from Prometheus. We’ve created a dashboard that takes the data from Prometheus and puts it into use.

The dashboard is divided into several rows, which group individual panels for easier navigation. It allows you to target individual zones for metrics visualization.

To make things even more beneficial for the operations team, you can use the gathered metrics to create alerts. These can be created either in Grafana directly or using Prometheus alert rules.

Furthermore, if you integrate Thanos or Cortex into your monitoring setup, you can store these metrics indefinitely.

Future work

We’d like to integrate even more analytics data into our exporters, eventually reaching every metric that Cloudflare’s GraphQL can provide. We plan on creating new metrics for firewall analytics, DoS analytics, and Network analytics soon.

Feel free to create a GitHub issue if you have any questions, problems, or suggestions. Any pull request is greatly appreciated.

About us

Labyrinth Labs helps companies build, run, deploy and scale software and infrastructure by embracing the right technologies and principles.

Want to see your DNS analytics? We have a Grafana plugin for that

Marek Vavruša — Tue, 14 Feb 2017 18:04:08 GMT

Curious where your DNS traffic is coming from, how much DNS traffic is on your domain, and what records people are querying for that don’t exist? We now have a Grafana plugin for you.

Grafana is an open source data visualization tool that you can use to integrate data from many sources into one cohesive dashboard, and even use it to set up alerts. We’re big Grafana fans here - we use Grafana internally for our ops metrics dashboards.

In the Cloudflare Grafana plugin, you can see the response code breakdown of your DNS traffic. During a random prefix flood, a common type of DNS DDoS attack where an attacker queries random subdomains to bypass DNS caches and overwhelm the origin nameservers, you will see the number of NXDOMAIN responses increase dramatically. It is also common during normal traffic to have a small amount of negative answers due to typos or clients searching for missing records.

You can also see the breakdown of queries by data center and by query type to understand where your traffic is coming from and what your domains are being queried for. This is very useful to identify localized issues, and to see how your traffic is spread globally.

You can filter by specific data centers, record types, query types, response codes, and query name, so you can filter down to see analytics for just the MX records that are returning errors in one of the data centers, or understand whether the negative answers are generated because of a DNS attack, or misconfigured records.

Once you have the Cloudflare Grafana Plugin installed, you can also make your own charts using the Cloudflare data set in Grafana, and integrate them into your existing dashboards.

Virtual DNS customers can also take advantage of the Grafana plugin. There is a custom Grafana dashboard that comes installed with the plugin to show traffic distribution and RTT from different Virtual DNS origins, as well as the top queries that uncached or are returning SERVFAIL.

The Grafana plugin is just one step to install once you have Grafana up and running:

grafana-cli plugins install cloudflare-app

Once you sign in using your user email and API key, the plugin will automatically discover domains and Virtual DNS clusters you have access to.

The Grafana plugin is built on our new DNS analytics API. If you want to explore your DNS traffic but Grafana isn’t your tool of choice, our DNS analytics API is very easy to get started with. Here’s a curl to get you started:

curl -s -H 'X-Auth-Key:####' -H 'X-Auth-Email:####' 'https://api.cloudflare.com/client/v4/zones/####/dns_analytics/report?metrics=queryCount’

To make all of this work, Cloudflare DNS is answering and logging millions of queries each second. Having high resolution data at this scale enables us to quickly pinpoint and resolve problems, and we’re excited to share this with you. More on this in a follow up deep dive blog post on improvements in our new data pipeline.

Instructions for how to get started with Grafana are here and DNS analytics API documentation is here. Enjoy!

This blog post was edited on 9/20/18 to update installation instructions.

The Porcupine Attack: investigating millions of junk requests

Marek Majkowski — Mon, 09 Jan 2017 14:08:58 GMT

We extensively monitor our network and use multiple systems that give us visibility including external monitoring and internal alerts when things go wrong. One of the most useful systems is Grafana that allows us to quickly create arbitrary dashboards. And a heavy user of Grafana we are: at last count we had 645 different Grafana dashboards configured in our system!

grafana=> select count(1) from dashboard;
 count
-------
   645
(1 row)

This post is not about our Grafana systems though. It's about something we noticed a few days ago, while looking at one of those dashboards. We noticed this:

This chart shows the number of HTTP requests per second handled by our systems globally. You can clearly see multiple spikes, and this chart most definitely should not look like a porcupine! The spikes were large in scale - 500k to 1M HTTP requests per second. Something very strange was going on.

Tracing the spikes[1]

Our intuition indicated an attack - but our attack mitigation systems didn't confirm it. We'd seen no major HTTP attacks at those times.

It would be bad if we were under such heavy HTTP attack and our mitigation systems didn't notice it. Without more ideas, we went back to one of our favorite debugging tools - tcpdump.

The spikes happened every 80 minutes and lasted about 10 minutes. We waited, and tried to catch the offending traffic. Here is what the HTTP traffic looked like on the wire:

The client had sent some binary junk to our HTTP server on port 80; they weren't even sending a fake GET or POST line!

Our server politely responded with HTTP 400 error. This explains why it wasn't caught by our attack mitigation systems. Invalid HTTP requests don't trigger our HTTP DDoS mitigations - it makes no sense to mitigate traffic which is never accepted by NGINX in the first place!

The payload

At first glance the payload sent to HTTP servers seems random. A colleague of mine, Chris Branch, investigated and proved me wrong. The payload has patterns.

Let me show what's happening. Here are the first 24 bytes of the mentioned payload:

If you look closely, the pattern will start to emerge. Let's add some colors and draw it in not eight, but seven bytes per row:

This checkerboard-like pattern, is exhibited in most of the requests with payload sizes below 512 bytes.

Another engineer pointed out there appear to actually be two separate sequences generated in the same fashion. Starting with the a6 and the cb take alternating bytes

a6 ef 39 82 cb 15 5e a7 f0 3a 83 cc 16 5f
cb 15 5e a7 f0 3a 83 cc 16 5f a8 f1 3b

Aligning that differently shows that the second sequence is essentially the same as the first:

a6 ef 39 82 cb 15 5e a7 f0 3a 83 cc 16 5f
            cb 15 5e a7 f0 3a 83 cc 16 5f a8 f1 3b

Thinking of that as one sequence gets

a6 ef 39 82 cb 15 5e a7 f0 3a 83 cc 16 5f a8 f1 3b

Which is generated by starting at ef and adding the following repeating sequence.

4a 49 49 4a 49 49 49

The 'random' binary junk is actually generated by some simple code.

The length distribution of the requests is also interesting. Here's the histogram showing the popularity of particular lengths of payloads.

About 80% of the junk requests we received had length of up to 511 bytes, uniformly distributed.

The remaining 20% had length uniformly distributed between 512 and 2047 bytes, with a few interesting spikes. For some reason lengths of 979, 1383 and 1428 bytes stand out. The rest of the distribution looks uniform.

The scale

The spikes were large. It takes a lot of firepower to generate a spike in our global HTTP statistics! On the first day the spikes reached about 600k junk requests per second. On second day the score went up to 1M rps. In total we recorded 37 spikes.

Geography

Unlike L3 attacks, L7 attacks require TCP/IP connections to be fully established. That means the source IP addresses are not spoofed and can be used to investigate the geographic distribution of attacking hosts.

The spikes were generated by IP addresses from all around the world. We recorded IP numbers from 4,912 distinct Autonomous Systems. Here are top ASN numbers by number of unique attacking IP addresses:

Percent of unique IP addresses seen:
21.51% AS36947  # AS de Algerie Telecom, Algeria
 5.34% AS18881  # Telefonica Brasil S.A, Brasil
 3.60% AS7738   # Telemar Norte Leste S.A., Brasil
 3.48% AS27699  # Telefonica Brasil S.A, Brasil
 3.37% AS28573  # CLARO S.A., Brasil
 3.20% AS8167   # Brasil Telecom S/A, Brasil
 2.44% AS2609   # Tunisia BackBone, Tunisia
 2.22% AS6849   # PJSC "Ukrtelecom", Ukraine
 1.77% AS3320   # Deutsche Telekom AG, Germany
 1.73% AS12322  # Free SAS, France
 1.73% AS8452   # TE-AS, Egypt
 1.35% AS12880  # Information Technology Company, Iran
 1.30% AS37705  # TOPNET, Tunisia
 1.26% AS53006  # Algar Telecom S/A, Brasil
 1.22% AS36903  # ASN du reseaux MPLs de Maroc Telecom, Morocco
 ... 4897 AS numbers below 1% of IP addresses.

You get the picture - the traffic was sourced all over the place, with bias towards South America and North Africa. Here is the country distribution of attacking IPs:

Percent of unique IP addresses seen:
31.76% BR
21.76% DZ
 7.49% UA
 5.73% TN
 4.89% IR
 3.96% FR
 3.76% DE
 2.09% EG
 1.78% SK
 1.36% MA
 1.15% GB
 1.05% ES
 ... 109 countries below 1% of IP addresses

The traffic was truly global and launched with IPs from 121 countries. This kind of globally distributed attack is where Cloudflare's Anycast network shines. During these spikes the load was nicely distributed across dozens of datacenters. Our datacenter in São Paulo absorbed the most traffic, roughly 4 times more traffic than the second in line - Paris. This chart shows how the traffic was distributed across many datacenters:

Unique IPs

During each of the spikes our systems recorded 200k unique source IP addresses sending us junk requests.

Normally we would conclude that whoever generated the attack controlled roughly 200k bots, and that's it. But these spikes were different. It seems the bots rotated IPs aggressively. Here is an example: during these 16 spikes we recorded a total count of a whopping 1.2M unique IP addresses attacking us.

This can be explained by bots churning through IP addresses. We believe that out of the estimated 200k bots, between 50k and 100k bots changed their IP addresses during the 80 minutes between attacks. This resulted in 1.2M unique IP addresses during the 16 spikes happening over 24 hours.

A botnet?

These spikes were unusual for a number of reasons.

They were generated by a large number of IP addresses. We estimate 200k concurrent bots.
The bots were rotating IP addresses aggressively.
The bots were from around the world with an emphasis on South America and North Africa.
The traffic generated was enormous, reaching 1M junk connections per second.
The spikes happened exactly every 80 minutes and lasted for 10 minutes.
The payload of the traffic was junk, not a usual HTTP request attack.
The payload had uniformly distributed payload sizes.

It's hard to draw conclusions, but we can imagine two possible scenarios. It is possible these spikes were an attack intended to break our HTTP servers.

A second possibility is that these spikes were legitimate connection attempts by some weird, obfuscated protocol. For some reason the clients were connecting to port 80/TCP and retried precisely every 80 minutes.

We are continuing our investigation. In the meantime we are looking for clues. Please do let us know if you have encountered this kind of TCP/IP payload. We are puzzled by these large spikes.

If you'd like to work on this type of problem we're hiring in London, San Francisco, Austin, Champaign and Singapore.

Update A Twitter user pointed out that the sequence a6 ef 39 82 cb 15 5e a7 f0 3a 83 cc 16 5f a8 f1 3b appears in this set of test vectors so we contacted the author who was kind enough to reply and point us to the code that generated those vectors.

static void generateSimpleRawMaterial(unsigned char* data, unsigned int length, unsigned char seed1, unsigned int seed2)
{
    unsigned int i;

    for( i=0; i


            Since we identified above that the difference between two bytes seemed to be 0x49 or 0x4a it's worth looking at the difference between bytes in this algorithm. Simplifying, bytes are generated from:
            ((i * seed1) + length + seed2)%0xFF
            Ignoring the % 0xff for the moment that's (i * seed1) + length + seed. Taking the difference between two adjacent bytes (for i and i+1) gives a difference of just seed1.
Thus in our case it's likely that seed1 is 0x49. It's fairly easy to end up with the following code to generate the sequence:
            seed = 0x49
byte = 0xa6

do
   byte = (seed + byte) % 0xff
done
            One big mystery remaining is 'what's the 0x75 at the start of the junk data?'.
Yes, we're aware that porcupines have spines/quills not spikes. ↩︎