How to Block Bots from Overwhelming Your Website using Cloudflare Firewall

As an online business owner, you wake up one day and realize your site load time has tripled. Looking at traffic logs, you see requests from weird crawler bots you never noticed before have shot up 20X overnight!

The bots are aggressively crawling indexed pages, hammering servers and impeding real visitors from accessing your site. Some are even trying to break into user accounts via credential stuffing.

Left unchecked, this bot epidemic threatens everything we‘ve built so far. Our site‘s stability, security and even revenue streams are at risk.

So what do we do? How do we take back control and block these troublesome bots?

This comprehensive guide will teach you how to leverage Cloudflare‘s firewall to selectively filter bad bot traffic. Follow along as we dive into:

The scale of today‘s bot problem
Distinguishing good bots from malicious ones
Steps to block bots on Cloudflare
Advanced firewall customization
Complementary bot mitigation strategies

Rapid Rise of Good & Bad Bots

First, let‘s grok the massive bot epidemic websites face today.

Bots Make Up 40%+ of Web Traffic

Industry analysis reveals bots today account for over 40% of requests to the average website.

The share of traffic from bots more than doubled in just the last 3 years according to Imperva:

Bots have exploded thanks to:

Ubiquity of scalable cloud computing resources
Rising popularity of browser automation tools like Selenium
Improved machine learning crawling algorithms

Both good and malicious bots have multiplied rapidly.

So what exactly are these mysterious bots hitting our websites?

Web Robots & Crawlers Explained

Bots or "web robots" are applications that programmatically browse the web to perform certain tasks.

Search engine bots like Googlebot crawl sites to index new pages. Social media bots scrape public data for preview link shares. Price comparison sites acquire product pricing from ecommerce stores via scripts.

These automated agents browse web pages by spoofing user browsers. Most bots run either custom code or leverage browser automation software like Selenium or Puppeteer to emulate human visitors at scale.

Bots dynamically claw through websites by recursively following links – a process called web crawling. As they visit each page, bots extract information to serve their own objectives.

But not all bots have good intentions…

Malicious Bots Pose Business & Security Threats

While some "good bots" provide value, over 50% of bots have negative consequences according to Imperva.

These bad bots hijack infrastructure to:

Steal copyrighted content via scraping
Spam contact forms on websites
Spread malware by probing sites for vulnerabilities
Amplify DDoS attacks by overwhelming servers
Enable credential stuffing by brute forcing login pages

Left unchecked, malicious bots can seriously harm websites:

Loss of revenue from content theft or degraded site UX
Legal liability & compliance violations from data breaches
Increased infrastructure costs to handle excessive traffic
Loss of search ranking from blocks by Google‘s anti-spam algorithms

With bots running amok, we need to take action. Now.

Distinguishing Good Bots from Bad Bots

With bots making up so much web traffic today, arbitrarily blocking all bots would be unwise.

So how do we allow legitimate "good" bots while selectively filtering out "bad" bots?

Let‘s contrast some features of these two categories:

Good Bots	Bad Bots
Follow robots.txt rules Respect crawl delays Limited page crawl rate Scrape only public data Identify themselves properly in user agent	Ignore robots.txt restrictions Crawl aggressively without delays Scrape private/copyrighted data Spoof user agent identities Spread malware/participate in attacks

The challenge is bad bots also dynamically evolve their behavior to mimic real users in order to stay undetected.

Many leverage tricks like:

Rotating IP addresses
Randomizing user agents
Executing JavaScript like humans
Using residential proxies to mask origins

So simply checking user agent headers or IP addresses is no longer sufficient. More advanced techniques are required.

Browser Fingerprinting to Detect Evasive Bots

To reliable filter sophisticated bots, we can analyze their browser fingerprints.

A browser fingerprint comprises attributes like:

Screen size
Software versions
Supported MIME types
Benchmark speeds

These create a signature that uniquely identifies a specific browser on a device.

Human users will have diverse fingerprints across many types of devices like laptops, mobile phones and tablets.

However, bots usually run on a fixed infrastructure so exhibit much less fingerprint variance making them easier to spot.

Advanced bot management solutions leverage browser fingerprinting to reliably detect bots even if they spoof IPs or user agents. We‘ll discuss more solutions later.

First, let‘s look at configuring basic blocking on Cloudflare.

Blocking Bots with Cloudflare Firewall

Cloudflare runs a massive global network spanning 200+ cities worldwide. This allows them to absorb and filter malicious traffic before it reaches your origin infrastructure.

Let‘s examine how to leverage Cloudflare‘s firewall policies to block bad bots.

Here‘s what we‘ll cover:

Enabling the Cloudflare reverse proxy
Creating firewall rules to block specific bots
Leveraging Cloudflare‘s bad bot threat intelligence
Customizing firewall rule sensitivity

Let‘s get started!

Step 1 – Enable Cloudflare Proxy

First, we need to configure our DNS settings to route traffic through Cloudflare‘s servers.

Sign up for a Cloudflare account, add your site‘s domain and modify the domain records to point to Cloudflare‘s nameservers. This funnels web traffic through their proxy network.

Cloudflare also offers a free SSL certificate to encrypt all web traffic for better security.

Step 2 – Create Firewall Rules

Now we can formulate firewall policies to block bad bots.

Navigate to "Firewall" and select the "Firewall rules" tab. Hit "Create Firewall Rule".

Let‘s block the notorious MJ12 botnet:

We can block by user agent, IP, geography, request frequency and more. Use "Edit Expression" for complex logic.

Repeat to block other malicious bots plaguing your site.

Step 3 – Leverage Global Bot Threat Intelligence

Manually hunting down every new malicious bot is impractical.

Instead, tap into Cloudflare‘s constantly updated database of bad bots identified across its international network spanning 200 billion requests per day.

Under Firewall‘s "Bot Management" section, enable the "Block Bad Bots" toggle:

This instantly blocks traffic from 4000+ known malicious bots saving us endless hours of log analysis. The list is continually updated as new threats emerge.

Step 4 – Adjust Rule Sensitivity

We can tweak the threshold Impact Score at which traffic gets blocked:

A higher score blocks more bot traffic but risks also filtering some legitimate users. Start conservative and reduce until site performance stabilizes.

Also review stats like total threats blocked:

Proactively blocking bad bots with Cloudflare improves site security and accelerates performance.

Next let‘s examine some advanced customization options.

Advanced Firewall Configuration & Custom Rules

Cloudflare firewall offers myriad ways to secure sites beyond basic user agent checks.

Size-Based Rate Limiting

We can throttle traffic based on bandwidth consumed to prevent resource abuse.

For instance, limit any client to 50MB download per 5 minutes:

(ip.src eq 93.184.216.0/24) and (cf.client.bytes_downloaded gt 50 MB)

This controls heavy scraping that degrades site performance. The firewall rule language supports a variety of flexible predicates.

Block Traffic from Anonymous Proxies

Scrapers often hide their location using anonymous proxy services.

We can blacklist networks of known proxies:

(ip.geoip.asnum eq 34934) or (ip.src in 93.184.216.0/24)

Prevent Scraping of Sensitive URLs

Say your /reports subfolder contains private inventory data. We want to ensure only employee browsers can access it.

Use a firewall rule to restrict the path:

http.request.uri.path contains "/reports/" and 
  not cf.client.bot and
  ip.src ne 192.168.0.1/32

The above only allows the internal office IP while blocking all bots. This helps secure confidential data.

We can build sophisticated logic tailored to our specific traffic patterns.

Available Fields for Custom Rules

Numerous request attributes can be filtered on:

cf.client.bot – Match bot vs real browsers
http.request.uri.path – URLs being accessed
ip.src – Visitor IP address
ip.geoip.country – Geographic origin
http.request.cookie – Cookies set
http.request.user_agent – Browser user agent

Mix and match conditions to craft precise bot blocking policies.

Complementary Bot Defense Strategies

No single tactic can block all bots. A layered defense works best.

Some complementary approaches include:

CAPTCHAs – Distinguish humans from bots via challenges
Security Headers – Restrict resource access permissions
Virtual Patches – Patch app layer vulnerabilities instantly
Log Analysis – Continually monitor and block new threats
Scrape Protection – API-based data access for well-behaved scrapers

Combine network firewalls with application hardening for resilient bot defense.

Stay Vigilant of New Bot Behaviors

The cat and mouse game continues as sneaky new bots evolve innovative ways to bypass defenses.

Continually monitor site traffic and logs for patterns signaling malicious bots:

Sudden drops in bounce rates
Pages per session spikes
Bandwidth overages
Upticks in 404 errors

Analyze client fingerprints for more clues:

Collect visitor browser attributes like screen resolution, DOM properties, WebGL renderer. 

Cluster to identify groups with identical fingerprints indicative of bots.

Scrutinize traffic data, tweak rules aggressively and keep refining over time.

Be relentless hunting down data-thieving bot swarms trying to game your defenses. This is about protecting our livelihood and the value we create online.

Show them no mercy when it comes to securing our hard work!

Take Control by Blocking Bad Bots

Left unchecked, nefarious bots can seriously harm website infrastructure, finances and credibility.

Leveraging Cloudflare‘s global bot visibility and firewall gives us a way to selectively filter malicious crawlers trying to freeload off our servers.

We can control precisely which bots to allow, block and shape traffic from based on their impact. This protects our online assets and safeguards customer experience.

With great power comes great responsibility. Wield your new anti-bot skills judiciously by blocking threats while enabling benign bots that bring value.

Our website now runs faster with a trimmed down bot load. Loyal customers enjoy improved page load times. Scarce server resources are freed up for actual human visitors. It‘s a win-win for everyone (except you, pesky bots!)

So don‘t surrender meekly as bot mobs run amok. Take a stand by deploying the robust bot blocking strategies outlined here. Our hard work and websites will thank us.