Stopping Bad Bots and Scrapers Cold: A 2800+ Word Guide

Unwanted requests have become the bane of website owners and infrastructure teams alike in 2023. Automated scraping, spamming and abuse have reached staggering levels across the internet:

  • Up to 25% of traffic on many sites comes from sneaky bots according to Imperva research
  • The average website faces over 110 bot attacks daily based on various threat reports
  • DDOS and brute force attacks often ride on the coattails of mass botnet activities

Faced with this onslaught, many site owners feel overwhelmed and unsure how to stem the tide. The first step? Understanding what you’re up against and meeting bots where they live – at the edge.

In this 2800+ word guide, we’ll explore battle-tested techniques for blocking bad bot traffic across three vital layers:

  • Apache web servers
  • Nginx proxies and load balancers
  • WordPress application firewalls

Equipped with the right tools and rules, you can filter bot abuse without impacting real visitors or SEO. Let’s dig in!

When Good Bots Go Bad

Before covering technical controls, it helps to examine modern bot behavior plaguing sites:

Web Scraping – Extracting pricing data, reviews, images en masse for competitive analysis. Amazon and flight search engines aggressively scrape.

Spam Commenting – Flooding blogs, forums and guestbooks with links to drive traffic. Declined after peaking in 2016 but still problematic with 127 million spam comments annually.

Fake Analytics Bots – Emulating humans to inflate traffic metrics and ad revenues. Referrer spammers like semalt.com specialize here.

Malicious Recon – Probing sites for vulnerabilities to exploit like SQL injection, XSS. Advance persistent threats exhibit this behavior.

Brute Force Attacks – Attempting to crack passwords by testing millions of guesses via scripts and compromised devices. Infamous Mirai botnet leveraged this.

Catching this junk traffic early prevents wasted resources along with site issues down the road. Now let’s see how with proper bot blocking…

Taming Apache – A Sheriff‘s Guide

The Apache HTTP server first launched in 1995 and still drives over 33% of websites – only second to Nginx in adoption. Its longevity comes from being open source, stable and extending via modules like mod_rewrite for filters.

Here‘s a common rewrite rule in .htaccess files or vhost configs leveraging mod_rewrite:

RewriteEngine On 

# Block UserAgents 
RewriteCond %{HTTP_USER_AGENT} ^NameOfBadBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^AnotherBadOne
RewriteRule .* - [R=403,L]

This uses regular expression (regex) logic to match bad bots by name and return 403 forbidden access errors. Some common troublemakers include:

  • MSIECrawler – Researchers exploiting IE browser bugs
  • MJ12bot – Aggregates/copies data to generate revenue
  • AhrefsBot – SEO analysis but scrapes heavily

And referrer blocking:

# Block Referrers
RewriteCond %{HTTP_REFERER} semalt\.com [NC,OR]  
RewriteCond %{HTTP_REFERER} buttons\.com
RewriteRule .* - [R=403,L]

Preventing referrer spammers like semalt.com helps safeguard analytics. Enable mod_rewrite, tweak rules and restart Apache to apply changes:

a2enmod rewrite  

service apache2 restart

For added visibility, enable logging of blocked request details:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %I %O" combined

Let’s see this in action for an attempted brute force attack:

Sample Apache Access Log

Based on this log, we can see multiple login POST requests using an attacker tool called Havij for SQLi exploitation. Our rules blocked the foreign user-agent and blocked brute force guessing.

On to leveraging Nginx for more advanced filtering…

Nginx Masters – Redirecting Bots into Blackholes

Nginx excels as a high-performance web server, load balancer and reverse proxy able to handle thousands of connections concurrently.

Although newer compared to Apache, Nginx gained popularity thanks to its memory efficiency, custom modules and scaling capacity. It now runs over 40% of the busiest sites.

Nginx relies on if conditional statements to evaluate and filter requests:

if ($bad_bot) {
  return 403; # Forbidden 
}

if ($bad_referrer) {
 return 301 /sinkhole.html; # Redirect
}

Benefits of this approach include:

  • Clean syntax as code vs .htaccess files
  • Flexibility in defining logics
  • Fine-tuned through locations and server blocks
  • Redirecting bots to isolate them

Let’s look at some syntax examples in depth. First, a basic user-agent block:

if ($http_user_agent ~ “Googlebot|bingbot|baidubot”) {
  return 403; 
}

This targets major search engine crawlers with a regex OR match to block them outright. Useful if you need to temporarily pause indexing.

Referrer filtering with redirects:

if ($http_referer ~* “semalt|buttons-for-website|.*fake”) {
  rewrite ^(.*)$ /sinkhole.html break;
}

The ^ and $ denote starting and ending positions for the match. This catches entire referer domains along with partial matches to handle URL parameters.

After updating the nginx.conf file, a reload applies the new bot jailing rules:

/etc/init.d/nginx reload

And checking the access logs reveals redirected bot attempts:

Nginx access logs

NICELY DONE! With bad bots pointed to oblivion, let’s secure WordPress specifically…

WordPress Wrangling – Lassoing Bad Bots

If your site relies on WordPress for content, performance and security plugins help manage bot threats. They offer user-friendly ways to block bad actors without complex server tweaks.

WordPress brute force attacks increased over 60% in 2022 according to WordFence research. Bots target vulnerabilities in older PHP and plugin code along with weak user passwords. This makes security plugins essential.

The popular Blackhole for Bad Bots plugin auto-detects high risk traffic out the gate. Features include:

✅ Instantly blocking the top 50 malicious user agents

✅ Continuous scanning for vulnerability probes

✅ Monitoring attempts to crack admin logins via brute force

✅ Analyzing patterns across logins, file requests and key headers

Blackhole uses WordPress hooks to filter requests in memory before committing disk writes or MySQL queries. This maximizes efficiency while also updating firewall rules dynamically as new threats emerge.

For example, it will automatically block IO detection attempts:

66.249.64.0 - - [30/Dec/2022:18:59:13 +0000] “GET /wp-content/uploads/2022/12/test-canary-01.txt HTTP/1.1” 403 488 “-” “Googlebot/2.1 (+http://www.google.com/bot.html)”

Other leading plugins like WordFence also provide firewall capabilities and integrate with Blackhole. The combination handles everything from country blocking to file change monitoring.

The one downside to app-level solutions? If your hosting account is too restricted to edit web server configs directly, plugins may miss lower-level filtering controls. But for most they strike the right balance.

Parting Thoughts – Multi-Layered Bot Defenses

In closing, no single tool can fully block all bot abuse and scraping threats targeting modern websites and APIs. The weaponry shown here should be part of a broader strategy including:

  • Analyzing traffic logs to identify top offenders
  • Confirming JavaScript and cookies are enabled to filter headless scaping
  • Implementing CAPTCHAs and other human challenges where needed
  • Considering commercial bot management services that proxy traffic
  • Hardening WordPress and other apps to prevent exploitation

With techniques covered for Apache, Nginx and WordPress, you now have more tools to defend websites against the growing bot menace. Deploy a mix across your environment, monitor activity closely and keep firewall rules updated.

Here’s to many years ofhappy bot-free surfing! Let me know if you have any other tips for blocking bad bot traffic by tweeting me @bot_buster.