The Python Developer‘s Guide to Rotating Proxies for Web Scraping

When you‘re scraping data from websites using Python, your IP address is like your digital fingerprint. Make too many requests from the same IP and you risk getting blocked. That‘s where proxy servers come in.

By routing your requests through an intermediary proxy server, you can mask your real IP address. Even better, by rotating through a pool of proxy IPs, you make it look like the requests are coming from many different users instead of just you.

In this in-depth guide, we‘ll walk you through exactly how to implement proxy rotation in your Python web scraping projects for the best results. Let‘s get started!

Proxies 101: The Basics

First, let‘s cover some key terminology:

  • A proxy server acts as a gateway between you and the internet. It forwards your web requests to the target server, but with the proxy server‘s IP address instead of your own.

  • A static proxy uses a single IP address. It provides some anonymity but websites can still block the IP if they detect an unusual amount of traffic from it.

  • A rotating proxy gives you access to a pool of IP addresses that are constantly switched out, either randomly or at set intervals. Your requests get distributed across the pool of proxies.

The big advantages of rotating proxies are:

  1. Avoiding IP bans and rate limits, since each request comes from a different address
  2. Better anonymity and security, making it harder to track your web activity
  3. Ability to make a high volume of requests in a short time span
  4. Switching between different geolocations to bypass restrictions

For these reasons, rotating proxies are extremely useful for large-scale web scraping projects. With a pool of IPs to rotate through, you can scrape a high number of pages without getting blocked.

How to Rotate Proxies in Python

Now for the practical part – here‘s how to set up a Python script that rotates proxies for you:

Step 1: Get a list of proxy IP addresses

To get started, you‘ll need access to multiple proxy servers. The easiest way is to sign up with a proxy provider – they‘ll give you a pool of rotating proxies to use. More on some recommended providers later.

Alternatively, you can use free proxy lists available online. Just be aware these free proxies can be unreliable and may stop working at any time. For this guide, we‘ll assume you have your proxies in a text file, one per line:

101.32.104.120:1080  
212.129.62.3:8080
187.217.54.84:80

Read this file into your Python script and parse it into a list:

with open(‘proxy_list.txt‘) as f:
    raw_proxies = f.read().splitlines()

proxies = set()
for line in raw_proxies:
    ip, port = line.split(‘:‘)
    proxies.add(ip + ‘:‘ + port)

We use a set to store the proxies to avoid any duplicates.

Step 2: Check which proxies are working

Just because an IP is in your proxy list, doesn‘t mean it actually works. Let‘s test them out by sending a request to a site like http://icanhazip.com that returns our current IP address.

We‘ll use the requests library to make the HTTP calls. Here‘s the code:

import requests

def check_proxy(proxy):
    try:
        r = requests.get(‘http://icanhazip.com‘, proxies={‘http‘: proxy, ‘https‘: proxy}, timeout=2)

        if r.status_code == 200:
            return True
        else:
            return False
    except:
         return False

working_proxies = set()
for proxy in proxies:
    if check_proxy(proxy):
        working_proxies.add(proxy)
    else:
        print(f‘Proxy {proxy} failed, skipping‘)

This loops through our list of proxies and sends a GET request to icanhazip.com with each one. If the request succeeds with a 200 status code, we know that proxy is working and add it to the working_proxies set.

We also set a timeout of 2 seconds. If the proxy server doesn‘t return a response within that time, it‘s too slow to be useful.

Step 3: Separate proxies into working and broken sets

After testing all the proxies, we split them into two groups:

broken_proxies = proxies - working_proxies

By using Python‘s set subtraction, we get a new set broken_proxies containing all the non-working proxies. We‘ll check these again later in case any have come back online.

Step 4: Make requests using a random proxy

Time to actually put the proxies to use! Whenever we want to make a request, we‘ll randomly select a proxy from our working_proxies set.

Let‘s say the URL we want to scrape is https://quotes.toscrape.com/:

def get_random_proxy():
    return random.sample(working_proxies, 1)[0]

def scrape_site(url):    
    proxy = get_random_proxy()
    try:
        r = requests.get(url, proxies={‘http‘: proxy, ‘https‘: proxy}, timeout=3)

        if r.status_code == 200:
            html = r.text        
            # Parse the HTML here 
            return html

    except:
         broken_proxies.add(proxy)
         working_proxies.remove(proxy)

         if len(working_proxies) == 0:
             print(‘Ran out of working proxies!‘)
             return None

         new_proxy = get_random_proxy()
         return scrape_site(url)

scrape_site(‘https://quotes.toscrape.com/‘)         

This code does a few important things:

  1. The get_random_proxy() function selects a random proxy from the working set
  2. We pass this proxy to the requests.get() call, so the request goes through that proxy server
  3. If the request fails for any reason, we assume that proxy is no longer working and move it to the broken set
  4. If we run out of working proxies, the script stops
  5. Otherwise, it recursively calls itself to try the same URL again with a different random proxy

By using this recursive structure, the script will keep trying new random proxies for each URL until it either gets a successful response or runs out of proxies.

Step 5: Recheck broken proxies

Proxies that failed before might start working again later. So it‘s a good idea to periodically move the broken proxies back into the unchecked pool and re-test them.

We can use a Python threading.Timer to run this on a schedule, e.g. every 5 minutes:

import threading

def proxy_manager():
    global proxies
    proxies = proxies | broken_proxies

    broken_proxies.clear()

    threading.Timer(5 * 60, proxy_manager).start()

proxy_manager()

This proxy_manager() function moves all the broken proxies back into the main proxies set, then clears out broken_proxies. On the next run of the script, they‘ll all get retested.

By calling proxy_manager() once at the start, and having it reschedule itself using threading.Timer, it will keep running on an infinite loop.

Complete Python Proxy Rotator Code

Here‘s the full code for the proxy rotator, wrapped in a class for easy reuse:

import random 
import requests
import threading

class ProxyRotator:
    def __init__(self, proxy_file):
        self.proxy_file = proxy_file

        with open(proxy_file) as f:
            raw_proxies = f.read().splitlines()

        self.proxies = set()
        for line in raw_proxies:
            ip, port = line.split(‘:‘)
            self.proxies.add(ip + ‘:‘ + port)

        self.working_proxies = set()
        self.broken_proxies = set()

        self.test_proxies()

    def test_proxies(self):
        def check(proxy):
            try:
                r = requests.get(‘http://icanhazip.com‘, proxies={‘http‘: proxy, ‘https‘: proxy}, timeout=2)
                if r.status_code == 200:
                    self.working_proxies.add(proxy)
                else:
                    self.broken_proxies.add(proxy)
            except:
                self.broken_proxies.add(proxy)

        threads = []    
        for proxy in self.proxies:
            thread = threading.Thread(target=check, args=[proxy])
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

        self.broken_proxies = self.proxies - self.working_proxies

    def get_proxy(self):
        if len(self.working_proxies) == 0:
            print(‘Ran out of working proxies, waiting for some to regenerate...‘)
            while len(self.working_proxies) == 0:
                self.test_proxies()

        return random.sample(self.working_proxies, 1)[0]

    def add_broken_proxy(self, proxy):
        self.broken_proxies.add(proxy)
        self.working_proxies.remove(proxy)

    def proxy_manager(self):
        self.proxies = self.proxies.union(self.broken_proxies) 
        self.broken_proxies.clear()

        threading.Timer(5 * 60, self.proxy_manager).start()

rotator = ProxyRotator(‘proxy_list.txt‘)
rotator.proxy_manager()

html = rotator.scrape_site(‘https://quotes.toscrape.com/‘)
print(html)

Best Rotating Proxy Services

If you don‘t want the hassle of managing your own proxy lists, you‘re best off signing up for a paid rotating proxy service. Here are a few of the top providers:

  • Bright Data (formerly Luminati) – Huge peer-to-peer proxy network with over 72M+ IPs
  • Smartproxy – Rotating datacenter and residential proxies, shared or private
  • Blazing SEO – Datacenter proxies optimized for fast performance and high success rates
  • Oxylabs – Residential and datacenter proxies with worldwide coverage
  • ScraperAPI – Manages proxy rotation, CAPTCHAs and retries automatically for you

Most of these services offer APIs to automatically rotate the proxies with each request. You just add your API key and the service takes care of the rest.

Frequently Asked Questions

Are rotating proxies legal?

In most cases, yes. Proxies are just an intermediary between you and the web. It‘s how you use them that matters.

Rotating proxies are frequently used for perfectly legitimate purposes like web scraping, ad verification, price comparisons, and SEO monitoring. As long as you‘re not using them to access any illegal content or conduct malicious activity, you‘re in the clear.

However, it‘s important to check the terms of service of any website you‘re scraping. Some sites strictly prohibit the use of any automated bots or scrapers. In those cases, even rotating proxies won‘t necessarily prevent you from getting blocked.

When should I use a rotating proxy?

Rotating proxies are useful any time you need to make a high volume of requests to one or more websites. Without proxies, you‘d quickly get rate limited or IP banned.

Some specific use cases include:

  • Web scraping at scale to extract data from websites
  • Automating social media posts and interactions
  • Monitoring search engine rankings for many keywords
  • Comparing prices across ecommerce sites
  • Anonymizing your web traffic and protecting your privacy
  • Bypassing geo-restrictions and accessing content from other countries

For these applications, rotating proxies are an essential tool. They allow you to spread your requests across many IP addresses to avoid detection and blocking.

How often should proxies be rotated?

It depends on the website you‘re targeting and how heavily you‘re scraping it. Some sites are very strict and will ban an IP after just a handful of requests. Others are more lenient and you can make hundreds of requests from one IP.

As a general rule, the more frequently you rotate your proxies, the harder it is for sites to detect and block you. Many rotating proxy services let you set the rotation interval, e.g. getting a new IP on every request, or every 1 minute, 10 minutes, etc.

For large scraping jobs, rotating on every request is the safest bet. The slight overhead from establishing a new connection each time is worth the added reliability. Your requests will be distributed across the maximum number of IPs.

Conclusion

Whether you‘re a professional data scientist or just curious about web scraping, proxies are a critical part of your toolkit. With a pool of rotating proxies and some basic Python skills, you can gather data from almost any website at scale.

The tricky part is keeping those proxies working reliably, which means catching and replacing broken ones as you go. The code samples in this guide show you exactly how to do that.

However, setting up your own proxy infrastructure can be a lot of work. If you‘re serious about web scraping, you may want to outsource it to a professional proxy service that manages the rotating, throttling, and retrying of requests for you.

Whichever route you choose, you‘re now armed with the knowledge to start diving into the world of anonymous web scraping with Python. Get out there and happy scraping!