Randomize the User Agent to mimic a human

How to Scrape PerimeterX Protected Sites in 2024: The Ultimate Guide

Are you trying to scrape a website but getting blocked by "Please verify you are human" or "Press & Hold" challenges? Chances are the site is using PerimeterX, a popular bot detection service that makes web scraping difficult.

Don‘t worry though – in this in-depth guide, we‘ll walk you through exactly how PerimeterX works and the most effective techniques to bypass it in 2024. Whether you‘re a beginner or an experienced scraper, by the end of this article you‘ll be able to scrape PerimeterX protected sites with ease.

What is PerimeterX?

PerimeterX is a cybersecurity service used by many websites to detect and block automated bot traffic, including web scrapers. Its goal is to prevent bots from accessing site content, submitting forms, or performing other actions that could harm the website or its users.

You‘ve likely encountered PerimeterX before in the form of "Press & Hold" or "Please verify you are human" challenges that appear when you visit a site. These CAPTCHAs aim to determine if you are a real human user or an automated bot. Solving the challenge grants you temporary access to the site.

How PerimeterX Blocks Web Scraping

So how does PerimeterX differentiate bots from human users? It uses a variety of signals and techniques, including:

Browser fingerprinting: PerimeterX scripts collect data points about your browser, such as screen size, installed plugins, OS, and more to create a unique "fingerprint". Any visitors with matching fingerprints are assumed to be bots.
Behavioral analysis: Tools monitor user actions like mouse movements, typing patterns, and page navigation to see if the visitor behaves like a human.
IP reputation: Activity from IP addresses known to belong to data centers or that have generated suspicious traffic in the past are blocked.
Request patterns: An unusually high volume or frequency of requests from the same client is indicative of bot activity.

When PerimeterX suspects a visitor is a bot instead of a human, it displays a CAPTCHA challenge. Failing to solve the CAPTCHA will result in your access to the site being blocked. Even if you solve it, your IP may still be rate limited or banned if you continue to send bot-like traffic.

This multi-pronged approach makes PerimeterX very difficult to circumvent compared to other anti-bot solutions. As a result, normal web scraping tools like Scrapy and Selenium have a low success rate on sites protected by PerimeterX.

Bypassing PerimeterX with Undetected-Chromedriver

The key to scraping PerimeterX protected sites is to convince it that your scraper is a regular human using a browser. One of the best tools for this is undetected-chromedriver, an optimized, bot-resistant version of Selenium chromedriver.

Undetected-chromedriver automatically tweaks Selenium to avoid common signs of automation that trigger bot detection. It does this by:

Modifying Chrome‘s navigator.webdriver property
Masking the ChromeDriver window to appear more humanlike
Letting you set randomized, human-like User Agents
Allowing cookie persistence and customization across sessions

While no tool is perfect against PerimeterX, undetected-chromedriver is one of the most reliable options currently available. To further reduce the chance of getting blocked, you should pair it with premium proxies that rotate your IP address with each request.

Here‘s a basic example of how to use undetected-chromedriver in Python to scrape a PerimeterX protected page:

import undetected_chromedriver as uc

if __name__ == ‘__main__‘:   
   options = uc.ChromeOptions()

   # Setting a humanlike User Agent
   options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5593.96 Safari/537.36")

   driver = uc.Chrome(options=options)

   driver.get(‘https://www.example.com‘)

   # Your scraping code here

   driver.quit()

The key parts are importing undetected_chromedriver, setting a User Agent to mimic a human user, and launching Chrome with the custom options. You can then visit the target URL and run your scraping logic as usual.

To add rotating premium proxies, you can specify the –proxy-server argument when launching Chrome:

options.add_argument(‘--proxy-server=46.29.20.183:8118‘)

Make sure to set a new proxy IP for each request to avoid rate limiting. You can find reliable proxies from providers like Bright Data or Zyte.

With undetected-chromedriver and rotating IPs, you should be able to scrape most PerimeterX protected sites without triggering bot detection. However, it‘s not a silver bullet.

Limitations of Undetected-Chromedriver

While a big improvement over regular Selenium, undetected-chromedriver still has some limitations:

Slow – Spinning up a headless Chrome instance, waiting for page elements to load, and solving CAPTCHAs is time and resource intensive compared to lightweight scraping libraries.
Not 100% undetectable – Advanced bot fingerprinting used by some sites can still identify undetected-chromedriver as a bot. It‘s a cat-and-mouse game staying ahead of the latest techniques.
Difficult to scale – Running hundreds or thousands of concurrent Chrome instances for large scraping jobs is challenging to manage yourself in terms of CPU and memory requirements.

If you find yourself still getting blocked with undetected-chromedriver, need to scrape at scale, or simply want to save time and resources, you may want to consider using a web scraping API instead.

Using a Web Scraping API

A web scraping API is a service that handles the entire scraping process – fetching pages, parsing data, dealing with CAPTCHAs, and avoiding bans – and returns the data you need via an API endpoint. You simply specify the URL and data you want, and the API does the rest.

For PerimeterX protected sites, using a scraping API is often easier and more reliable than trying to bypass the anti-bot measures yourself. The API provider‘s infrastructure is optimized to simulate human users and adapt to the latest defenses.

ScrapingBee and ScraperAPI are two popular scraping APIs that work well on PerimeterX sites. Here‘s how to use the ScrapingBee API in Python:

import requests

API_KEY = ‘PASTE_YOUR_API_KEY_HERE‘

url = ‘https://www.example.com‘

response = requests.get(
  url,
  params = {
    ‘api_key‘: API_KEY,
    ‘wait‘: ‘true‘,
    ‘headless‘: ‘false‘
  }  
)

print(response.text)

Simply replace the API_KEY with your own key, set the URL you want to scrape, and any other configuration options. The API will return the page HTML which you can then parse to extract the desired data.

Using a scraping API abstracts away all the complexity of avoiding PerimeterX and saves you significant time and effort. The tradeoff is cost – most APIs charge a per-request fee. But for any kind of serious scraping project, the cost is well worth the reliability and scalability.

Step-by-Step Undetected-Chromedriver Setup

If you do want to go the DIY route with undetected-chromedriver for scraping PerimeterX sites, here‘s a step-by-step guide to getting set up:

Install Python and pip if you don‘t already have them.
Create a new virtual environment to isolate the project:
```
python -m venv myenv
source myenv/bin/activate
```
Install undetected-chromedriver:
```
pip install undetected-chromedriver
```

In your Python script, import the necessary libraries:

import undetected_chromedriver as uc
import random
from fake_useragent import UserAgent

Set up ChromeDriver options:
```
options = uc.ChromeOptions()
```

ua = UserAgent()
userAgent = ua.random
options.add_argument(f‘user-agent={userAgent}‘)

options.add_argument(‘–proxy-server=191.37.227.48:5098‘)


6. Launch Chrome and visit the target site:
```python
driver = uc.Chrome(options=options)
driver.get(‘https://www.example.com‘)

Interact with the page to solve any CAPTCHAs. This may require using Selenium methods like click(), send_keys(), etc. to mimic human behavior.
Once you have access to the page, run your scraping logic to extract the data you need. Remember to randomize your requests to avoid being fingerprinted.
When you‘re done, close the browser:
```
driver.quit()
```
Run your script and check that everything works. If you encounter any issues, debug by inspecting the browser console and network traffic.

That‘s the basic setup for using undetected-chromedriver to scrape PerimeterX protected sites. As you can see, there are quite a few steps involved compared to using a scraping API. But it gives you full control over the process.

Tips to Avoid Getting Blocked

Regardless of whether you use undetected-chromedriver or a scraping API to bypass PerimeterX, there are some general tips you should follow to minimize the risk of being detected:

Rotate your IP address with each request. This makes it harder to track your activity.
Randomize your User Agent on each request to avoid browser fingerprinting. Use real-world UAs that match your target audience.
Add random delays between requests to mimic human browsing patterns. Avoid making too many requests too quickly.
If using Selenium, randomize your window size. Real users have a variety of screen resolutions.
Set proper HTTP headers like Accept-Language and Referer where appropriate.
Avoid malformed or poorly formed HTML in your requests which is a sign of a bot.
If you encounter a CAPTCHA, use a CAPTCHA solving service or try changing your IP.

The key is to be as undetectable as possible so that your requests blend in with normal user traffic. Any patterns or unusual activity will quickly get you blocked.

Dealing with IP Bans and CAPTCHAs

Even with the best setup, you may still encounter CAPTCHAs or find your IP address banned when scraping PerimeterX sites. It‘s important to have a system in place to handle these roadblocks.

For CAPTCHAs, you have a few options:

Solve them manually yourself if the volume is low
Use a CAPTCHA solving service like 2captcha or DeathByCaptcha that uses human workers to solve CAPTCHAs at scale
Use an OCR tool like OpenCV to try to solve image CAPTCHAs programmatically (low success rate)
Rotate to a new IP address and user agent and try again

Getting around IP bans usually requires waiting it out or swapping to a new IP via proxies or VPNs. Residential proxies tend to be more resilient against bans than data center IPs.

PerimeterX Web Scraping FAQs

To wrap up, let‘s answer some frequently asked questions about scraping PerimeterX sites:

Q: Can I use a headless browser to avoid PerimeterX detection?
A: Headless browsers are detectable by anti-bot scripts. It‘s better to use a fully rendered browser like undetected-chromedriver in non-headless mode.

Q: What‘s the best programming language for scraping PerimeterX sites?
A: Python, JavaScript, and Java all have good tools for browser automation and scraping. But Python‘s simplicity and extensive library ecosystem make it a top choice.

Q: Are free proxies sufficient or do I need paid ones?
A: Free proxies are often slow, unreliable, and may already be banned. It‘s worth investing in paid proxies from a reputable provider for the best results.

Q: How often do I need to update my scraping scripts?
A: PerimeterX and other anti-bot vendors are constantly adapting their techniques. You may need to update your scripts every few weeks or months to stay ahead of the arms race.

Q: Is it legal to scrape websites protected by PerimeterX?
A: Web scraping itself is legal in most jurisdictions. However, you should always respect a site‘s terms of service and robots.txt file. Don‘t scrape copyrighted content or anything behind a login wall without permission.

Conclusion

As you can see, bypassing PerimeterX bot detection to scrape websites is a complex task requiring careful consideration. Using an optimized, humanlike headless browser like undetected-chromedriver in combination with rotating premium proxies is currently the most effective method.

However, this DIY approach has some limitations in terms of scalability and long-term success rates. For large scale scraping of PerimeterX sites, you may want to consider using a web scraping API service instead to offload the work.

Whichever route you choose, make sure to implement request patterns that closely mimic human behavior. Randomize IPs, user agents, headers, delays, mouse movements, and anything else you can to avoid detection. Even then, be prepared to deal with CAPTCHAs and IP bans when they inevitably happen.

Web scraping is a constant game of cat and mouse, especially when dealing with sophisticated anti-bot measures like PerimeterX. But by understanding how it works and using the right tools and techniques covered in this guide, you can tilt the odds of success in your favor. Now get out there and start scraping those PerimeterX protected sites!

Randomize the User Agent to mimic a human

Related