How to Scrape Data from Twitter Using Python and Selenium

Twitter is a gold mine of data for anyone interested in social media analysis, sentiment analysis, news tracking, brand monitoring, and more. While Twitter provides official APIs to access data, these APIs have many limitations in terms of historical access, rate limits, and data restrictions.

Web scraping provides an alternative way to collect Twitter data that is more flexible and customizable. And Python‘s Selenium package makes it relatively easy to programmatically scrape data from Twitter‘s frontend web pages.

In this guide, I‘ll walk through how to use Python and Selenium to scrape key data points from Twitter profiles and tweets, including:

  • Profile metadata like name, handle, bio, location, website, join date, and follower/following counts
  • Individual tweet data like text, timestamp, likes, retweets, and replies
  • Scrolling to collect tweets beyond the initial page load
  • Storing the scraped data
  • Handling common challenges and following best practices

Whether you‘re a data scientist, software engineer, researcher, or business analyst, this guide will equip you with the knowledge you need to scrape Twitter effectively. Let‘s dive in!

Prerequisites and Setup

Before we start scraping, we need to set up our environment. Here are the key things you‘ll need:

  1. Python 3.6+ installed
  2. Selenium package installed
  3. WebDriver for your browser of choice

I recommend using Google Chrome for scraping twitter as it tends to be the most stable and well-supported by Selenium. You can install Selenium and the Chrome WebDriver in Python using:

pip install selenium
brew install chromedriver  # on macOS
apt-get install chromium-chromedriver  # on Linux 

Make sure the ChromeDriver is in your system PATH. On Windows, you may need to download the ChromeDriver executable and specify its path explicitly.

We‘re now ready to start building our scraper!

Scraping Twitter Profile Data

The first thing our scraper will do is load a given Twitter profile page and extract key metadata. We‘ll extract:

  • Profile name
  • Profile handle (@username)
  • Bio
  • Location
  • Website
  • Join date
  • Follower count
  • Following count

Here‘s the code to do that:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # initialize ChromeDriver
driver.get("https://twitter.com/elonmusk")  # load twitter profile

# extract profile name and handle
name_element = driver.find_element(By.CSS_SELECTOR, "div[data-testid=‘UserName‘] div[data-testid=‘UserProfileHeader_Items‘]")
name = name_element.text.split("\n")[0]
handle = name_element.text.split("\n")[1]

# extract profile metadata
bio = driver.find_element(By.CSS_SELECTOR, "div[data-testid=‘UserDescription‘]").text
location = driver.find_element(By.CSS_SELECTOR, "span[data-testid=‘UserLocation‘]").text 
website = driver.find_element(By.CSS_SELECTOR, "a[data-testid=‘UserUrl‘]").get_attribute("href")
join_date = driver.find_element(By.CSS_SELECTOR, "span[data-testid=‘UserJoinDate‘]").text

# extract follower and following counts
follower_count = driver.find_element(By.CSS_SELECTOR, "div[data-testid=‘primaryColumn‘] div[data-testid=‘follower_count‘] a span").text
following_count = driver.find_element(By.CSS_SELECTOR, "div[data-testid=‘primaryColumn‘] div[data-testid=‘following_count‘] a span").text

print(f"Profile: {name} (@{handle})")
print(f"Bio: {bio}")
print(f"Location: {location}") 
print(f"Website: {website}")
print(f"Joined: {join_date}")
print(f"Followers: {follower_count}")
print(f"Following: {following_count}")

This code uses CSS selectors to locate the HTML elements containing the desired data points based on inspecting Twitter‘s page source. The WebDriverWait is used to wait for elements to be present before accessing them.

A few things to note:

  • The profile name and handle are combined in one div, so we split them apart
  • The follower/following counts are inside spans within a tags
  • We print out the extracted data at the end for demonstration purposes

When run on a profile like @elonmusk, the output looks like:

Profile: Elon Musk (@elonmusk)
Bio: Technoking of Tesla, Imperator of Mars
Location: Earth
Website: https://t.co/IaEUcVoJ7e
Joined: June 2009
Followers: 125.6M
Following: 184

We‘ve now successfully extracted key profile metadata! Let‘s move on to scraping tweets.

Scraping Tweet Data

In addition to profile metadata, the real value often lies in the content of the tweets themselves. Let‘s modify our script to also collect the following data points for each tweet:

  • Tweet text
  • Timestamp
  • Number of replies
  • Number of retweets
  • Number of likes

We‘ll grab tweets from the initial page load to start. Here‘s the code:

# extract tweets from initial page load
tweets = driver.find_elements(By.CSS_SELECTOR, "div[data-testid=‘tweet‘]")
for tweet in tweets:
    text = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘tweetText‘]").text
    timestamp = tweet.find_element(By.TAG_NAME, "time").get_attribute("datetime")
    reply_count = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘reply‘]").text
    retweet_count = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘retweet‘]").text
    like_count = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘like‘]").text

    print(f"Tweet: {text}\nTime: {timestamp}\nReplies: {reply_count} Retweets: {retweet_count} Likes: {like_count}\n")

For each tweet, we find the elements containing the text, timestamp, and reply/retweet/like counts, then extract their values. The timestamp is stored as a datetime attribute on the time element.

Running this on @elonmusk yields output like:

Tweet: Tesla AI Day pushed to Sept 30, as we may have an Optimus prototype working by then
Time: 2022-06-02T18:31:29.000Z 
Replies: 17.1K Retweets: 26.4K Likes: 329.6K

Tweet: Starship launch attempt soon https://t.co/QBprVsAhX5
Time: 2023-03-09T15:36:41.000Z
Replies: 9,510 Retweets: 6,834 Likes: 141K

...

We‘re now collecting tweets! However, this only grabs the initial tweets loaded on the page. To get more tweets, we need to scroll down the page.

Scrolling to Load More Tweets

Twitter loads more tweets as you scroll down the page. To scrape historical tweets, we need our scraper to mimic this scrolling behavior.

We can achieve this by using Selenium‘s execute_script method to run JavaScript that scrolls the page. Here‘s the code to collect tweets while scrolling:

import time

# scroll page and extract tweets
tweet_data = []
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

    tweets = driver.find_elements(By.CSS_SELECTOR, "div[data-testid=‘tweet‘]")
    for tweet in tweets:
        data = get_tweet_data(tweet)
        if data:
            tweet_data.append(data)

print(f"Collected {len(tweet_data)} tweets")

Here‘s a breakdown of what this code does:

  1. Initialize a list tweet_data to store the scraped tweets
  2. Get the initial scrollHeight of the page
  3. Enter a loop that:
    • Scrolls to the bottom of the page
    • Waits 2 seconds for new tweets to load
    • Gets the new scrollHeight
    • Exits loop if the scrollHeight hasn‘t changed
    • Scrapes all currently loaded tweets

The get_tweet_data function extracts the tweet text, timestamp, and metrics from each tweet element, with error handling:

def get_tweet_data(tweet):
    try:
        text = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘tweetText‘]").text
        timestamp = tweet.find_element(By.TAG_NAME, "time").get_attribute("datetime") 
        reply_count = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘reply‘]").text
        retweet_count = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘retweet‘]").text
        like_count = tweet.find_element(By.CSS_SELECTOR, "div[data-testid=‘like‘]").text
    except:
        return None

    return (text, timestamp, reply_count, retweet_count, like_count)

With scrolling in place, our scraper can now collect a large number of historical tweets! The only limit is how far back in time you want to go.

Storing the Scraped Data

As a final step, let‘s store our scraped profile metadata and tweet data for future analysis. We‘ll write it to a JSON file for simplicity, but you could also use a CSV, database, or other storage method.

Here‘s the code:

import json

# store profile metadata
profile_data = {
    "name": name,
    "handle": handle, 
    "bio": bio,
    "location": location,
    "website": website,
    "join_date": join_date,
    "follower_count": follower_count,
    "following_count": following_count
}

# store tweet data
tweets_data = []
for tweet in tweet_data:
    tweets_data.append({
        "text": tweet[0],
        "timestamp": tweet[1],
        "reply_count": tweet[2],
        "retweet_count": tweet[3],
        "like_count": tweet[4]
    })

# write data to JSON file
with open("twitter_data.json", "w") as f:
    json.dump({"profile": profile_data, "tweets": tweets_data}, f)

print("Scraped data written to twitter_data.json")

This code creates dictionaries for the profile metadata and tweet data, then writes it all to a twitter_data.json file. You can modify this to store the data however you prefer.

Challenges and Best Practices

While we‘ve covered the core components of a Twitter scraper, there are some additional challenges to be aware of and best practices to follow:

  • Rate limiting: Twitter may block your IP if you make too many requests too quickly. Add delays between requests and consider cycling through a pool of proxies.

  • Account login: Our scraper collected public data, but some data requires login. You can use Selenium to automate the login process. But be very careful with your login credentials!

  • JavaScript rendering: Twitter renders a lot of data on the frontend with JavaScript. Selenium handles this, but you may need to add explicit waits for certain elements to appear.

  • Inconsistent page loading: Web pages don‘t always load consistently with dynamic content. Add error handling to your element selectors and rerun your scraper if needed.

  • Ethical scraping: Respect the website‘s terms of service, robots.txt, and rate limits. Don‘t collect personal data. Consider the impact and purpose of your scraping.

By being aware of these factors, you can build a robust and responsible Twitter scraper. Always aim to be respectful and only collect what you need.

Conclusion

We‘ve walked through how to scrape key data points from Twitter profiles and tweets using Python and Selenium. With this foundation, you can collect Twitter data at scale for all sorts of analysis and modeling projects.

The full code for this guide is available on GitHub. Feel free to use and modify it for your own projects!

Some potential use cases for a Twitter scraper include:

  • Analyzing trends and sentiment around topics, hashtags, or accounts over time
  • Monitoring brand mentions and customer feedback
  • Creating datasets for NLP, network analysis, and ML models
  • Archiving tweets for research or historical purposes

Whatever your goal, remember to scrape ethically and respect the terms of service. Happy scraping!