How to Find HTML Elements by Multiple Tags with DOM Crawler

Web scraping is the process of automatically extracting data and content from websites. It allows you to gather information at scale from online sources. PHP is a popular language for web scraping due to its simplicity and powerful ecosystem of libraries and tools.

Content Navigation show

One of the most useful PHP libraries for web scraping is called DOM Crawler. DOM Crawler provides an intuitive way to parse and traverse HTML documents so that you can find and extract the data you need.

In this guide, we‘ll take an in-depth look at how to use DOM Crawler to find HTML elements by multiple tag names. We‘ll cover what DOM Crawler is, how to use it to load webpages, different ways to find elements, and provide complete code examples. Let‘s get started!

What is DOM Crawler?

DOM Crawler is a component of the Symfony framework that allows you to parse, traverse, and extract data from HTML and XML documents. It provides a simple, object-oriented interface for working with the Document Object Model (DOM) of a webpage.

With DOM Crawler, you can load the HTML of a URL or string and then find elements within it using methods like filter() and filterXPath(). You can chain these methods together to perform complex queries and traverse the DOM tree. Once you‘ve found the elements you need, you can access their data and attributes.

DOM Crawler is lightweight, flexible, and easy to use. It doesn‘t require a headless browser like Puppeteer or Selenium, making it a good choice for simple web scraping tasks. However, it cannot execute JavaScript or handle complex, client-side rendered sites.

Loading an HTML Document

Before you can start finding elements with DOM Crawler, you first need to load an HTML document. There are a couple ways to do this.

If you already have the HTML as a string, you can pass it directly to the Crawler constructor:

use Symfony\Component\DomCrawler\Crawler;

$html = ‘<html><body></body></html>‘;

$crawler = new Crawler($html);

More commonly, you‘ll want to load the HTML by making an HTTP request to a URL. To do this, you can use a library like Guzzle to fetch the page content and then pass the response body to Crawler:

use Symfony\Component\DomCrawler\Crawler;
use GuzzleHttp\Client;

$client = new Client();
$response = $client->get(‘https://example.com‘);
$html = $response->getBody()->getContents();

$crawler = new Crawler($html);

Once you‘ve loaded an HTML document into a Crawler instance, you‘re ready to start finding elements!

Finding Elements by Tag Name

The simplest way to find elements with DOM Crawler is by tag name. To do this, you pass the tag name to the filter() method:

$crawler = $crawler->filter(‘h1‘);

This will return a new Crawler instance containing all the <h1> elements found in the document.

You can also pass any valid CSS selector to filter(). For example, to find all <a> elements within a <nav> element:

$crawler = $crawler->filter(‘nav a‘);

Finding Elements by Multiple Tags

Often, you‘ll want to find elements that match multiple tag names. For example, let‘s say you want to extract all the heading elements from a page, including <h1>, <h2>, <h3>, etc.

To find elements by multiple tags with DOM Crawler, you can pass a comma-separated list of tag names to filter():

$crawler = $crawler->filter(‘h1, h2, h3, h4, h5, h6‘);

This will return a Crawler instance containing all the elements that match any of the listed tag names.

Here‘s a complete example that demonstrates finding elements by multiple tags:

use Symfony\Component\DomCrawler\Crawler;
use GuzzleHttp\Client;

$client = new Client();
$response = $client->get(‘https://en.wikipedia.org/wiki/Web_scraping‘);
$html = $response->getBody()->getContents();

$crawler = new Crawler($html);

$headings = $crawler->filter(‘h1, h2, h3, h4, h5, h6‘);

foreach ($headings as $heading) {
    echo $heading->textContent."\n";
}

This code will output all the text of the heading elements found on the Wikipedia page for web scraping:

Web scraping
Techniques
Legal issues
United States
Australia
Methods to prevent web scraping
See also
Technical measures
Blocking
Anti-bot measures
Application firewall
Rate limiting
Legal measures
Related topics
References
External links

Finding Elements by Attribute

In addition to tag names, you can also find elements by their attributes using DOM Crawler. This allows you to locate elements based on things like IDs, classes, data attributes, and more.

To find elements by attribute, you use attribute selectors in your CSS selector passed to filter(). For example, to find all elements with a specific class, you would do:

$crawler = $crawler->filter(‘[class="example"]‘);

You can use any of the supported attribute selectors, such as:

[attribute] – matches elements with the attribute, regardless of value
[attribute="value"] – matches elements with the attribute and exact value
[attribute*="value"] – matches elements with the attribute and value containing the substring
[attribute^="value"] – matches elements with the attribute and value starting with the substring
[attribute$="value"] – matches elements with the attribute and value ending with the substring

You can combine attribute selectors with tag names for more precise matching:

// Find all <a> elements with an "external" class
$crawler = $crawler->filter(‘a[class="external"]‘); 

// Find all elements with a "data-id" attribute
$crawler = $crawler->filter(‘[data-id]‘);

Finding Elements by XPath

For even more advanced element finding, you can use XPath expressions with DOM Crawler‘s filterXPath() method. XPath is a powerful query language for navigating XML and HTML documents.

To find elements using XPath, you pass an XPath expression to filterXPath():

$crawler = $crawler->filterXPath(‘//h1‘);

This will return all the <h1> elements in the document.

Some common XPath expressions include:

//tag – finds all <tag> elements anywhere in the document
/tag – finds <tag> elements that are root nodes
//tag[@attribute="value"] – finds <tag> elements with a matching attribute and value
//tag[contains(@attribute, "value")] – finds <tag> elements whose attribute contains the value

XPath allows you to find elements based on their position, relationships, and more. For example, to find the third <p> element on the page:

$crawler = $crawler->filterXPath(‘//p[3]‘);

To find all <a> elements that are direct children of <p> elements:

$crawler = $crawler->filterXPath(‘//p/a‘);

As you can see, XPath offers a lot of flexibility and specificity for finding elements. However, it can be less readable and maintainable than CSS selectors for simpler queries.

Accessing Element Data

Once you‘ve found the elements you need with DOM Crawler, you‘ll want to access their data and attributes so you can output or save it.

To get the text content of an element, you can use the text() or textContent properties:

foreach ($crawler as $node) {
    $text = $node->textContent; 
    echo $text."\n";
}

To get an element‘s HTML content, including its child elements, use html() or innerHTML:

foreach ($crawler as $node) {
    $html = $node->innerHTML;
    echo $html."\n"; 
}

To get the value of an element‘s attribute, use attr() and pass the attribute name:

foreach ($crawler as $node) {
    $src = $node->getAttribute(‘src‘);
    echo $src."\n";
}

You can also extract data from forms and inputs using DOM Crawler:

$form = $crawler->filter(‘form‘)->form();
$name = $form->get(‘name‘)->getValue();

This finds a <form> element, gets its name input field, and retrieves its value.

Tips for Using DOM Crawler

Here are a few best practices and tips to keep in mind when using DOM Crawler for web scraping:

Always be respectful of the websites you scrape and follow their robots.txt rules
Cache the pages you‘ve already scraped to avoid unnecessary requests
Set a realistic request delay to avoid overloading servers
Handle errors and exceptions gracefully
Extract only the data you need to keep your scraping efficient
Use concurrency when possible to speed up scraping
Rotate user agents and IP addresses if you run into rate limiting or blocking

With DOM Crawler, CSS/XPath selectors, and HTTP clients like Guzzle, you can build robust and efficient web scrapers in PHP. Hopefully this guide has helped you understand the basics of finding elements by tag name and more advanced techniques.

Conclusion

In this guide, we covered how to use DOM Crawler to find HTML elements by one or more tag names. We also looked at alternative ways to find elements, such as by attributes or XPath expressions.

To recap, here are the key steps:

Install DOM Crawler using Composer
Load an HTML document from a URL or string using Crawler
Find elements by passing tag names to filter()
Find elements by multiple tags by passing comma-separated tag names to filter()
Access element data using properties like textContent and methods like getAttribute()

DOM Crawler is a powerful tool for parsing and scraping web pages in PHP. With a basic understanding of CSS selectors and DOM traversal, you can build scripts to efficiently extract data from websites.

However, DOM Crawler is just one piece of the web scraping puzzle. You‘ll also need to make HTTP requests, handle authentication and rate limiting, and format and save your scraped data.

To learn more, check out the following resources:

I hope you found this guide helpful for learning how to use DOM Crawler to find elements by multiple tags and more. Happy scraping!