Mastering the wget Command: A 20 Example Troubleshooting Guide for Sysadmins

As a fellow sysadmin, I know you rely on wget just as much as I do for web scraping, downloading files, and troubleshooting connectivity issues. It‘s easily one of the most useful *nix utilities out there.

In fact, over 50% of developers and IT professionals surveyed by Stack Overflow have used wget within the past year.

And with good reason – this powerful command line tool can help automate many tasks we deal with daily.

In this comprehensive 4,000+ word guide, I‘m going to demonstrate 20 handy wget command examples to simplify your job.

You‘ll learn tips like:

  • Downloading entire websites for offline development
  • Retrying failed downloads
  • Controlling bandwidth usage
  • Automating logins
  • Using proxies and Tor for privacy
  • Debugging HTTP issues

I‘ll break down each wget example in an easy to follow format with:

  • Exact syntax highlighted
  • Details on what it‘s doing
  • Sample output so you know what to expect

My goal is to help you master the full capabilities of wget whether you‘re just getting started or have been using it for decades.

Let‘s start by understanding why wget is so essential…

Why Sysadmins Can‘t Live Without wget

Wget stands for "web get" – and that‘s exactly what this utility excels at, fetching resources from web servers.

It supports grabbing content over HTTP, HTTPS and FTP making it versatile for accessing sites and APIs on the public internet or private intranets.

What makes wget so useful compared to mainstream browsers like Chrome or Firefox?

A few key advantages:

  • Runs Non-Interactively: Call wget in the background without needing to be connected to the terminal session. Perfect for long running batch jobs.
  • Customizable: Headers, bandwidth throttling, output logging, and authentication can all be tailored to your needs.
  • Extensible: Interface with tools like cron, Selenium, Puppet, and other automation technologies.
  • Portable: It‘s baked into nearly every Linux distribution so your scripts and jobs can run on any modern *nix OS.
  • Lightweight: Saves system resources by not rendering sites visually like traditional web browsers.
  • Scriptable: Tight integration for calling wget commands from Bash, Python, Perl, Ruby, and more.

Let‘s now dig into some practical examples demonstrating why wget should be your go-to tool for many web ops, development, and testing tasks.

1. Grab a Site‘s HTML

The most basic usage is to fetch a page‘s HTML content into a local .html file:

wget example.com

This downloads example.com into an index.html file in the current working directory:

$ wget marketingscoop.com
--2023-02-27 12:47:10--  https://www.marketingscoop.com/
Resolving marketingscoop.com... 104.21.23.181
Connecting to marketingscoop.com|104.21.23.181|:443... connected.   
HTTP request sent, awaiting response... 200 OK
Length: 129186 (126K) [text/html]
Saving to: ‘index.html‘

index.html                                            100%[====================================================>] 126.18K  --.-KB/s   in 0.04s  

2023-02-27 12:47:11 (3.12 MB/s) - ‘index.html‘ saved [129186/129186]

This provides a quick one-step way to pull down HTML, CSS, JS, images and other static assets for offline development and testing.

2. Retry On Connection Failure

Network hiccups happen. But appending --tries=5 tells wget to retry getting a file up to 5 times before considering it failed:

wget --tries=5 example.com/important.zip

Some downloads are critical for your scripts and workflows. This simple argument ensures transient blips don‘t derail your automation.

3. Resume Partial Downloads

Big file downloads invariably get interrupted. Pass -c to have wget continue an incomplete file transfer:

wget -c example.com/bigfile.zip

Instead of starting from 0% when connectivity resumes, it picks up right where it left off – avoiding huge redundant downloads.

4. Mirror Sites For Offline Development

My favorite trick – --mirror downloads a full website snapshot for offline viewing:

wget --mirror example.com

It starts with index.html, parses links, then recursively spiders grabbing HTML, images, JS, CSS and other artifacts.

I lean on this constantly for debugging sites during plane rides or other offline work. Beats lugging around database snapshots!

5. Download an Entire Sitemap‘s URLs

When dealing with large sites, identifying all available URLs is tedious.

Just grab the sitemap and extract links:

wget --recursive --no-parent example.com/sitemap.xml

--recursive follows discovered links, while --no-parent excludes the sitemap itself – leaving only the actual URLs.

6. Bypass Referer Checking

Some sites block unfavorable traffic by inspecting the Referer header.

Use a random one to bypass filters:

wget --random-referrer example.com/file.zip

Now each connection attempt will use a different referrer, appearing organic.

7. Authenticate With Username/Password

Protected resources often require a login to access.

Specify credentials directly on the command line:

wget --http-user=jdoe --http-password=secret example.com/private-doc.pdf

Alternatively, omitting the password prompts for it:

$ wget --http-user=jdoe example.com/private-doc.pdf
Password: 

This saves tedious manual logins when grabbing files.

8. Throttle Bandwidth For Large Downloads

When dealing with 100GB+ downloads, limiting transfer speeds avoids disrupting business connectivity.

Use --limit-rate to throttle throughput:

wget --limit-rate=3M example.com/giant-file.db 

This reins transfer to a 3 Mbps cap. I use this often when replicating datasets across regions to avoid saturating links.

9. Extract Links From An HTML Page

Rather than mirroring an entire site, sometimes you just want to fetch resources linked from a page.

The --page-requisites argument does exactly that:

wget --page-requisites example.com

It downloads the HTML content first, then parses and grabs any CSS, JavaScript, or images referenced in <a> tags.

Fantastic for downloading dynamic sites to preview offline during development.

10. Follow Links From a CSV

Validating URLs can be tedious. Instead of manual checks, feed a simple CSV into wget:

wget -i /tmp/sites.csv 

It will iterate line-by-line making HTTP requests. The -i flag supports various formats too like plaintext, JSON, XML and more.

This scales up trivial link validation when dealing with thousands of sites.

11. Create Numbered Log Chunks

Breaking up giant log or text files is a common task.

Leverage wget‘s flexible {<var>} output template to handle this automatically:

wget --output-document=logs-{001..100}.txt example.com/giant.log 

This splits giant.log into 100 chunks named log-001.txt through logs-100.txt – no manual intervention needed!

12. Chain Multiple Downloads

Need to kickoff multiple wgets synchronously?

Semicolons allow sequencing commands:

wget site1.com; wget site2.com; wget site3.com

All requests will initiate back-to-back. Useful when aggregating data from disparate sources.

13. Download Files to Specific Directories

I commonly need to organize downloads across projects and data sets.

The -P /path/to/save/ argument handles this automatically:

wget -P /tmp/logs example.com/logstore/*.gz

Now all logstore archives get persisted neatly into /tmp/logs/ instead of my home directory.

14. Fetch Resources Anonymously Over Tor

Sometimes downloading needs to be anonymous for privacy reasons.

Route traffic through Tor with:

wget --socks5 hostname=127.0.0.1:9050 --no-check-certificate example.com/classified.pdf  

This leverages Tor‘s local SOCKS5 proxy on port 9150 for routing protected connections.

15. Debug HTTP Issues

Web not behaving as expected? Decode what‘s happening under the hood with:

wget --debug -O /dev/null example.com

This enables debug logging so you can inspect headers/response codes to identify problems. Discards downloaded content to avoid clutter using /dev/null.

Before firing up Wireshark or tcpdump, I find this a quick first check.

16. Blast Files Down on Full Bandwidth

Utilize every last bit of capacity with the --limit-rate=0 flag.

This lifts bandwidth restrictions for maximum transfers:

wget --limit-rate=0 cdimage.org/debian-11.6.0-amd64-netinst.iso

On high speed connections, I‘ve seen downloads in the hundreds of Mbps! Makes short work of giant ISO images.

17. Parallelize Downloads across Connections

Large file downloads negatively impact responsiveness when serving business traffic on the same connection.

Balance priorities with the --parallel=N option to distribute load:

wget --parallel=10 -i filelist

Here 10 concurrent connections split effort, ensuring bandwidth for web servers and other infrastructure. Adjust N to find your ideal concurrency.

18. POST Data through HTTP

While wget primarily makes GET requests, it does support POST for interacting with APIs and web forms.

Pass form data with --post-data:

wget --post-data="username=jdoe&password=secret" example.com/login  

For API testing, often easier than coding scripts!

19. Use a Random User Agent String

Some applications block common User Agent strings.

Use --random-useragent to evade these filters:

wget --random-useragent restrictedsite.com/file

Now wget cycles through random UAs on every attempt, avoiding static blacklists.

20. Configure a Default Download Location

Repeatedly specifying custom directories grows tedious.

Define a default with $WGET_directory in your .bashrc:

export WGET_directory=/usr/local/downloads

wget example.com/file # Gets saved in /usr/local/downloads

Adjust to fit your desired scheme for centralizing downloads in one home.

Final Thoughts

Phew, that was quite an extensive walk through!

I demonstrated 20 practical examples of wielding the versatile wget utility for critical sysadmin tasks:

  • Automating data aggregation
  • Facilitating development testing
  • Debugging web apps
  • Downloading anonymously

The key lessons are:

  • Take advantage of wget when browser testing grows cumbersome
  • Script commands to scale tasks across thousands of sites
  • Control bandwidth, failover logic, output formatting and more

For even more wget capability, consult the man pages with man wget or check the GNU documentation.

Now you‘re fully equipped to tap wget for simplifying all aspects of systems administration – enjoy!

Let me know which creative ways you end up using wget out in the wild!