Lessons learned about Web Scraping

December 14, 2022

Introduction

I got my start in software back in 2015. Right out of Codeacademy’s Python guide, I got interested in web scraping and the rest is history.

This blog post is a collection of the lessons learned through the years and multiple projects in this space.

Split your scraping and parsing

Initially my web scrapers would attempt to parse the data (such as extract valuable information from HTML) as soon as the scraping was completed. This approach inevitably led to inefficient scraping as well as the potential for bugs unrelated to scraping causing the scraper to break.

Nowadays I split up the scraping and parsing processes. I decouple them by having the scraper either save the raw data to a database or to files, with the parser being a separate process doing reading of that data and extracting the valuable information.

To keep things simple, these different modes typically live in the same script, with a CLI argument used to run it in scraping mode or in parsing mode. This also allows for easy code re-use between both modes.

Keep your full source content

In a number of projects I have experienced a need to go back and get some new information from a source that I was already scraping. In each case, I ended up having to begin scraping again from scratch.

An alternative to this is to keep all the source content. Whether you do it by storing them as files, or saving to a database, it doesn’t matter as long as it is easy for you to re-inspect it. This approach also allows you to iterate on your parsing without making repeated calls to the same pages over and over.

If storage space is an issue, either because you are retrieving large files (such as wav files) or you expect to scrape large volumes of data, leverage compression techniques such as brotli to minimize the impact.

Respect your scraping targets

In the beginning I would scrape rather liberally. Computers are fast! was my reasoning. Eventually I had one too many run-ins with my IP being flagged or blocked by my scraping targets (either automatically with a tool such as Cloudflare or even manually in some cases).

Whilst there are workarounds (such as rotating your IP Address using a proxy), I have found that scraping respectfully is a better solution.

By implementing large delays between your scrapes, avoiding peak traffic time and lowering my need to have all the data NOW, I’ve largely eliminated these kind of issues.

If you can, scrape asynchronously

If your scrape target has liberal limits, consider scraping asynchronously. What this means is using a distributed task runner (such as Celery for Python or Asnyq for Go). Running multiple workers will reduce the impact of the fact that web scraping tends to be an I/O bound problem.

Handle exceptions and timeouts

Often enough in my first few projects I would leave a scraper running overnight only to discover that at 2am it died of an unhandled exception (such as a TLS handshake failure) or got stuck waiting for a response because requests.get() doesn’t have a default timeout.

Handle exceptions liberally and set sane default timeouts to avoid falling in this pit. You can also work around it by having your process automatically restart if it fails, however your script could get into a loop where it is failing on a specific page and will keep failing over and over. In this case I prefer to have the script skip that page and move on.

Store all your data in one place

This is a more recent decision, inspired by the idea of doing things that don’t scale. There’s likely an academic answer to the question: “which format/database is ideal to store scraped HTML?” however the more realist answer I now prefer is “postgres” regardless of the domain.

This for me is a mix of simplicity (all your data, both raw and parsed is in one place) and a lack of time to dedicate more than is absolutely necessary to side projects. Scaling issues are a good issue to have, it likely means your project is successful by some metric.