System Design - Web Crawler

In this wiki, we will explore an approach to designing a Web-crawling service.

Requirements

  • What is the purpose of the crawler here? Search engine indexing, data mining, or for some other purpose? => Search engine indexing
  • Type of content that needs to be parsed: HTML/ Text/ PDF/ Image/ Video => HTML only
  • How long do we need to store the parsed content? => 5 years
  • How do we handle a web page that is edited? => Web-page is parsed only once
  • Scale of web-crawling? => 1 billion pages per month
  • Robustness: Need to handle the edge cases while handling web pages, including non-responsiveness, web server crashing, bad HTML, etc.
  • Politeness: Shouldn’t make too many requests to a website within a short span of time. The service can get marked as a DoS attacker.
  • Extensibility: Flexible enough to support parsing new content types with minimal changes
Back of the envelope estimation:
  • 1 billion web pages to be downloaded each month
  • QPS => 1 billion pages / 30 days / 24 hours / 3600 seconds => ~400 pages/second
  • Peak QPS = 2 * QPS => 800
  • Assume the average web page size is 500k
  • Storage requirement = 1 billion page/month * 500k => 500TB/month
  • Total Storage requirement = 500 TB * 12 months * 5 years => 30 PB

Architecture

Spider Trap: Causes the crawler to loop in an infinite loop. Need to handle such edge cases.

Future Study:

  • How to know a webpage is edited? Any optimization regarding this?
  • How are bloom filters and hash values used? Check examples
Written on September 14, 2025