System Design - Web Crawler

In this wiki, we will explore an approach to designing a Web-crawling service.

What is the purpose of the crawler here? Search engine indexing, data mining, or for some other purpose? => Search engine indexing
Type of content that needs to be parsed: HTML/ Text/ PDF/ Image/ Video => HTML only
How long do we need to store the parsed content? => 5 years
How do we handle a web page that is edited? => Web-page is parsed only once
Scale of web-crawling? => 1 billion pages per month
Robustness: Need to handle the edge cases while handling web pages, including non-responsiveness, web server crashing, bad HTML, etc.
Politeness: Shouldn’t make too many requests to a website within a short span of time. The service can get marked as a DoS attacker.
Extensibility: Flexible enough to support parsing new content types with minimal changes

Spider Trap: Causes the crawler to loop in an infinite loop. Need to handle such edge cases.

Written on September 14, 2025