Back to Blog
Web Scraping

Building a Production-Ready Web Scraper: Architecture and Design Patterns

November 15, 2024
15 min read
By Muhammad Zaid
Web Scraping
Architecture
Python
Design Patterns
Building a Production-Ready Web Scraper: Architecture and Design Patterns

Building a Production-Ready Web Scraper: Architecture and Design Patterns

As a web scraping expert who's built enterprise-level scrapers handling millions of pages, I'll share the architectural patterns that ensure reliability, scalability, and maintainability.

The Components of a Production Scraper

A production-ready scraper needs:

  1. Scheduler: Manages scraping tasks
  2. Fetcher: Downloads pages
  3. Parser: Extracts data
  4. Storage: Saves results
  5. Monitor: Tracks performance

Architecture Overview

class ScraperArchitecture: def __init__(self): self.scheduler = Scheduler() self.fetcher = Fetcher() self.parser = Parser() self.storage = Storage() self.monitor = Monitor()

The Scheduler Component

Manages what to scrape and when:

from queue import PriorityQueue from dataclasses import dataclass from datetime import datetime @dataclass class Task: url: str priority: int retry_count: int = 0 class Scheduler: def __init__(self): self.queue = PriorityQueue() self.visited = set() def add_task(self, task: Task): if task.url not in self.visited: self.queue.put((task.priority, task)) def get_next_task(self): if not self.queue.empty(): priority, task = self.queue.get() self.visited.add(task.url) return task return None

The Fetcher Component

Handles HTTP requests with retries:

import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry class Fetcher: def __init__(self): self.session = self._create_session() def _create_session(self): session = requests.Session() retry = Retry( total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504] ) adapter = HTTPAdapter(max_retries=retry) session.mount('http://', adapter) session.mount('https://', adapter) return session def fetch(self, url, **kwargs): try: response = self.session.get(url, timeout=30, **kwargs) response.raise_for_status() return response except requests.RequestException as e: logger.error(f"Error fetching {url}: {e}") return None

The Parser Component

Extracts and validates data:

from bs4 import BeautifulSoup from typing import Dict, Optional class Parser: def parse_product(self, html: str) -> Optional[Dict]: soup = BeautifulSoup(html, 'lxml') try: product = { 'title': self._extract_title(soup), 'price': self._extract_price(soup), 'description': self._extract_description(soup), 'images': self._extract_images(soup) } if self._validate_product(product): return product except Exception as e: logger.error(f"Parse error: {e}") return None def _validate_product(self, product: Dict) -> bool: required_fields = ['title', 'price'] return all(product.get(field) for field in required_fields)

Rate Limiting

Respect target servers:

import time from collections import deque class RateLimiter: def __init__(self, max_requests: int, time_window: int): self.max_requests = max_requests self.time_window = time_window self.requests = deque() def wait_if_needed(self): now = time.time() # Remove old requests while self.requests and self.requests[0] < now - self.time_window: self.requests.popleft() # Wait if limit reached if len(self.requests) >= self.max_requests: sleep_time = self.time_window - (now - self.requests[0]) if sleep_time > 0: time.sleep(sleep_time) self.requests.append(time.time())

Data Storage

Efficient storage with deduplication:

from sqlalchemy import create_engine, Column, String, Float, DateTime from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import sessionmaker import hashlib Base = declarative_base() class Product(Base): __tablename__ = 'products' id = Column(String, primary_key=True) title = Column(String) price = Column(Float) url = Column(String, unique=True) scraped_at = Column(DateTime) class Storage: def __init__(self, db_url: str): self.engine = create_engine(db_url) Base.metadata.create_all(self.engine) Session = sessionmaker(bind=self.engine) self.session = Session() def save_product(self, data: Dict): # Create unique ID product_id = hashlib.md5( data['url'].encode() ).hexdigest() product = Product( id=product_id, **data, scraped_at=datetime.now() ) self.session.merge(product) self.session.commit()

Monitoring and Alerts

Track scraper health:

from dataclasses import dataclass from typing import Dict @dataclass class Metrics: total_requests: int = 0 successful_requests: int = 0 failed_requests: int = 0 items_scraped: int = 0 class Monitor: def __init__(self): self.metrics = Metrics() def record_request(self, success: bool): self.metrics.total_requests += 1 if success: self.metrics.successful_requests += 1 else: self.metrics.failed_requests += 1 def record_item(self): self.metrics.items_scraped += 1 def get_success_rate(self) -> float: if self.metrics.total_requests == 0: return 0 return self.metrics.successful_requests / self.metrics.total_requests

Putting It All Together

class ProductionScraper: def __init__(self): self.scheduler = Scheduler() self.fetcher = Fetcher() self.parser = Parser() self.storage = Storage('postgresql://...') self.monitor = Monitor() self.rate_limiter = RateLimiter(max_requests=10, time_window=60) def run(self, urls: list): # Add initial tasks for url in urls: self.scheduler.add_task(Task(url=url, priority=1)) # Process tasks while task := self.scheduler.get_next_task(): self.rate_limiter.wait_if_needed() response = self.fetcher.fetch(task.url) self.monitor.record_request(response is not None) if response: product = self.parser.parse_product(response.text) if product: self.storage.save_product(product) self.monitor.record_item() # Report results print(f"Success rate: {self.monitor.get_success_rate():.2%}") print(f"Items scraped: {self.monitor.metrics.items_scraped}")

Deployment Considerations

Docker Deployment

FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "scraper.py"]

Kubernetes for Scale

apiVersion: batch/v1 kind: CronJob metadata: name: web-scraper spec: schedule: "0 */6 * * *" jobTemplate: spec: template: spec: containers: - name: scraper image: scraper:latest

Conclusion

Building production-ready web scrapers requires careful architecture and attention to detail. These patterns have helped me build scrapers that run reliably for years.

Need help with your web scraping project? As a data scraping expert and freelance Python developer, I can help you build scalable, reliable scraping solutions!

Need Expert Python Development?

Looking to hire Python developer or need help with Django, web scraping, or automationprojects? Let's work together!