Creative Codes
← All insights
ScrapingJune 1, 20268 min read

Building an E-Commerce Price Scraping Pipeline

Price monitoring across dozens of competitor sites, normalized into a clean database, with alerts when something changes. Here's how we build these pipelines.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

Price scraping at scale has three failure modes most teams hit in the first week: blocked requests, inconsistent data formats, and no alerting when a target site changes its markup. Clients in e-commerce, retail intelligence, and procurement need prices from competitor sites, marketplaces, and supplier catalogs: normalized, updated daily or hourly, with alerts when something changes significantly. This is how we build these pipelines.

The problem with naive price scrapers

The simplest price scraper: hit a URL, extract the price, store it. This works for two weeks. Then:

  • The site updates its HTML structure. Your selector breaks silently. You're storing None without knowing it.
  • The site adds bot detection. Your scraper gets blocked.
  • A product has multiple variants (size, color). You're capturing the wrong price.
  • Currency formatting varies by locale. You're comparing "$1,299.00" to "1299.00" and they're not equal.
  • The product is temporarily out of stock. Is the missing price a scraping failure or an actual gap?

A production price scraping pipeline handles all of these. Here's how.

Step 1: Product URL management

Before scraping anything, you need a clean list of what to scrape and how. We maintain a product mapping table:

text
product_urls (table)
- id
- product_name
- our_sku (our internal identifier)
- competitor_name
- url
- price_selector (CSS selector for the price element)
- currency
- last_scraped_at
- last_successful_scrape_at
- is_active

The price_selector field matters: different pages, even from the same site, may have different HTML structures. Product pages vs. listing pages vs. search results all have different markup.

The last_successful_scrape_at (distinct from last_scraped_at) lets you know which products haven't successfully returned a price in a while. These need investigation.

Step 2: Scraper architecture

For e-commerce sites, most pricing data is in the DOM (not behind API calls). We use Playwright for JavaScript-heavy sites and httpx for simpler ones.

python
import httpx
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
import re
from decimal import Decimal

def parse_price(raw: str) -> Decimal | None:
    """Normalize price strings to a Decimal value."""
    if not raw:
        return None
    # Remove currency symbols, spaces, and non-numeric characters except . and ,
    cleaned = re.sub(r'[^\d.,]', '', raw.strip())
    # Handle European format (1.299,00 -> 1299.00)
    if ',' in cleaned and '.' in cleaned:
        if cleaned.index(',') > cleaned.index('.'):
            cleaned = cleaned.replace('.', '').replace(',', '.')
        else:
            cleaned = cleaned.replace(',', '')
    elif ',' in cleaned:
        cleaned = cleaned.replace(',', '.')
    
    try:
        return Decimal(cleaned)
    except Exception:
        return None

async def scrape_price(url: str, selector: str, use_browser: bool = False) -> dict:
    if use_browser:
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()
            await page.goto(url, wait_until="domcontentloaded")
            element = await page.query_selector(selector)
            raw_price = await element.inner_text() if element else None
            await browser.close()
    else:
        async with httpx.AsyncClient() as client:
            response = await client.get(url, headers={"User-Agent": "Mozilla/5.0..."})
            soup = BeautifulSoup(response.text, "html.parser")
            element = soup.select_one(selector)
            raw_price = element.get_text(strip=True) if element else None
    
    return {
        "raw_price": raw_price,
        "parsed_price": parse_price(raw_price),
        "scrape_success": raw_price is not None,
    }

The parse_price function is critical. Prices come in every format imaginable. "€1.299,99", "$1,299.00", "1299.00 USD", "from $49". A parser that handles these correctly prevents silent data corruption.

Step 3: Change detection and alerts

Storing every scraped price creates a clean history. Change detection runs as a separate step after scraping:

python
def detect_significant_changes(product_id: int, new_price: Decimal, threshold: float = 0.05):
    """Alert when price changes by more than threshold percentage."""
    previous = get_last_price(product_id)
    if previous is None:
        return  # First scrape, no comparison
    
    change_pct = abs((new_price - previous) / previous)
    
    if change_pct >= threshold:
        alert_type = "price_drop" if new_price < previous else "price_increase"
        create_alert(
            product_id=product_id,
            alert_type=alert_type,
            old_price=previous,
            new_price=new_price,
            change_pct=float(change_pct),
        )

For high-value products, we alert on any change. For commodity products, we use a threshold (5-10%) to avoid alert noise from minor fluctuations.

Info

Set per-site thresholds, not a single global threshold. A 5% price change on a $10 product is noise. A 5% change on a $5,000 product is worth an immediate alert. We store the threshold in the product_urls table alongside the selector.

Step 4: Handling failures

Price scraping has two types of failures:

  1. Scraping failure: the request failed, the selector didn't match, or the page structure changed
  2. Data anomaly: the scrape succeeded but the price is wrong (e.g., a discount badge was captured instead of the actual price, or the product is out-of-stock and showing "$0")

For scraping failures, we retry 3 times with exponential backoff, then mark the URL as needing manual review.

For data anomalies, we validate against expected ranges:

python
def validate_price(price: Decimal, product_id: int) -> bool:
    """Check if price is within expected range."""
    history = get_price_history(product_id, days=30)
    if len(history) < 3:
        return True  # Not enough history to validate
    
    avg = sum(history) / len(history)
    # Flag if price is less than 20% or more than 200% of recent average
    return (avg * Decimal("0.20")) <= price <= (avg * Decimal("2.00"))

Prices that fail validation are stored with a needs_review flag rather than being auto-published to clients.

Step 5: Scheduling and scale

For monitoring dozens of sites at daily frequency, a simple cron job works. For hundreds of sites at hourly frequency, you need a job queue.

Our standard setup:

  • Celery with Redis broker for job queuing
  • One worker per proxy pool (to rate-limit requests to each target site)
  • Priority queue: high-value products scrape more frequently
  • Monitoring: Prometheus metrics on success rate, latency, and failure counts per site

For the full job queue and error handling architecture we use when automation needs to act on these price changes, see Building Production n8n Workflows.

Proxy rotation for price scrapers

Price scraping is one of the most proxy-intensive use cases. Major retailers — Amazon, Walmart, Best Buy, Chewy, and most Shopify stores with anti-bot apps installed — block datacenter IP ranges on sight. For these targets, residential proxies are not optional.

The practical difference: a datacenter IP (AWS, DigitalOcean, Hetzner range) gets blocked on the first request or after 3-5 requests. A residential IP can sustain dozens of requests before triggering rate limiting, and rotating to a fresh IP resets the counter.

Rotation strategy we use for retail price scrapers:

  • 1 IP per 50 requests to the same domain. After 50 requests, rotate to a fresh residential IP.
  • For sites with aggressive bot protection (Cloudflare with JS challenge), 1 IP per 10-15 requests.
  • Always route requests from the same scraping session through the same IP. Switching IPs mid-session triggers fingerprint inconsistency alerts on Akamai and similar bot managers.

For sites without detection (smaller retailers, direct supplier catalogs, B2B pricing portals), datacenter proxies work fine and cost significantly less.

Cost comparison at scale:

| Volume | Proxy type | Approx. cost/month | |--------|-----------|-------------------| | 10K pages/day | Residential | $20-40 | | 10K pages/day | Datacenter | $5-10 | | 100K pages/day | Residential | $150-300 | | 100K pages/day | Datacenter | $30-60 |

Residential proxy costs (we use Brightdata and Oxylabs depending on the target) are priced per GB of data transferred, typically $8-15/GB. A page with a price element and minimal assets runs 50-100KB. At 100K pages/day, that's 5-10GB/day ($40-100/day residential vs $3-5/day datacenter).

For most price monitoring projects, the mix is roughly 60% residential (for protected retailer sites) and 40% datacenter (for smaller sites). At 10K pages/day that's a ~$30/month proxy cost — negligible compared to the value of the monitoring data.

One more cost factor worth knowing: some residential proxy providers charge for failed requests too. If your selector breaks and the scraper is sending requests but returning no price, you're paying for nothing. This is another reason to instrument scrape success rate per site — a site with a success rate below 80% is not only giving you bad data, it's costing you more proxy bandwidth than it's worth.

What the output looks like

A complete price monitoring pipeline outputs:

  • A price_history table with every recorded price, timestamp, and scrape status
  • A current_prices view showing the latest valid price per product
  • A price_alerts table with detected changes and their metadata
  • A dashboard (we typically build this in Grafana or deliver as a Notion database sync)

For clients who need this data in their existing systems, we deliver via webhook (POST to their API on each price change) or database sync (write directly to their Postgres or MySQL). The webhook delivery pattern uses the same architecture described in the webhook-driven automation post — store the payload on receipt, process asynchronously, and retry on downstream failure.


If you're building a price monitoring pipeline and need it to handle bot protection, data normalization, and change alerts, tell us about the scope.

Related: How We Scrape at Scale Without Getting Blocked | BookingKoala: End-to-End Booking and Dispatch Automation

Web scraping services →

Related service

Need large-scale scraping built to run without getting blocked?

Web Scraping & Data Extraction

We publish new posts every few weeks. See more on the insights page.