Creative Codes
← All insights
ScrapingJune 2, 20269 min read

Scraping Training Data at Scale: Quality, Deduplication, and Labeling Pipelines

Web scraping for ML training data has different requirements than scraping for analytics. Here's how to build a pipeline that produces clean, labeled, class-balanced datasets from the web.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

Training data quality determines model quality, and most teams only discover data collection problems after training has already started. Scraping for training data is different from scraping for analytics. The requirements around quality, consistency, and label integrity are stricter, and mistakes made here compound during training. Here's how we build these pipelines.

What makes training data different

When you scrape data for analytics, a 5% error rate in parsing is annoying but manageable. You average over noise, remove outliers, and the insights survive.

When you scrape data for ML training, a 5% error rate in labels poisons the model. The model learns the errors. If 5% of your "positive" examples are actually negative, the model's decision boundary moves in the wrong direction. You won't know until you measure on a held-out test set — if you have one.

The three quality constraints that matter:

Label consistency. The same real-world condition should always produce the same label. This sounds obvious until you realize that the HTML structure you're using as a labeling signal changes across pages, products, or time periods.

Class balance. Real-world data is almost never balanced. If 95% of e-commerce reviews are 4-5 stars, a naive sentiment classifier trained on scraped reviews will achieve 95% accuracy by always predicting "positive." You need to be intentional about sampling.

Schema integrity. Every training record must have all required fields. A record with a missing label, a null feature value, or an out-of-range numeric field is a model bug waiting to happen.

Step 1: Schema-first collection

Before writing a single scraper, define the output schema. This is the contract between your data collection and your training pipeline.

python
from pydantic import BaseModel, validator
from typing import Literal
from datetime import datetime

class ReviewRecord(BaseModel):
    review_id: str
    product_id: str
    text: str
    rating: int  # 1-5
    label: Literal["positive", "negative", "neutral"]
    source: str
    scraped_at: datetime
    word_count: int

    @validator("rating")
    def rating_must_be_valid(cls, v):
        if not 1 <= v <= 5:
            raise ValueError("rating must be 1-5")
        return v

    @validator("text")
    def text_must_be_substantial(cls, v):
        if len(v.split()) < 10:
            raise ValueError("review text too short to be useful")
        return v

    @validator("label")
    def label_from_rating(cls, v, values):
        # Derived label — don't rely on the caller to set this correctly
        rating = values.get("rating")
        if rating and rating >= 4:
            return "positive"
        elif rating and rating <= 2:
            return "negative"
        return "neutral"

The schema does two things: validates at collection time (not post-hoc) and derives the label from a deterministic rule (rating >= 4 = positive). This eliminates an entire category of label inconsistency.

Run every scraped record through Pydantic validation before it enters your dataset. Records that fail go to a failed_records table for investigation. In our experience, 2-8% of records fail schema validation on the first pass — usually because of unexpected page structures, locale-specific formatting, or structural changes on the target site.

Step 2: Deduplication

Web content is deeply duplicated. The same product review appears on multiple retailer sites. News articles are syndicated. Forum posts are cross-posted. If your training set contains duplicates, you'll overfit to those examples without knowing it.

Three levels of deduplication:

Exact deduplication

Hash the content and reject duplicates:

python
import hashlib

def content_hash(text: str) -> str:
    return hashlib.sha256(text.lower().strip().encode()).hexdigest()

# Before inserting:
existing = db.query("SELECT id FROM records WHERE content_hash = %s", [content_hash(record.text)])
if existing:
    # Skip — exact duplicate
    continue

This catches word-for-word duplicates. Fast, zero false positives.

Near-duplicate detection (MinHash)

Exact hashing misses paraphrases and slightly modified copies. MinHash with locality-sensitive hashing (LSH) finds documents that are 80%+ similar.

python
from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.8, num_perm=128)

def build_minhash(text: str) -> MinHash:
    m = MinHash(num_perm=128)
    for word in text.lower().split():
        m.update(word.encode("utf8"))
    return m

# During collection:
m = build_minhash(record.text)
if lsh.query(m):
    # Near-duplicate found — skip or deduplicate
    continue
lsh.insert(record.review_id, m)

For sentiment classification datasets, near-dedup typically removes 15-25% of records from e-commerce sites where boilerplate review text is common ("Great product, fast shipping" appears in slightly different forms thousands of times).

URL canonicalization

The same content often lives at multiple URLs: example.com/product/123, example.com/product/123?ref=homepage, example.com/product/widget-123. Canonicalize URLs before storing to prevent collecting the same page through different entry points.

python
from urllib.parse import urlparse, urlencode, parse_qsl

def canonicalize_url(url: str) -> str:
    parsed = urlparse(url)
    # Remove tracking parameters
    kept_params = {k: v for k, v in parse_qsl(parsed.query)
                   if k not in {"ref", "utm_source", "utm_medium", "utm_campaign"}}
    return parsed._replace(query=urlencode(sorted(kept_params.items()))).geturl()

Step 3: Labeling pipelines

Schema-derived labels (like our rating → sentiment mapping) work when there's a reliable signal in the scraped data. When there isn't, you need a labeling step.

Rule-based labeling

Fast, deterministic, brittle. Define rules that map scraped fields to labels:

python
def label_job_posting(record: dict) -> str:
    if record["salary_max"] > 150000 and record["experience_years"] <= 3:
        return "high_comp_entry"
    elif "machine learning" in record["title"].lower():
        return "ml_role"
    # ...
    return "other"

Use this when your rules are high-confidence and you can validate them on a sample. Break down when edge cases are common or the labeling criteria are fuzzy.

LLM-assisted labeling

For complex labels that require judgment (tone classification, content quality scoring, topic categorization), use an LLM as a labeler.

python
import anthropic

client = anthropic.Anthropic()

def label_with_llm(text: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Classify this product review as exactly one of: positive, negative, neutral.

Review: {text}

Respond with JSON only: {{"label": "...", "confidence": 0.0-1.0, "reason": "one sentence"}}"""
        }]
    )
    import json
    return json.loads(response.content[0].text)

For production labeling pipelines, we run each record through 3 independent LLM calls and take the majority vote. Records where all 3 disagree get flagged for human review:

python
def label_with_majority_vote(text: str, model: str = "claude-haiku-4-5-20251001") -> dict:
    labels = [label_with_llm(text)["label"] for _ in range(3)]
    from collections import Counter
    counts = Counter(labels)
    majority = counts.most_common(1)[0]

    if majority[1] >= 2:
        return {"label": majority[0], "agreement": majority[1] / 3, "needs_review": False}
    else:
        return {"label": None, "agreement": 0.33, "needs_review": True}

Cost at scale: Claude Haiku is ~$0.001 per labeling call. At 3 calls per record and 10,000 records, that's $30 for a labeled dataset. Acceptable for most training data budgets.

Human-in-the-loop

For the records flagged by majority vote disagreement, or for any domain where LLM labeling is unreliable (medical, legal, highly technical domains), route to human review. We use a simple internal tool: a Next.js page that shows the record text and asks for a label. Reviewers work through the queue and their decisions are written back to the database.

The 3-vote LLM approach typically reduces the human review queue to 5-10% of total records — the genuinely ambiguous cases where even humans would disagree.

Step 4: Class balance

After deduplication and labeling, check your class distribution:

python
from collections import Counter
import pandas as pd

df = pd.read_sql("SELECT label FROM training_records WHERE split = 'train'", conn)
counts = Counter(df["label"])
total = sum(counts.values())
print({k: f"{v} ({v/total*100:.1f}%)" for k, v in counts.items()})

For severe imbalance (>10:1 ratio), options:

Oversample the minority class. Duplicate minority class examples (or use SMOTE for tabular data). Simple, effective for moderate imbalance.

Undersample the majority class. Randomly remove majority class examples. Loses data but keeps the dataset smaller.

Stratified sampling at collection time. If you know the target distribution before scraping, build it into your collection logic: stop collecting positive examples once you have N, continue collecting negatives until balanced.

For most classification tasks at Creative Codes, we target a 60/40 split at worst, 50/50 where the data supports it.

The handoff to training

The output of the scraping + labeling pipeline is a clean dataset in a predictable format. We export as:

  • Parquet file on S3 for large datasets (efficient column-oriented storage)
  • CSV for datasets under 100K rows (simple, universally readable)
  • A database view (training_data_v1) that the training script reads directly

The dataset version is explicit. training_data_v1 is immutable once training starts. New data goes to training_data_v2. This prevents the model from silently changing between runs because the underlying data changed.

See From Training to Endpoint: How We Deploy Custom ML Models for the training-to-production workflow that follows this data preparation step.


If you need training data collected and labeled for a custom ML project, tell us what you're trying to classify.

Related: How We Scrape at Scale Without Getting Blocked | RAG Pipelines in Production

Web Scraping services → | AI & Machine Learning services →

Related service

Need large-scale scraping built to run without getting blocked?

Web Scraping & Data Extraction

We publish new posts every few weeks. See more on the insights page.