Scraping Training Data at Scale: Quality, Deduplication, and Labeling Pipelines
Web scraping for ML training data has different requirements than scraping for analytics. Here's how to build a pipeline that produces clean, labeled, class-balanced datasets from the web.
Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.
Training data quality determines model quality, and most teams only discover data collection problems after training has already started. Scraping for training data is different from scraping for analytics. The requirements around quality, consistency, and label integrity are stricter, and mistakes made here compound during training. Here's how we build these pipelines.
What makes training data different
When you scrape data for analytics, a 5% error rate in parsing is annoying but manageable. You average over noise, remove outliers, and the insights survive.
When you scrape data for ML training, a 5% error rate in labels poisons the model. The model learns the errors. If 5% of your "positive" examples are actually negative, the model's decision boundary moves in the wrong direction. You won't know until you measure on a held-out test set — if you have one.
The three quality constraints that matter:
Label consistency. The same real-world condition should always produce the same label. This sounds obvious until you realize that the HTML structure you're using as a labeling signal changes across pages, products, or time periods.
Class balance. Real-world data is almost never balanced. If 95% of e-commerce reviews are 4-5 stars, a naive sentiment classifier trained on scraped reviews will achieve 95% accuracy by always predicting "positive." You need to be intentional about sampling.
Schema integrity. Every training record must have all required fields. A record with a missing label, a null feature value, or an out-of-range numeric field is a model bug waiting to happen.
Step 1: Schema-first collection
Before writing a single scraper, define the output schema. This is the contract between your data collection and your training pipeline.
from pydantic import BaseModel, validator
from typing import Literal
from datetime import datetime
class ReviewRecord(BaseModel):
review_id: str
product_id: str
text: str
rating: int # 1-5
label: Literal["positive", "negative", "neutral"]
source: str
scraped_at: datetime
word_count: int
@validator("rating")
def rating_must_be_valid(cls, v):
if not 1 <= v <= 5:
raise ValueError("rating must be 1-5")
return v
@validator("text")
def text_must_be_substantial(cls, v):
if len(v.split()) < 10:
raise ValueError("review text too short to be useful")
return v
@validator("label")
def label_from_rating(cls, v, values):
# Derived label — don't rely on the caller to set this correctly
rating = values.get("rating")
if rating and rating >= 4:
return "positive"
elif rating and rating <= 2:
return "negative"
return "neutral"The schema does two things: validates at collection time (not post-hoc) and derives the label from a deterministic rule (rating >= 4 = positive). This eliminates an entire category of label inconsistency.
Run every scraped record through Pydantic validation before it enters your dataset. Records that fail go to a failed_records table for investigation. In our experience, 2-8% of records fail schema validation on the first pass — usually because of unexpected page structures, locale-specific formatting, or structural changes on the target site.
Step 2: Deduplication
Web content is deeply duplicated. The same product review appears on multiple retailer sites. News articles are syndicated. Forum posts are cross-posted. If your training set contains duplicates, you'll overfit to those examples without knowing it.
Three levels of deduplication:
Exact deduplication
Hash the content and reject duplicates:
import hashlib
def content_hash(text: str) -> str:
return hashlib.sha256(text.lower().strip().encode()).hexdigest()
# Before inserting:
existing = db.query("SELECT id FROM records WHERE content_hash = %s", [content_hash(record.text)])
if existing:
# Skip — exact duplicate
continueThis catches word-for-word duplicates. Fast, zero false positives.
Near-duplicate detection (MinHash)
Exact hashing misses paraphrases and slightly modified copies. MinHash with locality-sensitive hashing (LSH) finds documents that are 80%+ similar.
from datasketch import MinHash, MinHashLSH
lsh = MinHashLSH(threshold=0.8, num_perm=128)
def build_minhash(text: str) -> MinHash:
m = MinHash(num_perm=128)
for word in text.lower().split():
m.update(word.encode("utf8"))
return m
# During collection:
m = build_minhash(record.text)
if lsh.query(m):
# Near-duplicate found — skip or deduplicate
continue
lsh.insert(record.review_id, m)For sentiment classification datasets, near-dedup typically removes 15-25% of records from e-commerce sites where boilerplate review text is common ("Great product, fast shipping" appears in slightly different forms thousands of times).
URL canonicalization
The same content often lives at multiple URLs: example.com/product/123, example.com/product/123?ref=homepage, example.com/product/widget-123. Canonicalize URLs before storing to prevent collecting the same page through different entry points.
from urllib.parse import urlparse, urlencode, parse_qsl
def canonicalize_url(url: str) -> str:
parsed = urlparse(url)
# Remove tracking parameters
kept_params = {k: v for k, v in parse_qsl(parsed.query)
if k not in {"ref", "utm_source", "utm_medium", "utm_campaign"}}
return parsed._replace(query=urlencode(sorted(kept_params.items()))).geturl()Step 3: Labeling pipelines
Schema-derived labels (like our rating → sentiment mapping) work when there's a reliable signal in the scraped data. When there isn't, you need a labeling step.
Rule-based labeling
Fast, deterministic, brittle. Define rules that map scraped fields to labels:
def label_job_posting(record: dict) -> str:
if record["salary_max"] > 150000 and record["experience_years"] <= 3:
return "high_comp_entry"
elif "machine learning" in record["title"].lower():
return "ml_role"
# ...
return "other"Use this when your rules are high-confidence and you can validate them on a sample. Break down when edge cases are common or the labeling criteria are fuzzy.
LLM-assisted labeling
For complex labels that require judgment (tone classification, content quality scoring, topic categorization), use an LLM as a labeler.
import anthropic
client = anthropic.Anthropic()
def label_with_llm(text: str) -> dict:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""Classify this product review as exactly one of: positive, negative, neutral.
Review: {text}
Respond with JSON only: {{"label": "...", "confidence": 0.0-1.0, "reason": "one sentence"}}"""
}]
)
import json
return json.loads(response.content[0].text)For production labeling pipelines, we run each record through 3 independent LLM calls and take the majority vote. Records where all 3 disagree get flagged for human review:
def label_with_majority_vote(text: str, model: str = "claude-haiku-4-5-20251001") -> dict:
labels = [label_with_llm(text)["label"] for _ in range(3)]
from collections import Counter
counts = Counter(labels)
majority = counts.most_common(1)[0]
if majority[1] >= 2:
return {"label": majority[0], "agreement": majority[1] / 3, "needs_review": False}
else:
return {"label": None, "agreement": 0.33, "needs_review": True}Cost at scale: Claude Haiku is ~$0.001 per labeling call. At 3 calls per record and 10,000 records, that's $30 for a labeled dataset. Acceptable for most training data budgets.
Human-in-the-loop
For the records flagged by majority vote disagreement, or for any domain where LLM labeling is unreliable (medical, legal, highly technical domains), route to human review. We use a simple internal tool: a Next.js page that shows the record text and asks for a label. Reviewers work through the queue and their decisions are written back to the database.
The 3-vote LLM approach typically reduces the human review queue to 5-10% of total records — the genuinely ambiguous cases where even humans would disagree.
Step 4: Class balance
After deduplication and labeling, check your class distribution:
from collections import Counter
import pandas as pd
df = pd.read_sql("SELECT label FROM training_records WHERE split = 'train'", conn)
counts = Counter(df["label"])
total = sum(counts.values())
print({k: f"{v} ({v/total*100:.1f}%)" for k, v in counts.items()})For severe imbalance (>10:1 ratio), options:
Oversample the minority class. Duplicate minority class examples (or use SMOTE for tabular data). Simple, effective for moderate imbalance.
Undersample the majority class. Randomly remove majority class examples. Loses data but keeps the dataset smaller.
Stratified sampling at collection time. If you know the target distribution before scraping, build it into your collection logic: stop collecting positive examples once you have N, continue collecting negatives until balanced.
For most classification tasks at Creative Codes, we target a 60/40 split at worst, 50/50 where the data supports it.
The handoff to training
The output of the scraping + labeling pipeline is a clean dataset in a predictable format. We export as:
- Parquet file on S3 for large datasets (efficient column-oriented storage)
- CSV for datasets under 100K rows (simple, universally readable)
- A database view (
training_data_v1) that the training script reads directly
The dataset version is explicit. training_data_v1 is immutable once training starts. New data goes to training_data_v2. This prevents the model from silently changing between runs because the underlying data changed.
See From Training to Endpoint: How We Deploy Custom ML Models for the training-to-production workflow that follows this data preparation step.
If you need training data collected and labeled for a custom ML project, tell us what you're trying to classify.
Related: How We Scrape at Scale Without Getting Blocked | RAG Pipelines in Production
Related service
Need large-scale scraping built to run without getting blocked?
Web Scraping & Data Extraction →Related
We publish new posts every few weeks. See more on the insights page.