What anti-detection techniques do you use to avoid IP blocks?

We rotate residential proxies, randomize browser fingerprints using Playwright with stealth plugins, introduce behavioral delays between requests, and simulate realistic user sessions. Combined, these drop our block rate below 0.3% on most targets.

How many pages per day can your scrapers handle?

Our production architecture currently handles over 2 million pages per day across active clients. Throughput scales horizontally with additional browser instances and proxy pool size.

Can you scrape JavaScript-heavy SPAs?

Yes. We use Playwright with full browser rendering to handle React, Angular, and Vue SPAs. We wait for specific network requests to complete before extracting data, not just page load events.

← All insights

ScrapingMay 1, 20268 min read

How We Scrape at Scale Without Getting Blocked

Most scrapers get blocked because they're too fast, too predictable, or both. Here's the full architecture we use to run 2M+ requests daily at under 0.3% block rate.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

Running 2M+ page requests daily at under 0.3% block rate is not about brute force — it's about making each session look like a normal browser user. Here's the full anti-detection architecture we use to stay under 0.3% block rate at scale.

The problem with naive scrapers

Most scrapers get blocked within hours of hitting a serious target. The reason isn't usually request rate. It's detectability. Modern bot protection systems like Cloudflare Bot Management, DataDome, and PerimeterX score dozens of signals simultaneously: browser fingerprints, TLS handshake patterns, behavioral cadence, and JavaScript execution environment. A scraper that fails even two or three of these checks gets flagged immediately.

The common fixes people try first don't work. Adding delays between requests helps marginally. Rotating User-Agent headers is trivially bypassed. Using a list of free proxies gets you blocked instantly because those IP ranges are pre-flagged. The problem isn't any single signal. It's the fingerprint profile as a whole.

What actually triggers a block

TLS fingerprint mismatch. Every browser creates a distinct TLS 1.3 handshake: the cipher suite order, the extension list, the elliptic curve preferences. Python's requests library has a completely different fingerprint than Chrome 120. DataDome and Cloudflare Bot Management both match incoming TLS fingerprints against known browser profiles. A Python request pretending to be Chrome fails this check immediately. The fix is using a library like curl_cffi that implements real browser TLS profiles, or driving actual browser instances via Playwright.

Static residential or datacenter IPs. Datacenter IP ranges (AWS, DigitalOcean, Hetzner) are pre-blocked on most high-value targets. Residential proxies work better but fail when a single IP sends hundreds of requests in a short window. Any IP with an unusual request volume gets throttled or challenged with a CAPTCHA.

Headless browser leaks. Headless Chrome exposes over 30 JavaScript properties that reveal it's not a real browser. navigator.webdriver is set to true. The WebGL renderer identifies as SwiftShader instead of a real GPU. window.chrome is undefined. navigator.plugins is empty. Anti-bot scripts check all of these in milliseconds.

Inhuman request timing. Real users don't click exactly every 800ms. A scraper that spaces requests with fixed delays, or uses random.uniform(0.5, 1.5) without any behavioral model, produces a timing distribution that looks nothing like real traffic. Bot protection systems have baseline distributions from billions of real sessions. Yours needs to match.

Missing browser entropy. Real browsers generate dozens of passive signals: CSS media query responses, audio context fingerprints, font rendering metrics. A bare Playwright instance without customization scores poorly on all of them.

How do we scrape 2M+ pages/day without getting blocked?

TLS impersonation

For high-volume scraping where we don't need full JavaScript rendering, we use curl_cffi with browser impersonation profiles (Chrome 120, Safari 17). This gives us real browser TLS fingerprints without the overhead of running an actual browser. Block rates on TLS-fingerprinted targets drop significantly compared to standard httpx or requests.

For JavaScript-rendered targets, we use Playwright with playwright-stealth plus custom patches. The standard stealth plugin handles the most common headless browser leaks. We add our own patches on top for:

WebGL renderer string (patched to match common GPU models: NVIDIA RTX 4070, Apple M2)
Canvas fingerprint (injected noise that varies per session while staying within realistic bounds)
Audio context fingerprint (subtle per-session variation)
navigator.plugins (populated with realistic plugin entries)
navigator.hardwareConcurrency and deviceMemory (consistent with the claimed GPU)
Battery API (returns realistic values or is disabled)

Each session uses a fresh profile generated from a profile bank of 50+ real device/browser combinations we've recorded from actual machines.

Proxy architecture

We run a three-tier proxy pool:

Residential proxies for high-security targets (Cloudflare Enterprise, DataDome, PerimeterX). Real residential IPs from ISP ranges. Rotation happens at the session level: one IP per logical browsing session. Switching IPs mid-session is a red flag. We use providers with city-level targeting so we can match the expected geographic distribution for a given target.

Mobile proxies for the highest-security targets. Mobile carrier IPs (4G/5G) are the hardest to block because carrier NAT means thousands of legitimate users share a single IP. Block rates with mobile proxies on Cloudflare-protected sites are an order of magnitude lower than residential.

Datacenter proxies for bulk low-risk targets that don't implement fingerprinting. Cheaper, faster, and high-bandwidth. We route anything not blocked by IP reputation through datacenter pools.

Proxy health monitoring runs continuously. IPs that return challenge pages or CAPTCHAs get pulled from rotation immediately. We maintain separate IP pools per target domain to prevent cross-contamination: if an IP gets flagged on site A, it doesn't affect sessions on site B.

Behavioral humanization

Real users don't move in straight lines or click at fixed intervals. Our behavioral layer generates synthetic session traces that match recorded human patterns:

Mouse movement: We use Bézier curves with randomized control points and variable speed profiles drawn from a distribution fitted to real session recordings. Movements follow a natural acceleration/deceleration curve rather than linear trajectories.

Scroll behavior: Pages are scrolled in variable-distance increments with pauses. Scroll speed and pause duration are drawn from a distribution, not a uniform range. We also handle momentum-style scrolling that modern browsers implement.

Typing simulation: For forms, keystrokes use realistic inter-key delays with occasional backspace/correction events at rates that match real error frequencies.

Request timing: Instead of uniform delays, we model page-load time variability plus a reading time distribution based on content length. A 2,000-word article gets a longer dwell time than a product listing.

Click targeting: We don't click exact element center coordinates. We add random offsets within the element bounds, biased toward the center to match real user behavior.

Handling JavaScript challenges

Cloudflare's JS challenge and Turnstile, DataDome's sensor collection, and similar systems run JavaScript in the browser to generate a challenge token before allowing access to the real page content.

For targets with these challenges, we pre-warm sessions: load the challenge page, let the JavaScript run in full, gather the challenge token, then proceed with the actual scraping. Session tokens are cached and reused until they expire. Most JS challenge tokens have 30-minute to 2-hour TTLs. This means we pay the challenge cost once and then run dozens of requests per session.

For CAPTCHA-blocked requests (which happen under 0.3% of the time in our production systems), we route to a CAPTCHA solving service rather than letting the job fail.

Infrastructure and scheduling

Distributed job queue: Scraping jobs are queued via BullMQ (Redis-backed). Workers pull jobs at configurable concurrency, which we tune per target based on observed block rates. Aggressive concurrency on a target that's watching per-IP rates triggers blocks; we back off automatically when error rates spike.

Rotating request schedules: We don't scrape targets at uniform intervals. Real traffic to any website follows a daily pattern: high during business hours, low at night. Our scrapers mirror this distribution, concentrating requests during peak hours and reducing volume during off-hours.

Change detection: For recurring scrapers, we only re-fetch pages that have changed. We store content hashes and check Last-Modified and ETag headers before pulling full page content. On large sites, 60-80% of pages are unchanged between runs. Skipping them reduces cost and reduces our fingerprint on the target's server.

Dead-letter handling: Failed requests go to a dead-letter queue with metadata about the failure type (block, CAPTCHA, timeout, parsing error). We review these to identify new bot detection patterns and update our evasion strategies. A spike in a specific error type is usually a signal that the target has deployed new protection.

Results at scale

This architecture runs 2M+ page requests daily across Creative Codes' production scrapers. Current metrics:

Block rate: under 0.3% across all targets
CAPTCHA challenge rate: under 0.1%
Successful parse rate: 99.4% (failures are mostly structural changes to target pages, not blocks)
Average request latency: 1.2-3.8 seconds per page depending on target complexity

The targets range from simple HTML sites to JavaScript-heavy SPAs protected by Cloudflare Enterprise. The architecture handles both without code changes to the core scraper. Routing logic selects the right proxy tier and fingerprint profile based on a per-domain configuration.

What this means for your project

If you need data from a source that blocks conventional scrapers, the answer isn't a simpler tool. It's a more complete model of what legitimate browser traffic looks like.

Every component described here runs in our production pipelines. The configuration per target takes an hour or two to tune. The underlying infrastructure handles the rest.

If you need data from a source that resists conventional scrapers, our web scraping service covers the full stack: TLS impersonation, proxy architecture, behavioral humanization, and production-grade infrastructure. We handle the engineering; you get the data. For a concrete example of this architecture applied to a specific marketplace, see how we built the BizBuySell scraper.

Once you have the data flowing, the next step is usually ML processing and classification: enrichment, anomaly detection, or feeding structured records into a decision pipeline that triggers automated actions.

If you're deciding which scraping tool to use for your project, see Playwright vs Scrapy vs Crawl4AI: When to Use Each for the decision framework we use on every new engagement.

Related service

Need large-scale scraping built to run without getting blocked?

Web Scraping & Data Extraction →

← All insights

Scraping9 min

B2B Lead Enrichment Pipelines: From Raw Email to Qualified Contact Data

Scraping10 min

Scraping Google Maps and Business Directories: Architecture and Anti-Detection

Scraping9 min

Scraping Training Data at Scale: Quality, Deduplication, and Labeling Pipelines

We publish new posts every few weeks. See more on the insights page.