How do you handle Cloudflare protection when scraping?

Cloudflare's standard bot protection is bypassed using residential proxies, TLS fingerprint spoofing, and behavioral patterns that match real users. For sites with Cloudflare Turnstile or advanced challenge pages, we route through specialized proxy services designed for this.

What scraping tools do you use for large-scale projects?

Playwright with stealth plugins for JavaScript-heavy sites, Scrapy for high-throughput static pages, and custom Python services using httpx with managed proxy rotation for API-like endpoints. The choice depends on target site complexity and required throughput.

How do you handle data freshness for scraped datasets?

We build incremental scrapers that only re-fetch changed pages using ETag and Last-Modified headers where available, change detection on key fields for others, and full re-scrapes on a configurable schedule. Most clients run full refreshes daily with intraday incremental updates.

← All insights

ScrapingMay 10, 20269 min read

How We Built a BizBuySell Scraping Pipeline That Tracks Thousands of Listings Daily

BizBuySell runs Cloudflare. Our first scraper got blocked in 20 minutes. Here's the architecture that fixed it.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

BizBuySell is Cloudflare-protected and rate-limits aggressively. The first scraper got blocked in 20 minutes. This is the architecture that ran daily for 6+ months without a manual intervention, collecting acquisition intelligence from thousands of business listings.

The challenge

BizBuySell is the largest US marketplace for businesses for sale. Thousands of listings go live every day, each one containing asking price, trailing-twelve-month cash flow, gross revenue, EBITDA, seller financing terms, broker details, and a text description of the business.

At Creative Codes, we built a production scraping pipeline that monitors thousands of BizBuySell listings daily for a business acquisition advisory firm, eliminating 40+ hours of manual research per week. Their analysts were manually browsing listings for hours every day, copying financials into spreadsheets, and trying to identify undervalued opportunities before competitors. They missed deals constantly. A strong listing at $800K asking on a $280K cash flow would appear, get flooded with inquiries, and go under LOI before the team had even seen it.

The ask was straightforward: scrape everything, score it, and alert us the moment something good appears.

The execution was not.

Architecture decisions

BizBuySell serves content through a JavaScript-rendered interface. The listing cards load via XHR after the initial page paint, which rules out simple HTTP request scrapers like requests + BeautifulSoup. You need a real browser.

We evaluated Scrapy with Splash, Selenium, and Playwright. We chose Playwright for three reasons:

Native async: Playwright's async API handles concurrent page operations cleanly, which matters when you're paginating through hundreds of search result pages
First-class stealth support: playwright-stealth patches the most common fingerprinting vectors (WebGL, canvas, navigator, Chrome runtime) without the fragility of Selenium patching
Built-in waiting: wait_until="networkidle" saves significant complexity compared to writing explicit wait conditions for XHR completion

The core loop is straightforward:

python

async def scrape_listings(search_url: str) -> list[BusinessListing]:
    browser = await playwright.chromium.launch(headless=True)
    page = await browser.new_page()
    await stealth(page)
    await page.goto(search_url, wait_until="networkidle")

    listings = []
    while True:
        items = await page.query_selector_all(".listing-card")
        for item in items:
            listings.append(BusinessListing(
                title=await item.text_content(".listing-title"),
                asking_price=parse_currency(await item.text_content(".price")),
                cash_flow=parse_currency(await item.text_content(".cash-flow")),
revenue=parse_currency(await item.text_content(".revenue")),
                location=await item.text_content(".location"),
                industry=await item.text_content(".category"),
            ))

        next_btn = await page.query_selector(".next-page")
        if not next_btn:
            break
        await next_btn.click()
        await page.wait_for_load_state("networkidle")

    await browser.close()
    return deduplicate(listings)

The parse_currency function handles the inconsistency in how BizBuySell formats financials: $1.2M, $1,200,000, Asking: $1.2m, and Not Disclosed all need different handling.

Handling anti-detection

Tip

The key to sustainable scraping isn't speed, it's stealth. We throttle requests to mimic human browsing patterns rather than hammering the server as fast as possible.

BizBuySell uses Cloudflare for bot protection. Naive scrapers get challenged immediately. Our anti-detection stack:

Residential proxies: datacenter IPs are trivially fingerprinted. We rotate through a residential proxy pool, cycling IPs at the session level (not per-request). A session always appears to come from the same location. Rotating per-request is a stronger signal than a consistent IP.

Randomized delays: we draw inter-action delays from a log-normal distribution fitted to real browsing session data. The median delay is ~1.8 seconds, with a long tail that occasionally produces 6-8 second pauses. Perfectly uniform timing is a bot signal.

Browser fingerprint rotation: each new browser session generates a fresh fingerprint profile. Canvas hash, WebGL renderer string, font enumeration, and audio context fingerprints all vary. playwright-stealth handles most of this, but we layer additional patches for the vectors it misses.

Session reuse: rather than launching a new browser for every search page, we maintain browser sessions across multiple searches. Cold browser launches have a different fingerprint profile than established sessions with cookies and local storage populated.

Data pipeline

Once extracted, listings pass through a validation and normalization pipeline before hitting the database.

Scrape listings

Extract 30+ fields

Validate & normalize

Deduplicate

Score & store

Validation catches the most common issues: missing required fields, null financials where the listing has them but the parser failed, and obviously invalid values (asking price of $0, revenue lower than cash flow).

Deduplication is done by a composite key of listing URL and a hash of the title + asking price. BizBuySell sometimes re-surfaces listings in multiple search result pages, and brokers occasionally relist businesses with minor title variations after price reductions.

The normalized records write to PostgreSQL. Each run is tracked with a run_id, so we store historical data rather than just current state. Clients can see how listing prices and multiples change over time.

Rate adaptation

Cloudflare doesn't just block scrapers — it throttles them. The challenge rate varies by time of day, source IP reputation, and scraping velocity. Our system adapts rather than using a fixed delay schedule.

We track the challenge encounter rate per session as a running metric. If more than one in twenty page loads returns a Cloudflare challenge page instead of content, the session manager backs off: increases inter-page delay by 40%, rotates the IP, and re-initializes the browser fingerprint. If the challenge rate drops below the threshold after five clean requests, it gradually returns to normal speed.

The adaptation logic sits between the crawler and the page pool:

python

class RateAdaptiveSession:
    def __init__(self, target_challenge_rate: float = 0.05):
        self.challenge_count = 0
        self.request_count = 0
        self.base_delay = 1.8  # seconds
        self.multiplier = 1.0

    def record_result(self, was_challenge: bool):
        self.request_count += 1
        if was_challenge:
            self.challenge_count += 1
        rate = self.challenge_count / max(self.request_count, 1)
        if rate > 0.05 and self.request_count > 10:
            self.multiplier = min(self.multiplier * 1.4, 4.0)
        elif rate < 0.02 and self.request_count > 20:
            self.multiplier = max(self.multiplier * 0.9, 1.0)

    def next_delay(self) -> float:
        import random, math
        base = self.base_delay * self.multiplier
        return base * math.exp(random.gauss(0, 0.3))

This keeps the scraper inside Cloudflare's tolerance window automatically without requiring manual tuning when BizBuySell tightens its bot detection after platform updates.

Self-healing and monitoring

BizBuySell updates their frontend periodically. Selector changes break scrapers silently. You get empty results instead of errors, which is the worst failure mode.

We solve this with yield monitoring: every run logs the extraction rate (fields extracted / fields expected). If the rate drops below 85% for two consecutive runs, the system automatically:

Fires a Slack alert with a sample of failing selectors
Pauses scheduled runs
Logs the affected search URLs for manual review

The threshold is configurable per field. Cash flow and asking price are required, zero tolerance for extraction failure. Location and industry are optional, so we tolerate higher miss rates without alerting.

Production reliability

The self-healing and rate adaptation systems described above are not edge-case protections — they fire regularly. BizBuySell makes frontend changes roughly every 4-6 weeks: selector updates, layout changes, and occasionally structural shifts in how financial data is presented. Without automated monitoring, every one of those would produce a silent failure: the scraper returns empty records with no error, and the client loses days of data before anyone notices.

The yield monitoring system has triggered Slack alerts and paused runs on three occasions over six months in production. Each time, the alert included the specific selectors returning empty and a sample of the failing search URLs, which made root-cause analysis a 15-minute task rather than a multi-hour debugging session. Selector updates were deployed and a backfill ran within 2 hours of each alert.

The rate adaptation system operates continuously. Cloudflare's tolerance window is not static — it tightens during high-traffic periods (US market hours, particularly mornings when brokers post new listings) and loosens overnight. The session manager's automatic backoff means the scraper is always running close to the maximum allowed speed for current conditions without manual tuning.

Results

The platform monitors thousands of listings daily across the client's target markets. Analysts receive alerts within five minutes of a qualifying listing appearing. Manual research time dropped by approximately 40 hours per week.

The system has run without manual intervention for six months. The only operator actions were the three selector updates triggered automatically by yield monitoring. The client's team does not interact with the scraping layer — they use the dashboard and receive alerts. That's the goal for any production scraping system: when it works correctly, it should be invisible to the people it serves.

The full case study is on our work page, including the ML scoring layer that identifies undervalued listings and the dashboard that replaced the spreadsheet workflow.

If you're monitoring a marketplace that blocks conventional scrapers, our web scraping service covers the same stack: anti-detection, proxy management, structured extraction, and change monitoring. We scope the work before starting and build to production from day one. For the full anti-detection architecture (TLS impersonation, proxy tiering, behavioral humanization at 2M pages/day), see How We Scrape 2M Pages Daily Without Getting Blocked.

Related service

Need large-scale scraping built to run without getting blocked?

Web Scraping & Data Extraction →

← All insights

Scraping9 min

B2B Lead Enrichment Pipelines: From Raw Email to Qualified Contact Data

Scraping10 min

Scraping Google Maps and Business Directories: Architecture and Anti-Detection

Scraping9 min

Scraping Training Data at Scale: Quality, Deduplication, and Labeling Pipelines

We publish new posts every few weeks. See more on the insights page.