Creative Codes
← All insights
ScrapingMay 10, 202610 min read

How We Built a BizBuySell Scraping Pipeline That Tracks Thousands of Listings Daily

BizBuySell runs Cloudflare. Our first scraper got blocked in 20 minutes. Here's the architecture that fixed it.

By Muhammad Hassan

The challenge

BizBuySell is the largest US marketplace for businesses for sale. Thousands of listings go live every day, each one containing asking price, trailing-twelve-month cash flow, gross revenue, EBITDA, seller financing terms, broker details, and a text description of the business.

Our client, a business acquisition advisory firm, needed to monitor all of it. Their analysts were manually browsing listings for hours every day, copying financials into spreadsheets, and trying to identify undervalued opportunities before competitors. They missed deals constantly. A strong listing at $800K asking on a $280K cash flow would appear, get flooded with inquiries, and go under LOI before the team had even seen it.

The ask was straightforward: scrape everything, score it, and alert us the moment something good appears.

The execution was not.

Architecture decisions

BizBuySell serves content through a JavaScript-rendered interface. The listing cards load via XHR after the initial page paint, which rules out simple HTTP request scrapers like requests + BeautifulSoup. You need a real browser.

We evaluated Scrapy with Splash, Selenium, and Playwright. We chose Playwright for three reasons:

  1. Native async: Playwright's async API handles concurrent page operations cleanly, which matters when you're paginating through hundreds of search result pages
  2. First-class stealth support: playwright-stealth patches the most common fingerprinting vectors (WebGL, canvas, navigator, Chrome runtime) without the fragility of Selenium patching
  3. Built-in waiting: wait_until="networkidle" saves significant complexity compared to writing explicit wait conditions for XHR completion

The core loop is straightforward:

python
async def scrape_listings(search_url: str) -> list[BusinessListing]:
    browser = await playwright.chromium.launch(headless=True)
    page = await browser.new_page()
    await stealth(page)
    await page.goto(search_url, wait_until="networkidle")

    listings = []
    while True:
        items = await page.query_selector_all(".listing-card")
        for item in items:
            listings.append(BusinessListing(
                title=await item.text_content(".listing-title"),
                asking_price=parse_currency(await item.text_content(".price")),
                cash_flow=parse_currency(await item.text_content(".cash-flow")),
revenue=parse_currency(await item.text_content(".revenue")),
                location=await item.text_content(".location"),
                industry=await item.text_content(".category"),
            ))

        next_btn = await page.query_selector(".next-page")
        if not next_btn:
            break
        await next_btn.click()
        await page.wait_for_load_state("networkidle")

    await browser.close()
    return deduplicate(listings)

The parse_currency function handles the inconsistency in how BizBuySell formats financials: $1.2M, $1,200,000, Asking: $1.2m, and Not Disclosed all need different handling.

Handling anti-detection

Tip

The key to sustainable scraping isn't speed, it's stealth. We throttle requests to mimic human browsing patterns rather than hammering the server as fast as possible.

BizBuySell uses Cloudflare for bot protection. Naive scrapers get challenged immediately. Our anti-detection stack:

Residential proxies: datacenter IPs are trivially fingerprinted. We rotate through a residential proxy pool, cycling IPs at the session level (not per-request). A session always appears to come from the same location. Rotating per-request is a stronger signal than a consistent IP.

Randomized delays: we draw inter-action delays from a log-normal distribution fitted to real browsing session data. The median delay is ~1.8 seconds, with a long tail that occasionally produces 6-8 second pauses. Perfectly uniform timing is a bot signal.

Browser fingerprint rotation: each new browser session generates a fresh fingerprint profile. Canvas hash, WebGL renderer string, font enumeration, and audio context fingerprints all vary. playwright-stealth handles most of this, but we layer additional patches for the vectors it misses.

Session reuse: rather than launching a new browser for every search page, we maintain browser sessions across multiple searches. Cold browser launches have a different fingerprint profile than established sessions with cookies and local storage populated.

Data pipeline

Once extracted, listings pass through a validation and normalization pipeline before hitting the database.

Scrape listings
Extract 30+ fields
Validate & normalize
Deduplicate
Score & store

Validation catches the most common issues: missing required fields, null financials where the listing has them but the parser failed, and obviously invalid values (asking price of $0, revenue lower than cash flow).

Deduplication is done by a composite key of listing URL and a hash of the title + asking price. BizBuySell sometimes re-surfaces listings in multiple search result pages, and brokers occasionally relist businesses with minor title variations after price reductions.

The normalized records write to PostgreSQL. Each run is tracked with a run_id, so we store historical data rather than just current state. Clients can see how listing prices and multiples change over time.

Self-healing and monitoring

BizBuySell updates their frontend periodically. Selector changes break scrapers silently. You get empty results instead of errors, which is the worst failure mode.

We solve this with yield monitoring: every run logs the extraction rate (fields extracted / fields expected). If the rate drops below 85% for two consecutive runs, the system automatically:

  1. Fires a Slack alert with a sample of failing selectors
  2. Pauses scheduled runs
  3. Logs the affected search URLs for manual review

The threshold is configurable per field. Cash flow and asking price are required, zero tolerance for extraction failure. Location and industry are optional, so we tolerate higher miss rates without alerting.

Results

The platform now monitors thousands of listings daily across the client's target markets and industries. Analysts receive alerts within five minutes of a qualifying listing appearing. Manual research time dropped by approximately 40 hours per week.

The full case study is on our work page, including the ML scoring layer that identifies undervalued listings and the dashboard that replaced the spreadsheet workflow.

We publish new posts every few weeks. See more on the insights page.