ScrapingJune 22, 202610 min read

Scraping Google Maps and Business Directories: Architecture and Anti-Detection

Google Maps is the most accurate source of local business data on the internet — and one of the hardest to scrape reliably. Here's the architecture that gets clean data from Maps and major directories without getting blocked.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

Google Maps is the most accurate source of local business data on the internet: NAP (name, address, phone), hours, categories, reviews, and photos, updated continuously by business owners and Google's own verification systems. It's also one of the hardest web targets to scrape reliably — JavaScript-heavy, fingerprint-aware, and aggressive about blocking automated requests.

This post covers the full architecture for extracting local business data from Google Maps and major business directories (Yelp, Yellow Pages, Bing Places) at scale.

Why Google Maps is different from a typical web scraping target

Most websites render HTML that you can parse. Google Maps renders almost entirely in JavaScript — the underlying data is fetched via internal APIs and injected into the DOM after page load. A simple httpx GET request returns a JavaScript shell with no business data.

What this means practically:

You must use a headless browser (Playwright or Puppeteer) to get rendered content
Google detects and blocks headless browsers using canvas fingerprinting, WebGL checks, and behavior analysis
Rate limiting is aggressive — too many requests from the same IP in a short window triggers a CAPTCHA or soft block

The anti-detection layer is more critical here than on most targets.

Playwright setup for Google Maps

python

from playwright.async_api import async_playwright
import asyncio
import random

async def scrape_maps_listing(place_url: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
                "--disable-dev-shm-usage",
            ],
        )

        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
            viewport={"width": 1366, "height": 768},
            locale="en-US",
            timezone_id="America/New_York",
        )

        # Mask Playwright's automation fingerprint
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined,
            });
        """)

        page = await context.new_page()

        # Random delay before navigation
        await asyncio.sleep(random.uniform(1.5, 3.5))

        await page.goto(place_url, wait_until="networkidle", timeout=30000)

        # Wait for the main content to render
        await page.wait_for_selector("h1.DUwDvf", timeout=10000)

        # Extract data
        result = await page.evaluate("""() => {
            const name = document.querySelector('h1.DUwDvf')?.innerText;
            const rating = document.querySelector('.F7nice span[aria-hidden]')?.innerText;
            const reviewCount = document.querySelector('.F7nice span[aria-label]')?.getAttribute('aria-label');
            const address = document.querySelector('button[data-item-id="address"]')?.innerText;
            const phone = document.querySelector('button[data-item-id^="phone"]')?.innerText;
            const website = document.querySelector('a[data-item-id="authority"]')?.href;
            const hours = Array.from(document.querySelectorAll('.t39EBf'))
                .map(el => el.innerText).join(' | ');
            
            return { name, rating, reviewCount, address, phone, website, hours };
        }""")

        await browser.close()
        return result

The CSS selectors for Google Maps break with UI updates — plan to update them every few weeks. Selectors based on data-item-id attributes are more stable than class names, which Google randomizes.

Search result extraction (business listing pages)

For collecting businesses by search query (e.g., "plumbers in Chicago"), you need to extract the full list of results from a Maps search page, not individual business URLs.

python

async def extract_search_results(query: str, location: str) -> list[dict]:
    search_url = f"https://www.google.com/maps/search/{query}+{location}"

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, args=["--disable-blink-features=AutomationControlled"])
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
            locale="en-US",
        )
        page = await context.new_page()

        await page.goto(search_url, wait_until="networkidle")

        # Scroll to load all results
        results_container = await page.wait_for_selector('[role="feed"]')
        previous_count = 0

        while True:
            await results_container.evaluate("el => el.scrollTop = el.scrollHeight")
            await asyncio.sleep(2)

            current_count = await page.evaluate(
                "() => document.querySelectorAll('.Nv2PK').length"
            )
            if current_count == previous_count:
                break  # No new results loaded
            previous_count = current_count

        # Extract all listings
        listings = await page.evaluate("""() => {
            return Array.from(document.querySelectorAll('.Nv2PK')).map(el => ({
                name: el.querySelector('.qBF1Pd')?.innerText,
                rating: el.querySelector('.MW4etd')?.innerText,
                reviewCount: el.querySelector('.UY7F9')?.innerText,
                category: el.querySelector('.W4Efsd:nth-child(1)')?.innerText,
                address: el.querySelector('.W4Efsd:nth-child(2)')?.innerText,
                url: el.querySelector('a.hfpxzc')?.href,
            }));
        }""")

        await browser.close()
        return listings

Search results pages typically show 20 listings per scroll, up to about 120 total per query. For broader coverage of a market, break the area into a grid and run multiple localized searches.

Geo-grid coverage strategy

A single Maps search query returns at most ~120 results, biased toward the geographic center of the search area. To collect all businesses in a large metro area, you need a grid of overlapping search queries.

The approach: divide the target area into a grid of cells, run a search centered on each cell, and deduplicate the results by Google Place ID.

python

import math
from dataclasses import dataclass

@dataclass
class GridCell:
    lat: float
    lng: float
    radius_km: float

def generate_grid(
    center_lat: float,
    center_lng: float,
    total_radius_km: float,
    cell_radius_km: float,
) -> list[GridCell]:
    """Generate a grid of search cells covering a circular area."""
    cells = []
    # Approximate: 1 degree lat ≈ 111 km, 1 degree lng varies by latitude
    lat_step = cell_radius_km * 1.5 / 111
    lng_step = cell_radius_km * 1.5 / (111 * math.cos(math.radians(center_lat)))

    lat = center_lat - total_radius_km / 111
    while lat <= center_lat + total_radius_km / 111:
        lng = center_lng - total_radius_km / (111 * math.cos(math.radians(center_lat)))
        while lng <= center_lng + total_radius_km / (111 * math.cos(math.radians(center_lat))):
            cells.append(GridCell(lat=lat, lng=lng, radius_km=cell_radius_km))
            lng += lng_step
        lat += lat_step

    return cells

For a city like Chicago (roughly 30km radius), a 3km cell radius generates about 300 search cells. Each cell requires one search request. With residential proxies and 30-second inter-request delays, that's about 2.5 hours per query type per city.

Google Maps uses its Place ID as a stable unique identifier for each business. Always extract and store the Place ID alongside other fields — it's the deduplication key when merging results from multiple grid cells.

Anti-detection at scale

For extracting thousands of listings, the anti-detection layer must be systematic:

Residential proxies: datacenter IPs are blocked immediately on Maps. Residential proxies (Brightdata, Oxylabs) are required. Rotate IPs per session, not per request — each browser session uses one IP from start to finish.

Request pacing: randomize delays between requests (2-5 seconds) and between browser sessions (30-120 seconds). Consistent cadence is a bot signal.

Browser fingerprint variation: rotate user agent strings, viewport sizes, and language/timezone combinations across sessions. A single fingerprint making thousands of requests is a trivial pattern to detect.

Session warming: don't go directly to a Maps URL from a fresh browser context. Load Google.com first, wait a few seconds, then navigate to Maps. This matches normal browser behavior.

CAPTCHA handling: when a CAPTCHA appears, don't retry immediately. Mark the session as blocked, rotate to a new IP, and resume from a clean context. At scale, some percentage of sessions will hit CAPTCHAs — design the pipeline to handle this gracefully rather than stopping entirely.

Yelp and Yellow Pages

Major business directories are easier than Google Maps — they render server-side HTML and don't use the same level of fingerprinting. But they still block at the IP level for high-volume scraping.

For Yelp, the business listing HTML is parseable with BeautifulSoup:

python

import httpx
from bs4 import BeautifulSoup

def scrape_yelp_listing(business_url: str, proxy: str) -> dict:
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
        "Accept-Language": "en-US,en;q=0.9",
    }

    response = httpx.get(
        business_url,
        headers=headers,
        proxies={"https://": proxy},
        timeout=15.0,
    )

    soup = BeautifulSoup(response.text, "html.parser")

    return {
        "name": soup.select_one("h1.css-1se8maq")?.get_text(strip=True),
        "rating": soup.select_one('[aria-label*="star rating"]')?.get("aria-label"),
        "address": soup.select_one('address')?.get_text(strip=True),
        "phone": soup.select_one('[href^="tel:"]')?.get_text(strip=True),
        "website": soup.select_one('a[href*="biz_redir"]')?.get("href"),
    }

Yelp selectors change more frequently than most targets — build in automated selector validation (check that critical fields are non-null on a known URL) and alert when the success rate drops below a threshold.

Data normalization across sources

When collecting the same business from multiple directories, normalization is required before deduplication. Phone numbers come in different formats ("(312) 555-1234" vs "+13125551234"), addresses have inconsistent abbreviations, and business names vary ("Joe's Plumbing LLC" vs "Joe's Plumbing").

An important note on data quality: Google Maps is more current than Yelp and Yellow Pages for most markets. When a business closes or changes hours, Google is typically updated first. For time-sensitive use cases (sales prospecting, delivery route planning), prioritize Google Maps data as the source of truth and treat other directories as supplementary. Cross-referencing helps surface quality issues: a business that appears on Google Maps but has no Yelp presence and no website is a signal worth flagging for manual review before using in a sales pipeline.

Standard normalization:

Phone: strip to digits, apply E.164 format
Address: parse with a library like usaddress (US) or libpostal (international) to extract structured components
Business name: lowercase, strip legal suffixes (LLC, Inc, Corp), trim whitespace

Deduplicate on a combination of normalized phone number + normalized address. Two records with the same phone and address are the same business, regardless of how the name appears.

Output schema

A complete local business record from this pipeline:

python

@dataclass
class LocalBusinessRecord:
    name: str
    address_street: str
    address_city: str
    address_state: str
    address_postal: str
    address_country: str
    phone: str                   # E.164
    website: Optional[str]
    google_maps_url: Optional[str]
    yelp_url: Optional[str]
    rating: Optional[float]
    review_count: Optional[int]
    categories: list[str]
    hours: dict                  # {"monday": "9:00 AM - 5:00 PM", ...}
    source: str                  # "google_maps" | "yelp" | "yellow_pages"
    scraped_at: datetime

If you're building a local business data pipeline — for lead generation, competitive intelligence, or directory aggregation — tell us about the scope.

Google Maps and Directory Scraping services →

Related service

Need large-scale scraping built to run without getting blocked?

Web Scraping & Data Extraction →

← All insights

Scraping9 min

B2B Lead Enrichment Pipelines: From Raw Email to Qualified Contact Data

Scraping9 min

Scraping Training Data at Scale: Quality, Deduplication, and Labeling Pipelines

Scraping8 min

Building an E-Commerce Price Scraping Pipeline

We publish new posts every few weeks. See more on the insights page.