Scraping Google Maps and Business Directories: Architecture and Anti-Detection
Google Maps is the most accurate source of local business data on the internet — and one of the hardest to scrape reliably. Here's the architecture that gets clean data from Maps and major directories without getting blocked.
Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.
Google Maps is the most accurate source of local business data on the internet: NAP (name, address, phone), hours, categories, reviews, and photos, updated continuously by business owners and Google's own verification systems. It's also one of the hardest web targets to scrape reliably — JavaScript-heavy, fingerprint-aware, and aggressive about blocking automated requests.
This post covers the full architecture for extracting local business data from Google Maps and major business directories (Yelp, Yellow Pages, Bing Places) at scale.
Why Google Maps is different from a typical web scraping target
Most websites render HTML that you can parse. Google Maps renders almost entirely in JavaScript — the underlying data is fetched via internal APIs and injected into the DOM after page load. A simple httpx GET request returns a JavaScript shell with no business data.
What this means practically:
- You must use a headless browser (Playwright or Puppeteer) to get rendered content
- Google detects and blocks headless browsers using canvas fingerprinting, WebGL checks, and behavior analysis
- Rate limiting is aggressive — too many requests from the same IP in a short window triggers a CAPTCHA or soft block
The anti-detection layer is more critical here than on most targets.
Playwright setup for Google Maps
from playwright.async_api import async_playwright
import asyncio
import random
async def scrape_maps_listing(place_url: str) -> dict:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
],
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
viewport={"width": 1366, "height": 768},
locale="en-US",
timezone_id="America/New_York",
)
# Mask Playwright's automation fingerprint
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
""")
page = await context.new_page()
# Random delay before navigation
await asyncio.sleep(random.uniform(1.5, 3.5))
await page.goto(place_url, wait_until="networkidle", timeout=30000)
# Wait for the main content to render
await page.wait_for_selector("h1.DUwDvf", timeout=10000)
# Extract data
result = await page.evaluate("""() => {
const name = document.querySelector('h1.DUwDvf')?.innerText;
const rating = document.querySelector('.F7nice span[aria-hidden]')?.innerText;
const reviewCount = document.querySelector('.F7nice span[aria-label]')?.getAttribute('aria-label');
const address = document.querySelector('button[data-item-id="address"]')?.innerText;
const phone = document.querySelector('button[data-item-id^="phone"]')?.innerText;
const website = document.querySelector('a[data-item-id="authority"]')?.href;
const hours = Array.from(document.querySelectorAll('.t39EBf'))
.map(el => el.innerText).join(' | ');
return { name, rating, reviewCount, address, phone, website, hours };
}""")
await browser.close()
return resultThe CSS selectors for Google Maps break with UI updates — plan to update them every few weeks. Selectors based on data-item-id attributes are more stable than class names, which Google randomizes.
Search result extraction (business listing pages)
For collecting businesses by search query (e.g., "plumbers in Chicago"), you need to extract the full list of results from a Maps search page, not individual business URLs.
async def extract_search_results(query: str, location: str) -> list[dict]:
search_url = f"https://www.google.com/maps/search/{query}+{location}"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, args=["--disable-blink-features=AutomationControlled"])
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
locale="en-US",
)
page = await context.new_page()
await page.goto(search_url, wait_until="networkidle")
# Scroll to load all results
results_container = await page.wait_for_selector('[role="feed"]')
previous_count = 0
while True:
await results_container.evaluate("el => el.scrollTop = el.scrollHeight")
await asyncio.sleep(2)
current_count = await page.evaluate(
"() => document.querySelectorAll('.Nv2PK').length"
)
if current_count == previous_count:
break # No new results loaded
previous_count = current_count
# Extract all listings
listings = await page.evaluate("""() => {
return Array.from(document.querySelectorAll('.Nv2PK')).map(el => ({
name: el.querySelector('.qBF1Pd')?.innerText,
rating: el.querySelector('.MW4etd')?.innerText,
reviewCount: el.querySelector('.UY7F9')?.innerText,
category: el.querySelector('.W4Efsd:nth-child(1)')?.innerText,
address: el.querySelector('.W4Efsd:nth-child(2)')?.innerText,
url: el.querySelector('a.hfpxzc')?.href,
}));
}""")
await browser.close()
return listingsSearch results pages typically show 20 listings per scroll, up to about 120 total per query. For broader coverage of a market, break the area into a grid and run multiple localized searches.
Geo-grid coverage strategy
A single Maps search query returns at most ~120 results, biased toward the geographic center of the search area. To collect all businesses in a large metro area, you need a grid of overlapping search queries.
The approach: divide the target area into a grid of cells, run a search centered on each cell, and deduplicate the results by Google Place ID.
import math
from dataclasses import dataclass
@dataclass
class GridCell:
lat: float
lng: float
radius_km: float
def generate_grid(
center_lat: float,
center_lng: float,
total_radius_km: float,
cell_radius_km: float,
) -> list[GridCell]:
"""Generate a grid of search cells covering a circular area."""
cells = []
# Approximate: 1 degree lat ≈ 111 km, 1 degree lng varies by latitude
lat_step = cell_radius_km * 1.5 / 111
lng_step = cell_radius_km * 1.5 / (111 * math.cos(math.radians(center_lat)))
lat = center_lat - total_radius_km / 111
while lat <= center_lat + total_radius_km / 111:
lng = center_lng - total_radius_km / (111 * math.cos(math.radians(center_lat)))
while lng <= center_lng + total_radius_km / (111 * math.cos(math.radians(center_lat))):
cells.append(GridCell(lat=lat, lng=lng, radius_km=cell_radius_km))
lng += lng_step
lat += lat_step
return cellsFor a city like Chicago (roughly 30km radius), a 3km cell radius generates about 300 search cells. Each cell requires one search request. With residential proxies and 30-second inter-request delays, that's about 2.5 hours per query type per city.
Google Maps uses its Place ID as a stable unique identifier for each business. Always extract and store the Place ID alongside other fields — it's the deduplication key when merging results from multiple grid cells.
Anti-detection at scale
For extracting thousands of listings, the anti-detection layer must be systematic:
Residential proxies: datacenter IPs are blocked immediately on Maps. Residential proxies (Brightdata, Oxylabs) are required. Rotate IPs per session, not per request — each browser session uses one IP from start to finish.
Request pacing: randomize delays between requests (2-5 seconds) and between browser sessions (30-120 seconds). Consistent cadence is a bot signal.
Browser fingerprint variation: rotate user agent strings, viewport sizes, and language/timezone combinations across sessions. A single fingerprint making thousands of requests is a trivial pattern to detect.
Session warming: don't go directly to a Maps URL from a fresh browser context. Load Google.com first, wait a few seconds, then navigate to Maps. This matches normal browser behavior.
CAPTCHA handling: when a CAPTCHA appears, don't retry immediately. Mark the session as blocked, rotate to a new IP, and resume from a clean context. At scale, some percentage of sessions will hit CAPTCHAs — design the pipeline to handle this gracefully rather than stopping entirely.
Yelp and Yellow Pages
Major business directories are easier than Google Maps — they render server-side HTML and don't use the same level of fingerprinting. But they still block at the IP level for high-volume scraping.
For Yelp, the business listing HTML is parseable with BeautifulSoup:
import httpx
from bs4 import BeautifulSoup
def scrape_yelp_listing(business_url: str, proxy: str) -> dict:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
"Accept-Language": "en-US,en;q=0.9",
}
response = httpx.get(
business_url,
headers=headers,
proxies={"https://": proxy},
timeout=15.0,
)
soup = BeautifulSoup(response.text, "html.parser")
return {
"name": soup.select_one("h1.css-1se8maq")?.get_text(strip=True),
"rating": soup.select_one('[aria-label*="star rating"]')?.get("aria-label"),
"address": soup.select_one('address')?.get_text(strip=True),
"phone": soup.select_one('[href^="tel:"]')?.get_text(strip=True),
"website": soup.select_one('a[href*="biz_redir"]')?.get("href"),
}Yelp selectors change more frequently than most targets — build in automated selector validation (check that critical fields are non-null on a known URL) and alert when the success rate drops below a threshold.
Data normalization across sources
When collecting the same business from multiple directories, normalization is required before deduplication. Phone numbers come in different formats ("(312) 555-1234" vs "+13125551234"), addresses have inconsistent abbreviations, and business names vary ("Joe's Plumbing LLC" vs "Joe's Plumbing").
An important note on data quality: Google Maps is more current than Yelp and Yellow Pages for most markets. When a business closes or changes hours, Google is typically updated first. For time-sensitive use cases (sales prospecting, delivery route planning), prioritize Google Maps data as the source of truth and treat other directories as supplementary. Cross-referencing helps surface quality issues: a business that appears on Google Maps but has no Yelp presence and no website is a signal worth flagging for manual review before using in a sales pipeline.
Standard normalization:
- Phone: strip to digits, apply E.164 format
- Address: parse with a library like
usaddress(US) orlibpostal(international) to extract structured components - Business name: lowercase, strip legal suffixes (LLC, Inc, Corp), trim whitespace
Deduplicate on a combination of normalized phone number + normalized address. Two records with the same phone and address are the same business, regardless of how the name appears.
Output schema
A complete local business record from this pipeline:
@dataclass
class LocalBusinessRecord:
name: str
address_street: str
address_city: str
address_state: str
address_postal: str
address_country: str
phone: str # E.164
website: Optional[str]
google_maps_url: Optional[str]
yelp_url: Optional[str]
rating: Optional[float]
review_count: Optional[int]
categories: list[str]
hours: dict # {"monday": "9:00 AM - 5:00 PM", ...}
source: str # "google_maps" | "yelp" | "yellow_pages"
scraped_at: datetimeIf you're building a local business data pipeline — for lead generation, competitive intelligence, or directory aggregation — tell us about the scope.
Related: How We Scrape at Scale Without Getting Blocked | Building an E-Commerce Price Scraping Pipeline
Related service
Need large-scale scraping built to run without getting blocked?
Web Scraping & Data Extraction →Related
We publish new posts every few weeks. See more on the insights page.