ScrapingJune 22, 20269 min read

B2B Lead Enrichment Pipelines: From Raw Email to Qualified Contact Data

A raw email address tells you almost nothing about a lead. A properly enriched record tells you company size, industry, job seniority, technology stack, and whether they match your ICP. Here's how to build the pipeline that gets you from one to the other.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

A raw email address tells you almost nothing about a lead. A properly enriched record tells you company size, industry, job title, seniority, technology stack, funding stage, and whether this person matches your ideal customer profile. The gap between raw and enriched is where B2B sales efficiency lives — and building the pipeline that reliably closes that gap is the difference between a sales team spending time on good-fit prospects and one drowning in unqualified outreach.

This post covers the enrichment pipeline architecture: data sources, waterfall logic, normalization, and routing.

What enrichment actually provides

From a business email address, enrichment APIs can return:

Person-level data: first name, last name, job title, seniority level (individual contributor / manager / director / VP / C-suite), LinkedIn URL, professional history

Company-level data: company name, website, company size (employee count range), industry (GICS/SIC classification), annual revenue range, headquarters location, founding year

Technology stack: what software the company uses (CRM, marketing automation, analytics tools, cloud providers) — extracted from job postings, website scripts, and tech stack detection

Funding data: last funding round, total funding raised, investors (from Crunchbase, PitchBook via third-party enrichment APIs)

Not every field is available for every contact. Enrichment coverage varies significantly by industry, company size, and how well-indexed the contact is. A VP at a 500-person SaaS company in the US will have high coverage. A manager at a regional services firm in Southeast Asia may return very little.

Data sources and their tradeoffs

Clearbit (now part of HubSpot): high quality for US/EU tech companies, lower coverage for non-English markets. Returns person + company data in one call. $99-299/month for small volumes.

Apollo.io: large database, strong coverage for B2B contacts globally. API access available for enrichment (separate from outbound sequences). Better coverage in emerging markets than Clearbit.

Hunter.io: email verification and domain-level company data. Doesn't return person-level detail but is useful for validating email deliverability before enrichment.

ZoomInfo: enterprise-grade, expensive, but has the best coverage for enterprise accounts. Overkill for most SMB pipelines.

Scraped sources: LinkedIn, Crunchbase, company websites. Legal constraints apply — check your jurisdiction and terms of service. For company-level data (technologies, headcount, recent funding), scraping is a viable complement to paid APIs.

Waterfall enrichment

No single provider covers 100% of contacts. A waterfall pattern tries multiple sources in sequence, stopping when sufficient data is found:

python

from dataclasses import dataclass, field
from typing import Optional
import httpx

@dataclass
class EnrichedContact:
    email: str
    first_name: Optional[str] = None
    last_name: Optional[str] = None
    job_title: Optional[str] = None
    seniority: Optional[str] = None
    company_name: Optional[str] = None
    company_size: Optional[str] = None
    industry: Optional[str] = None
    linkedin_url: Optional[str] = None
    enrichment_sources: list[str] = field(default_factory=list)

    @property
    def is_sufficiently_enriched(self) -> bool:
        """True if we have the minimum fields needed for ICP scoring."""
        return bool(
            self.company_name
            and self.company_size
            and self.job_title
        )

def enrich_from_clearbit(email: str) -> Optional[dict]:
    try:
        response = httpx.get(
            f"https://person.clearbit.com/v2/combined/find?email={email}",
            headers={"Authorization": f"Bearer {CLEARBIT_KEY}"},
            timeout=5.0,
        )
        if response.status_code == 200:
            data = response.json()
            return {
                "first_name": data.get("person", {}).get("name", {}).get("givenName"),
                "last_name": data.get("person", {}).get("name", {}).get("familyName"),
                "job_title": data.get("person", {}).get("employment", {}).get("title"),
                "seniority": data.get("person", {}).get("employment", {}).get("seniority"),
                "company_name": data.get("company", {}).get("name"),
                "company_size": data.get("company", {}).get("metrics", {}).get("employeesRange"),
                "industry": data.get("company", {}).get("category", {}).get("industry"),
                "linkedin_url": data.get("person", {}).get("linkedin", {}).get("handle"),
                "source": "clearbit",
            }
    except Exception:
        pass
    return None

def enrich_from_apollo(email: str) -> Optional[dict]:
    try:
        response = httpx.post(
            "https://api.apollo.io/v1/people/match",
            json={"email": email, "reveal_personal_emails": False},
            headers={"X-Api-Key": APOLLO_KEY},
            timeout=5.0,
        )
        if response.status_code == 200:
            person = response.json().get("person") or {}
            org = person.get("organization") or {}
            return {
                "first_name": person.get("first_name"),
                "last_name": person.get("last_name"),
                "job_title": person.get("title"),
                "seniority": person.get("seniority"),
                "company_name": org.get("name"),
                "company_size": org.get("estimated_num_employees"),
                "industry": org.get("industry"),
                "linkedin_url": person.get("linkedin_url"),
                "source": "apollo",
            }
    except Exception:
        pass
    return None

def waterfall_enrich(email: str) -> EnrichedContact:
    contact = EnrichedContact(email=email)

    for enrich_fn in [enrich_from_clearbit, enrich_from_apollo]:
        result = enrich_fn(email)
        if result:
            # Fill in any missing fields from this source
            for key, value in result.items():
                if key == "source":
                    contact.enrichment_sources.append(value)
                    continue
                if value and not getattr(contact, key, None):
                    setattr(contact, key, value)

        if contact.is_sufficiently_enriched:
            break

    return contact

The waterfall tries Clearbit first (better quality for tech), falls back to Apollo (better coverage), and stops when enough data is found. This controls cost — you only call the second provider when the first doesn't return sufficient data.

Technology stack detection

For selling technical products or services, knowing what tools a company already uses is high-signal enrichment. A company using HubSpot CRM and running on AWS with a React frontend has a different buyer profile than one using Salesforce and running on a custom Java stack.

Technology stack data comes from two places:

Job postings: companies list their tech stack in engineering job descriptions. Scraping a company's open roles and extracting technology mentions gives a reasonably current picture of their stack.
Website fingerprinting: tools like Wappalyzer, BuiltWith, and Datanyze detect technologies from website headers, script tags, and meta patterns.

python

import httpx
from bs4 import BeautifulSoup

TECH_PATTERNS = {
    "HubSpot CRM": ["hubspot.com/hs-scripts", "hsforms.com"],
    "Salesforce": ["salesforce.com", "force.com"],
    "Intercom": ["intercom.io", "intercomcdn.com"],
    "Stripe": ["js.stripe.com"],
    "Segment": ["cdn.segment.com", "segment.io"],
    "Google Analytics 4": ["gtag/js", "G-"],
}

def detect_website_tech(domain: str) -> list[str]:
    try:
        response = httpx.get(f"https://{domain}", timeout=8.0, follow_redirects=True)
        html = response.text
        detected = []
        for tech, patterns in TECH_PATTERNS.items():
            if any(p in html for p in patterns):
                detected.append(tech)
        return detected
    except Exception:
        return []

For ICP scoring, technology stack signals can be decisive: if you're selling a data pipeline product, a company already using Segment + Snowflake is a much warmer prospect than one with no data infrastructure.

ICP scoring

Once enriched, score the contact against your ideal customer profile:

python

def calculate_icp_score(contact: EnrichedContact) -> tuple[int, str]:
    """Returns (score 0-100, tier A/B/C/D)."""
    score = 0

    # Company size (example ICP: 50-500 employees)
    size = contact.company_size or ""
    if any(s in size for s in ["51-200", "201-500", "51-100", "101-250"]):
        score += 35
    elif any(s in size for s in ["11-50", "501-1000"]):
        score += 20
    elif any(s in size for s in ["1-10"]):
        score += 5

    # Industry fit
    target_industries = {"technology", "saas", "e-commerce", "financial services", "healthcare"}
    if any(ind in (contact.industry or "").lower() for ind in target_industries):
        score += 30

    # Seniority (prefer decision makers)
    if contact.seniority in ("director", "vp", "c_suite", "owner", "founder"):
        score += 35
    elif contact.seniority in ("manager", "senior"):
        score += 20

    tier = "A" if score >= 80 else "B" if score >= 55 else "C" if score >= 30 else "D"
    return score, tier

Tier A and B contacts route to direct sales outreach. Tier C goes to a nurture email sequence. Tier D is deprioritized or excluded entirely — it's not worth a rep's time.

Enrichment at scale: batching and caching

For large lists (import of 10,000 contacts, weekly re-enrichment of the entire CRM), enrichment must be batched to control rate limits and costs.

Key operational patterns:

Cache enrichment results per email for 90 days. Re-enriching the same email weekly is wasteful. Job titles and company sizes don't change that fast.
Prioritize — enrich new leads immediately, batch-enrich historical records overnight
Track coverage — log which fields came back null per provider; if a provider's coverage drops below 40%, investigate whether your query patterns have changed

At 10,000 contacts, Clearbit costs roughly $0.01-0.02 per successful enrichment (depending on plan). Apollo is cheaper per call but may require more fallback calls. A waterfall with both providers typically costs $0.015-0.03 per enriched contact.

What to do when enrichment fails

About 20-30% of contacts in a typical B2B list won't enrich successfully — personal email addresses, very small companies, or contacts in regions with low database coverage. For these:

Flag as "unenriched" in the CRM
Route to a lower-priority sequence with more generic messaging
Periodically re-try enrichment (people change jobs, new LinkedIn profiles appear)
Use domain-level company data (from Hunter or a web scrape) when person-level enrichment fails

One pattern worth implementing: a manual enrichment queue. For Tier A accounts (large deal size, close to ICP but missing key fields), route to a human reviewer to fill gaps manually rather than letting the automation downgrade them. The economics work when a single Tier A deal is worth more than the 15 minutes it takes a sales rep to manually look up the missing fields. For Tier C and D, the automation's best guess is good enough — human review is not worth the time.

Finally, treat enrichment data as decaying. Job titles change. Companies get acquired. A contact enriched 18 months ago may now be at a different company entirely. Re-enriching the active CRM database quarterly — or whenever a contact re-engages — keeps the data fresh and the ICP scoring accurate.

If you're building a lead enrichment pipeline and need it to handle multiple sources, scoring, and CRM sync, tell us about the project.

Lead Enrichment services →

Related service

Need large-scale scraping built to run without getting blocked?

Web Scraping & Data Extraction →

← All insights

Scraping10 min

Scraping Google Maps and Business Directories: Architecture and Anti-Detection

Scraping9 min

Scraping Training Data at Scale: Quality, Deduplication, and Labeling Pipelines

Scraping8 min

Building an E-Commerce Price Scraping Pipeline

We publish new posts every few weeks. See more on the insights page.