B2B Lead Enrichment Pipelines: From Raw Email to Qualified Contact Data
A raw email address tells you almost nothing about a lead. A properly enriched record tells you company size, industry, job seniority, technology stack, and whether they match your ICP. Here's how to build the pipeline that gets you from one to the other.
Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.
A raw email address tells you almost nothing about a lead. A properly enriched record tells you company size, industry, job title, seniority, technology stack, funding stage, and whether this person matches your ideal customer profile. The gap between raw and enriched is where B2B sales efficiency lives — and building the pipeline that reliably closes that gap is the difference between a sales team spending time on good-fit prospects and one drowning in unqualified outreach.
This post covers the enrichment pipeline architecture: data sources, waterfall logic, normalization, and routing.
What enrichment actually provides
From a business email address, enrichment APIs can return:
Person-level data: first name, last name, job title, seniority level (individual contributor / manager / director / VP / C-suite), LinkedIn URL, professional history
Company-level data: company name, website, company size (employee count range), industry (GICS/SIC classification), annual revenue range, headquarters location, founding year
Technology stack: what software the company uses (CRM, marketing automation, analytics tools, cloud providers) — extracted from job postings, website scripts, and tech stack detection
Funding data: last funding round, total funding raised, investors (from Crunchbase, PitchBook via third-party enrichment APIs)
Not every field is available for every contact. Enrichment coverage varies significantly by industry, company size, and how well-indexed the contact is. A VP at a 500-person SaaS company in the US will have high coverage. A manager at a regional services firm in Southeast Asia may return very little.
Data sources and their tradeoffs
Clearbit (now part of HubSpot): high quality for US/EU tech companies, lower coverage for non-English markets. Returns person + company data in one call. $99-299/month for small volumes.
Apollo.io: large database, strong coverage for B2B contacts globally. API access available for enrichment (separate from outbound sequences). Better coverage in emerging markets than Clearbit.
Hunter.io: email verification and domain-level company data. Doesn't return person-level detail but is useful for validating email deliverability before enrichment.
ZoomInfo: enterprise-grade, expensive, but has the best coverage for enterprise accounts. Overkill for most SMB pipelines.
Scraped sources: LinkedIn, Crunchbase, company websites. Legal constraints apply — check your jurisdiction and terms of service. For company-level data (technologies, headcount, recent funding), scraping is a viable complement to paid APIs.
Waterfall enrichment
No single provider covers 100% of contacts. A waterfall pattern tries multiple sources in sequence, stopping when sufficient data is found:
from dataclasses import dataclass, field
from typing import Optional
import httpx
@dataclass
class EnrichedContact:
email: str
first_name: Optional[str] = None
last_name: Optional[str] = None
job_title: Optional[str] = None
seniority: Optional[str] = None
company_name: Optional[str] = None
company_size: Optional[str] = None
industry: Optional[str] = None
linkedin_url: Optional[str] = None
enrichment_sources: list[str] = field(default_factory=list)
@property
def is_sufficiently_enriched(self) -> bool:
"""True if we have the minimum fields needed for ICP scoring."""
return bool(
self.company_name
and self.company_size
and self.job_title
)
def enrich_from_clearbit(email: str) -> Optional[dict]:
try:
response = httpx.get(
f"https://person.clearbit.com/v2/combined/find?email={email}",
headers={"Authorization": f"Bearer {CLEARBIT_KEY}"},
timeout=5.0,
)
if response.status_code == 200:
data = response.json()
return {
"first_name": data.get("person", {}).get("name", {}).get("givenName"),
"last_name": data.get("person", {}).get("name", {}).get("familyName"),
"job_title": data.get("person", {}).get("employment", {}).get("title"),
"seniority": data.get("person", {}).get("employment", {}).get("seniority"),
"company_name": data.get("company", {}).get("name"),
"company_size": data.get("company", {}).get("metrics", {}).get("employeesRange"),
"industry": data.get("company", {}).get("category", {}).get("industry"),
"linkedin_url": data.get("person", {}).get("linkedin", {}).get("handle"),
"source": "clearbit",
}
except Exception:
pass
return None
def enrich_from_apollo(email: str) -> Optional[dict]:
try:
response = httpx.post(
"https://api.apollo.io/v1/people/match",
json={"email": email, "reveal_personal_emails": False},
headers={"X-Api-Key": APOLLO_KEY},
timeout=5.0,
)
if response.status_code == 200:
person = response.json().get("person") or {}
org = person.get("organization") or {}
return {
"first_name": person.get("first_name"),
"last_name": person.get("last_name"),
"job_title": person.get("title"),
"seniority": person.get("seniority"),
"company_name": org.get("name"),
"company_size": org.get("estimated_num_employees"),
"industry": org.get("industry"),
"linkedin_url": person.get("linkedin_url"),
"source": "apollo",
}
except Exception:
pass
return None
def waterfall_enrich(email: str) -> EnrichedContact:
contact = EnrichedContact(email=email)
for enrich_fn in [enrich_from_clearbit, enrich_from_apollo]:
result = enrich_fn(email)
if result:
# Fill in any missing fields from this source
for key, value in result.items():
if key == "source":
contact.enrichment_sources.append(value)
continue
if value and not getattr(contact, key, None):
setattr(contact, key, value)
if contact.is_sufficiently_enriched:
break
return contactThe waterfall tries Clearbit first (better quality for tech), falls back to Apollo (better coverage), and stops when enough data is found. This controls cost — you only call the second provider when the first doesn't return sufficient data.
Technology stack detection
For selling technical products or services, knowing what tools a company already uses is high-signal enrichment. A company using HubSpot CRM and running on AWS with a React frontend has a different buyer profile than one using Salesforce and running on a custom Java stack.
Technology stack data comes from two places:
- Job postings: companies list their tech stack in engineering job descriptions. Scraping a company's open roles and extracting technology mentions gives a reasonably current picture of their stack.
- Website fingerprinting: tools like Wappalyzer, BuiltWith, and Datanyze detect technologies from website headers, script tags, and meta patterns.
import httpx
from bs4 import BeautifulSoup
TECH_PATTERNS = {
"HubSpot CRM": ["hubspot.com/hs-scripts", "hsforms.com"],
"Salesforce": ["salesforce.com", "force.com"],
"Intercom": ["intercom.io", "intercomcdn.com"],
"Stripe": ["js.stripe.com"],
"Segment": ["cdn.segment.com", "segment.io"],
"Google Analytics 4": ["gtag/js", "G-"],
}
def detect_website_tech(domain: str) -> list[str]:
try:
response = httpx.get(f"https://{domain}", timeout=8.0, follow_redirects=True)
html = response.text
detected = []
for tech, patterns in TECH_PATTERNS.items():
if any(p in html for p in patterns):
detected.append(tech)
return detected
except Exception:
return []For ICP scoring, technology stack signals can be decisive: if you're selling a data pipeline product, a company already using Segment + Snowflake is a much warmer prospect than one with no data infrastructure.
ICP scoring
Once enriched, score the contact against your ideal customer profile:
def calculate_icp_score(contact: EnrichedContact) -> tuple[int, str]:
"""Returns (score 0-100, tier A/B/C/D)."""
score = 0
# Company size (example ICP: 50-500 employees)
size = contact.company_size or ""
if any(s in size for s in ["51-200", "201-500", "51-100", "101-250"]):
score += 35
elif any(s in size for s in ["11-50", "501-1000"]):
score += 20
elif any(s in size for s in ["1-10"]):
score += 5
# Industry fit
target_industries = {"technology", "saas", "e-commerce", "financial services", "healthcare"}
if any(ind in (contact.industry or "").lower() for ind in target_industries):
score += 30
# Seniority (prefer decision makers)
if contact.seniority in ("director", "vp", "c_suite", "owner", "founder"):
score += 35
elif contact.seniority in ("manager", "senior"):
score += 20
tier = "A" if score >= 80 else "B" if score >= 55 else "C" if score >= 30 else "D"
return score, tierTier A and B contacts route to direct sales outreach. Tier C goes to a nurture email sequence. Tier D is deprioritized or excluded entirely — it's not worth a rep's time.
Enrichment at scale: batching and caching
For large lists (import of 10,000 contacts, weekly re-enrichment of the entire CRM), enrichment must be batched to control rate limits and costs.
Key operational patterns:
- Cache enrichment results per email for 90 days. Re-enriching the same email weekly is wasteful. Job titles and company sizes don't change that fast.
- Prioritize — enrich new leads immediately, batch-enrich historical records overnight
- Track coverage — log which fields came back null per provider; if a provider's coverage drops below 40%, investigate whether your query patterns have changed
At 10,000 contacts, Clearbit costs roughly $0.01-0.02 per successful enrichment (depending on plan). Apollo is cheaper per call but may require more fallback calls. A waterfall with both providers typically costs $0.015-0.03 per enriched contact.
What to do when enrichment fails
About 20-30% of contacts in a typical B2B list won't enrich successfully — personal email addresses, very small companies, or contacts in regions with low database coverage. For these:
- Flag as "unenriched" in the CRM
- Route to a lower-priority sequence with more generic messaging
- Periodically re-try enrichment (people change jobs, new LinkedIn profiles appear)
- Use domain-level company data (from Hunter or a web scrape) when person-level enrichment fails
One pattern worth implementing: a manual enrichment queue. For Tier A accounts (large deal size, close to ICP but missing key fields), route to a human reviewer to fill gaps manually rather than letting the automation downgrade them. The economics work when a single Tier A deal is worth more than the 15 minutes it takes a sales rep to manually look up the missing fields. For Tier C and D, the automation's best guess is good enough — human review is not worth the time.
Finally, treat enrichment data as decaying. Job titles change. Companies get acquired. A contact enriched 18 months ago may now be at a different company entirely. Re-enriching the active CRM database quarterly — or whenever a contact re-engages — keeps the data fresh and the ICP scoring accurate.
If you're building a lead enrichment pipeline and need it to handle multiple sources, scoring, and CRM sync, tell us about the project.
Related: Automating the Sales Pipeline: Lead Capture, CRM Sync, and Follow-up Sequences | How We Scrape at Scale Without Getting Blocked
Related service
Need large-scale scraping built to run without getting blocked?
Web Scraping & Data Extraction →Related
We publish new posts every few weeks. See more on the insights page.