A sales team came to us with a problem they had completely misdiagnosed. They thought they had a volume problem. "We need more leads," they said. "Our reps are running dry." So they'd been buying lists. Exporting tens of thousands of contacts out of every tool they could expense. Dumping them into the CRM. And watching their reps drown.
What they actually had was a data quality problem wearing a volume costume.
Their reps weren't running dry. They were running blind. Half the records had no verified email. A quarter were duplicates the system happily counted twice. Job titles were three years stale. The "decision-makers" had left the company. And every bad record didn't just sit there harmlessly — it actively poisoned the well, torching domain reputation and burning rep hours on people who were never going to answer.
This is the write-up of the system we built to fix it. The client is anonymized — call them a mid-market B2B platform running an outbound-heavy sales motion across multiple personas and geographies. But the architecture is real, the code is real, and the numbers at the bottom are pulled straight from the engine after 90 days of operation.
More leads was never the answer. A clean, enriched, deduplicated, source-tracked lead — one a rep can act on in ten seconds without second-guessing it — is worth more than fifty contacts scraped off a list nobody trusts. — The thesis behind the entire build
Here's the trap nobody puts in the pitch deck. Every record you push into outreach has a hidden tax. A bad email bounces and drags your sender reputation down. A duplicate gets contacted twice and makes you look like a spammer. A stale title means your "personalized" opener is wrong on arrival. A missing phone number means the SDR skips the record entirely — so you paid to acquire a lead that gets zero touches.
The team had 40,000 contacts in their CRM. They felt rich. In reality, fewer than 12,000 were actually usable. The other 28,000 weren't neutral — they were a liability, silently degrading deliverability and eating rep time on a treadmill that went nowhere.
Before we wrote a single line of code, we built them a calculator to show exactly what their "big database" was really worth. Move the sliders. Watch the number that matters collapse.
We didn't sell them a list. We built them a lead operations engine — a repeatable pipeline that ingests raw targets from any source, enriches them through a waterfall, scrubs out the junk, scores what's left, and hands sales a CRM-ready dataset they can act on without a second of cleanup.
Three layers, running on a tight loop:
ICP-driven account and contact pulls from ZoomInfo, LinkedIn Sales Navigator, and Apollo — filtered by title, headcount, revenue band, geo, and tech stack. No broad scrapes. Every record enters with a reason.
A multi-provider enrichment waterfall in Clay fills firmographics, technographics, and verified contact data — falling through providers until a confident match lands, then validating every email before it's allowed downstream.
Deterministic + fuzzy deduplication, field normalization, a fit/confidence score, then a clean push to the CRM with source tracking baked into every record. Reps get priority, not chaos.
The whole thing is described below — diagram, then the actual code that powers each layer.
Once the data was clean, the next leak was time. A verified, high-fit lead is a perishable asset. The half-life of buyer intent is measured in minutes, not days — and this team was routing fresh inbound and high-intent records to reps on a four-hour lag because everything moved through a manual export.
We wired the clean output straight into a speed-to-lead router. Here's the decay curve the team had been ignoring. Drag the response time and watch what happens to the odds.
This is the part agencies never show you, because most of them don't have it. The engine isn't a person clicking export buttons. It's code. Here are three of the load-bearing pieces — lightly genericized, but structurally exactly what's running.
// Clay "HTTP API" enrichment column — waterfall fallback logic. // Runs per-row: try providers in order, stop at first confident hit, // then hard-verify the email before the row is allowed downstream. const providers = [ { name: "zoominfo", minConfidence: 0.85 }, { name: "apollo", minConfidence: 0.80 }, { name: "clearbit", minConfidence: 0.75 } ]; async function enrichRow(row) { let match = null; for (const p of providers) { const res = await callProvider(p.name, { domain: row.company_domain, fullName: row.full_name, title: row.title }); if (res && res.confidence >= p.minConfidence) { match = { ...res, sourced_from: p.name }; break; // stop the waterfall — we have a confident hit } } if (!match) return { status: "no_match", route: "discard" }; // Gate: never let an unverified email leave this column const verify = await verifyEmail(match.email); match.email_status = verify.result; // valid | risky | invalid match.route = verify.result === "valid" ? "sync" : "hold"; return match; }
# Deterministic + fuzzy dedup, then normalize fields for clean CRM import. import re from rapidfuzz import fuzz def norm_email(e): return (e or "").strip().lower() def norm_domain(url): d = re.sub(r"^https?://(www\.)?", "", (url or "").lower()) return d.split("/")[0].strip() def dedupe(records): seen, clean = {}, [] for r in records: key = norm_email(r.get("email")) # 1. exact match on email = hard duplicate if key and key in seen: seen[key]["sources"].append(r.get("source")) continue # 2. fuzzy match: same domain + ~same name = likely dup dup = False for c in clean: if norm_domain(c["company_domain"]) == norm_domain(r.get("company_domain")): if fuzz.ratio(c["full_name"], r.get("full_name","")) > 90: dup = True; break if dup: continue r["email"] = key r["company_domain"] = norm_domain(r.get("company_domain")) r["sources"] = [r.get("source")] if key: seen[key] = r clean.append(r) return clean
// Serverless webhook: receives a verified, deduped record and // upserts it into the CRM with a fit score + priority tier. export default async function handler(req, res) { const lead = req.body; // Simple, transparent fit score — title + size + intent signals. let score = 0; if (/(vp|head|director|chief|founder)/i.test(lead.title)) score += 35; if (lead.employee_count >= 50) score += 25; if (lead.tech_stack?.includes("target_tool")) score += 25; if (lead.email_status === "valid") score += 15; const tier = score >= 70 ? "A_hot" : score >= 45 ? "B_warm" : "C_nurture"; await upsertContact({ email: lead.email, properties: { fit_score: score, priority_tier: tier, lead_source: lead.sourced_from, sourced_date: new Date().toISOString() } }); // Hot leads jump the queue → routed for <2 min first touch. if (tier === "A_hot") await notifyRepInstant(lead.email); return res.status(200).json({ ok: true, score, tier }); }
Three points worth calling out. The waterfall stops at the first confident match — you don't pay every provider for every row, which is how enrichment budgets quietly explode. Nothing reaches the CRM without a verified email gate. And the fit score is deliberately simple and readable — a sales leader can look at it and understand exactly why a lead is tier A. No black box.
The instinct is that "enrich everything through every provider" gets you the most data. It also gets you the biggest bill — most of it spent re-buying data you already had. The waterfall pattern only pays for the next provider when the previous one misses. Here's the difference, in dollars.
We don't report on impressions. We report on what the data did once it hit the sales floor. After one quarter of the engine running on a continuous loop:
The headline shift wasn't a number on a dashboard. It was that reps stopped opening records and sighing. Every contact that hit their queue was verified, enriched, scored, and explained. The "huge database" got smaller on paper — and the pipeline got bigger in reality.
| Dimension | The list-buying habit | The lead-ops engine |
|---|---|---|
| Success metric | Rows added to CRM | Usable, action-ready records |
| Email handling | Blast and hope | Verified before sync — gate enforced |
| Duplicates | Counted as "more leads" | Caught on entry, deterministic + fuzzy |
| Enrichment | One provider or none | Multi-provider waterfall, first-hit stop |
| Routing | Manual export, 4-hr lag | Scored + tiered, hot leads <2 min |
| Source tracking | Unknown / lost | Stamped on every record at ingest |
| Rep cleanup | Hours per list | Zero — it arrives clean |
We deleted more than half their database and their pipeline went up. That's the whole lesson. A lead operations system isn't about how much data you can pour in. It's about how little garbage you let out. — Arsalan Faysal, Revenue Systems Architect
You don't need a full audit to smell the problem. Run your own export and check these. If two or more are true, you're paying the dead-weight tax right now:
If that list stung, good. That's the same gut-check that started this build. The fix isn't more leads. It's an engine that makes every lead worth a rep's attention — and runs on its own once it's deployed.
I don't sell hours. I don't sell lists. I build the infrastructure that turns raw, messy targets into a clean, scored, sales-ready pipeline — and keeps it that way without you babysitting it. If your CRM is a landfill and your reps are doing data entry instead of selling, that's a system problem. And system problems have system fixes.