Fixing Data Quality: How We Transformed a Sales Team's CRM Efficiency

Written by Arsalan Faysal | Jun 1, 2026 4:31:13 AM

A sales team came to us with a problem they had completely misdiagnosed. They thought they had a volume problem. "We need more leads," they said. "Our reps are running dry." So they'd been buying lists. Exporting tens of thousands of contacts out of every tool they could expense. Dumping them into the CRM. And watching their reps drown.

What they actually had was a data quality problem wearing a volume costume.

Their reps weren't running dry. They were running blind. Half the records had no verified email. A quarter were duplicates the system happily counted twice. Job titles were three years stale. The "decision-makers" had left the company. And every bad record didn't just sit there harmlessly — it actively poisoned the well, torching domain reputation and burning rep hours on people who were never going to answer.

This is the write-up of the system we built to fix it. The client is anonymized — call them a mid-market B2B platform running an outbound-heavy sales motion across multiple personas and geographies. But the architecture is real, the code is real, and the numbers at the bottom are pulled straight from the engine after 90 days of operation.

More leads was never the answer. A clean, enriched, deduplicated, source-tracked lead — one a rep can act on in ten seconds without second-guessing it — is worth more than fifty contacts scraped off a list nobody trusts. — The thesis behind the entire build

◆ ◆ ◆

What "more leads" actually costs you

Here's the trap nobody puts in the pitch deck. Every record you push into outreach has a hidden tax. A bad email bounces and drags your sender reputation down. A duplicate gets contacted twice and makes you look like a spammer. A stale title means your "personalized" opener is wrong on arrival. A missing phone number means the SDR skips the record entirely — so you paid to acquire a lead that gets zero touches.

The team had 40,000 contacts in their CRM. They felt rich. In reality, fewer than 12,000 were actually usable. The other 28,000 weren't neutral — they were a liability, silently degrading deliverability and eating rep time on a treadmill that went nowhere.

Before we wrote a single line of code, we built them a calculator to show exactly what their "big database" was really worth. Move the sliders. Watch the number that matters collapse.

Interactive Diagnostic // 01

The Real-Database Calculator

A list is only as big as the part of it you can actually use. Set your numbers and see how fast a "huge database" shrinks to its usable core — and what the dead weight is costing you.

Total contacts in CRM 40,000

Duplicate rate 22%

Invalid / unverified email rate 30%

Stale / wrong-role rate 18%

Avg. cost per acquired record $1.10

Sender reputation risk Elevated

17,909Genuinely usable

22,091Dead weight

$24,300Spend on dead weight

You think you have a 40,000-contact database. You have a 17,909-contact database wearing a 40,000-contact costume — and you paid $24,300 to acquire records no rep should ever touch.

Illustrative model. Real audits use measured bounce, dedup, and verification rates from your actual export.

⚡ The reframe that started the build

Before

"We need 10,000 more leads this quarter."A volume target. Measured by rows added. Guaranteed to make the deliverability problem worse.

After

"We need every record a rep opens to be verified, enriched, deduped, and routed by priority."A quality target. Measured by usable, action-ready records. The volume took care of itself.

The architecture we deployed

We didn't sell them a list. We built them a lead operations engine — a repeatable pipeline that ingests raw targets from any source, enriches them through a waterfall, scrubs out the junk, scores what's left, and hands sales a CRM-ready dataset they can act on without a second of cleanup.

Three layers, running on a tight loop:

01Source & Define

Targeted Sourcing

ICP-driven account and contact pulls from ZoomInfo, LinkedIn Sales Navigator, and Apollo — filtered by title, headcount, revenue band, geo, and tech stack. No broad scrapes. Every record enters with a reason.

02Enrich & Verify

The Clay Waterfall

A multi-provider enrichment waterfall in Clay fills firmographics, technographics, and verified contact data — falling through providers until a confident match lands, then validating every email before it's allowed downstream.

03Scrub & Route

Dedup, Score, Sync

Deterministic + fuzzy deduplication, field normalization, a fit/confidence score, then a clean push to the CRM with source tracking baked into every record. Reps get priority, not chaos.

The whole thing is described below — diagram, then the actual code that powers each layer.

RAW TARGETS ENRICHMENT CORE SALES-READY OUTPUT ┌──────────────┐ ┌───────────────────┐ ┌────────────────────┐ │ ZoomInfo │ │ CLAY WATERFALL │ │ CRM (HubSpot) │ │ Sales Nav │──ICP──────▶│ ├─ Provider A │ │ ├─ Verified email │ │ Apollo │ filter │ ├─ Provider B ▼ │──score───▶│ ├─ Fit score 0-100│ │ CSV uploads │ │ └─ Provider C │ │ ├─ Source + date │ └──────────────┘ │ email verify │ │ └─ Priority tier │ │ └─────────┬─────────┘ └─────────┬──────────┘ │ │ │ ▼ ▼ ▼ dedupe on entry confidence threshold speed-to-lead routing (no junk gets in) (no guess gets out) (hot leads first, <2 min)

Why routing speed mattered more than list size

Once the data was clean, the next leak was time. A verified, high-fit lead is a perishable asset. The half-life of buyer intent is measured in minutes, not days — and this team was routing fresh inbound and high-intent records to reps on a four-hour lag because everything moved through a manual export.

We wired the clean output straight into a speed-to-lead router. Here's the decay curve the team had been ignoring. Drag the response time and watch what happens to the odds.

Interactive Diagnostic // 02

The Lead-Decay Simulator

Contact a lead in the first 5 minutes and your odds of a real conversation are dramatically higher than at the 1-hour mark. Set your response time and monthly lead flow to see the qualified conversations you're leaving on the table.

Avg. time to first contact 4 hrs

High-intent leads / month 600

Avg. deal value $8,000

Close rate on connected leads 12%

Relative odds of qualifying the lead32%

408Conversations lost / mo

$392KPipeline forfeited / mo

$361KRecoverable at <5 min

At a 4-hour response time you're converting interest at roughly 32% of its peak. Cut first-contact to under 5 minutes and you reclaim most of that $392K/mo without sourcing a single new lead.

Decay modeled on widely-cited speed-to-lead research. Directional, not a guarantee — your curve is measured during the audit.

The code that runs the engine

This is the part agencies never show you, because most of them don't have it. The engine isn't a person clicking export buttons. It's code. Here are three of the load-bearing pieces — lightly genericized, but structurally exactly what's running.

Clay → enrichment waterfall (HTTP API column)

// Clay "HTTP API" enrichment column — waterfall fallback logic.
// Runs per-row: try providers in order, stop at first confident hit,
// then hard-verify the email before the row is allowed downstream.

const providers = [
  { name: "zoominfo",  minConfidence: 0.85 },
  { name: "apollo",    minConfidence: 0.80 },
  { name: "clearbit",  minConfidence: 0.75 }
];

async function enrichRow(row) {
  let match = null;

  for (const p of providers) {
    const res = await callProvider(p.name, {
      domain: row.company_domain,
      fullName: row.full_name,
      title: row.title
    });
    if (res && res.confidence >= p.minConfidence) {
      match = { ...res, sourced_from: p.name };
      break; // stop the waterfall — we have a confident hit
    }
  }

  if (!match) return { status: "no_match", route: "discard" };

  // Gate: never let an unverified email leave this column
  const verify = await verifyEmail(match.email);
  match.email_status = verify.result;      // valid | risky | invalid
  match.route = verify.result === "valid" ? "sync" : "hold";

  return match;
}

Deduplication + field normalization

# Deterministic + fuzzy dedup, then normalize fields for clean CRM import.
import re
from rapidfuzz import fuzz

def norm_email(e):
    return (e or "").strip().lower()

def norm_domain(url):
    d = re.sub(r"^https?://(www\.)?", "", (url or "").lower())
    return d.split("/")[0].strip()

def dedupe(records):
    seen, clean = {}, []
    for r in records:
        key = norm_email(r.get("email"))
        # 1. exact match on email = hard duplicate
        if key and key in seen:
            seen[key]["sources"].append(r.get("source"))
            continue
        # 2. fuzzy match: same domain + ~same name = likely dup
        dup = False
        for c in clean:
            if norm_domain(c["company_domain"]) == norm_domain(r.get("company_domain")):
                if fuzz.ratio(c["full_name"], r.get("full_name","")) > 90:
                    dup = True; break
        if dup: continue

        r["email"] = key
        r["company_domain"] = norm_domain(r.get("company_domain"))
        r["sources"] = [r.get("source")]
        if key: seen[key] = r
        clean.append(r)
    return clean

Clean record → CRM sync + priority routing

// Serverless webhook: receives a verified, deduped record and
// upserts it into the CRM with a fit score + priority tier.

export default async function handler(req, res) {
  const lead = req.body;

  // Simple, transparent fit score — title + size + intent signals.
  let score = 0;
  if (/(vp|head|director|chief|founder)/i.test(lead.title)) score += 35;
  if (lead.employee_count >= 50) score += 25;
  if (lead.tech_stack?.includes("target_tool")) score += 25;
  if (lead.email_status === "valid") score += 15;

  const tier = score >= 70 ? "A_hot"
             : score >= 45 ? "B_warm"
             : "C_nurture";

  await upsertContact({
    email: lead.email,
    properties: {
      fit_score: score,
      priority_tier: tier,
      lead_source: lead.sourced_from,
      sourced_date: new Date().toISOString()
    }
  });

  // Hot leads jump the queue → routed for <2 min first touch.
  if (tier === "A_hot") await notifyRepInstant(lead.email);

  return res.status(200).json({ ok: true, score, tier });
}

Three points worth calling out. The waterfall stops at the first confident match — you don't pay every provider for every row, which is how enrichment budgets quietly explode. Nothing reaches the CRM without a verified email gate. And the fit score is deliberately simple and readable — a sales leader can look at it and understand exactly why a lead is tier A. No black box.

Does the waterfall actually save money?

The instinct is that "enrich everything through every provider" gets you the most data. It also gets you the biggest bill — most of it spent re-buying data you already had. The waterfall pattern only pays for the next provider when the previous one misses. Here's the difference, in dollars.

Interactive Diagnostic // 03

Waterfall vs. Brute-Force Enrichment

"Hit every provider on every row" feels thorough. It's mostly waste. Compare the naive approach against a waterfall that stops at the first confident match.

Rows to enrich / month 10,000

Cost per provider credit $0.08

Providers in stack 3

1st-provider hit rate 65%

$2,400Brute force / mo

$1,178Waterfall / mo

Monthly saving with waterfall51%

Same data coverage, 51% lower cost — roughly $14,664/yr back in the budget — just from not re-buying records you already matched on provider one.

Model assumes residual misses cascade evenly to remaining providers. Real savings depend on provider overlap.

What 90 days of the engine produced

We don't report on impressions. We report on what the data did once it hit the sales floor. After one quarter of the engine running on a continuous loop:

73%Usable record rate (from ~30%)

96%Email deliverability achieved

0Manual cleanup hours for reps

2.4×Reply rate on outbound

The headline shift wasn't a number on a dashboard. It was that reps stopped opening records and sighing. Every contact that hit their queue was verified, enriched, scored, and explained. The "huge database" got smaller on paper — and the pipeline got bigger in reality.

Dimension	The list-buying habit	The lead-ops engine
Success metric	Rows added to CRM	Usable, action-ready records
Email handling	Blast and hope	Verified before sync — gate enforced
Duplicates	Counted as "more leads"	Caught on entry, deterministic + fuzzy
Enrichment	One provider or none	Multi-provider waterfall, first-hit stop
Routing	Manual export, 4-hr lag	Scored + tiered, hot leads <2 min
Source tracking	Unknown / lost	Stamped on every record at ingest
Rep cleanup	Hours per list	Zero — it arrives clean

We deleted more than half their database and their pipeline went up. That's the whole lesson. A lead operations system isn't about how much data you can pour in. It's about how little garbage you let out. — Arsalan Faysal, Revenue Systems Architect

How to know if your lead data is quietly broken

You don't need a full audit to smell the problem. Run your own export and check these. If two or more are true, you're paying the dead-weight tax right now:

⚠ The 60-second self-diagnostic

Check 01

Your bounce rate is above 3%.You're syncing unverified emails. Every bounce is a deposit into the spam-folder bank.

Check 02

Reps "clean up" lists before working them.If a human has to fix the data before using it, your system isn't finished — it's offloading the work to your most expensive people.

Check 03

You can't say where a given contact came from.No source tracking means you can't kill bad sources or double down on good ones. You're optimizing blind.

Check 04

First contact takes longer than 15 minutes.The decay simulator above already showed you what that costs. It's not a routing inconvenience — it's forfeited pipeline.

If that list stung, good. That's the same gut-check that started this build. The fix isn't more leads. It's an engine that makes every lead worth a rep's attention — and runs on its own once it's deployed.

I don't sell hours. I don't sell lists. I build the infrastructure that turns raw, messy targets into a clean, scored, sales-ready pipeline — and keeps it that way without you babysitting it. If your CRM is a landfill and your reps are doing data entry instead of selling, that's a system problem. And system problems have system fixes.

View full post