ToolDox
← Use the tool

Data Quality Framework: Standardize Your Process

Bad data costs UK businesses £3.1 trillion annually (Experian). Most analytics teams clean data ad-hoc, introducing inconsistencies and errors. Learn the framework that professional data teams use to catch problems at the source, automate quality checks, and build a culture of data integrity.

The 6 Dimensions of Data Quality

Data quality isn't binary—it's multidimensional. A dataset can pass 5 dimensions and fail on one, making it worthless.

1. Accuracy: Values Match Reality

Customer age is 250 years old? Invalid. Email contains @? Valid. Revenue is £0? Technically valid, but suspicious. Check: statistical rules (age 18-100), format validation (email regex), reasonableness checks (revenue > 0).

2. Completeness: Required Fields Exist

Missing customer_id in a CRM export = unusable. Missing last_login? Depends on context. Check: row-level null counts, schema requirements by data source, context-aware nullability (nullable in one system, required in another).

3. Consistency: Same Data, Same Meaning

Customer "John" in system A, "john" in system B. Are they the same? Format inconsistency creates analysis errors. Check: standardize case (uppercase/lowercase), match formats across systems (customer_id should always be 8 digits), flag mismatches.

4. Timeliness: Data Is Current

Customer joined 2018, last_login is 2018, account never used since. Is this data still valid? For reporting, yes. For marketing, no (don't target inactive customers). Check: SLA compliance (ingested within 4 hours), stale data detection (last update > 30 days), real-time dashboards vs. daily batches.

5. Validity: Values Conform to Expected Type

Email address is stored as text when it should be email format. Phone number is "123" when it should be 10 digits. Date is "32-13-2025" (impossible date). Check: type validation (string vs. int), format validation (regex for email/phone), range validation (day 1-31, month 1-12).

6. Uniqueness: No Unintended Duplicates

Customer appears twice in export (API retry issue). Transaction appears 3 times (data merge went wrong). Revenue is double-counted. Check: duplicate detection on primary keys, row-level hashes for exact duplicates, fuzzy matching for near-duplicates (same customer, slightly different name).

Source-Specific Cleaning Patterns

Different data sources have different quality problems. Tailor your checks to the source.

CRM Exports (Salesforce, HubSpot)

  • Duplicate contacts (same person created twice)
  • Merged records with inconsistent data
  • Custom fields with nulls (always blank)
  • Email/phone format inconsistencies
  • Deleted records re-imported accidentally

Web Analytics (Google Analytics, Mixpanel)

  • Session IDs not unique (client retry logic)
  • Timestamps out of order (clock skew)
  • Conversion events missing properties
  • User agents that don't parse
  • Bots counted as real users

Survey Data (Typeform, Qualtrics)

  • Responses outside allowed choices
  • Completion rate below threshold (too many drop-offs)
  • Response time too fast (bot/click-through)
  • Duplicate responses from same IP
  • Free-text fields with script injection attempts

Database Exports (SQL Dumps)

  • Foreign key constraint violations (referential integrity)
  • Stale data from production snapshots
  • Encoding issues (UTF-8 vs. Latin-1)
  • Schema changes not documented
  • Transaction log gaps (incomplete exports)

Automated vs. Manual Cleaning: When to Use Each

Automate the rules. Manual review the edge cases.

Automate These

  • Null detection: Flag rows missing required fields (catches 40% of issues, zero false positives)
  • Type validation: Column must be numeric/date/email, not text (simple regex, <1% false positive rate)
  • Format checking: Phone must be 10 digits, ZIP code 5 digits (deterministic rules)
  • Duplicate detection: Exact match on primary key columns (no guessing required)
  • Range validation: Age must be 0-150, revenue must be > 0 (mathematical rules)

Manual Review These

  • Business logic: "Customer name is 'aaa'"—technically valid text, obviously wrong in context
  • Fuzzy duplicates: "John Smith" vs "Jon Smith"—are they the same person? (requires judgment)
  • Outliers: Customer spent £100,000 in one transaction—fraud or VIP? (investigation needed)
  • Change detection: Customer's industry changed from "Tech" to "Healthcare"—update or error? (context matters)

Best Practice: Hybrid Approach

(1) Run automated checks on every data load (catches 95% of issues in seconds). (2) Alert data owners to exceptions. (3) Route exceptions to manual review queue. (4) Analyst spends 5 minutes on 50 rows, not 2 hours on 5,000 rows. This hybrid approach scales: 10 automated checks → 10,000 rows/second, then humans review edge cases.

Building a Data Quality Culture

Data quality is a team problem. Solve it together.

Make quality visible: Publish monthly metrics on a shared dashboard. Show % completeness, null counts, duplicate rates, SLA compliance. When metrics improve, celebrate publicly. This creates accountability and momentum.

Define clear ownership: Data engineering owns pipeline quality (automated checks, SLA compliance). Product owns source quality (correct form fields, no spam submissions). Analytics owns transformation quality (correct joins, aggregate formulas). When something breaks, everyone knows who to contact.

Set SLAs (Service Level Agreements): Data must be 95% complete. Ingestion latency < 4 hours. Incident response time < 30 minutes. When data quality drops below SLA, pause release and alert stakeholders. This prevents bad data from reaching reports.

Automate enforcement: If data quality check fails, the pipeline stops with a clear error message. No manual intervention, no workarounds—this quarantines bad data and forces fixes at the source.

Train continuously: New hires need to understand data quality standards. Quarterly workshops on common issues. Internal docs on "how we clean CRM data" and "what to do when quality alerts fire." Culture sticks when it's documented and enforced.

ETL Pipeline Validation Strategy

Validate at each stage. Problems caught early are cheap. Problems caught in production reports are expensive.

Extract Stage

Verify row counts match source, schema changes are detected, no unexpected nulls. If extract fails, hold and alert before transform/load begins.

Transform Stage

Check calculations (revenue = price × quantity), aggregations match source totals (CTEs must reconcile), joins don't create unexpected nulls. Test with realistic data volumes—small volumes hide problems.

Load Stage

Confirm data reaches destination, indexes are fresh, downstream queries work. Run sample queries against final table. If anything fails, quarantine load and alert immediately.

GDPR Compliance in Data Cleaning

Non-compliance costs €20 million or 4% of revenue, whichever is higher. Build compliance into your data quality process.

Data access logs: Log who accessed personal data, when, and why. Audit trail must be immutable. This proves compliance if regulators ask.

Right to be forgotten: When a customer asks for deletion, purge their PII within 30 days. Anonymize their historical records (keep aggregate stats, remove identifying info). Automate enforcement: if deletion request age > 30 days, alert data team.

Retention policies: Don't keep personal data longer than needed. Data older than 2 years? Archive or delete. Implement retention windows in your ETL: tag old records, purge on schedule.

Consent tracking: Don't analyze opted-out users. Tag records with consent status. In your cleaning pipeline, filter out non-consented users before analysis. This is automatic, not manual.

Ready to standardize your data quality?

Use our Data Quality Framework generator to create standardized checklists for your data sources. Reduce cleaning time from 80% to 20% of analytics time.

Use the Data Quality Framework →