Question 1

What are the 6 dimensions of data quality?

Accepted Answer

Accuracy: values match the real world (e.g., customer age is reasonable). Completeness: required fields are populated (no nulls where they shouldn't be). Consistency: same data means same thing across systems (customer_id formats match). Timeliness: data is current enough for use (last updated within SLA). Validity: values conform to expected format/type (email looks like email). Uniqueness: no duplicate records when uniqueness required. A dataset can pass 5 dimensions and fail on one, making it worthless.

Question 2

How much of analytics time should be spent on data cleaning?

Accepted Answer

Industry standard: 50-80% of analytics time goes to data preparation. Bad data wastes time in three ways: (1) cleaning errors flow downstream, (2) analysts discover issues mid-analysis and restart, (3) stakeholders lose trust in reporting. With a data quality framework, you reduce this to 20-30%. The time saved accelerates insights from weeks to days.

Question 3

Should I automate data cleaning or do it manually?

Accepted Answer

Automate the recurring, rule-based checks: null validation, type validation, format checking, duplicate detection. Manual cleaning catches edge cases: the customer name that's technically valid but obviously wrong (e.g., "aaa"), context that needs business judgment. Use automated rules first (catches 95% of issues), then manual review on exceptions. This hybrid approach scales.

Question 4

How do I handle GDPR compliance in data cleaning?

Accepted Answer

GDPR requires: (1) data access logs (who accessed what, when), (2) deletion on request (purge PII in 30 days), (3) retention policies (don't keep personal data longer than needed), (4) consent tracking (no analysis on opted-out users). In your cleaning pipeline: flag PII columns, log access, enforce retention windows, implement audit trails. Non-compliance costs €20m or 4% of revenue, whichever is higher.

Question 5

What data quality issues are most common?

Accepted Answer

Top issues in order: (1) Nulls in unexpected places (50% of records missing revenue). (2) Duplicates from merge operations or API retries. (3) Type mismatches (dates stored as text, amounts as strings). (4) Inconsistency (customer name is "John" in system A, "john" in system B). (5) Stale values (last_login is 2018). (6) Invalid formats (email without @, phone with 6 digits). Most are preventable with automated checks at ingestion.

Question 6

How do I build a data quality culture in my organization?

Accepted Answer

Culture starts with visibility: publish monthly data quality metrics (% complete, null counts, duplicate rates). Hold teams accountable: data engineering owns pipeline quality (automated checks), product owns source system quality (correct form fields), analytics owns downstream quality (correct transformations). Define clear SLAs: reports must be 95% complete, ingest within 4 hours, incident response in 30 min. Automate enforcement: if data quality drops below threshold, pause data release with clear error messages.

Question 7

What does a data quality checklist actually check?

Accepted Answer

Depends on data source. CRM exports: check for duplicate contacts, null phone numbers, invalid email formats. Web analytics: verify session IDs are unique, timestamps are chronological, conversion events match expected format. Survey data: validate response choices match survey definition, completion rate is above threshold, timestamps aren't in the future. The best checklist is automated: run on every data load, alert on failures, quarantine bad data before it reaches reporting.

Question 8

How do I validate an ETL pipeline?

Accepted Answer

ETL validation in three stages: (1) Extract: verify row counts match source, schema changes detected, no unexpected nulls. (2) Transform: check calculations (revenue = price × quantity), aggregations match source totals (sum of transactions = total). (3) Load: confirm data reaches destination, indexes are fresh, downstream queries work. If any stage fails, alert data owners immediately and hold release. Most data warehouse issues are caught too late&mdash;validate at each stage.

Question 9

What data quality tool should I use?

Accepted Answer

No-code rules tools (Great Expectations, Soda Cloud): define quality rules without coding, good for data engineers. SQL-based validation: write queries to detect issues (best if you already have SQL pipelines). Python libraries (Pandas, DQ package): programmatic for custom validation logic. Cloud-native (BigQuery data validation, Snowflake Streamlit apps): tight integration with your warehouse. Best approach: start with Great Expectations (free, integrates everywhere), upgrade to paid tools only if team is large (20+ people) or compliance-heavy.

Question 10

How do I measure data quality improvement over time?

Accepted Answer

Track four metrics: (1) Defect rate: % of records failing any quality rule (target: <1%). (2) Time to resolution: how long from detection to fix (target: <4 hours). (3) Analyst time saved: on data cleaning before vs. after (target: 50% reduction). (4) Trust score: % of stakeholders who trust reports (survey quarterly). Plot these on a dashboard. Share monthly: when metrics improve, celebrate wins with the team. This creates momentum and justifies investment in data quality infrastructure.

Data Quality Framework: Standardize Your Process

The 6 Dimensions of Data Quality

1. Accuracy: Values Match Reality

2. Completeness: Required Fields Exist

3. Consistency: Same Data, Same Meaning

4. Timeliness: Data Is Current

5. Validity: Values Conform to Expected Type

6. Uniqueness: No Unintended Duplicates

Source-Specific Cleaning Patterns

CRM Exports (Salesforce, HubSpot)

Web Analytics (Google Analytics, Mixpanel)

Survey Data (Typeform, Qualtrics)

Database Exports (SQL Dumps)

Automated vs. Manual Cleaning: When to Use Each

Automate These

Manual Review These

Best Practice: Hybrid Approach

Building a Data Quality Culture

ETL Pipeline Validation Strategy

Extract Stage

Transform Stage

Load Stage

GDPR Compliance in Data Cleaning

Ready to standardize your data quality?