Best Practices for Achieving Enterprise Data Quality in the 2020s

Data underpins digital transformation. Yet with data volumes exploding, varieties proliferating and velocity ratcheting up, getting a handle on trustworthy information remains an uphill climb. Costs related to poor data quality continue rising with Gartner estimating organizations lose over $15 million annually because of flawed data.

But help has arrived! Data management and analytics teams now have access to versatile tools that automate the hitherto manual drudgery of cleansing, validating and enriching data. The market has seen furious innovation recently. Established megavendors have continued bolstering already robust product suites while cloud-native disruptors leverage machine learning to reimagine data quality challenges plaguing business and technical users for decades.

This guide will arm you with wisdom from real-world guidance distilled from customer journeys spanning early experimentation to proven large scale production deployment of data quality platforms. Let‘s get started!

Why Data Quality Solutions are Fundamental

Getting accurate information on anything from customer addresses to financial transactions has never been easy. But present times pose exacerbated strains:

  • Exploding data volume, velocity and complexity with Big Data makes reliability a huge pain point. Social feeds, mobile apps, IoT sensors – new data types bring new chaos!
  • Trust and regulatory spotlight has never been higher with standards like GDPR. Manual methods cannot safeguard quality.
  • Maturing analytics, AI/ML and cloud journeys struggle with bad data. Your BI insights and ML predictions are only as good as the raw input!

Research by IDC shows that on average, professionals waste over 45 days annually in non-productive time spent fixing bad data acquired from outside sources. And the problem pervades across business functions:

  • Sales/Marketing teams struggle with obsolete, incomplete customer contact data leading to ignored cross-sell opportunities.
  • Finance contends with improper regulatory classifications, risk analysis assumptions undermined by false linkages between entities.
  • Supply chain and inventory management depend on harmonized, accurate supplier and logistics data. Outdated product catalog information directly causes lost revenue.

Forrester finds that 37% of businesses experience negative ROI from Big Data and analytics efforts directly attributable to data quality issues. But help has arrived! Advancements in data quality tools over the last decade make it possible to finally put the problem to rest.

Core Capabilities

Data quality solutions support systematically measuring, monitoring and improving the health of organizational information. Here are key capabilities:

Data Profiling – Scanning a wide spectrum of sources like databases, files, enterprise apps to catalog metadata. Building visibility into data integrity issues.

  • Volumetric analysis – Cardinality of fields, total record counts
  • Validity analysis – Constraints, relational integrity
  • Pattern analysis – Distinct values, data types and formats
  • Statistical analysis – Completeness, uniqueness and more

Validation and Standardization – Applying rules and machine learning models to filter lapses in accuracy, completeness and consistency. Enforcing uniformity for downstream consumption.

  • Value validation – Format, reasonableness
  • Referential integrity checks – Foreign keys, mandatory attributes
  • Merge/collapse standardization – Vendor name variants
  • Machine learning recommendations – Clustering duplicates

Enrichment and Augmentation – Merging external data like demographic info to add value and context. Appending related data to complete existing records.

Deduplication and Record Matching – Identifying multiple records representing the same entities across systems via algorithms. Consolidating information accurately on customers, products and more.

  • Deterministic vs Probabilistic – Exact value vs approximate matching
  • Configurable similarity metrics – Jaro-Winkler, Levenshtein
  • Machine learning classification – Supervised models predicting matches

Monitoring and Alerting – Tracking data quality KPIs like accuracy rate over time. Threshold-based notifications on metric changes.

Workflow – Designing repeatable QA routines with scheduling, messaging, historical visibility.

Why Data Quality is Non-Negotiable for the 2020s

Sluggish data maturity directly drags down analytics ROI and causes competitive disadvantage. The costs cascade across the enterprise:

  • Customer analytics, sales predictions struggle with 20% bad data
  • Operational inefficiency with duplicate items, supplier records
  • Regulatory compliance risk with bad customer classification
  • Unreliable business insights, strategy decisions

But organizations leveraging modern data quality enjoy advantages:

  • Trusted data asset lineage provides analysis transparency
  • Consistent customer views enable superior engagement
  • Decisions carry analytical weight and objectivity
  • Automated issue notification prevents problems

With data serving as the catalyst for digital transformation and touching all parts of business, maintaining quality is no longer optional.

Specialized Data Quality Needs

While the capabilities discussed form the common foundation, additional niche functionality needs arise for certain emerging workloads:

Hybrid and Multicloud Architectures – Today‘s data landscape involves a heterogeneous mix of environments like AWS S3, Azure SQL, Databricks and Kafka streams. Data quality tools must handle diverse data infrastructure while also offering cloud delivery models themselves.

IoT, Geospatial and Timeseries Data – Telematics, connected vehicles and equipment pose exploded data complexity alongside location and temporal elements requiring specialized treatment.

Sensitive Data Protection – As regulation mounts, expect more purpose-built privacy preservation, tokenization and masking capabilities when processing fields like healthcare records.

Non-Structured Data Quality – Applies not just to structured data like transaction systems but growing unstructured sources – documents, email, chats and multimedia. Requires extended techniques like image tag analysis.

Master Data Management (MDM) – Consolidating data on core business entities like customer, product, asset centrally. Blending data quality and MDM delivers superior ROI.

Evaluating Solutions

The market has exploded with choices. I analyzed the top vendors across areas like:

Functionality Depth – Data discovery, validation, standardization, matching, enrichment etc

Ease of Use – Interface intuitiveness, complexity of building workflows/rules

Platform Support – Cloud, on-prem, hybrid data infrastructure fit

Scalability – Handling diverse data volumes across structured, unstructured

Innovation Pace – Emerging capabilities via internal R&D, acquisitions

Ecosystem Integration – How seamlessly they partner with adjacent DBs, analytics and other data infrastructure

Cost – Licensing model (subscription vs perpetual license)

Here are leading options suitable for most needs:

Solution Strengths Limitations Licensing
Talend Data Quality Pioneer with strong DNA across profiling, cleansing, matching and monitoring. Trusted by leading brands. Steep learning curve, long time to value realization Subscription starts $1500/month
Informatica Mature offering excelling in collaborative data stewardship for large regulated deployments Siloed cloud data support, needs tight integration with broader stack Starts $100K based on use case complexity
Ataccama ONE Innovative automated self-learning approach requiring minimal user oversight Broad capabilities still maturing across all areas Starts at $75K for perpetual license
Trifacta Democratizes data quality for business users with visually interactive data profiling Mostly focused on data preparation needs $4K per month base subscription
IBM Infosphere Information Server Strong information governance features tuned for enterprise security, scale and compliance Functionality gaps remaining in cloud support, collaboration Custom enterprise pricing, $100K+
Melissa Data Quality Suite Unparalleled global address verification capability Primarily address-focused, on-prem deployment model Starts at $10K based on use case size
Tamr Modern machine learning-based continuous data curation capability Still new, long-term enterprise viability and support unknown $100K+ per year subscription

I also explored emerging solutions like Data.World, Census, Okera and SODA which show promising innovation leverage Big Data tech while making fundamental advances in areas like data cataloging, blockchain infrastructure and distributed query based data quality.

Here are key highlights from my evaluation of leaders in the category:

Informatica Enterprise Data Quality

  • Background: 30+ years focused on data integration, quality and governance. 8,000+ customers. Acquired data quality pure-play Siperian in 2009.
  • Key Differentiators: Hybrid cloud iPaaS foundation handling structured, unstructured data. Collaborative workflows between IT, business/data stewards. Reference data for address verification worldwide.
  • Recent Innovation: Infused CLAIRE metadata AI engine for smart recommendations improving data consistency, quality over time. Industry-specific packaged solutions.
  • Analyst POV: Top ranking in Gartner, Forrester Wave for ability to execute and completeness of vision.
  • Sample Customers: American Honda Motors, Cardinal Health, Coca-Cola European Partners

"We expanded revenue through better data-driven customer engagement while lowering compliance risk" – Sr. Enterprise Architect, Top 5 US Bank

Talend Data Quality

  • Background: Founded in 2005, went public in 2016. 2,000 customers including leaders like AstraZeneca, Orange and Allianz.
  • Key Differentiators: Unified web interface delivering self-service workflow design capability for IT and Business Users. Integrated data health dashboard. Broad built-in data functions library.
  • Recent Innovation: Trust Score delivering single quantifiable metric on overall data quality. Stitching customer data from Snowflake, Databricks into workflows.
  • Analyst POV: Fastest growing data quality solution according to MarketsAndMarkets. Among leaders in latest Forrester Wave.
  • Sample Customers: Daimler, Ulta Beauty, Flipkart

"Talend dealt with our fragmented customer data quality while meeting strict compliance needs" – Lead Enterprise Architect, Top 10 Pharma Company

Ataccama ONE

  • Background: Founded 2013 from merger of data quality pure-plays. Raised $150M funding. 500+ enterprise customers globally.
  • Key Differentiators: Automated self-learning capability minimizes manual oversight. Business metadata catalog tracks KPIs like data freshness, completeness. Adapters for ~30 major data repository sources.
  • Recent Innovation: Automated Root Cause Analysis. Active Anchor modeling for enterprise ML data alignment. Atlaccatama Labs focusing on augmented data quality.
  • Analyst POV: Named a leader by Gartner, Forrester, G2 across high ROI driven by innovation pace coupled with intuitive end user experience
  • Sample Customers: BBVA, Tribal Group, ING

"Ataccama immediately identified and fixed 60% of data quality issues through unsupervised ML" – Customer Data Executive, Top 10 US Retailer

Real World Implementation Success Strategies

Getting game-changing value out of data quality tools requires going beyond the basics. Through my consulting engagements, I‘ve helped clients across industries develop modern data health capabilities. Here are proven guidelines:

Phase 1: Start With High Priority Domains – Profiling customer data delivers faster wins improving acquisition and retention rather than tackling all data simultaneously. Quick demonstrable benefits build momentum for long term ROI.

Phase 2: Scale Out With MDM as Key Driver – Master Data Management for customers, products and other business entities gives structure for managing quality holistically after initial pilots. MDM and DQ are natural complements.

Phase 3: Expand Metrics Monitoring – Operationalize KPI tracking for usage, issue identification, accuracy. Dashboards supporting drill down to check reliability over time by domain, Enables continuous improvement through analysis.

Phase 4: Embrace Machine Learning – Leverage growing tool functionality applying smart algorithms to cluster duplicates, classify errors automatically without endless manual rule tuning.

The Bottom Line

The costs of coping with poor data quality remain staggeringly high while businesses demand trustworthy analytics fueling everything from customer personalization to IoT initiatives. Modern tools make it possible to systematically measure, manage and improve data reliability.

Leading solutions like Collibra, Informatica and Tamr make fundamentally new capabilities accessible to data stewards through intuitive interfaces. Established platforms from IBM, SAP, Oracle offer robust breadth hardened by decades of enterprise deployment.

Prioritizing the data quality agenda is fundamental to unlocking value from analytics investments in the decade ahead. Initiatives like AI and cloud data migration cannot deliver upside without high grade data. The technology now exists to meet quality demands at scale for any industry. The time for action is now!