Saturday, June 17, 2017

DataWorks Summit 2017 - Netflix: Scaling Data Quality

Netflix processes over 700 billion events / day, with 300 TB data warehouse writes each day and 5 PB of data warehouse reads each day. When quality starts to go bad, it can escalate into a major problem very quickly.


Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:

  • Metacat
    • Federated Metastore, like an extended HCatalog
    • Contains statistics about the data on the partition
      • Missing data
      • Life Cycle
      • Audience using the data
      • Sunset date
      • Sunrise date
  • Quint
    • DQ Service
    • Defines metrics using a sql-like syntax
    • Evaluation rules include normal distribution compared over X partitions
  • WAP (Write, Audit, Publish)
    • This is an ETL pattern that leverages the idea of separating schema from data.
      • Write the additional data to a new partition file
      • Audit table definition is created to match the original table + the new partition
        • Validate the data using the Quint rules
      • Publish the partition by updating the production table definition to include the new partition
This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.

Lessons learned
  • Query based validation may be enough
  • Not all tables require quality coverage
  • One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.

No comments:

Post a Comment