Saturday, June 17, 2017

DataWorks Summit 2017 - Netflix: Scaling Data Quality

Netflix processes over 700 billion events / day, with 300 TB data warehouse writes each day and 5 PB of data warehouse reads each day. When quality starts to go bad, it can escalate into a major problem very quickly.


Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:

  • Metacat
    • Federated Metastore, like an extended HCatalog
    • Contains statistics about the data on the partition
      • Missing data
      • Life Cycle
      • Audience using the data
      • Sunset date
      • Sunrise date
  • Quint
    • DQ Service
    • Defines metrics using a sql-like syntax
    • Evaluation rules include normal distribution compared over X partitions
  • WAP (Write, Audit, Publish)
    • This is an ETL pattern that leverages the idea of separating schema from data.
      • Write the additional data to a new partition file
      • Audit table definition is created to match the original table + the new partition
        • Validate the data using the Quint rules
      • Publish the partition by updating the production table definition to include the new partition
This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.

Lessons learned
  • Query based validation may be enough
  • Not all tables require quality coverage
  • One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.

5 comments:

  1. The great technology with advanced techniques and tools is the real cause behind a splendid change in the history of movies.movie box

    ReplyDelete

  2. Willard Carroll Smith Jr. is an American actor, producer and rapper. In April 2007, Newsweek called him "the most powerful actor in Hollywood

    will smith net worth

    ReplyDelete
  3. In need of Rekordbox DJ License Key. Whats good homies,. So I recently purchased a preowned Pioneer DDJ-400 for a great deal but the license .RekordBox License Key

    ReplyDelete
  4. Hypnosis is great to overcome abundance blocks and it works directly on your subconscious mind, this makes it such a powerful law of attraction tool.Self-Hypnosis-With-The-Law-Of-Attraction






    ReplyDelete