Unimpeded By Sanity: DataWorks Summit 2017 - Netflix: Scaling Data Quality

Saturday, June 17, 2017

DataWorks Summit 2017 - Netflix: Scaling Data Quality

Netflix processes over 700 billion events / day, with 300 TB data warehouse writes each day and 5 PB of data warehouse reads each day. When quality starts to go bad, it can escalate into a major problem very quickly.

Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:

Metacat

Federated Metastore, like an extended HCatalog
Contains statistics about the data on the partition

Missing data
Life Cycle
Audience using the data
Sunset date
Sunrise date

Quint

DQ Service
Defines metrics using a sql-like syntax
Evaluation rules include normal distribution compared over X partitions

WAP (Write, Audit, Publish)

This is an ETL pattern that leverages the idea of separating schema from data.

Write the additional data to a new partition file
Audit table definition is created to match the original table + the new partition

Validate the data using the Quint rules

Publish the partition by updating the production table definition to include the new partition

This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.

Lessons learned

Query based validation may be enough
Not all tables require quality coverage
One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.

5 comments:

Richard YatesAugust 7, 2019 at 1:17 PM
The great technology with advanced techniques and tools is the real cause behind a splendid change in the history of movies.movie box
ReplyDelete
Replies
AlkaDecember 16, 2020 at 4:44 AM

Willard Carroll Smith Jr. is an American actor, producer and rapper. In April 2007, Newsweek called him "the most powerful actor in Hollywood

will smith net worth
ReplyDelete
Replies
tech2September 25, 2022 at 1:34 AM
In need of Rekordbox DJ License Key. Whats good homies,. So I recently purchased a preowned Pioneer DDJ-400 for a great deal but the license .RekordBox License Key
ReplyDelete
Replies
haseebFebruary 6, 2023 at 7:40 AM
Hypnosis is great to overcome abundance blocks and it works directly on your subconscious mind, this makes it such a powerful law of attraction tool.Self-Hypnosis-With-The-Law-Of-Attraction

ReplyDelete
Replies
kemalJuly 7, 2023 at 1:49 AM
bodrum
hakkari
şırnak
bağcılar
tekirdağ
VSU
ReplyDelete
Replies

Add comment

Unimpeded By Sanity

Saturday, June 17, 2017

DataWorks Summit 2017 - Netflix: Scaling Data Quality

5 comments:

About Me

Blog Archive