Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:
- Federated Metastore, like an extended HCatalog
- Contains statistics about the data on the partition
- Missing data
- Life Cycle
- Audience using the data
- Sunset date
- Sunrise date
- DQ Service
- Defines metrics using a sql-like syntax
- Evaluation rules include normal distribution compared over X partitions
- WAP (Write, Audit, Publish)
- This is an ETL pattern that leverages the idea of separating schema from data.
- Write the additional data to a new partition file
- Audit table definition is created to match the original table + the new partition
- Validate the data using the Quint rules
- Publish the partition by updating the production table definition to include the new partition
This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.
- Query based validation may be enough
- Not all tables require quality coverage
- One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.