Saturday, June 17, 2017

DataWorks Summit 2017 - Netflix: Scaling Data Quality

Netflix processes over 700 billion events / day, with 300 TB data warehouse writes each day and 5 PB of data warehouse reads each day. When quality starts to go bad, it can escalate into a major problem very quickly.

Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:

  • Metacat
    • Federated Metastore, like an extended HCatalog
    • Contains statistics about the data on the partition
      • Missing data
      • Life Cycle
      • Audience using the data
      • Sunset date
      • Sunrise date
  • Quint
    • DQ Service
    • Defines metrics using a sql-like syntax
    • Evaluation rules include normal distribution compared over X partitions
  • WAP (Write, Audit, Publish)
    • This is an ETL pattern that leverages the idea of separating schema from data.
      • Write the additional data to a new partition file
      • Audit table definition is created to match the original table + the new partition
        • Validate the data using the Quint rules
      • Publish the partition by updating the production table definition to include the new partition
This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.

Lessons learned
  • Query based validation may be enough
  • Not all tables require quality coverage
  • One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.

Thursday, June 15, 2017

DataWorks Summit 2017 - Verizon Finance Data Lake

Verizon has stood up a finance data lake to be a shared Enterprise Data Repository and enable self-service data discovery. This combines data from multiple SAP systems into a single, easy to access data repository. Benefits included:

  • Simplifies access to ERP data
  • Reduces data replication
  • Enables sharing of data (eliminating point to point integration)
  • Drive data archiving strategy
  • Rationalize data sources
  • Central set of business rules
  • Common reporting and analytic tools
  • Unified location to apply Data Governance and Security
The Architecture has data flow up from the source systems and into the data lake (~1,000 tables). The data movement is a combination of direct file imports, batch ingestion with Sqoop and incremental ingestion. The incremental ingestion was done by CDC and partition swaps originally but was eventually replaced with Attunity Replicate to simplify the maintenance and speed up processing.
Governance and security is applied, the sensitive data is either tokenized or not allowed into the cluster. The data is processed through the ETL layer, consisting mostly of Hive processes. The Logical model is the semantic layer provided by the reporting tools. 

There are areas to improve.

  • Data Validations are executed with manual scripts after the fact.
  • Data Governance is limited and needs review as more users are given access to the data
  • There is no managed data catalog at this time
  • Very few curated and maintained data sets
  • Self-service data preparation tools are being reviewed
  • Capabilities are limited to small number of uses

Wednesday, June 14, 2017

DataWorks Summit 2017 - Day 1 Keynote

I was able to get to my first Hadoop conference this year, so far it's been really good. Below are my notes from the sessions I was able to attend today.

The sessions I went to today include:

  • Summit Keynote
  • Verizon: Finance Data Lake
  • Whoops, the numbers are wrong! Scaling Data Quality @ Netflix
  • Dancing Elephants - Efficiently working with object stores from Apache Spark and Hive
  • Governance Bots - Metadata driven compliance
  • Cloudy with a chance of Hadoop - real world considerations
  • Hadoop Journey at Walgreens
  • LLAP: Building Cloud-first BI

Summit Keynote

The keynote was considerably flashier than I was expecting, the first 10 minutes or so was a lot of fog and laser light shows stuff. Not really my thing and I wish they didn't do that.

IBM Announcement

They did announce a very interesting partnership with IBM. IBM's Big Insights was based on open source Hadoop, maintaining their own distribution.Now IBM is going to leverage Hortonworks Data Platform (HDP) with Big SQL and Data Science Experience (DSX) layered on top of it.

Hortonworks has stated that for a distributed Data Warehouse Big SQL is their recommendation, but it doesn't seem to change their message about Hive. Hive still is their platform for Analytical SQL processing. The thing that Big SQL brings to the table is their "Fluid Query" (Federation) across multiple data sources that live in Hadoop (Hive, HBase, Spark and Object stored through HDFS) and conventional RDBMS (Oracle, SQL Server, Teradata, DB2, PDA / Netezza, etc.). Big SQL is a query optimizer and compiler to gather data across all of those systems. I've never used it, but the notion is interesting.

As for DSX, I have only seen a little bit of it. At first glance it appears that it stitches together some opensource Machine Learning (ML) platforms (and SPSS) into a set of coherent tools that focus upon collaboration. There are video tutorials to help team members with varying experience in the tools learn about using them through DSX. They have a website for machine learning use cases by industry at, but there isn't much up there yet. Their visualization engine has been open-sourced, it's called brunel and is available on github. It can be used with Python, R and Spark, but I'm not clear how it stacks up to its competitors. They also focused on how simple it is to integrate the models generated into existing applications. The models can be tested, scored, monitored and generate alerts based on defined conditions.


There was also a round-table of folks using Hortonworks in different industries, including:

  • Duke Energy
  • Black Night (Fidelity Mortgages)
  • Healthcare Services

Duke Energy

They came to the conclusion that they need to be a digital company that sells energy, instead of an energy company using some technology. This is a completely different approach from what they did before and required a major mental shift for everyone. There was a complete shift from commercial to open source software whenever the option exists. This model has been an advantage for attracting and retaining top talent. Above all else, they recommend making sure that Governance be considered early in the journey, otherwise it will lead to problems.

They have many sensors (>4 million) feeding into their system which would have conventional data warehouse appliances, so most data was getting lost. They are able to identify customer issues faster than ever and are using customer data to fuel their warehouse and applying machine learning to help with customer calls. An interesting thing they are working on is dynamically looking at the customer calling in, identify anticipated needs and offer the options to them faster than ever.

Black Night (Fidelity Mortgage)

Black Night uses Hadoop to improve their understanding of the customer. Similar to Duke, the cultural shift was hard but it also attracted (and retained) talent that otherwise would not of joined the company. They are using the data they gather to predict customer behavior and areas of interest. They want to reach the point where a customer can get their mortgage in a few minutes instead of the time intensive process it has been in the past.

Healthcare Services

Healthcare Services are two years into their data journey. This industry is even more heavily regulated and tends to towards data silos than most. To break out of the silos wherever possible they are switching to opensource software to fuel their ongoing machine learning efforts. For example, zip codes are highly correlated to lifespan, this can help identify populations that need help and preventive care. By working to provide clinics and focus attention on those areas those numbers can be improved.

Microsoft using YARN

Microsoft is switching to using YARN on their COSMOS clusters. These are very large clusters that need to support > 50k nodes (each!) distributed throughout the globe. Previously it used their own resource negotiator that processed Dryad DAGs (talk about a blast from the past!). Microsoft has been steadily contributing changes to YARN so that it can swap out their homegrown solution for an indsutry standard approach. They are a major project contributor and have dedicated MS Research members to work on improving cluster size and creating predictable and preemptable processing allocations. This effectively allows them to create higher priority jobs that will always execute when needed and allow other jobs to run when resources free up again. They are achieving the scalability by adding a federation layer and having each YARN instance scale to 10k nodes. Fascinating stuff, I love keeping an eye on the MS Research site.