Saturday, June 17, 2017

DataWorks Summit 2017 - Netflix: Scaling Data Quality

Netflix processes over 700 billion events / day, with 300 TB data warehouse writes each day and 5 PB of data warehouse reads each day. When quality starts to go bad, it can escalate into a major problem very quickly.

Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:

  • Metacat
    • Federated Metastore, like an extended HCatalog
    • Contains statistics about the data on the partition
      • Missing data
      • Life Cycle
      • Audience using the data
      • Sunset date
      • Sunrise date
  • Quint
    • DQ Service
    • Defines metrics using a sql-like syntax
    • Evaluation rules include normal distribution compared over X partitions
  • WAP (Write, Audit, Publish)
    • This is an ETL pattern that leverages the idea of separating schema from data.
      • Write the additional data to a new partition file
      • Audit table definition is created to match the original table + the new partition
        • Validate the data using the Quint rules
      • Publish the partition by updating the production table definition to include the new partition
This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.

Lessons learned
  • Query based validation may be enough
  • Not all tables require quality coverage
  • One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.

Thursday, June 15, 2017

DataWorks Summit 2017 - Verizon Finance Data Lake

Verizon has stood up a finance data lake to be a shared Enterprise Data Repository and enable self-service data discovery. This combines data from multiple SAP systems into a single, easy to access data repository. Benefits included:

  • Simplifies access to ERP data
  • Reduces data replication
  • Enables sharing of data (eliminating point to point integration)
  • Drive data archiving strategy
  • Rationalize data sources
  • Central set of business rules
  • Common reporting and analytic tools
  • Unified location to apply Data Governance and Security
The Architecture has data flow up from the source systems and into the data lake (~1,000 tables). The data movement is a combination of direct file imports, batch ingestion with Sqoop and incremental ingestion. The incremental ingestion was done by CDC and partition swaps originally but was eventually replaced with Attunity Replicate to simplify the maintenance and speed up processing.
Governance and security is applied, the sensitive data is either tokenized or not allowed into the cluster. The data is processed through the ETL layer, consisting mostly of Hive processes. The Logical model is the semantic layer provided by the reporting tools. 

There are areas to improve.

  • Data Validations are executed with manual scripts after the fact.
  • Data Governance is limited and needs review as more users are given access to the data
  • There is no managed data catalog at this time
  • Very few curated and maintained data sets
  • Self-service data preparation tools are being reviewed
  • Capabilities are limited to small number of uses

Wednesday, June 14, 2017

DataWorks Summit 2017 - Day 1 Keynote

I was able to get to my first Hadoop conference this year, so far it's been really good. Below are my notes from the sessions I was able to attend today.

The sessions I went to today include:

  • Summit Keynote
  • Verizon: Finance Data Lake
  • Whoops, the numbers are wrong! Scaling Data Quality @ Netflix
  • Dancing Elephants - Efficiently working with object stores from Apache Spark and Hive
  • Governance Bots - Metadata driven compliance
  • Cloudy with a chance of Hadoop - real world considerations
  • Hadoop Journey at Walgreens
  • LLAP: Building Cloud-first BI

Summit Keynote

The keynote was considerably flashier than I was expecting, the first 10 minutes or so was a lot of fog and laser light shows stuff. Not really my thing and I wish they didn't do that.

IBM Announcement

They did announce a very interesting partnership with IBM. IBM's Big Insights was based on open source Hadoop, maintaining their own distribution.Now IBM is going to leverage Hortonworks Data Platform (HDP) with Big SQL and Data Science Experience (DSX) layered on top of it.

Hortonworks has stated that for a distributed Data Warehouse Big SQL is their recommendation, but it doesn't seem to change their message about Hive. Hive still is their platform for Analytical SQL processing. The thing that Big SQL brings to the table is their "Fluid Query" (Federation) across multiple data sources that live in Hadoop (Hive, HBase, Spark and Object stored through HDFS) and conventional RDBMS (Oracle, SQL Server, Teradata, DB2, PDA / Netezza, etc.). Big SQL is a query optimizer and compiler to gather data across all of those systems. I've never used it, but the notion is interesting.

As for DSX, I have only seen a little bit of it. At first glance it appears that it stitches together some opensource Machine Learning (ML) platforms (and SPSS) into a set of coherent tools that focus upon collaboration. There are video tutorials to help team members with varying experience in the tools learn about using them through DSX. They have a website for machine learning use cases by industry at, but there isn't much up there yet. Their visualization engine has been open-sourced, it's called brunel and is available on github. It can be used with Python, R and Spark, but I'm not clear how it stacks up to its competitors. They also focused on how simple it is to integrate the models generated into existing applications. The models can be tested, scored, monitored and generate alerts based on defined conditions.


There was also a round-table of folks using Hortonworks in different industries, including:

  • Duke Energy
  • Black Night (Fidelity Mortgages)
  • Healthcare Services

Duke Energy

They came to the conclusion that they need to be a digital company that sells energy, instead of an energy company using some technology. This is a completely different approach from what they did before and required a major mental shift for everyone. There was a complete shift from commercial to open source software whenever the option exists. This model has been an advantage for attracting and retaining top talent. Above all else, they recommend making sure that Governance be considered early in the journey, otherwise it will lead to problems.

They have many sensors (>4 million) feeding into their system which would have conventional data warehouse appliances, so most data was getting lost. They are able to identify customer issues faster than ever and are using customer data to fuel their warehouse and applying machine learning to help with customer calls. An interesting thing they are working on is dynamically looking at the customer calling in, identify anticipated needs and offer the options to them faster than ever.

Black Night (Fidelity Mortgage)

Black Night uses Hadoop to improve their understanding of the customer. Similar to Duke, the cultural shift was hard but it also attracted (and retained) talent that otherwise would not of joined the company. They are using the data they gather to predict customer behavior and areas of interest. They want to reach the point where a customer can get their mortgage in a few minutes instead of the time intensive process it has been in the past.

Healthcare Services

Healthcare Services are two years into their data journey. This industry is even more heavily regulated and tends to towards data silos than most. To break out of the silos wherever possible they are switching to opensource software to fuel their ongoing machine learning efforts. For example, zip codes are highly correlated to lifespan, this can help identify populations that need help and preventive care. By working to provide clinics and focus attention on those areas those numbers can be improved.

Microsoft using YARN

Microsoft is switching to using YARN on their COSMOS clusters. These are very large clusters that need to support > 50k nodes (each!) distributed throughout the globe. Previously it used their own resource negotiator that processed Dryad DAGs (talk about a blast from the past!). Microsoft has been steadily contributing changes to YARN so that it can swap out their homegrown solution for an indsutry standard approach. They are a major project contributor and have dedicated MS Research members to work on improving cluster size and creating predictable and preemptable processing allocations. This effectively allows them to create higher priority jobs that will always execute when needed and allow other jobs to run when resources free up again. They are achieving the scalability by adding a federation layer and having each YARN instance scale to 10k nodes. Fascinating stuff, I love keeping an eye on the MS Research site.

Monday, March 30, 2015

EDW 2015: Data Governance Organization

Pharmaceutical model
Janet Lichtenburger and Nancy Makos presented a terrific view of how Data Governance works at their company. They implement functional (small) data governance, with the exception of data sensitivity.

Here is my brain dump of the session today. You're welcome to it if you can make heads or tails of it. :-)

Some numbers that show the challenge of Data Governance there are:
  • 5 people in Enterprise Architecture
  • 3 people in Data Governance
  • 130 people provide Data Stewardship support
  • 350k total employees

Data Governance is about Policies and Standards and is typically independent of implementations, as part of the Enterprise Architecture or Finance groups.

To encourage adoption, Data Governance could be considered an internal consulting service to support projects ,that is not charged back.

Organizational Model
Overall Model
  1. Enterprise Data Governance Executive Committee
    1. Meets only a few times a year
    2. Limited number of senior executives
  2. Data Governance Committees
    1. Each committee is chaired by Data Goverenance
    2. Divided into specific domains
    3. Meets as often as required by projects
    4. Up to 15 people on each Domain Specific Committee.

Data Steward Roles
  • Executive Stewards
    • Member of the Data Governance Executive Committee
    • Strategic Direction
    • Authority
  • Enterprise Stewards
    • Member of the Data Governance Committee
    • Development Support of Data Governance Policies
  • Operational Stewards
    • Communicates to promote policies
    • Endorses Data Standards
  • Domain Stewards
    • Recommends canonical structures
    • Endorses Data Standards

However the model is constructed, it must make sense for the business. All new policies should be considered from a Cost / Benefit analysis, the exception is with regulatory requirements. Regulatory and Legal compliance are critical to avoid jail.

The goal of policies is to drive behavior changes needed for Enterprise Information Management to succeed.

People generally don't like change, a way to get buy-in is to amend existing processes instead of creating new ones. These are smaller, less intrusive and stakeholders have already been identified. This also leads to more partnerships and shared endorsements of the changes.

There are typically a small number of extremely high value areas, policy should focus upon those.

Align with the Enterprise goals and other Enterprise ranging groups, there are a lot of shared concerns and ways that the teams can support each other.

Keep the policies easily accessible, do not hide them in a 500 page volume, instead keep them somewhere easily discoverable, such as a wiki on the corporate intranet.

Policies that are overly broad and not enforceable can quickly cause legal / compliance problems. In those cases, no policy is a better choice than an unenforceable one.

Data Classification
The source and type of data both define the data classification required. Similarly, data from several, more open sources can be combined to escalate the protection required.

  • Restricted
    • Financial information, such as credit cards
  • Protected
    • Regulated information, such as HIPPA data
  • Private
    • Named Persons
  • Internal
    • Business Data
  • Public
    • Everything Else

The data classification levels are combined with Information Security levels for systems to identify where the data is able to be transmitted.

Anonymizing Data
Safe Harbor
Remove 18 identifying attributes from the data, which renders it fairly useless.

Expert Determination
Expert certifies that the data available is too low of a probability to re-identify an individual.

There are tools out there, such as Parit(?) that are capable of doing this automatically after a survey and analysis of the data.

Standards Examples

USPS Publication 28 specifies international address requirements
ISO15836 standards for tagging unstructured documents
ISO/IEC 11179 -4 and -5 has naming standards for business metadata

Sunday, March 29, 2015

EDW 2015: Data Strategy

This session was presented by Lewis Broome (from Data Blueprint) and Brian Cassel (from the Massey Cancer Center).

Lewis presented a his strategic model and roadmap using the case study of a logistics company implementing an ambitious project.

Brian spoke about the challenges he faced within the cancer center as he implemented analytics across data hidden in silos. This was primarily culture based but once he was past that he was able to use the existing Data Analytics hub to build a specialized data mart to support strategic review of the data.

This was really great stuff and my summary doesn't do it justice by a long shot.
Data Strategy
Business Needs
In order to get anywhere with discussions about data and mays to improve it throughout the organization, the value of the effort has to be made clear. Clean data may seem like the most obvious need in the world, but that view is too low level to make it on to the radar of senior management. Instead, it needs to clearly address a business need.

There are three aspects to consider
  • How will mesh with the company Mission and Brand Promises?
    • Ex. FedEx: Your package will get there overnight. Guaranteed.
  • Does it improve the company's market position / provide a competitive advantage?
    • Michael Porter's Market Positioning Framework and his Competitive Advantage Framework provide a good way to think about this.
  • Will it improve the operating model and support the company's objectives?
    • Operating models improve by changing the degree of business integration or standardization.

If the data changes do not address any of these areas, it will not gain the support needed to succeed.

New capabilities that do not meet a business need aren't a program, they are a science project.

Current State of the Business
The current state assessment looks at
  • Existing Assets and Capabilities
  • Gaps in Assets and Capabilities
  • Constraints and Interdependencies
    • This can be the toughest stuff to identify.
    • BEWARE SHADOW SYSTEMS typically excel spreadsheets with macros or access data fixing before feeding it into the next step of a process.
  • Cultural Readiness

Cultural Readiness
Cultural Readiness depends on 5 different areas
  • Vision
    • A clear message of what the program is expected to achieve
  • Skills
    • Ensure that the right people are part of the program
  • Incentive
    • The value and importance of the program should be clear to all of the participants
  • Resources
    • Backing the program will require more than just good will, tools, environments and training may all be required
  • Action Plan
    • The system boundaries being developed should be clearly defined

Capability Maturity Model Levels
  1. Starting point
    1. There is some data in a pile over there.
  2. Repeatable Process
    1. This is how we sweep the data into a pile and remove the bits of junk we find.
  3. Defined Process
    1. Sweep from left to right, avoid the dead bugs. Leave data in a pile.
  4. Managed Process
    1. The entire team has the same brooms and the dead bugs are highlighted and automatically avoided by the brooms.
  5. Optimizing
    1. Maybe we can add rules to avoid sweeping twigs into the pile as well.

The Roadmap establishes the path of the Data Management Program to achieve the strategic goals.

Leadership and Planning
    • Planning and Business Strategy Alignment
    • Program Management
    • Clearly Defined Imperatives, Tactics and KPI
    • Accountable to CDO
Project Development
    • Outcome Based Targets
    • Business Case and Project Scope
    • Program Execution

Project Model
Big Projects tend to fail, at least twice sometimes more than that as the business learns what it really needs.

Always start with crawling and walking before going to running.
  • Governance should start with a small 'g', where it matters most. There are commonly 5-10 critical data elements, take care of those before setting targets higher.
  • Data Strategy as top-down approach works best. Otherwise it is uncoordinated and is only capable of supporting tactical initiatives.
  • Data Architecture must focus on the business needs, not individual systems or applications.

Saturday, March 28, 2015

Questions and where to seek answers

There always seems to be an unending stream of questions.

In the past, a good number of them were about things like "How can we use this shiny new feature?" "What are the best practices for these scenarios?".

However, times change and so do the questions.

Microsoft has been following the lead of the SQL Server community and steadily become more open with road maps and dialogs. We understand the platform and internals better than ever before, making it easier to address concerns.

Now the questions that need to be addressed are more strategic than tactical. Strategies, by their very nature are more challenging to find peers to discuss success and (more importantly) failures in plans and their implementations.

I'm attending Enterprise Data World this year to have a chance to discuss ideas, strategies and technologies to bring new insights and make things run better than ever.

This will be fun!

Saturday, November 10, 2012

PASS Summit Day 3

The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.


Microsoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation.  Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:

  • Available via NuGet
  • Web Management console
  • Map-Reduce can be written in C# and JavaScript
  • ODBC Driver to connect to Hive
    • Normal .Net
    • LINQ 
    • Excel (with a friendly plug-in!)

There are two versions announced

  • Azure Edition
    • This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
    • A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.
  • On-Premises Edition
    • This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
    • The filesystem is HDFS, not NTFS or version of the BlobStore
    • Enterprise infratructure support coming
      • Microsoft Virtual Machine Manager
      • System Center HDInsight Management Pack
      • Active Directory for unified authentication

Additional features (maybe)

  • Scheduled
    • Oozi Worksflow Scheduler
    • Mahout
  • UnScheduled (but popular requests)
    • SSIS Integration
    • SQL Server Management Studio


The Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:

  1. Compress your data being sent over the network
  2. Parallelism is good but adding thread will reduce the transfer time
  3. Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.


Encryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security

  • Always salt the hash and encryption. 
  • Be aware of CPU scaling concerns
  • Encryption makes everything bigger
  • Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
  • Encrypted and hashed values will not compress well

What types on encryption should be considered?

  • Field Level
  • Data at rest (on disk or on backup)
  • Connection Level
  • SAN level

Field Level

At the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs

  • The DB is hard to scale out by adding nodes if it is responsible for all encryption. 
  • The DB could manage to invisibly decrypt columns based on account permissions
  • The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
  • The application can scale out to additional servers if needed to distribute the CPU load.

Data at Rest

When the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.

Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.

Connection Level

There are two ways to make sure you connection is encrypted. A simple one to setup is having Active Directory enact a policy requiring IPSec for all connections to the DB Server. The other one is to use certificates to establish the classic SSL connection encryption. This can be controlled at the connection string level so you can configure it for only a subset of the traffic and reduce the CPU load.


SAN Level Encryption (MPIO) is fascinating because it will stop the LUN from being restored to a machine without the proper certificates. This tends to require a lot of CPU resources but it is possible to offload that encryption work to special (such as the Emulex HPA) that can be integrated into the RSA Key Manager.


I did discover something a bit disappointing about DQS today. There are curerntly no matching services available. That means that there are no internal web services for identifying matches nor are they any SSIS operations available. Matching work can ONLY be done through the client GUI. Without automation this will have a real challenge working with some types of data..