Unimpeded By Sanity

Saturday, June 17, 2017

DataWorks Summit 2017 - Netflix: Scaling Data Quality

Netflix processes over 700 billion events / day, with 300 TB data warehouse writes each day and 5 PB of data warehouse reads each day. When quality starts to go bad, it can escalate into a major problem very quickly.

Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:

Metacat

Federated Metastore, like an extended HCatalog
Contains statistics about the data on the partition

Missing data
Life Cycle
Audience using the data
Sunset date
Sunrise date

Quint

DQ Service
Defines metrics using a sql-like syntax
Evaluation rules include normal distribution compared over X partitions

WAP (Write, Audit, Publish)

This is an ETL pattern that leverages the idea of separating schema from data.

Write the additional data to a new partition file
Audit table definition is created to match the original table + the new partition

Validate the data using the Quint rules

Publish the partition by updating the production table definition to include the new partition

This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.

Lessons learned

Query based validation may be enough
Not all tables require quality coverage
One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.

Thursday, June 15, 2017

DataWorks Summit 2017 - Verizon Finance Data Lake

Verizon has stood up a finance data lake to be a shared Enterprise Data Repository and enable self-service data discovery. This combines data from multiple SAP systems into a single, easy to access data repository. Benefits included:

Simplifies access to ERP data
Reduces data replication
Enables sharing of data (eliminating point to point integration)
Drive data archiving strategy
Rationalize data sources
Central set of business rules
Common reporting and analytic tools
Unified location to apply Data Governance and Security

The Architecture has data flow up from the source systems and into the data lake (~1,000 tables). The data movement is a combination of direct file imports, batch ingestion with Sqoop and incremental ingestion. The incremental ingestion was done by CDC and partition swaps originally but was eventually replaced with Attunity Replicate to simplify the maintenance and speed up processing.

Governance and security is applied, the sensitive data is either tokenized or not allowed into the cluster. The data is processed through the ETL layer, consisting mostly of Hive processes. The Logical model is the semantic layer provided by the reporting tools.

There are areas to improve.

Data Validations are executed with manual scripts after the fact.
Data Governance is limited and needs review as more users are given access to the data
There is no managed data catalog at this time
Very few curated and maintained data sets
Self-service data preparation tools are being reviewed
Capabilities are limited to small number of uses

Wednesday, June 14, 2017

DataWorks Summit 2017 - Day 1 Keynote

I was able to get to my first Hadoop conference this year, so far it's been really good. Below are my notes from the sessions I was able to attend today.

The sessions I went to today include:

Summit Keynote
Verizon: Finance Data Lake
Whoops, the numbers are wrong! Scaling Data Quality @ Netflix
Dancing Elephants - Efficiently working with object stores from Apache Spark and Hive
Governance Bots - Metadata driven compliance
Cloudy with a chance of Hadoop - real world considerations
Hadoop Journey at Walgreens
LLAP: Building Cloud-first BI

Summit Keynote

The keynote was considerably flashier than I was expecting, the first 10 minutes or so was a lot of fog and laser light shows stuff. Not really my thing and I wish they didn't do that.

IBM Announcement

They did announce a very interesting partnership with IBM. IBM's Big Insights was based on open source Hadoop, maintaining their own distribution.Now IBM is going to leverage Hortonworks Data Platform (HDP) with Big SQL and Data Science Experience (DSX) layered on top of it.

Hortonworks has stated that for a distributed Data Warehouse Big SQL is their recommendation, but it doesn't seem to change their message about Hive. Hive still is their platform for Analytical SQL processing. The thing that Big SQL brings to the table is their "Fluid Query" (Federation) across multiple data sources that live in Hadoop (Hive, HBase, Spark and Object stored through HDFS) and conventional RDBMS (Oracle, SQL Server, Teradata, DB2, PDA / Netezza, etc.). Big SQL is a query optimizer and compiler to gather data across all of those systems. I've never used it, but the notion is interesting.

As for DSX, I have only seen a little bit of it. At first glance it appears that it stitches together some opensource Machine Learning (ML) platforms (and SPSS) into a set of coherent tools that focus upon collaboration. There are video tutorials to help team members with varying experience in the tools learn about using them through DSX. They have a website for machine learning use cases by industry at www.ibm-ml-hub.com, but there isn't much up there yet. Their visualization engine has been open-sourced, it's called brunel and is available on github. It can be used with Python, R and Spark, but I'm not clear how it stacks up to its competitors. They also focused on how simple it is to integrate the models generated into existing applications. The models can be tested, scored, monitored and generate alerts based on defined conditions.

Round-table

There was also a round-table of folks using Hortonworks in different industries, including:

Duke Energy
Black Night (Fidelity Mortgages)
Healthcare Services

Duke Energy

They came to the conclusion that they need to be a digital company that sells energy, instead of an energy company using some technology. This is a completely different approach from what they did before and required a major mental shift for everyone. There was a complete shift from commercial to open source software whenever the option exists. This model has been an advantage for attracting and retaining top talent. Above all else, they recommend making sure that Governance be considered early in the journey, otherwise it will lead to problems.

They have many sensors (>4 million) feeding into their system which would have conventional data warehouse appliances, so most data was getting lost. They are able to identify customer issues faster than ever and are using customer data to fuel their warehouse and applying machine learning to help with customer calls. An interesting thing they are working on is dynamically looking at the customer calling in, identify anticipated needs and offer the options to them faster than ever.

Black Night (Fidelity Mortgage)

Black Night uses Hadoop to improve their understanding of the customer. Similar to Duke, the cultural shift was hard but it also attracted (and retained) talent that otherwise would not of joined the company. They are using the data they gather to predict customer behavior and areas of interest. They want to reach the point where a customer can get their mortgage in a few minutes instead of the time intensive process it has been in the past.

Healthcare Services

Healthcare Services are two years into their data journey. This industry is even more heavily regulated and tends to towards data silos than most. To break out of the silos wherever possible they are switching to opensource software to fuel their ongoing machine learning efforts. For example, zip codes are highly correlated to lifespan, this can help identify populations that need help and preventive care. By working to provide clinics and focus attention on those areas those numbers can be improved.

Microsoft using YARN

Microsoft is switching to using YARN on their COSMOS clusters. These are very large clusters that need to support > 50k nodes (each!) distributed throughout the globe. Previously it used their own resource negotiator that processed Dryad DAGs (talk about a blast from the past!). Microsoft has been steadily contributing changes to YARN so that it can swap out their homegrown solution for an indsutry standard approach. They are a major project contributor and have dedicated MS Research members to work on improving cluster size and creating predictable and preemptable processing allocations. This effectively allows them to create higher priority jobs that will always execute when needed and allow other jobs to run when resources free up again. They are achieving the scalability by adding a federation layer and having each YARN instance scale to 10k nodes. Fascinating stuff, I love keeping an eye on the MS Research site.

Monday, March 30, 2015

EDW 2015: Data Governance Organization

Pharmaceutical model

Janet Lichtenburger and Nancy Makos presented a terrific view of how Data Governance works at their company. They implement functional (small) data governance, with the exception of data sensitivity.

Here is my brain dump of the session today. You're welcome to it if you can make heads or tails of it. :-)

Some numbers that show the challenge of Data Governance there are:

5 people in Enterprise Architecture
3 people in Data Governance
130 people provide Data Stewardship support
350k total employees

Data Governance is about Policies and Standards and is typically independent of implementations, as part of the Enterprise Architecture or Finance groups.

To encourage adoption, Data Governance could be considered an internal consulting service to support projects ,that is not charged back.

Organizational Model

Overall Model

Enterprise Data Governance Executive Committee

Meets only a few times a year
Limited number of senior executives

Data Governance Committees

Each committee is chaired by Data Goverenance
Divided into specific domains
Meets as often as required by projects
Up to 15 people on each Domain Specific Committee.

Data Steward Roles

Executive Stewards

Member of the Data Governance Executive Committee
Strategic Direction
Authority

Enterprise Stewards

Member of the Data Governance Committee
Development Support of Data Governance Policies

Operational Stewards

Communicates to promote policies
Endorses Data Standards

Domain Stewards

Recommends canonical structures
Endorses Data Standards

However the model is constructed, it must make sense for the business. All new policies should be considered from a Cost / Benefit analysis, the exception is with regulatory requirements. Regulatory and Legal compliance are critical to avoid jail.

Policies

The goal of policies is to drive behavior changes needed for Enterprise Information Management to succeed.

People generally don't like change, a way to get buy-in is to amend existing processes instead of creating new ones. These are smaller, less intrusive and stakeholders have already been identified. This also leads to more partnerships and shared endorsements of the changes.

There are typically a small number of extremely high value areas, policy should focus upon those.

Align with the Enterprise goals and other Enterprise ranging groups, there are a lot of shared concerns and ways that the teams can support each other.

Keep the policies easily accessible, do not hide them in a 500 page volume, instead keep them somewhere easily discoverable, such as a wiki on the corporate intranet.

Policies that are overly broad and not enforceable can quickly cause legal / compliance problems. In those cases, no policy is a better choice than an unenforceable one.

Data Classification

The source and type of data both define the data classification required. Similarly, data from several, more open sources can be combined to escalate the protection required.

Restricted

Financial information, such as credit cards

Protected

Regulated information, such as HIPPA data

Private

Named Persons

Internal

Business Data

Public

Everything Else

The data classification levels are combined with Information Security levels for systems to identify where the data is able to be transmitted.

Anonymizing Data

Safe Harbor

Remove 18 identifying attributes from the data, which renders it fairly useless.

Expert Determination

Expert certifies that the data available is too low of a probability to re-identify an individual.

Automation

There are tools out there, such as Parit(?) that are capable of doing this automatically after a survey and analysis of the data.

Standards Examples

USPS Publication 28 specifies international address requirements

ISO15836 standards for tagging unstructured documents

ISO/IEC 11179 -4 and -5 has naming standards for business metadata

Sunday, March 29, 2015

EDW 2015: Data Strategy

This session was presented by Lewis Broome (from Data Blueprint) and Brian Cassel (from the Massey Cancer Center).

Lewis presented a his strategic model and roadmap using the case study of a logistics company implementing an ambitious project.

Brian spoke about the challenges he faced within the cancer center as he implemented analytics across data hidden in silos. This was primarily culture based but once he was past that he was able to use the existing Data Analytics hub to build a specialized data mart to support strategic review of the data.

This was really great stuff and my summary doesn't do it justice by a long shot.

Data Strategy

Business Needs

In order to get anywhere with discussions about data and mays to improve it throughout the organization, the value of the effort has to be made clear. Clean data may seem like the most obvious need in the world, but that view is too low level to make it on to the radar of senior management. Instead, it needs to clearly address a business need.

There are three aspects to consider

How will mesh with the company Mission and Brand Promises?

Ex. FedEx: Your package will get there overnight. Guaranteed.

Does it improve the company's market position / provide a competitive advantage?

Michael Porter's Market Positioning Framework and his Competitive Advantage Framework provide a good way to think about this.

Will it improve the operating model and support the company's objectives?

Operating models improve by changing the degree of business integration or standardization.

If the data changes do not address any of these areas, it will not gain the support needed to succeed.

New capabilities that do not meet a business need aren't a program, they are a science project.

Current State of the Business

The current state assessment looks at

Existing Assets and Capabilities
Gaps in Assets and Capabilities
Constraints and Interdependencies

This can be the toughest stuff to identify.
BEWARE SHADOW SYSTEMS typically excel spreadsheets with macros or access data fixing before feeding it into the next step of a process.

Cultural Readiness

Cultural Readiness

Cultural Readiness depends on 5 different areas

Vision

A clear message of what the program is expected to achieve

Skills

Ensure that the right people are part of the program

Incentive

The value and importance of the program should be clear to all of the participants

Resources

Backing the program will require more than just good will, tools, environments and training may all be required

Action Plan

The system boundaries being developed should be clearly defined

Capability Maturity Model Levels

Starting point

There is some data in a pile over there.

Repeatable Process

This is how we sweep the data into a pile and remove the bits of junk we find.

Defined Process

Sweep from left to right, avoid the dead bugs. Leave data in a pile.

Managed Process

The entire team has the same brooms and the dead bugs are highlighted and automatically avoided by the brooms.

Optimizing

Maybe we can add rules to avoid sweeping twigs into the pile as well.

Roadmap

The Roadmap establishes the path of the Data Management Program to achieve the strategic goals.

Leadership and Planning	Planning and Business Strategy Alignment Program Management Clearly Defined Imperatives, Tactics and KPI Accountable to CDO
Project Development	Outcome Based Targets Business Case and Project Scope Program Execution

Project Model

Big Projects tend to fail, at least twice sometimes more than that as the business learns what it really needs.

Always start with crawling and walking before going to running.

Governance should start with a small 'g', where it matters most. There are commonly 5-10 critical data elements, take care of those before setting targets higher.
Data Strategy as top-down approach works best. Otherwise it is uncoordinated and is only capable of supporting tactical initiatives.
Data Architecture must focus on the business needs, not individual systems or applications.

Saturday, March 28, 2015

Questions and where to seek answers

There always seems to be an unending stream of questions.

In the past, a good number of them were about things like "How can we use this shiny new feature?" "What are the best practices for these scenarios?".

However, times change and so do the questions.

Microsoft has been following the lead of the SQL Server community and steadily become more open with road maps and dialogs. We understand the platform and internals better than ever before, making it easier to address concerns.

Now the questions that need to be addressed are more strategic than tactical. Strategies, by their very nature are more challenging to find peers to discuss success and (more importantly) failures in plans and their implementations.

I'm attending Enterprise Data World this year to have a chance to discuss ideas, strategies and technologies to bring new insights and make things run better than ever.

This will be fun!

Saturday, November 10, 2012

PASS Summit Day 3

The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.

Hadoop

Microsoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation. Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:

Available via NuGet
Web Management console
Map-Reduce can be written in C# and JavaScript
ODBC Driver to connect to Hive

Normal .Net
LINQ
Excel (with a friendly plug-in!)

There are two versions announced

Azure Edition

This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.

On-Premises Edition

This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
The filesystem is HDFS, not NTFS or version of the BlobStore
Enterprise infratructure support coming

Microsoft Virtual Machine Manager
System Center HDInsight Management Pack
Active Directory for unified authentication

Additional features (maybe)

Scheduled

Oozi Worksflow Scheduler
Mahout

UnScheduled (but popular requests)

SSIS Integration
SQL Server Management Studio

Azure

The Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:

Compress your data being sent over the network
Parallelism is good but adding thread will reduce the transfer time
Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.

Encryption

Encryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security

Always salt the hash and encryption.
Be aware of CPU scaling concerns
Encryption makes everything bigger
Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
Encrypted and hashed values will not compress well

What types on encryption should be considered?

Field Level
Data at rest (on disk or on backup)
Connection Level
SAN level

Field Level

At the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs

The DB is hard to scale out by adding nodes if it is responsible for all encryption.
The DB could manage to invisibly decrypt columns based on account permissions
The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
The application can scale out to additional servers if needed to distribute the CPU load.

Data at Rest

When the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.

Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.

Connection Level

There are two ways to make sure you connection is encrypted. A simple one to setup is having Active Directory enact a policy requiring IPSec for all connections to the DB Server. The other one is to use certificates to establish the classic SSL connection encryption. This can be controlled at the connection string level so you can configure it for only a subset of the traffic and reduce the CPU load.

SAN

SAN Level Encryption (MPIO) is fascinating because it will stop the LUN from being restored to a machine without the proper certificates. This tends to require a lot of CPU resources but it is possible to offload that encryption work to special (such as the Emulex HPA) that can be integrated into the RSA Key Manager.

DQS

I did discover something a bit disappointing about DQS today. There are curerntly no matching services available. That means that there are no internal web services for identifying matches nor are they any SSIS operations available. Matching work can ONLY be done through the client GUI. Without automation this will have a real challenge working with some types of data..