I was able to get to my first Hadoop conference this year, so far it's been really good. Below are my notes from the sessions I was able to attend today.
The sessions I went to today include:
- Summit Keynote
- Verizon: Finance Data Lake
- Whoops, the numbers are wrong! Scaling Data Quality @ Netflix
- Dancing Elephants - Efficiently working with object stores from Apache Spark and Hive
- Governance Bots - Metadata driven compliance
- Cloudy with a chance of Hadoop - real world considerations
- Hadoop Journey at Walgreens
- LLAP: Building Cloud-first BI
The keynote was considerably flashier than I was expecting, the first 10 minutes or so was a lot of fog and laser light shows stuff. Not really my thing and I wish they didn't do that.
They did announce a very interesting partnership with IBM. IBM's Big Insights was based on open source Hadoop, maintaining their own distribution.Now IBM is going to leverage Hortonworks Data Platform (HDP) with Big SQL and Data Science Experience (DSX) layered on top of it.
Hortonworks has stated that for a distributed Data Warehouse Big SQL is their recommendation, but it doesn't seem to change their message about Hive. Hive still is their platform for Analytical SQL processing. The thing that Big SQL brings to the table is their "Fluid Query" (Federation) across multiple data sources that live in Hadoop (Hive, HBase, Spark and Object stored through HDFS) and conventional RDBMS (Oracle, SQL Server, Teradata, DB2, PDA / Netezza, etc.). Big SQL is a query optimizer and compiler to gather data across all of those systems. I've never used it, but the notion is interesting.
As for DSX, I have only seen a little bit of it. At first glance it appears that it stitches together some opensource Machine Learning (ML) platforms (and SPSS) into a set of coherent tools that focus upon collaboration. There are video tutorials to help team members with varying experience in the tools learn about using them through DSX. They have a website for machine learning use cases by industry at www.ibm-ml-hub.com
, but there isn't much up there yet. Their visualization engine has been open-sourced, it's called brunel and is available on github
. It can be used with Python, R and Spark, but I'm not clear how it stacks up to its competitors. They also focused on how simple it is to integrate the models generated into existing applications. The models can be tested, scored, monitored and generate alerts based on defined conditions.
There was also a round-table of folks using Hortonworks in different industries, including:
- Duke Energy
- Black Night (Fidelity Mortgages)
- Healthcare Services
They came to the conclusion that they need to be a digital company that sells energy, instead of an energy company using some technology. This is a completely different approach from what they did before and required a major mental shift for everyone. There was a complete shift from commercial to open source software whenever the option exists. This model has been an advantage for attracting and retaining top talent. Above all else, they recommend making sure that Governance be considered early in the journey, otherwise it will lead to problems.
They have many sensors (>4 million) feeding into their system which would have conventional data warehouse appliances, so most data was getting lost. They are able to identify customer issues faster than ever and are using customer data to fuel their warehouse and applying machine learning to help with customer calls. An interesting thing they are working on is dynamically looking at the customer calling in, identify anticipated needs and offer the options to them faster than ever.
Black Night (Fidelity Mortgage)
Black Night uses Hadoop to improve their understanding of the customer. Similar to Duke, the cultural shift was hard but it also attracted (and retained) talent that otherwise would not of joined the company. They are using the data they gather to predict customer behavior and areas of interest. They want to reach the point where a customer can get their mortgage in a few minutes instead of the time intensive process it has been in the past.
Healthcare Services are two years into their data journey. This industry is even more heavily regulated and tends to towards data silos than most. To break out of the silos wherever possible they are switching to opensource software to fuel their ongoing machine learning efforts. For example, zip codes are highly correlated to lifespan, this can help identify populations that need help and preventive care. By working to provide clinics and focus attention on those areas those numbers can be improved.
Microsoft using YARN
Microsoft is switching to using YARN on their COSMOS clusters. These are very large clusters that need to support > 50k nodes (each!) distributed throughout the globe. Previously it used their own resource negotiator that processed Dryad DAGs (talk about a blast from the past!). Microsoft has been steadily contributing changes to YARN so that it can swap out their homegrown solution for an indsutry standard approach. They are a major project contributor and have dedicated MS Research members to work on improving cluster size and creating predictable and preemptable processing allocations. This effectively allows them to create higher priority jobs that will always execute when needed and allow other jobs to run when resources free up again. They are achieving the scalability by adding a federation layer and having each YARN instance scale to 10k nodes. Fascinating stuff, I love keeping an eye on the MS Research site