Friday, January 1, 2010

The Fourth Paradigm brought to life?

I was reading "The Fourth Paradigm" and thinking about the whole data-brick concept that is needed to handle the information deluge. To their credit it is not a solution focused book, they mention Hadoop, MapReduce and Beowulf Clusters all over and hardly mention SQL Server at all. This is impressive because the man who inspired this book (Jim Gray) is the same database luminary that dragged SQL Server back from the dead with SQL Server 7.0. Now I'm only considering this in the context of business systems because that is where I have the most experience.

In the book, there are three main concerns.

  1. Capturing data
  2. Curating the data
  3. Analyzing the data

Capture
Capturing is not a problem, currently we capture a rather absurd amount of data but we don't know what to do with it. Between the Beowulf Clusters and MS Windows HPC (High Performance Clustering) Editions be know how to collect tons of information.

Curate

I think that the curating of the data is intended to be done by Master Data Services. This provides a way to setup data approval processes and de-duplication based on customized business rules. It's whole purpose in life taking care of information. Now, opening that data up so it is valuable to the world instead of a single company is a separate problem.

Data Analysis


Initially I was thinking that the data-brick concept for analyzing and retrieving the data was being brought to life as part of SQL Azure. Probably the best part of SQL Azure is that it could meet the need of centralized storage that can be rapidly accessed from anywhere in the world. Unfortunately SQL Azure has a lot of drawbacks.

  • Extremely limited DB size (10GB?)
  • Limited ability to do query tuning (Can't even see the query plan?)
  • No controlled backup strategy (the "I need data from a week ago" sceanario)
The limited database size is a real killer, this ain't it.

So what is the next option? I think Microsoft is addressing it through SQL 2008 R2 Parallel Data Warehouse Edition. Yup it's a mouth-full alright, but it is positioned in a very interesting spot. It's intended to be paired with Windows HPC (High Performance Cluster) Edition. It's based off the parallel data warehouse software from DATAllegro. It distributed the query across a bunch of nodes, collects the results and presents it. It sounds awesome, right? The price certainly is, (57k / Processor!).

I really like that MS has a coherent long-term strategy for all this stuff that I hadn't seen so clearly until reviewing Jim Gray's work.