Saturday, November 10, 2012

PASS Summit Day 3

The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.


Microsoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation.  Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:

  • Available via NuGet
  • Web Management console
  • Map-Reduce can be written in C# and JavaScript
  • ODBC Driver to connect to Hive
    • Normal .Net
    • LINQ 
    • Excel (with a friendly plug-in!)

There are two versions announced

  • Azure Edition
    • This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
    • A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.
  • On-Premises Edition
    • This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
    • The filesystem is HDFS, not NTFS or version of the BlobStore
    • Enterprise infratructure support coming
      • Microsoft Virtual Machine Manager
      • System Center HDInsight Management Pack
      • Active Directory for unified authentication

Additional features (maybe)

  • Scheduled
    • Oozi Worksflow Scheduler
    • Mahout
  • UnScheduled (but popular requests)
    • SSIS Integration
    • SQL Server Management Studio


The Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:

  1. Compress your data being sent over the network
  2. Parallelism is good but adding thread will reduce the transfer time
  3. Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.


Encryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security

  • Always salt the hash and encryption. 
  • Be aware of CPU scaling concerns
  • Encryption makes everything bigger
  • Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
  • Encrypted and hashed values will not compress well

What types on encryption should be considered?

  • Field Level
  • Data at rest (on disk or on backup)
  • Connection Level
  • SAN level

Field Level

At the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs

  • The DB is hard to scale out by adding nodes if it is responsible for all encryption. 
  • The DB could manage to invisibly decrypt columns based on account permissions
  • The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
  • The application can scale out to additional servers if needed to distribute the CPU load.

Data at Rest

When the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.

Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.

Connection Level

There are two ways to make sure you connection is encrypted. A simple one to setup is having Active Directory enact a policy requiring IPSec for all connections to the DB Server. The other one is to use certificates to establish the classic SSL connection encryption. This can be controlled at the connection string level so you can configure it for only a subset of the traffic and reduce the CPU load.


SAN Level Encryption (MPIO) is fascinating because it will stop the LUN from being restored to a machine without the proper certificates. This tends to require a lot of CPU resources but it is possible to offload that encryption work to special (such as the Emulex HPA) that can be integrated into the RSA Key Manager.


I did discover something a bit disappointing about DQS today. There are curerntly no matching services available. That means that there are no internal web services for identifying matches nor are they any SSIS operations available. Matching work can ONLY be done through the client GUI. Without automation this will have a real challenge working with some types of data..


  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.

    Best hadoop training institute in chennai
    Hadoop Course in Chennai

  2. The content provided here is vital in increasing one's knowledge regarding hadoop, the way you have presented here is simply awesome. Thanks for sharing this. The uniqueness I see in your content made me to comment on this. Keep sharing article like this. Thanks :)

    Hadoop Training in Chennai | Hadoop Course in Chennai | Big data training in Chennai

  3. Updating with the current trend is strictly advisable and the content furnished here also states the same. Thanks for sharing this wonderful and worth able article in here. The way to expressed is simply awesome. Keep doing this job. Thanks :)

    JAVA J2EE Training Institutes in Chennai | JAVA Training | JAVA Course in Chennai | software testing training in chennai