The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.
HadoopMicrosoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation. Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:
- Available via NuGet
- Web Management console
- ODBC Driver to connect to Hive
- Normal .Net
- Excel (with a friendly plug-in!)
There are two versions announced
- Azure Edition
- This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
- A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.
- On-Premises Edition
- This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
- The filesystem is HDFS, not NTFS or version of the BlobStore
- Enterprise infratructure support coming
- Microsoft Virtual Machine Manager
- System Center HDInsight Management Pack
- Active Directory for unified authentication
Additional features (maybe)
- Oozi Worksflow Scheduler
- UnScheduled (but popular requests)
- SSIS Integration
- SQL Server Management Studio
AzureThe Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:
- Compress your data being sent over the network
- Parallelism is good but adding thread will reduce the transfer time
- Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.
EncryptionEncryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security
- Always salt the hash and encryption.
- Be aware of CPU scaling concerns
- Encryption makes everything bigger
- Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
- Encrypted and hashed values will not compress well
What types on encryption should be considered?
- Field Level
- Data at rest (on disk or on backup)
- Connection Level
- SAN level
Field LevelAt the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs
- The DB is hard to scale out by adding nodes if it is responsible for all encryption.
- The DB could manage to invisibly decrypt columns based on account permissions
- The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
- The application can scale out to additional servers if needed to distribute the CPU load.
Data at RestWhen the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.
Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.