The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.
Hadoop
Microsoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation. Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:- Available via NuGet
- Web Management console
- Map-Reduce can be written in C# and JavaScript
- ODBC Driver to connect to Hive
- Normal .Net
- LINQ
- Excel (with a friendly plug-in!)
There are two versions announced
- Azure Edition
- This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
- A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.
- On-Premises Edition
- This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
- The filesystem is HDFS, not NTFS or version of the BlobStore
- Enterprise infratructure support coming
- Microsoft Virtual Machine Manager
- System Center HDInsight Management Pack
- Active Directory for unified authentication
Additional features (maybe)
- Scheduled
- Oozi Worksflow Scheduler
- Mahout
- UnScheduled (but popular requests)
- SSIS Integration
- SQL Server Management Studio
Azure
The Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:- Compress your data being sent over the network
- Parallelism is good but adding thread will reduce the transfer time
- Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.
Encryption
Encryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security- Always salt the hash and encryption.
- Be aware of CPU scaling concerns
- Encryption makes everything bigger
- Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
- Encrypted and hashed values will not compress well
What types on encryption should be considered?
- Field Level
- Data at rest (on disk or on backup)
- Connection Level
- SAN level
Field Level
At the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs- The DB is hard to scale out by adding nodes if it is responsible for all encryption.
- The DB could manage to invisibly decrypt columns based on account permissions
- The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
- The application can scale out to additional servers if needed to distribute the CPU load.
Data at Rest
When the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.
There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.
ReplyDeleteBest hadoop training institute in chennai
Hadoop Course in Chennai
The content provided here is vital in increasing one's knowledge regarding hadoop, the way you have presented here is simply awesome. Thanks for sharing this. The uniqueness I see in your content made me to comment on this. Keep sharing article like this. Thanks :)
ReplyDeleteHadoop Training in Chennai | Hadoop Course in Chennai | Big data training in Chennai
Updating with the current trend is strictly advisable and the content furnished here also states the same. Thanks for sharing this wonderful and worth able article in here. The way to expressed is simply awesome. Keep doing this job. Thanks :)
ReplyDeleteJAVA J2EE Training Institutes in Chennai | JAVA Training | JAVA Course in Chennai | software testing training in chennai
There are lots of information about hadoop have spread around the web, but this is a
ReplyDeleteunique one according to me. The strategy you have updated here will make me to get to the
next level in big data. Thanks for sharing this.
Visit SKARTEC
SKARTEC Digital Marketing Academy
digital marketing course fees
digital marketing course online
digital marketing course near me
digital marketing course in chennai fees
digital marketing course in tamil
digital marketing course with placement
digital marketing training institute in chennai
digital marketing training institute
digital marketing training institute near me
digital marketing training in india
digital marketing training in chennai