Unimpeded By Sanity: 2012

Saturday, November 10, 2012

PASS Summit Day 3

The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.

Hadoop

Microsoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation. Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:

Available via NuGet
Web Management console
Map-Reduce can be written in C# and JavaScript
ODBC Driver to connect to Hive

Normal .Net
LINQ
Excel (with a friendly plug-in!)

There are two versions announced

Azure Edition

This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.

On-Premises Edition

This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
The filesystem is HDFS, not NTFS or version of the BlobStore
Enterprise infratructure support coming

Microsoft Virtual Machine Manager
System Center HDInsight Management Pack
Active Directory for unified authentication

Additional features (maybe)

Scheduled

Oozi Worksflow Scheduler
Mahout

UnScheduled (but popular requests)

SSIS Integration
SQL Server Management Studio

Azure

The Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:

Compress your data being sent over the network
Parallelism is good but adding thread will reduce the transfer time
Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.

Encryption

Encryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security

Always salt the hash and encryption.
Be aware of CPU scaling concerns
Encryption makes everything bigger
Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
Encrypted and hashed values will not compress well

What types on encryption should be considered?

Field Level
Data at rest (on disk or on backup)
Connection Level
SAN level

Field Level

At the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs

The DB is hard to scale out by adding nodes if it is responsible for all encryption.
The DB could manage to invisibly decrypt columns based on account permissions
The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
The application can scale out to additional servers if needed to distribute the CPU load.

Data at Rest

When the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.

Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.

Connection Level

There are two ways to make sure you connection is encrypted. A simple one to setup is having Active Directory enact a policy requiring IPSec for all connections to the DB Server. The other one is to use certificates to establish the classic SSL connection encryption. This can be controlled at the connection string level so you can configure it for only a subset of the traffic and reduce the CPU load.

SAN

SAN Level Encryption (MPIO) is fascinating because it will stop the LUN from being restored to a machine without the proper certificates. This tends to require a lot of CPU resources but it is possible to offload that encryption work to special (such as the Emulex HPA) that can be integrated into the RSA Key Manager.

DQS

I did discover something a bit disappointing about DQS today. There are curerntly no matching services available. That means that there are no internal web services for identifying matches nor are they any SSIS operations available. Matching work can ONLY be done through the client GUI. Without automation this will have a real challenge working with some types of data..

Friday, November 9, 2012

PASS Summit Day 2

Another great day filled with learning new techniques and discussing different challenges. There was more of a focus on Query optimization than anything else today. After the sessions We all went to the Experience Music Project where there was food, live band karaoke and a Dalek!

Great fun all around, I just need to get back to a normal sleeping schedule, this 4 hours a night thing is exhausting.

Technical notes below!

The session that Adam Machanic did about forcing better parallel execution plans was astounding. He took a 28 minute query and optimized it to about 3 seconds and by improving the parallelization.

The Optimizer Issue

The premise is that optimizer is outdated because the Memory to Data Volume ratio (size or speed) is nothing like the way it was 1997 (when the last big optimizer rewrite happened). This means the optimizer needs to be coaxed into doing the right thing.

The optimizer tends to serialize a lot of processing for a number of reasons that may not make sense now. The problem with serializing all of these processes is that Amdahl's Law (not theory, LAW) says that the number of serial processes substantially impacts performance as the number of processors increases. The goal is to make as many parallel as possible. The reason the optimizer does not default to this behaviour is because it is expecting greater disk latency than we have now that would have rendered this point moot.

Possible Approach

An example of a common item that inhibit can parallelism is the Scalar Function. Potential solutions to this are:

Convert it to an in-line TVF and CROSS APPLY it.
Create a CLR scalar function (weird but true, it works much faster)

To check how balanced the parallelism is you can open the properties of any step of the query plan and check view the actual number of rows returned for each thread (there is a little arrow thing hiding there).

There is a traceflag that will force the optimizer to try to make some part of query parallel (Trace Flag 8649) but this is not a good answer. The issue is that it doesn't ensure that you are actually removing the serial parts to balance the load.

One way to encourage the optimizer to seek the parallelism path without the Trace Flag is to use a TOP(@i) with an OPTION(OPTIMIZE FOR (@i = 2000000000)). This will make the optimizer expect 2 billion rows even if there are only 10,000 in the table. This will top the estimated cost calculate to the side of making the query parallel.

MAXDOP should be configured to less than the number of processing cores for each numa node. This avoids making for foreign memory requests which may be slow. It is ok to go higher, but this is a very sensible starting spot, be sure to test carefully.

Parallel Apply Pattern

The Parallel Apply Pattern can be used to enourage working with an managing efficient streams with the optimizer. The abstract form of this query is:

SELECT

myfields

FROM

[DRIVER TABLE]

CROSS APPLY

(

[CORRELATED PAYLOAD]

)

To execute this pattern the DRIVER of your query must be identified, for example is it really just calculating information about product stock levels? Then the driver table would be Product and the CORRELATED PAYLOAD would be the sophisticated joins, aggregations and calculations.

Roughly speaking, for each record in the DRIVER TABLE the CORRELATED PAYLOAD executes all of it's heavy set based work as a smaller piece rather than against the full DRIVER TABLE set. This is more efficient for parallel loads.

DRIVER TABLE issues

If the driver table is not going parallel there are a couple of choices after turning it into a derived table

1. Use the TOP techique mentioned above. The optimizer will think there are more rows.

2. Use a DISTINCT 0%ColA + ColA calculation in the select list to prevent a calculation from being extracted from the parallel stream.

CORRELATED PAYLOAD issues

If the CORRELATED PAYLOAD is still having it's calculations pulled away from the parallel stream that change the CROSS APPLY to an OUTER APPLY. This prevents the optimizer from knowing if it is safe to pull the calculations up (as a poor optimization choice).

Thursday, November 8, 2012

PASS Summit Day 1

Summit officially began today. I had a great time meeting new people, talking to vendors and learning stuff.

The big things today were they Keynote, Extended Events, Enterprise Information Management and Performance Tuning.

MS is clearly taking the threat from SAP's HANA seriously with making a new in-memory storage engine fr SQL Server. It will be in the next major release and is expected to speed up existing applications substantially.

Extended Events are interesting and really are complex enough that there would be a real benefit from having an actual book assembled about them. It is replacing a very simple tool (Profiler) with a very complex and powerful one. I learned just enough to know that I need to send a lot of quality time to understand it as much as I'd like.

So far it looks like Microsoft is taking the approach of going after the smaller, simpler data to Master instead of trying to battle in the arena of CDI or PIM where there are large established player. This follows the approach MS has had for a while. Something like Informatica is just overkill to manage Jill's spreadsheet that everyone shares, but MDS is perfect for managing and distributing the information throughout the company.

Some of my notes from today are below:

Keynote

MS Announced Hekaton : An in-memory engine for SQL Server

There will be an adviser on when to use it
Live as part of the DBMS server as a different storage engine. Heterogeneous
Lock and latch free
The storage engine for tables can be changed to move them in and out of memory. If the table no longer fits it switches to the conventional model. Because this can be done seamlessly the data is likely persisted to disk in the conventional DBMS way.

Exadata uses flash as a sort of cache fusion
HANA is designed as Memory primary and disk secondary so has huge performance drops when memory limits are exceeded.

Stored procedure optimization compiled to native code
Included in the next version of SQL Server

Parallel Data Warehouse
* Polybase allows seems connection to Hadoop / Hive and SQL Server DB to allow a single point of federated querying

Extended Events

SQL Trace and Profiler are deprecated. With SQL 2012 every traceable is now available in Extended Events

Extended events are much lighter than Profiler (similar to trace) of the previous options with support for multiple targets. http://www.sqlperformance.com/2012/10/sql-trace/observer-overhead-trace-extended-events
The available targets are:

Ringbuffer
Event Bucketing (Bucketizer) is a specialized ringbuffer, storing aggregates
Event Pairing allows the tracking of things like everything happening between lock_acquired and lock_released
Event file (xml)
Synchronous Event Counter only tracks number of times and event fired, no details

The SQL Server audit events have been merged into the normal extended events.

System Overhead (as tested by SQLCAT)

Query Time increases by 2 microseconds. As more actions are tracked this does increase.
Memory increases by 4MB for the RingBuffer
Disk storage is more verbose, always try to avoid logging to the Data / Log volumes
If the event tracking causes too much of a slowdown, it will be automatically disconnected until it has less impact on performance.

In general the learning curve is steeper but you can do more with Extended Events than with Profiler.

The GUI is not as nice and profiler and there is a larger learning curve. Also the SQL 2012 GUI does not work with SQL 2008.

Event information is recorded in an XML structure that is fairly consistent. Use XQuery to extract the results in a tabular form.

One mental model for understanding the different parts of the setup is to think of it like an insert statement

INSERT INTO [TARGET]
SELECT [EVENT DATA]
FROM [EVENT]
LEFT OUTER JOIN [ACTION]
WHERE [PREDICATE]

The predicate will short circuit, this is means that Criteria should be placed with the most restrictive rules first!

More good stuff on Jonathan Kehayias's blog

Because traces are very light they can run in the background. For example, the old SQL Trace Blackbox has returned as an Extended Event session called "system_health" and is visible under: Server>Management>Extended Events>Sessions and is using two targets, an event file and the ring_buffer.

Performance Tuning

5 Mains areas to consider when there is a problem

IO
CPU
Memory
Network
Locks and Blocks

Activity Monitor is MUCH better than it was in SQL 2005. Most of the common troubleshooting DMVs are presented there.

The only system process with a SPID > 50 is... SERVICE BROKER! That is a feature that really gets no respect and it turns out the other system process don't want to be associated with it either

When a WAIT x:y:z is displayed, the variables represent

x : DBID
y : File ID
z : Page

page 1 contains the page free space bitmap, contnention here is typically for heaps (most temp tables and table variables are heaps). Scalar and non-inlined TVFs can abuse this by creating and destroying objects swiftly.

Wednesday, November 7, 2012

Pre-Con Day 2 T-SQL Querying and Query Tuning Enhancements in SQL Server

This was the second (and last) day of the pre-conference events. I selected the T-SQL session for two reasons, I know that there were a lot of enhancements to core and it was being given by Itzik Ben-Gan, who is a SQL Master (he literally wrote the book on T-SQL query techniques).

I was a bit concerned because he enumerated the items he wanted to cover and I've used nearly all of them so I didn't expect to learn much. Much like yesterday, I was wrong.

There were two areas that I struggled to keep up with him during the session, those were the Aggregate Windowing (which I never used) and all the crazy ways you can use the APPLY operator. The rest of the session was good but I kept coming back to (and testing out) those two areas.

I didn't get much of a chance for socializing because I was tinkering with the new stuff I learned whenever I had a few free minutes and I did something bad to my ankle so I just hung around my room for the evening.

Below are my notes from this session:

T-SQL Querying and Query Tuning Enhancements in SQL Server

Extended Events Change

This was the last thing we discussed but it is SO important that I need to talk about it first. Extended Events have anew Global Field called Query Hash. This is a hash of the auto-paramterized query, capturing this allows the detection of all sorts of performance problems. All of the tiny, fast queries that execute ALL the time can be evaluated to see if your system is suffering the "death by a million cuts". This is HUGE.

CTE

CTEs join the rank of derived tables, views and table valued functions as a way to make the query syntax cleaner. By themselves, all of these options generate the exact same query plan and do nothing improve or degrade query performance. The biggest immediate benefit to non-recursive CTEs is the improved legibility they gain by allowing query reuse.

CTEs are evaluated Top to Bottom. This allows the first query to be referenced by the second query, Both the first and second queries could be referneced by the third query, etc.

To improve performance consider using something to actually materialize the data and reference that single result set. Examples of this would be actual tables, temp tables and table variables.

Recursive CTEs do not follow the rules mentioned above, instead they need to keep a running tally of the data. It does not persist it to a normal table but instead creates a B-Tree structure for the data on disk that has been optimized for the Recursive CTE's usage patterns.

Windowing Functions

Craziness. Absolute, amazing crazy that has been hiding from me since SQL 2005 that can be applied to any aggregate (but works best with cumulative aggregates).

This is work applied to the result set.

The Windowing Functions are made up of a bunch of different parts

OVER : Establishes the window of data over the result sets
PARTITION BY : Defines the size of the sliding window by referencing what columns cause the aggregate to reset. (Note - FILTER operator is not supported, which would be an explicit where to filter the partition)
ORDER BY : Sorting sequence to be used when calculating aggregates. This important for things like running totals.
Window Frame :

ROWS BETWEEN ... AND ...

Specify the number of result rows to consider, regardless of missing data
Duplicate ORDER BY values are calculated individually
Always Specify Row, it is not the default (Damn you ISO SQL Standards!) but it is much more efficient
UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
UNBOUNDED FOLLOWING : To the end of the Partition
n PRECEDING : From the beginning of the Partition
n FOLLOWING : To the end of the Partition
CURRENT ROW : To the current row (think of a running total) (Default)

RANGE BETWEEN

Specify the value range to consider, this is not fully supported in SQL 2012
Duplicate ORDER BY values are grouped together for the aggregate calculation
Cannot tell the difference between?
Specify the number of result rows to consider (may skip missing values)
Use with extreme caution, it is the default when there is a partition
UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
CURRENT ROW : To the current row (think of a running total) (Default)

Performance

This is NOT syntactic sugar. The query plan has a new operation called Window Spool which allows a "Fast Track" of results (for Linear Scaling) under two very common sets of conditions.

"BETWEEN Unbounded Preceding" looks only at directly previous row instead of maintaining a full list for each record.
The aggregate is a Cumulative Aggregate (such as SUM or COUNT, not a MIN or MAX) it is calculated by using the "Unbounded Preceding" behind the scenes and doing appropriate.

If the Fast-Track conditions are not met it turns from a 2n cost to an n^2 cost.

For indexing the window aggregate a POC pattern should be followed.

P: Partition (Quirk: Is nearly always ASC unless the same column is in the order by as DESC)
O: Order By
C: Coverage (can be in the INCLUDE)

To check for performance issues, use Extended Events or Statistics IO (which show the worktable)

New SQL 2012 Window Aggregates

LAG : Element from previous row in the Partition
LEAD : Element from next row in the Partition
FIRST_VALUE : Element from the FIRST row in the FRAME

Typically the frame and partition are the same when using FIRST_VALUE

LAST_VALUE : Element from the LAST row in the FRAME

Typically the frame and partition are the same when using LAST_VALUE

Paging

The new Offset - Fetch syntax was added as an extension of the ORDER BY clause. The implementation is not SQL Compliant yet because it is missing PERCENTAGE and WITH TIES options still

Performance

It performs similar to TOP with a discard of the first X rows (the offset). This is fine for the first few "pages" of data but does not scale deeply. To scale it, use a CTE on an indexed column and join to that, such as:

WITH Keys AS
(
SELECT orderid
FROM dbo.orders
ORDER BY orderid
OFFSET (@pagenum - 1) * @pagesize ROWS
FETCH NEXT @pagesize ROWS ONLY
)
SELECT
K.orderid
, O.orderdate
, O.custid
, O.empid
FROM
Keys as K
INNER JOIN Orders as O
ON O.orderid = K.orderid
ORDER BY K.orderid

PIVOT / UNPIVOT

This is not part of the SQL Standard. These are syntactical sugar for some common pivot requirements. As a pivot becomes more complex (such as multiple columns) they break down and require a more sophisticated model.

An efficient UNPIVOT can be created with the CROSS APPLY using a VALUES clause (which creates a derived table behind the scenes). This method uses VALUES as a constant scan which feeds the stream aggregate avoiding any worktables.

CREATE TABLE #t (id int , [2008] int, [2009] int , [2010] int, [2011] int)
;
INSERT into #t
values
(1, 100, 101,102,104)
,(2, 200, 201,202,204)
,(3, 300, 301,302,304)
,(4, 400, 401,402,404)

WITH
u AS
(
SELECT t.id, sum(val) AS total
FROM
#t t
CROSS APPLY
( VALUES
(t.id, 2008, t.[2008])
, (t.id, 2009, t.[2009])
, (t.id, 2010, t.[2010])
, (t.id, 2011, t.[2011])
) AS v (id, [year], val)
GROUP BY
t.id
)
SELECT * FROM #t t inner join u on u.id = t.id

Sequences

The new seqeunce object uses the same internals as the identity feature but is more configurable with different defaults.

Defaults

MIN value: Most sequences are not user visible values so it starts at the lowest value possible for the data type.
DATATYPE : BIGINT. This is effectively an alternative to using the GUID keys so it should be support a OT of values. BEWARE! This datatype SHOULD match the datatype in the table (especially if being used as a constraint), otherwise there will be a lot of conversions that may not have been expected.
CYCLE : OFF is the default value but the reason it exists is for IDs used in staging data that have a relatively short lifespan to avoid running out of keys.

Extended beyond the SQL Standard

It is able to be bound as a default constraint on a table. This will help with being able to eliminate the IDENTITY columns without impacting application code or adding triggers (which is the normal model).
There is an "OVER" clause available when generating IDs for a set to ensure the expected sequence.

Notes

This has all of the dangers on the IDENTITY when it comes to sequential values. The process of getting IDs is outside of the transaction, so once it's been handed off it is gone (unless the sequence cycles).
A single sequence can be used across multiple tables but this does increase the gap likelihood.

Merge

This is commonly thought of as an UPSERT but it can do a bit more. It compares the source (stg) to the target (active) and will cause different DML actions to be taken.

The "USING ... ON" clause is used matching the table from the source and left joining to the target. This is not a "WHERE" but left join conditions so it DOES NOT filter the source table. To filter the source table use a CTE as the source.

MS extended this to support executing an action if the record exists in the Target but not in the Source. This is great when doing full table refreshes.

Tips

Use a HOLDLOCK to keep the data consistent because the merge executes multiple actions.
Review any Target tables to ensure there are no triggers referrencing @@ROWCOUNT in the beginning. This value is set for the ENTIRE Merge command, not the single operation. Typically I've seen this used to do a quick eject from the trigger or switch the path from row parocessing to set processing. Look at the INSERTED / DELETED tables instead.
Unlike the OUTPUT from an INSERT statement, you can return the any generated values from both the source and target. If you need these values then make a merge with a non-mathcing condition (1=2).
If you only want to update the table when there has been a change there are two approaches, adding each column to the WHEN or adding an exists to the WHEN with an EXCEPT, like this:

MERGE INTO dbo.Customers AS TGT
USING dbo.CustomersStage AS SRC
ON TGT.custid = SRC.custid
WHEN MATCHED AND
EXISTS (
SELECT * FROM CustomersStage AS S WHERE s.custID = src.custid
EXCEPT
SELECT * FROM dbo.Customers AS T WHERE t.custid = tgt.custid
) THEN
UPDATE SET
TGT.companyname = SRC.companyname,
TGT.phone = SRC.phone,
TGT.address = SRC.address

Misc cool stuff

If the goal is to have a non-deterministic sort but an ORDER BY is required, the field list can replaced by a (SELECT NULL)
Left to Right reading of the Execution Plan is how the API works
The lack of a ; after each SQL Statement is officially Deprecated!
Avoid using IIF and CHOOSE statements. They were added as a stopgap to help with conversions from MS Access. Behind the scenes they are converted into CASE statements by the optimizer.
Interesting trick that can be embedded in other queries to get a delimited list:

SELECT N',' + QUOTENAME(orderyear) AS [text()]
FROM (SELECT DISTINCT YEAR(orderdate) AS orderyear FROM Sales.Orders) AS Y
ORDER BY orderyear
FOR XML PATH('')

Tuesday, November 6, 2012

PreCon Day 1

So the conference began. Registration was a snap, I got a nifty backpack that is big enough to hold my workstation (an Elitebook 8560w) and was remarkably comfortable. I don't think it will replace my Tom Bihn bag for travel but it may be my new commute backpack.

The session I attended today was SQL Server 2012 in a Highly Available World and, while it was not what I expected I found a lot to learn. I was expecting a focus on Availability Groups with maybe a section on Failover Cluster Instances (FCI). Instead this was a much broader session. It started with Log Shipping, Database Mirroring (deprecated in SQL 2012), FCI and then finally Availability Groups. While I didn't want the first bits (Log Shipping has been around forever and Mirroring is going away) it was a good refresher and I learned a lot.

The best part of today's session was working through a problem that the person next to me was having during the breaks. There were two real questions

Can you mirror upstream from SQL 2008R2 to SQL 2012?

It turns out you can, if the patches / services packs are correct. In fact this will be demonstrated later this week in one of the sessions!

Can you combine Failover Clusters with Availability Groups?

Yes, but it prevents automatic failover between datacenters, this may be fixed in a service pack.

I now have an intense need to get VMWare going on my laptop and start messing with cluster configurations.

My notes from this session are below.

PASS Pre-Con 1: SQL Server 2012 in a Highly Available World

This was an overview of the High Availability options typically used in SQL 2012. Topics ranged from the venerable Log Shipping to the newer and shinier Availability Groups.

Log Shipping

What happens when the database is set to "Bulk Logged" mode and log shipping is enabled? It turns out that the data never makes it over to the target system unless it is updated afterward. This is a good example of why data should only be bulk loaded into staging tables, cleaned and moved to final tables using normal, logged commands.

Test log shipping Disaster Recovery (with an outage window)

Server 1: "BACKUP LOG ... NORECOVERY"
Server 2: "RESTORE LOG ... WITH RECOVERY"
Do stuff (nothing bad) and flip it back
Server 2: "BACKUP LOG ... NORECOVERY"
Server 1: "RESTORE LOG ... WITH RECOVERY"

This maintains allows you to maintain the log chain and test DR as long as only legitimate transactions happen as part of the test.

Mirroring

The reason that this still matters is that is was JUST deprecated. That means it will still be there for at least the next two releases (SQL Server 2014 and 2016 presumably). The last version of SQL Server with mirroring will be supported for at least 10 years after release, that means it will be 2026 before it will be gone.

Unsupported DB Objects

FileStream
FileTable

Endpoint

The Log Stream endpoint hangs out on port 5022. This endpoint will ONLY talk to other SQL Servers and is always encrypted, typically by a server signed certificate.

Concerns

If everything is working well, Synchronous replication is expected to only add 5ms to the DML commands at distances < 100 km. If the distance is > 100km then Asynchronous Replication is recommended. The Mirroring overhead time is viewable in the Mirroring Monitor (which presumably fronts a DMV).

The greatest visible lag in mirroring is when the application realizes that it must switch to the secondary node. The application needs to have a connection timeout happen (5 - 10 secs) before flipping to the next server (which will of course work).

Mirroring is not impacted by single user mode (other than nobody doing transactions) because it is using a log reader and is not "in" the database.

Always On - Clustering (Local)

Cluster Configuration

The dreaded Hardware Compartibility List (HCL) is gone! The reason for this is that the storage options for the cluster now include network shares (not just iSCSI). Now who in their right mind thinks that network shares over SMB are a good idea for the DB? Well the enhancements to SMB (starting in 2.2 with Win 2008R2 but much improved in 3.0 with Win2012) are pretty amazing. They have removed the overly chatty aspects of the protocol and made it a viable alternative to SAN storage. There is a lot of value going Win2012 instead of Win2008R2 for the SMB 3.0 enhancements.

Sadly, the validation wizard still needs to execute to make sure that the node can be part of a cluster. The good news is that it is a lot less invasive (doesn't do things like take SAN Luns offline). It is still slow, while a 2 node cluster may take 10 minutes, larger ones can take a weekend or longer.

When setting up the cluster, if you are not a domain admin you will need to either get access to add computers to AD or get the new cluster name added to the directory before installing.

The heartbeat for the cluster is using IPv6, if it is not a dedicated connection between the servers the network must be able to accommodate it (it's 2012, probably time to update to IPv6 anyway).

Windows Core is completely supported for the database engine. An interesting thing about core is that (in case of emergency) you can now activate the full GUI when required and then flip it off again when done. The core instance actually keeps all the DLLs and extra files on disk, but in a packed form that is unpacked (into memory?) when needed.

Best Practice tip: Try to have your resource group name match the network group name for your cluster. It will improve manageability a lot.

Quorum Modes configuration

Node Majority Set should be chosen for large clusters (> 2 nodes) with an odd number of nodes
Node and File Share Quorum should be chosen for large clusters that may not have an odd number of nodes
Quorum Drive can be used for 2 node clusters but remember, this is a single point of failure.
Win2012 improvement: Dead Nodes are removed from voting after they are kicked out of the cluster so the quorum requirement is recalculated.

Database Cluster Configuration

TempDB can finally use direct storage! This allows the use of local SSD or even Fusion-IO cards (gasp!). This can be a lifesaver in systems with heavy TempDB usage.

The number of available SQL Server nodes per cluster is getting absurd in Enterprise Edition (but is still stuck at 2 for Std and the new BI Editions). The cap is somewhere around 64, but is really limited by the number of drive letters (26 - A,B,C,D = 22). Technically, this can probably be worked around with mount points but requires further research.

One thing that seemed to cause confusion was the idea that even named instances can be treated like a default instance if you left it hanging around on port 1433. I've always thought of instance naming as just a human readable shortcut for remembering the port number (I know for other protocols the name is more important but most systems are TCP/IP focused). So, if the instance is locked to port 1433, then when you try to connect without an instance name that is the door it knocks at.

When reviewing the cluster resource groups make sure that the mount points are explicitly listed and set as proper dependencies, this does not happen automatically at this time. This can be complex for mount points because the label can be a bit obscure, use the Disk ID to match the volume in Disk Management to the one in the Cluster Manager.

Slipstream installs are now deprecated. I didn't see that one coming. The reason for it is that every patch that comes out need to get an updated slipstream image generated. Now you can store the patches in a centralized location and point the installer at it. It will pickup and apply the patches automatically. This is a big improvement if you haven't automated your image generation yet, otherwise it's a minor change.

Only the Database Engine, Analysis Services and SQL Agent are cluster aware. The other SQL Server goodies are STILL not cluster aware. This includes the MDM solution (MDS + DQS).

Always On - Geo Distributed Clusters

There have always been a LOT of obstacles when doing Geo-Clustering, primarily around moving the data and managing the network configuration. As far as I know, MS is not touching the data side, they are leaving that to the SAN experts for this option. However, MS now has options for Multi-Subnet Failover where the IP Address can be set dynamically based on the available subnets. This feature may stop the network admins from wanting to kill you.

Always On - Availability Groups

Different from Mirroring

This is really a fork of the mirroring technology used previously. It is still an encrypted transaction log stream replayed from the primary to the secondaries with the same overhead.

There can only be a total of 4 replicas of any single availability group.

Synchronous Replicas are limited to 2
Asynchronous Replicas are limited to 4

This means you can have 4 Async Replics or 1 Sync + 3 Async or 2 Sync + 2 Async but NOT 3 Sync + 1 Async.

Improvements

Read-Only replica (independent statistics kept in tempdb)
Routing based on application intent
Increased number of worker threads
Increased cap from 20 to 200 Replicas / instance
Replication Publishers and Subscribers are supported
Filestream / Filetable
Extended events are supported

Limitation

Cannot be used across domains
Does not support the replication distributor database
No SQL Server Profiler Events are available for Availability Groups

Fancy stuff. You can set your read-intent route to point at a load balancer that will bounce you between several read-only instances if desired.

FCI + AG

The combination of Failover Cluster Instances can be used in conjunction with Availability Groups for a very comprehensive solution (licenses may get pricey though). The sequence for doing this is:

Install Windows Cluster Services
Install SQL Server Cluster
Enable Always On Availability Groups
Create / Restore Databases
Configure Availiability Groups

SQLCAT has a whitepaper on implementing this combined cluster.

Backups

Full and Differential backups shold be taken from the primary
Transaction Log backups can be taken from the secondary. The info is pushed up to the primary so it knows that the logs have been archived.
Only maintenance plans (and Ola's script) support backup rules so far.

Connecting

An Availability Group Listener is created (more Active Directory fun) and all systems should connect to that name.
The Availability Group Listener is not supported by the SQL Server Browser and is recommended to live on port 1433 to avoid confusion.
If there is a read intent connection and no secondaries are available to service the request it will execute on the primary.

Monday, November 5, 2012

PASS Summit 2012 Pre-PreCon

I flew into Seattle last night and I'm here for PASS Summit 2012 this week. I can hardly wait for things to get going (hence blogging at 5am).

Why is it so exciting? Well this is what I liked so much last time.

This is a SQL Server conference, but it is not run by Microsoft and SQL Server (and Data in all it's varied forms) is the ONLY topic. This means that the party line was not always followed and DB folk were not playing second fiddle to Developers or Server Admins.

The presenters were all top-notch. Not just good presenters but some of the best SQL Server presenters from across the country (world?) are here. They are experts in their arena, not only specific parts of SQL Server but on the broader topics of data modeling, data integration, data warehousing or enterprise data architecture. There were discussions that continued long after the sessions were over and private dialogues sometimes continued past the conference over email.

The people, not just the presenters were great too. There is something special about being surrounded by a totally new group of peers with different and exciting experiences. Everyone was having fun sharing war stories about the weird corner-cases that were encountered.

I learned more from that week at PASS Summit 2009 than I ever expected.

This week is gonna be great.

Sunday, October 7, 2012

Some, Any and All oh my!

Tales of the exotic and strange

When discussing some of the strange parts of a "new to me" system, a co-worker said "That's nothing, our queries have rules where the data has to meet Some, Any or All criteria!"

Now this sounded completely absurd to me so I just let it slide by, but it got me thinking. There are a number of unusual ANSI operators that appear to be logically redundant, maybe this is real.

Sure enough, it is real.

What is surprising, is that isn't an Operator but instead is a Modfier to existing operator.

Some / Any

Some and Any are synonyms. If the subset (query, or static list) has any records that meet the criteria records are returned.

Isn't that an "IN"?

Nope. "IN" is an operator, and would be the same as the combination "= ANY" but because this is a modifier you can pull some neat tricks like ">= ANY". This is handy for checking things like security levels or general product availability.

Example:

SELECT firstName, lastName

FROM employee AS e

WHERE

e.secId >= ANY

(

SELECT minSecId

FROM securedAssets AS sa

WHERE

sa.region = 'Inventory Mgmt'

)

;

All

All is another one with some interesting rules. Everything that is returned needs to meet the criteria. This is pretty terrific for things like being able to test if there is enough inventory to build an assembled item. You know if you are building 3 widgets, you need to have at least 3 of each part.

Example:

SELECT widgetName

FROM widget AS w

WHERE

3 >= ALL

(

SELECT availableWidgetParts

FROM widgetParts AS wp

WHERE

wp.widgetId = w.widgetId

)

;

Conclusion

Now this is a nifty feature, and there are plenty of other ways to do this for the majority of cases. I'll probably lump it in INTERSECT and EXCEPT as part of my seldom used, exotic query techniques.

Saturday, May 19, 2012

Data Modeling Rematch!

So I've got a lot of data modeling ahead of me at work and I need tool. The system was being tracked in Visio and that doesn't scale, for the reason I've noted before. So I did a breakdown before of the choices but I wanted to address my specific needs this time.

I like ModelRight, a lot. If you read the review you know it has an excellent working style. The mode-less property boxes are incredibly efficient and let you make changes remarkably quickly.

I also really enjoy ER/Studio Data Architect (referred to as ER/Studio thoughtout this post). The User Interface is extremely smooth and responsive. I do find that the modal dialog boxes slow me down a bit.

So I've been doing a bake-off between the two. They both support the same core functionality really well (forward and reverse engineering) and are fairly stable.

I'm working with a larger (~700 table) model and seeing how it does.

Data Modeling

So which is faster for actually developing a new model?

ER/Studio

This is very much an awkward process of dragging the table on to the screen, double-clicking it and adding entries to the table. There are custom domains that can be defined to group similar types together and provide guidelines to improve consistency. The tabs can be awkward and it feels like a very stilted process.

ModelRight

This is a bit more natural because after you drag the table over, you can start editing on the canvas or go to the docked modeless window. The same domain features are available to provide consistency for your columns and types. The modeless windows really do make this a much faster process to create and modify the tables.

Model Layout (Reverse Engineering)

ER/Studio

ER/Studio offers several different layout models to improve layout accuracy

Circular
Orthoganal
Hierarchical

These help to various extents based on your sub-model contents.

The canvas the ER/Studio has is unmatched for it's smoothness when manipulating the entities and adjusting the layout.

ModelRight

ModelRight only has auto-arrange, with an adjustable number to indicate spacing.
ModelRight does it better (after fiddling with the number).

The diagram needs less adjustment for clarity even if the UI isn't as smooth as ER/Studio offers.

Reporting

It's important to be able to distribute the model. Typical options for this are JPG, XML, HTML and RTF.

ER/Studio

JPG: Good
XML: Limited information is available but ok
HTML: Good. It's a nice tree structure and they provide a fairly good interface for it
RTF: Good. It's a word doc, nothing super fancy but it is pretty clear.

ModelRight

JPG: Good
XML: Good. Super Detailed info in this file
HTML: Brutal to setup. Basically it is an XML export that expects you to transform it with an XSLT. Complete Customization, but out of the box it is weak. On the other hand, the help file says that ModelRight support will help you craft any XSLT you need.
RTF: I don't think it's offered. May be available as an XSLT from support.

Sales and Support

ER/Studio

I've been going through the Embarcadero sales rep whenever I needed anything. There was some initial lag on responsiveness but when it was clear I was serious, things progressed well. She brought engineering on to a call to demonstrate functions they felt were superior to their competitors. It was very interesting and involved a lot of add on products.

I also contact support for two incidents. There was about a one day lag between requests for help and responses. Neither incident was resolved but the staff was courteous and tried to help.

ModelRight

The sales rep has been very responsive through email, but that was the only communication channel. I questioned her about features in the next two versions and the answers were useful without over promising.

Support has been remarkable. The support team was a bit terse in it's replies to my questions but always responded quickly. They went so far as to make an application build with fixes for bugs a couple of issues I found the same day. This support is amazing considering I'm still running the trial version.

Defaults

I'm not even sure if the section should be in here but I think it is important from the standpoint of balance.

ER/Studio

The defaults here are so good that I didn't even realize that a can configure a lot of the options to do something different. A lot of thought went into the initial configuration of out of the box.

ModelRight

This feels like it was developed by programmers. For example

The screen defaults to being overly crowded (even at 1600 x 900) with too many widgets active by default.
Reports REQUIRE an XSLT to be legible outside of Internet Explorer.

On the other hand, this leads to things like

A complete XML export of the model is available and can be diff'ed with older versions
Quick short-cut to take a JPG of your diagram
Ridiculously powerful scripting

Conclusion

ER/Studio is an excellent, polished tool. I have the budget to purchase it for myself and my team but not all of the secondary features. I could not get the repository or portal functionality which is what I need to be able to distribute the model to the development team.

ModelRight has similar (or better) features when compared to the baseline ER/Studio package. The price is about half but it also includes a free viewer and more flexibility if you are interested in putting a bit more work into it. Future enhancements slated for ModelRight 4 look very impressive and I'm glad to be on board with them for this project.

TLDR: ModelRight wins! Embarcadero has a lot to worry about with ModelRight. I will be using them as my data modeling platform of choice for the foreseeable future.

Wednesday, May 2, 2012

Sure it scales UP, but does it scale DOWN?

I've recently run into an interesting issue, scalability doesn't necessarily go both ways.

Now, intellectually I understand that you make trade-offs to scale to millions of users but that doesn't help my "gut feeling" that I am doing something wrong when I accept the trade-off.

Let's take the classic example of triggers. Triggers tend to be very wrong when it comes to performance, however they are highly effective at securing the data and ensuring business rules are followed. I weigh the need for triggers very carefully before I add them to any table. This can be much more effective than adding the logic to every system that must integrate with the database.

Small databases allow you to pull all sorts of fun tricks that you would never consider otherwise.

Performance concepts do not always scale down to provide the flexibility that is needed on some projects. I've been measuring system with 500GB as a baseline for a while now and forgetting about the ~100GB options.

Right now, I'm looking at different ways to do row, column and cell level security. Almost none of these are viable in a large database. In fact, most of these are not even viable in a "small" database but this gives me a lot of freedom to experiment with concepts, which is fun.

Saturday, April 7, 2012

Change, it's inevitable

I'm in the middle of changing jobs, I mean it literally. The last day for my old job was Friday, the first day of my new job is Monday, and it's Saturday night.

There is much to miss with my old position. I had the chance to participate in changing the core data strategy and the luxury of doing it with the right people. I can't stress this enough, the right people make the difference. I can't mention leaving without mentioning them because they are an incredible group.

When I announced my plans to move on, it was possible to promote from within because every had grown so very much. They are all brilliant and I will miss working with them.

So now I'm switching my role up a little bit.

I'm moving on to more of a straight Data Architect position instead of Enterprise data. I have the chance to build out a number of interesting "cloud" products that will have several very serious, real world applications.

There is a good bit of greenfield development and different agile techniques I expect to run into. I'll post as I learn new and interesting things.

Saturday, February 18, 2012

The State of Data Modeling Tools (Conclusion)

After reviewing a bunch of the data modeling tools out there, I've come to the conclusion that there are some great tools out there, some workable tools and some tools to avoid.

Data Modeling Tools Result Summary

CA ERwin

This is a toolset that has been in the decline for while. CA is trying to pull it out of the nose-dive it's in but it has a legacy code-base without a lot of the original team around to explain it. As a long term system I can't recommend it. It will continue to go downhill.

Visio 2010 Professional

Fine for small models but it will quickly become to unwieldy with a significant model. Microsoft doesn't seem interested in this aspect of the tool market and are not improving Visio for it.

MySQL Workbench

Stability is important. It's free but my time is not. I don't recommend this tool.

SQL Power Architect (Community Edition)

This tool needs better integration with the actual database. It has the most potential of any open source modeling tool I've seen.

Embarcadero's ER/Studio Data Architect (Trial)

This tool works flawlessly. If you can afford it, buy it. They've been doing this for over 15 years and have built an entire product line around modeling tools.

Toad Data Modeler (Trial)

I admit that the lack of stability irritated me greatly but I don't think that this product has improved much over the last few years and it may be on life support from Quest. Not recommended.

ModelRight Professional (Trial)

For the price (roughly 25% of ER/Studio) this tool works amazingly well. The only thing it lacks is logical modeling, if you have one platform (like most folks) this is a great choice.

I hope that some of you found my research helpful. Your great comments help me remember to post. :-)

ModelRight Professional

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

I was talking to a co-worker who was at LogicWorks before they were purchased by Platinum Technology (who were later purchased by Computer Associates) and he mentioned that there was a new ERD company on the block, ModelRight. It turns out that some the original Logic Works folks put this together and I gave it a whirl. It's UI is a bit different but worth investigating further.

Stability

This ia another product I used frequently over the evaluation period. The small model I tested with (< 50 table) worked great and the application was fast and responsive even after many days of use.

Price
ModelRight Professional: ~$1,095 ($895 + $199 maint)

ModelRight Professional for MySQL: ~$494 ($295 + $199 maint)

Features
Common Checklist

Logic Modeling: No
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes

Compare against files and live databases.

Sub Models: Yes

The UI is extremely responsive, not only do you grab things when you are intend to, but all of the properties for the selected item are immediately visible. This is because the properties are not in windows that popup but are in a docked pane. I like this very much but it requires a bit more screen space, which makes working with it on a laptop awkward. After I got used to it this made working with the model unexpectedly more efficient.

Reversing out the DB worked as expected and the submodel management was intuitive. Just drag the entity you want to add to a submodel into the model pane and it was added. Slick, fast and efficient.

The compare and forward engineering created some of the most efficient scripts of all the modeling tools I've used. They actually demonstrated features I didn't know about on MySQL and worked very well. I don't think I had to re-write any of the generated scripts.

Conclusion

If you have a single DB platform, ModelRight is definitely a modeling tool to consider. It is the only one that actually made me more efficient than I expected. The SQL it generated was was excellent and it was rock-solid.

Toad Data Modeler

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

Toad is legendary for it's Oracle database management software. When I started at my current job I was asked to evaluate their SQL Server tooling. My evaluation was that the SQL Server tools were not fully baked and that Toad Data Modeler treated non-Oracle DBMS platforms as second class citizens. That was 4 years ago, Quest has put a lot of resources into the Toad family since then, lets see if Toad Data Modeler (TDM) has improved.

Stability

I used it for a few days to make my evaluation. The system only supports 25 entities in trial mode so I had to limited my model pretty severely. I didn't realize that I hit the limit until I was using it for a while, and it crashed. As soon as I hit the limit, the app stopped saving and hours got lost.

Later on I used the application for a few hours and it crashed again. Luckily, because the model was kept small I was able to use my last save.

Price
Toad Data Modeler: ~$450

Features
Common Checklist

Logic Modeling: Yes
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes

Does not compare against the live DB, only against previous models.

Sub Models: Yes

The UI is reasonably responsive.

Reversing the DB works ok but the arrangement leaves a lot to be desired.

The model compare and forward engineering works ok, but not great. Not being able to compare against a live DB leads to issues.

Conclusion

Toad Data Modeler is not stable. I can't stress the value of stability in tool selection. I really didn't enjoy using the tool and cannot recommend it.

ER/Studio Data Architect

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

Embarcadero Technologies started off with making DB development and management tools. I ran into one of their tools (DBArtisan) in 1998 for managing my Sybase instances. At my next job I insisted on getting the full suite of Embarcadero Software. And my next job too. Unfortunately after that I wasn't able to justify the cost and it has been 7 years since I used the software. Did it age poorly, similar to what happened to ERwin? Let's see.

Stability

I used it nearly every day during the evaluation period, testing out a number of the different options. There were no problems with the small model I tested with (< 50 table) while it remained fast and responsive over many days being open and used.

Price
ER/Studio Multi-Platform: ~$4,615.00 ($3,700 + $915 maint)

ER/Studio Open Source Databases: ~$1,360 ($1,100 + $260 maint)

Features
Common Checklist

Logic Modeling: Yes
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes

Compare against files and live databases.

Sub Models: Yes

The UI is responsive, when you grab items they do what you expect them to. It's very similar to ERwin but the UI is much more responsive and stable.

After reversing out an existing database, the initial entity arrangement is surprisingly good, they have clearly put a lot of effort into it. There are several layout options and all of them work pretty well. Ultimately you still need to re-arrange the model a bit, but far less than any other tool I've used.

The Database compare and forward engineering makes very usable scripts that respect the integrity of the original model.

Conclusion

ER/Studio is an excellent product. There are add-ons to manage the model with a centralized repository and share database development tasks. There are also tools to establishing and maintaining conventions throughout your diagram.

In conclusion, ER/Studio is a great product, the price is high but it works flawlessly.

Saturday, January 7, 2012

Visio 2011 (ERD)

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

Visio is the default business diagramming tool in most organizations. I've used it off and on for after-the-fact documentation but never to create the actual model. The newest version of Visio (2010) included a slightly remodeled UI and some tools to help with choosing colors (surprisingly, that is very useful to me).

Stability
It's Visio, it works well for the small model I tested with (< 50 table) and had no stability issues.

Price
Visio Professional: ~$400.00

Features
Common Checklist

Logic Modeling: No
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: No

Note, there is a plug-in created by Alberto Ferrari and hosted at CodePlex which does a non-comparing forward engineer of the DB

Sub Models: Yes

Once you get past the Visio UI quirks, documenting an existing system using Visio is a breeze. Yes, extra clicks are required everywhere. Yes, the relationship link points are limited and not fun to reset.

The initial arrangement of the entities is pretty bad, but after using SQL Power Architect and MySQL Workbench I can appreciate that the layout is not awful.

Microsoft has a team of writers maintaining the help system and there are several good web resources on building ERDs using Visio so there is plenty of available documentation.

The lack of a built-in forward engineering function can be worked around (as noted above) but it does not compare to the actual database.

Conclusion

I have used Visio to document data models in the past and I maintain that it can do this at an acceptable level.

Also Visio can be a good way to get down some initial thoughts on a model if that is all you have available to you.

I do not recommend using Visio for ongoing, iterative data model development. Forward engineering without a comparison function makes this work too awkward.