Monday, March 30, 2015

EDW 2015: Data Governance Organization

Pharmaceutical model
Janet Lichtenburger and Nancy Makos presented a terrific view of how Data Governance works at their company. They implement functional (small) data governance, with the exception of data sensitivity.

Here is my brain dump of the session today. You're welcome to it if you can make heads or tails of it. :-)

Some numbers that show the challenge of Data Governance there are:
  • 5 people in Enterprise Architecture
  • 3 people in Data Governance
  • 130 people provide Data Stewardship support
  • 350k total employees

Data Governance is about Policies and Standards and is typically independent of implementations, as part of the Enterprise Architecture or Finance groups.

To encourage adoption, Data Governance could be considered an internal consulting service to support projects ,that is not charged back.

Organizational Model
Overall Model
  1. Enterprise Data Governance Executive Committee
    1. Meets only a few times a year
    2. Limited number of senior executives
  2. Data Governance Committees
    1. Each committee is chaired by Data Goverenance
    2. Divided into specific domains
    3. Meets as often as required by projects
    4. Up to 15 people on each Domain Specific Committee.

Data Steward Roles
  • Executive Stewards
    • Member of the Data Governance Executive Committee
    • Strategic Direction
    • Authority
  • Enterprise Stewards
    • Member of the Data Governance Committee
    • Development Support of Data Governance Policies
  • Operational Stewards
    • Communicates to promote policies
    • Endorses Data Standards
  • Domain Stewards
    • Recommends canonical structures
    • Endorses Data Standards

However the model is constructed, it must make sense for the business. All new policies should be considered from a Cost / Benefit analysis, the exception is with regulatory requirements. Regulatory and Legal compliance are critical to avoid jail.

The goal of policies is to drive behavior changes needed for Enterprise Information Management to succeed.

People generally don't like change, a way to get buy-in is to amend existing processes instead of creating new ones. These are smaller, less intrusive and stakeholders have already been identified. This also leads to more partnerships and shared endorsements of the changes.

There are typically a small number of extremely high value areas, policy should focus upon those.

Align with the Enterprise goals and other Enterprise ranging groups, there are a lot of shared concerns and ways that the teams can support each other.

Keep the policies easily accessible, do not hide them in a 500 page volume, instead keep them somewhere easily discoverable, such as a wiki on the corporate intranet.

Policies that are overly broad and not enforceable can quickly cause legal / compliance problems. In those cases, no policy is a better choice than an unenforceable one.

Data Classification
The source and type of data both define the data classification required. Similarly, data from several, more open sources can be combined to escalate the protection required.

  • Restricted
    • Financial information, such as credit cards
  • Protected
    • Regulated information, such as HIPPA data
  • Private
    • Named Persons
  • Internal
    • Business Data
  • Public
    • Everything Else

The data classification levels are combined with Information Security levels for systems to identify where the data is able to be transmitted.

Anonymizing Data
Safe Harbor
Remove 18 identifying attributes from the data, which renders it fairly useless.

Expert Determination
Expert certifies that the data available is too low of a probability to re-identify an individual.

There are tools out there, such as Parit(?) that are capable of doing this automatically after a survey and analysis of the data.

Standards Examples

USPS Publication 28 specifies international address requirements
ISO15836 standards for tagging unstructured documents
ISO/IEC 11179 -4 and -5 has naming standards for business metadata

Sunday, March 29, 2015

EDW 2015: Data Strategy

This session was presented by Lewis Broome (from Data Blueprint) and Brian Cassel (from the Massey Cancer Center).

Lewis presented a his strategic model and roadmap using the case study of a logistics company implementing an ambitious project.

Brian spoke about the challenges he faced within the cancer center as he implemented analytics across data hidden in silos. This was primarily culture based but once he was past that he was able to use the existing Data Analytics hub to build a specialized data mart to support strategic review of the data.

This was really great stuff and my summary doesn't do it justice by a long shot.
Data Strategy
Business Needs
In order to get anywhere with discussions about data and mays to improve it throughout the organization, the value of the effort has to be made clear. Clean data may seem like the most obvious need in the world, but that view is too low level to make it on to the radar of senior management. Instead, it needs to clearly address a business need.

There are three aspects to consider
  • How will mesh with the company Mission and Brand Promises?
    • Ex. FedEx: Your package will get there overnight. Guaranteed.
  • Does it improve the company's market position / provide a competitive advantage?
    • Michael Porter's Market Positioning Framework and his Competitive Advantage Framework provide a good way to think about this.
  • Will it improve the operating model and support the company's objectives?
    • Operating models improve by changing the degree of business integration or standardization.

If the data changes do not address any of these areas, it will not gain the support needed to succeed.

New capabilities that do not meet a business need aren't a program, they are a science project.

Current State of the Business
The current state assessment looks at
  • Existing Assets and Capabilities
  • Gaps in Assets and Capabilities
  • Constraints and Interdependencies
    • This can be the toughest stuff to identify.
    • BEWARE SHADOW SYSTEMS typically excel spreadsheets with macros or access data fixing before feeding it into the next step of a process.
  • Cultural Readiness

Cultural Readiness
Cultural Readiness depends on 5 different areas
  • Vision
    • A clear message of what the program is expected to achieve
  • Skills
    • Ensure that the right people are part of the program
  • Incentive
    • The value and importance of the program should be clear to all of the participants
  • Resources
    • Backing the program will require more than just good will, tools, environments and training may all be required
  • Action Plan
    • The system boundaries being developed should be clearly defined

Capability Maturity Model Levels
  1. Starting point
    1. There is some data in a pile over there.
  2. Repeatable Process
    1. This is how we sweep the data into a pile and remove the bits of junk we find.
  3. Defined Process
    1. Sweep from left to right, avoid the dead bugs. Leave data in a pile.
  4. Managed Process
    1. The entire team has the same brooms and the dead bugs are highlighted and automatically avoided by the brooms.
  5. Optimizing
    1. Maybe we can add rules to avoid sweeping twigs into the pile as well.

The Roadmap establishes the path of the Data Management Program to achieve the strategic goals.

Leadership and Planning
    • Planning and Business Strategy Alignment
    • Program Management
    • Clearly Defined Imperatives, Tactics and KPI
    • Accountable to CDO
Project Development
    • Outcome Based Targets
    • Business Case and Project Scope
    • Program Execution

Project Model
Big Projects tend to fail, at least twice sometimes more than that as the business learns what it really needs.

Always start with crawling and walking before going to running.
  • Governance should start with a small 'g', where it matters most. There are commonly 5-10 critical data elements, take care of those before setting targets higher.
  • Data Strategy as top-down approach works best. Otherwise it is uncoordinated and is only capable of supporting tactical initiatives.
  • Data Architecture must focus on the business needs, not individual systems or applications.

Saturday, March 28, 2015

Questions and where to seek answers

There always seems to be an unending stream of questions.

In the past, a good number of them were about things like "How can we use this shiny new feature?" "What are the best practices for these scenarios?".

However, times change and so do the questions.

Microsoft has been following the lead of the SQL Server community and steadily become more open with road maps and dialogs. We understand the platform and internals better than ever before, making it easier to address concerns.

Now the questions that need to be addressed are more strategic than tactical. Strategies, by their very nature are more challenging to find peers to discuss success and (more importantly) failures in plans and their implementations.

I'm attending Enterprise Data World this year to have a chance to discuss ideas, strategies and technologies to bring new insights and make things run better than ever.

This will be fun!

Saturday, November 10, 2012

PASS Summit Day 3

The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.


Microsoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation.  Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:

  • Available via NuGet
  • Web Management console
  • Map-Reduce can be written in C# and JavaScript
  • ODBC Driver to connect to Hive
    • Normal .Net
    • LINQ 
    • Excel (with a friendly plug-in!)

There are two versions announced

  • Azure Edition
    • This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
    • A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.
  • On-Premises Edition
    • This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
    • The filesystem is HDFS, not NTFS or version of the BlobStore
    • Enterprise infratructure support coming
      • Microsoft Virtual Machine Manager
      • System Center HDInsight Management Pack
      • Active Directory for unified authentication

Additional features (maybe)

  • Scheduled
    • Oozi Worksflow Scheduler
    • Mahout
  • UnScheduled (but popular requests)
    • SSIS Integration
    • SQL Server Management Studio


The Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:

  1. Compress your data being sent over the network
  2. Parallelism is good but adding thread will reduce the transfer time
  3. Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.


Encryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security

  • Always salt the hash and encryption. 
  • Be aware of CPU scaling concerns
  • Encryption makes everything bigger
  • Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
  • Encrypted and hashed values will not compress well

What types on encryption should be considered?

  • Field Level
  • Data at rest (on disk or on backup)
  • Connection Level
  • SAN level

Field Level

At the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs

  • The DB is hard to scale out by adding nodes if it is responsible for all encryption. 
  • The DB could manage to invisibly decrypt columns based on account permissions
  • The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
  • The application can scale out to additional servers if needed to distribute the CPU load.

Data at Rest

When the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.

Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.

Connection Level

There are two ways to make sure you connection is encrypted. A simple one to setup is having Active Directory enact a policy requiring IPSec for all connections to the DB Server. The other one is to use certificates to establish the classic SSL connection encryption. This can be controlled at the connection string level so you can configure it for only a subset of the traffic and reduce the CPU load.


SAN Level Encryption (MPIO) is fascinating because it will stop the LUN from being restored to a machine without the proper certificates. This tends to require a lot of CPU resources but it is possible to offload that encryption work to special (such as the Emulex HPA) that can be integrated into the RSA Key Manager.


I did discover something a bit disappointing about DQS today. There are curerntly no matching services available. That means that there are no internal web services for identifying matches nor are they any SSIS operations available. Matching work can ONLY be done through the client GUI. Without automation this will have a real challenge working with some types of data..

Friday, November 9, 2012

PASS Summit Day 2

Another great day filled with learning new techniques and discussing different challenges. There was more of a focus on Query optimization than anything else today. After the sessions We all went to the Experience Music Project where there was food, live band karaoke and a Dalek!

Great fun all around, I just need to get back to a normal sleeping schedule, this 4 hours a night thing is exhausting.

Technical notes below!

The session that Adam Machanic did about forcing better parallel execution plans was astounding. He took a 28 minute query and optimized it to about 3 seconds and by improving the parallelization. 

The Optimizer Issue

The premise is that optimizer is outdated because the Memory to Data Volume ratio (size or speed) is nothing like the way it was 1997 (when the last big optimizer rewrite happened). This means the optimizer needs to be coaxed into doing the right thing. 

The optimizer tends to serialize a lot of processing for a number of reasons that may not make sense now. The problem with serializing all of these processes is that Amdahl's Law (not theory, LAW) says that the number of serial processes substantially impacts performance as the number of processors increases. The goal is to make as many parallel as possible. The reason the optimizer does not default to this behaviour is because it is expecting greater disk latency than we have now that would have rendered this point moot.

Possible Approach

An example of a common item that inhibit can parallelism is the Scalar Function. Potential solutions to this are: 
  • Convert it to an in-line TVF and CROSS APPLY it. 
  • Create a CLR scalar function (weird but true, it works much faster)

To check how balanced the parallelism is you can open the properties of any step of the query plan and check view the actual number of rows returned for each thread (there is a little arrow thing hiding there).

There is a traceflag that will force the optimizer to try to make some part of query parallel (Trace Flag 8649) but this is not a good answer. The issue is that it doesn't ensure that you are actually removing the serial parts to balance the load.

One way to encourage the optimizer to seek the parallelism path without the Trace Flag is to use a TOP(@i) with an OPTION(OPTIMIZE FOR (@i = 2000000000)). This will make the optimizer expect 2 billion rows even if there are only 10,000 in the table. This will top the estimated cost calculate to the side of making the query parallel.

MAXDOP should be configured to less than the number of processing cores for each numa node. This avoids making for foreign memory requests which may be slow. It is ok to go higher, but this is a very sensible starting spot, be sure to test carefully.

Parallel Apply Pattern

The Parallel Apply Pattern can be used to enourage working with an managing efficient streams with the optimizer. The abstract form of this query is:


To execute this pattern the DRIVER of your query must be identified, for example is it really just calculating information about product stock levels? Then the driver table would be Product and the CORRELATED PAYLOAD would be the sophisticated joins, aggregations and calculations. 

Roughly speaking, for each record in the DRIVER TABLE the CORRELATED PAYLOAD executes all of it's heavy set based work as a smaller piece rather than against the full DRIVER TABLE set. This is more efficient for parallel loads.


If the driver table is not going parallel there are a couple of choices after turning it into a derived table
1. Use the TOP techique mentioned above. The optimizer will think there are more rows.
2. Use a DISTINCT 0%ColA + ColA calculation in the select list to prevent a calculation from being extracted from the parallel stream.


If the CORRELATED PAYLOAD is still having it's calculations pulled away from the parallel stream that change the CROSS APPLY to an OUTER APPLY. This prevents the optimizer from knowing if it is safe to pull the calculations up (as a poor optimization choice).

Thursday, November 8, 2012

PASS Summit Day 1

Summit officially began today. I had a great time meeting new people, talking to vendors and learning stuff.

The big things today were they Keynote, Extended Events, Enterprise Information Management and Performance Tuning.

MS is clearly taking the threat from SAP's HANA seriously with making a new in-memory storage engine fr SQL Server. It will be in the next major release and is expected to speed up existing applications substantially.

Extended Events are interesting and really are complex enough that there would be a real benefit from having an actual book assembled about them. It is replacing a very simple tool (Profiler) with a very complex and powerful one. I learned just enough to know that I need to send a lot of quality time to understand it as much  as I'd like.

So far it looks like Microsoft is taking the approach of going after the smaller, simpler data to Master instead of trying to battle in the arena of CDI or PIM where there are large established player. This follows the approach MS has had for a while. Something like Informatica is just overkill to manage Jill's spreadsheet that everyone shares, but MDS is perfect for managing and distributing the information throughout the company.

Some of my notes from today are below:


MS Announced Hekaton : An in-memory engine for SQL Server

  • There will be an adviser on when to use it
  • Live as part of the DBMS server as a different storage engine. Heterogeneous
  • Lock and latch free
  • The storage engine for tables can be changed to move them in and out of memory. If the table no longer fits it switches to the conventional model. Because this can be done seamlessly the data is likely persisted to disk in the conventional DBMS way.
    • Exadata uses flash as a sort of cache fusion
    • HANA is designed as Memory primary and disk secondary so has huge performance drops when memory limits are exceeded.
  • Stored procedure optimization compiled to native code
  • Included in the next version of SQL Server

Parallel Data Warehouse
* Polybase allows seems connection to Hadoop / Hive and SQL Server DB to allow a single point of federated querying

Extended Events

SQL Trace and Profiler are deprecated. With SQL 2012 every traceable is now available in Extended Events

Extended events are much lighter than Profiler (similar to trace) of the previous options with support for multiple targets.
The available targets are:

  • Ringbuffer
  • Event Bucketing (Bucketizer) is a specialized ringbuffer, storing aggregates
  • Event Pairing allows the tracking of things like everything happening between lock_acquired and lock_released
  • Event file (xml)
  • Synchronous Event Counter only tracks number of times and event fired, no details

The SQL Server audit events have been merged into the normal extended events.

System Overhead (as tested by SQLCAT)

  • Query Time increases by 2 microseconds. As more actions are tracked this does increase.
  • Memory increases by 4MB for the RingBuffer
  • Disk storage is more verbose, always try to avoid logging to the Data / Log volumes
  • If the event tracking causes too much of a slowdown, it will be automatically disconnected until it has less impact on performance.

In general the learning curve is steeper but you can do more with Extended Events than with Profiler.

The GUI is not as nice and profiler and there is a larger learning curve. Also the SQL 2012 GUI does not work with SQL 2008.

Event information is recorded in an XML structure that is fairly consistent. Use XQuery to extract the results in a tabular form.

One mental model for understanding the different parts of the setup is to think of it like an insert statement


The predicate will short circuit, this is means that Criteria should be placed with the most restrictive rules first!

More good stuff on Jonathan Kehayias's blog 

Because traces are very light they can run in the background. For example, the old SQL Trace Blackbox has returned as an Extended Event session called "system_health" and is visible under: Server>Management>Extended Events>Sessions and is using two targets, an event file and the ring_buffer.

Performance Tuning

5 Mains areas to consider when there is a problem

  • IO
  • CPU
  • Memory
  • Network
  • Locks and Blocks

Activity Monitor is MUCH better than it was in SQL 2005. Most of the common troubleshooting DMVs are presented there.

The only system process with a SPID > 50 is... SERVICE BROKER! That is a feature that really gets no respect and it turns out the other system process don't want to be associated with it either

When a WAIT x:y:z is displayed, the variables represent

  • x : DBID
  • y : File ID
  • z : Page
    • page 1 contains the page free space bitmap, contnention here is typically for heaps (most temp tables and table variables are heaps). Scalar and non-inlined TVFs can abuse this by creating and destroying objects swiftly.

Wednesday, November 7, 2012

Pre-Con Day 2 T-SQL Querying and Query Tuning Enhancements in SQL Server

This was the second (and last) day of the pre-conference events. I selected the T-SQL session for two reasons, I know that there were a lot of enhancements to core and it was being given by Itzik Ben-Gan, who is a SQL Master (he literally wrote the book on T-SQL query techniques).

I was a bit concerned because he enumerated the items he wanted to cover and I've used nearly all of them so I didn't expect to learn much. Much like yesterday, I was wrong.

There were two areas that I struggled to keep up with him during the session, those were the Aggregate Windowing (which I never used) and all the crazy ways you can use the APPLY operator. The rest of the session was good but I kept coming back to (and testing out) those two areas.

I didn't get much of a chance for socializing because I was tinkering with the new stuff I learned whenever I had a few free minutes and I did something bad to my ankle so I just hung around my room for the evening.

Below are my notes from this session:

T-SQL Querying and Query Tuning Enhancements in SQL Server

Extended Events Change

This was the last thing we discussed but it is SO important that I need to talk about it first. Extended Events have anew Global Field called Query Hash. This is a hash of the auto-paramterized query, capturing this allows the detection of all sorts of performance problems. All of the tiny, fast queries that execute ALL the time can be evaluated to see if your system is suffering the "death by a million cuts". This is HUGE.


CTEs join the rank of derived tables, views and table valued functions as a way to make the query syntax cleaner. By themselves, all of these options generate the exact same query plan and do nothing improve or degrade query performance. The biggest immediate benefit to non-recursive CTEs is the improved legibility they gain by allowing query reuse.

CTEs are evaluated Top to Bottom. This allows the first query to be referenced by the second query, Both the first and second queries could be referneced by the third query, etc.

To improve performance consider using something to actually materialize the data and reference that single result set. Examples of this would be actual tables, temp tables and table variables.

Recursive CTEs do not follow the rules mentioned above, instead they need to keep a running tally of the data. It does not persist it to a normal table but instead creates a B-Tree structure for the data on disk that has been optimized for the Recursive CTE's usage patterns.

Windowing Functions

Craziness. Absolute, amazing crazy that has been hiding from me since SQL 2005 that can be applied to any aggregate (but works best with cumulative aggregates).

This is work applied to the result set.

The Windowing Functions are made up of a bunch of different parts

  • OVER : Establishes the window of data over the result sets
  • PARTITION BY : Defines the size of the sliding window by referencing what columns cause the aggregate to reset. (Note - FILTER operator is not supported, which would be an explicit where to filter the partition)
  • ORDER BY : Sorting sequence to be used when calculating aggregates. This important for things like running totals.
  • Window Frame : 
    • ROWS BETWEEN ... AND ...
      • Specify the number of result rows to consider, regardless of missing data 
      • Duplicate ORDER BY values are calculated individually
      • Always Specify Row, it is not the default (Damn you ISO SQL Standards!) but it is much more efficient
      • UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
      • UNBOUNDED FOLLOWING : To the end of the Partition
      • n PRECEDING : From the beginning of the Partition
      • n FOLLOWING : To the end of the Partition
      • CURRENT ROW : To the current row (think of a running total) (Default)
      • Specify the value range to consider, this is not fully supported in SQL 2012 
      • Duplicate ORDER BY values are grouped together for the aggregate calculation
      • Cannot tell the difference between? 
      • Specify the number of result rows to consider (may skip missing values)
      • Use with extreme caution, it is the default when there is a partition
      • UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
      • CURRENT ROW : To the current row (think of a running total) (Default)


This is NOT syntactic sugar. The query plan has a new operation called Window Spool which allows a "Fast Track" of results (for Linear Scaling) under two very common sets of conditions.

  1. "BETWEEN Unbounded Preceding" looks only at directly previous row instead of maintaining a full list for each record. 
  2. The aggregate is a Cumulative Aggregate (such as SUM or COUNT, not a MIN or MAX) it is calculated by using the "Unbounded Preceding" behind the scenes and doing appropriate.

If the Fast-Track conditions are not met it turns from a 2n cost to an n^2 cost.

For indexing the window aggregate a POC pattern should be followed.

  • P: Partition (Quirk: Is nearly always ASC unless the same column is in the order by as DESC)
  • O: Order By
  • C: Coverage (can be in the INCLUDE)

To check for performance issues, use Extended Events or Statistics IO (which show the worktable)

New SQL 2012 Window Aggregates

  • LAG : Element from previous row in the Partition
  • LEAD : Element from next row in the Partition
  • FIRST_VALUE : Element from the FIRST row in the FRAME
    • Typically the frame and partition are the same when using FIRST_VALUE
  • LAST_VALUE : Element from the LAST row in the FRAME
    • Typically the frame and partition are the same when using LAST_VALUE


The new Offset - Fetch syntax was added as an extension of the ORDER BY clause. The implementation is not SQL Compliant yet because it is missing PERCENTAGE and WITH TIES options still


It performs similar to TOP with a discard of the first X rows (the offset). This is fine for the first few "pages" of data but does not scale deeply. To scale it, use a CTE on an indexed column and join to that, such as:

SELECT orderid
FROM dbo.orders
ORDER BY orderid
OFFSET (@pagenum - 1) * @pagesize ROWS
, O.orderdate
, O.custid
, O.empid
Keys as K
INNER JOIN Orders as O
ON O.orderid = K.orderid
ORDER BY K.orderid


This is not part of the SQL Standard. These are syntactical sugar for some common pivot requirements. As a pivot becomes more complex (such as multiple columns) they break down and require a more sophisticated model.

An efficient UNPIVOT can be created with the CROSS APPLY using a VALUES clause (which creates a derived table behind the scenes). This method uses VALUES as a constant scan which feeds the stream aggregate avoiding any worktables.

CREATE TABLE #t (id int , [2008] int, [2009] int , [2010] int, [2011] int)
INSERT into #t
(1, 100, 101,102,104)
,(2, 200, 201,202,204)
,(3, 300, 301,302,304)
,(4, 400, 401,402,404)

u AS 
SELECT, sum(val) AS total
#t t
    ( VALUES
        (, 2008, t.[2008])
        , (, 2009, t.[2009])
        , (, 2010, t.[2010])
        , (, 2011, t.[2011])
    ) AS v (id, [year], val)
SELECT * FROM #t t inner join u on =


The new seqeunce object uses the same internals as the identity feature but is more configurable with different defaults.


  • MIN value: Most sequences are not user visible values so it starts at the lowest value possible for the data type.
  • DATATYPE : BIGINT. This is effectively an alternative to using the GUID keys so it should be support a OT of values. BEWARE! This datatype SHOULD match the datatype in the table (especially if being used as a constraint), otherwise there will be a lot of conversions that may not have been expected.
  • CYCLE : OFF is the default value but the reason it exists is for IDs used in staging data that have a relatively short lifespan to avoid running out of keys.

Extended beyond the SQL Standard

  • It is able to be bound as a default constraint on a table. This will help with being able to eliminate the IDENTITY columns without impacting application code or adding triggers (which is the normal model).
  • There is an "OVER" clause available when generating IDs for a set to ensure the expected sequence.


  • This has all of the dangers on the IDENTITY when it comes to sequential values. The process of getting IDs is outside of the transaction, so once it's been handed off it is gone (unless the sequence cycles).
  • A single sequence can be used across multiple tables but this does increase the gap likelihood.


This is commonly thought of as an UPSERT but it can do a bit more. It compares the source (stg) to the target (active) and will cause different DML actions to be taken.

The "USING ... ON" clause is used matching the table from the source and left joining to the target. This is not a "WHERE" but left join conditions so it DOES NOT filter the source table. To filter the source table use a CTE as the source.

MS extended this to support executing an action if the record exists in the Target but not in the Source. This is great when doing full table refreshes.


  • Use a HOLDLOCK to keep the data consistent because the merge executes multiple actions.
  • Review any Target tables to ensure there are no triggers referrencing @@ROWCOUNT in the beginning. This value is set for the ENTIRE Merge command, not the single operation. Typically I've seen this used to do a quick eject from the trigger or switch the path from row parocessing to set processing. Look at the INSERTED / DELETED tables instead.
  • Unlike the OUTPUT from an INSERT statement, you can return the any generated values from both the source and target. If you need these values then make a merge with a non-mathcing condition (1=2).
  • If you only want to update the table when there has been a change there are two approaches, adding each column to the WHEN or adding an exists to the WHEN with an EXCEPT, like this:

MERGE INTO dbo.Customers AS TGT
USING dbo.CustomersStage AS SRC
  ON TGT.custid = SRC.custid
     EXISTS  (    
            SELECT * FROM CustomersStage AS S WHERE s.custID = src.custid
            SELECT * FROM dbo.Customers AS T WHERE t.custid = tgt.custid
        ) THEN
    TGT.companyname = SRC.companyname, =,
    TGT.address = SRC.address

Misc cool stuff

  • If the goal is to have a non-deterministic sort but an ORDER BY is required, the field list can replaced by a (SELECT NULL)
  • Left to Right reading of the Execution Plan is how the API works
  • The lack of a ; after each SQL Statement is officially Deprecated! 
  • Avoid using IIF and CHOOSE statements. They were added as a stopgap to help with conversions from MS Access. Behind the scenes they are converted into CASE statements by the optimizer.
  • Interesting trick that can be embedded in other queries to get a delimited list:

SELECT N',' + QUOTENAME(orderyear) AS [text()]
   FROM (SELECT DISTINCT YEAR(orderdate) AS orderyear FROM Sales.Orders) AS Y
   ORDER BY orderyear