Wednesday, November 7, 2012

Pre-Con Day 2 T-SQL Querying and Query Tuning Enhancements in SQL Server

This was the second (and last) day of the pre-conference events. I selected the T-SQL session for two reasons, I know that there were a lot of enhancements to core and it was being given by Itzik Ben-Gan, who is a SQL Master (he literally wrote the book on T-SQL query techniques).

I was a bit concerned because he enumerated the items he wanted to cover and I've used nearly all of them so I didn't expect to learn much. Much like yesterday, I was wrong.

There were two areas that I struggled to keep up with him during the session, those were the Aggregate Windowing (which I never used) and all the crazy ways you can use the APPLY operator. The rest of the session was good but I kept coming back to (and testing out) those two areas.

I didn't get much of a chance for socializing because I was tinkering with the new stuff I learned whenever I had a few free minutes and I did something bad to my ankle so I just hung around my room for the evening.

Below are my notes from this session:


T-SQL Querying and Query Tuning Enhancements in SQL Server


Extended Events Change

This was the last thing we discussed but it is SO important that I need to talk about it first. Extended Events have anew Global Field called Query Hash. This is a hash of the auto-paramterized query, capturing this allows the detection of all sorts of performance problems. All of the tiny, fast queries that execute ALL the time can be evaluated to see if your system is suffering the "death by a million cuts". This is HUGE.

CTE

CTEs join the rank of derived tables, views and table valued functions as a way to make the query syntax cleaner. By themselves, all of these options generate the exact same query plan and do nothing improve or degrade query performance. The biggest immediate benefit to non-recursive CTEs is the improved legibility they gain by allowing query reuse.

CTEs are evaluated Top to Bottom. This allows the first query to be referenced by the second query, Both the first and second queries could be referneced by the third query, etc.

To improve performance consider using something to actually materialize the data and reference that single result set. Examples of this would be actual tables, temp tables and table variables.

Recursive CTEs do not follow the rules mentioned above, instead they need to keep a running tally of the data. It does not persist it to a normal table but instead creates a B-Tree structure for the data on disk that has been optimized for the Recursive CTE's usage patterns.

Windowing Functions

Craziness. Absolute, amazing crazy that has been hiding from me since SQL 2005 that can be applied to any aggregate (but works best with cumulative aggregates).

This is work applied to the result set.

The Windowing Functions are made up of a bunch of different parts

  • OVER : Establishes the window of data over the result sets
  • PARTITION BY : Defines the size of the sliding window by referencing what columns cause the aggregate to reset. (Note - FILTER operator is not supported, which would be an explicit where to filter the partition)
  • ORDER BY : Sorting sequence to be used when calculating aggregates. This important for things like running totals.
  • Window Frame : 
    • ROWS BETWEEN ... AND ...
      • Specify the number of result rows to consider, regardless of missing data 
      • Duplicate ORDER BY values are calculated individually
      • Always Specify Row, it is not the default (Damn you ISO SQL Standards!) but it is much more efficient
      • UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
      • UNBOUNDED FOLLOWING : To the end of the Partition
      • n PRECEDING : From the beginning of the Partition
      • n FOLLOWING : To the end of the Partition
      • CURRENT ROW : To the current row (think of a running total) (Default)
    • RANGE BETWEEN
      • Specify the value range to consider, this is not fully supported in SQL 2012 
      • Duplicate ORDER BY values are grouped together for the aggregate calculation
      • Cannot tell the difference between? 
      • Specify the number of result rows to consider (may skip missing values)
      • Use with extreme caution, it is the default when there is a partition
      • UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
      • CURRENT ROW : To the current row (think of a running total) (Default)


Performance

This is NOT syntactic sugar. The query plan has a new operation called Window Spool which allows a "Fast Track" of results (for Linear Scaling) under two very common sets of conditions.

  1. "BETWEEN Unbounded Preceding" looks only at directly previous row instead of maintaining a full list for each record. 
  2. The aggregate is a Cumulative Aggregate (such as SUM or COUNT, not a MIN or MAX) it is calculated by using the "Unbounded Preceding" behind the scenes and doing appropriate.


If the Fast-Track conditions are not met it turns from a 2n cost to an n^2 cost.


For indexing the window aggregate a POC pattern should be followed.

  • P: Partition (Quirk: Is nearly always ASC unless the same column is in the order by as DESC)
  • O: Order By
  • C: Coverage (can be in the INCLUDE)


To check for performance issues, use Extended Events or Statistics IO (which show the worktable)

New SQL 2012 Window Aggregates


  • LAG : Element from previous row in the Partition
  • LEAD : Element from next row in the Partition
  • FIRST_VALUE : Element from the FIRST row in the FRAME
    • Typically the frame and partition are the same when using FIRST_VALUE
  • LAST_VALUE : Element from the LAST row in the FRAME
    • Typically the frame and partition are the same when using LAST_VALUE


Paging

The new Offset - Fetch syntax was added as an extension of the ORDER BY clause. The implementation is not SQL Compliant yet because it is missing PERCENTAGE and WITH TIES options still

Performance

It performs similar to TOP with a discard of the first X rows (the offset). This is fine for the first few "pages" of data but does not scale deeply. To scale it, use a CTE on an indexed column and join to that, such as:

WITH Keys AS
(
SELECT orderid
FROM dbo.orders
ORDER BY orderid
OFFSET (@pagenum - 1) * @pagesize ROWS
FETCH NEXT @pagesize ROWS ONLY
)
SELECT
K.orderid
, O.orderdate
, O.custid
, O.empid
FROM
Keys as K
INNER JOIN Orders as O
ON O.orderid = K.orderid
ORDER BY K.orderid

PIVOT / UNPIVOT

This is not part of the SQL Standard. These are syntactical sugar for some common pivot requirements. As a pivot becomes more complex (such as multiple columns) they break down and require a more sophisticated model.

An efficient UNPIVOT can be created with the CROSS APPLY using a VALUES clause (which creates a derived table behind the scenes). This method uses VALUES as a constant scan which feeds the stream aggregate avoiding any worktables.

CREATE TABLE #t (id int , [2008] int, [2009] int , [2010] int, [2011] int)
;
INSERT into #t
values 
(1, 100, 101,102,104)
,(2, 200, 201,202,204)
,(3, 300, 301,302,304)
,(4, 400, 401,402,404)

WITH 
u AS 
(
SELECT t.id, sum(val) AS total
FROM 
#t t
CROSS APPLY
    ( VALUES
        (t.id, 2008, t.[2008])
        , (t.id, 2009, t.[2009])
        , (t.id, 2010, t.[2010])
        , (t.id, 2011, t.[2011])
    ) AS v (id, [year], val)
GROUP BY 
    t.id
)
SELECT * FROM #t t inner join u on u.id = t.id

Sequences

The new seqeunce object uses the same internals as the identity feature but is more configurable with different defaults.

Defaults


  • MIN value: Most sequences are not user visible values so it starts at the lowest value possible for the data type.
  • DATATYPE : BIGINT. This is effectively an alternative to using the GUID keys so it should be support a OT of values. BEWARE! This datatype SHOULD match the datatype in the table (especially if being used as a constraint), otherwise there will be a lot of conversions that may not have been expected.
  • CYCLE : OFF is the default value but the reason it exists is for IDs used in staging data that have a relatively short lifespan to avoid running out of keys.


Extended beyond the SQL Standard


  • It is able to be bound as a default constraint on a table. This will help with being able to eliminate the IDENTITY columns without impacting application code or adding triggers (which is the normal model).
  • There is an "OVER" clause available when generating IDs for a set to ensure the expected sequence.


Notes


  • This has all of the dangers on the IDENTITY when it comes to sequential values. The process of getting IDs is outside of the transaction, so once it's been handed off it is gone (unless the sequence cycles).
  • A single sequence can be used across multiple tables but this does increase the gap likelihood.


Merge

This is commonly thought of as an UPSERT but it can do a bit more. It compares the source (stg) to the target (active) and will cause different DML actions to be taken.

The "USING ... ON" clause is used matching the table from the source and left joining to the target. This is not a "WHERE" but left join conditions so it DOES NOT filter the source table. To filter the source table use a CTE as the source.

MS extended this to support executing an action if the record exists in the Target but not in the Source. This is great when doing full table refreshes.

Tips


  • Use a HOLDLOCK to keep the data consistent because the merge executes multiple actions.
  • Review any Target tables to ensure there are no triggers referrencing @@ROWCOUNT in the beginning. This value is set for the ENTIRE Merge command, not the single operation. Typically I've seen this used to do a quick eject from the trigger or switch the path from row parocessing to set processing. Look at the INSERTED / DELETED tables instead.
  • Unlike the OUTPUT from an INSERT statement, you can return the any generated values from both the source and target. If you need these values then make a merge with a non-mathcing condition (1=2).
  • If you only want to update the table when there has been a change there are two approaches, adding each column to the WHEN or adding an exists to the WHEN with an EXCEPT, like this:


MERGE INTO dbo.Customers AS TGT
USING dbo.CustomersStage AS SRC
  ON TGT.custid = SRC.custid
WHEN MATCHED AND 
     EXISTS  (    
            SELECT * FROM CustomersStage AS S WHERE s.custID = src.custid
            EXCEPT
            SELECT * FROM dbo.Customers AS T WHERE t.custid = tgt.custid
        ) THEN
  UPDATE SET
    TGT.companyname = SRC.companyname,
    TGT.phone = SRC.phone,
    TGT.address = SRC.address


Misc cool stuff


  • If the goal is to have a non-deterministic sort but an ORDER BY is required, the field list can replaced by a (SELECT NULL)
  • Left to Right reading of the Execution Plan is how the API works
  • The lack of a ; after each SQL Statement is officially Deprecated! 
  • Avoid using IIF and CHOOSE statements. They were added as a stopgap to help with conversions from MS Access. Behind the scenes they are converted into CASE statements by the optimizer.
  • Interesting trick that can be embedded in other queries to get a delimited list:

SELECT N',' + QUOTENAME(orderyear) AS [text()]
   FROM (SELECT DISTINCT YEAR(orderdate) AS orderyear FROM Sales.Orders) AS Y
   ORDER BY orderyear
   FOR XML PATH('')

No comments:

Post a Comment