Unimpeded By Sanity

DataWorks Summit 2017 - Netflix: Scaling Data Quality

2017-06-17T06:00:00.000-04:00

Netflix processes over 700 billion events / day, with 300 TB data warehouse writes each day and 5 PB of data warehouse reads each day. When quality starts to go bad, it can escalate into a major problem very quickly.

Data quality problems tend to be from upstream systems when either the volume changes dramatically or values shift out from underneath the system. It is more important to find out "when" there is a problem, than "why" there is a problem. To support this, developers have created a few tools:

Metacat

Federated Metastore, like an extended HCatalog
Contains statistics about the data on the partition

Missing data
Life Cycle
Audience using the data
Sunset date
Sunrise date

Quint

DQ Service
Defines metrics using a sql-like syntax
Evaluation rules include normal distribution compared over X partitions

WAP (Write, Audit, Publish)

This is an ETL pattern that leverages the idea of separating schema from data.

Write the additional data to a new partition file
Audit table definition is created to match the original table + the new partition

Validate the data using the Quint rules

Publish the partition by updating the production table definition to include the new partition

This pattern is not easy to put into place, but library components help simplify it it by applying common patterns based on the audience and life cycle of the data.

Lessons learned

Query based validation may be enough
Not all tables require quality coverage
One size does not fit all, use multiple small components so that customizations don't require the process to be completely on its own.

DataWorks Summit 2017 - Verizon Finance Data Lake

2017-06-15T06:00:00.000-04:00

Verizon has stood up a finance data lake to be a shared Enterprise Data Repository and enable self-service data discovery. This combines data from multiple SAP systems into a single, easy to access data repository. Benefits included:

Simplifies access to ERP data
Reduces data replication
Enables sharing of data (eliminating point to point integration)
Drive data archiving strategy
Rationalize data sources
Central set of business rules
Common reporting and analytic tools
Unified location to apply Data Governance and Security

The Architecture has data flow up from the source systems and into the data lake (~1,000 tables). The data movement is a combination of direct file imports, batch ingestion with Sqoop and incremental ingestion. The incremental ingestion was done by CDC and partition swaps originally but was eventually replaced with Attunity Replicate to simplify the maintenance and speed up processing.

Governance and security is applied, the sensitive data is either tokenized or not allowed into the cluster. The data is processed through the ETL layer, consisting mostly of Hive processes. The Logical model is the semantic layer provided by the reporting tools.

There are areas to improve.

Data Validations are executed with manual scripts after the fact.
Data Governance is limited and needs review as more users are given access to the data
There is no managed data catalog at this time
Very few curated and maintained data sets
Self-service data preparation tools are being reviewed
Capabilities are limited to small number of uses

DataWorks Summit 2017 - Day 1 Keynote

2017-06-14T09:39:00.003-04:00

I was able to get to my first Hadoop conference this year, so far it's been really good. Below are my notes from the sessions I was able to attend today.

The sessions I went to today include:

Summit Keynote
Verizon: Finance Data Lake
Whoops, the numbers are wrong! Scaling Data Quality @ Netflix
Dancing Elephants - Efficiently working with object stores from Apache Spark and Hive
Governance Bots - Metadata driven compliance
Cloudy with a chance of Hadoop - real world considerations
Hadoop Journey at Walgreens
LLAP: Building Cloud-first BI

Summit Keynote

The keynote was considerably flashier than I was expecting, the first 10 minutes or so was a lot of fog and laser light shows stuff. Not really my thing and I wish they didn't do that.

IBM Announcement

They did announce a very interesting partnership with IBM. IBM's Big Insights was based on open source Hadoop, maintaining their own distribution.Now IBM is going to leverage Hortonworks Data Platform (HDP) with Big SQL and Data Science Experience (DSX) layered on top of it.

Hortonworks has stated that for a distributed Data Warehouse Big SQL is their recommendation, but it doesn't seem to change their message about Hive. Hive still is their platform for Analytical SQL processing. The thing that Big SQL brings to the table is their "Fluid Query" (Federation) across multiple data sources that live in Hadoop (Hive, HBase, Spark and Object stored through HDFS) and conventional RDBMS (Oracle, SQL Server, Teradata, DB2, PDA / Netezza, etc.). Big SQL is a query optimizer and compiler to gather data across all of those systems. I've never used it, but the notion is interesting.

As for DSX, I have only seen a little bit of it. At first glance it appears that it stitches together some opensource Machine Learning (ML) platforms (and SPSS) into a set of coherent tools that focus upon collaboration. There are video tutorials to help team members with varying experience in the tools learn about using them through DSX. They have a website for machine learning use cases by industry at www.ibm-ml-hub.com, but there isn't much up there yet. Their visualization engine has been open-sourced, it's called brunel and is available on github. It can be used with Python, R and Spark, but I'm not clear how it stacks up to its competitors. They also focused on how simple it is to integrate the models generated into existing applications. The models can be tested, scored, monitored and generate alerts based on defined conditions.

Round-table

There was also a round-table of folks using Hortonworks in different industries, including:

Duke Energy
Black Night (Fidelity Mortgages)
Healthcare Services

Duke Energy

They came to the conclusion that they need to be a digital company that sells energy, instead of an energy company using some technology. This is a completely different approach from what they did before and required a major mental shift for everyone. There was a complete shift from commercial to open source software whenever the option exists. This model has been an advantage for attracting and retaining top talent. Above all else, they recommend making sure that Governance be considered early in the journey, otherwise it will lead to problems.

They have many sensors (>4 million) feeding into their system which would have conventional data warehouse appliances, so most data was getting lost. They are able to identify customer issues faster than ever and are using customer data to fuel their warehouse and applying machine learning to help with customer calls. An interesting thing they are working on is dynamically looking at the customer calling in, identify anticipated needs and offer the options to them faster than ever.

Black Night (Fidelity Mortgage)

Black Night uses Hadoop to improve their understanding of the customer. Similar to Duke, the cultural shift was hard but it also attracted (and retained) talent that otherwise would not of joined the company. They are using the data they gather to predict customer behavior and areas of interest. They want to reach the point where a customer can get their mortgage in a few minutes instead of the time intensive process it has been in the past.

Healthcare Services

Healthcare Services are two years into their data journey. This industry is even more heavily regulated and tends to towards data silos than most. To break out of the silos wherever possible they are switching to opensource software to fuel their ongoing machine learning efforts. For example, zip codes are highly correlated to lifespan, this can help identify populations that need help and preventive care. By working to provide clinics and focus attention on those areas those numbers can be improved.

Microsoft using YARN

Microsoft is switching to using YARN on their COSMOS clusters. These are very large clusters that need to support > 50k nodes (each!) distributed throughout the globe. Previously it used their own resource negotiator that processed Dryad DAGs (talk about a blast from the past!). Microsoft has been steadily contributing changes to YARN so that it can swap out their homegrown solution for an indsutry standard approach. They are a major project contributor and have dedicated MS Research members to work on improving cluster size and creating predictable and preemptable processing allocations. This effectively allows them to create higher priority jobs that will always execute when needed and allow other jobs to run when resources free up again. They are achieving the scalability by adding a federation layer and having each YARN instance scale to 10k nodes. Fascinating stuff, I love keeping an eye on the MS Research site.

EDW 2015: Data Governance Organization

2015-03-30T20:26:00.001-04:00

Pharmaceutical model

Janet Lichtenburger and Nancy Makos presented a terrific view of how Data Governance works at their company. They implement functional (small) data governance, with the exception of data sensitivity.

Here is my brain dump of the session today. You're welcome to it if you can make heads or tails of it. :-)

Some numbers that show the challenge of Data Governance there are:

5 people in Enterprise Architecture
3 people in Data Governance
130 people provide Data Stewardship support
350k total employees

Data Governance is about Policies and Standards and is typically independent of implementations, as part of the Enterprise Architecture or Finance groups.

To encourage adoption, Data Governance could be considered an internal consulting service to support projects ,that is not charged back.

Organizational Model

Overall Model

Enterprise Data Governance Executive Committee

Meets only a few times a year
Limited number of senior executives

Data Governance Committees

Each committee is chaired by Data Goverenance
Divided into specific domains
Meets as often as required by projects
Up to 15 people on each Domain Specific Committee.

Data Steward Roles

Executive Stewards

Member of the Data Governance Executive Committee
Strategic Direction
Authority

Enterprise Stewards

Member of the Data Governance Committee
Development Support of Data Governance Policies

Operational Stewards

Communicates to promote policies
Endorses Data Standards

Domain Stewards

Recommends canonical structures
Endorses Data Standards

However the model is constructed, it must make sense for the business. All new policies should be considered from a Cost / Benefit analysis, the exception is with regulatory requirements. Regulatory and Legal compliance are critical to avoid jail.

Policies

The goal of policies is to drive behavior changes needed for Enterprise Information Management to succeed.

People generally don't like change, a way to get buy-in is to amend existing processes instead of creating new ones. These are smaller, less intrusive and stakeholders have already been identified. This also leads to more partnerships and shared endorsements of the changes.

There are typically a small number of extremely high value areas, policy should focus upon those.

Align with the Enterprise goals and other Enterprise ranging groups, there are a lot of shared concerns and ways that the teams can support each other.

Keep the policies easily accessible, do not hide them in a 500 page volume, instead keep them somewhere easily discoverable, such as a wiki on the corporate intranet.

Policies that are overly broad and not enforceable can quickly cause legal / compliance problems. In those cases, no policy is a better choice than an unenforceable one.

Data Classification

The source and type of data both define the data classification required. Similarly, data from several, more open sources can be combined to escalate the protection required.

Restricted

Financial information, such as credit cards

Protected

Regulated information, such as HIPPA data

Private

Named Persons

Internal

Business Data

Public

Everything Else

The data classification levels are combined with Information Security levels for systems to identify where the data is able to be transmitted.

Anonymizing Data

Safe Harbor

Remove 18 identifying attributes from the data, which renders it fairly useless.

Expert Determination

Expert certifies that the data available is too low of a probability to re-identify an individual.

Automation

There are tools out there, such as Parit(?) that are capable of doing this automatically after a survey and analysis of the data.

Standards Examples

USPS Publication 28 specifies international address requirements

ISO15836 standards for tagging unstructured documents

ISO/IEC 11179 -4 and -5 has naming standards for business metadata

EDW 2015: Data Strategy

2015-03-29T22:11:00.002-04:00

This session was presented by Lewis Broome (from Data Blueprint) and Brian Cassel (from the Massey Cancer Center).

Lewis presented a his strategic model and roadmap using the case study of a logistics company implementing an ambitious project.

Brian spoke about the challenges he faced within the cancer center as he implemented analytics across data hidden in silos. This was primarily culture based but once he was past that he was able to use the existing Data Analytics hub to build a specialized data mart to support strategic review of the data.

This was really great stuff and my summary doesn't do it justice by a long shot.

Data Strategy

Business Needs

In order to get anywhere with discussions about data and mays to improve it throughout the organization, the value of the effort has to be made clear. Clean data may seem like the most obvious need in the world, but that view is too low level to make it on to the radar of senior management. Instead, it needs to clearly address a business need.

There are three aspects to consider

How will mesh with the company Mission and Brand Promises?

Ex. FedEx: Your package will get there overnight. Guaranteed.

Does it improve the company's market position / provide a competitive advantage?

Michael Porter's Market Positioning Framework and his Competitive Advantage Framework provide a good way to think about this.

Will it improve the operating model and support the company's objectives?

Operating models improve by changing the degree of business integration or standardization.

If the data changes do not address any of these areas, it will not gain the support needed to succeed.

New capabilities that do not meet a business need aren't a program, they are a science project.

Current State of the Business

The current state assessment looks at

Existing Assets and Capabilities
Gaps in Assets and Capabilities
Constraints and Interdependencies

This can be the toughest stuff to identify.
BEWARE SHADOW SYSTEMS typically excel spreadsheets with macros or access data fixing before feeding it into the next step of a process.

Cultural Readiness

Cultural Readiness

Cultural Readiness depends on 5 different areas

Vision

A clear message of what the program is expected to achieve

Skills

Ensure that the right people are part of the program

Incentive

The value and importance of the program should be clear to all of the participants

Resources

Backing the program will require more than just good will, tools, environments and training may all be required

Action Plan

The system boundaries being developed should be clearly defined

Capability Maturity Model Levels

Starting point

There is some data in a pile over there.

Repeatable Process

This is how we sweep the data into a pile and remove the bits of junk we find.

Defined Process

Sweep from left to right, avoid the dead bugs. Leave data in a pile.

Managed Process

The entire team has the same brooms and the dead bugs are highlighted and automatically avoided by the brooms.

Optimizing

Maybe we can add rules to avoid sweeping twigs into the pile as well.

Roadmap

The Roadmap establishes the path of the Data Management Program to achieve the strategic goals.

Leadership and Planning	Planning and Business Strategy Alignment Program Management Clearly Defined Imperatives, Tactics and KPI Accountable to CDO
Project Development	Outcome Based Targets Business Case and Project Scope Program Execution

Project Model

Big Projects tend to fail, at least twice sometimes more than that as the business learns what it really needs.

Always start with crawling and walking before going to running.

Governance should start with a small 'g', where it matters most. There are commonly 5-10 critical data elements, take care of those before setting targets higher.
Data Strategy as top-down approach works best. Otherwise it is uncoordinated and is only capable of supporting tactical initiatives.
Data Architecture must focus on the business needs, not individual systems or applications.

Questions and where to seek answers

2015-03-28T19:54:00.002-04:00

There always seems to be an unending stream of questions.

In the past, a good number of them were about things like "How can we use this shiny new feature?" "What are the best practices for these scenarios?".

However, times change and so do the questions.

Microsoft has been following the lead of the SQL Server community and steadily become more open with road maps and dialogs. We understand the platform and internals better than ever before, making it easier to address concerns.

Now the questions that need to be addressed are more strategic than tactical. Strategies, by their very nature are more challenging to find peers to discuss success and (more importantly) failures in plans and their implementations.

I'm attending Enterprise Data World this year to have a chance to discuss ideas, strategies and technologies to bring new insights and make things run better than ever.

This will be fun!

PASS Summit Day 3

2012-11-10T07:37:00.002-05:00

The last day of PASS is pretty exhausting. My head was still spinning with ideas from the previous days and the tracks were a bit less balanced. To let the ideas settle in I investigated some different areas.

Hadoop

Microsoft is really getting into Hadoop in a big way. I am glad that they have partnered with Hortonworks (the orginal Yahoo Hadoop team) for the Windows implementation. Microsoft is making a serious attempt to be a good opensource citizen and is actively contributing back to the Apache project. Their Hadoop product now has a name, HDInsight and some interesting characteristics, such as:

Available via NuGet
Web Management console
Map-Reduce can be written in C# and JavaScript
ODBC Driver to connect to Hive

Normal .Net
LINQ
Excel (with a friendly plug-in!)

There are two versions announced

Azure Edition

This version actually substitutes the Azure BlobStore for HDFS. Presumably this was done because they share many characteristics but the BlobStore is already tuned for the MS Datacenters and has presumably avoided the NameNode issue somehow.
A popular model is to gather data (such as web or application logs) into the blobstore and on a regular schedule spin up a large HDInsight cluster, run the analysis and then spin it back down. This technique is used by Halo 4.

On-Premises Edition

This sounds as though it is targeted at preconfigured Hyper-V Virtual Machines (I may be way off here).
The filesystem is HDFS, not NTFS or version of the BlobStore
Enterprise infratructure support coming

Microsoft Virtual Machine Manager
System Center HDInsight Management Pack
Active Directory for unified authentication

Additional features (maybe)

Scheduled

Oozi Worksflow Scheduler
Mahout

UnScheduled (but popular requests)

SSIS Integration
SQL Server Management Studio

Azure

The Azure data integration story continues to improve. The model to move data is either using SSIS or Data Sync. However, the SSIS development best practices are a little bit different. To transfer data efficiently there are a combination of things to do:

Compress your data being sent over the network
Parallelism is good but adding thread will reduce the transfer time
Robustness must be maintained by the package. A "blip" in th connection cannot force the entire operation to restart. This is long and painful process and should be able to pickup where the blip happened. This is achieved by using a loop (with sleep) to keep retrying and sending small sets into staging tables and then pushing loaded sets from staging into final tables.

Encryption

Encryption is becoming more of an issue as PII regulations become more prominent. Some things to keep in mind when planning security

Always salt the hash and encryption.
Be aware of CPU scaling concerns
Encryption makes everything bigger
Only 4000 characters at a time work with hashing and encryption functions built into SQL Server.
Encrypted and hashed values will not compress well

What types on encryption should be considered?

Field Level
Data at rest (on disk or on backup)
Connection Level
SAN level

Field Level

At the field level this can get complex because it is either managed in the DB or the Application. Each has tradeoffs

The DB is hard to scale out by adding nodes if it is responsible for all encryption.
The DB could manage to invisibly decrypt columns based on account permissions
The application would have to wrap all sensitive data with encrypt / decrypt logic if it was responsible for doing it.
The application can scale out to additional servers if needed to distribute the CPU load.

Data at Rest

When the data is at rest the database can encrypt it with minimal overhead using Transparent Data Encryption (TDE). This does mean that data is not encrypted in memory so an attacker could view data sitting through an exploit. TDE does do some changes silently. Instant File Initialization becomes disabled as soon as TDE is initiated. The shared database TEMPDB also must become encrypted.

Backup encryption is critical because there is no telling what will happen when tapes leave the security of a private datacenter and goes touring through the outside world.

Connection Level

There are two ways to make sure you connection is encrypted. A simple one to setup is having Active Directory enact a policy requiring IPSec for all connections to the DB Server. The other one is to use certificates to establish the classic SSL connection encryption. This can be controlled at the connection string level so you can configure it for only a subset of the traffic and reduce the CPU load.

SAN

SAN Level Encryption (MPIO) is fascinating because it will stop the LUN from being restored to a machine without the proper certificates. This tends to require a lot of CPU resources but it is possible to offload that encryption work to special (such as the Emulex HPA) that can be integrated into the RSA Key Manager.

DQS

I did discover something a bit disappointing about DQS today. There are curerntly no matching services available. That means that there are no internal web services for identifying matches nor are they any SSIS operations available. Matching work can ONLY be done through the client GUI. Without automation this will have a real challenge working with some types of data..

PASS Summit Day 2

2012-11-09T00:40:00.000-05:00

Another great day filled with learning new techniques and discussing different challenges. There was more of a focus on Query optimization than anything else today. After the sessions We all went to the Experience Music Project where there was food, live band karaoke and a Dalek!

Great fun all around, I just need to get back to a normal sleeping schedule, this 4 hours a night thing is exhausting.

Technical notes below!

The session that Adam Machanic did about forcing better parallel execution plans was astounding. He took a 28 minute query and optimized it to about 3 seconds and by improving the parallelization.

The Optimizer Issue

The premise is that optimizer is outdated because the Memory to Data Volume ratio (size or speed) is nothing like the way it was 1997 (when the last big optimizer rewrite happened). This means the optimizer needs to be coaxed into doing the right thing.

The optimizer tends to serialize a lot of processing for a number of reasons that may not make sense now. The problem with serializing all of these processes is that Amdahl's Law (not theory, LAW) says that the number of serial processes substantially impacts performance as the number of processors increases. The goal is to make as many parallel as possible. The reason the optimizer does not default to this behaviour is because it is expecting greater disk latency than we have now that would have rendered this point moot.

Possible Approach

An example of a common item that inhibit can parallelism is the Scalar Function. Potential solutions to this are:

Convert it to an in-line TVF and CROSS APPLY it.
Create a CLR scalar function (weird but true, it works much faster)

To check how balanced the parallelism is you can open the properties of any step of the query plan and check view the actual number of rows returned for each thread (there is a little arrow thing hiding there).

There is a traceflag that will force the optimizer to try to make some part of query parallel (Trace Flag 8649) but this is not a good answer. The issue is that it doesn't ensure that you are actually removing the serial parts to balance the load.

One way to encourage the optimizer to seek the parallelism path without the Trace Flag is to use a TOP(@i) with an OPTION(OPTIMIZE FOR (@i = 2000000000)). This will make the optimizer expect 2 billion rows even if there are only 10,000 in the table. This will top the estimated cost calculate to the side of making the query parallel.

MAXDOP should be configured to less than the number of processing cores for each numa node. This avoids making for foreign memory requests which may be slow. It is ok to go higher, but this is a very sensible starting spot, be sure to test carefully.

Parallel Apply Pattern

The Parallel Apply Pattern can be used to enourage working with an managing efficient streams with the optimizer. The abstract form of this query is:

SELECT

myfields

FROM

[DRIVER TABLE]

CROSS APPLY

(

[CORRELATED PAYLOAD]

)

To execute this pattern the DRIVER of your query must be identified, for example is it really just calculating information about product stock levels? Then the driver table would be Product and the CORRELATED PAYLOAD would be the sophisticated joins, aggregations and calculations.

Roughly speaking, for each record in the DRIVER TABLE the CORRELATED PAYLOAD executes all of it's heavy set based work as a smaller piece rather than against the full DRIVER TABLE set. This is more efficient for parallel loads.

DRIVER TABLE issues

If the driver table is not going parallel there are a couple of choices after turning it into a derived table

1. Use the TOP techique mentioned above. The optimizer will think there are more rows.

2. Use a DISTINCT 0%ColA + ColA calculation in the select list to prevent a calculation from being extracted from the parallel stream.

CORRELATED PAYLOAD issues

If the CORRELATED PAYLOAD is still having it's calculations pulled away from the parallel stream that change the CROSS APPLY to an OUTER APPLY. This prevents the optimizer from knowing if it is safe to pull the calculations up (as a poor optimization choice).

PASS Summit Day 1

2012-11-08T09:22:00.004-05:00

Summit officially began today. I had a great time meeting new people, talking to vendors and learning stuff.

The big things today were they Keynote, Extended Events, Enterprise Information Management and Performance Tuning.

MS is clearly taking the threat from SAP's HANA seriously with making a new in-memory storage engine fr SQL Server. It will be in the next major release and is expected to speed up existing applications substantially.

Extended Events are interesting and really are complex enough that there would be a real benefit from having an actual book assembled about them. It is replacing a very simple tool (Profiler) with a very complex and powerful one. I learned just enough to know that I need to send a lot of quality time to understand it as much as I'd like.

So far it looks like Microsoft is taking the approach of going after the smaller, simpler data to Master instead of trying to battle in the arena of CDI or PIM where there are large established player. This follows the approach MS has had for a while. Something like Informatica is just overkill to manage Jill's spreadsheet that everyone shares, but MDS is perfect for managing and distributing the information throughout the company.

Some of my notes from today are below:

Keynote

MS Announced Hekaton : An in-memory engine for SQL Server

There will be an adviser on when to use it
Live as part of the DBMS server as a different storage engine. Heterogeneous
Lock and latch free
The storage engine for tables can be changed to move them in and out of memory. If the table no longer fits it switches to the conventional model. Because this can be done seamlessly the data is likely persisted to disk in the conventional DBMS way.

Exadata uses flash as a sort of cache fusion
HANA is designed as Memory primary and disk secondary so has huge performance drops when memory limits are exceeded.

Stored procedure optimization compiled to native code
Included in the next version of SQL Server

Parallel Data Warehouse
* Polybase allows seems connection to Hadoop / Hive and SQL Server DB to allow a single point of federated querying

Extended Events

SQL Trace and Profiler are deprecated. With SQL 2012 every traceable is now available in Extended Events

Extended events are much lighter than Profiler (similar to trace) of the previous options with support for multiple targets. http://www.sqlperformance.com/2012/10/sql-trace/observer-overhead-trace-extended-events
The available targets are:

Ringbuffer
Event Bucketing (Bucketizer) is a specialized ringbuffer, storing aggregates
Event Pairing allows the tracking of things like everything happening between lock_acquired and lock_released
Event file (xml)
Synchronous Event Counter only tracks number of times and event fired, no details

The SQL Server audit events have been merged into the normal extended events.

System Overhead (as tested by SQLCAT)

Query Time increases by 2 microseconds. As more actions are tracked this does increase.
Memory increases by 4MB for the RingBuffer
Disk storage is more verbose, always try to avoid logging to the Data / Log volumes
If the event tracking causes too much of a slowdown, it will be automatically disconnected until it has less impact on performance.

In general the learning curve is steeper but you can do more with Extended Events than with Profiler.

The GUI is not as nice and profiler and there is a larger learning curve. Also the SQL 2012 GUI does not work with SQL 2008.

Event information is recorded in an XML structure that is fairly consistent. Use XQuery to extract the results in a tabular form.

One mental model for understanding the different parts of the setup is to think of it like an insert statement

INSERT INTO [TARGET]
SELECT [EVENT DATA]
FROM [EVENT]
LEFT OUTER JOIN [ACTION]
WHERE [PREDICATE]

The predicate will short circuit, this is means that Criteria should be placed with the most restrictive rules first!

More good stuff on Jonathan Kehayias's blog

Because traces are very light they can run in the background. For example, the old SQL Trace Blackbox has returned as an Extended Event session called "system_health" and is visible under: Server>Management>Extended Events>Sessions and is using two targets, an event file and the ring_buffer.

Performance Tuning

5 Mains areas to consider when there is a problem

IO
CPU
Memory
Network
Locks and Blocks

Activity Monitor is MUCH better than it was in SQL 2005. Most of the common troubleshooting DMVs are presented there.

The only system process with a SPID > 50 is... SERVICE BROKER! That is a feature that really gets no respect and it turns out the other system process don't want to be associated with it either

When a WAIT x:y:z is displayed, the variables represent

x : DBID
y : File ID
z : Page

page 1 contains the page free space bitmap, contnention here is typically for heaps (most temp tables and table variables are heaps). Scalar and non-inlined TVFs can abuse this by creating and destroying objects swiftly.

Pre-Con Day 2 T-SQL Querying and Query Tuning Enhancements in SQL Server

2012-11-07T09:00:00.001-05:00

This was the second (and last) day of the pre-conference events. I selected the T-SQL session for two reasons, I know that there were a lot of enhancements to core and it was being given by Itzik Ben-Gan, who is a SQL Master (he literally wrote the book on T-SQL query techniques).

I was a bit concerned because he enumerated the items he wanted to cover and I've used nearly all of them so I didn't expect to learn much. Much like yesterday, I was wrong.

There were two areas that I struggled to keep up with him during the session, those were the Aggregate Windowing (which I never used) and all the crazy ways you can use the APPLY operator. The rest of the session was good but I kept coming back to (and testing out) those two areas.

I didn't get much of a chance for socializing because I was tinkering with the new stuff I learned whenever I had a few free minutes and I did something bad to my ankle so I just hung around my room for the evening.

Below are my notes from this session:

T-SQL Querying and Query Tuning Enhancements in SQL Server

Extended Events Change

This was the last thing we discussed but it is SO important that I need to talk about it first. Extended Events have anew Global Field called Query Hash. This is a hash of the auto-paramterized query, capturing this allows the detection of all sorts of performance problems. All of the tiny, fast queries that execute ALL the time can be evaluated to see if your system is suffering the "death by a million cuts". This is HUGE.

CTE

CTEs join the rank of derived tables, views and table valued functions as a way to make the query syntax cleaner. By themselves, all of these options generate the exact same query plan and do nothing improve or degrade query performance. The biggest immediate benefit to non-recursive CTEs is the improved legibility they gain by allowing query reuse.

CTEs are evaluated Top to Bottom. This allows the first query to be referenced by the second query, Both the first and second queries could be referneced by the third query, etc.

To improve performance consider using something to actually materialize the data and reference that single result set. Examples of this would be actual tables, temp tables and table variables.

Recursive CTEs do not follow the rules mentioned above, instead they need to keep a running tally of the data. It does not persist it to a normal table but instead creates a B-Tree structure for the data on disk that has been optimized for the Recursive CTE's usage patterns.

Windowing Functions

Craziness. Absolute, amazing crazy that has been hiding from me since SQL 2005 that can be applied to any aggregate (but works best with cumulative aggregates).

This is work applied to the result set.

The Windowing Functions are made up of a bunch of different parts

OVER : Establishes the window of data over the result sets
PARTITION BY : Defines the size of the sliding window by referencing what columns cause the aggregate to reset. (Note - FILTER operator is not supported, which would be an explicit where to filter the partition)
ORDER BY : Sorting sequence to be used when calculating aggregates. This important for things like running totals.
Window Frame :

ROWS BETWEEN ... AND ...

Specify the number of result rows to consider, regardless of missing data
Duplicate ORDER BY values are calculated individually
Always Specify Row, it is not the default (Damn you ISO SQL Standards!) but it is much more efficient
UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
UNBOUNDED FOLLOWING : To the end of the Partition
n PRECEDING : From the beginning of the Partition
n FOLLOWING : To the end of the Partition
CURRENT ROW : To the current row (think of a running total) (Default)

RANGE BETWEEN

Specify the value range to consider, this is not fully supported in SQL 2012
Duplicate ORDER BY values are grouped together for the aggregate calculation
Cannot tell the difference between?
Specify the number of result rows to consider (may skip missing values)
Use with extreme caution, it is the default when there is a partition
UNBOUNDED PRECEDING : From the beginning of the Partition (Default)
CURRENT ROW : To the current row (think of a running total) (Default)

Performance

This is NOT syntactic sugar. The query plan has a new operation called Window Spool which allows a "Fast Track" of results (for Linear Scaling) under two very common sets of conditions.

"BETWEEN Unbounded Preceding" looks only at directly previous row instead of maintaining a full list for each record.
The aggregate is a Cumulative Aggregate (such as SUM or COUNT, not a MIN or MAX) it is calculated by using the "Unbounded Preceding" behind the scenes and doing appropriate.

If the Fast-Track conditions are not met it turns from a 2n cost to an n^2 cost.

For indexing the window aggregate a POC pattern should be followed.

P: Partition (Quirk: Is nearly always ASC unless the same column is in the order by as DESC)
O: Order By
C: Coverage (can be in the INCLUDE)

To check for performance issues, use Extended Events or Statistics IO (which show the worktable)

New SQL 2012 Window Aggregates

LAG : Element from previous row in the Partition
LEAD : Element from next row in the Partition
FIRST_VALUE : Element from the FIRST row in the FRAME

Typically the frame and partition are the same when using FIRST_VALUE

LAST_VALUE : Element from the LAST row in the FRAME

Typically the frame and partition are the same when using LAST_VALUE

Paging

The new Offset - Fetch syntax was added as an extension of the ORDER BY clause. The implementation is not SQL Compliant yet because it is missing PERCENTAGE and WITH TIES options still

Performance

It performs similar to TOP with a discard of the first X rows (the offset). This is fine for the first few "pages" of data but does not scale deeply. To scale it, use a CTE on an indexed column and join to that, such as:

WITH Keys AS
(
SELECT orderid
FROM dbo.orders
ORDER BY orderid
OFFSET (@pagenum - 1) * @pagesize ROWS
FETCH NEXT @pagesize ROWS ONLY
)
SELECT
K.orderid
, O.orderdate
, O.custid
, O.empid
FROM
Keys as K
INNER JOIN Orders as O
ON O.orderid = K.orderid
ORDER BY K.orderid

PIVOT / UNPIVOT

This is not part of the SQL Standard. These are syntactical sugar for some common pivot requirements. As a pivot becomes more complex (such as multiple columns) they break down and require a more sophisticated model.

An efficient UNPIVOT can be created with the CROSS APPLY using a VALUES clause (which creates a derived table behind the scenes). This method uses VALUES as a constant scan which feeds the stream aggregate avoiding any worktables.

CREATE TABLE #t (id int , [2008] int, [2009] int , [2010] int, [2011] int)
;
INSERT into #t
values
(1, 100, 101,102,104)
,(2, 200, 201,202,204)
,(3, 300, 301,302,304)
,(4, 400, 401,402,404)

WITH
u AS
(
SELECT t.id, sum(val) AS total
FROM
#t t
CROSS APPLY
( VALUES
(t.id, 2008, t.[2008])
, (t.id, 2009, t.[2009])
, (t.id, 2010, t.[2010])
, (t.id, 2011, t.[2011])
) AS v (id, [year], val)
GROUP BY
t.id
)
SELECT * FROM #t t inner join u on u.id = t.id

Sequences

The new seqeunce object uses the same internals as the identity feature but is more configurable with different defaults.

Defaults

MIN value: Most sequences are not user visible values so it starts at the lowest value possible for the data type.
DATATYPE : BIGINT. This is effectively an alternative to using the GUID keys so it should be support a OT of values. BEWARE! This datatype SHOULD match the datatype in the table (especially if being used as a constraint), otherwise there will be a lot of conversions that may not have been expected.
CYCLE : OFF is the default value but the reason it exists is for IDs used in staging data that have a relatively short lifespan to avoid running out of keys.

Extended beyond the SQL Standard

It is able to be bound as a default constraint on a table. This will help with being able to eliminate the IDENTITY columns without impacting application code or adding triggers (which is the normal model).
There is an "OVER" clause available when generating IDs for a set to ensure the expected sequence.

Notes

This has all of the dangers on the IDENTITY when it comes to sequential values. The process of getting IDs is outside of the transaction, so once it's been handed off it is gone (unless the sequence cycles).
A single sequence can be used across multiple tables but this does increase the gap likelihood.

Merge

This is commonly thought of as an UPSERT but it can do a bit more. It compares the source (stg) to the target (active) and will cause different DML actions to be taken.

The "USING ... ON" clause is used matching the table from the source and left joining to the target. This is not a "WHERE" but left join conditions so it DOES NOT filter the source table. To filter the source table use a CTE as the source.

MS extended this to support executing an action if the record exists in the Target but not in the Source. This is great when doing full table refreshes.

Tips

Use a HOLDLOCK to keep the data consistent because the merge executes multiple actions.
Review any Target tables to ensure there are no triggers referrencing @@ROWCOUNT in the beginning. This value is set for the ENTIRE Merge command, not the single operation. Typically I've seen this used to do a quick eject from the trigger or switch the path from row parocessing to set processing. Look at the INSERTED / DELETED tables instead.
Unlike the OUTPUT from an INSERT statement, you can return the any generated values from both the source and target. If you need these values then make a merge with a non-mathcing condition (1=2).
If you only want to update the table when there has been a change there are two approaches, adding each column to the WHEN or adding an exists to the WHEN with an EXCEPT, like this:

MERGE INTO dbo.Customers AS TGT
USING dbo.CustomersStage AS SRC
ON TGT.custid = SRC.custid
WHEN MATCHED AND
EXISTS (
SELECT * FROM CustomersStage AS S WHERE s.custID = src.custid
EXCEPT
SELECT * FROM dbo.Customers AS T WHERE t.custid = tgt.custid
) THEN
UPDATE SET
TGT.companyname = SRC.companyname,
TGT.phone = SRC.phone,
TGT.address = SRC.address

Misc cool stuff

If the goal is to have a non-deterministic sort but an ORDER BY is required, the field list can replaced by a (SELECT NULL)
Left to Right reading of the Execution Plan is how the API works
The lack of a ; after each SQL Statement is officially Deprecated!
Avoid using IIF and CHOOSE statements. They were added as a stopgap to help with conversions from MS Access. Behind the scenes they are converted into CASE statements by the optimizer.
Interesting trick that can be embedded in other queries to get a delimited list:

SELECT N',' + QUOTENAME(orderyear) AS [text()]
FROM (SELECT DISTINCT YEAR(orderdate) AS orderyear FROM Sales.Orders) AS Y
ORDER BY orderyear
FOR XML PATH('')

PreCon Day 1

2012-11-06T09:43:00.000-05:00

So the conference began. Registration was a snap, I got a nifty backpack that is big enough to hold my workstation (an Elitebook 8560w) and was remarkably comfortable. I don't think it will replace my Tom Bihn bag for travel but it may be my new commute backpack.

The session I attended today was SQL Server 2012 in a Highly Available World and, while it was not what I expected I found a lot to learn. I was expecting a focus on Availability Groups with maybe a section on Failover Cluster Instances (FCI). Instead this was a much broader session. It started with Log Shipping, Database Mirroring (deprecated in SQL 2012), FCI and then finally Availability Groups. While I didn't want the first bits (Log Shipping has been around forever and Mirroring is going away) it was a good refresher and I learned a lot.

The best part of today's session was working through a problem that the person next to me was having during the breaks. There were two real questions

Can you mirror upstream from SQL 2008R2 to SQL 2012?

It turns out you can, if the patches / services packs are correct. In fact this will be demonstrated later this week in one of the sessions!

Can you combine Failover Clusters with Availability Groups?

Yes, but it prevents automatic failover between datacenters, this may be fixed in a service pack.

I now have an intense need to get VMWare going on my laptop and start messing with cluster configurations.

My notes from this session are below.

PASS Pre-Con 1: SQL Server 2012 in a Highly Available World

This was an overview of the High Availability options typically used in SQL 2012. Topics ranged from the venerable Log Shipping to the newer and shinier Availability Groups.

Log Shipping

What happens when the database is set to "Bulk Logged" mode and log shipping is enabled? It turns out that the data never makes it over to the target system unless it is updated afterward. This is a good example of why data should only be bulk loaded into staging tables, cleaned and moved to final tables using normal, logged commands.

Test log shipping Disaster Recovery (with an outage window)

Server 1: "BACKUP LOG ... NORECOVERY"
Server 2: "RESTORE LOG ... WITH RECOVERY"
Do stuff (nothing bad) and flip it back
Server 2: "BACKUP LOG ... NORECOVERY"
Server 1: "RESTORE LOG ... WITH RECOVERY"

This maintains allows you to maintain the log chain and test DR as long as only legitimate transactions happen as part of the test.

Mirroring

The reason that this still matters is that is was JUST deprecated. That means it will still be there for at least the next two releases (SQL Server 2014 and 2016 presumably). The last version of SQL Server with mirroring will be supported for at least 10 years after release, that means it will be 2026 before it will be gone.

Unsupported DB Objects

FileStream
FileTable

Endpoint

The Log Stream endpoint hangs out on port 5022. This endpoint will ONLY talk to other SQL Servers and is always encrypted, typically by a server signed certificate.

Concerns

If everything is working well, Synchronous replication is expected to only add 5ms to the DML commands at distances < 100 km. If the distance is > 100km then Asynchronous Replication is recommended. The Mirroring overhead time is viewable in the Mirroring Monitor (which presumably fronts a DMV).

The greatest visible lag in mirroring is when the application realizes that it must switch to the secondary node. The application needs to have a connection timeout happen (5 - 10 secs) before flipping to the next server (which will of course work).

Mirroring is not impacted by single user mode (other than nobody doing transactions) because it is using a log reader and is not "in" the database.

Always On - Clustering (Local)

Cluster Configuration

The dreaded Hardware Compartibility List (HCL) is gone! The reason for this is that the storage options for the cluster now include network shares (not just iSCSI). Now who in their right mind thinks that network shares over SMB are a good idea for the DB? Well the enhancements to SMB (starting in 2.2 with Win 2008R2 but much improved in 3.0 with Win2012) are pretty amazing. They have removed the overly chatty aspects of the protocol and made it a viable alternative to SAN storage. There is a lot of value going Win2012 instead of Win2008R2 for the SMB 3.0 enhancements.

Sadly, the validation wizard still needs to execute to make sure that the node can be part of a cluster. The good news is that it is a lot less invasive (doesn't do things like take SAN Luns offline). It is still slow, while a 2 node cluster may take 10 minutes, larger ones can take a weekend or longer.

When setting up the cluster, if you are not a domain admin you will need to either get access to add computers to AD or get the new cluster name added to the directory before installing.

The heartbeat for the cluster is using IPv6, if it is not a dedicated connection between the servers the network must be able to accommodate it (it's 2012, probably time to update to IPv6 anyway).

Windows Core is completely supported for the database engine. An interesting thing about core is that (in case of emergency) you can now activate the full GUI when required and then flip it off again when done. The core instance actually keeps all the DLLs and extra files on disk, but in a packed form that is unpacked (into memory?) when needed.

Best Practice tip: Try to have your resource group name match the network group name for your cluster. It will improve manageability a lot.

Quorum Modes configuration

Node Majority Set should be chosen for large clusters (> 2 nodes) with an odd number of nodes
Node and File Share Quorum should be chosen for large clusters that may not have an odd number of nodes
Quorum Drive can be used for 2 node clusters but remember, this is a single point of failure.
Win2012 improvement: Dead Nodes are removed from voting after they are kicked out of the cluster so the quorum requirement is recalculated.

Database Cluster Configuration

TempDB can finally use direct storage! This allows the use of local SSD or even Fusion-IO cards (gasp!). This can be a lifesaver in systems with heavy TempDB usage.

The number of available SQL Server nodes per cluster is getting absurd in Enterprise Edition (but is still stuck at 2 for Std and the new BI Editions). The cap is somewhere around 64, but is really limited by the number of drive letters (26 - A,B,C,D = 22). Technically, this can probably be worked around with mount points but requires further research.

One thing that seemed to cause confusion was the idea that even named instances can be treated like a default instance if you left it hanging around on port 1433. I've always thought of instance naming as just a human readable shortcut for remembering the port number (I know for other protocols the name is more important but most systems are TCP/IP focused). So, if the instance is locked to port 1433, then when you try to connect without an instance name that is the door it knocks at.

When reviewing the cluster resource groups make sure that the mount points are explicitly listed and set as proper dependencies, this does not happen automatically at this time. This can be complex for mount points because the label can be a bit obscure, use the Disk ID to match the volume in Disk Management to the one in the Cluster Manager.

Slipstream installs are now deprecated. I didn't see that one coming. The reason for it is that every patch that comes out need to get an updated slipstream image generated. Now you can store the patches in a centralized location and point the installer at it. It will pickup and apply the patches automatically. This is a big improvement if you haven't automated your image generation yet, otherwise it's a minor change.

Only the Database Engine, Analysis Services and SQL Agent are cluster aware. The other SQL Server goodies are STILL not cluster aware. This includes the MDM solution (MDS + DQS).

Always On - Geo Distributed Clusters

There have always been a LOT of obstacles when doing Geo-Clustering, primarily around moving the data and managing the network configuration. As far as I know, MS is not touching the data side, they are leaving that to the SAN experts for this option. However, MS now has options for Multi-Subnet Failover where the IP Address can be set dynamically based on the available subnets. This feature may stop the network admins from wanting to kill you.

Always On - Availability Groups

Different from Mirroring

This is really a fork of the mirroring technology used previously. It is still an encrypted transaction log stream replayed from the primary to the secondaries with the same overhead.

There can only be a total of 4 replicas of any single availability group.

Synchronous Replicas are limited to 2
Asynchronous Replicas are limited to 4

This means you can have 4 Async Replics or 1 Sync + 3 Async or 2 Sync + 2 Async but NOT 3 Sync + 1 Async.

Improvements

Read-Only replica (independent statistics kept in tempdb)
Routing based on application intent
Increased number of worker threads
Increased cap from 20 to 200 Replicas / instance
Replication Publishers and Subscribers are supported
Filestream / Filetable
Extended events are supported

Limitation

Cannot be used across domains
Does not support the replication distributor database
No SQL Server Profiler Events are available for Availability Groups

Fancy stuff. You can set your read-intent route to point at a load balancer that will bounce you between several read-only instances if desired.

FCI + AG

The combination of Failover Cluster Instances can be used in conjunction with Availability Groups for a very comprehensive solution (licenses may get pricey though). The sequence for doing this is:

Install Windows Cluster Services
Install SQL Server Cluster
Enable Always On Availability Groups
Create / Restore Databases
Configure Availiability Groups

SQLCAT has a whitepaper on implementing this combined cluster.

Backups

Full and Differential backups shold be taken from the primary
Transaction Log backups can be taken from the secondary. The info is pushed up to the primary so it knows that the logs have been archived.
Only maintenance plans (and Ola's script) support backup rules so far.

Connecting

An Availability Group Listener is created (more Active Directory fun) and all systems should connect to that name.
The Availability Group Listener is not supported by the SQL Server Browser and is recommended to live on port 1433 to avoid confusion.
If there is a read intent connection and no secondaries are available to service the request it will execute on the primary.

PASS Summit 2012 Pre-PreCon

2012-11-05T08:31:00.001-05:00

I flew into Seattle last night and I'm here for PASS Summit 2012 this week. I can hardly wait for things to get going (hence blogging at 5am).

Why is it so exciting? Well this is what I liked so much last time.

This is a SQL Server conference, but it is not run by Microsoft and SQL Server (and Data in all it's varied forms) is the ONLY topic. This means that the party line was not always followed and DB folk were not playing second fiddle to Developers or Server Admins.

The presenters were all top-notch. Not just good presenters but some of the best SQL Server presenters from across the country (world?) are here. They are experts in their arena, not only specific parts of SQL Server but on the broader topics of data modeling, data integration, data warehousing or enterprise data architecture. There were discussions that continued long after the sessions were over and private dialogues sometimes continued past the conference over email.

The people, not just the presenters were great too. There is something special about being surrounded by a totally new group of peers with different and exciting experiences. Everyone was having fun sharing war stories about the weird corner-cases that were encountered.

I learned more from that week at PASS Summit 2009 than I ever expected.

This week is gonna be great.

Some, Any and All oh my!

2012-10-07T10:00:00.000-04:00

Tales of the exotic and strange

When discussing some of the strange parts of a "new to me" system, a co-worker said "That's nothing, our queries have rules where the data has to meet Some, Any or All criteria!"

Now this sounded completely absurd to me so I just let it slide by, but it got me thinking. There are a number of unusual ANSI operators that appear to be logically redundant, maybe this is real.

Sure enough, it is real.

What is surprising, is that isn't an Operator but instead is a Modfier to existing operator.

Some / Any

Some and Any are synonyms. If the subset (query, or static list) has any records that meet the criteria records are returned.

Isn't that an "IN"?

Nope. "IN" is an operator, and would be the same as the combination "= ANY" but because this is a modifier you can pull some neat tricks like ">= ANY". This is handy for checking things like security levels or general product availability.

Example:

SELECT firstName, lastName

FROM employee AS e

WHERE

e.secId >= ANY

(

SELECT minSecId

FROM securedAssets AS sa

WHERE

sa.region = 'Inventory Mgmt'

)

;

All

All is another one with some interesting rules. Everything that is returned needs to meet the criteria. This is pretty terrific for things like being able to test if there is enough inventory to build an assembled item. You know if you are building 3 widgets, you need to have at least 3 of each part.

Example:

SELECT widgetName

FROM widget AS w

WHERE

3 >= ALL

(

SELECT availableWidgetParts

FROM widgetParts AS wp

WHERE

wp.widgetId = w.widgetId

)

;

Conclusion

Now this is a nifty feature, and there are plenty of other ways to do this for the majority of cases. I'll probably lump it in INTERSECT and EXCEPT as part of my seldom used, exotic query techniques.

Data Modeling Rematch!

2012-05-19T16:59:00.000-04:00

So I've got a lot of data modeling ahead of me at work and I need tool. The system was being tracked in Visio and that doesn't scale, for the reason I've noted before. So I did a breakdown before of the choices but I wanted to address my specific needs this time.

I like ModelRight, a lot. If you read the review you know it has an excellent working style. The mode-less property boxes are incredibly efficient and let you make changes remarkably quickly.

I also really enjoy ER/Studio Data Architect (referred to as ER/Studio thoughtout this post). The User Interface is extremely smooth and responsive. I do find that the modal dialog boxes slow me down a bit.

So I've been doing a bake-off between the two. They both support the same core functionality really well (forward and reverse engineering) and are fairly stable.

I'm working with a larger (~700 table) model and seeing how it does.

Data Modeling

So which is faster for actually developing a new model?

ER/Studio

This is very much an awkward process of dragging the table on to the screen, double-clicking it and adding entries to the table. There are custom domains that can be defined to group similar types together and provide guidelines to improve consistency. The tabs can be awkward and it feels like a very stilted process.

ModelRight

This is a bit more natural because after you drag the table over, you can start editing on the canvas or go to the docked modeless window. The same domain features are available to provide consistency for your columns and types. The modeless windows really do make this a much faster process to create and modify the tables.

Model Layout (Reverse Engineering)

ER/Studio

ER/Studio offers several different layout models to improve layout accuracy

Circular
Orthoganal
Hierarchical

These help to various extents based on your sub-model contents.

The canvas the ER/Studio has is unmatched for it's smoothness when manipulating the entities and adjusting the layout.

ModelRight

ModelRight only has auto-arrange, with an adjustable number to indicate spacing.
ModelRight does it better (after fiddling with the number).

The diagram needs less adjustment for clarity even if the UI isn't as smooth as ER/Studio offers.

Reporting

It's important to be able to distribute the model. Typical options for this are JPG, XML, HTML and RTF.

ER/Studio

JPG: Good
XML: Limited information is available but ok
HTML: Good. It's a nice tree structure and they provide a fairly good interface for it
RTF: Good. It's a word doc, nothing super fancy but it is pretty clear.

ModelRight

JPG: Good
XML: Good. Super Detailed info in this file
HTML: Brutal to setup. Basically it is an XML export that expects you to transform it with an XSLT. Complete Customization, but out of the box it is weak. On the other hand, the help file says that ModelRight support will help you craft any XSLT you need.
RTF: I don't think it's offered. May be available as an XSLT from support.

Sales and Support

ER/Studio

I've been going through the Embarcadero sales rep whenever I needed anything. There was some initial lag on responsiveness but when it was clear I was serious, things progressed well. She brought engineering on to a call to demonstrate functions they felt were superior to their competitors. It was very interesting and involved a lot of add on products.

I also contact support for two incidents. There was about a one day lag between requests for help and responses. Neither incident was resolved but the staff was courteous and tried to help.

ModelRight

The sales rep has been very responsive through email, but that was the only communication channel. I questioned her about features in the next two versions and the answers were useful without over promising.

Support has been remarkable. The support team was a bit terse in it's replies to my questions but always responded quickly. They went so far as to make an application build with fixes for bugs a couple of issues I found the same day. This support is amazing considering I'm still running the trial version.

Defaults

I'm not even sure if the section should be in here but I think it is important from the standpoint of balance.

ER/Studio

The defaults here are so good that I didn't even realize that a can configure a lot of the options to do something different. A lot of thought went into the initial configuration of out of the box.

ModelRight

This feels like it was developed by programmers. For example

The screen defaults to being overly crowded (even at 1600 x 900) with too many widgets active by default.
Reports REQUIRE an XSLT to be legible outside of Internet Explorer.

On the other hand, this leads to things like

A complete XML export of the model is available and can be diff'ed with older versions
Quick short-cut to take a JPG of your diagram
Ridiculously powerful scripting

Conclusion

ER/Studio is an excellent, polished tool. I have the budget to purchase it for myself and my team but not all of the secondary features. I could not get the repository or portal functionality which is what I need to be able to distribute the model to the development team.

ModelRight has similar (or better) features when compared to the baseline ER/Studio package. The price is about half but it also includes a free viewer and more flexibility if you are interested in putting a bit more work into it. Future enhancements slated for ModelRight 4 look very impressive and I'm glad to be on board with them for this project.

TLDR: ModelRight wins! Embarcadero has a lot to worry about with ModelRight. I will be using them as my data modeling platform of choice for the foreseeable future.

Sure it scales UP, but does it scale DOWN?

2012-05-02T17:37:00.002-04:00

I've recently run into an interesting issue, scalability doesn't necessarily go both ways.

Now, intellectually I understand that you make trade-offs to scale to millions of users but that doesn't help my "gut feeling" that I am doing something wrong when I accept the trade-off.

Let's take the classic example of triggers. Triggers tend to be very wrong when it comes to performance, however they are highly effective at securing the data and ensuring business rules are followed. I weigh the need for triggers very carefully before I add them to any table. This can be much more effective than adding the logic to every system that must integrate with the database.

Small databases allow you to pull all sorts of fun tricks that you would never consider otherwise.

Performance concepts do not always scale down to provide the flexibility that is needed on some projects. I've been measuring system with 500GB as a baseline for a while now and forgetting about the ~100GB options.

Right now, I'm looking at different ways to do row, column and cell level security. Almost none of these are viable in a large database. In fact, most of these are not even viable in a "small" database but this gives me a lot of freedom to experiment with concepts, which is fun.

Change, it's inevitable

2012-04-07T21:40:00.000-04:00

I'm in the middle of changing jobs, I mean it literally. The last day for my old job was Friday, the first day of my new job is Monday, and it's Saturday night.

There is much to miss with my old position. I had the chance to participate in changing the core data strategy and the luxury of doing it with the right people. I can't stress this enough, the right people make the difference. I can't mention leaving without mentioning them because they are an incredible group.

When I announced my plans to move on, it was possible to promote from within because every had grown so very much. They are all brilliant and I will miss working with them.

So now I'm switching my role up a little bit.

I'm moving on to more of a straight Data Architect position instead of Enterprise data. I have the chance to build out a number of interesting "cloud" products that will have several very serious, real world applications.

There is a good bit of greenfield development and different agile techniques I expect to run into. I'll post as I learn new and interesting things.

The State of Data Modeling Tools (Conclusion)

2012-02-18T15:31:00.000-05:00

After reviewing a bunch of the data modeling tools out there, I've come to the conclusion that there are some great tools out there, some workable tools and some tools to avoid.

Data Modeling Tools Result Summary

CA ERwin

This is a toolset that has been in the decline for while. CA is trying to pull it out of the nose-dive it's in but it has a legacy code-base without a lot of the original team around to explain it. As a long term system I can't recommend it. It will continue to go downhill.

Visio 2010 Professional

Fine for small models but it will quickly become to unwieldy with a significant model. Microsoft doesn't seem interested in this aspect of the tool market and are not improving Visio for it.

MySQL Workbench

Stability is important. It's free but my time is not. I don't recommend this tool.

SQL Power Architect (Community Edition)

This tool needs better integration with the actual database. It has the most potential of any open source modeling tool I've seen.

Embarcadero's ER/Studio Data Architect (Trial)

This tool works flawlessly. If you can afford it, buy it. They've been doing this for over 15 years and have built an entire product line around modeling tools.

Toad Data Modeler (Trial)

I admit that the lack of stability irritated me greatly but I don't think that this product has improved much over the last few years and it may be on life support from Quest. Not recommended.

ModelRight Professional (Trial)

For the price (roughly 25% of ER/Studio) this tool works amazingly well. The only thing it lacks is logical modeling, if you have one platform (like most folks) this is a great choice.

I hope that some of you found my research helpful. Your great comments help me remember to post. :-)

ModelRight Professional

2012-02-18T15:12:00.000-05:00

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

I was talking to a co-worker who was at LogicWorks before they were purchased by Platinum Technology (who were later purchased by Computer Associates) and he mentioned that there was a new ERD company on the block, ModelRight. It turns out that some the original Logic Works folks put this together and I gave it a whirl. It's UI is a bit different but worth investigating further.

Stability

This ia another product I used frequently over the evaluation period. The small model I tested with (< 50 table) worked great and the application was fast and responsive even after many days of use.

Price
ModelRight Professional: ~$1,095 ($895 + $199 maint)

ModelRight Professional for MySQL: ~$494 ($295 + $199 maint)

Features
Common Checklist

Logic Modeling: No
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes

Compare against files and live databases.

Sub Models: Yes

The UI is extremely responsive, not only do you grab things when you are intend to, but all of the properties for the selected item are immediately visible. This is because the properties are not in windows that popup but are in a docked pane. I like this very much but it requires a bit more screen space, which makes working with it on a laptop awkward. After I got used to it this made working with the model unexpectedly more efficient.

Reversing out the DB worked as expected and the submodel management was intuitive. Just drag the entity you want to add to a submodel into the model pane and it was added. Slick, fast and efficient.

The compare and forward engineering created some of the most efficient scripts of all the modeling tools I've used. They actually demonstrated features I didn't know about on MySQL and worked very well. I don't think I had to re-write any of the generated scripts.

Conclusion

If you have a single DB platform, ModelRight is definitely a modeling tool to consider. It is the only one that actually made me more efficient than I expected. The SQL it generated was was excellent and it was rock-solid.

Toad Data Modeler

2012-02-18T15:00:00.000-05:00

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

Toad is legendary for it's Oracle database management software. When I started at my current job I was asked to evaluate their SQL Server tooling. My evaluation was that the SQL Server tools were not fully baked and that Toad Data Modeler treated non-Oracle DBMS platforms as second class citizens. That was 4 years ago, Quest has put a lot of resources into the Toad family since then, lets see if Toad Data Modeler (TDM) has improved.

Stability

I used it for a few days to make my evaluation. The system only supports 25 entities in trial mode so I had to limited my model pretty severely. I didn't realize that I hit the limit until I was using it for a while, and it crashed. As soon as I hit the limit, the app stopped saving and hours got lost.

Later on I used the application for a few hours and it crashed again. Luckily, because the model was kept small I was able to use my last save.

Price
Toad Data Modeler: ~$450

Features
Common Checklist

Logic Modeling: Yes
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes

Does not compare against the live DB, only against previous models.

Sub Models: Yes

The UI is reasonably responsive.

Reversing the DB works ok but the arrangement leaves a lot to be desired.

The model compare and forward engineering works ok, but not great. Not being able to compare against a live DB leads to issues.

Conclusion

Toad Data Modeler is not stable. I can't stress the value of stability in tool selection. I really didn't enjoy using the tool and cannot recommend it.

ER/Studio Data Architect

2012-02-18T14:13:00.000-05:00

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

Embarcadero Technologies started off with making DB development and management tools. I ran into one of their tools (DBArtisan) in 1998 for managing my Sybase instances. At my next job I insisted on getting the full suite of Embarcadero Software. And my next job too. Unfortunately after that I wasn't able to justify the cost and it has been 7 years since I used the software. Did it age poorly, similar to what happened to ERwin? Let's see.

Stability

I used it nearly every day during the evaluation period, testing out a number of the different options. There were no problems with the small model I tested with (< 50 table) while it remained fast and responsive over many days being open and used.

Price
ER/Studio Multi-Platform: ~$4,615.00 ($3,700 + $915 maint)

ER/Studio Open Source Databases: ~$1,360 ($1,100 + $260 maint)

Features
Common Checklist

Logic Modeling: Yes
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes

Compare against files and live databases.

Sub Models: Yes

The UI is responsive, when you grab items they do what you expect them to. It's very similar to ERwin but the UI is much more responsive and stable.

After reversing out an existing database, the initial entity arrangement is surprisingly good, they have clearly put a lot of effort into it. There are several layout options and all of them work pretty well. Ultimately you still need to re-arrange the model a bit, but far less than any other tool I've used.

The Database compare and forward engineering makes very usable scripts that respect the integrity of the original model.

Conclusion

ER/Studio is an excellent product. There are add-ons to manage the model with a centralized repository and share database development tasks. There are also tools to establishing and maintaining conventions throughout your diagram.

In conclusion, ER/Studio is a great product, the price is high but it works flawlessly.

Visio 2011 (ERD)

2012-01-07T18:13:00.000-05:00

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

Visio is the default business diagramming tool in most organizations. I've used it off and on for after-the-fact documentation but never to create the actual model. The newest version of Visio (2010) included a slightly remodeled UI and some tools to help with choosing colors (surprisingly, that is very useful to me).

Stability
It's Visio, it works well for the small model I tested with (< 50 table) and had no stability issues.

Price
Visio Professional: ~$400.00

Features
Common Checklist

Logic Modeling: No
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: No

Note, there is a plug-in created by Alberto Ferrari and hosted at CodePlex which does a non-comparing forward engineer of the DB

Sub Models: Yes

Once you get past the Visio UI quirks, documenting an existing system using Visio is a breeze. Yes, extra clicks are required everywhere. Yes, the relationship link points are limited and not fun to reset.

The initial arrangement of the entities is pretty bad, but after using SQL Power Architect and MySQL Workbench I can appreciate that the layout is not awful.

Microsoft has a team of writers maintaining the help system and there are several good web resources on building ERDs using Visio so there is plenty of available documentation.

The lack of a built-in forward engineering function can be worked around (as noted above) but it does not compare to the actual database.

Conclusion

I have used Visio to document data models in the past and I maintain that it can do this at an acceptable level.

Also Visio can be a good way to get down some initial thoughts on a model if that is all you have available to you.

I do not recommend using Visio for ongoing, iterative data model development. Forward engineering without a comparison function makes this work too awkward.

SQL Power Architect

2012-01-07T16:53:00.000-05:00

This is part of continuing set of reviews / opinion pieces about the state of data modeling tools that began over here.

Background

I must say that SQL Power Architect came out of no where for me. I was googling around when I stumbled upon it and it sounded too good to be true. A multi-platform, open source database modeling tool with an Enterprise version available (for pay naturally) if you need advanced features. Watching some of the videos it sounds like this was a tool developed in house that evolved into a product for the company. I couldn't find much more history than that, if anyone has some let me know and I'll add it in here.

Stability
I was back to using my small (< 50 table) model and had no stability issues. Rock Solid.

Price
SQL Power Architect: ~$00.00
SQL Power Architect Enterprise: $495.00

Features
Common Checklist

Logic Modeling: Yes
Physical Modeling: No
Reverse Engineering: Yes
Forward Engineering: Yes
Sub Models: No

This is an interesting contrast to MySQL Workbench. SQL Power Architect insists that you do all of your work in the Logical Model (called the PlayPen Database) and generate SQL to apply to your Physical Model. Of course this gets awkward if you need to update your Logical Model from the actual database.

Once again Auto-Arrange is a very poor choice. So many of the relationships overlap that it is difficult to understand the model.

There is no documentation for the Community edition available. To understand how this application works your best bet is to visit the YouTube SQL Power Channel and play around for a while. In all fairness the Enterprise Edition does have documentation available.

The GUI, well it is in Java, so it does not feel native. The scrolling is wrong, either it scrolls too fast or begins scrolling before you would expect it to. In most apps this is isn't a problem but as you're trying to work on your diagram this can get really awkward.

Conclusion
This is easily the best open source tool I've seen for data modeling. It is easy to setup, works with JDBC connections and allows you to compare and forward engineer your changes.

The downside is that all DB schema changes are expected to originate from it, or be added to the Logical Model manually. And the Logical Model does not support sub-models. And the GUI makes working with the models challenging.

These issues can be worked around by making a lot of small projects as sub-models. This raises it's own challenges with overlapping entities and synchronizing projects.

All in all if you are working on an open source project with limited funds this may be your best approach but watch out, you will spend more effort managing this model than you expect.

The State of Data Modeling Tools (MySQL Workbench)

2011-12-31T13:55:00.000-05:00

Background
I have only begun using MySQL and trying out the WorkBench tools from Oracle was the next logical step. This product was originally DBDesigner 4 from fabForce.net before it's developer was hired by MySQL AB to create a new set of MySQL GUI tools. This application has grown from a physical database designer for MySQL to include additional query and managment features. It is much closer to SQL Server Management Studio than it is to ERwin.

Stability
I was using a relatively small (< 50 table) model to work with and the system would crash after about on hour of use. The crashing was more frequent on the Mac side of things than it was when running on the PC but neither were very stable.

Price
MySQL Workbench: ~$00.00

Features
Common Checklist

Logic Modeling: No
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes *

* Note: Does NOT compare to existing schema. It will make fresh. new objects!

Sub Models: No

The auto-arrange does a terrible job distributing the tables, many were overlapping. The relationship lines are not adjustable, they go where the app assigns them to be.

The GUI makes me think of Eclipse, which is clearly not native on any platform and it generally works as expected. The notation can be changed to fit your conventions and the automatic screen refreshing worked flawlessly (not something to take for granted after using ERwin).

Conclusion
This is a very good tool for visually showing developers how a few objects are related to each other and firmly belongs in the MySQL suite of tools.

Unfortunately, this application is just a bit too primitive to use for anything other than the most basic data modeling. The lack of a Logical Model is less than ideal, but not a deal breaker. I did find the crashing, lack of sub-models, minimal control of the relationship lines and the poor forward engineering very frustrating. I recommend using a piece of graph paper instead.

The State of Data Modeling Tools

2011-12-30T20:00:00.000-05:00

Lately have had a chance to do some data modeling for a MySQL project and I thought I'd do a series of posts with a brief review the tools for making ERDs.

Data Modeling Tools

Evaluation Criteria

Stability
Price
Features

ERwin Review
I have been using ERwin for a few years now so I am pretty familiar with this tool. Like it or not it has been around for ages (Logic Works, who originally made ERwin started in 1988) and is the benchmark all other system have to be measured against.

Stability
The most recent version has not been entirely stable, I cannot make it crash on demand but it will occasionally lock up.

Price
ERwin Data Modeler Std Edt + 1 yr Maint: ~$3,900

Features
Common Checklist

Logic Modeling: Yes
Physical Modeling: Yes
Reverse Engineering: Yes
Forward Engineering: Yes
Sub Models: Yes

Now having used it for a while I know a number of ERwin's quirks but stepping back and looking at it with a fresh eye, it is dated. The GUI looks like it is still built on an ancient toolkit which is likely why the work area screens require manual refreshes to clean up visual artifacts.

Conclusion
ERwin has the core features I'd like, but it's not entirely stable, the GUI is quirky and the product is expensive. For the price I would expect CA to be rolling updates more frequently to address these issues but ERwin has had minimal updates these last few years.

NoSQL Summary

2011-01-30T11:56:00.000-05:00

The best short description of when not to use NoSQL that I've ever seen is from a reddit comment by bucknuggets

Assumptions:

a key-value pair data store isn't optimal storage in the perfect world since data quality is very difficult to manage, reporting is a nightmare, and since most data doesn't map elegantly to key-values app complexity grows.
but it may be a very effective storage solution at certain scalability thresholds

Then for sites below those thresholds it is premature optimization - that adds unnecessary risk to the project.