Ad-Hoc Querying: Best Practices

When running ad-hoc queries against a production database it is paramount that a) you guard against any unintended data modifications, and b) you ensure that your “footprint” is as small as possible in order to guard against interferring with the normal operation of the server.  It is to this end that I intend on starting a list of best practices for this scenario.   Ultimately, best practice is to NEVER run ad-hoc queries against a production machine, but we all do it from time to time, usually out of necessity, sometimes out of sloppiness.  So for those times (and i mean the necessary times, not the sloppy times), please please please follow these small tips.

 Please feel free to add to this list:

a) Use the WITH NOLOCK query hint on all select statements.  (For instance, SELECT TOP 100 <fieldlist> FROM AuditTable WITH NOLOCK)

Description: The WITH NOLOCK hint tells the SQL engine to NOT place any locks on the rows/tables/database when it performs the query.  It also tells the SQL engine to read the table as-is – this means that it will read transactions that are in-progress, and could result in phantom rows or a “non-repeatable read” (look it up if you care about the details).   WITH NOLOCK is synonymous with changing the transaction isolation level to READ UNCOMMITTED, but is much simpler to type out, so recommended for ad-hoc querying.  While this may comprimize the integrity of your search results in high-transaction systems, it also means that it will never lock or block any other transactions.  My guess is that in 99.9% of the ad-hoc querying that you will be doing on a transactional production system, you are more concerned about not messing with the live system than you are about ensuring that the data you’re retrieving is currently being modified, and thus may not be entirely accurate to-the-second.

b) Use the TOP keyword in your SELECT statements to minimize the data transfer and the disk reads required.

If you are querying a table of any significant size (and production systems often contain tables with millions or billions of rows), please use TOP to limit the scope of your select statement.  Usually an ad-hoc query is trying to find out something specific, so don’t go running broad select statements that will clog a finely-tuned production system.  Be as specific as possible, and only return the rows you need.  For instance, if you’re looking through an audit log for the most recent person to modify a particular piece of data, adding TOP 1 to “SELECT UserId FROM AuditTable WHERE DataId=x ORDER BY [Date] DESC” could result in a HUGE improvement in query speed.  In fact, if your [Date] column is indexed, especially if it has a clustered index, using TOP will prevent the server from even having to read or consider the rest of the rows!  Comparing best-case scenario (clustered index on [Date]) to worst-case scenario (no indexes at all), the server would only have to read one record, instead of reading every record in the table, sort them all, and return them all.  Give your database server a break.  Use TOP (and indexes, but that’s a different topic) 🙂

Repeating rows in a table based on an Integer Column

This one’s hard to describe, but is something I’ve had to do on numerous occasions.  The scenario is where you have an integer column that represents the number of times that row needs to be “repeated” in the resultset.  Consider the following fictional example.  A summary table contains the number of “Events” (let’s say for this example that they represent “logins” for a user).  This summary table simply has the UserID, a date field indicating the month, and a column showing how many logins that user had for the month.  Let’s create this table and fill it with fictional data:

DECLARE @MonthlySummary TABLE (
UserId int,
[Month] datetime,
EventCountForMonth int
)
insert into @MonthlySummary Select 1,’3/1/2007′,2
insert into @MonthlySummary Select 1,’4/1/2007′,4
insert into @MonthlySummary Select 1,’5/1/2007′,1
insert into @MonthlySummary Select 2,’3/1/2007′,3
insert into @MonthlySummary Select 2,’4/1/2007′,2

What we want to do in this fictional example is to “blow out” the table, so that we have exactly one record in our result set for EACH login.  Note that the data only shows a COUNT of logins per month.  So, the “Event Count For Month” column contains the number of times the row needs to be repeated.

To accomplish this, we can use the Incrementing Numbers table that we created earlier today, and use a “conditional join”.  See the following code, and note the join that uses a “less than” operator.

SELECT e.*
FROM @MonthlySummary e
INNER JOIN IncrementingNumbers ON IncrementingNumbers.Number < e.EventCountForMonth

Please post a comment if you can come up with other examples where this would com in handy.

Creating and populating a table of Incrementing Numbers

There are many cases when it is useful to have a table of incrementing numbers.  An incrementing number table can be used in many sql tricks, several of which are described in this blog.  To created such a table and populate it, simply use the following script.

With SQL Server 2005:

CREATE TABLE IncrementingNumbers (Number int PRIMARY KEY)
WITH digit AS (SELECT 0 as digit UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9)
INSERT INTO IncrementingNumbers (Number)
SELECT thousands.digit*1000 + hundreds.digit*100 + tens.digit*10 + ones.digit
FROM digit ones CROSS JOIN digit tens CROSS JOIN digit hundreds CROSS JOIN digit thousands
ORDER BY thousands.digit*1000 + hundreds.digit*100 + tens.digit*10 + ones.digit

With SQL Server 2000:

CREATE TABLE IncrementingNumbers (Number int PRIMARY KEY)
INSERT INTO IncrementingNumbers (Number)
SELECT thousands.digit*1000 + hundreds.digit*100 + tens.digit*10 + ones.digit
FROM
(SELECT 0 as digit UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) ones
CROSS JOIN (SELECT 0 as digit UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) tens
CROSS JOIN (SELECT 0 as digit UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) hundreds
CROSS JOIN (SELECT 0 as digit UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) thousands
ORDER BY thousands.digit*1000 + hundreds.digit*100 + tens.digit*10 + ones.digit

Case sensitive comparisons or joins

If you ever need to do a case-sensitive comparison or a case-sensitive join then you have several options.  My recommended solution is to take advantage of the COLLATE keyword to solve the issue.  Collations that have the “CS” suffix are “Case-Sensitive”.  By default, new databases are case-insensitive; however, the collation can be modified on a database level or a column level.

To do a case-sensive comparison, use syntax such as the following:

SELECT * FROM myTable WHERE myColumn COLLATE Latin1_General_CS_AS = ‘MyValue’

This will match only records where the value of myColumn is exactly “MyValue”, and will not match “myValue” or any other case variants.

 This syntax can also be used on joins.  For instance:

SELECT * FROM Table1 INNER JOIN Table2 ON Table1.CaseSensitiveColumn COLLATE Latin1_General_CS_AS = Table2.CaseSensitiveColumn COLLATE Latin1_General_CS_AS

In SQL Server 2005, you could also achieve the same results by converting the field and the comparison value to VARBINARY(MAX).  However, I have not done benchmarks to compare effiency of the two syntaxes.  I would presume that the former technique is preferable, simply because it’s more intuitive – after all the COLLATE syntax is used specifically to define case/accent/sorting/language settings.

It is also worth noting that this same technique comes in handy when doing cross-database joins.  If the two databases have different collation settings, which is quite a common dilema, the collations can be specified in the same fashion as above.  However, do be careful that you don’t accidentally “create” case-sensitive comparisons by choosing to convert either side of your comparison to a case-sensitive collation.

Brief introductions…

Today I begin my first venture into the world of blogging.  As a long-time SQL programmer and who has gleaned much of his SQL prowess from the blogging community, I figure it’s about time that I finally give something back.

Primary Key: NEWID() vs. IDENTITY column

Many times tables do not have values that are inherently unique, and thus it is useful to use a surrogate key.  A surrogate key is a column that is added to the table strictly for the purpose of providing a unique value.

Many people use an IDENTITY column to achieve this.  However, I have also seen many many scenarios where the surrogate key was a UNIQUEIDENTIFIER column with the default value of NEWID().

Whenever possible, avoid NEWID() and use an IDENTITY column instead.  There are several reasons for this:

  1. By default, SQL Server generates a clustered index on the primary key column.  Using a NEWID() column as a clustered index will invariably lead to huge performance degradation.  Since clustered indexes are roughly synonymous to the physical order of rows on a disk, each non-sequential insert will create re-ordering of the index.  Since NEWID() generates non-sequential numbers, nearly all inserts will necessitate this re-ordering.
  2. The clustered index column is included in each non-clustered index as a reference point.  This may seem inconsequential, but think for a moment about disk space:  A UNIQUEIDENTIFIER takes 16 bytes.  On a 10 million row table, the primary key column would fill 16 million bytes, or 160MB.  This isn’t much in the grand scheme of things.  However, if that table also has 20 non-clustered indexes (not uncommon), and each of them required a reference to the clustered index’s column then merely the pointers from the non-clustered indexes to the clustered index would fill 20 x 160MB, or 3.2GB  Now, if the primary key was an IDENTITY column, which is merely a 4-bit INT, then that would be reduced from 3.2GB to 800MB.

It is worth noting that if you do require a UNIQUEIDENTIFIER column, you can set its default value to NEWSEQUENTIALID() instead of NEWID(), which will at least generate incrementing values, avoiding unnecessary page splits.

Generating random numbers in SQL

Ever needed to generate a random number in SQL?  If you only need one random number, then the built-in RAND() function would work fine.  If you need to, say, randomize the rows in a select query, then sorting by NEWID() would do the trick.  However, neither of these would give you the ability to create a colum called RandomNumber, containing an actual random integer.  The cases were this is needed are few and far between, but I’ve had to do it on a number of occasions for everything from loading sample data into a table, to verifying the distribution of data values.

Here’s the trick: 

SELECT RIGHT(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)),1) AS [RandomNumber] FROM <myTable>

In a nutshell, I’m simply using the random value found in NEWID() and converting it into a useful number.  The “1” can be replaced with any number between 1 and 7.  It represents the number of digits that the random integer will use.  So, “1” will give a random number between 0-9.  Using “2” would generate random numbers between 0-99, etc.  I’m calling “7” the maximum number because beyond 7 digits the randomness becomes less reliable.  The reason for this is simply that only some NEWID() values will create int’s larger than 9,999,999. 

70-442

I took the 70-442 exam the other day, and passed!  This means that I now have an MCITP: Database Developer certification.

Study tips: Know your transaction isolation levels!  There are lots of questions involving concurrancy requirements.  Other than that, I’d highly recommend the Sybex book on the exam.  Overall, the Sybex book is much more thorough than the MS Press training kit, though I have to say that I only “got” the whole transaction isolation stuff after going through the examples in the MS Press book.  If you don’t have the time/money to read both books, read the Sybex one, and study transaction isolation levels separately on the web.

mcitp.gif


Pages