Here’s the secret — the most important thing to know about SQL Server (in my opinion) is indexing. In transaction processing systems (also known as OLTP databases), database activity involves several statements. These include insert, update, delete, and select. There is a competing need between the modification statements (insert, update, and delete) and reading the data (select). Since the database has both — what should we do?
We must carefully balance the needs of both. As our first step, let’s look at indexing.
Indexes potentially improve performance when we search and sort, but they potentially hurt performance when we add or modify data.
The two major types of indexes in SQL Server are clustered and non-clustered. SQL Server stores data physically in one of two ways. The first way is according to a clustered index, which means that the key of the index determines how the data is physically stored on disk, and the second way is called a heap. For example, let’s say I create an index on a column called customerid. Each row of the data would follow in order. In other words, customerid 2 would immediately follow customerid 1, and as new rows are added to the table, the index is updated sequentially.
Think of it this way, I have a filing cabinet, and as I place things in the filing cabinet I am mindful of the way I will retrieve them. In my case, I often lecture on a specific version of SQL Server, and I have technical journals that arrive monthly. In my filing cabinet, I have the journals sorted by type, actual journal, and date. Right now, I am beginning to write and lecture about SQL 2012. My system works well in this case because I know that the “Denali” project (now branded as SQL Server 2012) is in the journals that were written around 2010 and beyond. The retrieval of the data is exactly how it’s laid out in my file cabinet.
What if I need to write about a particular feature in SQL Server? For example, let’s say I’m asked to lecture over all of the different versions of Reporting Services. The data isn’t physically ordered that way. We will come back and use a non-clustered index for this.
First, we need to look at the other way data is physically organized. A heap is defined as “in no particular order”. Think of it this way, have you ever known (or been) a person with an incredibly messy desk? As paper comes into the office, it simply goes in the pile or various piles (I don’t keep my desk this way, I can’t function). I have always been fascinated by the people that do keep everything in piles and function perfectly well in what looks like chaos to me. Their data is in no particular order, but, amazingly, they can retrieve information quickly.
SQL Server is able to retrieve information quickly regardless of the clustered vs. heap storage. However, we need some other indexes to help. These are called non-clustered indexes. We can have indexes that are built on top of the clustered index or on top of the heap.
Say you travel to an unfamiliar city, and you have an address for your destination but no idea how to get to that location. You need a map or a GPS, right? You don’t know how to get to the physical location, but you know someone who knows. That’s the way non-clustered indexes work. The data is not physically stored according to the non-clustered index, but the non-clustered index points to the physical location — the heap or the clustered index.
Going back to the example above — writing about Reporting Services would be easy. I would simply have a lookup chart that lists the topic, and next to it I would have the journal name and journal date of the journals with Reporting Services articles.
When you run a query in SQL Server, the Database Engine physically has to retrieve the data you’ve requested. In order to do that, it must know the physical order. For efficiency, we want to store the data in an order that we will search (there are several considerations beyond the scope of this post, but for now, we are only considering the retrieval). If we request the data in a way that it isn’t physically stored, we want a way to look it up, thus needing a non-clustered index.
We only get one way to store the data physically at a time, a clustered index or heap. However, there are probably multiple ways that I search or sort the data. I might request a particular customer in one query, and in another query I might search by last name. In both cases, I need to physically retrieve the data. If my clustered index for the table is on customerid, the first query would use that. If I had a second query, a non-clustered index on last name could be used.
In summary, remember that indexes help us when we search or sort. We will see in another post how they potentially hurt performance on data modification.