SQL Index – Introduction
“An index makes the query fast” is the most basic explanation of an index.
An index is a distinct structure in the database that is built using the
create index statement. It requires its own disk space and holds a copy of
the indexed table data. That means an index is pure redundancy. Creating an
index does not change the table data; it just creates a new data structure
that refers to the table. A database index is, after all, very much like
the index at the end of a book: it occupies its own space, it is highly
redundant, and it refers to the actual information stored in a different
place.
Index Structure
Pic-1 |
In SQL Server, indexes are organized as B-trees. Each page in an index B-tree is called an index node.
-
The top node of the B-tree is called the root node.
-
The bottom level of nodes in the index is called the leaf nodes.
-
Any index levels between the root and the leaf nodes are
collectively known as intermediate levels.
In a clustered index, the leaf nodes contain the data pages of the
underlying table. The root and intermediate level nodes contain index pages
holding index rows. Each index row contains a key value and a pointer to
either an intermediate level page in the B-tree, or a data row in the leaf
level of the index. The pages in each level of the index are linked in a
doubly-linked list.
Implicit Indexes: They are created when a column is explicity defined with PRIMARY KEY, UNIQUE KEY Constraint.
Explicit Indexes: They are created using the "create index" syntax.
Data Pages
There are total 14 types of page in SQL server. Here we are taking only two
types of page to understand the index architecture. In SQL Server, the page
size is 8 KB.
Each page begins with a 96-byte header that is used to store system
information about the page.
Page Size
A page is 8KB. 1024 bytes X 8KB = 8192 Bytes.
Header is 96 Bytes
Row offset is 36 Bytes
Total 132 Bytes
So, 8192 Bytes - 132 Bytes = 8060 Bytes.
Every node in the above diagram denotes a page (Data or Index). The
intension is here to know how intermediate level in B-Tree increase or
decrease based on the size of indexes and choose the best data types for
index. And also you know how to reduce the no. of pages traversing.
Let’s take a small example to understand the how intermediate level
increase based on the no. of rows in a table and index column. Index
building always starts from 0- leaf level. In below picture we have a table
of 1,000,000,000 rows and we assume that 1 page can hold 100 index entries
only.
In 0-Leaf level would contain 1,000,000,000 / 100 = 10,000,000 pages (1
page can hold 100 index entries only), Similarly the calculation goes up to
root level.
Pic - 2 |
Increase the no. of intermediate levels means increase of reading, more no. of pages and that overall decrease the performance.
Types of Indexes in SQL Server
There are two kinds of indexes in SQL Server: clustered and nonclustered.
Clustered Index - A table can only have one clustered index, because the clustered index sorts the rows in the table itself. Any column or group of columns can make up a clustered index, but ideally a clustered index should be:
-
Small (of a small data type) –The clustered index key is the pointer contained in each nonclustered index. If you therefore have a clustered index key that is large – for example, a 16 bit UNIQUEIDENTIFIER – indexes will take up much more space than if the clustered index key were smaller (e.g., a 4 bit INT).
-
Unique or highly selective
– The more selective an index, the more efficient.
-
Ever-increasing – The clustered index orders the rows in the table. If the clustered index key is ever increasing, new rows are added to the end of the table. Otherwise, new rows are inserted in the middle of the table, and the database engine must reorganize the data on disk more often.
-
Static
– A frequently changing clustered index key will cause rows to be
reordered within the table, causing unnecessary overhead.
Clustered Index Structure
Pic - 3 |
A non-clustered index, on the other hand, creates a completely different
object within the table that contains the column(s) selected for indexing
and a pointer back to the table’s rows containing the data. It is like an
index in the last pages of a book, where keywords are sorted and contain
the page number to the material of the book for faster reference.
Nonclustered indexes cannot be sorted like clustered indexes; however, you
can create more than one nonclustered index per table or view. SQL Server
2005 supports up to 249 nonclustered indexes, and SQL Server 2008 support
up to 999
NonClustered Index Structure –
Pic - 4 |
Key lookup is a lookup into a Clustered table using Clustered key.
A Bookmark lookup or RID Lookup is a lookup into a heap table using a Row ID.
The Row ID is included in a non-clustered index in order to find the rest
of a table's data in the heap table. Since a heap table is a table without
a clustered index and is sorted unordered a Row ID is required for the
correlation
SQL SERVER – Index Levels, Page Count and DMV – sys.dm_db_index_physical_stats
I have created a table called dbo.Customer from AdventureWorksDW2012 Database and no index created on it initially and total records are around 19K
Pic - 5 |
-
When the table does not have any clustered index, table is called
heap. Below you can see indexed= 0 means heap table and also
showing the page count and index level is 0.
- After creating a clustered index on table let’s check the index structure, how SQL server store internally.
Here you can see the IndexID=1 means clustered index. Also you can see then index level and pages distribution.
Index level = 0 Leaf level
Index level = 1 Intermediate level
Index level = 2 Root level
-
After creating a non-clustered index on table let’s check the index
structure. In this case we have both the indexes Clustered and
NonClustered.
Pic - A |
- The leaf level of this index is spread over 979 pages and total 18484 records.
- The one and only intermediate level requires just 2 pages and total 979 records.
- The root level is, as always, a single page and in this example holding total 2 records.
- The leaf level of this index is spread over 78 pages and total 18484 records.
- No intermediate level requires here.
- The root level is, as always, a single page and in this example holding total 78 records.
-
In this case we have only NonClustered index and no any Clustered
index on table (It’s a HEAP table)
Pic - B |
- No Clustered Index (HEAP table)
- Total 18484 records spread over 979 pages.
Non Clustered Index -
- The leaf level of this index is spread over 87 pages and total 18484 records.
- No intermediate level requires here.
- The root level is, as always, a single page and in this example holding total 87 records.
NOTE: In Pic - A you saw the NonClustered index holding the same no. of
records in 78 pages instead 87 pages in Pic - B. The reason of this
difference is Clustered Index in Pic - A and HEAP table in Pic - B.
Here in this example Clustered Index column took less space than HEAP table
in 0-leaf level of NonClustered index , because in HEAP table NonClustered
Index use ROW ID instead of Clustered Index key to find the rest of a
table's data in the heap table
Before execute any SQL command and see the execution plan let’s see How the
pages arranged in B-Tree
-
In case of Clustered Index the Leaf level hold the actual data
pages.
-
In Case of Non Clustered index the Leaf level hold either "Key
Value and Clustered Index Key " (If table have Clustered Index) or
"Key Value and ROW ID" (If table does not have Clustered Index)
that point to Clustered Index Page or HEAP table respectively
To understand this lets start a real example.
-
I Created a table called dbo.Customer from AdventureWorksDW2012 Database and no index created on it initially and total records are around 19K
Select * from table and see the execution plan
--> As expected the optimizer choose Table Scan because no index present.
Create a clustered index on table for a column (best candidate is unique column)
Again Select * from table and see the execution plan.
-->In this query we need all the columns. And we have just created Clustered Index on this table. But we are not using this column in the where clause, so optimizer choose clustered index Scan instead of Seek operation.
--> After creation the Clustered Index on the table the index structure has been organized and inter connected based on the order of the index column.
Select all the column with where clause (clustered column)
--> In this query again we need all the columns but here we are using index column in the where clause. Whenever we use index column in the where clause Index Seek operation happened instead of index scan. Because every node in the B-tree have the index column. That's why it quickly get the desired result set.
Select all the column with where clause (Non index column)
-
--> In this query again we need all the columns and we are not using any index column in the where clause. Optimizer choose Clustered index Scan as we have clustered index on the table.
- How Nonclustered indexes work
Here we can use same table that we have created for clustered index example, but remember before use remove all the indexes.
-- Drop clustered index
Drop INDEX IDX_ClusteredIndex ON [dbo].[Customer]
GO
Select * from table and see the execution plan
--> As expected the optimizer choose Table Scan because no index present.
Create a Non-clustered index on table for a column
-- Create a Non clustered index
CREATE NONCLUSTERED INDEX IDX_NonClusteredIndex ON [dbo]. [Customer] (CustomerAlternateKey);
GO
Again Select * from table and see the execution plan.
--> As we know the NC Index does not hold the actual data at 0-leaf level, and in the above query we need all the columns in select list, and at this time NC index alone does not satisfy the required result. And to satisfy to get the desired result NC Index go to scan the Clustered Index page if table have the Clustered Index or go for Table Scan if no Clustered Index defined. And in this example at this point we do not have any Clustered index created. That's why optimizer choose Table Scan instead of NC Index scan.
--> As we know the NC Index does not hold the actual data at 0-leaf level, and in the above query we need all the columns in select list, and at this time NC index alone does not satisfy the required result. And to satisfy to get the desired result NC Index go to scan the Clustered Index page if table have the Clustered Index or go for Table Scan if no Clustered Index defined. And in this example at this point we do not have any Clustered index created. That's why optimizer choose Table Scan instead of NC Index scan.
Select indexed column only
--> In the first picture we required only NC column from table. In this case NC index alone satisfy the required result. That's why optimizer choose NC index scan.
--> In the second picture we required two columns and one of them is not part of the NC index (CustomerKey). So again NC index does not satisfy the required result. That's why optimizer choose Table Scan in all case if you required any column other than the NC index column.
--> In the first picture we required only NC column from table. In this case NC index alone satisfy the required result. That's why optimizer choose NC index scan.
--> In the second picture we required two columns and one of them is not part of the NC index (CustomerKey). So again NC index does not satisfy the required result. That's why optimizer choose Table Scan in all case if you required any column other than the NC index column.
If you choose only indexed column then only it use non clustered index and in all cases it go for table scan.
Select all the columns with where clause (Non-Clustered column)
--> In this query we required all the column, but at the same time we are using where clause to get the desired result set. As we are using NC index column in the where clause, you can see the NC index seek operation in the execution plan.
--> Also you can see the RID Lookup operation. As you know we need all the column from table and NC index does not have all the data to satisfy the result. And we do not have Clustered Index created in this table as of till this point. NC Index 0-leaf level hold the RID (Row Identifier) to connect the desired data pages. So To get all the columns data NC index use RID to get all the columns data.
Select all the column with where clause (Not non-clustered index column)
--> In this query we need all the columns and also we are not using any Index column in the where clause for filtration. Optimizer choose Table Scan.
Select all the columns with where clause (Non-Clustered column)
--> In this query we required all the column, but at the same time we are using where clause to get the desired result set. As we are using NC index column in the where clause, you can see the NC index seek operation in the execution plan.
--> Also you can see the RID Lookup operation. As you know we need all the column from table and NC index does not have all the data to satisfy the result. And we do not have Clustered Index created in this table as of till this point. NC Index 0-leaf level hold the RID (Row Identifier) to connect the desired data pages. So To get all the columns data NC index use RID to get all the columns data.
Select all the column with where clause (Not non-clustered index column)
--> In this query we need all the columns and also we are not using any Index column in the where clause for filtration. Optimizer choose Table Scan.
Create a clustered index on the same table for a column
-- Create a clustered index
CREATE CLUSTERED INDEX IDX_ClusteredIndex ON [dbo].[Customer] (CustomerKey);
GO
Select all the column from table using where clause (Non-Clustered column)
--> Again we need all the column and we are using NC index column in the where clause. As usual NC Index seek operation happened.
--> But in the second operation you can see Key Lookup operation instead of RID Lookup as I mentioned in above execution plan. the reason behind this was here we have created a Clustered index before execute this query. When we create a clustered index on the table, NC index 0-leaf level hold the Clustered Key instead of RID. That's why Key Lookup operation happened instead of RID lookup.
--> Again we need all the column and we are using NC index column in the where clause. As usual NC Index seek operation happened.
--> But in the second operation you can see Key Lookup operation instead of RID Lookup as I mentioned in above execution plan. the reason behind this was here we have created a Clustered index before execute this query. When we create a clustered index on the table, NC index 0-leaf level hold the Clustered Key instead of RID. That's why Key Lookup operation happened instead of RID lookup.
Select all the column from table
--> Here we need all the columns in the query. As you know now the table have Clustered Index, so optimizer always choose Clustered Index Scan if no any index column used in where condition.
Select all the column from table using where clause (not using any Index column)
---> Same as above. In this case also Clustered Index Scan operation happened because in the where clause no any index column used.
In my next post I'll explain about the INCLUDE column or convering index and Filter index
No comments:
Post a Comment