Microsoft Business Intelligence: Statistics in SQL server

SQL Server statistics are nothing but system objects that contain vital information about the index key values and regular column values. SQL Server uses a cost based model to choose a "good enough" execution plan as fast as possible. Cardinality estimation (estimating no. of rows to be processed on each step of the query execution) is the most important factor in query optimization which in turn affects the join strategy, memory grant requirement, worker thread selection as well as choice of indexes when accessing data.

When we have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting the results which we want. It could query and join the tables in any order and it could use different methods for matching records from the various tables we have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing. A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.

SQL Server automatically updates statistics, there are certain scenarios where may we want to update statistics manually. These scenarios include:

Ø Large Data Changes: If a large percentage of data has been added, modified, or deleted from a table, we may want to update statistics manually to ensure that the query optimizer has the most accurate information.

Ø Stale Statistics: If the statistics for a table or indexed view are stale (out of date), we may want to update them manually to ensure that the query optimizer has the most accurate information. This can happen when data is added, modified, or deleted from a table or indexed view after the statistics were last updated.

Ø Performance Issues: If we are experiencing performance issues with a particular query, updating statistics may help to improve performance by providing the query optimizer with more accurate information.

To update the Statistics we need to use below method.

Ø Automatic Updates - SQL Server automatically updates statistics when a threshold of data modifications occurs, or when the query optimizer determines that the statistics are out of date. This is the default behavior for SQL Server, and it works well for most databases.

Ø Manual Updates - We can manually update statistics using the UPDATE STATISTICS statement.

Update the statistics for a single table or indexed view using the following command:

UPDATE STATISTICS [table_name]

Where [table_name] is the name of the table or indexed view for which statistics to be updated.

Update the statistics for all tables in a database using the following command:

EXEC sp_updatestats

When statistics getting updated, SQL Server automatically updates the statistics for all columns that have a density value that is out of date. This means that the update may not necessarily update all statistics for all columns in the table or indexed view.

Let’s see the example

Creating a table and inserting some records.

CREATE TABLE employee

(

id INT IDENTITY PRIMARY KEY,

firstname NVARCHAR(100),

lastname NVARCHAR(100),

placeofbirth NVARCHAR(100)

)

CREATE NONCLUSTERED INDEX ndx_employee_firstname

ON dbo.Employee(firstname)

--Insert Some Seed Data

DECLARE @i INT =1

WHILE @i <= 1000

BEGIN

INSERT INTO dbo.employee

(firstname,lastname,placeofbirth)

SELECT 'FirstName' + Cast(@i AS VARCHAR(10)),

'LastName' + Cast(@i AS VARCHAR(10)),

'PlaceOfBirth' + Cast(@i AS VARCHAR(10))

SET @i=@i + 1

END;

Records inserted successfully.

Selecting a records

See the statistics.

SELECT s.NAME AS statistics_name,
       c.NAME AS column_name,
       sc.stats_column_id,
       s.auto_created
FROM   sys.stats AS s
       INNER JOIN sys.stats_columns AS sc
               ON s.object_id = sc.object_id
                  AND s.stats_id = sc.stats_id
       INNER JOIN sys.columns AS c
               ON sc.object_id = c.object_id
                  AND c.column_id = sc.column_id
WHERE  s.object_id = Object_id('dbo.Employee');

We can see here we have 2 statistics collections one for each of the indexes our table has. SQL created these stats when we created the index. So what happens when we query and predicate on one of the columns not indexed? Assuming auto create stats is turned on (it is by default) then a new statistics object will be automatically created for that column…

See the stats

Stats Header

DBCC SHOW_STATISTICS('dbo.Employee','_WA_Sys_00000003_20ACD28B') WITH STAT_HEADER

From this we can see when the statistics were last updated and also how many rows of the full data it sampled to generate itself.

Ø Rows – is the number of rows in our table
Ø Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statistics using sampling rather than reading all the rows. In this case we will see it did actually read the whole table.
Ø Steps – If we imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if we have more distinct values in our column than that they will be grouped into steps.
Ø Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column)

The Density Vector

We can view the density information in this statistics object by running

DBCC SHOW_STATISTICS('dbo.Employee','_WA_Sys_00000003_20ACD28B') WITH DENSITY_VECTOR

Ø All Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column)

Ø Average Length – Is what it says on the can. The average length of data in this column.

Ø Columns – The names of any column measured in this statistics objects. We can have statistics across multiple columns. In this case it tells us these statistics are based on the “lastname” column.

The Histogram

Lastly we have the Histogram option on SHOW_STATISTICS

DBCC SHOW_STATISTICS('dbo.Employee','_WA_Sys_00000003_20ACD28B')WITH HISTOGRAM

This last record set shows the distribution of the data, and is what we could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps.

Ø RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.
Ø RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.
Ø EQ_ROWS – The number of rows equal to the HI_KEY
Ø DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).
Ø AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.

Microsoft Business Intelligence

Pages

Sunday, 13 October 2024

Statistics in SQL server

No comments:

Post a Comment

Popular Posts