The APPROX_COUNT_DISTINCT function in SQL Server is an aggregate function that returns the approximate number of unique, non-null values in a group. It is designed for use with large datasets (millions of rows or more) where performance and responsiveness are more important than absolute precision. The function was introduced in SQL Server 2019 and is also available in Azure SQL Database and Azure Synapse Analytics. It uses the HyperLogLog (HLL) algorithm for estimation, which requires less memory than an exhaustive distinct count operation and is less likely to spill memory to disk. The function guarantees an error rate of up to 2% within a 97% probability.
Syntex
APPROX_COUNT_DISTINCT (expression)
COUNT (DISTINCT col) becomes slow and it consume memory
heavy and chance of spill-prone on large datasets. While APPROX_COUNT_DISTINCT
() is fast and low memory consuming. It is especially for billions of rows, telemetry,
logs or analytics purpose.
It is working with INT, BIGINT, UNIQUEIDENTIFIER, VARCHAR,
DATETIME, etc.
see the example
|
WITH nums AS
( SELECT TOP (10000000) ROW_NUMBER() OVER (ORDER BY (SELECT
NULL)) AS n FROM sys.objects a CROSS JOIN sys.objects b CROSS JOIN sys.objects c CROSS JOIN sys.objects d ) SELECT COUNT(DISTINCT n) AS Exact, APPROX_COUNT_DISTINCT(n) AS Approx FROM nums; |
usually we can see:
Exact = 10,000,000
Approx ≈ 9,950,000 – 10,050,000
Generally, we are using
|
Rows |
Recommendation |
|
< 1 million |
COUNT DISTINCT |
|
1M – 100M |
depends |
|
100M+ |
APPROX_COUNT_DISTINCT |
Never use it in below
Ø Financial reports
Ø Billing systems
Ø Compliance numbers
Ø Audits
We can use this on below
Ø Dashboards
Ø Trends
Ø Big data analytics
Ø Monitoring
Ø ML feature engineering
No comments:
Post a Comment
If you have any doubt, please let me know.