Friday, 20 March 2026

APPROX_COUNT_DISTINCT function in SQL server

The APPROX_COUNT_DISTINCT function in SQL Server is an aggregate function that returns the approximate number of unique, non-null values in a group. It is designed for use with large datasets (millions of rows or more) where performance and responsiveness are more important than absolute precision. The function was introduced in SQL Server 2019 and is also available in Azure SQL Database and Azure Synapse Analytics. It uses the HyperLogLog (HLL) algorithm for estimation, which requires less memory than an exhaustive distinct count operation and is less likely to spill memory to disk. The function guarantees an error rate of up to 2% within a 97% probability.

Syntex

APPROX_COUNT_DISTINCT (expression)

COUNT (DISTINCT col) becomes slow and it consume memory heavy and chance of spill-prone on large datasets. While APPROX_COUNT_DISTINCT () is fast and low memory consuming. It is especially for billions of rows, telemetry, logs or analytics purpose.

It is working with INT, BIGINT, UNIQUEIDENTIFIER, VARCHAR, DATETIME, etc.

see the example

WITH nums AS (

  SELECT TOP (10000000)

         ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n

  FROM sys.objects a

  CROSS JOIN sys.objects b

 CROSS JOIN sys.objects c

 CROSS JOIN sys.objects d

)

SELECT

  COUNT(DISTINCT n) AS Exact,

  APPROX_COUNT_DISTINCT(n) AS Approx

FROM nums;

usually we can see:

Exact = 10,000,000
Approx ≈ 9,950,000 – 10,050,000

Generally, we are using

Rows

Recommendation

< 1 million

COUNT DISTINCT

1M – 100M

depends

100M+

APPROX_COUNT_DISTINCT

Never use it in below

Ø  Financial reports

Ø  Billing systems

Ø  Compliance numbers

Ø  Audits

We can use this on below

Ø  Dashboards

Ø  Trends

Ø  Big data analytics

Ø  Monitoring

Ø  ML feature engineering

No comments:

Post a Comment

If you have any doubt, please let me know.

Popular Posts