Applying Distinct and Dense_Rank in SQL

Posted on January 04, 2021

Recently, I came up with a problem statement where I had to filter my data to retrieve the first 3 distinct rows for each value of a certain column in SQL. The input dataset seems to be as follows:

TableName: TrackingData

TrackingNumber	City	DeliveryStatus
1	NewYork	Delivered
1	NewYork	Delivered
2	NewYork	Delivered
3	Seattle	Delivered
3	Seattle	Delivered
3	Seattle	NotDelivered
4	Seattle	Delivered
5	Chicago	Delivered
6	Chicago	Delivered
7	Chicago	NotDelivered
8	Chicago	Delivered
9	Chicago	Delivered

applying-distinct-and-dense-rank-in-sql

Problem Statement

These are the conditions that I need to use to filter the datasets:

Tracking Numbers are specific to the Cities. One city can have multiple rows with the same tracking number but the same number cannot be assigned to another city
Output dataset should have distinct tracking numbers for each city.
The Delivery status should be “Delivered”.
Maximum 3 rows can be there for each city.

Based on the above criteria the resultant dataset should look like the following:

TrackingNumber	City	DeliveryStatus	RowNumber
5	Chicago	Delivered	1
6	Chicago	Delivered	2
8	Chicago	Delivered	3
1	NewYork	Delivered	1
1	NewYork	Delivered	1
2	NewYork	Delivered	2
3	Seattle	Delivered	1
3	Seattle	Delivered	1
4	Seattle	Delivered	2

Preparing the Dataset

	CREATE TABLE TrackingData
	(TrackingNumber int, City varcha, DeliveryStatus varchar);

view raw denseRank.sql hosted with ❤ by GitHub

	INSERT INTO TrackingData VALUES
	(1,'NewYork','Delivered'),
	(1,'NewYork','Delivered'),
	(2,'NewYork','Delivered'),
	(3,'Seattle','Delivered'),
	(3,'Seattle','Delivered'),
	(3,'Seattle','NotDelivered'),
	(4,'Seattle','Delivered'),
	(5,'Chicago','Delivered'),
	(6,'Chicago','Delivered'),
	(7,'Chicago','NotDelivered'),
	(8,'Chicago','Delivered'),
	(9,'Chicago','Delivered');

view raw denseRank.sql hosted with ❤ by GitHub

First Glance to Select Query

	SELECT
	*
	FROM (
	SELECT DISTINCT *,
	DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber)AS'RowNumber'
	FROM TrackingData
	WHERE DeliveryStatus ='Delivered')AS newtable
	WHERE newtable.RowNumber <=3;

view raw denseRank.sql hosted with ❤ by GitHub

Looks complex? It’s easier than what it looks at the first glance. Let’s decompose the query and understand it on the chunk. We have mainly three criteria: Distinct, Delivered and max 3 rows.

First two criteria can be simply achieved with the following select statement:

	SELECT
	DISTINCT *
	FROM TrackingData
	WHERE DeliveryStatus ='Delivered';

view raw denseRank.sql hosted with ❤ by GitHub

TrackingNumber	City	DeliveryStatus
1	NewYork	Delivered
2	NewYork	Delivered
3	Seattle	Delivered
4	Seattle	Delivered
5	Chicago	Delivered
6	Chicago	Delivered
8	Chicago	Delivered
9	Chicago	Delivered

So the above table consists of rows with Distinct TrackingNumber for each City which is successfully delivered.

Now, How can we get Maximum 3 rows for each City? As we see in the above table there are 4 rows for Chicago but the expected result is 3.

We are going to follow the following steps:

Add a new index column for each row based on the Cities(like NewYork 1…n, Chicago 1…n)
Select the top 3 rows based on the Index column

We can use the Dense_Rank function to assign a unique rank value to a distinct row for each group of data. We will use the Partition and Over function to group our data based on the city name as below:

	SELECT
	*,
	DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber) AS'RowNumber'
	FROM TrackingData;

view raw denseRank.sql hosted with ❤ by GitHub

Output

TrackingNumber	City	DeliveryStatus	RowNumber
5	Chicago	Delivered	1
6	Chicago	Delivered	2
7	Chicago	NotDelivered	3
8	Chicago	Delivered	4
9	Chicago	Delivered	5
1	NewYork	Delivered	1
1	NewYork	Delivered	1
2	NewYork	Delivered	2
3	Seattle	Delivered	1
3	Seattle	Delivered	1
3	Seattle	NotDelivered	1
4	Seattle	Delivered	2

As we see in the above table, the rank is specific to the city and the unique rank is given only for the distinct row. Same rows are having the same rank value.

Now, we can filter the data based on our RowNumber column to get the top 3 rows for each country as below:

	SELECT
	*
	FROM (
	SELECT *,
	DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber)AS'RowNumber'
	FROM TrackingData)AS newTable
	WHERE newTable.'RowNumber'<=3;

view raw denseRank.sql hosted with ❤ by GitHub

Output

TrackingNumber	City	DeliveryStatus	RowNumber
5	Chicago	Delivered	1
6	Chicago	Delivered	2
7	Chicago	NotDelivered	3
1	NewYork	Delivered	1
1	NewYork	Delivered	1
2	NewYork	Delivered	2
3	Seattle	Delivered	1
3	Seattle	Delivered	1
3	Seattle	NotDelivered	1
4	Seattle	Delivered	2

Note: We cannot directly use the RowNumber in the where clause since it is a derived column and the column is created only after the where clause is executed based on logical processing order. So we need to use the derived tables.

Finally, let’s merge all our select statement into one to get the desired output.

	SELECT
	*
	FROM (
	SELECT DISTINCT *,
	DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber) AS'RowNumber'
	FROM TrackingData WHERE DeliveryStatus ='Delivered') AS newtable
	WHERE newtable.RowNumber <=3;

view raw denseRank.sql hosted with ❤ by GitHub

Output

TrackingNumber	City	DeliveryStatus	RowNumber
5	Chicago	Delivered	1
6	Chicago	Delivered	2
8	Chicago	Delivered	3
1	NewYork	Delivered	1
1	NewYork	Delivered	1
2	NewYork	Delivered	2
3	Seattle	Delivered	1
3	Seattle	Delivered	1
4	Seattle	Delivered	2

I hope this was helpful to you. Please free to give feedback on comment section :).

programming

sql relational-database database dense-rank distinct partition sql-server ssms