Applying Distinct and Dense_Rank in SQL
Posted on
Recently, I came up with a problem statement where I had to filter my data to retrieve the first 3 distinct rows for each value of a certain column in SQL. The input dataset seems to be as follows:
TableName: TrackingData
TrackingNumber | City | DeliveryStatus |
---|---|---|
1 | NewYork | Delivered |
1 | NewYork | Delivered |
2 | NewYork | Delivered |
3 | Seattle | Delivered |
3 | Seattle | Delivered |
3 | Seattle | NotDelivered |
4 | Seattle | Delivered |
5 | Chicago | Delivered |
6 | Chicago | Delivered |
7 | Chicago | NotDelivered |
8 | Chicago | Delivered |
9 | Chicago | Delivered |
Problem Statement
These are the conditions that I need to use to filter the datasets:
- Tracking Numbers are specific to the Cities. One city can have multiple rows with the same tracking number but the same number cannot be assigned to another city
- Output dataset should have distinct tracking numbers for each city.
- The Delivery status should be “Delivered”.
- Maximum 3 rows can be there for each city.
Based on the above criteria the resultant dataset should look like the following:
TrackingNumber | City | DeliveryStatus | RowNumber |
---|---|---|---|
5 | Chicago | Delivered | 1 |
6 | Chicago | Delivered | 2 |
8 | Chicago | Delivered | 3 |
1 | NewYork | Delivered | 1 |
1 | NewYork | Delivered | 1 |
2 | NewYork | Delivered | 2 |
3 | Seattle | Delivered | 1 |
3 | Seattle | Delivered | 1 |
4 | Seattle | Delivered | 2 |
Preparing the Dataset
CREATE TABLE TrackingData | |
(TrackingNumber int, City varcha, DeliveryStatus varchar); |
INSERT INTO TrackingData VALUES | |
(1,'NewYork','Delivered'), | |
(1,'NewYork','Delivered'), | |
(2,'NewYork','Delivered'), | |
(3,'Seattle','Delivered'), | |
(3,'Seattle','Delivered'), | |
(3,'Seattle','NotDelivered'), | |
(4,'Seattle','Delivered'), | |
(5,'Chicago','Delivered'), | |
(6,'Chicago','Delivered'), | |
(7,'Chicago','NotDelivered'), | |
(8,'Chicago','Delivered'), | |
(9,'Chicago','Delivered'); |
First Glance to Select Query
SELECT | |
* | |
FROM ( | |
SELECT DISTINCT *, | |
DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber)AS'RowNumber' | |
FROM TrackingData | |
WHERE DeliveryStatus ='Delivered')AS newtable | |
WHERE newtable.RowNumber <=3; |
Looks complex? It’s easier than what it looks at the first glance. Let’s decompose the query and understand it on the chunk. We have mainly three criteria: Distinct, Delivered and max 3 rows.
First two criteria can be simply achieved with the following select statement:
SELECT | |
DISTINCT * | |
FROM TrackingData | |
WHERE DeliveryStatus ='Delivered'; |
TrackingNumber | City | DeliveryStatus |
---|---|---|
1 | NewYork | Delivered |
2 | NewYork | Delivered |
3 | Seattle | Delivered |
4 | Seattle | Delivered |
5 | Chicago | Delivered |
6 | Chicago | Delivered |
8 | Chicago | Delivered |
9 | Chicago | Delivered |
So the above table consists of rows with Distinct TrackingNumber for each City which is successfully delivered.
Now, How can we get Maximum 3 rows for each City? As we see in the above table there are 4 rows for Chicago but the expected result is 3.
We are going to follow the following steps:
- Add a new index column for each row based on the Cities(like NewYork 1…n, Chicago 1…n)
- Select the top 3 rows based on the Index column
We can use the Dense_Rank function to assign a unique rank value to a distinct row for each group of data. We will use the Partition and Over function to group our data based on the city name as below:
SELECT | |
*, | |
DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber) AS'RowNumber' | |
FROM TrackingData; |
Output
TrackingNumber | City | DeliveryStatus | RowNumber |
---|---|---|---|
5 | Chicago | Delivered | 1 |
6 | Chicago | Delivered | 2 |
7 | Chicago | NotDelivered | 3 |
8 | Chicago | Delivered | 4 |
9 | Chicago | Delivered | 5 |
1 | NewYork | Delivered | 1 |
1 | NewYork | Delivered | 1 |
2 | NewYork | Delivered | 2 |
3 | Seattle | Delivered | 1 |
3 | Seattle | Delivered | 1 |
3 | Seattle | NotDelivered | 1 |
4 | Seattle | Delivered | 2 |
As we see in the above table, the rank is specific to the city and the unique rank is given only for the distinct row. Same rows are having the same rank value.
Now, we can filter the data based on our RowNumber column to get the top 3 rows for each country as below:
SELECT | |
* | |
FROM ( | |
SELECT *, | |
DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber)AS'RowNumber' | |
FROM TrackingData)AS newTable | |
WHERE newTable.'RowNumber'<=3; |
Output
TrackingNumber | City | DeliveryStatus | RowNumber |
---|---|---|---|
5 | Chicago | Delivered | 1 |
6 | Chicago | Delivered | 2 |
7 | Chicago | NotDelivered | 3 |
1 | NewYork | Delivered | 1 |
1 | NewYork | Delivered | 1 |
2 | NewYork | Delivered | 2 |
3 | Seattle | Delivered | 1 |
3 | Seattle | Delivered | 1 |
3 | Seattle | NotDelivered | 1 |
4 | Seattle | Delivered | 2 |
Note: We cannot directly use the RowNumber in the where clause since it is a derived column and the column is created only after the where clause is executed based on logical processing order. So we need to use the derived tables.
Finally, let’s merge all our select statement into one to get the desired output.
SELECT | |
* | |
FROM ( | |
SELECT DISTINCT *, | |
DENSE_RANK() OVER (PARTITION BY City ORDER BY TrackingNumber) AS'RowNumber' | |
FROM TrackingData WHERE DeliveryStatus ='Delivered') AS newtable | |
WHERE newtable.RowNumber <=3; |
Output
TrackingNumber | City | DeliveryStatus | RowNumber |
---|---|---|---|
5 | Chicago | Delivered | 1 |
6 | Chicago | Delivered | 2 |
8 | Chicago | Delivered | 3 |
1 | NewYork | Delivered | 1 |
1 | NewYork | Delivered | 1 |
2 | NewYork | Delivered | 2 |
3 | Seattle | Delivered | 1 |
3 | Seattle | Delivered | 1 |
4 | Seattle | Delivered | 2 |
I hope this was helpful to you. Please free to give feedback on comment section :).