Thursday, November 11, 2010

How to find and remove duplicate rows in SQL Server

How to find and remove duplicate rows in SQL Server:


Relational databases by design are supposed to eliminate redundancy. The whole idea of Redundancy is that
you should store information in one place in order to maintain data integrity. Reality however is far from this truth as there are many Production databases out there with redundant data.  As such the duplicate records are alive and well unfortunately!

In today’s discussion we are going to look at some sample code that will let you find duplicate data across one or many fields.  Next we will look at some different methods on how to remove and delete that duplicate data. For today’s exercises, we are going to be working with sample Microsoft Northwind database which has been around for a few years to say the least.  All this SQL code has been tested on SQL server 2008 Enterprise Edition.

-How do we create duplicate data for testing purposes?

First we are going to make a copy of the CUSTOMERS table using the following sql query.
Next we are going to reinsert the data for Country=UK twice into the same table.  We can do this by using the following SQL query to populate duplicate data.

Before we get started with finding and deleting duplicate records, we are going to do a simple record count. This will let us know how many rows are present in the CUSTOMERS_BAK table. Now that we have our baseline data we need to start looking at different methods of finding duplicates in CUSTOMERS_BAK table.

We are going to go through a set of examples to show you how to accomplish this task.

-How do you find duplicate records in a SQL table?

There are many ways of finding duplicate data in a SQL Server table, however the simplest and the most efficient method is to use the GROUP BY clause. Further information on this SQL expression can be found on this website.  In our case we are going to be using GROUP BY with a HAVING clause to find the repeating (duplicate) rows. The first query will let us find any rows with repeating Company Names in our table. We are going to use the following TSQL query in SQL Server for this.

SELECT COMPANYNAME
FROM CUSTOMERS_BAK
GROUP BY COMPANYNAME
HAVING COUNT(*)>1




This query is looking for those rows in the CUSTOMERS_BAK table where the record count is greater than 1 when you group them by COMPANYNAME field. Notice you will get the 7 rows from UK customers that we had added in an earlier step. Here is a screen capture of this duplicate query for you.


Duplicate Company Name

What if we wanted to find out, how often these records are repeated within the table.  We can easily do that by running sql command with COUNT function in it.

You will have noticed that so far we have only looked at one field, COMPANYNAME the above case.  In a similar fashion we can add multiple fields to the SQL query. Let us see if he wanted to find any duplicate rows with fields CITY and COUNTRY that are being repeated in our table. How would we find the duplicates for two fields instead of one?  Let us try this sql server command first.

SELECT CITY, COUNTRY, COUNT(*) AS [# OF TIMES]
FROM CUSTOMERS_BAK
GROUP BY CITY
HAVING COUNT(*)>1




When we ran the above SQL statement ro find duplicate records, we got the following error:

Msg 8120, Level 16, State 1, Line 1
Column 'CUSTOMERS_BAK.Country' is invalid in the select list because it is not contained
in either an aggregate function or the GROUP BY clause.


The reason for this error is that although we are selecting CITY, COUNTRY fields, in our GROUP BY clause, we only have the CITY field in the SELECT portion. The number of fields in the SELECT clause has to match the GROUP BY clause. It is ok to have the COUNT(*) as it is an aggregate function. This is how GROUP BY clause works because you are trying to find the CITY and the COUNTRY fields that are also grouped by the same fields. 

You can simply run the following query that will give you the end result. You will notice that there are 11 rows in our table that match the duplicate criteria. 

SELECT CITY, COUNTRY, COUNT(*) AS [# OF TIMES]
FROM CUSTOMERS_BAK
GROUP BY CITY, COUNTRY
HAVING COUNT(*)>1







Let us say that we actually have duplicate rows across all the fields in a MS SQL table. This is probably the most common type of situation that you will run into working with redundant data. In this one you simply need to add all the fields in the table and also make sure that all those fields are also added in the GROUP BY clause.  Using our CUSTOMERS_BAK table we can use the following tsql script to do just that.

You will notice that there are seven rows in the customer table that are repeated.

How do you remove duplicate rows that?

So far we have covered how to find duplicate data, the next question obviously is: How to delete duplicate rows? There are many options and ways on how you can eliminate duplicate rows, we are going to look at two of these options. Both these options involve inserting non-duplicate data into another table first and then renaming the new table to the original table.

Remove duplicate rows - Option 1

This is the one I prefer as it is more efficient and uses similar code to what we have already seen.  We are going to use a GROUP BY without the having clause. Here’s the query to remove duplicates. We have also included a screen capture from part of the sql statement that removes the duplicates.



Remove duplicate data


Remove duplicate rows - Option 2

Another method is to use the UNION expression to filter out the duplicate rows and leave distinct values behind. Here’s the sql command in order to do that.


Related articles to Finding and Deleting Duplicate rows:


-Remove Duplicate Rows in SQL Server
-Delete duplicate rows in a table
-Using common table expression to remove duplicate data
-Microsoft SQL Server 2008 Management and Administration
-Microsoft SQL Server 2008 Bible
-SQL Server Tutorials

Microsoft SQL Server 2008 Step by Step (Step by Step (Microsoft))