How to Find and Remove Duplicate records in a table in SQL

In this article, we will learn How to Find and Remove Duplicate records in a table in SQL. We will explore some of the most effective method for finding and removing duplicate records in SQL, including using GROUP BY Clause, INNER JOIN, EXIST CLAUSE and ROW_NUMBER Function.

We need to create sample table

 CREATE TABLE [dbo].[Customers](
	[id] [int] NOT NULL,
	[firstName] [nvarchar](max) NOT NULL,
	[lastName] [nvarchar](max) NOT NULL,
	[job] [nvarchar](max) NOT NULL,
	[amount] [real] NOT NULL,
	[tdate] [datetime2](7) NOT NULL,
	[email] [varchar](25) NULL
	)

Then Insert the Sample record

 INSERT [dbo].[Customers] ([id], [firstName], [lastName], [job], [amount], [tdate], [email]) VALUES (1, N'Astle', N'Vicky', N'Software Engineer', 50000, CAST(N'2021-01-29T00:00:00.0000000' AS DateTime2), N'astle@example.com')
INSERT [dbo].[Customers] ([id], [firstName], [lastName], [job], [amount], [tdate], [email]) VALUES (2, N'John', N'Vicky', N'software engineer', 55000, CAST(N'2022-12-03T00:00:00.0000000' AS DateTime2), N'John@example.com')
INSERT [dbo].[Customers] ([id], [firstName], [lastName], [job], [amount], [tdate], [email]) VALUES (3, N'Fleming', N'Stuart', N'software engineer', 55000, CAST(N'2022-12-29T00:00:00.0000000' AS DateTime2), N'astle@example.com')
INSERT [dbo].[Customers] ([id], [firstName], [lastName], [job], [amount], [tdate], [email]) VALUES (4, N'Philip', N'John', N'Manager', 650000, CAST(N'2022-12-29T00:00:00.0000000' AS DateTime2), N'ricky@example.com')
INSERT [dbo].[Customers] ([id], [firstName], [lastName], [job], [amount], [tdate], [email]) VALUES (5, N'Ricky', N'Fleming', N'DBA Administrator', 78000, CAST(N'2022-12-29T00:00:00.0000000' AS DateTime2), N'ricky@example.com')
INSERT [dbo].[Customers] ([id], [firstName], [lastName], [job], [amount], [tdate], [email]) VALUES (6, N'mack', N'Mack', N'Manager', 56000, CAST(N'2022-12-20T00:00:00.0000000' AS DateTime2), N'nack@exampl.com')
INSERT [dbo].[Customers] ([id], [firstName], [lastName], [job], [amount], [tdate], [email]) VALUES (7, N'steva', N'margh', N'Team Lead', 456000, CAST(N'2022-12-23T00:00:00.0000000' AS DateTime2), N'steva@example.com')

Here are few SQL methods for finding and removing duplicate records in a table using SQL

Table of Contents

GROUP BY and HAVING CLAUSE

Suppose we have a table customer with columns id, firstname, email etc.

Suppose we have a table customer with columns id, firstname, email etc. To find and remove duplicate records based on the “email” column, you can use the following SQL query

SELECT EMAIL,COUNT(*)
FROM Customers GROUP BY EMAIL
HAVING COUNT(*)>1

This query will group the rows in the “Customers” table by the “email” column, and the return the email and count of each group that has more than on row. To delete the duplicate records , you can use the following SQL query

DELETE FROM Customers WHERE id NOT IN(SELECT MAX(id) FROM Customers GROUP BY EMAIL)

This query will delete all records in the customers table that have a duplicate email, except for the one with highest id value

EXISTS CLAUSE

Suppose we have a table customer with columns id, firstname, email etc To find and remove duplicate records based on the “firstname” column, you can use the following SQL query

SELECT c.id,c.firstName
FROM Customers c WHERE EXISTS (
SELECT * FROM Customers cust
WHERE c.firstName=cust.firstName
AND c.id > cust.id
)

This query will select all rows in the customers table where there exists another row with the same firstname and a lower id value. To delete the duplicate records , you can use the following SQL query

DELETE c FROM Customers c,Customers cust
WHERE c.firstName=cust.firstName
AND c.id > cust.id

This query will delete all records in the customers table that have a duplicate firstname, except for the one with lowest id value

Document

ROW_NUMBER Function

In this query , I’m using the same example as in the above one, but this time I’m using the ROW_NUMBER Function to find the duplicate data

SELECT id, email , ROW_NUMBER() OVER (
     PARTITION BY email ORDER BY id
	 ) AS row_dup
	 FROM Customers

The query return the id, email and row number for each row from the email column. To remove duplicate rows, modify the query as follows

WITH CT AS(
SELECT id, email , ROW_NUMBER() OVER (
     PARTITION BY email ORDER BY id
	 ) AS row_dup
	 FROM Customers
	 )
	 DELETE FROM CT WHERE row_dup>1

This query will use a common table expression (CTE) to assign row numbers to each records in the customers table based on the email column, then delete any rows with a row number greater than 1