Quantcast
Channel: Finance, Tech & Analytics Career Resources | Imarticus Blog
Viewing all articles
Browse latest Browse all 490

How to Delete Duplicate Rows in SQL? Detailed Guide With Syntax And Examples

$
0
0

Keeping data right is very important for all databases. When we have copies, it can cause problems and use more space. To help with this, we will learn how to remove duplicates in SQL. We'll start with simple ways and work up to complicated ones. 

We'll explore a range of techniques, from the fundamental DISTINCT keyword to utilizing advanced Common Table Expressions (CTEs) in conjunction with the ROW_NUMBER() function. This will make you adapt at using SQL and keep your data clean and efficient in no time! 

Deleting duplicate rows in SQL

In SQL, deleting duplicate rows means putting off entries from a table comprising equal information primarily based on specific criteria. Duplicate rows can occur for diverse reasons, including data entry mistakes, integrations from different assets, or incomplete deduplication methods. 

Deleting duplicates facilitates:

  • Improve data integrity: By eliminating redundant data, you make sure that the tables are correctly filled with data and constant.
  • Save storage space: Duplicate rows occupy needless garage space, and getting rid of them can optimize database performance.
  • Enhance data analysis: Duplicate rows can skew statistics evaluation consequences. Removing them results in more correct and dependable insights.

How to delete duplicate rows in SQL using sample data

Here's how testing makes it clear to see how duplicate rows take-out works in SQL:

Sample data

Let's consider a table named Customers with the following columns:

CustomerIDNameEmail
1John Doejohn.doe@email.com
2Jane Smithjane.smith@email.com
3Mike Jonesmike.jones@email.com
4John Doejohn.doe@email.com (duplicate)

This table has the same row twice for John Doe. We can take an example like this to show how various SQL ways find and delete duplicate rows. 

SQL delete duplicate rows using group

Using GROUP BY and HAVING clauses is a strong method to remove repeated rows in SQL. You select columns to group the data and then use the HAVING clause to filter the groups. It helps find rows with the same values in specific columns.

Here's how it works:

  • Group By: You choose which columns to group the data by. This puts rows with the same values in those columns into categories.
  • HAVING Clause: This filters the groups made by GROUP BY. You can use COUNT(*) inside HAVING to find groups with more than one row (copies).

How to remove duplicates in SQL with Group By and Having

To do this, follow the steps mentioned here.

Step 1. Find duplicate rules: Decide which columns show a duplicate in your data. For example, in a list of customers, duplicates can be found by matching Name and Email together.

Step 2. Build the DELETE Query: This is the basic format:

DELETE FROM your_table_name

WHERE your_table_name.column_name_1 IN (

  SELECT column_name_1

  FROM your_table_name

  GROUP BY column_name_1, column_name_2 (columns for duplicate check)

  HAVING COUNT(*) > 1

);

Example

Consider a table named Products with columns ProductCode, ProductName, and Price. We want to delete duplicate products based on ProductCode and Price.

DELETE FROM Products

WHERE Products.ProductCode IN (

  SELECT ProductCode

  FROM Products

  GROUP BY ProductCode, Price

  HAVING COUNT(*) > 1

);

Result: This query will put things together by ProductCode and Price. The part saying HAVING COUNT(*) > 1 shows sets with the same products and prices. The DELETE statement then takes away rows with codes that are the same as these found duplicates.

Fetching and Identifying the duplicate rows

It's crucial to identify them accurately before knowing how to remove duplicates in SQL. Data science professionals often use SQL's functionalities like querying and filtering to pinpoint these duplicate entries. Here are some methods to fetch and identify duplicate rows:

1. Using GROUP BY and COUNT(*)

This is a common approach that uses both grouping and aggregate functions. The idea is to group rows based on the columns that define duplicates. 

Use COUNT(*) to determine the number of rows in each group. Groups with a count greater than 1 indicate duplicates.

Syntax

SELECT column_name_1, column_name_2, ..., COUNT(*) AS row_count

FROM your_table_name

GROUP BY column_name_1, column_name_2, ...;

2. Using DISTINCT and Self-Join

The SQL remove duplicates option is a very handy way to handle your data. This method utilizes DISTINCT to fetch unique combinations and a self-join to compare rows. Use SELECT DISTINCT on the columns defining duplicates to get unique combinations. Later on, perform a self-join on the table itself, matching these unique combinations with the original table.

Syntax

SELECT t1*.

FROM (SELECT DISTINCT column_name_1, column_name_2, ... FROM your_table_name) AS unique_data

INNER JOIN your_table_name AS t1 ON (unique_data.column_name_1 = t1.column_name_1 AND ...)

WHERE unique_data.column_name_1 = t1.column_name_1 AND ...;

3. Using ROW_NUMBER()

This method assigns a row number within groups defined by duplicate criteria, allowing you to identify duplicates based on their order.

Syntax

SELECT *, ROW_NUMBER() OVER (PARTITION BY column_name_1, column_name_2, ... ORDER BY column_name_3) AS row_num

FROM your_table_name;

How do you choose the right method?

The right way depends on your needs and table size. Using GROUP BY and COUNT(*) is good for most cases. If you know how to remove duplicates in SQL, you might as well learn when to use which method.

If you have complicated copies or need to filter based on order, you could try ROW_NUMBER(). If you want to see all the copies, using self-join can help.

Delete duplicate rows with an intermediate table

The "Intermediate table" way is good for doing away with the same rows in SQL. You use another table to keep the different info, and then swap it with the first table. For example, in a table called Customers with CustomerID, Name, and Email, with the same data.

Steps

  1. Create Intermediate Table: CREATE TABLE Customers_Temp LIKE Customers;
  2. Insert Distinct Rows: INSERT INTO Customers_Temp
  3. SELECT DISTINCT CustomerID, Name, Email
  4. FROM Customers;
  5. (Optional) Drop Original Table: DROP TABLE Customers;
  6. Rename Intermediate Table: ALTER TABLE Customers_Temp RENAME TO Customers;

Deleting duplicate rows using ROW_NUMBER() function

The ROW_NUMBER() function is a handy tool for deleting duplicate rows within a database table. For a query to delete duplicate records in SQL, you have a convenient option in this function.

This function assigns a unique number to each row within a result set, based on a specified ordering. It uses the following syntax:

ROW_NUMBER() OVER (PARTITION BY <column_list> ORDER BY <column_list>) AS row_num

where

  • PARTITION BY <column_list>: This clause groups rows together based on the specified columns. Rows within each group will be assigned unique row numbers.
  • ORDER BY <column_list>: This clause defines the order in which the rows within each partition will be numbered.

Example

Suppose you have a table named Customers with columns customer_id, name, and email. You want to delete duplicate customer entries based on name and email. Here's the query:

WITH cte AS (

  SELECT *, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY customer_id) AS row_num

  FROM Customers

)

DELETE FROM cte

WHERE row_num > 1;

Result: This query first creates a CTE named cte. It assigns a row number (row_num) to each row in the Customers table. The partitioning is done by name and email, and the ordering is based on customer_id. Then, the DELETE statement removes rows from the CTE where row_num is greater than 1, eliminating duplicates.

Delete duplicate rows using Common Table Expressions (CTE)

Common Table Expressions (CTEs) offer a powerful way to delete duplicate rows from your database tables. Here's how you can use CTEs with the ROW_NUMBER() function for this task:

Step 1. Define the CTE

  • The CTE identifies the duplicate rows. Here, you'll use the ROW_NUMBER() function to assign a unique sequential number to each row.
  • The PARTITION BY clause groups rows together based on specific columns. Only rows within the same group will compete for unique numbering.
  • The ORDER BY clause defines the order in which rows within each group are numbered.

Step 2. Filter and delete

After creating the CTE, you can use the DELETE statement to target the CTE alias.

Within the DELETE statement, you'll filter for rows where the ROW_NUMBER() (often aliased as row_num) is greater than 1. This effectively removes duplicates while keeping the first occurrence of each unique combination.

How to remove duplicate rows in SQL using CTE

While procedures are a great way to encapsulate logic, removing duplicates with CTEs is typically done within a single SQL statement. However, here's how you could potentially create a procedure using CTEs as an example:

Step 1. Procedure creation

   CREATE PROCEDURE RemoveDuplicates (

@tableName VARCHAR(50),  -- Name of the table to process

       @columnList VARCHAR(200) -- Comma-separated list of columns for duplicate check

   )

   AS

   BEGIN

       -- Implement the logic here

   END;

Step 2. Logic within the procedure (using CTE)

   DECLARE @cteName VARCHAR(50);  -- To store dynamic CTE name

SET @cteName = 'cte_' + @tableName;  -- Generate unique CTE name

   WITH (@cteName) AS (  -- Define CTE dynamically

       SELECT *,

              ROW_NUMBER() OVER (PARTITION BY @columnList ORDER BY some_column) AS row_num

       FROM @tableName

   )

   DELETE FROM @cteName  -- Delete from CTE

   WHERE row_num > 1;

   END;

Rank function to SQL delete duplicate rows

The RANK() function in SQL can be a great tool for deleting duplicate rows from a table. The function assigns a ranking number to each row within a result set, considering a specified ordering. Similar to ROW_NUMBER(), it uses the following syntax:

RANK() OVER (PARTITION BY <column_list> ORDER BY <column_list>) AS rank_num

where

  • PARTITION BY <column_list>: This clause groups rows together based on the specified columns. Rows within each group will receive ranks.
  • ORDER BY <column_list>: This one defines the order in which the rows within each partition will be ranked.

Steps for deleting duplicates with RANK

The steps are explained here:

Step 1. Identify duplicates: The RANK() function assigns the same rank to rows with identical values in the PARTITION BY columns.

Step 2. Delete ranked duplicates: We can leverage a CTE to isolate the duplicates and then delete them based on the rank.

Example for RANK function

Suppose you have a table named Products with columns for product_id, name, and color. You want to remove duplicate rows in SQL by targeting the product entries based on name and color. Here's the query:

WITH cte AS (

  SELECT *, RANK() OVER (PARTITION BY name, color ORDER BY product_id) AS rank_num

  FROM Products

)

DELETE FROM cte

WHERE rank_num > 1;

Result: This query first creates a CTE named cte. It assigns a rank_num to each row in the Products table. The partitioning is done by name and color, and the ordering is based on product_id. Rows with the same name and color will receive the same rank_num. 

Then, the DELETE statement removes rows from the CTE where rank_num is greater than 1, eliminating duplicate entries.

Final Thoughts

Duplicate rows in your database can cause wasted space and skewed analysis. This article empowers you to remove them effectively using SQL. We explored methods like GROUP BY with HAVING for basic tasks, and advanced techniques with ROW_NUMBER() and CTEs.

Choosing the right method depends on your table size and needs. For a data-driven approach to managing your databases, consider Imarticus's Postgraduate Program in Data Science Analytics. This program equips you with the skills to wrangle, analyze, and visualize data, making you an expert in data management. Register instantly!

The post How to Delete Duplicate Rows in SQL? Detailed Guide With Syntax And Examples appeared first on Finance, Tech & Analytics Career Resources | Imarticus Blog.


Viewing all articles
Browse latest Browse all 490

Trending Articles