SQL
Identify and Remove Duplicate Rows from a SQL Table
Efficiently find duplicate records based on specific columns and safely remove them, keeping only one unique entry using SQL window functions or self-joins.
-- Step 1: Identify duplicates
SELECT
id, email, username, created_at,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) as rn
FROM
users;
-- Step 2: Delete duplicates (keeping the oldest entry by created_at)
DELETE FROM users
WHERE id IN (
SELECT id FROM (
SELECT
id,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) as rn
FROM
users
) AS subquery
WHERE rn > 1
);
How it works: This snippet provides a common pattern for handling duplicate rows. The first part uses the `ROW_NUMBER()` window function to assign a rank to each row within partitions defined by the `email` column. The `ORDER BY created_at` ensures that for duplicate emails, the earliest entry gets `rn=1`. The second part then uses this logic within a `DELETE` statement, targeting all `id`s where `ROW_NUMBER()` is greater than 1, effectively removing all but the first (oldest) entry for each duplicate email. This is a robust way to clean up redundant data.