By: Alejandro Villarreal
During database development, unique indexes (or constraints) are very important part of the design. There’s always data for which you do not want to have repeated values, like usernames or email addresses. Different applications might have their own uniqueness requirements, but they all usually have at least one. It’s great when we can identify those requirements from the start and create the appropriate indexes up front.
But what if your application now needs to enforce uniqueness on the values of a million-row table which has tons of duplicates in the production environment?
A naïve solution (and definitely not appropriate for a production environment) would be to replace all the values in that column with uniquely generated IDs, converted to the appropriate data type (I would guess varchar, most of the time). This, however, implies losing any reference to the original data, and looking for a particular record might become hard (in addition to the already stated problem that it just won’t cut it for a production environment).
A better solution is to keep the original duplicate value, and number the duplicates consecutively (i.e., change 3 instances of “test value” for “test value 1”, “test value 2” and “test value 3”). Depending on the application (and environment) at hand, this might or might not be appropriate, but if it is, then a simple query can help us achieve it. Let me show you with an example:
I’ll first create a very simple test table with 2 columns: a numeric ID and a string value. The second column will have duplicates that we want to remove.
CREATE TABLE #Table (Id int, Value varchar(20))
INSERT INTO #Table (Id, Value)
VALUES (1,'test'),(2,'test'),(3, 'test'),(4, 'another test'),(5, 'another test'),(6, 'another test')
The table then looks like this:
|
|
Id |
Value |
|
1 |
1 |
Test |
|
2 |
2 |
Test |
|
3 |
3 |
Test |
|
4 |
4 |
Another test |
|
5 |
5 |
Another test |
|
6 |
6 |
Another test |
To remove the duplicates, we will first determine the different values that the column currently has. Then we’ll iterate over them, and for each value, “iterate” over its duplicates (actually we’ll let SQL iterate over them), updating each one with a counter that will get incremented each time it’s used. That said, we execute the following statements to do it:
SELECT DISTINCT Value
INTO #Duplicates
FROM #Table
DECLARE @currentDuplicate varchar(20)
DECLARE @counter int
WHILE EXISTS (SELECT 1 FROM #Duplicates) BEGIN
SET @counter = 0
SELECT TOP 1 @currentDuplicate = Value FROM #Duplicates
UPDATE #Table
SET Value = Value + ' ' + CAST(@counter as varchar(3))
,@counter += 1
WHERE Value = @currentDuplicate
DELETE FROM #duplicates WHERE value = @currentDuplicate
END
DROP TABLE #Duplicates
After running those statements, the table contains the following:
|
|
Id |
Value |
|
1 |
1 |
Test 1 |
|
2 |
2 |
Test 2 |
|
3 |
3 |
Test 3 |
|
4 |
4 |
Another test 1 |
|
5 |
5 |
Another test 2 |
|
6 |
6 |
Another test 3 |
As you can see the duplicates are no more, and now we can create the desired unique index/constraint over the column. You should be careful and give enough range to the counter when casting it to a varchar: if at least 1 of the duplicate values has 1000 or more instances, then you’ll need a varchar(4) to store the number.
As a reminder, this solution might not be the most appropriate for all applications, but hopefully it will come in handy to someone in the future, as it has to me.