Blog

SQL Server: Fix duplicate values for uniqueness constraint.

By: Alejandro Villarreal

During database development, unique indexes (or constraints) are very important part of the design. There’s always data for which you do not want to have repeated values, like usernames or email addresses. Different applications might have their own uniqueness requirements, but they all usually have at least one. It’s great when we can identify those requirements from the start and create the appropriate indexes up front.

But what if your application now needs to enforce uniqueness on the values of a million-row table which has tons of duplicates in the production environment?

A naïve solution (and definitely not appropriate for a production environment) would be to replace all the values in that column with uniquely generated IDs, converted to the appropriate data type (I would guess varchar, most of the time). This, however, implies losing any reference to the original data, and looking for a particular record might become hard (in addition to the already stated problem that it just won’t cut it for a production environment).

A better solution is to keep the original duplicate value, and number the duplicates consecutively (i.e., change 3 instances of “test value” for “test value 1”, “test value 2” and “test value 3”). Depending on the application (and environment) at hand, this might or might not be appropriate, but if it is, then a simple query can help us achieve it. Let me show you with an example:

I’ll first create a very simple test table with 2 columns: a numeric ID and a string value. The second column will have duplicates that we want to remove.

CREATE TABLE #Table (Id int, Value varchar(20))

 

INSERT INTO #Table (Id, Value)

VALUES (1,'test'),(2,'test'),(3, 'test'),(4, 'another test'),(5, 'another test'),(6, 'another test')

The table then looks like this:

 

Id

Value

1

1

Test

2

2

Test

3

3

Test

4

4

Another test

5

5

Another test

6

6

Another test

 

To remove the duplicates, we will first determine the different values that the column currently has. Then we’ll iterate over them, and for each value, “iterate” over its duplicates (actually we’ll let SQL iterate over them), updating each one with a counter that will get incremented each time it’s used. That said, we execute the following statements to do it:

SELECT DISTINCT Value

INTO #Duplicates

FROM #Table

 

DECLARE @currentDuplicate varchar(20)

DECLARE @counter int

WHILE EXISTS (SELECT 1 FROM #Duplicates) BEGIN

      SET @counter = 0

      SELECT TOP 1 @currentDuplicate = Value FROM #Duplicates

      UPDATE #Table

            SET Value = Value + ' ' + CAST(@counter as varchar(3))

                  ,@counter += 1

            WHERE Value = @currentDuplicate

           

      DELETE FROM #duplicates WHERE value = @currentDuplicate

END

DROP TABLE #Duplicates

After running those statements, the table contains the following:

 

Id

Value

1

1

Test 1

2

2

Test 2

3

3

Test 3

4

4

Another test 1

5

5

Another test 2

6

6

Another test 3

 

As you can see the duplicates are no more, and now we can create the desired unique index/constraint over the column. You should be careful and give enough range to the counter when casting it to a varchar: if at least 1 of the duplicate values has 1000 or more instances, then you’ll need a varchar(4) to store the number.

As a reminder, this solution might not be the most appropriate for all applications, but hopefully it will come in handy to someone in the future, as it has to me.

 

Comments

Leave a comment

 
 
 
 
CAPTCHA Image Validation