Remove Duplicate MS Excel Data
Excel is frequently used for storing data like mailing lists. While most of
this information is better kept in a
database because of data integrity and data security issues, we can't
help but notice that many more people are familiar and comfortable with
Excel than Access or SQL Server. So while we at I Get It! Development, Inc.
would emphatically recommend moving such data into a database, we recognize
that much of it will stay in Excel and so needs to be worked on there.
One of the most common questions we find on
our website is how to remove duplicate rows or de-dupe data. The very first
step in deleting duplicates is figuring out what constitutes a duplicate.
The same street address, city and zip code may have different apartment or
suite numbers.
A catalog like Land's End may send catalogs
to both a husband and wife living at the same address. Merely finding two
addresses that are the same does not constitute a duplicate data record in
that case.
A catalog of fishing equipment may send
separate catalogs to younger customers and older customers; so even the same
names and addresses – James (dad) and James (son) – may also not constitute
duplicates but legitimately separate addressees!
For our purposes, let's define a duplicate
as a line that contains the same ZIP code and the same street address
including apartment or any secondary street address information. We only
want to send a one catalog to one residence. Our catalog doesn't
differentiate between male or female, parents or children.
We use the & operator (concatenation) to put
fields together. We append zip, street, and any secondary street information
in that order. See the formulas in the first graphic below, and the results
in the graphic beneath that.
The zip code will represent city and state
so it isn’t necessary to include them, but if you feel that your zip code
information may be compromised, use the city and state as well.
Select this entire column and copy it. Then
immediately go to Edit > Paste Special… and choose Values. We should get rid
of the formulas before we start sorting the data. Now sort your data on this
column but be sure that you are sorting all of your columns! It’s just too
easy to corrupt your data in Excel by sorting on only one column instead of
all of them!
On row 2 of the first blank column after the
dupe data, insert the formula =EXACT(H2,H1)
This formula checks the dupe column on the
same row with the dupe column on the previous row. Copy the formula all the
way down the column and you’ll see a series of TRUE and FALSE values. The
TRUE values indicate a duplicate has been found, so you can examine the
multiple records and delete the appropriate ones.
Another technique you can use is to use
COUNTIF. The COUNTIF function will show not only that duplicates exist, but
how many there are! On row 2 of the first blank column after the EXACT
formulas type =COUNTIF(H:H,H2). This says ‘look in all of column H and tell
me how many times the value in H2 appears’. Copy this formula all the way
down the column. You can sort on this formula (though you should copy, paste
special…values first!) in descending order and the duplicates will float up
to the top for easy manipulation.
The most effective and the least-liked step
is a hard visual search of the data. This method is 100% manual and 100%
disagreeable. Nobody likes to do it. But it is the only way to recognize
that James Wilson, Jim Wilson, and Jimmy Wilson are all the same person.
Visual inspection is the only way to recognize that One Washington Place is
the same as 1 Washington Place. You’ll never get your computer to do that
for you!
Hopefully, the steps above will help you get
rid of the easy 90% so that you can concentrate your human eye and human
brain on the difficult 10%.
Remember that data scrubbing is an
ongoing process and you should never go very long without
examining your business critical
data.