Scrubbing Your Data
Our
example database from the prior article
is quite simple. We need to break the transactional data shown into
three separate tables, each table containing only one type of
information. For instance, we need a place we can store Customer data
and only customer data—name, address, credit limit, etc. Each order
placed by that customer includes a customer code that links to the
customer data.
To get a list of our customers we’ll use the Advanced Filter and check
the data in Excel before the database import . You’ll often hear this
process called ‘scrubbing the data’. It is critically important and
cannot be automated:
DO NOT UNDERESTIMATE THE AMOUNT OF TIME THIS
PROCESS TAKES.
Data which has been left in an unsecured environment for
any length of time may take many days to clean. This process will be
easier to do in Excel if that is the environment you’re comfortable in.
In Excel, highlight the column labels that describe uniqueness for a
customer. In other words, don’t just select the name; select the name,
full address and phone number(s). This way, if a company has the same
name but different contact information, it will show up twice in our
unique list. That’s what we want! We can then do the research to find
out what information is correct. Copy the labels to an empty place in
the workbook. Go to Data, Advanced Filter—you’ll see this dialog box.
Click the picture to see the demonstration video!
Click the box ‘Copy to another location’. Indicate the full range of your existing
data for the List Range. For the Criteria Range, highlight the labels you pasted
into an empty place in your workbook. For the Copy To, highlight the criteria range
and more than enough rows to hold the unique list of customer data. This is critical,
as Excel will simply chop off any data which does not fit into the copy to range!
Finally, click unique records only and OK.
Sort the resulting list of company data by name to force company names to appear
next to each other. This will make companies with inaccurate data stand out. If
you are concerned that you may have wildly different company spellings, you may
wish to sort by telephone number or by address to see if there are duplicate addresses
with different company names.
Perform this process for all data to be imported into a relational database. When
you’re sure that the data is clean, you may import each type of data into its own
table with a unique code for each row. Your company may already assign unique codes
for each company (like a D&B number) and product (like a SKU). These are the
best codes to use as they already have meaning within your company. Should the time
come to integrate your database with another in your company, you will have the
ability to tie together data with the same company or product code.