Search
 

Technical Articles

Subscribe

Browse through the latest and greatest technical articles on differing SQL Server topics.

 

Current Articles | Categories | Search

Fuzzy Grouping by Brian Nordberg - Tuesday, September 02, 2008
534 Views | 0 Comments

Fuzzy Grouping

By Brian Nordberg

Finding similar, but not exact, duplicates using SQL Server Integration Services

SQL Server 2005 Integration Services (SSIS) introduces two new tools designed for Data Warehousing, but their uses are far more than just warehousing. Fuzzy Lookups and Fuzzy Grouping both can improve quality of data. Fuzzy lookups is designed to correct errors in lookup tables such as misspellings in cities or states. Fuzzy grouping finds duplicates in datasets. The hospital where I work strives to identify and merge duplicates to improve and maintain one complete electronic medical record for each patient. We have been using the fuzzy grouping tool for the past 6 months. The concept of fuzzy grouping is not new; many have used probabilistic linkage of datasets to find duplicates and related records in other datasets (for more info visit http://www.utcodes.org/Linkage/description.htm). Few commercial probabilistic linkage packages are available and they are expensive. With SSIS we now have a “free” tool to do probabilistic linkages.

Fix the SSIS executables first

First, download the memory leak fix http://support.microsoft.com/kb/912423/ . This fix resolves many memory issues when using record sets with over 1 million records, but there are still issues with more than 4 million records. Fuzzy grouping is memory and processor intensive, so running on servers separate from the main databases is recommended. On a 1.8 GHZ Pentium 4 with 512 MB, my tests running an 800,000 record dataset took five hours. Adding another 512MB reduced the run time to two hours.

Define a Dataset

Open Visual Studio and start a BI project, then define an OLE DB Source dataset. Any definable dataset may be grouped, but as many temporary tables are created you must have a connection to a SQL Server and be a user with permission to create tables. Also, limiting your dataset to the fields to be used for grouping and a primary key will increase performance. Be wary of your data types as many are not supported and datetime fields must be transformed to varchar.

Add a Fuzzy Grouping Step

Next add a Fuzzy Grouping package. Define the fields that identify the duplicates. Fields such as First Name, Last Name, Gender, and Date of Birth will prove sufficient to find duplicates. Fields with high discerning power such as Social Security number will improve accuracy and should be given a higher weight. A primary key should also be used as a pass-through in the dataset to aid in future joins of the result set. Define the type of grouping, either exact (data must be equal) or fuzzy (data is compared to determine similarity). The fuzzy grouping allows you to choose further options to ignore cases, punctuation, kana, non-spacing characters, and character width or symbols.

Minimum Similarities

Minimum Similarities should be defined for each field. These thresholds define how closely you want each of the values to correspond. Minimum similarities are between 0 and 1, where 1 means the values are exact and 0 means review everything. Setting a higher minimum similarity will speed the matching process. Minimum similarities are very dependent on data and the number of matches, and match quality will depend on these values. Setting minimum similarities too high will result in only closer to exact records being found, so you may need some experimentation to determine the best thresholds to use. For example, I like my dates of birth (dob) to be close, but I realize that data entry errors skew dates so I set a threshold of .20 for dob. In the example below May 12, 1972, is compared against May 12, 1971, and the similarity score is .84, so it would make the cut. If you are unsure, run some tests with no minimum similarities and review your resulting _similarity fields for each field. Take some averages and review the matches to determine what a good match is, and then set your threshold at this level.

Minimum Thresholds

An overall minimum threshold must also be set. I have found that matches with my data get weak after .80, so I set my minimum overall threshold at .80. Again this is dependent on your data, accuracy of entry staff, and how closely you want/need your records to match.

 

Define the Destination

The last step sets an OLE DB destination and either creates a new table or uses an existing one and matches the output fields in the fuzzy grouping step.

 

The outcome of the grouping will give a _key, in which is the unique key of the first table, a key _out field that identifies the unique groups, the _similarity_fieldname and the _score field that is the average of the _similarity_fieldname. The _score field is between 0 and 1 where a 0 is no similarity and 1 is exact. Running the following dataset through the grouping:

 

Outputs:

Here it grouped the first three records together and the last three together. Note it did not choose the correct name in grouping one, it simply noted that they are all the same. So, if you want the “correct” person, as we did for the hospital, you will need a human to review the results. In our hospital database of over 1.4 million records we found over 5000 duplicates. Unfortunately, the nature of the business requires us to review each of the datasets and pull medical records, so an automated merge process is out of the question. However, Fuzzy Grouping has proved to be a very valuable tool and has given us many useful reports.

Author Bio

Brian Nordberg has been with the University of Utah Health Care System since 1999. His work there includes cleaning/converting legacy data, data analysis, data warehousing, database administration, probabilistic linkage and stereotactic neurological imaging. Prior to Utah, he worked as a Systems Analyst for Northern Arizona University and the Arizona Superior Courts where he honed his skills in data analysis, statistics and data warehousing. Brian’s father introduced him to computing and BASIC programming in his youth. Brian’s passion lies in the great outdoors and he attempted to escape the computer industry by getting his BS and MS in Forestry. He and his wife of 13 years are outdoor enthusiasts and have recently dabbled in politics as they battle for tougher DUI laws.

Comments
Currently, there are no comments. Be the first to post one!
You must be logged in to post a comment. You can login here

Please login to view this content.

Register as a PASS member today for free.



Register
Forgot Password?

Hosted Trial