Data Matching (Recordlinkage and Fuzzymatcher)

17 minute read

1. Introduction

Record Linkage aka data matching/merging is to join the information from a variety of data sources. The recordlinkage and fuzzymaker are used for process of joining two data sets when they don’t have a common unique identifier.

This problem is a common business challenge and difficult to solve in a systematic way due to the tremendous amount of data sets. A naive approach using vlookup function in Excel requests a lot of human intervention. To reduce that, fuzzymatcher and recordlinkage can be used.

  1. fuzzymatcher allows us to match two pandas DataFrames using sqlite3 for finding potential matches and probabilistic record linkage for scoring those matches.
  2. recordlinkage is a library to link records in or between data sources.

2. Problem

If trying to merge two datasets that comes from two separate sources, you must have some challenge of removing duplicates; like the contents are the same but a machine could not recognize that they are the same. For instance, a machine cannot recognize the duplicates that do not match exactly.

When machine recognizes that the data match:

exact-match
| Fig1. Deterministic Matching |

This is called Deterministic matching. However, it happens that a part of data is mistyped or missing. Then, we use probability to match (Probabilistic matching). Using string distance algorithms like Levenshtein, Damerau-Levenshtein, Jaro-Winkler, q-gram, and cosine, we score a match to see how such matching is suitable.

analogy
| Fig2. Data Linkage/Matching Analogy |

Through the data-matching, Duplicates in the data can be eliminated.

dupls
| Fig3. Eliminating Duplicates (Deduplication) |

The vlookup function in Excel requests exhaustive work, so using fuzzymatcher or recordlinkage is recommended.

3. Data

For exploring the data matching, US hospital data was used for the following challenges:

  • Similar names of hospitals: to know they are different or the same (Saint Lukes, Saint Mary, etc.)
  • Unclear addresses of hospitals: some hospitals located in urban areas have ambiguous addresses.
  • A lot of assosiated clinics or facilities: hospitals tend to have many clinics and facilities associated and related nearby.
  • Name changes: name changes of hospitals are common.
  • Scalability: a thousands of medical facilities in the US is hard to scale.

The full data sets are available: Medicare.gov and CMS.gov. I used the cleaned version available here.

3.1. Importing packages

import pandas as pd
from pathlib import Path
import fuzzywuzzy as fw # works at python 3.6 ver.
import fuzzymatcher as fm # works at python 3.6 ver.
import recordlinkage as rl# works at python 3.6 ver.

4. Fuzzymatcher

4.1. Loading the data

hospital_accounts = pd.read_csv(
    'hospital_account_info.csv'
)
hospital_reimbursement = pd.read_csv(
    'hospital_reimbursement.csv'
)

US hospital account Data:

hospital_accounts.head(2)
Facility Name Address City State ZIP Code County Name Phone Number Hospital Type Hospital Ownership Acct_Name_Lookup
Account_Num
10605 SAGE MEMORIAL HOSPITAL STATE ROUTE 264 SOUTH 191 GANADO AZ 86505 APACHE (928) 755-4541 Critical Access Hospitals Voluntary non-profit - Private SAGE MEMORIAL HOSPITAL_STATE ROUTE 264 SOUTH 1...
24250 WOODRIDGE BEHAVIORAL CENTER 600 NORTH 7TH STREET WEST MEMPHIS AR 72301 CRITTENDEN (870) 394-4113 Psychiatric Proprietary WOODRIDGE BEHAVIORAL CENTER_600 NORTH 7TH STRE...

US reimbursement Data:

hospital_reimbursement.head(2)
Provider Name Provider Street Address Provider City Provider State Provider Zip Code Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments Reimbursement_Name_Lookup
Provider_Num
839987 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 118 20855.61 5026.19 4115.52 SOUTHEAST ALABAMA MEDICAL CENTER_1108 ROSS CLA...
519118 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 43 13289.09 5413.63 4490.93 MARSHALL MEDICAL CENTER SOUTH_2505 U S HIGHWAY...

4.2. Joining the data

Since the columns have different names, we need to define which columns to match for the left and right DataFrames. In this case, the ‘hospital_accounts’ will be the left and the ‘hospital_reimbursement’ will be the right.

# Columns to match on from df_left
left_on = ["Facility Name", "Address", "City", "State"]

# Columns to match on from df_right
right_on = [
    "Provider Name", "Provider Street Address", "Provider City",
    "Provider State"
]

Using function fuzzy_left_join(), I let the fuzzymatcher work on matching.

# Now perform the match
# It will take several minutes to run on this data set
matched_results = fm.fuzzy_left_join(hospital_accounts,
                                               hospital_reimbursement,
                                               left_on,
                                               right_on,
                                               left_id_col='Account_Num',
                                               right_id_col='Provider_Num')

NOTE For over 10 million combinations of the dataset, fuzzymatcher finds the best match for each combination, so it takes a while.

The DataFrame ‘matched_results’ contains all the data linked together as well as ‘best_match_score’ which shows the measure of the matching.

matched_results.head(2)
best_match_score __id_left __id_right Account_Num Facility Name Address City State ZIP Code County Name ... Provider_Num Provider Name Provider Street Address Provider City Provider State Provider Zip Code Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
0 -0.746613 10605 643595 10605 SAGE MEMORIAL HOSPITAL STATE ROUTE 264 SOUTH 191 GANADO AZ 86505 APACHE ... 643595 TYLER MEMORIAL HOSPITAL 5950 STATE ROUTE 6 WEST TUNKHANNOCK PA 18657 18 20482.94 5783.22 4929.22
234 -0.609873 24250 426767 24250 WOODRIDGE BEHAVIORAL CENTER 600 NORTH 7TH STREET WEST MEMPHIS AR 72301 CRITTENDEN ... 426767 CRISP REGIONAL HOSPITAL 902 7TH STREET NORTH CORDELE GA 31015 18 14655.94 5680.28 4899.39

4.3. Best and Worst Matches

The columns are rearranged for readability. The following data represent the top 3 best matches and the worst 3

# Reorder the columns to make viewing easier
cols = [
    "best_match_score", "Facility Name", "Provider Name", "Address", "Provider Street Address",
    "Provider City", "City", "Provider State", "State"
]
# Check the top 3 best matches
rearranged_best_matches=matched_results[cols].sort_values(by=['best_match_score'], ascending=False)
rearranged_best_matches.head(3)
best_match_score Facility Name Provider Name Address Provider Street Address Provider City City Provider State State
78037 3.090931 RARITAN BAY MEDICAL CENTER PERTH AMBOY DIVISION RARITAN BAY MEDICAL CENTER PERTH AMBOY DIVISION 530 NEW BRUNSWICK AVE 530 NEW BRUNSWICK AVE PERTH AMBOY PERTH AMBOY NJ NJ
533154 2.799072 ROBERT WOOD JOHNSON UNIVERSITY HOSPITAL ROBERT WOOD JOHNSON UNIVERSITY HOSPITAL ONE ROBERT WOOD JOHNSON PLACE ONE ROBERT WOOD JOHNSON PLACE NEW BRUNSWICK NEW BRUNSWICK NJ NJ
78626 2.785132 AVERA MCKENNAN HOSPITAL & UNIVERSITY HEALTH CE... AVERA MCKENNAN HOSPITAL & UNIVERSITY HEALTH CE... 1325 S CLIFF AVE POST OFFICE BOX 5045 1325 S CLIFF AVE POST OFFICE BOX 5045 SIOUX FALLS SIOUX FALLS SD SD
# Check the worst 3 matches
rearranged_worst_matches=matched_results[cols].sort_values(by=['best_match_score'],
                                  ascending=True)
rearranged_worst_matches.head(3)
# You can also use `tail()` from 'rearranged_best_matches'.
best_match_score Facility Name Provider Name Address Provider Street Address Provider City City Provider State State
426392 -2.268231 CENTRO MEDICO WILMA N VAZQUEZ BAPTIST MEDICAL CENTER EAST CARR. 2 KM 39.5 ROAD NUMBER 2 BO ALGARROBO 400 TAYLOR ROAD MONTGOMERY VEGA BAJA AL PR
83137 -2.124071 DOCTOR CENTER HOSPITAL SAN FERNANDO DE LA CARO... OVERLAKE HOSPITAL MEDICAL CENTER EDIF JESUS T PINEIRO AVE FERNANDEZ JUNCOS BO P... 1035-116TH AVE NE BELLEVUE CAROLINA WA PR
42545 -2.106746 HOSPITAL ONCOLOGICO DR ISAAC GONZALEZ MARTINEZ SCRIPPS MERCY HOSPITAL BO. MONACILLOS CARR 22 CENTRO MEDICO DE PUERTO... 4077 5TH AVE SAN DIEGO SAN JUAN CA PR

As you can see, ‘Provider State’ and ‘State’ are different; meaning that some hospitals at different states (AL, WA, or CA) are mismatched with some hospitals at PR (Puerto Rico).

Check the Threshold

Now let’s find the threshold to collect the acceptible data that properly matched. When we see some of the matches with their scores $\leq$ 75:

# Look at the matches below 0.75
matched_results[cols].query("best_match_score <= .75").sort_values(
    by=['best_match_score'], ascending=False).head(10)
best_match_score Facility Name Provider Name Address Provider Street Address Provider City City Provider State State
179656 0.745897 HCA HOUSTON HEALTHCARE KINGWOOD KINGWOOD MEDICAL CENTER 22999 US HWY 59 22999 US HWY 59 KINGWOOD KINGWOOD TX TX
147070 0.744798 ADVENTIST HEALTH SONORA SONORA REGIONAL MEDICAL CENTER 1000 GREENLEY ROAD 1000 GREENLEY ROAD SONORA SONORA CA CA
550827 0.739983 ADVENTHEALTH MANCHESTER MEMORIAL HOSPITAL 210 MARIE LANGDON DRIVE 210 MARIE LANGDON DRIVE MANCHESTER MANCHESTER KY KY
404399 0.739409 UPMC KANE KANE COMMUNITY HOSPITAL 4372 ROUTE 6 4372 ROUTE 6 KANE KANE PA PA
399780 0.738336 MARSHFIELD MEDICAL CENTER - RICE LAKE LAKEVIEW MEDICAL CENTER 1700 WEST STOUT STREET 1700 WEST STOUT STREET RICE LAKE RICE LAKE WI WI
477383 0.736923 TRINITY BETTENDORF UNITY POINT HEALTH TRINITY 4500 UTICA RIDGE ROAD 4500 UTICA RIDGE ROAD BETTENDORF BETTENDORF IA IA
160318 0.728097 RUSH COPLEY MEDICAL CENTER COPLEY MEMORIAL HOSPITAL 2000 OGDEN AVENUE 2000 OGDEN AVENUE AURORA AURORA IL IL
555173 0.724772 UNITYPOINT HEALTH - KEOKUK KEOKUK AREA HOSPITAL 1600 MORGAN STREET 1600 MORGAN STREET KEOKUK KEOKUK IA IA
368685 0.722856 CALIFORNIA PACIFIC MEDICAL CTR-DAVIES CAMPUS HOSP CALIFORNIA PACIFIC MEDICAL CTR-PACIFIC CAMPUS ... 601 DUBOCE AVENUE 2333 BUCHANAN STREET SAN FRANCISCO SAN FRANCISCO CA CA
431261 0.720821 KANSAS SPINE & SPECIALTY HOSPITAL, LLC KANSAS HEART HOSPITAL 3333 NORTH WEBB ROAD 3601 NORTH WEBB ROAD WICHITA WICHITA KS KS

It shows that the data becomes more ambiguous below 0.75. The CALIFORNIA PACIFIC MEDICAL CTR-DAVIES CAMPUS HOSP is different from CALIFORNIA PACIFIC MEDICAL CTR-PACIFIC CAMPUS.
By checking your data, you need to find and set the proper threshold for collecting data.

Overall, fuzzymatcher is a useful tool to have for medium sized data sets (around 10,000) due to its computational time. However, it is easy to use.

5. Recordlinkage

Record Linkage Toolkit can clean, standardize data, and score similarity of data like fuzzymatcher, but it has additional capabilities:

  • makes pairs of data by ‘blocking’ (block()) to limit the pool of potential matches and ‘grouping’ (sortedneighbourhood()) to cover the data with minor spelling mistakes.
  • compares data with similarity scores for different types of data.
  • provides Supervised and unsupervised classification algorithms for classifying data.

The trade-off is that it is a little more complicated to wrangle the results in order to do further validation. However, the steps are relatively standard pandas commands so do not let that intimidate you.

5.1. Loading the data with index_col

It needs to set the ‘index_col’ argument for indexing

# Re-read in the data using the index_col
hospital_accounts = pd.read_csv(
    'hospital_account_info.csv',
    index_col='Account_Num'
)
hospital_reimbursement = pd.read_csv(
    'hospital_reimbursement.csv',
    index_col='Provider_Num'
)
hospital_accounts.head(2)
Facility Name Address City State ZIP Code County Name Phone Number Hospital Type Hospital Ownership
Account_Num
10605 SAGE MEMORIAL HOSPITAL STATE ROUTE 264 SOUTH 191 GANADO AZ 86505 APACHE (928) 755-4541 Critical Access Hospitals Voluntary non-profit - Private
24250 WOODRIDGE BEHAVIORAL CENTER 600 NORTH 7TH STREET WEST MEMPHIS AR 72301 CRITTENDEN (870) 394-4113 Psychiatric Proprietary
hospital_reimbursement.head(2)
Provider Name Provider Street Address Provider City Provider State Provider Zip Code Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
Provider_Num
839987 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 118 20855.61 5026.19 4115.52
519118 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 43 13289.09 5413.63 4490.93

5.2. Blocking

Blocking is nothing but a limiter of the size of comparisons. For instance, we want to compare the only hospitals that are in the same state. We can use this knowledge to setup a block on the state columns.

This will speed up the process, but adding some flexibility for minor spelling mistakes is important. Using SortedNeighborhood() function to clean the data

# Build the indexer
indexer = rl.Index()
# Can use full or block
indexer.full()
indexer.block(left_on='State', right_on='Provider State')

# Use sortedneighbor as a good option if data is not clean
indexer.sortedneighbourhood(left_on='State', right_on='Provider State')

It may toss a warning of large number of data. With the recordlinkage, we have some flexibility to influence how many pairs are evaluated. By using full indexer all potential pairs are evaluated (which we know is over 14M pairs). I will come back to some of the other options in a moment. Let’s continue with the full index and see how it performs.

The next step is to build up all the potential candidates to check:

candidates = indexer.index(hospital_accounts, hospital_reimbursement)
# Let's see how many matches we want to do
print(len(candidates))
14399283

This quick check just confirmed the total number of comparisons. With the block on state, the 14,399,283 candidates will be filtered to only include those where the state values are the same. It may take longer to take care of minor spelling mistakes, but it is a reasonable trade-off.

5.3. Comparing

Now that we have defined the left and right data sets and all the candidates, we can define how we want to perform the comparison logic using the Compare() function

# Takes 3 minutes using the full index.
# 14s using sorted neighbor
# 7s using blocking
compare = rl.Compare()
compare.exact('City', 'Provider City', label='City')
compare.string('Facility Name',
               'Provider Name',
               threshold=0.85,
               label='Hosp_Name')
compare.string('Address',
               'Provider Street Address',
               method='jarowinkler',
               threshold=0.85,
               label='Hosp_Address')
features = compare.compute(candidates, hospital_accounts,
                           hospital_reimbursement)

We can define several options for how we want to compare the columns of data. In this specific example, we look for an exact match on the city. I have also shown some examples of string comparison along with the threshold and algorithm to use for comparison. In addition to these options, you can define your own or use numeric, dates and geographic coordinates. Refer to the documentation for more examples.

Regardless of which option you use, the result is a features DataFrame that looks like this:

features
City Hosp_Name Hosp_Address
Account_Num Provider_Num
10605 537184 0 0.0 0.0
803181 0 0.0 0.0
450616 0 0.0 0.0
854377 0 0.0 0.0
560361 0 0.0 0.0
... ... ... ... ...
70226 815904 0 0.0 0.0
746090 0 0.0 0.0
193062 0 0.0 0.0
834984 0 0.0 0.0
365095 0 0.0 0.0

This DataFrame shows the results of all of the comparisons. There is one row for each row in the account and reimbursement DataFrames. The columns correspond to the comparisons we defined. 1 is a match and 0 is not.

5.4. Similarity Score

Given the large number of records with no matches, it is a little hard to see how many matches we might have. We can sum up the individual scores to see about the quality of the matches.

# What are the score totals?
features.sum(axis=1).value_counts().sort_index(ascending=False)
3.0      2285
2.0       451
1.0      7937
0.0    988187
dtype: int64

Now we know that there are 988,187 rows with no matching values whatsoever. 7,937 rows have at least one match, 451 have 2 and 2,285 have 3 matches.

To make the rest of the analysis easier, let’s get all the records with 2 or 3 matches and add a total score:

# Get the potential matches
potential_matches = features[features.sum(axis=1) > 1].reset_index()
potential_matches['Score'] = potential_matches.loc[:, 'City':'Hosp_Address'].sum(axis=1)
potential_matches.head()
Account_Num Provider_Num City Hosp_Name Hosp_Address Score
0 51216 268781 0 1.0 1.0 2.0
1 55272 556917 1 1.0 1.0 3.0
2 87807 854637 1 1.0 1.0 3.0
3 51151 783146 1 0.0 1.0 2.0
4 11740 260374 1 1.0 1.0 3.0

Here is how to interpret the table. For the first row, Account_Num 55,272 and Provider_Num 556,917 match on city, hospital name and hospital address.

Let’s look at these two and see how close they are:

hospital_accounts.loc[55272,:]
Facility Name         SCOTTSDALE OSBORN MEDICAL CENTER
Address                          7400 EAST OSBORN ROAD
City                                        SCOTTSDALE
State                                               AZ
ZIP Code                                         85251
County Name                                   MARICOPA
Phone Number                            (480) 882-4004
Hospital Type                     Acute Care Hospitals
Hospital Ownership                         Proprietary
Name: 55272, dtype: object
hospital_reimbursement.loc[556917,:]
Provider Name                SCOTTSDALE OSBORN MEDICAL CENTER
Provider Street Address                 7400 EAST OSBORN ROAD
Provider City                                      SCOTTSDALE
Provider State                                             AZ
Provider Zip Code                                       85251
Total Discharges                                           62
Average Covered Charges                               39572.2
Average Total Payments                                6551.47
Average Medicare Payments                             5451.89
Name: 556917, dtype: object

Those look like good matches.

5.5. Joining

Now that we know the matches, we need to wrangle the data to make it easier to review all the data together. I am going to make a concatenated name and address lookup for each of these source DataFrames.

# Add some convenience columns for comparing data
hospital_accounts['Acct_Name_Lookup'] = hospital_accounts[[
    'Facility Name', 'Address', 'City', 'State'
]].apply(lambda x: '_'.join(x), axis=1)
hospital_reimbursement['Reimbursement_Name_Lookup'] = hospital_reimbursement[[
    'Provider Name', 'Provider Street Address', 'Provider City',
    'Provider State'
]].apply(lambda x: '_'.join(x), axis=1)
reimbursement_lookup = hospital_reimbursement[['Reimbursement_Name_Lookup']].reset_index()
account_lookup = hospital_accounts[['Acct_Name_Lookup']].reset_index()
account_lookup.head(2)
Account_Num Acct_Name_Lookup
0 10605 SAGE MEMORIAL HOSPITAL_STATE ROUTE 264 SOUTH 1...
1 24250 WOODRIDGE BEHAVIORAL CENTER_600 NORTH 7TH STRE...
reimbursement_lookup.head(2)
Provider_Num Reimbursement_Name_Lookup
0 839987 SOUTHEAST ALABAMA MEDICAL CENTER_1108 ROSS CLA...
1 519118 MARSHALL MEDICAL CENTER SOUTH_2505 U S HIGHWAY...
account_merge = potential_matches.merge(account_lookup, how='left')
account_merge.head()
Account_Num Provider_Num City Hosp_Name Hosp_Address Score Acct_Name_Lookup
0 51216 268781 0 1.0 1.0 2.0 ST FRANCIS MEDICAL CENTER_2400 ST FRANCIS DRIV...
1 55272 556917 1 1.0 1.0 3.0 SCOTTSDALE OSBORN MEDICAL CENTER_7400 EAST OSB...
2 87807 854637 1 1.0 1.0 3.0 ORO VALLEY HOSPITAL_1551 EAST TANGERINE ROAD_O...
3 51151 783146 1 0.0 1.0 2.0 ST. LUKE'S BEHAVIORAL HOSPITAL, LP_1800 EAST V...
4 11740 260374 1 1.0 1.0 3.0 SUMMIT HEALTHCARE REGIONAL MEDICAL CENTER_2200...

The merge in the reimbursement data:

# Let's build a dataframe to  compare
final_merge = account_merge.merge(reimbursement_lookup, how='left')
cols = [
    'Account_Num', 'Provider_Num', 'Score', 'Acct_Name_Lookup',
    'Reimbursement_Name_Lookup'
]
final_merge[cols].sort_values(by=['Account_Num', 'Score'], ascending=False)
Account_Num Provider_Num Score Acct_Name_Lookup Reimbursement_Name_Lookup
2660 94995 825914 3.0 CLAIBORNE MEDICAL CENTER_1850 OLD KNOXVILLE HI... CLAIBORNE MEDICAL CENTER_1850 OLD KNOXVILLE HI...
1975 94953 819181 3.0 LAKE CHARLES MEMORIAL HOSPITAL_1701 OAK PARK B... LAKE CHARLES MEMORIAL HOSPITAL_1701 OAK PARK B...
1042 94943 680596 3.0 VALLEY PRESBYTERIAN HOSPITAL_15107 VANOWEN ST_... VALLEY PRESBYTERIAN HOSPITAL_15107 VANOWEN ST_...
2305 94923 403151 3.0 UNIVERSITY COLO HEALTH MEMORIAL HOSPITAL CENTR... UNIVERSITY COLO HEALTH MEMORIAL HOSPITAL CENTR...
2512 94887 752284 2.0 NEW YORK-PRESBYTERIAN BROOKLYN METHODIST HOSPI... NEW YORK METHODIST HOSPITAL_506 SIXTH STREET_B...
... ... ... ... ... ...
2080 10165 188247 3.0 UTAH VALLEY HOSPITAL_1034 NORTH 500 WEST_PROVO_UT UTAH VALLEY HOSPITAL_1034 NORTH 500 WEST_PROVO_UT
1825 10090 212069 3.0 CANONSBURG GENERAL HOSPITAL_100 MEDICAL BOULEV... CANONSBURG GENERAL HOSPITAL_100 MEDICAL BOULEV...
2424 10043 140535 3.0 BETH ISRAEL DEACONESS HOSPITAL - PLYMOUTH_275 ... BETH ISRAEL DEACONESS HOSPITAL - PLYMOUTH_275 ...
1959 10020 210657 3.0 ST FRANCIS MEDICAL CENTER_309 JACKSON STREET_M... ST FRANCIS MEDICAL CENTER_309 JACKSON STREET_M...
1958 10020 121670 2.0 ST FRANCIS MEDICAL CENTER_309 JACKSON STREET_M... UNIVERSITY HEALTH CONWAY_4864 JACKSON STREET_M...

6. Deduplicates Remover with Recordlinkage

Another additional uses of the Recordlinkage is for finding duplicate records in a data set. The process is very similar to matching except you pass match a single DataFrame against itself.

6.1. Loading the data

hospital_dupes = pd.read_csv(
    'hospital_account_dupes.csv',
    index_col='Account_Num')
hospital_dupes.head(2)
Facility Name Address City State ZIP Code County Name Phone Number Hospital Type Hospital Ownership
Account_Num
71730 SAGE MEMORIAL HOSPITAL STATE ROUTE 264 SOUTH 191 GANADO AZ 86505 APACHE (928) 755-4541 Critical Access Hospitals Voluntary non-profit - Private
70116 WOODRIDGE BEHAVIORAL CENTER 600 NORTH 7TH STREET WEST MEMPHIS AR 72301 CRITTENDEN (870) 394-4113 Psychiatric Proprietary

6.2. Blocking

Then create our indexer with a sortedneighbourhood block on State .

# Deduping follows the same process, you just use 1 single dataframe
dupe_indexer = rl.Index()
dupe_indexer.sortedneighbourhood(left_on='State')
dupe_candidate_links = dupe_indexer.index(hospital_dupes)

6.3. Comparing

We should check for duplicates based on city, name and address:

# Comparison step
compare_dupes = rl.Compare()
compare_dupes.string('City', 'City', threshold=0.85, label='City')
compare_dupes.string('Phone Number',
                     'Phone Number',
                     threshold=0.85,
                     label='Phone_Num')
compare_dupes.string('Facility Name',
                     'Facility Name',
                     threshold=0.80,
                     label='Hosp_Name')
compare_dupes.string('Address',
                     'Address',
                     threshold=0.85,
                     label='Hosp_Address')
dupe_features = compare_dupes.compute(dupe_candidate_links, hospital_dupes)

Because we are only comparing with a single DataFrame, the resulting DataFrame has an ‘Account_Num_1’ and ‘Account_Num_2’:

dupe_features
City Phone_Num Hosp_Name Hosp_Address
Account_Num_1 Account_Num_2
26270 28485 0.0 0.0 0.0 0.0
30430 0.0 0.0 0.0 0.0
43602 0.0 0.0 0.0 0.0
59585 28485 0.0 0.0 0.0 0.0
30430 0.0 0.0 0.0 0.0
... ... ... ... ... ...
64029 38600 0.0 0.0 0.0 0.0
35413 0.0 0.0 0.0 0.0
81525 0.0 0.0 0.0 0.0
82916 0.0 0.0 0.0 0.0
18907 0.0 0.0 0.0 0.0

6.4. Similarity Score

dupe_features.sum(axis=1).value_counts().sort_index(ascending=False)
3.0         7
2.0       206
1.0      7859
0.0    973205
dtype: int64
potential_dupes = dupe_features[dupe_features.sum(axis=1) > 2].reset_index()
potential_dupes['Score'] = potential_dupes.loc[:, 'City':'Hosp_Address'].sum(axis=1)
potential_dupes.sort_values(by=['Score'], ascending=True)
Account_Num_1 Account_Num_2 City Phone_Num Hosp_Name Hosp_Address Score
0 28494 37949 1.0 1.0 0.0 1.0 3.0
1 74835 77000 1.0 1.0 0.0 1.0 3.0
2 24549 28485 1.0 1.0 0.0 1.0 3.0
3 70366 52654 1.0 1.0 0.0 1.0 3.0
4 61685 24849 1.0 1.0 0.0 1.0 3.0
5 51567 41166 1.0 1.0 1.0 0.0 3.0
6 26495 41079 1.0 1.0 0.0 1.0 3.0

These 7 records are potentially duplicates; having a high likelihood of being duplicated. Let’s look at them more deeply:

# Take a look at one of the potential duplicates
hospital_dupes[hospital_dupes.index.isin([51567, 41166])]
Facility Name Address City State ZIP Code County Name Phone Number Hospital Type Hospital Ownership
Account_Num
41166 ST VINCENT HOSPITAL 835 S VAN BUREN ST GREEN BAY WI 54301 BROWN (920) 433-0111 Acute Care Hospitals Voluntary non-profit - Church
51567 SAINT VINCENT HOSPITAL 835 SOUTH VAN BUREN ST GREEN BAY WI 54301 BROWN (920) 433-0112 Acute Care Hospitals Voluntary non-profit - Church

Such potential duplicates can be confirmed with further checks. Their names and addresses are similar and their phone numbers are off by one digit.

As you can see, this method can be a powerful and relatively easy tool to inspect your data and check for duplicate records.

7. Summary

Matching/Linking different datasets on text fields like names and addresses is a common challenge in data processing. This article provided two useful data linkage packages to match two data from separate sources.

The fuzzymatcher uses sqlite 3 to simply match two pandas DataFrames together, based on probabilistic scoring. If you have a larger data set or need to use more complex matching logic, then the Recordlinkage could be a better tool for cleaning duplicates and joining data.

Categories:

Updated:

Leave a comment