Andrew,
I highly recommend the Data Quality Server in SAS. I am not the most comfortable programming in SAS, but found this essential for a match of names and addresses of companies across two very large (100,000+ observations) data sets.
You will want to standardize names as much as possible before running the match, but the program lets you do a lot of great things, like matching on the "sound" of the name, increasingly relaxing the spelling, etc. It is helpful to go in cycles, where you do a very exact string match first, then take those records out, match up a bit more loosely, etc. Then, on the remaining ones, I agree with the "eyeballs" folks. (Also spot-check the automated work visually).
One note of caution with the "eyeballs" approach -- if you ever have to go back and re-do your work, you will have to do the eyeballs part again. So, automate as much as possible!
Happy to discuss offline,
Kristina
-----Original Message-----
From: Maslach, David [mailto:
dmaslach@IVEY.UWO.CA]
Sent: Thursday, October 08, 2009 3:20 PM
Subject: Re: wanted: software to identify "close" matches in a datase t of names (either individuals or companies).
Andrew,
I had a similar problem. My dataset is large so I do not use excel. I
find the best method is to directly match.
Here is the pseudo-code:
1. Match one-to-one using the name fields - remove from sample, store
these as 'matched.'
2. Find any common identifiers (dates, locations, etc.) - use these to
match the 'unmatched' records as closely as possible using many-to-many.
Ie. match all records with the same year. Your merged database will get
very big at this moment.
3. Loop:
a. Remove a word from the one of the name fields that is common.
Ie. Merrill from name field a:
(Name field a: "Merrill Lynch" & Name field b: "Merrill Lynch Fenner
Smith") becomes (Name field a: "Lynch" & Name field b: "Merrill Lynch
Fenner Smith").
b. Check to see if Name field b contains name field a.
Ie. Lynch is contained in "Merrill Lynch Fenner Smith"
c. Put record in 'matched' location.
d. Repeat.
You will have to go through your databases and find 'common names'
manually. Don't worry - I have a large database (> 100 k records) and
it did not take that long to create the 'common name list.' I found
that the best method was to check the 'unmatched' database after each
run and see if there were any 'common names' left over.
I hope that helps.
David Maslach
University of Western Ontario
-----Original Message-----
From: Business Policy and Strategy List
[mailto:
BPS-NET@AOMLISTS.PACE.EDU] On Behalf Of Cui, Victor
Sent: Thursday, October 08, 2009 1:33 PM
To:
BPS-NET@AOMLISTS.PACE.EDU
Subject: Re: wanted: software to identify "close" matches in a datase t
of names (either individuals or companies).
Andrew,
I just did a fuzzy matching using Excel and Eyeballs. I would have used
SAS if my dataset is larger. SAS has several procedures which allow you
to "customize" your matching routines. But eyeballing before and after
any procedures that you use is necessary.
Victor
UBC
________________________________
From: Business Policy and Strategy List on behalf of Charles Williams
Sent: Thu 10/8/2009 10:06 AM
To:
BPS-NET@AOMLISTS.PACE.EDU
Subject: Re: wanted: software to identify "close" matches in a datase t
of names (either individuals or companies).
Andrew, I'd largely agree with Will (having done name matching with him
on one project) that eyeballs are essential and probably the best tool
for any database on the small end of large. For really huge datasets, I
would suggest Access plus some sql code.
Access used to even have its own fuzzy matching routine, though I'm not
sure that it does anymore. But if you google around for name matching &
access, fuzzy matching & access, and other similar terms you'll find
routines and suggestions for coding this sort of matching process.
Companies do this often with their mailing lists - though not enough
judging by the stream of identical catalogs to my doorstep.
Charlie
On Wed, Oct 7, 2009 at 11:33 PM, Andrew Von Nordenflycht <
vonetc@sfu.ca>
wrote:
I have several large datasets containing names of companies and
individual people. The companies or people can and do appear multiple
times (e.g., in different years) and I want to link all instances of the
same name. This is easy when the match is exact.
However, for a variety of reasons, such as typos or 'nicknames',
there are also many "close" matches - where the text does not match
exactly but is very likely to refer to the same entity (e.g., "Jhon
Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner
Smith").
My goal is to identify these close matches in a systematic way
without manually going over the data. I presume the main function of
such a program or algorithm would be to identify "all but 1 character"
matches, and then "all but 2 character matches", etc. Preferably the
program would suggest close matches and let me decide if they are
matched.
Any ideas on useful software for this task would be appreciated.
Andrew von Nordenflycht
Assistant Professor, Strategy
Simon Fraser University
vonetc@sfu.ca
View my research on my SSRN Author page:
http://ssrn.com/author=100363 <http://ssrn.com/author=100363>
--
Charles Williams, Asst. Professor of Strategy
Fuqua School of Business, Duke University
P.O. Box 90120, Durham, NC 27708
tel: 919.660.7963 // fax: 919.681.6244