Discussion: View Thread

synthesis of advice on name matching

  • 1.  synthesis of advice on name matching

    Posted 10-28-2009 17:31

    Thanks to the many of you who offered helpful advice in regards to matching company and person names across large datasets.

    I've attached a synthesis of the suggestions I received, broken down into various categories. Please note that I have not vetted or otherwise evaluated the various software programs listed – I am just reporting back the suggestions. If I develop additional insight upon further evaluation and use, I will report back to the listserv again. (I have also pasted the info to the bottom of this email, in case the attachment does not work

     

    Apologies to those of you who get this twice via both BPS and OMT listservs.

     

    regards

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     

     

    Name Matching Software and Processes

    Summary of BPS & OMT Listserv Suggestions

    Many thanks to the 28 people who wrote back with suggestions within 2 days.

    Note: I have not investigated these suggested papers or programs yet. In order to report back to the listservs quickly, I am just synthesizing the suggestions I received. When I gather more info on some of these options, I will report back a second time.

     

    Generic Steps in the Name Matching (or Record Linking) Process

     

    ·         clean / simplify names

    o   remove punctuation and common words (from company names)

    ·         do successive rounds of matching with ever "looser" match criteria, removing obvious matches after each round

    o   for non-exact matches, use some kind of "string edit distance" algorithm = measures of distance between string sequences defined as least cost ways of transforming one string to another

    §  e.g., Levenshtein distance algorithm: http://en.wikipedia.org/wiki/Approximate_string_matching

    ·         check non-exact matches with eyeballs (manually), rather than accepting blindly

    o   but using software to suggest matches (and to prioritize based on their likelihood) can save a lot of eyeball time

     

    Relevant Papers

     

    ·         Trajtenberg, Shiff & Melamed. 2006. The "Names Game": Harnessing Inventors' Patent Data for Economic Research: http://www.nber.org/papers/w12479

    o    ABSTRACT: The goal of this paper is to lay out a methodology and corresponding computer algorithms, that allow us to extract the detailed data on inventors contained in patents, and harness it for economic research. Patent data has long been used in empirical research in economics, and yet the information on the identity (i.e. the names and location) of the patents' inventors has seldom been deployed in a large scale, primarily because of the "who is who" problem: the name of a given inventor may be spelled differently across her/his patents, and the exact same name may correspond to different inventors (i.e. the "John Smith" problem). Given that there are over 2 million patents with 2 inventors per patent on average, the "who is who" problem applies to over 4 million "records", which is obviously too large to tackle manually. We have thus developed an elaborate methodology and computerized procedure to address this problem in a comprehensive way. The end result is a list of 1.6 million unique inventors from all over the world, with detailed data on their patenting histories, their employers, co-inventors, etc. Forty percent of them have more than one patent, and 70,000 have more than 10 patents. We can trace those multiple inventors across time and space, and thus study the causes and consequences of their mobility across countries, regions, and employers. Given the increasing availability of large computerized data sets on individuals, there may be plenty of opportunities to deploy this methodology to other areas of economic research as well

    ·         Raffo & Lhuillery. 2009. Research Policy. How to play the 'Names Game': Patent Retrieval using Different Heuristics: http://dx.doi.org/10.1016/j.respol.2009.08.001

    o    ABSTRACT: Patent statistics represent a critical tool for scholars, statisticians and policy makers interested in innovation and intellectual property rights. Many analyses are based on heterogeneous methods delineating the inventors' or firms' patent portfolios without questioning the quality of the method employed. We assess different heuristics in order to provide a robust solution to automatically retrieve inventors in large patent datasets (PATSTAT). The solution we propose reduces the usual errors by 50% and casts doubts on the reliability of statistical indicators and micro-econometric results based on common matching procedures. Guidelines for researchers, TTOs, firms, venture capitalists and policy makers likely to implement a names game or to comment on results based on a names game are also provided

    ·         D. G. Feitelson, On identifying name equivalences in digital libraries. Information Research 9(4) paper 192, Jul 2004. http://InformationR.net/ir/9-4/paper192.html

    o   Note: author was happy to share his perl code

    o    ABSTRACT: The services provided by digital libraries can be much improved by correctly identifying variants of the same name. For example, this will allow for better retrieval of all the works by a certain author. We focus on variants caused by abbreviations of first names, and show that significant achievements are possible by simple lexical analysis and comparison of names. This is done in two steps: first a pairwise matching of names is performed, and then these are used to find cliques of equivalent names. However, these steps can each be performed in a variety of ways. We therefore conduct an experimental analysis using two real datasets to find which approaches actually work well in practice. Interestingly, this depends on the size of the repository, as larger repositories may have many more similar names

     

    Reviews of Research and/or Software

     

    ·         Australian National University Data Mining Group lists plenty of vendors and software for linking records. Note: some of the links are out-of-date: http://datamining.anu.edu.au/linkage.html

    ·         a Guelph postdoc has a brief page on record linkage, including some links to software: http://www.uoguelph.ca/~lantonie/recordlinkage.html.

     

    Existing Programs (in order of being suggested to me, not in order of relevance or quality)

     

    ·         custom software from Emory CS grad student (for hospital patient records): http://www.mathcs.emory.edu/Research/Area/datainfo/FRIL/

    ·         WordStat/Simstat from Provalis Research.

    ·         VBPro, a free program that runs under DOS.

    ·         Link Plus: US Center For Disease Control and Prevention offers free record linkage software: http://www.cdc.gov/cancer/npcr/tools/

    ·         The Link King works through SAS: http://www.the-link-king.com/

    ·         DDupe (mentioned twice): http://www.cs.umd.edu/projects/linqs/ddupe/

    ·         MatchIT works well for (a) matching company names and addresses that are spelled differently across various large datasets, and (b) fuzzy matching of individuals' names: http://helpit.com/folders/software_solutions/batch_data_quality_us/ 
    caveats:

    o   a bit complicated to learn to use, and does require some "tuning" based on the particulars of the datasets

    o   it's not cheap

    ·         Soundex/Metaphone

     

    Functions / algorithms w/in statistical or data management software

     

    ·         w/in SAS [SAS mentioned in at least 5 responses]

    o   "edit distance" functions:

    §  COMPGED for computing the generalized edit distance

    §  COMPLEV for computing the Levenshtein distance

    o   Data Quality Server in SAS -- the program lets you do a lot of great things, like matching on the "sound" of the name, increasingly relaxing the spelling, etc.

    ·         w/in Microsoft Access

    o   SQL "fuzzy grouping" functions: http://msdn.microsoft.com/en-us/library/ms141764.aspx

    o   MS Access + SQL code -- Access has had, may still have some "fuzzy matching" functions

    ·         w/in Microsoft Excel

    o   ASAP Utilities (Excel plug-in) free download: http://www.asap-utilities.com/

     

    PERL for custom programming = find a student (or other programmer) who can write a PERL program [suggested 4 times]

     

    ·         these can be written quickly

    o   example: $20/hr for 5 hours to write a program that found firm names and various attributes in 42,000 magazine articles

    ·         onling primer on PERL: http://www.perl.com/pub/a/2000/10/begperl1.html

    ·         SAS has built-in perl-like functions

    ·         Chuck Martin is a programmer who writes such programs for academics: contact at cwurld@yahoo.com