Discussion: View Thread

wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

  • 1.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-07-2009 23:33

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     



  • 2.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-08-2009 06:10
    I have not yet used it, but there is a program called DDupe that looks promising.  I just finished cleaning a dataset by hand, then found it.  It can be found at:
     
     
    Perhaps this will help.

    On Wed, Oct 7, 2009 at 11:33 PM, Andrew Von Nordenflycht <vonetc@sfu.ca> wrote:

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     




    --
    ***********************************************
    Thomas E. Nelson
    University of Louisville Entrepreneurship PhD Candidate
    Office:  502.852.4874
    Home:  812.944.8380
    Cell:  765.212.1012
    ***********************************************
    My greatest hope is to be a man of unborrowed vision

    ***********************************************


  • 3.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-08-2009 08:59
    unfortunately, the only software I have ever found to work viably in such cases is eyeballs.

    will
    -------------------------------------------------------------------------------------
    Will Mitchell
    J. Rex Fuqua Professor of International Management, Professor of Strategy
    Duke University, The Fuqua School of Business
    Phone: 1.919.660.7994 | Fax: 1.919.681.6244 | email: will.mitchell@duke.edu | URL: willmitchell.org

    Watch our video at www.fuqua.duke.edu/wakeup
    ________________________________________
    From: Business Policy and Strategy List [BPS-NET@AOMLISTS.PACE.EDU] On Behalf Of Andrew Von Nordenflycht [vonetc@SFU.CA]
    Sent: Wednesday, October 07, 2009 11:33 PM
    To: BPS-NET@AOMLISTS.PACE.EDU
    Subject: wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.
    However, for a variety of reasons, such as typos or ‘nicknames’, there are also many “close” matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., “Jhon Smith” vs. “John Smith” or “Merrill Lynch” vs. “Merrill Lynch Fenner Smith”).
    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify “all but 1 character” matches, and then “all but 2 character matches”, etc. Preferably the program would suggest close matches and let me decide if they are matched.
    Any ideas on useful software for this task would be appreciated.


    Andrew von Nordenflycht
    Assistant Professor, Strategy
    Simon Fraser University
    vonetc@sfu.ca

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363


  • 4.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-08-2009 09:26

    Andrew,

    I think STATA program might be helpful to merge different datasets.

     

    Mine Ozer, Ph.D.

    Assistant Professor of Management

    Division of Economics and Business

    SUNY Oneonta

    <st1:place w:st="on"><st1:city w:st="on">Oneonta</st1:city>, <st1:state w:st="on">NY</st1:state> <st1:postalcode w:st="on">13820</st1:postalcode></st1:place>

    Phone: 607-436-3047

     

     

     


    From: Business Policy and Strategy List [mailto:BPS-NET@AOMLISTS.PACE.EDU] On Behalf Of Andrew Von Nordenflycht
    Sent: Wednesday, October 07, 2009 11:33 PM
    To: BPS-NET@AOMLISTS.PACE.EDU
    Subject: wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

     

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    <st1:place w:st="on"><st1:placename w:st="on">Simon</st1:placename> <st1:placename w:st="on">Fraser</st1:placename> <st1:placename w:st="on">University</st1:placename></st1:place>

    vonetc@sfu.ca

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     



  • 5.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-08-2009 13:06
    Andrew, I'd largely agree with Will (having done name matching with him on one project) that eyeballs are essential and probably the best tool for any database on the small end of large. For really huge datasets, I would suggest Access plus some sql code.

    Access used to even have its own fuzzy matching routine, though I'm not sure that it does anymore. But if you google around for name matching & access, fuzzy matching & access, and other similar terms you'll find routines and suggestions for coding this sort of matching process. Companies do this often with their mailing lists - though not enough judging by the stream of identical catalogs to my doorstep.

    Charlie

    On Wed, Oct 7, 2009 at 11:33 PM, Andrew Von Nordenflycht <vonetc@sfu.ca> wrote:

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     




    --
    Charles Williams, Asst. Professor of Strategy
    Fuqua School of Business, Duke University
    P.O. Box 90120, Durham, NC 27708
    tel: 919.660.7963 // fax: 919.681.6244


  • 6.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-08-2009 13:33
    Andrew,

    I just did a fuzzy matching using Excel and Eyeballs. I would have used SAS if my dataset is larger. SAS has several procedures which allow you to "customize" your matching routines. But eyeballing before and after any procedures that you use is necessary.

    Victor
    UBC

    ________________________________

    From: Business Policy and Strategy List on behalf of Charles Williams
    Sent: Thu 10/8/2009 10:06 AM
    To: BPS-NET@AOMLISTS.PACE.EDU
    Subject: Re: wanted: software to identify "close" matches in a datase t of names (either individuals or companies).


    Andrew, I'd largely agree with Will (having done name matching with him on one project) that eyeballs are essential and probably the best tool for any database on the small end of large. For really huge datasets, I would suggest Access plus some sql code.

    Access used to even have its own fuzzy matching routine, though I'm not sure that it does anymore. But if you google around for name matching & access, fuzzy matching & access, and other similar terms you'll find routines and suggestions for coding this sort of matching process. Companies do this often with their mailing lists - though not enough judging by the stream of identical catalogs to my doorstep.

    Charlie


    On Wed, Oct 7, 2009 at 11:33 PM, Andrew Von Nordenflycht <vonetc@sfu.ca> wrote:


    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches - where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.





    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca



    View my research on my SSRN Author page:
    http://ssrn.com/author=100363 <http://ssrn.com/author=100363>






    --
    Charles Williams, Asst. Professor of Strategy
    Fuqua School of Business, Duke University
    P.O. Box 90120, Durham, NC 27708
    tel: 919.660.7963 // fax: 919.681.6244


  • 7.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-08-2009 15:20
    Andrew,

    I had a similar problem. My dataset is large so I do not use excel. I
    find the best method is to directly match.

    Here is the pseudo-code:

    1. Match one-to-one using the name fields - remove from sample, store
    these as 'matched.'
    2. Find any common identifiers (dates, locations, etc.) - use these to
    match the 'unmatched' records as closely as possible using many-to-many.
    Ie. match all records with the same year. Your merged database will get
    very big at this moment.


    3. Loop:
    a. Remove a word from the one of the name fields that is common.

    Ie. Merrill from name field a:

    (Name field a: "Merrill Lynch" & Name field b: "Merrill Lynch Fenner
    Smith") becomes (Name field a: "Lynch" & Name field b: "Merrill Lynch
    Fenner Smith").

    b. Check to see if Name field b contains name field a.

    Ie. Lynch is contained in "Merrill Lynch Fenner Smith"

    c. Put record in 'matched' location.

    d. Repeat.



    You will have to go through your databases and find 'common names'
    manually. Don't worry - I have a large database (> 100 k records) and
    it did not take that long to create the 'common name list.' I found
    that the best method was to check the 'unmatched' database after each
    run and see if there were any 'common names' left over.

    I hope that helps.



    David Maslach
    University of Western Ontario




    -----Original Message-----
    From: Business Policy and Strategy List
    [mailto:BPS-NET@AOMLISTS.PACE.EDU] On Behalf Of Cui, Victor
    Sent: Thursday, October 08, 2009 1:33 PM
    To: BPS-NET@AOMLISTS.PACE.EDU
    Subject: Re: wanted: software to identify "close" matches in a datase t
    of names (either individuals or companies).

    Andrew,

    I just did a fuzzy matching using Excel and Eyeballs. I would have used
    SAS if my dataset is larger. SAS has several procedures which allow you
    to "customize" your matching routines. But eyeballing before and after
    any procedures that you use is necessary.

    Victor
    UBC

    ________________________________

    From: Business Policy and Strategy List on behalf of Charles Williams
    Sent: Thu 10/8/2009 10:06 AM
    To: BPS-NET@AOMLISTS.PACE.EDU
    Subject: Re: wanted: software to identify "close" matches in a datase t
    of names (either individuals or companies).


    Andrew, I'd largely agree with Will (having done name matching with him
    on one project) that eyeballs are essential and probably the best tool
    for any database on the small end of large. For really huge datasets, I
    would suggest Access plus some sql code.

    Access used to even have its own fuzzy matching routine, though I'm not
    sure that it does anymore. But if you google around for name matching &
    access, fuzzy matching & access, and other similar terms you'll find
    routines and suggestions for coding this sort of matching process.
    Companies do this often with their mailing lists - though not enough
    judging by the stream of identical catalogs to my doorstep.

    Charlie


    On Wed, Oct 7, 2009 at 11:33 PM, Andrew Von Nordenflycht <vonetc@sfu.ca>
    wrote:


    I have several large datasets containing names of companies and
    individual people. The companies or people can and do appear multiple
    times (e.g., in different years) and I want to link all instances of the
    same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames',
    there are also many "close" matches - where the text does not match
    exactly but is very likely to refer to the same entity (e.g., "Jhon
    Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner
    Smith").

    My goal is to identify these close matches in a systematic way
    without manually going over the data. I presume the main function of
    such a program or algorithm would be to identify "all but 1 character"
    matches, and then "all but 2 character matches", etc. Preferably the
    program would suggest close matches and let me decide if they are
    matched.

    Any ideas on useful software for this task would be appreciated.





    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca



    View my research on my SSRN Author page:
    http://ssrn.com/author=100363 <http://ssrn.com/author=100363>






    --
    Charles Williams, Asst. Professor of Strategy
    Fuqua School of Business, Duke University
    P.O. Box 90120, Durham, NC 27708
    tel: 919.660.7963 // fax: 919.681.6244


  • 8.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-09-2009 06:48
    Andrew,
     
    I've been able to match company names and addresses that are spelled differently (as you describe) across various large datasets using MatchIT software. It also does fuzzy matching of individuals' names.
     
    MatchIT is a bit complicated to learn to use, and does require some "tuning" based on the particulars of the datasets. But once you've got the hang of it, it works very well and quickly. Details are here: http://helpit.com/folders/software_solutions/batch_data_quality_us/  
     
    Because it's not cheap, though, our centralized IT group purchased it and makes it available to faculty in the IT lab.
     
    -Mike
     

    -----------------------------------------------------------
    Michael Toffel
    Assistant Professor | Harvard Business School
    Morgan Hall 497 | Boston MA 02163
    tel +1 (617) 384-8043
    fax +1 (206) 339-7123
    http://people.hbs.edu/mtoffel/

     


    From: Andrew Von Nordenflycht [vonetc@SFU.CA]
    Sent: Wednesday, October 07, 2009 11:33 PM
    Subject: wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     



  • 9.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-09-2009 06:57
    Andrew,
    I highly recommend the Data Quality Server in SAS. I am not the most comfortable programming in SAS, but found this essential for a match of names and addresses of companies across two very large (100,000+ observations) data sets.

    You will want to standardize names as much as possible before running the match, but the program lets you do a lot of great things, like matching on the "sound" of the name, increasingly relaxing the spelling, etc. It is helpful to go in cycles, where you do a very exact string match first, then take those records out, match up a bit more loosely, etc. Then, on the remaining ones, I agree with the "eyeballs" folks. (Also spot-check the automated work visually).

    One note of caution with the "eyeballs" approach -- if you ever have to go back and re-do your work, you will have to do the eyeballs part again. So, automate as much as possible!

    Happy to discuss offline,
    Kristina

    -----Original Message-----
    From: Maslach, David [mailto:dmaslach@IVEY.UWO.CA]
    Sent: Thursday, October 08, 2009 3:20 PM
    Subject: Re: wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Andrew,

    I had a similar problem. My dataset is large so I do not use excel. I
    find the best method is to directly match.

    Here is the pseudo-code:

    1. Match one-to-one using the name fields - remove from sample, store
    these as 'matched.'
    2. Find any common identifiers (dates, locations, etc.) - use these to
    match the 'unmatched' records as closely as possible using many-to-many.
    Ie. match all records with the same year. Your merged database will get
    very big at this moment.


    3. Loop:
    a. Remove a word from the one of the name fields that is common.

    Ie. Merrill from name field a:

    (Name field a: "Merrill Lynch" & Name field b: "Merrill Lynch Fenner
    Smith") becomes (Name field a: "Lynch" & Name field b: "Merrill Lynch
    Fenner Smith").

    b. Check to see if Name field b contains name field a.

    Ie. Lynch is contained in "Merrill Lynch Fenner Smith"

    c. Put record in 'matched' location.

    d. Repeat.



    You will have to go through your databases and find 'common names'
    manually. Don't worry - I have a large database (> 100 k records) and
    it did not take that long to create the 'common name list.' I found
    that the best method was to check the 'unmatched' database after each
    run and see if there were any 'common names' left over.

    I hope that helps.



    David Maslach
    University of Western Ontario




    -----Original Message-----
    From: Business Policy and Strategy List
    [mailto:BPS-NET@AOMLISTS.PACE.EDU] On Behalf Of Cui, Victor
    Sent: Thursday, October 08, 2009 1:33 PM
    To: BPS-NET@AOMLISTS.PACE.EDU
    Subject: Re: wanted: software to identify "close" matches in a datase t
    of names (either individuals or companies).

    Andrew,

    I just did a fuzzy matching using Excel and Eyeballs. I would have used
    SAS if my dataset is larger. SAS has several procedures which allow you
    to "customize" your matching routines. But eyeballing before and after
    any procedures that you use is necessary.

    Victor
    UBC

    ________________________________

    From: Business Policy and Strategy List on behalf of Charles Williams
    Sent: Thu 10/8/2009 10:06 AM
    To: BPS-NET@AOMLISTS.PACE.EDU
    Subject: Re: wanted: software to identify "close" matches in a datase t
    of names (either individuals or companies).


    Andrew, I'd largely agree with Will (having done name matching with him
    on one project) that eyeballs are essential and probably the best tool
    for any database on the small end of large. For really huge datasets, I
    would suggest Access plus some sql code.

    Access used to even have its own fuzzy matching routine, though I'm not
    sure that it does anymore. But if you google around for name matching &
    access, fuzzy matching & access, and other similar terms you'll find
    routines and suggestions for coding this sort of matching process.
    Companies do this often with their mailing lists - though not enough
    judging by the stream of identical catalogs to my doorstep.

    Charlie


    On Wed, Oct 7, 2009 at 11:33 PM, Andrew Von Nordenflycht <vonetc@sfu.ca>
    wrote:


    I have several large datasets containing names of companies and
    individual people. The companies or people can and do appear multiple
    times (e.g., in different years) and I want to link all instances of the
    same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames',
    there are also many "close" matches - where the text does not match
    exactly but is very likely to refer to the same entity (e.g., "Jhon
    Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner
    Smith").

    My goal is to identify these close matches in a systematic way
    without manually going over the data. I presume the main function of
    such a program or algorithm would be to identify "all but 1 character"
    matches, and then "all but 2 character matches", etc. Preferably the
    program would suggest close matches and let me decide if they are
    matched.

    Any ideas on useful software for this task would be appreciated.





    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca



    View my research on my SSRN Author page:
    http://ssrn.com/author=100363 <http://ssrn.com/author=100363>






    --
    Charles Williams, Asst. Professor of Strategy
    Fuqua School of Business, Duke University
    P.O. Box 90120, Durham, NC 27708
    tel: 919.660.7963 // fax: 919.681.6244


  • 10.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-09-2009 07:16
    Andrew,

    A forthcoming paper in Research Policy that compares different heuristics for patent retrieval might also be of interest to you (http://dx.doi.org/10.1016/j.respol.2009.08.001).

    Best regards,
    Marcel

    Marcel Bogers, Ph.D.
    University of Southern Denmark
    Alsion 2, 6400 Sønderborg, Denmark
    Phone: +45 6550 1284
    E-mail: bogers@mci.sdu.dk
    URL: www.marcelbogers.com

    On Thu, Oct 8, 2009 at 5:33 AM, Andrew Von Nordenflycht <vonetc@sfu.ca> wrote:

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     




  • 11.  wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-09-2009 10:57

    Andrew,

     

    I used STATA to match names in large datasets (>100K) by having the following steps.  

     

    First, "standardize" all the names. This procedure may include, e.g., changing all characters to capital, ensuring only one space between any two words, dropping any special signs or marks, such as " : , et al.

     

    Second, start with "perfect" match. That is, deal with those names that can be perfectly matched by machine.

     

    Third, have "key word" match for the rest. For example, you can start with matching the first 5 words, then matching the first 4 words, then matching the first 3 words, etc. It would be helpful if you check your dataset beforehand and learn about any "regularities" in your dataset. For example, you may decide to drop all non-essential words in the names, such as "The" "A", etc. With this step, you'll have a matched list for instances like "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith". You do need eyeballing, unfortunately.

     

    Yong

     

    www.buffalo.edu/~yl67

     

     

     

     

     

    From: Andrew Von Nordenflycht [mailto:vonetc@SFU.CA]
    Sent: Wednesday, October 07, 2009 11:33 PM
    Subject: wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

     

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exct.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363