Lingua-EN-NameParse-1.33000755000000000000 012515627074 15370 5ustar00unknownunknown000000000000Lingua-EN-NameParse-1.33/Changes000444000000000000 2160712515625565 17051 0ustar00unknownunknown000000000000Revision history for Perl CPAN module Lingua::En::NameParse 1.33 Apr 22 2015 Removed 'Mr_J_Adam_Smith' name type. Names that were most likely double surnames were beiong incorrectly parsed For example 'Mr A Smythe Taylor' was getting middle name of 'Smythe'. Should be picked up as error. Surname should be 'Smythe-Taylor' Fold all text to upper case prior to parsing, regexp matching is faster if case insensitive Note that this means all components are now NameCased by default. The case_components method has been removed as compoenents methods achieves the same outcome The force_case option has been removed non_matching text moved from propereties to components Use extended regexp (/x) in grammar for better readability Simplified grammar and removed duplicate definitions Removed all full stops before parsing as this simplifies grammar, so: Mr. A.B. Smith => Mr AB Smith Added more valid prefixes 'report' method now tests all functions removed indirect object notation for 'new' method added error and warning description attribues to name object 1.32 11 Nov 2013 Removed &s from start of subroutine calls (obsolete syntax) Fixed typo RT bug 88573 case_surname does not process undefined or blank input, stops warning message occuring 1.31 24 Jun 2013 Wherever program returned 'undef' changed to just a return statement, fix for RT bug 85491 Added error handling to synopsis examples 1.30 31 Mar 2011 Added component ordering for Mr_J_Adam_Smith name type, thanks to John Hansen Corrected some of the documentation Added more military titles 1.29 23 Jan 2011 Corrected documentation of case_components module, thanks to John Hansen Removed invalid space after /Pilot Officer/ in extended titles grammar, thanks to John Hansen Added the 'Mr_J_Adam_Smith' name type, thanks to John Hansen Added the 'John' name type, thanks to Graham Seamen Moved NameGrammar.pm to Lingua::EN::NameParse::Grammar name space 1.28 3 Jan 2011 Added more extended titles including Pilot Officer, Count, Duke, Dutchess, Marquess (thanks to Hugh Myers) Allowed T-Bone as a given name (thanks to Hugh Myers) Added name type Mr_John_Adam_Smith (thanks to Chris Brown) Allowed joint names to be reverse cased Fixed bug when printing report on name of unknown type Moved saluation paramters from 'new' to 'salutation' method Allowed for two types of salutation, given_name and title_plus_surname 1.27 4 Jul 2010 Changed my cygwin fstab file to use noacl option for default mount points. Before this, the tar tool was creating directories with no read permision. 1.26 3 Jul 2010 Made line endings consistent in all files, with Unix style EOL markers 1.25 2 Jul 2010 Changed the parse method to match the documentation. It was returning a name object and an error flag, when only the flag was required. Note that this caused no problem if used as documented, the error value was still assigned correctly, and the name object assigned to nothing. Thanks to Chris Prather. Added 'use warnings' to modules 1.24 6 Jan 2008 Removed one more use to $_ in case_surname subroutine. Thanks to Ben Martin 1.23 23 Nov 2006 components method now returns undef if name cannot be parsed. Thanks to Patrick Kane. Stopped assigning to $_ in case_surname subroutine. Thanks to Richard Walker Created examples folder Fixed alignment problem in report method Removed optional title from John_Adam_Smith name type 1.22 12 Jul 2005 Added the Mr_John_Smith_&_Ms_Mary_Jones name type Added the John_Smith_&_Mary_Jones name type Added the John_&_Mary_Smith name type Added the A_Smith_&_B_Jones name type Updated distribution to current CPAN requirements 1.21 20 May 2005 Fixed bug that was creating a name cased surname of O'sullivan, now O'Sullivan Added report method to dump all name properties and components Added several new suffixes such as PhD 1.20 16 Feb 2004 Fixed bug that was preventing some extended titles, such as 'Major General' from working 1.19 27 Dec 2003 Fixed bug in rules.t, thanks to Assad Arnaud Added more valid surnames in _valid_name, thanks to Peter Schendzielorz 1.18 28 Mar 2002 Fixed bug in rules.t, thanks to Assad Arnaud Added more reserved words Added option to use a limited set of common titles (improves speed) 1.17 21 Mar 2002: Added option to remove joint names from grammar (100% speed improvement!) NOTE that by default, joint names (Mr & Mrs AB Smith) are not recognized Applied correct capitalization to single possessive word, such as French's Added Dalle, dela and dall' to list of Italian surname prefixes Added San to list of Spanish surname prefixes Detect reserved words, such as Pty Ltd in pre parse stage 1.16 24 Sep 2001: Minor additions to README file 1.15 25 Jul 2001: Added more complete list to surname_prefs.txt Allowed for a surname prefix of Dell', as in Dell'Arte Added case_all_reversed method to return name in the format of surname, initials and/or given_names, useful for alphabetical sorting Length of given name set to at least 2 characters for name type John_A_Smith Names such as "Al B Jones" now parse correctly Length of given name set to at least 2 characters for name type John_Adam_Smith Names such as "Al Brian Jones" now parse correctly Created a new token for middle names, at least 2 characters long and a look ahead to exclude prefixes. Names such as "Mary Jo White" and "John Van Der Wald" now parse correctly 1.14 19 Jul 2001: Moved prefixes and suffixes back into grammar tree to stop invalid combinations Removed pre parse stage Changed case_surname to correctly capitalize suffixes in roman numeral format Removed POD directives from README Added J_Adam_Smith name type, thanks to Michael Cesar 1.13 25 Apr 2001: Initialised values for all parsed name components to empty string Fixed POD errors, thanks to Jason Gallagher 1.12 4 Apr 2001: Allowed for titles in John_Adam_Smith name type (they were being consumed to early as given names). Fixed some incorrect title regular expressions Added 'dela' to list of surname Italian prefixes 1.11 24 Feb 2001: Fixed bug in Makefile.PL, caused by omission of NameGrammar.pm 1.10 22 Feb 2001: Allowed for apostrophes in given names, like D'Artagnan Fixed bug for suffixes in 'John_Smith' name type Added the John_Adam_Smith name type Added correct capitalization of possessive proper names such as Australia's to case_surname. Thanks to Dennis Ingram. Precursors and suffixes now detected in pre-parsing stage Removed precursor and suffix tree from grammar Removed Senior and Junior as suffixes, as they are valid surnames Extended range of suffixes in roman numeral format Moved grammar definition to separate module 1.05 7 Jan 2001: Updated README file Added test of lower casing prefixes to rules.t 1.04 10 Dec 2000: Allowed for reversed order names where the surname appears first, followed by a comma and the remaining components of the name, such as title, first name, initials etc. 1.03 25 Jul 2000: Added suffixes (like Jnr) to most single names Replaced '=head3' because they don't work with perldoc Removed search of @INC array in BEGIN subroutine Thanks to Douglas Wilson for these requests 1.02 19 Apr 2000: Removed title J\.? (abbreviation for Judge) as it clashed with names such as J A Smith, J. B. Jones etc. Thanks to Adam Huffman for spotting this bug 1.01 15 Apr 2000: auto_clean option now removes commas from input Improved regular expressions in 'clean' sub, thanks to Mark Summerfield Catered for initials with both dots and spaces, such as A. B. Smith, requested by Adam Huffman 1.00 27 Dec 1999: Added user defined file of surname capitalization over rides Allowed for salutations where precursor is not an estate 0.40 14 Sep 1999: Added the Mr_John_A_Smith and John_A_Smith name types Allowed for hyphenated given names 0.30 21 Aug 1999: Allowed for user defined length of initials Added the Mr_John_Smith name type Added the John_Smith name type Surnames with the D' prefix now correctly capitalised If a parsed name had no components, the components method returned an odd numbered hash and case_components returned 1. Both these methods now return undef in this situation 0.10 04 Jul 1999: Allowed for lower casing of surname prefixes 0.04 16 May 1999: Added test script for rule ordering Added more titles, improved documentation 0.03 02 May 1999: Altered output of test script to work with Test::Harness Modified &clean to remove single leading or trailing space 0.02 01 May 1999: Added test script, converted source to Unix format 0.01 25 Apr 1999: First Release Lingua-EN-NameParse-1.33/Makefile.PL000444000000000000 66710742041732 17457 0ustar00unknownunknown000000000000use ExtUtils::MakeMaker; # See lib/ExtUtils/MakeMaker.pm for details of how to influence # the contents of the Makefile that is written. WriteMakefile ( 'NAME' => 'Lingua::EN::NameParse', 'VERSION_FROM' => 'lib/Lingua/EN/NameParse.pm', 'ABSTRACT' => 'Manipulate peoples names, titles and initials', 'AUTHOR' => 'Kim Ryan', 'LICENSE' => 'perl', 'PREREQ_PM' => { 'Parse::RecDescent' => 0 } ); Lingua-EN-NameParse-1.33/MANIFEST000444000000000000 30512161770771 16634 0ustar00unknownunknown000000000000MANIFEST README Changes Makefile.PL examples/demo.pl lib/Lingua/EN/NameParse.pm lib/Lingua/EN/NameParse/Grammar.pm surname_prefs.txt t/main.t t/rules.t t/pod.t t/pod-coverage.t META.yml META.json Lingua-EN-NameParse-1.33/META.json000444000000000000 256312515627074 17154 0ustar00unknownunknown000000000000{ "abstract" : "Manipulate peoples names, titles and initials", "author" : [ "Kim Ryan " ], "dynamic_config" : 1, "generated_by" : "Module::Build version 0.421", "license" : [ "perl_5" ], "meta-spec" : { "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", "version" : "2" }, "name" : "Lingua-EN-NameParse", "prereqs" : { "build" : { "requires" : { "Test::More" : "0.94", "Test::Pod" : "1.04" } }, "configure" : { "requires" : { "Module::Build" : "0.38" } }, "runtime" : { "requires" : { "Parse::RecDescent" : "1", "Test::Pod::Coverage" : "1.04", "locale" : "1", "strict" : "1.04", "utf8" : "1.09", "warnings" : "1.12" } } }, "provides" : { "Lingua::EN::NameParse" : { "file" : "lib/Lingua/EN/NameParse.pm", "version" : "1.33" }, "Lingua::EN::NameParse::Grammar" : { "file" : "lib/Lingua/EN/NameParse/Grammar.pm", "version" : "1.33" } }, "release_status" : "stable", "resources" : { "license" : [ "http://dev.perl.org/licenses/" ] }, "version" : "1.33" } Lingua-EN-NameParse-1.33/META.yml000444000000000000 153712515627074 17004 0ustar00unknownunknown000000000000--- abstract: 'Manipulate peoples names, titles and initials' author: - 'Kim Ryan ' build_requires: Test::More: '0.94' Test::Pod: '1.04' configure_requires: Module::Build: '0.38' dynamic_config: 1 generated_by: 'Module::Build version 0.421, CPAN::Meta::Converter version 2.142060' license: perl meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: '1.4' name: Lingua-EN-NameParse provides: Lingua::EN::NameParse: file: lib/Lingua/EN/NameParse.pm version: '1.33' Lingua::EN::NameParse::Grammar: file: lib/Lingua/EN/NameParse/Grammar.pm version: '1.33' requires: Parse::RecDescent: '1' Test::Pod::Coverage: '1.04' locale: '1' strict: '1.04' utf8: '1.09' warnings: '1.12' resources: license: http://dev.perl.org/licenses/ version: '1.33' Lingua-EN-NameParse-1.33/README000444000000000000 476612515625716 16423 0ustar00unknownunknown000000000000NAME Lingua::EN::NameParse - routines for manipulating a person's name SYNOPSIS use Lingua::EN::NameParse qw(clean case_surname); # optional configuration arguments my %args = ( auto_clean => 1, force_case => 1, lc_prefix => 1, initials => 3, allow_reversed => 1, joint_names => 0, extended_titles => 0 ); my $name = Lingua::EN::NameParse->new(%args); $error = $name->parse("MR AC DE SILVA"); %name_comps = $name->components; $surname = $name_comps{surname_1}; # De Silva $correct_casing = $name->case_all; # Mr AC de Silva $correct_casing = $name->case_all_reversed ; # de Silva, AC $good_name = &clean("Bad Na9me "); # "Bad Name" $salutation = $name->salutation(salutation => 'Dear',sal_default => 'Friend')); # Dear Mr de Silva %my_properties = $name->properties; $number_surnames = $my_properties{number}; # 1 $bad_input = $my_properties{non_matching}; $name->report; # create a report listing all information about the parsed name $lc_prefix = 0; $correct_case = &case_surname("DE SILVA-MACNAY",$lc_prefix); # De Silva-MacNay DESCRIPTION This module takes as input a person or persons name in free format text such as, Mr AB & M/s CD MacNay-Smith MR J.L. D'ANGELO Estate Of The Late Lieutenant Colonel AB Van Der Heiden and attempts to parse it. If successful, the name is broken down into components and useful functions can be performed like : - converting upper or lower case values to name case (Mr AB MacNay ) - creating a personalised greeting or salutation (Dear Mr MacNay ) - extracting the names individual components (Mr,AB,MacNay ) - determining the type of format the name is in (Mr_A_Smith ) If the name cannot be parsed you have the option of cleaning the name of bad characters, or extracting any portion that was parsed and the portion that failed. This module can be used for analysing and improving the quality of lists of names. HOW TO INSTALL perl Makefile.PL make make test make install or perl Build.PL build build test build install AUTHOR NameParse was written by Kim Ryan, kimryan at cpan dot org COPYRIGHT AND LICENSE Copyright (c) 2015 Kim Ryan. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html). Lingua-EN-NameParse-1.33/surname_prefs.txt000444000000000000 120 7326663104 21066 0ustar00unknownunknown000000000000Machin Macquarie Machar Macrae Macris Mackay Mackey Macaulay Macauley Macmillan Lingua-EN-NameParse-1.33/examples000755000000000000 012515627074 17206 5ustar00unknownunknown000000000000Lingua-EN-NameParse-1.33/examples/demo.pl000444000000000000 507512502714634 20626 0ustar00unknownunknown000000000000#!/usr/local/bin/perl -w # Demo script for Lingua::EN::NameParse.pm use Lingua::EN::NameParse qw(clean case_surname); use strict; # Quick casing, no parsing or context check my $input = "FRENCH'S"; print("$input :",case_surname($input,1),"\n\n"); my %args = ( auto_clean => 1, lc_prefix => 0, initials => 3, allow_reversed => 1, joint_names => 1, extended_titles => 1, ); my $name = Lingua::EN::NameParse->new(%args); # Open files to contain errors, a report on data quality and # an extract of all single names open(ERROR_FH,">errors.txt"); open(REPORT_FH,">report.txt"); open(EXTRACT_FH,">extract.txt"); my ($num_names,$num_errors); # loop over all lines in dDATA block below while () { chomp($_); my $input = $_; $num_names++; my $error = $name->parse($input); my %comps = $name->components; my %props = $name->properties; my $bad_part = $props{non_matching}; if ($error) { $num_errors++; printf(ERROR_FH "%-40.40s %-40.40s\n",$input,$bad_part); } if ( $props{type} eq 'Mr_A_Smith' ) { # extract all single names with title and initials printf(EXTRACT_FH "%-40.40s %-20.20s %-3.3s %-20.20s\n", $input,$comps{title_1},$comps{initials_1},$comps{surname_1}); } my $whole_name = $name->case_all; my $salutation = $name->salutation; printf(REPORT_FH "%-40.40s %-40.40s %-40.40s\n",$input,$whole_name,$salutation); } printf("BATCH DATA QUALITY: %5.2f percent\n",( 1- ($num_errors / $num_names)) *100 ); close(EXTRACT_FH); close(ERROR_FH); close(REPORT_FH); #------------------------------------------------------------------------------ __DATA__ MR AB MACMURDO LIEUTENANT COLONEL DE DE SILVA MR AB AND M/S CD VAN DER HEIDEN-MACNAY MR AS & D.E. DE LA MARE ESTATE OF THE LATE AB LE FONTAIN MR. EUGENE ANDERSON I MR. EUGENE ANDERSON II MR. EUGENE ANDERSON III MR. EUGENE ANDERSON IV MR. EUGENE ANDERSON V MR. EUGENE ANDERSON VI MR. EUGENE ANDERSON VII MR. EUGENE ANDERSON VIII MR. EUGENE ANDERSON IX MR. EUGENE ANDERSON X MR. EUGENE ANDERSON XI MR. EUGENE ANDERSON XII MR. EUGENE ANDERSON XIII MR KA MACQUARIE JNR. REVERAND S.A. VON DER MERVIN SNR BIG BROTHER & THE HOLDING COMPANY RIGHT HONOURABLE MR PJ KEATING MR TOM JONES Robert James Hawke EDWARD G WHITLAM JAMES BROWN MR AS SMI9TH prof a.s.d. genius Coltrane, Mr. John Davis, Miles A. Smith, Mr AB De Silva, Professor A.B. Air Marshall William Dunn Major General William Dunn J.R.R. Tolkien James Graham, Marquess of Montrose Flight Officer John Gillespie Magee Sir Author Conan Doyle Major JA DunnLingua-EN-NameParse-1.33/lib000755000000000000 012515627074 16136 5ustar00unknownunknown000000000000Lingua-EN-NameParse-1.33/lib/Lingua000755000000000000 012515627074 17355 5ustar00unknownunknown000000000000Lingua-EN-NameParse-1.33/lib/Lingua/EN000755000000000000 012515627074 17657 5ustar00unknownunknown000000000000Lingua-EN-NameParse-1.33/lib/Lingua/EN/NameParse.pm000444000000000000 12003612515626565 22273 0ustar00unknownunknown000000000000=head1 NAME Lingua::EN::NameParse - routines for manipulating a person's name =head1 SYNOPSIS use Lingua::EN::NameParse qw(clean case_surname); # optional configuration arguments my %args = ( auto_clean => 1, lc_prefix => 1, initials => 3, allow_reversed => 1, joint_names => 0, extended_titles => 0 ); my $name = Lingua::EN::NameParse->new(%args); $error = $name->parse("MR AC DE SILVA"); unless ( $error ) { %name_comps = $name->components; $surname = $name_comps{surname_1}; # De Silva $correct_casing = $name->case_all; # Mr AC de Silva $correct_casing = $name->case_all_reversed ; # de Silva, AC $good_name = clean("Bad Na9me "); # "Bad Name" $salutation = $name->salutation(salutation => 'Dear',sal_default => 'Friend')); # Dear Mr de Silva %my_properties = $name->properties; $number_surnames = $my_properties{number}; # 1 } $name->report; # create a report listing all information about the parsed name $lc_prefix = 0; $correct_case = case_surname("DE SILVA-MACNAY",$lc_prefix); # De Silva-MacNay =head1 DESCRIPTION This module takes as input a person or persons name in free format text such as, Mr AB & M/s CD MacNay-Smith MR J.L. D'ANGELO Estate Of The Late Lieutenant Colonel AB Van Der Heiden and attempts to parse it. If successful, the name is broken down into components and useful functions can be performed such as : converting upper or lower case values to name case (Mr AB MacNay ) creating a personalised greeting or salutation (Dear Mr MacNay ) extracting the names individual components (Mr,AB,MacNay ) determining the type of format the name is in (Mr_A_Smith ) If the name cannot be parsed you have the option of cleaning the name of bad characters, or extracting any portion that was parsed and the portion that failed. This module can be used for analysing and improving the quality of lists of names. =head1 DEFINITIONS The following terms are used by NameParse to define the components that can make up a name. Precursor - Estate of (The Late), Right Honourable ... Title - Mr, Mrs, Ms., Sir, Dr, Major, Reverend ... Conjunction - word to separate names or initials, such as "And" Initials - 1-3 letters, each with an optional space and/or dot Surname - De Silva, Van Der Heiden, MacNay-Smith, O'Reilly ... Suffix - Snr., Jnr, III, V ... Refer to the component grammar defined within the code for a complete list of combinations. 'Name casing' refers to the correct use of upper and lower case letters in peoples names, such as Mr AB McNay. To describe the formats supported by NameParse, a short hand representation of the name is used. The following formats are currently supported : Mr_John_Smith_&_Ms_Mary_Jones Mr_A_Smith_&_Ms_B_Jones Mr_&Ms_A_&_B_Smith Mr_A_&_Ms_B_Smith Mr_&_Ms_A_Smith Mr_A_&_B_Smith John_Smith_&_Mary_Jones John_&_Mary_Smith A_Smith_&_B_Jones Mr_John_Adam_Smith Mr_John_A_Smith Mr_John_Smith Mr_A_Smith John_Adam_Smith John_A_Smith J_Adam_Smith John_Smith A_Smith John Precursors and suffixes may be applied to single names that include a surname =head1 METHODS =head2 new The C method creates an instance of a name object and sets up the grammar used to parse names. This must be called before any of the following methods are invoked. Note that the object only needs to be created ONCE, and should be reused with new input data. Calling C repeatedly will significantly slow your program down. Various setup options may be defined in a hash that is passed as an optional argument to the C method. Note that all the arguments are optional. You need to define the combination of arguments that are appropriate for your usage. my %args = ( auto_clean => 1, lc_prefix => 1, initials => 3, allow_reversed => 1 ); my $name = Lingua::EN::NameParse->new(%args); =over 4 =item auto_clean When this option is set to a positive value, any call to the C method that fails will attempt to 'clean' the name and then reparse it. See the C method for details. This is useful for dirty data with embedded unprintable or non alphabetic characters. =item lc_prefix When this option is set to a positive value, it will force the C and C methods to lower case the first letter of each word that occurs in the prefix portion of a surname. For example, Mr AB de Silva, or Ms AS von der Heiden. =item initials Allows the user to control the number of letters that can occur in the initials. Valid settings are 1,2 or 3. If no value is supplied a default of 2 is used. =item allow_reversed When this option is set to a positive value, names in reverse order will be processed. The only valid format is the surname followed by a comma and the rest of the name, which can be in any of the combinations allowed by non reversed names. Some examples are: Smith, Mr AB Jones, Jim De Silva, Professor A.B. The program changes the order of the name back to the non reversed format, and then performs the normal parsing. Note that if the name can be parsed, the fact that it's order was originally reversed, is not recorded as a property of the name object. =item joint_names When this option is set to a positive value, joint names are accounted for: Mr_A_Smith_&Ms_B_Jones Mr_&Ms_A_&B_Smith Mr_A_&Ms_B_Smith Mr_&Ms_A_Smith Mr_A_&B_Smith Note that if this option is not specified, than by default joint names are ignored. Disabling joint names speeds up the processing a lot. =item extended_titles When this option is set to a positive value, all combinations of titles, such as Colonel, Mother Superior are used. If this value is not set, only the following titles are accounted for: Mr Ms M/s Mrs Miss Dr Sir Dame Note that if this option is not specified, than by default extended titles are ignored. Disabling extended titles speeds up the processing. =back =head2 parse $error = $name->parse("MR AC DE SILVA"); The C method takes a single parameter of a text string containing a name. It attempts to parse the name and break it down into the components Returns an error flag, if the name was parsed successfully, it's value is 0, otherwise a 1. This step is a prerequisite for the following methods. =head2 case_all $correct_casing = $name->case_all; The C method converts the first letter of each component to capitals and the remainder to lower case, with the following exceptions- initials remain capitalised surname spelling such as MacNay-Smith, O'Brien and Van Der Heiden are preserved - see C for user defined exceptions A complete definition of the capitalising rules can be found by studying the case_surname function. The method returns the entire cased name as text. =head2 case_all_reversed $correct_casing = $name->case_all_reversed; The C method applies the same type of casing as C. However, the name is returned as surname followed by a comma and the rest of the name, which can be any of the combinations allowed for a name, except the title. Some examples are: "Smith, John", "De Silva, A.B." This is useful for sorting names alphabetically by surname. The method returns the entire reverse order cased name as text. =head2 components %my_name = $name->components; $cased_surname = $my_name{surname_1}; The C method does the same thing as the C method, but returns the name cased components in a hash. The following keys are used for each component: precursor title_1 title_2 given_name_1 given_name_2 initials_1 initials_2 middle_name conjunction_1 conjunction_2 surname_1 surname_2 suffix If a component has no matching data for a given name, it will not appear in the hash If the name could not be parsed, this method returns null. If you assign the return value to a hash, you should check the error status returned by the C method first. Ohterwise, you will get an odd number of values assigned to the hash. =head2 case_surname $correct_casing = case_surname("DE SILVA-MACNAY" [,$lc_prefix]); C is a stand alone function that does not require a name object. The input is a text string. An optional input argument controls the casing rules for prefix portions of a surname, as described above in the C section. The output is a string converted to the correct casing for surnames. See C for user defined exceptions This function is useful when you know you are only dealing with names that do not have initials like "Mr John Jones". It is much faster than the case_all method, but does not understand context, and cannot detect errors on strings that are not personal names. =head2 surname_prefs.txt Some surnames can have more than one form of valid capitalisation, such as MacQuarie or Macquarie. Where the user wants to specify one form as the default, a text file called surname_prefs.txt should be created and placed in the same location as the NameParse module. The text file should contain one surname per line, in the capitalised form you want, such as Macquarie MacHado NameParse will still operate if the file does not exist =head2 salutation $salutation = $name->salutation(salutation => 'Dear',sal_default => 'Friend',sal_type => 'given_name')); The C method converts a name into a personal greeting, such as "Dear Mr & Mrs O'Brien" or "Dear Sue and John" Optional parameters may be specided in a hash as follows: salutation: The greeting word such as 'Dear' or 'Greetings'. If not spefied than 'Dear' is used sal_default: The default word used when a personalised salution cannot be generated. If not specified, than 'Friend' is used. sal_type: Can be either 'given_name' such as 'Dear Sue' or 'title_plus_name' such as 'Dear Ms Smith' If not specified, than 'given_name' is used. If an error is detected during parsing, such as with the name "AB Smith & Associates", then the value of sal_default is used instead of a given name, or a title and surname. If the input string contains a conjunction, an 's' is added to the value of sal_default. If the name contains a precursor, a default salutation is produced. =head2 clean $good_name = clean("Bad Na9me"); C is a stand alone function that does not require a name object. The input is a text string and the output is the string with: all repeating spaces removed all characters not in the set (A-Z a-z - ' , . &) removed =head2 properties The C method returns all the properties of the name, non_matching, number and type, as a hash. =over 4 =item type The type of format a name is in, as one of the following strings: Mr_A_Smith_&Ms_B_Jones Mr_&Ms_A_&B_Smith Mr_A_&Ms_B_Smith Mr_&Ms_A_Smith Mr_A_&B_Smith Mr_John_Adam_Smith Mr_John_A_Smith Mr_John_Smith Mr_A_Smith John_Adam_Smith John_A_Smith J_Adam_Smith John_Smith A_Smith John unknown =item non_matching Returns any unmatched section that was found. =back =head2 report Create a formatted text report to standard output listing - the input string, - the name and value of each defined component - any non matching component =head1 LIMITATIONS The huge number of character combinations that can form a valid names makes it is impossible to correctly identify them all. Firstly, there are many ambiguities, which have no right answer. Macbeth or MacBeth, are both valid spellings Is ED WOOD E.D. Wood or Edward Wood Is 'Mr Rapid Print' a name or a company Does John Bradfield Smith have a middle name of Bradfield, or a surname of Bradfield-Smith? One approach is to have large lookup files of names and words, statistical rules and fuzzy logic to attempt to derive context. This approach gives high levels of accuracy but uses a lot of your computers time and resources. NameParse takes the approach of using a limited set of rules, based on the formats that are commonly used by business to represent peoples names. This gives us fairly high accuracy, with acceptable speed and program size. NameParse will accept names from many countries, like Van Der Heiden, De La Mare and Le Fontain. Having said that, it is still biased toward English, because the precursors, titles and conjunctions are based on English usage. Names with two or more words, but no separating hyphen are not recognized. This is a real quandary as Indian, Chinese and other names can have several components. If these are allowed for, any component after the surname will also be picked up. For example in "Mr AB Jones Trading As Jones Pty Ltd" will return a surname of "Jones Trading". Because of the large combination of possible names defined in the grammar, the program is not very fast, except for the more limited C subroutine. See the "Future Directions" section for possible speed ups. As the parser has a very limited understanding of context, the "John_Adam_Smith" name type is most likely to cause problems, as it contains no known tokens like a title. A string such as "National Australia Bank" would be accepted as a valid name, first name National etc. Supplying a list of common pronouns as exceptions could solve this problem. =head1 REFERENCES "The Wordsworth Dictionary of Abbreviations & Acronyms" (1997) Australian Standard AS4212-1994 "Geographic Information Systems - Data Dictionary for transfer of street addressing information" =head1 FUTURE DIRECTIONS Define grammar for other languages. Hopefully, all that would be needed is to specify a new module with its own grammar, and inherit all the existing methods. I don't have the knowledge of the naming conventions for non-english languages. =head1 SEE ALSO L, L, L, L, L =head1 BUGS Names with accented characters (acute, circumfelx etc) will not be parsed correctly. A work around is to replace the character class [a-z] with \w in the appropriate rules in the grammar tree, but this could lower the accuracy of names based purely on ASCII text. =head1 CREDITS Thanks to all the people who provided ideas and suggestions, including - Damian Conway, author of Parse::RecDescent Mark Summerfield author of Lingua::EN::NameCase, Ron Savage, Alastair Adam Huffman, Douglas Wilson Peter Schendzielorz =head1 AUTHOR NameParse was written by Kim Ryan =head1 COPYRIGHT AND LICENSE Copyright (c) 2015 Kim Ryan. All rights reserved. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut #------------------------------------------------------------------------------- package Lingua::EN::NameParse; use strict; use warnings; use Lingua::EN::NameParse::Grammar; use Parse::RecDescent; use Exporter; use vars qw (@ISA @EXPORT_OK); our $VERSION = '1.33'; @ISA = qw(Exporter); @EXPORT_OK = qw(clean case_surname); #------------------------------------------------------------------------------- # Create a new instance of a name parsing object. This step is time consuming # and should normally only be called once in your program. sub new { my $class = shift; my %args = @_; my $name = {}; bless($name,$class); # Default to 2 initials per name. Can be overwritten if user defines # 'initials' as a key in the hash supplied to new method. $name->{initials} = 2; my $current_key; foreach my $current_key (keys %args) { $name->{$current_key} = $args{$current_key}; } my $grammar = Lingua::EN::NameParse::Grammar::_create($name); $name->{parse} = new Parse::RecDescent($grammar); return ($name); } #------------------------------------------------------------------------------- # Attempt to parse a string and retrieve it's components and properties # Requires a name object to have been created with the 'new' method' # Returns: an error code, 0 for success, 1 for failure sub parse { my $name = shift; my ($input_string) = @_; chomp($input_string); # If reverse ordered names are allowed, swap the surname component, before # the comma, with the rest of the name. Rejoin the name, replacing comma # with a space. if ( $name->{allow_reversed} and $input_string =~ /,/ ) { my ($first,$second) = split(/,/,$input_string); $input_string = join(' ',$second,$first); } $name->{comps} = (); $name->{properties} = (); $name->{properties}{type} = 'unknown'; $name->{error} = 0; $name->{error_desc} = ''; $name->{warning} = 0; $name->{warning_desc} = ''; $name->{original_input} = $input_string; $name->{input_string} = $input_string; $name = _pre_parse($name); unless ( $name->{error} ) { if ( $name->{auto_clean} ) { $name->{input_string} = clean($name->{input_string}); } $name = _assemble($name); _validate($name); } return($name->{error}); } #------------------------------------------------------------------------------- # Clean the input string. Can be called as a stand alone function. sub clean { my ($input_string) = @_; # remove illegal characters $input_string =~ s/[^A-Za-z\-\'\.&\/ ]//go; # remove repeating spaces $input_string =~ s/ +/ /go ; # remove any remaining leading or trailing space $input_string =~ s/^ //; return($input_string); } #------------------------------------------------------------------------------- # Given a name object, apply correct capitalisation to each component of a person's name. # Return all cased components in a hash. # Else return no value. sub components { my $name = shift; if ( $name->{properties}{type} eq 'unknown' ) { return; } else { my %orig_components = %{ $name->{comps} }; my ($current_key,%cased_components); foreach $current_key ( keys %orig_components ) { my $cased_value; if ( $current_key =~ /initials/ ) # initials_1, possibly initials_2 { $cased_value = uc($orig_components{$current_key}); } elsif ( $current_key =~ /surname|suffix/ ) { $cased_value = case_surname($orig_components{$current_key},$name->{lc_prefix}); } elsif ( $current_key eq 'type') { $cased_value = $orig_components{$current_key}; } else { $cased_value = _case_word($orig_components{$current_key}); } $cased_components{$current_key} = $cased_value; } return(%cased_components); } } #------------------------------------------------------------------------------- # Hash of of lists, indicating the order that name components are assembled in. # Each list element is itself the name of the key value in a name object. # Used by the case_all and case_all_reversed methods. # These hashes are created here globally, as quite a large overhead is # imposed if the are created locally, each time the method is invoked my %component_order= ( 'Mr_John_Smith_&_Ms_Mary_Jones' => ['title_1','given_name_1','surname_1','conjunction_1','title_2','given_name_2','surname_2'], 'Mr_A_Smith_&_Ms_B_Jones' => ['title_1','initials_1','surname_1','conjunction_1','title_2','initials_2','surname_2'], 'Mr_&_Ms_A_&_B_Smith' => ['title_1','conjunction_1','title_2','initials_1','conjunction_2','initials_2','surname_1'], 'Mr_A_&_Ms_B_Smith' => ['title_1','initials_1','conjunction_1','title_2','initials_2','surname_1'], 'Mr_&_Ms_A_Smith' => ['title_1','conjunction_1','title_2','initials_1','surname_1'], 'Mr_A_&_B_Smith' => ['title_1','initials_1','conjunction_1','initials_2','surname_1'], 'John_Smith_&Mary_Jones' => ['given_name_1','surname_1','conjunction_1','given_name_2','surname_2'], 'John_&_Mary_Smith' => ['given_name_1','conjunction_1','given_name_2','surname_1'], 'A_Smith_&_B_Jones' => ['initials_1','surname_1','conjunction_1','initials_2','surname_2'], 'Mr_John_Adam_Smith' => ['precursor','title_1','given_name_1','middle_name','surname_1','suffix'], 'Mr_John_A_Smith' => ['precursor','title_1','given_name_1','initials_1','surname_1','suffix'], 'Mr_John_Smith' => ['precursor','title_1','given_name_1','surname_1','suffix'], 'Mr_A_Smith' => ['precursor','title_1','initials_1','surname_1','suffix'], 'John_Adam_Smith' => ['precursor','given_name_1','middle_name','surname_1','suffix'], 'John_A_Smith' => ['precursor','given_name_1','initials_1','surname_1','suffix'], 'J_Adam_Smith' => ['precursor','initials_1','middle_name','surname_1','suffix'], 'John_Smith' => ['precursor','given_name_1','surname_1','suffix'], 'A_Smith' => ['precursor','initials_1','surname_1','suffix'], 'John' => ['given_name_1'] ); # only include names with a single surname my %reverse_component_order= ( 'Mr_&_Ms_A_&_B_Smith' => ['surname_1','title_1','conjunction_1','title_2','initials_1','conjunction_1','initials_2'], 'Mr_A_&_Ms_B_Smith' => ['surname_1','title_1','initials_1','conjunction_1','title_2','initials_2'], 'Mr_&_Ms_A_Smith' => ['surname_1','title_1','title_1','conjunction_1','title_2','initials_1'], 'Mr_A_&_B_Smith' => ['surname_1','title_1','initials_1','conjunction_1','initials_2'], 'John_&_Mary_Smith' => ['surname_1','given_name_1','conjunction_1','given_name_2'], 'Mr_John_Adam_Smith' => ['surname_1','title_1','given_name_1','middle_name','suffix'], 'Mr_John_A_Smith' => ['surname_1','title_1','given_name_1','initials_1','suffix'], 'Mr_John_Smith' => ['surname_1','title_1','given_name_1','suffix'], 'Mr_A_Smith' => ['surname_1','title_1','initials_1','suffix'], 'John_Adam_Smith' => ['surname_1','given_name_1','middle_name','suffix'], 'John_A_Smith' => ['surname_1','given_name_1','initials_1','suffix'], 'J_Adam_Smith' => ['surname_1','initials_1','middle_name','suffix'], 'John_Smith' => ['surname_1','given_name_1','suffix'], 'A_Smith' => ['surname_1','initials_1','suffix'] ); #------------------------------------------------------------------------------- # Apply correct capitalisation to a person's entire name # If the name type is unknown, return undef # Else, return a string of all cased components in correct order sub case_all { my $name = shift; my @cased_name; if ( $name->{properties}{type} eq 'unknown' ) { return undef; } unless ( $component_order{$name->{properties}{type}} ) { # component order missing in array defined above warn "Component order not defined for: $name->{properties}{type}"; return; } my %component_vals = $name->components; my @order = @{ $component_order{$name->{properties}{type}} }; foreach my $component_key ( @order ) { # As some components such as precursors are optional, they will appear # in the order array but may or may not have have a value, so only # process defined values if ( $component_vals{$component_key} ) { push(@cased_name,$component_vals{$component_key}); } } if ( $name->{comps}{non_matching} ) { # Despite errors, try to name case non-matching section. As the format # of this section is unknown, surname case will provide the best # approximation, but still fail on initials of more than 1 letter push(@cased_name,case_surname($name->{comps}{non_matching},$name->{lc_prefix})); } return(join(' ',@cased_name)); } #------------------------------------------------------------------------------- =head1 case_all_reversed Apply correct capitalisation to a person's entire name and reverse the order so that surname is first, followed by the other components, such as: Smith, Mr John A Useful for creating a list of names that can be sorted by surname. If name type is unknown , returns null If the name type has a joint name, such as 'Mr_A_Smith_Ms_B_Jones', return null, as it is ambiguous which surname to place at the start of the string Else, returns a string of all cased components in correct reversed order =cut sub case_all_reversed { my $name = shift; my @cased_name_reversed; unless ( $name->{properties}{type} eq 'unknown' ) { unless ( $reverse_component_order{$name->{properties}{type} } ) { # this type of name should not be reversed, such as two surnames return; } my %component_vals = $name->components; my @reverse_order = @{ $reverse_component_order{$name->{properties}{type} } }; foreach my $component_key ( @reverse_order ) { # As some components such as precursors are optional, they will appear # in the order array but may or may not have have a value, so only # process defined values my $component_value = $component_vals{$component_key}; if ( $component_value ) { if ($component_key eq 'surname_1') { $component_value .= ','; } push(@cased_name_reversed,$component_value); } } } return(join(' ',@cased_name_reversed)); } #------------------------------------------------------------------------------- # The user may specify their own preferred spelling for surnames. # These should be placed in a text file called surname_prefs.txt # in the same location as the module itself. BEGIN { # Obtain the full path to NameParse module, defined in the %INC hash. my $prefs_file_location = $INC{"Lingua/EN/NameParse.pm"}; # Now substitute the name of the preferences file $prefs_file_location =~ s/NameParse\.pm$/surname_prefs.txt/; if ( open(PREFERENCES_FH,"<$prefs_file_location") ) { my @surnames = ; foreach my $name ( @surnames ) { chomp($name); # Build hash, lower case name is key for case insensitive # comparison, while value holds the actual capitalisation $Lingua::EN::surname_preferences{lc($name)} = $name; } close(PREFERENCES_FH); } } #------------------------------------------------------------------------------- # Apply correct capitalisation to a person's surname. Can be called as a # stand alone function. sub case_surname { my ($surname,$lc_prefix) = @_; unless ($surname) { return(''); } # If the user has specified a preferred capitalisation for this # surname in the surname_prefs.txt, it should be returned now. if ($Lingua::EN::surname_preferences{lc($surname)} ) { return($Lingua::EN::surname_preferences{lc($surname)}); } # Lowercase everything $surname = lc($surname); # Now uppercase first letter of every word. By checking on word boundaries, # we will account for apostrophes (D'Angelo) and hyphenated names $surname =~ s/\b(\w)/\u$1/g; # Name case Macs and Mcs # Exclude names with 1-2 letters after prefix like Mack, Macky, Mace # Exclude names ending in a,c,i,o,z or j, typically Polish or Italian if ( $surname =~ /\bMac[a-z]{2,}[^a|c|i|o|z|j]\b/i ) { $surname =~ s/\b(Mac)([a-z]+)/$1\u$2/ig; # Now correct for "Mac" exceptions $surname =~ s/MacHin/Machin/; $surname =~ s/MacHlin/Machlin/; $surname =~ s/MacHar/Machar/; $surname =~ s/MacKle/Mackle/; $surname =~ s/MacKlin/Macklin/; $surname =~ s/MacKie/Mackie/; # Portuguese $surname =~ s/MacHado/Machado/; # Lithuanian $surname =~ s/MacEvicius/Macevicius/; $surname =~ s/MacIulis/Maciulis/; $surname =~ s/MacIas/Macias/; } elsif ( $surname =~ /\bMc/i ) { $surname =~ s/\b(Mc)([a-z]+)/$1\u$2/ig; } # Exceptions (only 'Mac' name ending in 'o' ?) $surname =~ s/Macmurdo/MacMurdo/; if ( $lc_prefix ) { # Lowercase first letter of every word in prefix. The trailing space # prevents the surname from being altered. Note that spellings like # d'Angelo are not accounted for. $surname =~ s/\b(\w+ )/\l$1/g; } # Correct for possessives such as "John's" or "Australia's". Although this # should not occur in a person's name, they are valid for proper names. # As this subroutine may be used to capitalise words other than names, # we may need to account for this case. Note that the 's' must be at the # end of the string $surname =~ s/(\w+)'S(\s+)/$1's$2/; $surname =~ s/(\w+)'S$/$1's/; # Correct for roman numerals, excluding single letter cases I,V and X, # which will work with the above code $surname =~ s/\b(I{2,3})\b/\U$1/i; # 2nd, 3rd $surname =~ s/\b(IV)\b/\U$1/i; # 4th $surname =~ s/\b(VI{1,3})\b/\U$1/i; # 6th, 7th, 8th $surname =~ s/\b(IX)\b/\U$1/i; # 9th $surname =~ s/\b(XI{1,3})\b/\U$1/i; # 11th, 12th, 13th return($surname); } #------------------------------------------------------------------------------- # Create a personalised greeting from one or two person's names # Returns the salutation as a string, such as "Dear Mr Smith", or "Dear Sue" sub salutation { my $name = shift; my %args = @_; my $salutation = 'Dear'; my $sal_default = 'Friend'; my $sal_type = 'title_plus_surname'; # Check to see if we should override defualts with any user specified preferences if ( %args ) { foreach my $current_key (keys %args) { $current_key eq 'salutation' and $salutation = $args{$current_key}; $current_key eq 'sal_default' and $sal_default = $args{$current_key}; $current_key eq 'sal_type' and $sal_type = $args{$current_key}; } } my @greeting; push(@greeting,$salutation); # Personalised salutations cannot be created for Estates or people # without some title if ( $name->{error} or ( $name->{comps}{precursor} and $name->{comps}{precursor} =~ /ESTATE/) ) { # Despite an error, the presence of a conjunction probably # means we are dealing with 2 or more people. # For example Mr AB Smith & John Jones if ( $name->{input_string} =~ / (AND|&) / ) { $sal_default .= 's'; } push(@greeting,$sal_default); } else { my %component_vals = $name->components; if ( $sal_type eq 'given_name') { if ( $component_vals{'given_name_1'} ) { push(@greeting,$component_vals{'given_name_1'}); if ( $component_vals{'given_name_2'} ) { push(@greeting,$component_vals{'conjunction_1'}); push(@greeting,$component_vals{'given_name_2'}); } } else { # No given name such as 'A_Smith','J_Adam_Smith','Mr_A_Smith' # Must use default push(@greeting,$sal_default); } } elsif ( $sal_type eq 'title_plus_surname' ) { if ( $name->{properties}{number} == 1 ) { if ( $component_vals{'title_1'} ) { push(@greeting,$component_vals{'title_1'}); push(@greeting,$component_vals{'surname_1'}); } else { # No title such as 'A_Smith','J_Adam_Smith', so must use default push(@greeting,$sal_default); } } elsif ( $name->{properties}{number} == 2 ) { # a joint name my $type = $name->{properties}{type}; if ( $type eq 'Mr_&Ms_A_Smith' or $type eq 'Mr_A_&Ms_B_Smith' or $type eq 'Mr_&Ms_A_&B_Smith' ) { # common surname push(@greeting,$component_vals{'title_1'}); push(@greeting,$component_vals{'conjunction_1'}); push(@greeting,$component_vals{'title_2'}); push(@greeting,$component_vals{'surname_1'}); } elsif ( $type eq 'Mr_A_Smith_&Ms_B_Jones' or $type eq 'Mr_John_Smith_&Ms_Mary_Jones' ) { push(@greeting,$component_vals{'title_1'}); push(@greeting,$component_vals{'surname_1'}); push(@greeting,$component_vals{'conjunction_1'}); push(@greeting,$component_vals{'title_2'}); push(@greeting,$component_vals{'surname_2'}); } else { # No title such as A_Smith_&B_Jones', 'John_Smith_&Mary_Jones' # Must use default push(@greeting,$sal_default); } } } else { warn "Invalid sal_type : ", $sal_type; push(@greeting,$sal_default); } } return(join(' ',@greeting)); } #------------------------------------------------------------------------------- # Return all name properties as a hash sub properties { my $name = shift; return(%{ $name->{properties} }); } #------------------------------------------------------------------------------- # Create a text report to standard output listing # - the input string, # - the name of each defined component, if it exists # - any non matching component sub report { my $name = shift; my %props = $name->properties; my $fmt = "%-20.20s : %s\n"; printf($fmt,"Original Input",$name->{original_input}); printf($fmt,"Cleaned Input",$name->{input_string}); printf($fmt,"Case all",$name->case_all); printf($fmt,"Case all reversed",$name->case_all_reversed); printf($fmt,"Salutation",$name->salutation(salutation => 'Dear',sal_default => 'Friend', sal_type => 'title_plus_surname')); printf($fmt,"Type", $props{type}); printf($fmt,"Parsing Error", $name->{error}); printf($fmt,"Error description : ", $name->{error_desc}); printf($fmt,"Parsing Warning", $name->{warning}); printf($fmt,"Warning description", $name->{warning_desc}); unless ($props{type} eq 'unknown') { my %comps = $name->components; if ( %comps ) { print("\nCOMPONENTS\n"); foreach my $value ( sort keys %comps) { if ($value and $comps{$value}) { printf($fmt,$value,$comps{$value}); } } } } } #------------------------------------------------------------------------------- # PRIVATE METHODS #------------------------------------------------------------------------------- sub _pre_parse { my $name = shift; # strip all full stops $name->{input_string} =~ s/\.//g; # Fold all text to upper case, as these are used in all regular expressions withun thr grammar tree $name->{input_string} = uc($name->{input_string}); # Check that common reserved word (as found in company names) do not appear if ( $name->{input_string} =~ /\BPTY LTD$|\BLTD$|\BPLC$|ASSOCIATION|DEPARTMENT|NATIONAL|SOCIETY/ ) { $name->{error} = 1; $name->{comps}{non_matching} = $name->{input_string}; $name->{error_desc} = 'Reserved words found in name'; } # For the case of a single name such as 'Voltaire' we need to add a trailing space # to the input string. This is because the grammar tree expects a terminator (the space) # optionally followed by other productions or non matching text $name->{input_string} .= ' '; if ( $name->{input_string} =~ /^[A-Z]{2,}(\-)?[A-Z]{0,}$/ ) { $name->{input_string} .= ' '; } return($name); } #------------------------------------------------------------------------------- # Initialise all components to empty string. Assemble hashes of components # and properties as part of the name object # sub _assemble { my $name = shift; # Use Parse::RecDescent to do the parsing. 'full_name' is a label for the complete grammar tree # defined in Lingua::EN::NameParse::Grammar my $parsed_name = $name->{parse}->full_name($name->{input_string}); # Place components into a separate hash, so they can be easily returned # for the user to inspect and modify. my @all_comps = qw(precursor title_1 given_name_1 initials_1 middle_name surname_1 conjunction_1 title_2 given_name_2 initials_2 surname_2 conjunction_2 suffix non_matching); foreach my $comp (@all_comps) { # set all components to empty string, as any of them could be accessed, even if they don't exist $name->{comps}{$comp} = ''; if (defined($parsed_name->{$comp})) { # Copy over existing componets. $name->{comps}{$comp} = _trim_space($parsed_name->{$comp}); } } $name->{properties}{number} = 0; $name->{properties}{number} = $parsed_name->{number}; $name->{properties}{type} = $parsed_name->{type}; return($name); } #------------------------------------------------------------------------------- # For correct matching, the grammar of each component must include the trailing space that separates it # from any following word. This should now be removed from the components, and will be restored by the # case_all and salutation methods, if called. sub _trim_space { my ($string) = @_; if ($string) { $string =~ s/ $//; } return($string); } #------------------------------------------------------------------------------- # Check if any name components have illegal characters, or do not have the # correct syntax for a valid name. sub _validate { my $name = shift; my %comps = $name->components; if ( $comps{non_matching} ) { $name->{warning} = 1; $name->{warning_desc} .= ";non_matching text found : $comps{non_matching}"; } elsif ( $name->{input_string} =~ /[^A-Za-z\-\'\.,&\/ ]/ ) { # illegal characters found $name->{error} = 1; $name->{error_desc} = 'illegal characters found'; } if ( not _valid_name($comps{given_name_1}) ) { $name->{warning} = 1; $name->{warning_desc} .= ";no vowel sound in given_name_1 : $comps{given_name_1}"; } elsif ( not _valid_name($comps{middle_name}) ) { $name->{warning} = 1; $name->{warning_desc} .= ";no vowel sound in middle_name : $comps{middle_name}"; } elsif ( not _valid_name($comps{surname_1}) ) { $name->{warning} = 1; $name->{warning_desc} .= ";no vowel sound in surname_1 : $comps{surname_1}"; } elsif ( not _valid_name($comps{surname_2}) ) { $name->{warning} = 1; $name->{warning_desc} .= ";no vowel sound in surname_2 : $comps{surname_2}"; } } #------------------------------------------------------------------------------- # If the name has an assigned value, check that it contains a vowel sound, # or matches the exceptions to this rule. # Returns 1 if name is valid, otherwise 0 sub _valid_name { my ($name) = @_; if ( not $name ) { return(1); } # Names should have a vowel sound, # valid exceptions are Ng, Tsz,Md, Cng,Hng,Chng etc elsif ( $name and $name =~ /[AEIOUYJ]|^(NG|TSZ|MD|(C?H|[PTS])NG)$/ ) { return(1); } else { return(0); } } #------------------------------------------------------------------------------- # Upper case first letter, lower case the rest, for all words in string sub _case_word { my ($word) = @_; if ($word) { $word =~ s/(\w+)/\u\L$1/g; } return($word); } #------------------------------------------------------------------------------- return(1); Lingua-EN-NameParse-1.33/lib/Lingua/EN/NameParse000755000000000000 012515627074 21532 5ustar00unknownunknown000000000000Lingua-EN-NameParse-1.33/lib/Lingua/EN/NameParse/Grammar.pm000444000000000000 4034212515626325 23634 0ustar00unknownunknown000000000000=head1 NAME Lingua::EN::NameGrammar - grammar tree for Lingua::EN::NameParse =head1 SYNOPSIS Internal functions called from NameParse.pm module =head1 DESCRIPTION Grammar tree of personal name syntax for module. The grammar defined here is for use with the Parse::RecDescent module. Note that parsing is done depth first, meaning match the shortest string first. To avoid premature matches, when one rule is a sub set of another longer rule, it must appear after the longer rule. See the Parse::RecDescent documentation for more details. =head1 AUTHOR NameParse::Grammar was written by Kim Ryan . =head1 COPYRIGHT AND LICENSE Copyright (c) 2015 Kim Ryan. All rights reserved. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut #------------------------------------------------------------------------------ package Lingua::EN::NameParse::Grammar; use strict; use warnings; our $VERSION = '1.33'; # Rules that define valid orderings of a names components my $rules_start = q{ full_name : }; my $rules_joint_names = q{ # A (?) refers to an optional component, occurring 0 or more times. # Optional items are returned as an array, which for our case will # always consist of one element, when they exist. title given_name surname conjunction title given_name surname non_matching(?) { # block of code to define actions upon successful completion of a # 'production' or rule # Two separate people $return = { # Parse::RecDescent lets you return a single scalar, which we use as # an anonymous hash reference title_1 => $item[1], given_name_1 => $item[2], surname_1 => $item[3], conjunction_1 => $item[4], title_2 => $item[5], given_name_2 => $item[6], surname_2 => $item[7], non_matching => $item[8][0], number => 2, type => 'Mr_John_Smith_&_Ms_Mary_Jones' } } | title initials surname conjunction title initials surname non_matching(?) { $return = { title_1 => $item[1], initials_1 => $item[2], surname_1 => $item[3], conjunction_1 => $item[4], title_2 => $item[5], initials_2 => $item[6], surname_2 => $item[7], non_matching => $item[8][0], number => 2, type => 'Mr_A_Smith_&_Ms_B_Jones' } } | title initials conjunction title initials surname non_matching(?) { # Two related people, own initials, shared surname $return = { title_1 => $item[1], initials_1 => $item[2], conjunction_1 => $item[3], title_2 => $item[4], initials_2 => $item[5], surname_1 => $item[6], non_matching => $item[7][0], number => 2, type => 'Mr_A_&_Ms_B_Smith' } } | title initials conjunction initials surname non_matching(?) { # Two related people, shared title, separate initials, # shared surname. Example, father and son, sisters $return = { title_1 => $item[1], initials_1 => $item[2], conjunction_1 => $item[3], initials_2 => $item[4], surname_1 => $item[5], non_matching => $item[6][0], number => 2, type => 'Mr_A_&_B_Smith' } } | title conjunction title initials conjunction initials surname non_matching(?) { # Two related people, own initials, shared surname $return = { title_1 => $item[1], conjunction_1 => $item[2], title_2 => $item[3], initials_1 => $item[4], conjunction_2 => $item[5], initials_2 => $item[6], surname_1 => $item[7], non_matching => $item[8][0], number => 2, type => 'Mr_&_Ms_A_&_B_Smith' } } | title conjunction title initials surname non_matching(?) { # Two related people, shared initials, shared surname $return = { title_1 => $item[1], conjunction_1 => $item[2], title_2 => $item[3], initials_1 => $item[4], surname_1 => $item[5], non_matching => $item[6][0], number => 2, type => 'Mr_&_Ms_A_Smith' } } | given_name surname conjunction given_name surname non_matching(?) { $return = { given_name_1 => $item[1], surname_1 => $item[2], conjunction_1 => $item[3], given_name_2 => $item[4], surname_2 => $item[5], non_matching => $item[6][0], number => 2, type => 'John_Smith_&_Mary_Jones' } } | initials surname conjunction initials surname non_matching(?) { $return = { initials_1 => $item[1], surname_1 => $item[2], conjunction_1 => $item[3], initials_2 => $item[4], surname_2 => $item[5], non_matching => $item[6][0], number => 2, type => 'A_Smith_&_B_Jones' } } | given_name conjunction given_name surname non_matching(?) { $return = { given_name_1 => $item[1], conjunction_1 => $item[2], given_name_2 => $item[3], surname_2 => $item[4], non_matching => $item[5][0], number => 2, type => 'John_&_Mary_Smith' } } | }; my $rules_single_names = q{ precursor(?) title given_name_standard middle_name surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], title_1 => $item[2], given_name_1 => $item[3], middle_name => $item[4], surname_1 => $item[5], suffix => $item[6][0], non_matching => $item[7][0], number => 1, type => 'Mr_John_Adam_Smith' } } | precursor(?) title given_name_standard single_initial surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], title_1 => $item[2], given_name_1 => $item[3], initials_1 => $item[4], surname_1 => $item[5], suffix => $item[6][0], non_matching => $item[7][0], number => 1, type => 'Mr_John_A_Smith' } } | precursor(?) title given_name surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], title_1 => $item[2], given_name_1 => $item[3], surname_1 => $item[4], suffix => $item[5][0], non_matching => $item[6][0], number => 1, type => 'Mr_John_Smith' } } | precursor(?) title initials surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], title_1 => $item[2], initials_1 => $item[3], surname_1 => $item[4], suffix => $item[5][0], non_matching => $item[6][0], number => 1, type => 'Mr_A_Smith' } } | precursor(?) given_name_standard middle_name surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], given_name_1 => $item[2], middle_name => $item[3], surname_1 => $item[4], suffix => $item[5][0], non_matching => $item[6][0], number => 1, type => 'John_Adam_Smith' } } | precursor(?) given_name_standard single_initial surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], given_name_1 => $item[2], initials_1 => $item[3], surname_1 => $item[4], suffix => $item[5][0], non_matching => $item[6][0], number => 1, type => 'John_A_Smith' } } | precursor(?) single_initial middle_name surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], initials_1 => $item[2], middle_name => $item[3], surname_1 => $item[4], suffix => $item[5][0], non_matching => $item[6][0], number => 1, type => 'J_Adam_Smith' } } | precursor(?) given_name surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], given_name_1 => $item[2], surname_1 => $item[3], suffix => $item[4][0], non_matching => $item[5][0], number => 1, type => 'John_Smith' } } | precursor(?) initials surname suffix(?) non_matching(?) { $return = { precursor => $item[1][0], initials_1 => $item[2], surname_1 => $item[3], suffix => $item[4][0], non_matching => $item[5][0], number => 1, type => 'A_Smith' } } | given_name_standard non_matching(?) { $return = { given_name_1 => $item[1], non_matching => $item[2][0], number => 1, type => 'John' } } | non_matching(?) { $return = { non_matching => $item[1][0], number => 0, type => 'unknown' } } }; #------------------------------------------------------------------------------ # Individual components that a name can be composed from. Components are # expressed as literals or Perl regular expressions. my $titles = q{ # Place most frequent titles frist to speed matching title : /(MR|MS|M\/S|MRS|MISS|DR) / }; my $extended_titles = q{ | /( SIR| MESSRS| # Plural or Mr MADAME?| MME| # Madame MISTER| MASTER| MAST| MS?GR| # Monsignor COUNT| COUNTESS| DUKE| DUCHESS| LORD| LADY| MARQUESS| # Medical DOCTOR|SISTER|MATRON| # Legal JUDGE|JUSTICE| # Police DET|INSP|CONST| # Military BRIGDIER|BRIG| CAPTAIN|CAPT| COLONEL|COL| COMMANDER IN CHIEF|COMMANDER| COMMODORE| CDR| # Commander, Commodore FIELD\ MARSHALL| FLIGHT\ OFFICER| FL OFF| FLIGHT\ LIEUTENANT|FLT LT| PILOT\ OFFICER| GENERAL\ OF\ THE\ ARMY|GENERAL|GEN| PTE|PRIVATE| SGT|SARGENT| AIR\ COMMANDER| AIR\ COMMODORE| AIR\ MARSHALL| LIEUTENANT\ COLONEL|LT\ COL| LT\ GEN| LT\ CDR| LIEUTENANT|LT|LEUT|LIEUT| MAJOR GENERAL|MAJ GEN| MAJOR|MAJ| # Religious RABBI| BISHOP| BROTHER| CHAPLAIN| FATHER| PASTOR| MOTHER\ SUPERIOR|MOTHER| MOST\ REVER[E|A]ND| MUFTI| VERY\ REVER[E|A]ND| REVER[E|A]ND| MT\ REVD|V\ REVD|REVD| # Other PROFESSOR| PROF| ALDERMAN|ALD )\ /x }; my $common = q{ precursor : /( ESTATE\ OF\ THE\ LATE| ESTATE\ OF| HIS\ EXCELLENCY| HIS\ HONOU?R| HER\ EXCELLENCY| HER\ HONOU?R| THE\ RIGHT HONOU?RABLE| THE\ HONOU?RABLE| RIGHT\ HONOU?RABLE| THE\ RT\ HON| THE\ HON| RT\ HON )\ /x conjunction : /AND |& / # Used in the John_A_Smith and J_Adam_Smith name types, as well as when intials are set to 1 single_initial: /[A-Z] / # Examples are Jo-Anne, D'Artagnan, O'Shaugnessy La'Keishia, T-Bone split_given_name : /[A-Z]{1,}['|-][A-Z]{2,} / constonant: /[A-DF-HJ-NP-TV-Z]]/ # For use with John_Adam_Smith and John_A_Smith name types given_name_standard: /[A-Z]{3,} / | /[AEIOU]/ constonant / / | constonant /[AEIOUY] / | split_given_name # Patronymic, place name and other surname prefixes prefix: /( [A|E]L| # ARABIC, GREEK, AP| # WELSH BEN| # HEBREW DELLA|DELLE|DALLE| # ITALIAN DELA| DELL?| DE\ LA| DE\ LOS| DE| D[A|I|U]| L[A|E|O]| ST| # ABBREVIATION FOR SAINT SAN| # SPANISH # DUTCH DEN| VON\ DER| VON| VAN\ DE[N|R]| VAN )\ /x | /[D|L|O]'/ # ITALIAN, IRISH OR FRENCH, abbreviation for 'the', 'of' etc | /D[A|E]LL'/ middle_name: # Dont grab surname prefix too early. For example, John Van Dam could be # interpreted as middle name of Van and Surname of Dam. So exclude prefixs # from middle names ...!prefix given_name { $return = $item[2]; } # Use look-ahead to avoid ambiguity between surname and suffix. For example, # John Smith Snr, would detect Snr as the surname and Smith as the middle name surname : ...!suffix first_surname second_surname(?) { if ( $item[2] and $item[3][0] ) { $return = "$item[2]$item[3][0]"; } else { $return = $item[2]; } } first_surname : prefix name { $return = "$item[1]$item[2]"; } | name second_surname : '-' name { if ( $item[1] and $item[2] ) { $return = "$item[1]$item[2]"; } } # Note space will not occur for first part of a hphenated surname # AddressParse::_valid_name will do further check on name context name : /[A-Z]{2,} ?/ suffix: /( ESQUIRE| ESQ | SN?R| # Senior JN?R| # Junior PHD | MD | LLB | XI{1,3}| # 11th, 12th, 13th X | # 10th IV | # 4th VI{1,3} | # 6th, 7th, 8th V | # 5th IX | # 9th I{1,3} # 1st, 2nd, 3rd )\ /x # One or more characters. non_matching: /.*/ }; # Define given name combinations, specifying the minimum number of letters. # The correct pair of rules is determined by the 'initials' key in the hash # passed to the 'new' method. my $given_name_min_2 = q{ given_name : given_name_standard }; # Joe, Jo-Anne ... my $given_name_min_3 = q{ given_name: /[A-Z]{3,} / | split_given_name }; # John ... my $given_name_min_4 = q{ given_name: /[A-Z]{4,} / | split_given_name }; # Define initials combinations specifying the minimum and maximum letters. # Order from most complex to simplest, to avoid premature matching. # 'A' my $initials_1 = q{ initials : single_initial }; #'AB' 'A B' my $initials_2 = q{ initials: /([A-Z] ){1,2}/ | /([A-Z]){1,2} / }; # 'ABC' or 'A B C' my $initials_3 = q{ initials: /([A-Z] ){1,3}/ | /([A-Z]){1,3} / }; #------------------------------------------------------------------------------- # Assemble correct combination for grammar tree. sub _create { my $name = shift; my $grammar = $rules_start; if ( $name->{joint_names} ) { $grammar .= $rules_joint_names; } $grammar .= $rules_single_names; $grammar .= $common; $grammar .= $titles; if ( $name->{extended_titles} ) { $grammar .= $extended_titles; } $name->{initials} > 3 and $name->{initials} = 3; $name->{initials} < 1 and $name->{initials} = 1; # Define limit of when a string is treated as an initial, or # a given name. For example, if initials are set to 2, MR TO SMITH # will have initials of T & O and no given name, but MR TOM SMITH will # have no initials, and a given name of Tom. if ( $name->{initials} == 1 ) { $grammar .= $given_name_min_2 . $initials_1; } elsif ( $name->{initials} == 2 ) { $grammar .= $initials_2 . $given_name_min_3; } elsif ( $name->{initials} == 3 ) { $grammar .= $given_name_min_4 . $initials_3; } return($grammar); } #------------------------------------------------------------------------------- 1; Lingua-EN-NameParse-1.33/t000755000000000000 012515627074 15633 5ustar00unknownunknown000000000000Lingua-EN-NameParse-1.33/t/main.t000444000000000000 462612510113525 17073 0ustar00unknownunknown000000000000#------------------------------------------------------------------------------ # Test script for Lingua::EN::NameParse.pm # Author : Kim Ryan #------------------------------------------------------------------------------ use strict; use Test::Simple tests => 12; use Lingua::EN::NameParse qw(clean case_surname); my $input; # Test case_surname subroutine $input = "BIG BROTHER & THE HOLDING COMPANY"; ok(case_surname($input) eq 'Big Brother & The Holding Company','case_surname'); my %args = ( salutation => 'Dear', sal_default => 'Friend', auto_clean => 1, initials => 2, allow_reversed => 1, joint_names => 1, extended_titles => 1 ); my $name = Lingua::EN::NameParse->new(%args); $input = "MR AB MACHLIN"; $name->parse($input); ok( $name->case_all eq 'Mr AB Machlin','Mac prefix exception'); $input = "MR AB MACHLIN & JANE O'BRIEN"; $name->parse($input); ok( $name->case_all eq "Mr AB Machlin & Jane O'Brien" ,'name casing'); $input = "john smith"; $name->parse($input); ok( $name->salutation eq 'Dear Friend' ,'default salutation'); $input = "DR. A.B.C. FEELGOOD"; $name->parse($input); ok( $name->salutation(sal_type => 'title_plus_surname') eq 'Dear Dr Feelgood' ,'title_plus_surname salutation'); $input = "DR ANDREW FEELGOOD"; $name->parse($input); ok( $name->salutation(sal_type => 'given_name') eq 'Dear Andrew' ,'given_name salutation'); $input = "Estate Of The Late Lieutenant Colonel AB Van Der Heiden Jnr"; $name->parse($input); my %comps = $name->components; ok ( ($comps{precursor} eq 'Estate Of The Late' and $comps{title_1} eq 'Lieutenant Colonel' and $comps{initials_1} eq 'AB' and $comps{surname_1} eq 'Van Der Heiden' and $comps{suffix} eq 'Jnr'), 'component extraction'); # Test properties $input = "m/s de de silva"; $name->parse($input); my %props = $name->properties; ok( ($props{number} == 1 and $props{type} eq 'Mr_A_Smith','properties'),'properties'); # Test non matching $input = "PROF A BRAIN & ASSOCIATES"; $name->parse($input); %comps = $name->components; ok( $comps{non_matching} eq '& Associates','non matching'); $input = ' Bad Na89me!'; ok( clean($input) eq 'Bad Name','cleaning'); $input = "de silva, m/s de"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_A_Smith','reverse order'); my $lc_prefix = 1; ok( case_surname("DE SILVA-O'SULLIVAN",$lc_prefix) eq "de Silva-O'Sullivan" ,'lower casing of surname prefix'); Lingua-EN-NameParse-1.33/t/pod-coverage.t000444000000000000 25410264364676 20516 0ustar00unknownunknown000000000000#!perl -T use Test::More; eval "use Test::Pod::Coverage 1.04"; plan skip_all => "Test::Pod::Coverage 1.04 required for testing POD coverage" if $@; all_pod_coverage_ok(); Lingua-EN-NameParse-1.33/t/pod.t000444000000000000 21410264364632 16711 0ustar00unknownunknown000000000000#!perl -T use Test::More; eval "use Test::Pod 1.14"; plan skip_all => "Test::Pod 1.14 required for testing POD" if $@; all_pod_files_ok(); Lingua-EN-NameParse-1.33/t/rules.t000444000000000000 632612510113545 17302 0ustar00unknownunknown000000000000#------------------------------------------------------------------------------ # File : rules.t - test script for Lingua::EN::NameParse.pm # Author : Kim Ryan #------------------------------------------------------------------------------ use strict; use Test::Simple tests => 19; use Lingua::EN::NameParse; my %args = ( joint_names => 1 ); my $name = Lingua::EN::NameParse->new(%args); my ($input,%props); # Test order of rule evaluation $input = "MR ADAM SMITH & MS DEBRA JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_John_Smith_&_Ms_Mary_Jones', 'Mr_John_Smith_&_Ms_Mary_Jones format'); $input = "MR AB SMITH & MS D.F. JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_A_Smith_&_Ms_B_Jones', 'Mr_A_Smith_&_Ms_B_Jones format'); $input = "MR AND MRS AB & D.F. JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_&_Ms_A_&_B_Smith', 'Mr_&_Ms_A_&_B_Smith format'); $input = "MR AB AND MS D.F. JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_A_&_Ms_B_Smith', 'Mr_A_&_Ms_B_Smith format'); $input = "MR AND MS D.F. JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_&_Ms_A_Smith', 'Mr_&_Ms_A_Smith format'); $input = "MR AB AND D.G. JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_A_&_B_Smith', 'Mr_A_&_B_Smith format'); $input = "ADAM SMITH & DEBRA JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'John_Smith_&_Mary_Jones', 'John_Smith_&_Mary_Jones format'); $input = "ADAM & DEBRA SMITH"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'John_&_Mary_Smith', 'John_&_Mary_Smith format'); $input = "A SMITH & D JONES "; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'A_Smith_&_B_Jones', 'A_Smith_&_B_Jones format'); $input = "MR JOHN FITZGERALD KENNEDY"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_John_Adam_Smith', 'Mr_John_Adam_Smith format'); $input = "MR JOHN F KENNEDY"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_John_A_Smith', 'Mr_John_A_Smith format'); $input = "MR TOM JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_John_Smith', 'Mr_John_Smith format'); $input = "MR AB JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'Mr_A_Smith', 'Mr_A_Smith format'); $input = "WILLIAM JEFFERSON CLINTON"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'John_Adam_Smith', 'John_Adam_Smith format'); $input = "F SCOTT FITZGERALD"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'J_Adam_Smith', 'J_Adam_Smith format'); $input = "JOHN F KENNEDY"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'John_A_Smith', 'John_A_Smith format'); $input = "TOM JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'John_Smith', 'John_Smith format'); $input = "AB JONES"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'A_Smith', 'A_Smith format'); $input = "Voltaire"; $name->parse($input); %props = $name->properties; ok( $props{type} eq 'John', 'John format');