String-Approx-3.27/0000750010002300116100000000000012077546526012265 5ustar jhiengString-Approx-3.27/LGPL0000444010002300116100000006127306642366123012754 0ustar jhieng GNU LIBRARY GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1991 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. [This is the first released version of the library GPL. It is numbered 2 because it goes with version 2 of the ordinary GPL.] Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public Licenses are intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This license, the Library General Public License, applies to some specially designated Free Software Foundation software, and to any other libraries whose authors decide to use it. You can use it for your libraries, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the library, or if you modify it. For example, if you distribute copies of the library, whether gratis or for a fee, you must give the recipients all the rights that we gave you. You must make sure that they, too, receive or can get the source code. If you link a program with the library, you must provide complete object files to the recipients so that they can relink them with the library, after making changes to the library and recompiling it. And you must show them these terms so they know their rights. Our method of protecting your rights has two steps: (1) copyright the library, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the library. Also, for each distributor's protection, we want to make certain that everyone understands that there is no warranty for this free library. If the library is modified by someone else and passed on, we want its recipients to know that what they have is not the original version, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that companies distributing free software will individually obtain patent licenses, thus in effect transforming the program into proprietary software. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. Most GNU software, including some libraries, is covered by the ordinary GNU General Public License, which was designed for utility programs. This license, the GNU Library General Public License, applies to certain designated libraries. This license is quite different from the ordinary one; be sure to read it in full, and don't assume that anything in it is the same as in the ordinary license. The reason we have a separate public license for some libraries is that they blur the distinction we usually make between modifying or adding to a program and simply using it. Linking a program with a library, without changing the library, is in some sense simply using the library, and is analogous to running a utility program or application program. However, in a textual and legal sense, the linked executable is a combined work, a derivative of the original library, and the ordinary General Public License treats it as such. Because of this blurred distinction, using the ordinary General Public License for libraries did not effectively promote software sharing, because most developers did not use the libraries. We concluded that weaker conditions might promote sharing better. However, unrestricted linking of non-free programs would deprive the users of those programs of all benefit from the free status of the libraries themselves. This Library General Public License is intended to permit developers of non-free programs to use free libraries, while preserving your freedom as a user of such programs to change the free libraries that are incorporated in them. (We have not seen how to achieve this as regards changes in header files, but we have achieved it as regards changes in the actual functions of the Library.) The hope is that this will lead to faster development of free libraries. The precise terms and conditions for copying, distribution and modification follow. Pay close attention to the difference between a "work based on the library" and a "work that uses the library". The former contains code derived from the library, while the latter only works together with the library. Note that it is possible for a library to be covered by the ordinary General Public License rather than by this special one. GNU LIBRARY GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any software library which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Library General Public License (also called "this License"). Each licensee is addressed as "you". A "library" means a collection of software functions and/or data prepared so as to be conveniently linked with application programs (which use some of those functions and data) to form executables. The "Library", below, refers to any such software library or work which has been distributed under these terms. A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".) "Source code" for a work means the preferred form of the work for making modifications to it. For a library, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the library. Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running a program using the Library is not restricted, and output from such a program is covered only if its contents constitute a work based on the Library (independent of the use of the Library in a tool for writing it). Whether that is true depends on what the Library does and what the program that uses the Library does. 1. You may copy and distribute verbatim copies of the Library's complete source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and distribute a copy of this License along with the Library. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Library or any portion of it, thus forming a work based on the Library, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) The modified work must itself be a software library. b) You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change. c) You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License. d) If a facility in the modified Library refers to a function or a table of data to be supplied by an application program that uses the facility, other than as an argument passed when the facility is invoked, then you must make a good faith effort to ensure that, in the event an application does not supply such function or table, the facility still operates, and performs whatever part of its purpose remains meaningful. (For example, a function in a library to compute square roots has a purpose that is entirely well-defined independent of the application. Therefore, Subsection 2d requires that any application-supplied function or table used by this function must be optional: if the application does not supply it, the square root function must still compute square roots.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Library, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Library, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Library. In addition, mere aggregation of another work not based on the Library with the Library (or with a work based on the Library) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may opt to apply the terms of the ordinary GNU General Public License instead of this License to a given copy of the Library. To do this, you must alter all the notices that refer to this License, so that they refer to the ordinary GNU General Public License, version 2, instead of to this License. (If a newer version than version 2 of the ordinary GNU General Public License has appeared, then you can specify that version instead if you wish.) Do not make any other change in these notices. Once this change is made in a given copy, it is irreversible for that copy, so the ordinary GNU General Public License applies to all subsequent copies and derivative works made from that copy. This option is useful when you wish to copy part of the code of the Library into a program that is not a library. 4. You may copy and distribute the Library (or a portion or derivative of it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. If distribution of object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place satisfies the requirement to distribute the source code, even though third parties are not compelled to copy the source along with the object code. 5. A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License. However, linking a "work that uses the Library" with the Library creates an executable that is a derivative of the Library (because it contains portions of the Library), rather than a "work that uses the library". The executable is therefore covered by this License. Section 6 states terms for distribution of such executables. When a "work that uses the Library" uses material from a header file that is part of the Library, the object code for the work may be a derivative work of the Library even though the source code is not. Whether this is true is especially significant if the work can be linked without the Library, or if the work is itself a library. The threshold for this to be true is not precisely defined by law. If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing this object code plus portions of the Library will still fall under Section 6.) Otherwise, if the work is a derivative of the Library, you may distribute the object code for the work under the terms of Section 6. Any executables containing that work also fall under Section 6, whether or not they are linked directly with the Library itself. 6. As an exception to the Sections above, you may also compile or link a "work that uses the Library" with the Library to produce a work containing portions of the Library, and distribute that work under terms of your choice, provided that the terms permit modification of the work for the customer's own use and reverse engineering for debugging such modifications. You must give prominent notice with each copy of the work that the Library is used in it and that the Library and its use are covered by this License. You must supply a copy of this License. If the work during execution displays copyright notices, you must include the copyright notice for the Library among them, as well as a reference directing the user to the copy of this License. Also, you must do one of these things: a) Accompany the work with the complete corresponding machine-readable source code for the Library including whatever changes were used in the work (which must be distributed under Sections 1 and 2 above); and, if the work is an executable linked with the Library, with the complete machine-readable "work that uses the Library", as object code and/or source code, so that the user can modify the Library and then relink to produce a modified executable containing the modified Library. (It is understood that the user who changes the contents of definitions files in the Library will not necessarily be able to recompile the application to use the modified definitions.) b) Accompany the work with a written offer, valid for at least three years, to give the same user the materials specified in Subsection 6a, above, for a charge no more than the cost of performing this distribution. c) If distribution of the work is made by offering access to copy from a designated place, offer equivalent access to copy the above specified materials from the same place. d) Verify that the user has already received a copy of these materials or that you have already sent this user a copy. For an executable, the required form of the "work that uses the Library" must include any data and utility programs needed for reproducing the executable from it. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. It may happen that this requirement contradicts the license restrictions of other proprietary libraries that do not normally accompany the operating system. Such a contradiction means you cannot use both them and the Library together in an executable that you distribute. 7. You may place library facilities that are a work based on the Library side-by-side in a single library together with other library facilities not covered by this License, and distribute such a combined library, provided that the separate distribution of the work based on the Library and of the other library facilities is otherwise permitted, and provided that you do these two things: a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities. This must be distributed under the terms of the Sections above. b) Give prominent notice with the combined library of the fact that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. 8. You may not copy, modify, sublicense, link with, or distribute the Library except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense, link with, or distribute the Library is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 9. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Library or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Library (or any work based on the Library), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Library or works based on it. 10. Each time you redistribute the Library (or any work based on the Library), the recipient automatically receives a license from the original licensor to copy, distribute, link with or modify the Library subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 11. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Library at all. For example, if a patent license would not permit royalty-free redistribution of the Library by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply, and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 12. If the distribution and/or use of the Library is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Library under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 13. The Free Software Foundation may publish revised and/or new versions of the Library General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Library specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Library does not specify a license version number, you may choose any version ever published by the Free Software Foundation. 14. If you wish to incorporate parts of the Library into other free programs whose distribution conditions are incompatible with these, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Libraries If you develop a new library, and you want it to be of the greatest possible use to the public, we recommend making it free software that everyone can redistribute and change. You can do so by permitting redistribution under these terms (or, alternatively, under the terms of the ordinary General Public License). To apply these terms, attach the following notices to the library. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details. You should have received a copy of the GNU Library General Public License along with this library; if not, write to the Free Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Also add information on how to contact you by electronic and paper mail. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the library, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the library `Frob' (a library for tweaking knobs) written by James Random Hacker. , 1 April 1990 Ty Coon, President of Vice That's all there is to it! String-Approx-3.27/t/0000750010002300116100000000000012077546526012530 5ustar jhiengString-Approx-3.27/t/adist.t0000644010002300116100000000207712077530226014022 0ustar jhienguse String::Approx qw(adist adistr adistword adistrword); use Test::More tests => 32; is(adist("abc", "abc"), 0); is(adist("abc", "abd"), 1); is(adist("abc", "ade"), 2); is(adist("abc", "def"), 3); is(adistr("abc", "abd"), 1/3); $a = adist("abc", ["abc", "abd", "ade", "def"]); is($a->[0], 0); is($a->[1], 1); is($a->[2], 2); is($a->[3], 3); is(@$a, 4); $a = adist(["abc", "abd", "ade", "def"], "abc"); is($a->[0], 0); is($a->[1], 1); is($a->[2], 2); is($a->[3], 3); is(@$a, 4); $a = adist(["abc", "abd", "ade", "def"], ["abc", "abd", "ade", "def"]); is($a->[0]->[0], 0); is($a->[1]->[2], 2); is($a->[2]->[1], 1); is($a->[3]->[3], 0); is(@$a, 4); is(adist("abcd", "abc"), -1); is(adistr("abcd", "abc"), -1/4); is(adist("abcde", "abc"), -2); is(adistr("abcde", "abc"), -2/5); my @a = adist("abc", "abd", "ade", "def"); is($a[2], 3); { my @abd = ("abd", "bad"); my @r = adistr("abc", @abd); is(@r, 2); is($r[0], 1/3); is($r[1], 2/3); } is(adist("abc", ""), 3); is(adist("", "abc"), 3); is(adist("", ""), 0); is(adist("\x{100}", ""), 1); String-Approx-3.27/t/util0000644010002300116100000000134212077527777013444 0ustar jhiengsub t { my ($a, $b) = @_; my ($wa, $wb, $db); my $fail = 0; foreach (@$a) { chomp } foreach (@$b) { chomp } my @oa = @$a; my @ob = @$b; if (@$a == @$b) { for $wa (@$a) { $wb = shift(@$b); $db = defined $wb; if ($db) { $wa =~ s/^\s+//; $wa =~ s/\s+$//; $wb =~ s/^\s+//; $wb =~ s/\s+$//; $wa =~ s/\n//g; $wb =~ s/\n//g; } if (not $db or $wa ne $wb) { print STDERR "# ne: $wa $wb\n"; $fail = 1; last; } } } else { print STDERR "# !=: ", scalar @$a, " ", scalar @$b, "\n"; $fail = 1; } if ($fail) { print STDERR "# EXPECTED: @oa\n"; print STDERR "# GOT: @ob\n"; } return !$fail; } 1; String-Approx-3.27/t/aslice.t0000644010002300116100000000136512077530216014154 0ustar jhienguse String::Approx 'aslice'; use Test::More tests => 20; @s = aslice("xyz", "abcdef"); is(@s, 1); is(@{$s[0]}, 0); @s = aslice("xyz", "abcdefxyzghi"); is(@s, 1); is($s[0]->[0], 6); is($s[0]->[1], 4);; @s = aslice("xyz", ["i"], "ABCDEFXYZGHI"); is(@s, 1); is($s[0]->[0], 6); is($s[0]->[1], 4); @s = aslice("xyz", ["minimal_distance"], "abcdefx!yzghi"); print "# @{$s[0]}\n"; is(@s, 1); is($s[0]->[0], 6); is($s[0]->[1], 4); is($s[0]->[2], 1); @s = aslice("xyz", ["minimal_distance"], "abcdefxzghi"); print "# @{$s[0]}\n"; is(@s, 1); is($s[0]->[0], 6); is($s[0]->[1], 2); is($s[0]->[2], 1); @s = aslice("xyz", ["minimal_distance"], "abcdefx!zghi"); print "# @{$s[0]}\n"; is(@s, 1); is($s[0]->[0], 6); is($s[0]->[1], 3); is($s[0]->[2], 1); # that's it. String-Approx-3.27/t/amatch.t0000755010002300116100000000576412077534656014177 0ustar jhienguse String::Approx 'amatch'; use Test::More tests => 15; chdir('t') or die "could not chdir to 't'"; require 'util'; # test 1 open(WORDS, 'words') or die "could not find words"; my @words = ; ok( t( [qw( appeal dispel erlangen hyperbola merlin parlance pearl perk superappeal superlative )], [amatch('perl', @words)])); # test 2: same as 1 but no insertions allowed ok( t( [qw( appeal dispel erlangen hyperbola merlin parlance perk superappeal superlative )], [amatch('perl', ['I0'], @words)])); # test 3: same as 1 but no deletions allowed ok( t( [qw( appeal hyperbola merlin parlance pearl perk superappeal superlative )], [amatch('perl', ['D0'], @words)])); # test 4: same as 1 but no substitutions allowed ok( t( [qw( dispel erlangen hyperbola merlin pearl perk superappeal superlative )], [amatch('perl', ['S0'], @words)])); # test 5: 2-differences ok( t( [qw( aberrant accelerate appeal dispel erlangen felicity gibberish hyperbola iterate legerdemain merlin mermaid oatmeal park parlance Pearl pearl perk petal superappeal superlative supple twirl zealous )], [amatch('perl', [2], @words)])); # test 6: i(gnore case) ok( t( [qw( appeal dispel erlangen hyperbola merlin parlance Pearl pearl perk superappeal superlative )], [amatch('perl', ['i'], @words)])); # test 7: test for undefined input { undef $_; local *SAVERR; open SAVERR, ">&STDERR"; close STDERR; my $error; open STDERR, ">", \$error; ok(!defined amatch("foo")); ok($error =~ /what are you/); close STDERR; open STDERR, ">&SAVERR"; } $_ = 'foo'; # anything defined so later tests do not fret # test 8: test just for acceptance of a very long pattern ok(!amatch("abcdefghij" x 10)); # test 9: test long pattern matching $_ = 'xyz' x 10 . 'abc0defghijabc1defghij' . 'zyx' x 10; ok(amatch('abcdefghij' x 2)); # test 10: test stingy matching. ok( t( [qw( appeal dispel erlangen hyperbola merlin parlance pearl perk superappeal superlative )], [amatch('perl', ['?'], @words)])); ok(!amatch("xyz", "")); ok(amatch("", "xyz")); ok(amatch("", "")); ok(amatch("\x{100}d", "ab\x{100}cd")); # that's it. String-Approx-3.27/t/aindex.t0000644010002300116100000000151412077530240014155 0ustar jhienguse String::Approx 'aindex'; use Test::More tests => 16; is(aindex("xyz", "abcdef"), -1); is(aindex("xyz", "abcdefxyz"), 6); is(aindex("xyz", "abcdefxgh"), -1); is(aindex("xyz", "abcdefyzg"), 6); is(aindex("xyz", "abcdefxgz"), 6); is(aindex("xyz", "abcdexfyz"), 5); is(aindex("xyz", ["initial_position=3"], "xyzabcde"), -1); is(aindex("xyz", ["initial_position=1"], "xyzabcde"), 1); is(aindex("xyz", ["final_position=5"], "abcdexyz"), -1); is(aindex("xyz", ["initial_position=2", "final_position=6"], "xyzabcxyz"), -1); is(aindex("xyz", ["initial_position=2", "final_position=7"], "xyzabcxyz"), 6); is(aindex("xyz", ["initial_position=1", "final_position=6"], "xyzabcxyz"), 1); is(aindex("xyz", ""), -1); is(aindex("", "xyz"), 0); is(aindex("", ""), 0); is(aindex("\x{100}d", "ab\x{100}cd"), 2); # that's it. String-Approx-3.27/t/words0000755010002300116100000000036006633241231013602 0ustar jhiengaberrant accelerate appeal dispel dispute erlangen felicity foobar gibberish hyperbola hyphen iterate legerdemain merlin mermaid nouveau cameau oatmeal park parlance Pearl pearl perk petal superappeal superlative supple twirl xyzzy zealous String-Approx-3.27/t/arindex.t0000644010002300116100000000026212077530151014337 0ustar jhienguse String::Approx 'arindex'; use Test::More tests => 3; is(arindex("xyz", "abcxyzdefxyz"), 9); is(arindex("xyz", "abcxyzdefghi"), 3); is(arindex("xyz", "abcwyzdefghi"), 3); String-Approx-3.27/t/asubst.t0000755010002300116100000001121412077534652014222 0ustar jhienguse String::Approx qw(asubstitute); use Test::More tests => 12; chdir('t') or die "could not chdir to 't'"; require 'util'; # test 1 open(WORDS, 'words') or die "could not find words"; ok( t( [qw( ap(peal) dis(pel) (erl)angen hy(per)bola m(erl)in (parl)ance (pearl) (per)k su(per)appeal su(per)lative )], [asubstitute('perl', '($&)', )])); close(WORDS); # test 2: like 1 but no insertions allowed open(WORDS, 'words') or die "could not find 'words'"; ok( t( [qw( a(ppeal) dis(pel) (erl)angen hy(perb)ola m(erl)in (parl)ance (perk) su(pera)ppeal su(perl)ative )], [asubstitute('perl', '($&)', ['I0'], )])); close(WORDS); # test 3: like 1 but no deletions allowed open(WORDS, 'words') or die "could not find 'words'"; ok( t( [qw( a(ppeal) hy(perb)ola m(erl)in (parl)ance (pearl) (perk) su(pera)ppeal su(perla)tive )], [asubstitute('perl', '($&)', ['D0'], )])); close(WORDS); # test 4: like 1 but no substitutions allowed open(WORDS, 'words') or die "could not find 'words'"; ok( t( [qw( dis(pel) (erl)angen hy(per)bola m(erl)in (pearl) (per)k su(per)appeal su(perla)tive )], [asubstitute('perl', '($&)', ['S0'], )])); close(WORDS); # test 5: 2-differences open(WORDS, 'words') or die; ok( t( [qw( ab(err)ant acc(el)erate a(ppeal) dis(pel) (erla)ngen f(el)icity gibb(eri)sh hy(perbol)a it(era)te l(egerd)emain m(erli)n m(erm)aid oatm(eal) (park) (parla)nce P(earl) (pearl) (perk) (petal) su(perap)peal su(perlat)ive su(ppl)e twi(rl) z(eal)ous )], [asubstitute('perl', '($&)', [2], )])); close(WORDS); # test 6: i(gnore case) open(WORDS, 'words') or die; ok( t( [qw( a(ppeal) dis(pel) (erl)angen hy(perb)ola m(erl)in (parl)ance (Pearl) (pearl) (perk) su(pera)ppeal su(perla)tive )], [asubstitute('perl', '($&)', ['i'], )])); close(WORDS); # test 7: both i(gnore case) and g(lobally) open(WORDS, 'words') or die; ok( t( [qw( a(ppeal) dis(pel) (erl)angen hy(perb)ola m(erl)in (parl)ance (Pearl) (pearl) (perk) su(pera)p(peal) su(perla)tive )], [asubstitute('perl', '($&)', ['ig'], )])); close(WORDS); # test 8: exercise all of $` $& $' open(WORDS, 'words') or die; ok( t( [qw( ap(ap:peal:) dis(dis:pel:) (:erl:angen)angen hy(hy:per:bola)bola m(m:erl:in)in (:parl:ance)ance (:pearl:) (:per:k)k su(su:per:appeal)appeal su(su:per:lative)lative )], [asubstitute('perl', q(($`:$&:$')), map {chomp;$_} )])); close(WORDS); # test 9: $_ $_ = "foo"; eval 'asubstitute("foo","bar")'; is($_, 'bar'); $_ = 'foo'; # anything defined so later tests do not fret # test 10: test for undefined input { undef $_; local *SAVERR; open SAVERR, ">&STDERR"; close STDERR; my $error; open STDERR, ">", \$error; ok(!defined asubstitute("foo","bar")); ok($error =~ /what are you/); close STDERR; open STDERR, ">&SAVERR"; } $_ = 'foo'; # anything defined so later tests do not fret # test 11: test fuzzier subsitution. open(WORDS, 'words') or die; ok( t( [qw( ab(ab:err:ant)ant acc(acc:el:erate)erate a(a:ppeal:) dis(dis:pel:) (:erla:ngen)ngen f(f:el:icity)icity gibb(gibb:eri:sh)sh hy(hy:perbol:a)a it(it:era:te)te l(l:egerd:emain)emain m(m:erli:n)n m(m:erm:aid)aid oatm(oatm:eal:) (:park:) (:parla:nce)nce P(P:earl:) (:pearl:) (:perk:) (:petal:) su(su:perap:peal)peal su(su:perlat:ive)ive su(su:ppl:e)e twi(twi:rl:) z(z:eal:ous)ous )], [asubstitute('perl', q(($`:$&:$')), [q(2)], map {chomp;$_} )])); close(WORDS); # eof String-Approx-3.27/t/user.t0000644010002300116100000001665712077532111013700 0ustar jhieng# User-supplied test cases. # (These *were* bugs :-) use String::Approx qw(amatch aindex adist); use Test::More tests => 42; chdir('t') or die "could not chdir to 't'"; require 'util'; local $^W = 1; # test long pattern both matching and not matching # Thanks to Alberto Fontaneda # for this test case and also thanks to Dmitrij Frishman # for testing this test. { my @l = ('perl warks fiine','parl works fine', 'perl worrs', 'perl warkss'); my @m = amatch('perl works fin', [2] , @l); ok($m[0] eq 'perl warks fiine' && $m[1] eq 'parl works fine'); #print "m = (@{[join(':',@m)]})\n"; } # Slaven Rezic { my @w=('one hundred','two hundred','three hundred','bahnhofstr. (koepenick)'); my @m=amatch('bahnhofstr. ', ['i',3], @w); ok(t(['bahnhofstr. (koepenick)'],[@m])); } # Greg Ward ok(amatch('mcdonald', 'macdonald')); ok(amatch('macdonald', 'mcdonald')); ok(amatch('mcdonald', ['I0'], 'macdonald')); ok(amatch('mcdonald', ['I1'], 'mcdonaald')); ok(!amatch('mcdonald', ['I1'], 'mcdonaaald')); ok(amatch('mcdonald', ['1I1'], 'mcdonaald')); ok(amatch('mcdonald', ['2I2'], 'mcdonaaald')); # Kevin Greiner @IN = ("AK_ANCHORAGE A-7 NW","AK A ANCHORAGE B-8 NE"); $Title = "AK_ANCHORAGE A-7 NE"; ok(amatch($Title, @IN)); # Ricky Houghton @names = ("madeleine albright","william clinton"); @matches = amatch("madeleine albriqhl",@names); ok($matches[0] eq "madeleine albright"); # test 9: Jared August ok(amatch("Dopeman (Remix)",["i","50%"],"Dopeman (Remix)")); # Steve A. Chervitz # Short vs. Long behaved differently than Long vs. Short. # s1 and s1_1 are identical except for an extra extension at end of s1. $s1 = "MSRTGHGGLMPVNGLGFPPQNVARVVVWECLNEHSRWRPYTATVCHHIENVLKEDARGSVVLGQVDAQ". "LVPYIIDLQSMHQFRQDTGTMRPVRRNFYDPSSAPGKGIVWEWENDGGAWTAYDMDICITIQNAYEKQHPWLW_GBH"; $s1_1 = "MSRPGHGGLMPVNGLGFPPQNVARVVVWECLNEHSRWRPYTATVCHHIENVLKEDARGSVVLGQVDAQ". "LVPYIIDLQSMHQFRQDTGTMRPVRRNFYDPSSAPGKGIVWEWENDGGAWTAYDMDICITIQNAYEKQHPWLW"; ok(amatch($s1, ['5%'], $s1_1)); # this failed to match ok(amatch($s1_1, ['5%'], $s1)); # s1_1 vs. s1: (attempting to disallow insertions). ok(amatch($s1_1, ['5%','I0'], $s1)); #----------------------------------------------------------------------- # Position dependency of approximate matching. # There is a position dependency for matching. If two strings differ # at two neighboring (or very close) positions, they will not match # with approximation. If the differences are well-separated, they # will match with approximation. $s2 = "DLSSLGFCYLIYFNSMSQMNRQTRRRRRLRRRLDLAYPLTVGSIPKSQSWPVGASSGQPCSCQQCLLVNSTRAVSN". "VILASQRRKVPPAPPLPPPPPPGGPPGALAVRPSATFTGAALWAAPAAGPAEPAPPPGAPPRSPGAPGGARTPGQNNLNR". "PGPQRTTSVSARASIPPGVPALPVKNLNGTGPVHPALAGMTGILLCAAGLPVCLTRAPKPILHPPPVSKSDVKPVPGVPG". "VCRKTKKKHLKKSKNPEDVVRRYMQKVKNPPDEDCTICMERLVTASGYEGVLRHKGVRPELVGRLGRCGHMYHLLCLVAMY". "SNGNKDGSLQCPTCKAIYGEKTGTQPPGKMEFHLIPHSLPGFPDTQTIRIVYDIPTGIQGPEHPNPGKKFTARGFPRHCYL". "PNNEKGRKVLRLLITAWERRLIFTIGTSNTTGESDTVVWNEIHHKTEFGSNLTGHGYPDASYLDNVLAELTAQGVSEAAGKA"; # s2_1 has two nearby substitutions relative to s2 indicated with '_' $s2_1 = "DLSSLGFCYL_YFNSMSQMN_QTRRRRRLRRRLDLAYPLTVGSIPKSQSWPVGASSGQPCSCQQCLLVNSTRAVSN". "VILASQRRKVPPAPPLPPPPPPGGPPGALAVRPSATFTGAALWAAPAAGPAEPAPPPGAPPRSPGAPGGARTPGQNNLNR". "PGPQRTTSVSARASIPPGVPALPVKNLNGTGPVHPALAGMTGILLCAAGLPVCLTRAPKPILHPPPVSKSDVKPVPGVPG". "VCRKTKKKHLKKSKNPEDVVRRYMQKVKNPPDEDCTICMERLVTASGYEGVLRHKGVRPELVGRLGRCGHMYHLLCLVAMY". "SNGNKDGSLQCPTCKAIYGEKTGTQPPGKMEFHLIPHSLPGFPDTQTIRIVYDIPTGIQGPEHPNPGKKFTARGFPRHCYL". "PNNEKGRKVLRLLITAWERRLIFTIGTSNTTGESDTVVWNEIHHKTEFGSNLTGHGYPDASYLDNVLAELTAQGVSEAAGKA"; # s2_2 has two far apart substitutions relative to s2 indicated with '_' $s2_2 = "DLSSLGFC_LIYFNSMSQMNRQTRRRRRLRRRLDLAYPLTVGSIPKSQSWPVGASSGQPCSCQQCLLVNSTRAVSN". "VILASQRRKVPPAPPLPPPPPPGGPPGALAVRPSATFTGAALWAAPAAGPAEPAPPPGAPPRSPGAPGGARTPGQNNLNR". "PGPQRTTSVSARASIPPGVPALPVKNLNGTGPVHPALAGMTGILLCAAGLPVCLTRAPKPILHPPPVSKSDVKPVPGVPG". "VCRKTKKKHLKKSKNPEDVVRRYMQKVKNPPDEDCTICMERLVTASGYEGVLRHKGVRPELVGRLGRCGHMYHLLCLVAMY". "SNGNKDGSLQCPTCKAIYGEKTGTQPPGKMEFHLIPHSLPGFPDTQTIRIVYDIPTGIQGPEHPNPGKKFTARGFPRHCYL". "PNNEKGRKVLRLLITAWERRLIFTIGTSNTTGESDTVVWNEIHHKTEFGSNLTG_GYPDASYLDNVLAELTAQGVSEAAGKA"; # s2 vs s2_1: (substitutions close together) ok(amatch($s2, [10], $s2_1)); # s2 vs s2_2: (substitutions far apart) ok(amatch($s2, [10], $s2_2)); #----------------------------------------------------------------------- # Difference in behavior of % differences versus absolute number of # differences. $s3 = "MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRG". "ILRNAKLKPVYDSLDAVRRCALINMVFGMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGT". "WDAYKNL"; # s3_1 contains two substitutions '_' and one deletion relative to s3. $s3_1 = "MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRN_NGVITKDEAEKLFNQDVDAVRG". "ILRNAKLKPVYDSLDAVRRCALINMVF_MGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGT". "WDAYKNL"; # s3 vs s3_1: (matching with 10% differences) ok(amatch($s3, ['10%'], $s3_1)); # s3 vs s3_1: (matching with 10 differences) ok(amatch($s3, ['10'], $s3_1)); # Bob J.A. Schijvenaars @gloslist = ('computer', 'computermonitorsalesman'); @matches = amatch('computers', [1,'g'], @gloslist); $a = ''; for (@matches) { $a .= "|$_|"; } @matches = amatch('computers', [2,'g'], @gloslist); $b = ''; for (@matches) { $b .= "|$_|"; } ok($a eq $b and $a eq "|computer||computermonitorsalesman|"); # Rick Wise ok(amatch('abc', [10], 'abd')); # Ilya Sandler $_="ABCDEF"; ok(!amatch("ABCDEF","VWXYZ")); ok(amatch("BURTONUPONTRENT",['5'], "BURTONONTRENT")); ok(amatch("BURTONONTRENT",['5'], "BURTONUPONTRENT")); # Chris Rosin and # Mark Land . ok(!amatch("Karivaratharajan", "Rajan")); ok(amatch("Rajan", "Karivaratharajan")); ok(amatch("Ferna", "Fernandez")); ok(!amatch("Fernandez", "Ferna")); # Mitch Helle ok(!amatch('ffffff', 'a')); ok(!amatch('fffffffffff', 'a')); ok(!amatch('fffffffffffffffffffff', 'ab')); # Anirvan Chatterjee ok(amatch("", "foo")); # Rob Fugina open(MAKEFILEPL, "../Makefile.PL") or die "$0: failed to open Makefile.PL: $!"; # Don't let a debugging version escape the laboratory. my $debugging = grep {/^[^#]*-DAPSE_DEBUGGING/} ; ok(!$debugging); close(MAKEFILEPL); warn "(You have -DAPSE_DEBUGGING turned on!)\n" if $debugging; # David Curiel is(aindex("xyz", "abxefxyz"), 5); # Stefan Ram is(aindex( "knral", "nisinobikttatnbankfirknalt" ), 21); is(aindex( "knral", "nbankfirknalt"), 8); is(aindex( "knral", "nkfirknalt"), 5); # Chris Rosin is(adist('MOM','XXMOXXMOXX'), 1); # Frank Tobin is(aindex('----foobar----',[1],'----aoobar----'), 0); # Damian Keefe is(aindex('tccaacttctctgtgactgaccaaagaa','tctttgcatccaatactccaacttctctgtggctgaccaaagaattggcacctatcttgccagtcaggtagttctgatgggtccagcacagactggctgcctgggggagaaagacagcattgatttgaagtggtgaacactataactcccctagctcatcacaaaacaagcagacaagaaccacagcttc'), 16); # Juha Muilu is(aindex("pattern", "aaaaaaaaapattern"), 9); # Ji Y Park # 0% must mean 0. $_="TTES"; ok(!amatch("test", ["i I0% S0% D0%"])); # eof String-Approx-3.27/apse.h0000644010002300116100000001500407762152244013366 0ustar jhieng/* Copyright (C) Jarkko Hietaniemi, 1998,1999,2000,2001,2002,2003. All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of either: a) the GNU Library General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version, or b) the "Artistic License" which comes with Perl source code. Other free software licensing schemes are negotiable. Furthermore: (1) This software is provided as-is, without warranties or obligations of any kind. (2) You shall include this copyright notice intact in all copies and derived materials. */ /* $Id: apse.h,v 1.14 1998/12/15 12:42:04 jhi Exp $ */ #ifndef APSE_H #define APSE_H #define APSE_MAJOR_VERSION 0 #define APSE_MINOR_VERSION 16 #include #ifdef APSE_VEC_T typedef APSE_VEC_T apse_vec_t; #else typedef unsigned long apse_vec_t; #endif #ifdef APSE_SIZE_T typedef APSE_SIZE_T apse_size_t; #else typedef unsigned long apse_size_t; #endif #ifdef APSE_SSIZE_T typedef APSE_SSIZE_T apse_ssize_t; #else typedef long apse_ssize_t; #endif #ifdef APSE_BOOL_T typedef APSE_BOOL_T apse_bool_t; #else typedef int apse_bool_t; #endif typedef struct apse_s { apse_size_t pattern_size; apse_vec_t* pattern_mask; apse_vec_t* case_mask; apse_vec_t* fold_mask; apse_size_t edit_distance; apse_bool_t has_different_distances; apse_size_t different_distances_max; apse_size_t edit_insertions; apse_size_t edit_deletions; apse_size_t edit_substitutions; apse_bool_t use_minimal_distance; apse_size_t bitvectors_in_state; apse_size_t bytes_in_state; apse_size_t bytes_in_all_states; apse_size_t largest_distance; unsigned char* text; apse_size_t text_size; apse_size_t text_position; apse_size_t text_initial_position; apse_size_t text_final_position; apse_size_t text_position_range; apse_vec_t* state; apse_vec_t* prev_state; apse_size_t prev_equal; apse_size_t prev_active; apse_size_t match_begin_bitvector; apse_vec_t match_begin_bitmask; apse_vec_t match_begin_prefix; apse_size_t match_end_bitvector; apse_vec_t match_end_bitmask; apse_bool_t match_state; apse_size_t match_begin; apse_size_t match_end; void* (*match_bot_callback) (struct apse_s *ap); void* (*match_begin_callback)(struct apse_s *ap); void* (*match_fail_callback) (struct apse_s *ap); void* (*match_end_callback) (struct apse_s *ap); void* (*match_eot_callback) (struct apse_s *ap); apse_size_t exact_positions; apse_vec_t* exact_mask; apse_bool_t is_greedy; void* custom_data; apse_size_t custom_data_size; } apse_t; apse_t *apse_create(unsigned char* pattern, apse_size_t pattern_size, apse_size_t edit_distance); apse_bool_t apse_match(apse_t* ap, unsigned char* text, apse_size_t text_size); apse_bool_t apse_match_next(apse_t* ap, unsigned char* text, apse_size_t text_size); apse_ssize_t apse_index(apse_t* ap, unsigned char* text, apse_size_t text_size); apse_ssize_t apse_index_next(apse_t* ap, unsigned char* text, apse_size_t text_size); apse_bool_t apse_slice(apse_t* ap, unsigned char* text, apse_size_t text_size, apse_size_t* match_begin, apse_size_t* match_size); apse_bool_t apse_slice_next(apse_t* ap, unsigned char* text, apse_size_t text_size, apse_size_t* match_begin, apse_size_t* match_size); void apse_reset(apse_t *ap); apse_bool_t apse_set_pattern(apse_t* ap, unsigned char* pattern, apse_size_t pattern_size); apse_bool_t apse_set_edit_distance(apse_t *ap, apse_size_t edit_distance); apse_size_t apse_get_edit_distance(apse_t *ap); apse_bool_t apse_set_minimal_distance(apse_t* ap, apse_bool_t minimal); apse_bool_t apse_get_minimal_distance(apse_t* ap); apse_bool_t apse_set_text_position(apse_t *ap, apse_size_t text_position); apse_size_t apse_get_text_position(apse_t *ap); apse_bool_t apse_set_text_initial_position(apse_t *ap, apse_size_t text_initial_position); apse_size_t apse_get_text_initial_position(apse_t *ap); apse_bool_t apse_set_text_final_position(apse_t *ap, apse_size_t text_final_position); apse_size_t apse_get_text_final_position(apse_t *ap); apse_bool_t apse_set_text_position_range(apse_t *ap, apse_size_t text_position_range); apse_size_t apse_get_text_position_range(apse_t *ap); apse_bool_t apse_set_insertions(apse_t *ap, apse_size_t insertions); apse_bool_t apse_set_deletions(apse_t *ap, apse_size_t deletions); apse_bool_t apse_set_substitutions(apse_t *ap, apse_size_t substitutions); apse_size_t apse_get_insertions(apse_t *ap); apse_size_t apse_get_deletions(apse_t *ap); apse_size_t apse_get_substitutions(apse_t *ap); apse_bool_t apse_set_caseignore_slice(apse_t* ap, apse_ssize_t caseignore_begin, apse_ssize_t caseignore_size, apse_bool_t caseignore); void apse_set_greedy(apse_t *ap, apse_bool_t greedy); apse_bool_t apse_get_greedy(apse_t *ap); void apse_set_match_bot_callback(apse_t *ap, void* (*match_bot_callback)(apse_t* ap)); void apse_set_match_begin_callback(apse_t *ap, void* (*match_begin_callback)(apse_t* ap)); void apse_set_match_fail_callback(apse_t *ap, void* (*match_fail_callback)(apse_t* ap)); void apse_set_match_end_callback(apse_t *ap, void* (*match_end_callback)(apse_t* ap)); void apse_set_match_eot_callback(apse_t *ap, void* (*match_eot_callback)(apse_t* ap)); void* (*apse_get_match_bot_callback(apse_t *ap))(apse_t *ap); void* (*apse_get_match_begin_callback(apse_t *ap))(apse_t *ap); void* (*apse_get_match_fail_callback(apse_t *ap))(apse_t *ap); void* (*apse_get_match_end_callback(apse_t *ap))(apse_t *ap); void* (*apse_get_match_eot_callback(apse_t *ap))(apse_t *ap); apse_bool_t apse_set_anychar(apse_t *ap, apse_ssize_t pattern_index); apse_bool_t apse_set_charset(apse_t* ap, apse_ssize_t pattern_index, unsigned char* set, apse_size_t set_size, apse_bool_t complement); apse_bool_t apse_set_exact_slice(apse_t* ap, apse_ssize_t exact_begin, apse_ssize_t exact_size, apse_bool_t exact); void apse_set_custom_data(apse_t* ap, void* custom_data, apse_size_t custom_data_size); void* apse_get_custom_data(apse_t* ap, apse_size_t* custom_data_size); void apse_destroy(apse_t *ap); #define APSE_MATCH_BAD ((apse_size_t) -1) #endif /* #ifndef APSE_H */ String-Approx-3.27/Artistic0000644010002300116100000002130612077535752014000 0ustar jhieng The Artistic License 2.0 Copyright (c) 2000-2006, The Perl Foundation. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble This license establishes the terms under which a given free software Package may be copied, modified, distributed, and/or redistributed. The intent is that the Copyright Holder maintains some artistic control over the development of that Package while still keeping the Package available as open source and free software. You are always permitted to make arrangements wholly outside of this license directly with the Copyright Holder of a given Package. If the terms of this license do not permit the full use that you propose to make of the Package, you should contact the Copyright Holder and seek a different licensing arrangement. Definitions "Copyright Holder" means the individual(s) or organization(s) named in the copyright notice for the entire Package. "Contributor" means any party that has contributed code or other material to the Package, in accordance with the Copyright Holder's procedures. "You" and "your" means any person who would like to copy, distribute, or modify the Package. "Package" means the collection of files distributed by the Copyright Holder, and derivatives of that collection and/or of those files. A given Package may consist of either the Standard Version, or a Modified Version. "Distribute" means providing a copy of the Package or making it accessible to anyone else, or in the case of a company or organization, to others outside of your company or organization. "Distributor Fee" means any fee that you charge for Distributing this Package or providing support for this Package to another party. It does not mean licensing fees. "Standard Version" refers to the Package if it has not been modified, or has been modified only in ways explicitly requested by the Copyright Holder. "Modified Version" means the Package, if it has been changed, and such changes were not explicitly requested by the Copyright Holder. "Original License" means this Artistic License as Distributed with the Standard Version of the Package, in its current version or as it may be modified by The Perl Foundation in the future. "Source" form means the source code, documentation source, and configuration files for the Package. "Compiled" form means the compiled bytecode, object code, binary, or any other form resulting from mechanical transformation or translation of the Source form. Permission for Use and Modification Without Distribution (1) You are permitted to use the Standard Version and create and use Modified Versions for any purpose without restriction, provided that you do not Distribute the Modified Version. Permissions for Redistribution of the Standard Version (2) You may Distribute verbatim copies of the Source form of the Standard Version of this Package in any medium without restriction, either gratis or for a Distributor Fee, provided that you duplicate all of the original copyright notices and associated disclaimers. At your discretion, such verbatim copies may or may not include a Compiled form of the Package. (3) You may apply any bug fixes, portability changes, and other modifications made available from the Copyright Holder. The resulting Package will still be considered the Standard Version, and as such will be subject to the Original License. Distribution of Modified Versions of the Package as Source (4) You may Distribute your Modified Version as Source (either gratis or for a Distributor Fee, and with or without a Compiled form of the Modified Version) provided that you clearly document how it differs from the Standard Version, including, but not limited to, documenting any non-standard features, executables, or modules, and provided that you do at least ONE of the following: (a) make the Modified Version available to the Copyright Holder of the Standard Version, under the Original License, so that the Copyright Holder may include your modifications in the Standard Version. (b) ensure that installation of your Modified Version does not prevent the user installing or running the Standard Version. In addition, the Modified Version must bear a name that is different from the name of the Standard Version. (c) allow anyone who receives a copy of the Modified Version to make the Source form of the Modified Version available to others under (i) the Original License or (ii) a license that permits the licensee to freely copy, modify and redistribute the Modified Version using the same licensing terms that apply to the copy that the licensee received, and requires that the Source form of the Modified Version, and of any works derived from it, be made freely available in that license fees are prohibited but Distributor Fees are allowed. Distribution of Compiled Forms of the Standard Version or Modified Versions without the Source (5) You may Distribute Compiled forms of the Standard Version without the Source, provided that you include complete instructions on how to get the Source of the Standard Version. Such instructions must be valid at the time of your distribution. If these instructions, at any time while you are carrying out such distribution, become invalid, you must provide new instructions on demand or cease further distribution. If you provide valid instructions or cease distribution within thirty days after you become aware that the instructions are invalid, then you do not forfeit any of your rights under this license. (6) You may Distribute a Modified Version in Compiled form without the Source, provided that you comply with Section 4 with respect to the Source of the Modified Version. Aggregating or Linking the Package (7) You may aggregate the Package (either the Standard Version or Modified Version) with other packages and Distribute the resulting aggregation provided that you do not charge a licensing fee for the Package. Distributor Fees are permitted, and licensing fees for other components in the aggregation are permitted. The terms of this license apply to the use and Distribution of the Standard or Modified Versions as included in the aggregation. (8) You are permitted to link Modified and Standard Versions with other works, to embed the Package in a larger work of your own, or to build stand-alone binary or bytecode versions of applications that include the Package, and Distribute the result without restriction, provided the result does not expose a direct interface to the Package. Items That are Not Considered Part of a Modified Version (9) Works (including, but not limited to, modules and scripts) that merely extend or make use of the Package, do not, by themselves, cause the Package to be a Modified Version. In addition, such works are not considered parts of the Package itself, and are not subject to the terms of this license. General Provisions (10) Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license. (11) If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license. (12) This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder. (13) This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed. (14) Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. String-Approx-3.27/MANIFEST0000644010002300116100000000046612077532257013425 0ustar jhiengApprox.pm Approx.xs apse.c apse.h Artistic ChangeLog COPYRIGHT COPYRIGHT.agrep LGPL Makefile.PL MANIFEST PROBLEMS README README.apse t/adist.t t/aindex.t t/amatch.t t/aslice.t t/arindex.t t/asubst.t t/user.t t/util t/words typemap META.yml Module meta-data (added by MakeMaker) String-Approx-3.27/Approx.pm0000644010002300116100000006217512077535500014102 0ustar jhiengpackage String::Approx; require v5.8.0; $VERSION = '3.27'; use strict; local $^W = 1; use Carp; use vars qw($VERSION @ISA @EXPORT @EXPORT_OK); require Exporter; require DynaLoader; @ISA = qw(Exporter DynaLoader); @EXPORT_OK = qw(amatch asubstitute aindex aslice arindex adist adistr adistword adistrword); bootstrap String::Approx $VERSION; my $CACHE_MAX = 1000; # high water mark my $CACHE_PURGE = 0.75; # purge this much of the least used my $CACHE_N_PURGE; # purge this many of the least used sub cache_n_purge () { $CACHE_N_PURGE = $CACHE_MAX * $CACHE_PURGE; $CACHE_N_PURGE = 1 if $CACHE_N_PURGE < 1; return $CACHE_N_PURGE; } cache_n_purge(); sub cache_max (;$) { if (@_ == 0) { return $CACHE_MAX; } else { $CACHE_MAX = shift; } $CACHE_MAX = 0 if $CACHE_MAX < 0; cache_n_purge(); } sub cache_purge (;$) { if (@_ == 0) { return $CACHE_PURGE; } else { $CACHE_PURGE = shift; } if ($CACHE_PURGE < 0) { $CACHE_PURGE = 0; } elsif ($CACHE_PURGE > 1) { $CACHE_PURGE = 1; } cache_n_purge(); } my %_simple; my %_simple_usage_count; sub _cf_simple { my $P = shift; my @usage = sort { $_simple_usage_count{$a} <=> $_simple_usage_count{$b} } grep { $_ ne $P } keys %_simple_usage_count; # Make room, delete the least used entries. $#usage = $CACHE_N_PURGE - 1; delete @_simple_usage_count{@usage}; delete @_simple{@usage}; } sub _simple { my $P = shift; my $_simple = new(__PACKAGE__, $P); if ($CACHE_MAX) { $_simple{$P} = $_simple unless exists $_simple{$P}; $_simple_usage_count{$P}++; if (keys %_simple_usage_count > $CACHE_MAX) { _cf_simple($P); } } return ( $_simple ); } sub _parse_param { use integer; my ($n, @param) = @_; my %param; foreach (@param) { while ($_ ne '') { s/^\s+//; if (s/^([IDS]\s*)?(\d+)(\s*%)?//) { my $k = defined $3 ? (($2-1) * $n) / 100 + ($2 ? 1 : 0) : $2; if (defined $1) { $param{$1} = $k; } else { $param{k} = $k; } } elsif (s/^initial_position\W+(\d+)\b//) { $param{'initial_position'} = $1; } elsif (s/^final_position\W+(\d+)\b//) { $param{'final_position'} = $1; } elsif (s/^position_range\W+(\d+)\b//) { $param{'position_range'} = $1; } elsif (s/^minimal_distance\b//) { $param{'minimal_distance'} = 1; } elsif (s/^i//) { $param{ i } = 1; } elsif (s/^g//) { $param{ g } = 1; } elsif (s/^\?//) { $param{'?'} = 1; } else { warn "unknown parameter: '$_'\n"; return; } } } return %param; } my %_param_key; my %_parsed_param; my %_complex; my %_complex_usage_count; sub _cf_complex { my $P = shift; my @usage = sort { $_complex_usage_count{$a} <=> $_complex_usage_count{$b} } grep { $_ ne $P } keys %_complex_usage_count; # Make room, delete the least used entries. $#usage = $CACHE_N_PURGE - 1; delete @_complex_usage_count{@usage}; delete @_complex{@usage}; } sub _complex { my ($P, @param) = @_; unshift @param, length $P; my $param = "@param"; my $_param_key; my %param; my $complex; my $is_new; unless (exists $_param_key{$param}) { %param = _parse_param(@param); $_parsed_param{$param} = { %param }; $_param_key{$param} = join(" ", %param); } else { %param = %{ $_parsed_param{$param} }; } $_param_key = $_param_key{$param}; if ($CACHE_MAX) { if (exists $_complex{$P}->{$_param_key}) { $complex = $_complex{$P}->{$_param_key}; } } unless (defined $complex) { if (exists $param{'k'}) { $complex = new(__PACKAGE__, $P, $param{k}); } else { $complex = new(__PACKAGE__, $P); } $_complex{$P}->{$_param_key} = $complex if $CACHE_MAX; $is_new = 1; } if ($is_new) { $complex->set_greedy unless exists $param{'?'}; $complex->set_insertions($param{'I'}) if exists $param{'I'}; $complex->set_deletions($param{'D'}) if exists $param{'D'}; $complex->set_substitutions($param{'S'}) if exists $param{'S'}; $complex->set_caseignore_slice if exists $param{'i'}; $complex->set_text_initial_position($param{'initial_position'}) if exists $param{'initial_position'}; $complex->set_text_final_position($param{'final_position'}) if exists $param{'final_position'}; $complex->set_text_position_range($param{'position_range'}) if exists $param{'position_range'}; $complex->set_minimal_distance($param{'minimal_distance'}) if exists $param{'minimal_distance'}; } if ($CACHE_MAX) { $_complex_usage_count{$P}->{$_param_key}++; # If our cache overfloweth. if (scalar keys %_complex_usage_count > $CACHE_MAX) { _cf_complex($P); } } return ( $complex, %param ); } sub cache_disable { cache_max(0); } sub cache_flush_all { my $old_purge = cache_purge(); cache_purge(1); _cf_simple(''); _cf_complex(''); cache_purge($old_purge); } sub amatch { my $P = shift; return 1 unless length $P; my $a = ((@_ && ref $_[0] eq 'ARRAY') ? _complex($P, @{ shift(@_) }) : _simple($P))[0]; if (@_) { if (wantarray) { return grep { $a->match($_) } @_; } else { foreach (@_) { return 1 if $a->match($_); } return 0; } } if (defined $_) { if (wantarray) { return $a->match($_) ? $_ : undef; } else { return 1 if $a->match($_); } } return $a->match($_) if defined $_; warn "amatch: \$_ is undefined: what are you matching?\n"; return; } sub _find_substitute { my ($ri, $rs, $i, $s, $S, $rn) = @_; push @{ $ri }, $i; push @{ $rs }, $s; my $pre = substr($_, 0, $i); my $old = substr($_, $i, $s); my $suf = substr($_, $i + $s); my $new = $S; $new =~ s/\$\`/$pre/g; $new =~ s/\$\&/$old/g; $new =~ s/\$\'/$suf/g; push @{ $rn }, $new; } sub _do_substitute { my ($rn, $ri, $rs, $rS) = @_; my $d = 0; my $n = $_; foreach my $i (0..$#$rn) { substr($n, $ri->[$i] + $d, $rs->[$i]) = $rn->[$i]; $d += length($rn->[$i]) - $rs->[$i]; } push @{ $rS }, $n; } sub asubstitute { my $P = shift; my $S = shift; my ($a, %p) = (@_ && ref $_[0] eq 'ARRAY') ? _complex($P, @{ shift(@_) }) : _simple($P); my ($i, $s, @i, @s, @n, @S); if (@_) { if (exists $p{ g }) { foreach (@_) { @s = @i = @n = (); while (($i, $s) = $a->slice_next($_)) { if (defined $i) { _find_substitute(\@i, \@s, $i, $s, $S, \@n); } } _do_substitute(\@n, \@i, \@s, \@S) if @n; } } else { foreach (@_) { @s = @i = @n = (); ($i, $s) = $a->slice($_); if (defined $i) { _find_substitute(\@i, \@s, $i, $s, $S, \@n); _do_substitute(\@n, \@i, \@s, \@S); } } } return @S; } elsif (defined $_) { if (exists $p{ g }) { while (($i, $s) = $a->slice_next($_)) { if (defined $i) { _find_substitute(\@i, \@s, $i, $s, $S, \@n); } } _do_substitute(\@n, \@i, \@s, \@S) if @n; } else { ($i, $s) = $a->slice($_); if (defined $i) { _find_substitute(\@i, \@s, $i, $s, $S, \@n); _do_substitute(\@n, \@i, \@s, \@S); } } return $_ = $n[0]; } else { warn "asubstitute: \$_ is undefined: what are you substituting?\n"; return; } } sub aindex { my $P = shift; return 0 unless length $P; my $a = ((@_ && ref $_[0] eq 'ARRAY') ? _complex($P, @{ shift(@_) }) : _simple($P))[0]; $a->set_greedy; # The *first* match, thank you. if (@_) { if (wantarray) { return map { $a->index($_) } @_; } else { return $a->index($_[0]); } } return $a->index($_) if defined $_; warn "aindex: \$_ is undefined: what are you indexing?\n"; return; } sub aslice { my $P = shift; return (0, 0) unless length $P; my $a = ((@_ && ref $_[0] eq 'ARRAY') ? _complex($P, @{ shift(@_) }) : _simple($P))[0]; $a->set_greedy; # The *first* match, thank you. if (@_) { return map { [ $a->slice($_) ] } @_; } return $a->slice($_) if defined $_; warn "aslice: \$_ is undefined: what are you slicing?\n"; return; } sub _adist { my $s0 = shift; my $s1 = shift; my ($aslice) = aslice($s0, ['minimal_distance', @_], $s1); my ($index, $size, $distance) = @$aslice; my ($l0, $l1) = map { length } ($s0, $s1); return $l0 <= $l1 ? $distance : -$distance; } sub adist { my $a0 = shift; my $a1 = shift; if (length($a0) == 0) { return length($a1); } if (length($a1) == 0) { return length($a0); } my @m = ref $_[0] eq 'ARRAY' ? @{shift()} : (); if (ref $a0 eq 'ARRAY') { if (ref $a1 eq 'ARRAY') { return [ map { adist($a0, $_, @m) } @{$a1} ]; } else { return [ map { _adist($_, $a1, @m) } @{$a0} ]; } } elsif (ref $a1 eq 'ARRAY') { return [ map { _adist($a0, $_, @m) } @{$a1} ]; } else { if (wantarray) { return map { _adist($a0, $_, @m) } ($a1, @_); } else { return _adist($a0, $a1, @m); } } } sub adistr { my $a0 = shift; my $a1 = shift; my @m = ref $_[0] eq 'ARRAY' ? shift : (); if (ref $a0 eq 'ARRAY') { if (ref $a1 eq 'ARRAY') { my $l0 = length(); return $l0 ? [ map { adist($a0, $_, @m) } @{$a1} ] : [ ]; } else { return [ map { my $l0 = length(); $l0 ? _adist($_, $a1, @m) / $l0 : undef } @{$a0} ]; } } elsif (ref $a1 eq 'ARRAY') { my $l0 = length($a0); return [] unless $l0; return [ map { _adist($a0, $_, @m) / $l0 } @{$a1} ]; } else { my $l0 = length($a0); if (wantarray) { return map { $l0 ? _adist($a0, $_, @m) / $l0 : undef } ($a1, @_); } else { return undef unless $l0; return _adist($a0, $a1, @m) / $l0; } } } sub adistword { return adist($_[0], $_[1], ['position_range=0']); } sub adistrword { return adistr($_[0], $_[1], ['position_range=0']); } sub arindex { my $P = shift; my $l = length $P; return 0 unless $l; my $R = reverse $P; my $a = ((@_ && ref $_[0] eq 'ARRAY') ? _complex($R, @{ shift(@_) }) : _simple($R))[0]; $a->set_greedy; # The *first* match, thank you. if (@_) { if (wantarray) { return map { my $aindex = $a->index(scalar reverse()); $aindex == -1 ? $aindex : (length($_) - $aindex - $l); } @_; } else { my $aindex = $a->index(scalar reverse $_[0]); return $aindex == -1 ? $aindex : (length($_[0]) - $aindex - $l); } } if (defined $_) { my $aindex = $a->index(scalar reverse()); return $aindex == -1 ? $aindex : (length($_) - $aindex - $l); } warn "arindex: \$_ is undefined: what are you indexing?\n"; return; } 1; __END__ =pod =head1 NAME String::Approx - Perl extension for approximate matching (fuzzy matching) =head1 SYNOPSIS use String::Approx 'amatch'; print if amatch("foobar"); my @matches = amatch("xyzzy", @inputs); my @catches = amatch("plugh", ['2'], @inputs); =head1 DESCRIPTION String::Approx lets you match and substitute strings approximately. With this you can emulate errors: typing errorrs, speling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot). NOTE: String::Approx suits the task of B, not B, and it works for B, not for B. If you want to compare strings for similarity, you probably just want the Levenshtein edit distance (explained below), the Text::Levenshtein and Text::LevenshteinXS modules in CPAN. See also Text::WagnerFischer and Text::PhraseDistance. (There are functions for this in String::Approx, e.g. adist(), but their results sometimes differ from the bare Levenshtein et al.) If you want to compare things like text or source code, consisting of B or B and B and B, or B and B, you should probably use some other tool than String::Approx, like for example the standard UNIX diff(1) tool, or the Algorithm::Diff module from CPAN. The measure of B is the I. It is the total number of "edits": insertions, word world deletions, monkey money and substitutions sun fun required to transform a string to another string. For example, to transform I<"lead"> into I<"gold">, you need three edits: lead gead goad gold The edit distance of "lead" and "gold" is therefore three, or 75%. B uses the Levenshtein edit distance as its measure, but String::Approx is not well-suited for comparing strings of different length, in other words, if you want a "fuzzy eq", see above. String::Approx is more like regular expressions or index(), it finds substrings that are close matches.> =head1 MATCH use String::Approx 'amatch'; $matched = amatch("pattern") $matched = amatch("pattern", [ modifiers ]) $any_matched = amatch("pattern", @inputs) $any_matched = amatch("pattern", [ modifiers ], @inputs) @match = amatch("pattern") @match = amatch("pattern", [ modifiers ]) @matches = amatch("pattern", @inputs) @matches = amatch("pattern", [ modifiers ], @inputs) Match B approximately. In list context return the matched B<@inputs>. If no inputs are given, match against the B<$_>. In scalar context return true if I of the inputs match, false if none match. Notice that the pattern is a string. Not a regular expression. None of the regular expression notations (^, ., *, and so on) work. They are characters just like the others. Note-on-note: some limited form of I<"regular expressionism"> is planned in future: for example character classes ([abc]) and I (.). But that feature will be turned on by a special I (just a guess: "r"), so there should be no backward compatibility problem. Notice also that matching is not symmetric. The inputs are matched against the pattern, not the other way round. In other words: the pattern can be a substring, a submatch, of an input element. An input element is always a superstring of the pattern. =head2 MODIFIERS With the modifiers you can control the amount of approximateness and certain other control variables. The modifiers are one or more strings, for example B<"i">, within a string optionally separated by whitespace. The modifiers are inside an anonymous array: the B<[ ]> in the syntax are not notational, they really do mean B<[ ]>, for example B<[ "i", "2" ]>. B<["2 i"]> would be identical. The implicit default approximateness is 10%, rounded up. In other words: every tenth character in the pattern may be an error, an edit. You can explicitly set the maximum approximateness by supplying a modifier like number number% Examples: B<"3">, B<"15%">. Note that C<0%> is not rounded up, it is equal to C<0>. Using a similar syntax you can separately control the maximum number of insertions, deletions, and substitutions by prefixing the numbers with I, D, or S, like this: Inumber Inumber% Dnumber Dnumber% Snumber Snumber% Examples: B<"I2">, B<"D20%">, B<"S0">. You can ignore case (B<"A"> becames equal to B<"a"> and vice versa) by adding the B<"i"> modifier. For example [ "i 25%", "S0" ] means I, I, but allow I. (See L about disallowing substitutions or insertions.) NOTE: setting C is not equivalent to using index(). If you want to use index(), use index(). =head1 SUBSTITUTE use String::Approx 'asubstitute'; @substituted = asubstitute("pattern", "replacement") @substituted = asubstitute("pattern", "replacement", @inputs) @substituted = asubstitute("pattern", "replacement", [ modifiers ]) @substituted = asubstitute("pattern", "replacement", [ modifiers ], @inputs) Substitute approximate B with B and return as a list of B<@inputs>, the substitutions having been made on the elements that did match the pattern. If no inputs are given, substitute in the B<$_>. The replacement can contain magic strings B<$&>, B<$`>, B<$'> that stand for the matched string, the string before it, and the string after it, respectively. All the other arguments are as in C, plus one additional modifier, B<"g"> which means substitute globally (all the matches in an element and not just the first one, as is the default). See L about the unfortunate stinginess of C. =head1 INDEX use String::Approx 'aindex'; $index = aindex("pattern") @indices = aindex("pattern", @inputs) $index = aindex("pattern", [ modifiers ]) @indices = aindex("pattern", [ modifiers ], @inputs) Like C but returns the index/indices at which the pattern matches approximately. In list context and if C<@inputs> are used, returns a list of indices, one index for each input element. If there's no approximate match, C<-1> is returned as the index. NOTE: if there is character repetition (e.g. "aa") either in the pattern or in the text, the returned index might start "too early". This is consistent with the goal of the module of matching "as early as possible", just like regular expressions (that there might be a "less approximate" match starting later is of somewhat irrelevant). There's also backwards-scanning C. =head1 SLICE use String::Approx 'aslice'; ($index, $size) = aslice("pattern") ([$i0, $s0], ...) = aslice("pattern", @inputs) ($index, $size) = aslice("pattern", [ modifiers ]) ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs) Like C but returns also the size (length) of the match. If the match fails, returns an empty list (when matching against C<$_>) or an empty anonymous list corresponding to the particular input. NOTE: size of the match will very probably be something you did not expect (such as longer than the pattern, or a negative number). This may or may not be fixed in future releases. Also the beginning of the match may vary from the expected as with aindex(), see above. If the modifier "minimal_distance" is used, the minimal possible edit distance is returned as the third element: ($index, $size, $distance) = aslice("pattern", [ modifiers ]) ([$i0, $s0, $d0], ...) = aslice("pattern", [ modifiers ], @inputs) =head1 DISTANCE use String::Approx 'adist'; $dist = adist("pattern", $input); @dist = adist("pattern", @input); Return the I or distances between the pattern and the input or inputs. Zero edit distance means exact match. (Remember that the match can 'float' in the inputs, the match is a substring match.) If the pattern is longer than the input or inputs, the returned distance or distances is or are negative. use String::Approx 'adistr'; $dist = adistr("pattern", $input); @dist = adistr("pattern", @inputs); Return the B I or distances between the pattern and the input or inputs. Zero relative edit distance means exact match, one means completely different. (Remember that the match can 'float' in the inputs, the match is a substring match.) If the pattern is longer than the input or inputs, the returned distance or distances is or are negative. You can use adist() or adistr() to sort the inputs according to their approximateness: my %d; @d{@inputs} = map { abs } adistr("pattern", @inputs); my @d = sort { $d{$a} <=> $d{$b} } @inputs; Now C<@d> contains the inputs, the most like C<"pattern"> first. =head1 CONTROLLING THE CACHE C maintains a LU (least-used) cache that holds the 'matching engines' for each instance of a I. The cache is intended to help the case where you match a small set of patterns against a large set of string. However, the more engines you cache the more you eat memory. If you have a lot of different patterns or if you have a lot of memory to burn, you may want to control the cache yourself. For example, allowing a larger cache consumes more memory but probably runs a little bit faster since the cache fills (and needs flushing) less often. The cache has two parameters: I and I. The first one is the maximum size of the cache and the second one is the cache flushing ratio: when the number of cache entries exceeds I, I times I cache entries are flushed. The default values are 1000 and 0.75, respectively, which means that when the 1001st entry would be cached, 750 least used entries will be removed from the cache. To access the parameters you can use the calls $now_max = String::Approx::cache_max(); String::Approx::cache_max($new_max); $now_purge = String::Approx::cache_purge(); String::Approx::cache_purge($new_purge); $limit = String::Approx::cache_n_purge(); To be honest, there are actually B caches: the first one is used far the patterns with no modifiers, the second one for the patterns with pattern modifiers. Using the standard parameters you will therefore actually cache up to 2000 entries. The above calls control both caches for the same price. To disable caching completely use String::Approx::cache_disable(); Note that this doesn't flush any possibly existing cache entries, to do that use String::Approx::cache_flush_all(); =head1 NOTES Because matching is by I, not by whole strings, insertions and substitutions produce often very similar results: "abcde" matches "axbcde" either by insertion B substitution of "x". The maximum edit distance is also the maximum number of edits. That is, the B<"I2"> in amatch("abcd", ["I2"]) is useless because the maximum edit distance is (implicitly) 1. You may have meant to say amatch("abcd", ["2D1S1"]) or something like that. If you want to simulate transposes feet fete you need to allow at least edit distance of two because in terms of our edit primitives a transpose is first one deletion and then one insertion. =head2 TEXT POSITION The starting and ending positions of matching, substituting, indexing, or slicing can be changed from the beginning and end of the input(s) to some other positions by using either or both of the modifiers "initial_position=24" "final_position=42" or the both the modifiers "initial_position=24" "position_range=10" By setting the B<"position_range"> to be zero you can limit (anchor) the operation to happen only once (if a match is possible) at the position. =head1 VERSION Major release 3. =head1 CHANGES FROM VERSION 2 =head2 GOOD NEWS =over 4 =item The version 3 is 2-3 times faster than version 2 =item No pattern length limitation The algorithm is independent on the pattern length: its time complexity is I, where I is the number of edits and I the length of the text (input). The preprocessing of the pattern will of course take some I (I being the pattern length) time, but C and C cache the result of this preprocessing so that it is done only once per pattern. =back =head2 BAD NEWS =over 4 =item You do need a C compiler to install the module Perl's regular expressions are no more used; instead a faster and more scalable algorithm written in C is used. =item C is now always stingy The string matched and substituted is now always stingy, as short as possible. It used to be as long as possible. This is an unfortunate change stemming from switching the matching algorithm. Example: with edit distance of two and substituting for B<"word"> from B<"cork"> and B<"wool"> previously did match B<"cork"> and B<"wool">. Now it does match B<"or"> and B<"wo">. As little as possible, or, in other words, with as much approximateness, as many edits, as possible. Because there is no I to match the B<"c"> of B<"cork">, it is not matched. =item no more C because regular expressions are no more used =item no more C for String::Approx version 1 compatibility =back =head1 ACKNOWLEDGEMENTS The following people have provided valuable test cases, documentation clarifications, and other feedback: Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz, Aldo Calpini, David Curiel, Teun van den Dool, Alberto Fontaneda, Rob Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner, B. Elijah Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut Jarausch, Damian Keefe, Ben Kennedy, Craig Kelley, Franz Kirsch, Dag Kristian, Mark Land, J. D. Laub, John P. Linderman, Tim Maher, Juha Muilu, Sergey Novoselov, Andy Oram, Ji Y Park, Eric Promislow, Nikolaus Rath, Stefan Ram, Slaven Rezic, Dag Kristian Rognlien, Stewart Russell, Slaven Rezic, Chris Rosin, Pasha Sadri, Ilya Sandler, Bob J.A. Schijvenaars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams, Rick Wise. The matching algorithm was developed by Udi Manber, Sun Wu, and Burra Gopal in the Department of Computer Science, University of Arizona. =head1 AUTHOR Jarkko Hietaniemi =head1 COPYRIGHT AND LICENSE Copyright 2001-2013 by Jarkko Hietaniemi This library is free software; you can redistribute it and/or modify under either the terms of the Artistic License 2.0, or the GNU Library General Public License, Version 2. See the files Artistic and LGPL for more details. Furthermore: no warranties or obligations of any kind are given, and the separate file F must be included intact in all copies and derived materials. =cut String-Approx-3.27/typemap0000644010002300116100000000125006633044665013670 0ustar jhiengTYPEMAP apse_t * O_OBJECT apse_bool_t T_IV apse_size_t T_UV apse_ssize_t T_IV ########################################################################### OUTPUT O_OBJECT sv_setref_pv( $arg, CLASS, (void*)$var ); T_UV $var = ($type)SvUV($arg) ########################################################################### INPUT O_OBJECT if( sv_isobject($arg) && (SvTYPE(SvRV($arg)) == SVt_PVMG) ) $var = ($type)SvIV((SV*)SvRV( $arg )); else{ warn( \"${Package}::$func_name() -- $var is not a blessed SV reference\" ); XSRETURN_UNDEF; } T_UV sv_setuv($arg, (UV)$var); ########################################################################### String-Approx-3.27/COPYRIGHT0000444010002300116100000000134206642366304013556 0ustar jhiengCopyright (C) Jarkko Hietaniemi, 1998. All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of either: a) the GNU Library General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or b) the "Artistic License"; the original version of which is released with the Perl distribution. These licenses come with the software as files "LGPL" and "Artistic". Other free software licensing schemes are negotiable. Furthermore: (1) This software is provided as-is, without warranties or obligations of any kind. (2) You shall include this copyright notice intact in all copies and derived materials. String-Approx-3.27/apse.c0000644010002300116100000011732410416201327013354 0ustar jhieng/* Copyright (C) by Jarkko Hietaniemi, 1998,1999,2000,2001,2002,2003,2006. All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of either: a) the GNU Library General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version, or b) the "Artistic License" which comes with Perl source code. Other free software licensing schemes are negotiable. Furthermore: (1) This software is provided as-is, without warranties or obligations of any kind. (2) You shall include this copyright notice intact in all copies and derived materials. */ /* $Id: apse.c,v 1.1 1999/06/23 16:09:13 jhi Exp jhi $ */ #include "apse.h" #include #include #include #include #include #define APSE_BITS_IN_BITVEC (8*sizeof(apse_vec_t)) #define APSE_CHAR_MAX 256 #ifdef APSE_DEBUGGING #define APSE_DEBUG(x) x #else #define APSE_DEBUG(x) #endif #define APSE_BIT(i) ((apse_vec_t)1 << ((i)%APSE_BITS_IN_BITVEC)) #define APSE_IDX(p, q, i) ((p)*(q)+((i)/APSE_BITS_IN_BITVEC)) #define APSE_BIT_SET(bv, p, q, i) ((bv[APSE_IDX(p, q, i)] |= APSE_BIT(i))) #define APSE_BIT_CLR(bv, p, q, i) ((bv[APSE_IDX(p, q, i)] &= ~APSE_BIT(i))) #define APSE_BIT_TST(bv, p, q, i) ((bv[APSE_IDX(p, q, i)] & APSE_BIT(i))) #define APSE_MATCH_STATE_BOT 0 #define APSE_MATCH_STATE_SEARCH 1 #define APSE_MATCH_STATE_BEGIN 2 #define APSE_MATCH_STATE_FAIL 3 #define APSE_MATCH_STATE_GREEDY 4 #define APSE_MATCH_STATE_END 5 #define APSE_MATCH_STATE_EOT 6 #define APSE_TEST_HIGH_BIT(i) \ (((i) & ((apse_vec_t)1 << (APSE_BITS_IN_BITVEC - 1))) ? 1 : 0) /* In case you are reading the TR 91-11 of University of Arizona, page 6: * j+1 state * j prev_state * d i * d-1 prev_i */ #define APSE_NEXT_EXACT(state, prev_state, text, i, carry) \ (state[i] = ((prev_state[i] << 1 | carry) & text)) #define APSE_NEXT_APPROX(state, prev_state, text, i, prev_i, carry) \ (state[i] = (((prev_state[i] << 1) & text) | \ prev_state[prev_i] | \ ((state[prev_i] | prev_state[prev_i]) << 1) | \ carry)) #define APSE_NEXT_COMMON(state, prev_state, text, i) \ (state[i] = (prev_state[i] << 1) & text) #define APSE_NEXT_INSERT(state, prev_state, i, prev_i) \ (state[i] |= prev_state[prev_i]) #define APSE_NEXT_DELETE(state, i, prev_i) \ (state[i] |= (state[prev_i] << 1)) #define APSE_NEXT_SUBSTI(state, prev_state, i, prev_i) \ (state[i] |= (prev_state[prev_i] << 1)) #define APSE_NEXT_CARRY(state, i, carry) \ (state[i] |= carry) #define APSE_EXACT_MATCH_BEGIN(ap) (ap->state[0] & 1) #define APSE_APPROX_MATCH_BEGIN(ap) \ (ap->state[ap->largest_distance + ap->match_begin_bitvector] > \ ap->match_begin_prefix && \ ap->state[ap->largest_distance + ap->match_begin_bitvector] & \ ap->match_begin_prefix) #define APSE_PREFIX_DELETE_MASK(ap) \ do { if (ap->edit_deletions < ap->edit_distance && \ ap->text_position < ap->edit_distance) \ ap->state[h] &= ap->match_begin_bitmask; } while (0) #define APSE_DEBUG_SINGLE(ap, i) \ APSE_DEBUG(printf("%c %2ld %2ld %s\n", \ isprint(ap->text[ap->text_position])? \ ap->text[ap->text_position]:'.', \ ap->text_position, i, \ _apse_fbin(ap->state[i], \ ap->pattern_size, 1))) #define APSE_DEBUG_MULTIPLE_FIRST(ap, i) \ APSE_DEBUG(printf("%c %2ld %2ld", \ isprint(ap->text[ap->text_position])? \ ap->text[ap->text_position]:'.', \ ap->text_position, i)) #define APSE_DEBUG_MULTIPLE_REST(ap, i, j) \ APSE_DEBUG(printf(" %s", \ _apse_fbin(ap->state[j], \ ap->pattern_size, \ i == ap->bitvectors_in_state-1))) #ifdef APSE_DEBUGGING static char *_apse_fbin(apse_vec_t v, apse_size_t n, apse_bool_t last); static char *_apse_fbin(apse_vec_t v, apse_size_t n, apse_bool_t last) { static char s[APSE_BITS_IN_BITVEC + 1] = { 0 }; /* non-reentrant */ if (v) { static const char *b = "0000100001001100001010100110111000011001010111010011101101111111"; apse_size_t i; for (i = 0; i < APSE_BITS_IN_BITVEC && i < n && v; i += 4) { (void)memcpy(s + i, b + ((v & 0x0f) << 2), (size_t)4); v >>= 4; } if (i < APSE_BITS_IN_BITVEC) memset(s + i, '0', APSE_BITS_IN_BITVEC - i); } else memset(s, '0', APSE_BITS_IN_BITVEC); if (last) s[n % APSE_BITS_IN_BITVEC] = 0; return s; } #endif /* The code begins. */ apse_bool_t apse_set_pattern(apse_t* ap, unsigned char* pattern, apse_size_t pattern_size) { apse_size_t i; if (ap->case_mask) free(ap->case_mask); if (ap->fold_mask) free(ap->fold_mask); ap->pattern_mask = 0; ap->fold_mask = 0; ap->case_mask = 0; ap->is_greedy = 0; ap->prev_equal = 0; ap->prev_active = 0; ap->pattern_size = pattern_size; ap->bitvectors_in_state = (pattern_size - 1)/APSE_BITS_IN_BITVEC + 1; if (ap->edit_distance) ap->largest_distance = ap->edit_distance * ap->bitvectors_in_state; else ap->largest_distance = 0; ap->bytes_in_state = ap->bitvectors_in_state * sizeof(apse_vec_t); ap->case_mask = calloc((apse_size_t)APSE_CHAR_MAX, ap->bytes_in_state); if (ap->case_mask == 0) goto out; for (i = 0; i < pattern_size; i++) APSE_BIT_SET(ap->case_mask, (unsigned)pattern[i], ap->bitvectors_in_state, i); ap->pattern_mask = ap->case_mask; ap->match_end_bitmask = (apse_vec_t)1 << ((pattern_size - 1) % APSE_BITS_IN_BITVEC); out: if (ap && ap->case_mask) return 1; else { if (ap->case_mask) free(ap->case_mask); if (ap) free(ap); return 0; } } void apse_set_greedy(apse_t *ap, apse_bool_t greedy) { ap->is_greedy = greedy; } apse_bool_t apse_get_greedy(apse_t *ap) { return ap->is_greedy; } void apse_set_match_bot_callback(apse_t *ap, void* (*match_bot_callback)(apse_t* ap)) { ap->match_bot_callback = match_bot_callback; } void apse_set_match_begin_callback(apse_t *ap, void* (*match_begin_callback)(apse_t* ap)) { ap->match_begin_callback = match_begin_callback; } void apse_set_match_fail_callback(apse_t *ap, void* (*match_fail_callback)(apse_t* ap)) { ap->match_fail_callback = match_fail_callback; } void apse_set_match_end_callback(apse_t *ap, void* (*match_end_callback)(apse_t* ap)) { ap->match_end_callback = match_end_callback; } void apse_set_match_eot_callback(apse_t *ap, void* (*match_eot_callback)(apse_t* ap)) { ap->match_eot_callback = match_eot_callback; } void* (*apse_get_match_bot_callback(apse_t * ap))(apse_t *ap) { return ap->match_bot_callback; } void* (*apse_get_match_begin_callback(apse_t * ap))(apse_t *ap) { return ap->match_begin_callback; } void* (*apse_get_match_fail_callback(apse_t * ap))(apse_t *ap) { return ap->match_fail_callback; } void* (*apse_get_match_end_callback(apse_t * ap))(apse_t *ap) { return ap->match_end_callback; } void* (*apse_get_match_eot_callback(apse_t * ap))(apse_t *ap) { return ap->match_eot_callback; } static int _apse_wrap_slice(apse_t* ap, apse_ssize_t begin_in, apse_ssize_t size_in, apse_ssize_t* begin_out, apse_ssize_t* size_out) { if (begin_in < 0) { if ((apse_size_t)-begin_in > ap->pattern_size) return 0; begin_in = ap->pattern_size + begin_in; } if (size_in < 0) { if (-size_in > begin_in) return 0; size_in = -size_in; begin_in -= size_in; } if ((apse_size_t)begin_in >= ap->pattern_size) return 0; if ((apse_size_t)begin_in + size_in > ap->pattern_size) size_in = ap->pattern_size - begin_in; if (begin_out) *begin_out = begin_in; if (size_out) *size_out = size_in; return 1; } apse_bool_t apse_set_anychar(apse_t *ap, apse_ssize_t pattern_index) { apse_size_t bitvectors_in_state = ap->bitvectors_in_state; apse_ssize_t true_index, i; apse_bool_t okay = 0; if (!_apse_wrap_slice(ap, pattern_index, (apse_ssize_t)1, &true_index, 0)) goto out; for (i = 0; i < APSE_CHAR_MAX; i++) APSE_BIT_SET(ap->case_mask, i, bitvectors_in_state, pattern_index); if (ap->fold_mask) for (i = 0; i < APSE_CHAR_MAX; i++) APSE_BIT_SET(ap->fold_mask, i, bitvectors_in_state, pattern_index); okay = 1; out: return okay; } apse_bool_t apse_set_charset(apse_t* ap, apse_ssize_t pattern_index, unsigned char* set, apse_size_t set_size, apse_bool_t complement) { apse_size_t bitvectors_in_state = ap->bitvectors_in_state; apse_ssize_t true_index; apse_bool_t okay = 0; apse_size_t i; if (!_apse_wrap_slice(ap, pattern_index, (apse_ssize_t)1, &true_index, 0)) goto out; if (complement) { for (i = 0; i < set_size; i++) APSE_BIT_CLR(ap->case_mask, (unsigned)set[i], bitvectors_in_state, true_index); } else { for (i = 0; i < set_size; i++) APSE_BIT_SET(ap->case_mask, (unsigned)set[i], bitvectors_in_state, true_index); } if (ap->fold_mask) apse_set_caseignore_slice(ap, pattern_index, (apse_ssize_t)1, (apse_bool_t)1); okay = 1; out: return okay; } static void _apse_reset_state(apse_t* ap) { apse_size_t i, j; (void)memset(ap->state, 0, ap->bytes_in_all_states); (void)memset(ap->prev_state, 0, ap->bytes_in_all_states); ap->prev_equal = 0; ap->prev_active = 0; for (i = 1; i <= ap->edit_distance; i++) { for (j = 0; j < i; j++) { #ifdef APSE_DEBUGGING int k = APSE_IDX(i, ap->bitvectors_in_state, j); int l = ap->bytes_in_all_states/sizeof(apse_vec_t); assert (k < l); #endif APSE_BIT_SET(ap->prev_state, i, ap->bitvectors_in_state, j); } } } apse_bool_t apse_set_text_position(apse_t *ap, apse_size_t text_position) { ap->text_position = text_position; return 1; } apse_size_t apse_get_text_position(apse_t *ap) { return ap->text_position; } apse_bool_t apse_set_text_initial_position(apse_t *ap, apse_size_t text_initial_position) { ap->text_initial_position = text_initial_position; return 1; } apse_size_t apse_get_text_initial_position(apse_t *ap) { return ap->text_initial_position; } apse_bool_t apse_set_text_final_position(apse_t *ap, apse_size_t text_final_position) { ap->text_final_position = text_final_position; return 1; } apse_size_t apse_get_text_final_position(apse_t *ap) { return ap->text_final_position; } apse_bool_t apse_set_text_position_range(apse_t *ap, apse_size_t text_position_range) { ap->text_position_range = text_position_range; return 1; } apse_size_t apse_get_text_position_range(apse_t *ap) { return ap->text_position_range; } void apse_reset(apse_t *ap) { _apse_reset_state(ap); ap->text_position = ap->text_initial_position; #if 0 ap->text_position_range = APSE_MATCH_BAD; /* Do not reset this. */ #endif ap->match_state = APSE_MATCH_STATE_BOT; ap->match_begin = APSE_MATCH_BAD; ap->match_end = APSE_MATCH_BAD; } apse_bool_t apse_set_edit_distance(apse_t *ap, apse_size_t edit_distance) { /* TODO: waste not--reuse if possible */ if (ap->state) free(ap->state); if (ap->prev_state) free(ap->prev_state); if (edit_distance >= ap->pattern_size) edit_distance = ap->pattern_size; ap->edit_distance = edit_distance; ap->bytes_in_all_states = (edit_distance + 1) * ap->bytes_in_state; ap->state = ap->prev_state = 0; ap->state = calloc(edit_distance + 1, ap->bytes_in_state); if (ap->state == 0) goto out; ap->prev_state = calloc(edit_distance + 1, ap->bytes_in_state); if (ap->prev_state == 0) goto out; apse_reset(ap); if (!ap->has_different_distances) { ap->edit_insertions = edit_distance; ap->edit_deletions = edit_distance; ap->edit_substitutions = edit_distance; } if (ap->edit_distance && ap->bitvectors_in_state) ap->largest_distance = ap->edit_distance * ap->bitvectors_in_state; else ap->largest_distance = 0; ap->match_begin_bitvector = (edit_distance + 1) / APSE_BITS_IN_BITVEC; ap->match_begin_prefix = ((apse_vec_t)1 << edit_distance) - 1; ap->match_begin_bitmask = ((apse_vec_t)1 << edit_distance) - 1; ap->match_end_bitvector = (ap->pattern_size - 1) / APSE_BITS_IN_BITVEC; #ifdef APSE_DEBUGGING if (ap->has_different_distances) { printf("(edit distances: "); printf("insertions = %ld, deletions = %ld, substitutions = %ld)\n", ap->edit_insertions, ap->edit_deletions, ap->edit_substitutions); } else printf("(edit_distance = %ld)\n", ap->edit_distance); #endif out: return ap->state && ap->prev_state; } apse_size_t apse_get_edit_distance(apse_t *ap) { return ap->edit_distance; } apse_bool_t apse_set_minimal_distance(apse_t* ap, apse_bool_t minimal) { ap->use_minimal_distance = minimal; return 1; } apse_bool_t apse_get_minimal_distance(apse_t *ap) { return ap->use_minimal_distance; } apse_bool_t apse_set_exact_slice(apse_t* ap, apse_ssize_t exact_begin, apse_ssize_t exact_size, apse_bool_t exact) { apse_ssize_t true_begin, true_size; apse_bool_t okay = 0; apse_size_t i, j; if (!ap->exact_mask) { ap->exact_mask = calloc((size_t)1, ap->bytes_in_state); if (ap->exact_mask == 0) goto out; ap->exact_positions = 0; } if (!_apse_wrap_slice(ap, exact_begin, exact_size, &true_begin, &true_size)) goto out; if (exact) { for (i = true_begin, j = true_begin + true_size; i < j && i < ap->pattern_size; i++) { if (!APSE_BIT_TST(ap->exact_mask, 0, 0, i)) ap->exact_positions++; APSE_BIT_SET(ap->exact_mask, 0, 0, i); } } else { for (i = true_begin, j = true_begin + true_size; i < j && i < ap->pattern_size; i++) { if (APSE_BIT_TST(ap->exact_mask, 0, 0, i)) ap->exact_positions--; APSE_BIT_CLR(ap->exact_mask, 0, 0, i); } } okay = 1; out: return okay; } apse_bool_t apse_set_caseignore_slice(apse_t* ap, apse_ssize_t caseignore_begin, apse_ssize_t caseignore_size, apse_bool_t caseignore) { apse_size_t i, j; int k; apse_ssize_t true_begin, true_size; apse_bool_t okay = 0; if (!ap->fold_mask) { ap->fold_mask = calloc((apse_size_t)APSE_CHAR_MAX, ap->bytes_in_state); if (ap->fold_mask == 0) goto out; memcpy(ap->fold_mask, ap->case_mask, APSE_CHAR_MAX * ap->bytes_in_state); ap->pattern_mask = ap->fold_mask; } if (!_apse_wrap_slice(ap, caseignore_begin, caseignore_size, &true_begin, &true_size)) goto out; if (caseignore) { for (i = true_begin, j = true_begin + true_size; i < j && i < ap->pattern_size; i++) { for (k = 0; k < APSE_CHAR_MAX; k++) { if (APSE_BIT_TST(ap->case_mask, k, ap->bitvectors_in_state, i)) { if (isupper(k)) APSE_BIT_SET(ap->fold_mask, tolower(k), ap->bitvectors_in_state, i); else if (islower(k)) APSE_BIT_SET(ap->fold_mask, toupper(k), ap->bitvectors_in_state, i); } } } } else { for (i = true_begin, j = true_begin + true_size; i < j && i < ap->pattern_size; i++) { for (k = 0; k < APSE_CHAR_MAX; k++) { if (APSE_BIT_TST(ap->case_mask, k, ap->bitvectors_in_state, i)) { if (isupper(k)) APSE_BIT_CLR(ap->fold_mask, tolower(k), ap->bitvectors_in_state, i); else if (islower(k)) APSE_BIT_CLR(ap->fold_mask, toupper(k), ap->bitvectors_in_state, i); } } } } okay = 1; out: return okay; } void apse_destroy(apse_t *ap) { if (ap->case_mask) free(ap->case_mask); if (ap->fold_mask) free(ap->fold_mask); if (ap->state) free(ap->state); if (ap->prev_state) free(ap->prev_state); if (ap->exact_mask) free(ap->exact_mask); free(ap); } apse_t *apse_create(unsigned char* pattern, apse_size_t pattern_size, apse_size_t edit_distance) { apse_t *ap; apse_bool_t okay = 0; APSE_DEBUG(printf("(apse version %u.%u)\n", APSE_MAJOR_VERSION, APSE_MINOR_VERSION)); APSE_DEBUG( printf("(pattern = \"%s\", pattern_size = %ld)\n", pattern, pattern_size)); ap = calloc((size_t)1, sizeof(*ap)); if (ap == 0) return 0; ap->pattern_size = 0; ap->pattern_mask = 0; ap->edit_distance = 0; ap->has_different_distances = 0; ap->edit_insertions = 0; ap->edit_deletions = 0; ap->edit_substitutions = 0; ap->use_minimal_distance = 0; ap->bitvectors_in_state = 0; ap->bytes_in_state = 0; ap->bytes_in_all_states = 0; ap->largest_distance = 0; ap->text = 0; ap->text_size = 0; ap->text_position = 0; ap->text_initial_position = 0; ap->text_final_position = APSE_MATCH_BAD; ap->text_position_range = APSE_MATCH_BAD; ap->state = 0; ap->prev_state = 0; ap->match_begin_bitmask = 0; ap->match_begin_prefix = 0; ap->match_end_bitvector = 0; ap->match_end_bitmask = 0; ap->match_state = APSE_MATCH_STATE_BOT; ap->match_begin = APSE_MATCH_BAD; ap->match_end = APSE_MATCH_BAD; ap->match_bot_callback = 0; ap->match_begin_callback = 0; ap->match_fail_callback = 0; ap->match_end_callback = 0; ap->match_eot_callback = 0; ap->exact_positions = 0; ap->exact_mask = 0; ap->is_greedy = 0; ap->custom_data = 0; ap->custom_data_size = 0; if (!apse_set_pattern(ap, (unsigned char *)pattern, pattern_size)) goto out; if (!apse_set_edit_distance(ap, edit_distance)) goto out; ap->edit_insertions = ap->edit_deletions = ap->edit_substitutions = ap->edit_distance; ap->largest_distance = edit_distance * ap->bitvectors_in_state; #ifdef APSE_DEBUGGING printf("(size of bitvector = %ld, bitvectors_in_state = %ld)\n", (long)sizeof(apse_vec_t), ap->bitvectors_in_state); printf("(bytes_in_state = %ld, states = %ld, bytes_in_all_states = %ld)\n", ap->bytes_in_state, ap->edit_distance + 1, ap->bytes_in_all_states); printf("(match_begin_bitvector = %ld, match_begin_bitmask = %s)\n", ap->match_begin_bitvector, _apse_fbin(ap->match_begin_bitmask, ap->pattern_size, 1)); printf("(match_end_bitvector = %ld, match_end_bitmask = %s)\n", ap->match_end_bitvector, _apse_fbin(ap->match_end_bitmask, ap->pattern_size, 1)); printf("(largest_distance = %ld, match_begin_prefix = %s)\n", ap->largest_distance, _apse_fbin(ap->match_begin_prefix, ap->pattern_size, 1)); if (ap->bitvectors_in_state == 1) printf("(single bitvector"); else printf("(multiple bitvectors"); printf(")\n"); #endif okay = 1; out: if (!okay) { apse_destroy(ap); ap = 0; } return ap; } apse_bool_t apse_set_insertions(apse_t *ap, apse_size_t insertions) { apse_bool_t okay = 0; if (insertions > ap->edit_distance) insertions = ap->edit_distance; ap->edit_insertions = insertions; ap->has_different_distances = 1; APSE_DEBUG((printf("(edit distances: insertions = %ld, deletions = %ld, substitutions = %ld)\n", ap->edit_insertions, ap->edit_deletions, ap->edit_substitutions))); okay = 1; return okay; } apse_size_t apse_get_insertions(apse_t *ap) { if (ap->has_different_distances) return ap->edit_insertions; else return ap->edit_distance; } apse_bool_t apse_set_deletions(apse_t *ap, apse_size_t deletions) { apse_bool_t okay = 0; if (deletions > ap->edit_distance) deletions = ap->edit_distance; ap->edit_deletions = deletions; ap->has_different_distances = 1; APSE_DEBUG((printf("(edit distances: insertions = %ld, deletions = %ld, substitutions = %ld)\n", ap->edit_insertions, ap->edit_deletions, ap->edit_substitutions))); okay = 1; return okay; } apse_size_t apse_get_deletions(apse_t *ap) { if (ap->has_different_distances) return ap->edit_deletions; else return ap->edit_distance; } apse_bool_t apse_set_substitutions(apse_t *ap, apse_size_t substitutions) { apse_bool_t okay = 0; if (substitutions > ap->edit_distance) substitutions = ap->edit_distance; ap->edit_substitutions = substitutions; ap->has_different_distances = 1; APSE_DEBUG((printf("(edit distances: insertions = %ld, deletions = %ld, substitutions = %ld)\n", ap->edit_insertions, ap->edit_deletions, ap->edit_substitutions))); okay = 1; return okay; } apse_size_t apse_get_substitutions(apse_t *ap) { if (ap->has_different_distances) return ap->edit_substitutions; else return ap->edit_distance; } #ifdef APSE_DEBUGGING static const char* apse_match_state_name(apse_t* ap) { switch (ap->match_state) { case APSE_MATCH_STATE_BOT: return "BOT"; case APSE_MATCH_STATE_SEARCH: return "SEARCH"; case APSE_MATCH_STATE_BEGIN: return "BEGIN"; case APSE_MATCH_STATE_FAIL: return "FAIL"; case APSE_MATCH_STATE_GREEDY: return "GREEDY"; case APSE_MATCH_STATE_END: return "END"; case APSE_MATCH_STATE_EOT: return "EOT"; default: return "***UNKNOWN***"; } } #endif static void _apse_match_bot(apse_t *ap) { APSE_DEBUG(printf("(text = \"%*.*s\", text_size = %ld)\n", (int)ap->text_size, (int)ap->text_size, ap->text, ap->text_size)); apse_reset(ap); APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); APSE_DEBUG(printf("(text begin %ld)\n", ap->text_position)); if (ap->match_bot_callback) ap->match_bot_callback(ap); } static void _apse_match_begin(apse_t *ap) { ap->match_state = APSE_MATCH_STATE_BEGIN; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); APSE_DEBUG(printf("(match begin %ld)\n", ap->text_position)); ap->match_begin = ap->text_position; if (ap->match_begin_callback) ap->match_begin_callback(ap); } static void _apse_match_fail(apse_t *ap) { ap->match_state = APSE_MATCH_STATE_FAIL; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); ap->match_begin = APSE_MATCH_BAD; APSE_DEBUG(printf("(match fail %ld)\n", ap->text_position)); if (ap->match_fail_callback) ap->match_fail_callback(ap); ap->match_state = APSE_MATCH_STATE_SEARCH; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); } static void _apse_match_end(apse_t *ap) { ap->match_state = APSE_MATCH_STATE_END; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); #ifdef APSE_DEBUGGING printf("(match end %ld)\n", ap->match_end); printf("(match string \"%.*s\")\n", (int)(ap->match_end - ap->match_begin + 1), ap->text + ap->match_begin); printf("(match length %ld)\n", ap->match_end - ap->match_begin + 1); #endif if (ap->match_end_callback) ap->match_end_callback(ap); ap->match_state = APSE_MATCH_STATE_SEARCH; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); } static void _apse_match_eot(apse_t *ap) { ap->match_state = APSE_MATCH_STATE_EOT; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); ap->text_position = ap->text_size; APSE_DEBUG(printf("(text end %ld)\n", ap->text_position)); if (ap->match_eot_callback) ap->match_eot_callback(ap); } static apse_bool_t _apse_match_next_state(apse_t *ap) { apse_size_t h, i, j, k; apse_vec_t match; k = ap->edit_distance * ap->bitvectors_in_state; switch (ap->match_state) { case APSE_MATCH_STATE_SEARCH: if (APSE_EXACT_MATCH_BEGIN(ap) || APSE_APPROX_MATCH_BEGIN(ap)) _apse_match_begin(ap); break; case APSE_MATCH_STATE_BEGIN: { apse_size_t equal = 0; apse_size_t active = 0; for (h = 0; h <= k; h += ap->bitvectors_in_state) { for (i = h, j = h + ap->bitvectors_in_state - 1; i < j; j--) if (ap->state[j] != ap->prev_state[j]) break; if (ap->prev_state[j] == ap->state[j]) equal++; if (ap->state[j]) active++; } #ifdef APSE_DEBUGGING printf("(equal = %d, active = %d)\n", equal, active); #endif if ((equal == ap->edit_distance + 1 && ap->is_greedy == 0) || (equal < ap->prev_equal && ap->prev_active && active > ap->prev_active && ap->text_position - ap->match_begin < 8 * ap->bytes_in_state && !APSE_BIT_TST(ap->state, ap->edit_distance, ap->bitvectors_in_state, ap->text_position - ap->match_begin))) { ap->match_begin = ap->text_position; #ifdef APSE_DEBUGGING printf("(slide begin %d)\n", ap->match_begin); #endif } else if (active == 0) _apse_match_fail(ap); ap->prev_equal = equal; ap->prev_active = active; } break; default: break; } for (match = 0, h = 0; h <= k; h += ap->bitvectors_in_state) match |= ap->state[h + ap->match_end_bitvector]; if (match & ap->match_end_bitmask) { if (ap->match_state == APSE_MATCH_STATE_BEGIN) { if (ap->is_greedy) { ap->match_state = APSE_MATCH_STATE_GREEDY; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); APSE_DEBUG( printf("(greedy match continue %ld)\n", ap->text_position)); } else { ap->match_state = APSE_MATCH_STATE_END; ap->match_end = ap->text_position; } } } else if (ap->match_state == APSE_MATCH_STATE_GREEDY) { ap->match_state = APSE_MATCH_STATE_END; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); ap->match_end = ap->text_position - 1; APSE_DEBUG(printf("(match end %ld)\n", ap->match_end)); } return ap->match_state; } static void _apse_exact_multiple(apse_t* ap) { apse_size_t h; apse_size_t g = ap->edit_distance * ap->bitvectors_in_state; for (h = 0; h < ap->bitvectors_in_state; h++) ap->state[g + h] &= ~ap->exact_mask[h]; } static apse_bool_t _apse_match_single_simple(apse_t *ap) { /* single apse_vec_t, edit_distance */ APSE_DEBUG(printf("(match single simple)\n")); for ( ; ap->text_position < ap->text_size; ap->text_position++) { apse_vec_t t = ap->pattern_mask[(unsigned)ap->text[ap->text_position] * ap->bitvectors_in_state]; apse_size_t h, g; APSE_NEXT_EXACT(ap->state, ap->prev_state, t, (apse_size_t)0, 1); APSE_DEBUG_SINGLE(ap, (apse_size_t)0); for (g = 0, h = 1; h <= ap->edit_distance; g = h, h++) { APSE_NEXT_APPROX(ap->state, ap->prev_state, t, h, g, 1); APSE_DEBUG_SINGLE(ap, h); } if (ap->exact_positions) ap->state[ap->edit_distance] &= ~ap->exact_mask[0]; if (_apse_match_next_state(ap) == APSE_MATCH_STATE_END) return 1; (void)memcpy(ap->prev_state, ap->state, ap->bytes_in_all_states); } return 0; } static apse_bool_t _apse_match_multiple_simple(apse_t *ap) { /* multiple apse_vec_t:s, has_different_distances */ apse_size_t h, i; APSE_DEBUG(printf("(match multiple simple)\n")); for ( ; ap->text_position < ap->text_size; ap->text_position++) { apse_vec_t *t = ap->pattern_mask + (unsigned)ap->text[ap->text_position] * ap->bitvectors_in_state; apse_vec_t c, d; APSE_DEBUG_MULTIPLE_FIRST(ap, (apse_size_t)0); for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[i]); APSE_NEXT_EXACT(ap->state, ap->prev_state, t[i], i, c); APSE_DEBUG_MULTIPLE_REST(ap, i, i); } APSE_DEBUG(printf("\n")); for (h = 1; h <= ap->edit_distance; h++) { apse_size_t kj = h * ap->bitvectors_in_state, jj = kj - ap->bitvectors_in_state; APSE_DEBUG_MULTIPLE_FIRST(ap, h); for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_APPROX(ap->state, ap->prev_state, t[i], kj, jj, c); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } APSE_DEBUG(printf("\n")); } if (ap->exact_positions) _apse_exact_multiple(ap); if (_apse_match_next_state(ap) == APSE_MATCH_STATE_END) return 1; (void)memcpy(ap->prev_state, ap->state, ap->bytes_in_all_states); } return 0; } static apse_bool_t _apse_match_single_complex(apse_t *ap) { /* single apse_vec_t, has_different_distances */ APSE_DEBUG(printf("(match single complex)\n")); for ( ; ap->text_position < ap->text_size; ap->text_position++) { unsigned char o = ap->text[ap->text_position]; apse_vec_t t = ap->pattern_mask[(unsigned int)o * ap->bitvectors_in_state]; apse_size_t h, g; APSE_NEXT_EXACT(ap->state, ap->prev_state, t, (apse_size_t)0, 1); APSE_DEBUG_SINGLE(ap, (apse_size_t)0); for (g = 0, h = 1; h <= ap->edit_distance; g = h, h++) { apse_bool_t has_insertions = h <= ap->edit_insertions; apse_bool_t has_deletions = h <= ap->edit_deletions; apse_bool_t has_substitutions = h <= ap->edit_substitutions; APSE_NEXT_COMMON(ap->state, ap->prev_state, t, h); if (has_insertions) APSE_NEXT_INSERT(ap->state, ap->prev_state, h, g); if (has_deletions) APSE_NEXT_DELETE(ap->state, h, g); if (has_substitutions) APSE_NEXT_SUBSTI(ap->state, ap->prev_state, h, g); APSE_NEXT_CARRY(ap->state, h, has_deletions || has_substitutions ? 1 : 0); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_SINGLE(ap, h); } if (ap->exact_positions) ap->state[ap->edit_distance] &= ~ap->exact_mask[0]; if (_apse_match_next_state(ap) == APSE_MATCH_STATE_END) return 1; (void)memcpy(ap->prev_state, ap->state, ap->bytes_in_all_states); } return 0; } static apse_bool_t _apse_match_multiple_complex(apse_t *ap) { /* multiple apse_vec_t:s, has_different_distances */ apse_size_t h, i; APSE_DEBUG(printf("(match multiple complex)\n")); for ( ; ap->text_position < ap->text_size; ap->text_position++) { unsigned char o = ap->text[ap->text_position]; apse_vec_t *t = ap->pattern_mask + (unsigned)o * ap->bitvectors_in_state; apse_vec_t c, d; APSE_DEBUG_MULTIPLE_FIRST(ap, (apse_size_t)0); for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[i]); APSE_NEXT_EXACT(ap->state, ap->prev_state, t[i], i, c); APSE_DEBUG_MULTIPLE_REST(ap, i, i); } APSE_DEBUG(printf("\n")); for (h = 1; h <= ap->edit_distance; h++) { apse_size_t kj = h * ap->bitvectors_in_state, jj = kj - ap->bitvectors_in_state; apse_bool_t has_insertions = h <= ap->edit_insertions; apse_bool_t has_deletions = h <= ap->edit_deletions; apse_bool_t has_substitutions = h <= ap->edit_substitutions; APSE_DEBUG_MULTIPLE_FIRST(ap, h); /* Is there such a thing as too much manual optimization? */ if (has_insertions) { if (has_deletions && has_substitutions) { for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_COMMON(ap->state, ap->prev_state, t[i], kj); APSE_NEXT_INSERT(ap->state, ap->prev_state, kj, jj); APSE_NEXT_DELETE(ap->state, kj, jj); APSE_NEXT_SUBSTI(ap->state, ap->prev_state, kj, jj); APSE_NEXT_CARRY(ap->state, kj, c); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } } else if (has_deletions) { for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_COMMON(ap->state, ap->prev_state, t[i], kj); APSE_NEXT_INSERT(ap->state, ap->prev_state, kj, jj); APSE_NEXT_DELETE(ap->state, kj, jj); APSE_NEXT_CARRY(ap->state, kj, c); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } } else if (has_substitutions) { for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_COMMON(ap->state, ap->prev_state, t[i], kj); APSE_NEXT_INSERT(ap->state, ap->prev_state, kj, jj); APSE_NEXT_SUBSTI(ap->state, ap->prev_state, kj, jj); APSE_NEXT_CARRY(ap->state, kj, c); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } } else { for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_COMMON(ap->state, ap->prev_state, t[i], kj); APSE_NEXT_INSERT(ap->state, ap->prev_state, kj, jj); APSE_NEXT_CARRY(ap->state, kj, c); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } } } else { if (has_deletions && has_substitutions) { for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_COMMON(ap->state, ap->prev_state, t[i], kj); APSE_NEXT_DELETE(ap->state, kj, jj); APSE_NEXT_SUBSTI(ap->state, ap->prev_state, kj, jj); APSE_NEXT_CARRY(ap->state, kj, c); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } } else if (has_deletions) { for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_COMMON(ap->state, ap->prev_state, t[i], kj); APSE_NEXT_DELETE(ap->state, kj, jj); APSE_NEXT_CARRY(ap->state, kj, c); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } } else if (has_substitutions) { for (c = 1, i = 0; i < ap->bitvectors_in_state; i++, kj++, jj++, c = d) { d = APSE_TEST_HIGH_BIT(ap->state[kj]); APSE_NEXT_COMMON(ap->state, ap->prev_state, t[i], kj); APSE_NEXT_SUBSTI(ap->state, ap->prev_state, kj, jj); APSE_NEXT_CARRY(ap->state, kj, c); APSE_PREFIX_DELETE_MASK(ap); APSE_DEBUG_MULTIPLE_REST(ap, i, kj); } } } APSE_DEBUG(printf("\n")); if (ap->exact_positions) _apse_exact_multiple(ap); if (_apse_match_next_state(ap) == APSE_MATCH_STATE_END) return 1; (void)memcpy(ap->prev_state, ap->state, ap->bytes_in_all_states); } } return 0; } static apse_bool_t __apse_match(apse_t *ap, unsigned char *text, apse_size_t text_size) { apse_bool_t did_match = 0; APSE_DEBUG(printf("(match enter)\n")); if (ap->match_state == APSE_MATCH_STATE_BOT) { ap->text = text; if (ap->text_final_position == APSE_MATCH_BAD) ap->text_size = text_size; else ap->text_size = ap->text_final_position > text_size ? text_size : ap->text_final_position + 1; _apse_match_bot(ap); } else if (ap->match_state == APSE_MATCH_STATE_EOT) goto leave; if (ap->edit_deletions >= ap->pattern_size || ap->edit_substitutions >= ap->pattern_size) { ap->match_state = APSE_MATCH_STATE_END; ap->match_begin = ap->text_initial_position; ap->match_end = ap->text_size - 1; ap->text_position = ap->text_size; goto out; } if (ap->pattern_size - ap->edit_deletions > ap->text_size - ap->text_initial_position) { ap->match_state = APSE_MATCH_STATE_EOT; ap->text_position = ap->text_size; goto out; } if (text_size + ap->edit_distance < ap->pattern_size + ap->text_position) { ap->text_position = ap->text_size; goto eot; } if (ap->match_state == APSE_MATCH_STATE_SEARCH) { ap->text_position++; _apse_reset_state(ap); } if (ap->text_position_range != APSE_MATCH_BAD && ap->text_position - ap->text_initial_position > ap->text_position_range) { ap->match_state = APSE_MATCH_STATE_END; goto eot; } ap->match_state = APSE_MATCH_STATE_SEARCH; APSE_DEBUG(printf("(match state %s)\n", apse_match_state_name(ap))); APSE_DEBUG(printf("(match search %ld)\n", ap->text_position)); if (ap->has_different_distances) { if (ap->bitvectors_in_state == 1) { if (_apse_match_single_complex(ap)) goto out; } else { if (_apse_match_multiple_complex(ap)) goto out; } } else { if (ap->bitvectors_in_state == 1) { if (_apse_match_single_simple(ap)) goto out; } else { if (_apse_match_multiple_simple(ap)) goto out; } } out: if (ap->match_state == APSE_MATCH_STATE_GREEDY) { ap->match_state = APSE_MATCH_STATE_END; ap->match_end = ap->text_position - 1; APSE_DEBUG(printf("(greedy match end %ld)\n", ap->match_end)); } if (ap->match_state == APSE_MATCH_STATE_END) { _apse_match_end(ap); did_match = 1; } eot: if (ap->text_position == ap->text_size) _apse_match_eot(ap); leave: APSE_DEBUG(printf("(match leave)\n")); return did_match; } static apse_bool_t _apse_match(apse_t *ap, unsigned char *text, apse_size_t text_size) { if (ap->use_minimal_distance) { apse_set_edit_distance(ap, 0); if (__apse_match(ap, text, text_size)) return 1; else { apse_size_t minimal_edit_distance; apse_size_t previous_edit_distance = 0; apse_size_t next_edit_distance; for (next_edit_distance = 1; next_edit_distance <= ap->pattern_size; next_edit_distance *= 2) { apse_set_edit_distance(ap, next_edit_distance); if (__apse_match(ap, text, text_size)) break; previous_edit_distance = next_edit_distance; } minimal_edit_distance = next_edit_distance; if (next_edit_distance > 1) { do { minimal_edit_distance = (previous_edit_distance + next_edit_distance) / 2; if (minimal_edit_distance == previous_edit_distance) break; apse_set_edit_distance(ap, minimal_edit_distance); if (__apse_match(ap, text, text_size)) next_edit_distance = minimal_edit_distance; else previous_edit_distance = minimal_edit_distance; } while (previous_edit_distance <= next_edit_distance); if (!__apse_match(ap, text, text_size)) minimal_edit_distance++; } apse_set_edit_distance(ap, minimal_edit_distance); __apse_match(ap, text, text_size); return 1; } } else return __apse_match(ap, text, text_size); } apse_bool_t apse_match(apse_t *ap, unsigned char *text, apse_size_t text_size) { apse_bool_t did_match = _apse_match(ap, text, text_size); _apse_match_eot(ap); apse_reset(ap); return did_match; } apse_bool_t apse_match_next(apse_t *ap, unsigned char *text, apse_size_t text_size) { apse_bool_t did_match = _apse_match(ap, text, text_size); if (!did_match) ap->match_state = APSE_MATCH_STATE_BOT; return did_match; } apse_ssize_t apse_index(apse_t *ap, unsigned char *text, apse_size_t text_size) { apse_size_t did_match = _apse_match(ap, text, text_size); _apse_match_eot(ap); ap->match_state = APSE_MATCH_STATE_BOT; return did_match ? ap->match_begin : APSE_MATCH_BAD; } apse_ssize_t apse_index_next(apse_t *ap, unsigned char *text, apse_size_t text_size) { apse_bool_t did_match = _apse_match(ap, text, text_size); if (!did_match) ap->match_state = APSE_MATCH_STATE_BOT; return did_match ? ap->match_begin : APSE_MATCH_BAD; } static apse_bool_t _apse_slice(apse_t *ap, unsigned char *text, apse_size_t text_size, apse_size_t *match_begin, apse_size_t *match_size) { apse_bool_t did_match = _apse_match(ap, text, text_size); if (did_match) { if (match_begin) *match_begin = ap->match_begin; if (match_size) *match_size = ap->match_end - ap->match_begin + 1; } else { if (match_begin) *match_begin = APSE_MATCH_BAD; if (match_size) *match_size = APSE_MATCH_BAD; } return did_match; } apse_bool_t apse_slice(apse_t *ap, unsigned char *text, apse_size_t text_size, apse_size_t *match_begin, apse_size_t *match_size) { apse_bool_t did_match = _apse_slice(ap, text, text_size, match_begin, match_size); _apse_match_eot(ap); ap->match_state = APSE_MATCH_STATE_BOT; return did_match; } apse_bool_t apse_slice_next(apse_t* ap, unsigned char* text, apse_size_t text_size, apse_size_t* match_begin, apse_size_t* match_size) { apse_bool_t did_match = _apse_slice(ap, text, text_size, match_begin, match_size); if (!did_match) ap->match_state = APSE_MATCH_STATE_BOT; return did_match; } void apse_set_custom_data(apse_t* ap, void* custom_data, apse_size_t custom_data_size) { ap->custom_data = custom_data; ap->custom_data_size = custom_data_size; } void* apse_get_custom_data(apse_t* ap, apse_size_t* custom_data_size) { if (custom_data_size) *custom_data_size = ap->custom_data_size; return ap->custom_data; } String-Approx-3.27/PROBLEMS0000644010002300116100000000030406744554366013442 0ustar jhieng3.08: * t/aslice.t: for some strange reason the tests 5 and 6 fail in SunOS 4.1.4, I got two separate cpan-testers reports on this. If the test is run manually, it succeeds. Go figure. String-Approx-3.27/META.yml0000640010002300116100000000075112077546526013542 0ustar jhieng--- #YAML:1.0 name: String-Approx version: 3.27 abstract: ~ author: [] license: unknown distribution_type: module configure_requires: ExtUtils::MakeMaker: 0 build_requires: ExtUtils::MakeMaker: 0 requires: Test::More: 0 no_index: directory: - t - inc generated_by: ExtUtils::MakeMaker version 6.57_05 meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: 1.4 String-Approx-3.27/Approx.xs0000644010002300116100000001344507241123737014116 0ustar jhieng#ifdef __cplusplus extern "C" { #endif #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "patchlevel.h" #ifdef __cplusplus } #endif #include "apse.h" #if PATCHLEVEL < 5 # define PL_na na #endif MODULE = String::Approx PACKAGE = String::Approx PROTOTYPES: DISABLE apse_t* new(CLASS, pattern, ...) char* CLASS SV* pattern CODE: apse_t* ap; apse_size_t edit_distance; IV pattern_size = sv_len(pattern); if (items == 2) edit_distance = ((pattern_size-1)/10)+1; else if (items == 3) edit_distance = (apse_size_t)SvIV(ST(2)); else { warn("Usage: new(pattern[, edit_distance])\n"); XSRETURN_UNDEF; } ap = apse_create((unsigned char *)SvPV(pattern, PL_na), pattern_size, edit_distance); if (ap) { RETVAL = ap; } else { warn("unable to allocate"); XSRETURN_UNDEF; } OUTPUT: RETVAL void DESTROY(ap) apse_t* ap CODE: apse_destroy(ap); apse_bool_t match(ap, text) apse_t* ap SV* text CODE: RETVAL = apse_match(ap, (unsigned char *)SvPV(text, PL_na), sv_len(text)); OUTPUT: RETVAL apse_bool_t match_next(ap, text) apse_t* ap SV* text CODE: RETVAL = apse_match_next(ap, (unsigned char *)SvPV(text, PL_na), sv_len(text)); OUTPUT: RETVAL apse_ssize_t index(ap, text) apse_t* ap SV* text CODE: RETVAL = apse_index(ap, (unsigned char *)SvPV(text, PL_na), sv_len(text)); OUTPUT: RETVAL void slice(ap, text) apse_t* ap SV* text PREINIT: apse_size_t match_begin; apse_size_t match_size; PPCODE: if (ap->use_minimal_distance) { apse_slice(ap, (unsigned char *)SvPV(text, PL_na), (apse_size_t)sv_len(text), &match_begin, &match_size); EXTEND(sp, 3); PUSHs(sv_2mortal(newSViv(match_begin))); PUSHs(sv_2mortal(newSViv(match_size))); PUSHs(sv_2mortal(newSViv(ap->edit_distance))); } else if (apse_slice(ap, (unsigned char *)SvPV(text, PL_na), (apse_size_t)sv_len(text), &match_begin, &match_size)) { EXTEND(sp, 2); PUSHs(sv_2mortal(newSViv(match_begin))); PUSHs(sv_2mortal(newSViv(match_size))); } void slice_next(ap, text) apse_t* ap SV* text PREINIT: apse_size_t match_begin; apse_size_t match_size; PPCODE: if (apse_slice_next(ap, (unsigned char *)SvPV(text, PL_na), sv_len(text), &match_begin, &match_size)) { EXTEND(sp, 2); PUSHs(sv_2mortal(newSViv(match_begin))); PUSHs(sv_2mortal(newSViv(match_size))); if (ap->use_minimal_distance) { EXTEND(sp, 1); PUSHs(sv_2mortal(newSViv(ap->edit_distance))); } } void set_greedy(ap) apse_t* ap CODE: apse_set_greedy(ap, 1); apse_bool_t set_caseignore_slice(ap, ...) apse_t* ap PREINIT: apse_size_t offset; apse_size_t size; apse_bool_t ignore; CODE: offset = items < 2 ? 0 : (apse_size_t)SvIV(ST(1)); size = items < 3 ? ap->pattern_size : (apse_size_t)SvIV(ST(2)); ignore = items < 4 ? 1 : (apse_bool_t)SvIV(ST(3)); RETVAL = apse_set_caseignore_slice(ap, offset, size, ignore); OUTPUT: RETVAL apse_bool_t set_insertions(ap, insertions) apse_t* ap apse_size_t insertions = SvUV($arg); CODE: RETVAL = apse_set_insertions(ap, insertions); OUTPUT: RETVAL apse_bool_t set_deletions(ap, deletions) apse_t* ap apse_size_t deletions = SvUV($arg); CODE: RETVAL = apse_set_deletions(ap, deletions); OUTPUT: RETVAL apse_bool_t set_substitutions(ap, substitutions) apse_t* ap apse_size_t substitutions = SvUV($arg); CODE: RETVAL = apse_set_substitutions(ap, substitutions); OUTPUT: RETVAL apse_bool_t set_edit_distance(ap, edit_distance) apse_t* ap apse_size_t edit_distance = SvUV($arg); CODE: RETVAL = apse_set_edit_distance(ap, edit_distance); OUTPUT: RETVAL apse_size_t get_edit_distance(ap) apse_t* ap CODE: ST(0) = sv_newmortal(); sv_setiv(ST(0), apse_get_edit_distance(ap)); apse_bool_t set_text_initial_position(ap, text_initial_position) apse_t* ap apse_size_t text_initial_position = SvUV($arg); CODE: RETVAL = apse_set_text_initial_position(ap, text_initial_position); OUTPUT: RETVAL apse_bool_t set_text_final_position(ap, text_final_position) apse_t* ap apse_size_t text_final_position = SvUV($arg); CODE: RETVAL = apse_set_text_final_position(ap, text_final_position); OUTPUT: RETVAL apse_bool_t set_text_position_range(ap, text_position_range) apse_t* ap apse_size_t text_position_range = SvUV($arg); CODE: RETVAL = apse_set_text_position_range(ap, text_position_range); OUTPUT: RETVAL void set_minimal_distance(ap, b) apse_t* ap apse_bool_t b CODE: apse_set_minimal_distance(ap, b); String-Approx-3.27/ChangeLog0000644010002300116100000002524612077535760014053 0ustar jhieng2013-01-22 Jarkko Hietaniemi * Resolve https://rt.cpan.org/Ticket/Display.html?id=69029 Do not die (adist()) on empty pattern string. * Resolve https://rt.cpan.org/Ticket/Display.html?id=36707 Do not die on undefined inputs, just return undef. * Resolve https://rt.cpan.org/Ticket/Display.html?id=82341 Explicitly specify the licensing to be Artistic 2 or LGPL 2. * Modernize all the tests to use Test::More. * Add some tests for UTF-8 inputs. * Mark some stalled/ancient bugs as resolved. * Delete the hopelessly obsolete BUGS file. * Released as 3.27. 2006-04-09 Jarkko Hietaniemi * Try to underline, highlight, and explain the fact that String::Approx does not do a good job for comparing strings "with fuzz", use the Levenshtein et al for that. * aindex() might return "too early" indices if either the pattern or the text contain repetitive characters, this seems to be a tricky defect to fix and somewhat conflicting with our model (return "as early as possible" matches, just like regular expressions) (to get "as late as possible" matches one would basically have to keep retrying at later indices until one fails), so for now only document this known problem. The same goes for aslice(). * [INTERNAL] use Test::More (not 100% yet) * [INTERNAL] apse.c: do not reset text_position_range in apse_reset (thanks to Helmut Jarausch) * [INTERNAL] apse.c: add #include . * [INTERNAL] apse.c: small signed/unsigned cleanups. * Released as 3.26. 2005-05-24 Jarkko Hietaniemi * Pure documentation cleanup release to address http://rt.cpan.org/NoAuth/Bug.html?id=12196 "Small String::Approx Pod issue" * Released as 3.25. 2005-01-02 Jarkko Hietaniemi * Pure documentation cleanup release to address http://rt.cpan.org/NoAuth/Bug.html?id=6668 "Unfortunate perldoc rendering of String::Approx" * Released as 3.24. Mon Nov 30 15:18:15 2003 Jarkko Hietaniemi * Safeguards against trying to use greater edit distance than the pattern is long, inspired by 'idallen'. * Advise against using String::Approx for text comparisons, since String::Approx is meant for strings. * Released as 3.23. Sun Oct 19 12:17:20 2003 Jarkko Hietaniemi * adistr($pattern, @inputs) returned absolute, not relative, distances in list context, reported by 'idallen'. * Released as 3.22. Sat Oct 18 10:29:30 2003 Jarkko Hietaniemi * asubstitute() didn't substitute in $_ as promised, reported by Tim Maher. * Released as 3.21. Mon May 12 22:09:59 2003 Jarkko Hietaniemi * Bug report, analysis, and patch from Rich Williams for a nasty segfault (no easy test case, sadly). * Clarify the documentation about the weirdness of the 'size' from aslice(). * Released as 3.20. Sun Aug 18 01:37:53 2002 Jarkko Hietaniemi * Fixed a frontend bug which caused 0% to be rounded up to 1, (instead of being equal to 0) found by Ji Y Park. * Released as 3.19. Tue Oct 16 20:06:53 2001 Jarkko Hietaniemi * Documented how one can use the adist() and adistr() to sort the inputs according on their approximateness, suggested by Arthur Bergman. * Fixed yet another aindex() bug found by Juha Muilu. * Pasha Sadri figured out what was the stupid bug breaking complex cache flushing: I was calling a wrong non-existent sub: _cache_flush_complex was called instead of _flush_cache_complex. Argh. * Released as 3.18. Fri Aug 3 15:47:10 2001 Jarkko Hietaniemi * 3.17: Add COPYRIGHT and LICENSE to the pod. Thu Jun 28 23:50:14 2001 Jarkko Hietaniemi * Released as 3.16. * Ben Kennedy found yet another silly memory leak, in Approx.pm:_complex() this time. Sat Feb 10 04:18:09 2001 Jarkko Hietaniemi * Rewrote _simple() and _complex(), now the silly memory leak in the caching code should be gone, the problem was reported at least by Frank Kirsch, Dag Kristian Rognlien, Chris Rosin, and Ben Kennedy. * There were actually two memory leaks, the other one was mild enough not to be noticed. While fixing the mild one found a long standing bug, fixing it made asubst() to matches to greedy again (they were always stingy because of the bug). * Fixed a bug in adist() (actually, aslice(), which adist() internally uses) which made adist('MOM','XXMOXXMOXX') to return 0, not 1, reported by Chris Rosin. * Fixed a bug in aindex() where aindex('----foobar----', '----aoobar----', 1) would report 10 instead of 0, reported by Trank Tobin. * Fixed a bug in aindex() where reported by Damian Keefe. * Added an API to control the size of the cache(s), as suggested by B. Elijah Griffin. * Added Aldo Calpini's test failure in Win32 as an open bug. Don't know what's wrong yet, the test output looks garbled. * Internal: use sv_len() instead of SvCUR() to be Unicode-safe. Thu Nov 16 22:35:13 2000 Jarkko Hietaniemi * Fixed the apse.c:_apse_match() for loop bug reported by Ross Smith and Eric Promislow. * Add adist() and adistr() which tell the edit distance and the relative edit distance. For both zero is an exact match, and negative means that the pattern is longer than the input. For the relative edit distances 1 means that the strings are completely different, and the negative values never reach -1. This feature has been requested by zillions. * Add arindex() (a backward aindex()). * Add BUGS list. Two open bugs. Wed Sep 6 19:23:55 2000 Jarkko Hietaniemi * Released as 3.13. * apse.c: fixed an aindex() bug where a better match doesn't move the match_begin forward. Reported by David Curiel and Stefan Ram. * Approx.pm: fix a cache flushing bug where the most current pattern was accidentally flushed, fix from Ben Kennedy. Revert back to using real words: "approximateness" it is (drop the "approximity" coinage). Sun Apr 30 22:39:09 2000 Jarkko Hietaniemi * Released as 3.12. * ChangeLog: Explain the bug fixed in 3.10 a little bit more. Mon Apr 24 23:19:20 2000 Jarkko Hietaniemi * Makefile.PL: The -DAPSE_DEBUGGING was accidentally left on, resulting in a deluge of debug messages being spewed to stderr. Oops. Reported by Rob Fugina. * Approx.pm: Bump to 3.11. Tue Apr 18 06:48:34 2000 Jarkko Hietaniemi * Released as 3.10. * apse.c: Fix an insidious buffer overrun, found by heroic debugging by J. D. Laub (already back in November 1999, but I'm amazingly good at procrastinating). This is probably the bug that plagued String::Approx 3.0* in HP-UX and Solaris. I couldn't anymore repeat it in HP-UX, in Solaris it was still going strong. The bug was really well camouflaged: it didn't overstep the buffer area as a whole, in which case it would have been found by Purify-like tools. What was happening was that the code is maintaining a 2-dimensional bitmask (or a 3-dimensional bitbuffer, if you will), and under certain circumstances the bits from one dimension leaked into another. The only user-visible API that was tickling the bug was aslice(), so if you weren't using that, you were probably okay. In Solaris (and at some point HP-UX) the malloced buffers happened to be laid out so that bug was visible (e.g. in Linux, AIX, and Digital UNIX, it wasn't). Mon Jul 19 10:42:13 1999 Jarkko Hietaniemi * Approx.pm: Released 3.09. * Approx.pm: Do the right things if pattern is empty (the bug found by Anirvan Chatterjee). * apse.c: Fixed the cases "this-has-so-many-edits-that-it-is- going-to-match-no-matter-what" and "there-are-not-enough-edits- to-ever-match-this" (the latter bug found by Mitch Helle). Thu Jun 24 02:00:26 1999 Jarkko Hietaniemi * Approx.pm: Released 3.08. * Approx.pm: Add aslice() as suggested by Mike Hanafey. * apse.h: Introduce use_minimal_distance. * apse.c: Implement use_minimal_distance. * t/aslice.t: Test the new feature. * MANIFEST: Add t/aslice.t. Wed Jun 23 20:07:05 1999 Jarkko Hietaniemi * Approx.pm: Released 3.07. * Approx.pm: Add aindex() as suggested by Mike Hanafey. * apse.h: Introduce text_initial_position and text_final_position. * apse.c: Implement text_initial_position and text_final_position. * t/aindex.t: Test the new feature. * MANIFEST: Add t/aindex.t. Wed Jun 16 14:19:39 1999 Jarkko Hietaniemi * Approx.pm: Released 3.06. * Approx.pm: Release 3.06: Fixed a bug in caching of parsed parameters (the absolute length of the pattern must be cached, too). (Reported by Chris Rosin and Mark Land) * Approx.pm: Fixed a couple of typos and introduced the coinage 'approximity' (to replace the clunky 'approximateness'). * t/user.t: Added four new tests from Rosin and Land. Fri Jan 8 16:09:52 1999 Jarkko Hietaniemi * Approx.pm: Release 3.05. No functional changes or bug fixes (no bugs reported), just a release that includes both the Artistic License and LGPL. Wed Dec 30 11:07:28 1998 Jarkko Hietaniemi * MANIFEST: Added Artistic and LGPL, reworded COPYRIGHT slightly to comply. Thu Dec 17 13:17:58 1998 Jarkko Hietaniemi * Approx.pm: Released 3.04. * Approx.pm: Fixed a parameter parsing bug: "i 1" was not accepted. (Reported by Bob J.A. Schijvenaars) * Approx.pm: Documented that matching is asymmetric: the inputs are matched against the pattern, not the other way round. * Approx.pm: Added a few C<>s and I<>s to the pod. * Approx.pm: Added an automatic flush to the pattern compilation caches (triggered by a high water mark). * Approx.pm: Added "require 5.004_04;", Previously only Makefile.PL had this, but better be paranoid. Wed Dec 16 10:28:07 1998 Jarkko Hietaniemi * Approx.pm: Released 3.03. * Approx.pm: Added confirmation from Udi Manber to README: it's not a problem that I have looked at agrep code. My code is my code and can be used within the limits set by my copyright. * Approx.xs: Removed a lot of the glue code because it is not yet reachable via Approx.pm. Later. Wed Dec 16 01:23:58 1998 Jarkko Hietaniemi * Approx.pm: Released 3.02. * README: Added a clarification about our relationship with agrep. There is no common code. None. Waiting for confirmation from Udi Manber. (The concern raised by Slaven Rezic). * Approx.xs: fixed the PL_na mess for now, will have to figure out the correct way. Tested under 5.004-maint, 5.005-maint, and 5.005-devel. * Approx.pm: Released 3.01 (actually already Dec 15). String-Approx-3.27/COPYRIGHT.agrep0000644010002300116100000000173006633240551014651 0ustar jhiengThis material was developed by Sun Wu, Udi Manber and Burra Gopal at the University of Arizona, Department of Computer Science. Permission is granted to copy this software, to redistribute it on a nonprofit basis, and to use it for any purpose, subject to the following restrictions and understandings. 1. Any copy made of this software must include this copyright notice in full. 2. All materials developed as a consequence of the use of this software shall duly acknowledge such use, in accordance with the usual standards of acknowledging credit in academic research. 3. The authors have made no warranty or representation that the operation of this software will be error-free or suitable for any application, and they are under under no obligation to provide any services, by way of maintenance, update, or otherwise. The software is an experimental prototype offered on an as-is basis. 4. Redistribution for profit requires the express, written permission of the authors. String-Approx-3.27/README.apse0000644010002300116100000000772406635667073014117 0ustar jhiengThis code implements the basic 'bitap' algorithm for approximate string matching, within "edit distance" (Levenshtein measure). The edit distance is defined as the total number of insertions, deletions, and substitutions required to transfrom a string to (possibly a substring of) another. "sale" is within 1 edit from "salve", "ale", and "same", and within 2 edits from "fake". The algorithm was developed by Udi Manber and Sun Wu, as described in the "Fast text search allowing errors", Communications of the ACM, Vol 35, No. 10, October 1992. An earlier version of the paper, "Fast Text Searching with Errors", Technical Report TR-91-11, Department of Computer Science, University of Arizona, Tucson, AZ, June 1991, is available online [1]. The 'bitap' algorithm is best known as part of the agrep(1) ("approximate grep") implementation [2]. The implementation has been explored in the "agrep - a fast approximate pattern matching tool", Proceedings of the Winter 1992 USENIX Conference, USENIX Association, Berkeley, CA. A version of that paper is available online [3]. Newer version of agrep, by Udi Manber, Sun Wu, and Burra Gopal, comes with the Glimpse [4][5] indexing tool, the search engine of Harvest [6]. The version number of agrep hasn't been updated from 2.04 but the interface has been librarised and the code made more portable (for example--the original release didn't handle ISO Latin, only pure ASCII). Note also that the 'bitap' algorithm is just one of the many string searching algorithms agrep uses--see the agrep implementations for more details. 'bitap' is not the fastest approximate matching algorithm in agrep, it is the most versatile one. [1] ftp://ftp.cs.arizona.edu/agrep/agrep.ps.1.Z [2] ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z [3] ftp://ftp.cs.arizona.edu/agrep/agrep.ps.1.Z [4] http://glimpse.cs.arizona.edu/ [5] ftp://ftp.cs.arizona.edu/glimpse/glimpse-4.1.src.tar.gz [6] http://harvest.transarc.com/ agrep itself is not in public domain, it is copyright by University of Arizona. I also used the book "String Searching Algorithms", Graham A. Stephen, World Scientific 1994, ISBN 981-02-1829-X, in which the 'bitap' ("Wu-Manber k-differences shift-add") and many other string searching algorithms are nicely summarized. This code doesn't implement the "partition-scan" improvement described in the TR-91-11, so this could still be made to run faster. Neither does it implement all the described extensions (implemented are "sets of characters" (any-character and caseignoring as special cases of this) and "patterns with and without errors"; missing are: "wild cards" (Kleene star), "unknown number of errors" (finding out the edit distance when given two strings), "non-uniform costs", "set of patterns", "long patterns", and "regular expressions"), so it can still be made to run slower, too. In place of "non-uniform costs" feature I have an invention of mine where one can for example completely disallow substitutions. The feature is still largely untested (as is the whole program, come to think of it). Please read the COPYRIGHT. This implementation shares no code with agrep, none, it was made from scratch based on the Manber and Wu papers. Still, I have looked at the source code of agrep. Therefore I also include in this distribution the original agrep copyright, in the file COPYRIGHT.agrep. The inclusion of this file in no way affects the copyright of my code or the applicability of it to any purposes, even commercial ones. The existence of COPYRIGHT.agrep only serves the clause (2) in it, courtesy. The clauses (1) and (4) don't apply because this is not agrep. In clause (3) our copyrights agree, we guarantee nothing. This interpretation has been kindly sanctioned by Udi Manber. If you have any questions, detailed bug reports, enhancement suggestions, feature requests, or fan mail (I would like to know of any uses you put this code into), please feel free to contact me at . All bugs are mine, mine, all mine. Jarkko Hietaniemi November 1998 String-Approx-3.27/Makefile.PL0000644010002300116100000000113410416201326014220 0ustar jhienguse ExtUtils::MakeMaker; WriteMakefile( NAME => 'String::Approx', VERSION_FROM => 'Approx.pm', OBJECT => 'Approx.o apse.o', # CCFLAGS => '-DAPSE_DEBUGGING -g', # CCFLAGS => '-Wall -O -W -Waggregate-return -Wbad-function-cast -Wcast-align -Wcast-qual -Wconversion -Wendif-labels -Wfloat-equal -Wmissing-prototypes -Wmissing-noreturn -Wnested-externs -Wpointer-arith -Wredundant-decls -Wshadow -Wsign-compare -Wstrict-prototypes -Wwrite-strings -Wformat=2 -Wdisabled-optimization -ansi -pedantic', dist => { 'COMPRESS' => 'gzip' }, PREREQ_PM => { 'Test::More' => 0, }, ); String-Approx-3.27/README0000444010002300116100000000562312077535512013147 0ustar jhiengWelcome to String::Approx 3.0. This release is a major update from String Approx 2, of which 2.7 was the last release. See later about the future of version 2. The most important change was that the underlying algorithm was changed completely. Instead of doing everything in Perl using regular expressions we now do the matching in C using the so-called Manber-Wu k-differences algorithm shift-add. You have met this algorithm if you have used the agrep utility or the Glimpse indexing system. Because this implementation shares no code with agrep, only the well-publicized algorithms, the use of this software is limited only by my copyright (file COPYRIGHT). This interpretation has been kindly confirmed by Udi Manber. More details in the file README.apse. This change brings both good and bad news. Good news first. * We are now 2-3 times faster than String::Approx version 2. * There should be no limit on the length of the pattern. Extremely long patterns haven't been tested much, though, so please do not assume overmuch. The used algorithm is actually _independent_ of the length of the pattern: it's O(kn) where k is the number of errors and n the length of the text. Then the bad news: * You do need a C compiler. If your system does not have a C compiler you should either get one or find a friendly soul to compile this extension for you. I won't be deleting the String::Approx 2.7 because in some restrictive environments compiling C is not an option. * The semantics of asubstitute() have changed. The match is now always stingy; that is, as short as possible. Previously if you matched for "word" with edit distance of two, "cork" and "wool" both would have matched -- and they still do. But: the matching parts will be "or" and "wo" and nothing more. There's still the stingy option but that's rather pointless now as far as shortening the match goes: it is now useful only for sligthly speeding up matching. This change messes up asubstitute() rather badly, I'm very sorry about this. I will try to think of something better in future releases. * aregex() is gone not to return because we no more use regular expressions. * The 'compat1' option to be backward compatible with String::Approx release 1 is no more supported. Perl 5.8 is required. This library is free software; you can redistribute it and/or modify under either the terms of the Artistic License 2.0, or the GNU Library General Public License, Version 2. See the files Artistic and LGPL for more details. About the future of String::Approx version 2: (1) I repeat: I won't be deleting version 2.7. (2) Any updates on the version 2 are unlikely. The software simply grew past maintainability. Installation is as horribly complicated as before: perl Makefile.PL make make test make install Let me (jhi@iki.fi) know of any trouble. Also let me know of cool uses you put String::Approx into. That's it. Enjoy.