KinoSearch1-1.01000755000765000765 011462203446 13531 5ustar00marvinmarvin000000000000KinoSearch1-1.01/ApacheLicense2.0.txt000444000765000765 2514411462203446 17361 0ustar00marvinmarvin000000000000 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. KinoSearch1-1.01/Build.PL000444000765000765 2423311462203446 15206 0ustar00marvinmarvin000000000000use strict; use warnings; use 5.008003; use Module::Build; use File::Find qw( find ); use Storable qw( store retrieve ); use File::Spec::Functions qw( catfile ); use Fcntl; =for comment Build.PL file generates a slew of files on the fly before writing the Build script. =cut my $ks_xs_filepath = 'KinoSearch1.xs'; # keep lists of which .c .h and .xs files need to be rewritten and cleaned up my %needs_rewrite; # retrieve a list of modification times from when Build.PL was last run my $lastmod = -f 'lastmod' ? retrieve('lastmod') : {}; # If Build.PL was modified, force recompile of KinoSearch1.xs my $build_pl_lastmod = ( stat('Build.PL') )[9]; if ( !exists $lastmod->{'Build.PL'} or $lastmod->{'Build.PL'} != $build_pl_lastmod ) { $needs_rewrite{$ks_xs_filepath} = 1; } $lastmod->{'Build.PL'} = $build_pl_lastmod; # hold filepath => content pairs for generated files my %code = ( $ks_xs_filepath => '' ); my %source_pm = (); # grab all .pm filepaths, making sure that KinoSearch1.pm is first my @pm_filepaths; find( { wanted => sub { if ( $File::Find::name =~ /KinoSearch1\.pm$/ ) { unshift @pm_filepaths, $File::Find::name; } elsif ( $File::Find::name =~ /\.pm$/ ) { push @pm_filepaths, $File::Find::name; } }, no_chdir => 1, }, 'lib', ); for my $pm_filepath (@pm_filepaths) { open( my $module_fh, '<', $pm_filepath ) or die "couldn't open file '$pm_filepath': $!"; my $module_text = do { local $/; <$module_fh> }; my $outfilepath; # grab code that's delimited by an xs, c, or h __TAG__ my $inside = ''; my $line_count = 0; while ( $module_text =~ /(.*?(?:\n|\r\n|\r))/g ) { $line_count++; my $line = $1; if ( $line =~ /^__(\w+)__/ ) { # the tag to begin a block $inside = $1; if ( $inside eq 'XS' ) { # all XS code goes into one file: lib/KinoSearch1.xs $outfilepath = $ks_xs_filepath; } elsif ( $inside eq 'H' or $inside eq 'C' ) { # each .c and .h code section becomes its own file $outfilepath = $pm_filepath; $outfilepath =~ s/lib//; $outfilepath =~ s/\W//g; $outfilepath =~ s/pm$//; if ( $inside eq 'H' ) { $outfilepath .= ".h"; # prepend an #include to KinoSearch1.xs $code{$ks_xs_filepath} = qq|#include "$outfilepath"\n$code{$ks_xs_filepath}|; } else { $outfilepath .= ".c"; } $outfilepath = catfile( 'src', $outfilepath ); my $line_start = $line_count + 1; $code{$outfilepath} = qq|#line $line_start "$pm_filepath"\n|; } if ( $inside =~ /^(?:XS|H|C)$/ ) { # if the file has been modified, force a recompile my $mod_time = ( stat($module_fh) )[9]; if ( !exists $lastmod->{$pm_filepath}{$inside} or $lastmod->{$pm_filepath}{$inside} != $mod_time ) { $needs_rewrite{$outfilepath} = 1; } $lastmod->{$pm_filepath}{$inside} = $mod_time; $source_pm{$outfilepath} = $pm_filepath; } } elsif ( $inside =~ /^(?:XS|H|C)$/ ) { $code{$outfilepath} .= $line; } } } # write all the files that have been modified. for my $outfilepath ( keys %needs_rewrite ) { my $autogen_header = <<"END_AUTOGEN"; /*********************************************** !!!! DO NOT EDIT THIS FILE !!!! This file was auto-generated by Build.PL from $source_pm{$outfilepath} ***********************************************/ END_AUTOGEN print "Writing $outfilepath\n"; unlink $outfilepath; sysopen( my $fh, $outfilepath, O_CREAT | O_EXCL | O_WRONLY ) or die "Couldn't open file '$outfilepath' for writing: $!"; print $fh "$autogen_header$code{$outfilepath}" or die "Print to '$outfilepath' failed: $!"; close $fh or die "Couldn't close file '$outfilepath': $!"; } =begin Rationale All of KinoSearch1's C-struct types share the same typemap profile, but can't be mapped to a single type. Instead of tediously hand-editing the typemap file, we autogenerate the file. Adding a new type is now as simple as adding an item to the @struct_classes array (provided it follows the same pattern as all the others). =end Rationale =cut # write the typemap file. if ( $needs_rewrite{$ks_xs_filepath} ) { my @struct_classes = qw( KinoSearch1::Analysis::Stemmer::Stemmifier KinoSearch1::Analysis::Token KinoSearch1::Analysis::TokenBatch KinoSearch1::Index::SegTermEnum KinoSearch1::Index::TermBuffer KinoSearch1::Index::TermDocs KinoSearch1::Index::TermInfo KinoSearch1::Index::TermInfosWriter KinoSearch1::Search::HitCollector KinoSearch1::Search::MatchBatch KinoSearch1::Search::Scorer KinoSearch1::Search::Similarity KinoSearch1::Store::InStream KinoSearch1::Store::OutStream KinoSearch1::Util::BitVector KinoSearch1::Util::BoolSet KinoSearch1::Util::PriorityQueue KinoSearch1::Util::SortExternal ); my $typemap_start = qq|\nTYPEMAP\n|; my $typemap_input = qq|\n\nINPUT\n|; my $typemap_output = qq|\n\nOUTPUT\n|; for my $struct_class (@struct_classes) { my ($ctype) = $struct_class =~ /([^:]+$)/; my $uc_ctype = uc($ctype); $ctype .= ' *'; $typemap_start .= "$ctype\t$uc_ctype\n"; my $input_frag = <<'END_INPUT'; #UC_CTYPE# if (sv_derived_from($arg, \"#STRUCT_CLASS#\")) { $var = INT2PTR($type,( SvIV((SV*)SvRV($arg)) ) ); } else Perl_croak(aTHX_ \"$var is not of type #STRUCT_CLASS#\") END_INPUT $input_frag =~ s/#UC_CTYPE#/$uc_ctype/gsm; $input_frag =~ s/#STRUCT_CLASS#/$struct_class/gsm; $typemap_input .= $input_frag; my $output_frag .= <<'END_OUTPUT'; #UC_CTYPE# sv_setref_pv($arg, \"#STRUCT_CLASS#\", (void*)$var); END_OUTPUT $output_frag =~ s/#UC_CTYPE#/$uc_ctype/gsm; $output_frag =~ s/#STRUCT_CLASS#/$struct_class/gsm; $typemap_output .= $output_frag; } # blast it out print "Writing typemap\n"; unlink 'typemap'; sysopen( my $typemap_fh, 'typemap', O_CREAT | O_WRONLY | O_EXCL ) or die "Couldn't open 'typemap' for writing: $!"; print $typemap_fh "# Auto-generated file.\n\n" or die "Print to 'typemap' failed: $!"; print $typemap_fh "$typemap_start $typemap_input $typemap_output" or die "Print to 'typemap' failed: $!"; } # record mod times in anticipation of Build.PL's next run store( $lastmod, 'lastmod' ); my $builder = Module::Build->new( module_name => 'KinoSearch1', license => 'perl', dist_author => 'Marvin Humphrey ', dist_version_from => 'lib/KinoSearch1.pm', requires => { 'Compress::Zlib' => 0, 'Lingua::Stem::Snowball' => 0.94, 'Lingua::StopWords' => 0.02, }, build_requires => { 'ExtUtils::CBuilder' => 0, 'ExtUtils::ParseXS' => 0, }, create_makefile_pl => 'passthrough', # extra_compiler_flags => [ # '-Wall', '-Wextra', # '-pedantic', '-ansi', # '-DPERL_GCC_PEDANTIC', '-std=c89', # ], xs_files => { $ks_xs_filepath => 'lib/KinoSearch1.xs' }, c_source => 'src', add_to_cleanup => [ keys %code, 'KinoSearch1-*', 'typemap', 'MANIFEST.bak', 'lastmod', 'perltidy.ERR', '*.o', ], ); my @no_index_files = qw( buildlib/KinoSearch1/Test/TestUtils.pm devel/dump_index devel/hexdebug devel/kinotidy devel/kinotidyrc devel/predit devel/scan_enum.plx devel/test_tidiness.plx devel/valgrind_test.plx lib/KinoSearch1/Index/CompoundFileReader.pm lib/KinoSearch1/Index/CompoundFileWriter.pm lib/KinoSearch1/Index/DelDocs.pm lib/KinoSearch1/Index/FieldInfos.pm lib/KinoSearch1/Index/FieldsReader.pm lib/KinoSearch1/Index/FieldsWriter.pm lib/KinoSearch1/Index/IndexFileNames.pm lib/KinoSearch1/Index/IndexReader.pm lib/KinoSearch1/Index/MultiReader.pm lib/KinoSearch1/Index/MultiTermDocs.pm lib/KinoSearch1/Index/NormsReader.pm lib/KinoSearch1/Index/PostingsWriter.pm lib/KinoSearch1/Index/SegInfos.pm lib/KinoSearch1/Index/SegReader.pm lib/KinoSearch1/Index/SegTermDocs.pm lib/KinoSearch1/Index/SegTermEnum.pm lib/KinoSearch1/Index/SegWriter.pm lib/KinoSearch1/Index/TermBuffer.pm lib/KinoSearch1/Index/TermDocs.pm lib/KinoSearch1/Index/TermEnum.pm lib/KinoSearch1/Index/TermInfo.pm lib/KinoSearch1/Index/TermInfosReader.pm lib/KinoSearch1/Index/TermInfosWriter.pm lib/KinoSearch1/Index/TermVector.pm lib/KinoSearch1/Search/BooleanClause.pm lib/KinoSearch1/Search/BooleanScorer.pm lib/KinoSearch1/Search/HitCollector.pm lib/KinoSearch1/Search/HitQueue.pm lib/KinoSearch1/Search/PhraseScorer.pm lib/KinoSearch1/Search/Scorer.pm lib/KinoSearch1/Search/Searchable.pm lib/KinoSearch1/Search/TermScorer.pm lib/KinoSearch1/Search/Weight.pm lib/KinoSearch1/Store/FSLock.pm lib/KinoSearch1/Store/InStream.pm lib/KinoSearch1/Store/Lock.pm lib/KinoSearch1/Store/OutStream.pm lib/KinoSearch1/Store/RAMLock.pm lib/KinoSearch1/Util/BitVector.pm lib/KinoSearch1/Util/ByteBuf.pm lib/KinoSearch1/Util/Carp.pm lib/KinoSearch1/Util/CClass.pm lib/KinoSearch1/Util/Class.pm lib/KinoSearch1/Util/IntMap.pm lib/KinoSearch1/Util/MathUtils.pm lib/KinoSearch1/Util/MemManager.pm lib/KinoSearch1/Util/PriorityQueue.pm lib/KinoSearch1/Util/SortExternal.pm lib/KinoSearch1/Util/StringHelper.pm lib/KinoSearch1/Util/ToolSet.pm lib/KinoSearch1/Util/ToStringUtils.pm lib/KinoSearch1/Util/VerifyArgs.pm ); $builder->meta_add( { no_index => { files => \@no_index_files } } ); $builder->create_build_script(); KinoSearch1-1.01/Changes000444000765000765 27411462203445 15143 0ustar00marvinmarvin000000000000Revision history for KinoSearch1 1.01 2010-10-27 * Attempt to hide private classes from search.cpan.org more effectively. 1.00 2010-03-17 Forked from KinoSearch version 0.165. KinoSearch1-1.01/Makefile.PL000444000765000765 226311462203446 15643 0ustar00marvinmarvin000000000000# Note: this file was auto-generated by Module::Build::Compat version 0.3607 unless (eval "use Module::Build::Compat 0.02; 1" ) { print "This module requires Module::Build to install itself.\n"; require ExtUtils::MakeMaker; my $yn = ExtUtils::MakeMaker::prompt (' Install Module::Build now from CPAN?', 'y'); unless ($yn =~ /^y/i) { die " *** Cannot install without Module::Build. Exiting ...\n"; } require Cwd; require File::Spec; require CPAN; # Save this 'cause CPAN will chdir all over the place. my $cwd = Cwd::cwd(); CPAN::Shell->install('Module::Build::Compat'); CPAN::Shell->expand("Module", "Module::Build::Compat")->uptodate or die "Couldn't install Module::Build, giving up.\n"; chdir $cwd or die "Cannot chdir() back to $cwd: $!"; } eval "use Module::Build::Compat 0.02; 1" or die $@; Module::Build::Compat->run_build_pl(args => \@ARGV); my $build_script = 'Build'; $build_script .= '.com' if $^O eq 'VMS'; exit(0) unless(-e $build_script); # cpantesters convention require Module::Build; Module::Build::Compat->write_makefile(build_class => 'Module::Build'); KinoSearch1-1.01/MANIFEST000444000765000765 1443411462203446 15045 0ustar00marvinmarvin000000000000ApacheLicense2.0.txt Build.PL Changes buildlib/KinoSearch1/Test/TestUtils.pm devel/dump_index devel/hexdebug devel/kinotidy devel/kinotidyrc devel/predit devel/scan_enum.plx devel/test_tidiness.plx devel/valgrind_test.plx lib/KinoSearch1.pm lib/KinoSearch1/Analysis/Analyzer.pm lib/KinoSearch1/Analysis/LCNormalizer.pm lib/KinoSearch1/Analysis/PolyAnalyzer.pm lib/KinoSearch1/Analysis/Stemmer.pm lib/KinoSearch1/Analysis/Stopalizer.pm lib/KinoSearch1/Analysis/Token.pm lib/KinoSearch1/Analysis/TokenBatch.pm lib/KinoSearch1/Analysis/Tokenizer.pm lib/KinoSearch1/Docs/FileFormat.pod lib/KinoSearch1/Docs/Tutorial.pod lib/KinoSearch1/Document/Doc.pm lib/KinoSearch1/Document/Field.pm lib/KinoSearch1/Highlight/Encoder.pm lib/KinoSearch1/Highlight/Formatter.pm lib/KinoSearch1/Highlight/Highlighter.pm lib/KinoSearch1/Highlight/SimpleHTMLEncoder.pm lib/KinoSearch1/Highlight/SimpleHTMLFormatter.pm lib/KinoSearch1/Index/CompoundFileReader.pm lib/KinoSearch1/Index/CompoundFileWriter.pm lib/KinoSearch1/Index/DelDocs.pm lib/KinoSearch1/Index/FieldInfos.pm lib/KinoSearch1/Index/FieldsReader.pm lib/KinoSearch1/Index/FieldsWriter.pm lib/KinoSearch1/Index/IndexFileNames.pm lib/KinoSearch1/Index/IndexReader.pm lib/KinoSearch1/Index/MultiReader.pm lib/KinoSearch1/Index/MultiTermDocs.pm lib/KinoSearch1/Index/NormsReader.pm lib/KinoSearch1/Index/PostingsWriter.pm lib/KinoSearch1/Index/SegInfos.pm lib/KinoSearch1/Index/SegReader.pm lib/KinoSearch1/Index/SegTermDocs.pm lib/KinoSearch1/Index/SegTermEnum.pm lib/KinoSearch1/Index/SegWriter.pm lib/KinoSearch1/Index/Term.pm lib/KinoSearch1/Index/TermBuffer.pm lib/KinoSearch1/Index/TermDocs.pm lib/KinoSearch1/Index/TermEnum.pm lib/KinoSearch1/Index/TermInfo.pm lib/KinoSearch1/Index/TermInfosReader.pm lib/KinoSearch1/Index/TermInfosWriter.pm lib/KinoSearch1/Index/TermVector.pm lib/KinoSearch1/InvIndexer.pm lib/KinoSearch1/QueryParser/QueryParser.pm lib/KinoSearch1/Search/BooleanClause.pm lib/KinoSearch1/Search/BooleanQuery.pm lib/KinoSearch1/Search/BooleanScorer.pm lib/KinoSearch1/Search/Hit.pm lib/KinoSearch1/Search/HitCollector.pm lib/KinoSearch1/Search/HitQueue.pm lib/KinoSearch1/Search/Hits.pm lib/KinoSearch1/Search/MultiSearcher.pm lib/KinoSearch1/Search/PhraseQuery.pm lib/KinoSearch1/Search/PhraseScorer.pm lib/KinoSearch1/Search/Query.pm lib/KinoSearch1/Search/QueryFilter.pm lib/KinoSearch1/Search/Scorer.pm lib/KinoSearch1/Search/Searchable.pm lib/KinoSearch1/Search/SearchClient.pm lib/KinoSearch1/Search/SearchServer.pm lib/KinoSearch1/Search/Similarity.pm lib/KinoSearch1/Search/TermQuery.pm lib/KinoSearch1/Search/TermScorer.pm lib/KinoSearch1/Search/Weight.pm lib/KinoSearch1/Searcher.pm lib/KinoSearch1/Store/FSInvIndex.pm lib/KinoSearch1/Store/FSLock.pm lib/KinoSearch1/Store/InStream.pm lib/KinoSearch1/Store/InvIndex.pm lib/KinoSearch1/Store/Lock.pm lib/KinoSearch1/Store/OutStream.pm lib/KinoSearch1/Store/RAMInvIndex.pm lib/KinoSearch1/Store/RAMLock.pm lib/KinoSearch1/Util/BitVector.pm lib/KinoSearch1/Util/ByteBuf.pm lib/KinoSearch1/Util/Carp.pm lib/KinoSearch1/Util/CClass.pm lib/KinoSearch1/Util/Class.pm lib/KinoSearch1/Util/IntMap.pm lib/KinoSearch1/Util/MathUtils.pm lib/KinoSearch1/Util/MemManager.pm lib/KinoSearch1/Util/PriorityQueue.pm lib/KinoSearch1/Util/SortExternal.pm lib/KinoSearch1/Util/StringHelper.pm lib/KinoSearch1/Util/ToolSet.pm lib/KinoSearch1/Util/ToStringUtils.pm lib/KinoSearch1/Util/VerifyArgs.pm Makefile.PL MANIFEST copy META.yml README src/ppport.h t/000-load.t t/001-build_invindexes.t t/002-kinosearch.t t/010-verify_args.t t/011-class.t t/012-priority_queue.t t/013-bit_vector.t t/015-sort_external.t t/101-simple_template_io.t t/102-strings_template_io.t t/103-repeats_template_io.t t/104-parse_template_io.t t/105-invindex.t t/106-locking.t t/150-polyanalyzer.t t/152-token_batch.t t/153-lc_normalizer.t t/154-tokenizer.t t/155-stopalizer.t t/201-field_infos.t t/202-term.t t/203-compound_file_reader.t t/204-fields_reader.t t/205-seg_reader.t t/206-seg_infos.t t/207-seg_term_enum.t t/208-terminfo.t t/209-term_infos_reader.t t/210-deldocs.t t/211-seg_term_docs.t t/212-multi_term_docs.t t/213-segment_merging.t t/214-spec_field.t t/302-many_fields.t t/303-highlighter.t t/501-termquery.t t/502-phrasequery.t t/503-booleanquery.t t/504-similarity.t t/505-hit_queue.t t/506-hit_collector.t t/507-query_filter.t t/508-hits.t t/509-multi_searcher.t t/510-remote_search.t t/601-queryparser.t t/602-boosts.t t/603-query_boosts.t t/604-simple_search.t t/701-uscon.t t/999-remove_invindexes.t t/benchmarks/extract_reuters.plx t/benchmarks/indexers/BenchmarkingIndexer.pm t/benchmarks/indexers/kinosearch_indexer.plx t/benchmarks/indexers/LuceneIndexer.java t/benchmarks/indexers/plucene_indexer.plx t/benchmarks/README.txt t/pod-coverage.t t/pod.t t/us_constitution/amend1.html t/us_constitution/amend10.html t/us_constitution/amend11.html t/us_constitution/amend12.html t/us_constitution/amend13.html t/us_constitution/amend14.html t/us_constitution/amend15.html t/us_constitution/amend16.html t/us_constitution/amend17.html t/us_constitution/amend18.html t/us_constitution/amend19.html t/us_constitution/amend2.html t/us_constitution/amend20.html t/us_constitution/amend21.html t/us_constitution/amend22.html t/us_constitution/amend23.html t/us_constitution/amend24.html t/us_constitution/amend25.html t/us_constitution/amend26.html t/us_constitution/amend27.html t/us_constitution/amend3.html t/us_constitution/amend4.html t/us_constitution/amend5.html t/us_constitution/amend6.html t/us_constitution/amend7.html t/us_constitution/amend8.html t/us_constitution/amend9.html t/us_constitution/art1sec1.html t/us_constitution/art1sec10.html t/us_constitution/art1sec2.html t/us_constitution/art1sec3.html t/us_constitution/art1sec4.html t/us_constitution/art1sec5.html t/us_constitution/art1sec6.html t/us_constitution/art1sec7.html t/us_constitution/art1sec8.html t/us_constitution/art1sec9.html t/us_constitution/art2sec1.html t/us_constitution/art2sec2.html t/us_constitution/art2sec3.html t/us_constitution/art2sec4.html t/us_constitution/art3sec1.html t/us_constitution/art3sec2.html t/us_constitution/art3sec3.html t/us_constitution/art4sec1.html t/us_constitution/art4sec2.html t/us_constitution/art4sec3.html t/us_constitution/art4sec4.html t/us_constitution/art5.html t/us_constitution/art6.html t/us_constitution/art7.html t/us_constitution/index.html t/us_constitution/preamble.html t/us_constitution/uscon.css KinoSearch1-1.01/META.yml000444000765000765 2477211462203446 15173 0ustar00marvinmarvin000000000000--- abstract: 'search engine library' author: - 'Marvin Humphrey ' build_requires: ExtUtils::CBuilder: 0 ExtUtils::ParseXS: 0 configure_requires: Module::Build: 0.36 generated_by: 'Module::Build version 0.3607' license: perl meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: 1.4 name: KinoSearch1 no_index: files: - buildlib/KinoSearch1/Test/TestUtils.pm - devel/dump_index - devel/hexdebug - devel/kinotidy - devel/kinotidyrc - devel/predit - devel/scan_enum.plx - devel/test_tidiness.plx - devel/valgrind_test.plx - lib/KinoSearch1/Index/CompoundFileReader.pm - lib/KinoSearch1/Index/CompoundFileWriter.pm - lib/KinoSearch1/Index/DelDocs.pm - lib/KinoSearch1/Index/FieldInfos.pm - lib/KinoSearch1/Index/FieldsReader.pm - lib/KinoSearch1/Index/FieldsWriter.pm - lib/KinoSearch1/Index/IndexFileNames.pm - lib/KinoSearch1/Index/IndexReader.pm - lib/KinoSearch1/Index/MultiReader.pm - lib/KinoSearch1/Index/MultiTermDocs.pm - lib/KinoSearch1/Index/NormsReader.pm - lib/KinoSearch1/Index/PostingsWriter.pm - lib/KinoSearch1/Index/SegInfos.pm - lib/KinoSearch1/Index/SegReader.pm - lib/KinoSearch1/Index/SegTermDocs.pm - lib/KinoSearch1/Index/SegTermEnum.pm - lib/KinoSearch1/Index/SegWriter.pm - lib/KinoSearch1/Index/TermBuffer.pm - lib/KinoSearch1/Index/TermDocs.pm - lib/KinoSearch1/Index/TermEnum.pm - lib/KinoSearch1/Index/TermInfo.pm - lib/KinoSearch1/Index/TermInfosReader.pm - lib/KinoSearch1/Index/TermInfosWriter.pm - lib/KinoSearch1/Index/TermVector.pm - lib/KinoSearch1/Search/BooleanClause.pm - lib/KinoSearch1/Search/BooleanScorer.pm - lib/KinoSearch1/Search/HitCollector.pm - lib/KinoSearch1/Search/HitQueue.pm - lib/KinoSearch1/Search/PhraseScorer.pm - lib/KinoSearch1/Search/Scorer.pm - lib/KinoSearch1/Search/Searchable.pm - lib/KinoSearch1/Search/TermScorer.pm - lib/KinoSearch1/Search/Weight.pm - lib/KinoSearch1/Store/FSLock.pm - lib/KinoSearch1/Store/InStream.pm - lib/KinoSearch1/Store/Lock.pm - lib/KinoSearch1/Store/OutStream.pm - lib/KinoSearch1/Store/RAMLock.pm - lib/KinoSearch1/Util/BitVector.pm - lib/KinoSearch1/Util/ByteBuf.pm - lib/KinoSearch1/Util/Carp.pm - lib/KinoSearch1/Util/CClass.pm - lib/KinoSearch1/Util/Class.pm - lib/KinoSearch1/Util/IntMap.pm - lib/KinoSearch1/Util/MathUtils.pm - lib/KinoSearch1/Util/MemManager.pm - lib/KinoSearch1/Util/PriorityQueue.pm - lib/KinoSearch1/Util/SortExternal.pm - lib/KinoSearch1/Util/StringHelper.pm - lib/KinoSearch1/Util/ToolSet.pm - lib/KinoSearch1/Util/ToStringUtils.pm - lib/KinoSearch1/Util/VerifyArgs.pm provides: KinoSearch1: file: lib/KinoSearch1.pm version: 1.01 KinoSearch1::Analysis::Analyzer: file: lib/KinoSearch1/Analysis/Analyzer.pm KinoSearch1::Analysis::LCNormalizer: file: lib/KinoSearch1/Analysis/LCNormalizer.pm KinoSearch1::Analysis::PolyAnalyzer: file: lib/KinoSearch1/Analysis/PolyAnalyzer.pm KinoSearch1::Analysis::Stemmer: file: lib/KinoSearch1/Analysis/Stemmer.pm KinoSearch1::Analysis::Stopalizer: file: lib/KinoSearch1/Analysis/Stopalizer.pm KinoSearch1::Analysis::Token: file: lib/KinoSearch1/Analysis/Token.pm KinoSearch1::Analysis::TokenBatch: file: lib/KinoSearch1/Analysis/TokenBatch.pm KinoSearch1::Analysis::Tokenizer: file: lib/KinoSearch1/Analysis/Tokenizer.pm KinoSearch1::Document::Doc: file: lib/KinoSearch1/Document/Doc.pm KinoSearch1::Document::Field: file: lib/KinoSearch1/Document/Field.pm KinoSearch1::Highlight::Encoder: file: lib/KinoSearch1/Highlight/Encoder.pm KinoSearch1::Highlight::Formatter: file: lib/KinoSearch1/Highlight/Formatter.pm KinoSearch1::Highlight::Highlighter: file: lib/KinoSearch1/Highlight/Highlighter.pm KinoSearch1::Highlight::SimpleHTMLEncoder: file: lib/KinoSearch1/Highlight/SimpleHTMLEncoder.pm KinoSearch1::Highlight::SimpleHTMLFormatter: file: lib/KinoSearch1/Highlight/SimpleHTMLFormatter.pm KinoSearch1::Index::CompoundFileReader: file: lib/KinoSearch1/Index/CompoundFileReader.pm KinoSearch1::Index::CompoundFileWriter: file: lib/KinoSearch1/Index/CompoundFileWriter.pm KinoSearch1::Index::DelDocs: file: lib/KinoSearch1/Index/DelDocs.pm KinoSearch1::Index::FieldInfos: file: lib/KinoSearch1/Index/FieldInfos.pm KinoSearch1::Index::FieldsReader: file: lib/KinoSearch1/Index/FieldsReader.pm KinoSearch1::Index::FieldsWriter: file: lib/KinoSearch1/Index/FieldsWriter.pm KinoSearch1::Index::IndexFileNames: file: lib/KinoSearch1/Index/IndexFileNames.pm KinoSearch1::Index::IndexReader: file: lib/KinoSearch1/Index/IndexReader.pm KinoSearch1::Index::MultiReader: file: lib/KinoSearch1/Index/MultiReader.pm KinoSearch1::Index::MultiTermDocs: file: lib/KinoSearch1/Index/MultiTermDocs.pm KinoSearch1::Index::NormsReader: file: lib/KinoSearch1/Index/NormsReader.pm KinoSearch1::Index::PostingsWriter: file: lib/KinoSearch1/Index/PostingsWriter.pm KinoSearch1::Index::SegInfo: file: lib/KinoSearch1/Index/SegInfos.pm KinoSearch1::Index::SegInfos: file: lib/KinoSearch1/Index/SegInfos.pm KinoSearch1::Index::SegReader: file: lib/KinoSearch1/Index/SegReader.pm KinoSearch1::Index::SegTermDocs: file: lib/KinoSearch1/Index/SegTermDocs.pm KinoSearch1::Index::SegTermEnum: file: lib/KinoSearch1/Index/SegTermEnum.pm KinoSearch1::Index::SegWriter: file: lib/KinoSearch1/Index/SegWriter.pm KinoSearch1::Index::Term: file: lib/KinoSearch1/Index/Term.pm KinoSearch1::Index::TermBuffer: file: lib/KinoSearch1/Index/TermBuffer.pm KinoSearch1::Index::TermDocs: file: lib/KinoSearch1/Index/TermDocs.pm KinoSearch1::Index::TermEnum: file: lib/KinoSearch1/Index/TermEnum.pm KinoSearch1::Index::TermInfo: file: lib/KinoSearch1/Index/TermInfo.pm KinoSearch1::Index::TermInfosReader: file: lib/KinoSearch1/Index/TermInfosReader.pm KinoSearch1::Index::TermInfosWriter: file: lib/KinoSearch1/Index/TermInfosWriter.pm KinoSearch1::Index::TermVector: file: lib/KinoSearch1/Index/TermVector.pm KinoSearch1::InvIndexer: file: lib/KinoSearch1/InvIndexer.pm KinoSearch1::QueryParser::QueryParser: file: lib/KinoSearch1/QueryParser/QueryParser.pm KinoSearch1::Search::BitCollector: file: lib/KinoSearch1/Search/HitCollector.pm KinoSearch1::Search::BooleanClause: file: lib/KinoSearch1/Search/BooleanClause.pm KinoSearch1::Search::BooleanQuery: file: lib/KinoSearch1/Search/BooleanQuery.pm KinoSearch1::Search::BooleanScorer: file: lib/KinoSearch1/Search/BooleanScorer.pm KinoSearch1::Search::BooleanWeight: file: lib/KinoSearch1/Search/BooleanQuery.pm KinoSearch1::Search::CacheDFSource: file: lib/KinoSearch1/Search/MultiSearcher.pm KinoSearch1::Search::FilteredCollector: file: lib/KinoSearch1/Search/HitCollector.pm KinoSearch1::Search::Hit: file: lib/KinoSearch1/Search/Hit.pm KinoSearch1::Search::HitCollector: file: lib/KinoSearch1/Search/HitCollector.pm KinoSearch1::Search::HitQueue: file: lib/KinoSearch1/Search/HitQueue.pm KinoSearch1::Search::HitQueueCollector: file: lib/KinoSearch1/Search/HitCollector.pm KinoSearch1::Search::Hits: file: lib/KinoSearch1/Search/Hits.pm KinoSearch1::Search::MultiSearcher: file: lib/KinoSearch1/Search/MultiSearcher.pm KinoSearch1::Search::OffsetCollector: file: lib/KinoSearch1/Search/HitCollector.pm KinoSearch1::Search::PhraseQuery: file: lib/KinoSearch1/Search/PhraseQuery.pm KinoSearch1::Search::PhraseScorer: file: lib/KinoSearch1/Search/PhraseScorer.pm KinoSearch1::Search::PhraseWeight: file: lib/KinoSearch1/Search/PhraseQuery.pm KinoSearch1::Search::Query: file: lib/KinoSearch1/Search/Query.pm KinoSearch1::Search::QueryFilter: file: lib/KinoSearch1/Search/QueryFilter.pm KinoSearch1::Search::Scorer: file: lib/KinoSearch1/Search/Scorer.pm KinoSearch1::Search::SearchClient: file: lib/KinoSearch1/Search/SearchClient.pm KinoSearch1::Search::SearchServer: file: lib/KinoSearch1/Search/SearchServer.pm KinoSearch1::Search::Searchable: file: lib/KinoSearch1/Search/Searchable.pm KinoSearch1::Search::Similarity: file: lib/KinoSearch1/Search/Similarity.pm KinoSearch1::Search::TermQuery: file: lib/KinoSearch1/Search/TermQuery.pm KinoSearch1::Search::TermScorer: file: lib/KinoSearch1/Search/TermScorer.pm KinoSearch1::Search::TermWeight: file: lib/KinoSearch1/Search/TermQuery.pm KinoSearch1::Search::TitleSimilarity: file: lib/KinoSearch1/Search/Similarity.pm KinoSearch1::Search::Weight: file: lib/KinoSearch1/Search/Weight.pm KinoSearch1::Searcher: file: lib/KinoSearch1/Searcher.pm KinoSearch1::Store::FSInvIndex: file: lib/KinoSearch1/Store/FSInvIndex.pm KinoSearch1::Store::FSLock: file: lib/KinoSearch1/Store/FSLock.pm KinoSearch1::Store::InStream: file: lib/KinoSearch1/Store/InStream.pm KinoSearch1::Store::InvIndex: file: lib/KinoSearch1/Store/InvIndex.pm KinoSearch1::Store::Lock: file: lib/KinoSearch1/Store/Lock.pm KinoSearch1::Store::OutStream: file: lib/KinoSearch1/Store/OutStream.pm KinoSearch1::Store::RAMInvIndex: file: lib/KinoSearch1/Store/RAMInvIndex.pm KinoSearch1::Store::RAMLock: file: lib/KinoSearch1/Store/RAMLock.pm KinoSearch1::Util::BitVector: file: lib/KinoSearch1/Util/BitVector.pm KinoSearch1::Util::ByteBuf: file: lib/KinoSearch1/Util/ByteBuf.pm KinoSearch1::Util::CClass: file: lib/KinoSearch1/Util/CClass.pm KinoSearch1::Util::Carp: file: lib/KinoSearch1/Util/Carp.pm KinoSearch1::Util::Class: file: lib/KinoSearch1/Util/Class.pm KinoSearch1::Util::IntMap: file: lib/KinoSearch1/Util/IntMap.pm KinoSearch1::Util::MathUtils: file: lib/KinoSearch1/Util/MathUtils.pm KinoSearch1::Util::MemManager: file: lib/KinoSearch1/Util/MemManager.pm KinoSearch1::Util::PriorityQueue: file: lib/KinoSearch1/Util/PriorityQueue.pm KinoSearch1::Util::SortExternal: file: lib/KinoSearch1/Util/SortExternal.pm KinoSearch1::Util::StringHelper: file: lib/KinoSearch1/Util/StringHelper.pm KinoSearch1::Util::ToStringUtils: file: lib/KinoSearch1/Util/ToStringUtils.pm KinoSearch1::Util::ToolSet: file: lib/KinoSearch1/Util/ToolSet.pm KinoSearch1::Util::VerifyArgs: file: lib/KinoSearch1/Util/VerifyArgs.pm requires: Compress::Zlib: 0 Lingua::Stem::Snowball: 0.94 Lingua::StopWords: 0.02 resources: license: http://dev.perl.org/licenses/ version: 1.01 KinoSearch1-1.01/README000444000765000765 77711462203445 14540 0ustar00marvinmarvin000000000000KinoSearch1 Search engine library. INSTALLATION To install this module, run the following commands: perl Build.PL perl Build perl Build test perl Build install XS CODE KinoSearch1 is a Perl/XS distribution, despite the apparent lack of .xs files. KinoSearch1.xs gets written on the fly when 'perl Build.PL' is invoked. COPYRIGHT AND LICENCE Copyright 2005-2010 Marvin Humphrey This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. KinoSearch1-1.01/buildlib000755000765000765 011462203446 15317 5ustar00marvinmarvin000000000000KinoSearch1-1.01/buildlib/KinoSearch1000755000765000765 011462203446 17426 5ustar00marvinmarvin000000000000KinoSearch1-1.01/buildlib/KinoSearch1/Test000755000765000765 011462203446 20345 5ustar00marvinmarvin000000000000KinoSearch1-1.01/buildlib/KinoSearch1/Test/TestUtils.pm000444000765000765 1443311462203446 23025 0ustar00marvinmarvin000000000000use strict; use warnings; package KinoSearch1::Test::TestUtils; use base qw( Exporter ); our @EXPORT_OK = qw( working_dir create_working_dir remove_working_dir create_index create_persistent_test_index test_index_loc persistent_test_index_loc init_test_index_loc get_uscon_docs utf8_test_strings test_analyzer ); use KinoSearch1::InvIndexer; use KinoSearch1::Store::RAMInvIndex; use KinoSearch1::Store::FSInvIndex; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Analysis::TokenBatch; use KinoSearch1::Analysis::PolyAnalyzer; use File::Spec::Functions qw( catdir catfile curdir ); use Encode qw( _utf8_off ); use File::Path qw( rmtree ); use Carp; my $working_dir = catfile( curdir(), 'kinosearch_test' ); # Return a directory within the system's temp directory where we will put all # testing scratch files. sub working_dir {$working_dir} sub create_working_dir { mkdir( $working_dir, 0700 ) or die "Can't mkdir '$working_dir': $!"; } # Verify that this user owns the working dir, then zap it. Returns true upon # success. sub remove_working_dir { return unless -d $working_dir; rmtree $working_dir; return 1; } # Return a location for a test index to be used by a single test file. If # the test file crashes it cannot clean up after itself, so we put the cleanup # routine in a single test file to be run at or near the end of the test # suite. sub test_index_loc { return catdir( $working_dir, 'test_index' ); } # Return a location for a test index intended to be shared by multiple test # files. It will be cleaned as above. sub persistent_test_index_loc { return catdir( $working_dir, 'persistent_test_index' ); } # Destroy anything left over in the test_index location, then create the # directory. Finally, return the path. sub init_test_index_loc { my $dir = test_index_loc(); rmtree $dir; die "Can't clean up '$dir'" if -e $dir; mkdir $dir or die "Can't mkdir '$dir': $!"; return $dir; } # Build a RAM index, using the supplied array of strings as source material. # The index will have a single field: "content". sub create_index { my @docs = @_; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $invindex = KinoSearch1::Store::RAMInvIndex->new; my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, analyzer => $tokenizer, create => 1, ); $invindexer->spec_field( name => 'content' ); for (@docs) { my $doc = $invindexer->new_doc; $doc->set_value( content => $_ ); $invindexer->add_doc($doc); } $invindexer->finish; return $invindex; } # Slurp us constitition docs and build hashrefs. sub get_uscon_docs { my $uscon_dir = catdir( 't', 'us_constitution' ); opendir( my $uscon_dh, $uscon_dir ) or die "couldn't opendir '$uscon_dir': $!"; my @filenames = grep {/\.html$/} sort readdir $uscon_dh; closedir $uscon_dh or die "couldn't closedir '$uscon_dir': $!"; my %docs; for my $filename (@filenames) { next if $filename eq 'index.html'; my $filepath = catfile( $uscon_dir, $filename ); open( my $fh, '<', $filepath ) or die "couldn't open file '$filepath': $!"; my $content = do { local $/; <$fh> }; $content =~ m#(.*?)#s or die "couldn't isolate title in '$filepath'"; my $title = $1; $content =~ m#
(.*?)
#s or die "couldn't isolate bodytext in '$filepath'"; my $bodytext = $1; $bodytext =~ s/<.*?>//sg; $bodytext =~ s/\s+/ /sg; $docs{$filename} = { title => $title, bodytext => $bodytext, url => "/us_constitution/$filename", }; } return \%docs; } sub create_persistent_test_index { my $invindexer; my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en' ); $invindexer = KinoSearch1::InvIndexer->new( invindex => persistent_test_index_loc(), create => 1, analyzer => $polyanalyzer, ); $invindexer->spec_field( name => 'content' ); for ( 0 .. 10000 ) { my $doc = $invindexer->new_doc; $doc->set_value( content => "zz$_" ); $invindexer->add_doc($doc); } $invindexer->finish; undef $invindexer; $invindexer = KinoSearch1::InvIndexer->new( invindex => persistent_test_index_loc(), analyzer => $polyanalyzer, ); $invindexer->spec_field( name => 'content' ); my $source_docs = get_uscon_docs(); for ( values %$source_docs ) { my $doc = $invindexer->new_doc; $doc->set_value( content => $_->{bodytext} ); $invindexer->add_doc($doc); } $invindexer->finish; undef $invindexer; $invindexer = KinoSearch1::InvIndexer->new( invindex => persistent_test_index_loc(), analyzer => $polyanalyzer, ); $invindexer->spec_field( name => 'content' ); my @chars = ( 'a' .. 'z' ); for ( 0 .. 1000 ) { my $content = ''; for my $num_words ( 1 .. int( rand(20) ) ) { for ( 1 .. ( int( rand(10) ) + 10 ) ) { $content .= @chars[ rand(@chars) ]; } $content .= ' '; } my $doc = $invindexer->new_doc; $doc->set_value( content => $content ); $invindexer->add_doc($doc); } $invindexer->finish( optimize => 1 ); } # Return 3 strings useful for verifying UTF-8 integrity. sub utf8_test_strings { my $smiley = "\x{263a}"; my $not_a_smiley = $smiley; _utf8_off($not_a_smiley); my $frowny = $not_a_smiley; utf8::upgrade($frowny); return ( $smiley, $not_a_smiley, $frowny ); } # Verify an Analyzer's analyze() method. sub test_analyzer { my ( $analyzer, $source, $expected, $message ) = @_; my $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( $source, 0, length($source) ); $batch = $analyzer->analyze($batch); my @got; while ( $batch->next ) { push @got, $batch->get_text; } Test::More::is_deeply( \@got, $expected, "analyze: $message" ); } 1; __END__ __COPYRIGHT__ Copyright 2005-2010 Marvin Humphrey This program is free software; you can redistribute it and/or modify under the same terms as Perl itself. KinoSearch1-1.01/devel000755000765000765 011462203446 14630 5ustar00marvinmarvin000000000000KinoSearch1-1.01/devel/dump_index000555000765000765 403511462203446 17051 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use KinoSearch1::Index::IndexReader; my $where = shift @ARGV; if( !$where || ! -e $where ){ die "please specify an index location at the command line\n"; } my $r = KinoSearch1::Index::IndexReader->new( invindex => $where ); my @readers = ref $r->{sub_readers} eq 'ARRAY' ? @{ $r->{sub_readers} } : $r; print "We have " . @readers . " readers\n"; print "\n\nDocuments:\n"; for my $reader (@readers) { print "Segment " . $reader->get_seg_name . " has " . $reader->max_doc . " docs\n"; my $finfos = $reader->get_finfos; my $term_enum = $reader->terms; print "Fields:\n"; my %fields; for my $field ( $finfos->get_infos ) { $fields{ $field->get_field_num } = $field->get_name; print "\t" . $field->get_field_num . ": " . $field->get_name; my @info; foreach my $i (qw(indexed stored analyzed vectorized binary compressed)) { my $method = "get_$i"; push @info, $i if ( $field->$method ); } print " [" . join( ',', map { substr( $_, 0, 1 ) } sort @info ) . "]" if (@info); print "\n"; } print "Terms:\n"; my $td = $reader->term_docs; while ( $term_enum->next ) { my $term = $term_enum->get_term; print $term->to_string . "\n"; $td->seek($term); while ( $td->next ) { print "\t Doc " . $td->get_doc . " (" . $td->get_doc_freq . " occurrences)\n"; } } } print "Total documents: " . $r->max_doc . " in " . @readers . " segments\n"; __END__ =head1 NAME dump_index - dump the contents of an index =head1 SYNOPSIS perl dump_index $INDEX_LOCATION =head1 DESCRIPTION This will dump out an index in human readable form. =head1 AUTHOR Adapted from a Plucene-based version by Brian Phillips. =head1 COPYRIGHT AND LICENCE Copyright 2006 Brian Phillips. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut KinoSearch1-1.01/devel/hexdebug000555000765000765 57311462203446 16473 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Data::Hexdumper qw( hexdump ); die "Usage: hexdebug FILE FILE_POS BYTES_TO_DUMP" unless @ARGV == 3; open( my $fh, '<', $ARGV[0] ) or die "Couldn't open file '$ARGV[0]': $!"; seek $fh, $ARGV[1], 0; my $content; read( $fh, $content, $ARGV[2] ); print hexdump( data => $content, space_as_space => 1, ); KinoSearch1-1.01/devel/kinotidy000555000765000765 31711462203446 16526 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec::Functions qw( catfile ); use Perl::Tidy; my $rc_filepath = catfile( 'devel', 'kinotidyrc' ); Perl::Tidy::perltidy( perltidyrc => $rc_filepath ); KinoSearch1-1.01/devel/kinotidyrc000555000765000765 125511462203446 17075 0ustar00marvinmarvin000000000000# This set is derived from the recommendations in Perl Best Practices. --maximum-line-length=78 # 78 cols --indent-columns=4 # 4 spaces --continuation-indentation=4 # 4 spaces --vertical-tightness=2 # maximum --closing-token-indentation=0 # none --paren-tightness=1 # medium --brace-tightness=1 # medium --square-bracket-tightness=1 # medium --block-brace-tightness=1 # medium --no-space-for-semicolon --no-outdent-long-quotes --noblanks-before-comments # all operators except ** << >> ! ~ ^ --want-break-before="% + - * / x & | && || . %= += -= *= /= x= &= |= &&= ||= .= **= >>= <<= ^= = < > == >= <= != =~ !~" KinoSearch1-1.01/devel/predit000555000765000765 323611462203446 16206 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Text::Diff qw( diff ); use File::Find qw( find ); use Getopt::Std qw( getopts ); $File::Find::dont_use_nlink = 1; use vars qw( $opt_s $opt_v ); getopts('sv:'); die "usage: predit [-s] [-v pattern] [perl code] [file ...]" unless @ARGV >= 2; my $code = shift @ARGV; my $exclude; if (defined $opt_v) { $exclude = qr/$opt_v/; } for my $start (@ARGV) { find( { wanted => \&maybe_edit, no_chdir => 1, }, $start, ); } sub maybe_edit { my $filepath = $File::Find::name; return if ( defined $exclude and $File::Find::name =~ $exclude ); return unless -f $filepath; open( my $fh, "<", $filepath ) or die "Couldn't open '$filepath' for reading: $!"; my $orig = ''; my $edited = ''; if ($opt_s) { local $/; $orig = $edited = <$fh>; for ($edited) { eval $code; die $@ if $@; } } else { while (<$fh>) { $orig .= $_; eval $code; die $@ if $@; $edited .= $_; } } close $fh or die "Couldn't close '$filepath': $!"; if ($edited eq $orig) { print "No change to $filepath\n"; return; } # confirm with user that the change worked as intended. my $diff = diff( \$orig, \$edited ); print "\nFILE: $filepath\n$diff\nApply? "; my $response = ; return unless $response =~ /^y/i; print "Applying edit...\n"; open( $fh, ">", $filepath ) or die "Couldn't open '$filepath' for writing: $!"; print $fh $edited; close $fh or die "Couldn't close '$filepath': $!"; } KinoSearch1-1.01/devel/scan_enum.plx000444000765000765 156411462203446 17470 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; $|++; use Time::HiRes qw( time ); use KinoSearch1::Store::FSInvIndex; use KinoSearch1::Index::FieldInfos; use KinoSearch1::Index::CompoundFileReader; use KinoSearch1::Index::SegTermEnum; my $invindex = KinoSearch1::Store::FSInvIndex->new( path => $ARGV[0], ); my $cfs_reader = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, seg_name => '_1', ); my $finfos = KinoSearch1::Index::FieldInfos->new; $finfos->read_infos( $cfs_reader->open_instream('_1.fnm')); my $t0 = time; while (1) { print "."; # 1 for 1 .. 10000; my $instream = $cfs_reader->open_instream('_1.tis'); my $enum = KinoSearch1::Index::SegTermEnum->new( finfos => $finfos, instream => $instream, ); $enum->fill_cache(); # 1 while defined (my $term = $enum->next); } print ((time - $t0) . " secs\n"); KinoSearch1-1.01/devel/test_tidiness.plx000444000765000765 261511462203445 20376 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; =for comment test_tidiness.plx => check all Perl modules and test files to see if their tidiness is up to date. Since this has no effect on users, it's not part of the standard test suite. =cut use File::Find qw( find ); use File::Spec::Functions qw( catfile ); use Text::Diff; use Perl::Tidy; use Test::More 'no_plan'; # grab all .pm filepaths my @paths; find( { wanted => sub { push @paths, $File::Find::name if $File::Find::name =~ /\.pm$/; }, no_chdir => 1, }, 'lib', ); # grab all .t files find( { wanted => sub { push @paths, $File::Find::name if $File::Find::name =~ /\.t$/; }, no_chdir => 1, }, 't', ); my $rc_filepath = catfile('devel', 'kinotidyrc'); ok(-f $rc_filepath, "found $rc_filepath"); for my $path (@paths) { # grab orig text open( my $module_fh, '<', $path ) or die "couldn't open file '$path' for reading: $!"; my $orig_text = do { local $/; <$module_fh> }; close $module_fh; my $tidied = ''; Perl::Tidy::perltidy( source => \$orig_text, destination => \$tidied, perltidyrc => $rc_filepath, ); is( index($orig_text, $tidied), 0, "$path" ); if (index($orig_text, $tidied) != 0) { warn diff(\$orig_text, \$tidied); ; } } KinoSearch1-1.01/devel/valgrind_test.plx000555000765000765 106111462203446 20360 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; $|++; =begin comment Run the test suite under valgrind and log the output. =end comment =cut opendir(my $t_dir, 't') or die "Couldn't opendir 't': $!"; my @t_files = sort grep { /\.t$/ } readdir $t_dir; closedir $t_dir; open(my $log_fh, '>', "valgrind_test.log"); for my $t_file (@t_files) { my $command = "valgrind --leak-check=full $^X -Mblib t/$t_file 2>&1"; my $output = "\n\n" . (scalar localtime(time)) . "\n$command\n"; $output .= `$command`; print $output; print $log_fh $output; }KinoSearch1-1.01/lib000755000765000765 011462203446 14277 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1.pm000444000765000765 1536711462203446 17135 0ustar00marvinmarvin000000000000package KinoSearch1; use strict; use warnings; use 5.008003; our $VERSION = '1.01'; use constant K_DEBUG => 0; use XSLoader; # This loads a large number of disparate subs. XSLoader::load( 'KinoSearch1', $VERSION ); use base qw( Exporter ); our @EXPORT_OK = qw( K_DEBUG kdump ); sub kdump { require Data::Dumper; my $kdumper = Data::Dumper->new( [@_] ); $kdumper->Sortkeys( sub { return [ sort keys %{ $_[0] } ] } ); $kdumper->Indent(1); warn $kdumper->Dump; } 1; __END__ __XS__ #include "limits.h" #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #define NEED_newRV_noinc_GLOBAL #include "ppport.h" MODULE = KinoSearch1 PACKAGE = KinoSearch1 PROTOTYPES: disable BOOT: items = 0; =for comment Return 1 if memory debugging is enabled. See KinoSearch1::Util::MemManager. =cut IV memory_debugging_enabled() CODE: RETVAL = KINO_MEM_LEAK_DEBUG; OUTPUT: RETVAL __POD__ =head1 NAME KinoSearch1 - search engine library =head1 VERSION 1.01 =head1 STABLE FORK KinoSearch1 is a fork of L version 0.165 intended to provide stability and backwards compatibility. For the latest features, see the main branch. =head1 SYNOPSIS First, write an application to build an inverted index, or "invindex", from your document collection. use KinoSearch1::InvIndexer; use KinoSearch1::Analysis::PolyAnalyzer; my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en' ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => '/path/to/invindex', create => 1, analyzer => $analyzer, ); $invindexer->spec_field( name => 'title', boost => 3, ); $invindexer->spec_field( name => 'bodytext' ); while ( my ( $title, $bodytext ) = each %source_documents ) { my $doc = $invindexer->new_doc; $doc->set_value( title => $title ); $doc->set_value( bodytext => $bodytext ); $invindexer->add_doc($doc); } $invindexer->finish; Then, write a second application to search the invindex: use KinoSearch1::Searcher; use KinoSearch1::Analysis::PolyAnalyzer; my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en' ); my $searcher = KinoSearch1::Searcher->new( invindex => '/path/to/invindex', analyzer => $analyzer, ); my $hits = $searcher->search( query => "foo bar" ); while ( my $hit = $hits->fetch_hit_hashref ) { print "$hit->{title}\n"; } =head1 DESCRIPTION KinoSearch1 is a loose port of the Java search engine library Apache Lucene, written in Perl and C. The archetypal application is website search, but it can be put to many different uses. =head2 Features =over =item * Extremely fast and scalable - can handle millions of documents =item * Incremental indexing (addition/deletion of documents to/from an existing index). =item * Full support for 12 Indo-European languages. =item * Support for boolean operators AND, OR, and AND NOT; parenthetical groupings, and prepended +plus and -minus =item * Algorithmic selection of relevant excerpts and highlighting of search terms within excerpts =item * Highly customizable query and indexing APIs =item * Phrase matching =item * Stemming =item * Stoplists =back =head2 Getting Started KinoSearch1 has many, many classes, but you only need to get aquainted with three to start with: =over =item * L =item * L =item * L =back Probably the quickest way to get something up and running is to cut and paste the sample applications out of L and adapt them for your purposes. =head1 SEE ALSO The actively developed main branch, L. The KinoSearch homepage, where you'll find links to the mailing list and so on, is L. The Lucene homepage is L. L, for an overview of the invindex file format. =head1 History Search::Kinosearch 0.02x, no longer supported, is this suite's forerunner. L is a pure-Perl port of Lucene 1.3. KinoSearch is a from-scratch project which attempts to draws on the lessons of both. The API is not compatible with either. KinoSearch is named for Kino, the main character in John Steinbeck's novella, "The Pearl". =head1 SUPPORT Please direct support questions to the KinoSearch mailing list: subscription information at L. =head1 BUGS UTF-8 scalars are not indexed properly. This is fixed in KinoSearch 0.3x, but cannot be fixed in KinoSearch1 without breaking index compatibility with KinoSearch 0.165 from which KinoSearch1 was forked. Not thread-safe. Indexing crashes reliably on Solaris 2.9 or other systems which are fussy about pointer alignment. Please report any other bugs or feature requests to C, or through the web interface at L. =head1 COPYRIGHT & LICENSE Copyright 2005-2010 Marvin Humphrey This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. Terms of usage for Apache Lucene, from which portions of KinoSearch1 are derived, are spelled out in the Apache License: see the file "ApacheLicense2.0.txt". =head1 DISCLAIMER OF WARRANTY BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. =cut KinoSearch1-1.01/lib/KinoSearch1000755000765000765 011462203446 16406 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/InvIndexer.pm000444000765000765 4106311462203446 21200 0ustar00marvinmarvin000000000000package KinoSearch1::InvIndexer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); use constant UNINITIALIZED => 0; use constant INITIALIZED => 1; use constant FINISHED => 2; BEGIN { __PACKAGE__->init_instance_vars( # constructor args / members create => undef, invindex => undef, analyzer => undef, # members reader => undef, analyzers => undef, sinfos => undef, finfos => undef, doc_template => undef, frozen_doc => undef, similarity => undef, field_sims => undef, seg_writer => undef, write_lock => undef, state => UNINITIALIZED, ); } use Storable qw( freeze thaw ); use File::Spec::Functions qw( catfile tmpdir ); use KinoSearch1::Document::Doc; use KinoSearch1::Document::Field; use KinoSearch1::Analysis::Analyzer; use KinoSearch1::Store::FSInvIndex; use KinoSearch1::Index::FieldInfos; use KinoSearch1::Index::FieldsReader; use KinoSearch1::Index::IndexReader; use KinoSearch1::Index::SegInfos; use KinoSearch1::Index::SegWriter; use KinoSearch1::Index::IndexFileNames qw( WRITE_LOCK_NAME COMMIT_LOCK_NAME WRITE_LOCK_TIMEOUT COMMIT_LOCK_TIMEOUT ); use KinoSearch1::Search::Similarity; sub init_instance { my $self = shift; $self->{analyzers} = {}; $self->{field_sims} = {}; # use a no-op Analyzer if not supplied $self->{analyzer} ||= KinoSearch1::Analysis::Analyzer->new; # create a few members $self->{similarity} = KinoSearch1::Search::Similarity->new; $self->{sinfos} = KinoSearch1::Index::SegInfos->new; $self->{doc_template} = KinoSearch1::Document::Doc->new; # confirm or create an InvIndex object my $invindex; if ( blessed( $self->{invindex} ) and $self->{invindex}->isa('KinoSearch1::Store::InvIndex') ) { $invindex = $self->{invindex}; $self->{create} = $invindex->get_create unless defined $self->{create}; } elsif ( defined $self->{invindex} ) { $invindex = $self->{invindex} = KinoSearch1::Store::FSInvIndex->new( create => $self->{create}, path => $self->{invindex}, ); } else { croak("Required parameter 'invindex' not supplied"); } # get a write lock for this invindex. my $write_lock = $invindex->make_lock( lock_name => WRITE_LOCK_NAME, timeout => WRITE_LOCK_TIMEOUT, ); if ( $write_lock->obtain ) { # only assign if successful, otherwise DESTROY unlocks (bad!) $self->{write_lock} = $write_lock; } else { croak( "invindex locked: " . $write_lock->get_lock_name ); } # read/write SegInfos eval { $invindex->run_while_locked( lock_name => COMMIT_LOCK_NAME, timeout => COMMIT_LOCK_TIMEOUT, do_body => sub { $self->{create} ? $self->{sinfos}->write_infos($invindex) : $self->{sinfos}->read_infos($invindex); }, ); }; if ($@) { $self->{create} ? croak("failed to create invindex: $@") : croak("failed to open existing invindex: $@"); } # get a finfos and maybe a reader if ( $self->{create} ) { $self->{finfos} = KinoSearch1::Index::FieldInfos->new; } else { $self->{reader} = KinoSearch1::Index::IndexReader->new( invindex => $invindex ); $self->{finfos} = $self->{reader}->generate_field_infos; } # more initialization is coming after fields are spec'd... } sub _delayed_init { my $self = shift; my ( $invindex, $finfos, $field_sims ) = @{$self}{qw( invindex finfos field_sims )}; confess("finish has been called") if $self->{state} == FINISHED; confess("internal error: already initialized") if $self->{state} == INITIALIZED; $self->{state} = INITIALIZED; # create a cloning template my $doc = $self->{doc_template}; for my $field ( $doc->get_fields ) { $field->set_field_num( $finfos->get_field_num( $field->get_name ) ); } $self->{frozen_doc} = freeze($doc); # set sim for each field my $main_sim = $self->{similarity}; for my $finfo ( $finfos->get_infos ) { $field_sims->{ $finfo->get_name } ||= $main_sim; } # name a new segment and create a SegWriter my $out_seg_name = $self->_new_seg_name; $self->{seg_writer} = KinoSearch1::Index::SegWriter->new( invindex => $invindex, seg_name => $out_seg_name, finfos => $finfos->clone, field_sims => $field_sims, ); } sub spec_field { my $self = shift; # don't allow new fields to be spec'd once the seg is in motion croak("Too late to spec field (new_doc has been called)") unless $self->{state} == UNINITIALIZED; # detect or define a Field object my $field; if ( blessed( $_[0] ) ) { $field = shift; } else { eval { $field = KinoSearch1::Document::Field->new(@_) }; croak $@ if $@; } # cache fnm_bits and fdt_bits $field->set_fnm_bits( KinoSearch1::Index::FieldInfos->encode_fnm_bits($field) ); $field->set_fdt_bits( KinoSearch1::Index::FieldsReader->encode_fdt_bits($field) ); # establish which analyzer will be used against the field $self->{analyzers}{ $field->get_name } = ( $field->get_analyzer || $self->{analyzer} ); # don't copy the analyzer into the template, so that it can be overridden $field->set_analyzer(undef); # add the field to the finfos and the template. $self->{finfos}->add_field($field); $self->{doc_template}->add_field($field); } sub new_doc { my $self = shift; $self->_delayed_init unless $self->{state} == INITIALIZED; return thaw( $self->{frozen_doc} ); } sub set_similarity { if ( @_ == 3 ) { my ( $self, $field_name, $sim ) = @_; $self->{field_sims}{$field_name} = $sim; } else { $_[0]->{similarity} = $_[1]; } } sub add_doc { my ( $self, $doc ) = @_; # assign analyzers for my $field ( $doc->get_fields ) { if ( $field->get_analyzed ) { next if $field->get_analyzer; my $fieldname = $field->get_name; $field->set_analyzer( $self->{analyzers}{$fieldname} ); } } # add doc to output segment $self->{seg_writer}->add_doc($doc); } sub add_invindexes { my ( $self, @invindexes ) = @_; confess("Can't call add_invindexes after new_doc") if $self->{state} == INITIALIZED; # verify or obtain InvIndex objects for (@invindexes) { if ( !a_isa_b( $_, 'KinoSearch1::Store::InvIndex' ) ) { $_ = KinoSearch1::Store::FSInvIndex->new( path => $_ ); } } # get a reader for each invindex my @readers = map { KinoSearch1::Index::IndexReader->new( invindex => $_ ) } @invindexes; # merge finfos and init for my $reader (@readers) { $self->{finfos}->consolidate( $reader->get_finfos ); } $self->_delayed_init; # add all segments in each of the supplied invindexes my $seg_writer = $self->{seg_writer}; for my $reader (@readers) { $seg_writer->add_segment($_) for $reader->segreaders_to_merge('all'); } } sub delete_docs_by_term { my ( $self, $term ) = @_; confess("Not a KinoSearch1::Index::Term") unless a_isa_b( $term, 'KinoSearch1::Index::Term' ); return unless $self->{reader}; $self->_delayed_init unless $self->{state} == INITIALIZED; $self->{reader}->delete_docs_by_term($term); } our %finish_defaults = ( optimize => 0, ); sub finish { my $self = shift; confess kerror() unless verify_args( \%finish_defaults, @_ ); my %args = ( %finish_defaults, @_ ); # if no changes were made to the index, don't write anything if ( $self->{state} == UNINITIALIZED ) { if ( !$args{optimize} ) { return; } else { $self->_delayed_init; } } my ( $invindex, $sinfos, $seg_writer ) = @{$self}{qw( invindex sinfos seg_writer )}; # perform segment merging my @to_merge = $self->{reader} ? $self->{reader}->segreaders_to_merge( $args{optimize} ) : (); $seg_writer->add_segment($_) for @to_merge; $sinfos->delete_segment( $_->get_seg_name ) for @to_merge; # finish the segment $seg_writer->finish; # now that the seg is complete, write its info to the 'segments' file my $doc_count = $seg_writer->get_doc_count; if ($doc_count) { $sinfos->add_info( KinoSearch1::Index::SegInfo->new( seg_name => $seg_writer->get_seg_name, doc_count => $doc_count, invindex => $invindex, ) ); } # commit changes to the invindex $invindex->run_while_locked( lock_name => COMMIT_LOCK_NAME, timeout => COMMIT_LOCK_TIMEOUT, do_body => sub { $self->{reader}->commit_deletions if defined $self->{reader}; $sinfos->write_infos($invindex); }, ); my @files_to_delete = $self->_generate_deletions_list( \@to_merge ); push @files_to_delete, $self->_read_delqueue; # close reader, so that we can delete its files if appropriate $self->{reader}->close if defined $self->{reader}; $self->_purge_merged(@files_to_delete); $self->_release_locks; $self->{state} = FINISHED; } # Given an array of SegReaders, return a list of their files. sub _generate_deletions_list { my ( $self, $readers_to_merge ) = @_; my $invindex = $self->{invindex}; my @segs_to_merge = map { $_->get_seg_name } @$readers_to_merge; my @deletions = grep { $invindex->file_exists($_) } map { ( "$_.cfs", "$_.del" ) } @segs_to_merge; return @deletions; } # Retrieve a list of files that weren't successfully deleted before. sub _read_delqueue { my ( $self, $readers_to_merge ) = @_; my $invindex = $self->{invindex}; my @deletions; if ( $invindex->file_exists('delqueue') ) { my $instream = $invindex->open_instream('delqueue'); my $num_in_queue = $instream->lu_read('i'); @deletions = $instream->lu_read("T$num_in_queue"); $instream->close; } return @deletions; } # Delete segments that have been folded into the new segment. sub _purge_merged { my ( $self, @deletions ) = @_; my $invindex = $self->{invindex}; my @delqueue; for my $deletion (@deletions) { eval { $invindex->delete_file($deletion) }; # Win32: if the deletion fails (because a reader is open), queue it if ( $@ and $invindex->file_exists($deletion) ) { push @delqueue, $deletion; } } $self->_write_delqueue(@delqueue); } sub _write_delqueue { my ( $self, @delqueue ) = @_; my $invindex = $self->{invindex}; my $num_files = scalar @delqueue; if ($num_files) { # we have files that weren't successfully deleted, so write list my $outstream = $invindex->open_outstream('delqueue.new'); $outstream->lu_write( "iT$num_files", $num_files, @delqueue ); $outstream->close; $invindex->rename_file( 'delqueue.new', 'delqueue' ); } elsif ( $invindex->file_exists('delqueue') ) { # no files to delete, so delete the delqueue file if it's there $invindex->delete_file('delqueue'); } } # Release the write lock - if it's there. sub _release_locks { my $self = shift; if ( defined $self->{write_lock} ) { $self->{write_lock}->release if $self->{write_lock}->is_locked; undef $self->{write_lock}; } } # Generate segment names (no longer Lucene compatible, as of 0.06). sub _new_seg_name { my $self = shift; my $counter = $self->{sinfos}->get_counter; $self->{sinfos}->set_counter( ++$counter ); return "_$counter"; } sub DESTROY { shift->_release_locks } 1; __END__ =head1 NAME KinoSearch1::InvIndexer - build inverted indexes =head1 SYNOPSIS use KinoSearch1::InvIndexer; use KinoSearch1::Analysis::PolyAnalyzer; my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en' ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => '/path/to/invindex', create => 1, analyzer => $analyzer, ); $invindexer->spec_field( name => 'title' boost => 3, ); $invindexer->spec_field( name => 'bodytext' ); while ( my ( $title, $bodytext ) = each %source_documents ) { my $doc = $invindexer->new_doc($title); $doc->set_value( title => $title ); $doc->set_value( bodytext => $bodytext ); $invindexer->add_doc($doc); } $invindexer->finish; =head1 DESCRIPTION The InvIndexer class is KinoSearch1's primary tool for creating and modifying inverted indexes, which may be searched using L. =head1 METHODS =head2 new my $invindexer = KinoSearch1::InvIndexer->new( invindex => '/path/to/invindex', # required create => 1, # default: 0 analyzer => $analyzer, # default: no-op Analyzer ); Create an InvIndexer object. =over =item * B - can be either a filepath, or an InvIndex subclass such as L or L. =item * B - create a new invindex, clobbering an existing one if necessary. =item * B - an object which subclasses L, such as a L. =back =head2 spec_field $invindexer->spec_field( name => 'url', # required boost => 1, # default: 1, analyzer => undef, # default: analyzer spec'd in new() indexed => 0, # default: 1 analyzed => 0, # default: 1 stored => 1, # default: 1 compressed => 0, # default: 0 vectorized => 0, # default: 1 ); Define a field. =over =item * B - the field's name. =item * B - A multiplier which determines how much a field contributes to a document's score. =item * B - By default, all indexed fields are analyzed using the analyzer that was supplied to new(). Supplying an alternate for a given field overrides the primary analyzer. =item * B - index the field, so that it can be searched later. =item * B - analyze the field, using the relevant Analyzer. Fields such as "category" or "product_number" might be indexed but not analyzed. =item * B - store the field, so that it can be retrieved when the document turns up in a search. =item * B - compress the stored field, using the zlib compression algorithm. =item * B - store the field's "term vectors", which are required by L for excerpt selection and search term highlighting. =back =head2 new_doc my $doc = $invindexer->new_doc; Spawn an empty L object, primed to accept values for the fields spec'd by spec_field. =head2 add_doc $invindexer->add_doc($doc); Add a document to the invindex. =head2 add_invindexes my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, analyzer => $analyzer, ); $invindexer->add_invindexes( $another_invindex, $yet_another_invindex ); $invindexer->finish; Absorb existing invindexes into this one. May only be called once per InvIndexer. add_invindexes() and add_doc() cannot be called on the same InvIndexer. =head2 delete_docs_by_term my $term = KinoSearch1::Index::Term->new( 'id', $unique_id ); $invindexer->delete_docs_by_term($term); Mark any document which contains the supplied term as deleted, so that it will be excluded from search results. For more info, see L in KinoSearch1::Docs::FileFormat. =head2 finish $invindexer->finish( optimize => 1, # default: 0 ); Finish the invindex. Invalidates the InvIndexer. Takes one hash-style parameter. =over =item * B - If optimize is set to 1, the invindex will be collapsed to its most compact form, which will yield the fastest queries. =back =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Searcher.pm000444000765000765 1560011462203446 20657 0ustar00marvinmarvin000000000000package KinoSearch1::Searcher; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Searchable ); BEGIN { __PACKAGE__->init_instance_vars( # params/members invindex => undef, analyzer => undef, # members reader => undef, close_reader => 0, # not implemented yet ); __PACKAGE__->ready_get(qw( reader )); } use KinoSearch1::Store::FSInvIndex; use KinoSearch1::Index::IndexReader; use KinoSearch1::Search::Hits; use KinoSearch1::Search::HitCollector; use KinoSearch1::Search::Similarity; use KinoSearch1::QueryParser::QueryParser; use KinoSearch1::Search::BooleanQuery; use KinoSearch1::Analysis::Analyzer; sub init_instance { my $self = shift; $self->{analyzer} ||= KinoSearch1::Analysis::Analyzer->new; $self->{similarity} = KinoSearch1::Search::Similarity->new; $self->{field_sims} = {}; if ( !defined $self->{reader} ) { # confirm or create an InvIndex object my $invindex; if ( blessed( $self->{invindex} ) and $self->{invindex}->isa('KinoSearch1::Store::InvIndex') ) { $invindex = $self->{invindex}; } elsif ( defined $self->{invindex} ) { $invindex = $self->{invindex} = KinoSearch1::Store::FSInvIndex->new( create => $self->{create}, path => $self->{invindex}, ); } else { croak("valid 'reader' or 'invindex' must be supplied"); } # now that we have an invindex, get a reader for it $self->{reader} = KinoSearch1::Index::IndexReader->new( invindex => $self->{invindex} ); } } my %search_args = ( query => undef, filter => undef, num_docs => undef, ); sub search { my $self = shift; my %args = @_ == 1 ? ( %search_args, query => $_[0] ) : ( %search_args, @_ ); confess kerror() unless verify_args( \%search_args, %args ); # turn a query string into a query against all fields if ( !a_isa_b( $args{query}, 'KinoSearch1::Search::Query' ) ) { $args{query} = $self->_prepare_simple_search( $args{query} ); } return KinoSearch1::Search::Hits->new( searcher => $self, %args ); } sub get_field_names { my $self = shift; return $self->{reader}->get_field_names(@_); } # Search for the query string against all indexed fields sub _prepare_simple_search { my ( $self, $query_string ) = @_; my $indexed_field_names = $self->get_field_names( indexed => 1 ); my $query_parser = KinoSearch1::QueryParser::QueryParser->new( fields => $indexed_field_names, analyzer => $self->{analyzer}, ); return $query_parser->parse($query_string); } my %search_hit_collector_args = ( hit_collector => undef, weight => undef, filter => undef, sort_spec => undef, ); sub search_hit_collector { my $self = shift; confess kerror() unless verify_args( \%search_hit_collector_args, @_ ); my %args = ( %search_hit_collector_args, @_ ); # wrap the collector if there's a filter my $collector = $args{hit_collector}; if ( defined $args{filter} ) { $collector = KinoSearch1::Search::FilteredCollector->new( filter_bits => $args{filter}->bits($self), hit_collector => $args{hit_collector}, ); } # accumulate hits into the HitCollector if the query is valid my $scorer = $args{weight}->scorer( $self->{reader} ); if ( defined $scorer ) { $scorer->score_batch( hit_collector => $collector, end => $self->{reader}->max_doc, ); } } sub fetch_doc { $_[0]->{reader}->fetch_doc( $_[1] ) } sub max_doc { shift->{reader}->max_doc } sub doc_freq { my ( $self, $term ) = @_; return $self->{reader}->doc_freq($term); } sub create_weight { my ( $self, $query ) = @_; return $query->to_weight($self); } sub rewrite { my ( $self, $query ) = @_; my $reader = $self->{reader}; while (1) { my $rewritten = $query->rewrite($reader); last if ( 0 + $rewritten == 0 + $query ); $query = $rewritten; } return $query; } sub close { my $self = shift; $self->{reader}->close if $self->{close_reader}; } 1; __END__ =head1 NAME KinoSearch1::Searcher - execute searches =head1 SYNOPSIS my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en', ); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $analyzer, ); my $hits = $searcher->search( query => 'foo bar' ); =head1 DESCRIPTION Use the Searcher class to perform queries against an invindex. =head1 METHODS =head2 new my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $analyzer, ); Constructor. Takes two labeled parameters, both of which are required. =over =item * B - can be either a path to an invindex, or a L object. =item * B - An object which subclasses L, such as a L. This B be identical to the Analyzer used at index-time, or the results won't match up. =back =head2 search my $hits = $searcher->search( query => $query, # required filter => $filter, # default: undef (no filtering) ); Process a search and return a L object. search() expects labeled hash-style parameters. =over =item * B - Can be either an object which subclasses L, or a query string. If it's a query string, it will be parsed using a QueryParser and a search will be performed against all indexed fields in the invindex. For more sophisticated searching, supply Query objects, such as TermQuery and BooleanQuery. =item * B - Must be a L. Search results will be limited to only those documents which pass through the filter. =back =head1 Caching a Searcher When a Searcher is created, a small portion of the invindex is loaded into memory. For large document collections, this startup time may become noticeable, in which case reusing the searcher is likely to speed up your search application. Caching a Searcher is especially helpful when running a high-activity app under mod_perl. Searcher objects always represent a snapshot of an invindex as it existed when the Searcher was created. If you want the search results to reflect modifications to an invindex, you must create a new Searcher after the update process completes. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. KinoSearch1-1.01/lib/KinoSearch1/Analysis000755000765000765 011462203446 20171 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Analysis/Analyzer.pm000444000765000765 247611462203446 22462 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::Analyzer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members language => '', ); } # usage: $token_batch = $analyzer->analyze($token_batch); sub analyze { return $_[1] } 1; __END__ =head1 NAME KinoSearch1::Analysis::Analyzer - base class for analyzers =head1 SYNOPSIS # abstract base class -- you probably want PolyAnalyzer, not this. =head1 DESCRIPTION In KinoSearch1, an Analyzer is a filter which processes text, transforming it from one form into another. For instance, an analyzer might break up a long text into smaller pieces (L), or it might convert text to lowercase (L). =head1 METHODS =head2 analyze (EXPERIMENTAL) $token_batch = $analyzer->analyze($token_batch); All Analyzer subclasses provide an C method. C takes a single L as input, and it returns a TokenBatch, either the same one (probably transformed in some way), or a new one. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Analysis/LCNormalizer.pm000444000765000765 216411462203445 23227 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::LCNormalizer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Analysis::Analyzer ); use locale; BEGIN { __PACKAGE__->init_instance_vars(); } sub analyze { my ( $self, $batch ) = @_; # lc all of the terms, one by one while ( $batch->next ) { $batch->set_text( lc( $batch->get_text ) ); } $batch->reset; return $batch; } 1; __END__ =head1 NAME KinoSearch1::Analysis::LCNormalizer - convert input to lower case =head1 SYNOPSIS my $lc_normalizer = KinoSearch1::Analysis::LCNormalizer->new; my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $tokenizer, $stemmer ], ); =head1 DESCRIPTION This class basically says C in a longwinded way which KinoSearch1's Analysis apparatus can understand. =head1 CONSTRUCTOR =head2 new Construct a new LCNormalizer. Takes one labeled parameter, C, though it's a no-op for now. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Analysis/PolyAnalyzer.pm000444000765000765 644011462203446 23321 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::PolyAnalyzer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Analysis::Analyzer ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members analyzers => undef, ); } use KinoSearch1::Analysis::LCNormalizer; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Analysis::Stemmer; sub init_instance { my $self = shift; my $language = $self->{language} = lc( $self->{language} ); # create a default set of analyzers if language was specified if ( !defined $self->{analyzers} ) { croak("Must specify either 'language' or 'analyzers'") unless $language; $self->{analyzers} = [ KinoSearch1::Analysis::LCNormalizer->new( language => $language ), KinoSearch1::Analysis::Tokenizer->new( language => $language ), KinoSearch1::Analysis::Stemmer->new( language => $language ), ]; } } sub analyze { my ( $self, $token_batch ) = @_; # iterate through each of the anayzers in order $token_batch = $_->analyze($token_batch) for @{ $self->{analyzers} }; return $token_batch; } 1; __END__ =head1 NAME KinoSearch1::Analysis::PolyAnalyzer - multiple analyzers in series =head1 SYNOPSIS my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'es', ); # or... my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $custom_tokenizer, $snowball_stemmer, ], ); =head1 DESCRIPTION A PolyAnalyzer is a series of Analyzers -- objects which inherit from L -- each of which will be called upon to "analyze" text in turn. You can either provide the Analyzers yourself, or you can specify a supported language, in which case a PolyAnalyzer consisting of an L, a L, and a L will be generated for you. Supported languages: en => English, da => Danish, de => German, es => Spanish, fi => Finnish, fr => French, it => Italian, nl => Dutch, no => Norwegian, pt => Portuguese, ru => Russian, sv => Swedish, =head1 CONSTRUCTOR =head2 new() my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en', ); Construct a PolyAnalyzer object. If the parameter C is specified, it will override C and no attempt will be made to generate a default set of Analyzers. =over =item B - Must be an ISO code from the list of supported languages. =item B - Must be an arrayref. Each element in the array must inherit from KinoSearch1::Analysis::Analyzer. The order of the analyzers matters. Don't put a Stemmer before a Tokenizer (can't stem whole documents or paragraphs -- just individual words), or a Stopalizer after a Stemmer (stemmed words, e.g. "themselv", will not appear in a stoplist). In general, the sequence should be: normalize, tokenize, stopalize, stem. =back =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Analysis/Stemmer.pm000444000765000765 375111462203446 22306 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::Stemmer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Analysis::Analyzer ); our %supported_languages; BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members stemmifier => undef, ); } use Lingua::Stem::Snowball qw( stemmers ); # build a list of supported languages. $supported_languages{$_} = 1 for stemmers(); sub init_instance { my $self = shift; # verify language param my $language = $self->{language} = lc( $self->{language} ); croak("Unsupported language: '$language'") unless $supported_languages{$language}; # create instance of Snowball stemmer $self->{stemmifier} = Lingua::Stem::Snowball->new( lang => $language ); } sub analyze { my ( $self, $batch ) = @_; # replace terms with stemmed versions. my $all_texts = $batch->get_all_texts; $self->{stemmifier}->stem_in_place($all_texts); $batch->set_all_texts($all_texts); $batch->reset; return $batch; } 1; __END__ =head1 NAME KinoSearch1::Analysis::Stemmer - reduce related words to a shared root =head1 SYNOPSIS my $stemmer = KinoSearch1::Analysis::Stemmer->new( language => 'es' ); my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $tokenizer, $stemmer ], ); =head1 DESCRIPTION Stemming reduces words to a root form. For instance, "horse", "horses", and "horsing" all become "hors" -- so that a search for 'horse' will also match documents containing 'horses' and 'horsing'. This class is a wrapper around L, so it supports the same languages. =head1 METHODS =head2 new Create a new stemmer. Takes a single named parameter, C, which must be an ISO two-letter code that Lingua::Stem::Snowball understands. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Analysis/Stopalizer.pm000444000765000765 1012111462203446 23033 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::Stopalizer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Analysis::Analyzer ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members stoplist => undef, ); } use Lingua::StopWords; sub init_instance { my $self = shift; my $language = $self->{language} = lc( $self->{language} ); # verify a supplied stoplist if ( defined $self->{stoplist} ) { croak("stoplist must be a hashref") unless reftype( $self->{stoplist} ) eq 'HASH'; } else { # create a stoplist if language was supplied if ( $language =~ /^(?:da|de|en|es|fr|it|nl|no|pt|ru|sv)$/ ) { $self->{stoplist} = Lingua::StopWords::getStopWords($language); } # No Finnish stoplist, though we have a stemmmer. elsif ( $language eq 'fi' ) { $self->{stoplist} = {}; } else { confess "Invalid language: '$language'"; } } } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Analysis::Stopalizer SV* analyze(self_hash, batch_sv) HV *self_hash; SV *batch_sv; PREINIT: TokenBatch *batch; CODE: Kino1_extract_struct( batch_sv, batch, TokenBatch*, "KinoSearch1::Analysis::TokenBatch"); Kino1_Stopalizer_analyze(self_hash, batch); SvREFCNT_inc(batch_sv); RETVAL = batch_sv; OUTPUT: RETVAL __H__ #ifndef H_KINOSEARCH_ANALYSIS_STOPALIZER #define H_KINOSEARCH_ANALYSIS_STOPALIZER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1AnalysisToken.h" #include "KinoSearch1AnalysisTokenBatch.h" #include "KinoSearch1UtilVerifyArgs.h" TokenBatch* Kino1_Stopalizer_analyze(HV*, TokenBatch*); #endif /* include guard */ __C__ #include "KinoSearch1AnalysisStopalizer.h" TokenBatch* Kino1_Stopalizer_analyze(HV* self_hash, TokenBatch *batch) { SV **sv_ptr; HV *stoplist_hv; Token *token; sv_ptr = hv_fetch(self_hash, "stoplist", 8, 0); if (sv_ptr == NULL) Kino1_confess("no element 'stoplist'"); if (!SvROK(*sv_ptr)) Kino1_confess("not a hashref"); stoplist_hv = (HV*)SvRV(*sv_ptr); Kino1_Verify_extract_arg(self_hash, "stoplist", 8); while (Kino1_TokenBatch_next(batch)) { token = batch->current; if (hv_exists(stoplist_hv, token->text, token->len)) { token->len = 0; } } Kino1_TokenBatch_reset(batch); return batch; } __POD__ =head1 NAME KinoSearch1::Analysis::Stopalizer - suppress a "stoplist" of common words =head1 SYNOPSIS my $stopalizer = KinoSearch1::Analysis::Stopalizer->new( language => 'fr', ); my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ], ); =head1 DESCRIPTION A "stoplist" is collection of "stopwords": words which are common enough to be of little value when determining search results. For example, so many documents in English contain "the", "if", and "maybe" that it may improve both performance and relevance to block them. # before @token_texts = ('i', 'am', 'the', 'walrus'); # after @token_texts = ('', '', '', 'walrus'); =head1 CONSTRUCTOR =head2 new my $stopalizer = KinoSearch1::Analysis::Stopalizer->new( language => 'de', ); # or... my $stopalizer = KinoSearch1::Analysis::Stopalizer->new( stoplist => \%stoplist, ); new() takes two possible parameters, C and C. If C is supplied, it will be used, overriding the behavior indicated by the value of C. =over =item B - must be a hashref, with stopwords as the keys of the hash and values set to 1. =item B - must be the ISO code for a language. Loads a default stoplist supplied by L. =back =head1 SEE ALSO L =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Analysis/Token.pm000444000765000765 521611462203445 21747 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::Token; 1; __END__ __H__ #ifndef H_KINOSEARCH_ANALYSIS_TOKEN #define H_KINOSEARCH_ANALYSIS_TOKEN 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilMemManager.h" typedef struct Token Token; struct Token { char *text; STRLEN len; I32 start_offset; I32 end_offset; I32 pos_inc; Token *next; Token *prev; }; Token* Kino1_Token_new(char* text, STRLEN len, I32 start_offset, I32 end_offset, I32 pos_inc); void Kino1_Token_destroy(Token*); #endif /* include guard */ __C__ #include "KinoSearch1AnalysisToken.h" Token* Kino1_Token_new(char* text, STRLEN len, I32 start_offset, I32 end_offset, I32 pos_inc) { Token *token; /* allocate */ Kino1_New(0, token, 1, Token); /* allocate and assign */ token->text = Kino1_savepvn(text, len); /* assign */ token->len = len; token->start_offset = start_offset; token->end_offset = end_offset; token->pos_inc = pos_inc; /* init */ token->next = NULL; token->prev = NULL; return token; } void Kino1_Token_destroy(Token *token) { Kino1_Safefree(token->text); Kino1_Safefree(token); } __POD__ =head1 NAME KinoSearch1::Analysis::Token - unit of text =head1 SYNOPSIS # private class - no public API =head1 PRIVATE CLASS You can't actually instantiate a Token object at the Perl level -- however, you can affect individual Tokens within a TokenBatch by way of TokenBatch's (experimental) API. =head1 DESCRIPTION Token is the fundamental unit used by KinoSearch1's Analyzer subclasses. Each Token has 4 attributes: text, start_offset, end_offset, and pos_inc (for position increment). The text of a token is a string. A Token's start_offset and end_offset locate it within a larger text, even if the Token's text attribute gets modified -- by stemming, for instance. The Token for "beating" in the text "beating a dead horse" begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is "beat", but the end_offset is still 7. The position increment, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for "three blind mice". However, if you set the position increment for "blind" to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 -- and will no longer produce a phrase match for the query '"three blind mice"'. =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Analysis/TokenBatch.pm000444000765000765 4603111462203445 22731 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::TokenBatch; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Analysis::TokenBatch void new(either_sv) SV *either_sv; PREINIT: const char *class; TokenBatch *batch; PPCODE: /* determine the class */ class = sv_isobject(either_sv) ? sv_reftype(either_sv, 0) : SvPV_nolen(either_sv); /* build object */ batch = Kino1_TokenBatch_new(); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)batch); XSRETURN(1); void append(batch, text_sv, start_offset, end_offset, ...) TokenBatch *batch; SV *text_sv; I32 start_offset; I32 end_offset; PREINIT: char *text; STRLEN len; I32 pos_inc = 1; Token *token; PPCODE: text = SvPV(text_sv, len); if (items == 5) pos_inc = SvIV( ST(4) ); else if (items > 5) Kino1_confess("Too many arguments: %d", items); token = Kino1_Token_new(text, len, start_offset, end_offset, pos_inc); Kino1_TokenBatch_append(batch, token); =for comment Add many tokens to the batch, by supplying the string to be tokenized, and arrays of token starts and token ends (specified in bytes). =cut void add_many_tokens(batch, string_sv, starts_av, ends_av) TokenBatch *batch; SV *string_sv; AV *starts_av; AV *ends_av; PREINIT: char *string_start; STRLEN len, start_offset, end_offset; I32 i, max; SV **start_sv_ptr; SV **end_sv_ptr; Token *token; PPCODE: { string_start = SvPV(string_sv, len); max = av_len(starts_av); for (i = 0; i <= max; i++) { /* retrieve start */ start_sv_ptr = av_fetch(starts_av, i, 0); if (start_sv_ptr == NULL) Kino1_confess("Failed to retrieve @starts array element"); start_offset = SvIV(*start_sv_ptr); /* retrieve end */ end_sv_ptr = av_fetch(ends_av, i, 0); if (end_sv_ptr == NULL) Kino1_confess("Failed to retrieve @ends array element"); end_offset = SvIV(*end_sv_ptr); /* sanity check the offsets to make sure they're inside the string */ if (start_offset > len) Kino1_confess("start_offset > len (%d > %"UVuf")", start_offset, (UV)len); if (end_offset > len) Kino1_confess("end_offset > len (%d > %"UVuf")", end_offset, (UV)len); /* calculate the start of the substring and add the token */ token = Kino1_Token_new( (string_start + start_offset), (end_offset - start_offset), start_offset, end_offset, 1 ); Kino1_TokenBatch_append(batch, token); } } =begin comment Add the postings to the segment. Postings are serialized and dumped into a SortExternal sort pool. The actual writing takes place later. The serialization algo is designed so that postings emerge from the sort pool in the order ideal for writing an index after a simple lexical sort. The concatenated components are: field number term text null byte document number positions (C array of U32) term length =end comment =cut void build_posting_list(batch, doc_num, field_num) TokenBatch *batch; U32 doc_num; U16 field_num; PPCODE: Kino1_TokenBatch_build_plist(batch, doc_num, field_num); void set_all_texts(batch, texts_av) TokenBatch *batch; AV *texts_av; PREINIT: Token *token; I32 i, max; SV **sv_ptr; char *text; STRLEN len; PPCODE: { token = batch->first; max = av_len(texts_av); for (i = 0; i <= max; i++) { if (token == NULL) { Kino1_confess("Batch size %d doesn't match array size %d", batch->size, (max + 1)); } sv_ptr = av_fetch(texts_av, i, 0); if (sv_ptr == NULL) { Kino1_confess("Encountered a null SV* pointer"); } text = SvPV(*sv_ptr, len); Kino1_Safefree(token->text); token->text = Kino1_savepvn(text, len); token->len = len; token = token->next; } } void get_all_texts(batch) TokenBatch *batch; PREINIT: Token *token; AV *out_av; PPCODE: { out_av = newAV(); token = batch->first; while (token != NULL) { SV *text = newSVpvn(token->text, token->len); av_push(out_av, text); token = token->next; } XPUSHs(sv_2mortal( newRV_noinc((SV*)out_av) )); XSRETURN(1); } SV* _set_or_get(batch, ...) TokenBatch *batch; ALIAS: set_text = 1 get_text = 2 set_start_offset = 3 get_start_offset = 4 set_end_offset = 5 get_end_offset = 6 set_pos_inc = 7 get_pos_inc = 8 set_size = 9 get_size = 10 set_postings = 11 get_postings = 12 set_tv_string = 13 get_tv_string = 14 CODE: { /* fail if looking for info on a single token but there isn't one */ if ((ix < 7) && (batch->current == NULL)) Kino1_confess("TokenBatch doesn't currently hold a valid token"); KINO_START_SET_OR_GET_SWITCH case 1: { Token *current = batch->current; char *text; Kino1_Safefree(current->text); text = SvPV( ST(1), current->len ); current->text = Kino1_savepvn( text, current->len ); } /* fall through */ case 2: RETVAL = newSVpvn(batch->current->text, batch->current->len); break; case 3: batch->current->start_offset = SvIV( ST(1) ); /* fall through */ case 4: RETVAL = newSViv(batch->current->start_offset); break; case 5: batch->current->end_offset = SvIV( ST(1) ); /* fall through */ case 6: RETVAL = newSViv(batch->current->end_offset); break; case 7: batch->current->pos_inc = SvIV( ST(1) ); /* fall through */ case 8: RETVAL = newSViv(batch->current->pos_inc); break; case 9: Kino1_confess("Can't set size on a TokenBatch object"); /* fall through */ case 10: RETVAL = newSVuv(batch->size); break; case 11: Kino1_confess("can't set_postings"); /* fall through */ case 12: RETVAL = newRV_inc( (SV*)batch->postings ); break; case 13: Kino1_confess("can't set_tv_string"); /* fall through */ case 14: RETVAL = newSVsv(batch->tv_string); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void reset(batch) TokenBatch *batch; PPCODE: Kino1_TokenBatch_reset(batch); I32 next(batch) TokenBatch *batch; CODE: RETVAL = Kino1_TokenBatch_next(batch); OUTPUT: RETVAL void DESTROY(batch) TokenBatch *batch; PPCODE: Kino1_TokenBatch_destroy(batch); __H__ #ifndef H_KINOSEARCH_ANALYSIS_TOKENBATCH #define H_KINOSEARCH_ANALYSIS_TOKENBATCH 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "ppport.h" #include "KinoSearch1AnalysisToken.h" #include "KinoSearch1IndexTerm.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMathUtils.h" #include "KinoSearch1UtilMemManager.h" #include "KinoSearch1UtilStringHelper.h" typedef struct tokenbatch { Token *first; Token *last; Token *current; I32 size; I32 initialized; AV *postings; SV *tv_string; } TokenBatch; TokenBatch* Kino1_TokenBatch_new(); void Kino1_TokenBatch_destroy(TokenBatch *batch); void Kino1_TokenBatch_append(TokenBatch *batch, Token *token); I32 Kino1_TokenBatch_next(TokenBatch *batch); void Kino1_TokenBatch_reset(TokenBatch *batch); void Kino1_TokenBatch_build_plist(TokenBatch*, U32, U16); #endif /* include guard */ __C__ #include "KinoSearch1AnalysisTokenBatch.h" TokenBatch* Kino1_TokenBatch_new() { TokenBatch *batch; /* allocate */ Kino1_New(0, batch, 1, TokenBatch); /* init */ batch->first = NULL; batch->last = NULL; batch->current = NULL; batch->size = 0; batch->initialized = 0; batch->tv_string = &PL_sv_undef; batch->postings = (AV*)&PL_sv_undef; return batch; } void Kino1_TokenBatch_destroy(TokenBatch *batch) { Token *token = batch->first; while (token != NULL) { Token *next = token->next; Kino1_Token_destroy(token); token = next; } SvREFCNT_dec( (SV*)batch->postings ); SvREFCNT_dec(batch->tv_string); Kino1_Safefree(batch); } I32 Kino1_TokenBatch_next(TokenBatch *batch) { /* enter iterative mode */ if (batch->initialized == 0) { batch->current = batch->first; batch->initialized = 1; } /* continue iterative mode */ else { batch->current = batch->current->next; } return batch->current == NULL ? 0 : 1; } void Kino1_TokenBatch_reset(TokenBatch *batch) { batch->initialized = 0; } void Kino1_TokenBatch_append(TokenBatch *batch, Token *token) { token->next = NULL; token->prev = batch->last; /* if this is the first token added, init */ if (batch->first == NULL) { batch->first = token; batch->last = token; } else { batch->last->next = token; batch->last = token; } batch->size++; } #define POSDATA_LEN 12 #define DOC_NUM_LEN 4 #define NULL_BYTE_LEN 1 #define TEXT_LEN_LEN 2 /* Encode postings in the serialized format expected by PostingsWriter, plus * the term vector expected by FieldsWriter. */ void Kino1_TokenBatch_build_plist(TokenBatch *batch, U32 doc_num, U16 field_num) { char doc_num_buf[4]; char field_num_buf[2]; char text_len_buf[2]; char vint_buf[5]; HV *pos_hash; HE *he; AV *out_av; I32 i = 0; I32 overlap, num_bytes, num_positions; I32 num_postings = 0; SV **sv_ptr; char *text, *source_ptr, *dest_ptr, *end_ptr; char *last_text = ""; STRLEN text_len, len, fake_len; STRLEN last_len = 0; SV *serialized_sv; SV *tv_string_sv; U32 *source_u32, *dest_u32, *end_u32; /* prepare doc num and field num in anticipation of upcoming loop */ Kino1_encode_bigend_U32(doc_num, doc_num_buf); Kino1_encode_bigend_U16(field_num, field_num_buf); /* build a posting list hash. */ pos_hash = newHV(); while (Kino1_TokenBatch_next(batch)) { Token* token = batch->current; /* either start a new hash entry or retrieve an existing one */ if (!hv_exists(pos_hash, token->text, token->len)) { /* the values are the serialized scalars */ if (token->len > 65535) Kino1_confess("Maximum token length is 65535; got %d", token->len); Kino1_encode_bigend_U16(token->len, text_len_buf); /* allocate the serialized scalar */ len = TEXT_LEN_LEN /* for now, put text_len at top */ + KINO_FIELD_NUM_LEN /* encoded field number */ + token->len /* term text */ + NULL_BYTE_LEN /* the term text's null byte */ + DOC_NUM_LEN + POSDATA_LEN + TEXT_LEN_LEN /* eventually, text_len goes at end */ + NULL_BYTE_LEN; /* the scalar's null byte */ serialized_sv = newSV(len); SvPOK_on(serialized_sv); source_ptr = SvPVX(serialized_sv); dest_ptr = source_ptr; /* concatenate a bunch of stuff onto the serialized scalar */ Copy(text_len_buf, dest_ptr, TEXT_LEN_LEN, char); dest_ptr += TEXT_LEN_LEN; Copy(field_num_buf, dest_ptr, KINO_FIELD_NUM_LEN, char); dest_ptr += KINO_FIELD_NUM_LEN; Copy(token->text, dest_ptr, token->len, char); dest_ptr += token->len; *dest_ptr = '\0'; dest_ptr += NULL_BYTE_LEN; Copy(doc_num_buf, dest_ptr, DOC_NUM_LEN, char); dest_ptr += DOC_NUM_LEN; SvCUR_set(serialized_sv, (dest_ptr - source_ptr)); /* store the text => serialized_sv pair in the pos_hash */ (void)hv_store(pos_hash, token->text, token->len, serialized_sv, 0); } else { /* retrieve the serialized scalar and allocate more space */ sv_ptr = hv_fetch(pos_hash, token->text, token->len, 0); if (sv_ptr == NULL) Kino1_confess("unexpected null sv_ptr"); serialized_sv = *sv_ptr; len = SvCUR(serialized_sv) + POSDATA_LEN /* allocate space for upcoming posdata */ + TEXT_LEN_LEN /* extra space for encoded text length */ + NULL_BYTE_LEN; SvGROW( serialized_sv, len ); } /* append position, start offset, end offset to the serialized_sv */ dest_u32 = (U32*)SvEND(serialized_sv); *dest_u32++ = (U32)i; i += token->pos_inc; *dest_u32++ = token->start_offset; *dest_u32++ = token->end_offset; len = SvCUR(serialized_sv) + POSDATA_LEN; SvCUR_set(serialized_sv, len); /* destroy the token, because nobody else will -- XXX MAYBE? */ /* Kino1_Token_destroy(token); */ } /* allocate and presize the array to hold the output */ num_postings = hv_iterinit(pos_hash); out_av = newAV(); av_extend(out_av, num_postings); /* collect serialized scalars into an array */ i = 0; while ((he = hv_iternext(pos_hash))) { serialized_sv = HeVAL(he); /* transfer text_len to end of serialized scalar */ source_ptr = SvPVX(serialized_sv); dest_ptr = SvEND(serialized_sv); Copy(source_ptr, dest_ptr, TEXT_LEN_LEN, char); SvCUR(serialized_sv) += TEXT_LEN_LEN; source_ptr += TEXT_LEN_LEN; sv_chop(serialized_sv, source_ptr); SvREFCNT_inc(serialized_sv); av_store(out_av, i, serialized_sv); i++; } /* we're done with the positions hash, so kill it off */ SvREFCNT_dec(pos_hash); /* start the term vector string */ tv_string_sv = newSV(20); SvPOK_on(tv_string_sv); num_bytes = Kino1_OutStream_encode_vint(num_postings, vint_buf); sv_catpvn(tv_string_sv, vint_buf, num_bytes); /* sort the posting lists lexically */ sortsv(AvARRAY(out_av), num_postings, Perl_sv_cmp); /* iterate through the array, making changes to the serialized scalars */ for (i = 0; i < num_postings; i++) { serialized_sv = *(av_fetch(out_av, i, 0)); /* find the beginning of the term text */ text = SvPV(serialized_sv, fake_len); text += KINO_FIELD_NUM_LEN; /* save the text_len; we'll move it forward later */ end_ptr = SvEND(serialized_sv) - TEXT_LEN_LEN; text_len = Kino1_decode_bigend_U16( end_ptr ); Kino1_encode_bigend_U16(text_len, text_len_buf); source_ptr = SvPVX(serialized_sv) + KINO_FIELD_NUM_LEN + text_len + NULL_BYTE_LEN + DOC_NUM_LEN; source_u32 = (U32*)source_ptr; dest_u32 = source_u32; end_u32 = (U32*)end_ptr; /* append the string diff to the tv_string */ overlap = Kino1_StrHelp_string_diff(last_text, text, last_len, text_len); num_bytes = Kino1_OutStream_encode_vint(overlap, vint_buf); sv_catpvn( tv_string_sv, vint_buf, num_bytes ); num_bytes = Kino1_OutStream_encode_vint( (text_len - overlap), vint_buf ); sv_catpvn( tv_string_sv, vint_buf, num_bytes ); sv_catpvn( tv_string_sv, (text + overlap), (text_len - overlap) ); /* append the number of positions for this term */ num_positions = SvCUR(serialized_sv) - KINO_FIELD_NUM_LEN - text_len - NULL_BYTE_LEN - DOC_NUM_LEN - TEXT_LEN_LEN; num_positions /= POSDATA_LEN; num_bytes = Kino1_OutStream_encode_vint(num_positions, vint_buf); sv_catpvn( tv_string_sv, vint_buf, num_bytes ); while (source_u32 < end_u32) { /* keep only the positions in the serialized scalars */ num_bytes = Kino1_OutStream_encode_vint(*source_u32, vint_buf); sv_catpvn( tv_string_sv, vint_buf, num_bytes ); *dest_u32++ = *source_u32++; /* add start_offset to tv_string */ num_bytes = Kino1_OutStream_encode_vint(*source_u32, vint_buf); sv_catpvn( tv_string_sv, vint_buf, num_bytes ); source_u32++; /* add end_offset to tv_string */ num_bytes = Kino1_OutStream_encode_vint(*source_u32, vint_buf); sv_catpvn( tv_string_sv, vint_buf, num_bytes ); source_u32++; } /* restore the text_len and close the scalar */ dest_ptr = (char*)dest_u32; Copy(text_len_buf, dest_ptr, TEXT_LEN_LEN, char); dest_ptr += TEXT_LEN_LEN; len = dest_ptr - SvPVX(serialized_sv); SvCUR_set(serialized_sv, len); last_text = text; last_len = text_len; } /* store the postings array and the term vector string */ SvREFCNT_dec(batch->tv_string); batch->tv_string = tv_string_sv; SvREFCNT_dec(batch->postings); batch->postings = out_av; } __POD__ =head1 NAME KinoSearch1::Analysis::TokenBatch - a collection of tokens =head1 SYNOPSIS while ( $batch->next ) { $batch->set_text( lc( $batch->get_text ) ); } =head1 EXPERIMENTAL API TokenBatch's API should be considered experimental and is likely to change. =head1 DESCRIPTION A TokenBatch is a collection of L which you can add to, then iterate over. =head1 METHODS =head2 new my $batch = KinoSearch1::Analysis::TokenBatch->new; Constructor. =head2 append $batch->append( $text, $start_offset, $end_offset, $pos_inc ); Add a Token to the end of the batch. Accepts either three or four arguments: text, start_offset, end_offset, and an optional position increment which defaults to 1 if not supplied. For a description of what these arguments mean, see the docs for L. =head2 next while ( $batch->next ) { # ... } Proceed to the next token in the TokenBatch. Returns true if the TokenBatch ends up located at valid token. =head1 ACCESSOR METHODS All of TokenBatch's accessor methods affect the current Token. Calling any of these methods when the TokenBatch is not located at a valid Token will trigger an exception. =head2 set_text get_text Set/get the text of the current Token. =head2 set_start_offset get_start_offset Set/get the start_offset of the current Token. =head2 set_end_offset get_end_offset Set/get the end_offset of the current Token. =head2 set_pos_inc get_pos_inc Set/get the position increment of the current Token. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Analysis/Tokenizer.pm000444000765000765 745511462203445 22650 0ustar00marvinmarvin000000000000package KinoSearch1::Analysis::Tokenizer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Analysis::Analyzer ); use locale; BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members token_re => undef, # regex for a single token # members separator_re => undef, # regex for separations between tokens ); } use KinoSearch1::Analysis::TokenBatch; sub init_instance { my $self = shift; # supply defaults if token_re wasn't specified if ( !defined $self->{token_re} ) { $self->{token_re} = qr/\b\w+(?:'\w+)?\b/; $self->{separator_re} = qr/\W*/; } # if user-defined token_re... if ( !defined $self->{separator_re} ) { # define separator using lookahead $self->{separator_re} = qr/ .*? # match up to... (?= # but not including... $self->{token_re} # a token, |\z # or the end of the string )/xsm; } } sub analyze { my ( $self, $batch ) = @_; my $new_batch = KinoSearch1::Analysis::TokenBatch->new; my $token_re = $self->{token_re}; my $separator_re = $self->{separator_re}; # alias input to $_ while ( $batch->next ) { local $_ = $batch->get_text; # ensure that pos is set to 0 for this scalar pos = 0; # accumulate token start_offsets and end_offsets my ( @starts, @ends ); 1 while ( m/$separator_re/g and push @starts, pos and m/$token_re/g and push @ends, pos ); # correct for overshoot $#starts = $#ends; # add the new tokens to the batch $new_batch->add_many_tokens( $_, \@starts, \@ends ); } return $new_batch; } 1; __END__ =head1 NAME KinoSearch1::Analysis::Tokenizer - customizable tokenizing =head1 SYNOPSIS my $whitespace_tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\S+/, ); # or... my $word_char_tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\w+/, ); # or... my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new; # then... once you have a tokenizer, put it into a PolyAnalyzer my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], ); =head1 DESCRIPTION Generically, "tokenizing" is a process of breaking up a string into an array of "tokens". # before: my $string = "three blind mice"; # after: @tokens = qw( three blind mice ); KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of C. # before: my $string = "Eats, Shoots and Leaves."; # tokenized by $whitespace_tokenizer @tokens = qw( Eats, Shoots and Leaves. ); # tokenized by $word_char_tokenizer @tokens = qw( Eats Shoots and Leaves ); =head1 METHODS =head2 new # match "O'Henry" as well as "Henry" and "it's" as well as "it" my $token_re = qr/ \b # start with a word boundary \w+ # Match word chars. (?: # Group, but don't capture... '\w+ # ... an apostrophe plus word chars. )? # Matching the apostrophe group is optional. \b # end with a word boundary /xsm; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => $token_re, # default: what you see above ); Constructor. Takes one hash style parameter. =over =item * B - must be a pre-compiled regular expression matching one token. =back =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Docs000755000765000765 011462203446 17276 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Docs/FileFormat.pod000444000765000765 2336511462203446 22220 0ustar00marvinmarvin000000000000=head1 NAME KinoSearch1::Docs::FileFormat - overview of invindex file format =head1 OVERVIEW It is not necessary to understand the guts of the Lucene-derived "invindex" file format in order to use KinoSearch1, but it may be helpful if you are interested in tweaking for high performance, exotic usage, or debugging and development. On a file system, all the files in an invindex exist in one, flat directory. Conceptually, the files have a hierarchical relationship: an invindex is made up of "segments", each of which is an independent inverted index, and each segment is made up of several subsections. [invindex]--| |-"segments" file | |-[segments]------| |--[seg _0]--| | |--[postings] | |--[stored fields] | |--[deletions] | |--[seg _1]--| | |--[postings] | |--[stored fields] | |--[deletions] | |--[ ... ]---| The "segments" file keeps a list of the segments that make up an invindex. When a new segment is being written, KinoSearch1 may put files into the directory, but until the segments file is updated, a Searcher reading the index won't know about them. Each segment is an independent inverted index. All the files which belong to a given segment share a common prefix which consists of an underscore followed by 1 or more decimal digits: _0, _67, _1058. A fully optimized index has only a single segment. In theory there are many files which make up each segment. However, when you look inside an invindex not in the process of being updated, you'll probably see only the segments file and files with either a .cfs or .del extension. The .cfs file, a "compound" file which is consolidated when a segment is finalized, "contains" all the other per-segment files. Segments are written once, and with the exception of the deletions file, are never modified once written. They are deleted when their data is written to new segments during the process of optimization. =head1 A segment's component parts Each segment can be said to have four logical parts: postings, stored fields, the deletions file, and the term vectors data. =head2 Stored fields The stored fields are organized into two files. =over =item * [seg_name]B<.fdx> - Field inDeX - pointers to field data =item * [seg_name]B<.fdt> - Field DaTa - the actual stored fields =back When a document turns up as a hit in a search and must be retrieved, KinoSearch1 looks at the Field inDeX file to see where in the data file the document's stored fields start, then retrieves all of them from the .fdt file in one lump. _1.fdx--| |--[doc#0 => 0]----->_1.fdt--| | |--[bodytext] | |--[title] | |--[url] |--[doc#1 => 305]----->_1.fdt--| # byte 305 | |--[bodytext] | |--[title] | |--[url] |--[...]--------------->_1.fdt--|--[...] If a field is marked as "vectorized", its "term vectors" are also stored in the .fdx file. =head2 Postings "Posting" is a technical term from the field of Information Retrieval which refers to an single instance of a one term indexing one document. If you are looking at the index in the back of a book, and you see that "freedom" is referenced on pages 8, 86, and 240, that would be three postings, which taken together form a "posting list". The same terminology applies to an index in electronic form. The postings data is spread out over 4 main files (not including field normalization data, which we'll get to in a moment). From lowest to highest in the hierarchy, they are... [seg_name]B<.prx> - PRoXimity data. A list of the positions at which terms appear in any given document. The .prx file is just a raw stream of VInts; the document numbers and terms are implicitly indicated by files higher up the hierarchy. [seg_name]B<.frq> - FReQuency data for terms. If a term has a frequency of 5 in a given document, that implies that there will be 5 entries in the .prx file. The terms themselves are implicitly specified by the .tis file. _1.frq--| |--[doc#40 => 2]----->_1.prx--|--[54,107] |--[doc#0 => 1]----->_1.prx--|--[6] |--[doc#6 => 1]----->_1.prx--|--[504] |--[doc#36 => 3]----->_1.prx--|--[2,33,747] |--[...]------------->_1.frq--|--[...] [seg_name]B<.tis> - TermInfoS. Among the items stored here is the term's doc_freq, which is the number of documents the term appears in. If a term has a doc_freq of 22 in a given collection, that implies that there will be 22 corresponding entries in the .frq file. Terms are ordered lexically, first by field, then by term text. _1.tis--| |--[...]----------------------->_1.frq--|--[...] |--[bodytext:mule => 1]-->_1.frq--|--[doc#40 => 2] |--[bodytext:multitude => 3]-->_1.frq--|--[doc#0 => 1] | |--[doc#6 => 1] | |--[doc#36 => 3] |--[bodytext:navigate => 1]-->_1.frq--|--[doc#21 => 1] |--[...]----------------------->_1.frq--|--[...] |--[title:amendment => 27]-->_1.frq--|--[doc#21 => 1] | |--[doc#22 => 1] |--[...]----------------------->_1.frq--|--[...] [seg_name]B<.tii> - TermInfos Index. This file, which is decompressed and loaded into RAM as soon as the IndexReader is initialized, contains a small subset of the .tis data, with pointers to locations in the .tis file. It is used to locate the right general vicinity in the .tis file as quickly as possible. _1.tii--| |--[bodytext:a => 20]---------->_1.tis--|--[bodytext:a] # byte 20 | |--[bodytext:about] | |--[bodytext:absolute] | |--[...] |--[bodytext:mule => 27065]---->_1.tis--|--[bodytext:mule] | |--[bodytext:multitude] | |--[...] |--[title:amendment => 56992]-->_1.tis--|--[title:amendment] |--[...] Here's a simplified version of how a search for "freedom" against a given segment plays out: =over =item 1 The searcher asks the .tii file, "Do you know anything about 'freedom'?" The .tii file replies, "Can't say for sure, but if the .tis file does, 'freedom' is probably somewhere around byte 21008". =item 2 The .tis file tells the searcher "Yes, we have 2 documents which contain 'freedom'. You'll find them in the .frq file starting at byte 66991." =item 3 The .frq file says "document number 40 has 1 'freedom', and document 44 has 8. If you need to know more, like if any 'freedom' is part of the phrase 'freedom of speech', take a look at the .prx file starting at..." =item 4 If the searcher is only looking for 'freedom' in isolation, that's where it stops. It already knows enough to assign the documents scores against "freedom", with the 8-freedom document scoring higher than the single-freedom document. =back =head2 Deletions When a document is "deleted" from a segment, it is not actually purged from the postings data and the stored fields data right away; it is merely marked as "deleted", via the .del file. The .del file contains a bit vector with one bit for each document in the segment; if bit #254 is set then document 254 is deleted, and if it turns up in a search it will be masked out. It is only when a segment's contents are rewritten to a new segment during the segment-merging process that deleted documents truly go away. =head2 Field Normalization Files For the sake of simplicity, the example search scenario above omits the role played the field normalization files, or "fieldnorms" for short. These files have the (theoretical) suffix of .f followed by an integer -- .f0, .f1, etc. Each segment contains one such file for every indexed field. By default, the fieldnorms' job is to make sure that a field which is 100 terms long and contains 10 mentions of the word 'freedom' scores higher than a field which also contains 10 mentions of the word 'freedom', but is 1000 terms in length. The idea is that the higher the density of the desired term, the more relevant the document. The fieldnorms files contain one byte per document per indexed field, and all of them must be loaded into RAM before a search can be executed. =head1 Document Numbers Document numbers are ephemeral. They change every time a document gets moved from one segment to a new one during optimization. If you need to assign a primary key to each document, you need to create a field and populate it with an externally generated unique identifier. =head1 Not compatible with Java Lucene The file format used by KinoSearch1 is closely related to the Lucene compound index format. (The technical specification for Lucene's file format is distributed along with Lucene.) However, indexes generated by Lucene and KinoSearch1 are not compatible. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. KinoSearch1-1.01/lib/KinoSearch1/Docs/Tutorial.pod000444000765000765 2240511462203446 21765 0ustar00marvinmarvin000000000000=head1 NAME KinoSearch1::Docs::Tutorial - sample indexing and search applications =head1 DESCRIPTION The following sample code for invindexer.plx and search.cgi can be used to create a simple search engine. It requires the html presentation of the US Constitution included in the distribution for KinoSearch1, under C. Note that a proper indexer for html documents would not rely on quick-n-dirty regular expressions for stripping tags, as this one does for the sake of brevity -- it would use a dedicated parsing module such as L. =head2 invindexer.plx #!/usr/bin/perl use strict; use warnings; use File::Spec; use KinoSearch1::InvIndexer; use KinoSearch1::Analysis::PolyAnalyzer; ### In order for invindexer.plx to work correctly, you must modify ### $source_dir, $path_to_invindex, and possibly $base_url. ### ### $source_dir must lead to the directory containing the US ### Constitution html files. ### ### $path_to_invindex is the future location of the invindex. ### ### $base_url should reflect the location of the us_constitution directory ### when accessed via a web browser. my $source_dir = ''; my $path_to_invindex = ''; my $base_url = '/us_constitution'; opendir( my $source_dh, $source_dir ) or die "Couldn't opendir '$source_dir': $!"; my @filenames = grep {/\.html/} readdir $source_dh; closedir $source_dh or die "Couldn't closedir '$source_dir': $!"; ### STEP 1: Choose an Analyzer. my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en', ); ### STEP 2: Create a InvIndexer object. my $invindexer = KinoSearch1::InvIndexer->new( analyzer => $analyzer, invindex => $path_to_invindex, create => 1, ); ### STEP 3: Define fields. $invindexer->spec_field( name => 'title' ); $invindexer->spec_field( name => 'bodytext', vectorized => 1, ); $invindexer->spec_field( name => 'url', indexed => 0, ); foreach my $filename (@filenames) { next if $filename eq 'index.html'; my $filepath = File::Spec->catfile( $source_dir, $filename ); open( my $fh, '<', $filepath ) or die "couldn't open file '$filepath': $!"; my $content = do { local $/; <$fh> }; ### STEP 4: Start a new document. my $doc = $invindexer->new_doc; $content =~ m#(.*?)#s or die "couldn't isolate title in '$filepath'"; my $title = $1; $content =~ m#
(.*?)
#s or die "couldn't isolate bodytext in '$filepath'"; my $bodytext = $1; $bodytext =~ s/<.*?>/ /gsm; # quick and dirty tag stripping ### STEP 5: Set the value for each field. $doc->set_value( url => "$base_url/$filename" ); $doc->set_value( title => $title ); $doc->set_value( bodytext => $bodytext ); ### STEP 6 Add the document to the invindex. $invindexer->add_doc($doc); ### STEP 7 Repeat steps 3-5 for each document in the collection. } ### STEP 8 Finalize the invindex. $invindexer->finish; =head2 search.cgi #!/usr/bin/perl -T use strict; use warnings; use CGI; use List::Util qw( max min ); use POSIX qw( ceil ); use KinoSearch1::Searcher; use KinoSearch1::Analysis::PolyAnalyzer; use KinoSearch1::Highlight::Highlighter; my $cgi = CGI->new; my $q = $cgi->param('q'); my $offset = $cgi->param('offset'); my $hits_per_page = 10; $q = '' unless defined $q; $offset = 0 unless defined $offset; ### In order for search.cgi to work, $path_to_invindex must be modified so ### that it points to the invindex created by invindexer.plx, and ### $base_url may have to change to reflect where a web-browser should ### look for the us_constitution directory. my $path_to_invindex = ''; my $base_url = '/us_constitution'; ### STEP 1: Specify the same Analyzer used to create the invindex. my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en', ); ### STEP 2: Create a Searcher object. my $searcher = KinoSearch1::Searcher->new( invindex => $path_to_invindex, analyzer => $analyzer, ); ### STEP 3: Feed a query to the Search object. my $hits = $searcher->search($q); ### STEP 4: Arrange for highlighted excerpts to be created. my $highlighter = KinoSearch1::Highlight::Highlighter->new( excerpt_field => 'bodytext' ); $hits->create_excerpts( highlighter => $highlighter ); ### STEP 5: Process the search. $hits->seek( $offset, $hits_per_page ); ### STEP 6: Format the results however you like. # create result list my $report = ''; while ( my $hit = $hits->fetch_hit_hashref ) { my $score = sprintf( "%0.3f", $hit->{score} ); $report .= qq|

$hit->{title} $score
$hit->{excerpt}
$hit->{url}

|; } $q = CGI::escapeHTML($q); # display info about the number of hits, paging links my $total_hits = $hits->total_hits; my $num_hits_info; if ( !length $q ) { # no query, no display $num_hits_info = ''; } elsif ( $total_hits == 0 ) { # alert the user that their search failed $num_hits_info = qq|

No matches for $q

|; } else { # calculate the nums for the first and last hit to display my $last_result = min( ( $offset + $hits_per_page ), $total_hits ); my $first_result = min( ( $offset + 1 ), $last_result ); # display the result nums, start paging info $num_hits_info = qq|

Results $first_result-$last_result of $total_hits for $q.

Results Page: |; # calculate first and last hits pages to display / link to my $current_page = int( $first_result / $hits_per_page ) + 1; my $last_page = ceil( $total_hits / $hits_per_page ); my $first_page = max( 1, ( $current_page - 9 ) ); $last_page = min( $last_page, ( $current_page + 10 ) ); # create a url for use in paging links my $href = $cgi->url( -relative => 1 ) . "?" . $cgi->query_string; $href .= ";offset=0" unless $href =~ /offset=/; # generate the "Prev" link; if ( $current_page > 1 ) { my $new_offset = ( $current_page - 2 ) * $hits_per_page; $href =~ s/(?<=offset=)\d+/$new_offset/; $num_hits_info .= qq|<= Prev\n|; } # generate paging links for my $page_num ( $first_page .. $last_page ) { if ( $page_num == $current_page ) { $num_hits_info .= qq|$page_num \n|; } else { my $new_offset = ( $page_num - 1 ) * $hits_per_page; $href =~ s/(?<=offset=)\d+/$new_offset/; $num_hits_info .= qq|$page_num\n|; } } # generate the "Next" link if ( $current_page != $last_page ) { my $new_offset = $current_page * $hits_per_page; $href =~ s/(?<=offset=)\d+/$new_offset/; $num_hits_info .= qq|Next =>\n|; } # finish paging links $num_hits_info .= "

\n"; } # blast it all out print "Content-type: text/html\n\n"; print < KinoSearch: $q
$report $num_hits_info

Powered by KinoSearch

END_HTML =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. KinoSearch1-1.01/lib/KinoSearch1/Document000755000765000765 011462203446 20164 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Document/Doc.pm000444000765000765 452011462203446 21365 0ustar00marvinmarvin000000000000package KinoSearch1::Document::Doc; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # special member - used to keep track of boost _kino_boost => 1, ); } sub set_value { my ( $self, $field_name, $value ) = @_; carp("undef supplied to set_value") unless defined $value; $self->{$field_name}->set_value($value); } sub get_value { return $_[0]->{ $_[1] }->get_value; } sub get_field { $_[0]->{ $_[1] } } sub boost_field { $_[0]->{ $_[1] }->set_boost( $_[2] ); } sub set_boost { $_[0]->{_kino_boost} = $_[1] } sub get_boost { $_[0]->{_kino_boost} } # set the analyzer for a field sub set_analyzer { $_[0]->{ $_[1] }->set_analyzer( $_[2] ); } sub add_field { my ( $self, $field ) = @_; croak("argument to add_field must be a KinoSearch1::Document::Field") unless $field->isa('KinoSearch1::Document::Field'); $self->{ $field->get_name } = $field; } # retrieve all fields sub get_fields { return grep {ref} values %{ $_[0] }; } # Return the doc as a hashref, with the field names as hash keys and the # field # values as values. sub to_hashref { my $self = shift; my %hash; $hash{ $_->get_name } = $_->get_value for grep {ref} values %$self; return \%hash; } 1; __END__ =head1 NAME KinoSearch1::Document::Doc - a document =head1 SYNOPSIS my $doc = $invindexer->new_doc; $doc->set_value( title => $title ); $doc->set_value( bodytext => $bodytext ); $invindexer->add($doc); =head1 DESCRIPTION A Doc object is akin to a row in a database, in that it is made up of several fields, each of which has a value. Doc objects are only created via factory methods of other classes. =head1 METHODS =head2 set_value get_value $doc->set_value( title => $title_text ); my $text = $doc->get_value( 'title' ); C and C are used to modify and access the values of the fields within a Doc object. =head2 set_boost get_boost $doc->set_boost(2.5); C is a scoring multiplier. Setting boost to something other than 1 causes a document to score better or worse against a given query relative to other documents. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Document/Field.pm000444000765000765 1614711462203446 21733 0ustar00marvinmarvin000000000000package KinoSearch1::Document::Field; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args / members name => undef, analyzer => undef, boost => 1, stored => 1, indexed => 1, analyzed => 1, vectorized => 1, binary => 0, compressed => 0, omit_norms => 0, field_num => undef, value => '', fnm_bits => undef, fdt_bits => undef, tv_string => '', tv_cache => undef, ); __PACKAGE__->ready_get_set( qw( value tv_string boost indexed stored analyzed vectorized binary compressed analyzer field_num name omit_norms ) ); } use KinoSearch1::Index::FieldsReader; use KinoSearch1::Index::FieldInfos; use KinoSearch1::Index::TermVector; use Storable qw( dclone ); sub init_instance { my $self = shift; # field name is required croak("Missing required parameter 'name'") unless length $self->{name}; # don't index binary fields if ( $self->{binary} ) { $self->{indexed} = 0; $self->{analyzed} = 0; } } sub clone { my $self = shift; return dclone($self); } # Given two Field objects, return a child which has all the positive # attributes of both parents (meaning: values are OR'd). sub breed_with { my ( $self, $other ) = @_; my $kid = $self->clone; for (qw( indexed vectorized )) { $kid->{$_} ||= $other->{$_}; } return $kid; } sub set_fnm_bits { $_[0]->{fnm_bits} = $_[1] } sub get_fnm_bits { my $self = shift; $self->{fnm_bits} = KinoSearch1::Index::FieldInfos->encode_fnm_bits($self) unless defined $self->{fnm_bits}; return $self->{fnm_bits}; } sub set_fdt_bits { $_[0]->{fdt_bits} = $_[1] } sub get_fdt_bits { my $self = shift; $self->{fdt_bits} = KinoSearch1::Index::FieldsReader->encode_fdt_bits($self) unless defined $self->{fdt_bits}; return $self->{fdt_bits}; } sub get_value_len { bytes::length( $_[0]->{value} ) } # Return a TermVector object for a given Term, if it's in this field. sub term_vector { my ( $self, $term_text ) = @_; return unless bytes::length( $self->{tv_string} ); if ( !defined $self->{tv_cache} ) { $self->{tv_cache} = _extract_tv_cache( $self->{tv_string} ); } if ( exists $self->{tv_cache}{$term_text} ) { my ( $positions, $starts, $ends ) = _unpack_posdata( $self->{tv_cache}{$term_text} ); my $term_vector = KinoSearch1::Index::TermVector->new( text => $term_text, field => $self->{name}, positions => $positions, start_offsets => $starts, end_offsets => $ends, ); return $term_vector; } return; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Document::Field =for comment Return ref to a hash where the keys are term texts and the values are encoded positional data. =cut void _extract_tv_cache(tv_string_sv) SV *tv_string_sv; PREINIT: HV *tv_cache_hv; PPCODE: tv_cache_hv = Kino1_Field_extract_tv_cache(tv_string_sv); XPUSHs( sv_2mortal( newRV_noinc( (SV*)tv_cache_hv ) ) ); XSRETURN(1); =for comment Decompress positional data. =cut void _unpack_posdata(posdata_sv) SV *posdata_sv; PREINIT: AV *positions_av, *starts_av, *ends_av; PPCODE: positions_av = newAV(); starts_av = newAV(); ends_av = newAV(); Kino1_Field_unpack_posdata(posdata_sv, positions_av, starts_av, ends_av); XPUSHs(sv_2mortal( newRV_noinc((SV*)positions_av) )); XPUSHs(sv_2mortal( newRV_noinc((SV*)starts_av) )); XPUSHs(sv_2mortal( newRV_noinc((SV*)ends_av) )); XSRETURN(3); __H__ #ifndef H_KINOSEARCH_FIELD #define H_KINOSEARCH_FIELD 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1StoreInStream.h" #include "KinoSearch1UtilCarp.h" HV* Kino1_Field_extract_tv_cache(SV*); void Kino1_Field_unpack_posdata(SV*, AV*, AV*, AV*); #endif /* include guard */ __C__ #include "KinoSearch1DocumentField.h" HV* Kino1_Field_extract_tv_cache(SV *tv_string_sv) { HV *tv_cache_hv; char *tv_string, *bookmark_ptr, *key; char **tv_ptr; STRLEN len, tv_len, overlap, key_len; SV *text_sv, *nums_sv; I32 i, num_terms, num_positions; /* allocate a new hash */ tv_cache_hv = newHV(); /* extract pointers */ tv_string = SvPV(tv_string_sv, tv_len); tv_ptr = &tv_string; /* create a base text scalar */ text_sv = newSV(1); SvPOK_on(text_sv); *(SvEND(text_sv)) = '\0'; /* read the number of vectorized terms in the field */ num_terms = Kino1_InStream_decode_vint(tv_ptr); for (i = 0; i < num_terms; i++) { /* decompress the term text */ overlap = Kino1_InStream_decode_vint(tv_ptr); SvCUR_set(text_sv, overlap); len = Kino1_InStream_decode_vint(tv_ptr); sv_catpvn(text_sv, *tv_ptr, len); *tv_ptr += len; key = SvPV(text_sv, key_len); /* get positions & offsets string */ num_positions = Kino1_InStream_decode_vint(tv_ptr); bookmark_ptr = *tv_ptr; while(num_positions--) { /* leave nums compressed to save a little mem */ (void)Kino1_InStream_decode_vint(tv_ptr); (void)Kino1_InStream_decode_vint(tv_ptr); (void)Kino1_InStream_decode_vint(tv_ptr); } len = *tv_ptr - bookmark_ptr; nums_sv = newSVpvn(bookmark_ptr, len); /* store the $text => $posdata pair in the output hash */ hv_store(tv_cache_hv, key, key_len, nums_sv, 0); } SvREFCNT_dec(text_sv); return tv_cache_hv; } void Kino1_Field_unpack_posdata(SV *posdata_sv, AV *positions_av, AV *starts_av, AV *ends_av) { STRLEN len; char *posdata, *posdata_end; char **posdata_ptr; SV *num_sv; posdata = SvPV(posdata_sv, len); posdata_ptr = &posdata; posdata_end = SvEND(posdata_sv); /* translate encoded VInts to Perl scalars */ while(*posdata_ptr < posdata_end) { num_sv = newSViv( Kino1_InStream_decode_vint(posdata_ptr) ); av_push(positions_av, num_sv); num_sv = newSViv( Kino1_InStream_decode_vint(posdata_ptr) ); av_push(starts_av, num_sv); num_sv = newSViv( Kino1_InStream_decode_vint(posdata_ptr) ); av_push(ends_av, num_sv); } if (*posdata_ptr != posdata_end) Kino1_confess("Bad encoding of posdata"); } __POD__ =head1 NAME KinoSearch1::Document::Field - a field within a document =head1 SYNOPSIS # no public interface =head1 DESCRIPTION Fields can only be defined or manipulated indirectly, via InvIndexer and Doc. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Highlight000755000765000765 011462203446 20315 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Highlight/Encoder.pm000444000765000765 140111462203446 22363 0ustar00marvinmarvin000000000000package KinoSearch1::Highlight::Encoder; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars(); } sub encode { shift->abstract_death } 1; __END__ =head1 NAME KinoSearch1::Highlight::Encoder - encode excerpted text =head1 SYNOPSIS # abstract base class =head1 DESCRIPTION Encoder objects are invoked by Highlighter objects for every piece of text that makes it into an excerpt. The archetypal implementation is KinoSearch1::Highlight::SimpleHTMLEncoder. =head1 METHODS =head2 encode my $encoded = $encoder->encode($text); =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Highlight/Formatter.pm000444000765000765 152411462203446 22755 0ustar00marvinmarvin000000000000package KinoSearch1::Highlight::Formatter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars(); } sub highlight_term { shift->abstract_death } 1; __END__ =head1 NAME KinoSearch1::Highlight::Formatter - format highlighted bits within excerpts =head1 SYNOPSIS # abstract base class =head1 DESCRIPTION Formatter objects serve one purpose: they highlight pieces of text within an excerpt. The text might be a single term, or it might be a phrase. =head1 METHODS =head2 highlight my $highlighted = $formatter->highlight($text); Highlight text by e.g. surrounding it with asterisks, or html "strong" tags, etc. =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Highlight/Highlighter.pm000444000765000765 3114211462203446 23267 0ustar00marvinmarvin000000000000package KinoSearch1::Highlight::Highlighter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); use locale; BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members excerpt_field => undef, analyzer => undef, formatter => undef, encoder => undef, terms => undef, excerpt_length => 200, pre_tag => undef, # back compat post_tag => undef, # back compat token_re => qr/\b\w+(?:'\w+)?\b/, # members limit => undef, ); __PACKAGE__->ready_get_set(qw( terms )); } use KinoSearch1::Highlight::SimpleHTMLFormatter; use KinoSearch1::Highlight::SimpleHTMLEncoder; sub init_instance { my $self = shift; croak("Missing required arg 'excerpt_field'") unless defined $self->{excerpt_field}; $self->{terms} = []; # assume HTML if ( !defined $self->{encoder} ) { $self->{encoder} = KinoSearch1::Highlight::SimpleHTMLEncoder->new; } if ( !defined $self->{formatter} ) { my ( $pre_tag, $post_tag ) = @{$self}{qw( pre_tag post_tag )}; $pre_tag = '' unless defined $pre_tag; $post_tag = '' unless defined $post_tag; $self->{formatter} = KinoSearch1::Highlight::SimpleHTMLFormatter->new( pre_tag => $pre_tag, post_tag => $post_tag, ); } # scoring window is 1.66 * excerpt_length, with the loc in the middle $self->{limit} = int( $self->{excerpt_length} / 3 ); } sub generate_excerpt { my ( $self, $doc ) = @_; my $excerpt_length = $self->{excerpt_length}; my $limit = $self->{limit}; my $token_re = $self->{token_re}; # retrieve the text from the chosen field my $field = $doc->get_field( $self->{excerpt_field} ); my $text = $field->get_value; my $text_length = bytes::length $text; return '' unless $text_length; # determine the rough boundaries of the excerpt my $posits = $self->_starts_and_ends($field); my $best_location = $self->_calc_best_location($posits); my $top = $best_location - $limit; # expand the excerpt if the best location is near the end $top = $text_length - $excerpt_length < $top ? $text_length - $excerpt_length : $top; # if the best starting point is the very beginning, cool... if ( $top <= 0 ) { $top = 0; } # ... otherwise ... else { # lop off $top bytes $text = bytes::substr( $text, $top ); # try to start the excerpt at a sentence boundary if ($text =~ s/ \A ( \C{0,$limit}? \.\s+ ) //xsm ) { $top += bytes::length($1); } # no sentence boundary, so we'll need an ellipsis else { # skip past possible partial tokens, prepend an ellipsis if ($text =~ s/ \A ( \C{0,$limit}? # don't go outside the window $token_re # match possible partial token .*? # ... and any junk following that token ) (?=$token_re) # just before the start of a full token... /... /xsm # ... insert an ellipsis ) { $top += bytes::length($1); $top -= 4 # three dots and a space } } } # remove possible partial tokens from the end of the excerpt $text = bytes::substr( $text, 0, $excerpt_length + 1 ); if ( bytes::length($text) > $excerpt_length ) { my $extra_char = chop $text; # if the extra char wasn't part of a token, we aren't splitting one if ( $extra_char =~ $token_re ) { $text =~ s/$token_re$//; # if this is unsuccessful, that's fine } } # if the excerpt doesn't end with a full stop, end with an an ellipsis if ( $text !~ /\.\s*\Z/xsm ) { $text =~ s/\W+\Z//xsm; while ( bytes::length($text) + 4 > $excerpt_length ) { my $extra_char = chop $text; if ( $extra_char =~ $token_re ) { $text =~ s/\W+$token_re\Z//xsm; # if unsuccessful, that's fine } $text =~ s/\W+\Z//xsm; } $text .= ' ...'; } # remap locations now that we know the starting and ending bytes $text_length = bytes::length($text); my @relative_starts = map { $_->[0] - $top } @$posits; my @relative_ends = map { $_->[1] - $top } @$posits; # get rid of pairs with at least one member outside the text while ( @relative_starts and $relative_starts[0] < 0 ) { shift @relative_starts; shift @relative_ends; } while ( @relative_ends and $relative_ends[-1] > $text_length ) { pop @relative_starts; pop @relative_ends; } # insert highlight tags my $formatter = $self->{formatter}; my $encoder = $self->{encoder}; my $output_text = ''; my ( $start, $end, $last_start, $last_end ) = ( undef, undef, 0, 0 ); while (@relative_starts) { $end = shift @relative_ends; $start = shift @relative_starts; $output_text .= $encoder->encode( bytes::substr( $text, $last_end, $start - $last_end ) ); $output_text .= $formatter->highlight( $encoder->encode( bytes::substr( $text, $start, $end - $start ) ) ); $last_end = $end; } $output_text .= $encoder->encode( bytes::substr( $text, $last_end ) ); return $output_text; } =for comment Find all points in the text where a relevant term begins and ends. For terms that are part of a phrase, only include points that are part of the phrase. =cut sub _starts_and_ends { my ( $self, $field ) = @_; my @posits; my %done; TERM: for my $term ( @{ $self->{terms} } ) { if ( a_isa_b( $term, 'KinoSearch1::Index::Term' ) ) { my $term_text = $term->get_text; next TERM if $done{$term_text}; $done{$term_text} = 1; # add all starts and ends my $term_vector = $field->term_vector($term_text); next TERM unless defined $term_vector; my $starts = $term_vector->get_start_offsets; my $ends = $term_vector->get_end_offsets; while (@$starts) { push @posits, [ shift @$starts, shift @$ends, 1 ]; } } # intersect positions for phrase terms else { # if not a Term, it's an array of Terms representing a phrase my @term_texts = map { $_->get_text } @$term; my $phrase_text = join( ' ', @term_texts ); next TERM if $done{$phrase_text}; $done{$phrase_text} = 1; my $posit_vec = KinoSearch1::Util::BitVector->new; my @term_vectors = map { $field->term_vector($_) } @term_texts; # make sure all terms are present next TERM unless scalar @term_vectors == scalar @term_texts; my $i = 0; for my $tv (@term_vectors) { # one term missing, ergo no phrase next TERM unless defined $tv; if ( $i == 0 ) { $posit_vec->set( @{ $tv->get_positions } ); } else { # filter positions using logical "and" my $other_posit_vec = KinoSearch1::Util::BitVector->new; $other_posit_vec->set( grep { $_ >= 0 } map { $_ - $i } @{ $tv->get_positions } ); $posit_vec->logical_and($other_posit_vec); } $i++; } # add only those starts/ends that belong to a valid position my $tv_start_positions = $term_vectors[0]->get_positions; my $tv_starts = $term_vectors[0]->get_start_offsets; my $tv_end_positions = $term_vectors[-1]->get_positions; my $tv_ends = $term_vectors[-1]->get_end_offsets; $i = 0; my $j = 0; my $last_token_index = $#term_vectors; for my $valid_position ( @{ $posit_vec->to_arrayref } ) { while ( $i <= $#$tv_start_positions ) { last if ( $tv_start_positions->[$i] >= $valid_position ); $i++; } $valid_position += $last_token_index; while ( $j <= $#$tv_end_positions ) { last if ( $tv_end_positions->[$j] >= $valid_position ); $j++; } push @posits, [ $tv_starts->[$i], $tv_ends->[$j], scalar @$term ]; $i++; $j++; } } } # sort, uniquify and return @posits = sort { $a->[0] <=> $b->[0] || $b->[1] <=> $a->[1] } @posits; my @unique; my $last = ~0; for (@posits) { push @unique, $_ if $_->[0] != $last; $last = $_->[0]; } return \@unique; } =for comment Select the byte address representing the greatest keyword density. Because the algorithm counts bytes rather than characters, it will degrade if the number of bytes per character is larger than 1. =cut sub _calc_best_location { my ( $self, $posits ) = @_; my $window = $self->{limit} * 2; # if there aren't any keywords, take the excerpt from the top of the text return 0 unless @$posits; my %locations = map { ( $_->[0] => 0 ) } @$posits; # if another keyword is in close proximity, add to the loc's score for my $loc_index ( 0 .. $#$posits ) { # only score positions that are in range my $location = $posits->[$loc_index][0]; my $other_loc_index = $loc_index - 1; while ( $other_loc_index > 0 ) { my $diff = $location - $posits->[$other_loc_index][0]; last if $diff > $window; my $num_tokens_at_pos = $posits->[$other_loc_index][2]; $locations{$location} += ( 1 / ( 1 + log($diff) ) ) * $num_tokens_at_pos; --$other_loc_index; } $other_loc_index = $loc_index + 1; while ( $other_loc_index <= $#$posits ) { my $diff = $posits->[$other_loc_index] - $location; last if $diff > $window; my $num_tokens_at_pos = $posits->[$other_loc_index][2]; $locations{$location} += ( 1 / ( 1 + log($diff) ) ) * $num_tokens_at_pos; ++$other_loc_index; } } # return the highest scoring position return ( sort { $locations{$b} <=> $locations{$a} } keys %locations )[0]; } 1; __END__ =head1 NAME KinoSearch1::Highlight::Highlighter - create and highlight excerpts =head1 SYNOPSIS my $highlighter = KinoSearch1::Highlight::Highlighter->new( excerpt_field => 'bodytext', ); $hits->create_excerpts( highlighter => $highlighter ); =head1 DESCRIPTION KinoSearch1's Highlighter can be used to select a relevant snippet from a document, and to surround search terms with highlighting tags. It handles both stems and phrases correctly and efficiently, using special-purpose data generated at index-time. =head1 METHODS =head2 new my $highlighter = KinoSearch1::Highlight::Highlighter->new( excerpt_field => 'bodytext', # required excerpt_length => 150, # default: 200 formatter => $formatter, # default: SimpleHTMLFormatter encoder => $encoder, # default: SimpleHTMLEncoder ); Constructor. Takes hash-style parameters: =over =item * B - the name of the field from which to draw the excerpt. This field B be C. =item * B - the length of the excerpt, in I. This should probably use characters as a unit instead of bytes, and the behavior is likely to change in the future. =item * B - an object which subclasses L, used to perform the actual highlighting. =item * B - an object which subclasses L. All excerpt text gets passed through the encoder, including highlighted terms. By default, this is a SimpleHTMLEncoder, which encodes HTML entities. =item * B - deprecated. =item * B - deprecated. =back =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Highlight/SimpleHTMLEncoder.pm000444000765000765 164611462203445 24234 0ustar00marvinmarvin000000000000package KinoSearch1::Highlight::SimpleHTMLEncoder; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars(); } sub encode { my $text = $_[1]; for ($text) { s/&/&/g; s/"/"/g; s//>/g; } return $text; } 1; __END__ =head1 NAME KinoSearch1::Highlight::SimpleHTMLEncoder - encode a few HTML entities =head1 SYNOPSIS # returns '"Hey, you!"' my $encoded = $encoder->encode('"Hey, you!"'); =head1 DESCRIPTION Implemetation of L which encodes HTML entities. Currently, this module takes a minimal approach, encoding only '<', '>', '&', and '"'. That is likely to change in the future. =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Highlight/SimpleHTMLFormatter.pm000444000765000765 306411462203446 24615 0ustar00marvinmarvin000000000000package KinoSearch1::Highlight::SimpleHTMLFormatter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Highlight::Formatter ); BEGIN { __PACKAGE__->init_instance_vars( pre_tag => '', post_tag => '', ); } sub highlight { my ( $self, $text ) = @_; return "$self->{pre_tag}$text$self->{post_tag}"; } 1; __END__ =head1 NAME KinoSearch1::Highlight::SimpleHTMLFormatter - surround highlight bits with tags =head1 SYNOPSIS my $formatter = KinoSearch1::Highlight::SimpleHTMLFormatter->new( pre_tag => '', post_tag => '', ); # returns "foo" my $highlighted = $formatter->highlight("foo"); =head1 DESCRIPTION This subclass of L highlights text by surrounding it with HTML "strong" tags. =head1 METHODS =head2 new Constructor. Takes hash-style params. my $formatter = KinoSearch1::Highlight::SimpleHTMLFormatter->new( pre_tag => '*', # default: '' post_tag => '*', # default: '' ); =over =item * B - a string which will be inserted immediately prior to the highlightable text, typically to accentuate it. If you don't want highlighting, set both C and C to C<''>. =item * B - a string which will be inserted immediately after the highlightable text. =back =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Index000755000765000765 011462203446 17455 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Index/CompoundFileReader.pm000444000765000765 527711462203446 23672 0ustar00marvinmarvin000000000000package KinoSearch1::Index::CompoundFileReader; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Store::InvIndex ); # !! BEGIN { __PACKAGE__->init_instance_vars( # members / constructor params invindex => undef, seg_name => undef, # members instream => undef, entries => undef, ); } use KinoSearch1::Store::InStream; sub init_instance { my $self = shift; my ( $seg_name, $invindex ) = @{$self}{ 'seg_name', 'invindex' }; # read in names and lengths for all the "files" within the compound file my $instream = $self->{instream} = $invindex->open_instream("$seg_name.cfs"); my $num_entries = $instream->lu_read('V'); my @offsets_and_names = $instream->lu_read( 'QT' x $num_entries ); my $offset = shift @offsets_and_names; my %entries; while (@offsets_and_names) { my $filename = shift @offsets_and_names; my $next_offset = shift @offsets_and_names || $instream->length; $entries{$filename} = { offset => $offset, len => $next_offset - $offset, }; $offset = $next_offset; } $self->{entries} = \%entries; } sub open_instream { my ( $self, $filename ) = @_; my $entry = $self->{entries}{$filename}; croak("filename '$filename' is not accessible") unless defined $entry; open( my $duped_fh, '<&=', $self->{instream}->get_fh ) or confess("Couldn't dupe filehandle: $!"); return KinoSearch1::Store::InStream->new( $duped_fh, $entry->{offset}, $entry->{len} ); } sub slurp_file { my ( $self, $filename ) = @_; my $instream = $self->open_instream($filename); my $contents = $instream->lu_read( 'a' . $self->{entries}{$filename}{len} ); $instream->close; return $contents; } sub close { shift->{instream}->close } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::CompoundFileReader - read from a compound file ==head1 SYNOPSIS my $comp_file_reader = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, filename => "$seg_name.cfs", ); my $instream = $comp_file_reader->open_instream("$seg_name.fnm"); ==head1 DESCRIPTION A CompoundFileReader provides access to the files contained within the compound file format written by CompoundFileWriter. The InStream objects it spits out behave largely like InStreams opened against discrete files -- $instream->seek(0) seeks to the beginning of the sub-file, not the beginning of the compound file. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/CompoundFileWriter.pm000444000765000765 502311462203445 23730 0ustar00marvinmarvin000000000000package KinoSearch1::Index::CompoundFileWriter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members invindex => undef, filename => undef, # members entries => undef, merged => 0, ); } sub init_instance { my $self = shift; $self->{entries} = {}; } # Add a file to the list of files-to-merge. sub add_file { my ( $self, $filename ) = @_; croak("filename '$filename' already added") if $self->{entries}{$filename}; $self->{entries}{$filename} = 1; } # Write a compound file. sub finish { my $self = shift; my $invindex = $self->{invindex}; my $filename = $self->{filename}; my @files_to_merge = keys %{ $self->{entries} }; croak('no entries defined') unless @files_to_merge; # ensure that the file only gets written once; open the outfile croak('merge already performed') if $self->{merged}; $self->{merged} = 1; $invindex->delete_file($filename) if $invindex->file_exists($filename); my $outstream = $invindex->open_outstream($filename); # write number of files, plus data_offset placeholders my @to_write = map { ( 0, $_ ) } @files_to_merge; unshift @to_write, scalar @files_to_merge; my $template = 'V' . ( 'QT' x scalar @files_to_merge ); $outstream->lu_write( $template, @to_write ); # copy data my @data_offsets; my $out_fh = $outstream; for my $file (@files_to_merge) { push @data_offsets, $outstream->tell; my $instream = $invindex->open_instream($file); $outstream->absorb($instream); } # rewrite number of files, plus real data offsets $outstream->seek(0); @to_write = map { ( shift @data_offsets, $_ ) } @files_to_merge; unshift @to_write, scalar @files_to_merge; $outstream->lu_write( $template, @to_write ); $outstream->close; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::CompoundFileWriter - consolidate invindex files ==head1 DESCRIPTION CompoundFileWriter takes a list of pre-existing files and writes a new file which combines them into one. It writes a header containing filenames and filepointers, then writes a data section containing file content. The original files are not deleted, so cleanup must be done externally. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/DelDocs.pm000444000765000765 1174011462203446 21510 0ustar00marvinmarvin000000000000package KinoSearch1::Index::DelDocs; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::BitVector ); use KinoSearch1::Util::IntMap; # instance vars: my %num_deletions; sub new { my $self = shift->SUPER::new; $num_deletions{"$self"} = 0; return $self; } # Read a deletions file if one exists. sub read_deldocs { my ( $self, $invindex, $filename ) = @_; # load the file into memory if it's there if ( $invindex->file_exists($filename) ) { my $instream = $invindex->open_instream($filename); my $byte_size; ( $byte_size, $num_deletions{"$self"} ) = $instream->lu_read('ii'); $self->set_bits( $instream->lu_read("a$byte_size") ); $instream->close; } } # Blast out a hard copy of the deletions held in memory. sub write_deldocs { my ( $self, $invindex, $filename, $max_doc ) = @_; if ( $invindex->file_exists($filename) ) { $invindex->delete_file($filename); } my $outstream = $invindex->open_outstream($filename); # pad out deldocs->bits $self->set_capacity($max_doc); # write header followed by deletions data my $byte_size = ceil( $max_doc / 8 ); $outstream->lu_write( "iia$byte_size", $byte_size, $num_deletions{"$self"}, $self->get_bits, ); $outstream->close; } # Mark a doc as deleted. sub set { my ( $self, $doc_num ) = @_; # ... only if it isn't already deleted if ( !$self->get($doc_num) ) { $self->SUPER::set($doc_num); $num_deletions{"$self"}++; } } # Delete all the docs represented by a TermDocs object. sub delete_by_term_docs { my ( $self, $term_docs ) = @_; $num_deletions{"$self"} += _delete_by_term_docs( $self, $term_docs ); } # Undelete a doc. sub clear { my ( $self, $doc_num ) = @_; # ... only if it was deleted before if ( $self->get($doc_num) ) { $self->SUPER::clear($doc_num); $num_deletions{"$self"}--; } } sub get_num_deletions { $num_deletions{"$_[0]"} } # Map around deleted documents. sub generate_doc_map { my ( $self, $max, $offset ) = @_; my $map = $self->_generate_doc_map( $max, $offset ); return KinoSearch1::Util::IntMap->new($map); } # If these get implemented, we'll need to write a range_count(first, last) # method for BitVector. sub bulk_set { shift->todo_death } sub bulk_clear { shift->todo_death } sub close { } sub DESTROY { my $self = shift; delete $num_deletions{"$self"}; $self->SUPER::DESTROY; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::DelDocs SV* _generate_doc_map(deldocs, max, offset); BitVector *deldocs; I32 max; I32 offset; PREINIT: SV *map_sv; CODE: map_sv = Kino1_DelDocs_generate_doc_map(deldocs, max, offset); RETVAL = newRV_noinc(map_sv); OUTPUT: RETVAL I32 _delete_by_term_docs(deldocs, term_docs) BitVector *deldocs; TermDocs *term_docs; CODE: RETVAL = Kino1_DelDocs_delete_by_term_docs(deldocs, term_docs); OUTPUT: RETVAL __H__ #ifndef H_KINOSEARCH_DELDOCS #define H_KINOSEARCH_DELDOCS 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1IndexTermDocs.h" #include "KinoSearch1UtilBitVector.h" SV* Kino1_DelDocs_generate_doc_map(BitVector*, I32, I32); I32 Kino1_DelDocs_delete_by_term_docs(BitVector*, TermDocs*); #endif /* include guard */ __C__ #include "KinoSearch1IndexDelDocs.h" SV* Kino1_DelDocs_generate_doc_map(BitVector *deldocs, I32 max, I32 offset) { SV *doc_map_sv; I32 *doc_map; I32 new_doc_num; int i; /* allocate space for the doc map */ doc_map_sv = newSV(max * sizeof(I32) + 1); SvCUR_set(doc_map_sv, max * sizeof(I32)); SvPOK_on(doc_map_sv); doc_map = (I32*)SvPVX(doc_map_sv); /* -1 for a deleted doc, a new number otherwise */ new_doc_num = 0; for (i = 0; i < max; i++) { if (Kino1_BitVec_get(deldocs, i)) *doc_map++ = -1; else *doc_map++ = offset + new_doc_num++; } return doc_map_sv; } I32 Kino1_DelDocs_delete_by_term_docs(BitVector* deldocs, TermDocs* term_docs) { I32 doc; I32 num_deleted = 0; /* iterate through term docs, marking each doc returned as deleted */ while (term_docs->next(term_docs)) { doc = term_docs->get_doc(term_docs); if (Kino1_BitVec_get(deldocs, doc)) continue; Kino1_BitVec_set(deldocs, doc); num_deleted++; } return num_deleted; } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::DelDocs - manage documents deleted from an invindex ==head1 DESCRIPTION DelDocs provides the low-level mechanisms for declaring a document deleted from a segment, and for finding out whether or not a particular document has been deleted. Note that documents are not actually gone from the invindex until the segment gets rewritten. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/FieldInfos.pm000444000765000765 1445311462203446 22221 0ustar00marvinmarvin000000000000package KinoSearch1::Index::FieldInfos; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class Exporter ); use constant INDEXED => "\x01"; use constant VECTORIZED => "\x02"; use constant OMIT_NORMS => "\x10"; our @EXPORT_OK; BEGIN { __PACKAGE__->init_instance_vars( # members by_name => undef, by_num => undef, from_file => 0, ); __PACKAGE__->ready_get_set(qw( from_file )); @EXPORT_OK = qw( INDEXED VECTORIZED OMIT_NORMS ); } use KinoSearch1::Document::Field; sub init_instance { my $self = shift; $self->{by_name} = {}; $self->{by_num} = []; } sub clone { my $self = shift; my $evil_twin = __PACKAGE__->new; $evil_twin->{from_file} = $self->{from_file}; my @by_num; my %by_name; for my $finfo ( @{ $self->{by_num} } ) { my $dupe = $finfo->clone; push @by_num, $dupe; $by_name{ $finfo->get_name } = $dupe; } $evil_twin->{by_num} = \@by_num; $evil_twin->{by_name} = \%by_name; return $evil_twin; } # Add a user-supplied Field object to the collection. sub add_field { my ( $self, $field ) = @_; croak("Not a KinoSearch1::Document::Field") unless a_isa_b( $field, 'KinoSearch1::Document::Field' ); # don't mod Field objects for segments that are read back in croak("Can't update FieldInfos that were read in from file") if $self->{from_file}; # add the field my $fieldname = $field->get_name; $self->{by_name}{$fieldname} = $field; $self->_assign_field_nums; } # Return the number of fields in the segment. sub size { scalar @{ $_[0]->{by_num} } } # Return a list of the Field objects. sub get_infos { @{ $_[0]->{by_num} } } # Given a fieldname, return its number. sub get_field_num { my ( $self, $name ) = @_; return undef unless exists $self->{by_name}{$name}; my $num = $self->{by_name}{$name}->get_field_num; return $num; } # Given a fieldname, return its FieldInfo. sub info_by_name { $_[0]->{by_name}{ $_[1] } } # Given a field number, return its fieldInfo. sub info_by_num { $_[0]->{by_num}[ $_[1] ] } # Given the field number (new, not original), return the name of the field. sub field_name { my ( $self, $num ) = @_; my $name = $self->{by_num}[$num]->get_name; croak("Don't know about field number $num") unless defined $name; return $name; } # Sort all the fields lexically by name and assign ascending numbers. sub _assign_field_nums { my $self = shift; confess("Can't _assign_field_nums when from_file") if $self->{from_file}; # assign field nums according to lexical order of field names @{ $self->{by_num} } = sort { $a->get_name cmp $b->get_name } values %{ $self->{by_name} }; my $inc = 0; $_->set_field_num( $inc++ ) for @{ $self->{by_num} }; } # Decode an existing .fnm file. sub read_infos { my ( $self, $instream ) = @_; my ( $by_name, $by_num ) = @{$self}{qw( by_name by_num )}; # set flag indicating that this FieldInfos object has been read in $self->{from_file} = 1; # read in infos from stream my $num_fields = $instream->lu_read('V'); my @names_and_bits = $instream->lu_read( 'Ta' x $num_fields ); my $field_num = 0; while ( $field_num < $num_fields ) { my ( $name, $bits ) = splice( @names_and_bits, 0, 2 ); my $info = KinoSearch1::Document::Field->new( field_num => $field_num, name => $name, indexed => ( "$bits" & INDEXED ) eq INDEXED ? 1 : 0, vectorized => ( "$bits" & VECTORIZED ) eq VECTORIZED ? 1 : 0, fnm_bits => $bits, ); $by_name->{$name} = $info; # order of storage implies lexical order by name and field number push @$by_num, $info; $field_num++; } } # Write .fnm file. sub write_infos { my ( $self, $outstream ) = @_; $outstream->lu_write( 'V', scalar @{ $self->{by_num} } ); for my $finfo ( @{ $self->{by_num} } ) { $outstream->lu_write( 'Ta', $finfo->get_name, $finfo->get_fnm_bits, ); } } # Merge two FieldInfos objects, redefining fields as necessary and generating # new field numbers. sub consolidate { my ( $self, @others ) = @_; my $infos = $self->{by_name}; # Make *this* finfos the master FieldInfos object for my $other (@others) { while ( my ( $name, $other_finfo ) = each %{ $other->{by_name} } ) { if ( exists $infos->{$name} ) { $infos->{$name} = $other_finfo->breed_with( $infos->{$name} ); } else { $infos->{$name} = $other_finfo->clone; } } } $self->_assign_field_nums; } # Generate a mapping of field numbers between two FieldInfos objects. Should # be called by the superset. sub generate_field_num_map { my ( $self, $other ) = @_; my $map = ''; for my $other_finfo ( @{ $other->{by_num} } ) { my $orig_finfo = $self->{by_name}{ $other_finfo->get_name }; $map .= pack( 'I', $orig_finfo->get_field_num ); } return KinoSearch1::Util::IntMap->new( \$map ); } sub encode_fnm_bits { my ( undef, $field ) = @_; my $bits = "\0"; for ($bits) { $_ |= INDEXED if $field->get_indexed; $_ |= VECTORIZED if $field->get_vectorized; $_ |= OMIT_NORMS if $field->get_omit_norms; } return $bits; } sub decode_fnm_bits { my ( undef, $field, $bits ) = @_; $field->set_indexed( ( $bits & INDEXED ) eq INDEXED ); $field->set_vectorized( ( $bits & VECTORIZED ) eq VECTORIZED ); $field->set_omit_norms( ( $bits & OMIT_NORMS ) eq OMIT_NORMS ); } sub close { } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::FieldInfos - track field characteristics ==head1 SYNOPSIS my $finfos = KinoSearch1::Index::FieldInfos->new; $finfos->read_infos($instream); ==head1 DESCRIPTION A FieldInfos object tracks the characteristics of all fields in a given segment. KinoSearch1 counts on having field nums assigned to fields by lexically sorted order of field names, but indexes generated by Java Lucene are not likely to have this property. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/FieldsReader.pm000444000765000765 1003411462203446 22517 0ustar00marvinmarvin000000000000package KinoSearch1::Index::FieldsReader; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class Exporter ); use constant ANALYZED => "\x01"; use constant BINARY => "\x02"; use constant COMPRESSED => "\x04"; our @EXPORT_OK; BEGIN { @EXPORT_OK = qw( ANALYZED BINARY COMPRESSED ); __PACKAGE__->init_instance_vars( # constructor params / members finfos => undef, fdata_stream => undef, findex_stream => undef, # members size => undef, ); } use Compress::Zlib qw( uncompress ); use KinoSearch1::Document::Field; use KinoSearch1::Document::Doc; sub init_instance { my $self = shift; # derive the number of documents in the segment $self->{size} = $self->{findex_stream}->length / 8; } # Return number of documents in segment. sub get_size { $_[0]->{size} } # Retrieve raw field data from files. Either the data will be turned into # full-on Field and Doc objects by fetch_doc, or it will be passed on mostly # intact when merging segments (field numbers will be modified). sub fetch_raw { my ( $self, $doc_num ) = @_; my ( $findex_stream, $fdata_stream ) = @{$self}{ 'findex_stream', 'fdata_stream' }; # get data file pointer from index $findex_stream->seek( $doc_num * 8 ); my $start = $findex_stream->lu_read('Q'); # retrieve one doc's worth of field data $fdata_stream->seek($start); my $num_fields = $fdata_stream->lu_read('V'); my $template = 'VaTT' x $num_fields; my @raw = $fdata_stream->lu_read($template); return ( $num_fields, \@raw ); } # Given a doc_num, rebuild a Doc object from the fields that were # stored. sub fetch_doc { my ( $self, $doc_num ) = @_; my $finfos = $self->{finfos}; # start a new Doc object, read in data my $doc = KinoSearch1::Document::Doc->new; my ( $num_fields, $data ) = $self->fetch_raw($doc_num); # docode stored data and build up the Doc object Field by Field. for ( 1 .. $num_fields ) { my ( $field_num, $bits, $string, $tv_string ) = splice( @$data, 0, 4 ); # decode fnm bits my $analyzed = ( $bits & ANALYZED ) eq ANALYZED ? 1 : 0; my $binary = ( $bits & BINARY ) eq BINARY ? 1 : 0; my $compressed = ( $bits & COMPRESSED ) eq COMPRESSED ? 1 : 0; # create a field object, merging in the FieldInfo data, and add it my $finfo = $finfos->info_by_num($field_num); my $field = KinoSearch1::Document::Field->new( %$finfo, field_num => $field_num, analyzed => $analyzed, binary => $binary, compressed => $compressed, fdt_bits => $bits, value => $compressed ? uncompress($string) : $string, tv_string => $tv_string, ); $doc->add_field($field); } return $doc; } sub decode_fdt_bits { my ( undef, $field, $bits ) = @_; $field->set_analyzed( ( $bits & ANALYZED ) eq ANALYZED ); $field->set_binary( ( $bits & BINARY ) eq BINARY ); $field->set_compressed( ( $bits & COMPRESSED ) eq COMPRESSED ); } sub encode_fdt_bits { my ( undef, $field ) = @_; my $bits = "\0"; for ($bits) { $_ |= ANALYZED if $field->get_analyzed; $_ |= BINARY if $field->get_binary; $_ |= COMPRESSED if $field->get_compressed; } return $bits; } sub close { my $self = shift; $self->{findex_stream}->close; $self->{fdata_stream}->close; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::FieldsReader - retrieve stored documents ==head1 DESCRIPTION FieldsReader's purpose is to retrieve stored documents from the invindex. In addition to returning fully decoded Doc objects, it can pass on raw data -- for instance, compressed fields stay compressed -- for the purpose of merging segments efficiently. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/FieldsWriter.pm000444000765000765 624411462203446 22561 0ustar00marvinmarvin000000000000package KinoSearch1::Index::FieldsWriter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members invindex => undef, seg_name => undef, # members fdata_stream => undef, findex_stream => undef, ); } use Compress::Zlib qw( compress ); sub init_instance { my $self = shift; my $invindex = $self->{invindex}; # open an index stream and a data stream. my $fdx_file = "$self->{seg_name}.fdx"; my $fdt_file = "$self->{seg_name}.fdt"; for ( $fdx_file, $fdt_file, ) { $invindex->delete_file($_) if $invindex->file_exists($_); } $self->{findex_stream} = $invindex->open_outstream($fdx_file); $self->{fdata_stream} = $invindex->open_outstream($fdt_file); } sub add_doc { my ( $self, $doc ) = @_; # record the data stream's current file pointer in the index. $self->{findex_stream}->lu_write( 'Q', $self->{fdata_stream}->tell ); # only store fields marked as "stored" my @stored = sort { $a->get_field_num <=> $b->get_field_num } grep $_->get_stored, $doc->get_fields; # add the number of stored fields in the Doc my @to_write = ( scalar @stored ); # add flag bits and value for each stored field for (@stored) { push @to_write, ( $_->get_field_num, $_->get_fdt_bits ); push @to_write, $_->get_compressed ? compress( $_->get_value ) : $_->get_value; push @to_write, $_->get_tv_string; } # write out data my $lu_template = 'V' . ( 'VaTT' x scalar @stored ); $self->{fdata_stream}->lu_write( $lu_template, @to_write ); } sub add_segment { my ( $self, $seg_reader, $doc_map, $field_num_map ) = @_; my ( $findex_stream, $fdata_stream ) = @{$self}{qw( findex_stream fdata_stream )}; my $fields_reader = $seg_reader->get_fields_reader; my $max = $seg_reader->max_doc; return unless $max; $max -= 1; for my $orig ( 0 .. $max ) { # if the doc isn't deleted, copy it to the new seg next unless defined $doc_map->get($orig); # write pointer $findex_stream->lu_write( 'Q', $fdata_stream->tell ); # retrieve all fields my ( $num_fields, $all_data ) = $fields_reader->fetch_raw($orig); # write number of fields $fdata_stream->lu_write( 'V', $num_fields ); # write data for each field for ( 1 .. $num_fields ) { my ( $field_num, @some_data ) = splice( @$all_data, 0, 4 ); $fdata_stream->lu_write( 'VaTT', $field_num_map->get($field_num), @some_data ); } } } sub finish { my $self = shift; $self->{fdata_stream}->close; $self->{findex_stream}->close; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::FieldsWriter - write stored fields to an invindex ==head1 DESCRIPTION FieldsWriter writes fields which are marked as stored to the field data and field index files. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/IndexFileNames.pm000444000765000765 342211462203446 23004 0ustar00marvinmarvin000000000000package KinoSearch1::Index::IndexFileNames; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( Exporter ); our @EXPORT_OK = qw( SEGMENTS DELETEABLE SORTFILE_EXTENSION @INDEX_EXTENSIONS @COMPOUND_EXTENSIONS @VECTOR_EXTENSIONS @SCRATCH_EXTENSIONS WRITE_LOCK_NAME WRITE_LOCK_TIMEOUT COMMIT_LOCK_NAME COMMIT_LOCK_TIMEOUT ); # name of the index segments file use constant SEGMENTS => 'segments'; # name of the index deletable file use constant DELETABLE => 'deletable'; # extension of the temporary file used by the SortExternal sort pool use constant SORTFILE_EXTENSION => '.srt'; # Most, but not all of Lucene file extenstions. Missing are the ".f$num" # extensions. Also note that 'segments' and 'deletable' don't have # extensions. our @INDEX_EXTENSIONS = qw( cfs fnm fdx fdt tii tis frq prx del tvx tvd tvf tvp ); # extensions for files which are subsumed into the cfs compound file our @COMPOUND_EXTENSIONS = qw( fnm frq prx fdx fdt tii tis ); # file extensions for term vectors our @VECTOR_EXTENSIONS = qw( tvd tvx tvf ); our @SCRATCH_EXTENSIONS = qw( srt ); # names and constants for lockfiles use constant WRITE_LOCK_NAME => 'write.lock'; use constant COMMIT_LOCK_NAME => 'commit.lock'; use constant WRITE_LOCK_TIMEOUT => 1000; use constant COMMIT_LOCK_TIMEOUT => 10_000; 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::IndexFileNames - filenames and suffixes used in an invindex ==head1 DESCRIPTION This module abstracts the names of the files that make up an invindex, similarly to the way InStream and OutStream abstract filehandle operations. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/IndexReader.pm000444000765000765 1265211462203446 22370 0ustar00marvinmarvin000000000000package KinoSearch1::Index::IndexReader; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members invindex => undef, seg_infos => undef, close_invindex => 1, invindex_owner => 1, ); __PACKAGE__->ready_get(qw( invindex )); } use KinoSearch1::Store::FSInvIndex; use KinoSearch1::Index::SegReader; use KinoSearch1::Index::MultiReader; use KinoSearch1::Index::SegInfos; use KinoSearch1::Index::IndexFileNames qw( WRITE_LOCK_NAME WRITE_LOCK_TIMEOUT COMMIT_LOCK_NAME COMMIT_LOCK_TIMEOUT ); sub new { my $temp = shift->SUPER::new(@_); return $temp->_open_multi_or_segreader; } # Returns a subclass of IndexReader: either a MultiReader or a SegReader, # depending on whether an invindex contains more than one segment. sub _open_multi_or_segreader { my $self = shift; # confirm an InvIndex object or make one using a supplied filepath. if ( !a_isa_b( $self->{invindex}, 'KinoSearch1::Store::InvIndex' ) ) { $self->{invindex} = KinoSearch1::Store::FSInvIndex->new( path => $self->{invindex} ); } my $invindex = $self->{invindex}; # read the segments file and decide what to do my $reader; $invindex->run_while_locked( lock_name => COMMIT_LOCK_NAME, timeout => COMMIT_LOCK_TIMEOUT, do_body => sub { my $seg_infos = KinoSearch1::Index::SegInfos->new; $seg_infos->read_infos($invindex); # create a SegReader for each segment in the invindex my @seg_readers; for my $sinfo ( $seg_infos->infos ) { push @seg_readers, KinoSearch1::Index::SegReader->new( seg_name => $sinfo->get_seg_name, invindex => $invindex, ); } # if there's one SegReader use it; otherwise make a MultiReader $reader = @seg_readers == 1 ? $seg_readers[0] : KinoSearch1::Index::MultiReader->new( invindex => $invindex, sub_readers => \@seg_readers, ); }, ); return $reader; } =begin comment my $num = $reader->max_doc; Return the highest document number available to the reader. =end comment =cut sub max_doc { shift->abstract_death } =begin comment my $num = $reader->num_docs; Return the number of (non-deleted) documents available to the reader. =end comment =cut sub num_docs { shift->abstract_death } =begin comment my $term_docs = $reader->term_docs($term); Given a Term, return a TermDocs subclass. =end comment =cut sub term_docs { shift->abstract_death } =begin comment my $norms_reader = $reader->norms_reader($field_name); Given a field name, return a NormsReader object. =end comment =cut sub norms_reader { shift->abstract_death } =begin comment $reader->delete_docs_by_term( $term ); Delete all the documents available to the reader that index the given Term. =end comment =cut sub delete_docs_by_term { shift->abstract_death } =begin comment $boolean = $reader->has_deletions Return true if any documents have been marked as deleted. =end comment =cut sub has_deletions { shift->abstract_death } =begin comment my $enum = $reader->terms($term); Given a Term, return a TermEnum subclass. The Enum will be be pre-located via $enum->seek($term) to the right spot. =end comment =cut sub terms { shift->abstract_death } =begin comment my $field_names = $reader->get_field_names( indexed => $indexed_fields_only, ); Return a hashref which is a list of field names. If the parameter 'indexed' is true, return only the names of fields which are indexed. =end comment =cut sub get_field_names { shift->abstract_death } =begin comment my $infos = $reader->generate_field_infos; Return a new FieldInfos object, describing all the fields held by the reader. The FieldInfos object will be consolidated, and thus may not be representative of every field in every segment if there are conflicting definitions. =end comment =cut sub generate_field_infos { shift->abstract_death } =begin comment my @sparse_segreaders = $reader->segreaders_to_merge; my @all_segreaders = $reader->segreaders_to_merge('all'); Find segments which are good candidates for merging, as they don't contain many valid documents. Returns an array of SegReaders. If passed an argument, return all SegReaders. =end comment =cut sub segreaders_to_merge { shift->abstract_death } =begin comment $reader->close; Release all resources. =end comment =cut sub close { shift->abstract_death } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::IndexReader - base class for objects which read invindexes ==head1 DESCRIPTION There are two subclasses of the abstract base class IndexReader: SegReader, which reads a single segment, and MultiReader, which condenses the output of several SegReaders. Since each segment is a self-contained inverted index, a SegReader is in effect a complete index reader. The constructor for IndexReader returns either a SegReader if the index has only one segment, or a MultiReader if there are multiple segments. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/MultiReader.pm000444000765000765 1262511462203446 22413 0ustar00marvinmarvin000000000000package KinoSearch1::Index::MultiReader; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Index::IndexReader ); BEGIN { __PACKAGE__->init_instance_vars( invindex => undef, sub_readers => undef, starts => undef, max_doc => 0, norms_cache => undef, ); } use KinoSearch1::Index::FieldInfos; use KinoSearch1::Index::SegReader; use KinoSearch1::Index::MultiTermDocs; # use KinoSearch1::Util::Class's new() # Note: can't inherit IndexReader's new() without recursion problems *new = *KinoSearch1::Util::Class::new; sub init_instance { my $self = shift; $self->{sub_readers} ||= []; $self->{starts} ||= []; $self->{norms_cache} ||= {}; $self->_init_sub_readers; } sub _init_sub_readers { my $self = shift; my @starts; my $max_doc = 0; for my $sub_reader ( @{ $self->{sub_readers} } ) { push @starts, $max_doc; $max_doc += $sub_reader->max_doc; } $self->{starts} = \@starts; $self->{max_doc} = $max_doc; } sub max_doc { shift->{max_doc} } sub num_docs { my $self = shift; my $num_docs = 0; $num_docs += $_->num_docs for @{ $self->{sub_readers} }; return $num_docs; } sub term_docs { my ( $self, $term ) = @_; my $term_docs = KinoSearch1::Index::MultiTermDocs->new( sub_readers => $self->{sub_readers}, starts => $self->{starts}, ); $term_docs->seek($term); return $term_docs; } sub doc_freq { my ( $self, $term ) = @_; my $doc_freq = 0; $doc_freq += $_->doc_freq($term) for @{ $self->{sub_readers} }; return $doc_freq; } sub fetch_doc { my ( $self, $doc_num ) = @_; my $reader_index = $self->_reader_index($doc_num); $doc_num -= $self->{starts}[$reader_index]; return $self->{sub_readers}[$reader_index]->fetch_doc($doc_num); } sub delete_docs_by_term { my ( $self, $term ) = @_; $_->delete_docs_by_term($term) for @{ $self->{sub_readers} }; } sub commit_deletions { my $self = shift; $_->commit_deletions for @{ $self->{sub_readers} }; } # Determine which sub-reader a document resides in sub _reader_index { my ( $self, $doc_num ) = @_; my $starts = $self->{starts}; my ( $lo, $mid, $hi ) = ( 0, undef, $#$starts ); while ( $hi >= $lo ) { $mid = ( $lo + $hi ) >> 1; my $mid_start = $starts->[$mid]; if ( $doc_num < $mid_start ) { $hi = $mid - 1; } elsif ( $doc_num > $mid_start ) { $lo = $mid + 1; } else { while ( $mid < $#$starts and $starts->[ $mid + 1 ] == $mid_start ) { $mid++; } return $mid; } } return $hi; } sub norms_reader { # TODO refactor and minimize copying my ( $self, $field_num ) = @_; if ( exists $self->{norms_cache}{$field_num} ) { return $self->{norms_cache}{$field_num}; } else { my $bytes = ''; for my $seg_reader ( @{ $self->{sub_readers} } ) { my $seg_norms_reader = $seg_reader->norms_reader($field_num); $bytes .= ${ $seg_norms_reader->get_bytes } if $seg_norms_reader; } my $norms_reader = $self->{norms_cache}{$field_num} = KinoSearch1::Index::NormsReader->new( bytes => $bytes, max_doc => $self->max_doc, ); return $norms_reader; } } sub generate_field_infos { my $self = shift; my $new_finfos = KinoSearch1::Index::FieldInfos->new; my @sub_finfos = map { $_->generate_field_infos } @{ $self->{sub_readers} }; $new_finfos->consolidate(@sub_finfos); return $new_finfos; } sub get_field_names { my $self = shift; my %field_names; for my $sub_reader ( @{ $self->{sub_readers} } ) { my $sub_field_names = $sub_reader->get_field_names; @field_names{@$sub_field_names} = (1) x scalar @$sub_field_names; } return [ keys %field_names ]; } sub segreaders_to_merge { my ( $self, $all ) = @_; return unless @{ $self->{sub_readers} }; return @{ $self->{sub_readers} } if $all; # sort by ascending size in docs my @sorted_sub_readers = sort { $a->num_docs <=> $b->num_docs } @{ $self->{sub_readers} }; # find sparsely populated segments my $total_docs = 0; my $threshold = -1; for my $i ( 0 .. $#sorted_sub_readers ) { $total_docs += $sorted_sub_readers[$i]->num_docs; if ( $total_docs < fibonacci( $i + 5 ) ) { $threshold = $i; } } # if any of the segments are sparse, return their readers if ( $threshold > -1 ) { return @sorted_sub_readers[ 0 .. $threshold ]; } else { return; } } # Generate fibonacci series my %fibo_cache; sub fibonacci { my $n = shift; return $fibo_cache{$n} if exists $fibo_cache{$n}; my $result = $n < 2 ? $n : fibonacci( $n - 1 ) + fibonacci( $n - 2 ); $fibo_cache{$n} = $result; return $result; } sub close { my $self = shift; return unless $self->{close_invindex}; $_->close for @{ $self->{sub_readers} }; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::MultiReader - read from a multi-segment invindex ==head1 DESCRIPTION Multi-segment implementation of IndexReader. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/MultiTermDocs.pm000444000765000765 2307311462203446 22730 0ustar00marvinmarvin000000000000package KinoSearch1::Index::MultiTermDocs; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Index::TermDocs ); BEGIN { __PACKAGE__->init_instance_vars( sub_readers => undef, starts => undef, ); } our %instance_vars; sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); # get a SegTermDocs for each segment my $sub_readers = $args{sub_readers} || []; my $starts = $args{starts} || []; my @sub_term_docs = map { $_->term_docs } @$sub_readers; _init_child( $self, \@sub_term_docs, $starts ); return $self; } sub seek { my ( $self, $term ) = @_; $_->seek($term) for @{ $self->_get_sub_term_docs }; $self->_reset_pointer; } sub set_read_positions { my ( $self, $val ) = @_; $_->set_read_positions($val) for @{ $self->_get_sub_term_docs }; } sub close { my $self = shift; $_->close for @{ $self->_get_sub_term_docs }; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::MultiTermDocs void _init_child(term_docs, sub_term_docs_avref, starts_av) TermDocs *term_docs; SV *sub_term_docs_avref; AV *starts_av; PPCODE: Kino1_MultiTermDocs_init_child(term_docs, sub_term_docs_avref, starts_av); =for comment Helper for seek(). =cut void _reset_pointer(term_docs) TermDocs *term_docs; PREINIT: MultiTermDocsChild *child; PPCODE: child = (MultiTermDocsChild*)term_docs->child; child->base = 0; child->pointer = 0; child->current = NULL; SV* _set_or_get(term_docs, ...) TermDocs *term_docs; ALIAS: _set_sub_term_docs = 1 _get_sub_term_docs = 2 CODE: { MultiTermDocsChild *child = (MultiTermDocsChild*)term_docs->child; KINO_START_SET_OR_GET_SWITCH case 1: Kino1_confess("Can't set sub_term_docs"); /* fall through */ case 2: RETVAL = newSVsv( child->sub_term_docs_avref ); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL __H__ #ifndef H_KINO_MULTI_TERM_DOCS #define H_KINO_MULTI_TERM_DOCS 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1IndexTermDocs.h" #include "KinoSearch1UtilCClass.h" #include "KinoSearch1UtilMemManager.h" typedef struct multitermdocschild { I32 num_subs; I32 base; I32 pointer; SV *sub_term_docs_avref; U32 *starts; SV *term_sv; TermDocs **sub_term_docs; TermDocs *current; } MultiTermDocsChild; void Kino1_MultiTermDocs_init_child(TermDocs*, SV*, AV*); void Kino1_MultiTermDocs_set_doc_freq_death(TermDocs*, U32); U32 Kino1_MultiTermDocs_get_doc_freq(TermDocs*); U32 Kino1_MultiTermDocs_get_doc(TermDocs*); U32 Kino1_MultiTermDocs_get_freq(TermDocs*); SV* Kino1_MultiTermDocs_get_positions(TermDocs*); U32 Kino1_MultiTermDocs_bulk_read(TermDocs*, SV*, SV*, U32); bool Kino1_MultiTermDocs_next(TermDocs*); bool Kino1_MultiTermDocs_skip_to(TermDocs*, U32); void Kino1_MultiTermDocs_destroy(TermDocs*); #endif /* include guard */ __C__ #include "KinoSearch1IndexMultiTermDocs.h" void Kino1_MultiTermDocs_init_child(TermDocs* term_docs, SV *sub_term_docs_avref, AV *starts_av) { MultiTermDocsChild *child; I32 i; SV **sv_ptr; AV *sub_term_docs_av; /* allocate */ Kino1_New(0, child, 1, MultiTermDocsChild); term_docs->child = child; /* assign */ child->current = NULL; child->base = 0; child->pointer = 0; /* extract AV* and take stock of how many sub-TermDocs we've got */ child->sub_term_docs_avref = newSVsv(sub_term_docs_avref);; sub_term_docs_av = (AV*)SvRV(sub_term_docs_avref); child->num_subs = av_len(sub_term_docs_av) + 1; /* extract starts from starts array, subTermDocs from the subs array */ Kino1_New(0, child->starts, child->num_subs, U32); Kino1_New(0, child->sub_term_docs, child->num_subs, TermDocs*); for (i = 0; i < child->num_subs; i++) { sv_ptr = av_fetch(starts_av, i, 0); if (sv_ptr == NULL) Kino1_confess("starts array doesn't have enough valid members"); child->starts[i] = (U32)SvUV(*sv_ptr); sv_ptr = av_fetch(sub_term_docs_av, i, 0); if (sv_ptr == NULL) Kino1_confess("TermDocs array doesn't have enough valid members"); Kino1_extract_struct(*sv_ptr, child->sub_term_docs[i], TermDocs*, "KinoSearch1::Index::TermDocs"); } /* assign method pointers */ term_docs->set_doc_freq = Kino1_MultiTermDocs_set_doc_freq_death; term_docs->get_doc_freq = Kino1_MultiTermDocs_get_doc_freq; term_docs->get_doc = Kino1_MultiTermDocs_get_doc; term_docs->get_freq = Kino1_MultiTermDocs_get_freq; term_docs->get_positions = Kino1_MultiTermDocs_get_positions; term_docs->bulk_read = Kino1_MultiTermDocs_bulk_read; term_docs->next = Kino1_MultiTermDocs_next; term_docs->skip_to = Kino1_MultiTermDocs_skip_to; term_docs->destroy = Kino1_MultiTermDocs_destroy; } void Kino1_MultiTermDocs_set_doc_freq_death(TermDocs *term_docs, U32 doc_freq) { Kino1_confess("can't set doc_freq on a MultiTermDocs"); } U32 Kino1_MultiTermDocs_get_doc_freq(TermDocs *term_docs) { MultiTermDocsChild *child; TermDocs *sub_td; I32 i; U32 doc_freq = 0; /* sum the doc_freqs of all segments */ child = (MultiTermDocsChild*)term_docs->child; for (i = 0; i < child->num_subs; i++) { sub_td = child->sub_term_docs[i]; doc_freq += sub_td->get_doc_freq(sub_td); } return doc_freq; } U32 Kino1_MultiTermDocs_get_doc(TermDocs *term_docs) { MultiTermDocsChild *child; child = (MultiTermDocsChild*)term_docs->child; if (child->current == NULL) return KINO_TERM_DOCS_SENTINEL; return child->current->get_doc(child->current) + child->base; } U32 Kino1_MultiTermDocs_get_freq(TermDocs *term_docs) { MultiTermDocsChild *child; child = (MultiTermDocsChild*)term_docs->child; if (child->current == NULL) return KINO_TERM_DOCS_SENTINEL; return child->current->get_freq(child->current); } SV* Kino1_MultiTermDocs_get_positions(TermDocs *term_docs) { MultiTermDocsChild *child; child = (MultiTermDocsChild*)term_docs->child; if (child->current == NULL) return &PL_sv_undef; return child->current->get_positions(child->current); } U32 Kino1_MultiTermDocs_bulk_read(TermDocs *term_docs, SV *doc_nums_sv, SV *freqs_sv, U32 num_wanted) { MultiTermDocsChild *child; U32 i, num_got, base; U32 *doc_nums; child = (MultiTermDocsChild*)term_docs->child; while (1) { /* move to the next SegTermDocs */ while (child->current == NULL) { if (child->pointer < child->num_subs) { child->base = child->starts[ child->pointer ]; child->current = child->sub_term_docs[ child->pointer ]; child->pointer++; } else { return 0; } } num_got = child->current->bulk_read( child->current, doc_nums_sv, freqs_sv, num_wanted ); if (num_got == 0) { /* no more docs left in this segment */ child->current = NULL; } else { /* add the start offset for this seg to each doc */ base = child->base; doc_nums = (U32*)SvPVX(doc_nums_sv); for (i = 0; i < num_got; i++) { *doc_nums++ += base; } return num_got; } } } bool Kino1_MultiTermDocs_next(TermDocs* term_docs) { MultiTermDocsChild *child; child = (MultiTermDocsChild*)term_docs->child; if ( child->current != NULL && child->current->next(child->current) ) { return 1; } else if (child->pointer < child->num_subs) { /* try next segment */ child->base = child->starts[ child->pointer ]; child->current = child->sub_term_docs[ child->pointer ]; child->pointer++; return term_docs->next(term_docs); /* recurse */ } else { /* done with all segments */ return 0; } } bool Kino1_MultiTermDocs_skip_to(TermDocs *term_docs, U32 target) { MultiTermDocsChild *child = (MultiTermDocsChild*)term_docs->child; if ( child->current != NULL && child->current->skip_to(child->current, (target - child->base)) ) { return TRUE; } else if (child->pointer < child->num_subs) { /* try next segment */ child->base = child->starts[ child->pointer ]; child->current = child->sub_term_docs[ child->pointer ]; child->pointer++; return term_docs->skip_to(term_docs, target); /* recurse */ } else { return FALSE; } } void Kino1_MultiTermDocs_destroy(TermDocs* term_docs) { MultiTermDocsChild *child; child = (MultiTermDocsChild*)term_docs->child; SvREFCNT_dec(child->sub_term_docs_avref); Kino1_Safefree(child->sub_term_docs); Kino1_Safefree(child->starts); Kino1_Safefree(child); Kino1_TermDocs_destroy(term_docs); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::MultiTermDocs - multi-segment TermDocs ==head1 DESCRIPTION Multi-segment implementation of KinoSearch1::Index::TermDocs. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/NormsReader.pm000444000765000765 243711462203446 22377 0ustar00marvinmarvin000000000000package KinoSearch1::Index::NormsReader; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members instream => undef, max_doc => undef, bytes => undef, ); } sub init_instance { my $self = shift; confess("Internal error: max_doc is required") unless defined $self->{max_doc}; } # return a reference to a byte-array of norms sub get_bytes { my $self = shift; $self->_ensure_read; return \$self->{bytes}; } # Lazily read in the raw array of norms. sub _ensure_read { my $self = shift; if ( !defined $self->{bytes} ) { $self->{bytes} = $self->{instream}->lu_read( 'a' . $self->{max_doc} ); } } sub close { shift->{instream}->close } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::NormsReader - read field normalization data ==head1 DESCRIPTION NormsReader accesses the encoded norms which are built up, one byte per document, for indexed fields. ==head1 SEE ALSO L ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/PostingsWriter.pm000444000765000765 3515711462203446 23206 0ustar00marvinmarvin000000000000package KinoSearch1::Index::PostingsWriter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( #constructor params / members invindex => undef, seg_name => undef, # members sort_pool => undef, ); } use KinoSearch1::Index::TermInfo; use KinoSearch1::Index::TermInfosWriter; use KinoSearch1::Util::SortExternal; sub init_instance { my $self = shift; # create a SortExternal object which autosorts the posting list cache $self->{sort_pool} = KinoSearch1::Util::SortExternal->new( invindex => $self->{invindex}, seg_name => $self->{seg_name}, ); } # Add all the postings in an inverted document to the sort pool. sub add_postings { my ( $self, $postings_array ) = @_; $self->{sort_pool}->feed(@$postings_array); } # Bulk add all the postings in a segment to the sort pool. sub add_segment { my ( $self, $seg_reader, $doc_map ) = @_; my $term_enum = $seg_reader->terms; my $term_docs = $seg_reader->term_docs; $term_docs->set_read_positions(1); _add_segment( $self->{sort_pool}, $term_enum, $term_docs, $doc_map ); } =for comment Process all the postings in the sort pool. Generate the freqs and positions files. Hand off data to TermInfosWriter for the generating the term dictionaries. =cut sub write_postings { my $self = shift; my ( $invindex, $seg_name ) = @{$self}{ 'invindex', 'seg_name' }; $self->{sort_pool}->sort_all; my $tinfos_writer = KinoSearch1::Index::TermInfosWriter->new( invindex => $invindex, seg_name => $seg_name, ); my $frq_file = "$seg_name.frq"; my $prx_file = "$seg_name.prx"; for ( $frq_file, $prx_file ) { $invindex->delete_file($_) if $invindex->file_exists($_); } my $frq_out = $invindex->open_outstream($frq_file); my $prx_out = $invindex->open_outstream($prx_file); _write_postings( $self->{sort_pool}, $tinfos_writer, $frq_out, $prx_out ); $frq_out->close; $prx_out->close; $tinfos_writer->finish; } sub finish { my $self = shift; $self->{sort_pool}->close; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::PostingsWriter void _write_postings (sort_pool, tinfos_writer, frq_out, prx_out) SortExternal *sort_pool; TermInfosWriter *tinfos_writer; OutStream *frq_out; OutStream *prx_out; PPCODE: Kino1_PostWriter_write_postings(sort_pool, tinfos_writer, frq_out, prx_out); void _add_segment(sort_pool, term_enum, term_docs, doc_map_ref) SortExternal *sort_pool; SegTermEnum *term_enum; TermDocs *term_docs; SV *doc_map_ref; PPCODE: Kino1_PostWriter_add_segment(sort_pool, term_enum, term_docs, doc_map_ref); __H__ #ifndef H_KINOSEARCH_INDEX_POSTINGS_WRITER #define H_KINOSEARCH_INDEX_POSTINGS_WRITER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1IndexSegTermEnum.h" #include "KinoSearch1IndexTerm.h" #include "KinoSearch1IndexTermDocs.h" #include "KinoSearch1IndexTermInfosWriter.h" #include "KinoSearch1StoreOutStream.h" #include "KinoSearch1UtilByteBuf.h" #include "KinoSearch1UtilSortExternal.h" void Kino1_PostWriter_write_postings(SortExternal*, TermInfosWriter*, OutStream*, OutStream*); void Kino1_PostWriter_add_segment(SortExternal*, SegTermEnum*, TermDocs*, SV*); #endif /* include guard */ __C__ #include "KinoSearch1IndexPostingsWriter.h" static void Kino1_PostWriter_deserialize(ByteBuf*, ByteBuf*, ByteBuf*, U32*, U32*); static void Kino1_PostWriter_write_positions(OutStream*, ByteBuf*); void Kino1_PostWriter_write_postings(SortExternal *sort_pool, TermInfosWriter *tinfos_writer, OutStream *frq_out, OutStream *prx_out) { ByteBuf *posting = NULL; ByteBuf *positions, *termstring, *last_termstring; TermInfo *tinfo; U32 doc_num = 0; U32 freq = 0; U32 last_doc_num = 0; U32 last_skip_doc = 0; double frq_ptr, prx_ptr; double last_skip_frq_ptr = 0.0; double last_skip_prx_ptr = 0.0; I32 iter = 0; I32 i; AV *skip_data_av; SV *skip_sv; posting = Kino1_BB_new_string("", 0); last_termstring = Kino1_BB_new_string("\0\0", 2); termstring = Kino1_BB_new_view(NULL, 0); positions = Kino1_BB_new_view(NULL, 0); tinfo = Kino1_TInfo_new(); skip_data_av = newAV(); skip_sv = &PL_sv_undef; /* each loop is one field, one term, one doc_num, many positions */ while (1) { /* retrieve the next posting from the sort pool */ Kino1_BB_destroy(posting); posting = sort_pool->fetch(sort_pool); /* SortExternal returns NULL when exhausted */ if (posting == NULL) { goto FINAL_ITER; } /* each iter, add a doc to the doc_freq for a given term */ iter++; tinfo->doc_freq++; /* lags by 1 iter */ /* break up the serialized posting into its parts */ Kino1_PostWriter_deserialize(posting, termstring, positions, &doc_num, &freq); /* on the first iter, prime the "heldover" variables */ if (iter == 1) { Kino1_BB_assign_string(last_termstring, termstring->ptr, termstring->size); tinfo->doc_freq = 0; tinfo->frq_fileptr = frq_out->tell(frq_out); tinfo->prx_fileptr = prx_out->tell(prx_out); tinfo->skip_offset = frq_out->tell(frq_out); tinfo->index_fileptr = 0; } else if ( iter == -1 ) { /* never true; can only get here via goto */ /* prepare to clear out buffers and exit loop */ FINAL_ITER: { iter = -1; Kino1_BB_destroy(termstring); termstring = Kino1_BB_new_string("\0\0", 2); tinfo->doc_freq++; } } /* create skipdata (unused by KinoSearch1 at present) */ if ( (tinfo->doc_freq + 1) % tinfos_writer->skip_interval == 0 ) { frq_ptr = frq_out->tell(frq_out); prx_ptr = prx_out->tell(prx_out); av_push(skip_data_av, newSViv(last_doc_num - last_skip_doc )); av_push(skip_data_av, newSViv(frq_ptr - last_skip_frq_ptr)); av_push(skip_data_av, newSViv(prx_ptr - last_skip_prx_ptr)); last_skip_doc = last_doc_num; last_skip_frq_ptr = frq_ptr; last_skip_prx_ptr = prx_ptr; } /* if either the term or fieldnum changes, process the last term */ if ( Kino1_BB_compare(termstring, last_termstring) ) { /* take note of where we are for the term dictionary */ frq_ptr = frq_out->tell(frq_out); prx_ptr = prx_out->tell(prx_out); /* write skipdata if there is any */ if (av_len(skip_data_av) != -1) { /* kludge to compensate for doc_freq's 1-iter lag */ if ( (tinfo->doc_freq + 1) % tinfos_writer->skip_interval == 0 ) { /* remove 1 cycle of skip data */ for (i = 3; i > 0; i--) { skip_sv = av_pop(skip_data_av); SvREFCNT_dec(skip_sv); } } if (av_len(skip_data_av) != -1) { /* tell tinfos_writer about the non-zero skip amount */ tinfo->skip_offset = frq_ptr - tinfo->frq_fileptr; /* write out the skip data */ i = av_len(skip_data_av); while (i-- > -1) { skip_sv = av_shift(skip_data_av); frq_out->write_vint(frq_out, SvIV(skip_sv) ); SvREFCNT_dec(skip_sv); } /* update the filepointer for the file we just wrote to */ frq_ptr = frq_out->tell(frq_out); } } /* init skip data in preparation for the next term */ last_skip_doc = 0; last_skip_frq_ptr = frq_ptr; last_skip_prx_ptr = prx_ptr; /* hand off to TermInfosWriter */ Kino1_TInfosWriter_add(tinfos_writer, last_termstring, tinfo); /* start each term afresh */ tinfo->doc_freq = 0; tinfo->frq_fileptr = frq_ptr; tinfo->prx_fileptr = prx_ptr; tinfo->skip_offset = 0; tinfo->index_fileptr = 0; /* remember the termstring so we can write string diffs */ Kino1_BB_assign_string(last_termstring, termstring->ptr, termstring->size); last_doc_num = 0; } /* break out of loop on last iter before writing invalid data */ if (iter == -1) { Kino1_TInfo_destroy(tinfo); Kino1_BB_destroy(termstring); Kino1_BB_destroy(last_termstring); Kino1_BB_destroy(positions); Kino1_BB_destroy(posting); SvREFCNT_dec( (SV*)skip_data_av ); return; } /* write positions data */ Kino1_PostWriter_write_positions(prx_out, positions); /* write freq data */ /* doc_code is delta doc_num, shifted left by 1. */ if (freq == 1) { U32 doc_code = (doc_num - last_doc_num) << 1; /* set low bit of doc_code to 1 to indicate freq of 1 */ doc_code += 1; frq_out->write_vint(frq_out, doc_code); } else { U32 doc_code = (doc_num - last_doc_num) << 1; /* leave low bit of doc_code at 0, record explicit freq */ frq_out->write_vint(frq_out, doc_code); frq_out->write_vint(frq_out, freq); } /* remember last doc num because we need it for delta encoding */ last_doc_num = doc_num; } } /* Pull apart a serialized posting into its component parts */ #define DOC_NUM_LEN 4 #define TEXT_LEN_LEN 2 #define NULL_BYTE_LEN 1 void Kino1_PostWriter_add_segment(SortExternal *sort_pool, SegTermEnum* term_enum, TermDocs *term_docs, SV *doc_map_ref) { I32 *doc_map; I32 doc_num, max_doc; char doc_num_buf[4]; char text_len_buf[4]; SV *positions_sv, *doc_map_sv; ByteBuf *posting; TermBuffer *term_buf; char *positions_ptr; STRLEN len, common_len, positions_len; /* extract the doc number remapping array */ doc_map_sv = SvRV(doc_map_ref); doc_map = (I32*)SvPV(doc_map_sv, len); max_doc = len / sizeof(I32); term_buf = term_enum->term_buf; posting = Kino1_BB_new_string("", 0); while (Kino1_SegTermEnum_next(term_enum)) { /* start with the termstring and the null byte */ Kino1_encode_bigend_U16(term_buf->text_len, text_len_buf); common_len = term_buf->text_len + KINO_FIELD_NUM_LEN; Kino1_BB_assign_string(posting, term_buf->termstring->ptr, common_len); Kino1_BB_cat_string(posting, "\0", NULL_BYTE_LEN); common_len += NULL_BYTE_LEN; term_docs->seek_tinfo(term_docs, term_enum->tinfo); while (term_docs->next(term_docs)) { posting->size = common_len; /* can't ever be gt posting->cap */ /* concat the remapped doc number */ doc_num = term_docs->get_doc(term_docs); if (doc_num == -1) continue; if (doc_num > max_doc) Kino1_confess("doc_num > max_doc: %d %d", doc_num, max_doc); doc_num = doc_map[doc_num]; Kino1_encode_bigend_U32(doc_num, doc_num_buf); Kino1_BB_cat_string(posting, doc_num_buf, DOC_NUM_LEN); /* concat the positions */ positions_sv = term_docs->get_positions(term_docs); positions_ptr = SvPV(positions_sv, positions_len); Kino1_BB_cat_string(posting, positions_ptr, positions_len); /* concat the term_length */ Kino1_BB_cat_string(posting, text_len_buf, TEXT_LEN_LEN); /* add the posting to the sortpool */ sort_pool->feed(sort_pool, posting->ptr, posting->size); } } Kino1_BB_destroy(posting); } static void Kino1_PostWriter_deserialize(ByteBuf *posting, ByteBuf *termstring, ByteBuf *positions, U32 *doc_num_ptr, U32 *freq_ptr) { char *ptr; STRLEN len; /* extract termstring_len, decoding packed 'n', assign termstring */ ptr = posting->ptr + posting->size - TEXT_LEN_LEN; termstring->size = Kino1_decode_bigend_U16(ptr) + KINO_FIELD_NUM_LEN; Kino1_BB_assign_view(termstring, posting->ptr, termstring->size); /* extract and assign doc_num, decoding packed 'N' */ ptr = posting->ptr + termstring->size + NULL_BYTE_LEN; *doc_num_ptr = Kino1_decode_bigend_U32(ptr); /* make positions ByteBuf a view of the positional data in the posting */ ptr = posting->ptr + termstring->size + NULL_BYTE_LEN + DOC_NUM_LEN; len = posting->size - termstring->size - NULL_BYTE_LEN - DOC_NUM_LEN - TEXT_LEN_LEN; Kino1_BB_assign_view(positions, ptr, len); /* calculate freq by counting the number of positions, assign */ *freq_ptr = len / 4; } /* Write out the positions data using delta encoding. */ static void Kino1_PostWriter_write_positions(OutStream *prx_out, ByteBuf *positions) { U32 *current_pos_ptr, *end; U32 last_pos; U32 pos_delta; /* extract 32 bit unsigned integers from positions_sv. */ current_pos_ptr = (U32*)positions->ptr; end = current_pos_ptr + (positions->size / 4); last_pos = 0; while (current_pos_ptr < end) { /* get delta and write out as VInt */ pos_delta = *current_pos_ptr - last_pos; prx_out->write_vint(prx_out, pos_delta); /* advance pointers */ last_pos = *current_pos_ptr; current_pos_ptr++; } } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::PostingsWriter - write postings data to an invindex ==head1 DESCRIPTION PostingsWriter creates posting lists. It writes the frequency and and positional data files, plus feeds data to TermInfosWriter. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/SegInfos.pm000444000765000765 706011462203446 21670 0ustar00marvinmarvin000000000000package KinoSearch1::Index::SegInfos; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); use constant FORMAT => -1; BEGIN { __PACKAGE__->init_instance_vars( # members infos => undef, counter => 0, version => undef, ); __PACKAGE__->ready_get_set(qw( counter )); } use Time::HiRes qw( time ); sub init_instance { my $self = shift; $self->{infos} = {}; $self->{version} ||= int( time * 1000 ); } # Add a SegInfo to the collection. sub add_info { my ( $self, $info ) = @_; $self->{infos}{"$info->{seg_name}"} = $info; } # Remove the info corresponding to a segment; sub delete_segment { my ( $self, $seg_name ) = @_; confess("no segment named '$seg_name'") unless exists $self->{infos}{$seg_name}; delete $self->{infos}{$seg_name}; } # Return number of segments in invindex. sub size { scalar keys %{ $_[0]->{infos} } } # Retrieve all infos. sub infos { sort { $a->{seg_name} cmp $b->{seg_name} } values %{ $_[0]->{infos} }; } # Decode "segments" file. sub read_infos { my ( $self, $invindex ) = @_; my $instream = $invindex->open_instream('segments'); # support only recent index formats my $format = $instream->lu_read('i'); croak("unsupported format: '$format'") unless $format == FORMAT; # read header @{$self}{ 'version', 'counter' } = $instream->lu_read('Qi'); my $num_segs = $instream->lu_read('i'); # build one SegInfo object for each segment if ($num_segs) { my @file_contents = $instream->lu_read( 'Ti' x $num_segs ); while (@file_contents) { my ( $seg_name, $doc_count ) = splice( @file_contents, 0, 2 ); $self->{infos}{$seg_name} = KinoSearch1::Index::SegInfo->new( seg_name => $seg_name, doc_count => $doc_count, invindex => $invindex, ); } } } # Write "segments" file sub write_infos { my ( $self, $invindex ) = @_; my $num_segs = scalar keys %{ $self->{infos} }; my $tempname = 'segments.new'; $invindex->delete_file($tempname) if $invindex->file_exists($tempname); my $outstream = $invindex->open_outstream($tempname); # prepare header $self->{version}++; my @outstuff = ( FORMAT, $self->{version}, $self->{counter}, $num_segs ); # prepare data push @outstuff, map { ( $self->{infos}{$_}{seg_name}, $self->{infos}{$_}{doc_count} ) } sort keys %{ $self->{infos} }; # write it all out my $template = 'iQii' . ( 'Ti' x $num_segs ); $outstream->lu_write( $template, @outstuff ); $outstream->close; # clobber the old segments file $invindex->rename_file( $tempname, "segments" ); } package KinoSearch1::Index::SegInfo; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members seg_name => '', doc_count => 0, invindex => undef, ); __PACKAGE__->ready_get(qw( seg_name doc_count invindex )); } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::SegInfos - manage segment statistical data ==head1 DESCRIPTION SegInfos ties together the segments which make up an invindex. It stores a little information about each, plus some unifying information such as the counter used to name new segments. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/SegReader.pm000444000765000765 1370411462203446 22036 0ustar00marvinmarvin000000000000package KinoSearch1::Index::SegReader; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Index::IndexReader ); BEGIN { __PACKAGE__->init_instance_vars( # params/members invindex => undef, seg_name => undef, # members comp_file_reader => undef, tinfos_reader => undef, finfos => undef, fields_reader => undef, freq_stream => undef, prox_stream => undef, deldocs => undef, norms_readers => undef, ); __PACKAGE__->ready_get( qw( finfos fields_reader freq_stream prox_stream deldocs seg_name ) ); } use KinoSearch1::Index::CompoundFileReader; use KinoSearch1::Index::TermInfosReader; use KinoSearch1::Index::FieldsReader; use KinoSearch1::Index::FieldInfos; use KinoSearch1::Index::NormsReader; use KinoSearch1::Index::SegTermDocs; use KinoSearch1::Index::DelDocs; # use KinoSearch1::Util::Class's new() # Note: can't inherit IndexReader's new() without recursion problems *new = *KinoSearch1::Util::Class::new; sub init_instance { my $self = shift; my ( $seg_name, $invindex ) = @{$self}{ 'seg_name', 'invindex' }; $self->{norms_readers} = {}; # initialize DelDocs $self->{deldocs} = KinoSearch1::Index::DelDocs->new( invindex => $invindex, seg_name => $seg_name, ); $self->{deldocs}->read_deldocs( $invindex, "$seg_name.del" ) if ( $invindex->file_exists("$seg_name.del") ); # initialize a CompoundFileReader my $comp_file_reader = $self->{comp_file_reader} = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, seg_name => $seg_name, ); # initialize FieldInfos my $finfos = $self->{finfos} = KinoSearch1::Index::FieldInfos->new; $finfos->read_infos( $comp_file_reader->open_instream("$seg_name.fnm") ); # initialize FieldsReader $self->{fields_reader} = KinoSearch1::Index::FieldsReader->new( finfos => $finfos, fdata_stream => $comp_file_reader->open_instream("$seg_name.fdt"), findex_stream => $comp_file_reader->open_instream("$seg_name.fdx"), ); # initialize TermInfosReader $self->{tinfos_reader} = KinoSearch1::Index::TermInfosReader->new( invindex => $comp_file_reader, seg_name => $seg_name, finfos => $finfos, ); # open the frequency data, the positional data, and the norms $self->{freq_stream} = $comp_file_reader->open_instream("$seg_name.frq"); $self->{prox_stream} = $comp_file_reader->open_instream("$seg_name.prx"); $self->_open_norms; } sub max_doc { shift->{fields_reader}->get_size } sub num_docs { my $self = shift; return $self->max_doc - $self->{deldocs}->get_num_deletions; } sub delete_docs_by_term { my ( $self, $term ) = @_; my $term_docs = $self->term_docs($term); $self->{deldocs}->delete_by_term_docs($term_docs); } sub commit_deletions { my $self = shift; return unless $self->{deldocs}->get_num_deletions; my $filename = $self->{seg_name} . ".del"; $self->{deldocs} ->write_deldocs( $self->{invindex}, $filename, $self->max_doc ); } sub has_deletions { shift->{deldocs}->get_num_deletions } sub _open_norms { my $self = shift; my ( $seg_name, $finfos, $comp_file_reader ) = @{$self}{ 'seg_name', 'finfos', 'comp_file_reader' }; my $max_doc = $self->max_doc; # create a NormsReader for each indexed field. for my $finfo ( $finfos->get_infos ) { next unless $finfo->get_indexed; my $filename = "$seg_name.f" . $finfo->get_field_num; my $instream = $comp_file_reader->open_instream($filename); $self->{norms_readers}{ $finfo->get_name } = KinoSearch1::Index::NormsReader->new( instream => $instream, max_doc => $max_doc, ); } } sub terms { my ( $self, $term ) = @_; return $self->{tinfos_reader}->terms($term); } sub fetch_term_info { my ( $self, $term ) = @_; return $self->{tinfos_reader}->fetch_term_info($term); } sub get_skip_interval { shift->{tinfos_reader}->get_skip_interval; } sub doc_freq { my ( $self, $term ) = @_; my $tinfo = $self->{tinfos_reader}->fetch_term_info($term); return defined $tinfo ? $tinfo->get_doc_freq : 0; } sub term_docs { my ( $self, $term ) = @_; my $term_docs = KinoSearch1::Index::SegTermDocs->new( reader => $self, ); $term_docs->seek($term); return $term_docs; } sub norms_reader { my ( $self, $field_name ) = @_; return unless exists $self->{norms_readers}{$field_name}; return $self->{norms_readers}{$field_name}; } sub get_field_names { my ( $self, %args ) = @_; my @fields = $self->{finfos}->get_infos; @fields = grep { $_->get_indexed } @fields if $args{indexed}; my @names = map { $_->get_name } @fields; return \@names; } sub generate_field_infos { my $self = shift; my $new_finfos = $self->{finfos}->clone; $new_finfos->set_from_file(0); return $new_finfos; } sub fetch_doc { $_[0]->{fields_reader}->fetch_doc( $_[1] ); } sub segreaders_to_merge { my ( $self, $all ) = @_; return $self if $all; return; } sub close { my $self = shift; return unless $self->{close_invindex}; $self->{deldocs}->close; $self->{finfos}->close; $self->{fields_reader}->close; $self->{tinfos_reader}->close; $self->{comp_file_reader}->close; $self->{freq_stream}->close; $self->{prox_stream}->close; $_->close for values %{ $self->{norms_readers} }; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::SegReader - read from a single-segment invindex ==head1 DESCRIPTION Single-segment implementation of IndexReader. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/SegTermDocs.pm000444000765000765 3543711462203446 22363 0ustar00marvinmarvin000000000000package KinoSearch1::Index::SegTermDocs; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Index::TermDocs ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params reader => undef, ); } our %instance_vars; sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); my $reader = $args{reader}; _init_child($self); # dupe some stuff from the parent reader. $self->_set_reader($reader); $self->_set_skip_interval( $reader->get_skip_interval ); $self->_set_freq_stream( $reader->get_freq_stream()->clone_stream ); $self->_set_skip_stream( $reader->get_freq_stream()->clone_stream ); $self->_set_prox_stream( $reader->get_prox_stream()->clone_stream ); $self->_set_deldocs( $reader->get_deldocs ); return $self; } sub seek { my ( $self, $term ) = @_; my $tinfo = defined $term ? $self->_get_reader()->fetch_term_info($term) : undef; $self->seek_tinfo($tinfo); } sub close { my $self = shift; $self->_get_freq_stream()->close; $self->_get_prox_stream()->close; $self->_get_skip_stream()->close; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::SegTermDocs void _init_child(term_docs) TermDocs *term_docs; PPCODE: Kino1_SegTermDocs_init_child(term_docs); SV* _set_or_get(term_docs, ...) TermDocs *term_docs; ALIAS: _set_count = 1 _get_count = 2 _set_freq_stream = 3 _get_freq_stream = 4 _set_prox_stream = 5 _get_prox_stream = 6 _set_skip_stream = 7 _get_skip_stream = 8 _set_deldocs = 9 _get_deldocs = 10 _set_reader = 11 _get_reader = 12 set_read_positions = 13 get_read_positions = 14 _set_skip_interval = 15 _get_skip_interval = 16 CODE: { SegTermDocsChild *child = (SegTermDocsChild*)term_docs->child; KINO_START_SET_OR_GET_SWITCH case 1: child->count = SvUV(ST(1)); /* fall through */ case 2: RETVAL = newSVuv(child->count); break; case 3: SvREFCNT_dec(child->freq_stream_sv); child->freq_stream_sv = newSVsv( ST(1) ); Kino1_extract_struct( child->freq_stream_sv, child->freq_stream, InStream*, "KinoSearch1::Store::InStream"); /* fall through */ case 4: RETVAL = newSVsv(child->freq_stream_sv); break; case 5: SvREFCNT_dec(child->prox_stream_sv); child->prox_stream_sv = newSVsv( ST(1) ); Kino1_extract_struct( child->prox_stream_sv, child->prox_stream, InStream*, "KinoSearch1::Store::InStream"); /* fall through */ case 6: RETVAL = newSVsv(child->prox_stream_sv); break; case 7: SvREFCNT_dec(child->skip_stream_sv); child->skip_stream_sv = newSVsv( ST(1) ); Kino1_extract_struct( child->skip_stream_sv, child->skip_stream, InStream*, "KinoSearch1::Store::InStream"); /* fall through */ case 8: RETVAL = newSVsv(child->skip_stream_sv); break; case 9: SvREFCNT_dec(child->deldocs_sv); child->deldocs_sv = newSVsv( ST(1) ); Kino1_extract_struct( child->deldocs_sv, child->deldocs, BitVector*, "KinoSearch1::Index::DelDocs" ); /* fall through */ case 10: RETVAL = newSVsv(child->deldocs_sv); break; case 11: SvREFCNT_dec(child->reader_sv); if (!sv_derived_from( ST(1), "KinoSearch1::Index::IndexReader") ) Kino1_confess("not a KinoSearch1::Index::IndexReader"); child->reader_sv = newSVsv( ST(1) ); /* fall through */ case 12: RETVAL = newSVsv(child->reader_sv); break; case 13: child->read_positions = SvTRUE( ST(1) ) ? 1 : 0; /* fall through */ case 14: RETVAL = newSViv(child->read_positions); break; case 15: child->skip_interval = SvUV(ST(1)); /* fall through */ case 16: RETVAL = newSVuv(child->skip_interval); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL __H__ #ifndef H_KINO_SEG_TERM_DOCS #define H_KINO_SEG_TERM_DOCS 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilBitVector.h" #include "KinoSearch1IndexTermDocs.h" #include "KinoSearch1IndexTermInfo.h" #include "KinoSearch1StoreInStream.h" #include "KinoSearch1UtilMemManager.h" typedef struct segtermdocschild { U32 count; U32 doc_freq; U32 doc; U32 freq; U32 skip_doc; U32 skip_count; U32 num_skips; SV *positions; U32 read_positions; U32 skip_interval; InStream *freq_stream; InStream *prox_stream; InStream *skip_stream; bool have_skipped; double frq_fileptr; double prx_fileptr; double skip_fileptr; BitVector *deldocs; SV *freq_stream_sv; SV *prox_stream_sv; SV *skip_stream_sv; SV *deldocs_sv; SV *reader_sv; } SegTermDocsChild; void Kino1_SegTermDocs_init_child(TermDocs*); void Kino1_SegTermDocs_set_doc_freq(TermDocs*, U32); U32 Kino1_SegTermDocs_get_doc_freq(TermDocs*); U32 Kino1_SegTermDocs_get_doc(TermDocs*); U32 Kino1_SegTermDocs_get_freq(TermDocs*); SV* Kino1_SegTermDocs_get_positions(TermDocs*); U32 Kino1_SegTermDocs_bulk_read(TermDocs*, SV*, SV*, U32); void Kino1_SegTermDocs_seek_tinfo(TermDocs*, TermInfo*); bool Kino1_SegTermDocs_next(TermDocs*); bool Kino1_SegTermDocs_skip_to(TermDocs*, U32 target); bool Kino1_SegTermDocs_skip_to_with_positions(TermDocs*); void Kino1_SegTermDocs_destroy(TermDocs*); #endif /* include guard */ __C__ #include "KinoSearch1IndexSegTermDocs.h" static void load_positions(TermDocs *term_docs); void Kino1_SegTermDocs_init_child(TermDocs *term_docs) { SegTermDocsChild *child; Kino1_New(1, child, 1, SegTermDocsChild); term_docs->child = child; child->doc_freq = KINO_TERM_DOCS_SENTINEL; child->doc = KINO_TERM_DOCS_SENTINEL; child->freq = KINO_TERM_DOCS_SENTINEL; /* child->positions starts life as an empty string */ child->positions = newSV(1); SvCUR_set(child->positions, 0); SvPOK_on(child->positions); term_docs->set_doc_freq = Kino1_SegTermDocs_set_doc_freq; term_docs->get_doc_freq = Kino1_SegTermDocs_get_doc_freq; term_docs->get_doc = Kino1_SegTermDocs_get_doc; term_docs->get_freq = Kino1_SegTermDocs_get_freq; term_docs->get_positions = Kino1_SegTermDocs_get_positions; term_docs->bulk_read = Kino1_SegTermDocs_bulk_read; term_docs->seek_tinfo = Kino1_SegTermDocs_seek_tinfo; term_docs->next = Kino1_SegTermDocs_next; term_docs->skip_to = Kino1_SegTermDocs_skip_to; term_docs->destroy = Kino1_SegTermDocs_destroy; child->freq_stream_sv = &PL_sv_undef; child->prox_stream_sv = &PL_sv_undef; child->skip_stream_sv = &PL_sv_undef; child->deldocs_sv = &PL_sv_undef; child->reader_sv = &PL_sv_undef; child->count = 0; child->read_positions = 0; /* off by default */ } void Kino1_SegTermDocs_set_doc_freq(TermDocs *term_docs, U32 doc_freq) { SegTermDocsChild *child; child = (SegTermDocsChild*)term_docs->child; child->doc_freq = doc_freq; } U32 Kino1_SegTermDocs_get_doc_freq(TermDocs *term_docs) { SegTermDocsChild *child; child = (SegTermDocsChild*)term_docs->child; return child->doc_freq; } U32 Kino1_SegTermDocs_get_doc(TermDocs *term_docs) { SegTermDocsChild *child; child = (SegTermDocsChild*)term_docs->child; return child->doc; } U32 Kino1_SegTermDocs_get_freq(TermDocs *term_docs) { SegTermDocsChild *child; child = (SegTermDocsChild*)term_docs->child; return child->freq; } SV* Kino1_SegTermDocs_get_positions(TermDocs *term_docs) { SegTermDocsChild *child; child = (SegTermDocsChild*)term_docs->child; return child->positions; } U32 Kino1_SegTermDocs_bulk_read(TermDocs *term_docs, SV* doc_nums_sv, SV* freqs_sv, U32 num_wanted) { SegTermDocsChild *child; InStream *freq_stream; U32 doc_code; U32 *doc_nums; U32 *freqs; STRLEN len; U32 num_got = 0; /* local copies */ child = (SegTermDocsChild*)term_docs->child; freq_stream = child->freq_stream; /* allocate space in supplied SVs and make them POK, if necessary */ len = num_wanted * sizeof(U32); SvUPGRADE(doc_nums_sv, SVt_PV); SvUPGRADE(freqs_sv, SVt_PV); SvPOK_on(doc_nums_sv); SvPOK_on(freqs_sv); doc_nums = (U32*)SvGROW(doc_nums_sv, len + 1); freqs = (U32*)SvGROW(freqs_sv, len + 1); while (child->count < child->doc_freq && num_got < num_wanted) { /* manually inlined call to term_docs->next */ child->count++; doc_code = freq_stream->read_vint(freq_stream);; child->doc += doc_code >> 1; if (doc_code & 1) child->freq = 1; else child->freq = freq_stream->read_vint(freq_stream); /* if the doc isn't deleted... */ if ( !Kino1_BitVec_get(child->deldocs, child->doc) ) { /* ... append to results */ *doc_nums++ = child->doc; *freqs++ = child->freq; num_got++; } } /* set the string end to the end of the U32 array */ SvCUR_set(doc_nums_sv, (num_got * sizeof(U32))); SvCUR_set(freqs_sv, (num_got * sizeof(U32))); return num_got; } bool Kino1_SegTermDocs_next(TermDocs *term_docs) { SegTermDocsChild *child = (SegTermDocsChild*)term_docs->child; InStream *freq_stream = child->freq_stream; U32 doc_code; while (1) { /* bail if we're out of docs */ if (child->count == child->doc_freq) { return 0; } /* decode delta doc */ doc_code = freq_stream->read_vint(freq_stream); child->doc += doc_code >> 1; /* if the stored num was odd, the freq is 1 */ if (doc_code & 1) { child->freq = 1; } /* otherwise, freq was stored as a VInt. */ else { child->freq = freq_stream->read_vint(freq_stream); } child->count++; /* read positions if desired */ if (child->read_positions) load_positions(term_docs); /* if the doc isn't deleted... success! */ if (!Kino1_BitVec_get(child->deldocs, child->doc)) break; } return 1; } static void load_positions(TermDocs *term_docs) { SegTermDocsChild *child = (SegTermDocsChild*)term_docs->child; InStream *prox_stream = child->prox_stream; STRLEN len = child->freq * sizeof(U32); U32 *positions, *positions_end; U32 position = 0; SvGROW( child->positions, len ); SvCUR_set(child->positions, len); positions = (U32*)SvPVX(child->positions); positions_end = (U32*)SvEND(child->positions); while (positions < positions_end) { position += prox_stream->read_vint(prox_stream); *positions++ = position; } } void Kino1_SegTermDocs_seek_tinfo(TermDocs *term_docs, TermInfo *tinfo) { SegTermDocsChild *child; child = (SegTermDocsChild*)term_docs->child; child->count = 0; if (tinfo == NULL) { child->doc_freq = 0; } else { child->doc = 0; child->freq = 0; child->skip_doc = 0; child->skip_count = 0; child->have_skipped = FALSE; child->num_skips = tinfo->doc_freq / child->skip_interval; child->doc_freq = tinfo->doc_freq; child->frq_fileptr = tinfo->frq_fileptr; child->prx_fileptr = tinfo->prx_fileptr; child->skip_fileptr = tinfo->frq_fileptr + tinfo->skip_offset; child->freq_stream->seek( child->freq_stream, tinfo->frq_fileptr ); child->prox_stream->seek( child->prox_stream, tinfo->prx_fileptr ); } } bool Kino1_SegTermDocs_skip_to(TermDocs *term_docs, U32 target) { SegTermDocsChild *child = (SegTermDocsChild*)term_docs->child; if (child->doc_freq >= child->skip_interval) { InStream *freq_stream = child->freq_stream; InStream *prox_stream = child->prox_stream; InStream *skip_stream = child->skip_stream; U32 last_skip_doc = child->skip_doc; double last_frq_fileptr = freq_stream->tell(freq_stream); double last_prx_fileptr = -1; I32 num_skipped = -1 - (child->count % child->skip_interval); if (!child->have_skipped) { child->skip_stream->seek(child->skip_stream, child->skip_fileptr); child->have_skipped = TRUE; } while (target > child->skip_doc) { last_skip_doc = child->skip_doc; last_frq_fileptr = child->frq_fileptr; last_prx_fileptr = child->prx_fileptr; if (child->skip_doc != 0 && child->skip_doc >= child->doc) { num_skipped += child->skip_interval; } if (child->skip_count >= child->num_skips) { break; } child->skip_doc += skip_stream->read_vint(skip_stream); child->frq_fileptr += skip_stream->read_vint(skip_stream); child->prx_fileptr += skip_stream->read_vint(skip_stream); child->skip_count++; } /* if there's something to skip, skip it */ if (last_frq_fileptr > freq_stream->tell(freq_stream)) { freq_stream->seek(freq_stream, last_frq_fileptr); if (child->read_positions) { prox_stream->seek(prox_stream, last_prx_fileptr); } child->doc = last_skip_doc; child->count += num_skipped; } } /* done skipping, so scan */ do { if (!term_docs->next(term_docs)) { return FALSE; } } while (target > child->doc); return TRUE; } void Kino1_SegTermDocs_destroy(TermDocs *term_docs){ SegTermDocsChild *child; child = (SegTermDocsChild*)term_docs->child; SvREFCNT_dec(child->positions); SvREFCNT_dec(child->freq_stream_sv); SvREFCNT_dec(child->prox_stream_sv); SvREFCNT_dec(child->skip_stream_sv); SvREFCNT_dec(child->deldocs_sv); SvREFCNT_dec(child->reader_sv); Kino1_Safefree(child); Kino1_TermDocs_destroy(term_docs); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::SegTermDocs - single-segment TermDocs ==head1 DESCRIPTION Single-segment implemetation of KinoSearch1::Index::TermDocs. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/SegTermEnum.pm000444000765000765 3737611462203445 22402 0ustar00marvinmarvin000000000000package KinoSearch1::Index::SegTermEnum; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params finfos => undef, instream => undef, is_index => 0, ); } our %instance_vars; use KinoSearch1::Index::Term; use KinoSearch1::Index::TermInfo; use KinoSearch1::Index::TermBuffer; sub new { # verify params my $ignore = shift; my %args = ( %instance_vars, @_ ); confess kerror() unless verify_args( \%instance_vars, %args ); # get a TermBuffer helper object my $term_buffer = KinoSearch1::Index::TermBuffer->new( finfos => $args{finfos}, ); return _new_helper( @args{ 'instream', 'is_index', 'finfos', }, $term_buffer ); } sub clone_enum { my $self = shift; # dupe instream and seek it to the start of the file, so init works right my $instream = $self->_get_instream; my $new_stream = $instream->clone_stream; $new_stream->seek(0); # create a new object and seek it to the right term/terminfo my $evil_twin = __PACKAGE__->new( finfos => $self->_get_finfos, instream => $new_stream, is_index => $self->is_index, ); $evil_twin->seek( $instream->tell, $self->_get_position, $self->get_termstring, $self->get_term_info ); return $evil_twin; } # Locate the Enum to a particular spot. sub seek { my ( $self, $pointer, $position, $termstring, $tinfo ) = @_; # seek the filehandle my $instream = $self->_get_instream; $instream->seek($pointer); # set values as if we'd scanned here from the start of the Enum $self->_set_position($position); $self->_set_termstring($termstring); $self->_set_term_info($tinfo); } sub close { my $instream = $_[0]->_get_instream; $instream->close; } # return a Term, if the Enum is currently valid. sub get_term { my $self = shift; my $termstring = $self->get_termstring; return unless defined $termstring; return KinoSearch1::Index::Term->new_from_string( $termstring, $self->_get_finfos ); } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::SegTermEnum SegTermEnum* _new_helper(instream_sv, is_index, finfos_sv, term_buffer_sv) SV *instream_sv; I32 is_index; SV *finfos_sv SV *term_buffer_sv; CODE: RETVAL = Kino1_SegTermEnum_new_helper(instream_sv, is_index, finfos_sv, term_buffer_sv); OUTPUT: RETVAL =for comment fill_cache() loads the entire Enum into memory. This should only be called for index Enums -- never for primary Enums. =cut void fill_cache(obj) SegTermEnum *obj; PPCODE: Kino1_SegTermEnum_fill_cache(obj); =begin comment scan_to() iterates through the Enum until the Enum's state is ge the target. This is called on the main Enum, after seek() has gotten it close. You don't want to scan through the entire main Enum, just through a small part. Scanning through an Enum is an involved process, due to the heavy data compression. See the Java Lucene File Format definition for details. =end comment =cut void scan_to(obj, target_termstring_sv) SegTermEnum *obj; SV *target_termstring_sv; PREINIT: char *ptr; STRLEN len; PPCODE: ptr = SvPV(target_termstring_sv, len); if (len < 2) Kino1_confess("length of termstring < 2: %"UVuf, (UV)len); Kino1_SegTermEnum_scan_to(obj, ptr, len); =for comment Reset the Enum to the top, so that after next() is called, the Enum is located at the first term in the segment. =cut void reset(obj) SegTermEnum *obj; PPCODE: Kino1_SegTermEnum_reset(obj); =for comment next() advances the state of the Enum one term. If the current position of the Enum is valid, it returns 1; when the Enum is exhausted, it returns 0. =cut IV next(obj) SegTermEnum *obj; CODE: RETVAL = Kino1_SegTermEnum_next(obj); OUTPUT: RETVAL =for comment For an Enum which has been loaded into memory, scan to the target as quickly as possible. =cut I32 scan_cache(obj, target_termstring_sv) SegTermEnum *obj; SV *target_termstring_sv; PREINIT: char *ptr; STRLEN len; CODE: ptr = SvPV(target_termstring_sv, len); if (len < 2) Kino1_confess("length of termstring < 2: %"UVuf, (UV)len); RETVAL = Kino1_SegTermEnum_scan_cache(obj, ptr, len); OUTPUT: RETVAL =for comment Setters and getters for members in the SegTermEnum struct. Not all of these are useful. =cut SV* _set_or_get(obj, ...) SegTermEnum *obj; ALIAS: _set_instream = 1 _get_instream = 2 _set_finfos = 3 _get_finfos = 4 _set_size = 5 get_size = 6 _set_termstring = 7 get_termstring = 8 _set_term_info = 9 get_term_info = 10 _set_index_interval = 11 get_index_interval = 12 _set_skip_interval = 13 get_skip_interval = 14 _set_position = 15 _get_position = 16 _set_is_index = 17 is_index = 18 CODE: { KINO_START_SET_OR_GET_SWITCH case 0: croak("can't call _get_or_set on it's own"); break; /* probably unreachable */ case 1: SvREFCNT_dec(obj->instream_sv); obj->instream_sv = newSVsv( ST(1) ); /* fall through */ case 2: RETVAL = newSVsv(obj->instream_sv); break; case 3: SvREFCNT_dec(obj->finfos); obj->finfos = newSVsv( ST(1) ); /* fall through */ case 4: RETVAL = newSVsv(obj->finfos); break; case 5: obj->enum_size = (I32)SvIV( ST(1) ); /* fall through */ case 6: RETVAL = newSViv(obj->enum_size); break; case 7: if ( SvOK( ST(1) ) ) { STRLEN len = SvCUR( ST(1) ); if (len < KINO_FIELD_NUM_LEN) Kino1_confess("Internal error: termstring too short"); Kino1_TermBuf_set_termstring(obj->term_buf, SvPVX(ST(1)), len); } else { Kino1_TermBuf_reset(obj->term_buf); } /* fall through */ case 8: RETVAL = (obj->term_buf->termstring == NULL) ? &PL_sv_undef : newSVpv( obj->term_buf->termstring->ptr, obj->term_buf->termstring->size ); break; case 9: { TermInfo* new_tinfo; Kino1_extract_struct( ST(1), new_tinfo, TermInfo*, "KinoSearch1::Index::TermInfo"); Kino1_TInfo_destroy(obj->tinfo); obj->tinfo = Kino1_TInfo_dupe(new_tinfo); } /* fall through */ case 10: { TermInfo* new_tinfo; RETVAL = newSV(0); new_tinfo = Kino1_TInfo_dupe(obj->tinfo); sv_setref_pv(RETVAL, "KinoSearch1::Index::TermInfo", (void*)new_tinfo); } break; case 11: obj->index_interval = SvIV( ST(1) ); /* fall through */ case 12: RETVAL = newSViv(obj->index_interval); break; case 13: obj->skip_interval = SvIV( ST(1) ); /* fall through */ case 14: RETVAL = newSViv(obj->skip_interval); break; case 15: obj->position = SvIV( ST(1) ); /* fall through */ case 16: RETVAL = newSViv(obj->position); break; case 17: Kino1_confess("can't set is_index"); /* fall through */ case 18: RETVAL = newSViv(obj->is_index); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(obj) SegTermEnum* obj; PPCODE: Kino1_SegTermEnum_destroy(obj); __H__ #ifndef H_KINOSEARCH_INDEX_SEG_TERM_ENUM #define H_KINOSEARCH_INDEX_SEG_TERM_ENUM 1 #include "EXTERN.h" #include "perl.h" #include "KinoSearch1IndexTermBuffer.h" #include "KinoSearch1IndexTermInfo.h" #include "KinoSearch1StoreInStream.h" #include "KinoSearch1UtilByteBuf.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilCClass.h" #include "KinoSearch1UtilMemManager.h" #include "KinoSearch1UtilStringHelper.h" typedef struct segtermenum { SV *finfos; SV *instream_sv; SV *term_buf_ref; TermBuffer *term_buf; TermInfo *tinfo; InStream *instream; I32 is_index; I32 enum_size; I32 position; I32 index_interval; I32 skip_interval; ByteBuf **termstring_cache; TermInfo **tinfos_cache; } SegTermEnum; SegTermEnum* Kino1_SegTermEnum_new_helper(SV*, I32, SV*, SV*); void Kino1_SegTermEnum_reset(SegTermEnum*); I32 Kino1_SegTermEnum_next(SegTermEnum*); void Kino1_SegTermEnum_fill_cache(SegTermEnum*); void Kino1_SegTermEnum_scan_to(SegTermEnum*, char*, I32); I32 Kino1_SegTermEnum_scan_cache(SegTermEnum*, char*, I32); void Kino1_SegTermEnum_destroy(SegTermEnum*); #endif /* include guard */ __C__ #include "KinoSearch1IndexSegTermEnum.h" SegTermEnum* Kino1_SegTermEnum_new_helper(SV *instream_sv, I32 is_index, SV *finfos_sv, SV *term_buffer_sv) { I32 format; InStream *instream; SegTermEnum *obj; /* allocate */ Kino1_New(0, obj, 1, SegTermEnum); obj->tinfo = Kino1_TInfo_new(); /* init */ obj->tinfos_cache = NULL; obj->termstring_cache = NULL; /* save instream, finfos, and term_buffer, incrementing refcounts */ obj->instream_sv = newSVsv(instream_sv); obj->finfos = newSVsv(finfos_sv); obj->term_buf_ref = newSVsv(term_buffer_sv); Kino1_extract_struct(term_buffer_sv, obj->term_buf, TermBuffer*, "KinoSearch1::Index::TermBuffer"); Kino1_extract_struct(instream_sv, obj->instream, InStream*, "KinoSearch1::Store::InStream"); instream = obj->instream; /* determine whether this is a primary or index enum */ obj->is_index = is_index; /* reject older or newer index formats */ format = (I32)instream->read_int(instream); if (format != -2) Kino1_confess("Unsupported index format: %d", format); /* read in some vars */ obj->enum_size = instream->read_long(instream); obj->index_interval = instream->read_int(instream); obj->skip_interval = instream->read_int(instream); /* define the position of the Enum as "not yet started" */ obj->position = -1; return obj; } #define KINO_SEG_TERM_ENUM_HEADER_LEN 20 void Kino1_SegTermEnum_reset(SegTermEnum* obj) { obj->position = -1; obj->instream->seek(obj->instream, KINO_SEG_TERM_ENUM_HEADER_LEN); Kino1_TermBuf_reset(obj->term_buf); Kino1_TInfo_reset(obj->tinfo); } I32 Kino1_SegTermEnum_next(SegTermEnum *obj) { InStream *instream; TermInfo *tinfo; /* make some local copies for clarity of code */ instream = obj->instream; tinfo = obj->tinfo; /* if we've run out of terms, null out the termstring and return */ if (++obj->position >= obj->enum_size) { Kino1_TermBuf_reset(obj->term_buf); return 0; } /* read in the term */ Kino1_TermBuf_read(obj->term_buf, instream); /* read doc freq */ tinfo->doc_freq = instream->read_vint(instream); /* adjust file pointers. */ tinfo->frq_fileptr += instream->read_vlong(instream); tinfo->prx_fileptr += instream->read_vlong(instream); /* read skip data (which doesn't do anything right now) */ if (tinfo->doc_freq >= obj->skip_interval) tinfo->skip_offset = instream->read_vint(instream); else tinfo->skip_offset = 0; /* read filepointer to main enum if this is an index enum */ if (obj->is_index) tinfo->index_fileptr += instream->read_vlong(instream); return 1; } void Kino1_SegTermEnum_fill_cache(SegTermEnum* obj) { TermBuffer *term_buf; TermInfo *tinfo; TermInfo **tinfos_cache; ByteBuf **termstring_cache; /* allocate caches */ if (obj->tinfos_cache != NULL) Kino1_confess("Internal error: cache already filled"); Kino1_New(0, obj->termstring_cache, obj->enum_size, ByteBuf*); Kino1_New(0, obj->tinfos_cache, obj->enum_size, TermInfo*); /* make some local copies */ tinfo = obj->tinfo; term_buf = obj->term_buf; tinfos_cache = obj->tinfos_cache; termstring_cache = obj->termstring_cache; while (Kino1_SegTermEnum_next(obj)) { /* copy tinfo and termstring into caches */ *tinfos_cache++ = Kino1_TInfo_dupe(tinfo); *termstring_cache++ = Kino1_BB_clone(term_buf->termstring); } } void Kino1_SegTermEnum_scan_to(SegTermEnum *obj, char *target_termstring, I32 target_termstring_len) { TermBuffer *term_buf = obj->term_buf; ByteBuf target; /* make convenience copies */ target.ptr = target_termstring; target.size = target_termstring_len; /* keep looping until the termstring is lexically ge target */ do { const I32 comparison = Kino1_BB_compare(term_buf->termstring, &target); if ( comparison >= 0 && obj->position != -1) { break; } } while (Kino1_SegTermEnum_next(obj)); } I32 Kino1_SegTermEnum_scan_cache(SegTermEnum *obj, char *target_termstring, I32 target_len) { TermBuffer *term_buf = obj->term_buf; ByteBuf **termstrings = obj->termstring_cache; ByteBuf target; I32 lo = 0; I32 hi = obj->enum_size - 1; I32 result = -100; I32 mid, comparison; /* make convenience copies */ target.ptr = target_termstring; target.size = target_len; if (obj->tinfos_cache == NULL) Kino1_confess("Internal Error: fill_cache hasn't been called yet"); /* divide and conquer */ while (hi >= lo) { mid = (lo + hi) >> 1; comparison = Kino1_BB_compare(&target, termstrings[mid]); if (comparison < 0) hi = mid - 1; else if (comparison > 0) lo = mid + 1; else { result = mid; break; } } result = hi == -1 ? 0 /* indicating that target lt first entry */ : result == -100 ? hi /* if result is still -100, it wasn't set */ : result; /* set the state of the Enum/TermBuffer as if we'd called scan_to */ obj->position = result; Kino1_TermBuf_set_termstring(term_buf, termstrings[result]->ptr, termstrings[result]->size); Kino1_TInfo_destroy(obj->tinfo); obj->tinfo = Kino1_TInfo_dupe( obj->tinfos_cache[result] ); return result; } void Kino1_SegTermEnum_destroy(SegTermEnum *obj) { /* put out the garbage for collection */ SvREFCNT_dec(obj->finfos); SvREFCNT_dec(obj->instream_sv); SvREFCNT_dec(obj->term_buf_ref); Kino1_TInfo_destroy(obj->tinfo); /* if fill_cache was called, free all of that... */ if (obj->tinfos_cache != NULL) { I32 iter; ByteBuf **termstring_cache = obj->termstring_cache; TermInfo **tinfos_cache = obj->tinfos_cache; for (iter = 0; iter < obj->enum_size; iter++) { Kino1_BB_destroy(*termstring_cache++); Kino1_TInfo_destroy(*tinfos_cache++); } Kino1_Safefree(obj->tinfos_cache); Kino1_Safefree(obj->termstring_cache); } /* last, the SegTermEnum object itself */ Kino1_Safefree(obj); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::SegTermEnum - single-segment TermEnum ==head1 DESCRIPTION Single-segment implementation of KinoSearch1::Index::TermEnum. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/SegWriter.pm000444000765000765 2137511462203445 22112 0ustar00marvinmarvin000000000000package KinoSearch1::Index::SegWriter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members invindex => undef, seg_name => undef, finfos => undef, field_sims => undef, # members norm_outstreams => undef, fields_writer => undef, postings_writer => undef, doc_count => 0, ); __PACKAGE__->ready_get(qw( seg_name doc_count )); } use KinoSearch1::Analysis::TokenBatch; use KinoSearch1::Index::FieldsWriter; use KinoSearch1::Index::PostingsWriter; use KinoSearch1::Index::CompoundFileWriter; use KinoSearch1::Index::IndexFileNames qw( @COMPOUND_EXTENSIONS SORTFILE_EXTENSION ); sub init_instance { my $self = shift; my ( $invindex, $seg_name, $finfos ) = @{$self}{ 'invindex', 'seg_name', 'finfos' }; # init norms my $norm_outstreams = $self->{norm_outstreams} = []; my @indexed_field_nums = map { $_->get_field_num } grep { $_->get_indexed } $finfos->get_infos; for my $field_num (@indexed_field_nums) { my $filename = "$seg_name.f$field_num"; $invindex->delete_file($filename) if $invindex->file_exists($filename); $norm_outstreams->[$field_num] = $invindex->open_outstream($filename); } # init FieldsWriter $self->{fields_writer} = KinoSearch1::Index::FieldsWriter->new( invindex => $invindex, seg_name => $seg_name, ); # init PostingsWriter $self->{postings_writer} = KinoSearch1::Index::PostingsWriter->new( invindex => $invindex, seg_name => $seg_name, ); } # Add a document to the segment. sub add_doc { my ( $self, $doc ) = @_; my $norm_outstreams = $self->{norm_outstreams}; my $postings_cache = $self->{postings_cache}; my $field_sims = $self->{field_sims}; my $doc_boost = $doc->get_boost; for my $indexed_field ( grep { $_->get_indexed } $doc->get_fields ) { my $field_name = $indexed_field->get_name; my $token_batch = KinoSearch1::Analysis::TokenBatch->new; # if the field has content, put it in the TokenBatch if ( $indexed_field->get_value_len ) { $token_batch->append( $indexed_field->get_value, 0, $indexed_field->get_value_len ); } # analyze the field if ( $indexed_field->get_analyzed ) { $token_batch = $indexed_field->get_analyzer()->analyze($token_batch); } # invert the doc $token_batch->build_posting_list( $self->{doc_count}, $indexed_field->get_field_num ); # prepare to store the term vector, if the field is vectorized if ( $indexed_field->get_vectorized and $indexed_field->get_stored ) { $indexed_field->set_tv_string( $token_batch->get_tv_string ); } # encode a norm into a byte, write it to an outstream my $norm_val = $doc_boost * $indexed_field->get_boost * $field_sims->{$field_name} ->lengthnorm( $token_batch->get_size ); my $outstream = $norm_outstreams->[ $indexed_field->get_field_num ]; $outstream->lu_write( 'a', $field_sims->{$field_name}->encode_norm($norm_val) ); # feed PostingsWriter $self->{postings_writer}->add_postings( $token_batch->get_postings ); } # store fields $self->{fields_writer}->add_doc($doc); $self->{doc_count}++; } sub add_segment { my ( $self, $seg_reader ) = @_; # prepare to bulk add my $deldocs = $seg_reader->get_deldocs; my $doc_map = $deldocs->generate_doc_map( $seg_reader->max_doc, $self->{doc_count} ); my $field_num_map = $self->{finfos}->generate_field_num_map( $seg_reader->get_finfos ); # bulk add the slab of documents to the various writers $self->_merge_norms( $seg_reader, $doc_map ); $self->{fields_writer} ->add_segment( $seg_reader, $doc_map, $field_num_map ); $self->{postings_writer}->add_segment( $seg_reader, $doc_map ); $self->{doc_count} += $seg_reader->num_docs; } # Bulk write norms. sub _merge_norms { my ( $self, $seg_reader, $doc_map ) = @_; my $norm_outstreams = $self->{norm_outstreams}; my $field_sims = $self->{field_sims}; my @indexed_fields = grep { $_->get_indexed } $self->{finfos}->get_infos; for my $field (@indexed_fields) { my $field_name = $field->get_name; my $outstream = $norm_outstreams->[ $field->get_field_num ]; my $norms_reader = $seg_reader->norms_reader($field_name); # if the field was indexed before, copy the norms if ( defined $norms_reader ) { _write_remapped_norms( $outstream, $doc_map, $norms_reader->get_bytes ); } else { # the field isn't in the input segment, so write a default my $zeronorm = $field_sims->{$field_name}->lengthnorm(0); my $num_docs = $seg_reader->num_docs; my $normstring = $field_sims->{$field_name}->encode_norm($zeronorm) x $num_docs; $outstream->lu_write( "a$num_docs", $normstring ); } } } # Finish writing the segment. sub finish { my $self = shift; my ( $invindex, $seg_name ) = @{$self}{ 'invindex', 'seg_name' }; # write Term Dictionary, positions. $self->{postings_writer}->write_postings; # write FieldInfos my $fnm_file = "$seg_name.fnm"; $invindex->delete_file($fnm_file) if $invindex->file_exists($fnm_file); my $finfos_outstream = $invindex->open_outstream("$seg_name.fnm"); $self->{finfos}->write_infos($finfos_outstream); $finfos_outstream->close; # close down all the writers, so we can open the files they've finished. $self->{postings_writer}->finish; $self->{fields_writer}->finish; for ( @{ $self->{norm_outstreams} } ) { $_->close if defined; } # consolidate compound file - if we actually added any docs my @compound_files = map {"$seg_name.$_"} @COMPOUND_EXTENSIONS; if ( $self->{doc_count} ) { my $compound_file_writer = KinoSearch1::Index::CompoundFileWriter->new( invindex => $invindex, filename => "$seg_name.tmp", ); push @compound_files, map { "$seg_name.f" . $_->get_field_num } grep { $_->get_indexed } $self->{finfos}->get_infos; $compound_file_writer->add_file($_) for @compound_files; $compound_file_writer->finish; $invindex->rename_file( "$seg_name.tmp", "$seg_name.cfs" ); } # delete files that are no longer needed; $invindex->delete_file($_) for @compound_files; my $sort_file_name = "$seg_name" . SORTFILE_EXTENSION; $invindex->delete_file($sort_file_name) if $invindex->file_exists($sort_file_name); } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::SegWriter void _write_remapped_norms(outstream, doc_map_ref, norms_ref) OutStream *outstream; SV *doc_map_ref; SV *norms_ref; PPCODE: Kino1_SegWriter_write_remapped_norms(outstream, doc_map_ref, norms_ref); __H__ #ifndef H_KINOSEARCH_SEG_WRITER #define H_KINOSEARCH_SEG_WRITER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1StoreOutStream.h" #include "KinoSearch1UtilCarp.h" void Kino1_SegWriter_write_remapped_norms(OutStream*, SV*, SV*); #endif /* include guard */ __C__ #include "KinoSearch1IndexSegWriter.h" void Kino1_SegWriter_write_remapped_norms(OutStream *outstream, SV *doc_map_ref, SV* norms_ref) { SV *norms_sv, *doc_map_sv; I32 *doc_map, *doc_map_end; char *norms; STRLEN doc_map_len, norms_len; /* extract doc map and norms arrays */ doc_map_sv = SvRV(doc_map_ref); doc_map = (I32*)SvPV(doc_map_sv, doc_map_len); doc_map_end = (I32*)SvEND(doc_map_sv); norms_sv = SvRV(norms_ref); norms = SvPV(norms_sv, norms_len); if (doc_map_len != norms_len * sizeof(I32)) Kino1_confess("Mismatched doc_map and norms"); /* write a norm for each non-deleted doc */ while (doc_map < doc_map_end) { if (*doc_map != -1) { outstream->write_byte(outstream, *norms); } doc_map++; norms++; } } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::SegWriter - write one segment of an invindex ==head1 DESCRIPTION SegWriter is a conduit through which information fed to InvIndexer passes on its way to low-level writers such as FieldsWriter and TermInfosWriter. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/Term.pm000444000765000765 430311462203445 21056 0ustar00marvinmarvin000000000000package KinoSearch1::Index::Term; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( field => undef, text => undef, ); __PACKAGE__->ready_get_set(qw( field text )); } sub new { croak("usage: KinoSearch1::Index::Term->new( field, text )") unless @_ == 3; return bless { field => $_[1], text => $_[2], }, __PACKAGE__; } # Alternate, internal constructor. sub new_from_string { my ( $class, $termstring, $finfos ) = @_; my $field_num = unpack( 'n', bytes::substr( $termstring, 0, 2, '' ) ); my $field_name = $finfos->field_name($field_num); return __PACKAGE__->new( $field_name, $termstring ); } # Return an encoded termstring. Requires a FieldInfos to discover fieldnum. sub get_termstring { confess('usage: $term->get_termstring($finfos)') unless @_ == 2; my ( $self, $finfos ) = @_; my $field_num = $finfos->get_field_num( $self->{field} ); return unless defined $field_num; return pack( 'n', $field_num ) . $self->{text}; } sub to_string { my $self = shift; return "$self->{field}:$self->{text}"; } 1; __END__ __H__ #ifndef H_KINOSEARCH_INDEX_TERM #define H_KINOSEARCH_INDEX_TERM 1 /* Field Number Length -- the number of bytes occupied by the field number at * the top of a TermString. */ #define KINO_FIELD_NUM_LEN 2 #endif /* include guard */ __POD__ =head1 NAME KinoSearch1::Index::Term - string of text associated with a field =head1 SYNOPSIS my $foo_term = KinoSearch1::Index::Term->new( 'content', 'foo' ); my $term_query = KinoSearch1::Search::TermQuery->new( term => $foo_term ); =head1 DESCRIPTION The Term is the unit of search. It has two characteristics: a field name, and term text. =head1 METHODS =head2 new my $term = KinoSearch1::Index::Term->new( FIELD_NAME, TERM_TEXT ); Constructor. =head2 set_text get_text set_field get_field Getters and setters. =head2 to_string Returns a string representation of the Term object. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Index/TermBuffer.pm000444000765000765 1163211462203446 22234 0ustar00marvinmarvin000000000000package KinoSearch1::Index::TermBuffer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params finfos => undef, ); } our %instance_vars; sub new { my $class = shift; $class = ref($class) || $class; my %args = ( %instance_vars, @_ ); confess kerror() unless verify_args( \%instance_vars, %args ); my $self = _new( $class, $args{finfos}->size ); return $self; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::TermBuffer void _new(class, finfos_size) char *class; I32 finfos_size; PREINIT: TermBuffer *term_buf; PPCODE: term_buf = Kino1_TermBuf_new(finfos_size); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)term_buf); XSRETURN(1); void DESTROY(term_buf) TermBuffer *term_buf; PPCODE: Kino1_TermBuf_destroy(term_buf); __H__ #ifndef H_KINOSEARCH_INDEX_TERM_BUFFER #define H_KINOSEARCH_INDEX_TERM_BUFFER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1IndexTerm.h" #include "KinoSearch1StoreInStream.h" #include "KinoSearch1UtilByteBuf.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMemManager.h" typedef struct termbuffer { ByteBuf *termstring; I32 text_len; I32 max_field_num; } TermBuffer; TermBuffer* Kino1_TermBuf_new(I32); void Kino1_TermBuf_read(TermBuffer*, InStream*); void Kino1_TermBuf_reset(TermBuffer*); void Kino1_TermBuf_set_termstring(TermBuffer*, char*, I32); void Kino1_TermBuf_destroy(TermBuffer*); #endif /* include guard */ __C__ #include "KinoSearch1IndexTermBuffer.h" static void Kino1_TermBuf_set_text_len(TermBuffer*, I32); TermBuffer* Kino1_TermBuf_new(I32 finfos_size) { TermBuffer *term_buf; /* allocate */ Kino1_New(0, term_buf, 1, TermBuffer); /* reset the TermBuffer */ term_buf->termstring = NULL; Kino1_TermBuf_reset(term_buf); /* derive max_field_num */ term_buf->max_field_num = finfos_size - 1; return term_buf; } /* Decode the next term in a term dictionary file (.tii, .tis), but don't turn * it into a full-fledged Term object. */ void Kino1_TermBuf_read(TermBuffer *term_buf, InStream *instream) { I32 text_overlap, finish_chars_len, total_text_len, field_num; /* read bytes which are shared between the last term text and this */ text_overlap = instream->read_vint(instream); finish_chars_len = instream->read_vint(instream); total_text_len = text_overlap + finish_chars_len; Kino1_TermBuf_set_text_len(term_buf, total_text_len); instream->read_chars(instream, term_buf->termstring->ptr, (text_overlap + KINO_FIELD_NUM_LEN), finish_chars_len); /* read field num */ field_num = instream->read_vint(instream); if (field_num > term_buf->max_field_num && field_num != -1) Kino1_confess("Internal error: field_num %d > max_field_num %d", field_num, term_buf->max_field_num); Kino1_encode_bigend_U16( (U16)field_num, term_buf->termstring->ptr); } /* Set the TermBuffer object to a sentinel state, indicating that it does not * hold a valid Term */ void Kino1_TermBuf_reset(TermBuffer *term_buf) { if (term_buf->termstring != NULL) { Kino1_BB_destroy(term_buf->termstring); term_buf->termstring = NULL; } term_buf->text_len = 0; } void Kino1_TermBuf_set_termstring(TermBuffer *term_buf, char* ptr, I32 len) { /* the passed in len includes the length of the encoded field num */ if (len < 2) Kino1_confess("can't set_termstring with a len < 2: %d", len); Kino1_TermBuf_set_text_len(term_buf, len - KINO_FIELD_NUM_LEN); Kino1_BB_assign_string(term_buf->termstring, ptr, len); } /* Set the length of the term text, and ensure that there's enough memory * allocated to hold term text that size. */ static void Kino1_TermBuf_set_text_len(TermBuffer *term_buf, I32 new_len) { ByteBuf* termstring = term_buf->termstring; /* initialize if necessary, with a field number of 0 */ if (termstring == NULL) { termstring = Kino1_BB_new_string("\0\0", 2); term_buf->termstring = termstring; } /* realloc and set lengths */ Kino1_BB_grow(termstring, new_len + KINO_FIELD_NUM_LEN); termstring->size = new_len + KINO_FIELD_NUM_LEN; term_buf->text_len = new_len; /* null-terminate */ termstring->ptr[ termstring->size ] = '\0'; } void Kino1_TermBuf_destroy(TermBuffer *term_buf) { Kino1_TermBuf_reset(term_buf); Kino1_Safefree(term_buf); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::TermBuffer - decode a term dictionary one Term at a time ==head1 DESCRIPTION A TermBuffer iterates through a term dictionary, holding one current term in a buffer. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/TermDocs.pm000444000765000765 2024211462203446 21710 0ustar00marvinmarvin000000000000package KinoSearch1::Index::TermDocs; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); BEGIN { __PACKAGE__->init_instance_vars(); } =begin comment $term_docs->seek($term); Locate the TermDocs object at a particular term. =end comment =cut sub seek { shift->abstract_death } sub close { shift->abstract_death } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::TermDocs void new(either_sv) SV *either_sv; PREINIT: const char *class; TermDocs *term_docs; PPCODE: /* determine the class */ class = sv_isobject(either_sv) ? sv_reftype(either_sv, 0) : SvPV_nolen(either_sv); /* build object */ term_docs = Kino1_TermDocs_new(); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)term_docs); XSRETURN(1); void seek_tinfo(term_docs, maybe_tinfo_sv) TermDocs *term_docs; SV *maybe_tinfo_sv; PREINIT: TermInfo *tinfo = NULL; PPCODE: /* if maybe_tinfo_sv is undef, tinfo is NULL */ if (SvOK(maybe_tinfo_sv)) { Kino1_extract_struct(maybe_tinfo_sv, tinfo, TermInfo*, "KinoSearch1::Index::TermInfo"); } term_docs->seek_tinfo(term_docs, tinfo); =begin comment while ($term_docs->next) { # ... } Advance the TermDocs object to the next document. Returns false when the iterator is exhausted, true otherwise. =end comment =cut bool next(term_docs) TermDocs *term_docs; CODE: RETVAL = term_docs->next(term_docs); OUTPUT: RETVAL U32 bulk_read(term_docs, doc_nums_sv, freqs_sv, num_wanted) TermDocs *term_docs SV *doc_nums_sv; SV *freqs_sv; U32 num_wanted; CODE: RETVAL = term_docs->bulk_read(term_docs, doc_nums_sv, freqs_sv, num_wanted); OUTPUT: RETVAL =begin comment To do. =end comment =cut bool skip_to(term_docs, target) TermDocs *term_docs; U32 target; CODE: RETVAL = term_docs->skip_to(term_docs, target); OUTPUT: RETVAL SV* _parent_set_or_get(term_docs, ...) TermDocs *term_docs; ALIAS: set_doc = 1 get_doc = 2 set_freq = 3 get_freq = 4 set_positions = 5 get_positions = 6 set_doc_freq = 7 get_doc_freq = 8 PREINIT: U32 num; CODE: { KINO_START_SET_OR_GET_SWITCH case 1: Kino1_confess("Can't set_doc"); /* fall through */ case 2: num = term_docs->get_doc(term_docs); RETVAL = num == KINO_TERM_DOCS_SENTINEL ? &PL_sv_undef : newSVuv(num); break; case 3: Kino1_confess("Can't set_freq"); /* fall through */ case 4: num = term_docs->get_freq(term_docs); RETVAL = num == KINO_TERM_DOCS_SENTINEL ? &PL_sv_undef : newSVuv(num); break; case 5: Kino1_confess("Can't set_positions"); /* fall through */ case 6: RETVAL = newSVsv(term_docs->get_positions(term_docs)); break; case 7: term_docs->set_doc_freq(term_docs, (U32)SvUV(ST(1)) ); /* fall through */ case 8: num = term_docs->get_doc_freq(term_docs); RETVAL = num == KINO_TERM_DOCS_SENTINEL ? &PL_sv_undef : newSVuv(num); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(term_docs) TermDocs *term_docs; PPCODE: term_docs->destroy(term_docs); __H__ #ifndef H_KINO_TERM_DOCS #define H_KINO_TERM_DOCS 1 #define KINO_TERM_DOCS_SENTINEL 0xFFFFFFFF #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilMemManager.h" #include "KinoSearch1IndexTermInfo.h" typedef struct termdocs { void *child; SV *positions; void (*set_doc_freq)(struct termdocs*, U32); U32 (*get_doc_freq)(struct termdocs*); U32 (*get_doc)(struct termdocs*); U32 (*get_freq)(struct termdocs*); SV* (*get_positions)(struct termdocs*); void (*seek_tinfo)(struct termdocs*, TermInfo*); bool (*next)(struct termdocs*); bool (*skip_to)(struct termdocs*, U32); U32 (*bulk_read)(struct termdocs*, SV*, SV*, U32); void (*destroy)(struct termdocs*); } TermDocs; TermDocs* Kino1_TermDocs_new(); void Kino1_TermDocs_set_doc_freq_death(TermDocs*, U32); U32 Kino1_TermDocs_get_doc_freq_death(TermDocs*); U32 Kino1_TermDocs_get_doc_death(TermDocs*); U32 Kino1_TermDocs_get_freq_death(TermDocs*); SV* Kino1_TermDocs_get_positions_death(TermDocs*); void Kino1_TermDocs_seek_tinfo_death(TermDocs*, TermInfo*); bool Kino1_TermDocs_next_death(TermDocs*); bool Kino1_TermDocs_skip_to_death(TermDocs*, U32); U32 Kino1_TermDocs_bulk_read_death(TermDocs*, SV*, SV*, U32); void Kino1_TermDocs_destroy(TermDocs*); #endif /* include guard */ __C__ #include "KinoSearch1IndexTermDocs.h" TermDocs* Kino1_TermDocs_new() { TermDocs* term_docs; Kino1_New(0, term_docs, 1, TermDocs); term_docs->child = NULL; /* force the subclass to override functions */ term_docs->set_doc_freq = Kino1_TermDocs_set_doc_freq_death; term_docs->get_doc_freq = Kino1_TermDocs_get_doc_freq_death; term_docs->get_doc = Kino1_TermDocs_get_doc_death; term_docs->get_freq = Kino1_TermDocs_get_freq_death; term_docs->get_positions = Kino1_TermDocs_get_positions_death; term_docs->seek_tinfo = Kino1_TermDocs_seek_tinfo_death; term_docs->next = Kino1_TermDocs_next_death; term_docs->skip_to = Kino1_TermDocs_skip_to_death; term_docs->destroy = Kino1_TermDocs_destroy; return term_docs; } void Kino1_TermDocs_set_doc_freq_death(TermDocs *term_docs, U32 doc_freq) { Kino1_confess("term_docs->set_doc_freq must be defined in a subclass"); } U32 Kino1_TermDocs_get_doc_freq_death(TermDocs *term_docs) { Kino1_confess("term_docs->get_doc_freq must be defined in a subclass"); return 1; } U32 Kino1_TermDocs_get_doc_death(TermDocs *term_docs) { Kino1_confess("term_docs->get_doc must be defined in a subclass"); return 1; } U32 Kino1_TermDocs_get_freq_death(TermDocs *term_docs) { Kino1_confess("term_docs->get_freq must be defined in a subclass"); return 1; } SV* Kino1_TermDocs_get_positions_death(TermDocs *term_docs) { Kino1_confess("term_docs->get_positions must be defined in a subclass"); return &PL_sv_undef; } void Kino1_TermDocs_seek_tinfo_death(TermDocs *term_docs, TermInfo *tinfo) { Kino1_confess("term_docs->seek_tinfo must be defined in a subclass"); } bool Kino1_TermDocs_next_death(TermDocs *term_docs) { Kino1_confess("term_docs->next must be defined in a subclass"); return 1; } U32 Kino1_TermDocs_bulk_read_death(TermDocs* term_docs, SV* doc_nums_sv, SV* freqs_sv, U32 num_wanted) { Kino1_confess("term_docs->bulk_read must be defined in a subclass"); return 1; } bool Kino1_TermDocs_skip_to_death(TermDocs *term_docs, U32 target) { Kino1_confess("term_docs->skip_to must be defined in a subclass"); return 1; } void Kino1_TermDocs_destroy(TermDocs *term_docs) { Kino1_Safefree(term_docs); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::TermDocs - retrieve list of docs which contain a Term ==head1 SYNOPSIS # abstract base class, but here's how a subclass works: $term_docs->seek($term); my $num_got = $term_docs->bulk_read( $docs, $freqs, $num_to_read ); my @doc_nums = unpack( 'I*', $docs ); my @tf_ds = unpack( 'I*', $freqs ); # term frequency in document # alternately... $term_docs->set_read_positions(1); while ($term_docs->next) { do_something_with( doc => $term_docs->get_doc, freq => $term_docs->get_freq, positions => $term_docs->get_positions, ); } ==head1 DESCRIPTION Feed a TermDocs object a Term to get docs (and freqs). If a term is present in the portion of an index that a TermDocs subclass is responsible for, the object is used to access the doc_nums for the documents in which it appears, plus the number of appearances, plus (optionally), the positions at which the term appears in the document. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/TermEnum.pm000444000765000765 336511462203446 21713 0ustar00marvinmarvin000000000000package KinoSearch1::Index::TermEnum; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); BEGIN { __PACKAGE__->init_instance_vars(); } =begin comment $term_enum->seek($term); Locate the Enum to a particular spot. =end comment =cut sub seek { shift->abstract_death } =begin comment my $evil_twin = $term_enum->clone_enum; Return a dupe, in the same state as the orig. =end comment =cut sub clone_enum { shift->abstract_death } =begin comment my $not_end_of_enum_yet = $term_enum->next; Proceed to the next term. Return true until we fall off the end of the Enum, then return false. =end comment =cut sub next { shift->abstract_death } sub skip_to { shift->todo_death } =begin comment my $termstring = $term_enum->get_termstring; Return a termstring, if the Enum is in a state where it's valid to do so. Otherwise, return undef. =end comment =cut sub get_termstring { shift->abstract_death } sub get_terminfo { shift->abstract_death } sub get_index_interval { shift->abstract_death } sub get_size { shift->abstract_death } sub close { shift->abstract_death } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::TermEnum - scan through a list of Terms ==head1 SYNOPSIS # abstract base class ==head1 DESCRIPTION Conceptually, a TermEnum is a array of Term => TermInfo pairs, sorted lexically by term field name, then term text. The implementations in KinoSearch1 solve the same problem that tied arrays solve: it is possible to iterate through the array while loading as little as possible into memory. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/TermInfo.pm000444000765000765 1117111462203446 21714 0ustar00marvinmarvin000000000000package KinoSearch1::Index::TermInfo; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::TermInfo TermInfo* new( class_sv, doc_freq, frq_fileptr, prx_fileptr, skip_offset, index_fileptr ) SV *class_sv; I32 doc_freq; double frq_fileptr; double prx_fileptr; I32 skip_offset; double index_fileptr; PREINIT: TermInfo *tinfo; CODE: class_sv = NULL; /* suppress "unused variable" warning */ Kino1_New(0, tinfo, 1, TermInfo); tinfo->doc_freq = doc_freq; tinfo->frq_fileptr = frq_fileptr; tinfo->prx_fileptr = prx_fileptr; tinfo->skip_offset = skip_offset; tinfo->index_fileptr = index_fileptr; RETVAL = tinfo; OUTPUT: RETVAL =begin comment Duplicate a TermInfo object. =end comment =cut TermInfo* clone(tinfo) TermInfo *tinfo; CODE: RETVAL = Kino1_TInfo_dupe(tinfo); OUTPUT: RETVAL =for comment Zero out the TermInfo object. =cut void reset(tinfo) TermInfo *tinfo; PPCODE: Kino1_TInfo_reset(tinfo); =begin comment Setters and getters. =end comment =cut SV* _set_or_get(tinfo, ...) TermInfo *tinfo; ALIAS: set_doc_freq = 1 get_doc_freq = 2 set_frq_fileptr = 3 get_frq_fileptr = 4 set_prx_fileptr = 5 get_prx_fileptr = 6 set_skip_offset = 7 get_skip_offset = 8 set_index_fileptr = 9 get_index_fileptr = 10 CODE: { KINO_START_SET_OR_GET_SWITCH case 1: tinfo->doc_freq = SvIV(ST(1)); /* fall through */ case 2: RETVAL = newSViv(tinfo->doc_freq); break; case 3: tinfo->frq_fileptr = SvNV(ST(1)); /* fall through */ case 4: RETVAL = newSVnv(tinfo->frq_fileptr); break; case 5: tinfo->prx_fileptr = SvNV(ST(1)); /* fall through */ case 6: RETVAL = newSVnv(tinfo->prx_fileptr); break; case 7: tinfo->skip_offset = SvIV(ST(1)); /* fall through */ case 8: RETVAL = newSViv(tinfo->skip_offset); break; case 9: tinfo->index_fileptr = SvNV(ST(1)); /* fall through */ case 10: RETVAL = newSVnv(tinfo->index_fileptr); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(tinfo) TermInfo* tinfo; CODE: Kino1_Safefree(tinfo); __H__ #ifndef H_KINOSEARCH_INDEX_TERM_INFO #define H_KINOSEARCH_INDEX_TERM_INFO 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilMemManager.h" typedef struct terminfo { I32 doc_freq; double frq_fileptr; double prx_fileptr; I32 skip_offset; double index_fileptr; } TermInfo; TermInfo* Kino1_TInfo_new(); TermInfo* Kino1_TInfo_dupe(TermInfo*); void Kino1_TInfo_reset(TermInfo*); void Kino1_TInfo_destroy(TermInfo*); #endif /* include guard */ __C__ #include "KinoSearch1IndexTermInfo.h" TermInfo* Kino1_TInfo_new() { TermInfo* tinfo; Kino1_New(0, tinfo, 1, TermInfo); Kino1_TInfo_reset(tinfo); return tinfo; } /* Allocate and return a copy of the supplied TermInfo. */ TermInfo* Kino1_TInfo_dupe(TermInfo *tinfo) { TermInfo* new_tinfo; Kino1_New(0, new_tinfo, 1, TermInfo); new_tinfo->doc_freq = tinfo->doc_freq; new_tinfo->frq_fileptr = tinfo->frq_fileptr; new_tinfo->prx_fileptr = tinfo->prx_fileptr; new_tinfo->skip_offset = tinfo->skip_offset; new_tinfo->index_fileptr = tinfo->index_fileptr; return new_tinfo; } void Kino1_TInfo_reset(TermInfo *tinfo) { tinfo->doc_freq = 0; tinfo->frq_fileptr = 0.0; tinfo->prx_fileptr = 0.0; tinfo->skip_offset = 0; tinfo->index_fileptr = 0.0; } void Kino1_TInfo_destroy(TermInfo *tinfo) { Kino1_Safefree(tinfo); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::TermInfo - filepointer/statistical data for a Term ==head1 SYNOPSIS my $tinfo = KinoSearch1::Index::TermInfo->new( $doc_freq, $frq_fileptr, $prx_fileptr, $skip_offset, $index_fileptr ); ==head1 DESCRIPTION The TermInfo contains pointer data indicating where a term can be found in various files, plus the document frequency of the term. The index_fileptr member variable is only used if the TermInfo is part of the .tii stream; it is a filepointer to a locations in the main .tis file. ==head1 METHODS ==head2 new Constructor. All 5 arguments are required. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/TermInfosReader.pm000444000765000765 731711462203445 23210 0ustar00marvinmarvin000000000000package KinoSearch1::Index::TermInfosReader; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members invindex => undef, seg_name => undef, finfos => undef, # members orig_enum => undef, index_enum => undef, ); } use KinoSearch1::Index::SegTermEnum; sub init_instance { my $self = shift; my $invindex = $self->{invindex}; # prepare a main Enum which can access all terms $self->{orig_enum} = KinoSearch1::Index::SegTermEnum->new( finfos => $self->{finfos}, instream => $invindex->open_instream("$self->{seg_name}.tis"), ); # load an index Enum into memory which can point to places in main $self->{index_enum} = KinoSearch1::Index::SegTermEnum->new( finfos => $self->{finfos}, instream => $invindex->open_instream("$self->{seg_name}.tii"), is_index => 1, ); $self->{index_enum}->fill_cache; } # Return a SegTermEnum, pre-located at the right spot if a Term is supplied. sub terms { my ( $self, $term ) = @_; if ( defined $term ) { $self->fetch_term_info($term); } else { $self->{orig_enum}->reset; } return $self->{orig_enum}->clone_enum; } # Given a Term, return a TermInfo if the Term is present in the segment, or # undef if it's not. sub fetch_term_info { my ( $self, $term ) = @_; my $termstring = $term->get_termstring( $self->{finfos} ); # termstring will be undefined if field doesn't exist return unless defined $termstring; $self->_seek_enum($termstring); return $self->_scan_enum($termstring); } # Locate the main Enum as close as possible to where the term might be found. sub _seek_enum { my ( $self, $termstring ) = @_; my $index_enum = $self->{index_enum}; # get the approximate possible location of the term in the main Enum my $tii_position = $index_enum->scan_cache($termstring); my $ballpark_termstring = $index_enum->get_termstring; my $ballpark_tinfo = $index_enum->get_term_info; # point the main Enum just before the term $self->{orig_enum}->seek( $ballpark_tinfo->get_index_fileptr, ( ( $tii_position * $self->{orig_enum}->get_index_interval ) - 1 ), $ballpark_termstring, $ballpark_tinfo, ); } # One-by-one targeted iteration through TermEnum. sub _scan_enum { my ( $self, $target_termstring ) = @_; my $orig_enum = $self->{orig_enum}; # iterate through the Enum until the result is ge the term $orig_enum->scan_to($target_termstring); # if the stopping point matches the target, return info; otherwise, undef my $found_termstring = $orig_enum->get_termstring; if ( defined $found_termstring and $found_termstring eq $target_termstring ) { return $orig_enum->get_term_info; } return; } sub get_skip_interval { shift->{orig_enum}->get_skip_interval; } sub close { my $self = shift; $self->{orig_enum}->close; $self->{index_enum}->close; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Index::TermInfosReader - look up Terms in an invindex ==head1 DESCRIPTION A TermInfosReader manages the relationship between two SegTermEnum objects - a primary and an index. It would be possible, though extremely inefficient, to scan through a single SegTermEnum every time you wanted to know about a Term. Having an index makes the process much quicker, and you need a TermInfosReader to deal with the index. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/TermInfosWriter.pm000444000765000765 2136611462203446 23303 0ustar00marvinmarvin000000000000package KinoSearch1::Index::TermInfosWriter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params invindex => undef, seg_name => undef, is_index => 0, index_interval => 1024, skip_interval => 16, ); } our %instance_vars; sub new { my $class = shift; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); my $invindex = $args{invindex}; # open an outstream my $suffix = $args{is_index} ? 'tii' : 'tis'; my $filename = "$args{seg_name}.$suffix"; $invindex->delete_file($filename) if $invindex->file_exists($filename); my $outstream = $args{invindex}->open_outstream($filename); my $self = _new( $outstream, @args{qw( is_index index_interval skip_interval )} ); # create the tii doppelganger if ( !$args{is_index} ) { my $other = __PACKAGE__->new( invindex => $invindex, seg_name => $args{seg_name}, is_index => 1, ); $self->_set_other($other); $other->_set_other($self); } return $self; } sub finish { my $self = shift; my $outstream = $self->_get_outstream; # seek to near the head and write the number of terms processed $outstream->seek(4); $outstream->lu_write( 'Q', $self->_get_size ); # cue the doppelganger's exit if ( !$self->_get_is_index ) { $self->_get_other()->finish; } $outstream->close; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Index::TermInfosWriter TermInfosWriter* _new(outstream_sv, is_index, index_interval, skip_interval) SV *outstream_sv; I32 is_index; I32 index_interval; I32 skip_interval; CODE: RETVAL = Kino1_TInfosWriter_new(outstream_sv, is_index, index_interval, skip_interval); OUTPUT: RETVAL =for comment Add a Term (encoded as a termstring) and its associated TermInfo. =cut void add(obj, termstring_sv, tinfo) TermInfosWriter *obj; SV *termstring_sv; TermInfo *tinfo; PREINIT: ByteBuf bb; STRLEN len; PPCODE: bb.ptr = SvPV(termstring_sv, len); bb.size = len; Kino1_TInfosWriter_add(obj, &bb, tinfo); =for comment Export the FORMAT constant to Perl. =cut IV FORMAT() CODE: RETVAL = KINO_TINFOS_FORMAT; OUTPUT: RETVAL SV* _set_or_get(obj, ...) TermInfosWriter *obj; ALIAS: _set_other = 1 _get_other = 2 _get_outstream = 4 _get_is_index = 6 _get_size = 8 CODE: { KINO_START_SET_OR_GET_SWITCH case 1: SvREFCNT_dec(obj->other_sv); obj->other_sv = newSVsv( ST(1) ); Kino1_extract_struct(obj->other_sv, obj->other, TermInfosWriter*, "KinoSearch1::Index::TermInfosWriter"); /* fall through */ case 2: RETVAL = newSVsv(obj->other_sv); break; case 4: RETVAL = newSVsv(obj->fh_sv); break; case 6: RETVAL = newSViv(obj->is_index); break; case 8: RETVAL = newSViv(obj->size); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(obj) TermInfosWriter *obj; PPCODE: Kino1_TInfosWriter_destroy(obj); __H__ #ifndef H_KINO_TERM_INFOS_WRITER #define H_KINO_TERM_INFOS_WRITER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1IndexTerm.h" #include "KinoSearch1IndexTermInfo.h" #include "KinoSearch1StoreOutStream.h" #include "KinoSearch1UtilByteBuf.h" #include "KinoSearch1UtilCClass.h" #include "KinoSearch1UtilMathUtils.h" #include "KinoSearch1UtilMemManager.h" #include "KinoSearch1UtilStringHelper.h" #define KINO_TINFOS_FORMAT -2 typedef struct terminfoswriter { OutStream *fh; SV *fh_sv; I32 is_index; I32 index_interval; I32 skip_interval; struct terminfoswriter* other; SV *other_sv; ByteBuf *last_termstring; TermInfo *last_tinfo; I32 last_fieldnum; double last_tis_ptr; I32 size; } TermInfosWriter; TermInfosWriter* Kino1_TInfosWriter_new(SV*, I32, I32, I32); void Kino1_TInfosWriter_add(TermInfosWriter*, ByteBuf*, TermInfo*); void Kino1_TInfosWriter_destroy(TermInfosWriter*); #endif /* include guard */ __C__ #include "KinoSearch1IndexTermInfosWriter.h" TermInfosWriter* Kino1_TInfosWriter_new(SV *outstream_sv, I32 is_index, I32 index_interval, I32 skip_interval) { TermInfosWriter *obj; /* allocate */ Kino1_New(0, obj, 1, TermInfosWriter); /* assign */ obj->is_index = is_index; obj->index_interval = index_interval; obj->skip_interval = skip_interval; obj->fh_sv = newSVsv(outstream_sv); Kino1_extract_struct(obj->fh_sv, obj->fh, OutStream*, "KinoSearch1::Store::OutStream"); /* NOTE: this value forces the first field_num in the .tii file to -1. * Do not change it. */ obj->last_termstring = Kino1_BB_new_string("\xff\xff", 2); obj->last_tinfo = Kino1_TInfo_new(); obj->last_fieldnum = -1; obj->last_tis_ptr = 0, obj->size = 0; obj->other = NULL; obj->other_sv = &PL_sv_undef; /* write file header */ obj->fh->write_int(obj->fh, KINO_TINFOS_FORMAT); obj->fh->write_long(obj->fh, 0.0); /* return to fill in later */ obj->fh->write_int(obj->fh, index_interval); obj->fh->write_int(obj->fh, skip_interval); return obj; } /* Write out a term/terminfo combo. */ void Kino1_TInfosWriter_add(TermInfosWriter* obj, ByteBuf* termstring_bb, TermInfo* tinfo) { char *termstring, *last_tstring; STRLEN termstring_len, last_tstring_len; I32 field_num; I32 overlap; char *diff_start_str; STRLEN diff_len; OutStream* fh; /* make local copy */ fh = obj->fh; /* write a subset of the entries to the .tii index */ if ( (obj->size % obj->index_interval == 0) && (!obj->is_index) ) { Kino1_TInfosWriter_add(obj->other, obj->last_termstring, obj->last_tinfo); } /* extract string pointers and string lengths */ termstring = termstring_bb->ptr; last_tstring = obj->last_termstring->ptr; termstring_len = termstring_bb->size; last_tstring_len = obj->last_termstring->size; /* to obtain field number, decode packed 'n' at top of termstring */ field_num = (I16)Kino1_decode_bigend_U16(termstring); /* move past field_num */ termstring += KINO_FIELD_NUM_LEN; last_tstring += KINO_FIELD_NUM_LEN; termstring_len -= KINO_FIELD_NUM_LEN; last_tstring_len -= KINO_FIELD_NUM_LEN; /* count how many bytes the strings share at the top */ overlap = Kino1_StrHelp_string_diff(last_tstring, termstring, last_tstring_len, termstring_len); diff_start_str = termstring + overlap; diff_len = termstring_len - overlap; /* write number of common bytes */ fh->write_vint(fh, overlap); /* write common bytes */ fh->write_string(fh, diff_start_str, diff_len); /* write field number and doc_freq */ fh->write_vint(fh, field_num); fh->write_vint(fh, tinfo->doc_freq); /* delta encode filepointers */ fh->write_vlong(fh, (tinfo->frq_fileptr - obj->last_tinfo->frq_fileptr) ); fh->write_vlong(fh, (tinfo->prx_fileptr - obj->last_tinfo->prx_fileptr) ); /* write skipdata */ if (tinfo->doc_freq >= obj->skip_interval) fh->write_vint(fh, tinfo->skip_offset); /* the .tii index file gets a pointer to the location of the primary */ if (obj->is_index) { double tis_ptr; tis_ptr = obj->other->fh->tell(obj->other->fh); obj->fh->write_vlong(obj->fh, (tis_ptr - obj->last_tis_ptr)); obj->last_tis_ptr = tis_ptr; } /* track number of terms */ obj->size++; /* remember for delta encoding */ Kino1_BB_assign_string(obj->last_termstring, termstring_bb->ptr, termstring_bb->size); StructCopy(tinfo, obj->last_tinfo, TermInfo); } void Kino1_TInfosWriter_destroy(TermInfosWriter *obj) { SvREFCNT_dec(obj->fh_sv); SvREFCNT_dec(obj->other_sv); Kino1_BB_destroy(obj->last_termstring); Kino1_TInfo_destroy(obj->last_tinfo); Kino1_Safefree(obj); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::TermInfosWriter - write a term dictionary ==head1 DESCRIPTION The TermInfosWriter write both parts of the term dictionary. The primary instance creates a shadow TermInfosWriter that writes the index. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Index/TermVector.pm000444000765000765 174611462203446 22252 0ustar00marvinmarvin000000000000package KinoSearch1::Index::TermVector; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # params / members field => undef, text => undef, positions => undef, start_offsets => undef, end_offsets => undef, ); __PACKAGE__->ready_get_set( qw( field text positions start_offsets end_offsets ) ); } sub init_instance { my $self = shift; $self->{$_} ||= [] for qw( positions start_offsets end_offsets ); } 1; __END__ __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Index::TermVector - Term freq and positional data ==head1 DESCRIPTION Ancillary information about a Term. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/QueryParser000755000765000765 011462203446 20670 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/QueryParser/QueryParser.pm000444000765000765 2641311462203446 23673 0ustar00marvinmarvin000000000000package KinoSearch1::QueryParser::QueryParser; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args / members analyzer => undef, default_boolop => 'OR', default_field => undef, # back compat fields => undef, # members bool_groups => undef, phrases => undef, bool_group_re => undef, phrase_re => undef, label_inc => 0, ); } use KinoSearch1::Analysis::TokenBatch; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Search::BooleanQuery; use KinoSearch1::Search::PhraseQuery; use KinoSearch1::Search::TermQuery; use KinoSearch1::Index::Term; sub init_instance { my $self = shift; $self->{bool_groups} = {}; $self->{phrases} = {}; croak("default_boolop must be either 'AND' or 'OR'") unless $self->{default_boolop} =~ /^(?:AND|OR)$/; # create a random string that presumably won't appear in a search string my @chars = ( 'A' .. 'Z' ); my $randstring = ''; $randstring .= $chars[ rand @chars ] for ( 1 .. 16 ); $self->{randstring} = $randstring; # create labels which won't appear in search strings $self->{phrase_re} = qr/^(_phrase$randstring\d+)/; $self->{bool_group_re} = qr/^(_boolgroup$randstring\d+)/; # verify fields param my $fields = defined $self->{fields} ? $self->{fields} : [ $self->{default_field} ]; croak("Required parameter 'fields' not supplied as arrayref") unless ( defined $fields and reftype($fields) eq 'ARRAY' ); $self->{fields} = $fields; # verify analyzer croak("Missing required param 'analyzer'") unless a_isa_b( $self->{analyzer}, 'KinoSearch1::Analysis::Analyzer' ); } # regex matching a quoted string my $quoted_re = qr/ " # opening quote ( # capture [^"]*? # anything not a quote ) (?:"|$) # closed by either a quote or end of string /xsm; # regex matching a parenthetical group my $paren_re = qr/ \( # opening paren ( # capture [^()]*? # anything not a paren ) (?:\)|$) # closed by paren or end of string /xsm; # regex matching a negating boolean operator my $neg_re = qr/^(?: NOT\s+ # NOT followed by space |-(?=\S) # minus followed by something not-spacey )/xsm; # regex matching a requiring boolean operator my $req_re = qr/^ \+(?=\S) # plus followed by something not-spacey /xsm; # regex matching a field indicator my $field_re = qr/^ ( # capture [^"(:\s]+ # non-spacey string ) : # followed by : /xsm; sub parse { my ( $self, $qstring_orig, $default_fields ) = @_; $qstring_orig = '' unless defined $qstring_orig; $default_fields ||= $self->{fields}; my $default_boolop = $self->{default_boolop}; my @clauses; # substitute contiguous labels for phrases and boolean groups my $qstring = $self->_extract_phrases($qstring_orig); $qstring = $self->_extract_boolgroups($qstring); local $_ = $qstring; while ( bytes::length $_ ) { # fast-forward past whitespace next if s/^\s+//; my $occur = $default_boolop eq 'AND' ? 'MUST' : 'SHOULD'; if (s/^AND\s+//) { if (@clauses) { # require the previous clause (unless it's negated) if ( $clauses[-1]{occur} eq 'SHOULD' ) { $clauses[-1]{occur} = 'MUST'; } } # require this clause $occur = 'MUST'; } elsif (s/^OR\s+//) { if (@clauses) { $clauses[-1]{occur} = 'SHOULD'; } $occur = 'SHOULD'; } # detect tokens which cause this clause to be required or negated if (s/$neg_re//) { $occur = 'MUST_NOT'; } elsif (s/$req_re//) { $occur = 'MUST'; } # set the field my $fields = s/^$field_re// ? [$1] : $default_fields; # if a phrase label is detected... if (s/$self->{phrase_re}//) { my $query; # retreive the text and analyze it my $orig_phrase_text = delete $self->{phrases}{$1}; my $token_texts = $self->_analyze($orig_phrase_text); if (@$token_texts) { my $query = $self->_get_field_query( $fields, $token_texts ); push @clauses, { query => $query, occur => $occur } if defined $query; } } # if a label indicating a bool group is detected... elsif (s/$self->{bool_group_re}//) { # parse boolean subqueries recursively my $inner_text = delete $self->{bool_groups}{$1}; my $query = $self->parse( $inner_text, $fields ); push @clauses, { query => $query, occur => $occur }; } # what's left is probably a term elsif (s/([^"(\s]+)//) { my $token_texts = $self->_analyze($1); @$token_texts = grep { $_ ne '' } @$token_texts; if (@$token_texts) { my $query = $self->_get_field_query( $fields, $token_texts ); push @clauses, { occur => $occur, query => $query }; } } } if ( @clauses == 1 and $clauses[0]{occur} ne 'MUST_NOT' ) { # if it's just a simple query, return it unwrapped return $clauses[0]{query}; } else { # otherwise, build a boolean query my $bool_query = KinoSearch1::Search::BooleanQuery->new; for my $clause (@clauses) { $bool_query->add_clause( query => $clause->{query}, occur => $clause->{occur}, ); } return $bool_query; } } # Wrap a TermQuery/PhraseQuery to deal with multiple fields. sub _get_field_query { my ( $self, $fields, $token_texts ) = @_; my @queries = grep { defined $_ } map { $self->_gen_single_field_query( $_, $token_texts ) } @$fields; if ( @queries == 0 ) { return; } elsif ( @queries == 1 ) { return $queries[0]; } else { my $wrapper_query = KinoSearch1::Search::BooleanQuery->new; for my $query (@queries) { $wrapper_query->add_clause( query => $query, occur => 'SHOULD', ); } return $wrapper_query; } } # Create a TermQuery, a PhraseQuery, or nothing. sub _gen_single_field_query { my ( $self, $field, $token_texts ) = @_; if ( @$token_texts == 1 ) { my $term = KinoSearch1::Index::Term->new( $field, $token_texts->[0] ); return KinoSearch1::Search::TermQuery->new( term => $term ); } elsif ( @$token_texts > 1 ) { my $phrase_query = KinoSearch1::Search::PhraseQuery->new; for my $token_text (@$token_texts) { $phrase_query->add_term( KinoSearch1::Index::Term->new( $field, $token_text ), ); } return $phrase_query; } } # break a string into tokens sub _analyze { my ( $self, $string ) = @_; my $token_batch = KinoSearch1::Analysis::TokenBatch->new; $token_batch->append( $string, 0, bytes::length($string) ); $token_batch = $self->{analyzer}->analyze($token_batch); my @token_texts; while ( $token_batch->next ) { push @token_texts, $token_batch->get_text; } return \@token_texts; } # replace all phrases with labels sub _extract_phrases { my ( $self, $qstring ) = @_; while ( $qstring =~ $quoted_re ) { my $label = sprintf( "_phrase$self->{randstring}%d", $self->{label_inc}++ ); $qstring =~ s/$quoted_re/$label /; # extra space for safety # store the phrase text for later retrieval $self->{phrases}{$label} = $1; } return $qstring; } # recursively replace boolean groupings with labels, innermost first sub _extract_boolgroups { my ( $self, $qstring ) = @_; while ( $qstring =~ $paren_re ) { my $label = sprintf( "_boolgroup$self->{randstring}%d", $self->{label_inc}++ ); $qstring =~ s/$paren_re/$label /; # extra space for safety # store the text for later retrieval $self->{bool_groups}{$label} = $1; } return $qstring; } 1; __END__ =head1 NAME KinoSearch1::QueryParser::QueryParser - transform a string into a Query object =head1 SYNOPSIS my $query_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $analyzer, fields => [ 'bodytext' ], ); my $query = $query_parser->parse( $query_string ); my $hits = $searcher->search( query => $query ); =head1 DESCRIPTION The QueryParser accepts search strings as input and produces Query objects, suitable for feeding into L. =head2 Syntax The following constructs are recognized by QueryParser. =over =item * Boolean operators 'AND', 'OR', and 'AND NOT'. =item * Prepented +plus and -minus, indicating that the labeled entity should be either required or forbidden -- be it a single word, a phrase, or a parenthetical group. =item * Logical groups, delimited by parentheses. =item * Phrases, delimited by double quotes. =item * Field-specific terms, in the form of C. (The field specified by fieldname will be used instead of the QueryParser's default fields). A field can also be given to a logical group, in which case it is the same as if the field had been prepended onto every term in the group. For example: C is the same as C. =back =head1 METHODS =head2 new my $query_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $analyzer, # required fields => [ 'bodytext' ], # required default_boolop => 'AND', # default: 'OR' ); Constructor. Takes hash-style parameters: =over =item * B - An object which subclasses L. This B be identical to the Analyzer used at index-time, or the results won't match up. =item * B - the names of the fields which will be searched against. Must be supplied as an arrayref. =item * B - deprecated. Use C instead. =item * B - two possible values: 'AND' and 'OR'. The default is 'OR', which means: return documents which match any of the query terms. If you want only documents which match all of the query terms, set this to 'AND'. =back =head2 parse my $query = $query_parser->parse( $query_string ); Turn a query string into a Query object. Depending on the contents of the query string, the returned object could be any one of several subclasses of L. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search000755000765000765 011462203446 17613 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Search/BooleanClause.pm000444000765000765 215111462203446 23021 0ustar00marvinmarvin000000000000package KinoSearch1::Search::BooleanClause; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( occur => 'SHOULD', query => undef, ); } sub init_instance { my $self = shift; croak("invalid value for 'occur': '$self->{occur}'") unless $self->{occur} =~ /^(?:MUST|MUST_NOT|SHOULD)$/; } __PACKAGE__->ready_get_set(qw( occur query )); sub is_required { shift->{occur} eq 'MUST' } sub is_prohibited { shift->{occur} eq 'MUST_NOT' } my %string_representations = ( MUST => '+', MUST_NOT => '-', SHOULD => '', ); sub to_string { my $self = shift; my $string = $string_representations{"$self->{occur}"} . $self->{query}->to_string; return $string; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Search::BooleanClause - clause in a BooleanQuery ==head1 DESCRIPTION A clause in a BooleanQuery. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/BooleanQuery.pm000444000765000765 1304111462203445 22731 0ustar00marvinmarvin000000000000package KinoSearch1::Search::BooleanQuery; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Query ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args / members disable_coord => 0, # members clauses => undef, max_clause_count => 1024, ); __PACKAGE__->ready_get(qw( clauses )); } use KinoSearch1::Search::BooleanClause; sub init_instance { my $self = shift; $self->{clauses} = []; } # Add an subquery tagged with boolean characteristics. sub add_clause { my $self = shift; my $clause = @_ == 1 ? shift : KinoSearch1::Search::BooleanClause->new(@_); push @{ $self->{clauses} }, $clause; confess("not a BooleanClause") unless a_isa_b( $clause, 'KinoSearch1::Search::BooleanClause' ); confess("Too many clauses") if @{ $self->{clauses} } > $self->{max_clause_count}; } sub get_similarity { my ( $self, $searcher ) = @_; if ( $self->{disable_coord} ) { confess "disable_coord not implemented yet"; } return $searcher->get_similarity; } sub extract_terms { my $self = shift; my @terms; for my $clause ( @{ $self->{clauses} } ) { push @terms, $clause->get_query()->extract_terms; } return @terms; } sub create_weight { my ( $self, $searcher ) = @_; return KinoSearch1::Search::BooleanWeight->new( parent => $self, searcher => $searcher, ); } sub clone { shift->todo_death } package KinoSearch1::Search::BooleanWeight; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Weight ); BEGIN { __PACKAGE__->init_instance_vars( # members weights => undef, ); } use KinoSearch1::Search::BooleanScorer; sub init_instance { my $self = shift; $self->{weights} = []; my ( $weights, $searcher ) = @{$self}{ 'weights', 'searcher' }; $self->{similarity} = $self->{parent}->get_similarity($searcher); for my $clause ( @{ $self->{parent}{clauses} } ) { my $query = $clause->get_query; push @$weights, $query->create_weight($searcher); } undef $self->{searcher}; # don't want the baggage } sub get_value { shift->{parent}->get_boost } sub sum_of_squared_weights { my $self = shift; my $sum = 0; $sum += $_->sum_of_squared_weights for @{ $self->{weights} }; # compound the weight of each sub-Weight $sum *= $self->{parent}->get_boost**2; return $sum; } sub normalize { my ( $self, $query_norm ) = @_; $_->normalize($query_norm) for @{ $self->{weights} }; } sub scorer { my ( $self, $reader ) = @_; my $scorer = KinoSearch1::Search::BooleanScorer->new( similarity => $self->{similarity}, ); # add all the subscorers one by one my $clauses = $self->{parent}{clauses}; my $i = 0; for my $weight ( @{ $self->{weights} } ) { my $clause = $clauses->[ $i++ ]; my $subscorer = $weight->scorer($reader); if ( defined $subscorer ) { $scorer->add_subscorer( $subscorer, $clause->get_occur ); } elsif ( $clause->is_required ) { # if any required clause fails, the whole thing fails return undef; } } return $scorer; } 1; __END__ =head1 NAME KinoSearch1::Search::BooleanQuery - match boolean combinations of Queries =head1 SYNOPSIS my $bool_query = KinoSearch1::Search::BooleanQuery->new; $bool_query->add_clause( query => $term_query, occur => 'MUST' ); my $hits = $searcher->search( query => $bool_query ); =head1 DESCRIPTION BooleanQueries are super-Query objects which match boolean combinations of other Queries. One way of producing a BooleanQuery is to feed a query string along the lines of C to a L object: my $bool_query = $query_parser->parse( 'this AND NOT that' ); It's also possible to achieve the same end by manually constructing the query piece by piece: my $bool_query = KinoSearch1::Search::BooleanQuery->new; my $this_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'bodytext', 'this' ), ); $bool_query->add_clause( query => $this_query, occur => 'MUST' ); my $that_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'bodytext', 'that' ), ); $bool_query->add_clause( query => $that_query, occur => 'MUST_NOT' ); QueryParser objects and hand-rolled Queries can work together: my $general_query = $query_parser->parse($q); my $news_only = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'category', 'news' ); ); $bool_query->add_clause( query => $general_query, occur => 'MUST' ); $bool_query->add_clause( query => $news_only, occur => 'MUST' ); =head1 METHODS =head2 new my $bool_query = KinoSearch1::Search::BooleanQuery->new; Constructor. Takes no arguments. =head2 add_clause $bool_query->add_clause( query => $query, # required occur => 'MUST', # default: 'SHOULD' ); Add a clause to the BooleanQuery. Takes hash-style parameters: =over =item * B - an object which belongs to a subclass of L. =item * B - must be one of three possible values: 'SHOULD', 'MUST', or 'MUST_NOT'. =back =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/BooleanScorer.pm000444000765000765 2524011462203446 23066 0ustar00marvinmarvin000000000000package KinoSearch1::Search::BooleanScorer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Scorer ); BEGIN { __PACKAGE__->init_instance_vars() } our %instance_vars; sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); $self->set_similarity( $args{similarity} ); $self->_init_child; return $self; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::BooleanScorer void _init_child(scorer) Scorer *scorer; PPCODE: Kino1_BoolScorer_init_child(scorer); =for comment Add a scorer for a sub-query of the BooleanQuery. =cut void add_subscorer(scorer, subscorer_sv, occur) Scorer *scorer; SV *subscorer_sv; char *occur; PREINIT: BoolScorerChild* child; Scorer *subscorer; SV *subscorer_sv_copy; PPCODE: child = (BoolScorerChild*)scorer->child; Kino1_extract_struct(subscorer_sv, subscorer, Scorer*, "KinoSearch1::Search::Scorer"); subscorer_sv_copy = newSVsv(subscorer_sv); av_push(child->subscorers_av, subscorer_sv_copy); Kino1_BoolScorer_add_subscorer(scorer, subscorer, occur); SV* _boolean_scorer_set_or_get(scorer, ...) Scorer* scorer; ALIAS: _get_subscorer_storage = 2 CODE: { BoolScorerChild* child = (BoolScorerChild*)scorer->child; KINO_START_SET_OR_GET_SWITCH case 2: RETVAL = newRV((SV*)child->subscorers_av); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(scorer) Scorer *scorer; PPCODE: Kino1_BoolScorer_destroy(scorer); __H__ #ifndef H_KINO_BOOLEAN_SCORER #define H_KINO_BOOLEAN_SCORER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1SearchScorer.h" #include "KinoSearch1UtilMemManager.h" #define KINO_MATCH_BATCH_SIZE (1 << 11) #define KINO_MATCH_BATCH_DOC_MASK (KINO_MATCH_BATCH_SIZE - 1) /* A MatchBatch can hold scoring data for 2048 documents. */ typedef struct matchbatch { U32 count; float *scores; U32 *matcher_counts; U32 *bool_masks; U32 *recent_docs; } MatchBatch; typedef struct boolsubscorer { Scorer *scorer; U32 bool_mask; bool done; struct boolsubscorer *next_subscorer; } BoolSubScorer; typedef struct boolscorerchild { U32 doc; U32 end; U32 max_coord; float *coord_factors; U32 required_mask; U32 prohibited_mask; U32 next_mask; MatchBatch *mbatch; BoolSubScorer *subscorers; /* linked list */ AV *subscorers_av; } BoolScorerChild; void Kino1_BoolScorer_init_child(Scorer*); MatchBatch* Kino1_BoolScorer_new_mbatch(); void Kino1_BoolScorer_clear_mbatch(MatchBatch*); void Kino1_BoolScorer_compute_coord_factors(Scorer*); void Kino1_BoolScorer_add_subscorer(Scorer*, Scorer*, char*); bool Kino1_BoolScorer_next(Scorer*); float Kino1_BoolScorer_score(Scorer*); U32 Kino1_BoolScorer_doc(Scorer*); void Kino1_BoolScorer_destroy(Scorer*); #endif /* include guard */ __C__ #include "KinoSearch1SearchBooleanScorer.h" void Kino1_BoolScorer_init_child(Scorer *scorer) { BoolScorerChild *child; Kino1_New(0, child, 1, BoolScorerChild); scorer->child = child; /* define Scorer's abstract methods */ scorer->next = Kino1_BoolScorer_next; scorer->doc = Kino1_BoolScorer_doc; scorer->score = Kino1_BoolScorer_score; /* init */ child->end = 0; child->max_coord = 1; child->coord_factors = NULL; child->required_mask = 0; child->prohibited_mask = 0; child->next_mask = 1; child->mbatch = Kino1_BoolScorer_new_mbatch(); child->subscorers = NULL; child->subscorers_av = newAV(); } MatchBatch* Kino1_BoolScorer_new_mbatch() { MatchBatch* mbatch; /* allocate and init */ Kino1_New(0, mbatch, 1, MatchBatch); Kino1_New(0, mbatch->scores, KINO_MATCH_BATCH_SIZE, float); Kino1_New(0, mbatch->matcher_counts, KINO_MATCH_BATCH_SIZE, U32); Kino1_New(0, mbatch->bool_masks, KINO_MATCH_BATCH_SIZE, U32); Kino1_New(0, mbatch->recent_docs, KINO_MATCH_BATCH_SIZE, U32); mbatch->count = 0; return mbatch; } /* Return a MatchBatch to a "zeroed" state. Only the matcher_counts and the * count are actually cleared; the rest get initialized the next time a doc * gets captured. */ void Kino1_BoolScorer_clear_mbatch(MatchBatch *mbatch) { Zero(mbatch->matcher_counts, KINO_MATCH_BATCH_SIZE, U32); mbatch->count = 0; } /* BooleanScorers award bonus points to documents which match multiple * subqueries. This routine calculates the size of the bonuses. */ void Kino1_BoolScorer_compute_coord_factors(Scorer *scorer) { BoolScorerChild *child; float *coord_factors; U32 i; child = (BoolScorerChild*)scorer->child; Kino1_New(0, child->coord_factors, (child->max_coord + 1), float); coord_factors = child->coord_factors; for (i = 0; i <= child->max_coord; i++) { *coord_factors++ = scorer->sim->coord(scorer->sim, i, child->max_coord); } } void Kino1_BoolScorer_add_subscorer(Scorer* main_scorer, Scorer* subscorer, char *occur) { BoolScorerChild *child; BoolSubScorer *bool_subscorer; child = (BoolScorerChild*)main_scorer->child; Kino1_New(0, bool_subscorer, 1, BoolSubScorer); bool_subscorer->scorer = subscorer; /* if this scorer is required or negated, assign it a unique mask bit. */ if (strnEQ(occur, "SHOULD", 6)) { bool_subscorer->bool_mask = 0; child->max_coord++; } else { if (child->next_mask == 0) { Kino1_confess("more than 32 required or prohibited clauses"); } bool_subscorer->bool_mask = child->next_mask; child->next_mask <<= 1; if (strnEQ(occur, "MUST_NOT", 8)) { child->prohibited_mask |= bool_subscorer->bool_mask; } else { /* "MUST" occur */ child->max_coord++; child->required_mask |= bool_subscorer->bool_mask; } } /* prime the pump */ bool_subscorer->done = !subscorer->next(subscorer); /* link up the linked list of subscorers */ bool_subscorer->next_subscorer = child->subscorers; child->subscorers = bool_subscorer; } bool Kino1_BoolScorer_next(Scorer* scorer) { BoolScorerChild *child; MatchBatch *mbatch; bool more; U32 doc; U32 masked_doc; U32 bool_mask; BoolSubScorer *sub; child = (BoolScorerChild*)scorer->child; mbatch = child->mbatch; do { while (mbatch->count-- > 0) { /* check to see if the doc is prohibited */ doc = mbatch->recent_docs[ mbatch->count ]; masked_doc = doc & KINO_MATCH_BATCH_DOC_MASK; bool_mask = mbatch->bool_masks[masked_doc]; if ( (bool_mask & child->prohibited_mask) == 0 && (bool_mask & child->required_mask) == child->required_mask ) { /* it's not prohibited, so next() was successful */ child->doc = doc; return 1; } } /* refill the queue, processing all docs within the next range */ Kino1_BoolScorer_clear_mbatch(mbatch); more = 0; child->end += KINO_MATCH_BATCH_SIZE; /* iterate through subscorers, caching results to the MatchBatch */ for (sub = child->subscorers; sub != NULL; sub = sub->next_subscorer) { Scorer *scorer = sub->scorer; while (!sub->done && scorer->doc(scorer) < child->end) { doc = scorer->doc(scorer); masked_doc = doc & KINO_MATCH_BATCH_DOC_MASK; if (mbatch->matcher_counts[masked_doc] == 0) { /* first scorer to hit this doc */ mbatch->recent_docs[mbatch->count] = doc; mbatch->count++; mbatch->matcher_counts[masked_doc] = 1; mbatch->scores[masked_doc] = scorer->score(scorer); mbatch->bool_masks[masked_doc] = sub->bool_mask; } else { mbatch->matcher_counts[masked_doc]++; mbatch->scores[masked_doc] += scorer->score(scorer); mbatch->bool_masks[masked_doc] |= sub->bool_mask; } /* check whether this scorer is exhausted */ sub->done = !scorer->next(scorer); } /* if at least one scorer succeeded, loop back */ if (!sub->done) { more = 1; } } } while (mbatch->count > 0 || more); /* out of docs! we're done. */ return 0; } float Kino1_BoolScorer_score(Scorer* scorer){ BoolScorerChild *child = (BoolScorerChild*)scorer->child; MatchBatch *mbatch = child->mbatch; U32 masked_doc; float score; if (child->coord_factors == NULL) { Kino1_BoolScorer_compute_coord_factors(scorer); } /* retrieve the docs accumulated score from the MatchBatch */ masked_doc = child->doc & KINO_MATCH_BATCH_DOC_MASK; score = mbatch->scores[masked_doc]; /* assign bonus for multi-subscorer matches */ score *= child->coord_factors[ mbatch->matcher_counts[masked_doc] ]; return score; } U32 Kino1_BoolScorer_doc(Scorer* scorer) { BoolScorerChild *child = (BoolScorerChild*)scorer->child; return child->doc; } void Kino1_BoolScorer_destroy(Scorer * scorer) { BoolSubScorer *sub, *next_sub; BoolScorerChild *child; child = (BoolScorerChild*)scorer->child; if (child->mbatch != NULL) { Kino1_Safefree(child->mbatch->scores); Kino1_Safefree(child->mbatch->matcher_counts); Kino1_Safefree(child->mbatch->bool_masks); Kino1_Safefree(child->mbatch->recent_docs); Kino1_Safefree(child->mbatch); } sub = child->subscorers; while (sub != NULL) { next_sub = sub->next_subscorer; Kino1_Safefree(sub); sub = next_sub; /* individual scorers will be GC'd on their own by Perl */ } Kino1_Safefree(child->coord_factors); SvREFCNT_dec((SV*)child->subscorers_av); Kino1_Safefree(child); Kino1_Scorer_destroy(scorer); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::BooleanScorer - scorer for BooleanQuery ==head1 DESCRIPTION Implementation of Scorer for BooleanQuery. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/Hit.pm000444000765000765 321411462203445 21031 0ustar00marvinmarvin000000000000package KinoSearch1::Search::Hit; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members id => undef, score => undef, searcher => undef, # members doc => undef, hashref => undef, ); __PACKAGE__->ready_get(qw( id score )); } sub get_doc { my $self = shift; $self->{doc} ||= $self->{searcher}->fetch_doc( $self->{id} ); return $self->{doc}; } sub get_field_values { my $self = shift; if ( !defined $self->{hashref} ) { if ( !defined $self->{doc} ) { $self->get_doc; } $self->{hashref} = $self->{doc}->to_hashref; } return $self->{hashref}; } 1; __END__ =head1 NAME KinoSearch1::Search::Hit - successful match against a Query =head1 DESCRIPTION A Hit object is a storage vessel which holds a Doc, a floating point score, and an integer document id. =head1 METHODS =head2 get_doc my $doc = $hit->get_doc; Return the Hit's KinoSearch1::Document::Doc object. =head2 get_score my $score = $hit->get_score; Return the Hit's score. =head2 get_id my $doc_number = $hit->get_id; Return the Hit's document number. Note that this document number is not permanent, and will likely become invalid the next time the index is updated. =head2 get_field_values my $hashref = $hit->get_field_values; Return the values of the Hit's constituent fields as a hashref. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. KinoSearch1-1.01/lib/KinoSearch1/Search/HitCollector.pm000444000765000765 2437311462203446 22732 0ustar00marvinmarvin000000000000package KinoSearch1::Search::HitCollector; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); # all xs, other than the pragmas/includes package KinoSearch1::Search::HitQueueCollector; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::HitCollector ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args size => undef, ); } our %instance_vars; use KinoSearch1::Search::HitQueue; sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = @_; croak("Required parameter: 'size'") unless defined $args{size}; my $hit_queue = KinoSearch1::Search::HitQueue->new( max_size => $args{size} ); $self->_set_storage($hit_queue); $self->_define_collect; return $self; } *get_total_hits = *KinoSearch1::Search::HitCollector::get_i; *get_hit_queue = *KinoSearch1::Search::HitCollector::get_storage; sub get_max_size { shift->get_hit_queue->get_max_size; } package KinoSearch1::Search::BitCollector; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::HitCollector ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params capacity => 0, ); } our %instance_vars; use KinoSearch1::Util::BitVector; sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); my $bit_vec = KinoSearch1::Util::BitVector->new( capacity => $args{capacity} ); $self->_set_storage($bit_vec); $self->_define_collect; return $self; } *get_bit_vector = *KinoSearch1::Search::HitCollector::get_storage; package KinoSearch1::Search::FilteredCollector; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::HitCollector ); BEGIN { __PACKAGE__->init_instance_vars( hit_collector => undef, filter_bits => undef, ); } our %instance_vars; sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = @_; croak("Required parameter: 'hit_collector'") unless a_isa_b( $args{hit_collector}, "KinoSearch1::Search::HitCollector" ); $self->_set_filter_bits( $args{filter_bits} ); $self->_set_storage( $args{hit_collector} ); $self->_define_collect; return $self; } package KinoSearch1::Search::OffsetCollector; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::HitCollector ); BEGIN { __PACKAGE__->init_instance_vars( hit_collector => undef, offset => undef, ); } our %instance_vars; sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = @_; croak("Required parameter: 'hit_collector'") unless a_isa_b( $args{hit_collector}, "KinoSearch1::Search::HitCollector" ); $self->_set_f( $args{offset} ); $self->_set_storage( $args{hit_collector} ); $self->_define_collect; return $self; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::HitCollector void new(either_sv) SV *either_sv; PREINIT: const char *class; HitCollector *hc; PPCODE: hc = Kino1_HC_new(); class = sv_isobject(either_sv) ? sv_reftype(either_sv, 0) : SvPV_nolen(either_sv); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)hc); XSRETURN(1); =begin comment $hit_collector->collect( $doc_num, $score ); Process a doc_num/score combination. In production, this method should not be called from Perl, as collecting hits is an extremely data-intensive operation. =end comment =cut void collect(hc, doc_num, score) HitCollector *hc; U32 doc_num; float score; PPCODE: hc->collect(hc, doc_num, score); SV* _set_or_get(hc, ...) HitCollector *hc; ALIAS: _set_storage = 1 get_storage = 2 _set_i = 3 get_i = 4 _set_f = 5 _get_f = 6 _set_filter_bits = 7 _get_filter_bits = 8 CODE: { KINO_START_SET_OR_GET_SWITCH case 1: SvREFCNT_dec(hc->storage_ref); hc->storage_ref = newSVsv( ST(1) ); Kino1_extract_anon_struct(hc->storage_ref, hc->storage); /* fall through */ case 2: RETVAL = newSVsv(hc->storage_ref); break; case 3: hc->i = SvUV( ST(1) ); /* fall through */ case 4: RETVAL = newSVuv(hc->i); break; case 5: hc->f = SvNV( ST(1) ); /* fall through */ case 6: RETVAL = newSVnv(hc->f); break; case 7: SvREFCNT_dec(hc->filter_bits_ref); hc->filter_bits_ref = newSVsv( ST(1) ); Kino1_extract_struct( hc->filter_bits_ref, hc->filter_bits, BitVector*, "KinoSearch1::Util::BitVector" ); /* fall through */ case 8: RETVAL = newSVsv(hc->filter_bits_ref); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(hc) HitCollector *hc; PPCODE: Kino1_HC_destroy(hc); MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::HitQueueCollector void _define_collect(hc) HitCollector *hc; PPCODE: hc->collect = Kino1_HC_collect_HitQueue; MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::BitCollector void _define_collect(hc) HitCollector *hc; PPCODE: hc->collect = Kino1_HC_collect_BitVec; MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::FilteredCollector void _define_collect(hc); HitCollector *hc; PPCODE: hc->collect = Kino1_HC_collect_filtered; MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::OffsetCollector void _define_collect(hc); HitCollector *hc; PPCODE: hc->collect = Kino1_HC_collect_offset; __H__ #ifndef H_KINO_HIT_COLLECTOR #define H_KINO_HIT_COLLECTOR 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMathUtils.h" #include "KinoSearch1UtilBitVector.h" #include "KinoSearch1UtilPriorityQueue.h" #include "KinoSearch1UtilMemManager.h" typedef struct hitcollector { void (*collect)(struct hitcollector*, U32, float); float f; U32 i; void *storage; SV *storage_ref; BitVector *filter_bits; SV *filter_bits_ref; } HitCollector; HitCollector* Kino1_HC_new(); void Kino1_HC_collect_death(HitCollector*, U32, float); void Kino1_HC_collect_HitQueue(HitCollector*, U32, float); void Kino1_HC_collect_BitVec(HitCollector*, U32, float); void Kino1_HC_collect_filtered(HitCollector*, U32, float); void Kino1_HC_collect_offset(HitCollector*, U32, float); void Kino1_HC_destroy(HitCollector*); #endif /* include guard */ __C__ #include "KinoSearch1SearchHitCollector.h" HitCollector* Kino1_HC_new() { HitCollector *hc; /* allocate memory and init */ Kino1_New(0, hc, 1, HitCollector); hc->f = 0; hc->i = 0; hc->storage = NULL; hc->storage_ref = &PL_sv_undef; hc->filter_bits = NULL; hc->filter_bits_ref = &PL_sv_undef; /* force the subclass to spec a collect method */ hc->collect = Kino1_HC_collect_death; return hc; } void Kino1_HC_collect_death(HitCollector *hc, U32 doc_num, float score) { Kino1_confess("hit_collector->collect must be assigned in a subclass"); } void Kino1_HC_collect_HitQueue(HitCollector *hc, U32 doc_num, float score) { /* add to the total number of hits */ hc->i++; /* bail if the score doesn't exceed the minimum */ if (score < hc->f) { return; } else { SV *element; char doc_num_buf[4]; PriorityQueue *hit_queue; hit_queue = (PriorityQueue*)hc->storage; /* put a dualvar scalar -- encoded doc_num in PV, score in NV */ element = sv_newmortal(); (void)SvUPGRADE(element, SVt_PVNV); Kino1_encode_bigend_U32(doc_num, &doc_num_buf); sv_setpvn(element, doc_num_buf, (STRLEN)4); SvNV_set(element, (double)score); SvNOK_on(element); (void)Kino1_PriQ_insert(hit_queue, element); /* store the bubble score in a more accessible spot */ if (hit_queue->size == hit_queue->max_size) { SV *least_sv; least_sv = Kino1_PriQ_peek(hit_queue); hc->f = SvNV(least_sv); } } } void Kino1_HC_collect_BitVec(HitCollector *hc, U32 doc_num, float score) { BitVector *bit_vec; bit_vec = (BitVector*)hc->storage; /* add to the total number of hits */ hc->i++; /* add the doc_num to the BitVector */ Kino1_BitVec_set(bit_vec, doc_num); } void Kino1_HC_collect_filtered(HitCollector *hc, U32 doc_num, float score) { if (hc->filter_bits == NULL) { Kino1_confess("filter_bits not set on FilteredCollector"); } if (Kino1_BitVec_get(hc->filter_bits, doc_num)) { HitCollector *inner_collector; inner_collector = (HitCollector*)hc->storage; inner_collector->collect(inner_collector, doc_num, score); } } void Kino1_HC_collect_offset(HitCollector *hc, U32 doc_num, float score) { HitCollector *inner_collector = (HitCollector*)hc->storage; U32 offset_doc_num = doc_num + hc->f; inner_collector->collect(inner_collector, offset_doc_num, score); } void Kino1_HC_destroy(HitCollector *hc) { SvREFCNT_dec(hc->storage_ref); SvREFCNT_dec(hc->filter_bits_ref); Kino1_Safefree(hc); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::HitCollector - process doc/score pairs ==head1 DESCRIPTION A Scorer spits out raw doc_num/score pairs; a HitCollector decides what to do with them, based on the hc->collect method. A HitQueueCollector keeps the highest scoring N documents and their associated scores in a HitQueue while iterating through a large list. A BitCollector builds a BitVector with a set bit for each doc number (scores are irrelevant). A FilterCollector wraps another HitCollector, only allowing the inner collector to "see" doc_num/score pairs which make it through the filter. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/HitQueue.pm000444000765000765 451411462203446 22043 0ustar00marvinmarvin000000000000package KinoSearch1::Search::HitQueue; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::PriorityQueue ); BEGIN { __PACKAGE__->init_instance_vars() } use KinoSearch1::Search::Hit; sub new { my $either = shift; my $self = $either->SUPER::new(@_); $self->define_less_than; return $self; } # Create an array of "empty" Hit objects -- they have scores and ids, # but the stored fields have yet to be retrieved. sub hits { my ( $self, $start_offset, $num_wanted, $searcher ) = @_; my @hits = @{ $self->pop_all }; if ( defined $start_offset and defined $num_wanted ) { @hits = splice( @hits, $start_offset, $num_wanted ); } @hits = map { KinoSearch1::Search::Hit->new( id => unpack( 'N', "$_" ), score => 0 + $_, searcher => $searcher ) } @hits; return \@hits; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::HitQueue void define_less_than(hitq) PriorityQueue *hitq; PPCODE: hitq->less_than = &Kino1_HitQ_less_than; __H__ #ifndef H_KINOSEARCH_SEARCH_HIT_QUEUE #define H_KINOSEARCH_SEARCH_HIT_QUEUE 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" bool Kino1_HitQ_less_than(SV*, SV*); #endif /* include guard */ __C__ #include "KinoSearch1SearchHitQueue.h" /* Compare the NV then the PV for two scalars. */ bool Kino1_HitQ_less_than(SV* a, SV* b) { char *ptr_a, *ptr_b; if (SvNV(a) == SvNV(b)) { ptr_a = SvPVX(a); ptr_b = SvPVX(b); /* sort by doc_num second */ return (bool) (memcmp(ptr_b, ptr_a, 4) < 0); } /* sort by score first */ return SvNV(a) < SvNV(b); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::HitQueue - track highest scoring docs ==head1 DESCRIPTION HitQueue, a subclass of KinoSearch1::Util::PriorityQueue, keeps track of score/doc_num pairs. Each pair is stored in a single scalar, with the document number in the PV and the score in the NV. The encoding algorithm is functionally equivalent to this: my $encoded_doc_num = pack('N', $doc_num); my $doc_num_slash_score = dualvar( $score, $encoded_doc_num ); ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/Hits.pm000444000765000765 1222211462203445 21233 0ustar00marvinmarvin000000000000package KinoSearch1::Search::Hits; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # params/members searcher => undef, query => undef, filter => undef, sort_spec => undef, num_docs => undef, # members weight => undef, highlighter => undef, hit_docs => undef, total_hits => undef, ); __PACKAGE__->ready_get(qw( hit_docs )); } use KinoSearch1::Highlight::Highlighter; use KinoSearch1::Search::HitCollector; sub init_instance { my $self = shift; croak("required parameter 'query' not supplied") unless $self->{query}; croak("required parameter 'searcher' not supplied") unless $self->{searcher}; # turn the Query into a Weight (so the Query won't get mussed) $self->{weight} = $self->{searcher}->create_weight( $self->{query} ); } sub seek { my ( $self, $start_offset, $num_wanted ) = @_; croak('Usage: $hits->seek( START, NUM_TO_RETRIEVE );') unless @_ = 3; # collect enough to satisfy both the offset and the num wanted my $collector = KinoSearch1::Search::HitQueueCollector->new( size => $num_wanted + $start_offset, ); # execute the search! $self->{searcher}->search_hit_collector( hit_collector => $collector, weight => $self->{weight}, filter => $self->{filter}, sort_spec => $self->{sort_spec}, ); $self->{total_hits} = $collector->get_total_hits; my $hit_queue = $collector->get_hit_queue; # turn the HitQueue into an array of Hit objects $self->{hit_docs} = $hit_queue->hits( $start_offset, $num_wanted, $self->{searcher} ); } sub total_hits { my $self = shift; $self->seek( 0, 100 ) unless defined $self->{total_hits}; return $self->{total_hits}; } sub fetch_hit { my $self = shift; $self->seek( 0, 100 ) unless defined $self->{total_hits}; my $hit = shift @{ $self->{hit_docs} }; return unless defined $hit; return $hit; } sub fetch_hit_hashref { my $self = shift; $self->seek( 0, 100 ) unless defined $self->{total_hits}; # bail if there aren't any more *captured* hits my $hit = shift @{ $self->{hit_docs} }; return unless defined $hit; # lazily fetch stored fields my $hashref = $hit->get_field_values; if ( !exists $hashref->{score} ) { $hashref->{score} = $hit->get_score; } if ( defined $self->{highlighter} and !exists $hashref->{excerpt} ) { $hashref->{excerpt} = $self->{highlighter}->generate_excerpt( $hit->get_doc ); } return $hashref; } my %create_excerpts_defaults = ( highlighter => undef, ); sub create_excerpts { my $self = shift; confess kerror() unless verify_args( \%create_excerpts_defaults, @_ ); my %args = ( %create_excerpts_defaults, @_ ); $self->{highlighter} = $args{highlighter}; $self->{highlighter}->set_terms( [ $self->{query}->extract_terms ] ); } 1; =head1 NAME KinoSearch1::Search::Hits - access search results =head1 SYNOPSIS my $hits = $searcher->search( query => $query ); $hits->seek( 0, 10 ); while ( my $hashref = $hits->fetch_hit_hashref ) { print "

$hashref->{title} $hashref->{score}

\n"; } =head1 DESCRIPTION Hits objects are used to access the results of a search. By default, a hits object provides access to the top 100 matches; the seek() method provides finer-grained control. A classic application would be paging through hits. The first time, seek to a START of 0, and retrieve 10 documents. If the user wants to see more -- and there are more than 10 total hits -- seek to a START of 10, and retrieve 10 more documents. And so on. =head1 METHODS =head2 seek $hits->seek( START, NUM_TO_RETRIEVE ); Position the Hits iterator at START, and capture NUM_TO_RETRIEVE docs. =head2 total_hits my $num_that_matched = $hits->total_hits; Return the total number of documents which matched the query used to produce the Hits object. (This number is unlikely to match NUM_TO_RETRIEVE.) =head2 fetch_hit while ( my $hit = $hits->fetch_hit ) { # ... } Return the next hit as a KinoSearch1::Search::Hit object. =head2 fetch_hit_hashref while ( my $hashref = $hits->fetch_hit_hashref ) { # ... } Return the next hit as a hashref, with the field names as keys and the field values as values. An entry for C will also be present, as will an entry for C if create_excerpts() was called earlier. However, if the document contains stored fields named "score" or "excerpt", they will not be clobbered. =head2 create_excerpts my $highlighter = KinoSearch1::Highlight::Highlighter->new( excerpt_field => 'bodytext', ); $hits->create_excerpts( highlighter => $highlighter ); Use the supplied highlighter to generate excerpts. See L. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/MultiSearcher.pm000444000765000765 1433611462203446 23104 0ustar00marvinmarvin000000000000package KinoSearch1::Search::MultiSearcher; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Searcher ); BEGIN { __PACKAGE__->init_instance_vars( # members / constructor args searchables => undef, # members starts => undef, max_doc => undef, ); } use KinoSearch1::Search::Similarity; sub init_instance { my $self = shift; $self->{field_sims} = {}; # derive max_doc, relative start offsets my $max_doc = 0; my @starts; for my $searchable ( @{ $self->{searchables} } ) { push @starts, $max_doc; $max_doc += $searchable->max_doc; } $self->{max_doc} = $max_doc; $self->{starts} = \@starts; # default similarity $self->{similarity} = KinoSearch1::Search::Similarity->new unless defined $self->{similarity}; } sub get_field_names { my $self = shift; my %field_names; for my $searchable ( @{ $self->{searchables} } ) { my $sub_field_names = $searchable->get_field_names; @field_names{@$sub_field_names} = (1) x scalar @$sub_field_names; } return [ keys %field_names ]; } sub max_doc { shift->{max_doc} } sub close { } sub subsearcher { my ( $self, $doc_num ) = @_; my $i = -1; for ( @{ $self->{starts} } ) { last if $_ > $doc_num; $i++; } return $i; } sub doc_freq { my ( $self, $term ) = @_; my $doc_freq = 0; $doc_freq += $_->doc_freq($term) for @{ $self->{searchables} }; return $doc_freq; } sub fetch_doc { my ( $self, $doc_num ) = @_; my $i = $self->subsearcher($doc_num); my $searchable = $self->{searchables}[$i]; $doc_num -= $self->{starts}[$i]; return $searchable->fetch_doc($doc_num); } my %search_hit_collector_args = ( hit_collector => undef, weight => undef, filter => undef, sort_spec => undef, ); sub search_hit_collector { my $self = shift; confess kerror() unless verify_args( \%search_hit_collector_args, @_ ); my %args = ( %search_hit_collector_args, @_ ); my ( $searchables, $starts ) = @{$self}{qw( searchables starts )}; for my $i ( 0 .. $#$searchables ) { my $searchable = $searchables->[$i]; my $start = $starts->[$i]; my $collector = KinoSearch1::Search::OffsetCollector->new( hit_collector => $args{hit_collector}, offset => $start ); $searchable->search_hit_collector( %args, hit_collector => $collector ); } } sub rewrite { my ( $self, $orig_query ) = @_; # not necessary to rewrite until we add query types that need it return $orig_query; #my @queries = map { $_->rewrite($orig_query) } @{ $self->{searchables} }; #my $combined = $queries->[0]->combine(\@queries); #return $combined; } sub create_weight { my ( $self, $query ) = @_; my $searchables = $self->{searchables}; my $rewritten_query = $self->rewrite($query); # generate an array of unique terms my @terms = $rewritten_query->extract_terms; my %unique_terms; for my $term (@terms) { if ( a_isa_b( $term, "KinoSearch1::Index::Term" ) ) { $unique_terms{ $term->to_string } = $term; } else { # PhraseQuery returns an array of terms $unique_terms{ $_->to_string } = $_ for @$term; } } @terms = values %unique_terms; my @stringified = keys %unique_terms; # get an aggregated doc_freq for each term my @aggregated_doc_freqs = (0) x scalar @terms; for my $i ( 0 .. $#$searchables ) { my $doc_freqs = $searchables->[$i]->doc_freqs( \@terms ); for my $j ( 0 .. $#terms ) { $aggregated_doc_freqs[$j] += $doc_freqs->[$j]; } } # prepare a hashmap of stringified_term => doc_freq pairs. my %doc_freq_map; @doc_freq_map{@stringified} = @aggregated_doc_freqs; my $cache_df_source = KinoSearch1::Search::CacheDFSource->new( doc_freq_map => \%doc_freq_map, max_doc => $self->max_doc, similarity => $self->get_similarity, ); return $rewritten_query->to_weight($cache_df_source); } package KinoSearch1::Search::CacheDFSource; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Searchable ); BEGIN { __PACKAGE__->init_instance_vars( doc_freq_map => {}, max_doc => undef, ); __PACKAGE__->ready_get(qw( max_doc )); } sub init_instance { } sub doc_freq { my ( $self, $term ) = @_; my $df = $self->{doc_freq_map}{ $term->to_string }; confess( "df for " . $term->to_string . " not available" ) unless defined $df; } sub doc_freqs { my $self = shift; my @doc_freqs = map { $self->doc_freq($_) } @_; return \@doc_freqs; } sub max_doc { shift->{max_doc} } sub rewrite { return $_[1]; } =for comment Dummy class, only here to support initialization of Weights from Queries. =cut 1; __END__ =head1 NAME KinoSearch1::Search::MultiSearcher - Aggregate results from multiple searchers. =head1 SYNOPSIS for my $server_name (@server_names) { push @searchers, KinoSearch1::Search::SearchClient->new( peer_address => "$server_name:$port", analyzer => $analyzer, password => $pass, ); } my $multi_searcher = KinoSearch1::Search::MultiSearcher->new( searchables => \@searchers, analyzer => $analyzer, ); my $hits = $multi_searcher->search( query => $query ); =head1 DESCRIPTION Aside from the arguments to its constructor, MultiSearcher looks and acts just like a L object. The primary use for MultiSearcher is to aggregate results from several remote searchers via L, diffusing the cost of searching a large corpus over multiple machines. =head1 METHODS =head2 new Constructor. Takes two hash-style parameters, both of which are required. =over =item * B - an item which subclasses L. =item * B - a reference to an array of searchers. =back =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/PhraseQuery.pm000444000765000765 1073511462203446 22604 0ustar00marvinmarvin000000000000package KinoSearch1::Search::PhraseQuery; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Query ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args / members slop => 0, # members field => undef, terms => undef, positions => undef, ); __PACKAGE__->ready_get_set(qw( slop )); __PACKAGE__->ready_get(qw( terms )); } use KinoSearch1::Search::TermQuery; use KinoSearch1::Document::Field; use KinoSearch1::Util::ToStringUtils qw( boost_to_string ); sub init_instance { my $self = shift; $self->{terms} = []; $self->{positions} = []; } # Add a term/position combo to the query. The position is specified # explicitly in order to allow for phrases with gaps, two terms at the same # position, etc. sub add_term { my ( $self, $term, $position ) = @_; my $field = $term->get_field; $self->{field} = $field unless defined $self->{field}; croak("Mismatched fields in phrase query: '$self->{field}' '$field'") unless ( $field eq $self->{field} ); if ( !defined $position ) { $position = @{ $self->{positions} } ? $self->{positions}[-1] + 1 : 0; } push @{ $self->{terms} }, $term; push @{ $self->{positions} }, $position; } sub create_weight { my ( $self, $searcher ) = @_; # optimize for one-term phrases if ( @{ $self->{terms} } == 1 ) { my $term_query = KinoSearch1::Search::TermQuery->new( term => $self->{terms}[0], ); return $term_query->create_weight($searcher); } else { return KinoSearch1::Search::PhraseWeight->new( parent => $self, searcher => $searcher, ); } } sub extract_terms { shift->{terms} } sub to_string { my ( $self, $proposed_field ) = @_; my $string = $proposed_field eq $self->{field} ? qq(") : qq($proposed_field:"); $string .= ( $_->get_text . ' ' ) for @{ $self->{terms} }; $string .= qq("); $string .= qq(~$self->{slop}) if $self->{slop}; $string .= boost_to_string( $self->get_boost ); return $string; } package KinoSearch1::Search::PhraseWeight; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Weight ); BEGIN { __PACKAGE__->init_instance_vars(); } use KinoSearch1::Search::PhraseScorer; sub init_instance { my $self = shift; $self->{similarity} = $self->{parent}->get_similarity( $self->{searcher} ); $self->{idf} = $self->{similarity} ->idf( $self->{parent}->get_terms, $self->{searcher} ); undef $self->{searcher}; # don't want the baggage } sub scorer { my ( $self, $reader ) = @_; my $query = $self->{parent}; # look up each term my @term_docs; for my $term ( @{ $query->{terms} } ) { # bail if any one of the terms isn't in the index return unless $reader->doc_freq($term); my $td = $reader->term_docs($term); push @term_docs, $td; # turn on positions $td->set_read_positions(1); } # bail if there are no terms return unless @term_docs; my $norms_reader = $reader->norms_reader( $query->{field} ); return KinoSearch1::Search::PhraseScorer->new( weight => $self, slop => $query->{slop}, similarity => $self->{similarity}, norms_reader => $norms_reader, term_docs => \@term_docs, phrase_offsets => $query->{positions}, ); } 1; __END__ =head1 NAME KinoSearch1::Search::PhraseQuery - match ordered list of Terms =head1 SYNOPSIS my $phrase_query = KinoSearch1::Search::PhraseQuery->new; for ( qw( the who ) ) { my $term = KinoSearch1::Index::Term( 'bodytext', $_ ); $phrase_query->add_term($term); } my $hits = $searcher->search( query => $phrase_query ); =head1 DESCRIPTION PhraseQuery is a subclass of L for matching against ordered collections of terms. =head1 METHODS =head2 new my $phrase_query = KinoSearch1::Search::PhraseQuery->new; Constructor. Takes no arguments. =head2 add_term $phrase_query->add_term($term); Append a term to the phrase to be matched. Takes one argument, a L object. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/PhraseScorer.pm000444000765000765 2774711462203445 22746 0ustar00marvinmarvin000000000000package KinoSearch1::Search::PhraseScorer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Scorer ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params weight => undef, term_docs => undef, phrase_offsets => undef, norms_reader => undef, slop => 0, ); } our %instance_vars; sub new { my $either = shift; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); my $self = $either->SUPER::new; $self->_init_child; # set/derive some member vars $self->_set_norms( $args{norms_reader}->get_bytes ); $self->set_similarity( $args{similarity} ); $self->_set_weight_value( $args{weight}->get_value ); confess("Sloppy phrase matching not yet implemented") unless $args{slop} == 0; # TODO -- enable slop. $self->_set_slop( $args{slop} ); # sort terms by ascending frequency confess("positions count doesn't match term count") unless $#{ $args{term_docs} } == $#{ $args{phrase_offsets} }; my @by_size = sort { $a->[0]->get_doc_freq <=> $b->[0]->get_doc_freq } map { [ $args{term_docs}[$_], $args{phrase_offsets}[$_] ] } 0 .. $#{ $args{term_docs} }; my @term_docs = map { $_->[0] } @by_size; my @phrase_offsets = map { $_->[1] } @by_size; $self->_init_elements( \@term_docs, \@phrase_offsets ); return $self; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::PhraseScorer void _init_child(scorer) Scorer *scorer; PPCODE: Kino1_PhraseScorer_init_child(scorer); void _init_elements(scorer, term_docs_av, phrase_offsets_av) Scorer *scorer; AV *term_docs_av; AV *phrase_offsets_av; PREINIT: PhraseScorerChild *child; I32 i; SV **sv_ptr; IV tmp; PPCODE: { child = (PhraseScorerChild*)scorer->child; SvREFCNT_inc(term_docs_av); SvREFCNT_dec(child->term_docs_av); child->term_docs_av = term_docs_av; child->num_elements = av_len(term_docs_av) + 1; Kino1_New(0, child->term_docs, child->num_elements, TermDocs*); Kino1_New(0, child->phrase_offsets, child->num_elements, U32); /* create an array of TermDocs* */ for(i = 0; i < child->num_elements; i++) { sv_ptr = av_fetch(term_docs_av, i, 0); tmp = SvIV((SV*)SvRV( *sv_ptr )); child->term_docs[i] = INT2PTR(TermDocs*, tmp); sv_ptr = av_fetch(phrase_offsets_av, i, 0); child->phrase_offsets[i] = SvIV( *sv_ptr ); } } SV* _phrase_scorer_set_or_get(scorer, ...) Scorer *scorer; ALIAS: _set_slop = 1 _get_slop = 2 _set_weight_value = 3 _get_weight_value = 4 _set_norms = 5 _get_norms = 6 CODE: { PhraseScorerChild *child = (PhraseScorerChild*)scorer->child; KINO_START_SET_OR_GET_SWITCH case 1: child->slop = SvIV( ST(1) ); /* fall through */ case 2: RETVAL = newSViv(child->slop); break; case 3: child->weight_value = SvNV( ST(1) ); /* fall through */ case 4: RETVAL = newSVnv(child->weight_value); break; case 5: SvREFCNT_dec(child->norms_sv); child->norms_sv = newSVsv( ST(1) ); { SV* bytes_deref_sv; bytes_deref_sv = SvRV(child->norms_sv); if (SvPOK(bytes_deref_sv)) { child->norms = (unsigned char*)SvPVX(bytes_deref_sv); } else { child->norms = NULL; } } /* fall through */ case 6: RETVAL = newSVsv(child->norms_sv); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(scorer) Scorer *scorer; PPCODE: Kino1_PhraseScorer_destroy(scorer); __H__ #ifndef H_KINO_PHRASE_SCORER #define H_KINO_PHRASE_SCORER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1IndexTermDocs.h" #include "KinoSearch1SearchScorer.h" #include "KinoSearch1UtilMemManager.h" typedef struct phrasescorerchild { U32 doc; U32 slop; U32 num_elements; TermDocs **term_docs; U32 *phrase_offsets; float phrase_freq; float weight_value; U32 first_time; unsigned char *norms; SV *anchor_set; float (*calc_phrase_freq)(Scorer*); SV *norms_sv; AV *term_docs_av; } PhraseScorerChild; void Kino1_PhraseScorer_init_child(Scorer*); bool Kino1_PhraseScorer_next(Scorer*); float Kino1_PhraseScorer_calc_phrase_freq(Scorer*); U32 Kino1_PhraseScorer_doc(Scorer*); float Kino1_PhraseScorer_score(Scorer*); void Kino1_PhraseScorer_destroy(Scorer*); #endif /* include guard */ __C__ #include "KinoSearch1SearchPhraseScorer.h" void Kino1_PhraseScorer_init_child(Scorer *scorer) { PhraseScorerChild *child; /* allocate */ Kino1_New(0, child, 1, PhraseScorerChild); scorer->child = child; child->anchor_set = newSV(0); /* init */ child->doc = 0xFFFFFFFF; child->slop = 0; child->first_time = 1; child->phrase_freq = 0.0; child->norms = NULL; child->phrase_offsets = NULL; child->term_docs_av = (AV*)&PL_sv_undef; child->norms_sv = &PL_sv_undef;; /* define abstract methods */ scorer->next = Kino1_PhraseScorer_next; scorer->score = Kino1_PhraseScorer_score; scorer->doc = Kino1_PhraseScorer_doc; child->calc_phrase_freq = Kino1_PhraseScorer_calc_phrase_freq; } bool Kino1_PhraseScorer_next(Scorer *scorer) { PhraseScorerChild *child; TermDocs **term_docs; U32 candidate; U32 i; child = (PhraseScorerChild*)scorer->child; term_docs = child->term_docs; child->phrase_freq = 0.0; child->doc = 0xFFFFFFFF; if (child->first_time) { child->first_time = 0; /* advance all except the first term_docs */ for (i = 1; i < child->num_elements; i++) { if ( !term_docs[i]->next(term_docs[i]) ) return 0; } } /* seed the search */ if ( !term_docs[0]->next(term_docs[0]) ) return 0; candidate = term_docs[0]->get_doc(term_docs[0]); /* find a doc which contains all the terms */ FIND_COMMON_DOC: while (1) { for (i = 0; i < child->num_elements; i++) { U32 thisdoc = term_docs[i]->get_doc(term_docs[i]); if (thisdoc > candidate) candidate = thisdoc; } for (i = 0; i < child->num_elements; i++) { U32 thisdoc = term_docs[i]->get_doc(term_docs[i]); if (thisdoc < candidate) { if (!term_docs[i]->skip_to(term_docs[i], candidate)) return 0; } } for (i = 0; i < child->num_elements; i++) { if (term_docs[i]->get_doc(term_docs[i]) != candidate) { goto FIND_COMMON_DOC; } } break; /* success! */ } /* if the terms don't actually form a phrase, skip to the next doc */ child->phrase_freq = child->calc_phrase_freq(scorer); if (child->phrase_freq == 0.0) return scorer->next(scorer); /* success! */ child->doc = candidate; return 1; } float Kino1_PhraseScorer_calc_phrase_freq(Scorer *scorer) { PhraseScorerChild *child; TermDocs **term_docs; U32 *anchors; U32 *anchors_start; U32 *anchors_end; U32 *new_anchors; U32 *candidates; U32 *candidates_end; U32 phrase_offset; U32 i; STRLEN len; child = (PhraseScorerChild*)scorer->child; term_docs = child->term_docs; /* create an anchor set */ sv_setsv( child->anchor_set, term_docs[0]->get_positions(term_docs[0]) ); anchors_start = (U32*)SvPVX(child->anchor_set); anchors = anchors_start; anchors_end = (U32*)SvEND(child->anchor_set); phrase_offset = child->phrase_offsets[0]; while(anchors < anchors_end) { *anchors++ -= phrase_offset; } /* match the positions of other terms against the anchor set */ for (i = 1; i < child->num_elements; i++) { phrase_offset = child->phrase_offsets[i]; anchors = anchors_start; new_anchors = anchors_start; anchors_end = (U32*)SvEND(child->anchor_set); new_anchors = anchors; candidates = (U32*)SvPVX( term_docs[i]->get_positions(term_docs[i]) ); candidates_end = (U32*)SvEND( term_docs[i]->get_positions(term_docs[i]) ); while (anchors < anchors_end) { U32 target; /* Discard positions that occur too early in the field to match as * a part of the phrase. For example, if the field begins with * "The ants go marching one by one", that initial "the" cannot * match as the second term in a phrase search for * "fight the power". */ target = phrase_offset; while (candidates < candidates_end && *candidates < target) { candidates++; } if (candidates == candidates_end) break; /* Discard partial matches which seemed promising earlier but * which fail on this go-round. */ target = *candidates - phrase_offset; while (anchors < anchors_end && *anchors < target) { anchors++; } if (anchors == anchors_end) break; /* Blast past any positions for the current term which are too low * for the partial phrase matched in earlier iters. */ target = *anchors + phrase_offset; while (candidates < candidates_end && *candidates < target) { candidates++; } if (candidates == candidates_end) break; /* Does the current position fall into the slot? */ if (*candidates == target) { /* The anchor has made it through another elimination round. */ *new_anchors = *anchors; new_anchors++; } anchors++; } /* winnow down the size of the anchor set */ len = (char*)new_anchors - (char*)anchors_start; SvCUR_set(child->anchor_set, len); } /* the number of anchors left is the phrase freq */ len = SvCUR(child->anchor_set); return (float) len / sizeof(U32); } U32 Kino1_PhraseScorer_doc(Scorer *scorer) { PhraseScorerChild* child = (PhraseScorerChild*)scorer->child; return child->doc; } float Kino1_PhraseScorer_score(Scorer *scorer) { PhraseScorerChild* child; float score; unsigned char norm; child = (PhraseScorerChild*)scorer->child; /* calculate raw score */ score = scorer->sim->tf(scorer->sim, child->phrase_freq) * child->weight_value; /* normalize */ norm = child->norms[ child->doc ]; score *= scorer->sim->norm_decoder[norm]; return score; } void Kino1_PhraseScorer_destroy(Scorer *scorer) { PhraseScorerChild *child; child = (PhraseScorerChild*)scorer->child; Kino1_Safefree(child->term_docs); Kino1_Safefree(child->phrase_offsets); SvREFCNT_dec(child->norms_sv); SvREFCNT_dec((SV*)child->term_docs_av); SvREFCNT_dec(child->anchor_set); Kino1_Safefree(child); Kino1_Scorer_destroy(scorer); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::PhraseScorer - scorer for PhraseQuery ==head1 DESCRIPTION Score phrases. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/Query.pm000444000765000765 1031111462203446 21427 0ustar00marvinmarvin000000000000package KinoSearch1::Search::Query; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members boost => 1, ); __PACKAGE__->ready_get_set(qw( boost )); } =begin comment my $string = $query->to_string( $field_name ); Return a string representation of the query. $field_name is a default field, and affects how the string is generated -- for instance, if a TermQuery's field matches $field_name, the field will be omitted, while if it doesn't match, the field will be included in the string. =end comment =cut sub to_string { shift->abstract_death } =begin comment my $weight = $query->create_weight($searcher); Only low-level Queries which rewrite themselves implement this method. =end comment =cut sub create_weight { shift->abstract_death } # Derive a weight for a high-level query. sub to_weight { # in Lucene, this method is simply "weight" my ( $self, $searcher ) = @_; my $rewritten_self = $searcher->rewrite($self); my $weight = $rewritten_self->create_weight($searcher); my $sum = $weight->sum_of_squared_weights; my $sim = $self->get_similarity($searcher); my $norm = $sim->query_norm($sum); $weight->normalize($norm); return $weight; } =begin comment my $rewritten_query = $query->rewrite( $index_reader ); Called by high-level Queries that wish to reformulate themselves as agglomerations of low-level queries. =end comment =cut sub rewrite { return shift } =begin comment my @terms = $query->extract_terms; Return all the Terms within this query. =end comment =cut sub extract_terms { shift->abstract_death } # These will be needed by MultiSearcher if we add queries which rewrite # themselves. sub combine { shift->todo_death } sub merge_boolean_queries { shift->todo_death } # return the Similarity implementation used by the Query. sub get_similarity { my ( $self, $searcher, $field_name ) = @_; # This can be overriden in subclasses, allowing alternative Sims. return defined $field_name ? $searcher->get_similarity($field_name) : $searcher->get_similarity; } sub clone { shift->todo_death } 1; __END__ =head1 NAME KinoSearch1::Search::Query - base class for search queries =head1 SYNOPSIS # abstract base class =head1 DESCRIPTION Base class for queries to be performed against an invindex. L is one example. =head1 METHODS =head2 set_boost get_boost $term_query_a->set_boost(2); $boolean_query->add_clause( query => $term_query_a, occur => 'SHOULD' ); $boolean_query->add_clause( query => $term_query_b, occur => 'SHOULD' ); The boost of any Query is 1.0 by default. Setting boost to a number greater than one increases a Query's relative contribution to a score, and setting boost to a lower number decreases the contribution. =begin devdocs A Query in KinoSearch1 is a highly abstracted representation. It must be transformed in several ways before the index is actually consulted to see how documents score against it. First, a Query must be "rewritten", a task that falls to the searcher. Rewriting something as simple as a TermQuery just means returning the original object; other more complex Queries, e.g. the as-yet-unimplemented SpanQueries, may get transformed into collections of simpler queries -- such as TermQueries. Next, a Weight must be derived from a Query. The role of a Weight is to hold all data which changes as the search gets processed -- allowing still-pristine Query objects to be reused later. The Weight object is used to derive a Scorer. The scorer iterates over the documents which match the query, producing doc_num => score pairs. These pairs are are processed by a HitCollector. Different types of HitCollectors yield different results. Here's another way of looking at the divided responsibilities: # 1. Searcher-dependent info belongs in the Weight. # 2. IndexReader-dependent info belongs in the Scorer. =end devdocs =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/QueryFilter.pm000444000765000765 437011462203446 22565 0ustar00marvinmarvin000000000000package KinoSearch1::Search::QueryFilter; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members query => undef, # members cached_bits => undef, ); } use KinoSearch1::Search::HitCollector; sub init_instance { my $self = shift; confess("required parameter query is not a KinoSearch1::Search::Query") unless a_isa_b( $self->{query}, 'KinoSearch1::Search::Query' ); } sub bits { my ( $self, $searcher ) = @_; # fill the cache if ( !defined $self->{cache} ) { my $collector = KinoSearch1::Search::BitCollector->new( capacity => $searcher->max_doc, ); # perform the search $searcher->search_hit_collector( weight => $self->{query}->to_weight($searcher), hit_collector => $collector, ); # save the bitvector of doc hits $self->{cached_bits} = $collector->get_bit_vector; } return $self->{cached_bits}; } 1; __END__ =head1 NAME KinoSearch1::Search::QueryFilter - build a filter based on results of a query =head1 SYNOPSIS my $books_only_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'category', 'books' ); ); my $filter = KinoSearch1::Search::QueryFilter->new( query => $books_only_query; ); my $hits = $searcher->search( query => $query_string, filter => $filter, ); =head1 DESCRIPTION A QueryFilter spawns a result set that can be used to filter the results of another query. The effect is very similar to adding a required clause to a L -- however, a QueryFilter caches its results, so it is more efficient if you use it more than once. =head1 METHODS =head2 new my $filter = KinoSearch1::Search::QueryFilter->new( query => $query; ); Constructor. Takes one hash-style parameter, C, which must be an object belonging to a subclass of L. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/Scorer.pm000444000765000765 1346111462203445 21567 0ustar00marvinmarvin000000000000package KinoSearch1::Search::Scorer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass Exporter ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params similarity => undef, ); } our %instance_vars; our @EXPORT_OK = qw( %score_batch_args ); sub new { my $class = shift; verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); $class = ref($class) || $class; my $self = _construct_parent($class); if ( defined $args{similarity} ) { $self->set_similarity( $args{similarity} ); } return $self; } our %score_batch_args = ( hit_collector => undef, start => 0, end => 2**31, ); =begin comment my $explanation = $scorer->explain($doc_num); Provide an Explanation for how this scorer rates a given doc. =end comment =cut sub explain { shift->abstract_death } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::Scorer void _construct_parent(class) char *class; PREINIT: Scorer *scorer; PPCODE: scorer = Kino1_Scorer_new(); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)scorer); XSRETURN(1); SV* _scorer_set_or_get(scorer, ...) Scorer *scorer; ALIAS: set_similarity = 1 get_similarity = 2 CODE: { KINO_START_SET_OR_GET_SWITCH case 1: SvREFCNT_dec(scorer->similarity_sv); scorer->similarity_sv = newSVsv( ST(1) ); Kino1_extract_struct( scorer->similarity_sv, scorer->sim, Similarity*, "KinoSearch1::Search::Similarity" ); /* fall through */ case 2: RETVAL = newSVsv(scorer->similarity_sv); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL =begin comment my $score = $scorer->score; Calculate and return a score for the scorer's current document. =end comment =cut float score(scorer) Scorer* scorer; CODE: RETVAL = scorer->score(scorer); OUTPUT: RETVAL =begin comment my $valid_state = $scorer->next; Move the internal state of the scorer to the next document. Return false when there are no more documents to score. =end comment =cut bool next(scorer) Scorer* scorer; CODE: RETVAL = scorer->next(scorer); OUTPUT: RETVAL =begin comment $scorer->score_batch( hit_collector => $collector, start => $start, end => $end, ); Execute the scoring number crunching, accumulating results via the $hit_collector. TODO: Doesn't actually pay any attention to start/end at present. =end comment =cut void score_batch(scorer, ...) Scorer *scorer; PREINIT: HV *args_hash; U32 start, end; HitCollector *hc; PPCODE: /* process hash-style params */ Kino1_Verify_build_args_hash(args_hash, "KinoSearch1::Search::Scorer::score_batch_args", 1); Kino1_extract_struct_from_hv(args_hash, hc, "hit_collector", 13, HitCollector*, "KinoSearch1::Search::HitCollector"); start = (U32)SvUV( Kino1_Verify_extract_arg(args_hash, "start", 5) ); end = (U32)SvUV( Kino1_Verify_extract_arg(args_hash, "end", 3) ); /* execute scoring loop */ while (scorer->next(scorer)) { hc->collect( hc, scorer->doc(scorer), scorer->score(scorer) ); } =begin comment Not implemented yet. =end comment =cut bool skip_to(scorer, target_doc_num) Scorer* scorer; U32 target_doc_num; CODE: RETVAL = scorer->skip_to(scorer, target_doc_num); OUTPUT: RETVAL void DESTROY(scorer) Scorer *scorer; PPCODE: Kino1_Scorer_destroy(scorer); __H__ #ifndef H_KINO_SCORER #define H_KINO_SCORER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1SearchSimilarity.h" #include "KinoSearch1UtilMemManager.h" #include "KinoSearch1UtilCarp.h" typedef struct scorer { void *child; Similarity *sim; float (*score)(struct scorer*); bool (*next)(struct scorer*); U32 (*doc)(struct scorer*); bool (*skip_to)(struct scorer*, U32); SV *similarity_sv; } Scorer; Scorer* Kino1_Scorer_new(); float Kino1_Scorer_score_death(Scorer*); bool Kino1_Scorer_next_death(Scorer*); U32 Kino1_Scorer_doc_death(Scorer*); bool Kino1_Scorer_skip_to_death(Scorer*, U32); void Kino1_Scorer_destroy(Scorer*); #endif /* include guard */ __C__ #include "KinoSearch1SearchScorer.h" Scorer* Kino1_Scorer_new() { Scorer* scorer; Kino1_New(0, scorer, 1, Scorer); scorer->child = NULL; scorer->sim = NULL; scorer->next = Kino1_Scorer_next_death; scorer->score = Kino1_Scorer_score_death; scorer->skip_to = Kino1_Scorer_skip_to_death; scorer->similarity_sv = &PL_sv_undef; return scorer; } float Kino1_Scorer_score_death(Scorer* scorer) { Kino1_confess("scorer->score must be defined in a subclass"); return 1.0; } bool Kino1_Scorer_next_death(Scorer* scorer) { Kino1_confess("scorer->next must be defined in a subclass"); return 1; } U32 Kino1_Scorer_doc_death(Scorer* scorer) { Kino1_confess("scorer->doc must be defined in a subclass"); return 1; } bool Kino1_Scorer_skip_to_death(Scorer* scorer, U32 target_doc_num) { Kino1_confess("scorer->skip_to must be defined in a subclass"); return 1; } void Kino1_Scorer_destroy(Scorer* scorer) { SvREFCNT_dec(scorer->similarity_sv); Kino1_Safefree(scorer); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::Scorer - score documents against a Query ==head1 DESCRIPTION Abstract base class for scorers. Scorers iterate through a list of documents, producing score/doc_num pairs for further processing, typically by a HitCollector. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/Searchable.pm000444000765000765 560711462203446 22347 0ustar00marvinmarvin000000000000package KinoSearch1::Search::Searchable; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # members similarity => undef, field_sims => undef, # {} ); } use KinoSearch1::Search::Similarity; =begin comment my $hits = $searchable->search($query_string); my $hits = $searchable->search( query => $query, filter => $filter, sort_spec => $sort_spec, ); =end comment =cut sub search { shift->abstract_death } =begin comment my $explanation = $searchable->explain( $weight, $doc_num ); Provide an Explanation for how the document represented by $doc_num scored agains $weight. Useful for probing the guts of Similarity. =end comment =cut sub explain { shift->todo_death } =begin comment my $doc_num = $searchable->max_doc; Return one larger than the largest doc_num. =end comment =cut sub max_doc { shift->abstract_death } =begin comment my $doc = $searchable->fetch_doc($doc_num); Generate a Doc object, retrieving the stored fields from the invindex. =end comment =cut sub fetch_doc { shift->abstract_death } =begin comment my $doc_freq = $searchable->doc_freq($term); Return the number of documents which contain this Term. Used for calculating Weights. =end comment =cut sub doc_freq { shift->abstract_death } =begin comment $searchable->set_similarity($sim); $searchable->set_similarity( $field_name, $alternate_sim ); my $sim = $searchable->get_similarity; my $alt_sim = $searchable->get_similarity($field_name); Set or get Similarity. If a field name is included, set/retrieve the Similarity instance for that field only. =end comment =cut sub set_similarity { if ( @_ == 3 ) { my ( $self, $field_name, $sim ) = @_; $self->{field_sims}{$field_name} = $sim; } else { $_[0]->{similarity} = $_[1]; } } sub get_similarity { my ( $self, $field_name ) = @_; if ( defined $field_name and exists $self->{field_sims}{$field_name} ) { return $self->{field_sims}{$field_name}; } else { return $self->{similarity}; } } # not sure these are needed (call $query->create_weight($searcher) instead) sub create_weight { shift->unimplemented_death } sub rewrite_query { shift->unimplemented_death } sub doc_freqs { my ( $self, $terms ) = @_; my @doc_freqs = map { $self->doc_freq($_) } @$terms; return \@doc_freqs; } sub close { } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Search::Searchable - base class for searching an invindex ==head1 DESCRIPTION Abstract base class for objects which search an invindex. The most prominent subclass is KinoSearch1::Searcher. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/SearchClient.pm000444000765000765 1150511462203446 22674 0ustar00marvinmarvin000000000000package KinoSearch1::Search::SearchClient; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Searcher ); use Storable qw( nfreeze thaw ); BEGIN { __PACKAGE__->init_instance_vars( # params/members analyzer => undef, peer_address => undef, password => undef, # members similarity => undef, ); } use IO::Socket::INET; sub init_instance { my $self = shift; $self->{similarity} ||= KinoSearch1::Search::Similarity->new; $self->{field_sims} = {}; # establish a connection my $sock = IO::Socket::INET->new( PeerAddr => $self->{peer_address}, Proto => 'tcp', ); confess("No socket: $!") unless $sock; $sock->autoflush(1); $self->{sock} = $sock; # verify password print $sock "$self->{password}\n"; chomp( my $response = <$sock> ); confess("Failed to connect: '$response'") unless $response =~ /accept/i; } =for comment Make a remote procedure call. For every call that does not close/terminate the socket connection, expect a response back that's been serialized using Storable. =cut sub _rpc { my ( $self, $method, $args ) = @_; my $sock = $self->{sock}; my $serialized = nfreeze($args); my $packed_len = pack( 'N', bytes::length($serialized) ); print $sock "$method\n$packed_len$serialized"; # bail out if we're either closing or shutting down the server remotely return if $method eq 'done'; return if $method eq 'terminate'; # decode response $sock->read( $packed_len, 4 ); my $arg_len = unpack( 'N', $packed_len ); my $check_val = read( $sock, $serialized, $arg_len ); confess("Tried to read $arg_len bytes, got $check_val") unless ( defined $arg_len and $check_val == $arg_len ); return thaw($serialized); } sub get_field_names { my $self = shift; return $self->_rpc( 'get_field_names', {} ); } my %search_hit_collector_args = ( hit_collector => undef, weight => undef, filter => undef, sort_spec => undef, ); sub search_hit_collector { my $self = shift; confess kerror() unless verify_args( \%search_hit_collector_args, @_ ); my %args = ( %search_hit_collector_args, @_ ); confess("remote filtered search not supported") if defined $args{filter}; # replace the HitCollector with a size rather than serialize it my $collector = delete $args{hit_collector}; if ( a_isa_b( $collector, "KinoSearch1::Search::OffsetCollector" ) ) { $args{num_wanted} = $collector->get_storage->get_max_size; } else { $args{num_wanted} = $collector->get_max_size; } # Make the remote call, which returns a hashref of doc => score pairs. # Accumulate hits into the HitCollector if the query is valid. my $score_pairs = $self->_rpc( 'search_hit_collector', \%args ); while ( my ( $doc, $score ) = each %$score_pairs ) { $collector->collect( $doc, $score ); } } sub terminate { my $self = shift; return $self->_rpc( 'terminate', {} ); } sub fetch_doc { my ( $self, $doc_num ) = @_; return $self->_rpc( 'fetch_doc', { doc_num => $doc_num } ); } sub max_doc { my $self = shift; return $self->_rpc( 'max_doc', {} ); } sub doc_freq { my ( $self, $term ) = @_; return $self->_rpc( 'doc_freq', { term => $term } ); } sub doc_freqs { my ( $self, $terms ) = @_; return $self->_rpc( 'doc_freqs', { terms => $terms } ); } sub close { my $self = shift; $self->_rpc( 'done', {} ); my $sock = $self->{sock}; close $sock or confess("Error when closing socket: $!"); undef $self->{sock}; } sub DESTROY { my $self = shift; $self->close if defined $self->{sock}; } 1; __END__ =head1 NAME KinoSearch1::Search::SearchClient - connect to a remote SearchServer =head1 SYNOPSIS my $client = KinoSearch1::Search::SearchClient->new( peer_address => 'searchserver1:7890', password => $pass, analyzer => $analyzer, ); my $hits = $client->search( query => $query ); =head1 DESCRIPTION SearchClient is a subclass of L which can be used to search an index on a remote machine made accessible via L. =head1 METHODS =head2 new Constructor. Takes hash-style params. =over =item * B - The name/IP and the port number which the client should attempt to connect to. =item * B - Password to be supplied to the SearchServer when initializing socket connection. =item * B - An object belonging to a subclass of L =back =head1 LIMITATIONS Limiting search results with a QueryFilter is not yet supported. =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/SearchServer.pm000444000765000765 1422611462203445 22726 0ustar00marvinmarvin000000000000package KinoSearch1::Search::SearchServer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # params/members searchable => undef, port => undef, password => undef, # members sock => undef, ); } use IO::Socket::INET; use IO::Select; use Storable qw( nfreeze thaw ); sub init_instance { my $self = shift; confess("Missing required param 'password'") unless defined $self->{password}; # establish a listening socket confess("Invalid port") unless $self->{port} =~ /^\d+$/; my $sock = IO::Socket::INET->new( LocalPort => $self->{port}, Proto => 'tcp', Listen => SOMAXCONN, Reuse => 1, ); confess("No socket: $!") unless $sock; $sock->autoflush(1); $self->{sock} = $sock; } my %dispatch = ( get_field_names => \&do_get_field_names, max_doc => \&do_max_doc, doc_freq => \&do_doc_freq, doc_freqs => \&do_doc_freqs, search_hit_collector => \&do_search_hit_collector, fetch_doc => \&do_fetch_doc, terminate => undef, ); sub serve { my $self = shift; my $main_sock = $self->{sock}; my $read_set = IO::Select->new($main_sock); while ( my @ready = $read_set->can_read ) { for my $readhandle (@ready) { # if this is the main handle, we have a new client, so accept if ( $readhandle == $main_sock ) { my $client_sock = $main_sock->accept; # verify password my $pass = <$client_sock>; chomp($pass) if defined $pass; if ( defined($pass) && $pass eq $self->{password} ) { $read_set->add($client_sock); print $client_sock "accepted\n"; } else { print $client_sock "password incorrect\n"; } } # otherwise it's a client sock, so process the request else { my $client_sock = $readhandle; my ( $check_val, $buf, $len, $method, $args ); chomp( $method = <$client_sock> ); # if "done", the client's closing if ( $method eq 'done' ) { $read_set->remove($client_sock); $client_sock->close; next; } # remote signal to close the server elsif ( $method eq 'terminate' ) { $read_set->remove($client_sock); $client_sock->close; $main_sock->close; return; } # sanity check the method name elsif ( !$dispatch{$method} ) { print $client_sock "ERROR: Bad method name: $method\n"; next; } # process the method call read( $client_sock, $buf, 4 ); $len = unpack( 'N', $buf ); read( $client_sock, $buf, $len ); my $response = $dispatch{$method}->( $self, thaw($buf) ); my $frozen = nfreeze($response); my $packed_len = pack( 'N', bytes::length($frozen) ); print $client_sock $packed_len . $frozen; } } } } sub do_get_field_names { my ( $self, $args ) = @_; return $self->{searchable}->get_field_names(%$args); } sub do_doc_freq { my ( $self, $args ) = @_; my $doc_freq = $self->{searchable}->doc_freq( $args->{term} ); return { doc_freq => $doc_freq }; } sub do_doc_freqs { my ( $self, $args ) = @_; return $self->{searchable}->doc_freqs( $args->{terms} ); } sub do_search_hit_collector { my ( $self, $args ) = @_; confess("remote filtered search not supported") if defined $args->{filter}; my $collector = KinoSearch1::Search::HitQueueCollector->new( size => $args->{num_wanted} ); my $scorer = $args->{weight}->scorer( $self->{searchable}->get_reader ); if ( defined $scorer ) { $scorer->score_batch( hit_collector => $collector, end => $self->{searchable}->max_doc, ); } my $hit_queue = $collector->get_hit_queue; my $hit_docs = $hit_queue->hits; my %score_docs; $score_docs{ $_->get_id } = $_->get_score for @$hit_docs; return \%score_docs; } sub do_max_doc { my ( $self, $args ) = @_; my $max_doc = $self->{searchable}->max_doc; return { max_doc => $max_doc }; } sub do_fetch_doc { my ( $self, $args ) = @_; return $self->{searchable}->fetch_doc( $args->{doc_num} ); } 1; __END__ =head1 NAME KinoSearch1::Search::SearchServer - make a Searcher remotely accessible =head1 SYNOPSIS my $searcher = KinoSearch1::Searcher->new( analyzer => $analyzer, invindex => '/path/to/invindex', ); my $server = KinoSearch1::Search::SearchServer->new( searchable => $searcher, port => 7890, password => $pass, ); $server->serve; =head1 DESCRIPTION The SearchServer class, in conjunction with L, makes it possible to run a search on one machine and report results on another. By aggregating several SearchClients under a L, the cost of searching what might have been a prohibitively large monolithic index can be distributed across multiple nodes, each with its own, smaller index. =head1 METHODS =head2 new Constructor. Takes hash-style parameters. =over =item * B - The L that the SearchServer will wrap. =item * B - the port on localhost that the server should open and listen on. =item * B - a password which must be supplied by clients. =back =head2 serve Open a listening socket on localhost and wait for SearchClients to connect. =head1 COPYRIGHT Copyright 2006-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/Similarity.pm000444000765000765 1700311462203446 22455 0ustar00marvinmarvin000000000000package KinoSearch1::Search::Similarity; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); BEGIN { __PACKAGE__->init_instance_vars(); } # See _float_to_byte. *encode_norm = *_float_to_byte; *decode_norm = *_byte_to_float; # Calculate the Inverse Document Frequecy for one or more Term in a given # collection (the Searcher represents the collection). # # If multiple Terms are supplied, their idfs are summed. sub idf { my ( $self, $term_or_terms, $searcher ) = @_; my $max_doc = $searcher->max_doc; my $terms = ref $term_or_terms eq 'ARRAY' ? $term_or_terms : [$term_or_terms]; return 1 unless $max_doc; # guard against log of zero error # accumulate IDF my $idf = 0; for my $term (@$terms) { my $doc_freq = $searcher->doc_freq($term); $idf += 1 + log( $max_doc / ( 1 + $searcher->doc_freq($term) ) ); } return $idf; } # Normalize a Query's weight so that it is comparable to other Queries. sub query_norm { my ( $self, $sum_of_squared_weights ) = @_; return 0 if ( $sum_of_squared_weights == 0 ); # guard against div by zero return ( 1 / sqrt($sum_of_squared_weights) ); } # KLUDGE -- see comment at STORABLE_thaw. sub STORABLE_freeze { my ( $self, $cloning ) = @_; return if $cloning; return "1"; } package KinoSearch1::Search::TitleSimilarity; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Similarity ); sub new { my $self = shift->SUPER::new(@_); $self->_use_title_tf; return $self; } sub lengthnorm { return 0 unless $_[1]; return 1 / sqrt( $_[1] ); } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::Similarity =begin comment KLUDGE!! Rather than attempt to serialize a Similarity, we just create a new one. =end comment =cut void STORABLE_thaw(blank_obj, cloning, serialized) SV *blank_obj; SV *cloning; SV *serialized; PPCODE: { Similarity *sim = Kino1_Sim_new(); SV *deep_obj = SvRV(blank_obj); sv_setiv(deep_obj, PTR2IV(sim)); } void new(either_sv) SV *either_sv; PREINIT: const char *class; Similarity *sim; PPCODE: /* determine the class */ class = sv_isobject(either_sv) ? sv_reftype(either_sv, 0) : SvPV_nolen(either_sv); /* build object */ sim = Kino1_Sim_new(); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)sim); XSRETURN(1); =for comment Provide a normalization factor for a field based on the square-root of the number of terms in it. =cut float lengthnorm(sim, num_terms) Similarity *sim; U32 num_terms; CODE: num_terms = num_terms < 100 ? 100 : num_terms; RETVAL = (float)1 / sqrt(num_terms); OUTPUT: RETVAL =for comment Return a score factor based on the frequency of a term in a given document. The default implementation is sqrt(freq). Other implementations typically produce ascending scores with ascending freqs, since the more times a doc matches, the more relevant it is likely to be. =cut float tf(sim, freq) Similarity *sim; U32 freq; CODE: RETVAL = sim->tf(sim, freq); OUTPUT: RETVAL =for comment _float_to_byte and _byte_to_float encode and decode between 32-bit IEEE floating point numbers and a 5-bit exponent, 3-bit mantissa float. The range covered by the single-byte encoding is 7x10^9 to 2x10^-9. The accuracy is about one significant decimal digit. =cut SV* _float_to_byte(sim, f) Similarity *sim; float f; PREINIT: char b; CODE: b = Kino1_Sim_float2byte(sim, f); RETVAL = newSVpv(&b, 1); OUTPUT: RETVAL float _byte_to_float(sim, b) Similarity *sim; char b; CODE: RETVAL = Kino1_Sim_byte2float(sim, b); OUTPUT: RETVAL =for comment The norm_decoder caches the 256 possible byte => float pairs, obviating the need to call decode_norm over and over for a scoring implementation that knows how to use it. =cut SV* get_norm_decoder(sim) Similarity *sim; CODE: RETVAL = newSVpv( (char*)sim->norm_decoder, (256 * sizeof(float)) ); OUTPUT: RETVAL float coord(sim, overlap, max_overlap) Similarity *sim; U32 overlap; U32 max_overlap; CODE: RETVAL = sim->coord(sim, overlap, max_overlap); OUTPUT: RETVAL void _use_title_tf(sim) Similarity *sim; PPCODE: sim->tf = Kino1_Sim_title_tf; void DESTROY(sim) Similarity *sim; PPCODE: Kino1_Sim_destroy(sim); __H__ #ifndef H_KINO_SIMILARITY #define H_KINO_SIMILARITY 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilMemManager.h" typedef struct similarity { float (*tf)(struct similarity*, float); float (*coord)(struct similarity*, U32, U32); float *norm_decoder; } Similarity; Similarity* Kino1_Sim_new(); float Kino1_Sim_default_tf(Similarity*, float); float Kino1_Sim_title_tf(Similarity*, float); char Kino1_Sim_float2byte(Similarity*, float); float Kino1_Sim_byte2float(Similarity*, char); float Kino1_Sim_coord(Similarity*, U32, U32); void Kino1_Sim_destroy(Similarity*); #endif /* include guard */ __C__ #include "KinoSearch1SearchSimilarity.h" Similarity* Kino1_Sim_new() { int i; unsigned char aUChar; Similarity *sim; Kino1_New(0, sim, 1, Similarity); /* cache decoded norms */ Kino1_New(0, sim->norm_decoder, 256, float); for (i = 0; i < 256; i++) { aUChar = i; sim->norm_decoder[i] = Kino1_Sim_byte2float(sim, (char)aUChar); } sim->tf = Kino1_Sim_default_tf; sim->coord = Kino1_Sim_coord; return sim; } float Kino1_Sim_default_tf(Similarity *sim, float freq) { return( sqrt(freq) ); } float Kino1_Sim_title_tf(Similarity *sim, float freq) { return 1.0; } char Kino1_Sim_float2byte(Similarity *sim, float f) { char norm; I32 mantissa; I32 exponent; I32 bits; if (f < 0.0) f = 0.0; if (f == 0.0) { norm = 0; } else { bits = *(I32*)&f; mantissa = (bits & 0xffffff) >> 21; exponent = (((bits >> 24) & 0x7f)-63) + 15; if (exponent > 31) { exponent = 31; mantissa = 7; } if (exponent < 0) { exponent = 0; mantissa = 1; } norm = (char)((exponent << 3) | mantissa); } return norm; } float Kino1_Sim_byte2float(Similarity *sim, char b) { I32 mantissa; I32 exponent; I32 result; if (b == 0) { result = 0; } else { mantissa = b & 7; exponent = (b >> 3) & 31; result = ((exponent+(63-15)) << 24) | (mantissa << 21); } return *(float*)&result; } /* Calculate a score factor based on the number of terms which match. */ float Kino1_Sim_coord(Similarity *sim, U32 overlap, U32 max_overlap) { if (max_overlap == 0) return 1; return (float)overlap / (float)max_overlap; } void Kino1_Sim_destroy(Similarity *sim) { Kino1_Safefree(sim->norm_decoder); Kino1_Safefree(sim); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::Similarity - calculate how closely two items match ==head1 DESCRIPTION The Similarity class encapsulates some of the math used when calculating scores. TitleSimilarity is tuned for best results with title fields. ==head1 SEE ALSO The Lucene equivalent of this class provides a thorough discussion of the Lucene scoring algorithm, which KinoSearch1 implements. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/TermQuery.pm000444000765000765 632711462203446 22253 0ustar00marvinmarvin000000000000package KinoSearch1::Search::TermQuery; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Query ); use KinoSearch1::Util::ToStringUtils qw( boost_to_string ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members term => undef, ); __PACKAGE__->ready_get(qw( term )); } sub init_instance { my $self = shift; confess("parameter 'term' is not a KinoSearch1::Index::Term") unless a_isa_b( $self->{term}, 'KinoSearch1::Index::Term' ); } sub create_weight { my ( $self, $searcher ) = @_; my $weight = KinoSearch1::Search::TermWeight->new( parent => $self, searcher => $searcher, ); } sub extract_terms { shift->{term} } sub to_string { my ( $self, $proposed_field ) = @_; my $field = $self->{term}->get_field; my $string = $proposed_field eq $field ? '' : "$field:"; $string .= $self->{term}->get_text . boost_to_string( $self->{boost} ); return $string; } sub get_similarity { my ( $self, $searcher ) = @_; my $field_name = $self->{term}->get_field; return $searcher->get_similarity($field_name); } sub equals { shift->todo_death } package KinoSearch1::Search::TermWeight; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Weight ); use KinoSearch1::Search::TermScorer; our %instance_vars = __PACKAGE__->init_instance_vars(); sub init_instance { my $self = shift; $self->{similarity} = $self->{parent}->get_similarity( $self->{searcher} ); $self->{idf} = $self->{similarity} ->idf( $self->{parent}->get_term, $self->{searcher} ); # kill this because we don't want its baggage. undef $self->{searcher}; } sub scorer { my ( $self, $reader ) = @_; my $term = $self->{parent}{term}; my $term_docs = $reader->term_docs($term); return unless defined $term_docs; return unless $term_docs->get_doc_freq; my $norms_reader = $reader->norms_reader( $term->get_field ); return KinoSearch1::Search::TermScorer->new( weight => $self, term_docs => $term_docs, similarity => $self->{similarity}, norms_reader => $norms_reader, ); } sub to_string { my $self = shift; return "weight(" . $self->{parent}->to_string . ")"; } 1; __END__ =head1 NAME KinoSearch1::Search::TermQuery - match individual Terms =head1 SYNOPSIS my $term = KinoSearch1::Index::Term->new( $field, $term_text ); my $term_query = KinoSearch1::Search::TermQuery->new( term => $term, ); my $hits = $searcher->search( query => $term_query ); =head1 DESCRIPTION TermQuery is a subclass of L for matching individual L. Note that since Term objects are associated with one and only one field, so are TermQueries. =head1 METHODS =head2 new my $term_query = KinoSearch1::Search::TermQuery->new( term => $term, ); Constructor. Takes hash-style parameters: =over =item * B - a L. =back =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Search/TermScorer.pm000444000765000765 2375411462203446 22426 0ustar00marvinmarvin000000000000package KinoSearch1::Search::TermScorer; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Search::Scorer ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params weight => undef, term_docs => undef, norms_reader => undef, ); } our %instance_vars; use KinoSearch1::Search::Scorer qw( %score_batch_args ); sub new { my $self = shift->SUPER::new; confess kerror() unless verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); $self->_init_child; $self->_set_term_docs( $args{term_docs} ); $self->_set_norms( $args{norms_reader}->get_bytes ); $self->set_similarity( $args{similarity} ); $self->_set_weight( $args{weight} ); $self->_set_weight_value( $args{weight}->get_value ); $self->_fill_score_cache; return $self; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Search::TermScorer void _init_child(scorer) Scorer *scorer; PPCODE: Kino1_TermScorer_init_child(scorer); =for comment Build up a cache of scores for common (i.e. low) freqs, so they don't have to be continually recalculated. =cut void _fill_score_cache(scorer) Scorer* scorer; PPCODE: Kino1_TermScorer_fill_score_cache(scorer); void score_batch(scorer, ...) Scorer *scorer; PREINIT: HV *args_hash; U32 start, end; HitCollector *hc; PPCODE: /* process hash-style params */ Kino1_Verify_build_args_hash(args_hash, "KinoSearch1::Search::TermScorer::score_batch_args", 1); Kino1_extract_struct_from_hv(args_hash, hc, "hit_collector", 13, HitCollector*, "KinoSearch1::Search::HitCollector"); start = (U32)SvUV( Kino1_Verify_extract_arg(args_hash, "start", 5) ); end = (U32)SvUV( Kino1_Verify_extract_arg(args_hash, "end", 3) ); Kino1_TermScorer_score_batch(scorer, start, end, hc); SV* _term_scorer_set_or_get(scorer, ...) Scorer *scorer; ALIAS: _set_term_docs = 1 _get_term_docs = 2 _set_weight = 3 _get_weight = 4 _set_weight_value = 5 _get_weight_value = 6 _set_norms = 7 _get_norms = 8 CODE: { TermScorerChild *child = (TermScorerChild*)scorer->child; KINO_START_SET_OR_GET_SWITCH case 1: SvREFCNT_dec(child->term_docs_sv); child->term_docs_sv = newSVsv( ST(1) ); Kino1_extract_struct( child->term_docs_sv, child->term_docs, TermDocs*, "KinoSearch1::Index::TermDocs"); /* fall through */ case 2: RETVAL = newSVsv(child->term_docs_sv); break; case 3: SvREFCNT_dec(child->weight_sv); if (!sv_derived_from( ST(1), "KinoSearch1::Search::Weight")) Kino1_confess("not a KinoSearch1::Search::Weight"); child->weight_sv = newSVsv( ST(1) ); /* fall through */ case 4: RETVAL = newSVsv(child->weight_sv); break; case 5: child->weight_value = SvNV( ST(1) ); /* fall through */ case 6: RETVAL = newSVnv(child->weight_value); break; case 7: SvREFCNT_dec(child->norms_sv); child->norms_sv = newSVsv( ST(1) ); { SV* bytes_deref_sv; bytes_deref_sv = SvRV(child->norms_sv); if (SvPOK(bytes_deref_sv)) { child->norms = (unsigned char*)SvPVX(bytes_deref_sv); } else { child->norms = NULL; } } /* fall through */ case 8: RETVAL = newSVsv(child->norms_sv); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(scorer) Scorer *scorer; PPCODE: Kino1_TermScorer_destroy(scorer); __H__ #ifndef H_KINO_TERM_SCORER #define H_KINO_TERM_SCORER 1 #define KINO_SCORE_CACHE_SIZE 32 #define KINO_TERM_SCORER_SENTINEL 0xFFFFFFFF #include "EXTERN.h" #include "perl.h" #include "KinoSearch1IndexTermDocs.h" #include "KinoSearch1SearchHitCollector.h" #include "KinoSearch1SearchScorer.h" #include "KinoSearch1UtilMemManager.h" typedef struct termscorerchild { U32 doc; TermDocs* term_docs; U32 pointer; U32 pointer_max; float weight_value; unsigned char *norms; float *score_cache; U32 *doc_nums; U32 *freqs; SV *doc_nums_sv; SV *freqs_sv; SV *weight_sv; SV *term_docs_sv; SV *norms_sv; } TermScorerChild; void Kino1_TermScorer_init_child(Scorer*); void Kino1_TermScorer_fill_score_cache(Scorer*); bool Kino1_TermScorer_next(Scorer*); float Kino1_TermScorer_score(Scorer*); void Kino1_TermScorer_score_batch(Scorer*, U32, U32, HitCollector*); U32 Kino1_TermScorer_doc(Scorer*); void Kino1_TermScorer_destroy(Scorer*); #endif /* include guard */ __C__ #include "KinoSearch1SearchTermScorer.h" void Kino1_TermScorer_init_child(Scorer *scorer){ TermScorerChild *child; /* allocate */ Kino1_New(0, child, 1, TermScorerChild); scorer->child = child; child->doc_nums_sv = newSV(0); child->freqs_sv = newSV(0); /* define abstract methods */ scorer->next = Kino1_TermScorer_next; scorer->doc = Kino1_TermScorer_doc; scorer->score = Kino1_TermScorer_score; /* init */ child->doc = 0; child->term_docs = NULL; child->pointer = 0; child->pointer_max = 0; child->doc_nums = NULL; child->freqs = NULL; child->weight_value = 0.0; child->norms = NULL; child->score_cache = NULL; child->weight_sv = &PL_sv_undef; child->term_docs_sv = &PL_sv_undef; child->norms_sv = &PL_sv_undef; } void Kino1_TermScorer_fill_score_cache(Scorer *scorer) { TermScorerChild *child; float *cache_ptr; int i; child = (TermScorerChild*)scorer->child; Kino1_Safefree(child->score_cache); Kino1_New(0, child->score_cache, KINO_SCORE_CACHE_SIZE, float); cache_ptr = child->score_cache; for (i = 0; i < KINO_SCORE_CACHE_SIZE; i++) { *cache_ptr++ = scorer->sim->tf(scorer->sim, i) * child->weight_value; } } void Kino1_TermScorer_destroy(Scorer *scorer) { TermScorerChild *child; child = (TermScorerChild*)scorer->child; Kino1_Safefree(child->score_cache); SvREFCNT_dec(child->term_docs_sv); SvREFCNT_dec(child->norms_sv); SvREFCNT_dec(child->weight_sv); SvREFCNT_dec(child->doc_nums_sv); SvREFCNT_dec(child->freqs_sv); Kino1_Safefree(child); Kino1_Scorer_destroy(scorer); } bool Kino1_TermScorer_next(Scorer* scorer) { TermScorerChild *child = (TermScorerChild*)scorer->child; /* refill the queue if needed */ if (++child->pointer >= child->pointer_max) { child->pointer_max = child->term_docs->bulk_read(child->term_docs, child->doc_nums_sv, child->freqs_sv, 1024); child->doc_nums = (U32*)SvPV_nolen(child->doc_nums_sv); child->freqs = (U32*)SvPV_nolen(child->freqs_sv); if (child->pointer_max != 0) { child->pointer = 0; } else { child->doc = KINO_TERM_SCORER_SENTINEL; /* TODO Lucene calls termDocs.close() here. */ return 0; } } child->doc = child->doc_nums[child->pointer]; return 1; } float Kino1_TermScorer_score(Scorer* scorer) { TermScorerChild *child; U32 freq; float score; unsigned char norm; child = (TermScorerChild*)scorer->child; freq = child->freqs[child->pointer]; if (freq < KINO_SCORE_CACHE_SIZE) { /* cache hit, so we don't need to recompute the whole score */ score = child->score_cache[freq]; } else { score = scorer->sim->tf(scorer->sim, freq) * child->weight_value; } /* normalize for field */ norm = child->norms[child->doc]; score *= scorer->sim->norm_decoder[norm]; return score; } void Kino1_TermScorer_score_batch(Scorer *scorer, U32 start, U32 end, HitCollector* hc) { TermScorerChild *child; U32 freq; unsigned char norm; float score; child = (TermScorerChild*)scorer->child; scorer->next(scorer); while(child->doc < end) { freq = child->freqs[child->pointer]; if (freq < KINO_SCORE_CACHE_SIZE) { /* cache hit, so we don't need to recompute the whole score */ score = child->score_cache[freq]; } else { score = scorer->sim->tf(scorer->sim, freq) * child->weight_value; } /* normalize for field */ norm = child->norms[child->doc]; score *= scorer->sim->norm_decoder[norm]; hc->collect(hc, child->doc, score); /* time for a refill? */ if (++child->pointer >= child->pointer_max) { /* try to get more docs and freqs */ child->pointer_max = child->term_docs->bulk_read( child->term_docs, child->doc_nums_sv, child->freqs_sv, 1024); child->doc_nums = (U32*)SvPV_nolen(child->doc_nums_sv); child->freqs = (U32*)SvPV_nolen(child->freqs_sv); /* bail if we didn't get any more docs */ if (child->pointer_max != 0) { child->pointer = 0; } else { child->doc = KINO_TERM_SCORER_SENTINEL; /* TODO Lucene calls termDocs.close() here. */ return; } } child->doc = child->doc_nums[ child->pointer ]; } } U32 Kino1_TermScorer_doc(Scorer* scorer) { TermScorerChild *child = (TermScorerChild*)scorer->child; return child->doc; } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::TermScorer - scorer for TermQuery ==head1 DESCRIPTION Subclass of Scorer which scores individual Terms. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Search/Weight.pm000444000765000765 442511462203446 21542 0ustar00marvinmarvin000000000000package KinoSearch1::Search::Weight; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args / members parent => undef, searcher => undef, # members similarity => undef, value => 0, idf => undef, query_norm => undef, query_weight => undef, ); } # Return the Query that the Weight was derived from. sub get_query { shift->{parent} } # Return the Weight's numerical value, now that it's been calculated. sub get_value { shift->{value} } # Return a damping/normalization factor for the Weight/Query. sub sum_of_squared_weights { my $self = shift; $self->{query_weight} = $self->{idf} * $self->{parent}->get_boost; return ( $self->{query_weight}**2 ); } # Normalize the Weight/Query, so that it produces more comparable numbers in # context of other Weights/Queries. sub normalize { my ( $self, $query_norm ) = @_; $self->{query_norm} = $query_norm; $self->{query_weight} *= $query_norm; $self->{value} = $self->{query_weight} * $self->{idf}; } =begin comment my $scorer = $weight->scorer( $index_reader ); Return a subclass of scorer, primed with values and ready to crunch numbers. =end comment =cut sub scorer { shift->abstract_death } =begin comment my $explanation = $weight->explain( $index_reader, $doc_num ); Explain how a document scores. =end comment =cut sub explain { shift->todo_death } sub to_string { my $self = shift; return "weight(" . $self->{parent}->to_string . ")"; } 1; __END__ __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Search::Weight - Searcher-dependent transformation of a Query ==head1 SYNOPSIS # abstract base class ==head1 DESCRIPTION In one sense, a Weight is the weight of a Query object. Conceptually, a Query's "weight" ought to be a single number: a co-efficient... and indeed, eventually a Weight object gets turned into a $weight_value. However, since calculating that multiplier is rather complex, the calculations are encapsulated within a class. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Store000755000765000765 011462203446 17502 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Store/FSInvIndex.pm000444000765000765 1246711462203445 22203 0ustar00marvinmarvin000000000000package KinoSearch1::Store::FSInvIndex; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Store::InvIndex ); our $LOCK_DIR; # used by FSLock use File::Spec::Functions qw( canonpath catfile catdir tmpdir no_upwards ); use Fcntl; BEGIN { __PACKAGE__->init_instance_vars(); # confirm or create a directory to put lockfiles in $LOCK_DIR = catdir( tmpdir, 'kinosearch_lockdir' ); if ( !-d $LOCK_DIR ) { mkdir $LOCK_DIR or die "couldn't mkdir '$LOCK_DIR': $!"; chmod 0777, $LOCK_DIR; } } use Digest::MD5 qw( md5_hex ); use KinoSearch1::Store::InStream; use KinoSearch1::Store::OutStream; use KinoSearch1::Store::FSLock; use KinoSearch1::Index::IndexFileNames; sub init_instance { my $self = shift; # clean up path. my $path = $self->{path} = canonpath( $self->{path} ); if ( $self->{create} ) { # clear out lockfiles related to this path my $lock_prefix = $self->get_lock_prefix; opendir( my $lock_dh, $LOCK_DIR ) or confess("Couldn't opendir '$LOCK_DIR': $!"); my @lockfiles = grep {/$lock_prefix/} readdir $lock_dh; closedir $lock_dh or confess("Couldn't closedir '$LOCK_DIR': $!"); for (@lockfiles) { $_ = catfile( $LOCK_DIR, $_ ); unlink $_ or confess("couldn't unlink '$_': $!"); } # blast any existing index files if ( -e $path ) { opendir( my $invindex_dh, $path ) or confess("Couldn't opendir '$path': $!"); my @to_remove; for ( readdir $invindex_dh ) { if (/(^\w+\.(?:cfs|del)$)/) { push @to_remove, $1; } elsif ( $_ eq 'segments' ) { push @to_remove, 'segments'; } elsif ( $_ eq 'delqueue' ) { push @to_remove, 'delqueue'; } } for my $removable (@to_remove) { $removable = catfile( $path, $removable ); unlink $removable or confess "Couldn't unlink file '$removable': $!"; } closedir $invindex_dh or confess("Couldn't closedir '$path': $!"); } if ( !-d $path ) { mkdir $path or confess("Couldn't mkdir '$path': $!"); } } # by now, we should have a directory, so throw an error if we don't if ( !-d $path ) { confess("Can't open invindex location '$path': $! ") unless -e $path; confess("invindex location '$path' isn't a directory"); } } sub open_outstream { my ( $self, $filename ) = @_; my $filepath = catfile( $self->{path}, $filename ); sysopen( my $fh, $filepath, O_CREAT | O_RDWR | O_EXCL ) or confess("Couldn't open file '$filepath': $!"); binmode($fh); return KinoSearch1::Store::OutStream->new($fh); } sub open_instream { my ( $self, $filename, $offset, $len ) = @_; my $filepath = catfile( $self->{path}, $filename ); # must be unbuffered, or PerlIO messes up with the shared handles open( my $fh, "<:unix", $filepath ) or confess("Couldn't open file '$filepath': $!"); binmode($fh); return KinoSearch1::Store::InStream->new( $fh, $offset, $len ); } sub list { my $self = shift; opendir( my $dh, $self->{path} ) or confess("Couldn't opendir '$self->{path}'"); my @files = no_upwards( readdir $dh ); closedir $dh or confess("Couldn't closedir '$self->{path}'"); return @files; } sub file_exists { my ( $self, $filename ) = @_; return -e catfile( $self->{path}, $filename ); } sub rename_file { my ( $self, $from, $to ) = @_; $_ = catfile( $self->{path}, $_ ) for ( $from, $to ); rename( $from, $to ) or confess("couldn't rename file '$from' to '$to': $!"); } sub delete_file { my ( $self, $filename ) = @_; $filename = catfile( $self->{path}, $filename ); unlink $filename or confess("couldn't unlink file '$filename': $!"); } sub slurp_file { my ( $self, $filename ) = @_; my $filepath = catfile( $self->{path}, $filename ); open( my $fh, "<", $filepath ) or confess("Couldn't open file '$filepath': $!"); binmode($fh); local $/; return <$fh>; } sub make_lock { my $self = shift; return KinoSearch1::Store::FSLock->new( @_, invindex => $self ); } # Create a hashed string derived from this invindex's path. sub get_lock_prefix { my $self = shift; return "kinosearch-" . md5_hex( canonpath( $self->{path} ) ); } sub close { } 1; __END__ =head1 NAME KinoSearch1::Store::FSInvIndex - file system InvIndex =head1 SYNOPSIS my $invindex = KinoSearch1::Store::FSInvIndex->new( path => '/path/to/invindex', create => 1, ); =head1 DESCRIPTION Implementation of KinoSearch1::Store::InvIndex using a single file system directory and multiple files. =head1 CONSTRUCTOR =head2 new C takes two parameters: =over =item B - the location of the invindex. =item B - if set to 1, create a fresh invindex, clobbering an existing one if necessary. Default value is 0, indicating that an existing invindex should be opened. =back =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Store/FSLock.pm000444000765000765 441711462203446 21324 0ustar00marvinmarvin000000000000package KinoSearch1::Store::FSLock; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Store::Lock ); BEGIN { __PACKAGE__->init_instance_vars() } use Fcntl qw( :DEFAULT :flock ); use File::Spec::Functions qw( catfile ); use KinoSearch1::Store::FSInvIndex; my $disable_locks = 0; # placeholder -- locks always enabled for now sub init_instance { my $self = shift; # derive the lockfile's filepath $self->{lock_name} = catfile( $KinoSearch1::Store::FSInvIndex::LOCK_DIR, # TODO fix this stupid hack $self->{invindex}->get_lock_prefix . "-$self->{lock_name}" ); } sub do_obtain { my $self = shift; return 1 if $disable_locks; my $lock_name = $self->{lock_name}; # check for locks created by old processes and remove them if ( -e $lock_name ) { open( my $fh, $lock_name ) or confess "Can't open $lock_name: $!"; my $line = <$fh>; $line =~ /pid: (\d+)/; my $pid = $1; close $fh or confess "Can't close '$lock_name': $!"; unless ( kill 0 => $pid ) { warn "Lockfile looks dead - removing"; unlink $lock_name or confess "Can't unlink '$lock_name: $!"; } } # create a lock by creating a lockfile return unless sysopen( my $fh, $lock_name, O_CREAT | O_WRONLY | O_EXCL ); # print pid and path to the lock file, using YAML for future compat print $fh "pid: $$\ninvindex: " . $self->{invindex}->get_path . "\n"; close $fh or confess "Can't close '$lock_name': $!"; # success! return 1; } sub release { my $self = shift; return if $disable_locks; # release the lock by removing the lockfile from the file system unlink $self->{lock_name} or confess("Couldn't unlink file '$self->{lock_name}': $!"); } sub is_locked { # if the lockfile exists, the resource is locked return ( -e $_[0]->{lock_name} or $disable_locks ); } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Store::FSLock - lock an FSInvIndex ==head1 DESCRIPTION File-system-based implementation of L. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Store/InStream.pm000444000765000765 4212311462203445 21740 0ustar00marvinmarvin000000000000package KinoSearch1::Store::InStream; use base qw( KinoSearch1::Util::CClass ); use strict; use warnings; use KinoSearch1::Util::ToolSet; sub close { CORE::close shift->get_fh } =for comment Dupe the filehandle and create a new object around the dupe. Seek the dupe to the same spot as the original. =cut sub clone_stream { my $self = shift; open( my $duped_fh, '<&=', $self->get_fh ) or confess("Couldn't dupe filehandle: $!"); my $evil_twin = __PACKAGE__->new( $duped_fh, $self->get_offset, $self->length, ); $evil_twin->seek( $self->tell ); return $evil_twin; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Store::InStream =begin comment my $instream = KinoSearch1::Store::Instream->new( $filehandle, $offset, $length ); Constructor. Takes 1-3 arguments, and unlike most classes in the KinoSearch1 suite, the arguments to the constructor are not labeled parameters. The second argument, an offset, defaults to 0 if not supplied. Non-zero offsets get factored in when calling seek and tell. The last argument, a length, is the length of the "file" in bytes. Supplying an explicit value is only essential for InStreams which are assigned to read a portion of a compound file -- otherwise, the length gets auto-calculated correctly. =end comment =cut InStream* new(class, fh_sv, ...) char *class; SV *fh_sv; PREINIT: double offset = 0; double len = -1; CODE: if (items > 2) { SV* offset_sv; offset_sv = ST(2); if (SvOK(offset_sv)) offset = SvNV(offset_sv); } if (items > 3) { SV *len_sv; len_sv = ST(3); if (SvOK(len_sv)) len = SvNV(len_sv); } RETVAL = Kino1_InStream_new(class, fh_sv, offset, len); OUTPUT: RETVAL =for comment Seek to target plus the object's start offset. =cut void seek(instream, target) InStream *instream; double target; PPCODE: instream->seek(instream, target); =for comment Return the filehandle's position minus the offset. =cut double tell(instream) InStream *instream; CODE: RETVAL = instream->tell(instream); OUTPUT: RETVAL =for comment Return the length of the "file" in bytes, factoring in the offset. =cut double length(instream) InStream *instream; CODE: RETVAL = instream->len; OUTPUT: RETVAL =begin comment @items = $instream->lu_read( TEMPLATE ); Read the items specified by TEMPLATE from the InStream. =end comment =cut SV* _set_or_get(instream, ...) InStream *instream; ALIAS: set_len = 1 get_len = 2 set_offset = 3 get_offset = 4 set_fh = 5 get_fh = 6 CODE: { KINO_START_SET_OR_GET_SWITCH case 1: instream->len = SvNV( ST(1) ); /* fall through */ case 2: RETVAL = newSVnv(instream->len); break; case 3: instream->offset = SvNV( ST(1) ); /* fall through */ case 4: RETVAL = newSVnv(instream->offset); break; case 5: Kino1_confess("Can't set_fh"); /* fall through */ case 6: RETVAL = newSVsv(instream->fh_sv); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void lu_read (instream, template_sv) InStream *instream; SV *template_sv PREINIT: STRLEN tpt_len; /* bytelength of template */ char *template; /* ptr to a spot in the template */ char *tpt_end; /* ptr to the end of the template */ int repeat_count; /* number of times to repeat sym */ char sym; /* the current symbol in the template */ char countsym; /* used when calculating repeat counts */ IV aIV; SV *aSV; char aChar; char* string; STRLEN len; PPCODE: { /* prepare template string pointers */ template = SvPV(template_sv, tpt_len); tpt_end = SvEND(template_sv); repeat_count = 0; while (1) { if (repeat_count == 0) { /* fast-forward past space characters */ while (*template == ' ' && template < tpt_end) { template++; } /* break out of the loop if we've exhausted the template */ if (template == tpt_end) { break; } /* derive the current symbol and a possible digit repeat sym */ sym = *template++; countsym = *template; if (template == tpt_end) { /* sym is last char in template, so process once */ repeat_count = 1; } else if (countsym >= '0' && countsym <= '9') { /* calculate numerical repeat count */ repeat_count = countsym - KINO_NUM_CHAR_OFFSET; countsym = *(++template); while ( template <= tpt_end && countsym >= '0' && countsym <= '9' ) { repeat_count = (repeat_count * 10) + (countsym - KINO_NUM_CHAR_OFFSET); countsym = *(++template); } } else { /* no numeric repeat count, so process sym only once */ repeat_count = 1; } } /* thwart potential infinite loop */ if (repeat_count < 1) Kino1_confess( "invalid repeat_count: %d", repeat_count); switch(sym) { case 'a': /* arbitrary binary data */ len = repeat_count; repeat_count = 1; aSV = newSV(len + 1); SvCUR_set(aSV, len); SvPOK_on(aSV); string = SvPVX(aSV); instream->read_bytes(instream, string, len); break; case 'b': /* signed byte */ case 'B': /* unsigned byte */ aChar = instream->read_byte(instream); if (sym == 'b') aIV = (signed char)aChar; else aIV = (unsigned char)aChar; aSV = newSViv(aIV); break; case 'i': /* signed 32-bit integer */ aSV = newSViv( (I32)instream->read_int(instream) ); break; case 'I': /* unsigned 32-bit integer */ aSV = newSVuv( instream->read_int(instream) ); break; case 'Q': /* unsigned "64-bit integer" */ aSV = newSVnv( instream->read_long(instream) ); break; case 'T': /* string */ len = instream->read_vint(instream); aSV = newSV(len + 1); SvCUR_set(aSV, len); SvPOK_on(aSV); string = SvPVX(aSV); instream->read_chars(instream, string, 0, len); break; case 'V': /* VInt */ aSV = newSVuv( instream->read_vint(instream) ); break; case 'W': /* VLong */ aSV = newSVnv( instream->read_vlong(instream) ); break; default: aSV = NULL; /* suppress unused var compiler warning */ Kino1_confess("Invalid type in template: '%c'", sym); } /* Put a scalar on the stack, use up one symbol or repeater */ XPUSHs( sv_2mortal(aSV) ); repeat_count -= 1; } } void DESTROY(instream) InStream *instream; PPCODE: Kino1_InStream_destroy(instream); __H__ #ifndef H_KINOSEARCH_STORE_INSTREAM #define H_KINOSEARCH_STORE_INSTREAM 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMathUtils.h" /* Detect whether we're on an ASCII or EBCDIC machine. */ #if '0' == 240 #define KINO_NUM_CHAR_OFFSET 240 #else #define KINO_NUM_CHAR_OFFSET 48 #endif #define KINO_IO_STREAM_BUF_SIZE 1024 typedef struct instream { PerlIO *fh; SV *fh_sv; double offset; double len; char *buf; Off_t buf_start; /* file position of start of buffer */ int buf_len; /* number of valid bytes in the buffer */ int buf_pos; /* next byte to read */ void (*seek)(struct instream*, double); double (*tell)(struct instream*); char (*read_byte)(struct instream*); void (*read_bytes)(struct instream*, char*, STRLEN); void (*read_chars)(struct instream*, char*, STRLEN, STRLEN); U32 (*read_int)(struct instream*); double (*read_long)(struct instream*); U32 (*read_vint)(struct instream*); double (*read_vlong)(struct instream*); } InStream; InStream* Kino1_InStream_new (char*, SV*, double, double); void Kino1_InStream_seek (InStream*, double); double Kino1_InStream_tell (InStream*); void Kino1_InStream_refill (InStream*); char Kino1_InStream_read_byte (InStream*); void Kino1_InStream_read_bytes (InStream*, char*, STRLEN); void Kino1_InStream_read_chars (InStream*, char*, STRLEN, STRLEN); U32 Kino1_InStream_read_int (InStream*); double Kino1_InStream_read_long (InStream*); U32 Kino1_InStream_decode_vint(char**); U32 Kino1_InStream_read_vint (InStream*); double Kino1_InStream_read_vlong (InStream*); void Kino1_InStream_destroy (InStream*); #endif /* include guard */ __C__ #include "KinoSearch1StoreInStream.h" InStream* Kino1_InStream_new(char *class, SV *fh_sv, double offset, double len ) { InStream *instream; /* allocate */ Kino1_New(0, instream, 1, InStream); /* assign */ instream->fh_sv = newSVsv(fh_sv); instream->fh = IoIFP( sv_2io(fh_sv) ); instream->offset = offset; /* init buffer */ instream->buf = NULL; instream->buf_start = 0; instream->buf_len = 0; instream->buf_pos = 0; /* seek */ if (offset != 0) { PerlIO_seek(instream->fh, offset, 0); } /* calculate len if an (intentionally) invalid value was supplied */ if (len < 0.0) { double bookmark = PerlIO_tell(instream->fh); PerlIO_seek(instream->fh, 0, 2); len = PerlIO_tell(instream->fh); PerlIO_seek(instream->fh, bookmark, 0); } instream->len = len; /* assign methods */ instream->seek = Kino1_InStream_seek; instream->tell = Kino1_InStream_tell; instream->read_byte = Kino1_InStream_read_byte; instream->read_bytes = Kino1_InStream_read_bytes; instream->read_chars = Kino1_InStream_read_chars; instream->read_int = Kino1_InStream_read_int; instream->read_long = Kino1_InStream_read_long; instream->read_vint = Kino1_InStream_read_vint; instream->read_vlong = Kino1_InStream_read_vlong; return instream; } void Kino1_InStream_seek(InStream *instream, double target) { /* seek within buffer if possible */ if ( (target >= instream->buf_start) && (target < (instream->buf_start + instream->buf_pos)) ) { instream->buf_pos = target - instream->buf_start; } /* nope, not possible, so seek within file and prepare to refill */ else { instream->buf_start = target; instream->buf_pos = 0; instream->buf_len = 0; PerlIO_seek(instream->fh, target + instream->offset, 0); } } double Kino1_InStream_tell(InStream *instream) { return instream->buf_start + instream->buf_pos; } void Kino1_InStream_refill(InStream *instream) { int check_val; /* wait to allocate buffer until it's needed */ if (instream->buf == NULL) Kino1_New(0, instream->buf, KINO_IO_STREAM_BUF_SIZE, char); /* add bytes read to file position, reset */ instream->buf_start += instream->buf_pos; instream->buf_pos = 0; /* calculate the number of bytes to read */ if (KINO_IO_STREAM_BUF_SIZE < instream->len - instream->buf_start) instream->buf_len = KINO_IO_STREAM_BUF_SIZE; else instream->buf_len = instream->len - instream->buf_start; /* perform the file operations */ PerlIO_seek(instream->fh, 0, 1); check_val = PerlIO_seek(instream->fh, (instream->buf_start + instream->offset), 0); if (check_val == -1) Kino1_confess("refill: PerlIO_seek failed: %d", errno); check_val = PerlIO_read(instream->fh, instream->buf, instream->buf_len); if (check_val != instream->buf_len) Kino1_confess("refill: tried to read %d bytes, got %d: %d", instream->buf_len, check_val, errno); } char Kino1_InStream_read_byte(InStream *instream) { if (instream->buf_pos >= instream->buf_len) Kino1_InStream_refill(instream); return instream->buf[ instream->buf_pos++ ]; } void Kino1_InStream_read_bytes (InStream *instream, char* buf, STRLEN len) { if (instream->buf_pos + len < instream->buf_len) { /* request is entirely within buffer, so copy */ Copy((instream->buf + instream->buf_pos), buf, len, char); instream->buf_pos += len; } else { /* get the request from the file and reset buffer */ int check_val; Off_t start; start = instream->tell(instream); check_val = PerlIO_seek(instream->fh, (start + instream->offset), 0); if (check_val == -1) Kino1_confess("read_bytes: PerlIO_seek failed: %d", errno ); check_val = PerlIO_read(instream->fh, buf, len); if (check_val < len) Kino1_confess("read_bytes: tried to read %"UVuf" bytes, got %d", (UV)len, check_val); /* reset vars and refill if there's more in the file */ instream->buf_start = start + len; instream->buf_pos = 0; instream->buf_len = 0; if (instream->buf_start < instream->len) Kino1_InStream_refill(instream); } } /* This is just a wrapper for read_bytes, but that may change. It should * be used whenever Lucene character data is being read, typically after * read_vint as part of a String read. If and when a change does come, it will * be a lot easier to track down all the relevant code fragments if read_chars * gets used consistently. */ void Kino1_InStream_read_chars(InStream *instream, char *buf, STRLEN start, STRLEN len) { buf += start; instream->read_bytes(instream, buf, len); } U32 Kino1_InStream_read_int (InStream *instream) { unsigned char buf[4]; instream->read_bytes(instream, (char*)buf, 4); return Kino1_decode_bigend_U32(buf); } double Kino1_InStream_read_long (InStream *instream) { unsigned char buf[8]; double aDouble; /* get 8 bytes from the stream */ instream->read_bytes(instream, (char*)buf, 8); /* get high 4 bytes, multiply by 2**32 */ aDouble = Kino1_decode_bigend_U32(buf); aDouble = aDouble * pow(2.0, 32.0); /* decode low four bytes as unsigned int and add to total */ aDouble += Kino1_decode_bigend_U32(&buf[4]); return aDouble; } /* read in a Variable INTeger, stored in 1-5 bytes */ U32 Kino1_InStream_read_vint (InStream *instream) { unsigned char aUChar; int bitshift; U32 aU32; /* start by reading one byte; use the lower 7 bits */ aUChar = (unsigned char)instream->read_byte(instream); aU32 = aUChar & 0x7f; /* keep reading and shifting as long as the high bit is set */ for (bitshift = 7; (aUChar & 0x80) != 0; bitshift += 7) { aUChar = (unsigned char)instream->read_byte(instream); aU32 |= (aUChar & 0x7f) << bitshift; } return aU32; } U32 Kino1_InStream_decode_vint(char **source_ptr) { char *source; int bitshift; U32 aU32; source = *source_ptr; aU32 = (unsigned char)*source & 0x7f; for (bitshift = 7; (*source & 0x80) != 0; bitshift += 7) { source++; aU32 |= ((unsigned char)*source & 0x7f) << bitshift; } source++; *source_ptr = source; return aU32; } double Kino1_InStream_read_vlong (InStream *instream) { unsigned char aUChar; int bitshift; double aDouble; aUChar = (unsigned char)instream->read_byte(instream); aDouble = aUChar & 0x7f; for (bitshift = 7; (aUChar & 0x80) != 0; bitshift += 7) { aUChar = (unsigned char)instream->read_byte(instream); aDouble += (aUChar & 0x7f) * pow(2, bitshift); } return aDouble; } void Kino1_InStream_destroy(InStream* instream) { SvREFCNT_dec(instream->fh_sv); Kino1_Safefree(instream->buf); Kino1_Safefree(instream); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Store::InStream - filehandles for reading invindexes ==head1 SYNOPSIS # isa blessed filehandle my $instream = $invindex->open_instream( $filehandle, $offset, $length ); my @ten_vints = $instream->lu_read('V10'); ==head1 DESCRIPTION The InStream class abstracts out all input operations to KinoSearch1. InStream is implemented as a inside-out object around a blessed filehandle. It would almost be possible to use an ordinary filehandle, but the objectification is necessary because InStreams have to be capable of pretending that they are acting upon a distinct file when in reality they may be reading only a portion of a compound file. For the template used by lu_read, see InStream's companion, L. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Store/InvIndex.pm000444000765000765 706211462203445 21725 0ustar00marvinmarvin000000000000package KinoSearch1::Store::InvIndex; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( create => 0, path => undef, ); } __PACKAGE__->ready_get(qw( create path )); =begin comment my $outstream = $invindex->open_outstream($filename); Given a filename, return an OutStream object. =end comment =cut sub open_outstream { shift->abstract_death } =begin comment my $instream = $invindex->open_instream($filename); Given a filename, return an InStream object. =end comment =cut sub open_instream { shift->abstract_death } =begin comment my @files = $invindex->list; Return a list of all the files in the InvIndex =end comment =cut sub list { shift->abstract_death } =begin comment my $truth = $invindex->file_exists($filename); Indicate whether the invindex contains a file with the given filename. =end comment =cut sub file_exists { shift->abstract_death } =begin comment $invindex->rename_file( $from, $to ); Rename a file. =end comment =cut sub rename_file { shift->abstract_death } =begin comment $invindex->delete_file($filename); Delete a file from the invindex. =end comment =cut sub delete_file { shift->abstract_death } =begin comment my $file_contents = $invindex->slurp_file($filename); Return a scalar with the file's contents. Only for small files, obviously. =end comment =cut sub slurp_file { shift->abstract_death } =begin comment my $lock = $invindex->make_lock( lock_name => $name, timeout => $timeout, # milliseconds ); Factory method for creating a KinoSearch1::Store::Lock subclassed object. =end comment =cut sub make_lock { shift->abstract_death } =begin comment $invindex->run_while_locked( lock_name => $name, timeout => $timeout, # milliseconds do_body => \&do_some_stuff, ); Create a Lock object and obtain a lock, run the subroutine specified by the do_body parameter, then release the lock and discard the Lock object. The hash-style argument labels include all the arguments to make_lock, plus do_body. =end comment =cut sub run_while_locked { my ( $self, %args ) = @_; my $do_body = delete $args{do_body}; my $lock = $self->make_lock( %args, invindex => $self, ); my $locked; eval { $locked = $lock->obtain; $do_body->(); }; $lock->release if $lock->is_locked; confess $@ if $@; } =begin comment $invindex->close() Release any reserved resources. =end comment =cut sub close { shift->abstract_death } 1; __END__ =head1 NAME KinoSearch1::Store::InvIndex - inverted index =head1 SYNOPSIS # abstract base class =head1 DESCRIPTION An InvIndex is an abstract representation of an inverted index, KinoSearch1's core data structure. The archetypal implementation of an invindex, FSInvIndex, is a single directory holding a collection of files. However, to allow alternative implementations such as RAMInvIndex, i/o and file manipulation are abstracted out rather than executed directly by KinoSearch1's classes. A "file" within an invindex might be a real file -- or it might be a ram file, or eventually a database record, etc. Similarly, C<< $invindex->delete_file($filename) >> might delete a file from the file system, or a key-value pair from a hash, or something else. =head1 SEE ALSO L =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Store/Lock.pm000444000765000765 405411462203446 21070 0ustar00marvinmarvin000000000000package KinoSearch1::Store::Lock; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members invindex => undef, lock_name => undef, timeout => 0, ); } use constant LOCK_POLL_INTERVAL => 1000; # Attempt to aquire lock once per second until the timeout has been reached. sub obtain { my $self = shift; # calculate maximum seconds to sleep my $sleepcount = $self->{timeout} / LOCK_POLL_INTERVAL; # keep trying to obtain lock until timeout is reached my $locked = $self->do_obtain; while ( !$locked ) { croak("Couldn't get lock using '$self->{lock_name}'") if $sleepcount-- <= 0; sleep 1; $locked = $self->do_obtain; } return $locked; } =begin comment my $locked = $lock->do_obtain; Do the actual work to aquire the lock and return a boolean reflecting success/failure. =end comment =cut sub do_obtain { shift->abstract_death } =begin comment $lock->release; Release the lock. =end comment =cut sub release { shift->abstract_death } =begin comment my $locked_or_not = $lock->is_locked; Return true if the resource is locked, false otherwise. =end comment =cut sub is_locked { shift->abstract_death } # Getter for lock_name. sub get_lock_name { shift->{lock_name} } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Store::Lock - mutex lock on an invindex ==head1 SYNOPSIS # abstract base class, but here's typical usage: my $lock = $invindex->make_lock( lock_name => COMMIT_LOCK_NAME, timeout => 5000, ); ==head1 DESCRIPTION The Lock class produces an interprocess mutex lock. It does not rely on flock(). Lock must be subclassed, and instances must be constructed using the C factory method of KinoSearch1::Store::InvIndex. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Store/OutStream.pm000444000765000765 4202311462203446 22141 0ustar00marvinmarvin000000000000package KinoSearch1::Store::OutStream; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); sub close { my $self = shift; $self->flush; CORE::close $self->get_fh; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Store::OutStream =for comment Constructor - takes one arg: a filehandle. =cut OutStream* new(class, fh_sv) char *class; SV *fh_sv; CODE: RETVAL = Kino1_OutStream_new(class, fh_sv); OUTPUT: RETVAL void seek(outstream, target) OutStream *outstream; double target; PPCODE: outstream->seek(outstream, target); double tell(outstream) OutStream *outstream; CODE: RETVAL = outstream->tell(outstream); OUTPUT: RETVAL double length(outstream) OutStream *outstream; CODE: RETVAL = Kino1_OutStream_length(outstream); OUTPUT: RETVAL void flush(outstream); OutStream *outstream; PPCODE: Kino1_OutStream_flush(outstream); =for comment Write the entire contents of an instream to an outstream. =cut void absorb(outstream, instream) OutStream *outstream; InStream *instream; PPCODE: Kino1_OutStream_absorb(outstream, instream); SV* _set_or_get(outstream, ...) OutStream *outstream; ALIAS: set_fh = 1 get_fh = 2 CODE: { KINO_START_SET_OR_GET_SWITCH case 1: Kino1_confess("Can't set_fh"); /* fall through */ case 2: RETVAL = newSVsv(outstream->fh_sv); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL =begin comment $outstream->lu_write( TEMPLATE, LIST ); Write the items in LIST to the OutStream using the serialization schemes specified by TEMPLATE. =end comment =cut void lu_write (outstream, template_sv, ...) OutStream *outstream; SV *template_sv; PREINIT: STRLEN tpt_len; /* bytelength of template */ char *template; /* ptr to a spot in the template */ char *tpt_end; /* ptr to the end of the template */ int repeat_count; /* number of times to repeat sym */ int item_count; /* current place in @_ */ char sym; /* the current symbol in the template */ char countsym; /* used when calculating repeat counts */ I32 aI32; U32 aU32; double aDouble; SV *aSV; char *string; STRLEN string_len; PPCODE: { /* require an object, a template, and at least 1 item */ if (items < 2) { Kino1_confess("lu_write error: too few arguments"); } /* prepare the template and get pointers */ template = SvPV(template_sv, tpt_len); tpt_end = template + tpt_len; /* reject an empty template */ if (tpt_len == 0) { Kino1_confess("lu_write error: TEMPLATE cannot be empty string"); } /* init counters */ repeat_count = 0; item_count = 2; while (1) { /* only process template if we're not in the midst of a repeat */ if (repeat_count == 0) { /* fast-forward past space characters */ while (*template == ' ' && template < tpt_end) { template++; } /* if we're done, return or throw error */ if (template == tpt_end || item_count == items) { if (item_count != items) { Kino1_confess( "lu_write error: Too many ITEMS, not enough TEMPLATE"); } else if (template != tpt_end) { Kino1_confess( "lu_write error: Too much TEMPLATE, not enough ITEMS"); } else { /* success! */ break; } } /* derive the current symbol and a possible digit repeat sym */ sym = *template++; countsym = *template; if (template == tpt_end) { /* sym is last char in template */ repeat_count = 1; } else if (countsym >= '0' && countsym <= '9') { /* calculate numerical repeat count */ repeat_count = countsym - KINO_NUM_CHAR_OFFSET; countsym = *(++template); while ( template <= tpt_end && countsym >= '0' && countsym <= '9' ) { repeat_count = (repeat_count * 10) + (countsym - KINO_NUM_CHAR_OFFSET); countsym = *(++template); } } else { /* no numeric repeat count, so process sym only once */ repeat_count = 1; } } switch(sym) { case 'a': /* arbitrary binary data */ aSV = ST(item_count); if (!SvOK(aSV)) { Kino1_confess("Internal error: undef at lu_write 'a'"); } string = SvPV(aSV, string_len); if (repeat_count != string_len) { Kino1_confess( "lu_write error: repeat_count != string_len: %d %d", repeat_count, string_len); } Kino1_OutStream_write_bytes(outstream, string, string_len); /* trigger next sym */ repeat_count = 1; break; case 'b': /* signed byte */ case 'B': /* unsigned byte */ aI32 = SvIV( ST(item_count) ); Kino1_OutStream_write_byte(outstream, (char)(aI32 & 0xff)); break; case 'i': /* signed 32-bit integer */ aI32 = SvIV( ST(item_count) ); Kino1_OutStream_write_int(outstream, (U32)aI32); break; case 'I': /* unsigned 32-bit integer */ aU32 = SvUV( ST(item_count) ); Kino1_OutStream_write_int(outstream, aU32); break; case 'Q': /* unsigned "64-bit" integer */ aDouble = SvNV( ST(item_count) ); Kino1_OutStream_write_long(outstream, aDouble); break; case 'V': /* VInt */ aU32 = SvUV( ST(item_count) ); Kino1_OutStream_write_vint(outstream, aU32); break; case 'W': /* VLong */ aDouble = SvNV( ST(item_count) ); Kino1_OutStream_write_vlong(outstream, aDouble); break; case 'T': /* string */ aSV = ST(item_count); string = SvPV(aSV, string_len); Kino1_OutStream_write_string(outstream, string, string_len); break; default: Kino1_confess("Illegal character in template: %c", sym); } /* use up one repeat_count and one item from the stack */ repeat_count--; item_count++; } } void DESTROY(outstream) OutStream *outstream; PPCODE: Kino1_OutStream_destroy(outstream); __H__ #ifndef H_KINOIO #define H_KINOIO 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1StoreInStream.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMathUtils.h" typedef struct outstream { PerlIO *fh; SV *fh_sv; char *buf; Off_t buf_start; int buf_pos; void (*seek) (struct outstream*, double); double (*tell) (struct outstream*); void (*write_byte) (struct outstream*, char); void (*write_bytes) (struct outstream*, char*, STRLEN); void (*write_int) (struct outstream*, U32); void (*write_long) (struct outstream*, double); void (*write_vint) (struct outstream*, U32); void (*write_vlong) (struct outstream*, double); void (*write_string)(struct outstream*, char*, STRLEN); } OutStream; OutStream* Kino1_OutStream_new (char*, SV*); void Kino1_OutStream_seek (OutStream*, double); double Kino1_OutStream_tell (OutStream*); double Kino1_OutStream_length (OutStream*); void Kino1_OutStream_flush (OutStream*); void Kino1_OutStream_absorb (OutStream*, InStream*); void Kino1_OutStream_write_byte (OutStream*, char); void Kino1_OutStream_write_bytes (OutStream*, char*, STRLEN); void Kino1_OutStream_write_int (OutStream*, U32); void Kino1_OutStream_write_long (OutStream*, double); void Kino1_OutStream_write_vint (OutStream*, U32); int Kino1_OutStream_encode_vint (U32, char*); void Kino1_OutStream_write_vlong (OutStream*, double); void Kino1_OutStream_write_string (OutStream*, char*, STRLEN); void Kino1_OutStream_destroy (OutStream*); #endif /* include guard */ __C__ #include "KinoSearch1StoreOutStream.h" OutStream* Kino1_OutStream_new(char* class, SV* fh_sv) { OutStream *outstream; /* allocate */ Kino1_New(0, outstream, 1, OutStream); /* assign */ outstream->fh_sv = newSVsv(fh_sv); outstream->fh = IoOFP( sv_2io(fh_sv) ); /* init buffer */ Kino1_New(0, outstream->buf, KINO_IO_STREAM_BUF_SIZE, char); outstream->buf_start = 0; outstream->buf_pos = 0; /* assign methods */ outstream->seek = Kino1_OutStream_seek; outstream->tell = Kino1_OutStream_tell; outstream->write_byte = Kino1_OutStream_write_byte; outstream->write_bytes = Kino1_OutStream_write_bytes; outstream->write_int = Kino1_OutStream_write_int; outstream->write_long = Kino1_OutStream_write_long; outstream->write_vint = Kino1_OutStream_write_vint; outstream->write_vlong = Kino1_OutStream_write_vlong; outstream->write_string = Kino1_OutStream_write_string; return outstream; } void Kino1_OutStream_seek(OutStream *outstream, double target) { Kino1_OutStream_flush(outstream); outstream->buf_start = target; PerlIO_seek(outstream->fh, target, 0); } double Kino1_OutStream_tell(OutStream *outstream) { return outstream->buf_start + outstream->buf_pos; } double Kino1_OutStream_length(OutStream *outstream) { double len; /* flush, go to end, note length, return to bookmark */ Kino1_OutStream_flush(outstream); PerlIO_seek(outstream->fh, 0, 2); len = PerlIO_tell(outstream->fh); PerlIO_seek(outstream->fh, outstream->buf_start, 0); return len; } void Kino1_OutStream_flush(OutStream *outstream) { PerlIO_write(outstream->fh, outstream->buf, outstream->buf_pos); outstream->buf_start += outstream->buf_pos; outstream->buf_pos = 0; } void Kino1_OutStream_absorb(OutStream *outstream, InStream *instream) { double bytes_left, bytes_this_iter; char *buf; int check_val; /* flush, then "borrow" the buffer */ Kino1_OutStream_flush(outstream); buf = outstream->buf; bytes_left = instream->len; while (bytes_left > 0) { bytes_this_iter = bytes_left < KINO_IO_STREAM_BUF_SIZE ? bytes_left : KINO_IO_STREAM_BUF_SIZE; instream->read_bytes(instream, buf, bytes_this_iter); check_val = PerlIO_write(outstream->fh, buf, bytes_this_iter); if (check_val != bytes_this_iter) { Kino1_confess("outstream->absorb error: %"UVuf", %d", (UV)bytes_this_iter, check_val); } bytes_left -= bytes_this_iter; outstream->buf_start += bytes_this_iter; } } void Kino1_OutStream_write_byte(OutStream *outstream, char aChar) { if (outstream->buf_pos >= KINO_IO_STREAM_BUF_SIZE) Kino1_OutStream_flush(outstream); outstream->buf[ outstream->buf_pos++ ] = aChar; } void Kino1_OutStream_write_bytes(OutStream *outstream, char *bytes, STRLEN len) { /* if this data is larger than the buffer size, flush and write */ if (len >= KINO_IO_STREAM_BUF_SIZE) { int check_val; Kino1_OutStream_flush(outstream); check_val = PerlIO_write(outstream->fh, bytes, len); if (check_val != len) { Kino1_confess("Write error: tried to write %"UVuf", got %d", (UV)len, check_val); } outstream->buf_start += len; } /* if there's not enough room in the buffer, flush then add */ else if (outstream->buf_pos + len >= KINO_IO_STREAM_BUF_SIZE) { Kino1_OutStream_flush(outstream); Copy(bytes, (outstream->buf + outstream->buf_pos), len, char); outstream->buf_pos += len; } /* if there's room, just add these bytes to the buffer */ else { Copy(bytes, (outstream->buf + outstream->buf_pos), len, char); outstream->buf_pos += len; } } void Kino1_OutStream_write_int(OutStream *outstream, U32 aU32) { unsigned char buf[4]; Kino1_encode_bigend_U32(aU32, buf); outstream->write_bytes(outstream, (char*)buf, 4); } void Kino1_OutStream_write_long(OutStream *outstream, double aDouble) { unsigned char buf[8]; U32 aU32; /* derive the upper 4 bytes by truncating a quotient */ aU32 = floor( ldexp( aDouble, -32 ) ); Kino1_encode_bigend_U32(aU32, buf); /* derive the lower 4 bytes by taking a modulus against 2**32 */ aU32 = fmod(aDouble, (pow(2.0, 32.0))); Kino1_encode_bigend_U32(aU32, &buf[4]); /* print encoded Long to the output handle */ outstream->write_bytes(outstream, (char*)buf, 8); } void Kino1_OutStream_write_vint(OutStream *outstream, U32 aU32) { char buf[5]; int num_bytes; num_bytes = Kino1_OutStream_encode_vint(aU32, buf); outstream->write_bytes(outstream, buf, num_bytes); } /* Encode a VInt. buf must have room for at 5 bytes. */ int Kino1_OutStream_encode_vint(U32 aU32, char *buf) { int num_bytes = 0; while ((aU32 & ~0x7f) != 0) { buf[num_bytes++] = ( (aU32 & 0x7f) | 0x80 ); aU32 >>= 7; } buf[num_bytes++] = aU32 & 0x7f; return num_bytes; } void Kino1_OutStream_write_vlong(OutStream *outstream, double aDouble) { unsigned char buf[10]; int num_bytes = 0; U32 aU32; while (aDouble > 127.0) { /* take modulus of num against 128 */ aU32 = fmod(aDouble, 128); buf[num_bytes++] = ( (aU32 & 0x7f) | 0x80 ); /* right shift for floating point! */ aDouble = floor( ldexp( aDouble, -7 ) ); } buf[num_bytes++] = aDouble; outstream->write_bytes(outstream, (char*)buf, num_bytes); } void Kino1_OutStream_write_string(OutStream *outstream, char *string, STRLEN len) { Kino1_OutStream_write_vint(outstream, (U32)len); Kino1_OutStream_write_bytes(outstream, string, len); } void Kino1_OutStream_destroy(OutStream *outstream) { Kino1_OutStream_flush(outstream); SvREFCNT_dec(outstream->fh_sv); Kino1_Safefree(outstream->buf); Kino1_Safefree(outstream); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Store::OutStream - filehandles for writing invindexes ==head1 SYNOPSIS # isa blessed filehandle my $outstream = $invindex->open_outstream( $filename ); $outstream->lu_write( 'V8', @eight_vints ); ==head1 DESCRIPTION The OutStream class abstracts all of KinoSearch1's output operations. It is akin to a narrowly-implemented, specialized IO::File. Unlike its counterpart InStream, OutStream cannot be assigned an arbitrary C or C. ==head2 lu_write / lu_read template lu_write and it's opposite number, InStream's lu_read, provide a pack/unpack-style interface for handling primitive data types required by the Lucene index file format. The most notable of these specialized data types is the VInt, or Variable Integer, which is similar to the BER compressed integer (pack template 'w'). All fixed-width integer formats are stored in big-endian order (high-byte first). Signed integers use twos-complement encoding. The maximum allowable value both Long and VLong is 2**52 because it is stored inside the NV (double) storage pocket of a perl Scalar, which has a 53-bit mantissa. a Arbitrary binary data, copied to/from the scalar's PV (string) b 8-bit integer, signed B 8-bit integer, unsigned i 32-bit integer, signed I 32-bit integer, unsigned Q 64-bit integer, unsigned (max value 2**52) V VInt variable-width integer, unsigned (max value 2**32) W VLong variable-width integer, unsigned (max value 2**52) T Lucene string, which is a VInt indicating the length in bytes followed by the string. The string must be valid UTF-8. Numeric repeat counts are supported: $outstream->lu_write( 'V2 T', 0, 1, "a string" ); Other features of pack/unpack such as parentheses, infinite repeats via '*', and slash notation are not. A numeric repeat count following 'a' indicates how many bytes to read, while a count following any other symbol indicates how many scalars of that type to return. ( $three_byte_string, @eight_vints ) = $instream->lu_read('a3V8'); The behavior of lu_read and lu_write is much more strict with regards to a mismatch between TEMPLATE and LIST than pack/unpack, which are fairly forgiving in what they will accept. lu_read will confess() if it cannot read all the items specified by TEMPLATE from the InStream, and lu_write will confess() if the number of items in LIST does not match the expression in TEMPLATE. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Store/RAMInvIndex.pm000444000765000765 726711462203446 22275 0ustar00marvinmarvin000000000000package KinoSearch1::Store::RAMInvIndex; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Store::InvIndex ); BEGIN { __PACKAGE__->init_instance_vars( # members ramfiles => undef, ); } use Digest::MD5 qw( md5_hex ); use KinoSearch1::Store::FSInvIndex; use KinoSearch1::Store::InStream; use KinoSearch1::Store::OutStream; use KinoSearch1::Store::RAMLock; sub init_instance { my $self = shift; $self->{ramfiles} = {}; # read in an FSInvIndex if specified $self->_read_invindex if defined $self->{path}; } sub _read_invindex { my $self = shift; # open an FSInvIndex for reading my $source_invindex = KinoSearch1::Store::FSInvIndex->new( path => $self->{path}, ); # copy every file in the FSInvIndex into RAM. for my $filename ( $source_invindex->list ) { my $source_stream = $source_invindex->open_instream($filename); my $outstream = $self->open_outstream($filename); $outstream->absorb($source_stream); $source_stream->close; $outstream->close; } $source_invindex->close; } sub open_outstream { my ( $self, $filename ) = @_; # use perl scalars as virtual files; my $data = ''; $self->{ramfiles}{$filename} = \$data; open( my $fh, '>:scalar', \$data ) or die $!; binmode($fh); return KinoSearch1::Store::OutStream->new($fh); } sub open_instream { my ( $self, $filename, $offset, $len ) = @_; confess("File '$filename' not loaded into RAM") unless exists $self->{ramfiles}{$filename}; open( my $fh, '<:scalar', $self->{ramfiles}{$filename} ) or die $!; binmode($fh); return KinoSearch1::Store::InStream->new( $$fh, $offset, $len ); } sub list { keys %{ $_[0]->{ramfiles} }; } sub file_exists { my ( $self, $filename ) = @_; return ( exists $self->{ramfiles}{$filename} ); } sub rename_file { my ( $self, $from, $to ) = @_; confess("File '$from' not currently in RAM") unless exists $self->{ramfiles}{$from}; $self->{ramfiles}{$to} = delete $self->{ramfiles}{$from}; } sub delete_file { my ( $self, $filename ) = @_; my $file = delete $self->{ramfiles}{$filename}; confess("File '$filename' not currently in RAM") unless $file; } sub slurp_file { my ( $self, $filename ) = @_; my $virtual_file_ref = $self->{ramfiles}{$filename}; confess("File '$filename' not currently in RAM") unless defined $virtual_file_ref; # return a copy of the virtual file's content return $$virtual_file_ref; } sub make_lock { my $self = shift; return KinoSearch1::Store::RAMLock->new( @_, invindex => $self ); } sub close { } 1; __END__ =head1 NAME KinoSearch1::Store::RAMInvIndex - in-memory InvIndex =head1 SYNOPSIS my $invindex = KinoSearch1::Store::RAMInvIndex->new( path => '/path/to/invindex', ); # or... my $invindex = KinoSearch1::Store::RAMInvIndex->new; =head1 DESCRIPTION RAMInvIndex is an entirely in-memory implementation of KinoSearch1::Store::InvIndex. It serves two main purposes. First, it's possible to load an existing FSInvIndex into memory, which can improve search-speed -- if you have that kind of RAM to spare. Needless to say, any FSInvIndex you try to load this way should be appropriately modest in size. Second, RAMInvIndex is handy for testing and development. =head1 CONSTRUCTOR =head2 new Create a RAMInvIndex object. C takes one optional parameter, C. If C is supplied, KinoSearch1 will try to read an FSInvIndex at that location into memory. =head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey =head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. =cut KinoSearch1-1.01/lib/KinoSearch1/Store/RAMLock.pm000444000765000765 202711462203445 21425 0ustar00marvinmarvin000000000000package KinoSearch1::Store::RAMLock; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Store::Lock ); BEGIN { __PACKAGE__->init_instance_vars() } sub do_obtain { my $self = shift; # bail if the virtual lockfile already exists return if $self->{invindex}->file_exists( $self->{lock_name} ); # create a virtual lockfile my $temp = $self->{invindex}->open_outstream( $self->{lock_name} ); $temp->close; return 1; } sub release { my $self = shift; # delete the virtual lockfile $self->{invindex}->delete_file( $self->{lock_name} ); } sub is_locked { my $self = shift; return $self->{invindex}->file_exists( $self->{lock_name} ); } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Store::RAMLock - lock a RAMInvIndex ==head1 DESCRIPTION Implementation of KinoSearch1::Store::Lock entirely in memory. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util000755000765000765 011462203446 17323 5ustar00marvinmarvin000000000000KinoSearch1-1.01/lib/KinoSearch1/Util/BitVector.pm000444000765000765 3553511462203446 21752 0ustar00marvinmarvin000000000000package KinoSearch1::Util::BitVector; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params capacity => 0, ); } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Util::BitVector void new(either_sv, ...) SV *either_sv; PREINIT: const char *class; HV *args_hash; U32 capacity; BitVector *bit_vec; PPCODE: /* determine the class */ class = sv_isobject(either_sv) ? sv_reftype(either_sv, 0) : SvPV_nolen(either_sv); /* process hash-style params */ Kino1_Verify_build_args_hash(args_hash, "KinoSearch1::Util::BitVector::instance_vars", 1); capacity = (U32)SvUV( Kino1_Verify_extract_arg(args_hash, "capacity", 8) ); /* build object */ bit_vec = Kino1_BitVec_new(capacity); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)bit_vec); XSRETURN(1); =for comment Return true if the bit indcated by $num has been set, false if it hasn't (regardless of whether $num lies within the bounds of the object's capacity). =cut bool get(bit_vec, num) BitVector *bit_vec; U32 num; CODE: RETVAL = Kino1_BitVec_get(bit_vec, num); OUTPUT: RETVAL =for comment Set the bit at $num to 1. =cut void set(bit_vec, ...) BitVector *bit_vec; PREINIT: U32 i, num; PPCODE: for (i = 1; i < items; i++) { num = (U32)( SvUV( ST(i) ) ); Kino1_BitVec_set(bit_vec, num); } =for comment Clear the bit at $num (i.e. set it to 0). =cut void clear(bit_vec, num) BitVector *bit_vec; U32 num; PPCODE: Kino1_BitVec_clear(bit_vec, num); =for comment Set all the bits bounded by $first and $last, inclusive, to 1. =cut void bulk_set(bit_vec, first, last) BitVector *bit_vec; U32 first; U32 last; PPCODE: Kino1_BitVec_bulk_set(bit_vec, first, last); =for comment Clear all the bits bounded by $first and $last, inclusive. =cut void bulk_clear(bit_vec, first, last) BitVector *bit_vec; U32 first; U32 last; PPCODE: Kino1_BitVec_bulk_clear(bit_vec, first, last); =for comment Given $num, return either $num (if it is set), the next set bit above it, or if no such bit exists, undef (from Perl) or a sentinel (0xFFFFFFFF) from C. =cut SV* next_set_bit(bit_vec, num) BitVector *bit_vec; U32 num; CODE: num = Kino1_BitVec_next_set_bit(bit_vec, num); RETVAL = num == KINO_BITVEC_SENTINEL ? &PL_sv_undef : newSVuv(num); OUTPUT: RETVAL =for comment Given $num, return $num (if it is clear), or the next clear bit above it. The highest number that next_clear_bit can return is the object's capacity. =cut SV* next_clear_bit(bit_vec, num) BitVector *bit_vec; U32 num; CODE: num = Kino1_BitVec_next_clear_bit(bit_vec, num); RETVAL = num == KINO_BITVEC_SENTINEL ? &PL_sv_undef : newSVuv(num); OUTPUT: RETVAL =for comment Modify the BitVector so that only bits which remain set are those which 1) were already set in this BitVector, and 2) were also set in the other BitVector. =cut void logical_and(bit_vec, other) BitVector *bit_vec; BitVector *other; PPCODE: Kino1_BitVec_logical_and(bit_vec, other); =for comment Return a count of the number of set bits in the BitVector. =cut U32 count(bit_vec) BitVector *bit_vec; CODE: RETVAL = Kino1_BitVec_count(bit_vec); OUTPUT: RETVAL =for comment Return an arrayref of the with each element the number of a set bit. =cut void to_arrayref(bit_vec) BitVector *bit_vec; PREINIT: AV *out_av; PPCODE: out_av = Kino1_BitVec_to_array(bit_vec); XPUSHs( sv_2mortal(newRV_noinc( (SV*)out_av )) ); XSRETURN(1); =for comment Setters and getters. A quirk: set_bits automatically adjusts capacity upwards to the appropriate multiple of 8 if necessary. =cut SV* _set_or_get(bit_vec, ...) BitVector *bit_vec; ALIAS: set_capacity = 1 get_capacity = 2 set_bits = 3 get_bits = 4 PREINIT: STRLEN len; U32 new_capacity; char *new_bits; CODE: { KINO_START_SET_OR_GET_SWITCH case 1: new_capacity = SvUV(ST(1)); if (new_capacity < bit_vec->capacity) { Kino1_BitVec_shrink(bit_vec, new_capacity); } else if (new_capacity > bit_vec->capacity) { Kino1_BitVec_grow(bit_vec, new_capacity); } /* fall through */ case 2: RETVAL = newSVuv(bit_vec->capacity); break; case 3: Kino1_Safefree(bit_vec->bits); new_bits = SvPV(ST(1), len); bit_vec->bits = (unsigned char*)Kino1_savepvn(new_bits, len); bit_vec->capacity = len << 3; /* fall through */ case 4: len = ceil(bit_vec->capacity / 8.0); RETVAL = newSVpv((char*)bit_vec->bits, len); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(bit_vec) BitVector *bit_vec; PPCODE: Kino1_BitVec_destroy(bit_vec); __H__ #ifndef H_KINO_BIT_VECTOR #define H_KINO_BIT_VECTOR 1 #include "limits.h" #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilMathUtils.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMemManager.h" #define KINO_BITVEC_SENTINEL 0xFFFFFFFF typedef struct bitvector { U32 capacity; unsigned char *bits; } BitVector; BitVector* Kino1_BitVec_new(U32); BitVector* Kino1_BitVec_clone(BitVector*); void Kino1_BitVec_grow(BitVector*, U32); void Kino1_BitVec_shrink(BitVector *, U32); void Kino1_BitVec_set(BitVector*, U32); void Kino1_BitVec_clear(BitVector*, U32); void Kino1_BitVec_bulk_set(BitVector*, U32, U32); void Kino1_BitVec_bulk_clear(BitVector*, U32, U32); bool Kino1_BitVec_get(BitVector*, U32); U32 Kino1_BitVec_next_set_bit(BitVector*, U32); U32 Kino1_BitVec_next_clear_bit(BitVector*, U32); void Kino1_BitVec_logical_and(BitVector*, BitVector*); U32 Kino1_BitVec_count(BitVector*); AV* Kino1_BitVec_to_array(BitVector*); void Kino1_BitVec_destroy(BitVector*); #endif /* include guard */ __C__ #include "KinoSearch1UtilBitVector.h" static unsigned char bitmasks[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, }; BitVector* Kino1_BitVec_new(U32 capacity) { BitVector *bit_vec; Kino1_New(0, bit_vec, 1, BitVector); bit_vec->capacity = 0; bit_vec->bits = NULL; Kino1_BitVec_grow(bit_vec, capacity); return bit_vec; } BitVector* Kino1_BitVec_clone(BitVector *bit_vec) { BitVector *evil_twin; U32 byte_size; Kino1_New(0, evil_twin, 1, BitVector); byte_size = ceil(bit_vec->capacity / 8.0); evil_twin->bits = (unsigned char*)Kino1_savepvn((char*)bit_vec->bits, byte_size); evil_twin->capacity = bit_vec->capacity; return evil_twin; } void Kino1_BitVec_grow(BitVector *bit_vec, U32 capacity) { U32 byte_size; U32 old_capacity; /* derive size in bytes from size in bits */ byte_size = ceil(capacity / 8.0); if (capacity > bit_vec->capacity && bit_vec->bits != NULL) { U32 old_byte_size; old_byte_size = ceil(bit_vec->capacity / 8.0); Kino1_Renew(bit_vec->bits, byte_size, unsigned char); /* zero out all new bits, since Renew doesn't guarantee they're 0 */ old_capacity = bit_vec->capacity; bit_vec->capacity = capacity; Kino1_BitVec_bulk_clear(bit_vec, old_capacity, capacity - 1); /* shouldn't be necessary, but Valgrind reports an error without it */ if (byte_size > old_byte_size) { memset( (bit_vec->bits + old_byte_size), 0x00, (byte_size - old_byte_size) ); } } else if (bit_vec->bits == NULL) { Kino1_Newz(0, bit_vec->bits, byte_size, unsigned char); bit_vec->capacity = capacity; } } void Kino1_BitVec_shrink(BitVector *bit_vec, U32 capacity) { U32 byte_size; if (capacity >= bit_vec->capacity) return; /* derive size in bytes from size in bits */ byte_size = ceil(capacity / 8.0); Kino1_Renew(bit_vec->bits, byte_size, unsigned char); bit_vec->capacity = capacity; } void Kino1_BitVec_set(BitVector *bit_vec, U32 num) { if (num >= bit_vec->capacity) Kino1_BitVec_grow(bit_vec, num + 1); bit_vec->bits[ (num >> 3) ] |= bitmasks[num & 0x7]; } void Kino1_BitVec_clear(BitVector *bit_vec, U32 num) { if (num >= bit_vec->capacity) Kino1_BitVec_grow(bit_vec, num + 1); bit_vec->bits[ (num >> 3) ] &= ~(bitmasks[num & 0x7]); } void Kino1_BitVec_bulk_set(BitVector *bit_vec, U32 first, U32 last) { unsigned char *ptr; U32 num_bytes; /* detect range errors */ if (first > last) { Kino1_confess("bitvec range error: %d %d %d", first, last, bit_vec->capacity); } /* grow the bits if necessary */ if (last >= bit_vec->capacity) { Kino1_BitVec_grow(bit_vec, last); } /* set partial bytes */ while (first % 8 != 0 && first <= last) { Kino1_BitVec_set(bit_vec, first++); } while (last % 8 != 0 && last >= first) { Kino1_BitVec_set(bit_vec, last--); } Kino1_BitVec_set(bit_vec, last); /* mass set whole bytes */ if (last > first) { ptr = bit_vec->bits + (first >> 3); num_bytes = (last - first) >> 3; memset(ptr, 0xff, num_bytes); } } void Kino1_BitVec_bulk_clear(BitVector *bit_vec, U32 first, U32 last) { unsigned char *ptr; U32 num_bytes; /* detect range errors */ if (first > last) { Kino1_confess("bitvec range error: %d %d %d", first, last, bit_vec->capacity); } /* grow the bits if necessary */ if (last >= bit_vec->capacity) { Kino1_BitVec_grow(bit_vec, last); } /* clear partial bytes */ while (first % 8 != 0 && first <= last) { Kino1_BitVec_clear(bit_vec, first++); } while (last % 8 != 0 && last >= first) { Kino1_BitVec_clear(bit_vec, last--); } Kino1_BitVec_clear(bit_vec, last); /* mass clear whole bytes */ if (last > first) { ptr = bit_vec->bits + (first >> 3); num_bytes = (last - first) >> 3; memset(ptr, 0, num_bytes); } } bool Kino1_BitVec_get(BitVector *bit_vec, U32 num) { if (num >= bit_vec->capacity) return 0; return (bit_vec->bits[ (num >> 3) ] & bitmasks[num & 0x7]) != 0; } U32 Kino1_BitVec_next_set_bit(BitVector *bit_vec, U32 num) { U32 outval; unsigned char *bits_ptr; unsigned char *end_ptr; int i; U32 byte_size; if (num >= bit_vec->capacity) { return KINO_BITVEC_SENTINEL; } outval = KINO_BITVEC_SENTINEL; bits_ptr = bit_vec->bits + (num >> 3) ; byte_size = ceil(bit_vec->capacity / 8.0); end_ptr = bit_vec->bits + byte_size; while (outval == KINO_BITVEC_SENTINEL) { if (*bits_ptr != 0) { /* check each num in represented in this byte */ outval = (bits_ptr - bit_vec->bits) * 8; for (i = 0; i < 8; i++) { if (Kino1_BitVec_get(bit_vec, outval) == 1) { if (outval < bit_vec->capacity && outval >= num) { return outval; } } outval++; } /* nothing valid, so reset the sentinel */ outval = KINO_BITVEC_SENTINEL; } if (++bits_ptr >= end_ptr) break; } /* nothing valid, so return a sentinel */ return KINO_BITVEC_SENTINEL; } U32 Kino1_BitVec_next_clear_bit(BitVector *bit_vec, U32 num) { U32 outval; unsigned char *bits_ptr; unsigned char *end_ptr; int i; if (num >= bit_vec->capacity) { return num; } outval = KINO_BITVEC_SENTINEL; bits_ptr = bit_vec->bits + (num >> 3) ; end_ptr = bit_vec->bits + (bit_vec->capacity >> 3); while (outval == KINO_BITVEC_SENTINEL) { if (*bits_ptr != 0xFF) { /* check each num in represented in this byte */ outval = (bits_ptr - bit_vec->bits) * 8; for (i = 0; i < 8; i++) { if (Kino1_BitVec_get(bit_vec, outval) == 0) { if (outval < bit_vec->capacity && outval >= num) { return outval; } } outval++; } /* nothing valid, so reset the sentinel */ outval = KINO_BITVEC_SENTINEL; } if (++bits_ptr >= end_ptr) break; } /* didn't find clear bits in the set, so return 1 larger than the max */ return bit_vec->capacity; } void Kino1_BitVec_logical_and(BitVector *bit_vec, BitVector *other) { U32 num = 0; while (1) { num = Kino1_BitVec_next_set_bit(bit_vec, num); if (num == KINO_BITVEC_SENTINEL) break; if ( !Kino1_BitVec_get(other, num) ) Kino1_BitVec_clear(bit_vec, num); num++; } } const U32 BYTE_COUNTS[256] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 }; U32 Kino1_BitVec_count(BitVector *bit_vec) { U32 count = 0; U32 byte_size = ceil(bit_vec->capacity / 8.0); unsigned char *ptr = bit_vec->bits; unsigned char *limit = ptr + byte_size; for( ; ptr < limit; ptr++) { count += BYTE_COUNTS[*ptr]; } return count; } AV* Kino1_BitVec_to_array(BitVector* bit_vec) { U32 num = 0; AV *out_av; out_av = newAV(); while (1) { num = Kino1_BitVec_next_set_bit(bit_vec, num); if (num == KINO_BITVEC_SENTINEL) break; av_push( out_av, newSViv(num) ); num++; } return out_av; } void Kino1_BitVec_destroy(BitVector* bit_vec) { Kino1_Safefree(bit_vec->bits); Kino1_Safefree(bit_vec); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::BitVector - a set of bits ==head1 DESCRIPTION A vector of bits, which grows as needed. The implementation is designed to resemble both org.apache.lucene.util.BitVector and java.util.BitSet. Accessible from both C and Perl. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/ByteBuf.pm000444000765000765 1201311462203445 21372 0ustar00marvinmarvin000000000000package KinoSearch1::Util::ByteBuf; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); 1; __END__ __H__ #ifndef H_KINOSEARCH_UTIL_BYTEBUF #define H_KINOSEARCH_UTIL_BYTEBUF 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMemManager.h" typedef struct bytebuf { char *ptr; I32 size; /* number of valid chars */ I32 cap; /* allocated bytes, including any null termination */ U32 flags; } ByteBuf; ByteBuf* Kino1_BB_new(I32); ByteBuf* Kino1_BB_new_string(char*, I32); ByteBuf* Kino1_BB_new_view(char*, I32); ByteBuf* Kino1_BB_clone(ByteBuf*); void Kino1_BB_assign_view(ByteBuf*, char*, I32); void Kino1_BB_assign_string(ByteBuf*, char*, I32); void Kino1_BB_cat_string(ByteBuf*, char*, I32); void Kino1_BB_grow(ByteBuf*, I32); I32 Kino1_BB_compare(ByteBuf*, ByteBuf*); void Kino1_BB_destroy(ByteBuf*); #endif /* include guard */ __C__ #include "KinoSearch1UtilByteBuf.h" #define KINO_BB_VIEW 0x1 /* Return a pointer to a new ByteBuf capable of holding a string of [size] * bytes. Though the ByteBuf's size member is set, none of the allocated * memory is initialized. */ ByteBuf* Kino1_BB_new(I32 size) { ByteBuf *bb; /* allocate */ Kino1_New(0, bb, 1, ByteBuf); Kino1_New(0, bb->ptr, size + 1, char); /* assign */ bb->size = size; bb->cap = size + 1; bb->flags = 0; return bb; } /* Return a pointer to a new ByteBuf which holds a copy of the passed in * string. */ ByteBuf* Kino1_BB_new_string(char *ptr, I32 size) { ByteBuf *bb; /* allocate */ Kino1_New(0, bb, 1, ByteBuf); Kino1_New(0, bb->ptr, size + 1, char); /* copy */ Copy(ptr, bb->ptr, size, char); /* assign */ bb->size = size; bb->cap = size + 1; bb->ptr[size] = '\0'; /* null terminate */ bb->flags = 0; return bb; } /* Return a pointer to a new "view" ByteBuf, offing a persective on the passed * in string. */ ByteBuf* Kino1_BB_new_view(char *ptr, I32 size) { ByteBuf *bb; /* allocate */ Kino1_New(0, bb, 1, ByteBuf); /* assign */ bb->ptr = ptr; bb->size = size; bb->cap = 0; bb->flags = 0 | KINO_BB_VIEW; return bb; } /* Return a "real" copy of the ByteBuf (regardless of whether it was a "view" * ByteBuf before). */ ByteBuf* Kino1_BB_clone(ByteBuf *bb) { if (bb == NULL) return NULL; else return Kino1_BB_new_string(bb->ptr, bb->size); } /* Assign the ptr and size members to the passed in values. Downgrade the * ByteBuf to a "view" ByteBuf and free any existing assigned memory if * necessary. */ void Kino1_BB_assign_view(ByteBuf *bb, char*ptr, I32 size) { /* downgrade the ByteBuf to a view */ if (!bb->flags & KINO_BB_VIEW) { Kino1_Safefree(bb->ptr); bb->flags |= KINO_BB_VIEW; } /* assign */ bb->ptr = ptr; bb->size = size; } /* Copy the passed-in string into the ByteBuf. Allocate more memory if * necessary. */ void Kino1_BB_assign_string(ByteBuf *bb, char* ptr, I32 size) { Kino1_BB_grow(bb, size); Copy(ptr, bb->ptr, size, char); bb->size = size; } /* Concatenate the passed-in string onto the end of the ByteBuf. Allocate more * memory as needed. */ void Kino1_BB_cat_string(ByteBuf *bb, char* ptr, I32 size) { I32 new_size; new_size = bb->size + size; Kino1_BB_grow(bb, new_size); Copy(ptr, (bb->ptr + bb->size), size, char); bb->size = new_size; } /* Assign more memory to the ByteBuf, if it doesn't already have enough room * to hold a string of [size] bytes. Cannot shrink the allocation. */ void Kino1_BB_grow(ByteBuf *bb, I32 new_size) { if (bb->flags & KINO_BB_VIEW) Kino1_confess("grow called on 'view' ByteBuf"); /* bail out if the buffer's already at least as big as required */ if (bb->cap > new_size) return; Kino1_Renew(bb->ptr, (new_size + 1), char); bb->cap = new_size; } void Kino1_BB_destroy(ByteBuf *bb) { if (bb == NULL) return; if (!(bb->flags & KINO_BB_VIEW)) Kino1_Safefree(bb->ptr); Kino1_Safefree(bb); } /* Lexically compare two ByteBufs. */ I32 Kino1_BB_compare(ByteBuf *a, ByteBuf *b) { I32 size; I32 comparison; size = a->size < b->size ? a->size : b->size; comparison = memcmp(a->ptr, b->ptr, size); if (comparison == 0 && a->size != b->size) comparison = a->size < b->size ? -1 : 1; return comparison; } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::ByteBuf - stripped down scalar ==head1 DESCRIPTION The ByteBuf is a C struct that's essentially a growable string of char. It's like a stripped down scalar that can only deal with strings. It knows its own size and capacity, so it can contain arbitrary binary data. "View" ByteBufs don't own their own strings. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/Carp.pm000444000765000765 214511462203446 20705 0ustar00marvinmarvin000000000000package KinoSearch1::Util::Carp; 1; __END__ __H__ #ifndef H_KINOSEARCH_UTIL_CARP #define H_KINOSEARCH_UTIL_CARP 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilMemManager.h" void Kino1_confess (char*, ...); #endif /* include guard */ __C__ #include "KinoSearch1UtilCarp.h" void Kino1_confess (char* pat, ...) { va_list args; SV *error_sv; dSP; error_sv = newSV(0); va_start(args, pat); sv_vsetpvf(error_sv, pat, &args); va_end(args); ENTER; SAVETMPS; PUSHMARK(SP); XPUSHs( sv_2mortal(error_sv) ); PUTBACK; call_pv("Carp::confess", G_DISCARD); FREETMPS; LEAVE; } __END__ ==begin devdocs ==head1 NAME KinoSearch1::Util::Carp - stack traces from C ==head1 DESCRIPTION This module makes it possible to invoke Carp::confess() from C. Modules that use it will need to "use Carp;" -- which is usually taken care of by "use KinoSearch1::Util::ToolSet;". ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/CClass.pm000444000765000765 663111462203446 21174 0ustar00marvinmarvin000000000000package KinoSearch1::Util::CClass; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base('KinoSearch1::Util::Class'); 1; __END__ __H__ #ifndef H_KINO_CCLASS #define H_KINO_CCLASS 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" #define KINO_START_SET_OR_GET_SWITCH \ /* if called as a setter, make sure the extra arg is there */ \ if (ix % 2 == 1 && items != 2) \ croak("usage: $seg_term_enum->set_xxxxxx($val)"); \ switch (ix) { #define KINO_END_SET_OR_GET_SWITCH \ default: Kino1_confess("Internal error. ix: %d", ix); \ RETVAL = &PL_sv_undef; /* quiet compiler warning */ \ break; /* probably unreachable */ \ } #define Kino1_extract_struct( perl_obj, dest, cname, class ) \ if (sv_derived_from( perl_obj, class )) { \ IV tmp = SvIV( (SV*)SvRV(perl_obj) ); \ dest = INT2PTR(cname, tmp); \ } \ else { \ dest = NULL; /* suppress unused var warning */ \ Kino1_confess("not a %s", class); \ } #define Kino1_extract_anon_struct( perl_obj, dest ) \ if (sv_derived_from( perl_obj, "KinoSearch1::Util::CClass" )) { \ IV tmp = SvIV( (SV*)SvRV(perl_obj) ); \ dest = INT2PTR(void*, tmp); \ } \ else { \ dest = NULL; /* suppress unused var warning */ \ Kino1_confess("not derived from KinoSearch1::Util::CClass"); \ } #define Kino1_extract_struct_from_hv(hash, dest, key, key_len, cname, class) \ { \ SV **sv_ptr; \ sv_ptr = hv_fetch(hash, key, key_len, 0); \ if (sv_ptr == NULL) \ Kino1_confess("Failed to retrieve hash entry '%s'", key); \ if (sv_derived_from( *sv_ptr, class )) { \ IV tmp = SvIV( (SV*)SvRV(*sv_ptr) ); \ dest = INT2PTR(cname, tmp); \ } \ else { \ dest = NULL; /* suppress unused var warning */ \ Kino1_confess("not a %s", class); \ } \ } #endif /* include guard */ __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::CClass - base class for C-struct objects ==head1 DESCRIPTION KinoSearch1's C-struct objects use this as a base class, rather than KinoSearch1::Util::Class. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/Class.pm000444000765000765 1651511462203446 21113 0ustar00marvinmarvin000000000000package KinoSearch1::Util::Class; use strict; use warnings; use KinoSearch1::Util::ToolSet; use KinoSearch1::Util::VerifyArgs qw( verify_args kerror ); sub new { my $class = shift; # leave the rest of @_ intact. # find a defaults hash and verify args $class = ref($class) || $class; my $defaults; { no strict 'refs'; $defaults = \%{ $class . '::instance_vars' }; } if ( !verify_args( $defaults, @_ ) ) { # if a user-based subclass, find KinoSearch1 parent class and verify. my $kinoclass = _traverse_at_isa($class); confess kerror() unless $kinoclass; { no strict 'refs'; $defaults = \%{ $kinoclass . '::instance_vars' }; } confess kerror() unless verify_args( $defaults, @_ ); } # merge var => val pairs into new object, call customizable init routine my $self = bless { %$defaults, @_ }, $class; $self->init_instance; return $self; } # Walk @ISA until a parent class starting with 'KinoSearch1::' is found. sub _traverse_at_isa { my $orig = shift; { no strict 'refs'; my $at_isa = \@{ $orig . '::ISA' }; for my $parent (@$at_isa) { return $parent if $parent =~ /^KinoSearch1::/; my $grand_parent = _traverse_at_isa($parent); return $grand_parent if $grand_parent; } }; return ''; } sub init_instance { } sub init_instance_vars { my $package = shift; no strict 'refs'; no warnings 'once'; my $first_isa = ${ $package . '::ISA' }[0]; %{ $package . '::instance_vars' } = ( %{ $first_isa . '::instance_vars' }, @_ ); } sub ready_get_set { ready_get(@_); ready_set(@_); } sub ready_get { my $package = shift; no strict 'refs'; for my $member (@_) { *{ $package . "::get_$member" } = sub { return $_[0]->{$member} }; } } sub ready_set { my $package = shift; no strict 'refs'; for my $member (@_) { *{ $package . "::set_$member" } = sub { $_[0]->{$member} = $_[1] }; } } =for Rationale: KinoSearch1 is not thread-safe. Among other things, the C-struct-based classes cause segfaults or bus errors when their data gets double-freed by DESTROY. Therefore, CLONE dies with a user-friendly error message before that happens. =cut sub CLONE { my $package = shift; die( "CLONE invoked by package '$package', indicating that threads " . "or Win32 fork were initiated, but KinoSearch1 is not thread-safe" ); } sub abstract_death { my ( undef, $filename, $line, $methodname ) = caller(1); die "ERROR: $methodname', called at $filename line $line, is an " . "abstract method and must be defined in a subclass"; } sub unimplemented_death { my ( undef, $filename, $line, $methodname ) = caller(1); die "ERROR: $methodname, called at $filename line $line, is " . "intentionally unimplemented in KinoSearch1, though it is part " . "of Lucene"; } sub todo_death { my ( undef, $filename, $line, $methodname ) = caller(1); die "ERROR: $methodname, called at $filename line $line, is not " . "implemented yet in KinoSearch1, but is on the todo list"; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Util::Class - class building utility ==head1 SYNOPSIS package KinoSearch1::SomePackage::SomeClass; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members foo => undef, bar => {}, # members baz => {}, ); } ==head1 DESCRIPTION KinoSearch1::Util::Class is a class-building utility a la L, L, etc. It provides four main services: ==over ==item 1 A mechanism for inheriting instance variable declarations. ==item 2 A constructor with basic argument checking. ==item 3 Manufacturing of get_xxxx and set_xxxx methods. ==item 4 Convenience methods which help in defining abstract classes. ==back ==head1 VARIABLES ==head2 %instance_vars The %instance_vars hash, which is always a package global, serves as a template for the creation of a hash-based object. It is built up from all the %instance_vars hashes in the module's parent classes, using init_instance_vars(). Key-value pairs in an %instance_vars hash are labeled as "constructor params" and/or "members". Items which are labeled as constructor params can be used as arguments to new(). BEGIN { __PACKAGE__->init_instance_vars( # constructor params / members foo => undef, bar => 10, # members baz => '', ); } # ok: specifies foo, uses default for bar, derives baz my $object = __PACKAGE__->new( foo => $foo ); # not ok: baz isn't a constructor param my $object = __PACKAGE__->new( baz => $baz ); # ok if a parent class defines boffo as a constructor param my $object = __PACKAGE__->new( foo => $foo, boffo => $boffo, ); %instance_vars may only contain scalar values, as the defaults are merged into the object using a shallow copy. init_instance_vars() must be called from within a BEGIN block and before any C directives load a child class -- if children are born before their parents, inheritance gets screwed up. ==head1 METHODS ==head2 new A generic constructor with basic argument checking. new() expects hash-style labeled parameters; the label names must be present in the %instance_vars hash, or it will croak(). After verifying the labeled parameters, new() merges %instance_vars and @_ into a new object. It then calls $self->init_instance() before returning the blessed reference. ==head2 init_instance $self->init_instance(); Perform customized initialization routine. By default, this is a no-op. ==head2 init_instance_vars BEGIN { __PACKAGE__->init_instance_vars( a_safe_variable_name_that_wont_clash => 1, freep_warble => undef, ); } Package method only. Creates a package global %instance_vars hash in the passed in package which consists of the passed in arguments plus all the key-value pairs in the parent class's %instance_vars hash. ==head2 ready_get_set ready_get ready_set # create get_foo(), set_foo(), get_bar(), set_bar() in __PACKAGE__ BEGIN { __PACKAGE__->ready_get_set(qw( foo bar )) }; Mass manufacture getters and setters. The setters do not return a meaningful value. ==head2 abstract_death unimplemented_death todo_death sub an_abstract_method { shift->abstract_death } sub an_unimplemented_method { shift->unimplemented_death } sub maybe_someday { shift->todo_death } These are just different ways to die(), and are of little interest until your particular application comes face to face with one of them. abstract_death indicates that a method must be defined in a subclass. unimplemented_death indicates a feature/function that will probably not be implemented. Typically, this would appear for a sub that a developer intimately familiar with Lucene would expect to find. todo_death indicates a feature that might get implemented someday. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/IntMap.pm000444000765000765 314011462203446 21204 0ustar00marvinmarvin000000000000package KinoSearch1::Util::IntMap; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::Class ); sub new { my ( $class, $map ) = @_; $class = ref($class) || $class; return bless $map, $class; } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Util::IntMap =for comment Return either the remapped number, or undef if orig has been removed. =cut SV* get(int_map_ref, orig); SV *int_map_ref; I32 orig; PREINIT: I32 result; CODE: result = Kino1_IntMap_get(int_map_ref, orig); RETVAL = result == -1 ? &PL_sv_undef : newSViv(result); OUTPUT: RETVAL __H__ #ifndef H_KINOSEARCH_INT_MAP #define H_KINOSEARCH_INT_MAP 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" I32 Kino1_IntMap_get(SV*, I32); #endif /* include guard */ __C__ #include "KinoSearch1UtilIntMap.h" I32 Kino1_IntMap_get(SV* int_map_ref, I32 orig) { SV *int_map_sv; I32 *map; STRLEN len; int_map_sv = SvRV(int_map_ref); map = (I32*)SvPV(int_map_sv, len); if (orig * sizeof(I32) > len) { return -1; } return map[orig]; } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::IntMap - compact array of integers ==head1 DESCRIPTION An IntMap is a C array of I32, stored in a scalar. The get() method returns either the number present at the index requested, or undef if either the index is out of range or the number at the index is -1. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/MathUtils.pm000444000765000765 372111462203445 21732 0ustar00marvinmarvin000000000000package KinoSearch1::Util::MathUtils; use strict; use warnings; use base qw( Exporter ); our @EXPORT_OK = qw( ceil ); 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Util::MathUtils double ceil(in) double in; CODE: RETVAL = ceil(in); OUTPUT: RETVAL __H__ #ifndef H_KINOSEARCH_UTIL_MATH_UTILS #define H_KINOSEARCH_UTIL_MATH_UTILS 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilMemManager.h" void Kino1_encode_bigend_U32(U32, void*); void Kino1_encode_bigend_U16(U16, void*); U32 Kino1_decode_bigend_U32(void*); U16 Kino1_decode_bigend_U16(void*); #endif /* include guard */ __C__ #include "KinoSearch1UtilMathUtils.h" void Kino1_encode_bigend_U32(U32 aU32, void *vbuf) { unsigned char *buf; buf = (unsigned char*)vbuf; * buf = (aU32 & 0xff000000) >> 24; *(buf + 1) = (aU32 & 0x00ff0000) >> 16; *(buf + 2) = (aU32 & 0x0000ff00) >> 8; *(buf + 3) = (aU32 & 0x000000ff); } void Kino1_encode_bigend_U16(U16 aU16, void *vbuf) { unsigned char *buf; buf = (unsigned char*)vbuf; * buf = (aU16 & 0xff00) >> 8; *(buf + 1) = (aU16 & 0x00ff); } U32 Kino1_decode_bigend_U32(void *vbuf) { unsigned char *buf; U32 aU32; buf = (unsigned char*)vbuf; aU32 = (* buf << 24) | (*(buf + 1) << 16) | (*(buf + 2) << 8) | *(buf + 3); return aU32; } U16 Kino1_decode_bigend_U16(void *vbuf) { unsigned char *buf; U16 aU16; buf = (unsigned char*)vbuf; aU16 = (*buf << 8) | *(buf + 1); return aU16; } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::MathUtils - various math utilities ==head1 DESCRIPTION Provide various math related utilities, including endcoding/decoding integers in guaranteed Big-endian byte order. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/MemManager.pm000444000765000765 456411462203446 22040 0ustar00marvinmarvin000000000000package KinoSearch1::Util::MemManager; 1; __END__ __H__ #ifndef H_KINO_MEM_MANAGER #define H_KINO_MEM_MANAGER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" /* Set this to 1 to enable debugging. */ #define KINO_MEM_LEAK_DEBUG 0 #if KINO_MEM_LEAK_DEBUG #define Kino1_New(x,v,n,t) \ (v = (t*)Kino1_New_wrapper(x,(n*sizeof(t)))) #define Kino1_Newz(x,v,n,t) \ (v = (t*)Kino1_Newz_wrapper(x,(n*sizeof(t)))) #define Kino1_Renew(v,n,t) \ (v = (t*)Kino1_Renew_wrapper(v, n*sizeof(t))) #define Kino1_Safefree(x) \ Kino1_Safefree_wrapper(x) #define Kino1_savepvn(p,n) \ Kino1_savepvn_wrapper(p,n) #else #define Kino1_New(x,v,n,t) New(x,v,n,t) #define Kino1_Newz(x,v,n,t) Newz(x,v,n,t) #define Kino1_Renew(v,n,t) Renew(v,n,t) #define Kino1_Safefree(v) Safefree(v) #define Kino1_savepvn(p,n) savepvn(p,n) #endif void* Kino1_New_wrapper(int, size_t); void* Kino1_Newz_wrapper(int, size_t); void* Kino1_Renew_wrapper(void*, size_t); void Kino1_Safefree_wrapper(void*); char* Kino1_savepvn_wrapper(const char*, I32); #endif /* include guard */ __C__ #include "KinoSearch1UtilMemManager.h" void* Kino1_New_wrapper(int x, size_t num) { void* ptr; ptr = malloc(num); return ptr; } void* Kino1_Newz_wrapper(int x, size_t num) { char* ptr; ptr = (char*)malloc(num); memset(ptr, 0, num); return (void*)ptr; } void* Kino1_Renew_wrapper(void* ptr, size_t num) { void* new_ptr; new_ptr = realloc(ptr, num); return new_ptr; } void Kino1_Safefree_wrapper(void* ptr) { /* Safefree(ptr); */ free(ptr); } char* Kino1_savepvn_wrapper(const char* pv, I32 len) { char* ptr; ptr = (char*)malloc(len + 1); if (ptr == NULL) Kino1_confess("Out of memory"); ptr[len] = '\0'; memcpy(ptr, pv, len); return ptr; } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::MemManager - wrappers which aid memory debugging ==head1 DESCRIPTION In normal mode, the C functions in this module are macro aliases for Perl's memory management tools. In debug mode, memory management passes through local functions which make hunting down bugs with Valgrind easier. No Perl interface. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/PriorityQueue.pm000444000765000765 2301711462203446 22667 0ustar00marvinmarvin000000000000package KinoSearch1::Util::PriorityQueue; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args max_size => undef, ); } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Util::PriorityQueue void new(either_sv, ...) SV *either_sv; PREINIT: const char *class; HV *args_hash; U32 max_size; PriorityQueue *pq; PPCODE: /* determine the class */ class = sv_isobject(either_sv) ? sv_reftype(either_sv, 0) : SvPV_nolen(either_sv); /* process hash-style params */ Kino1_Verify_build_args_hash(args_hash, "KinoSearch1::Util::PriorityQueue::instance_vars", 1); max_size = (U32)SvUV( Kino1_Verify_extract_arg(args_hash, "max_size", 8) ); /* build object */ pq = Kino1_PriQ_new(max_size); ST(0) = sv_newmortal(); sv_setref_pv(ST(0), class, (void*)pq); XSRETURN(1); =for comment Add an element to the Queue if either... a) the queue isn't full, or b) the element belongs in the queue and should displace another =cut void insert(pq, element) PriorityQueue *pq; SV *element; PPCODE: Kino1_PriQ_insert(pq, element); =for comment Pop the *least* item off of the priority queue. =cut SV* pop(pq) PriorityQueue *pq; CODE: RETVAL = Kino1_PriQ_pop(pq); if (RETVAL == Nullsv) { RETVAL = &PL_sv_undef; } else { RETVAL = newSVsv(RETVAL); } OUTPUT: RETVAL =for comment Return the least item in the queue, but don't remove it. =cut SV* peek(pq) PriorityQueue *pq; CODE: RETVAL = Kino1_PriQ_peek(pq); if (RETVAL == Nullsv) { RETVAL = &PL_sv_undef; } else { RETVAL = newSVsv(RETVAL); } OUTPUT: RETVAL =for comment Empty the queue into an array, with the highest priority item at index 0. =cut void pop_all(pq) PriorityQueue *pq; PREINIT: AV* out_av; PPCODE: out_av = Kino1_PriQ_pop_all(pq); XPUSHs( sv_2mortal(newRV_noinc( (SV*)out_av )) ); SV* _set_or_get(pq, ...) PriorityQueue *pq; ALIAS: get_size = 2 get_max_size = 4 CODE: { KINO_START_SET_OR_GET_SWITCH case 2: RETVAL = newSVuv(pq->size); break; case 4: RETVAL = newSVuv(pq->max_size); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(pq) PriorityQueue *pq; PPCODE: Kino1_PriQ_destroy(pq); __H__ #ifndef H_KINOSEARCH_UTIL_PRIORITY_QUEUE #define H_KINOSEARCH_UTIL_PRIORITY_QUEUE 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" #include "KinoSearch1UtilMemManager.h" typedef struct priorityqueuec { U32 size; U32 max_size; SV **heap; bool (*less_than)(SV*, SV*); } PriorityQueue; PriorityQueue* Kino1_PriQ_new (U32 max_size); bool Kino1_PriQ_insert(PriorityQueue*, SV*); SV* Kino1_PriQ_pop(PriorityQueue*); SV* Kino1_PriQ_peek(PriorityQueue*); AV* Kino1_PriQ_pop_all(PriorityQueue*); void Kino1_PriQ_destroy(PriorityQueue*); bool Kino1_PriQ_default_less_than( SV*, SV* ); void Kino1_PriQ_dump(PriorityQueue*); #endif /* include guard */ __C__ #include "KinoSearch1UtilPriorityQueue.h" static void Kino1_PriQ_put(PriorityQueue*, SV*); static SV* Kino1_PriQ_top(PriorityQueue*); static void Kino1_PriQ_adjust_top(PriorityQueue*); static void Kino1_PriQ_clear(PriorityQueue*); static void Kino1_PriQ_up_heap(PriorityQueue*); static void Kino1_PriQ_down_heap(PriorityQueue*); PriorityQueue* Kino1_PriQ_new (U32 max_size) { PriorityQueue *pq; U32 i, heap_size; Kino1_New(0, pq, 1, PriorityQueue); pq->size = 0; pq->max_size = max_size; pq->less_than = Kino1_PriQ_default_less_than; /* allocate space for the heap, assign all slots to Nullsv */ heap_size = max_size + 1; Kino1_New(0, pq->heap, heap_size, SV*); for (i = 0; i < heap_size; i++) { pq->heap[i] = Nullsv; } return pq; } /* Add an element to the heap. Throw an error if too many elements * are added. */ static void Kino1_PriQ_put(PriorityQueue *pq, SV *element) { /* extend heap */ if (pq->size >= pq->max_size) { Kino1_confess("PriorityQueue exceeded max_size: %d %d", pq->size, pq->max_size); } pq->size++; /* put element into heap */ pq->heap[ pq->size ] = newSVsv(element); /* adjust heap */ Kino1_PriQ_up_heap(pq); } bool Kino1_PriQ_insert(PriorityQueue *pq, SV *element) { SV *scratch_sv; /* absorb element if there's a vacancy */ if (pq->size < pq->max_size) { Kino1_PriQ_put(pq, element); return 1; } /* otherwise, compete for the slot */ else { scratch_sv = Kino1_PriQ_top(pq); if( pq->size > 0 && !pq->less_than(element, scratch_sv)) { /* if the new element belongs in the queue, replace something */ scratch_sv = pq->heap[1]; SvREFCNT_dec(scratch_sv); pq->heap[1] = newSVsv(element); Kino1_PriQ_adjust_top(pq); return 1; } else { return 0; } } } /* Return the least item in the queue, or Nullsv if queue is empty. */ static SV* Kino1_PriQ_top(PriorityQueue *pq) { if (pq->size > 0) { return pq->heap[1]; /* note: no refcount manip */ } else { return Nullsv; } } SV* Kino1_PriQ_pop(PriorityQueue *pq) { SV *result; if (pq->size > 0) { /* mortalize the first value and save it */ result = sv_2mortal( pq->heap[1] ); /* move last to first and adjust heap */ pq->heap[1] = pq->heap[ pq->size ]; pq->heap[ pq->size ] = Nullsv; pq->size--; Kino1_PriQ_down_heap(pq); return result; } else { return Nullsv; } } SV* Kino1_PriQ_peek(PriorityQueue *pq) { if (pq->size > 0) { return pq->heap[1]; } else { return Nullsv; } } AV* Kino1_PriQ_pop_all(PriorityQueue *pq) { AV* out_av; I32 i; SV* element; /* allocate an empty AV; return immediately if the queue is empty */ out_av = newAV(); if (pq->size == 0) { return out_av; } /* map the queue nodes onto the array in reverse order */ av_extend(out_av, pq->size - 1); for (i = pq->size - 1; i >= 0; i--) { element = newSVsv( Kino1_PriQ_pop(pq) ); av_store(out_av, i, element); } return out_av; } /* Alias for down_heap. Should be called when the item at the top changes. */ static void Kino1_PriQ_adjust_top(PriorityQueue *pq) { Kino1_PriQ_down_heap(pq); } /* Free all the elements in the heap and set size to 0. */ static void Kino1_PriQ_clear(PriorityQueue *pq) { U32 i; SV **sv_ptr; sv_ptr = (pq->heap + 1); /* node 0 is held empty, to make the algo clearer */ for (i = 1; i <= pq->size; i++) { SvREFCNT_dec(*sv_ptr); *sv_ptr = Nullsv; sv_ptr++; } pq->size = 0; } /* Heap adjuster. */ static void Kino1_PriQ_up_heap(PriorityQueue *pq) { U32 i, j; SV *node; i = pq->size; node = pq->heap[i]; /* save bottom node */ j = i >> 1; while ( j > 0 && pq->less_than(node, pq->heap[j]) ) { pq->heap[i] = pq->heap[j]; i = j; j = j >> 1; } pq->heap[i] = node; } /* Heap adjuster. */ static void Kino1_PriQ_down_heap(PriorityQueue *pq) { U32 i, j, k; SV *node; /* save top node */ i = 1; node = pq->heap[i]; /* find smaller child */ j = i << 1; k = j + 1; if ( k <= pq->size && pq->less_than(pq->heap[k], pq->heap[j]) ) { j = k; } while ( j <= pq->size && pq->less_than(pq->heap[j], node) ) { pq->heap[i] = pq->heap[j]; i = j; j = i << 1; k = j + 1; if ( k <= pq->size && pq->less_than(pq->heap[k], pq->heap[j]) ) { j = k; } } pq->heap[i] = node; } /* Compare the integer values of two scalars. */ bool Kino1_PriQ_default_less_than(SV* a, SV* b) { if ( SvIV(a) < SvIV(b) ) { return 1; } else { return 0; } } void Kino1_PriQ_destroy(PriorityQueue *pq) { Kino1_PriQ_clear(pq); Kino1_Safefree(pq->heap); Kino1_Safefree(pq); } /* Print integer values for all items in the Queue. */ void Kino1_PriQ_dump(PriorityQueue *pq) { U32 i; SV **sv_ptr; sv_ptr = (pq->heap + 1); for (i = 1; i <= pq->size; i++) { IV j = SvIV(*sv_ptr); fprintf(stderr, "%"IVdf" ", j); sv_ptr++; } fprintf(stderr, "\n"); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::PriorityQueue - classic heap sort / priority queue ==head1 DESCRIPTION PriorityQueue implements a textbook heap-sort/priority-queue algorithm. This particular variant leaves slot 0 in the queue open in order to keep the relationship between node rank and index clear in the up_heap and down_heap routines. The nodes in this implementation are all perl scalars, which allows us to use Perl's reference counting to manage memory. However, the underlying queue management methods are all written in C, which allows them to be used within other C routines without expensive callbacks to Perl. Subclass constructors must redefine the C pointer-to-function, less_than. The default behavior is to compare the SvIV value of two scalars. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/SortExternal.pm000444000765000765 5646511462203445 22507 0ustar00marvinmarvin000000000000package KinoSearch1::Util::SortExternal; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( KinoSearch1::Util::CClass ); BEGIN { __PACKAGE__->init_instance_vars( # constructor args invindex => undef, seg_name => undef, mem_threshold => 2**24, ); } our %instance_vars; sub new { my $class = shift; verify_args( \%instance_vars, @_ ); my %args = ( %instance_vars, @_ ); my $invindex = $args{invindex}; $class = ref($class) || $class; my $filename = "$args{seg_name}.srt"; $invindex->delete_file($filename) if $invindex->file_exists($filename); my $outstream = $invindex->open_outstream($filename); return _new( $class, $outstream, @args{qw( invindex seg_name mem_threshold )} ); } # Prepare to start fetching sorted results. sub sort_all { my $self = shift; # deal with any items in the cache right now if ( $self->_get_num_runs == 0 ) { # if we've never exceeded mem_threshold, sort in-memory $self->_sort_cache; } else { # create a run from whatever's in the cache right now $self->_sort_run; } # done adding elements, so close file and reopen as an instream $self->_get_outstream->close; my $filename = $self->_get_seg_name . ".srt"; my $instream = $self->_get_invindex()->open_instream($filename); $self->_set_instream($instream); # allow fetching now that we're set up $self->_enable_fetch; } sub close { shift->_get_instream()->close } 1; __END__ __XS__ MODULE = KinoSearch1 PACKAGE = KinoSearch1::Util::SortExternal void _new(class, outstream_sv, invindex_sv, seg_name_sv, mem_threshold) char *class; SV *outstream_sv; SV *invindex_sv; SV *seg_name_sv; I32 mem_threshold; PREINIT: SortExternal *sortex; PPCODE: sortex = Kino1_SortEx_new(outstream_sv, invindex_sv, seg_name_sv, mem_threshold); ST(0) = sv_newmortal(); sv_setref_pv( ST(0), class, (void*)sortex ); XSRETURN(1); =for comment Add one or more items to the sort pool. =cut void feed(sortex, ...) SortExternal *sortex; PREINIT: I32 i; PPCODE: for (i = 1; i < items; i++) { SV const * item_sv = ST(i); if (!SvPOK(item_sv)) continue; sortex->feed(sortex, SvPVX(item_sv), SvCUR(item_sv)); } =for comment Fetch the next sorted item from the sort pool. sort_all must be called first. =cut SV* fetch(sortex) SortExternal *sortex; PREINIT: ByteBuf *bb; CODE: bb = sortex->fetch(sortex); if (bb == NULL) { RETVAL = newSV(0); } else { RETVAL = newSVpvn(bb->ptr, bb->size); Kino1_BB_destroy(bb); } OUTPUT: RETVAL =for comment Sort all items currently in memory. =cut void _sort_cache(sortex) SortExternal *sortex; PPCODE: Kino1_SortEx_sort_cache(sortex); =for comment Sort everything in memory and write the sorted elements to disk, creating a SortExRun C object. =cut void _sort_run(sortex); SortExternal *sortex; PPCODE: Kino1_SortEx_sort_run(sortex); =for comment Turn on fetching. =cut void _enable_fetch(sortex) SortExternal *sortex; PPCODE: Kino1_SortEx_enable_fetch(sortex); SV* _set_or_get(sortex, ...) SortExternal *sortex; ALIAS: _set_outstream = 1 _get_outstream = 2 _set_instream = 3 _get_instream = 4 _set_num_runs = 5 _get_num_runs = 6 _set_invindex = 7 _get_invindex = 8 _set_seg_name = 9 _get_seg_name = 10 CODE: { KINO_START_SET_OR_GET_SWITCH case 1: SvREFCNT_dec(sortex->outstream_sv); sortex->outstream_sv = newSVsv( ST(1) ); Kino1_extract_struct(sortex->outstream_sv, sortex->outstream, OutStream*, "KinoSearch1::Store::OutStream"); /* fall through */ case 2: RETVAL = newSVsv(sortex->outstream_sv); break; case 3: SvREFCNT_dec(sortex->instream_sv); sortex->instream_sv = newSVsv( ST(1) ); Kino1_extract_struct(sortex->instream_sv, sortex->instream, InStream*, "KinoSearch1::Store::InStream"); /* fall through */ case 4: RETVAL = newSVsv(sortex->instream_sv); break; case 5: Kino1_confess("can't set num_runs"); /* fall through */ case 6: RETVAL = newSViv(sortex->num_runs); break; case 7: Kino1_confess("can't set_invindex"); /* fall through */ case 8: RETVAL = newSVsv(sortex->invindex_sv); break; case 9: Kino1_confess("can't set_seg_name"); /* fall through */ case 10: RETVAL = newSVsv(sortex->seg_name_sv); break; KINO_END_SET_OR_GET_SWITCH } OUTPUT: RETVAL void DESTROY(sortex) SortExternal *sortex; PPCODE: Kino1_SortEx_destroy(sortex); __H__ #ifndef H_KINOSEARCH_UTIL_SORT_EXTERNAL #define H_KINOSEARCH_UTIL_SORT_EXTERNAL 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1StoreInStream.h" #include "KinoSearch1StoreOutStream.h" #include "KinoSearch1UtilByteBuf.h" #include "KinoSearch1UtilCClass.h" #include "KinoSearch1UtilMemManager.h" typedef struct sortexrun { double start; double file_pos; double end; ByteBuf **cache; I32 cache_cap; I32 cache_elems; I32 cache_pos; I32 slice_size; } SortExRun; typedef struct sortexternal { ByteBuf **cache; /* item cache, both incoming and outgoing */ I32 cache_cap; /* allocated limit for cache */ I32 cache_elems; /* number of elems in cache */ I32 cache_pos; /* index of current element in cache */ ByteBuf **scratch; /* memory for use by mergesort */ I32 scratch_cap; /* allocated limit for scratch */ I32 mem_threshold; /* bytes of mem allowed for cache */ I32 cache_bytes; /* bytes of mem occupied by cache */ I32 run_cache_limit; /* bytes of mem allowed each run cache */ SortExRun **runs; I32 num_runs; SV *outstream_sv; OutStream *outstream; SV *instream_sv; InStream *instream; SV *invindex_sv; SV *seg_name_sv; void (*feed) (struct sortexternal*, char*, I32); ByteBuf* (*fetch)(struct sortexternal*); } SortExternal; SortExternal* Kino1_SortEx_new(SV*, SV*, SV*, I32); void Kino1_SortEx_feed(SortExternal*, char*, I32); ByteBuf* Kino1_SortEx_fetch(SortExternal*); ByteBuf* Kino1_SortEx_fetch_death(SortExternal*); void Kino1_SortEx_enable_fetch(SortExternal*); void Kino1_SortEx_sort_cache(SortExternal*); void Kino1_SortEx_sort_run(SortExternal*); void Kino1_SortEx_destroy(SortExternal*); #endif /* include guard */ __C__ #include "KinoSearch1UtilSortExternal.h" static SortExRun* Kino1_SortEx_new_run(double, double); static void Kino1_SortEx_grow_bufbuf(ByteBuf***, I32, I32); static I32 Kino1_SortEx_refill_run(SortExternal*, SortExRun*); static void Kino1_SortEx_refill_cache(SortExternal*); static void Kino1_SortEx_merge_runs(SortExternal*); static ByteBuf* Kino1_SortEx_find_endpost(SortExternal*); static I32 Kino1_SortEx_define_slice(SortExRun*, ByteBuf*); static void Kino1_SortEx_mergesort(ByteBuf**, ByteBuf**, I32); static void Kino1_SortEx_msort(ByteBuf**, ByteBuf**, U32, U32); static void Kino1_SortEx_merge(ByteBuf**, U32, ByteBuf**, U32, ByteBuf**); static void Kino1_SortEx_clear_cache(SortExternal*); static void Kino1_SortEx_clear_run_cache(SortExRun*); static void Kino1_SortEx_destroy_run(SortExRun*); #define KINO_PER_ITEM_OVERHEAD (sizeof(ByteBuf) + sizeof(ByteBuf*)) SortExternal* Kino1_SortEx_new(SV *outstream_sv, SV *invindex_sv, SV *seg_name_sv, I32 mem_threshold) { SortExternal *sortex; /* allocate */ Kino1_New(0, sortex, 1, SortExternal); Kino1_New(0, sortex->cache, 100, ByteBuf*); Kino1_New(0, sortex->runs, 1, SortExRun*); /* init */ sortex->scratch = NULL; sortex->scratch_cap = 0; sortex->cache_cap = 100; sortex->cache_elems = 0; sortex->cache_pos = 0; sortex->cache_bytes = 0; sortex->num_runs = 0; sortex->instream_sv = &PL_sv_undef; sortex->feed = Kino1_SortEx_feed; sortex->fetch = Kino1_SortEx_fetch_death; /* assign */ sortex->outstream_sv = newSVsv(outstream_sv); Kino1_extract_struct(outstream_sv, sortex->outstream, OutStream*, "KinoSearch1::Store::OutStream"); sortex->invindex_sv = newSVsv(invindex_sv); sortex->seg_name_sv = newSVsv(seg_name_sv); sortex->mem_threshold = mem_threshold; /* derive */ sortex->run_cache_limit = mem_threshold / 2; return sortex; } /* Create a new SortExRun object */ static SortExRun* Kino1_SortEx_new_run(double start, double end) { SortExRun *run; /* allocate */ Kino1_New(0, run, 1, SortExRun); Kino1_New(0, run->cache, 100, ByteBuf*); /* init */ run->cache_cap = 100; run->cache_elems = 0; run->cache_pos = 0; /* assign */ run->start = start; run->file_pos = start; run->end = end; return run; } void Kino1_SortEx_feed(SortExternal* sortex, char* ptr, I32 len) { /* add room for more cache elements if needed */ if (sortex->cache_elems == sortex->cache_cap) { /* add 100, plus 10% of the current capacity */ sortex->cache_cap = sortex->cache_cap + 100 + (sortex->cache_cap / 8); Kino1_Renew(sortex->cache, sortex->cache_cap, ByteBuf*); } sortex->cache[ sortex->cache_elems ] = Kino1_BB_new_string(ptr, len); sortex->cache_elems++; /* track memory consumed */ sortex->cache_bytes += KINO_PER_ITEM_OVERHEAD; sortex->cache_bytes += len + 1; /* check if it's time to flush the cache */ if (sortex->cache_bytes >= sortex->mem_threshold) Kino1_SortEx_sort_run(sortex); } ByteBuf* Kino1_SortEx_fetch(SortExternal *sortex) { if (sortex->cache_pos >= sortex->cache_elems) Kino1_SortEx_refill_cache(sortex); if (sortex->cache_elems > 0) { return sortex->cache[ sortex->cache_pos++ ]; } else { return NULL; } } ByteBuf* Kino1_SortEx_fetch_death(SortExternal *sortex) { ByteBuf *bb = NULL; Kino1_confess("can't call fetch before sort_all"); return bb; } void Kino1_SortEx_enable_fetch(SortExternal *sortex) { sortex->fetch = Kino1_SortEx_fetch; } /* Allocate more memory to an array of pointers to pointers to ByteBufs, if * the current allocation isn't sufficient. */ static void Kino1_SortEx_grow_bufbuf(ByteBuf ***bb_buf, I32 current, I32 desired) { if (current < desired) Kino1_Renew(*bb_buf, desired, ByteBuf*); } /* Sort the main cache. */ void Kino1_SortEx_sort_cache(SortExternal *sortex) { Kino1_SortEx_grow_bufbuf(&sortex->scratch, sortex->scratch_cap, sortex->cache_elems); Kino1_SortEx_mergesort(sortex->cache, sortex->scratch, sortex->cache_elems); } void Kino1_SortEx_sort_run(SortExternal *sortex) { OutStream *outstream; ByteBuf **cache, **cache_end; ByteBuf *bb; double start, end; /* bail if there's nothing in the cache */ if (sortex->cache_bytes == 0) return; /* allocate space for a new run */ sortex->num_runs++; Kino1_Renew(sortex->runs, sortex->num_runs, SortExRun*); /* make local copies */ outstream = sortex->outstream; cache = sortex->cache; /* mark start of run */ start = outstream->tell(outstream); /* write sorted items to file */ Kino1_SortEx_sort_cache(sortex); cache_end = cache + sortex->cache_elems; for (cache = sortex->cache; cache < cache_end; cache++) { bb = *cache; outstream->write_vint(outstream, bb->size); outstream->write_bytes(outstream, bb->ptr, bb->size); } /* clear the cache */ Kino1_SortEx_clear_cache(sortex); /* mark end of run and build a new SortExRun object */ end = outstream->tell(outstream); sortex->runs[ sortex->num_runs - 1 ] = Kino1_SortEx_new_run(start, end); /* recalculate the size allowed for each run's cache */ sortex->run_cache_limit = (sortex->mem_threshold / 2) / sortex->num_runs; sortex->run_cache_limit = sortex->run_cache_limit < 65536 ? 65536 : sortex->run_cache_limit; } /* Recover sorted items from disk, up to the allowable memory limit. */ static I32 Kino1_SortEx_refill_run(SortExternal* sortex, SortExRun *run) { InStream *instream; double end; I32 run_cache_bytes = 0; int num_elems = 0; /* number of items recovered */ I32 len; ByteBuf *bb; I32 run_cache_limit; /* see if we actually need to refill */ if (run->cache_elems - run->cache_pos) return run->cache_elems - run->cache_pos; else Kino1_SortEx_clear_run_cache(run); /* make local copies */ instream = sortex->instream; run_cache_limit = sortex->run_cache_limit; end = run->end; instream->seek(instream, run->file_pos); while (1) { /* bail if we've read everything in this run */ if (instream->tell(instream) >= end) { /* make sure we haven't read too much */ if (instream->tell(instream) > end) { UV pos = instream->tell(instream); Kino1_confess( "read past end of run: %"UVuf", %"UVuf, pos, (UV)end ); } break; } /* bail if we've hit the ceiling for this run's cache */ if (run_cache_bytes > run_cache_limit) break; /* retrieve and decode len; allocate a ByteBuf and recover the string */ len = instream->read_vint(instream); bb = Kino1_BB_new(len); instream->read_bytes(instream, bb->ptr, len); bb->ptr[len] = '\0'; /* add to the run's cache */ if (num_elems == run->cache_cap) { run->cache_cap = run->cache_cap + 100 + (run->cache_cap / 8); Kino1_Renew(run->cache, run->cache_cap, ByteBuf*); } run->cache[ num_elems ] = bb; /* track how much we've read so far */ num_elems++; run_cache_bytes += len + 1 + KINO_PER_ITEM_OVERHEAD; } /* reset the cache array position and length; remember file pos */ run->cache_elems = num_elems; run->cache_pos = 0; run->file_pos = instream->tell(instream); return num_elems; } /* Refill the main cache, drawing from the caches of all runs. */ static void Kino1_SortEx_refill_cache(SortExternal *sortex) { ByteBuf *endpost; SortExRun *run; I32 i = 0; I32 total = 0; /* free all the existing ByteBufs, as they've been fetched by now */ Kino1_SortEx_clear_cache(sortex); /* make sure all runs have at least one item in the cache */ while (i < sortex->num_runs) { run = sortex->runs[i]; if ( (run->cache_elems > run->cache_pos) || (Kino1_SortEx_refill_run(sortex, run)) ) { i++; } else { /* discard empty runs */ Kino1_SortEx_destroy_run(run); sortex->num_runs--; sortex->runs[i] = sortex->runs[ sortex->num_runs ]; sortex->runs[ sortex->num_runs ] = NULL; } } if (!sortex->num_runs) return; /* move as many items as possible into the sorting cache */ endpost = Kino1_SortEx_find_endpost(sortex); for (i = 0; i < sortex->num_runs; i++) { total += Kino1_SortEx_define_slice(sortex->runs[i], endpost); } /* make sure we have enough room in both the main cache and the scratch */ Kino1_SortEx_grow_bufbuf(&sortex->cache, sortex->cache_cap, total); Kino1_SortEx_grow_bufbuf(&sortex->scratch, sortex->scratch_cap, total); Kino1_SortEx_merge_runs(sortex); sortex->cache_elems = total; } /* Merge all the items which are "in-range" from all the Runs into the main * cache. */ static void Kino1_SortEx_merge_runs(SortExternal *sortex) { SortExRun *run; ByteBuf ***slice_starts; ByteBuf **cache = sortex->cache; I32 *slice_sizes; I32 i = 0, j = 0, slice_size = 0, num_slices = 0; Kino1_New(0, slice_starts, sortex->num_runs, ByteBuf**); Kino1_New(0, slice_sizes, sortex->num_runs, I32); /* copy all the elements in range into the cache */ j = 0; for (i = 0; i < sortex->num_runs; i++) { run = sortex->runs[i]; slice_size = run->slice_size; if (slice_size == 0) continue; slice_sizes[j] = slice_size; slice_starts[j] = cache; Copy( (run->cache + run->cache_pos), cache, slice_size, ByteBuf* ); run->cache_pos += slice_size; cache += slice_size; num_slices = ++j; } /* exploit previous sorting, rather than sort cache naively */ while (num_slices > 1) { /* leave the first slice intact if the number of slices is odd */ i = 0; j = 0; while (i < num_slices) { if (num_slices - i >= 2) { /* merge two consecutive slices */ slice_size = slice_sizes[i] + slice_sizes[i+1]; Kino1_SortEx_merge(slice_starts[i], slice_sizes[i], slice_starts[i+1], slice_sizes[i+1], sortex->scratch); slice_sizes[j] = slice_size; slice_starts[j] = slice_starts[i]; Copy(sortex->scratch, slice_starts[j], slice_size, ByteBuf*); i += 2; j += 1; } else if (num_slices - i >= 1) { /* move single slice pointer */ slice_sizes[j] = slice_sizes[i]; slice_starts[j] = slice_starts[i]; i += 1; j += 1; } } num_slices = j; } Kino1_Safefree(slice_starts); Kino1_Safefree(slice_sizes); } /* Return a pointer to the item in one of the runs' caches which is * the highest in sort order, but which we can guarantee is lower in sort * order than any item which has yet to enter a run cache. */ static ByteBuf* Kino1_SortEx_find_endpost(SortExternal *sortex) { int i; ByteBuf *endpost = NULL, *candidate = NULL; SortExRun *run; for (i = 0; i < sortex->num_runs; i++) { /* get a run and verify no errors */ run = sortex->runs[i]; if (run->cache_pos == run->cache_elems || run->cache_elems < 1) Kino1_confess("find_endpost encountered an empty run cache"); /* get the last item in this run's cache */ candidate = run->cache[ run->cache_elems - 1 ]; /* if it's the first run, the item is automatically the new endpost */ if (i == 0) { endpost = candidate; continue; } /* if it's less than the current endpost, it's the new endpost */ else if (Kino1_BB_compare(candidate, endpost) < 0) { endpost = candidate; } } return endpost; } /* Record the number of items in the run's cache which are lexically * less than or equal to the endpost. */ static I32 Kino1_SortEx_define_slice(SortExRun *run, ByteBuf *endpost) { I32 lo, mid, hi, delta; ByteBuf **cache = run->cache; /* operate on a slice of the cache */ lo = run->cache_pos - 1; hi = run->cache_elems; /* binary search */ while (hi - lo > 1) { mid = (lo + hi) / 2; delta = Kino1_BB_compare(cache[mid], endpost); if (delta > 0) hi = mid; else lo = mid; } run->slice_size = lo == -1 ? 0 : (lo - run->cache_pos) + 1; return run->slice_size; } /* Standard merge sort. */ static void Kino1_SortEx_mergesort(ByteBuf **bufbuf, ByteBuf **scratch, I32 buf_size) { if (buf_size == 0) return; Kino1_SortEx_msort(bufbuf, scratch, 0, buf_size - 1); } /* Standard merge sort msort function. */ static void Kino1_SortEx_msort(ByteBuf **bufbuf, ByteBuf **scratch, U32 left, U32 right) { I32 mid; if (right > left) { mid = ( (right+left)/2 ) + 1; Kino1_SortEx_msort(bufbuf, scratch, left, mid - 1); Kino1_SortEx_msort(bufbuf, scratch, mid, right); Kino1_SortEx_merge( (bufbuf + left), (mid - left), (bufbuf + mid), (right - mid + 1), scratch); Copy( scratch, (bufbuf + left), (right - left + 1), ByteBuf* ); } } /* Standard mergesort merge function. This variant is capable of merging two * discontiguous source arrays. Copying elements back into the source is left * for the caller. */ static void Kino1_SortEx_merge(ByteBuf **left_ptr, U32 left_size, ByteBuf **right_ptr, U32 right_size, ByteBuf **dest) { ByteBuf **left_boundary, **right_boundary; left_boundary = left_ptr + left_size; right_boundary = right_ptr + right_size; while (left_ptr < left_boundary && right_ptr < right_boundary) { if (Kino1_BB_compare(*left_ptr, *right_ptr) < 1) { *dest++ = *left_ptr++; } else { *dest++ = *right_ptr++; } } while (left_ptr < left_boundary) { *dest++ = *left_ptr++; } while (right_ptr < right_boundary) { *dest++ = *right_ptr++; } } static void Kino1_SortEx_clear_cache(SortExternal *sortex) { ByteBuf **cache, **cache_end; cache_end = sortex->cache + sortex->cache_elems; /* only blow away items that haven't been released */ for (cache = sortex->cache + sortex->cache_pos; cache < cache_end; cache++ ) { Kino1_BB_destroy(*cache); } sortex->cache_bytes = 0; sortex->cache_elems = 0; sortex->cache_pos = 0; } static void Kino1_SortEx_clear_run_cache(SortExRun *run) { ByteBuf **cache, **cache_end; cache_end = run->cache + run->cache_elems; /* only destroy items which haven't been passed to the main cache */ for (cache = run->cache + run->cache_pos; cache < cache_end; cache++) { Kino1_BB_destroy(*cache); } run->cache_elems = 0; run->cache_pos = 0; } void Kino1_SortEx_destroy(SortExternal *sortex) { I32 i; /* delegate to Perl garbage collector */ SvREFCNT_dec(sortex->outstream_sv); SvREFCNT_dec(sortex->instream_sv); SvREFCNT_dec(sortex->invindex_sv); SvREFCNT_dec(sortex->seg_name_sv); /* free the cache and the scratch */ Kino1_SortEx_clear_cache(sortex); Kino1_Safefree(sortex->cache); Kino1_Safefree(sortex->scratch); /* free all of the runs and the array that held them */ for (i = 0; i < sortex->num_runs; i++) { Kino1_SortEx_destroy_run(sortex->runs[i]); } Kino1_Safefree(sortex->runs); /* free me */ Kino1_Safefree(sortex); } static void Kino1_SortEx_destroy_run(SortExRun *run) { Kino1_SortEx_clear_run_cache(run); Kino1_Safefree(run->cache); Kino1_Safefree(run); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::SortExternal - external sorting ==head1 DESCRIPTION External sorting implementation, using lexical comparison. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/StringHelper.pm000444000765000765 357511462203446 22436 0ustar00marvinmarvin000000000000package KinoSearch1::Util::StringHelper; 1; __END__ __H__ #ifndef H_KINO_STRING_HELPER #define H_KINO_STRING_HELPER 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" I32 Kino1_StrHelp_string_diff(char*, char*, STRLEN, STRLEN); I32 Kino1_StrHelp_compare_strings(char*, char*, STRLEN, STRLEN); I32 Kino1_StrHelp_compare_svs(SV*, SV*); #endif /* include guard */ __C__ #include "KinoSearch1UtilStringHelper.h" /* return the number of bytes that two strings have in common */ I32 Kino1_StrHelp_string_diff(char *str1, char *str2, STRLEN len1, STRLEN len2) { STRLEN i, len; len = len1 <= len2 ? len1 : len2; for (i = 0; i < len; i++) { if (*str1++ != *str2++) break; } return i; } /* memcmp, but with lengths for both pointers, not just one */ I32 Kino1_StrHelp_compare_strings(char *a, char *b, STRLEN a_len, STRLEN b_len) { STRLEN len; I32 comparison = 0; if (a == NULL || b == NULL) Kino1_confess("Internal error: can't compare unallocated pointers"); len = a_len < b_len? a_len : b_len; if (len > 0) comparison = memcmp(a, b, len); /* if a is a substring of b, it's less than b, so return a neg num */ if (comparison == 0) comparison = a_len - b_len; return comparison; } /* compare the PVs of two scalars */ I32 Kino1_StrHelp_compare_svs(SV *sva, SV *svb) { char *a, *b; STRLEN a_len, b_len; a = SvPV(sva, a_len); b = SvPV(svb, b_len); return Kino1_StrHelp_compare_strings(a, b, a_len, b_len); } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::StringHelper - String related utilities ==head1 DESCRIPTION String related utilities, e.g. string comparison functions. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/ToolSet.pm000444000765000765 165611462203446 21417 0ustar00marvinmarvin000000000000package KinoSearch1::Util::ToolSet; use strict; use warnings; use bytes; no bytes; use base qw( Exporter ); use Carp qw( carp croak cluck confess ); # everything except readonly and set_prototype use Scalar::Util qw( refaddr blessed dualvar isweak refaddr reftype tainted weaken isvstring looks_like_number ); use KinoSearch1 qw( K_DEBUG kdump ); use KinoSearch1::Util::VerifyArgs qw( verify_args kerror a_isa_b ); use KinoSearch1::Util::MathUtils qw( ceil ); our @EXPORT = qw( carp croak cluck confess refaddr blessed dualvar isweak refaddr reftype tainted weaken isvstring looks_like_number K_DEBUG kdump kerror verify_args a_isa_b ceil ); 1; __END__ __COPYRIGHT__ Copyright 2005-2010 Marvin Humphrey This program is free software; you can redistribute it and/or modify under the same terms as Perl itself. KinoSearch1-1.01/lib/KinoSearch1/Util/ToStringUtils.pm000444000765000765 123111462203446 22605 0ustar00marvinmarvin000000000000package KinoSearch1::Util::ToStringUtils; use strict; use warnings; use KinoSearch1::Util::ToolSet; use base qw( Exporter ); our @EXPORT_OK = qw( boost_to_string ); # return a stringified numerical boost if it actually does anything. sub boost_to_string { my $boost = shift; return $boost == 1 ? '' : "^$boost"; } 1; __END__ ==begin devdocs ==head1 NAME KinoSearch1::Util::ToStringUtils - common routines which aid stringification ==head1 DESCRIPTION Provide functions which help with to_string. ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS, etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/lib/KinoSearch1/Util/VerifyArgs.pm000444000765000765 1002011462203446 22110 0ustar00marvinmarvin000000000000package KinoSearch1::Util::VerifyArgs; use strict; use warnings; use Scalar::Util qw( blessed ); use Carp; use base qw( Exporter ); our @EXPORT_OK = qw( verify_args kerror a_isa_b ); my $kerror; sub kerror {$kerror} # Verify that named parameters exist in a defaults hash. sub verify_args { my $defaults = shift; # leave the rest of @_ intact # verify that args came in pairs if ( @_ % 2 ) { my ( $package, $filename, $line ) = caller(1); $kerror = "Parameter error: odd number of args at $filename line $line\n"; return 0; } # verify keys, ignore values while (@_) { my ( $var, undef ) = ( shift, shift ); next if exists $defaults->{$var}; my ( $package, $filename, $line ) = caller(1); $kerror = "Invalid parameter: '$var' at $filename line $line\n"; return 0; } return 1; } =begin comment a_isa_b serves the same purpose as the isa method from UNIVERSAL, only it is called as a function rather than a method. # safer than $foo->isa($class), which crashes if $foo isn't blessed my $confirm = a_isa_b( $foo, $class ); =end comment =cut sub a_isa_b { my ( $item, $class_name ) = @_; return 0 unless blessed($item); return $item->isa($class_name); } 1; __END__ __H__ #ifndef H_KINO_VERIFY_ARGS #define H_KINO_VERIFY_ARGS 1 #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #include "KinoSearch1UtilCarp.h" /* Return a mortalized hash, built using a defaults hash and @_. */ #define Kino1_Verify_build_args_hash(args_hash, defaults_hash_name, stack_st)\ /* dXSARGS in the next function pops a stack marker, so we push one */ \ PUSHMARK(SP); \ args_hash = Kino1_Verify_do_build_args_hash(defaults_hash_name, stack_st); HV* Kino1_Verify_do_build_args_hash(char*, I32); SV* Kino1_Verify_extract_arg(HV*, char*, I32); #endif /* include guard */ __C__ #include "KinoSearch1UtilVerifyArgs.h" HV* Kino1_Verify_do_build_args_hash(char* defaults_hash_name, I32 stack_st) { HV *defaults_hash, *args_hash; char *key; I32 key_len; STRLEN len; SV *key_sv, *val_sv, *val_copy_sv; I32 stack_pos; dXSARGS; /* create the args hash and mortalize it */ args_hash = newHV(); args_hash = (HV*)sv_2mortal( (SV*)args_hash ); /* NOTE: the defaults hash must be declared using "our" */ defaults_hash = get_hv(defaults_hash_name, 0); if (defaults_hash == NULL) Kino1_confess("Can't find hash named %s", defaults_hash_name); /* make the args hash a copy of the defaults hash */ (void)hv_iterinit(defaults_hash); while ((val_sv = hv_iternextsv(defaults_hash, &key, &key_len))) { val_copy_sv = newSVsv(val_sv); hv_store(args_hash, key, key_len, val_copy_sv, 0); } /* verify and copy hash-style params into args hash from stack */ if ((items - stack_st) % 2 != 0) Kino1_confess("Expecting hash-style params, " "got odd number of args"); stack_pos = stack_st; while (stack_pos < items) { key_sv = ST(stack_pos++); key = SvPV(key_sv, len); key_len = len; if (!hv_exists(args_hash, key, key_len)) { Kino1_confess("Invalid parameter: '%s'", key); } val_sv = ST(stack_pos++); val_copy_sv = newSVsv(val_sv); hv_store(args_hash, key, key_len, val_copy_sv, 0); } return args_hash; } SV* Kino1_Verify_extract_arg(HV* hash, char* key, I32 key_len) { SV** sv_ptr; sv_ptr = hv_fetch(hash, key, key_len, 0); if (sv_ptr == NULL) Kino1_confess("Failed to retrieve hash entry '%s'", key); return *sv_ptr; } __POD__ ==begin devdocs ==head1 NAME KinoSearch1::Util::VerifyArgs - some validation functions ==head1 DESCRIPTION Provide some utility functions under the general heading of "verification". ==head1 COPYRIGHT Copyright 2005-2010 Marvin Humphrey ==head1 LICENSE, DISCLAIMER, BUGS etc. See L version 1.01. ==end devdocs ==cut KinoSearch1-1.01/src000755000765000765 011462203446 14320 5ustar00marvinmarvin000000000000KinoSearch1-1.01/src/ppport.h000444000765000765 46204211462203446 16222 0ustar00marvinmarvin000000000000#if 0 <<'SKIP'; #endif /* ---------------------------------------------------------------------- ppport.h -- Perl/Pollution/Portability Version 3.14 Automatically created by Devel::PPPort running under perl 5.008007. Do NOT edit this file directly! -- Edit PPPort_pm.PL and the includes in parts/inc/ instead. Use 'perldoc ppport.h' to view the documentation below. ---------------------------------------------------------------------- SKIP =pod =head1 NAME ppport.h - Perl/Pollution/Portability version 3.14 =head1 SYNOPSIS perl ppport.h [options] [source files] Searches current directory for files if no [source files] are given --help show short help --version show version --patch=file write one patch file with changes --copy=suffix write changed copies with suffix --diff=program use diff program and options --compat-version=version provide compatibility with Perl version --cplusplus accept C++ comments --quiet don't output anything except fatal errors --nodiag don't show diagnostics --nohints don't show hints --nochanges don't suggest changes --nofilter don't filter input files --strip strip all script and doc functionality from ppport.h --list-provided list provided API --list-unsupported list unsupported API --api-info=name show Perl API portability information =head1 COMPATIBILITY This version of F is designed to support operation with Perl installations back to 5.003, and has been tested up to 5.10.0. =head1 OPTIONS =head2 --help Display a brief usage summary. =head2 --version Display the version of F. =head2 --patch=I If this option is given, a single patch file will be created if any changes are suggested. This requires a working diff program to be installed on your system. =head2 --copy=I If this option is given, a copy of each file will be saved with the given suffix that contains the suggested changes. This does not require any external programs. Note that this does not automagially add a dot between the original filename and the suffix. If you want the dot, you have to include it in the option argument. If neither C<--patch> or C<--copy> are given, the default is to simply print the diffs for each file. This requires either C or a C program to be installed. =head2 --diff=I Manually set the diff program and options to use. The default is to use C, when installed, and output unified context diffs. =head2 --compat-version=I Tell F to check for compatibility with the given Perl version. The default is to check for compatibility with Perl version 5.003. You can use this option to reduce the output of F if you intend to be backward compatible only down to a certain Perl version. =head2 --cplusplus Usually, F will detect C++ style comments and replace them with C style comments for portability reasons. Using this option instructs F to leave C++ comments untouched. =head2 --quiet Be quiet. Don't print anything except fatal errors. =head2 --nodiag Don't output any diagnostic messages. Only portability alerts will be printed. =head2 --nohints Don't output any hints. Hints often contain useful portability notes. Warnings will still be displayed. =head2 --nochanges Don't suggest any changes. Only give diagnostic output and hints unless these are also deactivated. =head2 --nofilter Don't filter the list of input files. By default, files not looking like source code (i.e. not *.xs, *.c, *.cc, *.cpp or *.h) are skipped. =head2 --strip Strip all script and documentation functionality from F. This reduces the size of F dramatically and may be useful if you want to include F in smaller modules without increasing their distribution size too much. The stripped F will have a C<--unstrip> option that allows you to undo the stripping, but only if an appropriate C module is installed. =head2 --list-provided Lists the API elements for which compatibility is provided by F. Also lists if it must be explicitly requested, if it has dependencies, and if there are hints or warnings for it. =head2 --list-unsupported Lists the API elements that are known not to be supported by F and below which version of Perl they probably won't be available or work. =head2 --api-info=I Show portability information for API elements matching I. If I is surrounded by slashes, it is interpreted as a regular expression. =head1 DESCRIPTION In order for a Perl extension (XS) module to be as portable as possible across differing versions of Perl itself, certain steps need to be taken. =over 4 =item * Including this header is the first major one. This alone will give you access to a large part of the Perl API that hasn't been available in earlier Perl releases. Use perl ppport.h --list-provided to see which API elements are provided by ppport.h. =item * You should avoid using deprecated parts of the API. For example, using global Perl variables without the C prefix is deprecated. Also, some API functions used to have a C prefix. Using this form is also deprecated. You can safely use the supported API, as F will provide wrappers for older Perl versions. =item * If you use one of a few functions or variables that were not present in earlier versions of Perl, and that can't be provided using a macro, you have to explicitly request support for these functions by adding one or more C<#define>s in your source code before the inclusion of F. These functions or variables will be marked C in the list shown by C<--list-provided>. Depending on whether you module has a single or multiple files that use such functions or variables, you want either C or global variants. For a C function or variable (used only in a single source file), use: #define NEED_function #define NEED_variable For a global function or variable (used in multiple source files), use: #define NEED_function_GLOBAL #define NEED_variable_GLOBAL Note that you mustn't have more than one global request for the same function or variable in your project. Function / Variable Static Request Global Request ----------------------------------------------------------------------------------------- PL_signals NEED_PL_signals NEED_PL_signals_GLOBAL eval_pv() NEED_eval_pv NEED_eval_pv_GLOBAL grok_bin() NEED_grok_bin NEED_grok_bin_GLOBAL grok_hex() NEED_grok_hex NEED_grok_hex_GLOBAL grok_number() NEED_grok_number NEED_grok_number_GLOBAL grok_numeric_radix() NEED_grok_numeric_radix NEED_grok_numeric_radix_GLOBAL grok_oct() NEED_grok_oct NEED_grok_oct_GLOBAL load_module() NEED_load_module NEED_load_module_GLOBAL my_snprintf() NEED_my_snprintf NEED_my_snprintf_GLOBAL my_strlcat() NEED_my_strlcat NEED_my_strlcat_GLOBAL my_strlcpy() NEED_my_strlcpy NEED_my_strlcpy_GLOBAL newCONSTSUB() NEED_newCONSTSUB NEED_newCONSTSUB_GLOBAL newRV_noinc() NEED_newRV_noinc NEED_newRV_noinc_GLOBAL newSVpvn_flags() NEED_newSVpvn_flags NEED_newSVpvn_flags_GLOBAL newSVpvn_share() NEED_newSVpvn_share NEED_newSVpvn_share_GLOBAL sv_2pv_flags() NEED_sv_2pv_flags NEED_sv_2pv_flags_GLOBAL sv_2pvbyte() NEED_sv_2pvbyte NEED_sv_2pvbyte_GLOBAL sv_catpvf_mg() NEED_sv_catpvf_mg NEED_sv_catpvf_mg_GLOBAL sv_catpvf_mg_nocontext() NEED_sv_catpvf_mg_nocontext NEED_sv_catpvf_mg_nocontext_GLOBAL sv_pvn_force_flags() NEED_sv_pvn_force_flags NEED_sv_pvn_force_flags_GLOBAL sv_setpvf_mg() NEED_sv_setpvf_mg NEED_sv_setpvf_mg_GLOBAL sv_setpvf_mg_nocontext() NEED_sv_setpvf_mg_nocontext NEED_sv_setpvf_mg_nocontext_GLOBAL vload_module() NEED_vload_module NEED_vload_module_GLOBAL vnewSVpvf() NEED_vnewSVpvf NEED_vnewSVpvf_GLOBAL warner() NEED_warner NEED_warner_GLOBAL To avoid namespace conflicts, you can change the namespace of the explicitly exported functions / variables using the C macro. Just C<#define> the macro before including C: #define DPPP_NAMESPACE MyOwnNamespace_ #include "ppport.h" The default namespace is C. =back The good thing is that most of the above can be checked by running F on your source code. See the next section for details. =head1 EXAMPLES To verify whether F is needed for your module, whether you should make any changes to your code, and whether any special defines should be used, F can be run as a Perl script to check your source code. Simply say: perl ppport.h The result will usually be a list of patches suggesting changes that should at least be acceptable, if not necessarily the most efficient solution, or a fix for all possible problems. If you know that your XS module uses features only available in newer Perl releases, if you're aware that it uses C++ comments, and if you want all suggestions as a single patch file, you could use something like this: perl ppport.h --compat-version=5.6.0 --cplusplus --patch=test.diff If you only want your code to be scanned without any suggestions for changes, use: perl ppport.h --nochanges You can specify a different C program or options, using the C<--diff> option: perl ppport.h --diff='diff -C 10' This would output context diffs with 10 lines of context. If you want to create patched copies of your files instead, use: perl ppport.h --copy=.new To display portability information for the C function, use: perl ppport.h --api-info=newSVpvn Since the argument to C<--api-info> can be a regular expression, you can use perl ppport.h --api-info=/_nomg$/ to display portability information for all C<_nomg> functions or perl ppport.h --api-info=/./ to display information for all known API elements. =head1 BUGS If this version of F is causing failure during the compilation of this module, please check if newer versions of either this module or C are available on CPAN before sending a bug report. If F was generated using the latest version of C and is causing failure of this module, please file a bug report using the CPAN Request Tracker at L. Please include the following information: =over 4 =item 1. The complete output from running "perl -V" =item 2. This file. =item 3. The name and version of the module you were trying to build. =item 4. A full log of the build that failed. =item 5. Any other information that you think could be relevant. =back For the latest version of this code, please get the C module from CPAN. =head1 COPYRIGHT Version 3.x, Copyright (c) 2004-2008, Marcus Holland-Moritz. Version 2.x, Copyright (C) 2001, Paul Marquess. Version 1.x, Copyright (C) 1999, Kenneth Albanowski. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =head1 SEE ALSO See L. =cut use strict; # Disable broken TRIE-optimization BEGIN { eval '${^RE_TRIE_MAXBUF} = -1' if $] >= 5.009004 && $] <= 5.009005 } my $VERSION = 3.14; my %opt = ( quiet => 0, diag => 1, hints => 1, changes => 1, cplusplus => 0, filter => 1, strip => 0, version => 0, ); my($ppport) = $0 =~ /([\w.]+)$/; my $LF = '(?:\r\n|[\r\n])'; # line feed my $HS = "[ \t]"; # horizontal whitespace # Never use C comments in this file! my $ccs = '/'.'*'; my $cce = '*'.'/'; my $rccs = quotemeta $ccs; my $rcce = quotemeta $cce; eval { require Getopt::Long; Getopt::Long::GetOptions(\%opt, qw( help quiet diag! filter! hints! changes! cplusplus strip version patch=s copy=s diff=s compat-version=s list-provided list-unsupported api-info=s )) or usage(); }; if ($@ and grep /^-/, @ARGV) { usage() if "@ARGV" =~ /^--?h(?:elp)?$/; die "Getopt::Long not found. Please don't use any options.\n"; } if ($opt{version}) { print "This is $0 $VERSION.\n"; exit 0; } usage() if $opt{help}; strip() if $opt{strip}; if (exists $opt{'compat-version'}) { my($r,$v,$s) = eval { parse_version($opt{'compat-version'}) }; if ($@) { die "Invalid version number format: '$opt{'compat-version'}'\n"; } die "Only Perl 5 is supported\n" if $r != 5; die "Invalid version number: $opt{'compat-version'}\n" if $v >= 1000 || $s >= 1000; $opt{'compat-version'} = sprintf "%d.%03d%03d", $r, $v, $s; } else { $opt{'compat-version'} = 5; } my %API = map { /^(\w+)\|([^|]*)\|([^|]*)\|(\w*)$/ ? ( $1 => { ($2 ? ( base => $2 ) : ()), ($3 ? ( todo => $3 ) : ()), (index($4, 'v') >= 0 ? ( varargs => 1 ) : ()), (index($4, 'p') >= 0 ? ( provided => 1 ) : ()), (index($4, 'n') >= 0 ? ( nothxarg => 1 ) : ()), } ) : die "invalid spec: $_" } qw( AvFILLp|5.004050||p AvFILL||| CLASS|||n CX_CURPAD_SAVE||| CX_CURPAD_SV||| CopFILEAV|5.006000||p CopFILEGV_set|5.006000||p CopFILEGV|5.006000||p CopFILESV|5.006000||p CopFILE_set|5.006000||p CopFILE|5.006000||p CopSTASHPV_set|5.006000||p CopSTASHPV|5.006000||p CopSTASH_eq|5.006000||p CopSTASH_set|5.006000||p CopSTASH|5.006000||p CopyD|5.009002||p Copy||| CvPADLIST||| CvSTASH||| CvWEAKOUTSIDE||| DEFSV|5.004050||p END_EXTERN_C|5.005000||p ENTER||| ERRSV|5.004050||p EXTEND||| EXTERN_C|5.005000||p F0convert|||n FREETMPS||| GIMME_V||5.004000|n GIMME|||n GROK_NUMERIC_RADIX|5.007002||p G_ARRAY||| G_DISCARD||| G_EVAL||| G_NOARGS||| G_SCALAR||| G_VOID||5.004000| GetVars||| GvSV||| Gv_AMupdate||| HEf_SVKEY||5.004000| HeHASH||5.004000| HeKEY||5.004000| HeKLEN||5.004000| HePV||5.004000| HeSVKEY_force||5.004000| HeSVKEY_set||5.004000| HeSVKEY||5.004000| HeUTF8||5.011000| HeVAL||5.004000| HvNAME||| INT2PTR|5.006000||p IN_LOCALE_COMPILETIME|5.007002||p IN_LOCALE_RUNTIME|5.007002||p IN_LOCALE|5.007002||p IN_PERL_COMPILETIME|5.008001||p IS_NUMBER_GREATER_THAN_UV_MAX|5.007002||p IS_NUMBER_INFINITY|5.007002||p IS_NUMBER_IN_UV|5.007002||p IS_NUMBER_NAN|5.007003||p IS_NUMBER_NEG|5.007002||p IS_NUMBER_NOT_INT|5.007002||p IVSIZE|5.006000||p IVTYPE|5.006000||p IVdf|5.006000||p LEAVE||| LVRET||| MARK||| MULTICALL||5.011000| MY_CXT_CLONE|5.009002||p MY_CXT_INIT|5.007003||p MY_CXT|5.007003||p MoveD|5.009002||p Move||| NOOP|5.005000||p NUM2PTR|5.006000||p NVTYPE|5.006000||p NVef|5.006001||p NVff|5.006001||p NVgf|5.006001||p Newxc|5.009003||p Newxz|5.009003||p Newx|5.009003||p Nullav||| Nullch||| Nullcv||| Nullhv||| Nullsv||| ORIGMARK||| PAD_BASE_SV||| PAD_CLONE_VARS||| PAD_COMPNAME_FLAGS||| PAD_COMPNAME_GEN_set||| PAD_COMPNAME_GEN||| PAD_COMPNAME_OURSTASH||| PAD_COMPNAME_PV||| PAD_COMPNAME_TYPE||| PAD_RESTORE_LOCAL||| PAD_SAVE_LOCAL||| PAD_SAVE_SETNULLPAD||| PAD_SETSV||| PAD_SET_CUR_NOSAVE||| PAD_SET_CUR||| PAD_SVl||| PAD_SV||| PERL_ABS|5.008001||p PERL_BCDVERSION|5.011000||p PERL_GCC_BRACE_GROUPS_FORBIDDEN|5.008001||p PERL_HASH|5.004000||p PERL_INT_MAX|5.004000||p PERL_INT_MIN|5.004000||p PERL_LONG_MAX|5.004000||p PERL_LONG_MIN|5.004000||p PERL_MAGIC_arylen|5.007002||p PERL_MAGIC_backref|5.007002||p PERL_MAGIC_bm|5.007002||p PERL_MAGIC_collxfrm|5.007002||p PERL_MAGIC_dbfile|5.007002||p PERL_MAGIC_dbline|5.007002||p PERL_MAGIC_defelem|5.007002||p PERL_MAGIC_envelem|5.007002||p PERL_MAGIC_env|5.007002||p PERL_MAGIC_ext|5.007002||p PERL_MAGIC_fm|5.007002||p PERL_MAGIC_glob|5.011000||p PERL_MAGIC_isaelem|5.007002||p PERL_MAGIC_isa|5.007002||p PERL_MAGIC_mutex|5.011000||p PERL_MAGIC_nkeys|5.007002||p PERL_MAGIC_overload_elem|5.007002||p PERL_MAGIC_overload_table|5.007002||p PERL_MAGIC_overload|5.007002||p PERL_MAGIC_pos|5.007002||p PERL_MAGIC_qr|5.007002||p PERL_MAGIC_regdata|5.007002||p PERL_MAGIC_regdatum|5.007002||p PERL_MAGIC_regex_global|5.007002||p PERL_MAGIC_shared_scalar|5.007003||p PERL_MAGIC_shared|5.007003||p PERL_MAGIC_sigelem|5.007002||p PERL_MAGIC_sig|5.007002||p PERL_MAGIC_substr|5.007002||p PERL_MAGIC_sv|5.007002||p PERL_MAGIC_taint|5.007002||p PERL_MAGIC_tiedelem|5.007002||p PERL_MAGIC_tiedscalar|5.007002||p PERL_MAGIC_tied|5.007002||p PERL_MAGIC_utf8|5.008001||p PERL_MAGIC_uvar_elem|5.007003||p PERL_MAGIC_uvar|5.007002||p PERL_MAGIC_vec|5.007002||p PERL_MAGIC_vstring|5.008001||p PERL_QUAD_MAX|5.004000||p PERL_QUAD_MIN|5.004000||p PERL_REVISION|5.006000||p PERL_SCAN_ALLOW_UNDERSCORES|5.007003||p PERL_SCAN_DISALLOW_PREFIX|5.007003||p PERL_SCAN_GREATER_THAN_UV_MAX|5.007003||p PERL_SCAN_SILENT_ILLDIGIT|5.008001||p PERL_SHORT_MAX|5.004000||p PERL_SHORT_MIN|5.004000||p PERL_SIGNALS_UNSAFE_FLAG|5.008001||p PERL_SUBVERSION|5.006000||p PERL_UCHAR_MAX|5.004000||p PERL_UCHAR_MIN|5.004000||p PERL_UINT_MAX|5.004000||p PERL_UINT_MIN|5.004000||p PERL_ULONG_MAX|5.004000||p PERL_ULONG_MIN|5.004000||p PERL_UNUSED_ARG|5.009003||p PERL_UNUSED_CONTEXT|5.009004||p PERL_UNUSED_DECL|5.007002||p PERL_UNUSED_VAR|5.007002||p PERL_UQUAD_MAX|5.004000||p PERL_UQUAD_MIN|5.004000||p PERL_USE_GCC_BRACE_GROUPS|5.009004||p PERL_USHORT_MAX|5.004000||p PERL_USHORT_MIN|5.004000||p PERL_VERSION|5.006000||p PL_DBsignal|5.005000||p PL_DBsingle|||pn PL_DBsub|||pn PL_DBtrace|||pn PL_Sv|5.005000||p PL_compiling|5.004050||p PL_copline|5.011000||p PL_curcop|5.004050||p PL_curstash|5.004050||p PL_debstash|5.004050||p PL_defgv|5.004050||p PL_diehook|5.004050||p PL_dirty|5.004050||p PL_dowarn|||pn PL_errgv|5.004050||p PL_expect|5.011000||p PL_hexdigit|5.005000||p PL_hints|5.005000||p PL_last_in_gv|||n PL_laststatval|5.005000||p PL_modglobal||5.005000|n PL_na|5.004050||pn PL_no_modify|5.006000||p PL_ofs_sv|||n PL_perl_destruct_level|5.004050||p PL_perldb|5.004050||p PL_ppaddr|5.006000||p PL_rsfp_filters|5.004050||p PL_rsfp|5.004050||p PL_rs|||n PL_signals|5.008001||p PL_stack_base|5.004050||p PL_stack_sp|5.004050||p PL_statcache|5.005000||p PL_stdingv|5.004050||p PL_sv_arenaroot|5.004050||p PL_sv_no|5.004050||pn PL_sv_undef|5.004050||pn PL_sv_yes|5.004050||pn PL_tainted|5.004050||p PL_tainting|5.004050||p POP_MULTICALL||5.011000| POPi|||n POPl|||n POPn|||n POPpbytex||5.007001|n POPpx||5.005030|n POPp|||n POPs|||n PTR2IV|5.006000||p PTR2NV|5.006000||p PTR2UV|5.006000||p PTR2ul|5.007001||p PTRV|5.006000||p PUSHMARK||| PUSH_MULTICALL||5.011000| PUSHi||| PUSHmortal|5.009002||p PUSHn||| PUSHp||| PUSHs||| PUSHu|5.004000||p PUTBACK||| PerlIO_clearerr||5.007003| PerlIO_close||5.007003| PerlIO_context_layers||5.009004| PerlIO_eof||5.007003| PerlIO_error||5.007003| PerlIO_fileno||5.007003| PerlIO_fill||5.007003| PerlIO_flush||5.007003| PerlIO_get_base||5.007003| PerlIO_get_bufsiz||5.007003| PerlIO_get_cnt||5.007003| PerlIO_get_ptr||5.007003| PerlIO_read||5.007003| PerlIO_seek||5.007003| PerlIO_set_cnt||5.007003| PerlIO_set_ptrcnt||5.007003| PerlIO_setlinebuf||5.007003| PerlIO_stderr||5.007003| PerlIO_stdin||5.007003| PerlIO_stdout||5.007003| PerlIO_tell||5.007003| PerlIO_unread||5.007003| PerlIO_write||5.007003| Perl_signbit||5.009005|n PoisonFree|5.009004||p PoisonNew|5.009004||p PoisonWith|5.009004||p Poison|5.008000||p RETVAL|||n Renewc||| Renew||| SAVECLEARSV||| SAVECOMPPAD||| SAVEPADSV||| SAVETMPS||| SAVE_DEFSV|5.004050||p SPAGAIN||| SP||| START_EXTERN_C|5.005000||p START_MY_CXT|5.007003||p STMT_END|||p STMT_START|||p STR_WITH_LEN|5.009003||p ST||| SV_CONST_RETURN|5.009003||p SV_COW_DROP_PV|5.008001||p SV_COW_SHARED_HASH_KEYS|5.009005||p SV_GMAGIC|5.007002||p SV_HAS_TRAILING_NUL|5.009004||p SV_IMMEDIATE_UNREF|5.007001||p SV_MUTABLE_RETURN|5.009003||p SV_NOSTEAL|5.009002||p SV_SMAGIC|5.009003||p SV_UTF8_NO_ENCODING|5.008001||p SVf_UTF8|5.006000||p SVf|5.006000||p SVt_IV||| SVt_NV||| SVt_PVAV||| SVt_PVCV||| SVt_PVHV||| SVt_PVMG||| SVt_PV||| Safefree||| Slab_Alloc||| Slab_Free||| Slab_to_rw||| StructCopy||| SvCUR_set||| SvCUR||| SvEND||| SvGAMAGIC||5.006001| SvGETMAGIC|5.004050||p SvGROW||| SvIOK_UV||5.006000| SvIOK_notUV||5.006000| SvIOK_off||| SvIOK_only_UV||5.006000| SvIOK_only||| SvIOK_on||| SvIOKp||| SvIOK||| SvIVX||| SvIV_nomg|5.009001||p SvIV_set||| SvIVx||| SvIV||| SvIsCOW_shared_hash||5.008003| SvIsCOW||5.008003| SvLEN_set||| SvLEN||| SvLOCK||5.007003| SvMAGIC_set|5.009003||p SvNIOK_off||| SvNIOKp||| SvNIOK||| SvNOK_off||| SvNOK_only||| SvNOK_on||| SvNOKp||| SvNOK||| SvNVX||| SvNV_set||| SvNVx||| SvNV||| SvOK||| SvOOK||| SvPOK_off||| SvPOK_only_UTF8||5.006000| SvPOK_only||| SvPOK_on||| SvPOKp||| SvPOK||| SvPVX_const|5.009003||p SvPVX_mutable|5.009003||p SvPVX||| SvPV_const|5.009003||p SvPV_flags_const_nolen|5.009003||p SvPV_flags_const|5.009003||p SvPV_flags_mutable|5.009003||p SvPV_flags|5.007002||p SvPV_force_flags_mutable|5.009003||p SvPV_force_flags_nolen|5.009003||p SvPV_force_flags|5.007002||p SvPV_force_mutable|5.009003||p SvPV_force_nolen|5.009003||p SvPV_force_nomg_nolen|5.009003||p SvPV_force_nomg|5.007002||p SvPV_force|||p SvPV_mutable|5.009003||p SvPV_nolen_const|5.009003||p SvPV_nolen|5.006000||p SvPV_nomg_const_nolen|5.009003||p SvPV_nomg_const|5.009003||p SvPV_nomg|5.007002||p SvPV_set||| SvPVbyte_force||5.009002| SvPVbyte_nolen||5.006000| SvPVbytex_force||5.006000| SvPVbytex||5.006000| SvPVbyte|5.006000||p SvPVutf8_force||5.006000| SvPVutf8_nolen||5.006000| SvPVutf8x_force||5.006000| SvPVutf8x||5.006000| SvPVutf8||5.006000| SvPVx||| SvPV||| SvREFCNT_dec||| SvREFCNT_inc_NN|5.009004||p SvREFCNT_inc_simple_NN|5.009004||p SvREFCNT_inc_simple_void_NN|5.009004||p SvREFCNT_inc_simple_void|5.009004||p SvREFCNT_inc_simple|5.009004||p SvREFCNT_inc_void_NN|5.009004||p SvREFCNT_inc_void|5.009004||p SvREFCNT_inc|||p SvREFCNT||| SvROK_off||| SvROK_on||| SvROK||| SvRV_set|5.009003||p SvRV||| SvRXOK||5.009005| SvRX||5.009005| SvSETMAGIC||| SvSHARED_HASH|5.009003||p SvSHARE||5.007003| SvSTASH_set|5.009003||p SvSTASH||| SvSetMagicSV_nosteal||5.004000| SvSetMagicSV||5.004000| SvSetSV_nosteal||5.004000| SvSetSV||| SvTAINTED_off||5.004000| SvTAINTED_on||5.004000| SvTAINTED||5.004000| SvTAINT||| SvTRUE||| SvTYPE||| SvUNLOCK||5.007003| SvUOK|5.007001|5.006000|p SvUPGRADE||| SvUTF8_off||5.006000| SvUTF8_on||5.006000| SvUTF8||5.006000| SvUVXx|5.004000||p SvUVX|5.004000||p SvUV_nomg|5.009001||p SvUV_set|5.009003||p SvUVx|5.004000||p SvUV|5.004000||p SvVOK||5.008001| SvVSTRING_mg|5.009004||p THIS|||n UNDERBAR|5.009002||p UTF8_MAXBYTES|5.009002||p UVSIZE|5.006000||p UVTYPE|5.006000||p UVXf|5.007001||p UVof|5.006000||p UVuf|5.006000||p UVxf|5.006000||p WARN_ALL|5.006000||p WARN_AMBIGUOUS|5.006000||p WARN_ASSERTIONS|5.011000||p WARN_BAREWORD|5.006000||p WARN_CLOSED|5.006000||p WARN_CLOSURE|5.006000||p WARN_DEBUGGING|5.006000||p WARN_DEPRECATED|5.006000||p WARN_DIGIT|5.006000||p WARN_EXEC|5.006000||p WARN_EXITING|5.006000||p WARN_GLOB|5.006000||p WARN_INPLACE|5.006000||p WARN_INTERNAL|5.006000||p WARN_IO|5.006000||p WARN_LAYER|5.008000||p WARN_MALLOC|5.006000||p WARN_MISC|5.006000||p WARN_NEWLINE|5.006000||p WARN_NUMERIC|5.006000||p WARN_ONCE|5.006000||p WARN_OVERFLOW|5.006000||p WARN_PACK|5.006000||p WARN_PARENTHESIS|5.006000||p WARN_PIPE|5.006000||p WARN_PORTABLE|5.006000||p WARN_PRECEDENCE|5.006000||p WARN_PRINTF|5.006000||p WARN_PROTOTYPE|5.006000||p WARN_QW|5.006000||p WARN_RECURSION|5.006000||p WARN_REDEFINE|5.006000||p WARN_REGEXP|5.006000||p WARN_RESERVED|5.006000||p WARN_SEMICOLON|5.006000||p WARN_SEVERE|5.006000||p WARN_SIGNAL|5.006000||p WARN_SUBSTR|5.006000||p WARN_SYNTAX|5.006000||p WARN_TAINT|5.006000||p WARN_THREADS|5.008000||p WARN_UNINITIALIZED|5.006000||p WARN_UNOPENED|5.006000||p WARN_UNPACK|5.006000||p WARN_UNTIE|5.006000||p WARN_UTF8|5.006000||p WARN_VOID|5.006000||p XCPT_CATCH|5.009002||p XCPT_RETHROW|5.009002||p XCPT_TRY_END|5.009002||p XCPT_TRY_START|5.009002||p XPUSHi||| XPUSHmortal|5.009002||p XPUSHn||| XPUSHp||| XPUSHs||| XPUSHu|5.004000||p XSRETURN_EMPTY||| XSRETURN_IV||| XSRETURN_NO||| XSRETURN_NV||| XSRETURN_PV||| XSRETURN_UNDEF||| XSRETURN_UV|5.008001||p XSRETURN_YES||| XSRETURN|||p XST_mIV||| XST_mNO||| XST_mNV||| XST_mPV||| XST_mUNDEF||| XST_mUV|5.008001||p XST_mYES||| XS_VERSION_BOOTCHECK||| XS_VERSION||| XSprePUSH|5.006000||p XS||| ZeroD|5.009002||p Zero||| _aMY_CXT|5.007003||p _pMY_CXT|5.007003||p aMY_CXT_|5.007003||p aMY_CXT|5.007003||p aTHXR_|5.011000||p aTHXR|5.011000||p aTHX_|5.006000||p aTHX|5.006000||p add_data|||n addmad||| allocmy||| amagic_call||| amagic_cmp_locale||| amagic_cmp||| amagic_i_ncmp||| amagic_ncmp||| any_dup||| ao||| append_elem||| append_list||| append_madprops||| apply_attrs_my||| apply_attrs_string||5.006001| apply_attrs||| apply||| atfork_lock||5.007003|n atfork_unlock||5.007003|n av_arylen_p||5.009003| av_clear||| av_create_and_push||5.009005| av_create_and_unshift_one||5.009005| av_delete||5.006000| av_exists||5.006000| av_extend||| av_fake||| av_fetch||| av_fill||| av_iter_p||5.011000| av_len||| av_make||| av_pop||| av_push||| av_reify||| av_shift||| av_store||| av_undef||| av_unshift||| ax|||n bad_type||| bind_match||| block_end||| block_gimme||5.004000| block_start||| boolSV|5.004000||p boot_core_PerlIO||| boot_core_UNIVERSAL||| boot_core_mro||| boot_core_xsutils||| bytes_from_utf8||5.007001| bytes_to_uni|||n bytes_to_utf8||5.006001| call_argv|5.006000||p call_atexit||5.006000| call_list||5.004000| call_method|5.006000||p call_pv|5.006000||p call_sv|5.006000||p calloc||5.007002|n cando||| cast_i32||5.006000| cast_iv||5.006000| cast_ulong||5.006000| cast_uv||5.006000| check_type_and_open||| check_uni||| checkcomma||| checkposixcc||| ckWARN|5.006000||p ck_anoncode||| ck_bitop||| ck_concat||| ck_defined||| ck_delete||| ck_die||| ck_each||| ck_eof||| ck_eval||| ck_exec||| ck_exists||| ck_exit||| ck_ftst||| ck_fun||| ck_glob||| ck_grep||| ck_index||| ck_join||| ck_lengthconst||| ck_lfun||| ck_listiob||| ck_match||| ck_method||| ck_null||| ck_open||| ck_readline||| ck_repeat||| ck_require||| ck_retarget||| ck_return||| ck_rfun||| ck_rvconst||| ck_sassign||| ck_select||| ck_shift||| ck_sort||| ck_spair||| ck_split||| ck_subr||| ck_substr||| ck_svconst||| ck_trunc||| ck_unpack||| ckwarn_d||5.009003| ckwarn||5.009003| cl_and|||n cl_anything|||n cl_init_zero|||n cl_init|||n cl_is_anything|||n cl_or|||n clear_placeholders||| closest_cop||| convert||| cop_free||| cr_textfilter||| create_eval_scope||| croak_nocontext|||vn croak|||v csighandler||5.009003|n curmad||| custom_op_desc||5.007003| custom_op_name||5.007003| cv_ckproto_len||| cv_ckproto||| cv_clone||| cv_const_sv||5.004000| cv_dump||| cv_undef||| cx_dump||5.005000| cx_dup||| cxinc||| dAXMARK|5.009003||p dAX|5.007002||p dITEMS|5.007002||p dMARK||| dMULTICALL||5.009003| dMY_CXT_SV|5.007003||p dMY_CXT|5.007003||p dNOOP|5.006000||p dORIGMARK||| dSP||| dTHR|5.004050||p dTHXR|5.011000||p dTHXa|5.006000||p dTHXoa|5.006000||p dTHX|5.006000||p dUNDERBAR|5.009002||p dVAR|5.009003||p dXCPT|5.009002||p dXSARGS||| dXSI32||| dXSTARG|5.006000||p deb_curcv||| deb_nocontext|||vn deb_stack_all||| deb_stack_n||| debop||5.005000| debprofdump||5.005000| debprof||| debstackptrs||5.007003| debstack||5.007003| debug_start_match||| deb||5.007003|v del_sv||| delete_eval_scope||| delimcpy||5.004000| deprecate_old||| deprecate||| despatch_signals||5.007001| destroy_matcher||| die_nocontext|||vn die_where||| die|||v dirp_dup||| div128||| djSP||| do_aexec5||| do_aexec||| do_aspawn||| do_binmode||5.004050| do_chomp||| do_chop||| do_close||| do_dump_pad||| do_eof||| do_exec3||| do_execfree||| do_exec||| do_gv_dump||5.006000| do_gvgv_dump||5.006000| do_hv_dump||5.006000| do_ipcctl||| do_ipcget||| do_join||| do_kv||| do_magic_dump||5.006000| do_msgrcv||| do_msgsnd||| do_oddball||| do_op_dump||5.006000| do_op_xmldump||| do_open9||5.006000| do_openn||5.007001| do_open||5.004000| do_pmop_dump||5.006000| do_pmop_xmldump||| do_print||| do_readline||| do_seek||| do_semop||| do_shmio||| do_smartmatch||| do_spawn_nowait||| do_spawn||| do_sprintf||| do_sv_dump||5.006000| do_sysseek||| do_tell||| do_trans_complex_utf8||| do_trans_complex||| do_trans_count_utf8||| do_trans_count||| do_trans_simple_utf8||| do_trans_simple||| do_trans||| do_vecget||| do_vecset||| do_vop||| docatch||| doeval||| dofile||| dofindlabel||| doform||| doing_taint||5.008001|n dooneliner||| doopen_pm||| doparseform||| dopoptoeval||| dopoptogiven||| dopoptolabel||| dopoptoloop||| dopoptosub_at||| dopoptowhen||| doref||5.009003| dounwind||| dowantarray||| dump_all||5.006000| dump_eval||5.006000| dump_exec_pos||| dump_fds||| dump_form||5.006000| dump_indent||5.006000|v dump_mstats||| dump_packsubs||5.006000| dump_sub||5.006000| dump_sv_child||| dump_trie_interim_list||| dump_trie_interim_table||| dump_trie||| dump_vindent||5.006000| dumpuntil||| dup_attrlist||| emulate_cop_io||| eval_pv|5.006000||p eval_sv|5.006000||p exec_failed||| expect_number||| fbm_compile||5.005000| fbm_instr||5.005000| fd_on_nosuid_fs||| feature_is_enabled||| filter_add||| filter_del||| filter_gets||| filter_read||| find_and_forget_pmops||| find_array_subscript||| find_beginning||| find_byclass||| find_hash_subscript||| find_in_my_stash||| find_runcv||5.008001| find_rundefsvoffset||5.009002| find_script||| find_uninit_var||| first_symbol|||n fold_constants||| forbid_setid||| force_ident||| force_list||| force_next||| force_version||| force_word||| forget_pmop||| form_nocontext|||vn form||5.004000|v fp_dup||| fprintf_nocontext|||vn free_global_struct||| free_tied_hv_pool||| free_tmps||| gen_constant_list||| get_arena||| get_aux_mg||| get_av|5.006000||p get_context||5.006000|n get_cvn_flags||5.009005| get_cv|5.006000||p get_db_sub||| get_debug_opts||| get_hash_seed||| get_hv|5.006000||p get_mstats||| get_no_modify||| get_num||| get_op_descs||5.005000| get_op_names||5.005000| get_opargs||| get_ppaddr||5.006000| get_re_arg||| get_sv|5.006000||p get_vtbl||5.005030| getcwd_sv||5.007002| getenv_len||| glob_2number||| glob_2pv||| glob_assign_glob||| glob_assign_ref||| gp_dup||| gp_free||| gp_ref||| grok_bin|5.007003||p grok_hex|5.007003||p grok_number|5.007002||p grok_numeric_radix|5.007002||p grok_oct|5.007003||p group_end||| gv_AVadd||| gv_HVadd||| gv_IOadd||| gv_SVadd||| gv_autoload4||5.004000| gv_check||| gv_const_sv||5.009003| gv_dump||5.006000| gv_efullname3||5.004000| gv_efullname4||5.006001| gv_efullname||| gv_ename||| gv_fetchfile_flags||5.009005| gv_fetchfile||| gv_fetchmeth_autoload||5.007003| gv_fetchmethod_autoload||5.004000| gv_fetchmethod||| gv_fetchmeth||| gv_fetchpvn_flags||5.009002| gv_fetchpv||| gv_fetchsv||5.009002| gv_fullname3||5.004000| gv_fullname4||5.006001| gv_fullname||| gv_get_super_pkg||| gv_handler||5.007001| gv_init_sv||| gv_init||| gv_name_set||5.009004| gv_stashpvn|5.004000||p gv_stashpvs||5.009003| gv_stashpv||| gv_stashsv||| he_dup||| hek_dup||| hfreeentries||| hsplit||| hv_assert||5.011000| hv_auxinit|||n hv_backreferences_p||| hv_clear_placeholders||5.009001| hv_clear||| hv_common_key_len||5.010000| hv_common||5.010000| hv_copy_hints_hv||| hv_delayfree_ent||5.004000| hv_delete_common||| hv_delete_ent||5.004000| hv_delete||| hv_eiter_p||5.009003| hv_eiter_set||5.009003| hv_exists_ent||5.004000| hv_exists||| hv_fetch_ent||5.004000| hv_fetchs|5.009003||p hv_fetch||| hv_free_ent||5.004000| hv_iterinit||| hv_iterkeysv||5.004000| hv_iterkey||| hv_iternext_flags||5.008000| hv_iternextsv||| hv_iternext||| hv_iterval||| hv_kill_backrefs||| hv_ksplit||5.004000| hv_magic_check|||n hv_magic||| hv_name_set||5.009003| hv_notallowed||| hv_placeholders_get||5.009003| hv_placeholders_p||5.009003| hv_placeholders_set||5.009003| hv_riter_p||5.009003| hv_riter_set||5.009003| hv_scalar||5.009001| hv_store_ent||5.004000| hv_store_flags||5.008000| hv_stores|5.009004||p hv_store||| hv_undef||| ibcmp_locale||5.004000| ibcmp_utf8||5.007003| ibcmp||| incline||| incpush_if_exists||| incpush||| ingroup||| init_argv_symbols||| init_debugger||| init_global_struct||| init_i18nl10n||5.006000| init_i18nl14n||5.006000| init_ids||| init_interp||| init_main_stash||| init_perllib||| init_postdump_symbols||| init_predump_symbols||| init_stacks||5.005000| init_tm||5.007002| instr||| intro_my||| intuit_method||| intuit_more||| invert||| io_close||| isALNUM||| isALPHA||| isDIGIT||| isLOWER||| isSPACE||| isUPPER||| is_an_int||| is_gv_magical_sv||| is_gv_magical||| is_handle_constructor|||n is_list_assignment||| is_lvalue_sub||5.007001| is_uni_alnum_lc||5.006000| is_uni_alnumc_lc||5.006000| is_uni_alnumc||5.006000| is_uni_alnum||5.006000| is_uni_alpha_lc||5.006000| is_uni_alpha||5.006000| is_uni_ascii_lc||5.006000| is_uni_ascii||5.006000| is_uni_cntrl_lc||5.006000| is_uni_cntrl||5.006000| is_uni_digit_lc||5.006000| is_uni_digit||5.006000| is_uni_graph_lc||5.006000| is_uni_graph||5.006000| is_uni_idfirst_lc||5.006000| is_uni_idfirst||5.006000| is_uni_lower_lc||5.006000| is_uni_lower||5.006000| is_uni_print_lc||5.006000| is_uni_print||5.006000| is_uni_punct_lc||5.006000| is_uni_punct||5.006000| is_uni_space_lc||5.006000| is_uni_space||5.006000| is_uni_upper_lc||5.006000| is_uni_upper||5.006000| is_uni_xdigit_lc||5.006000| is_uni_xdigit||5.006000| is_utf8_alnumc||5.006000| is_utf8_alnum||5.006000| is_utf8_alpha||5.006000| is_utf8_ascii||5.006000| is_utf8_char_slow|||n is_utf8_char||5.006000| is_utf8_cntrl||5.006000| is_utf8_common||| is_utf8_digit||5.006000| is_utf8_graph||5.006000| is_utf8_idcont||5.008000| is_utf8_idfirst||5.006000| is_utf8_lower||5.006000| is_utf8_mark||5.006000| is_utf8_print||5.006000| is_utf8_punct||5.006000| is_utf8_space||5.006000| is_utf8_string_loclen||5.009003| is_utf8_string_loc||5.008001| is_utf8_string||5.006001| is_utf8_upper||5.006000| is_utf8_xdigit||5.006000| isa_lookup||| items|||n ix|||n jmaybe||| join_exact||| keyword||| leave_scope||| lex_end||| lex_start||| linklist||| listkids||| list||| load_module_nocontext|||vn load_module|5.006000||pv localize||| looks_like_bool||| looks_like_number||| lop||| mPUSHi|5.009002||p mPUSHn|5.009002||p mPUSHp|5.009002||p mPUSHs|5.011000||p mPUSHu|5.009002||p mXPUSHi|5.009002||p mXPUSHn|5.009002||p mXPUSHp|5.009002||p mXPUSHs|5.011000||p mXPUSHu|5.009002||p mad_free||| madlex||| madparse||| magic_clear_all_env||| magic_clearenv||| magic_clearhint||| magic_clearpack||| magic_clearsig||| magic_dump||5.006000| magic_existspack||| magic_freearylen_p||| magic_freeovrld||| magic_getarylen||| magic_getdefelem||| magic_getnkeys||| magic_getpack||| magic_getpos||| magic_getsig||| magic_getsubstr||| magic_gettaint||| magic_getuvar||| magic_getvec||| magic_get||| magic_killbackrefs||| magic_len||| magic_methcall||| magic_methpack||| magic_nextpack||| magic_regdata_cnt||| magic_regdatum_get||| magic_regdatum_set||| magic_scalarpack||| magic_set_all_env||| magic_setamagic||| magic_setarylen||| magic_setcollxfrm||| magic_setdbline||| magic_setdefelem||| magic_setenv||| magic_sethint||| magic_setisa||| magic_setmglob||| magic_setnkeys||| magic_setpack||| magic_setpos||| magic_setregexp||| magic_setsig||| magic_setsubstr||| magic_settaint||| magic_setutf8||| magic_setuvar||| magic_setvec||| magic_set||| magic_sizepack||| magic_wipepack||| magicname||| make_matcher||| make_trie_failtable||| make_trie||| malloced_size|||n malloc||5.007002|n markstack_grow||| matcher_matches_sv||| measure_struct||| memEQ|5.004000||p memNE|5.004000||p mem_collxfrm||| mess_alloc||| mess_nocontext|||vn mess||5.006000|v method_common||| mfree||5.007002|n mg_clear||| mg_copy||| mg_dup||| mg_find||| mg_free||| mg_get||| mg_length||5.005000| mg_localize||| mg_magical||| mg_set||| mg_size||5.005000| mini_mktime||5.007002| missingterm||| mode_from_discipline||| modkids||| mod||| more_bodies||| more_sv||| moreswitches||| mro_get_linear_isa_c3||| mro_get_linear_isa_dfs||| mro_get_linear_isa||5.009005| mro_isa_changed_in||| mro_meta_dup||| mro_meta_init||| mro_method_changed_in||5.009005| mul128||| mulexp10|||n my_atof2||5.007002| my_atof||5.006000| my_attrs||| my_bcopy|||n my_betoh16|||n my_betoh32|||n my_betoh64|||n my_betohi|||n my_betohl|||n my_betohs|||n my_bzero|||n my_chsize||| my_clearenv||| my_cxt_index||| my_cxt_init||| my_dirfd||5.009005| my_exit_jump||| my_exit||| my_failure_exit||5.004000| my_fflush_all||5.006000| my_fork||5.007003|n my_htobe16|||n my_htobe32|||n my_htobe64|||n my_htobei|||n my_htobel|||n my_htobes|||n my_htole16|||n my_htole32|||n my_htole64|||n my_htolei|||n my_htolel|||n my_htoles|||n my_htonl||| my_kid||| my_letoh16|||n my_letoh32|||n my_letoh64|||n my_letohi|||n my_letohl|||n my_letohs|||n my_lstat||| my_memcmp||5.004000|n my_memset|||n my_ntohl||| my_pclose||5.004000| my_popen_list||5.007001| my_popen||5.004000| my_setenv||| my_snprintf|5.009004||pvn my_socketpair||5.007003|n my_sprintf||5.009003|vn my_stat||| my_strftime||5.007002| my_strlcat|5.009004||pn my_strlcpy|5.009004||pn my_swabn|||n my_swap||| my_unexec||| my_vsnprintf||5.009004|n my||| need_utf8|||n newANONATTRSUB||5.006000| newANONHASH||| newANONLIST||| newANONSUB||| newASSIGNOP||| newATTRSUB||5.006000| newAVREF||| newAV||| newBINOP||| newCONDOP||| newCONSTSUB|5.004050||p newCVREF||| newDEFSVOP||| newFORM||| newFOROP||| newGIVENOP||5.009003| newGIVWHENOP||| newGP||| newGVOP||| newGVREF||| newGVgen||| newHVREF||| newHVhv||5.005000| newHV||| newIO||| newLISTOP||| newLOGOP||| newLOOPEX||| newLOOPOP||| newMADPROP||| newMADsv||| newMYSUB||| newNULLLIST||| newOP||| newPADOP||| newPMOP||| newPROG||| newPVOP||| newRANGE||| newRV_inc|5.004000||p newRV_noinc|5.004000||p newRV||| newSLICEOP||| newSTATEOP||| newSUB||| newSVOP||| newSVREF||| newSV_type||5.009005| newSVhek||5.009003| newSViv||| newSVnv||| newSVpvf_nocontext|||vn newSVpvf||5.004000|v newSVpvn_flags|5.011000||p newSVpvn_share|5.007001||p newSVpvn_utf8|5.011000||p newSVpvn|5.004050||p newSVpvs_flags|5.011000||p newSVpvs_share||5.009003| newSVpvs|5.009003||p newSVpv||| newSVrv||| newSVsv||| newSVuv|5.006000||p newSV||| newTOKEN||| newUNOP||| newWHENOP||5.009003| newWHILEOP||5.009003| newXS_flags||5.009004| newXSproto||5.006000| newXS||5.006000| new_collate||5.006000| new_constant||| new_ctype||5.006000| new_he||| new_logop||| new_numeric||5.006000| new_stackinfo||5.005000| new_version||5.009000| new_warnings_bitfield||| next_symbol||| nextargv||| nextchar||| ninstr||| no_bareword_allowed||| no_fh_allowed||| no_op||| not_a_number||| nothreadhook||5.008000| nuke_stacks||| num_overflow|||n offer_nice_chunk||| oopsAV||| oopsCV||| oopsHV||| op_clear||| op_const_sv||| op_dump||5.006000| op_free||| op_getmad_weak||| op_getmad||| op_null||5.007002| op_refcnt_dec||| op_refcnt_inc||| op_refcnt_lock||5.009002| op_refcnt_unlock||5.009002| op_xmldump||| open_script||| pMY_CXT_|5.007003||p pMY_CXT|5.007003||p pTHX_|5.006000||p pTHX|5.006000||p packWARN|5.007003||p pack_cat||5.007003| pack_rec||| package||| packlist||5.008001| pad_add_anon||| pad_add_name||| pad_alloc||| pad_block_start||| pad_check_dup||| pad_compname_type||| pad_findlex||| pad_findmy||| pad_fixup_inner_anons||| pad_free||| pad_leavemy||| pad_new||| pad_peg|||n pad_push||| pad_reset||| pad_setsv||| pad_sv||5.011000| pad_swipe||| pad_tidy||| pad_undef||| parse_body||| parse_unicode_opts||| parser_dup||| parser_free||| path_is_absolute|||n peep||| pending_Slabs_to_ro||| perl_alloc_using|||n perl_alloc|||n perl_clone_using|||n perl_clone|||n perl_construct|||n perl_destruct||5.007003|n perl_free|||n perl_parse||5.006000|n perl_run|||n pidgone||| pm_description||| pmflag||| pmop_dump||5.006000| pmop_xmldump||| pmruntime||| pmtrans||| pop_scope||| pregcomp||5.009005| pregexec||| pregfree2||5.011000| pregfree||| prepend_elem||| prepend_madprops||| printbuf||| printf_nocontext|||vn process_special_blocks||| ptr_table_clear||5.009005| ptr_table_fetch||5.009005| ptr_table_find|||n ptr_table_free||5.009005| ptr_table_new||5.009005| ptr_table_split||5.009005| ptr_table_store||5.009005| push_scope||| put_byte||| pv_display||5.006000| pv_escape||5.009004| pv_pretty||5.009004| pv_uni_display||5.007003| qerror||| qsortsvu||| re_compile||5.009005| re_croak2||| re_dup_guts||| re_intuit_start||5.009005| re_intuit_string||5.006000| readpipe_override||| realloc||5.007002|n reentrant_free||| reentrant_init||| reentrant_retry|||vn reentrant_size||| ref_array_or_hash||| refcounted_he_chain_2hv||| refcounted_he_fetch||| refcounted_he_free||| refcounted_he_new||| refcounted_he_value||| refkids||| refto||| ref||5.011000| reg_check_named_buff_matched||| reg_named_buff_all||5.009005| reg_named_buff_exists||5.009005| reg_named_buff_fetch||5.009005| reg_named_buff_firstkey||5.009005| reg_named_buff_iter||| reg_named_buff_nextkey||5.009005| reg_named_buff_scalar||5.009005| reg_named_buff||| reg_namedseq||| reg_node||| reg_numbered_buff_fetch||| reg_numbered_buff_length||| reg_numbered_buff_store||| reg_qr_package||| reg_recode||| reg_scan_name||| reg_skipcomment||| reg_stringify||5.009005| reg_temp_copy||| reganode||| regatom||| regbranch||| regclass_swash||5.009004| regclass||| regcppop||| regcppush||| regcurly|||n regdump_extflags||| regdump||5.005000| regdupe_internal||| regexec_flags||5.005000| regfree_internal||5.009005| reghop3|||n reghop4|||n reghopmaybe3|||n reginclass||| reginitcolors||5.006000| reginsert||| regmatch||| regnext||5.005000| regpiece||| regpposixcc||| regprop||| regrepeat||| regtail_study||| regtail||| regtry||| reguni||| regwhite|||n reg||| repeatcpy||| report_evil_fh||| report_uninit||| require_pv||5.006000| require_tie_mod||| restore_magic||| rninstr||| rsignal_restore||| rsignal_save||| rsignal_state||5.004000| rsignal||5.004000| run_body||| run_user_filter||| runops_debug||5.005000| runops_standard||5.005000| rvpv_dup||| rxres_free||| rxres_restore||| rxres_save||| safesyscalloc||5.006000|n safesysfree||5.006000|n safesysmalloc||5.006000|n safesysrealloc||5.006000|n same_dirent||| save_I16||5.004000| save_I32||| save_I8||5.006000| save_aelem||5.004050| save_alloc||5.006000| save_aptr||| save_ary||| save_bool||5.008001| save_clearsv||| save_delete||| save_destructor_x||5.006000| save_destructor||5.006000| save_freeop||| save_freepv||| save_freesv||| save_generic_pvref||5.006001| save_generic_svref||5.005030| save_gp||5.004000| save_hash||| save_hek_flags|||n save_helem||5.004050| save_hptr||| save_int||| save_item||| save_iv||5.005000| save_lines||| save_list||| save_long||| save_magic||| save_mortalizesv||5.007001| save_nogv||| save_op||| save_padsv||5.007001| save_pptr||| save_re_context||5.006000| save_scalar_at||| save_scalar||| save_set_svflags||5.009000| save_shared_pvref||5.007003| save_sptr||| save_svref||| save_vptr||5.006000| savepvn||| savepvs||5.009003| savepv||| savesharedpvn||5.009005| savesharedpv||5.007003| savestack_grow_cnt||5.008001| savestack_grow||| savesvpv||5.009002| sawparens||| scalar_mod_type|||n scalarboolean||| scalarkids||| scalarseq||| scalarvoid||| scalar||| scan_bin||5.006000| scan_commit||| scan_const||| scan_formline||| scan_heredoc||| scan_hex||| scan_ident||| scan_inputsymbol||| scan_num||5.007001| scan_oct||| scan_pat||| scan_str||| scan_subst||| scan_trans||| scan_version||5.009001| scan_vstring||5.009005| scan_word||| scope||| screaminstr||5.005000| seed||5.008001| sequence_num||| sequence_tail||| sequence||| set_context||5.006000|n set_numeric_local||5.006000| set_numeric_radix||5.006000| set_numeric_standard||5.006000| setdefout||| setenv_getix||| share_hek_flags||| share_hek||5.004000| si_dup||| sighandler|||n simplify_sort||| skipspace0||| skipspace1||| skipspace2||| skipspace||| softref2xv||| sortcv_stacked||| sortcv_xsub||| sortcv||| sortsv_flags||5.009003| sortsv||5.007003| space_join_names_mortal||| ss_dup||| stack_grow||| start_force||| start_glob||| start_subparse||5.004000| stashpv_hvname_match||5.011000| stdize_locale||| strEQ||| strGE||| strGT||| strLE||| strLT||| strNE||| str_to_version||5.006000| strip_return||| strnEQ||| strnNE||| study_chunk||| sub_crush_depth||| sublex_done||| sublex_push||| sublex_start||| sv_2bool||| sv_2cv||| sv_2io||| sv_2iuv_common||| sv_2iuv_non_preserve||| sv_2iv_flags||5.009001| sv_2iv||| sv_2mortal||| sv_2num||| sv_2nv||| sv_2pv_flags|5.007002||p sv_2pv_nolen|5.006000||p sv_2pvbyte_nolen|5.006000||p sv_2pvbyte|5.006000||p sv_2pvutf8_nolen||5.006000| sv_2pvutf8||5.006000| sv_2pv||| sv_2uv_flags||5.009001| sv_2uv|5.004000||p sv_add_arena||| sv_add_backref||| sv_backoff||| sv_bless||| sv_cat_decode||5.008001| sv_catpv_mg|5.004050||p sv_catpvf_mg_nocontext|||pvn sv_catpvf_mg|5.006000|5.004000|pv sv_catpvf_nocontext|||vn sv_catpvf||5.004000|v sv_catpvn_flags||5.007002| sv_catpvn_mg|5.004050||p sv_catpvn_nomg|5.007002||p sv_catpvn||| sv_catpvs|5.009003||p sv_catpv||| sv_catsv_flags||5.007002| sv_catsv_mg|5.004050||p sv_catsv_nomg|5.007002||p sv_catsv||| sv_catxmlpvn||| sv_catxmlsv||| sv_chop||| sv_clean_all||| sv_clean_objs||| sv_clear||| sv_cmp_locale||5.004000| sv_cmp||| sv_collxfrm||| sv_compile_2op||5.008001| sv_copypv||5.007003| sv_dec||| sv_del_backref||| sv_derived_from||5.004000| sv_destroyable||5.010000| sv_does||5.009004| sv_dump||| sv_dup||| sv_eq||| sv_exp_grow||| sv_force_normal_flags||5.007001| sv_force_normal||5.006000| sv_free2||| sv_free_arenas||| sv_free||| sv_gets||5.004000| sv_grow||| sv_i_ncmp||| sv_inc||| sv_insert||| sv_isa||| sv_isobject||| sv_iv||5.005000| sv_kill_backrefs||| sv_len_utf8||5.006000| sv_len||| sv_magic_portable|5.011000|5.004000|p sv_magicext||5.007003| sv_magic||| sv_mortalcopy||| sv_ncmp||| sv_newmortal||| sv_newref||| sv_nolocking||5.007003| sv_nosharing||5.007003| sv_nounlocking||| sv_nv||5.005000| sv_peek||5.005000| sv_pos_b2u_midway||| sv_pos_b2u||5.006000| sv_pos_u2b_cached||| sv_pos_u2b_forwards|||n sv_pos_u2b_midway|||n sv_pos_u2b||5.006000| sv_pvbyten_force||5.006000| sv_pvbyten||5.006000| sv_pvbyte||5.006000| sv_pvn_force_flags|5.007002||p sv_pvn_force||| sv_pvn_nomg|5.007003|5.005000|p sv_pvn||5.005000| sv_pvutf8n_force||5.006000| sv_pvutf8n||5.006000| sv_pvutf8||5.006000| sv_pv||5.006000| sv_recode_to_utf8||5.007003| sv_reftype||| sv_release_COW||| sv_replace||| sv_report_used||| sv_reset||| sv_rvweaken||5.006000| sv_setiv_mg|5.004050||p sv_setiv||| sv_setnv_mg|5.006000||p sv_setnv||| sv_setpv_mg|5.004050||p sv_setpvf_mg_nocontext|||pvn sv_setpvf_mg|5.006000|5.004000|pv sv_setpvf_nocontext|||vn sv_setpvf||5.004000|v sv_setpviv_mg||5.008001| sv_setpviv||5.008001| sv_setpvn_mg|5.004050||p sv_setpvn||| sv_setpvs|5.009004||p sv_setpv||| sv_setref_iv||| sv_setref_nv||| sv_setref_pvn||| sv_setref_pv||| sv_setref_uv||5.007001| sv_setsv_cow||| sv_setsv_flags||5.007002| sv_setsv_mg|5.004050||p sv_setsv_nomg|5.007002||p sv_setsv||| sv_setuv_mg|5.004050||p sv_setuv|5.004000||p sv_tainted||5.004000| sv_taint||5.004000| sv_true||5.005000| sv_unglob||| sv_uni_display||5.007003| sv_unmagic||| sv_unref_flags||5.007001| sv_unref||| sv_untaint||5.004000| sv_upgrade||| sv_usepvn_flags||5.009004| sv_usepvn_mg|5.004050||p sv_usepvn||| sv_utf8_decode||5.006000| sv_utf8_downgrade||5.006000| sv_utf8_encode||5.006000| sv_utf8_upgrade_flags||5.007002| sv_utf8_upgrade||5.007001| sv_uv|5.005000||p sv_vcatpvf_mg|5.006000|5.004000|p sv_vcatpvfn||5.004000| sv_vcatpvf|5.006000|5.004000|p sv_vsetpvf_mg|5.006000|5.004000|p sv_vsetpvfn||5.004000| sv_vsetpvf|5.006000|5.004000|p sv_xmlpeek||| svtype||| swallow_bom||| swap_match_buff||| swash_fetch||5.007002| swash_get||| swash_init||5.006000| sys_init3||5.010000|n sys_init||5.010000|n sys_intern_clear||| sys_intern_dup||| sys_intern_init||| sys_term||5.010000|n taint_env||| taint_proper||| tmps_grow||5.006000| toLOWER||| toUPPER||| to_byte_substr||| to_uni_fold||5.007003| to_uni_lower_lc||5.006000| to_uni_lower||5.007003| to_uni_title_lc||5.006000| to_uni_title||5.007003| to_uni_upper_lc||5.006000| to_uni_upper||5.007003| to_utf8_case||5.007003| to_utf8_fold||5.007003| to_utf8_lower||5.007003| to_utf8_substr||| to_utf8_title||5.007003| to_utf8_upper||5.007003| token_free||| token_getmad||| tokenize_use||| tokeq||| tokereport||| too_few_arguments||| too_many_arguments||| uiv_2buf|||n unlnk||| unpack_rec||| unpack_str||5.007003| unpackstring||5.008001| unshare_hek_or_pvn||| unshare_hek||| unsharepvn||5.004000| unwind_handler_stack||| update_debugger_info||| upg_version||5.009005| usage||| utf16_to_utf8_reversed||5.006001| utf16_to_utf8||5.006001| utf8_distance||5.006000| utf8_hop||5.006000| utf8_length||5.007001| utf8_mg_pos_cache_update||| utf8_to_bytes||5.006001| utf8_to_uvchr||5.007001| utf8_to_uvuni||5.007001| utf8n_to_uvchr||| utf8n_to_uvuni||5.007001| utilize||| uvchr_to_utf8_flags||5.007003| uvchr_to_utf8||| uvuni_to_utf8_flags||5.007003| uvuni_to_utf8||5.007001| validate_suid||| varname||| vcmp||5.009000| vcroak||5.006000| vdeb||5.007003| vdie_common||| vdie_croak_common||| vdie||| vform||5.006000| visit||| vivify_defelem||| vivify_ref||| vload_module|5.006000||p vmess||5.006000| vnewSVpvf|5.006000|5.004000|p vnormal||5.009002| vnumify||5.009000| vstringify||5.009000| vverify||5.009003| vwarner||5.006000| vwarn||5.006000| wait4pid||| warn_nocontext|||vn warner_nocontext|||vn warner|5.006000|5.004000|pv warn|||v watch||| whichsig||| write_no_mem||| write_to_stderr||| xmldump_all||| xmldump_attr||| xmldump_eval||| xmldump_form||| xmldump_indent|||v xmldump_packsubs||| xmldump_sub||| xmldump_vindent||| yyerror||| yylex||| yyparse||| yywarn||| ); if (exists $opt{'list-unsupported'}) { my $f; for $f (sort { lc $a cmp lc $b } keys %API) { next unless $API{$f}{todo}; print "$f ", '.'x(40-length($f)), " ", format_version($API{$f}{todo}), "\n"; } exit 0; } # Scan for possible replacement candidates my(%replace, %need, %hints, %warnings, %depends); my $replace = 0; my($hint, $define, $function); sub find_api { my $code = shift; $code =~ s{ / (?: \*[^*]*\*+(?:[^$ccs][^*]*\*+)* / | /[^\r\n]*) | "[^"\\]*(?:\\.[^"\\]*)*" | '[^'\\]*(?:\\.[^'\\]*)*' }{}egsx; grep { exists $API{$_} } $code =~ /(\w+)/mg; } while () { if ($hint) { my $h = $hint->[0] eq 'Hint' ? \%hints : \%warnings; if (m{^\s*\*\s(.*?)\s*$}) { for (@{$hint->[1]}) { $h->{$_} ||= ''; # suppress warning with older perls $h->{$_} .= "$1\n"; } } else { undef $hint } } $hint = [$1, [split /,?\s+/, $2]] if m{^\s*$rccs\s+(Hint|Warning):\s+(\w+(?:,?\s+\w+)*)\s*$}; if ($define) { if ($define->[1] =~ /\\$/) { $define->[1] .= $_; } else { if (exists $API{$define->[0]} && $define->[1] !~ /^DPPP_\(/) { my @n = find_api($define->[1]); push @{$depends{$define->[0]}}, @n if @n } undef $define; } } $define = [$1, $2] if m{^\s*#\s*define\s+(\w+)(?:\([^)]*\))?\s+(.*)}; if ($function) { if (/^}/) { if (exists $API{$function->[0]}) { my @n = find_api($function->[1]); push @{$depends{$function->[0]}}, @n if @n } undef $function; } else { $function->[1] .= $_; } } $function = [$1, ''] if m{^DPPP_\(my_(\w+)\)}; $replace = $1 if m{^\s*$rccs\s+Replace:\s+(\d+)\s+$rcce\s*$}; $replace{$2} = $1 if $replace and m{^\s*#\s*define\s+(\w+)(?:\([^)]*\))?\s+(\w+)}; $replace{$2} = $1 if m{^\s*#\s*define\s+(\w+)(?:\([^)]*\))?\s+(\w+).*$rccs\s+Replace\s+$rcce}; $replace{$1} = $2 if m{^\s*$rccs\s+Replace (\w+) with (\w+)\s+$rcce\s*$}; if (m{^\s*$rccs\s+(\w+)\s+depends\s+on\s+(\w+(\s*,\s*\w+)*)\s+$rcce\s*$}) { push @{$depends{$1}}, map { s/\s+//g; $_ } split /,/, $2; } $need{$1} = 1 if m{^#if\s+defined\(NEED_(\w+)(?:_GLOBAL)?\)}; } for (values %depends) { my %s; $_ = [sort grep !$s{$_}++, @$_]; } if (exists $opt{'api-info'}) { my $f; my $count = 0; my $match = $opt{'api-info'} =~ m!^/(.*)/$! ? $1 : "^\Q$opt{'api-info'}\E\$"; for $f (sort { lc $a cmp lc $b } keys %API) { next unless $f =~ /$match/; print "\n=== $f ===\n\n"; my $info = 0; if ($API{$f}{base} || $API{$f}{todo}) { my $base = format_version($API{$f}{base} || $API{$f}{todo}); print "Supported at least starting from perl-$base.\n"; $info++; } if ($API{$f}{provided}) { my $todo = $API{$f}{todo} ? format_version($API{$f}{todo}) : "5.003"; print "Support by $ppport provided back to perl-$todo.\n"; print "Support needs to be explicitly requested by NEED_$f.\n" if exists $need{$f}; print "Depends on: ", join(', ', @{$depends{$f}}), ".\n" if exists $depends{$f}; print "\n$hints{$f}" if exists $hints{$f}; print "\nWARNING:\n$warnings{$f}" if exists $warnings{$f}; $info++; } print "No portability information available.\n" unless $info; $count++; } $count or print "Found no API matching '$opt{'api-info'}'."; print "\n"; exit 0; } if (exists $opt{'list-provided'}) { my $f; for $f (sort { lc $a cmp lc $b } keys %API) { next unless $API{$f}{provided}; my @flags; push @flags, 'explicit' if exists $need{$f}; push @flags, 'depend' if exists $depends{$f}; push @flags, 'hint' if exists $hints{$f}; push @flags, 'warning' if exists $warnings{$f}; my $flags = @flags ? ' ['.join(', ', @flags).']' : ''; print "$f$flags\n"; } exit 0; } my @files; my @srcext = qw( .xs .c .h .cc .cpp -c.inc -xs.inc ); my $srcext = join '|', map { quotemeta $_ } @srcext; if (@ARGV) { my %seen; for (@ARGV) { if (-e) { if (-f) { push @files, $_ unless $seen{$_}++; } else { warn "'$_' is not a file.\n" } } else { my @new = grep { -f } glob $_ or warn "'$_' does not exist.\n"; push @files, grep { !$seen{$_}++ } @new; } } } else { eval { require File::Find; File::Find::find(sub { $File::Find::name =~ /($srcext)$/i and push @files, $File::Find::name; }, '.'); }; if ($@) { @files = map { glob "*$_" } @srcext; } } if (!@ARGV || $opt{filter}) { my(@in, @out); my %xsc = map { /(.*)\.xs$/ ? ("$1.c" => 1, "$1.cc" => 1) : () } @files; for (@files) { my $out = exists $xsc{$_} || /\b\Q$ppport\E$/i || !/($srcext)$/i; push @{ $out ? \@out : \@in }, $_; } if (@ARGV && @out) { warning("Skipping the following files (use --nofilter to avoid this):\n| ", join "\n| ", @out); } @files = @in; } die "No input files given!\n" unless @files; my(%files, %global, %revreplace); %revreplace = reverse %replace; my $filename; my $patch_opened = 0; for $filename (@files) { unless (open IN, "<$filename") { warn "Unable to read from $filename: $!\n"; next; } info("Scanning $filename ..."); my $c = do { local $/; }; close IN; my %file = (orig => $c, changes => 0); # Temporarily remove C/XS comments and strings from the code my @ccom; $c =~ s{ ( ^$HS*\#$HS*include\b[^\r\n]+\b(?:\Q$ppport\E|XSUB\.h)\b[^\r\n]* | ^$HS*\#$HS*(?:define|elif|if(?:def)?)\b[^\r\n]* ) | ( ^$HS*\#[^\r\n]* | "[^"\\]*(?:\\.[^"\\]*)*" | '[^'\\]*(?:\\.[^'\\]*)*' | / (?: \*[^*]*\*+(?:[^$ccs][^*]*\*+)* / | /[^\r\n]* ) ) }{ defined $2 and push @ccom, $2; defined $1 ? $1 : "$ccs$#ccom$cce" }mgsex; $file{ccom} = \@ccom; $file{code} = $c; $file{has_inc_ppport} = $c =~ /^$HS*#$HS*include[^\r\n]+\b\Q$ppport\E\b/m; my $func; for $func (keys %API) { my $match = $func; $match .= "|$revreplace{$func}" if exists $revreplace{$func}; if ($c =~ /\b(?:Perl_)?($match)\b/) { $file{uses_replace}{$1}++ if exists $revreplace{$func} && $1 eq $revreplace{$func}; $file{uses_Perl}{$func}++ if $c =~ /\bPerl_$func\b/; if (exists $API{$func}{provided}) { $file{uses_provided}{$func}++; if (!exists $API{$func}{base} || $API{$func}{base} > $opt{'compat-version'}) { $file{uses}{$func}++; my @deps = rec_depend($func); if (@deps) { $file{uses_deps}{$func} = \@deps; for (@deps) { $file{uses}{$_} = 0 unless exists $file{uses}{$_}; } } for ($func, @deps) { $file{needs}{$_} = 'static' if exists $need{$_}; } } } if (exists $API{$func}{todo} && $API{$func}{todo} > $opt{'compat-version'}) { if ($c =~ /\b$func\b/) { $file{uses_todo}{$func}++; } } } } while ($c =~ /^$HS*#$HS*define$HS+(NEED_(\w+?)(_GLOBAL)?)\b/mg) { if (exists $need{$2}) { $file{defined $3 ? 'needed_global' : 'needed_static'}{$2}++; } else { warning("Possibly wrong #define $1 in $filename") } } for (qw(uses needs uses_todo needed_global needed_static)) { for $func (keys %{$file{$_}}) { push @{$global{$_}{$func}}, $filename; } } $files{$filename} = \%file; } # Globally resolve NEED_'s my $need; for $need (keys %{$global{needs}}) { if (@{$global{needs}{$need}} > 1) { my @targets = @{$global{needs}{$need}}; my @t = grep $files{$_}{needed_global}{$need}, @targets; @targets = @t if @t; @t = grep /\.xs$/i, @targets; @targets = @t if @t; my $target = shift @targets; $files{$target}{needs}{$need} = 'global'; for (@{$global{needs}{$need}}) { $files{$_}{needs}{$need} = 'extern' if $_ ne $target; } } } for $filename (@files) { exists $files{$filename} or next; info("=== Analyzing $filename ==="); my %file = %{$files{$filename}}; my $func; my $c = $file{code}; my $warnings = 0; for $func (sort keys %{$file{uses_Perl}}) { if ($API{$func}{varargs}) { unless ($API{$func}{nothxarg}) { my $changes = ($c =~ s{\b(Perl_$func\s*\(\s*)(?!aTHX_?)(\)|[^\s)]*\))} { $1 . ($2 eq ')' ? 'aTHX' : 'aTHX_ ') . $2 }ge); if ($changes) { warning("Doesn't pass interpreter argument aTHX to Perl_$func"); $file{changes} += $changes; } } } else { warning("Uses Perl_$func instead of $func"); $file{changes} += ($c =~ s{\bPerl_$func(\s*)\((\s*aTHX_?)?\s*} {$func$1(}g); } } for $func (sort keys %{$file{uses_replace}}) { warning("Uses $func instead of $replace{$func}"); $file{changes} += ($c =~ s/\b$func\b/$replace{$func}/g); } for $func (sort keys %{$file{uses_provided}}) { if ($file{uses}{$func}) { if (exists $file{uses_deps}{$func}) { diag("Uses $func, which depends on ", join(', ', @{$file{uses_deps}{$func}})); } else { diag("Uses $func"); } } $warnings += hint($func); } unless ($opt{quiet}) { for $func (sort keys %{$file{uses_todo}}) { print "*** WARNING: Uses $func, which may not be portable below perl ", format_version($API{$func}{todo}), ", even with '$ppport'\n"; $warnings++; } } for $func (sort keys %{$file{needed_static}}) { my $message = ''; if (not exists $file{uses}{$func}) { $message = "No need to define NEED_$func if $func is never used"; } elsif (exists $file{needs}{$func} && $file{needs}{$func} ne 'static') { $message = "No need to define NEED_$func when already needed globally"; } if ($message) { diag($message); $file{changes} += ($c =~ s/^$HS*#$HS*define$HS+NEED_$func\b.*$LF//mg); } } for $func (sort keys %{$file{needed_global}}) { my $message = ''; if (not exists $global{uses}{$func}) { $message = "No need to define NEED_${func}_GLOBAL if $func is never used"; } elsif (exists $file{needs}{$func}) { if ($file{needs}{$func} eq 'extern') { $message = "No need to define NEED_${func}_GLOBAL when already needed globally"; } elsif ($file{needs}{$func} eq 'static') { $message = "No need to define NEED_${func}_GLOBAL when only used in this file"; } } if ($message) { diag($message); $file{changes} += ($c =~ s/^$HS*#$HS*define$HS+NEED_${func}_GLOBAL\b.*$LF//mg); } } $file{needs_inc_ppport} = keys %{$file{uses}}; if ($file{needs_inc_ppport}) { my $pp = ''; for $func (sort keys %{$file{needs}}) { my $type = $file{needs}{$func}; next if $type eq 'extern'; my $suffix = $type eq 'global' ? '_GLOBAL' : ''; unless (exists $file{"needed_$type"}{$func}) { if ($type eq 'global') { diag("Files [@{$global{needs}{$func}}] need $func, adding global request"); } else { diag("File needs $func, adding static request"); } $pp .= "#define NEED_$func$suffix\n"; } } if ($pp && ($c =~ s/^(?=$HS*#$HS*define$HS+NEED_\w+)/$pp/m)) { $pp = ''; $file{changes}++; } unless ($file{has_inc_ppport}) { diag("Needs to include '$ppport'"); $pp .= qq(#include "$ppport"\n) } if ($pp) { $file{changes} += ($c =~ s/^($HS*#$HS*define$HS+NEED_\w+.*?)^/$1$pp/ms) || ($c =~ s/^(?=$HS*#$HS*include.*\Q$ppport\E)/$pp/m) || ($c =~ s/^($HS*#$HS*include.*XSUB.*\s*?)^/$1$pp/m) || ($c =~ s/^/$pp/); } } else { if ($file{has_inc_ppport}) { diag("No need to include '$ppport'"); $file{changes} += ($c =~ s/^$HS*?#$HS*include.*\Q$ppport\E.*?$LF//m); } } # put back in our C comments my $ix; my $cppc = 0; my @ccom = @{$file{ccom}}; for $ix (0 .. $#ccom) { if (!$opt{cplusplus} && $ccom[$ix] =~ s!^//!!) { $cppc++; $file{changes} += $c =~ s/$rccs$ix$rcce/$ccs$ccom[$ix] $cce/; } else { $c =~ s/$rccs$ix$rcce/$ccom[$ix]/; } } if ($cppc) { my $s = $cppc != 1 ? 's' : ''; warning("Uses $cppc C++ style comment$s, which is not portable"); } my $s = $warnings != 1 ? 's' : ''; my $warn = $warnings ? " ($warnings warning$s)" : ''; info("Analysis completed$warn"); if ($file{changes}) { if (exists $opt{copy}) { my $newfile = "$filename$opt{copy}"; if (-e $newfile) { error("'$newfile' already exists, refusing to write copy of '$filename'"); } else { local *F; if (open F, ">$newfile") { info("Writing copy of '$filename' with changes to '$newfile'"); print F $c; close F; } else { error("Cannot open '$newfile' for writing: $!"); } } } elsif (exists $opt{patch} || $opt{changes}) { if (exists $opt{patch}) { unless ($patch_opened) { if (open PATCH, ">$opt{patch}") { $patch_opened = 1; } else { error("Cannot open '$opt{patch}' for writing: $!"); delete $opt{patch}; $opt{changes} = 1; goto fallback; } } mydiff(\*PATCH, $filename, $c); } else { fallback: info("Suggested changes:"); mydiff(\*STDOUT, $filename, $c); } } else { my $s = $file{changes} == 1 ? '' : 's'; info("$file{changes} potentially required change$s detected"); } } else { info("Looks good"); } } close PATCH if $patch_opened; exit 0; sub try_use { eval "use @_;"; return $@ eq '' } sub mydiff { local *F = shift; my($file, $str) = @_; my $diff; if (exists $opt{diff}) { $diff = run_diff($opt{diff}, $file, $str); } if (!defined $diff and try_use('Text::Diff')) { $diff = Text::Diff::diff($file, \$str, { STYLE => 'Unified' }); $diff = <
$tmp") { print F $str; close F; if (open F, "$prog $file $tmp |") { while () { s/\Q$tmp\E/$file.patched/; $diff .= $_; } close F; unlink $tmp; return $diff; } unlink $tmp; } else { error("Cannot open '$tmp' for writing: $!"); } return undef; } sub rec_depend { my($func, $seen) = @_; return () unless exists $depends{$func}; $seen = {%{$seen||{}}}; return () if $seen->{$func}++; my %s; grep !$s{$_}++, map { ($_, rec_depend($_, $seen)) } @{$depends{$func}}; } sub parse_version { my $ver = shift; if ($ver =~ /^(\d+)\.(\d+)\.(\d+)$/) { return ($1, $2, $3); } elsif ($ver !~ /^\d+\.[\d_]+$/) { die "cannot parse version '$ver'\n"; } $ver =~ s/_//g; $ver =~ s/$/000000/; my($r,$v,$s) = $ver =~ /(\d+)\.(\d{3})(\d{3})/; $v = int $v; $s = int $s; if ($r < 5 || ($r == 5 && $v < 6)) { if ($s % 10) { die "cannot parse version '$ver'\n"; } } return ($r, $v, $s); } sub format_version { my $ver = shift; $ver =~ s/$/000000/; my($r,$v,$s) = $ver =~ /(\d+)\.(\d{3})(\d{3})/; $v = int $v; $s = int $s; if ($r < 5 || ($r == 5 && $v < 6)) { if ($s % 10) { die "invalid version '$ver'\n"; } $s /= 10; $ver = sprintf "%d.%03d", $r, $v; $s > 0 and $ver .= sprintf "_%02d", $s; return $ver; } return sprintf "%d.%d.%d", $r, $v, $s; } sub info { $opt{quiet} and return; print @_, "\n"; } sub diag { $opt{quiet} and return; $opt{diag} and print @_, "\n"; } sub warning { $opt{quiet} and return; print "*** ", @_, "\n"; } sub error { print "*** ERROR: ", @_, "\n"; } my %given_hints; my %given_warnings; sub hint { $opt{quiet} and return; my $func = shift; my $rv = 0; if (exists $warnings{$func} && !$given_warnings{$func}++) { my $warn = $warnings{$func}; $warn =~ s!^!*** !mg; print "*** WARNING: $func\n", $warn; $rv++; } if ($opt{hints} && exists $hints{$func} && !$given_hints{$func}++) { my $hint = $hints{$func}; $hint =~ s/^/ /mg; print " --- hint for $func ---\n", $hint; } $rv; } sub usage { my($usage) = do { local(@ARGV,$/)=($0); <> } =~ /^=head\d$HS+SYNOPSIS\s*^(.*?)\s*^=/ms; my %M = ( 'I' => '*' ); $usage =~ s/^\s*perl\s+\S+/$^X $0/; $usage =~ s/([A-Z])<([^>]+)>/$M{$1}$2$M{$1}/g; print < }; my($copy) = $self =~ /^=head\d\s+COPYRIGHT\s*^(.*?)^=\w+/ms; $copy =~ s/^(?=\S+)/ /gms; $self =~ s/^$HS+Do NOT edit.*?(?=^-)/$copy/ms; $self =~ s/^SKIP.*(?=^__DATA__)/SKIP if (\@ARGV && \$ARGV[0] eq '--unstrip') { eval { require Devel::PPPort }; \$@ and die "Cannot require Devel::PPPort, please install.\\n"; if (\$Devel::PPPort::VERSION < $VERSION) { die "$0 was originally generated with Devel::PPPort $VERSION.\\n" . "Your Devel::PPPort is only version \$Devel::PPPort::VERSION.\\n" . "Please install a newer version, or --unstrip will not work.\\n"; } Devel::PPPort::WriteFile(\$0); exit 0; } print <$0" or die "cannot strip $0: $!\n"; print OUT "$pl$c\n"; exit 0; } __DATA__ */ #ifndef _P_P_PORTABILITY_H_ #define _P_P_PORTABILITY_H_ #ifndef DPPP_NAMESPACE # define DPPP_NAMESPACE DPPP_ #endif #define DPPP_CAT2(x,y) CAT2(x,y) #define DPPP_(name) DPPP_CAT2(DPPP_NAMESPACE, name) #ifndef PERL_REVISION # if !defined(__PATCHLEVEL_H_INCLUDED__) && !(defined(PATCHLEVEL) && defined(SUBVERSION)) # define PERL_PATCHLEVEL_H_IMPLICIT # include # endif # if !(defined(PERL_VERSION) || (defined(SUBVERSION) && defined(PATCHLEVEL))) # include # endif # ifndef PERL_REVISION # define PERL_REVISION (5) /* Replace: 1 */ # define PERL_VERSION PATCHLEVEL # define PERL_SUBVERSION SUBVERSION /* Replace PERL_PATCHLEVEL with PERL_VERSION */ /* Replace: 0 */ # endif #endif #define _dpppDEC2BCD(dec) ((((dec)/100)<<8)|((((dec)%100)/10)<<4)|((dec)%10)) #define PERL_BCDVERSION ((_dpppDEC2BCD(PERL_REVISION)<<24)|(_dpppDEC2BCD(PERL_VERSION)<<12)|_dpppDEC2BCD(PERL_SUBVERSION)) /* It is very unlikely that anyone will try to use this with Perl 6 (or greater), but who knows. */ #if PERL_REVISION != 5 # error ppport.h only works with Perl version 5 #endif /* PERL_REVISION != 5 */ #ifdef I_LIMITS # include #endif #ifndef PERL_UCHAR_MIN # define PERL_UCHAR_MIN ((unsigned char)0) #endif #ifndef PERL_UCHAR_MAX # ifdef UCHAR_MAX # define PERL_UCHAR_MAX ((unsigned char)UCHAR_MAX) # else # ifdef MAXUCHAR # define PERL_UCHAR_MAX ((unsigned char)MAXUCHAR) # else # define PERL_UCHAR_MAX ((unsigned char)~(unsigned)0) # endif # endif #endif #ifndef PERL_USHORT_MIN # define PERL_USHORT_MIN ((unsigned short)0) #endif #ifndef PERL_USHORT_MAX # ifdef USHORT_MAX # define PERL_USHORT_MAX ((unsigned short)USHORT_MAX) # else # ifdef MAXUSHORT # define PERL_USHORT_MAX ((unsigned short)MAXUSHORT) # else # ifdef USHRT_MAX # define PERL_USHORT_MAX ((unsigned short)USHRT_MAX) # else # define PERL_USHORT_MAX ((unsigned short)~(unsigned)0) # endif # endif # endif #endif #ifndef PERL_SHORT_MAX # ifdef SHORT_MAX # define PERL_SHORT_MAX ((short)SHORT_MAX) # else # ifdef MAXSHORT /* Often used in */ # define PERL_SHORT_MAX ((short)MAXSHORT) # else # ifdef SHRT_MAX # define PERL_SHORT_MAX ((short)SHRT_MAX) # else # define PERL_SHORT_MAX ((short) (PERL_USHORT_MAX >> 1)) # endif # endif # endif #endif #ifndef PERL_SHORT_MIN # ifdef SHORT_MIN # define PERL_SHORT_MIN ((short)SHORT_MIN) # else # ifdef MINSHORT # define PERL_SHORT_MIN ((short)MINSHORT) # else # ifdef SHRT_MIN # define PERL_SHORT_MIN ((short)SHRT_MIN) # else # define PERL_SHORT_MIN (-PERL_SHORT_MAX - ((3 & -1) == 3)) # endif # endif # endif #endif #ifndef PERL_UINT_MAX # ifdef UINT_MAX # define PERL_UINT_MAX ((unsigned int)UINT_MAX) # else # ifdef MAXUINT # define PERL_UINT_MAX ((unsigned int)MAXUINT) # else # define PERL_UINT_MAX (~(unsigned int)0) # endif # endif #endif #ifndef PERL_UINT_MIN # define PERL_UINT_MIN ((unsigned int)0) #endif #ifndef PERL_INT_MAX # ifdef INT_MAX # define PERL_INT_MAX ((int)INT_MAX) # else # ifdef MAXINT /* Often used in */ # define PERL_INT_MAX ((int)MAXINT) # else # define PERL_INT_MAX ((int)(PERL_UINT_MAX >> 1)) # endif # endif #endif #ifndef PERL_INT_MIN # ifdef INT_MIN # define PERL_INT_MIN ((int)INT_MIN) # else # ifdef MININT # define PERL_INT_MIN ((int)MININT) # else # define PERL_INT_MIN (-PERL_INT_MAX - ((3 & -1) == 3)) # endif # endif #endif #ifndef PERL_ULONG_MAX # ifdef ULONG_MAX # define PERL_ULONG_MAX ((unsigned long)ULONG_MAX) # else # ifdef MAXULONG # define PERL_ULONG_MAX ((unsigned long)MAXULONG) # else # define PERL_ULONG_MAX (~(unsigned long)0) # endif # endif #endif #ifndef PERL_ULONG_MIN # define PERL_ULONG_MIN ((unsigned long)0L) #endif #ifndef PERL_LONG_MAX # ifdef LONG_MAX # define PERL_LONG_MAX ((long)LONG_MAX) # else # ifdef MAXLONG # define PERL_LONG_MAX ((long)MAXLONG) # else # define PERL_LONG_MAX ((long) (PERL_ULONG_MAX >> 1)) # endif # endif #endif #ifndef PERL_LONG_MIN # ifdef LONG_MIN # define PERL_LONG_MIN ((long)LONG_MIN) # else # ifdef MINLONG # define PERL_LONG_MIN ((long)MINLONG) # else # define PERL_LONG_MIN (-PERL_LONG_MAX - ((3 & -1) == 3)) # endif # endif #endif #if defined(HAS_QUAD) && (defined(convex) || defined(uts)) # ifndef PERL_UQUAD_MAX # ifdef ULONGLONG_MAX # define PERL_UQUAD_MAX ((unsigned long long)ULONGLONG_MAX) # else # ifdef MAXULONGLONG # define PERL_UQUAD_MAX ((unsigned long long)MAXULONGLONG) # else # define PERL_UQUAD_MAX (~(unsigned long long)0) # endif # endif # endif # ifndef PERL_UQUAD_MIN # define PERL_UQUAD_MIN ((unsigned long long)0L) # endif # ifndef PERL_QUAD_MAX # ifdef LONGLONG_MAX # define PERL_QUAD_MAX ((long long)LONGLONG_MAX) # else # ifdef MAXLONGLONG # define PERL_QUAD_MAX ((long long)MAXLONGLONG) # else # define PERL_QUAD_MAX ((long long) (PERL_UQUAD_MAX >> 1)) # endif # endif # endif # ifndef PERL_QUAD_MIN # ifdef LONGLONG_MIN # define PERL_QUAD_MIN ((long long)LONGLONG_MIN) # else # ifdef MINLONGLONG # define PERL_QUAD_MIN ((long long)MINLONGLONG) # else # define PERL_QUAD_MIN (-PERL_QUAD_MAX - ((3 & -1) == 3)) # endif # endif # endif #endif /* This is based on code from 5.003 perl.h */ #ifdef HAS_QUAD # ifdef cray #ifndef IVTYPE # define IVTYPE int #endif #ifndef IV_MIN # define IV_MIN PERL_INT_MIN #endif #ifndef IV_MAX # define IV_MAX PERL_INT_MAX #endif #ifndef UV_MIN # define UV_MIN PERL_UINT_MIN #endif #ifndef UV_MAX # define UV_MAX PERL_UINT_MAX #endif # ifdef INTSIZE #ifndef IVSIZE # define IVSIZE INTSIZE #endif # endif # else # if defined(convex) || defined(uts) #ifndef IVTYPE # define IVTYPE long long #endif #ifndef IV_MIN # define IV_MIN PERL_QUAD_MIN #endif #ifndef IV_MAX # define IV_MAX PERL_QUAD_MAX #endif #ifndef UV_MIN # define UV_MIN PERL_UQUAD_MIN #endif #ifndef UV_MAX # define UV_MAX PERL_UQUAD_MAX #endif # ifdef LONGLONGSIZE #ifndef IVSIZE # define IVSIZE LONGLONGSIZE #endif # endif # else #ifndef IVTYPE # define IVTYPE long #endif #ifndef IV_MIN # define IV_MIN PERL_LONG_MIN #endif #ifndef IV_MAX # define IV_MAX PERL_LONG_MAX #endif #ifndef UV_MIN # define UV_MIN PERL_ULONG_MIN #endif #ifndef UV_MAX # define UV_MAX PERL_ULONG_MAX #endif # ifdef LONGSIZE #ifndef IVSIZE # define IVSIZE LONGSIZE #endif # endif # endif # endif #ifndef IVSIZE # define IVSIZE 8 #endif #ifndef PERL_QUAD_MIN # define PERL_QUAD_MIN IV_MIN #endif #ifndef PERL_QUAD_MAX # define PERL_QUAD_MAX IV_MAX #endif #ifndef PERL_UQUAD_MIN # define PERL_UQUAD_MIN UV_MIN #endif #ifndef PERL_UQUAD_MAX # define PERL_UQUAD_MAX UV_MAX #endif #else #ifndef IVTYPE # define IVTYPE long #endif #ifndef IV_MIN # define IV_MIN PERL_LONG_MIN #endif #ifndef IV_MAX # define IV_MAX PERL_LONG_MAX #endif #ifndef UV_MIN # define UV_MIN PERL_ULONG_MIN #endif #ifndef UV_MAX # define UV_MAX PERL_ULONG_MAX #endif #endif #ifndef IVSIZE # ifdef LONGSIZE # define IVSIZE LONGSIZE # else # define IVSIZE 4 /* A bold guess, but the best we can make. */ # endif #endif #ifndef UVTYPE # define UVTYPE unsigned IVTYPE #endif #ifndef UVSIZE # define UVSIZE IVSIZE #endif #ifndef sv_setuv # define sv_setuv(sv, uv) \ STMT_START { \ UV TeMpUv = uv; \ if (TeMpUv <= IV_MAX) \ sv_setiv(sv, TeMpUv); \ else \ sv_setnv(sv, (double)TeMpUv); \ } STMT_END #endif #ifndef newSVuv # define newSVuv(uv) ((uv) <= IV_MAX ? newSViv((IV)uv) : newSVnv((NV)uv)) #endif #ifndef sv_2uv # define sv_2uv(sv) ((PL_Sv = (sv)), (UV) (SvNOK(PL_Sv) ? SvNV(PL_Sv) : sv_2nv(PL_Sv))) #endif #ifndef SvUVX # define SvUVX(sv) ((UV)SvIVX(sv)) #endif #ifndef SvUVXx # define SvUVXx(sv) SvUVX(sv) #endif #ifndef SvUV # define SvUV(sv) (SvIOK(sv) ? SvUVX(sv) : sv_2uv(sv)) #endif #ifndef SvUVx # define SvUVx(sv) ((PL_Sv = (sv)), SvUV(PL_Sv)) #endif /* Hint: sv_uv * Always use the SvUVx() macro instead of sv_uv(). */ #ifndef sv_uv # define sv_uv(sv) SvUVx(sv) #endif #if !defined(SvUOK) && defined(SvIOK_UV) # define SvUOK(sv) SvIOK_UV(sv) #endif #ifndef XST_mUV # define XST_mUV(i,v) (ST(i) = sv_2mortal(newSVuv(v)) ) #endif #ifndef XSRETURN_UV # define XSRETURN_UV(v) STMT_START { XST_mUV(0,v); XSRETURN(1); } STMT_END #endif #ifndef PUSHu # define PUSHu(u) STMT_START { sv_setuv(TARG, (UV)(u)); PUSHTARG; } STMT_END #endif #ifndef XPUSHu # define XPUSHu(u) STMT_START { sv_setuv(TARG, (UV)(u)); XPUSHTARG; } STMT_END #endif #ifdef HAS_MEMCMP #ifndef memNE # define memNE(s1,s2,l) (memcmp(s1,s2,l)) #endif #ifndef memEQ # define memEQ(s1,s2,l) (!memcmp(s1,s2,l)) #endif #else #ifndef memNE # define memNE(s1,s2,l) (bcmp(s1,s2,l)) #endif #ifndef memEQ # define memEQ(s1,s2,l) (!bcmp(s1,s2,l)) #endif #endif #ifndef MoveD # define MoveD(s,d,n,t) memmove((char*)(d),(char*)(s), (n) * sizeof(t)) #endif #ifndef CopyD # define CopyD(s,d,n,t) memcpy((char*)(d),(char*)(s), (n) * sizeof(t)) #endif #ifdef HAS_MEMSET #ifndef ZeroD # define ZeroD(d,n,t) memzero((char*)(d), (n) * sizeof(t)) #endif #else #ifndef ZeroD # define ZeroD(d,n,t) ((void)memzero((char*)(d), (n) * sizeof(t)), d) #endif #endif #ifndef PoisonWith # define PoisonWith(d,n,t,b) (void)memset((char*)(d), (U8)(b), (n) * sizeof(t)) #endif #ifndef PoisonNew # define PoisonNew(d,n,t) PoisonWith(d,n,t,0xAB) #endif #ifndef PoisonFree # define PoisonFree(d,n,t) PoisonWith(d,n,t,0xEF) #endif #ifndef Poison # define Poison(d,n,t) PoisonFree(d,n,t) #endif #ifndef Newx # define Newx(v,n,t) New(0,v,n,t) #endif #ifndef Newxc # define Newxc(v,n,t,c) Newc(0,v,n,t,c) #endif #ifndef Newxz # define Newxz(v,n,t) Newz(0,v,n,t) #endif #ifndef PERL_UNUSED_DECL # ifdef HASATTRIBUTE # if (defined(__GNUC__) && defined(__cplusplus)) || defined(__INTEL_COMPILER) # define PERL_UNUSED_DECL # else # define PERL_UNUSED_DECL __attribute__((unused)) # endif # else # define PERL_UNUSED_DECL # endif #endif #ifndef PERL_UNUSED_ARG # if defined(lint) && defined(S_SPLINT_S) /* www.splint.org */ # include # define PERL_UNUSED_ARG(x) NOTE(ARGUNUSED(x)) # else # define PERL_UNUSED_ARG(x) ((void)x) # endif #endif #ifndef PERL_UNUSED_VAR # define PERL_UNUSED_VAR(x) ((void)x) #endif #ifndef PERL_UNUSED_CONTEXT # ifdef USE_ITHREADS # define PERL_UNUSED_CONTEXT PERL_UNUSED_ARG(my_perl) # else # define PERL_UNUSED_CONTEXT # endif #endif #ifndef NOOP # define NOOP /*EMPTY*/(void)0 #endif #ifndef dNOOP # define dNOOP extern int /*@unused@*/ Perl___notused PERL_UNUSED_DECL #endif #ifndef NVTYPE # if defined(USE_LONG_DOUBLE) && defined(HAS_LONG_DOUBLE) # define NVTYPE long double # else # define NVTYPE double # endif typedef NVTYPE NV; #endif #ifndef INT2PTR # if (IVSIZE == PTRSIZE) && (UVSIZE == PTRSIZE) # define PTRV UV # define INT2PTR(any,d) (any)(d) # else # if PTRSIZE == LONGSIZE # define PTRV unsigned long # else # define PTRV unsigned # endif # define INT2PTR(any,d) (any)(PTRV)(d) # endif # define NUM2PTR(any,d) (any)(PTRV)(d) # define PTR2IV(p) INT2PTR(IV,p) # define PTR2UV(p) INT2PTR(UV,p) # define PTR2NV(p) NUM2PTR(NV,p) # if PTRSIZE == LONGSIZE # define PTR2ul(p) (unsigned long)(p) # else # define PTR2ul(p) INT2PTR(unsigned long,p) # endif #endif /* !INT2PTR */ #undef START_EXTERN_C #undef END_EXTERN_C #undef EXTERN_C #ifdef __cplusplus # define START_EXTERN_C extern "C" { # define END_EXTERN_C } # define EXTERN_C extern "C" #else # define START_EXTERN_C # define END_EXTERN_C # define EXTERN_C extern #endif #if defined(PERL_GCC_PEDANTIC) # ifndef PERL_GCC_BRACE_GROUPS_FORBIDDEN # define PERL_GCC_BRACE_GROUPS_FORBIDDEN # endif #endif #if defined(__GNUC__) && !defined(PERL_GCC_BRACE_GROUPS_FORBIDDEN) && !defined(__cplusplus) # ifndef PERL_USE_GCC_BRACE_GROUPS # define PERL_USE_GCC_BRACE_GROUPS # endif #endif #undef STMT_START #undef STMT_END #ifdef PERL_USE_GCC_BRACE_GROUPS # define STMT_START (void)( /* gcc supports ``({ STATEMENTS; })'' */ # define STMT_END ) #else # if defined(VOIDFLAGS) && (VOIDFLAGS) && (defined(sun) || defined(__sun__)) && !defined(__GNUC__) # define STMT_START if (1) # define STMT_END else (void)0 # else # define STMT_START do # define STMT_END while (0) # endif #endif #ifndef boolSV # define boolSV(b) ((b) ? &PL_sv_yes : &PL_sv_no) #endif /* DEFSV appears first in 5.004_56 */ #ifndef DEFSV # define DEFSV GvSV(PL_defgv) #endif #ifndef SAVE_DEFSV # define SAVE_DEFSV SAVESPTR(GvSV(PL_defgv)) #endif /* Older perls (<=5.003) lack AvFILLp */ #ifndef AvFILLp # define AvFILLp AvFILL #endif #ifndef ERRSV # define ERRSV get_sv("@",FALSE) #endif /* Hint: gv_stashpvn * This function's backport doesn't support the length parameter, but * rather ignores it. Portability can only be ensured if the length * parameter is used for speed reasons, but the length can always be * correctly computed from the string argument. */ #ifndef gv_stashpvn # define gv_stashpvn(str,len,create) gv_stashpv(str,create) #endif /* Replace: 1 */ #ifndef get_cv # define get_cv perl_get_cv #endif #ifndef get_sv # define get_sv perl_get_sv #endif #ifndef get_av # define get_av perl_get_av #endif #ifndef get_hv # define get_hv perl_get_hv #endif /* Replace: 0 */ #ifndef dUNDERBAR # define dUNDERBAR dNOOP #endif #ifndef UNDERBAR # define UNDERBAR DEFSV #endif #ifndef dAX # define dAX I32 ax = MARK - PL_stack_base + 1 #endif #ifndef dITEMS # define dITEMS I32 items = SP - MARK #endif #ifndef dXSTARG # define dXSTARG SV * targ = sv_newmortal() #endif #ifndef dAXMARK # define dAXMARK I32 ax = POPMARK; \ register SV ** const mark = PL_stack_base + ax++ #endif #ifndef XSprePUSH # define XSprePUSH (sp = PL_stack_base + ax - 1) #endif #if (PERL_BCDVERSION < 0x5005000) # undef XSRETURN # define XSRETURN(off) \ STMT_START { \ PL_stack_sp = PL_stack_base + ax + ((off) - 1); \ return; \ } STMT_END #endif #ifndef PERL_ABS # define PERL_ABS(x) ((x) < 0 ? -(x) : (x)) #endif #ifndef dVAR # define dVAR dNOOP #endif #ifndef SVf # define SVf "_" #endif #ifndef UTF8_MAXBYTES # define UTF8_MAXBYTES UTF8_MAXLEN #endif #ifndef PERL_HASH # define PERL_HASH(hash,str,len) \ STMT_START { \ const char *s_PeRlHaSh = str; \ I32 i_PeRlHaSh = len; \ U32 hash_PeRlHaSh = 0; \ while (i_PeRlHaSh--) \ hash_PeRlHaSh = hash_PeRlHaSh * 33 + *s_PeRlHaSh++; \ (hash) = hash_PeRlHaSh; \ } STMT_END #endif #ifndef PERL_SIGNALS_UNSAFE_FLAG #define PERL_SIGNALS_UNSAFE_FLAG 0x0001 #if (PERL_BCDVERSION < 0x5008000) # define D_PPP_PERL_SIGNALS_INIT PERL_SIGNALS_UNSAFE_FLAG #else # define D_PPP_PERL_SIGNALS_INIT 0 #endif #if defined(NEED_PL_signals) static U32 DPPP_(my_PL_signals) = D_PPP_PERL_SIGNALS_INIT; #elif defined(NEED_PL_signals_GLOBAL) U32 DPPP_(my_PL_signals) = D_PPP_PERL_SIGNALS_INIT; #else extern U32 DPPP_(my_PL_signals); #endif #define PL_signals DPPP_(my_PL_signals) #endif /* Hint: PL_ppaddr * Calling an op via PL_ppaddr requires passing a context argument * for threaded builds. Since the context argument is different for * 5.005 perls, you can use aTHXR (supplied by ppport.h), which will * automatically be defined as the correct argument. */ #if (PERL_BCDVERSION <= 0x5005005) /* Replace: 1 */ # define PL_ppaddr ppaddr # define PL_no_modify no_modify /* Replace: 0 */ #endif #if (PERL_BCDVERSION <= 0x5004005) /* Replace: 1 */ # define PL_DBsignal DBsignal # define PL_DBsingle DBsingle # define PL_DBsub DBsub # define PL_DBtrace DBtrace # define PL_Sv Sv # define PL_compiling compiling # define PL_copline copline # define PL_curcop curcop # define PL_curstash curstash # define PL_debstash debstash # define PL_defgv defgv # define PL_diehook diehook # define PL_dirty dirty # define PL_dowarn dowarn # define PL_errgv errgv # define PL_expect expect # define PL_hexdigit hexdigit # define PL_hints hints # define PL_laststatval laststatval # define PL_na na # define PL_perl_destruct_level perl_destruct_level # define PL_perldb perldb # define PL_rsfp_filters rsfp_filters # define PL_rsfp rsfp # define PL_stack_base stack_base # define PL_stack_sp stack_sp # define PL_statcache statcache # define PL_stdingv stdingv # define PL_sv_arenaroot sv_arenaroot # define PL_sv_no sv_no # define PL_sv_undef sv_undef # define PL_sv_yes sv_yes # define PL_tainted tainted # define PL_tainting tainting /* Replace: 0 */ #endif /* Warning: PL_expect, PL_copline, PL_rsfp, PL_rsfp_filters * Do not use this variable. It is internal to the perl parser * and may change or even be removed in the future. Note that * as of perl 5.9.5 you cannot assign to this variable anymore. */ /* TODO: cannot assign to these vars; is it worth fixing? */ #if (PERL_BCDVERSION >= 0x5009005) # define PL_expect (PL_parser ? PL_parser->expect : 0) # define PL_copline (PL_parser ? PL_parser->copline : 0) # define PL_rsfp (PL_parser ? PL_parser->rsfp : (PerlIO *) 0) # define PL_rsfp_filters (PL_parser ? PL_parser->rsfp_filters : (AV *) 0) #endif #ifndef dTHR # define dTHR dNOOP #endif #ifndef dTHX # define dTHX dNOOP #endif #ifndef dTHXa # define dTHXa(x) dNOOP #endif #ifndef pTHX # define pTHX void #endif #ifndef pTHX_ # define pTHX_ #endif #ifndef aTHX # define aTHX #endif #ifndef aTHX_ # define aTHX_ #endif #if (PERL_BCDVERSION < 0x5006000) # ifdef USE_THREADS # define aTHXR thr # define aTHXR_ thr, # else # define aTHXR # define aTHXR_ # endif # define dTHXR dTHR #else # define aTHXR aTHX # define aTHXR_ aTHX_ # define dTHXR dTHX #endif #ifndef dTHXoa # define dTHXoa(x) dTHXa(x) #endif #ifndef mPUSHs # define mPUSHs(s) PUSHs(sv_2mortal(s)) #endif #ifndef PUSHmortal # define PUSHmortal PUSHs(sv_newmortal()) #endif #ifndef mPUSHp # define mPUSHp(p,l) sv_setpvn(PUSHmortal, (p), (l)) #endif #ifndef mPUSHn # define mPUSHn(n) sv_setnv(PUSHmortal, (NV)(n)) #endif #ifndef mPUSHi # define mPUSHi(i) sv_setiv(PUSHmortal, (IV)(i)) #endif #ifndef mPUSHu # define mPUSHu(u) sv_setuv(PUSHmortal, (UV)(u)) #endif #ifndef mXPUSHs # define mXPUSHs(s) XPUSHs(sv_2mortal(s)) #endif #ifndef XPUSHmortal # define XPUSHmortal XPUSHs(sv_newmortal()) #endif #ifndef mXPUSHp # define mXPUSHp(p,l) STMT_START { EXTEND(sp,1); sv_setpvn(PUSHmortal, (p), (l)); } STMT_END #endif #ifndef mXPUSHn # define mXPUSHn(n) STMT_START { EXTEND(sp,1); sv_setnv(PUSHmortal, (NV)(n)); } STMT_END #endif #ifndef mXPUSHi # define mXPUSHi(i) STMT_START { EXTEND(sp,1); sv_setiv(PUSHmortal, (IV)(i)); } STMT_END #endif #ifndef mXPUSHu # define mXPUSHu(u) STMT_START { EXTEND(sp,1); sv_setuv(PUSHmortal, (UV)(u)); } STMT_END #endif /* Replace: 1 */ #ifndef call_sv # define call_sv perl_call_sv #endif #ifndef call_pv # define call_pv perl_call_pv #endif #ifndef call_argv # define call_argv perl_call_argv #endif #ifndef call_method # define call_method perl_call_method #endif #ifndef eval_sv # define eval_sv perl_eval_sv #endif #ifndef PERL_LOADMOD_DENY # define PERL_LOADMOD_DENY 0x1 #endif #ifndef PERL_LOADMOD_NOIMPORT # define PERL_LOADMOD_NOIMPORT 0x2 #endif #ifndef PERL_LOADMOD_IMPORT_OPS # define PERL_LOADMOD_IMPORT_OPS 0x4 #endif /* Replace: 0 */ /* Replace perl_eval_pv with eval_pv */ #ifndef eval_pv #if defined(NEED_eval_pv) static SV* DPPP_(my_eval_pv)(char *p, I32 croak_on_error); static #else extern SV* DPPP_(my_eval_pv)(char *p, I32 croak_on_error); #endif #ifdef eval_pv # undef eval_pv #endif #define eval_pv(a,b) DPPP_(my_eval_pv)(aTHX_ a,b) #define Perl_eval_pv DPPP_(my_eval_pv) #if defined(NEED_eval_pv) || defined(NEED_eval_pv_GLOBAL) SV* DPPP_(my_eval_pv)(char *p, I32 croak_on_error) { dSP; SV* sv = newSVpv(p, 0); PUSHMARK(sp); eval_sv(sv, G_SCALAR); SvREFCNT_dec(sv); SPAGAIN; sv = POPs; PUTBACK; if (croak_on_error && SvTRUE(GvSV(errgv))) croak(SvPVx(GvSV(errgv), na)); return sv; } #endif #endif #ifndef vload_module #if defined(NEED_vload_module) static void DPPP_(my_vload_module)(U32 flags, SV *name, SV *ver, va_list *args); static #else extern void DPPP_(my_vload_module)(U32 flags, SV *name, SV *ver, va_list *args); #endif #ifdef vload_module # undef vload_module #endif #define vload_module(a,b,c,d) DPPP_(my_vload_module)(aTHX_ a,b,c,d) #define Perl_vload_module DPPP_(my_vload_module) #if defined(NEED_vload_module) || defined(NEED_vload_module_GLOBAL) void DPPP_(my_vload_module)(U32 flags, SV *name, SV *ver, va_list *args) { dTHR; dVAR; OP *veop, *imop; OP * const modname = newSVOP(OP_CONST, 0, name); /* 5.005 has a somewhat hacky force_normal that doesn't croak on SvREADONLY() if PL_compling is true. Current perls take care in ck_require() to correctly turn off SvREADONLY before calling force_normal_flags(). This seems a better fix than fudging PL_compling */ SvREADONLY_off(((SVOP*)modname)->op_sv); modname->op_private |= OPpCONST_BARE; if (ver) { veop = newSVOP(OP_CONST, 0, ver); } else veop = NULL; if (flags & PERL_LOADMOD_NOIMPORT) { imop = sawparens(newNULLLIST()); } else if (flags & PERL_LOADMOD_IMPORT_OPS) { imop = va_arg(*args, OP*); } else { SV *sv; imop = NULL; sv = va_arg(*args, SV*); while (sv) { imop = append_elem(OP_LIST, imop, newSVOP(OP_CONST, 0, sv)); sv = va_arg(*args, SV*); } } { const line_t ocopline = PL_copline; COP * const ocurcop = PL_curcop; const int oexpect = PL_expect; #if (PERL_BCDVERSION >= 0x5004000) utilize(!(flags & PERL_LOADMOD_DENY), start_subparse(FALSE, 0), veop, modname, imop); #else utilize(!(flags & PERL_LOADMOD_DENY), start_subparse(), modname, imop); #endif PL_expect = oexpect; PL_copline = ocopline; PL_curcop = ocurcop; } } #endif #endif #ifndef load_module #if defined(NEED_load_module) static void DPPP_(my_load_module)(U32 flags, SV *name, SV *ver, ...); static #else extern void DPPP_(my_load_module)(U32 flags, SV *name, SV *ver, ...); #endif #ifdef load_module # undef load_module #endif #define load_module DPPP_(my_load_module) #define Perl_load_module DPPP_(my_load_module) #if defined(NEED_load_module) || defined(NEED_load_module_GLOBAL) void DPPP_(my_load_module)(U32 flags, SV *name, SV *ver, ...) { va_list args; va_start(args, ver); vload_module(flags, name, ver, &args); va_end(args); } #endif #endif #ifndef newRV_inc # define newRV_inc(sv) newRV(sv) /* Replace */ #endif #ifndef newRV_noinc #if defined(NEED_newRV_noinc) static SV * DPPP_(my_newRV_noinc)(SV *sv); static #else extern SV * DPPP_(my_newRV_noinc)(SV *sv); #endif #ifdef newRV_noinc # undef newRV_noinc #endif #define newRV_noinc(a) DPPP_(my_newRV_noinc)(aTHX_ a) #define Perl_newRV_noinc DPPP_(my_newRV_noinc) #if defined(NEED_newRV_noinc) || defined(NEED_newRV_noinc_GLOBAL) SV * DPPP_(my_newRV_noinc)(SV *sv) { SV *rv = (SV *)newRV(sv); SvREFCNT_dec(sv); return rv; } #endif #endif /* Hint: newCONSTSUB * Returns a CV* as of perl-5.7.1. This return value is not supported * by Devel::PPPort. */ /* newCONSTSUB from IO.xs is in the core starting with 5.004_63 */ #if (PERL_BCDVERSION < 0x5004063) && (PERL_BCDVERSION != 0x5004005) #if defined(NEED_newCONSTSUB) static void DPPP_(my_newCONSTSUB)(HV *stash, const char *name, SV *sv); static #else extern void DPPP_(my_newCONSTSUB)(HV *stash, const char *name, SV *sv); #endif #ifdef newCONSTSUB # undef newCONSTSUB #endif #define newCONSTSUB(a,b,c) DPPP_(my_newCONSTSUB)(aTHX_ a,b,c) #define Perl_newCONSTSUB DPPP_(my_newCONSTSUB) #if defined(NEED_newCONSTSUB) || defined(NEED_newCONSTSUB_GLOBAL) void DPPP_(my_newCONSTSUB)(HV *stash, const char *name, SV *sv) { U32 oldhints = PL_hints; HV *old_cop_stash = PL_curcop->cop_stash; HV *old_curstash = PL_curstash; line_t oldline = PL_curcop->cop_line; PL_curcop->cop_line = PL_copline; PL_hints &= ~HINT_BLOCK_SCOPE; if (stash) PL_curstash = PL_curcop->cop_stash = stash; newSUB( #if (PERL_BCDVERSION < 0x5003022) start_subparse(), #elif (PERL_BCDVERSION == 0x5003022) start_subparse(0), #else /* 5.003_23 onwards */ start_subparse(FALSE, 0), #endif newSVOP(OP_CONST, 0, newSVpv((char *) name, 0)), newSVOP(OP_CONST, 0, &PL_sv_no), /* SvPV(&PL_sv_no) == "" -- GMB */ newSTATEOP(0, Nullch, newSVOP(OP_CONST, 0, sv)) ); PL_hints = oldhints; PL_curcop->cop_stash = old_cop_stash; PL_curstash = old_curstash; PL_curcop->cop_line = oldline; } #endif #endif /* * Boilerplate macros for initializing and accessing interpreter-local * data from C. All statics in extensions should be reworked to use * this, if you want to make the extension thread-safe. See ext/re/re.xs * for an example of the use of these macros. * * Code that uses these macros is responsible for the following: * 1. #define MY_CXT_KEY to a unique string, e.g. "DynaLoader_guts" * 2. Declare a typedef named my_cxt_t that is a structure that contains * all the data that needs to be interpreter-local. * 3. Use the START_MY_CXT macro after the declaration of my_cxt_t. * 4. Use the MY_CXT_INIT macro such that it is called exactly once * (typically put in the BOOT: section). * 5. Use the members of the my_cxt_t structure everywhere as * MY_CXT.member. * 6. Use the dMY_CXT macro (a declaration) in all the functions that * access MY_CXT. */ #if defined(MULTIPLICITY) || defined(PERL_OBJECT) || \ defined(PERL_CAPI) || defined(PERL_IMPLICIT_CONTEXT) #ifndef START_MY_CXT /* This must appear in all extensions that define a my_cxt_t structure, * right after the definition (i.e. at file scope). The non-threads * case below uses it to declare the data as static. */ #define START_MY_CXT #if (PERL_BCDVERSION < 0x5004068) /* Fetches the SV that keeps the per-interpreter data. */ #define dMY_CXT_SV \ SV *my_cxt_sv = get_sv(MY_CXT_KEY, FALSE) #else /* >= perl5.004_68 */ #define dMY_CXT_SV \ SV *my_cxt_sv = *hv_fetch(PL_modglobal, MY_CXT_KEY, \ sizeof(MY_CXT_KEY)-1, TRUE) #endif /* < perl5.004_68 */ /* This declaration should be used within all functions that use the * interpreter-local data. */ #define dMY_CXT \ dMY_CXT_SV; \ my_cxt_t *my_cxtp = INT2PTR(my_cxt_t*,SvUV(my_cxt_sv)) /* Creates and zeroes the per-interpreter data. * (We allocate my_cxtp in a Perl SV so that it will be released when * the interpreter goes away.) */ #define MY_CXT_INIT \ dMY_CXT_SV; \ /* newSV() allocates one more than needed */ \ my_cxt_t *my_cxtp = (my_cxt_t*)SvPVX(newSV(sizeof(my_cxt_t)-1));\ Zero(my_cxtp, 1, my_cxt_t); \ sv_setuv(my_cxt_sv, PTR2UV(my_cxtp)) /* This macro must be used to access members of the my_cxt_t structure. * e.g. MYCXT.some_data */ #define MY_CXT (*my_cxtp) /* Judicious use of these macros can reduce the number of times dMY_CXT * is used. Use is similar to pTHX, aTHX etc. */ #define pMY_CXT my_cxt_t *my_cxtp #define pMY_CXT_ pMY_CXT, #define _pMY_CXT ,pMY_CXT #define aMY_CXT my_cxtp #define aMY_CXT_ aMY_CXT, #define _aMY_CXT ,aMY_CXT #endif /* START_MY_CXT */ #ifndef MY_CXT_CLONE /* Clones the per-interpreter data. */ #define MY_CXT_CLONE \ dMY_CXT_SV; \ my_cxt_t *my_cxtp = (my_cxt_t*)SvPVX(newSV(sizeof(my_cxt_t)-1));\ Copy(INT2PTR(my_cxt_t*, SvUV(my_cxt_sv)), my_cxtp, 1, my_cxt_t);\ sv_setuv(my_cxt_sv, PTR2UV(my_cxtp)) #endif #else /* single interpreter */ #ifndef START_MY_CXT #define START_MY_CXT static my_cxt_t my_cxt; #define dMY_CXT_SV dNOOP #define dMY_CXT dNOOP #define MY_CXT_INIT NOOP #define MY_CXT my_cxt #define pMY_CXT void #define pMY_CXT_ #define _pMY_CXT #define aMY_CXT #define aMY_CXT_ #define _aMY_CXT #endif /* START_MY_CXT */ #ifndef MY_CXT_CLONE #define MY_CXT_CLONE NOOP #endif #endif #ifndef IVdf # if IVSIZE == LONGSIZE # define IVdf "ld" # define UVuf "lu" # define UVof "lo" # define UVxf "lx" # define UVXf "lX" # else # if IVSIZE == INTSIZE # define IVdf "d" # define UVuf "u" # define UVof "o" # define UVxf "x" # define UVXf "X" # endif # endif #endif #ifndef NVef # if defined(USE_LONG_DOUBLE) && defined(HAS_LONG_DOUBLE) && \ defined(PERL_PRIfldbl) && (PERL_BCDVERSION != 0x5006000) /* Not very likely, but let's try anyway. */ # define NVef PERL_PRIeldbl # define NVff PERL_PRIfldbl # define NVgf PERL_PRIgldbl # else # define NVef "e" # define NVff "f" # define NVgf "g" # endif #endif #ifndef SvREFCNT_inc # ifdef PERL_USE_GCC_BRACE_GROUPS # define SvREFCNT_inc(sv) \ ({ \ SV * const _sv = (SV*)(sv); \ if (_sv) \ (SvREFCNT(_sv))++; \ _sv; \ }) # else # define SvREFCNT_inc(sv) \ ((PL_Sv=(SV*)(sv)) ? (++(SvREFCNT(PL_Sv)),PL_Sv) : NULL) # endif #endif #ifndef SvREFCNT_inc_simple # ifdef PERL_USE_GCC_BRACE_GROUPS # define SvREFCNT_inc_simple(sv) \ ({ \ if (sv) \ (SvREFCNT(sv))++; \ (SV *)(sv); \ }) # else # define SvREFCNT_inc_simple(sv) \ ((sv) ? (SvREFCNT(sv)++,(SV*)(sv)) : NULL) # endif #endif #ifndef SvREFCNT_inc_NN # ifdef PERL_USE_GCC_BRACE_GROUPS # define SvREFCNT_inc_NN(sv) \ ({ \ SV * const _sv = (SV*)(sv); \ SvREFCNT(_sv)++; \ _sv; \ }) # else # define SvREFCNT_inc_NN(sv) \ (PL_Sv=(SV*)(sv),++(SvREFCNT(PL_Sv)),PL_Sv) # endif #endif #ifndef SvREFCNT_inc_void # ifdef PERL_USE_GCC_BRACE_GROUPS # define SvREFCNT_inc_void(sv) \ ({ \ SV * const _sv = (SV*)(sv); \ if (_sv) \ (void)(SvREFCNT(_sv)++); \ }) # else # define SvREFCNT_inc_void(sv) \ (void)((PL_Sv=(SV*)(sv)) ? ++(SvREFCNT(PL_Sv)) : 0) # endif #endif #ifndef SvREFCNT_inc_simple_void # define SvREFCNT_inc_simple_void(sv) STMT_START { if (sv) SvREFCNT(sv)++; } STMT_END #endif #ifndef SvREFCNT_inc_simple_NN # define SvREFCNT_inc_simple_NN(sv) (++SvREFCNT(sv), (SV*)(sv)) #endif #ifndef SvREFCNT_inc_void_NN # define SvREFCNT_inc_void_NN(sv) (void)(++SvREFCNT((SV*)(sv))) #endif #ifndef SvREFCNT_inc_simple_void_NN # define SvREFCNT_inc_simple_void_NN(sv) (void)(++SvREFCNT((SV*)(sv))) #endif #ifndef newSVpvn # define newSVpvn(data,len) ((data) \ ? ((len) ? newSVpv((data), (len)) : newSVpv("", 0)) \ : newSV(0)) #endif #ifndef newSVpvn_utf8 # define newSVpvn_utf8(s, len, u) newSVpvn_flags((s), (len), (u) ? SVf_UTF8 : 0) #endif #ifndef SVf_UTF8 # define SVf_UTF8 0 #endif #ifndef newSVpvn_flags #if defined(NEED_newSVpvn_flags) static SV * DPPP_(my_newSVpvn_flags)(pTHX_ const char * s, STRLEN len, U32 flags); static #else extern SV * DPPP_(my_newSVpvn_flags)(pTHX_ const char * s, STRLEN len, U32 flags); #endif #ifdef newSVpvn_flags # undef newSVpvn_flags #endif #define newSVpvn_flags(a,b,c) DPPP_(my_newSVpvn_flags)(aTHX_ a,b,c) #define Perl_newSVpvn_flags DPPP_(my_newSVpvn_flags) #if defined(NEED_newSVpvn_flags) || defined(NEED_newSVpvn_flags_GLOBAL) SV * DPPP_(my_newSVpvn_flags)(pTHX_ const char *s, STRLEN len, U32 flags) { SV *sv = newSVpvn(s, len); SvFLAGS(sv) |= (flags & SVf_UTF8); return (flags & SVs_TEMP) ? sv_2mortal(sv) : sv; } #endif #endif /* Backwards compatibility stuff... :-( */ #if !defined(NEED_sv_2pv_flags) && defined(NEED_sv_2pv_nolen) # define NEED_sv_2pv_flags #endif #if !defined(NEED_sv_2pv_flags_GLOBAL) && defined(NEED_sv_2pv_nolen_GLOBAL) # define NEED_sv_2pv_flags_GLOBAL #endif /* Hint: sv_2pv_nolen * Use the SvPV_nolen() or SvPV_nolen_const() macros instead of sv_2pv_nolen(). */ #ifndef sv_2pv_nolen # define sv_2pv_nolen(sv) SvPV_nolen(sv) #endif #ifdef SvPVbyte /* Hint: SvPVbyte * Does not work in perl-5.6.1, ppport.h implements a version * borrowed from perl-5.7.3. */ #if (PERL_BCDVERSION < 0x5007000) #if defined(NEED_sv_2pvbyte) static char * DPPP_(my_sv_2pvbyte)(pTHX_ SV * sv, STRLEN * lp); static #else extern char * DPPP_(my_sv_2pvbyte)(pTHX_ SV * sv, STRLEN * lp); #endif #ifdef sv_2pvbyte # undef sv_2pvbyte #endif #define sv_2pvbyte(a,b) DPPP_(my_sv_2pvbyte)(aTHX_ a,b) #define Perl_sv_2pvbyte DPPP_(my_sv_2pvbyte) #if defined(NEED_sv_2pvbyte) || defined(NEED_sv_2pvbyte_GLOBAL) char * DPPP_(my_sv_2pvbyte)(pTHX_ SV *sv, STRLEN *lp) { sv_utf8_downgrade(sv,0); return SvPV(sv,*lp); } #endif /* Hint: sv_2pvbyte * Use the SvPVbyte() macro instead of sv_2pvbyte(). */ #undef SvPVbyte #define SvPVbyte(sv, lp) \ ((SvFLAGS(sv) & (SVf_POK|SVf_UTF8)) == (SVf_POK) \ ? ((lp = SvCUR(sv)), SvPVX(sv)) : sv_2pvbyte(sv, &lp)) #endif #else # define SvPVbyte SvPV # define sv_2pvbyte sv_2pv #endif #ifndef sv_2pvbyte_nolen # define sv_2pvbyte_nolen(sv) sv_2pv_nolen(sv) #endif /* Hint: sv_pvn * Always use the SvPV() macro instead of sv_pvn(). */ /* Hint: sv_pvn_force * Always use the SvPV_force() macro instead of sv_pvn_force(). */ /* If these are undefined, they're not handled by the core anyway */ #ifndef SV_IMMEDIATE_UNREF # define SV_IMMEDIATE_UNREF 0 #endif #ifndef SV_GMAGIC # define SV_GMAGIC 0 #endif #ifndef SV_COW_DROP_PV # define SV_COW_DROP_PV 0 #endif #ifndef SV_UTF8_NO_ENCODING # define SV_UTF8_NO_ENCODING 0 #endif #ifndef SV_NOSTEAL # define SV_NOSTEAL 0 #endif #ifndef SV_CONST_RETURN # define SV_CONST_RETURN 0 #endif #ifndef SV_MUTABLE_RETURN # define SV_MUTABLE_RETURN 0 #endif #ifndef SV_SMAGIC # define SV_SMAGIC 0 #endif #ifndef SV_HAS_TRAILING_NUL # define SV_HAS_TRAILING_NUL 0 #endif #ifndef SV_COW_SHARED_HASH_KEYS # define SV_COW_SHARED_HASH_KEYS 0 #endif #if (PERL_BCDVERSION < 0x5007002) #if defined(NEED_sv_2pv_flags) static char * DPPP_(my_sv_2pv_flags)(pTHX_ SV * sv, STRLEN * lp, I32 flags); static #else extern char * DPPP_(my_sv_2pv_flags)(pTHX_ SV * sv, STRLEN * lp, I32 flags); #endif #ifdef sv_2pv_flags # undef sv_2pv_flags #endif #define sv_2pv_flags(a,b,c) DPPP_(my_sv_2pv_flags)(aTHX_ a,b,c) #define Perl_sv_2pv_flags DPPP_(my_sv_2pv_flags) #if defined(NEED_sv_2pv_flags) || defined(NEED_sv_2pv_flags_GLOBAL) char * DPPP_(my_sv_2pv_flags)(pTHX_ SV *sv, STRLEN *lp, I32 flags) { STRLEN n_a = (STRLEN) flags; return sv_2pv(sv, lp ? lp : &n_a); } #endif #if defined(NEED_sv_pvn_force_flags) static char * DPPP_(my_sv_pvn_force_flags)(pTHX_ SV * sv, STRLEN * lp, I32 flags); static #else extern char * DPPP_(my_sv_pvn_force_flags)(pTHX_ SV * sv, STRLEN * lp, I32 flags); #endif #ifdef sv_pvn_force_flags # undef sv_pvn_force_flags #endif #define sv_pvn_force_flags(a,b,c) DPPP_(my_sv_pvn_force_flags)(aTHX_ a,b,c) #define Perl_sv_pvn_force_flags DPPP_(my_sv_pvn_force_flags) #if defined(NEED_sv_pvn_force_flags) || defined(NEED_sv_pvn_force_flags_GLOBAL) char * DPPP_(my_sv_pvn_force_flags)(pTHX_ SV *sv, STRLEN *lp, I32 flags) { STRLEN n_a = (STRLEN) flags; return sv_pvn_force(sv, lp ? lp : &n_a); } #endif #endif #if (PERL_BCDVERSION < 0x5008008) || ( (PERL_BCDVERSION >= 0x5009000) && (PERL_BCDVERSION < 0x5009003) ) # define DPPP_SVPV_NOLEN_LP_ARG &PL_na #else # define DPPP_SVPV_NOLEN_LP_ARG 0 #endif #ifndef SvPV_const # define SvPV_const(sv, lp) SvPV_flags_const(sv, lp, SV_GMAGIC) #endif #ifndef SvPV_mutable # define SvPV_mutable(sv, lp) SvPV_flags_mutable(sv, lp, SV_GMAGIC) #endif #ifndef SvPV_flags # define SvPV_flags(sv, lp, flags) \ ((SvFLAGS(sv) & (SVf_POK)) == SVf_POK \ ? ((lp = SvCUR(sv)), SvPVX(sv)) : sv_2pv_flags(sv, &lp, flags)) #endif #ifndef SvPV_flags_const # define SvPV_flags_const(sv, lp, flags) \ ((SvFLAGS(sv) & (SVf_POK)) == SVf_POK \ ? ((lp = SvCUR(sv)), SvPVX_const(sv)) : \ (const char*) sv_2pv_flags(sv, &lp, flags|SV_CONST_RETURN)) #endif #ifndef SvPV_flags_const_nolen # define SvPV_flags_const_nolen(sv, flags) \ ((SvFLAGS(sv) & (SVf_POK)) == SVf_POK \ ? SvPVX_const(sv) : \ (const char*) sv_2pv_flags(sv, DPPP_SVPV_NOLEN_LP_ARG, flags|SV_CONST_RETURN)) #endif #ifndef SvPV_flags_mutable # define SvPV_flags_mutable(sv, lp, flags) \ ((SvFLAGS(sv) & (SVf_POK)) == SVf_POK \ ? ((lp = SvCUR(sv)), SvPVX_mutable(sv)) : \ sv_2pv_flags(sv, &lp, flags|SV_MUTABLE_RETURN)) #endif #ifndef SvPV_force # define SvPV_force(sv, lp) SvPV_force_flags(sv, lp, SV_GMAGIC) #endif #ifndef SvPV_force_nolen # define SvPV_force_nolen(sv) SvPV_force_flags_nolen(sv, SV_GMAGIC) #endif #ifndef SvPV_force_mutable # define SvPV_force_mutable(sv, lp) SvPV_force_flags_mutable(sv, lp, SV_GMAGIC) #endif #ifndef SvPV_force_nomg # define SvPV_force_nomg(sv, lp) SvPV_force_flags(sv, lp, 0) #endif #ifndef SvPV_force_nomg_nolen # define SvPV_force_nomg_nolen(sv) SvPV_force_flags_nolen(sv, 0) #endif #ifndef SvPV_force_flags # define SvPV_force_flags(sv, lp, flags) \ ((SvFLAGS(sv) & (SVf_POK|SVf_THINKFIRST)) == SVf_POK \ ? ((lp = SvCUR(sv)), SvPVX(sv)) : sv_pvn_force_flags(sv, &lp, flags)) #endif #ifndef SvPV_force_flags_nolen # define SvPV_force_flags_nolen(sv, flags) \ ((SvFLAGS(sv) & (SVf_POK|SVf_THINKFIRST)) == SVf_POK \ ? SvPVX(sv) : sv_pvn_force_flags(sv, DPPP_SVPV_NOLEN_LP_ARG, flags)) #endif #ifndef SvPV_force_flags_mutable # define SvPV_force_flags_mutable(sv, lp, flags) \ ((SvFLAGS(sv) & (SVf_POK|SVf_THINKFIRST)) == SVf_POK \ ? ((lp = SvCUR(sv)), SvPVX_mutable(sv)) \ : sv_pvn_force_flags(sv, &lp, flags|SV_MUTABLE_RETURN)) #endif #ifndef SvPV_nolen # define SvPV_nolen(sv) \ ((SvFLAGS(sv) & (SVf_POK)) == SVf_POK \ ? SvPVX(sv) : sv_2pv_flags(sv, DPPP_SVPV_NOLEN_LP_ARG, SV_GMAGIC)) #endif #ifndef SvPV_nolen_const # define SvPV_nolen_const(sv) \ ((SvFLAGS(sv) & (SVf_POK)) == SVf_POK \ ? SvPVX_const(sv) : sv_2pv_flags(sv, DPPP_SVPV_NOLEN_LP_ARG, SV_GMAGIC|SV_CONST_RETURN)) #endif #ifndef SvPV_nomg # define SvPV_nomg(sv, lp) SvPV_flags(sv, lp, 0) #endif #ifndef SvPV_nomg_const # define SvPV_nomg_const(sv, lp) SvPV_flags_const(sv, lp, 0) #endif #ifndef SvPV_nomg_const_nolen # define SvPV_nomg_const_nolen(sv) SvPV_flags_const_nolen(sv, 0) #endif #ifndef SvMAGIC_set # define SvMAGIC_set(sv, val) \ STMT_START { assert(SvTYPE(sv) >= SVt_PVMG); \ (((XPVMG*) SvANY(sv))->xmg_magic = (val)); } STMT_END #endif #if (PERL_BCDVERSION < 0x5009003) #ifndef SvPVX_const # define SvPVX_const(sv) ((const char*) (0 + SvPVX(sv))) #endif #ifndef SvPVX_mutable # define SvPVX_mutable(sv) (0 + SvPVX(sv)) #endif #ifndef SvRV_set # define SvRV_set(sv, val) \ STMT_START { assert(SvTYPE(sv) >= SVt_RV); \ (((XRV*) SvANY(sv))->xrv_rv = (val)); } STMT_END #endif #else #ifndef SvPVX_const # define SvPVX_const(sv) ((const char*)((sv)->sv_u.svu_pv)) #endif #ifndef SvPVX_mutable # define SvPVX_mutable(sv) ((sv)->sv_u.svu_pv) #endif #ifndef SvRV_set # define SvRV_set(sv, val) \ STMT_START { assert(SvTYPE(sv) >= SVt_RV); \ ((sv)->sv_u.svu_rv = (val)); } STMT_END #endif #endif #ifndef SvSTASH_set # define SvSTASH_set(sv, val) \ STMT_START { assert(SvTYPE(sv) >= SVt_PVMG); \ (((XPVMG*) SvANY(sv))->xmg_stash = (val)); } STMT_END #endif #if (PERL_BCDVERSION < 0x5004000) #ifndef SvUV_set # define SvUV_set(sv, val) \ STMT_START { assert(SvTYPE(sv) == SVt_IV || SvTYPE(sv) >= SVt_PVIV); \ (((XPVIV*) SvANY(sv))->xiv_iv = (IV) (val)); } STMT_END #endif #else #ifndef SvUV_set # define SvUV_set(sv, val) \ STMT_START { assert(SvTYPE(sv) == SVt_IV || SvTYPE(sv) >= SVt_PVIV); \ (((XPVUV*) SvANY(sv))->xuv_uv = (val)); } STMT_END #endif #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(vnewSVpvf) #if defined(NEED_vnewSVpvf) static SV * DPPP_(my_vnewSVpvf)(pTHX_ const char * pat, va_list * args); static #else extern SV * DPPP_(my_vnewSVpvf)(pTHX_ const char * pat, va_list * args); #endif #ifdef vnewSVpvf # undef vnewSVpvf #endif #define vnewSVpvf(a,b) DPPP_(my_vnewSVpvf)(aTHX_ a,b) #define Perl_vnewSVpvf DPPP_(my_vnewSVpvf) #if defined(NEED_vnewSVpvf) || defined(NEED_vnewSVpvf_GLOBAL) SV * DPPP_(my_vnewSVpvf)(pTHX_ const char *pat, va_list *args) { register SV *sv = newSV(0); sv_vsetpvfn(sv, pat, strlen(pat), args, Null(SV**), 0, Null(bool*)); return sv; } #endif #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_vcatpvf) # define sv_vcatpvf(sv, pat, args) sv_vcatpvfn(sv, pat, strlen(pat), args, Null(SV**), 0, Null(bool*)) #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_vsetpvf) # define sv_vsetpvf(sv, pat, args) sv_vsetpvfn(sv, pat, strlen(pat), args, Null(SV**), 0, Null(bool*)) #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_catpvf_mg) #if defined(NEED_sv_catpvf_mg) static void DPPP_(my_sv_catpvf_mg)(pTHX_ SV * sv, const char * pat, ...); static #else extern void DPPP_(my_sv_catpvf_mg)(pTHX_ SV * sv, const char * pat, ...); #endif #define Perl_sv_catpvf_mg DPPP_(my_sv_catpvf_mg) #if defined(NEED_sv_catpvf_mg) || defined(NEED_sv_catpvf_mg_GLOBAL) void DPPP_(my_sv_catpvf_mg)(pTHX_ SV *sv, const char *pat, ...) { va_list args; va_start(args, pat); sv_vcatpvfn(sv, pat, strlen(pat), &args, Null(SV**), 0, Null(bool*)); SvSETMAGIC(sv); va_end(args); } #endif #endif #ifdef PERL_IMPLICIT_CONTEXT #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_catpvf_mg_nocontext) #if defined(NEED_sv_catpvf_mg_nocontext) static void DPPP_(my_sv_catpvf_mg_nocontext)(SV * sv, const char * pat, ...); static #else extern void DPPP_(my_sv_catpvf_mg_nocontext)(SV * sv, const char * pat, ...); #endif #define sv_catpvf_mg_nocontext DPPP_(my_sv_catpvf_mg_nocontext) #define Perl_sv_catpvf_mg_nocontext DPPP_(my_sv_catpvf_mg_nocontext) #if defined(NEED_sv_catpvf_mg_nocontext) || defined(NEED_sv_catpvf_mg_nocontext_GLOBAL) void DPPP_(my_sv_catpvf_mg_nocontext)(SV *sv, const char *pat, ...) { dTHX; va_list args; va_start(args, pat); sv_vcatpvfn(sv, pat, strlen(pat), &args, Null(SV**), 0, Null(bool*)); SvSETMAGIC(sv); va_end(args); } #endif #endif #endif /* sv_catpvf_mg depends on sv_catpvf_mg_nocontext */ #ifndef sv_catpvf_mg # ifdef PERL_IMPLICIT_CONTEXT # define sv_catpvf_mg Perl_sv_catpvf_mg_nocontext # else # define sv_catpvf_mg Perl_sv_catpvf_mg # endif #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_vcatpvf_mg) # define sv_vcatpvf_mg(sv, pat, args) \ STMT_START { \ sv_vcatpvfn(sv, pat, strlen(pat), args, Null(SV**), 0, Null(bool*)); \ SvSETMAGIC(sv); \ } STMT_END #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_setpvf_mg) #if defined(NEED_sv_setpvf_mg) static void DPPP_(my_sv_setpvf_mg)(pTHX_ SV * sv, const char * pat, ...); static #else extern void DPPP_(my_sv_setpvf_mg)(pTHX_ SV * sv, const char * pat, ...); #endif #define Perl_sv_setpvf_mg DPPP_(my_sv_setpvf_mg) #if defined(NEED_sv_setpvf_mg) || defined(NEED_sv_setpvf_mg_GLOBAL) void DPPP_(my_sv_setpvf_mg)(pTHX_ SV *sv, const char *pat, ...) { va_list args; va_start(args, pat); sv_vsetpvfn(sv, pat, strlen(pat), &args, Null(SV**), 0, Null(bool*)); SvSETMAGIC(sv); va_end(args); } #endif #endif #ifdef PERL_IMPLICIT_CONTEXT #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_setpvf_mg_nocontext) #if defined(NEED_sv_setpvf_mg_nocontext) static void DPPP_(my_sv_setpvf_mg_nocontext)(SV * sv, const char * pat, ...); static #else extern void DPPP_(my_sv_setpvf_mg_nocontext)(SV * sv, const char * pat, ...); #endif #define sv_setpvf_mg_nocontext DPPP_(my_sv_setpvf_mg_nocontext) #define Perl_sv_setpvf_mg_nocontext DPPP_(my_sv_setpvf_mg_nocontext) #if defined(NEED_sv_setpvf_mg_nocontext) || defined(NEED_sv_setpvf_mg_nocontext_GLOBAL) void DPPP_(my_sv_setpvf_mg_nocontext)(SV *sv, const char *pat, ...) { dTHX; va_list args; va_start(args, pat); sv_vsetpvfn(sv, pat, strlen(pat), &args, Null(SV**), 0, Null(bool*)); SvSETMAGIC(sv); va_end(args); } #endif #endif #endif /* sv_setpvf_mg depends on sv_setpvf_mg_nocontext */ #ifndef sv_setpvf_mg # ifdef PERL_IMPLICIT_CONTEXT # define sv_setpvf_mg Perl_sv_setpvf_mg_nocontext # else # define sv_setpvf_mg Perl_sv_setpvf_mg # endif #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(sv_vsetpvf_mg) # define sv_vsetpvf_mg(sv, pat, args) \ STMT_START { \ sv_vsetpvfn(sv, pat, strlen(pat), args, Null(SV**), 0, Null(bool*)); \ SvSETMAGIC(sv); \ } STMT_END #endif #ifndef newSVpvn_share #if defined(NEED_newSVpvn_share) static SV * DPPP_(my_newSVpvn_share)(pTHX_ const char *src, I32 len, U32 hash); static #else extern SV * DPPP_(my_newSVpvn_share)(pTHX_ const char *src, I32 len, U32 hash); #endif #ifdef newSVpvn_share # undef newSVpvn_share #endif #define newSVpvn_share(a,b,c) DPPP_(my_newSVpvn_share)(aTHX_ a,b,c) #define Perl_newSVpvn_share DPPP_(my_newSVpvn_share) #if defined(NEED_newSVpvn_share) || defined(NEED_newSVpvn_share_GLOBAL) SV * DPPP_(my_newSVpvn_share)(pTHX_ const char *src, I32 len, U32 hash) { SV *sv; if (len < 0) len = -len; if (!hash) PERL_HASH(hash, (char*) src, len); sv = newSVpvn((char *) src, len); sv_upgrade(sv, SVt_PVIV); SvIVX(sv) = hash; SvREADONLY_on(sv); SvPOK_on(sv); return sv; } #endif #endif #ifndef SvSHARED_HASH # define SvSHARED_HASH(sv) (0 + SvUVX(sv)) #endif #ifndef WARN_ALL # define WARN_ALL 0 #endif #ifndef WARN_CLOSURE # define WARN_CLOSURE 1 #endif #ifndef WARN_DEPRECATED # define WARN_DEPRECATED 2 #endif #ifndef WARN_EXITING # define WARN_EXITING 3 #endif #ifndef WARN_GLOB # define WARN_GLOB 4 #endif #ifndef WARN_IO # define WARN_IO 5 #endif #ifndef WARN_CLOSED # define WARN_CLOSED 6 #endif #ifndef WARN_EXEC # define WARN_EXEC 7 #endif #ifndef WARN_LAYER # define WARN_LAYER 8 #endif #ifndef WARN_NEWLINE # define WARN_NEWLINE 9 #endif #ifndef WARN_PIPE # define WARN_PIPE 10 #endif #ifndef WARN_UNOPENED # define WARN_UNOPENED 11 #endif #ifndef WARN_MISC # define WARN_MISC 12 #endif #ifndef WARN_NUMERIC # define WARN_NUMERIC 13 #endif #ifndef WARN_ONCE # define WARN_ONCE 14 #endif #ifndef WARN_OVERFLOW # define WARN_OVERFLOW 15 #endif #ifndef WARN_PACK # define WARN_PACK 16 #endif #ifndef WARN_PORTABLE # define WARN_PORTABLE 17 #endif #ifndef WARN_RECURSION # define WARN_RECURSION 18 #endif #ifndef WARN_REDEFINE # define WARN_REDEFINE 19 #endif #ifndef WARN_REGEXP # define WARN_REGEXP 20 #endif #ifndef WARN_SEVERE # define WARN_SEVERE 21 #endif #ifndef WARN_DEBUGGING # define WARN_DEBUGGING 22 #endif #ifndef WARN_INPLACE # define WARN_INPLACE 23 #endif #ifndef WARN_INTERNAL # define WARN_INTERNAL 24 #endif #ifndef WARN_MALLOC # define WARN_MALLOC 25 #endif #ifndef WARN_SIGNAL # define WARN_SIGNAL 26 #endif #ifndef WARN_SUBSTR # define WARN_SUBSTR 27 #endif #ifndef WARN_SYNTAX # define WARN_SYNTAX 28 #endif #ifndef WARN_AMBIGUOUS # define WARN_AMBIGUOUS 29 #endif #ifndef WARN_BAREWORD # define WARN_BAREWORD 30 #endif #ifndef WARN_DIGIT # define WARN_DIGIT 31 #endif #ifndef WARN_PARENTHESIS # define WARN_PARENTHESIS 32 #endif #ifndef WARN_PRECEDENCE # define WARN_PRECEDENCE 33 #endif #ifndef WARN_PRINTF # define WARN_PRINTF 34 #endif #ifndef WARN_PROTOTYPE # define WARN_PROTOTYPE 35 #endif #ifndef WARN_QW # define WARN_QW 36 #endif #ifndef WARN_RESERVED # define WARN_RESERVED 37 #endif #ifndef WARN_SEMICOLON # define WARN_SEMICOLON 38 #endif #ifndef WARN_TAINT # define WARN_TAINT 39 #endif #ifndef WARN_THREADS # define WARN_THREADS 40 #endif #ifndef WARN_UNINITIALIZED # define WARN_UNINITIALIZED 41 #endif #ifndef WARN_UNPACK # define WARN_UNPACK 42 #endif #ifndef WARN_UNTIE # define WARN_UNTIE 43 #endif #ifndef WARN_UTF8 # define WARN_UTF8 44 #endif #ifndef WARN_VOID # define WARN_VOID 45 #endif #ifndef WARN_ASSERTIONS # define WARN_ASSERTIONS 46 #endif #ifndef packWARN # define packWARN(a) (a) #endif #ifndef ckWARN # ifdef G_WARN_ON # define ckWARN(a) (PL_dowarn & G_WARN_ON) # else # define ckWARN(a) PL_dowarn # endif #endif #if (PERL_BCDVERSION >= 0x5004000) && !defined(warner) #if defined(NEED_warner) static void DPPP_(my_warner)(U32 err, const char *pat, ...); static #else extern void DPPP_(my_warner)(U32 err, const char *pat, ...); #endif #define Perl_warner DPPP_(my_warner) #if defined(NEED_warner) || defined(NEED_warner_GLOBAL) void DPPP_(my_warner)(U32 err, const char *pat, ...) { SV *sv; va_list args; PERL_UNUSED_ARG(err); va_start(args, pat); sv = vnewSVpvf(pat, &args); va_end(args); sv_2mortal(sv); warn("%s", SvPV_nolen(sv)); } #define warner Perl_warner #define Perl_warner_nocontext Perl_warner #endif #endif /* concatenating with "" ensures that only literal strings are accepted as argument * note that STR_WITH_LEN() can't be used as argument to macros or functions that * under some configurations might be macros */ #ifndef STR_WITH_LEN # define STR_WITH_LEN(s) (s ""), (sizeof(s)-1) #endif #ifndef newSVpvs # define newSVpvs(str) newSVpvn(str "", sizeof(str) - 1) #endif #ifndef newSVpvs_flags # define newSVpvs_flags(str, flags) newSVpvn_flags(str "", sizeof(str) - 1, flags) #endif #ifndef sv_catpvs # define sv_catpvs(sv, str) sv_catpvn(sv, str "", sizeof(str) - 1) #endif #ifndef sv_setpvs # define sv_setpvs(sv, str) sv_setpvn(sv, str "", sizeof(str) - 1) #endif #ifndef hv_fetchs # define hv_fetchs(hv, key, lval) hv_fetch(hv, key "", sizeof(key) - 1, lval) #endif #ifndef hv_stores # define hv_stores(hv, key, val) hv_store(hv, key "", sizeof(key) - 1, val, 0) #endif #ifndef SvGETMAGIC # define SvGETMAGIC(x) STMT_START { if (SvGMAGICAL(x)) mg_get(x); } STMT_END #endif #ifndef PERL_MAGIC_sv # define PERL_MAGIC_sv '\0' #endif #ifndef PERL_MAGIC_overload # define PERL_MAGIC_overload 'A' #endif #ifndef PERL_MAGIC_overload_elem # define PERL_MAGIC_overload_elem 'a' #endif #ifndef PERL_MAGIC_overload_table # define PERL_MAGIC_overload_table 'c' #endif #ifndef PERL_MAGIC_bm # define PERL_MAGIC_bm 'B' #endif #ifndef PERL_MAGIC_regdata # define PERL_MAGIC_regdata 'D' #endif #ifndef PERL_MAGIC_regdatum # define PERL_MAGIC_regdatum 'd' #endif #ifndef PERL_MAGIC_env # define PERL_MAGIC_env 'E' #endif #ifndef PERL_MAGIC_envelem # define PERL_MAGIC_envelem 'e' #endif #ifndef PERL_MAGIC_fm # define PERL_MAGIC_fm 'f' #endif #ifndef PERL_MAGIC_regex_global # define PERL_MAGIC_regex_global 'g' #endif #ifndef PERL_MAGIC_isa # define PERL_MAGIC_isa 'I' #endif #ifndef PERL_MAGIC_isaelem # define PERL_MAGIC_isaelem 'i' #endif #ifndef PERL_MAGIC_nkeys # define PERL_MAGIC_nkeys 'k' #endif #ifndef PERL_MAGIC_dbfile # define PERL_MAGIC_dbfile 'L' #endif #ifndef PERL_MAGIC_dbline # define PERL_MAGIC_dbline 'l' #endif #ifndef PERL_MAGIC_mutex # define PERL_MAGIC_mutex 'm' #endif #ifndef PERL_MAGIC_shared # define PERL_MAGIC_shared 'N' #endif #ifndef PERL_MAGIC_shared_scalar # define PERL_MAGIC_shared_scalar 'n' #endif #ifndef PERL_MAGIC_collxfrm # define PERL_MAGIC_collxfrm 'o' #endif #ifndef PERL_MAGIC_tied # define PERL_MAGIC_tied 'P' #endif #ifndef PERL_MAGIC_tiedelem # define PERL_MAGIC_tiedelem 'p' #endif #ifndef PERL_MAGIC_tiedscalar # define PERL_MAGIC_tiedscalar 'q' #endif #ifndef PERL_MAGIC_qr # define PERL_MAGIC_qr 'r' #endif #ifndef PERL_MAGIC_sig # define PERL_MAGIC_sig 'S' #endif #ifndef PERL_MAGIC_sigelem # define PERL_MAGIC_sigelem 's' #endif #ifndef PERL_MAGIC_taint # define PERL_MAGIC_taint 't' #endif #ifndef PERL_MAGIC_uvar # define PERL_MAGIC_uvar 'U' #endif #ifndef PERL_MAGIC_uvar_elem # define PERL_MAGIC_uvar_elem 'u' #endif #ifndef PERL_MAGIC_vstring # define PERL_MAGIC_vstring 'V' #endif #ifndef PERL_MAGIC_vec # define PERL_MAGIC_vec 'v' #endif #ifndef PERL_MAGIC_utf8 # define PERL_MAGIC_utf8 'w' #endif #ifndef PERL_MAGIC_substr # define PERL_MAGIC_substr 'x' #endif #ifndef PERL_MAGIC_defelem # define PERL_MAGIC_defelem 'y' #endif #ifndef PERL_MAGIC_glob # define PERL_MAGIC_glob '*' #endif #ifndef PERL_MAGIC_arylen # define PERL_MAGIC_arylen '#' #endif #ifndef PERL_MAGIC_pos # define PERL_MAGIC_pos '.' #endif #ifndef PERL_MAGIC_backref # define PERL_MAGIC_backref '<' #endif #ifndef PERL_MAGIC_ext # define PERL_MAGIC_ext '~' #endif /* That's the best we can do... */ #ifndef sv_catpvn_nomg # define sv_catpvn_nomg sv_catpvn #endif #ifndef sv_catsv_nomg # define sv_catsv_nomg sv_catsv #endif #ifndef sv_setsv_nomg # define sv_setsv_nomg sv_setsv #endif #ifndef sv_pvn_nomg # define sv_pvn_nomg sv_pvn #endif #ifndef SvIV_nomg # define SvIV_nomg SvIV #endif #ifndef SvUV_nomg # define SvUV_nomg SvUV #endif #ifndef sv_catpv_mg # define sv_catpv_mg(sv, ptr) \ STMT_START { \ SV *TeMpSv = sv; \ sv_catpv(TeMpSv,ptr); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_catpvn_mg # define sv_catpvn_mg(sv, ptr, len) \ STMT_START { \ SV *TeMpSv = sv; \ sv_catpvn(TeMpSv,ptr,len); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_catsv_mg # define sv_catsv_mg(dsv, ssv) \ STMT_START { \ SV *TeMpSv = dsv; \ sv_catsv(TeMpSv,ssv); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_setiv_mg # define sv_setiv_mg(sv, i) \ STMT_START { \ SV *TeMpSv = sv; \ sv_setiv(TeMpSv,i); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_setnv_mg # define sv_setnv_mg(sv, num) \ STMT_START { \ SV *TeMpSv = sv; \ sv_setnv(TeMpSv,num); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_setpv_mg # define sv_setpv_mg(sv, ptr) \ STMT_START { \ SV *TeMpSv = sv; \ sv_setpv(TeMpSv,ptr); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_setpvn_mg # define sv_setpvn_mg(sv, ptr, len) \ STMT_START { \ SV *TeMpSv = sv; \ sv_setpvn(TeMpSv,ptr,len); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_setsv_mg # define sv_setsv_mg(dsv, ssv) \ STMT_START { \ SV *TeMpSv = dsv; \ sv_setsv(TeMpSv,ssv); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_setuv_mg # define sv_setuv_mg(sv, i) \ STMT_START { \ SV *TeMpSv = sv; \ sv_setuv(TeMpSv,i); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef sv_usepvn_mg # define sv_usepvn_mg(sv, ptr, len) \ STMT_START { \ SV *TeMpSv = sv; \ sv_usepvn(TeMpSv,ptr,len); \ SvSETMAGIC(TeMpSv); \ } STMT_END #endif #ifndef SvVSTRING_mg # define SvVSTRING_mg(sv) (SvMAGICAL(sv) ? mg_find(sv, PERL_MAGIC_vstring) : NULL) #endif /* Hint: sv_magic_portable * This is a compatibility function that is only available with * Devel::PPPort. It is NOT in the perl core. * Its purpose is to mimic the 5.8.0 behaviour of sv_magic() when * it is being passed a name pointer with namlen == 0. In that * case, perl 5.8.0 and later store the pointer, not a copy of it. * The compatibility can be provided back to perl 5.004. With * earlier versions, the code will not compile. */ #if (PERL_BCDVERSION < 0x5004000) /* code that uses sv_magic_portable will not compile */ #elif (PERL_BCDVERSION < 0x5008000) # define sv_magic_portable(sv, obj, how, name, namlen) \ STMT_START { \ SV *SvMp_sv = (sv); \ char *SvMp_name = (char *) (name); \ I32 SvMp_namlen = (namlen); \ if (SvMp_name && SvMp_namlen == 0) \ { \ MAGIC *mg; \ sv_magic(SvMp_sv, obj, how, 0, 0); \ mg = SvMAGIC(SvMp_sv); \ mg->mg_len = -42; /* XXX: this is the tricky part */ \ mg->mg_ptr = SvMp_name; \ } \ else \ { \ sv_magic(SvMp_sv, obj, how, SvMp_name, SvMp_namlen); \ } \ } STMT_END #else # define sv_magic_portable(a, b, c, d, e) sv_magic(a, b, c, d, e) #endif #ifdef USE_ITHREADS #ifndef CopFILE # define CopFILE(c) ((c)->cop_file) #endif #ifndef CopFILEGV # define CopFILEGV(c) (CopFILE(c) ? gv_fetchfile(CopFILE(c)) : Nullgv) #endif #ifndef CopFILE_set # define CopFILE_set(c,pv) ((c)->cop_file = savepv(pv)) #endif #ifndef CopFILESV # define CopFILESV(c) (CopFILE(c) ? GvSV(gv_fetchfile(CopFILE(c))) : Nullsv) #endif #ifndef CopFILEAV # define CopFILEAV(c) (CopFILE(c) ? GvAV(gv_fetchfile(CopFILE(c))) : Nullav) #endif #ifndef CopSTASHPV # define CopSTASHPV(c) ((c)->cop_stashpv) #endif #ifndef CopSTASHPV_set # define CopSTASHPV_set(c,pv) ((c)->cop_stashpv = ((pv) ? savepv(pv) : Nullch)) #endif #ifndef CopSTASH # define CopSTASH(c) (CopSTASHPV(c) ? gv_stashpv(CopSTASHPV(c),GV_ADD) : Nullhv) #endif #ifndef CopSTASH_set # define CopSTASH_set(c,hv) CopSTASHPV_set(c, (hv) ? HvNAME(hv) : Nullch) #endif #ifndef CopSTASH_eq # define CopSTASH_eq(c,hv) ((hv) && (CopSTASHPV(c) == HvNAME(hv) \ || (CopSTASHPV(c) && HvNAME(hv) \ && strEQ(CopSTASHPV(c), HvNAME(hv))))) #endif #else #ifndef CopFILEGV # define CopFILEGV(c) ((c)->cop_filegv) #endif #ifndef CopFILEGV_set # define CopFILEGV_set(c,gv) ((c)->cop_filegv = (GV*)SvREFCNT_inc(gv)) #endif #ifndef CopFILE_set # define CopFILE_set(c,pv) CopFILEGV_set((c), gv_fetchfile(pv)) #endif #ifndef CopFILESV # define CopFILESV(c) (CopFILEGV(c) ? GvSV(CopFILEGV(c)) : Nullsv) #endif #ifndef CopFILEAV # define CopFILEAV(c) (CopFILEGV(c) ? GvAV(CopFILEGV(c)) : Nullav) #endif #ifndef CopFILE # define CopFILE(c) (CopFILESV(c) ? SvPVX(CopFILESV(c)) : Nullch) #endif #ifndef CopSTASH # define CopSTASH(c) ((c)->cop_stash) #endif #ifndef CopSTASH_set # define CopSTASH_set(c,hv) ((c)->cop_stash = (hv)) #endif #ifndef CopSTASHPV # define CopSTASHPV(c) (CopSTASH(c) ? HvNAME(CopSTASH(c)) : Nullch) #endif #ifndef CopSTASHPV_set # define CopSTASHPV_set(c,pv) CopSTASH_set((c), gv_stashpv(pv,GV_ADD)) #endif #ifndef CopSTASH_eq # define CopSTASH_eq(c,hv) (CopSTASH(c) == (hv)) #endif #endif /* USE_ITHREADS */ #ifndef IN_PERL_COMPILETIME # define IN_PERL_COMPILETIME (PL_curcop == &PL_compiling) #endif #ifndef IN_LOCALE_RUNTIME # define IN_LOCALE_RUNTIME (PL_curcop->op_private & HINT_LOCALE) #endif #ifndef IN_LOCALE_COMPILETIME # define IN_LOCALE_COMPILETIME (PL_hints & HINT_LOCALE) #endif #ifndef IN_LOCALE # define IN_LOCALE (IN_PERL_COMPILETIME ? IN_LOCALE_COMPILETIME : IN_LOCALE_RUNTIME) #endif #ifndef IS_NUMBER_IN_UV # define IS_NUMBER_IN_UV 0x01 #endif #ifndef IS_NUMBER_GREATER_THAN_UV_MAX # define IS_NUMBER_GREATER_THAN_UV_MAX 0x02 #endif #ifndef IS_NUMBER_NOT_INT # define IS_NUMBER_NOT_INT 0x04 #endif #ifndef IS_NUMBER_NEG # define IS_NUMBER_NEG 0x08 #endif #ifndef IS_NUMBER_INFINITY # define IS_NUMBER_INFINITY 0x10 #endif #ifndef IS_NUMBER_NAN # define IS_NUMBER_NAN 0x20 #endif #ifndef GROK_NUMERIC_RADIX # define GROK_NUMERIC_RADIX(sp, send) grok_numeric_radix(sp, send) #endif #ifndef PERL_SCAN_GREATER_THAN_UV_MAX # define PERL_SCAN_GREATER_THAN_UV_MAX 0x02 #endif #ifndef PERL_SCAN_SILENT_ILLDIGIT # define PERL_SCAN_SILENT_ILLDIGIT 0x04 #endif #ifndef PERL_SCAN_ALLOW_UNDERSCORES # define PERL_SCAN_ALLOW_UNDERSCORES 0x01 #endif #ifndef PERL_SCAN_DISALLOW_PREFIX # define PERL_SCAN_DISALLOW_PREFIX 0x02 #endif #ifndef grok_numeric_radix #if defined(NEED_grok_numeric_radix) static bool DPPP_(my_grok_numeric_radix)(pTHX_ const char ** sp, const char * send); static #else extern bool DPPP_(my_grok_numeric_radix)(pTHX_ const char ** sp, const char * send); #endif #ifdef grok_numeric_radix # undef grok_numeric_radix #endif #define grok_numeric_radix(a,b) DPPP_(my_grok_numeric_radix)(aTHX_ a,b) #define Perl_grok_numeric_radix DPPP_(my_grok_numeric_radix) #if defined(NEED_grok_numeric_radix) || defined(NEED_grok_numeric_radix_GLOBAL) bool DPPP_(my_grok_numeric_radix)(pTHX_ const char **sp, const char *send) { #ifdef USE_LOCALE_NUMERIC #ifdef PL_numeric_radix_sv if (PL_numeric_radix_sv && IN_LOCALE) { STRLEN len; char* radix = SvPV(PL_numeric_radix_sv, len); if (*sp + len <= send && memEQ(*sp, radix, len)) { *sp += len; return TRUE; } } #else /* older perls don't have PL_numeric_radix_sv so the radix * must manually be requested from locale.h */ #include dTHR; /* needed for older threaded perls */ struct lconv *lc = localeconv(); char *radix = lc->decimal_point; if (radix && IN_LOCALE) { STRLEN len = strlen(radix); if (*sp + len <= send && memEQ(*sp, radix, len)) { *sp += len; return TRUE; } } #endif #endif /* USE_LOCALE_NUMERIC */ /* always try "." if numeric radix didn't match because * we may have data from different locales mixed */ if (*sp < send && **sp == '.') { ++*sp; return TRUE; } return FALSE; } #endif #endif #ifndef grok_number #if defined(NEED_grok_number) static int DPPP_(my_grok_number)(pTHX_ const char * pv, STRLEN len, UV * valuep); static #else extern int DPPP_(my_grok_number)(pTHX_ const char * pv, STRLEN len, UV * valuep); #endif #ifdef grok_number # undef grok_number #endif #define grok_number(a,b,c) DPPP_(my_grok_number)(aTHX_ a,b,c) #define Perl_grok_number DPPP_(my_grok_number) #if defined(NEED_grok_number) || defined(NEED_grok_number_GLOBAL) int DPPP_(my_grok_number)(pTHX_ const char *pv, STRLEN len, UV *valuep) { const char *s = pv; const char *send = pv + len; const UV max_div_10 = UV_MAX / 10; const char max_mod_10 = UV_MAX % 10; int numtype = 0; int sawinf = 0; int sawnan = 0; while (s < send && isSPACE(*s)) s++; if (s == send) { return 0; } else if (*s == '-') { s++; numtype = IS_NUMBER_NEG; } else if (*s == '+') s++; if (s == send) return 0; /* next must be digit or the radix separator or beginning of infinity */ if (isDIGIT(*s)) { /* UVs are at least 32 bits, so the first 9 decimal digits cannot overflow. */ UV value = *s - '0'; /* This construction seems to be more optimiser friendly. (without it gcc does the isDIGIT test and the *s - '0' separately) With it gcc on arm is managing 6 instructions (6 cycles) per digit. In theory the optimiser could deduce how far to unroll the loop before checking for overflow. */ if (++s < send) { int digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { digit = *s - '0'; if (digit >= 0 && digit <= 9) { value = value * 10 + digit; if (++s < send) { /* Now got 9 digits, so need to check each time for overflow. */ digit = *s - '0'; while (digit >= 0 && digit <= 9 && (value < max_div_10 || (value == max_div_10 && digit <= max_mod_10))) { value = value * 10 + digit; if (++s < send) digit = *s - '0'; else break; } if (digit >= 0 && digit <= 9 && (s < send)) { /* value overflowed. skip the remaining digits, don't worry about setting *valuep. */ do { s++; } while (s < send && isDIGIT(*s)); numtype |= IS_NUMBER_GREATER_THAN_UV_MAX; goto skip_value; } } } } } } } } } } } } } } } } } } numtype |= IS_NUMBER_IN_UV; if (valuep) *valuep = value; skip_value: if (GROK_NUMERIC_RADIX(&s, send)) { numtype |= IS_NUMBER_NOT_INT; while (s < send && isDIGIT(*s)) /* optional digits after the radix */ s++; } } else if (GROK_NUMERIC_RADIX(&s, send)) { numtype |= IS_NUMBER_NOT_INT | IS_NUMBER_IN_UV; /* valuep assigned below */ /* no digits before the radix means we need digits after it */ if (s < send && isDIGIT(*s)) { do { s++; } while (s < send && isDIGIT(*s)); if (valuep) { /* integer approximation is valid - it's 0. */ *valuep = 0; } } else return 0; } else if (*s == 'I' || *s == 'i') { s++; if (s == send || (*s != 'N' && *s != 'n')) return 0; s++; if (s == send || (*s != 'F' && *s != 'f')) return 0; s++; if (s < send && (*s == 'I' || *s == 'i')) { s++; if (s == send || (*s != 'N' && *s != 'n')) return 0; s++; if (s == send || (*s != 'I' && *s != 'i')) return 0; s++; if (s == send || (*s != 'T' && *s != 't')) return 0; s++; if (s == send || (*s != 'Y' && *s != 'y')) return 0; s++; } sawinf = 1; } else if (*s == 'N' || *s == 'n') { /* XXX TODO: There are signaling NaNs and quiet NaNs. */ s++; if (s == send || (*s != 'A' && *s != 'a')) return 0; s++; if (s == send || (*s != 'N' && *s != 'n')) return 0; s++; sawnan = 1; } else return 0; if (sawinf) { numtype &= IS_NUMBER_NEG; /* Keep track of sign */ numtype |= IS_NUMBER_INFINITY | IS_NUMBER_NOT_INT; } else if (sawnan) { numtype &= IS_NUMBER_NEG; /* Keep track of sign */ numtype |= IS_NUMBER_NAN | IS_NUMBER_NOT_INT; } else if (s < send) { /* we can have an optional exponent part */ if (*s == 'e' || *s == 'E') { /* The only flag we keep is sign. Blow away any "it's UV" */ numtype &= IS_NUMBER_NEG; numtype |= IS_NUMBER_NOT_INT; s++; if (s < send && (*s == '-' || *s == '+')) s++; if (s < send && isDIGIT(*s)) { do { s++; } while (s < send && isDIGIT(*s)); } else return 0; } } while (s < send && isSPACE(*s)) s++; if (s >= send) return numtype; if (len == 10 && memEQ(pv, "0 but true", 10)) { if (valuep) *valuep = 0; return IS_NUMBER_IN_UV; } return 0; } #endif #endif /* * The grok_* routines have been modified to use warn() instead of * Perl_warner(). Also, 'hexdigit' was the former name of PL_hexdigit, * which is why the stack variable has been renamed to 'xdigit'. */ #ifndef grok_bin #if defined(NEED_grok_bin) static UV DPPP_(my_grok_bin)(pTHX_ const char * start, STRLEN * len_p, I32 * flags, NV * result); static #else extern UV DPPP_(my_grok_bin)(pTHX_ const char * start, STRLEN * len_p, I32 * flags, NV * result); #endif #ifdef grok_bin # undef grok_bin #endif #define grok_bin(a,b,c,d) DPPP_(my_grok_bin)(aTHX_ a,b,c,d) #define Perl_grok_bin DPPP_(my_grok_bin) #if defined(NEED_grok_bin) || defined(NEED_grok_bin_GLOBAL) UV DPPP_(my_grok_bin)(pTHX_ const char *start, STRLEN *len_p, I32 *flags, NV *result) { const char *s = start; STRLEN len = *len_p; UV value = 0; NV value_nv = 0; const UV max_div_2 = UV_MAX / 2; bool allow_underscores = *flags & PERL_SCAN_ALLOW_UNDERSCORES; bool overflowed = FALSE; if (!(*flags & PERL_SCAN_DISALLOW_PREFIX)) { /* strip off leading b or 0b. for compatibility silently suffer "b" and "0b" as valid binary numbers. */ if (len >= 1) { if (s[0] == 'b') { s++; len--; } else if (len >= 2 && s[0] == '0' && s[1] == 'b') { s+=2; len-=2; } } } for (; len-- && *s; s++) { char bit = *s; if (bit == '0' || bit == '1') { /* Write it in this wonky order with a goto to attempt to get the compiler to make the common case integer-only loop pretty tight. With gcc seems to be much straighter code than old scan_bin. */ redo: if (!overflowed) { if (value <= max_div_2) { value = (value << 1) | (bit - '0'); continue; } /* Bah. We're just overflowed. */ warn("Integer overflow in binary number"); overflowed = TRUE; value_nv = (NV) value; } value_nv *= 2.0; /* If an NV has not enough bits in its mantissa to * represent a UV this summing of small low-order numbers * is a waste of time (because the NV cannot preserve * the low-order bits anyway): we could just remember when * did we overflow and in the end just multiply value_nv by the * right amount. */ value_nv += (NV)(bit - '0'); continue; } if (bit == '_' && len && allow_underscores && (bit = s[1]) && (bit == '0' || bit == '1')) { --len; ++s; goto redo; } if (!(*flags & PERL_SCAN_SILENT_ILLDIGIT)) warn("Illegal binary digit '%c' ignored", *s); break; } if ( ( overflowed && value_nv > 4294967295.0) #if UVSIZE > 4 || (!overflowed && value > 0xffffffff ) #endif ) { warn("Binary number > 0b11111111111111111111111111111111 non-portable"); } *len_p = s - start; if (!overflowed) { *flags = 0; return value; } *flags = PERL_SCAN_GREATER_THAN_UV_MAX; if (result) *result = value_nv; return UV_MAX; } #endif #endif #ifndef grok_hex #if defined(NEED_grok_hex) static UV DPPP_(my_grok_hex)(pTHX_ const char * start, STRLEN * len_p, I32 * flags, NV * result); static #else extern UV DPPP_(my_grok_hex)(pTHX_ const char * start, STRLEN * len_p, I32 * flags, NV * result); #endif #ifdef grok_hex # undef grok_hex #endif #define grok_hex(a,b,c,d) DPPP_(my_grok_hex)(aTHX_ a,b,c,d) #define Perl_grok_hex DPPP_(my_grok_hex) #if defined(NEED_grok_hex) || defined(NEED_grok_hex_GLOBAL) UV DPPP_(my_grok_hex)(pTHX_ const char *start, STRLEN *len_p, I32 *flags, NV *result) { const char *s = start; STRLEN len = *len_p; UV value = 0; NV value_nv = 0; const UV max_div_16 = UV_MAX / 16; bool allow_underscores = *flags & PERL_SCAN_ALLOW_UNDERSCORES; bool overflowed = FALSE; const char *xdigit; if (!(*flags & PERL_SCAN_DISALLOW_PREFIX)) { /* strip off leading x or 0x. for compatibility silently suffer "x" and "0x" as valid hex numbers. */ if (len >= 1) { if (s[0] == 'x') { s++; len--; } else if (len >= 2 && s[0] == '0' && s[1] == 'x') { s+=2; len-=2; } } } for (; len-- && *s; s++) { xdigit = strchr((char *) PL_hexdigit, *s); if (xdigit) { /* Write it in this wonky order with a goto to attempt to get the compiler to make the common case integer-only loop pretty tight. With gcc seems to be much straighter code than old scan_hex. */ redo: if (!overflowed) { if (value <= max_div_16) { value = (value << 4) | ((xdigit - PL_hexdigit) & 15); continue; } warn("Integer overflow in hexadecimal number"); overflowed = TRUE; value_nv = (NV) value; } value_nv *= 16.0; /* If an NV has not enough bits in its mantissa to * represent a UV this summing of small low-order numbers * is a waste of time (because the NV cannot preserve * the low-order bits anyway): we could just remember when * did we overflow and in the end just multiply value_nv by the * right amount of 16-tuples. */ value_nv += (NV)((xdigit - PL_hexdigit) & 15); continue; } if (*s == '_' && len && allow_underscores && s[1] && (xdigit = strchr((char *) PL_hexdigit, s[1]))) { --len; ++s; goto redo; } if (!(*flags & PERL_SCAN_SILENT_ILLDIGIT)) warn("Illegal hexadecimal digit '%c' ignored", *s); break; } if ( ( overflowed && value_nv > 4294967295.0) #if UVSIZE > 4 || (!overflowed && value > 0xffffffff ) #endif ) { warn("Hexadecimal number > 0xffffffff non-portable"); } *len_p = s - start; if (!overflowed) { *flags = 0; return value; } *flags = PERL_SCAN_GREATER_THAN_UV_MAX; if (result) *result = value_nv; return UV_MAX; } #endif #endif #ifndef grok_oct #if defined(NEED_grok_oct) static UV DPPP_(my_grok_oct)(pTHX_ const char * start, STRLEN * len_p, I32 * flags, NV * result); static #else extern UV DPPP_(my_grok_oct)(pTHX_ const char * start, STRLEN * len_p, I32 * flags, NV * result); #endif #ifdef grok_oct # undef grok_oct #endif #define grok_oct(a,b,c,d) DPPP_(my_grok_oct)(aTHX_ a,b,c,d) #define Perl_grok_oct DPPP_(my_grok_oct) #if defined(NEED_grok_oct) || defined(NEED_grok_oct_GLOBAL) UV DPPP_(my_grok_oct)(pTHX_ const char *start, STRLEN *len_p, I32 *flags, NV *result) { const char *s = start; STRLEN len = *len_p; UV value = 0; NV value_nv = 0; const UV max_div_8 = UV_MAX / 8; bool allow_underscores = *flags & PERL_SCAN_ALLOW_UNDERSCORES; bool overflowed = FALSE; for (; len-- && *s; s++) { /* gcc 2.95 optimiser not smart enough to figure that this subtraction out front allows slicker code. */ int digit = *s - '0'; if (digit >= 0 && digit <= 7) { /* Write it in this wonky order with a goto to attempt to get the compiler to make the common case integer-only loop pretty tight. */ redo: if (!overflowed) { if (value <= max_div_8) { value = (value << 3) | digit; continue; } /* Bah. We're just overflowed. */ warn("Integer overflow in octal number"); overflowed = TRUE; value_nv = (NV) value; } value_nv *= 8.0; /* If an NV has not enough bits in its mantissa to * represent a UV this summing of small low-order numbers * is a waste of time (because the NV cannot preserve * the low-order bits anyway): we could just remember when * did we overflow and in the end just multiply value_nv by the * right amount of 8-tuples. */ value_nv += (NV)digit; continue; } if (digit == ('_' - '0') && len && allow_underscores && (digit = s[1] - '0') && (digit >= 0 && digit <= 7)) { --len; ++s; goto redo; } /* Allow \octal to work the DWIM way (that is, stop scanning * as soon as non-octal characters are seen, complain only iff * someone seems to want to use the digits eight and nine). */ if (digit == 8 || digit == 9) { if (!(*flags & PERL_SCAN_SILENT_ILLDIGIT)) warn("Illegal octal digit '%c' ignored", *s); } break; } if ( ( overflowed && value_nv > 4294967295.0) #if UVSIZE > 4 || (!overflowed && value > 0xffffffff ) #endif ) { warn("Octal number > 037777777777 non-portable"); } *len_p = s - start; if (!overflowed) { *flags = 0; return value; } *flags = PERL_SCAN_GREATER_THAN_UV_MAX; if (result) *result = value_nv; return UV_MAX; } #endif #endif #if !defined(my_snprintf) #if defined(NEED_my_snprintf) static int DPPP_(my_my_snprintf)(char * buffer, const Size_t len, const char * format, ...); static #else extern int DPPP_(my_my_snprintf)(char * buffer, const Size_t len, const char * format, ...); #endif #define my_snprintf DPPP_(my_my_snprintf) #define Perl_my_snprintf DPPP_(my_my_snprintf) #if defined(NEED_my_snprintf) || defined(NEED_my_snprintf_GLOBAL) int DPPP_(my_my_snprintf)(char *buffer, const Size_t len, const char *format, ...) { dTHX; int retval; va_list ap; va_start(ap, format); #ifdef HAS_VSNPRINTF retval = vsnprintf(buffer, len, format, ap); #else retval = vsprintf(buffer, format, ap); #endif va_end(ap); if (retval >= (int)len) Perl_croak(aTHX_ "panic: my_snprintf buffer overflow"); return retval; } #endif #endif #ifdef NO_XSLOCKS # ifdef dJMPENV # define dXCPT dJMPENV; int rEtV = 0 # define XCPT_TRY_START JMPENV_PUSH(rEtV); if (rEtV == 0) # define XCPT_TRY_END JMPENV_POP; # define XCPT_CATCH if (rEtV != 0) # define XCPT_RETHROW JMPENV_JUMP(rEtV) # else # define dXCPT Sigjmp_buf oldTOP; int rEtV = 0 # define XCPT_TRY_START Copy(top_env, oldTOP, 1, Sigjmp_buf); rEtV = Sigsetjmp(top_env, 1); if (rEtV == 0) # define XCPT_TRY_END Copy(oldTOP, top_env, 1, Sigjmp_buf); # define XCPT_CATCH if (rEtV != 0) # define XCPT_RETHROW Siglongjmp(top_env, rEtV) # endif #endif #if !defined(my_strlcat) #if defined(NEED_my_strlcat) static Size_t DPPP_(my_my_strlcat)(char * dst, const char * src, Size_t size); static #else extern Size_t DPPP_(my_my_strlcat)(char * dst, const char * src, Size_t size); #endif #define my_strlcat DPPP_(my_my_strlcat) #define Perl_my_strlcat DPPP_(my_my_strlcat) #if defined(NEED_my_strlcat) || defined(NEED_my_strlcat_GLOBAL) Size_t DPPP_(my_my_strlcat)(char *dst, const char *src, Size_t size) { Size_t used, length, copy; used = strlen(dst); length = strlen(src); if (size > 0 && used < size - 1) { copy = (length >= size - used) ? size - used - 1 : length; memcpy(dst + used, src, copy); dst[used + copy] = '\0'; } return used + length; } #endif #endif #if !defined(my_strlcpy) #if defined(NEED_my_strlcpy) static Size_t DPPP_(my_my_strlcpy)(char * dst, const char * src, Size_t size); static #else extern Size_t DPPP_(my_my_strlcpy)(char * dst, const char * src, Size_t size); #endif #define my_strlcpy DPPP_(my_my_strlcpy) #define Perl_my_strlcpy DPPP_(my_my_strlcpy) #if defined(NEED_my_strlcpy) || defined(NEED_my_strlcpy_GLOBAL) Size_t DPPP_(my_my_strlcpy)(char *dst, const char *src, Size_t size) { Size_t length, copy; length = strlen(src); if (size > 0) { copy = (length >= size) ? size - 1 : length; memcpy(dst, src, copy); dst[copy] = '\0'; } return length; } #endif #endif #endif /* _P_P_PORTABILITY_H_ */ /* End of File ppport.h */ KinoSearch1-1.01/t000755000765000765 011462203446 13774 5ustar00marvinmarvin000000000000KinoSearch1-1.01/t/000-load.t000444000765000765 65611462203445 15520 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More 'no_plan'; use File::Find 'find'; my @modules; find( { no_chdir => 1, wanted => sub { return unless $File::Find::name =~ /\.pm$/; push @modules, $File::Find::name; } }, 'lib' ); for (@modules) { s/^.*?KinoSearch1/KinoSearch1/; s/\.pm$//; s/[^a-zA-Z1-9]+/::/g; eval "use_ok('" . $_ . "');"; } KinoSearch1-1.01/t/001-build_invindexes.t000444000765000765 152711462203446 20154 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 4; use File::Spec::Functions qw( catfile ); use KinoSearch1::Test::TestUtils qw( working_dir create_working_dir remove_working_dir create_persistent_test_index persistent_test_index_loc ); remove_working_dir(); ok( !-e working_dir(), "Working dir doesn't exist" ); create_working_dir(); ok( -e working_dir(), "Working dir successfully created" ); create_persistent_test_index(); my $path = persistent_test_index_loc(); ok( -d $path, "created invindex directory" ); opendir( my $test_invindex_dh, $path ) or die "Couldn't opendir '$path': $!"; my @cfs_files = grep {m/\.cfs$/} readdir $test_invindex_dh; closedir $test_invindex_dh or die "Couldn't closedir '$path': $!"; cmp_ok( scalar @cfs_files, '>', 0, "at least one .cfs file exists" ); KinoSearch1-1.01/t/002-kinosearch.t000444000765000765 36211462203446 16724 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 3; BEGIN { use_ok( 'KinoSearch1', 'K_DEBUG' ) } ok( !K_DEBUG, "DEBUG mode should be disabled" ); ok( !KinoSearch1::memory_debugging_enabled(), "Memory debugging should be disabled" ); KinoSearch1-1.01/t/010-verify_args.t000444000765000765 123411462203446 17134 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 6; BEGIN { use_ok( 'KinoSearch1::Util::VerifyArgs', qw( kerror verify_args ) ) } my %defaults = ( foo => 'FOO', bar => 'BAR' ); sub check { return verify_args( \%defaults, @_ ); } my $dest = {}; my $ret = check( odd => 'number', of => ); is( $ret, 0, "An odd number of args fails verify_args" ); like( kerror(), qr/odd/, "verify_args sets the right error string" ); $ret = check( bad => 'badness' ); is( $ret, 0, "An invalid arg chokes verify_args" ); like( kerror(), qr/invalid/i, "verify_args sets the right error string" ); $ret = check( foo => 'boo' ); is( $ret, 1, "A valid arg passes verify_args" ); KinoSearch1-1.01/t/011-class.t000444000765000765 343511462203446 15727 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 9; package KinoSearch1::TestClass; use base qw( KinoSearch1::Util::Class ); BEGIN { __PACKAGE__->init_instance_vars( foo => 'correct', ); __PACKAGE__->ready_get_set('foo'); } sub die_an_abstract_death { shift->abstract_death } sub die_an_unimplemented_death { shift->unimplemented_death } sub die_a_todo_death { shift->todo_death } our $version = $KinoSearch1::VERSION; package MySubClass; use base qw( Exporter KinoSearch1::TestClass ); package main; # These should NOT be accessed. our %instance_vars = ( foo => 'wrong', bar => 'wrong', ); my $verify_version = defined $KinoSearch1::TestClass::version; is( $verify_version, 1, "Using this class should grant access to " . "package globals in the KinoSearch1:: namespace" ); can_ok( 'KinoSearch1::Util::Class', 'new' ); my $util_class_object = KinoSearch1::Util::Class->new(); is( ref $util_class_object, 'KinoSearch1::Util::Class', "constructor works." ); my $test_obj = KinoSearch1::TestClass->new; is( $test_obj->{foo}, 'correct', "Inheritance works as expected" ); eval { $test_obj->die_an_abstract_death }; like( $@, qr/abstract/i, "abstract_death produces a meaningful error message" ); eval { $test_obj->die_a_todo_death }; like( $@, qr/todo/i, "todo_death produces a meaningful error message" ); eval { $test_obj->die_an_unimplemented_death }; like( $@, qr/unimplemented/i, "unimplemented_death produces a meaningful error message" ); my $subclassed_obj = MySubClass->new( foo => 'boo' ); is( $subclassed_obj->get_foo, "boo", "KinoSearch1 objects can be subclassed outside the KinoSearch1 hierarchy" ); $subclassed_obj->set_foo("hoo"); is( $subclassed_obj->get_foo, "hoo", "ready_get_set creates valid setter and getter" ); KinoSearch1-1.01/t/012-priority_queue.t000444000765000765 210711462203446 17703 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 7; BEGIN { use_ok('KinoSearch1::Util::PriorityQueue') } my $pq = KinoSearch1::Util::PriorityQueue->new( max_size => 5 ); $pq->insert($_) for ( 3, 1, 2, 20, 10 ); is( $pq->peek, 1, "peek at the least item in the queue" ); is_deeply( $pq->pop_all, [ 20, 10, 3, 2, 1 ], "pop_all sorts correctly" ); is( $pq->get_max_size, 5, "get_max_size" ); $pq->insert($_) for ( 3, 1, 2, 20, 10 ); my @prioritized; for ( 1 .. 4 ) { push @prioritized, $pq->pop; } is( $pq->get_size, 1, "get_size" ); $pq->insert(7); push @prioritized, $pq->pop; is_deeply( \@prioritized, [ 1, 2, 3, 10, 7 ], "insert, pop, and sort correctly" ); 1 while defined $pq->pop; # empty queue; $pq = KinoSearch1::Util::PriorityQueue->new( max_size => 5 ); @prioritized = (); $pq->insert($_) for ( 1 .. 10, -3, 1590 .. 1600, 5 ); push @prioritized, $pq->pop for 1 .. 5; is_deeply( \@prioritized, [ 1596 .. 1600 ], "insert properly discards waste" ); 1 while defined $pq->pop; # empty queue; @prioritized = (); $pq->insert($_) for ( 3, 1, 2, 20, 10 ); KinoSearch1-1.01/t/013-bit_vector.t000444000765000765 450311462203446 16761 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 15; use List::Util qw( shuffle ); BEGIN { use_ok('KinoSearch1::Util::BitVector') } my $bit_vec = KinoSearch1::Util::BitVector->new( capacity => 9 ); $bit_vec->set(2); my @got = map { $bit_vec->get($_) } 0 .. 9; is_deeply( \@got, [ '', '', 1, '', '', '', '', '', '', '' ], "set and get, including out-of-range get" ); $bit_vec = KinoSearch1::Util::BitVector->new( capacity => 9 ); $bit_vec->set(5); is( $bit_vec->next_set_bit(5), 5, 'next_set_bit' ); is( $bit_vec->next_clear_bit(5), 6, 'next_clear_bit' ); is( $bit_vec->next_set_bit(6), undef, "next_set_bit should return undef when " . "the there's no set bit between the val and the end" ); $bit_vec->set(6); $bit_vec->set(7); $bit_vec->set(8); is( $bit_vec->next_clear_bit(5), 9, "... in same situation, clear returns the first out of range" ); $bit_vec = KinoSearch1::Util::BitVector->new( capacity => 25 ); $bit_vec->bulk_set( 1, 22 ); is_deeply( $bit_vec->to_arrayref, [ 1 .. 22 ], "bulk set" ); $bit_vec->bulk_clear( 2, 21 ); @got = map { $bit_vec->get($_) } 0 .. 24; is_deeply( $bit_vec->to_arrayref, [ 1, 22 ], "bulk clear" ); $bit_vec = KinoSearch1::Util::BitVector->new; is( $bit_vec->get_capacity, 0, "default capacity of 0" ); $bit_vec->set_bits("\x02"); is( $bit_vec->get_capacity, 8, "set_bits has side effect of new capacity" ); is_deeply( $bit_vec->to_arrayref, [1], "set_bits was successful" ); $bit_vec->set(9); is( $bit_vec->get_capacity, 10, "capacity should grow with above-range set" ); is( $bit_vec->get_bits, "\x02\x02", "bits have grown with above-range set" ); $bit_vec = KinoSearch1::Util::BitVector->new; my $other = KinoSearch1::Util::BitVector->new; $bit_vec->set( 1 .. 3, 10, 20, 30 ); $other->set( 2 .. 10, 25 .. 35 ); $bit_vec->logical_and($other); is_deeply( $bit_vec->to_arrayref, [ 2, 3, 10, 30 ], "logical_and" ); $bit_vec = KinoSearch1::Util::BitVector->new; my @counts; for ( shuffle 1 .. 64 ) { $bit_vec->set($_); push @counts, $bit_vec->count; } is_deeply( \@counts, [ 1 .. 64 ], 'count() returns the right number of bits' ); # valgrind only - detect off-by-one error for my $cap ( 5 .. 24 ) { $bit_vec = KinoSearch1::Util::BitVector->new( capacity => $cap ); $bit_vec->set( $cap - 2 ); for ( 0 .. $cap ) { $bit_vec->next_set_bit($_); } } KinoSearch1-1.01/t/015-sort_external.t000444000765000765 547511462203446 17525 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 8; use File::Spec; use List::Util qw( shuffle ); BEGIN { use_ok("KinoSearch1::Util::SortExternal"); } use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex = create_index(); my ( $sortex, @orig, @sort_output ); sub init_sortex { $sortex = KinoSearch1::Util::SortExternal->new( invindex => $invindex, seg_name => '_1', @_, ); } init_sortex; @orig = ( 'a' .. 'z' ); $sortex->feed($_) for shuffle(@orig); $sortex->sort_all; while ( my $result = $sortex->fetch ) { push @sort_output, $result; } is_deeply( \@sort_output, \@orig, "sort letters" ); @orig = (); @sort_output = (); init_sortex; @orig = qw( a a a b c d x x x x x x y y ); $sortex->feed($_) for shuffle(@orig); $sortex->sort_all; while ( defined( my $result = $sortex->fetch ) ) { push @sort_output, $result; } is_deeply( \@sort_output, \@orig, "sort repeated letters" ); @orig = (); @sort_output = (); init_sortex; @orig = ( '', '', 'a' .. 'z' ); $sortex->feed($_) for shuffle(@orig); $sortex->sort_all; while ( defined( my $result = $sortex->fetch ) ) { push @sort_output, $result; } is_deeply( \@sort_output, \@orig, "sort letters and empty strings" ); @orig = (); @sort_output = (); init_sortex( mem_threshold => 30 ); @orig = 'a' .. 'z'; $sortex->feed($_) for ( shuffle(@orig) ); $sortex->sort_all; while ( my $result = $sortex->fetch ) { push @sort_output, $result; } is_deeply( \@sort_output, \@orig, "... with an absurdly low mem_threshold" ); @orig = (); @sort_output = (); init_sortex( mem_threshold => 1 ); @orig = 'a' .. 'z'; $sortex->feed($_) for ( shuffle(@orig) ); $sortex->sort_all; while ( my $result = $sortex->fetch ) { push @sort_output, $result; } is_deeply( \@sort_output, \@orig, "... with an even lower mem_threshold" ); @orig = (); @sort_output = (); init_sortex; $sortex->sort_all; @sort_output = $sortex->fetch; is_deeply( \@sort_output, [undef], "Sorting nothing returns undef" ); @sort_output = (); init_sortex( mem_threshold => 20_000 ); @orig = map { pack( 'N', $_ ) } ( 0 .. 11_000 ); $sortex->feed( shuffle(@orig) ); $sortex->sort_all; while ( defined( my $item = $sortex->fetch ) ) { push @sort_output, $item; } is_deeply( \@sort_output, \@orig, "Sorting packed integers..." ); @sort_output = (); exit; init_sortex( mem_threshold => 20_000 ); @orig = (); for my $iter ( 0 .. 1_000 ) { my $string = ''; for my $string_len ( 0 .. int( rand(1200) ) ) { $string .= pack( 'C', int( rand(256) ) ); } push @orig, $string; } @orig = sort @orig; $sortex->feed($_) for shuffle(@orig); $sortex->sort_all; while ( defined( my $item = $sortex->fetch ) ) { push @sort_output, $item; } is_deeply( \@sort_output, \@orig, "Random binary strings of random length" ); @sort_output = (); KinoSearch1-1.01/t/101-simple_template_io.t000444000765000765 366011462203446 20475 0ustar00marvinmarvin000000000000#!/usr/bin/perl -w use strict; use warnings; use Test::More tests => 13; BEGIN { use_ok('KinoSearch1::Store::RAMInvIndex'); } my $invindex = KinoSearch1::Store::RAMInvIndex->new; my ( @nums, $packed, $template ); sub check_io { my ( $filename, $tpt ) = ( shift, shift ); my $outstream = $invindex->open_outstream($filename); $outstream->lu_write( $tpt, @_ ); $outstream->close; my $instream = $invindex->open_instream($filename); my @got = $instream->lu_read($tpt); is_deeply( \@got, \@_, $filename ); } @nums = ( -128 .. 127 ); $packed = pack( 'c256', @nums ); check_io( "signed byte", 'b256', @nums ); is( $invindex->slurp_file('signed byte'), $packed, "pack and lu_write handle signed bytes identically" ); @nums = ( 0 .. 255 ); $packed = pack( 'C*', @nums ); check_io( "unsigned byte", 'B256', @nums ); is( $invindex->slurp_file('unsigned byte'), $packed, "pack and lu_write handle unsigned bytes identically" ); @nums = map { $_ * 1_000_000 + int( rand() * 1_000_000 ) } -1000 .. 1000; push @nums, ( -1 * ( 2**31 ), 2**31 - 1 ); check_io( "signed int", 'i' . scalar @nums, @nums ); @nums = map { $_ * 1_000_000 + int( rand() * 1_000_000 ) } 1000 .. 3000; push @nums, ( 0, 1, 2**32 - 1 ); $packed = pack( 'N*', @nums ); check_io( "unsigned int", 'I' . scalar @nums, @nums ); is( $invindex->slurp_file('unsigned int'), $packed, "pack and lu_write handle unsigned int32s identically" ); @nums = map { $_ * 2 } 0 .. 5; check_io( 'unsigned long small', 'Q' . scalar @nums, @nums ); @nums = map { $_ * 2**31 } 0 .. 2000; $_ += int( rand( 2**16 ) ) for @nums; check_io( 'unsigned long large', 'Q' . scalar @nums, @nums ); @nums = ( 0 .. 127 ); check_io( 'VLong small', 'W' . scalar @nums, @nums ); @nums = ( 128 .. 500 ); check_io( 'VLong medium', 'W' . scalar @nums, @nums ); @nums = map { $_ * 2**31 } 0 .. 2000; $_ += int( rand( 2**16 ) ) for @nums; check_io( 'VLong large', 'W' . scalar @nums, @nums ); KinoSearch1-1.01/t/102-strings_template_io.t000444000765000765 170111462203446 20670 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 5; BEGIN { use_ok('KinoSearch1::Store::RAMInvIndex'); } my $invindex = KinoSearch1::Store::RAMInvIndex->new; my ( @items, $packed, $template ); sub check_io { my ( $filename, $tpt ) = ( shift, shift ); my $outstream = $invindex->open_outstream($filename); $outstream->lu_write( $tpt, @_ ); $outstream->close; my $instream = $invindex->open_instream($filename); my @got = $instream->lu_read($tpt); is_deeply( \@got, \@_, $filename ); } my @chars = ( qw( a b c d 1 ), "\n", "\0", " ", " ", "\xf0\x9d\x84\x9e" ); for ( 0, 22, 300 ) { @items = ( 'a' x $_ ); check_io( "string of length $_", 'T', @items ); } { @items = (); for ( 1 .. 50 ) { my $string_len = int( rand() * 5 ); my $str = ''; $str .= $chars[ rand @chars ] for 1 .. $string_len; push @items, $str; } check_io( "50 strings", "T50", @items ); } KinoSearch1-1.01/t/103-repeats_template_io.t000444000765000765 262011462203446 20644 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 18; BEGIN { use_ok('KinoSearch1::Store::RAMInvIndex'); } my $invindex = KinoSearch1::Store::RAMInvIndex->new; my ( @items, $packed, $template ); sub check_io { my ( $filename, $tpt ) = ( shift, shift ); my $outstream = $invindex->open_outstream($filename); $outstream->lu_write( $tpt, @_ ); $outstream->close; my $instream = $invindex->open_instream($filename); my @got = $instream->lu_read($tpt); is_deeply( \@got, \@_, $filename ); } # Verify numeric repeats for signed char. for ( -127, 2, 20, 127 ) { @items = ( -128 .. $_ ); my $set = $_ + 129; $packed = pack( "c$set", @items ); check_io( "b$set", "b$set", @items ); is( $invindex->slurp_file("b$set"), $packed, "pack and lu_write handle signed bytes identically" ); } # Verify numeric repeats for unsigned char. for ( 2, 20, 127 ) { @items = ( 1 .. $_ ); $packed = pack( "C$_", @items ); check_io( "B$_", "B$_", @items ); is( $invindex->slurp_file("B$_"), $packed, "pack and lu_write handle unsigned bytes identically" ); } # Multiple repeats in one template. for my $num ( 2, 19, 101 ) { @items = ( 1 .. $num ); @items = (@items) x 8; push @items, 'foo'; my $template = ''; $template .= $_ . "$num " for (qw( T V b B i I Q W )); $template .= 'T'; check_io( $template, $template, @items ); } KinoSearch1-1.01/t/104-parse_template_io.t000444000765000765 163211462203446 20316 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 4; BEGIN { use_ok('KinoSearch1::Store::RAMInvIndex'); } my $invindex = KinoSearch1::Store::RAMInvIndex->new; my ( @items, $packed, $template ); sub check_io { my ( $filename, $tpt ) = ( shift, shift ); my $outstream = $invindex->open_outstream($filename); $outstream->lu_write( $tpt, @_ ); $outstream->close; my $instream = $invindex->open_instream($filename); my @got = $instream->lu_read($tpt); is_deeply( \@got, \@_, $filename ); } my $outstream = $invindex->open_outstream("fake_file"); eval { $outstream->lu_write( 'u', 'foo' ); }; like( $@, qr/illegal character/i, "Illegal symbol in template caught" ); @items = qw( foo bar ); check_io( "leading and trailing whitespace", " T T ", @items ); @items = ( qw( foo bar baz ), 0 .. 5 ); $template = "TT2Ti3Qb"; check_io( "Tightly packed template", $template, @items ); KinoSearch1-1.01/t/105-invindex.t000444000765000765 566511462203446 16461 0ustar00marvinmarvin000000000000#!perl -wT use strict; use warnings; use lib 'buildlib'; use Test::More tests => 22; use File::Spec::Functions qw( catfile ); use File::Path qw( rmtree ); use Fcntl; BEGIN { use_ok('KinoSearch1::Store::RAMInvIndex'); use_ok('KinoSearch1::Store::FSInvIndex'); } use KinoSearch1::Test::TestUtils qw( test_index_loc ); my $fs_invindex_loc = test_index_loc(); # clean up from previous tests if needed. rmtree($fs_invindex_loc); eval { my $fs_invindex = KinoSearch1::Store::FSInvIndex->new( path => $fs_invindex_loc, ); }; like( $@, qr/invindex/, "opening an invindex that doesn't exist fails without create => 1" ); my $fs_invindex = KinoSearch1::Store::FSInvIndex->new( create => 1, path => $fs_invindex_loc, ); my $king = "I'm the king of rock."; my $outstream = $fs_invindex->open_outstream('king_of_rock'); $outstream->lu_write( 'a' . bytes::length($king), $king ); $outstream->close; my $ram_invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1, path => $fs_invindex_loc, ); ok( $ram_invindex->file_exists('king_of_rock'), "RAMInvIndex successfully reads existing FSInvIndex" ); for my $invindex ( $fs_invindex, $ram_invindex ) { my @files = $invindex->list; is_deeply( \@files, ['king_of_rock'], "list lists files" ); my $slurped = $invindex->slurp_file('king_of_rock'); is( $slurped, $king, "slurp_file works" ); my $lock = $invindex->make_lock( lock_name => 'lock_robster', timeout => 0, ); my $competing_lock = $invindex->make_lock( lock_name => 'lock_robster', timeout => 0, ); $lock->obtain; eval { $competing_lock->obtain }; like( $@, qr/get lock/, "shouldn't get lock on existing resource" ); ok( $lock->is_locked, "lock is locked" ); $lock->release; ok( !$lock->is_locked, "release works" ); $invindex->run_while_locked( lock_name => 'lock_robster', timeout => 1000, do_body => sub { $invindex->rename_file( 'king_of_rock', 'king_of_lock' ); }, ); ok( !$invindex->file_exists('king_of_rock'), "file successfully removed while locked" ); is( $invindex->file_exists('king_of_lock'), 1, "file successfully moved while locked" ); $invindex->delete_file('king_of_lock'); ok( !$invindex->file_exists('king_of_lock'), "delete_file works" ); } my $foo_path = catfile( $fs_invindex_loc, 'foo' ); my $cfs_path = catfile( $fs_invindex_loc, '_1.cfs' ); for ( $foo_path, $cfs_path ) { sysopen( my $fh, $_, O_CREAT | O_EXCL | O_WRONLY ) or die "Couldn't open '$_' for writing: $!"; print $fh 'stuff'; } $fs_invindex = KinoSearch1::Store::FSInvIndex->new( create => 1, path => $fs_invindex_loc, ); ok( -e $foo_path, "creating an invindex shouldn't wipe an unrelated file" ); ok( !-e catfile( $fs_invindex_loc, '_1.cfs' ), "... but it should clean the cfs file" ); # clean up rmtree($fs_invindex_loc); KinoSearch1-1.01/t/106-locking.t000444000765000765 345411462203445 16255 0ustar00marvinmarvin000000000000#!perl use strict; use warnings; use Time::HiRes qw( sleep ); use Test::More; BEGIN { if ( $^O =~ /mswin/i ) { plan( 'skip_all', "fork on Windows not supported by KS" ); } else { plan( tests => 3 ); } use_ok 'KinoSearch1::Store::FSLock'; } use KinoSearch1::Store::FSInvIndex; my $lock_path = "$KinoSearch1::Store::FSInvIndex::LOCK_DIR/test-foo"; Dead_locks_are_removed: { # Remove any existing lockfile unlink $lock_path; die "Can't unlink '$lock_path'" if -e $lock_path; # Fake index for test simplicity my $mock_index = MockIndex->new( prefix => 'test' ); sub make_lock { my $lock = KinoSearch1::Store::FSLock->new( invindex => $mock_index, lock_name => 'foo', ); $lock->obtain; return $lock; } # Fork a process that will create a lock and then exit my $pid = fork(); if ( $pid == 0 ) { # child make_lock(); exit; } else { waitpid( $pid, 0 ); } ok( -e $lock_path, "child secured lock" ); # The locking attempt will fail if the pid from the process that made the # lock is active, so do the best we can to see whether another process # started up with the child's pid (which would be weird). my $pid_active = kill( 0, $pid ); eval { make_lock() }; warn $@ if $@; my $saved_err = $@; $pid_active ||= kill( 0, $pid ); SKIP: { skip( "Child's pid is active", 1 ) if $pid_active; ok( !$saved_err, 'second lock attempt clobbered dead lock file and did not die' ); } # clean up unlink $lock_path; } package MockIndex; use strict; use warnings; sub new { my ( $class, %args ) = @_; bless \%args, $class; } sub get_path {"bar"} sub get_lock_prefix { $_[0]->{prefix} } KinoSearch1-1.01/t/150-polyanalyzer.t000444000765000765 333711462203445 17357 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 5; use KinoSearch1::Test::TestUtils qw( test_analyzer ); use KinoSearch1::Analysis::LCNormalizer; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Analysis::Stopalizer; use KinoSearch1::Analysis::Stemmer; use KinoSearch1::Analysis::PolyAnalyzer; use KinoSearch1::Analysis::TokenBatch; my $source_text = 'Eats, shoots and leaves.'; my $lc_normalizer = KinoSearch1::Analysis::LCNormalizer->new; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $stopalizer = KinoSearch1::Analysis::Stopalizer->new( language => 'en' ); my $stemmer = KinoSearch1::Analysis::Stemmer->new( language => 'en' ); my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [], ); test_analyzer( $polyanalyzer, $source_text, [$source_text], 'no sub analyzers' ); $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [$lc_normalizer], ); test_analyzer( $polyanalyzer, $source_text, ['eats, shoots and leaves.'], 'with LCNormalizer' ); $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $tokenizer ], ); test_analyzer( $polyanalyzer, $source_text, [ 'eats', 'shoots', 'and', 'leaves' ], 'with Tokenizer' ); $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $tokenizer, $stopalizer ], ); test_analyzer( $polyanalyzer, $source_text, [ 'eats', 'shoots', '', 'leaves' ], 'with Stopalizer' ); $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer, ], ); test_analyzer( $polyanalyzer, $source_text, [ 'eat', 'shoot', '', 'leav' ], 'with Stemmer' ); KinoSearch1-1.01/t/152-token_batch.t000444000765000765 155711462203446 17114 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 3; use KinoSearch1::Test::TestUtils qw( utf8_test_strings ); BEGIN { use_ok('KinoSearch1::Analysis::TokenBatch') } use KinoSearch1::Analysis::Token; my $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( "car", 0, 3 ); $batch->append( "bike", 10, 14 ); $batch->append( "truck", 20, 25 ); my @texts; while ( $batch->next ) { push @texts, $batch->get_text; } is_deeply( \@texts, [qw( car bike truck )], "return tokens in order" ); TODO: { local $TODO = "Known UTF-8 bugs, fixed in KS 0.3x"; my ( $smiley, $not_a_smiley, $frowny ) = utf8_test_strings(); $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( $smiley, 0, bytes::length($smiley) ); $batch->next; is( $batch->get_text, $smiley, "TokenBatch handles UTF-8 correctly" ); } KinoSearch1-1.01/t/153-lc_normalizer.t000444000765000765 132311462203445 17462 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 3; BEGIN { use_ok('KinoSearch1::Analysis::LCNormalizer') } use KinoSearch1::Analysis::TokenBatch; my $lc_normalizer = KinoSearch1::Analysis::LCNormalizer->new; my $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( "caPiTal ofFensE", 0, 15 ); $batch = $lc_normalizer->analyze($batch); $batch->next; is( $batch->get_text, "capital offense", "lc plain text" ); $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( $_, 10, 20 ) for qw( eL sEE ); $batch = $lc_normalizer->analyze($batch); my @texts; while ( $batch->next ) { push @texts, $batch->get_text; } is_deeply( \@texts, [qw( el see )], "analyze an existing batch" ); KinoSearch1-1.01/t/154-tokenizer.t000444000765000765 255611462203445 16646 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 7; BEGIN { use_ok('KinoSearch1::Analysis::Tokenizer') } use KinoSearch1::Analysis::TokenBatch; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( "a b c", 0, 5 ); $batch = $tokenizer->analyze($batch); my ( @token_texts, @start_offsets, @end_offsets ); while ( $batch->next ) { push @token_texts, $batch->get_text; push @start_offsets, $batch->get_start_offset; push @end_offsets, $batch->get_end_offset; } is_deeply( \@token_texts, [qw( a b c )], "correct texts" ); is_deeply( \@start_offsets, [ 0, 2, 4, ], "correct start offsets" ); is_deeply( \@end_offsets, [ 1, 3, 5, ], "correct end offsets" ); $tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/./ ); $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( "a b c", 0, 5 ); $batch = $tokenizer->analyze($batch); @token_texts = (); @start_offsets = (); @end_offsets = (); while ( $batch->next ) { push @token_texts, $batch->get_text; push @start_offsets, $batch->get_start_offset; push @end_offsets, $batch->get_end_offset; } is_deeply( \@token_texts, [ 'a', ' ', 'b', ' ', 'c' ], "texts: custom re" ); is_deeply( \@start_offsets, [ 0 .. 4 ], "starts: custom re" ); is_deeply( \@end_offsets, [ 1 .. 5 ], "ends: custom re" ); KinoSearch1-1.01/t/155-stopalizer.t000444000765000765 124211462203446 17021 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 2; BEGIN { use_ok('KinoSearch1::Analysis::Stopalizer') } use KinoSearch1::Analysis::TokenBatch; use KinoSearch1::Analysis::Tokenizer; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $batch = KinoSearch1::Analysis::TokenBatch->new; $batch->append( "i am the walrus", 0, 5 ); $batch = $tokenizer->analyze($batch); my $stopalizer = KinoSearch1::Analysis::Stopalizer->new( language => 'en' ); $batch = $stopalizer->analyze($batch); my @token_texts; while ( $batch->next ) { push @token_texts, $batch->get_text; } is_deeply( \@token_texts, [ '', '', '', 'walrus' ], "stopwords stopalized" ); KinoSearch1-1.01/t/201-field_infos.t000444000765000765 467211462203446 17110 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 7; use File::Spec::Functions qw( catfile ); BEGIN { use_ok('KinoSearch1::Index::CompoundFileReader'); use_ok('KinoSearch1::Index::FieldInfos'); use_ok('KinoSearch1::Document::Field'); } use KinoSearch1::Test::TestUtils qw( create_index ); my $finfos = KinoSearch1::Index::FieldInfos->new; for my $name (qw( x b a content )) { $finfos->add_field( KinoSearch1::Document::Field->new( name => $name, vectorized => 0, ) ); } my @nums = map { $finfos->get_field_num($_) } qw( a b content x ); is_deeply( \@nums, [ 0, 1, 2, 3 ], "field nums should reflect lexical order" ); my $invindex = create_index( 'a', 'a b' ); my $cfs_reader = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, seg_name => '_1', ); my $outstream = $invindex->open_outstream('finfos_test'); $finfos->write_infos($outstream); $outstream->close; $finfos = KinoSearch1::Index::FieldInfos->new; my $instream = $invindex->open_instream('finfos_test'); $finfos->read_infos($instream); $instream->close; my $finfos2 = KinoSearch1::Index::FieldInfos->new; $instream = $cfs_reader->open_instream("_1.fnm"); $finfos2->read_infos($instream); my %correct = ( name => 'content', field_num => 0, indexed => 1, vectorized => 1, fnm_bits => "\x3", ); my ($finfo) = grep { $_->get_name eq 'content' } $finfos2->get_infos; my %test; $test{$_} = $finfo->{$_} for keys %correct; is_deeply( \%test, \%correct, "Reading and writing, plus get_infos" ); my $master_finfos = KinoSearch1::Index::FieldInfos->new; $master_finfos->consolidate( $finfos, $finfos2 ); my $new_content_finfo = $master_finfos->info_by_name('content'); is( $new_content_finfo->get_vectorized, 1, "consolidate and breed_with merge field characteristics properly" ); $finfos = KinoSearch1::Index::FieldInfos->new; my @correct = ( 'a' .. 'z' ); for my $name ( reverse @correct ) { $finfos->add_field( KinoSearch1::Document::Field->new( name => $name ) ); } $outstream = $invindex->open_outstream('finfos_test2'); $finfos->write_infos($outstream); $outstream->close; $finfos = KinoSearch1::Index::FieldInfos->new; $instream = $invindex->open_instream('finfos_test2'); $finfos->read_infos($instream); my @got = map { $finfos->info_by_num($_)->get_name } 0 .. 25; is_deeply( \@got, \@correct, "field numbers still correct after write/read" ); KinoSearch1-1.01/t/202-term.t000444000765000765 64211462203446 15550 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 3; use List::Util qw( shuffle ); BEGIN { use_ok('KinoSearch1::Index::Term') } my $foo_term = KinoSearch1::Index::Term->new( "f1", "foo" ); my $bar_term = KinoSearch1::Index::Term->new( "f3", "bar" ); is( $foo_term->get_text, 'foo', "get_text should return correct val" ); is( $bar_term->get_field, "f3", "get_field should return correct val" ); KinoSearch1-1.01/t/203-compound_file_reader.t000444000765000765 166211462203445 20771 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 5; use File::Spec::Functions qw( catfile ); BEGIN { use_ok('KinoSearch1::Index::CompoundFileReader'); use_ok( 'KinoSearch1::Index::IndexFileNames', qw( @COMPOUND_EXTENSIONS ) ); } use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex = create_index('a'); my $cfs_reader = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, seg_name => '_1', ); my $instream = $cfs_reader->open_instream('_1.tis'); isa_ok( $instream, 'KinoSearch1::Store::InStream' ); my $tis_bytecount = $instream->length; is( $cfs_reader->slurp_file('_1.tis'), $instream->lu_read("a$tis_bytecount"), "slurp_file gets the right bytes" ); my @files = sort map {"_1.$_"} ( @COMPOUND_EXTENSIONS, 'f0' ); my @cfs_entries = sort keys %{ $cfs_reader->{entries} }; is_deeply( \@cfs_entries, \@files, "get all the right files" ); KinoSearch1-1.01/t/204-fields_reader.t000444000765000765 476611462203446 17426 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 22; BEGIN { use_ok('KinoSearch1::InvIndexer') } BEGIN { use_ok('KinoSearch1::Store::RAMInvIndex') } BEGIN { use_ok('KinoSearch1::Index::FieldsReader') } BEGIN { use_ok('KinoSearch1::Index::CompoundFileReader') } BEGIN { use_ok('KinoSearch1::Index::FieldInfos') } my $invindex = KinoSearch1::Store::RAMInvIndex->new; my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, create => 1, ); # This valid UTF-8 string includes skull and crossbones, null byte -- however, # it is not flagged as UTF-8. my $bin_val = my $val = "a b c \xe2\x98\xA0 \0a"; my %field_specs = ( text => { indexed => 1, binary => 0, compressed => 0, value => $val, }, text_comp => { indexed => 1, binary => 0, compressed => 1, value => $val, }, bin => { indexed => 0, binary => 1, compressed => 0, value => $bin_val, }, bin_comp => { indexed => 0, binary => 1, compressed => 1, value => $bin_val, }, ); while ( my ( $name, $spec ) = each %field_specs ) { $invindexer->spec_field( name => $name, indexed => $spec->{indexed}, binary => $spec->{binary}, compressed => $spec->{compressed}, ); } my $doc = $invindexer->new_doc; $doc->set_value( $_ => $field_specs{$_}{value} ) for keys %field_specs; $invindexer->add_doc($doc); $invindexer->finish; my $cfs_reader = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, seg_name => '_1', ); my $finfos = KinoSearch1::Index::FieldInfos->new; $finfos->read_infos( $cfs_reader->open_instream('_1.fnm') ); my $fields_reader = KinoSearch1::Index::FieldsReader->new( finfos => $finfos, fdata_stream => $cfs_reader->open_instream('_1.fdt'), findex_stream => $cfs_reader->open_instream('_1.fdx'), ); $doc = $fields_reader->fetch_doc(0); isa_ok( $doc, 'KinoSearch1::Document::Doc' ); #while ( my ( $name, $spec ) = each %field_specs ) { for my $field ( $doc->get_fields ) { my $name = $field->get_name; my $spec = $field_specs{$name}; is( $field->get_indexed, $spec->{indexed}, "correct val for indexed" ); is( $field->get_binary, $spec->{binary}, "correct val for binary" ); is( $field->get_compressed, $spec->{compressed}, "correct val for compressed" ); is( $field->get_value, $spec->{value}, "correct content" ); } KinoSearch1-1.01/t/205-seg_reader.t000444000765000765 207511462203445 16725 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 8; BEGIN { use_ok('KinoSearch1::Index::IndexReader'); use_ok('KinoSearch1::Index::Term'); } use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex = create_index( "What's he building in there?", "What's he building in there?", "We have a right to know." ); my $reader = KinoSearch1::Index::IndexReader->new( invindex => $invindex ); isa_ok( $reader, 'KinoSearch1::Index::SegReader', "single segment indexes cause new to return a SegReader" ); isa_ok( $reader->norms_reader('content'), 'KinoSearch1::Index::NormsReader' ); ok( !$reader->has_deletions, "has_deletions returns false if no deletions" ); is( $reader->max_doc, 3, "max_doc returns correct number" ); my $term = KinoSearch1::Index::Term->new( 'content', 'building' ); my $enum = $reader->terms($term); isa_ok( $enum, 'KinoSearch1::Index::SegTermEnum', "terms returns a SegTermEnum" ); my $tinfo = $enum->get_term_info; is( $tinfo->get_doc_freq, 2, "correct place in enum" ); KinoSearch1-1.01/t/206-seg_infos.t000444000765000765 45111462203446 16557 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 1; use File::Spec::Functions qw( catfile ); BEGIN { use_ok('KinoSearch1::Index::SegInfos') } use KinoSearch1::Test::TestUtils qw( create_index ); create_index( "a", "a b" ); my $sinfos = KinoSearch1::Index::SegInfos->new; KinoSearch1-1.01/t/207-seg_term_enum.t000444000765000765 413511462203445 17457 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 9; BEGIN { use_ok('KinoSearch1::Store::RAMInvIndex'); use_ok('KinoSearch1::InvIndexer'); use_ok('KinoSearch1::Index::SegTermEnum'); use_ok('KinoSearch1::Index::CompoundFileReader'); use_ok('KinoSearch1::Index::FieldInfos'); } my $invindex = KinoSearch1::Store::RAMInvIndex->new; my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, create => 1, ); $invindexer->spec_field( name => 'a' ); $invindexer->spec_field( name => 'b' ); $invindexer->spec_field( name => 'c' ); my @animals = qw( cat dog tick ); for my $animal (@animals) { my $doc = $invindexer->new_doc; $doc->set_value( $_ => $animal ) for qw( a b c ); $invindexer->add_doc($doc); } $invindexer->finish; my $cfs_reader = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, seg_name => '_1', ); my $finfos = KinoSearch1::Index::FieldInfos->new; $finfos->read_infos( $cfs_reader->open_instream('_1.fnm') ); my $enum = KinoSearch1::Index::SegTermEnum->new( finfos => $finfos, instream => $cfs_reader->open_instream('_1.tis'), ); my @fields; my @texts; my ( $pointer, $position, $termstring, $tinfo ); while ( $enum->next ) { my $ts = $enum->get_termstring; my $term = KinoSearch1::Index::Term->new_from_string( $ts, $finfos ); push @fields, $term->get_field; push @texts, $term->get_text; if ( $term->get_text eq 'tick' and $term->get_field eq 'b' ) { $pointer = $enum->_get_instream->tell; $position = $enum->_get_position; $termstring = $enum->get_termstring; $tinfo = $enum->get_term_info; } } is_deeply( \@fields, [qw( a a a b b b c c c )], "correct fields" ); my @correct_texts = (@animals) x 3; is_deeply( \@texts, \@correct_texts, "correct terms" ); $enum->seek( $pointer, $position, $termstring, $tinfo ); $enum->next; my $ts = $enum->get_termstring; my $term = KinoSearch1::Index::Term->new_from_string( $ts, $finfos ); is( $term->get_text, 'cat', "enum seeks to correct term (ptr)" ); is( $term->get_field, 'c', "enum seeks to correct term (field)" ); KinoSearch1-1.01/t/208-terminfo.t000444000765000765 274511462203446 16460 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 18; BEGIN { use_ok('KinoSearch1::Index::TermInfo'); } my $tinfo = KinoSearch1::Index::TermInfo->new( 10, 20, 30, 40, 50 ); my $cloned_tinfo = $tinfo->clone; isnt( 0 + $tinfo, 0 + $cloned_tinfo, "the clone should be a separate C struct" ); is( $tinfo->get_doc_freq, 10, "new sets doc_freq correctly" ); is( $tinfo->get_doc_freq, 10, "... doc_freq cloned" ); is( $tinfo->get_frq_fileptr, 20, "new sets frq_fileptr correctly" ); is( $tinfo->get_frq_fileptr, 20, "... frq_fileptr cloned" ); is( $tinfo->get_prx_fileptr, 30, "new sets prx_fileptr correctly" ); is( $tinfo->get_prx_fileptr, 30, "... prx_fileptr cloned" ); is( $tinfo->get_skip_offset, 40, "new sets skip_offset correctly" ); is( $tinfo->get_skip_offset, 40, "... skip_offset cloned" ); is( $tinfo->get_index_fileptr, 50, "new sets index_fileptr correctly" ); is( $tinfo->get_index_fileptr, 50, "... index_fileptr cloned" ); $tinfo->set_doc_freq(5); is( $tinfo->get_doc_freq, 5, "set/get doc_freq" ); is( $cloned_tinfo->get_doc_freq, 10, "setting orig doesn't affect clone" ); $tinfo->set_frq_fileptr(15); is( $tinfo->get_frq_fileptr, 15, "set/get frq_fileptr" ); $tinfo->set_prx_fileptr(25); is( $tinfo->get_prx_fileptr, 25, "set/get prx_fileptr" ); $tinfo->set_skip_offset(35); is( $tinfo->get_skip_offset, 35, "set/get skip_offset" ); $tinfo->set_index_fileptr(45); is( $tinfo->get_index_fileptr, 45, "set/get index_fileptr" ); KinoSearch1-1.01/t/209-term_infos_reader.t000444000765000765 275211462203445 20322 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 6; BEGIN { use_ok('KinoSearch1::Index::TermInfosReader'); use_ok('KinoSearch1::Index::CompoundFileReader'); use_ok('KinoSearch1::Index::FieldInfos'); use_ok('KinoSearch1::Index::Term'); } use KinoSearch1::Test::TestUtils qw( create_index ); my @docs; my @chars = ( 'a' .. 'z' ); for ( 0 .. 1000 ) { my $content = ''; for my $num_words ( 0 .. int( rand(20) ) ) { for my $num_chars ( 1 .. int( rand(10) ) ) { $content .= @chars[ rand(@chars) ]; } $content .= ' '; } push @docs, "$content\n"; } my $invindex = create_index( ( 1 .. 1000 ), ( ("a") x 100 ), "Foo", @docs, "Foo", "A MAN", "A PLAN", "A CANAL", "PANAMA" ); my $comp_file_reader = KinoSearch1::Index::CompoundFileReader->new( invindex => $invindex, seg_name => '_1', ); my $finfos = KinoSearch1::Index::FieldInfos->new; $finfos->read_infos( $comp_file_reader->open_instream('_1.fnm') ); my $tinfos_reader = KinoSearch1::Index::TermInfosReader->new( invindex => $comp_file_reader, seg_name => '_1', finfos => $finfos, ); my $term = KinoSearch1::Index::Term->new( 'content', 'A' ); my $tinfo = $tinfos_reader->fetch_term_info($term); is( $tinfo->get_doc_freq, 3, "correct retrieval #1" ); $term = KinoSearch1::Index::Term->new( 'content', "Foo" ); $tinfo = $tinfos_reader->fetch_term_info($term); is( $tinfo->get_doc_freq, 2, "correct retrieval #2" ); KinoSearch1-1.01/t/210-deldocs.t000444000765000765 345211462203446 16237 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 12; BEGIN { use_ok('KinoSearch1::Index::DelDocs') } use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex = create_index( 'a' .. 'e' ); my $deldocs = KinoSearch1::Index::DelDocs->new(); $deldocs->read_deldocs( $invindex, "_1.del" ); $deldocs->set(3); $deldocs->set(3); my @deleted_or_not = map { $deldocs->get($_) } 0 .. 4; is_deeply( \@deleted_or_not, [ '', '', '', 1, '' ], "set works" ); is( $deldocs->get_num_deletions, 1, "set increments num_deletions, once" ); my $doc_map = $deldocs->generate_doc_map( 5, 0 ); my $correct_doc_map = pack( 'i*', 0, 1, 2, -1, 3 ); is( $$doc_map, $correct_doc_map, "doc map maps around deleted docs" ); $doc_map = $deldocs->generate_doc_map( 5, 100 ); is( $doc_map->get(4), 103, "doc map handles offset correctly" ); is( $doc_map->get(3), undef, "doc_map handled deletions correctly" ); is( $doc_map->get(6), undef, "doc_map returns undef for out of range" ); $deldocs->clear(3); $deldocs->clear(3); $deldocs->clear(3); is( $deldocs->get_num_deletions, 0, "clear decrements num_deletions, once" ); $deldocs->set(2); $deldocs->set(1); $deldocs->write_deldocs( $invindex, "_1.del", 8 ); $deldocs = KinoSearch1::Index::DelDocs->new(); $deldocs->read_deldocs( $invindex, "_1.del" ); @deleted_or_not = map { $deldocs->get($_) } 0 .. 7; is_deeply( \@deleted_or_not, [ '', 1, 1, '', '', '', '', '' ], "write_deldocs and read_deldocs save/recover deletions correctly" ); is( $deldocs->get_num_deletions, 2, "write_deldocs and read_deldocs save/recover num_deletions correctly" ); is( $deldocs->get_capacity, 8, "write_deldocs wrote correct number of bytes" ); $deldocs->write_deldocs( $invindex, "_1.del", 8 ); ok( $invindex->file_exists("_1.del"), "overwrite existing deletions file" ); KinoSearch1-1.01/t/211-seg_term_docs.t000444000765000765 301611462203446 17434 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 7; BEGIN { use_ok('KinoSearch1::Index::SegTermDocs'); use_ok('KinoSearch1::Index::IndexReader'); } use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex = create_index( qw( a b c ), 'c c d' ); my $reader = KinoSearch1::Index::IndexReader->new( invindex => $invindex ); my $term = KinoSearch1::Index::Term->new( 'content', 'c' ); my $term_docs = $reader->term_docs($term); my ( $docs, $freqs, $prox ); $term_docs->bulk_read( $docs, $freqs, 1024 ); my @doc_nums = unpack( 'I*', $docs ); is_deeply( \@doc_nums, [ 2, 3 ], "correct doc_nums" ); my @freq_nums = unpack( 'I*', $freqs ); is_deeply( \@freq_nums, [ 1, 2 ], "correct freqs" ); $term_docs->set_read_positions(1); $term_docs->seek($term); $prox = ''; $prox .= $term_docs->get_positions while $term_docs->next; my @prox_nums = unpack( 'I*', $prox ); is_deeply( \@prox_nums, [ 0, 0, 1 ], "correct positions" ); $term_docs->_get_deldocs()->set(2); $term_docs->seek($term); $term_docs->bulk_read( $docs, $freqs, 1024 ); @doc_nums = unpack( 'I*', $docs ); is_deeply( \@doc_nums, [3], "deletions are honored" ); my @documents = ( qw( c ), 'c c d', ); push @documents, "$_ c" for 0 .. 200; $invindex = create_index(@documents); $reader = KinoSearch1::Index::IndexReader->new( invindex => $invindex ); $term_docs = $reader->term_docs($term); $term_docs->bulk_read( $docs, $freqs, 1024 ); @doc_nums = unpack( 'I*', $docs ); is_deeply( \@doc_nums, [ 0 .. 202 ], "large number of doc_nums correct" ); KinoSearch1-1.01/t/212-multi_term_docs.t000444000765000765 441311462203445 20012 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 10; BEGIN { use_ok('KinoSearch1::Index::SegTermDocs'); use_ok('KinoSearch1::Index::MultiTermDocs'); use_ok('KinoSearch1::Index::IndexReader'); use_ok('KinoSearch1::InvIndexer'); use_ok('KinoSearch1::Analysis::Tokenizer'); use_ok('KinoSearch1::Store::RAMInvIndex'); } my $invindex = KinoSearch1::Store::RAMInvIndex->new(); my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $id = 0; for my $iter ( 1 .. 4 ) { my $invindexer = KinoSearch1::InvIndexer->new( create => $iter == 1 ? 1 : 0, invindex => $invindex, analyzer => $tokenizer, ); $invindexer->spec_field( name => 'content' ); $invindexer->spec_field( name => 'id' ); for my $letter ( 'a' .. 'y' ) { my $doc = $invindexer->new_doc; my $content = ( "$letter " x $iter ) . 'z'; $doc->set_value( content => $content ); $doc->set_value( id => $id++ ); $invindexer->add_doc($doc); } $invindexer->finish; } my $reader = KinoSearch1::Index::IndexReader->new( invindex => $invindex ); my $term = KinoSearch1::Index::Term->new( 'content', 'c' ); my $term_docs = $reader->term_docs($term); my ( $docs, $freqs, $prox ) = ( '', '', '' ); my ( $d, $f ); while ( $term_docs->bulk_read( $d, $f, 1024 ) ) { $docs .= $d; $freqs .= $f; } my @doc_nums = unpack( 'I*', $docs ); is_deeply( \@doc_nums, [ 2, 27, 52, 77 ], "correct doc_nums" ); my @freq_nums = unpack( 'I*', $freqs ); is_deeply( \@freq_nums, [ 1, 2, 3, 4 ], "correct freqs" ); $term_docs->set_read_positions(1); $term_docs->seek($term); $prox = ''; $prox .= $term_docs->get_positions while $term_docs->next; my @prox_nums = unpack( 'I*', $prox ); is_deeply( \@prox_nums, [ 0, 0, 1, 0, 1, 2, 0, 1, 2, 3 ], "correct positions" ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, analyzer => $tokenizer, ); $invindexer->delete_docs_by_term( KinoSearch1::Index::Term->new( id => 52 ) ); $invindexer->finish; $reader = KinoSearch1::Index::IndexReader->new( invindex => $invindex ); $term_docs = $reader->term_docs($term); @doc_nums = (); push @doc_nums, $term_docs->get_doc, while $term_docs->next; is_deeply( \@doc_nums, [ 2, 27, 77 ], "deletions handled properly" ); KinoSearch1-1.01/t/213-segment_merging.t000444000765000765 621211462203446 17774 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 10; use File::Path qw( rmtree ); use File::Spec::Functions qw( catfile ); use File::stat qw( stat ); BEGIN { use_ok('KinoSearch1::InvIndexer'); use_ok('KinoSearch1::Searcher'); use_ok('KinoSearch1::Analysis::Tokenizer'); use_ok('KinoSearch1::Index::IndexReader'); } use KinoSearch1::Test::TestUtils qw( create_index init_test_index_loc ); my $invindex_loc = init_test_index_loc(); my ( $invindexer, $searcher, $hits, $another_invindex, $yet_another_invindex ); my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $fake_norm_file = catfile( $invindex_loc, '_4.f0' ); sub init_invindexer { my %args = @_; undef $invindexer; $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex_loc, analyzer => $tokenizer, %args, ); if ( $args{create} ) { open( my $fh, '>', $fake_norm_file ) or die "can't open $fake_norm_file: $!"; print $fh "blah"; } $invindexer->spec_field( name => 'letters' ); } my $create = 1; my @correct; for my $num_letters ( reverse 1 .. 10 ) { init_invindexer( create => $create ); $create = 0; for my $letter ( 'a' .. 'b' ) { my $doc = $invindexer->new_doc; my $content = ( "$letter " x $num_letters ) . 'z'; $doc->set_value( letters => $content ); $invindexer->add_doc($doc); push @correct, $content if $letter eq 'b'; } $invindexer->finish; } ok( !-f $fake_norm_file, "overwrote fake leftover norm file" ); $searcher = KinoSearch1::Searcher->new( invindex => $invindex_loc, analyzer => $tokenizer, ); $hits = $searcher->search( query => 'b' ); is( $hits->total_hits, 10, "correct total_hits from merged invindex" ); my @got; push @got, $hits->fetch_hit_hashref->{letters} for 1 .. $hits->total_hits; is_deeply( \@got, \@correct, "correct top scoring hit from merged invindex" ); init_invindexer(); $another_invindex = create_index( "atlantic ocean", "fresh fish" ); $yet_another_invindex = create_index("bonus"); $invindexer->add_invindexes( $another_invindex, $yet_another_invindex ); $invindexer->finish; $searcher = KinoSearch1::Searcher->new( invindex => $invindex_loc, analyzer => $tokenizer, ); $hits = $searcher->search( query => 'fish' ); is( $hits->total_hits, 1, "correct total_hits after add_invindexes" ); is( $hits->fetch_hit_hashref->{content}, 'fresh fish', "other invindexes successfully absorbed" ); undef $searcher; undef $hits; # Open an IndexReader, to prevent the deletion of files on Win32 and verify # the deletequeue mechanism. my $reader = KinoSearch1::Index::IndexReader->new( invindex => $invindex_loc, ); init_invindexer(); $invindexer->finish( optimize => 1 ); $reader->close; init_invindexer(); $invindexer->finish( optimize => 1 ); opendir( my $invindex_dh, $invindex_loc ) or die "Couldn't opendir '$invindex_loc': $!"; my @cfs_files = grep {m/\.cfs$/} readdir $invindex_dh; closedir $invindex_dh, $invindex_loc or die "Couldn't closedir '$invindex_loc': $!"; is( scalar @cfs_files, 1, "merged segment files successfully deleted" ); # Clean up. rmtree($invindex_loc); KinoSearch1-1.01/t/214-spec_field.t000444000765000765 517311462203446 16725 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 12; use KinoSearch1::Store::RAMInvIndex; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Analysis::PolyAnalyzer; use KinoSearch1::InvIndexer; use KinoSearch1::Searcher; use KinoSearch1::Search::TermQuery; use KinoSearch1::Index::Term; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en' ); my $invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, analyzer => $tokenizer, ); $invindexer->spec_field( name => 'analyzed', ); $invindexer->spec_field( name => 'polyanalyzed', analyzer => $polyanalyzer, ); $invindexer->spec_field( name => 'unanalyzed', analyzed => 0, ); $invindexer->spec_field( name => 'unpolyanalyzed', analyzed => 0, analyzer => $polyanalyzer, ); $invindexer->spec_field( name => 'unindexed_but_analyzed', indexed => 0, ); $invindexer->spec_field( name => 'unanalyzed_unindexed', analyzed => 0, indexed => 0, ); sub add_a_doc { my $field_name = shift; my $doc = $invindexer->new_doc; $doc->set_value( $field_name => 'United States' ); $invindexer->add_doc($doc); } add_a_doc($_) for qw( analyzed polyanalyzed unanalyzed unpolyanalyzed unindexed_but_analyzed unanalyzed_unindexed ); $invindexer->finish; sub check { my ( $field_name, $query_text, $expected_num_hits ) = @_; my $query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( $field_name, $query_text ), ); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $tokenizer, # doesn't matter - no QueryParser ); my $hits = $searcher->search( query => $query ); is( $hits->total_hits, $expected_num_hits, "$field_name correct num hits " ); # don't check the contents of the hit if there aren't any return unless $expected_num_hits; my $hit = $hits->fetch_hit_hashref; is( $hit->{$field_name}, 'United States', "$field_name correct doc returned" ); } check( 'analyzed', 'States', 1 ); check( 'polyanalyzed', 'state', 1 ); check( 'unanalyzed', 'United States', 1 ); check( 'unpolyanalyzed', 'United States', 1 ); check( 'unindexed_but_analyzed', 'state', 0 ); check( 'unindexed_but_analyzed', 'United States', 0 ); check( 'unanalyzed_unindexed', 'state', 0 ); check( 'unanalyzed_unindexed', 'United States', 0 ); KinoSearch1-1.01/t/302-many_fields.t000444000765000765 243411462203446 17115 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More 'no_plan'; use KinoSearch1::Store::RAMInvIndex; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::InvIndexer; use KinoSearch1::Searcher; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; for my $num_fields ( 1 .. 10 ) { # build an invindex with $num_fields fields, and the same content in each my $invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, analyzer => $tokenizer, ); for my $field_name ( 1 .. $num_fields ) { $invindexer->spec_field( name => $field_name ); } for my $content ( 'a' .. 'z', 'x x y' ) { my $doc = $invindexer->new_doc; for my $field_name ( 1 .. $num_fields ) { $doc->set_value( $field_name => $content ); } $invindexer->add_doc($doc); } $invindexer->finish; # see if our search results match as expected. my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $tokenizer, ); my $hits = $searcher->search('x'); $hits->seek( 0, 100 ); is( $hits->total_hits, 2, "correct number of hits for $num_fields fields" ); my $top_hit = $hits->fetch_hit_hashref; } KinoSearch1-1.01/t/303-highlighter.t000444000765000765 453211462203445 17122 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 9; BEGIN { use_ok('KinoSearch1::Searcher'); use_ok('KinoSearch1::Analysis::Tokenizer'); use_ok('KinoSearch1::Highlight::Highlighter'); } use KinoSearch1::InvIndexer; use KinoSearch1::Store::RAMInvIndex; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, analyzer => $tokenizer, ); $invindexer->spec_field( name => 'content' ); $invindexer->spec_field( name => 'alt', boost => 0.1 ); my $string = '1 2 3 4 5 ' x 20; # 200 characters $string .= 'a b c d x y z h i j k '; $string .= '6 7 8 9 0 ' x 20; my $with_quotes = '"I see," said the blind man.'; for ( $string, $with_quotes ) { my $doc = $invindexer->new_doc; $doc->set_value( content => $_ ); $invindexer->add_doc($doc); } { my $doc = $invindexer->new_doc; $doc->set_value( alt => $string . " and extra stuff so it scores lower" ); $doc->set_value( content => "x but not why or 2ee" ); $invindexer->add_doc($doc); } $invindexer->finish; my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $tokenizer, ); my $highlighter = KinoSearch1::Highlight::Highlighter->new( excerpt_field => 'content', ); my $hits = $searcher->search( query => '"x y z" AND b' ); $hits->create_excerpts( highlighter => $highlighter ); $hits->seek( 0, 2 ); my $hit = $hits->fetch_hit_hashref; like( $hit->{excerpt}, qr/b.*?z/, "excerpt contains all relevant terms" ); like( $hit->{excerpt}, qr#x y z#, "highlighter tagged the phrase" ); like( $hit->{excerpt}, qr#b#, "highlighter tagged the single term" ); like( $hits->fetch_hit_hashref()->{excerpt}, qr/x/, "excerpt field with partial hit doesn't cause highlighter freakout" ); $hits = $searcher->search( query => 'x "x y z" AND b' ); $hits->create_excerpts( highlighter => $highlighter ); $hits->seek( 0, 2 ); like( $hits->fetch_hit_hashref()->{excerpt}, qr/x y z/, "query with same word in both phrase and term doesn't cause freakout" ); $hits = $searcher->search( query => 'blind' ); $hits->create_excerpts( highlighter => $highlighter ); like( $hits->fetch_hit_hashref()->{excerpt}, qr/quot/, "HTML entity encoded properly" ); KinoSearch1-1.01/t/501-termquery.t000444000765000765 172411462203446 16662 0ustar00marvinmarvin000000000000#!/usr/bin/perl use lib 'buildlib'; use Test::More tests => 7; BEGIN { use_ok('KinoSearch1::Search::TermQuery'); use_ok('KinoSearch1::Index::Term'); use_ok('KinoSearch1::Searcher'); } use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex = create_index( 'a', 'b', 'c c c d', 'c d', 'd' .. 'z', ); my $term = KinoSearch1::Index::Term->new( 'content', 'c' ); my $term_query = KinoSearch1::Search::TermQuery->new( term => $term ); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex ); my $hits = $searcher->search( query => $term_query ); $hits->seek( 0, 50 ); is( $hits->total_hits, 2, "correct number of hits returned" ); my $hashref = $hits->fetch_hit_hashref; is( $hashref->{content}, 'c c c d', "most relevant doc is highest" ); $hashref = $hits->fetch_hit_hashref; is( $hashref->{content}, 'c d', "second most relevant" ); $hits->seek( 1, 50 ); $hashref = $hits->fetch_hit_hashref; is( $hashref->{content}, 'c d', "fresh seek" ); KinoSearch1-1.01/t/502-phrasequery.t000444000765000765 265711462203446 17204 0ustar00marvinmarvin000000000000#!/usr/bin/perl use lib 'buildlib'; use Test::More tests => 5; BEGIN { use_ok('KinoSearch1::Search::PhraseQuery') } use KinoSearch1::Test::TestUtils qw( create_index ); use KinoSearch1::Index::Term; use KinoSearch1::Searcher; my $best_match = 'x a b c d a b c d'; my @docs = ( 1 .. 20, 'a b c a b c a b c d', 'a b c d x x a', 'a c b d', 'a x x x b x x x c x x x x x x d x', $best_match, 'a' .. 'z', ); my $invindex = create_index(@docs); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex ); my $phrase_query = KinoSearch1::Search::PhraseQuery->new( slop => 0 ); for (qw( a b c d )) { my $term = KinoSearch1::Index::Term->new( 'content', $_ ); $phrase_query->add_term($term); } my $hits = $searcher->search( query => $phrase_query ); $hits->seek( 0, 50 ); is( $hits->total_hits, 3, "correct number of hits" ); my $first_hit = $hits->fetch_hit_hashref; is( $first_hit->{content}, $best_match, 'best match appears first' ); my $second_hit = $hits->fetch_hit_hashref; ok( $first_hit->{score} > $second_hit->{score}, "best match scores higher: $first_hit->{score} > $second_hit->{score}" ); $phrase_query = KinoSearch1::Search::PhraseQuery->new( slop => 0 ); for (qw( c a )) { my $term = KinoSearch1::Index::Term->new( 'content', $_ ); $phrase_query->add_term($term); } $hits = $searcher->search( query => $phrase_query ); is( $hits->total_hits, 1, 'avoid underflow when subtracting offset' ); KinoSearch1-1.01/t/503-booleanquery.t000444000765000765 415311462203445 17332 0ustar00marvinmarvin000000000000#!/usr/bin/perl use lib 'buildlib'; use Test::More tests => 6; use File::Spec::Functions qw( catfile ); BEGIN { use_ok('KinoSearch1::Search::BooleanQuery'); use_ok('KinoSearch1::Search::BooleanScorer'); } use KinoSearch1::Test::TestUtils qw( create_index ); use KinoSearch1::Search::TermQuery; use KinoSearch1::Index::Term; use KinoSearch1::Searcher; use KinoSearch1::Analysis::Tokenizer; my @docs = ( 'a' .. 'h', 'c c', 'c d e' ); push @docs, ('x') x 90; push @docs, ('c d x'); my $invindex = create_index(@docs); my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $tokenizer, ); my $bool_query = KinoSearch1::Search::BooleanQuery->new; my $c_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'content', 'c' ), ); my $d_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'content', 'd' ), ); my $e_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'content', 'e' ), ); $bool_query->add_clause( query => $c_query, occur => 'SHOULD', ); my $hits = $searcher->search( query => $bool_query ); $hits->seek( 0, 10 ); is( $hits->total_hits, 4, "single clause" ); $bool_query->add_clause( query => $d_query, occur => 'MUST', ); $hits = $searcher->search( query => $bool_query ); $hits->seek( 0, 10 ); is( $hits->total_hits, 3, "c +d" ); $bool_query->add_clause( query => $e_query, occur => 'MUST_NOT', ); $hits = $searcher->search( query => $bool_query ); $hits->seek( 0, 10 ); is( $hits->total_hits, 2, "c +d -e" ); $bool_query = KinoSearch1::Search::BooleanQuery->new; $bool_query->add_clause( query => $c_query, occur => 'SHOULD' ); my $sub_query = KinoSearch1::Search::BooleanQuery->new; $sub_query->add_clause( query => $d_query, occur => 'SHOULD', ); $sub_query->add_clause( query => $e_query, occur => 'SHOULD', ); $bool_query->add_clause( query => $sub_query, occur => 'SHOULD' ); $hits = $searcher->search( query => $bool_query ); $hits->seek( 0, 50 ); is( $hits->total_hits, 6, "nested BooleanQuery" ); KinoSearch1-1.01/t/504-similarity.t000444000765000765 423611462203446 17017 0ustar00marvinmarvin000000000000#!/usr/bin/perl use Test::More tests => 5; BEGIN { use_ok('KinoSearch1::Search::Similarity'); } use KinoSearch1::Store::RAMInvIndex; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::InvIndexer; use KinoSearch1::Searcher; my $sim = KinoSearch1::Search::Similarity->new; my @bytes = map { pack( 'C', $_ ) } ( 100, 110, 120, 130, 140 ); my @floats = ( 0.015625, 0.09375, 0.5, 3.0, 16.0 ); my @transformed = map { $sim->decode_norm($_) } @bytes; is_deeply( \@floats, \@transformed, "decode_norm more or less matches Java Lucene behavior" ); @bytes = map { pack( 'C', $_ ) } 0 .. 255; @floats = map { $sim->decode_norm($_) } @bytes; @transformed = map { $sim->encode_norm($_) } @floats; is_deeply( \@transformed, \@bytes, "encode_norm and decode_norm are complementary" ); my $norm_decoder = $sim->get_norm_decoder; @transformed = (); for ( 0 .. 255 ) { push @transformed, unpack( 'f', bytes::substr( $norm_decoder, $_ * 4, 4 ) ); } is_deeply( \@transformed, \@floats, "using the norm_decoder produces desired results" ); my $invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $invindexer = KinoSearch1::InvIndexer->new( analyzer => $tokenizer, invindex => $invindex, ); $invindexer->spec_field( name => 'body' ); $invindexer->spec_field( name => 'title', boost => 2, ); my $title_sim = KinoSearch1::Search::TitleSimilarity->new; $invindexer->set_similarity( title => $title_sim ); my %source_docs = ( 'spam spam spam spam' => 'a load of spam', 'not spam' => 'not spam not even close to spam', ); while ( my ( $title, $body ) = each %source_docs ) { my $doc = $invindexer->new_doc; $doc->set_value( title => $title ); $doc->set_value( body => $body ); $invindexer->add_doc($doc); } $invindexer->finish; undef $invindexer; my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $tokenizer, ); $searcher->set_similarity( title => $title_sim ); my $hits = $searcher->search( query => 'spam' ); is( $hits->fetch_hit_hashref->{'title'}, 'not spam', "TitleSimilarity works well on title fields" ); KinoSearch1-1.01/t/505-hit_queue.t000444000765000765 161111462203446 16614 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 3; use Scalar::Util qw( dualvar ); BEGIN { use_ok('KinoSearch1::Search::HitQueue') } my $hq = KinoSearch1::Search::HitQueue->new( max_size => 10 ); my @docs_and_scores = ( [ 1.0, 0 ], [ 0.1, 5 ], [ 0.1, 10 ], [ 0.9, 1000 ], [ 1.0, 3000 ], [ 1.0, 2000 ], ); my @scoredocs = map { dualvar( $_->[0], pack( 'N', $_->[1] ) ) } @docs_and_scores; my @correct_order = sort { $b <=> $a or $a cmp $b } @scoredocs; my @correct_docs = map { unpack( 'N', "$_" ) } @correct_order; my @correct_scores = map { 0 + $_ } @correct_order; my $hit_docs; $hq->insert($_) for @scoredocs; $hit_docs = $hq->hits; my @scores = map { $_->get_score } @$hit_docs; is_deeply( \@scores, \@correct_scores, "rank by scores first" ); my @doc_nums = map { $_->get_id } @$hit_docs; is_deeply( \@doc_nums, \@correct_docs, "rank by doc_num after score" ); KinoSearch1-1.01/t/506-hit_collector.t000444000765000765 137511462203445 17465 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More tests => 4; BEGIN { use_ok('KinoSearch1::Search::HitCollector') } my @docs_and_scores = ( [ 0, 2 ], [ 5, 0 ], [ 10, 0 ], [ 1000, 1 ] ); my $hc = KinoSearch1::Search::HitQueueCollector->new( size => 3, ); $hc->collect( $_->[0], $_->[1] ) for @docs_and_scores; my $hit_queue = $hc->get_storage; isa_ok( $hit_queue, 'KinoSearch1::Search::HitQueue' ); my @scores = map { $_->get_score } @{ $hit_queue->hits }; is_deeply( \@scores, [ 2, 1, 0 ], "collect into HitQueue" ); $hc = KinoSearch1::Search::BitCollector->new; $hc->collect( $_->[0], $_->[1] ) for @docs_and_scores; is_deeply( $hc->get_bit_vector()->to_arrayref, [ 0, 5, 10, 1000 ], "BitCollector produces a valid BitVector with the right doc nums" ); KinoSearch1-1.01/t/507-query_filter.t000444000765000765 162111462203446 17341 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 2; use KinoSearch1::Search::HitCollector; use KinoSearch1::Searcher; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Search::TermQuery; use KinoSearch1::Index::Term; BEGIN { use_ok('KinoSearch1::Search::QueryFilter') } use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex = create_index( 'a x', 'b x', 'c x', 'a y', 'b y', 'c y' ); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => KinoSearch1::Analysis::Tokenizer->new, ); my $only_a_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'content', 'a' ), ); my $filter = KinoSearch1::Search::QueryFilter->new( query => $only_a_query, ); my $hits = $searcher->search( query => 'x y', filter => $filter, ); $hits->seek( 0, 50 ); is( $hits->total_hits, 2, "filtering a query works" ); KinoSearch1-1.01/t/508-hits.t000444000765000765 211411462203446 15575 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 4; BEGIN { use_ok('KinoSearch1::Search::Hits') } use KinoSearch1::Searcher; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Test::TestUtils qw( create_index ); my @docs = ( 'a b', 'a a b', 'a a a b', 'x' ); my $invindex = create_index(@docs); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => KinoSearch1::Analysis::Tokenizer->new, ); my $hits = $searcher->search( query => 'a' ); my @ids; my @retrieved; while ( my $hit = $hits->fetch_hit ) { push @ids, $hit->get_id; my $doc = $hit->get_doc; push @retrieved, $doc->get_value('content'); } is_deeply( \@ids, [ 2, 1, 0 ], "get_id()" ); is_deeply( \@retrieved, [ @docs[ 2, 1, 0 ] ], "correct content via fetch_hit() and get_doc()" ); @retrieved = (); $hits = $searcher->search( query => 'a' ); while ( my $hashref = $hits->fetch_hit_hashref ) { push @retrieved, $hashref->{content}; } is_deeply( \@retrieved, [ @docs[ 2, 1, 0 ] ], "correct content via fetch_hit_hashref()" ); KinoSearch1-1.01/t/509-multi_searcher.t000444000765000765 207311462203446 17641 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 4; use lib 'buildlib'; BEGIN { use_ok('KinoSearch1::Search::MultiSearcher') } use KinoSearch1::Searcher; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Test::TestUtils qw( create_index ); my $invindex_a = create_index( 'x a', 'x b', 'x c' ); my $invindex_b = create_index( 'y b', 'y c', 'y d' ); my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $searcher_a = KinoSearch1::Searcher->new( analyzer => $tokenizer, invindex => $invindex_a, ); my $searcher_b = KinoSearch1::Searcher->new( analyzer => $tokenizer, invindex => $invindex_b, ); my $multi_searcher = KinoSearch1::Search::MultiSearcher->new( searchables => [ $searcher_a, $searcher_b ], analyzer => $tokenizer, ); my $hits = $multi_searcher->search('a'); is( $hits->total_hits, 1, "Find hit in first searcher" ); $hits = $multi_searcher->search('d'); is( $hits->total_hits, 1, "Find hit in second searcher" ); $hits = $multi_searcher->search('c'); is( $hits->total_hits, 2, "Find hits in both searchers" ); KinoSearch1-1.01/t/510-remote_search.t000444000765000765 471411462203446 17447 0ustar00marvinmarvin000000000000use strict; use warnings; use Test::More; use Time::HiRes qw( sleep ); use IO::Socket::INET; use lib 'buildlib'; my $PORT_NUM = 7890; BEGIN { if ( $^O =~ /mswin/i ) { plan( 'skip_all', "fork on Windows not supported by KS" ); } } use KinoSearch1::Search::SearchServer; use KinoSearch1::Search::SearchClient; use KinoSearch1::Searcher; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Search::MultiSearcher; use KinoSearch1::Test::TestUtils qw( create_index ); my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $kid; $kid = fork; if ($kid) { sleep .25; # allow time for the server to set up the socket die "Failed fork: $!" unless defined $kid; } else { my $invindex = create_index( 'x a', 'x b', 'x c' ); my $searcher = KinoSearch1::Searcher->new( analyzer => $tokenizer, invindex => $invindex, ); my $server = KinoSearch1::Search::SearchServer->new( port => $PORT_NUM, searchable => $searcher, password => 'foo', ); $server->serve; exit(0); } my $test_client_sock = IO::Socket::INET->new( PeerAddr => "localhost:$PORT_NUM", Proto => 'tcp', ); if ($test_client_sock) { plan( tests => 4 ); undef $test_client_sock; } else { plan( 'skip_all', "Can't get a socket: $!" ); } my $tokenizer2 = KinoSearch1::Analysis::Tokenizer->new; my $searchclient = KinoSearch1::Search::SearchClient->new( analyzer => $tokenizer2, peer_address => "localhost:$PORT_NUM", password => 'foo', ); my $hits = $searchclient->search('x'); is( $hits->total_hits, 3, "retrieved hits from search server" ); $hits = $searchclient->search('a'); is( $hits->total_hits, 1, "retrieved hit from search server" ); my $invindex_b = create_index( 'y b', 'y c', 'y d' ); my $searcher_b = KinoSearch1::Searcher->new( analyzer => $tokenizer, invindex => $invindex_b, ); my $multi_searcher = KinoSearch1::Search::MultiSearcher->new( analyzer => $tokenizer, searchables => [ $searcher_b, $searchclient ], ); $hits = $multi_searcher->search('b'); is( $hits->total_hits, 2, "retrieved hits from MultiSearcher" ); my %results; $results{ $hits->fetch_hit_hashref()->{content} } = 1; $results{ $hits->fetch_hit_hashref()->{content} } = 1; my %expected = ( 'x b' => 1, 'y b' => 1, ); is_deeply( \%results, \%expected, "docs fetched from both local and remote" ); END { $searchclient->terminate if defined $searchclient; kill( TERM => $kid ) if $kid; } KinoSearch1-1.01/t/601-queryparser.t000444000765000765 1253211462203446 17227 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use KinoSearch1 qw( kdump ); use Test::More tests => 217; BEGIN { use_ok('KinoSearch1::QueryParser::QueryParser') } use KinoSearch1::InvIndexer; use KinoSearch1::Searcher; use KinoSearch1::Store::RAMInvIndex; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Analysis::Stopalizer; use KinoSearch1::Analysis::PolyAnalyzer; my $whitespace_tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\S+/ ); my $stopalizer = KinoSearch1::Analysis::Stopalizer->new( stoplist => { x => 1 } ); my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $whitespace_tokenizer, $stopalizer, ], ); my @docs = ( 'x', 'y', 'z', 'x a', 'x a b', 'x a b c', 'x foo a b c d', ); my $invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $stop_invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => $invindex, analyzer => $whitespace_tokenizer, ); my $stop_invindexer = KinoSearch1::InvIndexer->new( invindex => $stop_invindex, analyzer => $polyanalyzer, ); $invindexer->spec_field( name => 'content' ); $stop_invindexer->spec_field( name => 'content' ); for my $content_string (@docs) { my $doc = $invindexer->new_doc; $doc->set_value( content => $content_string ); $invindexer->add_doc($doc); $doc = $stop_invindexer->new_doc; $doc->set_value( content => $content_string ); $stop_invindexer->add_doc($doc); } $invindexer->finish; $stop_invindexer->finish; my $OR_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $whitespace_tokenizer, default_field => 'content', ); my $AND_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $whitespace_tokenizer, default_field => 'content', default_boolop => 'AND', ); my $OR_stop_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $polyanalyzer, default_field => 'content', ); my $AND_stop_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $polyanalyzer, default_field => 'content', default_boolop => 'AND', ); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex ); my $stop_searcher = KinoSearch1::Searcher->new( invindex => $stop_invindex ); my @logical_tests = ( 'b' => [ 3, 3, 3, 3, ], '(a)' => [ 4, 4, 4, 4, ], '"a"' => [ 4, 4, 4, 4, ], '"(a)"' => [ 0, 0, 0, 0, ], '("a")' => [ 4, 4, 4, 4, ], 'a b' => [ 4, 3, 4, 3, ], 'a (b)' => [ 4, 3, 4, 3, ], 'a "b"' => [ 4, 3, 4, 3, ], 'a ("b")' => [ 4, 3, 4, 3, ], 'a "(b)"' => [ 4, 0, 4, 0, ], '(a b)' => [ 4, 3, 4, 3, ], '"a b"' => [ 3, 3, 3, 3, ], '("a b")' => [ 3, 3, 3, 3, ], '"(a b)"' => [ 0, 0, 0, 0, ], 'a b c' => [ 4, 2, 4, 2, ], 'a (b c)' => [ 4, 2, 4, 2, ], 'a "b c"' => [ 4, 2, 4, 2, ], 'a ("b c")' => [ 4, 2, 4, 2, ], 'a "(b c)"' => [ 4, 0, 4, 0, ], '"a b c"' => [ 2, 2, 2, 2, ], '-x' => [ 0, 0, 0, 0, ], 'x -c' => [ 3, 3, 0, 0, ], 'x "-c"' => [ 5, 0, 0, 0, ], 'x +c' => [ 2, 2, 2, 2, ], 'x "+c"' => [ 5, 0, 0, 0, ], '+x +c' => [ 2, 2, 2, 2, ], '+x -c' => [ 3, 3, 0, 0, ], '-x +c' => [ 0, 0, 2, 2, ], '-x -c' => [ 0, 0, 0, 0, ], 'x y' => [ 6, 0, 1, 1, ], 'x a d' => [ 5, 1, 4, 1, ], 'x "a d"' => [ 5, 0, 0, 0, ], '"x a"' => [ 3, 3, 3, 3, ], 'x AND y' => [ 0, 0, 1, 1, ], 'x OR y' => [ 6, 6, 1, 1, ], 'x AND NOT y' => [ 5, 5, 0, 0, ], 'x (b OR c)' => [ 5, 3, 3, 3, ], 'x AND (b OR c)' => [ 3, 3, 3, 3, ], 'x OR (b OR c)' => [ 5, 5, 3, 3, ], 'x (y OR c)' => [ 6, 2, 3, 3, ], 'x AND (y OR c)' => [ 2, 2, 3, 3, ], 'a AND NOT (b OR "c d")' => [ 1, 1, 1, 1, ], 'a AND NOT "a b"' => [ 1, 1, 1, 1, ], 'a AND NOT ("a b" OR "c d")' => [ 1, 1, 1, 1, ], '+"b c" -d' => [ 1, 1, 1, 1, ], '"a b" +d' => [ 1, 1, 1, 1, ], 'x AND NOT (b OR (c AND d))' => [ 2, 2, 0, 0, ], '-(+notthere)' => [ 0, 0, 0, 0 ], 'content:b' => [ 3, 3, 3, 3, ], 'bogusfield:a' => [ 0, 0, 0, 0, ], 'bogusfield:a content:b' => [ 3, 0, 3, 0, ], 'content:b content:c' => [ 3, 2, 3, 2 ], 'content:(b c)' => [ 3, 2, 3, 2 ], 'bogusfield:(b c)' => [ 0, 0, 0, 0 ], ); my $i = 0; while ( $i < @logical_tests ) { my $qstring = $logical_tests[$i]; $i++; my $query = $OR_parser->parse($qstring); my $hits = $searcher->search( query => $query ); $hits->seek( 0, 50 ); is( $hits->total_hits, $logical_tests[$i][0], "OR: $qstring" ); $query = $AND_parser->parse($qstring); $hits = $searcher->search( query => $query ); $hits->seek( 0, 50 ); is( $hits->total_hits, $logical_tests[$i][1], "AND: $qstring" ); $query = $OR_stop_parser->parse($qstring); $hits = $stop_searcher->search( query => $query ); $hits->seek( 0, 50 ); is( $hits->total_hits, $logical_tests[$i][2], "stoplist-OR: $qstring" ); $query = $AND_stop_parser->parse($qstring); $hits = $stop_searcher->search( query => $query ); $hits->seek( 0, 50 ); is( $hits->total_hits, $logical_tests[$i][3], "stoplist-AND: $qstring" ); $i++; $hits->{searcher} = undef; $hits->{reader} = undef; $hits->{weight} = undef; #kdump($query); #exit; } KinoSearch1-1.01/t/602-boosts.t000444000765000765 721211462203446 16136 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More 'no_plan'; use KinoSearch1::Store::RAMInvIndex; use KinoSearch1::Searcher; use KinoSearch1::InvIndexer; use KinoSearch1::Analysis::Tokenizer; my $control_invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $boosted_fields_invindex_a = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $boosted_fields_invindex_b = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $boosted_docs_invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $analyzer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\S+/ ); my $control_invindexer = KinoSearch1::InvIndexer->new( invindex => $control_invindex, analyzer => $analyzer, ); my $boosted_fields_invindexer_a = KinoSearch1::InvIndexer->new( invindex => $boosted_fields_invindex_a, analyzer => $analyzer, ); my $boosted_fields_invindexer_b = KinoSearch1::InvIndexer->new( invindex => $boosted_fields_invindex_b, analyzer => $analyzer, ); my $boosted_docs_invindexer = KinoSearch1::InvIndexer->new( invindex => $boosted_docs_invindex, analyzer => $analyzer, ); for ( $control_invindexer, $boosted_fields_invindexer_b, $boosted_docs_invindexer ) { $_->spec_field( name => 'content' ); $_->spec_field( name => 'category' ); } $boosted_fields_invindexer_a->spec_field( name => 'content' ); $boosted_fields_invindexer_a->spec_field( name => 'category', boost => 100, ); my %source_docs = ( 'x' => '', 'x a a a a' => 'x a', 'a b' => 'x a a', ); while ( my ( $content, $category ) = each %source_docs ) { my $doc = $control_invindexer->new_doc; $doc->set_value( content => $content ); $doc->set_value( category => $category ); $control_invindexer->add_doc($doc); $doc = $boosted_fields_invindexer_a->new_doc; $doc->set_value( content => $content ); $doc->set_value( category => $category ); $boosted_fields_invindexer_a->add_doc($doc); $doc = $boosted_fields_invindexer_b->new_doc; $doc->set_value( content => $content ); $doc->set_value( category => $category ); $doc->boost_field( content => 5 ) if ( $content =~ 'b' ); $boosted_fields_invindexer_b->add_doc($doc); $doc = $boosted_docs_invindexer->new_doc; $doc->set_value( content => $content ); $doc->set_value( category => $category ); $doc->set_boost(5) if ( $content =~ 'b' ); $boosted_docs_invindexer->add_doc($doc); } $control_invindexer->finish; $boosted_fields_invindexer_a->finish; $boosted_fields_invindexer_b->finish; $boosted_docs_invindexer->finish; my $searcher = KinoSearch1::Searcher->new( invindex => $control_invindex, analyzer => $analyzer, ); my $hits = $searcher->search('a'); $hits->seek( 0, 1 ); my $hit = $hits->fetch_hit_hashref; is( $hit->{content}, "x a a a a", "best doc ranks highest with no boosting" ); $searcher = KinoSearch1::Searcher->new( invindex => $boosted_fields_invindex_a, analyzer => $analyzer, ); $hits = $searcher->search('a'); $hits->seek( 0, 3 ); $hit = $hits->fetch_hit_hashref; is( $hit->{content}, 'a b', "boost from spec_field works" ); $searcher = KinoSearch1::Searcher->new( invindex => $boosted_fields_invindex_b, analyzer => $analyzer, ); $hits = $searcher->search('a'); $hits->seek( 0, 1 ); $hit = $hits->fetch_hit_hashref; is( $hit->{content}, 'a b', "boost from \$doc->boost_field works" ); $searcher = KinoSearch1::Searcher->new( invindex => $boosted_docs_invindex, analyzer => $analyzer, ); $hits = $searcher->search('a'); $hits->seek( 0, 1 ); $hit = $hits->fetch_hit_hashref; is( $hit->{content}, 'a b', "boost from \$doc->set_boost works" ); KinoSearch1-1.01/t/603-query_boosts.t000444000765000765 314211462203446 17362 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 2; use KinoSearch1::Test::TestUtils qw( create_index ); use KinoSearch1::Searcher; use KinoSearch1::InvIndexer; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::Search::TermQuery; use KinoSearch1::Search::PhraseQuery; use KinoSearch1::Search::BooleanQuery; use KinoSearch1::Index::Term; my $doc_1 = 'a a a a a a a a a a a a b c d x y'; my $doc_2 = 'a b c d x y x y'; my $invindex = create_index( $doc_1, $doc_2 ); my $analyzer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\S+/ ); my $searcher = KinoSearch1::Searcher->new( invindex => $invindex, analyzer => $analyzer, ); my $a_query = KinoSearch1::Search::TermQuery->new( term => KinoSearch1::Index::Term->new( 'content', 'a' ) ); my $x_y_query = KinoSearch1::Search::PhraseQuery->new; $x_y_query->add_term( KinoSearch1::Index::Term->new( 'content', 'x' ) ); $x_y_query->add_term( KinoSearch1::Index::Term->new( 'content', 'y' ) ); my $combined_query = KinoSearch1::Search::BooleanQuery->new; $combined_query->add_clause( query => $a_query, occur => 'SHOULD' ); $combined_query->add_clause( query => $x_y_query, occur => 'SHOULD' ); my $hits = $searcher->search( query => $combined_query ); $hits->seek( 0, 50 ); my $hit = $hits->fetch_hit_hashref; is( $hit->{content}, $doc_1, "best doc ranks highest with no boosting" ); my $first_score = $hit->{score}; $x_y_query->set_boost(20); $hits = $searcher->search( query => $combined_query ); $hits->seek( 0, 50 ); $hit = $hits->fetch_hit_hashref; is( $hit->{content}, $doc_2, "boosting a sub query succeeds" ); KinoSearch1-1.01/t/604-simple_search.t000444000765000765 411711462203446 17446 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 12; use KinoSearch1::Searcher; use KinoSearch1::InvIndexer; use KinoSearch1::Store::RAMInvIndex; use KinoSearch1::Analysis::Tokenizer; use KinoSearch1::QueryParser::QueryParser; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new; my $invindex = KinoSearch1::Store::RAMInvIndex->new( create => 1 ); my $invindexer = KinoSearch1::InvIndexer->new( analyzer => $tokenizer, invindex => $invindex, ); $invindexer->spec_field( name => 'title' ); $invindexer->spec_field( name => 'body' ); my %docs = ( 'a' => 'foo', 'b' => 'bar', ); while ( my ( $title, $body ) = each %docs ) { my $doc = $invindexer->new_doc; $doc->set_value( title => $title ); $doc->set_value( body => $body ); $invindexer->add_doc($doc); } $invindexer->finish; my $searcher = KinoSearch1::Searcher->new( analyzer => $tokenizer, invindex => $invindex, ); my $or_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $tokenizer, fields => [ 'title', 'body', ], ); my $and_parser = KinoSearch1::QueryParser::QueryParser->new( analyzer => $tokenizer, fields => [ 'title', 'body', ], default_boolop => 'AND', ); sub test_qstring { my ( $qstring, $expected, $message ) = @_; my $hits = $searcher->search( query => $qstring ); is( $hits->total_hits, $expected, $message ); my $query = $or_parser->parse($qstring); $hits = $searcher->search( query => $query ); is( $hits->total_hits, $expected, "OR: $message" ); $query = $and_parser->parse($qstring); $hits = $searcher->search( query => $query ); is( $hits->total_hits, $expected, "AND: $message" ); } test_qstring( 'a foo', 1, "simple match across multiple fields" ); test_qstring( 'a -foo', 0, "match of negated term on any field should exclude document" ); test_qstring( 'a +foo', 1, "failure to match of required term on a field " . "should not exclude doc if another field matches." ); test_qstring( '+a +foo', 1, "required terms spread across disparate fields should match" ); KinoSearch1-1.01/t/701-uscon.t000444000765000765 171011462203446 15751 0ustar00marvinmarvin000000000000use strict; use warnings; use lib 'buildlib'; use Test::More tests => 10; BEGIN { use_ok('KinoSearch1::Searcher'); use_ok('KinoSearch1::Analysis::PolyAnalyzer'); } use KinoSearch1::Test::TestUtils qw( persistent_test_index_loc ); my $tokenizer = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en' ); my $searcher = KinoSearch1::Searcher->new( invindex => persistent_test_index_loc(), analyzer => $tokenizer, ); my %searches = ( 'United' => 34, 'shall' => 50, 'not' => 27, '"shall not"' => 21, 'shall not' => 51, 'Congress' => 31, 'Congress AND United' => 22, '(Congress AND United) OR ((Vice AND President) OR "free exercise")' => 28, ); while ( my ( $qstring, $num_expected ) = each %searches ) { my $hits = $searcher->search($qstring); $hits->seek( 0, 100 ); is( $hits->total_hits, $num_expected, $qstring ); } KinoSearch1-1.01/t/999-remove_invindexes.t000444000765000765 35511462203446 20362 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'buildlib'; use Test::More tests => 1; use KinoSearch1::Test::TestUtils qw( remove_working_dir working_dir ); remove_working_dir(); ok( !-e working_dir(), "working_dir is no more" ); KinoSearch1-1.01/t/pod-coverage.t000444000765000765 44611462203446 16655 0ustar00marvinmarvin000000000000#!perl -T use constant RUN_AUTHOR_ONLY_TESTS => 0; use Test::More; eval "use Test::Pod::Coverage 1.04"; plan skip_all => "Test::Pod::Coverage 1.04 required for testing POD coverage" if $@; plan skip_all => "Only run during development" unless RUN_AUTHOR_ONLY_TESTS; all_pod_coverage_ok(); KinoSearch1-1.01/t/pod.t000444000765000765 21411462203445 15054 0ustar00marvinmarvin000000000000#!perl -T use Test::More; eval "use Test::Pod 1.14"; plan skip_all => "Test::Pod 1.14 required for testing POD" if $@; all_pod_files_ok(); KinoSearch1-1.01/t/benchmarks000755000765000765 011462203446 16111 5ustar00marvinmarvin000000000000KinoSearch1-1.01/t/benchmarks/extract_reuters.plx000555000765000765 706311462203446 22227 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec::Functions qw( catfile catdir ); use Cwd qw( getcwd ); # ensure call from correct location and with required arg my $source_dir = $ARGV[0]; die "Usage: ./extract_reuters.plx /path/to/expanded/archive" unless -d $source_dir; my $working_dir = getcwd; die "Must be run from the benchmarks/ directory" unless ( $working_dir =~ /benchmarks\W*$/ ); # create the main output directory my $main_out_dir = 'extracted_corpus'; if ( !-d $main_out_dir ) { mkdir $main_out_dir or die "Couldn't mkdir '$main_out_dir': $!"; } # get a list of the sgm files opendir SOURCE_DIR, $source_dir or die "Couldn't open directory: $!"; my @sgm_files = grep {/\.sgm$/} readdir SOURCE_DIR; closedir SOURCE_DIR or die "Couldn't close directory: $!"; die "Couldn't find all the sgm files" unless @sgm_files == 22; # track number of story docs my $num_files = 0; for my $sgm_file (@sgm_files) { # get the sgm file my $sgm_filepath = catfile( $source_dir, $sgm_file ); print "Processing $sgm_filepath\n"; open( my $sgm_fh, '<', $sgm_filepath ) or die "Couldn't open file '$sgm_filepath': $!"; # prepare output directory $sgm_file =~ /(\d+)\.sgm$/ or die "no match"; my $out_dir = catdir( $main_out_dir, "articles$1" ); if ( !-d $out_dir ) { mkdir $out_dir or die "Couldn't create directory '$out_dir': $!"; } my $in_body = 0; my $in_title = 0; my ( $title, $body ); while (<$sgm_fh>) { # start a new story doc if (///) { $in_title = 1; $title = ''; } $title .= $_ if $in_title; if (s/.*?//) { $in_body = 1; $body = ''; } $body .= $_ if $in_body; if (m#.*#) { $in_title = 0; $title =~ s#.*##s; } if (m#.*#) { $in_body = 0; $body =~ s#.*##s; } # write out a finished article doc if (m##) { die "Malformed data" if ( $in_title or $in_body ); if ( length $title and length $body ) { my $out_filename = sprintf( "article%05d.txt", $num_files ); my $out_filepath = catfile( $out_dir, $out_filename ); open( my $out_fh, '>', $out_filepath ) or die "Couldn't open '$out_filepath' for writing: $!"; $title =~ s/^\s*//; $title =~ s/\s*$//; print $out_fh "$title\n\n" or die "print failed: $!"; print $out_fh $body or die "print failed: $!"; close $out_fh or die "Couldn't close '$out_filepath': $!"; $num_files++; } } } } print "Total articles extracted: $num_files\n"; __END__ =head1 NAME extract_reuters.plx - parse Reuters 21578 corpus into individual files =head1 SYNOPSIS ./extract_reuters.plx /path/to/expanded/reuters/archive =head1 DESCRIPTION This script will extract TITLE and BODY for each item in the Reuters 21578 corpus into individual files. It expects to be passed the location of the decompressed archive as a command line argument. =head1 AUTHOR Marvin Humphrey E marvin at rectangular dot com E. =head1 COPYRIGHT AND LICENSE Copyright 2006-2010 Marvin Humphrey This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut KinoSearch1-1.01/t/benchmarks/README.txt000555000765000765 330111462203446 17744 0ustar00marvinmarvin000000000000Indexing Benchmarks The purpose of this experiment is to test raw indexing speed, using Reuters-21578, Distribution 1.0 as a test corpus. As of this writing, Reuters-21578 is available at: http://www.daviddlewis.com/resources/testcollections/reuters21578 The corpus comes packaged in SGML, which means we need to preprocess it so that our results are not infected by differences between SGML parsers. A simple perl script, "./extract_reuters.plx" is supplied, which expands the Reuters articles out into the file system, 1 article per file, with the title as the first line of text. It takes one command line argument: the location of the un-tarred Reuters collection. ./extract_reuters.plx /path/to/reuters_collection Filepaths are hard-coded, and the assumption is that the apps will be run from within the benchmarks/ directory. Each of the indexing apps takes four optional command line arguments: * The number of documents to index. * The number of times to repeat the indexing process. * The increment, or number of docs to add during each index writer instance. * Whether or not the main text should be stored and vectorized. $ perl -Mblib indexers/kinosearch_indexer.plx \ > --docs=1000 --reps=6 --increment=10 --store=1 $ java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer \ > -docs 1000 -reps 6 -increment 10 -store 1 If no command line args are supplied, the apps will index the entire 19043 article collection once, using a single index writer, and will neither store nor vectorize the main text. Upon finishing, each app will produce a "truncated mean" report: the slowest 25% and fastest 25% of reps will be discarded, and the rest will be averaged. KinoSearch1-1.01/t/benchmarks/indexers000755000765000765 011462203446 17732 5ustar00marvinmarvin000000000000KinoSearch1-1.01/t/benchmarks/indexers/BenchmarkingIndexer.pm000555000765000765 1772311462203445 24370 0ustar00marvinmarvin000000000000package BenchmarkingIndexer; use strict; use warnings; use Carp; use Config; use File::Spec::Functions qw( catfile catdir ); use POSIX qw( uname ); sub new { my $either = shift; my $class = ref($either) || $either; return bless { docs => undef, increment => undef, store => undef, engine => undef, version => undef, index_dir => undef, corpus_dir => 'extracted_corpus', article_filepaths => undef, @_, }, $class; } sub init_indexer { confess "abstract method" } sub build_index { confess "abstract method" } sub delayed_init { my $self = shift; my $article_filepaths = $self->{article_filepaths} = $self->build_file_list; $self->{docs} = @$article_filepaths unless defined $self->{docs}; $self->{increment} = $self->{docs} + 1 unless defined $self->{increment}; } # Return a lexically sorted list of all article files from all subdirs. sub build_file_list { my $self = shift; my $corpus_dir = $self->{corpus_dir}; my @article_filepaths; opendir CORPUS_DIR, $corpus_dir or confess "Can't opendir '$corpus_dir': $!"; my @article_dir_names = grep {/articles/} readdir CORPUS_DIR; for my $article_dir_name (@article_dir_names) { my $article_dir = catdir( $corpus_dir, $article_dir_name ); opendir ARTICLE_DIR, $article_dir or die "Can't opendir '$article_dir': $!"; push @article_filepaths, map { catfile( $article_dir, $_ ) } grep {m/^article\d+\.txt$/} readdir ARTICLE_DIR; } @article_filepaths = sort @article_filepaths; $self->{article_filepaths} = \@article_filepaths; } # Print out stats for one run. sub print_interim_report { my ( $self, %args ) = @_; printf( "%-3d Secs: %.2f Docs: %-4d\n", @args{qw( rep secs count )} ); } sub start_report { # start the output print '-' x 60 . "\n"; } # Print out aggregate stats sub print_final_report { my ( $self, $times ) = @_; # produce mean and truncated mean my @sorted_times = sort @$times; my $num_to_chop = int( @sorted_times >> 2 ); my $mean = 0; my $trunc_mean = 0; my $num_kept = 0; for ( my $i = 0; $i < @sorted_times; $i++ ) { $mean += $sorted_times[$i]; # discard fastest 25% and slowest 25% of runs next if $i < $num_to_chop; next if $i > ( $#sorted_times - $num_to_chop ); $trunc_mean += $sorted_times[$i]; $num_kept++; } $mean /= @sorted_times; $trunc_mean /= $num_kept; my $num_discarded = @sorted_times - $num_kept; $mean = sprintf("%.2f", $mean); $trunc_mean = sprintf("%.2f", $trunc_mean); # get some info about the system my $thread_support = $Config{usethreads} ? "yes" : "no"; my @uname_info = (uname)[0, 2, 4]; print <{engine} $self->{version} Perl $Config{version} Thread support: $thread_support @uname_info Mean: $mean secs Truncated mean ($num_kept kept, $num_discarded discarded): $trunc_mean secs ------------------------------------------------------------ END_REPORT } package BenchmarkingIndexer::KinoSearch1; use strict; use warnings; use base qw( BenchmarkingIndexer ); use Time::HiRes qw( gettimeofday ); sub new { my $class = shift; my $self = $class->SUPER::new(@_); require KinoSearch1; require KinoSearch1::InvIndexer; require KinoSearch1::Analysis::Tokenizer; $self->{index_dir} = 'kinosearch_index'; $self->{engine} = 'KinoSearch1'; $self->{version} = $KinoSearch1::VERSION; return $self; } sub init_indexer { my ( $self, $count ) = @_; my $create = $count ? 0 : 1; # spec out the invindexer my $analyzer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/\S+/, ); my $invindexer = KinoSearch1::InvIndexer->new( invindex => $self->{index_dir}, create => $create, analyzer => $analyzer, ); $invindexer->spec_field( name => 'body', stored => $self->{store}, vectorized => $self->{store}, ); $invindexer->spec_field( name => 'title', vectorized => 0, ); return $invindexer; } # Build an index, stopping at $max docs if $max > 0. sub build_index { my $self = shift; $self->delayed_init; my ( $max, $increment, $article_filepaths ) = @{$self}{qw( docs increment article_filepaths )}; # start timer my $start = gettimeofday(); my $invindexer = $self->init_indexer(0); my $count = 0; while ($count < $max) { for my $article_filepath (@$article_filepaths) { # the title is the first line, the body is the rest open( my $article_fh, '<', $article_filepath ) or die "Can't open file '$article_filepath'"; my $title = <$article_fh>; my $body = do { local $/; <$article_fh> }; # add content to index my $doc = $invindexer->new_doc; $doc->set_value( title => $title ); $doc->set_value( body => $body ); $invindexer->add_doc($doc); # bail if we've reached spec'd number of docs $count++; last if $count >= $max; if ( $count % $increment == 0 and $count ) { $invindexer->finish; undef $invindexer; $invindexer = $self->init_indexer($count); } } } # finish index $invindexer->finish( optimize => 1 ); # return elapsed seconds my $end = gettimeofday(); my $secs = $end - $start; return ( $count, $secs ); } package BenchmarkingIndexer::Plucene; use strict; use warnings; use base qw( BenchmarkingIndexer ); use Time::HiRes qw( gettimeofday ); sub new { my $class = shift; my $self = $class->SUPER::new(@_); require Plucene; require Plucene::Document; require Plucene::Document::Field; require Plucene::Index::Writer; require Plucene::Analysis::WhitespaceAnalyzer; $self->{index_dir} = 'plucene_index'; $self->{engine} = 'Plucene'; $self->{version} = $Plucene::VERSION; return $self; } sub init_indexer { my ( $self, $count ) = @_; my $create = $count ? 0 : 1; my $writer = Plucene::Index::Writer->new( $self->{index_dir}, Plucene::Analysis::WhitespaceAnalyzer->new(), $create ); $writer->set_mergefactor(1000); return $writer; } # Build an index, stopping at $max docs if $max > 0. sub build_index { my $self = shift; $self->delayed_init; my ( $max, $increment, $article_filepaths ) = @{$self}{qw( docs increment article_filepaths )}; # cause text to be stored if spec'd my $field_constructor = $self->{store} ? 'Text' : 'UnStored'; # start timer my $start = gettimeofday(); my $writer = $self->init_indexer(0); my $count = 0; while ($count < $max) { for my $article_filepath (@$article_filepaths) { # the title is the first line, the body is the rest open( my $article_fh, '<', $article_filepath ) or die "Can't open file '$article_filepath'"; my $title = <$article_fh>; my $body = do { local $/; <$article_fh> }; # add content to index my $doc = Plucene::Document->new; $doc->add( Plucene::Document::Field->Text( title => $title ) ); $doc->add( Plucene::Document::Field->$field_constructor( body => $body ) ); $writer->add_document($doc); # bail if we've reached spec'd number of docs $count++; last if ( $count >= $max ); if ( $count % $increment == 0 and $count ) { undef $writer; $writer = $self->init_indexer($count); } } } # finish index $writer->optimize; # return elapsed seconds my $end = gettimeofday(); my $secs = $end - $start; return ( $count, $secs ); } 1; KinoSearch1-1.01/t/benchmarks/indexers/kinosearch_indexer.plx000555000765000765 365511462203446 24474 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'indexers'; use Getopt::Long; use Cwd qw( getcwd ); use BenchmarkingIndexer; # verify that we're running from the right directory; my $working_dir = getcwd; die "Must be run from benchmarks/" unless $working_dir =~ /benchmarks\W*$/; # index all docs and run one iter unless otherwise spec'd my ( $num_reps, $max_to_index, $increment, $store, $build_index ); GetOptions( 'reps=s' => \$num_reps, 'docs=s' => \$max_to_index, 'increment=s' => \$increment, 'store=s' => \$store, 'build_index=s' => \$build_index, ); $num_reps = 1 unless defined $num_reps; my $bencher = BenchmarkingIndexer::KinoSearch1->new( docs => $max_to_index, increment => $increment, store => $store, ); if ($build_index) { my ( $count, $secs ) = $bencher->build_index; print "docs: $count elapsed: $secs\n"; exit; } else { $bencher->start_report; my @times; for my $rep ( 1 .. $num_reps ) { # spawn an index-building child process my $command = "$^X "; # try to figure out if this program was called with -Mblib for (@INC) { next unless /\bblib\b/; # propagate -Mblib to the child $command .= "-Mblib "; last; } $command .= "$0 --build_index=1 "; $command .= "--docs=$max_to_index " if $max_to_index; $command .= "--store=$store " if $store; $command .= "--increment=$increment " if $increment; my $output = `$command`; # extract elapsed time from the output of the child $output =~ /^docs: (\d+) elapsed: ([\d.]+)/ or die "no match: '$output'"; my $docs = $1; my $secs = $2; push @times, $secs; $bencher->print_interim_report( rep => $rep, secs => $secs, count => $docs, ); } $bencher->print_final_report(\@times); } KinoSearch1-1.01/t/benchmarks/indexers/LuceneIndexer.java000555000765000765 1673511462203446 23523 0ustar00marvinmarvin000000000000import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import java.io.File; import java.io.BufferedReader; import java.io.FileReader; import java.text.DecimalFormat; import java.util.Date; import java.util.Vector; import java.util.Collections; import java.util.Arrays; /** * LuceneIndexer - benchmarking app * usage: java LuceneIndexer [-docs MAX_TO_INDEX] [-reps NUM_REPETITIONS] * * Recommended options: -server -Xmx500M -XX:CompileThreshold=100 */ public class LuceneIndexer { static File corpusDir = new File("extracted_corpus"); static File indexDir = new File("lucene_index"); static String[] fileList; public static void main (String[] args) throws Exception { // verify that we're running from the right directory String curDir = new File(".").getCanonicalPath(); if (!curDir.endsWith("benchmarks")) throw new Exception("Must be run from benchmarks/ "); // assemble the sorted list of article files fileList = buildFileList(); // parse command line args int maxToIndex = fileList.length; // default: index all docs int numReps = 1; // default: run once int increment = 0; boolean store = false; String arg; int i = 0; while (i < (args.length - 1) && args[i].startsWith("-")) { arg = args[i++]; if (arg.equals("-docs")) maxToIndex = Integer.parseInt(args[i++]); else if (arg.equals("-reps")) numReps = Integer.parseInt(args[i++]); else if (arg.equals("-increment")) increment = Integer.parseInt(args[i++]); else if (arg.equals("-store")) { if (Integer.parseInt(args[i++]) != 0) store = true; } else throw new Exception("Unknown argument: " + arg); } increment = increment == 0 ? maxToIndex + 1 : increment; // start the output System.out.println("---------------------------------------------------"); // build the index numReps times, then print a final report float[] times = new float[numReps]; for (int rep = 1; rep <= numReps; rep++) { // start the clock and build the index long start = new Date().getTime(); int numIndexed = buildIndex(fileList, maxToIndex, increment, store); // stop the clock and print a report long end = new Date().getTime(); float secs = (float)(end - start) / 1000; times[rep - 1] = secs; printInterimReport(rep, secs, numIndexed); } printFinalReport(times); } // Return a lexically sorted list of all article files from all subdirs. static String[] buildFileList () throws Exception { File[] articleDirs = corpusDir.listFiles(); Vector filePaths = new Vector(); for (int i = 0; i < articleDirs.length; i++) { File[] articles = articleDirs[i].listFiles(); for (int j = 0; j < articles.length; j++) { String path = articles[j].getPath(); if (path.indexOf("article") == -1) continue; filePaths.add(path); } } Collections.sort(filePaths); return (String[])filePaths.toArray(new String[filePaths.size()]); } // Initialize an IndexWriter static IndexWriter initWriter (int count) throws Exception { boolean create = count > 0 ? false : true; IndexWriter writer = new IndexWriter(indexDir, new WhitespaceAnalyzer(), create); writer.setMaxBufferedDocs(1000); writer.setUseCompoundFile(false); return writer; } // Build an index, stopping at maxToIndex docs if maxToIndex > 0. static int buildIndex (String[] fileList, int maxToIndex, int increment, boolean store) throws Exception { IndexWriter writer = initWriter(0); int docsSoFar = 0; while (docsSoFar < maxToIndex) { for (int i = 0; i < fileList.length; i++) { // add content to index File f = new File(fileList[i]); Document doc = new Document(); BufferedReader br = new BufferedReader(new FileReader(f)); try { // the title is the first line String title; if ( (title = br.readLine()) == null) throw new Exception("Failed to read title"); Field titleField = new Field("title", title, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO); doc.add(titleField); // the body is the rest if (store) { StringBuffer buf = new StringBuffer(); String str; while ( (str = br.readLine()) != null ) buf.append( str ); String body = buf.toString(); Field bodyField = new Field("body", body, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(bodyField); } else { Field bodyField = new Field("body", br); doc.add(bodyField); } writer.addDocument(doc); } finally { br.close(); } docsSoFar++; if (docsSoFar >= maxToIndex) break; if (docsSoFar % increment == 0) { writer.close(); writer = initWriter(docsSoFar); } } } // finish index int numIndexed = writer.docCount(); writer.optimize(); writer.close(); return numIndexed; } // Print out stats for one run. private static void printInterimReport(int rep, float secs, int numIndexed) { DecimalFormat secsFormat = new DecimalFormat("#,##0.00"); String secString = secsFormat.format(secs); System.out.println(rep + " Secs: " + secString + " Docs: " + numIndexed); } // Print out aggregate stats private static void printFinalReport(float[] times) { // produce mean and truncated mean Arrays.sort(times); float meanTime = 0.0f; float truncatedMeanTime = 0.0f; int numToChop = times.length >> 2; int numKept = 0; for (int i = 0; i < times.length; i++) { meanTime += times[i]; // discard fastest 25% and slowest 25% of reps if (i < numToChop || i >= (times.length - numToChop)) continue; truncatedMeanTime += times[i]; numKept++; } meanTime /= times.length; truncatedMeanTime /= numKept; int numDiscarded = times.length - numKept; DecimalFormat format = new DecimalFormat("#,##0.00"); String meanString = format.format(meanTime); String truncatedMeanString = format.format(truncatedMeanTime); // get the Lucene version Package lucenePackage = org.apache.lucene.LucenePackage.get(); String luceneVersion = lucenePackage.getSpecificationVersion(); System.out.println("---------------------------------------------------"); System.out.println("Lucene " + luceneVersion); System.out.println("JVM " + System.getProperty("java.version") + " (" + System.getProperty("java.vendor") + ")"); System.out.println(System.getProperty("os.name") + " " + System.getProperty("os.version") + " " + System.getProperty("os.arch")); System.out.println("Mean: " + meanString + " secs"); System.out.println("Truncated mean (" + numKept + " kept, " + numDiscarded + " discarded): " + truncatedMeanString + " secs"); System.out.println("---------------------------------------------------"); } } KinoSearch1-1.01/t/benchmarks/indexers/plucene_indexer.plx000555000765000765 365011462203446 23774 0ustar00marvinmarvin000000000000#!/usr/bin/perl use strict; use warnings; use lib 'indexers'; use Getopt::Long; use Cwd qw( getcwd ); use BenchmarkingIndexer; # verify that we're running from the right directory; my $working_dir = getcwd; die "Must be run from benchmarks/" unless $working_dir =~ /benchmarks\W*$/; # index all docs and run one iter unless otherwise spec'd my ( $num_reps, $max_to_index, $increment, $store, $build_index ); GetOptions( 'reps=s' => \$num_reps, 'docs=s' => \$max_to_index, 'increment=s' => \$increment, 'store=s' => \$store, 'build_index=s' => \$build_index, ); $num_reps = 1 unless defined $num_reps; my $bencher = BenchmarkingIndexer::Plucene->new( docs => $max_to_index, increment => $increment, store => $store, ); if ($build_index) { my ( $count, $secs ) = $bencher->build_index; print "docs: $count elapsed: $secs\n"; exit; } else { $bencher->start_report; my @times; for my $rep ( 1 .. $num_reps ) { # spawn an index-building child process my $command = "$^X "; # try to figure out if this program was called with -Mblib for (@INC) { next unless /\bblib\b/; # propagate -Mblib to the child $command .= "-Mblib "; last; } $command .= "$0 --build_index=1 "; $command .= "--docs=$max_to_index " if $max_to_index; $command .= "--store=$store " if $store; $command .= "--increment=$increment " if $increment; my $output = `$command`; # extract elapsed time from the output of the child $output =~ /^docs: (\d+) elapsed: ([\d.]+)/ or die "no match: '$output'"; my $docs = $1; my $secs = $2; push @times, $secs; $bencher->print_interim_report( rep => $rep, secs => $secs, count => $docs, ); } $bencher->print_final_report(\@times); } KinoSearch1-1.01/t/us_constitution000755000765000765 011462203446 17245 5ustar00marvinmarvin000000000000KinoSearch1-1.01/t/us_constitution/amend1.html000444000765000765 146411462203446 21442 0ustar00marvinmarvin000000000000 Amendment I

Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.

KinoSearch1-1.01/t/us_constitution/amend10.html000444000765000765 130511462203446 21514 0ustar00marvinmarvin000000000000 Amendment X

The powers not delegated to the United States by the Constitution, nor prohibited by it to the States, are reserved to the States respectively, or to the people.

KinoSearch1-1.01/t/us_constitution/amend11.html000444000765000765 142411462203446 21517 0ustar00marvinmarvin000000000000 Amendment XI

The Judicial power of the United States shall not be construed to extend to any suit in law or equity, commenced or prosecuted against one of the United States by Citizens of another State, or by Citizens or Subjects of any Foreign State.

KinoSearch1-1.01/t/us_constitution/amend12.html000444000765000765 555211462203446 21526 0ustar00marvinmarvin000000000000 Amendment XII

The Electors shall meet in their respective states, and vote by ballot for President and Vice-President, one of whom, at least, shall not be an inhabitant of the same state with themselves; they shall name in their ballots the person voted for as President, and in distinct ballots the person voted for as Vice-President, and they shall make distinct lists of all persons voted for as President, and of all persons voted for as Vice-President and of the number of votes for each, which lists they shall sign and certify, and transmit sealed to the seat of the government of the United States, directed to the President of the Senate;

The President of the Senate shall, in the presence of the Senate and House of Representatives, open all the certificates and the votes shall then be counted;

The person having the greatest Number of votes for President, shall be the President, if such number be a majority of the whole number of Electors appointed; and if no person have such majority, then from the persons having the highest numbers not exceeding three on the list of those voted for as President, the House of Representatives shall choose immediately, by ballot, the President. But in choosing the President, the votes shall be taken by states, the representation from each state having one vote; a quorum for this purpose shall consist of a member or members from two-thirds of the states, and a majority of all the states shall be necessary to a choice. And if the House of Representatives shall not choose a President whenever the right of choice shall devolve upon them, before the fourth day of March next following, then the Vice-President shall act as President, as in the case of the death or other constitutional disability of the President.

The person having the greatest number of votes as Vice-President, shall be the Vice-President, if such number be a majority of the whole number of Electors appointed, and if no person have a majority, then from the two highest numbers on the list, the Senate shall choose the Vice-President; a quorum for the purpose shall consist of two-thirds of the whole number of Senators, and a majority of the whole number shall be necessary to a choice. But no person constitutionally ineligible to the office of President shall be eligible to that of Vice-President of the United States.

KinoSearch1-1.01/t/us_constitution/amend13.html000444000765000765 153111462203446 21520 0ustar00marvinmarvin000000000000 Amendment XIII

1. Neither slavery nor involuntary servitude, except as a punishment for crime whereof the party shall have been duly convicted, shall exist within the United States, or any place subject to their jurisdiction.

2. Congress shall have power to enforce this article by appropriate legislation.

KinoSearch1-1.01/t/us_constitution/amend14.html000444000765000765 610611462203445 21523 0ustar00marvinmarvin000000000000 Amendment XIV

1. All persons born or naturalized in the United States, and subject to the jurisdiction thereof, are citizens of the United States and of the State wherein they reside. No State shall make or enforce any law which shall abridge the privileges or immunities of citizens of the United States; nor shall any State deprive any person of life, liberty, or property, without due process of law; nor deny to any person within its jurisdiction the equal protection of the laws.

2. Representatives shall be apportioned among the several States according to their respective numbers, counting the whole number of persons in each State, excluding Indians not taxed. But when the right to vote at any election for the choice of electors for President and Vice-President of the United States, Representatives in Congress, the Executive and Judicial officers of a State, or the members of the Legislature thereof, is denied to any of the male inhabitants of such State, being twenty-one years of age, and citizens of the United States, or in any way abridged, except for participation in rebellion, or other crime, the basis of representation therein shall be reduced in the proportion which the number of such male citizens shall bear to the whole number of male citizens twenty-one years of age in such State.

3. No person shall be a Senator or Representative in Congress, or elector of President and Vice-President, or hold any office, civil or military, under the United States, or under any State, who, having previously taken an oath, as a member of Congress, or as an officer of the United States, or as a member of any State legislature, or as an executive or judicial officer of any State, to support the Constitution of the United States, shall have engaged in insurrection or rebellion against the same, or given aid or comfort to the enemies thereof. But Congress may by a vote of two-thirds of each House, remove such disability.

4. The validity of the public debt of the United States, authorized by law, including debts incurred for payment of pensions and bounties for services in suppressing insurrection or rebellion, shall not be questioned. But neither the United States nor any State shall assume or pay any debt or obligation incurred in aid of insurrection or rebellion against the United States, or any claim for the loss or emancipation of any slave; but all such debts, obligations and claims shall be held illegal and void.

5. The Congress shall have power to enforce, by appropriate legislation, the provisions of this article.

KinoSearch1-1.01/t/us_constitution/amend15.html000444000765000765 150011462203446 21516 0ustar00marvinmarvin000000000000 Amendment XV

1. The right of citizens of the United States to vote shall not be denied or abridged by the United States or by any State on account of race, color, or previous condition of servitude.

2. The Congress shall have power to enforce this article by appropriate legislation.

KinoSearch1-1.01/t/us_constitution/amend16.html000444000765000765 135211462203446 21524 0ustar00marvinmarvin000000000000 Amendment XVI

The Congress shall have power to lay and collect taxes on incomes, from whatever source derived, without apportionment among the several States, and without regard to any census or enumeration.

KinoSearch1-1.01/t/us_constitution/amend17.html000444000765000765 254511462203446 21532 0ustar00marvinmarvin000000000000 Amendment XVII

The Senate of the United States shall be composed of two Senators from each State, elected by the people thereof, for six years; and each Senator shall have one vote. The electors in each State shall have the qualifications requisite for electors of the most numerous branch of the State legislatures.

When vacancies happen in the representation of any State in the Senate, the executive authority of such State shall issue writs of election to fill such vacancies: Provided, That the legislature of any State may empower the executive thereof to make temporary appointments until the people fill the vacancies by election as the legislature may direct.

This amendment shall not be so construed as to affect the election or term of any Senator chosen before it becomes valid as part of the Constitution.

KinoSearch1-1.01/t/us_constitution/amend18.html000444000765000765 237611462203446 21535 0ustar00marvinmarvin000000000000 Amendment XVIII

1. After one year from the ratification of this article the manufacture, sale, or transportation of intoxicating liquors within, the importation thereof into, or the exportation thereof from the United States and all territory subject to the jurisdiction thereof for beverage purposes is hereby prohibited.

2. The Congress and the several States shall have concurrent power to enforce this article by appropriate legislation.

3. This article shall be inoperative unless it shall have been ratified as an amendment to the Constitution by the legislatures of the several States, as provided in the Constitution, within seven years from the date of the submission hereof to the States by the Congress.

KinoSearch1-1.01/t/us_constitution/amend19.html000444000765000765 141411462203446 21526 0ustar00marvinmarvin000000000000 Amendment XIX

The right of citizens of the United States to vote shall not be denied or abridged by the United States or by any State on account of sex.

Congress shall have power to enforce this article by appropriate legislation.

KinoSearch1-1.01/t/us_constitution/amend2.html000444000765000765 127011462203446 21436 0ustar00marvinmarvin000000000000 Amendment II

A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.

KinoSearch1-1.01/t/us_constitution/amend20.html000444000765000765 473611462203446 21530 0ustar00marvinmarvin000000000000 Amendment XX

1. The terms of the President and Vice President shall end at noon on the 20th day of January, and the terms of Senators and Representatives at noon on the 3d day of January, of the years in which such terms would have ended if this article had not been ratified; and the terms of their successors shall then begin.

2. The Congress shall assemble at least once in every year, and such meeting shall begin at noon on the 3d day of January, unless they shall by law appoint a different day.

3. If, at the time fixed for the beginning of the term of the President, the President elect shall have died, the Vice President elect shall become President. If a President shall not have been chosen before the time fixed for the beginning of his term, or if the President elect shall have failed to qualify, then the Vice President elect shall act as President until a President shall have qualified; and the Congress may by law provide for the case wherein neither a President elect nor a Vice President elect shall have qualified, declaring who shall then act as President, or the manner in which one who is to act shall be selected, and such person shall act accordingly until a President or Vice President shall have qualified.

4. The Congress may by law provide for the case of the death of any of the persons from whom the House of Representatives may choose a President whenever the right of choice shall have devolved upon them, and for the case of the death of any of the persons from whom the Senate may choose a Vice President whenever the right of choice shall have devolved upon them.

5. Sections 1 and 2 shall take effect on the 15th day of October following the ratification of this article.

6. This article shall be inoperative unless it shall have been ratified as an amendment to the Constitution by the legislatures of three-fourths of the several States within seven years from the date of its submission.

KinoSearch1-1.01/t/us_constitution/amend21.html000444000765000765 217611462203446 21525 0ustar00marvinmarvin000000000000 Amendment XXI

1. The eighteenth article of amendment to the Constitution of the United States is hereby repealed.

2. The transportation or importation into any State, Territory, or possession of the United States for delivery or use therein of intoxicating liquors, in violation of the laws thereof, is hereby prohibited.

3. The article shall be inoperative unless it shall have been ratified as an amendment to the Constitution by conventions in the several States, as provided in the Constitution, within seven years from the date of the submission hereof to the States by the Congress.

KinoSearch1-1.01/t/us_constitution/amend22.html000444000765000765 273211462203446 21524 0ustar00marvinmarvin000000000000 Amendment XXII

1. No person shall be elected to the office of the President more than twice, and no person who has held the office of President, or acted as President, for more than two years of a term to which some other person was elected President shall be elected to the office of the President more than once. But this Article shall not apply to any person holding the office of President, when this Article was proposed by the Congress, and shall not prevent any person who may be holding the office of President, or acting as President, during the term within which this Article becomes operative from holding the office of President or acting as President during the remainder of such term.

2. This article shall be inoperative unless it shall have been ratified as an amendment to the Constitution by the legislatures of three-fourths of the several States within seven years from the date of its submission to the States by the Congress.

KinoSearch1-1.01/t/us_constitution/amend23.html000444000765000765 244711462203445 21527 0ustar00marvinmarvin000000000000 Amendment XXIII

1. The District constituting the seat of Government of the United States shall appoint in such manner as the Congress may direct: A number of electors of President and Vice President equal to the whole number of Senators and Representatives in Congress to which the District would be entitled if it were a State, but in no event more than the least populous State; they shall be in addition to those appointed by the States, but they shall be considered, for the purposes of the election of President and Vice President, to be electors appointed by a State; and they shall meet in the District and perform such duties as provided by the twelfth article of amendment.

2. The Congress shall have power to enforce this article by appropriate legislation.

KinoSearch1-1.01/t/us_constitution/amend24.html000444000765000765 172711462203446 21531 0ustar00marvinmarvin000000000000 Amendment XXIV

1. The right of citizens of the United States to vote in any primary or other election for President or Vice President, for electors for President or Vice President, or for Senator or Representative in Congress, shall not be denied or abridged by the United States or any State by reason of failure to pay any poll tax or other tax.

2. The Congress shall have power to enforce this article by appropriate legislation.

KinoSearch1-1.01/t/us_constitution/amend25.html000444000765000765 562311462203445 21530 0ustar00marvinmarvin000000000000 Amendment XXV

1. In case of the removal of the President from office or of his death or resignation, the Vice President shall become President.

2. Whenever there is a vacancy in the office of the Vice President, the President shall nominate a Vice President who shall take office upon confirmation by a majority vote of both Houses of Congress.

3. Whenever the President transmits to the President pro tempore of the Senate and the Speaker of the House of Representatives his written declaration that he is unable to discharge the powers and duties of his office, and until he transmits to them a written declaration to the contrary, such powers and duties shall be discharged by the Vice President as Acting President.

4. Whenever the Vice President and a majority of either the principal officers of the executive departments or of such other body as Congress may by law provide, transmit to the President pro tempore of the Senate and the Speaker of the House of Representatives their written declaration that the President is unable to discharge the powers and duties of his office, the Vice President shall immediately assume the powers and duties of the office as Acting President.

Thereafter, when the President transmits to the President pro tempore of the Senate and the Speaker of the House of Representatives his written declaration that no inability exists, he shall resume the powers and duties of his office unless the Vice President and a majority of either the principal officers of the executive department or of such other body as Congress may by law provide, transmit within four days to the President pro tempore of the Senate and the Speaker of the House of Representatives their written declaration that the President is unable to discharge the powers and duties of his office. Thereupon Congress shall decide the issue, assembling within forty eight hours for that purpose if not in session. If the Congress, within twenty one days after receipt of the latter written declaration, or, if Congress is not in session, within twenty one days after Congress is required to assemble, determines by two thirds vote of both Houses that the President is unable to discharge the powers and duties of his office, the Vice President shall continue to discharge the same as Acting President; otherwise, the President shall resume the powers and duties of his office.

KinoSearch1-1.01/t/us_constitution/amend26.html000444000765000765 150111462203445 21520 0ustar00marvinmarvin000000000000 Amendment XXVI

1. The right of citizens of the United States, who are eighteen years of age or older, to vote shall not be denied or abridged by the United States or by any State on account of age.

2. The Congress shall have power to enforce this article by appropriate legislation.

KinoSearch1-1.01/t/us_constitution/amend27.html000444000765000765 132111462203445 21521 0ustar00marvinmarvin000000000000 Amendment XXVII

No law, varying the compensation for the services of the Senators and Representatives, shall take effect, until an election of Representatives shall have intervened.

KinoSearch1-1.01/t/us_constitution/amend3.html000444000765000765 130411462203445 21434 0ustar00marvinmarvin000000000000 Amendment III

No Soldier shall, in time of peace be quartered in any house, without the consent of the Owner, nor in time of war, but in a manner to be prescribed by law.

KinoSearch1-1.01/t/us_constitution/amend4.html000444000765000765 156211462203446 21444 0ustar00marvinmarvin000000000000 Amendment IV

The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.

KinoSearch1-1.01/t/us_constitution/amend5.html000444000765000765 216111462203445 21440 0ustar00marvinmarvin000000000000 Amendment V

No person shall be held to answer for a capital, or otherwise infamous crime, unless on a presentment or indictment of a Grand Jury, except in cases arising in the land or naval forces, or in the Militia, when in actual service in time of War or public danger; nor shall any person be subject for the same offense to be twice put in jeopardy of life or limb; nor shall be compelled in any criminal case to be a witness against himself, nor be deprived of life, liberty, or property, without due process of law; nor shall private property be taken for public use, without just compensation.

KinoSearch1-1.01/t/us_constitution/amend6.html000444000765000765 200411462203446 21436 0ustar00marvinmarvin000000000000 Amendment VI

In all criminal prosecutions, the accused shall enjoy the right to a speedy and public trial, by an impartial jury of the State and district wherein the crime shall have been committed, which district shall have been previously ascertained by law, and to be informed of the nature and cause of the accusation; to be confronted with the witnesses against him; to have compulsory process for obtaining witnesses in his favor, and to have the Assistance of Counsel for his defence.

KinoSearch1-1.01/t/us_constitution/amend7.html000444000765000765 147011462203445 21444 0ustar00marvinmarvin000000000000 Amendment VII

In Suits at common law, where the value in controversy shall exceed twenty dollars, the right of trial by jury shall be preserved, and no fact tried by a jury, shall be otherwise re-examined in any Court of the United States, than according to the rules of the common law.

KinoSearch1-1.01/t/us_constitution/amend8.html000444000765000765 123111462203446 21441 0ustar00marvinmarvin000000000000 Amendment VIII

Excessive bail shall not be required, nor excessive fines imposed, nor cruel and unusual punishments inflicted.

KinoSearch1-1.01/t/us_constitution/amend9.html000444000765000765 125011462203446 21443 0ustar00marvinmarvin000000000000 Amendment IX

The enumeration in the Constitution, of certain rights, shall not be construed to deny or disparage others retained by the people.

KinoSearch1-1.01/t/us_constitution/art1sec1.html000444000765000765 131011462203445 21705 0ustar00marvinmarvin000000000000 Article I Section 1

All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.

KinoSearch1-1.01/t/us_constitution/art1sec10.html000444000765000765 313011462203445 21767 0ustar00marvinmarvin000000000000 Article I Section 10

No State shall enter into any Treaty, Alliance, or Confederation; grant Letters of Marque and Reprisal; coin Money; emit Bills of Credit; make any Thing but gold and silver Coin a Tender in Payment of Debts; pass any Bill of Attainder, ex post facto Law, or Law impairing the Obligation of Contracts, or grant any Title of Nobility.

No State shall, without the Consent of the Congress, lay any Imposts or Duties on Imports or Exports, except what may be absolutely necessary for executing it's inspection Laws: and the net Produce of all Duties and Imposts, laid by any State on Imports or Exports, shall be for the Use of the Treasury of the United States; and all such Laws shall be subject to the Revision and Controul of the Congress.

No State shall, without the Consent of Congress, lay any duty of Tonnage, keep Troops, or Ships of War in time of Peace, enter into any Agreement or Compact with another State, or with a foreign Power, or engage in War, unless actually invaded, or in such imminent Danger as will not admit of delay.

KinoSearch1-1.01/t/us_constitution/art1sec2.html000444000765000765 457011462203445 21721 0ustar00marvinmarvin000000000000 Article I Section 2

The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.

No Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.

Representatives and direct Taxes shall be apportioned among the several States which may be included within this Union, according to their respective Numbers, which shall be determined by adding to the whole Number of free Persons, including those bound to Service for a Term of Years, and excluding Indians not taxed, three fifths of all other Persons.

The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years, in such Manner as they shall by Law direct. The Number of Representatives shall not exceed one for every thirty Thousand, but each State shall have at Least one Representative; and until such enumeration shall be made, the State of New Hampshire shall be entitled to chuse three, Massachusetts eight, Rhode Island and Providence Plantations one, Connecticut five, New York six, New Jersey four, Pennsylvania eight, Delaware one, Maryland six, Virginia ten, North Carolina five, South Carolina five and Georgia three.

When vacancies happen in the Representation from any State, the Executive Authority thereof shall issue Writs of Election to fill such Vacancies.

The House of Representatives shall chuse their Speaker and other Officers; and shall have the sole Power of Impeachment.

KinoSearch1-1.01/t/us_constitution/art1sec3.html000444000765000765 510711462203446 21720 0ustar00marvinmarvin000000000000 Article I Section 3

The Senate of the United States shall be composed of two Senators from each State, chosen by the Legislature thereof, for six Years; and each Senator shall have one Vote.

Immediately after they shall be assembled in Consequence of the first Election, they shall be divided as equally as may be into three Classes. The Seats of the Senators of the first Class shall be vacated at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year; and if Vacancies happen by Resignation, or otherwise, during the Recess of the Legislature of any State, the Executive thereof may make temporary Appointments until the next Meeting of the Legislature, which shall then fill such Vacancies.

No person shall be a Senator who shall not have attained to the Age of thirty Years, and been nine Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State for which he shall be chosen.

The Vice President of the United States shall be President of the Senate, but shall have no Vote, unless they be equally divided.

The Senate shall chuse their other Officers, and also a President pro tempore, in the absence of the Vice President, or when he shall exercise the Office of President of the United States.

The Senate shall have the sole Power to try all Impeachments. When sitting for that Purpose, they shall be on Oath or Affirmation. When the President of the United States is tried, the Chief Justice shall preside: And no Person shall be convicted without the Concurrence of two thirds of the Members present.

Judgment in Cases of Impeachment shall not extend further than to removal from Office, and disqualification to hold and enjoy any Office of honor, Trust or Profit under the United States: but the Party convicted shall nevertheless be liable and subject to Indictment, Trial, Judgment and Punishment, according to Law.

KinoSearch1-1.01/t/us_constitution/art1sec4.html000444000765000765 174711462203446 21727 0ustar00marvinmarvin000000000000 Article I Section 4

The Times, Places and Manner of holding Elections for Senators and Representatives, shall be prescribed in each State by the Legislature thereof; but the Congress may at any time by Law make or alter such Regulations, except as to the Place of Chusing Senators.

The Congress shall assemble at least once in every Year, and such Meeting shall be on the first Monday in December, unless they shall by Law appoint a different Day.

KinoSearch1-1.01/t/us_constitution/art1sec5.html000444000765000765 307311462203445 21721 0ustar00marvinmarvin000000000000 Article I Section 5

Each House shall be the Judge of the Elections, Returns and Qualifications of its own Members, and a Majority of each shall constitute a Quorum to do Business; but a smaller number may adjourn from day to day, and may be authorized to compel the Attendance of absent Members, in such Manner, and under such Penalties as each House may provide.

Each House may determine the Rules of its Proceedings, punish its Members for disorderly Behavior, and, with the Concurrence of two-thirds, expel a Member.

Each House shall keep a Journal of its Proceedings, and from time to time publish the same, excepting such Parts as may in their Judgment require Secrecy; and the Yeas and Nays of the Members of either House on any question shall, at the Desire of one fifth of those Present, be entered on the Journal.

Neither House, during the Session of Congress, shall, without the Consent of the other, adjourn for more than three days, nor to any other Place than that in which the two Houses shall be sitting.

KinoSearch1-1.01/t/us_constitution/art1sec6.html000444000765000765 260711462203446 21725 0ustar00marvinmarvin000000000000 Article I Section 6

The Senators and Representatives shall receive a Compensation for their Services, to be ascertained by Law, and paid out of the Treasury of the United States. They shall in all Cases, except Treason, Felony and Breach of the Peace, be privileged from Arrest during their Attendance at the Session of their respective Houses, and in going to and returning from the same; and for any Speech or Debate in either House, they shall not be questioned in any other Place.

No Senator or Representative shall, during the Time for which he was elected, be appointed to any civil Office under the Authority of the United States which shall have been created, or the Emoluments whereof shall have been increased during such time; and no Person holding any Office under the United States, shall be a Member of either House during his Continuance in Office.

KinoSearch1-1.01/t/us_constitution/art1sec7.html000444000765000765 444011462203446 21723 0ustar00marvinmarvin000000000000 Article I Section 7

All bills for raising Revenue shall originate in the House of Representatives; but the Senate may propose or concur with Amendments as on other Bills.

Every Bill which shall have passed the House of Representatives and the Senate, shall, before it become a Law, be presented to the President of the United States; If he approve he shall sign it, but if not he shall return it, with his Objections to that House in which it shall have originated, who shall enter the Objections at large on their Journal, and proceed to reconsider it. If after such Reconsideration two thirds of that House shall agree to pass the Bill, it shall be sent, together with the Objections, to the other House, by which it shall likewise be reconsidered, and if approved by two thirds of that House, it shall become a Law. But in all such Cases the Votes of both Houses shall be determined by Yeas and Nays, and the Names of the Persons voting for and against the Bill shall be entered on the Journal of each House respectively. If any Bill shall not be returned by the President within ten Days (Sundays excepted) after it shall have been presented to him, the Same shall be a Law, in like Manner as if he had signed it, unless the Congress by their Adjournment prevent its Return, in which Case it shall not be a Law.

Every Order, Resolution, or Vote to which the Concurrence of the Senate and House of Representatives may be necessary (except on a question of Adjournment) shall be presented to the President of the United States; and before the Same shall take Effect, shall be approved by him, or being disapproved by him, shall be repassed by two thirds of the Senate and House of Representatives, according to the Rules and Limitations prescribed in the Case of a Bill.

KinoSearch1-1.01/t/us_constitution/art1sec8.html000444000765000765 645211462203446 21731 0ustar00marvinmarvin000000000000 Article I Section 8

The Congress shall have Power To lay and collect Taxes, Duties, Imposts and Excises, to pay the Debts and provide for the common Defence and general Welfare of the United States; but all Duties, Imposts and Excises shall be uniform throughout the United States;

To borrow money on the credit of the United States;

To regulate Commerce with foreign Nations, and among the several States, and with the Indian Tribes;

To establish an uniform Rule of Naturalization, and uniform Laws on the subject of Bankruptcies throughout the United States;

To coin Money, regulate the Value thereof, and of foreign Coin, and fix the Standard of Weights and Measures;

To provide for the Punishment of counterfeiting the Securities and current Coin of the United States;

To establish Post Offices and Post Roads;

To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;

To constitute Tribunals inferior to the supreme Court;

To define and punish Piracies and Felonies committed on the high Seas, and Offenses against the Law of Nations;

To declare War, grant Letters of Marque and Reprisal, and make Rules concerning Captures on Land and Water;

To raise and support Armies, but no Appropriation of Money to that Use shall be for a longer Term than two Years;

To provide and maintain a Navy;

To make Rules for the Government and Regulation of the land and naval Forces;

To provide for calling forth the Militia to execute the Laws of the Union, suppress Insurrections and repel Invasions;

To provide for organizing, arming, and disciplining the Militia, and for governing such Part of them as may be employed in the Service of the United States, reserving to the States respectively, the Appointment of the Officers, and the Authority of training the Militia according to the discipline prescribed by Congress;

To exercise exclusive Legislation in all Cases whatsoever, over such District (not exceeding ten Miles square) as may, by Cession of particular States, and the acceptance of Congress, become the Seat of the Government of the United States, and to exercise like Authority over all Places purchased by the Consent of the Legislature of the State in which the Same shall be, for the Erection of Forts, Magazines, Arsenals, dock-Yards, and other needful Buildings; And

To make all Laws which shall be necessary and proper for carrying into Execution the foregoing Powers, and all other Powers vested by this Constitution in the Government of the United States, or in any Department or Officer thereof.

KinoSearch1-1.01/t/us_constitution/art1sec9.html000444000765000765 402011462203446 21717 0ustar00marvinmarvin000000000000 Article I Section 9

The Migration or Importation of such Persons as any of the States now existing shall think proper to admit, shall not be prohibited by the Congress prior to the Year one thousand eight hundred and eight, but a tax or duty may be imposed on such Importation, not exceeding ten dollars for each Person.

The privilege of the Writ of Habeas Corpus shall not be suspended, unless when in Cases of Rebellion or Invasion the public Safety may require it.

No Bill of Attainder or ex post facto Law shall be passed. No capitation, or other direct, Tax shall be laid, unless in Proportion to the Census or Enumeration herein before directed to be taken.

No Tax or Duty shall be laid on Articles exported from any State.

No Preference shall be given by any Regulation of Commerce or Revenue to the Ports of one State over those of another: nor shall Vessels bound to, or from, one State, be obliged to enter, clear, or pay Duties in another.

No Money shall be drawn from the Treasury, but in Consequence of Appropriations made by Law; and a regular Statement and Account of the Receipts and Expenditures of all public Money shall be published from time to time.

No Title of Nobility shall be granted by the United States: And no Person holding any Office of Profit or Trust under them, shall, without the Consent of the Congress, accept of any present, Emolument, Office, or Title, of any kind whatever, from any King, Prince or foreign State.

KinoSearch1-1.01/t/us_constitution/art2sec1.html000444000765000765 1051511462203445 21735 0ustar00marvinmarvin000000000000 Article II Section 1

The executive Power shall be vested in a President of the United States of America. He shall hold his Office during the Term of four Years, and, together with the Vice-President chosen for the same Term, be elected, as follows:

Each State shall appoint, in such Manner as the Legislature thereof may direct, a Number of Electors, equal to the whole Number of Senators and Representatives to which the State may be entitled in the Congress: but no Senator or Representative, or Person holding an Office of Trust or Profit under the United States, shall be appointed an Elector.

The Electors shall meet in their respective States, and vote by Ballot for two persons, of whom one at least shall not lie an Inhabitant of the same State with themselves. And they shall make a List of all the Persons voted for, and of the Number of Votes for each; which List they shall sign and certify, and transmit sealed to the Seat of the Government of the United States, directed to the President of the Senate. The President of the Senate shall, in the Presence of the Senate and House of Representatives, open all the Certificates, and the Votes shall then be counted. The Person having the greatest Number of Votes shall be the President, if such Number be a Majority of the whole Number of Electors appointed; and if there be more than one who have such Majority, and have an equal Number of Votes, then the House of Representatives shall immediately chuse by Ballot one of them for President; and if no Person have a Majority, then from the five highest on the List the said House shall in like Manner chuse the President. But in chusing the President, the Votes shall be taken by States, the Representation from each State having one Vote; a quorum for this Purpose shall consist of a Member or Members from two-thirds of the States, and a Majority of all the States shall be necessary to a Choice. In every Case, after the Choice of the President, the Person having the greatest Number of Votes of the Electors shall be the Vice President. But if there should remain two or more who have equal Votes, the Senate shall chuse from them by Ballot the Vice-President.

The Congress may determine the Time of chusing the Electors, and the Day on which they shall give their Votes; which Day shall be the same throughout the United States.

No person except a natural born Citizen, or a Citizen of the United States, at the time of the Adoption of this Constitution, shall be eligible to the Office of President; neither shall any Person be eligible to that Office who shall not have attained to the Age of thirty-five Years, and been fourteen Years a Resident within the United States.

In Case of the Removal of the President from Office, or of his Death, Resignation, or Inability to discharge the Powers and Duties of the said Office, the same shall devolve on the Vice President, and the Congress may by Law provide for the Case of Removal, Death, Resignation or Inability, both of the President and Vice President, declaring what Officer shall then act as President, and such Officer shall act accordingly, until the Disability be removed, or a President shall be elected.

The President shall, at stated Times, receive for his Services, a Compensation, which shall neither be increased nor diminished during the Period for which he shall have been elected, and he shall not receive within that Period any other Emolument from the United States, or any of them.

Before he enter on the Execution of his Office, he shall take the following Oath or Affirmation:

"I do solemnly swear (or affirm) that I will faithfully execute the Office of President of the United States, and will to the best of my Ability, preserve, protect and defend the Constitution of the United States."

KinoSearch1-1.01/t/us_constitution/art2sec2.html000444000765000765 352711462203446 21724 0ustar00marvinmarvin000000000000 Article II Section 2

The President shall be Commander in Chief of the Army and Navy of the United States, and of the Militia of the several States, when called into the actual Service of the United States; he may require the Opinion, in writing, of the principal Officer in each of the executive Departments, upon any subject relating to the Duties of their respective Offices, and he shall have Power to Grant Reprieves and Pardons for Offenses against the United States, except in Cases of Impeachment.

He shall have Power, by and with the Advice and Consent of the Senate, to make Treaties, provided two thirds of the Senators present concur; and he shall nominate, and by and with the Advice and Consent of the Senate, shall appoint Ambassadors, other public Ministers and Consuls, Judges of the supreme Court, and all other Officers of the United States, whose Appointments are not herein otherwise provided for, and which shall be established by Law: but the Congress may by Law vest the Appointment of such inferior Officers, as they think proper, in the President alone, in the Courts of Law, or in the Heads of Departments.

The President shall have Power to fill up all Vacancies that may happen during the Recess of the Senate, by granting Commissions which shall expire at the End of their next Session.

KinoSearch1-1.01/t/us_constitution/art2sec3.html000444000765000765 215611462203446 21722 0ustar00marvinmarvin000000000000 Article II Section 3

He shall from time to time give to the Congress Information of the State of the Union, and recommend to their Consideration such Measures as he shall judge necessary and expedient; he may, on extraordinary Occasions, convene both Houses, or either of them, and in Case of Disagreement between them, with Respect to the Time of Adjournment, he may adjourn them to such Time as he shall think proper; he shall receive Ambassadors and other public Ministers; he shall take Care that the Laws be faithfully executed, and shall Commission all the Officers of the United States.

KinoSearch1-1.01/t/us_constitution/art2sec4.html000444000765000765 137111462203446 21721 0ustar00marvinmarvin000000000000 Article II Section 4

The President, Vice President and all civil Officers of the United States, shall be removed from Office on Impeachment for, and Conviction of, Treason, Bribery, or other high Crimes and Misdemeanors.

KinoSearch1-1.01/t/us_constitution/art3sec1.html000444000765000765 170611462203446 21721 0ustar00marvinmarvin000000000000 Article III Section 1

The judicial Power of the United States, shall be vested in one supreme Court, and in such inferior Courts as the Congress may from time to time ordain and establish. The Judges, both of the supreme and inferior Courts, shall hold their Offices during good Behavior, and shall, at stated Times, receive for their Services a Compensation which shall not be diminished during their Continuance in Office.

KinoSearch1-1.01/t/us_constitution/art3sec2.html000444000765000765 360211462203446 21717 0ustar00marvinmarvin000000000000 Article III Section 2

The judicial Power shall extend to all Cases, in Law and Equity, arising under this Constitution, the Laws of the United States, and Treaties made, or which shall be made, under their Authority; to all Cases affecting Ambassadors, other public Ministers and Consuls; to all Cases of admiralty and maritime Jurisdiction; to Controversies to which the United States shall be a Party; to Controversies between two or more States; between a State and Citizens of another State; between Citizens of different States; between Citizens of the same State claiming Lands under Grants of different States, and between a State, or the Citizens thereof, and foreign States, Citizens or Subjects.

In all Cases affecting Ambassadors, other public Ministers and Consuls, and those in which a State shall be Party, the supreme Court shall have original Jurisdiction. In all the other Cases before mentioned, the supreme Court shall have appellate Jurisdiction, both as to Law and Fact, with such Exceptions, and under such Regulations as the Congress shall make.

Trial of all Crimes, except in Cases of Impeachment, shall be by Jury; and such Trial shall be held in the State where the said Crimes shall have been committed; but when not committed within any State, the Trial shall be at such Place or Places as the Congress may by Law have directed.

KinoSearch1-1.01/t/us_constitution/art3sec3.html000444000765000765 202711462203446 21720 0ustar00marvinmarvin000000000000 Article III Section 3

Treason against the United States, shall consist only in levying War against them, or in adhering to their Enemies, giving them Aid and Comfort. No Person shall be convicted of Treason unless on the Testimony of two Witnesses to the same overt Act, or on Confession in open Court.

The Congress shall have power to declare the Punishment of Treason, but no Attainder of Treason shall work Corruption of Blood, or Forfeiture except during the Life of the Person attainted.

KinoSearch1-1.01/t/us_constitution/art4sec1.html000444000765000765 150011462203446 21712 0ustar00marvinmarvin000000000000 Article IV Section 1

Full Faith and Credit shall be given in each State to the public Acts, Records, and judicial Proceedings of every other State. And the Congress may by general Laws prescribe the Manner in which such Acts, Records and Proceedings shall be proved, and the Effect thereof.

KinoSearch1-1.01/t/us_constitution/art4sec2.html000444000765000765 236311462203446 21723 0ustar00marvinmarvin000000000000 Article IV Section 2

The Citizens of each State shall be entitled to all Privileges and Immunities of Citizens in the several States.

A Person charged in any State with Treason, Felony, or other Crime, who shall flee from Justice, and be found in another State, shall on demand of the executive Authority of the State from which he fled, be delivered up, to be removed to the State having Jurisdiction of the Crime.

No Person held to Service or Labour in one State, under the Laws thereof, escaping into another, shall, in Consequence of any Law or Regulation therein, be discharged from such Service or Labour, But shall be delivered up on Claim of the Party to whom such Service or Labour may be due.

KinoSearch1-1.01/t/us_constitution/art4sec3.html000444000765000765 224711462203445 21724 0ustar00marvinmarvin000000000000 Article IV Section 3

New States may be admitted by the Congress into this Union; but no new States shall be formed or erected within the Jurisdiction of any other State; nor any State be formed by the Junction of two or more States, or parts of States, without the Consent of the Legislatures of the States concerned as well as of the Congress.

The Congress shall have Power to dispose of and make all needful Rules and Regulations respecting the Territory or other Property belonging to the United States; and nothing in this Constitution shall be so construed as to Prejudice any Claims of the United States, or of any particular State.

KinoSearch1-1.01/t/us_constitution/art4sec4.html000444000765000765 150311462203446 21720 0ustar00marvinmarvin000000000000 Article IV Section 4

The United States shall guarantee to every State in this Union a Republican Form of Government, and shall protect each of them against Invasion; and on Application of the Legislature, or of the Executive (when the Legislature cannot be convened) against domestic Violence.

KinoSearch1-1.01/t/us_constitution/art5.html000444000765000765 254311462203446 21147 0ustar00marvinmarvin000000000000 Article V

The Congress, whenever two thirds of both Houses shall deem it necessary, shall propose Amendments to this Constitution, or, on the Application of the Legislatures of two thirds of the several States, shall call a Convention for proposing Amendments, which, in either Case, shall be valid to all Intents and Purposes, as part of this Constitution, when ratified by the Legislatures of three fourths of the several States, or by Conventions in three fourths thereof, as the one or the other Mode of Ratification may be proposed by the Congress; Provided that no Amendment which may be made prior to the Year One thousand eight hundred and eight shall in any Manner affect the first and fourth Clauses in the Ninth Section of the first Article; and that no State, without its Consent, shall be deprived of its equal Suffrage in the Senate.

KinoSearch1-1.01/t/us_constitution/art6.html000444000765000765 273511462203446 21153 0ustar00marvinmarvin000000000000 Article VI

All Debts contracted and Engagements entered into, before the Adoption of this Constitution, shall be as valid against the United States under this Constitution, as under the Confederation.

This Constitution, and the Laws of the United States which shall be made in Pursuance thereof; and all Treaties made, or which shall be made, under the Authority of the United States, shall be the supreme Law of the Land; and the Judges in every State shall be bound thereby, any Thing in the Constitution or Laws of any State to the Contrary notwithstanding.

The Senators and Representatives before mentioned, and the Members of the several State Legislatures, and all executive and judicial Officers, both of the United States and of the several States, shall be bound by Oath or Affirmation, to support this Constitution; but no religious Test shall ever be required as a Qualification to any Office or public Trust under the United States.

KinoSearch1-1.01/t/us_constitution/art7.html000444000765000765 373011462203446 21150 0ustar00marvinmarvin000000000000 Article VII

The Ratification of the Conventions of nine States, shall be sufficient for the Establishment of this Constitution between the States so ratifying the Same.

Done in Convention by the Unanimous Consent of the States present the Seventeenth Day of September in the Year of our Lord one thousand seven hundred and Eighty seven and of the Independence of the United States of America the Twelfth. In Witness whereof We have hereunto subscribed our Names.

Go Washington - President and deputy from Virginia

New Hampshire - John Langdon, Nicholas Gilman

Massachusetts - Nathaniel Gorham, Rufus King

Connecticut - Wm Saml Johnson, Roger Sherman

New York - Alexander Hamilton

New Jersey - Wil Livingston, David Brearley, Wm Paterson, Jona. Dayton

Pensylvania - B Franklin, Thomas Mifflin, Robt Morris, Geo. Clymer, Thos FitzSimons, Jared Ingersoll, James Wilson, Gouv Morris

Delaware - Geo. Read, Gunning Bedford jun, John Dickinson, Richard Bassett, Jaco. Broom

Maryland - James McHenry, Dan of St Tho Jenifer, Danl Carroll

Virginia - John Blair, James Madison Jr.

North Carolina - Wm Blount, Richd Dobbs Spaight, Hu Williamson

South Carolina - J. Rutledge, Charles Cotesworth Pinckney, Charles Pinckney, Pierce Butler

Georgia - William Few, Abr Baldwin

Attest: William Jackson, Secretary

KinoSearch1-1.01/t/us_constitution/index.html000444000765000765 763411462203446 21411 0ustar00marvinmarvin000000000000 US Constitution KinoSearch1-1.01/t/us_constitution/preamble.html000444000765000765 146311462203446 22063 0ustar00marvinmarvin000000000000 Preamble

We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.

KinoSearch1-1.01/t/us_constitution/uscon.css000444000765000765 62211462203446 21223 0ustar00marvinmarvin000000000000body,table,textarea { font-family: Arial, Helvetica, sans-serif; } body { font-size: 90%; background: #fff; margin: 0 0 0 0; } div#navigation { background: #ddfff6; padding: 10px; border-bottom: 1px solid #555; } div#bodytext { margin: 10px 10px 10px 10px; } form#usconSearch { display: inline; margin-bottom: 0px; } span.excerptURL { color: green; }