MediaWiki-DumpFile-0.2.2/000755 000765 000024 00000000000 12173343122 015361 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/Changes000644 000765 000024 00000006646 12173331531 016671 0ustar00tylerstaff000000 000000 Revision history for MediaWiki-DumpFile 0.2.2 July 22, 2013 * Fixed bug #69983 "type map failed for varbinary" * Included transparant support for compressed XML and SQL files * Applied patch by CHOCOLATE per bug #85143 0.2.1 * No longer enforces version checking of the XML document schema by default but it remains as an option 0.2.0 * No longer depends on XML::CompactTree::XS but does recommend it. Just install it via CPAN and it will be used automatically to make things faster. * Fast mode is here! ::Pages, ::FastPages, and ::Compat::Pages can all be very fast by giving up support for everything besides titles and the text contents of the first entry in the dump file. * Resolved bug #63453 "categories() does not work in MediaWiki::DumpFile::Compat" * MediaWiki::DumpFile::Compat did not perform caching on methods like Parse::MediaWikiDump did also from bug #63453 * Added in formal error messages for a few events, most notably when the page dump version changes to an unknown one. Now the user is directed to our documentation where there is explicit instructions on how to report a bug and override the case where it will not run with an environment variable * Added the XML benchmarking suite I created to study XML processing speeds to distro; hopefully more people will be interested in the shootout. * Ported over documentation from Parse::MediaWikiDump giving ::Compat full documentation in this module as well. * ::Compat::page now uses optimized regex caching and compilation 0.1.8 * Resolved bug #58107 "MediaWiki::DumpFile::SQL, not all dumps have LOCK" 0.1.7 * Added current_byte() and size() method to ::Pages * Documented bug #56843 in ::Compat 0.1.6 * API changes in XML::TreePuller * Added some missing deps to Makefile.PL 0.1.5 * Made the CPAN indexer unhappy with me 0.1.3 * Added in compatibility with Parse::MediaWikiDump via MediaWiki::DumpFile::Compat 0.1.2 * Fixed bug #55758 - When calling subroutine comment ($revision->comment), it throws null pointer exception. * Added check to constructor for ::Pages to complain if the input specified as a file is not really in the file system 0.1.1 * Split off MediaWiki::DumpFile::XML into XML::TreePuller and made it available on CPAN * Removed LIMITATIONS section of MediaWiki::DumpFile documentation 0.0.10 * Fixed bug in subclassing of pages object to handle version 0.4 dump files * Fixed incorrect dependencies listed in Makefile.PL 0.0.9 * Added support for dump files with versions less than 3 and for versions equal to 0.4 with support for the redirect feature of the 0.4 dump file * Refactored interface to ::XML and getting it ready to be split off to its own CPAN module * Changed the constructor class to require the parsing modules when the methods are invoked instead of use()ing them when the module is loaded. 0.0.8 * Added minor method to ::Pages::Revision * Created ::Pages::Revision::Contributor 0.0.7 * Added MediaWiki::DumpFile::FastPages - twice as fast as MediaWiki::DumpFile::Pages but with limited features. * Improved documentation for MediaWiki::DumpFile::Pages 0.0.6 * Remembered to start listing changes * Implemented a method to get the contents of the create_table statement in ::SQL per bug #53371 0.0.1 * First version, released on an unsuspecting world. MediaWiki-DumpFile-0.2.2/lib/000755 000765 000024 00000000000 12173343122 016127 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/Makefile.PL000644 000765 000024 00000001730 11616413320 017333 0ustar00tylerstaff000000 000000 use strict; use warnings; use ExtUtils::MakeMaker; WriteMakefile( META_MERGE => { recommends => { 'XML::CompactTree::XS' => '0.02' } }, NAME => 'MediaWiki::DumpFile', AUTHOR => q{"Tyler Riddle "}, VERSION_FROM => 'lib/MediaWiki/DumpFile.pm', ABSTRACT_FROM => 'lib/MediaWiki/DumpFile.pm', ($ExtUtils::MakeMaker::VERSION >= 6.3002 ? ('LICENSE'=> 'perl') : ()), PL_FILES => {}, PREREQ_PM => { 'Test::More' => '0.94', 'Test::Simple' => '0.94', 'Scalar::Util' => '1.21', 'XML::TreePuller' => '0.1.0', 'Data::Compare' => '1.2101', 'Test::Exception' => '0.27', 'File::Find::Rule' => '0.32', 'File::Type' => '0.22', 'IO::Uncompress::AnyUncompress' => '2.0.37', }, dist => { COMPRESS => 'gzip -9f', SUFFIX => 'gz', }, clean => { FILES => 'MediaWiki-DumpFile-*' }, ); MediaWiki-DumpFile-0.2.2/MANIFEST000644 000765 000024 00000004613 11616413422 016520 0ustar00tylerstaff000000 000000 Changes MANIFEST Makefile.PL README lib/MediaWiki/DumpFile.pm lib/MediaWiki/DumpFile/SQL.pm lib/MediaWiki/DumpFile/Pages.pm lib/MediaWiki/DumpFile/Pages/Lib.pm lib/MediaWiki/DumpFile/FastPages.pm lib/MediaWiki/DumpFile/Compat.pm lib/MediaWiki/DumpFile/Compat/link.pod lib/MediaWiki/DumpFile/Compat/Links.pod lib/MediaWiki/DumpFile/Compat/page.pod lib/MediaWiki/DumpFile/Compat/Pages.pod lib/MediaWiki/DumpFile/Compat/Revisions.pod lib/MediaWiki/DumpFile/Benchmarks.pod t/20-FastPages.t t/20-Pages.t t/20-SQL.t t/00-load.t t/90-compat-links.t t/90-compat-pages.t t/90-compat-revisions.t t/95-compat-pages-single_revision_only.t t/specieswiki-20091204-user_groups.data t/specieswiki-20091204-user_groups.sql t/pages_test.xml t/compat.pages_test.xml t/compat.links_test.sql t/compat.revisions_test.xml t/50-fuzz-fast-mode.off t/91-compat-pre_factory.t TODO speed_test/Bench.pm speed_test/benchmark.pl speed_test/bin/expat.c speed_test/bin/iksemel.c speed_test/bin/libxml.c speed_test/bin/Makefile speed_test/MODULES speed_test/README speed_test/test_cases/expat.t speed_test/test_cases/libxml.t speed_test/test_cases/MediaWiki-DumpFile-Compat_fastmode.t speed_test/test_cases/MediaWiki-DumpFile-Compat.t speed_test/test_cases/MediaWiki-DumpFile-FastPages.t speed_test/test_cases/MediaWiki-DumpFile-Pages.t speed_test/test_cases/Parse-MediaWikiDump.t speed_test/test_cases/XML-Bare.t speed_test/test_cases/XML-CompactTree-XS.t speed_test/test_cases/XML-CompactTree.t speed_test/test_cases/XML-LibXML-Reader.t speed_test/test_cases/XML-LibXML-SAX-ChunkParser.t speed_test/test_cases/XML-LibXML-SAX.t speed_test/test_cases/XML-LibXML-SAX_charjoin.t speed_test/test_cases/XML-Parser-Expat.t speed_test/test_cases/XML-Parser-ExpatNB.t speed_test/test_cases/XML-Parser.t speed_test/test_cases/XML-Parser_string_append.t speed_test/test_cases/XML-Records.t speed_test/test_cases/XML-Rules.t speed_test/test_cases/XML-SAX-Expat.t speed_test/test_cases/XML-SAX-ExpatXS.t speed_test/test_cases/XML-SAX-ExpatXS_nocharjoin.t speed_test/test_cases/XML-SAX-PurePerl.t speed_test/test_cases/XML-TreePuller_config.t speed_test/test_cases/XML-TreePuller_element.t speed_test/test_cases/XML-TreePuller_xpath.t speed_test/test_cases/XML-Twig.t speed_test/test_cases/MediaWiki-DumpFile-Pages_fastmode.t META.yml Module meta-data (added by MakeMaker) META.json Module JSON meta-data (added by MakeMaker) MediaWiki-DumpFile-0.2.2/META.json000644 000765 000024 00000002455 12173343122 017010 0ustar00tylerstaff000000 000000 { "abstract" : "Process various dump files from a MediaWiki instance", "author" : [ "\"Tyler Riddle \"" ], "dynamic_config" : 1, "generated_by" : "ExtUtils::MakeMaker version 6.66, CPAN::Meta::Converter version 2.120921", "license" : [ "perl_5" ], "meta-spec" : { "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", "version" : "2" }, "name" : "MediaWiki-DumpFile", "no_index" : { "directory" : [ "t", "inc" ] }, "prereqs" : { "build" : { "requires" : { "ExtUtils::MakeMaker" : "0" } }, "configure" : { "requires" : { "ExtUtils::MakeMaker" : "0" } }, "runtime" : { "recommends" : { "XML::CompactTree::XS" : "0.02" }, "requires" : { "Data::Compare" : "1.2101", "File::Find::Rule" : "0.32", "File::Type" : "0.22", "IO::Uncompress::AnyUncompress" : "v2.0.37", "Scalar::Util" : "1.21", "Test::Exception" : "0.27", "Test::More" : "0.94", "Test::Simple" : "0.94", "XML::TreePuller" : "v0.1.0" } } }, "release_status" : "stable", "version" : "v0.2.2" } MediaWiki-DumpFile-0.2.2/META.yml000644 000765 000024 00000001400 12173343122 016625 0ustar00tylerstaff000000 000000 --- abstract: 'Process various dump files from a MediaWiki instance' author: - "\"Tyler Riddle \"" build_requires: ExtUtils::MakeMaker: 0 configure_requires: ExtUtils::MakeMaker: 0 dynamic_config: 1 generated_by: 'ExtUtils::MakeMaker version 6.66, CPAN::Meta::Converter version 2.120921' license: perl meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: 1.4 name: MediaWiki-DumpFile no_index: directory: - t - inc recommends: XML::CompactTree::XS: 0.02 requires: Data::Compare: 1.2101 File::Find::Rule: 0.32 File::Type: 0.22 IO::Uncompress::AnyUncompress: v2.0.37 Scalar::Util: 1.21 Test::Exception: 0.27 Test::More: 0.94 Test::Simple: 0.94 XML::TreePuller: v0.1.0 version: v0.2.2 MediaWiki-DumpFile-0.2.2/README000644 000765 000024 00000002675 11543133737 016264 0ustar00tylerstaff000000 000000 MediaWiki-DumpFile This module is used to parse various dump files from a MediaWiki instance. The most likely case is that you will want to be parsing content at http://download.wikimedia.org/backup-index.html provided by WikiMedia which includes the English and all other language Wikipedias. This module could also be considered Parse::MediaWikiDump version 2. It has been created as a seperate distribution to improve the API with out breaking existing code that is using Parse::MediaWikiDump. INSTALLATION To install this module, run the following commands: perl Makefile.PL make make test make install SUPPORT AND DOCUMENTATION After installing, you can find documentation for this module with the perldoc command. perldoc MediaWiki::DumpFile You can also look for information at: RT, CPAN's request tracker http://rt.cpan.org/NoAuth/Bugs.html?Dist=MediaWiki-DumpFile AnnoCPAN, Annotated CPAN documentation http://annocpan.org/dist/MediaWiki-DumpFile CPAN Ratings http://cpanratings.perl.org/d/MediaWiki-DumpFile Search CPAN http://search.cpan.org/dist/MediaWiki-DumpFile/ COPYRIGHT AND LICENCE Copyright (C) 2009 "Tyler Riddle" This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See http://dev.perl.org/licenses/ for more information. MediaWiki-DumpFile-0.2.2/speed_test/000755 000765 000024 00000000000 12173343122 017520 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/t/000755 000765 000024 00000000000 12173343122 015624 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/TODO000644 000765 000024 00000000000 11543133737 016050 0ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/t/00-load.t000644 000765 000024 00000000313 11543133737 017153 0ustar00tylerstaff000000 000000 use Test::More tests => 2; BEGIN { use_ok( 'MediaWiki::DumpFile' ); use_ok(' MediaWiki::DumpFile::Compat'); } diag( "Testing MediaWiki::DumpFile $MediaWiki::DumpFile::VERSION, Perl $], $^X" ); MediaWiki-DumpFile-0.2.2/t/20-FastPages.t000644 000765 000024 00000001534 11543133737 020121 0ustar00tylerstaff000000 000000 use strict; use warnings; use Test::Simple tests => 14; use Data::Compare; use Data::Dumper; use MediaWiki::DumpFile; my $test_data = "t/pages_test.xml"; my $mw = MediaWiki::DumpFile->new; my $p = $mw->fastpages($test_data); test_suite($p); die "die could not open $test_data: $!" unless open(INPUT, $test_data); $p = $mw->fastpages(\*INPUT); test_suite($p); sub test_suite { my ($p) = @_; test_one($p->next); test_two($p->next); test_three($p->next); ok(! scalar($p->next)); } sub test_one { my ($title, $text) = @_; ok($title eq 'Talk:Title Test Value'); ok($text eq 'Text Test Value'); } sub test_two { my ($title, $text) = @_; ok($title eq 'Title Test Value #2'); ok($text eq '#redirect : [[fooooo]]'); } sub test_three { my ($title, $text) = @_; ok($title eq 'Title Test Value #3'); ok($text eq '#redirect [[fooooo]]'); }MediaWiki-DumpFile-0.2.2/t/20-Pages.t000644 000765 000024 00000005605 11543133737 017306 0ustar00tylerstaff000000 000000 use strict; use warnings; use Test::Simple tests => 89; use Data::Compare; use Data::Dumper; use MediaWiki::DumpFile; our $TEST = 'file'; my $test_data = "t/pages_test.xml"; my $mw = MediaWiki::DumpFile->new; my $p = $mw->pages($test_data); test_suite($p); $TEST = 'filehandle'; die "die could not open $test_data: $!" unless open(INPUT, $test_data); $p = $mw->pages(\*INPUT); test_suite($p); sub test_suite { my ($p) = @_; my %namespace_test_values = new_namespace_data(); my %namespace_test_against; ok($p->version eq '0.3'); ok($p->sitename eq 'Sitename Test Value'); ok($p->base eq 'Base Test Value'); ok($p->case eq 'Case Test Value'); %namespace_test_against = $p->namespaces; ok(Compare(\%namespace_test_values, \%namespace_test_against)); ok(defined($p->current_byte)); ok($p->current_byte != 0); if ($TEST ne 'filehandle') { ok($p->size == 2259); } test_one($p->next); test_two($p->next); test_three($p->next); ok(! defined($p->next)); } sub new_namespace_data { return ( '-1' => 'Special', '0' => '', '1' => 'Talk', ); } sub test_one { my ($page) = @_; my $revision = $page->revision; ok($page->title eq 'Talk:Title Test Value'); ok($page->id == 1); ok($revision->text eq 'Text Test Value'); ok($revision->id == 47084); ok($revision->timestamp eq '2005-07-09T18:41:10Z'); ok($revision->comment eq ''); #bug #55758 ok($revision->minor == 1); ok($page->revision->contributor->username eq 'Username Test Value'); ok($page->revision->contributor->id == 1292); ok(! defined($page->revision->contributor->ip)); ok($page->revision->contributor->astext eq 'Username Test Value'); ok($page->revision->contributor eq 'Username Test Value'); } sub test_two { my ($page) = @_; my @revisions = $page->revision; my $revision; ok($page->title eq 'Title Test Value #2'); ok($page->id == 2); $revision = shift(@revisions); ok($revision->id == 47085); ok($revision->timestamp eq '2005-07-09T18:41:10Z'); ok($revision->comment eq 'Comment Test Value 2'); ok($revision->text eq '#redirect : [[fooooo]]'); ok($revision->minor == 1); $revision = shift(@revisions); ok($revision->id == 12345); ok($revision->timestamp eq '2006-07-09T18:41:10Z'); ok($revision->comment eq 'Comment Test Value 3'); ok($revision->text eq 'more test data'); ok($revision->minor == 0); } sub test_three { my ($page) = @_; my $revision = $page->revision; ok($page->title eq 'Title Test Value #3'); ok($page->id == 3); ok($revision->id == 57086); ok($revision->timestamp eq '2008-07-09T18:41:10Z'); ok($revision->comment eq 'Second Comment Test Value'); ok($revision->text eq 'Expecting this data'); ok($revision->minor == 1); ok($revision->contributor->ip eq '194.187.135.27'); ok(! defined($revision->contributor->username)); ok(! defined($revision->contributor->id)); ok($revision->contributor->astext eq '194.187.135.27'); ok($revision->contributor eq '194.187.135.27'); }MediaWiki-DumpFile-0.2.2/t/20-SQL.t000644 000765 000024 00000002207 11543133737 016701 0ustar00tylerstaff000000 000000 use strict; use warnings; use Data::Dumper; use Data::Compare; use Storable qw(nstore retrieve); use Test::Simple tests => 106; use MediaWiki::DumpFile; my $test_file = 't/specieswiki-20091204-user_groups.sql'; my $mw = MediaWiki::DumpFile->new; my $p = $mw->sql($test_file); test_suite($p); die "could not open $test_file: $!" unless open(FILE, $test_file); $p = $mw->sql(\*FILE); test_suite($p); sub test_suite { my ($p) = @_; my $data = retrieve('t/specieswiki-20091204-user_groups.data'); my @schema = $p->schema; ok($p->table_name eq 'user_groups'); ok($p->table_statement eq table_statement_data()); ok($schema[0][0] eq 'ug_user'); ok($schema[0][1] eq 'int'); ok($schema[1][0] eq 'ug_group'); ok($schema[1][1] eq 'varchar'); ok(! defined($schema[2])); while(defined(my $row = $p->next)) { my $test_against = shift(@$data); ok(Compare($test_against, $row)); } } sub table_statement_data { return "CREATE TABLE `user_groups` ( `ug_user` int(5) unsigned NOT NULL default '0', `ug_group` varchar(16) binary NOT NULL default '', PRIMARY KEY (`ug_user`,`ug_group`), KEY `ug_group` (`ug_group`) ) TYPE=InnoDB; "; }MediaWiki-DumpFile-0.2.2/t/50-fuzz-fast-mode.off000644 000765 000024 00000001560 11543133737 021430 0ustar00tylerstaff000000 000000 use strict; use warnings; use MediaWiki::Dumpfile; use MediaWiki::DumpFile::Pages; use MediaWiki::DumpFile::FastPages; my $input = shift(@ARGV); my $pages = MediaWiki::DumpFile::Pages->new(input => $input, fast => 1); my @titles; print "generating titles\n"; while(my ($title, $text) = $pages->next) { push(@titles, $title); } while(1) { print "starting loop\n"; my @copy = reverse(@titles); $pages = MediaWiki::DumpFile::Pages->new(input => $input); my $fast = 0; while(1) { my $title; if (rand(1) > .5) { $fast = 1; } else { $fast = 0; } if ($fast) { ($title) = $pages->next($fast); last unless defined $title; } else { my $page = $pages->next; last unless defined $page; $title = $page->title; } die "failed" unless pop(@copy) eq $title; } die "did not read number of entries right" unless scalar(@copy) == 0; }MediaWiki-DumpFile-0.2.2/t/90-compat-links.t000644 000765 000024 00000000676 11543133737 020662 0ustar00tylerstaff000000 000000 #!perl use Test::Simple tests =>4; use strict; use warnings; use MediaWiki::DumpFile::Compat; use Data::Dumper; my $file = 't/compat.links_test.sql'; my $links = Parse::MediaWikiDump->links($file); my $sum; my $last_link; while(my $link = $links->next) { $sum += $link->from; $last_link = $link; } ok($sum == 92288); ok($last_link->from == 3955); ok($last_link->to eq '...Baby_One_More_Time_(single)'); ok($last_link->namespace == 0); MediaWiki-DumpFile-0.2.2/t/90-compat-pages.t000644 000765 000024 00000005327 11543133737 020637 0ustar00tylerstaff000000 000000 #!perl -w use Test::Simple tests => 110; use strict; use MediaWiki::DumpFile::Compat; use Data::Dumper; my $file = 't/compat.pages_test.xml'; my $fh; my $pages; my $mode; $mode = 'file'; test_all($file); open($fh, $file) or die "could not open $file: $!"; $mode = 'handle'; test_all($fh); sub test_all { $pages = Parse::MediaWikiDump->pages(shift); test_one(); test_two(); test_three(); test_four(); test_five(); ok(! defined($pages->next)); } sub test_one { ok($pages->sitename eq 'Sitename Test Value'); ok($pages->base eq 'Base Test Value'); ok($pages->generator eq 'Generator Test Value'); ok($pages->case eq 'Case Test Value'); ok($pages->namespaces->[0]->[0] == -2); ok($pages->namespaces_names->[0] eq 'Media'); ok($pages->current_byte != 0); ok($pages->version eq '0.3'); if ($mode eq 'file') { ok($pages->size == 3117); } elsif ($mode eq 'handle') { ok(! defined($pages->size)) } else { die "invalid test mode"; } my $page = $pages->next; my $text = $page->text; ok(defined($page)); ok($page->title eq 'Talk:Title Test Value'); ok($page->id == 1); ok($page->revision_id == 47084); ok($page->username eq 'Username Test Value'); ok($page->userid == 1292); ok($page->timestamp eq '2005-07-09T18:41:10Z'); ok($page->userid == 1292); ok($page->minor); ok($$text eq "Text Test Value\n"); ok($page->namespace eq 'Talk'); ok(! defined($page->redirect)); ok(! defined($page->categories)); } sub test_two { my $page = $pages->next; my $text = $page->text; ok($page->title eq 'Title Test Value #2'); ok($page->id == 2); ok($page->revision_id eq '47085'); ok($page->username eq 'Username Test Value 2'); ok($page->timestamp eq '2005-07-09T18:41:10Z'); ok($page->userid == 1292); ok($page->minor); ok($$text eq "#redirect : [[fooooo]]\n"); ok($page->namespace eq ''); ok($page->redirect eq 'fooooo'); ok(! defined($page->categories)); } sub test_three { my $page = $pages->next; ok(defined($page)); ok($page->title eq 'Title Test Value #3'); ok($page->id == 3); ok($page->timestamp eq '2005-07-09T18:41:10Z'); ok($page->username eq 'Username Test Value'); ok($page->userid == 1292); ok(ref($page->categories) eq 'ARRAY'); ok($page->categories->[0] eq 'test'); } sub test_four { my $page = $pages->next; ok(defined($page)); ok($page->id == 4); ok($page->timestamp eq '2005-07-09T18:41:10Z'); ok($page->username eq 'Username Test Value'); ok($page->userid == 1292); #test for bug 36255 ok($page->namespace eq ''); ok($page->title eq 'NotANameSpace:Bar'); } sub test_five { my $page = $pages->next; ok(defined($page)); ok($page->id == 5); ok($page->title eq 'Moar Tests'); ok(! defined($page->username)); ok(! defined($page->userid)); ok($page->userip eq '62.104.212.74'); } MediaWiki-DumpFile-0.2.2/t/90-compat-revisions.t000644 000765 000024 00000006334 11543133737 021560 0ustar00tylerstaff000000 000000 #!perl -w use Test::Simple tests => 114; use strict; use MediaWiki::DumpFile::Compat; use Data::Dumper; my $file = 't/compat.revisions_test.xml'; my $fh; my $revisions; my $mode; $mode = 'file'; test_all($file); open($fh, $file) or die "could not open $file: $!"; $mode = 'handle'; test_all($fh); sub test_all { $revisions = Parse::MediaWikiDump->revisions(shift); test_siteinfo(); test_one(); test_two(); test_three(); test_four(); test_five(); test_six(); ok(! defined($revisions->next)); } sub test_siteinfo { ok($revisions->sitename eq 'Sitename Test Value'); ok($revisions->base eq 'Base Test Value'); ok($revisions->generator eq 'Generator Test Value'); ok($revisions->case eq 'Case Test Value'); ok($revisions->namespaces->[0]->[0] == -2); ok($revisions->namespaces_names->[0] eq 'Media'); ok($revisions->current_byte != 0); ok($revisions->version eq '0.3'); if ($mode eq 'file') { ok($revisions->size == 3570); } elsif ($mode eq 'handle') { ok(! defined($revisions->size)); } else { die "invalid test mode"; } } #the first two tests check everything to make sure information #is not leaking across pages due to accumulator errors. sub test_one { my $page = $revisions->next; my $text = $page->text; ok(defined($page)); ok($page->title eq 'Talk:Title Test Value'); ok($page->id == 1); ok($page->revision_id == 47084); ok($page->username eq 'Username Test Value 1'); ok($page->userid == 1292); ok($page->timestamp eq '2005-07-09T18:41:10Z'); ok($page->userid == 1292); ok($page->minor); ok($$text eq "Text Test Value 1\n"); ok($page->namespace eq 'Talk'); ok(! defined($page->redirect)); ok(! defined($page->categories)); } sub test_two { my $page = $revisions->next; my $text = $page->text; ok($page->title eq 'Title Test Value #2'); ok($page->id == 2); ok($page->revision_id eq '47085'); ok($page->username eq 'Username Test Value 2'); ok($page->timestamp eq '2006-07-09T18:41:10Z'); ok($page->userid == 12); ok($page->minor); ok($$text eq "#redirect : [[fooooo]]"); ok($page->namespace eq ''); ok($page->redirect eq 'fooooo'); ok(! defined($page->categories)); } sub test_three { my $page = $revisions->next; my $text = $page->text; ok(defined($page)); ok($page->redirect eq 'fooooo'); ok($page->title eq 'Title Test Value #2'); ok($page->id == 2); ok($page->timestamp eq '2005-07-09T18:41:10Z'); ok($page->username eq 'Username Test Value'); ok($page->userid == 1292); ok(! $page->minor); } sub test_four { my $page = $revisions->next; my $text = $page->text; ok(defined($page)); ok($page->id == 4); ok($page->timestamp eq '2005-07-09T18:41:10Z'); ok($page->username eq 'Username Test Value'); ok($page->userid == 1292); #test for bug 36255 ok($page->namespace eq ''); ok($page->title eq 'NotANameSpace:Bar'); } #test for Bug 50092 sub test_five { my $page = $revisions->next; ok($page->title eq 'Bug 50092 Test'); ok(defined(${$page->text})); } #test for bug 53361 sub test_six { my $page = $revisions->next; ok($page->title eq 'Test for bug 53361'); ok($page->username eq 'Ben-Zin'); ok(! defined($page->userip)); $page = $revisions->next; ok($page->title eq 'Test for bug 53361'); ok($page->userip eq '62.104.212.74'); ok(! defined($page->username)); } MediaWiki-DumpFile-0.2.2/t/91-compat-pre_factory.t000644 000765 000024 00000000461 11543133737 022050 0ustar00tylerstaff000000 000000 use Test::Simple tests => 3; use strict; use MediaWiki::DumpFile::Compat; ok(defined(Parse::MediaWikiDump::Pages->new('t/compat.pages_test.xml'))); ok(defined(Parse::MediaWikiDump::Revisions->new('t/compat.revisions_test.xml'))); ok(defined(Parse::MediaWikiDump::Links->new('t/compat.links_test.sql')));MediaWiki-DumpFile-0.2.2/t/95-compat-pages-single_revision_only.t000644 000765 000024 00000000535 11543133737 025076 0ustar00tylerstaff000000 000000 #!perl -w use strict; use warnings; use Test::Exception tests => 1; use MediaWiki::DumpFile::Compat; my $file = 't/compat.revisions_test.xml'; throws_ok { test() } qr/^only one revision per page is allowed$/, 'one revision per article ok'; sub test { my $pages = Parse::MediaWikiDump->pages($file); while(defined($pages->next)) { }; }; MediaWiki-DumpFile-0.2.2/t/compat.links_test.sql000644 000765 000024 00000002220 11543133737 022013 0ustar00tylerstaff000000 000000 -- MySQL dump 9.11 -- -- Host: benet Database: simplewiki -- ------------------------------------------------------ -- Server version 4.0.22-log -- -- Table structure for table `pagelinks` -- DROP TABLE IF EXISTS `pagelinks`; CREATE TABLE `pagelinks` ( `pl_from` int(8) unsigned NOT NULL default '0', `pl_namespace` int(11) NOT NULL default '0', `pl_title` varchar(255) binary NOT NULL default '', UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`), KEY `pl_namespace` (`pl_namespace`,`pl_title`) ) TYPE=InnoDB; -- -- Dumping data for table `pagelinks` -- /*!40000 ALTER TABLE `pagelinks` DISABLE KEYS */; LOCK TABLES `pagelinks` WRITE; INSERT INTO `pagelinks` VALUES (7759,-1,'Recentchanges'),(4016,0,'\"Captain\"_Lou_Albano'),(7491,0,'\"Captain\"_Lou_Albano'),(9935,0,'\"Dimebag\"_Darrell'),(7617,0,'\"Hawkeye\"_Pierce'),(1495,0,'$1'),(1495,0,'$2'),(4901,0,'\',_art_title,_\''),(4376,0,'\'Abd_Al-Rahman_Al_Sufi'),(12418,0,'\'Allo_\'Allo!'),(4045,0,'\'Newton\'s_cradle\'_toy'),(4045,0,'\'Push-and-go\'_toy_car'),(7794,0,'\'Salem\'s_Lot'),(4670,0,'(2340_Hathor'),(1876,0,'(Mt.'),(4400,0,'(c)Brain'),(3955,0,'...Baby_One_More_Time_(single)'); MediaWiki-DumpFile-0.2.2/t/compat.pages_test.xml000644 000765 000024 00000006055 11543133737 022005 0ustar00tylerstaff000000 000000 Sitename Test Value Base Test Value Generator Test Value Case Test Value Media Special Talk User User talk Wikipedia Wikipedia talk Image Image talk MediaWiki MediaWiki talk Template Template talk Help Help talk Category Category talk Talk:Title Test Value 1 47084 2005-07-09T18:41:10Z Username Test Value1292 Comment Test Value Text Test Value Title Test Value #2 2 47085 2005-07-09T18:41:10Z Username Test Value 21292 Comment Test Value #redirect : [[fooooo]] Title Test Value #3 3 47086 2005-07-09T18:41:10Z Username Test Value1292 Comment Test Value [[category:test]] #redirect [[fooooo]] NotANameSpace:Bar 4 47088 2005-07-09T18:41:10Z Username Test Value1292 Comment Test Value test for bug #36255 - Parse::MediaWikiDump::page::namespace may return a string which is not really a namespace Moar Tests 5 38847 2002-10-31T14:53:37Z 62.104.212.74 MediaWiki-DumpFile-0.2.2/t/compat.revisions_test.xml000644 000765 000024 00000006762 11543133737 022734 0ustar00tylerstaff000000 000000 Sitename Test Value Base Test Value Generator Test Value Case Test Value Media Special Talk User User talk Wikipedia Wikipedia talk Image Image talk MediaWiki MediaWiki talk Template Template talk Help Help talk Category Category talk Talk:Title Test Value 1 47084 2005-07-09T18:41:10Z Username Test Value 11292 Comment Test Value 1 Text Test Value 1 Title Test Value #2 2 47085 2006-07-09T18:41:10Z Username Test Value 212 Comment Test Value 2 #redirect : [[fooooo]] 47086 2005-07-09T18:41:10Z Username Test Value1292 Comment Test Value #redirect [[fooooo]] NotANameSpace:Bar 4 47088 2005-07-09T18:41:10Z Username Test Value1292 Comment Test Value test for bug #36255 - Parse::MediaWikiDump::page::namespace may return a string which is not really a namespace Bug 50092 Test 5 47089 2005-07-09T18:41:10Z Username Test Value1292 Comment Test Value Test for bug 53361 145 38841 2002-09-08T22:15:32Z Ben-Zin 9 en: 38847 2002-10-31T14:53:37Z 62.104.212.74 MediaWiki-DumpFile-0.2.2/t/pages_test.xml000644 000765 000024 00000004323 11543133737 020517 0ustar00tylerstaff000000 000000 Sitename Test Value Base Test Value Generator Test Value Case Test Value Special Talk Talk:Title Test Value 1 47084 2005-07-09T18:41:10Z Username Test Value1292 Text Test Value Title Test Value #2 2 47085 2005-07-09T18:41:10Z Username Test Value 21292 Comment Test Value 2 #redirect : [[fooooo]] 12345 2006-07-09T18:41:10Z Username Test Value 31211 Comment Test Value 3 more test data Title Test Value #3 3 47086 2007-07-09T18:41:10Z Username Test Value1292 Comment Test Value #redirect [[fooooo]] 57086 2008-07-09T18:41:10Z 194.187.135.27 Second Comment Test Value Expecting this data MediaWiki-DumpFile-0.2.2/t/specieswiki-20091204-user_groups.data000644 000765 000024 00000003666 11543133737 024274 0ustar00tylerstaff000000 000000 pst0. botug_group 1062ug_user botug_group 3560ug_user botug_group 3774ug_user botug_group 5652ug_user botug_group 7298ug_user botug_group 7369ug_user botug_group 7568ug_user botug_group 8033ug_user botug_group 9737ug_user botug_group 132832ug_user bureaucratug_group 6ug_user bureaucratug_group 42ug_user bureaucratug_group 67ug_user bureaucratug_group 210ug_user bureaucratug_group 568ug_user bureaucratug_group 1033ug_user bureaucratug_group 1387ug_user bureaucratug_group 2297ug_user bureaucratug_group 3194ug_user bureaucratug_group 4305ug_user bureaucratug_group 6748ug_user bureaucratug_group 6764ug_user bureaucratug_group 6873ug_user bureaucratug_group 6984ug_user sysopug_group 6ug_user sysopug_group 42ug_user sysopug_group 67ug_user sysopug_group 78ug_user sysopug_group 210ug_user sysopug_group 410ug_user sysopug_group 568ug_user sysopug_group 1033ug_user sysopug_group 1387ug_user sysopug_group 2297ug_user sysopug_group 2391ug_user sysopug_group 3082ug_user sysopug_group 3194ug_user sysopug_group 4305ug_user sysopug_group 6748ug_user sysopug_group 6764ug_user sysopug_group 6806ug_user sysopug_group 6873ug_user sysopug_group 6984ug_user sysopug_group 7298ug_user sysopug_group 8033ug_user sysopug_group 9863ug_userMediaWiki-DumpFile-0.2.2/t/specieswiki-20091204-user_groups.sql000644 000765 000024 00000004214 11543133737 024150 0ustar00tylerstaff000000 000000 -- MySQL dump 10.11 -- -- Host: db27 Database: specieswiki -- ------------------------------------------------------ -- Server version 4.0.40-wikimedia-log /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */; /*!40103 SET TIME_ZONE='+00:00' */; /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; -- -- Table structure for table `user_groups` -- DROP TABLE IF EXISTS `user_groups`; SET @saved_cs_client = @@character_set_client; SET character_set_client = utf8; CREATE TABLE `user_groups` ( `ug_user` int(5) unsigned NOT NULL default '0', `ug_group` varchar(16) binary NOT NULL default '', PRIMARY KEY (`ug_user`,`ug_group`), KEY `ug_group` (`ug_group`) ) TYPE=InnoDB; SET character_set_client = @saved_cs_client; -- -- Dumping data for table `user_groups` -- LOCK TABLES `user_groups` WRITE; /*!40000 ALTER TABLE `user_groups` DISABLE KEYS */; INSERT INTO `user_groups` VALUES (1062,'bot'),(3560,'bot'),(3774,'bot'),(5652,'bot'),(7298,'bot'),(7369,'bot'),(7568,'bot'),(8033,'bot'),(9737,'bot'),(132832,'bot'),(6,'bureaucrat'),(42,'bureaucrat'),(67,'bureaucrat'),(210,'bureaucrat'),(568,'bureaucrat'),(1033,'bureaucrat'),(1387,'bureaucrat'),(2297,'bureaucrat'),(3194,'bureaucrat'),(4305,'bureaucrat'),(6748,'bureaucrat'),(6764,'bureaucrat'),(6873,'bureaucrat'),(6984,'bureaucrat'),(6,'sysop'),(42,'sysop'),(67,'sysop'),(78,'sysop'),(210,'sysop'),(410,'sysop'),(568,'sysop'),(1033,'sysop'),(1387,'sysop'),(2297,'sysop'),(2391,'sysop'),(3082,'sysop'),(3194,'sysop'),(4305,'sysop'),(6748,'sysop'),(6764,'sysop'),(6806,'sysop'),(6873,'sysop'),(6984,'sysop'),(7298,'sysop'),(8033,'sysop'),(9863,'sysop'); /*!40000 ALTER TABLE `user_groups` ENABLE KEYS */; UNLOCK TABLES; /*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */; /*!40101 SET SQL_MODE=@OLD_SQL_MODE */; /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */; /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */; /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */; -- Dump completed on 2009-12-04 18:19:57 MediaWiki-DumpFile-0.2.2/speed_test/Bench.pm000644 000765 000024 00000006447 11543133737 021121 0ustar00tylerstaff000000 000000 package #go away cpan indexer! Bench; use strict; use warnings; our $hack; our $profile; sub Article { my ($title, $text) = @_; $title = '' unless defined $title; $text = '' unless defined $text; print "Title: $title\n"; print "$text\n"; # if (defined($ENV{PROFILE})) { # $profile++; # # exit 0 if $profile >= $ENV{PROFILE}; # } } package Bench::ExpatHandler; sub new { return(bless({ }, $_[0])); } sub starthandler { my ($self, $e, $element) = @_; if ($element eq 'title') { $self->char_on; $self->{in_title} = 1; } elsif ($element eq 'text') { $self->char_on; $self->{in_text} = 1; } } sub endhandler { my ($self, $e, $element) = @_; if ($self->{in_text}) { $self->{text} = $self->get_chars; $self->char_off; $self->{in_text} = 0; } elsif ($self->{in_title}) { $self->{title} = $self->get_chars; $self->char_off; $self->{in_title} = 0; } elsif ($element eq 'revision') { Bench::Article($self->{title}, $self->{text}); } } sub char_handler { my ($self, $e, $chars) = @_; if ($self->{char}) { push(@{ $self->{a} }, $chars); } } sub char_on { my ($self) = @_; $self->{a} = []; $self->{char} = 1; } sub char_off { my ($self) = @_; $self->{a} = []; $self->{char} = 0; } sub get_chars { my ($self) = @_; return join('', @{ $self->{a} }); } package Bench::SAXHandler; sub new { my ($class) = @_; return bless({}, $class); } sub start_element { my ($self, $element) = @_; $element = $element->{Name}; if ($element eq 'title') { $self->char_on; $self->{in_title} = 1; } elsif ($element eq 'text') { $self->char_on; $self->{in_text} = 1; } } sub end_element { my ($self, $element) = @_; $element = $element->{Name}; if ($self->{in_text}) { $self->{text} = $self->get_chars; $self->char_off; $self->{in_text} = 0; } elsif ($self->{in_title}) { $self->{title} = $self->get_chars; $self->char_off; $self->{in_title} = 0; } elsif ($element eq 'revision') { Bench::Article($self->{title}, $self->{text}); } } sub characters { my ($self, $characters) = @_; if ($self->{char}) { push(@{ $self->{a} }, $characters->{Data}); } } sub char_on { my ($self) = @_; $self->{a} = []; $self->{char} = 1; } sub char_off { my ($self) = @_; $self->{a} = []; $self->{char} = 0; } sub get_chars { my ($self) = @_; return join('', @{ $self->{a} }); } package Bench::CompactTree; use strict; use warnings; use XML::LibXML::Reader; sub run { my ($reader, $tree_sub) = @_; $reader->nextElement('page'); my $i = 0; while(++$i) { my $page = &$tree_sub($reader); my $p; last unless defined $page; die "expected element" unless $page->[0] == XML_READER_TYPE_ELEMENT; die "expected " unless $page->[1] eq 'page'; my $title = $page->[4]->[1]->[4]->[0]->[1]; my $text; foreach(@{$page->[4]}) { next unless $_->[0] == XML_READER_TYPE_ELEMENT; if ($_->[1] eq 'revision') { $p = $_->[4]; last; } } foreach(@$p) { next unless $_->[0] == XML_READER_TYPE_ELEMENT; if ($_->[1] eq 'text') { $text = $_->[4]->[0]->[1]; last; } } $text = '' unless defined $text; Bench::Article($title, $text); my $ret = $reader->nextElement('page'); die "read error" if $ret == -1; last if $ret == 0; die "expected 1" unless $ret == 1; } } 1;MediaWiki-DumpFile-0.2.2/speed_test/benchmark.pl000644 000765 000024 00000012122 11543133737 022016 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Digest::MD5; use XML::Parser; use bytes; use YAML; autoflush(\*STDOUT); binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $testdir = shift(@ARGV); my $datadir = shift(@ARGV); my $iterations = shift(@ARGV); my $output = shift(@ARGV); if (! defined($output)) { $output = 'results.data'; } my $log = 'results.log'; my @all_iterations; if (! defined($datadir)) { print STDERR "Usage: $0 [number of iterations]\n"; exit(1); } if (! defined($iterations)) { $iterations = 1; } my ($datafh, $logfh); die "could not open $output: $!" unless open($datafh, "> $output"); die "could not open log: $!" unless open($logfh, "> $log"); autoflush($logfh); while($iterations--) { print "Iterations remaining: ", $iterations + 1, "\n"; foreach my $data (get_contents($datadir)) { my $data_path = $data; my %report; my $md5; my $len; print "Benchmarking $data\n"; $report{tests} = []; $report{filename} = $data; $report{size} = filesize($data_path); print "Generating md5sum: "; ($len, $md5) = get_md5sum($data_path); $report{markup_density} = 1 - ($len / $report{size}); $report{md5sum} = $md5; print "$md5\n"; print "Markup density: ", $report{markup_density}, "\n"; foreach my $test (get_contents($testdir)) { my $command = "$test $data_path"; my $start_time = time; print "running $command: "; my %results = bench($command); print time - $start_time, " seconds "; if (! defined($results{fail_reason}) && (! defined($md5) || $results{md5sum} ne $md5)) { $results{failed} = 1; $results{fail_reason} = "md5sum mismatch"; } if ($results{failed}) { print "FAILED "; } print "\n"; push(@{$report{tests}}, { name => $test, %results }); } my @rankings = make_rankings(%report); $report{tests} = \@rankings; print $logfh Dump(\%report); push(@all_iterations, \%report); } } print $datafh Dump(\@all_iterations) or die "could not save results to $output: $!"; sub bench { my ($command) = @_; my ($read, $write); my ($child, $result); my ($cuser, $csys, $md5); my %results; pipe($read, $write); # autoflush($write); $child = fork(); if ($child == 0) { bench_child($command, $write); die "child should exit"; } $result = <$read>; waitpid($child, 0) or die "could not waitpid($child, 0)"; ($cuser, $csys, $md5) = parse_result($result); if (defined($cuser)) { $results{runtimes}->{system} = $csys; $results{runtimes}->{user} = $cuser; $results{runtimes}->{total} = $csys + $cuser; $results{md5sum} = $md5; } else { $results{failed} = 1; $results{fail_reason} = "benchmark execution error"; } return %results; } sub bench_child { my ($command, $write) = @_; my $md5 = Digest::MD5->new; my ($child_user, $child_sys); my $fh; open($fh, "$command |") or die "could not execute $command for reading"; while(<$fh>) { $md5->add($_); } if (! close($fh)) { print STDERR "FAILED "; print $write "FAILED\n"; exit(1); } if ($? >> 8) { print STDERR "FAILED "; print $write "FAILED\n"; exit(1); } (undef, undef, $child_user, $child_sys) = times; print $write "$child_user $child_sys ", $md5->hexdigest, "\n" or die "could not write to pipe: $!"; exit(0); } sub autoflush { my ($fh) = @_; my $old = select($fh); $| = 1; print ''; select($old); return; } sub parse_result { my ($text) = @_; if ($text !~ m/^([0-9.]+) ([0-9.]+) (.+)/) { return(); } return ($1, $2, $3); } sub filesize { my ($file) = @_; my @stat = stat($file); return $stat[7]; } sub get_contents { my ($dir) = @_; my @contents; if (-f $dir) { return ($dir); } die "could not open $dir: $!" unless opendir(DIR, $dir); foreach (sort(readdir(DIR))) { next if m/^\./; #next unless m/\.t$/; push(@contents, $dir . '/' . $_); } closedir(DIR); return @contents; } sub get_md5sum { my ($file) = @_; my $command; my $md5 = Digest::MD5->new; my $fh; my $prog; my $len; if (-x "bin/libxml") { $prog = 'test_cases/libxml.t'; } else { $prog = 'test_cases/XML-CompactTree-XS.t'; } $command = "$prog $file"; open($fh, "$command |") or die "could not execute $command for reading"; while(<$fh>) { $md5->add($_); $len += bytes::length($_); } close($fh) or die "could not close $command"; if ($? >> 8) { die "could not generate md5sum"; } return ($len, $md5->hexdigest); } sub make_rankings { my (%data) = @_; my @tests = sort_tests($data{tests}); my $fastest = $tests[0]->{runtimes}->{total}; my $size = $data{size}; if (! defined($fastest) || $fastest == 0) { die "no successful tests were run"; } foreach (@tests) { my $total = $_->{runtimes}->{total}; next unless defined $total; $_->{'MiB/sec'} = $size / $total / 1024 / 1024; $_->{percentage} = int($total / $fastest * 100); } return @tests; } sub sort_tests { return sort({ if ($a->{failed}) { return 1; } elsif ($b->{failed}) { return -1; } $a->{runtimes}->{total} <=> $b->{runtimes}->{total} } @{$_[0]} ); } MediaWiki-DumpFile-0.2.2/speed_test/bin/000755 000765 000024 00000000000 12173343122 020270 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/speed_test/MODULES000644 000765 000024 00000000350 11543133737 020562 0ustar00tylerstaff000000 000000 YAML Parse::MediaWikiDump XML::LibXML XML::Parser XML::SAX XML::SAX::ExpatXS XML::SAX::Expat XML::Twig XML::Bare HTML::Entities XML::CompactTree::XS XML::CompactTree XML::LibXML::SAX::ChunkParser XML::Parser XML::Records XML::Rules MediaWiki-DumpFile-0.2.2/speed_test/README000644 000765 000024 00000034001 11543133737 020407 0ustar00tylerstaff000000 000000 ABOUT This is a benchmark system for XML parsers against various language editions of the Wikipedia. The benchmark is to print all the article titles and text of a dump file specified on the command line to standard output. There are implementations for many perl parsing modules both high and low level. There are even implementations written in C that perform very fast. The benchmark.pl program is used to run a series of benchmarks. It takes two required arguments and one optional. The first required argument is a path to a directory full of tests to execute. The second required argument is a path to a directory full of dump files to execute the tests against. Both of these directories will be executed according to sort() on their file names. The third argument is a number of iterations to perform, the default being 1. Output goes to two files: results.log and results.data - they both are the output from YAML of an internal data structure that represents the test report. The results.log file is written to each time all the tests have been run against a specific file and lets you keep an eye on how long running jobs are performing. The results.data file is the cumulative data for all iterations and is written at the end of the entire run. The benchmark.pl utility and all of the tests are only guaranteed to work if executed from the root directory of this software package. The C based parsers are in the bin/ directory and can be compiled by executing make in that directory. The Iksemel parser is not currently functional for unknown reasons. THE CHALLENGE First and foremost the most important thing to keep in mind is that the English Wikipedia is currently 22 gigabytes of XML in a single file. You will not be able to use any XML processing system that requires the entire document to fit into RAM. Each benchmark must gather up the title and text for each Wikipedia article for an arbitrary XML dump file. In the spirit of making this test approximate a real world scenario you must collect all character data together and make it available at one time. For instance in the perl benchmarks they actually invoke a common method that prints the article title and text for them. In the C based tests they simply collect all the data and print it out at once. EXAMPLES Doing a test run: foodmotron:XML_Speed_Test tyler$ ./benchmark.pl test_cases data Iterations remaining: 1 Benchmarking 20-simplewiki-20091021-pages-articles.xml Generating md5sum: 8fa1e9de18b8da7523ebfe2dac53482a running test_cases/MediaWiki-DumpFile-SimplePages.t data/20-simplewiki-20091021-pages-articles.xml: 12 seconds running test_cases/Parse-MediaWikiDump.t data/20-simplewiki-20091021-pages-articles.xml: 66 seconds running test_cases/XML-Bare.t data/20-simplewiki-20091021-pages-articles.xml: 7 seconds running test_cases/XML-LibXML-Reader.t data/20-simplewiki-20091021-pages-articles.xml: 12 seconds running test_cases/XML-LibXML-SAX.t data/20-simplewiki-20091021-pages-articles.xml: 68 seconds running test_cases/XML-Parser-ExpatNB.t data/20-simplewiki-20091021-pages-articles.xml: 44 seconds running test_cases/XML-Parser.t data/20-simplewiki-20091021-pages-articles.xml: 42 seconds running test_cases/XML-SAX-Expat.t data/20-simplewiki-20091021-pages-articles.xml: 183 seconds running test_cases/XML-SAX-ExpatXS.t data/20-simplewiki-20091021-pages-articles.xml: 33 seconds running test_cases/XML-SAX-ExpatXS_nocharjoin.t data/20-simplewiki-20091021-pages-articles.xml: 62 seconds running test_cases/XML-SAX-PurePerl.t data/20-simplewiki-20091021-pages-articles.xml: 585 seconds running test_cases/XML-Twig.t data/20-simplewiki-20091021-pages-articles.xml: 204 seconds running test_cases/expat.t data/20-simplewiki-20091021-pages-articles.xml: 7 seconds running test_cases/libxml.t data/20-simplewiki-20091021-pages-articles.xml: 7 seconds foodmotron:XML_Speed_Test tyler$ The report: $VAR1 = [ { 'filename' => '20-simplewiki-20091021-pages-articles.xml', 'tests' => [ { 'runtimes' => { 'system' => '0.4', 'user' => '5.78', 'total' => '6.18' }, 'name' => 'libxml.t', 'percentage' => 100, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '35.1349971055213' }, { 'runtimes' => { 'system' => '0.37', 'user' => '6.32', 'total' => '6.69' }, 'name' => 'XML-Bare.t', 'percentage' => 108, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '32.4565444113784' }, { 'runtimes' => { 'system' => '0.4', 'user' => '6.55', 'total' => '6.95' }, 'name' => 'expat.t', 'percentage' => 112, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '31.2423427499455' }, { 'runtimes' => { 'system' => '0.83', 'user' => '10.62', 'total' => '11.45' }, 'name' => 'XML-LibXML-Reader.t', 'percentage' => 185, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '18.963692760884' }, { 'runtimes' => { 'system' => '0.42', 'user' => '11.33', 'total' => '11.75' }, 'name' => 'MediaWiki-DumpFile-SimplePages.t', 'percentage' => 190, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '18.4795133712444' }, { 'runtimes' => { 'system' => '0.55', 'user' => '32', 'total' => '32.55' }, 'name' => 'XML-SAX-ExpatXS.t', 'percentage' => 526, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '6.67079207717731' }, { 'runtimes' => { 'system' => '0.26', 'user' => '41.55', 'total' => '41.81' }, 'name' => 'XML-Parser.t', 'percentage' => 676, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '5.19335762047648' }, { 'runtimes' => { 'system' => '0.46', 'user' => '42.1', 'total' => '42.56' }, 'name' => 'XML-Parser-ExpatNB.t', 'percentage' => 688, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '5.1018393353412' }, { 'runtimes' => { 'system' => '0.53', 'user' => '60.13', 'total' => '60.66' }, 'name' => 'XML-SAX-ExpatXS_nocharjoin.t', 'percentage' => 981, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '3.5795298732628' }, { 'runtimes' => { 'system' => '0.49', 'user' => '65.33', 'total' => '65.82' }, 'name' => 'Parse-MediaWikiDump.t', 'percentage' => 1065, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '3.29891039368158' }, { 'runtimes' => { 'system' => '0.87', 'user' => '66.01', 'total' => '66.88' }, 'name' => 'XML-LibXML-SAX.t', 'percentage' => 1082, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '3.24662503158076' }, { 'runtimes' => { 'system' => '1.32', 'user' => '179.77', 'total' => '181.09' }, 'name' => 'XML-SAX-Expat.t', 'percentage' => 2930, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '1.19904070965885' }, { 'runtimes' => { 'system' => '1.95', 'user' => '201.49', 'total' => '203.44' }, 'name' => 'XML-Twig.t', 'percentage' => 3291, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '1.06731361635923' }, { 'runtimes' => { 'system' => '3.45', 'user' => '577.07', 'total' => '580.52' }, 'name' => 'XML-SAX-PurePerl.t', 'percentage' => 9393, 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'MiB/sec' => '0.374034110990356' } ], 'md5sum' => '8fa1e9de18b8da7523ebfe2dac53482a', 'size' => 227681797 } ]; One of the fastest benchmarks: #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use XML::LibXML; use XML::LibXML::Reader; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); $| = 1; print ''; use Bench; my $reader = XML::LibXML::Reader->new(location => shift(@ARGV)); my $title; while(1) { my $type = $reader->nodeType; if ($type == XML_READER_TYPE_ELEMENT) { if ($reader->name eq 'title') { $title = get_text($reader); } elsif ($reader->name eq 'text') { my $text = get_text($reader); Bench::Article($title, $text); } $reader->nextElement; next; } last unless $reader->read; } sub get_text { my ($r) = @_; my @buffer; my $type; while($r->nodeType != XML_READER_TYPE_TEXT && $r->nodeType != XML_READER_TYPE_END_ELEMENT) { $r->read or die "could not read"; } while($r->nodeType != XML_READER_TYPE_END_ELEMENT) { if ($r->nodeType == XML_READER_TYPE_TEXT) { push(@buffer, $r->value); } $r->read or die "could not read"; } return join('', @buffer); } __END__ TEST DATA You can find various MediaWiki dump files via http://download.wikimedia.org/ I use the following various language Wikipedia dump files for my testing: http://download.wikimedia.org/cvwiki/20091208/cvwiki-20091208-pages-articles.xml.bz2 http://download.wikimedia.org/simplewiki/20091203/simplewiki-20091203-pages-articles.xml.bz2 http://download.wikimedia.org/enwiki/20091103/enwiki-20091103-pages-articles.xml.bz2 TODO * It would be nice if the C based parsers were glued to perl with XS so they invoke the Bench::Article method just like the perl based parsers do. * One common string buffering library between all C based parsers would be nice but I could not get this functional. There is a lot of other code duplication as well. * A C implementation of libxml's reader interface would be fun to compare against the perl one. AUTHOR Test suite and initial tests created by Tyler Riddle Please send any patches to me and feel free to add yourself to the contributors list. CONTRIBUTORS * "Sebastian Bober " - Concept behind the XML::Bare implementationMediaWiki-DumpFile-0.2.2/speed_test/test_cases/000755 000765 000024 00000000000 12173343122 021655 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/speed_test/test_cases/expat.t000644 000765 000024 00000000044 11543133737 023172 0ustar00tylerstaff000000 000000 #!/bin/bash cat "$1" | ./bin/expat MediaWiki-DumpFile-0.2.2/speed_test/test_cases/libxml.t000644 000765 000024 00000000045 11543133737 023341 0ustar00tylerstaff000000 000000 #!/bin/bash cat "$1" | ./bin/libxml MediaWiki-DumpFile-0.2.2/speed_test/test_cases/MediaWiki-DumpFile-Compat.t000644 000765 000024 00000000700 11543133737 026637 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use MediaWiki::DumpFile::Compat; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $file = shift(@ARGV) or die "must specify file"; my $fh; open($fh, $file) or die "could not open $file: $!"; my $articles = Parse::MediaWikiDump::Pages->new($fh); my $i; while(defined(my $one = $articles->next)) { Bench::Article($one->title, ${$one->text}); # if (++$i > 1000) { exit 1}; } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/MediaWiki-DumpFile-Compat_fastmode.t000644 000765 000024 00000000731 11543133737 030525 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use MediaWiki::DumpFile::Compat; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $file = shift(@ARGV) or die "must specify file"; my $fh; open($fh, $file) or die "could not open $file: $!"; my $articles = Parse::MediaWikiDump::Pages->new(input => $fh, fast_mode => 1); my $i; while(defined(my $one = $articles->next)) { Bench::Article($one->title, ${$one->text}); # if (++$i > 1000) { exit 1}; } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/MediaWiki-DumpFile-FastPages.t000644 000765 000024 00000000516 11543133737 027276 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use MediaWiki::DumpFile::FastPages; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $file = shift(@ARGV) or die "must specify file"; my $dump = MediaWiki::DumpFile::FastPages->new($file); while(my ($title, $text) = $dump->next) { Bench::Article($title, $text); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/MediaWiki-DumpFile-Pages.t000644 000765 000024 00000000664 11543133737 026464 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl BEGIN { unshift(@INC, '/Users/tyler/work/eclipse-workspace/MediaWiki-DumpFile/lib'); } use strict; use warnings; use Bench; use MediaWiki::DumpFile::Pages; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $file = shift(@ARGV) or die "must specify file"; my $dump = MediaWiki::DumpFile::Pages->new($file); while(defined(my $page = $dump->next)) { Bench::Article($page->title, $page->revision->text); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/MediaWiki-DumpFile-Pages_fastmode.t000644 000765 000024 00000000564 11543133737 030345 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use MediaWiki::DumpFile::Pages; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $file = shift(@ARGV) or die "must specify file"; my $dump = MediaWiki::DumpFile::Pages->new(input => $file, fast_mode => 1); while(defined(my $page = $dump->next)) { Bench::Article($page->title, $page->revision->text); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/Parse-MediaWikiDump.t000644 000765 000024 00000000620 11543133737 025612 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use Parse::MediaWikiDump; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $file = shift(@ARGV) or die "must specify file"; my $fh; open($fh, $file) or die "could not open $file: $!"; my $articles = Parse::MediaWikiDump::Pages->new($fh); while(defined(my $one = $articles->next)) { Bench::Article($one->title, ${$one->text}); } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Bare.t000644 000765 000024 00000001513 11543133737 023362 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl #this idea came from sbob from #perl on Freenode #"Sebastian Bober " #it's non-XML compliant and evil but damn is #it fast #die "not XML compliant"; use strict; use warnings; use XML::Bare; use HTML::Entities; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); use Bench; my $file = shift(@ARGV); my $fh; die "could not open $file: $!" unless open($fh, $file); while(<$fh>) { last unless m/$/; } $/ = "\n"; while(<$fh>) { last if m/<\/mediawiki>/; my $xml = XML::Bare->new(text => $_); my $root = $xml->parse; my $title = $root->{page}->{title}->{value}; my $text = $root->{page}->{revision}->{text}->{value}; if (! defined($title)) { $title = ''; } if (! defined($text)) { $text = ''; } Bench::Article(decode_entities($title), decode_entities($text)); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-CompactTree-XS.t000644 000765 000024 00000000526 11543133737 025252 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl #thank you Petr Pajas use strict; use warnings; use Bench; binmode(STDOUT, ':utf8'); use XML::CompactTree::XS; my $reader = XML::LibXML::Reader->new(location => shift(@ARGV)); Bench::CompactTree::run($reader, \&read_tree); sub read_tree { my ($r) = @_; return XML::CompactTree::XS::readSubtreeToPerl($r, 0); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-CompactTree.t000644 000765 000024 00000000570 11543133737 024721 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl #thank you Petr Pajas use strict; use warnings; use Bench; binmode(STDOUT, ':utf8'); use XML::LibXML::Reader; use XML::CompactTree; my $reader = XML::LibXML::Reader->new(location => shift(@ARGV)); Bench::CompactTree::run($reader, \&read_tree); sub read_tree { my ($r) = @_; return XML::CompactTree::readSubtreeToPerl($r, XCT_DOCUMENT_ROOT); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-LibXML-Reader.t000644 000765 000024 00000002064 11543133737 025002 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::LibXML; use XML::LibXML::Reader; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); $| = 1; print ''; use Bench; my $reader = XML::LibXML::Reader->new(location => shift(@ARGV)); my $title; while(1) { my $type = $reader->nodeType; if ($type == XML_READER_TYPE_ELEMENT) { if ($reader->name eq 'title') { $title = get_text($reader); last unless $reader->nextElement('text') == 1; next; } elsif ($reader->name eq 'text') { my $text = get_text($reader); Bench::Article($title, $text); last unless $reader->nextElement('title') == 1; next; } } last unless $reader->nextElement == 1; } sub get_text { my ($r) = @_; my @buffer; my $type; while($r->nodeType != XML_READER_TYPE_TEXT && $r->nodeType != XML_READER_TYPE_END_ELEMENT) { $r->read or die "could not read"; } while($r->nodeType != XML_READER_TYPE_END_ELEMENT) { if ($r->nodeType == XML_READER_TYPE_TEXT) { push(@buffer, $r->value); } $r->read or die "could not read"; } return join('', @buffer); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-LibXML-SAX-ChunkParser.t000644 000765 000024 00000001142 11543133737 026452 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use XML::LibXML::SAX::ChunkParser; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $handler = Bench::SAXHandler->new(); my $parser = XML::LibXML::SAX::ChunkParser->new( Handler => $handler ); my $file = shift(@ARGV); my $fh; die "could not open $file: $!" unless open($fh, $file); while(1) { my ($buf, $ret); $ret = read($fh, $buf, 32768); if (! defined($ret)) { die "could not read: $!"; } elsif ($ret == 0) { #doesn't work unless this is commented out #$parser->finish; last; } else { $parser->parse_chunk($buf); } } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-LibXML-SAX.t000644 000765 000024 00000000421 11543133737 024226 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use XML::LibXML::SAX; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $handler = Bench::SAXHandler->new(); my $parser = XML::LibXML::SAX->new( Handler => $handler ); $parser->parse_file(shift(@ARGV)); MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-LibXML-SAX_charjoin.t000644 000765 000024 00000000721 11543133737 026106 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl #this is for an experimental extension to XML::LibXML::SAX - see #https://rt.cpan.org/Ticket/Display.html?id=52368 use strict; use warnings; use Bench; use XML::LibXML::SAX; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $handler = Bench::SAXHandler->new(); my $parser = XML::LibXML::SAX->new( Handler => $handler ); $parser->set_feature('http://xmlns.perl.org/sax/join-character-data', 1); $parser->parse_file(shift(@ARGV)); MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Parser-Expat.t000644 000765 000024 00000001015 11543133737 025021 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::Parser; use Bench; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $self = Bench::ExpatHandler->new; my $parser = XML::Parser::Expat->new; $parser->setHandlers( Start => sub { $self->starthandler(@_) }, End => sub { $self->endhandler(@_) }, Char => sub { $self->char_handler(@_) }, ); if(scalar(@ARGV)) { my $file = shift(@ARGV); die "could not open $file: $!" unless open(FILE, $file); $parser->parse(\*FILE); } else { $parser->parse(\*STDIN); } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Parser-ExpatNB.t000644 000765 000024 00000001313 11543133737 025242 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::Parser; use Bench; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $self = Bench::ExpatHandler->new; my $fh; my $parser = XML::Parser->new(Handlers => { Start => sub { $self->starthandler(@_) }, End => sub { $self->endhandler(@_) }, Char => sub { $self->char_handler(@_) }, }); if(scalar(@ARGV)) { my ($file) = @ARGV; die "could not open $file: $!" unless open($fh, $file); } else { $fh = \*STDIN; } my $nb = $parser->parse_start; while(1) { my ($buf, $ret); $ret = read($fh, $buf, 32768); if (! defined($ret)) { die "could not read: $!"; } elsif ($ret == 0) { $nb->parse_done; last; } else { $nb->parse_more($buf); } } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Parser.t000644 000765 000024 00000000661 11543133737 023750 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::Parser; use Bench; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $self = Bench::ExpatHandler->new; my $parser = XML::Parser->new(Handlers => { Start => sub { $self->starthandler(@_) }, End => sub { $self->endhandler(@_) }, Char => sub { $self->char_handler(@_) }, }); if(scalar(@ARGV)) { $parser->parsefile(shift(@ARGV)); } else { $parser->parse(\*STDIN); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Parser_string_append.t000644 000765 000024 00000002535 11543133737 026667 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::Parser; use Bench; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $self = ExpatHandler->new; my $parser = XML::Parser->new(Handlers => { Start => sub { $self->starthandler(@_) }, End => sub { $self->endhandler(@_) }, Char => sub { $self->char_handler(@_) }, }); if(scalar(@ARGV)) { $parser->parsefile(shift(@ARGV)); } else { $parser->parse(\*STDIN); } package ExpatHandler; sub new { return(bless({ }, $_[0])); } sub starthandler { my ($self, $e, $element) = @_; if ($element eq 'title') { $self->char_on; $self->{in_title} = 1; } elsif ($element eq 'text') { $self->char_on; $self->{in_text} = 1; } } sub endhandler { my ($self, $e, $element) = @_; if ($self->{in_text}) { $self->{text} = $self->get_chars; $self->char_off; $self->{in_text} = 0; } elsif ($self->{in_title}) { $self->{title} = $self->get_chars; $self->char_off; $self->{in_title} = 0; } elsif ($element eq 'revision') { Bench::Article($self->{title}, $self->{text}); } } sub char_handler { my ($self, $e, $chars) = @_; if ($self->{char}) { $self->{a} .= $chars; } } sub char_on { my ($self) = @_; $self->{a} = undef; $self->{char} = 1; } sub char_off { my ($self) = @_; $self->{a} = undef; $self->{char} = 0; } sub get_chars { my ($self) = @_; return $self->{a}; } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Records.t000644 000765 000024 00000000556 11543133737 024120 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); use Bench; use XML::Records; my $p = XML::Records->new(shift(@ARGV)) or die "$!"; $p->set_records('page'); while(defined(my $page = $p->get_record)) { my $title = $page->{title}; my $text = $page->{revision}->{text}->[0]; Bench::Article($title, $text); } MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Rules.t000644 000765 000024 00000000554 11543133737 023607 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::Rules; binmode(STDOUT, ':utf8'); use Bench; my ($title, $text); my $p = XML::Rules->new(rules => [ _default => undef, title => sub { $title = $_[1]->{_content} }, text => sub { Bench::Article($title, $_[1]->{_content}) }, ]); die "could not open" unless open(FILE, shift(@ARGV)); $p->parse(\*FILE);MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-SAX-Expat.t000644 000765 000024 00000000420 11543133737 024217 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use XML::SAX::Expat; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $handler = Bench::SAXHandler->new(); my $parser = XML::SAX::Expat->new( Handler => $handler ); $parser->parse_file(shift(@ARGV)); MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-SAX-ExpatXS.t000644 000765 000024 00000000456 11543133737 024503 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use XML::SAX::ExpatXS; binmode(STDOUT, ':utf8'); binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $handler = Bench::SAXHandler->new(); my $parser = XML::SAX::ExpatXS->new( Handler => $handler ); $parser->parse_file(shift(@ARGV)); MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-SAX-ExpatXS_nocharjoin.t000644 000765 000024 00000000541 11543133737 026710 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use XML::SAX::ExpatXS; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $handler = Bench::SAXHandler->new(); my $parser = XML::SAX::ExpatXS->new( Handler => $handler ); $parser->set_feature('http://xmlns.perl.org/sax/join-character-data' => 0); $parser->parse_file(shift(@ARGV)); MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-SAX-PurePerl.t000644 000765 000024 00000000460 11543133737 024700 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use Bench; use XML::SAX::PurePerl; binmode(STDOUT, ':utf8'); binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $handler = Bench::SAXHandler->new(); my $parser = XML::SAX::PurePerl->new( Handler => $handler ); $parser->parse_file(shift(@ARGV)); MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-TreePuller_config.t000644 000765 000024 00000000742 11543133737 026124 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::TreePuller; use Bench; binmode(STDOUT, ':utf8'); my $xml = XML::TreePuller->new(location => shift(@ARGV)); $xml->config('/mediawiki/page/title', 'subtree'); $xml->config('/mediawiki/page/revision/text', 'subtree'); my $title; while(my ($path, $e) = $xml->next) { if ($path eq '/mediawiki/page/title') { $title = $e->text; } elsif ($path eq '/mediawiki/page/revision/text') { Bench::Article($title, $e->text); } }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-TreePuller_element.t000644 000765 000024 00000000526 11543133737 026310 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::TreePuller; use Bench; binmode(STDOUT, ':utf8'); my $xml = XML::TreePuller->new(location => shift(@ARGV)); $xml->config('/mediawiki/page', 'subtree'); while(defined(my $e = $xml->next)) { Bench::Article($e->get_elements('title')->text, $e->get_elements('revision/text')->text); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-TreePuller_xpath.t000644 000765 000024 00000000563 11543133737 026004 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; use XML::TreePuller; use Bench; binmode(STDOUT, ':utf8'); my $xml = XML::TreePuller->new(location => shift(@ARGV)); $xml->config('/mediawiki/page', 'subtree'); while(defined(my $e = $xml->next)) { my $t = $e->xpath('/page'); Bench::Article($e->xpath('/page/title')->text, $e->xpath('/page/revision/text')->text); }MediaWiki-DumpFile-0.2.2/speed_test/test_cases/XML-Twig.t000644 000765 000024 00000000750 11543133737 023425 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl use strict; use warnings; binmode(STDOUT, ':utf8'); use XML::Twig; use Bench; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); my $twig = XML::Twig->new( twig_handlers => { page => \&page_handler, } ); $twig->parsefile(shift(@ARGV)); sub page_handler { my ($twig, $page) = @_; my $title = $page->first_child('title')->text; my $text = $page->first_child('revision')->first_child('text')->text; Bench::Article($title, $text); $twig->purge; } MediaWiki-DumpFile-0.2.2/speed_test/bin/expat.c000644 000765 000024 00000010605 11543133737 021570 0ustar00tylerstaff000000 000000 #include #include #include #include struct stringbuffer { struct stringbuffer *next; struct stringbuffer *cur; char *data; int len; int depth; }; struct parseinfo { int in_title; int in_text; int do_char; XML_Parser parser; struct stringbuffer *buffer; }; void * safe_malloc(size_t); struct stringbuffer * stringbuffer_new(void); void char_on(struct parseinfo *info); void char_off(struct parseinfo *info); /* parseinfo functions */ struct parseinfo * parseinfo_new(void) { struct parseinfo *new = safe_malloc(sizeof(struct parseinfo)); new->in_title = 0; new->in_text = 0; new->do_char = 0; new->buffer = NULL; new->parser = NULL; return new; } /* end parseinfo functions */ /* string buffering functions */ struct stringbuffer * stringbuffer_new(void) { struct stringbuffer *new = safe_malloc(sizeof(struct stringbuffer)); new->cur = new; new->next = NULL; new->data = NULL; new->len = 0; new->depth = 0; return new; } void stringbuffer_free(struct stringbuffer *buffer) { struct stringbuffer *p1 = buffer; struct stringbuffer *p2; while(p1) { p2 = p1->next; if (p1->data) { free(p1->data); } free(p1); p1 = p2; } return; } int stringbuffer_length(struct stringbuffer *buffer) { int length = 0; struct stringbuffer *cur; for(cur = buffer; cur; cur = cur->next) { length += cur->len; } return length; } void stringbuffer_append(struct stringbuffer *buffer, const XML_Char *newstring, int len) { char *copy = safe_malloc(len); memcpy(copy, newstring, len); buffer->cur->data = copy; buffer->cur->len = len; buffer->cur->next = stringbuffer_new(); buffer->cur = buffer->cur->next; buffer->depth++; } char * stringbuffer_string(struct stringbuffer *buffer) { int length = stringbuffer_length(buffer); char *new = malloc(length + 1); char *p = new; int copied = 0; struct stringbuffer *cur; for(cur = buffer;cur;cur = cur->next) { if (! cur->data) { continue; } if ((copied = copied + cur->len) > length) { fprintf(stderr, "string overflow\n"); abort(); } strncpy(p, cur->data, cur->len); p += cur->len; } new[length] = '\0'; // fprintf(stderr, "append depth: %i\n", buffer->depth); return new; } /* end string buffering functions */ /* expat handlers */ void charh(void *user, const XML_Char *s, int len) { struct parseinfo *info = (struct parseinfo *)user; // if (info->do_char) { // stringbuffer_append(info->buffer, s, len); // } } void starth(void *user, const char *el, const char **attr) { struct parseinfo *info = (struct parseinfo *)user; // if (strcmp(el, "title") == 0) { // info->in_title = 1; // char_on(info); // } else if (strcmp(el, "text") == 0) { // info->in_text = 1; // char_on(info); // } } void endh(void *user, const char *el) { // struct parseinfo *info = (struct parseinfo *)user; // // if (info->in_text && strcmp(el, "text") == 0) { // char *string = stringbuffer_string(info->buffer); // info->in_text = 0; // // printf("%s\n", string); // // free(string); // // char_off(info); // } else if (info->in_title && strcmp(el, "title") == 0) { // char *string = stringbuffer_string(info->buffer); // // info->in_title = 0; // // printf("Title: %s\n", string); // fprintf(stderr, "Title: %s\n", string); // // free(string); // // char_off(info); // } } /* end of expat handlers */ int main(int argc, char **argv) { XML_Parser p = XML_ParserCreate(NULL); char *buf = safe_malloc(BUFSIZ); struct parseinfo *info = parseinfo_new(); XML_SetElementHandler(p, starth, endh); XML_SetCharacterDataHandler(p, charh); XML_SetUserData(p, info); info->parser = p; for (;;) { int done; int len; len = fread(buf, 1, BUFSIZ, stdin); if (ferror(stdin)) { fprintf(stderr, "Read error\n"); exit(-1); } done = feof(stdin); if (! XML_Parse(p, buf, len, done)) { fprintf(stderr, "Parse error at line %d:\n%s\n", (int)XML_GetCurrentLineNumber(p), XML_ErrorString(XML_GetErrorCode(p))); exit(-1); } if (done) break; } exit(0); } void * safe_malloc(size_t size) { void *new = malloc(size); if (! new) { fprintf(stderr, "could not malloc\n"); exit(1); } return new; } void char_on(struct parseinfo *info) { info->do_char = 1; info->buffer = stringbuffer_new(); } void char_off(struct parseinfo *info) { info->do_char = 0; stringbuffer_free(info->buffer); info->buffer = NULL; } MediaWiki-DumpFile-0.2.2/speed_test/bin/iksemel.c000644 000765 000024 00000011372 11543133737 022102 0ustar00tylerstaff000000 000000 /* Iksemel does not seem to support xpath */ #include #include #include #include #include struct stringbuffer { struct stringbuffer *next; struct stringbuffer *cur; char *data; int len; int depth; }; struct parseinfo { int in_title; int in_text; int do_char; iksparser *parser; struct stringbuffer *buffer; }; void * safe_malloc(size_t); struct stringbuffer * stringbuffer_new(void); void char_on(struct parseinfo *info); void char_off(struct parseinfo *info); /* parseinfo functions */ struct parseinfo * parseinfo_new(void) { struct parseinfo *new = safe_malloc(sizeof(struct parseinfo)); new->in_title = 0; new->in_text = 0; new->do_char = 0; new->buffer = NULL; new->parser = NULL; return new; } /* end parseinfo functions */ /* string buffering functions */ struct stringbuffer * stringbuffer_new(void) { struct stringbuffer *new = safe_malloc(sizeof(struct stringbuffer)); new->cur = new; new->next = NULL; new->data = NULL; new->len = 0; new->depth = 0; return new; } void stringbuffer_free(struct stringbuffer *buffer) { struct stringbuffer *p1 = buffer; struct stringbuffer *p2; while(p1) { p2 = p1->next; if (p1->data) { free(p1->data); } free(p1); p1 = p2; } return; } int stringbuffer_length(struct stringbuffer *buffer) { int length = 0; struct stringbuffer *cur; for(cur = buffer; cur; cur = cur->next) { length += cur->len; } return length; } void stringbuffer_append(struct stringbuffer *buffer, char *newstring, int len) { char *copy = safe_malloc(len); strncpy(copy, newstring, len); buffer->cur->data = copy; buffer->cur->len = len; buffer->cur->next = stringbuffer_new(); buffer->cur = buffer->cur->next; buffer->depth++; } char * stringbuffer_string(struct stringbuffer *buffer) { int length = stringbuffer_length(buffer); char *new = safe_malloc(length + 1); char *p = new; int copied = 0; struct stringbuffer *cur; for(cur = buffer;cur;cur = cur->next) { if (! cur->data) { continue; } if ((copied = copied + cur->len) > length) { fprintf(stderr, "string overflow\n"); abort(); } strncpy(p, cur->data, cur->len); p += cur->len; } new[length] = '\0'; // fprintf(stderr, "append depth: %i\n", buffer->depth); return new; } /* end string buffering functions */ /* SAX handlers */ int charh(void *p, char *s, size_t len) { struct parseinfo *info = (struct parseinfo *)p; if (info->do_char) { stringbuffer_append(info->buffer, (char *)s, (int)len); } return IKS_OK; } void starth(struct parseinfo *info, char *el, char **attr) { if (strcmp(el, "title") == 0) { info->in_title = 1; char_on(info); } else if (strcmp(el, "text") == 0) { info->in_text = 1; char_on(info); } } void endh(struct parseinfo *info, char *el) { if (info->in_text && strcmp(el, "text") == 0) { char *string = stringbuffer_string(info->buffer); info->in_text = 0; printf("%s\n", string); free(string); char_off(info); } else if (info->in_title && strcmp(el, "title") == 0) { char *string = stringbuffer_string(info->buffer); info->in_title = 0; printf("Title: %s\n", string); // fprintf(stderr, "Title: %s\n", string); free(string); char_off(info); } } /* end of expat handlers */ int ikshandler(void *p, char *name, char **attributes, int type) { struct parseinfo *userdata = (struct parseinfo *)p; if (type == IKS_OPEN) { starth(userdata, name, attributes); } else if (type == IKS_CLOSE) { endh(userdata, name); } else if (type == IKS_SINGLE) { starth(userdata, name, attributes); endh(userdata, name); } else { fprintf(stderr, "invalid iks type: %i\n", type); exit(1); } return IKS_OK; } iksparser * new_iksemel(struct parseinfo *parseinfo) { iksparser *p = iks_sax_new(parseinfo, ikshandler, charh); return p; } int main(int argc, char **argv) { struct parseinfo *info = parseinfo_new(); iksparser *p = new_iksemel(info); char *buf = safe_malloc(BUFSIZ); int result; for (;;) { int done; int len; len = fread(buf, 1, BUFSIZ, stdin); if (ferror(stdin)) { fprintf(stderr, "Read error\n"); exit(-1); } done = feof(stdin); if ((result = iks_parse(p, buf, len, done)) != IKS_OK) { fprintf(stderr, "IKS_OK:%i IKS_NOMEM:%i IKS_BADXML:%i IKS_HOOK:%i - error number: %i\n", IKS_OK, IKS_NOMEM, IKS_BADXML, IKS_HOOK, result); exit(1); } if (done) break; } exit(0); } void * safe_malloc(size_t size) { void *new = malloc(size); if (! new) { fprintf(stderr, "could not malloc\n"); exit(1); } return new; } void char_on(struct parseinfo *info) { info->do_char = 1; info->buffer = stringbuffer_new(); } void char_off(struct parseinfo *info) { info->do_char = 0; stringbuffer_free(info->buffer); info->buffer = NULL; } MediaWiki-DumpFile-0.2.2/speed_test/bin/libxml.c000644 000765 000024 00000011026 11543133737 021734 0ustar00tylerstaff000000 000000 #include #include #include #include #include struct stringbuffer { struct stringbuffer *next; struct stringbuffer *cur; char *data; int len; int depth; }; struct parseinfo { int in_title; int in_text; int do_char; xmlParserCtxtPtr parser; struct stringbuffer *buffer; }; void * safe_malloc(size_t); struct stringbuffer * stringbuffer_new(void); void char_on(struct parseinfo *info); void char_off(struct parseinfo *info); /* parseinfo functions */ struct parseinfo * parseinfo_new(void) { struct parseinfo *new = safe_malloc(sizeof(struct parseinfo)); new->in_title = 0; new->in_text = 0; new->do_char = 0; new->buffer = NULL; new->parser = NULL; return new; } /* end parseinfo functions */ /* string buffering functions */ struct stringbuffer * stringbuffer_new(void) { struct stringbuffer *new = safe_malloc(sizeof(struct stringbuffer)); new->cur = new; new->next = NULL; new->data = NULL; new->len = 0; new->depth = 0; return new; } void stringbuffer_free(struct stringbuffer *buffer) { struct stringbuffer *p1 = buffer; struct stringbuffer *p2; while(p1) { p2 = p1->next; if (p1->data) { free(p1->data); } free(p1); p1 = p2; } return; } int stringbuffer_length(struct stringbuffer *buffer) { int length = 0; struct stringbuffer *cur; for(cur = buffer; cur; cur = cur->next) { length += cur->len; } return length; } void stringbuffer_append(struct stringbuffer *buffer, char *newstring, int len) { char *copy = safe_malloc(len); strncpy(copy, newstring, len); buffer->cur->data = copy; buffer->cur->len = len; buffer->cur->next = stringbuffer_new(); buffer->cur = buffer->cur->next; buffer->depth++; } char * stringbuffer_string(struct stringbuffer *buffer) { int length = stringbuffer_length(buffer); char *new = safe_malloc(length + 1); char *p = new; int copied = 0; struct stringbuffer *cur; for(cur = buffer;cur;cur = cur->next) { if (! cur->data) { continue; } if ((copied = copied + cur->len) > length) { fprintf(stderr, "string overflow\n"); abort(); } strncpy(p, cur->data, cur->len); p += cur->len; } new[length] = '\0'; // fprintf(stderr, "append depth: %i\n", buffer->depth); return new; } /* end string buffering functions */ /* expat handlers */ void charh(void *user, const xmlChar *s, int len) { struct parseinfo *info = (struct parseinfo *)user; if (info->do_char) { stringbuffer_append(info->buffer, (char *)s, len); } } void starth(void *user, const xmlChar *el, const xmlChar **attr) { struct parseinfo *info = (struct parseinfo *)user; if (strcmp(el, "title") == 0) { info->in_title = 1; char_on(info); } else if (strcmp(el, "text") == 0) { info->in_text = 1; char_on(info); } } void endh(void *user, const xmlChar *el) { struct parseinfo *info = (struct parseinfo *)user; if (info->in_text && strcmp(el, "text") == 0) { char *string = stringbuffer_string(info->buffer); info->in_text = 0; printf("%s\n", string); free(string); char_off(info); } else if (info->in_title && strcmp(el, "title") == 0) { char *string = stringbuffer_string(info->buffer); info->in_title = 0; printf("Title: %s\n", string); // fprintf(stderr, "Title: %s\n", string); free(string); char_off(info); } } /* end of expat handlers */ xmlParserCtxtPtr new_libxml(struct parseinfo *parseinfo) { xmlSAXHandler *saxHandler = safe_malloc(sizeof(xmlSAXHandler)); xmlParserCtxtPtr p; LIBXML_TEST_VERSION memset(saxHandler, 0, sizeof(saxHandler)); saxHandler->startElement = starth; saxHandler->endElement = endh; saxHandler->characters = charh; p = xmlCreatePushParserCtxt(saxHandler, parseinfo, NULL, 0, NULL); return p; } int main(int argc, char **argv) { struct parseinfo *info = parseinfo_new(); xmlParserCtxtPtr p = new_libxml(info); char *buf = safe_malloc(BUFSIZ); for (;;) { int done; int len; len = fread(buf, 1, BUFSIZ, stdin); if (ferror(stdin)) { fprintf(stderr, "Read error\n"); exit(-1); } done = feof(stdin); if (xmlParseChunk(p, buf, len, done)) { fprintf(stderr, "XML parse failed\n"); exit(1); } if (done) break; } exit(0); } void * safe_malloc(size_t size) { void *new = malloc(size); if (! new) { fprintf(stderr, "could not malloc\n"); exit(1); } return new; } void char_on(struct parseinfo *info) { info->do_char = 1; info->buffer = stringbuffer_new(); } void char_off(struct parseinfo *info) { info->do_char = 0; stringbuffer_free(info->buffer); info->buffer = NULL; } MediaWiki-DumpFile-0.2.2/speed_test/bin/Makefile000644 000765 000024 00000001065 11543133737 021743 0ustar00tylerstaff000000 000000 CC = gcc CCOPTS = LIBS=-L/opt/local/lib INCLUDES=-I/opt/local/include all: expat libxml clean: rm -rf expat libxml iksemel *.o *.dSYM #must be recursive for Mac OS expat: expat.c Makefile $(CC) $(CCOPTS) $(LIBS) $(INCLUDES) -o expat -lexpat expat.c libxml: libxml.c Makefile $(CC) $(CCOPTS) $(LIBS) $(INCLUDES) `xml-config --cflags` `xml-config --libs` -o libxml libxml.c iksemel: iksemel.c Makefile $(CC) $(CCOPTS) $(LIBS) $(INCLUDES) -o iksemel -liksemel iksemel.c naive: naive.c Makefile $(CC) $(CCOPTS) $(LIBS) $(INCLUDES) -m64 -g -o naive naive.cMediaWiki-DumpFile-0.2.2/lib/MediaWiki/000755 000765 000024 00000000000 12173343122 017772 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/000755 000765 000024 00000000000 12173343122 021477 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile.pm000644 000765 000024 00000021063 12173331345 022043 0ustar00tylerstaff000000 000000 package MediaWiki::DumpFile; our $VERSION = '0.2.2'; use warnings; use strict; use Carp qw(croak); sub new { my ($class, %files) = @_; my $self = {}; bless($self, $class); return $self; } sub sql { if (! defined($_[1])) { croak "must specify a filename or open filehandle"; } require MediaWiki::DumpFile::SQL; return MediaWiki::DumpFile::SQL->new($_[1]); } sub pages { my ($class, @args) = @_; require MediaWiki::DumpFile::Pages; return MediaWiki::DumpFile::Pages->new(@args); } sub fastpages { if (! defined($_[1])) { croak "must specify a filename or open filehandle"; } require MediaWiki::DumpFile::FastPages; return MediaWiki::DumpFile::FastPages->new($_[1]); } 1; __END__ =head1 NAME MediaWiki::DumpFile - Process various dump files from a MediaWiki instance =head1 SYNOPSIS use MediaWiki::DumpFile; $mw = MediaWiki::DumpFile->new; $sql = $mw->sql($filename); $sql = $mw->sql(\*FH); $pages = $mw->pages($filename); $pages = $mw->pages(\*FH); $fastpages = $mw->fastpages($filename); $fastpages = $mw->fastpages(\*FH); use MediaWiki::DumpFile::Compat; $pmwd = Parse::MediaWikiDump->new; =head1 ABOUT This module is used to parse various dump files from a MediaWiki instance. The most likely case is that you will want to be parsing content at http://download.wikimedia.org/backup-index.html provided by WikiMedia which includes the English and all other language Wikipedias. This module is the successor to Parse::MediaWikiDump acting as a near full replacement in feature set and providing an independent 100% backwards compatible API that is faster than Parse::MediaWikiDump is (see the MediaWiki::DumpFile::Compat and MediaWiki::DumpFile::Benchmarks documentation for details). =head1 STATUS This software is maturing into a stable and tested state with known users; the API is stable and will not be changed. The software is actively being maintained and improved; please submit bug reports, feature requests, and other feedback to the author using the bug reporting features described below. =head1 FUNCTIONS =head2 sql Return an instance of MediaWiki::DumpFile::SQL. This object can be used to parse any arbitrary SQL dump file used to recreate a single table in the MediaWiki instance. =head2 pages Return an instance of MediaWiki::DumpFile::Pages. This object parses the contents of the page dump file and supports both single and multiple revisions per article as well as associated metadata. The page can be parsed in either normal or fast mode where fast mode is only capable of parsing the article titles and text contents, with restrictions. =head2 fastpages Return an instance of MediaWiki::DumpFile::FastPages. This class is a subclass of MediaWiki::DumpFile::Pages that configures it to fast mode by default and uses a tuned iterator interface with slightly less overhead. =head1 SPEED MediaWiki::DumpFile now runs in a slower configuration when installed with out the recommended Perl modules; this was done so that the package can be installed with out a C compiler and still have some utility. As well there is a fast mode available when parsing the XML document that can give significant speed boosts while giving up support for anything except for the article titles and text contents. If you want to decrease the processing overhead of this system follow this guide: =over 4 =item Install XML::CompactTree::XS Having this module on your system will cause XML::TreePuller to use it automatically - this will net you a dramatic speed boost if it is not already installed. This can give you a 3-4 times speed increase when not using fast mode. =item Use fast mode if possible Details of fast mode and the restrictions it imposes are in the MediaWiki::DumpFile::Pages documentation. Fast mode is also available in the compatibility library as a new available option. Fast mode can give you a further 3-4 times speed increase over parsing with XML::CompatTree::XS installed but it does not require that module to function; fast mode is nearly the same speed with or with out XML::CompactTree::XS installed. =item Stop using compatibility mode If you are using the compatibility API you lose performance; the compatibility API is a set of wrappers around the MediaWiki::DumpFile API and while it is faster than the original Parse::MediaWikiDump::Pages it is still slower than MediaWiki::DumpFile::Pages by a few percent. =item Use MediaWiki::DumpFile::FastPages This is a subclass of MediaWiki::DumpFile::Pages that configures it by default to run in fast mode and uses a tuned iterator that decreases overhead another few percent. This is generally the absolute fastest fully supported and tested way to parse the XML dump files. =item Start hacking I've put some considerable effort into finding the fastest ways to parse the XML dump files. Probably the most important part of this research has been an XML benchmarking suite I created for specifically measuring the performance of parsing the Mediawiki page dump files. The benchmark suite is present in the module tarball in the speed_test/ directory. It contains a comprehensive set of test cases to measure the performance of a good number of XML parsers and parsing schemes from CPAN. You can use this suite as a starting point to see how various parsers work and how fast they go; as well you can use it to reliably verify the performance impacts of experiments in parsing performance. The result of my research into XML parsers was to create XML::TreePuller which is the heart XML processing system of MediaWiki::DumpFile::Pages - it's fast but I'm positive there is room for improvement. Increaseing the speed of that module will increase the speed of MediaWiki::DumpFile::Pages as well. Please consider sharing the results of your hacking with me by opening a ticket in the bug reporting system as documented below. The following test cases are notable and could be used by anyone who just needs to extract article titles and text: =over 4 =item XML-Bare Wow is it fast! And wrong! Just so very wrong... but it does pass the tests *shrug* =back =back =head2 Benchmarks See MediWiki::DumpFile::Benchmarks for a comprehensive report on dump file processing speeds. =head1 AUTHOR Tyler Riddle, C<< >> =head1 LIMITATIONS =over 4 =item English Wikipedia comprehensive dump files not supported There are two types of Mediawiki dump files sharing one schema: ones with one revision of page per entry and one with multiple revisions of a page per entry. This software is designed to parse either case and provide a consistent API however it comes with the restriction that an entire entry must fit in memory. The normal English Wikipedia dump file is around 20 gigabytes and each entry easily fits into RAM on most machines. In the case of the comprehensive English Wikipedia dump files the file itself is measured in the terabytes and a single entry can be 20 gigabytes or more. It is technically possible for the original Parse::MediaWikiDump::Revisions (not the compatibility version provided in this module) to parse that dump file however Parse::MediaWikiDump runs at a few megabytes per second under the best of conditions. =back =head1 BUGS Please report any bugs or feature requests to C, or through the web interface at L. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes. =over 4 =item 56843 ::Pages->current_byte() wraps at 2 gigs+ If you have a large XML file, where the file size is greater than a signed 32bit integer, the returned value from this method can go negative. =back =head1 SUPPORT You can find documentation for this module with the perldoc command. perldoc MediaWiki::DumpFile You can also look for information at: =over 4 =item * RT: CPAN's request tracker L =item * AnnoCPAN: Annotated CPAN documentation L =item * CPAN Ratings L =item * Search CPAN L =back =head1 ACKNOWLEDGEMENTS All of the people who reported bugs or feature requests for Parse::MediaWikiDump. =head1 COPYRIGHT & LICENSE Copyright 2009 "Tyler Riddle". This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See http://dev.perl.org/licenses/ for more information. MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Benchmarks.pod000644 000765 000024 00000015223 11543133737 024274 0ustar00tylerstaff000000 000000 =head1 NAME MediaWiki::DumpFile::Benchmarks - Documentation on parsing speeds =head1 ENVIRONMENT The tests were conducted on Debian/squeeze using the vendor supplied Perl version 5.10.1. All modules were installed from CPAN and not from the Debian package archives. The host hardware is a 4 core Intel i7 at 2.8ghz with 16 GiB of RAM and disks that can read at 100 MiB/sec. =head1 SOFTWARE The benchmark software is included in the MediaWiki::DumpFile distribution tarball in the speed_test/ directory. The tested version of MediaWiki::DumpFile is version 0.2.0 Parse::MediaWikiDump version 1.0.6 is included for comparison against the original implementation. =head1 RESULTS All times are expressed in seconds. See the SPEED section of the MediaWiki::DumpFile documentation for an explanation of the various parsing modes. =head2 With XML::CompactTree::XS =head3 English Wikipedia markup_density: 0.171869387735729 size: 25247483017 tests: - MiB/sec: 35.0121798854383 name: suite//MediaWiki-DumpFile-FastPages.t percentage: 100 runtimes: system: 25.04 total: 687.7 user: 662.66 - MiB/sec: 30.0620222578669 name: suite//MediaWiki-DumpFile-Pages_fastmode.t percentage: 116 runtimes: system: 26.52 total: 800.94 user: 774.42 - MiB/sec: 23.3355715754023 name: suite//MediaWiki-DumpFile-Compat_fastmode.t percentage: 150 runtimes: system: 28.1 total: 1031.81 user: 1003.71 - MiB/sec: 9.00969754502099 name: suite//MediaWiki-DumpFile-Pages.t percentage: 388 runtimes: system: 31.56 total: 2672.44 user: 2640.88 - MiB/sec: 7.83820750529512 name: suite//MediaWiki-DumpFile-Compat.t percentage: 446 runtimes: system: 31.28 total: 3071.86 user: 3040.58 - MiB/sec: 5.19432459350304 name: suite//Parse-MediaWikiDump.t percentage: 674 runtimes: system: 23.22 total: 4635.42 user: 4612.2 =head3 Simple English Wikipedia markup_density: 0.202659609191331 size: 227681797 tests: - MiB/sec: 27.9092907599128 name: suite//MediaWiki-DumpFile-FastPages.t percentage: 100 runtimes: system: 0.17 total: 7.78 user: 7.61 - MiB/sec: 26.19231388566 name: suite//MediaWiki-DumpFile-Pages_fastmode.t percentage: 106 runtimes: system: 0.15 total: 8.29 user: 8.14 - MiB/sec: 19.632394404351 name: suite//MediaWiki-DumpFile-Compat_fastmode.t percentage: 142 runtimes: system: 0.19 total: 11.06 user: 10.87 - MiB/sec: 7.20180040172874 name: suite//MediaWiki-DumpFile-Pages.t percentage: 387 runtimes: system: 0.27 total: 30.15 user: 29.88 - MiB/sec: 6.39382456160546 name: suite//MediaWiki-DumpFile-Compat.t percentage: 436 runtimes: system: 0.26 total: 33.96 user: 33.7 - MiB/sec: 4.14457495919301 name: suite//Parse-MediaWikiDump.t percentage: 673 runtimes: system: 0.12 total: 52.39 user: 52.27 =head3 Chuvash Wikipedia markup_density: 0.18934898819024 size: 39436366 tests: - MiB/sec: 25.7598968401347 name: suite//MediaWiki-DumpFile-FastPages.t percentage: 100 runtimes: system: 0.03 total: 1.46 user: 1.43 - MiB/sec: 22.6562948112028 name: suite//MediaWiki-DumpFile-Pages_fastmode.t percentage: 113 runtimes: system: 0.06 total: 1.66 user: 1.6 - MiB/sec: 17.9949518596156 name: suite//MediaWiki-DumpFile-Compat_fastmode.t percentage: 143 runtimes: system: 0.04 total: 2.09 user: 2.05 - MiB/sec: 7.46219233861045 name: suite//MediaWiki-DumpFile-Pages.t percentage: 345 runtimes: system: 0.04 total: 5.04 user: 5 - MiB/sec: 6.45102047797542 name: suite//MediaWiki-DumpFile-Compat.t percentage: 399 runtimes: system: 0.05 total: 5.83 user: 5.78 - MiB/sec: 4.15574026371234 name: suite//Parse-MediaWikiDump.t percentage: 619 runtimes: system: 0.02 total: 9.05 user: 9.03 =head2 With out XML::CompactTree::XS =head3 Simple English Wikipedia markup_density: 0.202659609191331 size: 227681797 tests: - MiB/sec: 30.0740002925376 name: suite//MediaWiki-DumpFile-FastPages.t percentage: 100 runtimes: system: 0.18 total: 7.22 user: 7.04 - MiB/sec: 24.3697286321124 name: suite//MediaWiki-DumpFile-Pages_fastmode.t percentage: 123 runtimes: system: 0.23 total: 8.91 user: 8.68 - MiB/sec: 20.2550636298621 name: suite//MediaWiki-DumpFile-Compat_fastmode.t percentage: 148 runtimes: system: 0.19 total: 10.72 user: 10.53 - MiB/sec: 4.21456292919491 name: suite//Parse-MediaWikiDump.t percentage: 713 runtimes: system: 0.12 total: 51.52 user: 51.4 - MiB/sec: 4.06770854462573 name: suite//MediaWiki-DumpFile-Pages.t percentage: 739 runtimes: system: 0.24 total: 53.38 user: 53.14 - MiB/sec: 3.79871032386497 name: suite//MediaWiki-DumpFile-Compat.t percentage: 791 runtimes: system: 0.28 total: 57.16 user: 56.88 =head3 Chuvash Wikipedia markup_density: 0.18934898819024 size: 39436366 tests: - MiB/sec: 25.5846594466644 name: suite//MediaWiki-DumpFile-FastPages.t percentage: 100 runtimes: system: 0.04 total: 1.47 user: 1.43 - MiB/sec: 20.4399181448895 name: suite//MediaWiki-DumpFile-Pages_fastmode.t percentage: 125 runtimes: system: 0.04 total: 1.84 user: 1.8 - MiB/sec: 14.5210229291879 name: suite//MediaWiki-DumpFile-Compat_fastmode.t percentage: 176 runtimes: system: 0.03 total: 2.59 user: 2.56 - MiB/sec: 4.11481940772393 name: suite//Parse-MediaWikiDump.t percentage: 621 runtimes: system: 0.03 total: 9.14 user: 9.11 - MiB/sec: 3.87726282336048 name: suite//MediaWiki-DumpFile-Pages.t percentage: 659 runtimes: system: 0.03 total: 9.7 user: 9.67 - MiB/sec: 3.53140369827199 name: suite//MediaWiki-DumpFile-Compat.t percentage: 724 runtimes: system: 0.06 total: 10.65 user: 10.59 MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Compat/000755 000765 000024 00000000000 12173343122 022722 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Compat.pm000644 000765 000024 00000023356 11543133737 023302 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl #Parse::MediaWikiDump compatibility package MediaWiki::DumpFile::Compat; our $VERSION = '0.2.0'; package #go away indexer! Parse::MediaWikiDump; use strict; use warnings; sub new { my ($class) = @_; return bless({}, $class); } sub pages { shift(@_); return Parse::MediaWikiDump::Pages->new(@_); } sub revisions { shift(@_); return Parse::MediaWikiDump::Revisions->new(@_); } sub links { shift(@_); return Parse::MediaWikiDump::Links->new(@_); } package #go away indexer! Parse::MediaWikiDump::Links; use strict; use warnings; use MediaWiki::DumpFile::SQL; sub new { my ($class, $source) = @_; my $self = {}; my $sql; $Carp::CarpLevel++; $sql = MediaWiki::DumpFile::SQL->new($source); $Carp::CarpLevel--; if (! defined($sql)) { die "could not create SQL parser"; } $self->{sql} = $sql; return bless($self, $class); } sub next { my ($self) = @_; my $next = $self->{sql}->next; unless(defined($next)) { return undef; } return Parse::MediaWikiDump::link->new($next); } package #go away indexer! Parse::MediaWikiDump::link; use strict; use warnings; use Data::Dumper; sub new { my ($class, $self) = @_; bless($self, $class); } sub from { return $_[0]->{pl_from}; } sub namespace { return $_[0]->{pl_namespace}; } sub to { return $_[0]->{pl_title}; } package #go away indexer! Parse::MediaWikiDump::Revisions; use strict; use warnings; use Data::Dumper; use MediaWiki::DumpFile::Pages; sub new { my ($class, @args) = @_; my $self = { queue => [] }; my $mediawiki; $Carp::CarpLevel++; $mediawiki = MediaWiki::DumpFile::Pages->new(@args); $Carp::CarpLevel--; $self->{mediawiki} = $mediawiki; return bless($self, $class); } sub version { return $_[0]->{mediawiki}->version; } sub sitename { return $_[0]->{mediawiki}->sitename; } sub base { return $_[0]->{mediawiki}->base; } sub generator { return $_[0]->{mediawiki}->generator; } sub case { return $_[0]->{mediawiki}->case; } sub namespaces { my $cache = $_[0]->{cache}->{namespaces}; if(defined($cache)) { return $cache; } my %namespaces = $_[0]->{mediawiki}->namespaces; my @temp; while(my ($key, $val) = each(%namespaces)) { push(@temp, [$key, $val]); } @temp = sort({$a->[0] <=> $b->[0]} @temp); $_[0]->{cache}->{namespaces} = \@temp; return \@temp; } sub namespaces_names { my ($self) = @_; my @result; return $self->{cache}->{namespaces_names} if defined $self->{cache}->{namespaces_names}; foreach (@{ $_[0]->namespaces }) { push(@result, $_->[1]); } $self->{cache}->{namespaces_names} = \@result; return \@result; } sub current_byte { return $_[0]->{mediawiki}->current_byte; } sub size { return $_[0]->{mediawiki}->size; } sub get_category_anchor { my ($self) = @_; my $namespaces = $self->namespaces; my $cache = $self->{cache}; my $ret = undef; if (defined($cache->{category_anchor})) { return $cache->{category_anchor}; } foreach (@$namespaces) { my ($id, $name) = @$_; if ($id == 14) { $ret = $name; } } $self->{cache}->{category_anchor} = $ret; return $ret; } sub next { my $self = $_[0]; my $queue = $_[0]->{queue}; my $next = shift(@$queue); my @results; return $next if defined $next; $next = $self->{mediawiki}->next; return undef unless defined $next; foreach ($next->revision) { push(@$queue, Parse::MediaWikiDump::page->new($next, $self->namespaces, $self->get_category_anchor, $_)); } return shift(@$queue); } package #go away indexer! Parse::MediaWikiDump::Pages; use strict; use warnings; our @ISA = qw(Parse::MediaWikiDump::Revisions); sub next { my $self = $_[0]; my $next = $self->{mediawiki}->next; my $revision_count; return undef unless defined $next; $revision_count = scalar(@{[$next->revision]}); #^^^^^ because scalar($next->revision) doesn't work if ($revision_count > 1) { die "only one revision per page is allowed\n"; } return Parse::MediaWikiDump::page->new($next, $self->namespaces, $self->get_category_anchor); } package #go away indexer! Parse::MediaWikiDump::page; use strict; use warnings; our %REGEX_CACHE_CATEGORIES; sub new { my ($class, $page, $namespaces, $category_anchor, $revision) = @_; my $self = {page => $page, namespaces => $namespaces, category_anchor => $category_anchor}; $self->{revision} = $revision; return bless($self, $class); } sub _revision { if (defined($_[0]->{revision})) { return $_[0]->{revision}}; return $_[0]->{page}->revision; } sub text { my $text = $_[0]->_revision->text; return \$text; } sub title { return $_[0]->{page}->title; } sub id { return $_[0]->{page}->id; } sub revision_id { return $_[0]->_revision->id; } sub username { return $_[0]->_revision->contributor->username; } sub userid { return $_[0]->_revision->contributor->id; } sub userip { return $_[0]->_revision->contributor->ip; } sub timestamp { return $_[0]->_revision->timestamp; } sub minor { return $_[0]->_revision->minor; } sub namespace { my ($self) = @_; my $title = $self->title; my $namespace = ''; if (defined($self->{cache}->{namespace})) { return $self->{cache}->{namespace}; } if ($title =~ m/^([^:]+):(.*)/o) { foreach (@{ $self->{namespaces} } ) { my ($num, $name) = @$_; if ($1 eq $name) { $namespace = $1; last; } } } $self->{cache}->{namespace} = $namespace; return $namespace; } sub redirect { my ($self) = @_; my $text = $self->text; my $ret; return $self->{cache}->{redirect} if defined $self->{cache}->{redirect}; if ($$text =~ m/^#redirect\s*:?\s*\[\[([^\]]*)\]\]/io) { $ret = $1; } else { $ret = undef; } $self->{cache}->{redirect} = $ret; return $ret; } sub categories { my ($self) = @_; my $anchor = $$self{category_anchor}; my $text = $self->text; my @cats; my $ret; return $self->{cache}->{categories} if defined $self->{cache}->{categories}; if (! defined($REGEX_CACHE_CATEGORIES{$anchor})) { $REGEX_CACHE_CATEGORIES{$anchor} = qr/\[\[$anchor:\s*([^\]]+)\]\]/i; } while($$text =~ /$REGEX_CACHE_CATEGORIES{$anchor}/g) { my $buf = $1; #deal with the pipe trick $buf =~ s/\|.*$//; push(@cats, $buf); } if (scalar(@cats) == 0) { $ret = undef; } else { $ret = \@cats; } return $ret; } 1; __END__ =head1 NAME MediaWiki::DumpFile::Compat - Compatibility with Parse::MediaWikiDump =head1 SYNOPSIS use MediaWiki::DumpFile::Compat; $pmwd = Parse::MediaWikiDump->new; $pages = $pmwd->pages('pages-articles.xml'); $revisions = $pmwd->revisions('pages-articles.xml'); $links = $pmwd->links('links.sql'); =head1 ABOUT This software suite provides the tools needed to process the contents of the XML page dump files and the SQL based links dump file from a Mediawiki instance. This is a compatibility layer between MediaWiki::Dumpfile and Parse::MediaWikiDump; instead of "use Parse::MediaWikiDump;" you "use MediaWiki::DumpFile::Compat;". The benefit of using the new compatibility module is an increased processing speed - see the MediaWiki::DumpFile::Benchmarks documentation for benchmark results. =head1 MORE DOCUMENTATION The original Parse::MediaWikiDump documentation is also available in this package; it has been updated to include new features introduced by MediaWiki::DumpFile. You can find the documentation in the following locations: =over 4 =item MediaWiki::DumpFile::Compat::Pages =item MediaWiki::DumpFile::Compat::Revisions =item MediaWiki::DumpFile::Compat::page =item MediaWiki::DumpFile::Compat::Links =item MediaWiki::DumpFile::Compat::link =back =head1 USAGE This module is a factory class that allows you to create instances of the individual parser objects. =over 4 =item $pmwd->pages Returns a Parse::MediaWikiDump::Pages object capable of parsing an article XML dump file with one revision per each article. =item $pmwd->revisions Returns a Parse::MediaWikiDump::Revisions object capable of parsing an article XML dump file with multiple revisions per each article. =item $pmwd->links Returns a Parse::MediaWikiDump::Links object capable of parsing an article links SQL dump file. =back =head2 General All parser creation invocations require a location of source data to parse; this argument can be either a filename or a reference to an already open filehandle. This entire software suite will die() upon errors in the file or if internal inconsistencies have been detected. If this concerns you then you can wrap the portion of your code that uses these calls with eval(). =head1 COMPATIBILITY Any deviation of the behavior of MediaWiki::DumpFile::Compat from Parse::MediaWikiDump that is not listed below is a bug. Please report it so that this package can act as a near perfect standin for the original. Compatibility is verified by using the existing Parse::MediaWikiDump test suite with the following adjustments: =head2 Parse::MediaWikiDump::Pages =over 4 =item Parse::MediaWikiDump did not need to load all revisions of an article into memory when processing dump files that contain more than one revision but this compatibility module does. The API does not change but the memory requirements for parsing those dump files certainly do. It is, however, highly unlikely that you will notice this as most of the documents with many revisions per article are so large that Parse::MediaWikiDump would not have been able to parse them in any reasonable timeframe. =item The order of the results from namespaces() is now sorted by the namespace ID instead of being in document order =back =head2 Parse::MediaWikiDump::Links =over 4 =item Order of values from next() is now in identical order as SQL file. =back =head1 BUGS =over 4 =item The value of current_byte() wraps at around 2 gigabytes of input XML; see http://rt.cpan.org/Public/Bug/Display.html?id=56843 =back =head1 LIMITATIONS =over 4 =item This compatibility layer is not yet well tested. =backMediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/FastPages.pm000644 000765 000024 00000005261 11543133737 023727 0ustar00tylerstaff000000 000000 #!/usr/bin/env perl package MediaWiki::DumpFile::FastPages; our $VERSION = '0.2.0'; use base qw(MediaWiki::DumpFile::Pages); use strict; use warnings; use Data::Dumper; sub new { my ($class, $input) = @_; use Carp qw(croak); my $self; if (! defined($input)) { croak "you must provide either a filename or an already open file handle"; } $self = $class->SUPER::new(input => $input, fast_mode => 1); bless($self, $class); return $self; } sub next { my ($self) = @_; return $self->_fast_next; } 1; __END__ =head1 NAME MediaWiki::DumpFile::FastPages - Fastest way to parse a page dump file =head1 SYNOPSIS use MediaWiki::DumpFile::FastPages; $pages = MediaWiki::DumpFile::FastPages->new($file); $pages = MediaWiki::DumpFile::FastPages->new(\*FH); while(($title, $text) = $pages->next) { print "Title: $title\n"; print "Text: $text\n"; } =head1 ABOUT This is a subclass of MediaWiki::DumpFile::Pages that configures it to run in fast mode and uses a custom iterator that dispenses with the duck-typed MediaWiki::DumpFile::Pages::Page object that fast mode uses giving a slight processing speed boost. See the MediaWiki::DumpFile::Pages documentation for information about fast mode. =head1 METHODS All of the methods of MediaWiki::DumpFile::Pages are also available on this subclass. =head2 new This is the constructor for this package. It is called with a single parameter: the location of a MediaWiki pages dump file or a reference to an already open file handle. =head2 next Returns a two element list where the first element is the article title and the second element is the article text. Returns an empty list when there are no more pages available. =head1 AUTHOR Tyler Riddle, C<< >> =head1 BUGS Please see MediaWiki::DumpFile for information on how to report bugs in this software. =head1 HISTORY This package originally started life as a very limited hack using only XML::LibXML::Reader and seeking to text and title nodes in the document. Implementing a parser for the full document was a daunting task and this package sat in the hopes that other people might find it useful. Because XML::TreePuller can expose the underlying XML::LibXML::Reader object and sync itself back up after the cursor was moved out from underneath it, I was able to integrate the logic from this package into the main ::Pages parser. =head1 COPYRIGHT & LICENSE Copyright 2009 "Tyler Riddle". This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See http://dev.perl.org/licenses/ for more information. MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Pages/000755 000765 000024 00000000000 12173343122 022536 5ustar00tylerstaff000000 000000 MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Pages.pm000644 000765 000024 00000047035 11616412543 023112 0ustar00tylerstaff000000 000000 package MediaWiki::DumpFile::Pages; our $VERSION = '0.2.2'; our $TESTED_SCHEMA_VERSION = 0.5; use strict; use warnings; use Scalar::Util qw(reftype); use Carp qw(croak); use Data::Dumper; use XML::TreePuller; use XML::LibXML::Reader; use IO::Uncompress::AnyUncompress qw($AnyUncompressError); sub new { my ($class, @args) = @_; my $self = {}; my $reftype; my $xml; my $input; my %conf; my $io; bless($self, $class); $self->{siteinfo} = undef; $self->{version} = undef; $self->{fast_mode} = undef; $self->{version_ignore} = 1; if (scalar(@args) == 0) { croak "must specify a file path or open file handle object or a hash of options"; } elsif (scalar(@args) == 1) { $input = $args[0]; } elsif (! scalar(@args) % 2) { croak "must specify a hash as an argument"; } else { %conf = @args; if (! defined($input = $conf{input})) { croak "input is a required option"; } if (defined($conf{fast_mode})) { $self->{fast_mode} = $conf{fast_mode}; } if (defined($conf{strict})) { $self->{version_ignore} = $conf{version_ignore}; } } $reftype = reftype($input); if (! defined($reftype)) { if (! -e $input) { croak("$input is not a file"); } } elsif ($reftype ne 'GLOB') { croak('must provide a GLOB reference'); } $self->{input} = $input; $io = IO::Uncompress::AnyUncompress->new($input); $xml = $self->_new_puller(IO => $io); if (exists($ENV{MEDIAWIKI_DUMPFILE_VERSION_IGNORE})) { $self->{version_ignore} = $ENV{MEDIAWIKI_DUMPFILE_VERSION_IGNORE}; } if (exists($ENV{MEDIAWIKI_DUMPFILE_FAST_MODE})) { $self->{fast_mode} = $ENV{MEDIAWIKI_DUMPFILE_FAST_MODE}; } $self->{xml} = $xml; $self->{reader} = $xml->reader; $self->{input} = $input; $self->{io} = $io; $self->_init_xml; return $self; } sub next { my ($self, $fast) = @_; my $version; my $new; if ($fast || $self->{fast_mode}) { my ($title, $text); if ($self->{finished}) { return (); } eval { ($title, $text) = $self->_fast_next; }; if ($@) { chomp($_); croak("E_XML_PARSE_FAILED \"$@\" see the ERRORS section of the MediaWiki::DumpFile::Pages Perl module documentation for what to do"); } unless (defined($title)) { $self->{finished} = 1; return (); } return MediaWiki::DumpFile::Pages::FastPage->new($title, $text); } if ($self->{finished}) { return undef; } $version = $self->{version}; eval { $new = $self->{xml}->next; }; if ($@) { chomp($_); croak("E_XML_PARSE_FAILED \"$@\" see the ERRORS section of the MediaWiki::DumpFile::Pages Perl module documentation for what to do"); } unless (defined($new)) { $self->{finished} = 1; return undef; } return MediaWiki::DumpFile::Pages::Page->new($new, $version); } sub size { my $source = $_[0]->{input}; unless(defined($source) && ref($source) eq '') { return undef; } #if we are decompressing a file on the fly then don't report the size #of the file because we don't actually know the uncompressed size, #only the compressed size if (defined($_[0]->{io}->getHeaderInfo)) { return undef; } my @stat = stat($source); return $stat[7]; } sub current_byte { return $_[0]->{xml}->reader->byteConsumed; } sub completed { my ($self) = @_; my $size = $self->size; my $current = $self->current_byte; return -1 unless (defined($size) && defined($current)); return int($current / $size * 100); } sub version { return $_[0]->{version}; } #private methods sub _init_xml { my ($self) = @_; my $xml = $self->{xml}; my $version; $xml->iterate_at('/mediawiki', 'short'); $xml->iterate_at('/mediawiki/siteinfo', 'subtree'); $xml->iterate_at('/mediawiki/page', 'subtree'); $version = $self->{version} = $xml->next->attribute('version'); unless ($self->{version_ignore}) { $self->_version_enforce($version); } if ($version > 0.2) { $self->{siteinfo} = $xml->next; bless($self, 'MediaWiki::DumpFile::PagesSiteinfo'); } return undef; } sub _version_enforce { my ($self, $version) = @_; if ($version > $TESTED_SCHEMA_VERSION) { my $filename; my $msg; if (ref($self->{input}) eq '') { $filename = $self->{input}; } else { $filename = ref($self->{input}); } $msg = "E_UNTESTED_DUMP_VERSION Version $version dump file \"$filename\" has not been tested with "; $msg .= __PACKAGE__ . " version $VERSION; see the ERRORS section of the MediaWiki::DumpFile::Pages Perl module documentation for what to do"; die $msg; } } sub _new_puller { my ($self, @args) = @_; my $ret; eval { $ret = XML::TreePuller->new(@args) }; if ($@) { chomp($@); croak("E_XML_CREATE_FAILED \"$@\" see the ERRORS section of the MediaWiki::DumpFile::Pages Perl module documentation for what to do") } return $ret; } sub _get_text { my ($self) = @_; my $r = $self->{reader}; my @buffer; my $type; while($r->nodeType != XML_READER_TYPE_TEXT && $r->nodeType != XML_READER_TYPE_END_ELEMENT) { $r->read or die "could not read"; } while($r->nodeType != XML_READER_TYPE_END_ELEMENT) { if ($r->nodeType == XML_READER_TYPE_TEXT) { push(@buffer, $r->value); } $r->read or die "could not read"; } return join('', @buffer); } sub _fast_next { my ($self) = @_; my $reader = $self->{reader}; my ($title, $text); if ($self->{finished}) { return (); } while(1) { my $type = $reader->nodeType; if ($type == XML_READER_TYPE_ELEMENT) { if ($reader->name eq 'title') { $title = $self->_get_text(); last unless $reader->nextElement('text') == 1; next; } elsif ($reader->name eq 'text') { $text = $self->_get_text(); $reader->nextElement('page'); last; } } last unless $reader->nextElement == 1; } if (! defined($title) || ! defined($text)) { $self->{finished} = 1; return (); } return($title, $text); } package MediaWiki::DumpFile::PagesSiteinfo; use base qw(MediaWiki::DumpFile::Pages); use Data::Dumper; use MediaWiki::DumpFile::Pages::Lib qw(_safe_text); sub _site_info { my ($self, $name) = @_; my $siteinfo = $self->{siteinfo}; return _safe_text($siteinfo, $name); } sub sitename { return $_[0]->_site_info('sitename'); } sub base { return $_[0]->_site_info('base'); } sub generator { return $_[0]->_site_info('generator'); } sub case { return $_[0]->_site_info('case'); } sub namespaces { my ($self) = @_; my @e = $self->{siteinfo}->get_elements('namespaces/namespace'); my %ns; map({ $ns{ $_->attribute('key') } = $_->text } @e); return %ns; } package MediaWiki::DumpFile::Pages::Page; use strict; use warnings; use Data::Dumper; use MediaWiki::DumpFile::Pages::Lib qw(_safe_text); sub new { my ($class, $element, $version) = @_; my $self = { tree => $element }; bless($self, $class); if ($version >= 0.4) { bless ($self, 'MediaWiki::DumpFile::Pages::Page000004000'); } return $self; } sub title { return _safe_text($_[0]->{tree}, 'title'); } sub id { return _safe_text($_[0]->{tree}, 'id'); } sub revision { my ($self) = @_; my @revisions; foreach ($self->{tree}->get_elements('revision')) { push(@revisions, MediaWiki::DumpFile::Pages::Page::Revision->new($_)); } if (wantarray()) { return (@revisions); } return pop(@revisions); } package MediaWiki::DumpFile::Pages::Page000004000; use base qw(MediaWiki::DumpFile::Pages::Page); use strict; use warnings; sub redirect { return 1 if defined $_[0]->{tree}->get_elements('redirect'); return 0; } package MediaWiki::DumpFile::Pages::Page::Revision; use strict; use warnings; use MediaWiki::DumpFile::Pages::Lib qw(_safe_text); sub new { my ($class, $tree) = @_; my $self = { tree => $tree }; return bless($self, $class); } sub text { return _safe_text($_[0]->{tree}, 'text'); } sub id { return _safe_text($_[0]->{tree}, 'id'); } sub timestamp { return _safe_text($_[0]->{tree}, 'timestamp'); } sub comment { return _safe_text($_[0]->{tree}, 'comment'); } sub minor { return 1 if defined $_[0]->{tree}->get_elements('minor'); return 0; } sub contributor { return MediaWiki::DumpFile::Pages::Page::Revision::Contributor->new( $_[0]->{tree}->get_elements('contributor') ); } package MediaWiki::DumpFile::Pages::Page::Revision::Contributor; use strict; use warnings; use Carp qw(croak); use overload '""' => 'astext', fallback => 'TRUE'; sub new { my ($class, $tree) = @_; my $self = { tree => $tree }; return bless($self, $class); } sub astext { my ($self) = @_; if (defined($self->ip)) { return $self->ip; } return $self->username; } sub username { my $user = $_[0]->{tree}->get_elements('username'); return undef unless defined $user; return $user->text; } sub id { my $id = $_[0]->{tree}->get_elements('id'); return undef unless defined $id; return $id->text; } sub ip { my $ip = $_[0]->{tree}->get_elements('ip'); return undef unless defined $ip; return $ip->text; } package MediaWiki::DumpFile::Pages::FastPage; sub new { my ($class, $title, $text) = @_; my $self = { title => $title, text => $text }; bless($self, $class); return $self; } sub title { return $_[0]->{title}; } sub text { return $_[0]->{text}; } sub revision { return $_[0]; } 1; __END__ =head1 NAME MediaWiki::DumpFile::Pages - Process an XML dump file of pages from a MediaWiki instance =head1 SYNOPSIS use MediaWiki::DumpFile::Pages; #dump files up to version 0.5 are tested $input = 'file-name.xml'; #many supported compression formats $input = 'file-name.xml.bz2'; $input = 'file-name.xml.gz'; $input = \*FH; $pages = MediaWiki::DumpFile::Pages->new($input); #default values %opts = ( input => $input, fast_mode => 0, version_ignore => 1 ); #override configuration options passed to constructor $ENV{MEDIAWIKI_DUMPFILE_VERSION_IGNORE} = 0; $ENV{MEDIAWIKI_DUMPFILE_FAST_MODE} = 1; $pages = MediaWiki::DumpFile::Pages->new(%opts); $version = $pages->version; #version 0.3 and later dump files only $sitename = $pages->sitename; $base = $pages->base; $generator = $pages->generator; $case = $pages->case; %namespaces = $pages->namespaces; #all versions while(defined($page = $pages->next) { print 'Title: ', $page->title, "\n"; } $title = $page->title; $id = $page->id; $revision = $page->revision; @revisions = $page->revision; $text = $revision->text; $id = $revision->id; $timestamp = $revision->timestamp; $comment = $revision->comment; $contributor = $revision->contributor; #version 0.4 and later dump files only $bool = $revision->redirect; $username = $contributor->username; $id = $contributor->id; $ip = $contributor->ip; $username_or_ip = $contributor->astext; $username_or_ip = "$contributor"; =head1 METHODS =head2 new This is the constructor for this package. If it is called with a single parameter it must be the input to use for parsing. The input is specified as either the location of a MediaWiki pages dump file or a reference to an already open file handle. If more than one argument is passed to new it must be a hash of options. The keys are named =over 4 =item input This is the input to parse as documented earlier. =item fast_mode Have the iterator run in fast mode by default; defaults to false. See the section on fast mode below. =item version_ignore Do not enforce parsing of only tested schemas in the XML document; defaults to true =back =head2 version Returns the version of the dump file. =head2 sitename Returns the sitename from the MediaWiki instance. Requires a dump file of at least version 0.3. =head2 base Returns the URL used to access the MediaWiki instance. Requires a dump file of at least version 0.3. =head2 generator Returns the version of MediaWiki that generated the dump file. Requires a dump file of at least version 0.3. =head2 case Returns the case sensitivity configuration of the MediaWiki instance. Requires a dump file of at least version 0.3. =head2 namespaces Returns a hash where the key is the numerical namespace id and the value is the plain text namespace name. The main namespace has an id of 0 and an empty string value. Requires a dump file of at least version 0.3. =head2 next Accepts an optional boolean argument to control fast mode. If the argument is specified it forces fast mode on or off. Otherwise the mode is controlled by the fast_mode configuration option. See the section below on fast mode for more information. It is safe to intermix calls between fast and normal mode in one parsing session. In all modes undef is returned if there is no more data to parse. In normal mode an instance of MediaWiki::DumpFile::Pages::Page is returned and the full API is available. In fast mode an instance of MediaWiki::DumpFile::Pages::FastPage is returned; the only methods supported are title, text, and revision. This class can act as a stand-in for MediaWiki::DumpFile::Pages::Page except it will throw an error if any attempt is made to access any other part of the API. =head2 size Returns the size of the input file in bytes or if the input specified is a reference to a file handle it returns undef. =head2 current_byte Returns the number of bytes of XML that have been successfully parsed. =head1 FAST MODE Fast mode is a way to get increased parsing performance while dropping some of the features available in the parser. If you only require the titles and text from a page then fast mode will decrease the amount of time required just to parse the XML file; some times drastically. When fast mode is used on a dump file that has more than one revision of a single article in it only the text of the first article in the dump file will be returned; the other revisions of the article will be silently skipped over. =head1 MediaWiki::DumpFile::Pages::Page This object represents a distinct Mediawiki page and is used to access the page data and metadata. The following methods are available: =over 4 =item title Returns a string of the page title =item id Returns a numerical page identification =item revision In scalar context returns the last revision in the dump for this page; in array context returns a list of all revisions made available for the page in the same order as the dump file. All returned data is an instance of MediaWiki::DumpFile::Pages::Revision =back =head1 MediaWiki::DumpFile::Pages::Page::Revision This object represents a distinct revision of a page from the Mediawiki dump file. The standard dump files contain only the most specific revision of each page and the comprehensive dump files contain all revisions for each page. The following methods are available: =over 4 =item text Returns the page text for this specific revision of the page. =item id Returns the numerical revision id for this specific revision - this is independent of the page id. =item timestamp Returns a string value representing the time the revision was created. The string is in the format of "2008-07-09T18:41:10Z". =item comment Returns the comment made about the revision when it was created. =item contributor Returns an instance of MediaWiki::DumpFile::Pages::Page::Revision::Contributor =item minor Returns true if the edit was marked as being minor or false otherwise =item redirect Returns true if the page is a redirect to another page or false otherwise. Requires a dump file of at least version 0.4. =back =head1 MediaWiki::DumpFile::Pages::Page::Revision::Contributor This object provides access to the contributor of a specific revision of a page. When used in a scalar context it will return the username of the editor if the editor was logged in or the IP address of the editor if the edit was anonymous. =over 4 =item username Returns the username of the editor if the editor was logged in when the edit was made or undef otherwise. =item id Returns the numerical id of the editor if the editor was logged in or undef otherwise. =item ip Returns the IP address of the editor if the editor was anonymous or undef otherwise. =item astext Returns the username of the editor if they were logged in or the IP address if the editor was anonymous. =back =head1 ERRORS =head2 E_XML_CREATE_FAILED Error creating XML parser object While trying to build the XML::TreePuller object a fatal error occured; the error message from the parser was included in the generated error output you saw. At the time of writing this document the error messages are not very helpful but for some reason the XML parser rejected the document; here's a list of things to check: =over 4 =item Make sure the file exists and is readable =item Make sure the file is actually an XML file and is not compressed =back =head2 E_XML_PARSE_FAILED XML parser failed during parsing Something went wrong with the XML parser - the error from the parser was included in the generated error message. This happens when there is a severe error parsing the document such as a syntax error. =head2 E_UNTESTED_DUMP_VERSION Untested dump file versions The dump files created by Mediawiki include a versioned XML schema. This software is tested with the most recent known schema versions and can be configured to enforce a specific tested schema. MediaWiki::DumpFile::Pages no longer enforces the versions by default but the software author using this library has indicated that it should. When this happens it dies with an error like the following: E_UNTESTED_DUMP_VERSION Version 0.4 dump file "t/simpleenglish-wikipedia.xml" has not been tested with MediaWiki::DumpFile::Pages version 0.1.9; see the ERRORS section of the MediaWiki::DumpFile::Pages Perl module documentation for what to do at lib/MediaWiki/DumpFile/Pages.pm line 148. If you encounter this condition you can do the following: =over 4 =item Check your module version The error message should have the version number of this module in it. Check CPAN and see if there is a newer version with official support. The web page http://search.cpan.org/dist/MediaWiki-DumpFile/lib/MediaWiki/DumpFile/Pages.pm will show the highest supported version dump files near the top of the SYNOPSIS. =item Check the bug database It is possible the issue has been resolved already but the update has not made it onto CPAN yet. See this web page http://rt.cpan.org/Public/Dist/Display.html?Name=mediawiki-dumpfile and check for an open bug report relating to the version number changing. =item Be adventurous If you just want to have the software run anyway and see what happens you can set the environment variable MEDIAWIKI_DUMPFILE_VERSION_IGNORE to a true value which will cause the module to silently ignore the case and continue parsing the document. You can set the environment and run your program at the same time with a command like this: MEDIAWIKI_DUMPFILE_VERSION_IGNORE=1 ./wikiscript.pl This may work fine or it may fail in subtle ways silently - there is no way to know for sure with out studying the schema to see if the changes are backwards compatible. =item Open a bug report You can use the same URL for rt.cpan.org above to create a new ticket in MediaWiki-DumpFile or just send an email to "bug-mediawiki-dumpfile at rt.cpan.org". Be sure to use a title for the bug that others will be able to use to find this case as well and to include the full text from the error message. Please also specify if you were adventurous or not and if it was successful for you. =back =head1 AUTHOR Tyler Riddle, C<< >> =head1 BUGS Please see MediaWiki::DumpFile for information on how to report bugs in this software. =head1 COPYRIGHT & LICENSE Copyright 2009 "Tyler Riddle". This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See http://dev.perl.org/licenses/ for more information. MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/SQL.pm000644 000765 000024 00000020627 12172343632 022510 0ustar00tylerstaff000000 000000 package MediaWiki::DumpFile::SQL; our $VERSION = '0.2.2'; use strict; use warnings; use Data::Dumper; use Carp qw(croak); use Scalar::Util qw(reftype); use IO::Uncompress::AnyUncompress qw($AnyUncompressError); #public methods sub new { my ($class, $file) = @_; my $self = { }; if (! defined($file)) { croak "must specify a filename or open filehandle"; } bless($self, $class); $self->{buffer} = []; $self->{file} = $file; $self->{fh} = undef; $self->{table_name} = undef; $self->{schema} = undef; $self->{type_map} = undef; $self->{table_statement} = undef; $self->create_type_map; $self->open_file; $self->parse_table; return $self; } sub next { my ($self) = @_; my $buffer = $self->{buffer}; my $next; while(! defined($next = shift(@$buffer))) { if (! $self->parse_more) { return undef; } } return $next; } sub table_name { my ($self) = @_; return $self->{table_name}; } sub table_statement { my ($self) = @_; return $self->{table_statement}; } sub schema { my ($self) = @_; return @{$self->{schema}}; } #private methods sub open_file { my ($self) = @_; my $file = $self->{file}; my $type = reftype($file); my $fh; $self->{fh} = $fh = IO::Uncompress::AnyUncompress->new($file); my $line = <$fh>; if ($line !~ m/^-- MySQL dump/) { die "expected MySQL dump file"; } return; } sub parse_table { my ($self) = @_; my $fh = $self->{fh}; my $found = 0; my $table; my $table_statement; my @cols; #find the CREATE TABLE line and get the table name while(<$fh>) { if (m/^CREATE TABLE `([^`]+)` \(/) { $table = $1; $table_statement = $_; last; } } die "expected CREATE TABLE" unless defined($table); while(<$fh>) { $table_statement .= $_; if (m/^\)/) { last; } elsif (m/^\s+`([^`]+)` (\w+)/) { #this regex ^^^^ matches column names and types push(@cols, [$1, $2]); } } if (! scalar(@cols)) { die "Could not find columns for $table"; } $self->{table_name} = $table; $self->{schema} = \@cols; $self->{table_statement} = $table_statement; return 1; } #returns false at EOF or true if more data was parsed sub parse_more { my ($self) = @_; my $fh = $self->{fh}; my $insert; if (! defined($fh)) { return 0; } while(1) { $insert = <$fh>; if (! defined($insert)) { close($fh) or die "could not close: $!"; $self->{fh} = undef; return 0; } if ($insert =~ m/^INSERT INTO/) { $self->parse($insert); return 1; } } } #this parses a complete INSERT line into the individual #components sub parse { my ($self, $string) = @_; my $buffer = $self->{buffer}; my $compiled = $self->compile_config; my $found = 0; $_ = $string; #check the table name m/^INSERT INTO `(.*?)` VALUES /g or die "expected header"; if ($self->{table_name} ne $1) { die "table name mismatch: $1"; } while(1) { my %new; my $depth = 0; #apply the various regular expressions to the #string in order foreach my $handler (@$compiled) { my ($col, $cb) = @$handler; my $ret; $depth++; #these callbacks also use $_ eval { $ret = &$cb }; if ($@) { die "parse error pos:" . pos() . " depth:$depth error: $@"; } #column names starting with # are part of the parser, not user data if ($col !~ m/^#/) { $new{$col} = $ret; } } push(@$buffer, \%new); $found++; if (m/\G, ?/gc) { #^^^^ match the delimiter between rows next; } elsif (m/\G;$/gc) { #^^^^ match end of statement last; } else { die "expected delimter or end of statement. pos:" . pos; } } return $found; } #functions for the parsing engine #maps between MySQL types and our types sub create_type_map { my ($self) = @_; $self->{type_map} = { int => 'int', tinyint => 'int', bigint => 'int', char => 'varchar', varchar => 'varchar', enum => 'varchar', double => 'float', timestamp => 'int', blob => 'varchar', mediumblob => 'varchar', mediumtext => 'varchar', tinyblob => 'varchar', varbinary => 'varchar', }; return 1; } #convert the schema into a list of callbacks #that match the schema and extract data from it sub compile_config { my ($self) = @_; my $schema = $self->{schema}; my @handlers; push(@handlers, ['#start', new_start_data()]); foreach (@$schema) { my ($name, $type) = @$_; my $oldtype = $type; $type = $self->{type_map}->{lc($type)}; if (! defined($type)) { die "type map failed for $oldtype"; } if ($type eq 'int') { push(@handlers, [$name, new_int()], ['#delim', new_delim()]); } elsif ($type eq 'varchar') { push(@handlers, [$name, new_varchar()], ['#delim', new_delim()]); } elsif($type eq 'float') { push(@handlers, [$name, new_float()], ['#delim', new_delim()]); } else { die "unknown type: $type"; } } pop(@handlers); #gets rid of that extra delimiter push(@handlers, ['#end', new_end_data()]); return \@handlers; } sub unescape { my ($input) = @_; if ($input eq '\\\\') { return '\\'; } elsif ($input eq "\\'") { return("'"); } elsif ($input eq '\\"') { return '"'; } elsif ($input eq '\\n') { return "\n"; } elsif ($input eq '\\t') { return "\t"; } else { die "can not unescape $input"; } } #functions that create callbacks that match and extract #data from INSERT lines #it is critical that these regular expressions use the /gc option #or the parser will stop functioning as soon as a regex with #out those options is encountered and debugging becomes #almost impossible sub new_int { return sub { m/\GNULL/gc and return undef; m/\G(-?[\d]+)/gc or die "expected int"; return $1; }; } sub new_float { return sub { m/\GNULL/gc and return undef; m/\G(-?[\d]+(?:\.[\d]+(e-?[\d]+)?)?)/gc or die "expected float"; return $1; } } sub new_varchar { return sub { my $data; m/\GNULL/gc and return undef; #does not handle very long strings; crashes perl 5.8.9 causes 5.10.1 to error out #m/\G'((\\.|[^'])*)'/gc or die "expected varchar"; #thanks somni! m/'((?:[^\\']*(?:\\.[^\\']*)*))'/gc or die "expected varchar"; $data = $1; $data =~ s/(\\.)/unescape($1)/e; return $data; } } sub new_delim { return sub { m/\G, ?/gc or die "expected delimiter"; return undef; }; } sub new_start_data { return sub { m/\G\(/gc or die "expected start of data set"; return undef; }; } sub new_end_data { return sub { m/\G\)/gc or die "expected end of data set"; return undef; } } 1; __END__ =head1 NAME MediaWiki::DumpFile::SQL - Process SQL dump files from a MediaWiki instance =head1 SYNOPSIS use MediaWiki::DumpFile::SQL; $sql = MediaWiki::DumpFile::SQL->new('dumpfile.sql'); #many compression formats are supported $sql = MediaWiki::DumpFile::SQL->new('dumpfile.sql.gz'); $sql = MediaWiki::DumpFile::SQL->new('dumpfile.sql.bz2'); $sql = MediaWiki::DumpFile::SQL->new(\*FH); @schema = $sql->schema; $name = $sql->table_name; while(defined($row = $sql->next)) { #do something with the data from the row } =head1 FUNCTIONS =head2 new This is the constructor for this package. It is called with a single parameter: the location of a MediaWiki SQL dump file or a reference to an already open file handle. Only the definition and data for the B table in a SQL dump file is processed. =head2 next Returns a hash reference where the key is the column name from the table and the value of the hash is the value for that column and row. Returns undef when there is no more data available. =head2 table_name Returns a string of the table name that was discovered in the dump file. =head2 table_statement Returns a string of the unparsed CREATE TABLE statement straight from the dump file. =head2 schema Returns a list that represents the table schema as it was parsed by this module. Each item in the list is a reference to an array with two elements. The first element is the name of the row and the second element is the MySQL datatype associated with the row. The list is in an identical order as the definitions in the CREATE TABLE statement. =head1 AUTHOR Tyler Riddle, C<< >> =head1 BUGS Please see MediaWiki::DumpFile for information on how to report bugs in this software. =head1 COPYRIGHT & LICENSE Copyright 2009 "Tyler Riddle". This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See http://dev.perl.org/licenses/ for more information. MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Pages/Lib.pm000644 000765 000024 00000000433 11543133737 023613 0ustar00tylerstaff000000 000000 package MediaWiki::DumpFile::Pages::Lib; use strict; use warnings; use Exporter 'import'; our @EXPORT_OK = qw(_safe_text); sub _safe_text { my ($data, $name) = @_; my $found = $data->get_elements($name); if (! defined($found)) { return ''; } return $found->text; } 1;MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Compat/link.pod000644 000765 000024 00000001044 11543133737 024373 0ustar00tylerstaff000000 000000 =head1 NAME Parse::MediaWikiDump::link - Object representing a link from one article to another =head1 ABOUT This object is used to access the data associated with each individual link between articles in a MediaWiki instance. =head1 METHODS =over 4 =item $link->from Returns the article id (not the name) that the link orginiates from. =item $link->namespace Returns the namespace id (not the name) that the link points to =item $link->to Returns the article title (not the id and not including the namespace) that the link points to MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Compat/Links.pod000644 000765 000024 00000004324 11543133737 024522 0ustar00tylerstaff000000 000000 =head1 NAME Parse::MediaWikiDump::Links - Object capable of processing link dump files =head1 ABOUT This object is used to access content of the SQL based category dump files by providing an iterative interface for extracting the individual article links to the same. Objects returned are an instance of Parse::MediaWikiDump::link. =head1 SYNOPSIS use MediaWiki::DumpFile::Compat; $pmwd = Parse::MediaWikiDump->new; $links = $pmwd->links('pagelinks.sql'); $links = $pmwd->links(\*FILEHANDLE); #print the links between articles while(defined($link = $links->next)) { print 'from ', $link->from, ' to ', $link->namespace, ':', $link->to, "\n"; } =head1 METHODS =over 4 =item Parse::MediaWikiDump::Links->new Create a new instance of a page links dump file parser =item $links->next Return the next available Parse::MediaWikiDump::link object or undef if there is no more data left =back =head1 EXAMPLE =head2 List all links between articles in a friendly way #!/usr/bin/perl use strict; use warnings; use MediaWiki::DumpFile::Compat; my $pmwd = Parse::MediaWikiDump->new; my $links = $pmwd->links(shift) or die "must specify a pagelinks dump file"; my $dump = $pmwd->pages(shift) or die "must specify an article dump file"; my %id_to_namespace; my %id_to_pagename; binmode(STDOUT, ':utf8'); #build a map between namespace ids to namespace names foreach (@{$dump->namespaces}) { my $id = $_->[0]; my $name = $_->[1]; $id_to_namespace{$id} = $name; } #build a map between article ids and article titles while(my $page = $dump->next) { my $id = $page->id; my $title = $page->title; $id_to_pagename{$id} = $title; } $dump = undef; #cleanup since we don't need it anymore while(my $link = $links->next) { my $namespace = $link->namespace; my $from = $link->from; my $to = $link->to; my $namespace_name = $id_to_namespace{$namespace}; my $fully_qualified; my $from_name = $id_to_pagename{$from}; if ($namespace_name eq '') { #default namespace $fully_qualified = $to; } else { $fully_qualified = "$namespace_name:$to"; } print "Article \"$from_name\" links to \"$fully_qualified\"\n"; } MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Compat/page.pod000644 000765 000024 00000004271 11543133737 024357 0ustar00tylerstaff000000 000000 =head1 NAME Parse::MediaWikiDump::page - Object representing a specific revision of a MediaWiki page =head1 ABOUT This object is returned from the "next" method of Parse::MediaWikiDump::Pages and Parse::MediaWikiDump::Revisions. You most likely will not be creating instances of this particular object yourself instead you use this object to access the information about a page in a MediaWiki instance. =head1 SYNOPSIS use MediaWiki::DumpFile::Compat; $pages = Parse::MediaWikiDump::Pages->new('pages-articles.xml'); #get all the records from the dump files, one record at a time while(defined($page = $pages->next)) { print "title '", $page->title, "' id ", $page->id, "\n"; } =head1 METHODS =over 4 =item $page->redirect Returns an empty string (such as '') for the main namespace or a string containing the name of the namespace. =item $page->categories Returns a reference to an array that contains a list of categories or undef if there are no categories. This method does not understand templates and may not return all the categories the article actually belongs in. =item $page->title Returns a string of the full article title including the namespace if present =item $page->namespace Returns a string of the namespace of the article or an empty string if the article is in the default namespace =item $page->id Returns a number that is the id for the page in the MediaWiki instance =item $page->revision_id Returns a number that is the revision id for the page in the MediaWiki instance =item $page->timestamp Returns a string in the following format: 2005-07-09T18:41:10Z =item $page->username Returns a string of the username responsible for this specific revision of the article or undef if the editor was anonymous =item $page->userid Returns a number that is the id for the user returned by $page->username or undef if the editor was anonymous =item $page->userip Returns a string of the IP of the editor if the edit was anonymous or undef otherwise =item $page->minor Returns 1 if this article was flaged as a minor edit otherwise returns 0 =item $page->text Returns a reference to a string that contains the article title text =back MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Compat/Pages.pod000644 000765 000024 00000014206 11543133737 024501 0ustar00tylerstaff000000 000000 =head1 NAME Parse::MediaWikiDump::Pages - Object capable of processing dump files with a single revision per article =head1 ABOUT This object is used to access the metadata associated with a MediaWiki instance and provide an iterative interface for extracting the individual articles out of the same. This module does not allow more than one revision for each specific article; to parse a comprehensive dump file use the Parse::MediaWikiDump::Revisions object. =head1 SYNOPSIS use MediaWiki::DumpFile::Compat; $pmwd = Parse::MediaWikiDump->new; $input = 'pages-articles.xml'; $input = \*FILEHANDLE; $pages = $pmwd->pages(); $pages = $pmwd->pages(); $pages = $pmwd->pages(input => $input, fast_mode => 0); #print the title and id of each article inside the dump file while(defined($page = $pages->next)) { print "title '", $page->title, "' id ", $page->id, "\n"; } =head1 METHODS =over 4 =item $pages->new Open the specified MediaWiki dump file. If the single argument to this method is a string it will be used as the path to the file to open. If the argument is a reference to a filehandle the contents will be read from the filehandle as specified. If more than one argument is supplied the arguments must be a hash of configuration options. The input option is required and is the same as previously described. The fast_mode option is optional, defaults to being off, and if set to a true value will cause the parser to run in a mode that is much faster but only provides access to the title and text contents of a page. See the MediaWiki::DumpFile::Pages for details about fast mode. =item $pages->next Returns an instance of the next available Parse::MediaWikiDump::page object or returns undef if there are no more articles left. =item $pages->version Returns a plain text string of the dump file format revision number =item $pages->sitename Returns a plain text string that is the name of the MediaWiki instance. =item $pages->base Returns the URL to the instances main article in the form of a string. =item $pages->generator Returns a string containing 'MediaWiki' and a version number of the instance that dumped this file. Example: 'MediaWiki 1.14alpha' =item $pages->case Returns a string describing the case sensitivity configured in the instance. =item $pages->namespaces Returns a reference to an array of references. Each reference is to another array with the first item being the unique identifier of the namespace and the second element containing a string that is the name of the namespace. =item $pages->namespaces_names Returns an array reference the array contains strings of all the namespaces each as an element. =item $pages->current_byte Returns the number of bytes that has been processed so far =item $pages->size Returns the total size of the dump file in bytes. =back =head2 Scan an article dump file for double redirects that exist in the most recent article revision #!/usr/bin/perl #progress information goes to STDERR, a list of double redirects found #goes to STDOUT binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); use strict; use warnings; use MediaWiki::DumpFile::Compat; my $file = shift(@ARGV); my $pmwd = Parse::MediaWikiDump->new; my $pages; my $page; my %redirs; my $artcount = 0; my $file_size; my $start = time; if (defined($file)) { $file_size = (stat($file))[7]; $pages = $pmwd->pages($file); } else { print STDERR "No file specified, using standard input\n"; $pages = $pmwd->pages(\*STDIN); } #the case of the first letter of titles is ignored - force this option #because the other values of the case setting are unknown die 'this program only supports the first-letter case setting' unless $pages->case eq 'first-letter'; print STDERR "Analyzing articles:\n"; while(defined($page = $pages->next)) { update_ui() if ++$artcount % 500 == 0; #main namespace only next unless $page->namespace eq ''; next unless defined($page->redirect); my $title = case_fixer($page->title); #create a list of redirects indexed by their original name $redirs{$title} = case_fixer($page->redirect); } my $redir_count = scalar(keys(%redirs)); print STDERR "done; searching $redir_count redirects:\n"; my $count = 0; #if a redirect location is also a key to the index we have a double redirect foreach my $key (keys(%redirs)) { my $redirect = $redirs{$key}; if (defined($redirs{$redirect})) { print "$key\n"; $count++; } } print STDERR "discovered $count double redirects\n"; #removes any case sensativity from the very first letter of the title #but not from the optional namespace name sub case_fixer { my $title = shift; #check for namespace if ($title =~ /^(.+?):(.+)/) { $title = $1 . ':' . ucfirst($2); } else { $title = ucfirst($title); } return $title; } sub pretty_bytes { my $bytes = shift; my $pretty = int($bytes) . ' bytes'; if (($bytes = $bytes / 1024) > 1) { $pretty = int($bytes) . ' kilobytes'; } if (($bytes = $bytes / 1024) > 1) { $pretty = sprintf("%0.2f", $bytes) . ' megabytes'; } if (($bytes = $bytes / 1024) > 1) { $pretty = sprintf("%0.4f", $bytes) . ' gigabytes'; } return $pretty; } sub pretty_number { my $number = reverse(shift); $number =~ s/(...)/$1,/g; $number = reverse($number); $number =~ s/^,//; return $number; } sub update_ui { my $seconds = time - $start; my $bytes = $pages->current_byte; print STDERR " ", pretty_number($artcount), " articles; "; print STDERR pretty_bytes($bytes), " processed; "; if (defined($file_size)) { my $percent = int($bytes / $file_size * 100); print STDERR "$percent% completed\n"; } else { my $bytes_per_second = int($bytes / $seconds); print STDERR pretty_bytes($bytes_per_second), " per second\n"; } } =head2 Version 0.4 This class was updated to support version 0.4 dump files from a MediaWiki instance but it does not currently support any of the new information available in those files. MediaWiki-DumpFile-0.2.2/lib/MediaWiki/DumpFile/Compat/Revisions.pod000644 000765 000024 00000007625 11543133737 025432 0ustar00tylerstaff000000 000000 =head1 NAME Parse::MediaWikiDump::Revisions - Object capable of processing dump files with multiple revisions per article =head1 ABOUT This object is used to access the metadata associated with a MediaWiki instance and provide an iterative interface for extracting the individual article revisions out of the same. To guarantee that there is only a single revision per article use the Parse::MediaWikiDump::Pages object. =head1 SYNOPSIS use MediaWiki::DumpFile::Compat; $pmwd = Parse::MediaWikiDump->new; $revisions = $pmwd->revisions('pages-articles.xml'); $revisions = $pmwd->revisions(\*FILEHANDLE); #print the title and id of each article inside the dump file while(defined($page = $revisions->next)) { print "title '", $page->title, "' id ", $page->id, "\n"; } =head1 METHODS =over 4 =item $revisions->new Open the specified MediaWiki dump file. If the single argument to this method is a string it will be used as the path to the file to open. If the argument is a reference to a filehandle the contents will be read from the filehandle as specified. =item $revisions->next Returns an instance of the next available Parse::MediaWikiDump::page object or returns undef if there are no more articles left. =item $revisions->version Returns a plain text string of the dump file format revision number =item $revisions->sitename Returns a plain text string that is the name of the MediaWiki instance. =item $revisions->base Returns the URL to the instances main article in the form of a string. =item $revisions->generator Returns a string containing 'MediaWiki' and a version number of the instance that dumped this file. Example: 'MediaWiki 1.14alpha' =item $revisions->case Returns a string describing the case sensitivity configured in the instance. =item $revisions->namespaces Returns a reference to an array of references. Each reference is to another array with the first item being the unique identifier of the namespace and the second element containing a string that is the name of the namespace. =item $revisions->namespaces_names Returns an array reference the array contains strings of all the namespaces each as an element. =item $revisions->current_byte Returns the number of bytes that has been processed so far =item $revisions->size Returns the total size of the dump file in bytes. =back =head1 EXAMPLE =head2 Extract the article text of each revision of an article using a given title #!/usr/bin/perl use strict; use warnings; use MediaWiki::DumpFile::Compat; my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages"; my $title = shift(@ARGV) or die "must specify an article title"; my $pmwd = Parse::MediaWikiDump->new; my $dump = $pmwd->revisions($file); my $found = 0; binmode(STDOUT, ':utf8'); binmode(STDERR, ':utf8'); #this is the only currently known value but there could be more in the future if ($dump->case ne 'first-letter') { die "unable to handle any case setting besides 'first-letter'"; } $title = case_fixer($title); while(my $revision = $dump->next) { if ($revision->title eq $title) { print STDERR "Located text for $title revision ", $revision->revision_id, "\n"; my $text = $revision->text; print $$text; $found = 1; } } print STDERR "Unable to find article text for $title\n" unless $found; exit 1; #removes any case sensativity from the very first letter of the title #but not from the optional namespace name sub case_fixer { my $title = shift; #check for namespace if ($title =~ /^(.+?):(.+)/) { $title = $1 . ':' . ucfirst($2); } else { $title = ucfirst($title); } return $title; } =head1 LIMITATIONS =head2 Version 0.4 This class was updated to support version 0.4 dump files from a MediaWiki instance but it does not currently support any of the new information available in those files.