. All rights reserved. This program is free
software; you can redistribute it and/or modify it under the same
terms as Perl itself.
HTML-Diff-0.60/t/ 000755 000765 000024 00000000000 12614726540 013557 5 ustar 00neilb staff 000000 000000 HTML-Diff-0.60/t/01-diff.t 000755 000765 000024 00000026034 12314631025 015070 0 ustar 00neilb staff 000000 000000 #!/usr/bin/perl
use strict;
use Test;
BEGIN { plan tests => 10 }
use Getopt::Long;
my ($verbose);
GetOptions("verbose!" => \$verbose) or die "Parsing command line failed.";
use Data::Dumper;
use HTML::Diff qw(line_diff word_diff html_word_diff);
my $test_text_a = "Four score
and seven years ago, our forefathers
brought forth on this continent
a new nation conceived
in liberty and
dedicated to the proposition
that all men are created equal.
Now our great nation is
engaged in civil war.";
my $test_text_b = "Four score
and seven years ago, our forefathers
brought forth on this continent
dedicated to the proposition
that all men are created equal.
Now our great nation is enagaged
in civil war";
my $test_text_c = 'PEOPLE said, "The evening-bell is sounding, the sun is setting." A strange wondrous tone was heard in the narrow streets of a large town. It was like the sound of a church-bell: but it was only heard for a moment, for the rolling of the carriages, and the voices of the multitude made too great a noise.';
my $test_text_d = 'PEOPLE said, "The bell is sounding." A strange wondrous was heard tone in the narrow streets of a large town. A long time passed. It was like the sound of a church-bell: but it was only heard for a moment, for the rolling of the carriages, and the voices of the multitude made too great a noise.';
# This next pair tests the usual HTML changes
# (e.g. "a b c d" -> "a b c d" considers all of "a b c d"
# as a change). It also tests that whitespace changes are effectively ignored.
my $test_html_a = '
September 23, 1996
< this is an unkown html tag />
Anne Condon, University of Wisconsin
Faith Fich, University of Toronto
Greg N. Frederickson, Purdue University
Andrew V. Goldberg, NEC Research Institute
David S. Johnson, AT&T Bell Laboratories
Michael C. Loui, University of Illinois at Urbana-Champaign
Steven Mahaney, DIMACS
Prabhakar Raghavan, IBM Almaden Research Center
John Savage, Brown University
Alan Selman, SUNY at Buffalo
David B. Shmoys, Cornell University
Abstract.
This report focuses on two core areas of theory of computing:
discrete algorithms and computational complexity theory.
The report
reviews the purposes and goals of theoretical research,
summarizes selected past and recent achievements,
explains the importance of sustaining core research,
and identifies promising opportunities for future research.
Some research opportunities build bridges between
theory of computing and other areas of computer science,
and other science and engineering disciplines.
';
my $test_html_b = '
September 23, 1996
Anne Condon, University of Wisconsin
Faith Fich, University of Toronto
Greg N. Frederickson, Purdue University
Andrew V. Goldberg, NEC Research Institute
David S. Johnson, AT&T Bell Laboratories
Michael C. Loui, University of Illinois at Urbana-Champaign
Steven Mahaney, DIMACS
Prabhakar Raghavan, IBM Almaden Research Center
John Savage, Brown University
Alan Selman, SUNY at Buffalo
David B. Shmoys, Cornell University
Abstract. This report focuses on two core areas of
theory of computing: discrete algorithms and computational complexity
theory. The report reviews the purposes and goals of theoretical
research, summarizes selected past and recent achievements, explains
the importance of sustaining core research, and identifies promising
opportunities for future research. Some research opportunities build
bridges between theory of computing and other areas of computer
science, and other science and engineering disciplines.
';
sub print_diff {
my $ch;
my ($chunks) = @_;
foreach $ch (@$chunks) {
my ($flag, $m, $o) = @$ch;
unless ($flag eq 'u') {
print "<< old\n";
print "$o";
print ">> new\n";
print "$m";
print "==\n";
# TBD: make some kind of warning about lacking a newline at the end
} else {
print "$m";
}
}
}
sub test_diff_continuity {
my ($a, $b, $diffalgo, $ignore_whitespace) = @_;
my $chunks = &$diffalgo($a, $b);
my ($runningb, $runninga);
$runninga = $runningb = "";
my $ch;
foreach $ch (@$chunks)
{
my ($flag, $ach, $bch) = @$ch;
$runninga .= $ach || '';
$runningb .= $bch || '';
}
if ($ignore_whitespace) {
$a =~ s/\s\s+/ /g;
$b =~ s/\s\s+/ /g;
$runninga =~ s/\s\s+/ /g;
$runningb =~ s/\s\s+/ /g;
}
return ($a eq $runninga) && ($b eq $runningb);
}
sub expect_diff {
my ($a, $b, $algo, $expectation) = @_;
}
if ($verbose) {
my $chunks = HTML::Diff::line_diff($test_text_a, $test_text_b);
print_diff($chunks);
$chunks = HTML::Diff::word_diff($test_text_c, $test_text_d);
print "\n";
print_diff($chunks);
}
ok(deep_compare(html_word_diff('', ''), [['', undef, undef]]));
ok(deep_compare(html_word_diff('0', ''), [['-', '0', '']]));
ok(deep_compare(html_word_diff('', '0'), [['+', '', '0']]));
ok(deep_compare(html_word_diff('0', '0'), [['u', '0', '0']]));
ok(deep_compare(html_word_diff('a b b', 'b b c'),
[['-', 'a ', ''],
['u', 'b ', 'b '],
['c', 'b', 'b c']]));
print "Testing line_diff on test_text_a and test_text_b\n" if $verbose;
ok(test_diff_continuity($test_text_a, $test_text_b,
\&HTML::Diff::line_diff));
print "Testing html_word_diff on test_text_a and test_text_b\n"
if $verbose;
ok(test_diff_continuity($test_text_a, $test_text_b,
\&HTML::Diff::html_word_diff));
print "Testing html_word_diff on test_html_a and test_html_b\n"
if $verbose;
ok(test_diff_continuity($test_html_a, $test_html_b,
\&HTML::Diff::html_word_diff, 1));
my $result = HTML::Diff::html_word_diff($test_html_a, $test_html_b);
# Use the following lines to capture a "correct" result (when you
# think you've got one) which can be used to validate future tests
# open OUT, ">expect";
# print OUT Dumper($result);
# close OUT;
# This value is the result we expect from HTML::Diff::html_word_diff()
# If the actual result differs by one byte, it's a failure.
# When the diff code is changed, you'll need to calculate a new expected
# value using the lines above, and paste the resulting value below.
my $expect = [
[
'-',
'
',
''
],
[
'u',
'
'
],
[
'-',
'
',
''
],
[
'u',
'September 23, 1996
',
'September 23, 1996
'
],
[
'-',
'< this is an unkown html tag />
',
''
],
[
'u',
'Anne Condon, University of Wisconsin
Faith Fich, University of Toronto
Greg N. Frederickson, Purdue University
Andrew V. Goldberg, NEC Research Institute
David S. Johnson, AT&T Bell Laboratories
Michael C. Loui, University of Illinois at Urbana-Champaign
Steven Mahaney, DIMACS ',
'Anne Condon, University of Wisconsin
Faith Fich, University of Toronto
Greg N. Frederickson, Purdue University
Andrew V. Goldberg, NEC Research Institute
David S. Johnson, AT&T Bell Laboratories
Michael C. Loui, University of Illinois at Urbana-Champaign
Steven Mahaney, DIMACS '
],
[
'c',
'
',
'
'
],
[
'u',
'Prabhakar Raghavan, IBM Almaden Research Center
John Savage, Brown University
Alan Selman, SUNY at Buffalo
David B. Shmoys, Cornell University
Abstract.
This report focuses on two core areas of theory of computing:
discrete algorithms and computational complexity theory.
The report
reviews the purposes and goals of theoretical research,
summarizes selected past and recent achievements,
',
'Prabhakar Raghavan, IBM Almaden Research Center
John Savage, Brown University
Alan Selman, SUNY at Buffalo
David B. Shmoys, Cornell University
Abstract. This report focuses on two core areas of
theory of computing: discrete algorithms and computational complexity
theory. The report reviews the purposes and goals of theoretical
research, summarizes selected past and recent achievements, '
],
[
'c',
'explains the importance of ',
'explains
the importance of '
],
[
'u',
'sustaining core research,
and identifies promising opportunities for future research.
Some research opportunities build bridges between
theory of computing and other areas of computer science,
and other science and engineering disciplines.
',
'sustaining core research, and identifies promising
opportunities for future research. Some research opportunities build
bridges between theory of computing and other areas of computer
science, and other science and engineering disciplines.
'
],
];
ok(deep_compare($result, $expect));
# Given two array refs of array refs, of array refs... return true if
# the two structures are isomorphic and all the corresponding scalars
# are equal
# TBD: make it more efficient; builds up call stack too much.
# TBD: Take a binary test as an arg, to replace eq
sub deep_compare {
my ($a, $b) = @_;
my ($x, $y);
if (!ref($a) && !ref($b)) {
return $a eq $b;
} else {
return 0 unless ((ref($a) eq 'ARRAY') && (ref($b) eq 'ARRAY'));
while ($x = shift @$a) {
$y = shift @$b;
return 0 unless deep_compare($x, $y);
}
}
return 1;
}
my $diffchunks = HTML::Diff::html_word_diff($test_html_a, $test_html_b);
if ($verbose) {
print "Result of diff:\n";
print "[$_]\n" foreach (map {join "||", @$_} @$diffchunks);
}
sub check_diff_integrity {
my $failure = 0;
foreach my $chunk (@{$_[0]}) {
my ($mark, $left, $right) = @$chunk;
if ($mark ne 'u' && $left eq $right) {
print "[$left] is [$right] but HTML::Diff thinks they're different!\n";
$failure = 1;
}
}
return !$failure;
}
my $A = "
Search
";
my $B = "Search
";
$result = html_word_diff($A, $B);
ok(check_diff_integrity($result));
sub diff_file {
my ($left, $right) = @_;
open LEFT, $left;
open RIGHT, $right;
$/ = undef;
my $Left = ;
my $Right = ;
close LEFT;
close RIGHT;
my $diff_chunks = html_word_diff($Left, $Right);
print_diff($diff_chunks);
print "\n";
}
1;
HTML-Diff-0.60/lib/HTML/ 000755 000765 000024 00000000000 12614726540 014626 5 ustar 00neilb staff 000000 000000 HTML-Diff-0.60/lib/HTML/Diff.pm 000644 000765 000024 00000016764 12614726507 016055 0 ustar 00neilb staff 000000 000000 package HTML::Diff;
our $VERSION = '0.60';
use 5.006;
use strict;
use warnings;
use Exporter;
our @ISA = qw(Exporter);
our @EXPORT = qw(line_diff word_diff html_word_diff);
# This list of tags is taken from the XHTML spec and includes
# all those for which no closing tag is expected. In addition
# the pattern below matches any tag which ends with a slash /
our @UNBALANCED_TAGS = qw(br hr p li base basefont meta link
col colgroup frame input isindex area
embed img bgsound marquee);
use Algorithm::Diff 'sdiff';
sub member {
my ($item, @list) = @_;
return scalar(grep {$_ eq $item} @list);
}
sub html_word_diff {
my ($left, $right) = @_;
# Split the two texts into words and tags.
my (@leftchks) = $left =~ m/(<[^>]*>\s*|[^<]+)/gm;
my (@rightchks) = $right =~ m/(<[^>]*>\s*|[^<]+)/gm;
@leftchks = map { $_ =~ /^<[^>]*>$/ ? $_ : ($_ =~ m/(\S+\s*)/gm) }
@leftchks;
@rightchks = map { $_ =~ /^<[^>]*>$/ ? $_ : ($_ =~ m/(\S+\s*)/gm) }
@rightchks;
# Remove blanks; maybe the above regexes could handle this?
@leftchks = grep { $_ ne '' } @leftchks;
@rightchks = grep { $_ ne '' } @rightchks;
# Now we process each segment by turning it into a pair. The first element
# is the text as we want it to read in the result. The second element is
# the value we will to use in comparisons. It contains an identifier
# for each of the balanced tags that it lies within.
# This subroutine holds state in the tagstack variable
my $tagstack = [];
my $smear_tags = sub {
if ($_ =~ /^<.*>/) {
if ($_ =~ m|^|) {
my ($tag) = m|^\s*([^ \t\n\r>]*)|;
$tag = lc $tag;
# print STDERR "Found closer of $tag with " . (scalar @$tagstack) . " stack items\n";
# If we found the closer for the tag on top
# of the stack, pop it off.
if ((scalar @$tagstack) > 0 && $$tagstack[-1] eq $tag) {
my $stacktag = pop @$tagstack;
}
return [$_, $tag];
} else {
my ($tag) = m|^<\s*([^\s>]*)|;
$tag = lc $tag;
# print STDERR "Found opener of $tag with " . (scalar @$tagstack) . " stack items\n";
if (member($tag, @UNBALANCED_TAGS) || $tag =~ m#/\s*>$#)
{ # (tags without correspond closer tags)
return [$_, $tag];
} else {
push @$tagstack, $tag;
}
return [$_, $_];
}
} else {
my $result = [$_, (join "!!!", (@$tagstack, $_)) ];
return $result;
}
};
# Now do the "smear tags" operation across each of the chunk-lists
$tagstack = [];
@leftchks = map { &$smear_tags } @leftchks;
# TBD: better modularity would preclude having to reset the stack
$tagstack = [];
@rightchks = map { &$smear_tags } @rightchks;
# print STDERR Data::Dumper::Dumper(\@leftchks);
# print STDERR Data::Dumper::Dumper(\@rightchks);
# Now do the diff, using the "comparison" half of the pair to
# compare two chuncks.
my $chunks = sdiff(\@leftchks, \@rightchks,
sub { $_ = elem_cmprsn(shift); $_ =~ s/\s+$/ /g; $_ });
# print STDERR Data::Dumper::Dumper($chunks);
# Finally, process the output of sdiff by concatenating
# consecutive chunks that were "unchanged."
my $lastsignal = '';
my $lbuf = "";
my $rbuf = "";
my @result;
my $ch;
foreach $ch (@$chunks) {
my ($signal, $left, $right) = @$ch;
if ($signal ne $lastsignal && $lastsignal ne '') {
if ($signal ne 'u' && $lastsignal ne 'u') {
$signal = 'c';
} else {
push @result, [$lastsignal, $lbuf, $rbuf];
$lbuf = "";
$rbuf = "";
}
}
# if ($signal eq 'u' && $lastsignal ne 'u') {
# push @result, [$lastsignal, $lbuf, $rbuf]
# unless $lastsignal eq '';
# $lbuf = "";
# $rbuf = "";
# } elsif ($signal ne 'u' && $lastsignal eq 'u') {
# push @result, [$lastsignal, $lbuf, $rbuf];
# $lbuf = "";
# $rbuf = "";
# }
my $lelem = elem_mkp($left);
my $relem = elem_mkp($right);
$lbuf .= (defined $lelem ? $lelem : '');
$rbuf .= (defined $relem ? $relem : '');
$lastsignal = $signal;
}
push @result, [$lastsignal, $lbuf, $rbuf];
return \@result;
}
# these are like "accessors" for the two halves of the diff-chunk pairs
sub elem_mkp {
my ($e) = @_;
return undef unless ref $e eq 'ARRAY';
my ($mkp, $cmp) = @$e;
return $mkp;
}
sub elem_cmprsn {
my ($e) = @_;
return undef unless ref $e eq 'ARRAY';
my ($mkp, $cmp) = @$e;
return $cmp;
}
# Finally a couple of non-HTML diff routines
sub line_diff {
my ($left, $right) = @_;
my (@leftchks) = $left =~ m/(.*\n?)/gm;
my (@rightchks) = $right =~ m/(.*\n?)/gm;
my $result = sdiff(\@leftchks, \@rightchks);
# my @result = map { [ $_->[1], $_->[2] ] } @$result;
return $result;
}
sub word_diff {
my ($left, $right) = @_;
my (@leftchks) = $left =~ m/([^\s]*\s?)/gm;
my (@rightchks) = $right =~ m/([^\s]*\s?)/gm;
my $result = sdiff(\@leftchks, \@rightchks);
my @result = (map { [ $_->[1], $_->[2] ] } @$result);
return $result;
}
1;
=head1 NAME
HTML::Diff - compare two HTML strings and return a list of differences
=head1 SYNOPSIS
use HTML::Diff;
$result = html_word_diff($left_text, $right_text);
=head1 DESCRIPTION
This module compares two strings of HTML and returns a list of a
chunks which indicate the diff between the two input strings, where
changes in formatting are considered changes.
HTML::Diff does not strictly parse the HTML. Instead, it uses regular
expressions to make a decent effort at understanding the given HTML.
As a result, there are many valid HTML documents for which it will not
produce the correct answer. But there may be some invalid HTML
documents for which it gives you the answer you're looking for. Your
mileage may vary; test it on lots of inputs from your domain before
relying on it.
Returns a reference to a list of triples [, , ].
Each triple represents a check of the input texts. The flag tells you
whether it represents a deletion, insertion, a modification, or an
unchanged chunk.
Every character of each input text is accounted for by some triple in
the output. Specifically, Concatenating all the members from
the return value should produce C<$left_text>, and likewise the
members concatenate together to produce C<$right_text>.
The is either C<'u'>, C<'+'>, C<'-'>, or C<'c'>, indicating
whether the two chunks are the same, the $right_text contained this
chunk and the left chunk didn't, or vice versa, or the two chunks are
simply different. This follows the usage of Algorithm::Diff.
The difference is computed on a word-by-word basis, "breaking" on
visible words in the HTML text. If a tag only is changed, it will not
be returned as an independent chunk but will be shown as a change to
one of the neighboring words. For balanced tags, such as , it
is intended that a change to the tag will be treated as a change to
all words in between.
=head1 SEE ALSO
L provides the diff algorithm used in this module.
L can generate a diff between two XML files, and also
patch an XML file, given a diff.
=head1 REPOSITORY
L
=head1 AUTHOR
Whipped up by Ezra elias kilty Cooper, Eezra@ezrakilty.netE.
Patch contributed by Adam Easjo@koldfront.dkE.
=head1 COPYRIGHT AND LICENSE
Copyright 2003-2014 by Ezra elias kilty Cooper, Eezra@ezrakilty.netE
This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut