LaTeX-TOM-1.03000755001750001750 011675167673 11677 5ustar00stssts000000000000LaTeX-TOM-1.03/META.yml000444001750001750 146211675167673 13310 0ustar00stssts000000000000--- abstract: 'A module for parsing, analyzing, and manipulating LaTeX documents.' author: - 'Steven Schubiger ' build_requires: Test::More: 0 configure_requires: Module::Build: 0.38 dynamic_config: 1 generated_by: 'Module::Build version 0.38, CPAN::Meta::Converter version 2.110930' license: perl meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: 1.4 name: LaTeX-TOM provides: LaTeX::TOM: file: lib/LaTeX/TOM.pm version: 1.03 LaTeX::TOM::Node: file: lib/LaTeX/TOM/Node.pm version: 0.03 LaTeX::TOM::Parser: file: lib/LaTeX/TOM/Parser.pm version: 0.07 LaTeX::TOM::Tree: file: lib/LaTeX/TOM/Tree.pm version: 0.04 requires: Carp: 0 File::Basename: 0 resources: license: http://dev.perl.org/licenses/ version: 1.03 LaTeX-TOM-1.03/META.json000444001750001750 255111675167673 13460 0ustar00stssts000000000000{ "abstract" : "A module for parsing, analyzing, and manipulating LaTeX documents.", "author" : [ "Steven Schubiger " ], "dynamic_config" : 1, "generated_by" : "Module::Build version 0.38, CPAN::Meta::Converter version 2.110930", "license" : [ "perl_5" ], "meta-spec" : { "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", "version" : "2" }, "name" : "LaTeX-TOM", "prereqs" : { "build" : { "requires" : { "Test::More" : 0 } }, "configure" : { "requires" : { "Module::Build" : "0.38" } }, "runtime" : { "requires" : { "Carp" : 0, "File::Basename" : 0 } } }, "provides" : { "LaTeX::TOM" : { "file" : "lib/LaTeX/TOM.pm", "version" : "1.03" }, "LaTeX::TOM::Node" : { "file" : "lib/LaTeX/TOM/Node.pm", "version" : "0.03" }, "LaTeX::TOM::Parser" : { "file" : "lib/LaTeX/TOM/Parser.pm", "version" : "0.07" }, "LaTeX::TOM::Tree" : { "file" : "lib/LaTeX/TOM/Tree.pm", "version" : "0.04" } }, "release_status" : "stable", "resources" : { "license" : [ "http://dev.perl.org/licenses/" ] }, "version" : "1.03" } LaTeX-TOM-1.03/Makefile.PL000444001750001750 75511675167673 13775 0ustar00stssts000000000000# Note: this file was auto-generated by Module::Build::Compat version 0.3800 use ExtUtils::MakeMaker; WriteMakefile ( 'PL_FILES' => {}, 'INSTALLDIRS' => 'site', 'NAME' => 'LaTeX::TOM', 'EXE_FILES' => [], 'VERSION_FROM' => 'lib/LaTeX/TOM.pm', 'PREREQ_PM' => { 'Test::More' => 0, 'File::Basename' => 0, 'Carp' => 0 } ) ; LaTeX-TOM-1.03/Build.PL000444001750001750 100511675167673 13324 0ustar00stssts000000000000## Created by make2build 0.17 use strict; use warnings; use Module::Build; my $build = Module::Build->new ( module_name => 'LaTeX::TOM', dist_author => 'Steven Schubiger ', dist_version_from => 'lib/LaTeX/TOM.pm', requires => { 'Carp' => 0, 'File::Basename' => 0, }, build_requires => { 'Test::More' => 0 }, license => 'perl', create_readme => 1, create_makefile_pl => 'traditional', ); $build->create_build_script; LaTeX-TOM-1.03/INSTALL000444001750001750 43511675167673 13047 0ustar00stssts000000000000To install LaTeX::TOM, type the following: perl Build.PL ./Build ./Build test ./Build install Or, if you're on a platform (like DOS or Windows) that doesn't like the "./" notation, you can do this: perl Build.PL perl Build perl Build test perl Build install LaTeX-TOM-1.03/TODO000444001750001750 474511675167673 12536 0ustar00stssts000000000000- Only apply LAST definition of a particular mapping. This could be determined by keeping track of positions of mapping declarations as in a preorder traversal, done in stage 3 of parsing. - Somehow we need to speed up the application of mappings. Instead of recurring through the entire tree, can't we just make USED_COMMANDS contain at each entry a "postings list" of where in the tree the command occurs? CAVEAT: if we do this, we'd have to also update these postings lists each time a mapping is applied, because we're making new subtrees with new commands being exercised. On the bright side, if this is properly executed, we could get massive speedups on applying the mappings (for n mappings, we'd remove n full tree traversals!) - Normalize TEXT nodes on all splits (adjacent TEXT nodes should be combined) - Have linked-list stuff maintained as the parsing proceeds. - Add interface to brace matching in parser (and make sure it works). - Add environment boundary matching, with interface. - Perhaps remove array-based tree representation entirely. - Add file names to parser errors, so we know which included files are being complained about. - Add another parsing stage to pull group nodes that are part of multi-group commands into a single "parameter" list. This would be kind of analogous to the "attributes" list of an XML node. This will require information about all built-in commands (though we might be able to get away with grabbing all GROUPs until we hit non-whitespace TEXT nodes.) - Parse all commands into COMMAND nodes, not just ones that have groups as children. This is more for consistency than structural usefulness. This would fairly radically effect later parsing stages. Note that currently some commands will appear in the results of getting all TEXT nodes. This doesn't really make sense; the semantics of unrecognized commands is not that they are plain text. - In a group, embedded commands go from the embedded location to the end of the group boundary. So for each embedded command we should really be splitting the text of the group, and making the right half a child of the command, and the left half a previous sibling. Currently the parser just makes the entire group a child of the command, which is wrong (but not likely to be encountered due to convention). - Should we be treating square-braces ([]) as groups, analogously to what we do for curly braces? Read up on the full semantics of the square braces. LaTeX-TOM-1.03/Changes000444001750001750 1637311675167673 13361 0ustar00stssts000000000000Revision history for Perl extension LaTeX::TOM. 1.03 2011-12-23 - Merged development version to stable. 1.02_01 2011-11-24 - Change commented debug statements in ::Parser to be invokable. - Alter _debug() to print filename and line number. 1.02 2011-11-13 - Merged development version to stable. 1.01_01 2011-10-09 - Refactor new(), copy() and split() Node methods. - Adjust setNodeText(). - Replace Node's boolean values with true/false. - Enable warnings for the Node class. - Remove obsoleted LICENSE file. 1.01 2011-08-19 - Merged development version to stable. 1.00_08 2011-08-18 - Test getCommandNodesByName(), getEnvironmentsByName() and getNodesByCondition(). 1.00_07 2011-08-15 - Fix parsing user-defined mappings and add a test. [rt #48540 - Jesse S. Bangs] - Don't pass the parser object to the Tree constructor. - Adjust some code indentation. 1.00_06 2011-08-03 - Fix setting instance config data for the main constructor. - Introduce error handlers to minimize code repetition. - Change commented debug statements in parse() to be invokable. - Rename print() to _debug_tree() and wrap it twice in order to emit output to STDOUT/STDERR. - Alter _debug_tree() further to use the output handler being passed in and enhance the code layout. - Be less verbose for variable names when assigning user options. 1.00_05 2011-07-29 - Refactor _getTextAndCommentNodes(), which includes: - Move creating a comment or text node to a lexical subroutine. - Store the type as string and adjust comments accordingly. - Append line to string directly instead of pushing to an array. - (Re)set initialization variables with short-circuit operators. - Use underscores within variable names where appropriate. - Reformat visually the regular expressions used. 1.00_04 2011-07-27 - Strengthen the check for a \input file filename extension. - Add File::Basename as dependency. 1.00_03 2011-07-26 - Fix an error when dereferencing the nodes of a subtree. - Improve the \bibliography handling code and add a test. - Make reading a \input file more strict. - Test that empty \input files are not skipped. - Bless into current package for the Node/Tree constructors. - Adjust some code indentation. 1.00_02 2011-07-24 - Improve the \input handling code and add tests. - Raise error in _readFile() when a file cannot be opened. - Use lexical filehandle and slurp file in _readFile(). - Substitute warn with carp in _addInputs(). 1.00_01 2011-07-20 - Use true as boolean value when initializing config data. - Change the mention of the primary contact. - Reword the documentation a bit. - Remove broken website link and according text. - Update broken license link. - Skip documentation tests for non-release testing. 1.00 Wed Oct 7 10:56:12 CEST 2009 - Merged development version to stable. 0.9_03 Sun Aug 23 16:59:26 CEST 2009 - Initialize user options by calling a lexical subroutine. - Replace calls to not existant copyTree/splitTextNode subs with calls to the copy/split methods. - Populate the config data hashes with true values at runtime. - Remove the superfluous use of 'defined' when checking booleans. - Declare globals with 'our' instead of 'use vars'. - Assign the config data at once within the parser object. - Set initial version numbers for the Node, Parser and Tree class. 0.9_02 Sun Aug 16 12:31:18 CEST 2009 - Fix \input lines parsing failure with "read inputs flag" set. [rt #48538 - Jesse S. Bangs] 0.9_01 Wed Aug 12 14:25:08 CEST 2009 - Use code reference instead of string eval in getNodesByCondition(). [rt #48551 - Jesse S. Bangs] - Fix some warnings which were suppressed within the tests. 0.9 Tue Apr 29 12:21:00 CEST 2008 - Added support for dealing with starred commands. [James Bowlin ] - Merged development version to stable. 0.8_02 Thu Feb 21 21:08:50 CET 2008 - Added further test-files to suite (i.e., ones that test the parser, tree and node functionality). - Fixed a slight documentation error (the method getTopLevelNodes() returns a list, and *not* an array reference). 0.8_01 Tue Feb 19 15:29:40 CET 2008 - Added basic test-file basic.t. - Added CREDITS and LICENSE sections to the documentation. 0.8 Mon Oct 8 10:23:01 CEST 2007 - Fixed failing tests pod.t & pod-coverage.t (adjusted plans). 0.7 Tue Aug 28 00:12:03 CEST 2007 - Added formatting tags to the documentation where appropriate and enlisted all methods within the documentation index. 0.6 Wed Mar 14 01:05:09 CET 2007 - Merged development version to stable. 0.5_05 Sun Feb 18 11:30:51 CET 2007 - Fixing reference types in all ...->{children}->{nodes}->[...] occurrences in the LaTeX::TOM::Parser::_applyMapping and LaTeX::TOM::Node::getLastChild subroutines. [Otakar Smrz, otakar.smrz@mff.cuni.cz] 0.5_04 Fri Feb 16 10:41:21 CET 2007 - Fixed approximately half a dozen broken hash keys in references with {node} instead of {nodes} as subkey. 0.5_03 Fri Feb 16 02:00:52 CET 2007 - Fixed wrong spelling of $self->{node} to $self->{nodes} within LaTeX::TOM::Parser. 0.5_02 Mon Feb 12 03:37:11 CET 2007 - Added suitable (albeit slightly modified) pod.t & pod-coverage.t to the test directory. - Documented LaTeX::TOM's constructor new(). 0.5_01 Mon Feb 5 08:47:05 CET 2007 - Resolved accidentally swapped $prev/$next pointers in assignment in LaTeX::TOM::Node's listify(), resulting in misbehaviour of getNextGroupNode(), getPreviousSibling() and the like. - Added fully qualified package declaration to LaTeX::TOM::Parser, LaTeX::TOM::Node & LaTeX::TOM::Tree. Removed class specification from sub declarations likewise. - LaTeX::TOM's constructor, new() reblesses a LaTeX::TOM::Parser object with the references to the global variables defined within LaTeX::TOM. @_ is passed unaltered to LaTeX::TOM::Parser's new(). - LaTeX::TOM establishes an ISA relationship with LaTeX::TOM::Parser and LaTeX::TOM::Parser with LaTeX::TOM::Node/LaTeX::TOM::Tree. - LaTeX::TOM::Tree's constructor, new() now returns a blessed hash reference instead of previously a blessed array reference, because we're basically reblessing the $parser object. - Extracted the TODO part from LaTeX::TOM and put it in a separate file named TODO in the root of the distribution. 0.5 Son Dec 31 01:47:36 CET 2006 - Percents (%) and braces ({}) within verbatim blocks are now taken care of while parsing. - Replaced all occurences of tabs within the code with literal whitespace. 0.3 Sun Dec 24 11:37:21 CET 2006 - Initial CPAN version. 02c ??? - Bug fixes: Handling of newlines and whitespace between commands and parameters and groups, handling of \w+\d+ commands (thanks Leo Tenenblat for both of these), documentation bugfix: "parseFile", not "parsefile". 02b ??? - License included (BSD), some minor code indenting cleanups. 02 ??? - This is the first release version. 01 ??? - Non-OOP version of the current functionality. Not released. LaTeX-TOM-1.03/README000444001750001750 3267311675167673 12747 0ustar00stssts000000000000NAME LaTeX::TOM - A module for parsing, analyzing, and manipulating LaTeX documents. SYNOPSIS use LaTeX::TOM; $parser = LaTeX::TOM->new; $document = $parser->parseFile('mypaper.tex'); $latex = $document->toLaTeX; $specialnodes = $document->getNodesByCondition(sub { my $node = shift; return ( $node->getNodeType eq 'TEXT' && $node->getNodeText =~ /magic string/ ); }); $sections = $document->getNodesByCondition(sub { my $node = shift; return ( $node->getNodeType eq 'COMMAND' && $node->getCommandName =~ /section$/ ); }); $indexme = $document->getIndexableText; $document->print; DESCRIPTION This module provides a parser which parses and interprets (though not fully) LaTeX documents and returns a tree-based representation of what it finds. This tree is a `LaTeX::TOM::Tree'. The tree contains `LaTeX::TOM::Node' nodes. This module should be especially useful to anyone who wants to do processing of LaTeX documents that requires extraction of plain-text information, or altering of the plain-text components (or alternatively, the math-text components). COMPONENTS LaTeX::TOM::Parser The parser recognizes 3 parameters upon creation. The parameters, in order, are parse error handling (= 0 || 1 || 2) Determines what happens when a parse error is encountered. `0' results in a warning. `1' results in a die. `2' results in silence. Note that particular groupings in LaTeX (i.e. newcommands and the like) contain invalid TeX or LaTeX, so you nearly always need this parameter to be `0' or `2' to completely parse the document. read inputs flag (= 0 || 1) This flag determines whether a scan for `\input' and `\input-like' commands is performed, and the resulting called files parsed and added to the parent parse tree. `0' means no, `1' means do it. Note that this will happen recursively if it is turned on. Also, bibliographies (.bbl files) are detected and included. apply mappings flag (= 0 || 1) This flag determines whether (most) user-defined mappings are applied. This means `\defs', `\newcommands', and `\newenvironments'. This is critical for properly analyzing the content of the document, as this must be phrased in terms of the semantics of the original TeX and LaTeX commands, not ad hoc user macros. So, for instance, do not expect plain-text extraction to work properly with this option off. The parser returns a `LaTeX::TOM::Tree' ($document in the SYNOPSIS). LaTeX::TOM::Node Nodes may be of the following types: TEXT `TEXT' nodes can be thought of as representing the plain-text portions of the LaTeX document. This includes math and anything else that is not a recognized TeX or LaTeX command, or user-defined command. In reality, `TEXT' nodes contain commands that this parser does not yet recognize the semantics of. COMMAND A `COMMAND' node represents a TeX command. It always has child nodes in a tree, though the tree might be empty if the command operates on zero parameters. An example of a command is \textbf{blah} This would parse into a `COMMAND' node for `textbf', which would have a subtree containing the `TEXT' node with text ``blah.'' ENVIRONMENT Similarly, TeX environments parse into `ENVIRONMENT' nodes, which have metadata about the environment, along with a subtree representing what is contained in the environment. For example, \begin{equation} r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \end{equation} Would parse into an `ENVIRONMENT' node of the class ``equation'' with a child tree containing the result of parsing ```r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.''' GROUP A `GROUP' is like an anonymous `COMMAND'. Since you can put whatever you want in curly-braces (`{}') in TeX in order to make semantically isolated regions, this separation is preserved by the parser. A `GROUP' is just the subtree of the parsed contents of plain curly-braces. It is important to note that currently only the first `GROUP' in a series of `GROUP's following a LaTeX command will actually be parsed into a `COMMAND' node. The reason is that, for the initial purposes of this module, it was not necessary to recognize additional `GROUP's as additional parameters to the `COMMAND'. However, this is something that this module really should do eventually. Currently if you want all the parameters to a multi-parametered command, you'll need to pick out all the following `GROUP' nodes yourself. Eventually this will become something like a list which is stored in the `COMMAND' node, much like XML::DOM's treatment of attributes. These are, in a sense, apart from the rest of the document tree. Then `GROUP' nodes will become much more rare. COMMENT A `COMMENT' node is very similar to a `TEXT' node, except it is specifically for lines beginning with ```%''' (the TeX comment delimeter) or the right-hand portion of a line that has ```%''' at some internal point. LaTeX::TOM::Trees As mentioned before, the Tree is the return result of a parse. The tree is nothing more than an arrayref of Nodes, some of which may contain their own trees. This is useful knowledge at this point, since the user isn't provided with a full suite of convenient tree-modification methods. However, Trees do already have some very convenient methods, described in the next section. METHODS LaTeX::TOM new Instantiate a new parser object. In this section all of the methods for each of the components are listed and described. LaTeX::TOM::Parser The methods for the parser (aside from the constructor, discussed above) are : parseFile (filename) Read in the contents of *filename* and parse them, returning a `LaTeX::TOM::Tree'. parse (string) Parse the string *string* and return a `LaTeX::TOM::Tree'. LaTeX::TOM::Tree This section contains methods for the Trees returned by the parser. copy Duplicate a tree into new memory. print A debug print of the structure of the tree. plainText Returns an arrayref which is a list of strings representing the text of all `getNodePlainTextFlag = 1' `TEXT' nodes, in an inorder traversal. indexableText A method like the above but which goes one step further; it cleans all of the returned text and concatenates it into a single string which one could consider having all of the standard information retrieval value for the document, making it useful for indexing. toLaTeX Return a string representing the LaTeX encoded by the tree. This is especially useful to get a normal document again, after modifying nodes of the tree. getTopLevelNodes Return a list of `LaTeX::TOM::Nodes' at the top level of the Tree. getAllNodes Return an arrayref with all nodes of the tree. This "flattens" the tree. getCommandNodesByName (name) Return an arrayref with all `COMMAND' nodes in the tree which have a name matching *name*. getEnvironmentsByName (name) Return an arrayref with all `ENVIRONMENT' nodes in the tree which have a class matching *name*. getNodesByCondition (code reference) This is a catch-all search method which can be used to pull out nodes that match pretty much any perl expression, without manually having to traverse the tree. *code reference* is a perl code reference which receives as its first argument the node of the tree that is currently scrutinized and is expected to return a boolean value. See the SYNOPSIS for examples. getFirstNode Returns the first node of the tree. This is useful if you want to walk the tree yourself, starting with the first node. LaTeX::TOM::Node This section contains the methods for nodes of the parsed Trees. getNodeType Returns the type, one of `TEXT', `COMMAND', `ENVIRONMENT', `GROUP', or `COMMENT', as described above. getNodeText Applicable for `TEXT' or `COMMENT' nodes; this returns the document text they contain. This is undef for other node types. setNodeText Set the node text, also for `TEXT' and `COMMENT' nodes. getNodeStartingPosition Get the starting character position in the document of this node. For `TEXT' and `COMMENT' nodes, this will be where the text begins. For `ENVIRONMENT', `COMMAND', or `GROUP' nodes, this will be the position of the *last* character of the opening identifier. getNodeEndingPosition Same as above, but for last character. For `GROUP', `ENVIRONMENT', or `COMMAND' nodes, this will be the *first* character of the closing identifier. getNodeOuterStartingPosition Same as getNodeStartingPosition, but for `GROUP', `ENVIRONMENT', or `COMMAND' nodes, this returns the *first* character of the opening identifier. getNodeOuterEndingPosition Same as getNodeEndingPosition, but for `GROUP', `ENVIRONMENT', or `COMMAND' nodes, this returns the *last* character of the closing identifier. getNodeMathFlag This applies to any node type. It is `1' if the node sets, or is contained within, a math mode region. `0' otherwise. `TEXT' nodes which have this flag as `1' can be assumed to be the actual mathematics contained in the document. getNodePlainTextFlag This applies only to `TEXT' nodes. It is `1' if the node is non-math and is visible (in other words, will end up being a part of the output document). One would only want to index `TEXT' nodes with this property, for information retrieval purposes. getEnvironmentClass This applies only to `ENVIRONMENT' nodes. Returns what class of environment the node represents (the `X' in `\begin{X}' and `\end{X}'). getCommandName This applies only to `COMMAND' nodes. Returns the name of the command (the `X' in `\X{...}'). getChildTree This applies only to `COMMAND', `ENVIRONMENT', and `GROUP' nodes: it returns the `LaTeX::TOM::Tree' which is ``under'' the calling node. getFirstChild This applies only to `COMMAND', `ENVIRONMENT', and `GROUP' nodes: it returns the first node from the first level of the child subtree. getLastChild Same as above, but for the last node of the first level. getPreviousSibling Return the prior node on the same level of the tree. getNextSibling Same as above, but for following node. getParent Get the parent node of this node in the tree. getNextGroupNode This is an interesting function, and kind of a hack because of the way the parser makes the current tree. Basically it will give you the next sibling that is a `GROUP' node, until it either hits the end of the tree level, a `TEXT' node which doesn't match `/^\s*$/', or a `COMMAND' node. This is useful for finding all `GROUP'ed parameters after a `COMMAND' node (see comments for `GROUP' in the `COMPONENTS' / `LaTeX::TOM::Node' section). You can just have a while loop that calls this method until it gets `undef', and you'll know you've found all the parameters to a command. Note: this may be bad, but `TEXT' Nodes matching `/^\s*\[[0-9]+\]$/' (optional parameter groups) are treated as if they were 'blank'. CAVEATS Due to the lack of tree-modification methods, currently this module is mostly useful for minor modifications to the parsed document, for instance, altering the text of `TEXT' nodes but not deleting the nodes. Of course, the user can still do this by breaking abstraction and directly modifying the Tree. Also note that the parsing is not complete. This module was not written with the intention of being able to produce output documents the way ``latex'' does. The intent was instead to be able to analyze and modify the document on a logical level with regards to the content; it doesn't care about the document formatting and outputting side of TeX/LaTeX. There is much work still to be done. See the TODO list in the TOM.pm source. BUGS Probably plenty. However, this module has performed fairly well on a set of ~1000 research publications from the Computing Research Repository, so I deemed it ``good enough'' to use for purposes similar to mine. Please let the maintainer know of parser errors if you discover any. CREDITS Thanks to (in order of appearance) who have contributed valuable suggestions and patches: Otakar Smrz Moritz Lenz James Bowlin Jesse S. Bangs AUTHORS Written by Aaron Krowne Maintained by Steven Schubiger LICENSE This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself. See http://dev.perl.org/licenses/ LaTeX-TOM-1.03/MANIFEST000444001750001750 127411675167673 13171 0ustar00stssts000000000000Build.PL Changes INSTALL lib/LaTeX/TOM.pm lib/LaTeX/TOM/Node.pm lib/LaTeX/TOM/Parser.pm lib/LaTeX/TOM/Tree.pm Makefile.PL MANIFEST This list of files META.json META.yml README t/00-load.t t/01-basic.t t/02-parser.t t/03-tree.t t/04-node.t t/bibliography.t t/by_condition.t t/by_name.t t/data/bibliography.t/sample.bbl t/data/bibliography.t/sample.in t/data/input.t/00-image_skip.pstex_t t/data/input.t/01-basic.in t/data/input.t/01-basic.tex t/data/input.t/02-guess.in t/data/input.t/02-guess.tex t/data/input.t/03-empty.in t/data/input.t/03-empty.tex t/data/input.t/04-psfig_ignore.tex t/data/input.tex t/data/mapping.t/mapping.in t/data/tex.in t/input.t t/mapping.t t/pod-coverage.t t/pod.t TODO LaTeX-TOM-1.03/lib000755001750001750 011675167673 12445 5ustar00stssts000000000000LaTeX-TOM-1.03/lib/LaTeX000755001750001750 011675167673 13422 5ustar00stssts000000000000LaTeX-TOM-1.03/lib/LaTeX/TOM.pm000444001750001750 4220411675167673 14576 0ustar00stssts000000000000############################################################################### # # LaTeX::TOM (TeX Object Model) # # Version 1.03 # # ---------------------------------------------------------------------------- # # originally written by Aaron Krowne (akrowne@vt.edu) # July 2002 # # Virginia Polytechnic Institute and State University # Department of Computer Science # Digital Libraries Research Laboratory # # now maintained by Steven Schubiger (schubiger@cpan.org) # April 2008 # # ---------------------------------------------------------------------------- # # This module provides some decent semantic handling of LaTeX documents. It is # inspired by XML::DOM, so users of that module should be able to acclimate # themselves to this one quickly. Basically the subroutines in this package # allow you to parse a LaTeX document into its logical structure, including # groupings, commands, environments, and comments. These all go into a tree # which is built as arrays of Perl hashes. # ############################################################################### package LaTeX::TOM; use strict; use base qw(LaTeX::TOM::Parser); use constant true => 1; our $VERSION = '1.03'; our (%INNERCMDS, %MATHENVS, %MATHBRACKETS, %BRACELESS, %TEXTENVS, $PARSE_ERRORS_FATAL, $DEBUG); # BEGIN CONFIG SECTION ######################################################## # these are commands that can be "embedded" within a grouping to alter the # environment of that grouping. For instance {\bf text}. Without listing the # command names here, the parser will treat such sequences as plain text. # %INNERCMDS = map { $_ => true } ( 'bf', 'md', 'em', 'up', 'sl', 'sc', 'sf', 'rm', 'it', 'tt', 'noindent', 'mathtt', 'mathbf', 'tiny', 'scriptsize', 'footnotesize', 'small', 'normalsize', 'large', 'Large', 'LARGE', 'huge', 'Huge', 'HUGE', ); # these commands put their environments into math mode # %MATHENVS = map { $_ => true } ( 'align', 'equation', 'eqnarray', 'displaymath', 'ensuremath', 'math', '$$', '$', '\[', '\(', ); # these commands/environments put their children in text (non-math) mode # %TEXTENVS = map { $_ => true } ( 'tiny', 'scriptsize', 'footnotesize', 'small', 'normalsize', 'large', 'Large', 'LARGE', 'huge', 'Huge', 'HUGE', 'text', 'textbf', 'textmd', 'textsc', 'textsf', 'textrm', 'textsl', 'textup', 'texttt', 'mbox', 'fbox', 'section', 'subsection', 'subsubsection', 'em', 'bf', 'emph', 'it', 'enumerate', 'description', 'itemize', 'trivlist', 'list', 'proof', 'theorem', 'lemma', 'thm', 'prop', 'lem', 'table', 'tabular', 'tabbing', 'caption', 'footnote', 'center', 'flushright', 'document', 'article', 'titlepage', 'title', 'author', 'titlerunninghead', 'authorrunninghead', 'affil', 'email', 'abstract', 'thanks', 'algorithm', 'nonumalgorithm', 'references', 'thebibliography', 'bibitem', 'verbatim', 'verbatimtab', 'quotation', 'quote', ); # these form sets of simple mode delimiters # %MATHBRACKETS = ( '$$' => '$$', '$' => '$', # '\[' => '\]', # these are problematic and handled separately now # '\(' => '\)', ); # these commands require no braces, and their parameters are simply the # "word" following the command declaration # %BRACELESS = map { $_ => true } ( 'oddsidemargin', 'evensidemargin', 'topmargin', 'headheight', 'headsep', 'textwidth', 'textheight', 'input', ); # default value controlling how fatal parse errors are # # 0 = warn, 1 = die, 2 = silent # $PARSE_ERRORS_FATAL = 0; # debugging mode (internal use) # # 0 = off, 1 = messages, 2 = messages and code # $DEBUG = 0; # END CONFIG SECTION ########################################################## sub new { my $class = shift; return __PACKAGE__->SUPER::new(@_); } 1; =head1 NAME LaTeX::TOM - A module for parsing, analyzing, and manipulating LaTeX documents. =head1 SYNOPSIS use LaTeX::TOM; $parser = LaTeX::TOM->new; $document = $parser->parseFile('mypaper.tex'); $latex = $document->toLaTeX; $specialnodes = $document->getNodesByCondition(sub { my $node = shift; return ( $node->getNodeType eq 'TEXT' && $node->getNodeText =~ /magic string/ ); }); $sections = $document->getNodesByCondition(sub { my $node = shift; return ( $node->getNodeType eq 'COMMAND' && $node->getCommandName =~ /section$/ ); }); $indexme = $document->getIndexableText; $document->print; =head1 DESCRIPTION This module provides a parser which parses and interprets (though not fully) LaTeX documents and returns a tree-based representation of what it finds. This tree is a C. The tree contains C nodes. This module should be especially useful to anyone who wants to do processing of LaTeX documents that requires extraction of plain-text information, or altering of the plain-text components (or alternatively, the math-text components). =head1 COMPONENTS =head2 LaTeX::TOM::Parser The parser recognizes 3 parameters upon creation. The parameters, in order, are =over 4 =item parse error handling (= B<0> || 1 || 2) Determines what happens when a parse error is encountered. C<0> results in a warning. C<1> results in a die. C<2> results in silence. Note that particular groupings in LaTeX (i.e. newcommands and the like) contain invalid TeX or LaTeX, so you nearly always need this parameter to be C<0> or C<2> to completely parse the document. =item read inputs flag (= 0 || B<1>) This flag determines whether a scan for C<\input> and C<\input-like> commands is performed, and the resulting called files parsed and added to the parent parse tree. C<0> means no, C<1> means do it. Note that this will happen recursively if it is turned on. Also, bibliographies (F<.bbl> files) are detected and included. =item apply mappings flag (= 0 || B<1>) This flag determines whether (most) user-defined mappings are applied. This means C<\defs>, C<\newcommands>, and C<\newenvironments>. This is critical for properly analyzing the content of the document, as this must be phrased in terms of the semantics of the original TeX and LaTeX commands, not ad hoc user macros. So, for instance, do not expect plain-text extraction to work properly with this option off. =back The parser returns a C ($document in the SYNOPSIS). =head2 LaTeX::TOM::Node Nodes may be of the following types: =over 4 =item TEXT C nodes can be thought of as representing the plain-text portions of the LaTeX document. This includes math and anything else that is not a recognized TeX or LaTeX command, or user-defined command. In reality, C nodes contain commands that this parser does not yet recognize the semantics of. =item COMMAND A C node represents a TeX command. It always has child nodes in a tree, though the tree might be empty if the command operates on zero parameters. An example of a command is \textbf{blah} This would parse into a C node for C, which would have a subtree containing the C node with text ``blah.'' =item ENVIRONMENT Similarly, TeX environments parse into C nodes, which have metadata about the environment, along with a subtree representing what is contained in the environment. For example, \begin{equation} r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \end{equation} Would parse into an C node of the class ``equation'' with a child tree containing the result of parsing C<``r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.''> =item GROUP A C is like an anonymous C. Since you can put whatever you want in curly-braces (C<{}>) in TeX in order to make semantically isolated regions, this separation is preserved by the parser. A C is just the subtree of the parsed contents of plain curly-braces. It is important to note that currently only the first C in a series of Cs following a LaTeX command will actually be parsed into a C node. The reason is that, for the initial purposes of this module, it was not necessary to recognize additional Cs as additional parameters to the C. However, this is something that this module really should do eventually. Currently if you want all the parameters to a multi-parametered command, you'll need to pick out all the following C nodes yourself. Eventually this will become something like a list which is stored in the C node, much like L's treatment of attributes. These are, in a sense, apart from the rest of the document tree. Then C nodes will become much more rare. =item COMMENT A C node is very similar to a C node, except it is specifically for lines beginning with C<``%''> (the TeX comment delimeter) or the right-hand portion of a line that has C<``%''> at some internal point. =back =head2 LaTeX::TOM::Trees As mentioned before, the Tree is the return result of a parse. The tree is nothing more than an arrayref of Nodes, some of which may contain their own trees. This is useful knowledge at this point, since the user isn't provided with a full suite of convenient tree-modification methods. However, Trees do already have some very convenient methods, described in the next section. =head1 METHODS =head2 LaTeX::TOM =head3 new =over 4 =item C<> Instantiate a new parser object. =back In this section all of the methods for each of the components are listed and described. =head2 LaTeX::TOM::Parser The methods for the parser (aside from the constructor, discussed above) are : =head3 parseFile (filename) =over 4 =item C<> Read in the contents of I and parse them, returning a C. =back =head3 parse (string) =over 4 =item C<> Parse the string I and return a C. =back =head2 LaTeX::TOM::Tree This section contains methods for the Trees returned by the parser. =head3 copy =over 4 =item C<> Duplicate a tree into new memory. =back =head3 print =over 4 =item C<> A debug print of the structure of the tree. =back =head3 plainText =over 4 =item C<> Returns an arrayref which is a list of strings representing the text of all C C nodes, in an inorder traversal. =back =head3 indexableText =over 4 =item C<> A method like the above but which goes one step further; it cleans all of the returned text and concatenates it into a single string which one could consider having all of the standard information retrieval value for the document, making it useful for indexing. =back =head3 toLaTeX =over 4 =item C<> Return a string representing the LaTeX encoded by the tree. This is especially useful to get a normal document again, after modifying nodes of the tree. =back =head3 getTopLevelNodes =over 4 =item C<> Return a list of C at the top level of the Tree. =back =head3 getAllNodes =over 4 =item C<> Return an arrayref with B nodes of the tree. This "flattens" the tree. =back =head3 getCommandNodesByName (name) =over 4 =item C<> Return an arrayref with all C nodes in the tree which have a name matching I. =back =head3 getEnvironmentsByName (name) =over 4 =item C<> Return an arrayref with all C nodes in the tree which have a class matching I. =back =head3 getNodesByCondition (code reference) =over 4 =item C<> This is a catch-all search method which can be used to pull out nodes that match pretty much any perl expression, without manually having to traverse the tree. I is a perl code reference which receives as its first argument the node of the tree that is currently scrutinized and is expected to return a boolean value. See the SYNOPSIS for examples. =back =head3 getFirstNode =over 4 =item C<> Returns the first node of the tree. This is useful if you want to walk the tree yourself, starting with the first node. =back =head2 LaTeX::TOM::Node This section contains the methods for nodes of the parsed Trees. =head3 getNodeType =over 4 =item C<> Returns the type, one of C, C, C, C, or C, as described above. =back =head3 getNodeText =over 4 =item C<> Applicable for C or C nodes; this returns the document text they contain. This is undef for other node types. =back =head3 setNodeText =over 4 =item C<> Set the node text, also for C and C nodes. =back =head3 getNodeStartingPosition =over 4 =item C<> Get the starting character position in the document of this node. For C and C nodes, this will be where the text begins. For C, C, or C nodes, this will be the position of the I character of the opening identifier. =back =head3 getNodeEndingPosition =over 4 =item C<> Same as above, but for last character. For C, C, or C nodes, this will be the I character of the closing identifier. =back =head3 getNodeOuterStartingPosition =over 4 =item C<> Same as getNodeStartingPosition, but for C, C, or C nodes, this returns the I character of the opening identifier. =back =head3 getNodeOuterEndingPosition =over 4 =item C<> Same as getNodeEndingPosition, but for C, C, or C nodes, this returns the I character of the closing identifier. =back =head3 getNodeMathFlag =over 4 =item C<> This applies to any node type. It is C<1> if the node sets, or is contained within, a math mode region. C<0> otherwise. C nodes which have this flag as C<1> can be assumed to be the actual mathematics contained in the document. =back =head3 getNodePlainTextFlag =over 4 =item C<> This applies only to C nodes. It is C<1> if the node is non-math B is visible (in other words, will end up being a part of the output document). One would only want to index C nodes with this property, for information retrieval purposes. =back =head3 getEnvironmentClass =over 4 =item C<> This applies only to C nodes. Returns what class of environment the node represents (the C in C<\begin{X}> and C<\end{X}>). =back =head3 getCommandName =over 4 =item C<> This applies only to C nodes. Returns the name of the command (the C in C<\X{...}>). =back =head3 getChildTree =over 4 =item C<> This applies only to C, C, and C nodes: it returns the C which is ``under'' the calling node. =back =head3 getFirstChild =over 4 =item C<> This applies only to C, C, and C nodes: it returns the first node from the first level of the child subtree. =back =head3 getLastChild =over 4 =item C<> Same as above, but for the last node of the first level. =back =head3 getPreviousSibling =over 4 =item C<> Return the prior node on the same level of the tree. =back =head3 getNextSibling =over 4 =item C<> Same as above, but for following node. =back =head3 getParent =over 4 =item C<> Get the parent node of this node in the tree. =back =head3 getNextGroupNode =over 4 =item C<> This is an interesting function, and kind of a hack because of the way the parser makes the current tree. Basically it will give you the next sibling that is a C node, until it either hits the end of the tree level, a C node which doesn't match C, or a C node. This is useful for finding all Ced parameters after a C node (see comments for C in the C / C section). You can just have a while loop that calls this method until it gets C, and you'll know you've found all the parameters to a command. Note: this may be bad, but C Nodes matching C (optional parameter groups) are treated as if they were 'blank'. =back =head1 CAVEATS Due to the lack of tree-modification methods, currently this module is mostly useful for minor modifications to the parsed document, for instance, altering the text of C nodes but not deleting the nodes. Of course, the user can still do this by breaking abstraction and directly modifying the Tree. Also note that the parsing is not complete. This module was not written with the intention of being able to produce output documents the way ``latex'' does. The intent was instead to be able to analyze and modify the document on a logical level with regards to the content; it doesn't care about the document formatting and outputting side of TeX/LaTeX. There is much work still to be done. See the F list in the F source. =head1 BUGS Probably plenty. However, this module has performed fairly well on a set of ~1000 research publications from the Computing Research Repository, so I deemed it ``good enough'' to use for purposes similar to mine. Please let the maintainer know of parser errors if you discover any. =head1 CREDITS Thanks to (in order of appearance) who have contributed valuable suggestions and patches: Otakar Smrz Moritz Lenz James Bowlin Jesse S. Bangs =head1 AUTHORS Written by Aaron Krowne Maintained by Steven Schubiger =head1 LICENSE This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself. See L =cut LaTeX-TOM-1.03/lib/LaTeX/TOM000755001750001750 011675167673 14061 5ustar00stssts000000000000LaTeX-TOM-1.03/lib/LaTeX/TOM/Parser.pm000444001750001750 15543111675167673 16061 0ustar00stssts000000000000############################################################################### # # LaTeX::TOM::Parser # # The parsing class # ############################################################################### package LaTeX::TOM::Parser; use strict; use base qw( LaTeX::TOM::Node LaTeX::TOM::Tree ); use constant true => 1; use constant false => 0; use Carp qw(carp croak); use File::Basename qw(fileparse); our $VERSION = '0.07'; my %error_handlers = ( 0 => sub { warn "parse error: $_[0].\n" }, 1 => sub { die "parse error: $_[0].\n" }, 2 => sub {}, ); # Constructor # sub new { my $class = shift; no strict 'refs'; my $self = bless { config => { BRACELESS => \%{"${class}::BRACELESS"}, INNERCMDS => \%{"${class}::INNERCMDS"}, MATHENVS => \%{"${class}::MATHENVS"}, MATHBRACKETS => \%{"${class}::MATHBRACKETS"}, PARSE_ERRORS_FATAL => ${"${class}::PARSE_ERRORS_FATAL"}, TEXTENVS => \%{"${class}::TEXTENVS"}, }, }; $self->_init(@_); return $self; } # Set/reset "globals" # sub _init { my $parser = shift; my ($parse_errors_fatal, $readinputs, $applymappings) = @_; my $retrieve_opt_default = sub { my ($opt, $default) = @_; return $opt if defined $opt; return $default; }; # set user options # $parser->{readinputs} = $retrieve_opt_default->($readinputs, 0); $parser->{applymappings} = $retrieve_opt_default->($applymappings, 0); $parser->{PARSE_ERRORS_FATAL} = $retrieve_opt_default->($parse_errors_fatal, $parser->{config}{PARSE_ERRORS_FATAL}); # init internal stuff # $parser->{MATHBRACKETS} = $parser->{config}{MATHBRACKETS}; # this will hold a running list/hash of commands that have been remapped $parser->{MAPPEDCMDS} = {}; # this will hold a running list/hash of commands that have been used. We dont # bother apply mappings except to commands that have been used. $parser->{USED_COMMANDS} = {}; # no file yet $parser->{file} = undef; } # Parse a LaTeX file, return a tree. You probably want this method. # sub parseFile { my $parser = shift; my $filename = shift; # init variables # $parser->{file} = $filename; # file name member data my $tree = {}; # init output tree # read in text from file or bomb out # my $text = _readFile($filename, true); # do the parse # $tree = $parser->parse($text); return $tree; } # main parsing entrypoint # sub parse { my $parser = shift; my ($text) = @_; # first half of parsing (goes up to finding commands, reading inputs) # my ($tree, $bracehash) = $parser->_parseA($text); _debug( 'done with _parseA', sub { $tree->_warn() }, ); # handle mappings # $parser->_applyMappings($tree) if $parser->{applymappings}; _debug( 'done with _applyMappings', sub { $tree->_warn() }, ); # second half of parsing (environments) # $parser->_parseB($tree); _debug( 'done with _parseB', sub { $tree->_warn() }, ); # once all the above is done we can propegate math/plaintext modes down # $parser->_propegateModes($tree, 0, 0); # math = 0, plaintext = 0 _debug( 'done with _propegateModes', sub { $tree->_warn() }, ); # handle kooky \[ \] math mode # if (not exists $parser->{MAPPEDCMDS}->{'\\['}) { # math mode (\[ \], \( \)) $parser->_stage5($tree, {'\\[' => '\\]', '\\(' => '\\)'}, 1); $parser->_propegateModes($tree, 0, 0); # have to do this again of course $parser->{MATHBRACKETS}->{'\\['} = '\\]'; # put back in brackets list for $parser->{MATHBRACKETS}->{'\\('} = '\\)'; # printing purposes. } _debug( undef, sub { $tree->_warn() }, ); $tree->listify; # add linked-list stuff return $tree; } # Parsing with no mappings and no externally accessible parser object. # sub _basicparse { my $parser = shift; # @_ would break code my $text = shift; my $parse_errors_fatal = (defined $_[0] ? $_[0] : $parser->{config}{PARSE_ERRORS_FATAL}); my $readinputs = (defined $_[1] ? $_[1] : 1); $parser = LaTeX::TOM::Parser->new($parse_errors_fatal, $readinputs); my ($tree, $bracehash) = $parser->_parseA($text); $parser->_parseB($tree); $tree->listify; # add linked-list stuff return ($tree, $bracehash); } # start the tree. separate out comment and text nodes. # sub _stage1 { my $parser = shift; my $text = shift; my @nodes = _getTextAndCommentNodes($text, 0, length($text)); return LaTeX::TOM::Tree->new([@nodes]); } # this stage parses the braces ({}) and adds the corresponding structure to # the tree. # sub _stage2 { my $parser = shift; my $tree = shift; my $bracehash = shift || undef; my $startidx = shift || 0; # last two params for starting at some specific my $startpos = shift || 0; # node and offset. my %blankhash; if (not defined $bracehash) { $bracehash = {%blankhash}; } my $leftidx = -1; my $leftpos = -1; my $leftcount = 0; # loop through the nodes for (my $i = $startidx; $i < @{$tree->{nodes}}; $i++) { my $node = $tree->{nodes}[$i]; my $spos = $node->{start}; # get text start position # set position placeholder within the text block my $pos = ($i == $startidx) ? $startpos : 0; if ($node->{type} eq 'TEXT') { _debug("parseStage2: looking at text node: [$node->{content}]", undef); my ($nextpos, $brace) = _findbrace($node->{content}, $pos); while ($nextpos != -1) { $pos = $nextpos + 1; # update position pointer # handle left brace if ($brace eq '{') { _debug("found '{' at position $nextpos, leftcount is $leftcount", undef); if ($leftcount == 0) { $leftpos = $nextpos; $leftidx = $i } $leftcount++; } # handle right brance elsif ($brace eq '}') { _debug("found '}' at position $nextpos, leftcount is $leftcount", undef); my $rightpos = $nextpos; $leftcount--; # found the corresponding right brace to our starting left brace if ($leftcount == 0) { # see if we have to split the text node into 3 parts # if ($leftidx == $i) { my ($leftside, $textnode3) = $node->split($rightpos, $rightpos); my ($textnode1, $textnode2) = $leftside->split($leftpos, $leftpos); # make the new GROUP node my $groupnode = LaTeX::TOM::Node->new( {type => 'GROUP', start => $textnode2->{start} - 1, end => $textnode2->{end} + 1, children => LaTeX::TOM::Tree->new([$textnode2]), }); # splice the new subtree into the old location splice @{$tree->{nodes}}, $i, 1, $textnode1, $groupnode, $textnode3; # add to the brace-pair lookup table $bracehash->{$groupnode->{start}} = $groupnode->{end}; $bracehash->{$groupnode->{end}} = $groupnode->{start}; # recur into new child node $parser->_stage2($groupnode->{children}, $bracehash); $i++; # skip to textnode3 for further processing } # split across nodes # else { my ($textnode1, $textnode2) = $tree->{nodes}[$leftidx]->split($leftpos, $leftpos); my ($textnode3, $textnode4) = $node->split($rightpos, $rightpos); # remove nodes in between the node we found '{' in and the node # we found '}' in # my @removed = splice @{$tree->{nodes}}, $leftidx+1, $i-$leftidx-1; # create a group node that contains the text after the left brace, # then all the nodes up until the next text node, then the text # before the right brace. # my $groupnode = LaTeX::TOM::Node->new( {type => 'GROUP', start => $textnode2->{start} - 1, end => $textnode3->{end} + 1, children => LaTeX::TOM::Tree->new( [$textnode2, @removed, $textnode3]), }); # replace the two original text nodes with the leftover left and # right portions, as well as the group node with everything in # the middle. # splice @{$tree->{nodes}}, $leftidx, 2, $textnode1, $groupnode, $textnode4; # add to the brace-pair lookup table $bracehash->{$groupnode->{start}} = $groupnode->{end}; $bracehash->{$groupnode->{end}} = $groupnode->{start}; # recur into new child nodes $parser->_stage2($groupnode->{children}, $bracehash); # step back to textnode4 on this level for further processing $i -= scalar @removed; } $leftpos = -1; # reset left data $leftidx = -1; last; } # $leftcount == 0 # check for '}'-based error # if ($leftcount < 0) { $error_handlers{$parser->{PARSE_ERRORS_FATAL}}->("'}' before '{' at " . ($spos + $rightpos)); $leftcount = 0; # reset and continue } } # right brace ($nextpos, $brace) = _findbrace($node->{content}, $pos); } # while (braces left) } # if TEXT } # loop over all nodes # check for extra '{' parse error # if ($leftcount > 0) { my $spos = $tree->{nodes}[$leftidx]->{start}; # get text start position $error_handlers{$parser->{PARSE_ERRORS_FATAL}}->("unmatched '{' at " . ($spos + $leftpos)); # try to continue on, after the offending brace $parser->_stage2($tree, $bracehash, $leftidx, $leftpos + 1); } return $bracehash; } # this stage finds LaTeX commands and accordingly turns GROUP nodes into # command nodes, labeled with the command # sub _stage3 { my $parser = shift; my $tree = shift; my $parent = shift; for (my $i = 0; $i< @{$tree->{nodes}}; $i++) { my $node = $tree->{nodes}[$i]; # check text node for command tag if ($node->{type} eq 'TEXT') { my $text = $node->{content}; # inner command (such as {\command text text}). our regexp checks to see # if this text chunk begins with \command, since that would be the case # due to the previous parsing stages. if found, the parent node is # promoted to a command. # if ($text =~ /^\s*\\(\w+\*?)/ && defined $parent && $parser->{config}{INNERCMDS}->{$1}) { my $command = $1; # if the parent is already a command node, we have to make a new # nested command node # if ($parent->{type} eq 'COMMAND') { # make a new command node my $newnode = LaTeX::TOM::Node->new( {type => 'COMMAND', command => $command, start => $parent->{start}, end => $parent->{end}, position => 'inner', children => $parent->{children} }); # point parent to it $parent->{children} = LaTeX::TOM::Tree->new([$newnode]); # start over at this level (get additional inner commands) $parent = $newnode; $i = -1; $parser->{USED_COMMANDS}->{$newnode->{command}} = 1; } # parent is a naked group, we can make it into a command node # elsif ($parent->{type} eq 'GROUP') { $parent->{type} = 'COMMAND'; $parent->{command} = $command; $parent->{position} = 'inner'; # start over at this level $i = -1; $parser->{USED_COMMANDS}->{$parent->{command}} = 1; } $node->{content} =~ s/^\s*\\(?:\w+\*?)//o; } # outer command (such as \command{parameters}). our regexp checks to # see if this text chunk ends in \command, since that would be the case # due to the previous parsing stages. # if ($text =~ /(?:^|[^\\])(\\\w+\*?(\s*\[.*?\])?)\s*$/os && defined $tree->{nodes}[$i+1] && $tree->{nodes}[$i+1]->{type} eq 'GROUP') { my $tag = $1; _debug("found text node [$text] with command tag [$tag]", undef); # remove the text $node->{content} =~ s/\\\w+\*?\s*(?:\[.*?\])?\s*$//os; # parse it for command and ops $tag =~ /^\\(\w+\*?)\s*(?:\[(.*?)\])?$/os; my $command = $1; my $opts = $2; # make the next node a command node with the above data my $next = $tree->{nodes}[$i+1]; $next->{type} = 'COMMAND'; $next->{command} = $command; $next->{opts} = $opts; $next->{position} = 'outer'; $parser->{USED_COMMANDS}->{$next->{command}} = 1; } # recognize braceless commands # if ($text =~ /(\\(\w+\*?)[ \t]+(\S+))/gso || $text =~ /(\\(\w+)(\d+))/gso) { my $all = $1; my $command = $2; my $param = $3; if ($parser->{config}{BRACELESS}->{$command}) { # warn "found braceless command $command with param $param"; # get location to split from node text my $a = index $node->{content}, $all, 0; my $b = $a + length($all) - 1; # make all the new nodes # new left and right text nodes my ($leftnode, $rightnode) = $node->split($a, $b); # param contents node my $pstart = index $node->{content}, $param, $a; my $newchild = LaTeX::TOM::Node->new( {type => 'TEXT', start => $node->{start} + $pstart, end => $node->{start} + $pstart + length($param) - 1, content => $param }); # new command node my $commandnode = LaTeX::TOM::Node->new( {type => 'COMMAND', braces => 0, command => $command, start => $node->{start} + $a, end => $node->{start} + $b, children => LaTeX::TOM::Tree->new([$newchild]), }); $parser->{USED_COMMANDS}->{$commandnode->{command}} = 1; # splice these all into the original array splice @{$tree->{nodes}}, $i, 1, $leftnode, $commandnode, $rightnode; # make the rightnode the node we're currently analyzing $node = $rightnode; # make sure outer loop will continue parsing *after* rightnode $i += 2; } } } # recur if ($node->{type} eq 'GROUP' || $node->{type} eq 'COMMAND') { $parser->_stage3($node->{children}, $node); } } } # this stage finds \begin{x} \end{x} environments and shoves their contents # down into a new child node, with a parent node of ENVIRONMENT type. # # this has the effect of making the tree deeper, since much of the structure # is in environment tags and will now be picked up. # # for ENVIRONMENTs, "start" means the ending } on the \begin tag, # "end" means the starting \ on the \end tag, # "ostart" is the starting \ on the "begin" tag, # "oend" is the ending } on the "end" tag, and # and "class" is the "x" from above. # sub _stage4 { my $parser = shift; my $tree = shift; my $bcount = 0; # \begin "stack count" my $class = ""; # environment class my $bidx = 0; # \begin array index. for (my $i = 0; $i < @{$tree->{nodes}}; $i++) { my $node = $tree->{nodes}->[$i]; # see if this is a "\begin" command node if ($node->{type} eq 'COMMAND' && $node->{command} eq 'begin') { _debug("parseStage4: found a begin COMMAND node, $node->{children}->{nodes}[0]->{content} @ $node->{start}", undef); # start a new "stack" if ($bcount == 0) { $bidx = $i; $bcount++; $class = $node->{children}->{nodes}->[0]->{content}; _debug("parseStage4: opening environment tag found, class = $class", undef); } # add to the "stack" elsif ($node->{children}->{nodes}->[0]->{content} eq $class) { $bcount++; _debug("parseStage4: incrementing tag count for $class", undef); } } # handle "\end" command nodes elsif ($node->{type} eq 'COMMAND' && $node->{command} eq 'end' && $node->{children}->{nodes}->[0]->{content} eq $class) { $bcount--; _debug("parseStage4: decrementing tag count for $class", undef); # we found our closing "\end" tag. replace everything with the proper # ENVIRONMENT tag and subtree. # if ($bcount == 0) { _debug("parseStage4: closing environment $class", undef); # first we must take everything between the "\begin" and "\end" # nodes and put them in a new array, removing them from the old one my @newarray = splice @{$tree->{nodes}}, $bidx+1, $i - ($bidx + 1); # make the ENVIRONMENT node my $start = $tree->{nodes}[$bidx]->{end}; my $end = $node->{start}; my $envnode = LaTeX::TOM::Node->new( {type => 'ENVIRONMENT', class => $class, start => $start, # "inner" start and end end => $end, ostart => $start - length('begin') - length($class) - 2, oend => $end + length('end') + length($class) + 2, children => LaTeX::TOM::Tree->new([@newarray]), }); if ($parser->{config}{MATHENVS}->{$envnode->{class}}) { $envnode->{math} = 1; } # replace the \begin and \end COMMAND nodes with the single # environment node splice @{$tree->{nodes}}, $bidx, 2, $envnode; $class = ""; # reset class. # i is going to change by however many nodes we removed $i -= scalar @newarray; # recur into the children $parser->_stage4($envnode->{children}); } } # recur in general elsif ($node->{children}) { $parser->_stage4($node->{children}); } } # parse error if we're missing an "\end" tag. if ($bcount > 0) { $error_handlers{$parser->{PARSE_ERRORS_FATAL}}->( "missing \\end{$class} for \\begin{$class} at position $tree->{nodes}[$bidx]->{end}" ); } } # This is the "math" stage: here we grab simple-delimeter math modes from # the text they are embedded in, and turn those into new groupings, with the # "math" flag set. # # having this top level to go over all the bracket types prevents some pretty # bad combinatorial explosion # sub _stage5 { my $parser = shift; my $tree = shift; my $caremath = shift || 0; my $brackets = $parser->{MATHBRACKETS}; # loop through all the different math mode bracket types foreach my $left (sort {length($b) <=> length($a)} keys %$brackets) { my $right = $brackets->{$left}; $parser->_stage5_r($tree, $left, $right, $caremath); } } # recursive meat of above # sub _stage5_r { my $parser = shift; my $tree = shift; my $left = shift; my $right = shift; my $caremath = shift || 0; # do we care if we're already in math mode? # this matters for \( \), \[ \] my $leftpos = -1; # no text pos for found left brace yet. my $leftidx = -1; # no array index for found left brace yet. # loop through the nodes for (my $i = 0; $i < scalar @{$tree->{nodes}}; $i++) { my $node = $tree->{nodes}[$i]; my $pos = 0; # position placeholder within the text block my $spos = $node->{start}; # get text start position if ($node->{type} eq 'TEXT' && (!$caremath || (!$node->{math} && $caremath))) { # search for left brace if we haven't started a pair yet if ($leftidx == -1) { $leftpos = _findsymbol($node->{content}, $left, $pos); if ($leftpos != -1) { _debug("found (left) $left in [$node->{content}]", undef); $leftidx = $i; $pos = $leftpos + 1; # next pos to search from } } # search for a right brace if ($leftpos != -1) { my $rightpos = _findsymbol($node->{content}, $right, $pos); # found if ($rightpos != -1) { # we have to split the text node into 3 parts if ($leftidx == $i) { _debug("splitwithin: found (right) $right in [$node->{content}]", undef); my ($leftnode, $textnode3) = $node->split($rightpos, $rightpos + length($right) - 1); my ($textnode1, $textnode2) = $leftnode->split($leftpos, $leftpos + length($left) - 1); my $startpos = $spos; # get text start position # make the math ENVIRONMENT node my $mathnode = LaTeX::TOM::Node->new( {type => 'ENVIRONMENT', class => $left, # use left delim as class math => 1, start => $startpos + $leftpos, ostart => $startpos + $leftpos - length($left) + 1, end => $startpos + $rightpos, oend => $startpos + $rightpos + length($right) - 1, children => LaTeX::TOM::Tree->new([$textnode2]), }); splice @{$tree->{nodes}}, $i, 1, $textnode1, $mathnode, $textnode3; $i++; # skip ahead two nodes, so we'll be parsing textnode3 } # split across nodes else { _debug("splitacross: found (right) $right in [$node->{content}]", undef); # create new set of 4 smaller text nodes from the original two # that contain the left and right delimeters # my ($textnode1, $textnode2) = $tree->{nodes}[$leftidx]->split($leftpos, $leftpos + length($left) - 1); my ($textnode3, $textnode4) = $tree->{nodes}[$i]->split($rightpos, $rightpos + length($right) - 1); # nodes to remove "from the middle" (between the left and right # text nodes which contain the delimeters) # my @remnodes = splice @{$tree->{nodes}}, $leftidx+1, $i - $leftidx - 1; # create a math node that contains the text after the left brace, # then all the nodes up until the next text node, then the text # before the right brace. # my $mathnode = LaTeX::TOM::Node->new( {type => 'ENVIRONMENT', class => $left, math => 1, start => $textnode2->{start} - 1, end => $textnode3->{end} + 1, ostart => $textnode2->{start} - 1 - length($left) + 1, oend => $textnode3->{end} + 1 + length($right) - 1, children => LaTeX::TOM::Tree->new( [$textnode2, @remnodes, $textnode3]), }); # replace (TEXT_A, ... , TEXT_B) with the mathnode created above splice @{$tree->{nodes}}, $leftidx, 2, $textnode1, $mathnode, $textnode4; # do all nodes again but the very leftmost # $i = $leftidx; } $leftpos = -1; # reset left data $leftidx = -1; } # right brace } # left brace else { my $rightpos = _findsymbol($node->{content}, $right, $pos); if ($rightpos != -1) { my $startpos = $node->{start}; # get text start position $error_handlers{$parser->{PARSE_ERRORS_FATAL}}->("unmatched '$right' at " . ($startpos + $rightpos)); } } } # if TEXT # recur, but not into verbatim environments! # elsif ($node->{children} && !( ($node->{type} eq 'COMMAND' && $node->{command} =~ /^verb/) || ($node->{type} eq 'ENVIRONMENT' && $node->{class} =~ /^verbatim/))) { if ($LaTeX::TOM::DEBUG) { my $message = "Recurring into $node->{type} node "; $message .= $node->{command} if ($node->{type} eq 'COMMAND'); $message .= $node->{class} if ($node->{type} eq 'ENVIRONMENT'); _debug($message, undef); } $parser->_stage5_r($node->{children}, $left, $right, $caremath); } } # loop over text blocks if ($leftpos != -1) { my $startpos = $tree->{nodes}[$leftidx]->{start}; # get text start position $error_handlers{$parser->{PARSE_ERRORS_FATAL}}->("unmatched '$left' at " . ($startpos + $leftpos)); } } # This stage propegates the math mode flag and plaintext flags downward. # # After this is done, we can make the claim that only text nodes marked with # the plaintext flag should be printed. math nodes will have the "math" flag, # and also plantext = 0. # sub _propegateModes { my $parser = shift; my $tree = shift; my $math = shift; # most likely want to call this with 0 my $plaintext = shift; # ditto this-- default to nothing visible. foreach my $node (@{$tree->{nodes}}) { # handle text nodes on this level. set flags. # if ($node->{type} eq 'TEXT') { $node->{math} = $math; $node->{plaintext} = $plaintext; } # propegate flags downward, possibly modified # elsif (defined $node->{children}) { my $mathflag = $math; # math propegates down by default my $plaintextflag = 0; # plaintext flag does NOT propegate by default # handle math or plain text forcing envs # if ($node->{type} eq 'ENVIRONMENT' || $node->{type} eq 'COMMAND') { if (defined $node->{class} && ( $parser->{config}{MATHENVS}->{$node->{class}} || $parser->{config}{MATHENVS}->{"$node->{class}*"}) ) { $mathflag = 1; $plaintextflag = 0; } elsif (($node->{type} eq 'COMMAND' && ($parser->{config}{TEXTENVS}->{$node->{command}} || $parser->{config}{TEXTENVS}->{"$node->{command}*"})) || ($node->{type} eq 'ENVIRONMENT' && ($parser->{config}{TEXTENVS}->{$node->{class}} || $parser->{config}{TEXTENVS}{"$node->{command}*"})) ) { $mathflag = 0; $plaintextflag = 1; } } # groupings change nothing # elsif ($node->{type} eq 'GROUP') { $mathflag = $math; $plaintextflag = $plaintext; } # recur $parser->_propegateModes($node->{children}, $mathflag, $plaintextflag); } } } # apply a mapping to text nodes in a tree # # for newcommands and defs: mapping is a hash: # # {name, nparams, template, type} # # name is a string # nparams is an integer # template is a tree fragement containing text nodes with #x flags, where # parameters will be replaced. # type is "command" # # for newenvironments: # # {name, nparams, btemplate, etemplate, type} # # same as above, except type is "environment" and there are two templates, # btemplate and etemplate. # sub _applyMapping { my $parser = shift; my $tree = shift; my $mapping = shift; my $i = shift || 0; # index to start with, in tree. my $applications = 0; # keep track of # of applications for (; $i < @{$tree->{nodes}}; $i++) { my $node = $tree->{nodes}[$i]; # begin environment nodes # if ($node->{type} eq 'COMMAND' && $node->{command} eq 'begin' && $node->{children}->{nodes}[0]->{content} eq $mapping->{name} ) { # grab the nparams next group nodes as parameters # my @params = (); my $remain = $mapping->{nparams}; my $j = 1; while ($remain > 0 && ($i + $j) < scalar @{$tree->{nodes}}) { my $node = $tree->{nodes}[$i + $j]; # grab group node if ($node->{type} eq 'GROUP') { push @params, $node->{children}; $remain--; } $j++; } # if we didn't get enough group nodes, bomb out next if $remain; # otherwise make new subtree my $applied = _applyParamsToTemplate($mapping->{btemplate}, @params); # splice in the result splice @{$tree->{nodes}}, $i, $j, @{$applied->{nodes}}; # skip past all the new stuff $i += scalar @{$applied->{nodes}} - 1; } # end environment nodes # elsif ($node->{type} eq 'COMMAND' && $node->{command} eq 'end' && $node->{children}->{nodes}[0]->{content} eq $mapping->{name} ) { # make new subtree (no params) my $applied = $mapping->{etemplate}->copy(); # splice in the result splice @{$tree->{nodes}}, $i, 1, @{$applied->{nodes}}; # skip past all the new stuff $i += scalar @{$applied->{nodes}} - 1; $applications++; # only count end environment nodes } # newcommand nodes # elsif ($node->{type} eq 'COMMAND' && $node->{command} eq $mapping->{name} && $mapping->{nparams} ) { my @params = (); # children of COMMAND node will be first parameter push @params, $node->{children}; # find next nparams GROUP nodes and push their children onto @params my $remain = $mapping->{nparams} - 1; my $j = 1; while ($remain > 0 && ($i + $j) < scalar @{$tree->{nodes}}) { my $node = $tree->{nodes}[$i + $j]; # grab group node if ($node->{type} eq 'GROUP') { push @params, $node->{children}; $remain--; } $j++; } # if we didn't get enough group nodes, bomb out next if ($remain > 0); # apply the params to the template my $applied = _applyParamsToTemplate($mapping->{template}, @params); # splice in the result splice @{$tree->{nodes}}, $i, $j, @{$applied->{nodes}}; # skip past all the new stuff $i += scalar @{$applied->{nodes}} - 1; $applications++; } # find 0-param mappings elsif ($node->{type} eq 'TEXT' && !$mapping->{nparams}) { my $text = $node->{content}; my $command = $mapping->{name}; # find occurrences of the mapping command # my $wordend = ($command =~ /\w$/ ? 1 : 0); while (($wordend && $text =~ /\\\Q$command\E(\W|$)/g) || (!$wordend && $text =~ /\\\Q$command\E/g)) { _debug("found occurrence of mapping $command", undef); my $idx = index $node->{content}, '\\' . $command, 0; # split the text node at that command my ($leftnode, $rightnode) = $node->split($idx, $idx + length($command)); # copy the mapping template my $applied = $mapping->{template}->copy(); # splice the new nodes in splice @{$tree->{nodes}}, $i, 1, $leftnode, @{$applied->{nodes}}, $rightnode; # adjust i so we end up on rightnode when we're done $i += scalar @{$applied->{nodes}} + 1; # get the next node $node = $tree->{$node}[$i]; # count application $applications++; } } # recur elsif ($node->{children}) { $applications += $parser->_applyMapping($node->{children}, $mapping); } } return $applications; } # find and apply all mappings in the tree, progressively and recursively. # a mapping applies to the entire tree and subtree consisting of nodes AFTER # itself in the level array. # sub _applyMappings { my $parser = shift; my $tree = shift; for (my $i = 0; $i < @{$tree->{nodes}}; $i++) { my $prev = $tree->{nodes}[$i-1]; my $node = $tree->{nodes}[$i]; # find newcommands if ($node->{type} eq 'COMMAND' && $node->{command} =~ /^(re)?newcommand$/) { my $mapping = _makeMapping($tree, $i); next if (!$mapping->{name}); # skip fragged commands if ($parser->{USED_COMMANDS}->{$mapping->{name}}) { _debug("applying (nc) mapping $mapping->{name}", undef); } else { _debug("NOT applying (nc) mapping $mapping->{name}", undef); next; } # add to mappings list # $parser->{MAPPEDCMDS}->{"\\$mapping->{name}"} = 1; _debug("found a mapping with name $mapping->{name}, $mapping->{nparams} params", undef); # remove the mapping declaration # splice @{$tree->{nodes}}, $i, $mapping->{skip} + 1; # apply the mapping my $count = $parser->_applyMapping($tree, $mapping, $i); if ($count > 0) { _debug("printing altered subtree", sub { $tree->_warn() }); } $i--; # since we removed the cmd node, check this index again } # handle "\newenvironment" mappings elsif ($node->{type} eq 'COMMAND' && $node->{command} =~ /^(re)?newenvironment$/) { # make a mapping hash # my $mapping = $parser->_makeEnvMapping($tree, $i); next if (!$mapping->{name}); # skip fragged commands. _debug("applying (ne) mapping $mapping->{name}", undef); # remove the mapping declaration # splice @{$tree->{nodes}}, $i, $mapping->{skip} + 1; # apply the mapping # my $count = $parser->_applyMapping($tree, $mapping, $i); } # handle "\def" stype commands. elsif ($node->{type} eq 'COMMAND' && defined $prev && $prev->{type} eq 'TEXT' && $prev->{content} =~ /\\def\s*$/o) { _debug("found def style mapping $node->{command}", undef); # remove the \def $prev->{content} =~ s/\\def\s*$//o; # make the mapping my $mapping = {name => $node->{command}, nparams => 0, template => $node->{children}->copy(), type => 'command'}; next if (!$mapping->{name}); # skip fragged commands if ($parser->{USED_COMMANDS}->{$mapping->{name}}) { _debug("applying (def) mapping $mapping->{name}", undef); } else { _debug("NOT applying (def) mapping $mapping->{name}", undef); next; } # add to mappings list # $parser->{MAPPEDCMDS}->{"\\$mapping->{name}"} = 1; _debug("template is", sub { $mapping->{template}->_warn() }); # remove the command node splice @{$tree->{nodes}}, $i, 1; # apply the mapping my $count = $parser->_applyMapping($tree, $mapping, $i); $i--; # check this index again } # recur elsif ($node->{children}) { $parser->_applyMappings($node->{children}); } } } # read files from \input commands and place into the tree, parsed # # also include bibliographies # sub _addInputs { my $parser = shift; my $tree = shift; for (my $i = 0; $i < @{$tree->{nodes}}; $i++) { my $node = $tree->{nodes}[$i]; if ($node->{type} eq 'COMMAND' && $node->{command} eq 'input' ) { my $file = $node->{children}->{nodes}[0]->{content}; next if $file =~ /pstex/; # ignore pstex images _debug("reading input file $file", undef); my $contents; my $filename = fileparse($file); my $has_extension = qr/\.\S+$/; # read in contents of file if (-e $file && $filename =~ $has_extension) { $contents = _readFile($file); } elsif ($filename !~ $has_extension) { $file = "$file.tex"; $contents = _readFile($file) if -e $file; } # dump Psfig/TeX files, they aren't useful to us and have # nonconforming syntax. Use declaration line as our heuristic. # if (defined $contents && $contents =~ m!^ \% \s*? Psfig/TeX \s* $!mx ) { undef $contents; carp "ignoring Psfig input `$file'"; } # actually do the parse of the sub-content # if (defined $contents) { # parse into a tree my ($subtree,) = $parser->_basicparse($contents, $parser->{PARSE_ERRORS_FATAL}); # replace \input command node with subtree splice @{$tree->{nodes}}, $i, 1, @{$subtree->{nodes}}; # step back $i--; } } elsif ($node->{type} eq 'COMMAND' && $node->{command} eq 'bibliography' ) { # try to find a .bbl file # foreach my $file (<*.bbl>) { my $contents = _readFile($file); if (defined $contents) { my ($subtree,) = $parser->_basicparse($contents, $parser->{PARSE_ERRORS_FATAL}); splice @{$tree->{nodes}}, $i, 1, @{$subtree->{nodes}}; $i--; } } } # recur if ($node->{children}) { $parser->_addInputs($node->{children}); } } } # do pre-mapping parsing # sub _parseA { my $parser = shift; my $text = shift; my $tree = $parser->_stage1($text); my $bracehash = $parser->_stage2($tree); $parser->_stage3($tree); $parser->_addInputs($tree) if $parser->{readinputs}; return ($tree, $bracehash); } # do post-mapping parsing (make environments) # sub _parseB { my $parser = shift; my $tree = shift; $parser->_stage4($tree); _debug("done with parseStage4", undef); $parser->_stage5($tree, 0); _debug("done with parseStage5", undef); } ############################################################################### # # Parser "Static" Subroutines # ############################################################################### # find next unescaped char in some text # sub _uindex { my $text = shift; my $char = shift; my $pos = shift; my $realbrace = 0; my $idx = -1; # get next opening brace do { $realbrace = 1; $idx = index $text, $char, $pos; if ($idx != -1) { $pos = $idx + 1; my $prevchar = substr $text, $idx - 1, 1; if ($prevchar eq '\\') { $realbrace = 0; $idx = -1; } } } while (!$realbrace); return $idx; } # support function: find the next occurrence of some symbol which is # not escaped. # sub _findsymbol { my $text = shift; my $symbol = shift; my $pos = shift; my $realhit = 0; my $index = -1; # get next occurrence of the symbol do { $realhit = 1; $index = index $text, $symbol, $pos; if ($index != -1) { $pos = $index + 1; # make sure this occurrence isn't escaped. this is imperfect. # my $prevchar = ($index - 1 >= 0) ? (substr $text, $index - 1, 1) : ''; my $pprevchar = ($index - 2 >= 0) ? (substr $text, $index - 2, 1) : ''; if ($prevchar eq '\\' && $pprevchar ne '\\') { $realhit = 0; $index = -1; } } } while (!$realhit); return $index; } # support function: find the earliest next brace in some (flat) text # sub _findbrace { my $text = shift; my $pos = shift; my $realbrace = 0; my $index_o = -1; my $index_c = -1; my $pos_o = $pos; my $pos_c = $pos; # get next opening brace do { $realbrace = 1; $index_o = index $text, '{', $pos_o; if ($index_o != -1) { $pos_o = $index_o + 1; # make sure this brace isn't escaped. this is imperfect. # my $prevchar = ($index_o - 1 >= 0) ? (substr $text, $index_o - 1, 1) : ''; my $pprevchar = ($index_o - 2 >= 0) ? (substr $text, $index_o - 2, 1) : ''; if ($prevchar eq '\\' && $pprevchar ne '\\') { $realbrace = 0; $index_o = -1; } } } while (!$realbrace); # get next closing brace do { $realbrace = 1; $index_c = index $text, '}', $pos_c; if (($index_c - 1) >= 0 && substr($text, $index_c - 1, 1) eq ' ') { $pos_c = $index_c + 1; $index_c = -1; } if ($index_c != -1) { $pos_c = $index_c + 1; # make sure this brace isn't escaped. this is imperfect. # my $prevchar = ($index_c - 1 >= 0) ? (substr $text, $index_c - 1, 1) : ''; my $pprevchar = ($index_c - 2 >= 0) ? (substr $text, $index_c - 2, 1) : ''; if ($prevchar eq '\\' && $pprevchar ne '\\') { $realbrace = 0; $index_c = -1; } } } while (!$realbrace); # handle all find cases return (-1, '') if ($index_o == -1 && $index_c == -1); return ($index_o, '{') if ($index_c == -1 || ($index_o != -1 && $index_o < $index_c)); return ($index_c, '}') if ($index_o == -1 || $index_c < $index_o); } # skip "blank nodes" in a tree, starting at some position. will finish # at the first non-blank node. (ie, not a comment or whitespace TEXT node. # sub _skipBlankNodes { my $tree = shift; my $i = shift; while ($tree->{nodes}[$i]->{type} eq 'COMMENT' || ($tree->{nodes}[$i]->{type} eq 'TEXT' && $tree->{nodes}[$i]->{content} =~ /^\s*$/s)) { $i++; } return $i; } # is the passed-in node a valid parameter node? for this to be true, it must # either be a GROUP or a position = inner command. # sub _validParamNode { my $node = shift; return 1 if ($node->{type} eq 'GROUP' || ($node->{type} eq 'COMMAND' && $node->{position} eq 'inner')); return 0; } # duplicate a valid param node. This means for a group, copy the child tree. # for a command, make a new tree with just the command node and its child tree. # sub _duplicateParam { my $parser = shift; my $node = shift; if ($node->{type} eq 'GROUP') { return $node->{children}->copy(); } elsif ($node->{type} eq 'COMMAND') { my $subtree = $node->{children}->copy(); # copy child subtree my $nodecopy = $node->copy(); # make a new node with old data $nodecopy->{children} = $subtree; # set the child pointer to new subtree # return a new tree with the new node (subtree) as its only element return LaTeX::TOM::Tree->new([$nodecopy]); } return undef; } # make a mapping from a newenvironment fragment # # newenvironments have the following syntax: # # \newenvironment{name}[nparams]?{beginTeX}{endTeX} # sub _makeEnvMapping { my $parser = shift; my $tree = shift; my $i = shift; return undef if ($tree->{nodes}[$i]->{type} ne 'COMMAND' || ($tree->{nodes}[$i]->{command} ne 'newenvironment' && $tree->{nodes}[$i]->{command} ne 'renewenvironment')); # figure out command (first child, text node) my $command = $tree->{nodes}[$i]->{children}->{nodes}[0]->{content}; if ($command =~ /^\s*\\(\S+)\s*$/) { $command = $1; } my $next = $i+1; # figure out number of params my $nparams = 0; if ($tree->{nodes}[$next]->{type} eq 'TEXT') { my $text = $tree->{nodes}[$next]->{content}; if ($text =~ /^\s*\[\s*([0-9])+\s*\]\s*$/) { $nparams = $1; } $next++; } # default templates-- just repeat the declarations # my ($btemplate) = $parser->_basicparse("\\begin{$command}", 2, 0); my ($etemplate) = $parser->_basicparse("\\end{$command}", 2, 0); my $endpos = $next; # get two group subtrees... one for the begin and one for the end # templates. we only ignore whitespace TEXT nodes and comments # $next = _skipBlankNodes($tree, $next); if (_validParamNode($tree->{nodes}[$next])) { $btemplate = $parser->_duplicateParam($tree->{nodes}[$next]); $next++; $next = _skipBlankNodes($tree, $next); if (_validParamNode($tree->{nodes}[$next])) { $etemplate = $parser->_duplicateParam($tree->{nodes}[$next]); $endpos = $next; } } # build and return the mapping hash # return {name => $command, nparams => $nparams, btemplate => $btemplate, # begin template etemplate => $etemplate, # end template skip => $endpos - $i, type => 'environment'}; } # make a mapping from a newcommand fragment # takes tree pointer and index of command node # # newcommands have the following syntax: # # \newcommand{\name}[nparams]?{anyTeX} # sub _makeMapping { my $tree = shift; my $i = shift; return undef if ($tree->{nodes}[$i]->{type} ne 'COMMAND' || ($tree->{nodes}[$i]->{command} ne 'newcommand' && $tree->{nodes}[$i]->{command} ne 'renewcommand')); # figure out command (first child, text node) my $command = $tree->{nodes}[$i]->{children}->{nodes}[0]->{content}; if ($command =~ /^\s*\\(\S+)\s*$/) { $command = $1; } my $next = $i+1; # figure out number of params my $nparams = 0; if ($tree->{nodes}[$next]->{type} eq 'TEXT') { my $text = $tree->{nodes}[$next]->{content}; if ($text =~ /^\s*\[\s*([0-9])+\s*\]\s*$/) { $nparams = $1; } $next++; } # grab subtree template (array ref) # my $template; if ($tree->{nodes}[$next]->{type} eq 'GROUP') { $template = $tree->{nodes}[$next]->{children}->copy(); } else { return undef; } # build and return the mapping hash # return {name => $command, nparams => $nparams, template => $template, skip => $next - $i, type => 'command'}; } # this sub is the main entry point for the sub that actually takes a set of # parameter trees and inserts them into a template tree. the return result, # newly allocated, should be plopped back into the original tree where the # parameters (along with the initial command invocation) # sub _applyParamsToTemplate { my $template = shift; my @params = @_; # have to copy the template to a freshly allocated tree # my $applied = $template->copy(); # now recursively apply the params. # _applyParamsToTemplate_r($applied, @params); return $applied; } # recursive helper for above # sub _applyParamsToTemplate_r { my $template = shift; my @params = @_; for (my $i = 0; $i < @{$template->{nodes}}; $i++) { my $node = $template->{nodes}[$i]; if ($node->{type} eq 'TEXT') { my $text = $node->{content}; # find occurrences of the parameter flags # if ($text =~ /(#([0-9]+))/) { my $all = $1; my $num = $2; # get the index of the flag we just found # my $idx = index $text, $all, 0; # split the node on the location of the flag # my ($leftnode, $rightnode) = $node->split($idx, $idx + length($all) - 1); # make a copy of the param we want # my $param = $params[$num - 1]->copy(); # splice the new text nodes, along with the parameter subtree, into # the old location # splice @{$template->{nodes}}, $i, 1, $leftnode, @{$param->{nodes}}, $rightnode; # skip forward to where $rightnode is in $template on next iteration # $i += scalar @{$param->{nodes}}; } } # recur elsif (defined $node->{children}) { _applyParamsToTemplate_r($node->{children}, @params); } } } # This sub takes a chunk of the document text between two points and makes # it into a list of TEXT nodes and COMMENT nodes, as we would expect from # '%' prefixed LaTeX comment lines # sub _getTextAndCommentNodes { my ($text, $begins, $ends) = @_; my $node_text = substr $text, $begins, $ends - $begins; _debug("getTextAndCommentNodes: looking at [$node_text]", undef); my $make_node = sub { my ($mode_type, $begins, $start_pos, $output) = @_; return LaTeX::TOM::Node->new({ type => uc $mode_type, start => $begins + $start_pos, end => $begins + $start_pos + length($output) - 1, content => $output, }); }; my @lines = split (/( (?:\s* # whitespace (?($mode_type, $begins, $start_pos, $output); $start_pos += length($output); # update start position $output = $line; $mode_type = $line_type; } } push @nodes, $make_node->($mode_type, $begins, $start_pos, $output) if defined $output; return @nodes; } # Read in the contents of a text file on disk. Return in string scalar. # sub _readFile { my ($file, $raise_error) = @_; $raise_error ||= false; my $opened = open(my $fh, '<', $file); unless ($opened) { croak "Cannot open $file: $!" if $raise_error; return undef; } my $contents = do { local $/; <$fh> }; close($fh); return $contents; } sub _debug { my ($message, $code) = @_; my $DEBUG = $LaTeX::TOM::DEBUG; return unless $DEBUG >= 1 && $DEBUG <= 2; my ($filename, $line) = (caller)[1,2]; my $caller = join ':', (fileparse($filename))[0], $line; warn "$caller: $message\n" if $DEBUG >= 1 && defined $message; $code->() if $DEBUG == 2 && defined $code; } 1; LaTeX-TOM-1.03/lib/LaTeX/TOM/Node.pm000444001750001750 1073011675167673 15462 0ustar00stssts000000000000############################################################################### # # LaTeX::TOM::Node # # This package defines an object for nodes in the TOM tree, and methods to go # with them. # ############################################################################### package LaTeX::TOM::Node; use strict; use warnings; use constant true => 1; use constant false => 0; our $VERSION = '0.03'; # Make a new Node: turn input hash into object. # sub new { my $class = shift; my ($node) = @_; return bless $node || {}; } # "copy constructor" # sub copy { my $node = shift; return bless $node; } # Split a text node into two text nodes, with the first ending before point a, # and the second starting after point b. actually returns NEW nodes, does not # alter the input node. # # Note: a and b are relative to the contents of the node, not the original # document. # # Note2: a and b are not jointly constrained. You can split after location x # without losing any characters by setting a = x + 1 and b = x. # sub split { my $node = shift; my ($a, $b) = @_; return (undef) x 2 unless $node->{type} eq 'TEXT'; my $left_text = substr $node->{content}, 0, $a; my $right_text = substr $node->{content}, $b + 1, length($node->{content}) - $b; my $left_node = LaTeX::TOM::Node->new({ type => 'TEXT', start => $node->{start}, end => $node->{start} + $a - 1, content => $left_text, }); my $right_node = LaTeX::TOM::Node->new({ type => 'TEXT', start => $node->{start} + $b + 1, end => $node->{start} + length $node->{content}, content => $right_text, }); return ($left_node, $right_node); } # # accessor methods # sub getNodeType { my $node = shift; return $node->{type}; } sub getNodeText { my $node = shift; return $node->{content}; } sub setNodeText { my $node = shift; my ($text) = @_; $node->{content} = $text; } sub getNodeStartingPosition { my $node = shift; return $node->{start}; } sub getNodeEndingPosition { my $node = shift; return $node->{end}; } sub getNodeMathFlag { my $node = shift; return $node->{math} ? true : false; } sub getNodePlainTextFlag { my $node = shift; return $node->{plaintext} ? true : false; } sub getNodeOuterStartingPosition { my $node = shift; return (defined $node->{ostart} ? $node->{ostart} : $node->{start}); } sub getNodeOuterEndingPosition { my $node = shift; return (defined $node->{oend} ? $node->{oend} : $node->{end}); } sub getEnvironmentClass { my $node = shift; return $node->{class}; } sub getCommandName { my $node = shift; return $node->{command}; } # # linked-list accessors # sub getChildTree { my $node = shift; return $node->{children}; } sub getFirstChild { my $node = shift; if ($node->{children}) { return $node->{children}->{nodes}[0]; } return undef; } sub getLastChild { my $node = shift; if ($node->{children}) { return $node->{children}->{nodes}[-1]; } return undef; } sub getPreviousSibling { my $node = shift; return $node->{prev}; } sub getNextSibling { my $node = shift; return $node->{'next'}; } sub getParent { my $node = shift; return $node->{parent}; } # This is an interesting function, and kind of a hack because of the way the # parser makes the current tree. Basically it will give you the next sibling # that is a GROUP node, until it either hits the end of the tree level, a TEXT # node which doesn't match /^\s*$/, or a COMMAND node. # # This is useful for finding all GROUPed parameters after a COMMAND node. You # can just have a while loop that calls this method until it gets 'undef'. # # Note: this may be bad, but TEXT Nodes matching /^\s*\[[0-9]+\]$/ (optional # parameter groups) are treated as if they were whitespace. # sub getNextGroupNode { my $node = shift; my $next = $node; while ($next = $next->{'next'}) { # found a GROUP node. if ($next->{type} eq 'GROUP') { return $next; } # see if we should skip a node elsif ($next->{type} eq 'COMMENT' || ($next->{type} eq 'TEXT' && ($next->{content} =~ /^\s*$/ || $next->{content} =~ /^\s*\[\s*[0-9]+\s*\]\s*$/ ))) { next; } else { return undef; } } return undef; } 1; LaTeX-TOM-1.03/lib/LaTeX/TOM/Tree.pm000444001750001750 2123311675167673 15474 0ustar00stssts000000000000############################################################################### # # LaTeX::TOM::Tree # # This package defines a TOM Tree object. # ############################################################################### package LaTeX::TOM::Tree; use strict; use Carp qw(croak); our $VERSION = '0.04'; # "constructor" # sub new { my $class = shift; my $nodes = shift || []; # empty array for tree structure my $self = { nodes => $nodes, }; return bless $self; } # make a copy of a tree, recursively # sub copy { my $tree = shift; # input tree my @output; # output array (to become tree) foreach my $node (@{$tree->{nodes}}) { # make a copy of the node's hash definition # my $nodecopy = $node->copy(); # grab a copy of children, if any exist # if ($node->{children}) { my $children = $node->{children}->copy(); $nodecopy->{children} = $children; } # add hashref to new node to array for this level push @output, $nodecopy; } # each subtree is a tree return __PACKAGE__->new([@output]); } sub print { shift->_debug_tree(@_, sub { print STDOUT $_[0] }); } sub _warn { shift->_debug_tree(@_, sub { print STDERR $_[0] }); } # Print out the LaTeX "TOM" tree. Good for debugging our parser. # sub _debug_tree { my $tree = shift; my $output_handler = pop; my ($level) = @_; $level ||= 0; foreach my $node (@{$tree->{nodes}}) { my $spacer = ' ' x ($level * 2); $output_handler->($spacer); # print grouping/command info if ($node->{type} eq 'COMMAND') { $output_handler->(sprintf "($node->{type}) \\$node->{command} %s @ [$node->{start}, $node->{end}]", $node->{opts} ? "[$node->{opts}]" : "\b", ); } elsif ($node->{type} eq 'GROUP') { $output_handler->("($node->{type}) [$node->{start}, $node->{end}]"); } elsif ($node->{type} eq 'ENVIRONMENT') { $output_handler->("($node->{type}) $node->{class} @ inner [$node->{start}, $node->{end}] outer [$node->{ostart}, $node->{oend}]"); } elsif ($node->{type} =~ /^(?:TEXT|COMMENT)$/) { my $space_out = do { local $_ = "$spacer $node->{type} |"; s/[A-Z]/ /go; $_; }; my $max_len = 80 - length($space_out); my $print_text = do { local $_ = $node->{content}; s/^(.{0,$max_len}).*$/$1/gm; s/\n/\n$space_out/gs; $_; }; $output_handler->("($node->{type}) |$print_text\""); } $output_handler->(' ** math mode **') if $node->{math}; $output_handler->(' ** plaintext **') if $node->{plaintext}; $output_handler->("\n"); # recur if (defined $node->{children}) { my ($wrapper) = (caller(1))[3] =~ /.+::(.+)$/; $node->{children}->$wrapper($level + 1); } } } # pull out the plain text (non-math) TEXT nodes. returns an array of strings. # sub plainText { my $tree = shift; my $stringlist = []; foreach my $node (@{$tree->{nodes}}) { if ($node->{type} eq 'TEXT' && $node->{plaintext}) { push @$stringlist, $node->{content}; } if ($node->{children}) { push @$stringlist, @{$node->{children}->plainText()}; } } return $stringlist; } # Get the plaintext of a LaTeX DOM and whittle it down into a word list # suitable for indexing. # sub indexableText { my $tree = shift; my $pt = $tree->plainText(); my $text = join (' ', @$pt); # kill leftover commands $text =~ s/\\\w+\*?//gso; # kill nonpunctuation $text =~ s/[^\w\-0-9\s]//gso; # kill non-intraword hyphens $text =~ s/(\W)\-+(\W)/$1 $2/gso; $text =~ s/(\w)\-+(\W)/$1 $2/gso; $text =~ s/(\W)\-+(\w)/$1 $2/gso; # kill small words $text =~ s/\b[^\s]{1,2}\b//gso; # kill purely numerical "words" $text =~ s/\b[0-9]+\b//gso; # compress whitespace $text =~ s/\s+/ /gso; return $text; } # Convert tree to LaTeX. If our output doesn't compile to the same final # document, something is amiss (we don't, however, guarantee that the output # TeX will be identical to the input, due to certain normalizations.) # sub toLaTeX { my $tree = shift; my $parent = shift; my $str = ""; foreach my $node (@{$tree->{nodes}}) { if ($node->{type} eq 'TEXT' || $node->{type} eq 'COMMENT') { $str .= $node->{content}; } elsif ($node->{type} eq 'GROUP') { $str .= '{' . $node->{children}->toLaTeX($node) . '}'; } elsif ($node->{type} eq 'COMMAND') { if ($node->{position} eq 'outer') { $str .= "\\$node->{command}" . '{' . $node->{children}->toLaTeX($node) . '}'; } elsif ($node->{position} eq 'inner') { if (defined $parent && # dont add superfluous braces $parent->{start} == $node->{start} && $parent->{end} == $node->{end}) { $str .= "\\$node->{command}" . ' ' . $node->{children}->toLaTeX($node); } else { $str .= '{' . "\\$node->{command}" . $node->{children}->toLaTeX($node) . '}'; } } elsif ($node->{braces} == 0) { $str .= "\\$node->{command}" . ' ' . $node->{children}->toLaTeX($node); } } elsif ($node->{type} eq 'ENVIRONMENT') { # handle special math mode envs my $MATHBRACKETS = \%LaTeX::TOM::MATHBRACKETS; if (defined $MATHBRACKETS->{$node->{class}}) { # print with left and lookup right brace. $str .= $node->{class} . $node->{children}->toLaTeX($node) . $MATHBRACKETS->{$node->{class}}; } # standard \begin/\end envs else { $str .= "\\begin{$node->{class}}" . $node->{children}->toLaTeX($node) . "\\end{$node->{class}}"; } } } return $str; } # Augment the nodes in the tree with pointers to all neighboring nodes, so # traversal is easier for the user who is given a lone node. This is a hack, # we should really be maintaining this all along. # # Note that child pointers are already taken care of. # sub listify { my $tree = shift; my $parent = shift; for (my $i = 0; $i < scalar @{$tree->{nodes}}; $i++) { my $prev = undef; my $next = undef; $prev = $tree->{nodes}[$i - 1] if ($i > 0); $next = $tree->{nodes}[$i + 1] if ($i + 1 < scalar @{$tree->{nodes}}); $tree->{nodes}[$i]->{'prev'} = $prev; $tree->{nodes}[$i]->{'next'} = $next; $tree->{nodes}[$i]->{'parent'} = $parent; # recur, with parent info if ($tree->{nodes}[$i]->{children}) { $tree->{nodes}[$i]->{children}->listify($tree->{nodes}[$i]); } } } ############################################################################### # "Tree walking" methods. # sub getTopLevelNodes { my $tree = shift; return @{$tree->{nodes}}; } sub getAllNodes { my $tree = shift; my @nodelist; foreach my $node (@{$tree->{nodes}}) { push @nodelist, $node; if ($node->{children}) { push @nodelist, @{$node->{children}->getAllNodes()}; } } return [@nodelist]; } sub getNodesByCondition { my $tree = shift; my $condition = shift; # XXX rt #48551 - string eval no longer supported (12/08/2009) unless (ref $condition eq 'CODE') { croak 'getNodesByCondition(): code reference expected'; } my @nodelist; foreach my $node (@{$tree->{nodes}}) { # evaluate the perl code condition and if the result evaluates to true, # push this node # if ($condition->($node)) { push @nodelist, $node; } if ($node->{children}) { push @nodelist, @{$node->{children}->getNodesByCondition($condition)}; } } return [@nodelist]; } sub getCommandNodesByName { my $tree = shift; my $name = shift; return $tree->getNodesByCondition( sub { my $node = shift; return ($node->{type} eq 'COMMAND' && $node->{command} eq $name); } ); } sub getEnvironmentsByName { my $tree = shift; my $name = shift; return $tree->getNodesByCondition( sub { my $node = shift; return ($node->{type} eq 'ENVIRONMENT' && $node->{class} eq $name); } ); } sub getFirstNode { my $tree = shift; return $tree->{nodes}[0]; } 1; LaTeX-TOM-1.03/t000755001750001750 011675167673 12142 5ustar00stssts000000000000LaTeX-TOM-1.03/t/by_name.t000555001750001750 142211675167673 14100 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use LaTeX::TOM; use Test::More tests => 2; my $parser = LaTeX::TOM->new; my $tex = do { local $/; }; my $tree = $parser->parse($tex); { my $nodes = $tree->getCommandNodesByName('section'); my $ok = (@$nodes == 1 && $nodes->[0]->getNodeType eq 'COMMAND' && $nodes->[0]->getCommandName eq 'section' ); ok($ok, 'getCommandNodesByName'); } { my $nodes = $tree->getEnvironmentsByName('document'); my $ok = (@$nodes == 1 && $nodes->[0]->getNodeType eq 'ENVIRONMENT' && $nodes->[0]->getEnvironmentClass eq 'document' ); ok($ok, 'getEnvironmentsByName'); } __DATA__ \documentclass[10pt]{article} \begin{document} \section{abc} \end{document} LaTeX-TOM-1.03/t/pod.t000555001750001750 43011675167673 13226 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use Test::More; plan skip_all => 'tests for release testing' unless $ENV{RELEASE_TESTING}; eval "use Test::Pod 1.14"; plan skip_all => "Test::Pod 1.14 required for testing POD" if $@; plan tests => 1; pod_file_ok('lib/LaTeX/TOM.pm'); LaTeX-TOM-1.03/t/mapping.t000555001750001750 133311675167673 14122 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec; use FindBin qw($Bin); use LaTeX::TOM; use Test::More tests => 1; my $abs_path = File::Spec->catfile($Bin, 'data', 'mapping.t'); my $parser = LaTeX::TOM->new(0,0,1); my $tex = do { local $/; }; my $tree_string = $parser->parse($tex); my $tree_file = $parser->parseFile(File::Spec->catfile($abs_path, 'mapping.in')); is_deeply( [ grep /\S/, split /\n/, $tree_string->toLaTeX ], [ grep /\S/, split /\n/, $tree_file->toLaTeX ], 'mapping'); __DATA__ \documentclass[10pt]{article} \newenvironment{centered}{\begin{center}}{\end{center}} \newcommand{\bold}[1]{{\bf #1}} \begin{document} \begin{centered} foo \bold{bar} baz \end{centered} \end{document} LaTeX-TOM-1.03/t/03-tree.t000555001750001750 224611675167673 13652 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec; use FindBin qw($Bin); use LaTeX::TOM; use Test::More tests => 8; my $file = File::Spec->catfile($Bin, 'data', 'tex.in'); my $tex = do { open(my $fh, '<', $file) or die "Cannot open $file: $!\n"; local $/; <$fh>; }; my $parser = LaTeX::TOM->new; my $tree = $parser->parseFile($file); is_deeply($tree->plainText, [ 'Some Test Doc', "\n" . " \\maketitle\n" . " \\mainmatter\n" . " ", "\n ", "\n", ], 'Tree as plain text'); is($tree->indexableText, 'Some Test Doc ', 'Tree as indexable text'); is($tree->toLaTeX, do { $_ = $tex; $_ =~ s/\[.*?pt\]//; $_ }, 'Tree to LaTeX'); is(@{$tree->getAllNodes}, 19, 'Amount of all nodes'); is($tree->getTopLevelNodes, 9, 'Amount of top level nodes'); is(@{$tree->getCommandNodesByName('title')}, 1, "Amount of 'title' command nodes"); is(@{$tree->getEnvironmentsByName('document')}, 1, "Amount of 'document' environment nodes"); is(@{$tree->getNodesByCondition(sub { my $node = shift; return ($node->getNodeType eq 'COMMAND' && $node->getCommandName eq 'title'); })}, 1, "Amount of 'title' command nodes by condition"); LaTeX-TOM-1.03/t/00-load.t000555001750001750 15411675167673 13603 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use Test::More tests => 1; BEGIN { use_ok('LaTeX::TOM'); } LaTeX-TOM-1.03/t/pod-coverage.t000555001750001750 46311675167673 15025 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use Test::More; plan skip_all => 'tests for release testing' unless $ENV{RELEASE_TESTING}; eval "use Test::Pod::Coverage 1.04"; plan skip_all => "Test::Pod::Coverage 1.04 required for testing POD coverage" if $@; plan tests => 1; pod_coverage_ok('LaTeX::TOM'); LaTeX-TOM-1.03/t/04-node.t000555001750001750 554411675167673 13645 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec; use FindBin qw($Bin); use LaTeX::TOM; use Test::More tests => 56; my $parser = LaTeX::TOM->new; my $tree = $parser->parseFile(File::Spec->catfile($Bin, 'data', 'tex.in')); my @expected_all = ( [ 'TEXT', '' ], [ 'COMMAND', 'NeedsTeXFormat' ], [ 'TEXT', 'LaTeX2e' ], [ 'TEXT', "\n" ], [ 'COMMAND', 'documentclass' ], [ 'TEXT', 'book' ], [ 'TEXT', "\n" ], [ 'COMMAND', 'title' ], [ 'TEXT', 'Some Test Doc' ], [ 'TEXT', "\n" ], [ 'ENVIRONMENT', 'document' ], [ 'TEXT', "\n" . " \\maketitle\n" . " \\mainmatter\n" . " " ], [ 'COMMAND', 'chapter*' ], [ 'TEXT', "Preface" ], [ 'TEXT', "\n " ], [ 'COMMAND', 'input' ], [ 'TEXT', 't/data/input.tex' ], [ 'TEXT', "\n" ], [ 'TEXT', "\n" ], ); my @expected_top = ( [ 'TEXT', '' ], [ 'COMMAND', 'NeedsTeXFormat' ], [ 'TEXT', "\n" ], [ 'COMMAND', 'documentclass' ], [ 'TEXT', "\n" ], [ 'COMMAND', 'title' ], [ 'TEXT', "\n" ], [ 'ENVIRONMENT', 'document' ], [ 'TEXT', "\n" ], [ 'TEXT', "\n" . " \\maketitle\n" . " \\mainmatter\n" . " " ], ); verify_nodes(@{$tree->getAllNodes}, \@expected_all); verify_nodes($tree->getTopLevelNodes, \@expected_top); sub verify_nodes { my $expected = pop; my @nodes = @_; foreach my $node (@nodes) { my $node_type = $node->getNodeType; my $expected = shift @$expected; my $desc = $expected->[1]; my $cnt = 0; $cnt++ while $desc =~ /\n/g; if (!length $desc) { $desc = 'undef'; } elsif ($cnt >= 1 && $desc !~ /\w/) { $desc = 'newline'; $desc .= 's' if $cnt > 1; } else { $desc =~ s/\n//g; $desc =~ tr/ //d; } if (my ($type) = $node_type =~ /^(TEXT|COMMENT)$/) { ok($expected->[0] =~ $type, $type); ok($expected->[1] eq $node->getNodeText, $desc); } elsif ($node_type eq 'ENVIRONMENT') { ok($expected->[0] =~ $node_type, $node_type); ok($expected->[1] eq $node->getEnvironmentClass, $desc); } elsif ($node_type eq 'COMMAND') { ok($expected->[0] =~ $node_type, $node_type); ok($expected->[1] =~ $node->getCommandName, $desc); } } } LaTeX-TOM-1.03/t/input.t000555001750001750 526511675167673 13636 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use constant true => 1; use constant false => 0; use File::Spec; use FindBin qw($Bin); use LaTeX::TOM; use Test::More tests => 6; my $set_input = sub { ${$_[0]} =~ s/\$INPUT/\\input{$_[1]}/ }; my $abs_path = File::Spec->catfile($Bin, 'data', 'input.t'); my $rel_path = File::Spec->abs2rel($abs_path); my $parser = LaTeX::TOM->new(0,1,0); my $data = do { local $/; }; my @tests = ( # input file # tex file # message # input file must exist [ '00-image_skip.pstex_t', undef, 'skip pstex' , true ], [ '01-basic.tex', '01-basic.in', 'basic' , true ], [ '02-guess', '02-guess.in', 'guess' , false ], # file extension for '02-guess' missing on purpose [ '03-empty.tex', '03-empty.in', 'empty', true ], [ '04-psfig_ignore.tex', undef, 'ignore Psfig', true ], ); SKIP: { skip 'test for release testing', 1 unless $ENV{RELEASE_TESTING}; # Check that all input files exist or bogus results may ensue. my $exist = true; foreach my $test (@tests) { my ($input_file, $must_exist) = @$test[0,3]; if ($must_exist) { $exist &= -e File::Spec->catfile($abs_path, $input_file) ? true : false; } } ok($exist, '\input test files exist'); } sub check_unaltered_tex { my $input_file = $_[0]->[0]; $input_file = File::Spec->catfile($rel_path, $input_file); my $tex = $data; $set_input->(\$tex, $input_file); my $tree = $parser->parse($tex); return scalar grep /\\input\{\Q$input_file\E\}/, split /\n/, $tree->toLaTeX; } { my $skipped = check_unaltered_tex($tests[0]); my $message = $tests[0]->[2]; ok($skipped, $message); } { foreach my $test (@tests[1..3]) { my ($input_file, $tex_file, $message) = @$test; $input_file = File::Spec->catfile($rel_path, $input_file); $tex_file = File::Spec->catfile($abs_path, $tex_file); my $tex = $data; $set_input->(\$tex, $input_file); my $tree_string = $parser->parse($tex); my $tree_file = $parser->parseFile($tex_file); is_deeply( [ grep /\S/, split /\n/, $tree_string->toLaTeX ], [ grep /\S/, split /\n/, $tree_file->toLaTeX ], $message); } } { my $seen_warning = false; local $SIG{__WARN__} = sub { warn $_[0] and return unless $_[0] =~ /^ignoring Psfig/; $seen_warning = true; }; my $ignored = check_unaltered_tex($tests[4]); my $message = $tests[4]->[2]; ok($ignored && $seen_warning, $message); } __DATA__ \documentclass[10pt]{article} \begin{document} $INPUT \end{document} LaTeX-TOM-1.03/t/by_condition.t000555001750001750 130511675167673 15146 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use LaTeX::TOM; use Test::More tests => 1; my $parser = LaTeX::TOM->new; my $tex = do { local $/; }; my $tree = $parser->parse($tex); my $nodes = $tree->getNodesByCondition(sub { my $node = shift; return ( $node->getNodeType eq 'COMMAND' && $node->getCommandName =~ /section$/ ); }); my $count = 3; my $ok = ( @$nodes == $count && (grep { $_->getNodeType eq 'COMMAND' } @$nodes) == $count && (grep { $_->getCommandName =~ /section$/ } @$nodes) == $count ); ok($ok, 'getNodesByCondition'); __DATA__ \documentclass[10pt]{article} \begin{document} \section{abc} \subsection{def} \subsubsection{ghi} \end{document} LaTeX-TOM-1.03/t/bibliography.t000555001750001750 117311675167673 15144 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec; use FindBin qw($Bin); use LaTeX::TOM; use Test::More tests => 1; my $abs_path = File::Spec->catfile($Bin, 'data', 'bibliography.t'); my $parser = LaTeX::TOM->new(0,1,0); my $tex = do { local $/; }; chdir $abs_path; my $tree_string = $parser->parse($tex); my $tree_file = $parser->parseFile(File::Spec->catfile($abs_path, 'sample.in')); is_deeply( [ grep /\S/, split /\n/, $tree_string->toLaTeX ], [ grep /\S/, split /\n/, $tree_file->toLaTeX ], 'sample'); __DATA__ \documentclass[10pt]{article} \begin{document} \bibliography{sample} \end{document} LaTeX-TOM-1.03/t/02-parser.t000555001750001750 117111675167673 14202 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec; use FindBin qw($Bin); use LaTeX::TOM; use Test::More tests => 1; my $tex = do { local $/; }; my $texfile = File::Spec->catfile($Bin, 'data', 'tex.in'); my $parser = LaTeX::TOM->new(0,1,0); my $tree_string = $parser->parse($tex); my $tree_file = $parser->parseFile($texfile); is_deeply($tree_string, $tree_file, 'Tree read from string equals tree read from file'); __DATA__ \NeedsTeXFormat{LaTeX2e} \documentclass[11pt]{book} \title{Some Test Doc} \begin{document} \maketitle \mainmatter \chapter*{Preface} \input{t/data/input.tex} \end{document} LaTeX-TOM-1.03/t/01-basic.t000555001750001750 55511675167673 13753 0ustar00stssts000000000000#!/usr/bin/perl use strict; use warnings; use File::Spec; use FindBin qw($Bin); use LaTeX::TOM; use Test::More tests => 2; my $parser = LaTeX::TOM->new; ok($parser->isa('LaTeX::TOM::Parser'), 'Parser object is-a LaTeX::TOM::Parser object'); my $tree = $parser->parseFile(File::Spec->catfile($Bin, 'data', 'tex.in')); ok($tree, 'Parser returned a defined tree'); LaTeX-TOM-1.03/t/data000755001750001750 011675167673 13053 5ustar00stssts000000000000LaTeX-TOM-1.03/t/data/input.tex000444001750001750 4511675167673 15030 0ustar00stssts000000000000\begin{verbatim} text \end{verbatim} LaTeX-TOM-1.03/t/data/tex.in000444001750001750 27511675167673 14324 0ustar00stssts000000000000\NeedsTeXFormat{LaTeX2e} \documentclass[11pt]{book} \title{Some Test Doc} \begin{document} \maketitle \mainmatter \chapter*{Preface} \input{t/data/input.tex} \end{document} LaTeX-TOM-1.03/t/data/mapping.t000755001750001750 011675167673 14750 5ustar00stssts000000000000LaTeX-TOM-1.03/t/data/mapping.t/mapping.in000444001750001750 15411675167673 17050 0ustar00stssts000000000000\documentclass[10pt]{article} \begin{document} \begin{center} foo {\bf bar} baz \end{center} \end{document} LaTeX-TOM-1.03/t/data/input.t000755001750001750 011675167673 14454 5ustar00stssts000000000000LaTeX-TOM-1.03/t/data/input.t/02-guess.tex000444001750001750 6611675167673 16642 0ustar00stssts000000000000No great discovery was ever made without a bold guess LaTeX-TOM-1.03/t/data/input.t/03-empty.in000444001750001750 7611675167673 16462 0ustar00stssts000000000000\documentclass[10pt]{article} \begin{document} \end{document} LaTeX-TOM-1.03/t/data/input.t/03-empty.tex000444001750001750 011675167673 16617 0ustar00stssts000000000000LaTeX-TOM-1.03/t/data/input.t/00-image_skip.pstex_t000444001750001750 1011675167673 20475 0ustar00stssts000000000000skipped LaTeX-TOM-1.03/t/data/input.t/04-psfig_ignore.tex000444001750001750 1411675167673 20162 0ustar00stssts000000000000% Psfig/TeX LaTeX-TOM-1.03/t/data/input.t/02-guess.in000444001750001750 16411675167673 16467 0ustar00stssts000000000000\documentclass[10pt]{article} \begin{document} No great discovery was ever made without a bold guess \end{document} LaTeX-TOM-1.03/t/data/input.t/01-basic.tex000444001750001750 12111675167673 16604 0ustar00stssts000000000000\begin{verbatim} Tade kuu mushi mo sukizuki -- japanese proverb \end{verbatim} LaTeX-TOM-1.03/t/data/input.t/01-basic.in000444001750001750 21711675167673 16420 0ustar00stssts000000000000\documentclass[10pt]{article} \begin{document} \begin{verbatim} Tade kuu mushi mo sukizuki -- japanese proverb \end{verbatim} \end{document} LaTeX-TOM-1.03/t/data/bibliography.t000755001750001750 011675167673 15770 5ustar00stssts000000000000LaTeX-TOM-1.03/t/data/bibliography.t/sample.bbl000444001750001750 11311675167673 20042 0ustar00stssts000000000000\begin{thebibliography}{99} \bibitem{reference} text \end{thebibliography} LaTeX-TOM-1.03/t/data/bibliography.t/sample.in000444001750001750 21111675167673 17710 0ustar00stssts000000000000\documentclass[10pt]{article} \begin{document} \begin{thebibliography}{99} \bibitem{reference} text \end{thebibliography} \end{document}