nokogumbo-2.0.5/0000755000004100000410000000000014030710665013550 5ustar www-datawww-datanokogumbo-2.0.5/README.md0000644000004100000410000002677714030710665015052 0ustar www-datawww-data# Nokogumbo - a Nokogiri interface to the Gumbo HTML5 parser. Nokogumbo provides the ability for a Ruby program to invoke [our version of the Gumbo HTML5 parser](https://github.com/rubys/nokogumbo/tree/master/gumbo-parser/src) and to access the result as a [Nokogiri::HTML::Document](http://rdoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document). [![Travis-CI Build Status](https://travis-ci.org/rubys/nokogumbo.svg)](https://travis-ci.org/rubys/nokogumbo) [![Appveyor Build Status](https://ci.appveyor.com/api/projects/status/github/rubys/nokogumbo)](https://ci.appveyor.com/project/rubys/nokogumbo/branch/master) ## Usage ```ruby require 'nokogumbo' doc = Nokogiri.HTML5(string) ``` To parse an HTML fragment, a `fragment` method is provided. ```ruby require 'nokogumbo' doc = Nokogiri::HTML5.fragment(string) ``` Because HTML is often fetched via the web, a convenience interface to HTTP get is also provided: ```ruby require 'nokogumbo' doc = Nokogiri::HTML5.get(uri) ``` ## Parsing options The document and fragment parsing methods, - `Nokogiri.HTML5(html, url = nil, encoding = nil, options = {})` - `Nokogiri::HTML5.parse(html, url = nil, encoding = nil, options = {})` - `Nokogiri::HTML5::Document.parse(html, url = nil, encoding = nil, options = {})` - `Nokogiri::HTML5.fragment(html, encoding = nil, options = {})` - `Nokogiri::HTML5::DocumentFragment.parse(html, encoding = nil, options = {})` support options that are different from Nokogiri's. The three currently supported options are `:max_errors`, `:max_tree_depth` and `:max_attributes`, described below. ### Error reporting Nokogumbo contains an experimental parse error reporting facility. By default, no parse errors are reported but this can be configured by passing the `:max_errors` option to `::parse` or `::fragment`. ```ruby require 'nokogumbo' doc = Nokogiri::HTML5.parse('Hi there!', max_errors: 10) doc.errors.each do |err| puts(err) end ``` This prints the following. ``` 1:1: ERROR: Expected a doctype token Hi there! ^ 1:1: ERROR: Start tag of nonvoid HTML element ends with '/>', use '>'. Hi there! ^ 1:17: ERROR: End tag ends with '/>', use '>'. Hi there! ^ 1:17: ERROR: End tag contains attributes. Hi there! ^ ``` Using `max_errors: -1` results in an unlimited number of errors being returned. The errors returned by `#errors` are instances of [`Nokogiri::XML::SyntaxError`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SyntaxError). The [HTML standard](https://html.spec.whatwg.org/multipage/parsing.html#parse-errors) defines a number of standard parse error codes. These error codes only cover the "tokenization" stage of parsing HTML. The parse errors in the "tree construction" stage do not have standardized error codes (yet). As a convenience to Nokogumbo users, the defined error codes are available via the [`Nokogiri::XML::SyntaxError#str1`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SyntaxError#str1-instance_method) method. ```ruby require 'nokogumbo' doc = Nokogiri::HTML5.parse('Hi there!', max_errors: 10) doc.errors.each do |err| puts("#{err.line}:#{err.column}: #{err.str1}") end ``` This prints the following. ``` 1:1: generic-parser 1:1: non-void-html-element-start-tag-with-trailing-solidus 1:17: end-tag-with-trailing-solidus 1:17: end-tag-with-attributes ``` Note that the first error is `generic-parser` because it's an error from the tree construction stage and doesn't have a standardized error code. For the purposes of semantic versioning, the error messages, error locations, and error codes are not part of Nokogumbo's public API. That is, these are subject to change without Nokogumbo's major version number changing. These may be stabilized in the future. ### Maximum tree depth The maximum depth of the DOM tree parsed by the various parsing methods is configurable by the `:max_tree_depth` option. If the depth of the tree would exceed this limit, then an [ArgumentError](https://ruby-doc.org/core-2.5.0/ArgumentError.html) is thrown. This limit (which defaults to `Nokogumbo::DEFAULT_MAX_TREE_DEPTH = 400`) can be removed by giving the option `max_tree_depth: -1`. ``` ruby html = '' + '
' * 1000 doc = Nokogiri.HTML5(html) # raises ArgumentError: Document tree depth limit exceeded doc = Nokogiri.HTML5(html, max_tree_depth: -1) ``` ### Attribute limit per element The maximum number of attributes per DOM element is configurable by the `:max_attributes` option. If a given element would exceed this limit, then an [ArgumentError](https://ruby-doc.org/core-2.5.0/ArgumentError.html) is thrown. This limit (which defaults to `Nokogumbo::DEFAULT_MAX_ATTRIBUTES = 400`) can be removed by giving the option `max_attributes: -1`. ``` ruby html = '
' # "
" doc = Nokogiri.HTML5(html) # raises ArgumentError: Attributes per element limit exceeded doc = Nokogiri.HTML5(html, max_attributes: -1) ``` ## HTML Serialization After parsing HTML, it may be serialized using any of the Nokogiri [serialization methods](https://www.rubydoc.info/gems/nokogiri/Nokogiri/XML/Node). In particular, `#serialize`, `#to_html`, and `#to_s` will serialize a given node and its children. (This is the equivalent of JavaScript's `Element.outerHTML`.) Similarly, `#inner_html` will serialize the children of a given node. (This is the equivalent of JavaScript's `Element.innerHTML`.) ``` ruby doc = Nokogiri::HTML5("Hello world!") puts doc.serialize # Prints: Hello world! ``` Due to quirks in how HTML is parsed and serialized, it's possible for a DOM tree to be serialized and then re-parsed, resulting in a different DOM. Mostly, this happens with DOMs produced from invalid HTML. Unfortunately, even valid HTML may not survive serialization and re-parsing. In particular, a newline at the start of `pre`, `listing`, and `textarea` elements is ignored by the parser. ``` ruby doc = Nokogiri::HTML5(<<-EOF)
Content
EOF puts doc.at('/html/body/pre').serialize # Prints:
Content
``` In this case, the original HTML is semantically equivalent to the serialized version. If the `pre`, `listing`, or `textarea` content starts with two newlines, the first newline will be stripped on the first parse and the second newline will be stripped on the second, leading to semantically different DOMs. Passing the parameter `preserve_newline: true` will cause two or more newlines to be preserved. (A single leading newline will still be removed.) ``` ruby doc = Nokogiri::HTML5(<<-EOF) Content EOF puts doc.at('/html/body/listing').serialize(preserve_newline: true) # Prints: # # Content ``` ## Encodings Nokogumbo always parses HTML using [UTF-8](https://en.wikipedia.org/wiki/UTF-8); however, the encoding of the input can be explicitly selected via the optional `encoding` parameter. This is most useful when the input comes not from a string but from an IO object. When serializing a document or node, the encoding of the output string can be specified via the `:encoding` options. Characters that cannot be encoded in the selected encoding will be encoded as [HTML numeric entities](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). ``` ruby frag = Nokogiri::HTML5.fragment('아는 길도 물어가라') html = frag.serialize(encoding: 'US-ASCII') puts html # Prints: 아는 길도 물어가라 frag = Nokogiri::HTML5.fragment(html) puts frag.serialize # Prints: 아는 길도 물어가라 ``` (There's a [bug](https://bugs.ruby-lang.org/issues/15033) in all current versions of Ruby that can cause the entity encoding to fail. Of the mandated supported encodings for HTML, the only encoding I'm aware of that has this bug is `'ISO-2022-JP'`. I recommend avoiding this encoding.) ## Examples ```ruby require 'nokogumbo' puts Nokogiri::HTML5.get('http://nokogiri.org').search('ol li')[2].text ``` ## Notes * The `Nokogiri::HTML5.fragment` function takes a string and parses it as a HTML5 document. The ``, ``, and `` elements are removed from this document, and any children of these elements that remain are returned as a `Nokogiri::HTML::DocumentFragment`. * The `Nokogiri::HTML5.parse` function takes a string and passes it to the gumbo_parse_with_options method, using the default options. The resulting Gumbo parse tree is then walked. * If the necessary Nokogiri and [libxml2](http://xmlsoft.org/html/) headers can be found at installation time then an [xmlDoc](http://xmlsoft.org/html/libxml-tree.html#xmlDoc) tree is produced and a single Nokogiri Ruby object is constructed to wrap the xmlDoc structure. Nokogiri only produces Ruby objects as necessary, so all searching is done using the underlying libxml2 libraries. * If the necessary headers are not present at installation time, then Nokogiri Ruby objects are created for each Gumbo node. Other than memory usage and CPU time, the results should be equivalent. * The `Nokogiri::HTML5.get` function takes care of following redirects, https, and determining the character encoding of the result, based on the rules defined in the HTML5 specification for doing so. * Instead of uppercase element names, lowercase element names are produced. * Instead of returning `unknown` as the element name for unknown tags, the original tag name is returned verbatim. # Flavors of Nokogumbo Nokogumbo uses libxml2, the XML library underlying Nokogiri, to speed up parsing. If the libxml2 headers are not available, then Nokogumbo resorts to using Nokogiri's Ruby API to construct the DOM tree. Nokogiri can be configured to either use the system library version of libxml2 or use a bundled version. By default (as of Nokogiri version 1.8.4), Nokogiri will use a bundled version. To prevent differences between versions of libxml2, Nokogumbo will only use libxml2 if the build process can find the exact same version used by Nokogiri. This leads to three possibilities 1. Nokogiri is compiled with the bundled libxml2. In this case, Nokogumbo will (by default) use the same version of libxml2. 2. Nokogiri is compiled with the system libxml2. In this case, if the libxml2 headers are available, then Nokogumbo will (by default) use the system version and headers. 3. Nokogiri is compiled with the system libxml2 but its headers aren't available at build time for Nokogumbo. In this case, Nokogumbo will use the slower Ruby API. Using libxml2 can be required by passing `-- --with-libxml2` to `bundle exec rake` or to `gem install`. Using libxml2 can be prohibited by instead passing `-- --without-libxml2`. Functionally, the only difference between using libxml2 or not is in the behavior of `Nokogiri::XML::Node#line`. If it is used, then `#line` will return the line number of the corresponding node. Otherwise, it will return 0. # Installation git clone https://github.com/rubys/nokogumbo.git cd nokogumbo bundle install rake gem gem install pkg/nokogumbo*.gem # Related efforts * [ruby-gumbo](https://github.com/nevir/ruby-gumbo#readme) -- a ruby binding for the Gumbo HTML5 parser. * [lua-gumbo](https://gitlab.com/craigbarnes/lua-gumbo) -- a lua binding for the Gumbo HTML5 parser. nokogumbo-2.0.5/gumbo-parser/0000755000004100000410000000000014030710665016153 5ustar www-datawww-datanokogumbo-2.0.5/gumbo-parser/src/0000755000004100000410000000000014030710665016742 5ustar www-datawww-datanokogumbo-2.0.5/gumbo-parser/src/tag.c0000644000004100000410000001505514030710665017667 0ustar www-datawww-data/* Copyright 2011 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #include "gumbo.h" #include "util.h" #include "tag_lookup.h" #include #include static const char kGumboTagNames[GUMBO_TAG_LAST+1][15] = { [GUMBO_TAG_HTML] = "html", [GUMBO_TAG_HEAD] = "head", [GUMBO_TAG_TITLE] = "title", [GUMBO_TAG_BASE] = "base", [GUMBO_TAG_LINK] = "link", [GUMBO_TAG_META] = "meta", [GUMBO_TAG_STYLE] = "style", [GUMBO_TAG_SCRIPT] = "script", [GUMBO_TAG_NOSCRIPT] = "noscript", [GUMBO_TAG_TEMPLATE] = "template", [GUMBO_TAG_BODY] = "body", [GUMBO_TAG_ARTICLE] = "article", [GUMBO_TAG_SECTION] = "section", [GUMBO_TAG_NAV] = "nav", [GUMBO_TAG_ASIDE] = "aside", [GUMBO_TAG_H1] = "h1", [GUMBO_TAG_H2] = "h2", [GUMBO_TAG_H3] = "h3", [GUMBO_TAG_H4] = "h4", [GUMBO_TAG_H5] = "h5", [GUMBO_TAG_H6] = "h6", [GUMBO_TAG_HGROUP] = "hgroup", [GUMBO_TAG_HEADER] = "header", [GUMBO_TAG_FOOTER] = "footer", [GUMBO_TAG_ADDRESS] = "address", [GUMBO_TAG_P] = "p", [GUMBO_TAG_HR] = "hr", [GUMBO_TAG_PRE] = "pre", [GUMBO_TAG_BLOCKQUOTE] = "blockquote", [GUMBO_TAG_OL] = "ol", [GUMBO_TAG_UL] = "ul", [GUMBO_TAG_LI] = "li", [GUMBO_TAG_DL] = "dl", [GUMBO_TAG_DT] = "dt", [GUMBO_TAG_DD] = "dd", [GUMBO_TAG_FIGURE] = "figure", [GUMBO_TAG_FIGCAPTION] = "figcaption", [GUMBO_TAG_MAIN] = "main", [GUMBO_TAG_DIV] = "div", [GUMBO_TAG_A] = "a", [GUMBO_TAG_EM] = "em", [GUMBO_TAG_STRONG] = "strong", [GUMBO_TAG_SMALL] = "small", [GUMBO_TAG_S] = "s", [GUMBO_TAG_CITE] = "cite", [GUMBO_TAG_Q] = "q", [GUMBO_TAG_DFN] = "dfn", [GUMBO_TAG_ABBR] = "abbr", [GUMBO_TAG_DATA] = "data", [GUMBO_TAG_TIME] = "time", [GUMBO_TAG_CODE] = "code", [GUMBO_TAG_VAR] = "var", [GUMBO_TAG_SAMP] = "samp", [GUMBO_TAG_KBD] = "kbd", [GUMBO_TAG_SUB] = "sub", [GUMBO_TAG_SUP] = "sup", [GUMBO_TAG_I] = "i", [GUMBO_TAG_B] = "b", [GUMBO_TAG_U] = "u", [GUMBO_TAG_MARK] = "mark", [GUMBO_TAG_RUBY] = "ruby", [GUMBO_TAG_RT] = "rt", [GUMBO_TAG_RP] = "rp", [GUMBO_TAG_BDI] = "bdi", [GUMBO_TAG_BDO] = "bdo", [GUMBO_TAG_SPAN] = "span", [GUMBO_TAG_BR] = "br", [GUMBO_TAG_WBR] = "wbr", [GUMBO_TAG_INS] = "ins", [GUMBO_TAG_DEL] = "del", [GUMBO_TAG_IMAGE] = "image", [GUMBO_TAG_IMG] = "img", [GUMBO_TAG_IFRAME] = "iframe", [GUMBO_TAG_EMBED] = "embed", [GUMBO_TAG_OBJECT] = "object", [GUMBO_TAG_PARAM] = "param", [GUMBO_TAG_VIDEO] = "video", [GUMBO_TAG_AUDIO] = "audio", [GUMBO_TAG_SOURCE] = "source", [GUMBO_TAG_TRACK] = "track", [GUMBO_TAG_CANVAS] = "canvas", [GUMBO_TAG_MAP] = "map", [GUMBO_TAG_AREA] = "area", [GUMBO_TAG_MATH] = "math", [GUMBO_TAG_MI] = "mi", [GUMBO_TAG_MO] = "mo", [GUMBO_TAG_MN] = "mn", [GUMBO_TAG_MS] = "ms", [GUMBO_TAG_MTEXT] = "mtext", [GUMBO_TAG_MGLYPH] = "mglyph", [GUMBO_TAG_MALIGNMARK] = "malignmark", [GUMBO_TAG_ANNOTATION_XML] = "annotation-xml", [GUMBO_TAG_SVG] = "svg", [GUMBO_TAG_FOREIGNOBJECT] = "foreignobject", [GUMBO_TAG_DESC] = "desc", [GUMBO_TAG_TABLE] = "table", [GUMBO_TAG_CAPTION] = "caption", [GUMBO_TAG_COLGROUP] = "colgroup", [GUMBO_TAG_COL] = "col", [GUMBO_TAG_TBODY] = "tbody", [GUMBO_TAG_THEAD] = "thead", [GUMBO_TAG_TFOOT] = "tfoot", [GUMBO_TAG_TR] = "tr", [GUMBO_TAG_TD] = "td", [GUMBO_TAG_TH] = "th", [GUMBO_TAG_FORM] = "form", [GUMBO_TAG_FIELDSET] = "fieldset", [GUMBO_TAG_LEGEND] = "legend", [GUMBO_TAG_LABEL] = "label", [GUMBO_TAG_INPUT] = "input", [GUMBO_TAG_BUTTON] = "button", [GUMBO_TAG_SELECT] = "select", [GUMBO_TAG_DATALIST] = "datalist", [GUMBO_TAG_OPTGROUP] = "optgroup", [GUMBO_TAG_OPTION] = "option", [GUMBO_TAG_TEXTAREA] = "textarea", [GUMBO_TAG_KEYGEN] = "keygen", [GUMBO_TAG_OUTPUT] = "output", [GUMBO_TAG_PROGRESS] = "progress", [GUMBO_TAG_METER] = "meter", [GUMBO_TAG_DETAILS] = "details", [GUMBO_TAG_SUMMARY] = "summary", [GUMBO_TAG_MENU] = "menu", [GUMBO_TAG_MENUITEM] = "menuitem", [GUMBO_TAG_APPLET] = "applet", [GUMBO_TAG_ACRONYM] = "acronym", [GUMBO_TAG_BGSOUND] = "bgsound", [GUMBO_TAG_DIR] = "dir", [GUMBO_TAG_FRAME] = "frame", [GUMBO_TAG_FRAMESET] = "frameset", [GUMBO_TAG_NOFRAMES] = "noframes", [GUMBO_TAG_LISTING] = "listing", [GUMBO_TAG_XMP] = "xmp", [GUMBO_TAG_NEXTID] = "nextid", [GUMBO_TAG_NOEMBED] = "noembed", [GUMBO_TAG_PLAINTEXT] = "plaintext", [GUMBO_TAG_RB] = "rb", [GUMBO_TAG_STRIKE] = "strike", [GUMBO_TAG_BASEFONT] = "basefont", [GUMBO_TAG_BIG] = "big", [GUMBO_TAG_BLINK] = "blink", [GUMBO_TAG_CENTER] = "center", [GUMBO_TAG_FONT] = "font", [GUMBO_TAG_MARQUEE] = "marquee", [GUMBO_TAG_MULTICOL] = "multicol", [GUMBO_TAG_NOBR] = "nobr", [GUMBO_TAG_SPACER] = "spacer", [GUMBO_TAG_TT] = "tt", [GUMBO_TAG_RTC] = "rtc", [GUMBO_TAG_DIALOG] = "dialog", [GUMBO_TAG_UNKNOWN] = "", [GUMBO_TAG_LAST] = "", }; const char* gumbo_normalized_tagname(GumboTag tag) { assert(tag <= GUMBO_TAG_LAST); const char *tagname = kGumboTagNames[tag]; assert(tagname); return tagname; } void gumbo_tag_from_original_text(GumboStringPiece* text) { if (text->data == NULL) { return; } assert(text->length >= 2); assert(text->data[0] == '<'); assert(text->data[text->length - 1] == '>'); if (text->data[1] == '/') { // End tag assert(text->length >= 3); text->data += 2; // Move past length -= 3; } else { // Start tag text->data += 1; // Move past < text->length -= 2; for (const char* c = text->data; c != text->data + text->length; ++c) { switch (*c) { case '\t': case '\n': case '\f': case ' ': case '/': text->length = c - text->data; return; } } } } GumboTag gumbo_tagn_enum(const char *tagname, size_t tagname_length) { const TagHashSlot *slot = gumbo_tag_lookup(tagname, tagname_length); return slot ? slot->tag : GUMBO_TAG_UNKNOWN; } nokogumbo-2.0.5/gumbo-parser/src/tag_lookup.c0000644000004100000410000003237714030710665021266 0ustar www-datawww-data/* ANSI-C code produced by gperf version 3.1 */ /* Command-line: gperf -m100 lib/tag_lookup.gperf */ /* Computed positions: -k'1-2,$' */ /* Filtered by: mk/gperf-filter.sed */ #include "tag_lookup.h" #include "macros.h" #include "ascii.h" #include #define TOTAL_KEYWORDS 150 #define MIN_WORD_LENGTH 1 #define MAX_WORD_LENGTH 14 #define MIN_HASH_VALUE 9 #define MAX_HASH_VALUE 271 /* maximum key range = 263, duplicates = 0 */ static inline unsigned int hash (register const char *str, register size_t len) { static const unsigned short asso_values[] = { 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 9, 7, 6, 4, 4, 3, 4, 3, 3, 272, 272, 272, 272, 272, 272, 272, 70, 83, 152, 7, 16, 61, 98, 5, 76, 102, 126, 12, 19, 54, 54, 31, 97, 3, 4, 9, 33, 136, 113, 86, 15, 272, 272, 272, 272, 272, 272, 272, 70, 83, 152, 7, 16, 61, 98, 5, 76, 102, 126, 12, 19, 54, 54, 31, 97, 3, 4, 9, 33, 136, 113, 86, 15, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272, 272 }; register unsigned int hval = len; switch (hval) { default: hval += asso_values[(unsigned char)str[1]+3]; /*FALLTHROUGH*/ case 1: hval += asso_values[(unsigned char)str[0]]; break; } return hval + asso_values[(unsigned char)str[len - 1]]; } const TagHashSlot * gumbo_tag_lookup (register const char *str, register size_t len) { static const unsigned char lengthtable[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 2, 2, 6, 2, 6, 2, 4, 0, 7, 6, 3, 0, 3, 0, 6, 6, 8, 5, 0, 0, 4, 5, 5, 8, 0, 2, 4, 5, 2, 0, 5, 4, 2, 0, 7, 0, 8, 5, 0, 0, 0, 0, 0, 0, 5, 3, 4, 5, 1, 4, 0, 4, 1, 2, 8, 7, 7, 6, 6, 8, 2, 8, 4, 2, 0, 6, 0, 0, 3, 4, 6, 13, 4, 4, 6, 8, 0, 8, 4, 0, 6, 0, 8, 4, 5, 0, 2, 2, 9, 2, 4, 0, 8, 4, 2, 4, 8, 7, 0, 2, 5, 2, 0, 6, 0, 3, 2, 2, 6, 3, 8, 7, 2, 5, 7, 0, 2, 6, 2, 4, 3, 0, 10, 5, 6, 3, 1, 2, 0, 6, 0, 5, 5, 0, 3, 0, 3, 3, 1, 4, 6, 4, 7, 3, 0, 0, 2, 10, 10, 0, 0, 6, 1, 4, 6, 3, 0, 2, 5, 6, 4, 3, 4, 0, 7, 3, 0, 0, 0, 4, 0, 0, 5, 0, 0, 0, 6, 0, 14, 8, 1, 3, 0, 0, 7, 3, 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 7, 6, 0, 0, 0, 0, 0, 5, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 3 }; static const TagHashSlot wordlist[] = { {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"s", GUMBO_TAG_S}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"h6", GUMBO_TAG_H6}, {"h5", GUMBO_TAG_H5}, {"h4", GUMBO_TAG_H4}, {"h3", GUMBO_TAG_H3}, {"spacer", GUMBO_TAG_SPACER}, {"h2", GUMBO_TAG_H2}, {"header", GUMBO_TAG_HEADER}, {"h1", GUMBO_TAG_H1}, {"head", GUMBO_TAG_HEAD}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"details", GUMBO_TAG_DETAILS}, {"select", GUMBO_TAG_SELECT}, {"dir", GUMBO_TAG_DIR}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"del", GUMBO_TAG_DEL}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"source", GUMBO_TAG_SOURCE}, {"legend", GUMBO_TAG_LEGEND}, {"datalist", GUMBO_TAG_DATALIST}, {"meter", GUMBO_TAG_METER}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"math", GUMBO_TAG_MATH}, {"label", GUMBO_TAG_LABEL}, {"table", GUMBO_TAG_TABLE}, {"template", GUMBO_TAG_TEMPLATE}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"rp", GUMBO_TAG_RP}, {"time", GUMBO_TAG_TIME}, {"title", GUMBO_TAG_TITLE}, {"hr", GUMBO_TAG_HR}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"tbody", GUMBO_TAG_TBODY}, {"samp", GUMBO_TAG_SAMP}, {"tr", GUMBO_TAG_TR}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"marquee", GUMBO_TAG_MARQUEE}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"menuitem", GUMBO_TAG_MENUITEM}, {"small", GUMBO_TAG_SMALL}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"embed", GUMBO_TAG_EMBED}, {"map", GUMBO_TAG_MAP}, {"menu", GUMBO_TAG_MENU}, {"param", GUMBO_TAG_PARAM}, {"p", GUMBO_TAG_P}, {"nobr", GUMBO_TAG_NOBR}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"span", GUMBO_TAG_SPAN}, {"u", GUMBO_TAG_U}, {"em", GUMBO_TAG_EM}, {"noframes", GUMBO_TAG_NOFRAMES}, {"section", GUMBO_TAG_SECTION}, {"noembed", GUMBO_TAG_NOEMBED}, {"nextid", GUMBO_TAG_NEXTID}, {"footer", GUMBO_TAG_FOOTER}, {"noscript", GUMBO_TAG_NOSCRIPT}, {"dl", GUMBO_TAG_DL}, {"progress", GUMBO_TAG_PROGRESS}, {"font", GUMBO_TAG_FONT}, {"mo", GUMBO_TAG_MO}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"script", GUMBO_TAG_SCRIPT}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"pre", GUMBO_TAG_PRE}, {"main", GUMBO_TAG_MAIN}, {"object", GUMBO_TAG_OBJECT}, {"foreignobject", GUMBO_TAG_FOREIGNOBJECT}, {"form", GUMBO_TAG_FORM}, {"data", GUMBO_TAG_DATA}, {"applet", GUMBO_TAG_APPLET}, {"fieldset", GUMBO_TAG_FIELDSET}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"textarea", GUMBO_TAG_TEXTAREA}, {"abbr", GUMBO_TAG_ABBR}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"figure", GUMBO_TAG_FIGURE}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"optgroup", GUMBO_TAG_OPTGROUP}, {"meta", GUMBO_TAG_META}, {"tfoot", GUMBO_TAG_TFOOT}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"ul", GUMBO_TAG_UL}, {"li", GUMBO_TAG_LI}, {"plaintext", GUMBO_TAG_PLAINTEXT}, {"rb", GUMBO_TAG_RB}, {"body", GUMBO_TAG_BODY}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"basefont", GUMBO_TAG_BASEFONT}, {"ruby", GUMBO_TAG_RUBY}, {"mi", GUMBO_TAG_MI}, {"base", GUMBO_TAG_BASE}, {"frameset", GUMBO_TAG_FRAMESET}, {"summary", GUMBO_TAG_SUMMARY}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"dd", GUMBO_TAG_DD}, {"frame", GUMBO_TAG_FRAME}, {"td", GUMBO_TAG_TD}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"option", GUMBO_TAG_OPTION}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"svg", GUMBO_TAG_SVG}, {"br", GUMBO_TAG_BR}, {"ol", GUMBO_TAG_OL}, {"dialog", GUMBO_TAG_DIALOG}, {"sup", GUMBO_TAG_SUP}, {"multicol", GUMBO_TAG_MULTICOL}, {"article", GUMBO_TAG_ARTICLE}, {"rt", GUMBO_TAG_RT}, {"image", GUMBO_TAG_IMAGE}, {"listing", GUMBO_TAG_LISTING}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"dt", GUMBO_TAG_DT}, {"mglyph", GUMBO_TAG_MGLYPH}, {"tt", GUMBO_TAG_TT}, {"html", GUMBO_TAG_HTML}, {"wbr", GUMBO_TAG_WBR}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"figcaption", GUMBO_TAG_FIGCAPTION}, {"style", GUMBO_TAG_STYLE}, {"strike", GUMBO_TAG_STRIKE}, {"dfn", GUMBO_TAG_DFN}, {"a", GUMBO_TAG_A}, {"th", GUMBO_TAG_TH}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"hgroup", GUMBO_TAG_HGROUP}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"mtext", GUMBO_TAG_MTEXT}, {"thead", GUMBO_TAG_THEAD}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"var", GUMBO_TAG_VAR}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"xmp", GUMBO_TAG_XMP}, {"kbd", GUMBO_TAG_KBD}, {"i", GUMBO_TAG_I}, {"link", GUMBO_TAG_LINK}, {"output", GUMBO_TAG_OUTPUT}, {"mark", GUMBO_TAG_MARK}, {"acronym", GUMBO_TAG_ACRONYM}, {"div", GUMBO_TAG_DIV}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"ms", GUMBO_TAG_MS}, {"malignmark", GUMBO_TAG_MALIGNMARK}, {"blockquote", GUMBO_TAG_BLOCKQUOTE}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"center", GUMBO_TAG_CENTER}, {"b", GUMBO_TAG_B}, {"desc", GUMBO_TAG_DESC}, {"canvas", GUMBO_TAG_CANVAS}, {"col", GUMBO_TAG_COL}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"mn", GUMBO_TAG_MN}, {"track", GUMBO_TAG_TRACK}, {"iframe", GUMBO_TAG_IFRAME}, {"code", GUMBO_TAG_CODE}, {"sub", GUMBO_TAG_SUB}, {"area", GUMBO_TAG_AREA}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"address", GUMBO_TAG_ADDRESS}, {"ins", GUMBO_TAG_INS}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"cite", GUMBO_TAG_CITE}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"input", GUMBO_TAG_INPUT}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"keygen", GUMBO_TAG_KEYGEN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"annotation-xml", GUMBO_TAG_ANNOTATION_XML}, {"colgroup", GUMBO_TAG_COLGROUP}, {"q", GUMBO_TAG_Q}, {"big", GUMBO_TAG_BIG}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"bgsound", GUMBO_TAG_BGSOUND}, {"nav", GUMBO_TAG_NAV}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"video", GUMBO_TAG_VIDEO}, {"img", GUMBO_TAG_IMG}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"audio", GUMBO_TAG_AUDIO}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"caption", GUMBO_TAG_CAPTION}, {"strong", GUMBO_TAG_STRONG}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"aside", GUMBO_TAG_ASIDE}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"button", GUMBO_TAG_BUTTON}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"bdo", GUMBO_TAG_BDO}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"bdi", GUMBO_TAG_BDI}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"blink", GUMBO_TAG_BLINK}, {(char*)0,GUMBO_TAG_UNKNOWN}, {(char*)0,GUMBO_TAG_UNKNOWN}, {"rtc", GUMBO_TAG_RTC} }; if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) { register unsigned int key = hash (str, len); if (key <= MAX_HASH_VALUE) if (len == lengthtable[key]) { register const char *s = wordlist[key].key; if (s && (((unsigned char)*str ^ (unsigned char)*s) & ~32) == 0 && !gumbo_ascii_strncasecmp(str, s, len)) return &wordlist[key]; } } return 0; } nokogumbo-2.0.5/gumbo-parser/src/utf8.h0000644000004100000410000001206714030710665020007 0ustar www-datawww-data#ifndef GUMBO_UTF8_H_ #define GUMBO_UTF8_H_ // This contains an implementation of a UTF-8 iterator and decoder suitable for // a HTML5 parser. This does a bit more than straight UTF-8 decoding. The // HTML5 spec specifies that: // 1. Decoding errors are parse errors. // 2. Certain other codepoints (e.g. control characters) are parse errors. // 3. Carriage returns and CR/LF groups are converted to line feeds. // https://encoding.spec.whatwg.org/#utf-8-decode // // Also, we want to keep track of source positions for error handling. As a // result, we fold all that functionality into this decoder, and can't use an // off-the-shelf library. // // This header is internal-only, which is why we prefix functions with only // utf8_ or utf8_iterator_ instead of gumbo_utf8_. #include #include #include "gumbo.h" #include "macros.h" #ifdef __cplusplus extern "C" { #endif struct GumboInternalError; struct GumboInternalParser; // Unicode replacement char. #define kUtf8ReplacementChar 0xFFFD #define kUtf8BomChar 0xFEFF #define kUtf8MaxChar 0x10FFFF typedef struct GumboInternalUtf8Iterator { // Points at the start of the code point most recently read into 'current'. const char* _start; // Points at the mark. The mark is initially set to the beginning of the // input. const char* _mark; // Points past the end of the iter, like a past-the-end iterator in the STL. const char* _end; // The code point under the cursor. int _current; // The width in bytes of the current code point. size_t _width; // The SourcePosition for the current location. GumboSourcePosition _pos; // The SourcePosition for the mark. GumboSourcePosition _mark_pos; // Pointer back to the GumboParser instance, for configuration options and // error recording. struct GumboInternalParser* _parser; } Utf8Iterator; // Returns true if this Unicode code point is a surrogate. CONST_FN static inline bool utf8_is_surrogate(int c) { return c >= 0xD800 && c <= 0xDFFF; } // Returns true if this Unicode code point is a noncharacter. CONST_FN static inline bool utf8_is_noncharacter(int c) { return (c >= 0xFDD0 && c <= 0xFDEF) || ((c & 0xFFFF) == 0xFFFE) || ((c & 0xFFFF) == 0xFFFF); } // Returns true if this Unicode code point is a control. CONST_FN static inline bool utf8_is_control(int c) { return ((unsigned int)c < 0x1Fu) || (c >= 0x7F && c <= 0x9F); } // Initializes a new Utf8Iterator from the given byte buffer. The source does // not have to be NUL-terminated, but the length must be passed in explicitly. void utf8iterator_init ( struct GumboInternalParser* parser, const char* source, size_t source_length, Utf8Iterator* iter ); // Advances the current position by one code point. void utf8iterator_next(Utf8Iterator* iter); // Returns the current code point as an integer. static inline int utf8iterator_current(const Utf8Iterator* iter) { return iter->_current; } // Retrieves and fills the output parameter with the current source position. static inline void utf8iterator_get_position ( const Utf8Iterator* iter, GumboSourcePosition* output ) { *output = iter->_pos; } // Retrieves the marked position. static inline GumboSourcePosition utf8iterator_get_mark_position ( const Utf8Iterator* iter ) { return iter->_mark_pos; } // Retrieves a character pointer to the start of the current character. static inline const char* utf8iterator_get_char_pointer(const Utf8Iterator* iter) { return iter->_start; } // Retrieves the width of the current character. static inline size_t utf8iterator_get_width(const Utf8Iterator* iter) { return iter->_width; } // Retrieves a character pointer to 1 past the end of the buffer. This is // necessary for certain state machines and string comparisons that would like // to look directly for ASCII text in the buffer without going through the // decoder. static inline const char* utf8iterator_get_end_pointer(const Utf8Iterator* iter) { return iter->_end; } // Retrieves a character pointer to the marked position. static inline const char* utf8iterator_get_mark_pointer(const Utf8Iterator* iter) { return iter->_mark; } // If the upcoming text in the buffer matches the specified prefix (which has // length 'length'), consume it and return true. Otherwise, return false with // no other effects. If the length of the string would overflow the buffer, // this returns false. Note that prefix should not contain null bytes because // of the use of strncmp/strncasecmp internally. All existing use-cases adhere // to this. bool utf8iterator_maybe_consume_match ( Utf8Iterator* iter, const char* prefix, size_t length, bool case_sensitive ); // "Marks" a particular location of interest in the input stream, so that it can // later be reset() to. There's also the ability to record an error at the // point that was marked, as oftentimes that's more useful than the last // character before the error was detected. void utf8iterator_mark(Utf8Iterator* iter); // Returns the current input stream position to the mark. void utf8iterator_reset(Utf8Iterator* iter); #ifdef __cplusplus } #endif #endif // GUMBO_UTF8_H_ nokogumbo-2.0.5/gumbo-parser/src/svg_attrs.c0000644000004100000410000001353614030710665021132 0ustar www-datawww-data/* ANSI-C code produced by gperf version 3.1 */ /* Command-line: gperf -m100 lib/svg_attrs.gperf */ /* Computed positions: -k'1,10,$' */ /* Filtered by: mk/gperf-filter.sed */ #include "replacement.h" #include "macros.h" #include "ascii.h" #include #define TOTAL_KEYWORDS 58 #define MIN_WORD_LENGTH 4 #define MAX_WORD_LENGTH 19 #define MIN_HASH_VALUE 5 #define MAX_HASH_VALUE 77 /* maximum key range = 73, duplicates = 0 */ static inline unsigned int hash (register const char *str, register size_t len) { static const unsigned char asso_values[] = { 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 5, 78, 39, 14, 1, 31, 31, 13, 13, 78, 78, 22, 25, 10, 2, 7, 78, 22, 0, 1, 3, 1, 78, 0, 36, 14, 17, 20, 78, 78, 78, 78, 5, 78, 39, 14, 1, 31, 31, 13, 13, 78, 78, 22, 25, 10, 2, 7, 78, 22, 0, 1, 3, 1, 78, 0, 36, 14, 17, 20, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78, 78 }; register unsigned int hval = len; switch (hval) { default: hval += asso_values[(unsigned char)str[9]]; /*FALLTHROUGH*/ case 9: case 8: case 7: case 6: case 5: case 4: case 3: case 2: case 1: hval += asso_values[(unsigned char)str[0]+2]; break; } return hval + asso_values[(unsigned char)str[len - 1]]; } const StringReplacement * gumbo_get_svg_attr_replacement (register const char *str, register size_t len) { static const unsigned char lengthtable[] = { 0, 0, 0, 0, 0, 4, 0, 7, 7, 0, 8, 9, 10, 11, 11, 11, 11, 10, 16, 18, 16, 12, 16, 11, 13, 11, 12, 11, 16, 0, 17, 9, 9, 8, 9, 10, 13, 10, 12, 14, 8, 4, 12, 19, 7, 9, 12, 12, 11, 14, 10, 19, 8, 16, 13, 16, 16, 15, 10, 12, 0, 0, 13, 13, 13, 0, 0, 9, 16, 0, 0, 0, 0, 0, 0, 0, 0, 17 }; static const StringReplacement wordlist[] = { {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {"refx", "refX"}, {(char*)0,(char*)0}, {"viewbox", "viewBox"}, {"targetx", "targetX"}, {(char*)0,(char*)0}, {"calcmode", "calcMode"}, {"maskunits", "maskUnits"}, {"viewtarget", "viewTarget"}, {"tablevalues", "tableValues"}, {"markerunits", "markerUnits"}, {"stitchtiles", "stitchTiles"}, {"startoffset", "startOffset"}, {"numoctaves", "numOctaves"}, {"requiredfeatures", "requiredFeatures"}, {"requiredextensions", "requiredExtensions"}, {"specularexponent", "specularExponent"}, {"surfacescale", "surfaceScale"}, {"specularconstant", "specularConstant"}, {"repeatcount", "repeatCount"}, {"clippathunits", "clipPathUnits"}, {"filterunits", "filterUnits"}, {"lengthadjust", "lengthAdjust"}, {"markerwidth", "markerWidth"}, {"maskcontentunits", "maskContentUnits"}, {(char*)0,(char*)0}, {"limitingconeangle", "limitingConeAngle"}, {"pointsatx", "pointsAtX"}, {"repeatdur", "repeatDur"}, {"keytimes", "keyTimes"}, {"keypoints", "keyPoints"}, {"keysplines", "keySplines"}, {"gradientunits", "gradientUnits"}, {"textlength", "textLength"}, {"stddeviation", "stdDeviation"}, {"primitiveunits", "primitiveUnits"}, {"edgemode", "edgeMode"}, {"refy", "refY"}, {"spreadmethod", "spreadMethod"}, {"preserveaspectratio", "preserveAspectRatio"}, {"targety", "targetY"}, {"pointsatz", "pointsAtZ"}, {"markerheight", "markerHeight"}, {"patternunits", "patternUnits"}, {"baseprofile", "baseProfile"}, {"systemlanguage", "systemLanguage"}, {"zoomandpan", "zoomAndPan"}, {"patterncontentunits", "patternContentUnits"}, {"glyphref", "glyphRef"}, {"xchannelselector", "xChannelSelector"}, {"attributetype", "attributeType"}, {"kernelunitlength", "kernelUnitLength"}, {"ychannelselector", "yChannelSelector"}, {"diffuseconstant", "diffuseConstant"}, {"pathlength", "pathLength"}, {"kernelmatrix", "kernelMatrix"}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {"preservealpha", "preserveAlpha"}, {"attributename", "attributeName"}, {"basefrequency", "baseFrequency"}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {"pointsaty", "pointsAtY"}, {"patterntransform", "patternTransform"}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {(char*)0,(char*)0}, {"gradienttransform", "gradientTransform"} }; if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) { register unsigned int key = hash (str, len); if (key <= MAX_HASH_VALUE) if (len == lengthtable[key]) { register const char *s = wordlist[key].from; if (s && (((unsigned char)*str ^ (unsigned char)*s) & ~32) == 0 && !gumbo_ascii_strncasecmp(str, s, len)) return &wordlist[key]; } } return 0; } nokogumbo-2.0.5/gumbo-parser/src/parser.h0000644000004100000410000000255214030710665020413 0ustar www-datawww-data#ifndef GUMBO_PARSER_H_ #define GUMBO_PARSER_H_ #ifdef __cplusplus extern "C" { #endif // Contains the definition of the top-level GumboParser structure that's // threaded through basically every internal function in the library. struct GumboInternalParserState; struct GumboInternalOutput; struct GumboInternalOptions; struct GumboInternalTokenizerState; // An overarching struct that's threaded through (nearly) all functions in the // library, OOP-style. This gives each function access to the options and // output, along with any internal state needed for the parse. typedef struct GumboInternalParser { // Settings for this parse run. const struct GumboInternalOptions* _options; // Output for the parse. struct GumboInternalOutput* _output; // The internal tokenizer state, defined as a pointer to avoid a cyclic // dependency on html5tokenizer.h. The main parse routine is responsible for // initializing this on parse start, and destroying it on parse end. // End-users will never see a non-garbage value in this pointer. struct GumboInternalTokenizerState* _tokenizer_state; // The internal parser state. Initialized on parse start and destroyed on // parse end; end-users will never see a non-garbage value in this pointer. struct GumboInternalParserState* _parser_state; } GumboParser; #ifdef __cplusplus } #endif #endif // GUMBO_PARSER_H_ nokogumbo-2.0.5/gumbo-parser/src/attribute.h0000644000004100000410000000046714030710665021125 0ustar www-datawww-data#ifndef GUMBO_ATTRIBUTE_H_ #define GUMBO_ATTRIBUTE_H_ #include "gumbo.h" #ifdef __cplusplus extern "C" { #endif // Release the memory used for a GumboAttribute, including the attribute itself void gumbo_destroy_attribute(GumboAttribute* attribute); #ifdef __cplusplus } #endif #endif // GUMBO_ATTRIBUTE_H_ nokogumbo-2.0.5/gumbo-parser/src/token_buffer.c0000644000004100000410000000443414030710665021564 0ustar www-datawww-data/* Copyright 2018 Stephen Checkoway Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #include #include "ascii.h" #include "token_buffer.h" #include "tokenizer.h" #include "util.h" struct GumboInternalCharacterToken { GumboSourcePosition position; GumboStringPiece original_text; int c; }; void gumbo_character_token_buffer_init(GumboCharacterTokenBuffer* buffer) { buffer->data = NULL; buffer->length = 0; buffer->capacity = 0; } void gumbo_character_token_buffer_append ( const GumboToken* token, GumboCharacterTokenBuffer* buffer ) { assert(token->type == GUMBO_TOKEN_WHITESPACE || token->type == GUMBO_TOKEN_CHARACTER); if (buffer->length == buffer->capacity) { if (buffer->capacity == 0) buffer->capacity = 10; else buffer->capacity *= 2; size_t bytes = sizeof(*buffer->data) * buffer->capacity; buffer->data = gumbo_realloc(buffer->data, bytes); } size_t index = buffer->length++; buffer->data[index].position = token->position; buffer->data[index].original_text = token->original_text; buffer->data[index].c = token->v.character; } void gumbo_character_token_buffer_get ( const GumboCharacterTokenBuffer* buffer, size_t index, struct GumboInternalToken* output ) { assert(index < buffer->length); int c = buffer->data[index].c; output->type = gumbo_ascii_isspace(c)? GUMBO_TOKEN_WHITESPACE : GUMBO_TOKEN_CHARACTER; output->position = buffer->data[index].position; output->original_text = buffer->data[index].original_text; output->v.character = c; } void gumbo_character_token_buffer_clear(GumboCharacterTokenBuffer* buffer) { buffer->length = 0; } void gumbo_character_token_buffer_destroy(GumboCharacterTokenBuffer* buffer) { gumbo_free(buffer->data); buffer->data = NULL; buffer->length = 0; buffer->capacity = 0; } nokogumbo-2.0.5/gumbo-parser/src/string_buffer.h0000644000004100000410000000410014030710665021745 0ustar www-datawww-data#ifndef GUMBO_STRING_BUFFER_H_ #define GUMBO_STRING_BUFFER_H_ #include #include #include "gumbo.h" #ifdef __cplusplus extern "C" { #endif // A struct representing a mutable, growable string. This consists of a // heap-allocated buffer that may grow (by doubling) as necessary. When // converting to a string, this allocates a new buffer that is only as long as // it needs to be. Note that the internal buffer here is *not* nul-terminated, // so be sure not to use ordinary string manipulation functions on it. typedef struct { // A pointer to the beginning of the string. NULL if length == 0. char* data; // The length of the string fragment, in bytes. May be zero. size_t length; // The capacity of the buffer, in bytes. size_t capacity; } GumboStringBuffer; // Initializes a new GumboStringBuffer. void gumbo_string_buffer_init(GumboStringBuffer* output); // Ensures that the buffer contains at least a certain amount of space. Most // useful with snprintf and the other length-delimited string functions, which // may want to write directly into the buffer. void gumbo_string_buffer_reserve ( size_t min_capacity, GumboStringBuffer* output ); // Appends a single Unicode codepoint onto the end of the GumboStringBuffer. // This is essentially a UTF-8 encoder, and may add 1-4 bytes depending on the // value of the codepoint. void gumbo_string_buffer_append_codepoint ( int c, GumboStringBuffer* output ); // Appends a string onto the end of the GumboStringBuffer. void gumbo_string_buffer_append_string ( const GumboStringPiece* str, GumboStringBuffer* output ); // Converts this string buffer to const char*, alloctaing a new buffer for it. char* gumbo_string_buffer_to_string(const GumboStringBuffer* input); // Reinitialize this string buffer. This clears it by setting length=0. It // does not zero out the buffer itself. void gumbo_string_buffer_clear(GumboStringBuffer* input); // Deallocates this GumboStringBuffer. void gumbo_string_buffer_destroy(GumboStringBuffer* buffer); #ifdef __cplusplus } #endif #endif // GUMBO_STRING_BUFFER_H_ nokogumbo-2.0.5/gumbo-parser/src/string_buffer.c0000644000004100000410000000571514030710665021755 0ustar www-datawww-data/* Copyright 2010 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #include #include "string_buffer.h" #include "util.h" // Size chosen via statistical analysis of ~60K websites. // 99% of text nodes and 98% of attribute names/values fit in this initial size. static const size_t kDefaultStringBufferSize = 5; static void maybe_resize_string_buffer ( size_t additional_chars, GumboStringBuffer* buffer ) { size_t new_length = buffer->length + additional_chars; size_t new_capacity = buffer->capacity; while (new_capacity < new_length) { new_capacity *= 2; } if (new_capacity != buffer->capacity) { buffer->data = gumbo_realloc(buffer->data, new_capacity); buffer->capacity = new_capacity; } } void gumbo_string_buffer_init(GumboStringBuffer* output) { output->data = gumbo_alloc(kDefaultStringBufferSize); output->length = 0; output->capacity = kDefaultStringBufferSize; } void gumbo_string_buffer_reserve ( size_t min_capacity, GumboStringBuffer* output ) { maybe_resize_string_buffer(min_capacity - output->length, output); } void gumbo_string_buffer_append_codepoint ( int c, GumboStringBuffer* output ) { // num_bytes is actually the number of continuation bytes, 1 less than the // total number of bytes. This is done to keep the loop below simple and // should probably change if we unroll it. int num_bytes, prefix; if (c <= 0x7f) { num_bytes = 0; prefix = 0; } else if (c <= 0x7ff) { num_bytes = 1; prefix = 0xc0; } else if (c <= 0xffff) { num_bytes = 2; prefix = 0xe0; } else { num_bytes = 3; prefix = 0xf0; } maybe_resize_string_buffer(num_bytes + 1, output); output->data[output->length++] = prefix | (c >> (num_bytes * 6)); for (int i = num_bytes - 1; i >= 0; --i) { output->data[output->length++] = 0x80 | (0x3f & (c >> (i * 6))); } } void gumbo_string_buffer_append_string ( const GumboStringPiece* str, GumboStringBuffer* output ) { maybe_resize_string_buffer(str->length, output); memcpy(output->data + output->length, str->data, str->length); output->length += str->length; } char* gumbo_string_buffer_to_string(const GumboStringBuffer* input) { char* buffer = gumbo_alloc(input->length + 1); memcpy(buffer, input->data, input->length); buffer[input->length] = '\0'; return buffer; } void gumbo_string_buffer_clear(GumboStringBuffer* input) { input->length = 0; } void gumbo_string_buffer_destroy(GumboStringBuffer* buffer) { gumbo_free(buffer->data); } nokogumbo-2.0.5/gumbo-parser/src/token_type.h0000644000004100000410000000056014030710665021275 0ustar www-datawww-data#ifndef GUMBO_TOKEN_TYPE_H_ #define GUMBO_TOKEN_TYPE_H_ // An enum representing the type of token. typedef enum { GUMBO_TOKEN_DOCTYPE, GUMBO_TOKEN_START_TAG, GUMBO_TOKEN_END_TAG, GUMBO_TOKEN_COMMENT, GUMBO_TOKEN_WHITESPACE, GUMBO_TOKEN_CHARACTER, GUMBO_TOKEN_CDATA, GUMBO_TOKEN_NULL, GUMBO_TOKEN_EOF } GumboTokenType; #endif // GUMBO_TOKEN_TYPE_H_ nokogumbo-2.0.5/gumbo-parser/src/util.c0000644000004100000410000000324114030710665020063 0ustar www-datawww-data/* Copyright 2017-2018 Craig Barnes. Copyright 2010 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ #include #include #include #include "util.h" #include "gumbo.h" void* gumbo_alloc(size_t size) { void* ptr = malloc(size); if (unlikely(ptr == NULL)) { perror(__func__); abort(); } return ptr; } void* gumbo_realloc(void* ptr, size_t size) { ptr = realloc(ptr, size); if (unlikely(ptr == NULL)) { perror(__func__); abort(); } return ptr; } void gumbo_free(void* ptr) { free(ptr); } char* gumbo_strdup(const char* str) { const size_t size = strlen(str) + 1; // The strdup(3) function isn't available in strict "-std=c99" mode // (it's part of POSIX, not C99), so use malloc(3) and memcpy(3) // instead: char* buffer = gumbo_alloc(size); return memcpy(buffer, str, size); } #ifdef GUMBO_DEBUG #include // Debug function to trace operation of the parser // (define GUMBO_DEBUG to use). void gumbo_debug(const char* format, ...) { va_list args; va_start(args, format); vprintf(format, args); va_end(args); fflush(stdout); } #else void gumbo_debug(const char* UNUSED_ARG(format), ...) {} #endif nokogumbo-2.0.5/gumbo-parser/src/ascii.c0000644000004100000410000000464214030710665020204 0ustar www-datawww-data#include "ascii.h" int gumbo_ascii_strcasecmp(const char *s1, const char *s2) { int c1, c2; while (*s1 && *s2) { c1 = (int)(unsigned char) gumbo_ascii_tolower(*s1); c2 = (int)(unsigned char) gumbo_ascii_tolower(*s2); if (c1 != c2) { return (c1 - c2); } s1++; s2++; } return (((int)(unsigned char) *s1) - ((int)(unsigned char) *s2)); } int gumbo_ascii_strncasecmp(const char *s1, const char *s2, size_t n) { int c1, c2; while (n && *s1 && *s2) { n -= 1; c1 = (int)(unsigned char) gumbo_ascii_tolower(*s1); c2 = (int)(unsigned char) gumbo_ascii_tolower(*s2); if (c1 != c2) { return (c1 - c2); } s1++; s2++; } if (n) { return (((int)(unsigned char) *s1) - ((int)(unsigned char) *s2)); } return 0; } const unsigned char _gumbo_ascii_table[0x80] = { 0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x03,0x03,0x01,0x03,0x03,0x01,0x01, 0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01, 0x02,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x28,0x28,0x28,0x28,0x28,0x28,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20, 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x00,0x00,0x00,0x00,0x00, 0x00,0x50,0x50,0x50,0x50,0x50,0x50,0x40,0x40,0x40,0x40,0x40,0x40,0x40,0x40,0x40, 0x40,0x40,0x40,0x40,0x40,0x40,0x40,0x40,0x40,0x40,0x40,0x00,0x00,0x00,0x00,0x00, }; // Table generation code. // clang -DGUMBO_GEN_TABLE=1 ascii.c && ./a.out && rm a.out #if GUMBO_GEN_TABLE #include int main() { printf("const unsigned char _gumbo_ascii_table[0x80] = {"); for (int c = 0; c < 0x80; ++c) { unsigned int x = 0; // https://infra.spec.whatwg.org/#ascii-code-point if (c <= 0x1f) x |= GUMBO_ASCII_CNTRL; if (c == 0x09 || c == 0x0a || c == 0x0c || c == 0x0d || c == 0x20) x |= GUMBO_ASCII_SPACE; if (c >= 0x30 && c <= 0x39) x |= GUMBO_ASCII_DIGIT; if ((c >= 0x30 && c <= 0x39) || (c >= 0x41 && c <= 0x46)) x |= GUMBO_ASCII_UPPER_XDIGIT; if ((c >= 0x30 && c <= 0x39) || (c >= 0x61 && c <= 0x66)) x |= GUMBO_ASCII_LOWER_XDIGIT; if (c >= 0x41 && c <= 0x5a) x |= GUMBO_ASCII_UPPER_ALPHA; if (c >= 0x61 && c <= 0x7a) x |= GUMBO_ASCII_LOWER_ALPHA; printf("%s0x%02x,", (c % 16 == 0? "\n " : ""), x); } printf("\n};\n"); return 0; } #endif nokogumbo-2.0.5/gumbo-parser/src/replacement.h0000644000004100000410000000122714030710665021414 0ustar www-datawww-data#ifndef GUMBO_REPLACEMENT_H_ #define GUMBO_REPLACEMENT_H_ #include #include "gumbo.h" typedef struct { const char *const from; const char *const to; } StringReplacement; const StringReplacement *gumbo_get_svg_tag_replacement ( const char* str, size_t len ); const StringReplacement *gumbo_get_svg_attr_replacement ( const char* str, size_t len ); typedef struct { const char *const from; const char *const local_name; const GumboAttributeNamespaceEnum attr_namespace; } ForeignAttrReplacement; const ForeignAttrReplacement *gumbo_get_foreign_attr_replacement ( const char* str, size_t len ); #endif // GUMBO_REPLACEMENT_H_ nokogumbo-2.0.5/gumbo-parser/src/tokenizer.c0000644000004100000410000035662314030710665021137 0ustar www-datawww-data/* Copyright 2010 Google Inc. Copyright 2017-2018 Craig Barnes Copyright 2018 Stephen Checkoway Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ /* Coding conventions specific to this file: 1. Functions that fill in a token should be named emit_*, and should be followed immediately by a return from the tokenizer. 2. Functions that shuffle data from temporaries to final API structures should be named finish_*, and be called just before the tokenizer exits the state that accumulates the temporary. 3. All internal data structures should be kept in an initialized state from tokenizer creation onwards, ready to accept input. When a buffer's flushed and reset, it should be deallocated and immediately reinitialized. 4. Make sure there are appropriate break statements following each state. 5. Assertions on the state of the temporary and tag buffers are usually a good idea, and should go at the entry point of each state when added. 6. Statement order within states goes: 1. Add parse errors, if appropriate. 2. Call finish_* functions to build up tag state. 2. Switch to new state. Set _reconsume flag if appropriate. 3. Perform any other temporary buffer manipulation. 4. Emit tokens 5. Return/break. This order ensures that we can verify that every emit is followed by a return, ensures that the correct state is recorded with any parse errors, and prevents parse error position from being messed up by possible mark/resets in temporary buffer manipulation. */ #include #include #include "tokenizer.h" #include "ascii.h" #include "attribute.h" #include "char_ref.h" #include "error.h" #include "gumbo.h" #include "parser.h" #include "string_buffer.h" #include "token_type.h" #include "tokenizer_states.h" #include "utf8.h" #include "util.h" #include "vector.h" // Compared against _temporary_buffer to determine if we're in // double-escaped script mode. static const GumboStringPiece kScriptTag = {.data = "script", .length = 6}; // An enum for the return value of each individual state. Each of the emit_* // functions should return EMIT_TOKEN and should be called as // return emit_foo(parser, ..., output); // Each of the handle_*_state functions that do not return emit_* should // instead return CONTINUE to indicate to gumbo_lex to continue lexing. typedef enum { EMIT_TOKEN, CONTINUE, } StateResult; // This is a struct containing state necessary to build up a tag token, // character by character. typedef struct GumboInternalTagState { // A buffer to accumulate characters for various GumboStringPiece fields. GumboStringBuffer _buffer; // A pointer to the start of the original text corresponding to the contents // of the buffer. const char* _original_text; // The current tag enum, computed once the tag name state has finished so that // the buffer can be re-used for building up attributes. GumboTag _tag; // The current tag name. It's set at the same time that _tag is set if _tag // is set to GUMBO_TAG_UNKNOWN. char *_name; // The starting location of the text in the buffer. GumboSourcePosition _start_pos; // The current list of attributes. This is copied (and ownership of its data // transferred) to the GumboStartTag token upon completion of the tag. New // attributes are added as soon as their attribute name state is complete, and // values are filled in by operating on _attributes.data[attributes.length-1]. GumboVector /* GumboAttribute */ _attributes; // If true, the next attribute value to be finished should be dropped. This // happens if a duplicate attribute name is encountered - we want to consume // the attribute value, but shouldn't overwrite the existing value. bool _drop_next_attr_value; // The last start tag to have been emitted by the tokenizer. This is // necessary to check for appropriate end tags. GumboTag _last_start_tag; // If true, then this is a start tag. If false, it's an end tag. This is // necessary to generate the appropriate token type at tag-closing time. bool _is_start_tag; // If true, then this tag is "self-closing" and doesn't have an end tag. bool _is_self_closing; } GumboTagState; // This is the main tokenizer state struct, containing all state used by in // tokenizing the input stream. typedef struct GumboInternalTokenizerState { // The current lexer state. Starts in GUMBO_LEX_DATA. GumboTokenizerEnum _state; // A flag indicating whether the current input character needs to reconsumed // in another state, or whether the next input character should be read for // the next iteration of the state loop. This is set when the spec reads // "Reconsume the current input character in..." bool _reconsume_current_input; // A flag indicating whether the adjusted current node is a foreign element. // This is set by gumbo_tokenizer_set_is_adjusted_current_node_foreign and // checked in the markup declaration state. bool _is_adjusted_current_node_foreign; // A flag indicating whether the tokenizer is in a CDATA section. If so, then // text tokens emitted will be GUMBO_TOKEN_CDATA. bool _is_in_cdata; // Certain states (notably character references) may emit two character tokens // at once, but the contract for lex() fills in only one token at a time. The // extra character is buffered here, and then this is checked on entry to // lex(). If a character is stored here, it's immediately emitted and control // returns from the lexer. kGumboNoChar is used to represent 'no character // stored.' // // Note that characters emitted through this mechanism will have their source // position marked as the character under the mark, i.e. multiple characters // may be emitted with the same position. This is desirable for character // references, but unsuitable for many other cases. Use the _temporary_buffer // mechanism if the buffered characters must have their original positions in // the document. int _buffered_emit_char; // A temporary buffer to accumulate characters, as described by the "temporary // buffer" phrase in the tokenizer spec. We use this in a somewhat unorthodox // way: In situations where the spec calls for inserting characters into the // temporary buffer that exactly match the input in order to emit them as // character tokens, we don't actually do it. // Instead, we mark the input and reset the input to it using set_mark() and // emit_from_mark(). We do use the temporary buffer for other uses such as // DOCTYPEs, comments, and detecting escaped tags are handled like any other end tag, putting the script's // text into a text node child and closing the current node. } assert(token->type == GUMBO_TOKEN_END_TAG); GumboNode* node = get_current_node(parser); GumboTag tag = token->v.end_tag.tag; const char* name = token->v.end_tag.name; assert(node != NULL); if (!node_tagname_is(node, tag, name)) parser_add_parse_error(parser, token); int i = parser->_parser_state->_open_elements.length; for (--i; i > 0;) { // Here we move up the stack until we find an HTML element (in which // case we do nothing) or we find the element that we're about to // close (in which case we pop everything we've seen until that // point.) gumbo_debug("Foreign %s node at %d.\n", node->v.element.name, i); if (node_tagname_is(node, tag, name)) { gumbo_debug("Matches.\n"); while (node != pop_current_node(parser)) { // Pop all the nodes below the current one. Node is guaranteed to // be an element on the stack of open elements (set below), so // this loop is guaranteed to terminate. } return; } --i; node = parser->_parser_state->_open_elements.data[i]; if (node->v.element.tag_namespace == GUMBO_NAMESPACE_HTML) { // The loop continues only in foreign namespaces. break; } } assert(node->v.element.tag_namespace == GUMBO_NAMESPACE_HTML); if (i == 0) return; // We can't call handle_token directly because the current node is still in // a foriegn namespace, so it would re-enter this and result in infinite // recursion. handle_html_content(parser, token); } // https://html.spec.whatwg.org/multipage/parsing.html#tree-construction static void handle_token(GumboParser* parser, GumboToken* token) { if ( parser->_parser_state->_ignore_next_linefeed && token->type == GUMBO_TOKEN_WHITESPACE && token->v.character == '\n' ) { parser->_parser_state->_ignore_next_linefeed = false; ignore_token(parser); return; } // This needs to be reset both here and in the conditional above to catch both // the case where the next token is not whitespace (so we don't ignore // whitespace in the middle of
 tags) and where there are multiple
  // whitespace tokens (so we don't ignore the second one).
  parser->_parser_state->_ignore_next_linefeed = false;

  if (tag_is(token, kEndTag, GUMBO_TAG_BODY)) {
    parser->_parser_state->_closed_body_tag = true;
  }
  if (tag_is(token, kEndTag, GUMBO_TAG_HTML)) {
    parser->_parser_state->_closed_html_tag = true;
  }

  const GumboNode* current_node = get_adjusted_current_node(parser);
  assert (
    !current_node
    || current_node->type == GUMBO_NODE_ELEMENT
    || current_node->type == GUMBO_NODE_TEMPLATE
  );
  if (current_node)
    gumbo_debug("Current node: <%s>.\n", current_node->v.element.name);
  if (!current_node ||
      current_node->v.element.tag_namespace == GUMBO_NAMESPACE_HTML ||
      (is_mathml_integration_point(current_node) &&
          (token->type == GUMBO_TOKEN_CHARACTER ||
              token->type == GUMBO_TOKEN_WHITESPACE ||
              token->type == GUMBO_TOKEN_NULL ||
              (token->type == GUMBO_TOKEN_START_TAG &&
                  !tag_in(token, kStartTag,
                      &(const TagSet){TAG(MGLYPH), TAG(MALIGNMARK)})))) ||
      (current_node->v.element.tag_namespace == GUMBO_NAMESPACE_MATHML &&
          node_qualified_tag_is(
              current_node, GUMBO_NAMESPACE_MATHML, GUMBO_TAG_ANNOTATION_XML) &&
          tag_is(token, kStartTag, GUMBO_TAG_SVG)) ||
      (is_html_integration_point(current_node) &&
          (token->type == GUMBO_TOKEN_START_TAG ||
              token->type == GUMBO_TOKEN_CHARACTER ||
              token->type == GUMBO_TOKEN_NULL ||
              token->type == GUMBO_TOKEN_WHITESPACE)) ||
      token->type == GUMBO_TOKEN_EOF) {
    handle_html_content(parser, token);
  } else {
    handle_in_foreign_content(parser, token);
  }
}

static GumboNode* create_fragment_ctx_element (
  const char* tag_name,
  GumboNamespaceEnum ns,
  const char* encoding
) {
  assert(tag_name);
  GumboTag tag = gumbo_tagn_enum(tag_name, strlen(tag_name));
  GumboNodeType type =
    ns == GUMBO_NAMESPACE_HTML && tag == GUMBO_TAG_TEMPLATE
    ? GUMBO_NODE_TEMPLATE : GUMBO_NODE_ELEMENT;
  GumboNode* node = create_node(type);
  GumboElement* element = &node->v.element;
  element->children = kGumboEmptyVector;
  if (encoding) {
    gumbo_vector_init(1, &element->attributes);
    GumboAttribute* attr = gumbo_alloc(sizeof(GumboAttribute));
    attr->attr_namespace = GUMBO_ATTR_NAMESPACE_NONE;
    attr->name = "encoding"; // Do not free this!
    attr->original_name = kGumboEmptyString;
    attr->value = encoding; // Do not free this!
    attr->original_value = kGumboEmptyString;
    attr->name_start = kGumboEmptySourcePosition;
    gumbo_vector_add(attr, &element->attributes);
  } else {
    element->attributes = kGumboEmptyVector;
  }
  element->tag = tag;
  element->tag_namespace = ns;
  element->name = tag_name; // Do not free this!
  element->original_tag = kGumboEmptyString;
  element->original_end_tag = kGumboEmptyString;
  element->start_pos = kGumboEmptySourcePosition;
  element->end_pos = kGumboEmptySourcePosition;
  return node;
}

static void destroy_fragment_ctx_element(GumboNode* ctx) {
  assert(ctx->type == GUMBO_NODE_ELEMENT || ctx->type == GUMBO_NODE_TEMPLATE);
  GumboElement* element = &ctx->v.element;
  element->name = NULL; // Do not free.
  if (element->attributes.length > 0) {
    assert(element->attributes.length == 1);
    GumboAttribute* attr = gumbo_vector_pop(&element->attributes);
    // Do not free attr->name or attr->value, just free the attr.
    gumbo_free(attr);
  }
  destroy_node(ctx);
}

static void fragment_parser_init (
  GumboParser* parser,
  const GumboOptions* options
) {
  assert(options->fragment_context != NULL);
  const char* fragment_ctx = options->fragment_context;
  GumboNamespaceEnum fragment_namespace = options->fragment_namespace;
  const char* fragment_encoding = options->fragment_encoding;
  GumboQuirksModeEnum quirks = options->quirks_mode;
  bool ctx_has_form_ancestor = options->fragment_context_has_form_ancestor;

  GumboNode* root;
  // 2.
  get_document_node(parser)->v.document.doc_type_quirks_mode = quirks;

  // 3.
  parser->_parser_state->_fragment_ctx =
    create_fragment_ctx_element(fragment_ctx, fragment_namespace, fragment_encoding);
  GumboTag ctx_tag = parser->_parser_state->_fragment_ctx->v.element.tag;

  // 4.
  if (fragment_namespace == GUMBO_NAMESPACE_HTML) {
    // Non-HTML namespaces always start in the DATA state.
    switch (ctx_tag) {
      case GUMBO_TAG_TITLE:
      case GUMBO_TAG_TEXTAREA:
        gumbo_tokenizer_set_state(parser, GUMBO_LEX_RCDATA);
        break;

      case GUMBO_TAG_STYLE:
      case GUMBO_TAG_XMP:
      case GUMBO_TAG_IFRAME:
      case GUMBO_TAG_NOEMBED:
      case GUMBO_TAG_NOFRAMES:
        gumbo_tokenizer_set_state(parser, GUMBO_LEX_RAWTEXT);
        break;

      case GUMBO_TAG_SCRIPT:
        gumbo_tokenizer_set_state(parser, GUMBO_LEX_SCRIPT_DATA);
        break;

      case GUMBO_TAG_NOSCRIPT:
        /* scripting is disabled in Gumbo, so leave the tokenizer
         * in the default data state */
        break;

      case GUMBO_TAG_PLAINTEXT:
        gumbo_tokenizer_set_state(parser, GUMBO_LEX_PLAINTEXT);
        break;

      default:
        /* default data state */
        break;
    }
  }

  // 5. 6. 7.
  root = insert_element_of_tag_type (
    parser,
    GUMBO_TAG_HTML,
    GUMBO_INSERTION_IMPLIED
  );
  parser->_output->root = root;

  // 8.
  if (ctx_tag == GUMBO_TAG_TEMPLATE) {
    push_template_insertion_mode(parser, GUMBO_INSERTION_MODE_IN_TEMPLATE);
  }

  // 10.
  reset_insertion_mode_appropriately(parser);

  // 11.
  if (ctx_has_form_ancestor
      || (ctx_tag == GUMBO_TAG_FORM
          && fragment_namespace == GUMBO_NAMESPACE_HTML)) {
    static const GumboNode form_ancestor = {
      .type = GUMBO_NODE_ELEMENT,
      .parent = NULL,
      .index_within_parent = -1,
      .parse_flags = GUMBO_INSERTION_BY_PARSER,
      .v.element = {
        .children = GUMBO_EMPTY_VECTOR_INIT,
        .tag = GUMBO_TAG_FORM,
        .name = NULL,
        .tag_namespace = GUMBO_NAMESPACE_HTML,
        .original_tag = GUMBO_EMPTY_STRING_INIT,
        .original_end_tag = GUMBO_EMPTY_STRING_INIT,
        .start_pos = GUMBO_EMPTY_SOURCE_POSITION_INIT,
        .end_pos = GUMBO_EMPTY_SOURCE_POSITION_INIT,
        .attributes = GUMBO_EMPTY_VECTOR_INIT,
      },
    };
    // This cast is okay because _form_element is only modified if it is
    // in in the list of open elements. This will never be.
    parser->_parser_state->_form_element = (GumboNode *)&form_ancestor;
  }
}

GumboOutput* gumbo_parse(const char* buffer) {
  return gumbo_parse_with_options (
    &kGumboDefaultOptions,
    buffer,
    strlen(buffer)
  );
}

GumboOutput* gumbo_parse_with_options (
  const GumboOptions* options,
  const char* buffer,
  size_t length
) {
  GumboParser parser;
  parser._options = options;
  output_init(&parser);
  gumbo_tokenizer_state_init(&parser, buffer, length);
  parser_state_init(&parser);

  if (options->fragment_context != NULL)
    fragment_parser_init(&parser, options);

  GumboParserState* state = parser._parser_state;
  gumbo_debug (
    "Parsing %.*s.\n",
    (int) length,
    buffer
  );

  // Sanity check so that infinite loops die with an assertion failure instead
  // of hanging the process before we ever get an error.
  uint_fast32_t loop_count = 0;

  const unsigned int max_tree_depth = options->max_tree_depth;
  GumboToken token;

  do {
    if (state->_reprocess_current_token) {
      state->_reprocess_current_token = false;
    } else {
      GumboNode* adjusted_current_node = get_adjusted_current_node(&parser);
      gumbo_tokenizer_set_is_adjusted_current_node_foreign (
        &parser,
        adjusted_current_node &&
          adjusted_current_node->v.element.tag_namespace != GUMBO_NAMESPACE_HTML
      );
      gumbo_lex(&parser, &token);
    }

    const char* token_type = "text";
    switch (token.type) {
      case GUMBO_TOKEN_DOCTYPE:
        token_type = "doctype";
        break;
      case GUMBO_TOKEN_START_TAG:
        if (token.v.start_tag.tag == GUMBO_TAG_UNKNOWN)
          token_type = token.v.start_tag.name;
        else
          token_type = gumbo_normalized_tagname(token.v.start_tag.tag);
        break;
      case GUMBO_TOKEN_END_TAG:
        token_type = gumbo_normalized_tagname(token.v.end_tag.tag);
        break;
      case GUMBO_TOKEN_COMMENT:
        token_type = "comment";
        break;
      default:
        break;
    }
    gumbo_debug (
      "Handling %s token @%lu:%lu in state %u.\n",
      (char*) token_type,
      (unsigned long)token.position.line,
      (unsigned long)token.position.column,
      state->_insertion_mode
    );

    state->_current_token = &token;
    state->_self_closing_flag_acknowledged = false;

    handle_token(&parser, &token);

    // Check for memory leaks when ownership is transferred from start tag
    // tokens to nodes.
    assert (
      state->_reprocess_current_token
      || token.type != GUMBO_TOKEN_START_TAG
      || (token.v.start_tag.attributes.data == NULL
          && token.v.start_tag.name == NULL)
    );

    if (!state->_reprocess_current_token) {
      // If we're done with the token, check for unacknowledged self-closing
      // flags on start tags.
      if (token.type == GUMBO_TOKEN_START_TAG &&
          token.v.start_tag.is_self_closing &&
          !state->_self_closing_flag_acknowledged) {
        GumboError* error = gumbo_add_error(&parser);
        if (error) {
          // This is essentially a tokenizer error that's only caught during
          // tree construction.
          error->type = GUMBO_ERR_NON_VOID_HTML_ELEMENT_START_TAG_WITH_TRAILING_SOLIDUS;
          error->original_text = token.original_text;
          error->position = token.position;
        }
      }
      // Make sure we free the end tag's name since it doesn't get transferred
      // to a token.
      if (token.type == GUMBO_TOKEN_END_TAG &&
          token.v.end_tag.tag == GUMBO_TAG_UNKNOWN)
        gumbo_free(token.v.end_tag.name);
    }

    if (unlikely(state->_open_elements.length > max_tree_depth)) {
      parser._output->status = GUMBO_STATUS_TREE_TOO_DEEP;
      gumbo_debug("Tree depth limit exceeded.\n");
      break;
    }

    ++loop_count;
    assert(loop_count < 1000000000UL);

  } while (
    (token.type != GUMBO_TOKEN_EOF || state->_reprocess_current_token)
    && !(options->stop_on_first_error && parser._output->document_error)
  );

  finish_parsing(&parser);
  // For API uniformity reasons, if the doctype still has nulls, convert them to
  // empty strings.
  GumboDocument* doc_type = &parser._output->document->v.document;
  if (doc_type->name == NULL) {
    doc_type->name = gumbo_strdup("");
  }
  if (doc_type->public_identifier == NULL) {
    doc_type->public_identifier = gumbo_strdup("");
  }
  if (doc_type->system_identifier == NULL) {
    doc_type->system_identifier = gumbo_strdup("");
  }

  parser_state_destroy(&parser);
  gumbo_tokenizer_state_destroy(&parser);
  return parser._output;
}

const char* gumbo_status_to_string(GumboOutputStatus status) {
  switch (status) {
    case GUMBO_STATUS_OK:
      return "OK";
    case GUMBO_STATUS_OUT_OF_MEMORY:
      return "System allocator returned NULL during parsing";
    case GUMBO_STATUS_TOO_MANY_ATTRIBUTES:
      return "Attributes per element limit exceeded";
    case GUMBO_STATUS_TREE_TOO_DEEP:
      return "Document tree depth limit exceeded";
    default:
      return "Unknown GumboOutputStatus value";
  }
}

void gumbo_destroy_node(GumboNode* node) {
  destroy_node(node);
}

void gumbo_destroy_output(GumboOutput* output) {
  destroy_node(output->document);
  for (unsigned int i = 0; i < output->errors.length; ++i) {
    gumbo_error_destroy(output->errors.data[i]);
  }
  gumbo_vector_destroy(&output->errors);
  gumbo_free(output);
}
nokogumbo-2.0.5/gumbo-parser/src/error.c0000644000004100000410000005534614030710665020254 0ustar  www-datawww-data/*
 Copyright 2010 Google Inc.

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
*/

#include 
#include 
#include 
#include 
#include 
#include "ascii.h"
#include "error.h"
#include "gumbo.h"
#include "macros.h"
#include "parser.h"
#include "string_buffer.h"
#include "util.h"
#include "vector.h"

// Prints a formatted message to a StringBuffer. This automatically resizes the
// StringBuffer as necessary to fit the message. Returns the number of bytes
// written.
static int PRINTF(2) print_message (
  GumboStringBuffer* output,
  const char* format,
  ...
) {
  va_list args;
  int remaining_capacity = output->capacity - output->length;
  va_start(args, format);
  int bytes_written = vsnprintf (
    output->data + output->length,
    remaining_capacity,
    format,
    args
  );
  va_end(args);
#if _MSC_VER && _MSC_VER < 1900
  if (bytes_written == -1) {
    // vsnprintf returns -1 on older MSVC++ if there's not enough capacity,
    // instead of returning the number of bytes that would've been written had
    // there been enough. In this case, we'll double the buffer size and hope
    // it fits when we retry (letting it fail and returning 0 if it doesn't),
    // since there's no way to smartly resize the buffer.
    gumbo_string_buffer_reserve(output->capacity * 2, output);
    va_start(args, format);
    int result = vsnprintf (
      output->data + output->length,
      remaining_capacity,
      format,
      args
    );
    va_end(args);
    return result == -1 ? 0 : result;
  }
#else
  // -1 in standard C99 indicates an encoding error. Return 0 and do nothing.
  if (bytes_written == -1) {
    return 0;
  }
#endif

  if (bytes_written >= remaining_capacity) {
    gumbo_string_buffer_reserve(output->capacity + bytes_written, output);
    remaining_capacity = output->capacity - output->length;
    va_start(args, format);
    bytes_written = vsnprintf (
      output->data + output->length,
      remaining_capacity,
      format,
      args
    );
    va_end(args);
  }
  output->length += bytes_written;
  return bytes_written;
}

static void print_tag_stack (
  const GumboParserError* error,
  GumboStringBuffer* output
) {
  print_message(output, "  Currently open tags: ");
  for (unsigned int i = 0; i < error->tag_stack.length; ++i) {
    if (i) {
      print_message(output, ", ");
    }
    GumboTag tag = (GumboTag) error->tag_stack.data[i];
    print_message(output, "%s", gumbo_normalized_tagname(tag));
  }
  gumbo_string_buffer_append_codepoint('.', output);
}

static void handle_tokenizer_error (
  const GumboError* error,
  GumboStringBuffer* output
) {
  switch (error->type) {
  case GUMBO_ERR_ABRUPT_CLOSING_OF_EMPTY_COMMENT:
      print_message(output, "Empty comment abruptly closed by '%s', use '-->'.",
                    error->v.tokenizer.state == GUMBO_LEX_COMMENT_START? ">" : "->");
    break;
  case GUMBO_ERR_ABRUPT_DOCTYPE_PUBLIC_IDENTIFIER:
    print_message (
      output,
      "DOCTYPE public identifier missing closing %s.",
      error->v.tokenizer.state == GUMBO_LEX_DOCTYPE_PUBLIC_ID_DOUBLE_QUOTED?
        "quotation mark (\")" : "apostrophe (')"
    );
    break;
  case GUMBO_ERR_ABRUPT_DOCTYPE_SYSTEM_IDENTIFIER:
    print_message (
      output,
      "DOCTYPE system identifier missing closing %s.",
      error->v.tokenizer.state == GUMBO_LEX_DOCTYPE_SYSTEM_ID_DOUBLE_QUOTED?
        "quotation mark (\")" : "apostrophe (')"
    );
    break;
  case GUMBO_ERR_ABSENCE_OF_DIGITS_IN_NUMERIC_CHARACTER_REFERENCE:
    print_message (
      output,
      "Numeric character reference '%.*s' does not contain any %sdigits.",
      (int)error->original_text.length, error->original_text.data,
      error->v.tokenizer.state == GUMBO_LEX_HEXADECIMAL_CHARACTER_REFERENCE_START? "hexadecimal " : ""
    );
    break;
  case GUMBO_ERR_CDATA_IN_HTML_CONTENT:
    print_message(output, "CDATA section outside foreign (SVG or MathML) content.");
    break;
  case GUMBO_ERR_CHARACTER_REFERENCE_OUTSIDE_UNICODE_RANGE:
    print_message (
      output,
      "Numeric character reference '%.*s' references a code point that is outside the valid Unicode range.",
      (int)error->original_text.length, error->original_text.data
    );
    break;
  case GUMBO_ERR_CONTROL_CHARACTER_IN_INPUT_STREAM:
    print_message (
      output,
      "Input contains prohibited control code point U+%04X.",
      error->v.tokenizer.codepoint
    );
    break;
  case GUMBO_ERR_CONTROL_CHARACTER_REFERENCE:
    print_message (
      output,
      "Numeric character reference '%.*s' references prohibited control code point U+%04X.",
      (int)error->original_text.length, error->original_text.data,
      error->v.tokenizer.codepoint
    );
    break;
  case GUMBO_ERR_END_TAG_WITH_ATTRIBUTES:
    print_message(output, "End tag contains attributes.");
    break;
  case GUMBO_ERR_DUPLICATE_ATTRIBUTE:
    print_message(output, "Tag contains multiple attributes with the same name.");
    break;
  case GUMBO_ERR_END_TAG_WITH_TRAILING_SOLIDUS:
    print_message(output, "End tag ends with '/>', use '>'.");
    break;
  case GUMBO_ERR_EOF_BEFORE_TAG_NAME:
    print_message(output, "End of input where a tag name is expected.");
    break;
  case GUMBO_ERR_EOF_IN_CDATA:
    print_message(output, "End of input in CDATA section.");
    break;
  case GUMBO_ERR_EOF_IN_COMMENT:
    print_message(output, "End of input in comment.");
    break;
  case GUMBO_ERR_EOF_IN_DOCTYPE:
    print_message(output, "End of input in DOCTYPE.");
    break;
  case GUMBO_ERR_EOF_IN_SCRIPT_HTML_COMMENT_LIKE_TEXT:
    print_message(output, "End of input in text that resembles an HTML comment inside script element content.");
    break;
  case GUMBO_ERR_EOF_IN_TAG:
    print_message(output, "End of input in tag.");
    break;
  case GUMBO_ERR_INCORRECTLY_CLOSED_COMMENT:
    print_message(output, "Comment closed incorrectly by '--!>', use '-->'.");
    break;
  case GUMBO_ERR_INCORRECTLY_OPENED_COMMENT:
    print_message(output, "Comment, DOCTYPE, or CDATA opened incorrectly, use '|\Z)/m, '')
          data.scan(//m).each do |meta|
            encoding ||= meta[/charset=["']?([^>]*?)($|["'\s>])/im, 1]
          end
        end

        # if all else fails, default to the official default encoding for HTML
        encoding ||= Encoding::ISO_8859_1

        # change the encoding to match the detected or inferred encoding
        body = body.dup
        begin
          body.force_encoding(encoding)
        rescue ArgumentError
          body.force_encoding(Encoding::ISO_8859_1)
        end
      end

      body.encode(Encoding::UTF_8)
    end

    def self.serialize_node_internal(current_node, io, encoding, options)
      case current_node.type
      when XML::Node::ELEMENT_NODE
        ns = current_node.namespace
        ns_uri = ns.nil? ? nil : ns.href
        # XXX(sfc): attach namespaces to all nodes, even html?
        if ns_uri.nil? || ns_uri == HTML_NAMESPACE || ns_uri == MATHML_NAMESPACE || ns_uri == SVG_NAMESPACE
          tagname = current_node.name
        else
          tagname = "#{ns.prefix}:#{current_node.name}"
        end
        io << '<' << tagname
        current_node.attribute_nodes.each do |attr|
          attr_ns = attr.namespace
          if attr_ns.nil?
            attr_name = attr.name
          else
            ns_uri = attr_ns.href
            if ns_uri == XML_NAMESPACE
              attr_name = 'xml:' + attr.name.sub(/^[^:]*:/, '')
            elsif ns_uri == XMLNS_NAMESPACE && attr.name.sub(/^[^:]*:/, '') == 'xmlns'
              attr_name = 'xmlns'
            elsif ns_uri == XMLNS_NAMESPACE
              attr_name = 'xmlns:' + attr.name.sub(/^[^:]*:/, '')
            elsif ns_uri == XLINK_NAMESPACE
              attr_name = 'xlink:' + attr.name.sub(/^[^:]*:/, '')
            else
              attr_name = "#{attr_ns.prefix}:#{attr.name}"
            end
          end
          io << ' ' << attr_name << '="' << escape_text(attr.content, encoding, true) << '"'
        end
        io << '>'
        if !%w[area base basefont bgsound br col embed frame hr img input keygen
               link meta param source track wbr].include?(current_node.name)
          io << "\n" if options[:preserve_newline] && prepend_newline?(current_node)
          current_node.children.each do |child|
            # XXX(sfc): Templates handled specially?
            serialize_node_internal(child, io, encoding, options)
          end
          io << ''
        end
      when XML::Node::TEXT_NODE
        parent = current_node.parent
        if parent.element? && %w[style script xmp iframe noembed noframes plaintext noscript].include?(parent.name)
          io << current_node.content
        else
          io << escape_text(current_node.content, encoding, false)
        end
      when XML::Node::CDATA_SECTION_NODE
        io << ''
      when XML::Node::COMMENT_NODE
        io << ''
      when XML::Node::PI_NODE
        io << ''
      when XML::Node::DOCUMENT_TYPE_NODE, XML::Node::DTD_NODE
          io << ''
      when XML::Node::HTML_DOCUMENT_NODE, XML::Node::DOCUMENT_FRAG_NODE
        current_node.children.each do |child|
          serialize_node_internal(child, io, encoding, options)
        end
      else
        raise "Unexpected node '#{current_node.name}' of type #{current_node.type}"
      end
    end

    def self.escape_text(text, encoding, attribute_mode)
      if attribute_mode
        text = text.gsub(/[&\u00a0"]/,
                           '&' => '&', "\u00a0" => ' ', '"' => '"')
      else
        text = text.gsub(/[&\u00a0<>]/,
                           '&' => '&', "\u00a0" => ' ',  '<' => '<', '>' => '>')
      end
      # Not part of the standard
      text.encode(encoding, fallback: lambda { |c| "&\#x#{c.ord.to_s(16)};" })
    end

    def self.prepend_newline?(node)
      return false unless %w[pre textarea listing].include?(node.name) && !node.children.empty?
      first_child = node.children[0]
      first_child.text? && first_child.content.start_with?("\n")
    end
  end
end
nokogumbo-2.0.5/lib/nokogumbo/html5/0000755000004100000410000000000014030710665017347 5ustar  www-datawww-datanokogumbo-2.0.5/lib/nokogumbo/html5/document.rb0000644000004100000410000000410614030710665021513 0ustar  www-datawww-datamodule Nokogiri
  module HTML5
    class Document < Nokogiri::HTML::Document
      def self.parse(string_or_io, url = nil, encoding = nil, **options, &block)
        yield options if block_given?
	string_or_io = '' unless string_or_io

        if string_or_io.respond_to?(:encoding) && string_or_io.encoding.name != 'ASCII-8BIT'
          encoding ||= string_or_io.encoding.name
        end

        if string_or_io.respond_to?(:read) && string_or_io.respond_to?(:path)
          url ||= string_or_io.path
        end
        unless string_or_io.respond_to?(:read) || string_or_io.respond_to?(:to_str)
          raise ArgumentError.new("not a string or IO object")
        end
        do_parse(string_or_io, url, encoding, options)
      end

      def self.read_io(io, url = nil, encoding = nil, **options)
        raise ArgumentError.new("io object doesn't respond to :read") unless io.respond_to?(:read)
        do_parse(io, url, encoding, options)
      end

      def self.read_memory(string, url = nil, encoding = nil, **options)
        raise ArgumentError.new("string object doesn't respond to :to_str") unless string.respond_to?(:to_str)
        do_parse(string, url, encoding, options)
      end

      def fragment(tags = nil)
        DocumentFragment.new(self, tags, self.root)
      end

      def to_xml(options = {}, &block)
        # Bypass XML::Document#to_xml which doesn't add
        # XML::Node::SaveOptions::AS_XML like XML::Node#to_xml does.
        XML::Node.instance_method(:to_xml).bind(self).call(options, &block)
      end

      private
      def self.do_parse(string_or_io, url, encoding, options)
        string = HTML5.read_and_encode(string_or_io, encoding)
        max_attributes = options[:max_attributes] || Nokogumbo::DEFAULT_MAX_ATTRIBUTES
        max_errors = options[:max_errors] || options[:max_parse_errors] || Nokogumbo::DEFAULT_MAX_ERRORS
        max_depth = options[:max_tree_depth] || Nokogumbo::DEFAULT_MAX_TREE_DEPTH
        doc = Nokogumbo.parse(string, url, max_attributes, max_errors, max_depth)
        doc.encoding = 'UTF-8'
        doc
      end
    end
  end
end
nokogumbo-2.0.5/lib/nokogumbo/html5/node.rb0000644000004100000410000000547714030710665020636 0ustar  www-datawww-datarequire 'nokogiri'

module Nokogiri
  module HTML5
    module Node
      # HTML elements can have attributes that contain colons.
      # Nokogiri::XML::Node#[]= treats names with colons as a prefixed QName
      # and tries to create an attribute in a namespace. This is especially
      # annoying with attribute names like xml:lang since libxml2 will
      # actually create the xml namespace if it doesn't exist already.
      def add_child_node_and_reparent_attrs(node)
        return super(node) unless document.is_a?(HTML5::Document)
        # I'm not sure what this method is supposed to do. Reparenting
        # namespaces is handled by libxml2, including child namespaces which
        # this method wouldn't handle.
        # https://github.com/sparklemotion/nokogiri/issues/1790
        add_child_node(node)
        #node.attribute_nodes.find_all { |a| a.namespace }.each do |attr|
        #  attr.remove
        #  ns = attr.namespace
        #  a["#{ns.prefix}:#{attr.name}"] = attr.value
        #end
      end

      def inner_html(options = {})
        return super(options) unless document.is_a?(HTML5::Document)
        result = options[:preserve_newline] && HTML5.prepend_newline?(self) ? "\n" : ""
        result << children.map { |child| child.to_html(options) }.join
        result
      end

      def write_to(io, *options)
        return super(io, *options) unless document.is_a?(HTML5::Document)
        options = options.first.is_a?(Hash) ? options.shift : {}
        encoding = options[:encoding] || options[0]
        if Nokogiri.jruby?
          save_options = options[:save_with] || options[1]
          indent_times = options[:indent] || 0
        else
          save_options = options[:save_with] || options[1] || XML::Node::SaveOptions::FORMAT
          indent_times = options[:indent] || 2
        end
        indent_string = (options[:indent_text] || ' ') * indent_times

        config = XML::Node::SaveOptions.new(save_options.to_i)
        yield config if block_given?

        config_options = config.options
        if (config_options & (XML::Node::SaveOptions::AS_XML | XML::Node::SaveOptions::AS_XHTML) != 0)
          # Use Nokogiri's serializing code.
          native_write_to(io, encoding, indent_string, config_options)
        else
          # Serialize including the current node.
          encoding ||= document.encoding || Encoding::UTF_8
          internal_ops = {
            preserve_newline: options[:preserve_newline] || false
          }
          HTML5.serialize_node_internal(self, io, encoding, internal_ops)
        end
      end

      def fragment(tags)
        return super(tags) unless document.is_a?(HTML5::Document)
        DocumentFragment.new(document, tags, self)
      end
    end
    # Monkey patch
    XML::Node.prepend(HTML5::Node)
  end
end

# vim: set shiftwidth=2 softtabstop=2 tabstop=8 expandtab:
nokogumbo-2.0.5/lib/nokogumbo/html5/document_fragment.rb0000644000004100000410000000363714030710665023406 0ustar  www-datawww-datarequire 'nokogiri'

module Nokogiri
  module HTML5
    class DocumentFragment < Nokogiri::HTML::DocumentFragment
      attr_accessor :document
      attr_accessor :errors

      # Create a document fragment.
      def initialize(doc, tags = nil, ctx = nil, options = {})
        self.document = doc
        self.errors = []
        return self unless tags

        max_attributes = options[:max_attributes] || Nokogumbo::DEFAULT_MAX_ATTRIBUTES
        max_errors = options[:max_errors] || Nokogumbo::DEFAULT_MAX_ERRORS
        max_depth = options[:max_tree_depth] || Nokogumbo::DEFAULT_MAX_TREE_DEPTH
        tags = Nokogiri::HTML5.read_and_encode(tags, nil)
        Nokogumbo.fragment(self, tags, ctx, max_attributes, max_errors, max_depth)
      end

      def serialize(options = {}, &block)
        # Bypass XML::Document.serialize which doesn't support options even
        # though XML::Node.serialize does!
        XML::Node.instance_method(:serialize).bind(self).call(options, &block)
      end

      # Parse a document fragment from +tags+, returning a Nodeset.
      def self.parse(tags, encoding = nil, options = {})
        doc = HTML5::Document.new
        tags = HTML5.read_and_encode(tags, encoding)
        doc.encoding = 'UTF-8'
        new(doc, tags, nil, options)
      end

      def extract_params params # :nodoc:
        handler = params.find do |param|
          ![Hash, String, Symbol].include?(param.class)
        end
        params -= [handler] if handler

        hashes = []
        while Hash === params.last || params.last.nil?
          hashes << params.pop
          break if params.empty?
        end
        ns, binds = hashes.reverse

        ns ||=
          begin
            ns = Hash.new
            children.each { |child| ns.merge!(child.namespaces) }
            ns
          end

        [params, handler, ns, binds]
      end

    end
  end
end
# vim: set shiftwidth=2 softtabstop=2 tabstop=8 expandtab:
nokogumbo-2.0.5/lib/nokogumbo.rb0000644000004100000410000000162514030710665016647 0ustar  www-datawww-datarequire 'nokogiri'

if ((defined?(Nokogiri::HTML5) && Nokogiri::HTML5.respond_to?(:parse)) &&
    (defined?(Nokogiri::Gumbo) && Nokogiri::Gumbo.respond_to?(:parse)) &&
    !(ENV.key?("NOKOGUMBO_IGNORE_NOKOGIRI_HTML5") && ENV["NOKOGUMBO_IGNORE_NOKOGIRI_HTML5"] != "false"))

  warn "NOTE: nokogumbo: Using Nokogiri::HTML5 provided by Nokogiri. See https://github.com/sparklemotion/nokogiri/issues/2205 for more information."

  ::Nokogumbo = ::Nokogiri::Gumbo
else
  require 'nokogumbo/html5'
  require 'nokogumbo/nokogumbo'

  module Nokogumbo
    # The default maximum number of attributes per element.
    DEFAULT_MAX_ATTRIBUTES = 400

    # The default maximum number of errors for parsing a document or a fragment.
    DEFAULT_MAX_ERRORS = 0

    # The default maximum depth of the DOM tree produced by parsing a document
    # or fragment.
    DEFAULT_MAX_TREE_DEPTH = 400
  end
end

require 'nokogumbo/version'
nokogumbo-2.0.5/nokogumbo.gemspec0000644000004100000410000000671614030710665017127 0ustar  www-datawww-data#########################################################
# This file has been automatically generated by gem2tgz #
#########################################################
# -*- encoding: utf-8 -*-
# stub: nokogumbo 2.0.5 ruby lib
# stub: ext/nokogumbo/extconf.rb

Gem::Specification.new do |s|
  s.name = "nokogumbo".freeze
  s.version = "2.0.5"

  s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
  s.metadata = { "bug_tracker_uri" => "https://github.com/rubys/nokogumbo/issues", "changelog_uri" => "https://github.com/rubys/nokogumbo/blob/master/CHANGELOG.md", "homepage_uri" => "https://github.com/rubys/nokogumbo/#readme", "source_code_uri" => "https://github.com/rubys/nokogumbo" } if s.respond_to? :metadata=
  s.require_paths = ["lib".freeze]
  s.authors = ["Sam Ruby".freeze, "Stephen Checkoway".freeze]
  s.date = "2021-03-19"
  s.description = "Nokogumbo allows a Ruby program to invoke the Gumbo HTML5 parser and access the result as a Nokogiri parsed document.".freeze
  s.email = ["rubys@intertwingly.net".freeze, "s@pahtak.org".freeze]
  s.extensions = ["ext/nokogumbo/extconf.rb".freeze]
  s.files = ["LICENSE.txt".freeze, "README.md".freeze, "ext/nokogumbo/extconf.rb".freeze, "ext/nokogumbo/nokogumbo.c".freeze, "gumbo-parser/src/ascii.c".freeze, "gumbo-parser/src/ascii.h".freeze, "gumbo-parser/src/attribute.c".freeze, "gumbo-parser/src/attribute.h".freeze, "gumbo-parser/src/char_ref.c".freeze, "gumbo-parser/src/char_ref.h".freeze, "gumbo-parser/src/error.c".freeze, "gumbo-parser/src/error.h".freeze, "gumbo-parser/src/foreign_attrs.c".freeze, "gumbo-parser/src/gumbo.h".freeze, "gumbo-parser/src/insertion_mode.h".freeze, "gumbo-parser/src/macros.h".freeze, "gumbo-parser/src/parser.c".freeze, "gumbo-parser/src/parser.h".freeze, "gumbo-parser/src/replacement.h".freeze, "gumbo-parser/src/string_buffer.c".freeze, "gumbo-parser/src/string_buffer.h".freeze, "gumbo-parser/src/string_piece.c".freeze, "gumbo-parser/src/svg_attrs.c".freeze, "gumbo-parser/src/svg_tags.c".freeze, "gumbo-parser/src/tag.c".freeze, "gumbo-parser/src/tag_lookup.c".freeze, "gumbo-parser/src/tag_lookup.h".freeze, "gumbo-parser/src/token_buffer.c".freeze, "gumbo-parser/src/token_buffer.h".freeze, "gumbo-parser/src/token_type.h".freeze, "gumbo-parser/src/tokenizer.c".freeze, "gumbo-parser/src/tokenizer.h".freeze, "gumbo-parser/src/tokenizer_states.h".freeze, "gumbo-parser/src/utf8.c".freeze, "gumbo-parser/src/utf8.h".freeze, "gumbo-parser/src/util.c".freeze, "gumbo-parser/src/util.h".freeze, "gumbo-parser/src/vector.c".freeze, "gumbo-parser/src/vector.h".freeze, "lib/nokogumbo.rb".freeze, "lib/nokogumbo/html5.rb".freeze, "lib/nokogumbo/html5/document.rb".freeze, "lib/nokogumbo/html5/document_fragment.rb".freeze, "lib/nokogumbo/html5/node.rb".freeze, "lib/nokogumbo/version.rb".freeze]
  s.homepage = "https://github.com/rubys/nokogumbo/#readme".freeze
  s.licenses = ["Apache-2.0".freeze]
  s.required_ruby_version = Gem::Requirement.new(">= 2.1".freeze)
  s.rubygems_version = "2.7.6.2".freeze
  s.summary = "Nokogiri interface to the Gumbo HTML5 parser".freeze

  if s.respond_to? :specification_version then
    s.specification_version = 4

    if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
      s.add_runtime_dependency(%q.freeze, [">= 1.8.4", "~> 1.8"])
    else
      s.add_dependency(%q.freeze, [">= 1.8.4", "~> 1.8"])
    end
  else
    s.add_dependency(%q.freeze, [">= 1.8.4", "~> 1.8"])
  end
end
nokogumbo-2.0.5/ext/0000755000004100000410000000000014030710665014350 5ustar  www-datawww-datanokogumbo-2.0.5/ext/nokogumbo/0000755000004100000410000000000014030710665016350 5ustar  www-datawww-datanokogumbo-2.0.5/ext/nokogumbo/extconf.rb0000644000004100000410000001117314030710665020346 0ustar  www-datawww-datarequire 'rubygems'
require 'fileutils'
require 'mkmf'
require 'nokogiri'

$CFLAGS += " -std=c99"
$LDFLAGS.gsub!('-Wl,--no-undefined', '')
$DLDFLAGS.gsub!('-Wl,--no-undefined', '')
$warnflags = CONFIG['warnflags'] = '-Wall'

NG_SPEC = Gem::Specification.find_by_name('nokogiri', "= #{Nokogiri::VERSION}")

def download_headers
  begin
    require 'yaml'

    dependencies = YAML.load_file(File.join(NG_SPEC.gem_dir, 'dependencies.yml'))
    version = dependencies['libxml2']['version']
    host = RbConfig::CONFIG["host_alias"].empty? ? RbConfig::CONFIG["host"] : RbConfig::CONFIG["host_alias"]
    path = File.join('ports', host, 'libxml2', version, 'include/libxml2')
    return path if File.directory?(path)

    # Make sure we're using the same version Nokogiri uses
    dep_index = NG_SPEC.dependencies.index { |dep| dep.name == 'mini_portile2' and dep.type == :runtime }
    return nil if dep_index.nil?
    requirement = NG_SPEC.dependencies[dep_index].requirement.to_s

    gem 'mini_portile2', requirement
    require 'mini_portile2'
    p = MiniPortile::new('libxml2', version).tap do |r|
      r.host = RbConfig::CONFIG["host_alias"].empty? ? RbConfig::CONFIG["host"] : RbConfig::CONFIG["host_alias"]
      r.files = [{
        url: "http://xmlsoft.org/sources/libxml2-#{r.version}.tar.gz",
        sha256: dependencies['libxml2']['sha256']
      }]
      r.configure_options += [
        "--without-python",
        "--without-readline",
        "--with-c14n",
        "--with-debug",
        "--with-threads"
      ]
    end
    p.download unless p.downloaded?
    p.extract
    p.configure unless p.configured?
    system('make', '-C', "tmp/#{p.host}/ports/libxml2/#{version}/libxml2-#{version}/include/libxml", 'install-xmlincHEADERS')
    path
  rescue
    puts 'failed to download/install headers'
    nil
  end
end

required = arg_config('--with-libxml2')
prohibited = arg_config('--without-libxml2')
if required and prohibited
  abort "cannot use both --with-libxml2 and --without-libxml2"
end

have_libxml2 = false
have_ng = false

def windows?
  ::RUBY_PLATFORM =~ /mingw|mswin/
end

def modern_nokogiri?
  nokogiri_version = Gem::Version.new(Nokogiri::VERSION)
  requirement = windows? ? ">= 1.11.2" : ">= 1.11.0.rc4"
  Gem::Requirement.new(requirement).satisfied_by?(nokogiri_version)
end

if !prohibited
  if modern_nokogiri?
    append_cflags(Nokogiri::VERSION_INFO["nokogiri"]["cppflags"])
    append_ldflags(Nokogiri::VERSION_INFO["nokogiri"]["ldflags"]) # may be nil for nokogiri pre-1.11.2
    have_libxml2 = if Nokogiri::VERSION_INFO["nokogiri"]["ldflags"].empty?
                     have_header('libxml/tree.h')
                   else
                     have_func("xmlNewDoc", "libxml/tree.h")
                   end
  end

  if !have_libxml2
    if Nokogiri::VERSION_INFO.include?('libxml') and
       Nokogiri::VERSION_INFO['libxml']['source'] == 'packaged'
      # Nokogiri has libxml2 built in. Find the headers.
      libxml2_path = File.join(Nokogiri::VERSION_INFO['libxml']['libxml2_path'],
                               'include/libxml2')
      if find_header('libxml/tree.h', libxml2_path)
        have_libxml2 = true
      else
        # Unfortunately, some versions of Nokogiri delete these files.
        # https://github.com/sparklemotion/nokogiri/pull/1788
        # Try to download them
        libxml2_path = download_headers
        unless libxml2_path.nil?
          have_libxml2 = find_header('libxml/tree.h', libxml2_path)
        end
      end
    else
      # Nokogiri is compiled with system headers.
      # Hack to work around broken mkmf on macOS
      # (https://bugs.ruby-lang.org/issues/14992 fixed now)
      if RbConfig::MAKEFILE_CONFIG['LIBPATHENV'] == 'DYLD_LIBRARY_PATH'
        RbConfig::MAKEFILE_CONFIG['LIBPATHENV'] = 'DYLD_FALLBACK_LIBRARY_PATH'
      end

      pkg_config('libxml-2.0')
      have_libxml2 = have_library('xml2', 'xmlNewDoc')
    end
  end

  if required and !have_libxml2
    abort "libxml2 required but could not be located"
  end


  if have_libxml2
    have_ng = have_header('nokogiri.h') || find_header('nokogiri.h', File.join(NG_SPEC.gem_dir, 'ext/nokogiri'))
  end
end

if have_libxml2 and have_ng
  $CFLAGS += " -DNGLIB=1"
end

# Symlink gumbo-parser source files.
ext_dir = File.dirname(__FILE__)

Dir.chdir(ext_dir) do
  $srcs = Dir['*.c', '../../gumbo-parser/src/*.c']
  $hdrs = Dir['*.h', '../../gumbo-parser/src/*.h']
end
$INCFLAGS << ' -I$(srcdir)/../../gumbo-parser/src'
$VPATH << '$(srcdir)/../../gumbo-parser/src'

create_makefile('nokogumbo/nokogumbo') do |conf|
  conf.map! do |chunk|
    chunk.gsub(/^HDRS = .*$/, "HDRS = #{$hdrs.map { |h| File.join('$(srcdir)', h)}.join(' ')}")
  end
end
# vim: set sw=2 sts=2 ts=8 et:
nokogumbo-2.0.5/ext/nokogumbo/nokogumbo.c0000644000004100000410000006315014030710665020521 0ustar  www-datawww-data//
// nokogumbo.c defines the following:
//
//   class Nokogumbo
//     def parse(utf8_string) # returns Nokogiri::HTML5::Document
//   end
//
// Processing starts by calling gumbo_parse_with_options.  The resulting
// document tree is then walked:
//
//  * if Nokogiri and libxml2 headers are available at compile time,
//    (if NGLIB) then a parallel libxml2 tree is constructed, and the
//    final document is then wrapped using Nokogiri_wrap_xml_document.
//    This approach reduces memory and CPU requirements as Ruby objects
//    are only built when necessary.
//
//  * if the necessary headers are not available at compile time, Nokogiri
//    methods are called instead, producing the equivalent functionality.
//

#include 
#include 
#include 

#include "gumbo.h"

// class constants
static VALUE Document;

// Interned symbols
static ID internal_subset;
static ID parent;

/* Backwards compatibility to Ruby 2.1.0 */
#if RUBY_API_VERSION_CODE < 20200
#define ONIG_ESCAPE_UCHAR_COLLISION 1
#include 

static VALUE rb_utf8_str_new(const char *str, long length) {
  return rb_enc_str_new(str, length, rb_utf8_encoding());
}

static VALUE rb_utf8_str_new_cstr(const char *str) {
  return rb_enc_str_new_cstr(str, rb_utf8_encoding());
}

static VALUE rb_utf8_str_new_static(const char *str, long length) {
  return rb_enc_str_new(str, length, rb_utf8_encoding());
}
#endif

#if NGLIB
#include 
#include 
#include 

#define NIL NULL
#else
#define NIL Qnil

// These are defined by nokogiri.h
static VALUE cNokogiriXmlSyntaxError;
static VALUE cNokogiriXmlElement;
static VALUE cNokogiriXmlText;
static VALUE cNokogiriXmlCData;
static VALUE cNokogiriXmlComment;

// Interned symbols.
static ID new;
static ID node_name_;

// Map libxml2 types to Ruby VALUE.
typedef VALUE xmlNodePtr;
typedef VALUE xmlDocPtr;
typedef VALUE xmlNsPtr;
typedef VALUE xmlDtdPtr;
typedef char xmlChar;
#define BAD_CAST

// Redefine libxml2 API as Ruby function calls.
static xmlNodePtr xmlNewDocNode(xmlDocPtr doc, xmlNsPtr ns, const xmlChar *name, const xmlChar *content) {
  assert(ns == NIL && content == NULL);
  return rb_funcall(cNokogiriXmlElement, new, 2, rb_utf8_str_new_cstr(name), doc);
}

static xmlNodePtr xmlNewDocText(xmlDocPtr doc, const xmlChar *content) {
  VALUE str = rb_utf8_str_new_cstr(content);
  return rb_funcall(cNokogiriXmlText, new, 2, str, doc);
}

static xmlNodePtr xmlNewCDataBlock(xmlDocPtr doc, const xmlChar *content, int len) {
  VALUE str = rb_utf8_str_new(content, len);
  // CDATA.new takes arguments in the opposite order from Text.new.
  return rb_funcall(cNokogiriXmlCData, new, 2, doc, str);
}

static xmlNodePtr xmlNewDocComment(xmlDocPtr doc, const xmlChar *content) {
  VALUE str = rb_utf8_str_new_cstr(content);
  return rb_funcall(cNokogiriXmlComment, new, 2, doc, str);
}

static xmlNodePtr xmlAddChild(xmlNodePtr parent, xmlNodePtr cur) {
  ID add_child;
  CONST_ID(add_child, "add_child");
  return rb_funcall(parent, add_child, 1, cur);
}

static void xmlSetNs(xmlNodePtr node, xmlNsPtr ns) {
  ID namespace_;
  CONST_ID(namespace_, "namespace=");
  rb_funcall(node, namespace_, 1, ns);
}

static void xmlFreeDoc(xmlDocPtr doc) { }

static VALUE Nokogiri_wrap_xml_document(VALUE klass, xmlDocPtr doc) {
  return doc;
}

static VALUE find_dummy_key(VALUE collection) {
  VALUE r_dummy = Qnil;
  char dummy[5] = "a";
  size_t len = 1;
  ID key_;
  CONST_ID(key_, "key?");
  while (len < sizeof dummy) {
    r_dummy = rb_utf8_str_new(dummy, len);
    if (rb_funcall(collection, key_, 1, r_dummy) == Qfalse)
      return r_dummy;
    for (size_t i = 0; ; ++i) {
      if (dummy[i] == 0) {
        dummy[i] = 'a';
        ++len;
        break;
      }
      if (dummy[i] == 'z')
        dummy[i] = 'a';
      else {
        ++dummy[i];
        break;
      }
    }
  }
  // This collection has 475254 elements?? Give up.
  rb_raise(rb_eArgError, "Failed to find a dummy key.");
}

// This should return an xmlAttrPtr, but we don't need it and it's easier to
// not get the result.
static void xmlNewNsProp (
  xmlNodePtr node,
  xmlNsPtr ns,
  const xmlChar *name,
  const xmlChar *value
) {
  ID set_attribute;
  CONST_ID(set_attribute, "set_attribute");

  VALUE rvalue = rb_utf8_str_new_cstr(value);

  if (RTEST(ns)) {
    // This is an easy case, we have a namespace so it's enough to do
    // node["#{ns.prefix}:#{name}"] = value
    ID prefix;
    CONST_ID(prefix, "prefix");
    VALUE ns_prefix = rb_funcall(ns, prefix, 0);
    VALUE qname = rb_sprintf("%" PRIsVALUE ":%s", ns_prefix, name);
    rb_funcall(node, set_attribute, 2, qname, rvalue);
    return;
  }

  size_t len = strlen(name);
  VALUE rname = rb_utf8_str_new(name, len);
  if (memchr(name, ':', len) == NULL) {
    // This is the easiest case. There's no colon so we can do
    // node[name] = value.
    rb_funcall(node, set_attribute, 2, rname, rvalue);
    return;
  }

  // Nokogiri::XML::Node#set_attribute calls xmlSetProp(node, name, value)
  // which behaves roughly as
  // if name is a QName prefix:local
  //   if node->doc has a namespace ns corresponding to prefix
  //     return xmlSetNsProp(node, ns, local, value)
  // return xmlSetNsProp(node, NULL, name, value)
  //
  // If the prefix is "xml", then the namespace lookup will create it.
  //
  // By contrast, xmlNewNsProp does not do this parsing and creates an attribute
  // with the name and value exactly as given. This is the behavior that we
  // want.
  //
  // Thus, for attribute names like "xml:lang", #set_attribute will create an
  // attribute with namespace "xml" and name "lang". This is incorrect for
  // html elements (but correct for foreign elements).
  //
  // Work around this by inserting a dummy attribute and then changing the
  // name, if needed.

  // Find a dummy attribute string that doesn't already exist.
  VALUE dummy = find_dummy_key(node);
  // Add the dummy attribute.
  rb_funcall(node, set_attribute, 2, dummy, rvalue);

  // Remove the old attribute, if it exists.
  ID remove_attribute;
  CONST_ID(remove_attribute, "remove_attribute");
  rb_funcall(node, remove_attribute, 1, rname);

  // Rename the dummy
  ID attribute;
  CONST_ID(attribute, "attribute");
  VALUE attr = rb_funcall(node, attribute, 1, dummy);
  rb_funcall(attr, node_name_, 1, rname);
}
#endif

// URI = system id
// external id = public id
static xmlDocPtr new_html_doc(const char *dtd_name, const char *system, const char *public)
{
#if NGLIB
  // These two libxml2 functions take the public and system ids in
  // opposite orders.
  htmlDocPtr doc = htmlNewDocNoDtD(/* URI */ NULL, /* ExternalID */NULL);
  assert(doc);
  if (dtd_name)
    xmlCreateIntSubset(doc, BAD_CAST dtd_name, BAD_CAST public, BAD_CAST system);
  return doc;
#else
  // remove internal subset from newly created documents
  VALUE doc;
  // If system and public are both NULL, Document#new is going to set default
  // values for them so we're going to have to remove the internal subset
  // which seems to leak memory in Nokogiri, so leak as little as possible.
  if (system == NULL && public == NULL) {
    ID remove;
    CONST_ID(remove, "remove");
    doc = rb_funcall(Document, new, 2, /* URI */ Qnil, /* external_id */ rb_utf8_str_new_static("", 0));
    rb_funcall(rb_funcall(doc, internal_subset, 0), remove, 0);
    if (dtd_name) {
      // We need to create an internal subset now.
      ID create_internal_subset;
      CONST_ID(create_internal_subset, "create_internal_subset");
      rb_funcall(doc, create_internal_subset, 3, rb_utf8_str_new_cstr(dtd_name), Qnil, Qnil);
    }
  } else {
    assert(dtd_name);
    // Rather than removing and creating the internal subset as we did above,
    // just create and then rename one.
    VALUE r_system = system ? rb_utf8_str_new_cstr(system) : Qnil;
    VALUE r_public = public ? rb_utf8_str_new_cstr(public) : Qnil;
    doc = rb_funcall(Document, new, 2, r_system, r_public);
    rb_funcall(rb_funcall(doc, internal_subset, 0), node_name_, 1, rb_utf8_str_new_cstr(dtd_name));
  }
  return doc;
#endif
}

static xmlNodePtr get_parent(xmlNodePtr node) {
#if NGLIB
  return node->parent;
#else
  if (!rb_respond_to(node, parent))
    return Qnil;
  return rb_funcall(node, parent, 0);
#endif
}

static GumboOutput *perform_parse(const GumboOptions *options, VALUE input) {
  assert(RTEST(input));
  Check_Type(input, T_STRING);
  GumboOutput *output = gumbo_parse_with_options (
    options,
    RSTRING_PTR(input),
    RSTRING_LEN(input)
  );

  const char *status_string = gumbo_status_to_string(output->status);
  switch (output->status) {
  case GUMBO_STATUS_OK:
    break;
  case GUMBO_STATUS_TOO_MANY_ATTRIBUTES:
  case GUMBO_STATUS_TREE_TOO_DEEP:
    gumbo_destroy_output(output);
    rb_raise(rb_eArgError, "%s", status_string);
  case GUMBO_STATUS_OUT_OF_MEMORY:
    gumbo_destroy_output(output);
    rb_raise(rb_eNoMemError, "%s", status_string);
  }
  return output;
}

static xmlNsPtr lookup_or_add_ns (
  xmlDocPtr doc,
  xmlNodePtr root,
  const char *href,
  const char *prefix
) {
#if NGLIB
  xmlNsPtr ns = xmlSearchNs(doc, root, BAD_CAST prefix);
  if (ns)
    return ns;
  return xmlNewNs(root, BAD_CAST href, BAD_CAST prefix);
#else
  ID add_namespace_definition;
  CONST_ID(add_namespace_definition, "add_namespace_definition");
  VALUE rprefix = rb_utf8_str_new_cstr(prefix);
  VALUE rhref = rb_utf8_str_new_cstr(href);
  return rb_funcall(root, add_namespace_definition, 2, rprefix, rhref);
#endif
}

static void set_line(xmlNodePtr node, size_t line) {
#if NGLIB
  // libxml2 uses 65535 to mean look elsewhere for the line number on some
  // nodes.
  if (line < 65535)
    node->line = (unsigned short)line;
#else
  // XXX: If Nokogiri gets a `#line=` method, we'll use that.
#endif
}

// Construct an XML tree rooted at xml_output_node from the Gumbo tree rooted
// at gumbo_node.
static void build_tree (
  xmlDocPtr doc,
  xmlNodePtr xml_output_node,
  const GumboNode *gumbo_node
) {
  xmlNodePtr xml_root = NIL;
  xmlNodePtr xml_node = xml_output_node;
  size_t child_index = 0;

  while (true) {
    assert(gumbo_node != NULL);
    const GumboVector *children = gumbo_node->type == GUMBO_NODE_DOCUMENT?
      &gumbo_node->v.document.children : &gumbo_node->v.element.children;
    if (child_index >= children->length) {
      // Move up the tree and to the next child.
      if (xml_node == xml_output_node) {
        // We've built as much of the tree as we can.
        return;
      }
      child_index = gumbo_node->index_within_parent + 1;
      gumbo_node = gumbo_node->parent;
      xml_node = get_parent(xml_node);
      // Children of fragments don't share the same root, so reset it and
      // it'll be set below. In the non-fragment case, this will only happen
      // after the html element has been finished at which point there are no
      // further elements.
      if (xml_node == xml_output_node)
        xml_root = NIL;
      continue;
    }
    const GumboNode *gumbo_child = children->data[child_index++];
    xmlNodePtr xml_child;

    switch (gumbo_child->type) {
      case GUMBO_NODE_DOCUMENT:
        abort(); // Bug in Gumbo.

      case GUMBO_NODE_TEXT:
      case GUMBO_NODE_WHITESPACE:
        xml_child = xmlNewDocText(doc, BAD_CAST gumbo_child->v.text.text);
        set_line(xml_child, gumbo_child->v.text.start_pos.line);
        xmlAddChild(xml_node, xml_child);
        break;

      case GUMBO_NODE_CDATA:
        xml_child = xmlNewCDataBlock(doc, BAD_CAST gumbo_child->v.text.text,
                                     (int) strlen(gumbo_child->v.text.text));
        set_line(xml_child, gumbo_child->v.text.start_pos.line);
        xmlAddChild(xml_node, xml_child);
        break;

      case GUMBO_NODE_COMMENT:
        xml_child = xmlNewDocComment(doc, BAD_CAST gumbo_child->v.text.text);
        set_line(xml_child, gumbo_child->v.text.start_pos.line);
        xmlAddChild(xml_node, xml_child);
        break;

      case GUMBO_NODE_TEMPLATE:
        // XXX: Should create a template element and a new DocumentFragment
      case GUMBO_NODE_ELEMENT:
      {
        xml_child = xmlNewDocNode(doc, NIL, BAD_CAST gumbo_child->v.element.name, NULL);
        set_line(xml_child, gumbo_child->v.element.start_pos.line);
        if (xml_root == NIL)
          xml_root = xml_child;
        xmlNsPtr ns = NIL;
        switch (gumbo_child->v.element.tag_namespace) {
        case GUMBO_NAMESPACE_HTML:
          break;
        case GUMBO_NAMESPACE_SVG:
          ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/2000/svg", "svg");
          break;
        case GUMBO_NAMESPACE_MATHML:
          ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/1998/Math/MathML", "math");
          break;
        }
        if (ns != NIL)
          xmlSetNs(xml_child, ns);
        xmlAddChild(xml_node, xml_child);

        // Add the attributes.
        const GumboVector* attrs = &gumbo_child->v.element.attributes;
        for (size_t i=0; i < attrs->length; i++) {
          const GumboAttribute *attr = attrs->data[i];

          switch (attr->attr_namespace) {
            case GUMBO_ATTR_NAMESPACE_XLINK:
              ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/1999/xlink", "xlink");
              break;

            case GUMBO_ATTR_NAMESPACE_XML:
              ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/XML/1998/namespace", "xml");
              break;

            case GUMBO_ATTR_NAMESPACE_XMLNS:
              ns = lookup_or_add_ns(doc, xml_root, "http://www.w3.org/2000/xmlns/", "xmlns");
              break;

            default:
              ns = NIL;
          }
          xmlNewNsProp(xml_child, ns, BAD_CAST attr->name, BAD_CAST attr->value);
        }

        // Add children for this element.
        child_index = 0;
        gumbo_node = gumbo_child;
        xml_node = xml_child;
      }
    }
  }
}

static void add_errors(const GumboOutput *output, VALUE rdoc, VALUE input, VALUE url) {
  const char *input_str = RSTRING_PTR(input);
  size_t input_len = RSTRING_LEN(input);

  // Add parse errors to rdoc.
  if (output->errors.length) {
    const GumboVector *errors = &output->errors;
    VALUE rerrors = rb_ary_new2(errors->length);

    for (size_t i=0; i < errors->length; i++) {
      GumboError *err = errors->data[i];
      GumboSourcePosition position = gumbo_error_position(err);
      char *msg;
      size_t size = gumbo_caret_diagnostic_to_string(err, input_str, input_len, &msg);
      VALUE err_str = rb_utf8_str_new(msg, size);
      free(msg);
      VALUE syntax_error = rb_class_new_instance(1, &err_str, cNokogiriXmlSyntaxError);
      const char *error_code = gumbo_error_code(err);
      VALUE str1 = error_code? rb_utf8_str_new_static(error_code, strlen(error_code)) : Qnil;
      rb_iv_set(syntax_error, "@domain", INT2NUM(1)); // XML_FROM_PARSER
      rb_iv_set(syntax_error, "@code", INT2NUM(1));   // XML_ERR_INTERNAL_ERROR
      rb_iv_set(syntax_error, "@level", INT2NUM(2));  // XML_ERR_ERROR
      rb_iv_set(syntax_error, "@file", url);
      rb_iv_set(syntax_error, "@line", INT2NUM(position.line));
      rb_iv_set(syntax_error, "@str1", str1);
      rb_iv_set(syntax_error, "@str2", Qnil);
      rb_iv_set(syntax_error, "@str3", Qnil);
      rb_iv_set(syntax_error, "@int1", INT2NUM(0));
      rb_iv_set(syntax_error, "@column", INT2NUM(position.column));
      rb_ary_push(rerrors, syntax_error);
    }
    rb_iv_set(rdoc, "@errors", rerrors);
  }
}

typedef struct {
  GumboOutput *output;
  VALUE input;
  VALUE url_or_frag;
  xmlDocPtr doc;
} ParseArgs;

static void parse_args_mark(void *parse_args) {
  ParseArgs *args = parse_args;
  rb_gc_mark_maybe(args->input);
  rb_gc_mark_maybe(args->url_or_frag);
}

// Wrap a ParseArgs pointer. The underlying ParseArgs must outlive the
// wrapper.
static VALUE wrap_parse_args(ParseArgs *args) {
  return Data_Wrap_Struct(rb_cData, parse_args_mark, RUBY_NEVER_FREE, args);
}

// Returnsd the underlying ParseArgs wrapped by wrap_parse_args.
static ParseArgs *unwrap_parse_args(VALUE obj) {
  ParseArgs *args;
  Data_Get_Struct(obj, ParseArgs, args);
  return args;
}

static VALUE parse_cleanup(VALUE parse_args) {
  ParseArgs *args = unwrap_parse_args(parse_args);
  gumbo_destroy_output(args->output);
  // Make sure garbage collection doesn't mark the objects as being live based
  // on references from the ParseArgs. This may be unnecessary.
  args->input = Qnil;
  args->url_or_frag = Qnil;
  if (args->doc != NIL)
    xmlFreeDoc(args->doc);
  return Qnil;
}

static VALUE parse_continue(VALUE parse_args);

// Parse a string using gumbo_parse into a Nokogiri document
static VALUE parse(VALUE self, VALUE input, VALUE url, VALUE max_attributes, VALUE max_errors, VALUE max_depth) {
  GumboOptions options = kGumboDefaultOptions;
  options.max_attributes = NUM2INT(max_attributes);
  options.max_errors = NUM2INT(max_errors);
  options.max_tree_depth = NUM2INT(max_depth);

  GumboOutput *output = perform_parse(&options, input);
  ParseArgs args = {
    .output = output,
    .input = input,
    .url_or_frag = url,
    .doc = NIL,
  };
  VALUE parse_args = wrap_parse_args(&args);

  return rb_ensure(parse_continue, parse_args, parse_cleanup, parse_args);
}

static VALUE parse_continue(VALUE parse_args) {
  ParseArgs *args = unwrap_parse_args(parse_args);
  GumboOutput *output = args->output;
  xmlDocPtr doc;
  if (output->document->v.document.has_doctype) {
    const char *name   = output->document->v.document.name;
    const char *public = output->document->v.document.public_identifier;
    const char *system = output->document->v.document.system_identifier;
    public = public[0] ? public : NULL;
    system = system[0] ? system : NULL;
    doc = new_html_doc(name, system, public);
  } else {
    doc = new_html_doc(NULL, NULL, NULL);
  }
  args->doc = doc; // Make sure doc gets cleaned up if an error is thrown.
  build_tree(doc, (xmlNodePtr)doc, output->document);
  VALUE rdoc = Nokogiri_wrap_xml_document(Document, doc);
  args->doc = NIL; // The Ruby runtime now owns doc so don't delete it.
  add_errors(output, rdoc, args->input, args->url_or_frag);
  return rdoc;
}

static int lookup_namespace(VALUE node, bool require_known_ns) {
  ID namespace, href;
  CONST_ID(namespace, "namespace");
  CONST_ID(href, "href");
  VALUE ns = rb_funcall(node, namespace, 0);

  if (NIL_P(ns))
    return GUMBO_NAMESPACE_HTML;
  ns = rb_funcall(ns, href, 0);
  assert(RTEST(ns));
  Check_Type(ns, T_STRING);

  const char *href_ptr = RSTRING_PTR(ns);
  size_t href_len = RSTRING_LEN(ns);
#define NAMESPACE_P(uri) (href_len == sizeof uri - 1 && !memcmp(href_ptr, uri, href_len))
  if (NAMESPACE_P("http://www.w3.org/1999/xhtml"))
    return GUMBO_NAMESPACE_HTML;
  if (NAMESPACE_P("http://www.w3.org/1998/Math/MathML"))
    return GUMBO_NAMESPACE_MATHML;
  if (NAMESPACE_P("http://www.w3.org/2000/svg"))
    return GUMBO_NAMESPACE_SVG;
#undef NAMESPACE_P
  if (require_known_ns)
    rb_raise(rb_eArgError, "Unexpected namespace URI \"%*s\"", (int)href_len, href_ptr);
  return -1;
}

static xmlNodePtr extract_xml_node(VALUE node) {
#if NGLIB
  xmlNodePtr xml_node;
  Data_Get_Struct(node, xmlNode, xml_node);
  return xml_node;
#else
  return node;
#endif
}

static VALUE fragment_continue(VALUE parse_args);

static VALUE fragment (
  VALUE self,
  VALUE doc_fragment,
  VALUE tags,
  VALUE ctx,
  VALUE max_attributes,
  VALUE max_errors,
  VALUE max_depth
) {
  ID name = rb_intern_const("name");
  const char *ctx_tag;
  GumboNamespaceEnum ctx_ns;
  GumboQuirksModeEnum quirks_mode;
  bool form = false;
  const char *encoding = NULL;

  if (NIL_P(ctx)) {
    ctx_tag = "body";
    ctx_ns = GUMBO_NAMESPACE_HTML;
  } else if (TYPE(ctx) == T_STRING) {
    ctx_tag = StringValueCStr(ctx);
    ctx_ns = GUMBO_NAMESPACE_HTML;
    size_t len = RSTRING_LEN(ctx);
    const char *colon = memchr(ctx_tag, ':', len);
    if (colon) {
      switch (colon - ctx_tag) {
      case 3:
        if (st_strncasecmp(ctx_tag, "svg", 3) != 0)
          goto error;
        ctx_ns = GUMBO_NAMESPACE_SVG;
        break;
      case 4:
        if (st_strncasecmp(ctx_tag, "html", 4) == 0)
          ctx_ns = GUMBO_NAMESPACE_HTML;
        else if (st_strncasecmp(ctx_tag, "math", 4) == 0)
          ctx_ns = GUMBO_NAMESPACE_MATHML;
        else
          goto error;
        break;
      default:
      error:
        rb_raise(rb_eArgError, "Invalid context namespace '%*s'", (int)(colon - ctx_tag), ctx_tag);
      }
      ctx_tag = colon+1;
    } else {
      // For convenience, put 'svg' and 'math' in their namespaces.
      if (len == 3 && st_strncasecmp(ctx_tag, "svg", 3) == 0)
        ctx_ns = GUMBO_NAMESPACE_SVG;
      else if (len == 4 && st_strncasecmp(ctx_tag, "math", 4) == 0)
        ctx_ns = GUMBO_NAMESPACE_MATHML;
    }

    // Check if it's a form.
    form = ctx_ns == GUMBO_NAMESPACE_HTML && st_strcasecmp(ctx_tag, "form") == 0;
  } else {
    ID element_ = rb_intern_const("element?");

    // Context fragment name.
    VALUE tag_name = rb_funcall(ctx, name, 0);
    assert(RTEST(tag_name));
    Check_Type(tag_name, T_STRING);
    ctx_tag = StringValueCStr(tag_name);

    // Context fragment namespace.
    ctx_ns = lookup_namespace(ctx, true);

    // Check for a form ancestor, including self.
    for (VALUE node = ctx;
         !NIL_P(node);
         node = rb_respond_to(node, parent) ? rb_funcall(node, parent, 0) : Qnil) {
      if (!RTEST(rb_funcall(node, element_, 0)))
        continue;
      VALUE element_name = rb_funcall(node, name, 0);
      if (RSTRING_LEN(element_name) == 4
          && !st_strcasecmp(RSTRING_PTR(element_name), "form")
          && lookup_namespace(node, false) == GUMBO_NAMESPACE_HTML) {
        form = true;
        break;
      }
    }

    // Encoding.
    if (RSTRING_LEN(tag_name) == 14
        && !st_strcasecmp(ctx_tag, "annotation-xml")) {
      VALUE enc = rb_funcall(ctx, rb_intern_const("[]"),
                             rb_utf8_str_new_static("encoding", 8));
      if (RTEST(enc)) {
        Check_Type(enc, T_STRING);
        encoding = StringValueCStr(enc);
      }
    }
  }

  // Quirks mode.
  VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
  VALUE dtd = rb_funcall(doc, internal_subset, 0);
  if (NIL_P(dtd)) {
    quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
  } else {
    VALUE dtd_name = rb_funcall(dtd, name, 0);
    VALUE pubid = rb_funcall(dtd, rb_intern_const("external_id"), 0);
    VALUE sysid = rb_funcall(dtd, rb_intern_const("system_id"), 0);
    quirks_mode = gumbo_compute_quirks_mode (
      NIL_P(dtd_name)? NULL:StringValueCStr(dtd_name),
      NIL_P(pubid)? NULL:StringValueCStr(pubid),
      NIL_P(sysid)? NULL:StringValueCStr(sysid)
    );
  }

  // Perform a fragment parse.
  int depth = NUM2INT(max_depth);
  GumboOptions options = kGumboDefaultOptions;
  options.max_attributes = NUM2INT(max_attributes);
  options.max_errors = NUM2INT(max_errors);
  // Add one to account for the HTML element.
  options.max_tree_depth = depth < 0 ? -1 : (depth + 1);
  options.fragment_context = ctx_tag;
  options.fragment_namespace = ctx_ns;
  options.fragment_encoding = encoding;
  options.quirks_mode = quirks_mode;
  options.fragment_context_has_form_ancestor = form;

  GumboOutput *output = perform_parse(&options, tags);
  ParseArgs args = {
    .output = output,
    .input = tags,
    .url_or_frag = doc_fragment,
    .doc = (xmlDocPtr)extract_xml_node(doc),
  };
  VALUE parse_args = wrap_parse_args(&args);
  rb_ensure(fragment_continue, parse_args, parse_cleanup, parse_args);
  return Qnil;
}

static VALUE fragment_continue(VALUE parse_args) {
  ParseArgs *args = unwrap_parse_args(parse_args);
  GumboOutput *output = args->output;
  VALUE doc_fragment = args->url_or_frag;
  xmlDocPtr xml_doc = args->doc;

  args->doc = NIL; // The Ruby runtime owns doc so make sure we don't delete it.
  xmlNodePtr xml_frag = extract_xml_node(doc_fragment);
  build_tree(xml_doc, xml_frag, output->root);
  add_errors(output, doc_fragment, args->input, rb_utf8_str_new_static("#fragment", 9));
  return Qnil;
}

// Initialize the Nokogumbo class and fetch constants we will use later.
void Init_nokogumbo() {
  rb_funcall(rb_mKernel, rb_intern_const("gem"), 1, rb_utf8_str_new_static("nokogiri", 8));
  rb_require("nokogiri");

  VALUE line_supported = Qtrue;

#if !NGLIB
  // Class constants.
  VALUE mNokogiri = rb_const_get(rb_cObject, rb_intern_const("Nokogiri"));
  VALUE mNokogiriXml = rb_const_get(mNokogiri, rb_intern_const("XML"));
  cNokogiriXmlSyntaxError = rb_const_get(mNokogiriXml, rb_intern_const("SyntaxError"));
  rb_gc_register_mark_object(cNokogiriXmlSyntaxError);
  cNokogiriXmlElement = rb_const_get(mNokogiriXml, rb_intern_const("Element"));
  rb_gc_register_mark_object(cNokogiriXmlElement);
  cNokogiriXmlText = rb_const_get(mNokogiriXml, rb_intern_const("Text"));
  rb_gc_register_mark_object(cNokogiriXmlText);
  cNokogiriXmlCData = rb_const_get(mNokogiriXml, rb_intern_const("CDATA"));
  rb_gc_register_mark_object(cNokogiriXmlCData);
  cNokogiriXmlComment = rb_const_get(mNokogiriXml, rb_intern_const("Comment"));
  rb_gc_register_mark_object(cNokogiriXmlComment);

  // Interned symbols.
  new = rb_intern_const("new");
  node_name_ = rb_intern_const("node_name=");

  // #line is not supported (returns 0)
  line_supported = Qfalse;
#endif

  // Class constants.
  VALUE HTML5 = rb_const_get(mNokogiri, rb_intern_const("HTML5"));
  Document = rb_const_get(HTML5, rb_intern_const("Document"));
  rb_gc_register_mark_object(Document);

  // Interned symbols.
  internal_subset = rb_intern_const("internal_subset");
  parent = rb_intern_const("parent");

  // Define Nokogumbo module with parse and fragment methods.
  VALUE Gumbo = rb_define_module("Nokogumbo");
  rb_define_singleton_method(Gumbo, "parse", parse, 5);
  rb_define_singleton_method(Gumbo, "fragment", fragment, 6);

  // Add private constant for testing.
  rb_define_const(Gumbo, "LINE_SUPPORTED", line_supported);
  rb_funcall(Gumbo, rb_intern_const("private_constant"), 1,
             rb_utf8_str_new_cstr("LINE_SUPPORTED"));
}

// vim: set shiftwidth=2 softtabstop=2 tabstop=8 expandtab:
nokogumbo-2.0.5/LICENSE.txt0000644000004100000410000002613514030710665015402 0ustar  www-datawww-data                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.