html2text-0.2.0/0000755000175000017500000000000013100046034014721 5ustar balasankarcbalasankarchtml2text-0.2.0/README.md0000644000175000017500000000304013100046034016175 0ustar balasankarcbalasankarchtml2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby) ============== `html2text` is a very simple script that uses Ruby's DOM methods to load HTML from a string, and then iterates over the resulting DOM to correctly output plain text. For example: ```html Ignored Title

Hello, World!

This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.

Even mismatched tags.

A div
Another div
A div
within a div
A link ``` Will be converted into: ```text Hello, World! This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly. Even mismatched tags. A div Another div A div within a div [A link](http://foo.com) ``` See the [original blog post](http://journals.jevon.org/users/jevon-phd/entry/19818) or the related [StackOverflow answer](http://stackoverflow.com/a/2564472/39531). ## Installing TODO Install the gem, then you can: ```ruby require 'html2text' text = Html2Text.convert(html) ``` ## Tests See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with: ``` bundle install rspec ``` ## License `html2text` is licensed under MIT. ## Other versions Also see [html2text](https://github.com/soundasleep/html2text), the original PHP implementation. html2text-0.2.0/spec/0000755000175000017500000000000013100046034015653 5ustar balasankarcbalasankarchtml2text-0.2.0/spec/examples_spec.rb0000644000175000017500000000135413100046034021033 0ustar balasankarcbalasankarcrequire "spec_helper" describe Html2Text do describe "#convert" do let(:text) { Html2Text.convert(html) } examples = Dir[File.dirname(__FILE__) + "/examples/*.html"] examples.each do |filename| context "#{filename}" do let(:html) { File.read(filename) } let(:text_file) { filename.sub(".html", ".txt") } let(:expected) { Html2Text.fix_newlines(File.read(text_file)) } it "has an expected output" do expect(File.exist?(text_file)).to eq(true), "'#{text_file}' did not exist" end it "converts to text" do expect(text).to eq(expected) end end end it "has examples to test" do expect(examples.size).to_not eq(0) end end end html2text-0.2.0/spec/html2text_spec.rb0000644000175000017500000000145413100046034021151 0ustar balasankarcbalasankarcrequire "spec_helper" describe Html2Text do describe "#convert" do let(:text) { Html2Text.convert(html) } context "an empty line" do let(:html) { "" } it "is an empty line" do expect(text).to eq("") end end context "a simple string" do let(:html) { "hello world" } it "is an empty line" do expect(text).to eq("hello world") end end end describe "#remove_leading_and_trailing_whitespace" do let(:subject) { Html2Text.new(nil).remove_leading_and_trailing_whitespace(input) } context "an empty string" do let(:input) { "" } it { is_expected.to eq("") } end context "many new lines" do let(:input) { "hello\n world \n yes" } it { is_expected.to eq("hello\nworld\nyes") } end end end html2text-0.2.0/spec/spec_helper.rb0000644000175000017500000000017113100046034020470 0ustar balasankarcbalasankarcrequire "rspec" require "rspec/collection_matchers" require File.join(File.dirname(__FILE__), "..", "lib", "html2text") html2text-0.2.0/spec/examples/0000755000175000017500000000000013100046034017471 5ustar balasankarcbalasankarchtml2text-0.2.0/spec/examples/basic.html0000644000175000017500000000064213100046034021442 0ustar balasankarcbalasankarc Ignored Title

Hello, World!

This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.

Even mismatched tags.

A div
Another div
A div
within a div

Another line
Yet another line

A link html2text-0.2.0/spec/examples/more-anchors.txt0000644000175000017500000000044313100046034022630 0ustar balasankarcbalasankarcAnchor tests Visit http://openiaml.org or openiaml.org or http://openiaml.org. To visit with SSL, visit https://openiaml.org or openiaml.org or https://openiaml.org. To mail, email support@openiaml.org or mailto:support@openiaml.org or support@openiaml.org or mailto:support@openiaml.org.html2text-0.2.0/spec/examples/table.html0000644000175000017500000000141413100046034021446 0ustar balasankarcbalasankarc Ignored Title

Hello, World!

Col A Col B
Data A1 Data B1
Data A2 Data B2
Data A3 Data B4
Total A Total B
html2text-0.2.0/spec/examples/full_email.txt0000644000175000017500000000233613100046034022347 0ustar balasankarcbalasankarchttp://localhost/home 16 December 2015 Account 123 Hi Susan Here is your cat report. You have found 5 cats less than anyone else [Find more cats](http://localhost/cats) Down the road Across the hall Your achievements You're currently finding about 12 cats per day [Number of cats found] --------------------------------------------------------------- Your last cat was found two days ago. One type of cat is a kitten. Special account A1 12.345 http://localhost/logout How can you find more cats? Look in trash cans Start meowing Eat cat food Some cats like to hang out in trash cans. Some cats do not. Some cats are attracted to similar tones. So one day your tears may smell like cat food, attracting more cats. https://localhost/about https://localhost/about https://localhost/about [Cats are great.](https://github.com/soundasleep/html2text_ruby) [Find more cats.](https://github.com/soundasleep/html2text_ruby) [Do more things.](https://github.com/soundasleep/html2text_ruby) [Contact us](http://localhost/contact) cats@cats.com Monday and Friday https://github.com/soundasleep/html2text https://github.com/soundasleep/html2text_ruby Having trouble seeing this email? [View it online](http://localhost/view_it_online).html2text-0.2.0/spec/examples/anchors.html0000644000175000017500000000065613100046034022023 0ustar balasankarcbalasankarcA document without any HTML open/closing tags.
We try and use the representation given by common browsers of the HTML document, so that it looks similar when converted to plain text. visit foo.com - or http://www.foo.com link

An anchor which will not appear

html2text-0.2.0/spec/examples/lists.html0000644000175000017500000000033613100046034021517 0ustar balasankarcbalasankarc

List tests

Add some lists.

  1. one
  2. two
  3. three

An unordered list

html2text-0.2.0/spec/examples/test4.txt0000644000175000017500000000001713100046034021273 0ustar balasankarcbalasankarc1 2 3 4 5 6html2text-0.2.0/spec/examples/lists.txt0000644000175000017500000000015513100046034021371 0ustar balasankarcbalasankarcList tests Add some lists. - one - two - three An unordered list - one - two - three - one - two - threehtml2text-0.2.0/spec/examples/nbsp.html0000644000175000017500000000006013100046034021315 0ustar balasankarcbalasankarchello   world & people < > &NBSP;html2text-0.2.0/spec/examples/full_email.html0000644000175000017500000003557113100046034022503 0ustar balasankarcbalasankarc

Hi Susan

Here is your cat report.

Down the road

Across the hall

Your achievements

You're currently finding about
12 cats
per day
 
Number of cats found

Your last cat was found two days ago.

One type of cat is a kitten.

Special account A1

12.345

How can you find more cats?

Look in trash cans

Start meowing

Eat cat food

Some cats like to hang out in trash cans. Some cats do not. Some cats are attracted to similar tones. So one day your tears may smell like cat food, attracting more cats.
html2text-0.2.0/spec/examples/table.txt0000644000175000017500000000013213100046034021315 0ustar balasankarcbalasankarcHello, World! Col A Col B Data A1 Data B1 Data A2 Data B2 Data A3 Data B4 Total A Total Bhtml2text-0.2.0/spec/examples/images.txt0000644000175000017500000000053313100046034021500 0ustar balasankarcbalasankarcOne: Two: [two] Three: [three] Four: [four] With links One: http://localhost Two: [two](http://localhost) Three: [three](http://localhost) Four: [four](http://localhost) With links with titles One: [one link](http://localhost) Two: [two link](http://localhost) Three: [three link](http://localhost) Four: [four link](http://localhost)html2text-0.2.0/spec/examples/test3.txt0000644000175000017500000000002213100046034021266 0ustar balasankarcbalasankarctest one test twohtml2text-0.2.0/spec/examples/test4.html0000644000175000017500000000003713100046034021422 0ustar balasankarcbalasankarc1
2
3
4
5 6html2text-0.2.0/spec/examples/test3.html0000644000175000017500000000002613100046034021417 0ustar balasankarcbalasankarctest one
test twohtml2text-0.2.0/spec/examples/anchors.txt0000644000175000017500000000055013100046034021667 0ustar balasankarcbalasankarcA document without any HTML open/closing tags. --------------------------------------------------------------- We try and use the representation given by common browsers of the HTML document, so that it looks similar when converted to plain text. [visit foo.com](http://foo.com) - or http://www.foo.com [link](http://foo.com) [An anchor which will not appear]html2text-0.2.0/spec/examples/nbsp.txt0000644000175000017500000000003713100046034021174 0ustar balasankarcbalasankarchello world & people < > &NBSP;html2text-0.2.0/spec/examples/basic.txt0000644000175000017500000000037213100046034021315 0ustar balasankarcbalasankarcHello, World! This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly. Even mismatched tags. A div Another div A div within a div Another line Yet another line [A link](http://foo.com)html2text-0.2.0/spec/examples/images.html0000644000175000017500000000210013100046034021615 0ustar balasankarcbalasankarc

One:

Two: two

Three:

Four: four alt

With links

One:

Two: two

Three:

Four: four alt

With links with titles

One:

Two: two

Three:

Four: four alt

html2text-0.2.0/spec/examples/more-anchors.html0000644000175000017500000000104613100046034022755 0ustar balasankarcbalasankarc

Anchor tests

Visit http://openiaml.org or openiaml.org or http://openiaml.org.

To visit with SSL, visit https://openiaml.org or openiaml.org or https://openiaml.org.

To mail, email support@openiaml.org or mailto:support@openiaml.org or support@openiaml.org or mailto:support@openiaml.org.

html2text-0.2.0/html2text.gemspec0000644000175000017500000000715013100046034020224 0ustar balasankarcbalasankarc######################################################### # This file has been automatically generated by gem2tgz # ######################################################### # -*- encoding: utf-8 -*- # stub: html2text 0.2.0 ruby lib Gem::Specification.new do |s| s.name = "html2text".freeze s.version = "0.2.0" s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version= s.require_paths = ["lib".freeze] s.authors = ["Jevon Wright".freeze] s.date = "2015-12-18" s.description = "A Ruby component to convert HTML into a plain text format.".freeze s.email = ["jevon@powershop.co.nz".freeze] s.files = ["LICENSE.md".freeze, "README.md".freeze, "lib/html2text.rb".freeze, "lib/html2text/version.rb".freeze, "spec/examples/anchors.html".freeze, "spec/examples/anchors.txt".freeze, "spec/examples/basic.html".freeze, "spec/examples/basic.txt".freeze, "spec/examples/full_email.html".freeze, "spec/examples/full_email.txt".freeze, "spec/examples/images.html".freeze, "spec/examples/images.txt".freeze, "spec/examples/lists.html".freeze, "spec/examples/lists.txt".freeze, "spec/examples/more-anchors.html".freeze, "spec/examples/more-anchors.txt".freeze, "spec/examples/nbsp.html".freeze, "spec/examples/nbsp.txt".freeze, "spec/examples/table.html".freeze, "spec/examples/table.txt".freeze, "spec/examples/test3.html".freeze, "spec/examples/test3.txt".freeze, "spec/examples/test4.html".freeze, "spec/examples/test4.txt".freeze, "spec/examples_spec.rb".freeze, "spec/html2text_spec.rb".freeze, "spec/spec_helper.rb".freeze] s.homepage = "https://github.com/soundasleep/html2text_ruby".freeze s.licenses = ["MIT".freeze] s.rubygems_version = "2.5.2".freeze s.summary = "Convert HTML into plain text.".freeze s.test_files = ["spec/examples/anchors.html".freeze, "spec/examples/anchors.txt".freeze, "spec/examples/basic.html".freeze, "spec/examples/basic.txt".freeze, "spec/examples/full_email.html".freeze, "spec/examples/full_email.txt".freeze, "spec/examples/images.html".freeze, "spec/examples/images.txt".freeze, "spec/examples/lists.html".freeze, "spec/examples/lists.txt".freeze, "spec/examples/more-anchors.html".freeze, "spec/examples/more-anchors.txt".freeze, "spec/examples/nbsp.html".freeze, "spec/examples/nbsp.txt".freeze, "spec/examples/table.html".freeze, "spec/examples/table.txt".freeze, "spec/examples/test3.html".freeze, "spec/examples/test3.txt".freeze, "spec/examples/test4.html".freeze, "spec/examples/test4.txt".freeze, "spec/examples_spec.rb".freeze, "spec/html2text_spec.rb".freeze, "spec/spec_helper.rb".freeze] if s.respond_to? :specification_version then s.specification_version = 4 if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then s.add_development_dependency(%q.freeze, [">= 0"]) s.add_runtime_dependency(%q.freeze, ["~> 1.6"]) s.add_development_dependency(%q.freeze, [">= 0"]) s.add_development_dependency(%q.freeze, [">= 0"]) s.add_development_dependency(%q.freeze, [">= 0"]) else s.add_dependency(%q.freeze, [">= 0"]) s.add_dependency(%q.freeze, ["~> 1.6"]) s.add_dependency(%q.freeze, [">= 0"]) s.add_dependency(%q.freeze, [">= 0"]) s.add_dependency(%q.freeze, [">= 0"]) end else s.add_dependency(%q.freeze, [">= 0"]) s.add_dependency(%q.freeze, ["~> 1.6"]) s.add_dependency(%q.freeze, [">= 0"]) s.add_dependency(%q.freeze, [">= 0"]) s.add_dependency(%q.freeze, [">= 0"]) end end html2text-0.2.0/LICENSE.md0000644000175000017500000000203413100046034016324 0ustar balasankarcbalasankarcCopyright 2015 Jevon Wright Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. html2text-0.2.0/lib/0000755000175000017500000000000013100046034015467 5ustar balasankarcbalasankarchtml2text-0.2.0/lib/html2text/0000755000175000017500000000000013100046034017422 5ustar balasankarcbalasankarchtml2text-0.2.0/lib/html2text/version.rb0000644000175000017500000000005013100046034021427 0ustar balasankarcbalasankarcclass Html2Text VERSION = "0.2.0" end html2text-0.2.0/lib/html2text.rb0000644000175000017500000000745213100046034017757 0ustar balasankarcbalasankarcrequire 'nokogiri' class Html2Text attr_reader :doc def initialize(doc) @doc = doc end def self.convert(html) html = fix_newlines(replace_entities(html)) doc = Nokogiri::HTML(html) Html2Text.new(doc).convert end def self.fix_newlines(text) text.gsub("\r\n", "\n").gsub("\r", "\n") end def self.replace_entities(text) text.gsub(" ", " ").gsub("\u00a0", " ") end def convert output = iterate_over(doc) output = remove_leading_and_trailing_whitespace(output) output = remove_unnecessary_empty_lines(output) output.strip end def remove_leading_and_trailing_whitespace(text) text.gsub(/[ \t]*\n[ \t]*/im, "\n").gsub(/ *\t */im, "\t") end def remove_unnecessary_empty_lines(text) text.gsub(/\n\n\n*/im, "\n\n") end def trimmed_whitespace(text) # Replace whitespace characters with a space (equivalent to \s) text.gsub(/[\t\n\f\r ]+/im, " ") end def next_node_name(node) next_node = node.next_sibling while next_node != nil break if next_node.element? next_node = next_node.next_sibling end if next_node && next_node.element? next_node.name.downcase end end def iterate_over(node) return trimmed_whitespace(node.text) if node.text? if ["style", "head", "title", "meta", "script"].include?(node.name.downcase) return "" end output = [] output << prefix_whitespace(node) output += node.children.map do |child| iterate_over(child) end output << suffix_whitespace(node) output = output.compact.join("") || "" if node.name.downcase == "a" output = wrap_link(node, output) end if node.name.downcase == "img" output = image_text(node) end output end def prefix_whitespace(node) case node.name.downcase when "hr" "---------------------------------------------------------------\n" when "h1", "h2", "h3", "h4", "h5", "h6", "ol", "ul" "\n" when "tr", "p", "div" "\n" when "td", "th" "\t" when "li" "- " end end def suffix_whitespace(node) case node.name.downcase when "h1", "h2", "h3", "h4", "h5", "h6" # add another line "\n" when "p", "br" "\n" if next_node_name(node) != "div" when "li" "\n" when "div" # add one line only if the next child isn't a div "\n" if next_node_name(node) != "div" && next_node_name(node) != nil end end # links are returned in [text](link) format def wrap_link(node, output) href = node.attribute("href") name = node.attribute("name") output = output.strip # remove double [[ ]]s from linking images if output[0] == "[" && output[-1] == "]" output = output[1, output.length - 2] # for linking images, the title of the overrides the title of the if node.attribute("title") output = node.attribute("title").to_s end end # if there is no link text, but a title attr if output.empty? && node.attribute("title") output = node.attribute("title").to_s end if href.nil? if !name.nil? output = "[#{output}]" end else href = href.to_s if href != output && href != "mailto:#{output}" && href != "http://#{output}" && href != "https://#{output}" if output.empty? output = href else output = "[#{output}](#{href})" end end end case next_node_name(node) when "h1", "h2", "h3", "h4", "h5", "h6" output += "\n" end output end def image_text(node) if node.attribute("title") "[" + node.attribute("title").to_s + "]" elsif node.attribute("alt") "[" + node.attribute("alt").to_s + "]" else "" end end end