beautifulsoup4-4.4.1/ 0000775 0001750 0001750 00000000000 12602354444 016137 5 ustar leonardr leonardr 0000000 0000000 beautifulsoup4-4.4.1/beautifulsoup4.egg-info/ 0000775 0001750 0001750 00000000000 12602354444 022604 5 ustar leonardr leonardr 0000000 0000000 beautifulsoup4-4.4.1/beautifulsoup4.egg-info/PKG-INFO 0000664 0001750 0001750 00000001735 12602354444 023707 0 ustar leonardr leonardr 0000000 0000000 Metadata-Version: 1.1
Name: beautifulsoup4
Version: 4.4.1
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
Author-email: leonardr@segfault.org
License: MIT
Download-URL: http://www.crummy.com/software/BeautifulSoup/bs4/download/
Description: Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Topic :: Text Processing :: Markup :: SGML
Classifier: Topic :: Software Development :: Libraries :: Python Modules
beautifulsoup4-4.4.1/beautifulsoup4.egg-info/top_level.txt 0000664 0001750 0001750 00000000004 12602354444 025330 0 ustar leonardr leonardr 0000000 0000000 bs4
beautifulsoup4-4.4.1/beautifulsoup4.egg-info/SOURCES.txt 0000664 0001750 0001750 00000001542 12602354444 024472 0 ustar leonardr leonardr 0000000 0000000 AUTHORS.txt
COPYING.txt
MANIFEST.in
NEWS.txt
README.txt
TODO.txt
convert-py3k
setup.cfg
setup.py
test-all-versions
beautifulsoup4.egg-info/PKG-INFO
beautifulsoup4.egg-info/SOURCES.txt
beautifulsoup4.egg-info/dependency_links.txt
beautifulsoup4.egg-info/requires.txt
beautifulsoup4.egg-info/top_level.txt
bs4/__init__.py
bs4/dammit.py
bs4/diagnose.py
bs4/element.py
bs4/testing.py
bs4/builder/__init__.py
bs4/builder/_html5lib.py
bs4/builder/_htmlparser.py
bs4/builder/_lxml.py
bs4/tests/__init__.py
bs4/tests/test_builder_registry.py
bs4/tests/test_docs.py
bs4/tests/test_html5lib.py
bs4/tests/test_htmlparser.py
bs4/tests/test_lxml.py
bs4/tests/test_soup.py
bs4/tests/test_tree.py
doc/Makefile
doc.zh/Makefile
doc.zh/source/conf.py
doc/source/6.1.jpg
doc/source/conf.py
doc/source/index.rst
scripts/demonstrate_parser_differences.py
scripts/demonstration_markup.txt beautifulsoup4-4.4.1/beautifulsoup4.egg-info/dependency_links.txt 0000664 0001750 0001750 00000000001 12602354444 026652 0 ustar leonardr leonardr 0000000 0000000
beautifulsoup4-4.4.1/beautifulsoup4.egg-info/requires.txt 0000664 0001750 0001750 00000000042 12602354444 025200 0 ustar leonardr leonardr 0000000 0000000
[lxml]
lxml
[html5lib]
html5lib beautifulsoup4-4.4.1/setup.py 0000664 0001750 0001750 00000002473 12602353443 017655 0 ustar leonardr leonardr 0000000 0000000 from setuptools import (
setup,
find_packages,
)
setup(
name="beautifulsoup4",
version = "4.4.1",
author="Leonard Richardson",
author_email='leonardr@segfault.org',
url="http://www.crummy.com/software/BeautifulSoup/bs4/",
download_url = "http://www.crummy.com/software/BeautifulSoup/bs4/download/",
description="Screen-scraping library",
long_description="""Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.""",
license="MIT",
packages=find_packages(exclude=['tests*']),
extras_require = {
'lxml' : [ 'lxml'],
'html5lib' : ['html5lib'],
},
use_2to3 = True,
classifiers=["Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 2",
'Programming Language :: Python :: 3',
"Topic :: Text Processing :: Markup :: HTML",
"Topic :: Text Processing :: Markup :: XML",
"Topic :: Text Processing :: Markup :: SGML",
"Topic :: Software Development :: Libraries :: Python Modules",
],
)
beautifulsoup4-4.4.1/test-all-versions 0000755 0001750 0001750 00000000070 11735323600 021447 0 ustar leonardr leonardr 0000000 0000000 python2.7 -m unittest discover -s bs4 && ./convert-py3k
beautifulsoup4-4.4.1/setup.cfg 0000664 0001750 0001750 00000000073 12602354444 017760 0 ustar leonardr leonardr 0000000 0000000 [egg_info]
tag_build =
tag_date = 0
tag_svn_revision = 0
beautifulsoup4-4.4.1/doc.zh/ 0000775 0001750 0001750 00000000000 12602354444 017324 5 ustar leonardr leonardr 0000000 0000000 beautifulsoup4-4.4.1/doc.zh/Makefile 0000644 0001750 0001750 00000011016 12542753330 020761 0 ustar leonardr leonardr 0000000 0000000 # Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = build
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest
help:
@echo "Please use \`make
tag. This parser also adds an empty
tag to the document. Here's the same document parsed with Python's built-in HTML parser:: BeautifulSoup("", "html.parser") # Like html5lib, this parser ignores the closing tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn't even bother to add an tag. Since the document "" is invalid, none of these techniques is the "correct" way to handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the "correct" way, but all three techniques are legitimate. Differences between parsers can affect your script. If you're planning on distributing your script to other people, or running it on multiple machines, you should specify a parser in the ``BeautifulSoup`` constructor. That will reduce the chances that your users parse a document differently from the way you parse it. Encodings ========= Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. But when you load that document into Beautiful Soup, you'll discover it's been converted to Unicode:: markup = "Sacr\xe9 bleu!
''' soup = BeautifulSoup(markup) print(soup.prettify()) # # # # # ## Sacré bleu! #
# # Note that the tag has been rewritten to reflect the fact that the document is now in UTF-8. If you don't want UTF-8, you can pass an encoding into ``prettify()``:: print(soup.prettify("latin-1")) # # # # ... You can also call encode() on the ``BeautifulSoup`` object, or any element in the soup, just as if it were a Python string:: soup.p.encode("latin-1") # 'Sacr\xe9 bleu!
' soup.p.encode("utf-8") # 'Sacr\xc3\xa9 bleu!
' Any characters that can't be represented in your chosen encoding will be converted into numeric XML entity references. Here's a document that includes the Unicode character SNOWMAN:: markup = u"\N{SNOWMAN}" snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b The SNOWMAN character can be part of a UTF-8 document (it looks like ☃), but there's no representation for that character in ISO-Latin-1 or ASCII, so it's converted into "☃" for those encodings:: print(tag.encode("utf-8")) # ☃ print tag.encode("latin-1") # ☃ print tag.encode("ascii") # ☃ Unicode, Dammit --------------- You can use Unicode, Dammit without using Beautiful Soup. It's useful whenever you have data in an unknown encoding and you just want it to become Unicode:: from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'utf-8' Unicode, Dammit's guesses will get a lot more accurate if you install the ``chardet`` or ``cchardet`` Python libraries. The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list:: dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'latin-1' Unicode, Dammit has two special features that Beautiful Soup doesn't use. Smart quotes ^^^^^^^^^^^^ You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities:: markup = b"I just \x93love\x94 Microsoft Word\x92s smart quotes
" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup # u'I just “love” Microsoft Word’s smart quotes
' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup # u'I just “love” Microsoft Word’s smart quotes
' You can also convert Microsoft smart quotes to ASCII quotes:: UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup # u'I just "love" Microsoft Word\'s smart quotes
' Hopefully you'll find this feature useful, but Beautiful Soup doesn't use it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else:: UnicodeDammit(markup, ["windows-1252"]).unicode_markup # u'I just \u201clove\u201d Microsoft Word\u2019s smart quotes
' Inconsistent encodings ^^^^^^^^^^^^^^^^^^^^^^ Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters such as (again) Microsoft smart quotes. This can happen when a website includes data from multiple sources. You can use ``UnicodeDammit.detwingle()`` to turn such a document into pure UTF-8. Here's a simple example:: snowmen = (u"\N{SNOWMAN}" * 3) quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") doc = snowmen.encode("utf8") + quote.encode("windows_1252") This document is a mess. The snowmen are in UTF-8 and the quotes are in Windows-1252. You can display the snowmen or the quotes, but not both:: print(doc) # ☃☃☃�I like snowmen!� print(doc.decode("windows-1252")) # ☃☃☃“I like snowmen!” Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and decoding it as Windows-1252 gives you gibberish. Fortunately, ``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8, allowing you to decode it to Unicode and display the snowmen and quote marks simultaneously:: new_doc = UnicodeDammit.detwingle(doc) print(new_doc.decode("utf8")) # ☃☃☃“I like snowmen!” ``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252 embedded in UTF-8 (or vice versa, I suppose), but this is the most common case. Note that you must know to call ``UnicodeDammit.detwingle()`` on your data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit`` constructor. Beautiful Soup assumes that a document has a single encoding, whatever it might be. If you pass it a document that contains both UTF-8 and Windows-1252, it's likely to think the whole document is Windows-1252, and the document will come out looking like ``☃☃☃“I like snowmen!”``. ``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0. Comparing objects for equality ============================== Beautiful Soup says that two ``NavigableString`` or ``Tag`` objects are equal when they represent the same HTML or XML markup. In this example, the two tags are treated as equal, even though they live in different parts of the object tree, because they both look like "pizza":: markup = "I want pizza and more pizza!
" soup = BeautifulSoup(markup, 'html.parser') first_b, second_b = soup.find_all('b') print first_b == second_b # True print first_b.previous_element == second_b.previous_element # False If you want to see whether two variables refer to exactly the same object, use `is`:: print first_b is second_b # False Copying Beautiful Soup objects ============================== You can use ``copy.copy()`` to create a copy of any ``Tag`` or ``NavigableString``:: import copy p_copy = copy.copy(soup.p) print p_copy #I want pizza and more pizza!
The copy is considered equal to the original, since it represents the same markup as the original, but it's not the same object:: print soup.p == p_copy # True print soup.p is p_copy # False The only real difference is that the copy is completely detached from the original Beautiful Soup object tree, just as if ``extract()`` had been called on it:: print p_copy.parent # None This is because two different ``Tag`` objects can't occupy the same space at the same time. Parsing only part of a document =============================== Let's say you want to use Beautiful Soup look at a document's tags. It's a waste of time and memory to parse the entire document and then go over it again looking for tags. It would be much faster to ignore everything that wasn't an tag in the first place. The ``SoupStrainer`` class allows you to choose which parts of an incoming document are parsed. You just create a ``SoupStrainer`` and pass it in to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. (Note that *this feature won't work if you're using the html5lib parser*. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didn't actually make it into the parse tree, it'll crash. To avoid confusion, in the examples below I'll be forcing Beautiful Soup to use Python's built-in parser.) ``SoupStrainer`` ---------------- The ``SoupStrainer`` class takes the same arguments as a typical method from `Searching the tree`_: :ref:`nameThe Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # # Elsie # # # Lacie # # # Tillie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # # Lacie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # You can also pass a ``SoupStrainer`` into any of the methods covered in `Searching the tree`_. This probably isn't terribly useful, but I thought I'd mention it:: soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n'] Troubleshooting =============== .. _diagnose: ``diagnose()`` -------------- If you're having trouble understanding what Beautiful Soup does to a document, pass the document into the ``diagnose()`` function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing you how different parsers handle the document, and tell you if you're missing a parser that Beautiful Soup could be using:: from bs4.diagnose import diagnose data = open("bad.html").read() diagnose(data) # Diagnostic running on Beautiful Soup 4.2.0 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) # I noticed that html5lib is not installed. Installing it may help. # Found lxml version 2.3.2.0 # # Trying to parse your data with html.parser # Here's what html.parser did with the document: # ... Just looking at the output of diagnose() may show you how to solve the problem. Even if not, you can paste the output of ``diagnose()`` when asking for help. Errors when parsing a document ------------------------------ There are two different kinds of parse errors. There are crashes, where you feed a document to Beautiful Soup and it raises an exception, usually an ``HTMLParser.HTMLParseError``. And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different than the document used to create it. Almost none of these problems turn out to be problems with Beautiful Soup. This is not because Beautiful Soup is an amazingly well-written piece of software. It's because Beautiful Soup doesn't include any parsing code. Instead, it relies on external parsers. If one parser isn't working on a certain document, the best solution is to try a different parser. See `Installing a parser`_ for details and a parser comparison. The most common parse errors are ``HTMLParser.HTMLParseError: malformed start tag`` and ``HTMLParser.HTMLParseError: bad end tag``. These are both generated by Python's built-in HTML parser library, and the solution is to :ref:`install lxml or html5lib.