Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib, updated 🕥 2023-03-03 19:25:15

html5lib

.. image:: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml/badge.svg :target: https://github.com/html5lib/html5lib-python/actions/workflows/python-tox.yml

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Usage

Simple usage follows this pattern:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f)

or:

.. code-block:: python

import html5lib document = html5lib.parse("

Hello World!")

By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom and lxml.etree. To use an alternative format, specify the name of a treebuilder:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should be pass into html5lib as follows:

.. code-block:: python

from contextlib import closing from urllib2 import urlopen import html5lib

with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTP should be pass into html5lib as follows:

.. code-block:: python

from urllib.request import urlopen import html5lib

with urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use:

.. code-block:: python

import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder class as the tree keyword argument to use an alternative document format:

.. code-block:: python

import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("

Hello World!")

More documentation is available at https://html5lib.readthedocs.io/.

Installation

html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:

.. code-block:: bash

$ pip install html5lib

The goal is to support a (non-strict) superset of the versions that pip supports <https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>_.

Optional Dependencies

The following third-party libraries may be used for additional functionality:

  • lxml is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);

  • genshi has a treewalker (but not builder); and

  • chardet can be used as a fallback when character encoding cannot be determined.

Bugs

Please report any bugs on the issue tracker <https://github.com/html5lib/html5lib-python/issues>_.

Tests

Unit tests require the pytest and mock libraries and can be run using the pytest command in the root directory.

Test data are contained in a separate html5lib-tests <https://github.com/html5lib/html5lib-tests>_ repository and included as a submodule, thus for git checkouts they must be initialized::

$ git submodule init $ git submodule update

If you have all compatible Python implementations available on your system, you can run tests on all of them using the tox utility, which can be found on PyPI.

Questions?

Check out the docs <https://html5lib.readthedocs.io/en/latest/>. Still need help? Go to our GitHub Discussions <https://github.com/html5lib/html5lib-python/discussions>.

You can also browse the archives of the html5lib-discuss mailing list <https://www.mail-archive.com/[email protected]/>_.

Issues

Fuzzing reveals a number of parse errors

opened on 2023-03-20 13:40:48 by leonardr

I'm the lead developer of Beautiful Soup, which has html5lib as an optional dependency. Over the past couple of years I've gotten a number of notifications from Google's oss-fuzz project about unhandled exceptions that actually turned out to be problems in html5lib. There wasn't much I could do with these errors, but now that it looks like html5lib maintenance is picking up, I can pass them on to you. (Sorry. :crying_cat_face:)

I've incorporated the fuzz reports into the Beautiful Soup test suite, and the test cases themselves are here, but here's a general picture of what problems I see. In each case, I believe just parsing the bad markup is enough to trigger the error.

clusterfuzz-testcase-minimized-bs4_fuzzer-4999465949331456

Markup: b')<a><math><TR><a><mI><a><p><a>'

Error:

``` self = , node =

, refNode = None

def insertBefore(self, node, refNode):
  index = self.element.index(refNode.element)

E AttributeError: 'NoneType' object has no attribute 'element' ```

clusterfuzz-testcase-minimized-bs4_fuzzer-5843991618256896

Markup: b'-<math><sElect><mi><sElect><sElect>'

Error:

``` def resetInsertionMode(self): ... # Check for conditions that should only happen in the innerHTML # case if nodeName in ("select", "colgroup", "head", "html"):

          assert self.innerHTML

E AssertionError ```

clusterfuzz-testcase-minimized-bs4_fuzzer-6241471367348224

Markup: b'ñ<table><svg><html>'

Error:

``` self = .InTablePhase object at 0x7f8f405ad440>

def processEOF(self):
    if self.tree.openElements[-1].name != "html":
        self.parser.parseError("eof-in-table")
    else:
      assert self.parser.innerHTML

E AssertionError ```

clusterfuzz-testcase-minimized-bs4_fuzzer-6600557255327744

Markup: b'\t<TABLE><<!>;<!><<!>.<lec><th>i><a><mat\x00\x01<mi\x00a><math>><th><mI>chardeta\xff\xff\xff\xff<><th><mI><||||||||A<select><>qu?\xbemath><th><mie>qu'

Error:

``` self = .InTableBodyPhase object at 0x7f8f4184ce00>

def clearStackToTableBodyContext(self):
    while self.tree.openElements[-1].name not in ("tbody", "tfoot",
                                                  "thead", "html"):
        # self.parser.parseError("unexpected-implied-end-tag-in-table",
        #  {"name": self.tree.openElements[-1].name})
        self.tree.openElements.pop()
    if self.tree.openElements[-1].name == "html":
      assert self.parser.innerHTML

E AssertionError ```

Also reported to me recently was the issue that was reported to you as issue #557.

Constant phases

opened on 2023-03-03 19:25:14 by ambv

This is #272 but rebased on top of current master.

I made cursory benchmarking of this and I see no difference between master and this branch.

| Python | Tests executed | Time with PR | Time on master | | ------- | -------------- | ------------ | -------------- | | 2.7.18 | 28238 | 47.3s | 47.2s | | 3.9.16 | 19062 | 13.4s | 13.6s | | 3.11.2 | 23563 | 12.4s | 12.4s | | pypy3.8 | 23563 | 44.3s | 44.4s |

Each timing is geomean from 30 runs. This is executed on Homebrew builds of all Pythons above on an M1 Max Macbook Pro. The slow result from pypy3 (v7.3.9) is because it's running in Rosetta (x86-64 emulation). The important thing is to compare the time on the PR with the time on master.

Found a bug in version 1.1: "AssertionError: <EMPTY MESSAGE>" caused by html5parser parsing failure.

opened on 2023-02-20 06:45:58 by aT0ngMu

Hi,I Found a bug in version 1.1: "AssertionError: " caused by html5parser parsing failure.The following is the crash stack information:

[1676863707] === Uncaught Python exception: === [1676863707] AssertionError: \<EMPTY MESSAGE> [1676863707] Traceback (most recent call last): [1676863707] File "/home/server1/adashwy/DriverCollections/exp_drivers/pyfuzzgen_drivers/bs4/beautifulsoup_driver/beautifulsoup_driver.py", line 40, in TestOneInput [1676863707] instance = BeautifulSoup(remaining_data, features=parsers[idx]) [1676863707] File "/home/server1/.local/lib/python3.8/site-packages/bs4/init.py", line 333, in init [1676863707] self._feed() [1676863707] File "/home/server1/.local/lib/python3.8/site-packages/bs4/init.py", line 452, in _feed [1676863707] self.builder.feed(self.markup) [1676863707] File "/home/server1/.local/lib/python3.8/site-packages/bs4/builder/_html5lib.py", line 99, in feed [1676863707] doc = parser.parse(markup, extra_kwargs) [1676863707] File "/usr/local/lib/python3.8/dist-packages/html5lib/html5parser.py", line 284, in parse [1676863707] self._parse(stream, False, None, *args, kwargs) [1676863707] File "/usr/local/lib/python3.8/dist-packages/html5lib/html5parser.py", line 133, in _parse [1676863707] self.mainLoop() [1676863707] File "/usr/local/lib/python3.8/dist-packages/html5lib/html5parser.py", line 242, in mainLoop [1676863707] new_token = phase.processEndTag(new_token) [1676863707] File "/usr/local/lib/python3.8/dist-packages/html5lib/html5parser.py", line 2534, in processEndTag [1676863707] new_token = self.parser.phase.processEndTag(token) [1676863707] File "/usr/local/lib/python3.8/dist-packages/html5lib/html5parser.py", line 496, in processEndTag [1676863707] return func(token) [1676863707] File "/usr/local/lib/python3.8/dist-packages/html5lib/html5parser.py", line 2089, in endTagTable [1676863707] self.clearStackToTableBodyContext() [1676863707] File "/usr/local/lib/python3.8/dist-packages/html5lib/html5parser.py", line 2036, in clearStackToTableBodyContext [1676863707] assert self.parser.innerHTML [1676863707] AssertionError: \<EMPTY MESSAGE>

The following is the testcase that triggers the crash. testcase.zip

1.1: test suite uses no longer maintained `pytest-expect`

opened on 2021-12-31 08:52:01 by kloczek

html5lib test suite uses pytest-expect which seems is not maintained since 2016. pytest-expect additionally used in its own test suite umsgpack which is already marked as deprecated https://github.com/gsnedders/pytest-expect/issues/15. On refreshing html5lib for python 3.10 and 3.11 looks like it would be good to remove or replace those bits as well.

Add position information for text nodes

opened on 2021-04-16 15:33:37 by corynezin

Would it be possible to add position information, i.e. line+column to text nodes? Or, at least make this information available to the tree builder? I implemented a very minimal proof of concept to add the information to each token and pass that along to the dom tree builder and obtain the following result:

``` import html5lib

html = '

&

bc

cab
'

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))

doc = parser.parse(html) def parse(n): for c in n.childNodes: if hasattr(c, 'sourcepos'): print(c.sourcepos, c) parse(c)

parse(doc) ```

None <DOM Element: head at 0x10bbed0d0> None <DOM Element: body at 0x10bbed1f0> (1, 5) <DOM Element: div at 0x10bbfb790> (1, 10) <DOM Text node "'&'"> (1, 13) <DOM Element: p at 0x10bbfb820> (1, 14) <DOM Text node "'b'"> (1, 20) <DOM Element: span at 0x10bbfb8b0> (1, 21) <DOM Text node "'c'"> (1, 33) <DOM Text node "' '"> (1, 36) <DOM Text node "'cab'">

I would be willing to implement it.

consider making html5lib.tokenizer public

opened on 2021-04-12 05:36:33 by mgrandi

Hello,

In version https://github.com/html5lib/html5lib-python/releases/tag/0.999999999 , html5lib.tokenizer was made private

The wpull project (https://github.com/ArchiveTeam/wpull ) uses this library, and if we were to ever migrate to using the 1.X versions, it would negatively impact the application, because instead of just tokenizing a webpage (see https://github.com/ArchiveTeam/wpull/blob/a4ff4a93f613ce18ad3c515aa3d4f5848a88b98c/wpull/document/htmlparse/html5lib_.py ), we would have to use the full tree parsing which is slower and uses more ram

is there any reason this was made private when the 1.x branch was released?