You can use Amara 2.x and html5lib to process HTML, even horrible "tag soup" HTML, and to some extent non-well-formed XML. The following code (cleanup.py) will help demonstrate. It takes tag soup from the standard input and puts out a cleaned result to standard output.

#cleanup.py
import sys
from amara.bindery import html
from amara import xml_print
doc = html.parse(sys.stdin)

xml_print(doc)

For example, using:

<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>

Run through cleanup.py:

echo '<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>' | python cleanup.py
<?xml version="1.0" encoding="UTF-8"?>
<html><head/><body color="white" onload=""><p>Spam!</p><p>Eggs!
</p></body></html>

It will even find HTML buried in junk:

XXX<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>YYY

Run through cleanup.py:

echo 'XXX<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>YYY' | python cleanup.py
<?xml version="1.0" encoding="UTF-8"?>
<html><head/><body color="white" onload="">XXX<p>Spam!</p><p>Eggs!YYY
</p></body></html>

As you can see it pulls the surrounding material into the HTML.

<foo>XXXX</foo><html><body onload="" color="white"><p>Hi  !</p></body></html>YYYY<foo/>

echo '<foo>XXXX</foo><html><body onload="" color="white"><p>Hi  !</p></body></html>YYYY<foo/>' | python cleanup.py
<?xml version="1.0" encoding="UTF-8"?>
<html><head/><body color="white" onload=""><foo>XXXX</foo><p>Hi  !</p>YYYY<foo>
</foo></body></html>

Amara/Tagsoup (last edited 2010-12-03 17:56:19 by LuisMiguel)