You can use Amara 2.x and html5lib to process HTML, even horrible "tag soup" HTML, and to some extent non-well-formed XML. The following code (cleanup.py) will help demonstrate. It takes tag soup from the standard input and puts out a cleaned result to standard output.
#cleanup.py
import sys
from amara.bindery import html
from amara import xml_print
doc = html.parse(sys.stdin)
xml_print(doc)
For example, using:
<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>
Run through cleanup.py:
echo '<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>' | python cleanup.py
<?xml version="1.0" encoding="UTF-8"?>
<html><head/><body color="white" onload=""><p>Spam!</p><p>Eggs!
</p></body></html>
It will even find HTML buried in junk:
XXX<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>YYY
Run through cleanup.py:
echo 'XXX<html><body onload="" color="white"><p>Spam!<p>Eggs!</body></html>YYY' | python cleanup.py
<?xml version="1.0" encoding="UTF-8"?>
<html><head/><body color="white" onload="">XXX<p>Spam!</p><p>Eggs!YYY
</p></body></html>
As you can see it pulls the surrounding material into the HTML.
<foo>XXXX</foo><html><body onload="" color="white"><p>Hi !</p></body></html>YYYY<foo/>
echo '<foo>XXXX</foo><html><body onload="" color="white"><p>Hi !</p></body></html>YYYY<foo/>' | python cleanup.py
<?xml version="1.0" encoding="UTF-8"?>
<html><head/><body color="white" onload=""><foo>XXXX</foo><p>Hi !</p>YYYY<foo>
</foo></body></html>
