
html2sax is a parser for html documents. It reads HTML-documents and creates callback calls using the Javatm SAX API.
There are many partly-malformed HTML documents on the
web. Many page authors don't care about correctness when their
browser manages to display his page.
Web browsers have specially adjusted parsers that can repair
malformed HTML pages to a certain degree. Unfortunately
there is no standard for correcting mistakes. So for wrong pages
every browser behaves differently.
To take a tool into the web you need to parse web pages.
There is a lot of fancy XML technology out there, but
HTML is actually no HTML. XHTML is XML, but not very
wide-spread today.
The intention for the original development was to have a really
simple HTML parser that just splits up the lexical parts of a
HTML document (tags, attributes, and text).
The parser should handle errors gracefully and continue
after them. It should not try to repair documents because
the intention was to extract certain parts using
XPath-queries.
There is also a Java version of HTML tidy.
It does its best at repairing malformed HTML documents.
It works quite ok, but in my opinion it's too much for some
applications.
html2sax is designed to be the frontend of a web-spider reading websites.
It can handle (almost?) all error situations, but will not try to
correct problematic HTML pages. It operates on a very low level and is quite fast.
Tests showed that it is twice as fast as Html-Tidy.
html2sax works as a SAX parser. Usually SAX is just designed
to handle real XML. Don't expect many SAX-supporters to work with
this 'weak' parser.
The parser was written using JavaCC
to support powerful lexical error recovery features.
The parser supports the following features:
There are several restrictions for html2sax that you should be aware of:
The only requirements for the parser is a Javatm 1.5 JRE. Less Java won't work because of using generics.
Usage is quite simple. The following example runs the parser:
SAXParserFactory factory = SAXParserFactory.newInstance( "de.tynne.htmltosax.HtmlToSaxParserFactory", null); SAXParser parser = factory.newSAXParser(); YourCallback s = new YourCallback(); parser.parse(new InputSource( new URL(args[0]).openStream()), s);
A working example is in the file Sample.java in the source distribution.
html2sax can be downloaded here.:
The GPG key to verify your downloads is located here.
html2sax was written by Stephan Fuhrmann. You can reach my at s_fuhrm (at) web.de.
html2sax is licensed under the LGPL 2.1 and only under this license version. Please see www.gnu.org for more details on the license.