html2sax

html2sax is a parser for html documents. It reads HTML-documents and creates callback calls using the Javatm SAX API.

Background

There are many partly-malformed HTML documents on the web. Many page authors don't care about correctness when their browser manages to display his page. Web browsers have specially adjusted parsers that can repair malformed HTML pages to a certain degree. Unfortunately there is no standard for correcting mistakes. So for wrong pages every browser behaves differently.
To take a tool into the web you need to parse web pages. There is a lot of fancy XML technology out there, but HTML is actually no HTML. XHTML is XML, but not very wide-spread today.

Purpose

The intention for the original development was to have a really simple HTML parser that just splits up the lexical parts of a HTML document (tags, attributes, and text). The parser should handle errors gracefully and continue after them. It should not try to repair documents because the intention was to extract certain parts using XPath-queries.
There is also a Java version of HTML tidy. It does its best at repairing malformed HTML documents. It works quite ok, but in my opinion it's too much for some applications.

The library

html2sax is designed to be the frontend of a web-spider reading websites. It can handle (almost?) all error situations, but will not try to correct problematic HTML pages. It operates on a very low level and is quite fast. Tests showed that it is twice as fast as Html-Tidy.
html2sax works as a SAX parser. Usually SAX is just designed to handle real XML. Don't expect many SAX-supporters to work with this 'weak' parser.
The parser was written using JavaCC to support powerful lexical error recovery features.

Features

The parser supports the following features:

Restrictions

There are several restrictions for html2sax that you should be aware of:

Requirements

The only requirements for the parser is a Javatm 1.5 JRE. Less Java won't work because of using generics.

Usage

Usage is quite simple. The following example runs the parser:

	SAXParserFactory factory = 
		SAXParserFactory.newInstance(
			"de.tynne.htmltosax.HtmlToSaxParserFactory",
			null);
	SAXParser parser = factory.newSAXParser();
	YourCallback s = new YourCallback();
	parser.parse(new InputSource(
		new URL(args[0]).openStream()), s);
	

A working example is in the file Sample.java in the source distribution.

Downloads

html2sax can be downloaded here.:

The GPG key to verify your downloads is located here.

Author

html2sax was written by Stephan Fuhrmann. You can reach my at s_fuhrm (at) web.de.

License

html2sax is licensed under the LGPL 2.1 and only under this license version. Please see www.gnu.org for more details on the license.

Valid XHTML 1.0 Strict