Fixing Java XML parser problems with W3C

Introduction

Javas default Xerces implementation loads the DTD from a XML document by default from the document-specified location (java version "1.6.0_18"). This technique leads to problems since some while and affects many Java applications. Some while ago I noticed that two Java applications didn't work anymore and stopped with this exception:

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:677)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1315)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1282)
	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:283)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1194)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1090)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1003)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
	at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
	at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
	at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:208)
	...

Problem analysis

Loading the DTD file is based on the doctype definition of the XML document header. In this example, it's a XHTML document:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
...

W3C looked at applications all over the world loading the DTDs for a while, but at some point they couldn't stand the load all the requests to their documents produced. Loading the DTD is not really necessary on one hand, it is thought to be a type identifier rather than a dynamic document type definition. Taking a look at the tcpdump of the failed URL connection showed that some HTTP headers pointed to this location: http://www.w3.org/brief/MTE2. The location tells the story about problematic clients and the big server load of W3C. W3C blocks Java clients to certain DTD URLs. The problem: I need the DTDs to verify my XML documents.

Solution

I relatively quick came to a nice solution: Why not cache the files on the local filesystem? Here is the code to enable file caching in your application:

import java.io.*;
import java.net.CacheRequest;
import java.net.CacheResponse;
import java.net.ResponseCache;
import java.net.*;
import java.util.*;

/**
 * Caches URI access in a local harddisk directory.
 * @author Stephan Fuhrmann
 */
public class MyResponseCache extends ResponseCache {

	/** The cache base directory. */
	private File base;

	/**
	 * Creates a new response cache.
	 * @param inBase the base directory where the cache is located.
	 */
	public MyResponseCache(File inBase) {
		if (!inBase.exists()) {
			inBase.mkdirs();
		}

		this.base = inBase;
	}

	/** Converts an URI to a file reference.
	 * @see URLEncoder#encode(java.lang.String, java.lang.String) 
	 */
	private final File uriToFile(URI uri) throws UnsupportedEncodingException {
		String name = URLEncoder.encode(uri.toString(), "UTF-8");
		return new File(base, name);
	}

	@Override
	public CacheResponse get(URI uri, String rqstMethod, Map<String, List<String>> rqstHeaders) throws IOException {
		if (rqstMethod.equals("GET")) {
			final File f = uriToFile(uri);
			if (f.exists()) {
				return new CacheResponse() {

					@Override
					public Map<String, List<String>> getHeaders() throws IOException {
						return new HashMap<String, List<String>>();
					}

					@Override
					public InputStream getBody() throws IOException {
						return new FileInputStream(f);
					}
				};
			} else {
				return null;
			}
		} else {
			return null;
		}
	}

	@Override
	public CacheRequest put(URI uri, URLConnection conn) throws IOException {

		final File f = uriToFile(uri);
		if (!f.exists()) {
			return new CacheRequest() {
				@Override
				public OutputStream getBody() throws IOException {
					return new FileOutputStream(f);
				}

				@Override
				public void abort() {
					// well I don't care about the result of this ;)
					f.delete();
				}
			};
		} else {
			return null;
		}
	}
}

It's possible that you need to adjust some settings for your application. You could restrict the get method to just cache W3C URIs or serve the DTDs contained in your JAR or whatsoever. The code above is used like this in your main class:

public class MainClass {

	...
	
	public static void main(String args[]) {
		ResponseCache rc = new MyResponseCache(new File("cache"));
		ResponseCache.setDefault(rc);
		
		...
	}
}

What does this do? It adds a cache directory in the current directory of the XML application and creates files with URL-encoded names like this:

http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-strict.dtd 
http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml-special.ent
http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml-lat1.ent
http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml-symbol.ent

As you can see, http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-strict.dtd is encoded for the URL http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd . My applications work again after this little fix. I hope that this will be fixed in the Xerces / JDK at some time.

Valid XHTML 1.0 Strict