Ever need to parse some HTML is Java or Groovy? No matter what the source, you’re almost always guaranteed to get bad, unformed garbage as a response when scraping. Rather than ditch XML readers and bust out regex, you can transform this data into good xhtml with tools like TagSoup.
The following class is a utility class meant to abstract away the nonsense of cleaning HTML. I’ve found that HTML data can come from a number of different sources (even files), so being able to clean strings containing html is immensely useful.
HtmlParsingUtils.cleanHtml(String html) cleans the HTML passed as an argument and returns parse-able XHTML. It does not validate entities to avoid the 503 w3.org return errors if you attempt to hit the w3 servers too frequently when validating XML entities – which is exactly what happens when using TagSoup when doing massive scraping.
XOM is my XML handling library of choice, although the code can be adapted if requirements dictate no-XOM.
Example usage:
String htmlData = obtainHtmlFromElsewhere();
String xml = HtmlParsingUtils.cleanHtml( htmlData );
Builder builder = HtmlParsingUtils.createNonValidatingXmlBuilder();
Document doc = builder.build(xml,HtmlParsingUtils.xhtmlBaseUri);
//Query: Find all anchor tags
Nodes nodes = doc.query("//html:a",HtmlParsingUtils.htmlCtx);
Presented below is the Groovy adaptation of HtmlParsingUtils. The Java version is identical, except the dummy entity resolver is built as private class declaration inline, rather than a closure.
import nu.xom.Builder
import nu.xom.XPathContext
import org.jdom.input.SAXBuilder
import org.xml.sax.EntityResolver
import org.xml.sax.InputSource
import org.xml.sax.XMLReader
import org.xml.sax.helpers.XMLReaderFactory
/**
* Utility class for handling html
*/
class HtmlParsingUtils
{
public static final EntityResolver dummyEntityResolver = { String publicId, String systemId ->
return new InputSource(new StringReader(""))
} as EntityResolver;
public static final xhtmlBaseUri = "http://www.w3.org/1999/xhtml"
//Required for querying
public static final XPathContext htmlCtx = new XPathContext("html", xhtmlBaseUri)
/**
* @param str possibly non-well-formed html
* @return xhtml representation of the argument data
*/
public static String cleanHtml(String str)
{
SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
builder.setEntityResolver(dummyEntityResolver);
Reader input = new StringReader(str);
org.jdom.Document doc = builder.build(input);
String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);
return cleanXmlDoc;
}
/**
* @return XOM Builder that does not validate or resolve entities.
*/
public static Builder createNonValidatingXmlBuilder()
{
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(dummyEntityResolver);
Builder builder = new Builder(reader);
return builder;
}
}