<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Prog-a-Month &#187; code-snippet</title>
	<atom:link href="http://www.progamonth.com/?feed=rss2&#038;cat=18" rel="self" type="application/rss+xml" />
	<link>http://www.progamonth.com</link>
	<description>The journeys of a programmer</description>
	<lastBuildDate>Sat, 22 May 2010 23:01:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Easy HTML web scraping with Groovy and Java. (w/XOM)</title>
		<link>http://www.progamonth.com/?p=312</link>
		<comments>http://www.progamonth.com/?p=312#comments</comments>
		<pubDate>Wed, 05 May 2010 04:28:19 +0000</pubDate>
		<dc:creator>Stefan Kendall</dc:creator>
				<category><![CDATA[code-snippet]]></category>
		<category><![CDATA[prog]]></category>
		<category><![CDATA[groovy]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[xom]]></category>

		<guid isPermaLink="false">http://www.progamonth.com/?p=312</guid>
		<description><![CDATA[Ever need to parse some HTML is Java or Groovy? No matter what the source, you&#8217;re almost always guaranteed to get bad, unformed garbage as a response when scraping. Rather than ditch XML readers and bust out regex, you can transform this data into good xhtml with tools like TagSoup. The following class is a [...]]]></description>
			<content:encoded><![CDATA[<p>Ever need to parse some HTML is Java or Groovy? No matter what the source, you&#8217;re almost always guaranteed to get bad, unformed garbage as a response when scraping. Rather than ditch XML readers and bust out regex, you can transform this data into good xhtml with tools like <a href="http://home.ccil.org/~cowan/tagsoup/">TagSoup</a>.</p>
<p>The following class is a utility class meant to abstract away the nonsense of cleaning HTML. I&#8217;ve found that HTML data can come from a number of different sources (even files), so being able to clean strings containing html is immensely useful. </p>
<p><strong>HtmlParsingUtils.cleanHtml(String html)</strong> cleans the HTML passed as an argument and returns parse-able XHTML. It does not validate entities to avoid the 503 <a href="w3.org">w3.org</a> return errors if you attempt to hit the w3 servers too frequently when validating XML entities &#8211; which is exactly what happens when using TagSoup when doing massive scraping.</p>
<p><a href="www.xom.nu">XOM</a> is my XML handling library of choice, although the code can be adapted if requirements dictate no-XOM.</p>
<p><em>Example usage:</em></p>
<pre name="code" class="java">
String htmlData = obtainHtmlFromElsewhere();
String xml = HtmlParsingUtils.cleanHtml( htmlData );

Builder builder = HtmlParsingUtils.createNonValidatingXmlBuilder();
Document doc = builder.build(xml,HtmlParsingUtils.xhtmlBaseUri);
//Query: Find all anchor tags
Nodes nodes = doc.query("//html:a",HtmlParsingUtils.htmlCtx);
</pre>
<p>Presented below is the Groovy adaptation of <em>HtmlParsingUtils</em>. The Java version is identical, except the dummy entity resolver is built as private class declaration inline, rather than a closure.</p>
<pre name="code" class="java">
import nu.xom.Builder
import nu.xom.XPathContext
import org.jdom.input.SAXBuilder
import org.xml.sax.EntityResolver
import org.xml.sax.InputSource
import org.xml.sax.XMLReader
import org.xml.sax.helpers.XMLReaderFactory

/**
 * Utility class for handling html
 */
class HtmlParsingUtils
{
	public static final EntityResolver dummyEntityResolver = { String publicId, String systemId ->
		return new InputSource(new StringReader(""))
	} as EntityResolver;

	public static final xhtmlBaseUri = "http://www.w3.org/1999/xhtml"

	//Required for querying
	public static final XPathContext htmlCtx = new XPathContext("html", xhtmlBaseUri)

	/**
	 * @param str possibly non-well-formed html
	 * @return xhtml representation of the argument data
	 */
	public static String cleanHtml(String str)
	{
		SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
		builder.setEntityResolver(dummyEntityResolver);
		Reader input = new StringReader(str);
		org.jdom.Document doc = builder.build(input);
		String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);

		return cleanXmlDoc;
	}

   /**
	 * @return XOM Builder that does not validate or resolve entities.
	 */
	public static Builder createNonValidatingXmlBuilder()
	{
		XMLReader reader = XMLReaderFactory.createXMLReader();
		reader.setEntityResolver(dummyEntityResolver);
		Builder builder = new Builder(reader);

		return builder;
	}
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.progamonth.com/?feed=rss2&amp;p=312</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Copying a file in Java</title>
		<link>http://www.progamonth.com/?p=232</link>
		<comments>http://www.progamonth.com/?p=232#comments</comments>
		<pubDate>Wed, 17 Feb 2010 16:38:15 +0000</pubDate>
		<dc:creator>Stefan Kendall</dc:creator>
				<category><![CDATA[code-snippet]]></category>

		<guid isPermaLink="false">http://www.progamonth.com/?p=232</guid>
		<description><![CDATA[Code snippet. Why isn&#8217;t this part of the Java API? Just use commons. &#60;dependency&#62; &#60;groupId&#62;commons-io&#60;/groupId&#62; &#60;artifactId&#62;commons-io&#60;/groupId&#62; &#60;version&#62;1.4&#60;/version&#62; &#60;/dependency&#62; FileUtils gives us this: public static void copyFile(File srcFile, File destFile) throws IOException]]></description>
			<content:encoded><![CDATA[<p><del>Code snippet. Why isn&#8217;t this part of the Java API?</del></p>
<p>Just use commons.</p>
<pre class = "xml" name = "code">
&lt;dependency&gt;
   &lt;groupId&gt;commons-io&lt;/groupId&gt;
   &lt;artifactId&gt;commons-io&lt;/groupId&gt;
   &lt;version&gt;1.4&lt;/version&gt;
&lt;/dependency&gt;
</pre>
<p><a href="http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#copyFile(java.io.File, java.io.File)">FileUtils</a> gives us this:</p>
<pre class="java" name="code">public static void copyFile(File srcFile, File destFile) throws IOException</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.progamonth.com/?feed=rss2&amp;p=232</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
