Using the Lobo Cobra Toolkit to Retrieve the HTML of Rendered Pages

This page is dated. Please see Parsing Web Pages in Java.

As noted in Using JRex to Retrieve the HTML of Rendered Pages, there is a difference between the raw HTML of a web page and the HTML after JavaScript has been applied. A typical browser loads raw HTML, parses it, and builds a Document Object Model (DOM). If the page includes JavaScript that is run when the page is loaded, the DOM can modified by the JavaScript code. For data mining operations, useful information might be lost if an application only looks at the raw HTML.

I originally used JRex, a Java wrapper for the Mozilla Gecko layout engine, to render HTML pages. I was looking for a better engine for extracting the HTML of rendered pages and found the Cobra Toolkit that is part of the Lobo Project. This project includes the Cobra Toolkit that renders HTML and the LoboBrowser built on this toolkit. The code is pure Java.

My initial comparison of JRex and Cobra found the following salient facts:

JRex seems to be an abandoned project while the Lobo Project is active. The forums for this project are more active than for JRex.
While JRex appears to be abandoned, Gecko is a world-class rendering engine. Cobra still seems to be in development.
JRex crashes the Java JVM when loading certain pages, and Cobra does not.
Cobra can be run headless while JRex/Gecko cannot. Cobra seems faster since it doesn't have to actually render the HTML page to a graphic context.
By default, JRex/Gecko includes a Flash plug-in while Cobra does not. (Since the plug-in mechanism for the LoboBrowser requires Java code, plug-ins for other browsers will not work. Until a Java Flash plug-in is available, Cobra will not handle Flash.) The JavaScript in some pages will cause a modified page to be loaded if Flash isn't present. In some data mining tasks, being able to examine the <OBJECT> and <EMBED> tags is useful and might not be available in Cobra unless a plug-in for Flash is installed.
JRex/Gecko seems to handle less well-formed HTML than Cobra. A missing <HTML> or <HEAD> tag can cause Cobra to quit before building the complete DOM. But since the LoboBrowser does properly render one of my test pages that Cobra fails on, perhaps this is less of a problem than I think.

Sample Code

I took a few lines from the "getting-started" page for Cobra and combined these with the JRex code I wrote that does a simple recursive traversal of the DOM. Please see my JRex page for a discussion of the traversal.

package com.benjysbrain.cobra ; import java.io.* ; import org.lobobrowser.html.*; import org.lobobrowser.html.gui.*; import org.lobobrowser.html.parser.*; import org.lobobrowser.html.test.* ; import org.w3c.dom.* ; import org.xml.sax.InputSource; import org.lobobrowser.html.domimpl.HTMLElementImpl ; import java.net.*; import java.util.* ; import java.util.logging.* ; /** Render - This object is a wrapper for the Cobra Toolkit, which is part of the Lobo Project (http://html.xamjwg.org/index.jsp). Cobra is a "pure Java HTML renderer and DOM parser." Render opens a URL, uses Cobra to render that HTML and apply JavaScript. It then does a simple tree traversal of the DOM to print beginning and end tag names. Subclass this class and override the doElement(org.w3c.dom.Element element) and doTagEnd(org.w3c.dom.Element element) methods to do some real work. In the base class, doElement() prints the tag name and doTagEnd() prints a closing version of the tag. This class is a rewrite of org.benjysbrain.htmlgrab.Render that uses JRex. Copyright (c) 2008 by Ben E. Cline. This code is presented as a teaching aid. No warranty is expressed or implied. http://www.benjysbrain.com/ @author Benjy Cline @version 1/2008 */ public class Render { String url ; // The page to be processed. // These variables can be used in subclasses and are created from // url. baseURL can be used to construct the absolute URL of the // relative URL's in the page. hostBase is just the http://host.com/ // part of the URL and can be used to construct the full URL of // URLs in the page that are site relative, e.g., "/xyzzy.jpg". // Variable host is set to the host part of url, e.g., host.com. String baseURL ; String hostBase ; String host ; /** Create a Render object with a target URL. */ public Render(String url) { this.url = url ; } /** Load the given URL using Cobra. When the page is loaded, recurse on the DOM and call doElement()/doTagEnd() for each Element node. Return false on error. */ public boolean parsePage() { // From Lobo forum. Disable all logging. Logger.getLogger("").setLevel(Level.OFF); // Parse the URL and build baseURL and hostURL for use by doElement() // and doTagEnd() ; URI uri = null ; URL urlObj = null ; try { uri = new URI(url) ; urlObj = new URL(url) ; } catch(Exception e) { System.out.println(e) ; return false ; } String path = uri.getPath() ; host = uri.getHost() ; String port = "" ; if(uri.getPort() != -1) port = Integer.toString(uri.getPort()) ; if(!port.equals("")) port = ":" + port ; baseURL = "http://" + uri.getHost() + port + path ; hostBase = "http://" + uri.getHost() + port ; // Open a connection to the HTML page and use Cobra to parse it. // Cobra does not return until page is loaded. try { URLConnection connection = urlObj.openConnection(); InputStream in = connection.getInputStream(); UserAgentContext context = new SimpleUserAgentContext(); DocumentBuilderImpl dbi = new DocumentBuilderImpl(context); Document document = dbi.parse(new InputSourceImpl(in, url, "ISO-8859-1")) ; // Do a recursive traversal on the top-level DOM node. Element ex = document.getDocumentElement() ; doTree((Node) ex) ; } catch(Exception e) { System.out.println("parsePage(" + url + "): " + e) ; return false ; } return true ; } /** Recurse the DOM starting with Node node. For each Node of type Element, call doElement() with it and recurse over its children. The Elements refer to the HTML tags, and the children are tags contained inside the parent tag. */ public void doTree(Node node) { if(node instanceof Element) { Element element = (Element) node ; // Visit tag. doElement(element) ; // Visit all the children, i.e., tags contained in this tag. NodeList nl = element.getChildNodes() ; if(nl == null) return ; int num = nl.getLength() ; for(int i=0; i<num; i++) doTree(nl.item(i)) ; // Process the end of this tag. doTagEnd(element) ; } } /** Simple doElement to print the tag name of the Element. Override to do something real. */ public void doElement(Element element) { System.out.println("<" + element.getTagName() + ">") ; } /** Simple doTagEnd() to print the closing tag of the Element. Override to do something real. */ public void doTagEnd(Element element) { System.out.println("</" + element.getTagName() + ">") ; } /** Main: java com.benjysbrain.cobra.Render [url]. Open Render on www.cnn.com by default. Parse the page and print the beginning and end tags. */ public static void main(String[] args) { String url ="http://www.cnn.com/" ; if(args.length == 1) url = args[0] ; Render p = new Render(url) ; p.parsePage() ; System.exit(0) ; } }

The program is run using the Java JVM:

java com.benjysbrain.cobra.Render [url]

where [url] is an optional URL. If the URL is not specified, the program parses the CNN home page and lists the tag names for beginning and ending tags. The Cobra jar files, cobra.jar and js.jar, have to be in the classpath.

Because JRex and Cobra both use the org.w3c.dom DOM interfaces, the two versions of the code are very similar.

To use Render to do real work, extend it in a subclass. You can override doElement() and doTagEnd() to extract information from the DOM. To extract tag attributes, first run the boolean method hasAttributes() of the Element object. If the tag has attributes, this method will return true. You can then use the getAttributes() method to obtain a NamedNodeMap object, which you can use to access the tag attributes. The Node objects referenced by the NamedNodeMap contain attribute/value pairs. The getNodeName() method of Node returns the attribute name, while the getNodeValue() method returns the attribute value.

I tried Cobra in my automatic comic extraction system, and it works well. As the library matures, I expect some of the limitations noted above to will disappear.

If you have comments, suggestions, or questions, feel free to contact me at the e-mail address given in the footer of this page.

Update: The Cobra library comes with the src for some sample code in org.lobobrowser.html.test. org.lobobrowser.html.test.TestEntry, for example, is a Swing app that will parse a web page and present you with three views: web page, DOM tree, and source. In perusing the code, I found some helpful techniques. One of these is to add the following lines after a URLConnection object is created:

 connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible;) Cobra/0.96.1+");
 connection.setRequestProperty("Cookie", "");

Once I added these lines to some code I was experimenting with, I was able to parse a web site that was troublesome.

Update: See Cobra Parsing: Disable Persistent Connections and Set Socket Timeouts.

Update (3/1/2011): It appears that Cobra is not being updated, and there are a number of complicated JavaScript pages it doesn't handle well. I am now using Crowbar and HtmlUnit for HTML parsing from Java.