Using the Lobo Cobra Toolkit to Retrieve the HTML of Rendered Pages

This page is dated. Please see Parsing Web Pages in Java.


As noted in Using JRex to Retrieve the HTML of Rendered Pages, there is a difference between the raw HTML of a web page and the HTML after JavaScript has been applied. A typical browser loads raw HTML, parses it, and builds a Document Object Model (DOM). If the page includes JavaScript that is run when the page is loaded, the DOM can modified by the JavaScript code. For data mining operations, useful information might be lost if an application only looks at the raw HTML.

I originally used JRex, a Java wrapper for the Mozilla Gecko layout engine, to render HTML pages. I was looking for a better engine for extracting the HTML of rendered pages and found the Cobra Toolkit that is part of the Lobo Project. This project includes the Cobra Toolkit that renders HTML and the LoboBrowser built on this toolkit. The code is pure Java.

My initial comparison of JRex and Cobra found the following salient facts:

Sample Code

I took a few lines from the "getting-started" page for Cobra and combined these with the JRex code I wrote that does a simple recursive traversal of the DOM. Please see my JRex page for a discussion of the traversal.

package com.benjysbrain.cobra ; import java.io.* ; import org.lobobrowser.html.*; import org.lobobrowser.html.gui.*; import org.lobobrowser.html.parser.*; import org.lobobrowser.html.test.* ; import org.w3c.dom.* ; import org.xml.sax.InputSource; import org.lobobrowser.html.domimpl.HTMLElementImpl ; import java.net.*; import java.util.* ; import java.util.logging.* ; /** Render - This object is a wrapper for the Cobra Toolkit, which is part of the Lobo Project (http://html.xamjwg.org/index.jsp). Cobra is a "pure Java HTML renderer and DOM parser." <p> Render opens a URL, uses Cobra to render that HTML and apply JavaScript. It then does a simple tree traversal of the DOM to print beginning and end tag names. <p> Subclass this class and override the <i>doElement(org.w3c.dom.Element element)</i> and <i>doTagEnd(org.w3c.dom.Element element)</i> methods to do some real work. In the base class, doElement() prints the tag name and doTagEnd() prints a closing version of the tag. <p> This class is a rewrite of org.benjysbrain.htmlgrab.Render that uses JRex. <p> Copyright (c) 2008 by Ben E. Cline. This code is presented as a teaching aid. No warranty is expressed or implied. <p> http://www.benjysbrain.com/ @author Benjy Cline @version 1/2008 */ public class Render { String url ; // The page to be processed. // These variables can be used in subclasses and are created from // url. baseURL can be used to construct the absolute URL of the // relative URL's in the page. hostBase is just the http://host.com/ // part of the URL and can be used to construct the full URL of // URLs in the page that are site relative, e.g., "/xyzzy.jpg". // Variable host is set to the host part of url, e.g., host.com. String baseURL ; String hostBase ; String host ; /** Create a Render object with a target URL. */ public Render(String url) { this.url = url ; } /** Load the given URL using Cobra. When the page is loaded, recurse on the DOM and call doElement()/doTagEnd() for each Element node. Return false on error. */ public boolean parsePage() { // From Lobo forum. Disable all logging. Logger.getLogger("").setLevel(Level.OFF); // Parse the URL and build baseURL and hostURL for use by doElement() // and doTagEnd() ; URI uri = null ; URL urlObj = null ; try { uri = new URI(url) ; urlObj = new URL(url) ; } catch(Exception e) { System.out.println(e) ; return false ; } String path = uri.getPath() ; host = uri.getHost() ; String port = "" ; if(uri.getPort() != -1) port = Integer.toString(uri.getPort()) ; if(!port.equals("")) port = ":" + port ; baseURL = "http://" + uri.getHost() + port + path ; hostBase = "http://" + uri.getHost() + port ; // Open a connection to the HTML page and use Cobra to parse it. // Cobra does not return until page is loaded. try { URLConnection connection = urlObj.openConnection(); InputStream in = connection.getInputStream(); UserAgentContext context = new SimpleUserAgentContext(); DocumentBuilderImpl dbi = new DocumentBuilderImpl(context); Document document = dbi.parse(new InputSourceImpl(in, url, "ISO-8859-1")) ; // Do a recursive traversal on the top-level DOM node. Element ex = document.getDocumentElement() ; doTree((Node) ex) ; } catch(Exception e) { System.out.println("parsePage(" + url + "): " + e) ; return false ; } return true ; } /** Recurse the DOM starting with Node node. For each Node of type Element, call doElement() with it and recurse over its children. The Elements refer to the HTML tags, and the children are tags contained inside the parent tag. */ public void doTree(Node node) { if(node instanceof Element) { Element element = (Element) node ; // Visit tag. doElement(element) ; // Visit all the children, i.e., tags contained in this tag. NodeList nl = element.getChildNodes() ; if(nl == null) return ; int num = nl.getLength() ; for(int i=0; i<num; i++) doTree(nl.item(i)) ; // Process the end of this tag. doTagEnd(element) ; } } /** Simple doElement to print the tag name of the Element. Override to do something real. */ public void doElement(Element element) { System.out.println("<" + element.getTagName() + ">") ; } /** Simple doTagEnd() to print the closing tag of the Element. Override to do something real. */ public void doTagEnd(Element element) { System.out.println("</" + element.getTagName() + ">") ; } /** Main: java com.benjysbrain.cobra.Render [url]. Open Render on www.cnn.com by default. Parse the page and print the beginning and end tags. */ public static void main(String[] args) { String url ="http://www.cnn.com/" ; if(args.length == 1) url = args[0] ; Render p = new Render(url) ; p.parsePage() ; System.exit(0) ; } }

The program is run using the Java JVM:

     java com.benjysbrain.cobra.Render [url]

where [url] is an optional URL. If the URL is not specified, the program parses the CNN home page and lists the tag names for beginning and ending tags. The Cobra jar files, cobra.jar and js.jar, have to be in the classpath.

Because JRex and Cobra both use the org.w3c.dom DOM interfaces, the two versions of the code are very similar.

To use Render to do real work, extend it in a subclass. You can override doElement() and doTagEnd() to extract information from the DOM. To extract tag attributes, first run the boolean method hasAttributes() of the Element object. If the tag has attributes, this method will return true. You can then use the getAttributes() method to obtain a NamedNodeMap object, which you can use to access the tag attributes. The Node objects referenced by the NamedNodeMap contain attribute/value pairs. The getNodeName() method of Node returns the attribute name, while the getNodeValue() method returns the attribute value.

I tried Cobra in my automatic comic extraction system, and it works well. As the library matures, I expect some of the limitations noted above to will disappear.

If you have comments, suggestions, or questions, feel free to contact me at the e-mail address given in the footer of this page.

Update: The Cobra library comes with the src for some sample code in org.lobobrowser.html.test. org.lobobrowser.html.test.TestEntry, for example, is a Swing app that will parse a web page and present you with three views: web page, DOM tree, and source. In perusing the code, I found some helpful techniques. One of these is to add the following lines after a URLConnection object is created:

 connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible;) Cobra/0.96.1+");
 connection.setRequestProperty("Cookie", "");

Once I added these lines to some code I was experimenting with, I was able to parse a web site that was troublesome.

Update: See Cobra Parsing: Disable Persistent Connections and Set Socket Timeouts.

Update (3/1/2011): It appears that Cobra is not being updated, and there are a number of complicated JavaScript pages it doesn't handle well. I am now using Crowbar and HtmlUnit for HTML parsing from Java.


This page © copyright 2008 by Ben E. Cline.  E-Mail: