Using the Lobo Cobra Toolkit to Retrieve the HTML of Rendered Pages |
As noted in Using JRex to Retrieve the HTML of Rendered Pages, there is a difference between the raw HTML of a web page and the HTML after JavaScript has been applied. A typical browser loads raw HTML, parses it, and builds a Document Object Model (DOM). If the page includes JavaScript that is run when the page is loaded, the DOM can modified by the JavaScript code. For data mining operations, useful information might be lost if an application only looks at the raw HTML.
I originally used JRex, a Java wrapper for the Mozilla Gecko layout engine, to render HTML pages. I was looking for a better engine for extracting the HTML of rendered pages and found the Cobra Toolkit that is part of the Lobo Project. This project includes the Cobra Toolkit that renders HTML and the LoboBrowser built on this toolkit. The code is pure Java.
My initial comparison of JRex and Cobra found the following salient facts:
I took a few lines from the "getting-started" page for Cobra and combined these with the JRex code I wrote that does a simple recursive traversal of the DOM. Please see my JRex page for a discussion of the traversal.
Render opens a URL, uses Cobra to render that HTML and apply JavaScript. It then does a simple tree traversal of the DOM to print beginning and end tag names.
Subclass this class and override the doElement(org.w3c.dom.Element element) and doTagEnd(org.w3c.dom.Element element) methods to do some real work. In the base class, doElement() prints the tag name and doTagEnd() prints a closing version of the tag.
This class is a rewrite of org.benjysbrain.htmlgrab.Render that uses JRex.
Copyright (c) 2008 by Ben E. Cline. This code is presented as a teaching aid. No warranty is expressed or implied.
http://www.benjysbrain.com/
@author Benjy Cline
@version 1/2008
*/
public class Render {
String url ; // The page to be processed.
// These variables can be used in subclasses and are created from
// url. baseURL can be used to construct the absolute URL of the
// relative URL's in the page. hostBase is just the http://host.com/
// part of the URL and can be used to construct the full URL of
// URLs in the page that are site relative, e.g., "/xyzzy.jpg".
// Variable host is set to the host part of url, e.g., host.com.
String baseURL ;
String hostBase ;
String host ;
/**
Create a Render object with a target URL.
*/
public Render(String url) {
this.url = url ;
}
/**
Load the given URL using Cobra. When the page is loaded,
recurse on the DOM and call doElement()/doTagEnd() for
each Element node. Return false on error.
*/
public boolean parsePage() {
// From Lobo forum. Disable all logging.
Logger.getLogger("").setLevel(Level.OFF);
// Parse the URL and build baseURL and hostURL for use by doElement()
// and doTagEnd() ;
URI uri = null ;
URL urlObj = null ;
try {
uri = new URI(url) ;
urlObj = new URL(url) ;
}
catch(Exception e) {
System.out.println(e) ;
return false ;
}
String path = uri.getPath() ;
host = uri.getHost() ;
String port = "" ;
if(uri.getPort() != -1) port = Integer.toString(uri.getPort()) ;
if(!port.equals("")) port = ":" + port ;
baseURL = "http://" + uri.getHost() + port + path ;
hostBase = "http://" + uri.getHost() + port ;
// Open a connection to the HTML page and use Cobra to parse it.
// Cobra does not return until page is loaded.
try {
URLConnection connection = urlObj.openConnection();
InputStream in = connection.getInputStream();
UserAgentContext context = new SimpleUserAgentContext();
DocumentBuilderImpl dbi = new DocumentBuilderImpl(context);
Document document = dbi.parse(new InputSourceImpl(in, url,
"ISO-8859-1")) ;
// Do a recursive traversal on the top-level DOM node.
Element ex = document.getDocumentElement() ;
doTree((Node) ex) ;
}
catch(Exception e) {
System.out.println("parsePage(" + url + "): " + e) ;
return false ;
}
return true ;
}
/**
Recurse the DOM starting with Node node. For each Node of
type Element, call doElement() with it and recurse over its
children. The Elements refer to the HTML tags, and the children
are tags contained inside the parent tag.
*/
public void doTree(Node node) {
if(node instanceof Element) {
Element element = (Element) node ;
// Visit tag.
doElement(element) ;
// Visit all the children, i.e., tags contained in this tag.
NodeList nl = element.getChildNodes() ;
if(nl == null) return ;
int num = nl.getLength() ;
for(int i=0; i
The program is run using the Java JVM:
java com.benjysbrain.cobra.Render [url]
where [url] is an optional URL. If the URL is not specified, the program parses the CNN home page and lists the tag names for beginning and ending tags. The Cobra jar files, cobra.jar and js.jar, have to be in the classpath.
Because JRex and Cobra both use the org.w3c.dom DOM interfaces, the two versions of the code are very similar.
To use Render to do real work, extend it in a subclass. You can override doElement() and doTagEnd() to extract information from the DOM. To extract tag attributes, first run the boolean method hasAttributes() of the Element object. If the tag has attributes, this method will return true. You can then use the getAttributes() method to obtain a NamedNodeMap object, which you can use to access the tag attributes. The Node objects referenced by the NamedNodeMap contain attribute/value pairs. The getNodeName() method of Node returns the attribute name, while the getNodeValue() method returns the attribute value.
I tried Cobra in my automatic comic extraction system, and it works well. As the library matures, I expect some of the limitations noted above to will disappear.
If you have comments, suggestions, or questions, feel free to contact me at the e-mail address given in the footer of this page.
Update: The Cobra library comes with the src for some sample code in org.lobobrowser.html.test. org.lobobrowser.html.test.TestEntry, for example, is a Swing app that will parse a web page and present you with three views: web page, DOM tree, and source. In perusing the code, I found some helpful techniques. One of these is to add the following lines after a URLConnection object is created:
connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible;) Cobra/0.96.1+"); connection.setRequestProperty("Cookie", "");
Once I added these lines to some code I was experimenting with, I was able to parse a web site that was troublesome.
Update: See Cobra Parsing: Disable Persistent Connections and Set Socket Timeouts.
Update (3/1/2011):
It appears that Cobra is not being updated, and there are a number of
complicated JavaScript pages it doesn't handle well. I am now using
Crowbar and HtmlUnit for HTML parsing from Java.