Automatic Extraction of Comics From Web Pages

Introduction

This page describes an ongoing project to extract newspaper-like comics from web pages. A typical comics page contains a single comic surrounded by various other images, including spacing images and advertisements. Other complications include dynamic HTML and the use Adobe Flash to present comics on some pages.

A primary goal of the project is to extract comics automatically without the need for user intervention when a comics page layout changes. There are other systems to extract comics from the web, e.g., dailystrips, however, these are less than automatic.

Raw vs. Rendered HTML

Typically, web browsers read HTML from a web server, construct a Document Object Model (DOM) from the HTML, and then render the page on a browser window. If the page includes JavaScript code, the DOM can be modified as the page loads or when various events occur. I will refer to the HTML from the web server as the "raw HTML" and the HTML of the rendered page with JavaScript changes applied as the "HTML of the rendered page" or just the "rendered HTML."

In at least one comics page, the raw HTML contains a tag of the form

  <IMG SRC="" NAME="IMG1">
to display the comic. But looking at the rendered HTML shows that the SRC attribute has been modified to contain the URL of the comic. In this case, it is necessary to produce the rendered HTML before extracting the comic.

The advantage of parsing raw HTML is that it is typically faster than rendering the HTML first to extract the rendered HTML. The main disadvantage of parsing raw HTML is that dynamic content is missed as in the example above. Another disadvantage is that raw HTML might contain multiple references to the comic as JavaScript conditionals are not evaluated. For example, if a page contains sections for different browsers and each section contains a reference to the comic, the parsed HTML will contain multiple references to the comic.

In the current version of my comic extraction program, I use both a simple HTML parser to extract raw HTML and a rendering engine to build the rendered HTML. Because I work in Java, I only explored Java tools:

My software currently runs the HTML Parser for most sites and JRex for the one site that writes the comic URL dynamically. The Cobra Toolkit seems to work well; however, my experience with it is limited. The Cobra Toolkit does not appear to have a Flash plug-in, so it behaves differently than JRex which does.

Update: I have abandoned the JRex parser. For the twenty comics I retrieve, fifteen are handled by the HTML Parser and the remaining five are handled by the Cobra parser. It seems the HTML Parser is a bit faster; however, Cobra is needed to handle sites that expose their image tags as an action of Javascript code.

Update 3/1/2011: I have abandoned the Cobra Toolkit also as it fails to parse some complicated JavaScript pages. See Another Approach to Java HTML Parsing for details on using Crowbar for comics extraction.

Locating Comics Automatically

Typically comics have a limited number of formats and sizes. There are the typical daily strip divided into three sections, e.g., Calvin and Hobbes, and single frame strips like The Quigmans. Sunday strips are larger and can be divided into a number of sections. The comic is typically an <IMG> tag; however, it can occur as a Flash object. The image can be surrounded by advertisements and other images.

Empirically, I determined the following rule for finding comics on typical web pages using the area and aspect ratio of images and Flash objects. The area is in square pixels, and the aspect ratio is defined as height/width.

  Comic if area > 80,000 && aspect > 0.2 && aspect < 3.25
For the major comics sites and some online newspapers, this rule works extremely well. It occasionally results in a false positive, allowing an advertisement to be tagged as a comic; however, it rarely fails to extract a comic.

Two additional pieces of information improves the performance of the rule. First, if it is known that the comic web page displays the comic using an <IMG> tag, then ignoring Flash objects on the page will reduce the number of ads that pass through the filter.

Second, the method used to determine the size of an image can enhance the filtering process. The width and height attributes on an <IMG> tag are optional. Tags that include size information may have both attributes or only one. In one case, an image that was accepted by the filter rule was an advertisement with "width=100" on the <IMG> tag. In this case, computing the actual image size caused the filter to reject the image. But, in another case, the primary comic was displayed as a Flash object while a full-sized image with attribute resizing was used as an advertisement. By ignoring both Flash and <IMG> attributes, the filter rule would find the full-sized image of the strip.

These two pieces of information augment the filter rule; however, they are transient in nature. The layout of the comic strip page can be altered by the owner at any time, requiring adjustment to how a page is handled.

I recently added a switch to the comics extractor that causes it to extract only the first comic on a page. One comic site started publishing a week's worth of strips on a single page, so the switch reduces the clutter on my daily extracted comics page.

In the limited number of cases I've studied, I don't find the ALT attribute for images to be helpful in locating comics. Many sites do not specify this attribute.

Some comic pages serve the image only as a Flash object. Flash objects are more difficult to deal with. One way is to capture the entire <OBJECT> tag that defines the comic to display. Usually to support multiple browser types, the tag includes an <EMBED> tag. Embedding the captured <OBJECT> tag in a web page will typically display the image; however, this method does not work with some sites.

Note that some pages use JavaScript to check for the Flash plug-in before generating the <OBJECT> tag. If no plug-in is found, a simple <IMG> tag is generated instead. If you are processing the raw HTML of the page, your parser will find both the Flash and image versions of the strip.

HTTP Referrer

Some comic servers will not serve a comic image unless it receives an HTTP referrer that is the page containing the image. If you blindly use something like the Java URLConnection class and don't set the referrer, the comic will not be extracted.

Software Architecture

My comic extractor is written in Java and uses JRex, HTMLParser, and MySQL. JRex is used for pages where execution of JavaScript is required to find a comic, and HTMLParser is used when simple parsing is adequate the find the comic image. MySQL is used to save copies of the extracted images. The images are retained for one week. In the case of comics that are viewed as Flash objects, the <OBJECT> tag is retained in the database instead of the image.

A JSP site with a single custom tag is used to display the images for any day during the past 7 days.

My ongoing comic strip extractor uses a combination of the aspect ratio/area filter rule and knowledge about particular pages. I try to limit the latter so that extraction is as automatic as possible. In some cases, the extra knowledge allows the extractor to ignore Flash versions of a strip and extract an image instead.

Currently, the extractor takes the following arguments:

The JRex and HTMLParser versions of the code share a common object, MinePage, that implements the extraction logic. Subclasses of the TagHolder class hide the underlying parser from MinePage.

The JRex version of the comic strip extractor is slower and more prone to error than the HTMLParser version. The JRex parser will sometimes pop up a browser window as my code processes the DOM. It sometimes crashes the underlying JVM. Some pages with modal windows might cause my code to hang since I handle a minimum of JRex events.

The "skip Flash" flag tells the software to ignore Flash objects in the page. I turn on this feature for most pages where the comic strip is presented as a simple image or where the strip is presented in both image and Flash forms. Because many animated advertisements are Flash objects, ignoring Flash can reduce the number of false positives created by ads.

The "image size calculation" flag is used to cause the extractor to ignore WIDTH and HEIGHT attributes on <IMG> tags when calculating the size of images on a page. Image size is used in the aspect ratio and area calculations in the filter rule.

By default, the comic strip extractor looks for HEIGHT and WIDTH attributes. If both are found, it calculates the area and aspect ratio using the attribute values. If only one attribute is given, the extractor downloads the image and determines the size. It then calculates the size such that the given attribute specifies one dimension and the other is calculated to keep the proper image aspect ratio. If no attributes are given, the calculations are computed from the size of the downloaded image.

If a Flash object does not have both the HEIGHT and WIDTH attributes, my software is unable to determine the size of the comic. In this case, the comic extractor software fails for the site. One strip has a specific HEIGHT attribute with a "WIDTH=100%" attribute.

As noted above, sometimes large images are used for ads but are resized by the browser to a smaller size. Other times, a comic might be resized to a smaller size as an ad, but this image might be more desirable than the Flash version used by the site.

Because JavaScript is not executed for comic pages that are parsed with HTML Parser, there may be duplicate references to the comic. I use a hash table to store image URLs, allowing the deletion of duplicate references.

The recently added "Just First" flag causes the extractor to ignore all but the first comic on a page.

Reliability and Performance

Because I use this software for my personal use, I have it extract only fourteen comic pages daily. Some of these are sites that host a single comic while others like gocomicsTM provide several of the comics I read. I need to construct a test suite that includes a larger number of comic sites.

I've been using a version of this code for about a year. The system reliably extracts comics from the fourteen strips I read. The primary problem with the system is the occasional false positive: an ad that is presented as a comic. On a recent morning, one of the major comic strip providers had an ad for a Calvin and Hobbes anthology that was on each page and that was not blocked by my filter rule. A tweak on the filter rule to exclude this ad would have caused the Sunday Mother Goose and Grimm to be treated as an ad.

I only use the JRex version for one strip, so it only takes a few minutes to extract the comics. I run the extractor as a scheduled task before I arrive at work.

The problem with the system is that it is not as automatic as I would like. By selecting the parser and using the "skip Flash" and "image size calculation" flags, I'm giving the comic strip extractor extra information that can be transient. This makes my system a hybrid between automatic extraction and a system where all extraction information is given by the user. Furthermore, the main filter rule might fail to work reliably in the future if webmasters drastically change the comic strip pages.

I continue to work on the system to increase it's reliability by adding more intelligence to the system. I've considered doing some image processing to exclude images that are not line drawings. I could also try to do some analysis of surrounding tags to try to exclude advertisements. Perhaps a machine learning approach would provide the best solution.

The comic extractor is a research project, and I am not sharing the ever changing code at this time.

Update 5/2011

This project has grown into a more extensive data mining project for automatically recognizing and extracting comics from the web. See The Comics Miner for a brief description.


This page © copyright 2008 by Ben E. Cline.  E-Mail: