The Comics Miner - Automatic Extraction of Comic Images from the Web

Introduction

The Comics Miner software is part of a research project to develop techniques to automatically identify and extract comic images from web pages. Current systems typically rely on regular expressions or other manually defined parameters, such the n^th image on a page, to extract comics. The problem with this approach is that human intervention is required each time a comics site is redesigned. Another approach uses the area and aspect ratio of images to decide which are comics. This approach is automatic; however, it is susceptible to false positives, i.e., advertisements and banners that are marked as comics.

Adult humans typically have no trouble inspecting a web page and deciding which images are comics and which are advertisements, banners, or decoration; however, this task is difficult for software systems. In part, the difficulty arises from the fact that comics on the web can vary from simple line drawings to photographic images with speech balloons. Also, comic images, banners, and advertisement images can share many of the same features on a given site. The variety in page formats between comic sites also increases the difficulty in identifying comics on a page.

The Comics Miner software relies on artificial intelligence and information retrieval techniques to perform at a high level of accuracy on this task. Earlier versions used Artificial Neural Networks (ANN) while more recent versions use Support Vector Machines (SVM). These quality of these classifiers is enhanced using heuristic rules.

Two uses for this software are for creating a personalized daily comics page and for more general web crawling. For a personalized comics page creator, a user gives the system a list of comics pages, and the system returns a comics page each day containing the most recent comics from the given comics pages. The system acts autonomously in that a site redesign should not require human intervention in order to successfully extract a comic.

The software can also be used in a more general crawler to look for comic images in general sites including commerce sites and news pages. A more general crawler can be used to build a library of web images for research into comics identification.

Image Attributes

The Comics Miner uses a number of image attributes to differentiate comics from other images. These attributes can be divided into three categories: features, page location, and corresponding text attributes.

The Comics Miner uses a number of images features. Some of those used in the Comics Miner are

Image size: height and width
Image area: height * width
Image location on the page: (x, y)
Aspect ratio
Texture features such as correlation and angular second moment
Grayscale histogram statistics
Web page structure and IMG tag context information

The native size and aspect ratio of an image might differ from the actual rendered size and aspect ratio depending on the height and width parameters of the image tag. For example, the same image might be rendered as 500 x 200 pixels when it is used as the main comic image on a page and also be rendered as a 100 x 40 image when used as a thumbnail for a link.

Image placement on a page can be useful in determining its usage. For a typical comics web page, the main comic is located in the top, left part of the page. For pages with multiple comic images, the latest comic is typically above the older comics.

Although the IMG tags that specify the typical comics image do not use the alternate text attribute, ALT, this attribute is used on some comics pages. The ALT text might specify the name and date of the comic or it might say "advertisement" for an advertisement. Text surrounding the IMG tag can also be helpful for differentiating comics from other images.

The remainder of this description concentrates on the Comics Miner use in producing a personalized daily comics page where the user is interested in the comic of the day. By relaxing some of the behaviors for this task, the software can also be used for more general purposes such as building a library of comic images for image research.

Figure 1 gives the architecture of the Comics Miner software. The system can either extract images from live web pages or from a library of comic and non-comics images created from the crawling of various web pages. For web images, the next stage discards small images under 62,000 square pixels. It is extremely rare to find a comic that is not being used as a thumbnail below this area. This pre-filtering reduces the computational burden of examining all the small images, such as spacers and small advertisements, that occur on the typical web page.

Figure 1 - Comics Miner Architecture

A constraint-based reasoning area filter follows the area pre-filter. It works in conjunction with the following stages. The image area constraint is set so that only images that are large enough to likely be a comic image are considered. The next stage attempts to classify images as either comics or other images. If the classifier finds no comic image, the area constraint is relaxed, and more images from the page are processed by the classifier.

The image classifier uses some of the attributes outlined in the previous section, such as the grayscale histogram and ALT attribute text, to determine if an image is a comic or a non-comic image. Both artificial neural networks and support vector machines have been used in the Comics Miner. These classifiers are trained using images in a library of images extracted from comics pages, news sites, and commerce sites.

If there are multiple images that pass the classifier filter, a knowledge-based filter selects the most likely image to be the comic of the day. The rules used in this stage are based on the relative locations of the images.

The current incarnation of the software has an accuracy in excess of 98% when used to crawl a number of web comics pages and news and commerce pages. A typical run crawls 68 pages. There are 61 comics pages on 27 sites with 4 commerce sites and 3 news sites. There are over 3000 images from these pages, all but about 220 are discarded due to their small size. The accuracy is computed by considering the number of images marked correctly as comics or non-comics out of the 220 or so images considered.

Update (1/2016)

A simple analysis of the effects of the feature set on the goodness of the comics classifier shows that the grayscale statistics, the IMG ALT attribute value, and page structure add little to selecting comics. The page location of an image, it's area, and height and width contribute significantly to the classifier.

Update (1/2017)

One of the major comics sites changed it layout to handle adaptive image sizes to support a range of devices with varying screen real estate. The Comics Miner was using very small browser windows, resulting in a small image size being reported by the embedded browser when the <IMG> tags were embedded in <PICTURE> tag. This change invalidated much of the training used by the Comics Miner. So larger browser windows were used, and the system was retrained.

Update (10/2025)

The Comics Miner has worked well for almost a decade with occasional retraining. On 4/1/2025, gocomics.com, where I scrape a number of comics, changed its format. The old embedded Chromium no longer found all the page images including the main comic.

When I first started using CEF (Chromium Embedded Framework), the Java bindings didn't seem mature, so I wrote my scraper in C#. Using C# impacted the portability of the system. The current Java binding is probably more mature, but I decided to replace CEF with Selenium. Selenium works with Java and multiple browsers and runs on multiple operating systems. Both Firefox and Chrome can be used with the Comics Miner.

Articles

Automatic Recognition and Extraction of Web Comics by Ben E. Cline. This article describes the state of the Comics Miner system at the beginning of 2012.
Automatic Extraction of Web Comics by Ben E. Cline. This 2011 article describes an earlier version of the system called the ComicsExplorer. That version used an ANN instead of an SVM and was not as accurate as the Comics Miner.