OCR for World Wide Web images

Abstract
A significant amount of text now present in World Wide Web documents is embedded in image data, and a large portion of it does not appear elsewhere at all. To make this information available, we need to develop techniques for recovering textual information from in-line Web images. In this paper, we describe two methods for Web image OCR. Recognizing text extracted from in-line Web images is difficult because characters in these images are often rendered at a low spatial resolution. Such images are typically considered to be 'low quality' by traditional OCR technologies. Our proposed methods utilize the information contained in the color bits to compensate for the loss of information due to low sampling resolution. The first method uses a polynomial surface fitting technique for object recognition. The second method is based on the traditional n-tuple technique. We collected a small set of character samples from Web documents and tested the two algorithms. Preliminary experimental results show that our n-tuple method works quite well. However, the surface fitting method performs rather poorly due to the coarseness and small number of color shades used in the text.© (1997) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.