European language determination from image

Abstract
The authors have developed a technique for determining the language from an image of text. This work is restricted to a small subset of European languages, but uses techniques which should be applicable across many more languages. The method first makes generalizations about images of characters, then performs gross classification of the isolated characters and agglomerates these class identities into spatially isolated (word) tokens. Analysis of corpora in English, French and German yields training data for a language classifier designed to codify the spatial relationships of the connected components which compose the letter-forms. Linear discriminant analysis provides classification criteria on which the test data are evaluated. The resulting process takes in images of text and produces a language classification based on image representations and generalizations about relative token shape frequency in the target languages.<>