OCR
Contents |
Introduction
Optical Character Recognition, less precisely described as image to text conversion is an almost barren field with linux. The shortness of Open-Source-Solutions is only surpassed by the lack of commercial products.
Since OCR is not well-developed in the field of open source, but only covered by some few mostly experimental programs, this guide is nearly useless. One may hope that it might evolve one day to usefulness. OCR itself is done via some of the following steps:
- Segmentation (cutting the image into images of each line/character)
- Recognition by features or discreet Fourier Transformation and comparison with database
- Comparison with trigrams (syllabes) or dictionary.
While the feature recognition has the huge advantage of not needing being trained, it often doesn't achieve the results of a well trained FFT comparison.
tesseract-ocr
by far the best, head over heels compared to gocr, xocr, ocrad, ocre, clara ....:
http://code.google.com/p/tesseract-ocr
see also http://code.google.com/p/ocropus/
Installation
package.keywords is now a directory so it is much easier to manage. Simply put a file in it with the packages to keyword.
# echo "app-text/tesseract ~x86" >> /etc/portage/package.keywords/tesseract # emerge tesseract # [ebuild R ] app-text/tesseract-2.00 USE="tiff" LINGUAS="-de -en -es -fr -it -nl" 0 kB # *** You must select one of these LINGUAS variables, otherwise no dictionary/language information is downloaded! ***
gocr
gocr is the yet most advanced free OCR software for linux. One might find it worth a try. gocr uses feature analysis, so no training is needed. The last version has crude fft comparison database capabilities, which are switched off by default. As of version 0.40 there is no trigram or dictionary comparison in gocr.
Installation
emerge gocr
or just copy gocr.exe anywhere in your path!
Usage
gocr filename.pnm
will analyze and output the text. It should even have the capability to recognize barcodes, though I have not tested it myself.
Man page
gocr [options] pnm_file_name # use - for stdin
options:
-h - get this help
-i name - input image file (pnm,pgm,pbm,ppm,pcx,...)
-i - - read PNM from stdin (djpeg -gray a.jpg | gocr -)
-o name - output file (redirection of stdout)
-e name - logging file (redirection of stderr)
-x name - progress output (file or fifo)
-p name - database path (including final slash, default is ./db/)
-f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
-l num - threshold grey level 0<160<=255 (0 = autodetect)
-d num - dust_size (remove smaller clusters, -1 = autodetect)
-s num - spacewidth/dots (0 = autodetect)
-v num - verbose [summed]
1 print more info
2 list shapes of boxes (see -c)
4 list pattern of boxes (see -c)
8 print pattern after recognition
16 print line infos
32 debug outXX.pgm
-c string - list of chars (_ = not recognized chars, debug)
-C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
-m num - operation modes, ~ = switch off
2 use database (early development)
4 layout analysis, zoning (development)
8 ~ compare non recognized chars
16 ~ divide overlapping chars
32 ~ context correction
64 char packing (development)
130 extend database, prompts user (128+2, early development)
256 switch off the OCR engine (makes sense together with -m 2)
-n 1 only numbers
examples:
gocr -v 33 text1.pbm # some infos + out30.bmp
gocr -v 7 -c _YV text1.pbm # list unknown, Y and V chars
djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe
Others
Other OCR engines are clara OCR (which is currently being rewritten since end of 2003), one of the few trainable programs, ocre, GNU ocrad, while experimental non-engines are xplab and gamera, the latter not being real OCR programs but engines to aid the later build of OCR-engines.
Between all of them, ocre seems to be most developed, ocrad being close up. None of them currently features trigram or dictionary comparison.
ABBYY has released the finereader OCR engine for linux. Since it is both closed source and relatively expensive, I haven't laid my hands on it yet.
Vividata provides Optical Character Recognition and Image Processing software for Linux and UNIX environments for commercial usage, high-volume applications, and customized applications.
See also
Created by NickStallman.net, Luxury Homes Australia
Real estate agents should be using interactive floor plans and real estate agent tools.
