Re: Text in images to a text file



On 29 Jun 2006 13:21:31 -0700, bobblebob@xxxxxxxxx staggered into the
Black Sun and said:
Does anyone know of any software for Linux which converts text in
images to a text file?

What you're looking for is called "Optical Character Recognition" and is
usually abbreviated to "OCR". There are 2 Free packages called gocr and
ocrad, but they both suck compared to the commercial atuff available for
Windows. Seriously. Source image was a 300 DPI black-and-white, very
well scanned, no skew, professionally typeset scanned TIFF. ~2800 chars
on the page. Number of mis-recognized chars:

Old Typereader: 0
Old Omnipage: 2
gocr 0.39: >50
ocrad 0.10: >100

....and that's for text that's as good as it gets. If you have skew,
blotch, curl, weird fonts, or anything like that, performance goes down
*sharply*.

I have loads of paper documents that I intend to scan. The contents of
these documents need to eventually end up on a web site as text.

Even the best OCR engine available (Finereader? Latest Omnipage?) is
not perfect. If you require perfection, you'll need to proof every
single page by hand to catch the problems. If you can't proof
everything, you'll need to store the scanned images so people can still
read the text when your engine recognizes "Murgatroyd" as
"IVIurgatroycl". HTH anyway,

--
Matt G|There is no Darkness in Eternity/But only Light too dim for us to see
Brainbench MVP for Linux Admin / mail: TRAP + SPAN don't belong
http://www.brainbench.com / "He is a rhythmic movement of the
-----------------------------/ penguins, is Tux." --MegaHAL
.



Relevant Pages

  • Re: Optimal Scanning into PDF
    ... Trying save an 8.5" x 11" page of monochrome text at 300 dpi in a 10k JPEG file is a recipe for disaster, but in 100K or a bit more there is no problem even with subsequent OCR. ... I'm starting a project to convert an association's monthly journals, going back to 1964, into PDF for web display initially followed by DVD's to members. ... I'm then using Acrobat 8 to create a PDF document by inserting all the images, followed by Acrobat OCR, followed by Acrobat optimization. ...
    (comp.periphs.scanners)
  • Re: combining images for OCR conversion
    ... means of OCR into a single page in a PDF file or any other type of ... would be difficult to convert those images into a single page of text? ... automation errors. ...
    (comp.periphs.scanners)
  • Re: Optimal Scanning into PDF
    ... I'm then using Acrobat 8 to create a PDF document by inserting all the images, followed by Acrobat OCR, ... I'm also not prepared to OCR scan the original journals into Microsoft Word, ...
    (comp.periphs.scanners)
  • Re: Imaging advice needed
    ... > individually witht he file name being the result of the zoned OCR. ... What do you mean by zone OCR? ... Generating PDF with Delphi should not be a problem. ... in images is not a problem too. ...
    (borland.public.delphi.thirdpartytools.general)
  • Optimal Scanning PDF
    ... gray scale images and line drawings scattered throughout. ... I'm then using Acrobat 8 to create a PDF document by inserting all the ... images, followed by Acrobat OCR, followed by Acrobat optimization. ... I'm also not prepared to OCR scan the original journals into Microsoft Word, ...
    (comp.text.pdf)