Google
has re-released an open source version of optical character recognition (OCR)
software originally produced by
HP.
The
Tesseract
program was developed by HP between 1985 and 1995 and in its final year was in
the top three OCR packages in a competition organised by the
University
of Las Vegas (UNLV) in Nevada.
Google said in a statement that, although some people might wonder why the
search giant was interested in OCR technology, it fitted in with the company's
plans to make information available online.
"We are all about making information available to users, and when this
information is in a paper document, OCR is the process by which we can convert
the pages of this document into text that can then be used for indexing," said
Eric Case on the
official
Google Code blog.
HP stopped working on Tesseract in 1995 and released the code to the
Information
Science Research Institute at UNLV a couple of years ago so that it could be
developed for open source.
"UNLV was happy to oblige, but they asked for our help in fixing a few bugs
that had crept in since 1995 (ever heard of bit rot?)," wrote Case.
"We tracked down the most obvious ones and decided a couple of months ago
that Tesseract OCR was stable enough to be re-released as open source."
Google originally chose to keep the launch low-profile but today's
announcement includes an advert for
engineers
to work on the project.
The software currently supports only English, does not include a page layout
analysis module, struggles with greyscale and colour documents, and will not
match the accuracy of the best commercial OCR packages currently available.
"Yet, as far as we know, despite its shortcomings, Tesseract is far more
accurate than any other open source OCR package out there," wrote Case.
Do you agree?
Have your say on this article