All the latest UK technology news, reviews and analysis

Google re-releases open source OCR software

by Matt Chapman

05 Sep 2006

Be the first to comment

  • Tweet this
Google has re-released an open source version of optical character recognition software originally produced by HP
Google uses OCR to convert documents into text that can be used for indexing

Google has re-released an open source version of optical character recognition (OCR) software originally produced by HP

The Tesseract program was developed by HP between 1985 and 1995 and in its final year was in the top three OCR packages in a competition organised by the University of Las Vegas (UNLV) in Nevada. 

Google said in a statement that, although some people might wonder why the search giant was interested in OCR technology, it fitted in with the company's plans to make information available online.

"We are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing," said Eric Case on the official Google Code blog

HP stopped working on Tesseract in 1995 and released the code to the Information Science Research Institute at UNLV a couple of years ago so that it could be developed for open source. 

"UNLV was happy to oblige, but they asked for our help in fixing a few bugs that had crept in since 1995 (ever heard of bit rot?)," wrote Case.

"We tracked down the most obvious ones and decided a couple of months ago that Tesseract OCR was stable enough to be re-released as open source."

Google originally chose to keep the launch low-profile but today's announcement includes an advert for engineers to work on the project

The software currently supports only English, does not include a page layout analysis module, struggles with greyscale and colour documents, and will not match the accuracy of the best commercial OCR packages currently available.

"Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other open source OCR package out there," wrote Case.

Do you agree?

 

Add your comment

We won't publish your address
By submitting a comment you agree to abide by our Terms & Conditions. Your comment will be moderated before publication.

Poll

IT priorities for 2012

What is the most important IT priority for your company this year?

99%

0%

1%

0%

0%

Connect with V3.co.uk

Sign up to our daily or weekly newsletters

Accurev

Top 5 software development challenges

This paper focuses on a series of best practices and techniques for development teams looking to improve their software development processes

Talend

Rubbish in, rubbish enterprise

Why good data management at all levels is essential in the modern business (video, 6mins)

Cisco Voice Support (IPT, Unified Communications)

Cisco Voice Support (IPT, Unified Communications) Cisco...

Financial Business Analyst - Berkshire - £55K

Business Analyst - Finance, Retail Banking/Core Banking...

Senior C# Developer

Senior C# Developer Senior C# Developer required for...

.NET Developer -Leeds - £25-30K+bens

GREYWOOD ASSOCIATES are currently recruiting for an experienced...

To send to more than one email address, simply separate each address with a comma.