OCR (Optical Character Recognition) technology can be used to convert a printed copy of a document into an electronic version. For example, if a multi-page instance is scanned into a TIFF file, then it is loaded into an OCR program that recognizes the text, and then transferred to an editable file. Some applications allow you to scan pages and convert content into a document in one step.
Although the technology was originally developed for the optical recognition of printed characters, it can also be used for handwritten. For example, mail services such as USPS use OCR software to automatically process letters and parcels by reading the address.
OCR Applications
OCR stands for Optical Character Recognition. This is a widespread technology for recognizing text inside images in the form of scanned documents and photographs. The technology is used to convert almost any type of image containing written, handwritten or printed text into machine-readable text data.
OCR became popular in the early 1990s when it attempted to digitize historical materials. Since then, the method has undergone significant improvements, and currently provides almost perfect accuracy of optical character recognition. Advanced techniques, such as Zonal OCR, are used to automate complex workflows by converting typewritten texts into digital documents. After the scanned material has been processed, the text can be edited using programs such as Microsoft Word or Google Docs, which are text editors.
Before this technology appeared, manual typing was the only option for digitizing printed documents. This not only took a lot of time, but also led to inaccuracies and errors when playing a copy. OCR is often used as a “hidden” technology in many well-known systems and services, including data entry automation and indexing for search engines, automatic optical recognition of license plate characters, and assistance to blind and visually impaired people.
The process of determining text accuracy
Each step of the OCR process is important to determine the accuracy of the final text. It begins with the conversion of a printed document. If there are marks, spots and poor contrast on it, the recognition software will make mistakes and the result will be incorrect. To avoid these problems, you can make an improved photocopy of printing.
The first step is to scan the printed text. OCR software works with image files. A scanner or a good digital camera creates clear photocopies of documents. It is better to convert scanned files in black and white format. The process is binary. Using black in the image, OCR text is recognized, and white, in turn, acts as a background.
The second step is to identify the characters. The speed of this process depends on the OCR program used. Most of them analyze each element one by one. The purpose of the application is to identify characters, but good programs recognize not only text, but also tables, and other layout elements.
The process is not ideal, as there are many factors that affect accuracy. What programs are designed for optical character recognition, consider below. And the user to choose which is better. OCRs have built-in spell checkers and highlight misspelled words. Some of them are so complex that they note the inconsistency of words and grammatical errors, the user can only make the necessary adjustment.
The last step is to save the finished document in the desired format. If the application does not produce the necessary, then you can use the numerous free convectors online.
Optical technology for braille
Optical Character Recognition (OCR) technology provides blind or visually impaired people with the ability to identify and pronounce it aloud. It uses speech output and also displays information on the braille display.
There are three main elements of optical character recognition systems: image acquisition, text recognition and reading. First, the printed document is captured by the camera, then the OCR software converts it into recognized characters and words, and then the synthesizer in the system pronounces the specific material aloud or displays it on the braille display. Information can be stored in electronic format on the device running OCR software, or in the memory of a stand-alone device.
The process takes into account the logical structure of the language. The system will conclude that, for example, the union of “this” at the beginning of a sentence is a mistake and should be read as “this”. She uses the vocabulary and applies spell-checking methods similar to those used in many text editors.
All OCR systems create temporary files containing characters and page layout. On some systems, they can be converted to formats that can be found using widely used computer applications such as a text editor, spreadsheet, and database.
Selection of programs for text recognition
It is recommended that you consciously approach the choice of software for text recognition. It is better to conduct your own testing or take into account the opinions of advanced users.
Testing is carried out taking into account the following factors:
- Accuracy is what distinguishes good OCR from bad. However, it is unrealistic to expect 100% accuracy from a handwriting recognition application. Factors such as the quality of the original documents and the resolution of the image significantly affect the final result. Good OCRs reach 98% when using a modern scanner and source code in satisfactory condition.
- Multilingualism - today most programs have this property. OCR scans a single character to determine it. If it is designed to recognize only English letters, it will not be able to accurately interpret special characters, such as, for example, letters with an emphasis on "e". Such software will represent these characters with the closest equivalent in English. When using an application that supports multilingualism, indicate the language of the document to ensure recognition accuracy.
- Handwriting support. Text created using the keyboard is easily recognized by any program. Handwritten, however, is a completely different scanning method. People have very different handwritings. Some write carefully, while most handwritings are not legible enough. Quality OCRs can recognize any handwriting. Therefore, for archiving handwritten material, programs for handwriting will be required.
- Automation level. OCR can be launched automatically or interactively. If you need to scan many pages at the same time, it is better to consider automatic programs. Using this function, you can scan documents in several clicks while performing other tasks, and it is easy to find the resulting PDF, txt or doc file. Most free text recognition programs have limited automation.
- Saving layout. The main purpose of these programs is to translate text into electronic form. Some do not save the layout of the original document. Therefore, you have to edit the final version for a long time. A good program should keep the original layout, then the final copy will require minor editing. Such applications save columns, tables, and graphics, as in the original version.
Popular Mobile Software
OCR is great for transferring text from physical sources directly to a digital document. There are various types of programs and applications for desktop and mobile devices. They vary in price and have their own key distinguishing features.
Most popular Android scanners:
- Office Lens - Provides page scanning and OCR for Android users for free. To convert, you need an Internet connection.
- PDF scanners (for example, ABBYY TextGrabber, CamScanner, MDScan, OCR Instantly) - perform scanning followed by OCR. There are no restrictions on the number of scanned pages in the software and no watermarks.
- Online OCR. It can be found on the Internet, the service is very simple and convenient to use. A distinctive feature is that it supports 46 languages, the output document weighs no more than 5 MB, it is easy to convert it to Microsoft Word, Excel or plain text format. After registration, you can convert multi-page PDF, RTF, Excel and files up to 100 MB in size. For large volumes of recognition there is a paid version.
Google Docs
For those already familiar with Google docs, you can use the OCR built into Google Drive. For best results, the font should be set to Arial or Times New Roman. You can improve the result by making sure that the scanned image has uniform illumination and clear contrast. Photographic materials can be processed individually in files: jpg, png, gif or in multi-page PDF documents. The extension supports most languages.
Google has many tutorials and cloud processing capabilities. Many users believe that the service does not have sufficiently advanced functions and options. However, if you use the Google Drive app for Android, you can scan pages directly from the app using the camera on your smartphone. Otherwise, they download documents using a scanner connected to a computer, or in any other way, to begin processing recognition in Google Drive. For individuals, Google Drive offers a free storage tier of about 19 GB, expandable to 100 GB via Google One for $ 1.99.
Optical Recognition Abbyy
Abbyy FineReader has been working with documents for a long time. This is a complete solution for both business and ordinary users. In it you can get all the necessary functions to extract the contents of texts from the scanner with full readability, neatly organized digitized materials. In addition to text recognition and conversion to PDF, Microsoft Office or other formats, the program can also compare them, add annotations and comments.
Abbyy FineReader can convert material in batch mode and process many output formats in 192 different languages. There are related mobile applications when you need to perform a quick scan from your phone.
The software is not the most modern, but it is simple, functional and does an excellent job. The utility has a strong reputation as one of the best options in the field of optical character recognition. You can use the free trial version. Software costs from $ 199.99 for a standard one-time perpetual license.
If someone thinks this is an expensive option, you can use a good alternative to ABBYY FineReader - an online version. It is limited in that it allows you to scan only 10 pages per month. But comes with all the other premium features. Registration will be required to gain access. It supports a lot of input file formats, and you can choose the output, such as PDF, Word, Excel, PowerPoint and e-Pub.
Adobe Acrobat Cloud Service
Adobe Acrobat meets all the requirements and offers an impressive list of features and options, although the price is slightly steeper than that of competitors. For all OCR features, choose the Pro version of Adobe Acrobat. DC stands for “Document Cloud,” and integrates quite clearly with Adobe’s cloud solution if you need to access your files from any computer. There is also a simple and seamless integration with all other Adobe services, such as Photoshop.
If the user decides to pay for the Pro version of Adobe Acrobat DC, he will receive all text recognition tools, the ability to add comments and reviews to the content, a specialized service for scanning tables, the ability to quickly compare two documents together. Materials can be edited directly on the screen a few seconds after they are scanned.
The Adobe mark guarantees a certain level of quality, and users are impressed by the intuitiveness and capabilities of Adobe Acrobat DC. Subscription to the service starts at $ 12.99.
The best free software
Free OCR to Word is the best free optical character recognition software using the latest mechanisms. Tesseract is the most powerful tool for this type of software and is considered one of the most accurate methods. The program supports several image formats and TIFF of several pages. This service can be used completely free to extract text from the provided photo material.
The Tesseract engine was originally developed by Hewlett Packard Labs in 1985-1994. Some changes were made to it in 1996. In 1995, it was included in the top three recognition mechanisms. It works with Windows, Linux, and Mac OS X. FreeOCR can process images that have multi-column and multi-language text. It processes PDF formats and supports TWAIN devices such as scanners, has a widespread dual-window interface, the settings of which are easy to understand.
Free OCR to Word can save a lot of time without having to re-enter an already written work. The program takes a document, a scanned object or image and converts it into readable, editable and accurate material. Software can be downloaded for free in Word. OCR to Word is optimized for working with all types of scanners and has a 98% accuracy rating, a modern interface that makes it easy to access all tasks, there are rotation functions in case the photo does not fit on the screen correctly. The software extracts text from captured images using smartphones or digital cameras with high accuracy and quality.
Linux character recognition
The OCRFeeder Kit provides a convenient Linux graphical interface, which is basically the front end for some images, OCR, and text tools such as printing or spelling. It does not read characters by itself, but instead uses other OCR applications through the so-called “recognition engine” settings. It has predefined parameters for Tesseract, CuneiForm, GOCR and Ocrad.
The user only needs to install the engines selected by him in Ubuntu - one or more and then find them in the Feeder settings. You can add other engines and change these settings manually. There can be several different engines in one application. The main Feeder window allows you to choose on the fly which one to use for a specific area, there is also a setting for choosing one by default. To select the language of the read text, in the case of Tesseract and CuneiForm, you must add the “-l” switch with the appropriate language / script code, for example, “-l pol” for Polish or “-l dan-frak” for Danish to the settings of this engine
The technology of optical recognition of printed characters "Tesseract" at the beginning could recognize text only in English, version 2.x made it multilingual. If necessary, you can install more than one dictionary. Newer versions digitize text based on ISO 963-2.
After successful installation, use the command "tesseract> image path> base name of the output file". Tesseract will automatically give the output document the extension ".txt", you can specify the option "-l", followed by the language code. For versions of Tesseract earlier than the third, it is very important that the image is in a tag value file format and has the extension ".tif" and not ".tiff". The command line should look like this: "$ tesseract ~ / input.tif output".
Where "input.tif" is the document for conversion located in the home folder and "output" is the material that Tesseract will create as "output.txt". PDF. ImageMagick, TIFF Tesseract. .
CuneiForm - , Cognitive Technologies. Windows, , Wine. Linux Launchpad , CuneiForm OCRFeeder.
, .jpeg .
Pdfocr is a script that runs OCR for multi-page PDF files and also embeds it back as a searchable text layer. He can use the Tesseract or cuneiform as a recognition mechanism. The script itself can be obtained from Github or from PPA. To run the command, register in the terminal: "pdfocr -i input.pdf -o output.pdf".
OCR technology does not stand still, in the future, the recognition of an intelligent optical character recognition system - ICR. This standard is advanced. Most ICRs have a self-learning system called the neural network, which automatically updates the database for new handwriting patterns. It extends the usefulness of scanning devices for document processing from printed text recognition (OCR function) to handwritten materials and can achieve more than 97% accuracy when reading handwritten material in structured forms.