So you have spent many hours analyzing and creating the layouts and definitions for the documents you need to be processed through Abbyy.  Now you should be almost ready for production, except you need to tune.  Many samples of the documents in question need to be run through and the results checked over very carefully to find and fix all the little issues that will be present.

Tuning involves not finding the bugs in your definitions but finding the little differences in the printed documents that are processed.  These differences may be due to printing offsets on the printed form that is then run through the printer where the actual data to extract is found.  In addition, there can be other cases where the Header or Footer elements are not extracted correctly.  All these differences can add up to Abbyy not detecting the correct document definition to apply to the scanned images.

In order to correct these issues a very careful analysis of results need to be viewed through the Design Studio.  Import the document in question into the Studio and then process it.  Look carefully at what was missed.  Many times it is due to the Search Area not being large enough to cover all the letters/numbers to be extracted.  Also, within a group the required and option flags have a lot to do with if the group is found or not.  All it takes is one search element within the group that is not found and the entire group may be marked as not found, so be sure to check them over the flags carefully.

There are going to be times with multiple Document Definitions that a specific document does not match the definition it should have, but some other definition.  This can be caused by the error percentage on the wrong document definition to be set too high a value when both document definitions share a similar field to extract.  To fix this just take the error percentage down a few points and try the recognition again.

It takes a lot more effort to tune a document definition especially when dealing with multiple document definitions and paper documents that are difficult to scan in with enough clarity for the OCR engine to work properly.  This is very true for Transcript type documents where each transcript has its own copy protection mechanism that the scan software must try and compensate.  However it works out, so be prepared to spend the time and effort to get the document definitions to the point where they work most of the time.

Christopher J. Hillenburg
Senior System Engineer
ImageSource, Inc.

When researching Enterprise Content Management capture projects, the question of handwriting recognition comes up again and again — and many people aren’t sure what to expect.  More commonly, their expectations are unrealistic. They think there is no hope at all, ever. On the other end of the spectrum, some think that tiny fevered cursive scribblings from a rushed meeting can be scanned (or even faxed) and read with accuracy. In helping people think about their forms and the viability of capturing handwriting, I have a few simple guidelines to consider which seem to apply in a majority of cases.

  • Are handwritten forms really the only option?  If the form is available online, can the data be made “fillable” and then submitted directly to your database tables?  Can you let the user fill the form online and print, thus producing machine print and eliminating handwriting?  How about taking the data that a user entered and bar coding it (if the form must be printed rather than be submitted)?  Also helpful and sometimes overlooked:  prefilling form  data from your database through a merge process with a bar code index for retrieval of that same data.
  • Does your Capture software support ICR?  Intelligent Character Recognition (ICR) is what you need to read handwriting.  Optical Character Recognition (OCR) is much more common and is designed to read machine print.  Please don’t try to make it read handwriting – you won’t like the results!
  • Make sure the handwriting is constrained. Annoying? Perhaps. But making the person filling the form write in boxes sets you up for the most successful ICR results.  The catch phrase here could be “Curse the cursive”.  When a character is joined to another character it is faster to write.  However,  the ICR software really struggles to figure out where one character starts and another stops.  And here’s where recognition tanks.   With the real world example below, we can generally expect 100% recognition.

  • Ask for all caps handwriting. You can often tell your ICR engine to look for upper case characters only. This really increases accuracy. And when the form filler forgets to write AS IF SHOUTING, you can often get OK results anyway.
  • Show them how! I know it may seem condescending, but consider this a helpful reminder to those who would blow through the blocks in a mad dash. Show users an example of the way to write in constrained print fields.  And here’s where you can tell them to use all-caps, and show it in your example.

  • Use key index values and database lookups! If there is an employee number, unique phone number, SSN/TaxID, or other unique ID for the person filling the form, use it whenever you can. Then perform a database lookup to confirm identity and optionally populate any other fields that you may need that happen to exist already in your database.
  • Less is More. People burn out on filling lengthy forms using constrained print fields.  Try to minimze the amount the need to write and careless handwriting will decrease.
  • Comb fields can work too. If you think all those constrained print boxes are just too hideous looking, try using comb fields instead. But remember, as soon as people ignore the combs and write cursively or sloppily, ICR results plummet.

  • Use Drop Out Colors for the boxes. If your scanner and ICR software support color dropout technology, you make the ICR engine’s job easier. The boxes aren’t recognized by the scanner, but the handwriting is. So now the constrained print box lines (which make sure each handwritten character is isolated in a target area) don’t have to be considered during ICR.
  • Use OMR bubbles if you really really need perfect index value from handwriting. Remember filling page one of standardized test?  This painful process might be worth it. This is called Optical Mark Recognition.  Since the engine just needs to confirm if a bubble is filled or not, this is easier and more accurate than OCR or ICR.

  • Faxing? Well, OK. But recognition levels will go down.

With these hints in mind, you can look forward to results that are perhaps short of miraculous – that is, less accurate than OCR.  By all means, the results are still worthwhile and produce great time savings when properly implemented.    There are more tricks to describe, which I may save for a later blog.  Please contact ImageSource if you have any questions about capturing handwriting in forms.

First off, ABBYY means “keen eye”, an apt name for a product that dynamically and automatically captures and processes widely disparate documents.  Powerful document recognition separates and classifies docs, and state-of-the art optical character recognition rips the data from the images.  I like the motto that pops up on screen – “take the data, leave the paper”.  I love doing just that, sending paper briskly off  to start its next recycled life.  It’s the greenest thing to do, especially when compared to  filling endless cabinets and long-term off-site storage facilities.

When you want to recommend, sell, support, and solve major customer problems with ECM software at ImageSource, due diligence mandates a thorough feature review and testing.  I’ll describe some of the steps I was involved with in this process for ABBYY FlexiCapture – but mine is but a single slice of the vet team pie.  Development teams and other engineering teams performed specific examinations to answer questions about integration, APIs, and more narrow capabilities to solve unique problems faced by eager customers.  Also, ImageSource staff with a variety of titles took a week-long training course with intensive labs.  Unfortunately I missed the class but was given the opportunity to spin up for a pre-sales demo last year, which was a lot of fun.

So here’s a peek at our process:

 Laptop Install

First things first!  I like to be able to run new software on my laptop whenever possible.  This frees me from all bandwidth and location constraints.  I can easily focus on the vet effort on a plane, down by the river, wherever and whenever.  ABBYY FlexiCapture has a convenient ‘Standalone Installation’ which gives you access to all the key components on one box.

 Obtain Sample Images from Client

In this case we gathered dozens of hardcopy invoices from a large international corporation.  The images were not pretty and included originals, copies, printed faxes, you name it.

 Ascertain Server Needs

After reviewing the ABBYY documentation we set the requirements for our labs – memory per server, disk space, software required, scan station requirements, scanner requirements, and required operating systems.

 Spin Up VMs

Thanks to Mike Peterson we had three servers up in no time.

Convening the Team , Locking Down the ‘War Room’

Gene Eckhart, Jeff Doyle and I  met in our Olympia office for a week.  Gene secured the war room where we periodically met with developers, project managers, engineers, and principals.  Most of the time it was the three of us banging away.

 Lab Software Install

Now we installed ILINX Capture on one server, ABBYY ‘s ‘Distributed Installation’ on another server, and SQL server on the last.   This architecture would mimic what we’d encounter in the field – and also the standalone install wouldn’t cut it as it doesn’t scale and it uses SQL Express as a support database. As installed,  we can easily add more servers for high-volume stress testing.  By running a WebEx all week we were able to record every moment of each day’s work, easily pass the focus from machine to machine, and allow others a view of what we were doing who were remote.  We involved ABBYY tech support when we had a question and felt we could speed up an installation process.  Turns out we could, and it was great to have the technician join our session without delay and see what was up. Also, as we installed we meticulously kept a running log of any issues – however minor – we encountered.  At the end of each day Gene led a review session where we discussed and polished the invaluable ‘Lessons’ doc.

 End-To-End Test

This was our ‘Hello World’ moment – we set up communication between ILINX Capture and ABBYY, and created an appropriate ILINX Capture workflow.  Then we created a simple FlexiLayout, exported it, imported it into FlexiCapture, and created a document definition and an export.  We configured the scanner and the scan station and established we had end-to-end connectivity.

 Building Generic Flexilayouts

One of the many goals of our week was to share baseline knowledge as well as advanced techniques for capturing documents.  We identified  two forms that were relatively easy to identify  and constituted a large amount of the total paper volume.  In short order we had FlexiLayouts and document definitions configured.  Then it was time to tweak and refine.  The ability to chain elements together worked outstandingly – find a keyword, then find the nearest zip code with the help of regular expressions.  Then using out-of-the-box settings we could  find the state, city, address, and addressee.   Wow, powerful.

Building an Uber FlexiLayout

Now it was time to roll the sleeves and build a smarter FlexiLayout that could capture invoices from a variety of sources.  We used advanced features such as FlexiLayout alternatives, element groups, object collection elements, and other settings to start recognizing semi-structured forms from a wide variety of sources.  Then we added a little bit of FlexiLayout language code to help us “crawl” around the identified forms to find dates and monetary amounts that could sometimes be below keywords, or to the right, etc.  We didn’t need to script any validation rules for our purposes, but I showed some script I had created prior  to our meeting .  A quick unit test showed great results – we now had stepped away from a model where each form had to have its own FlexiLayout.

 Running Recognition Tests

We changed our lab coat to testing hazmat suits and ran many batches of documents we had used in development as well as documents we had never looked at before.

 Recording Results

While never a thrill, here we benefitted from a spreadsheet created by Jeff Martin, Gene Eckhardt and  Brandon Konen that allowed easy entry of recognition results.  This is known as our “Advanced Capture Analysis and Comparison Tool”, highly regarded in our ranks.  The data was automatically crunched allowing us to very quickly establish baselines, compare our scan results with other products, share our results with coworker and principals, etc.

Lessons Learned Doc Revisited

It’s a privilege to be able to work with industry veterans such as Jeff Doyle and Gene Eckhardt on a project such as this.  They brought years of experience with them to improve every process we covered.  While evaluating  the Lessons Learned doc, they were able to extrapolate possible impacts in environments and scenarios they have seen in the field.  They also add fresh mitigation alternatives to work through problems encountered.  Our Lessons Learned docs are part of a valuable and large knowledge base that has been added to at ImageSource for year after year.

Findings and Conclusions Write-Up

After a demonstration to some coworkers needing to ramp-up on our configuration, we collaborated to create a summary document and here Gene took the lead.  We were able to draw on the Lessons Learned doc, the Advanced Capture Analysis and Comparison Tool, and meeting notes to piece together our findings and quantify our conclusions.  The summary outlined the scope of our efforts, including excluded activities, our environment and products tested, results, conclusions, general observations, and Best Practice recommendations.

It’s one thing to kick the tires on a car before purchase.  But a methodical, thorough and thoughtful approach is the norm for analogous software tasks at ImageSource.

College Transcript Processing refers to converting a paper based transcript into an electronic transcript via software that OCR’s the scanned paper version, locates specific data within the transcript and saves that data for later use.  The reason for processing a transcript via software is to improve the rate of data transfer to another system for storage and retrieval versus manual data entry by a data entry specialist.  This is a somewhat difficult task due to the following reasons:

  1. Each and every College presents similar data in a very different format.
  2. Almost all colleges attempt to prevent the copying of the paper transcript through various copy protection methods.  Most of these methods render the data on the transcript almost un-readable.

The data that is similar on a transcript falls into several main areas:

  1.  College Identifying Information
  2. Student Identifying Information
  3. Session/Course Information
  4. Previous Colleges Attended Information
  5. Degrees Awarded Information

The data is similar but not the same on each college transcript.  In addition, the layout of a transcript varies greatly between the various colleges.  Session/Course data could take up the entire width of the paper for one college, but be formatted as multiple columns of data for another college.  There are many, many variations that need to be taken into consideration when attempting to OCR to find and extract the data.

So far the Abbyy FlexiCapture 9.x software has been able to handle most of these issues out of the box.  One of its most powerful features I am finding out is the scripting language to write rule, custom scripts and export scripts that can correct OCR issues and assist the Verification Operator improving efficiency and throughput.

The scripts for rules, custom scripts or export can be written in VBasic or Jscript.  There is some documentation on the Abbyy classes and objects, but not a whole lot.  Most of what I have done has been through trial and error or in specific cases from examples provided by Tech Support.  However, what scripts that have been developed work well for correcting OCR issues and providing automated checks of extracted field data.  Through Custom scripts there is even the option to use a Database lookup on extracted data and return other fields from the database to assist in providing a complete set of validated information.

This has been a learning experience but it is proving to be well worth the effort in getting the data off the paper and into the system used to evaluate a student for enrollment by cutting down on the man hours required under the old manual data entry.

Follow

Get every new post delivered to your Inbox.