KTM TDS Model Building

October 13, 2011

Are you tired of separator sheets?  Tired of wasted paper and countless hours of flipping through pages and inserting a barcode sheet at the start of a new document just to take it out after the batch is scanned or leave it in the batch and have more paper to store?  Why not have the computer do the work for you?  That’s the idea behind the Project Planner module in KTM.  There is a standard separation functionality built into KTM that works very well on structured and semi-structured documents but when you have more complex separation rules the Project Planner component of KTM is what you need.  This program is designed to create a template or “model” for the automatic separation which is then used by the KTM Server during the normal batch workflow.  This is why you might also hear this process referred to as “model building.” I want to give you a brief look at the setup of a TDS (Trainable Document Separation) model and how to integrate it with a KTM project.

The first thing that you need to do is collect lots and lots of samples.  The program requires that each class or document type have at least 50 samples.  Each document needs to be in tiff format and have its own folder.  Furthermore, documents that are multiple pages should be separated into single page tiff images and placed into their own folder.  The next step is to take your collection of document folders and group them into folder of different document types (these will become your classes in your KTM project).  This is a very time consuming process but it will help when you go to import in documents and you will see why.

Another thing that you should be aware of is that Project Planner requires an additional license and does not get installed with the normal KTM install.  Following the standard KTM install, you can find the Project Planner setup.exe located in the install media under the “Kofax Transformation Modules” and “Project Planner.”

After you have project planner installed and you have created a new project, you need to import those documents that were just sorted.  Once imported, there is a handy tool that allows you to select where the separation for each class, document, and page are.  This will allow the system to automatically create the classes and import the appropriate documents into each class

After the files are imported you will see the classes automatically created for this model.  The next step is to run all of the documents through the OCR engine in order for the system to be able to read the documents.  This process can take hours for larger sample sets so there have been times that I just let it run overnight.

When all the documents have completed running through the OCR engine, you can begin the cleanup process.  This is simply a matter of confirming documents are part of a class or not by checking the checkbox on each document.  You only need confirm enough documents so that the system is confident on the classification based on the samples provided.  As you can see from the screen shot, the bar across the bottom is color coordinated to show the confidence of a particular class. 

Blue are the documents that you have confirmed, green means confident and red means unconfident.  As documents are confirmed the red bar will get smaller and eventually go away.  This cleanup process is complete when enough documents have been confirmed for each document class so that all of the red is gone.

The next step is to compile the information into a TDS model which the KTM project can use for separation.  This is done by creating two files in Project Planner.  The first is a classification file, or the mod file, that the system will use to distinguish what class each page belongs to.  You can either use a text classification or image classification.  The second file that needs to be compiled is the document separation file, or the ads file.  This allows the KTM Server to use the training provided in the cleanup step to know where to separate each document.

The final step is to link the model to your KTM project.   Open up a project in Project Builder and go to the project setting within Project Builder.  On the Document Separation tab, one of the options is to use the “Trainable Document Separation (TDS)”.  Select this and browse to the folder containing the mod and ads files.

When you click OK you should get a message that tells you that “The TDS project was successfully imported.  New classes were created according to the definition of the document separation model.”  If not already there, classes will be automatically created in your project.  You’re now ready to synchronize the project within Kofax Administration and publish the batch class.

In summary, by using KTM and combining it with the TDS model it will you save time and money by reducing the amount of document preparation required when scanning.  For example, in a recent install I worked with a company that had a whole room of employees (about 20) doing manual separation.  We installed KTM and used the TDS model for separation and now they only have 4-5 people doing the same volume of documents in less time.  This a very powerful tool that I would suggest to anyone who has a need for automatic separation of semi-structured and unstructured documents.

 

Brandon Konen
Systems Engineer
ImageSource Inc.

In KTM there is a nifty feature to search the entire document for a date field. It will recognize all dates existing on the form and with some other snazzy logic you can find the date you are looking for. If it is nearby the word “recieved”, then you probably have a recieve date. Easy, right?

Okay, sometimes dates get a little more tricky. “3/17/2011“, “3-17-20011” and “MAR 17, 2001” are all valid date strings. Any of those formats could be found on your document. In KTM there is a nifty feature to search for the string “MAR” and replace it with a “3” when searching for dates. You use it in your locator’s regular expression. You can setup your own dictionary of months to look for “March” or “Mar” (or “Marzo” if you need internationalization).

Here’s the gotcha. I recently found text in an OCR’d document like this: “19 NOV2008“. It’s a bit of an odd string. The OCR engine didn’t think there was enough space between the “NOV” and the “2008” to put an actual space character in the ORC’d text. So, I can read it, but KTM can’t. The nifty feature to search for the string “NOV” fails because it is only looking at whole words, those with whitespace on either side. Unfortunately, there is no option in the KTM dictionary setup to change this.

Here’s the fix. Modify the default KTM regular expression from this:

[0-3]?\d§English_Months_Abr§([12]\d{3}|\d{2})

to this:

[0-3]?\d\s*(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\s*[0-3]?\d

You can now make the space character optional in your regular expression search. You are no longer using the month dictionary, but that’s okay. This logic is only to locate a date, not translate it from a string to a date when found.

Problem solved.

I recently was asked to help with a client’s KTM (Kofax Transformation Modules) project, because they were not pleased with the percentages of valid and/or correct extraction fields. My first question was, “Are you using subclasses?”  The answer was, “No.”  Subclassifying your top forms is an easy way to greatly improve your extraction results.

What I mean by that is instead of trying to use a single locator to find data from all of your documents with a “one size fits all” approach, you can use subclasses to first classify the document and then tune your locators specific to that form to look in a precise location for the information. For example, let’s say you need to find a “Case Number” off of all of your forms. Some forms might have the word “Case Number” above the text you need to extract. Others might have the word “Number” to the left of the data. Another might not have any text around the data to key off of at all. It’s difficult to add enough rules in one locator to catch all the possible scenarios. Furthermore, there are times when adding rules to help find data on one form will actually give you negative results from another. Subclasses can help by allowing you to create a specific locator to zero in on the information that you are looking for.

How many subclasses are enough? I like to use the 80/20 rule. When listing all of your documents in relation to volume, 80% of your volume should come from 20% of your forms. I know that there are exceptions to the rule, but this is a good place to start. I have done projects here at ImageSource where we subclassified the top 10, 20 or 50 forms. When forms are subclassified, the extraction averages go way up by using locators like the Advanced Zone Locator on structured forms. This locator is very helpful because once you draw a box around the data, you can set it to run its own cleanup and OCR of that zone rather than taking the original full-text OCR results. However, this is only really useful on forms that have been subclassified since you know exactly where in data is on the page. Format Locators are also very helpful because you know how the data is structured in relation to the form, and you can create a regular expression to look for text. This helps reduce the amount of incorrect possible alternative results. For the rest of the forms that are not subclassified, you still need to create the miscellaneous locators, but the idea is that the majority of your documents are being subclassified and coming through with very high extraction rates.

The other nice thing about KTM is that you can use locators at the parent class, and each of the subclasses will inherit the locator unless you specifically change them. An example of where this is helpful is for fields that use a database locator with a fuzzy lookup that applies to all the forms, but you don’t what to create a specific locator for all the subclasses. In addition, you can still use the incredible training power that KTM provides. When using the specific learning, it will apply the training to the particular subclass. I have found that with KTM there are many different ways to “skin a cat,” and this is just one of the methods that can dramatically improve your extraction results.

Brandon Konen
Systems Engineer
ImageSource, Inc.

Follow

Get every new post delivered to your Inbox.