Kofax KTM Dictionary Gotcha on Dates
March 18, 2011
In KTM there is a nifty feature to search the entire document for a date field. It will recognize all dates existing on the form and with some other snazzy logic you can find the date you are looking for. If it is nearby the word “recieved”, then you probably have a recieve date. Easy, right?
Okay, sometimes dates get a little more tricky. “3/17/2011“, “3-17-20011” and “MAR 17, 2001” are all valid date strings. Any of those formats could be found on your document. In KTM there is a nifty feature to search for the string “MAR” and replace it with a “3” when searching for dates. You use it in your locator’s regular expression. You can setup your own dictionary of months to look for “March” or “Mar” (or “Marzo” if you need internationalization).
Here’s the gotcha. I recently found text in an OCR’d document like this: “19 NOV2008“. It’s a bit of an odd string. The OCR engine didn’t think there was enough space between the “NOV” and the “2008” to put an actual space character in the ORC’d text. So, I can read it, but KTM can’t. The nifty feature to search for the string “NOV” fails because it is only looking at whole words, those with whitespace on either side. Unfortunately, there is no option in the KTM dictionary setup to change this.
Here’s the fix. Modify the default KTM regular expression from this:
[0-3]?\d§English_Months_Abr§([12]\d{3}|\d{2})
to this:
[0-3]?\d\s*(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\s*[0-3]?\d
You can now make the space character optional in your regular expression search. You are no longer using the month dictionary, but that’s okay. This logic is only to locate a date, not translate it from a string to a date when found.
Problem solved.
To Classify or Not to Classify
March 11, 2011
I recently was asked to help with a client’s KTM (Kofax Transformation Modules) project, because they were not pleased with the percentages of valid and/or correct extraction fields. My first question was, “Are you using subclasses?” The answer was, “No.” Subclassifying your top forms is an easy way to greatly improve your extraction results.
What I mean by that is instead of trying to use a single locator to find data from all of your documents with a “one size fits all” approach, you can use subclasses to first classify the document and then tune your locators specific to that form to look in a precise location for the information. For example, let’s say you need to find a “Case Number” off of all of your forms. Some forms might have the word “Case Number” above the text you need to extract. Others might have the word “Number” to the left of the data. Another might not have any text around the data to key off of at all. It’s difficult to add enough rules in one locator to catch all the possible scenarios. Furthermore, there are times when adding rules to help find data on one form will actually give you negative results from another. Subclasses can help by allowing you to create a specific locator to zero in on the information that you are looking for.
How many subclasses are enough? I like to use the 80/20 rule. When listing all of your documents in relation to volume, 80% of your volume should come from 20% of your forms. I know that there are exceptions to the rule, but this is a good place to start. I have done projects here at ImageSource where we subclassified the top 10, 20 or 50 forms. When forms are subclassified, the extraction averages go way up by using locators like the Advanced Zone Locator on structured forms. This locator is very helpful because once you draw a box around the data, you can set it to run its own cleanup and OCR of that zone rather than taking the original full-text OCR results. However, this is only really useful on forms that have been subclassified since you know exactly where in data is on the page. Format Locators are also very helpful because you know how the data is structured in relation to the form, and you can create a regular expression to look for text. This helps reduce the amount of incorrect possible alternative results. For the rest of the forms that are not subclassified, you still need to create the miscellaneous locators, but the idea is that the majority of your documents are being subclassified and coming through with very high extraction rates.
The other nice thing about KTM is that you can use locators at the parent class, and each of the subclasses will inherit the locator unless you specifically change them. An example of where this is helpful is for fields that use a database locator with a fuzzy lookup that applies to all the forms, but you don’t what to create a specific locator for all the subclasses. In addition, you can still use the incredible training power that KTM provides. When using the specific learning, it will apply the training to the particular subclass. I have found that with KTM there are many different ways to “skin a cat,” and this is just one of the methods that can dramatically improve your extraction results.
Brandon Konen
Systems Engineer
ImageSource, Inc.



