Kofax KTM Dictionary Gotcha on Dates

In KTM there is a nifty feature to search the entire document for a date field. It will recognize all dates existing on the form and with some other snazzy logic you can find the date you are looking for. If it is nearby the word “recieved”, then you probably have a recieve date. Easy, right?

Okay, sometimes dates get a little more tricky. “3/17/2011“, “3-17-20011” and “MAR 17, 2001” are all valid date strings. Any of those formats could be found on your document. In KTM there is a nifty feature to search for the string “MAR” and replace it with a “3” when searching for dates. You use it in your locator’s regular expression. You can setup your own dictionary of months to look for “March” or “Mar” (or “Marzo” if you need internationalization).

Here’s the gotcha. I recently found text in an OCR’d document like this: “19 NOV2008“. It’s a bit of an odd string. The OCR engine didn’t think there was enough space between the “NOV” and the “2008” to put an actual space character in the ORC’d text. So, I can read it, but KTM can’t. The nifty feature to search for the string “NOV” fails because it is only looking at whole words, those with whitespace on either side. Unfortunately, there is no option in the KTM dictionary setup to change this.

Here’s the fix. Modify the default KTM regular expression from this:

[0-3]?\d§English_Months_Abr§([12]\d{3}|\d{2})

to this:

[0-3]?\d\s*(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\s*[0-3]?\d

You can now make the space character optional in your regular expression search. You are no longer using the month dictionary, but that’s okay. This logic is only to locate a date, not translate it from a string to a date when found.

Problem solved.

 

The True Nature Of Beautiful Perfection

I’m always pleased when I can build a nice clean system for a customer.  I like to be able to look back and say, “That is beautiful.  I’m proud of that.”  Antoine de Saint-Exupery said, “A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away.”  I like that quote and try to build systems with that in mind.  I like to be able to simplify a customer’s business process so it makes more sense to them that it did before.

But there is a danger here that needs to be avoided.  We are not dealing with art or literature.  We are dealing with business systems, systems that receive input from untrusted sources.  This data needs to be checked.  Joel Splosky puts it best when describing things you should never do.  He talks about old code that “has grown little hairs and stuff on it and nobody knows why.”  He’s not describing something beautiful; he’s describing something that works.  Something that’s gone through the pain of having exceptions found and dealt with.

Exception Processing

We help customers automate their business.  The software products we sell all have a type of workflow build it.  Oracle IPM and Liquid Office have a true workflow component where you build your processes graphically.  ILINX Capture and Kofax Capture have the idea of queues and the ordering of queues.  Systems with combined software are generally designed to be used in stages such as scan, store and retrieve.  In each stage the data is moved from one queue to another or one piece of software to another and the data needs to arrive correctly.  These are the types of system interactions I’m focusing on.  Unfortunately, these systems aren’t always configured perfectly and something will happen.

Evacuation Route SignYour scanned document will be unreadable.  Your form won’t be filled out completely.  You’ll have a power outage.  The database you rely on will have bad or missing data.  You network connection will drop.  You’ll need a strategy for handling these. Continue reading