Kofax KTM Dictionary Gotcha on Dates

In KTM there is a nifty feature to search the entire document for a date field. It will recognize all dates existing on the form and with some other snazzy logic you can find the date you are looking for. If it is nearby the word “recieved”, then you probably have a recieve date. Easy, right?

Okay, sometimes dates get a little more tricky. “3/17/2011“, “3-17-20011” and “MAR 17, 2001” are all valid date strings. Any of those formats could be found on your document. In KTM there is a nifty feature to search for the string “MAR” and replace it with a “3” when searching for dates. You use it in your locator’s regular expression. You can setup your own dictionary of months to look for “March” or “Mar” (or “Marzo” if you need internationalization).

Here’s the gotcha. I recently found text in an OCR’d document like this: “19 NOV2008“. It’s a bit of an odd string. The OCR engine didn’t think there was enough space between the “NOV” and the “2008” to put an actual space character in the ORC’d text. So, I can read it, but KTM can’t. The nifty feature to search for the string “NOV” fails because it is only looking at whole words, those with whitespace on either side. Unfortunately, there is no option in the KTM dictionary setup to change this.

Here’s the fix. Modify the default KTM regular expression from this:

[0-3]?\d§English_Months_Abr§([12]\d{3}|\d{2})

to this:

[0-3]?\d\s*(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\s*[0-3]?\d

You can now make the space character optional in your regular expression search. You are no longer using the month dictionary, but that’s okay. This logic is only to locate a date, not translate it from a string to a date when found.

Problem solved.

 

It’s All About the Database

The key component to any content management/archive/workflow system is the database. Many times this key component is overlooked at the time a new imaging system is put in place, often the database is placed on an existing server which already has many critical duties or it is a new install placed on the same server as the imaging system software. These choices might be fine for a proof of concept, but as a system takes on more data and more users the database becomes the performance bottle neck.

As with any database, issues will rarely arise within its first year of usage, the data is slow to grow and performance of queries is quick and responsive since the load put on the system is minimal. During this time with minimal data the number of active users in the system at any one time is low as well. There are more inserts taking place than data retrievals. But the point of putting in one of these systems is not that the amount of data and the number of users will remain small for any length of time; most organizations bought their software and underwent the installation and configuration process with the goal of creating a vast repository that would make access to their content easy and quick for a large community within their organization.

It is when the system has finally been adopted fully and has enough content to be useful that the undersized database server will start to become a problem for the return of data to the users. I have seen systems where the input of new data is scheduled to go in during the off hours to compensate for the inability of the database server to perform inserts as well as retrievals during business hours. By this time the money for the original project is long since gone and often times the stake holders have forgotten about the decision to use a less than desirable database server configuration as a short term solution during the software’s “Pilot phase”. Continue reading