A combination of existing software tools was investigated for the purpose of analyzing the content of Department of Defense (DoD) documents and classifying them according to official file plans, such as the Modern Army Recordkeeping System (MARKS). The tools used were the Propeller package to perform n-gram lexical matching among documents and the FILAS package to classify documents according to a built-in semantic knowledge base. The approach was tested against a corpus of documents, both paper reports and electronic mail messages, and the results showed that an integration of the two approaches was an improvement over using either of the two individually.
Text processing, n-gram, document analysis