Section: New Results
Keywords : mapping, document analysis, ontologies.
Comparison of textual documents
Participants : Nicolas Faure, Brigitte Trousse.
This research is conducted inside the Calfat Project which benefits from a grant issued by DRAST (Direction de la Recherche et des Affaires Scientifiques et Techniques), component of the Ministère de l'Ecologie, de l'Energie, du Déloppement durable et de l'Aménagement du territoire (French Ministry of Environment), and stands in the general context of text classification and text mining.
This work aims at allowing domain experts to automatically compare textual documents using a simple tool. Those documents are proposals following generic request; the automatic allocation of each proposal to a subtheme of the request for proposal would help domain experts a lot in this otherwise costly task.
The chosen methodology is related to the keyword technique as introduced in (Scot, M., 1997. PC analysis of key words - and key key words. System 25 (2), Elsevier, pp.233-245).
This technique is here used in order to produce a list of characterizing words for each document. Each document is tagged and lemmatised, reduced to a number of nominal lemmatised phrase, then compared to a reference corpus using log-likelihood measure. The results of these comparisons, lexical specificities, are then compared one to another, in order to establish lexical similarities.
Documents are then compared according to the result of this last comparison.
This approach differs from the original technique mainly regarding its application to nominal lemmatised phrases and systematic use of log-likelihood instead of Chi.
A component-based prototype, written in PERL, was developed as a mean to intensively and extensively test the underlying technique. This prototype also benefits from a web-based (PHP) front-end, allowing its use through a network, and comes in two versions, including a simple, "blackbox" version.
Further developments of this approach, includes the use of a terminological resource (ontology, thesaurus, semantic network) to allow classification and to better take into account lexical heterogenity, and various validation tests.