Section: Scientific Foundations
From KDD to KDDK
- Knowledge discovery in databases
is a process for extracting knowledge units from large databases, units that can be interpreted and reused within knowledge-based systems.
From an operational point of view, the KDD process is performed within a KDD system including databases, data mining modules, and interfaces for interactions, e.g. editing and visualization. The KDD process is based on three main operations: selection and preparation of the data, data mining, and finally interpretation of the extracted units.
The KDDK process –as implemented in the research work of the Orpailleur team– is based data mining methods that are either symbolic or numerical. The methods that are used in the Orpailleur team are the following:
-
Symbolic methods are based on lattice-based classification (or concept lattice design or formal concept analysis [80] ), frequent itemsets search, and association rule extraction [87] .
-
Numerical methods based on second-order Hidden Markov Models (HMM2, designed for pattern recognition [86] ). Hidden Markov Models have good capabilities for locating stationary segments, and are mainly used for mining temporal and spatial data.
Then, the principle summarizing KDDK can be read as follows [84] : going “from complex data units to complex knowledge units guided by domain knowledge” (KDDK) or “knowledge with/for knowledge”. Two original aspects can be underlined: (i) the fact that the KDD process is guided by domain knowledge, and (ii) the fact that the extracted units are embedded within a knowledge representation formalism to be reused in a knowledge-based system for problem solving purposes.
In the research work of the Orpailleur team, the various instantiations of the KDDK process are all based on the idea of classification . Classification is a polymorphic process involved in various tasks, e.g. modeling, mining, representing, and reasoning. Accordingly, a knowledge-based system may be designed, fed up by the KDDK process, and used for problem-solving in application domains, e.g. agronomy, astronomy, biology, chemistry, and medicine, with a special mention for semantic web activities involving text mining, content-based document mining, and intelligent information retrieval [67] , [68] .