
[full paper] 
Marco Botta, Ugo Galassi, Attilio Giordana
The Hierarchical Hidden Markov Model (HHMM) is a well formalized tool suitable to model complex patterns in long temporal or spatial sequences. The literature offers effective algorithm derived from the classical EM able to estimate HHMM parameters from sequences. However, up to now, little has been done in order to automatize the construction of the model architecture. The primary focus of this paper is on a multistrategy algorithm for inferring the HHMM structure from sequences where the events to capture are present in at least a relevant portion of the cases. The algorithm follows a bottomup strategy in which elementary facts in the sequences are progressively grouped building the abstraction hierarchy of an HHMM, layer after layer. In this process, clustering algorithms and sequence alignment algorithms, widely used in domains like molecular biology, are exploited. The induction strategy has been designed in order to deal with events characterized by a sparse structure, where gaps filled by irrelevant facts can be intermixed with the relevant ones. Irrelevant facts are modeled by "delays", i.e. HMMs of the noise. Delays are hypothesized when there is no significant statistical evidence for hypothesizing the existence of a specific model. Moreover, delays can be replaced in a second time by a specific model after new facts have been acquired. The induction strategy also allows an expert of the domain to integrate previous knowledge in the form of "chunk of structure" and in the form of temporal constraints. The method is evaluated both on artificial and on real data. Artificial data consist of a progression of datasets each representing a learning task of increasing difficulty. More specifically, each artificial dataset contains a complex event generated by a handcrafted HHMM plus additional spurious events. The goal is to evaluate the influence of the complexity, and of the noise, on the distance between the HHMM discovered by the algorithm and the original one. The real data have been proposed by a group of molecular biologists. The preliminary results generated by the algorithm have been considered interesting by the experts of the domain.
Keywords: Hierarchical Hidden Markov Model, Machine learning, sequence analysis
Citation: Marco Botta, Ugo Galassi, Attilio Giordana: Discovering Complex and Sparse Events in Long Sequences . In R.López de Mántaras and L.Saitta (eds.): ECAI2004, Proceedings of the 16th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2004, pp.425429.