Rupa (Regression Using Path Aggregation)

rupa is a program for learning the structure and parameters of a hidden Markov model (HMM) to explain the real-valued response associated with each of a number of training sequences.

There are many approaches for mapping fixed-length vectors of numerical features to real-valued responses, such as solving a least squared error linear model. However, the typical approach for dealing with variable-length sequences or other types of unstructured data is either to turn it into a classification problem by attempting to cluster the responses, or to first learn a structured model of the data, and then to map features of that model to the responses. The problem is that there are too many potential model features to consider them all, so these types of approaches need to use some kind of search bias which may or may not have the right relationship with the real-valued responses that the model attempts to explain.

What's special about rupa is that it is not a “guess-and-check” style approach. Instead, rupa is designed to use the real-valued response associated with each sequence as a training signal directly. By assigning a likelihood distribution over the number of occurrences of each of a set of features, rupa uses a modified version of the HMM backward procedure to propogate the likelihood information back through the model to find out what those features are.

Additionally, since rupa works with hidden Markov models, it can represent uncertainty about the presence of features (e.g. a degenerate binding site motif).

rupa is designed to work with hidden semi-Markov models (HSMM) (also called generalized hidden Markov models) with HMMs being a special case. This means that it can learn the distance (in terms of number of sequence characters) between pairs of sequence features.

There are several built-in state types, including silent states, single-character-emitting states, a few varieties of character-emitting states that emit characters from a Markov-chain distribution, and a double-stranded-DNA binding site motif state.

rupa is written in C and compiled using gcc. The (new) scripts for running rupa multiple times to do a structure search are written in Python. It has been tested on some Linux and MacOS (10.4) platforms.

Requirements:

Download: License: References:
  1. K. Noto and M. Craven. Learning Hidden Markov Models for Regression using Path Aggregation.
    Proceedings of the 24th Uncertainty in Artificial Intelligence Conference (UAI 2008), 444-451.