The ppm-star package

This package implements the Prediction by Partial Match* statistical model (originally developed for lossless data compression).

Here is a simple use of the package to generate probabilities for each element of each sequence in a list of sequences composed of some alphabet of symbols:

CL-USER> 
(defun simple-ppm-test (sequences alphabet) 
  (let ((model (ppm:make-ppm alphabet :escape :c :mixtures t 
                             :update-exclusion nil :order-bound nil)))
    (ppm:model-dataset model sequences :construct? t :predict? t)))
SIMPLE-PPM-TEST
CL-USER> (simple-ppm-test '((a b a b a b a b)) '(a b))
((0 (A ((A 0.5) (B 0.5))) (B ((A 0.75) (B 0.25))) (A ((A 0.5) (B 0.5)))
  (B ((A 0.36363637) (B 0.6363636))) (A ((A 0.6666667) (B 0.33333334)))
  (B ((A 0.35714284) (B 0.64285713))) (A ((A 0.6666667) (B 0.33333334)))
  (B ((A 0.3529412) (B 0.6470588)))))
CL-USER> 

The output is a list of lists, one for each sequence in the list supplied to the function. Each of these lists is itself a list, composed of the symbol that appeared at that position in the sequence and a probability distribution reflecting the models predictions for that position. The probability of the symbol appearing at that location can be obtained by looking up the symbol in the probability distribution or one can use the distribution to compute the entropy (uncertainty) of the model's prediction at that location.

The code in ppm-ui.lisp shows another example application: the following function returns the n-gram counts of a given order (n) in a list of sequences composed from symbols in the supplied alphabet:

(defun test-model (sequences alphabet order)
  (ngram-frequencies (build-model sequences alphabet) order))

Here are some examples:

CL-USER> (ppm-star::test-model '((a b r a c a d a b r a)) '(a b c d r) 3)           
(((D A B) 1) ((C A D) 1) ((R A C) 1) ((B R A) 2) ((A D A) 1) ((A C A) 1)
 ((A B R) 2))

 
CL-USER> (ppm-star::test-model '((l e t l e t t e r t e l e)) '(e l t r) 1)                     
(((R) 1) ((T) 4) ((E) 5) ((L) 3))

 
CL-USER> (ppm-star::test-model '((a g c g a c g a g)) '(a c g) 2)
(((C G) 2) ((G A) 2) ((G C) 1) ((A C) 1) ((A G) 2))

 
CL-USER> (ppm-star::test-model '((m i s s i s s i p p i)) '(i m p s) 4)
(((S I P P) 1) ((S I S S) 1) ((S S I P) 1) ((S S I S) 1) ((I P P I) 1)
 ((I S S I) 3) ((M I S S) 1))

 
CL-USER> (ppm-star::test-model '((a s s a n i s s i m a s s a)) '(a i m n s) 2)
(((M A) 1) ((I M) 1) ((I S) 1) ((N I) 1) ((S I) 1) ((S A) 2) ((S S) 3)
 ((A N) 1) ((A S) 2))

 
CL-USER> (ppm-star::test-model '((a s s a n i s s i m a s s a)
                                 (m i s s i s s i p p i))
                               '(a s n i m p)
                               2)
(((P I) 1) ((P P) 1) ((M I) 1) ((M A) 1) ((I P) 1) ((I M) 1) ((I S) 3)
 ((N I) 1) ((S I) 3) ((S A) 2) ((S S) 5) ((A N) 1) ((A S) 2))

CL-USER> (ppm-star::test-model '((a b r a c a d a b r a)
                                 (l e t l e t t e r t e l e)
                                 (a s s a n i s s i m a s s a)
                                 (m i s s i s s i p p i)
                                 (w o o l o o b o o l o o))
                               '(a b c d e i l m n o p r s t w)
                               3)
(((O B O) 1) ((O L O) 2) ((O O B) 1) ((O O L) 2) ((W O O) 1) ((P P I) 1)
 ((M I S) 1) ((M A S) 1) ((I P P) 1) ((I M A) 1) ((I S S) 3) ((N I S) 1)
 ((S I P) 1) ((S I S) 1) ((S I M) 1) ((S A N) 1) ((S S I) 3) ((S S A) 2)
 ((T E L) 1) ((T E R) 1) ((T T E) 1) ((T L E) 1) ((E L E) 1) ((E R T) 1)
 ((E T T) 1) ((E T L) 1) ((L O O) 2) ((L E T) 2) ((D A B) 1) ((C A D) 1)
 ((R T E) 1) ((R A C) 1) ((B O O) 1) ((B R A) 2) ((A N I) 1) ((A S S) 2)
 ((A D A) 1) ((A C A) 1) ((A B R) 2))
CL-USER> 

There is also a function to write the model to a postscript representation of a suffix tree:

CL-USER> (ppm-star:write-model-to-postscript 
          (ppm-star::build-model '((a b r a c a d a b r a)) '(a b c d r))
          "/tmp/ppm.ps")
NIL