comparison toolboxes/FullBNT-1.0.7/docs/usage.html @ 0:e9a9cd732c1e tip

first hg version after svn
author wolffd
date Tue, 10 Feb 2015 15:05:51 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:e9a9cd732c1e
1 <HEAD>
2 <TITLE>How to use the Bayes Net Toolbox</TITLE>
3 </HEAD>
4
5 <BODY BGCOLOR="#FFFFFF">
6 <!-- white background is better for the pictures and equations -->
7
8 <h1>How to use the Bayes Net Toolbox</h1>
9
10 This documentation was last updated on 29 October 2007.
11 <br>
12 Click
13 <a href="http://bnt.insa-rouen.fr/">here</a>
14 for a French version of this documentation (last updated in 2005).
15 <p>
16
17 <ul>
18 <li> <a href="install.html">Installation</a>
19
20 <li> <a href="#basics">Creating your first Bayes net</a>
21 <ul>
22 <li> <a href="#basics">Creating a model by hand</a>
23 <li> <a href="#file">Loading a model from a file</a>
24 <li> <a href="#GUI">Creating a model using a GUI</a>
25 <li> <a href="graphviz.html">Graph visualization</a>
26 </ul>
27
28 <li> <a href="#inference">Inference</a>
29 <ul>
30 <li> <a href="#marginal">Computing marginal distributions</a>
31 <li> <a href="#joint">Computing joint distributions</a>
32 <li> <a href="#soft">Soft/virtual evidence</a>
33 <li> <a href="#mpe">Most probable explanation</a>
34 </ul>
35
36 <li> <a href="#cpd">Conditional Probability Distributions</a>
37 <ul>
38 <li> <a href="#tabular">Tabular (multinomial) nodes</a>
39 <li> <a href="#noisyor">Noisy-or nodes</a>
40 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
41 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
42 <li> <a href="#mlp">Neural network nodes</a>
43 <li> <a href="#root">Root nodes</a>
44 <li> <a href="#gaussian">Gaussian nodes</a>
45 <li> <a href="#glm">Generalized linear model nodes</a>
46 <li> <a href="#dtree">Classification/regression tree nodes</a>
47 <li> <a href="#nongauss">Other continuous distributions</a>
48 <li> <a href="#cpd_summary">Summary of CPD types</a>
49 </ul>
50
51 <li> <a href="#examples">Example models</a>
52 <ul>
53 <li> <a
54 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
55 Gaussian mixture models</a>
56 <li> <a href="#pca">PCA, ICA, and all that</a>
57 <li> <a href="#mixep">Mixtures of experts</a>
58 <li> <a href="#hme">Hierarchical mixtures of experts</a>
59 <li> <a href="#qmr">QMR</a>
60 <li> <a href="#cg_model">Conditional Gaussian models</a>
61 <li> <a href="#hybrid">Other hybrid models</a>
62 </ul>
63
64 <li> <a href="#param_learning">Parameter learning</a>
65 <ul>
66 <li> <a href="#load_data">Loading data from a file</a>
67 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
68 <li> <a href="#prior">Parameter priors</a>
69 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
70 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
71 <li> <a href="#tying">Parameter tying</a>
72 </ul>
73
74 <li> <a href="#structure_learning">Structure learning</a>
75 <ul>
76 <li> <a href="#enumerate">Exhaustive search</a>
77 <li> <a href="#K2">K2</a>
78 <li> <a href="#hill_climb">Hill-climbing</a>
79 <li> <a href="#mcmc">MCMC</a>
80 <li> <a href="#active">Active learning</a>
81 <li> <a href="#struct_em">Structural EM</a>
82 <li> <a href="#graphdraw">Visualizing the learned graph structure</a>
83 <li> <a href="#constraint">Constraint-based methods</a>
84 </ul>
85
86
87 <li> <a href="#engines">Inference engines</a>
88 <ul>
89 <li> <a href="#jtree">Junction tree</a>
90 <li> <a href="#varelim">Variable elimination</a>
91 <li> <a href="#global">Global inference methods</a>
92 <li> <a href="#quickscore">Quickscore</a>
93 <li> <a href="#belprop">Belief propagation</a>
94 <li> <a href="#sampling">Sampling (Monte Carlo)</a>
95 <li> <a href="#engine_summary">Summary of inference engines</a>
96 </ul>
97
98
99 <li> <a href="#influence">Influence diagrams/ decision making</a>
100
101
102 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
103 </ul>
104
105 </ul>
106
107
108
109 <h1><a name="basics">Creating your first Bayes net</h1>
110
111 To define a Bayes net, you must specify the graph structure and then
112 the parameters. We look at each in turn, using a simple example
113 (adapted from Russell and
114 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
115 1995, p454).
116
117
118 <h2>Graph structure</h2>
119
120
121 Consider the following network.
122
123 <p>
124 <center>
125 <IMG SRC="Figures/sprinkler.gif">
126 </center>
127 <p>
128
129 <P>
130 To specify this directed acyclic graph (dag), we create an adjacency matrix:
131 <PRE>
132 N = 4;
133 dag = zeros(N,N);
134 C = 1; S = 2; R = 3; W = 4;
135 dag(C,[R S]) = 1;
136 dag(R,W) = 1;
137 dag(S,W)=1;
138 </PRE>
139 <P>
140 We have numbered the nodes as follows:
141 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
142 <b>The nodes must always be numbered in topological order, i.e.,
143 ancestors before descendants.</b>
144 For a more complicated graph, this is a little inconvenient: we will
145 see how to get around this <a href="usage_dbn.html#bat">below</a>.
146 <p>
147 In Matlab 6, you can use logical arrays instead of double arrays,
148 which are 4 times smaller:
149 <pre>
150 dag = false(N,N);
151 dag(C,[R S]) = true;
152 ...
153 </pre>
154 However, <b>some graph functions (eg acyclic) do not work on
155 logical arrays</b>!
156 <p>
157 You can visualize the resulting graph structure using
158 the methods discussed <a href="#graphdraw">below</a>.
159 For details on GUIs,
160 click <a href="#GUI">here</a>.
161
162 <h2>Creating the Bayes net shell</h2>
163
164 In addition to specifying the graph structure,
165 we must specify the size and type of each node.
166 If a node is discrete, its size is the
167 number of possible values
168 each node can take on; if a node is continuous,
169 it can be a vector, and its size is the length of this vector.
170 In this case, we will assume all nodes are discrete and binary.
171 <PRE>
172 discrete_nodes = 1:N;
173 node_sizes = 2*ones(1,N);
174 </pre>
175 If the nodes were not binary, you could type e.g.,
176 <pre>
177 node_sizes = [4 2 3 5];
178 </pre>
179 meaning that Cloudy has 4 possible values,
180 Sprinkler has 2 possible values, etc.
181 Note that these are cardinal values, not ordinal, i.e.,
182 they are not ordered in any way, like 'low', 'medium', 'high'.
183 <p>
184 We are now ready to make the Bayes net:
185 <pre>
186 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
187 </PRE>
188 By default, all nodes are assumed to be discrete, so we can also just
189 write
190 <pre>
191 bnet = mk_bnet(dag, node_sizes);
192 </PRE>
193 You may also specify which nodes will be observed.
194 If you don't know, or if this not fixed in advance,
195 just use the empty list (the default).
196 <pre>
197 onodes = [];
198 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
199 </PRE>
200 Note that optional arguments are specified using a name/value syntax.
201 This is common for many BNT functions.
202 In general, to find out more about a function (e.g., which optional
203 arguments it takes), please see its
204 documentation string by typing
205 <pre>
206 help mk_bnet
207 </pre>
208 See also other <a href="matlab_tips.html">useful Matlab tips</a>.
209 <p>
210 It is possible to associate names with nodes, as follows:
211 <pre>
212 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
213 </pre>
214 You can then refer to a node by its name:
215 <pre>
216 C = bnet.names('cloudy'); % bnet.names is an associative array
217 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
218 </pre>
219 This feature uses my own associative array class.
220
221
222 <h2><a name="cpt">Parameters</h2>
223
224 A model consists of the graph structure and the parameters.
225 The parameters are represented by CPD objects (CPD = Conditional
226 Probability Distribution), which define the probability distribution
227 of a node given its parents.
228 (We will use the terms "node" and "random variable" interchangeably.)
229 The simplest kind of CPD is a table (multi-dimensional array), which
230 is suitable when all the nodes are discrete-valued. Note that the discrete
231 values are not assumed to be ordered in any way; that is, they
232 represent categorical quantities, like male and female, rather than
233 ordinal quantities, like low, medium and high.
234 (We will discuss CPDs in more detail <a href="#cpd">below</a>.)
235 <p>
236 Tabular CPDs, also called CPTs (conditional probability tables),
237 are stored as multidimensional arrays, where the dimensions
238 are arranged in the same order as the nodes, e.g., the CPT for node 4
239 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
240 Hence the child is always the last dimension.
241 If a node has no parents, its CPT is a column vector representing its
242 prior.
243 Note that in Matlab (unlike C), arrays are indexed
244 from 1, and are layed out in memory such that the first index toggles
245 fastest, e.g., the CPT for node 4 (WetGrass) is as follows
246 <P>
247 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
248 <P>
249 where we have used the convention that false==1, true==2.
250 We can create this CPT in Matlab as follows
251 <PRE>
252 CPT = zeros(2,2,2);
253 CPT(1,1,1) = 1.0;
254 CPT(2,1,1) = 0.1;
255 ...
256 </PRE>
257 Here is an easier way:
258 <PRE>
259 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
260 </PRE>
261 In fact, we don't need to reshape the array, since the CPD constructor
262 will do that for us. So we can just write
263 <pre>
264 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
265 </pre>
266 The other nodes are created similarly (using the old syntax for
267 optional parameters)
268 <PRE>
269 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
270 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
271 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
272 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
273 </PRE>
274
275
276 <h2><a name="rnd_cpt">Random Parameters</h2>
277
278 If we do not specify the CPT, random parameters will be
279 created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
280 To ensure repeatable results, use
281 <pre>
282 rand('state', seed);
283 randn('state', seed);
284 </pre>
285 To control the degree of randomness (entropy),
286 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
287 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
288 If p = 1, each entry is drawn from U[0,1].
289 If p >> 1, the entries will all be near 1/k, where k is the arity of
290 this node, i.e., each row will be nearly uniform.
291 You can do this as follows, assuming this node
292 is number i, and ns is the node_sizes.
293 <pre>
294 k = ns(i);
295 ps = parents(dag, i);
296 psz = prod(ns(ps));
297 CPT = sample_dirichlet(p*ones(1,k), psz);
298 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
299 </pre>
300
301
302 <h2><a name="file">Loading a network from a file</h2>
303
304 If you already have a Bayes net represented in the XML-based
305 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
306 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
307 <a
308 href="http://www.cs.huji.ac.il/labs/compbio/Repository">
309 Bayes Net repository</a>),
310 you can convert it to BNT format using
311 the
312 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
313 program</a> written by Ken Shan.
314 (This is not necessarily up-to-date.)
315 <p>
316 <b>It is currently not possible to save/load a BNT matlab object to
317 file</b>, but this is easily fixed if you modify all the constructors
318 for all the classes (see matlab documentation).
319
320 <h2><a name="GUI">Creating a model using a GUI</h2>
321
322 <ul>
323 <li>Senthil Nachimuthu
324 has started (Oct 07) an open source
325 GUI for BNT called
326 <a href="http://projeny.sourceforge.net">projeny</a>
327 using Java. This is a successor to BNJ.
328
329 <li>
330 Philippe LeRay has written (Sep 05)
331 a
332 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-editor/">
333 BNT GUI</a> in matlab.
334
335 <li>
336 <a
337 href="http://www.dataonstage.com/BNT/PACKAGES/LinkStrength/index.html">
338 LinkStrength</a>,
339 a package by Imme Ebert-Uphoff for visualizing the strength of
340 dependencies between nodes.
341
342 </ul>
343
344 <h2>Graph visualization</h2>
345
346 Click <a href="graphviz.html">here</a> for more information
347 on graph visualization.
348
349
350 <h1><a name="inference">Inference</h1>
351
352 Having created the BN, we can now use it for inference.
353 There are many different algorithms for doing inference in Bayes nets,
354 that make different tradeoffs between speed,
355 complexity, generality, and accuracy.
356 BNT therefore offers a variety of different inference
357 "engines". We will discuss these
358 in more detail <a href="#engines">below</a>.
359 For now, we will use the junction tree
360 engine, which is the mother of all exact inference algorithms.
361 This can be created as follows.
362 <pre>
363 engine = jtree_inf_engine(bnet);
364 </pre>
365 The other engines have similar constructors, but might take
366 additional, algorithm-specific parameters.
367 All engines are used in the same way, once they have been created.
368 We illustrate this in the following sections.
369
370
371 <h2><a name="marginal">Computing marginal distributions</h2>
372
373 Suppose we want to compute the probability that the sprinker was on
374 given that the grass is wet.
375 The evidence consists of the fact that W=2. All the other nodes
376 are hidden (unobserved). We can specify this as follows.
377 <pre>
378 evidence = cell(1,N);
379 evidence{W} = 2;
380 </pre>
381 We use a 1D cell array instead of a vector to
382 cope with the fact that nodes can be vectors of different lengths.
383 In addition, the value [] can be used
384 to denote 'no evidence', instead of having to specify the observation
385 pattern as a separate argument.
386 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell
387 arrays in matlab.)
388 <p>
389 We are now ready to add the evidence to the engine.
390 <pre>
391 [engine, loglik] = enter_evidence(engine, evidence);
392 </pre>
393 The behavior of this function is algorithm-specific, and is discussed
394 in more detail <a href="#engines">below</a>.
395 In the case of the jtree engine,
396 enter_evidence implements a two-pass message-passing scheme.
397 The first return argument contains the modified engine, which
398 incorporates the evidence. The second return argument contains the
399 log-likelihood of the evidence. (Not all engines are capable of
400 computing the log-likelihood.)
401 <p>
402 Finally, we can compute p=P(S=2|W=2) as follows.
403 <PRE>
404 marg = marginal_nodes(engine, S);
405 marg.T
406 ans =
407 0.57024
408 0.42976
409 p = marg.T(2);
410 </PRE>
411 We see that p = 0.4298.
412 <p>
413 Now let us add the evidence that it was raining, and see what
414 difference it makes.
415 <PRE>
416 evidence{R} = 2;
417 [engine, loglik] = enter_evidence(engine, evidence);
418 marg = marginal_nodes(engine, S);
419 p = marg.T(2);
420 </PRE>
421 We find that p = P(S=2|W=2,R=2) = 0.1945,
422 which is lower than
423 before, because the rain can ``explain away'' the
424 fact that the grass is wet.
425 <p>
426 You can plot a marginal distribution over a discrete variable
427 as a barchart using the built 'bar' function:
428 <pre>
429 bar(marg.T)
430 </pre>
431 This is what it looks like
432
433 <p>
434 <center>
435 <IMG SRC="Figures/sprinkler_bar.gif">
436 </center>
437 <p>
438
439 <h2><a name="observed">Observed nodes</h2>
440
441 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
442 An observed discrete node effectively only has 1 value (the observed
443 one) --- all other values would result in 0 probability.
444 For efficiency, BNT treats observed (discrete) nodes as if they were
445 set to 1, as we see below:
446 <pre>
447 evidence = cell(1,N);
448 evidence{W} = 2;
449 engine = enter_evidence(engine, evidence);
450 m = marginal_nodes(engine, W);
451 m.T
452 ans =
453 1
454 </pre>
455 This can get a little confusing, since we assigned W=2.
456 So we can ask BNT to add the evidence back in by passing in an optional argument:
457 <pre>
458 m = marginal_nodes(engine, W, 1);
459 m.T
460 ans =
461 0
462 1
463 </pre>
464 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
465
466
467
468 <h2><a name="joint">Computing joint distributions</h2>
469
470 We can compute the joint probability on a set of nodes as in the
471 following example.
472 <pre>
473 evidence = cell(1,N);
474 [engine, ll] = enter_evidence(engine, evidence);
475 m = marginal_nodes(engine, [S R W]);
476 </pre>
477 m is a structure. The 'T' field is a multi-dimensional array (in
478 this case, 3-dimensional) that contains the joint probability
479 distribution on the specified nodes.
480 <pre>
481 >> m.T
482 ans(:,:,1) =
483 0.2900 0.0410
484 0.0210 0.0009
485 ans(:,:,2) =
486 0 0.3690
487 0.1890 0.0891
488 </pre>
489 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
490 to be wet if both the rain and sprinkler are off.
491 <p>
492 Let us now add some evidence to R.
493 <pre>
494 evidence{R} = 2;
495 [engine, ll] = enter_evidence(engine, evidence);
496 m = marginal_nodes(engine, [S R W])
497 m =
498 domain: [2 3 4]
499 T: [2x1x2 double]
500 >> m.T
501 m.T
502 ans(:,:,1) =
503 0.0820
504 0.0018
505 ans(:,:,2) =
506 0.7380
507 0.1782
508 </pre>
509 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
510 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
511 with the evidence that R=2.
512 Instead of creating large tables with many 0s, BNT sets the effective
513 size of observed (discrete) nodes to 1, as explained above.
514 This is why m.T has size 2x1x2.
515 To get a 2x2x2 table, type
516 <pre>
517 m = marginal_nodes(engine, [S R W], 1)
518 m =
519 domain: [2 3 4]
520 T: [2x2x2 double]
521 >> m.T
522 m.T
523 ans(:,:,1) =
524 0 0.082
525 0 0.0018
526 ans(:,:,2) =
527 0 0.738
528 0 0.1782
529 </pre>
530
531 <p>
532 Note: It is not always possible to compute the joint on arbitrary
533 sets of nodes: it depends on which inference engine you use, as discussed
534 in more detail <a href="#engines">below</a>.
535
536
537 <h2><a name="soft">Soft/virtual evidence</h2>
538
539 Sometimes a node is not observed, but we have some distribution over
540 its possible values; this is often called "soft" or "virtual"
541 evidence.
542 One can use this as follows
543 <pre>
544 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
545 </pre>
546 where soft_evidence{i} is either [] (if node i has no soft evidence)
547 or is a vector representing the probability distribution over i's
548 possible values.
549 For example, if we don't know i's exact value, but we know its
550 likelihood ratio is 60/40, we can write evidence{i} = [] and
551 soft_evidence{i} = [0.6 0.4].
552 <p>
553 Currently only jtree_inf_engine supports this option.
554 It assumes that all hidden nodes, and all nodes for
555 which we have soft evidence, are discrete.
556 For a longer example, see BNT/examples/static/softev1.m.
557
558
559 <h2><a name="mpe">Most probable explanation</h2>
560
561 To compute the most probable explanation (MPE) of the evidence (i.e.,
562 the most probable assignment, or a mode of the joint), use
563 <pre>
564 [mpe, ll] = calc_mpe(engine, evidence);
565 </pre>
566 mpe{i} is the most likely value of node i.
567 This calls enter_evidence with the 'maximize' flag set to 1, which
568 causes the engine to do max-product instead of sum-product.
569 The resulting max-marginals are then thresholded.
570 If there is more than one maximum probability assignment, we must take
571 care to break ties in a consistent manner (thresholding the
572 max-marginals may give the wrong result). To force this behavior,
573 type
574 <pre>
575 [mpe, ll] = calc_mpe(engine, evidence, 1);
576 </pre>
577 Note that computing the MPE is someties called abductive reasoning.
578
579 <p>
580 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
581 that does a forwards max-product pass, and then a backwards traceback
582 pass, which is how Viterbi is traditionally implemented.
583
584
585
586 <h1><a name="cpd">Conditional Probability Distributions</h1>
587
588 A Conditional Probability Distributions (CPD)
589 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
590 are the parents of node i. There are many ways to represent this
591 distribution, which depend in part on whether X(i) and X(Pa(i)) are
592 discrete, continuous, or a combination.
593 We will discuss various representations below.
594
595
596 <h2><a name="tabular">Tabular nodes</h2>
597
598 If the CPD is represented as a table (i.e., if it is a multinomial
599 distribution), it has a number of parameters that is exponential in
600 the number of parents. See the example <a href="#cpt">above</a>.
601
602
603 <h2><a name="noisyor">Noisy-or nodes</h2>
604
605 A noisy-OR node is like a regular logical OR gate except that
606 sometimes the effects of parents that are on get inhibited.
607 Let the prob. that parent i gets inhibited be q(i).
608 Then a node, C, with 2 parents, A and B, has the following CPD, where
609 we use F and T to represent off and on (1 and 2 in BNT).
610 <pre>
611 A B P(C=off) P(C=on)
612 ---------------------------
613 F F 1.0 0.0
614 T F q(A) 1-q(A)
615 F T q(B) 1-q(B)
616 T T q(A)q(B) 1-q(A)q(B)
617 </pre>
618 Thus we see that the causes get inhibited independently.
619 It is common to associate a "leak" node with a noisy-or CPD, which is
620 like a parent that is always on. This can account for all other unmodelled
621 causes which might turn the node on.
622 <p>
623 The noisy-or distribution is similar to the logistic distribution.
624 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
625 be the prob. that j inhibits i. Then
626 <pre>
627 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
628 </pre>
629 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
630 <pre>
631 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
632 </pre>
633 For a sigmoid node, we have
634 <pre>
635 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
636 </pre>
637 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
638 the activation function (although both are monotonically increasing).
639 In addition, in the case of a noisy-or, the weights are constrained to be
640 positive, since they derive from probabilities q(i,j).
641 In both cases, the number of parameters is <em>linear</em> in the
642 number of parents, unlike the case of a multinomial distribution,
643 where the number of parameters is exponential in the number of parents.
644 We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
645
646
647 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
648
649 Deterministic CPDs for discrete random variables can be created using
650 the deterministic_CPD class. It is also possible to 'flip' the output
651 of the function with some probability, to simulate noise.
652 The boolean_CPD class is just a special case of a
653 deterministic CPD, where the parents and child are all binary.
654 <p>
655 Both of these classes are just "syntactic sugar" for the tabular_CPD
656 class.
657
658
659
660 <h2><a name="softmax">Softmax nodes</h2>
661
662 If we have a discrete node with a continuous parent,
663 we can define its CPD using a softmax function
664 (also known as the multinomial logit function).
665 This acts like a soft thresholding operator, and is defined as follows:
666 <pre>
667 exp(w(:,i)'*x + b(i))
668 Pr(Q=i | X=x) = -----------------------------
669 sum_j exp(w(:,j)'*x + b(j))
670
671 </pre>
672 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
673 following interpretation: w(:,i)-w(:,j) is the normal vector to the
674 decision boundary between classes i and j,
675 and b(i)-b(j) is its offset (bias). For example, suppose
676 X is a 2-vector, and Q is binary. Then
677 <pre>
678 w = [1 -1;
679 0 0];
680
681 b = [0 0];
682 </pre>
683 means class 1 are points in the 2D plane with positive x coordinate,
684 and class 2 are points in the 2D plane with negative x coordinate.
685 If w has large magnitude, the decision boundary is sharp, otherwise it
686 is soft.
687 In the special case that Q is binary (0/1), the softmax function reduces to the logistic
688 (sigmoid) function.
689 <p>
690 Fitting a softmax function can be done using the iteratively reweighted
691 least squares (IRLS) algorithm.
692 We use the implementation from
693 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
694 Note that since
695 the softmax distribution is not in the exponential family, it does not
696 have finite sufficient statistics, and hence we must store all the
697 training data in uncompressed form.
698 If this takes too much space, one should use online (stochastic) gradient
699 descent (not implemented in BNT).
700 <p>
701 If a softmax node also has discrete parents,
702 we use a different set of w/b parameters for each combination of
703 parent values, as in the <a href="#gaussian">conditional linear
704 Gaussian CPD</a>.
705 This feature was implemented by Pierpaolo Brutti.
706 He is currently extending it so that discrete parents can be treated
707 as if they were continuous, by adding indicator variables to the X
708 vector.
709 <p>
710 We will see an example of softmax nodes <a href="#mixexp">below</a>.
711
712
713 <h2><a name="mlp">Neural network nodes</h2>
714
715 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
716 to implement a mapping from continuous parents to discrete children,
717 similar to the softmax function.
718 (If there are also discrete parents, it creates a mixture of MLPs.)
719 It uses code from <a
720 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
721 This is work in progress.
722
723 <h2><a name="root">Root nodes</h2>
724
725 A root node has no parents and no parameters; it can be used to model
726 an observed, exogeneous input variable, i.e., one which is "outside"
727 the model.
728 This is useful for conditional density models.
729 We will see an example of root nodes <a href="#mixexp">below</a>.
730
731
732 <h2><a name="gaussian">Gaussian nodes</h2>
733
734 We now consider a distribution suitable for the continuous-valued nodes.
735 Suppose the node is called Y, its continuous parents (if any) are
736 called X, and its discrete parents (if any) are called Q.
737 The distribution on Y is defined as follows:
738 <pre>
739 - no parents: Y ~ N(mu, Sigma)
740 - cts parents : Y|X=x ~ N(mu + W x, Sigma)
741 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
742 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
743 </pre>
744 where N(mu, Sigma) denotes a Normal distribution with mean mu and
745 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
746 respectively.
747 If there are no discrete parents, |Q|=1; if there is
748 more than one, then |Q| = a vector of the sizes of each discrete parent.
749 If there are no continuous parents, |X|=0; if there is more than one,
750 then |X| = the sum of their sizes.
751 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
752 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
753 matrix.
754 <p>
755 We can create a Gaussian node with random parameters as follows.
756 <pre>
757 bnet.CPD{i} = gaussian_CPD(bnet, i);
758 </pre>
759 We can specify the value of one or more of the parameters as in the
760 following example, in which |Y|=2, and |Q|=1.
761 <pre>
762 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
763 </pre>
764 <p>
765 We will see an example of conditional linear Gaussian nodes <a
766 href="#cg_model">below</a>.
767 <p>
768 <b>When learning Gaussians from data</b>, it is helpful to ensure the
769 data has a small magnitde
770 (see e.g., KPMstats/standardize) to prevent numerical problems.
771 Unless you have a lot of data, it is also a very good idea to use
772 diagonal instead of full covariance matrices.
773 (BNT does not currently support spherical covariances, although it
774 would be easy to add, since KPMstats/clg_Mstep supports this option;
775 you would just need to modify gaussian_CPD/update_ess to accumulate
776 weighted inner products.)
777
778
779
780 <h2><a name="nongauss">Other continuous distributions</h2>
781
782 Currently BNT does not support any CPDs for continuous nodes other
783 than the Gaussian.
784 However, you can use a mixture of Gaussians to
785 approximate other continuous distributions. We will see some an example
786 of this with the IFA model <a href="#pca">below</a>.
787
788
789 <h2><a name="glm">Generalized linear model nodes</h2>
790
791 In the future, we may incorporate some of the functionality of
792 <a href =
793 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
794 into BNT.
795
796
797 <h2><a name="dtree">Classification/regression tree nodes</h2>
798
799 We plan to add classification and regression trees to define CPDs for
800 discrete and continuous nodes, respectively.
801 Trees have many advantages: they are easy to interpret, they can do
802 feature selection, they can
803 handle discrete and continuous inputs, they do not make strong
804 assumptions about the form of the distribution, the number of
805 parameters can grow in a data-dependent way (i.e., they are
806 semi-parametric), they can handle missing data, etc.
807 However, they are not yet implemented.
808 <!--
809 Yimin Zhang is currently (Feb '02) implementing this.
810 -->
811
812
813 <h2><a name="cpd_summary">Summary of CPD types</h2>
814
815 We list all the different types of CPDs supported by BNT.
816 For each CPD, we specify if the child and parents can be discrete (D) or
817 continuous (C) (Binary (B) nodes are a special case).
818 We also specify which methods each class supports.
819 If a method is inherited, the name of the parent class is mentioned.
820 If a parent class calls a child method, this is mentioned.
821 <p>
822 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
823 requires that the child and all parents are discrete.
824 The CPT might be exponentially big...
825 <tt>convert_to_table</tt> evaluates a CPD with evidence, and
826 represents the the resulting potential as an array.
827 This requires that the child is discrete, and any continuous parents
828 are observed.
829 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and
830 represents the resulting potential as a dpot, gpot, cgpot or upot, as
831 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
832 utility).
833
834 <p>
835 When we sample a node, all the parents are observed.
836 When we compute the (log) probability of a node, all the parents and
837 the child are observed.
838 <p>
839 We also specify if the parameters are learnable.
840 For learning with EM, we require
841 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
842 <tt>maximize_params</tt>.
843 For learning from fully observed data, we require
844 the method <tt>learn_params</tt>.
845 By default, all classes inherit this from generic_CPD, which simply
846 calls <tt>update_ess</tt> N times, once for each data case, followed
847 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
848 Some classes implement a batch formula, which is quicker.
849 <p>
850 Bayesian learning means computing a posterior over the parameters
851 given fully observed data.
852 <p>
853 Pearl means we implement the methods <tt>compute_pi</tt> and
854 <tt>compute_lambda_msg</tt>, used by
855 <tt>pearl_inf_engine</tt>, which runs on directed graphs.
856 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
857 The pearl methods can exploit special properties of the CPDs for
858 computing the messages efficiently, whereas belprop does not.
859 <p>
860 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
861 which is not shown, since it is not very interesting.
862
863
864 <p>
865
866
867 <table>
868 <table border units = pixels><tr>
869 <td align=center>Name
870 <td align=center>Child
871 <td align=center>Parents
872 <td align=center>Comments
873 <td align=center>CPD_to_CPT
874 <td align=center>conv_to_table
875 <td align=center>conv_to_pot
876 <td align=center>sample
877 <td align=center>prob
878 <td align=center>learn
879 <td align=center>Bayes
880 <td align=center>Pearl
881
882
883 <tr>
884 <!-- Name--><td>
885 <!-- Child--><td>
886 <!-- Parents--><td>
887 <!-- Comments--><td>
888 <!-- CPD_to_CPT--><td>
889 <!-- conv_to_table--><td>
890 <!-- conv_to_pot--><td>
891 <!-- sample--><td>
892 <!-- prob--><td>
893 <!-- learn--><td>
894 <!-- Bayes--><td>
895 <!-- Pearl--><td>
896
897 <tr>
898 <!-- Name--><td>boolean
899 <!-- Child--><td>B
900 <!-- Parents--><td>B
901 <!-- Comments--><td>Syntactic sugar for tabular
902 <!-- CPD_to_CPT--><td>-
903 <!-- conv_to_table--><td>-
904 <!-- conv_to_pot--><td>-
905 <!-- sample--><td>-
906 <!-- prob--><td>-
907 <!-- learn--><td>-
908 <!-- Bayes--><td>-
909 <!-- Pearl--><td>-
910
911 <tr>
912 <!-- Name--><td>deterministic
913 <!-- Child--><td>D
914 <!-- Parents--><td>D
915 <!-- Comments--><td>Syntactic sugar for tabular
916 <!-- CPD_to_CPT--><td>-
917 <!-- conv_to_table--><td>-
918 <!-- conv_to_pot--><td>-
919 <!-- sample--><td>-
920 <!-- prob--><td>-
921 <!-- learn--><td>-
922 <!-- Bayes--><td>-
923 <!-- Pearl--><td>-
924
925 <tr>
926 <!-- Name--><td>Discrete
927 <!-- Child--><td>D
928 <!-- Parents--><td>C/D
929 <!-- Comments--><td>Virtual class
930 <!-- CPD_to_CPT--><td>N
931 <!-- conv_to_table--><td>Calls CPD_to_CPT
932 <!-- conv_to_pot--><td>Calls conv_to_table
933 <!-- sample--><td>Calls conv_to_table
934 <!-- prob--><td>Calls conv_to_table
935 <!-- learn--><td>N
936 <!-- Bayes--><td>N
937 <!-- Pearl--><td>N
938
939 <tr>
940 <!-- Name--><td>Gaussian
941 <!-- Child--><td>C
942 <!-- Parents--><td>C/D
943 <!-- Comments--><td>-
944 <!-- CPD_to_CPT--><td>N
945 <!-- conv_to_table--><td>N
946 <!-- conv_to_pot--><td>Y
947 <!-- sample--><td>Y
948 <!-- prob--><td>Y
949 <!-- learn--><td>Y
950 <!-- Bayes--><td>N
951 <!-- Pearl--><td>N
952
953 <tr>
954 <!-- Name--><td>gmux
955 <!-- Child--><td>C
956 <!-- Parents--><td>C/D
957 <!-- Comments--><td>multiplexer
958 <!-- CPD_to_CPT--><td>N
959 <!-- conv_to_table--><td>N
960 <!-- conv_to_pot--><td>Y
961 <!-- sample--><td>N
962 <!-- prob--><td>N
963 <!-- learn--><td>N
964 <!-- Bayes--><td>N
965 <!-- Pearl--><td>Y
966
967
968 <tr>
969 <!-- Name--><td>MLP
970 <!-- Child--><td>D
971 <!-- Parents--><td>C/D
972 <!-- Comments--><td>multi layer perceptron
973 <!-- CPD_to_CPT--><td>N
974 <!-- conv_to_table--><td>Y
975 <!-- conv_to_pot--><td>Inherits from discrete
976 <!-- sample--><td>Inherits from discrete
977 <!-- prob--><td>Inherits from discrete
978 <!-- learn--><td>Y
979 <!-- Bayes--><td>N
980 <!-- Pearl--><td>N
981
982
983 <tr>
984 <!-- Name--><td>noisy-or
985 <!-- Child--><td>B
986 <!-- Parents--><td>B
987 <!-- Comments--><td>-
988 <!-- CPD_to_CPT--><td>Y
989 <!-- conv_to_table--><td>Inherits from discrete
990 <!-- conv_to_pot--><td>Inherits from discrete
991 <!-- sample--><td>Inherits from discrete
992 <!-- prob--><td>Inherits from discrete
993 <!-- learn--><td>N
994 <!-- Bayes--><td>N
995 <!-- Pearl--><td>Y
996
997
998 <tr>
999 <!-- Name--><td>root
1000 <!-- Child--><td>C/D
1001 <!-- Parents--><td>none
1002 <!-- Comments--><td>no params
1003 <!-- CPD_to_CPT--><td>N
1004 <!-- conv_to_table--><td>N
1005 <!-- conv_to_pot--><td>Y
1006 <!-- sample--><td>Y
1007 <!-- prob--><td>Y
1008 <!-- learn--><td>N
1009 <!-- Bayes--><td>N
1010 <!-- Pearl--><td>N
1011
1012
1013 <tr>
1014 <!-- Name--><td>softmax
1015 <!-- Child--><td>D
1016 <!-- Parents--><td>C/D
1017 <!-- Comments--><td>-
1018 <!-- CPD_to_CPT--><td>N
1019 <!-- conv_to_table--><td>Y
1020 <!-- conv_to_pot--><td>Inherits from discrete
1021 <!-- sample--><td>Inherits from discrete
1022 <!-- prob--><td>Inherits from discrete
1023 <!-- learn--><td>Y
1024 <!-- Bayes--><td>N
1025 <!-- Pearl--><td>N
1026
1027
1028 <tr>
1029 <!-- Name--><td>generic
1030 <!-- Child--><td>C/D
1031 <!-- Parents--><td>C/D
1032 <!-- Comments--><td>Virtual class
1033 <!-- CPD_to_CPT--><td>N
1034 <!-- conv_to_table--><td>N
1035 <!-- conv_to_pot--><td>N
1036 <!-- sample--><td>N
1037 <!-- prob--><td>N
1038 <!-- learn--><td>N
1039 <!-- Bayes--><td>N
1040 <!-- Pearl--><td>N
1041
1042
1043 <tr>
1044 <!-- Name--><td>Tabular
1045 <!-- Child--><td>D
1046 <!-- Parents--><td>D
1047 <!-- Comments--><td>-
1048 <!-- CPD_to_CPT--><td>Y
1049 <!-- conv_to_table--><td>Inherits from discrete
1050 <!-- conv_to_pot--><td>Inherits from discrete
1051 <!-- sample--><td>Inherits from discrete
1052 <!-- prob--><td>Inherits from discrete
1053 <!-- learn--><td>Y
1054 <!-- Bayes--><td>Y
1055 <!-- Pearl--><td>Y
1056
1057 </table>
1058
1059
1060
1061 <h1><a name="examples">Example models</h1>
1062
1063
1064 <h2>Gaussian mixture models</h2>
1065
1066 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
1067 of Gaussians using BNT. Available
1068 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
1069
1070
1071 <h2><a name="pca">PCA, ICA, and all that </h2>
1072
1073 In Figure (a) below, we show how Factor Analysis can be thought of as a
1074 graphical model. Here, X has an N(0,I) prior, and
1075 Y|X=x ~ N(mu + Wx, Psi),
1076 where Psi is diagonal and W is called the "factor loading matrix".
1077 Since the noise on both X and Y is diagonal, the components of these
1078 vectors are uncorrelated, and hence can be represented as individual
1079 scalar nodes, as we show in (b).
1080 (This is useful if parts of the observations on the Y vector are occasionally missing.)
1081 We usually take k=|X| << |Y|=D, so the model tries to explain
1082 many observations using a low-dimensional subspace.
1083
1084
1085 <center>
1086 <table>
1087 <tr>
1088 <td><img src="Figures/fa.gif">
1089 <td><img src="Figures/fa_scalar.gif">
1090 <td><img src="Figures/mfa.gif">
1091 <td><img src="Figures/ifa.gif">
1092 <tr>
1093 <td align=center> (a)
1094 <td align=center> (b)
1095 <td align=center> (c)
1096 <td align=center> (d)
1097 </table>
1098 </center>
1099
1100 <p>
1101 We can create this model in BNT as follows.
1102 <pre>
1103 ns = [k D];
1104 dag = zeros(2,2);
1105 dag(1,2) = 1;
1106 bnet = mk_bnet(dag, ns, 'discrete', []);
1107 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
1108 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
1109 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
1110 'cov_type', 'diag', 'clamp_mean', 1);
1111 </pre>
1112
1113 The root node is clamped to the N(0,I) distribution, so that we will
1114 not update these parameters during learning.
1115 The mean of the leaf node is clamped to 0,
1116 since we assume the data has been centered (had its mean subtracted
1117 off); this is just for simplicity.
1118 Finally, the covariance of the leaf node is constrained to be
1119 diagonal. W0 and Psi0 are the initial parameter guesses.
1120
1121 <p>
1122 We can fit this model (i.e., estimate its parameters in a maximum
1123 likelihood (ML) sense) using EM, as we
1124 explain <a href="#em">below</a>.
1125 Not surprisingly, the ML estimates for mu and Psi turn out to be
1126 identical to the
1127 sample mean and variance, which can be computed directly as
1128 <pre>
1129 mu_ML = mean(data);
1130 Psi_ML = diag(cov(data));
1131 </pre>
1132 Note that W can only be identified up to a rotation matrix, because of
1133 the spherical symmetry of the source.
1134
1135 <p>
1136 If we restrict Psi to be spherical, i.e., Psi = sigma*I,
1137 there is a closed-form solution for W as well,
1138 i.e., we do not need to use EM.
1139 In particular, W contains the first |X| eigenvectors of the sample covariance
1140 matrix, with scalings determined by the eigenvalues and sigma.
1141 Classical PCA can be obtained by taking the sigma->0 limit.
1142 For details, see
1143
1144 <ul>
1145 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
1146 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
1147 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
1148 Matlab software</a>)
1149
1150 <p>
1151 <li>
1152 <a
1153 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1154 "Mixtures of probabilistic principal component analyzers"</a>,
1155 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1156 </ul>
1157
1158 <p>
1159 By adding a hidden discrete variable, we can create mixtures of FA
1160 models, as shown in (c).
1161 Now we can explain the data using a set of subspaces.
1162 We can create this model in BNT as follows.
1163 <pre>
1164 ns = [M k D];
1165 dag = zeros(3);
1166 dag(1,3) = 1;
1167 dag(2,3) = 1;
1168 bnet = mk_bnet(dag, ns, 'discrete', 1);
1169 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
1170 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
1171 'clamp_mean', 1, 'clamp_cov', 1);
1172 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
1173 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
1174 </pre>
1175 Notice how the covariance matrix for Y is the same for all values of
1176 Q; that is, the noise level in each sub-space is assumed the same.
1177 However, we allow the offset, mu, to vary.
1178 For details, see
1179 <ul>
1180
1181 <LI>
1182 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
1183 Algorithm for Mixtures of Factor Analyzers </A>,
1184 Ghahramani, Z. and Hinton, G.E. (1996),
1185 University of Toronto
1186 Technical Report CRG-TR-96-1.
1187 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
1188
1189 <p>
1190 <li>
1191 <a
1192 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1193 "Mixtures of probabilistic principal component analyzers"</a>,
1194 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1195 </ul>
1196
1197 <p>
1198 I have included Zoubin's specialized MFA code (with his permission)
1199 with the toolbox, so you can check that BNT gives the same results:
1200 see 'BNT/examples/static/mfa1.m'.
1201
1202 <p>
1203 Independent Factor Analysis (IFA) generalizes FA by allowing a
1204 non-Gaussian prior on each component of X.
1205 (Note that we can approximate a non-Gaussian prior using a mixture of
1206 Gaussians.)
1207 This means that the likelihood function is no longer rotationally
1208 invariant, so we can uniquely identify W and the hidden
1209 sources X.
1210 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
1211 We recover classical Independent Components Analysis (ICA)
1212 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the
1213 weight matrix W is square and invertible.
1214 For details, see
1215 <ul>
1216 <li>
1217 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
1218 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
1219 </ul>
1220
1221
1222
1223 <h2><a name="mixexp">Mixtures of experts</h2>
1224
1225 As an example of the use of the softmax function,
1226 we introduce the Mixture of Experts model.
1227 <!--
1228 We also show
1229 the Hierarchical Mixture of Experts model, where the hierarchy has two
1230 levels.
1231 (This is essentially a probabilistic decision tree of height two.)
1232 -->
1233 As before,
1234 circles denote continuous-valued nodes,
1235 squares denote discrete nodes, clear
1236 means hidden, and shaded means observed.
1237 <p>
1238 <center>
1239 <table>
1240 <tr>
1241 <td><img src="Figures/mixexp.gif">
1242 <!--
1243 <td><img src="Figures/hme.gif">
1244 -->
1245 </table>
1246 </center>
1247 <p>
1248 X is the observed
1249 input, Y is the output, and
1250 the Q nodes are hidden "gating" nodes, which select the appropriate
1251 set of parameters for Y. During training, Y is assumed observed,
1252 but for testing, the goal is to predict Y given X.
1253 Note that this is a <em>conditional</em> density model, so we don't
1254 associate any parameters with X.
1255 Hence X's CPD will be a root CPD, which is a way of modelling
1256 exogenous nodes.
1257 If the output is a continuous-valued quantity,
1258 we assume the "experts" are linear-regression units,
1259 and set Y's CPD to linear-Gaussian.
1260 If the output is discrete, we set Y's CPD to a softmax function.
1261 The Q CPDs will always be softmax functions.
1262
1263 <p>
1264 As a concrete example, consider the mixture of experts model where X and Y are
1265 scalars, and Q is binary.
1266 This is just piecewise linear regression, where
1267 we have two line segments, i.e.,
1268 <P>
1269 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
1270 <P>
1271 We can create this model with random parameters as follows.
1272 (This code is bundled in BNT/examples/static/mixexp2.m.)
1273 <PRE>
1274 X = 1;
1275 Q = 2;
1276 Y = 3;
1277 dag = zeros(3,3);
1278 dag(X,[Q Y]) = 1
1279 dag(Q,Y) = 1;
1280 ns = [1 2 1]; % make X and Y scalars, and have 2 experts
1281 onodes = [1 3];
1282 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
1283
1284 rand('state', 0);
1285 randn('state', 0);
1286 bnet.CPD{1} = root_CPD(bnet, 1);
1287 bnet.CPD{2} = softmax_CPD(bnet, 2);
1288 bnet.CPD{3} = gaussian_CPD(bnet, 3);
1289 </PRE>
1290 Now let us fit this model using <a href="#em">EM</a>.
1291 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
1292 <P>
1293 <PRE>
1294 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
1295 plot(data(:,1), data(:,2), '.');
1296 </PRE>
1297 <p>
1298 <center>
1299 <IMG SRC="Figures/mixexp_data.gif">
1300 </center>
1301 <p>
1302 This is what the model looks like before training.
1303 (Thanks to Thomas Hofman for writing this plotting routine.)
1304 <p>
1305 <center>
1306 <IMG SRC="Figures/mixexp_before.gif">
1307 </center>
1308 <p>
1309 Now let's train the model, and plot the final performance.
1310 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
1311 <P>
1312 <PRE>
1313 ncases = size(data, 1); % each row of data is a training case
1314 cases = cell(3, ncases);
1315 cases([1 3], :) = num2cell(data'); % each column of cases is a training case
1316 engine = jtree_inf_engine(bnet);
1317 max_iter = 20;
1318 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
1319 </PRE>
1320 (We specify which nodes will be observed when we create the engine.
1321 Hence BNT knows that the hidden nodes are all discrete.
1322 For complex models, this can lead to a significant speedup.)
1323 Below we show what the model looks like after 16 iterations of EM
1324 (with 100 IRLS iterations per M step), when it converged
1325 using the default convergence tolerance (that the
1326 fractional change in the log-likelihood be less than 1e-3).
1327 Before learning, the log-likelihood was
1328 -322.927442; afterwards, it was -13.728778.
1329 <p>
1330 <center>
1331 <IMG SRC="Figures/mixexp_after.gif">
1332 </center>
1333 (See BNT/examples/static/mixexp2.m for details of the code.)
1334
1335
1336
1337 <h2><a name="hme">Hierarchical mixtures of experts</h2>
1338
1339 A hierarchical mixture of experts (HME) extends the mixture of experts
1340 model by having more than one hidden node. A two-level example is shown below, along
1341 with its more traditional representation as a neural network.
1342 This is like a (balanced) probabilistic decision tree of height 2.
1343 <p>
1344 <center>
1345 <IMG SRC="Figures/HMEforMatlab.jpg">
1346 </center>
1347 <p>
1348 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
1349 has written an extensive set of routines for HMEs,
1350 which are bundled with BNT: see the examples/static/HME directory.
1351 These routines allow you to choose the number of hidden (gating)
1352 layers, and the form of the experts (softmax or MLP).
1353 See the file hmemenu, which provides a demo.
1354 For example, the figure below shows the decision boundaries learned
1355 for a ternary classification problem, using a 2 level HME with softmax
1356 gates and softmax experts; the training set is on the left, the
1357 testing set on the right.
1358 <p>
1359 <center>
1360 <!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
1361 <IMG SRC="Figures/hme_dec_boundary.png">
1362 </center>
1363 <p>
1364
1365
1366 <p>
1367 For more details, see the following:
1368 <ul>
1369
1370 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
1371 Hierarchical mixtures of experts and the EM algorithm</a>
1372 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
1373
1374 <li> <a href =
1375 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's
1376 matlab code for HME</a>
1377
1378 <li> <a
1379 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
1380 logistic function? A tutorial discussion on
1381 probabilities and neural networks.</a> M. I. Jordan. MIT Computational
1382 Cognitive Science Report 9503, August 1995.
1383
1384 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
1385 Halll, 1983.
1386
1387 <li>
1388 "Improved learning algorithms for mixtures of experts in multiclass
1389 classification".
1390 K. Chen, L. Xu, H. Chi.
1391 Neural Networks (1999) 12: 1229-1252.
1392
1393 <li> <a href="http://www.oigeeza.com/steve/">
1394 Classification Using Hierarchical Mixtures of Experts</a>
1395 S.R. Waterhouse and A.J. Robinson.
1396 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
1397
1398 <li> <a href="http://www.idiap.ch/~perry/">
1399 Localized mixtures of experts</a>,
1400 P. Moerland, 1998.
1401
1402 <li> "Nonlinear gated experts for time series",
1403 A.S. Weigend and M. Mangeas, 1995.
1404
1405 </ul>
1406
1407
1408 <h2><a name="qmr">QMR</h2>
1409
1410 Bayes nets originally arose out of an attempt to add probabilities to
1411 expert systems, and this is still the most common use for BNs.
1412 A famous example is
1413 QMR-DT, a decision-theoretic reformulation of the Quick Medical
1414 Reference (QMR) model.
1415 <p>
1416 <center>
1417 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
1418 </center>
1419 Here, the top layer represents hidden disease nodes, and the bottom
1420 layer represents observed symptom nodes.
1421 The goal is to infer the posterior probability of each disease given
1422 all the symptoms (which can be present, absent or unknown).
1423 Each node in the top layer has a Bernoulli prior (with a low prior
1424 probability that the disease is present).
1425 Since each node in the bottom layer has a high fan-in, we use a
1426 noisy-OR parameterization; each disease has an independent chance of
1427 causing each symptom.
1428 The real QMR-DT model is copyright, but
1429 we can create a random QMR-like model as follows.
1430 <pre>
1431 function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1432 % MK_QMR_BNET Make a QMR model
1433 % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1434 %
1435 % G(i,j) = 1 iff there is an arc from disease i to finding j
1436 % inhibit(i,j) = inhibition probability on i->j arc
1437 % leak(j) = inhibition prob. on leak->j arc
1438 % prior(i) = prob. disease i is on
1439
1440 [Ndiseases Nfindings] = size(inhibit);
1441 N = Ndiseases + Nfindings;
1442 finding_node = Ndiseases+1:N;
1443 ns = 2*ones(1,N);
1444 dag = zeros(N,N);
1445 dag(1:Ndiseases, finding_node) = G;
1446 bnet = mk_bnet(dag, ns, 'observed', finding_node);
1447
1448 for d=1:Ndiseases
1449 CPT = [1-prior(d) prior(d)];
1450 bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
1451 end
1452
1453 for i=1:Nfindings
1454 fnode = finding_node(i);
1455 ps = parents(G, i);
1456 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
1457 end
1458 </pre>
1459 In the file BNT/examples/static/qmr1, we create a random bipartite
1460 graph G, with 5 diseases and 10 findings, and random parameters.
1461 (In general, to create a random dag, use 'mk_random_dag'.)
1462 We can visualize the resulting graph structure using
1463 the methods discussed <a href="#graphdraw">below</a>, with the
1464 following results:
1465 <p>
1466 <img src="Figures/qmr.rnd.jpg">
1467
1468 <p>
1469 Now let us put some random evidence on all the leaves except the very
1470 first and very last, and compute the disease posteriors.
1471 <pre>
1472 pos = 2:floor(Nfindings/2);
1473 neg = (pos(end)+1):(Nfindings-1);
1474 onodes = myunion(pos, neg);
1475 evidence = cell(1, N);
1476 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
1477 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
1478
1479 engine = jtree_inf_engine(bnet);
1480 [engine, ll] = enter_evidence(engine, evidence);
1481 post = zeros(1, Ndiseases);
1482 for i=diseases(:)'
1483 m = marginal_nodes(engine, i);
1484 post(i) = m.T(2);
1485 end
1486 </pre>
1487 Junction tree can be quite slow on large QMR models.
1488 Fortunately, it is possible to exploit properties of the noisy-OR
1489 function to speed up exact inference using an algorithm called
1490 <a href="#quickscore">quickscore</a>, discussed below.
1491
1492
1493
1494
1495
1496 <h2><a name="cg_model">Conditional Gaussian models</h2>
1497
1498 A conditional Gaussian model is one in which, conditioned on all the discrete
1499 nodes, the distribution over the remaining (continuous) nodes is
1500 multivariate Gaussian. This means we can have arcs from discrete (D)
1501 to continuous (C) nodes, but not vice versa.
1502 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
1503 as in the <a href="#mixexp">mixture of experts</a> model,
1504 since this distribution can be represented with a discrete potential.)
1505 <p>
1506 We now give an example of a CG model, from
1507 the paper "Propagation of Probabilities, Means amd
1508 Variances in Mixed Graphical Association Models", Steffen Lauritzen,
1509 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
1510 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
1511 D. J. Spiegelhalter, Springer, 1999.)
1512
1513 <h3>Specifying the graph</h3>
1514
1515 Consider the model of waste emissions from an incinerator plant shown below.
1516 We follow the standard convention that shaded nodes are observed,
1517 clear nodes are hidden.
1518 We also use the non-standard convention that
1519 square nodes are discrete (tabular) and round nodes are
1520 Gaussian.
1521
1522 <p>
1523 <center>
1524 <IMG SRC="Figures/cg1.gif">
1525 </center>
1526 <p>
1527
1528 We can create this model as follows.
1529 <pre>
1530 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
1531 n = 9;
1532
1533 dag = zeros(n);
1534 dag(F,E)=1;
1535 dag(W,[E Min D]) = 1;
1536 dag(E,D)=1;
1537 dag(B,[C D])=1;
1538 dag(D,[L Mout])=1;
1539 dag(Min,Mout)=1;
1540
1541 % node sizes - all cts nodes are scalar, all discrete nodes are binary
1542 ns = ones(1, n);
1543 dnodes = [F W B];
1544 cnodes = mysetdiff(1:n, dnodes);
1545 ns(dnodes) = 2;
1546
1547 bnet = mk_bnet(dag, ns, 'discrete', dnodes);
1548 </pre>
1549 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
1550 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
1551 <p>
1552
1553
1554 <h3>Specifying the parameters</h3>
1555
1556 The parameters of the discrete nodes can be specified as follows.
1557 <pre>
1558 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
1559 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
1560 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
1561 </pre>
1562
1563 <p>
1564 The parameters of the continuous nodes can be specified as follows.
1565 <pre>
1566 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
1567 'cov', [0.00002 0.0001 0.00002 0.0001]);
1568 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
1569 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
1570 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
1571 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
1572 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
1573 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
1574 </pre>
1575
1576
1577 <h3><a name="cg_infer">Inference</h3>
1578
1579 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
1580 First we compute the unconditional marginals.
1581 <pre>
1582 engine = jtree_inf_engine(bnet);
1583 evidence = cell(1,n);
1584 [engine, ll] = enter_evidence(engine, evidence);
1585 marg = marginal_nodes(engine, E);
1586 </pre>
1587 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
1588 'marg' is a structure that contains the fields 'mu' and 'Sigma', which
1589 contain the mean and (co)variance of the marginal on E.
1590 In this case, they are both scalars.
1591 Let us check they match the published figures (to 2 decimal places).
1592 <!--(We can't expect
1593 more precision than this in general because I have implemented the algorithm of
1594 Lauritzen (1992), which can be numerically unstable.)-->
1595 <pre>
1596 tol = 1e-2;
1597 assert(approxeq(marg.mu, -3.25, tol));
1598 assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
1599 </pre>
1600 We can compute the other posteriors similarly.
1601 Now let us add some evidence.
1602 <pre>
1603 evidence = cell(1,n);
1604 evidence{W} = 1; % industrial
1605 evidence{L} = 1.1;
1606 evidence{C} = -0.9;
1607 [engine, ll] = enter_evidence(engine, evidence);
1608 </pre>
1609 Now we find
1610 <pre>
1611 marg = marginal_nodes(engine, E);
1612 assert(approxeq(marg.mu, -3.8983, tol));
1613 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
1614 </pre>
1615
1616
1617 We can also compute the joint probability on a set of nodes.
1618 For example, P(D, Mout | evidence) is a 2D Gaussian:
1619 <pre>
1620 marg = marginal_nodes(engine, [D Mout])
1621 marg =
1622 domain: [6 8]
1623 mu: [2x1 double]
1624 Sigma: [2x2 double]
1625 T: 1.0000
1626 </pre>
1627 The mean is
1628 <pre>
1629 marg.mu
1630 ans =
1631 3.6077
1632 4.1077
1633 </pre>
1634 and the covariance matrix is
1635 <pre>
1636 marg.Sigma
1637 ans =
1638 0.1062 0.1062
1639 0.1062 0.1182
1640 </pre>
1641 It is easy to visualize this posterior using standard Matlab plotting
1642 functions, e.g.,
1643 <pre>
1644 gaussplot2d(marg.mu, marg.Sigma);
1645 </pre>
1646 produces the following picture.
1647
1648 <p>
1649 <center>
1650 <IMG SRC="Figures/gaussplot.png">
1651 </center>
1652 <p>
1653
1654
1655 The T field indicates that the mixing weight of this Gaussian
1656 component is 1.0.
1657 If the joint contains discrete and continuous variables, the result
1658 will be a mixture of Gaussians, e.g.,
1659 <pre>
1660 marg = marginal_nodes(engine, [F E])
1661 domain: [1 3]
1662 mu: [-3.9000 -0.4003]
1663 Sigma: [1x1x2 double]
1664 T: [0.9995 4.7373e-04]
1665 </pre>
1666 The interpretation is
1667 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ].
1668 In this case, E is a scalar, so i=j=1; k specifies the mixture component.
1669 <p>
1670 We saw in the sprinkler network that BNT sets the effective size of
1671 observed discrete nodes to 1, since they only have one legal value.
1672 For continuous nodes, BNT sets their length to 0,
1673 since they have been reduced to a point.
1674 For example,
1675 <pre>
1676 marg = marginal_nodes(engine, [B C])
1677 domain: [4 5]
1678 mu: []
1679 Sigma: []
1680 T: [0.0123 0.9877]
1681 </pre>
1682 It is simple to post-process the output of marginal_nodes.
1683 For example, the file BNT/examples/static/cg1 sets the mu term of
1684 observed nodes to their observed value, and the Sigma term to 0 (since
1685 observed nodes have no variance).
1686
1687 <p>
1688 Note that the implemented version of the junction tree is numerically
1689 unstable when using CG potentials
1690 (which is why, in the example above, we only required our answers to agree with
1691 the published ones to 2dp.)
1692 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
1693 implemented by Shan Huang. This is described in
1694
1695 <ul>
1696 <li> "Stable Local Computation with Conditional Gaussian Distributions",
1697 S. Lauritzen and F. Jensen, Tech Report R-99-2014,
1698 Dept. Math. Sciences, Allborg Univ., 1999.
1699 </ul>
1700
1701 However, even the numerically stable version
1702 can be computationally intractable if there are many hidden discrete
1703 nodes, because the number of mixture components grows exponentially e.g., in a
1704 <a href="usage_dbn.html#lds">switching linear dynamical system</a>.
1705 In general, one must resort to approximate inference techniques: see
1706 the discussion on <a href="#engines">inference engines</a> below.
1707
1708
1709 <h2><a name="hybrid">Other hybrid models</h2>
1710
1711 When we have C->D arcs, where C is hidden, we need to use
1712 approximate inference.
1713 One approach (not implemented in BNT) is described in
1714 <ul>
1715 <li> <a
1716 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
1717 Variational Approximation for Bayesian Networks with
1718 Discrete and Continuous Latent Variables</a>,
1719 K. Murphy, UAI 99.
1720 </ul>
1721 Of course, one can always use <a href="#sampling">sampling</a> methods
1722 for approximate inference in such models.
1723
1724
1725
1726 <h1><a name="param_learning">Parameter Learning</h1>
1727
1728 The parameter estimation routines in BNT can be classified into 4
1729 types, depending on whether the goal is to compute
1730 a full (Bayesian) posterior over the parameters or just a point
1731 estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
1732 and whether all the variables are fully observed or there is missing
1733 data/ hidden variables (partial observability).
1734 <p>
1735
1736 <TABLE BORDER>
1737 <tr>
1738 <TH></TH>
1739 <th>Full obs</th>
1740 <th>Partial obs</th>
1741 </tr>
1742 <tr>
1743 <th>Point</th>
1744 <td><tt>learn_params</tt></td>
1745 <td><tt>learn_params_em</tt></td>
1746 </tr>
1747 <tr>
1748 <th>Bayes</th>
1749 <td><tt>bayes_update_params</tt></td>
1750 <td>not yet supported</td>
1751 </tr>
1752 </table>
1753
1754
1755 <h2><a name="load_data">Loading data from a file</h2>
1756
1757 To load numeric data from an ASCII text file called 'dat.txt', where each row is a
1758 case and columns are separated by white-space, such as
1759 <pre>
1760 011979 1626.5 0.0
1761 021979 1367.0 0.0
1762 ...
1763 </pre>
1764 you can use
1765 <pre>
1766 data = load('dat.txt');
1767 </pre>
1768 or
1769 <pre>
1770 load dat.txt -ascii
1771 </pre>
1772 In the latter case, the data is stored in a variable called 'dat' (the
1773 filename minus the extension).
1774 Alternatively, suppose the data is stored in a .csv file (has commas
1775 separating the columns, and contains a header line), such as
1776 <pre>
1777 header info goes here
1778 ORD,011979,1626.5,0.0
1779 DSM,021979,1367.0,0.0
1780 ...
1781 </pre>
1782 You can load this using
1783 <pre>
1784 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
1785 </pre>
1786 If your file is not in either of these formats, you can either use Perl to convert
1787 it to this format, or use the Matlab scanf command.
1788 Type
1789 <tt>
1790 help iofun
1791 </tt>
1792 for more information on Matlab's file functions.
1793 <!--
1794 <p>
1795 To load data directly from Excel,
1796 you should buy the
1797 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
1798 To load data directly from a relational database,
1799 you should buy the
1800 <a href="http://www.mathworks.com/products/database">Database
1801 toolbox</a>.
1802 -->
1803 <p>
1804 BNT learning routines require data to be stored in a cell array.
1805 data{i,m} is the value of node i in case (example) m, i.e., each
1806 <em>column</em> is a case.
1807 If node i is not observed in case m (missing value), set
1808 data{i,m} = [].
1809 (Not all the learning routines can cope with such missing values, however.)
1810 In the special case that all the nodes are observed and are
1811 scalar-valued (as opposed to vector-valued), the data can be
1812 stored in a matrix (as opposed to a cell-array).
1813 <p>
1814 Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
1815 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
1816 the observed output, and X(2) is a hidden (gating) node. We can
1817 create the dataset as follows.
1818 <pre>
1819 data = load('dat.txt');
1820 ncases = size(data, 1);
1821 cases = cell(3, ncases);
1822 cases([1 3], :) = num2cell(data');
1823 </pre>
1824 Notice how we transposed the data, to convert rows into columns.
1825 Also, cases{2,m} = [] for all m, since X(2) is always hidden.
1826
1827
1828 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
1829
1830 As an example, let's generate some data from the sprinkler network, randomize the parameters,
1831 and then try to recover the original model.
1832 First we create some training data using forwards sampling.
1833 <pre>
1834 samples = cell(N, nsamples);
1835 for i=1:nsamples
1836 samples(:,i) = sample_bnet(bnet);
1837 end
1838 </pre>
1839 samples{j,i} contains the value of the j'th node in case i.
1840 sample_bnet returns a cell array because, in general, each node might
1841 be a vector of different length.
1842 In this case, all nodes are discrete (and hence scalars), so we
1843 could have used a regular array instead (which can be quicker):
1844 <pre>
1845 data = cell2num(samples);
1846 </pre
1847 So now data(j,i) = samples{j,i}.
1848 <p>
1849 Now we create a network with random parameters.
1850 (The initial values of bnet2 don't matter in this case, since we can find the
1851 globally optimal MLE independent of where we start.)
1852 <pre>
1853 % Make a tabula rasa
1854 bnet2 = mk_bnet(dag, node_sizes);
1855 seed = 0;
1856 rand('state', seed);
1857 bnet2.CPD{C} = tabular_CPD(bnet2, C);
1858 bnet2.CPD{R} = tabular_CPD(bnet2, R);
1859 bnet2.CPD{S} = tabular_CPD(bnet2, S);
1860 bnet2.CPD{W} = tabular_CPD(bnet2, W);
1861 </pre>
1862 Finally, we find the maximum likelihood estimates of the parameters.
1863 <pre>
1864 bnet3 = learn_params(bnet2, samples);
1865 </pre>
1866 To view the learned parameters, we use a little Matlab hackery.
1867 <pre>
1868 CPT3 = cell(1,N);
1869 for i=1:N
1870 s=struct(bnet3.CPD{i}); % violate object privacy
1871 CPT3{i}=s.CPT;
1872 end
1873 </pre>
1874 Here are the parameters learned for node 4.
1875 <pre>
1876 dispcpt(CPT3{4})
1877 1 1 : 1.0000 0.0000
1878 2 1 : 0.2000 0.8000
1879 1 2 : 0.2273 0.7727
1880 2 2 : 0.0000 1.0000
1881 </pre>
1882 So we see that the learned parameters are fairly close to the "true"
1883 ones, which we display below.
1884 <pre>
1885 dispcpt(CPT{4})
1886 1 1 : 1.0000 0.0000
1887 2 1 : 0.1000 0.9000
1888 1 2 : 0.1000 0.9000
1889 2 2 : 0.0100 0.9900
1890 </pre>
1891 We can get better results by using a larger training set, or using
1892 informative priors (see <a href="#prior">below</a>).
1893
1894
1895
1896 <h2><a name="prior">Parameter priors</h2>
1897
1898 Currently, only tabular CPDs can have priors on their parameters.
1899 The conjugate prior for a multinomial is the Dirichlet.
1900 (For binary random variables, the multinomial is the same as the
1901 Bernoulli, and the Dirichlet is the same as the Beta.)
1902 <p>
1903 The Dirichlet has a simple interpretation in terms of pseudo counts.
1904 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
1905 training set, where Pa_i are the parents of X_i,
1906 then the maximum likelihood (ML) estimate is
1907 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
1908 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
1909 event was not seen in the training set,
1910 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
1911 of times in the past.
1912 The MAP (maximum a posterior) estimate is then
1913 <pre>
1914 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
1915 </pre>
1916 and is never 0 if all alpha_ijk > 0.
1917 For example, consider the network A->B, where A is binary and B has 3
1918 values.
1919 A uniform prior for B has the form
1920 <pre>
1921 B=1 B=2 B=3
1922 A=1 1 1 1
1923 A=2 1 1 1
1924 </pre>
1925 which can be created using
1926 <pre>
1927 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
1928 </pre>
1929 This prior does not satisfy the likelihood equivalence principle,
1930 which says that <a href="#markov_equiv">Markov equivalent</a> models
1931 should have the same marginal likelihood.
1932 A prior that does satisfy this principle is shown below.
1933 Heckerman (1995) calls this the
1934 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
1935 <pre>
1936 B=1 B=2 B=3
1937 A=1 1/6 1/6 1/6
1938 A=2 1/6 1/6 1/6
1939 </pre>
1940 where we put N/(q*r) in each bin; N is the equivalent sample size,
1941 r=|A|, q = |B|.
1942 This can be created as follows
1943 <pre>
1944 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
1945 </pre>
1946 Here, 1 is the equivalent sample size, and is the strength of the
1947 prior.
1948 You can change this using
1949 <pre>
1950 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
1951 'BDeu', 'dirichlet_weight', 10);
1952 </pre>
1953 <!--where counts is an array of pseudo-counts of the same size as the
1954 CPT.-->
1955 <!--
1956 <p>
1957 When you specify a prior, you should set row i of the CPT to the
1958 normalized version of row i of the pseudo-count matrix, i.e., to the
1959 expected values of the parameters. This will ensure that computing the
1960 marginal likelihood sequentially (see <a
1961 href="#bayes_learn">below</a>) and in batch form gives the same
1962 results.
1963 To do this, proceed as follows.
1964 <pre>
1965 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
1966 </pre>
1967 For a non-informative prior, you can just write
1968 <pre>
1969 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
1970 </pre>
1971 -->
1972
1973
1974 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
1975
1976 If we use conjugate priors and have fully observed data, we can
1977 compute the posterior over the parameters in batch form as follows.
1978 <pre>
1979 cases = sample_bnet(bnet, nsamples);
1980 bnet = bayes_update_params(bnet, cases);
1981 LL = log_marg_lik_complete(bnet, cases);
1982 </pre>
1983 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
1984 and bnet.CPD{i}.CPT is set to the mean of the posterior (the
1985 normalized counts).
1986 (Hence if the initial pseudo counts are 0,
1987 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
1988 same result.)
1989
1990
1991
1992
1993 <p>
1994 We can compute the same result sequentially (on-line) as follows.
1995 <pre>
1996 LL = 0;
1997 for m=1:nsamples
1998 LL = LL + log_marg_lik_complete(bnet, cases(:,m));
1999 bnet = bayes_update_params(bnet, cases(:,m));
2000 end
2001 </pre>
2002
2003 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
2004 sequential model selection which uses the same idea.
2005 We generate data from the model A->B
2006 and compute the posterior prob of all 3 dags on 2 nodes:
2007 (1) A B, (2) A <- B , (3) A -> B
2008 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
2009 observational data alone, so we expect their posteriors to be the same
2010 (assuming a prior which satisfies likelihood equivalence).
2011 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
2012 However, if we make B a noisy NOT gate, the true model "wins" after 12
2013 trials, as shown below (red = model 1, blue/green (superimposed)
2014 represents models 2/3).
2015 <p>
2016 <img src="Figures/model_select.png">
2017 <p>
2018 The use of marginal likelihood for model selection is discussed in
2019 greater detail in the
2020 section on <a href="structure_learning">structure learning</a>.
2021
2022
2023
2024
2025 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
2026
2027 Now we consider learning when some values are not observed.
2028 Let us randomly hide half the values generated from the water
2029 sprinkler example.
2030 <pre>
2031 samples2 = samples;
2032 hide = rand(N, nsamples) > 0.5;
2033 [I,J]=find(hide);
2034 for k=1:length(I)
2035 samples2{I(k), J(k)} = [];
2036 end
2037 </pre>
2038 samples2{i,l} is the value of node i in training case l, or [] if unobserved.
2039 <p>
2040 Now we will compute the MLEs using the EM algorithm.
2041 We need to use an inference algorithm to compute the expected
2042 sufficient statistics in the E step; the M (maximization) step is as
2043 above.
2044 <pre>
2045 engine2 = jtree_inf_engine(bnet2);
2046 max_iter = 10;
2047 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
2048 </pre>
2049 LLtrace(i) is the log-likelihood at iteration i. We can plot this as
2050 follows:
2051 <pre>
2052 plot(LLtrace, 'x-')
2053 </pre>
2054 Let's display the results after 10 iterations of EM.
2055 <pre>
2056 celldisp(CPT4)
2057 CPT4{1} =
2058 0.6616
2059 0.3384
2060 CPT4{2} =
2061 0.6510 0.3490
2062 0.8751 0.1249
2063 CPT4{3} =
2064 0.8366 0.1634
2065 0.0197 0.9803
2066 CPT4{4} =
2067 (:,:,1) =
2068 0.8276 0.0546
2069 0.5452 0.1658
2070 (:,:,2) =
2071 0.1724 0.9454
2072 0.4548 0.8342
2073 </pre>
2074 We can get improved performance by using one or more of the following
2075 methods:
2076 <ul>
2077 <li> Increasing the size of the training set.
2078 <li> Decreasing the amount of hidden data.
2079 <li> Running EM for longer.
2080 <li> Using informative priors.
2081 <li> Initialising EM from multiple starting points.
2082 </ul>
2083
2084 Click <a href="#gaussian">here</a> for a discussion of learning
2085 Gaussians, which can cause numerical problems.
2086 <p>
2087 For a more complete example of learning with EM,
2088 see the script BNT/examples/static/learn1.m.
2089
2090 <h2><a name="tying">Parameter tying</h2>
2091
2092 In networks with repeated structure (e.g., chains and grids), it is
2093 common to assume that the parameters are the same at every node. This
2094 is called parameter tying, and reduces the amount of data needed for
2095 learning.
2096 <p>
2097 When we have tied parameters, there is no longer a one-to-one
2098 correspondence between nodes and CPDs.
2099 Rather, each CPD species the parameters for a whole equivalence class
2100 of nodes.
2101 It is easiest to see this by example.
2102 Consider the following <a href="usage_dbn.html#hmm">hidden Markov
2103 model (HMM)</a>
2104 <p>
2105 <img src="Figures/hmm3.gif">
2106 <p>
2107 <!--
2108 We can create this graph structure, assuming we have T time-slices,
2109 as follows.
2110 (We number the nodes as shown in the figure, but we could equally well
2111 number the hidden nodes 1:T, and the observed nodes T+1:2T.)
2112 <pre>
2113 N = 2*T;
2114 dag = zeros(N);
2115 hnodes = 1:2:2*T;
2116 for i=1:T-1
2117 dag(hnodes(i), hnodes(i+1))=1;
2118 end
2119 onodes = 2:2:2*T;
2120 for i=1:T
2121 dag(hnodes(i), onodes(i)) = 1;
2122 end
2123 </pre>
2124 <p>
2125 The hidden nodes are always discrete, and have Q possible values each,
2126 but the observed nodes can be discrete or continuous, and have O possible values/length.
2127 <pre>
2128 if cts_obs
2129 dnodes = hnodes;
2130 else
2131 dnodes = 1:N;
2132 end
2133 ns = ones(1,N);
2134 ns(hnodes) = Q;
2135 ns(onodes) = O;
2136 </pre>
2137 -->
2138 When HMMs are used for semi-infinite processes like speech recognition,
2139 we assume the transition matrix
2140 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant
2141 or homogenous Markov chain.
2142 Hence hidden nodes 2, 3, ..., T
2143 are all in the same equivalence class, say class Hclass.
2144 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the
2145 same for all t, so the observed nodes are all in the same equivalence
2146 class, say class Oclass.
2147 Finally, the prior term P(H(1)) is in a class all by itself, say class
2148 H1class.
2149 This is illustrated below, where we explicitly represent the
2150 parameters as random variables (dotted nodes).
2151 <p>
2152 <img src="Figures/hmm4_params.gif">
2153 <p>
2154 In BNT, we cannot represent parameters as random variables (nodes).
2155 Instead, we "hide" the
2156 parameters inside one CPD for each equivalence class,
2157 and then specify that the other CPDs should share these parameters, as
2158 follows.
2159 <pre>
2160 hnodes = 1:2:2*T;
2161 onodes = 2:2:2*T;
2162 H1class = 1; Hclass = 2; Oclass = 3;
2163 eclass = ones(1,N);
2164 eclass(hnodes(2:end)) = Hclass;
2165 eclass(hnodes(1)) = H1class;
2166 eclass(onodes) = Oclass;
2167 % create dag and ns in the usual way
2168 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
2169 </pre>
2170 Finally, we define the parameters for each equivalence class:
2171 <pre>
2172 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
2173 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
2174 if cts_obs
2175 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
2176 else
2177 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
2178 end
2179 </pre>
2180 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
2181 member of e's equivalence class; that is, it is not always the case
2182 that e == j. You can use bnet.rep_of_eclass(e) to return the
2183 representative of equivalence class e.
2184 BNT will look up the parents of j to determine the size
2185 of the CPT to use. It assumes that this is the same for all members of
2186 the equivalence class.
2187 Click <a href="param_tieing.html">here</a> for
2188 a more complex example of parameter tying.
2189 <p>
2190 Note:
2191 Normally one would define an HMM as a
2192 <a href = "usage_dbn.html">Dynamic Bayes Net</a>
2193 (see the function BNT/examples/dynamic/mk_chmm.m).
2194 However, one can define an HMM as a static BN using the function
2195 BNT/examples/static/Models/mk_hmm_bnet.m.
2196
2197
2198
2199 <h1><a name="structure_learning">Structure learning</h1>
2200
2201 Update (9/29/03):
2202 Phillipe LeRay is developing some additional structure learning code
2203 on top of BNT. Click
2204 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-slp/">
2205 here</a>
2206 for details.
2207
2208 <p>
2209
2210 There are two very different approaches to structure learning:
2211 constraint-based and search-and-score.
2212 In the <a href="#constraint">constraint-based approach</a>,
2213 we start with a fully connected graph, and remove edges if certain
2214 conditional independencies are measured in the data.
2215 This has the disadvantage that repeated independence tests lose
2216 statistical power.
2217 <p>
2218 In the more popular search-and-score approach,
2219 we perform a search through the space of possible DAGs, and either
2220 return the best one found (a point estimate), or return a sample of the
2221 models found (an approximation to the Bayesian posterior).
2222 <p>
2223 The number of DAGs as a function of the number of
2224 nodes, G(n), is super-exponential in n,
2225 and is given by the following recurrence
2226 <!--(where R(i)=G(n)):-->
2227 <p>
2228 <center>
2229 <IMG SRC="numDAGsEqn2.png">
2230 </center>
2231 <p>
2232 The first few values
2233 are shown below.
2234
2235 <table>
2236 <tr> <th>n</th> <th align=left>G(n)</th> </tr>
2237 <tr> <td>1</td> <td>1</td> </tr>
2238 <tr> <td>2</td> <td>3</td> </tr>
2239 <tr> <td>3</td> <td>25</td> </tr>
2240 <tr> <td>4</td> <td>543</td> </tr>
2241 <tr> <td>5</td> <td>29,281</td> </tr>
2242 <tr> <td>6</td> <td>3,781,503</td> </tr>
2243 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr>
2244 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr>
2245 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr>
2246 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr>
2247 </table>
2248
2249 Since the number of DAGs is super-exponential in the number of nodes,
2250 we cannot exhaustively search the space, so we either use a local
2251 search algorithm (e.g., greedy hill climbining, perhaps with multiple
2252 restarts) or a global search algorithm (e.g., Markov Chain Monte
2253 Carlo).
2254 <p>
2255 If we know a total ordering on the nodes,
2256 finding the best structure amounts to picking the best set of parents
2257 for each node independently.
2258 This is what the K2 algorithm does.
2259 If the ordering is unknown, we can search over orderings,
2260 which is more efficient than searching over DAGs (Koller and Friedman, 2000).
2261 <p>
2262 In addition to the search procedure, we must specify the scoring
2263 function. There are two popular choices. The Bayesian score integrates
2264 out the parameters, i.e., it is the marginal likelihood of the model.
2265 The BIC (Bayesian Information Criterion) is defined as
2266 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is
2267 the ML estimate of the parameters, d is the number of parameters, and
2268 N is the number of data cases.
2269 The BIC method has the advantage of not requiring a prior.
2270 <p>
2271 BIC can be derived as a large sample
2272 approximation to the marginal likelihood.
2273 (It is also equal to the Minimum Description Length of a model.)
2274 However, in practice, the sample size does not need to be very large
2275 for the approximation to be good.
2276 For example, in the figure below, we plot the ratio between the log marginal likelihood
2277 and the BIC score against data-set size; we see that the ratio rapidly
2278 approaches 1, especially for non-informative priors.
2279 (This plot was generated by the file BNT/examples/static/bic1.m. It
2280 uses the water sprinkler BN with BDeu Dirichlet priors with different
2281 equivalent sample sizes.)
2282
2283 <p>
2284 <center>
2285 <IMG SRC="Figures/bic.png">
2286 </center>
2287 <p>
2288
2289 <p>
2290 As with parameter learning, handling missing data/ hidden variables is
2291 much harder than the fully observed case.
2292 The structure learning routines in BNT can therefore be classified into 4
2293 types, analogously to the parameter learning case.
2294 <p>
2295
2296 <TABLE BORDER>
2297 <tr>
2298 <TH></TH>
2299 <th>Full obs</th>
2300 <th>Partial obs</th>
2301 </tr>
2302 <tr>
2303 <th>Point</th>
2304 <td><tt>learn_struct_K2</tt> <br>
2305 <!-- <tt>learn_struct_hill_climb</tt></td> -->
2306 <td><tt>not yet supported</tt></td>
2307 </tr>
2308 <tr>
2309 <th>Bayes</th>
2310 <td><tt>learn_struct_mcmc</tt></td>
2311 <td>not yet supported</td>
2312 </tr>
2313 </table>
2314
2315
2316 <h2><a name="markov_equiv">Markov equivalence</h2>
2317
2318 If two DAGs encode the same conditional independencies, they are
2319 called Markov equivalent. The set of all DAGs can be paritioned into
2320 Markov equivalence classes. Graphs within the same class can
2321 have
2322 the direction of some of their arcs reversed without changing any of
2323 the CI relationships.
2324 Each class can be represented by a PDAG
2325 (partially directed acyclic graph) called an essential graph or
2326 pattern. This specifies which edges must be oriented in a certain
2327 direction, and which may be reversed.
2328
2329 <p>
2330 When learning graph structure from observational data,
2331 the best one can hope to do is to identify the model up to Markov
2332 equivalence. To distinguish amongst graphs within the same equivalence
2333 class, one needs interventional data: see the discussion on <a
2334 href="#active">active learning</a> below.
2335
2336
2337
2338 <h2><a name="enumerate">Exhaustive search</h2>
2339
2340 The brute-force approach to structure learning is to enumerate all
2341 possible DAGs, and score each one. This provides a "gold standard"
2342 with which to compare other algorithms. We can do this as follows.
2343 <pre>
2344 dags = mk_all_dags(N);
2345 score = score_dags(data, ns, dags);
2346 </pre>
2347 where data(i,m) is the value of node i in case m,
2348 and ns(i) is the size of node i.
2349 If the DAGs have a lot of families in common, we can cache the sufficient statistics,
2350 making this potentially more efficient than scoring the DAGs one at a time.
2351 (Caching is not currently implemented, however.)
2352 <p>
2353 By default, we use the Bayesian scoring metric, and assume CPDs are
2354 represented by tables with BDeu(1) priors.
2355 We can override these defaults as follows.
2356 If we want to use uniform priors, we can say
2357 <pre>
2358 params = cell(1,N);
2359 for i=1:N
2360 params{i} = {'prior', 'unif'};
2361 end
2362 score = score_dags(data, ns, dags, 'params', params);
2363 </pre>
2364 params{i} is a cell-array, containing optional arguments that are
2365 passed to the constructor for CPD i.
2366 <p>
2367 Now suppose we want to use different node types, e.g.,
2368 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
2369 these CPDs can support discrete and continuous parents, which is
2370 necessary since all other nodes will be considered as parents).
2371 The Bayesian scoring metric currently only works for tabular CPDs, so
2372 we will use BIC:
2373 <pre>
2374 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
2375 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
2376 </pre>
2377 In practice, one can't enumerate all possible DAGs for N > 5,
2378 but one can evaluate any reasonably-sized set of hypotheses in this
2379 way (e.g., nearest neighbors of your current best guess).
2380 Think of this as "computer assisted model refinement" as opposed to de
2381 novo learning.
2382
2383
2384 <h2><a name="K2">K2</h2>
2385
2386 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
2387 Initially each node has no parents. It then adds incrementally that parent whose addition most
2388 increases the score of the resulting structure. When the addition of no single
2389 parent can increase the score, it stops adding parents to the node.
2390 Since we are using a fixed ordering, we do not need to check for
2391 cycles, and can choose the parents for each node independently.
2392 <p>
2393 The original paper used the Bayesian scoring
2394 metric with tabular CPDs and Dirichlet priors.
2395 BNT generalizes this to allow any kind of CPD, and either the Bayesian
2396 scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
2397 In addition, you can specify
2398 an optional upper bound on the number of parents for each node.
2399 The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
2400 We use the water sprinkler network and sample 100 cases from it as before.
2401 Then we see how much data it takes to recover the generating structure:
2402 <pre>
2403 order = [C S R W];
2404 max_fan_in = 2;
2405 sz = 5:5:100;
2406 for i=1:length(sz)
2407 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
2408 correct(i) = isequal(dag, dag2);
2409 end
2410 </pre>
2411 Here are the results.
2412 <pre>
2413 correct =
2414 Columns 1 through 12
2415 0 0 0 0 0 0 0 1 0 1 1 1
2416 Columns 13 through 20
2417 1 1 1 1 1 1 1 1
2418 </pre>
2419 So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
2420 showing that the prior doesn't matter too much.)
2421 In general, we cannot hope to recover the "true" generating structure,
2422 only one that is in its <a href="#markov_equiv">Markov equivalence
2423 class</a>.
2424
2425
2426 <h2><a name="hill_climb">Hill-climbing</h2>
2427
2428 Hill-climbing starts at a specific point in space,
2429 considers all nearest neighbors, and moves to the neighbor
2430 that has the highest score; if no neighbors have higher
2431 score than the current point (i.e., we have reached a local maximum),
2432 the algorithm stops. One can then restart in another part of the space.
2433 <p>
2434 A common definition of "neighbor" is all graphs that can be
2435 generated from the current graph by adding, deleting or reversing a
2436 single arc, subject to the acyclicity constraint.
2437 Other neighborhoods are possible: see
2438 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
2439 Optimal Structure Identification with Greedy Search</a>, Max
2440 Chickering, JMLR 2002.
2441
2442 <!--
2443 Note: This algorithm is currently (Feb '02) being implemented by Qian
2444 Diao.
2445 -->
2446
2447
2448 <h2><a name="mcmc">MCMC</h2>
2449
2450 We can use a Markov Chain Monte Carlo (MCMC) algorithm called
2451 Metropolis-Hastings (MH) to search the space of all
2452 DAGs.
2453 The standard proposal distribution is to consider moving to all
2454 nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
2455 <p>
2456 The function can be called
2457 as in the following example.
2458 <pre>
2459 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
2460 </pre>
2461 We can convert our set of sampled graphs to a histogram
2462 (empirical posterior over all the DAGs) thus
2463 <pre>
2464 all_dags = mk_all_dags(N);
2465 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
2466 </pre>
2467 To see how well this performs, let us compute the exact posterior exhaustively.
2468 <p>
2469 <pre>
2470 score = score_dags(data, ns, all_dags);
2471 post = normalise(exp(score)); % assuming uniform structural prior
2472 </pre>
2473 We plot the results below.
2474 (The data set was 100 samples drawn from a random 4 node bnet; see the
2475 file BNT/examples/static/mcmc1.)
2476 <pre>
2477 subplot(2,1,1)
2478 bar(post)
2479 subplot(2,1,2)
2480 bar(mcmc_post)
2481 </pre>
2482 <img src="Figures/mcmc_post.jpg" width="800" height="500">
2483 <p>
2484 We can also plot the acceptance ratio versus number of MCMC steps,
2485 as a crude convergence diagnostic.
2486 <pre>
2487 clf
2488 plot(accept_ratio)
2489 </pre>
2490 <img src="Figures/mcmc_accept.jpg" width="800" height="300">
2491 <p>
2492 Even though the number of samples needed by MCMC is theoretically
2493 polynomial (not exponential) in the dimensionality of the search space, in practice it has been
2494 found that MCMC does not converge in reasonable time for graphs with
2495 more than about 10 nodes.
2496
2497
2498
2499
2500 <h2><a name="active">Active structure learning</h2>
2501
2502 As was mentioned <a href="#markov_equiv">above</a>,
2503 one can only learn a DAG up to Markov equivalence, even given infinite data.
2504 If one is interested in learning the structure of a causal network,
2505 one needs interventional data.
2506 (By "intervention" we mean forcing a node to take on a specific value,
2507 thereby effectively severing its incoming arcs.)
2508 <p>
2509 Most of the scoring functions accept an optional argument
2510 that specifies whether a node was observed to have a certain value, or
2511 was forced to have that value: we set clamped(i,m)=1 if node i was
2512 forced in training case m. e.g., see the file
2513 BNT/examples/static/cooper_yoo.
2514 <p>
2515 An interesting question is to decide which interventions to perform
2516 (c.f., design of experiments). For details, see the following tech
2517 report
2518 <ul>
2519 <li> <a href = "../../Papers/alearn.ps.gz">
2520 Active learning of causal Bayes net structure</a>, Kevin Murphy, March
2521 2001.
2522 </ul>
2523
2524
2525 <h2><a name="struct_em">Structural EM</h2>
2526
2527 Computing the Bayesian score when there is partial observability is
2528 computationally challenging, because the parameter posterior becomes
2529 multimodal (the hidden nodes induce a mixture distribution).
2530 One therefore needs to use approximations such as BIC.
2531 Unfortunately, search algorithms are still expensive, because we need
2532 to run EM at each step to compute the MLE, which is needed to compute
2533 the score of each model. An alternative approach is
2534 to do the local search steps inside of the M step of EM, which is more
2535 efficient since the data has been "filled in" - this is
2536 called the structural EM algorithm (Friedman 1997), and provably
2537 converges to a local maximum of the BIC score.
2538 <p>
2539 Wei Hu has implemented SEM for discrete nodes.
2540 You can download his package from
2541 <a href="../SEM.zip">here</a>.
2542 Please address all questions about this code to
2543 wei.hu@intel.com.
2544 See also <a href="#phl">Phl's implementation of SEM</a>.
2545
2546 <!--
2547 <h2><a name="reveal">REVEAL algorithm</h2>
2548
2549 A simple way to learn the structure of a fully observed, discrete,
2550 factored DBN from a time series is described <a
2551 href="usage_dbn.html#struct_learn">here</a>.
2552 -->
2553
2554
2555 <h2><a name="graphdraw">Visualizing the graph</h2>
2556
2557 Click <a href="graphviz.html">here</a> for more information
2558 on graph visualization.
2559
2560 <h2><a name = "constraint">Constraint-based methods</h2>
2561
2562 The IC algorithm (Pearl and Verma, 1991),
2563 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
2564 computes many conditional independence tests,
2565 and combines these constraints into a
2566 PDAG to represent the whole
2567 <a href="#markov_equiv">Markov equivalence class</a>.
2568 <p>
2569 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
2570 (IC stands for inductive causation; PC stands for Peter and Clark,
2571 the first names of Spirtes and Glymour; FCI stands for fast causal
2572 inference.
2573 What we, following Pearl (2000), call IC* was called
2574 IC in the original Pearl and Verma paper.)
2575 For details, see
2576 <ul>
2577 <li>
2578 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
2579 Prediction, and Search</a>, Spirtes, Glymour and
2580 Scheines (SGS), 2001 (2nd edition), MIT Press.
2581 <li>
2582 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
2583 2000, Cambridge University Press.
2584 </ul>
2585
2586 <p>
2587
2588 The PC algorithm takes as arguments a function f, the number of nodes N,
2589 the maximum fan in K, and additional arguments A which are passed to f.
2590 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
2591 otherwise.
2592 For example, suppose we cheat by
2593 passing in a CI "oracle" which has access to the true DAG; the oracle
2594 tests for d-separation in this DAG, i.e.,
2595 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
2596 <pre>
2597 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
2598 </pre>
2599 pdag(i,j) = -1 if there is definitely an i->j arc,
2600 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
2601 <p>
2602 Applied to the sprinkler network, this returns
2603 <pre>
2604 pdag =
2605 0 1 1 0
2606 1 0 0 -1
2607 1 0 0 -1
2608 0 0 0 0
2609 </pre>
2610 So as expected, we see that the V-structure at the W node is uniquely identified,
2611 but the other arcs have ambiguous orientation.
2612 <p>
2613 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
2614 book.
2615 This example concerns the female orgasm.
2616 We are given a correlation matrix C between 7 measured factors (such
2617 as subjective experiences of coital and masturbatory experiences),
2618 derived from 281 samples, and want to learn a causal model of the
2619 data. We will not discuss the merits of this type of work here, but
2620 merely show how to reproduce the results in the SGS book.
2621 Their program,
2622 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
2623 makes use of the Fisher Z-test for conditional
2624 independence, so we do the same:
2625 <pre>
2626 max_fan_in = 4;
2627 nsamples = 281;
2628 alpha = 0.05;
2629 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
2630 </pre>
2631 In this case, the CI test is
2632 <pre>
2633 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)
2634 </pre>
2635 The results match those of Fig 12a of SGS apart from two edge
2636 differences; presumably this is due to rounding error (although it
2637 could be a bug, either in BNT or in Tetrad).
2638 This example can be found in the file BNT/examples/static/pc2.m.
2639
2640 <p>
2641
2642 The IC* algorithm (Pearl and Verma, 1991),
2643 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
2644 are like the IC/PC algorithm, except that they can detect the presence
2645 of latent variables.
2646 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
2647 Kushnir. The output is a matrix P, defined as follows
2648 (see Pearl (2000), p52 for details):
2649 <pre>
2650 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
2651 % P(i,j) = -2 if there is a marked directed i-*>j edge.
2652 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
2653 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
2654 </pre>
2655
2656
2657 <h2><a name="phl">Philippe Leray's structure learning package</h2>
2658
2659 Philippe Leray has written a
2660 <a href="http://bnt.insa-rouen.fr/ajouts.html">
2661 structure learning package</a> that uses BNT.
2662
2663 It currently (Juen 2003) has the following features:
2664 <ul>
2665 <li>PC with Chi2 statistical test
2666 <li> MWST : Maximum weighted Spanning Tree
2667 <li> Hill Climbing
2668 <li> Greedy Search
2669 <li> Structural EM
2670 <li> hist_ic : optimal Histogram based on IC information criterion
2671 <li> cpdag_to_dag
2672 <li> dag_to_cpdag
2673 <li> ...
2674 </ul>
2675
2676
2677 </a>
2678
2679
2680 <!--
2681 <h2><a name="read_learning">Further reading on learning</h2>
2682
2683 I recommend the following tutorials for more details on learning.
2684 <ul>
2685 <li> <a
2686 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
2687 tutorial</a> on graphical models, which contains an overview of learning.
2688
2689 <li>
2690 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
2691 A tutorial on learning with Bayesian networks</a>, D. Heckerman,
2692 Microsoft Research Tech Report, 1995.
2693
2694 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
2695 Operations for Learning with Graphical Models</a>,
2696 W. L. Buntine, JAIR'94, 159--225.
2697 </ul>
2698 <p>
2699 -->
2700
2701
2702
2703
2704
2705 <h1><a name="engines">Inference engines</h1>
2706
2707 Up until now, we have used the junction tree algorithm for inference.
2708 However, sometimes this is too slow, or not even applicable.
2709 In general, there are many inference algorithms each of which make
2710 different tradeoffs between speed, accuracy, complexity and
2711 generality. Furthermore, there might be many implementations of the
2712 same algorithm; for instance, a general purpose, readable version,
2713 and a highly-optimized, specialized one.
2714 To cope with this variety, we treat each inference algorithm as an
2715 object, which we call an inference engine.
2716
2717 <p>
2718 An inference engine is an object that contains a bnet and supports the
2719 'enter_evidence' and 'marginal_nodes' methods. The engine constructor
2720 takes the bnet as argument and may do some model-specific processing.
2721 When 'enter_evidence' is called, the engine may do some
2722 evidence-specific processing. Finally, when 'marginal_nodes' is
2723 called, the engine may do some query-specific processing.
2724
2725 <p>
2726 The amount of work done when each stage is specified -- structure,
2727 parameters, evidence, and query -- depends on the engine. The cost of
2728 work done early in this sequence can be amortized. On the other hand,
2729 one can make better optimizations if one waits until later in the
2730 sequence.
2731 For example, the parameters might imply
2732 conditional indpendencies that are not evident in the graph structure,
2733 but can nevertheless be exploited; the evidence indicates which nodes
2734 are observed and hence can effectively be disconnected from the
2735 graph; and the query might indicate that large parts of the network
2736 are d-separated from the query nodes. (Since it is not the actual
2737 <em>values</em> of the evidence that matters, just which nodes are observed,
2738 many engines allow you to specify which nodes will be observed when they are constructed,
2739 i.e., before calling 'enter_evidence'. Some engines can still cope if
2740 the actual pattern of evidence is different, e.g., if there is missing
2741 data.)
2742 <p>
2743
2744 Although being maximally lazy (i.e., only doing work when a query is
2745 issued) may seem desirable,
2746 this is not always the most efficient.
2747 For example,
2748 when learning using EM, we need to call marginal_nodes N times, where N is the
2749 number of nodes. <a href="varelim">Variable elimination</a> would end
2750 up repeating a lot of work
2751 each time marginal_nodes is called, making it inefficient for
2752 learning. The junction tree algorithm, by contrast, uses dynamic
2753 programming to avoid this redundant computation --- it calculates all
2754 marginals in two passes during 'enter_evidence', so calling
2755 'marginal_nodes' takes constant time.
2756 <p>
2757 We will discuss some of the inference algorithms implemented in BNT
2758 below, and finish with a <a href="#engine_summary">summary</a> of all
2759 of them.
2760
2761
2762
2763
2764
2765
2766
2767 <h2><a name="varelim">Variable elimination</h2>
2768
2769 The variable elimination algorithm, also known as bucket elimination
2770 or peeling, is one of the simplest inference algorithms.
2771 The basic idea is to "push sums inside of products"; this is explained
2772 in more detail
2773 <a
2774 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
2775 <p>
2776 The principle of distributing sums over products can be generalized
2777 greatly to apply to any commutative semiring.
2778 This forms the basis of many common algorithms, such as Viterbi
2779 decoding and the Fast Fourier Transform. For details, see
2780
2781 <ul>
2782 <li> R. McEliece and S. M. Aji, 2000.
2783 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
2784 <a href="GDL.pdf">
2785 The Generalized Distributive Law</a>,
2786 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
2787 pp. 325--343.
2788
2789
2790 <li>
2791 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
2792 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
2793 Factor graphs and the sum-product algorithm</a>
2794 IEEE Transactions on Information Theory, February, 2001.
2795
2796 </ul>
2797
2798 <p>
2799 Choosing an order in which to sum out the variables so as to minimize
2800 computational cost is known to be NP-hard.
2801 The implementation of this algorithm in
2802 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this
2803 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
2804 greedy search procedure to find a good ordering).
2805 <p>
2806 Note: unlike most algorithms, var_elim does all its computational work
2807 inside of <tt>marginal_nodes</tt>, not inside of
2808 <tt>enter_evidence</tt>.
2809
2810
2811
2812
2813 <h2><a name="global">Global inference methods</h2>
2814
2815 The simplest inference algorithm of all is to explicitely construct
2816 the joint distribution over all the nodes, and then to marginalize it.
2817 This is implemented in <tt>global_joint_inf_engine</tt>.
2818 Since the size of the joint is exponential in the
2819 number of discrete (hidden) nodes, this is not a very practical algorithm.
2820 It is included merely for pedagogical and debugging purposes.
2821 <p>
2822 Three specialized versions of this algorithm have also been implemented,
2823 corresponding to the cases where all the nodes are discrete (D), all
2824 are Gaussian (G), and some are discrete and some Gaussian (CG).
2825 They are called <tt>enumerative_inf_engine</tt>,
2826 <tt>gaussian_inf_engine</tt>,
2827 and <tt>cond_gauss_inf_engine</tt> respectively.
2828 <p>
2829 Note: unlike most algorithms, these global inference algorithms do all their computational work
2830 inside of <tt>marginal_nodes</tt>, not inside of
2831 <tt>enter_evidence</tt>.
2832
2833
2834 <h2><a name="quickscore">Quickscore</h2>
2835
2836 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
2837 since the cliques are so big.
2838 One simple trick we can use is to notice that hidden leaves do not
2839 affect the posteriors on the roots, and hence do not need to be
2840 included in the network.
2841 A second trick is to notice that the negative findings can be
2842 "absorbed" into the prior:
2843 see the file
2844 BNT/examples/static/mk_minimal_qmr_bnet for details.
2845 <p>
2846
2847 A much more significant speedup is obtained by exploiting special
2848 properties of the noisy-or node, as done by the quickscore
2849 algorithm. For details, see
2850 <ul>
2851 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
2852 <li> Rish and Dechter, "On the impact of causal independence", UCI
2853 tech report, 1998.
2854 </ul>
2855
2856 This has been implemented in BNT as a special-purpose inference
2857 engine, which can be created and used as follows:
2858 <pre>
2859 engine = quickscore_inf_engine(inhibit, leak, prior);
2860 engine = enter_evidence(engine, pos, neg);
2861 m = marginal_nodes(engine, i);
2862 </pre>
2863
2864
2865 <h2><a name="belprop">Belief propagation</h2>
2866
2867 Even using quickscore, exact inference takes time that is exponential
2868 in the number of positive findings.
2869 Hence for large networks we need to resort to approximate inference techniques.
2870 See for example
2871 <ul>
2872 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
2873 QMR-DT network", JAIR 10, 1999.
2874
2875 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
2876 UAI 99.
2877 </ul>
2878 The latter approximation
2879 entails applying Pearl's belief propagation algorithm to a model even
2880 if it has loops (hence the name loopy belief propagation).
2881 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
2882 exact results when applied to singly-connected graphs
2883 (a.k.a. polytrees, since
2884 the underlying undirected topology is a tree, but a node may have
2885 multiple parents).
2886 To apply this algorithm to a graph with loops,
2887 use <tt>pearl_inf_engine</tt>.
2888 This can use a centralized or distributed message passing protocol.
2889 You can use it as in the following example.
2890 <pre>
2891 engine = pearl_inf_engine(bnet, 'max_iter', 30);
2892 engine = enter_evidence(engine, evidence);
2893 m = marginal_nodes(engine, i);
2894 </pre>
2895 We found that this algorithm often converges, and when it does, often
2896 is very accurate, but it depends on the precise setting of the
2897 parameter values of the network.
2898 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
2899 Understanding when and why belief propagation converges/ works
2900 is a topic of ongoing research.
2901 <p>
2902 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
2903 and gmux nodes to compute messages efficiently.
2904 <p>
2905 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
2906 represent messages. Hence this is slower.
2907 <p>
2908 <tt>belprop_fg_inf_engine</tt> is like belprop,
2909 but is designed for factor graphs.
2910
2911
2912
2913 <h2><a name="sampling">Sampling</h2>
2914
2915 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
2916 <ul>
2917 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance
2918 sampling and can handle any node type.
2919 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
2920 Currently this can only handle tabular CPDs.
2921 For a much faster and more powerful Gibbs sampling program, see
2922 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
2923 </ul>
2924 Note: To generate samples from a network (which is not the same as inference!),
2925 use <tt>sample_bnet</tt>.
2926
2927
2928
2929 <h2><a name="engine_summary">Summary of inference engines</h2>
2930
2931
2932 The inference engines differ in many ways. Here are
2933 some of the major "axes":
2934 <ul>
2935 <li> Works for all topologies or makes restrictions?
2936 <li> Works for all node types or makes restrictions?
2937 <li> Exact or approximate inference?
2938 </ul>
2939
2940 <p>
2941 In terms of topology, most engines handle any kind of DAG.
2942 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
2943 can be used to represent directed, undirected, and mixed (chain)
2944 graphs.
2945 (In the future, we plan to support exact inference on chain graphs.)
2946 <tt>quickscore</tt> only works on QMR-like models.
2947 <p>
2948 In terms of node types: algorithms that use potentials can handle
2949 discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
2950 Sampling algorithms can essentially handle any kind of node (distribution).
2951 Other algorithms make more restrictive assumptions in exchange for
2952 speed.
2953 <p>
2954 Finally, most algorithms are designed to give the exact answer.
2955 The belief propagation algorithms are exact if applied to trees, and
2956 in some other cases.
2957 Sampling is considered approximate, even though, in the limit of an
2958 infinite number of samples, it gives the exact answer.
2959
2960 <p>
2961
2962 Here is a summary of the properties
2963 of all the engines in BNT which work on static networks.
2964 <p>
2965 <table>
2966 <table border units = pixels><tr>
2967 <td align=left width=0>Name
2968 <td align=left width=0>Exact?
2969 <td align=left width=0>Node type?
2970 <td align=left width=0>topology
2971 <tr>
2972 <tr>
2973 <td align=left> belprop
2974 <td align=left> approx
2975 <td align=left> D
2976 <td align=left> DAG
2977 <tr>
2978 <td align=left> belprop_fg
2979 <td align=left> approx
2980 <td align=left> D
2981 <td align=left> factor graph
2982 <tr>
2983 <td align=left> cond_gauss
2984 <td align=left> exact
2985 <td align=left> CG
2986 <td align=left> DAG
2987 <tr>
2988 <td align=left> enumerative
2989 <td align=left> exact
2990 <td align=left> D
2991 <td align=left> DAG
2992 <tr>
2993 <td align=left> gaussian
2994 <td align=left> exact
2995 <td align=left> G
2996 <td align=left> DAG
2997 <tr>
2998 <td align=left> gibbs
2999 <td align=left> approx
3000 <td align=left> D
3001 <td align=left> DAG
3002 <tr>
3003 <td align=left> global_joint
3004 <td align=left> exact
3005 <td align=left> D,G,CG
3006 <td align=left> DAG
3007 <tr>
3008 <td align=left> jtree
3009 <td align=left> exact
3010 <td align=left> D,G,CG
3011 <td align=left> DAG
3012 b<tr>
3013 <td align=left> likelihood_weighting
3014 <td align=left> approx
3015 <td align=left> any
3016 <td align=left> DAG
3017 <tr>
3018 <td align=left> pearl
3019 <td align=left> approx
3020 <td align=left> D,G
3021 <td align=left> DAG
3022 <tr>
3023 <td align=left> pearl
3024 <td align=left> exact
3025 <td align=left> D,G
3026 <td align=left> polytree
3027 <tr>
3028 <td align=left> quickscore
3029 <td align=left> exact
3030 <td align=left> noisy-or
3031 <td align=left> QMR
3032 <tr>
3033 <td align=left> stab_cond_gauss
3034 <td align=left> exact
3035 <td align=left> CG
3036 <td align=left> DAG
3037 <tr>
3038 <td align=left> var_elim
3039 <td align=left> exact
3040 <td align=left> D,G,CG
3041 <td align=left> DAG
3042 </table>
3043
3044
3045
3046 <h1><a name="influence">Influence diagrams/ decision making</h1>
3047
3048 BNT implements an exact algorithm for solving LIMIDs (limited memory
3049 influence diagrams), described in
3050 <ul>
3051 <li> S. L. Lauritzen and D. Nilsson.
3052 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
3053 Representing and solving decision problems with limited
3054 information</a>
3055 Management Science, 47, 1238 - 1251. September 2001.
3056 </ul>
3057 LIMIDs explicitely show all information arcs, rather than implicitely
3058 assuming no forgetting. This allows them to model forgetful
3059 controllers.
3060 <p>
3061 See the examples in <tt>BNT/examples/limids</tt> for details.
3062
3063
3064
3065
3066 <h1>DBNs, HMMs, Kalman filters and all that</h1>
3067
3068 Click <a href="usage_dbn.html">here</a> for documentation about how to
3069 use BNT for dynamical systems and sequence data.
3070
3071
3072 </BODY>