comparison toolboxes/FullBNT-1.0.7/docs/usage_cropped.html @ 0:e9a9cd732c1e tip

first hg version after svn
author wolffd
date Tue, 10 Feb 2015 15:05:51 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:e9a9cd732c1e
1 <HEAD>
2 <TITLE>How to use the Bayes Net Toolbox</TITLE>
3 </HEAD>
4
5 <BODY BGCOLOR="#FFFFFF">
6 <!-- white background is better for the pictures and equations -->
7
8 <h1>How to use the Bayes Net Toolbox</h1>
9
10 This documentation was last updated on 13 November 2002.
11 <br>
12 Click <a href="changelog.html">here</a> for a list of changes made to
13 BNT.
14 <br>
15 Click
16 <a href="http://bnt.insa-rouen.fr/">here</a>
17 for a French version of this documentation (which might not
18 be up-to-date).
19
20
21 <p>
22
23 <ul>
24 <li> <a href="#install">Installation</a>
25 <ul>
26 <li> <a href="#installM">Installing the Matlab code</a>
27 <li> <a href="#installC">Installing the C code</a>
28 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>.
29 </ul>
30
31 <li> <a href="#basics">Creating your first Bayes net</a>
32 <ul>
33 <li> <a href="#basics">Creating a model by hand</a>
34 <li> <a href="#file">Loading a model from a file</a>
35 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a>
36 </ul>
37
38 <li> <a href="#inference">Inference</a>
39 <ul>
40 <li> <a href="#marginal">Computing marginal distributions</a>
41 <li> <a href="#joint">Computing joint distributions</a>
42 <li> <a href="#soft">Soft/virtual evidence</a>
43 <li> <a href="#mpe">Most probable explanation</a>
44 </ul>
45
46 <li> <a href="#cpd">Conditional Probability Distributions</a>
47 <ul>
48 <li> <a href="#tabular">Tabular (multinomial) nodes</a>
49 <li> <a href="#noisyor">Noisy-or nodes</a>
50 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
51 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
52 <li> <a href="#mlp">Neural network nodes</a>
53 <li> <a href="#root">Root nodes</a>
54 <li> <a href="#gaussian">Gaussian nodes</a>
55 <li> <a href="#glm">Generalized linear model nodes</a>
56 <li> <a href="#dtree">Classification/regression tree nodes</a>
57 <li> <a href="#nongauss">Other continuous distributions</a>
58 <li> <a href="#cpd_summary">Summary of CPD types</a>
59 </ul>
60
61 <li> <a href="#examples">Example models</a>
62 <ul>
63 <li> <a
64 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
65 Gaussian mixture models</a>
66 <li> <a href="#pca">PCA, ICA, and all that</a>
67 <li> <a href="#mixep">Mixtures of experts</a>
68 <li> <a href="#hme">Hierarchical mixtures of experts</a>
69 <li> <a href="#qmr">QMR</a>
70 <li> <a href="#cg_model">Conditional Gaussian models</a>
71 <li> <a href="#hybrid">Other hybrid models</a>
72 </ul>
73
74 <li> <a href="#param_learning">Parameter learning</a>
75 <ul>
76 <li> <a href="#load_data">Loading data from a file</a>
77 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
78 <li> <a href="#prior">Parameter priors</a>
79 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
80 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
81 <li> <a href="#tying">Parameter tying</a>
82 </ul>
83
84 <li> <a href="#structure_learning">Structure learning</a>
85 <ul>
86 <li> <a href="#enumerate">Exhaustive search</a>
87 <li> <a href="#K2">K2</a>
88 <li> <a href="#hill_climb">Hill-climbing</a>
89 <li> <a href="#mcmc">MCMC</a>
90 <li> <a href="#active">Active learning</a>
91 <li> <a href="#struct_em">Structural EM</a>
92 <li> <a href="#graphdraw">Visualizing the learned graph structure</a>
93 <li> <a href="#constraint">Constraint-based methods</a>
94 </ul>
95
96
97 <li> <a href="#engines">Inference engines</a>
98 <ul>
99 <li> <a href="#jtree">Junction tree</a>
100 <li> <a href="#varelim">Variable elimination</a>
101 <li> <a href="#global">Global inference methods</a>
102 <li> <a href="#quickscore">Quickscore</a>
103 <li> <a href="#belprop">Belief propagation</a>
104 <li> <a href="#sampling">Sampling (Monte Carlo)</a>
105 <li> <a href="#engine_summary">Summary of inference engines</a>
106 </ul>
107
108
109 <li> <a href="#influence">Influence diagrams/ decision making</a>
110
111
112 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
113 </ul>
114
115 </ul>
116
117
118
119
120 <h1><a name="install">Installation</h1>
121
122 <h2><a name="installM">Installing the Matlab code</h2>
123
124 <ul>
125 <li> <a href="bnt_download.html">Download</a> the BNT.zip file.
126
127 <p>
128 <li> Unpack the file. In Unix, type
129 <!--"tar xvf BNT.tar".-->
130 "unzip FullBNT.zip".
131 In Windows, use
132 a program like <a href="http://www.winzip.com">Winzip</a>. This will
133 create a directory called FullBNT, which contains BNT and other libraries.
134
135 <p>
136 <li> Read the file <tt>BNT/README</tt> to make sure the date
137 matches the one on the top of <a href=bnt.html>the BNT home page</a>.
138 If not, you may need to press 'refresh' on your browser, and download
139 again, to get the most recent version.
140
141 <p>
142 <li> <b>Edit the file "BNT/add_BNT_to_path.m"</b> so it contains the correct
143 pathname.
144 For example, in Windows,
145 I download FullBNT.zip into C:\kpmurphy\matlab, and
146 then comment out the second line (with the % character), and uncomment
147 the third line, which reads
148 <pre>
149 BNT_HOME = 'C:\kpmurphy\matlab\FullBNT';
150 </pre>
151
152 <p>
153 <li> Start up Matlab.
154
155 <p>
156 <li> Type "ver" at the Matlab prompt (">>").
157 <b>You need Matlab version 5.2 or newer to run BNT</b>.
158 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes
159 crash BNT.)
160
161 <p>
162 <li> Move to the BNT directory.
163 For example, in Windows, I type
164 <pre>
165 >> cd C:\kpmurphy\matlab\FullBNT\BNT
166 </pre>
167
168 <p>
169 <li> Type "add_BNT_to_path".
170 This executes the command
171 <tt>addpath(genpath(BNT_HOME))</tt>,
172 which adds all directories below FullBNT to the matlab path.
173
174 <p>
175 <li> Type "test_BNT".
176 <b>If all goes well, this will just produce a bunch of numbers and pictures.</b>
177 It may produce some
178 warning messages (which you can ignore), but should not produce any error messages.
179 (The warnings should only be of the form
180 "Warning: Maximum number of iterations has been exceeded", and are
181 produced by Netlab.)
182
183 <p>
184 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join">
185 Join the BNT email list</a>
186
187 </ul>
188
189
190 If you are new to Matlab, you might like to check out
191 <a href="matlab_tips.html">some useful Matlab tips</a>.
192 For instance, this explains how to create a startup file, which can be
193 used to set your path variable automatically, so you can avoid having
194 to type the above commands every time.
195
196
197
198
199 <h2><a name="installC">Installing the C code</h2>
200
201 Some BNT functions also have C implementations.
202 <b>It is not necessary to install the C code</b>, but it can result in a speedup
203 of a factor of 2-5 of certain simple functions.
204 To install all the C code,
205 edit installC_BNT.m so it contains the right path,
206 then type <tt>installC_BNT</tt>.
207 To uninstall all the C code,
208 edit uninstallC_BNT.m so it contains the right path,
209 then type <tt>uninstallC_BNT</tt>.
210 For an up-to-date list of the files which have C implementations, see
211 BNT/installC_BNT.m.
212
213 <p>
214 mex is a script that lets you call C code from Matlab - it does not compile matlab to
215 C (see mcc below).
216 If your C/C++ compiler is set up correctly, mex should work out of
217 the box. (Matlab 6 now ships with its own minimal C compiler.)
218 If not, you might need to type
219 <p>
220 <tt> mex -setup</tt>
221 <p>
222 before calling installC.
223 <p>
224 To make mex call gcc on Windows,
225 you must install <a
226 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>.
227 You can use the <a href="http://www.mingw.org/">minimalist GNU for
228 Windows</a> version of gcc, or
229 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version.
230 <p>
231 In general, typing
232 'mex foo.c' from inside Matlab creates a file called
233 'foo.mexglx' or 'foo.dll' (the exact file
234 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll').
235 The resulting file will hide the original 'foo.m' (if it existed), i.e.,
236 typing 'foo' at the prompt will call the compiled C version.
237 To reveal the original matlab version, just delete foo.mexglx (this is
238 what uninstallC does).
239 <p>
240 Sometimes it takes time for Matlab to realize that the file has
241 changed from matlab to C or vice versa; try typing 'clear all' or
242 restarting Matlab to refresh it.
243 To find out which version of a file you are running, type
244 'which foo'.
245 <p>
246 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the
247 Matlab to C compiler, is a separate product,
248 and is quite different from mex. It does not yet support
249 objects/classes, which is why we can't compile all of BNT to C automatically.
250 Also, hand-written C code is usually much
251 better than the C code generated by mcc.
252
253
254 <p>
255 Acknowledgements:
256 Although I wrote some of the C code, most of
257 the C code (e.g., for jtree and dpot) was written by Wei Hu;
258 the triangulation C code was written by Ilya Shpitser.
259
260
261 <h1><a name="basics">Creating your first Bayes net</h1>
262
263 To define a Bayes net, you must specify the graph structure and then
264 the parameters. We look at each in turn, using a simple example
265 (adapted from Russell and
266 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
267 1995, p454).
268
269
270 <h2>Graph structure</h2>
271
272
273 Consider the following network.
274
275 <p>
276 <center>
277 <IMG SRC="Figures/sprinkler.gif">
278 </center>
279 <p>
280
281 <P>
282 To specify this directed acyclic graph (dag), we create an adjacency matrix:
283 <PRE>
284 N = 4;
285 dag = zeros(N,N);
286 C = 1; S = 2; R = 3; W = 4;
287 dag(C,[R S]) = 1;
288 dag(R,W) = 1;
289 dag(S,W)=1;
290 </PRE>
291 <P>
292 We have numbered the nodes as follows:
293 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
294 <b>The nodes must always be numbered in topological order, i.e.,
295 ancestors before descendants.</b>
296 For a more complicated graph, this is a little inconvenient: we will
297 see how to get around this <a href="usage_dbn.html#bat">below</a>.
298 <p>
299 <!--
300 In Matlab 6, you can use logical arrays instead of double arrays,
301 which are 4 times smaller:
302 <pre>
303 dag = false(N,N);
304 dag(C,[R S]) = true;
305 ...
306 </pre>
307 --
308 <p>
309 A preliminary attempt to make a <b>GUI</b>
310 has been writte by Philippe LeRay and can be downloaded
311 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
312 <p>
313 You can visualize the resulting graph structure using
314 the methods discussed <a href="#graphdraw">below</a>.
315
316 <h2>Creating the Bayes net shell</h2>
317
318 In addition to specifying the graph structure,
319 we must specify the size and type of each node.
320 If a node is discrete, its size is the
321 number of possible values
322 each node can take on; if a node is continuous,
323 it can be a vector, and its size is the length of this vector.
324 In this case, we will assume all nodes are discrete and binary.
325 <PRE>
326 discrete_nodes = 1:N;
327 node_sizes = 2*ones(1,N);
328 </pre>
329 If the nodes were not binary, you could type e.g.,
330 <pre>
331 node_sizes = [4 2 3 5];
332 </pre>
333 meaning that Cloudy has 4 possible values,
334 Sprinkler has 2 possible values, etc.
335 Note that these are cardinal values, not ordinal, i.e.,
336 they are not ordered in any way, like 'low', 'medium', 'high'.
337 <p>
338 We are now ready to make the Bayes net:
339 <pre>
340 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
341 </PRE>
342 By default, all nodes are assumed to be discrete, so we can also just
343 write
344 <pre>
345 bnet = mk_bnet(dag, node_sizes);
346 </PRE>
347 You may also specify which nodes will be observed.
348 If you don't know, or if this not fixed in advance,
349 just use the empty list (the default).
350 <pre>
351 onodes = [];
352 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
353 </PRE>
354 Note that optional arguments are specified using a name/value syntax.
355 This is common for many BNT functions.
356 In general, to find out more about a function (e.g., which optional
357 arguments it takes), please see its
358 documentation string by typing
359 <pre>
360 help mk_bnet
361 </pre>
362 See also other <a href="matlab_tips.html">useful Matlab tips</a>.
363 <p>
364 It is possible to associate names with nodes, as follows:
365 <pre>
366 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
367 </pre>
368 You can then refer to a node by its name:
369 <pre>
370 C = bnet.names{'cloudy'}; % bnet.names is an associative array
371 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
372 </pre>
373
374
375 <h2><a name="cpt">Parameters</h2>
376
377 A model consists of the graph structure and the parameters.
378 The parameters are represented by CPD objects (CPD = Conditional
379 Probability Distribution), which define the probability distribution
380 of a node given its parents.
381 (We will use the terms "node" and "random variable" interchangeably.)
382 The simplest kind of CPD is a table (multi-dimensional array), which
383 is suitable when all the nodes are discrete-valued. Note that the discrete
384 values are not assumed to be ordered in any way; that is, they
385 represent categorical quantities, like male and female, rather than
386 ordinal quantities, like low, medium and high.
387 (We will discuss CPDs in more detail <a href="#cpd">below</a>.)
388 <p>
389 Tabular CPDs, also called CPTs (conditional probability tables),
390 are stored as multidimensional arrays, where the dimensions
391 are arranged in the same order as the nodes, e.g., the CPT for node 4
392 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
393 Hence the child is always the last dimension.
394 If a node has no parents, its CPT is a column vector representing its
395 prior.
396 Note that in Matlab (unlike C), arrays are indexed
397 from 1, and are layed out in memory such that the first index toggles
398 fastest, e.g., the CPT for node 4 (WetGrass) is as follows
399 <P>
400 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
401 <P>
402 where we have used the convention that false==1, true==2.
403 We can create this CPT in Matlab as follows
404 <PRE>
405 CPT = zeros(2,2,2);
406 CPT(1,1,1) = 1.0;
407 CPT(2,1,1) = 0.1;
408 ...
409 </PRE>
410 Here is an easier way:
411 <PRE>
412 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
413 </PRE>
414 In fact, we don't need to reshape the array, since the CPD constructor
415 will do that for us. So we can just write
416 <pre>
417 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
418 </pre>
419 The other nodes are created similarly (using the old syntax for
420 optional parameters)
421 <PRE>
422 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
423 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
424 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
425 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
426 </PRE>
427
428
429 <h2><a name="rnd_cpt">Random Parameters</h2>
430
431 If we do not specify the CPT, random parameters will be
432 created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
433 To ensure repeatable results, use
434 <pre>
435 rand('state', seed);
436 randn('state', seed);
437 </pre>
438 To control the degree of randomness (entropy),
439 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
440 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
441 If p = 1, each entry is drawn from U[0,1].
442 If p >> 1, the entries will all be near 1/k, where k is the arity of
443 this node, i.e., each row will be nearly uniform.
444 You can do this as follows, assuming this node
445 is number i, and ns is the node_sizes.
446 <pre>
447 k = ns(i);
448 ps = parents(dag, i);
449 psz = prod(ns(ps));
450 CPT = sample_dirichlet(p*ones(1,k), psz);
451 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
452 </pre>
453
454
455 <h2><a name="file">Loading a network from a file</h2>
456
457 If you already have a Bayes net represented in the XML-based
458 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
459 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
460 <a
461 href="http://www.cs.huji.ac.il/labs/compbio/Repository">
462 Bayes Net repository</a>),
463 you can convert it to BNT format using
464 the
465 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
466 program</a> written by Ken Shan.
467 (This is not necessarily up-to-date.)
468
469
470 <h2>Creating a model using a GUI</h2>
471
472 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
473
474
475
476 <h1><a name="inference">Inference</h1>
477
478 Having created the BN, we can now use it for inference.
479 There are many different algorithms for doing inference in Bayes nets,
480 that make different tradeoffs between speed,
481 complexity, generality, and accuracy.
482 BNT therefore offers a variety of different inference
483 "engines". We will discuss these
484 in more detail <a href="#engines">below</a>.
485 For now, we will use the junction tree
486 engine, which is the mother of all exact inference algorithms.
487 This can be created as follows.
488 <pre>
489 engine = jtree_inf_engine(bnet);
490 </pre>
491 The other engines have similar constructors, but might take
492 additional, algorithm-specific parameters.
493 All engines are used in the same way, once they have been created.
494 We illustrate this in the following sections.
495
496
497 <h2><a name="marginal">Computing marginal distributions</h2>
498
499 Suppose we want to compute the probability that the sprinker was on
500 given that the grass is wet.
501 The evidence consists of the fact that W=2. All the other nodes
502 are hidden (unobserved). We can specify this as follows.
503 <pre>
504 evidence = cell(1,N);
505 evidence{W} = 2;
506 </pre>
507 We use a 1D cell array instead of a vector to
508 cope with the fact that nodes can be vectors of different lengths.
509 In addition, the value [] can be used
510 to denote 'no evidence', instead of having to specify the observation
511 pattern as a separate argument.
512 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell
513 arrays in matlab.)
514 <p>
515 We are now ready to add the evidence to the engine.
516 <pre>
517 [engine, loglik] = enter_evidence(engine, evidence);
518 </pre>
519 The behavior of this function is algorithm-specific, and is discussed
520 in more detail <a href="#engines">below</a>.
521 In the case of the jtree engine,
522 enter_evidence implements a two-pass message-passing scheme.
523 The first return argument contains the modified engine, which
524 incorporates the evidence. The second return argument contains the
525 log-likelihood of the evidence. (Not all engines are capable of
526 computing the log-likelihood.)
527 <p>
528 Finally, we can compute p=P(S=2|W=2) as follows.
529 <PRE>
530 marg = marginal_nodes(engine, S);
531 marg.T
532 ans =
533 0.57024
534 0.42976
535 p = marg.T(2);
536 </PRE>
537 We see that p = 0.4298.
538 <p>
539 Now let us add the evidence that it was raining, and see what
540 difference it makes.
541 <PRE>
542 evidence{R} = 2;
543 [engine, loglik] = enter_evidence(engine, evidence);
544 marg = marginal_nodes(engine, S);
545 p = marg.T(2);
546 </PRE>
547 We find that p = P(S=2|W=2,R=2) = 0.1945,
548 which is lower than
549 before, because the rain can ``explain away'' the
550 fact that the grass is wet.
551 <p>
552 You can plot a marginal distribution over a discrete variable
553 as a barchart using the built 'bar' function:
554 <pre>
555 bar(marg.T)
556 </pre>
557 This is what it looks like
558
559 <p>
560 <center>
561 <IMG SRC="Figures/sprinkler_bar.gif">
562 </center>
563 <p>
564
565 <h2><a name="observed">Observed nodes</h2>
566
567 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
568 An observed discrete node effectively only has 1 value (the observed
569 one) --- all other values would result in 0 probability.
570 For efficiency, BNT treats observed (discrete) nodes as if they were
571 set to 1, as we see below:
572 <pre>
573 evidence = cell(1,N);
574 evidence{W} = 2;
575 engine = enter_evidence(engine, evidence);
576 m = marginal_nodes(engine, W);
577 m.T
578 ans =
579 1
580 </pre>
581 This can get a little confusing, since we assigned W=2.
582 So we can ask BNT to add the evidence back in by passing in an optional argument:
583 <pre>
584 m = marginal_nodes(engine, W, 1);
585 m.T
586 ans =
587 0
588 1
589 </pre>
590 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
591
592
593
594 <h2><a name="joint">Computing joint distributions</h2>
595
596 We can compute the joint probability on a set of nodes as in the
597 following example.
598 <pre>
599 evidence = cell(1,N);
600 [engine, ll] = enter_evidence(engine, evidence);
601 m = marginal_nodes(engine, [S R W]);
602 </pre>
603 m is a structure. The 'T' field is a multi-dimensional array (in
604 this case, 3-dimensional) that contains the joint probability
605 distribution on the specified nodes.
606 <pre>
607 >> m.T
608 ans(:,:,1) =
609 0.2900 0.0410
610 0.0210 0.0009
611 ans(:,:,2) =
612 0 0.3690
613 0.1890 0.0891
614 </pre>
615 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
616 to be wet if both the rain and sprinkler are off.
617 <p>
618 Let us now add some evidence to R.
619 <pre>
620 evidence{R} = 2;
621 [engine, ll] = enter_evidence(engine, evidence);
622 m = marginal_nodes(engine, [S R W])
623 m =
624 domain: [2 3 4]
625 T: [2x1x2 double]
626 >> m.T
627 m.T
628 ans(:,:,1) =
629 0.0820
630 0.0018
631 ans(:,:,2) =
632 0.7380
633 0.1782
634 </pre>
635 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
636 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
637 with the evidence that R=2.
638 Instead of creating large tables with many 0s, BNT sets the effective
639 size of observed (discrete) nodes to 1, as explained above.
640 This is why m.T has size 2x1x2.
641 To get a 2x2x2 table, type
642 <pre>
643 m = marginal_nodes(engine, [S R W], 1)
644 m =
645 domain: [2 3 4]
646 T: [2x2x2 double]
647 >> m.T
648 m.T
649 ans(:,:,1) =
650 0 0.082
651 0 0.0018
652 ans(:,:,2) =
653 0 0.738
654 0 0.1782
655 </pre>
656
657 <p>
658 Note: It is not always possible to compute the joint on arbitrary
659 sets of nodes: it depends on which inference engine you use, as discussed
660 in more detail <a href="#engines">below</a>.
661
662
663 <h2><a name="soft">Soft/virtual evidence</h2>
664
665 Sometimes a node is not observed, but we have some distribution over
666 its possible values; this is often called "soft" or "virtual"
667 evidence.
668 One can use this as follows
669 <pre>
670 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
671 </pre>
672 where soft_evidence{i} is either [] (if node i has no soft evidence)
673 or is a vector representing the probability distribution over i's
674 possible values.
675 For example, if we don't know i's exact value, but we know its
676 likelihood ratio is 60/40, we can write evidence{i} = [] and
677 soft_evidence{i} = [0.6 0.4].
678 <p>
679 Currently only jtree_inf_engine supports this option.
680 It assumes that all hidden nodes, and all nodes for
681 which we have soft evidence, are discrete.
682 For a longer example, see BNT/examples/static/softev1.m.
683
684
685 <h2><a name="mpe">Most probable explanation</h2>
686
687 To compute the most probable explanation (MPE) of the evidence (i.e.,
688 the most probable assignment, or a mode of the joint), use
689 <pre>
690 [mpe, ll] = calc_mpe(engine, evidence);
691 </pre>
692 mpe{i} is the most likely value of node i.
693 This calls enter_evidence with the 'maximize' flag set to 1, which
694 causes the engine to do max-product instead of sum-product.
695 The resulting max-marginals are then thresholded.
696 If there is more than one maximum probability assignment, we must take
697 care to break ties in a consistent manner (thresholding the
698 max-marginals may give the wrong result). To force this behavior,
699 type
700 <pre>
701 [mpe, ll] = calc_mpe(engine, evidence, 1);
702 </pre>
703 Note that computing the MPE is someties called abductive reasoning.
704
705 <p>
706 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
707 that does a forwards max-product pass, and then a backwards traceback
708 pass, which is how Viterbi is traditionally implemented.
709
710
711
712 <h1><a name="cpd">Conditional Probability Distributions</h1>
713
714 A Conditional Probability Distributions (CPD)
715 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
716 are the parents of node i. There are many ways to represent this
717 distribution, which depend in part on whether X(i) and X(Pa(i)) are
718 discrete, continuous, or a combination.
719 We will discuss various representations below.
720
721
722 <h2><a name="tabular">Tabular nodes</h2>
723
724 If the CPD is represented as a table (i.e., if it is a multinomial
725 distribution), it has a number of parameters that is exponential in
726 the number of parents. See the example <a href="#cpt">above</a>.
727
728
729 <h2><a name="noisyor">Noisy-or nodes</h2>
730
731 A noisy-OR node is like a regular logical OR gate except that
732 sometimes the effects of parents that are on get inhibited.
733 Let the prob. that parent i gets inhibited be q(i).
734 Then a node, C, with 2 parents, A and B, has the following CPD, where
735 we use F and T to represent off and on (1 and 2 in BNT).
736 <pre>
737 A B P(C=off) P(C=on)
738 ---------------------------
739 F F 1.0 0.0
740 T F q(A) 1-q(A)
741 F T q(B) 1-q(B)
742 T T q(A)q(B) q-q(A)q(B)
743 </pre>
744 Thus we see that the causes get inhibited independently.
745 It is common to associate a "leak" node with a noisy-or CPD, which is
746 like a parent that is always on. This can account for all other unmodelled
747 causes which might turn the node on.
748 <p>
749 The noisy-or distribution is similar to the logistic distribution.
750 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
751 be the prob. that j inhibits i. Then
752 <pre>
753 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
754 </pre>
755 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
756 <pre>
757 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
758 </pre>
759 For a sigmoid node, we have
760 <pre>
761 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
762 </pre>
763 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
764 the activation function (although both are monotonically increasing).
765 In addition, in the case of a noisy-or, the weights are constrained to be
766 positive, since they derive from probabilities q(i,j).
767 In both cases, the number of parameters is <em>linear</em> in the
768 number of parents, unlike the case of a multinomial distribution,
769 where the number of parameters is exponential in the number of parents.
770 We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
771
772
773 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
774
775 Deterministic CPDs for discrete random variables can be created using
776 the deterministic_CPD class. It is also possible to 'flip' the output
777 of the function with some probability, to simulate noise.
778 The boolean_CPD class is just a special case of a
779 deterministic CPD, where the parents and child are all binary.
780 <p>
781 Both of these classes are just "syntactic sugar" for the tabular_CPD
782 class.
783
784
785
786 <h2><a name="softmax">Softmax nodes</h2>
787
788 If we have a discrete node with a continuous parent,
789 we can define its CPD using a softmax function
790 (also known as the multinomial logit function).
791 This acts like a soft thresholding operator, and is defined as follows:
792 <pre>
793 exp(w(:,i)'*x + b(i))
794 Pr(Q=i | X=x) = -----------------------------
795 sum_j exp(w(:,j)'*x + b(j))
796
797 </pre>
798 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
799 following interpretation: w(:,i)-w(:,j) is the normal vector to the
800 decision boundary between classes i and j,
801 and b(i)-b(j) is its offset (bias). For example, suppose
802 X is a 2-vector, and Q is binary. Then
803 <pre>
804 w = [1 -1;
805 0 0];
806
807 b = [0 0];
808 </pre>
809 means class 1 are points in the 2D plane with positive x coordinate,
810 and class 2 are points in the 2D plane with negative x coordinate.
811 If w has large magnitude, the decision boundary is sharp, otherwise it
812 is soft.
813 In the special case that Q is binary (0/1), the softmax function reduces to the logistic
814 (sigmoid) function.
815 <p>
816 Fitting a softmax function can be done using the iteratively reweighted
817 least squares (IRLS) algorithm.
818 We use the implementation from
819 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
820 Note that since
821 the softmax distribution is not in the exponential family, it does not
822 have finite sufficient statistics, and hence we must store all the
823 training data in uncompressed form.
824 If this takes too much space, one should use online (stochastic) gradient
825 descent (not implemented in BNT).
826 <p>
827 If a softmax node also has discrete parents,
828 we use a different set of w/b parameters for each combination of
829 parent values, as in the <a href="#gaussian">conditional linear
830 Gaussian CPD</a>.
831 This feature was implemented by Pierpaolo Brutti.
832 He is currently extending it so that discrete parents can be treated
833 as if they were continuous, by adding indicator variables to the X
834 vector.
835 <p>
836 We will see an example of softmax nodes <a href="#mixexp">below</a>.
837
838
839 <h2><a name="mlp">Neural network nodes</h2>
840
841 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
842 to implement a mapping from continuous parents to discrete children,
843 similar to the softmax function.
844 (If there are also discrete parents, it creates a mixture of MLPs.)
845 It uses code from <a
846 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
847 This is work in progress.
848
849 <h2><a name="root">Root nodes</h2>
850
851 A root node has no parents and no parameters; it can be used to model
852 an observed, exogeneous input variable, i.e., one which is "outside"
853 the model.
854 This is useful for conditional density models.
855 We will see an example of root nodes <a href="#mixexp">below</a>.
856
857
858 <h2><a name="gaussian">Gaussian nodes</h2>
859
860 We now consider a distribution suitable for the continuous-valued nodes.
861 Suppose the node is called Y, its continuous parents (if any) are
862 called X, and its discrete parents (if any) are called Q.
863 The distribution on Y is defined as follows:
864 <pre>
865 - no parents: Y ~ N(mu, Sigma)
866 - cts parents : Y|X=x ~ N(mu + W x, Sigma)
867 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
868 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
869 </pre>
870 where N(mu, Sigma) denotes a Normal distribution with mean mu and
871 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
872 respectively.
873 If there are no discrete parents, |Q|=1; if there is
874 more than one, then |Q| = a vector of the sizes of each discrete parent.
875 If there are no continuous parents, |X|=0; if there is more than one,
876 then |X| = the sum of their sizes.
877 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
878 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
879 matrix.
880 <p>
881 We can create a Gaussian node with random parameters as follows.
882 <pre>
883 bnet.CPD{i} = gaussian_CPD(bnet, i);
884 </pre>
885 We can specify the value of one or more of the parameters as in the
886 following example, in which |Y|=2, and |Q|=1.
887 <pre>
888 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
889 </pre>
890 <p>
891 We will see an example of conditional linear Gaussian nodes <a
892 href="#cg_model">below</a>.
893 <p>
894 When fitting a Gaussian using EM, you might encounter some numerical
895 problems. See <a
896 href="http://www.ai.mit.edu/~murphyk/Software/HMM/hmm.html#loglikpos">here</a>
897 for a discussion of this in the context of HMMs.
898 (Note: BNT does not currently support spherical covariances, although it
899 would be easy to add, since KPMstats/clg_Mstep supports this option;
900 you would just need to modify gaussian_CPD/update_ess to accumulate
901 weighted inner products.)
902
903
904 <h2><a name="nongauss">Other continuous distributions</h2>
905
906 Currently BNT does not support any CPDs for continuous nodes other
907 than the Gaussian.
908 However, you can use a mixture of Gaussians to
909 approximate other continuous distributions. We will see some an example
910 of this with the IFA model <a href="#pca">below</a>.
911
912
913 <h2><a name="glm">Generalized linear model nodes</h2>
914
915 In the future, we may incorporate some of the functionality of
916 <a href =
917 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
918 into BNT.
919
920
921 <h2><a name="dtree">Classification/regression tree nodes</h2>
922
923 We plan to add classification and regression trees to define CPDs for
924 discrete and continuous nodes, respectively.
925 Trees have many advantages: they are easy to interpret, they can do
926 feature selection, they can
927 handle discrete and continuous inputs, they do not make strong
928 assumptions about the form of the distribution, the number of
929 parameters can grow in a data-dependent way (i.e., they are
930 semi-parametric), they can handle missing data, etc.
931 However, they are not yet implemented.
932 <!--
933 Yimin Zhang is currently (Feb '02) implementing this.
934 -->
935
936
937 <h2><a name="cpd_summary">Summary of CPD types</h2>
938
939 We list all the different types of CPDs supported by BNT.
940 For each CPD, we specify if the child and parents can be discrete (D) or
941 continuous (C) (Binary (B) nodes are a special case).
942 We also specify which methods each class supports.
943 If a method is inherited, the name of the parent class is mentioned.
944 If a parent class calls a child method, this is mentioned.
945 <p>
946 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
947 requires that the child and all parents are discrete.
948 The CPT might be exponentially big...
949 <tt>convert_to_table</tt> evaluates a CPD with evidence, and
950 represents the the resulting potential as an array.
951 This requires that the child is discrete, and any continuous parents
952 are observed.
953 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and
954 represents the resulting potential as a dpot, gpot, cgpot or upot, as
955 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
956 utility).
957
958 <p>
959 When we sample a node, all the parents are observed.
960 When we compute the (log) probability of a node, all the parents and
961 the child are observed.
962 <p>
963 We also specify if the parameters are learnable.
964 For learning with EM, we require
965 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
966 <tt>maximize_params</tt>.
967 For learning from fully observed data, we require
968 the method <tt>learn_params</tt>.
969 By default, all classes inherit this from generic_CPD, which simply
970 calls <tt>update_ess</tt> N times, once for each data case, followed
971 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
972 Some classes implement a batch formula, which is quicker.
973 <p>
974 Bayesian learning means computing a posterior over the parameters
975 given fully observed data.
976 <p>
977 Pearl means we implement the methods <tt>compute_pi</tt> and
978 <tt>compute_lambda_msg</tt>, used by
979 <tt>pearl_inf_engine</tt>, which runs on directed graphs.
980 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
981 The pearl methods can exploit special properties of the CPDs for
982 computing the messages efficiently, whereas belprop does not.
983 <p>
984 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
985 which is not shown, since it is not very interesting.
986
987
988 <p>
989
990
991 <table>
992 <table border units = pixels><tr>
993 <td align=center>Name
994 <td align=center>Child
995 <td align=center>Parents
996 <td align=center>Comments
997 <td align=center>CPD_to_CPT
998 <td align=center>conv_to_table
999 <td align=center>conv_to_pot
1000 <td align=center>sample
1001 <td align=center>prob
1002 <td align=center>learn
1003 <td align=center>Bayes
1004 <td align=center>Pearl
1005
1006
1007 <tr>
1008 <!-- Name--><td>
1009 <!-- Child--><td>
1010 <!-- Parents--><td>
1011 <!-- Comments--><td>
1012 <!-- CPD_to_CPT--><td>
1013 <!-- conv_to_table--><td>
1014 <!-- conv_to_pot--><td>
1015 <!-- sample--><td>
1016 <!-- prob--><td>
1017 <!-- learn--><td>
1018 <!-- Bayes--><td>
1019 <!-- Pearl--><td>
1020
1021 <tr>
1022 <!-- Name--><td>boolean
1023 <!-- Child--><td>B
1024 <!-- Parents--><td>B
1025 <!-- Comments--><td>Syntactic sugar for tabular
1026 <!-- CPD_to_CPT--><td>-
1027 <!-- conv_to_table--><td>-
1028 <!-- conv_to_pot--><td>-
1029 <!-- sample--><td>-
1030 <!-- prob--><td>-
1031 <!-- learn--><td>-
1032 <!-- Bayes--><td>-
1033 <!-- Pearl--><td>-
1034
1035 <tr>
1036 <!-- Name--><td>deterministic
1037 <!-- Child--><td>D
1038 <!-- Parents--><td>D
1039 <!-- Comments--><td>Syntactic sugar for tabular
1040 <!-- CPD_to_CPT--><td>-
1041 <!-- conv_to_table--><td>-
1042 <!-- conv_to_pot--><td>-
1043 <!-- sample--><td>-
1044 <!-- prob--><td>-
1045 <!-- learn--><td>-
1046 <!-- Bayes--><td>-
1047 <!-- Pearl--><td>-
1048
1049 <tr>
1050 <!-- Name--><td>Discrete
1051 <!-- Child--><td>D
1052 <!-- Parents--><td>C/D
1053 <!-- Comments--><td>Virtual class
1054 <!-- CPD_to_CPT--><td>N
1055 <!-- conv_to_table--><td>Calls CPD_to_CPT
1056 <!-- conv_to_pot--><td>Calls conv_to_table
1057 <!-- sample--><td>Calls conv_to_table
1058 <!-- prob--><td>Calls conv_to_table
1059 <!-- learn--><td>N
1060 <!-- Bayes--><td>N
1061 <!-- Pearl--><td>N
1062
1063 <tr>
1064 <!-- Name--><td>Gaussian
1065 <!-- Child--><td>C
1066 <!-- Parents--><td>C/D
1067 <!-- Comments--><td>-
1068 <!-- CPD_to_CPT--><td>N
1069 <!-- conv_to_table--><td>N
1070 <!-- conv_to_pot--><td>Y
1071 <!-- sample--><td>Y
1072 <!-- prob--><td>Y
1073 <!-- learn--><td>Y
1074 <!-- Bayes--><td>N
1075 <!-- Pearl--><td>N
1076
1077 <tr>
1078 <!-- Name--><td>gmux
1079 <!-- Child--><td>C
1080 <!-- Parents--><td>C/D
1081 <!-- Comments--><td>multiplexer
1082 <!-- CPD_to_CPT--><td>N
1083 <!-- conv_to_table--><td>N
1084 <!-- conv_to_pot--><td>Y
1085 <!-- sample--><td>N
1086 <!-- prob--><td>N
1087 <!-- learn--><td>N
1088 <!-- Bayes--><td>N
1089 <!-- Pearl--><td>Y
1090
1091
1092 <tr>
1093 <!-- Name--><td>MLP
1094 <!-- Child--><td>D
1095 <!-- Parents--><td>C/D
1096 <!-- Comments--><td>multi layer perceptron
1097 <!-- CPD_to_CPT--><td>N
1098 <!-- conv_to_table--><td>Y
1099 <!-- conv_to_pot--><td>Inherits from discrete
1100 <!-- sample--><td>Inherits from discrete
1101 <!-- prob--><td>Inherits from discrete
1102 <!-- learn--><td>Y
1103 <!-- Bayes--><td>N
1104 <!-- Pearl--><td>N
1105
1106
1107 <tr>
1108 <!-- Name--><td>noisy-or
1109 <!-- Child--><td>B
1110 <!-- Parents--><td>B
1111 <!-- Comments--><td>-
1112 <!-- CPD_to_CPT--><td>Y
1113 <!-- conv_to_table--><td>Inherits from discrete
1114 <!-- conv_to_pot--><td>Inherits from discrete
1115 <!-- sample--><td>Inherits from discrete
1116 <!-- prob--><td>Inherits from discrete
1117 <!-- learn--><td>N
1118 <!-- Bayes--><td>N
1119 <!-- Pearl--><td>Y
1120
1121
1122 <tr>
1123 <!-- Name--><td>root
1124 <!-- Child--><td>C/D
1125 <!-- Parents--><td>none
1126 <!-- Comments--><td>no params
1127 <!-- CPD_to_CPT--><td>N
1128 <!-- conv_to_table--><td>N
1129 <!-- conv_to_pot--><td>Y
1130 <!-- sample--><td>Y
1131 <!-- prob--><td>Y
1132 <!-- learn--><td>N
1133 <!-- Bayes--><td>N
1134 <!-- Pearl--><td>N
1135
1136
1137 <tr>
1138 <!-- Name--><td>softmax
1139 <!-- Child--><td>D
1140 <!-- Parents--><td>C/D
1141 <!-- Comments--><td>-
1142 <!-- CPD_to_CPT--><td>N
1143 <!-- conv_to_table--><td>Y
1144 <!-- conv_to_pot--><td>Inherits from discrete
1145 <!-- sample--><td>Inherits from discrete
1146 <!-- prob--><td>Inherits from discrete
1147 <!-- learn--><td>Y
1148 <!-- Bayes--><td>N
1149 <!-- Pearl--><td>N
1150
1151
1152 <tr>
1153 <!-- Name--><td>generic
1154 <!-- Child--><td>C/D
1155 <!-- Parents--><td>C/D
1156 <!-- Comments--><td>Virtual class
1157 <!-- CPD_to_CPT--><td>N
1158 <!-- conv_to_table--><td>N
1159 <!-- conv_to_pot--><td>N
1160 <!-- sample--><td>N
1161 <!-- prob--><td>N
1162 <!-- learn--><td>N
1163 <!-- Bayes--><td>N
1164 <!-- Pearl--><td>N
1165
1166
1167 <tr>
1168 <!-- Name--><td>Tabular
1169 <!-- Child--><td>D
1170 <!-- Parents--><td>D
1171 <!-- Comments--><td>-
1172 <!-- CPD_to_CPT--><td>Y
1173 <!-- conv_to_table--><td>Inherits from discrete
1174 <!-- conv_to_pot--><td>Inherits from discrete
1175 <!-- sample--><td>Inherits from discrete
1176 <!-- prob--><td>Inherits from discrete
1177 <!-- learn--><td>Y
1178 <!-- Bayes--><td>Y
1179 <!-- Pearl--><td>Y
1180
1181 </table>
1182
1183
1184
1185 <h1><a name="examples">Example models</h1>
1186
1187
1188 <h2>Gaussian mixture models</h2>
1189
1190 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
1191 of Gaussians using BNT. Available
1192 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
1193
1194
1195 <h2><a name="pca">PCA, ICA, and all that </h2>
1196
1197 In Figure (a) below, we show how Factor Analysis can be thought of as a
1198 graphical model. Here, X has an N(0,I) prior, and
1199 Y|X=x ~ N(mu + Wx, Psi),
1200 where Psi is diagonal and W is called the "factor loading matrix".
1201 Since the noise on both X and Y is diagonal, the components of these
1202 vectors are uncorrelated, and hence can be represented as individual
1203 scalar nodes, as we show in (b).
1204 (This is useful if parts of the observations on the Y vector are occasionally missing.)
1205 We usually take k=|X| << |Y|=D, so the model tries to explain
1206 many observations using a low-dimensional subspace.
1207
1208
1209 <center>
1210 <table>
1211 <tr>
1212 <td><img src="Figures/fa.gif">
1213 <td><img src="Figures/fa_scalar.gif">
1214 <td><img src="Figures/mfa.gif">
1215 <td><img src="Figures/ifa.gif">
1216 <tr>
1217 <td align=center> (a)
1218 <td align=center> (b)
1219 <td align=center> (c)
1220 <td align=center> (d)
1221 </table>
1222 </center>
1223
1224 <p>
1225 We can create this model in BNT as follows.
1226 <pre>
1227 ns = [k D];
1228 dag = zeros(2,2);
1229 dag(1,2) = 1;
1230 bnet = mk_bnet(dag, ns, 'discrete', []);
1231 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
1232 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
1233 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
1234 'cov_type', 'diag', 'clamp_mean', 1);
1235 </pre>
1236
1237 The root node is clamped to the N(0,I) distribution, so that we will
1238 not update these parameters during learning.
1239 The mean of the leaf node is clamped to 0,
1240 since we assume the data has been centered (had its mean subtracted
1241 off); this is just for simplicity.
1242 Finally, the covariance of the leaf node is constrained to be
1243 diagonal. W0 and Psi0 are the initial parameter guesses.
1244
1245 <p>
1246 We can fit this model (i.e., estimate its parameters in a maximum
1247 likelihood (ML) sense) using EM, as we
1248 explain <a href="#em">below</a>.
1249 Not surprisingly, the ML estimates for mu and Psi turn out to be
1250 identical to the
1251 sample mean and variance, which can be computed directly as
1252 <pre>
1253 mu_ML = mean(data);
1254 Psi_ML = diag(cov(data));
1255 </pre>
1256 Note that W can only be identified up to a rotation matrix, because of
1257 the spherical symmetry of the source.
1258
1259 <p>
1260 If we restrict Psi to be spherical, i.e., Psi = sigma*I,
1261 there is a closed-form solution for W as well,
1262 i.e., we do not need to use EM.
1263 In particular, W contains the first |X| eigenvectors of the sample covariance
1264 matrix, with scalings determined by the eigenvalues and sigma.
1265 Classical PCA can be obtained by taking the sigma->0 limit.
1266 For details, see
1267
1268 <ul>
1269 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
1270 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
1271 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
1272 Matlab software</a>)
1273
1274 <p>
1275 <li>
1276 <a
1277 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1278 "Mixtures of probabilistic principal component analyzers"</a>,
1279 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1280 </ul>
1281
1282 <p>
1283 By adding a hidden discrete variable, we can create mixtures of FA
1284 models, as shown in (c).
1285 Now we can explain the data using a set of subspaces.
1286 We can create this model in BNT as follows.
1287 <pre>
1288 ns = [M k D];
1289 dag = zeros(3);
1290 dag(1,3) = 1;
1291 dag(2,3) = 1;
1292 bnet = mk_bnet(dag, ns, 'discrete', 1);
1293 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
1294 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
1295 'clamp_mean', 1, 'clamp_cov', 1);
1296 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
1297 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
1298 </pre>
1299 Notice how the covariance matrix for Y is the same for all values of
1300 Q; that is, the noise level in each sub-space is assumed the same.
1301 However, we allow the offset, mu, to vary.
1302 For details, see
1303 <ul>
1304
1305 <LI>
1306 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
1307 Algorithm for Mixtures of Factor Analyzers </A>,
1308 Ghahramani, Z. and Hinton, G.E. (1996),
1309 University of Toronto
1310 Technical Report CRG-TR-96-1.
1311 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
1312
1313 <p>
1314 <li>
1315 <a
1316 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1317 "Mixtures of probabilistic principal component analyzers"</a>,
1318 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1319 </ul>
1320
1321 <p>
1322 I have included Zoubin's specialized MFA code (with his permission)
1323 with the toolbox, so you can check that BNT gives the same results:
1324 see 'BNT/examples/static/mfa1.m'.
1325
1326 <p>
1327 Independent Factor Analysis (IFA) generalizes FA by allowing a
1328 non-Gaussian prior on each component of X.
1329 (Note that we can approximate a non-Gaussian prior using a mixture of
1330 Gaussians.)
1331 This means that the likelihood function is no longer rotationally
1332 invariant, so we can uniquely identify W and the hidden
1333 sources X.
1334 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
1335 We recover classical Independent Components Analysis (ICA)
1336 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the
1337 weight matrix W is square and invertible.
1338 For details, see
1339 <ul>
1340 <li>
1341 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
1342 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
1343 </ul>
1344
1345
1346
1347 <h2><a name="mixexp">Mixtures of experts</h2>
1348
1349 As an example of the use of the softmax function,
1350 we introduce the Mixture of Experts model.
1351 <!--
1352 We also show
1353 the Hierarchical Mixture of Experts model, where the hierarchy has two
1354 levels.
1355 (This is essentially a probabilistic decision tree of height two.)
1356 -->
1357 As before,
1358 circles denote continuous-valued nodes,
1359 squares denote discrete nodes, clear
1360 means hidden, and shaded means observed.
1361 <p>
1362 <center>
1363 <table>
1364 <tr>
1365 <td><img src="Figures/mixexp.gif">
1366 <!--
1367 <td><img src="Figures/hme.gif">
1368 -->
1369 </table>
1370 </center>
1371 <p>
1372 X is the observed
1373 input, Y is the output, and
1374 the Q nodes are hidden "gating" nodes, which select the appropriate
1375 set of parameters for Y. During training, Y is assumed observed,
1376 but for testing, the goal is to predict Y given X.
1377 Note that this is a <em>conditional</em> density model, so we don't
1378 associate any parameters with X.
1379 Hence X's CPD will be a root CPD, which is a way of modelling
1380 exogenous nodes.
1381 If the output is a continuous-valued quantity,
1382 we assume the "experts" are linear-regression units,
1383 and set Y's CPD to linear-Gaussian.
1384 If the output is discrete, we set Y's CPD to a softmax function.
1385 The Q CPDs will always be softmax functions.
1386
1387 <p>
1388 As a concrete example, consider the mixture of experts model where X and Y are
1389 scalars, and Q is binary.
1390 This is just piecewise linear regression, where
1391 we have two line segments, i.e.,
1392 <P>
1393 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
1394 <P>
1395 We can create this model with random parameters as follows.
1396 (This code is bundled in BNT/examples/static/mixexp2.m.)
1397 <PRE>
1398 X = 1;
1399 Q = 2;
1400 Y = 3;
1401 dag = zeros(3,3);
1402 dag(X,[Q Y]) = 1
1403 dag(Q,Y) = 1;
1404 ns = [1 2 1]; % make X and Y scalars, and have 2 experts
1405 onodes = [1 3];
1406 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
1407
1408 rand('state', 0);
1409 randn('state', 0);
1410 bnet.CPD{1} = root_CPD(bnet, 1);
1411 bnet.CPD{2} = softmax_CPD(bnet, 2);
1412 bnet.CPD{3} = gaussian_CPD(bnet, 3);
1413 </PRE>
1414 Now let us fit this model using <a href="#em">EM</a>.
1415 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
1416 <P>
1417 <PRE>
1418 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
1419 plot(data(:,1), data(:,2), '.');
1420 </PRE>
1421 <p>
1422 <center>
1423 <IMG SRC="Figures/mixexp_data.gif">
1424 </center>
1425 <p>
1426 This is what the model looks like before training.
1427 (Thanks to Thomas Hofman for writing this plotting routine.)
1428 <p>
1429 <center>
1430 <IMG SRC="Figures/mixexp_before.gif">
1431 </center>
1432 <p>
1433 Now let's train the model, and plot the final performance.
1434 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
1435 <P>
1436 <PRE>
1437 ncases = size(data, 1); % each row of data is a training case
1438 cases = cell(3, ncases);
1439 cases([1 3], :) = num2cell(data'); % each column of cases is a training case
1440 engine = jtree_inf_engine(bnet);
1441 max_iter = 20;
1442 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
1443 </PRE>
1444 (We specify which nodes will be observed when we create the engine.
1445 Hence BNT knows that the hidden nodes are all discrete.
1446 For complex models, this can lead to a significant speedup.)
1447 Below we show what the model looks like after 16 iterations of EM
1448 (with 100 IRLS iterations per M step), when it converged
1449 using the default convergence tolerance (that the
1450 fractional change in the log-likelihood be less than 1e-3).
1451 Before learning, the log-likelihood was
1452 -322.927442; afterwards, it was -13.728778.
1453 <p>
1454 <center>
1455 <IMG SRC="Figures/mixexp_after.gif">
1456 </center>
1457 (See BNT/examples/static/mixexp2.m for details of the code.)
1458
1459
1460
1461 <h2><a name="hme">Hierarchical mixtures of experts</h2>
1462
1463 A hierarchical mixture of experts (HME) extends the mixture of experts
1464 model by having more than one hidden node. A two-level example is shown below, along
1465 with its more traditional representation as a neural network.
1466 This is like a (balanced) probabilistic decision tree of height 2.
1467 <p>
1468 <center>
1469 <IMG SRC="Figures/HMEforMatlab.jpg">
1470 </center>
1471 <p>
1472 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
1473 has written an extensive set of routines for HMEs,
1474 which are bundled with BNT: see the examples/static/HME directory.
1475 These routines allow you to choose the number of hidden (gating)
1476 layers, and the form of the experts (softmax or MLP).
1477 See the file hmemenu, which provides a demo.
1478 For example, the figure below shows the decision boundaries learned
1479 for a ternary classification problem, using a 2 level HME with softmax
1480 gates and softmax experts; the training set is on the left, the
1481 testing set on the right.
1482 <p>
1483 <center>
1484 <!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
1485 <IMG SRC="Figures/hme_dec_boundary.png">
1486 </center>
1487 <p>
1488
1489
1490 <p>
1491 For more details, see the following:
1492 <ul>
1493
1494 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
1495 Hierarchical mixtures of experts and the EM algorithm</a>
1496 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
1497
1498 <li> <a href =
1499 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's
1500 matlab code for HME</a>
1501
1502 <li> <a
1503 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
1504 logistic function? A tutorial discussion on
1505 probabilities and neural networks.</a> M. I. Jordan. MIT Computational
1506 Cognitive Science Report 9503, August 1995.
1507
1508 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
1509 Halll, 1983.
1510
1511 <li>
1512 "Improved learning algorithms for mixtures of experts in multiclass
1513 classification".
1514 K. Chen, L. Xu, H. Chi.
1515 Neural Networks (1999) 12: 1229-1252.
1516
1517 <li> <a href="http://www.oigeeza.com/steve/">
1518 Classification Using Hierarchical Mixtures of Experts</a>
1519 S.R. Waterhouse and A.J. Robinson.
1520 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
1521
1522 <li> <a href="http://www.idiap.ch/~perry/">
1523 Localized mixtures of experts</a>,
1524 P. Moerland, 1998.
1525
1526 <li> "Nonlinear gated experts for time series",
1527 A.S. Weigend and M. Mangeas, 1995.
1528
1529 </ul>
1530
1531
1532 <h2><a name="qmr">QMR</h2>
1533
1534 Bayes nets originally arose out of an attempt to add probabilities to
1535 expert systems, and this is still the most common use for BNs.
1536 A famous example is
1537 QMR-DT, a decision-theoretic reformulation of the Quick Medical
1538 Reference (QMR) model.
1539 <p>
1540 <center>
1541 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
1542 </center>
1543 Here, the top layer represents hidden disease nodes, and the bottom
1544 layer represents observed symptom nodes.
1545 The goal is to infer the posterior probability of each disease given
1546 all the symptoms (which can be present, absent or unknown).
1547 Each node in the top layer has a Bernoulli prior (with a low prior
1548 probability that the disease is present).
1549 Since each node in the bottom layer has a high fan-in, we use a
1550 noisy-OR parameterization; each disease has an independent chance of
1551 causing each symptom.
1552 The real QMR-DT model is copyright, but
1553 we can create a random QMR-like model as follows.
1554 <pre>
1555 function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1556 % MK_QMR_BNET Make a QMR model
1557 % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1558 %
1559 % G(i,j) = 1 iff there is an arc from disease i to finding j
1560 % inhibit(i,j) = inhibition probability on i->j arc
1561 % leak(j) = inhibition prob. on leak->j arc
1562 % prior(i) = prob. disease i is on
1563
1564 [Ndiseases Nfindings] = size(inhibit);
1565 N = Ndiseases + Nfindings;
1566 finding_node = Ndiseases+1:N;
1567 ns = 2*ones(1,N);
1568 dag = zeros(N,N);
1569 dag(1:Ndiseases, finding_node) = G;
1570 bnet = mk_bnet(dag, ns, 'observed', finding_node);
1571
1572 for d=1:Ndiseases
1573 CPT = [1-prior(d) prior(d)];
1574 bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
1575 end
1576
1577 for i=1:Nfindings
1578 fnode = finding_node(i);
1579 ps = parents(G, i);
1580 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
1581 end
1582 </pre>
1583 In the file BNT/examples/static/qmr1, we create a random bipartite
1584 graph G, with 5 diseases and 10 findings, and random parameters.
1585 (In general, to create a random dag, use 'mk_random_dag'.)
1586 We can visualize the resulting graph structure using
1587 the methods discussed <a href="#graphdraw">below</a>, with the
1588 following results:
1589 <p>
1590 <img src="Figures/qmr.rnd.jpg">
1591
1592 <p>
1593 Now let us put some random evidence on all the leaves except the very
1594 first and very last, and compute the disease posteriors.
1595 <pre>
1596 pos = 2:floor(Nfindings/2);
1597 neg = (pos(end)+1):(Nfindings-1);
1598 onodes = myunion(pos, neg);
1599 evidence = cell(1, N);
1600 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
1601 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
1602
1603 engine = jtree_inf_engine(bnet);
1604 [engine, ll] = enter_evidence(engine, evidence);
1605 post = zeros(1, Ndiseases);
1606 for i=diseases(:)'
1607 m = marginal_nodes(engine, i);
1608 post(i) = m.T(2);
1609 end
1610 </pre>
1611 Junction tree can be quite slow on large QMR models.
1612 Fortunately, it is possible to exploit properties of the noisy-OR
1613 function to speed up exact inference using an algorithm called
1614 <a href="#quickscore">quickscore</a>, discussed below.
1615
1616
1617
1618
1619
1620 <h2><a name="cg_model">Conditional Gaussian models</h2>
1621
1622 A conditional Gaussian model is one in which, conditioned on all the discrete
1623 nodes, the distribution over the remaining (continuous) nodes is
1624 multivariate Gaussian. This means we can have arcs from discrete (D)
1625 to continuous (C) nodes, but not vice versa.
1626 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
1627 as in the <a href="#mixexp">mixture of experts</a> model,
1628 since this distribution can be represented with a discrete potential.)
1629 <p>
1630 We now give an example of a CG model, from
1631 the paper "Propagation of Probabilities, Means amd
1632 Variances in Mixed Graphical Association Models", Steffen Lauritzen,
1633 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
1634 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
1635 D. J. Spiegelhalter, Springer, 1999.)
1636
1637 <h3>Specifying the graph</h3>
1638
1639 Consider the model of waste emissions from an incinerator plant shown below.
1640 We follow the standard convention that shaded nodes are observed,
1641 clear nodes are hidden.
1642 We also use the non-standard convention that
1643 square nodes are discrete (tabular) and round nodes are
1644 Gaussian.
1645
1646 <p>
1647 <center>
1648 <IMG SRC="Figures/cg1.gif">
1649 </center>
1650 <p>
1651
1652 We can create this model as follows.
1653 <pre>
1654 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
1655 n = 9;
1656
1657 dag = zeros(n);
1658 dag(F,E)=1;
1659 dag(W,[E Min D]) = 1;
1660 dag(E,D)=1;
1661 dag(B,[C D])=1;
1662 dag(D,[L Mout])=1;
1663 dag(Min,Mout)=1;
1664
1665 % node sizes - all cts nodes are scalar, all discrete nodes are binary
1666 ns = ones(1, n);
1667 dnodes = [F W B];
1668 cnodes = mysetdiff(1:n, dnodes);
1669 ns(dnodes) = 2;
1670
1671 bnet = mk_bnet(dag, ns, 'discrete', dnodes);
1672 </pre>
1673 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
1674 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
1675 <p>
1676
1677
1678 <h3>Specifying the parameters</h3>
1679
1680 The parameters of the discrete nodes can be specified as follows.
1681 <pre>
1682 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
1683 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
1684 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
1685 </pre>
1686
1687 <p>
1688 The parameters of the continuous nodes can be specified as follows.
1689 <pre>
1690 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
1691 'cov', [0.00002 0.0001 0.00002 0.0001]);
1692 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
1693 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
1694 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
1695 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
1696 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
1697 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
1698 </pre>
1699
1700
1701 <h3><a name="cg_infer">Inference</h3>
1702
1703 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
1704 First we compute the unconditional marginals.
1705 <pre>
1706 engine = jtree_inf_engine(bnet);
1707 evidence = cell(1,n);
1708 [engine, ll] = enter_evidence(engine, evidence);
1709 marg = marginal_nodes(engine, E);
1710 </pre>
1711 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
1712 'marg' is a structure that contains the fields 'mu' and 'Sigma', which
1713 contain the mean and (co)variance of the marginal on E.
1714 In this case, they are both scalars.
1715 Let us check they match the published figures (to 2 decimal places).
1716 <!--(We can't expect
1717 more precision than this in general because I have implemented the algorithm of
1718 Lauritzen (1992), which can be numerically unstable.)-->
1719 <pre>
1720 tol = 1e-2;
1721 assert(approxeq(marg.mu, -3.25, tol));
1722 assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
1723 </pre>
1724 We can compute the other posteriors similarly.
1725 Now let us add some evidence.
1726 <pre>
1727 evidence = cell(1,n);
1728 evidence{W} = 1; % industrial
1729 evidence{L} = 1.1;
1730 evidence{C} = -0.9;
1731 [engine, ll] = enter_evidence(engine, evidence);
1732 </pre>
1733 Now we find
1734 <pre>
1735 marg = marginal_nodes(engine, E);
1736 assert(approxeq(marg.mu, -3.8983, tol));
1737 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
1738 </pre>
1739
1740
1741 We can also compute the joint probability on a set of nodes.
1742 For example, P(D, Mout | evidence) is a 2D Gaussian:
1743 <pre>
1744 marg = marginal_nodes(engine, [D Mout])
1745 marg =
1746 domain: [6 8]
1747 mu: [2x1 double]
1748 Sigma: [2x2 double]
1749 T: 1.0000
1750 </pre>
1751 The mean is
1752 <pre>
1753 marg.mu
1754 ans =
1755 3.6077
1756 4.1077
1757 </pre>
1758 and the covariance matrix is
1759 <pre>
1760 marg.Sigma
1761 ans =
1762 0.1062 0.1062
1763 0.1062 0.1182
1764 </pre>
1765 It is easy to visualize this posterior using standard Matlab plotting
1766 functions, e.g.,
1767 <pre>
1768 gaussplot2d(marg.mu, marg.Sigma);
1769 </pre>
1770 produces the following picture.
1771
1772 <p>
1773 <center>
1774 <IMG SRC="Figures/gaussplot.png">
1775 </center>
1776 <p>
1777
1778
1779 The T field indicates that the mixing weight of this Gaussian
1780 component is 1.0.
1781 If the joint contains discrete and continuous variables, the result
1782 will be a mixture of Gaussians, e.g.,
1783 <pre>
1784 marg = marginal_nodes(engine, [F E])
1785 domain: [1 3]
1786 mu: [-3.9000 -0.4003]
1787 Sigma: [1x1x2 double]
1788 T: [0.9995 4.7373e-04]
1789 </pre>
1790 The interpretation is
1791 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ].
1792 In this case, E is a scalar, so i=j=1; k specifies the mixture component.
1793 <p>
1794 We saw in the sprinkler network that BNT sets the effective size of
1795 observed discrete nodes to 1, since they only have one legal value.
1796 For continuous nodes, BNT sets their length to 0,
1797 since they have been reduced to a point.
1798 For example,
1799 <pre>
1800 marg = marginal_nodes(engine, [B C])
1801 domain: [4 5]
1802 mu: []
1803 Sigma: []
1804 T: [0.0123 0.9877]
1805 </pre>
1806 It is simple to post-process the output of marginal_nodes.
1807 For example, the file BNT/examples/static/cg1 sets the mu term of
1808 observed nodes to their observed value, and the Sigma term to 0 (since
1809 observed nodes have no variance).
1810
1811 <p>
1812 Note that the implemented version of the junction tree is numerically
1813 unstable when using CG potentials
1814 (which is why, in the example above, we only required our answers to agree with
1815 the published ones to 2dp.)
1816 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
1817 implemented by Shan Huang. This is described in
1818
1819 <ul>
1820 <li> "Stable Local Computation with Conditional Gaussian Distributions",
1821 S. Lauritzen and F. Jensen, Tech Report R-99-2014,
1822 Dept. Math. Sciences, Allborg Univ., 1999.
1823 </ul>
1824
1825 However, even the numerically stable version
1826 can be computationally intractable if there are many hidden discrete
1827 nodes, because the number of mixture components grows exponentially e.g., in a
1828 <a href="usage_dbn.html#lds">switching linear dynamical system</a>.
1829 In general, one must resort to approximate inference techniques: see
1830 the discussion on <a href="#engines">inference engines</a> below.
1831
1832
1833 <h2><a name="hybrid">Other hybrid models</h2>
1834
1835 When we have C->D arcs, where C is hidden, we need to use
1836 approximate inference.
1837 One approach (not implemented in BNT) is described in
1838 <ul>
1839 <li> <a
1840 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
1841 Variational Approximation for Bayesian Networks with
1842 Discrete and Continuous Latent Variables</a>,
1843 K. Murphy, UAI 99.
1844 </ul>
1845 Of course, one can always use <a href="#sampling">sampling</a> methods
1846 for approximate inference in such models.
1847
1848
1849
1850 <h1><a name="param_learning">Parameter Learning</h1>
1851
1852 The parameter estimation routines in BNT can be classified into 4
1853 types, depending on whether the goal is to compute
1854 a full (Bayesian) posterior over the parameters or just a point
1855 estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
1856 and whether all the variables are fully observed or there is missing
1857 data/ hidden variables (partial observability).
1858 <p>
1859
1860 <TABLE BORDER>
1861 <tr>
1862 <TH></TH>
1863 <th>Full obs</th>
1864 <th>Partial obs</th>
1865 </tr>
1866 <tr>
1867 <th>Point</th>
1868 <td><tt>learn_params</tt></td>
1869 <td><tt>learn_params_em</tt></td>
1870 </tr>
1871 <tr>
1872 <th>Bayes</th>
1873 <td><tt>bayes_update_params</tt></td>
1874 <td>not yet supported</td>
1875 </tr>
1876 </table>
1877
1878
1879 <h2><a name="load_data">Loading data from a file</h2>
1880
1881 To load numeric data from an ASCII text file called 'dat.txt', where each row is a
1882 case and columns are separated by white-space, such as
1883 <pre>
1884 011979 1626.5 0.0
1885 021979 1367.0 0.0
1886 ...
1887 </pre>
1888 you can use
1889 <pre>
1890 data = load('dat.txt');
1891 </pre>
1892 or
1893 <pre>
1894 load dat.txt -ascii
1895 </pre>
1896 In the latter case, the data is stored in a variable called 'dat' (the
1897 filename minus the extension).
1898 Alternatively, suppose the data is stored in a .csv file (has commas
1899 separating the columns, and contains a header line), such as
1900 <pre>
1901 header info goes here
1902 ORD,011979,1626.5,0.0
1903 DSM,021979,1367.0,0.0
1904 ...
1905 </pre>
1906 You can load this using
1907 <pre>
1908 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
1909 </pre>
1910 If your file is not in either of these formats, you can either use Perl to convert
1911 it to this format, or use the Matlab scanf command.
1912 Type
1913 <tt>
1914 help iofun
1915 </tt>
1916 for more information on Matlab's file functions.
1917 <!--
1918 <p>
1919 To load data directly from Excel,
1920 you should buy the
1921 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
1922 To load data directly from a relational database,
1923 you should buy the
1924 <a href="http://www.mathworks.com/products/database">Database
1925 toolbox</a>.
1926 -->
1927 <p>
1928 BNT learning routines require data to be stored in a cell array.
1929 data{i,m} is the value of node i in case (example) m, i.e., each
1930 <em>column</em> is a case.
1931 If node i is not observed in case m (missing value), set
1932 data{i,m} = [].
1933 (Not all the learning routines can cope with such missing values, however.)
1934 In the special case that all the nodes are observed and are
1935 scalar-valued (as opposed to vector-valued), the data can be
1936 stored in a matrix (as opposed to a cell-array).
1937 <p>
1938 Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
1939 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
1940 the observed output, and X(2) is a hidden (gating) node. We can
1941 create the dataset as follows.
1942 <pre>
1943 data = load('dat.txt');
1944 ncases = size(data, 1);
1945 cases = cell(3, ncases);
1946 cases([1 3], :) = num2cell(data');
1947 </pre>
1948 Notice how we transposed the data, to convert rows into columns.
1949 Also, cases{2,m} = [] for all m, since X(2) is always hidden.
1950
1951
1952 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
1953
1954 As an example, let's generate some data from the sprinkler network, randomize the parameters,
1955 and then try to recover the original model.
1956 First we create some training data using forwards sampling.
1957 <pre>
1958 samples = cell(N, nsamples);
1959 for i=1:nsamples
1960 samples(:,i) = sample_bnet(bnet);
1961 end
1962 </pre>
1963 samples{j,i} contains the value of the j'th node in case i.
1964 sample_bnet returns a cell array because, in general, each node might
1965 be a vector of different length.
1966 In this case, all nodes are discrete (and hence scalars), so we
1967 could have used a regular array instead (which can be quicker):
1968 <pre>
1969 data = cell2num(samples);
1970 </pre
1971 So now data(j,i) = samples{j,i}.
1972 <p>
1973 Now we create a network with random parameters.
1974 (The initial values of bnet2 don't matter in this case, since we can find the
1975 globally optimal MLE independent of where we start.)
1976 <pre>
1977 % Make a tabula rasa
1978 bnet2 = mk_bnet(dag, node_sizes);
1979 seed = 0;
1980 rand('state', seed);
1981 bnet2.CPD{C} = tabular_CPD(bnet2, C);
1982 bnet2.CPD{R} = tabular_CPD(bnet2, R);
1983 bnet2.CPD{S} = tabular_CPD(bnet2, S);
1984 bnet2.CPD{W} = tabular_CPD(bnet2, W);
1985 </pre>
1986 Finally, we find the maximum likelihood estimates of the parameters.
1987 <pre>
1988 bnet3 = learn_params(bnet2, samples);
1989 </pre>
1990 To view the learned parameters, we use a little Matlab hackery.
1991 <pre>
1992 CPT3 = cell(1,N);
1993 for i=1:N
1994 s=struct(bnet3.CPD{i}); % violate object privacy
1995 CPT3{i}=s.CPT;
1996 end
1997 </pre>
1998 Here are the parameters learned for node 4.
1999 <pre>
2000 dispcpt(CPT3{4})
2001 1 1 : 1.0000 0.0000
2002 2 1 : 0.2000 0.8000
2003 1 2 : 0.2273 0.7727
2004 2 2 : 0.0000 1.0000
2005 </pre>
2006 So we see that the learned parameters are fairly close to the "true"
2007 ones, which we display below.
2008 <pre>
2009 dispcpt(CPT{4})
2010 1 1 : 1.0000 0.0000
2011 2 1 : 0.1000 0.9000
2012 1 2 : 0.1000 0.9000
2013 2 2 : 0.0100 0.9900
2014 </pre>
2015 We can get better results by using a larger training set, or using
2016 informative priors (see <a href="#prior">below</a>).
2017
2018
2019
2020 <h2><a name="prior">Parameter priors</h2>
2021
2022 Currently, only tabular CPDs can have priors on their parameters.
2023 The conjugate prior for a multinomial is the Dirichlet.
2024 (For binary random variables, the multinomial is the same as the
2025 Bernoulli, and the Dirichlet is the same as the Beta.)
2026 <p>
2027 The Dirichlet has a simple interpretation in terms of pseudo counts.
2028 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
2029 training set, where Pa_i are the parents of X_i,
2030 then the maximum likelihood (ML) estimate is
2031 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
2032 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
2033 event was not seen in the training set,
2034 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
2035 of times in the past.
2036 The MAP (maximum a posterior) estimate is then
2037 <pre>
2038 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
2039 </pre>
2040 and is never 0 if all alpha_ijk > 0.
2041 For example, consider the network A->B, where A is binary and B has 3
2042 values.
2043 A uniform prior for B has the form
2044 <pre>
2045 B=1 B=2 B=3
2046 A=1 1 1 1
2047 A=2 1 1 1
2048 </pre>
2049 which can be created using
2050 <pre>
2051 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
2052 </pre>
2053 This prior does not satisfy the likelihood equivalence principle,
2054 which says that <a href="#markov_equiv">Markov equivalent</a> models
2055 should have the same marginal likelihood.
2056 A prior that does satisfy this principle is shown below.
2057 Heckerman (1995) calls this the
2058 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
2059 <pre>
2060 B=1 B=2 B=3
2061 A=1 1/6 1/6 1/6
2062 A=2 1/6 1/6 1/6
2063 </pre>
2064 where we put N/(q*r) in each bin; N is the equivalent sample size,
2065 r=|A|, q = |B|.
2066 This can be created as follows
2067 <pre>
2068 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
2069 </pre>
2070 Here, 1 is the equivalent sample size, and is the strength of the
2071 prior.
2072 You can change this using
2073 <pre>
2074 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
2075 'BDeu', 'dirichlet_weight', 10);
2076 </pre>
2077 <!--where counts is an array of pseudo-counts of the same size as the
2078 CPT.-->
2079 <!--
2080 <p>
2081 When you specify a prior, you should set row i of the CPT to the
2082 normalized version of row i of the pseudo-count matrix, i.e., to the
2083 expected values of the parameters. This will ensure that computing the
2084 marginal likelihood sequentially (see <a
2085 href="#bayes_learn">below</a>) and in batch form gives the same
2086 results.
2087 To do this, proceed as follows.
2088 <pre>
2089 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
2090 </pre>
2091 For a non-informative prior, you can just write
2092 <pre>
2093 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
2094 </pre>
2095 -->
2096
2097
2098 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
2099
2100 If we use conjugate priors and have fully observed data, we can
2101 compute the posterior over the parameters in batch form as follows.
2102 <pre>
2103 cases = sample_bnet(bnet, nsamples);
2104 bnet = bayes_update_params(bnet, cases);
2105 LL = log_marg_lik_complete(bnet, cases);
2106 </pre>
2107 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
2108 and bnet.CPD{i}.CPT is set to the mean of the posterior (the
2109 normalized counts).
2110 (Hence if the initial pseudo counts are 0,
2111 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
2112 same result.)
2113
2114
2115
2116
2117 <p>
2118 We can compute the same result sequentially (on-line) as follows.
2119 <pre>
2120 LL = 0;
2121 for m=1:nsamples
2122 LL = LL + log_marg_lik_complete(bnet, cases(:,m));
2123 bnet = bayes_update_params(bnet, cases(:,m));
2124 end
2125 </pre>
2126
2127 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
2128 sequential model selection which uses the same idea.
2129 We generate data from the model A->B
2130 and compute the posterior prob of all 3 dags on 2 nodes:
2131 (1) A B, (2) A <- B , (3) A -> B
2132 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
2133 observational data alone, so we expect their posteriors to be the same
2134 (assuming a prior which satisfies likelihood equivalence).
2135 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
2136 However, if we make B a noisy NOT gate, the true model "wins" after 12
2137 trials, as shown below (red = model 1, blue/green (superimposed)
2138 represents models 2/3).
2139 <p>
2140 <img src="Figures/model_select.png">
2141 <p>
2142 The use of marginal likelihood for model selection is discussed in
2143 greater detail in the
2144 section on <a href="structure_learning">structure learning</a>.
2145
2146
2147
2148
2149 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
2150
2151 Now we consider learning when some values are not observed.
2152 Let us randomly hide half the values generated from the water
2153 sprinkler example.
2154 <pre>
2155 samples2 = samples;
2156 hide = rand(N, nsamples) > 0.5;
2157 [I,J]=find(hide);
2158 for k=1:length(I)
2159 samples2{I(k), J(k)} = [];
2160 end
2161 </pre>
2162 samples2{i,l} is the value of node i in training case l, or [] if unobserved.
2163 <p>
2164 Now we will compute the MLEs using the EM algorithm.
2165 We need to use an inference algorithm to compute the expected
2166 sufficient statistics in the E step; the M (maximization) step is as
2167 above.
2168 <pre>
2169 engine2 = jtree_inf_engine(bnet2);
2170 max_iter = 10;
2171 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
2172 </pre>
2173 LLtrace(i) is the log-likelihood at iteration i. We can plot this as
2174 follows:
2175 <pre>
2176 plot(LLtrace, 'x-')
2177 </pre>
2178 Let's display the results after 10 iterations of EM.
2179 <pre>
2180 celldisp(CPT4)
2181 CPT4{1} =
2182 0.6616
2183 0.3384
2184 CPT4{2} =
2185 0.6510 0.3490
2186 0.8751 0.1249
2187 CPT4{3} =
2188 0.8366 0.1634
2189 0.0197 0.9803
2190 CPT4{4} =
2191 (:,:,1) =
2192 0.8276 0.0546
2193 0.5452 0.1658
2194 (:,:,2) =
2195 0.1724 0.9454
2196 0.4548 0.8342
2197 </pre>
2198 We can get improved performance by using one or more of the following
2199 methods:
2200 <ul>
2201 <li> Increasing the size of the training set.
2202 <li> Decreasing the amount of hidden data.
2203 <li> Running EM for longer.
2204 <li> Using informative priors.
2205 <li> Initialising EM from multiple starting points.
2206 </ul>
2207
2208 <h2><a name="gaussianNumerical">Numerical problems when fitting
2209 Gaussians using EM</h2>
2210
2211 When fitting a <a href="#gaussian">Gaussian CPD</a> using EM, you
2212 might encounter some numerical
2213 problems. See <a
2214 href="http://www.ai.mit.edu/~murphyk/Software/HMM/hmm.html#loglikpos">here</a>
2215 for a discussion of this in the context of HMMs.
2216 <!--
2217 (Note: BNT does not currently support spherical covariances, although it
2218 would be easy to add, since KPMstats/clg_Mstep supports this option;
2219 you would just need to modify gaussian_CPD/update_ess to accumulate
2220 weighted inner products.)
2221 -->
2222
2223
2224 <h2><a name="tying">Parameter tying</h2>
2225
2226 In networks with repeated structure (e.g., chains and grids), it is
2227 common to assume that the parameters are the same at every node. This
2228 is called parameter tying, and reduces the amount of data needed for
2229 learning.
2230 <p>
2231 When we have tied parameters, there is no longer a one-to-one
2232 correspondence between nodes and CPDs.
2233 Rather, each CPD species the parameters for a whole equivalence class
2234 of nodes.
2235 It is easiest to see this by example.
2236 Consider the following <a href="usage_dbn.html#hmm">hidden Markov
2237 model (HMM)</a>
2238 <p>
2239 <img src="Figures/hmm3.gif">
2240 <p>
2241 <!--
2242 We can create this graph structure, assuming we have T time-slices,
2243 as follows.
2244 (We number the nodes as shown in the figure, but we could equally well
2245 number the hidden nodes 1:T, and the observed nodes T+1:2T.)
2246 <pre>
2247 N = 2*T;
2248 dag = zeros(N);
2249 hnodes = 1:2:2*T;
2250 for i=1:T-1
2251 dag(hnodes(i), hnodes(i+1))=1;
2252 end
2253 onodes = 2:2:2*T;
2254 for i=1:T
2255 dag(hnodes(i), onodes(i)) = 1;
2256 end
2257 </pre>
2258 <p>
2259 The hidden nodes are always discrete, and have Q possible values each,
2260 but the observed nodes can be discrete or continuous, and have O possible values/length.
2261 <pre>
2262 if cts_obs
2263 dnodes = hnodes;
2264 else
2265 dnodes = 1:N;
2266 end
2267 ns = ones(1,N);
2268 ns(hnodes) = Q;
2269 ns(onodes) = O;
2270 </pre>
2271 -->
2272 When HMMs are used for semi-infinite processes like speech recognition,
2273 we assume the transition matrix
2274 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant
2275 or homogenous Markov chain.
2276 Hence hidden nodes 2, 3, ..., T
2277 are all in the same equivalence class, say class Hclass.
2278 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the
2279 same for all t, so the observed nodes are all in the same equivalence
2280 class, say class Oclass.
2281 Finally, the prior term P(H(1)) is in a class all by itself, say class
2282 H1class.
2283 This is illustrated below, where we explicitly represent the
2284 parameters as random variables (dotted nodes).
2285 <p>
2286 <img src="Figures/hmm4_params.gif">
2287 <p>
2288 In BNT, we cannot represent parameters as random variables (nodes).
2289 Instead, we "hide" the
2290 parameters inside one CPD for each equivalence class,
2291 and then specify that the other CPDs should share these parameters, as
2292 follows.
2293 <pre>
2294 hnodes = 1:2:2*T;
2295 onodes = 2:2:2*T;
2296 H1class = 1; Hclass = 2; Oclass = 3;
2297 eclass = ones(1,N);
2298 eclass(hnodes(2:end)) = Hclass;
2299 eclass(hnodes(1)) = H1class;
2300 eclass(onodes) = Oclass;
2301 % create dag and ns in the usual way
2302 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
2303 </pre>
2304 Finally, we define the parameters for each equivalence class:
2305 <pre>
2306 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
2307 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
2308 if cts_obs
2309 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
2310 else
2311 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
2312 end
2313 </pre>
2314 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
2315 member of e's equivalence class; that is, it is not always the case
2316 that e == j. You can use bnet.rep_of_eclass(e) to return the
2317 representative of equivalence class e.
2318 BNT will look up the parents of j to determine the size
2319 of the CPT to use. It assumes that this is the same for all members of
2320 the equivalence class.
2321 Click <a href="http://www.cs.berkeley.edu/~murphyk/Bayes/param_tieing.html">here</a> for
2322 a more complex example of parameter tying.
2323 <p>
2324 Note:
2325 Normally one would define an HMM as a
2326 <a href = "usage_dbn.html">Dynamic Bayes Net</a>
2327 (see the function BNT/examples/dynamic/mk_chmm.m).
2328 However, one can define an HMM as a static BN using the function
2329 BNT/examples/static/Models/mk_hmm_bnet.m.
2330
2331
2332
2333 <h1><a name="structure_learning">Structure learning</h1>
2334
2335 Update (9/29/03):
2336 Phillipe LeRay is developing some additional structure learning code
2337 on top of BNT. Click <a
2338 href="http://bnt.insa-rouen.fr/ajouts.html">here</a>
2339 for details.
2340
2341 <p>
2342
2343 There are two very different approaches to structure learning:
2344 constraint-based and search-and-score.
2345 In the <a href="#constraint">constraint-based approach</a>,
2346 we start with a fully connected graph, and remove edges if certain
2347 conditional independencies are measured in the data.
2348 This has the disadvantage that repeated independence tests lose
2349 statistical power.
2350 <p>
2351 In the more popular search-and-score approach,
2352 we perform a search through the space of possible DAGs, and either
2353 return the best one found (a point estimate), or return a sample of the
2354 models found (an approximation to the Bayesian posterior).
2355 <p>
2356 Unfortunately, the number of DAGs as a function of the number of
2357 nodes, G(n), is super-exponential in n.
2358 A closed form formula for G(n) is not known, but the first few values
2359 are shown below (from Cooper, 1999).
2360
2361 <table>
2362 <tr> <th>n</th> <th align=left>G(n)</th> </tr>
2363 <tr> <td>1</td> <td>1</td> </tr>
2364 <tr> <td>2</td> <td>3</td> </tr>
2365 <tr> <td>3</td> <td>25</td> </tr>
2366 <tr> <td>4</td> <td>543</td> </tr>
2367 <tr> <td>5</td> <td>29,281</td> </tr>
2368 <tr> <td>6</td> <td>3,781,503</td> </tr>
2369 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr>
2370 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr>
2371 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr>
2372 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr>
2373 </table>
2374
2375 Since the number of DAGs is super-exponential in the number of nodes,
2376 we cannot exhaustively search the space, so we either use a local
2377 search algorithm (e.g., greedy hill climbining, perhaps with multiple
2378 restarts) or a global search algorithm (e.g., Markov Chain Monte
2379 Carlo).
2380 <p>
2381 If we know a total ordering on the nodes,
2382 finding the best structure amounts to picking the best set of parents
2383 for each node independently.
2384 This is what the K2 algorithm does.
2385 If the ordering is unknown, we can search over orderings,
2386 which is more efficient than searching over DAGs (Koller and Friedman, 2000).
2387 <p>
2388 In addition to the search procedure, we must specify the scoring
2389 function. There are two popular choices. The Bayesian score integrates
2390 out the parameters, i.e., it is the marginal likelihood of the model.
2391 The BIC (Bayesian Information Criterion) is defined as
2392 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is
2393 the ML estimate of the parameters, d is the number of parameters, and
2394 N is the number of data cases.
2395 The BIC method has the advantage of not requiring a prior.
2396 <p>
2397 BIC can be derived as a large sample
2398 approximation to the marginal likelihood.
2399 (It is also equal to the Minimum Description Length of a model.)
2400 However, in practice, the sample size does not need to be very large
2401 for the approximation to be good.
2402 For example, in the figure below, we plot the ratio between the log marginal likelihood
2403 and the BIC score against data-set size; we see that the ratio rapidly
2404 approaches 1, especially for non-informative priors.
2405 (This plot was generated by the file BNT/examples/static/bic1.m. It
2406 uses the water sprinkler BN with BDeu Dirichlet priors with different
2407 equivalent sample sizes.)
2408
2409 <p>
2410 <center>
2411 <IMG SRC="Figures/bic.png">
2412 </center>
2413 <p>
2414
2415 <p>
2416 As with parameter learning, handling missing data/ hidden variables is
2417 much harder than the fully observed case.
2418 The structure learning routines in BNT can therefore be classified into 4
2419 types, analogously to the parameter learning case.
2420 <p>
2421
2422 <TABLE BORDER>
2423 <tr>
2424 <TH></TH>
2425 <th>Full obs</th>
2426 <th>Partial obs</th>
2427 </tr>
2428 <tr>
2429 <th>Point</th>
2430 <td><tt>learn_struct_K2</tt> <br>
2431 <!-- <tt>learn_struct_hill_climb</tt></td> -->
2432 <td><tt>not yet supported</tt></td>
2433 </tr>
2434 <tr>
2435 <th>Bayes</th>
2436 <td><tt>learn_struct_mcmc</tt></td>
2437 <td>not yet supported</td>
2438 </tr>
2439 </table>
2440
2441
2442 <h2><a name="markov_equiv">Markov equivalence</h2>
2443
2444 If two DAGs encode the same conditional independencies, they are
2445 called Markov equivalent. The set of all DAGs can be paritioned into
2446 Markov equivalence classes. Graphs within the same class can
2447 have
2448 the direction of some of their arcs reversed without changing any of
2449 the CI relationships.
2450 Each class can be represented by a PDAG
2451 (partially directed acyclic graph) called an essential graph or
2452 pattern. This specifies which edges must be oriented in a certain
2453 direction, and which may be reversed.
2454
2455 <p>
2456 When learning graph structure from observational data,
2457 the best one can hope to do is to identify the model up to Markov
2458 equivalence. To distinguish amongst graphs within the same equivalence
2459 class, one needs interventional data: see the discussion on <a
2460 href="#active">active learning</a> below.
2461
2462
2463
2464 <h2><a name="enumerate">Exhaustive search</h2>
2465
2466 The brute-force approach to structure learning is to enumerate all
2467 possible DAGs, and score each one. This provides a "gold standard"
2468 with which to compare other algorithms. We can do this as follows.
2469 <pre>
2470 dags = mk_all_dags(N);
2471 score = score_dags(data, ns, dags);
2472 </pre>
2473 where data(i,m) is the value of node i in case m,
2474 and ns(i) is the size of node i.
2475 If the DAGs have a lot of families in common, we can cache the sufficient statistics,
2476 making this potentially more efficient than scoring the DAGs one at a time.
2477 (Caching is not currently implemented, however.)
2478 <p>
2479 By default, we use the Bayesian scoring metric, and assume CPDs are
2480 represented by tables with BDeu(1) priors.
2481 We can override these defaults as follows.
2482 If we want to use uniform priors, we can say
2483 <pre>
2484 params = cell(1,N);
2485 for i=1:N
2486 params{i} = {'prior', 'unif'};
2487 end
2488 score = score_dags(data, ns, dags, 'params', params);
2489 </pre>
2490 params{i} is a cell-array, containing optional arguments that are
2491 passed to the constructor for CPD i.
2492 <p>
2493 Now suppose we want to use different node types, e.g.,
2494 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
2495 these CPDs can support discrete and continuous parents, which is
2496 necessary since all other nodes will be considered as parents).
2497 The Bayesian scoring metric currently only works for tabular CPDs, so
2498 we will use BIC:
2499 <pre>
2500 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
2501 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
2502 </pre>
2503 In practice, one can't enumerate all possible DAGs for N > 5,
2504 but one can evaluate any reasonably-sized set of hypotheses in this
2505 way (e.g., nearest neighbors of your current best guess).
2506 Think of this as "computer assisted model refinement" as opposed to de
2507 novo learning.
2508
2509
2510 <h2><a name="K2">K2</h2>
2511
2512 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
2513 Initially each node has no parents. It then adds incrementally that parent whose addition most
2514 increases the score of the resulting structure. When the addition of no single
2515 parent can increase the score, it stops adding parents to the node.
2516 Since we are using a fixed ordering, we do not need to check for
2517 cycles, and can choose the parents for each node independently.
2518 <p>
2519 The original paper used the Bayesian scoring
2520 metric with tabular CPDs and Dirichlet priors.
2521 BNT generalizes this to allow any kind of CPD, and either the Bayesian
2522 scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
2523 In addition, you can specify
2524 an optional upper bound on the number of parents for each node.
2525 The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
2526 We use the water sprinkler network and sample 100 cases from it as before.
2527 Then we see how much data it takes to recover the generating structure:
2528 <pre>
2529 order = [C S R W];
2530 max_fan_in = 2;
2531 sz = 5:5:100;
2532 for i=1:length(sz)
2533 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
2534 correct(i) = isequal(dag, dag2);
2535 end
2536 </pre>
2537 Here are the results.
2538 <pre>
2539 correct =
2540 Columns 1 through 12
2541 0 0 0 0 0 0 0 1 0 1 1 1
2542 Columns 13 through 20
2543 1 1 1 1 1 1 1 1
2544 </pre>
2545 So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
2546 showing that the prior doesn't matter too much.)
2547 In general, we cannot hope to recover the "true" generating structure,
2548 only one that is in its <a href="#markov_equiv">Markov equivalence
2549 class</a>.
2550
2551
2552 <h2><a name="hill_climb">Hill-climbing</h2>
2553
2554 Hill-climbing starts at a specific point in space,
2555 considers all nearest neighbors, and moves to the neighbor
2556 that has the highest score; if no neighbors have higher
2557 score than the current point (i.e., we have reached a local maximum),
2558 the algorithm stops. One can then restart in another part of the space.
2559 <p>
2560 A common definition of "neighbor" is all graphs that can be
2561 generated from the current graph by adding, deleting or reversing a
2562 single arc, subject to the acyclicity constraint.
2563 Other neighborhoods are possible: see
2564 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
2565 Optimal Structure Identification with Greedy Search</a>, Max
2566 Chickering, JMLR 2002.
2567
2568 <!--
2569 Note: This algorithm is currently (Feb '02) being implemented by Qian
2570 Diao.
2571 -->
2572
2573
2574 <h2><a name="mcmc">MCMC</h2>
2575
2576 We can use a Markov Chain Monte Carlo (MCMC) algorithm called
2577 Metropolis-Hastings (MH) to search the space of all
2578 DAGs.
2579 The standard proposal distribution is to consider moving to all
2580 nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
2581 <p>
2582 The function can be called
2583 as in the following example.
2584 <pre>
2585 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
2586 </pre>
2587 We can convert our set of sampled graphs to a histogram
2588 (empirical posterior over all the DAGs) thus
2589 <pre>
2590 all_dags = mk_all_dags(N);
2591 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
2592 </pre>
2593 To see how well this performs, let us compute the exact posterior exhaustively.
2594 <p>
2595 <pre>
2596 score = score_dags(data, ns, all_dags);
2597 post = normalise(exp(score)); % assuming uniform structural prior
2598 </pre>
2599 We plot the results below.
2600 (The data set was 100 samples drawn from a random 4 node bnet; see the
2601 file BNT/examples/static/mcmc1.)
2602 <pre>
2603 subplot(2,1,1)
2604 bar(post)
2605 subplot(2,1,2)
2606 bar(mcmc_post)
2607 </pre>
2608 <img src="Figures/mcmc_post.jpg" width="800" height="500">
2609 <p>
2610 We can also plot the acceptance ratio versus number of MCMC steps,
2611 as a crude convergence diagnostic.
2612 <pre>
2613 clf
2614 plot(accept_ratio)
2615 </pre>
2616 <img src="Figures/mcmc_accept.jpg" width="800" height="300">
2617 <p>
2618 Better convergence diagnostics can be found
2619 <a href="<a href="http://www.lce.hut.fi/~ave/code/mcmcdiag/">here</a>
2620 <p>
2621 Even though the number of samples needed by MCMC is theoretically
2622 polynomial (not exponential) in the dimensionality of the search space, in practice it has been
2623 found that MCMC does not converge in reasonable time for graphs with
2624 more than about 10 nodes.
2625 <p>
2626 <a href="http://www.bioss.sari.ac.uk/~dirk/software/DBmcmc/">
2627 Dirk Husmeier has extended MCMC model selection to DBNs</a>.
2628
2629
2630 <h2><a name="active">Active structure learning</h2>
2631
2632 As was mentioned <a href="#markov_equiv">above</a>,
2633 one can only learn a DAG up to Markov equivalence, even given infinite data.
2634 If one is interested in learning the structure of a causal network,
2635 one needs interventional data.
2636 (By "intervention" we mean forcing a node to take on a specific value,
2637 thereby effectively severing its incoming arcs.)
2638 <p>
2639 Most of the scoring functions accept an optional argument
2640 that specifies whether a node was observed to have a certain value, or
2641 was forced to have that value: we set clamped(i,m)=1 if node i was
2642 forced in training case m. e.g., see the file
2643 BNT/examples/static/cooper_yoo.
2644 <p>
2645 An interesting question is to decide which interventions to perform
2646 (c.f., design of experiments). For details, see the following tech
2647 report
2648 <ul>
2649 <li> <a href = "../../Papers/alearn.ps.gz">
2650 Active learning of causal Bayes net structure</a>, Kevin Murphy, March
2651 2001.
2652 </ul>
2653
2654
2655 <h2><a name="struct_em">Structural EM</h2>
2656
2657 Computing the Bayesian score when there is partial observability is
2658 computationally challenging, because the parameter posterior becomes
2659 multimodal (the hidden nodes induce a mixture distribution).
2660 One therefore needs to use approximations such as BIC.
2661 Unfortunately, search algorithms are still expensive, because we need
2662 to run EM at each step to compute the MLE, which is needed to compute
2663 the score of each model. An alternative approach is
2664 to do the local search steps inside of the M step of EM, which is more
2665 efficient since the data has been "filled in" - this is
2666 called the structural EM algorithm (Friedman 1997), and provably
2667 converges to a local maximum of the BIC score.
2668 <p>
2669 Wei Hu has implemented SEM for discrete nodes.
2670 You can download his package from
2671 <a href="../SEM.zip">here</a>.
2672 Please address all questions about this code to
2673 wei.hu@intel.com.
2674 See also <a href="#phl">Phl's implementation of SEM</a>.
2675
2676 <!--
2677 <h2><a name="reveal">REVEAL algorithm</h2>
2678
2679 A simple way to learn the structure of a fully observed, discrete,
2680 factored DBN from a time series is described <a
2681 href="usage_dbn.html#struct_learn">here</a>.
2682 -->
2683
2684
2685 <h2><a name="graphdraw">Visualizing the graph</h2>
2686
2687 You can visualize an arbitrary graph (such as one learned using the
2688 structure learning routines) with Matlab code contributed by
2689 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali
2690 Taylan Cemgil</a>
2691 from the University of Nijmegen.
2692 For static BNs, call it as follows:
2693 <pre>
2694 draw_graph(bnet.dag);
2695 </pre>
2696 For example, this is the output produced on a
2697 <a href="#qmr">random QMR-like model</a>:
2698 <p>
2699 <img src="Figures/qmr.rnd.jpg">
2700 <p>
2701 If you install the excellent <a
2702 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an
2703 open-source graph visualization package from AT&T,
2704 you can create a much better visualization as follows
2705 <pre>
2706 graph_to_dot(bnet.dag)
2707 </pre>
2708 This works by converting the adjacency matrix to a file suitable
2709 for input to graphviz (using the dot format),
2710 then converting the output of graphviz to postscript, and displaying the results using
2711 ghostview.
2712 You can do each of these steps separately for more control, as shown
2713 below.
2714 <pre>
2715 graph_to_dot(bnet.dag, 'filename', 'foo.dot');
2716 dot -Tps foo.dot -o foo.ps
2717 ghostview foo.ps &
2718 </pre>
2719
2720 <h2><a name = "constraint">Constraint-based methods</h2>
2721
2722 The IC algorithm (Pearl and Verma, 1991),
2723 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
2724 computes many conditional independence tests,
2725 and combines these constraints into a
2726 PDAG to represent the whole
2727 <a href="#markov_equiv">Markov equivalence class</a>.
2728 <p>
2729 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
2730 (IC stands for inductive causation; PC stands for Peter and Clark,
2731 the first names of Spirtes and Glymour; FCI stands for fast causal
2732 inference.
2733 What we, following Pearl (2000), call IC* was called
2734 IC in the original Pearl and Verma paper.)
2735 For details, see
2736 <ul>
2737 <li>
2738 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
2739 Prediction, and Search</a>, Spirtes, Glymour and
2740 Scheines (SGS), 2001 (2nd edition), MIT Press.
2741 <li>
2742 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
2743 2000, Cambridge University Press.
2744 </ul>
2745
2746 <p>
2747
2748 The PC algorithm takes as arguments a function f, the number of nodes N,
2749 the maximum fan in K, and additional arguments A which are passed to f.
2750 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
2751 otherwise.
2752 For example, suppose we cheat by
2753 passing in a CI "oracle" which has access to the true DAG; the oracle
2754 tests for d-separation in this DAG, i.e.,
2755 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
2756 <pre>
2757 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
2758 </pre>
2759 pdag(i,j) = -1 if there is definitely an i->j arc,
2760 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
2761 <p>
2762 Applied to the sprinkler network, this returns
2763 <pre>
2764 pdag =
2765 0 1 1 0
2766 1 0 0 -1
2767 1 0 0 -1
2768 0 0 0 0
2769 </pre>
2770 So as expected, we see that the V-structure at the W node is uniquely identified,
2771 but the other arcs have ambiguous orientation.
2772 <p>
2773 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
2774 book.
2775 This example concerns the female orgasm.
2776 We are given a correlation matrix C between 7 measured factors (such
2777 as subjective experiences of coital and masturbatory experiences),
2778 derived from 281 samples, and want to learn a causal model of the
2779 data. We will not discuss the merits of this type of work here, but
2780 merely show how to reproduce the results in the SGS book.
2781 Their program,
2782 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
2783 makes use of the Fisher Z-test for conditional
2784 independence, so we do the same:
2785 <pre>
2786 max_fan_in = 4;
2787 nsamples = 281;
2788 alpha = 0.05;
2789 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
2790 </pre>
2791 In this case, the CI test is
2792 <pre>
2793 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)
2794 </pre>
2795 The results match those of Fig 12a of SGS apart from two edge
2796 differences; presumably this is due to rounding error (although it
2797 could be a bug, either in BNT or in Tetrad).
2798 This example can be found in the file BNT/examples/static/pc2.m.
2799
2800 <p>
2801
2802 The IC* algorithm (Pearl and Verma, 1991),
2803 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
2804 are like the IC/PC algorithm, except that they can detect the presence
2805 of latent variables.
2806 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
2807 Kushnir. The output is a matrix P, defined as follows
2808 (see Pearl (2000), p52 for details):
2809 <pre>
2810 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
2811 % P(i,j) = -2 if there is a marked directed i-*>j edge.
2812 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
2813 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
2814 </pre>
2815
2816
2817 <h2><a name="phl">Philippe Leray's structure learning package</h2>
2818
2819 Philippe Leray has written a
2820 <a href="http://bnt.insa-rouen.fr/ajouts.html">
2821 structure learning package</a> that uses BNT.
2822
2823 It currently (Juen 2003) has the following features:
2824 <ul>
2825 <li>PC with Chi2 statistical test
2826 <li> MWST : Maximum weighted Spanning Tree
2827 <li> Hill Climbing
2828 <li> Greedy Search
2829 <li> Structural EM
2830 <li> hist_ic : optimal Histogram based on IC information criterion
2831 <li> cpdag_to_dag
2832 <li> dag_to_cpdag
2833 <li> ...
2834 </ul>
2835
2836
2837 </a>
2838
2839
2840 <!--
2841 <h2><a name="read_learning">Further reading on learning</h2>
2842
2843 I recommend the following tutorials for more details on learning.
2844 <ul>
2845 <li> <a
2846 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
2847 tutorial</a> on graphical models, which contains an overview of learning.
2848
2849 <li>
2850 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
2851 A tutorial on learning with Bayesian networks</a>, D. Heckerman,
2852 Microsoft Research Tech Report, 1995.
2853
2854 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
2855 Operations for Learning with Graphical Models</a>,
2856 W. L. Buntine, JAIR'94, 159--225.
2857 </ul>
2858 <p>
2859 -->
2860
2861
2862
2863
2864
2865 <h1><a name="engines">Inference engines</h1>
2866
2867 Up until now, we have used the junction tree algorithm for inference.
2868 However, sometimes this is too slow, or not even applicable.
2869 In general, there are many inference algorithms each of which make
2870 different tradeoffs between speed, accuracy, complexity and
2871 generality. Furthermore, there might be many implementations of the
2872 same algorithm; for instance, a general purpose, readable version,
2873 and a highly-optimized, specialized one.
2874 To cope with this variety, we treat each inference algorithm as an
2875 object, which we call an inference engine.
2876
2877 <p>
2878 An inference engine is an object that contains a bnet and supports the
2879 'enter_evidence' and 'marginal_nodes' methods. The engine constructor
2880 takes the bnet as argument and may do some model-specific processing.
2881 When 'enter_evidence' is called, the engine may do some
2882 evidence-specific processing. Finally, when 'marginal_nodes' is
2883 called, the engine may do some query-specific processing.
2884
2885 <p>
2886 The amount of work done when each stage is specified -- structure,
2887 parameters, evidence, and query -- depends on the engine. The cost of
2888 work done early in this sequence can be amortized. On the other hand,
2889 one can make better optimizations if one waits until later in the
2890 sequence.
2891 For example, the parameters might imply
2892 conditional indpendencies that are not evident in the graph structure,
2893 but can nevertheless be exploited; the evidence indicates which nodes
2894 are observed and hence can effectively be disconnected from the
2895 graph; and the query might indicate that large parts of the network
2896 are d-separated from the query nodes. (Since it is not the actual
2897 <em>values</em> of the evidence that matters, just which nodes are observed,
2898 many engines allow you to specify which nodes will be observed when they are constructed,
2899 i.e., before calling 'enter_evidence'. Some engines can still cope if
2900 the actual pattern of evidence is different, e.g., if there is missing
2901 data.)
2902 <p>
2903
2904 Although being maximally lazy (i.e., only doing work when a query is
2905 issued) may seem desirable,
2906 this is not always the most efficient.
2907 For example,
2908 when learning using EM, we need to call marginal_nodes N times, where N is the
2909 number of nodes. <a href="varelim">Variable elimination</a> would end
2910 up repeating a lot of work
2911 each time marginal_nodes is called, making it inefficient for
2912 learning. The junction tree algorithm, by contrast, uses dynamic
2913 programming to avoid this redundant computation --- it calculates all
2914 marginals in two passes during 'enter_evidence', so calling
2915 'marginal_nodes' takes constant time.
2916 <p>
2917 We will discuss some of the inference algorithms implemented in BNT
2918 below, and finish with a <a href="#engine_summary">summary</a> of all
2919 of them.
2920
2921
2922
2923
2924
2925
2926
2927 <h2><a name="varelim">Variable elimination</h2>
2928
2929 The variable elimination algorithm, also known as bucket elimination
2930 or peeling, is one of the simplest inference algorithms.
2931 The basic idea is to "push sums inside of products"; this is explained
2932 in more detail
2933 <a
2934 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
2935 <p>
2936 The principle of distributing sums over products can be generalized
2937 greatly to apply to any commutative semiring.
2938 This forms the basis of many common algorithms, such as Viterbi
2939 decoding and the Fast Fourier Transform. For details, see
2940
2941 <ul>
2942 <li> R. McEliece and S. M. Aji, 2000.
2943 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
2944 <a href="GDL.pdf">
2945 The Generalized Distributive Law</a>,
2946 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
2947 pp. 325--343.
2948
2949
2950 <li>
2951 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
2952 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
2953 Factor graphs and the sum-product algorithm</a>
2954 IEEE Transactions on Information Theory, February, 2001.
2955
2956 </ul>
2957
2958 <p>
2959 Choosing an order in which to sum out the variables so as to minimize
2960 computational cost is known to be NP-hard.
2961 The implementation of this algorithm in
2962 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this
2963 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
2964 greedy search procedure to find a good ordering).
2965 <p>
2966 Note: unlike most algorithms, var_elim does all its computational work
2967 inside of <tt>marginal_nodes</tt>, not inside of
2968 <tt>enter_evidence</tt>.
2969
2970
2971
2972
2973 <h2><a name="global">Global inference methods</h2>
2974
2975 The simplest inference algorithm of all is to explicitely construct
2976 the joint distribution over all the nodes, and then to marginalize it.
2977 This is implemented in <tt>global_joint_inf_engine</tt>.
2978 Since the size of the joint is exponential in the
2979 number of discrete (hidden) nodes, this is not a very practical algorithm.
2980 It is included merely for pedagogical and debugging purposes.
2981 <p>
2982 Three specialized versions of this algorithm have also been implemented,
2983 corresponding to the cases where all the nodes are discrete (D), all
2984 are Gaussian (G), and some are discrete and some Gaussian (CG).
2985 They are called <tt>enumerative_inf_engine</tt>,
2986 <tt>gaussian_inf_engine</tt>,
2987 and <tt>cond_gauss_inf_engine</tt> respectively.
2988 <p>
2989 Note: unlike most algorithms, these global inference algorithms do all their computational work
2990 inside of <tt>marginal_nodes</tt>, not inside of
2991 <tt>enter_evidence</tt>.
2992
2993
2994 <h2><a name="quickscore">Quickscore</h2>
2995
2996 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
2997 since the cliques are so big.
2998 One simple trick we can use is to notice that hidden leaves do not
2999 affect the posteriors on the roots, and hence do not need to be
3000 included in the network.
3001 A second trick is to notice that the negative findings can be
3002 "absorbed" into the prior:
3003 see the file
3004 BNT/examples/static/mk_minimal_qmr_bnet for details.
3005 <p>
3006
3007 A much more significant speedup is obtained by exploiting special
3008 properties of the noisy-or node, as done by the quickscore
3009 algorithm. For details, see
3010 <ul>
3011 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
3012 <li> Rish and Dechter, "On the impact of causal independence", UCI
3013 tech report, 1998.
3014 </ul>
3015
3016 This has been implemented in BNT as a special-purpose inference
3017 engine, which can be created and used as follows:
3018 <pre>
3019 engine = quickscore_inf_engine(inhibit, leak, prior);
3020 engine = enter_evidence(engine, pos, neg);
3021 m = marginal_nodes(engine, i);
3022 </pre>
3023
3024
3025 <h2><a name="belprop">Belief propagation</h2>
3026
3027 Even using quickscore, exact inference takes time that is exponential
3028 in the number of positive findings.
3029 Hence for large networks we need to resort to approximate inference techniques.
3030 See for example
3031 <ul>
3032 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
3033 QMR-DT network", JAIR 10, 1999.
3034
3035 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
3036 UAI 99.
3037 </ul>
3038 The latter approximation
3039 entails applying Pearl's belief propagation algorithm to a model even
3040 if it has loops (hence the name loopy belief propagation).
3041 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
3042 exact results when applied to singly-connected graphs
3043 (a.k.a. polytrees, since
3044 the underlying undirected topology is a tree, but a node may have
3045 multiple parents).
3046 To apply this algorithm to a graph with loops,
3047 use <tt>pearl_inf_engine</tt>.
3048 This can use a centralized or distributed message passing protocol.
3049 You can use it as in the following example.
3050 <pre>
3051 engine = pearl_inf_engine(bnet, 'max_iter', 30);
3052 engine = enter_evidence(engine, evidence);
3053 m = marginal_nodes(engine, i);
3054 </pre>
3055 We found that this algorithm often converges, and when it does, often
3056 is very accurate, but it depends on the precise setting of the
3057 parameter values of the network.
3058 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
3059 Understanding when and why belief propagation converges/ works
3060 is a topic of ongoing research.
3061 <p>
3062 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
3063 and gmux nodes to compute messages efficiently.
3064 <p>
3065 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
3066 represent messages. Hence this is slower.
3067 <p>
3068 <tt>belprop_fg_inf_engine</tt> is like belprop,
3069 but is designed for factor graphs.
3070
3071
3072
3073 <h2><a name="sampling">Sampling</h2>
3074
3075 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
3076 <ul>
3077 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance
3078 sampling and can handle any node type.
3079 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
3080 Currently this can only handle tabular CPDs.
3081 For a much faster and more powerful Gibbs sampling program, see
3082 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
3083 </ul>
3084 Note: To generate samples from a network (which is not the same as inference!),
3085 use <tt>sample_bnet</tt>.
3086
3087
3088
3089 <h2><a name="engine_summary">Summary of inference engines</h2>
3090
3091
3092 The inference engines differ in many ways. Here are
3093 some of the major "axes":
3094 <ul>
3095 <li> Works for all topologies or makes restrictions?
3096 <li> Works for all node types or makes restrictions?
3097 <li> Exact or approximate inference?
3098 </ul>
3099
3100 <p>
3101 In terms of topology, most engines handle any kind of DAG.
3102 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
3103 can be used to represent directed, undirected, and mixed (chain)
3104 graphs.
3105 (In the future, we plan to support exact inference on chain graphs.)
3106 <tt>quickscore</tt> only works on QMR-like models.
3107 <p>
3108 In terms of node types: algorithms that use potentials can handle
3109 discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
3110 Sampling algorithms can essentially handle any kind of node (distribution).
3111 Other algorithms make more restrictive assumptions in exchange for
3112 speed.
3113 <p>
3114 Finally, most algorithms are designed to give the exact answer.
3115 The belief propagation algorithms are exact if applied to trees, and
3116 in some other cases.
3117 Sampling is considered approximate, even though, in the limit of an
3118 infinite number of samples, it gives the exact answer.
3119
3120 <p>
3121
3122 Here is a summary of the properties
3123 of all the engines in BNT which work on static networks.
3124 <p>
3125 <table>
3126 <table border units = pixels><tr>
3127 <td align=left width=0>Name
3128 <td align=left width=0>Exact?
3129 <td align=left width=0>Node type?
3130 <td align=left width=0>topology
3131 <tr>
3132 <tr>
3133 <td align=left> belprop
3134 <td align=left> approx
3135 <td align=left> D
3136 <td align=left> DAG
3137 <tr>
3138 <td align=left> belprop_fg
3139 <td align=left> approx
3140 <td align=left> D
3141 <td align=left> factor graph
3142 <tr>
3143 <td align=left> cond_gauss
3144 <td align=left> exact
3145 <td align=left> CG
3146 <td align=left> DAG
3147 <tr>
3148 <td align=left> enumerative
3149 <td align=left> exact
3150 <td align=left> D
3151 <td align=left> DAG
3152 <tr>
3153 <td align=left> gaussian
3154 <td align=left> exact
3155 <td align=left> G
3156 <td align=left> DAG
3157 <tr>
3158 <td align=left> gibbs
3159 <td align=left> approx
3160 <td align=left> D
3161 <td align=left> DAG
3162 <tr>
3163 <td align=left> global_joint
3164 <td align=left> exact
3165 <td align=left> D,G,CG
3166 <td align=left> DAG
3167 <tr>
3168 <td align=left> jtree
3169 <td align=left> exact
3170 <td align=left> D,G,CG
3171 <td align=left> DAG
3172 b<tr>
3173 <td align=left> likelihood_weighting
3174 <td align=left> approx
3175 <td align=left> any
3176 <td align=left> DAG
3177 <tr>
3178 <td align=left> pearl
3179 <td align=left> approx
3180 <td align=left> D,G
3181 <td align=left> DAG
3182 <tr>
3183 <td align=left> pearl
3184 <td align=left> exact
3185 <td align=left> D,G
3186 <td align=left> polytree
3187 <tr>
3188 <td align=left> quickscore
3189 <td align=left> exact
3190 <td align=left> noisy-or
3191 <td align=left> QMR
3192 <tr>
3193 <td align=left> stab_cond_gauss
3194 <td align=left> exact
3195 <td align=left> CG
3196 <td align=left> DAG
3197 <tr>
3198 <td align=left> var_elim
3199 <td align=left> exact
3200 <td align=left> D,G,CG
3201 <td align=left> DAG
3202 </table>
3203
3204
3205
3206 <h1><a name="influence">Influence diagrams/ decision making</h1>
3207
3208 BNT implements an exact algorithm for solving LIMIDs (limited memory
3209 influence diagrams), described in
3210 <ul>
3211 <li> S. L. Lauritzen and D. Nilsson.
3212 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
3213 Representing and solving decision problems with limited
3214 information</a>
3215 Management Science, 47, 1238 - 1251. September 2001.
3216 </ul>
3217 LIMIDs explicitely show all information arcs, rather than implicitely
3218 assuming no forgetting. This allows them to model forgetful
3219 controllers.
3220 <p>
3221 See the examples in <tt>BNT/examples/limids</tt> for details.
3222
3223
3224
3225
3226 <h1>DBNs, HMMs, Kalman filters and all that</h1>
3227
3228 Click <a href="usage_dbn.html">here</a> for documentation about how to
3229 use BNT for dynamical systems and sequence data.
3230
3231
3232 </BODY>