comparison toolboxes/FullBNT-1.0.7/docs/usage_sf.html @ 0:e9a9cd732c1e tip

first hg version after svn
author wolffd
date Tue, 10 Feb 2015 15:05:51 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:e9a9cd732c1e
1 <HEAD>
2 <TITLE>How to use the Bayes Net Toolbox</TITLE>
3 </HEAD>
4
5 <BODY BGCOLOR="#FFFFFF">
6 <!-- white background is better for the pictures and equations -->
7
8 <h1>How to use the Bayes Net Toolbox</h1>
9
10 This documentation was last updated on 7 June 2004.
11 <br>
12 Click <a href="changelog.html">here</a> for a list of changes made to
13 BNT.
14 <br>
15 Click
16 <a href="http://bnt.insa-rouen.fr/">here</a>
17 for a French version of this documentation (which might not
18 be up-to-date).
19 <br>
20 Update 23 May 2005:
21 Philippe LeRay has written
22 a
23 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-editor/">
24 BNT GUI</a>
25 and
26 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-slp/">
27 BNT Structure Learning Package</a>.
28
29 <p>
30
31 <ul>
32 <li> <a href="#install">Installation</a>
33 <ul>
34 <li> <a href="#install">Installing the Matlab code</a>
35 <li> <a href="#installC">Installing the C code</a>
36 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>.
37 </ul>
38
39 <li> <a href="#basics">Creating your first Bayes net</a>
40 <ul>
41 <li> <a href="#basics">Creating a model by hand</a>
42 <li> <a href="#file">Loading a model from a file</a>
43 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a>
44 </ul>
45
46 <li> <a href="#inference">Inference</a>
47 <ul>
48 <li> <a href="#marginal">Computing marginal distributions</a>
49 <li> <a href="#joint">Computing joint distributions</a>
50 <li> <a href="#soft">Soft/virtual evidence</a>
51 <li> <a href="#mpe">Most probable explanation</a>
52 </ul>
53
54 <li> <a href="#cpd">Conditional Probability Distributions</a>
55 <ul>
56 <li> <a href="#tabular">Tabular (multinomial) nodes</a>
57 <li> <a href="#noisyor">Noisy-or nodes</a>
58 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
59 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
60 <li> <a href="#mlp">Neural network nodes</a>
61 <li> <a href="#root">Root nodes</a>
62 <li> <a href="#gaussian">Gaussian nodes</a>
63 <li> <a href="#glm">Generalized linear model nodes</a>
64 <li> <a href="#dtree">Classification/regression tree nodes</a>
65 <li> <a href="#nongauss">Other continuous distributions</a>
66 <li> <a href="#cpd_summary">Summary of CPD types</a>
67 </ul>
68
69 <li> <a href="#examples">Example models</a>
70 <ul>
71 <li> <a
72 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
73 Gaussian mixture models</a>
74 <li> <a href="#pca">PCA, ICA, and all that</a>
75 <li> <a href="#mixep">Mixtures of experts</a>
76 <li> <a href="#hme">Hierarchical mixtures of experts</a>
77 <li> <a href="#qmr">QMR</a>
78 <li> <a href="#cg_model">Conditional Gaussian models</a>
79 <li> <a href="#hybrid">Other hybrid models</a>
80 </ul>
81
82 <li> <a href="#param_learning">Parameter learning</a>
83 <ul>
84 <li> <a href="#load_data">Loading data from a file</a>
85 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
86 <li> <a href="#prior">Parameter priors</a>
87 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
88 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
89 <li> <a href="#tying">Parameter tying</a>
90 </ul>
91
92 <li> <a href="#structure_learning">Structure learning</a>
93 <ul>
94 <li> <a href="#enumerate">Exhaustive search</a>
95 <li> <a href="#K2">K2</a>
96 <li> <a href="#hill_climb">Hill-climbing</a>
97 <li> <a href="#mcmc">MCMC</a>
98 <li> <a href="#active">Active learning</a>
99 <li> <a href="#struct_em">Structural EM</a>
100 <li> <a href="#graphdraw">Visualizing the learned graph structure</a>
101 <li> <a href="#constraint">Constraint-based methods</a>
102 </ul>
103
104
105 <li> <a href="#engines">Inference engines</a>
106 <ul>
107 <li> <a href="#jtree">Junction tree</a>
108 <li> <a href="#varelim">Variable elimination</a>
109 <li> <a href="#global">Global inference methods</a>
110 <li> <a href="#quickscore">Quickscore</a>
111 <li> <a href="#belprop">Belief propagation</a>
112 <li> <a href="#sampling">Sampling (Monte Carlo)</a>
113 <li> <a href="#engine_summary">Summary of inference engines</a>
114 </ul>
115
116
117 <li> <a href="#influence">Influence diagrams/ decision making</a>
118
119
120 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
121 </ul>
122
123 </ul>
124
125
126
127
128 <h1><a name="install">Installation</h1>
129
130 <h2><a name="installM">Installing the Matlab code</h2>
131
132 <ul>
133 <li> <a href="bnt_download.html">Download</a> the FullBNT.zip file.
134
135 <p>
136 <li> Unpack the file. In Unix, type
137 <!--"tar xvf BNT.tar".-->
138 "unzip FullBNT.zip".
139 In Windows, use
140 a program like <a href="http://www.winzip.com">Winzip</a>. This will
141 create a directory called FullBNT, which contains BNT and other libraries.
142 (Files ending in ~ or # are emacs backup files, and can be ignored.)
143
144 <p>
145 <li> Read the file <tt>BNT/README.txt</tt> to make sure the date
146 matches the one on the top of <a href=bnt.html>the BNT home page</a>.
147 If not, you may need to press 'refresh' on your browser, and download
148 again, to get the most recent version.
149
150 <p>
151 <li> <b>Edit the file "FullBNT/BNT/add_BNT_to_path.m"</b> so it contains the correct
152 pathname.
153 For example, in Windows,
154 I download FullBNT.zip into C:\kmurphy\matlab, and
155 then ensure the second lines reads
156 <pre>
157 BNT_HOME = 'C:\kmurphy\matlab\FullBNT';
158 </pre>
159
160 <p>
161 <li> Start up Matlab.
162
163 <p>
164 <li> Type "ver" at the Matlab prompt (">>").
165 <b>You need Matlab version 5.2 or newer to run BNT</b>.
166 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes
167 crash BNT.)
168 <b>BNT will not run on Octave</b>.
169
170 <p>
171 <li> Move to the BNT directory.
172 For example, in Windows, I type
173 <pre>
174 >> cd C:\kpmurphy\matlab\FullBNT\BNT
175 </pre>
176
177 <p>
178 <li> Type "add_BNT_to_path".
179 This executes the command
180 <tt>addpath(genpath(BNT_HOME))</tt>,
181 which adds all directories below FullBNT to the matlab path.
182
183 <p>
184 <li> Type "test_BNT".
185
186 If all goes well, this will produce a bunch of numbers and maybe some
187 warning messages (which you can ignore), but no error messages.
188 (The warnings should only be of the form
189 "Warning: Maximum number of iterations has been exceeded", and are
190 produced by Netlab.)
191
192 <p>
193 <li> Problems? Did you remember to
194 <b>Edit the file "FullBNT/BNT/add_BNT_to_path.m"</b> so it contains
195 the right path??
196
197 <p>
198 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join">
199 Join the BNT email list</a>
200
201 <p>
202 <li>Read
203 <a href="../matlab_tips.html">some useful Matlab tips</a>.
204 <!--
205 For instance, this explains how to create a startup file, which can be
206 used to set your path variable automatically, so you can avoid having
207 to type the above commands every time.
208 -->
209
210 </ul>
211
212
213
214 <h2><a name="installC">Installing the C code</h2>
215
216 Some BNT functions also have C implementations.
217 <b>It is not necessary to install the C code</b>, but it can result in a speedup
218 of a factor of 2-5.
219 To install all the C code,
220 edit installC_BNT.m so it contains the right path,
221 then type <tt>installC_BNT</tt>.
222 (Ignore warnings of the form 'invalid white space character in directive'.)
223 To uninstall all the C code,
224 edit uninstallC_BNT.m so it contains the right path,
225 then type <tt>uninstallC_BNT</tt>.
226 For an up-to-date list of the files which have C implementations, see
227 BNT/installC_BNT.m.
228
229 <p>
230 mex is a script that lets you call C code from Matlab - it does not compile matlab to
231 C (see mcc below).
232 If your C/C++ compiler is set up correctly, mex should work out of
233 the box.
234 If not, you might need to type
235 <p>
236 <tt> mex -setup</tt>
237 <p>
238 before calling installC.
239 <p>
240 To make mex call gcc on Windows,
241 you must install <a
242 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>.
243 You can use the <a href="http://www.mingw.org/">minimalist GNU for
244 Windows</a> version of gcc, or
245 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version.
246 <p>
247 In general, typing
248 'mex foo.c' from inside Matlab creates a file called
249 'foo.mexglx' or 'foo.dll' (the exact file
250 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll').
251 The resulting file will hide the original 'foo.m' (if it existed), i.e.,
252 typing 'foo' at the prompt will call the compiled C version.
253 To reveal the original matlab version, just delete foo.mexglx (this is
254 what uninstallC does).
255 <p>
256 Sometimes it takes time for Matlab to realize that the file has
257 changed from matlab to C or vice versa; try typing 'clear all' or
258 restarting Matlab to refresh it.
259 To find out which version of a file you are running, type
260 'which foo'.
261 <p>
262 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the
263 Matlab to C compiler, is a separate product,
264 and is quite different from mex. It does not yet support
265 objects/classes, which is why we can't compile all of BNT to C automatically.
266 Also, hand-written C code is usually much
267 better than the C code generated by mcc.
268
269
270 <p>
271 Acknowledgements:
272 Most of the C code (e.g., for jtree and dpot) was written by Wei Hu;
273 the triangulation C code was written by Ilya Shpitser;
274 the Gibbs sampling C code (for discrete nodes) was written by Bhaskara
275 Marthi.
276
277
278
279 <h1><a name="basics">Creating your first Bayes net</h1>
280
281 To define a Bayes net, you must specify the graph structure and then
282 the parameters. We look at each in turn, using a simple example
283 (adapted from Russell and
284 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
285 1995, p454).
286
287
288 <h2>Graph structure</h2>
289
290
291 Consider the following network.
292
293 <p>
294 <center>
295 <IMG SRC="Figures/sprinkler.gif">
296 </center>
297 <p>
298
299 <P>
300 To specify this directed acyclic graph (dag), we create an adjacency matrix:
301 <PRE>
302 N = 4;
303 dag = zeros(N,N);
304 C = 1; S = 2; R = 3; W = 4;
305 dag(C,[R S]) = 1;
306 dag(R,W) = 1;
307 dag(S,W)=1;
308 </PRE>
309 <P>
310 We have numbered the nodes as follows:
311 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
312 <b>The nodes must always be numbered in topological order, i.e.,
313 ancestors before descendants.</b>
314 For a more complicated graph, this is a little inconvenient: we will
315 see how to get around this <a href="usage_dbn.html#bat">below</a>.
316 <p>
317 In Matlab 6, you can use logical arrays instead of double arrays,
318 which are 4 times smaller:
319 <pre>
320 dag = false(N,N);
321 dag(C,[R S]) = true;
322 ...
323 </pre>
324 However, <b>some graph functions (eg acyclic) do not work on
325 logical arrays</b>!
326 <p>
327 A preliminary attempt to make a <b>GUI</b>
328 has been writte by Philippe LeRay and can be downloaded
329 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
330 <p>
331 You can visualize the resulting graph structure using
332 the methods discussed <a href="#graphdraw">below</a>.
333
334 <h2>Creating the Bayes net shell</h2>
335
336 In addition to specifying the graph structure,
337 we must specify the size and type of each node.
338 If a node is discrete, its size is the
339 number of possible values
340 each node can take on; if a node is continuous,
341 it can be a vector, and its size is the length of this vector.
342 In this case, we will assume all nodes are discrete and binary.
343 <PRE>
344 discrete_nodes = 1:N;
345 node_sizes = 2*ones(1,N);
346 </pre>
347 If the nodes were not binary, you could type e.g.,
348 <pre>
349 node_sizes = [4 2 3 5];
350 </pre>
351 meaning that Cloudy has 4 possible values,
352 Sprinkler has 2 possible values, etc.
353 Note that these are cardinal values, not ordinal, i.e.,
354 they are not ordered in any way, like 'low', 'medium', 'high'.
355 <p>
356 We are now ready to make the Bayes net:
357 <pre>
358 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
359 </PRE>
360 By default, all nodes are assumed to be discrete, so we can also just
361 write
362 <pre>
363 bnet = mk_bnet(dag, node_sizes);
364 </PRE>
365 You may also specify which nodes will be observed.
366 If you don't know, or if this not fixed in advance,
367 just use the empty list (the default).
368 <pre>
369 onodes = [];
370 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
371 </PRE>
372 Note that optional arguments are specified using a name/value syntax.
373 This is common for many BNT functions.
374 In general, to find out more about a function (e.g., which optional
375 arguments it takes), please see its
376 documentation string by typing
377 <pre>
378 help mk_bnet
379 </pre>
380 See also other <a href="matlab_tips.html">useful Matlab tips</a>.
381 <p>
382 It is possible to associate names with nodes, as follows:
383 <pre>
384 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
385 </pre>
386 You can then refer to a node by its name:
387 <pre>
388 C = bnet.names('cloudy'); % bnet.names is an associative array
389 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
390 </pre>
391 This feature uses my own associative array class.
392
393
394 <h2><a name="cpt">Parameters</h2>
395
396 A model consists of the graph structure and the parameters.
397 The parameters are represented by CPD objects (CPD = Conditional
398 Probability Distribution), which define the probability distribution
399 of a node given its parents.
400 (We will use the terms "node" and "random variable" interchangeably.)
401 The simplest kind of CPD is a table (multi-dimensional array), which
402 is suitable when all the nodes are discrete-valued. Note that the discrete
403 values are not assumed to be ordered in any way; that is, they
404 represent categorical quantities, like male and female, rather than
405 ordinal quantities, like low, medium and high.
406 (We will discuss CPDs in more detail <a href="#cpd">below</a>.)
407 <p>
408 Tabular CPDs, also called CPTs (conditional probability tables),
409 are stored as multidimensional arrays, where the dimensions
410 are arranged in the same order as the nodes, e.g., the CPT for node 4
411 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
412 Hence the child is always the last dimension.
413 If a node has no parents, its CPT is a column vector representing its
414 prior.
415 Note that in Matlab (unlike C), arrays are indexed
416 from 1, and are layed out in memory such that the first index toggles
417 fastest, e.g., the CPT for node 4 (WetGrass) is as follows
418 <P>
419 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
420 <P>
421 where we have used the convention that false==1, true==2.
422 We can create this CPT in Matlab as follows
423 <PRE>
424 CPT = zeros(2,2,2);
425 CPT(1,1,1) = 1.0;
426 CPT(2,1,1) = 0.1;
427 ...
428 </PRE>
429 Here is an easier way:
430 <PRE>
431 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
432 </PRE>
433 In fact, we don't need to reshape the array, since the CPD constructor
434 will do that for us. So we can just write
435 <pre>
436 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
437 </pre>
438 The other nodes are created similarly (using the old syntax for
439 optional parameters)
440 <PRE>
441 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
442 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
443 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
444 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
445 </PRE>
446
447
448 <h2><a name="rnd_cpt">Random Parameters</h2>
449
450 If we do not specify the CPT, random parameters will be
451 created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
452 To ensure repeatable results, use
453 <pre>
454 rand('state', seed);
455 randn('state', seed);
456 </pre>
457 To control the degree of randomness (entropy),
458 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
459 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
460 If p = 1, each entry is drawn from U[0,1].
461 If p >> 1, the entries will all be near 1/k, where k is the arity of
462 this node, i.e., each row will be nearly uniform.
463 You can do this as follows, assuming this node
464 is number i, and ns is the node_sizes.
465 <pre>
466 k = ns(i);
467 ps = parents(dag, i);
468 psz = prod(ns(ps));
469 CPT = sample_dirichlet(p*ones(1,k), psz);
470 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
471 </pre>
472
473
474 <h2><a name="file">Loading a network from a file</h2>
475
476 If you already have a Bayes net represented in the XML-based
477 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
478 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
479 <a
480 href="http://www.cs.huji.ac.il/labs/compbio/Repository">
481 Bayes Net repository</a>),
482 you can convert it to BNT format using
483 the
484 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
485 program</a> written by Ken Shan.
486 (This is not necessarily up-to-date.)
487 <p>
488 <b>It is currently not possible to save/load a BNT matlab object to
489 file</b>, but this is easily fixed if you modify all the constructors
490 for all the classes (see matlab documentation).
491
492 <h2>Creating a model using a GUI</h2>
493
494 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
495
496
497
498 <h1><a name="inference">Inference</h1>
499
500 Having created the BN, we can now use it for inference.
501 There are many different algorithms for doing inference in Bayes nets,
502 that make different tradeoffs between speed,
503 complexity, generality, and accuracy.
504 BNT therefore offers a variety of different inference
505 "engines". We will discuss these
506 in more detail <a href="#engines">below</a>.
507 For now, we will use the junction tree
508 engine, which is the mother of all exact inference algorithms.
509 This can be created as follows.
510 <pre>
511 engine = jtree_inf_engine(bnet);
512 </pre>
513 The other engines have similar constructors, but might take
514 additional, algorithm-specific parameters.
515 All engines are used in the same way, once they have been created.
516 We illustrate this in the following sections.
517
518
519 <h2><a name="marginal">Computing marginal distributions</h2>
520
521 Suppose we want to compute the probability that the sprinker was on
522 given that the grass is wet.
523 The evidence consists of the fact that W=2. All the other nodes
524 are hidden (unobserved). We can specify this as follows.
525 <pre>
526 evidence = cell(1,N);
527 evidence{W} = 2;
528 </pre>
529 We use a 1D cell array instead of a vector to
530 cope with the fact that nodes can be vectors of different lengths.
531 In addition, the value [] can be used
532 to denote 'no evidence', instead of having to specify the observation
533 pattern as a separate argument.
534 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell
535 arrays in matlab.)
536 <p>
537 We are now ready to add the evidence to the engine.
538 <pre>
539 [engine, loglik] = enter_evidence(engine, evidence);
540 </pre>
541 The behavior of this function is algorithm-specific, and is discussed
542 in more detail <a href="#engines">below</a>.
543 In the case of the jtree engine,
544 enter_evidence implements a two-pass message-passing scheme.
545 The first return argument contains the modified engine, which
546 incorporates the evidence. The second return argument contains the
547 log-likelihood of the evidence. (Not all engines are capable of
548 computing the log-likelihood.)
549 <p>
550 Finally, we can compute p=P(S=2|W=2) as follows.
551 <PRE>
552 marg = marginal_nodes(engine, S);
553 marg.T
554 ans =
555 0.57024
556 0.42976
557 p = marg.T(2);
558 </PRE>
559 We see that p = 0.4298.
560 <p>
561 Now let us add the evidence that it was raining, and see what
562 difference it makes.
563 <PRE>
564 evidence{R} = 2;
565 [engine, loglik] = enter_evidence(engine, evidence);
566 marg = marginal_nodes(engine, S);
567 p = marg.T(2);
568 </PRE>
569 We find that p = P(S=2|W=2,R=2) = 0.1945,
570 which is lower than
571 before, because the rain can ``explain away'' the
572 fact that the grass is wet.
573 <p>
574 You can plot a marginal distribution over a discrete variable
575 as a barchart using the built 'bar' function:
576 <pre>
577 bar(marg.T)
578 </pre>
579 This is what it looks like
580
581 <p>
582 <center>
583 <IMG SRC="Figures/sprinkler_bar.gif">
584 </center>
585 <p>
586
587 <h2><a name="observed">Observed nodes</h2>
588
589 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
590 An observed discrete node effectively only has 1 value (the observed
591 one) --- all other values would result in 0 probability.
592 For efficiency, BNT treats observed (discrete) nodes as if they were
593 set to 1, as we see below:
594 <pre>
595 evidence = cell(1,N);
596 evidence{W} = 2;
597 engine = enter_evidence(engine, evidence);
598 m = marginal_nodes(engine, W);
599 m.T
600 ans =
601 1
602 </pre>
603 This can get a little confusing, since we assigned W=2.
604 So we can ask BNT to add the evidence back in by passing in an optional argument:
605 <pre>
606 m = marginal_nodes(engine, W, 1);
607 m.T
608 ans =
609 0
610 1
611 </pre>
612 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
613
614
615
616 <h2><a name="joint">Computing joint distributions</h2>
617
618 We can compute the joint probability on a set of nodes as in the
619 following example.
620 <pre>
621 evidence = cell(1,N);
622 [engine, ll] = enter_evidence(engine, evidence);
623 m = marginal_nodes(engine, [S R W]);
624 </pre>
625 m is a structure. The 'T' field is a multi-dimensional array (in
626 this case, 3-dimensional) that contains the joint probability
627 distribution on the specified nodes.
628 <pre>
629 >> m.T
630 ans(:,:,1) =
631 0.2900 0.0410
632 0.0210 0.0009
633 ans(:,:,2) =
634 0 0.3690
635 0.1890 0.0891
636 </pre>
637 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
638 to be wet if both the rain and sprinkler are off.
639 <p>
640 Let us now add some evidence to R.
641 <pre>
642 evidence{R} = 2;
643 [engine, ll] = enter_evidence(engine, evidence);
644 m = marginal_nodes(engine, [S R W])
645 m =
646 domain: [2 3 4]
647 T: [2x1x2 double]
648 >> m.T
649 m.T
650 ans(:,:,1) =
651 0.0820
652 0.0018
653 ans(:,:,2) =
654 0.7380
655 0.1782
656 </pre>
657 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
658 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
659 with the evidence that R=2.
660 Instead of creating large tables with many 0s, BNT sets the effective
661 size of observed (discrete) nodes to 1, as explained above.
662 This is why m.T has size 2x1x2.
663 To get a 2x2x2 table, type
664 <pre>
665 m = marginal_nodes(engine, [S R W], 1)
666 m =
667 domain: [2 3 4]
668 T: [2x2x2 double]
669 >> m.T
670 m.T
671 ans(:,:,1) =
672 0 0.082
673 0 0.0018
674 ans(:,:,2) =
675 0 0.738
676 0 0.1782
677 </pre>
678
679 <p>
680 Note: It is not always possible to compute the joint on arbitrary
681 sets of nodes: it depends on which inference engine you use, as discussed
682 in more detail <a href="#engines">below</a>.
683
684
685 <h2><a name="soft">Soft/virtual evidence</h2>
686
687 Sometimes a node is not observed, but we have some distribution over
688 its possible values; this is often called "soft" or "virtual"
689 evidence.
690 One can use this as follows
691 <pre>
692 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
693 </pre>
694 where soft_evidence{i} is either [] (if node i has no soft evidence)
695 or is a vector representing the probability distribution over i's
696 possible values.
697 For example, if we don't know i's exact value, but we know its
698 likelihood ratio is 60/40, we can write evidence{i} = [] and
699 soft_evidence{i} = [0.6 0.4].
700 <p>
701 Currently only jtree_inf_engine supports this option.
702 It assumes that all hidden nodes, and all nodes for
703 which we have soft evidence, are discrete.
704 For a longer example, see BNT/examples/static/softev1.m.
705
706
707 <h2><a name="mpe">Most probable explanation</h2>
708
709 To compute the most probable explanation (MPE) of the evidence (i.e.,
710 the most probable assignment, or a mode of the joint), use
711 <pre>
712 [mpe, ll] = calc_mpe(engine, evidence);
713 </pre>
714 mpe{i} is the most likely value of node i.
715 This calls enter_evidence with the 'maximize' flag set to 1, which
716 causes the engine to do max-product instead of sum-product.
717 The resulting max-marginals are then thresholded.
718 If there is more than one maximum probability assignment, we must take
719 care to break ties in a consistent manner (thresholding the
720 max-marginals may give the wrong result). To force this behavior,
721 type
722 <pre>
723 [mpe, ll] = calc_mpe(engine, evidence, 1);
724 </pre>
725 Note that computing the MPE is someties called abductive reasoning.
726
727 <p>
728 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
729 that does a forwards max-product pass, and then a backwards traceback
730 pass, which is how Viterbi is traditionally implemented.
731
732
733
734 <h1><a name="cpd">Conditional Probability Distributions</h1>
735
736 A Conditional Probability Distributions (CPD)
737 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
738 are the parents of node i. There are many ways to represent this
739 distribution, which depend in part on whether X(i) and X(Pa(i)) are
740 discrete, continuous, or a combination.
741 We will discuss various representations below.
742
743
744 <h2><a name="tabular">Tabular nodes</h2>
745
746 If the CPD is represented as a table (i.e., if it is a multinomial
747 distribution), it has a number of parameters that is exponential in
748 the number of parents. See the example <a href="#cpt">above</a>.
749
750
751 <h2><a name="noisyor">Noisy-or nodes</h2>
752
753 A noisy-OR node is like a regular logical OR gate except that
754 sometimes the effects of parents that are on get inhibited.
755 Let the prob. that parent i gets inhibited be q(i).
756 Then a node, C, with 2 parents, A and B, has the following CPD, where
757 we use F and T to represent off and on (1 and 2 in BNT).
758 <pre>
759 A B P(C=off) P(C=on)
760 ---------------------------
761 F F 1.0 0.0
762 T F q(A) 1-q(A)
763 F T q(B) 1-q(B)
764 T T q(A)q(B) q-q(A)q(B)
765 </pre>
766 Thus we see that the causes get inhibited independently.
767 It is common to associate a "leak" node with a noisy-or CPD, which is
768 like a parent that is always on. This can account for all other unmodelled
769 causes which might turn the node on.
770 <p>
771 The noisy-or distribution is similar to the logistic distribution.
772 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
773 be the prob. that j inhibits i. Then
774 <pre>
775 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
776 </pre>
777 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
778 <pre>
779 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
780 </pre>
781 For a sigmoid node, we have
782 <pre>
783 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
784 </pre>
785 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
786 the activation function (although both are monotonically increasing).
787 In addition, in the case of a noisy-or, the weights are constrained to be
788 positive, since they derive from probabilities q(i,j).
789 In both cases, the number of parameters is <em>linear</em> in the
790 number of parents, unlike the case of a multinomial distribution,
791 where the number of parameters is exponential in the number of parents.
792 We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
793
794
795 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
796
797 Deterministic CPDs for discrete random variables can be created using
798 the deterministic_CPD class. It is also possible to 'flip' the output
799 of the function with some probability, to simulate noise.
800 The boolean_CPD class is just a special case of a
801 deterministic CPD, where the parents and child are all binary.
802 <p>
803 Both of these classes are just "syntactic sugar" for the tabular_CPD
804 class.
805
806
807
808 <h2><a name="softmax">Softmax nodes</h2>
809
810 If we have a discrete node with a continuous parent,
811 we can define its CPD using a softmax function
812 (also known as the multinomial logit function).
813 This acts like a soft thresholding operator, and is defined as follows:
814 <pre>
815 exp(w(:,i)'*x + b(i))
816 Pr(Q=i | X=x) = -----------------------------
817 sum_j exp(w(:,j)'*x + b(j))
818
819 </pre>
820 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
821 following interpretation: w(:,i)-w(:,j) is the normal vector to the
822 decision boundary between classes i and j,
823 and b(i)-b(j) is its offset (bias). For example, suppose
824 X is a 2-vector, and Q is binary. Then
825 <pre>
826 w = [1 -1;
827 0 0];
828
829 b = [0 0];
830 </pre>
831 means class 1 are points in the 2D plane with positive x coordinate,
832 and class 2 are points in the 2D plane with negative x coordinate.
833 If w has large magnitude, the decision boundary is sharp, otherwise it
834 is soft.
835 In the special case that Q is binary (0/1), the softmax function reduces to the logistic
836 (sigmoid) function.
837 <p>
838 Fitting a softmax function can be done using the iteratively reweighted
839 least squares (IRLS) algorithm.
840 We use the implementation from
841 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
842 Note that since
843 the softmax distribution is not in the exponential family, it does not
844 have finite sufficient statistics, and hence we must store all the
845 training data in uncompressed form.
846 If this takes too much space, one should use online (stochastic) gradient
847 descent (not implemented in BNT).
848 <p>
849 If a softmax node also has discrete parents,
850 we use a different set of w/b parameters for each combination of
851 parent values, as in the <a href="#gaussian">conditional linear
852 Gaussian CPD</a>.
853 This feature was implemented by Pierpaolo Brutti.
854 He is currently extending it so that discrete parents can be treated
855 as if they were continuous, by adding indicator variables to the X
856 vector.
857 <p>
858 We will see an example of softmax nodes <a href="#mixexp">below</a>.
859
860
861 <h2><a name="mlp">Neural network nodes</h2>
862
863 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
864 to implement a mapping from continuous parents to discrete children,
865 similar to the softmax function.
866 (If there are also discrete parents, it creates a mixture of MLPs.)
867 It uses code from <a
868 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
869 This is work in progress.
870
871 <h2><a name="root">Root nodes</h2>
872
873 A root node has no parents and no parameters; it can be used to model
874 an observed, exogeneous input variable, i.e., one which is "outside"
875 the model.
876 This is useful for conditional density models.
877 We will see an example of root nodes <a href="#mixexp">below</a>.
878
879
880 <h2><a name="gaussian">Gaussian nodes</h2>
881
882 We now consider a distribution suitable for the continuous-valued nodes.
883 Suppose the node is called Y, its continuous parents (if any) are
884 called X, and its discrete parents (if any) are called Q.
885 The distribution on Y is defined as follows:
886 <pre>
887 - no parents: Y ~ N(mu, Sigma)
888 - cts parents : Y|X=x ~ N(mu + W x, Sigma)
889 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
890 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
891 </pre>
892 where N(mu, Sigma) denotes a Normal distribution with mean mu and
893 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
894 respectively.
895 If there are no discrete parents, |Q|=1; if there is
896 more than one, then |Q| = a vector of the sizes of each discrete parent.
897 If there are no continuous parents, |X|=0; if there is more than one,
898 then |X| = the sum of their sizes.
899 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
900 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
901 matrix.
902 <p>
903 We can create a Gaussian node with random parameters as follows.
904 <pre>
905 bnet.CPD{i} = gaussian_CPD(bnet, i);
906 </pre>
907 We can specify the value of one or more of the parameters as in the
908 following example, in which |Y|=2, and |Q|=1.
909 <pre>
910 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
911 </pre>
912 <p>
913 We will see an example of conditional linear Gaussian nodes <a
914 href="#cg_model">below</a>.
915 <p>
916 <b>When learning Gaussians from data</b>, it is helpful to ensure the
917 data has a small magnitde
918 (see e.g., KPMstats/standardize) to prevent numerical problems.
919 Unless you have a lot of data, it is also a very good idea to use
920 diagonal instead of full covariance matrices.
921 (BNT does not currently support spherical covariances, although it
922 would be easy to add, since KPMstats/clg_Mstep supports this option;
923 you would just need to modify gaussian_CPD/update_ess to accumulate
924 weighted inner products.)
925
926
927
928 <h2><a name="nongauss">Other continuous distributions</h2>
929
930 Currently BNT does not support any CPDs for continuous nodes other
931 than the Gaussian.
932 However, you can use a mixture of Gaussians to
933 approximate other continuous distributions. We will see some an example
934 of this with the IFA model <a href="#pca">below</a>.
935
936
937 <h2><a name="glm">Generalized linear model nodes</h2>
938
939 In the future, we may incorporate some of the functionality of
940 <a href =
941 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
942 into BNT.
943
944
945 <h2><a name="dtree">Classification/regression tree nodes</h2>
946
947 We plan to add classification and regression trees to define CPDs for
948 discrete and continuous nodes, respectively.
949 Trees have many advantages: they are easy to interpret, they can do
950 feature selection, they can
951 handle discrete and continuous inputs, they do not make strong
952 assumptions about the form of the distribution, the number of
953 parameters can grow in a data-dependent way (i.e., they are
954 semi-parametric), they can handle missing data, etc.
955 However, they are not yet implemented.
956 <!--
957 Yimin Zhang is currently (Feb '02) implementing this.
958 -->
959
960
961 <h2><a name="cpd_summary">Summary of CPD types</h2>
962
963 We list all the different types of CPDs supported by BNT.
964 For each CPD, we specify if the child and parents can be discrete (D) or
965 continuous (C) (Binary (B) nodes are a special case).
966 We also specify which methods each class supports.
967 If a method is inherited, the name of the parent class is mentioned.
968 If a parent class calls a child method, this is mentioned.
969 <p>
970 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
971 requires that the child and all parents are discrete.
972 The CPT might be exponentially big...
973 <tt>convert_to_table</tt> evaluates a CPD with evidence, and
974 represents the the resulting potential as an array.
975 This requires that the child is discrete, and any continuous parents
976 are observed.
977 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and
978 represents the resulting potential as a dpot, gpot, cgpot or upot, as
979 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
980 utility).
981
982 <p>
983 When we sample a node, all the parents are observed.
984 When we compute the (log) probability of a node, all the parents and
985 the child are observed.
986 <p>
987 We also specify if the parameters are learnable.
988 For learning with EM, we require
989 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
990 <tt>maximize_params</tt>.
991 For learning from fully observed data, we require
992 the method <tt>learn_params</tt>.
993 By default, all classes inherit this from generic_CPD, which simply
994 calls <tt>update_ess</tt> N times, once for each data case, followed
995 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
996 Some classes implement a batch formula, which is quicker.
997 <p>
998 Bayesian learning means computing a posterior over the parameters
999 given fully observed data.
1000 <p>
1001 Pearl means we implement the methods <tt>compute_pi</tt> and
1002 <tt>compute_lambda_msg</tt>, used by
1003 <tt>pearl_inf_engine</tt>, which runs on directed graphs.
1004 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
1005 The pearl methods can exploit special properties of the CPDs for
1006 computing the messages efficiently, whereas belprop does not.
1007 <p>
1008 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
1009 which is not shown, since it is not very interesting.
1010
1011
1012 <p>
1013
1014
1015 <table>
1016 <table border units = pixels><tr>
1017 <td align=center>Name
1018 <td align=center>Child
1019 <td align=center>Parents
1020 <td align=center>Comments
1021 <td align=center>CPD_to_CPT
1022 <td align=center>conv_to_table
1023 <td align=center>conv_to_pot
1024 <td align=center>sample
1025 <td align=center>prob
1026 <td align=center>learn
1027 <td align=center>Bayes
1028 <td align=center>Pearl
1029
1030
1031 <tr>
1032 <!-- Name--><td>
1033 <!-- Child--><td>
1034 <!-- Parents--><td>
1035 <!-- Comments--><td>
1036 <!-- CPD_to_CPT--><td>
1037 <!-- conv_to_table--><td>
1038 <!-- conv_to_pot--><td>
1039 <!-- sample--><td>
1040 <!-- prob--><td>
1041 <!-- learn--><td>
1042 <!-- Bayes--><td>
1043 <!-- Pearl--><td>
1044
1045 <tr>
1046 <!-- Name--><td>boolean
1047 <!-- Child--><td>B
1048 <!-- Parents--><td>B
1049 <!-- Comments--><td>Syntactic sugar for tabular
1050 <!-- CPD_to_CPT--><td>-
1051 <!-- conv_to_table--><td>-
1052 <!-- conv_to_pot--><td>-
1053 <!-- sample--><td>-
1054 <!-- prob--><td>-
1055 <!-- learn--><td>-
1056 <!-- Bayes--><td>-
1057 <!-- Pearl--><td>-
1058
1059 <tr>
1060 <!-- Name--><td>deterministic
1061 <!-- Child--><td>D
1062 <!-- Parents--><td>D
1063 <!-- Comments--><td>Syntactic sugar for tabular
1064 <!-- CPD_to_CPT--><td>-
1065 <!-- conv_to_table--><td>-
1066 <!-- conv_to_pot--><td>-
1067 <!-- sample--><td>-
1068 <!-- prob--><td>-
1069 <!-- learn--><td>-
1070 <!-- Bayes--><td>-
1071 <!-- Pearl--><td>-
1072
1073 <tr>
1074 <!-- Name--><td>Discrete
1075 <!-- Child--><td>D
1076 <!-- Parents--><td>C/D
1077 <!-- Comments--><td>Virtual class
1078 <!-- CPD_to_CPT--><td>N
1079 <!-- conv_to_table--><td>Calls CPD_to_CPT
1080 <!-- conv_to_pot--><td>Calls conv_to_table
1081 <!-- sample--><td>Calls conv_to_table
1082 <!-- prob--><td>Calls conv_to_table
1083 <!-- learn--><td>N
1084 <!-- Bayes--><td>N
1085 <!-- Pearl--><td>N
1086
1087 <tr>
1088 <!-- Name--><td>Gaussian
1089 <!-- Child--><td>C
1090 <!-- Parents--><td>C/D
1091 <!-- Comments--><td>-
1092 <!-- CPD_to_CPT--><td>N
1093 <!-- conv_to_table--><td>N
1094 <!-- conv_to_pot--><td>Y
1095 <!-- sample--><td>Y
1096 <!-- prob--><td>Y
1097 <!-- learn--><td>Y
1098 <!-- Bayes--><td>N
1099 <!-- Pearl--><td>N
1100
1101 <tr>
1102 <!-- Name--><td>gmux
1103 <!-- Child--><td>C
1104 <!-- Parents--><td>C/D
1105 <!-- Comments--><td>multiplexer
1106 <!-- CPD_to_CPT--><td>N
1107 <!-- conv_to_table--><td>N
1108 <!-- conv_to_pot--><td>Y
1109 <!-- sample--><td>N
1110 <!-- prob--><td>N
1111 <!-- learn--><td>N
1112 <!-- Bayes--><td>N
1113 <!-- Pearl--><td>Y
1114
1115
1116 <tr>
1117 <!-- Name--><td>MLP
1118 <!-- Child--><td>D
1119 <!-- Parents--><td>C/D
1120 <!-- Comments--><td>multi layer perceptron
1121 <!-- CPD_to_CPT--><td>N
1122 <!-- conv_to_table--><td>Y
1123 <!-- conv_to_pot--><td>Inherits from discrete
1124 <!-- sample--><td>Inherits from discrete
1125 <!-- prob--><td>Inherits from discrete
1126 <!-- learn--><td>Y
1127 <!-- Bayes--><td>N
1128 <!-- Pearl--><td>N
1129
1130
1131 <tr>
1132 <!-- Name--><td>noisy-or
1133 <!-- Child--><td>B
1134 <!-- Parents--><td>B
1135 <!-- Comments--><td>-
1136 <!-- CPD_to_CPT--><td>Y
1137 <!-- conv_to_table--><td>Inherits from discrete
1138 <!-- conv_to_pot--><td>Inherits from discrete
1139 <!-- sample--><td>Inherits from discrete
1140 <!-- prob--><td>Inherits from discrete
1141 <!-- learn--><td>N
1142 <!-- Bayes--><td>N
1143 <!-- Pearl--><td>Y
1144
1145
1146 <tr>
1147 <!-- Name--><td>root
1148 <!-- Child--><td>C/D
1149 <!-- Parents--><td>none
1150 <!-- Comments--><td>no params
1151 <!-- CPD_to_CPT--><td>N
1152 <!-- conv_to_table--><td>N
1153 <!-- conv_to_pot--><td>Y
1154 <!-- sample--><td>Y
1155 <!-- prob--><td>Y
1156 <!-- learn--><td>N
1157 <!-- Bayes--><td>N
1158 <!-- Pearl--><td>N
1159
1160
1161 <tr>
1162 <!-- Name--><td>softmax
1163 <!-- Child--><td>D
1164 <!-- Parents--><td>C/D
1165 <!-- Comments--><td>-
1166 <!-- CPD_to_CPT--><td>N
1167 <!-- conv_to_table--><td>Y
1168 <!-- conv_to_pot--><td>Inherits from discrete
1169 <!-- sample--><td>Inherits from discrete
1170 <!-- prob--><td>Inherits from discrete
1171 <!-- learn--><td>Y
1172 <!-- Bayes--><td>N
1173 <!-- Pearl--><td>N
1174
1175
1176 <tr>
1177 <!-- Name--><td>generic
1178 <!-- Child--><td>C/D
1179 <!-- Parents--><td>C/D
1180 <!-- Comments--><td>Virtual class
1181 <!-- CPD_to_CPT--><td>N
1182 <!-- conv_to_table--><td>N
1183 <!-- conv_to_pot--><td>N
1184 <!-- sample--><td>N
1185 <!-- prob--><td>N
1186 <!-- learn--><td>N
1187 <!-- Bayes--><td>N
1188 <!-- Pearl--><td>N
1189
1190
1191 <tr>
1192 <!-- Name--><td>Tabular
1193 <!-- Child--><td>D
1194 <!-- Parents--><td>D
1195 <!-- Comments--><td>-
1196 <!-- CPD_to_CPT--><td>Y
1197 <!-- conv_to_table--><td>Inherits from discrete
1198 <!-- conv_to_pot--><td>Inherits from discrete
1199 <!-- sample--><td>Inherits from discrete
1200 <!-- prob--><td>Inherits from discrete
1201 <!-- learn--><td>Y
1202 <!-- Bayes--><td>Y
1203 <!-- Pearl--><td>Y
1204
1205 </table>
1206
1207
1208
1209 <h1><a name="examples">Example models</h1>
1210
1211
1212 <h2>Gaussian mixture models</h2>
1213
1214 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
1215 of Gaussians using BNT. Available
1216 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
1217
1218
1219 <h2><a name="pca">PCA, ICA, and all that </h2>
1220
1221 In Figure (a) below, we show how Factor Analysis can be thought of as a
1222 graphical model. Here, X has an N(0,I) prior, and
1223 Y|X=x ~ N(mu + Wx, Psi),
1224 where Psi is diagonal and W is called the "factor loading matrix".
1225 Since the noise on both X and Y is diagonal, the components of these
1226 vectors are uncorrelated, and hence can be represented as individual
1227 scalar nodes, as we show in (b).
1228 (This is useful if parts of the observations on the Y vector are occasionally missing.)
1229 We usually take k=|X| << |Y|=D, so the model tries to explain
1230 many observations using a low-dimensional subspace.
1231
1232
1233 <center>
1234 <table>
1235 <tr>
1236 <td><img src="Figures/fa.gif">
1237 <td><img src="Figures/fa_scalar.gif">
1238 <td><img src="Figures/mfa.gif">
1239 <td><img src="Figures/ifa.gif">
1240 <tr>
1241 <td align=center> (a)
1242 <td align=center> (b)
1243 <td align=center> (c)
1244 <td align=center> (d)
1245 </table>
1246 </center>
1247
1248 <p>
1249 We can create this model in BNT as follows.
1250 <pre>
1251 ns = [k D];
1252 dag = zeros(2,2);
1253 dag(1,2) = 1;
1254 bnet = mk_bnet(dag, ns, 'discrete', []);
1255 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
1256 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
1257 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
1258 'cov_type', 'diag', 'clamp_mean', 1);
1259 </pre>
1260
1261 The root node is clamped to the N(0,I) distribution, so that we will
1262 not update these parameters during learning.
1263 The mean of the leaf node is clamped to 0,
1264 since we assume the data has been centered (had its mean subtracted
1265 off); this is just for simplicity.
1266 Finally, the covariance of the leaf node is constrained to be
1267 diagonal. W0 and Psi0 are the initial parameter guesses.
1268
1269 <p>
1270 We can fit this model (i.e., estimate its parameters in a maximum
1271 likelihood (ML) sense) using EM, as we
1272 explain <a href="#em">below</a>.
1273 Not surprisingly, the ML estimates for mu and Psi turn out to be
1274 identical to the
1275 sample mean and variance, which can be computed directly as
1276 <pre>
1277 mu_ML = mean(data);
1278 Psi_ML = diag(cov(data));
1279 </pre>
1280 Note that W can only be identified up to a rotation matrix, because of
1281 the spherical symmetry of the source.
1282
1283 <p>
1284 If we restrict Psi to be spherical, i.e., Psi = sigma*I,
1285 there is a closed-form solution for W as well,
1286 i.e., we do not need to use EM.
1287 In particular, W contains the first |X| eigenvectors of the sample covariance
1288 matrix, with scalings determined by the eigenvalues and sigma.
1289 Classical PCA can be obtained by taking the sigma->0 limit.
1290 For details, see
1291
1292 <ul>
1293 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
1294 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
1295 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
1296 Matlab software</a>)
1297
1298 <p>
1299 <li>
1300 <a
1301 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1302 "Mixtures of probabilistic principal component analyzers"</a>,
1303 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1304 </ul>
1305
1306 <p>
1307 By adding a hidden discrete variable, we can create mixtures of FA
1308 models, as shown in (c).
1309 Now we can explain the data using a set of subspaces.
1310 We can create this model in BNT as follows.
1311 <pre>
1312 ns = [M k D];
1313 dag = zeros(3);
1314 dag(1,3) = 1;
1315 dag(2,3) = 1;
1316 bnet = mk_bnet(dag, ns, 'discrete', 1);
1317 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
1318 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
1319 'clamp_mean', 1, 'clamp_cov', 1);
1320 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
1321 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
1322 </pre>
1323 Notice how the covariance matrix for Y is the same for all values of
1324 Q; that is, the noise level in each sub-space is assumed the same.
1325 However, we allow the offset, mu, to vary.
1326 For details, see
1327 <ul>
1328
1329 <LI>
1330 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
1331 Algorithm for Mixtures of Factor Analyzers </A>,
1332 Ghahramani, Z. and Hinton, G.E. (1996),
1333 University of Toronto
1334 Technical Report CRG-TR-96-1.
1335 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
1336
1337 <p>
1338 <li>
1339 <a
1340 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1341 "Mixtures of probabilistic principal component analyzers"</a>,
1342 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1343 </ul>
1344
1345 <p>
1346 I have included Zoubin's specialized MFA code (with his permission)
1347 with the toolbox, so you can check that BNT gives the same results:
1348 see 'BNT/examples/static/mfa1.m'.
1349
1350 <p>
1351 Independent Factor Analysis (IFA) generalizes FA by allowing a
1352 non-Gaussian prior on each component of X.
1353 (Note that we can approximate a non-Gaussian prior using a mixture of
1354 Gaussians.)
1355 This means that the likelihood function is no longer rotationally
1356 invariant, so we can uniquely identify W and the hidden
1357 sources X.
1358 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
1359 We recover classical Independent Components Analysis (ICA)
1360 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the
1361 weight matrix W is square and invertible.
1362 For details, see
1363 <ul>
1364 <li>
1365 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
1366 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
1367 </ul>
1368
1369
1370
1371 <h2><a name="mixexp">Mixtures of experts</h2>
1372
1373 As an example of the use of the softmax function,
1374 we introduce the Mixture of Experts model.
1375 <!--
1376 We also show
1377 the Hierarchical Mixture of Experts model, where the hierarchy has two
1378 levels.
1379 (This is essentially a probabilistic decision tree of height two.)
1380 -->
1381 As before,
1382 circles denote continuous-valued nodes,
1383 squares denote discrete nodes, clear
1384 means hidden, and shaded means observed.
1385 <p>
1386 <center>
1387 <table>
1388 <tr>
1389 <td><img src="Figures/mixexp.gif">
1390 <!--
1391 <td><img src="Figures/hme.gif">
1392 -->
1393 </table>
1394 </center>
1395 <p>
1396 X is the observed
1397 input, Y is the output, and
1398 the Q nodes are hidden "gating" nodes, which select the appropriate
1399 set of parameters for Y. During training, Y is assumed observed,
1400 but for testing, the goal is to predict Y given X.
1401 Note that this is a <em>conditional</em> density model, so we don't
1402 associate any parameters with X.
1403 Hence X's CPD will be a root CPD, which is a way of modelling
1404 exogenous nodes.
1405 If the output is a continuous-valued quantity,
1406 we assume the "experts" are linear-regression units,
1407 and set Y's CPD to linear-Gaussian.
1408 If the output is discrete, we set Y's CPD to a softmax function.
1409 The Q CPDs will always be softmax functions.
1410
1411 <p>
1412 As a concrete example, consider the mixture of experts model where X and Y are
1413 scalars, and Q is binary.
1414 This is just piecewise linear regression, where
1415 we have two line segments, i.e.,
1416 <P>
1417 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
1418 <P>
1419 We can create this model with random parameters as follows.
1420 (This code is bundled in BNT/examples/static/mixexp2.m.)
1421 <PRE>
1422 X = 1;
1423 Q = 2;
1424 Y = 3;
1425 dag = zeros(3,3);
1426 dag(X,[Q Y]) = 1
1427 dag(Q,Y) = 1;
1428 ns = [1 2 1]; % make X and Y scalars, and have 2 experts
1429 onodes = [1 3];
1430 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
1431
1432 rand('state', 0);
1433 randn('state', 0);
1434 bnet.CPD{1} = root_CPD(bnet, 1);
1435 bnet.CPD{2} = softmax_CPD(bnet, 2);
1436 bnet.CPD{3} = gaussian_CPD(bnet, 3);
1437 </PRE>
1438 Now let us fit this model using <a href="#em">EM</a>.
1439 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
1440 <P>
1441 <PRE>
1442 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
1443 plot(data(:,1), data(:,2), '.');
1444 </PRE>
1445 <p>
1446 <center>
1447 <IMG SRC="Figures/mixexp_data.gif">
1448 </center>
1449 <p>
1450 This is what the model looks like before training.
1451 (Thanks to Thomas Hofman for writing this plotting routine.)
1452 <p>
1453 <center>
1454 <IMG SRC="Figures/mixexp_before.gif">
1455 </center>
1456 <p>
1457 Now let's train the model, and plot the final performance.
1458 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
1459 <P>
1460 <PRE>
1461 ncases = size(data, 1); % each row of data is a training case
1462 cases = cell(3, ncases);
1463 cases([1 3], :) = num2cell(data'); % each column of cases is a training case
1464 engine = jtree_inf_engine(bnet);
1465 max_iter = 20;
1466 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
1467 </PRE>
1468 (We specify which nodes will be observed when we create the engine.
1469 Hence BNT knows that the hidden nodes are all discrete.
1470 For complex models, this can lead to a significant speedup.)
1471 Below we show what the model looks like after 16 iterations of EM
1472 (with 100 IRLS iterations per M step), when it converged
1473 using the default convergence tolerance (that the
1474 fractional change in the log-likelihood be less than 1e-3).
1475 Before learning, the log-likelihood was
1476 -322.927442; afterwards, it was -13.728778.
1477 <p>
1478 <center>
1479 <IMG SRC="Figures/mixexp_after.gif">
1480 </center>
1481 (See BNT/examples/static/mixexp2.m for details of the code.)
1482
1483
1484
1485 <h2><a name="hme">Hierarchical mixtures of experts</h2>
1486
1487 A hierarchical mixture of experts (HME) extends the mixture of experts
1488 model by having more than one hidden node. A two-level example is shown below, along
1489 with its more traditional representation as a neural network.
1490 This is like a (balanced) probabilistic decision tree of height 2.
1491 <p>
1492 <center>
1493 <IMG SRC="Figures/HMEforMatlab.jpg">
1494 </center>
1495 <p>
1496 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
1497 has written an extensive set of routines for HMEs,
1498 which are bundled with BNT: see the examples/static/HME directory.
1499 These routines allow you to choose the number of hidden (gating)
1500 layers, and the form of the experts (softmax or MLP).
1501 See the file hmemenu, which provides a demo.
1502 For example, the figure below shows the decision boundaries learned
1503 for a ternary classification problem, using a 2 level HME with softmax
1504 gates and softmax experts; the training set is on the left, the
1505 testing set on the right.
1506 <p>
1507 <center>
1508 <!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
1509 <IMG SRC="Figures/hme_dec_boundary.png">
1510 </center>
1511 <p>
1512
1513
1514 <p>
1515 For more details, see the following:
1516 <ul>
1517
1518 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
1519 Hierarchical mixtures of experts and the EM algorithm</a>
1520 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
1521
1522 <li> <a href =
1523 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's
1524 matlab code for HME</a>
1525
1526 <li> <a
1527 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
1528 logistic function? A tutorial discussion on
1529 probabilities and neural networks.</a> M. I. Jordan. MIT Computational
1530 Cognitive Science Report 9503, August 1995.
1531
1532 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
1533 Halll, 1983.
1534
1535 <li>
1536 "Improved learning algorithms for mixtures of experts in multiclass
1537 classification".
1538 K. Chen, L. Xu, H. Chi.
1539 Neural Networks (1999) 12: 1229-1252.
1540
1541 <li> <a href="http://www.oigeeza.com/steve/">
1542 Classification Using Hierarchical Mixtures of Experts</a>
1543 S.R. Waterhouse and A.J. Robinson.
1544 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
1545
1546 <li> <a href="http://www.idiap.ch/~perry/">
1547 Localized mixtures of experts</a>,
1548 P. Moerland, 1998.
1549
1550 <li> "Nonlinear gated experts for time series",
1551 A.S. Weigend and M. Mangeas, 1995.
1552
1553 </ul>
1554
1555
1556 <h2><a name="qmr">QMR</h2>
1557
1558 Bayes nets originally arose out of an attempt to add probabilities to
1559 expert systems, and this is still the most common use for BNs.
1560 A famous example is
1561 QMR-DT, a decision-theoretic reformulation of the Quick Medical
1562 Reference (QMR) model.
1563 <p>
1564 <center>
1565 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
1566 </center>
1567 Here, the top layer represents hidden disease nodes, and the bottom
1568 layer represents observed symptom nodes.
1569 The goal is to infer the posterior probability of each disease given
1570 all the symptoms (which can be present, absent or unknown).
1571 Each node in the top layer has a Bernoulli prior (with a low prior
1572 probability that the disease is present).
1573 Since each node in the bottom layer has a high fan-in, we use a
1574 noisy-OR parameterization; each disease has an independent chance of
1575 causing each symptom.
1576 The real QMR-DT model is copyright, but
1577 we can create a random QMR-like model as follows.
1578 <pre>
1579 function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1580 % MK_QMR_BNET Make a QMR model
1581 % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1582 %
1583 % G(i,j) = 1 iff there is an arc from disease i to finding j
1584 % inhibit(i,j) = inhibition probability on i->j arc
1585 % leak(j) = inhibition prob. on leak->j arc
1586 % prior(i) = prob. disease i is on
1587
1588 [Ndiseases Nfindings] = size(inhibit);
1589 N = Ndiseases + Nfindings;
1590 finding_node = Ndiseases+1:N;
1591 ns = 2*ones(1,N);
1592 dag = zeros(N,N);
1593 dag(1:Ndiseases, finding_node) = G;
1594 bnet = mk_bnet(dag, ns, 'observed', finding_node);
1595
1596 for d=1:Ndiseases
1597 CPT = [1-prior(d) prior(d)];
1598 bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
1599 end
1600
1601 for i=1:Nfindings
1602 fnode = finding_node(i);
1603 ps = parents(G, i);
1604 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
1605 end
1606 </pre>
1607 In the file BNT/examples/static/qmr1, we create a random bipartite
1608 graph G, with 5 diseases and 10 findings, and random parameters.
1609 (In general, to create a random dag, use 'mk_random_dag'.)
1610 We can visualize the resulting graph structure using
1611 the methods discussed <a href="#graphdraw">below</a>, with the
1612 following results:
1613 <p>
1614 <img src="Figures/qmr.rnd.jpg">
1615
1616 <p>
1617 Now let us put some random evidence on all the leaves except the very
1618 first and very last, and compute the disease posteriors.
1619 <pre>
1620 pos = 2:floor(Nfindings/2);
1621 neg = (pos(end)+1):(Nfindings-1);
1622 onodes = myunion(pos, neg);
1623 evidence = cell(1, N);
1624 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
1625 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
1626
1627 engine = jtree_inf_engine(bnet);
1628 [engine, ll] = enter_evidence(engine, evidence);
1629 post = zeros(1, Ndiseases);
1630 for i=diseases(:)'
1631 m = marginal_nodes(engine, i);
1632 post(i) = m.T(2);
1633 end
1634 </pre>
1635 Junction tree can be quite slow on large QMR models.
1636 Fortunately, it is possible to exploit properties of the noisy-OR
1637 function to speed up exact inference using an algorithm called
1638 <a href="#quickscore">quickscore</a>, discussed below.
1639
1640
1641
1642
1643
1644 <h2><a name="cg_model">Conditional Gaussian models</h2>
1645
1646 A conditional Gaussian model is one in which, conditioned on all the discrete
1647 nodes, the distribution over the remaining (continuous) nodes is
1648 multivariate Gaussian. This means we can have arcs from discrete (D)
1649 to continuous (C) nodes, but not vice versa.
1650 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
1651 as in the <a href="#mixexp">mixture of experts</a> model,
1652 since this distribution can be represented with a discrete potential.)
1653 <p>
1654 We now give an example of a CG model, from
1655 the paper "Propagation of Probabilities, Means amd
1656 Variances in Mixed Graphical Association Models", Steffen Lauritzen,
1657 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
1658 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
1659 D. J. Spiegelhalter, Springer, 1999.)
1660
1661 <h3>Specifying the graph</h3>
1662
1663 Consider the model of waste emissions from an incinerator plant shown below.
1664 We follow the standard convention that shaded nodes are observed,
1665 clear nodes are hidden.
1666 We also use the non-standard convention that
1667 square nodes are discrete (tabular) and round nodes are
1668 Gaussian.
1669
1670 <p>
1671 <center>
1672 <IMG SRC="Figures/cg1.gif">
1673 </center>
1674 <p>
1675
1676 We can create this model as follows.
1677 <pre>
1678 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
1679 n = 9;
1680
1681 dag = zeros(n);
1682 dag(F,E)=1;
1683 dag(W,[E Min D]) = 1;
1684 dag(E,D)=1;
1685 dag(B,[C D])=1;
1686 dag(D,[L Mout])=1;
1687 dag(Min,Mout)=1;
1688
1689 % node sizes - all cts nodes are scalar, all discrete nodes are binary
1690 ns = ones(1, n);
1691 dnodes = [F W B];
1692 cnodes = mysetdiff(1:n, dnodes);
1693 ns(dnodes) = 2;
1694
1695 bnet = mk_bnet(dag, ns, 'discrete', dnodes);
1696 </pre>
1697 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
1698 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
1699 <p>
1700
1701
1702 <h3>Specifying the parameters</h3>
1703
1704 The parameters of the discrete nodes can be specified as follows.
1705 <pre>
1706 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
1707 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
1708 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
1709 </pre>
1710
1711 <p>
1712 The parameters of the continuous nodes can be specified as follows.
1713 <pre>
1714 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
1715 'cov', [0.00002 0.0001 0.00002 0.0001]);
1716 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
1717 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
1718 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
1719 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
1720 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
1721 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
1722 </pre>
1723
1724
1725 <h3><a name="cg_infer">Inference</h3>
1726
1727 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
1728 First we compute the unconditional marginals.
1729 <pre>
1730 engine = jtree_inf_engine(bnet);
1731 evidence = cell(1,n);
1732 [engine, ll] = enter_evidence(engine, evidence);
1733 marg = marginal_nodes(engine, E);
1734 </pre>
1735 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
1736 'marg' is a structure that contains the fields 'mu' and 'Sigma', which
1737 contain the mean and (co)variance of the marginal on E.
1738 In this case, they are both scalars.
1739 Let us check they match the published figures (to 2 decimal places).
1740 <!--(We can't expect
1741 more precision than this in general because I have implemented the algorithm of
1742 Lauritzen (1992), which can be numerically unstable.)-->
1743 <pre>
1744 tol = 1e-2;
1745 assert(approxeq(marg.mu, -3.25, tol));
1746 assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
1747 </pre>
1748 We can compute the other posteriors similarly.
1749 Now let us add some evidence.
1750 <pre>
1751 evidence = cell(1,n);
1752 evidence{W} = 1; % industrial
1753 evidence{L} = 1.1;
1754 evidence{C} = -0.9;
1755 [engine, ll] = enter_evidence(engine, evidence);
1756 </pre>
1757 Now we find
1758 <pre>
1759 marg = marginal_nodes(engine, E);
1760 assert(approxeq(marg.mu, -3.8983, tol));
1761 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
1762 </pre>
1763
1764
1765 We can also compute the joint probability on a set of nodes.
1766 For example, P(D, Mout | evidence) is a 2D Gaussian:
1767 <pre>
1768 marg = marginal_nodes(engine, [D Mout])
1769 marg =
1770 domain: [6 8]
1771 mu: [2x1 double]
1772 Sigma: [2x2 double]
1773 T: 1.0000
1774 </pre>
1775 The mean is
1776 <pre>
1777 marg.mu
1778 ans =
1779 3.6077
1780 4.1077
1781 </pre>
1782 and the covariance matrix is
1783 <pre>
1784 marg.Sigma
1785 ans =
1786 0.1062 0.1062
1787 0.1062 0.1182
1788 </pre>
1789 It is easy to visualize this posterior using standard Matlab plotting
1790 functions, e.g.,
1791 <pre>
1792 gaussplot2d(marg.mu, marg.Sigma);
1793 </pre>
1794 produces the following picture.
1795
1796 <p>
1797 <center>
1798 <IMG SRC="Figures/gaussplot.png">
1799 </center>
1800 <p>
1801
1802
1803 The T field indicates that the mixing weight of this Gaussian
1804 component is 1.0.
1805 If the joint contains discrete and continuous variables, the result
1806 will be a mixture of Gaussians, e.g.,
1807 <pre>
1808 marg = marginal_nodes(engine, [F E])
1809 domain: [1 3]
1810 mu: [-3.9000 -0.4003]
1811 Sigma: [1x1x2 double]
1812 T: [0.9995 4.7373e-04]
1813 </pre>
1814 The interpretation is
1815 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ].
1816 In this case, E is a scalar, so i=j=1; k specifies the mixture component.
1817 <p>
1818 We saw in the sprinkler network that BNT sets the effective size of
1819 observed discrete nodes to 1, since they only have one legal value.
1820 For continuous nodes, BNT sets their length to 0,
1821 since they have been reduced to a point.
1822 For example,
1823 <pre>
1824 marg = marginal_nodes(engine, [B C])
1825 domain: [4 5]
1826 mu: []
1827 Sigma: []
1828 T: [0.0123 0.9877]
1829 </pre>
1830 It is simple to post-process the output of marginal_nodes.
1831 For example, the file BNT/examples/static/cg1 sets the mu term of
1832 observed nodes to their observed value, and the Sigma term to 0 (since
1833 observed nodes have no variance).
1834
1835 <p>
1836 Note that the implemented version of the junction tree is numerically
1837 unstable when using CG potentials
1838 (which is why, in the example above, we only required our answers to agree with
1839 the published ones to 2dp.)
1840 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
1841 implemented by Shan Huang. This is described in
1842
1843 <ul>
1844 <li> "Stable Local Computation with Conditional Gaussian Distributions",
1845 S. Lauritzen and F. Jensen, Tech Report R-99-2014,
1846 Dept. Math. Sciences, Allborg Univ., 1999.
1847 </ul>
1848
1849 However, even the numerically stable version
1850 can be computationally intractable if there are many hidden discrete
1851 nodes, because the number of mixture components grows exponentially e.g., in a
1852 <a href="usage_dbn.html#lds">switching linear dynamical system</a>.
1853 In general, one must resort to approximate inference techniques: see
1854 the discussion on <a href="#engines">inference engines</a> below.
1855
1856
1857 <h2><a name="hybrid">Other hybrid models</h2>
1858
1859 When we have C->D arcs, where C is hidden, we need to use
1860 approximate inference.
1861 One approach (not implemented in BNT) is described in
1862 <ul>
1863 <li> <a
1864 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
1865 Variational Approximation for Bayesian Networks with
1866 Discrete and Continuous Latent Variables</a>,
1867 K. Murphy, UAI 99.
1868 </ul>
1869 Of course, one can always use <a href="#sampling">sampling</a> methods
1870 for approximate inference in such models.
1871
1872
1873
1874 <h1><a name="param_learning">Parameter Learning</h1>
1875
1876 The parameter estimation routines in BNT can be classified into 4
1877 types, depending on whether the goal is to compute
1878 a full (Bayesian) posterior over the parameters or just a point
1879 estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
1880 and whether all the variables are fully observed or there is missing
1881 data/ hidden variables (partial observability).
1882 <p>
1883
1884 <TABLE BORDER>
1885 <tr>
1886 <TH></TH>
1887 <th>Full obs</th>
1888 <th>Partial obs</th>
1889 </tr>
1890 <tr>
1891 <th>Point</th>
1892 <td><tt>learn_params</tt></td>
1893 <td><tt>learn_params_em</tt></td>
1894 </tr>
1895 <tr>
1896 <th>Bayes</th>
1897 <td><tt>bayes_update_params</tt></td>
1898 <td>not yet supported</td>
1899 </tr>
1900 </table>
1901
1902
1903 <h2><a name="load_data">Loading data from a file</h2>
1904
1905 To load numeric data from an ASCII text file called 'dat.txt', where each row is a
1906 case and columns are separated by white-space, such as
1907 <pre>
1908 011979 1626.5 0.0
1909 021979 1367.0 0.0
1910 ...
1911 </pre>
1912 you can use
1913 <pre>
1914 data = load('dat.txt');
1915 </pre>
1916 or
1917 <pre>
1918 load dat.txt -ascii
1919 </pre>
1920 In the latter case, the data is stored in a variable called 'dat' (the
1921 filename minus the extension).
1922 Alternatively, suppose the data is stored in a .csv file (has commas
1923 separating the columns, and contains a header line), such as
1924 <pre>
1925 header info goes here
1926 ORD,011979,1626.5,0.0
1927 DSM,021979,1367.0,0.0
1928 ...
1929 </pre>
1930 You can load this using
1931 <pre>
1932 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
1933 </pre>
1934 If your file is not in either of these formats, you can either use Perl to convert
1935 it to this format, or use the Matlab scanf command.
1936 Type
1937 <tt>
1938 help iofun
1939 </tt>
1940 for more information on Matlab's file functions.
1941 <!--
1942 <p>
1943 To load data directly from Excel,
1944 you should buy the
1945 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
1946 To load data directly from a relational database,
1947 you should buy the
1948 <a href="http://www.mathworks.com/products/database">Database
1949 toolbox</a>.
1950 -->
1951 <p>
1952 BNT learning routines require data to be stored in a cell array.
1953 data{i,m} is the value of node i in case (example) m, i.e., each
1954 <em>column</em> is a case.
1955 If node i is not observed in case m (missing value), set
1956 data{i,m} = [].
1957 (Not all the learning routines can cope with such missing values, however.)
1958 In the special case that all the nodes are observed and are
1959 scalar-valued (as opposed to vector-valued), the data can be
1960 stored in a matrix (as opposed to a cell-array).
1961 <p>
1962 Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
1963 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
1964 the observed output, and X(2) is a hidden (gating) node. We can
1965 create the dataset as follows.
1966 <pre>
1967 data = load('dat.txt');
1968 ncases = size(data, 1);
1969 cases = cell(3, ncases);
1970 cases([1 3], :) = num2cell(data');
1971 </pre>
1972 Notice how we transposed the data, to convert rows into columns.
1973 Also, cases{2,m} = [] for all m, since X(2) is always hidden.
1974
1975
1976 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
1977
1978 As an example, let's generate some data from the sprinkler network, randomize the parameters,
1979 and then try to recover the original model.
1980 First we create some training data using forwards sampling.
1981 <pre>
1982 samples = cell(N, nsamples);
1983 for i=1:nsamples
1984 samples(:,i) = sample_bnet(bnet);
1985 end
1986 </pre>
1987 samples{j,i} contains the value of the j'th node in case i.
1988 sample_bnet returns a cell array because, in general, each node might
1989 be a vector of different length.
1990 In this case, all nodes are discrete (and hence scalars), so we
1991 could have used a regular array instead (which can be quicker):
1992 <pre>
1993 data = cell2num(samples);
1994 </pre
1995 So now data(j,i) = samples{j,i}.
1996 <p>
1997 Now we create a network with random parameters.
1998 (The initial values of bnet2 don't matter in this case, since we can find the
1999 globally optimal MLE independent of where we start.)
2000 <pre>
2001 % Make a tabula rasa
2002 bnet2 = mk_bnet(dag, node_sizes);
2003 seed = 0;
2004 rand('state', seed);
2005 bnet2.CPD{C} = tabular_CPD(bnet2, C);
2006 bnet2.CPD{R} = tabular_CPD(bnet2, R);
2007 bnet2.CPD{S} = tabular_CPD(bnet2, S);
2008 bnet2.CPD{W} = tabular_CPD(bnet2, W);
2009 </pre>
2010 Finally, we find the maximum likelihood estimates of the parameters.
2011 <pre>
2012 bnet3 = learn_params(bnet2, samples);
2013 </pre>
2014 To view the learned parameters, we use a little Matlab hackery.
2015 <pre>
2016 CPT3 = cell(1,N);
2017 for i=1:N
2018 s=struct(bnet3.CPD{i}); % violate object privacy
2019 CPT3{i}=s.CPT;
2020 end
2021 </pre>
2022 Here are the parameters learned for node 4.
2023 <pre>
2024 dispcpt(CPT3{4})
2025 1 1 : 1.0000 0.0000
2026 2 1 : 0.2000 0.8000
2027 1 2 : 0.2273 0.7727
2028 2 2 : 0.0000 1.0000
2029 </pre>
2030 So we see that the learned parameters are fairly close to the "true"
2031 ones, which we display below.
2032 <pre>
2033 dispcpt(CPT{4})
2034 1 1 : 1.0000 0.0000
2035 2 1 : 0.1000 0.9000
2036 1 2 : 0.1000 0.9000
2037 2 2 : 0.0100 0.9900
2038 </pre>
2039 We can get better results by using a larger training set, or using
2040 informative priors (see <a href="#prior">below</a>).
2041
2042
2043
2044 <h2><a name="prior">Parameter priors</h2>
2045
2046 Currently, only tabular CPDs can have priors on their parameters.
2047 The conjugate prior for a multinomial is the Dirichlet.
2048 (For binary random variables, the multinomial is the same as the
2049 Bernoulli, and the Dirichlet is the same as the Beta.)
2050 <p>
2051 The Dirichlet has a simple interpretation in terms of pseudo counts.
2052 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
2053 training set, where Pa_i are the parents of X_i,
2054 then the maximum likelihood (ML) estimate is
2055 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
2056 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
2057 event was not seen in the training set,
2058 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
2059 of times in the past.
2060 The MAP (maximum a posterior) estimate is then
2061 <pre>
2062 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
2063 </pre>
2064 and is never 0 if all alpha_ijk > 0.
2065 For example, consider the network A->B, where A is binary and B has 3
2066 values.
2067 A uniform prior for B has the form
2068 <pre>
2069 B=1 B=2 B=3
2070 A=1 1 1 1
2071 A=2 1 1 1
2072 </pre>
2073 which can be created using
2074 <pre>
2075 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
2076 </pre>
2077 This prior does not satisfy the likelihood equivalence principle,
2078 which says that <a href="#markov_equiv">Markov equivalent</a> models
2079 should have the same marginal likelihood.
2080 A prior that does satisfy this principle is shown below.
2081 Heckerman (1995) calls this the
2082 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
2083 <pre>
2084 B=1 B=2 B=3
2085 A=1 1/6 1/6 1/6
2086 A=2 1/6 1/6 1/6
2087 </pre>
2088 where we put N/(q*r) in each bin; N is the equivalent sample size,
2089 r=|A|, q = |B|.
2090 This can be created as follows
2091 <pre>
2092 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
2093 </pre>
2094 Here, 1 is the equivalent sample size, and is the strength of the
2095 prior.
2096 You can change this using
2097 <pre>
2098 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
2099 'BDeu', 'dirichlet_weight', 10);
2100 </pre>
2101 <!--where counts is an array of pseudo-counts of the same size as the
2102 CPT.-->
2103 <!--
2104 <p>
2105 When you specify a prior, you should set row i of the CPT to the
2106 normalized version of row i of the pseudo-count matrix, i.e., to the
2107 expected values of the parameters. This will ensure that computing the
2108 marginal likelihood sequentially (see <a
2109 href="#bayes_learn">below</a>) and in batch form gives the same
2110 results.
2111 To do this, proceed as follows.
2112 <pre>
2113 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
2114 </pre>
2115 For a non-informative prior, you can just write
2116 <pre>
2117 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
2118 </pre>
2119 -->
2120
2121
2122 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
2123
2124 If we use conjugate priors and have fully observed data, we can
2125 compute the posterior over the parameters in batch form as follows.
2126 <pre>
2127 cases = sample_bnet(bnet, nsamples);
2128 bnet = bayes_update_params(bnet, cases);
2129 LL = log_marg_lik_complete(bnet, cases);
2130 </pre>
2131 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
2132 and bnet.CPD{i}.CPT is set to the mean of the posterior (the
2133 normalized counts).
2134 (Hence if the initial pseudo counts are 0,
2135 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
2136 same result.)
2137
2138
2139
2140
2141 <p>
2142 We can compute the same result sequentially (on-line) as follows.
2143 <pre>
2144 LL = 0;
2145 for m=1:nsamples
2146 LL = LL + log_marg_lik_complete(bnet, cases(:,m));
2147 bnet = bayes_update_params(bnet, cases(:,m));
2148 end
2149 </pre>
2150
2151 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
2152 sequential model selection which uses the same idea.
2153 We generate data from the model A->B
2154 and compute the posterior prob of all 3 dags on 2 nodes:
2155 (1) A B, (2) A <- B , (3) A -> B
2156 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
2157 observational data alone, so we expect their posteriors to be the same
2158 (assuming a prior which satisfies likelihood equivalence).
2159 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
2160 However, if we make B a noisy NOT gate, the true model "wins" after 12
2161 trials, as shown below (red = model 1, blue/green (superimposed)
2162 represents models 2/3).
2163 <p>
2164 <img src="Figures/model_select.png">
2165 <p>
2166 The use of marginal likelihood for model selection is discussed in
2167 greater detail in the
2168 section on <a href="structure_learning">structure learning</a>.
2169
2170
2171
2172
2173 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
2174
2175 Now we consider learning when some values are not observed.
2176 Let us randomly hide half the values generated from the water
2177 sprinkler example.
2178 <pre>
2179 samples2 = samples;
2180 hide = rand(N, nsamples) > 0.5;
2181 [I,J]=find(hide);
2182 for k=1:length(I)
2183 samples2{I(k), J(k)} = [];
2184 end
2185 </pre>
2186 samples2{i,l} is the value of node i in training case l, or [] if unobserved.
2187 <p>
2188 Now we will compute the MLEs using the EM algorithm.
2189 We need to use an inference algorithm to compute the expected
2190 sufficient statistics in the E step; the M (maximization) step is as
2191 above.
2192 <pre>
2193 engine2 = jtree_inf_engine(bnet2);
2194 max_iter = 10;
2195 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
2196 </pre>
2197 LLtrace(i) is the log-likelihood at iteration i. We can plot this as
2198 follows:
2199 <pre>
2200 plot(LLtrace, 'x-')
2201 </pre>
2202 Let's display the results after 10 iterations of EM.
2203 <pre>
2204 celldisp(CPT4)
2205 CPT4{1} =
2206 0.6616
2207 0.3384
2208 CPT4{2} =
2209 0.6510 0.3490
2210 0.8751 0.1249
2211 CPT4{3} =
2212 0.8366 0.1634
2213 0.0197 0.9803
2214 CPT4{4} =
2215 (:,:,1) =
2216 0.8276 0.0546
2217 0.5452 0.1658
2218 (:,:,2) =
2219 0.1724 0.9454
2220 0.4548 0.8342
2221 </pre>
2222 We can get improved performance by using one or more of the following
2223 methods:
2224 <ul>
2225 <li> Increasing the size of the training set.
2226 <li> Decreasing the amount of hidden data.
2227 <li> Running EM for longer.
2228 <li> Using informative priors.
2229 <li> Initialising EM from multiple starting points.
2230 </ul>
2231
2232 Click <a href="#gaussian">here</a> for a discussion of learning
2233 Gaussians, which can cause numerical problems.
2234 <p>
2235 For a more complete example of learning with EM,
2236 see the script BNT/examples/static/learn1.m.
2237
2238 <h2><a name="tying">Parameter tying</h2>
2239
2240 In networks with repeated structure (e.g., chains and grids), it is
2241 common to assume that the parameters are the same at every node. This
2242 is called parameter tying, and reduces the amount of data needed for
2243 learning.
2244 <p>
2245 When we have tied parameters, there is no longer a one-to-one
2246 correspondence between nodes and CPDs.
2247 Rather, each CPD species the parameters for a whole equivalence class
2248 of nodes.
2249 It is easiest to see this by example.
2250 Consider the following <a href="usage_dbn.html#hmm">hidden Markov
2251 model (HMM)</a>
2252 <p>
2253 <img src="Figures/hmm3.gif">
2254 <p>
2255 <!--
2256 We can create this graph structure, assuming we have T time-slices,
2257 as follows.
2258 (We number the nodes as shown in the figure, but we could equally well
2259 number the hidden nodes 1:T, and the observed nodes T+1:2T.)
2260 <pre>
2261 N = 2*T;
2262 dag = zeros(N);
2263 hnodes = 1:2:2*T;
2264 for i=1:T-1
2265 dag(hnodes(i), hnodes(i+1))=1;
2266 end
2267 onodes = 2:2:2*T;
2268 for i=1:T
2269 dag(hnodes(i), onodes(i)) = 1;
2270 end
2271 </pre>
2272 <p>
2273 The hidden nodes are always discrete, and have Q possible values each,
2274 but the observed nodes can be discrete or continuous, and have O possible values/length.
2275 <pre>
2276 if cts_obs
2277 dnodes = hnodes;
2278 else
2279 dnodes = 1:N;
2280 end
2281 ns = ones(1,N);
2282 ns(hnodes) = Q;
2283 ns(onodes) = O;
2284 </pre>
2285 -->
2286 When HMMs are used for semi-infinite processes like speech recognition,
2287 we assume the transition matrix
2288 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant
2289 or homogenous Markov chain.
2290 Hence hidden nodes 2, 3, ..., T
2291 are all in the same equivalence class, say class Hclass.
2292 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the
2293 same for all t, so the observed nodes are all in the same equivalence
2294 class, say class Oclass.
2295 Finally, the prior term P(H(1)) is in a class all by itself, say class
2296 H1class.
2297 This is illustrated below, where we explicitly represent the
2298 parameters as random variables (dotted nodes).
2299 <p>
2300 <img src="Figures/hmm4_params.gif">
2301 <p>
2302 In BNT, we cannot represent parameters as random variables (nodes).
2303 Instead, we "hide" the
2304 parameters inside one CPD for each equivalence class,
2305 and then specify that the other CPDs should share these parameters, as
2306 follows.
2307 <pre>
2308 hnodes = 1:2:2*T;
2309 onodes = 2:2:2*T;
2310 H1class = 1; Hclass = 2; Oclass = 3;
2311 eclass = ones(1,N);
2312 eclass(hnodes(2:end)) = Hclass;
2313 eclass(hnodes(1)) = H1class;
2314 eclass(onodes) = Oclass;
2315 % create dag and ns in the usual way
2316 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
2317 </pre>
2318 Finally, we define the parameters for each equivalence class:
2319 <pre>
2320 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
2321 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
2322 if cts_obs
2323 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
2324 else
2325 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
2326 end
2327 </pre>
2328 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
2329 member of e's equivalence class; that is, it is not always the case
2330 that e == j. You can use bnet.rep_of_eclass(e) to return the
2331 representative of equivalence class e.
2332 BNT will look up the parents of j to determine the size
2333 of the CPT to use. It assumes that this is the same for all members of
2334 the equivalence class.
2335 Click <a href="param_tieing.html">here</a> for
2336 a more complex example of parameter tying.
2337 <p>
2338 Note:
2339 Normally one would define an HMM as a
2340 <a href = "usage_dbn.html">Dynamic Bayes Net</a>
2341 (see the function BNT/examples/dynamic/mk_chmm.m).
2342 However, one can define an HMM as a static BN using the function
2343 BNT/examples/static/Models/mk_hmm_bnet.m.
2344
2345
2346
2347 <h1><a name="structure_learning">Structure learning</h1>
2348
2349 Update (9/29/03):
2350 Phillipe LeRay is developing some additional structure learning code
2351 on top of BNT. Click <a
2352 href="http://bnt.insa-rouen.fr/ajouts.html">here</a>
2353 for details.
2354
2355 <p>
2356
2357 There are two very different approaches to structure learning:
2358 constraint-based and search-and-score.
2359 In the <a href="#constraint">constraint-based approach</a>,
2360 we start with a fully connected graph, and remove edges if certain
2361 conditional independencies are measured in the data.
2362 This has the disadvantage that repeated independence tests lose
2363 statistical power.
2364 <p>
2365 In the more popular search-and-score approach,
2366 we perform a search through the space of possible DAGs, and either
2367 return the best one found (a point estimate), or return a sample of the
2368 models found (an approximation to the Bayesian posterior).
2369 <p>
2370 Unfortunately, the number of DAGs as a function of the number of
2371 nodes, G(n), is super-exponential in n.
2372 A closed form formula for G(n) is not known, but the first few values
2373 are shown below (from Cooper, 1999).
2374
2375 <table>
2376 <tr> <th>n</th> <th align=left>G(n)</th> </tr>
2377 <tr> <td>1</td> <td>1</td> </tr>
2378 <tr> <td>2</td> <td>3</td> </tr>
2379 <tr> <td>3</td> <td>25</td> </tr>
2380 <tr> <td>4</td> <td>543</td> </tr>
2381 <tr> <td>5</td> <td>29,281</td> </tr>
2382 <tr> <td>6</td> <td>3,781,503</td> </tr>
2383 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr>
2384 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr>
2385 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr>
2386 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr>
2387 </table>
2388
2389 Since the number of DAGs is super-exponential in the number of nodes,
2390 we cannot exhaustively search the space, so we either use a local
2391 search algorithm (e.g., greedy hill climbining, perhaps with multiple
2392 restarts) or a global search algorithm (e.g., Markov Chain Monte
2393 Carlo).
2394 <p>
2395 If we know a total ordering on the nodes,
2396 finding the best structure amounts to picking the best set of parents
2397 for each node independently.
2398 This is what the K2 algorithm does.
2399 If the ordering is unknown, we can search over orderings,
2400 which is more efficient than searching over DAGs (Koller and Friedman, 2000).
2401 <p>
2402 In addition to the search procedure, we must specify the scoring
2403 function. There are two popular choices. The Bayesian score integrates
2404 out the parameters, i.e., it is the marginal likelihood of the model.
2405 The BIC (Bayesian Information Criterion) is defined as
2406 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is
2407 the ML estimate of the parameters, d is the number of parameters, and
2408 N is the number of data cases.
2409 The BIC method has the advantage of not requiring a prior.
2410 <p>
2411 BIC can be derived as a large sample
2412 approximation to the marginal likelihood.
2413 (It is also equal to the Minimum Description Length of a model.)
2414 However, in practice, the sample size does not need to be very large
2415 for the approximation to be good.
2416 For example, in the figure below, we plot the ratio between the log marginal likelihood
2417 and the BIC score against data-set size; we see that the ratio rapidly
2418 approaches 1, especially for non-informative priors.
2419 (This plot was generated by the file BNT/examples/static/bic1.m. It
2420 uses the water sprinkler BN with BDeu Dirichlet priors with different
2421 equivalent sample sizes.)
2422
2423 <p>
2424 <center>
2425 <IMG SRC="Figures/bic.png">
2426 </center>
2427 <p>
2428
2429 <p>
2430 As with parameter learning, handling missing data/ hidden variables is
2431 much harder than the fully observed case.
2432 The structure learning routines in BNT can therefore be classified into 4
2433 types, analogously to the parameter learning case.
2434 <p>
2435
2436 <TABLE BORDER>
2437 <tr>
2438 <TH></TH>
2439 <th>Full obs</th>
2440 <th>Partial obs</th>
2441 </tr>
2442 <tr>
2443 <th>Point</th>
2444 <td><tt>learn_struct_K2</tt> <br>
2445 <!-- <tt>learn_struct_hill_climb</tt></td> -->
2446 <td><tt>not yet supported</tt></td>
2447 </tr>
2448 <tr>
2449 <th>Bayes</th>
2450 <td><tt>learn_struct_mcmc</tt></td>
2451 <td>not yet supported</td>
2452 </tr>
2453 </table>
2454
2455
2456 <h2><a name="markov_equiv">Markov equivalence</h2>
2457
2458 If two DAGs encode the same conditional independencies, they are
2459 called Markov equivalent. The set of all DAGs can be paritioned into
2460 Markov equivalence classes. Graphs within the same class can
2461 have
2462 the direction of some of their arcs reversed without changing any of
2463 the CI relationships.
2464 Each class can be represented by a PDAG
2465 (partially directed acyclic graph) called an essential graph or
2466 pattern. This specifies which edges must be oriented in a certain
2467 direction, and which may be reversed.
2468
2469 <p>
2470 When learning graph structure from observational data,
2471 the best one can hope to do is to identify the model up to Markov
2472 equivalence. To distinguish amongst graphs within the same equivalence
2473 class, one needs interventional data: see the discussion on <a
2474 href="#active">active learning</a> below.
2475
2476
2477
2478 <h2><a name="enumerate">Exhaustive search</h2>
2479
2480 The brute-force approach to structure learning is to enumerate all
2481 possible DAGs, and score each one. This provides a "gold standard"
2482 with which to compare other algorithms. We can do this as follows.
2483 <pre>
2484 dags = mk_all_dags(N);
2485 score = score_dags(data, ns, dags);
2486 </pre>
2487 where data(i,m) is the value of node i in case m,
2488 and ns(i) is the size of node i.
2489 If the DAGs have a lot of families in common, we can cache the sufficient statistics,
2490 making this potentially more efficient than scoring the DAGs one at a time.
2491 (Caching is not currently implemented, however.)
2492 <p>
2493 By default, we use the Bayesian scoring metric, and assume CPDs are
2494 represented by tables with BDeu(1) priors.
2495 We can override these defaults as follows.
2496 If we want to use uniform priors, we can say
2497 <pre>
2498 params = cell(1,N);
2499 for i=1:N
2500 params{i} = {'prior', 'unif'};
2501 end
2502 score = score_dags(data, ns, dags, 'params', params);
2503 </pre>
2504 params{i} is a cell-array, containing optional arguments that are
2505 passed to the constructor for CPD i.
2506 <p>
2507 Now suppose we want to use different node types, e.g.,
2508 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
2509 these CPDs can support discrete and continuous parents, which is
2510 necessary since all other nodes will be considered as parents).
2511 The Bayesian scoring metric currently only works for tabular CPDs, so
2512 we will use BIC:
2513 <pre>
2514 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
2515 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
2516 </pre>
2517 In practice, one can't enumerate all possible DAGs for N > 5,
2518 but one can evaluate any reasonably-sized set of hypotheses in this
2519 way (e.g., nearest neighbors of your current best guess).
2520 Think of this as "computer assisted model refinement" as opposed to de
2521 novo learning.
2522
2523
2524 <h2><a name="K2">K2</h2>
2525
2526 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
2527 Initially each node has no parents. It then adds incrementally that parent whose addition most
2528 increases the score of the resulting structure. When the addition of no single
2529 parent can increase the score, it stops adding parents to the node.
2530 Since we are using a fixed ordering, we do not need to check for
2531 cycles, and can choose the parents for each node independently.
2532 <p>
2533 The original paper used the Bayesian scoring
2534 metric with tabular CPDs and Dirichlet priors.
2535 BNT generalizes this to allow any kind of CPD, and either the Bayesian
2536 scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
2537 In addition, you can specify
2538 an optional upper bound on the number of parents for each node.
2539 The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
2540 We use the water sprinkler network and sample 100 cases from it as before.
2541 Then we see how much data it takes to recover the generating structure:
2542 <pre>
2543 order = [C S R W];
2544 max_fan_in = 2;
2545 sz = 5:5:100;
2546 for i=1:length(sz)
2547 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
2548 correct(i) = isequal(dag, dag2);
2549 end
2550 </pre>
2551 Here are the results.
2552 <pre>
2553 correct =
2554 Columns 1 through 12
2555 0 0 0 0 0 0 0 1 0 1 1 1
2556 Columns 13 through 20
2557 1 1 1 1 1 1 1 1
2558 </pre>
2559 So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
2560 showing that the prior doesn't matter too much.)
2561 In general, we cannot hope to recover the "true" generating structure,
2562 only one that is in its <a href="#markov_equiv">Markov equivalence
2563 class</a>.
2564
2565
2566 <h2><a name="hill_climb">Hill-climbing</h2>
2567
2568 Hill-climbing starts at a specific point in space,
2569 considers all nearest neighbors, and moves to the neighbor
2570 that has the highest score; if no neighbors have higher
2571 score than the current point (i.e., we have reached a local maximum),
2572 the algorithm stops. One can then restart in another part of the space.
2573 <p>
2574 A common definition of "neighbor" is all graphs that can be
2575 generated from the current graph by adding, deleting or reversing a
2576 single arc, subject to the acyclicity constraint.
2577 Other neighborhoods are possible: see
2578 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
2579 Optimal Structure Identification with Greedy Search</a>, Max
2580 Chickering, JMLR 2002.
2581
2582 <!--
2583 Note: This algorithm is currently (Feb '02) being implemented by Qian
2584 Diao.
2585 -->
2586
2587
2588 <h2><a name="mcmc">MCMC</h2>
2589
2590 We can use a Markov Chain Monte Carlo (MCMC) algorithm called
2591 Metropolis-Hastings (MH) to search the space of all
2592 DAGs.
2593 The standard proposal distribution is to consider moving to all
2594 nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
2595 <p>
2596 The function can be called
2597 as in the following example.
2598 <pre>
2599 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
2600 </pre>
2601 We can convert our set of sampled graphs to a histogram
2602 (empirical posterior over all the DAGs) thus
2603 <pre>
2604 all_dags = mk_all_dags(N);
2605 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
2606 </pre>
2607 To see how well this performs, let us compute the exact posterior exhaustively.
2608 <p>
2609 <pre>
2610 score = score_dags(data, ns, all_dags);
2611 post = normalise(exp(score)); % assuming uniform structural prior
2612 </pre>
2613 We plot the results below.
2614 (The data set was 100 samples drawn from a random 4 node bnet; see the
2615 file BNT/examples/static/mcmc1.)
2616 <pre>
2617 subplot(2,1,1)
2618 bar(post)
2619 subplot(2,1,2)
2620 bar(mcmc_post)
2621 </pre>
2622 <img src="Figures/mcmc_post.jpg" width="800" height="500">
2623 <p>
2624 We can also plot the acceptance ratio versus number of MCMC steps,
2625 as a crude convergence diagnostic.
2626 <pre>
2627 clf
2628 plot(accept_ratio)
2629 </pre>
2630 <img src="Figures/mcmc_accept.jpg" width="800" height="300">
2631 <p>
2632 Even though the number of samples needed by MCMC is theoretically
2633 polynomial (not exponential) in the dimensionality of the search space, in practice it has been
2634 found that MCMC does not converge in reasonable time for graphs with
2635 more than about 10 nodes.
2636
2637
2638
2639
2640 <h2><a name="active">Active structure learning</h2>
2641
2642 As was mentioned <a href="#markov_equiv">above</a>,
2643 one can only learn a DAG up to Markov equivalence, even given infinite data.
2644 If one is interested in learning the structure of a causal network,
2645 one needs interventional data.
2646 (By "intervention" we mean forcing a node to take on a specific value,
2647 thereby effectively severing its incoming arcs.)
2648 <p>
2649 Most of the scoring functions accept an optional argument
2650 that specifies whether a node was observed to have a certain value, or
2651 was forced to have that value: we set clamped(i,m)=1 if node i was
2652 forced in training case m. e.g., see the file
2653 BNT/examples/static/cooper_yoo.
2654 <p>
2655 An interesting question is to decide which interventions to perform
2656 (c.f., design of experiments). For details, see the following tech
2657 report
2658 <ul>
2659 <li> <a href = "../../Papers/alearn.ps.gz">
2660 Active learning of causal Bayes net structure</a>, Kevin Murphy, March
2661 2001.
2662 </ul>
2663
2664
2665 <h2><a name="struct_em">Structural EM</h2>
2666
2667 Computing the Bayesian score when there is partial observability is
2668 computationally challenging, because the parameter posterior becomes
2669 multimodal (the hidden nodes induce a mixture distribution).
2670 One therefore needs to use approximations such as BIC.
2671 Unfortunately, search algorithms are still expensive, because we need
2672 to run EM at each step to compute the MLE, which is needed to compute
2673 the score of each model. An alternative approach is
2674 to do the local search steps inside of the M step of EM, which is more
2675 efficient since the data has been "filled in" - this is
2676 called the structural EM algorithm (Friedman 1997), and provably
2677 converges to a local maximum of the BIC score.
2678 <p>
2679 Wei Hu has implemented SEM for discrete nodes.
2680 You can download his package from
2681 <a href="../SEM.zip">here</a>.
2682 Please address all questions about this code to
2683 wei.hu@intel.com.
2684 See also <a href="#phl">Phl's implementation of SEM</a>.
2685
2686 <!--
2687 <h2><a name="reveal">REVEAL algorithm</h2>
2688
2689 A simple way to learn the structure of a fully observed, discrete,
2690 factored DBN from a time series is described <a
2691 href="usage_dbn.html#struct_learn">here</a>.
2692 -->
2693
2694
2695 <h2><a name="graphdraw">Visualizing the graph</h2>
2696
2697 You can visualize an arbitrary graph (such as one learned using the
2698 structure learning routines) with Matlab code contributed by
2699 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali
2700 Taylan Cemgil</a>
2701 from the University of Nijmegen.
2702 For static BNs, call it as follows:
2703 <pre>
2704 draw_graph(bnet.dag);
2705 </pre>
2706 For example, this is the output produced on a
2707 <a href="#qmr">random QMR-like model</a>:
2708 <p>
2709 <img src="Figures/qmr.rnd.jpg">
2710 <p>
2711 If you install the excellent <a
2712 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an
2713 open-source graph visualization package from AT&T,
2714 you can create a much better visualization as follows
2715 <pre>
2716 graph_to_dot(bnet.dag)
2717 </pre>
2718 This works by converting the adjacency matrix to a file suitable
2719 for input to graphviz (using the dot format),
2720 then converting the output of graphviz to postscript, and displaying the results using
2721 ghostview.
2722 You can do each of these steps separately for more control, as shown
2723 below.
2724 <pre>
2725 graph_to_dot(bnet.dag, 'filename', 'foo.dot');
2726 dot -Tps foo.dot -o foo.ps
2727 ghostview foo.ps &
2728 </pre>
2729
2730 <h2><a name = "constraint">Constraint-based methods</h2>
2731
2732 The IC algorithm (Pearl and Verma, 1991),
2733 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
2734 computes many conditional independence tests,
2735 and combines these constraints into a
2736 PDAG to represent the whole
2737 <a href="#markov_equiv">Markov equivalence class</a>.
2738 <p>
2739 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
2740 (IC stands for inductive causation; PC stands for Peter and Clark,
2741 the first names of Spirtes and Glymour; FCI stands for fast causal
2742 inference.
2743 What we, following Pearl (2000), call IC* was called
2744 IC in the original Pearl and Verma paper.)
2745 For details, see
2746 <ul>
2747 <li>
2748 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
2749 Prediction, and Search</a>, Spirtes, Glymour and
2750 Scheines (SGS), 2001 (2nd edition), MIT Press.
2751 <li>
2752 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
2753 2000, Cambridge University Press.
2754 </ul>
2755
2756 <p>
2757
2758 The PC algorithm takes as arguments a function f, the number of nodes N,
2759 the maximum fan in K, and additional arguments A which are passed to f.
2760 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
2761 otherwise.
2762 For example, suppose we cheat by
2763 passing in a CI "oracle" which has access to the true DAG; the oracle
2764 tests for d-separation in this DAG, i.e.,
2765 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
2766 <pre>
2767 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
2768 </pre>
2769 pdag(i,j) = -1 if there is definitely an i->j arc,
2770 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
2771 <p>
2772 Applied to the sprinkler network, this returns
2773 <pre>
2774 pdag =
2775 0 1 1 0
2776 1 0 0 -1
2777 1 0 0 -1
2778 0 0 0 0
2779 </pre>
2780 So as expected, we see that the V-structure at the W node is uniquely identified,
2781 but the other arcs have ambiguous orientation.
2782 <p>
2783 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
2784 book.
2785 This example concerns the female orgasm.
2786 We are given a correlation matrix C between 7 measured factors (such
2787 as subjective experiences of coital and masturbatory experiences),
2788 derived from 281 samples, and want to learn a causal model of the
2789 data. We will not discuss the merits of this type of work here, but
2790 merely show how to reproduce the results in the SGS book.
2791 Their program,
2792 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
2793 makes use of the Fisher Z-test for conditional
2794 independence, so we do the same:
2795 <pre>
2796 max_fan_in = 4;
2797 nsamples = 281;
2798 alpha = 0.05;
2799 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
2800 </pre>
2801 In this case, the CI test is
2802 <pre>
2803 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)
2804 </pre>
2805 The results match those of Fig 12a of SGS apart from two edge
2806 differences; presumably this is due to rounding error (although it
2807 could be a bug, either in BNT or in Tetrad).
2808 This example can be found in the file BNT/examples/static/pc2.m.
2809
2810 <p>
2811
2812 The IC* algorithm (Pearl and Verma, 1991),
2813 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
2814 are like the IC/PC algorithm, except that they can detect the presence
2815 of latent variables.
2816 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
2817 Kushnir. The output is a matrix P, defined as follows
2818 (see Pearl (2000), p52 for details):
2819 <pre>
2820 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
2821 % P(i,j) = -2 if there is a marked directed i-*>j edge.
2822 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
2823 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
2824 </pre>
2825
2826
2827 <h2><a name="phl">Philippe Leray's structure learning package</h2>
2828
2829 Philippe Leray has written a
2830 <a href="http://bnt.insa-rouen.fr/ajouts.html">
2831 structure learning package</a> that uses BNT.
2832
2833 It currently (Juen 2003) has the following features:
2834 <ul>
2835 <li>PC with Chi2 statistical test
2836 <li> MWST : Maximum weighted Spanning Tree
2837 <li> Hill Climbing
2838 <li> Greedy Search
2839 <li> Structural EM
2840 <li> hist_ic : optimal Histogram based on IC information criterion
2841 <li> cpdag_to_dag
2842 <li> dag_to_cpdag
2843 <li> ...
2844 </ul>
2845
2846
2847 </a>
2848
2849
2850 <!--
2851 <h2><a name="read_learning">Further reading on learning</h2>
2852
2853 I recommend the following tutorials for more details on learning.
2854 <ul>
2855 <li> <a
2856 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
2857 tutorial</a> on graphical models, which contains an overview of learning.
2858
2859 <li>
2860 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
2861 A tutorial on learning with Bayesian networks</a>, D. Heckerman,
2862 Microsoft Research Tech Report, 1995.
2863
2864 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
2865 Operations for Learning with Graphical Models</a>,
2866 W. L. Buntine, JAIR'94, 159--225.
2867 </ul>
2868 <p>
2869 -->
2870
2871
2872
2873
2874
2875 <h1><a name="engines">Inference engines</h1>
2876
2877 Up until now, we have used the junction tree algorithm for inference.
2878 However, sometimes this is too slow, or not even applicable.
2879 In general, there are many inference algorithms each of which make
2880 different tradeoffs between speed, accuracy, complexity and
2881 generality. Furthermore, there might be many implementations of the
2882 same algorithm; for instance, a general purpose, readable version,
2883 and a highly-optimized, specialized one.
2884 To cope with this variety, we treat each inference algorithm as an
2885 object, which we call an inference engine.
2886
2887 <p>
2888 An inference engine is an object that contains a bnet and supports the
2889 'enter_evidence' and 'marginal_nodes' methods. The engine constructor
2890 takes the bnet as argument and may do some model-specific processing.
2891 When 'enter_evidence' is called, the engine may do some
2892 evidence-specific processing. Finally, when 'marginal_nodes' is
2893 called, the engine may do some query-specific processing.
2894
2895 <p>
2896 The amount of work done when each stage is specified -- structure,
2897 parameters, evidence, and query -- depends on the engine. The cost of
2898 work done early in this sequence can be amortized. On the other hand,
2899 one can make better optimizations if one waits until later in the
2900 sequence.
2901 For example, the parameters might imply
2902 conditional indpendencies that are not evident in the graph structure,
2903 but can nevertheless be exploited; the evidence indicates which nodes
2904 are observed and hence can effectively be disconnected from the
2905 graph; and the query might indicate that large parts of the network
2906 are d-separated from the query nodes. (Since it is not the actual
2907 <em>values</em> of the evidence that matters, just which nodes are observed,
2908 many engines allow you to specify which nodes will be observed when they are constructed,
2909 i.e., before calling 'enter_evidence'. Some engines can still cope if
2910 the actual pattern of evidence is different, e.g., if there is missing
2911 data.)
2912 <p>
2913
2914 Although being maximally lazy (i.e., only doing work when a query is
2915 issued) may seem desirable,
2916 this is not always the most efficient.
2917 For example,
2918 when learning using EM, we need to call marginal_nodes N times, where N is the
2919 number of nodes. <a href="varelim">Variable elimination</a> would end
2920 up repeating a lot of work
2921 each time marginal_nodes is called, making it inefficient for
2922 learning. The junction tree algorithm, by contrast, uses dynamic
2923 programming to avoid this redundant computation --- it calculates all
2924 marginals in two passes during 'enter_evidence', so calling
2925 'marginal_nodes' takes constant time.
2926 <p>
2927 We will discuss some of the inference algorithms implemented in BNT
2928 below, and finish with a <a href="#engine_summary">summary</a> of all
2929 of them.
2930
2931
2932
2933
2934
2935
2936
2937 <h2><a name="varelim">Variable elimination</h2>
2938
2939 The variable elimination algorithm, also known as bucket elimination
2940 or peeling, is one of the simplest inference algorithms.
2941 The basic idea is to "push sums inside of products"; this is explained
2942 in more detail
2943 <a
2944 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
2945 <p>
2946 The principle of distributing sums over products can be generalized
2947 greatly to apply to any commutative semiring.
2948 This forms the basis of many common algorithms, such as Viterbi
2949 decoding and the Fast Fourier Transform. For details, see
2950
2951 <ul>
2952 <li> R. McEliece and S. M. Aji, 2000.
2953 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
2954 <a href="GDL.pdf">
2955 The Generalized Distributive Law</a>,
2956 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
2957 pp. 325--343.
2958
2959
2960 <li>
2961 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
2962 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
2963 Factor graphs and the sum-product algorithm</a>
2964 IEEE Transactions on Information Theory, February, 2001.
2965
2966 </ul>
2967
2968 <p>
2969 Choosing an order in which to sum out the variables so as to minimize
2970 computational cost is known to be NP-hard.
2971 The implementation of this algorithm in
2972 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this
2973 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
2974 greedy search procedure to find a good ordering).
2975 <p>
2976 Note: unlike most algorithms, var_elim does all its computational work
2977 inside of <tt>marginal_nodes</tt>, not inside of
2978 <tt>enter_evidence</tt>.
2979
2980
2981
2982
2983 <h2><a name="global">Global inference methods</h2>
2984
2985 The simplest inference algorithm of all is to explicitely construct
2986 the joint distribution over all the nodes, and then to marginalize it.
2987 This is implemented in <tt>global_joint_inf_engine</tt>.
2988 Since the size of the joint is exponential in the
2989 number of discrete (hidden) nodes, this is not a very practical algorithm.
2990 It is included merely for pedagogical and debugging purposes.
2991 <p>
2992 Three specialized versions of this algorithm have also been implemented,
2993 corresponding to the cases where all the nodes are discrete (D), all
2994 are Gaussian (G), and some are discrete and some Gaussian (CG).
2995 They are called <tt>enumerative_inf_engine</tt>,
2996 <tt>gaussian_inf_engine</tt>,
2997 and <tt>cond_gauss_inf_engine</tt> respectively.
2998 <p>
2999 Note: unlike most algorithms, these global inference algorithms do all their computational work
3000 inside of <tt>marginal_nodes</tt>, not inside of
3001 <tt>enter_evidence</tt>.
3002
3003
3004 <h2><a name="quickscore">Quickscore</h2>
3005
3006 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
3007 since the cliques are so big.
3008 One simple trick we can use is to notice that hidden leaves do not
3009 affect the posteriors on the roots, and hence do not need to be
3010 included in the network.
3011 A second trick is to notice that the negative findings can be
3012 "absorbed" into the prior:
3013 see the file
3014 BNT/examples/static/mk_minimal_qmr_bnet for details.
3015 <p>
3016
3017 A much more significant speedup is obtained by exploiting special
3018 properties of the noisy-or node, as done by the quickscore
3019 algorithm. For details, see
3020 <ul>
3021 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
3022 <li> Rish and Dechter, "On the impact of causal independence", UCI
3023 tech report, 1998.
3024 </ul>
3025
3026 This has been implemented in BNT as a special-purpose inference
3027 engine, which can be created and used as follows:
3028 <pre>
3029 engine = quickscore_inf_engine(inhibit, leak, prior);
3030 engine = enter_evidence(engine, pos, neg);
3031 m = marginal_nodes(engine, i);
3032 </pre>
3033
3034
3035 <h2><a name="belprop">Belief propagation</h2>
3036
3037 Even using quickscore, exact inference takes time that is exponential
3038 in the number of positive findings.
3039 Hence for large networks we need to resort to approximate inference techniques.
3040 See for example
3041 <ul>
3042 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
3043 QMR-DT network", JAIR 10, 1999.
3044
3045 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
3046 UAI 99.
3047 </ul>
3048 The latter approximation
3049 entails applying Pearl's belief propagation algorithm to a model even
3050 if it has loops (hence the name loopy belief propagation).
3051 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
3052 exact results when applied to singly-connected graphs
3053 (a.k.a. polytrees, since
3054 the underlying undirected topology is a tree, but a node may have
3055 multiple parents).
3056 To apply this algorithm to a graph with loops,
3057 use <tt>pearl_inf_engine</tt>.
3058 This can use a centralized or distributed message passing protocol.
3059 You can use it as in the following example.
3060 <pre>
3061 engine = pearl_inf_engine(bnet, 'max_iter', 30);
3062 engine = enter_evidence(engine, evidence);
3063 m = marginal_nodes(engine, i);
3064 </pre>
3065 We found that this algorithm often converges, and when it does, often
3066 is very accurate, but it depends on the precise setting of the
3067 parameter values of the network.
3068 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
3069 Understanding when and why belief propagation converges/ works
3070 is a topic of ongoing research.
3071 <p>
3072 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
3073 and gmux nodes to compute messages efficiently.
3074 <p>
3075 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
3076 represent messages. Hence this is slower.
3077 <p>
3078 <tt>belprop_fg_inf_engine</tt> is like belprop,
3079 but is designed for factor graphs.
3080
3081
3082
3083 <h2><a name="sampling">Sampling</h2>
3084
3085 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
3086 <ul>
3087 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance
3088 sampling and can handle any node type.
3089 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
3090 Currently this can only handle tabular CPDs.
3091 For a much faster and more powerful Gibbs sampling program, see
3092 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
3093 </ul>
3094 Note: To generate samples from a network (which is not the same as inference!),
3095 use <tt>sample_bnet</tt>.
3096
3097
3098
3099 <h2><a name="engine_summary">Summary of inference engines</h2>
3100
3101
3102 The inference engines differ in many ways. Here are
3103 some of the major "axes":
3104 <ul>
3105 <li> Works for all topologies or makes restrictions?
3106 <li> Works for all node types or makes restrictions?
3107 <li> Exact or approximate inference?
3108 </ul>
3109
3110 <p>
3111 In terms of topology, most engines handle any kind of DAG.
3112 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
3113 can be used to represent directed, undirected, and mixed (chain)
3114 graphs.
3115 (In the future, we plan to support exact inference on chain graphs.)
3116 <tt>quickscore</tt> only works on QMR-like models.
3117 <p>
3118 In terms of node types: algorithms that use potentials can handle
3119 discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
3120 Sampling algorithms can essentially handle any kind of node (distribution).
3121 Other algorithms make more restrictive assumptions in exchange for
3122 speed.
3123 <p>
3124 Finally, most algorithms are designed to give the exact answer.
3125 The belief propagation algorithms are exact if applied to trees, and
3126 in some other cases.
3127 Sampling is considered approximate, even though, in the limit of an
3128 infinite number of samples, it gives the exact answer.
3129
3130 <p>
3131
3132 Here is a summary of the properties
3133 of all the engines in BNT which work on static networks.
3134 <p>
3135 <table>
3136 <table border units = pixels><tr>
3137 <td align=left width=0>Name
3138 <td align=left width=0>Exact?
3139 <td align=left width=0>Node type?
3140 <td align=left width=0>topology
3141 <tr>
3142 <tr>
3143 <td align=left> belprop
3144 <td align=left> approx
3145 <td align=left> D
3146 <td align=left> DAG
3147 <tr>
3148 <td align=left> belprop_fg
3149 <td align=left> approx
3150 <td align=left> D
3151 <td align=left> factor graph
3152 <tr>
3153 <td align=left> cond_gauss
3154 <td align=left> exact
3155 <td align=left> CG
3156 <td align=left> DAG
3157 <tr>
3158 <td align=left> enumerative
3159 <td align=left> exact
3160 <td align=left> D
3161 <td align=left> DAG
3162 <tr>
3163 <td align=left> gaussian
3164 <td align=left> exact
3165 <td align=left> G
3166 <td align=left> DAG
3167 <tr>
3168 <td align=left> gibbs
3169 <td align=left> approx
3170 <td align=left> D
3171 <td align=left> DAG
3172 <tr>
3173 <td align=left> global_joint
3174 <td align=left> exact
3175 <td align=left> D,G,CG
3176 <td align=left> DAG
3177 <tr>
3178 <td align=left> jtree
3179 <td align=left> exact
3180 <td align=left> D,G,CG
3181 <td align=left> DAG
3182 b<tr>
3183 <td align=left> likelihood_weighting
3184 <td align=left> approx
3185 <td align=left> any
3186 <td align=left> DAG
3187 <tr>
3188 <td align=left> pearl
3189 <td align=left> approx
3190 <td align=left> D,G
3191 <td align=left> DAG
3192 <tr>
3193 <td align=left> pearl
3194 <td align=left> exact
3195 <td align=left> D,G
3196 <td align=left> polytree
3197 <tr>
3198 <td align=left> quickscore
3199 <td align=left> exact
3200 <td align=left> noisy-or
3201 <td align=left> QMR
3202 <tr>
3203 <td align=left> stab_cond_gauss
3204 <td align=left> exact
3205 <td align=left> CG
3206 <td align=left> DAG
3207 <tr>
3208 <td align=left> var_elim
3209 <td align=left> exact
3210 <td align=left> D,G,CG
3211 <td align=left> DAG
3212 </table>
3213
3214
3215
3216 <h1><a name="influence">Influence diagrams/ decision making</h1>
3217
3218 BNT implements an exact algorithm for solving LIMIDs (limited memory
3219 influence diagrams), described in
3220 <ul>
3221 <li> S. L. Lauritzen and D. Nilsson.
3222 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
3223 Representing and solving decision problems with limited
3224 information</a>
3225 Management Science, 47, 1238 - 1251. September 2001.
3226 </ul>
3227 LIMIDs explicitely show all information arcs, rather than implicitely
3228 assuming no forgetting. This allows them to model forgetful
3229 controllers.
3230 <p>
3231 See the examples in <tt>BNT/examples/limids</tt> for details.
3232
3233
3234
3235
3236 <h1>DBNs, HMMs, Kalman filters and all that</h1>
3237
3238 Click <a href="usage_dbn.html">here</a> for documentation about how to
3239 use BNT for dynamical systems and sequence data.
3240
3241
3242 </BODY>