comparison toolboxes/FullBNT-1.0.7/docs/usage_02nov13.html @ 0:e9a9cd732c1e tip

first hg version after svn
author wolffd
date Tue, 10 Feb 2015 15:05:51 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:e9a9cd732c1e
1 <HEAD>
2 <TITLE>How to use the Bayes Net Toolbox</TITLE>
3 </HEAD>
4
5 <BODY BGCOLOR="#FFFFFF">
6 <!-- white background is better for the pictures and equations -->
7
8 <h1>How to use the Bayes Net Toolbox</h1>
9
10 This documentation was last updated on 13 November 2002.
11 <br>
12 Click <a href="changelog.html">here</a> for a list of changes made to
13 BNT.
14 <br>
15 Click
16 <a href="http://bnt.insa-rouen.fr/">here</a>
17 for a French version of this documentation (which might not
18 be up-to-date).
19
20
21 <p>
22
23 <ul>
24 <li> <a href="#install">Installation</a>
25 <ul>
26 <li> <a href="#installM">Installing the Matlab code</a>
27 <li> <a href="#installC">Installing the C code</a>
28 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>.
29 </ul>
30
31 <li> <a href="#basics">Creating your first Bayes net</a>
32 <ul>
33 <li> <a href="#basics">Creating a model by hand</a>
34 <li> <a href="#file">Loading a model from a file</a>
35 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a>
36 </ul>
37
38 <li> <a href="#inference">Inference</a>
39 <ul>
40 <li> <a href="#marginal">Computing marginal distributions</a>
41 <li> <a href="#joint">Computing joint distributions</a>
42 <li> <a href="#soft">Soft/virtual evidence</a>
43 <li> <a href="#mpe">Most probable explanation</a>
44 </ul>
45
46 <li> <a href="#cpd">Conditional Probability Distributions</a>
47 <ul>
48 <li> <a href="#tabular">Tabular (multinomial) nodes</a>
49 <li> <a href="#noisyor">Noisy-or nodes</a>
50 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
51 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
52 <li> <a href="#mlp">Neural network nodes</a>
53 <li> <a href="#root">Root nodes</a>
54 <li> <a href="#gaussian">Gaussian nodes</a>
55 <li> <a href="#glm">Generalized linear model nodes</a>
56 <li> <a href="#dtree">Classification/regression tree nodes</a>
57 <li> <a href="#nongauss">Other continuous distributions</a>
58 <li> <a href="#cpd_summary">Summary of CPD types</a>
59 </ul>
60
61 <li> <a href="#examples">Example models</a>
62 <ul>
63 <li> <a
64 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
65 Gaussian mixture models</a>
66 <li> <a href="#pca">PCA, ICA, and all that</a>
67 <li> <a href="#mixep">Mixtures of experts</a>
68 <li> <a href="#hme">Hierarchical mixtures of experts</a>
69 <li> <a href="#qmr">QMR</a>
70 <li> <a href="#cg_model">Conditional Gaussian models</a>
71 <li> <a href="#hybrid">Other hybrid models</a>
72 </ul>
73
74 <li> <a href="#param_learning">Parameter learning</a>
75 <ul>
76 <li> <a href="#load_data">Loading data from a file</a>
77 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
78 <li> <a href="#prior">Parameter priors</a>
79 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
80 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
81 <li> <a href="#tying">Parameter tying</a>
82 </ul>
83
84 <li> <a href="#structure_learning">Structure learning</a>
85 <ul>
86 <li> <a href="#enumerate">Exhaustive search</a>
87 <li> <a href="#K2">K2</a>
88 <li> <a href="#hill_climb">Hill-climbing</a>
89 <li> <a href="#mcmc">MCMC</a>
90 <li> <a href="#active">Active learning</a>
91 <li> <a href="#struct_em">Structural EM</a>
92 <li> <a href="#graphdraw">Visualizing the learned graph structure</a>
93 <li> <a href="#constraint">Constraint-based methods</a>
94 </ul>
95
96
97 <li> <a href="#engines">Inference engines</a>
98 <ul>
99 <li> <a href="#jtree">Junction tree</a>
100 <li> <a href="#varelim">Variable elimination</a>
101 <li> <a href="#global">Global inference methods</a>
102 <li> <a href="#quickscore">Quickscore</a>
103 <li> <a href="#belprop">Belief propagation</a>
104 <li> <a href="#sampling">Sampling (Monte Carlo)</a>
105 <li> <a href="#engine_summary">Summary of inference engines</a>
106 </ul>
107
108
109 <li> <a href="#influence">Influence diagrams/ decision making</a>
110
111
112 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
113 </ul>
114
115 </ul>
116
117
118
119
120 <h1><a name="install">Installation</h1>
121
122 <h2><a name="installM">Installing the Matlab code</h2>
123
124 <ul>
125 <li> <a href="bnt_download.html">Download</a> the BNT.zip file.
126
127 <p>
128 <li> Unpack the file. In Unix, type
129 <!--"tar xvf BNT.tar".-->
130 "unzip FullBNT.zip".
131 In Windows, use
132 a program like <a href="http://www.winzip.com">Winzip</a>. This will
133 create a directory called FullBNT, which contains BNT and other libraries.
134
135 <p>
136 <li> Read the file <tt>BNT/README</tt> to make sure the date
137 matches the one on the top of <a href=bnt.html>the BNT home page</a>.
138 If not, you may need to press 'refresh' on your browser, and download
139 again, to get the most recent version.
140
141 <p>
142 <li> <b>Edit the file "BNT/add_BNT_to_path.m"</b> so it contains the correct
143 pathname.
144 For example, in Windows,
145 I download FullBNT.zip into C:\kpmurphy\matlab, and
146 then comment out the second line (with the % character), and uncomment
147 the third line, which reads
148 <pre>
149 BNT_HOME = 'C:\kpmurphy\matlab\FullBNT';
150 </pre>
151
152 <p>
153 <li> Start up Matlab.
154
155 <p>
156 <li> Type "ver" at the Matlab prompt (">>").
157 <b>You need Matlab version 5.2 or newer to run BNT</b>.
158 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes
159 crash BNT.)
160
161 <p>
162 <li> Move to the BNT directory.
163 For example, in Windows, I type
164 <pre>
165 >> cd C:\kpmurphy\matlab\FullBNT\BNT
166 </pre>
167
168 <p>
169 <li> Type "add_BNT_to_path".
170 This executes the command
171 <tt>addpath(genpath(BNT_HOME))</tt>,
172 which adds all directories below FullBNT to the matlab path.
173
174 <p>
175 <li> Type "test_BNT".
176
177 If all goes well, this will produce a bunch of numbers and maybe some
178 warning messages (which you can ignore), but no error messages.
179 (The warnings should only be of the form
180 "Warning: Maximum number of iterations has been exceeded", and are
181 produced by Netlab.)
182
183 <p>
184 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join">
185 Join the BNT email list</a>
186
187 </ul>
188
189
190 If you are new to Matlab, you might like to check out
191 <a href="matlab_tips.html">some useful Matlab tips</a>.
192 For instance, this explains how to create a startup file, which can be
193 used to set your path variable automatically, so you can avoid having
194 to type the above commands every time.
195
196
197
198
199 <h2><a name="installC">Installing the C code</h2>
200
201 Some BNT functions also have C implementations.
202 <b>It is not necessary to install the C code</b>, but it can result in a speedup
203 of a factor of 5-10.
204 To install all the C code,
205 edit installC_BNT.m so it contains the right path,
206 then type <tt>installC_BNT</tt>.
207 To uninstall all the C code,
208 edit uninstallC_BNT.m so it contains the right path,
209 then type <tt>uninstallC_BNT</tt>.
210 For an up-to-date list of the files which have C implementations, see
211 BNT/installC_BNT.m.
212
213 <p>
214 mex is a script that lets you call C code from Matlab - it does not compile matlab to
215 C (see mcc below).
216 If your C/C++ compiler is set up correctly, mex should work out of
217 the box.
218 If not, you might need to type
219 <p>
220 <tt> mex -setup</tt>
221 <p>
222 before calling installC.
223 <p>
224 To make mex call gcc on Windows,
225 you must install <a
226 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>.
227 You can use the <a href="http://www.mingw.org/">minimalist GNU for
228 Windows</a> version of gcc, or
229 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version.
230 <p>
231 In general, typing
232 'mex foo.c' from inside Matlab creates a file called
233 'foo.mexglx' or 'foo.dll' (the exact file
234 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll').
235 The resulting file will hide the original 'foo.m' (if it existed), i.e.,
236 typing 'foo' at the prompt will call the compiled C version.
237 To reveal the original matlab version, just delete foo.mexglx (this is
238 what uninstallC does).
239 <p>
240 Sometimes it takes time for Matlab to realize that the file has
241 changed from matlab to C or vice versa; try typing 'clear all' or
242 restarting Matlab to refresh it.
243 To find out which version of a file you are running, type
244 'which foo'.
245 <p>
246 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the
247 Matlab to C compiler, is a separate product,
248 and is quite different from mex. It does not yet support
249 objects/classes, which is why we can't compile all of BNT to C automatically.
250 Also, hand-written C code is usually much
251 better than the C code generated by mcc.
252
253
254 <p>
255 Acknowledgements:
256 Although I wrote some of the C code, most of
257 the C code (e.g., for jtree and dpot) was written by Wei Hu;
258 the triangulation C code was written by Ilya Shpitser.
259
260
261 <h1><a name="basics">Creating your first Bayes net</h1>
262
263 To define a Bayes net, you must specify the graph structure and then
264 the parameters. We look at each in turn, using a simple example
265 (adapted from Russell and
266 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
267 1995, p454).
268
269
270 <h2>Graph structure</h2>
271
272
273 Consider the following network.
274
275 <p>
276 <center>
277 <IMG SRC="Figures/sprinkler.gif">
278 </center>
279 <p>
280
281 <P>
282 To specify this directed acyclic graph (dag), we create an adjacency matrix:
283 <PRE>
284 N = 4;
285 dag = zeros(N,N);
286 C = 1; S = 2; R = 3; W = 4;
287 dag(C,[R S]) = 1;
288 dag(R,W) = 1;
289 dag(S,W)=1;
290 </PRE>
291 <P>
292 We have numbered the nodes as follows:
293 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
294 <b>The nodes must always be numbered in topological order, i.e.,
295 ancestors before descendants.</b>
296 For a more complicated graph, this is a little inconvenient: we will
297 see how to get around this <a href="usage_dbn.html#bat">below</a>.
298 <p>
299 In Matlab 6, you can use logical arrays instead of double arrays,
300 which are 4 times smaller:
301 <pre>
302 dag = false(N,N);
303 dag(C,[R S]) = true;
304 ...
305 </pre>
306 <p>
307 A preliminary attempt to make a <b>GUI</b>
308 has been writte by Philippe LeRay and can be downloaded
309 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
310 <p>
311 You can visualize the resulting graph structure using
312 the methods discussed <a href="#graphdraw">below</a>.
313
314 <h2>Creating the Bayes net shell</h2>
315
316 In addition to specifying the graph structure,
317 we must specify the size and type of each node.
318 If a node is discrete, its size is the
319 number of possible values
320 each node can take on; if a node is continuous,
321 it can be a vector, and its size is the length of this vector.
322 In this case, we will assume all nodes are discrete and binary.
323 <PRE>
324 discrete_nodes = 1:N;
325 node_sizes = 2*ones(1,N);
326 </pre>
327 If the nodes were not binary, you could type e.g.,
328 <pre>
329 node_sizes = [4 2 3 5];
330 </pre>
331 meaning that Cloudy has 4 possible values,
332 Sprinkler has 2 possible values, etc.
333 Note that these are cardinal values, not ordinal, i.e.,
334 they are not ordered in any way, like 'low', 'medium', 'high'.
335 <p>
336 We are now ready to make the Bayes net:
337 <pre>
338 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
339 </PRE>
340 By default, all nodes are assumed to be discrete, so we can also just
341 write
342 <pre>
343 bnet = mk_bnet(dag, node_sizes);
344 </PRE>
345 You may also specify which nodes will be observed.
346 If you don't know, or if this not fixed in advance,
347 just use the empty list (the default).
348 <pre>
349 onodes = [];
350 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
351 </PRE>
352 Note that optional arguments are specified using a name/value syntax.
353 This is common for many BNT functions.
354 In general, to find out more about a function (e.g., which optional
355 arguments it takes), please see its
356 documentation string by typing
357 <pre>
358 help mk_bnet
359 </pre>
360 See also other <a href="matlab_tips.html">useful Matlab tips</a>.
361 <p>
362 It is possible to associate names with nodes, as follows:
363 <pre>
364 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
365 </pre>
366 You can then refer to a node by its name:
367 <pre>
368 C = bnet.names{'cloudy'}; % bnet.names is an associative array
369 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
370 </pre>
371
372
373 <h2><a name="cpt">Parameters</h2>
374
375 A model consists of the graph structure and the parameters.
376 The parameters are represented by CPD objects (CPD = Conditional
377 Probability Distribution), which define the probability distribution
378 of a node given its parents.
379 (We will use the terms "node" and "random variable" interchangeably.)
380 The simplest kind of CPD is a table (multi-dimensional array), which
381 is suitable when all the nodes are discrete-valued. Note that the discrete
382 values are not assumed to be ordered in any way; that is, they
383 represent categorical quantities, like male and female, rather than
384 ordinal quantities, like low, medium and high.
385 (We will discuss CPDs in more detail <a href="#cpd">below</a>.)
386 <p>
387 Tabular CPDs, also called CPTs (conditional probability tables),
388 are stored as multidimensional arrays, where the dimensions
389 are arranged in the same order as the nodes, e.g., the CPT for node 4
390 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
391 Hence the child is always the last dimension.
392 If a node has no parents, its CPT is a column vector representing its
393 prior.
394 Note that in Matlab (unlike C), arrays are indexed
395 from 1, and are layed out in memory such that the first index toggles
396 fastest, e.g., the CPT for node 4 (WetGrass) is as follows
397 <P>
398 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
399 <P>
400 where we have used the convention that false==1, true==2.
401 We can create this CPT in Matlab as follows
402 <PRE>
403 CPT = zeros(2,2,2);
404 CPT(1,1,1) = 1.0;
405 CPT(2,1,1) = 0.1;
406 ...
407 </PRE>
408 Here is an easier way:
409 <PRE>
410 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
411 </PRE>
412 In fact, we don't need to reshape the array, since the CPD constructor
413 will do that for us. So we can just write
414 <pre>
415 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
416 </pre>
417 The other nodes are created similarly (using the old syntax for
418 optional parameters)
419 <PRE>
420 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
421 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
422 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
423 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
424 </PRE>
425
426
427 <h2><a name="rnd_cpt">Random Parameters</h2>
428
429 If we do not specify the CPT, random parameters will be
430 created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
431 To ensure repeatable results, use
432 <pre>
433 rand('state', seed);
434 randn('state', seed);
435 </pre>
436 To control the degree of randomness (entropy),
437 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
438 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
439 If p = 1, each entry is drawn from U[0,1].
440 If p >> 1, the entries will all be near 1/k, where k is the arity of
441 this node, i.e., each row will be nearly uniform.
442 You can do this as follows, assuming this node
443 is number i, and ns is the node_sizes.
444 <pre>
445 k = ns(i);
446 ps = parents(dag, i);
447 psz = prod(ns(ps));
448 CPT = sample_dirichlet(p*ones(1,k), psz);
449 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
450 </pre>
451
452
453 <h2><a name="file">Loading a network from a file</h2>
454
455 If you already have a Bayes net represented in the XML-based
456 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
457 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
458 <a
459 href="http://www.cs.huji.ac.il/labs/compbio/Repository">
460 Bayes Net repository</a>),
461 you can convert it to BNT format using
462 the
463 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
464 program</a> written by Ken Shan.
465 (This is not necessarily up-to-date.)
466
467
468 <h2>Creating a model using a GUI</h2>
469
470 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
471
472
473
474 <h1><a name="inference">Inference</h1>
475
476 Having created the BN, we can now use it for inference.
477 There are many different algorithms for doing inference in Bayes nets,
478 that make different tradeoffs between speed,
479 complexity, generality, and accuracy.
480 BNT therefore offers a variety of different inference
481 "engines". We will discuss these
482 in more detail <a href="#engines">below</a>.
483 For now, we will use the junction tree
484 engine, which is the mother of all exact inference algorithms.
485 This can be created as follows.
486 <pre>
487 engine = jtree_inf_engine(bnet);
488 </pre>
489 The other engines have similar constructors, but might take
490 additional, algorithm-specific parameters.
491 All engines are used in the same way, once they have been created.
492 We illustrate this in the following sections.
493
494
495 <h2><a name="marginal">Computing marginal distributions</h2>
496
497 Suppose we want to compute the probability that the sprinker was on
498 given that the grass is wet.
499 The evidence consists of the fact that W=2. All the other nodes
500 are hidden (unobserved). We can specify this as follows.
501 <pre>
502 evidence = cell(1,N);
503 evidence{W} = 2;
504 </pre>
505 We use a 1D cell array instead of a vector to
506 cope with the fact that nodes can be vectors of different lengths.
507 In addition, the value [] can be used
508 to denote 'no evidence', instead of having to specify the observation
509 pattern as a separate argument.
510 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell
511 arrays in matlab.)
512 <p>
513 We are now ready to add the evidence to the engine.
514 <pre>
515 [engine, loglik] = enter_evidence(engine, evidence);
516 </pre>
517 The behavior of this function is algorithm-specific, and is discussed
518 in more detail <a href="#engines">below</a>.
519 In the case of the jtree engine,
520 enter_evidence implements a two-pass message-passing scheme.
521 The first return argument contains the modified engine, which
522 incorporates the evidence. The second return argument contains the
523 log-likelihood of the evidence. (Not all engines are capable of
524 computing the log-likelihood.)
525 <p>
526 Finally, we can compute p=P(S=2|W=2) as follows.
527 <PRE>
528 marg = marginal_nodes(engine, S);
529 marg.T
530 ans =
531 0.57024
532 0.42976
533 p = marg.T(2);
534 </PRE>
535 We see that p = 0.4298.
536 <p>
537 Now let us add the evidence that it was raining, and see what
538 difference it makes.
539 <PRE>
540 evidence{R} = 2;
541 [engine, loglik] = enter_evidence(engine, evidence);
542 marg = marginal_nodes(engine, S);
543 p = marg.T(2);
544 </PRE>
545 We find that p = P(S=2|W=2,R=2) = 0.1945,
546 which is lower than
547 before, because the rain can ``explain away'' the
548 fact that the grass is wet.
549 <p>
550 You can plot a marginal distribution over a discrete variable
551 as a barchart using the built 'bar' function:
552 <pre>
553 bar(marg.T)
554 </pre>
555 This is what it looks like
556
557 <p>
558 <center>
559 <IMG SRC="Figures/sprinkler_bar.gif">
560 </center>
561 <p>
562
563 <h2><a name="observed">Observed nodes</h2>
564
565 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
566 An observed discrete node effectively only has 1 value (the observed
567 one) --- all other values would result in 0 probability.
568 For efficiency, BNT treats observed (discrete) nodes as if they were
569 set to 1, as we see below:
570 <pre>
571 evidence = cell(1,N);
572 evidence{W} = 2;
573 engine = enter_evidence(engine, evidence);
574 m = marginal_nodes(engine, W);
575 m.T
576 ans =
577 1
578 </pre>
579 This can get a little confusing, since we assigned W=2.
580 So we can ask BNT to add the evidence back in by passing in an optional argument:
581 <pre>
582 m = marginal_nodes(engine, W, 1);
583 m.T
584 ans =
585 0
586 1
587 </pre>
588 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
589
590
591
592 <h2><a name="joint">Computing joint distributions</h2>
593
594 We can compute the joint probability on a set of nodes as in the
595 following example.
596 <pre>
597 evidence = cell(1,N);
598 [engine, ll] = enter_evidence(engine, evidence);
599 m = marginal_nodes(engine, [S R W]);
600 </pre>
601 m is a structure. The 'T' field is a multi-dimensional array (in
602 this case, 3-dimensional) that contains the joint probability
603 distribution on the specified nodes.
604 <pre>
605 >> m.T
606 ans(:,:,1) =
607 0.2900 0.0410
608 0.0210 0.0009
609 ans(:,:,2) =
610 0 0.3690
611 0.1890 0.0891
612 </pre>
613 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
614 to be wet if both the rain and sprinkler are off.
615 <p>
616 Let us now add some evidence to R.
617 <pre>
618 evidence{R} = 2;
619 [engine, ll] = enter_evidence(engine, evidence);
620 m = marginal_nodes(engine, [S R W])
621 m =
622 domain: [2 3 4]
623 T: [2x1x2 double]
624 >> m.T
625 m.T
626 ans(:,:,1) =
627 0.0820
628 0.0018
629 ans(:,:,2) =
630 0.7380
631 0.1782
632 </pre>
633 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
634 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
635 with the evidence that R=2.
636 Instead of creating large tables with many 0s, BNT sets the effective
637 size of observed (discrete) nodes to 1, as explained above.
638 This is why m.T has size 2x1x2.
639 To get a 2x2x2 table, type
640 <pre>
641 m = marginal_nodes(engine, [S R W], 1)
642 m =
643 domain: [2 3 4]
644 T: [2x2x2 double]
645 >> m.T
646 m.T
647 ans(:,:,1) =
648 0 0.082
649 0 0.0018
650 ans(:,:,2) =
651 0 0.738
652 0 0.1782
653 </pre>
654
655 <p>
656 Note: It is not always possible to compute the joint on arbitrary
657 sets of nodes: it depends on which inference engine you use, as discussed
658 in more detail <a href="#engines">below</a>.
659
660
661 <h2><a name="soft">Soft/virtual evidence</h2>
662
663 Sometimes a node is not observed, but we have some distribution over
664 its possible values; this is often called "soft" or "virtual"
665 evidence.
666 One can use this as follows
667 <pre>
668 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
669 </pre>
670 where soft_evidence{i} is either [] (if node i has no soft evidence)
671 or is a vector representing the probability distribution over i's
672 possible values.
673 For example, if we don't know i's exact value, but we know its
674 likelihood ratio is 60/40, we can write evidence{i} = [] and
675 soft_evidence{i} = [0.6 0.4].
676 <p>
677 Currently only jtree_inf_engine supports this option.
678 It assumes that all hidden nodes, and all nodes for
679 which we have soft evidence, are discrete.
680 For a longer example, see BNT/examples/static/softev1.m.
681
682
683 <h2><a name="mpe">Most probable explanation</h2>
684
685 To compute the most probable explanation (MPE) of the evidence (i.e.,
686 the most probable assignment, or a mode of the joint), use
687 <pre>
688 [mpe, ll] = calc_mpe(engine, evidence);
689 </pre>
690 mpe{i} is the most likely value of node i.
691 This calls enter_evidence with the 'maximize' flag set to 1, which
692 causes the engine to do max-product instead of sum-product.
693 The resulting max-marginals are then thresholded.
694 If there is more than one maximum probability assignment, we must take
695 care to break ties in a consistent manner (thresholding the
696 max-marginals may give the wrong result). To force this behavior,
697 type
698 <pre>
699 [mpe, ll] = calc_mpe(engine, evidence, 1);
700 </pre>
701 Note that computing the MPE is someties called abductive reasoning.
702
703 <p>
704 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
705 that does a forwards max-product pass, and then a backwards traceback
706 pass, which is how Viterbi is traditionally implemented.
707
708
709
710 <h1><a name="cpd">Conditional Probability Distributions</h1>
711
712 A Conditional Probability Distributions (CPD)
713 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
714 are the parents of node i. There are many ways to represent this
715 distribution, which depend in part on whether X(i) and X(Pa(i)) are
716 discrete, continuous, or a combination.
717 We will discuss various representations below.
718
719
720 <h2><a name="tabular">Tabular nodes</h2>
721
722 If the CPD is represented as a table (i.e., if it is a multinomial
723 distribution), it has a number of parameters that is exponential in
724 the number of parents. See the example <a href="#cpt">above</a>.
725
726
727 <h2><a name="noisyor">Noisy-or nodes</h2>
728
729 A noisy-OR node is like a regular logical OR gate except that
730 sometimes the effects of parents that are on get inhibited.
731 Let the prob. that parent i gets inhibited be q(i).
732 Then a node, C, with 2 parents, A and B, has the following CPD, where
733 we use F and T to represent off and on (1 and 2 in BNT).
734 <pre>
735 A B P(C=off) P(C=on)
736 ---------------------------
737 F F 1.0 0.0
738 T F q(A) 1-q(A)
739 F T q(B) 1-q(B)
740 T T q(A)q(B) q-q(A)q(B)
741 </pre>
742 Thus we see that the causes get inhibited independently.
743 It is common to associate a "leak" node with a noisy-or CPD, which is
744 like a parent that is always on. This can account for all other unmodelled
745 causes which might turn the node on.
746 <p>
747 The noisy-or distribution is similar to the logistic distribution.
748 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
749 be the prob. that j inhibits i. Then
750 <pre>
751 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
752 </pre>
753 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
754 <pre>
755 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
756 </pre>
757 For a sigmoid node, we have
758 <pre>
759 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
760 </pre>
761 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
762 the activation function (although both are monotonically increasing).
763 In addition, in the case of a noisy-or, the weights are constrained to be
764 positive, since they derive from probabilities q(i,j).
765 In both cases, the number of parameters is <em>linear</em> in the
766 number of parents, unlike the case of a multinomial distribution,
767 where the number of parameters is exponential in the number of parents.
768 We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
769
770
771 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
772
773 Deterministic CPDs for discrete random variables can be created using
774 the deterministic_CPD class. It is also possible to 'flip' the output
775 of the function with some probability, to simulate noise.
776 The boolean_CPD class is just a special case of a
777 deterministic CPD, where the parents and child are all binary.
778 <p>
779 Both of these classes are just "syntactic sugar" for the tabular_CPD
780 class.
781
782
783
784 <h2><a name="softmax">Softmax nodes</h2>
785
786 If we have a discrete node with a continuous parent,
787 we can define its CPD using a softmax function
788 (also known as the multinomial logit function).
789 This acts like a soft thresholding operator, and is defined as follows:
790 <pre>
791 exp(w(:,i)'*x + b(i))
792 Pr(Q=i | X=x) = -----------------------------
793 sum_j exp(w(:,j)'*x + b(j))
794
795 </pre>
796 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
797 following interpretation: w(:,i)-w(:,j) is the normal vector to the
798 decision boundary between classes i and j,
799 and b(i)-b(j) is its offset (bias). For example, suppose
800 X is a 2-vector, and Q is binary. Then
801 <pre>
802 w = [1 -1;
803 0 0];
804
805 b = [0 0];
806 </pre>
807 means class 1 are points in the 2D plane with positive x coordinate,
808 and class 2 are points in the 2D plane with negative x coordinate.
809 If w has large magnitude, the decision boundary is sharp, otherwise it
810 is soft.
811 In the special case that Q is binary (0/1), the softmax function reduces to the logistic
812 (sigmoid) function.
813 <p>
814 Fitting a softmax function can be done using the iteratively reweighted
815 least squares (IRLS) algorithm.
816 We use the implementation from
817 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
818 Note that since
819 the softmax distribution is not in the exponential family, it does not
820 have finite sufficient statistics, and hence we must store all the
821 training data in uncompressed form.
822 If this takes too much space, one should use online (stochastic) gradient
823 descent (not implemented in BNT).
824 <p>
825 If a softmax node also has discrete parents,
826 we use a different set of w/b parameters for each combination of
827 parent values, as in the <a href="#gaussian">conditional linear
828 Gaussian CPD</a>.
829 This feature was implemented by Pierpaolo Brutti.
830 He is currently extending it so that discrete parents can be treated
831 as if they were continuous, by adding indicator variables to the X
832 vector.
833 <p>
834 We will see an example of softmax nodes <a href="#mixexp">below</a>.
835
836
837 <h2><a name="mlp">Neural network nodes</h2>
838
839 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
840 to implement a mapping from continuous parents to discrete children,
841 similar to the softmax function.
842 (If there are also discrete parents, it creates a mixture of MLPs.)
843 It uses code from <a
844 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
845 This is work in progress.
846
847 <h2><a name="root">Root nodes</h2>
848
849 A root node has no parents and no parameters; it can be used to model
850 an observed, exogeneous input variable, i.e., one which is "outside"
851 the model.
852 This is useful for conditional density models.
853 We will see an example of root nodes <a href="#mixexp">below</a>.
854
855
856 <h2><a name="gaussian">Gaussian nodes</h2>
857
858 We now consider a distribution suitable for the continuous-valued nodes.
859 Suppose the node is called Y, its continuous parents (if any) are
860 called X, and its discrete parents (if any) are called Q.
861 The distribution on Y is defined as follows:
862 <pre>
863 - no parents: Y ~ N(mu, Sigma)
864 - cts parents : Y|X=x ~ N(mu + W x, Sigma)
865 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
866 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
867 </pre>
868 where N(mu, Sigma) denotes a Normal distribution with mean mu and
869 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
870 respectively.
871 If there are no discrete parents, |Q|=1; if there is
872 more than one, then |Q| = a vector of the sizes of each discrete parent.
873 If there are no continuous parents, |X|=0; if there is more than one,
874 then |X| = the sum of their sizes.
875 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
876 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
877 matrix.
878 <p>
879 We can create a Gaussian node with random parameters as follows.
880 <pre>
881 bnet.CPD{i} = gaussian_CPD(bnet, i);
882 </pre>
883 We can specify the value of one or more of the parameters as in the
884 following example, in which |Y|=2, and |Q|=1.
885 <pre>
886 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
887 </pre>
888 <p>
889 We will see an example of conditional linear Gaussian nodes <a
890 href="#cg_model">below</a>.
891 <p>
892 <b>When learning Gaussians from data</b>, it is helpful to ensure the
893 data has a small magnitde
894 (see e.g., KPMstats/standardize) to prevent numerical problems.
895 Unless you have a lot of data, it is also a very good idea to use
896 diagonal instead of full covariance matrices.
897 (BNT does not currently support spherical covariances, although it
898 would be easy to add, since KPMstats/clg_Mstep supports this option;
899 you would just need to modify gaussian_CPD/update_ess to accumulate
900 weighted inner products.)
901
902
903
904 <h2><a name="nongauss">Other continuous distributions</h2>
905
906 Currently BNT does not support any CPDs for continuous nodes other
907 than the Gaussian.
908 However, you can use a mixture of Gaussians to
909 approximate other continuous distributions. We will see some an example
910 of this with the IFA model <a href="#pca">below</a>.
911
912
913 <h2><a name="glm">Generalized linear model nodes</h2>
914
915 In the future, we may incorporate some of the functionality of
916 <a href =
917 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
918 into BNT.
919
920
921 <h2><a name="dtree">Classification/regression tree nodes</h2>
922
923 We plan to add classification and regression trees to define CPDs for
924 discrete and continuous nodes, respectively.
925 Trees have many advantages: they are easy to interpret, they can do
926 feature selection, they can
927 handle discrete and continuous inputs, they do not make strong
928 assumptions about the form of the distribution, the number of
929 parameters can grow in a data-dependent way (i.e., they are
930 semi-parametric), they can handle missing data, etc.
931 However, they are not yet implemented.
932 <!--
933 Yimin Zhang is currently (Feb '02) implementing this.
934 -->
935
936
937 <h2><a name="cpd_summary">Summary of CPD types</h2>
938
939 We list all the different types of CPDs supported by BNT.
940 For each CPD, we specify if the child and parents can be discrete (D) or
941 continuous (C) (Binary (B) nodes are a special case).
942 We also specify which methods each class supports.
943 If a method is inherited, the name of the parent class is mentioned.
944 If a parent class calls a child method, this is mentioned.
945 <p>
946 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
947 requires that the child and all parents are discrete.
948 The CPT might be exponentially big...
949 <tt>convert_to_table</tt> evaluates a CPD with evidence, and
950 represents the the resulting potential as an array.
951 This requires that the child is discrete, and any continuous parents
952 are observed.
953 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and
954 represents the resulting potential as a dpot, gpot, cgpot or upot, as
955 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
956 utility).
957
958 <p>
959 When we sample a node, all the parents are observed.
960 When we compute the (log) probability of a node, all the parents and
961 the child are observed.
962 <p>
963 We also specify if the parameters are learnable.
964 For learning with EM, we require
965 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
966 <tt>maximize_params</tt>.
967 For learning from fully observed data, we require
968 the method <tt>learn_params</tt>.
969 By default, all classes inherit this from generic_CPD, which simply
970 calls <tt>update_ess</tt> N times, once for each data case, followed
971 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
972 Some classes implement a batch formula, which is quicker.
973 <p>
974 Bayesian learning means computing a posterior over the parameters
975 given fully observed data.
976 <p>
977 Pearl means we implement the methods <tt>compute_pi</tt> and
978 <tt>compute_lambda_msg</tt>, used by
979 <tt>pearl_inf_engine</tt>, which runs on directed graphs.
980 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
981 The pearl methods can exploit special properties of the CPDs for
982 computing the messages efficiently, whereas belprop does not.
983 <p>
984 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
985 which is not shown, since it is not very interesting.
986
987
988 <p>
989
990
991 <table>
992 <table border units = pixels><tr>
993 <td align=center>Name
994 <td align=center>Child
995 <td align=center>Parents
996 <td align=center>Comments
997 <td align=center>CPD_to_CPT
998 <td align=center>conv_to_table
999 <td align=center>conv_to_pot
1000 <td align=center>sample
1001 <td align=center>prob
1002 <td align=center>learn
1003 <td align=center>Bayes
1004 <td align=center>Pearl
1005
1006
1007 <tr>
1008 <!-- Name--><td>
1009 <!-- Child--><td>
1010 <!-- Parents--><td>
1011 <!-- Comments--><td>
1012 <!-- CPD_to_CPT--><td>
1013 <!-- conv_to_table--><td>
1014 <!-- conv_to_pot--><td>
1015 <!-- sample--><td>
1016 <!-- prob--><td>
1017 <!-- learn--><td>
1018 <!-- Bayes--><td>
1019 <!-- Pearl--><td>
1020
1021 <tr>
1022 <!-- Name--><td>boolean
1023 <!-- Child--><td>B
1024 <!-- Parents--><td>B
1025 <!-- Comments--><td>Syntactic sugar for tabular
1026 <!-- CPD_to_CPT--><td>-
1027 <!-- conv_to_table--><td>-
1028 <!-- conv_to_pot--><td>-
1029 <!-- sample--><td>-
1030 <!-- prob--><td>-
1031 <!-- learn--><td>-
1032 <!-- Bayes--><td>-
1033 <!-- Pearl--><td>-
1034
1035 <tr>
1036 <!-- Name--><td>deterministic
1037 <!-- Child--><td>D
1038 <!-- Parents--><td>D
1039 <!-- Comments--><td>Syntactic sugar for tabular
1040 <!-- CPD_to_CPT--><td>-
1041 <!-- conv_to_table--><td>-
1042 <!-- conv_to_pot--><td>-
1043 <!-- sample--><td>-
1044 <!-- prob--><td>-
1045 <!-- learn--><td>-
1046 <!-- Bayes--><td>-
1047 <!-- Pearl--><td>-
1048
1049 <tr>
1050 <!-- Name--><td>Discrete
1051 <!-- Child--><td>D
1052 <!-- Parents--><td>C/D
1053 <!-- Comments--><td>Virtual class
1054 <!-- CPD_to_CPT--><td>N
1055 <!-- conv_to_table--><td>Calls CPD_to_CPT
1056 <!-- conv_to_pot--><td>Calls conv_to_table
1057 <!-- sample--><td>Calls conv_to_table
1058 <!-- prob--><td>Calls conv_to_table
1059 <!-- learn--><td>N
1060 <!-- Bayes--><td>N
1061 <!-- Pearl--><td>N
1062
1063 <tr>
1064 <!-- Name--><td>Gaussian
1065 <!-- Child--><td>C
1066 <!-- Parents--><td>C/D
1067 <!-- Comments--><td>-
1068 <!-- CPD_to_CPT--><td>N
1069 <!-- conv_to_table--><td>N
1070 <!-- conv_to_pot--><td>Y
1071 <!-- sample--><td>Y
1072 <!-- prob--><td>Y
1073 <!-- learn--><td>Y
1074 <!-- Bayes--><td>N
1075 <!-- Pearl--><td>N
1076
1077 <tr>
1078 <!-- Name--><td>gmux
1079 <!-- Child--><td>C
1080 <!-- Parents--><td>C/D
1081 <!-- Comments--><td>multiplexer
1082 <!-- CPD_to_CPT--><td>N
1083 <!-- conv_to_table--><td>N
1084 <!-- conv_to_pot--><td>Y
1085 <!-- sample--><td>N
1086 <!-- prob--><td>N
1087 <!-- learn--><td>N
1088 <!-- Bayes--><td>N
1089 <!-- Pearl--><td>Y
1090
1091
1092 <tr>
1093 <!-- Name--><td>MLP
1094 <!-- Child--><td>D
1095 <!-- Parents--><td>C/D
1096 <!-- Comments--><td>multi layer perceptron
1097 <!-- CPD_to_CPT--><td>N
1098 <!-- conv_to_table--><td>Y
1099 <!-- conv_to_pot--><td>Inherits from discrete
1100 <!-- sample--><td>Inherits from discrete
1101 <!-- prob--><td>Inherits from discrete
1102 <!-- learn--><td>Y
1103 <!-- Bayes--><td>N
1104 <!-- Pearl--><td>N
1105
1106
1107 <tr>
1108 <!-- Name--><td>noisy-or
1109 <!-- Child--><td>B
1110 <!-- Parents--><td>B
1111 <!-- Comments--><td>-
1112 <!-- CPD_to_CPT--><td>Y
1113 <!-- conv_to_table--><td>Inherits from discrete
1114 <!-- conv_to_pot--><td>Inherits from discrete
1115 <!-- sample--><td>Inherits from discrete
1116 <!-- prob--><td>Inherits from discrete
1117 <!-- learn--><td>N
1118 <!-- Bayes--><td>N
1119 <!-- Pearl--><td>Y
1120
1121
1122 <tr>
1123 <!-- Name--><td>root
1124 <!-- Child--><td>C/D
1125 <!-- Parents--><td>none
1126 <!-- Comments--><td>no params
1127 <!-- CPD_to_CPT--><td>N
1128 <!-- conv_to_table--><td>N
1129 <!-- conv_to_pot--><td>Y
1130 <!-- sample--><td>Y
1131 <!-- prob--><td>Y
1132 <!-- learn--><td>N
1133 <!-- Bayes--><td>N
1134 <!-- Pearl--><td>N
1135
1136
1137 <tr>
1138 <!-- Name--><td>softmax
1139 <!-- Child--><td>D
1140 <!-- Parents--><td>C/D
1141 <!-- Comments--><td>-
1142 <!-- CPD_to_CPT--><td>N
1143 <!-- conv_to_table--><td>Y
1144 <!-- conv_to_pot--><td>Inherits from discrete
1145 <!-- sample--><td>Inherits from discrete
1146 <!-- prob--><td>Inherits from discrete
1147 <!-- learn--><td>Y
1148 <!-- Bayes--><td>N
1149 <!-- Pearl--><td>N
1150
1151
1152 <tr>
1153 <!-- Name--><td>generic
1154 <!-- Child--><td>C/D
1155 <!-- Parents--><td>C/D
1156 <!-- Comments--><td>Virtual class
1157 <!-- CPD_to_CPT--><td>N
1158 <!-- conv_to_table--><td>N
1159 <!-- conv_to_pot--><td>N
1160 <!-- sample--><td>N
1161 <!-- prob--><td>N
1162 <!-- learn--><td>N
1163 <!-- Bayes--><td>N
1164 <!-- Pearl--><td>N
1165
1166
1167 <tr>
1168 <!-- Name--><td>Tabular
1169 <!-- Child--><td>D
1170 <!-- Parents--><td>D
1171 <!-- Comments--><td>-
1172 <!-- CPD_to_CPT--><td>Y
1173 <!-- conv_to_table--><td>Inherits from discrete
1174 <!-- conv_to_pot--><td>Inherits from discrete
1175 <!-- sample--><td>Inherits from discrete
1176 <!-- prob--><td>Inherits from discrete
1177 <!-- learn--><td>Y
1178 <!-- Bayes--><td>Y
1179 <!-- Pearl--><td>Y
1180
1181 </table>
1182
1183
1184
1185 <h1><a name="examples">Example models</h1>
1186
1187
1188 <h2>Gaussian mixture models</h2>
1189
1190 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
1191 of Gaussians using BNT. Available
1192 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
1193
1194
1195 <h2><a name="pca">PCA, ICA, and all that </h2>
1196
1197 In Figure (a) below, we show how Factor Analysis can be thought of as a
1198 graphical model. Here, X has an N(0,I) prior, and
1199 Y|X=x ~ N(mu + Wx, Psi),
1200 where Psi is diagonal and W is called the "factor loading matrix".
1201 Since the noise on both X and Y is diagonal, the components of these
1202 vectors are uncorrelated, and hence can be represented as individual
1203 scalar nodes, as we show in (b).
1204 (This is useful if parts of the observations on the Y vector are occasionally missing.)
1205 We usually take k=|X| << |Y|=D, so the model tries to explain
1206 many observations using a low-dimensional subspace.
1207
1208
1209 <center>
1210 <table>
1211 <tr>
1212 <td><img src="Figures/fa.gif">
1213 <td><img src="Figures/fa_scalar.gif">
1214 <td><img src="Figures/mfa.gif">
1215 <td><img src="Figures/ifa.gif">
1216 <tr>
1217 <td align=center> (a)
1218 <td align=center> (b)
1219 <td align=center> (c)
1220 <td align=center> (d)
1221 </table>
1222 </center>
1223
1224 <p>
1225 We can create this model in BNT as follows.
1226 <pre>
1227 ns = [k D];
1228 dag = zeros(2,2);
1229 dag(1,2) = 1;
1230 bnet = mk_bnet(dag, ns, 'discrete', []);
1231 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
1232 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
1233 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
1234 'cov_type', 'diag', 'clamp_mean', 1);
1235 </pre>
1236
1237 The root node is clamped to the N(0,I) distribution, so that we will
1238 not update these parameters during learning.
1239 The mean of the leaf node is clamped to 0,
1240 since we assume the data has been centered (had its mean subtracted
1241 off); this is just for simplicity.
1242 Finally, the covariance of the leaf node is constrained to be
1243 diagonal. W0 and Psi0 are the initial parameter guesses.
1244
1245 <p>
1246 We can fit this model (i.e., estimate its parameters in a maximum
1247 likelihood (ML) sense) using EM, as we
1248 explain <a href="#em">below</a>.
1249 Not surprisingly, the ML estimates for mu and Psi turn out to be
1250 identical to the
1251 sample mean and variance, which can be computed directly as
1252 <pre>
1253 mu_ML = mean(data);
1254 Psi_ML = diag(cov(data));
1255 </pre>
1256 Note that W can only be identified up to a rotation matrix, because of
1257 the spherical symmetry of the source.
1258
1259 <p>
1260 If we restrict Psi to be spherical, i.e., Psi = sigma*I,
1261 there is a closed-form solution for W as well,
1262 i.e., we do not need to use EM.
1263 In particular, W contains the first |X| eigenvectors of the sample covariance
1264 matrix, with scalings determined by the eigenvalues and sigma.
1265 Classical PCA can be obtained by taking the sigma->0 limit.
1266 For details, see
1267
1268 <ul>
1269 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
1270 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
1271 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
1272 Matlab software</a>)
1273
1274 <p>
1275 <li>
1276 <a
1277 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1278 "Mixtures of probabilistic principal component analyzers"</a>,
1279 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1280 </ul>
1281
1282 <p>
1283 By adding a hidden discrete variable, we can create mixtures of FA
1284 models, as shown in (c).
1285 Now we can explain the data using a set of subspaces.
1286 We can create this model in BNT as follows.
1287 <pre>
1288 ns = [M k D];
1289 dag = zeros(3);
1290 dag(1,3) = 1;
1291 dag(2,3) = 1;
1292 bnet = mk_bnet(dag, ns, 'discrete', 1);
1293 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
1294 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
1295 'clamp_mean', 1, 'clamp_cov', 1);
1296 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
1297 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
1298 </pre>
1299 Notice how the covariance matrix for Y is the same for all values of
1300 Q; that is, the noise level in each sub-space is assumed the same.
1301 However, we allow the offset, mu, to vary.
1302 For details, see
1303 <ul>
1304
1305 <LI>
1306 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
1307 Algorithm for Mixtures of Factor Analyzers </A>,
1308 Ghahramani, Z. and Hinton, G.E. (1996),
1309 University of Toronto
1310 Technical Report CRG-TR-96-1.
1311 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
1312
1313 <p>
1314 <li>
1315 <a
1316 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
1317 "Mixtures of probabilistic principal component analyzers"</a>,
1318 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
1319 </ul>
1320
1321 <p>
1322 I have included Zoubin's specialized MFA code (with his permission)
1323 with the toolbox, so you can check that BNT gives the same results:
1324 see 'BNT/examples/static/mfa1.m'.
1325
1326 <p>
1327 Independent Factor Analysis (IFA) generalizes FA by allowing a
1328 non-Gaussian prior on each component of X.
1329 (Note that we can approximate a non-Gaussian prior using a mixture of
1330 Gaussians.)
1331 This means that the likelihood function is no longer rotationally
1332 invariant, so we can uniquely identify W and the hidden
1333 sources X.
1334 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
1335 We recover classical Independent Components Analysis (ICA)
1336 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the
1337 weight matrix W is square and invertible.
1338 For details, see
1339 <ul>
1340 <li>
1341 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
1342 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
1343 </ul>
1344
1345
1346
1347 <h2><a name="mixexp">Mixtures of experts</h2>
1348
1349 As an example of the use of the softmax function,
1350 we introduce the Mixture of Experts model.
1351 <!--
1352 We also show
1353 the Hierarchical Mixture of Experts model, where the hierarchy has two
1354 levels.
1355 (This is essentially a probabilistic decision tree of height two.)
1356 -->
1357 As before,
1358 circles denote continuous-valued nodes,
1359 squares denote discrete nodes, clear
1360 means hidden, and shaded means observed.
1361 <p>
1362 <center>
1363 <table>
1364 <tr>
1365 <td><img src="Figures/mixexp.gif">
1366 <!--
1367 <td><img src="Figures/hme.gif">
1368 -->
1369 </table>
1370 </center>
1371 <p>
1372 X is the observed
1373 input, Y is the output, and
1374 the Q nodes are hidden "gating" nodes, which select the appropriate
1375 set of parameters for Y. During training, Y is assumed observed,
1376 but for testing, the goal is to predict Y given X.
1377 Note that this is a <em>conditional</em> density model, so we don't
1378 associate any parameters with X.
1379 Hence X's CPD will be a root CPD, which is a way of modelling
1380 exogenous nodes.
1381 If the output is a continuous-valued quantity,
1382 we assume the "experts" are linear-regression units,
1383 and set Y's CPD to linear-Gaussian.
1384 If the output is discrete, we set Y's CPD to a softmax function.
1385 The Q CPDs will always be softmax functions.
1386
1387 <p>
1388 As a concrete example, consider the mixture of experts model where X and Y are
1389 scalars, and Q is binary.
1390 This is just piecewise linear regression, where
1391 we have two line segments, i.e.,
1392 <P>
1393 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
1394 <P>
1395 We can create this model with random parameters as follows.
1396 (This code is bundled in BNT/examples/static/mixexp2.m.)
1397 <PRE>
1398 X = 1;
1399 Q = 2;
1400 Y = 3;
1401 dag = zeros(3,3);
1402 dag(X,[Q Y]) = 1
1403 dag(Q,Y) = 1;
1404 ns = [1 2 1]; % make X and Y scalars, and have 2 experts
1405 onodes = [1 3];
1406 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
1407
1408 rand('state', 0);
1409 randn('state', 0);
1410 bnet.CPD{1} = root_CPD(bnet, 1);
1411 bnet.CPD{2} = softmax_CPD(bnet, 2);
1412 bnet.CPD{3} = gaussian_CPD(bnet, 3);
1413 </PRE>
1414 Now let us fit this model using <a href="#em">EM</a>.
1415 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
1416 <P>
1417 <PRE>
1418 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
1419 plot(data(:,1), data(:,2), '.');
1420 </PRE>
1421 <p>
1422 <center>
1423 <IMG SRC="Figures/mixexp_data.gif">
1424 </center>
1425 <p>
1426 This is what the model looks like before training.
1427 (Thanks to Thomas Hofman for writing this plotting routine.)
1428 <p>
1429 <center>
1430 <IMG SRC="Figures/mixexp_before.gif">
1431 </center>
1432 <p>
1433 Now let's train the model, and plot the final performance.
1434 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
1435 <P>
1436 <PRE>
1437 ncases = size(data, 1); % each row of data is a training case
1438 cases = cell(3, ncases);
1439 cases([1 3], :) = num2cell(data'); % each column of cases is a training case
1440 engine = jtree_inf_engine(bnet);
1441 max_iter = 20;
1442 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
1443 </PRE>
1444 (We specify which nodes will be observed when we create the engine.
1445 Hence BNT knows that the hidden nodes are all discrete.
1446 For complex models, this can lead to a significant speedup.)
1447 Below we show what the model looks like after 16 iterations of EM
1448 (with 100 IRLS iterations per M step), when it converged
1449 using the default convergence tolerance (that the
1450 fractional change in the log-likelihood be less than 1e-3).
1451 Before learning, the log-likelihood was
1452 -322.927442; afterwards, it was -13.728778.
1453 <p>
1454 <center>
1455 <IMG SRC="Figures/mixexp_after.gif">
1456 </center>
1457 (See BNT/examples/static/mixexp2.m for details of the code.)
1458
1459
1460
1461 <h2><a name="hme">Hierarchical mixtures of experts</h2>
1462
1463 A hierarchical mixture of experts (HME) extends the mixture of experts
1464 model by having more than one hidden node. A two-level example is shown below, along
1465 with its more traditional representation as a neural network.
1466 This is like a (balanced) probabilistic decision tree of height 2.
1467 <p>
1468 <center>
1469 <IMG SRC="Figures/HMEforMatlab.jpg">
1470 </center>
1471 <p>
1472 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
1473 has written an extensive set of routines for HMEs,
1474 which are bundled with BNT: see the examples/static/HME directory.
1475 These routines allow you to choose the number of hidden (gating)
1476 layers, and the form of the experts (softmax or MLP).
1477 See the file hmemenu, which provides a demo.
1478 For example, the figure below shows the decision boundaries learned
1479 for a ternary classification problem, using a 2 level HME with softmax
1480 gates and softmax experts; the training set is on the left, the
1481 testing set on the right.
1482 <p>
1483 <center>
1484 <!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
1485 <IMG SRC="Figures/hme_dec_boundary.png">
1486 </center>
1487 <p>
1488
1489
1490 <p>
1491 For more details, see the following:
1492 <ul>
1493
1494 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
1495 Hierarchical mixtures of experts and the EM algorithm</a>
1496 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
1497
1498 <li> <a href =
1499 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's
1500 matlab code for HME</a>
1501
1502 <li> <a
1503 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
1504 logistic function? A tutorial discussion on
1505 probabilities and neural networks.</a> M. I. Jordan. MIT Computational
1506 Cognitive Science Report 9503, August 1995.
1507
1508 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
1509 Halll, 1983.
1510
1511 <li>
1512 "Improved learning algorithms for mixtures of experts in multiclass
1513 classification".
1514 K. Chen, L. Xu, H. Chi.
1515 Neural Networks (1999) 12: 1229-1252.
1516
1517 <li> <a href="http://www.oigeeza.com/steve/">
1518 Classification Using Hierarchical Mixtures of Experts</a>
1519 S.R. Waterhouse and A.J. Robinson.
1520 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
1521
1522 <li> <a href="http://www.idiap.ch/~perry/">
1523 Localized mixtures of experts</a>,
1524 P. Moerland, 1998.
1525
1526 <li> "Nonlinear gated experts for time series",
1527 A.S. Weigend and M. Mangeas, 1995.
1528
1529 </ul>
1530
1531
1532 <h2><a name="qmr">QMR</h2>
1533
1534 Bayes nets originally arose out of an attempt to add probabilities to
1535 expert systems, and this is still the most common use for BNs.
1536 A famous example is
1537 QMR-DT, a decision-theoretic reformulation of the Quick Medical
1538 Reference (QMR) model.
1539 <p>
1540 <center>
1541 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
1542 </center>
1543 Here, the top layer represents hidden disease nodes, and the bottom
1544 layer represents observed symptom nodes.
1545 The goal is to infer the posterior probability of each disease given
1546 all the symptoms (which can be present, absent or unknown).
1547 Each node in the top layer has a Bernoulli prior (with a low prior
1548 probability that the disease is present).
1549 Since each node in the bottom layer has a high fan-in, we use a
1550 noisy-OR parameterization; each disease has an independent chance of
1551 causing each symptom.
1552 The real QMR-DT model is copyright, but
1553 we can create a random QMR-like model as follows.
1554 <pre>
1555 function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1556 % MK_QMR_BNET Make a QMR model
1557 % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
1558 %
1559 % G(i,j) = 1 iff there is an arc from disease i to finding j
1560 % inhibit(i,j) = inhibition probability on i->j arc
1561 % leak(j) = inhibition prob. on leak->j arc
1562 % prior(i) = prob. disease i is on
1563
1564 [Ndiseases Nfindings] = size(inhibit);
1565 N = Ndiseases + Nfindings;
1566 finding_node = Ndiseases+1:N;
1567 ns = 2*ones(1,N);
1568 dag = zeros(N,N);
1569 dag(1:Ndiseases, finding_node) = G;
1570 bnet = mk_bnet(dag, ns, 'observed', finding_node);
1571
1572 for d=1:Ndiseases
1573 CPT = [1-prior(d) prior(d)];
1574 bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
1575 end
1576
1577 for i=1:Nfindings
1578 fnode = finding_node(i);
1579 ps = parents(G, i);
1580 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
1581 end
1582 </pre>
1583 In the file BNT/examples/static/qmr1, we create a random bipartite
1584 graph G, with 5 diseases and 10 findings, and random parameters.
1585 (In general, to create a random dag, use 'mk_random_dag'.)
1586 We can visualize the resulting graph structure using
1587 the methods discussed <a href="#graphdraw">below</a>, with the
1588 following results:
1589 <p>
1590 <img src="Figures/qmr.rnd.jpg">
1591
1592 <p>
1593 Now let us put some random evidence on all the leaves except the very
1594 first and very last, and compute the disease posteriors.
1595 <pre>
1596 pos = 2:floor(Nfindings/2);
1597 neg = (pos(end)+1):(Nfindings-1);
1598 onodes = myunion(pos, neg);
1599 evidence = cell(1, N);
1600 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
1601 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
1602
1603 engine = jtree_inf_engine(bnet);
1604 [engine, ll] = enter_evidence(engine, evidence);
1605 post = zeros(1, Ndiseases);
1606 for i=diseases(:)'
1607 m = marginal_nodes(engine, i);
1608 post(i) = m.T(2);
1609 end
1610 </pre>
1611 Junction tree can be quite slow on large QMR models.
1612 Fortunately, it is possible to exploit properties of the noisy-OR
1613 function to speed up exact inference using an algorithm called
1614 <a href="#quickscore">quickscore</a>, discussed below.
1615
1616
1617
1618
1619
1620 <h2><a name="cg_model">Conditional Gaussian models</h2>
1621
1622 A conditional Gaussian model is one in which, conditioned on all the discrete
1623 nodes, the distribution over the remaining (continuous) nodes is
1624 multivariate Gaussian. This means we can have arcs from discrete (D)
1625 to continuous (C) nodes, but not vice versa.
1626 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
1627 as in the <a href="#mixexp">mixture of experts</a> model,
1628 since this distribution can be represented with a discrete potential.)
1629 <p>
1630 We now give an example of a CG model, from
1631 the paper "Propagation of Probabilities, Means amd
1632 Variances in Mixed Graphical Association Models", Steffen Lauritzen,
1633 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
1634 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
1635 D. J. Spiegelhalter, Springer, 1999.)
1636
1637 <h3>Specifying the graph</h3>
1638
1639 Consider the model of waste emissions from an incinerator plant shown below.
1640 We follow the standard convention that shaded nodes are observed,
1641 clear nodes are hidden.
1642 We also use the non-standard convention that
1643 square nodes are discrete (tabular) and round nodes are
1644 Gaussian.
1645
1646 <p>
1647 <center>
1648 <IMG SRC="Figures/cg1.gif">
1649 </center>
1650 <p>
1651
1652 We can create this model as follows.
1653 <pre>
1654 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
1655 n = 9;
1656
1657 dag = zeros(n);
1658 dag(F,E)=1;
1659 dag(W,[E Min D]) = 1;
1660 dag(E,D)=1;
1661 dag(B,[C D])=1;
1662 dag(D,[L Mout])=1;
1663 dag(Min,Mout)=1;
1664
1665 % node sizes - all cts nodes are scalar, all discrete nodes are binary
1666 ns = ones(1, n);
1667 dnodes = [F W B];
1668 cnodes = mysetdiff(1:n, dnodes);
1669 ns(dnodes) = 2;
1670
1671 bnet = mk_bnet(dag, ns, 'discrete', dnodes);
1672 </pre>
1673 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
1674 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
1675 <p>
1676
1677
1678 <h3>Specifying the parameters</h3>
1679
1680 The parameters of the discrete nodes can be specified as follows.
1681 <pre>
1682 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
1683 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
1684 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
1685 </pre>
1686
1687 <p>
1688 The parameters of the continuous nodes can be specified as follows.
1689 <pre>
1690 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
1691 'cov', [0.00002 0.0001 0.00002 0.0001]);
1692 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
1693 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
1694 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
1695 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
1696 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
1697 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
1698 </pre>
1699
1700
1701 <h3><a name="cg_infer">Inference</h3>
1702
1703 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
1704 First we compute the unconditional marginals.
1705 <pre>
1706 engine = jtree_inf_engine(bnet);
1707 evidence = cell(1,n);
1708 [engine, ll] = enter_evidence(engine, evidence);
1709 marg = marginal_nodes(engine, E);
1710 </pre>
1711 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
1712 'marg' is a structure that contains the fields 'mu' and 'Sigma', which
1713 contain the mean and (co)variance of the marginal on E.
1714 In this case, they are both scalars.
1715 Let us check they match the published figures (to 2 decimal places).
1716 <!--(We can't expect
1717 more precision than this in general because I have implemented the algorithm of
1718 Lauritzen (1992), which can be numerically unstable.)-->
1719 <pre>
1720 tol = 1e-2;
1721 assert(approxeq(marg.mu, -3.25, tol));
1722 assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
1723 </pre>
1724 We can compute the other posteriors similarly.
1725 Now let us add some evidence.
1726 <pre>
1727 evidence = cell(1,n);
1728 evidence{W} = 1; % industrial
1729 evidence{L} = 1.1;
1730 evidence{C} = -0.9;
1731 [engine, ll] = enter_evidence(engine, evidence);
1732 </pre>
1733 Now we find
1734 <pre>
1735 marg = marginal_nodes(engine, E);
1736 assert(approxeq(marg.mu, -3.8983, tol));
1737 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
1738 </pre>
1739
1740
1741 We can also compute the joint probability on a set of nodes.
1742 For example, P(D, Mout | evidence) is a 2D Gaussian:
1743 <pre>
1744 marg = marginal_nodes(engine, [D Mout])
1745 marg =
1746 domain: [6 8]
1747 mu: [2x1 double]
1748 Sigma: [2x2 double]
1749 T: 1.0000
1750 </pre>
1751 The mean is
1752 <pre>
1753 marg.mu
1754 ans =
1755 3.6077
1756 4.1077
1757 </pre>
1758 and the covariance matrix is
1759 <pre>
1760 marg.Sigma
1761 ans =
1762 0.1062 0.1062
1763 0.1062 0.1182
1764 </pre>
1765 It is easy to visualize this posterior using standard Matlab plotting
1766 functions, e.g.,
1767 <pre>
1768 gaussplot2d(marg.mu, marg.Sigma);
1769 </pre>
1770 produces the following picture.
1771
1772 <p>
1773 <center>
1774 <IMG SRC="Figures/gaussplot.png">
1775 </center>
1776 <p>
1777
1778
1779 The T field indicates that the mixing weight of this Gaussian
1780 component is 1.0.
1781 If the joint contains discrete and continuous variables, the result
1782 will be a mixture of Gaussians, e.g.,
1783 <pre>
1784 marg = marginal_nodes(engine, [F E])
1785 domain: [1 3]
1786 mu: [-3.9000 -0.4003]
1787 Sigma: [1x1x2 double]
1788 T: [0.9995 4.7373e-04]
1789 </pre>
1790 The interpretation is
1791 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ].
1792 In this case, E is a scalar, so i=j=1; k specifies the mixture component.
1793 <p>
1794 We saw in the sprinkler network that BNT sets the effective size of
1795 observed discrete nodes to 1, since they only have one legal value.
1796 For continuous nodes, BNT sets their length to 0,
1797 since they have been reduced to a point.
1798 For example,
1799 <pre>
1800 marg = marginal_nodes(engine, [B C])
1801 domain: [4 5]
1802 mu: []
1803 Sigma: []
1804 T: [0.0123 0.9877]
1805 </pre>
1806 It is simple to post-process the output of marginal_nodes.
1807 For example, the file BNT/examples/static/cg1 sets the mu term of
1808 observed nodes to their observed value, and the Sigma term to 0 (since
1809 observed nodes have no variance).
1810
1811 <p>
1812 Note that the implemented version of the junction tree is numerically
1813 unstable when using CG potentials
1814 (which is why, in the example above, we only required our answers to agree with
1815 the published ones to 2dp.)
1816 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
1817 implemented by Shan Huang. This is described in
1818
1819 <ul>
1820 <li> "Stable Local Computation with Conditional Gaussian Distributions",
1821 S. Lauritzen and F. Jensen, Tech Report R-99-2014,
1822 Dept. Math. Sciences, Allborg Univ., 1999.
1823 </ul>
1824
1825 However, even the numerically stable version
1826 can be computationally intractable if there are many hidden discrete
1827 nodes, because the number of mixture components grows exponentially e.g., in a
1828 <a href="usage_dbn.html#lds">switching linear dynamical system</a>.
1829 In general, one must resort to approximate inference techniques: see
1830 the discussion on <a href="#engines">inference engines</a> below.
1831
1832
1833 <h2><a name="hybrid">Other hybrid models</h2>
1834
1835 When we have C->D arcs, where C is hidden, we need to use
1836 approximate inference.
1837 One approach (not implemented in BNT) is described in
1838 <ul>
1839 <li> <a
1840 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
1841 Variational Approximation for Bayesian Networks with
1842 Discrete and Continuous Latent Variables</a>,
1843 K. Murphy, UAI 99.
1844 </ul>
1845 Of course, one can always use <a href="#sampling">sampling</a> methods
1846 for approximate inference in such models.
1847
1848
1849
1850 <h1><a name="param_learning">Parameter Learning</h1>
1851
1852 The parameter estimation routines in BNT can be classified into 4
1853 types, depending on whether the goal is to compute
1854 a full (Bayesian) posterior over the parameters or just a point
1855 estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
1856 and whether all the variables are fully observed or there is missing
1857 data/ hidden variables (partial observability).
1858 <p>
1859
1860 <TABLE BORDER>
1861 <tr>
1862 <TH></TH>
1863 <th>Full obs</th>
1864 <th>Partial obs</th>
1865 </tr>
1866 <tr>
1867 <th>Point</th>
1868 <td><tt>learn_params</tt></td>
1869 <td><tt>learn_params_em</tt></td>
1870 </tr>
1871 <tr>
1872 <th>Bayes</th>
1873 <td><tt>bayes_update_params</tt></td>
1874 <td>not yet supported</td>
1875 </tr>
1876 </table>
1877
1878
1879 <h2><a name="load_data">Loading data from a file</h2>
1880
1881 To load numeric data from an ASCII text file called 'dat.txt', where each row is a
1882 case and columns are separated by white-space, such as
1883 <pre>
1884 011979 1626.5 0.0
1885 021979 1367.0 0.0
1886 ...
1887 </pre>
1888 you can use
1889 <pre>
1890 data = load('dat.txt');
1891 </pre>
1892 or
1893 <pre>
1894 load dat.txt -ascii
1895 </pre>
1896 In the latter case, the data is stored in a variable called 'dat' (the
1897 filename minus the extension).
1898 Alternatively, suppose the data is stored in a .csv file (has commas
1899 separating the columns, and contains a header line), such as
1900 <pre>
1901 header info goes here
1902 ORD,011979,1626.5,0.0
1903 DSM,021979,1367.0,0.0
1904 ...
1905 </pre>
1906 You can load this using
1907 <pre>
1908 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
1909 </pre>
1910 If your file is not in either of these formats, you can either use Perl to convert
1911 it to this format, or use the Matlab scanf command.
1912 Type
1913 <tt>
1914 help iofun
1915 </tt>
1916 for more information on Matlab's file functions.
1917 <!--
1918 <p>
1919 To load data directly from Excel,
1920 you should buy the
1921 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
1922 To load data directly from a relational database,
1923 you should buy the
1924 <a href="http://www.mathworks.com/products/database">Database
1925 toolbox</a>.
1926 -->
1927 <p>
1928 BNT learning routines require data to be stored in a cell array.
1929 data{i,m} is the value of node i in case (example) m, i.e., each
1930 <em>column</em> is a case.
1931 If node i is not observed in case m (missing value), set
1932 data{i,m} = [].
1933 (Not all the learning routines can cope with such missing values, however.)
1934 In the special case that all the nodes are observed and are
1935 scalar-valued (as opposed to vector-valued), the data can be
1936 stored in a matrix (as opposed to a cell-array).
1937 <p>
1938 Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
1939 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
1940 the observed output, and X(2) is a hidden (gating) node. We can
1941 create the dataset as follows.
1942 <pre>
1943 data = load('dat.txt');
1944 ncases = size(data, 1);
1945 cases = cell(3, ncases);
1946 cases([1 3], :) = num2cell(data');
1947 </pre>
1948 Notice how we transposed the data, to convert rows into columns.
1949 Also, cases{2,m} = [] for all m, since X(2) is always hidden.
1950
1951
1952 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
1953
1954 As an example, let's generate some data from the sprinkler network, randomize the parameters,
1955 and then try to recover the original model.
1956 First we create some training data using forwards sampling.
1957 <pre>
1958 samples = cell(N, nsamples);
1959 for i=1:nsamples
1960 samples(:,i) = sample_bnet(bnet);
1961 end
1962 </pre>
1963 samples{j,i} contains the value of the j'th node in case i.
1964 sample_bnet returns a cell array because, in general, each node might
1965 be a vector of different length.
1966 In this case, all nodes are discrete (and hence scalars), so we
1967 could have used a regular array instead (which can be quicker):
1968 <pre>
1969 data = cell2num(samples);
1970 </pre
1971 So now data(j,i) = samples{j,i}.
1972 <p>
1973 Now we create a network with random parameters.
1974 (The initial values of bnet2 don't matter in this case, since we can find the
1975 globally optimal MLE independent of where we start.)
1976 <pre>
1977 % Make a tabula rasa
1978 bnet2 = mk_bnet(dag, node_sizes);
1979 seed = 0;
1980 rand('state', seed);
1981 bnet2.CPD{C} = tabular_CPD(bnet2, C);
1982 bnet2.CPD{R} = tabular_CPD(bnet2, R);
1983 bnet2.CPD{S} = tabular_CPD(bnet2, S);
1984 bnet2.CPD{W} = tabular_CPD(bnet2, W);
1985 </pre>
1986 Finally, we find the maximum likelihood estimates of the parameters.
1987 <pre>
1988 bnet3 = learn_params(bnet2, samples);
1989 </pre>
1990 To view the learned parameters, we use a little Matlab hackery.
1991 <pre>
1992 CPT3 = cell(1,N);
1993 for i=1:N
1994 s=struct(bnet3.CPD{i}); % violate object privacy
1995 CPT3{i}=s.CPT;
1996 end
1997 </pre>
1998 Here are the parameters learned for node 4.
1999 <pre>
2000 dispcpt(CPT3{4})
2001 1 1 : 1.0000 0.0000
2002 2 1 : 0.2000 0.8000
2003 1 2 : 0.2273 0.7727
2004 2 2 : 0.0000 1.0000
2005 </pre>
2006 So we see that the learned parameters are fairly close to the "true"
2007 ones, which we display below.
2008 <pre>
2009 dispcpt(CPT{4})
2010 1 1 : 1.0000 0.0000
2011 2 1 : 0.1000 0.9000
2012 1 2 : 0.1000 0.9000
2013 2 2 : 0.0100 0.9900
2014 </pre>
2015 We can get better results by using a larger training set, or using
2016 informative priors (see <a href="#prior">below</a>).
2017
2018
2019
2020 <h2><a name="prior">Parameter priors</h2>
2021
2022 Currently, only tabular CPDs can have priors on their parameters.
2023 The conjugate prior for a multinomial is the Dirichlet.
2024 (For binary random variables, the multinomial is the same as the
2025 Bernoulli, and the Dirichlet is the same as the Beta.)
2026 <p>
2027 The Dirichlet has a simple interpretation in terms of pseudo counts.
2028 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
2029 training set, where Pa_i are the parents of X_i,
2030 then the maximum likelihood (ML) estimate is
2031 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
2032 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
2033 event was not seen in the training set,
2034 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
2035 of times in the past.
2036 The MAP (maximum a posterior) estimate is then
2037 <pre>
2038 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
2039 </pre>
2040 and is never 0 if all alpha_ijk > 0.
2041 For example, consider the network A->B, where A is binary and B has 3
2042 values.
2043 A uniform prior for B has the form
2044 <pre>
2045 B=1 B=2 B=3
2046 A=1 1 1 1
2047 A=2 1 1 1
2048 </pre>
2049 which can be created using
2050 <pre>
2051 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
2052 </pre>
2053 This prior does not satisfy the likelihood equivalence principle,
2054 which says that <a href="#markov_equiv">Markov equivalent</a> models
2055 should have the same marginal likelihood.
2056 A prior that does satisfy this principle is shown below.
2057 Heckerman (1995) calls this the
2058 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
2059 <pre>
2060 B=1 B=2 B=3
2061 A=1 1/6 1/6 1/6
2062 A=2 1/6 1/6 1/6
2063 </pre>
2064 where we put N/(q*r) in each bin; N is the equivalent sample size,
2065 r=|A|, q = |B|.
2066 This can be created as follows
2067 <pre>
2068 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
2069 </pre>
2070 Here, 1 is the equivalent sample size, and is the strength of the
2071 prior.
2072 You can change this using
2073 <pre>
2074 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
2075 'BDeu', 'dirichlet_weight', 10);
2076 </pre>
2077 <!--where counts is an array of pseudo-counts of the same size as the
2078 CPT.-->
2079 <!--
2080 <p>
2081 When you specify a prior, you should set row i of the CPT to the
2082 normalized version of row i of the pseudo-count matrix, i.e., to the
2083 expected values of the parameters. This will ensure that computing the
2084 marginal likelihood sequentially (see <a
2085 href="#bayes_learn">below</a>) and in batch form gives the same
2086 results.
2087 To do this, proceed as follows.
2088 <pre>
2089 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
2090 </pre>
2091 For a non-informative prior, you can just write
2092 <pre>
2093 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
2094 </pre>
2095 -->
2096
2097
2098 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
2099
2100 If we use conjugate priors and have fully observed data, we can
2101 compute the posterior over the parameters in batch form as follows.
2102 <pre>
2103 cases = sample_bnet(bnet, nsamples);
2104 bnet = bayes_update_params(bnet, cases);
2105 LL = log_marg_lik_complete(bnet, cases);
2106 </pre>
2107 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
2108 and bnet.CPD{i}.CPT is set to the mean of the posterior (the
2109 normalized counts).
2110 (Hence if the initial pseudo counts are 0,
2111 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
2112 same result.)
2113
2114
2115
2116
2117 <p>
2118 We can compute the same result sequentially (on-line) as follows.
2119 <pre>
2120 LL = 0;
2121 for m=1:nsamples
2122 LL = LL + log_marg_lik_complete(bnet, cases(:,m));
2123 bnet = bayes_update_params(bnet, cases(:,m));
2124 end
2125 </pre>
2126
2127 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
2128 sequential model selection which uses the same idea.
2129 We generate data from the model A->B
2130 and compute the posterior prob of all 3 dags on 2 nodes:
2131 (1) A B, (2) A <- B , (3) A -> B
2132 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
2133 observational data alone, so we expect their posteriors to be the same
2134 (assuming a prior which satisfies likelihood equivalence).
2135 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
2136 However, if we make B a noisy NOT gate, the true model "wins" after 12
2137 trials, as shown below (red = model 1, blue/green (superimposed)
2138 represents models 2/3).
2139 <p>
2140 <img src="Figures/model_select.png">
2141 <p>
2142 The use of marginal likelihood for model selection is discussed in
2143 greater detail in the
2144 section on <a href="structure_learning">structure learning</a>.
2145
2146
2147
2148
2149 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
2150
2151 Now we consider learning when some values are not observed.
2152 Let us randomly hide half the values generated from the water
2153 sprinkler example.
2154 <pre>
2155 samples2 = samples;
2156 hide = rand(N, nsamples) > 0.5;
2157 [I,J]=find(hide);
2158 for k=1:length(I)
2159 samples2{I(k), J(k)} = [];
2160 end
2161 </pre>
2162 samples2{i,l} is the value of node i in training case l, or [] if unobserved.
2163 <p>
2164 Now we will compute the MLEs using the EM algorithm.
2165 We need to use an inference algorithm to compute the expected
2166 sufficient statistics in the E step; the M (maximization) step is as
2167 above.
2168 <pre>
2169 engine2 = jtree_inf_engine(bnet2);
2170 max_iter = 10;
2171 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
2172 </pre>
2173 LLtrace(i) is the log-likelihood at iteration i. We can plot this as
2174 follows:
2175 <pre>
2176 plot(LLtrace, 'x-')
2177 </pre>
2178 Let's display the results after 10 iterations of EM.
2179 <pre>
2180 celldisp(CPT4)
2181 CPT4{1} =
2182 0.6616
2183 0.3384
2184 CPT4{2} =
2185 0.6510 0.3490
2186 0.8751 0.1249
2187 CPT4{3} =
2188 0.8366 0.1634
2189 0.0197 0.9803
2190 CPT4{4} =
2191 (:,:,1) =
2192 0.8276 0.0546
2193 0.5452 0.1658
2194 (:,:,2) =
2195 0.1724 0.9454
2196 0.4548 0.8342
2197 </pre>
2198 We can get improved performance by using one or more of the following
2199 methods:
2200 <ul>
2201 <li> Increasing the size of the training set.
2202 <li> Decreasing the amount of hidden data.
2203 <li> Running EM for longer.
2204 <li> Using informative priors.
2205 <li> Initialising EM from multiple starting points.
2206 </ul>
2207
2208 Click <a href="#gaussian">here</a> for a discussion of learning
2209 Gaussians, which can cause numerical problems.
2210
2211 <h2><a name="tying">Parameter tying</h2>
2212
2213 In networks with repeated structure (e.g., chains and grids), it is
2214 common to assume that the parameters are the same at every node. This
2215 is called parameter tying, and reduces the amount of data needed for
2216 learning.
2217 <p>
2218 When we have tied parameters, there is no longer a one-to-one
2219 correspondence between nodes and CPDs.
2220 Rather, each CPD species the parameters for a whole equivalence class
2221 of nodes.
2222 It is easiest to see this by example.
2223 Consider the following <a href="usage_dbn.html#hmm">hidden Markov
2224 model (HMM)</a>
2225 <p>
2226 <img src="Figures/hmm3.gif">
2227 <p>
2228 <!--
2229 We can create this graph structure, assuming we have T time-slices,
2230 as follows.
2231 (We number the nodes as shown in the figure, but we could equally well
2232 number the hidden nodes 1:T, and the observed nodes T+1:2T.)
2233 <pre>
2234 N = 2*T;
2235 dag = zeros(N);
2236 hnodes = 1:2:2*T;
2237 for i=1:T-1
2238 dag(hnodes(i), hnodes(i+1))=1;
2239 end
2240 onodes = 2:2:2*T;
2241 for i=1:T
2242 dag(hnodes(i), onodes(i)) = 1;
2243 end
2244 </pre>
2245 <p>
2246 The hidden nodes are always discrete, and have Q possible values each,
2247 but the observed nodes can be discrete or continuous, and have O possible values/length.
2248 <pre>
2249 if cts_obs
2250 dnodes = hnodes;
2251 else
2252 dnodes = 1:N;
2253 end
2254 ns = ones(1,N);
2255 ns(hnodes) = Q;
2256 ns(onodes) = O;
2257 </pre>
2258 -->
2259 When HMMs are used for semi-infinite processes like speech recognition,
2260 we assume the transition matrix
2261 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant
2262 or homogenous Markov chain.
2263 Hence hidden nodes 2, 3, ..., T
2264 are all in the same equivalence class, say class Hclass.
2265 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the
2266 same for all t, so the observed nodes are all in the same equivalence
2267 class, say class Oclass.
2268 Finally, the prior term P(H(1)) is in a class all by itself, say class
2269 H1class.
2270 This is illustrated below, where we explicitly represent the
2271 parameters as random variables (dotted nodes).
2272 <p>
2273 <img src="Figures/hmm4_params.gif">
2274 <p>
2275 In BNT, we cannot represent parameters as random variables (nodes).
2276 Instead, we "hide" the
2277 parameters inside one CPD for each equivalence class,
2278 and then specify that the other CPDs should share these parameters, as
2279 follows.
2280 <pre>
2281 hnodes = 1:2:2*T;
2282 onodes = 2:2:2*T;
2283 H1class = 1; Hclass = 2; Oclass = 3;
2284 eclass = ones(1,N);
2285 eclass(hnodes(2:end)) = Hclass;
2286 eclass(hnodes(1)) = H1class;
2287 eclass(onodes) = Oclass;
2288 % create dag and ns in the usual way
2289 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
2290 </pre>
2291 Finally, we define the parameters for each equivalence class:
2292 <pre>
2293 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
2294 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
2295 if cts_obs
2296 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
2297 else
2298 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
2299 end
2300 </pre>
2301 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
2302 member of e's equivalence class; that is, it is not always the case
2303 that e == j. You can use bnet.rep_of_eclass(e) to return the
2304 representative of equivalence class e.
2305 BNT will look up the parents of j to determine the size
2306 of the CPT to use. It assumes that this is the same for all members of
2307 the equivalence class.
2308 Click <a href="http://www.cs.berkeley.edu/~murphyk/Bayes/param_tieing.html">here</a> for
2309 a more complex example of parameter tying.
2310 <p>
2311 Note:
2312 Normally one would define an HMM as a
2313 <a href = "usage_dbn.html">Dynamic Bayes Net</a>
2314 (see the function BNT/examples/dynamic/mk_chmm.m).
2315 However, one can define an HMM as a static BN using the function
2316 BNT/examples/static/Models/mk_hmm_bnet.m.
2317
2318
2319
2320 <h1><a name="structure_learning">Structure learning</h1>
2321
2322 Update (9/29/03):
2323 Phillipe LeRay is developing some additional structure learning code
2324 on top of BNT. Click <a
2325 href="http://bnt.insa-rouen.fr/ajouts.html">here</a>
2326 for details.
2327
2328 <p>
2329
2330 There are two very different approaches to structure learning:
2331 constraint-based and search-and-score.
2332 In the <a href="#constraint">constraint-based approach</a>,
2333 we start with a fully connected graph, and remove edges if certain
2334 conditional independencies are measured in the data.
2335 This has the disadvantage that repeated independence tests lose
2336 statistical power.
2337 <p>
2338 In the more popular search-and-score approach,
2339 we perform a search through the space of possible DAGs, and either
2340 return the best one found (a point estimate), or return a sample of the
2341 models found (an approximation to the Bayesian posterior).
2342 <p>
2343 Unfortunately, the number of DAGs as a function of the number of
2344 nodes, G(n), is super-exponential in n.
2345 A closed form formula for G(n) is not known, but the first few values
2346 are shown below (from Cooper, 1999).
2347
2348 <table>
2349 <tr> <th>n</th> <th align=left>G(n)</th> </tr>
2350 <tr> <td>1</td> <td>1</td> </tr>
2351 <tr> <td>2</td> <td>3</td> </tr>
2352 <tr> <td>3</td> <td>25</td> </tr>
2353 <tr> <td>4</td> <td>543</td> </tr>
2354 <tr> <td>5</td> <td>29,281</td> </tr>
2355 <tr> <td>6</td> <td>3,781,503</td> </tr>
2356 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr>
2357 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr>
2358 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr>
2359 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr>
2360 </table>
2361
2362 Since the number of DAGs is super-exponential in the number of nodes,
2363 we cannot exhaustively search the space, so we either use a local
2364 search algorithm (e.g., greedy hill climbining, perhaps with multiple
2365 restarts) or a global search algorithm (e.g., Markov Chain Monte
2366 Carlo).
2367 <p>
2368 If we know a total ordering on the nodes,
2369 finding the best structure amounts to picking the best set of parents
2370 for each node independently.
2371 This is what the K2 algorithm does.
2372 If the ordering is unknown, we can search over orderings,
2373 which is more efficient than searching over DAGs (Koller and Friedman, 2000).
2374 <p>
2375 In addition to the search procedure, we must specify the scoring
2376 function. There are two popular choices. The Bayesian score integrates
2377 out the parameters, i.e., it is the marginal likelihood of the model.
2378 The BIC (Bayesian Information Criterion) is defined as
2379 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is
2380 the ML estimate of the parameters, d is the number of parameters, and
2381 N is the number of data cases.
2382 The BIC method has the advantage of not requiring a prior.
2383 <p>
2384 BIC can be derived as a large sample
2385 approximation to the marginal likelihood.
2386 (It is also equal to the Minimum Description Length of a model.)
2387 However, in practice, the sample size does not need to be very large
2388 for the approximation to be good.
2389 For example, in the figure below, we plot the ratio between the log marginal likelihood
2390 and the BIC score against data-set size; we see that the ratio rapidly
2391 approaches 1, especially for non-informative priors.
2392 (This plot was generated by the file BNT/examples/static/bic1.m. It
2393 uses the water sprinkler BN with BDeu Dirichlet priors with different
2394 equivalent sample sizes.)
2395
2396 <p>
2397 <center>
2398 <IMG SRC="Figures/bic.png">
2399 </center>
2400 <p>
2401
2402 <p>
2403 As with parameter learning, handling missing data/ hidden variables is
2404 much harder than the fully observed case.
2405 The structure learning routines in BNT can therefore be classified into 4
2406 types, analogously to the parameter learning case.
2407 <p>
2408
2409 <TABLE BORDER>
2410 <tr>
2411 <TH></TH>
2412 <th>Full obs</th>
2413 <th>Partial obs</th>
2414 </tr>
2415 <tr>
2416 <th>Point</th>
2417 <td><tt>learn_struct_K2</tt> <br>
2418 <!-- <tt>learn_struct_hill_climb</tt></td> -->
2419 <td><tt>not yet supported</tt></td>
2420 </tr>
2421 <tr>
2422 <th>Bayes</th>
2423 <td><tt>learn_struct_mcmc</tt></td>
2424 <td>not yet supported</td>
2425 </tr>
2426 </table>
2427
2428
2429 <h2><a name="markov_equiv">Markov equivalence</h2>
2430
2431 If two DAGs encode the same conditional independencies, they are
2432 called Markov equivalent. The set of all DAGs can be paritioned into
2433 Markov equivalence classes. Graphs within the same class can
2434 have
2435 the direction of some of their arcs reversed without changing any of
2436 the CI relationships.
2437 Each class can be represented by a PDAG
2438 (partially directed acyclic graph) called an essential graph or
2439 pattern. This specifies which edges must be oriented in a certain
2440 direction, and which may be reversed.
2441
2442 <p>
2443 When learning graph structure from observational data,
2444 the best one can hope to do is to identify the model up to Markov
2445 equivalence. To distinguish amongst graphs within the same equivalence
2446 class, one needs interventional data: see the discussion on <a
2447 href="#active">active learning</a> below.
2448
2449
2450
2451 <h2><a name="enumerate">Exhaustive search</h2>
2452
2453 The brute-force approach to structure learning is to enumerate all
2454 possible DAGs, and score each one. This provides a "gold standard"
2455 with which to compare other algorithms. We can do this as follows.
2456 <pre>
2457 dags = mk_all_dags(N);
2458 score = score_dags(data, ns, dags);
2459 </pre>
2460 where data(i,m) is the value of node i in case m,
2461 and ns(i) is the size of node i.
2462 If the DAGs have a lot of families in common, we can cache the sufficient statistics,
2463 making this potentially more efficient than scoring the DAGs one at a time.
2464 (Caching is not currently implemented, however.)
2465 <p>
2466 By default, we use the Bayesian scoring metric, and assume CPDs are
2467 represented by tables with BDeu(1) priors.
2468 We can override these defaults as follows.
2469 If we want to use uniform priors, we can say
2470 <pre>
2471 params = cell(1,N);
2472 for i=1:N
2473 params{i} = {'prior', 'unif'};
2474 end
2475 score = score_dags(data, ns, dags, 'params', params);
2476 </pre>
2477 params{i} is a cell-array, containing optional arguments that are
2478 passed to the constructor for CPD i.
2479 <p>
2480 Now suppose we want to use different node types, e.g.,
2481 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
2482 these CPDs can support discrete and continuous parents, which is
2483 necessary since all other nodes will be considered as parents).
2484 The Bayesian scoring metric currently only works for tabular CPDs, so
2485 we will use BIC:
2486 <pre>
2487 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
2488 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
2489 </pre>
2490 In practice, one can't enumerate all possible DAGs for N > 5,
2491 but one can evaluate any reasonably-sized set of hypotheses in this
2492 way (e.g., nearest neighbors of your current best guess).
2493 Think of this as "computer assisted model refinement" as opposed to de
2494 novo learning.
2495
2496
2497 <h2><a name="K2">K2</h2>
2498
2499 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
2500 Initially each node has no parents. It then adds incrementally that parent whose addition most
2501 increases the score of the resulting structure. When the addition of no single
2502 parent can increase the score, it stops adding parents to the node.
2503 Since we are using a fixed ordering, we do not need to check for
2504 cycles, and can choose the parents for each node independently.
2505 <p>
2506 The original paper used the Bayesian scoring
2507 metric with tabular CPDs and Dirichlet priors.
2508 BNT generalizes this to allow any kind of CPD, and either the Bayesian
2509 scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
2510 In addition, you can specify
2511 an optional upper bound on the number of parents for each node.
2512 The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
2513 We use the water sprinkler network and sample 100 cases from it as before.
2514 Then we see how much data it takes to recover the generating structure:
2515 <pre>
2516 order = [C S R W];
2517 max_fan_in = 2;
2518 sz = 5:5:100;
2519 for i=1:length(sz)
2520 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
2521 correct(i) = isequal(dag, dag2);
2522 end
2523 </pre>
2524 Here are the results.
2525 <pre>
2526 correct =
2527 Columns 1 through 12
2528 0 0 0 0 0 0 0 1 0 1 1 1
2529 Columns 13 through 20
2530 1 1 1 1 1 1 1 1
2531 </pre>
2532 So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
2533 showing that the prior doesn't matter too much.)
2534 In general, we cannot hope to recover the "true" generating structure,
2535 only one that is in its <a href="#markov_equiv">Markov equivalence
2536 class</a>.
2537
2538
2539 <h2><a name="hill_climb">Hill-climbing</h2>
2540
2541 Hill-climbing starts at a specific point in space,
2542 considers all nearest neighbors, and moves to the neighbor
2543 that has the highest score; if no neighbors have higher
2544 score than the current point (i.e., we have reached a local maximum),
2545 the algorithm stops. One can then restart in another part of the space.
2546 <p>
2547 A common definition of "neighbor" is all graphs that can be
2548 generated from the current graph by adding, deleting or reversing a
2549 single arc, subject to the acyclicity constraint.
2550 Other neighborhoods are possible: see
2551 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
2552 Optimal Structure Identification with Greedy Search</a>, Max
2553 Chickering, JMLR 2002.
2554
2555 <!--
2556 Note: This algorithm is currently (Feb '02) being implemented by Qian
2557 Diao.
2558 -->
2559
2560
2561 <h2><a name="mcmc">MCMC</h2>
2562
2563 We can use a Markov Chain Monte Carlo (MCMC) algorithm called
2564 Metropolis-Hastings (MH) to search the space of all
2565 DAGs.
2566 The standard proposal distribution is to consider moving to all
2567 nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
2568 <p>
2569 The function can be called
2570 as in the following example.
2571 <pre>
2572 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
2573 </pre>
2574 We can convert our set of sampled graphs to a histogram
2575 (empirical posterior over all the DAGs) thus
2576 <pre>
2577 all_dags = mk_all_dags(N);
2578 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
2579 </pre>
2580 To see how well this performs, let us compute the exact posterior exhaustively.
2581 <p>
2582 <pre>
2583 score = score_dags(data, ns, all_dags);
2584 post = normalise(exp(score)); % assuming uniform structural prior
2585 </pre>
2586 We plot the results below.
2587 (The data set was 100 samples drawn from a random 4 node bnet; see the
2588 file BNT/examples/static/mcmc1.)
2589 <pre>
2590 subplot(2,1,1)
2591 bar(post)
2592 subplot(2,1,2)
2593 bar(mcmc_post)
2594 </pre>
2595 <img src="Figures/mcmc_post.jpg" width="800" height="500">
2596 <p>
2597 We can also plot the acceptance ratio versus number of MCMC steps,
2598 as a crude convergence diagnostic.
2599 <pre>
2600 clf
2601 plot(accept_ratio)
2602 </pre>
2603 <img src="Figures/mcmc_accept.jpg" width="800" height="300">
2604 <p>
2605 Even though the number of samples needed by MCMC is theoretically
2606 polynomial (not exponential) in the dimensionality of the search space, in practice it has been
2607 found that MCMC does not converge in reasonable time for graphs with
2608 more than about 10 nodes.
2609
2610
2611
2612
2613 <h2><a name="active">Active structure learning</h2>
2614
2615 As was mentioned <a href="#markov_equiv">above</a>,
2616 one can only learn a DAG up to Markov equivalence, even given infinite data.
2617 If one is interested in learning the structure of a causal network,
2618 one needs interventional data.
2619 (By "intervention" we mean forcing a node to take on a specific value,
2620 thereby effectively severing its incoming arcs.)
2621 <p>
2622 Most of the scoring functions accept an optional argument
2623 that specifies whether a node was observed to have a certain value, or
2624 was forced to have that value: we set clamped(i,m)=1 if node i was
2625 forced in training case m. e.g., see the file
2626 BNT/examples/static/cooper_yoo.
2627 <p>
2628 An interesting question is to decide which interventions to perform
2629 (c.f., design of experiments). For details, see the following tech
2630 report
2631 <ul>
2632 <li> <a href = "../../Papers/alearn.ps.gz">
2633 Active learning of causal Bayes net structure</a>, Kevin Murphy, March
2634 2001.
2635 </ul>
2636
2637
2638 <h2><a name="struct_em">Structural EM</h2>
2639
2640 Computing the Bayesian score when there is partial observability is
2641 computationally challenging, because the parameter posterior becomes
2642 multimodal (the hidden nodes induce a mixture distribution).
2643 One therefore needs to use approximations such as BIC.
2644 Unfortunately, search algorithms are still expensive, because we need
2645 to run EM at each step to compute the MLE, which is needed to compute
2646 the score of each model. An alternative approach is
2647 to do the local search steps inside of the M step of EM, which is more
2648 efficient since the data has been "filled in" - this is
2649 called the structural EM algorithm (Friedman 1997), and provably
2650 converges to a local maximum of the BIC score.
2651 <p>
2652 Wei Hu has implemented SEM for discrete nodes.
2653 You can download his package from
2654 <a href="../SEM.zip">here</a>.
2655 Please address all questions about this code to
2656 wei.hu@intel.com.
2657 See also <a href="#phl">Phl's implementation of SEM</a>.
2658
2659 <!--
2660 <h2><a name="reveal">REVEAL algorithm</h2>
2661
2662 A simple way to learn the structure of a fully observed, discrete,
2663 factored DBN from a time series is described <a
2664 href="usage_dbn.html#struct_learn">here</a>.
2665 -->
2666
2667
2668 <h2><a name="graphdraw">Visualizing the graph</h2>
2669
2670 You can visualize an arbitrary graph (such as one learned using the
2671 structure learning routines) with Matlab code contributed by
2672 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali
2673 Taylan Cemgil</a>
2674 from the University of Nijmegen.
2675 For static BNs, call it as follows:
2676 <pre>
2677 draw_graph(bnet.dag);
2678 </pre>
2679 For example, this is the output produced on a
2680 <a href="#qmr">random QMR-like model</a>:
2681 <p>
2682 <img src="Figures/qmr.rnd.jpg">
2683 <p>
2684 If you install the excellent <a
2685 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an
2686 open-source graph visualization package from AT&T,
2687 you can create a much better visualization as follows
2688 <pre>
2689 graph_to_dot(bnet.dag)
2690 </pre>
2691 This works by converting the adjacency matrix to a file suitable
2692 for input to graphviz (using the dot format),
2693 then converting the output of graphviz to postscript, and displaying the results using
2694 ghostview.
2695 You can do each of these steps separately for more control, as shown
2696 below.
2697 <pre>
2698 graph_to_dot(bnet.dag, 'filename', 'foo.dot');
2699 dot -Tps foo.dot -o foo.ps
2700 ghostview foo.ps &
2701 </pre>
2702
2703 <h2><a name = "constraint">Constraint-based methods</h2>
2704
2705 The IC algorithm (Pearl and Verma, 1991),
2706 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
2707 computes many conditional independence tests,
2708 and combines these constraints into a
2709 PDAG to represent the whole
2710 <a href="#markov_equiv">Markov equivalence class</a>.
2711 <p>
2712 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
2713 (IC stands for inductive causation; PC stands for Peter and Clark,
2714 the first names of Spirtes and Glymour; FCI stands for fast causal
2715 inference.
2716 What we, following Pearl (2000), call IC* was called
2717 IC in the original Pearl and Verma paper.)
2718 For details, see
2719 <ul>
2720 <li>
2721 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
2722 Prediction, and Search</a>, Spirtes, Glymour and
2723 Scheines (SGS), 2001 (2nd edition), MIT Press.
2724 <li>
2725 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
2726 2000, Cambridge University Press.
2727 </ul>
2728
2729 <p>
2730
2731 The PC algorithm takes as arguments a function f, the number of nodes N,
2732 the maximum fan in K, and additional arguments A which are passed to f.
2733 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
2734 otherwise.
2735 For example, suppose we cheat by
2736 passing in a CI "oracle" which has access to the true DAG; the oracle
2737 tests for d-separation in this DAG, i.e.,
2738 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
2739 <pre>
2740 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
2741 </pre>
2742 pdag(i,j) = -1 if there is definitely an i->j arc,
2743 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
2744 <p>
2745 Applied to the sprinkler network, this returns
2746 <pre>
2747 pdag =
2748 0 1 1 0
2749 1 0 0 -1
2750 1 0 0 -1
2751 0 0 0 0
2752 </pre>
2753 So as expected, we see that the V-structure at the W node is uniquely identified,
2754 but the other arcs have ambiguous orientation.
2755 <p>
2756 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
2757 book.
2758 This example concerns the female orgasm.
2759 We are given a correlation matrix C between 7 measured factors (such
2760 as subjective experiences of coital and masturbatory experiences),
2761 derived from 281 samples, and want to learn a causal model of the
2762 data. We will not discuss the merits of this type of work here, but
2763 merely show how to reproduce the results in the SGS book.
2764 Their program,
2765 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
2766 makes use of the Fisher Z-test for conditional
2767 independence, so we do the same:
2768 <pre>
2769 max_fan_in = 4;
2770 nsamples = 281;
2771 alpha = 0.05;
2772 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
2773 </pre>
2774 In this case, the CI test is
2775 <pre>
2776 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)
2777 </pre>
2778 The results match those of Fig 12a of SGS apart from two edge
2779 differences; presumably this is due to rounding error (although it
2780 could be a bug, either in BNT or in Tetrad).
2781 This example can be found in the file BNT/examples/static/pc2.m.
2782
2783 <p>
2784
2785 The IC* algorithm (Pearl and Verma, 1991),
2786 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
2787 are like the IC/PC algorithm, except that they can detect the presence
2788 of latent variables.
2789 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
2790 Kushnir. The output is a matrix P, defined as follows
2791 (see Pearl (2000), p52 for details):
2792 <pre>
2793 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
2794 % P(i,j) = -2 if there is a marked directed i-*>j edge.
2795 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
2796 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
2797 </pre>
2798
2799
2800 <h2><a name="phl">Philippe Leray's structure learning package</h2>
2801
2802 Philippe Leray has written a
2803 <a href="http://bnt.insa-rouen.fr/ajouts.html">
2804 structure learning package</a> that uses BNT.
2805
2806 It currently (Juen 2003) has the following features:
2807 <ul>
2808 <li>PC with Chi2 statistical test
2809 <li> MWST : Maximum weighted Spanning Tree
2810 <li> Hill Climbing
2811 <li> Greedy Search
2812 <li> Structural EM
2813 <li> hist_ic : optimal Histogram based on IC information criterion
2814 <li> cpdag_to_dag
2815 <li> dag_to_cpdag
2816 <li> ...
2817 </ul>
2818
2819
2820 </a>
2821
2822
2823 <!--
2824 <h2><a name="read_learning">Further reading on learning</h2>
2825
2826 I recommend the following tutorials for more details on learning.
2827 <ul>
2828 <li> <a
2829 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
2830 tutorial</a> on graphical models, which contains an overview of learning.
2831
2832 <li>
2833 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
2834 A tutorial on learning with Bayesian networks</a>, D. Heckerman,
2835 Microsoft Research Tech Report, 1995.
2836
2837 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
2838 Operations for Learning with Graphical Models</a>,
2839 W. L. Buntine, JAIR'94, 159--225.
2840 </ul>
2841 <p>
2842 -->
2843
2844
2845
2846
2847
2848 <h1><a name="engines">Inference engines</h1>
2849
2850 Up until now, we have used the junction tree algorithm for inference.
2851 However, sometimes this is too slow, or not even applicable.
2852 In general, there are many inference algorithms each of which make
2853 different tradeoffs between speed, accuracy, complexity and
2854 generality. Furthermore, there might be many implementations of the
2855 same algorithm; for instance, a general purpose, readable version,
2856 and a highly-optimized, specialized one.
2857 To cope with this variety, we treat each inference algorithm as an
2858 object, which we call an inference engine.
2859
2860 <p>
2861 An inference engine is an object that contains a bnet and supports the
2862 'enter_evidence' and 'marginal_nodes' methods. The engine constructor
2863 takes the bnet as argument and may do some model-specific processing.
2864 When 'enter_evidence' is called, the engine may do some
2865 evidence-specific processing. Finally, when 'marginal_nodes' is
2866 called, the engine may do some query-specific processing.
2867
2868 <p>
2869 The amount of work done when each stage is specified -- structure,
2870 parameters, evidence, and query -- depends on the engine. The cost of
2871 work done early in this sequence can be amortized. On the other hand,
2872 one can make better optimizations if one waits until later in the
2873 sequence.
2874 For example, the parameters might imply
2875 conditional indpendencies that are not evident in the graph structure,
2876 but can nevertheless be exploited; the evidence indicates which nodes
2877 are observed and hence can effectively be disconnected from the
2878 graph; and the query might indicate that large parts of the network
2879 are d-separated from the query nodes. (Since it is not the actual
2880 <em>values</em> of the evidence that matters, just which nodes are observed,
2881 many engines allow you to specify which nodes will be observed when they are constructed,
2882 i.e., before calling 'enter_evidence'. Some engines can still cope if
2883 the actual pattern of evidence is different, e.g., if there is missing
2884 data.)
2885 <p>
2886
2887 Although being maximally lazy (i.e., only doing work when a query is
2888 issued) may seem desirable,
2889 this is not always the most efficient.
2890 For example,
2891 when learning using EM, we need to call marginal_nodes N times, where N is the
2892 number of nodes. <a href="varelim">Variable elimination</a> would end
2893 up repeating a lot of work
2894 each time marginal_nodes is called, making it inefficient for
2895 learning. The junction tree algorithm, by contrast, uses dynamic
2896 programming to avoid this redundant computation --- it calculates all
2897 marginals in two passes during 'enter_evidence', so calling
2898 'marginal_nodes' takes constant time.
2899 <p>
2900 We will discuss some of the inference algorithms implemented in BNT
2901 below, and finish with a <a href="#engine_summary">summary</a> of all
2902 of them.
2903
2904
2905
2906
2907
2908
2909
2910 <h2><a name="varelim">Variable elimination</h2>
2911
2912 The variable elimination algorithm, also known as bucket elimination
2913 or peeling, is one of the simplest inference algorithms.
2914 The basic idea is to "push sums inside of products"; this is explained
2915 in more detail
2916 <a
2917 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
2918 <p>
2919 The principle of distributing sums over products can be generalized
2920 greatly to apply to any commutative semiring.
2921 This forms the basis of many common algorithms, such as Viterbi
2922 decoding and the Fast Fourier Transform. For details, see
2923
2924 <ul>
2925 <li> R. McEliece and S. M. Aji, 2000.
2926 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
2927 <a href="GDL.pdf">
2928 The Generalized Distributive Law</a>,
2929 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
2930 pp. 325--343.
2931
2932
2933 <li>
2934 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
2935 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
2936 Factor graphs and the sum-product algorithm</a>
2937 IEEE Transactions on Information Theory, February, 2001.
2938
2939 </ul>
2940
2941 <p>
2942 Choosing an order in which to sum out the variables so as to minimize
2943 computational cost is known to be NP-hard.
2944 The implementation of this algorithm in
2945 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this
2946 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
2947 greedy search procedure to find a good ordering).
2948 <p>
2949 Note: unlike most algorithms, var_elim does all its computational work
2950 inside of <tt>marginal_nodes</tt>, not inside of
2951 <tt>enter_evidence</tt>.
2952
2953
2954
2955
2956 <h2><a name="global">Global inference methods</h2>
2957
2958 The simplest inference algorithm of all is to explicitely construct
2959 the joint distribution over all the nodes, and then to marginalize it.
2960 This is implemented in <tt>global_joint_inf_engine</tt>.
2961 Since the size of the joint is exponential in the
2962 number of discrete (hidden) nodes, this is not a very practical algorithm.
2963 It is included merely for pedagogical and debugging purposes.
2964 <p>
2965 Three specialized versions of this algorithm have also been implemented,
2966 corresponding to the cases where all the nodes are discrete (D), all
2967 are Gaussian (G), and some are discrete and some Gaussian (CG).
2968 They are called <tt>enumerative_inf_engine</tt>,
2969 <tt>gaussian_inf_engine</tt>,
2970 and <tt>cond_gauss_inf_engine</tt> respectively.
2971 <p>
2972 Note: unlike most algorithms, these global inference algorithms do all their computational work
2973 inside of <tt>marginal_nodes</tt>, not inside of
2974 <tt>enter_evidence</tt>.
2975
2976
2977 <h2><a name="quickscore">Quickscore</h2>
2978
2979 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
2980 since the cliques are so big.
2981 One simple trick we can use is to notice that hidden leaves do not
2982 affect the posteriors on the roots, and hence do not need to be
2983 included in the network.
2984 A second trick is to notice that the negative findings can be
2985 "absorbed" into the prior:
2986 see the file
2987 BNT/examples/static/mk_minimal_qmr_bnet for details.
2988 <p>
2989
2990 A much more significant speedup is obtained by exploiting special
2991 properties of the noisy-or node, as done by the quickscore
2992 algorithm. For details, see
2993 <ul>
2994 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
2995 <li> Rish and Dechter, "On the impact of causal independence", UCI
2996 tech report, 1998.
2997 </ul>
2998
2999 This has been implemented in BNT as a special-purpose inference
3000 engine, which can be created and used as follows:
3001 <pre>
3002 engine = quickscore_inf_engine(inhibit, leak, prior);
3003 engine = enter_evidence(engine, pos, neg);
3004 m = marginal_nodes(engine, i);
3005 </pre>
3006
3007
3008 <h2><a name="belprop">Belief propagation</h2>
3009
3010 Even using quickscore, exact inference takes time that is exponential
3011 in the number of positive findings.
3012 Hence for large networks we need to resort to approximate inference techniques.
3013 See for example
3014 <ul>
3015 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
3016 QMR-DT network", JAIR 10, 1999.
3017
3018 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
3019 UAI 99.
3020 </ul>
3021 The latter approximation
3022 entails applying Pearl's belief propagation algorithm to a model even
3023 if it has loops (hence the name loopy belief propagation).
3024 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
3025 exact results when applied to singly-connected graphs
3026 (a.k.a. polytrees, since
3027 the underlying undirected topology is a tree, but a node may have
3028 multiple parents).
3029 To apply this algorithm to a graph with loops,
3030 use <tt>pearl_inf_engine</tt>.
3031 This can use a centralized or distributed message passing protocol.
3032 You can use it as in the following example.
3033 <pre>
3034 engine = pearl_inf_engine(bnet, 'max_iter', 30);
3035 engine = enter_evidence(engine, evidence);
3036 m = marginal_nodes(engine, i);
3037 </pre>
3038 We found that this algorithm often converges, and when it does, often
3039 is very accurate, but it depends on the precise setting of the
3040 parameter values of the network.
3041 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
3042 Understanding when and why belief propagation converges/ works
3043 is a topic of ongoing research.
3044 <p>
3045 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
3046 and gmux nodes to compute messages efficiently.
3047 <p>
3048 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
3049 represent messages. Hence this is slower.
3050 <p>
3051 <tt>belprop_fg_inf_engine</tt> is like belprop,
3052 but is designed for factor graphs.
3053
3054
3055
3056 <h2><a name="sampling">Sampling</h2>
3057
3058 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
3059 <ul>
3060 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance
3061 sampling and can handle any node type.
3062 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
3063 Currently this can only handle tabular CPDs.
3064 For a much faster and more powerful Gibbs sampling program, see
3065 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
3066 </ul>
3067 Note: To generate samples from a network (which is not the same as inference!),
3068 use <tt>sample_bnet</tt>.
3069
3070
3071
3072 <h2><a name="engine_summary">Summary of inference engines</h2>
3073
3074
3075 The inference engines differ in many ways. Here are
3076 some of the major "axes":
3077 <ul>
3078 <li> Works for all topologies or makes restrictions?
3079 <li> Works for all node types or makes restrictions?
3080 <li> Exact or approximate inference?
3081 </ul>
3082
3083 <p>
3084 In terms of topology, most engines handle any kind of DAG.
3085 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
3086 can be used to represent directed, undirected, and mixed (chain)
3087 graphs.
3088 (In the future, we plan to support exact inference on chain graphs.)
3089 <tt>quickscore</tt> only works on QMR-like models.
3090 <p>
3091 In terms of node types: algorithms that use potentials can handle
3092 discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
3093 Sampling algorithms can essentially handle any kind of node (distribution).
3094 Other algorithms make more restrictive assumptions in exchange for
3095 speed.
3096 <p>
3097 Finally, most algorithms are designed to give the exact answer.
3098 The belief propagation algorithms are exact if applied to trees, and
3099 in some other cases.
3100 Sampling is considered approximate, even though, in the limit of an
3101 infinite number of samples, it gives the exact answer.
3102
3103 <p>
3104
3105 Here is a summary of the properties
3106 of all the engines in BNT which work on static networks.
3107 <p>
3108 <table>
3109 <table border units = pixels><tr>
3110 <td align=left width=0>Name
3111 <td align=left width=0>Exact?
3112 <td align=left width=0>Node type?
3113 <td align=left width=0>topology
3114 <tr>
3115 <tr>
3116 <td align=left> belprop
3117 <td align=left> approx
3118 <td align=left> D
3119 <td align=left> DAG
3120 <tr>
3121 <td align=left> belprop_fg
3122 <td align=left> approx
3123 <td align=left> D
3124 <td align=left> factor graph
3125 <tr>
3126 <td align=left> cond_gauss
3127 <td align=left> exact
3128 <td align=left> CG
3129 <td align=left> DAG
3130 <tr>
3131 <td align=left> enumerative
3132 <td align=left> exact
3133 <td align=left> D
3134 <td align=left> DAG
3135 <tr>
3136 <td align=left> gaussian
3137 <td align=left> exact
3138 <td align=left> G
3139 <td align=left> DAG
3140 <tr>
3141 <td align=left> gibbs
3142 <td align=left> approx
3143 <td align=left> D
3144 <td align=left> DAG
3145 <tr>
3146 <td align=left> global_joint
3147 <td align=left> exact
3148 <td align=left> D,G,CG
3149 <td align=left> DAG
3150 <tr>
3151 <td align=left> jtree
3152 <td align=left> exact
3153 <td align=left> D,G,CG
3154 <td align=left> DAG
3155 b<tr>
3156 <td align=left> likelihood_weighting
3157 <td align=left> approx
3158 <td align=left> any
3159 <td align=left> DAG
3160 <tr>
3161 <td align=left> pearl
3162 <td align=left> approx
3163 <td align=left> D,G
3164 <td align=left> DAG
3165 <tr>
3166 <td align=left> pearl
3167 <td align=left> exact
3168 <td align=left> D,G
3169 <td align=left> polytree
3170 <tr>
3171 <td align=left> quickscore
3172 <td align=left> exact
3173 <td align=left> noisy-or
3174 <td align=left> QMR
3175 <tr>
3176 <td align=left> stab_cond_gauss
3177 <td align=left> exact
3178 <td align=left> CG
3179 <td align=left> DAG
3180 <tr>
3181 <td align=left> var_elim
3182 <td align=left> exact
3183 <td align=left> D,G,CG
3184 <td align=left> DAG
3185 </table>
3186
3187
3188
3189 <h1><a name="influence">Influence diagrams/ decision making</h1>
3190
3191 BNT implements an exact algorithm for solving LIMIDs (limited memory
3192 influence diagrams), described in
3193 <ul>
3194 <li> S. L. Lauritzen and D. Nilsson.
3195 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
3196 Representing and solving decision problems with limited
3197 information</a>
3198 Management Science, 47, 1238 - 1251. September 2001.
3199 </ul>
3200 LIMIDs explicitely show all information arcs, rather than implicitely
3201 assuming no forgetting. This allows them to model forgetful
3202 controllers.
3203 <p>
3204 See the examples in <tt>BNT/examples/limids</tt> for details.
3205
3206
3207
3208
3209 <h1>DBNs, HMMs, Kalman filters and all that</h1>
3210
3211 Click <a href="usage_dbn.html">here</a> for documentation about how to
3212 use BNT for dynamical systems and sequence data.
3213
3214
3215 </BODY>