Mercurial > hg > camir-aes2014
comparison toolboxes/FullBNT-1.0.7/docs/usage_02nov13.html @ 0:e9a9cd732c1e tip
first hg version after svn
author | wolffd |
---|---|
date | Tue, 10 Feb 2015 15:05:51 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:e9a9cd732c1e |
---|---|
1 <HEAD> | |
2 <TITLE>How to use the Bayes Net Toolbox</TITLE> | |
3 </HEAD> | |
4 | |
5 <BODY BGCOLOR="#FFFFFF"> | |
6 <!-- white background is better for the pictures and equations --> | |
7 | |
8 <h1>How to use the Bayes Net Toolbox</h1> | |
9 | |
10 This documentation was last updated on 13 November 2002. | |
11 <br> | |
12 Click <a href="changelog.html">here</a> for a list of changes made to | |
13 BNT. | |
14 <br> | |
15 Click | |
16 <a href="http://bnt.insa-rouen.fr/">here</a> | |
17 for a French version of this documentation (which might not | |
18 be up-to-date). | |
19 | |
20 | |
21 <p> | |
22 | |
23 <ul> | |
24 <li> <a href="#install">Installation</a> | |
25 <ul> | |
26 <li> <a href="#installM">Installing the Matlab code</a> | |
27 <li> <a href="#installC">Installing the C code</a> | |
28 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>. | |
29 </ul> | |
30 | |
31 <li> <a href="#basics">Creating your first Bayes net</a> | |
32 <ul> | |
33 <li> <a href="#basics">Creating a model by hand</a> | |
34 <li> <a href="#file">Loading a model from a file</a> | |
35 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a> | |
36 </ul> | |
37 | |
38 <li> <a href="#inference">Inference</a> | |
39 <ul> | |
40 <li> <a href="#marginal">Computing marginal distributions</a> | |
41 <li> <a href="#joint">Computing joint distributions</a> | |
42 <li> <a href="#soft">Soft/virtual evidence</a> | |
43 <li> <a href="#mpe">Most probable explanation</a> | |
44 </ul> | |
45 | |
46 <li> <a href="#cpd">Conditional Probability Distributions</a> | |
47 <ul> | |
48 <li> <a href="#tabular">Tabular (multinomial) nodes</a> | |
49 <li> <a href="#noisyor">Noisy-or nodes</a> | |
50 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a> | |
51 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a> | |
52 <li> <a href="#mlp">Neural network nodes</a> | |
53 <li> <a href="#root">Root nodes</a> | |
54 <li> <a href="#gaussian">Gaussian nodes</a> | |
55 <li> <a href="#glm">Generalized linear model nodes</a> | |
56 <li> <a href="#dtree">Classification/regression tree nodes</a> | |
57 <li> <a href="#nongauss">Other continuous distributions</a> | |
58 <li> <a href="#cpd_summary">Summary of CPD types</a> | |
59 </ul> | |
60 | |
61 <li> <a href="#examples">Example models</a> | |
62 <ul> | |
63 <li> <a | |
64 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt"> | |
65 Gaussian mixture models</a> | |
66 <li> <a href="#pca">PCA, ICA, and all that</a> | |
67 <li> <a href="#mixep">Mixtures of experts</a> | |
68 <li> <a href="#hme">Hierarchical mixtures of experts</a> | |
69 <li> <a href="#qmr">QMR</a> | |
70 <li> <a href="#cg_model">Conditional Gaussian models</a> | |
71 <li> <a href="#hybrid">Other hybrid models</a> | |
72 </ul> | |
73 | |
74 <li> <a href="#param_learning">Parameter learning</a> | |
75 <ul> | |
76 <li> <a href="#load_data">Loading data from a file</a> | |
77 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a> | |
78 <li> <a href="#prior">Parameter priors</a> | |
79 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a> | |
80 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a> | |
81 <li> <a href="#tying">Parameter tying</a> | |
82 </ul> | |
83 | |
84 <li> <a href="#structure_learning">Structure learning</a> | |
85 <ul> | |
86 <li> <a href="#enumerate">Exhaustive search</a> | |
87 <li> <a href="#K2">K2</a> | |
88 <li> <a href="#hill_climb">Hill-climbing</a> | |
89 <li> <a href="#mcmc">MCMC</a> | |
90 <li> <a href="#active">Active learning</a> | |
91 <li> <a href="#struct_em">Structural EM</a> | |
92 <li> <a href="#graphdraw">Visualizing the learned graph structure</a> | |
93 <li> <a href="#constraint">Constraint-based methods</a> | |
94 </ul> | |
95 | |
96 | |
97 <li> <a href="#engines">Inference engines</a> | |
98 <ul> | |
99 <li> <a href="#jtree">Junction tree</a> | |
100 <li> <a href="#varelim">Variable elimination</a> | |
101 <li> <a href="#global">Global inference methods</a> | |
102 <li> <a href="#quickscore">Quickscore</a> | |
103 <li> <a href="#belprop">Belief propagation</a> | |
104 <li> <a href="#sampling">Sampling (Monte Carlo)</a> | |
105 <li> <a href="#engine_summary">Summary of inference engines</a> | |
106 </ul> | |
107 | |
108 | |
109 <li> <a href="#influence">Influence diagrams/ decision making</a> | |
110 | |
111 | |
112 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a> | |
113 </ul> | |
114 | |
115 </ul> | |
116 | |
117 | |
118 | |
119 | |
120 <h1><a name="install">Installation</h1> | |
121 | |
122 <h2><a name="installM">Installing the Matlab code</h2> | |
123 | |
124 <ul> | |
125 <li> <a href="bnt_download.html">Download</a> the BNT.zip file. | |
126 | |
127 <p> | |
128 <li> Unpack the file. In Unix, type | |
129 <!--"tar xvf BNT.tar".--> | |
130 "unzip FullBNT.zip". | |
131 In Windows, use | |
132 a program like <a href="http://www.winzip.com">Winzip</a>. This will | |
133 create a directory called FullBNT, which contains BNT and other libraries. | |
134 | |
135 <p> | |
136 <li> Read the file <tt>BNT/README</tt> to make sure the date | |
137 matches the one on the top of <a href=bnt.html>the BNT home page</a>. | |
138 If not, you may need to press 'refresh' on your browser, and download | |
139 again, to get the most recent version. | |
140 | |
141 <p> | |
142 <li> <b>Edit the file "BNT/add_BNT_to_path.m"</b> so it contains the correct | |
143 pathname. | |
144 For example, in Windows, | |
145 I download FullBNT.zip into C:\kpmurphy\matlab, and | |
146 then comment out the second line (with the % character), and uncomment | |
147 the third line, which reads | |
148 <pre> | |
149 BNT_HOME = 'C:\kpmurphy\matlab\FullBNT'; | |
150 </pre> | |
151 | |
152 <p> | |
153 <li> Start up Matlab. | |
154 | |
155 <p> | |
156 <li> Type "ver" at the Matlab prompt (">>"). | |
157 <b>You need Matlab version 5.2 or newer to run BNT</b>. | |
158 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes | |
159 crash BNT.) | |
160 | |
161 <p> | |
162 <li> Move to the BNT directory. | |
163 For example, in Windows, I type | |
164 <pre> | |
165 >> cd C:\kpmurphy\matlab\FullBNT\BNT | |
166 </pre> | |
167 | |
168 <p> | |
169 <li> Type "add_BNT_to_path". | |
170 This executes the command | |
171 <tt>addpath(genpath(BNT_HOME))</tt>, | |
172 which adds all directories below FullBNT to the matlab path. | |
173 | |
174 <p> | |
175 <li> Type "test_BNT". | |
176 | |
177 If all goes well, this will produce a bunch of numbers and maybe some | |
178 warning messages (which you can ignore), but no error messages. | |
179 (The warnings should only be of the form | |
180 "Warning: Maximum number of iterations has been exceeded", and are | |
181 produced by Netlab.) | |
182 | |
183 <p> | |
184 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join"> | |
185 Join the BNT email list</a> | |
186 | |
187 </ul> | |
188 | |
189 | |
190 If you are new to Matlab, you might like to check out | |
191 <a href="matlab_tips.html">some useful Matlab tips</a>. | |
192 For instance, this explains how to create a startup file, which can be | |
193 used to set your path variable automatically, so you can avoid having | |
194 to type the above commands every time. | |
195 | |
196 | |
197 | |
198 | |
199 <h2><a name="installC">Installing the C code</h2> | |
200 | |
201 Some BNT functions also have C implementations. | |
202 <b>It is not necessary to install the C code</b>, but it can result in a speedup | |
203 of a factor of 5-10. | |
204 To install all the C code, | |
205 edit installC_BNT.m so it contains the right path, | |
206 then type <tt>installC_BNT</tt>. | |
207 To uninstall all the C code, | |
208 edit uninstallC_BNT.m so it contains the right path, | |
209 then type <tt>uninstallC_BNT</tt>. | |
210 For an up-to-date list of the files which have C implementations, see | |
211 BNT/installC_BNT.m. | |
212 | |
213 <p> | |
214 mex is a script that lets you call C code from Matlab - it does not compile matlab to | |
215 C (see mcc below). | |
216 If your C/C++ compiler is set up correctly, mex should work out of | |
217 the box. | |
218 If not, you might need to type | |
219 <p> | |
220 <tt> mex -setup</tt> | |
221 <p> | |
222 before calling installC. | |
223 <p> | |
224 To make mex call gcc on Windows, | |
225 you must install <a | |
226 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>. | |
227 You can use the <a href="http://www.mingw.org/">minimalist GNU for | |
228 Windows</a> version of gcc, or | |
229 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version. | |
230 <p> | |
231 In general, typing | |
232 'mex foo.c' from inside Matlab creates a file called | |
233 'foo.mexglx' or 'foo.dll' (the exact file | |
234 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll'). | |
235 The resulting file will hide the original 'foo.m' (if it existed), i.e., | |
236 typing 'foo' at the prompt will call the compiled C version. | |
237 To reveal the original matlab version, just delete foo.mexglx (this is | |
238 what uninstallC does). | |
239 <p> | |
240 Sometimes it takes time for Matlab to realize that the file has | |
241 changed from matlab to C or vice versa; try typing 'clear all' or | |
242 restarting Matlab to refresh it. | |
243 To find out which version of a file you are running, type | |
244 'which foo'. | |
245 <p> | |
246 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the | |
247 Matlab to C compiler, is a separate product, | |
248 and is quite different from mex. It does not yet support | |
249 objects/classes, which is why we can't compile all of BNT to C automatically. | |
250 Also, hand-written C code is usually much | |
251 better than the C code generated by mcc. | |
252 | |
253 | |
254 <p> | |
255 Acknowledgements: | |
256 Although I wrote some of the C code, most of | |
257 the C code (e.g., for jtree and dpot) was written by Wei Hu; | |
258 the triangulation C code was written by Ilya Shpitser. | |
259 | |
260 | |
261 <h1><a name="basics">Creating your first Bayes net</h1> | |
262 | |
263 To define a Bayes net, you must specify the graph structure and then | |
264 the parameters. We look at each in turn, using a simple example | |
265 (adapted from Russell and | |
266 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall, | |
267 1995, p454). | |
268 | |
269 | |
270 <h2>Graph structure</h2> | |
271 | |
272 | |
273 Consider the following network. | |
274 | |
275 <p> | |
276 <center> | |
277 <IMG SRC="Figures/sprinkler.gif"> | |
278 </center> | |
279 <p> | |
280 | |
281 <P> | |
282 To specify this directed acyclic graph (dag), we create an adjacency matrix: | |
283 <PRE> | |
284 N = 4; | |
285 dag = zeros(N,N); | |
286 C = 1; S = 2; R = 3; W = 4; | |
287 dag(C,[R S]) = 1; | |
288 dag(R,W) = 1; | |
289 dag(S,W)=1; | |
290 </PRE> | |
291 <P> | |
292 We have numbered the nodes as follows: | |
293 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. | |
294 <b>The nodes must always be numbered in topological order, i.e., | |
295 ancestors before descendants.</b> | |
296 For a more complicated graph, this is a little inconvenient: we will | |
297 see how to get around this <a href="usage_dbn.html#bat">below</a>. | |
298 <p> | |
299 In Matlab 6, you can use logical arrays instead of double arrays, | |
300 which are 4 times smaller: | |
301 <pre> | |
302 dag = false(N,N); | |
303 dag(C,[R S]) = true; | |
304 ... | |
305 </pre> | |
306 <p> | |
307 A preliminary attempt to make a <b>GUI</b> | |
308 has been writte by Philippe LeRay and can be downloaded | |
309 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>. | |
310 <p> | |
311 You can visualize the resulting graph structure using | |
312 the methods discussed <a href="#graphdraw">below</a>. | |
313 | |
314 <h2>Creating the Bayes net shell</h2> | |
315 | |
316 In addition to specifying the graph structure, | |
317 we must specify the size and type of each node. | |
318 If a node is discrete, its size is the | |
319 number of possible values | |
320 each node can take on; if a node is continuous, | |
321 it can be a vector, and its size is the length of this vector. | |
322 In this case, we will assume all nodes are discrete and binary. | |
323 <PRE> | |
324 discrete_nodes = 1:N; | |
325 node_sizes = 2*ones(1,N); | |
326 </pre> | |
327 If the nodes were not binary, you could type e.g., | |
328 <pre> | |
329 node_sizes = [4 2 3 5]; | |
330 </pre> | |
331 meaning that Cloudy has 4 possible values, | |
332 Sprinkler has 2 possible values, etc. | |
333 Note that these are cardinal values, not ordinal, i.e., | |
334 they are not ordered in any way, like 'low', 'medium', 'high'. | |
335 <p> | |
336 We are now ready to make the Bayes net: | |
337 <pre> | |
338 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes); | |
339 </PRE> | |
340 By default, all nodes are assumed to be discrete, so we can also just | |
341 write | |
342 <pre> | |
343 bnet = mk_bnet(dag, node_sizes); | |
344 </PRE> | |
345 You may also specify which nodes will be observed. | |
346 If you don't know, or if this not fixed in advance, | |
347 just use the empty list (the default). | |
348 <pre> | |
349 onodes = []; | |
350 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes); | |
351 </PRE> | |
352 Note that optional arguments are specified using a name/value syntax. | |
353 This is common for many BNT functions. | |
354 In general, to find out more about a function (e.g., which optional | |
355 arguments it takes), please see its | |
356 documentation string by typing | |
357 <pre> | |
358 help mk_bnet | |
359 </pre> | |
360 See also other <a href="matlab_tips.html">useful Matlab tips</a>. | |
361 <p> | |
362 It is possible to associate names with nodes, as follows: | |
363 <pre> | |
364 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4); | |
365 </pre> | |
366 You can then refer to a node by its name: | |
367 <pre> | |
368 C = bnet.names{'cloudy'}; % bnet.names is an associative array | |
369 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); | |
370 </pre> | |
371 | |
372 | |
373 <h2><a name="cpt">Parameters</h2> | |
374 | |
375 A model consists of the graph structure and the parameters. | |
376 The parameters are represented by CPD objects (CPD = Conditional | |
377 Probability Distribution), which define the probability distribution | |
378 of a node given its parents. | |
379 (We will use the terms "node" and "random variable" interchangeably.) | |
380 The simplest kind of CPD is a table (multi-dimensional array), which | |
381 is suitable when all the nodes are discrete-valued. Note that the discrete | |
382 values are not assumed to be ordered in any way; that is, they | |
383 represent categorical quantities, like male and female, rather than | |
384 ordinal quantities, like low, medium and high. | |
385 (We will discuss CPDs in more detail <a href="#cpd">below</a>.) | |
386 <p> | |
387 Tabular CPDs, also called CPTs (conditional probability tables), | |
388 are stored as multidimensional arrays, where the dimensions | |
389 are arranged in the same order as the nodes, e.g., the CPT for node 4 | |
390 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. | |
391 Hence the child is always the last dimension. | |
392 If a node has no parents, its CPT is a column vector representing its | |
393 prior. | |
394 Note that in Matlab (unlike C), arrays are indexed | |
395 from 1, and are layed out in memory such that the first index toggles | |
396 fastest, e.g., the CPT for node 4 (WetGrass) is as follows | |
397 <P> | |
398 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P> | |
399 <P> | |
400 where we have used the convention that false==1, true==2. | |
401 We can create this CPT in Matlab as follows | |
402 <PRE> | |
403 CPT = zeros(2,2,2); | |
404 CPT(1,1,1) = 1.0; | |
405 CPT(2,1,1) = 0.1; | |
406 ... | |
407 </PRE> | |
408 Here is an easier way: | |
409 <PRE> | |
410 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]); | |
411 </PRE> | |
412 In fact, we don't need to reshape the array, since the CPD constructor | |
413 will do that for us. So we can just write | |
414 <pre> | |
415 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); | |
416 </pre> | |
417 The other nodes are created similarly (using the old syntax for | |
418 optional parameters) | |
419 <PRE> | |
420 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); | |
421 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]); | |
422 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]); | |
423 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); | |
424 </PRE> | |
425 | |
426 | |
427 <h2><a name="rnd_cpt">Random Parameters</h2> | |
428 | |
429 If we do not specify the CPT, random parameters will be | |
430 created, i.e., each "row" of the CPT will be drawn from the uniform distribution. | |
431 To ensure repeatable results, use | |
432 <pre> | |
433 rand('state', seed); | |
434 randn('state', seed); | |
435 </pre> | |
436 To control the degree of randomness (entropy), | |
437 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution. | |
438 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0). | |
439 If p = 1, each entry is drawn from U[0,1]. | |
440 If p >> 1, the entries will all be near 1/k, where k is the arity of | |
441 this node, i.e., each row will be nearly uniform. | |
442 You can do this as follows, assuming this node | |
443 is number i, and ns is the node_sizes. | |
444 <pre> | |
445 k = ns(i); | |
446 ps = parents(dag, i); | |
447 psz = prod(ns(ps)); | |
448 CPT = sample_dirichlet(p*ones(1,k), psz); | |
449 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT); | |
450 </pre> | |
451 | |
452 | |
453 <h2><a name="file">Loading a network from a file</h2> | |
454 | |
455 If you already have a Bayes net represented in the XML-based | |
456 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/"> | |
457 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the | |
458 <a | |
459 href="http://www.cs.huji.ac.il/labs/compbio/Repository"> | |
460 Bayes Net repository</a>), | |
461 you can convert it to BNT format using | |
462 the | |
463 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java | |
464 program</a> written by Ken Shan. | |
465 (This is not necessarily up-to-date.) | |
466 | |
467 | |
468 <h2>Creating a model using a GUI</h2> | |
469 | |
470 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>. | |
471 | |
472 | |
473 | |
474 <h1><a name="inference">Inference</h1> | |
475 | |
476 Having created the BN, we can now use it for inference. | |
477 There are many different algorithms for doing inference in Bayes nets, | |
478 that make different tradeoffs between speed, | |
479 complexity, generality, and accuracy. | |
480 BNT therefore offers a variety of different inference | |
481 "engines". We will discuss these | |
482 in more detail <a href="#engines">below</a>. | |
483 For now, we will use the junction tree | |
484 engine, which is the mother of all exact inference algorithms. | |
485 This can be created as follows. | |
486 <pre> | |
487 engine = jtree_inf_engine(bnet); | |
488 </pre> | |
489 The other engines have similar constructors, but might take | |
490 additional, algorithm-specific parameters. | |
491 All engines are used in the same way, once they have been created. | |
492 We illustrate this in the following sections. | |
493 | |
494 | |
495 <h2><a name="marginal">Computing marginal distributions</h2> | |
496 | |
497 Suppose we want to compute the probability that the sprinker was on | |
498 given that the grass is wet. | |
499 The evidence consists of the fact that W=2. All the other nodes | |
500 are hidden (unobserved). We can specify this as follows. | |
501 <pre> | |
502 evidence = cell(1,N); | |
503 evidence{W} = 2; | |
504 </pre> | |
505 We use a 1D cell array instead of a vector to | |
506 cope with the fact that nodes can be vectors of different lengths. | |
507 In addition, the value [] can be used | |
508 to denote 'no evidence', instead of having to specify the observation | |
509 pattern as a separate argument. | |
510 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell | |
511 arrays in matlab.) | |
512 <p> | |
513 We are now ready to add the evidence to the engine. | |
514 <pre> | |
515 [engine, loglik] = enter_evidence(engine, evidence); | |
516 </pre> | |
517 The behavior of this function is algorithm-specific, and is discussed | |
518 in more detail <a href="#engines">below</a>. | |
519 In the case of the jtree engine, | |
520 enter_evidence implements a two-pass message-passing scheme. | |
521 The first return argument contains the modified engine, which | |
522 incorporates the evidence. The second return argument contains the | |
523 log-likelihood of the evidence. (Not all engines are capable of | |
524 computing the log-likelihood.) | |
525 <p> | |
526 Finally, we can compute p=P(S=2|W=2) as follows. | |
527 <PRE> | |
528 marg = marginal_nodes(engine, S); | |
529 marg.T | |
530 ans = | |
531 0.57024 | |
532 0.42976 | |
533 p = marg.T(2); | |
534 </PRE> | |
535 We see that p = 0.4298. | |
536 <p> | |
537 Now let us add the evidence that it was raining, and see what | |
538 difference it makes. | |
539 <PRE> | |
540 evidence{R} = 2; | |
541 [engine, loglik] = enter_evidence(engine, evidence); | |
542 marg = marginal_nodes(engine, S); | |
543 p = marg.T(2); | |
544 </PRE> | |
545 We find that p = P(S=2|W=2,R=2) = 0.1945, | |
546 which is lower than | |
547 before, because the rain can ``explain away'' the | |
548 fact that the grass is wet. | |
549 <p> | |
550 You can plot a marginal distribution over a discrete variable | |
551 as a barchart using the built 'bar' function: | |
552 <pre> | |
553 bar(marg.T) | |
554 </pre> | |
555 This is what it looks like | |
556 | |
557 <p> | |
558 <center> | |
559 <IMG SRC="Figures/sprinkler_bar.gif"> | |
560 </center> | |
561 <p> | |
562 | |
563 <h2><a name="observed">Observed nodes</h2> | |
564 | |
565 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)? | |
566 An observed discrete node effectively only has 1 value (the observed | |
567 one) --- all other values would result in 0 probability. | |
568 For efficiency, BNT treats observed (discrete) nodes as if they were | |
569 set to 1, as we see below: | |
570 <pre> | |
571 evidence = cell(1,N); | |
572 evidence{W} = 2; | |
573 engine = enter_evidence(engine, evidence); | |
574 m = marginal_nodes(engine, W); | |
575 m.T | |
576 ans = | |
577 1 | |
578 </pre> | |
579 This can get a little confusing, since we assigned W=2. | |
580 So we can ask BNT to add the evidence back in by passing in an optional argument: | |
581 <pre> | |
582 m = marginal_nodes(engine, W, 1); | |
583 m.T | |
584 ans = | |
585 0 | |
586 1 | |
587 </pre> | |
588 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1. | |
589 | |
590 | |
591 | |
592 <h2><a name="joint">Computing joint distributions</h2> | |
593 | |
594 We can compute the joint probability on a set of nodes as in the | |
595 following example. | |
596 <pre> | |
597 evidence = cell(1,N); | |
598 [engine, ll] = enter_evidence(engine, evidence); | |
599 m = marginal_nodes(engine, [S R W]); | |
600 </pre> | |
601 m is a structure. The 'T' field is a multi-dimensional array (in | |
602 this case, 3-dimensional) that contains the joint probability | |
603 distribution on the specified nodes. | |
604 <pre> | |
605 >> m.T | |
606 ans(:,:,1) = | |
607 0.2900 0.0410 | |
608 0.0210 0.0009 | |
609 ans(:,:,2) = | |
610 0 0.3690 | |
611 0.1890 0.0891 | |
612 </pre> | |
613 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass | |
614 to be wet if both the rain and sprinkler are off. | |
615 <p> | |
616 Let us now add some evidence to R. | |
617 <pre> | |
618 evidence{R} = 2; | |
619 [engine, ll] = enter_evidence(engine, evidence); | |
620 m = marginal_nodes(engine, [S R W]) | |
621 m = | |
622 domain: [2 3 4] | |
623 T: [2x1x2 double] | |
624 >> m.T | |
625 m.T | |
626 ans(:,:,1) = | |
627 0.0820 | |
628 0.0018 | |
629 ans(:,:,2) = | |
630 0.7380 | |
631 0.1782 | |
632 </pre> | |
633 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) | |
634 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible | |
635 with the evidence that R=2. | |
636 Instead of creating large tables with many 0s, BNT sets the effective | |
637 size of observed (discrete) nodes to 1, as explained above. | |
638 This is why m.T has size 2x1x2. | |
639 To get a 2x2x2 table, type | |
640 <pre> | |
641 m = marginal_nodes(engine, [S R W], 1) | |
642 m = | |
643 domain: [2 3 4] | |
644 T: [2x2x2 double] | |
645 >> m.T | |
646 m.T | |
647 ans(:,:,1) = | |
648 0 0.082 | |
649 0 0.0018 | |
650 ans(:,:,2) = | |
651 0 0.738 | |
652 0 0.1782 | |
653 </pre> | |
654 | |
655 <p> | |
656 Note: It is not always possible to compute the joint on arbitrary | |
657 sets of nodes: it depends on which inference engine you use, as discussed | |
658 in more detail <a href="#engines">below</a>. | |
659 | |
660 | |
661 <h2><a name="soft">Soft/virtual evidence</h2> | |
662 | |
663 Sometimes a node is not observed, but we have some distribution over | |
664 its possible values; this is often called "soft" or "virtual" | |
665 evidence. | |
666 One can use this as follows | |
667 <pre> | |
668 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence); | |
669 </pre> | |
670 where soft_evidence{i} is either [] (if node i has no soft evidence) | |
671 or is a vector representing the probability distribution over i's | |
672 possible values. | |
673 For example, if we don't know i's exact value, but we know its | |
674 likelihood ratio is 60/40, we can write evidence{i} = [] and | |
675 soft_evidence{i} = [0.6 0.4]. | |
676 <p> | |
677 Currently only jtree_inf_engine supports this option. | |
678 It assumes that all hidden nodes, and all nodes for | |
679 which we have soft evidence, are discrete. | |
680 For a longer example, see BNT/examples/static/softev1.m. | |
681 | |
682 | |
683 <h2><a name="mpe">Most probable explanation</h2> | |
684 | |
685 To compute the most probable explanation (MPE) of the evidence (i.e., | |
686 the most probable assignment, or a mode of the joint), use | |
687 <pre> | |
688 [mpe, ll] = calc_mpe(engine, evidence); | |
689 </pre> | |
690 mpe{i} is the most likely value of node i. | |
691 This calls enter_evidence with the 'maximize' flag set to 1, which | |
692 causes the engine to do max-product instead of sum-product. | |
693 The resulting max-marginals are then thresholded. | |
694 If there is more than one maximum probability assignment, we must take | |
695 care to break ties in a consistent manner (thresholding the | |
696 max-marginals may give the wrong result). To force this behavior, | |
697 type | |
698 <pre> | |
699 [mpe, ll] = calc_mpe(engine, evidence, 1); | |
700 </pre> | |
701 Note that computing the MPE is someties called abductive reasoning. | |
702 | |
703 <p> | |
704 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar, | |
705 that does a forwards max-product pass, and then a backwards traceback | |
706 pass, which is how Viterbi is traditionally implemented. | |
707 | |
708 | |
709 | |
710 <h1><a name="cpd">Conditional Probability Distributions</h1> | |
711 | |
712 A Conditional Probability Distributions (CPD) | |
713 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i)) | |
714 are the parents of node i. There are many ways to represent this | |
715 distribution, which depend in part on whether X(i) and X(Pa(i)) are | |
716 discrete, continuous, or a combination. | |
717 We will discuss various representations below. | |
718 | |
719 | |
720 <h2><a name="tabular">Tabular nodes</h2> | |
721 | |
722 If the CPD is represented as a table (i.e., if it is a multinomial | |
723 distribution), it has a number of parameters that is exponential in | |
724 the number of parents. See the example <a href="#cpt">above</a>. | |
725 | |
726 | |
727 <h2><a name="noisyor">Noisy-or nodes</h2> | |
728 | |
729 A noisy-OR node is like a regular logical OR gate except that | |
730 sometimes the effects of parents that are on get inhibited. | |
731 Let the prob. that parent i gets inhibited be q(i). | |
732 Then a node, C, with 2 parents, A and B, has the following CPD, where | |
733 we use F and T to represent off and on (1 and 2 in BNT). | |
734 <pre> | |
735 A B P(C=off) P(C=on) | |
736 --------------------------- | |
737 F F 1.0 0.0 | |
738 T F q(A) 1-q(A) | |
739 F T q(B) 1-q(B) | |
740 T T q(A)q(B) q-q(A)q(B) | |
741 </pre> | |
742 Thus we see that the causes get inhibited independently. | |
743 It is common to associate a "leak" node with a noisy-or CPD, which is | |
744 like a parent that is always on. This can account for all other unmodelled | |
745 causes which might turn the node on. | |
746 <p> | |
747 The noisy-or distribution is similar to the logistic distribution. | |
748 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) | |
749 be the prob. that j inhibits i. Then | |
750 <pre> | |
751 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j) | |
752 </pre> | |
753 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then | |
754 <pre> | |
755 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j)) | |
756 </pre> | |
757 For a sigmoid node, we have | |
758 <pre> | |
759 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j)) | |
760 </pre> | |
761 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of | |
762 the activation function (although both are monotonically increasing). | |
763 In addition, in the case of a noisy-or, the weights are constrained to be | |
764 positive, since they derive from probabilities q(i,j). | |
765 In both cases, the number of parameters is <em>linear</em> in the | |
766 number of parents, unlike the case of a multinomial distribution, | |
767 where the number of parameters is exponential in the number of parents. | |
768 We will see an example of noisy-OR nodes <a href="#qmr">below</a>. | |
769 | |
770 | |
771 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2> | |
772 | |
773 Deterministic CPDs for discrete random variables can be created using | |
774 the deterministic_CPD class. It is also possible to 'flip' the output | |
775 of the function with some probability, to simulate noise. | |
776 The boolean_CPD class is just a special case of a | |
777 deterministic CPD, where the parents and child are all binary. | |
778 <p> | |
779 Both of these classes are just "syntactic sugar" for the tabular_CPD | |
780 class. | |
781 | |
782 | |
783 | |
784 <h2><a name="softmax">Softmax nodes</h2> | |
785 | |
786 If we have a discrete node with a continuous parent, | |
787 we can define its CPD using a softmax function | |
788 (also known as the multinomial logit function). | |
789 This acts like a soft thresholding operator, and is defined as follows: | |
790 <pre> | |
791 exp(w(:,i)'*x + b(i)) | |
792 Pr(Q=i | X=x) = ----------------------------- | |
793 sum_j exp(w(:,j)'*x + b(j)) | |
794 | |
795 </pre> | |
796 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the | |
797 following interpretation: w(:,i)-w(:,j) is the normal vector to the | |
798 decision boundary between classes i and j, | |
799 and b(i)-b(j) is its offset (bias). For example, suppose | |
800 X is a 2-vector, and Q is binary. Then | |
801 <pre> | |
802 w = [1 -1; | |
803 0 0]; | |
804 | |
805 b = [0 0]; | |
806 </pre> | |
807 means class 1 are points in the 2D plane with positive x coordinate, | |
808 and class 2 are points in the 2D plane with negative x coordinate. | |
809 If w has large magnitude, the decision boundary is sharp, otherwise it | |
810 is soft. | |
811 In the special case that Q is binary (0/1), the softmax function reduces to the logistic | |
812 (sigmoid) function. | |
813 <p> | |
814 Fitting a softmax function can be done using the iteratively reweighted | |
815 least squares (IRLS) algorithm. | |
816 We use the implementation from | |
817 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>. | |
818 Note that since | |
819 the softmax distribution is not in the exponential family, it does not | |
820 have finite sufficient statistics, and hence we must store all the | |
821 training data in uncompressed form. | |
822 If this takes too much space, one should use online (stochastic) gradient | |
823 descent (not implemented in BNT). | |
824 <p> | |
825 If a softmax node also has discrete parents, | |
826 we use a different set of w/b parameters for each combination of | |
827 parent values, as in the <a href="#gaussian">conditional linear | |
828 Gaussian CPD</a>. | |
829 This feature was implemented by Pierpaolo Brutti. | |
830 He is currently extending it so that discrete parents can be treated | |
831 as if they were continuous, by adding indicator variables to the X | |
832 vector. | |
833 <p> | |
834 We will see an example of softmax nodes <a href="#mixexp">below</a>. | |
835 | |
836 | |
837 <h2><a name="mlp">Neural network nodes</h2> | |
838 | |
839 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron | |
840 to implement a mapping from continuous parents to discrete children, | |
841 similar to the softmax function. | |
842 (If there are also discrete parents, it creates a mixture of MLPs.) | |
843 It uses code from <a | |
844 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>. | |
845 This is work in progress. | |
846 | |
847 <h2><a name="root">Root nodes</h2> | |
848 | |
849 A root node has no parents and no parameters; it can be used to model | |
850 an observed, exogeneous input variable, i.e., one which is "outside" | |
851 the model. | |
852 This is useful for conditional density models. | |
853 We will see an example of root nodes <a href="#mixexp">below</a>. | |
854 | |
855 | |
856 <h2><a name="gaussian">Gaussian nodes</h2> | |
857 | |
858 We now consider a distribution suitable for the continuous-valued nodes. | |
859 Suppose the node is called Y, its continuous parents (if any) are | |
860 called X, and its discrete parents (if any) are called Q. | |
861 The distribution on Y is defined as follows: | |
862 <pre> | |
863 - no parents: Y ~ N(mu, Sigma) | |
864 - cts parents : Y|X=x ~ N(mu + W x, Sigma) | |
865 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i)) | |
866 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i)) | |
867 </pre> | |
868 where N(mu, Sigma) denotes a Normal distribution with mean mu and | |
869 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q | |
870 respectively. | |
871 If there are no discrete parents, |Q|=1; if there is | |
872 more than one, then |Q| = a vector of the sizes of each discrete parent. | |
873 If there are no continuous parents, |X|=0; if there is more than one, | |
874 then |X| = the sum of their sizes. | |
875 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive | |
876 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) | |
877 matrix. | |
878 <p> | |
879 We can create a Gaussian node with random parameters as follows. | |
880 <pre> | |
881 bnet.CPD{i} = gaussian_CPD(bnet, i); | |
882 </pre> | |
883 We can specify the value of one or more of the parameters as in the | |
884 following example, in which |Y|=2, and |Q|=1. | |
885 <pre> | |
886 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y)); | |
887 </pre> | |
888 <p> | |
889 We will see an example of conditional linear Gaussian nodes <a | |
890 href="#cg_model">below</a>. | |
891 <p> | |
892 <b>When learning Gaussians from data</b>, it is helpful to ensure the | |
893 data has a small magnitde | |
894 (see e.g., KPMstats/standardize) to prevent numerical problems. | |
895 Unless you have a lot of data, it is also a very good idea to use | |
896 diagonal instead of full covariance matrices. | |
897 (BNT does not currently support spherical covariances, although it | |
898 would be easy to add, since KPMstats/clg_Mstep supports this option; | |
899 you would just need to modify gaussian_CPD/update_ess to accumulate | |
900 weighted inner products.) | |
901 | |
902 | |
903 | |
904 <h2><a name="nongauss">Other continuous distributions</h2> | |
905 | |
906 Currently BNT does not support any CPDs for continuous nodes other | |
907 than the Gaussian. | |
908 However, you can use a mixture of Gaussians to | |
909 approximate other continuous distributions. We will see some an example | |
910 of this with the IFA model <a href="#pca">below</a>. | |
911 | |
912 | |
913 <h2><a name="glm">Generalized linear model nodes</h2> | |
914 | |
915 In the future, we may incorporate some of the functionality of | |
916 <a href = | |
917 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a> | |
918 into BNT. | |
919 | |
920 | |
921 <h2><a name="dtree">Classification/regression tree nodes</h2> | |
922 | |
923 We plan to add classification and regression trees to define CPDs for | |
924 discrete and continuous nodes, respectively. | |
925 Trees have many advantages: they are easy to interpret, they can do | |
926 feature selection, they can | |
927 handle discrete and continuous inputs, they do not make strong | |
928 assumptions about the form of the distribution, the number of | |
929 parameters can grow in a data-dependent way (i.e., they are | |
930 semi-parametric), they can handle missing data, etc. | |
931 However, they are not yet implemented. | |
932 <!-- | |
933 Yimin Zhang is currently (Feb '02) implementing this. | |
934 --> | |
935 | |
936 | |
937 <h2><a name="cpd_summary">Summary of CPD types</h2> | |
938 | |
939 We list all the different types of CPDs supported by BNT. | |
940 For each CPD, we specify if the child and parents can be discrete (D) or | |
941 continuous (C) (Binary (B) nodes are a special case). | |
942 We also specify which methods each class supports. | |
943 If a method is inherited, the name of the parent class is mentioned. | |
944 If a parent class calls a child method, this is mentioned. | |
945 <p> | |
946 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this | |
947 requires that the child and all parents are discrete. | |
948 The CPT might be exponentially big... | |
949 <tt>convert_to_table</tt> evaluates a CPD with evidence, and | |
950 represents the the resulting potential as an array. | |
951 This requires that the child is discrete, and any continuous parents | |
952 are observed. | |
953 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and | |
954 represents the resulting potential as a dpot, gpot, cgpot or upot, as | |
955 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u = | |
956 utility). | |
957 | |
958 <p> | |
959 When we sample a node, all the parents are observed. | |
960 When we compute the (log) probability of a node, all the parents and | |
961 the child are observed. | |
962 <p> | |
963 We also specify if the parameters are learnable. | |
964 For learning with EM, we require | |
965 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and | |
966 <tt>maximize_params</tt>. | |
967 For learning from fully observed data, we require | |
968 the method <tt>learn_params</tt>. | |
969 By default, all classes inherit this from generic_CPD, which simply | |
970 calls <tt>update_ess</tt> N times, once for each data case, followed | |
971 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step. | |
972 Some classes implement a batch formula, which is quicker. | |
973 <p> | |
974 Bayesian learning means computing a posterior over the parameters | |
975 given fully observed data. | |
976 <p> | |
977 Pearl means we implement the methods <tt>compute_pi</tt> and | |
978 <tt>compute_lambda_msg</tt>, used by | |
979 <tt>pearl_inf_engine</tt>, which runs on directed graphs. | |
980 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H | |
981 The pearl methods can exploit special properties of the CPDs for | |
982 computing the messages efficiently, whereas belprop does not. | |
983 <p> | |
984 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>, | |
985 which is not shown, since it is not very interesting. | |
986 | |
987 | |
988 <p> | |
989 | |
990 | |
991 <table> | |
992 <table border units = pixels><tr> | |
993 <td align=center>Name | |
994 <td align=center>Child | |
995 <td align=center>Parents | |
996 <td align=center>Comments | |
997 <td align=center>CPD_to_CPT | |
998 <td align=center>conv_to_table | |
999 <td align=center>conv_to_pot | |
1000 <td align=center>sample | |
1001 <td align=center>prob | |
1002 <td align=center>learn | |
1003 <td align=center>Bayes | |
1004 <td align=center>Pearl | |
1005 | |
1006 | |
1007 <tr> | |
1008 <!-- Name--><td> | |
1009 <!-- Child--><td> | |
1010 <!-- Parents--><td> | |
1011 <!-- Comments--><td> | |
1012 <!-- CPD_to_CPT--><td> | |
1013 <!-- conv_to_table--><td> | |
1014 <!-- conv_to_pot--><td> | |
1015 <!-- sample--><td> | |
1016 <!-- prob--><td> | |
1017 <!-- learn--><td> | |
1018 <!-- Bayes--><td> | |
1019 <!-- Pearl--><td> | |
1020 | |
1021 <tr> | |
1022 <!-- Name--><td>boolean | |
1023 <!-- Child--><td>B | |
1024 <!-- Parents--><td>B | |
1025 <!-- Comments--><td>Syntactic sugar for tabular | |
1026 <!-- CPD_to_CPT--><td>- | |
1027 <!-- conv_to_table--><td>- | |
1028 <!-- conv_to_pot--><td>- | |
1029 <!-- sample--><td>- | |
1030 <!-- prob--><td>- | |
1031 <!-- learn--><td>- | |
1032 <!-- Bayes--><td>- | |
1033 <!-- Pearl--><td>- | |
1034 | |
1035 <tr> | |
1036 <!-- Name--><td>deterministic | |
1037 <!-- Child--><td>D | |
1038 <!-- Parents--><td>D | |
1039 <!-- Comments--><td>Syntactic sugar for tabular | |
1040 <!-- CPD_to_CPT--><td>- | |
1041 <!-- conv_to_table--><td>- | |
1042 <!-- conv_to_pot--><td>- | |
1043 <!-- sample--><td>- | |
1044 <!-- prob--><td>- | |
1045 <!-- learn--><td>- | |
1046 <!-- Bayes--><td>- | |
1047 <!-- Pearl--><td>- | |
1048 | |
1049 <tr> | |
1050 <!-- Name--><td>Discrete | |
1051 <!-- Child--><td>D | |
1052 <!-- Parents--><td>C/D | |
1053 <!-- Comments--><td>Virtual class | |
1054 <!-- CPD_to_CPT--><td>N | |
1055 <!-- conv_to_table--><td>Calls CPD_to_CPT | |
1056 <!-- conv_to_pot--><td>Calls conv_to_table | |
1057 <!-- sample--><td>Calls conv_to_table | |
1058 <!-- prob--><td>Calls conv_to_table | |
1059 <!-- learn--><td>N | |
1060 <!-- Bayes--><td>N | |
1061 <!-- Pearl--><td>N | |
1062 | |
1063 <tr> | |
1064 <!-- Name--><td>Gaussian | |
1065 <!-- Child--><td>C | |
1066 <!-- Parents--><td>C/D | |
1067 <!-- Comments--><td>- | |
1068 <!-- CPD_to_CPT--><td>N | |
1069 <!-- conv_to_table--><td>N | |
1070 <!-- conv_to_pot--><td>Y | |
1071 <!-- sample--><td>Y | |
1072 <!-- prob--><td>Y | |
1073 <!-- learn--><td>Y | |
1074 <!-- Bayes--><td>N | |
1075 <!-- Pearl--><td>N | |
1076 | |
1077 <tr> | |
1078 <!-- Name--><td>gmux | |
1079 <!-- Child--><td>C | |
1080 <!-- Parents--><td>C/D | |
1081 <!-- Comments--><td>multiplexer | |
1082 <!-- CPD_to_CPT--><td>N | |
1083 <!-- conv_to_table--><td>N | |
1084 <!-- conv_to_pot--><td>Y | |
1085 <!-- sample--><td>N | |
1086 <!-- prob--><td>N | |
1087 <!-- learn--><td>N | |
1088 <!-- Bayes--><td>N | |
1089 <!-- Pearl--><td>Y | |
1090 | |
1091 | |
1092 <tr> | |
1093 <!-- Name--><td>MLP | |
1094 <!-- Child--><td>D | |
1095 <!-- Parents--><td>C/D | |
1096 <!-- Comments--><td>multi layer perceptron | |
1097 <!-- CPD_to_CPT--><td>N | |
1098 <!-- conv_to_table--><td>Y | |
1099 <!-- conv_to_pot--><td>Inherits from discrete | |
1100 <!-- sample--><td>Inherits from discrete | |
1101 <!-- prob--><td>Inherits from discrete | |
1102 <!-- learn--><td>Y | |
1103 <!-- Bayes--><td>N | |
1104 <!-- Pearl--><td>N | |
1105 | |
1106 | |
1107 <tr> | |
1108 <!-- Name--><td>noisy-or | |
1109 <!-- Child--><td>B | |
1110 <!-- Parents--><td>B | |
1111 <!-- Comments--><td>- | |
1112 <!-- CPD_to_CPT--><td>Y | |
1113 <!-- conv_to_table--><td>Inherits from discrete | |
1114 <!-- conv_to_pot--><td>Inherits from discrete | |
1115 <!-- sample--><td>Inherits from discrete | |
1116 <!-- prob--><td>Inherits from discrete | |
1117 <!-- learn--><td>N | |
1118 <!-- Bayes--><td>N | |
1119 <!-- Pearl--><td>Y | |
1120 | |
1121 | |
1122 <tr> | |
1123 <!-- Name--><td>root | |
1124 <!-- Child--><td>C/D | |
1125 <!-- Parents--><td>none | |
1126 <!-- Comments--><td>no params | |
1127 <!-- CPD_to_CPT--><td>N | |
1128 <!-- conv_to_table--><td>N | |
1129 <!-- conv_to_pot--><td>Y | |
1130 <!-- sample--><td>Y | |
1131 <!-- prob--><td>Y | |
1132 <!-- learn--><td>N | |
1133 <!-- Bayes--><td>N | |
1134 <!-- Pearl--><td>N | |
1135 | |
1136 | |
1137 <tr> | |
1138 <!-- Name--><td>softmax | |
1139 <!-- Child--><td>D | |
1140 <!-- Parents--><td>C/D | |
1141 <!-- Comments--><td>- | |
1142 <!-- CPD_to_CPT--><td>N | |
1143 <!-- conv_to_table--><td>Y | |
1144 <!-- conv_to_pot--><td>Inherits from discrete | |
1145 <!-- sample--><td>Inherits from discrete | |
1146 <!-- prob--><td>Inherits from discrete | |
1147 <!-- learn--><td>Y | |
1148 <!-- Bayes--><td>N | |
1149 <!-- Pearl--><td>N | |
1150 | |
1151 | |
1152 <tr> | |
1153 <!-- Name--><td>generic | |
1154 <!-- Child--><td>C/D | |
1155 <!-- Parents--><td>C/D | |
1156 <!-- Comments--><td>Virtual class | |
1157 <!-- CPD_to_CPT--><td>N | |
1158 <!-- conv_to_table--><td>N | |
1159 <!-- conv_to_pot--><td>N | |
1160 <!-- sample--><td>N | |
1161 <!-- prob--><td>N | |
1162 <!-- learn--><td>N | |
1163 <!-- Bayes--><td>N | |
1164 <!-- Pearl--><td>N | |
1165 | |
1166 | |
1167 <tr> | |
1168 <!-- Name--><td>Tabular | |
1169 <!-- Child--><td>D | |
1170 <!-- Parents--><td>D | |
1171 <!-- Comments--><td>- | |
1172 <!-- CPD_to_CPT--><td>Y | |
1173 <!-- conv_to_table--><td>Inherits from discrete | |
1174 <!-- conv_to_pot--><td>Inherits from discrete | |
1175 <!-- sample--><td>Inherits from discrete | |
1176 <!-- prob--><td>Inherits from discrete | |
1177 <!-- learn--><td>Y | |
1178 <!-- Bayes--><td>Y | |
1179 <!-- Pearl--><td>Y | |
1180 | |
1181 </table> | |
1182 | |
1183 | |
1184 | |
1185 <h1><a name="examples">Example models</h1> | |
1186 | |
1187 | |
1188 <h2>Gaussian mixture models</h2> | |
1189 | |
1190 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures | |
1191 of Gaussians using BNT. Available | |
1192 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>. | |
1193 | |
1194 | |
1195 <h2><a name="pca">PCA, ICA, and all that </h2> | |
1196 | |
1197 In Figure (a) below, we show how Factor Analysis can be thought of as a | |
1198 graphical model. Here, X has an N(0,I) prior, and | |
1199 Y|X=x ~ N(mu + Wx, Psi), | |
1200 where Psi is diagonal and W is called the "factor loading matrix". | |
1201 Since the noise on both X and Y is diagonal, the components of these | |
1202 vectors are uncorrelated, and hence can be represented as individual | |
1203 scalar nodes, as we show in (b). | |
1204 (This is useful if parts of the observations on the Y vector are occasionally missing.) | |
1205 We usually take k=|X| << |Y|=D, so the model tries to explain | |
1206 many observations using a low-dimensional subspace. | |
1207 | |
1208 | |
1209 <center> | |
1210 <table> | |
1211 <tr> | |
1212 <td><img src="Figures/fa.gif"> | |
1213 <td><img src="Figures/fa_scalar.gif"> | |
1214 <td><img src="Figures/mfa.gif"> | |
1215 <td><img src="Figures/ifa.gif"> | |
1216 <tr> | |
1217 <td align=center> (a) | |
1218 <td align=center> (b) | |
1219 <td align=center> (c) | |
1220 <td align=center> (d) | |
1221 </table> | |
1222 </center> | |
1223 | |
1224 <p> | |
1225 We can create this model in BNT as follows. | |
1226 <pre> | |
1227 ns = [k D]; | |
1228 dag = zeros(2,2); | |
1229 dag(1,2) = 1; | |
1230 bnet = mk_bnet(dag, ns, 'discrete', []); | |
1231 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ... | |
1232 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1); | |
1233 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ... | |
1234 'cov_type', 'diag', 'clamp_mean', 1); | |
1235 </pre> | |
1236 | |
1237 The root node is clamped to the N(0,I) distribution, so that we will | |
1238 not update these parameters during learning. | |
1239 The mean of the leaf node is clamped to 0, | |
1240 since we assume the data has been centered (had its mean subtracted | |
1241 off); this is just for simplicity. | |
1242 Finally, the covariance of the leaf node is constrained to be | |
1243 diagonal. W0 and Psi0 are the initial parameter guesses. | |
1244 | |
1245 <p> | |
1246 We can fit this model (i.e., estimate its parameters in a maximum | |
1247 likelihood (ML) sense) using EM, as we | |
1248 explain <a href="#em">below</a>. | |
1249 Not surprisingly, the ML estimates for mu and Psi turn out to be | |
1250 identical to the | |
1251 sample mean and variance, which can be computed directly as | |
1252 <pre> | |
1253 mu_ML = mean(data); | |
1254 Psi_ML = diag(cov(data)); | |
1255 </pre> | |
1256 Note that W can only be identified up to a rotation matrix, because of | |
1257 the spherical symmetry of the source. | |
1258 | |
1259 <p> | |
1260 If we restrict Psi to be spherical, i.e., Psi = sigma*I, | |
1261 there is a closed-form solution for W as well, | |
1262 i.e., we do not need to use EM. | |
1263 In particular, W contains the first |X| eigenvectors of the sample covariance | |
1264 matrix, with scalings determined by the eigenvalues and sigma. | |
1265 Classical PCA can be obtained by taking the sigma->0 limit. | |
1266 For details, see | |
1267 | |
1268 <ul> | |
1269 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps"> | |
1270 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97. | |
1271 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz"> | |
1272 Matlab software</a>) | |
1273 | |
1274 <p> | |
1275 <li> | |
1276 <a | |
1277 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003> | |
1278 "Mixtures of probabilistic principal component analyzers"</a>, | |
1279 Tipping and Bishop, Neural Computation 11(2):443--482, 1999. | |
1280 </ul> | |
1281 | |
1282 <p> | |
1283 By adding a hidden discrete variable, we can create mixtures of FA | |
1284 models, as shown in (c). | |
1285 Now we can explain the data using a set of subspaces. | |
1286 We can create this model in BNT as follows. | |
1287 <pre> | |
1288 ns = [M k D]; | |
1289 dag = zeros(3); | |
1290 dag(1,3) = 1; | |
1291 dag(2,3) = 1; | |
1292 bnet = mk_bnet(dag, ns, 'discrete', 1); | |
1293 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0); | |
1294 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ... | |
1295 'clamp_mean', 1, 'clamp_cov', 1); | |
1296 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ... | |
1297 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1); | |
1298 </pre> | |
1299 Notice how the covariance matrix for Y is the same for all values of | |
1300 Q; that is, the noise level in each sub-space is assumed the same. | |
1301 However, we allow the offset, mu, to vary. | |
1302 For details, see | |
1303 <ul> | |
1304 | |
1305 <LI> | |
1306 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM | |
1307 Algorithm for Mixtures of Factor Analyzers </A>, | |
1308 Ghahramani, Z. and Hinton, G.E. (1996), | |
1309 University of Toronto | |
1310 Technical Report CRG-TR-96-1. | |
1311 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>) | |
1312 | |
1313 <p> | |
1314 <li> | |
1315 <a | |
1316 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003> | |
1317 "Mixtures of probabilistic principal component analyzers"</a>, | |
1318 Tipping and Bishop, Neural Computation 11(2):443--482, 1999. | |
1319 </ul> | |
1320 | |
1321 <p> | |
1322 I have included Zoubin's specialized MFA code (with his permission) | |
1323 with the toolbox, so you can check that BNT gives the same results: | |
1324 see 'BNT/examples/static/mfa1.m'. | |
1325 | |
1326 <p> | |
1327 Independent Factor Analysis (IFA) generalizes FA by allowing a | |
1328 non-Gaussian prior on each component of X. | |
1329 (Note that we can approximate a non-Gaussian prior using a mixture of | |
1330 Gaussians.) | |
1331 This means that the likelihood function is no longer rotationally | |
1332 invariant, so we can uniquely identify W and the hidden | |
1333 sources X. | |
1334 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). | |
1335 We recover classical Independent Components Analysis (ICA) | |
1336 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the | |
1337 weight matrix W is square and invertible. | |
1338 For details, see | |
1339 <ul> | |
1340 <li> | |
1341 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor | |
1342 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998. | |
1343 </ul> | |
1344 | |
1345 | |
1346 | |
1347 <h2><a name="mixexp">Mixtures of experts</h2> | |
1348 | |
1349 As an example of the use of the softmax function, | |
1350 we introduce the Mixture of Experts model. | |
1351 <!-- | |
1352 We also show | |
1353 the Hierarchical Mixture of Experts model, where the hierarchy has two | |
1354 levels. | |
1355 (This is essentially a probabilistic decision tree of height two.) | |
1356 --> | |
1357 As before, | |
1358 circles denote continuous-valued nodes, | |
1359 squares denote discrete nodes, clear | |
1360 means hidden, and shaded means observed. | |
1361 <p> | |
1362 <center> | |
1363 <table> | |
1364 <tr> | |
1365 <td><img src="Figures/mixexp.gif"> | |
1366 <!-- | |
1367 <td><img src="Figures/hme.gif"> | |
1368 --> | |
1369 </table> | |
1370 </center> | |
1371 <p> | |
1372 X is the observed | |
1373 input, Y is the output, and | |
1374 the Q nodes are hidden "gating" nodes, which select the appropriate | |
1375 set of parameters for Y. During training, Y is assumed observed, | |
1376 but for testing, the goal is to predict Y given X. | |
1377 Note that this is a <em>conditional</em> density model, so we don't | |
1378 associate any parameters with X. | |
1379 Hence X's CPD will be a root CPD, which is a way of modelling | |
1380 exogenous nodes. | |
1381 If the output is a continuous-valued quantity, | |
1382 we assume the "experts" are linear-regression units, | |
1383 and set Y's CPD to linear-Gaussian. | |
1384 If the output is discrete, we set Y's CPD to a softmax function. | |
1385 The Q CPDs will always be softmax functions. | |
1386 | |
1387 <p> | |
1388 As a concrete example, consider the mixture of experts model where X and Y are | |
1389 scalars, and Q is binary. | |
1390 This is just piecewise linear regression, where | |
1391 we have two line segments, i.e., | |
1392 <P> | |
1393 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif"> | |
1394 <P> | |
1395 We can create this model with random parameters as follows. | |
1396 (This code is bundled in BNT/examples/static/mixexp2.m.) | |
1397 <PRE> | |
1398 X = 1; | |
1399 Q = 2; | |
1400 Y = 3; | |
1401 dag = zeros(3,3); | |
1402 dag(X,[Q Y]) = 1 | |
1403 dag(Q,Y) = 1; | |
1404 ns = [1 2 1]; % make X and Y scalars, and have 2 experts | |
1405 onodes = [1 3]; | |
1406 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes); | |
1407 | |
1408 rand('state', 0); | |
1409 randn('state', 0); | |
1410 bnet.CPD{1} = root_CPD(bnet, 1); | |
1411 bnet.CPD{2} = softmax_CPD(bnet, 2); | |
1412 bnet.CPD{3} = gaussian_CPD(bnet, 3); | |
1413 </PRE> | |
1414 Now let us fit this model using <a href="#em">EM</a>. | |
1415 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them. | |
1416 <P> | |
1417 <PRE> | |
1418 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii'); | |
1419 plot(data(:,1), data(:,2), '.'); | |
1420 </PRE> | |
1421 <p> | |
1422 <center> | |
1423 <IMG SRC="Figures/mixexp_data.gif"> | |
1424 </center> | |
1425 <p> | |
1426 This is what the model looks like before training. | |
1427 (Thanks to Thomas Hofman for writing this plotting routine.) | |
1428 <p> | |
1429 <center> | |
1430 <IMG SRC="Figures/mixexp_before.gif"> | |
1431 </center> | |
1432 <p> | |
1433 Now let's train the model, and plot the final performance. | |
1434 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.) | |
1435 <P> | |
1436 <PRE> | |
1437 ncases = size(data, 1); % each row of data is a training case | |
1438 cases = cell(3, ncases); | |
1439 cases([1 3], :) = num2cell(data'); % each column of cases is a training case | |
1440 engine = jtree_inf_engine(bnet); | |
1441 max_iter = 20; | |
1442 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter); | |
1443 </PRE> | |
1444 (We specify which nodes will be observed when we create the engine. | |
1445 Hence BNT knows that the hidden nodes are all discrete. | |
1446 For complex models, this can lead to a significant speedup.) | |
1447 Below we show what the model looks like after 16 iterations of EM | |
1448 (with 100 IRLS iterations per M step), when it converged | |
1449 using the default convergence tolerance (that the | |
1450 fractional change in the log-likelihood be less than 1e-3). | |
1451 Before learning, the log-likelihood was | |
1452 -322.927442; afterwards, it was -13.728778. | |
1453 <p> | |
1454 <center> | |
1455 <IMG SRC="Figures/mixexp_after.gif"> | |
1456 </center> | |
1457 (See BNT/examples/static/mixexp2.m for details of the code.) | |
1458 | |
1459 | |
1460 | |
1461 <h2><a name="hme">Hierarchical mixtures of experts</h2> | |
1462 | |
1463 A hierarchical mixture of experts (HME) extends the mixture of experts | |
1464 model by having more than one hidden node. A two-level example is shown below, along | |
1465 with its more traditional representation as a neural network. | |
1466 This is like a (balanced) probabilistic decision tree of height 2. | |
1467 <p> | |
1468 <center> | |
1469 <IMG SRC="Figures/HMEforMatlab.jpg"> | |
1470 </center> | |
1471 <p> | |
1472 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a> | |
1473 has written an extensive set of routines for HMEs, | |
1474 which are bundled with BNT: see the examples/static/HME directory. | |
1475 These routines allow you to choose the number of hidden (gating) | |
1476 layers, and the form of the experts (softmax or MLP). | |
1477 See the file hmemenu, which provides a demo. | |
1478 For example, the figure below shows the decision boundaries learned | |
1479 for a ternary classification problem, using a 2 level HME with softmax | |
1480 gates and softmax experts; the training set is on the left, the | |
1481 testing set on the right. | |
1482 <p> | |
1483 <center> | |
1484 <!--<IMG SRC="Figures/hme_dec_boundary.gif">--> | |
1485 <IMG SRC="Figures/hme_dec_boundary.png"> | |
1486 </center> | |
1487 <p> | |
1488 | |
1489 | |
1490 <p> | |
1491 For more details, see the following: | |
1492 <ul> | |
1493 | |
1494 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z"> | |
1495 Hierarchical mixtures of experts and the EM algorithm</a> | |
1496 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994. | |
1497 | |
1498 <li> <a href = | |
1499 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's | |
1500 matlab code for HME</a> | |
1501 | |
1502 <li> <a | |
1503 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the | |
1504 logistic function? A tutorial discussion on | |
1505 probabilities and neural networks.</a> M. I. Jordan. MIT Computational | |
1506 Cognitive Science Report 9503, August 1995. | |
1507 | |
1508 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and | |
1509 Halll, 1983. | |
1510 | |
1511 <li> | |
1512 "Improved learning algorithms for mixtures of experts in multiclass | |
1513 classification". | |
1514 K. Chen, L. Xu, H. Chi. | |
1515 Neural Networks (1999) 12: 1229-1252. | |
1516 | |
1517 <li> <a href="http://www.oigeeza.com/steve/"> | |
1518 Classification Using Hierarchical Mixtures of Experts</a> | |
1519 S.R. Waterhouse and A.J. Robinson. | |
1520 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186 | |
1521 | |
1522 <li> <a href="http://www.idiap.ch/~perry/"> | |
1523 Localized mixtures of experts</a>, | |
1524 P. Moerland, 1998. | |
1525 | |
1526 <li> "Nonlinear gated experts for time series", | |
1527 A.S. Weigend and M. Mangeas, 1995. | |
1528 | |
1529 </ul> | |
1530 | |
1531 | |
1532 <h2><a name="qmr">QMR</h2> | |
1533 | |
1534 Bayes nets originally arose out of an attempt to add probabilities to | |
1535 expert systems, and this is still the most common use for BNs. | |
1536 A famous example is | |
1537 QMR-DT, a decision-theoretic reformulation of the Quick Medical | |
1538 Reference (QMR) model. | |
1539 <p> | |
1540 <center> | |
1541 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif"> | |
1542 </center> | |
1543 Here, the top layer represents hidden disease nodes, and the bottom | |
1544 layer represents observed symptom nodes. | |
1545 The goal is to infer the posterior probability of each disease given | |
1546 all the symptoms (which can be present, absent or unknown). | |
1547 Each node in the top layer has a Bernoulli prior (with a low prior | |
1548 probability that the disease is present). | |
1549 Since each node in the bottom layer has a high fan-in, we use a | |
1550 noisy-OR parameterization; each disease has an independent chance of | |
1551 causing each symptom. | |
1552 The real QMR-DT model is copyright, but | |
1553 we can create a random QMR-like model as follows. | |
1554 <pre> | |
1555 function bnet = mk_qmr_bnet(G, inhibit, leak, prior) | |
1556 % MK_QMR_BNET Make a QMR model | |
1557 % bnet = mk_qmr_bnet(G, inhibit, leak, prior) | |
1558 % | |
1559 % G(i,j) = 1 iff there is an arc from disease i to finding j | |
1560 % inhibit(i,j) = inhibition probability on i->j arc | |
1561 % leak(j) = inhibition prob. on leak->j arc | |
1562 % prior(i) = prob. disease i is on | |
1563 | |
1564 [Ndiseases Nfindings] = size(inhibit); | |
1565 N = Ndiseases + Nfindings; | |
1566 finding_node = Ndiseases+1:N; | |
1567 ns = 2*ones(1,N); | |
1568 dag = zeros(N,N); | |
1569 dag(1:Ndiseases, finding_node) = G; | |
1570 bnet = mk_bnet(dag, ns, 'observed', finding_node); | |
1571 | |
1572 for d=1:Ndiseases | |
1573 CPT = [1-prior(d) prior(d)]; | |
1574 bnet.CPD{d} = tabular_CPD(bnet, d, CPT'); | |
1575 end | |
1576 | |
1577 for i=1:Nfindings | |
1578 fnode = finding_node(i); | |
1579 ps = parents(G, i); | |
1580 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i)); | |
1581 end | |
1582 </pre> | |
1583 In the file BNT/examples/static/qmr1, we create a random bipartite | |
1584 graph G, with 5 diseases and 10 findings, and random parameters. | |
1585 (In general, to create a random dag, use 'mk_random_dag'.) | |
1586 We can visualize the resulting graph structure using | |
1587 the methods discussed <a href="#graphdraw">below</a>, with the | |
1588 following results: | |
1589 <p> | |
1590 <img src="Figures/qmr.rnd.jpg"> | |
1591 | |
1592 <p> | |
1593 Now let us put some random evidence on all the leaves except the very | |
1594 first and very last, and compute the disease posteriors. | |
1595 <pre> | |
1596 pos = 2:floor(Nfindings/2); | |
1597 neg = (pos(end)+1):(Nfindings-1); | |
1598 onodes = myunion(pos, neg); | |
1599 evidence = cell(1, N); | |
1600 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos))); | |
1601 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg))); | |
1602 | |
1603 engine = jtree_inf_engine(bnet); | |
1604 [engine, ll] = enter_evidence(engine, evidence); | |
1605 post = zeros(1, Ndiseases); | |
1606 for i=diseases(:)' | |
1607 m = marginal_nodes(engine, i); | |
1608 post(i) = m.T(2); | |
1609 end | |
1610 </pre> | |
1611 Junction tree can be quite slow on large QMR models. | |
1612 Fortunately, it is possible to exploit properties of the noisy-OR | |
1613 function to speed up exact inference using an algorithm called | |
1614 <a href="#quickscore">quickscore</a>, discussed below. | |
1615 | |
1616 | |
1617 | |
1618 | |
1619 | |
1620 <h2><a name="cg_model">Conditional Gaussian models</h2> | |
1621 | |
1622 A conditional Gaussian model is one in which, conditioned on all the discrete | |
1623 nodes, the distribution over the remaining (continuous) nodes is | |
1624 multivariate Gaussian. This means we can have arcs from discrete (D) | |
1625 to continuous (C) nodes, but not vice versa. | |
1626 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed, | |
1627 as in the <a href="#mixexp">mixture of experts</a> model, | |
1628 since this distribution can be represented with a discrete potential.) | |
1629 <p> | |
1630 We now give an example of a CG model, from | |
1631 the paper "Propagation of Probabilities, Means amd | |
1632 Variances in Mixed Graphical Association Models", Steffen Lauritzen, | |
1633 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert | |
1634 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and | |
1635 D. J. Spiegelhalter, Springer, 1999.) | |
1636 | |
1637 <h3>Specifying the graph</h3> | |
1638 | |
1639 Consider the model of waste emissions from an incinerator plant shown below. | |
1640 We follow the standard convention that shaded nodes are observed, | |
1641 clear nodes are hidden. | |
1642 We also use the non-standard convention that | |
1643 square nodes are discrete (tabular) and round nodes are | |
1644 Gaussian. | |
1645 | |
1646 <p> | |
1647 <center> | |
1648 <IMG SRC="Figures/cg1.gif"> | |
1649 </center> | |
1650 <p> | |
1651 | |
1652 We can create this model as follows. | |
1653 <pre> | |
1654 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9; | |
1655 n = 9; | |
1656 | |
1657 dag = zeros(n); | |
1658 dag(F,E)=1; | |
1659 dag(W,[E Min D]) = 1; | |
1660 dag(E,D)=1; | |
1661 dag(B,[C D])=1; | |
1662 dag(D,[L Mout])=1; | |
1663 dag(Min,Mout)=1; | |
1664 | |
1665 % node sizes - all cts nodes are scalar, all discrete nodes are binary | |
1666 ns = ones(1, n); | |
1667 dnodes = [F W B]; | |
1668 cnodes = mysetdiff(1:n, dnodes); | |
1669 ns(dnodes) = 2; | |
1670 | |
1671 bnet = mk_bnet(dag, ns, 'discrete', dnodes); | |
1672 </pre> | |
1673 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous | |
1674 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'. | |
1675 <p> | |
1676 | |
1677 | |
1678 <h3>Specifying the parameters</h3> | |
1679 | |
1680 The parameters of the discrete nodes can be specified as follows. | |
1681 <pre> | |
1682 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable | |
1683 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect | |
1684 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household | |
1685 </pre> | |
1686 | |
1687 <p> | |
1688 The parameters of the continuous nodes can be specified as follows. | |
1689 <pre> | |
1690 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ... | |
1691 'cov', [0.00002 0.0001 0.00002 0.0001]); | |
1692 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ... | |
1693 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]); | |
1694 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]); | |
1695 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5); | |
1696 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]); | |
1697 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]); | |
1698 </pre> | |
1699 | |
1700 | |
1701 <h3><a name="cg_infer">Inference</h3> | |
1702 | |
1703 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.--> | |
1704 First we compute the unconditional marginals. | |
1705 <pre> | |
1706 engine = jtree_inf_engine(bnet); | |
1707 evidence = cell(1,n); | |
1708 [engine, ll] = enter_evidence(engine, evidence); | |
1709 marg = marginal_nodes(engine, E); | |
1710 </pre> | |
1711 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)--> | |
1712 'marg' is a structure that contains the fields 'mu' and 'Sigma', which | |
1713 contain the mean and (co)variance of the marginal on E. | |
1714 In this case, they are both scalars. | |
1715 Let us check they match the published figures (to 2 decimal places). | |
1716 <!--(We can't expect | |
1717 more precision than this in general because I have implemented the algorithm of | |
1718 Lauritzen (1992), which can be numerically unstable.)--> | |
1719 <pre> | |
1720 tol = 1e-2; | |
1721 assert(approxeq(marg.mu, -3.25, tol)); | |
1722 assert(approxeq(sqrt(marg.Sigma), 0.709, tol)); | |
1723 </pre> | |
1724 We can compute the other posteriors similarly. | |
1725 Now let us add some evidence. | |
1726 <pre> | |
1727 evidence = cell(1,n); | |
1728 evidence{W} = 1; % industrial | |
1729 evidence{L} = 1.1; | |
1730 evidence{C} = -0.9; | |
1731 [engine, ll] = enter_evidence(engine, evidence); | |
1732 </pre> | |
1733 Now we find | |
1734 <pre> | |
1735 marg = marginal_nodes(engine, E); | |
1736 assert(approxeq(marg.mu, -3.8983, tol)); | |
1737 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol)); | |
1738 </pre> | |
1739 | |
1740 | |
1741 We can also compute the joint probability on a set of nodes. | |
1742 For example, P(D, Mout | evidence) is a 2D Gaussian: | |
1743 <pre> | |
1744 marg = marginal_nodes(engine, [D Mout]) | |
1745 marg = | |
1746 domain: [6 8] | |
1747 mu: [2x1 double] | |
1748 Sigma: [2x2 double] | |
1749 T: 1.0000 | |
1750 </pre> | |
1751 The mean is | |
1752 <pre> | |
1753 marg.mu | |
1754 ans = | |
1755 3.6077 | |
1756 4.1077 | |
1757 </pre> | |
1758 and the covariance matrix is | |
1759 <pre> | |
1760 marg.Sigma | |
1761 ans = | |
1762 0.1062 0.1062 | |
1763 0.1062 0.1182 | |
1764 </pre> | |
1765 It is easy to visualize this posterior using standard Matlab plotting | |
1766 functions, e.g., | |
1767 <pre> | |
1768 gaussplot2d(marg.mu, marg.Sigma); | |
1769 </pre> | |
1770 produces the following picture. | |
1771 | |
1772 <p> | |
1773 <center> | |
1774 <IMG SRC="Figures/gaussplot.png"> | |
1775 </center> | |
1776 <p> | |
1777 | |
1778 | |
1779 The T field indicates that the mixing weight of this Gaussian | |
1780 component is 1.0. | |
1781 If the joint contains discrete and continuous variables, the result | |
1782 will be a mixture of Gaussians, e.g., | |
1783 <pre> | |
1784 marg = marginal_nodes(engine, [F E]) | |
1785 domain: [1 3] | |
1786 mu: [-3.9000 -0.4003] | |
1787 Sigma: [1x1x2 double] | |
1788 T: [0.9995 4.7373e-04] | |
1789 </pre> | |
1790 The interpretation is | |
1791 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. | |
1792 In this case, E is a scalar, so i=j=1; k specifies the mixture component. | |
1793 <p> | |
1794 We saw in the sprinkler network that BNT sets the effective size of | |
1795 observed discrete nodes to 1, since they only have one legal value. | |
1796 For continuous nodes, BNT sets their length to 0, | |
1797 since they have been reduced to a point. | |
1798 For example, | |
1799 <pre> | |
1800 marg = marginal_nodes(engine, [B C]) | |
1801 domain: [4 5] | |
1802 mu: [] | |
1803 Sigma: [] | |
1804 T: [0.0123 0.9877] | |
1805 </pre> | |
1806 It is simple to post-process the output of marginal_nodes. | |
1807 For example, the file BNT/examples/static/cg1 sets the mu term of | |
1808 observed nodes to their observed value, and the Sigma term to 0 (since | |
1809 observed nodes have no variance). | |
1810 | |
1811 <p> | |
1812 Note that the implemented version of the junction tree is numerically | |
1813 unstable when using CG potentials | |
1814 (which is why, in the example above, we only required our answers to agree with | |
1815 the published ones to 2dp.) | |
1816 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>, | |
1817 implemented by Shan Huang. This is described in | |
1818 | |
1819 <ul> | |
1820 <li> "Stable Local Computation with Conditional Gaussian Distributions", | |
1821 S. Lauritzen and F. Jensen, Tech Report R-99-2014, | |
1822 Dept. Math. Sciences, Allborg Univ., 1999. | |
1823 </ul> | |
1824 | |
1825 However, even the numerically stable version | |
1826 can be computationally intractable if there are many hidden discrete | |
1827 nodes, because the number of mixture components grows exponentially e.g., in a | |
1828 <a href="usage_dbn.html#lds">switching linear dynamical system</a>. | |
1829 In general, one must resort to approximate inference techniques: see | |
1830 the discussion on <a href="#engines">inference engines</a> below. | |
1831 | |
1832 | |
1833 <h2><a name="hybrid">Other hybrid models</h2> | |
1834 | |
1835 When we have C->D arcs, where C is hidden, we need to use | |
1836 approximate inference. | |
1837 One approach (not implemented in BNT) is described in | |
1838 <ul> | |
1839 <li> <a | |
1840 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A | |
1841 Variational Approximation for Bayesian Networks with | |
1842 Discrete and Continuous Latent Variables</a>, | |
1843 K. Murphy, UAI 99. | |
1844 </ul> | |
1845 Of course, one can always use <a href="#sampling">sampling</a> methods | |
1846 for approximate inference in such models. | |
1847 | |
1848 | |
1849 | |
1850 <h1><a name="param_learning">Parameter Learning</h1> | |
1851 | |
1852 The parameter estimation routines in BNT can be classified into 4 | |
1853 types, depending on whether the goal is to compute | |
1854 a full (Bayesian) posterior over the parameters or just a point | |
1855 estimate (e.g., Maximum Likelihood or Maximum A Posteriori), | |
1856 and whether all the variables are fully observed or there is missing | |
1857 data/ hidden variables (partial observability). | |
1858 <p> | |
1859 | |
1860 <TABLE BORDER> | |
1861 <tr> | |
1862 <TH></TH> | |
1863 <th>Full obs</th> | |
1864 <th>Partial obs</th> | |
1865 </tr> | |
1866 <tr> | |
1867 <th>Point</th> | |
1868 <td><tt>learn_params</tt></td> | |
1869 <td><tt>learn_params_em</tt></td> | |
1870 </tr> | |
1871 <tr> | |
1872 <th>Bayes</th> | |
1873 <td><tt>bayes_update_params</tt></td> | |
1874 <td>not yet supported</td> | |
1875 </tr> | |
1876 </table> | |
1877 | |
1878 | |
1879 <h2><a name="load_data">Loading data from a file</h2> | |
1880 | |
1881 To load numeric data from an ASCII text file called 'dat.txt', where each row is a | |
1882 case and columns are separated by white-space, such as | |
1883 <pre> | |
1884 011979 1626.5 0.0 | |
1885 021979 1367.0 0.0 | |
1886 ... | |
1887 </pre> | |
1888 you can use | |
1889 <pre> | |
1890 data = load('dat.txt'); | |
1891 </pre> | |
1892 or | |
1893 <pre> | |
1894 load dat.txt -ascii | |
1895 </pre> | |
1896 In the latter case, the data is stored in a variable called 'dat' (the | |
1897 filename minus the extension). | |
1898 Alternatively, suppose the data is stored in a .csv file (has commas | |
1899 separating the columns, and contains a header line), such as | |
1900 <pre> | |
1901 header info goes here | |
1902 ORD,011979,1626.5,0.0 | |
1903 DSM,021979,1367.0,0.0 | |
1904 ... | |
1905 </pre> | |
1906 You can load this using | |
1907 <pre> | |
1908 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1); | |
1909 </pre> | |
1910 If your file is not in either of these formats, you can either use Perl to convert | |
1911 it to this format, or use the Matlab scanf command. | |
1912 Type | |
1913 <tt> | |
1914 help iofun | |
1915 </tt> | |
1916 for more information on Matlab's file functions. | |
1917 <!-- | |
1918 <p> | |
1919 To load data directly from Excel, | |
1920 you should buy the | |
1921 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>. | |
1922 To load data directly from a relational database, | |
1923 you should buy the | |
1924 <a href="http://www.mathworks.com/products/database">Database | |
1925 toolbox</a>. | |
1926 --> | |
1927 <p> | |
1928 BNT learning routines require data to be stored in a cell array. | |
1929 data{i,m} is the value of node i in case (example) m, i.e., each | |
1930 <em>column</em> is a case. | |
1931 If node i is not observed in case m (missing value), set | |
1932 data{i,m} = []. | |
1933 (Not all the learning routines can cope with such missing values, however.) | |
1934 In the special case that all the nodes are observed and are | |
1935 scalar-valued (as opposed to vector-valued), the data can be | |
1936 stored in a matrix (as opposed to a cell-array). | |
1937 <p> | |
1938 Suppose, as in the <a href="#mixexp">mixture of experts example</a>, | |
1939 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is | |
1940 the observed output, and X(2) is a hidden (gating) node. We can | |
1941 create the dataset as follows. | |
1942 <pre> | |
1943 data = load('dat.txt'); | |
1944 ncases = size(data, 1); | |
1945 cases = cell(3, ncases); | |
1946 cases([1 3], :) = num2cell(data'); | |
1947 </pre> | |
1948 Notice how we transposed the data, to convert rows into columns. | |
1949 Also, cases{2,m} = [] for all m, since X(2) is always hidden. | |
1950 | |
1951 | |
1952 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2> | |
1953 | |
1954 As an example, let's generate some data from the sprinkler network, randomize the parameters, | |
1955 and then try to recover the original model. | |
1956 First we create some training data using forwards sampling. | |
1957 <pre> | |
1958 samples = cell(N, nsamples); | |
1959 for i=1:nsamples | |
1960 samples(:,i) = sample_bnet(bnet); | |
1961 end | |
1962 </pre> | |
1963 samples{j,i} contains the value of the j'th node in case i. | |
1964 sample_bnet returns a cell array because, in general, each node might | |
1965 be a vector of different length. | |
1966 In this case, all nodes are discrete (and hence scalars), so we | |
1967 could have used a regular array instead (which can be quicker): | |
1968 <pre> | |
1969 data = cell2num(samples); | |
1970 </pre | |
1971 So now data(j,i) = samples{j,i}. | |
1972 <p> | |
1973 Now we create a network with random parameters. | |
1974 (The initial values of bnet2 don't matter in this case, since we can find the | |
1975 globally optimal MLE independent of where we start.) | |
1976 <pre> | |
1977 % Make a tabula rasa | |
1978 bnet2 = mk_bnet(dag, node_sizes); | |
1979 seed = 0; | |
1980 rand('state', seed); | |
1981 bnet2.CPD{C} = tabular_CPD(bnet2, C); | |
1982 bnet2.CPD{R} = tabular_CPD(bnet2, R); | |
1983 bnet2.CPD{S} = tabular_CPD(bnet2, S); | |
1984 bnet2.CPD{W} = tabular_CPD(bnet2, W); | |
1985 </pre> | |
1986 Finally, we find the maximum likelihood estimates of the parameters. | |
1987 <pre> | |
1988 bnet3 = learn_params(bnet2, samples); | |
1989 </pre> | |
1990 To view the learned parameters, we use a little Matlab hackery. | |
1991 <pre> | |
1992 CPT3 = cell(1,N); | |
1993 for i=1:N | |
1994 s=struct(bnet3.CPD{i}); % violate object privacy | |
1995 CPT3{i}=s.CPT; | |
1996 end | |
1997 </pre> | |
1998 Here are the parameters learned for node 4. | |
1999 <pre> | |
2000 dispcpt(CPT3{4}) | |
2001 1 1 : 1.0000 0.0000 | |
2002 2 1 : 0.2000 0.8000 | |
2003 1 2 : 0.2273 0.7727 | |
2004 2 2 : 0.0000 1.0000 | |
2005 </pre> | |
2006 So we see that the learned parameters are fairly close to the "true" | |
2007 ones, which we display below. | |
2008 <pre> | |
2009 dispcpt(CPT{4}) | |
2010 1 1 : 1.0000 0.0000 | |
2011 2 1 : 0.1000 0.9000 | |
2012 1 2 : 0.1000 0.9000 | |
2013 2 2 : 0.0100 0.9900 | |
2014 </pre> | |
2015 We can get better results by using a larger training set, or using | |
2016 informative priors (see <a href="#prior">below</a>). | |
2017 | |
2018 | |
2019 | |
2020 <h2><a name="prior">Parameter priors</h2> | |
2021 | |
2022 Currently, only tabular CPDs can have priors on their parameters. | |
2023 The conjugate prior for a multinomial is the Dirichlet. | |
2024 (For binary random variables, the multinomial is the same as the | |
2025 Bernoulli, and the Dirichlet is the same as the Beta.) | |
2026 <p> | |
2027 The Dirichlet has a simple interpretation in terms of pseudo counts. | |
2028 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the | |
2029 training set, where Pa_i are the parents of X_i, | |
2030 then the maximum likelihood (ML) estimate is | |
2031 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. | |
2032 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this | |
2033 event was not seen in the training set, | |
2034 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk) | |
2035 of times in the past. | |
2036 The MAP (maximum a posterior) estimate is then | |
2037 <pre> | |
2038 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij) | |
2039 </pre> | |
2040 and is never 0 if all alpha_ijk > 0. | |
2041 For example, consider the network A->B, where A is binary and B has 3 | |
2042 values. | |
2043 A uniform prior for B has the form | |
2044 <pre> | |
2045 B=1 B=2 B=3 | |
2046 A=1 1 1 1 | |
2047 A=2 1 1 1 | |
2048 </pre> | |
2049 which can be created using | |
2050 <pre> | |
2051 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif'); | |
2052 </pre> | |
2053 This prior does not satisfy the likelihood equivalence principle, | |
2054 which says that <a href="#markov_equiv">Markov equivalent</a> models | |
2055 should have the same marginal likelihood. | |
2056 A prior that does satisfy this principle is shown below. | |
2057 Heckerman (1995) calls this the | |
2058 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet). | |
2059 <pre> | |
2060 B=1 B=2 B=3 | |
2061 A=1 1/6 1/6 1/6 | |
2062 A=2 1/6 1/6 1/6 | |
2063 </pre> | |
2064 where we put N/(q*r) in each bin; N is the equivalent sample size, | |
2065 r=|A|, q = |B|. | |
2066 This can be created as follows | |
2067 <pre> | |
2068 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu'); | |
2069 </pre> | |
2070 Here, 1 is the equivalent sample size, and is the strength of the | |
2071 prior. | |
2072 You can change this using | |
2073 <pre> | |
2074 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ... | |
2075 'BDeu', 'dirichlet_weight', 10); | |
2076 </pre> | |
2077 <!--where counts is an array of pseudo-counts of the same size as the | |
2078 CPT.--> | |
2079 <!-- | |
2080 <p> | |
2081 When you specify a prior, you should set row i of the CPT to the | |
2082 normalized version of row i of the pseudo-count matrix, i.e., to the | |
2083 expected values of the parameters. This will ensure that computing the | |
2084 marginal likelihood sequentially (see <a | |
2085 href="#bayes_learn">below</a>) and in batch form gives the same | |
2086 results. | |
2087 To do this, proceed as follows. | |
2088 <pre> | |
2089 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts)); | |
2090 </pre> | |
2091 For a non-informative prior, you can just write | |
2092 <pre> | |
2093 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif'); | |
2094 </pre> | |
2095 --> | |
2096 | |
2097 | |
2098 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2> | |
2099 | |
2100 If we use conjugate priors and have fully observed data, we can | |
2101 compute the posterior over the parameters in batch form as follows. | |
2102 <pre> | |
2103 cases = sample_bnet(bnet, nsamples); | |
2104 bnet = bayes_update_params(bnet, cases); | |
2105 LL = log_marg_lik_complete(bnet, cases); | |
2106 </pre> | |
2107 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts, | |
2108 and bnet.CPD{i}.CPT is set to the mean of the posterior (the | |
2109 normalized counts). | |
2110 (Hence if the initial pseudo counts are 0, | |
2111 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the | |
2112 same result.) | |
2113 | |
2114 | |
2115 | |
2116 | |
2117 <p> | |
2118 We can compute the same result sequentially (on-line) as follows. | |
2119 <pre> | |
2120 LL = 0; | |
2121 for m=1:nsamples | |
2122 LL = LL + log_marg_lik_complete(bnet, cases(:,m)); | |
2123 bnet = bayes_update_params(bnet, cases(:,m)); | |
2124 end | |
2125 </pre> | |
2126 | |
2127 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of | |
2128 sequential model selection which uses the same idea. | |
2129 We generate data from the model A->B | |
2130 and compute the posterior prob of all 3 dags on 2 nodes: | |
2131 (1) A B, (2) A <- B , (3) A -> B | |
2132 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from | |
2133 observational data alone, so we expect their posteriors to be the same | |
2134 (assuming a prior which satisfies likelihood equivalence). | |
2135 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials! | |
2136 However, if we make B a noisy NOT gate, the true model "wins" after 12 | |
2137 trials, as shown below (red = model 1, blue/green (superimposed) | |
2138 represents models 2/3). | |
2139 <p> | |
2140 <img src="Figures/model_select.png"> | |
2141 <p> | |
2142 The use of marginal likelihood for model selection is discussed in | |
2143 greater detail in the | |
2144 section on <a href="structure_learning">structure learning</a>. | |
2145 | |
2146 | |
2147 | |
2148 | |
2149 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2> | |
2150 | |
2151 Now we consider learning when some values are not observed. | |
2152 Let us randomly hide half the values generated from the water | |
2153 sprinkler example. | |
2154 <pre> | |
2155 samples2 = samples; | |
2156 hide = rand(N, nsamples) > 0.5; | |
2157 [I,J]=find(hide); | |
2158 for k=1:length(I) | |
2159 samples2{I(k), J(k)} = []; | |
2160 end | |
2161 </pre> | |
2162 samples2{i,l} is the value of node i in training case l, or [] if unobserved. | |
2163 <p> | |
2164 Now we will compute the MLEs using the EM algorithm. | |
2165 We need to use an inference algorithm to compute the expected | |
2166 sufficient statistics in the E step; the M (maximization) step is as | |
2167 above. | |
2168 <pre> | |
2169 engine2 = jtree_inf_engine(bnet2); | |
2170 max_iter = 10; | |
2171 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter); | |
2172 </pre> | |
2173 LLtrace(i) is the log-likelihood at iteration i. We can plot this as | |
2174 follows: | |
2175 <pre> | |
2176 plot(LLtrace, 'x-') | |
2177 </pre> | |
2178 Let's display the results after 10 iterations of EM. | |
2179 <pre> | |
2180 celldisp(CPT4) | |
2181 CPT4{1} = | |
2182 0.6616 | |
2183 0.3384 | |
2184 CPT4{2} = | |
2185 0.6510 0.3490 | |
2186 0.8751 0.1249 | |
2187 CPT4{3} = | |
2188 0.8366 0.1634 | |
2189 0.0197 0.9803 | |
2190 CPT4{4} = | |
2191 (:,:,1) = | |
2192 0.8276 0.0546 | |
2193 0.5452 0.1658 | |
2194 (:,:,2) = | |
2195 0.1724 0.9454 | |
2196 0.4548 0.8342 | |
2197 </pre> | |
2198 We can get improved performance by using one or more of the following | |
2199 methods: | |
2200 <ul> | |
2201 <li> Increasing the size of the training set. | |
2202 <li> Decreasing the amount of hidden data. | |
2203 <li> Running EM for longer. | |
2204 <li> Using informative priors. | |
2205 <li> Initialising EM from multiple starting points. | |
2206 </ul> | |
2207 | |
2208 Click <a href="#gaussian">here</a> for a discussion of learning | |
2209 Gaussians, which can cause numerical problems. | |
2210 | |
2211 <h2><a name="tying">Parameter tying</h2> | |
2212 | |
2213 In networks with repeated structure (e.g., chains and grids), it is | |
2214 common to assume that the parameters are the same at every node. This | |
2215 is called parameter tying, and reduces the amount of data needed for | |
2216 learning. | |
2217 <p> | |
2218 When we have tied parameters, there is no longer a one-to-one | |
2219 correspondence between nodes and CPDs. | |
2220 Rather, each CPD species the parameters for a whole equivalence class | |
2221 of nodes. | |
2222 It is easiest to see this by example. | |
2223 Consider the following <a href="usage_dbn.html#hmm">hidden Markov | |
2224 model (HMM)</a> | |
2225 <p> | |
2226 <img src="Figures/hmm3.gif"> | |
2227 <p> | |
2228 <!-- | |
2229 We can create this graph structure, assuming we have T time-slices, | |
2230 as follows. | |
2231 (We number the nodes as shown in the figure, but we could equally well | |
2232 number the hidden nodes 1:T, and the observed nodes T+1:2T.) | |
2233 <pre> | |
2234 N = 2*T; | |
2235 dag = zeros(N); | |
2236 hnodes = 1:2:2*T; | |
2237 for i=1:T-1 | |
2238 dag(hnodes(i), hnodes(i+1))=1; | |
2239 end | |
2240 onodes = 2:2:2*T; | |
2241 for i=1:T | |
2242 dag(hnodes(i), onodes(i)) = 1; | |
2243 end | |
2244 </pre> | |
2245 <p> | |
2246 The hidden nodes are always discrete, and have Q possible values each, | |
2247 but the observed nodes can be discrete or continuous, and have O possible values/length. | |
2248 <pre> | |
2249 if cts_obs | |
2250 dnodes = hnodes; | |
2251 else | |
2252 dnodes = 1:N; | |
2253 end | |
2254 ns = ones(1,N); | |
2255 ns(hnodes) = Q; | |
2256 ns(onodes) = O; | |
2257 </pre> | |
2258 --> | |
2259 When HMMs are used for semi-infinite processes like speech recognition, | |
2260 we assume the transition matrix | |
2261 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant | |
2262 or homogenous Markov chain. | |
2263 Hence hidden nodes 2, 3, ..., T | |
2264 are all in the same equivalence class, say class Hclass. | |
2265 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the | |
2266 same for all t, so the observed nodes are all in the same equivalence | |
2267 class, say class Oclass. | |
2268 Finally, the prior term P(H(1)) is in a class all by itself, say class | |
2269 H1class. | |
2270 This is illustrated below, where we explicitly represent the | |
2271 parameters as random variables (dotted nodes). | |
2272 <p> | |
2273 <img src="Figures/hmm4_params.gif"> | |
2274 <p> | |
2275 In BNT, we cannot represent parameters as random variables (nodes). | |
2276 Instead, we "hide" the | |
2277 parameters inside one CPD for each equivalence class, | |
2278 and then specify that the other CPDs should share these parameters, as | |
2279 follows. | |
2280 <pre> | |
2281 hnodes = 1:2:2*T; | |
2282 onodes = 2:2:2*T; | |
2283 H1class = 1; Hclass = 2; Oclass = 3; | |
2284 eclass = ones(1,N); | |
2285 eclass(hnodes(2:end)) = Hclass; | |
2286 eclass(hnodes(1)) = H1class; | |
2287 eclass(onodes) = Oclass; | |
2288 % create dag and ns in the usual way | |
2289 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass); | |
2290 </pre> | |
2291 Finally, we define the parameters for each equivalence class: | |
2292 <pre> | |
2293 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior | |
2294 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix | |
2295 if cts_obs | |
2296 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1)); | |
2297 else | |
2298 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1)); | |
2299 end | |
2300 </pre> | |
2301 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a | |
2302 member of e's equivalence class; that is, it is not always the case | |
2303 that e == j. You can use bnet.rep_of_eclass(e) to return the | |
2304 representative of equivalence class e. | |
2305 BNT will look up the parents of j to determine the size | |
2306 of the CPT to use. It assumes that this is the same for all members of | |
2307 the equivalence class. | |
2308 Click <a href="http://www.cs.berkeley.edu/~murphyk/Bayes/param_tieing.html">here</a> for | |
2309 a more complex example of parameter tying. | |
2310 <p> | |
2311 Note: | |
2312 Normally one would define an HMM as a | |
2313 <a href = "usage_dbn.html">Dynamic Bayes Net</a> | |
2314 (see the function BNT/examples/dynamic/mk_chmm.m). | |
2315 However, one can define an HMM as a static BN using the function | |
2316 BNT/examples/static/Models/mk_hmm_bnet.m. | |
2317 | |
2318 | |
2319 | |
2320 <h1><a name="structure_learning">Structure learning</h1> | |
2321 | |
2322 Update (9/29/03): | |
2323 Phillipe LeRay is developing some additional structure learning code | |
2324 on top of BNT. Click <a | |
2325 href="http://bnt.insa-rouen.fr/ajouts.html">here</a> | |
2326 for details. | |
2327 | |
2328 <p> | |
2329 | |
2330 There are two very different approaches to structure learning: | |
2331 constraint-based and search-and-score. | |
2332 In the <a href="#constraint">constraint-based approach</a>, | |
2333 we start with a fully connected graph, and remove edges if certain | |
2334 conditional independencies are measured in the data. | |
2335 This has the disadvantage that repeated independence tests lose | |
2336 statistical power. | |
2337 <p> | |
2338 In the more popular search-and-score approach, | |
2339 we perform a search through the space of possible DAGs, and either | |
2340 return the best one found (a point estimate), or return a sample of the | |
2341 models found (an approximation to the Bayesian posterior). | |
2342 <p> | |
2343 Unfortunately, the number of DAGs as a function of the number of | |
2344 nodes, G(n), is super-exponential in n. | |
2345 A closed form formula for G(n) is not known, but the first few values | |
2346 are shown below (from Cooper, 1999). | |
2347 | |
2348 <table> | |
2349 <tr> <th>n</th> <th align=left>G(n)</th> </tr> | |
2350 <tr> <td>1</td> <td>1</td> </tr> | |
2351 <tr> <td>2</td> <td>3</td> </tr> | |
2352 <tr> <td>3</td> <td>25</td> </tr> | |
2353 <tr> <td>4</td> <td>543</td> </tr> | |
2354 <tr> <td>5</td> <td>29,281</td> </tr> | |
2355 <tr> <td>6</td> <td>3,781,503</td> </tr> | |
2356 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr> | |
2357 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr> | |
2358 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr> | |
2359 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr> | |
2360 </table> | |
2361 | |
2362 Since the number of DAGs is super-exponential in the number of nodes, | |
2363 we cannot exhaustively search the space, so we either use a local | |
2364 search algorithm (e.g., greedy hill climbining, perhaps with multiple | |
2365 restarts) or a global search algorithm (e.g., Markov Chain Monte | |
2366 Carlo). | |
2367 <p> | |
2368 If we know a total ordering on the nodes, | |
2369 finding the best structure amounts to picking the best set of parents | |
2370 for each node independently. | |
2371 This is what the K2 algorithm does. | |
2372 If the ordering is unknown, we can search over orderings, | |
2373 which is more efficient than searching over DAGs (Koller and Friedman, 2000). | |
2374 <p> | |
2375 In addition to the search procedure, we must specify the scoring | |
2376 function. There are two popular choices. The Bayesian score integrates | |
2377 out the parameters, i.e., it is the marginal likelihood of the model. | |
2378 The BIC (Bayesian Information Criterion) is defined as | |
2379 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is | |
2380 the ML estimate of the parameters, d is the number of parameters, and | |
2381 N is the number of data cases. | |
2382 The BIC method has the advantage of not requiring a prior. | |
2383 <p> | |
2384 BIC can be derived as a large sample | |
2385 approximation to the marginal likelihood. | |
2386 (It is also equal to the Minimum Description Length of a model.) | |
2387 However, in practice, the sample size does not need to be very large | |
2388 for the approximation to be good. | |
2389 For example, in the figure below, we plot the ratio between the log marginal likelihood | |
2390 and the BIC score against data-set size; we see that the ratio rapidly | |
2391 approaches 1, especially for non-informative priors. | |
2392 (This plot was generated by the file BNT/examples/static/bic1.m. It | |
2393 uses the water sprinkler BN with BDeu Dirichlet priors with different | |
2394 equivalent sample sizes.) | |
2395 | |
2396 <p> | |
2397 <center> | |
2398 <IMG SRC="Figures/bic.png"> | |
2399 </center> | |
2400 <p> | |
2401 | |
2402 <p> | |
2403 As with parameter learning, handling missing data/ hidden variables is | |
2404 much harder than the fully observed case. | |
2405 The structure learning routines in BNT can therefore be classified into 4 | |
2406 types, analogously to the parameter learning case. | |
2407 <p> | |
2408 | |
2409 <TABLE BORDER> | |
2410 <tr> | |
2411 <TH></TH> | |
2412 <th>Full obs</th> | |
2413 <th>Partial obs</th> | |
2414 </tr> | |
2415 <tr> | |
2416 <th>Point</th> | |
2417 <td><tt>learn_struct_K2</tt> <br> | |
2418 <!-- <tt>learn_struct_hill_climb</tt></td> --> | |
2419 <td><tt>not yet supported</tt></td> | |
2420 </tr> | |
2421 <tr> | |
2422 <th>Bayes</th> | |
2423 <td><tt>learn_struct_mcmc</tt></td> | |
2424 <td>not yet supported</td> | |
2425 </tr> | |
2426 </table> | |
2427 | |
2428 | |
2429 <h2><a name="markov_equiv">Markov equivalence</h2> | |
2430 | |
2431 If two DAGs encode the same conditional independencies, they are | |
2432 called Markov equivalent. The set of all DAGs can be paritioned into | |
2433 Markov equivalence classes. Graphs within the same class can | |
2434 have | |
2435 the direction of some of their arcs reversed without changing any of | |
2436 the CI relationships. | |
2437 Each class can be represented by a PDAG | |
2438 (partially directed acyclic graph) called an essential graph or | |
2439 pattern. This specifies which edges must be oriented in a certain | |
2440 direction, and which may be reversed. | |
2441 | |
2442 <p> | |
2443 When learning graph structure from observational data, | |
2444 the best one can hope to do is to identify the model up to Markov | |
2445 equivalence. To distinguish amongst graphs within the same equivalence | |
2446 class, one needs interventional data: see the discussion on <a | |
2447 href="#active">active learning</a> below. | |
2448 | |
2449 | |
2450 | |
2451 <h2><a name="enumerate">Exhaustive search</h2> | |
2452 | |
2453 The brute-force approach to structure learning is to enumerate all | |
2454 possible DAGs, and score each one. This provides a "gold standard" | |
2455 with which to compare other algorithms. We can do this as follows. | |
2456 <pre> | |
2457 dags = mk_all_dags(N); | |
2458 score = score_dags(data, ns, dags); | |
2459 </pre> | |
2460 where data(i,m) is the value of node i in case m, | |
2461 and ns(i) is the size of node i. | |
2462 If the DAGs have a lot of families in common, we can cache the sufficient statistics, | |
2463 making this potentially more efficient than scoring the DAGs one at a time. | |
2464 (Caching is not currently implemented, however.) | |
2465 <p> | |
2466 By default, we use the Bayesian scoring metric, and assume CPDs are | |
2467 represented by tables with BDeu(1) priors. | |
2468 We can override these defaults as follows. | |
2469 If we want to use uniform priors, we can say | |
2470 <pre> | |
2471 params = cell(1,N); | |
2472 for i=1:N | |
2473 params{i} = {'prior', 'unif'}; | |
2474 end | |
2475 score = score_dags(data, ns, dags, 'params', params); | |
2476 </pre> | |
2477 params{i} is a cell-array, containing optional arguments that are | |
2478 passed to the constructor for CPD i. | |
2479 <p> | |
2480 Now suppose we want to use different node types, e.g., | |
2481 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both | |
2482 these CPDs can support discrete and continuous parents, which is | |
2483 necessary since all other nodes will be considered as parents). | |
2484 The Bayesian scoring metric currently only works for tabular CPDs, so | |
2485 we will use BIC: | |
2486 <pre> | |
2487 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [], | |
2488 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic') | |
2489 </pre> | |
2490 In practice, one can't enumerate all possible DAGs for N > 5, | |
2491 but one can evaluate any reasonably-sized set of hypotheses in this | |
2492 way (e.g., nearest neighbors of your current best guess). | |
2493 Think of this as "computer assisted model refinement" as opposed to de | |
2494 novo learning. | |
2495 | |
2496 | |
2497 <h2><a name="K2">K2</h2> | |
2498 | |
2499 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows. | |
2500 Initially each node has no parents. It then adds incrementally that parent whose addition most | |
2501 increases the score of the resulting structure. When the addition of no single | |
2502 parent can increase the score, it stops adding parents to the node. | |
2503 Since we are using a fixed ordering, we do not need to check for | |
2504 cycles, and can choose the parents for each node independently. | |
2505 <p> | |
2506 The original paper used the Bayesian scoring | |
2507 metric with tabular CPDs and Dirichlet priors. | |
2508 BNT generalizes this to allow any kind of CPD, and either the Bayesian | |
2509 scoring metric or BIC, as in the example <a href="#enumerate">above</a>. | |
2510 In addition, you can specify | |
2511 an optional upper bound on the number of parents for each node. | |
2512 The file BNT/examples/static/k2demo1.m gives an example of how to use K2. | |
2513 We use the water sprinkler network and sample 100 cases from it as before. | |
2514 Then we see how much data it takes to recover the generating structure: | |
2515 <pre> | |
2516 order = [C S R W]; | |
2517 max_fan_in = 2; | |
2518 sz = 5:5:100; | |
2519 for i=1:length(sz) | |
2520 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in); | |
2521 correct(i) = isequal(dag, dag2); | |
2522 end | |
2523 </pre> | |
2524 Here are the results. | |
2525 <pre> | |
2526 correct = | |
2527 Columns 1 through 12 | |
2528 0 0 0 0 0 0 0 1 0 1 1 1 | |
2529 Columns 13 through 20 | |
2530 1 1 1 1 1 1 1 1 | |
2531 </pre> | |
2532 So we see it takes about sz(10)=50 cases. (BIC behaves similarly, | |
2533 showing that the prior doesn't matter too much.) | |
2534 In general, we cannot hope to recover the "true" generating structure, | |
2535 only one that is in its <a href="#markov_equiv">Markov equivalence | |
2536 class</a>. | |
2537 | |
2538 | |
2539 <h2><a name="hill_climb">Hill-climbing</h2> | |
2540 | |
2541 Hill-climbing starts at a specific point in space, | |
2542 considers all nearest neighbors, and moves to the neighbor | |
2543 that has the highest score; if no neighbors have higher | |
2544 score than the current point (i.e., we have reached a local maximum), | |
2545 the algorithm stops. One can then restart in another part of the space. | |
2546 <p> | |
2547 A common definition of "neighbor" is all graphs that can be | |
2548 generated from the current graph by adding, deleting or reversing a | |
2549 single arc, subject to the acyclicity constraint. | |
2550 Other neighborhoods are possible: see | |
2551 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf"> | |
2552 Optimal Structure Identification with Greedy Search</a>, Max | |
2553 Chickering, JMLR 2002. | |
2554 | |
2555 <!-- | |
2556 Note: This algorithm is currently (Feb '02) being implemented by Qian | |
2557 Diao. | |
2558 --> | |
2559 | |
2560 | |
2561 <h2><a name="mcmc">MCMC</h2> | |
2562 | |
2563 We can use a Markov Chain Monte Carlo (MCMC) algorithm called | |
2564 Metropolis-Hastings (MH) to search the space of all | |
2565 DAGs. | |
2566 The standard proposal distribution is to consider moving to all | |
2567 nearest neighbors in the sense defined <a href="#hill_climb">above</a>. | |
2568 <p> | |
2569 The function can be called | |
2570 as in the following example. | |
2571 <pre> | |
2572 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10); | |
2573 </pre> | |
2574 We can convert our set of sampled graphs to a histogram | |
2575 (empirical posterior over all the DAGs) thus | |
2576 <pre> | |
2577 all_dags = mk_all_dags(N); | |
2578 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags); | |
2579 </pre> | |
2580 To see how well this performs, let us compute the exact posterior exhaustively. | |
2581 <p> | |
2582 <pre> | |
2583 score = score_dags(data, ns, all_dags); | |
2584 post = normalise(exp(score)); % assuming uniform structural prior | |
2585 </pre> | |
2586 We plot the results below. | |
2587 (The data set was 100 samples drawn from a random 4 node bnet; see the | |
2588 file BNT/examples/static/mcmc1.) | |
2589 <pre> | |
2590 subplot(2,1,1) | |
2591 bar(post) | |
2592 subplot(2,1,2) | |
2593 bar(mcmc_post) | |
2594 </pre> | |
2595 <img src="Figures/mcmc_post.jpg" width="800" height="500"> | |
2596 <p> | |
2597 We can also plot the acceptance ratio versus number of MCMC steps, | |
2598 as a crude convergence diagnostic. | |
2599 <pre> | |
2600 clf | |
2601 plot(accept_ratio) | |
2602 </pre> | |
2603 <img src="Figures/mcmc_accept.jpg" width="800" height="300"> | |
2604 <p> | |
2605 Even though the number of samples needed by MCMC is theoretically | |
2606 polynomial (not exponential) in the dimensionality of the search space, in practice it has been | |
2607 found that MCMC does not converge in reasonable time for graphs with | |
2608 more than about 10 nodes. | |
2609 | |
2610 | |
2611 | |
2612 | |
2613 <h2><a name="active">Active structure learning</h2> | |
2614 | |
2615 As was mentioned <a href="#markov_equiv">above</a>, | |
2616 one can only learn a DAG up to Markov equivalence, even given infinite data. | |
2617 If one is interested in learning the structure of a causal network, | |
2618 one needs interventional data. | |
2619 (By "intervention" we mean forcing a node to take on a specific value, | |
2620 thereby effectively severing its incoming arcs.) | |
2621 <p> | |
2622 Most of the scoring functions accept an optional argument | |
2623 that specifies whether a node was observed to have a certain value, or | |
2624 was forced to have that value: we set clamped(i,m)=1 if node i was | |
2625 forced in training case m. e.g., see the file | |
2626 BNT/examples/static/cooper_yoo. | |
2627 <p> | |
2628 An interesting question is to decide which interventions to perform | |
2629 (c.f., design of experiments). For details, see the following tech | |
2630 report | |
2631 <ul> | |
2632 <li> <a href = "../../Papers/alearn.ps.gz"> | |
2633 Active learning of causal Bayes net structure</a>, Kevin Murphy, March | |
2634 2001. | |
2635 </ul> | |
2636 | |
2637 | |
2638 <h2><a name="struct_em">Structural EM</h2> | |
2639 | |
2640 Computing the Bayesian score when there is partial observability is | |
2641 computationally challenging, because the parameter posterior becomes | |
2642 multimodal (the hidden nodes induce a mixture distribution). | |
2643 One therefore needs to use approximations such as BIC. | |
2644 Unfortunately, search algorithms are still expensive, because we need | |
2645 to run EM at each step to compute the MLE, which is needed to compute | |
2646 the score of each model. An alternative approach is | |
2647 to do the local search steps inside of the M step of EM, which is more | |
2648 efficient since the data has been "filled in" - this is | |
2649 called the structural EM algorithm (Friedman 1997), and provably | |
2650 converges to a local maximum of the BIC score. | |
2651 <p> | |
2652 Wei Hu has implemented SEM for discrete nodes. | |
2653 You can download his package from | |
2654 <a href="../SEM.zip">here</a>. | |
2655 Please address all questions about this code to | |
2656 wei.hu@intel.com. | |
2657 See also <a href="#phl">Phl's implementation of SEM</a>. | |
2658 | |
2659 <!-- | |
2660 <h2><a name="reveal">REVEAL algorithm</h2> | |
2661 | |
2662 A simple way to learn the structure of a fully observed, discrete, | |
2663 factored DBN from a time series is described <a | |
2664 href="usage_dbn.html#struct_learn">here</a>. | |
2665 --> | |
2666 | |
2667 | |
2668 <h2><a name="graphdraw">Visualizing the graph</h2> | |
2669 | |
2670 You can visualize an arbitrary graph (such as one learned using the | |
2671 structure learning routines) with Matlab code contributed by | |
2672 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali | |
2673 Taylan Cemgil</a> | |
2674 from the University of Nijmegen. | |
2675 For static BNs, call it as follows: | |
2676 <pre> | |
2677 draw_graph(bnet.dag); | |
2678 </pre> | |
2679 For example, this is the output produced on a | |
2680 <a href="#qmr">random QMR-like model</a>: | |
2681 <p> | |
2682 <img src="Figures/qmr.rnd.jpg"> | |
2683 <p> | |
2684 If you install the excellent <a | |
2685 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an | |
2686 open-source graph visualization package from AT&T, | |
2687 you can create a much better visualization as follows | |
2688 <pre> | |
2689 graph_to_dot(bnet.dag) | |
2690 </pre> | |
2691 This works by converting the adjacency matrix to a file suitable | |
2692 for input to graphviz (using the dot format), | |
2693 then converting the output of graphviz to postscript, and displaying the results using | |
2694 ghostview. | |
2695 You can do each of these steps separately for more control, as shown | |
2696 below. | |
2697 <pre> | |
2698 graph_to_dot(bnet.dag, 'filename', 'foo.dot'); | |
2699 dot -Tps foo.dot -o foo.ps | |
2700 ghostview foo.ps & | |
2701 </pre> | |
2702 | |
2703 <h2><a name = "constraint">Constraint-based methods</h2> | |
2704 | |
2705 The IC algorithm (Pearl and Verma, 1991), | |
2706 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993), | |
2707 computes many conditional independence tests, | |
2708 and combines these constraints into a | |
2709 PDAG to represent the whole | |
2710 <a href="#markov_equiv">Markov equivalence class</a>. | |
2711 <p> | |
2712 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>. | |
2713 (IC stands for inductive causation; PC stands for Peter and Clark, | |
2714 the first names of Spirtes and Glymour; FCI stands for fast causal | |
2715 inference. | |
2716 What we, following Pearl (2000), call IC* was called | |
2717 IC in the original Pearl and Verma paper.) | |
2718 For details, see | |
2719 <ul> | |
2720 <li> | |
2721 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation, | |
2722 Prediction, and Search</a>, Spirtes, Glymour and | |
2723 Scheines (SGS), 2001 (2nd edition), MIT Press. | |
2724 <li> | |
2725 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl, | |
2726 2000, Cambridge University Press. | |
2727 </ul> | |
2728 | |
2729 <p> | |
2730 | |
2731 The PC algorithm takes as arguments a function f, the number of nodes N, | |
2732 the maximum fan in K, and additional arguments A which are passed to f. | |
2733 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 | |
2734 otherwise. | |
2735 For example, suppose we cheat by | |
2736 passing in a CI "oracle" which has access to the true DAG; the oracle | |
2737 tests for d-separation in this DAG, i.e., | |
2738 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows. | |
2739 <pre> | |
2740 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag); | |
2741 </pre> | |
2742 pdag(i,j) = -1 if there is definitely an i->j arc, | |
2743 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. | |
2744 <p> | |
2745 Applied to the sprinkler network, this returns | |
2746 <pre> | |
2747 pdag = | |
2748 0 1 1 0 | |
2749 1 0 0 -1 | |
2750 1 0 0 -1 | |
2751 0 0 0 0 | |
2752 </pre> | |
2753 So as expected, we see that the V-structure at the W node is uniquely identified, | |
2754 but the other arcs have ambiguous orientation. | |
2755 <p> | |
2756 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS | |
2757 book. | |
2758 This example concerns the female orgasm. | |
2759 We are given a correlation matrix C between 7 measured factors (such | |
2760 as subjective experiences of coital and masturbatory experiences), | |
2761 derived from 281 samples, and want to learn a causal model of the | |
2762 data. We will not discuss the merits of this type of work here, but | |
2763 merely show how to reproduce the results in the SGS book. | |
2764 Their program, | |
2765 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>, | |
2766 makes use of the Fisher Z-test for conditional | |
2767 independence, so we do the same: | |
2768 <pre> | |
2769 max_fan_in = 4; | |
2770 nsamples = 281; | |
2771 alpha = 0.05; | |
2772 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha); | |
2773 </pre> | |
2774 In this case, the CI test is | |
2775 <pre> | |
2776 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha) | |
2777 </pre> | |
2778 The results match those of Fig 12a of SGS apart from two edge | |
2779 differences; presumably this is due to rounding error (although it | |
2780 could be a bug, either in BNT or in Tetrad). | |
2781 This example can be found in the file BNT/examples/static/pc2.m. | |
2782 | |
2783 <p> | |
2784 | |
2785 The IC* algorithm (Pearl and Verma, 1991), | |
2786 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993), | |
2787 are like the IC/PC algorithm, except that they can detect the presence | |
2788 of latent variables. | |
2789 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar | |
2790 Kushnir. The output is a matrix P, defined as follows | |
2791 (see Pearl (2000), p52 for details): | |
2792 <pre> | |
2793 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j. | |
2794 % P(i,j) = -2 if there is a marked directed i-*>j edge. | |
2795 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j | |
2796 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j. | |
2797 </pre> | |
2798 | |
2799 | |
2800 <h2><a name="phl">Philippe Leray's structure learning package</h2> | |
2801 | |
2802 Philippe Leray has written a | |
2803 <a href="http://bnt.insa-rouen.fr/ajouts.html"> | |
2804 structure learning package</a> that uses BNT. | |
2805 | |
2806 It currently (Juen 2003) has the following features: | |
2807 <ul> | |
2808 <li>PC with Chi2 statistical test | |
2809 <li> MWST : Maximum weighted Spanning Tree | |
2810 <li> Hill Climbing | |
2811 <li> Greedy Search | |
2812 <li> Structural EM | |
2813 <li> hist_ic : optimal Histogram based on IC information criterion | |
2814 <li> cpdag_to_dag | |
2815 <li> dag_to_cpdag | |
2816 <li> ... | |
2817 </ul> | |
2818 | |
2819 | |
2820 </a> | |
2821 | |
2822 | |
2823 <!-- | |
2824 <h2><a name="read_learning">Further reading on learning</h2> | |
2825 | |
2826 I recommend the following tutorials for more details on learning. | |
2827 <ul> | |
2828 <li> <a | |
2829 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short | |
2830 tutorial</a> on graphical models, which contains an overview of learning. | |
2831 | |
2832 <li> | |
2833 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS"> | |
2834 A tutorial on learning with Bayesian networks</a>, D. Heckerman, | |
2835 Microsoft Research Tech Report, 1995. | |
2836 | |
2837 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja"> | |
2838 Operations for Learning with Graphical Models</a>, | |
2839 W. L. Buntine, JAIR'94, 159--225. | |
2840 </ul> | |
2841 <p> | |
2842 --> | |
2843 | |
2844 | |
2845 | |
2846 | |
2847 | |
2848 <h1><a name="engines">Inference engines</h1> | |
2849 | |
2850 Up until now, we have used the junction tree algorithm for inference. | |
2851 However, sometimes this is too slow, or not even applicable. | |
2852 In general, there are many inference algorithms each of which make | |
2853 different tradeoffs between speed, accuracy, complexity and | |
2854 generality. Furthermore, there might be many implementations of the | |
2855 same algorithm; for instance, a general purpose, readable version, | |
2856 and a highly-optimized, specialized one. | |
2857 To cope with this variety, we treat each inference algorithm as an | |
2858 object, which we call an inference engine. | |
2859 | |
2860 <p> | |
2861 An inference engine is an object that contains a bnet and supports the | |
2862 'enter_evidence' and 'marginal_nodes' methods. The engine constructor | |
2863 takes the bnet as argument and may do some model-specific processing. | |
2864 When 'enter_evidence' is called, the engine may do some | |
2865 evidence-specific processing. Finally, when 'marginal_nodes' is | |
2866 called, the engine may do some query-specific processing. | |
2867 | |
2868 <p> | |
2869 The amount of work done when each stage is specified -- structure, | |
2870 parameters, evidence, and query -- depends on the engine. The cost of | |
2871 work done early in this sequence can be amortized. On the other hand, | |
2872 one can make better optimizations if one waits until later in the | |
2873 sequence. | |
2874 For example, the parameters might imply | |
2875 conditional indpendencies that are not evident in the graph structure, | |
2876 but can nevertheless be exploited; the evidence indicates which nodes | |
2877 are observed and hence can effectively be disconnected from the | |
2878 graph; and the query might indicate that large parts of the network | |
2879 are d-separated from the query nodes. (Since it is not the actual | |
2880 <em>values</em> of the evidence that matters, just which nodes are observed, | |
2881 many engines allow you to specify which nodes will be observed when they are constructed, | |
2882 i.e., before calling 'enter_evidence'. Some engines can still cope if | |
2883 the actual pattern of evidence is different, e.g., if there is missing | |
2884 data.) | |
2885 <p> | |
2886 | |
2887 Although being maximally lazy (i.e., only doing work when a query is | |
2888 issued) may seem desirable, | |
2889 this is not always the most efficient. | |
2890 For example, | |
2891 when learning using EM, we need to call marginal_nodes N times, where N is the | |
2892 number of nodes. <a href="varelim">Variable elimination</a> would end | |
2893 up repeating a lot of work | |
2894 each time marginal_nodes is called, making it inefficient for | |
2895 learning. The junction tree algorithm, by contrast, uses dynamic | |
2896 programming to avoid this redundant computation --- it calculates all | |
2897 marginals in two passes during 'enter_evidence', so calling | |
2898 'marginal_nodes' takes constant time. | |
2899 <p> | |
2900 We will discuss some of the inference algorithms implemented in BNT | |
2901 below, and finish with a <a href="#engine_summary">summary</a> of all | |
2902 of them. | |
2903 | |
2904 | |
2905 | |
2906 | |
2907 | |
2908 | |
2909 | |
2910 <h2><a name="varelim">Variable elimination</h2> | |
2911 | |
2912 The variable elimination algorithm, also known as bucket elimination | |
2913 or peeling, is one of the simplest inference algorithms. | |
2914 The basic idea is to "push sums inside of products"; this is explained | |
2915 in more detail | |
2916 <a | |
2917 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>. | |
2918 <p> | |
2919 The principle of distributing sums over products can be generalized | |
2920 greatly to apply to any commutative semiring. | |
2921 This forms the basis of many common algorithms, such as Viterbi | |
2922 decoding and the Fast Fourier Transform. For details, see | |
2923 | |
2924 <ul> | |
2925 <li> R. McEliece and S. M. Aji, 2000. | |
2926 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">--> | |
2927 <a href="GDL.pdf"> | |
2928 The Generalized Distributive Law</a>, | |
2929 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000), | |
2930 pp. 325--343. | |
2931 | |
2932 | |
2933 <li> | |
2934 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001. | |
2935 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html"> | |
2936 Factor graphs and the sum-product algorithm</a> | |
2937 IEEE Transactions on Information Theory, February, 2001. | |
2938 | |
2939 </ul> | |
2940 | |
2941 <p> | |
2942 Choosing an order in which to sum out the variables so as to minimize | |
2943 computational cost is known to be NP-hard. | |
2944 The implementation of this algorithm in | |
2945 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this | |
2946 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a | |
2947 greedy search procedure to find a good ordering). | |
2948 <p> | |
2949 Note: unlike most algorithms, var_elim does all its computational work | |
2950 inside of <tt>marginal_nodes</tt>, not inside of | |
2951 <tt>enter_evidence</tt>. | |
2952 | |
2953 | |
2954 | |
2955 | |
2956 <h2><a name="global">Global inference methods</h2> | |
2957 | |
2958 The simplest inference algorithm of all is to explicitely construct | |
2959 the joint distribution over all the nodes, and then to marginalize it. | |
2960 This is implemented in <tt>global_joint_inf_engine</tt>. | |
2961 Since the size of the joint is exponential in the | |
2962 number of discrete (hidden) nodes, this is not a very practical algorithm. | |
2963 It is included merely for pedagogical and debugging purposes. | |
2964 <p> | |
2965 Three specialized versions of this algorithm have also been implemented, | |
2966 corresponding to the cases where all the nodes are discrete (D), all | |
2967 are Gaussian (G), and some are discrete and some Gaussian (CG). | |
2968 They are called <tt>enumerative_inf_engine</tt>, | |
2969 <tt>gaussian_inf_engine</tt>, | |
2970 and <tt>cond_gauss_inf_engine</tt> respectively. | |
2971 <p> | |
2972 Note: unlike most algorithms, these global inference algorithms do all their computational work | |
2973 inside of <tt>marginal_nodes</tt>, not inside of | |
2974 <tt>enter_evidence</tt>. | |
2975 | |
2976 | |
2977 <h2><a name="quickscore">Quickscore</h2> | |
2978 | |
2979 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network, | |
2980 since the cliques are so big. | |
2981 One simple trick we can use is to notice that hidden leaves do not | |
2982 affect the posteriors on the roots, and hence do not need to be | |
2983 included in the network. | |
2984 A second trick is to notice that the negative findings can be | |
2985 "absorbed" into the prior: | |
2986 see the file | |
2987 BNT/examples/static/mk_minimal_qmr_bnet for details. | |
2988 <p> | |
2989 | |
2990 A much more significant speedup is obtained by exploiting special | |
2991 properties of the noisy-or node, as done by the quickscore | |
2992 algorithm. For details, see | |
2993 <ul> | |
2994 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89. | |
2995 <li> Rish and Dechter, "On the impact of causal independence", UCI | |
2996 tech report, 1998. | |
2997 </ul> | |
2998 | |
2999 This has been implemented in BNT as a special-purpose inference | |
3000 engine, which can be created and used as follows: | |
3001 <pre> | |
3002 engine = quickscore_inf_engine(inhibit, leak, prior); | |
3003 engine = enter_evidence(engine, pos, neg); | |
3004 m = marginal_nodes(engine, i); | |
3005 </pre> | |
3006 | |
3007 | |
3008 <h2><a name="belprop">Belief propagation</h2> | |
3009 | |
3010 Even using quickscore, exact inference takes time that is exponential | |
3011 in the number of positive findings. | |
3012 Hence for large networks we need to resort to approximate inference techniques. | |
3013 See for example | |
3014 <ul> | |
3015 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the | |
3016 QMR-DT network", JAIR 10, 1999. | |
3017 | |
3018 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study", | |
3019 UAI 99. | |
3020 </ul> | |
3021 The latter approximation | |
3022 entails applying Pearl's belief propagation algorithm to a model even | |
3023 if it has loops (hence the name loopy belief propagation). | |
3024 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives | |
3025 exact results when applied to singly-connected graphs | |
3026 (a.k.a. polytrees, since | |
3027 the underlying undirected topology is a tree, but a node may have | |
3028 multiple parents). | |
3029 To apply this algorithm to a graph with loops, | |
3030 use <tt>pearl_inf_engine</tt>. | |
3031 This can use a centralized or distributed message passing protocol. | |
3032 You can use it as in the following example. | |
3033 <pre> | |
3034 engine = pearl_inf_engine(bnet, 'max_iter', 30); | |
3035 engine = enter_evidence(engine, evidence); | |
3036 m = marginal_nodes(engine, i); | |
3037 </pre> | |
3038 We found that this algorithm often converges, and when it does, often | |
3039 is very accurate, but it depends on the precise setting of the | |
3040 parameter values of the network. | |
3041 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) | |
3042 Understanding when and why belief propagation converges/ works | |
3043 is a topic of ongoing research. | |
3044 <p> | |
3045 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or | |
3046 and gmux nodes to compute messages efficiently. | |
3047 <p> | |
3048 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to | |
3049 represent messages. Hence this is slower. | |
3050 <p> | |
3051 <tt>belprop_fg_inf_engine</tt> is like belprop, | |
3052 but is designed for factor graphs. | |
3053 | |
3054 | |
3055 | |
3056 <h2><a name="sampling">Sampling</h2> | |
3057 | |
3058 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms: | |
3059 <ul> | |
3060 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance | |
3061 sampling and can handle any node type. | |
3062 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi. | |
3063 Currently this can only handle tabular CPDs. | |
3064 For a much faster and more powerful Gibbs sampling program, see | |
3065 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>. | |
3066 </ul> | |
3067 Note: To generate samples from a network (which is not the same as inference!), | |
3068 use <tt>sample_bnet</tt>. | |
3069 | |
3070 | |
3071 | |
3072 <h2><a name="engine_summary">Summary of inference engines</h2> | |
3073 | |
3074 | |
3075 The inference engines differ in many ways. Here are | |
3076 some of the major "axes": | |
3077 <ul> | |
3078 <li> Works for all topologies or makes restrictions? | |
3079 <li> Works for all node types or makes restrictions? | |
3080 <li> Exact or approximate inference? | |
3081 </ul> | |
3082 | |
3083 <p> | |
3084 In terms of topology, most engines handle any kind of DAG. | |
3085 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which | |
3086 can be used to represent directed, undirected, and mixed (chain) | |
3087 graphs. | |
3088 (In the future, we plan to support exact inference on chain graphs.) | |
3089 <tt>quickscore</tt> only works on QMR-like models. | |
3090 <p> | |
3091 In terms of node types: algorithms that use potentials can handle | |
3092 discrete (D), Gaussian (G) or conditional Gaussian (CG) models. | |
3093 Sampling algorithms can essentially handle any kind of node (distribution). | |
3094 Other algorithms make more restrictive assumptions in exchange for | |
3095 speed. | |
3096 <p> | |
3097 Finally, most algorithms are designed to give the exact answer. | |
3098 The belief propagation algorithms are exact if applied to trees, and | |
3099 in some other cases. | |
3100 Sampling is considered approximate, even though, in the limit of an | |
3101 infinite number of samples, it gives the exact answer. | |
3102 | |
3103 <p> | |
3104 | |
3105 Here is a summary of the properties | |
3106 of all the engines in BNT which work on static networks. | |
3107 <p> | |
3108 <table> | |
3109 <table border units = pixels><tr> | |
3110 <td align=left width=0>Name | |
3111 <td align=left width=0>Exact? | |
3112 <td align=left width=0>Node type? | |
3113 <td align=left width=0>topology | |
3114 <tr> | |
3115 <tr> | |
3116 <td align=left> belprop | |
3117 <td align=left> approx | |
3118 <td align=left> D | |
3119 <td align=left> DAG | |
3120 <tr> | |
3121 <td align=left> belprop_fg | |
3122 <td align=left> approx | |
3123 <td align=left> D | |
3124 <td align=left> factor graph | |
3125 <tr> | |
3126 <td align=left> cond_gauss | |
3127 <td align=left> exact | |
3128 <td align=left> CG | |
3129 <td align=left> DAG | |
3130 <tr> | |
3131 <td align=left> enumerative | |
3132 <td align=left> exact | |
3133 <td align=left> D | |
3134 <td align=left> DAG | |
3135 <tr> | |
3136 <td align=left> gaussian | |
3137 <td align=left> exact | |
3138 <td align=left> G | |
3139 <td align=left> DAG | |
3140 <tr> | |
3141 <td align=left> gibbs | |
3142 <td align=left> approx | |
3143 <td align=left> D | |
3144 <td align=left> DAG | |
3145 <tr> | |
3146 <td align=left> global_joint | |
3147 <td align=left> exact | |
3148 <td align=left> D,G,CG | |
3149 <td align=left> DAG | |
3150 <tr> | |
3151 <td align=left> jtree | |
3152 <td align=left> exact | |
3153 <td align=left> D,G,CG | |
3154 <td align=left> DAG | |
3155 b<tr> | |
3156 <td align=left> likelihood_weighting | |
3157 <td align=left> approx | |
3158 <td align=left> any | |
3159 <td align=left> DAG | |
3160 <tr> | |
3161 <td align=left> pearl | |
3162 <td align=left> approx | |
3163 <td align=left> D,G | |
3164 <td align=left> DAG | |
3165 <tr> | |
3166 <td align=left> pearl | |
3167 <td align=left> exact | |
3168 <td align=left> D,G | |
3169 <td align=left> polytree | |
3170 <tr> | |
3171 <td align=left> quickscore | |
3172 <td align=left> exact | |
3173 <td align=left> noisy-or | |
3174 <td align=left> QMR | |
3175 <tr> | |
3176 <td align=left> stab_cond_gauss | |
3177 <td align=left> exact | |
3178 <td align=left> CG | |
3179 <td align=left> DAG | |
3180 <tr> | |
3181 <td align=left> var_elim | |
3182 <td align=left> exact | |
3183 <td align=left> D,G,CG | |
3184 <td align=left> DAG | |
3185 </table> | |
3186 | |
3187 | |
3188 | |
3189 <h1><a name="influence">Influence diagrams/ decision making</h1> | |
3190 | |
3191 BNT implements an exact algorithm for solving LIMIDs (limited memory | |
3192 influence diagrams), described in | |
3193 <ul> | |
3194 <li> S. L. Lauritzen and D. Nilsson. | |
3195 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf"> | |
3196 Representing and solving decision problems with limited | |
3197 information</a> | |
3198 Management Science, 47, 1238 - 1251. September 2001. | |
3199 </ul> | |
3200 LIMIDs explicitely show all information arcs, rather than implicitely | |
3201 assuming no forgetting. This allows them to model forgetful | |
3202 controllers. | |
3203 <p> | |
3204 See the examples in <tt>BNT/examples/limids</tt> for details. | |
3205 | |
3206 | |
3207 | |
3208 | |
3209 <h1>DBNs, HMMs, Kalman filters and all that</h1> | |
3210 | |
3211 Click <a href="usage_dbn.html">here</a> for documentation about how to | |
3212 use BNT for dynamical systems and sequence data. | |
3213 | |
3214 | |
3215 </BODY> |