annotate toolboxes/FullBNT-1.0.7/docs/usage_02nov13.html @ 0:e9a9cd732c1e tip

first hg version after svn
author wolffd
date Tue, 10 Feb 2015 15:05:51 +0000
parents
children
rev   line source
wolffd@0 1 <HEAD>
wolffd@0 2 <TITLE>How to use the Bayes Net Toolbox</TITLE>
wolffd@0 3 </HEAD>
wolffd@0 4
wolffd@0 5 <BODY BGCOLOR="#FFFFFF">
wolffd@0 6 <!-- white background is better for the pictures and equations -->
wolffd@0 7
wolffd@0 8 <h1>How to use the Bayes Net Toolbox</h1>
wolffd@0 9
wolffd@0 10 This documentation was last updated on 13 November 2002.
wolffd@0 11 <br>
wolffd@0 12 Click <a href="changelog.html">here</a> for a list of changes made to
wolffd@0 13 BNT.
wolffd@0 14 <br>
wolffd@0 15 Click
wolffd@0 16 <a href="http://bnt.insa-rouen.fr/">here</a>
wolffd@0 17 for a French version of this documentation (which might not
wolffd@0 18 be up-to-date).
wolffd@0 19
wolffd@0 20
wolffd@0 21 <p>
wolffd@0 22
wolffd@0 23 <ul>
wolffd@0 24 <li> <a href="#install">Installation</a>
wolffd@0 25 <ul>
wolffd@0 26 <li> <a href="#installM">Installing the Matlab code</a>
wolffd@0 27 <li> <a href="#installC">Installing the C code</a>
wolffd@0 28 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>.
wolffd@0 29 </ul>
wolffd@0 30
wolffd@0 31 <li> <a href="#basics">Creating your first Bayes net</a>
wolffd@0 32 <ul>
wolffd@0 33 <li> <a href="#basics">Creating a model by hand</a>
wolffd@0 34 <li> <a href="#file">Loading a model from a file</a>
wolffd@0 35 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a>
wolffd@0 36 </ul>
wolffd@0 37
wolffd@0 38 <li> <a href="#inference">Inference</a>
wolffd@0 39 <ul>
wolffd@0 40 <li> <a href="#marginal">Computing marginal distributions</a>
wolffd@0 41 <li> <a href="#joint">Computing joint distributions</a>
wolffd@0 42 <li> <a href="#soft">Soft/virtual evidence</a>
wolffd@0 43 <li> <a href="#mpe">Most probable explanation</a>
wolffd@0 44 </ul>
wolffd@0 45
wolffd@0 46 <li> <a href="#cpd">Conditional Probability Distributions</a>
wolffd@0 47 <ul>
wolffd@0 48 <li> <a href="#tabular">Tabular (multinomial) nodes</a>
wolffd@0 49 <li> <a href="#noisyor">Noisy-or nodes</a>
wolffd@0 50 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
wolffd@0 51 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
wolffd@0 52 <li> <a href="#mlp">Neural network nodes</a>
wolffd@0 53 <li> <a href="#root">Root nodes</a>
wolffd@0 54 <li> <a href="#gaussian">Gaussian nodes</a>
wolffd@0 55 <li> <a href="#glm">Generalized linear model nodes</a>
wolffd@0 56 <li> <a href="#dtree">Classification/regression tree nodes</a>
wolffd@0 57 <li> <a href="#nongauss">Other continuous distributions</a>
wolffd@0 58 <li> <a href="#cpd_summary">Summary of CPD types</a>
wolffd@0 59 </ul>
wolffd@0 60
wolffd@0 61 <li> <a href="#examples">Example models</a>
wolffd@0 62 <ul>
wolffd@0 63 <li> <a
wolffd@0 64 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
wolffd@0 65 Gaussian mixture models</a>
wolffd@0 66 <li> <a href="#pca">PCA, ICA, and all that</a>
wolffd@0 67 <li> <a href="#mixep">Mixtures of experts</a>
wolffd@0 68 <li> <a href="#hme">Hierarchical mixtures of experts</a>
wolffd@0 69 <li> <a href="#qmr">QMR</a>
wolffd@0 70 <li> <a href="#cg_model">Conditional Gaussian models</a>
wolffd@0 71 <li> <a href="#hybrid">Other hybrid models</a>
wolffd@0 72 </ul>
wolffd@0 73
wolffd@0 74 <li> <a href="#param_learning">Parameter learning</a>
wolffd@0 75 <ul>
wolffd@0 76 <li> <a href="#load_data">Loading data from a file</a>
wolffd@0 77 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
wolffd@0 78 <li> <a href="#prior">Parameter priors</a>
wolffd@0 79 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
wolffd@0 80 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
wolffd@0 81 <li> <a href="#tying">Parameter tying</a>
wolffd@0 82 </ul>
wolffd@0 83
wolffd@0 84 <li> <a href="#structure_learning">Structure learning</a>
wolffd@0 85 <ul>
wolffd@0 86 <li> <a href="#enumerate">Exhaustive search</a>
wolffd@0 87 <li> <a href="#K2">K2</a>
wolffd@0 88 <li> <a href="#hill_climb">Hill-climbing</a>
wolffd@0 89 <li> <a href="#mcmc">MCMC</a>
wolffd@0 90 <li> <a href="#active">Active learning</a>
wolffd@0 91 <li> <a href="#struct_em">Structural EM</a>
wolffd@0 92 <li> <a href="#graphdraw">Visualizing the learned graph structure</a>
wolffd@0 93 <li> <a href="#constraint">Constraint-based methods</a>
wolffd@0 94 </ul>
wolffd@0 95
wolffd@0 96
wolffd@0 97 <li> <a href="#engines">Inference engines</a>
wolffd@0 98 <ul>
wolffd@0 99 <li> <a href="#jtree">Junction tree</a>
wolffd@0 100 <li> <a href="#varelim">Variable elimination</a>
wolffd@0 101 <li> <a href="#global">Global inference methods</a>
wolffd@0 102 <li> <a href="#quickscore">Quickscore</a>
wolffd@0 103 <li> <a href="#belprop">Belief propagation</a>
wolffd@0 104 <li> <a href="#sampling">Sampling (Monte Carlo)</a>
wolffd@0 105 <li> <a href="#engine_summary">Summary of inference engines</a>
wolffd@0 106 </ul>
wolffd@0 107
wolffd@0 108
wolffd@0 109 <li> <a href="#influence">Influence diagrams/ decision making</a>
wolffd@0 110
wolffd@0 111
wolffd@0 112 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
wolffd@0 113 </ul>
wolffd@0 114
wolffd@0 115 </ul>
wolffd@0 116
wolffd@0 117
wolffd@0 118
wolffd@0 119
wolffd@0 120 <h1><a name="install">Installation</h1>
wolffd@0 121
wolffd@0 122 <h2><a name="installM">Installing the Matlab code</h2>
wolffd@0 123
wolffd@0 124 <ul>
wolffd@0 125 <li> <a href="bnt_download.html">Download</a> the BNT.zip file.
wolffd@0 126
wolffd@0 127 <p>
wolffd@0 128 <li> Unpack the file. In Unix, type
wolffd@0 129 <!--"tar xvf BNT.tar".-->
wolffd@0 130 "unzip FullBNT.zip".
wolffd@0 131 In Windows, use
wolffd@0 132 a program like <a href="http://www.winzip.com">Winzip</a>. This will
wolffd@0 133 create a directory called FullBNT, which contains BNT and other libraries.
wolffd@0 134
wolffd@0 135 <p>
wolffd@0 136 <li> Read the file <tt>BNT/README</tt> to make sure the date
wolffd@0 137 matches the one on the top of <a href=bnt.html>the BNT home page</a>.
wolffd@0 138 If not, you may need to press 'refresh' on your browser, and download
wolffd@0 139 again, to get the most recent version.
wolffd@0 140
wolffd@0 141 <p>
wolffd@0 142 <li> <b>Edit the file "BNT/add_BNT_to_path.m"</b> so it contains the correct
wolffd@0 143 pathname.
wolffd@0 144 For example, in Windows,
wolffd@0 145 I download FullBNT.zip into C:\kpmurphy\matlab, and
wolffd@0 146 then comment out the second line (with the % character), and uncomment
wolffd@0 147 the third line, which reads
wolffd@0 148 <pre>
wolffd@0 149 BNT_HOME = 'C:\kpmurphy\matlab\FullBNT';
wolffd@0 150 </pre>
wolffd@0 151
wolffd@0 152 <p>
wolffd@0 153 <li> Start up Matlab.
wolffd@0 154
wolffd@0 155 <p>
wolffd@0 156 <li> Type "ver" at the Matlab prompt (">>").
wolffd@0 157 <b>You need Matlab version 5.2 or newer to run BNT</b>.
wolffd@0 158 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes
wolffd@0 159 crash BNT.)
wolffd@0 160
wolffd@0 161 <p>
wolffd@0 162 <li> Move to the BNT directory.
wolffd@0 163 For example, in Windows, I type
wolffd@0 164 <pre>
wolffd@0 165 >> cd C:\kpmurphy\matlab\FullBNT\BNT
wolffd@0 166 </pre>
wolffd@0 167
wolffd@0 168 <p>
wolffd@0 169 <li> Type "add_BNT_to_path".
wolffd@0 170 This executes the command
wolffd@0 171 <tt>addpath(genpath(BNT_HOME))</tt>,
wolffd@0 172 which adds all directories below FullBNT to the matlab path.
wolffd@0 173
wolffd@0 174 <p>
wolffd@0 175 <li> Type "test_BNT".
wolffd@0 176
wolffd@0 177 If all goes well, this will produce a bunch of numbers and maybe some
wolffd@0 178 warning messages (which you can ignore), but no error messages.
wolffd@0 179 (The warnings should only be of the form
wolffd@0 180 "Warning: Maximum number of iterations has been exceeded", and are
wolffd@0 181 produced by Netlab.)
wolffd@0 182
wolffd@0 183 <p>
wolffd@0 184 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join">
wolffd@0 185 Join the BNT email list</a>
wolffd@0 186
wolffd@0 187 </ul>
wolffd@0 188
wolffd@0 189
wolffd@0 190 If you are new to Matlab, you might like to check out
wolffd@0 191 <a href="matlab_tips.html">some useful Matlab tips</a>.
wolffd@0 192 For instance, this explains how to create a startup file, which can be
wolffd@0 193 used to set your path variable automatically, so you can avoid having
wolffd@0 194 to type the above commands every time.
wolffd@0 195
wolffd@0 196
wolffd@0 197
wolffd@0 198
wolffd@0 199 <h2><a name="installC">Installing the C code</h2>
wolffd@0 200
wolffd@0 201 Some BNT functions also have C implementations.
wolffd@0 202 <b>It is not necessary to install the C code</b>, but it can result in a speedup
wolffd@0 203 of a factor of 5-10.
wolffd@0 204 To install all the C code,
wolffd@0 205 edit installC_BNT.m so it contains the right path,
wolffd@0 206 then type <tt>installC_BNT</tt>.
wolffd@0 207 To uninstall all the C code,
wolffd@0 208 edit uninstallC_BNT.m so it contains the right path,
wolffd@0 209 then type <tt>uninstallC_BNT</tt>.
wolffd@0 210 For an up-to-date list of the files which have C implementations, see
wolffd@0 211 BNT/installC_BNT.m.
wolffd@0 212
wolffd@0 213 <p>
wolffd@0 214 mex is a script that lets you call C code from Matlab - it does not compile matlab to
wolffd@0 215 C (see mcc below).
wolffd@0 216 If your C/C++ compiler is set up correctly, mex should work out of
wolffd@0 217 the box.
wolffd@0 218 If not, you might need to type
wolffd@0 219 <p>
wolffd@0 220 <tt> mex -setup</tt>
wolffd@0 221 <p>
wolffd@0 222 before calling installC.
wolffd@0 223 <p>
wolffd@0 224 To make mex call gcc on Windows,
wolffd@0 225 you must install <a
wolffd@0 226 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>.
wolffd@0 227 You can use the <a href="http://www.mingw.org/">minimalist GNU for
wolffd@0 228 Windows</a> version of gcc, or
wolffd@0 229 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version.
wolffd@0 230 <p>
wolffd@0 231 In general, typing
wolffd@0 232 'mex foo.c' from inside Matlab creates a file called
wolffd@0 233 'foo.mexglx' or 'foo.dll' (the exact file
wolffd@0 234 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll').
wolffd@0 235 The resulting file will hide the original 'foo.m' (if it existed), i.e.,
wolffd@0 236 typing 'foo' at the prompt will call the compiled C version.
wolffd@0 237 To reveal the original matlab version, just delete foo.mexglx (this is
wolffd@0 238 what uninstallC does).
wolffd@0 239 <p>
wolffd@0 240 Sometimes it takes time for Matlab to realize that the file has
wolffd@0 241 changed from matlab to C or vice versa; try typing 'clear all' or
wolffd@0 242 restarting Matlab to refresh it.
wolffd@0 243 To find out which version of a file you are running, type
wolffd@0 244 'which foo'.
wolffd@0 245 <p>
wolffd@0 246 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the
wolffd@0 247 Matlab to C compiler, is a separate product,
wolffd@0 248 and is quite different from mex. It does not yet support
wolffd@0 249 objects/classes, which is why we can't compile all of BNT to C automatically.
wolffd@0 250 Also, hand-written C code is usually much
wolffd@0 251 better than the C code generated by mcc.
wolffd@0 252
wolffd@0 253
wolffd@0 254 <p>
wolffd@0 255 Acknowledgements:
wolffd@0 256 Although I wrote some of the C code, most of
wolffd@0 257 the C code (e.g., for jtree and dpot) was written by Wei Hu;
wolffd@0 258 the triangulation C code was written by Ilya Shpitser.
wolffd@0 259
wolffd@0 260
wolffd@0 261 <h1><a name="basics">Creating your first Bayes net</h1>
wolffd@0 262
wolffd@0 263 To define a Bayes net, you must specify the graph structure and then
wolffd@0 264 the parameters. We look at each in turn, using a simple example
wolffd@0 265 (adapted from Russell and
wolffd@0 266 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
wolffd@0 267 1995, p454).
wolffd@0 268
wolffd@0 269
wolffd@0 270 <h2>Graph structure</h2>
wolffd@0 271
wolffd@0 272
wolffd@0 273 Consider the following network.
wolffd@0 274
wolffd@0 275 <p>
wolffd@0 276 <center>
wolffd@0 277 <IMG SRC="Figures/sprinkler.gif">
wolffd@0 278 </center>
wolffd@0 279 <p>
wolffd@0 280
wolffd@0 281 <P>
wolffd@0 282 To specify this directed acyclic graph (dag), we create an adjacency matrix:
wolffd@0 283 <PRE>
wolffd@0 284 N = 4;
wolffd@0 285 dag = zeros(N,N);
wolffd@0 286 C = 1; S = 2; R = 3; W = 4;
wolffd@0 287 dag(C,[R S]) = 1;
wolffd@0 288 dag(R,W) = 1;
wolffd@0 289 dag(S,W)=1;
wolffd@0 290 </PRE>
wolffd@0 291 <P>
wolffd@0 292 We have numbered the nodes as follows:
wolffd@0 293 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
wolffd@0 294 <b>The nodes must always be numbered in topological order, i.e.,
wolffd@0 295 ancestors before descendants.</b>
wolffd@0 296 For a more complicated graph, this is a little inconvenient: we will
wolffd@0 297 see how to get around this <a href="usage_dbn.html#bat">below</a>.
wolffd@0 298 <p>
wolffd@0 299 In Matlab 6, you can use logical arrays instead of double arrays,
wolffd@0 300 which are 4 times smaller:
wolffd@0 301 <pre>
wolffd@0 302 dag = false(N,N);
wolffd@0 303 dag(C,[R S]) = true;
wolffd@0 304 ...
wolffd@0 305 </pre>
wolffd@0 306 <p>
wolffd@0 307 A preliminary attempt to make a <b>GUI</b>
wolffd@0 308 has been writte by Philippe LeRay and can be downloaded
wolffd@0 309 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
wolffd@0 310 <p>
wolffd@0 311 You can visualize the resulting graph structure using
wolffd@0 312 the methods discussed <a href="#graphdraw">below</a>.
wolffd@0 313
wolffd@0 314 <h2>Creating the Bayes net shell</h2>
wolffd@0 315
wolffd@0 316 In addition to specifying the graph structure,
wolffd@0 317 we must specify the size and type of each node.
wolffd@0 318 If a node is discrete, its size is the
wolffd@0 319 number of possible values
wolffd@0 320 each node can take on; if a node is continuous,
wolffd@0 321 it can be a vector, and its size is the length of this vector.
wolffd@0 322 In this case, we will assume all nodes are discrete and binary.
wolffd@0 323 <PRE>
wolffd@0 324 discrete_nodes = 1:N;
wolffd@0 325 node_sizes = 2*ones(1,N);
wolffd@0 326 </pre>
wolffd@0 327 If the nodes were not binary, you could type e.g.,
wolffd@0 328 <pre>
wolffd@0 329 node_sizes = [4 2 3 5];
wolffd@0 330 </pre>
wolffd@0 331 meaning that Cloudy has 4 possible values,
wolffd@0 332 Sprinkler has 2 possible values, etc.
wolffd@0 333 Note that these are cardinal values, not ordinal, i.e.,
wolffd@0 334 they are not ordered in any way, like 'low', 'medium', 'high'.
wolffd@0 335 <p>
wolffd@0 336 We are now ready to make the Bayes net:
wolffd@0 337 <pre>
wolffd@0 338 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
wolffd@0 339 </PRE>
wolffd@0 340 By default, all nodes are assumed to be discrete, so we can also just
wolffd@0 341 write
wolffd@0 342 <pre>
wolffd@0 343 bnet = mk_bnet(dag, node_sizes);
wolffd@0 344 </PRE>
wolffd@0 345 You may also specify which nodes will be observed.
wolffd@0 346 If you don't know, or if this not fixed in advance,
wolffd@0 347 just use the empty list (the default).
wolffd@0 348 <pre>
wolffd@0 349 onodes = [];
wolffd@0 350 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
wolffd@0 351 </PRE>
wolffd@0 352 Note that optional arguments are specified using a name/value syntax.
wolffd@0 353 This is common for many BNT functions.
wolffd@0 354 In general, to find out more about a function (e.g., which optional
wolffd@0 355 arguments it takes), please see its
wolffd@0 356 documentation string by typing
wolffd@0 357 <pre>
wolffd@0 358 help mk_bnet
wolffd@0 359 </pre>
wolffd@0 360 See also other <a href="matlab_tips.html">useful Matlab tips</a>.
wolffd@0 361 <p>
wolffd@0 362 It is possible to associate names with nodes, as follows:
wolffd@0 363 <pre>
wolffd@0 364 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
wolffd@0 365 </pre>
wolffd@0 366 You can then refer to a node by its name:
wolffd@0 367 <pre>
wolffd@0 368 C = bnet.names{'cloudy'}; % bnet.names is an associative array
wolffd@0 369 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
wolffd@0 370 </pre>
wolffd@0 371
wolffd@0 372
wolffd@0 373 <h2><a name="cpt">Parameters</h2>
wolffd@0 374
wolffd@0 375 A model consists of the graph structure and the parameters.
wolffd@0 376 The parameters are represented by CPD objects (CPD = Conditional
wolffd@0 377 Probability Distribution), which define the probability distribution
wolffd@0 378 of a node given its parents.
wolffd@0 379 (We will use the terms "node" and "random variable" interchangeably.)
wolffd@0 380 The simplest kind of CPD is a table (multi-dimensional array), which
wolffd@0 381 is suitable when all the nodes are discrete-valued. Note that the discrete
wolffd@0 382 values are not assumed to be ordered in any way; that is, they
wolffd@0 383 represent categorical quantities, like male and female, rather than
wolffd@0 384 ordinal quantities, like low, medium and high.
wolffd@0 385 (We will discuss CPDs in more detail <a href="#cpd">below</a>.)
wolffd@0 386 <p>
wolffd@0 387 Tabular CPDs, also called CPTs (conditional probability tables),
wolffd@0 388 are stored as multidimensional arrays, where the dimensions
wolffd@0 389 are arranged in the same order as the nodes, e.g., the CPT for node 4
wolffd@0 390 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
wolffd@0 391 Hence the child is always the last dimension.
wolffd@0 392 If a node has no parents, its CPT is a column vector representing its
wolffd@0 393 prior.
wolffd@0 394 Note that in Matlab (unlike C), arrays are indexed
wolffd@0 395 from 1, and are layed out in memory such that the first index toggles
wolffd@0 396 fastest, e.g., the CPT for node 4 (WetGrass) is as follows
wolffd@0 397 <P>
wolffd@0 398 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
wolffd@0 399 <P>
wolffd@0 400 where we have used the convention that false==1, true==2.
wolffd@0 401 We can create this CPT in Matlab as follows
wolffd@0 402 <PRE>
wolffd@0 403 CPT = zeros(2,2,2);
wolffd@0 404 CPT(1,1,1) = 1.0;
wolffd@0 405 CPT(2,1,1) = 0.1;
wolffd@0 406 ...
wolffd@0 407 </PRE>
wolffd@0 408 Here is an easier way:
wolffd@0 409 <PRE>
wolffd@0 410 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
wolffd@0 411 </PRE>
wolffd@0 412 In fact, we don't need to reshape the array, since the CPD constructor
wolffd@0 413 will do that for us. So we can just write
wolffd@0 414 <pre>
wolffd@0 415 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
wolffd@0 416 </pre>
wolffd@0 417 The other nodes are created similarly (using the old syntax for
wolffd@0 418 optional parameters)
wolffd@0 419 <PRE>
wolffd@0 420 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
wolffd@0 421 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
wolffd@0 422 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
wolffd@0 423 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
wolffd@0 424 </PRE>
wolffd@0 425
wolffd@0 426
wolffd@0 427 <h2><a name="rnd_cpt">Random Parameters</h2>
wolffd@0 428
wolffd@0 429 If we do not specify the CPT, random parameters will be
wolffd@0 430 created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
wolffd@0 431 To ensure repeatable results, use
wolffd@0 432 <pre>
wolffd@0 433 rand('state', seed);
wolffd@0 434 randn('state', seed);
wolffd@0 435 </pre>
wolffd@0 436 To control the degree of randomness (entropy),
wolffd@0 437 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
wolffd@0 438 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
wolffd@0 439 If p = 1, each entry is drawn from U[0,1].
wolffd@0 440 If p >> 1, the entries will all be near 1/k, where k is the arity of
wolffd@0 441 this node, i.e., each row will be nearly uniform.
wolffd@0 442 You can do this as follows, assuming this node
wolffd@0 443 is number i, and ns is the node_sizes.
wolffd@0 444 <pre>
wolffd@0 445 k = ns(i);
wolffd@0 446 ps = parents(dag, i);
wolffd@0 447 psz = prod(ns(ps));
wolffd@0 448 CPT = sample_dirichlet(p*ones(1,k), psz);
wolffd@0 449 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
wolffd@0 450 </pre>
wolffd@0 451
wolffd@0 452
wolffd@0 453 <h2><a name="file">Loading a network from a file</h2>
wolffd@0 454
wolffd@0 455 If you already have a Bayes net represented in the XML-based
wolffd@0 456 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
wolffd@0 457 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
wolffd@0 458 <a
wolffd@0 459 href="http://www.cs.huji.ac.il/labs/compbio/Repository">
wolffd@0 460 Bayes Net repository</a>),
wolffd@0 461 you can convert it to BNT format using
wolffd@0 462 the
wolffd@0 463 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
wolffd@0 464 program</a> written by Ken Shan.
wolffd@0 465 (This is not necessarily up-to-date.)
wolffd@0 466
wolffd@0 467
wolffd@0 468 <h2>Creating a model using a GUI</h2>
wolffd@0 469
wolffd@0 470 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
wolffd@0 471
wolffd@0 472
wolffd@0 473
wolffd@0 474 <h1><a name="inference">Inference</h1>
wolffd@0 475
wolffd@0 476 Having created the BN, we can now use it for inference.
wolffd@0 477 There are many different algorithms for doing inference in Bayes nets,
wolffd@0 478 that make different tradeoffs between speed,
wolffd@0 479 complexity, generality, and accuracy.
wolffd@0 480 BNT therefore offers a variety of different inference
wolffd@0 481 "engines". We will discuss these
wolffd@0 482 in more detail <a href="#engines">below</a>.
wolffd@0 483 For now, we will use the junction tree
wolffd@0 484 engine, which is the mother of all exact inference algorithms.
wolffd@0 485 This can be created as follows.
wolffd@0 486 <pre>
wolffd@0 487 engine = jtree_inf_engine(bnet);
wolffd@0 488 </pre>
wolffd@0 489 The other engines have similar constructors, but might take
wolffd@0 490 additional, algorithm-specific parameters.
wolffd@0 491 All engines are used in the same way, once they have been created.
wolffd@0 492 We illustrate this in the following sections.
wolffd@0 493
wolffd@0 494
wolffd@0 495 <h2><a name="marginal">Computing marginal distributions</h2>
wolffd@0 496
wolffd@0 497 Suppose we want to compute the probability that the sprinker was on
wolffd@0 498 given that the grass is wet.
wolffd@0 499 The evidence consists of the fact that W=2. All the other nodes
wolffd@0 500 are hidden (unobserved). We can specify this as follows.
wolffd@0 501 <pre>
wolffd@0 502 evidence = cell(1,N);
wolffd@0 503 evidence{W} = 2;
wolffd@0 504 </pre>
wolffd@0 505 We use a 1D cell array instead of a vector to
wolffd@0 506 cope with the fact that nodes can be vectors of different lengths.
wolffd@0 507 In addition, the value [] can be used
wolffd@0 508 to denote 'no evidence', instead of having to specify the observation
wolffd@0 509 pattern as a separate argument.
wolffd@0 510 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell
wolffd@0 511 arrays in matlab.)
wolffd@0 512 <p>
wolffd@0 513 We are now ready to add the evidence to the engine.
wolffd@0 514 <pre>
wolffd@0 515 [engine, loglik] = enter_evidence(engine, evidence);
wolffd@0 516 </pre>
wolffd@0 517 The behavior of this function is algorithm-specific, and is discussed
wolffd@0 518 in more detail <a href="#engines">below</a>.
wolffd@0 519 In the case of the jtree engine,
wolffd@0 520 enter_evidence implements a two-pass message-passing scheme.
wolffd@0 521 The first return argument contains the modified engine, which
wolffd@0 522 incorporates the evidence. The second return argument contains the
wolffd@0 523 log-likelihood of the evidence. (Not all engines are capable of
wolffd@0 524 computing the log-likelihood.)
wolffd@0 525 <p>
wolffd@0 526 Finally, we can compute p=P(S=2|W=2) as follows.
wolffd@0 527 <PRE>
wolffd@0 528 marg = marginal_nodes(engine, S);
wolffd@0 529 marg.T
wolffd@0 530 ans =
wolffd@0 531 0.57024
wolffd@0 532 0.42976
wolffd@0 533 p = marg.T(2);
wolffd@0 534 </PRE>
wolffd@0 535 We see that p = 0.4298.
wolffd@0 536 <p>
wolffd@0 537 Now let us add the evidence that it was raining, and see what
wolffd@0 538 difference it makes.
wolffd@0 539 <PRE>
wolffd@0 540 evidence{R} = 2;
wolffd@0 541 [engine, loglik] = enter_evidence(engine, evidence);
wolffd@0 542 marg = marginal_nodes(engine, S);
wolffd@0 543 p = marg.T(2);
wolffd@0 544 </PRE>
wolffd@0 545 We find that p = P(S=2|W=2,R=2) = 0.1945,
wolffd@0 546 which is lower than
wolffd@0 547 before, because the rain can ``explain away'' the
wolffd@0 548 fact that the grass is wet.
wolffd@0 549 <p>
wolffd@0 550 You can plot a marginal distribution over a discrete variable
wolffd@0 551 as a barchart using the built 'bar' function:
wolffd@0 552 <pre>
wolffd@0 553 bar(marg.T)
wolffd@0 554 </pre>
wolffd@0 555 This is what it looks like
wolffd@0 556
wolffd@0 557 <p>
wolffd@0 558 <center>
wolffd@0 559 <IMG SRC="Figures/sprinkler_bar.gif">
wolffd@0 560 </center>
wolffd@0 561 <p>
wolffd@0 562
wolffd@0 563 <h2><a name="observed">Observed nodes</h2>
wolffd@0 564
wolffd@0 565 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
wolffd@0 566 An observed discrete node effectively only has 1 value (the observed
wolffd@0 567 one) --- all other values would result in 0 probability.
wolffd@0 568 For efficiency, BNT treats observed (discrete) nodes as if they were
wolffd@0 569 set to 1, as we see below:
wolffd@0 570 <pre>
wolffd@0 571 evidence = cell(1,N);
wolffd@0 572 evidence{W} = 2;
wolffd@0 573 engine = enter_evidence(engine, evidence);
wolffd@0 574 m = marginal_nodes(engine, W);
wolffd@0 575 m.T
wolffd@0 576 ans =
wolffd@0 577 1
wolffd@0 578 </pre>
wolffd@0 579 This can get a little confusing, since we assigned W=2.
wolffd@0 580 So we can ask BNT to add the evidence back in by passing in an optional argument:
wolffd@0 581 <pre>
wolffd@0 582 m = marginal_nodes(engine, W, 1);
wolffd@0 583 m.T
wolffd@0 584 ans =
wolffd@0 585 0
wolffd@0 586 1
wolffd@0 587 </pre>
wolffd@0 588 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
wolffd@0 589
wolffd@0 590
wolffd@0 591
wolffd@0 592 <h2><a name="joint">Computing joint distributions</h2>
wolffd@0 593
wolffd@0 594 We can compute the joint probability on a set of nodes as in the
wolffd@0 595 following example.
wolffd@0 596 <pre>
wolffd@0 597 evidence = cell(1,N);
wolffd@0 598 [engine, ll] = enter_evidence(engine, evidence);
wolffd@0 599 m = marginal_nodes(engine, [S R W]);
wolffd@0 600 </pre>
wolffd@0 601 m is a structure. The 'T' field is a multi-dimensional array (in
wolffd@0 602 this case, 3-dimensional) that contains the joint probability
wolffd@0 603 distribution on the specified nodes.
wolffd@0 604 <pre>
wolffd@0 605 >> m.T
wolffd@0 606 ans(:,:,1) =
wolffd@0 607 0.2900 0.0410
wolffd@0 608 0.0210 0.0009
wolffd@0 609 ans(:,:,2) =
wolffd@0 610 0 0.3690
wolffd@0 611 0.1890 0.0891
wolffd@0 612 </pre>
wolffd@0 613 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
wolffd@0 614 to be wet if both the rain and sprinkler are off.
wolffd@0 615 <p>
wolffd@0 616 Let us now add some evidence to R.
wolffd@0 617 <pre>
wolffd@0 618 evidence{R} = 2;
wolffd@0 619 [engine, ll] = enter_evidence(engine, evidence);
wolffd@0 620 m = marginal_nodes(engine, [S R W])
wolffd@0 621 m =
wolffd@0 622 domain: [2 3 4]
wolffd@0 623 T: [2x1x2 double]
wolffd@0 624 >> m.T
wolffd@0 625 m.T
wolffd@0 626 ans(:,:,1) =
wolffd@0 627 0.0820
wolffd@0 628 0.0018
wolffd@0 629 ans(:,:,2) =
wolffd@0 630 0.7380
wolffd@0 631 0.1782
wolffd@0 632 </pre>
wolffd@0 633 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
wolffd@0 634 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
wolffd@0 635 with the evidence that R=2.
wolffd@0 636 Instead of creating large tables with many 0s, BNT sets the effective
wolffd@0 637 size of observed (discrete) nodes to 1, as explained above.
wolffd@0 638 This is why m.T has size 2x1x2.
wolffd@0 639 To get a 2x2x2 table, type
wolffd@0 640 <pre>
wolffd@0 641 m = marginal_nodes(engine, [S R W], 1)
wolffd@0 642 m =
wolffd@0 643 domain: [2 3 4]
wolffd@0 644 T: [2x2x2 double]
wolffd@0 645 >> m.T
wolffd@0 646 m.T
wolffd@0 647 ans(:,:,1) =
wolffd@0 648 0 0.082
wolffd@0 649 0 0.0018
wolffd@0 650 ans(:,:,2) =
wolffd@0 651 0 0.738
wolffd@0 652 0 0.1782
wolffd@0 653 </pre>
wolffd@0 654
wolffd@0 655 <p>
wolffd@0 656 Note: It is not always possible to compute the joint on arbitrary
wolffd@0 657 sets of nodes: it depends on which inference engine you use, as discussed
wolffd@0 658 in more detail <a href="#engines">below</a>.
wolffd@0 659
wolffd@0 660
wolffd@0 661 <h2><a name="soft">Soft/virtual evidence</h2>
wolffd@0 662
wolffd@0 663 Sometimes a node is not observed, but we have some distribution over
wolffd@0 664 its possible values; this is often called "soft" or "virtual"
wolffd@0 665 evidence.
wolffd@0 666 One can use this as follows
wolffd@0 667 <pre>
wolffd@0 668 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
wolffd@0 669 </pre>
wolffd@0 670 where soft_evidence{i} is either [] (if node i has no soft evidence)
wolffd@0 671 or is a vector representing the probability distribution over i's
wolffd@0 672 possible values.
wolffd@0 673 For example, if we don't know i's exact value, but we know its
wolffd@0 674 likelihood ratio is 60/40, we can write evidence{i} = [] and
wolffd@0 675 soft_evidence{i} = [0.6 0.4].
wolffd@0 676 <p>
wolffd@0 677 Currently only jtree_inf_engine supports this option.
wolffd@0 678 It assumes that all hidden nodes, and all nodes for
wolffd@0 679 which we have soft evidence, are discrete.
wolffd@0 680 For a longer example, see BNT/examples/static/softev1.m.
wolffd@0 681
wolffd@0 682
wolffd@0 683 <h2><a name="mpe">Most probable explanation</h2>
wolffd@0 684
wolffd@0 685 To compute the most probable explanation (MPE) of the evidence (i.e.,
wolffd@0 686 the most probable assignment, or a mode of the joint), use
wolffd@0 687 <pre>
wolffd@0 688 [mpe, ll] = calc_mpe(engine, evidence);
wolffd@0 689 </pre>
wolffd@0 690 mpe{i} is the most likely value of node i.
wolffd@0 691 This calls enter_evidence with the 'maximize' flag set to 1, which
wolffd@0 692 causes the engine to do max-product instead of sum-product.
wolffd@0 693 The resulting max-marginals are then thresholded.
wolffd@0 694 If there is more than one maximum probability assignment, we must take
wolffd@0 695 care to break ties in a consistent manner (thresholding the
wolffd@0 696 max-marginals may give the wrong result). To force this behavior,
wolffd@0 697 type
wolffd@0 698 <pre>
wolffd@0 699 [mpe, ll] = calc_mpe(engine, evidence, 1);
wolffd@0 700 </pre>
wolffd@0 701 Note that computing the MPE is someties called abductive reasoning.
wolffd@0 702
wolffd@0 703 <p>
wolffd@0 704 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
wolffd@0 705 that does a forwards max-product pass, and then a backwards traceback
wolffd@0 706 pass, which is how Viterbi is traditionally implemented.
wolffd@0 707
wolffd@0 708
wolffd@0 709
wolffd@0 710 <h1><a name="cpd">Conditional Probability Distributions</h1>
wolffd@0 711
wolffd@0 712 A Conditional Probability Distributions (CPD)
wolffd@0 713 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
wolffd@0 714 are the parents of node i. There are many ways to represent this
wolffd@0 715 distribution, which depend in part on whether X(i) and X(Pa(i)) are
wolffd@0 716 discrete, continuous, or a combination.
wolffd@0 717 We will discuss various representations below.
wolffd@0 718
wolffd@0 719
wolffd@0 720 <h2><a name="tabular">Tabular nodes</h2>
wolffd@0 721
wolffd@0 722 If the CPD is represented as a table (i.e., if it is a multinomial
wolffd@0 723 distribution), it has a number of parameters that is exponential in
wolffd@0 724 the number of parents. See the example <a href="#cpt">above</a>.
wolffd@0 725
wolffd@0 726
wolffd@0 727 <h2><a name="noisyor">Noisy-or nodes</h2>
wolffd@0 728
wolffd@0 729 A noisy-OR node is like a regular logical OR gate except that
wolffd@0 730 sometimes the effects of parents that are on get inhibited.
wolffd@0 731 Let the prob. that parent i gets inhibited be q(i).
wolffd@0 732 Then a node, C, with 2 parents, A and B, has the following CPD, where
wolffd@0 733 we use F and T to represent off and on (1 and 2 in BNT).
wolffd@0 734 <pre>
wolffd@0 735 A B P(C=off) P(C=on)
wolffd@0 736 ---------------------------
wolffd@0 737 F F 1.0 0.0
wolffd@0 738 T F q(A) 1-q(A)
wolffd@0 739 F T q(B) 1-q(B)
wolffd@0 740 T T q(A)q(B) q-q(A)q(B)
wolffd@0 741 </pre>
wolffd@0 742 Thus we see that the causes get inhibited independently.
wolffd@0 743 It is common to associate a "leak" node with a noisy-or CPD, which is
wolffd@0 744 like a parent that is always on. This can account for all other unmodelled
wolffd@0 745 causes which might turn the node on.
wolffd@0 746 <p>
wolffd@0 747 The noisy-or distribution is similar to the logistic distribution.
wolffd@0 748 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
wolffd@0 749 be the prob. that j inhibits i. Then
wolffd@0 750 <pre>
wolffd@0 751 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
wolffd@0 752 </pre>
wolffd@0 753 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
wolffd@0 754 <pre>
wolffd@0 755 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
wolffd@0 756 </pre>
wolffd@0 757 For a sigmoid node, we have
wolffd@0 758 <pre>
wolffd@0 759 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
wolffd@0 760 </pre>
wolffd@0 761 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
wolffd@0 762 the activation function (although both are monotonically increasing).
wolffd@0 763 In addition, in the case of a noisy-or, the weights are constrained to be
wolffd@0 764 positive, since they derive from probabilities q(i,j).
wolffd@0 765 In both cases, the number of parameters is <em>linear</em> in the
wolffd@0 766 number of parents, unlike the case of a multinomial distribution,
wolffd@0 767 where the number of parameters is exponential in the number of parents.
wolffd@0 768 We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
wolffd@0 769
wolffd@0 770
wolffd@0 771 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
wolffd@0 772
wolffd@0 773 Deterministic CPDs for discrete random variables can be created using
wolffd@0 774 the deterministic_CPD class. It is also possible to 'flip' the output
wolffd@0 775 of the function with some probability, to simulate noise.
wolffd@0 776 The boolean_CPD class is just a special case of a
wolffd@0 777 deterministic CPD, where the parents and child are all binary.
wolffd@0 778 <p>
wolffd@0 779 Both of these classes are just "syntactic sugar" for the tabular_CPD
wolffd@0 780 class.
wolffd@0 781
wolffd@0 782
wolffd@0 783
wolffd@0 784 <h2><a name="softmax">Softmax nodes</h2>
wolffd@0 785
wolffd@0 786 If we have a discrete node with a continuous parent,
wolffd@0 787 we can define its CPD using a softmax function
wolffd@0 788 (also known as the multinomial logit function).
wolffd@0 789 This acts like a soft thresholding operator, and is defined as follows:
wolffd@0 790 <pre>
wolffd@0 791 exp(w(:,i)'*x + b(i))
wolffd@0 792 Pr(Q=i | X=x) = -----------------------------
wolffd@0 793 sum_j exp(w(:,j)'*x + b(j))
wolffd@0 794
wolffd@0 795 </pre>
wolffd@0 796 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
wolffd@0 797 following interpretation: w(:,i)-w(:,j) is the normal vector to the
wolffd@0 798 decision boundary between classes i and j,
wolffd@0 799 and b(i)-b(j) is its offset (bias). For example, suppose
wolffd@0 800 X is a 2-vector, and Q is binary. Then
wolffd@0 801 <pre>
wolffd@0 802 w = [1 -1;
wolffd@0 803 0 0];
wolffd@0 804
wolffd@0 805 b = [0 0];
wolffd@0 806 </pre>
wolffd@0 807 means class 1 are points in the 2D plane with positive x coordinate,
wolffd@0 808 and class 2 are points in the 2D plane with negative x coordinate.
wolffd@0 809 If w has large magnitude, the decision boundary is sharp, otherwise it
wolffd@0 810 is soft.
wolffd@0 811 In the special case that Q is binary (0/1), the softmax function reduces to the logistic
wolffd@0 812 (sigmoid) function.
wolffd@0 813 <p>
wolffd@0 814 Fitting a softmax function can be done using the iteratively reweighted
wolffd@0 815 least squares (IRLS) algorithm.
wolffd@0 816 We use the implementation from
wolffd@0 817 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
wolffd@0 818 Note that since
wolffd@0 819 the softmax distribution is not in the exponential family, it does not
wolffd@0 820 have finite sufficient statistics, and hence we must store all the
wolffd@0 821 training data in uncompressed form.
wolffd@0 822 If this takes too much space, one should use online (stochastic) gradient
wolffd@0 823 descent (not implemented in BNT).
wolffd@0 824 <p>
wolffd@0 825 If a softmax node also has discrete parents,
wolffd@0 826 we use a different set of w/b parameters for each combination of
wolffd@0 827 parent values, as in the <a href="#gaussian">conditional linear
wolffd@0 828 Gaussian CPD</a>.
wolffd@0 829 This feature was implemented by Pierpaolo Brutti.
wolffd@0 830 He is currently extending it so that discrete parents can be treated
wolffd@0 831 as if they were continuous, by adding indicator variables to the X
wolffd@0 832 vector.
wolffd@0 833 <p>
wolffd@0 834 We will see an example of softmax nodes <a href="#mixexp">below</a>.
wolffd@0 835
wolffd@0 836
wolffd@0 837 <h2><a name="mlp">Neural network nodes</h2>
wolffd@0 838
wolffd@0 839 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
wolffd@0 840 to implement a mapping from continuous parents to discrete children,
wolffd@0 841 similar to the softmax function.
wolffd@0 842 (If there are also discrete parents, it creates a mixture of MLPs.)
wolffd@0 843 It uses code from <a
wolffd@0 844 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
wolffd@0 845 This is work in progress.
wolffd@0 846
wolffd@0 847 <h2><a name="root">Root nodes</h2>
wolffd@0 848
wolffd@0 849 A root node has no parents and no parameters; it can be used to model
wolffd@0 850 an observed, exogeneous input variable, i.e., one which is "outside"
wolffd@0 851 the model.
wolffd@0 852 This is useful for conditional density models.
wolffd@0 853 We will see an example of root nodes <a href="#mixexp">below</a>.
wolffd@0 854
wolffd@0 855
wolffd@0 856 <h2><a name="gaussian">Gaussian nodes</h2>
wolffd@0 857
wolffd@0 858 We now consider a distribution suitable for the continuous-valued nodes.
wolffd@0 859 Suppose the node is called Y, its continuous parents (if any) are
wolffd@0 860 called X, and its discrete parents (if any) are called Q.
wolffd@0 861 The distribution on Y is defined as follows:
wolffd@0 862 <pre>
wolffd@0 863 - no parents: Y ~ N(mu, Sigma)
wolffd@0 864 - cts parents : Y|X=x ~ N(mu + W x, Sigma)
wolffd@0 865 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
wolffd@0 866 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
wolffd@0 867 </pre>
wolffd@0 868 where N(mu, Sigma) denotes a Normal distribution with mean mu and
wolffd@0 869 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
wolffd@0 870 respectively.
wolffd@0 871 If there are no discrete parents, |Q|=1; if there is
wolffd@0 872 more than one, then |Q| = a vector of the sizes of each discrete parent.
wolffd@0 873 If there are no continuous parents, |X|=0; if there is more than one,
wolffd@0 874 then |X| = the sum of their sizes.
wolffd@0 875 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
wolffd@0 876 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
wolffd@0 877 matrix.
wolffd@0 878 <p>
wolffd@0 879 We can create a Gaussian node with random parameters as follows.
wolffd@0 880 <pre>
wolffd@0 881 bnet.CPD{i} = gaussian_CPD(bnet, i);
wolffd@0 882 </pre>
wolffd@0 883 We can specify the value of one or more of the parameters as in the
wolffd@0 884 following example, in which |Y|=2, and |Q|=1.
wolffd@0 885 <pre>
wolffd@0 886 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
wolffd@0 887 </pre>
wolffd@0 888 <p>
wolffd@0 889 We will see an example of conditional linear Gaussian nodes <a
wolffd@0 890 href="#cg_model">below</a>.
wolffd@0 891 <p>
wolffd@0 892 <b>When learning Gaussians from data</b>, it is helpful to ensure the
wolffd@0 893 data has a small magnitde
wolffd@0 894 (see e.g., KPMstats/standardize) to prevent numerical problems.
wolffd@0 895 Unless you have a lot of data, it is also a very good idea to use
wolffd@0 896 diagonal instead of full covariance matrices.
wolffd@0 897 (BNT does not currently support spherical covariances, although it
wolffd@0 898 would be easy to add, since KPMstats/clg_Mstep supports this option;
wolffd@0 899 you would just need to modify gaussian_CPD/update_ess to accumulate
wolffd@0 900 weighted inner products.)
wolffd@0 901
wolffd@0 902
wolffd@0 903
wolffd@0 904 <h2><a name="nongauss">Other continuous distributions</h2>
wolffd@0 905
wolffd@0 906 Currently BNT does not support any CPDs for continuous nodes other
wolffd@0 907 than the Gaussian.
wolffd@0 908 However, you can use a mixture of Gaussians to
wolffd@0 909 approximate other continuous distributions. We will see some an example
wolffd@0 910 of this with the IFA model <a href="#pca">below</a>.
wolffd@0 911
wolffd@0 912
wolffd@0 913 <h2><a name="glm">Generalized linear model nodes</h2>
wolffd@0 914
wolffd@0 915 In the future, we may incorporate some of the functionality of
wolffd@0 916 <a href =
wolffd@0 917 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
wolffd@0 918 into BNT.
wolffd@0 919
wolffd@0 920
wolffd@0 921 <h2><a name="dtree">Classification/regression tree nodes</h2>
wolffd@0 922
wolffd@0 923 We plan to add classification and regression trees to define CPDs for
wolffd@0 924 discrete and continuous nodes, respectively.
wolffd@0 925 Trees have many advantages: they are easy to interpret, they can do
wolffd@0 926 feature selection, they can
wolffd@0 927 handle discrete and continuous inputs, they do not make strong
wolffd@0 928 assumptions about the form of the distribution, the number of
wolffd@0 929 parameters can grow in a data-dependent way (i.e., they are
wolffd@0 930 semi-parametric), they can handle missing data, etc.
wolffd@0 931 However, they are not yet implemented.
wolffd@0 932 <!--
wolffd@0 933 Yimin Zhang is currently (Feb '02) implementing this.
wolffd@0 934 -->
wolffd@0 935
wolffd@0 936
wolffd@0 937 <h2><a name="cpd_summary">Summary of CPD types</h2>
wolffd@0 938
wolffd@0 939 We list all the different types of CPDs supported by BNT.
wolffd@0 940 For each CPD, we specify if the child and parents can be discrete (D) or
wolffd@0 941 continuous (C) (Binary (B) nodes are a special case).
wolffd@0 942 We also specify which methods each class supports.
wolffd@0 943 If a method is inherited, the name of the parent class is mentioned.
wolffd@0 944 If a parent class calls a child method, this is mentioned.
wolffd@0 945 <p>
wolffd@0 946 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
wolffd@0 947 requires that the child and all parents are discrete.
wolffd@0 948 The CPT might be exponentially big...
wolffd@0 949 <tt>convert_to_table</tt> evaluates a CPD with evidence, and
wolffd@0 950 represents the the resulting potential as an array.
wolffd@0 951 This requires that the child is discrete, and any continuous parents
wolffd@0 952 are observed.
wolffd@0 953 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and
wolffd@0 954 represents the resulting potential as a dpot, gpot, cgpot or upot, as
wolffd@0 955 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
wolffd@0 956 utility).
wolffd@0 957
wolffd@0 958 <p>
wolffd@0 959 When we sample a node, all the parents are observed.
wolffd@0 960 When we compute the (log) probability of a node, all the parents and
wolffd@0 961 the child are observed.
wolffd@0 962 <p>
wolffd@0 963 We also specify if the parameters are learnable.
wolffd@0 964 For learning with EM, we require
wolffd@0 965 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
wolffd@0 966 <tt>maximize_params</tt>.
wolffd@0 967 For learning from fully observed data, we require
wolffd@0 968 the method <tt>learn_params</tt>.
wolffd@0 969 By default, all classes inherit this from generic_CPD, which simply
wolffd@0 970 calls <tt>update_ess</tt> N times, once for each data case, followed
wolffd@0 971 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
wolffd@0 972 Some classes implement a batch formula, which is quicker.
wolffd@0 973 <p>
wolffd@0 974 Bayesian learning means computing a posterior over the parameters
wolffd@0 975 given fully observed data.
wolffd@0 976 <p>
wolffd@0 977 Pearl means we implement the methods <tt>compute_pi</tt> and
wolffd@0 978 <tt>compute_lambda_msg</tt>, used by
wolffd@0 979 <tt>pearl_inf_engine</tt>, which runs on directed graphs.
wolffd@0 980 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
wolffd@0 981 The pearl methods can exploit special properties of the CPDs for
wolffd@0 982 computing the messages efficiently, whereas belprop does not.
wolffd@0 983 <p>
wolffd@0 984 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
wolffd@0 985 which is not shown, since it is not very interesting.
wolffd@0 986
wolffd@0 987
wolffd@0 988 <p>
wolffd@0 989
wolffd@0 990
wolffd@0 991 <table>
wolffd@0 992 <table border units = pixels><tr>
wolffd@0 993 <td align=center>Name
wolffd@0 994 <td align=center>Child
wolffd@0 995 <td align=center>Parents
wolffd@0 996 <td align=center>Comments
wolffd@0 997 <td align=center>CPD_to_CPT
wolffd@0 998 <td align=center>conv_to_table
wolffd@0 999 <td align=center>conv_to_pot
wolffd@0 1000 <td align=center>sample
wolffd@0 1001 <td align=center>prob
wolffd@0 1002 <td align=center>learn
wolffd@0 1003 <td align=center>Bayes
wolffd@0 1004 <td align=center>Pearl
wolffd@0 1005
wolffd@0 1006
wolffd@0 1007 <tr>
wolffd@0 1008 <!-- Name--><td>
wolffd@0 1009 <!-- Child--><td>
wolffd@0 1010 <!-- Parents--><td>
wolffd@0 1011 <!-- Comments--><td>
wolffd@0 1012 <!-- CPD_to_CPT--><td>
wolffd@0 1013 <!-- conv_to_table--><td>
wolffd@0 1014 <!-- conv_to_pot--><td>
wolffd@0 1015 <!-- sample--><td>
wolffd@0 1016 <!-- prob--><td>
wolffd@0 1017 <!-- learn--><td>
wolffd@0 1018 <!-- Bayes--><td>
wolffd@0 1019 <!-- Pearl--><td>
wolffd@0 1020
wolffd@0 1021 <tr>
wolffd@0 1022 <!-- Name--><td>boolean
wolffd@0 1023 <!-- Child--><td>B
wolffd@0 1024 <!-- Parents--><td>B
wolffd@0 1025 <!-- Comments--><td>Syntactic sugar for tabular
wolffd@0 1026 <!-- CPD_to_CPT--><td>-
wolffd@0 1027 <!-- conv_to_table--><td>-
wolffd@0 1028 <!-- conv_to_pot--><td>-
wolffd@0 1029 <!-- sample--><td>-
wolffd@0 1030 <!-- prob--><td>-
wolffd@0 1031 <!-- learn--><td>-
wolffd@0 1032 <!-- Bayes--><td>-
wolffd@0 1033 <!-- Pearl--><td>-
wolffd@0 1034
wolffd@0 1035 <tr>
wolffd@0 1036 <!-- Name--><td>deterministic
wolffd@0 1037 <!-- Child--><td>D
wolffd@0 1038 <!-- Parents--><td>D
wolffd@0 1039 <!-- Comments--><td>Syntactic sugar for tabular
wolffd@0 1040 <!-- CPD_to_CPT--><td>-
wolffd@0 1041 <!-- conv_to_table--><td>-
wolffd@0 1042 <!-- conv_to_pot--><td>-
wolffd@0 1043 <!-- sample--><td>-
wolffd@0 1044 <!-- prob--><td>-
wolffd@0 1045 <!-- learn--><td>-
wolffd@0 1046 <!-- Bayes--><td>-
wolffd@0 1047 <!-- Pearl--><td>-
wolffd@0 1048
wolffd@0 1049 <tr>
wolffd@0 1050 <!-- Name--><td>Discrete
wolffd@0 1051 <!-- Child--><td>D
wolffd@0 1052 <!-- Parents--><td>C/D
wolffd@0 1053 <!-- Comments--><td>Virtual class
wolffd@0 1054 <!-- CPD_to_CPT--><td>N
wolffd@0 1055 <!-- conv_to_table--><td>Calls CPD_to_CPT
wolffd@0 1056 <!-- conv_to_pot--><td>Calls conv_to_table
wolffd@0 1057 <!-- sample--><td>Calls conv_to_table
wolffd@0 1058 <!-- prob--><td>Calls conv_to_table
wolffd@0 1059 <!-- learn--><td>N
wolffd@0 1060 <!-- Bayes--><td>N
wolffd@0 1061 <!-- Pearl--><td>N
wolffd@0 1062
wolffd@0 1063 <tr>
wolffd@0 1064 <!-- Name--><td>Gaussian
wolffd@0 1065 <!-- Child--><td>C
wolffd@0 1066 <!-- Parents--><td>C/D
wolffd@0 1067 <!-- Comments--><td>-
wolffd@0 1068 <!-- CPD_to_CPT--><td>N
wolffd@0 1069 <!-- conv_to_table--><td>N
wolffd@0 1070 <!-- conv_to_pot--><td>Y
wolffd@0 1071 <!-- sample--><td>Y
wolffd@0 1072 <!-- prob--><td>Y
wolffd@0 1073 <!-- learn--><td>Y
wolffd@0 1074 <!-- Bayes--><td>N
wolffd@0 1075 <!-- Pearl--><td>N
wolffd@0 1076
wolffd@0 1077 <tr>
wolffd@0 1078 <!-- Name--><td>gmux
wolffd@0 1079 <!-- Child--><td>C
wolffd@0 1080 <!-- Parents--><td>C/D
wolffd@0 1081 <!-- Comments--><td>multiplexer
wolffd@0 1082 <!-- CPD_to_CPT--><td>N
wolffd@0 1083 <!-- conv_to_table--><td>N
wolffd@0 1084 <!-- conv_to_pot--><td>Y
wolffd@0 1085 <!-- sample--><td>N
wolffd@0 1086 <!-- prob--><td>N
wolffd@0 1087 <!-- learn--><td>N
wolffd@0 1088 <!-- Bayes--><td>N
wolffd@0 1089 <!-- Pearl--><td>Y
wolffd@0 1090
wolffd@0 1091
wolffd@0 1092 <tr>
wolffd@0 1093 <!-- Name--><td>MLP
wolffd@0 1094 <!-- Child--><td>D
wolffd@0 1095 <!-- Parents--><td>C/D
wolffd@0 1096 <!-- Comments--><td>multi layer perceptron
wolffd@0 1097 <!-- CPD_to_CPT--><td>N
wolffd@0 1098 <!-- conv_to_table--><td>Y
wolffd@0 1099 <!-- conv_to_pot--><td>Inherits from discrete
wolffd@0 1100 <!-- sample--><td>Inherits from discrete
wolffd@0 1101 <!-- prob--><td>Inherits from discrete
wolffd@0 1102 <!-- learn--><td>Y
wolffd@0 1103 <!-- Bayes--><td>N
wolffd@0 1104 <!-- Pearl--><td>N
wolffd@0 1105
wolffd@0 1106
wolffd@0 1107 <tr>
wolffd@0 1108 <!-- Name--><td>noisy-or
wolffd@0 1109 <!-- Child--><td>B
wolffd@0 1110 <!-- Parents--><td>B
wolffd@0 1111 <!-- Comments--><td>-
wolffd@0 1112 <!-- CPD_to_CPT--><td>Y
wolffd@0 1113 <!-- conv_to_table--><td>Inherits from discrete
wolffd@0 1114 <!-- conv_to_pot--><td>Inherits from discrete
wolffd@0 1115 <!-- sample--><td>Inherits from discrete
wolffd@0 1116 <!-- prob--><td>Inherits from discrete
wolffd@0 1117 <!-- learn--><td>N
wolffd@0 1118 <!-- Bayes--><td>N
wolffd@0 1119 <!-- Pearl--><td>Y
wolffd@0 1120
wolffd@0 1121
wolffd@0 1122 <tr>
wolffd@0 1123 <!-- Name--><td>root
wolffd@0 1124 <!-- Child--><td>C/D
wolffd@0 1125 <!-- Parents--><td>none
wolffd@0 1126 <!-- Comments--><td>no params
wolffd@0 1127 <!-- CPD_to_CPT--><td>N
wolffd@0 1128 <!-- conv_to_table--><td>N
wolffd@0 1129 <!-- conv_to_pot--><td>Y
wolffd@0 1130 <!-- sample--><td>Y
wolffd@0 1131 <!-- prob--><td>Y
wolffd@0 1132 <!-- learn--><td>N
wolffd@0 1133 <!-- Bayes--><td>N
wolffd@0 1134 <!-- Pearl--><td>N
wolffd@0 1135
wolffd@0 1136
wolffd@0 1137 <tr>
wolffd@0 1138 <!-- Name--><td>softmax
wolffd@0 1139 <!-- Child--><td>D
wolffd@0 1140 <!-- Parents--><td>C/D
wolffd@0 1141 <!-- Comments--><td>-
wolffd@0 1142 <!-- CPD_to_CPT--><td>N
wolffd@0 1143 <!-- conv_to_table--><td>Y
wolffd@0 1144 <!-- conv_to_pot--><td>Inherits from discrete
wolffd@0 1145 <!-- sample--><td>Inherits from discrete
wolffd@0 1146 <!-- prob--><td>Inherits from discrete
wolffd@0 1147 <!-- learn--><td>Y
wolffd@0 1148 <!-- Bayes--><td>N
wolffd@0 1149 <!-- Pearl--><td>N
wolffd@0 1150
wolffd@0 1151
wolffd@0 1152 <tr>
wolffd@0 1153 <!-- Name--><td>generic
wolffd@0 1154 <!-- Child--><td>C/D
wolffd@0 1155 <!-- Parents--><td>C/D
wolffd@0 1156 <!-- Comments--><td>Virtual class
wolffd@0 1157 <!-- CPD_to_CPT--><td>N
wolffd@0 1158 <!-- conv_to_table--><td>N
wolffd@0 1159 <!-- conv_to_pot--><td>N
wolffd@0 1160 <!-- sample--><td>N
wolffd@0 1161 <!-- prob--><td>N
wolffd@0 1162 <!-- learn--><td>N
wolffd@0 1163 <!-- Bayes--><td>N
wolffd@0 1164 <!-- Pearl--><td>N
wolffd@0 1165
wolffd@0 1166
wolffd@0 1167 <tr>
wolffd@0 1168 <!-- Name--><td>Tabular
wolffd@0 1169 <!-- Child--><td>D
wolffd@0 1170 <!-- Parents--><td>D
wolffd@0 1171 <!-- Comments--><td>-
wolffd@0 1172 <!-- CPD_to_CPT--><td>Y
wolffd@0 1173 <!-- conv_to_table--><td>Inherits from discrete
wolffd@0 1174 <!-- conv_to_pot--><td>Inherits from discrete
wolffd@0 1175 <!-- sample--><td>Inherits from discrete
wolffd@0 1176 <!-- prob--><td>Inherits from discrete
wolffd@0 1177 <!-- learn--><td>Y
wolffd@0 1178 <!-- Bayes--><td>Y
wolffd@0 1179 <!-- Pearl--><td>Y
wolffd@0 1180
wolffd@0 1181 </table>
wolffd@0 1182
wolffd@0 1183
wolffd@0 1184
wolffd@0 1185 <h1><a name="examples">Example models</h1>
wolffd@0 1186
wolffd@0 1187
wolffd@0 1188 <h2>Gaussian mixture models</h2>
wolffd@0 1189
wolffd@0 1190 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
wolffd@0 1191 of Gaussians using BNT. Available
wolffd@0 1192 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
wolffd@0 1193
wolffd@0 1194
wolffd@0 1195 <h2><a name="pca">PCA, ICA, and all that </h2>
wolffd@0 1196
wolffd@0 1197 In Figure (a) below, we show how Factor Analysis can be thought of as a
wolffd@0 1198 graphical model. Here, X has an N(0,I) prior, and
wolffd@0 1199 Y|X=x ~ N(mu + Wx, Psi),
wolffd@0 1200 where Psi is diagonal and W is called the "factor loading matrix".
wolffd@0 1201 Since the noise on both X and Y is diagonal, the components of these
wolffd@0 1202 vectors are uncorrelated, and hence can be represented as individual
wolffd@0 1203 scalar nodes, as we show in (b).
wolffd@0 1204 (This is useful if parts of the observations on the Y vector are occasionally missing.)
wolffd@0 1205 We usually take k=|X| << |Y|=D, so the model tries to explain
wolffd@0 1206 many observations using a low-dimensional subspace.
wolffd@0 1207
wolffd@0 1208
wolffd@0 1209 <center>
wolffd@0 1210 <table>
wolffd@0 1211 <tr>
wolffd@0 1212 <td><img src="Figures/fa.gif">
wolffd@0 1213 <td><img src="Figures/fa_scalar.gif">
wolffd@0 1214 <td><img src="Figures/mfa.gif">
wolffd@0 1215 <td><img src="Figures/ifa.gif">
wolffd@0 1216 <tr>
wolffd@0 1217 <td align=center> (a)
wolffd@0 1218 <td align=center> (b)
wolffd@0 1219 <td align=center> (c)
wolffd@0 1220 <td align=center> (d)
wolffd@0 1221 </table>
wolffd@0 1222 </center>
wolffd@0 1223
wolffd@0 1224 <p>
wolffd@0 1225 We can create this model in BNT as follows.
wolffd@0 1226 <pre>
wolffd@0 1227 ns = [k D];
wolffd@0 1228 dag = zeros(2,2);
wolffd@0 1229 dag(1,2) = 1;
wolffd@0 1230 bnet = mk_bnet(dag, ns, 'discrete', []);
wolffd@0 1231 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
wolffd@0 1232 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
wolffd@0 1233 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
wolffd@0 1234 'cov_type', 'diag', 'clamp_mean', 1);
wolffd@0 1235 </pre>
wolffd@0 1236
wolffd@0 1237 The root node is clamped to the N(0,I) distribution, so that we will
wolffd@0 1238 not update these parameters during learning.
wolffd@0 1239 The mean of the leaf node is clamped to 0,
wolffd@0 1240 since we assume the data has been centered (had its mean subtracted
wolffd@0 1241 off); this is just for simplicity.
wolffd@0 1242 Finally, the covariance of the leaf node is constrained to be
wolffd@0 1243 diagonal. W0 and Psi0 are the initial parameter guesses.
wolffd@0 1244
wolffd@0 1245 <p>
wolffd@0 1246 We can fit this model (i.e., estimate its parameters in a maximum
wolffd@0 1247 likelihood (ML) sense) using EM, as we
wolffd@0 1248 explain <a href="#em">below</a>.
wolffd@0 1249 Not surprisingly, the ML estimates for mu and Psi turn out to be
wolffd@0 1250 identical to the
wolffd@0 1251 sample mean and variance, which can be computed directly as
wolffd@0 1252 <pre>
wolffd@0 1253 mu_ML = mean(data);
wolffd@0 1254 Psi_ML = diag(cov(data));
wolffd@0 1255 </pre>
wolffd@0 1256 Note that W can only be identified up to a rotation matrix, because of
wolffd@0 1257 the spherical symmetry of the source.
wolffd@0 1258
wolffd@0 1259 <p>
wolffd@0 1260 If we restrict Psi to be spherical, i.e., Psi = sigma*I,
wolffd@0 1261 there is a closed-form solution for W as well,
wolffd@0 1262 i.e., we do not need to use EM.
wolffd@0 1263 In particular, W contains the first |X| eigenvectors of the sample covariance
wolffd@0 1264 matrix, with scalings determined by the eigenvalues and sigma.
wolffd@0 1265 Classical PCA can be obtained by taking the sigma->0 limit.
wolffd@0 1266 For details, see
wolffd@0 1267
wolffd@0 1268 <ul>
wolffd@0 1269 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
wolffd@0 1270 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
wolffd@0 1271 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
wolffd@0 1272 Matlab software</a>)
wolffd@0 1273
wolffd@0 1274 <p>
wolffd@0 1275 <li>
wolffd@0 1276 <a
wolffd@0 1277 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
wolffd@0 1278 "Mixtures of probabilistic principal component analyzers"</a>,
wolffd@0 1279 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
wolffd@0 1280 </ul>
wolffd@0 1281
wolffd@0 1282 <p>
wolffd@0 1283 By adding a hidden discrete variable, we can create mixtures of FA
wolffd@0 1284 models, as shown in (c).
wolffd@0 1285 Now we can explain the data using a set of subspaces.
wolffd@0 1286 We can create this model in BNT as follows.
wolffd@0 1287 <pre>
wolffd@0 1288 ns = [M k D];
wolffd@0 1289 dag = zeros(3);
wolffd@0 1290 dag(1,3) = 1;
wolffd@0 1291 dag(2,3) = 1;
wolffd@0 1292 bnet = mk_bnet(dag, ns, 'discrete', 1);
wolffd@0 1293 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
wolffd@0 1294 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
wolffd@0 1295 'clamp_mean', 1, 'clamp_cov', 1);
wolffd@0 1296 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
wolffd@0 1297 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
wolffd@0 1298 </pre>
wolffd@0 1299 Notice how the covariance matrix for Y is the same for all values of
wolffd@0 1300 Q; that is, the noise level in each sub-space is assumed the same.
wolffd@0 1301 However, we allow the offset, mu, to vary.
wolffd@0 1302 For details, see
wolffd@0 1303 <ul>
wolffd@0 1304
wolffd@0 1305 <LI>
wolffd@0 1306 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
wolffd@0 1307 Algorithm for Mixtures of Factor Analyzers </A>,
wolffd@0 1308 Ghahramani, Z. and Hinton, G.E. (1996),
wolffd@0 1309 University of Toronto
wolffd@0 1310 Technical Report CRG-TR-96-1.
wolffd@0 1311 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
wolffd@0 1312
wolffd@0 1313 <p>
wolffd@0 1314 <li>
wolffd@0 1315 <a
wolffd@0 1316 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
wolffd@0 1317 "Mixtures of probabilistic principal component analyzers"</a>,
wolffd@0 1318 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
wolffd@0 1319 </ul>
wolffd@0 1320
wolffd@0 1321 <p>
wolffd@0 1322 I have included Zoubin's specialized MFA code (with his permission)
wolffd@0 1323 with the toolbox, so you can check that BNT gives the same results:
wolffd@0 1324 see 'BNT/examples/static/mfa1.m'.
wolffd@0 1325
wolffd@0 1326 <p>
wolffd@0 1327 Independent Factor Analysis (IFA) generalizes FA by allowing a
wolffd@0 1328 non-Gaussian prior on each component of X.
wolffd@0 1329 (Note that we can approximate a non-Gaussian prior using a mixture of
wolffd@0 1330 Gaussians.)
wolffd@0 1331 This means that the likelihood function is no longer rotationally
wolffd@0 1332 invariant, so we can uniquely identify W and the hidden
wolffd@0 1333 sources X.
wolffd@0 1334 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
wolffd@0 1335 We recover classical Independent Components Analysis (ICA)
wolffd@0 1336 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the
wolffd@0 1337 weight matrix W is square and invertible.
wolffd@0 1338 For details, see
wolffd@0 1339 <ul>
wolffd@0 1340 <li>
wolffd@0 1341 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
wolffd@0 1342 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
wolffd@0 1343 </ul>
wolffd@0 1344
wolffd@0 1345
wolffd@0 1346
wolffd@0 1347 <h2><a name="mixexp">Mixtures of experts</h2>
wolffd@0 1348
wolffd@0 1349 As an example of the use of the softmax function,
wolffd@0 1350 we introduce the Mixture of Experts model.
wolffd@0 1351 <!--
wolffd@0 1352 We also show
wolffd@0 1353 the Hierarchical Mixture of Experts model, where the hierarchy has two
wolffd@0 1354 levels.
wolffd@0 1355 (This is essentially a probabilistic decision tree of height two.)
wolffd@0 1356 -->
wolffd@0 1357 As before,
wolffd@0 1358 circles denote continuous-valued nodes,
wolffd@0 1359 squares denote discrete nodes, clear
wolffd@0 1360 means hidden, and shaded means observed.
wolffd@0 1361 <p>
wolffd@0 1362 <center>
wolffd@0 1363 <table>
wolffd@0 1364 <tr>
wolffd@0 1365 <td><img src="Figures/mixexp.gif">
wolffd@0 1366 <!--
wolffd@0 1367 <td><img src="Figures/hme.gif">
wolffd@0 1368 -->
wolffd@0 1369 </table>
wolffd@0 1370 </center>
wolffd@0 1371 <p>
wolffd@0 1372 X is the observed
wolffd@0 1373 input, Y is the output, and
wolffd@0 1374 the Q nodes are hidden "gating" nodes, which select the appropriate
wolffd@0 1375 set of parameters for Y. During training, Y is assumed observed,
wolffd@0 1376 but for testing, the goal is to predict Y given X.
wolffd@0 1377 Note that this is a <em>conditional</em> density model, so we don't
wolffd@0 1378 associate any parameters with X.
wolffd@0 1379 Hence X's CPD will be a root CPD, which is a way of modelling
wolffd@0 1380 exogenous nodes.
wolffd@0 1381 If the output is a continuous-valued quantity,
wolffd@0 1382 we assume the "experts" are linear-regression units,
wolffd@0 1383 and set Y's CPD to linear-Gaussian.
wolffd@0 1384 If the output is discrete, we set Y's CPD to a softmax function.
wolffd@0 1385 The Q CPDs will always be softmax functions.
wolffd@0 1386
wolffd@0 1387 <p>
wolffd@0 1388 As a concrete example, consider the mixture of experts model where X and Y are
wolffd@0 1389 scalars, and Q is binary.
wolffd@0 1390 This is just piecewise linear regression, where
wolffd@0 1391 we have two line segments, i.e.,
wolffd@0 1392 <P>
wolffd@0 1393 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
wolffd@0 1394 <P>
wolffd@0 1395 We can create this model with random parameters as follows.
wolffd@0 1396 (This code is bundled in BNT/examples/static/mixexp2.m.)
wolffd@0 1397 <PRE>
wolffd@0 1398 X = 1;
wolffd@0 1399 Q = 2;
wolffd@0 1400 Y = 3;
wolffd@0 1401 dag = zeros(3,3);
wolffd@0 1402 dag(X,[Q Y]) = 1
wolffd@0 1403 dag(Q,Y) = 1;
wolffd@0 1404 ns = [1 2 1]; % make X and Y scalars, and have 2 experts
wolffd@0 1405 onodes = [1 3];
wolffd@0 1406 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
wolffd@0 1407
wolffd@0 1408 rand('state', 0);
wolffd@0 1409 randn('state', 0);
wolffd@0 1410 bnet.CPD{1} = root_CPD(bnet, 1);
wolffd@0 1411 bnet.CPD{2} = softmax_CPD(bnet, 2);
wolffd@0 1412 bnet.CPD{3} = gaussian_CPD(bnet, 3);
wolffd@0 1413 </PRE>
wolffd@0 1414 Now let us fit this model using <a href="#em">EM</a>.
wolffd@0 1415 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
wolffd@0 1416 <P>
wolffd@0 1417 <PRE>
wolffd@0 1418 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
wolffd@0 1419 plot(data(:,1), data(:,2), '.');
wolffd@0 1420 </PRE>
wolffd@0 1421 <p>
wolffd@0 1422 <center>
wolffd@0 1423 <IMG SRC="Figures/mixexp_data.gif">
wolffd@0 1424 </center>
wolffd@0 1425 <p>
wolffd@0 1426 This is what the model looks like before training.
wolffd@0 1427 (Thanks to Thomas Hofman for writing this plotting routine.)
wolffd@0 1428 <p>
wolffd@0 1429 <center>
wolffd@0 1430 <IMG SRC="Figures/mixexp_before.gif">
wolffd@0 1431 </center>
wolffd@0 1432 <p>
wolffd@0 1433 Now let's train the model, and plot the final performance.
wolffd@0 1434 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
wolffd@0 1435 <P>
wolffd@0 1436 <PRE>
wolffd@0 1437 ncases = size(data, 1); % each row of data is a training case
wolffd@0 1438 cases = cell(3, ncases);
wolffd@0 1439 cases([1 3], :) = num2cell(data'); % each column of cases is a training case
wolffd@0 1440 engine = jtree_inf_engine(bnet);
wolffd@0 1441 max_iter = 20;
wolffd@0 1442 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
wolffd@0 1443 </PRE>
wolffd@0 1444 (We specify which nodes will be observed when we create the engine.
wolffd@0 1445 Hence BNT knows that the hidden nodes are all discrete.
wolffd@0 1446 For complex models, this can lead to a significant speedup.)
wolffd@0 1447 Below we show what the model looks like after 16 iterations of EM
wolffd@0 1448 (with 100 IRLS iterations per M step), when it converged
wolffd@0 1449 using the default convergence tolerance (that the
wolffd@0 1450 fractional change in the log-likelihood be less than 1e-3).
wolffd@0 1451 Before learning, the log-likelihood was
wolffd@0 1452 -322.927442; afterwards, it was -13.728778.
wolffd@0 1453 <p>
wolffd@0 1454 <center>
wolffd@0 1455 <IMG SRC="Figures/mixexp_after.gif">
wolffd@0 1456 </center>
wolffd@0 1457 (See BNT/examples/static/mixexp2.m for details of the code.)
wolffd@0 1458
wolffd@0 1459
wolffd@0 1460
wolffd@0 1461 <h2><a name="hme">Hierarchical mixtures of experts</h2>
wolffd@0 1462
wolffd@0 1463 A hierarchical mixture of experts (HME) extends the mixture of experts
wolffd@0 1464 model by having more than one hidden node. A two-level example is shown below, along
wolffd@0 1465 with its more traditional representation as a neural network.
wolffd@0 1466 This is like a (balanced) probabilistic decision tree of height 2.
wolffd@0 1467 <p>
wolffd@0 1468 <center>
wolffd@0 1469 <IMG SRC="Figures/HMEforMatlab.jpg">
wolffd@0 1470 </center>
wolffd@0 1471 <p>
wolffd@0 1472 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
wolffd@0 1473 has written an extensive set of routines for HMEs,
wolffd@0 1474 which are bundled with BNT: see the examples/static/HME directory.
wolffd@0 1475 These routines allow you to choose the number of hidden (gating)
wolffd@0 1476 layers, and the form of the experts (softmax or MLP).
wolffd@0 1477 See the file hmemenu, which provides a demo.
wolffd@0 1478 For example, the figure below shows the decision boundaries learned
wolffd@0 1479 for a ternary classification problem, using a 2 level HME with softmax
wolffd@0 1480 gates and softmax experts; the training set is on the left, the
wolffd@0 1481 testing set on the right.
wolffd@0 1482 <p>
wolffd@0 1483 <center>
wolffd@0 1484 <!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
wolffd@0 1485 <IMG SRC="Figures/hme_dec_boundary.png">
wolffd@0 1486 </center>
wolffd@0 1487 <p>
wolffd@0 1488
wolffd@0 1489
wolffd@0 1490 <p>
wolffd@0 1491 For more details, see the following:
wolffd@0 1492 <ul>
wolffd@0 1493
wolffd@0 1494 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
wolffd@0 1495 Hierarchical mixtures of experts and the EM algorithm</a>
wolffd@0 1496 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
wolffd@0 1497
wolffd@0 1498 <li> <a href =
wolffd@0 1499 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's
wolffd@0 1500 matlab code for HME</a>
wolffd@0 1501
wolffd@0 1502 <li> <a
wolffd@0 1503 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
wolffd@0 1504 logistic function? A tutorial discussion on
wolffd@0 1505 probabilities and neural networks.</a> M. I. Jordan. MIT Computational
wolffd@0 1506 Cognitive Science Report 9503, August 1995.
wolffd@0 1507
wolffd@0 1508 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
wolffd@0 1509 Halll, 1983.
wolffd@0 1510
wolffd@0 1511 <li>
wolffd@0 1512 "Improved learning algorithms for mixtures of experts in multiclass
wolffd@0 1513 classification".
wolffd@0 1514 K. Chen, L. Xu, H. Chi.
wolffd@0 1515 Neural Networks (1999) 12: 1229-1252.
wolffd@0 1516
wolffd@0 1517 <li> <a href="http://www.oigeeza.com/steve/">
wolffd@0 1518 Classification Using Hierarchical Mixtures of Experts</a>
wolffd@0 1519 S.R. Waterhouse and A.J. Robinson.
wolffd@0 1520 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
wolffd@0 1521
wolffd@0 1522 <li> <a href="http://www.idiap.ch/~perry/">
wolffd@0 1523 Localized mixtures of experts</a>,
wolffd@0 1524 P. Moerland, 1998.
wolffd@0 1525
wolffd@0 1526 <li> "Nonlinear gated experts for time series",
wolffd@0 1527 A.S. Weigend and M. Mangeas, 1995.
wolffd@0 1528
wolffd@0 1529 </ul>
wolffd@0 1530
wolffd@0 1531
wolffd@0 1532 <h2><a name="qmr">QMR</h2>
wolffd@0 1533
wolffd@0 1534 Bayes nets originally arose out of an attempt to add probabilities to
wolffd@0 1535 expert systems, and this is still the most common use for BNs.
wolffd@0 1536 A famous example is
wolffd@0 1537 QMR-DT, a decision-theoretic reformulation of the Quick Medical
wolffd@0 1538 Reference (QMR) model.
wolffd@0 1539 <p>
wolffd@0 1540 <center>
wolffd@0 1541 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
wolffd@0 1542 </center>
wolffd@0 1543 Here, the top layer represents hidden disease nodes, and the bottom
wolffd@0 1544 layer represents observed symptom nodes.
wolffd@0 1545 The goal is to infer the posterior probability of each disease given
wolffd@0 1546 all the symptoms (which can be present, absent or unknown).
wolffd@0 1547 Each node in the top layer has a Bernoulli prior (with a low prior
wolffd@0 1548 probability that the disease is present).
wolffd@0 1549 Since each node in the bottom layer has a high fan-in, we use a
wolffd@0 1550 noisy-OR parameterization; each disease has an independent chance of
wolffd@0 1551 causing each symptom.
wolffd@0 1552 The real QMR-DT model is copyright, but
wolffd@0 1553 we can create a random QMR-like model as follows.
wolffd@0 1554 <pre>
wolffd@0 1555 function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
wolffd@0 1556 % MK_QMR_BNET Make a QMR model
wolffd@0 1557 % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
wolffd@0 1558 %
wolffd@0 1559 % G(i,j) = 1 iff there is an arc from disease i to finding j
wolffd@0 1560 % inhibit(i,j) = inhibition probability on i->j arc
wolffd@0 1561 % leak(j) = inhibition prob. on leak->j arc
wolffd@0 1562 % prior(i) = prob. disease i is on
wolffd@0 1563
wolffd@0 1564 [Ndiseases Nfindings] = size(inhibit);
wolffd@0 1565 N = Ndiseases + Nfindings;
wolffd@0 1566 finding_node = Ndiseases+1:N;
wolffd@0 1567 ns = 2*ones(1,N);
wolffd@0 1568 dag = zeros(N,N);
wolffd@0 1569 dag(1:Ndiseases, finding_node) = G;
wolffd@0 1570 bnet = mk_bnet(dag, ns, 'observed', finding_node);
wolffd@0 1571
wolffd@0 1572 for d=1:Ndiseases
wolffd@0 1573 CPT = [1-prior(d) prior(d)];
wolffd@0 1574 bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
wolffd@0 1575 end
wolffd@0 1576
wolffd@0 1577 for i=1:Nfindings
wolffd@0 1578 fnode = finding_node(i);
wolffd@0 1579 ps = parents(G, i);
wolffd@0 1580 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
wolffd@0 1581 end
wolffd@0 1582 </pre>
wolffd@0 1583 In the file BNT/examples/static/qmr1, we create a random bipartite
wolffd@0 1584 graph G, with 5 diseases and 10 findings, and random parameters.
wolffd@0 1585 (In general, to create a random dag, use 'mk_random_dag'.)
wolffd@0 1586 We can visualize the resulting graph structure using
wolffd@0 1587 the methods discussed <a href="#graphdraw">below</a>, with the
wolffd@0 1588 following results:
wolffd@0 1589 <p>
wolffd@0 1590 <img src="Figures/qmr.rnd.jpg">
wolffd@0 1591
wolffd@0 1592 <p>
wolffd@0 1593 Now let us put some random evidence on all the leaves except the very
wolffd@0 1594 first and very last, and compute the disease posteriors.
wolffd@0 1595 <pre>
wolffd@0 1596 pos = 2:floor(Nfindings/2);
wolffd@0 1597 neg = (pos(end)+1):(Nfindings-1);
wolffd@0 1598 onodes = myunion(pos, neg);
wolffd@0 1599 evidence = cell(1, N);
wolffd@0 1600 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
wolffd@0 1601 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
wolffd@0 1602
wolffd@0 1603 engine = jtree_inf_engine(bnet);
wolffd@0 1604 [engine, ll] = enter_evidence(engine, evidence);
wolffd@0 1605 post = zeros(1, Ndiseases);
wolffd@0 1606 for i=diseases(:)'
wolffd@0 1607 m = marginal_nodes(engine, i);
wolffd@0 1608 post(i) = m.T(2);
wolffd@0 1609 end
wolffd@0 1610 </pre>
wolffd@0 1611 Junction tree can be quite slow on large QMR models.
wolffd@0 1612 Fortunately, it is possible to exploit properties of the noisy-OR
wolffd@0 1613 function to speed up exact inference using an algorithm called
wolffd@0 1614 <a href="#quickscore">quickscore</a>, discussed below.
wolffd@0 1615
wolffd@0 1616
wolffd@0 1617
wolffd@0 1618
wolffd@0 1619
wolffd@0 1620 <h2><a name="cg_model">Conditional Gaussian models</h2>
wolffd@0 1621
wolffd@0 1622 A conditional Gaussian model is one in which, conditioned on all the discrete
wolffd@0 1623 nodes, the distribution over the remaining (continuous) nodes is
wolffd@0 1624 multivariate Gaussian. This means we can have arcs from discrete (D)
wolffd@0 1625 to continuous (C) nodes, but not vice versa.
wolffd@0 1626 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
wolffd@0 1627 as in the <a href="#mixexp">mixture of experts</a> model,
wolffd@0 1628 since this distribution can be represented with a discrete potential.)
wolffd@0 1629 <p>
wolffd@0 1630 We now give an example of a CG model, from
wolffd@0 1631 the paper "Propagation of Probabilities, Means amd
wolffd@0 1632 Variances in Mixed Graphical Association Models", Steffen Lauritzen,
wolffd@0 1633 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
wolffd@0 1634 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
wolffd@0 1635 D. J. Spiegelhalter, Springer, 1999.)
wolffd@0 1636
wolffd@0 1637 <h3>Specifying the graph</h3>
wolffd@0 1638
wolffd@0 1639 Consider the model of waste emissions from an incinerator plant shown below.
wolffd@0 1640 We follow the standard convention that shaded nodes are observed,
wolffd@0 1641 clear nodes are hidden.
wolffd@0 1642 We also use the non-standard convention that
wolffd@0 1643 square nodes are discrete (tabular) and round nodes are
wolffd@0 1644 Gaussian.
wolffd@0 1645
wolffd@0 1646 <p>
wolffd@0 1647 <center>
wolffd@0 1648 <IMG SRC="Figures/cg1.gif">
wolffd@0 1649 </center>
wolffd@0 1650 <p>
wolffd@0 1651
wolffd@0 1652 We can create this model as follows.
wolffd@0 1653 <pre>
wolffd@0 1654 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
wolffd@0 1655 n = 9;
wolffd@0 1656
wolffd@0 1657 dag = zeros(n);
wolffd@0 1658 dag(F,E)=1;
wolffd@0 1659 dag(W,[E Min D]) = 1;
wolffd@0 1660 dag(E,D)=1;
wolffd@0 1661 dag(B,[C D])=1;
wolffd@0 1662 dag(D,[L Mout])=1;
wolffd@0 1663 dag(Min,Mout)=1;
wolffd@0 1664
wolffd@0 1665 % node sizes - all cts nodes are scalar, all discrete nodes are binary
wolffd@0 1666 ns = ones(1, n);
wolffd@0 1667 dnodes = [F W B];
wolffd@0 1668 cnodes = mysetdiff(1:n, dnodes);
wolffd@0 1669 ns(dnodes) = 2;
wolffd@0 1670
wolffd@0 1671 bnet = mk_bnet(dag, ns, 'discrete', dnodes);
wolffd@0 1672 </pre>
wolffd@0 1673 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
wolffd@0 1674 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
wolffd@0 1675 <p>
wolffd@0 1676
wolffd@0 1677
wolffd@0 1678 <h3>Specifying the parameters</h3>
wolffd@0 1679
wolffd@0 1680 The parameters of the discrete nodes can be specified as follows.
wolffd@0 1681 <pre>
wolffd@0 1682 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
wolffd@0 1683 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
wolffd@0 1684 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
wolffd@0 1685 </pre>
wolffd@0 1686
wolffd@0 1687 <p>
wolffd@0 1688 The parameters of the continuous nodes can be specified as follows.
wolffd@0 1689 <pre>
wolffd@0 1690 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
wolffd@0 1691 'cov', [0.00002 0.0001 0.00002 0.0001]);
wolffd@0 1692 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
wolffd@0 1693 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
wolffd@0 1694 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
wolffd@0 1695 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
wolffd@0 1696 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
wolffd@0 1697 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
wolffd@0 1698 </pre>
wolffd@0 1699
wolffd@0 1700
wolffd@0 1701 <h3><a name="cg_infer">Inference</h3>
wolffd@0 1702
wolffd@0 1703 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
wolffd@0 1704 First we compute the unconditional marginals.
wolffd@0 1705 <pre>
wolffd@0 1706 engine = jtree_inf_engine(bnet);
wolffd@0 1707 evidence = cell(1,n);
wolffd@0 1708 [engine, ll] = enter_evidence(engine, evidence);
wolffd@0 1709 marg = marginal_nodes(engine, E);
wolffd@0 1710 </pre>
wolffd@0 1711 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
wolffd@0 1712 'marg' is a structure that contains the fields 'mu' and 'Sigma', which
wolffd@0 1713 contain the mean and (co)variance of the marginal on E.
wolffd@0 1714 In this case, they are both scalars.
wolffd@0 1715 Let us check they match the published figures (to 2 decimal places).
wolffd@0 1716 <!--(We can't expect
wolffd@0 1717 more precision than this in general because I have implemented the algorithm of
wolffd@0 1718 Lauritzen (1992), which can be numerically unstable.)-->
wolffd@0 1719 <pre>
wolffd@0 1720 tol = 1e-2;
wolffd@0 1721 assert(approxeq(marg.mu, -3.25, tol));
wolffd@0 1722 assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
wolffd@0 1723 </pre>
wolffd@0 1724 We can compute the other posteriors similarly.
wolffd@0 1725 Now let us add some evidence.
wolffd@0 1726 <pre>
wolffd@0 1727 evidence = cell(1,n);
wolffd@0 1728 evidence{W} = 1; % industrial
wolffd@0 1729 evidence{L} = 1.1;
wolffd@0 1730 evidence{C} = -0.9;
wolffd@0 1731 [engine, ll] = enter_evidence(engine, evidence);
wolffd@0 1732 </pre>
wolffd@0 1733 Now we find
wolffd@0 1734 <pre>
wolffd@0 1735 marg = marginal_nodes(engine, E);
wolffd@0 1736 assert(approxeq(marg.mu, -3.8983, tol));
wolffd@0 1737 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
wolffd@0 1738 </pre>
wolffd@0 1739
wolffd@0 1740
wolffd@0 1741 We can also compute the joint probability on a set of nodes.
wolffd@0 1742 For example, P(D, Mout | evidence) is a 2D Gaussian:
wolffd@0 1743 <pre>
wolffd@0 1744 marg = marginal_nodes(engine, [D Mout])
wolffd@0 1745 marg =
wolffd@0 1746 domain: [6 8]
wolffd@0 1747 mu: [2x1 double]
wolffd@0 1748 Sigma: [2x2 double]
wolffd@0 1749 T: 1.0000
wolffd@0 1750 </pre>
wolffd@0 1751 The mean is
wolffd@0 1752 <pre>
wolffd@0 1753 marg.mu
wolffd@0 1754 ans =
wolffd@0 1755 3.6077
wolffd@0 1756 4.1077
wolffd@0 1757 </pre>
wolffd@0 1758 and the covariance matrix is
wolffd@0 1759 <pre>
wolffd@0 1760 marg.Sigma
wolffd@0 1761 ans =
wolffd@0 1762 0.1062 0.1062
wolffd@0 1763 0.1062 0.1182
wolffd@0 1764 </pre>
wolffd@0 1765 It is easy to visualize this posterior using standard Matlab plotting
wolffd@0 1766 functions, e.g.,
wolffd@0 1767 <pre>
wolffd@0 1768 gaussplot2d(marg.mu, marg.Sigma);
wolffd@0 1769 </pre>
wolffd@0 1770 produces the following picture.
wolffd@0 1771
wolffd@0 1772 <p>
wolffd@0 1773 <center>
wolffd@0 1774 <IMG SRC="Figures/gaussplot.png">
wolffd@0 1775 </center>
wolffd@0 1776 <p>
wolffd@0 1777
wolffd@0 1778
wolffd@0 1779 The T field indicates that the mixing weight of this Gaussian
wolffd@0 1780 component is 1.0.
wolffd@0 1781 If the joint contains discrete and continuous variables, the result
wolffd@0 1782 will be a mixture of Gaussians, e.g.,
wolffd@0 1783 <pre>
wolffd@0 1784 marg = marginal_nodes(engine, [F E])
wolffd@0 1785 domain: [1 3]
wolffd@0 1786 mu: [-3.9000 -0.4003]
wolffd@0 1787 Sigma: [1x1x2 double]
wolffd@0 1788 T: [0.9995 4.7373e-04]
wolffd@0 1789 </pre>
wolffd@0 1790 The interpretation is
wolffd@0 1791 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ].
wolffd@0 1792 In this case, E is a scalar, so i=j=1; k specifies the mixture component.
wolffd@0 1793 <p>
wolffd@0 1794 We saw in the sprinkler network that BNT sets the effective size of
wolffd@0 1795 observed discrete nodes to 1, since they only have one legal value.
wolffd@0 1796 For continuous nodes, BNT sets their length to 0,
wolffd@0 1797 since they have been reduced to a point.
wolffd@0 1798 For example,
wolffd@0 1799 <pre>
wolffd@0 1800 marg = marginal_nodes(engine, [B C])
wolffd@0 1801 domain: [4 5]
wolffd@0 1802 mu: []
wolffd@0 1803 Sigma: []
wolffd@0 1804 T: [0.0123 0.9877]
wolffd@0 1805 </pre>
wolffd@0 1806 It is simple to post-process the output of marginal_nodes.
wolffd@0 1807 For example, the file BNT/examples/static/cg1 sets the mu term of
wolffd@0 1808 observed nodes to their observed value, and the Sigma term to 0 (since
wolffd@0 1809 observed nodes have no variance).
wolffd@0 1810
wolffd@0 1811 <p>
wolffd@0 1812 Note that the implemented version of the junction tree is numerically
wolffd@0 1813 unstable when using CG potentials
wolffd@0 1814 (which is why, in the example above, we only required our answers to agree with
wolffd@0 1815 the published ones to 2dp.)
wolffd@0 1816 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
wolffd@0 1817 implemented by Shan Huang. This is described in
wolffd@0 1818
wolffd@0 1819 <ul>
wolffd@0 1820 <li> "Stable Local Computation with Conditional Gaussian Distributions",
wolffd@0 1821 S. Lauritzen and F. Jensen, Tech Report R-99-2014,
wolffd@0 1822 Dept. Math. Sciences, Allborg Univ., 1999.
wolffd@0 1823 </ul>
wolffd@0 1824
wolffd@0 1825 However, even the numerically stable version
wolffd@0 1826 can be computationally intractable if there are many hidden discrete
wolffd@0 1827 nodes, because the number of mixture components grows exponentially e.g., in a
wolffd@0 1828 <a href="usage_dbn.html#lds">switching linear dynamical system</a>.
wolffd@0 1829 In general, one must resort to approximate inference techniques: see
wolffd@0 1830 the discussion on <a href="#engines">inference engines</a> below.
wolffd@0 1831
wolffd@0 1832
wolffd@0 1833 <h2><a name="hybrid">Other hybrid models</h2>
wolffd@0 1834
wolffd@0 1835 When we have C->D arcs, where C is hidden, we need to use
wolffd@0 1836 approximate inference.
wolffd@0 1837 One approach (not implemented in BNT) is described in
wolffd@0 1838 <ul>
wolffd@0 1839 <li> <a
wolffd@0 1840 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
wolffd@0 1841 Variational Approximation for Bayesian Networks with
wolffd@0 1842 Discrete and Continuous Latent Variables</a>,
wolffd@0 1843 K. Murphy, UAI 99.
wolffd@0 1844 </ul>
wolffd@0 1845 Of course, one can always use <a href="#sampling">sampling</a> methods
wolffd@0 1846 for approximate inference in such models.
wolffd@0 1847
wolffd@0 1848
wolffd@0 1849
wolffd@0 1850 <h1><a name="param_learning">Parameter Learning</h1>
wolffd@0 1851
wolffd@0 1852 The parameter estimation routines in BNT can be classified into 4
wolffd@0 1853 types, depending on whether the goal is to compute
wolffd@0 1854 a full (Bayesian) posterior over the parameters or just a point
wolffd@0 1855 estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
wolffd@0 1856 and whether all the variables are fully observed or there is missing
wolffd@0 1857 data/ hidden variables (partial observability).
wolffd@0 1858 <p>
wolffd@0 1859
wolffd@0 1860 <TABLE BORDER>
wolffd@0 1861 <tr>
wolffd@0 1862 <TH></TH>
wolffd@0 1863 <th>Full obs</th>
wolffd@0 1864 <th>Partial obs</th>
wolffd@0 1865 </tr>
wolffd@0 1866 <tr>
wolffd@0 1867 <th>Point</th>
wolffd@0 1868 <td><tt>learn_params</tt></td>
wolffd@0 1869 <td><tt>learn_params_em</tt></td>
wolffd@0 1870 </tr>
wolffd@0 1871 <tr>
wolffd@0 1872 <th>Bayes</th>
wolffd@0 1873 <td><tt>bayes_update_params</tt></td>
wolffd@0 1874 <td>not yet supported</td>
wolffd@0 1875 </tr>
wolffd@0 1876 </table>
wolffd@0 1877
wolffd@0 1878
wolffd@0 1879 <h2><a name="load_data">Loading data from a file</h2>
wolffd@0 1880
wolffd@0 1881 To load numeric data from an ASCII text file called 'dat.txt', where each row is a
wolffd@0 1882 case and columns are separated by white-space, such as
wolffd@0 1883 <pre>
wolffd@0 1884 011979 1626.5 0.0
wolffd@0 1885 021979 1367.0 0.0
wolffd@0 1886 ...
wolffd@0 1887 </pre>
wolffd@0 1888 you can use
wolffd@0 1889 <pre>
wolffd@0 1890 data = load('dat.txt');
wolffd@0 1891 </pre>
wolffd@0 1892 or
wolffd@0 1893 <pre>
wolffd@0 1894 load dat.txt -ascii
wolffd@0 1895 </pre>
wolffd@0 1896 In the latter case, the data is stored in a variable called 'dat' (the
wolffd@0 1897 filename minus the extension).
wolffd@0 1898 Alternatively, suppose the data is stored in a .csv file (has commas
wolffd@0 1899 separating the columns, and contains a header line), such as
wolffd@0 1900 <pre>
wolffd@0 1901 header info goes here
wolffd@0 1902 ORD,011979,1626.5,0.0
wolffd@0 1903 DSM,021979,1367.0,0.0
wolffd@0 1904 ...
wolffd@0 1905 </pre>
wolffd@0 1906 You can load this using
wolffd@0 1907 <pre>
wolffd@0 1908 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
wolffd@0 1909 </pre>
wolffd@0 1910 If your file is not in either of these formats, you can either use Perl to convert
wolffd@0 1911 it to this format, or use the Matlab scanf command.
wolffd@0 1912 Type
wolffd@0 1913 <tt>
wolffd@0 1914 help iofun
wolffd@0 1915 </tt>
wolffd@0 1916 for more information on Matlab's file functions.
wolffd@0 1917 <!--
wolffd@0 1918 <p>
wolffd@0 1919 To load data directly from Excel,
wolffd@0 1920 you should buy the
wolffd@0 1921 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
wolffd@0 1922 To load data directly from a relational database,
wolffd@0 1923 you should buy the
wolffd@0 1924 <a href="http://www.mathworks.com/products/database">Database
wolffd@0 1925 toolbox</a>.
wolffd@0 1926 -->
wolffd@0 1927 <p>
wolffd@0 1928 BNT learning routines require data to be stored in a cell array.
wolffd@0 1929 data{i,m} is the value of node i in case (example) m, i.e., each
wolffd@0 1930 <em>column</em> is a case.
wolffd@0 1931 If node i is not observed in case m (missing value), set
wolffd@0 1932 data{i,m} = [].
wolffd@0 1933 (Not all the learning routines can cope with such missing values, however.)
wolffd@0 1934 In the special case that all the nodes are observed and are
wolffd@0 1935 scalar-valued (as opposed to vector-valued), the data can be
wolffd@0 1936 stored in a matrix (as opposed to a cell-array).
wolffd@0 1937 <p>
wolffd@0 1938 Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
wolffd@0 1939 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
wolffd@0 1940 the observed output, and X(2) is a hidden (gating) node. We can
wolffd@0 1941 create the dataset as follows.
wolffd@0 1942 <pre>
wolffd@0 1943 data = load('dat.txt');
wolffd@0 1944 ncases = size(data, 1);
wolffd@0 1945 cases = cell(3, ncases);
wolffd@0 1946 cases([1 3], :) = num2cell(data');
wolffd@0 1947 </pre>
wolffd@0 1948 Notice how we transposed the data, to convert rows into columns.
wolffd@0 1949 Also, cases{2,m} = [] for all m, since X(2) is always hidden.
wolffd@0 1950
wolffd@0 1951
wolffd@0 1952 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
wolffd@0 1953
wolffd@0 1954 As an example, let's generate some data from the sprinkler network, randomize the parameters,
wolffd@0 1955 and then try to recover the original model.
wolffd@0 1956 First we create some training data using forwards sampling.
wolffd@0 1957 <pre>
wolffd@0 1958 samples = cell(N, nsamples);
wolffd@0 1959 for i=1:nsamples
wolffd@0 1960 samples(:,i) = sample_bnet(bnet);
wolffd@0 1961 end
wolffd@0 1962 </pre>
wolffd@0 1963 samples{j,i} contains the value of the j'th node in case i.
wolffd@0 1964 sample_bnet returns a cell array because, in general, each node might
wolffd@0 1965 be a vector of different length.
wolffd@0 1966 In this case, all nodes are discrete (and hence scalars), so we
wolffd@0 1967 could have used a regular array instead (which can be quicker):
wolffd@0 1968 <pre>
wolffd@0 1969 data = cell2num(samples);
wolffd@0 1970 </pre
wolffd@0 1971 So now data(j,i) = samples{j,i}.
wolffd@0 1972 <p>
wolffd@0 1973 Now we create a network with random parameters.
wolffd@0 1974 (The initial values of bnet2 don't matter in this case, since we can find the
wolffd@0 1975 globally optimal MLE independent of where we start.)
wolffd@0 1976 <pre>
wolffd@0 1977 % Make a tabula rasa
wolffd@0 1978 bnet2 = mk_bnet(dag, node_sizes);
wolffd@0 1979 seed = 0;
wolffd@0 1980 rand('state', seed);
wolffd@0 1981 bnet2.CPD{C} = tabular_CPD(bnet2, C);
wolffd@0 1982 bnet2.CPD{R} = tabular_CPD(bnet2, R);
wolffd@0 1983 bnet2.CPD{S} = tabular_CPD(bnet2, S);
wolffd@0 1984 bnet2.CPD{W} = tabular_CPD(bnet2, W);
wolffd@0 1985 </pre>
wolffd@0 1986 Finally, we find the maximum likelihood estimates of the parameters.
wolffd@0 1987 <pre>
wolffd@0 1988 bnet3 = learn_params(bnet2, samples);
wolffd@0 1989 </pre>
wolffd@0 1990 To view the learned parameters, we use a little Matlab hackery.
wolffd@0 1991 <pre>
wolffd@0 1992 CPT3 = cell(1,N);
wolffd@0 1993 for i=1:N
wolffd@0 1994 s=struct(bnet3.CPD{i}); % violate object privacy
wolffd@0 1995 CPT3{i}=s.CPT;
wolffd@0 1996 end
wolffd@0 1997 </pre>
wolffd@0 1998 Here are the parameters learned for node 4.
wolffd@0 1999 <pre>
wolffd@0 2000 dispcpt(CPT3{4})
wolffd@0 2001 1 1 : 1.0000 0.0000
wolffd@0 2002 2 1 : 0.2000 0.8000
wolffd@0 2003 1 2 : 0.2273 0.7727
wolffd@0 2004 2 2 : 0.0000 1.0000
wolffd@0 2005 </pre>
wolffd@0 2006 So we see that the learned parameters are fairly close to the "true"
wolffd@0 2007 ones, which we display below.
wolffd@0 2008 <pre>
wolffd@0 2009 dispcpt(CPT{4})
wolffd@0 2010 1 1 : 1.0000 0.0000
wolffd@0 2011 2 1 : 0.1000 0.9000
wolffd@0 2012 1 2 : 0.1000 0.9000
wolffd@0 2013 2 2 : 0.0100 0.9900
wolffd@0 2014 </pre>
wolffd@0 2015 We can get better results by using a larger training set, or using
wolffd@0 2016 informative priors (see <a href="#prior">below</a>).
wolffd@0 2017
wolffd@0 2018
wolffd@0 2019
wolffd@0 2020 <h2><a name="prior">Parameter priors</h2>
wolffd@0 2021
wolffd@0 2022 Currently, only tabular CPDs can have priors on their parameters.
wolffd@0 2023 The conjugate prior for a multinomial is the Dirichlet.
wolffd@0 2024 (For binary random variables, the multinomial is the same as the
wolffd@0 2025 Bernoulli, and the Dirichlet is the same as the Beta.)
wolffd@0 2026 <p>
wolffd@0 2027 The Dirichlet has a simple interpretation in terms of pseudo counts.
wolffd@0 2028 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
wolffd@0 2029 training set, where Pa_i are the parents of X_i,
wolffd@0 2030 then the maximum likelihood (ML) estimate is
wolffd@0 2031 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
wolffd@0 2032 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
wolffd@0 2033 event was not seen in the training set,
wolffd@0 2034 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
wolffd@0 2035 of times in the past.
wolffd@0 2036 The MAP (maximum a posterior) estimate is then
wolffd@0 2037 <pre>
wolffd@0 2038 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
wolffd@0 2039 </pre>
wolffd@0 2040 and is never 0 if all alpha_ijk > 0.
wolffd@0 2041 For example, consider the network A->B, where A is binary and B has 3
wolffd@0 2042 values.
wolffd@0 2043 A uniform prior for B has the form
wolffd@0 2044 <pre>
wolffd@0 2045 B=1 B=2 B=3
wolffd@0 2046 A=1 1 1 1
wolffd@0 2047 A=2 1 1 1
wolffd@0 2048 </pre>
wolffd@0 2049 which can be created using
wolffd@0 2050 <pre>
wolffd@0 2051 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
wolffd@0 2052 </pre>
wolffd@0 2053 This prior does not satisfy the likelihood equivalence principle,
wolffd@0 2054 which says that <a href="#markov_equiv">Markov equivalent</a> models
wolffd@0 2055 should have the same marginal likelihood.
wolffd@0 2056 A prior that does satisfy this principle is shown below.
wolffd@0 2057 Heckerman (1995) calls this the
wolffd@0 2058 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
wolffd@0 2059 <pre>
wolffd@0 2060 B=1 B=2 B=3
wolffd@0 2061 A=1 1/6 1/6 1/6
wolffd@0 2062 A=2 1/6 1/6 1/6
wolffd@0 2063 </pre>
wolffd@0 2064 where we put N/(q*r) in each bin; N is the equivalent sample size,
wolffd@0 2065 r=|A|, q = |B|.
wolffd@0 2066 This can be created as follows
wolffd@0 2067 <pre>
wolffd@0 2068 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
wolffd@0 2069 </pre>
wolffd@0 2070 Here, 1 is the equivalent sample size, and is the strength of the
wolffd@0 2071 prior.
wolffd@0 2072 You can change this using
wolffd@0 2073 <pre>
wolffd@0 2074 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
wolffd@0 2075 'BDeu', 'dirichlet_weight', 10);
wolffd@0 2076 </pre>
wolffd@0 2077 <!--where counts is an array of pseudo-counts of the same size as the
wolffd@0 2078 CPT.-->
wolffd@0 2079 <!--
wolffd@0 2080 <p>
wolffd@0 2081 When you specify a prior, you should set row i of the CPT to the
wolffd@0 2082 normalized version of row i of the pseudo-count matrix, i.e., to the
wolffd@0 2083 expected values of the parameters. This will ensure that computing the
wolffd@0 2084 marginal likelihood sequentially (see <a
wolffd@0 2085 href="#bayes_learn">below</a>) and in batch form gives the same
wolffd@0 2086 results.
wolffd@0 2087 To do this, proceed as follows.
wolffd@0 2088 <pre>
wolffd@0 2089 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
wolffd@0 2090 </pre>
wolffd@0 2091 For a non-informative prior, you can just write
wolffd@0 2092 <pre>
wolffd@0 2093 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
wolffd@0 2094 </pre>
wolffd@0 2095 -->
wolffd@0 2096
wolffd@0 2097
wolffd@0 2098 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
wolffd@0 2099
wolffd@0 2100 If we use conjugate priors and have fully observed data, we can
wolffd@0 2101 compute the posterior over the parameters in batch form as follows.
wolffd@0 2102 <pre>
wolffd@0 2103 cases = sample_bnet(bnet, nsamples);
wolffd@0 2104 bnet = bayes_update_params(bnet, cases);
wolffd@0 2105 LL = log_marg_lik_complete(bnet, cases);
wolffd@0 2106 </pre>
wolffd@0 2107 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
wolffd@0 2108 and bnet.CPD{i}.CPT is set to the mean of the posterior (the
wolffd@0 2109 normalized counts).
wolffd@0 2110 (Hence if the initial pseudo counts are 0,
wolffd@0 2111 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
wolffd@0 2112 same result.)
wolffd@0 2113
wolffd@0 2114
wolffd@0 2115
wolffd@0 2116
wolffd@0 2117 <p>
wolffd@0 2118 We can compute the same result sequentially (on-line) as follows.
wolffd@0 2119 <pre>
wolffd@0 2120 LL = 0;
wolffd@0 2121 for m=1:nsamples
wolffd@0 2122 LL = LL + log_marg_lik_complete(bnet, cases(:,m));
wolffd@0 2123 bnet = bayes_update_params(bnet, cases(:,m));
wolffd@0 2124 end
wolffd@0 2125 </pre>
wolffd@0 2126
wolffd@0 2127 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
wolffd@0 2128 sequential model selection which uses the same idea.
wolffd@0 2129 We generate data from the model A->B
wolffd@0 2130 and compute the posterior prob of all 3 dags on 2 nodes:
wolffd@0 2131 (1) A B, (2) A <- B , (3) A -> B
wolffd@0 2132 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
wolffd@0 2133 observational data alone, so we expect their posteriors to be the same
wolffd@0 2134 (assuming a prior which satisfies likelihood equivalence).
wolffd@0 2135 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
wolffd@0 2136 However, if we make B a noisy NOT gate, the true model "wins" after 12
wolffd@0 2137 trials, as shown below (red = model 1, blue/green (superimposed)
wolffd@0 2138 represents models 2/3).
wolffd@0 2139 <p>
wolffd@0 2140 <img src="Figures/model_select.png">
wolffd@0 2141 <p>
wolffd@0 2142 The use of marginal likelihood for model selection is discussed in
wolffd@0 2143 greater detail in the
wolffd@0 2144 section on <a href="structure_learning">structure learning</a>.
wolffd@0 2145
wolffd@0 2146
wolffd@0 2147
wolffd@0 2148
wolffd@0 2149 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
wolffd@0 2150
wolffd@0 2151 Now we consider learning when some values are not observed.
wolffd@0 2152 Let us randomly hide half the values generated from the water
wolffd@0 2153 sprinkler example.
wolffd@0 2154 <pre>
wolffd@0 2155 samples2 = samples;
wolffd@0 2156 hide = rand(N, nsamples) > 0.5;
wolffd@0 2157 [I,J]=find(hide);
wolffd@0 2158 for k=1:length(I)
wolffd@0 2159 samples2{I(k), J(k)} = [];
wolffd@0 2160 end
wolffd@0 2161 </pre>
wolffd@0 2162 samples2{i,l} is the value of node i in training case l, or [] if unobserved.
wolffd@0 2163 <p>
wolffd@0 2164 Now we will compute the MLEs using the EM algorithm.
wolffd@0 2165 We need to use an inference algorithm to compute the expected
wolffd@0 2166 sufficient statistics in the E step; the M (maximization) step is as
wolffd@0 2167 above.
wolffd@0 2168 <pre>
wolffd@0 2169 engine2 = jtree_inf_engine(bnet2);
wolffd@0 2170 max_iter = 10;
wolffd@0 2171 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
wolffd@0 2172 </pre>
wolffd@0 2173 LLtrace(i) is the log-likelihood at iteration i. We can plot this as
wolffd@0 2174 follows:
wolffd@0 2175 <pre>
wolffd@0 2176 plot(LLtrace, 'x-')
wolffd@0 2177 </pre>
wolffd@0 2178 Let's display the results after 10 iterations of EM.
wolffd@0 2179 <pre>
wolffd@0 2180 celldisp(CPT4)
wolffd@0 2181 CPT4{1} =
wolffd@0 2182 0.6616
wolffd@0 2183 0.3384
wolffd@0 2184 CPT4{2} =
wolffd@0 2185 0.6510 0.3490
wolffd@0 2186 0.8751 0.1249
wolffd@0 2187 CPT4{3} =
wolffd@0 2188 0.8366 0.1634
wolffd@0 2189 0.0197 0.9803
wolffd@0 2190 CPT4{4} =
wolffd@0 2191 (:,:,1) =
wolffd@0 2192 0.8276 0.0546
wolffd@0 2193 0.5452 0.1658
wolffd@0 2194 (:,:,2) =
wolffd@0 2195 0.1724 0.9454
wolffd@0 2196 0.4548 0.8342
wolffd@0 2197 </pre>
wolffd@0 2198 We can get improved performance by using one or more of the following
wolffd@0 2199 methods:
wolffd@0 2200 <ul>
wolffd@0 2201 <li> Increasing the size of the training set.
wolffd@0 2202 <li> Decreasing the amount of hidden data.
wolffd@0 2203 <li> Running EM for longer.
wolffd@0 2204 <li> Using informative priors.
wolffd@0 2205 <li> Initialising EM from multiple starting points.
wolffd@0 2206 </ul>
wolffd@0 2207
wolffd@0 2208 Click <a href="#gaussian">here</a> for a discussion of learning
wolffd@0 2209 Gaussians, which can cause numerical problems.
wolffd@0 2210
wolffd@0 2211 <h2><a name="tying">Parameter tying</h2>
wolffd@0 2212
wolffd@0 2213 In networks with repeated structure (e.g., chains and grids), it is
wolffd@0 2214 common to assume that the parameters are the same at every node. This
wolffd@0 2215 is called parameter tying, and reduces the amount of data needed for
wolffd@0 2216 learning.
wolffd@0 2217 <p>
wolffd@0 2218 When we have tied parameters, there is no longer a one-to-one
wolffd@0 2219 correspondence between nodes and CPDs.
wolffd@0 2220 Rather, each CPD species the parameters for a whole equivalence class
wolffd@0 2221 of nodes.
wolffd@0 2222 It is easiest to see this by example.
wolffd@0 2223 Consider the following <a href="usage_dbn.html#hmm">hidden Markov
wolffd@0 2224 model (HMM)</a>
wolffd@0 2225 <p>
wolffd@0 2226 <img src="Figures/hmm3.gif">
wolffd@0 2227 <p>
wolffd@0 2228 <!--
wolffd@0 2229 We can create this graph structure, assuming we have T time-slices,
wolffd@0 2230 as follows.
wolffd@0 2231 (We number the nodes as shown in the figure, but we could equally well
wolffd@0 2232 number the hidden nodes 1:T, and the observed nodes T+1:2T.)
wolffd@0 2233 <pre>
wolffd@0 2234 N = 2*T;
wolffd@0 2235 dag = zeros(N);
wolffd@0 2236 hnodes = 1:2:2*T;
wolffd@0 2237 for i=1:T-1
wolffd@0 2238 dag(hnodes(i), hnodes(i+1))=1;
wolffd@0 2239 end
wolffd@0 2240 onodes = 2:2:2*T;
wolffd@0 2241 for i=1:T
wolffd@0 2242 dag(hnodes(i), onodes(i)) = 1;
wolffd@0 2243 end
wolffd@0 2244 </pre>
wolffd@0 2245 <p>
wolffd@0 2246 The hidden nodes are always discrete, and have Q possible values each,
wolffd@0 2247 but the observed nodes can be discrete or continuous, and have O possible values/length.
wolffd@0 2248 <pre>
wolffd@0 2249 if cts_obs
wolffd@0 2250 dnodes = hnodes;
wolffd@0 2251 else
wolffd@0 2252 dnodes = 1:N;
wolffd@0 2253 end
wolffd@0 2254 ns = ones(1,N);
wolffd@0 2255 ns(hnodes) = Q;
wolffd@0 2256 ns(onodes) = O;
wolffd@0 2257 </pre>
wolffd@0 2258 -->
wolffd@0 2259 When HMMs are used for semi-infinite processes like speech recognition,
wolffd@0 2260 we assume the transition matrix
wolffd@0 2261 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant
wolffd@0 2262 or homogenous Markov chain.
wolffd@0 2263 Hence hidden nodes 2, 3, ..., T
wolffd@0 2264 are all in the same equivalence class, say class Hclass.
wolffd@0 2265 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the
wolffd@0 2266 same for all t, so the observed nodes are all in the same equivalence
wolffd@0 2267 class, say class Oclass.
wolffd@0 2268 Finally, the prior term P(H(1)) is in a class all by itself, say class
wolffd@0 2269 H1class.
wolffd@0 2270 This is illustrated below, where we explicitly represent the
wolffd@0 2271 parameters as random variables (dotted nodes).
wolffd@0 2272 <p>
wolffd@0 2273 <img src="Figures/hmm4_params.gif">
wolffd@0 2274 <p>
wolffd@0 2275 In BNT, we cannot represent parameters as random variables (nodes).
wolffd@0 2276 Instead, we "hide" the
wolffd@0 2277 parameters inside one CPD for each equivalence class,
wolffd@0 2278 and then specify that the other CPDs should share these parameters, as
wolffd@0 2279 follows.
wolffd@0 2280 <pre>
wolffd@0 2281 hnodes = 1:2:2*T;
wolffd@0 2282 onodes = 2:2:2*T;
wolffd@0 2283 H1class = 1; Hclass = 2; Oclass = 3;
wolffd@0 2284 eclass = ones(1,N);
wolffd@0 2285 eclass(hnodes(2:end)) = Hclass;
wolffd@0 2286 eclass(hnodes(1)) = H1class;
wolffd@0 2287 eclass(onodes) = Oclass;
wolffd@0 2288 % create dag and ns in the usual way
wolffd@0 2289 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
wolffd@0 2290 </pre>
wolffd@0 2291 Finally, we define the parameters for each equivalence class:
wolffd@0 2292 <pre>
wolffd@0 2293 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
wolffd@0 2294 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
wolffd@0 2295 if cts_obs
wolffd@0 2296 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
wolffd@0 2297 else
wolffd@0 2298 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
wolffd@0 2299 end
wolffd@0 2300 </pre>
wolffd@0 2301 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
wolffd@0 2302 member of e's equivalence class; that is, it is not always the case
wolffd@0 2303 that e == j. You can use bnet.rep_of_eclass(e) to return the
wolffd@0 2304 representative of equivalence class e.
wolffd@0 2305 BNT will look up the parents of j to determine the size
wolffd@0 2306 of the CPT to use. It assumes that this is the same for all members of
wolffd@0 2307 the equivalence class.
wolffd@0 2308 Click <a href="http://www.cs.berkeley.edu/~murphyk/Bayes/param_tieing.html">here</a> for
wolffd@0 2309 a more complex example of parameter tying.
wolffd@0 2310 <p>
wolffd@0 2311 Note:
wolffd@0 2312 Normally one would define an HMM as a
wolffd@0 2313 <a href = "usage_dbn.html">Dynamic Bayes Net</a>
wolffd@0 2314 (see the function BNT/examples/dynamic/mk_chmm.m).
wolffd@0 2315 However, one can define an HMM as a static BN using the function
wolffd@0 2316 BNT/examples/static/Models/mk_hmm_bnet.m.
wolffd@0 2317
wolffd@0 2318
wolffd@0 2319
wolffd@0 2320 <h1><a name="structure_learning">Structure learning</h1>
wolffd@0 2321
wolffd@0 2322 Update (9/29/03):
wolffd@0 2323 Phillipe LeRay is developing some additional structure learning code
wolffd@0 2324 on top of BNT. Click <a
wolffd@0 2325 href="http://bnt.insa-rouen.fr/ajouts.html">here</a>
wolffd@0 2326 for details.
wolffd@0 2327
wolffd@0 2328 <p>
wolffd@0 2329
wolffd@0 2330 There are two very different approaches to structure learning:
wolffd@0 2331 constraint-based and search-and-score.
wolffd@0 2332 In the <a href="#constraint">constraint-based approach</a>,
wolffd@0 2333 we start with a fully connected graph, and remove edges if certain
wolffd@0 2334 conditional independencies are measured in the data.
wolffd@0 2335 This has the disadvantage that repeated independence tests lose
wolffd@0 2336 statistical power.
wolffd@0 2337 <p>
wolffd@0 2338 In the more popular search-and-score approach,
wolffd@0 2339 we perform a search through the space of possible DAGs, and either
wolffd@0 2340 return the best one found (a point estimate), or return a sample of the
wolffd@0 2341 models found (an approximation to the Bayesian posterior).
wolffd@0 2342 <p>
wolffd@0 2343 Unfortunately, the number of DAGs as a function of the number of
wolffd@0 2344 nodes, G(n), is super-exponential in n.
wolffd@0 2345 A closed form formula for G(n) is not known, but the first few values
wolffd@0 2346 are shown below (from Cooper, 1999).
wolffd@0 2347
wolffd@0 2348 <table>
wolffd@0 2349 <tr> <th>n</th> <th align=left>G(n)</th> </tr>
wolffd@0 2350 <tr> <td>1</td> <td>1</td> </tr>
wolffd@0 2351 <tr> <td>2</td> <td>3</td> </tr>
wolffd@0 2352 <tr> <td>3</td> <td>25</td> </tr>
wolffd@0 2353 <tr> <td>4</td> <td>543</td> </tr>
wolffd@0 2354 <tr> <td>5</td> <td>29,281</td> </tr>
wolffd@0 2355 <tr> <td>6</td> <td>3,781,503</td> </tr>
wolffd@0 2356 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr>
wolffd@0 2357 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr>
wolffd@0 2358 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr>
wolffd@0 2359 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr>
wolffd@0 2360 </table>
wolffd@0 2361
wolffd@0 2362 Since the number of DAGs is super-exponential in the number of nodes,
wolffd@0 2363 we cannot exhaustively search the space, so we either use a local
wolffd@0 2364 search algorithm (e.g., greedy hill climbining, perhaps with multiple
wolffd@0 2365 restarts) or a global search algorithm (e.g., Markov Chain Monte
wolffd@0 2366 Carlo).
wolffd@0 2367 <p>
wolffd@0 2368 If we know a total ordering on the nodes,
wolffd@0 2369 finding the best structure amounts to picking the best set of parents
wolffd@0 2370 for each node independently.
wolffd@0 2371 This is what the K2 algorithm does.
wolffd@0 2372 If the ordering is unknown, we can search over orderings,
wolffd@0 2373 which is more efficient than searching over DAGs (Koller and Friedman, 2000).
wolffd@0 2374 <p>
wolffd@0 2375 In addition to the search procedure, we must specify the scoring
wolffd@0 2376 function. There are two popular choices. The Bayesian score integrates
wolffd@0 2377 out the parameters, i.e., it is the marginal likelihood of the model.
wolffd@0 2378 The BIC (Bayesian Information Criterion) is defined as
wolffd@0 2379 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is
wolffd@0 2380 the ML estimate of the parameters, d is the number of parameters, and
wolffd@0 2381 N is the number of data cases.
wolffd@0 2382 The BIC method has the advantage of not requiring a prior.
wolffd@0 2383 <p>
wolffd@0 2384 BIC can be derived as a large sample
wolffd@0 2385 approximation to the marginal likelihood.
wolffd@0 2386 (It is also equal to the Minimum Description Length of a model.)
wolffd@0 2387 However, in practice, the sample size does not need to be very large
wolffd@0 2388 for the approximation to be good.
wolffd@0 2389 For example, in the figure below, we plot the ratio between the log marginal likelihood
wolffd@0 2390 and the BIC score against data-set size; we see that the ratio rapidly
wolffd@0 2391 approaches 1, especially for non-informative priors.
wolffd@0 2392 (This plot was generated by the file BNT/examples/static/bic1.m. It
wolffd@0 2393 uses the water sprinkler BN with BDeu Dirichlet priors with different
wolffd@0 2394 equivalent sample sizes.)
wolffd@0 2395
wolffd@0 2396 <p>
wolffd@0 2397 <center>
wolffd@0 2398 <IMG SRC="Figures/bic.png">
wolffd@0 2399 </center>
wolffd@0 2400 <p>
wolffd@0 2401
wolffd@0 2402 <p>
wolffd@0 2403 As with parameter learning, handling missing data/ hidden variables is
wolffd@0 2404 much harder than the fully observed case.
wolffd@0 2405 The structure learning routines in BNT can therefore be classified into 4
wolffd@0 2406 types, analogously to the parameter learning case.
wolffd@0 2407 <p>
wolffd@0 2408
wolffd@0 2409 <TABLE BORDER>
wolffd@0 2410 <tr>
wolffd@0 2411 <TH></TH>
wolffd@0 2412 <th>Full obs</th>
wolffd@0 2413 <th>Partial obs</th>
wolffd@0 2414 </tr>
wolffd@0 2415 <tr>
wolffd@0 2416 <th>Point</th>
wolffd@0 2417 <td><tt>learn_struct_K2</tt> <br>
wolffd@0 2418 <!-- <tt>learn_struct_hill_climb</tt></td> -->
wolffd@0 2419 <td><tt>not yet supported</tt></td>
wolffd@0 2420 </tr>
wolffd@0 2421 <tr>
wolffd@0 2422 <th>Bayes</th>
wolffd@0 2423 <td><tt>learn_struct_mcmc</tt></td>
wolffd@0 2424 <td>not yet supported</td>
wolffd@0 2425 </tr>
wolffd@0 2426 </table>
wolffd@0 2427
wolffd@0 2428
wolffd@0 2429 <h2><a name="markov_equiv">Markov equivalence</h2>
wolffd@0 2430
wolffd@0 2431 If two DAGs encode the same conditional independencies, they are
wolffd@0 2432 called Markov equivalent. The set of all DAGs can be paritioned into
wolffd@0 2433 Markov equivalence classes. Graphs within the same class can
wolffd@0 2434 have
wolffd@0 2435 the direction of some of their arcs reversed without changing any of
wolffd@0 2436 the CI relationships.
wolffd@0 2437 Each class can be represented by a PDAG
wolffd@0 2438 (partially directed acyclic graph) called an essential graph or
wolffd@0 2439 pattern. This specifies which edges must be oriented in a certain
wolffd@0 2440 direction, and which may be reversed.
wolffd@0 2441
wolffd@0 2442 <p>
wolffd@0 2443 When learning graph structure from observational data,
wolffd@0 2444 the best one can hope to do is to identify the model up to Markov
wolffd@0 2445 equivalence. To distinguish amongst graphs within the same equivalence
wolffd@0 2446 class, one needs interventional data: see the discussion on <a
wolffd@0 2447 href="#active">active learning</a> below.
wolffd@0 2448
wolffd@0 2449
wolffd@0 2450
wolffd@0 2451 <h2><a name="enumerate">Exhaustive search</h2>
wolffd@0 2452
wolffd@0 2453 The brute-force approach to structure learning is to enumerate all
wolffd@0 2454 possible DAGs, and score each one. This provides a "gold standard"
wolffd@0 2455 with which to compare other algorithms. We can do this as follows.
wolffd@0 2456 <pre>
wolffd@0 2457 dags = mk_all_dags(N);
wolffd@0 2458 score = score_dags(data, ns, dags);
wolffd@0 2459 </pre>
wolffd@0 2460 where data(i,m) is the value of node i in case m,
wolffd@0 2461 and ns(i) is the size of node i.
wolffd@0 2462 If the DAGs have a lot of families in common, we can cache the sufficient statistics,
wolffd@0 2463 making this potentially more efficient than scoring the DAGs one at a time.
wolffd@0 2464 (Caching is not currently implemented, however.)
wolffd@0 2465 <p>
wolffd@0 2466 By default, we use the Bayesian scoring metric, and assume CPDs are
wolffd@0 2467 represented by tables with BDeu(1) priors.
wolffd@0 2468 We can override these defaults as follows.
wolffd@0 2469 If we want to use uniform priors, we can say
wolffd@0 2470 <pre>
wolffd@0 2471 params = cell(1,N);
wolffd@0 2472 for i=1:N
wolffd@0 2473 params{i} = {'prior', 'unif'};
wolffd@0 2474 end
wolffd@0 2475 score = score_dags(data, ns, dags, 'params', params);
wolffd@0 2476 </pre>
wolffd@0 2477 params{i} is a cell-array, containing optional arguments that are
wolffd@0 2478 passed to the constructor for CPD i.
wolffd@0 2479 <p>
wolffd@0 2480 Now suppose we want to use different node types, e.g.,
wolffd@0 2481 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
wolffd@0 2482 these CPDs can support discrete and continuous parents, which is
wolffd@0 2483 necessary since all other nodes will be considered as parents).
wolffd@0 2484 The Bayesian scoring metric currently only works for tabular CPDs, so
wolffd@0 2485 we will use BIC:
wolffd@0 2486 <pre>
wolffd@0 2487 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
wolffd@0 2488 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
wolffd@0 2489 </pre>
wolffd@0 2490 In practice, one can't enumerate all possible DAGs for N > 5,
wolffd@0 2491 but one can evaluate any reasonably-sized set of hypotheses in this
wolffd@0 2492 way (e.g., nearest neighbors of your current best guess).
wolffd@0 2493 Think of this as "computer assisted model refinement" as opposed to de
wolffd@0 2494 novo learning.
wolffd@0 2495
wolffd@0 2496
wolffd@0 2497 <h2><a name="K2">K2</h2>
wolffd@0 2498
wolffd@0 2499 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
wolffd@0 2500 Initially each node has no parents. It then adds incrementally that parent whose addition most
wolffd@0 2501 increases the score of the resulting structure. When the addition of no single
wolffd@0 2502 parent can increase the score, it stops adding parents to the node.
wolffd@0 2503 Since we are using a fixed ordering, we do not need to check for
wolffd@0 2504 cycles, and can choose the parents for each node independently.
wolffd@0 2505 <p>
wolffd@0 2506 The original paper used the Bayesian scoring
wolffd@0 2507 metric with tabular CPDs and Dirichlet priors.
wolffd@0 2508 BNT generalizes this to allow any kind of CPD, and either the Bayesian
wolffd@0 2509 scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
wolffd@0 2510 In addition, you can specify
wolffd@0 2511 an optional upper bound on the number of parents for each node.
wolffd@0 2512 The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
wolffd@0 2513 We use the water sprinkler network and sample 100 cases from it as before.
wolffd@0 2514 Then we see how much data it takes to recover the generating structure:
wolffd@0 2515 <pre>
wolffd@0 2516 order = [C S R W];
wolffd@0 2517 max_fan_in = 2;
wolffd@0 2518 sz = 5:5:100;
wolffd@0 2519 for i=1:length(sz)
wolffd@0 2520 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
wolffd@0 2521 correct(i) = isequal(dag, dag2);
wolffd@0 2522 end
wolffd@0 2523 </pre>
wolffd@0 2524 Here are the results.
wolffd@0 2525 <pre>
wolffd@0 2526 correct =
wolffd@0 2527 Columns 1 through 12
wolffd@0 2528 0 0 0 0 0 0 0 1 0 1 1 1
wolffd@0 2529 Columns 13 through 20
wolffd@0 2530 1 1 1 1 1 1 1 1
wolffd@0 2531 </pre>
wolffd@0 2532 So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
wolffd@0 2533 showing that the prior doesn't matter too much.)
wolffd@0 2534 In general, we cannot hope to recover the "true" generating structure,
wolffd@0 2535 only one that is in its <a href="#markov_equiv">Markov equivalence
wolffd@0 2536 class</a>.
wolffd@0 2537
wolffd@0 2538
wolffd@0 2539 <h2><a name="hill_climb">Hill-climbing</h2>
wolffd@0 2540
wolffd@0 2541 Hill-climbing starts at a specific point in space,
wolffd@0 2542 considers all nearest neighbors, and moves to the neighbor
wolffd@0 2543 that has the highest score; if no neighbors have higher
wolffd@0 2544 score than the current point (i.e., we have reached a local maximum),
wolffd@0 2545 the algorithm stops. One can then restart in another part of the space.
wolffd@0 2546 <p>
wolffd@0 2547 A common definition of "neighbor" is all graphs that can be
wolffd@0 2548 generated from the current graph by adding, deleting or reversing a
wolffd@0 2549 single arc, subject to the acyclicity constraint.
wolffd@0 2550 Other neighborhoods are possible: see
wolffd@0 2551 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
wolffd@0 2552 Optimal Structure Identification with Greedy Search</a>, Max
wolffd@0 2553 Chickering, JMLR 2002.
wolffd@0 2554
wolffd@0 2555 <!--
wolffd@0 2556 Note: This algorithm is currently (Feb '02) being implemented by Qian
wolffd@0 2557 Diao.
wolffd@0 2558 -->
wolffd@0 2559
wolffd@0 2560
wolffd@0 2561 <h2><a name="mcmc">MCMC</h2>
wolffd@0 2562
wolffd@0 2563 We can use a Markov Chain Monte Carlo (MCMC) algorithm called
wolffd@0 2564 Metropolis-Hastings (MH) to search the space of all
wolffd@0 2565 DAGs.
wolffd@0 2566 The standard proposal distribution is to consider moving to all
wolffd@0 2567 nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
wolffd@0 2568 <p>
wolffd@0 2569 The function can be called
wolffd@0 2570 as in the following example.
wolffd@0 2571 <pre>
wolffd@0 2572 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
wolffd@0 2573 </pre>
wolffd@0 2574 We can convert our set of sampled graphs to a histogram
wolffd@0 2575 (empirical posterior over all the DAGs) thus
wolffd@0 2576 <pre>
wolffd@0 2577 all_dags = mk_all_dags(N);
wolffd@0 2578 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
wolffd@0 2579 </pre>
wolffd@0 2580 To see how well this performs, let us compute the exact posterior exhaustively.
wolffd@0 2581 <p>
wolffd@0 2582 <pre>
wolffd@0 2583 score = score_dags(data, ns, all_dags);
wolffd@0 2584 post = normalise(exp(score)); % assuming uniform structural prior
wolffd@0 2585 </pre>
wolffd@0 2586 We plot the results below.
wolffd@0 2587 (The data set was 100 samples drawn from a random 4 node bnet; see the
wolffd@0 2588 file BNT/examples/static/mcmc1.)
wolffd@0 2589 <pre>
wolffd@0 2590 subplot(2,1,1)
wolffd@0 2591 bar(post)
wolffd@0 2592 subplot(2,1,2)
wolffd@0 2593 bar(mcmc_post)
wolffd@0 2594 </pre>
wolffd@0 2595 <img src="Figures/mcmc_post.jpg" width="800" height="500">
wolffd@0 2596 <p>
wolffd@0 2597 We can also plot the acceptance ratio versus number of MCMC steps,
wolffd@0 2598 as a crude convergence diagnostic.
wolffd@0 2599 <pre>
wolffd@0 2600 clf
wolffd@0 2601 plot(accept_ratio)
wolffd@0 2602 </pre>
wolffd@0 2603 <img src="Figures/mcmc_accept.jpg" width="800" height="300">
wolffd@0 2604 <p>
wolffd@0 2605 Even though the number of samples needed by MCMC is theoretically
wolffd@0 2606 polynomial (not exponential) in the dimensionality of the search space, in practice it has been
wolffd@0 2607 found that MCMC does not converge in reasonable time for graphs with
wolffd@0 2608 more than about 10 nodes.
wolffd@0 2609
wolffd@0 2610
wolffd@0 2611
wolffd@0 2612
wolffd@0 2613 <h2><a name="active">Active structure learning</h2>
wolffd@0 2614
wolffd@0 2615 As was mentioned <a href="#markov_equiv">above</a>,
wolffd@0 2616 one can only learn a DAG up to Markov equivalence, even given infinite data.
wolffd@0 2617 If one is interested in learning the structure of a causal network,
wolffd@0 2618 one needs interventional data.
wolffd@0 2619 (By "intervention" we mean forcing a node to take on a specific value,
wolffd@0 2620 thereby effectively severing its incoming arcs.)
wolffd@0 2621 <p>
wolffd@0 2622 Most of the scoring functions accept an optional argument
wolffd@0 2623 that specifies whether a node was observed to have a certain value, or
wolffd@0 2624 was forced to have that value: we set clamped(i,m)=1 if node i was
wolffd@0 2625 forced in training case m. e.g., see the file
wolffd@0 2626 BNT/examples/static/cooper_yoo.
wolffd@0 2627 <p>
wolffd@0 2628 An interesting question is to decide which interventions to perform
wolffd@0 2629 (c.f., design of experiments). For details, see the following tech
wolffd@0 2630 report
wolffd@0 2631 <ul>
wolffd@0 2632 <li> <a href = "../../Papers/alearn.ps.gz">
wolffd@0 2633 Active learning of causal Bayes net structure</a>, Kevin Murphy, March
wolffd@0 2634 2001.
wolffd@0 2635 </ul>
wolffd@0 2636
wolffd@0 2637
wolffd@0 2638 <h2><a name="struct_em">Structural EM</h2>
wolffd@0 2639
wolffd@0 2640 Computing the Bayesian score when there is partial observability is
wolffd@0 2641 computationally challenging, because the parameter posterior becomes
wolffd@0 2642 multimodal (the hidden nodes induce a mixture distribution).
wolffd@0 2643 One therefore needs to use approximations such as BIC.
wolffd@0 2644 Unfortunately, search algorithms are still expensive, because we need
wolffd@0 2645 to run EM at each step to compute the MLE, which is needed to compute
wolffd@0 2646 the score of each model. An alternative approach is
wolffd@0 2647 to do the local search steps inside of the M step of EM, which is more
wolffd@0 2648 efficient since the data has been "filled in" - this is
wolffd@0 2649 called the structural EM algorithm (Friedman 1997), and provably
wolffd@0 2650 converges to a local maximum of the BIC score.
wolffd@0 2651 <p>
wolffd@0 2652 Wei Hu has implemented SEM for discrete nodes.
wolffd@0 2653 You can download his package from
wolffd@0 2654 <a href="../SEM.zip">here</a>.
wolffd@0 2655 Please address all questions about this code to
wolffd@0 2656 wei.hu@intel.com.
wolffd@0 2657 See also <a href="#phl">Phl's implementation of SEM</a>.
wolffd@0 2658
wolffd@0 2659 <!--
wolffd@0 2660 <h2><a name="reveal">REVEAL algorithm</h2>
wolffd@0 2661
wolffd@0 2662 A simple way to learn the structure of a fully observed, discrete,
wolffd@0 2663 factored DBN from a time series is described <a
wolffd@0 2664 href="usage_dbn.html#struct_learn">here</a>.
wolffd@0 2665 -->
wolffd@0 2666
wolffd@0 2667
wolffd@0 2668 <h2><a name="graphdraw">Visualizing the graph</h2>
wolffd@0 2669
wolffd@0 2670 You can visualize an arbitrary graph (such as one learned using the
wolffd@0 2671 structure learning routines) with Matlab code contributed by
wolffd@0 2672 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali
wolffd@0 2673 Taylan Cemgil</a>
wolffd@0 2674 from the University of Nijmegen.
wolffd@0 2675 For static BNs, call it as follows:
wolffd@0 2676 <pre>
wolffd@0 2677 draw_graph(bnet.dag);
wolffd@0 2678 </pre>
wolffd@0 2679 For example, this is the output produced on a
wolffd@0 2680 <a href="#qmr">random QMR-like model</a>:
wolffd@0 2681 <p>
wolffd@0 2682 <img src="Figures/qmr.rnd.jpg">
wolffd@0 2683 <p>
wolffd@0 2684 If you install the excellent <a
wolffd@0 2685 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an
wolffd@0 2686 open-source graph visualization package from AT&T,
wolffd@0 2687 you can create a much better visualization as follows
wolffd@0 2688 <pre>
wolffd@0 2689 graph_to_dot(bnet.dag)
wolffd@0 2690 </pre>
wolffd@0 2691 This works by converting the adjacency matrix to a file suitable
wolffd@0 2692 for input to graphviz (using the dot format),
wolffd@0 2693 then converting the output of graphviz to postscript, and displaying the results using
wolffd@0 2694 ghostview.
wolffd@0 2695 You can do each of these steps separately for more control, as shown
wolffd@0 2696 below.
wolffd@0 2697 <pre>
wolffd@0 2698 graph_to_dot(bnet.dag, 'filename', 'foo.dot');
wolffd@0 2699 dot -Tps foo.dot -o foo.ps
wolffd@0 2700 ghostview foo.ps &
wolffd@0 2701 </pre>
wolffd@0 2702
wolffd@0 2703 <h2><a name = "constraint">Constraint-based methods</h2>
wolffd@0 2704
wolffd@0 2705 The IC algorithm (Pearl and Verma, 1991),
wolffd@0 2706 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
wolffd@0 2707 computes many conditional independence tests,
wolffd@0 2708 and combines these constraints into a
wolffd@0 2709 PDAG to represent the whole
wolffd@0 2710 <a href="#markov_equiv">Markov equivalence class</a>.
wolffd@0 2711 <p>
wolffd@0 2712 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
wolffd@0 2713 (IC stands for inductive causation; PC stands for Peter and Clark,
wolffd@0 2714 the first names of Spirtes and Glymour; FCI stands for fast causal
wolffd@0 2715 inference.
wolffd@0 2716 What we, following Pearl (2000), call IC* was called
wolffd@0 2717 IC in the original Pearl and Verma paper.)
wolffd@0 2718 For details, see
wolffd@0 2719 <ul>
wolffd@0 2720 <li>
wolffd@0 2721 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
wolffd@0 2722 Prediction, and Search</a>, Spirtes, Glymour and
wolffd@0 2723 Scheines (SGS), 2001 (2nd edition), MIT Press.
wolffd@0 2724 <li>
wolffd@0 2725 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
wolffd@0 2726 2000, Cambridge University Press.
wolffd@0 2727 </ul>
wolffd@0 2728
wolffd@0 2729 <p>
wolffd@0 2730
wolffd@0 2731 The PC algorithm takes as arguments a function f, the number of nodes N,
wolffd@0 2732 the maximum fan in K, and additional arguments A which are passed to f.
wolffd@0 2733 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
wolffd@0 2734 otherwise.
wolffd@0 2735 For example, suppose we cheat by
wolffd@0 2736 passing in a CI "oracle" which has access to the true DAG; the oracle
wolffd@0 2737 tests for d-separation in this DAG, i.e.,
wolffd@0 2738 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
wolffd@0 2739 <pre>
wolffd@0 2740 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
wolffd@0 2741 </pre>
wolffd@0 2742 pdag(i,j) = -1 if there is definitely an i->j arc,
wolffd@0 2743 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
wolffd@0 2744 <p>
wolffd@0 2745 Applied to the sprinkler network, this returns
wolffd@0 2746 <pre>
wolffd@0 2747 pdag =
wolffd@0 2748 0 1 1 0
wolffd@0 2749 1 0 0 -1
wolffd@0 2750 1 0 0 -1
wolffd@0 2751 0 0 0 0
wolffd@0 2752 </pre>
wolffd@0 2753 So as expected, we see that the V-structure at the W node is uniquely identified,
wolffd@0 2754 but the other arcs have ambiguous orientation.
wolffd@0 2755 <p>
wolffd@0 2756 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
wolffd@0 2757 book.
wolffd@0 2758 This example concerns the female orgasm.
wolffd@0 2759 We are given a correlation matrix C between 7 measured factors (such
wolffd@0 2760 as subjective experiences of coital and masturbatory experiences),
wolffd@0 2761 derived from 281 samples, and want to learn a causal model of the
wolffd@0 2762 data. We will not discuss the merits of this type of work here, but
wolffd@0 2763 merely show how to reproduce the results in the SGS book.
wolffd@0 2764 Their program,
wolffd@0 2765 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
wolffd@0 2766 makes use of the Fisher Z-test for conditional
wolffd@0 2767 independence, so we do the same:
wolffd@0 2768 <pre>
wolffd@0 2769 max_fan_in = 4;
wolffd@0 2770 nsamples = 281;
wolffd@0 2771 alpha = 0.05;
wolffd@0 2772 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
wolffd@0 2773 </pre>
wolffd@0 2774 In this case, the CI test is
wolffd@0 2775 <pre>
wolffd@0 2776 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)
wolffd@0 2777 </pre>
wolffd@0 2778 The results match those of Fig 12a of SGS apart from two edge
wolffd@0 2779 differences; presumably this is due to rounding error (although it
wolffd@0 2780 could be a bug, either in BNT or in Tetrad).
wolffd@0 2781 This example can be found in the file BNT/examples/static/pc2.m.
wolffd@0 2782
wolffd@0 2783 <p>
wolffd@0 2784
wolffd@0 2785 The IC* algorithm (Pearl and Verma, 1991),
wolffd@0 2786 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
wolffd@0 2787 are like the IC/PC algorithm, except that they can detect the presence
wolffd@0 2788 of latent variables.
wolffd@0 2789 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
wolffd@0 2790 Kushnir. The output is a matrix P, defined as follows
wolffd@0 2791 (see Pearl (2000), p52 for details):
wolffd@0 2792 <pre>
wolffd@0 2793 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
wolffd@0 2794 % P(i,j) = -2 if there is a marked directed i-*>j edge.
wolffd@0 2795 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
wolffd@0 2796 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
wolffd@0 2797 </pre>
wolffd@0 2798
wolffd@0 2799
wolffd@0 2800 <h2><a name="phl">Philippe Leray's structure learning package</h2>
wolffd@0 2801
wolffd@0 2802 Philippe Leray has written a
wolffd@0 2803 <a href="http://bnt.insa-rouen.fr/ajouts.html">
wolffd@0 2804 structure learning package</a> that uses BNT.
wolffd@0 2805
wolffd@0 2806 It currently (Juen 2003) has the following features:
wolffd@0 2807 <ul>
wolffd@0 2808 <li>PC with Chi2 statistical test
wolffd@0 2809 <li> MWST : Maximum weighted Spanning Tree
wolffd@0 2810 <li> Hill Climbing
wolffd@0 2811 <li> Greedy Search
wolffd@0 2812 <li> Structural EM
wolffd@0 2813 <li> hist_ic : optimal Histogram based on IC information criterion
wolffd@0 2814 <li> cpdag_to_dag
wolffd@0 2815 <li> dag_to_cpdag
wolffd@0 2816 <li> ...
wolffd@0 2817 </ul>
wolffd@0 2818
wolffd@0 2819
wolffd@0 2820 </a>
wolffd@0 2821
wolffd@0 2822
wolffd@0 2823 <!--
wolffd@0 2824 <h2><a name="read_learning">Further reading on learning</h2>
wolffd@0 2825
wolffd@0 2826 I recommend the following tutorials for more details on learning.
wolffd@0 2827 <ul>
wolffd@0 2828 <li> <a
wolffd@0 2829 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
wolffd@0 2830 tutorial</a> on graphical models, which contains an overview of learning.
wolffd@0 2831
wolffd@0 2832 <li>
wolffd@0 2833 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
wolffd@0 2834 A tutorial on learning with Bayesian networks</a>, D. Heckerman,
wolffd@0 2835 Microsoft Research Tech Report, 1995.
wolffd@0 2836
wolffd@0 2837 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
wolffd@0 2838 Operations for Learning with Graphical Models</a>,
wolffd@0 2839 W. L. Buntine, JAIR'94, 159--225.
wolffd@0 2840 </ul>
wolffd@0 2841 <p>
wolffd@0 2842 -->
wolffd@0 2843
wolffd@0 2844
wolffd@0 2845
wolffd@0 2846
wolffd@0 2847
wolffd@0 2848 <h1><a name="engines">Inference engines</h1>
wolffd@0 2849
wolffd@0 2850 Up until now, we have used the junction tree algorithm for inference.
wolffd@0 2851 However, sometimes this is too slow, or not even applicable.
wolffd@0 2852 In general, there are many inference algorithms each of which make
wolffd@0 2853 different tradeoffs between speed, accuracy, complexity and
wolffd@0 2854 generality. Furthermore, there might be many implementations of the
wolffd@0 2855 same algorithm; for instance, a general purpose, readable version,
wolffd@0 2856 and a highly-optimized, specialized one.
wolffd@0 2857 To cope with this variety, we treat each inference algorithm as an
wolffd@0 2858 object, which we call an inference engine.
wolffd@0 2859
wolffd@0 2860 <p>
wolffd@0 2861 An inference engine is an object that contains a bnet and supports the
wolffd@0 2862 'enter_evidence' and 'marginal_nodes' methods. The engine constructor
wolffd@0 2863 takes the bnet as argument and may do some model-specific processing.
wolffd@0 2864 When 'enter_evidence' is called, the engine may do some
wolffd@0 2865 evidence-specific processing. Finally, when 'marginal_nodes' is
wolffd@0 2866 called, the engine may do some query-specific processing.
wolffd@0 2867
wolffd@0 2868 <p>
wolffd@0 2869 The amount of work done when each stage is specified -- structure,
wolffd@0 2870 parameters, evidence, and query -- depends on the engine. The cost of
wolffd@0 2871 work done early in this sequence can be amortized. On the other hand,
wolffd@0 2872 one can make better optimizations if one waits until later in the
wolffd@0 2873 sequence.
wolffd@0 2874 For example, the parameters might imply
wolffd@0 2875 conditional indpendencies that are not evident in the graph structure,
wolffd@0 2876 but can nevertheless be exploited; the evidence indicates which nodes
wolffd@0 2877 are observed and hence can effectively be disconnected from the
wolffd@0 2878 graph; and the query might indicate that large parts of the network
wolffd@0 2879 are d-separated from the query nodes. (Since it is not the actual
wolffd@0 2880 <em>values</em> of the evidence that matters, just which nodes are observed,
wolffd@0 2881 many engines allow you to specify which nodes will be observed when they are constructed,
wolffd@0 2882 i.e., before calling 'enter_evidence'. Some engines can still cope if
wolffd@0 2883 the actual pattern of evidence is different, e.g., if there is missing
wolffd@0 2884 data.)
wolffd@0 2885 <p>
wolffd@0 2886
wolffd@0 2887 Although being maximally lazy (i.e., only doing work when a query is
wolffd@0 2888 issued) may seem desirable,
wolffd@0 2889 this is not always the most efficient.
wolffd@0 2890 For example,
wolffd@0 2891 when learning using EM, we need to call marginal_nodes N times, where N is the
wolffd@0 2892 number of nodes. <a href="varelim">Variable elimination</a> would end
wolffd@0 2893 up repeating a lot of work
wolffd@0 2894 each time marginal_nodes is called, making it inefficient for
wolffd@0 2895 learning. The junction tree algorithm, by contrast, uses dynamic
wolffd@0 2896 programming to avoid this redundant computation --- it calculates all
wolffd@0 2897 marginals in two passes during 'enter_evidence', so calling
wolffd@0 2898 'marginal_nodes' takes constant time.
wolffd@0 2899 <p>
wolffd@0 2900 We will discuss some of the inference algorithms implemented in BNT
wolffd@0 2901 below, and finish with a <a href="#engine_summary">summary</a> of all
wolffd@0 2902 of them.
wolffd@0 2903
wolffd@0 2904
wolffd@0 2905
wolffd@0 2906
wolffd@0 2907
wolffd@0 2908
wolffd@0 2909
wolffd@0 2910 <h2><a name="varelim">Variable elimination</h2>
wolffd@0 2911
wolffd@0 2912 The variable elimination algorithm, also known as bucket elimination
wolffd@0 2913 or peeling, is one of the simplest inference algorithms.
wolffd@0 2914 The basic idea is to "push sums inside of products"; this is explained
wolffd@0 2915 in more detail
wolffd@0 2916 <a
wolffd@0 2917 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
wolffd@0 2918 <p>
wolffd@0 2919 The principle of distributing sums over products can be generalized
wolffd@0 2920 greatly to apply to any commutative semiring.
wolffd@0 2921 This forms the basis of many common algorithms, such as Viterbi
wolffd@0 2922 decoding and the Fast Fourier Transform. For details, see
wolffd@0 2923
wolffd@0 2924 <ul>
wolffd@0 2925 <li> R. McEliece and S. M. Aji, 2000.
wolffd@0 2926 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
wolffd@0 2927 <a href="GDL.pdf">
wolffd@0 2928 The Generalized Distributive Law</a>,
wolffd@0 2929 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
wolffd@0 2930 pp. 325--343.
wolffd@0 2931
wolffd@0 2932
wolffd@0 2933 <li>
wolffd@0 2934 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
wolffd@0 2935 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
wolffd@0 2936 Factor graphs and the sum-product algorithm</a>
wolffd@0 2937 IEEE Transactions on Information Theory, February, 2001.
wolffd@0 2938
wolffd@0 2939 </ul>
wolffd@0 2940
wolffd@0 2941 <p>
wolffd@0 2942 Choosing an order in which to sum out the variables so as to minimize
wolffd@0 2943 computational cost is known to be NP-hard.
wolffd@0 2944 The implementation of this algorithm in
wolffd@0 2945 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this
wolffd@0 2946 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
wolffd@0 2947 greedy search procedure to find a good ordering).
wolffd@0 2948 <p>
wolffd@0 2949 Note: unlike most algorithms, var_elim does all its computational work
wolffd@0 2950 inside of <tt>marginal_nodes</tt>, not inside of
wolffd@0 2951 <tt>enter_evidence</tt>.
wolffd@0 2952
wolffd@0 2953
wolffd@0 2954
wolffd@0 2955
wolffd@0 2956 <h2><a name="global">Global inference methods</h2>
wolffd@0 2957
wolffd@0 2958 The simplest inference algorithm of all is to explicitely construct
wolffd@0 2959 the joint distribution over all the nodes, and then to marginalize it.
wolffd@0 2960 This is implemented in <tt>global_joint_inf_engine</tt>.
wolffd@0 2961 Since the size of the joint is exponential in the
wolffd@0 2962 number of discrete (hidden) nodes, this is not a very practical algorithm.
wolffd@0 2963 It is included merely for pedagogical and debugging purposes.
wolffd@0 2964 <p>
wolffd@0 2965 Three specialized versions of this algorithm have also been implemented,
wolffd@0 2966 corresponding to the cases where all the nodes are discrete (D), all
wolffd@0 2967 are Gaussian (G), and some are discrete and some Gaussian (CG).
wolffd@0 2968 They are called <tt>enumerative_inf_engine</tt>,
wolffd@0 2969 <tt>gaussian_inf_engine</tt>,
wolffd@0 2970 and <tt>cond_gauss_inf_engine</tt> respectively.
wolffd@0 2971 <p>
wolffd@0 2972 Note: unlike most algorithms, these global inference algorithms do all their computational work
wolffd@0 2973 inside of <tt>marginal_nodes</tt>, not inside of
wolffd@0 2974 <tt>enter_evidence</tt>.
wolffd@0 2975
wolffd@0 2976
wolffd@0 2977 <h2><a name="quickscore">Quickscore</h2>
wolffd@0 2978
wolffd@0 2979 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
wolffd@0 2980 since the cliques are so big.
wolffd@0 2981 One simple trick we can use is to notice that hidden leaves do not
wolffd@0 2982 affect the posteriors on the roots, and hence do not need to be
wolffd@0 2983 included in the network.
wolffd@0 2984 A second trick is to notice that the negative findings can be
wolffd@0 2985 "absorbed" into the prior:
wolffd@0 2986 see the file
wolffd@0 2987 BNT/examples/static/mk_minimal_qmr_bnet for details.
wolffd@0 2988 <p>
wolffd@0 2989
wolffd@0 2990 A much more significant speedup is obtained by exploiting special
wolffd@0 2991 properties of the noisy-or node, as done by the quickscore
wolffd@0 2992 algorithm. For details, see
wolffd@0 2993 <ul>
wolffd@0 2994 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
wolffd@0 2995 <li> Rish and Dechter, "On the impact of causal independence", UCI
wolffd@0 2996 tech report, 1998.
wolffd@0 2997 </ul>
wolffd@0 2998
wolffd@0 2999 This has been implemented in BNT as a special-purpose inference
wolffd@0 3000 engine, which can be created and used as follows:
wolffd@0 3001 <pre>
wolffd@0 3002 engine = quickscore_inf_engine(inhibit, leak, prior);
wolffd@0 3003 engine = enter_evidence(engine, pos, neg);
wolffd@0 3004 m = marginal_nodes(engine, i);
wolffd@0 3005 </pre>
wolffd@0 3006
wolffd@0 3007
wolffd@0 3008 <h2><a name="belprop">Belief propagation</h2>
wolffd@0 3009
wolffd@0 3010 Even using quickscore, exact inference takes time that is exponential
wolffd@0 3011 in the number of positive findings.
wolffd@0 3012 Hence for large networks we need to resort to approximate inference techniques.
wolffd@0 3013 See for example
wolffd@0 3014 <ul>
wolffd@0 3015 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
wolffd@0 3016 QMR-DT network", JAIR 10, 1999.
wolffd@0 3017
wolffd@0 3018 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
wolffd@0 3019 UAI 99.
wolffd@0 3020 </ul>
wolffd@0 3021 The latter approximation
wolffd@0 3022 entails applying Pearl's belief propagation algorithm to a model even
wolffd@0 3023 if it has loops (hence the name loopy belief propagation).
wolffd@0 3024 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
wolffd@0 3025 exact results when applied to singly-connected graphs
wolffd@0 3026 (a.k.a. polytrees, since
wolffd@0 3027 the underlying undirected topology is a tree, but a node may have
wolffd@0 3028 multiple parents).
wolffd@0 3029 To apply this algorithm to a graph with loops,
wolffd@0 3030 use <tt>pearl_inf_engine</tt>.
wolffd@0 3031 This can use a centralized or distributed message passing protocol.
wolffd@0 3032 You can use it as in the following example.
wolffd@0 3033 <pre>
wolffd@0 3034 engine = pearl_inf_engine(bnet, 'max_iter', 30);
wolffd@0 3035 engine = enter_evidence(engine, evidence);
wolffd@0 3036 m = marginal_nodes(engine, i);
wolffd@0 3037 </pre>
wolffd@0 3038 We found that this algorithm often converges, and when it does, often
wolffd@0 3039 is very accurate, but it depends on the precise setting of the
wolffd@0 3040 parameter values of the network.
wolffd@0 3041 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
wolffd@0 3042 Understanding when and why belief propagation converges/ works
wolffd@0 3043 is a topic of ongoing research.
wolffd@0 3044 <p>
wolffd@0 3045 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
wolffd@0 3046 and gmux nodes to compute messages efficiently.
wolffd@0 3047 <p>
wolffd@0 3048 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
wolffd@0 3049 represent messages. Hence this is slower.
wolffd@0 3050 <p>
wolffd@0 3051 <tt>belprop_fg_inf_engine</tt> is like belprop,
wolffd@0 3052 but is designed for factor graphs.
wolffd@0 3053
wolffd@0 3054
wolffd@0 3055
wolffd@0 3056 <h2><a name="sampling">Sampling</h2>
wolffd@0 3057
wolffd@0 3058 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
wolffd@0 3059 <ul>
wolffd@0 3060 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance
wolffd@0 3061 sampling and can handle any node type.
wolffd@0 3062 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
wolffd@0 3063 Currently this can only handle tabular CPDs.
wolffd@0 3064 For a much faster and more powerful Gibbs sampling program, see
wolffd@0 3065 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
wolffd@0 3066 </ul>
wolffd@0 3067 Note: To generate samples from a network (which is not the same as inference!),
wolffd@0 3068 use <tt>sample_bnet</tt>.
wolffd@0 3069
wolffd@0 3070
wolffd@0 3071
wolffd@0 3072 <h2><a name="engine_summary">Summary of inference engines</h2>
wolffd@0 3073
wolffd@0 3074
wolffd@0 3075 The inference engines differ in many ways. Here are
wolffd@0 3076 some of the major "axes":
wolffd@0 3077 <ul>
wolffd@0 3078 <li> Works for all topologies or makes restrictions?
wolffd@0 3079 <li> Works for all node types or makes restrictions?
wolffd@0 3080 <li> Exact or approximate inference?
wolffd@0 3081 </ul>
wolffd@0 3082
wolffd@0 3083 <p>
wolffd@0 3084 In terms of topology, most engines handle any kind of DAG.
wolffd@0 3085 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
wolffd@0 3086 can be used to represent directed, undirected, and mixed (chain)
wolffd@0 3087 graphs.
wolffd@0 3088 (In the future, we plan to support exact inference on chain graphs.)
wolffd@0 3089 <tt>quickscore</tt> only works on QMR-like models.
wolffd@0 3090 <p>
wolffd@0 3091 In terms of node types: algorithms that use potentials can handle
wolffd@0 3092 discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
wolffd@0 3093 Sampling algorithms can essentially handle any kind of node (distribution).
wolffd@0 3094 Other algorithms make more restrictive assumptions in exchange for
wolffd@0 3095 speed.
wolffd@0 3096 <p>
wolffd@0 3097 Finally, most algorithms are designed to give the exact answer.
wolffd@0 3098 The belief propagation algorithms are exact if applied to trees, and
wolffd@0 3099 in some other cases.
wolffd@0 3100 Sampling is considered approximate, even though, in the limit of an
wolffd@0 3101 infinite number of samples, it gives the exact answer.
wolffd@0 3102
wolffd@0 3103 <p>
wolffd@0 3104
wolffd@0 3105 Here is a summary of the properties
wolffd@0 3106 of all the engines in BNT which work on static networks.
wolffd@0 3107 <p>
wolffd@0 3108 <table>
wolffd@0 3109 <table border units = pixels><tr>
wolffd@0 3110 <td align=left width=0>Name
wolffd@0 3111 <td align=left width=0>Exact?
wolffd@0 3112 <td align=left width=0>Node type?
wolffd@0 3113 <td align=left width=0>topology
wolffd@0 3114 <tr>
wolffd@0 3115 <tr>
wolffd@0 3116 <td align=left> belprop
wolffd@0 3117 <td align=left> approx
wolffd@0 3118 <td align=left> D
wolffd@0 3119 <td align=left> DAG
wolffd@0 3120 <tr>
wolffd@0 3121 <td align=left> belprop_fg
wolffd@0 3122 <td align=left> approx
wolffd@0 3123 <td align=left> D
wolffd@0 3124 <td align=left> factor graph
wolffd@0 3125 <tr>
wolffd@0 3126 <td align=left> cond_gauss
wolffd@0 3127 <td align=left> exact
wolffd@0 3128 <td align=left> CG
wolffd@0 3129 <td align=left> DAG
wolffd@0 3130 <tr>
wolffd@0 3131 <td align=left> enumerative
wolffd@0 3132 <td align=left> exact
wolffd@0 3133 <td align=left> D
wolffd@0 3134 <td align=left> DAG
wolffd@0 3135 <tr>
wolffd@0 3136 <td align=left> gaussian
wolffd@0 3137 <td align=left> exact
wolffd@0 3138 <td align=left> G
wolffd@0 3139 <td align=left> DAG
wolffd@0 3140 <tr>
wolffd@0 3141 <td align=left> gibbs
wolffd@0 3142 <td align=left> approx
wolffd@0 3143 <td align=left> D
wolffd@0 3144 <td align=left> DAG
wolffd@0 3145 <tr>
wolffd@0 3146 <td align=left> global_joint
wolffd@0 3147 <td align=left> exact
wolffd@0 3148 <td align=left> D,G,CG
wolffd@0 3149 <td align=left> DAG
wolffd@0 3150 <tr>
wolffd@0 3151 <td align=left> jtree
wolffd@0 3152 <td align=left> exact
wolffd@0 3153 <td align=left> D,G,CG
wolffd@0 3154 <td align=left> DAG
wolffd@0 3155 b<tr>
wolffd@0 3156 <td align=left> likelihood_weighting
wolffd@0 3157 <td align=left> approx
wolffd@0 3158 <td align=left> any
wolffd@0 3159 <td align=left> DAG
wolffd@0 3160 <tr>
wolffd@0 3161 <td align=left> pearl
wolffd@0 3162 <td align=left> approx
wolffd@0 3163 <td align=left> D,G
wolffd@0 3164 <td align=left> DAG
wolffd@0 3165 <tr>
wolffd@0 3166 <td align=left> pearl
wolffd@0 3167 <td align=left> exact
wolffd@0 3168 <td align=left> D,G
wolffd@0 3169 <td align=left> polytree
wolffd@0 3170 <tr>
wolffd@0 3171 <td align=left> quickscore
wolffd@0 3172 <td align=left> exact
wolffd@0 3173 <td align=left> noisy-or
wolffd@0 3174 <td align=left> QMR
wolffd@0 3175 <tr>
wolffd@0 3176 <td align=left> stab_cond_gauss
wolffd@0 3177 <td align=left> exact
wolffd@0 3178 <td align=left> CG
wolffd@0 3179 <td align=left> DAG
wolffd@0 3180 <tr>
wolffd@0 3181 <td align=left> var_elim
wolffd@0 3182 <td align=left> exact
wolffd@0 3183 <td align=left> D,G,CG
wolffd@0 3184 <td align=left> DAG
wolffd@0 3185 </table>
wolffd@0 3186
wolffd@0 3187
wolffd@0 3188
wolffd@0 3189 <h1><a name="influence">Influence diagrams/ decision making</h1>
wolffd@0 3190
wolffd@0 3191 BNT implements an exact algorithm for solving LIMIDs (limited memory
wolffd@0 3192 influence diagrams), described in
wolffd@0 3193 <ul>
wolffd@0 3194 <li> S. L. Lauritzen and D. Nilsson.
wolffd@0 3195 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
wolffd@0 3196 Representing and solving decision problems with limited
wolffd@0 3197 information</a>
wolffd@0 3198 Management Science, 47, 1238 - 1251. September 2001.
wolffd@0 3199 </ul>
wolffd@0 3200 LIMIDs explicitely show all information arcs, rather than implicitely
wolffd@0 3201 assuming no forgetting. This allows them to model forgetful
wolffd@0 3202 controllers.
wolffd@0 3203 <p>
wolffd@0 3204 See the examples in <tt>BNT/examples/limids</tt> for details.
wolffd@0 3205
wolffd@0 3206
wolffd@0 3207
wolffd@0 3208
wolffd@0 3209 <h1>DBNs, HMMs, Kalman filters and all that</h1>
wolffd@0 3210
wolffd@0 3211 Click <a href="usage_dbn.html">here</a> for documentation about how to
wolffd@0 3212 use BNT for dynamical systems and sequence data.
wolffd@0 3213
wolffd@0 3214
wolffd@0 3215 </BODY>