Mercurial > hg > camir-aes2014
comparison toolboxes/FullBNT-1.0.7/docs/usage_sf.html @ 0:e9a9cd732c1e tip
first hg version after svn
| author | wolffd |
|---|---|
| date | Tue, 10 Feb 2015 15:05:51 +0000 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:e9a9cd732c1e |
|---|---|
| 1 <HEAD> | |
| 2 <TITLE>How to use the Bayes Net Toolbox</TITLE> | |
| 3 </HEAD> | |
| 4 | |
| 5 <BODY BGCOLOR="#FFFFFF"> | |
| 6 <!-- white background is better for the pictures and equations --> | |
| 7 | |
| 8 <h1>How to use the Bayes Net Toolbox</h1> | |
| 9 | |
| 10 This documentation was last updated on 7 June 2004. | |
| 11 <br> | |
| 12 Click <a href="changelog.html">here</a> for a list of changes made to | |
| 13 BNT. | |
| 14 <br> | |
| 15 Click | |
| 16 <a href="http://bnt.insa-rouen.fr/">here</a> | |
| 17 for a French version of this documentation (which might not | |
| 18 be up-to-date). | |
| 19 <br> | |
| 20 Update 23 May 2005: | |
| 21 Philippe LeRay has written | |
| 22 a | |
| 23 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-editor/"> | |
| 24 BNT GUI</a> | |
| 25 and | |
| 26 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-slp/"> | |
| 27 BNT Structure Learning Package</a>. | |
| 28 | |
| 29 <p> | |
| 30 | |
| 31 <ul> | |
| 32 <li> <a href="#install">Installation</a> | |
| 33 <ul> | |
| 34 <li> <a href="#install">Installing the Matlab code</a> | |
| 35 <li> <a href="#installC">Installing the C code</a> | |
| 36 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>. | |
| 37 </ul> | |
| 38 | |
| 39 <li> <a href="#basics">Creating your first Bayes net</a> | |
| 40 <ul> | |
| 41 <li> <a href="#basics">Creating a model by hand</a> | |
| 42 <li> <a href="#file">Loading a model from a file</a> | |
| 43 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a> | |
| 44 </ul> | |
| 45 | |
| 46 <li> <a href="#inference">Inference</a> | |
| 47 <ul> | |
| 48 <li> <a href="#marginal">Computing marginal distributions</a> | |
| 49 <li> <a href="#joint">Computing joint distributions</a> | |
| 50 <li> <a href="#soft">Soft/virtual evidence</a> | |
| 51 <li> <a href="#mpe">Most probable explanation</a> | |
| 52 </ul> | |
| 53 | |
| 54 <li> <a href="#cpd">Conditional Probability Distributions</a> | |
| 55 <ul> | |
| 56 <li> <a href="#tabular">Tabular (multinomial) nodes</a> | |
| 57 <li> <a href="#noisyor">Noisy-or nodes</a> | |
| 58 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a> | |
| 59 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a> | |
| 60 <li> <a href="#mlp">Neural network nodes</a> | |
| 61 <li> <a href="#root">Root nodes</a> | |
| 62 <li> <a href="#gaussian">Gaussian nodes</a> | |
| 63 <li> <a href="#glm">Generalized linear model nodes</a> | |
| 64 <li> <a href="#dtree">Classification/regression tree nodes</a> | |
| 65 <li> <a href="#nongauss">Other continuous distributions</a> | |
| 66 <li> <a href="#cpd_summary">Summary of CPD types</a> | |
| 67 </ul> | |
| 68 | |
| 69 <li> <a href="#examples">Example models</a> | |
| 70 <ul> | |
| 71 <li> <a | |
| 72 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt"> | |
| 73 Gaussian mixture models</a> | |
| 74 <li> <a href="#pca">PCA, ICA, and all that</a> | |
| 75 <li> <a href="#mixep">Mixtures of experts</a> | |
| 76 <li> <a href="#hme">Hierarchical mixtures of experts</a> | |
| 77 <li> <a href="#qmr">QMR</a> | |
| 78 <li> <a href="#cg_model">Conditional Gaussian models</a> | |
| 79 <li> <a href="#hybrid">Other hybrid models</a> | |
| 80 </ul> | |
| 81 | |
| 82 <li> <a href="#param_learning">Parameter learning</a> | |
| 83 <ul> | |
| 84 <li> <a href="#load_data">Loading data from a file</a> | |
| 85 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a> | |
| 86 <li> <a href="#prior">Parameter priors</a> | |
| 87 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a> | |
| 88 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a> | |
| 89 <li> <a href="#tying">Parameter tying</a> | |
| 90 </ul> | |
| 91 | |
| 92 <li> <a href="#structure_learning">Structure learning</a> | |
| 93 <ul> | |
| 94 <li> <a href="#enumerate">Exhaustive search</a> | |
| 95 <li> <a href="#K2">K2</a> | |
| 96 <li> <a href="#hill_climb">Hill-climbing</a> | |
| 97 <li> <a href="#mcmc">MCMC</a> | |
| 98 <li> <a href="#active">Active learning</a> | |
| 99 <li> <a href="#struct_em">Structural EM</a> | |
| 100 <li> <a href="#graphdraw">Visualizing the learned graph structure</a> | |
| 101 <li> <a href="#constraint">Constraint-based methods</a> | |
| 102 </ul> | |
| 103 | |
| 104 | |
| 105 <li> <a href="#engines">Inference engines</a> | |
| 106 <ul> | |
| 107 <li> <a href="#jtree">Junction tree</a> | |
| 108 <li> <a href="#varelim">Variable elimination</a> | |
| 109 <li> <a href="#global">Global inference methods</a> | |
| 110 <li> <a href="#quickscore">Quickscore</a> | |
| 111 <li> <a href="#belprop">Belief propagation</a> | |
| 112 <li> <a href="#sampling">Sampling (Monte Carlo)</a> | |
| 113 <li> <a href="#engine_summary">Summary of inference engines</a> | |
| 114 </ul> | |
| 115 | |
| 116 | |
| 117 <li> <a href="#influence">Influence diagrams/ decision making</a> | |
| 118 | |
| 119 | |
| 120 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a> | |
| 121 </ul> | |
| 122 | |
| 123 </ul> | |
| 124 | |
| 125 | |
| 126 | |
| 127 | |
| 128 <h1><a name="install">Installation</h1> | |
| 129 | |
| 130 <h2><a name="installM">Installing the Matlab code</h2> | |
| 131 | |
| 132 <ul> | |
| 133 <li> <a href="bnt_download.html">Download</a> the FullBNT.zip file. | |
| 134 | |
| 135 <p> | |
| 136 <li> Unpack the file. In Unix, type | |
| 137 <!--"tar xvf BNT.tar".--> | |
| 138 "unzip FullBNT.zip". | |
| 139 In Windows, use | |
| 140 a program like <a href="http://www.winzip.com">Winzip</a>. This will | |
| 141 create a directory called FullBNT, which contains BNT and other libraries. | |
| 142 (Files ending in ~ or # are emacs backup files, and can be ignored.) | |
| 143 | |
| 144 <p> | |
| 145 <li> Read the file <tt>BNT/README.txt</tt> to make sure the date | |
| 146 matches the one on the top of <a href=bnt.html>the BNT home page</a>. | |
| 147 If not, you may need to press 'refresh' on your browser, and download | |
| 148 again, to get the most recent version. | |
| 149 | |
| 150 <p> | |
| 151 <li> <b>Edit the file "FullBNT/BNT/add_BNT_to_path.m"</b> so it contains the correct | |
| 152 pathname. | |
| 153 For example, in Windows, | |
| 154 I download FullBNT.zip into C:\kmurphy\matlab, and | |
| 155 then ensure the second lines reads | |
| 156 <pre> | |
| 157 BNT_HOME = 'C:\kmurphy\matlab\FullBNT'; | |
| 158 </pre> | |
| 159 | |
| 160 <p> | |
| 161 <li> Start up Matlab. | |
| 162 | |
| 163 <p> | |
| 164 <li> Type "ver" at the Matlab prompt (">>"). | |
| 165 <b>You need Matlab version 5.2 or newer to run BNT</b>. | |
| 166 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes | |
| 167 crash BNT.) | |
| 168 <b>BNT will not run on Octave</b>. | |
| 169 | |
| 170 <p> | |
| 171 <li> Move to the BNT directory. | |
| 172 For example, in Windows, I type | |
| 173 <pre> | |
| 174 >> cd C:\kpmurphy\matlab\FullBNT\BNT | |
| 175 </pre> | |
| 176 | |
| 177 <p> | |
| 178 <li> Type "add_BNT_to_path". | |
| 179 This executes the command | |
| 180 <tt>addpath(genpath(BNT_HOME))</tt>, | |
| 181 which adds all directories below FullBNT to the matlab path. | |
| 182 | |
| 183 <p> | |
| 184 <li> Type "test_BNT". | |
| 185 | |
| 186 If all goes well, this will produce a bunch of numbers and maybe some | |
| 187 warning messages (which you can ignore), but no error messages. | |
| 188 (The warnings should only be of the form | |
| 189 "Warning: Maximum number of iterations has been exceeded", and are | |
| 190 produced by Netlab.) | |
| 191 | |
| 192 <p> | |
| 193 <li> Problems? Did you remember to | |
| 194 <b>Edit the file "FullBNT/BNT/add_BNT_to_path.m"</b> so it contains | |
| 195 the right path?? | |
| 196 | |
| 197 <p> | |
| 198 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join"> | |
| 199 Join the BNT email list</a> | |
| 200 | |
| 201 <p> | |
| 202 <li>Read | |
| 203 <a href="../matlab_tips.html">some useful Matlab tips</a>. | |
| 204 <!-- | |
| 205 For instance, this explains how to create a startup file, which can be | |
| 206 used to set your path variable automatically, so you can avoid having | |
| 207 to type the above commands every time. | |
| 208 --> | |
| 209 | |
| 210 </ul> | |
| 211 | |
| 212 | |
| 213 | |
| 214 <h2><a name="installC">Installing the C code</h2> | |
| 215 | |
| 216 Some BNT functions also have C implementations. | |
| 217 <b>It is not necessary to install the C code</b>, but it can result in a speedup | |
| 218 of a factor of 2-5. | |
| 219 To install all the C code, | |
| 220 edit installC_BNT.m so it contains the right path, | |
| 221 then type <tt>installC_BNT</tt>. | |
| 222 (Ignore warnings of the form 'invalid white space character in directive'.) | |
| 223 To uninstall all the C code, | |
| 224 edit uninstallC_BNT.m so it contains the right path, | |
| 225 then type <tt>uninstallC_BNT</tt>. | |
| 226 For an up-to-date list of the files which have C implementations, see | |
| 227 BNT/installC_BNT.m. | |
| 228 | |
| 229 <p> | |
| 230 mex is a script that lets you call C code from Matlab - it does not compile matlab to | |
| 231 C (see mcc below). | |
| 232 If your C/C++ compiler is set up correctly, mex should work out of | |
| 233 the box. | |
| 234 If not, you might need to type | |
| 235 <p> | |
| 236 <tt> mex -setup</tt> | |
| 237 <p> | |
| 238 before calling installC. | |
| 239 <p> | |
| 240 To make mex call gcc on Windows, | |
| 241 you must install <a | |
| 242 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>. | |
| 243 You can use the <a href="http://www.mingw.org/">minimalist GNU for | |
| 244 Windows</a> version of gcc, or | |
| 245 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version. | |
| 246 <p> | |
| 247 In general, typing | |
| 248 'mex foo.c' from inside Matlab creates a file called | |
| 249 'foo.mexglx' or 'foo.dll' (the exact file | |
| 250 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll'). | |
| 251 The resulting file will hide the original 'foo.m' (if it existed), i.e., | |
| 252 typing 'foo' at the prompt will call the compiled C version. | |
| 253 To reveal the original matlab version, just delete foo.mexglx (this is | |
| 254 what uninstallC does). | |
| 255 <p> | |
| 256 Sometimes it takes time for Matlab to realize that the file has | |
| 257 changed from matlab to C or vice versa; try typing 'clear all' or | |
| 258 restarting Matlab to refresh it. | |
| 259 To find out which version of a file you are running, type | |
| 260 'which foo'. | |
| 261 <p> | |
| 262 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the | |
| 263 Matlab to C compiler, is a separate product, | |
| 264 and is quite different from mex. It does not yet support | |
| 265 objects/classes, which is why we can't compile all of BNT to C automatically. | |
| 266 Also, hand-written C code is usually much | |
| 267 better than the C code generated by mcc. | |
| 268 | |
| 269 | |
| 270 <p> | |
| 271 Acknowledgements: | |
| 272 Most of the C code (e.g., for jtree and dpot) was written by Wei Hu; | |
| 273 the triangulation C code was written by Ilya Shpitser; | |
| 274 the Gibbs sampling C code (for discrete nodes) was written by Bhaskara | |
| 275 Marthi. | |
| 276 | |
| 277 | |
| 278 | |
| 279 <h1><a name="basics">Creating your first Bayes net</h1> | |
| 280 | |
| 281 To define a Bayes net, you must specify the graph structure and then | |
| 282 the parameters. We look at each in turn, using a simple example | |
| 283 (adapted from Russell and | |
| 284 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall, | |
| 285 1995, p454). | |
| 286 | |
| 287 | |
| 288 <h2>Graph structure</h2> | |
| 289 | |
| 290 | |
| 291 Consider the following network. | |
| 292 | |
| 293 <p> | |
| 294 <center> | |
| 295 <IMG SRC="Figures/sprinkler.gif"> | |
| 296 </center> | |
| 297 <p> | |
| 298 | |
| 299 <P> | |
| 300 To specify this directed acyclic graph (dag), we create an adjacency matrix: | |
| 301 <PRE> | |
| 302 N = 4; | |
| 303 dag = zeros(N,N); | |
| 304 C = 1; S = 2; R = 3; W = 4; | |
| 305 dag(C,[R S]) = 1; | |
| 306 dag(R,W) = 1; | |
| 307 dag(S,W)=1; | |
| 308 </PRE> | |
| 309 <P> | |
| 310 We have numbered the nodes as follows: | |
| 311 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. | |
| 312 <b>The nodes must always be numbered in topological order, i.e., | |
| 313 ancestors before descendants.</b> | |
| 314 For a more complicated graph, this is a little inconvenient: we will | |
| 315 see how to get around this <a href="usage_dbn.html#bat">below</a>. | |
| 316 <p> | |
| 317 In Matlab 6, you can use logical arrays instead of double arrays, | |
| 318 which are 4 times smaller: | |
| 319 <pre> | |
| 320 dag = false(N,N); | |
| 321 dag(C,[R S]) = true; | |
| 322 ... | |
| 323 </pre> | |
| 324 However, <b>some graph functions (eg acyclic) do not work on | |
| 325 logical arrays</b>! | |
| 326 <p> | |
| 327 A preliminary attempt to make a <b>GUI</b> | |
| 328 has been writte by Philippe LeRay and can be downloaded | |
| 329 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>. | |
| 330 <p> | |
| 331 You can visualize the resulting graph structure using | |
| 332 the methods discussed <a href="#graphdraw">below</a>. | |
| 333 | |
| 334 <h2>Creating the Bayes net shell</h2> | |
| 335 | |
| 336 In addition to specifying the graph structure, | |
| 337 we must specify the size and type of each node. | |
| 338 If a node is discrete, its size is the | |
| 339 number of possible values | |
| 340 each node can take on; if a node is continuous, | |
| 341 it can be a vector, and its size is the length of this vector. | |
| 342 In this case, we will assume all nodes are discrete and binary. | |
| 343 <PRE> | |
| 344 discrete_nodes = 1:N; | |
| 345 node_sizes = 2*ones(1,N); | |
| 346 </pre> | |
| 347 If the nodes were not binary, you could type e.g., | |
| 348 <pre> | |
| 349 node_sizes = [4 2 3 5]; | |
| 350 </pre> | |
| 351 meaning that Cloudy has 4 possible values, | |
| 352 Sprinkler has 2 possible values, etc. | |
| 353 Note that these are cardinal values, not ordinal, i.e., | |
| 354 they are not ordered in any way, like 'low', 'medium', 'high'. | |
| 355 <p> | |
| 356 We are now ready to make the Bayes net: | |
| 357 <pre> | |
| 358 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes); | |
| 359 </PRE> | |
| 360 By default, all nodes are assumed to be discrete, so we can also just | |
| 361 write | |
| 362 <pre> | |
| 363 bnet = mk_bnet(dag, node_sizes); | |
| 364 </PRE> | |
| 365 You may also specify which nodes will be observed. | |
| 366 If you don't know, or if this not fixed in advance, | |
| 367 just use the empty list (the default). | |
| 368 <pre> | |
| 369 onodes = []; | |
| 370 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes); | |
| 371 </PRE> | |
| 372 Note that optional arguments are specified using a name/value syntax. | |
| 373 This is common for many BNT functions. | |
| 374 In general, to find out more about a function (e.g., which optional | |
| 375 arguments it takes), please see its | |
| 376 documentation string by typing | |
| 377 <pre> | |
| 378 help mk_bnet | |
| 379 </pre> | |
| 380 See also other <a href="matlab_tips.html">useful Matlab tips</a>. | |
| 381 <p> | |
| 382 It is possible to associate names with nodes, as follows: | |
| 383 <pre> | |
| 384 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4); | |
| 385 </pre> | |
| 386 You can then refer to a node by its name: | |
| 387 <pre> | |
| 388 C = bnet.names('cloudy'); % bnet.names is an associative array | |
| 389 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); | |
| 390 </pre> | |
| 391 This feature uses my own associative array class. | |
| 392 | |
| 393 | |
| 394 <h2><a name="cpt">Parameters</h2> | |
| 395 | |
| 396 A model consists of the graph structure and the parameters. | |
| 397 The parameters are represented by CPD objects (CPD = Conditional | |
| 398 Probability Distribution), which define the probability distribution | |
| 399 of a node given its parents. | |
| 400 (We will use the terms "node" and "random variable" interchangeably.) | |
| 401 The simplest kind of CPD is a table (multi-dimensional array), which | |
| 402 is suitable when all the nodes are discrete-valued. Note that the discrete | |
| 403 values are not assumed to be ordered in any way; that is, they | |
| 404 represent categorical quantities, like male and female, rather than | |
| 405 ordinal quantities, like low, medium and high. | |
| 406 (We will discuss CPDs in more detail <a href="#cpd">below</a>.) | |
| 407 <p> | |
| 408 Tabular CPDs, also called CPTs (conditional probability tables), | |
| 409 are stored as multidimensional arrays, where the dimensions | |
| 410 are arranged in the same order as the nodes, e.g., the CPT for node 4 | |
| 411 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. | |
| 412 Hence the child is always the last dimension. | |
| 413 If a node has no parents, its CPT is a column vector representing its | |
| 414 prior. | |
| 415 Note that in Matlab (unlike C), arrays are indexed | |
| 416 from 1, and are layed out in memory such that the first index toggles | |
| 417 fastest, e.g., the CPT for node 4 (WetGrass) is as follows | |
| 418 <P> | |
| 419 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P> | |
| 420 <P> | |
| 421 where we have used the convention that false==1, true==2. | |
| 422 We can create this CPT in Matlab as follows | |
| 423 <PRE> | |
| 424 CPT = zeros(2,2,2); | |
| 425 CPT(1,1,1) = 1.0; | |
| 426 CPT(2,1,1) = 0.1; | |
| 427 ... | |
| 428 </PRE> | |
| 429 Here is an easier way: | |
| 430 <PRE> | |
| 431 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]); | |
| 432 </PRE> | |
| 433 In fact, we don't need to reshape the array, since the CPD constructor | |
| 434 will do that for us. So we can just write | |
| 435 <pre> | |
| 436 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); | |
| 437 </pre> | |
| 438 The other nodes are created similarly (using the old syntax for | |
| 439 optional parameters) | |
| 440 <PRE> | |
| 441 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); | |
| 442 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]); | |
| 443 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]); | |
| 444 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); | |
| 445 </PRE> | |
| 446 | |
| 447 | |
| 448 <h2><a name="rnd_cpt">Random Parameters</h2> | |
| 449 | |
| 450 If we do not specify the CPT, random parameters will be | |
| 451 created, i.e., each "row" of the CPT will be drawn from the uniform distribution. | |
| 452 To ensure repeatable results, use | |
| 453 <pre> | |
| 454 rand('state', seed); | |
| 455 randn('state', seed); | |
| 456 </pre> | |
| 457 To control the degree of randomness (entropy), | |
| 458 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution. | |
| 459 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0). | |
| 460 If p = 1, each entry is drawn from U[0,1]. | |
| 461 If p >> 1, the entries will all be near 1/k, where k is the arity of | |
| 462 this node, i.e., each row will be nearly uniform. | |
| 463 You can do this as follows, assuming this node | |
| 464 is number i, and ns is the node_sizes. | |
| 465 <pre> | |
| 466 k = ns(i); | |
| 467 ps = parents(dag, i); | |
| 468 psz = prod(ns(ps)); | |
| 469 CPT = sample_dirichlet(p*ones(1,k), psz); | |
| 470 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT); | |
| 471 </pre> | |
| 472 | |
| 473 | |
| 474 <h2><a name="file">Loading a network from a file</h2> | |
| 475 | |
| 476 If you already have a Bayes net represented in the XML-based | |
| 477 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/"> | |
| 478 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the | |
| 479 <a | |
| 480 href="http://www.cs.huji.ac.il/labs/compbio/Repository"> | |
| 481 Bayes Net repository</a>), | |
| 482 you can convert it to BNT format using | |
| 483 the | |
| 484 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java | |
| 485 program</a> written by Ken Shan. | |
| 486 (This is not necessarily up-to-date.) | |
| 487 <p> | |
| 488 <b>It is currently not possible to save/load a BNT matlab object to | |
| 489 file</b>, but this is easily fixed if you modify all the constructors | |
| 490 for all the classes (see matlab documentation). | |
| 491 | |
| 492 <h2>Creating a model using a GUI</h2> | |
| 493 | |
| 494 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>. | |
| 495 | |
| 496 | |
| 497 | |
| 498 <h1><a name="inference">Inference</h1> | |
| 499 | |
| 500 Having created the BN, we can now use it for inference. | |
| 501 There are many different algorithms for doing inference in Bayes nets, | |
| 502 that make different tradeoffs between speed, | |
| 503 complexity, generality, and accuracy. | |
| 504 BNT therefore offers a variety of different inference | |
| 505 "engines". We will discuss these | |
| 506 in more detail <a href="#engines">below</a>. | |
| 507 For now, we will use the junction tree | |
| 508 engine, which is the mother of all exact inference algorithms. | |
| 509 This can be created as follows. | |
| 510 <pre> | |
| 511 engine = jtree_inf_engine(bnet); | |
| 512 </pre> | |
| 513 The other engines have similar constructors, but might take | |
| 514 additional, algorithm-specific parameters. | |
| 515 All engines are used in the same way, once they have been created. | |
| 516 We illustrate this in the following sections. | |
| 517 | |
| 518 | |
| 519 <h2><a name="marginal">Computing marginal distributions</h2> | |
| 520 | |
| 521 Suppose we want to compute the probability that the sprinker was on | |
| 522 given that the grass is wet. | |
| 523 The evidence consists of the fact that W=2. All the other nodes | |
| 524 are hidden (unobserved). We can specify this as follows. | |
| 525 <pre> | |
| 526 evidence = cell(1,N); | |
| 527 evidence{W} = 2; | |
| 528 </pre> | |
| 529 We use a 1D cell array instead of a vector to | |
| 530 cope with the fact that nodes can be vectors of different lengths. | |
| 531 In addition, the value [] can be used | |
| 532 to denote 'no evidence', instead of having to specify the observation | |
| 533 pattern as a separate argument. | |
| 534 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell | |
| 535 arrays in matlab.) | |
| 536 <p> | |
| 537 We are now ready to add the evidence to the engine. | |
| 538 <pre> | |
| 539 [engine, loglik] = enter_evidence(engine, evidence); | |
| 540 </pre> | |
| 541 The behavior of this function is algorithm-specific, and is discussed | |
| 542 in more detail <a href="#engines">below</a>. | |
| 543 In the case of the jtree engine, | |
| 544 enter_evidence implements a two-pass message-passing scheme. | |
| 545 The first return argument contains the modified engine, which | |
| 546 incorporates the evidence. The second return argument contains the | |
| 547 log-likelihood of the evidence. (Not all engines are capable of | |
| 548 computing the log-likelihood.) | |
| 549 <p> | |
| 550 Finally, we can compute p=P(S=2|W=2) as follows. | |
| 551 <PRE> | |
| 552 marg = marginal_nodes(engine, S); | |
| 553 marg.T | |
| 554 ans = | |
| 555 0.57024 | |
| 556 0.42976 | |
| 557 p = marg.T(2); | |
| 558 </PRE> | |
| 559 We see that p = 0.4298. | |
| 560 <p> | |
| 561 Now let us add the evidence that it was raining, and see what | |
| 562 difference it makes. | |
| 563 <PRE> | |
| 564 evidence{R} = 2; | |
| 565 [engine, loglik] = enter_evidence(engine, evidence); | |
| 566 marg = marginal_nodes(engine, S); | |
| 567 p = marg.T(2); | |
| 568 </PRE> | |
| 569 We find that p = P(S=2|W=2,R=2) = 0.1945, | |
| 570 which is lower than | |
| 571 before, because the rain can ``explain away'' the | |
| 572 fact that the grass is wet. | |
| 573 <p> | |
| 574 You can plot a marginal distribution over a discrete variable | |
| 575 as a barchart using the built 'bar' function: | |
| 576 <pre> | |
| 577 bar(marg.T) | |
| 578 </pre> | |
| 579 This is what it looks like | |
| 580 | |
| 581 <p> | |
| 582 <center> | |
| 583 <IMG SRC="Figures/sprinkler_bar.gif"> | |
| 584 </center> | |
| 585 <p> | |
| 586 | |
| 587 <h2><a name="observed">Observed nodes</h2> | |
| 588 | |
| 589 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)? | |
| 590 An observed discrete node effectively only has 1 value (the observed | |
| 591 one) --- all other values would result in 0 probability. | |
| 592 For efficiency, BNT treats observed (discrete) nodes as if they were | |
| 593 set to 1, as we see below: | |
| 594 <pre> | |
| 595 evidence = cell(1,N); | |
| 596 evidence{W} = 2; | |
| 597 engine = enter_evidence(engine, evidence); | |
| 598 m = marginal_nodes(engine, W); | |
| 599 m.T | |
| 600 ans = | |
| 601 1 | |
| 602 </pre> | |
| 603 This can get a little confusing, since we assigned W=2. | |
| 604 So we can ask BNT to add the evidence back in by passing in an optional argument: | |
| 605 <pre> | |
| 606 m = marginal_nodes(engine, W, 1); | |
| 607 m.T | |
| 608 ans = | |
| 609 0 | |
| 610 1 | |
| 611 </pre> | |
| 612 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1. | |
| 613 | |
| 614 | |
| 615 | |
| 616 <h2><a name="joint">Computing joint distributions</h2> | |
| 617 | |
| 618 We can compute the joint probability on a set of nodes as in the | |
| 619 following example. | |
| 620 <pre> | |
| 621 evidence = cell(1,N); | |
| 622 [engine, ll] = enter_evidence(engine, evidence); | |
| 623 m = marginal_nodes(engine, [S R W]); | |
| 624 </pre> | |
| 625 m is a structure. The 'T' field is a multi-dimensional array (in | |
| 626 this case, 3-dimensional) that contains the joint probability | |
| 627 distribution on the specified nodes. | |
| 628 <pre> | |
| 629 >> m.T | |
| 630 ans(:,:,1) = | |
| 631 0.2900 0.0410 | |
| 632 0.0210 0.0009 | |
| 633 ans(:,:,2) = | |
| 634 0 0.3690 | |
| 635 0.1890 0.0891 | |
| 636 </pre> | |
| 637 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass | |
| 638 to be wet if both the rain and sprinkler are off. | |
| 639 <p> | |
| 640 Let us now add some evidence to R. | |
| 641 <pre> | |
| 642 evidence{R} = 2; | |
| 643 [engine, ll] = enter_evidence(engine, evidence); | |
| 644 m = marginal_nodes(engine, [S R W]) | |
| 645 m = | |
| 646 domain: [2 3 4] | |
| 647 T: [2x1x2 double] | |
| 648 >> m.T | |
| 649 m.T | |
| 650 ans(:,:,1) = | |
| 651 0.0820 | |
| 652 0.0018 | |
| 653 ans(:,:,2) = | |
| 654 0.7380 | |
| 655 0.1782 | |
| 656 </pre> | |
| 657 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) | |
| 658 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible | |
| 659 with the evidence that R=2. | |
| 660 Instead of creating large tables with many 0s, BNT sets the effective | |
| 661 size of observed (discrete) nodes to 1, as explained above. | |
| 662 This is why m.T has size 2x1x2. | |
| 663 To get a 2x2x2 table, type | |
| 664 <pre> | |
| 665 m = marginal_nodes(engine, [S R W], 1) | |
| 666 m = | |
| 667 domain: [2 3 4] | |
| 668 T: [2x2x2 double] | |
| 669 >> m.T | |
| 670 m.T | |
| 671 ans(:,:,1) = | |
| 672 0 0.082 | |
| 673 0 0.0018 | |
| 674 ans(:,:,2) = | |
| 675 0 0.738 | |
| 676 0 0.1782 | |
| 677 </pre> | |
| 678 | |
| 679 <p> | |
| 680 Note: It is not always possible to compute the joint on arbitrary | |
| 681 sets of nodes: it depends on which inference engine you use, as discussed | |
| 682 in more detail <a href="#engines">below</a>. | |
| 683 | |
| 684 | |
| 685 <h2><a name="soft">Soft/virtual evidence</h2> | |
| 686 | |
| 687 Sometimes a node is not observed, but we have some distribution over | |
| 688 its possible values; this is often called "soft" or "virtual" | |
| 689 evidence. | |
| 690 One can use this as follows | |
| 691 <pre> | |
| 692 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence); | |
| 693 </pre> | |
| 694 where soft_evidence{i} is either [] (if node i has no soft evidence) | |
| 695 or is a vector representing the probability distribution over i's | |
| 696 possible values. | |
| 697 For example, if we don't know i's exact value, but we know its | |
| 698 likelihood ratio is 60/40, we can write evidence{i} = [] and | |
| 699 soft_evidence{i} = [0.6 0.4]. | |
| 700 <p> | |
| 701 Currently only jtree_inf_engine supports this option. | |
| 702 It assumes that all hidden nodes, and all nodes for | |
| 703 which we have soft evidence, are discrete. | |
| 704 For a longer example, see BNT/examples/static/softev1.m. | |
| 705 | |
| 706 | |
| 707 <h2><a name="mpe">Most probable explanation</h2> | |
| 708 | |
| 709 To compute the most probable explanation (MPE) of the evidence (i.e., | |
| 710 the most probable assignment, or a mode of the joint), use | |
| 711 <pre> | |
| 712 [mpe, ll] = calc_mpe(engine, evidence); | |
| 713 </pre> | |
| 714 mpe{i} is the most likely value of node i. | |
| 715 This calls enter_evidence with the 'maximize' flag set to 1, which | |
| 716 causes the engine to do max-product instead of sum-product. | |
| 717 The resulting max-marginals are then thresholded. | |
| 718 If there is more than one maximum probability assignment, we must take | |
| 719 care to break ties in a consistent manner (thresholding the | |
| 720 max-marginals may give the wrong result). To force this behavior, | |
| 721 type | |
| 722 <pre> | |
| 723 [mpe, ll] = calc_mpe(engine, evidence, 1); | |
| 724 </pre> | |
| 725 Note that computing the MPE is someties called abductive reasoning. | |
| 726 | |
| 727 <p> | |
| 728 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar, | |
| 729 that does a forwards max-product pass, and then a backwards traceback | |
| 730 pass, which is how Viterbi is traditionally implemented. | |
| 731 | |
| 732 | |
| 733 | |
| 734 <h1><a name="cpd">Conditional Probability Distributions</h1> | |
| 735 | |
| 736 A Conditional Probability Distributions (CPD) | |
| 737 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i)) | |
| 738 are the parents of node i. There are many ways to represent this | |
| 739 distribution, which depend in part on whether X(i) and X(Pa(i)) are | |
| 740 discrete, continuous, or a combination. | |
| 741 We will discuss various representations below. | |
| 742 | |
| 743 | |
| 744 <h2><a name="tabular">Tabular nodes</h2> | |
| 745 | |
| 746 If the CPD is represented as a table (i.e., if it is a multinomial | |
| 747 distribution), it has a number of parameters that is exponential in | |
| 748 the number of parents. See the example <a href="#cpt">above</a>. | |
| 749 | |
| 750 | |
| 751 <h2><a name="noisyor">Noisy-or nodes</h2> | |
| 752 | |
| 753 A noisy-OR node is like a regular logical OR gate except that | |
| 754 sometimes the effects of parents that are on get inhibited. | |
| 755 Let the prob. that parent i gets inhibited be q(i). | |
| 756 Then a node, C, with 2 parents, A and B, has the following CPD, where | |
| 757 we use F and T to represent off and on (1 and 2 in BNT). | |
| 758 <pre> | |
| 759 A B P(C=off) P(C=on) | |
| 760 --------------------------- | |
| 761 F F 1.0 0.0 | |
| 762 T F q(A) 1-q(A) | |
| 763 F T q(B) 1-q(B) | |
| 764 T T q(A)q(B) q-q(A)q(B) | |
| 765 </pre> | |
| 766 Thus we see that the causes get inhibited independently. | |
| 767 It is common to associate a "leak" node with a noisy-or CPD, which is | |
| 768 like a parent that is always on. This can account for all other unmodelled | |
| 769 causes which might turn the node on. | |
| 770 <p> | |
| 771 The noisy-or distribution is similar to the logistic distribution. | |
| 772 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) | |
| 773 be the prob. that j inhibits i. Then | |
| 774 <pre> | |
| 775 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j) | |
| 776 </pre> | |
| 777 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then | |
| 778 <pre> | |
| 779 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j)) | |
| 780 </pre> | |
| 781 For a sigmoid node, we have | |
| 782 <pre> | |
| 783 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j)) | |
| 784 </pre> | |
| 785 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of | |
| 786 the activation function (although both are monotonically increasing). | |
| 787 In addition, in the case of a noisy-or, the weights are constrained to be | |
| 788 positive, since they derive from probabilities q(i,j). | |
| 789 In both cases, the number of parameters is <em>linear</em> in the | |
| 790 number of parents, unlike the case of a multinomial distribution, | |
| 791 where the number of parameters is exponential in the number of parents. | |
| 792 We will see an example of noisy-OR nodes <a href="#qmr">below</a>. | |
| 793 | |
| 794 | |
| 795 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2> | |
| 796 | |
| 797 Deterministic CPDs for discrete random variables can be created using | |
| 798 the deterministic_CPD class. It is also possible to 'flip' the output | |
| 799 of the function with some probability, to simulate noise. | |
| 800 The boolean_CPD class is just a special case of a | |
| 801 deterministic CPD, where the parents and child are all binary. | |
| 802 <p> | |
| 803 Both of these classes are just "syntactic sugar" for the tabular_CPD | |
| 804 class. | |
| 805 | |
| 806 | |
| 807 | |
| 808 <h2><a name="softmax">Softmax nodes</h2> | |
| 809 | |
| 810 If we have a discrete node with a continuous parent, | |
| 811 we can define its CPD using a softmax function | |
| 812 (also known as the multinomial logit function). | |
| 813 This acts like a soft thresholding operator, and is defined as follows: | |
| 814 <pre> | |
| 815 exp(w(:,i)'*x + b(i)) | |
| 816 Pr(Q=i | X=x) = ----------------------------- | |
| 817 sum_j exp(w(:,j)'*x + b(j)) | |
| 818 | |
| 819 </pre> | |
| 820 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the | |
| 821 following interpretation: w(:,i)-w(:,j) is the normal vector to the | |
| 822 decision boundary between classes i and j, | |
| 823 and b(i)-b(j) is its offset (bias). For example, suppose | |
| 824 X is a 2-vector, and Q is binary. Then | |
| 825 <pre> | |
| 826 w = [1 -1; | |
| 827 0 0]; | |
| 828 | |
| 829 b = [0 0]; | |
| 830 </pre> | |
| 831 means class 1 are points in the 2D plane with positive x coordinate, | |
| 832 and class 2 are points in the 2D plane with negative x coordinate. | |
| 833 If w has large magnitude, the decision boundary is sharp, otherwise it | |
| 834 is soft. | |
| 835 In the special case that Q is binary (0/1), the softmax function reduces to the logistic | |
| 836 (sigmoid) function. | |
| 837 <p> | |
| 838 Fitting a softmax function can be done using the iteratively reweighted | |
| 839 least squares (IRLS) algorithm. | |
| 840 We use the implementation from | |
| 841 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>. | |
| 842 Note that since | |
| 843 the softmax distribution is not in the exponential family, it does not | |
| 844 have finite sufficient statistics, and hence we must store all the | |
| 845 training data in uncompressed form. | |
| 846 If this takes too much space, one should use online (stochastic) gradient | |
| 847 descent (not implemented in BNT). | |
| 848 <p> | |
| 849 If a softmax node also has discrete parents, | |
| 850 we use a different set of w/b parameters for each combination of | |
| 851 parent values, as in the <a href="#gaussian">conditional linear | |
| 852 Gaussian CPD</a>. | |
| 853 This feature was implemented by Pierpaolo Brutti. | |
| 854 He is currently extending it so that discrete parents can be treated | |
| 855 as if they were continuous, by adding indicator variables to the X | |
| 856 vector. | |
| 857 <p> | |
| 858 We will see an example of softmax nodes <a href="#mixexp">below</a>. | |
| 859 | |
| 860 | |
| 861 <h2><a name="mlp">Neural network nodes</h2> | |
| 862 | |
| 863 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron | |
| 864 to implement a mapping from continuous parents to discrete children, | |
| 865 similar to the softmax function. | |
| 866 (If there are also discrete parents, it creates a mixture of MLPs.) | |
| 867 It uses code from <a | |
| 868 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>. | |
| 869 This is work in progress. | |
| 870 | |
| 871 <h2><a name="root">Root nodes</h2> | |
| 872 | |
| 873 A root node has no parents and no parameters; it can be used to model | |
| 874 an observed, exogeneous input variable, i.e., one which is "outside" | |
| 875 the model. | |
| 876 This is useful for conditional density models. | |
| 877 We will see an example of root nodes <a href="#mixexp">below</a>. | |
| 878 | |
| 879 | |
| 880 <h2><a name="gaussian">Gaussian nodes</h2> | |
| 881 | |
| 882 We now consider a distribution suitable for the continuous-valued nodes. | |
| 883 Suppose the node is called Y, its continuous parents (if any) are | |
| 884 called X, and its discrete parents (if any) are called Q. | |
| 885 The distribution on Y is defined as follows: | |
| 886 <pre> | |
| 887 - no parents: Y ~ N(mu, Sigma) | |
| 888 - cts parents : Y|X=x ~ N(mu + W x, Sigma) | |
| 889 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i)) | |
| 890 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i)) | |
| 891 </pre> | |
| 892 where N(mu, Sigma) denotes a Normal distribution with mean mu and | |
| 893 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q | |
| 894 respectively. | |
| 895 If there are no discrete parents, |Q|=1; if there is | |
| 896 more than one, then |Q| = a vector of the sizes of each discrete parent. | |
| 897 If there are no continuous parents, |X|=0; if there is more than one, | |
| 898 then |X| = the sum of their sizes. | |
| 899 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive | |
| 900 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) | |
| 901 matrix. | |
| 902 <p> | |
| 903 We can create a Gaussian node with random parameters as follows. | |
| 904 <pre> | |
| 905 bnet.CPD{i} = gaussian_CPD(bnet, i); | |
| 906 </pre> | |
| 907 We can specify the value of one or more of the parameters as in the | |
| 908 following example, in which |Y|=2, and |Q|=1. | |
| 909 <pre> | |
| 910 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y)); | |
| 911 </pre> | |
| 912 <p> | |
| 913 We will see an example of conditional linear Gaussian nodes <a | |
| 914 href="#cg_model">below</a>. | |
| 915 <p> | |
| 916 <b>When learning Gaussians from data</b>, it is helpful to ensure the | |
| 917 data has a small magnitde | |
| 918 (see e.g., KPMstats/standardize) to prevent numerical problems. | |
| 919 Unless you have a lot of data, it is also a very good idea to use | |
| 920 diagonal instead of full covariance matrices. | |
| 921 (BNT does not currently support spherical covariances, although it | |
| 922 would be easy to add, since KPMstats/clg_Mstep supports this option; | |
| 923 you would just need to modify gaussian_CPD/update_ess to accumulate | |
| 924 weighted inner products.) | |
| 925 | |
| 926 | |
| 927 | |
| 928 <h2><a name="nongauss">Other continuous distributions</h2> | |
| 929 | |
| 930 Currently BNT does not support any CPDs for continuous nodes other | |
| 931 than the Gaussian. | |
| 932 However, you can use a mixture of Gaussians to | |
| 933 approximate other continuous distributions. We will see some an example | |
| 934 of this with the IFA model <a href="#pca">below</a>. | |
| 935 | |
| 936 | |
| 937 <h2><a name="glm">Generalized linear model nodes</h2> | |
| 938 | |
| 939 In the future, we may incorporate some of the functionality of | |
| 940 <a href = | |
| 941 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a> | |
| 942 into BNT. | |
| 943 | |
| 944 | |
| 945 <h2><a name="dtree">Classification/regression tree nodes</h2> | |
| 946 | |
| 947 We plan to add classification and regression trees to define CPDs for | |
| 948 discrete and continuous nodes, respectively. | |
| 949 Trees have many advantages: they are easy to interpret, they can do | |
| 950 feature selection, they can | |
| 951 handle discrete and continuous inputs, they do not make strong | |
| 952 assumptions about the form of the distribution, the number of | |
| 953 parameters can grow in a data-dependent way (i.e., they are | |
| 954 semi-parametric), they can handle missing data, etc. | |
| 955 However, they are not yet implemented. | |
| 956 <!-- | |
| 957 Yimin Zhang is currently (Feb '02) implementing this. | |
| 958 --> | |
| 959 | |
| 960 | |
| 961 <h2><a name="cpd_summary">Summary of CPD types</h2> | |
| 962 | |
| 963 We list all the different types of CPDs supported by BNT. | |
| 964 For each CPD, we specify if the child and parents can be discrete (D) or | |
| 965 continuous (C) (Binary (B) nodes are a special case). | |
| 966 We also specify which methods each class supports. | |
| 967 If a method is inherited, the name of the parent class is mentioned. | |
| 968 If a parent class calls a child method, this is mentioned. | |
| 969 <p> | |
| 970 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this | |
| 971 requires that the child and all parents are discrete. | |
| 972 The CPT might be exponentially big... | |
| 973 <tt>convert_to_table</tt> evaluates a CPD with evidence, and | |
| 974 represents the the resulting potential as an array. | |
| 975 This requires that the child is discrete, and any continuous parents | |
| 976 are observed. | |
| 977 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and | |
| 978 represents the resulting potential as a dpot, gpot, cgpot or upot, as | |
| 979 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u = | |
| 980 utility). | |
| 981 | |
| 982 <p> | |
| 983 When we sample a node, all the parents are observed. | |
| 984 When we compute the (log) probability of a node, all the parents and | |
| 985 the child are observed. | |
| 986 <p> | |
| 987 We also specify if the parameters are learnable. | |
| 988 For learning with EM, we require | |
| 989 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and | |
| 990 <tt>maximize_params</tt>. | |
| 991 For learning from fully observed data, we require | |
| 992 the method <tt>learn_params</tt>. | |
| 993 By default, all classes inherit this from generic_CPD, which simply | |
| 994 calls <tt>update_ess</tt> N times, once for each data case, followed | |
| 995 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step. | |
| 996 Some classes implement a batch formula, which is quicker. | |
| 997 <p> | |
| 998 Bayesian learning means computing a posterior over the parameters | |
| 999 given fully observed data. | |
| 1000 <p> | |
| 1001 Pearl means we implement the methods <tt>compute_pi</tt> and | |
| 1002 <tt>compute_lambda_msg</tt>, used by | |
| 1003 <tt>pearl_inf_engine</tt>, which runs on directed graphs. | |
| 1004 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H | |
| 1005 The pearl methods can exploit special properties of the CPDs for | |
| 1006 computing the messages efficiently, whereas belprop does not. | |
| 1007 <p> | |
| 1008 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>, | |
| 1009 which is not shown, since it is not very interesting. | |
| 1010 | |
| 1011 | |
| 1012 <p> | |
| 1013 | |
| 1014 | |
| 1015 <table> | |
| 1016 <table border units = pixels><tr> | |
| 1017 <td align=center>Name | |
| 1018 <td align=center>Child | |
| 1019 <td align=center>Parents | |
| 1020 <td align=center>Comments | |
| 1021 <td align=center>CPD_to_CPT | |
| 1022 <td align=center>conv_to_table | |
| 1023 <td align=center>conv_to_pot | |
| 1024 <td align=center>sample | |
| 1025 <td align=center>prob | |
| 1026 <td align=center>learn | |
| 1027 <td align=center>Bayes | |
| 1028 <td align=center>Pearl | |
| 1029 | |
| 1030 | |
| 1031 <tr> | |
| 1032 <!-- Name--><td> | |
| 1033 <!-- Child--><td> | |
| 1034 <!-- Parents--><td> | |
| 1035 <!-- Comments--><td> | |
| 1036 <!-- CPD_to_CPT--><td> | |
| 1037 <!-- conv_to_table--><td> | |
| 1038 <!-- conv_to_pot--><td> | |
| 1039 <!-- sample--><td> | |
| 1040 <!-- prob--><td> | |
| 1041 <!-- learn--><td> | |
| 1042 <!-- Bayes--><td> | |
| 1043 <!-- Pearl--><td> | |
| 1044 | |
| 1045 <tr> | |
| 1046 <!-- Name--><td>boolean | |
| 1047 <!-- Child--><td>B | |
| 1048 <!-- Parents--><td>B | |
| 1049 <!-- Comments--><td>Syntactic sugar for tabular | |
| 1050 <!-- CPD_to_CPT--><td>- | |
| 1051 <!-- conv_to_table--><td>- | |
| 1052 <!-- conv_to_pot--><td>- | |
| 1053 <!-- sample--><td>- | |
| 1054 <!-- prob--><td>- | |
| 1055 <!-- learn--><td>- | |
| 1056 <!-- Bayes--><td>- | |
| 1057 <!-- Pearl--><td>- | |
| 1058 | |
| 1059 <tr> | |
| 1060 <!-- Name--><td>deterministic | |
| 1061 <!-- Child--><td>D | |
| 1062 <!-- Parents--><td>D | |
| 1063 <!-- Comments--><td>Syntactic sugar for tabular | |
| 1064 <!-- CPD_to_CPT--><td>- | |
| 1065 <!-- conv_to_table--><td>- | |
| 1066 <!-- conv_to_pot--><td>- | |
| 1067 <!-- sample--><td>- | |
| 1068 <!-- prob--><td>- | |
| 1069 <!-- learn--><td>- | |
| 1070 <!-- Bayes--><td>- | |
| 1071 <!-- Pearl--><td>- | |
| 1072 | |
| 1073 <tr> | |
| 1074 <!-- Name--><td>Discrete | |
| 1075 <!-- Child--><td>D | |
| 1076 <!-- Parents--><td>C/D | |
| 1077 <!-- Comments--><td>Virtual class | |
| 1078 <!-- CPD_to_CPT--><td>N | |
| 1079 <!-- conv_to_table--><td>Calls CPD_to_CPT | |
| 1080 <!-- conv_to_pot--><td>Calls conv_to_table | |
| 1081 <!-- sample--><td>Calls conv_to_table | |
| 1082 <!-- prob--><td>Calls conv_to_table | |
| 1083 <!-- learn--><td>N | |
| 1084 <!-- Bayes--><td>N | |
| 1085 <!-- Pearl--><td>N | |
| 1086 | |
| 1087 <tr> | |
| 1088 <!-- Name--><td>Gaussian | |
| 1089 <!-- Child--><td>C | |
| 1090 <!-- Parents--><td>C/D | |
| 1091 <!-- Comments--><td>- | |
| 1092 <!-- CPD_to_CPT--><td>N | |
| 1093 <!-- conv_to_table--><td>N | |
| 1094 <!-- conv_to_pot--><td>Y | |
| 1095 <!-- sample--><td>Y | |
| 1096 <!-- prob--><td>Y | |
| 1097 <!-- learn--><td>Y | |
| 1098 <!-- Bayes--><td>N | |
| 1099 <!-- Pearl--><td>N | |
| 1100 | |
| 1101 <tr> | |
| 1102 <!-- Name--><td>gmux | |
| 1103 <!-- Child--><td>C | |
| 1104 <!-- Parents--><td>C/D | |
| 1105 <!-- Comments--><td>multiplexer | |
| 1106 <!-- CPD_to_CPT--><td>N | |
| 1107 <!-- conv_to_table--><td>N | |
| 1108 <!-- conv_to_pot--><td>Y | |
| 1109 <!-- sample--><td>N | |
| 1110 <!-- prob--><td>N | |
| 1111 <!-- learn--><td>N | |
| 1112 <!-- Bayes--><td>N | |
| 1113 <!-- Pearl--><td>Y | |
| 1114 | |
| 1115 | |
| 1116 <tr> | |
| 1117 <!-- Name--><td>MLP | |
| 1118 <!-- Child--><td>D | |
| 1119 <!-- Parents--><td>C/D | |
| 1120 <!-- Comments--><td>multi layer perceptron | |
| 1121 <!-- CPD_to_CPT--><td>N | |
| 1122 <!-- conv_to_table--><td>Y | |
| 1123 <!-- conv_to_pot--><td>Inherits from discrete | |
| 1124 <!-- sample--><td>Inherits from discrete | |
| 1125 <!-- prob--><td>Inherits from discrete | |
| 1126 <!-- learn--><td>Y | |
| 1127 <!-- Bayes--><td>N | |
| 1128 <!-- Pearl--><td>N | |
| 1129 | |
| 1130 | |
| 1131 <tr> | |
| 1132 <!-- Name--><td>noisy-or | |
| 1133 <!-- Child--><td>B | |
| 1134 <!-- Parents--><td>B | |
| 1135 <!-- Comments--><td>- | |
| 1136 <!-- CPD_to_CPT--><td>Y | |
| 1137 <!-- conv_to_table--><td>Inherits from discrete | |
| 1138 <!-- conv_to_pot--><td>Inherits from discrete | |
| 1139 <!-- sample--><td>Inherits from discrete | |
| 1140 <!-- prob--><td>Inherits from discrete | |
| 1141 <!-- learn--><td>N | |
| 1142 <!-- Bayes--><td>N | |
| 1143 <!-- Pearl--><td>Y | |
| 1144 | |
| 1145 | |
| 1146 <tr> | |
| 1147 <!-- Name--><td>root | |
| 1148 <!-- Child--><td>C/D | |
| 1149 <!-- Parents--><td>none | |
| 1150 <!-- Comments--><td>no params | |
| 1151 <!-- CPD_to_CPT--><td>N | |
| 1152 <!-- conv_to_table--><td>N | |
| 1153 <!-- conv_to_pot--><td>Y | |
| 1154 <!-- sample--><td>Y | |
| 1155 <!-- prob--><td>Y | |
| 1156 <!-- learn--><td>N | |
| 1157 <!-- Bayes--><td>N | |
| 1158 <!-- Pearl--><td>N | |
| 1159 | |
| 1160 | |
| 1161 <tr> | |
| 1162 <!-- Name--><td>softmax | |
| 1163 <!-- Child--><td>D | |
| 1164 <!-- Parents--><td>C/D | |
| 1165 <!-- Comments--><td>- | |
| 1166 <!-- CPD_to_CPT--><td>N | |
| 1167 <!-- conv_to_table--><td>Y | |
| 1168 <!-- conv_to_pot--><td>Inherits from discrete | |
| 1169 <!-- sample--><td>Inherits from discrete | |
| 1170 <!-- prob--><td>Inherits from discrete | |
| 1171 <!-- learn--><td>Y | |
| 1172 <!-- Bayes--><td>N | |
| 1173 <!-- Pearl--><td>N | |
| 1174 | |
| 1175 | |
| 1176 <tr> | |
| 1177 <!-- Name--><td>generic | |
| 1178 <!-- Child--><td>C/D | |
| 1179 <!-- Parents--><td>C/D | |
| 1180 <!-- Comments--><td>Virtual class | |
| 1181 <!-- CPD_to_CPT--><td>N | |
| 1182 <!-- conv_to_table--><td>N | |
| 1183 <!-- conv_to_pot--><td>N | |
| 1184 <!-- sample--><td>N | |
| 1185 <!-- prob--><td>N | |
| 1186 <!-- learn--><td>N | |
| 1187 <!-- Bayes--><td>N | |
| 1188 <!-- Pearl--><td>N | |
| 1189 | |
| 1190 | |
| 1191 <tr> | |
| 1192 <!-- Name--><td>Tabular | |
| 1193 <!-- Child--><td>D | |
| 1194 <!-- Parents--><td>D | |
| 1195 <!-- Comments--><td>- | |
| 1196 <!-- CPD_to_CPT--><td>Y | |
| 1197 <!-- conv_to_table--><td>Inherits from discrete | |
| 1198 <!-- conv_to_pot--><td>Inherits from discrete | |
| 1199 <!-- sample--><td>Inherits from discrete | |
| 1200 <!-- prob--><td>Inherits from discrete | |
| 1201 <!-- learn--><td>Y | |
| 1202 <!-- Bayes--><td>Y | |
| 1203 <!-- Pearl--><td>Y | |
| 1204 | |
| 1205 </table> | |
| 1206 | |
| 1207 | |
| 1208 | |
| 1209 <h1><a name="examples">Example models</h1> | |
| 1210 | |
| 1211 | |
| 1212 <h2>Gaussian mixture models</h2> | |
| 1213 | |
| 1214 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures | |
| 1215 of Gaussians using BNT. Available | |
| 1216 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>. | |
| 1217 | |
| 1218 | |
| 1219 <h2><a name="pca">PCA, ICA, and all that </h2> | |
| 1220 | |
| 1221 In Figure (a) below, we show how Factor Analysis can be thought of as a | |
| 1222 graphical model. Here, X has an N(0,I) prior, and | |
| 1223 Y|X=x ~ N(mu + Wx, Psi), | |
| 1224 where Psi is diagonal and W is called the "factor loading matrix". | |
| 1225 Since the noise on both X and Y is diagonal, the components of these | |
| 1226 vectors are uncorrelated, and hence can be represented as individual | |
| 1227 scalar nodes, as we show in (b). | |
| 1228 (This is useful if parts of the observations on the Y vector are occasionally missing.) | |
| 1229 We usually take k=|X| << |Y|=D, so the model tries to explain | |
| 1230 many observations using a low-dimensional subspace. | |
| 1231 | |
| 1232 | |
| 1233 <center> | |
| 1234 <table> | |
| 1235 <tr> | |
| 1236 <td><img src="Figures/fa.gif"> | |
| 1237 <td><img src="Figures/fa_scalar.gif"> | |
| 1238 <td><img src="Figures/mfa.gif"> | |
| 1239 <td><img src="Figures/ifa.gif"> | |
| 1240 <tr> | |
| 1241 <td align=center> (a) | |
| 1242 <td align=center> (b) | |
| 1243 <td align=center> (c) | |
| 1244 <td align=center> (d) | |
| 1245 </table> | |
| 1246 </center> | |
| 1247 | |
| 1248 <p> | |
| 1249 We can create this model in BNT as follows. | |
| 1250 <pre> | |
| 1251 ns = [k D]; | |
| 1252 dag = zeros(2,2); | |
| 1253 dag(1,2) = 1; | |
| 1254 bnet = mk_bnet(dag, ns, 'discrete', []); | |
| 1255 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ... | |
| 1256 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1); | |
| 1257 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ... | |
| 1258 'cov_type', 'diag', 'clamp_mean', 1); | |
| 1259 </pre> | |
| 1260 | |
| 1261 The root node is clamped to the N(0,I) distribution, so that we will | |
| 1262 not update these parameters during learning. | |
| 1263 The mean of the leaf node is clamped to 0, | |
| 1264 since we assume the data has been centered (had its mean subtracted | |
| 1265 off); this is just for simplicity. | |
| 1266 Finally, the covariance of the leaf node is constrained to be | |
| 1267 diagonal. W0 and Psi0 are the initial parameter guesses. | |
| 1268 | |
| 1269 <p> | |
| 1270 We can fit this model (i.e., estimate its parameters in a maximum | |
| 1271 likelihood (ML) sense) using EM, as we | |
| 1272 explain <a href="#em">below</a>. | |
| 1273 Not surprisingly, the ML estimates for mu and Psi turn out to be | |
| 1274 identical to the | |
| 1275 sample mean and variance, which can be computed directly as | |
| 1276 <pre> | |
| 1277 mu_ML = mean(data); | |
| 1278 Psi_ML = diag(cov(data)); | |
| 1279 </pre> | |
| 1280 Note that W can only be identified up to a rotation matrix, because of | |
| 1281 the spherical symmetry of the source. | |
| 1282 | |
| 1283 <p> | |
| 1284 If we restrict Psi to be spherical, i.e., Psi = sigma*I, | |
| 1285 there is a closed-form solution for W as well, | |
| 1286 i.e., we do not need to use EM. | |
| 1287 In particular, W contains the first |X| eigenvectors of the sample covariance | |
| 1288 matrix, with scalings determined by the eigenvalues and sigma. | |
| 1289 Classical PCA can be obtained by taking the sigma->0 limit. | |
| 1290 For details, see | |
| 1291 | |
| 1292 <ul> | |
| 1293 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps"> | |
| 1294 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97. | |
| 1295 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz"> | |
| 1296 Matlab software</a>) | |
| 1297 | |
| 1298 <p> | |
| 1299 <li> | |
| 1300 <a | |
| 1301 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003> | |
| 1302 "Mixtures of probabilistic principal component analyzers"</a>, | |
| 1303 Tipping and Bishop, Neural Computation 11(2):443--482, 1999. | |
| 1304 </ul> | |
| 1305 | |
| 1306 <p> | |
| 1307 By adding a hidden discrete variable, we can create mixtures of FA | |
| 1308 models, as shown in (c). | |
| 1309 Now we can explain the data using a set of subspaces. | |
| 1310 We can create this model in BNT as follows. | |
| 1311 <pre> | |
| 1312 ns = [M k D]; | |
| 1313 dag = zeros(3); | |
| 1314 dag(1,3) = 1; | |
| 1315 dag(2,3) = 1; | |
| 1316 bnet = mk_bnet(dag, ns, 'discrete', 1); | |
| 1317 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0); | |
| 1318 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ... | |
| 1319 'clamp_mean', 1, 'clamp_cov', 1); | |
| 1320 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ... | |
| 1321 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1); | |
| 1322 </pre> | |
| 1323 Notice how the covariance matrix for Y is the same for all values of | |
| 1324 Q; that is, the noise level in each sub-space is assumed the same. | |
| 1325 However, we allow the offset, mu, to vary. | |
| 1326 For details, see | |
| 1327 <ul> | |
| 1328 | |
| 1329 <LI> | |
| 1330 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM | |
| 1331 Algorithm for Mixtures of Factor Analyzers </A>, | |
| 1332 Ghahramani, Z. and Hinton, G.E. (1996), | |
| 1333 University of Toronto | |
| 1334 Technical Report CRG-TR-96-1. | |
| 1335 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>) | |
| 1336 | |
| 1337 <p> | |
| 1338 <li> | |
| 1339 <a | |
| 1340 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003> | |
| 1341 "Mixtures of probabilistic principal component analyzers"</a>, | |
| 1342 Tipping and Bishop, Neural Computation 11(2):443--482, 1999. | |
| 1343 </ul> | |
| 1344 | |
| 1345 <p> | |
| 1346 I have included Zoubin's specialized MFA code (with his permission) | |
| 1347 with the toolbox, so you can check that BNT gives the same results: | |
| 1348 see 'BNT/examples/static/mfa1.m'. | |
| 1349 | |
| 1350 <p> | |
| 1351 Independent Factor Analysis (IFA) generalizes FA by allowing a | |
| 1352 non-Gaussian prior on each component of X. | |
| 1353 (Note that we can approximate a non-Gaussian prior using a mixture of | |
| 1354 Gaussians.) | |
| 1355 This means that the likelihood function is no longer rotationally | |
| 1356 invariant, so we can uniquely identify W and the hidden | |
| 1357 sources X. | |
| 1358 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). | |
| 1359 We recover classical Independent Components Analysis (ICA) | |
| 1360 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the | |
| 1361 weight matrix W is square and invertible. | |
| 1362 For details, see | |
| 1363 <ul> | |
| 1364 <li> | |
| 1365 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor | |
| 1366 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998. | |
| 1367 </ul> | |
| 1368 | |
| 1369 | |
| 1370 | |
| 1371 <h2><a name="mixexp">Mixtures of experts</h2> | |
| 1372 | |
| 1373 As an example of the use of the softmax function, | |
| 1374 we introduce the Mixture of Experts model. | |
| 1375 <!-- | |
| 1376 We also show | |
| 1377 the Hierarchical Mixture of Experts model, where the hierarchy has two | |
| 1378 levels. | |
| 1379 (This is essentially a probabilistic decision tree of height two.) | |
| 1380 --> | |
| 1381 As before, | |
| 1382 circles denote continuous-valued nodes, | |
| 1383 squares denote discrete nodes, clear | |
| 1384 means hidden, and shaded means observed. | |
| 1385 <p> | |
| 1386 <center> | |
| 1387 <table> | |
| 1388 <tr> | |
| 1389 <td><img src="Figures/mixexp.gif"> | |
| 1390 <!-- | |
| 1391 <td><img src="Figures/hme.gif"> | |
| 1392 --> | |
| 1393 </table> | |
| 1394 </center> | |
| 1395 <p> | |
| 1396 X is the observed | |
| 1397 input, Y is the output, and | |
| 1398 the Q nodes are hidden "gating" nodes, which select the appropriate | |
| 1399 set of parameters for Y. During training, Y is assumed observed, | |
| 1400 but for testing, the goal is to predict Y given X. | |
| 1401 Note that this is a <em>conditional</em> density model, so we don't | |
| 1402 associate any parameters with X. | |
| 1403 Hence X's CPD will be a root CPD, which is a way of modelling | |
| 1404 exogenous nodes. | |
| 1405 If the output is a continuous-valued quantity, | |
| 1406 we assume the "experts" are linear-regression units, | |
| 1407 and set Y's CPD to linear-Gaussian. | |
| 1408 If the output is discrete, we set Y's CPD to a softmax function. | |
| 1409 The Q CPDs will always be softmax functions. | |
| 1410 | |
| 1411 <p> | |
| 1412 As a concrete example, consider the mixture of experts model where X and Y are | |
| 1413 scalars, and Q is binary. | |
| 1414 This is just piecewise linear regression, where | |
| 1415 we have two line segments, i.e., | |
| 1416 <P> | |
| 1417 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif"> | |
| 1418 <P> | |
| 1419 We can create this model with random parameters as follows. | |
| 1420 (This code is bundled in BNT/examples/static/mixexp2.m.) | |
| 1421 <PRE> | |
| 1422 X = 1; | |
| 1423 Q = 2; | |
| 1424 Y = 3; | |
| 1425 dag = zeros(3,3); | |
| 1426 dag(X,[Q Y]) = 1 | |
| 1427 dag(Q,Y) = 1; | |
| 1428 ns = [1 2 1]; % make X and Y scalars, and have 2 experts | |
| 1429 onodes = [1 3]; | |
| 1430 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes); | |
| 1431 | |
| 1432 rand('state', 0); | |
| 1433 randn('state', 0); | |
| 1434 bnet.CPD{1} = root_CPD(bnet, 1); | |
| 1435 bnet.CPD{2} = softmax_CPD(bnet, 2); | |
| 1436 bnet.CPD{3} = gaussian_CPD(bnet, 3); | |
| 1437 </PRE> | |
| 1438 Now let us fit this model using <a href="#em">EM</a>. | |
| 1439 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them. | |
| 1440 <P> | |
| 1441 <PRE> | |
| 1442 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii'); | |
| 1443 plot(data(:,1), data(:,2), '.'); | |
| 1444 </PRE> | |
| 1445 <p> | |
| 1446 <center> | |
| 1447 <IMG SRC="Figures/mixexp_data.gif"> | |
| 1448 </center> | |
| 1449 <p> | |
| 1450 This is what the model looks like before training. | |
| 1451 (Thanks to Thomas Hofman for writing this plotting routine.) | |
| 1452 <p> | |
| 1453 <center> | |
| 1454 <IMG SRC="Figures/mixexp_before.gif"> | |
| 1455 </center> | |
| 1456 <p> | |
| 1457 Now let's train the model, and plot the final performance. | |
| 1458 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.) | |
| 1459 <P> | |
| 1460 <PRE> | |
| 1461 ncases = size(data, 1); % each row of data is a training case | |
| 1462 cases = cell(3, ncases); | |
| 1463 cases([1 3], :) = num2cell(data'); % each column of cases is a training case | |
| 1464 engine = jtree_inf_engine(bnet); | |
| 1465 max_iter = 20; | |
| 1466 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter); | |
| 1467 </PRE> | |
| 1468 (We specify which nodes will be observed when we create the engine. | |
| 1469 Hence BNT knows that the hidden nodes are all discrete. | |
| 1470 For complex models, this can lead to a significant speedup.) | |
| 1471 Below we show what the model looks like after 16 iterations of EM | |
| 1472 (with 100 IRLS iterations per M step), when it converged | |
| 1473 using the default convergence tolerance (that the | |
| 1474 fractional change in the log-likelihood be less than 1e-3). | |
| 1475 Before learning, the log-likelihood was | |
| 1476 -322.927442; afterwards, it was -13.728778. | |
| 1477 <p> | |
| 1478 <center> | |
| 1479 <IMG SRC="Figures/mixexp_after.gif"> | |
| 1480 </center> | |
| 1481 (See BNT/examples/static/mixexp2.m for details of the code.) | |
| 1482 | |
| 1483 | |
| 1484 | |
| 1485 <h2><a name="hme">Hierarchical mixtures of experts</h2> | |
| 1486 | |
| 1487 A hierarchical mixture of experts (HME) extends the mixture of experts | |
| 1488 model by having more than one hidden node. A two-level example is shown below, along | |
| 1489 with its more traditional representation as a neural network. | |
| 1490 This is like a (balanced) probabilistic decision tree of height 2. | |
| 1491 <p> | |
| 1492 <center> | |
| 1493 <IMG SRC="Figures/HMEforMatlab.jpg"> | |
| 1494 </center> | |
| 1495 <p> | |
| 1496 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a> | |
| 1497 has written an extensive set of routines for HMEs, | |
| 1498 which are bundled with BNT: see the examples/static/HME directory. | |
| 1499 These routines allow you to choose the number of hidden (gating) | |
| 1500 layers, and the form of the experts (softmax or MLP). | |
| 1501 See the file hmemenu, which provides a demo. | |
| 1502 For example, the figure below shows the decision boundaries learned | |
| 1503 for a ternary classification problem, using a 2 level HME with softmax | |
| 1504 gates and softmax experts; the training set is on the left, the | |
| 1505 testing set on the right. | |
| 1506 <p> | |
| 1507 <center> | |
| 1508 <!--<IMG SRC="Figures/hme_dec_boundary.gif">--> | |
| 1509 <IMG SRC="Figures/hme_dec_boundary.png"> | |
| 1510 </center> | |
| 1511 <p> | |
| 1512 | |
| 1513 | |
| 1514 <p> | |
| 1515 For more details, see the following: | |
| 1516 <ul> | |
| 1517 | |
| 1518 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z"> | |
| 1519 Hierarchical mixtures of experts and the EM algorithm</a> | |
| 1520 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994. | |
| 1521 | |
| 1522 <li> <a href = | |
| 1523 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's | |
| 1524 matlab code for HME</a> | |
| 1525 | |
| 1526 <li> <a | |
| 1527 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the | |
| 1528 logistic function? A tutorial discussion on | |
| 1529 probabilities and neural networks.</a> M. I. Jordan. MIT Computational | |
| 1530 Cognitive Science Report 9503, August 1995. | |
| 1531 | |
| 1532 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and | |
| 1533 Halll, 1983. | |
| 1534 | |
| 1535 <li> | |
| 1536 "Improved learning algorithms for mixtures of experts in multiclass | |
| 1537 classification". | |
| 1538 K. Chen, L. Xu, H. Chi. | |
| 1539 Neural Networks (1999) 12: 1229-1252. | |
| 1540 | |
| 1541 <li> <a href="http://www.oigeeza.com/steve/"> | |
| 1542 Classification Using Hierarchical Mixtures of Experts</a> | |
| 1543 S.R. Waterhouse and A.J. Robinson. | |
| 1544 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186 | |
| 1545 | |
| 1546 <li> <a href="http://www.idiap.ch/~perry/"> | |
| 1547 Localized mixtures of experts</a>, | |
| 1548 P. Moerland, 1998. | |
| 1549 | |
| 1550 <li> "Nonlinear gated experts for time series", | |
| 1551 A.S. Weigend and M. Mangeas, 1995. | |
| 1552 | |
| 1553 </ul> | |
| 1554 | |
| 1555 | |
| 1556 <h2><a name="qmr">QMR</h2> | |
| 1557 | |
| 1558 Bayes nets originally arose out of an attempt to add probabilities to | |
| 1559 expert systems, and this is still the most common use for BNs. | |
| 1560 A famous example is | |
| 1561 QMR-DT, a decision-theoretic reformulation of the Quick Medical | |
| 1562 Reference (QMR) model. | |
| 1563 <p> | |
| 1564 <center> | |
| 1565 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif"> | |
| 1566 </center> | |
| 1567 Here, the top layer represents hidden disease nodes, and the bottom | |
| 1568 layer represents observed symptom nodes. | |
| 1569 The goal is to infer the posterior probability of each disease given | |
| 1570 all the symptoms (which can be present, absent or unknown). | |
| 1571 Each node in the top layer has a Bernoulli prior (with a low prior | |
| 1572 probability that the disease is present). | |
| 1573 Since each node in the bottom layer has a high fan-in, we use a | |
| 1574 noisy-OR parameterization; each disease has an independent chance of | |
| 1575 causing each symptom. | |
| 1576 The real QMR-DT model is copyright, but | |
| 1577 we can create a random QMR-like model as follows. | |
| 1578 <pre> | |
| 1579 function bnet = mk_qmr_bnet(G, inhibit, leak, prior) | |
| 1580 % MK_QMR_BNET Make a QMR model | |
| 1581 % bnet = mk_qmr_bnet(G, inhibit, leak, prior) | |
| 1582 % | |
| 1583 % G(i,j) = 1 iff there is an arc from disease i to finding j | |
| 1584 % inhibit(i,j) = inhibition probability on i->j arc | |
| 1585 % leak(j) = inhibition prob. on leak->j arc | |
| 1586 % prior(i) = prob. disease i is on | |
| 1587 | |
| 1588 [Ndiseases Nfindings] = size(inhibit); | |
| 1589 N = Ndiseases + Nfindings; | |
| 1590 finding_node = Ndiseases+1:N; | |
| 1591 ns = 2*ones(1,N); | |
| 1592 dag = zeros(N,N); | |
| 1593 dag(1:Ndiseases, finding_node) = G; | |
| 1594 bnet = mk_bnet(dag, ns, 'observed', finding_node); | |
| 1595 | |
| 1596 for d=1:Ndiseases | |
| 1597 CPT = [1-prior(d) prior(d)]; | |
| 1598 bnet.CPD{d} = tabular_CPD(bnet, d, CPT'); | |
| 1599 end | |
| 1600 | |
| 1601 for i=1:Nfindings | |
| 1602 fnode = finding_node(i); | |
| 1603 ps = parents(G, i); | |
| 1604 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i)); | |
| 1605 end | |
| 1606 </pre> | |
| 1607 In the file BNT/examples/static/qmr1, we create a random bipartite | |
| 1608 graph G, with 5 diseases and 10 findings, and random parameters. | |
| 1609 (In general, to create a random dag, use 'mk_random_dag'.) | |
| 1610 We can visualize the resulting graph structure using | |
| 1611 the methods discussed <a href="#graphdraw">below</a>, with the | |
| 1612 following results: | |
| 1613 <p> | |
| 1614 <img src="Figures/qmr.rnd.jpg"> | |
| 1615 | |
| 1616 <p> | |
| 1617 Now let us put some random evidence on all the leaves except the very | |
| 1618 first and very last, and compute the disease posteriors. | |
| 1619 <pre> | |
| 1620 pos = 2:floor(Nfindings/2); | |
| 1621 neg = (pos(end)+1):(Nfindings-1); | |
| 1622 onodes = myunion(pos, neg); | |
| 1623 evidence = cell(1, N); | |
| 1624 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos))); | |
| 1625 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg))); | |
| 1626 | |
| 1627 engine = jtree_inf_engine(bnet); | |
| 1628 [engine, ll] = enter_evidence(engine, evidence); | |
| 1629 post = zeros(1, Ndiseases); | |
| 1630 for i=diseases(:)' | |
| 1631 m = marginal_nodes(engine, i); | |
| 1632 post(i) = m.T(2); | |
| 1633 end | |
| 1634 </pre> | |
| 1635 Junction tree can be quite slow on large QMR models. | |
| 1636 Fortunately, it is possible to exploit properties of the noisy-OR | |
| 1637 function to speed up exact inference using an algorithm called | |
| 1638 <a href="#quickscore">quickscore</a>, discussed below. | |
| 1639 | |
| 1640 | |
| 1641 | |
| 1642 | |
| 1643 | |
| 1644 <h2><a name="cg_model">Conditional Gaussian models</h2> | |
| 1645 | |
| 1646 A conditional Gaussian model is one in which, conditioned on all the discrete | |
| 1647 nodes, the distribution over the remaining (continuous) nodes is | |
| 1648 multivariate Gaussian. This means we can have arcs from discrete (D) | |
| 1649 to continuous (C) nodes, but not vice versa. | |
| 1650 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed, | |
| 1651 as in the <a href="#mixexp">mixture of experts</a> model, | |
| 1652 since this distribution can be represented with a discrete potential.) | |
| 1653 <p> | |
| 1654 We now give an example of a CG model, from | |
| 1655 the paper "Propagation of Probabilities, Means amd | |
| 1656 Variances in Mixed Graphical Association Models", Steffen Lauritzen, | |
| 1657 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert | |
| 1658 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and | |
| 1659 D. J. Spiegelhalter, Springer, 1999.) | |
| 1660 | |
| 1661 <h3>Specifying the graph</h3> | |
| 1662 | |
| 1663 Consider the model of waste emissions from an incinerator plant shown below. | |
| 1664 We follow the standard convention that shaded nodes are observed, | |
| 1665 clear nodes are hidden. | |
| 1666 We also use the non-standard convention that | |
| 1667 square nodes are discrete (tabular) and round nodes are | |
| 1668 Gaussian. | |
| 1669 | |
| 1670 <p> | |
| 1671 <center> | |
| 1672 <IMG SRC="Figures/cg1.gif"> | |
| 1673 </center> | |
| 1674 <p> | |
| 1675 | |
| 1676 We can create this model as follows. | |
| 1677 <pre> | |
| 1678 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9; | |
| 1679 n = 9; | |
| 1680 | |
| 1681 dag = zeros(n); | |
| 1682 dag(F,E)=1; | |
| 1683 dag(W,[E Min D]) = 1; | |
| 1684 dag(E,D)=1; | |
| 1685 dag(B,[C D])=1; | |
| 1686 dag(D,[L Mout])=1; | |
| 1687 dag(Min,Mout)=1; | |
| 1688 | |
| 1689 % node sizes - all cts nodes are scalar, all discrete nodes are binary | |
| 1690 ns = ones(1, n); | |
| 1691 dnodes = [F W B]; | |
| 1692 cnodes = mysetdiff(1:n, dnodes); | |
| 1693 ns(dnodes) = 2; | |
| 1694 | |
| 1695 bnet = mk_bnet(dag, ns, 'discrete', dnodes); | |
| 1696 </pre> | |
| 1697 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous | |
| 1698 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'. | |
| 1699 <p> | |
| 1700 | |
| 1701 | |
| 1702 <h3>Specifying the parameters</h3> | |
| 1703 | |
| 1704 The parameters of the discrete nodes can be specified as follows. | |
| 1705 <pre> | |
| 1706 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable | |
| 1707 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect | |
| 1708 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household | |
| 1709 </pre> | |
| 1710 | |
| 1711 <p> | |
| 1712 The parameters of the continuous nodes can be specified as follows. | |
| 1713 <pre> | |
| 1714 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ... | |
| 1715 'cov', [0.00002 0.0001 0.00002 0.0001]); | |
| 1716 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ... | |
| 1717 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]); | |
| 1718 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]); | |
| 1719 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5); | |
| 1720 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]); | |
| 1721 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]); | |
| 1722 </pre> | |
| 1723 | |
| 1724 | |
| 1725 <h3><a name="cg_infer">Inference</h3> | |
| 1726 | |
| 1727 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.--> | |
| 1728 First we compute the unconditional marginals. | |
| 1729 <pre> | |
| 1730 engine = jtree_inf_engine(bnet); | |
| 1731 evidence = cell(1,n); | |
| 1732 [engine, ll] = enter_evidence(engine, evidence); | |
| 1733 marg = marginal_nodes(engine, E); | |
| 1734 </pre> | |
| 1735 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)--> | |
| 1736 'marg' is a structure that contains the fields 'mu' and 'Sigma', which | |
| 1737 contain the mean and (co)variance of the marginal on E. | |
| 1738 In this case, they are both scalars. | |
| 1739 Let us check they match the published figures (to 2 decimal places). | |
| 1740 <!--(We can't expect | |
| 1741 more precision than this in general because I have implemented the algorithm of | |
| 1742 Lauritzen (1992), which can be numerically unstable.)--> | |
| 1743 <pre> | |
| 1744 tol = 1e-2; | |
| 1745 assert(approxeq(marg.mu, -3.25, tol)); | |
| 1746 assert(approxeq(sqrt(marg.Sigma), 0.709, tol)); | |
| 1747 </pre> | |
| 1748 We can compute the other posteriors similarly. | |
| 1749 Now let us add some evidence. | |
| 1750 <pre> | |
| 1751 evidence = cell(1,n); | |
| 1752 evidence{W} = 1; % industrial | |
| 1753 evidence{L} = 1.1; | |
| 1754 evidence{C} = -0.9; | |
| 1755 [engine, ll] = enter_evidence(engine, evidence); | |
| 1756 </pre> | |
| 1757 Now we find | |
| 1758 <pre> | |
| 1759 marg = marginal_nodes(engine, E); | |
| 1760 assert(approxeq(marg.mu, -3.8983, tol)); | |
| 1761 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol)); | |
| 1762 </pre> | |
| 1763 | |
| 1764 | |
| 1765 We can also compute the joint probability on a set of nodes. | |
| 1766 For example, P(D, Mout | evidence) is a 2D Gaussian: | |
| 1767 <pre> | |
| 1768 marg = marginal_nodes(engine, [D Mout]) | |
| 1769 marg = | |
| 1770 domain: [6 8] | |
| 1771 mu: [2x1 double] | |
| 1772 Sigma: [2x2 double] | |
| 1773 T: 1.0000 | |
| 1774 </pre> | |
| 1775 The mean is | |
| 1776 <pre> | |
| 1777 marg.mu | |
| 1778 ans = | |
| 1779 3.6077 | |
| 1780 4.1077 | |
| 1781 </pre> | |
| 1782 and the covariance matrix is | |
| 1783 <pre> | |
| 1784 marg.Sigma | |
| 1785 ans = | |
| 1786 0.1062 0.1062 | |
| 1787 0.1062 0.1182 | |
| 1788 </pre> | |
| 1789 It is easy to visualize this posterior using standard Matlab plotting | |
| 1790 functions, e.g., | |
| 1791 <pre> | |
| 1792 gaussplot2d(marg.mu, marg.Sigma); | |
| 1793 </pre> | |
| 1794 produces the following picture. | |
| 1795 | |
| 1796 <p> | |
| 1797 <center> | |
| 1798 <IMG SRC="Figures/gaussplot.png"> | |
| 1799 </center> | |
| 1800 <p> | |
| 1801 | |
| 1802 | |
| 1803 The T field indicates that the mixing weight of this Gaussian | |
| 1804 component is 1.0. | |
| 1805 If the joint contains discrete and continuous variables, the result | |
| 1806 will be a mixture of Gaussians, e.g., | |
| 1807 <pre> | |
| 1808 marg = marginal_nodes(engine, [F E]) | |
| 1809 domain: [1 3] | |
| 1810 mu: [-3.9000 -0.4003] | |
| 1811 Sigma: [1x1x2 double] | |
| 1812 T: [0.9995 4.7373e-04] | |
| 1813 </pre> | |
| 1814 The interpretation is | |
| 1815 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. | |
| 1816 In this case, E is a scalar, so i=j=1; k specifies the mixture component. | |
| 1817 <p> | |
| 1818 We saw in the sprinkler network that BNT sets the effective size of | |
| 1819 observed discrete nodes to 1, since they only have one legal value. | |
| 1820 For continuous nodes, BNT sets their length to 0, | |
| 1821 since they have been reduced to a point. | |
| 1822 For example, | |
| 1823 <pre> | |
| 1824 marg = marginal_nodes(engine, [B C]) | |
| 1825 domain: [4 5] | |
| 1826 mu: [] | |
| 1827 Sigma: [] | |
| 1828 T: [0.0123 0.9877] | |
| 1829 </pre> | |
| 1830 It is simple to post-process the output of marginal_nodes. | |
| 1831 For example, the file BNT/examples/static/cg1 sets the mu term of | |
| 1832 observed nodes to their observed value, and the Sigma term to 0 (since | |
| 1833 observed nodes have no variance). | |
| 1834 | |
| 1835 <p> | |
| 1836 Note that the implemented version of the junction tree is numerically | |
| 1837 unstable when using CG potentials | |
| 1838 (which is why, in the example above, we only required our answers to agree with | |
| 1839 the published ones to 2dp.) | |
| 1840 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>, | |
| 1841 implemented by Shan Huang. This is described in | |
| 1842 | |
| 1843 <ul> | |
| 1844 <li> "Stable Local Computation with Conditional Gaussian Distributions", | |
| 1845 S. Lauritzen and F. Jensen, Tech Report R-99-2014, | |
| 1846 Dept. Math. Sciences, Allborg Univ., 1999. | |
| 1847 </ul> | |
| 1848 | |
| 1849 However, even the numerically stable version | |
| 1850 can be computationally intractable if there are many hidden discrete | |
| 1851 nodes, because the number of mixture components grows exponentially e.g., in a | |
| 1852 <a href="usage_dbn.html#lds">switching linear dynamical system</a>. | |
| 1853 In general, one must resort to approximate inference techniques: see | |
| 1854 the discussion on <a href="#engines">inference engines</a> below. | |
| 1855 | |
| 1856 | |
| 1857 <h2><a name="hybrid">Other hybrid models</h2> | |
| 1858 | |
| 1859 When we have C->D arcs, where C is hidden, we need to use | |
| 1860 approximate inference. | |
| 1861 One approach (not implemented in BNT) is described in | |
| 1862 <ul> | |
| 1863 <li> <a | |
| 1864 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A | |
| 1865 Variational Approximation for Bayesian Networks with | |
| 1866 Discrete and Continuous Latent Variables</a>, | |
| 1867 K. Murphy, UAI 99. | |
| 1868 </ul> | |
| 1869 Of course, one can always use <a href="#sampling">sampling</a> methods | |
| 1870 for approximate inference in such models. | |
| 1871 | |
| 1872 | |
| 1873 | |
| 1874 <h1><a name="param_learning">Parameter Learning</h1> | |
| 1875 | |
| 1876 The parameter estimation routines in BNT can be classified into 4 | |
| 1877 types, depending on whether the goal is to compute | |
| 1878 a full (Bayesian) posterior over the parameters or just a point | |
| 1879 estimate (e.g., Maximum Likelihood or Maximum A Posteriori), | |
| 1880 and whether all the variables are fully observed or there is missing | |
| 1881 data/ hidden variables (partial observability). | |
| 1882 <p> | |
| 1883 | |
| 1884 <TABLE BORDER> | |
| 1885 <tr> | |
| 1886 <TH></TH> | |
| 1887 <th>Full obs</th> | |
| 1888 <th>Partial obs</th> | |
| 1889 </tr> | |
| 1890 <tr> | |
| 1891 <th>Point</th> | |
| 1892 <td><tt>learn_params</tt></td> | |
| 1893 <td><tt>learn_params_em</tt></td> | |
| 1894 </tr> | |
| 1895 <tr> | |
| 1896 <th>Bayes</th> | |
| 1897 <td><tt>bayes_update_params</tt></td> | |
| 1898 <td>not yet supported</td> | |
| 1899 </tr> | |
| 1900 </table> | |
| 1901 | |
| 1902 | |
| 1903 <h2><a name="load_data">Loading data from a file</h2> | |
| 1904 | |
| 1905 To load numeric data from an ASCII text file called 'dat.txt', where each row is a | |
| 1906 case and columns are separated by white-space, such as | |
| 1907 <pre> | |
| 1908 011979 1626.5 0.0 | |
| 1909 021979 1367.0 0.0 | |
| 1910 ... | |
| 1911 </pre> | |
| 1912 you can use | |
| 1913 <pre> | |
| 1914 data = load('dat.txt'); | |
| 1915 </pre> | |
| 1916 or | |
| 1917 <pre> | |
| 1918 load dat.txt -ascii | |
| 1919 </pre> | |
| 1920 In the latter case, the data is stored in a variable called 'dat' (the | |
| 1921 filename minus the extension). | |
| 1922 Alternatively, suppose the data is stored in a .csv file (has commas | |
| 1923 separating the columns, and contains a header line), such as | |
| 1924 <pre> | |
| 1925 header info goes here | |
| 1926 ORD,011979,1626.5,0.0 | |
| 1927 DSM,021979,1367.0,0.0 | |
| 1928 ... | |
| 1929 </pre> | |
| 1930 You can load this using | |
| 1931 <pre> | |
| 1932 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1); | |
| 1933 </pre> | |
| 1934 If your file is not in either of these formats, you can either use Perl to convert | |
| 1935 it to this format, or use the Matlab scanf command. | |
| 1936 Type | |
| 1937 <tt> | |
| 1938 help iofun | |
| 1939 </tt> | |
| 1940 for more information on Matlab's file functions. | |
| 1941 <!-- | |
| 1942 <p> | |
| 1943 To load data directly from Excel, | |
| 1944 you should buy the | |
| 1945 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>. | |
| 1946 To load data directly from a relational database, | |
| 1947 you should buy the | |
| 1948 <a href="http://www.mathworks.com/products/database">Database | |
| 1949 toolbox</a>. | |
| 1950 --> | |
| 1951 <p> | |
| 1952 BNT learning routines require data to be stored in a cell array. | |
| 1953 data{i,m} is the value of node i in case (example) m, i.e., each | |
| 1954 <em>column</em> is a case. | |
| 1955 If node i is not observed in case m (missing value), set | |
| 1956 data{i,m} = []. | |
| 1957 (Not all the learning routines can cope with such missing values, however.) | |
| 1958 In the special case that all the nodes are observed and are | |
| 1959 scalar-valued (as opposed to vector-valued), the data can be | |
| 1960 stored in a matrix (as opposed to a cell-array). | |
| 1961 <p> | |
| 1962 Suppose, as in the <a href="#mixexp">mixture of experts example</a>, | |
| 1963 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is | |
| 1964 the observed output, and X(2) is a hidden (gating) node. We can | |
| 1965 create the dataset as follows. | |
| 1966 <pre> | |
| 1967 data = load('dat.txt'); | |
| 1968 ncases = size(data, 1); | |
| 1969 cases = cell(3, ncases); | |
| 1970 cases([1 3], :) = num2cell(data'); | |
| 1971 </pre> | |
| 1972 Notice how we transposed the data, to convert rows into columns. | |
| 1973 Also, cases{2,m} = [] for all m, since X(2) is always hidden. | |
| 1974 | |
| 1975 | |
| 1976 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2> | |
| 1977 | |
| 1978 As an example, let's generate some data from the sprinkler network, randomize the parameters, | |
| 1979 and then try to recover the original model. | |
| 1980 First we create some training data using forwards sampling. | |
| 1981 <pre> | |
| 1982 samples = cell(N, nsamples); | |
| 1983 for i=1:nsamples | |
| 1984 samples(:,i) = sample_bnet(bnet); | |
| 1985 end | |
| 1986 </pre> | |
| 1987 samples{j,i} contains the value of the j'th node in case i. | |
| 1988 sample_bnet returns a cell array because, in general, each node might | |
| 1989 be a vector of different length. | |
| 1990 In this case, all nodes are discrete (and hence scalars), so we | |
| 1991 could have used a regular array instead (which can be quicker): | |
| 1992 <pre> | |
| 1993 data = cell2num(samples); | |
| 1994 </pre | |
| 1995 So now data(j,i) = samples{j,i}. | |
| 1996 <p> | |
| 1997 Now we create a network with random parameters. | |
| 1998 (The initial values of bnet2 don't matter in this case, since we can find the | |
| 1999 globally optimal MLE independent of where we start.) | |
| 2000 <pre> | |
| 2001 % Make a tabula rasa | |
| 2002 bnet2 = mk_bnet(dag, node_sizes); | |
| 2003 seed = 0; | |
| 2004 rand('state', seed); | |
| 2005 bnet2.CPD{C} = tabular_CPD(bnet2, C); | |
| 2006 bnet2.CPD{R} = tabular_CPD(bnet2, R); | |
| 2007 bnet2.CPD{S} = tabular_CPD(bnet2, S); | |
| 2008 bnet2.CPD{W} = tabular_CPD(bnet2, W); | |
| 2009 </pre> | |
| 2010 Finally, we find the maximum likelihood estimates of the parameters. | |
| 2011 <pre> | |
| 2012 bnet3 = learn_params(bnet2, samples); | |
| 2013 </pre> | |
| 2014 To view the learned parameters, we use a little Matlab hackery. | |
| 2015 <pre> | |
| 2016 CPT3 = cell(1,N); | |
| 2017 for i=1:N | |
| 2018 s=struct(bnet3.CPD{i}); % violate object privacy | |
| 2019 CPT3{i}=s.CPT; | |
| 2020 end | |
| 2021 </pre> | |
| 2022 Here are the parameters learned for node 4. | |
| 2023 <pre> | |
| 2024 dispcpt(CPT3{4}) | |
| 2025 1 1 : 1.0000 0.0000 | |
| 2026 2 1 : 0.2000 0.8000 | |
| 2027 1 2 : 0.2273 0.7727 | |
| 2028 2 2 : 0.0000 1.0000 | |
| 2029 </pre> | |
| 2030 So we see that the learned parameters are fairly close to the "true" | |
| 2031 ones, which we display below. | |
| 2032 <pre> | |
| 2033 dispcpt(CPT{4}) | |
| 2034 1 1 : 1.0000 0.0000 | |
| 2035 2 1 : 0.1000 0.9000 | |
| 2036 1 2 : 0.1000 0.9000 | |
| 2037 2 2 : 0.0100 0.9900 | |
| 2038 </pre> | |
| 2039 We can get better results by using a larger training set, or using | |
| 2040 informative priors (see <a href="#prior">below</a>). | |
| 2041 | |
| 2042 | |
| 2043 | |
| 2044 <h2><a name="prior">Parameter priors</h2> | |
| 2045 | |
| 2046 Currently, only tabular CPDs can have priors on their parameters. | |
| 2047 The conjugate prior for a multinomial is the Dirichlet. | |
| 2048 (For binary random variables, the multinomial is the same as the | |
| 2049 Bernoulli, and the Dirichlet is the same as the Beta.) | |
| 2050 <p> | |
| 2051 The Dirichlet has a simple interpretation in terms of pseudo counts. | |
| 2052 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the | |
| 2053 training set, where Pa_i are the parents of X_i, | |
| 2054 then the maximum likelihood (ML) estimate is | |
| 2055 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. | |
| 2056 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this | |
| 2057 event was not seen in the training set, | |
| 2058 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk) | |
| 2059 of times in the past. | |
| 2060 The MAP (maximum a posterior) estimate is then | |
| 2061 <pre> | |
| 2062 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij) | |
| 2063 </pre> | |
| 2064 and is never 0 if all alpha_ijk > 0. | |
| 2065 For example, consider the network A->B, where A is binary and B has 3 | |
| 2066 values. | |
| 2067 A uniform prior for B has the form | |
| 2068 <pre> | |
| 2069 B=1 B=2 B=3 | |
| 2070 A=1 1 1 1 | |
| 2071 A=2 1 1 1 | |
| 2072 </pre> | |
| 2073 which can be created using | |
| 2074 <pre> | |
| 2075 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif'); | |
| 2076 </pre> | |
| 2077 This prior does not satisfy the likelihood equivalence principle, | |
| 2078 which says that <a href="#markov_equiv">Markov equivalent</a> models | |
| 2079 should have the same marginal likelihood. | |
| 2080 A prior that does satisfy this principle is shown below. | |
| 2081 Heckerman (1995) calls this the | |
| 2082 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet). | |
| 2083 <pre> | |
| 2084 B=1 B=2 B=3 | |
| 2085 A=1 1/6 1/6 1/6 | |
| 2086 A=2 1/6 1/6 1/6 | |
| 2087 </pre> | |
| 2088 where we put N/(q*r) in each bin; N is the equivalent sample size, | |
| 2089 r=|A|, q = |B|. | |
| 2090 This can be created as follows | |
| 2091 <pre> | |
| 2092 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu'); | |
| 2093 </pre> | |
| 2094 Here, 1 is the equivalent sample size, and is the strength of the | |
| 2095 prior. | |
| 2096 You can change this using | |
| 2097 <pre> | |
| 2098 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ... | |
| 2099 'BDeu', 'dirichlet_weight', 10); | |
| 2100 </pre> | |
| 2101 <!--where counts is an array of pseudo-counts of the same size as the | |
| 2102 CPT.--> | |
| 2103 <!-- | |
| 2104 <p> | |
| 2105 When you specify a prior, you should set row i of the CPT to the | |
| 2106 normalized version of row i of the pseudo-count matrix, i.e., to the | |
| 2107 expected values of the parameters. This will ensure that computing the | |
| 2108 marginal likelihood sequentially (see <a | |
| 2109 href="#bayes_learn">below</a>) and in batch form gives the same | |
| 2110 results. | |
| 2111 To do this, proceed as follows. | |
| 2112 <pre> | |
| 2113 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts)); | |
| 2114 </pre> | |
| 2115 For a non-informative prior, you can just write | |
| 2116 <pre> | |
| 2117 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif'); | |
| 2118 </pre> | |
| 2119 --> | |
| 2120 | |
| 2121 | |
| 2122 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2> | |
| 2123 | |
| 2124 If we use conjugate priors and have fully observed data, we can | |
| 2125 compute the posterior over the parameters in batch form as follows. | |
| 2126 <pre> | |
| 2127 cases = sample_bnet(bnet, nsamples); | |
| 2128 bnet = bayes_update_params(bnet, cases); | |
| 2129 LL = log_marg_lik_complete(bnet, cases); | |
| 2130 </pre> | |
| 2131 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts, | |
| 2132 and bnet.CPD{i}.CPT is set to the mean of the posterior (the | |
| 2133 normalized counts). | |
| 2134 (Hence if the initial pseudo counts are 0, | |
| 2135 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the | |
| 2136 same result.) | |
| 2137 | |
| 2138 | |
| 2139 | |
| 2140 | |
| 2141 <p> | |
| 2142 We can compute the same result sequentially (on-line) as follows. | |
| 2143 <pre> | |
| 2144 LL = 0; | |
| 2145 for m=1:nsamples | |
| 2146 LL = LL + log_marg_lik_complete(bnet, cases(:,m)); | |
| 2147 bnet = bayes_update_params(bnet, cases(:,m)); | |
| 2148 end | |
| 2149 </pre> | |
| 2150 | |
| 2151 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of | |
| 2152 sequential model selection which uses the same idea. | |
| 2153 We generate data from the model A->B | |
| 2154 and compute the posterior prob of all 3 dags on 2 nodes: | |
| 2155 (1) A B, (2) A <- B , (3) A -> B | |
| 2156 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from | |
| 2157 observational data alone, so we expect their posteriors to be the same | |
| 2158 (assuming a prior which satisfies likelihood equivalence). | |
| 2159 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials! | |
| 2160 However, if we make B a noisy NOT gate, the true model "wins" after 12 | |
| 2161 trials, as shown below (red = model 1, blue/green (superimposed) | |
| 2162 represents models 2/3). | |
| 2163 <p> | |
| 2164 <img src="Figures/model_select.png"> | |
| 2165 <p> | |
| 2166 The use of marginal likelihood for model selection is discussed in | |
| 2167 greater detail in the | |
| 2168 section on <a href="structure_learning">structure learning</a>. | |
| 2169 | |
| 2170 | |
| 2171 | |
| 2172 | |
| 2173 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2> | |
| 2174 | |
| 2175 Now we consider learning when some values are not observed. | |
| 2176 Let us randomly hide half the values generated from the water | |
| 2177 sprinkler example. | |
| 2178 <pre> | |
| 2179 samples2 = samples; | |
| 2180 hide = rand(N, nsamples) > 0.5; | |
| 2181 [I,J]=find(hide); | |
| 2182 for k=1:length(I) | |
| 2183 samples2{I(k), J(k)} = []; | |
| 2184 end | |
| 2185 </pre> | |
| 2186 samples2{i,l} is the value of node i in training case l, or [] if unobserved. | |
| 2187 <p> | |
| 2188 Now we will compute the MLEs using the EM algorithm. | |
| 2189 We need to use an inference algorithm to compute the expected | |
| 2190 sufficient statistics in the E step; the M (maximization) step is as | |
| 2191 above. | |
| 2192 <pre> | |
| 2193 engine2 = jtree_inf_engine(bnet2); | |
| 2194 max_iter = 10; | |
| 2195 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter); | |
| 2196 </pre> | |
| 2197 LLtrace(i) is the log-likelihood at iteration i. We can plot this as | |
| 2198 follows: | |
| 2199 <pre> | |
| 2200 plot(LLtrace, 'x-') | |
| 2201 </pre> | |
| 2202 Let's display the results after 10 iterations of EM. | |
| 2203 <pre> | |
| 2204 celldisp(CPT4) | |
| 2205 CPT4{1} = | |
| 2206 0.6616 | |
| 2207 0.3384 | |
| 2208 CPT4{2} = | |
| 2209 0.6510 0.3490 | |
| 2210 0.8751 0.1249 | |
| 2211 CPT4{3} = | |
| 2212 0.8366 0.1634 | |
| 2213 0.0197 0.9803 | |
| 2214 CPT4{4} = | |
| 2215 (:,:,1) = | |
| 2216 0.8276 0.0546 | |
| 2217 0.5452 0.1658 | |
| 2218 (:,:,2) = | |
| 2219 0.1724 0.9454 | |
| 2220 0.4548 0.8342 | |
| 2221 </pre> | |
| 2222 We can get improved performance by using one or more of the following | |
| 2223 methods: | |
| 2224 <ul> | |
| 2225 <li> Increasing the size of the training set. | |
| 2226 <li> Decreasing the amount of hidden data. | |
| 2227 <li> Running EM for longer. | |
| 2228 <li> Using informative priors. | |
| 2229 <li> Initialising EM from multiple starting points. | |
| 2230 </ul> | |
| 2231 | |
| 2232 Click <a href="#gaussian">here</a> for a discussion of learning | |
| 2233 Gaussians, which can cause numerical problems. | |
| 2234 <p> | |
| 2235 For a more complete example of learning with EM, | |
| 2236 see the script BNT/examples/static/learn1.m. | |
| 2237 | |
| 2238 <h2><a name="tying">Parameter tying</h2> | |
| 2239 | |
| 2240 In networks with repeated structure (e.g., chains and grids), it is | |
| 2241 common to assume that the parameters are the same at every node. This | |
| 2242 is called parameter tying, and reduces the amount of data needed for | |
| 2243 learning. | |
| 2244 <p> | |
| 2245 When we have tied parameters, there is no longer a one-to-one | |
| 2246 correspondence between nodes and CPDs. | |
| 2247 Rather, each CPD species the parameters for a whole equivalence class | |
| 2248 of nodes. | |
| 2249 It is easiest to see this by example. | |
| 2250 Consider the following <a href="usage_dbn.html#hmm">hidden Markov | |
| 2251 model (HMM)</a> | |
| 2252 <p> | |
| 2253 <img src="Figures/hmm3.gif"> | |
| 2254 <p> | |
| 2255 <!-- | |
| 2256 We can create this graph structure, assuming we have T time-slices, | |
| 2257 as follows. | |
| 2258 (We number the nodes as shown in the figure, but we could equally well | |
| 2259 number the hidden nodes 1:T, and the observed nodes T+1:2T.) | |
| 2260 <pre> | |
| 2261 N = 2*T; | |
| 2262 dag = zeros(N); | |
| 2263 hnodes = 1:2:2*T; | |
| 2264 for i=1:T-1 | |
| 2265 dag(hnodes(i), hnodes(i+1))=1; | |
| 2266 end | |
| 2267 onodes = 2:2:2*T; | |
| 2268 for i=1:T | |
| 2269 dag(hnodes(i), onodes(i)) = 1; | |
| 2270 end | |
| 2271 </pre> | |
| 2272 <p> | |
| 2273 The hidden nodes are always discrete, and have Q possible values each, | |
| 2274 but the observed nodes can be discrete or continuous, and have O possible values/length. | |
| 2275 <pre> | |
| 2276 if cts_obs | |
| 2277 dnodes = hnodes; | |
| 2278 else | |
| 2279 dnodes = 1:N; | |
| 2280 end | |
| 2281 ns = ones(1,N); | |
| 2282 ns(hnodes) = Q; | |
| 2283 ns(onodes) = O; | |
| 2284 </pre> | |
| 2285 --> | |
| 2286 When HMMs are used for semi-infinite processes like speech recognition, | |
| 2287 we assume the transition matrix | |
| 2288 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant | |
| 2289 or homogenous Markov chain. | |
| 2290 Hence hidden nodes 2, 3, ..., T | |
| 2291 are all in the same equivalence class, say class Hclass. | |
| 2292 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the | |
| 2293 same for all t, so the observed nodes are all in the same equivalence | |
| 2294 class, say class Oclass. | |
| 2295 Finally, the prior term P(H(1)) is in a class all by itself, say class | |
| 2296 H1class. | |
| 2297 This is illustrated below, where we explicitly represent the | |
| 2298 parameters as random variables (dotted nodes). | |
| 2299 <p> | |
| 2300 <img src="Figures/hmm4_params.gif"> | |
| 2301 <p> | |
| 2302 In BNT, we cannot represent parameters as random variables (nodes). | |
| 2303 Instead, we "hide" the | |
| 2304 parameters inside one CPD for each equivalence class, | |
| 2305 and then specify that the other CPDs should share these parameters, as | |
| 2306 follows. | |
| 2307 <pre> | |
| 2308 hnodes = 1:2:2*T; | |
| 2309 onodes = 2:2:2*T; | |
| 2310 H1class = 1; Hclass = 2; Oclass = 3; | |
| 2311 eclass = ones(1,N); | |
| 2312 eclass(hnodes(2:end)) = Hclass; | |
| 2313 eclass(hnodes(1)) = H1class; | |
| 2314 eclass(onodes) = Oclass; | |
| 2315 % create dag and ns in the usual way | |
| 2316 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass); | |
| 2317 </pre> | |
| 2318 Finally, we define the parameters for each equivalence class: | |
| 2319 <pre> | |
| 2320 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior | |
| 2321 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix | |
| 2322 if cts_obs | |
| 2323 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1)); | |
| 2324 else | |
| 2325 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1)); | |
| 2326 end | |
| 2327 </pre> | |
| 2328 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a | |
| 2329 member of e's equivalence class; that is, it is not always the case | |
| 2330 that e == j. You can use bnet.rep_of_eclass(e) to return the | |
| 2331 representative of equivalence class e. | |
| 2332 BNT will look up the parents of j to determine the size | |
| 2333 of the CPT to use. It assumes that this is the same for all members of | |
| 2334 the equivalence class. | |
| 2335 Click <a href="param_tieing.html">here</a> for | |
| 2336 a more complex example of parameter tying. | |
| 2337 <p> | |
| 2338 Note: | |
| 2339 Normally one would define an HMM as a | |
| 2340 <a href = "usage_dbn.html">Dynamic Bayes Net</a> | |
| 2341 (see the function BNT/examples/dynamic/mk_chmm.m). | |
| 2342 However, one can define an HMM as a static BN using the function | |
| 2343 BNT/examples/static/Models/mk_hmm_bnet.m. | |
| 2344 | |
| 2345 | |
| 2346 | |
| 2347 <h1><a name="structure_learning">Structure learning</h1> | |
| 2348 | |
| 2349 Update (9/29/03): | |
| 2350 Phillipe LeRay is developing some additional structure learning code | |
| 2351 on top of BNT. Click <a | |
| 2352 href="http://bnt.insa-rouen.fr/ajouts.html">here</a> | |
| 2353 for details. | |
| 2354 | |
| 2355 <p> | |
| 2356 | |
| 2357 There are two very different approaches to structure learning: | |
| 2358 constraint-based and search-and-score. | |
| 2359 In the <a href="#constraint">constraint-based approach</a>, | |
| 2360 we start with a fully connected graph, and remove edges if certain | |
| 2361 conditional independencies are measured in the data. | |
| 2362 This has the disadvantage that repeated independence tests lose | |
| 2363 statistical power. | |
| 2364 <p> | |
| 2365 In the more popular search-and-score approach, | |
| 2366 we perform a search through the space of possible DAGs, and either | |
| 2367 return the best one found (a point estimate), or return a sample of the | |
| 2368 models found (an approximation to the Bayesian posterior). | |
| 2369 <p> | |
| 2370 Unfortunately, the number of DAGs as a function of the number of | |
| 2371 nodes, G(n), is super-exponential in n. | |
| 2372 A closed form formula for G(n) is not known, but the first few values | |
| 2373 are shown below (from Cooper, 1999). | |
| 2374 | |
| 2375 <table> | |
| 2376 <tr> <th>n</th> <th align=left>G(n)</th> </tr> | |
| 2377 <tr> <td>1</td> <td>1</td> </tr> | |
| 2378 <tr> <td>2</td> <td>3</td> </tr> | |
| 2379 <tr> <td>3</td> <td>25</td> </tr> | |
| 2380 <tr> <td>4</td> <td>543</td> </tr> | |
| 2381 <tr> <td>5</td> <td>29,281</td> </tr> | |
| 2382 <tr> <td>6</td> <td>3,781,503</td> </tr> | |
| 2383 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr> | |
| 2384 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr> | |
| 2385 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr> | |
| 2386 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr> | |
| 2387 </table> | |
| 2388 | |
| 2389 Since the number of DAGs is super-exponential in the number of nodes, | |
| 2390 we cannot exhaustively search the space, so we either use a local | |
| 2391 search algorithm (e.g., greedy hill climbining, perhaps with multiple | |
| 2392 restarts) or a global search algorithm (e.g., Markov Chain Monte | |
| 2393 Carlo). | |
| 2394 <p> | |
| 2395 If we know a total ordering on the nodes, | |
| 2396 finding the best structure amounts to picking the best set of parents | |
| 2397 for each node independently. | |
| 2398 This is what the K2 algorithm does. | |
| 2399 If the ordering is unknown, we can search over orderings, | |
| 2400 which is more efficient than searching over DAGs (Koller and Friedman, 2000). | |
| 2401 <p> | |
| 2402 In addition to the search procedure, we must specify the scoring | |
| 2403 function. There are two popular choices. The Bayesian score integrates | |
| 2404 out the parameters, i.e., it is the marginal likelihood of the model. | |
| 2405 The BIC (Bayesian Information Criterion) is defined as | |
| 2406 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is | |
| 2407 the ML estimate of the parameters, d is the number of parameters, and | |
| 2408 N is the number of data cases. | |
| 2409 The BIC method has the advantage of not requiring a prior. | |
| 2410 <p> | |
| 2411 BIC can be derived as a large sample | |
| 2412 approximation to the marginal likelihood. | |
| 2413 (It is also equal to the Minimum Description Length of a model.) | |
| 2414 However, in practice, the sample size does not need to be very large | |
| 2415 for the approximation to be good. | |
| 2416 For example, in the figure below, we plot the ratio between the log marginal likelihood | |
| 2417 and the BIC score against data-set size; we see that the ratio rapidly | |
| 2418 approaches 1, especially for non-informative priors. | |
| 2419 (This plot was generated by the file BNT/examples/static/bic1.m. It | |
| 2420 uses the water sprinkler BN with BDeu Dirichlet priors with different | |
| 2421 equivalent sample sizes.) | |
| 2422 | |
| 2423 <p> | |
| 2424 <center> | |
| 2425 <IMG SRC="Figures/bic.png"> | |
| 2426 </center> | |
| 2427 <p> | |
| 2428 | |
| 2429 <p> | |
| 2430 As with parameter learning, handling missing data/ hidden variables is | |
| 2431 much harder than the fully observed case. | |
| 2432 The structure learning routines in BNT can therefore be classified into 4 | |
| 2433 types, analogously to the parameter learning case. | |
| 2434 <p> | |
| 2435 | |
| 2436 <TABLE BORDER> | |
| 2437 <tr> | |
| 2438 <TH></TH> | |
| 2439 <th>Full obs</th> | |
| 2440 <th>Partial obs</th> | |
| 2441 </tr> | |
| 2442 <tr> | |
| 2443 <th>Point</th> | |
| 2444 <td><tt>learn_struct_K2</tt> <br> | |
| 2445 <!-- <tt>learn_struct_hill_climb</tt></td> --> | |
| 2446 <td><tt>not yet supported</tt></td> | |
| 2447 </tr> | |
| 2448 <tr> | |
| 2449 <th>Bayes</th> | |
| 2450 <td><tt>learn_struct_mcmc</tt></td> | |
| 2451 <td>not yet supported</td> | |
| 2452 </tr> | |
| 2453 </table> | |
| 2454 | |
| 2455 | |
| 2456 <h2><a name="markov_equiv">Markov equivalence</h2> | |
| 2457 | |
| 2458 If two DAGs encode the same conditional independencies, they are | |
| 2459 called Markov equivalent. The set of all DAGs can be paritioned into | |
| 2460 Markov equivalence classes. Graphs within the same class can | |
| 2461 have | |
| 2462 the direction of some of their arcs reversed without changing any of | |
| 2463 the CI relationships. | |
| 2464 Each class can be represented by a PDAG | |
| 2465 (partially directed acyclic graph) called an essential graph or | |
| 2466 pattern. This specifies which edges must be oriented in a certain | |
| 2467 direction, and which may be reversed. | |
| 2468 | |
| 2469 <p> | |
| 2470 When learning graph structure from observational data, | |
| 2471 the best one can hope to do is to identify the model up to Markov | |
| 2472 equivalence. To distinguish amongst graphs within the same equivalence | |
| 2473 class, one needs interventional data: see the discussion on <a | |
| 2474 href="#active">active learning</a> below. | |
| 2475 | |
| 2476 | |
| 2477 | |
| 2478 <h2><a name="enumerate">Exhaustive search</h2> | |
| 2479 | |
| 2480 The brute-force approach to structure learning is to enumerate all | |
| 2481 possible DAGs, and score each one. This provides a "gold standard" | |
| 2482 with which to compare other algorithms. We can do this as follows. | |
| 2483 <pre> | |
| 2484 dags = mk_all_dags(N); | |
| 2485 score = score_dags(data, ns, dags); | |
| 2486 </pre> | |
| 2487 where data(i,m) is the value of node i in case m, | |
| 2488 and ns(i) is the size of node i. | |
| 2489 If the DAGs have a lot of families in common, we can cache the sufficient statistics, | |
| 2490 making this potentially more efficient than scoring the DAGs one at a time. | |
| 2491 (Caching is not currently implemented, however.) | |
| 2492 <p> | |
| 2493 By default, we use the Bayesian scoring metric, and assume CPDs are | |
| 2494 represented by tables with BDeu(1) priors. | |
| 2495 We can override these defaults as follows. | |
| 2496 If we want to use uniform priors, we can say | |
| 2497 <pre> | |
| 2498 params = cell(1,N); | |
| 2499 for i=1:N | |
| 2500 params{i} = {'prior', 'unif'}; | |
| 2501 end | |
| 2502 score = score_dags(data, ns, dags, 'params', params); | |
| 2503 </pre> | |
| 2504 params{i} is a cell-array, containing optional arguments that are | |
| 2505 passed to the constructor for CPD i. | |
| 2506 <p> | |
| 2507 Now suppose we want to use different node types, e.g., | |
| 2508 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both | |
| 2509 these CPDs can support discrete and continuous parents, which is | |
| 2510 necessary since all other nodes will be considered as parents). | |
| 2511 The Bayesian scoring metric currently only works for tabular CPDs, so | |
| 2512 we will use BIC: | |
| 2513 <pre> | |
| 2514 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [], | |
| 2515 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic') | |
| 2516 </pre> | |
| 2517 In practice, one can't enumerate all possible DAGs for N > 5, | |
| 2518 but one can evaluate any reasonably-sized set of hypotheses in this | |
| 2519 way (e.g., nearest neighbors of your current best guess). | |
| 2520 Think of this as "computer assisted model refinement" as opposed to de | |
| 2521 novo learning. | |
| 2522 | |
| 2523 | |
| 2524 <h2><a name="K2">K2</h2> | |
| 2525 | |
| 2526 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows. | |
| 2527 Initially each node has no parents. It then adds incrementally that parent whose addition most | |
| 2528 increases the score of the resulting structure. When the addition of no single | |
| 2529 parent can increase the score, it stops adding parents to the node. | |
| 2530 Since we are using a fixed ordering, we do not need to check for | |
| 2531 cycles, and can choose the parents for each node independently. | |
| 2532 <p> | |
| 2533 The original paper used the Bayesian scoring | |
| 2534 metric with tabular CPDs and Dirichlet priors. | |
| 2535 BNT generalizes this to allow any kind of CPD, and either the Bayesian | |
| 2536 scoring metric or BIC, as in the example <a href="#enumerate">above</a>. | |
| 2537 In addition, you can specify | |
| 2538 an optional upper bound on the number of parents for each node. | |
| 2539 The file BNT/examples/static/k2demo1.m gives an example of how to use K2. | |
| 2540 We use the water sprinkler network and sample 100 cases from it as before. | |
| 2541 Then we see how much data it takes to recover the generating structure: | |
| 2542 <pre> | |
| 2543 order = [C S R W]; | |
| 2544 max_fan_in = 2; | |
| 2545 sz = 5:5:100; | |
| 2546 for i=1:length(sz) | |
| 2547 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in); | |
| 2548 correct(i) = isequal(dag, dag2); | |
| 2549 end | |
| 2550 </pre> | |
| 2551 Here are the results. | |
| 2552 <pre> | |
| 2553 correct = | |
| 2554 Columns 1 through 12 | |
| 2555 0 0 0 0 0 0 0 1 0 1 1 1 | |
| 2556 Columns 13 through 20 | |
| 2557 1 1 1 1 1 1 1 1 | |
| 2558 </pre> | |
| 2559 So we see it takes about sz(10)=50 cases. (BIC behaves similarly, | |
| 2560 showing that the prior doesn't matter too much.) | |
| 2561 In general, we cannot hope to recover the "true" generating structure, | |
| 2562 only one that is in its <a href="#markov_equiv">Markov equivalence | |
| 2563 class</a>. | |
| 2564 | |
| 2565 | |
| 2566 <h2><a name="hill_climb">Hill-climbing</h2> | |
| 2567 | |
| 2568 Hill-climbing starts at a specific point in space, | |
| 2569 considers all nearest neighbors, and moves to the neighbor | |
| 2570 that has the highest score; if no neighbors have higher | |
| 2571 score than the current point (i.e., we have reached a local maximum), | |
| 2572 the algorithm stops. One can then restart in another part of the space. | |
| 2573 <p> | |
| 2574 A common definition of "neighbor" is all graphs that can be | |
| 2575 generated from the current graph by adding, deleting or reversing a | |
| 2576 single arc, subject to the acyclicity constraint. | |
| 2577 Other neighborhoods are possible: see | |
| 2578 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf"> | |
| 2579 Optimal Structure Identification with Greedy Search</a>, Max | |
| 2580 Chickering, JMLR 2002. | |
| 2581 | |
| 2582 <!-- | |
| 2583 Note: This algorithm is currently (Feb '02) being implemented by Qian | |
| 2584 Diao. | |
| 2585 --> | |
| 2586 | |
| 2587 | |
| 2588 <h2><a name="mcmc">MCMC</h2> | |
| 2589 | |
| 2590 We can use a Markov Chain Monte Carlo (MCMC) algorithm called | |
| 2591 Metropolis-Hastings (MH) to search the space of all | |
| 2592 DAGs. | |
| 2593 The standard proposal distribution is to consider moving to all | |
| 2594 nearest neighbors in the sense defined <a href="#hill_climb">above</a>. | |
| 2595 <p> | |
| 2596 The function can be called | |
| 2597 as in the following example. | |
| 2598 <pre> | |
| 2599 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10); | |
| 2600 </pre> | |
| 2601 We can convert our set of sampled graphs to a histogram | |
| 2602 (empirical posterior over all the DAGs) thus | |
| 2603 <pre> | |
| 2604 all_dags = mk_all_dags(N); | |
| 2605 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags); | |
| 2606 </pre> | |
| 2607 To see how well this performs, let us compute the exact posterior exhaustively. | |
| 2608 <p> | |
| 2609 <pre> | |
| 2610 score = score_dags(data, ns, all_dags); | |
| 2611 post = normalise(exp(score)); % assuming uniform structural prior | |
| 2612 </pre> | |
| 2613 We plot the results below. | |
| 2614 (The data set was 100 samples drawn from a random 4 node bnet; see the | |
| 2615 file BNT/examples/static/mcmc1.) | |
| 2616 <pre> | |
| 2617 subplot(2,1,1) | |
| 2618 bar(post) | |
| 2619 subplot(2,1,2) | |
| 2620 bar(mcmc_post) | |
| 2621 </pre> | |
| 2622 <img src="Figures/mcmc_post.jpg" width="800" height="500"> | |
| 2623 <p> | |
| 2624 We can also plot the acceptance ratio versus number of MCMC steps, | |
| 2625 as a crude convergence diagnostic. | |
| 2626 <pre> | |
| 2627 clf | |
| 2628 plot(accept_ratio) | |
| 2629 </pre> | |
| 2630 <img src="Figures/mcmc_accept.jpg" width="800" height="300"> | |
| 2631 <p> | |
| 2632 Even though the number of samples needed by MCMC is theoretically | |
| 2633 polynomial (not exponential) in the dimensionality of the search space, in practice it has been | |
| 2634 found that MCMC does not converge in reasonable time for graphs with | |
| 2635 more than about 10 nodes. | |
| 2636 | |
| 2637 | |
| 2638 | |
| 2639 | |
| 2640 <h2><a name="active">Active structure learning</h2> | |
| 2641 | |
| 2642 As was mentioned <a href="#markov_equiv">above</a>, | |
| 2643 one can only learn a DAG up to Markov equivalence, even given infinite data. | |
| 2644 If one is interested in learning the structure of a causal network, | |
| 2645 one needs interventional data. | |
| 2646 (By "intervention" we mean forcing a node to take on a specific value, | |
| 2647 thereby effectively severing its incoming arcs.) | |
| 2648 <p> | |
| 2649 Most of the scoring functions accept an optional argument | |
| 2650 that specifies whether a node was observed to have a certain value, or | |
| 2651 was forced to have that value: we set clamped(i,m)=1 if node i was | |
| 2652 forced in training case m. e.g., see the file | |
| 2653 BNT/examples/static/cooper_yoo. | |
| 2654 <p> | |
| 2655 An interesting question is to decide which interventions to perform | |
| 2656 (c.f., design of experiments). For details, see the following tech | |
| 2657 report | |
| 2658 <ul> | |
| 2659 <li> <a href = "../../Papers/alearn.ps.gz"> | |
| 2660 Active learning of causal Bayes net structure</a>, Kevin Murphy, March | |
| 2661 2001. | |
| 2662 </ul> | |
| 2663 | |
| 2664 | |
| 2665 <h2><a name="struct_em">Structural EM</h2> | |
| 2666 | |
| 2667 Computing the Bayesian score when there is partial observability is | |
| 2668 computationally challenging, because the parameter posterior becomes | |
| 2669 multimodal (the hidden nodes induce a mixture distribution). | |
| 2670 One therefore needs to use approximations such as BIC. | |
| 2671 Unfortunately, search algorithms are still expensive, because we need | |
| 2672 to run EM at each step to compute the MLE, which is needed to compute | |
| 2673 the score of each model. An alternative approach is | |
| 2674 to do the local search steps inside of the M step of EM, which is more | |
| 2675 efficient since the data has been "filled in" - this is | |
| 2676 called the structural EM algorithm (Friedman 1997), and provably | |
| 2677 converges to a local maximum of the BIC score. | |
| 2678 <p> | |
| 2679 Wei Hu has implemented SEM for discrete nodes. | |
| 2680 You can download his package from | |
| 2681 <a href="../SEM.zip">here</a>. | |
| 2682 Please address all questions about this code to | |
| 2683 wei.hu@intel.com. | |
| 2684 See also <a href="#phl">Phl's implementation of SEM</a>. | |
| 2685 | |
| 2686 <!-- | |
| 2687 <h2><a name="reveal">REVEAL algorithm</h2> | |
| 2688 | |
| 2689 A simple way to learn the structure of a fully observed, discrete, | |
| 2690 factored DBN from a time series is described <a | |
| 2691 href="usage_dbn.html#struct_learn">here</a>. | |
| 2692 --> | |
| 2693 | |
| 2694 | |
| 2695 <h2><a name="graphdraw">Visualizing the graph</h2> | |
| 2696 | |
| 2697 You can visualize an arbitrary graph (such as one learned using the | |
| 2698 structure learning routines) with Matlab code contributed by | |
| 2699 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali | |
| 2700 Taylan Cemgil</a> | |
| 2701 from the University of Nijmegen. | |
| 2702 For static BNs, call it as follows: | |
| 2703 <pre> | |
| 2704 draw_graph(bnet.dag); | |
| 2705 </pre> | |
| 2706 For example, this is the output produced on a | |
| 2707 <a href="#qmr">random QMR-like model</a>: | |
| 2708 <p> | |
| 2709 <img src="Figures/qmr.rnd.jpg"> | |
| 2710 <p> | |
| 2711 If you install the excellent <a | |
| 2712 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an | |
| 2713 open-source graph visualization package from AT&T, | |
| 2714 you can create a much better visualization as follows | |
| 2715 <pre> | |
| 2716 graph_to_dot(bnet.dag) | |
| 2717 </pre> | |
| 2718 This works by converting the adjacency matrix to a file suitable | |
| 2719 for input to graphviz (using the dot format), | |
| 2720 then converting the output of graphviz to postscript, and displaying the results using | |
| 2721 ghostview. | |
| 2722 You can do each of these steps separately for more control, as shown | |
| 2723 below. | |
| 2724 <pre> | |
| 2725 graph_to_dot(bnet.dag, 'filename', 'foo.dot'); | |
| 2726 dot -Tps foo.dot -o foo.ps | |
| 2727 ghostview foo.ps & | |
| 2728 </pre> | |
| 2729 | |
| 2730 <h2><a name = "constraint">Constraint-based methods</h2> | |
| 2731 | |
| 2732 The IC algorithm (Pearl and Verma, 1991), | |
| 2733 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993), | |
| 2734 computes many conditional independence tests, | |
| 2735 and combines these constraints into a | |
| 2736 PDAG to represent the whole | |
| 2737 <a href="#markov_equiv">Markov equivalence class</a>. | |
| 2738 <p> | |
| 2739 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>. | |
| 2740 (IC stands for inductive causation; PC stands for Peter and Clark, | |
| 2741 the first names of Spirtes and Glymour; FCI stands for fast causal | |
| 2742 inference. | |
| 2743 What we, following Pearl (2000), call IC* was called | |
| 2744 IC in the original Pearl and Verma paper.) | |
| 2745 For details, see | |
| 2746 <ul> | |
| 2747 <li> | |
| 2748 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation, | |
| 2749 Prediction, and Search</a>, Spirtes, Glymour and | |
| 2750 Scheines (SGS), 2001 (2nd edition), MIT Press. | |
| 2751 <li> | |
| 2752 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl, | |
| 2753 2000, Cambridge University Press. | |
| 2754 </ul> | |
| 2755 | |
| 2756 <p> | |
| 2757 | |
| 2758 The PC algorithm takes as arguments a function f, the number of nodes N, | |
| 2759 the maximum fan in K, and additional arguments A which are passed to f. | |
| 2760 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 | |
| 2761 otherwise. | |
| 2762 For example, suppose we cheat by | |
| 2763 passing in a CI "oracle" which has access to the true DAG; the oracle | |
| 2764 tests for d-separation in this DAG, i.e., | |
| 2765 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows. | |
| 2766 <pre> | |
| 2767 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag); | |
| 2768 </pre> | |
| 2769 pdag(i,j) = -1 if there is definitely an i->j arc, | |
| 2770 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. | |
| 2771 <p> | |
| 2772 Applied to the sprinkler network, this returns | |
| 2773 <pre> | |
| 2774 pdag = | |
| 2775 0 1 1 0 | |
| 2776 1 0 0 -1 | |
| 2777 1 0 0 -1 | |
| 2778 0 0 0 0 | |
| 2779 </pre> | |
| 2780 So as expected, we see that the V-structure at the W node is uniquely identified, | |
| 2781 but the other arcs have ambiguous orientation. | |
| 2782 <p> | |
| 2783 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS | |
| 2784 book. | |
| 2785 This example concerns the female orgasm. | |
| 2786 We are given a correlation matrix C between 7 measured factors (such | |
| 2787 as subjective experiences of coital and masturbatory experiences), | |
| 2788 derived from 281 samples, and want to learn a causal model of the | |
| 2789 data. We will not discuss the merits of this type of work here, but | |
| 2790 merely show how to reproduce the results in the SGS book. | |
| 2791 Their program, | |
| 2792 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>, | |
| 2793 makes use of the Fisher Z-test for conditional | |
| 2794 independence, so we do the same: | |
| 2795 <pre> | |
| 2796 max_fan_in = 4; | |
| 2797 nsamples = 281; | |
| 2798 alpha = 0.05; | |
| 2799 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha); | |
| 2800 </pre> | |
| 2801 In this case, the CI test is | |
| 2802 <pre> | |
| 2803 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha) | |
| 2804 </pre> | |
| 2805 The results match those of Fig 12a of SGS apart from two edge | |
| 2806 differences; presumably this is due to rounding error (although it | |
| 2807 could be a bug, either in BNT or in Tetrad). | |
| 2808 This example can be found in the file BNT/examples/static/pc2.m. | |
| 2809 | |
| 2810 <p> | |
| 2811 | |
| 2812 The IC* algorithm (Pearl and Verma, 1991), | |
| 2813 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993), | |
| 2814 are like the IC/PC algorithm, except that they can detect the presence | |
| 2815 of latent variables. | |
| 2816 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar | |
| 2817 Kushnir. The output is a matrix P, defined as follows | |
| 2818 (see Pearl (2000), p52 for details): | |
| 2819 <pre> | |
| 2820 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j. | |
| 2821 % P(i,j) = -2 if there is a marked directed i-*>j edge. | |
| 2822 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j | |
| 2823 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j. | |
| 2824 </pre> | |
| 2825 | |
| 2826 | |
| 2827 <h2><a name="phl">Philippe Leray's structure learning package</h2> | |
| 2828 | |
| 2829 Philippe Leray has written a | |
| 2830 <a href="http://bnt.insa-rouen.fr/ajouts.html"> | |
| 2831 structure learning package</a> that uses BNT. | |
| 2832 | |
| 2833 It currently (Juen 2003) has the following features: | |
| 2834 <ul> | |
| 2835 <li>PC with Chi2 statistical test | |
| 2836 <li> MWST : Maximum weighted Spanning Tree | |
| 2837 <li> Hill Climbing | |
| 2838 <li> Greedy Search | |
| 2839 <li> Structural EM | |
| 2840 <li> hist_ic : optimal Histogram based on IC information criterion | |
| 2841 <li> cpdag_to_dag | |
| 2842 <li> dag_to_cpdag | |
| 2843 <li> ... | |
| 2844 </ul> | |
| 2845 | |
| 2846 | |
| 2847 </a> | |
| 2848 | |
| 2849 | |
| 2850 <!-- | |
| 2851 <h2><a name="read_learning">Further reading on learning</h2> | |
| 2852 | |
| 2853 I recommend the following tutorials for more details on learning. | |
| 2854 <ul> | |
| 2855 <li> <a | |
| 2856 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short | |
| 2857 tutorial</a> on graphical models, which contains an overview of learning. | |
| 2858 | |
| 2859 <li> | |
| 2860 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS"> | |
| 2861 A tutorial on learning with Bayesian networks</a>, D. Heckerman, | |
| 2862 Microsoft Research Tech Report, 1995. | |
| 2863 | |
| 2864 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja"> | |
| 2865 Operations for Learning with Graphical Models</a>, | |
| 2866 W. L. Buntine, JAIR'94, 159--225. | |
| 2867 </ul> | |
| 2868 <p> | |
| 2869 --> | |
| 2870 | |
| 2871 | |
| 2872 | |
| 2873 | |
| 2874 | |
| 2875 <h1><a name="engines">Inference engines</h1> | |
| 2876 | |
| 2877 Up until now, we have used the junction tree algorithm for inference. | |
| 2878 However, sometimes this is too slow, or not even applicable. | |
| 2879 In general, there are many inference algorithms each of which make | |
| 2880 different tradeoffs between speed, accuracy, complexity and | |
| 2881 generality. Furthermore, there might be many implementations of the | |
| 2882 same algorithm; for instance, a general purpose, readable version, | |
| 2883 and a highly-optimized, specialized one. | |
| 2884 To cope with this variety, we treat each inference algorithm as an | |
| 2885 object, which we call an inference engine. | |
| 2886 | |
| 2887 <p> | |
| 2888 An inference engine is an object that contains a bnet and supports the | |
| 2889 'enter_evidence' and 'marginal_nodes' methods. The engine constructor | |
| 2890 takes the bnet as argument and may do some model-specific processing. | |
| 2891 When 'enter_evidence' is called, the engine may do some | |
| 2892 evidence-specific processing. Finally, when 'marginal_nodes' is | |
| 2893 called, the engine may do some query-specific processing. | |
| 2894 | |
| 2895 <p> | |
| 2896 The amount of work done when each stage is specified -- structure, | |
| 2897 parameters, evidence, and query -- depends on the engine. The cost of | |
| 2898 work done early in this sequence can be amortized. On the other hand, | |
| 2899 one can make better optimizations if one waits until later in the | |
| 2900 sequence. | |
| 2901 For example, the parameters might imply | |
| 2902 conditional indpendencies that are not evident in the graph structure, | |
| 2903 but can nevertheless be exploited; the evidence indicates which nodes | |
| 2904 are observed and hence can effectively be disconnected from the | |
| 2905 graph; and the query might indicate that large parts of the network | |
| 2906 are d-separated from the query nodes. (Since it is not the actual | |
| 2907 <em>values</em> of the evidence that matters, just which nodes are observed, | |
| 2908 many engines allow you to specify which nodes will be observed when they are constructed, | |
| 2909 i.e., before calling 'enter_evidence'. Some engines can still cope if | |
| 2910 the actual pattern of evidence is different, e.g., if there is missing | |
| 2911 data.) | |
| 2912 <p> | |
| 2913 | |
| 2914 Although being maximally lazy (i.e., only doing work when a query is | |
| 2915 issued) may seem desirable, | |
| 2916 this is not always the most efficient. | |
| 2917 For example, | |
| 2918 when learning using EM, we need to call marginal_nodes N times, where N is the | |
| 2919 number of nodes. <a href="varelim">Variable elimination</a> would end | |
| 2920 up repeating a lot of work | |
| 2921 each time marginal_nodes is called, making it inefficient for | |
| 2922 learning. The junction tree algorithm, by contrast, uses dynamic | |
| 2923 programming to avoid this redundant computation --- it calculates all | |
| 2924 marginals in two passes during 'enter_evidence', so calling | |
| 2925 'marginal_nodes' takes constant time. | |
| 2926 <p> | |
| 2927 We will discuss some of the inference algorithms implemented in BNT | |
| 2928 below, and finish with a <a href="#engine_summary">summary</a> of all | |
| 2929 of them. | |
| 2930 | |
| 2931 | |
| 2932 | |
| 2933 | |
| 2934 | |
| 2935 | |
| 2936 | |
| 2937 <h2><a name="varelim">Variable elimination</h2> | |
| 2938 | |
| 2939 The variable elimination algorithm, also known as bucket elimination | |
| 2940 or peeling, is one of the simplest inference algorithms. | |
| 2941 The basic idea is to "push sums inside of products"; this is explained | |
| 2942 in more detail | |
| 2943 <a | |
| 2944 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>. | |
| 2945 <p> | |
| 2946 The principle of distributing sums over products can be generalized | |
| 2947 greatly to apply to any commutative semiring. | |
| 2948 This forms the basis of many common algorithms, such as Viterbi | |
| 2949 decoding and the Fast Fourier Transform. For details, see | |
| 2950 | |
| 2951 <ul> | |
| 2952 <li> R. McEliece and S. M. Aji, 2000. | |
| 2953 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">--> | |
| 2954 <a href="GDL.pdf"> | |
| 2955 The Generalized Distributive Law</a>, | |
| 2956 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000), | |
| 2957 pp. 325--343. | |
| 2958 | |
| 2959 | |
| 2960 <li> | |
| 2961 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001. | |
| 2962 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html"> | |
| 2963 Factor graphs and the sum-product algorithm</a> | |
| 2964 IEEE Transactions on Information Theory, February, 2001. | |
| 2965 | |
| 2966 </ul> | |
| 2967 | |
| 2968 <p> | |
| 2969 Choosing an order in which to sum out the variables so as to minimize | |
| 2970 computational cost is known to be NP-hard. | |
| 2971 The implementation of this algorithm in | |
| 2972 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this | |
| 2973 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a | |
| 2974 greedy search procedure to find a good ordering). | |
| 2975 <p> | |
| 2976 Note: unlike most algorithms, var_elim does all its computational work | |
| 2977 inside of <tt>marginal_nodes</tt>, not inside of | |
| 2978 <tt>enter_evidence</tt>. | |
| 2979 | |
| 2980 | |
| 2981 | |
| 2982 | |
| 2983 <h2><a name="global">Global inference methods</h2> | |
| 2984 | |
| 2985 The simplest inference algorithm of all is to explicitely construct | |
| 2986 the joint distribution over all the nodes, and then to marginalize it. | |
| 2987 This is implemented in <tt>global_joint_inf_engine</tt>. | |
| 2988 Since the size of the joint is exponential in the | |
| 2989 number of discrete (hidden) nodes, this is not a very practical algorithm. | |
| 2990 It is included merely for pedagogical and debugging purposes. | |
| 2991 <p> | |
| 2992 Three specialized versions of this algorithm have also been implemented, | |
| 2993 corresponding to the cases where all the nodes are discrete (D), all | |
| 2994 are Gaussian (G), and some are discrete and some Gaussian (CG). | |
| 2995 They are called <tt>enumerative_inf_engine</tt>, | |
| 2996 <tt>gaussian_inf_engine</tt>, | |
| 2997 and <tt>cond_gauss_inf_engine</tt> respectively. | |
| 2998 <p> | |
| 2999 Note: unlike most algorithms, these global inference algorithms do all their computational work | |
| 3000 inside of <tt>marginal_nodes</tt>, not inside of | |
| 3001 <tt>enter_evidence</tt>. | |
| 3002 | |
| 3003 | |
| 3004 <h2><a name="quickscore">Quickscore</h2> | |
| 3005 | |
| 3006 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network, | |
| 3007 since the cliques are so big. | |
| 3008 One simple trick we can use is to notice that hidden leaves do not | |
| 3009 affect the posteriors on the roots, and hence do not need to be | |
| 3010 included in the network. | |
| 3011 A second trick is to notice that the negative findings can be | |
| 3012 "absorbed" into the prior: | |
| 3013 see the file | |
| 3014 BNT/examples/static/mk_minimal_qmr_bnet for details. | |
| 3015 <p> | |
| 3016 | |
| 3017 A much more significant speedup is obtained by exploiting special | |
| 3018 properties of the noisy-or node, as done by the quickscore | |
| 3019 algorithm. For details, see | |
| 3020 <ul> | |
| 3021 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89. | |
| 3022 <li> Rish and Dechter, "On the impact of causal independence", UCI | |
| 3023 tech report, 1998. | |
| 3024 </ul> | |
| 3025 | |
| 3026 This has been implemented in BNT as a special-purpose inference | |
| 3027 engine, which can be created and used as follows: | |
| 3028 <pre> | |
| 3029 engine = quickscore_inf_engine(inhibit, leak, prior); | |
| 3030 engine = enter_evidence(engine, pos, neg); | |
| 3031 m = marginal_nodes(engine, i); | |
| 3032 </pre> | |
| 3033 | |
| 3034 | |
| 3035 <h2><a name="belprop">Belief propagation</h2> | |
| 3036 | |
| 3037 Even using quickscore, exact inference takes time that is exponential | |
| 3038 in the number of positive findings. | |
| 3039 Hence for large networks we need to resort to approximate inference techniques. | |
| 3040 See for example | |
| 3041 <ul> | |
| 3042 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the | |
| 3043 QMR-DT network", JAIR 10, 1999. | |
| 3044 | |
| 3045 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study", | |
| 3046 UAI 99. | |
| 3047 </ul> | |
| 3048 The latter approximation | |
| 3049 entails applying Pearl's belief propagation algorithm to a model even | |
| 3050 if it has loops (hence the name loopy belief propagation). | |
| 3051 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives | |
| 3052 exact results when applied to singly-connected graphs | |
| 3053 (a.k.a. polytrees, since | |
| 3054 the underlying undirected topology is a tree, but a node may have | |
| 3055 multiple parents). | |
| 3056 To apply this algorithm to a graph with loops, | |
| 3057 use <tt>pearl_inf_engine</tt>. | |
| 3058 This can use a centralized or distributed message passing protocol. | |
| 3059 You can use it as in the following example. | |
| 3060 <pre> | |
| 3061 engine = pearl_inf_engine(bnet, 'max_iter', 30); | |
| 3062 engine = enter_evidence(engine, evidence); | |
| 3063 m = marginal_nodes(engine, i); | |
| 3064 </pre> | |
| 3065 We found that this algorithm often converges, and when it does, often | |
| 3066 is very accurate, but it depends on the precise setting of the | |
| 3067 parameter values of the network. | |
| 3068 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) | |
| 3069 Understanding when and why belief propagation converges/ works | |
| 3070 is a topic of ongoing research. | |
| 3071 <p> | |
| 3072 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or | |
| 3073 and gmux nodes to compute messages efficiently. | |
| 3074 <p> | |
| 3075 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to | |
| 3076 represent messages. Hence this is slower. | |
| 3077 <p> | |
| 3078 <tt>belprop_fg_inf_engine</tt> is like belprop, | |
| 3079 but is designed for factor graphs. | |
| 3080 | |
| 3081 | |
| 3082 | |
| 3083 <h2><a name="sampling">Sampling</h2> | |
| 3084 | |
| 3085 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms: | |
| 3086 <ul> | |
| 3087 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance | |
| 3088 sampling and can handle any node type. | |
| 3089 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi. | |
| 3090 Currently this can only handle tabular CPDs. | |
| 3091 For a much faster and more powerful Gibbs sampling program, see | |
| 3092 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>. | |
| 3093 </ul> | |
| 3094 Note: To generate samples from a network (which is not the same as inference!), | |
| 3095 use <tt>sample_bnet</tt>. | |
| 3096 | |
| 3097 | |
| 3098 | |
| 3099 <h2><a name="engine_summary">Summary of inference engines</h2> | |
| 3100 | |
| 3101 | |
| 3102 The inference engines differ in many ways. Here are | |
| 3103 some of the major "axes": | |
| 3104 <ul> | |
| 3105 <li> Works for all topologies or makes restrictions? | |
| 3106 <li> Works for all node types or makes restrictions? | |
| 3107 <li> Exact or approximate inference? | |
| 3108 </ul> | |
| 3109 | |
| 3110 <p> | |
| 3111 In terms of topology, most engines handle any kind of DAG. | |
| 3112 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which | |
| 3113 can be used to represent directed, undirected, and mixed (chain) | |
| 3114 graphs. | |
| 3115 (In the future, we plan to support exact inference on chain graphs.) | |
| 3116 <tt>quickscore</tt> only works on QMR-like models. | |
| 3117 <p> | |
| 3118 In terms of node types: algorithms that use potentials can handle | |
| 3119 discrete (D), Gaussian (G) or conditional Gaussian (CG) models. | |
| 3120 Sampling algorithms can essentially handle any kind of node (distribution). | |
| 3121 Other algorithms make more restrictive assumptions in exchange for | |
| 3122 speed. | |
| 3123 <p> | |
| 3124 Finally, most algorithms are designed to give the exact answer. | |
| 3125 The belief propagation algorithms are exact if applied to trees, and | |
| 3126 in some other cases. | |
| 3127 Sampling is considered approximate, even though, in the limit of an | |
| 3128 infinite number of samples, it gives the exact answer. | |
| 3129 | |
| 3130 <p> | |
| 3131 | |
| 3132 Here is a summary of the properties | |
| 3133 of all the engines in BNT which work on static networks. | |
| 3134 <p> | |
| 3135 <table> | |
| 3136 <table border units = pixels><tr> | |
| 3137 <td align=left width=0>Name | |
| 3138 <td align=left width=0>Exact? | |
| 3139 <td align=left width=0>Node type? | |
| 3140 <td align=left width=0>topology | |
| 3141 <tr> | |
| 3142 <tr> | |
| 3143 <td align=left> belprop | |
| 3144 <td align=left> approx | |
| 3145 <td align=left> D | |
| 3146 <td align=left> DAG | |
| 3147 <tr> | |
| 3148 <td align=left> belprop_fg | |
| 3149 <td align=left> approx | |
| 3150 <td align=left> D | |
| 3151 <td align=left> factor graph | |
| 3152 <tr> | |
| 3153 <td align=left> cond_gauss | |
| 3154 <td align=left> exact | |
| 3155 <td align=left> CG | |
| 3156 <td align=left> DAG | |
| 3157 <tr> | |
| 3158 <td align=left> enumerative | |
| 3159 <td align=left> exact | |
| 3160 <td align=left> D | |
| 3161 <td align=left> DAG | |
| 3162 <tr> | |
| 3163 <td align=left> gaussian | |
| 3164 <td align=left> exact | |
| 3165 <td align=left> G | |
| 3166 <td align=left> DAG | |
| 3167 <tr> | |
| 3168 <td align=left> gibbs | |
| 3169 <td align=left> approx | |
| 3170 <td align=left> D | |
| 3171 <td align=left> DAG | |
| 3172 <tr> | |
| 3173 <td align=left> global_joint | |
| 3174 <td align=left> exact | |
| 3175 <td align=left> D,G,CG | |
| 3176 <td align=left> DAG | |
| 3177 <tr> | |
| 3178 <td align=left> jtree | |
| 3179 <td align=left> exact | |
| 3180 <td align=left> D,G,CG | |
| 3181 <td align=left> DAG | |
| 3182 b<tr> | |
| 3183 <td align=left> likelihood_weighting | |
| 3184 <td align=left> approx | |
| 3185 <td align=left> any | |
| 3186 <td align=left> DAG | |
| 3187 <tr> | |
| 3188 <td align=left> pearl | |
| 3189 <td align=left> approx | |
| 3190 <td align=left> D,G | |
| 3191 <td align=left> DAG | |
| 3192 <tr> | |
| 3193 <td align=left> pearl | |
| 3194 <td align=left> exact | |
| 3195 <td align=left> D,G | |
| 3196 <td align=left> polytree | |
| 3197 <tr> | |
| 3198 <td align=left> quickscore | |
| 3199 <td align=left> exact | |
| 3200 <td align=left> noisy-or | |
| 3201 <td align=left> QMR | |
| 3202 <tr> | |
| 3203 <td align=left> stab_cond_gauss | |
| 3204 <td align=left> exact | |
| 3205 <td align=left> CG | |
| 3206 <td align=left> DAG | |
| 3207 <tr> | |
| 3208 <td align=left> var_elim | |
| 3209 <td align=left> exact | |
| 3210 <td align=left> D,G,CG | |
| 3211 <td align=left> DAG | |
| 3212 </table> | |
| 3213 | |
| 3214 | |
| 3215 | |
| 3216 <h1><a name="influence">Influence diagrams/ decision making</h1> | |
| 3217 | |
| 3218 BNT implements an exact algorithm for solving LIMIDs (limited memory | |
| 3219 influence diagrams), described in | |
| 3220 <ul> | |
| 3221 <li> S. L. Lauritzen and D. Nilsson. | |
| 3222 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf"> | |
| 3223 Representing and solving decision problems with limited | |
| 3224 information</a> | |
| 3225 Management Science, 47, 1238 - 1251. September 2001. | |
| 3226 </ul> | |
| 3227 LIMIDs explicitely show all information arcs, rather than implicitely | |
| 3228 assuming no forgetting. This allows them to model forgetful | |
| 3229 controllers. | |
| 3230 <p> | |
| 3231 See the examples in <tt>BNT/examples/limids</tt> for details. | |
| 3232 | |
| 3233 | |
| 3234 | |
| 3235 | |
| 3236 <h1>DBNs, HMMs, Kalman filters and all that</h1> | |
| 3237 | |
| 3238 Click <a href="usage_dbn.html">here</a> for documentation about how to | |
| 3239 use BNT for dynamical systems and sequence data. | |
| 3240 | |
| 3241 | |
| 3242 </BODY> |
