camir-ismir2012: toolboxes/FullBNT-1.0.7/docs/usage

annotate toolboxes/FullBNT-1.0.7/docs/usage_sf.html @ 0:cc4b1211e677 tip

initial commit to HG from Changeset: 646 (e263d8a21543) added further path and more save "camirversion.m"

author	Daniel Wolff
date	Fri, 19 Aug 2016 13:07:06 +0200
parents
children

rev	line source
Daniel@0	1 <HEAD>
Daniel@0	2 <TITLE>How to use the Bayes Net Toolbox</TITLE>
Daniel@0	3 </HEAD>
Daniel@0	4
Daniel@0	5 <BODY BGCOLOR="#FFFFFF">
Daniel@0	6 <!-- white background is better for the pictures and equations -->
Daniel@0	7
Daniel@0	8 <h1>How to use the Bayes Net Toolbox</h1>
Daniel@0	9
Daniel@0	10 This documentation was last updated on 7 June 2004.
Daniel@0	11 <br>
Daniel@0	12 Click <a href="changelog.html">here</a> for a list of changes made to
Daniel@0	13 BNT.
Daniel@0	14 <br>
Daniel@0	15 Click
Daniel@0	16 <a href="http://bnt.insa-rouen.fr/">here</a>
Daniel@0	17 for a French version of this documentation (which might not
Daniel@0	18 be up-to-date).
Daniel@0	19 <br>
Daniel@0	20 Update 23 May 2005:
Daniel@0	21 Philippe LeRay has written
Daniel@0	22 a
Daniel@0	23 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-editor/">
Daniel@0	24 BNT GUI</a>
Daniel@0	25 and
Daniel@0	26 <a href="http://banquiseasi.insa-rouen.fr/projects/bnt-slp/">
Daniel@0	27 BNT Structure Learning Package</a>.
Daniel@0	28
Daniel@0	29 <p>
Daniel@0	30
Daniel@0	31 <ul>
Daniel@0	32 <li> <a href="#install">Installation</a>
Daniel@0	33 <ul>
Daniel@0	34 <li> <a href="#install">Installing the Matlab code</a>
Daniel@0	35 <li> <a href="#installC">Installing the C code</a>
Daniel@0	36 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>.
Daniel@0	37 </ul>
Daniel@0	38
Daniel@0	39 <li> <a href="#basics">Creating your first Bayes net</a>
Daniel@0	40 <ul>
Daniel@0	41 <li> <a href="#basics">Creating a model by hand</a>
Daniel@0	42 <li> <a href="#file">Loading a model from a file</a>
Daniel@0	43 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a>
Daniel@0	44 </ul>
Daniel@0	45
Daniel@0	46 <li> <a href="#inference">Inference</a>
Daniel@0	47 <ul>
Daniel@0	48 <li> <a href="#marginal">Computing marginal distributions</a>
Daniel@0	49 <li> <a href="#joint">Computing joint distributions</a>
Daniel@0	50 <li> <a href="#soft">Soft/virtual evidence</a>
Daniel@0	51 <li> <a href="#mpe">Most probable explanation</a>
Daniel@0	52 </ul>
Daniel@0	53
Daniel@0	54 <li> <a href="#cpd">Conditional Probability Distributions</a>
Daniel@0	55 <ul>
Daniel@0	56 <li> <a href="#tabular">Tabular (multinomial) nodes</a>
Daniel@0	57 <li> <a href="#noisyor">Noisy-or nodes</a>
Daniel@0	58 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
Daniel@0	59 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
Daniel@0	60 <li> <a href="#mlp">Neural network nodes</a>
Daniel@0	61 <li> <a href="#root">Root nodes</a>
Daniel@0	62 <li> <a href="#gaussian">Gaussian nodes</a>
Daniel@0	63 <li> <a href="#glm">Generalized linear model nodes</a>
Daniel@0	64 <li> <a href="#dtree">Classification/regression tree nodes</a>
Daniel@0	65 <li> <a href="#nongauss">Other continuous distributions</a>
Daniel@0	66 <li> <a href="#cpd_summary">Summary of CPD types</a>
Daniel@0	67 </ul>
Daniel@0	68
Daniel@0	69 <li> <a href="#examples">Example models</a>
Daniel@0	70 <ul>
Daniel@0	71 <li> <a
Daniel@0	72 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
Daniel@0	73 Gaussian mixture models</a>
Daniel@0	74 <li> <a href="#pca">PCA, ICA, and all that</a>
Daniel@0	75 <li> <a href="#mixep">Mixtures of experts</a>
Daniel@0	76 <li> <a href="#hme">Hierarchical mixtures of experts</a>
Daniel@0	77 <li> <a href="#qmr">QMR</a>
Daniel@0	78 <li> <a href="#cg_model">Conditional Gaussian models</a>
Daniel@0	79 <li> <a href="#hybrid">Other hybrid models</a>
Daniel@0	80 </ul>
Daniel@0	81
Daniel@0	82 <li> <a href="#param_learning">Parameter learning</a>
Daniel@0	83 <ul>
Daniel@0	84 <li> <a href="#load_data">Loading data from a file</a>
Daniel@0	85 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
Daniel@0	86 <li> <a href="#prior">Parameter priors</a>
Daniel@0	87 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
Daniel@0	88 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
Daniel@0	89 <li> <a href="#tying">Parameter tying</a>
Daniel@0	90 </ul>
Daniel@0	91
Daniel@0	92 <li> <a href="#structure_learning">Structure learning</a>
Daniel@0	93 <ul>
Daniel@0	94 <li> <a href="#enumerate">Exhaustive search</a>
Daniel@0	95 <li> <a href="#K2">K2</a>
Daniel@0	96 <li> <a href="#hill_climb">Hill-climbing</a>
Daniel@0	97 <li> <a href="#mcmc">MCMC</a>
Daniel@0	98 <li> <a href="#active">Active learning</a>
Daniel@0	99 <li> <a href="#struct_em">Structural EM</a>
Daniel@0	100 <li> <a href="#graphdraw">Visualizing the learned graph structure</a>
Daniel@0	101 <li> <a href="#constraint">Constraint-based methods</a>
Daniel@0	102 </ul>
Daniel@0	103
Daniel@0	104
Daniel@0	105 <li> <a href="#engines">Inference engines</a>
Daniel@0	106 <ul>
Daniel@0	107 <li> <a href="#jtree">Junction tree</a>
Daniel@0	108 <li> <a href="#varelim">Variable elimination</a>
Daniel@0	109 <li> <a href="#global">Global inference methods</a>
Daniel@0	110 <li> <a href="#quickscore">Quickscore</a>
Daniel@0	111 <li> <a href="#belprop">Belief propagation</a>
Daniel@0	112 <li> <a href="#sampling">Sampling (Monte Carlo)</a>
Daniel@0	113 <li> <a href="#engine_summary">Summary of inference engines</a>
Daniel@0	114 </ul>
Daniel@0	115
Daniel@0	116
Daniel@0	117 <li> <a href="#influence">Influence diagrams/ decision making</a>
Daniel@0	118
Daniel@0	119
Daniel@0	120 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
Daniel@0	121 </ul>
Daniel@0	122
Daniel@0	123 </ul>
Daniel@0	124
Daniel@0	125
Daniel@0	126
Daniel@0	127
Daniel@0	128 <h1><a name="install">Installation</h1>
Daniel@0	129
Daniel@0	130 <h2><a name="installM">Installing the Matlab code</h2>
Daniel@0	131
Daniel@0	132 <ul>
Daniel@0	133 <li> <a href="bnt_download.html">Download</a> the FullBNT.zip file.
Daniel@0	134
Daniel@0	135 <p>
Daniel@0	136 <li> Unpack the file. In Unix, type
Daniel@0	137 <!--"tar xvf BNT.tar".-->
Daniel@0	138 "unzip FullBNT.zip".
Daniel@0	139 In Windows, use
Daniel@0	140 a program like <a href="http://www.winzip.com">Winzip</a>. This will
Daniel@0	141 create a directory called FullBNT, which contains BNT and other libraries.
Daniel@0	142 (Files ending in ~ or # are emacs backup files, and can be ignored.)
Daniel@0	143
Daniel@0	144 <p>
Daniel@0	145 <li> Read the file <tt>BNT/README.txt</tt> to make sure the date
Daniel@0	146 matches the one on the top of <a href=bnt.html>the BNT home page</a>.
Daniel@0	147 If not, you may need to press 'refresh' on your browser, and download
Daniel@0	148 again, to get the most recent version.
Daniel@0	149
Daniel@0	150 <p>
Daniel@0	151 <li> <b>Edit the file "FullBNT/BNT/add_BNT_to_path.m"</b> so it contains the correct
Daniel@0	152 pathname.
Daniel@0	153 For example, in Windows,
Daniel@0	154 I download FullBNT.zip into C:\kmurphy\matlab, and
Daniel@0	155 then ensure the second lines reads
Daniel@0	156 <pre>
Daniel@0	157 BNT_HOME = 'C:\kmurphy\matlab\FullBNT';
Daniel@0	158 </pre>
Daniel@0	159
Daniel@0	160 <p>
Daniel@0	161 <li> Start up Matlab.
Daniel@0	162
Daniel@0	163 <p>
Daniel@0	164 <li> Type "ver" at the Matlab prompt (">>").
Daniel@0	165 <b>You need Matlab version 5.2 or newer to run BNT</b>.
Daniel@0	166 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes
Daniel@0	167 crash BNT.)
Daniel@0	168 <b>BNT will not run on Octave</b>.
Daniel@0	169
Daniel@0	170 <p>
Daniel@0	171 <li> Move to the BNT directory.
Daniel@0	172 For example, in Windows, I type
Daniel@0	173 <pre>
Daniel@0	174 >> cd C:\kpmurphy\matlab\FullBNT\BNT
Daniel@0	175 </pre>
Daniel@0	176
Daniel@0	177 <p>
Daniel@0	178 <li> Type "add_BNT_to_path".
Daniel@0	179 This executes the command
Daniel@0	180 <tt>addpath(genpath(BNT_HOME))</tt>,
Daniel@0	181 which adds all directories below FullBNT to the matlab path.
Daniel@0	182
Daniel@0	183 <p>
Daniel@0	184 <li> Type "test_BNT".
Daniel@0	185
Daniel@0	186 If all goes well, this will produce a bunch of numbers and maybe some
Daniel@0	187 warning messages (which you can ignore), but no error messages.
Daniel@0	188 (The warnings should only be of the form
Daniel@0	189 "Warning: Maximum number of iterations has been exceeded", and are
Daniel@0	190 produced by Netlab.)
Daniel@0	191
Daniel@0	192 <p>
Daniel@0	193 <li> Problems? Did you remember to
Daniel@0	194 <b>Edit the file "FullBNT/BNT/add_BNT_to_path.m"</b> so it contains
Daniel@0	195 the right path??
Daniel@0	196
Daniel@0	197 <p>
Daniel@0	198 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join">
Daniel@0	199 Join the BNT email list</a>
Daniel@0	200
Daniel@0	201 <p>
Daniel@0	202 <li>Read
Daniel@0	203 <a href="../matlab_tips.html">some useful Matlab tips</a>.
Daniel@0	204 <!--
Daniel@0	205 For instance, this explains how to create a startup file, which can be
Daniel@0	206 used to set your path variable automatically, so you can avoid having
Daniel@0	207 to type the above commands every time.
Daniel@0	208 -->
Daniel@0	209
Daniel@0	210 </ul>
Daniel@0	211
Daniel@0	212
Daniel@0	213
Daniel@0	214 <h2><a name="installC">Installing the C code</h2>
Daniel@0	215
Daniel@0	216 Some BNT functions also have C implementations.
Daniel@0	217 <b>It is not necessary to install the C code</b>, but it can result in a speedup
Daniel@0	218 of a factor of 2-5.
Daniel@0	219 To install all the C code,
Daniel@0	220 edit installC_BNT.m so it contains the right path,
Daniel@0	221 then type <tt>installC_BNT</tt>.
Daniel@0	222 (Ignore warnings of the form 'invalid white space character in directive'.)
Daniel@0	223 To uninstall all the C code,
Daniel@0	224 edit uninstallC_BNT.m so it contains the right path,
Daniel@0	225 then type <tt>uninstallC_BNT</tt>.
Daniel@0	226 For an up-to-date list of the files which have C implementations, see
Daniel@0	227 BNT/installC_BNT.m.
Daniel@0	228
Daniel@0	229 <p>
Daniel@0	230 mex is a script that lets you call C code from Matlab - it does not compile matlab to
Daniel@0	231 C (see mcc below).
Daniel@0	232 If your C/C++ compiler is set up correctly, mex should work out of
Daniel@0	233 the box.
Daniel@0	234 If not, you might need to type
Daniel@0	235 <p>
Daniel@0	236 <tt> mex -setup</tt>
Daniel@0	237 <p>
Daniel@0	238 before calling installC.
Daniel@0	239 <p>
Daniel@0	240 To make mex call gcc on Windows,
Daniel@0	241 you must install <a
Daniel@0	242 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>.
Daniel@0	243 You can use the <a href="http://www.mingw.org/">minimalist GNU for
Daniel@0	244 Windows</a> version of gcc, or
Daniel@0	245 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version.
Daniel@0	246 <p>
Daniel@0	247 In general, typing
Daniel@0	248 'mex foo.c' from inside Matlab creates a file called
Daniel@0	249 'foo.mexglx' or 'foo.dll' (the exact file
Daniel@0	250 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll').
Daniel@0	251 The resulting file will hide the original 'foo.m' (if it existed), i.e.,
Daniel@0	252 typing 'foo' at the prompt will call the compiled C version.
Daniel@0	253 To reveal the original matlab version, just delete foo.mexglx (this is
Daniel@0	254 what uninstallC does).
Daniel@0	255 <p>
Daniel@0	256 Sometimes it takes time for Matlab to realize that the file has
Daniel@0	257 changed from matlab to C or vice versa; try typing 'clear all' or
Daniel@0	258 restarting Matlab to refresh it.
Daniel@0	259 To find out which version of a file you are running, type
Daniel@0	260 'which foo'.
Daniel@0	261 <p>
Daniel@0	262 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the
Daniel@0	263 Matlab to C compiler, is a separate product,
Daniel@0	264 and is quite different from mex. It does not yet support
Daniel@0	265 objects/classes, which is why we can't compile all of BNT to C automatically.
Daniel@0	266 Also, hand-written C code is usually much
Daniel@0	267 better than the C code generated by mcc.
Daniel@0	268
Daniel@0	269
Daniel@0	270 <p>
Daniel@0	271 Acknowledgements:
Daniel@0	272 Most of the C code (e.g., for jtree and dpot) was written by Wei Hu;
Daniel@0	273 the triangulation C code was written by Ilya Shpitser;
Daniel@0	274 the Gibbs sampling C code (for discrete nodes) was written by Bhaskara
Daniel@0	275 Marthi.
Daniel@0	276
Daniel@0	277
Daniel@0	278
Daniel@0	279 <h1><a name="basics">Creating your first Bayes net</h1>
Daniel@0	280
Daniel@0	281 To define a Bayes net, you must specify the graph structure and then
Daniel@0	282 the parameters. We look at each in turn, using a simple example
Daniel@0	283 (adapted from Russell and
Daniel@0	284 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
Daniel@0	285 1995, p454).
Daniel@0	286
Daniel@0	287
Daniel@0	288 <h2>Graph structure</h2>
Daniel@0	289
Daniel@0	290
Daniel@0	291 Consider the following network.
Daniel@0	292
Daniel@0	293 <p>
Daniel@0	294 <center>
Daniel@0	295 <IMG SRC="Figures/sprinkler.gif">
Daniel@0	296 </center>
Daniel@0	297 <p>
Daniel@0	298
Daniel@0	299 <P>
Daniel@0	300 To specify this directed acyclic graph (dag), we create an adjacency matrix:
Daniel@0	301 <PRE>
Daniel@0	302 N = 4;
Daniel@0	303 dag = zeros(N,N);
Daniel@0	304 C = 1; S = 2; R = 3; W = 4;
Daniel@0	305 dag(C,[R S]) = 1;
Daniel@0	306 dag(R,W) = 1;
Daniel@0	307 dag(S,W)=1;
Daniel@0	308 </PRE>
Daniel@0	309 <P>
Daniel@0	310 We have numbered the nodes as follows:
Daniel@0	311 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
Daniel@0	312 <b>The nodes must always be numbered in topological order, i.e.,
Daniel@0	313 ancestors before descendants.</b>
Daniel@0	314 For a more complicated graph, this is a little inconvenient: we will
Daniel@0	315 see how to get around this <a href="usage_dbn.html#bat">below</a>.
Daniel@0	316 <p>
Daniel@0	317 In Matlab 6, you can use logical arrays instead of double arrays,
Daniel@0	318 which are 4 times smaller:
Daniel@0	319 <pre>
Daniel@0	320 dag = false(N,N);
Daniel@0	321 dag(C,[R S]) = true;
Daniel@0	322 ...
Daniel@0	323 </pre>
Daniel@0	324 However, <b>some graph functions (eg acyclic) do not work on
Daniel@0	325 logical arrays</b>!
Daniel@0	326 <p>
Daniel@0	327 A preliminary attempt to make a <b>GUI</b>
Daniel@0	328 has been writte by Philippe LeRay and can be downloaded
Daniel@0	329 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
Daniel@0	330 <p>
Daniel@0	331 You can visualize the resulting graph structure using
Daniel@0	332 the methods discussed <a href="#graphdraw">below</a>.
Daniel@0	333
Daniel@0	334 <h2>Creating the Bayes net shell</h2>
Daniel@0	335
Daniel@0	336 In addition to specifying the graph structure,
Daniel@0	337 we must specify the size and type of each node.
Daniel@0	338 If a node is discrete, its size is the
Daniel@0	339 number of possible values
Daniel@0	340 each node can take on; if a node is continuous,
Daniel@0	341 it can be a vector, and its size is the length of this vector.
Daniel@0	342 In this case, we will assume all nodes are discrete and binary.
Daniel@0	343 <PRE>
Daniel@0	344 discrete_nodes = 1:N;
Daniel@0	345 node_sizes = 2*ones(1,N);
Daniel@0	346 </pre>
Daniel@0	347 If the nodes were not binary, you could type e.g.,
Daniel@0	348 <pre>
Daniel@0	349 node_sizes = [4 2 3 5];
Daniel@0	350 </pre>
Daniel@0	351 meaning that Cloudy has 4 possible values,
Daniel@0	352 Sprinkler has 2 possible values, etc.
Daniel@0	353 Note that these are cardinal values, not ordinal, i.e.,
Daniel@0	354 they are not ordered in any way, like 'low', 'medium', 'high'.
Daniel@0	355 <p>
Daniel@0	356 We are now ready to make the Bayes net:
Daniel@0	357 <pre>
Daniel@0	358 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
Daniel@0	359 </PRE>
Daniel@0	360 By default, all nodes are assumed to be discrete, so we can also just
Daniel@0	361 write
Daniel@0	362 <pre>
Daniel@0	363 bnet = mk_bnet(dag, node_sizes);
Daniel@0	364 </PRE>
Daniel@0	365 You may also specify which nodes will be observed.
Daniel@0	366 If you don't know, or if this not fixed in advance,
Daniel@0	367 just use the empty list (the default).
Daniel@0	368 <pre>
Daniel@0	369 onodes = [];
Daniel@0	370 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
Daniel@0	371 </PRE>
Daniel@0	372 Note that optional arguments are specified using a name/value syntax.
Daniel@0	373 This is common for many BNT functions.
Daniel@0	374 In general, to find out more about a function (e.g., which optional
Daniel@0	375 arguments it takes), please see its
Daniel@0	376 documentation string by typing
Daniel@0	377 <pre>
Daniel@0	378 help mk_bnet
Daniel@0	379 </pre>
Daniel@0	380 See also other <a href="matlab_tips.html">useful Matlab tips</a>.
Daniel@0	381 <p>
Daniel@0	382 It is possible to associate names with nodes, as follows:
Daniel@0	383 <pre>
Daniel@0	384 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
Daniel@0	385 </pre>
Daniel@0	386 You can then refer to a node by its name:
Daniel@0	387 <pre>
Daniel@0	388 C = bnet.names('cloudy'); % bnet.names is an associative array
Daniel@0	389 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
Daniel@0	390 </pre>
Daniel@0	391 This feature uses my own associative array class.
Daniel@0	392
Daniel@0	393
Daniel@0	394 <h2><a name="cpt">Parameters</h2>
Daniel@0	395
Daniel@0	396 A model consists of the graph structure and the parameters.
Daniel@0	397 The parameters are represented by CPD objects (CPD = Conditional
Daniel@0	398 Probability Distribution), which define the probability distribution
Daniel@0	399 of a node given its parents.
Daniel@0	400 (We will use the terms "node" and "random variable" interchangeably.)
Daniel@0	401 The simplest kind of CPD is a table (multi-dimensional array), which
Daniel@0	402 is suitable when all the nodes are discrete-valued. Note that the discrete
Daniel@0	403 values are not assumed to be ordered in any way; that is, they
Daniel@0	404 represent categorical quantities, like male and female, rather than
Daniel@0	405 ordinal quantities, like low, medium and high.
Daniel@0	406 (We will discuss CPDs in more detail <a href="#cpd">below</a>.)
Daniel@0	407 <p>
Daniel@0	408 Tabular CPDs, also called CPTs (conditional probability tables),
Daniel@0	409 are stored as multidimensional arrays, where the dimensions
Daniel@0	410 are arranged in the same order as the nodes, e.g., the CPT for node 4
Daniel@0	411 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
Daniel@0	412 Hence the child is always the last dimension.
Daniel@0	413 If a node has no parents, its CPT is a column vector representing its
Daniel@0	414 prior.
Daniel@0	415 Note that in Matlab (unlike C), arrays are indexed
Daniel@0	416 from 1, and are layed out in memory such that the first index toggles
Daniel@0	417 fastest, e.g., the CPT for node 4 (WetGrass) is as follows
Daniel@0	418 <P>
Daniel@0	419 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
Daniel@0	420 <P>
Daniel@0	421 where we have used the convention that false==1, true==2.
Daniel@0	422 We can create this CPT in Matlab as follows
Daniel@0	423 <PRE>
Daniel@0	424 CPT = zeros(2,2,2);
Daniel@0	425 CPT(1,1,1) = 1.0;
Daniel@0	426 CPT(2,1,1) = 0.1;
Daniel@0	427 ...
Daniel@0	428 </PRE>
Daniel@0	429 Here is an easier way:
Daniel@0	430 <PRE>
Daniel@0	431 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
Daniel@0	432 </PRE>
Daniel@0	433 In fact, we don't need to reshape the array, since the CPD constructor
Daniel@0	434 will do that for us. So we can just write
Daniel@0	435 <pre>
Daniel@0	436 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
Daniel@0	437 </pre>
Daniel@0	438 The other nodes are created similarly (using the old syntax for
Daniel@0	439 optional parameters)
Daniel@0	440 <PRE>
Daniel@0	441 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
Daniel@0	442 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
Daniel@0	443 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
Daniel@0	444 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
Daniel@0	445 </PRE>
Daniel@0	446
Daniel@0	447
Daniel@0	448 <h2><a name="rnd_cpt">Random Parameters</h2>
Daniel@0	449
Daniel@0	450 If we do not specify the CPT, random parameters will be
Daniel@0	451 created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
Daniel@0	452 To ensure repeatable results, use
Daniel@0	453 <pre>
Daniel@0	454 rand('state', seed);
Daniel@0	455 randn('state', seed);
Daniel@0	456 </pre>
Daniel@0	457 To control the degree of randomness (entropy),
Daniel@0	458 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
Daniel@0	459 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
Daniel@0	460 If p = 1, each entry is drawn from U[0,1].
Daniel@0	461 If p >> 1, the entries will all be near 1/k, where k is the arity of
Daniel@0	462 this node, i.e., each row will be nearly uniform.
Daniel@0	463 You can do this as follows, assuming this node
Daniel@0	464 is number i, and ns is the node_sizes.
Daniel@0	465 <pre>
Daniel@0	466 k = ns(i);
Daniel@0	467 ps = parents(dag, i);
Daniel@0	468 psz = prod(ns(ps));
Daniel@0	469 CPT = sample_dirichlet(p*ones(1,k), psz);
Daniel@0	470 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
Daniel@0	471 </pre>
Daniel@0	472
Daniel@0	473
Daniel@0	474 <h2><a name="file">Loading a network from a file</h2>
Daniel@0	475
Daniel@0	476 If you already have a Bayes net represented in the XML-based
Daniel@0	477 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
Daniel@0	478 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
Daniel@0	479 <a
Daniel@0	480 href="http://www.cs.huji.ac.il/labs/compbio/Repository">
Daniel@0	481 Bayes Net repository</a>),
Daniel@0	482 you can convert it to BNT format using
Daniel@0	483 the
Daniel@0	484 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
Daniel@0	485 program</a> written by Ken Shan.
Daniel@0	486 (This is not necessarily up-to-date.)
Daniel@0	487 <p>
Daniel@0	488 <b>It is currently not possible to save/load a BNT matlab object to
Daniel@0	489 file</b>, but this is easily fixed if you modify all the constructors
Daniel@0	490 for all the classes (see matlab documentation).
Daniel@0	491
Daniel@0	492 <h2>Creating a model using a GUI</h2>
Daniel@0	493
Daniel@0	494 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
Daniel@0	495
Daniel@0	496
Daniel@0	497
Daniel@0	498 <h1><a name="inference">Inference</h1>
Daniel@0	499
Daniel@0	500 Having created the BN, we can now use it for inference.
Daniel@0	501 There are many different algorithms for doing inference in Bayes nets,
Daniel@0	502 that make different tradeoffs between speed,
Daniel@0	503 complexity, generality, and accuracy.
Daniel@0	504 BNT therefore offers a variety of different inference
Daniel@0	505 "engines". We will discuss these
Daniel@0	506 in more detail <a href="#engines">below</a>.
Daniel@0	507 For now, we will use the junction tree
Daniel@0	508 engine, which is the mother of all exact inference algorithms.
Daniel@0	509 This can be created as follows.
Daniel@0	510 <pre>
Daniel@0	511 engine = jtree_inf_engine(bnet);
Daniel@0	512 </pre>
Daniel@0	513 The other engines have similar constructors, but might take
Daniel@0	514 additional, algorithm-specific parameters.
Daniel@0	515 All engines are used in the same way, once they have been created.
Daniel@0	516 We illustrate this in the following sections.
Daniel@0	517
Daniel@0	518
Daniel@0	519 <h2><a name="marginal">Computing marginal distributions</h2>
Daniel@0	520
Daniel@0	521 Suppose we want to compute the probability that the sprinker was on
Daniel@0	522 given that the grass is wet.
Daniel@0	523 The evidence consists of the fact that W=2. All the other nodes
Daniel@0	524 are hidden (unobserved). We can specify this as follows.
Daniel@0	525 <pre>
Daniel@0	526 evidence = cell(1,N);
Daniel@0	527 evidence{W} = 2;
Daniel@0	528 </pre>
Daniel@0	529 We use a 1D cell array instead of a vector to
Daniel@0	530 cope with the fact that nodes can be vectors of different lengths.
Daniel@0	531 In addition, the value [] can be used
Daniel@0	532 to denote 'no evidence', instead of having to specify the observation
Daniel@0	533 pattern as a separate argument.
Daniel@0	534 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell
Daniel@0	535 arrays in matlab.)
Daniel@0	536 <p>
Daniel@0	537 We are now ready to add the evidence to the engine.
Daniel@0	538 <pre>
Daniel@0	539 [engine, loglik] = enter_evidence(engine, evidence);
Daniel@0	540 </pre>
Daniel@0	541 The behavior of this function is algorithm-specific, and is discussed
Daniel@0	542 in more detail <a href="#engines">below</a>.
Daniel@0	543 In the case of the jtree engine,
Daniel@0	544 enter_evidence implements a two-pass message-passing scheme.
Daniel@0	545 The first return argument contains the modified engine, which
Daniel@0	546 incorporates the evidence. The second return argument contains the
Daniel@0	547 log-likelihood of the evidence. (Not all engines are capable of
Daniel@0	548 computing the log-likelihood.)
Daniel@0	549 <p>
Daniel@0	550 Finally, we can compute p=P(S=2\|W=2) as follows.
Daniel@0	551 <PRE>
Daniel@0	552 marg = marginal_nodes(engine, S);
Daniel@0	553 marg.T
Daniel@0	554 ans =
Daniel@0	555 0.57024
Daniel@0	556 0.42976
Daniel@0	557 p = marg.T(2);
Daniel@0	558 </PRE>
Daniel@0	559 We see that p = 0.4298.
Daniel@0	560 <p>
Daniel@0	561 Now let us add the evidence that it was raining, and see what
Daniel@0	562 difference it makes.
Daniel@0	563 <PRE>
Daniel@0	564 evidence{R} = 2;
Daniel@0	565 [engine, loglik] = enter_evidence(engine, evidence);
Daniel@0	566 marg = marginal_nodes(engine, S);
Daniel@0	567 p = marg.T(2);
Daniel@0	568 </PRE>
Daniel@0	569 We find that p = P(S=2\|W=2,R=2) = 0.1945,
Daniel@0	570 which is lower than
Daniel@0	571 before, because the rain can ``explain away'' the
Daniel@0	572 fact that the grass is wet.
Daniel@0	573 <p>
Daniel@0	574 You can plot a marginal distribution over a discrete variable
Daniel@0	575 as a barchart using the built 'bar' function:
Daniel@0	576 <pre>
Daniel@0	577 bar(marg.T)
Daniel@0	578 </pre>
Daniel@0	579 This is what it looks like
Daniel@0	580
Daniel@0	581 <p>
Daniel@0	582 <center>
Daniel@0	583 <IMG SRC="Figures/sprinkler_bar.gif">
Daniel@0	584 </center>
Daniel@0	585 <p>
Daniel@0	586
Daniel@0	587 <h2><a name="observed">Observed nodes</h2>
Daniel@0	588
Daniel@0	589 What happens if we ask for the marginal on an observed node, e.g. P(W\|W=2)?
Daniel@0	590 An observed discrete node effectively only has 1 value (the observed
Daniel@0	591 one) --- all other values would result in 0 probability.
Daniel@0	592 For efficiency, BNT treats observed (discrete) nodes as if they were
Daniel@0	593 set to 1, as we see below:
Daniel@0	594 <pre>
Daniel@0	595 evidence = cell(1,N);
Daniel@0	596 evidence{W} = 2;
Daniel@0	597 engine = enter_evidence(engine, evidence);
Daniel@0	598 m = marginal_nodes(engine, W);
Daniel@0	599 m.T
Daniel@0	600 ans =
Daniel@0	601 1
Daniel@0	602 </pre>
Daniel@0	603 This can get a little confusing, since we assigned W=2.
Daniel@0	604 So we can ask BNT to add the evidence back in by passing in an optional argument:
Daniel@0	605 <pre>
Daniel@0	606 m = marginal_nodes(engine, W, 1);
Daniel@0	607 m.T
Daniel@0	608 ans =
Daniel@0	609 0
Daniel@0	610 1
Daniel@0	611 </pre>
Daniel@0	612 This shows that P(W=1\|W=2) = 0 and P(W=2\|W=2) = 1.
Daniel@0	613
Daniel@0	614
Daniel@0	615
Daniel@0	616 <h2><a name="joint">Computing joint distributions</h2>
Daniel@0	617
Daniel@0	618 We can compute the joint probability on a set of nodes as in the
Daniel@0	619 following example.
Daniel@0	620 <pre>
Daniel@0	621 evidence = cell(1,N);
Daniel@0	622 [engine, ll] = enter_evidence(engine, evidence);
Daniel@0	623 m = marginal_nodes(engine, [S R W]);
Daniel@0	624 </pre>
Daniel@0	625 m is a structure. The 'T' field is a multi-dimensional array (in
Daniel@0	626 this case, 3-dimensional) that contains the joint probability
Daniel@0	627 distribution on the specified nodes.
Daniel@0	628 <pre>
Daniel@0	629 >> m.T
Daniel@0	630 ans(:,:,1) =
Daniel@0	631 0.2900 0.0410
Daniel@0	632 0.0210 0.0009
Daniel@0	633 ans(:,:,2) =
Daniel@0	634 0 0.3690
Daniel@0	635 0.1890 0.0891
Daniel@0	636 </pre>
Daniel@0	637 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
Daniel@0	638 to be wet if both the rain and sprinkler are off.
Daniel@0	639 <p>
Daniel@0	640 Let us now add some evidence to R.
Daniel@0	641 <pre>
Daniel@0	642 evidence{R} = 2;
Daniel@0	643 [engine, ll] = enter_evidence(engine, evidence);
Daniel@0	644 m = marginal_nodes(engine, [S R W])
Daniel@0	645 m =
Daniel@0	646 domain: [2 3 4]
Daniel@0	647 T: [2x1x2 double]
Daniel@0	648 >> m.T
Daniel@0	649 m.T
Daniel@0	650 ans(:,:,1) =
Daniel@0	651 0.0820
Daniel@0	652 0.0018
Daniel@0	653 ans(:,:,2) =
Daniel@0	654 0.7380
Daniel@0	655 0.1782
Daniel@0	656 </pre>
Daniel@0	657 The joint T(i,j,k) = P(S=i,R=j,W=k\|evidence)
Daniel@0	658 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
Daniel@0	659 with the evidence that R=2.
Daniel@0	660 Instead of creating large tables with many 0s, BNT sets the effective
Daniel@0	661 size of observed (discrete) nodes to 1, as explained above.
Daniel@0	662 This is why m.T has size 2x1x2.
Daniel@0	663 To get a 2x2x2 table, type
Daniel@0	664 <pre>
Daniel@0	665 m = marginal_nodes(engine, [S R W], 1)
Daniel@0	666 m =
Daniel@0	667 domain: [2 3 4]
Daniel@0	668 T: [2x2x2 double]
Daniel@0	669 >> m.T
Daniel@0	670 m.T
Daniel@0	671 ans(:,:,1) =
Daniel@0	672 0 0.082
Daniel@0	673 0 0.0018
Daniel@0	674 ans(:,:,2) =
Daniel@0	675 0 0.738
Daniel@0	676 0 0.1782
Daniel@0	677 </pre>
Daniel@0	678
Daniel@0	679 <p>
Daniel@0	680 Note: It is not always possible to compute the joint on arbitrary
Daniel@0	681 sets of nodes: it depends on which inference engine you use, as discussed
Daniel@0	682 in more detail <a href="#engines">below</a>.
Daniel@0	683
Daniel@0	684
Daniel@0	685 <h2><a name="soft">Soft/virtual evidence</h2>
Daniel@0	686
Daniel@0	687 Sometimes a node is not observed, but we have some distribution over
Daniel@0	688 its possible values; this is often called "soft" or "virtual"
Daniel@0	689 evidence.
Daniel@0	690 One can use this as follows
Daniel@0	691 <pre>
Daniel@0	692 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
Daniel@0	693 </pre>
Daniel@0	694 where soft_evidence{i} is either [] (if node i has no soft evidence)
Daniel@0	695 or is a vector representing the probability distribution over i's
Daniel@0	696 possible values.
Daniel@0	697 For example, if we don't know i's exact value, but we know its
Daniel@0	698 likelihood ratio is 60/40, we can write evidence{i} = [] and
Daniel@0	699 soft_evidence{i} = [0.6 0.4].
Daniel@0	700 <p>
Daniel@0	701 Currently only jtree_inf_engine supports this option.
Daniel@0	702 It assumes that all hidden nodes, and all nodes for
Daniel@0	703 which we have soft evidence, are discrete.
Daniel@0	704 For a longer example, see BNT/examples/static/softev1.m.
Daniel@0	705
Daniel@0	706
Daniel@0	707 <h2><a name="mpe">Most probable explanation</h2>
Daniel@0	708
Daniel@0	709 To compute the most probable explanation (MPE) of the evidence (i.e.,
Daniel@0	710 the most probable assignment, or a mode of the joint), use
Daniel@0	711 <pre>
Daniel@0	712 [mpe, ll] = calc_mpe(engine, evidence);
Daniel@0	713 </pre>
Daniel@0	714 mpe{i} is the most likely value of node i.
Daniel@0	715 This calls enter_evidence with the 'maximize' flag set to 1, which
Daniel@0	716 causes the engine to do max-product instead of sum-product.
Daniel@0	717 The resulting max-marginals are then thresholded.
Daniel@0	718 If there is more than one maximum probability assignment, we must take
Daniel@0	719 care to break ties in a consistent manner (thresholding the
Daniel@0	720 max-marginals may give the wrong result). To force this behavior,
Daniel@0	721 type
Daniel@0	722 <pre>
Daniel@0	723 [mpe, ll] = calc_mpe(engine, evidence, 1);
Daniel@0	724 </pre>
Daniel@0	725 Note that computing the MPE is someties called abductive reasoning.
Daniel@0	726
Daniel@0	727 <p>
Daniel@0	728 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
Daniel@0	729 that does a forwards max-product pass, and then a backwards traceback
Daniel@0	730 pass, which is how Viterbi is traditionally implemented.
Daniel@0	731
Daniel@0	732
Daniel@0	733
Daniel@0	734 <h1><a name="cpd">Conditional Probability Distributions</h1>
Daniel@0	735
Daniel@0	736 A Conditional Probability Distributions (CPD)
Daniel@0	737 defines P(X(i) \| X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
Daniel@0	738 are the parents of node i. There are many ways to represent this
Daniel@0	739 distribution, which depend in part on whether X(i) and X(Pa(i)) are
Daniel@0	740 discrete, continuous, or a combination.
Daniel@0	741 We will discuss various representations below.
Daniel@0	742
Daniel@0	743
Daniel@0	744 <h2><a name="tabular">Tabular nodes</h2>
Daniel@0	745
Daniel@0	746 If the CPD is represented as a table (i.e., if it is a multinomial
Daniel@0	747 distribution), it has a number of parameters that is exponential in
Daniel@0	748 the number of parents. See the example <a href="#cpt">above</a>.
Daniel@0	749
Daniel@0	750
Daniel@0	751 <h2><a name="noisyor">Noisy-or nodes</h2>
Daniel@0	752
Daniel@0	753 A noisy-OR node is like a regular logical OR gate except that
Daniel@0	754 sometimes the effects of parents that are on get inhibited.
Daniel@0	755 Let the prob. that parent i gets inhibited be q(i).
Daniel@0	756 Then a node, C, with 2 parents, A and B, has the following CPD, where
Daniel@0	757 we use F and T to represent off and on (1 and 2 in BNT).
Daniel@0	758 <pre>
Daniel@0	759 A B P(C=off) P(C=on)
Daniel@0	760 ---------------------------
Daniel@0	761 F F 1.0 0.0
Daniel@0	762 T F q(A) 1-q(A)
Daniel@0	763 F T q(B) 1-q(B)
Daniel@0	764 T T q(A)q(B) q-q(A)q(B)
Daniel@0	765 </pre>
Daniel@0	766 Thus we see that the causes get inhibited independently.
Daniel@0	767 It is common to associate a "leak" node with a noisy-or CPD, which is
Daniel@0	768 like a parent that is always on. This can account for all other unmodelled
Daniel@0	769 causes which might turn the node on.
Daniel@0	770 <p>
Daniel@0	771 The noisy-or distribution is similar to the logistic distribution.
Daniel@0	772 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
Daniel@0	773 be the prob. that j inhibits i. Then
Daniel@0	774 <pre>
Daniel@0	775 Pr(S(i)=1 \| parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
Daniel@0	776 </pre>
Daniel@0	777 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
Daniel@0	778 <pre>
Daniel@0	779 Pr(S(i)=1 \| parents(S(i))) = rho(sum_j w(i,j) S(j))
Daniel@0	780 </pre>
Daniel@0	781 For a sigmoid node, we have
Daniel@0	782 <pre>
Daniel@0	783 Pr(S(i)=1 \| parents(S(i))) = sigma(-sum_j w(i,j) S(j))
Daniel@0	784 </pre>
Daniel@0	785 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
Daniel@0	786 the activation function (although both are monotonically increasing).
Daniel@0	787 In addition, in the case of a noisy-or, the weights are constrained to be
Daniel@0	788 positive, since they derive from probabilities q(i,j).
Daniel@0	789 In both cases, the number of parameters is <em>linear</em> in the
Daniel@0	790 number of parents, unlike the case of a multinomial distribution,
Daniel@0	791 where the number of parameters is exponential in the number of parents.
Daniel@0	792 We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
Daniel@0	793
Daniel@0	794
Daniel@0	795 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
Daniel@0	796
Daniel@0	797 Deterministic CPDs for discrete random variables can be created using
Daniel@0	798 the deterministic_CPD class. It is also possible to 'flip' the output
Daniel@0	799 of the function with some probability, to simulate noise.
Daniel@0	800 The boolean_CPD class is just a special case of a
Daniel@0	801 deterministic CPD, where the parents and child are all binary.
Daniel@0	802 <p>
Daniel@0	803 Both of these classes are just "syntactic sugar" for the tabular_CPD
Daniel@0	804 class.
Daniel@0	805
Daniel@0	806
Daniel@0	807
Daniel@0	808 <h2><a name="softmax">Softmax nodes</h2>
Daniel@0	809
Daniel@0	810 If we have a discrete node with a continuous parent,
Daniel@0	811 we can define its CPD using a softmax function
Daniel@0	812 (also known as the multinomial logit function).
Daniel@0	813 This acts like a soft thresholding operator, and is defined as follows:
Daniel@0	814 <pre>
Daniel@0	815 exp(w(:,i)'*x + b(i))
Daniel@0	816 Pr(Q=i \| X=x) = -----------------------------
Daniel@0	817 sum_j exp(w(:,j)'*x + b(j))
Daniel@0	818
Daniel@0	819 </pre>
Daniel@0	820 The parameters of a softmax node, w(:,i) and b(i), i=1..\|Q\|, have the
Daniel@0	821 following interpretation: w(:,i)-w(:,j) is the normal vector to the
Daniel@0	822 decision boundary between classes i and j,
Daniel@0	823 and b(i)-b(j) is its offset (bias). For example, suppose
Daniel@0	824 X is a 2-vector, and Q is binary. Then
Daniel@0	825 <pre>
Daniel@0	826 w = [1 -1;
Daniel@0	827 0 0];
Daniel@0	828
Daniel@0	829 b = [0 0];
Daniel@0	830 </pre>
Daniel@0	831 means class 1 are points in the 2D plane with positive x coordinate,
Daniel@0	832 and class 2 are points in the 2D plane with negative x coordinate.
Daniel@0	833 If w has large magnitude, the decision boundary is sharp, otherwise it
Daniel@0	834 is soft.
Daniel@0	835 In the special case that Q is binary (0/1), the softmax function reduces to the logistic
Daniel@0	836 (sigmoid) function.
Daniel@0	837 <p>
Daniel@0	838 Fitting a softmax function can be done using the iteratively reweighted
Daniel@0	839 least squares (IRLS) algorithm.
Daniel@0	840 We use the implementation from
Daniel@0	841 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
Daniel@0	842 Note that since
Daniel@0	843 the softmax distribution is not in the exponential family, it does not
Daniel@0	844 have finite sufficient statistics, and hence we must store all the
Daniel@0	845 training data in uncompressed form.
Daniel@0	846 If this takes too much space, one should use online (stochastic) gradient
Daniel@0	847 descent (not implemented in BNT).
Daniel@0	848 <p>
Daniel@0	849 If a softmax node also has discrete parents,
Daniel@0	850 we use a different set of w/b parameters for each combination of
Daniel@0	851 parent values, as in the <a href="#gaussian">conditional linear
Daniel@0	852 Gaussian CPD</a>.
Daniel@0	853 This feature was implemented by Pierpaolo Brutti.
Daniel@0	854 He is currently extending it so that discrete parents can be treated
Daniel@0	855 as if they were continuous, by adding indicator variables to the X
Daniel@0	856 vector.
Daniel@0	857 <p>
Daniel@0	858 We will see an example of softmax nodes <a href="#mixexp">below</a>.
Daniel@0	859
Daniel@0	860
Daniel@0	861 <h2><a name="mlp">Neural network nodes</h2>
Daniel@0	862
Daniel@0	863 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
Daniel@0	864 to implement a mapping from continuous parents to discrete children,
Daniel@0	865 similar to the softmax function.
Daniel@0	866 (If there are also discrete parents, it creates a mixture of MLPs.)
Daniel@0	867 It uses code from <a
Daniel@0	868 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
Daniel@0	869 This is work in progress.
Daniel@0	870
Daniel@0	871 <h2><a name="root">Root nodes</h2>
Daniel@0	872
Daniel@0	873 A root node has no parents and no parameters; it can be used to model
Daniel@0	874 an observed, exogeneous input variable, i.e., one which is "outside"
Daniel@0	875 the model.
Daniel@0	876 This is useful for conditional density models.
Daniel@0	877 We will see an example of root nodes <a href="#mixexp">below</a>.
Daniel@0	878
Daniel@0	879
Daniel@0	880 <h2><a name="gaussian">Gaussian nodes</h2>
Daniel@0	881
Daniel@0	882 We now consider a distribution suitable for the continuous-valued nodes.
Daniel@0	883 Suppose the node is called Y, its continuous parents (if any) are
Daniel@0	884 called X, and its discrete parents (if any) are called Q.
Daniel@0	885 The distribution on Y is defined as follows:
Daniel@0	886 <pre>
Daniel@0	887 - no parents: Y ~ N(mu, Sigma)
Daniel@0	888 - cts parents : Y\|X=x ~ N(mu + W x, Sigma)
Daniel@0	889 - discrete parents: Y\|Q=i ~ N(mu(:,i), Sigma(:,:,i))
Daniel@0	890 - cts and discrete parents: Y\|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
Daniel@0	891 </pre>
Daniel@0	892 where N(mu, Sigma) denotes a Normal distribution with mean mu and
Daniel@0	893 covariance Sigma. Let \|X\|, \|Y\| and \|Q\| denote the sizes of X, Y and Q
Daniel@0	894 respectively.
Daniel@0	895 If there are no discrete parents, \|Q\|=1; if there is
Daniel@0	896 more than one, then \|Q\| = a vector of the sizes of each discrete parent.
Daniel@0	897 If there are no continuous parents, \|X\|=0; if there is more than one,
Daniel@0	898 then \|X\| = the sum of their sizes.
Daniel@0	899 Then mu is a \|Y\|\|Q\| vector, Sigma is a \|Y\|\|Y\|*\|Q\| positive
Daniel@0	900 semi-definite matrix, and W is a \|Y\|\|X\|\|Q\| regression (weight)
Daniel@0	901 matrix.
Daniel@0	902 <p>
Daniel@0	903 We can create a Gaussian node with random parameters as follows.
Daniel@0	904 <pre>
Daniel@0	905 bnet.CPD{i} = gaussian_CPD(bnet, i);
Daniel@0	906 </pre>
Daniel@0	907 We can specify the value of one or more of the parameters as in the
Daniel@0	908 following example, in which \|Y\|=2, and \|Q\|=1.
Daniel@0	909 <pre>
Daniel@0	910 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
Daniel@0	911 </pre>
Daniel@0	912 <p>
Daniel@0	913 We will see an example of conditional linear Gaussian nodes <a
Daniel@0	914 href="#cg_model">below</a>.
Daniel@0	915 <p>
Daniel@0	916 <b>When learning Gaussians from data</b>, it is helpful to ensure the
Daniel@0	917 data has a small magnitde
Daniel@0	918 (see e.g., KPMstats/standardize) to prevent numerical problems.
Daniel@0	919 Unless you have a lot of data, it is also a very good idea to use
Daniel@0	920 diagonal instead of full covariance matrices.
Daniel@0	921 (BNT does not currently support spherical covariances, although it
Daniel@0	922 would be easy to add, since KPMstats/clg_Mstep supports this option;
Daniel@0	923 you would just need to modify gaussian_CPD/update_ess to accumulate
Daniel@0	924 weighted inner products.)
Daniel@0	925
Daniel@0	926
Daniel@0	927
Daniel@0	928 <h2><a name="nongauss">Other continuous distributions</h2>
Daniel@0	929
Daniel@0	930 Currently BNT does not support any CPDs for continuous nodes other
Daniel@0	931 than the Gaussian.
Daniel@0	932 However, you can use a mixture of Gaussians to
Daniel@0	933 approximate other continuous distributions. We will see some an example
Daniel@0	934 of this with the IFA model <a href="#pca">below</a>.
Daniel@0	935
Daniel@0	936
Daniel@0	937 <h2><a name="glm">Generalized linear model nodes</h2>
Daniel@0	938
Daniel@0	939 In the future, we may incorporate some of the functionality of
Daniel@0	940 <a href =
Daniel@0	941 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
Daniel@0	942 into BNT.
Daniel@0	943
Daniel@0	944
Daniel@0	945 <h2><a name="dtree">Classification/regression tree nodes</h2>
Daniel@0	946
Daniel@0	947 We plan to add classification and regression trees to define CPDs for
Daniel@0	948 discrete and continuous nodes, respectively.
Daniel@0	949 Trees have many advantages: they are easy to interpret, they can do
Daniel@0	950 feature selection, they can
Daniel@0	951 handle discrete and continuous inputs, they do not make strong
Daniel@0	952 assumptions about the form of the distribution, the number of
Daniel@0	953 parameters can grow in a data-dependent way (i.e., they are
Daniel@0	954 semi-parametric), they can handle missing data, etc.
Daniel@0	955 However, they are not yet implemented.
Daniel@0	956 <!--
Daniel@0	957 Yimin Zhang is currently (Feb '02) implementing this.
Daniel@0	958 -->
Daniel@0	959
Daniel@0	960
Daniel@0	961 <h2><a name="cpd_summary">Summary of CPD types</h2>
Daniel@0	962
Daniel@0	963 We list all the different types of CPDs supported by BNT.
Daniel@0	964 For each CPD, we specify if the child and parents can be discrete (D) or
Daniel@0	965 continuous (C) (Binary (B) nodes are a special case).
Daniel@0	966 We also specify which methods each class supports.
Daniel@0	967 If a method is inherited, the name of the parent class is mentioned.
Daniel@0	968 If a parent class calls a child method, this is mentioned.
Daniel@0	969 <p>
Daniel@0	970 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
Daniel@0	971 requires that the child and all parents are discrete.
Daniel@0	972 The CPT might be exponentially big...
Daniel@0	973 <tt>convert_to_table</tt> evaluates a CPD with evidence, and
Daniel@0	974 represents the the resulting potential as an array.
Daniel@0	975 This requires that the child is discrete, and any continuous parents
Daniel@0	976 are observed.
Daniel@0	977 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and
Daniel@0	978 represents the resulting potential as a dpot, gpot, cgpot or upot, as
Daniel@0	979 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
Daniel@0	980 utility).
Daniel@0	981
Daniel@0	982 <p>
Daniel@0	983 When we sample a node, all the parents are observed.
Daniel@0	984 When we compute the (log) probability of a node, all the parents and
Daniel@0	985 the child are observed.
Daniel@0	986 <p>
Daniel@0	987 We also specify if the parameters are learnable.
Daniel@0	988 For learning with EM, we require
Daniel@0	989 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
Daniel@0	990 <tt>maximize_params</tt>.
Daniel@0	991 For learning from fully observed data, we require
Daniel@0	992 the method <tt>learn_params</tt>.
Daniel@0	993 By default, all classes inherit this from generic_CPD, which simply
Daniel@0	994 calls <tt>update_ess</tt> N times, once for each data case, followed
Daniel@0	995 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
Daniel@0	996 Some classes implement a batch formula, which is quicker.
Daniel@0	997 <p>
Daniel@0	998 Bayesian learning means computing a posterior over the parameters
Daniel@0	999 given fully observed data.
Daniel@0	1000 <p>
Daniel@0	1001 Pearl means we implement the methods <tt>compute_pi</tt> and
Daniel@0	1002 <tt>compute_lambda_msg</tt>, used by
Daniel@0	1003 <tt>pearl_inf_engine</tt>, which runs on directed graphs.
Daniel@0	1004 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
Daniel@0	1005 The pearl methods can exploit special properties of the CPDs for
Daniel@0	1006 computing the messages efficiently, whereas belprop does not.
Daniel@0	1007 <p>
Daniel@0	1008 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
Daniel@0	1009 which is not shown, since it is not very interesting.
Daniel@0	1010
Daniel@0	1011
Daniel@0	1012 <p>
Daniel@0	1013
Daniel@0	1014
Daniel@0	1015 <table>
Daniel@0	1016 <table border units = pixels><tr>
Daniel@0	1017 <td align=center>Name
Daniel@0	1018 <td align=center>Child
Daniel@0	1019 <td align=center>Parents
Daniel@0	1020 <td align=center>Comments
Daniel@0	1021 <td align=center>CPD_to_CPT
Daniel@0	1022 <td align=center>conv_to_table
Daniel@0	1023 <td align=center>conv_to_pot
Daniel@0	1024 <td align=center>sample
Daniel@0	1025 <td align=center>prob
Daniel@0	1026 <td align=center>learn
Daniel@0	1027 <td align=center>Bayes
Daniel@0	1028 <td align=center>Pearl
Daniel@0	1029
Daniel@0	1030
Daniel@0	1031 <tr>
Daniel@0	1032 <!-- Name--><td>
Daniel@0	1033 <!-- Child--><td>
Daniel@0	1034 <!-- Parents--><td>
Daniel@0	1035 <!-- Comments--><td>
Daniel@0	1036 <!-- CPD_to_CPT--><td>
Daniel@0	1037 <!-- conv_to_table--><td>
Daniel@0	1038 <!-- conv_to_pot--><td>
Daniel@0	1039 <!-- sample--><td>
Daniel@0	1040 <!-- prob--><td>
Daniel@0	1041 <!-- learn--><td>
Daniel@0	1042 <!-- Bayes--><td>
Daniel@0	1043 <!-- Pearl--><td>
Daniel@0	1044
Daniel@0	1045 <tr>
Daniel@0	1046 <!-- Name--><td>boolean
Daniel@0	1047 <!-- Child--><td>B
Daniel@0	1048 <!-- Parents--><td>B
Daniel@0	1049 <!-- Comments--><td>Syntactic sugar for tabular
Daniel@0	1050 <!-- CPD_to_CPT--><td>-
Daniel@0	1051 <!-- conv_to_table--><td>-
Daniel@0	1052 <!-- conv_to_pot--><td>-
Daniel@0	1053 <!-- sample--><td>-
Daniel@0	1054 <!-- prob--><td>-
Daniel@0	1055 <!-- learn--><td>-
Daniel@0	1056 <!-- Bayes--><td>-
Daniel@0	1057 <!-- Pearl--><td>-
Daniel@0	1058
Daniel@0	1059 <tr>
Daniel@0	1060 <!-- Name--><td>deterministic
Daniel@0	1061 <!-- Child--><td>D
Daniel@0	1062 <!-- Parents--><td>D
Daniel@0	1063 <!-- Comments--><td>Syntactic sugar for tabular
Daniel@0	1064 <!-- CPD_to_CPT--><td>-
Daniel@0	1065 <!-- conv_to_table--><td>-
Daniel@0	1066 <!-- conv_to_pot--><td>-
Daniel@0	1067 <!-- sample--><td>-
Daniel@0	1068 <!-- prob--><td>-
Daniel@0	1069 <!-- learn--><td>-
Daniel@0	1070 <!-- Bayes--><td>-
Daniel@0	1071 <!-- Pearl--><td>-
Daniel@0	1072
Daniel@0	1073 <tr>
Daniel@0	1074 <!-- Name--><td>Discrete
Daniel@0	1075 <!-- Child--><td>D
Daniel@0	1076 <!-- Parents--><td>C/D
Daniel@0	1077 <!-- Comments--><td>Virtual class
Daniel@0	1078 <!-- CPD_to_CPT--><td>N
Daniel@0	1079 <!-- conv_to_table--><td>Calls CPD_to_CPT
Daniel@0	1080 <!-- conv_to_pot--><td>Calls conv_to_table
Daniel@0	1081 <!-- sample--><td>Calls conv_to_table
Daniel@0	1082 <!-- prob--><td>Calls conv_to_table
Daniel@0	1083 <!-- learn--><td>N
Daniel@0	1084 <!-- Bayes--><td>N
Daniel@0	1085 <!-- Pearl--><td>N
Daniel@0	1086
Daniel@0	1087 <tr>
Daniel@0	1088 <!-- Name--><td>Gaussian
Daniel@0	1089 <!-- Child--><td>C
Daniel@0	1090 <!-- Parents--><td>C/D
Daniel@0	1091 <!-- Comments--><td>-
Daniel@0	1092 <!-- CPD_to_CPT--><td>N
Daniel@0	1093 <!-- conv_to_table--><td>N
Daniel@0	1094 <!-- conv_to_pot--><td>Y
Daniel@0	1095 <!-- sample--><td>Y
Daniel@0	1096 <!-- prob--><td>Y
Daniel@0	1097 <!-- learn--><td>Y
Daniel@0	1098 <!-- Bayes--><td>N
Daniel@0	1099 <!-- Pearl--><td>N
Daniel@0	1100
Daniel@0	1101 <tr>
Daniel@0	1102 <!-- Name--><td>gmux
Daniel@0	1103 <!-- Child--><td>C
Daniel@0	1104 <!-- Parents--><td>C/D
Daniel@0	1105 <!-- Comments--><td>multiplexer
Daniel@0	1106 <!-- CPD_to_CPT--><td>N
Daniel@0	1107 <!-- conv_to_table--><td>N
Daniel@0	1108 <!-- conv_to_pot--><td>Y
Daniel@0	1109 <!-- sample--><td>N
Daniel@0	1110 <!-- prob--><td>N
Daniel@0	1111 <!-- learn--><td>N
Daniel@0	1112 <!-- Bayes--><td>N
Daniel@0	1113 <!-- Pearl--><td>Y
Daniel@0	1114
Daniel@0	1115
Daniel@0	1116 <tr>
Daniel@0	1117 <!-- Name--><td>MLP
Daniel@0	1118 <!-- Child--><td>D
Daniel@0	1119 <!-- Parents--><td>C/D
Daniel@0	1120 <!-- Comments--><td>multi layer perceptron
Daniel@0	1121 <!-- CPD_to_CPT--><td>N
Daniel@0	1122 <!-- conv_to_table--><td>Y
Daniel@0	1123 <!-- conv_to_pot--><td>Inherits from discrete
Daniel@0	1124 <!-- sample--><td>Inherits from discrete
Daniel@0	1125 <!-- prob--><td>Inherits from discrete
Daniel@0	1126 <!-- learn--><td>Y
Daniel@0	1127 <!-- Bayes--><td>N
Daniel@0	1128 <!-- Pearl--><td>N
Daniel@0	1129
Daniel@0	1130
Daniel@0	1131 <tr>
Daniel@0	1132 <!-- Name--><td>noisy-or
Daniel@0	1133 <!-- Child--><td>B
Daniel@0	1134 <!-- Parents--><td>B
Daniel@0	1135 <!-- Comments--><td>-
Daniel@0	1136 <!-- CPD_to_CPT--><td>Y
Daniel@0	1137 <!-- conv_to_table--><td>Inherits from discrete
Daniel@0	1138 <!-- conv_to_pot--><td>Inherits from discrete
Daniel@0	1139 <!-- sample--><td>Inherits from discrete
Daniel@0	1140 <!-- prob--><td>Inherits from discrete
Daniel@0	1141 <!-- learn--><td>N
Daniel@0	1142 <!-- Bayes--><td>N
Daniel@0	1143 <!-- Pearl--><td>Y
Daniel@0	1144
Daniel@0	1145
Daniel@0	1146 <tr>
Daniel@0	1147 <!-- Name--><td>root
Daniel@0	1148 <!-- Child--><td>C/D
Daniel@0	1149 <!-- Parents--><td>none
Daniel@0	1150 <!-- Comments--><td>no params
Daniel@0	1151 <!-- CPD_to_CPT--><td>N
Daniel@0	1152 <!-- conv_to_table--><td>N
Daniel@0	1153 <!-- conv_to_pot--><td>Y
Daniel@0	1154 <!-- sample--><td>Y
Daniel@0	1155 <!-- prob--><td>Y
Daniel@0	1156 <!-- learn--><td>N
Daniel@0	1157 <!-- Bayes--><td>N
Daniel@0	1158 <!-- Pearl--><td>N
Daniel@0	1159
Daniel@0	1160
Daniel@0	1161 <tr>
Daniel@0	1162 <!-- Name--><td>softmax
Daniel@0	1163 <!-- Child--><td>D
Daniel@0	1164 <!-- Parents--><td>C/D
Daniel@0	1165 <!-- Comments--><td>-
Daniel@0	1166 <!-- CPD_to_CPT--><td>N
Daniel@0	1167 <!-- conv_to_table--><td>Y
Daniel@0	1168 <!-- conv_to_pot--><td>Inherits from discrete
Daniel@0	1169 <!-- sample--><td>Inherits from discrete
Daniel@0	1170 <!-- prob--><td>Inherits from discrete
Daniel@0	1171 <!-- learn--><td>Y
Daniel@0	1172 <!-- Bayes--><td>N
Daniel@0	1173 <!-- Pearl--><td>N
Daniel@0	1174
Daniel@0	1175
Daniel@0	1176 <tr>
Daniel@0	1177 <!-- Name--><td>generic
Daniel@0	1178 <!-- Child--><td>C/D
Daniel@0	1179 <!-- Parents--><td>C/D
Daniel@0	1180 <!-- Comments--><td>Virtual class
Daniel@0	1181 <!-- CPD_to_CPT--><td>N
Daniel@0	1182 <!-- conv_to_table--><td>N
Daniel@0	1183 <!-- conv_to_pot--><td>N
Daniel@0	1184 <!-- sample--><td>N
Daniel@0	1185 <!-- prob--><td>N
Daniel@0	1186 <!-- learn--><td>N
Daniel@0	1187 <!-- Bayes--><td>N
Daniel@0	1188 <!-- Pearl--><td>N
Daniel@0	1189
Daniel@0	1190
Daniel@0	1191 <tr>
Daniel@0	1192 <!-- Name--><td>Tabular
Daniel@0	1193 <!-- Child--><td>D
Daniel@0	1194 <!-- Parents--><td>D
Daniel@0	1195 <!-- Comments--><td>-
Daniel@0	1196 <!-- CPD_to_CPT--><td>Y
Daniel@0	1197 <!-- conv_to_table--><td>Inherits from discrete
Daniel@0	1198 <!-- conv_to_pot--><td>Inherits from discrete
Daniel@0	1199 <!-- sample--><td>Inherits from discrete
Daniel@0	1200 <!-- prob--><td>Inherits from discrete
Daniel@0	1201 <!-- learn--><td>Y
Daniel@0	1202 <!-- Bayes--><td>Y
Daniel@0	1203 <!-- Pearl--><td>Y
Daniel@0	1204
Daniel@0	1205 </table>
Daniel@0	1206
Daniel@0	1207
Daniel@0	1208
Daniel@0	1209 <h1><a name="examples">Example models</h1>
Daniel@0	1210
Daniel@0	1211
Daniel@0	1212 <h2>Gaussian mixture models</h2>
Daniel@0	1213
Daniel@0	1214 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
Daniel@0	1215 of Gaussians using BNT. Available
Daniel@0	1216 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
Daniel@0	1217
Daniel@0	1218
Daniel@0	1219 <h2><a name="pca">PCA, ICA, and all that </h2>
Daniel@0	1220
Daniel@0	1221 In Figure (a) below, we show how Factor Analysis can be thought of as a
Daniel@0	1222 graphical model. Here, X has an N(0,I) prior, and
Daniel@0	1223 Y\|X=x ~ N(mu + Wx, Psi),
Daniel@0	1224 where Psi is diagonal and W is called the "factor loading matrix".
Daniel@0	1225 Since the noise on both X and Y is diagonal, the components of these
Daniel@0	1226 vectors are uncorrelated, and hence can be represented as individual
Daniel@0	1227 scalar nodes, as we show in (b).
Daniel@0	1228 (This is useful if parts of the observations on the Y vector are occasionally missing.)
Daniel@0	1229 We usually take k=\|X\| << \|Y\|=D, so the model tries to explain
Daniel@0	1230 many observations using a low-dimensional subspace.
Daniel@0	1231
Daniel@0	1232
Daniel@0	1233 <center>
Daniel@0	1234 <table>
Daniel@0	1235 <tr>
Daniel@0	1236 <td><img src="Figures/fa.gif">
Daniel@0	1237 <td><img src="Figures/fa_scalar.gif">
Daniel@0	1238 <td><img src="Figures/mfa.gif">
Daniel@0	1239 <td><img src="Figures/ifa.gif">
Daniel@0	1240 <tr>
Daniel@0	1241 <td align=center> (a)
Daniel@0	1242 <td align=center> (b)
Daniel@0	1243 <td align=center> (c)
Daniel@0	1244 <td align=center> (d)
Daniel@0	1245 </table>
Daniel@0	1246 </center>
Daniel@0	1247
Daniel@0	1248 <p>
Daniel@0	1249 We can create this model in BNT as follows.
Daniel@0	1250 <pre>
Daniel@0	1251 ns = [k D];
Daniel@0	1252 dag = zeros(2,2);
Daniel@0	1253 dag(1,2) = 1;
Daniel@0	1254 bnet = mk_bnet(dag, ns, 'discrete', []);
Daniel@0	1255 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
Daniel@0	1256 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
Daniel@0	1257 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
Daniel@0	1258 'cov_type', 'diag', 'clamp_mean', 1);
Daniel@0	1259 </pre>
Daniel@0	1260
Daniel@0	1261 The root node is clamped to the N(0,I) distribution, so that we will
Daniel@0	1262 not update these parameters during learning.
Daniel@0	1263 The mean of the leaf node is clamped to 0,
Daniel@0	1264 since we assume the data has been centered (had its mean subtracted
Daniel@0	1265 off); this is just for simplicity.
Daniel@0	1266 Finally, the covariance of the leaf node is constrained to be
Daniel@0	1267 diagonal. W0 and Psi0 are the initial parameter guesses.
Daniel@0	1268
Daniel@0	1269 <p>
Daniel@0	1270 We can fit this model (i.e., estimate its parameters in a maximum
Daniel@0	1271 likelihood (ML) sense) using EM, as we
Daniel@0	1272 explain <a href="#em">below</a>.
Daniel@0	1273 Not surprisingly, the ML estimates for mu and Psi turn out to be
Daniel@0	1274 identical to the
Daniel@0	1275 sample mean and variance, which can be computed directly as
Daniel@0	1276 <pre>
Daniel@0	1277 mu_ML = mean(data);
Daniel@0	1278 Psi_ML = diag(cov(data));
Daniel@0	1279 </pre>
Daniel@0	1280 Note that W can only be identified up to a rotation matrix, because of
Daniel@0	1281 the spherical symmetry of the source.
Daniel@0	1282
Daniel@0	1283 <p>
Daniel@0	1284 If we restrict Psi to be spherical, i.e., Psi = sigma*I,
Daniel@0	1285 there is a closed-form solution for W as well,
Daniel@0	1286 i.e., we do not need to use EM.
Daniel@0	1287 In particular, W contains the first \|X\| eigenvectors of the sample covariance
Daniel@0	1288 matrix, with scalings determined by the eigenvalues and sigma.
Daniel@0	1289 Classical PCA can be obtained by taking the sigma->0 limit.
Daniel@0	1290 For details, see
Daniel@0	1291
Daniel@0	1292 <ul>
Daniel@0	1293 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
Daniel@0	1294 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
Daniel@0	1295 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
Daniel@0	1296 Matlab software</a>)
Daniel@0	1297
Daniel@0	1298 <p>
Daniel@0	1299 <li>
Daniel@0	1300 <a
Daniel@0	1301 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
Daniel@0	1302 "Mixtures of probabilistic principal component analyzers"</a>,
Daniel@0	1303 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
Daniel@0	1304 </ul>
Daniel@0	1305
Daniel@0	1306 <p>
Daniel@0	1307 By adding a hidden discrete variable, we can create mixtures of FA
Daniel@0	1308 models, as shown in (c).
Daniel@0	1309 Now we can explain the data using a set of subspaces.
Daniel@0	1310 We can create this model in BNT as follows.
Daniel@0	1311 <pre>
Daniel@0	1312 ns = [M k D];
Daniel@0	1313 dag = zeros(3);
Daniel@0	1314 dag(1,3) = 1;
Daniel@0	1315 dag(2,3) = 1;
Daniel@0	1316 bnet = mk_bnet(dag, ns, 'discrete', 1);
Daniel@0	1317 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
Daniel@0	1318 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
Daniel@0	1319 'clamp_mean', 1, 'clamp_cov', 1);
Daniel@0	1320 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
Daniel@0	1321 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
Daniel@0	1322 </pre>
Daniel@0	1323 Notice how the covariance matrix for Y is the same for all values of
Daniel@0	1324 Q; that is, the noise level in each sub-space is assumed the same.
Daniel@0	1325 However, we allow the offset, mu, to vary.
Daniel@0	1326 For details, see
Daniel@0	1327 <ul>
Daniel@0	1328
Daniel@0	1329 <LI>
Daniel@0	1330 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
Daniel@0	1331 Algorithm for Mixtures of Factor Analyzers </A>,
Daniel@0	1332 Ghahramani, Z. and Hinton, G.E. (1996),
Daniel@0	1333 University of Toronto
Daniel@0	1334 Technical Report CRG-TR-96-1.
Daniel@0	1335 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
Daniel@0	1336
Daniel@0	1337 <p>
Daniel@0	1338 <li>
Daniel@0	1339 <a
Daniel@0	1340 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
Daniel@0	1341 "Mixtures of probabilistic principal component analyzers"</a>,
Daniel@0	1342 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
Daniel@0	1343 </ul>
Daniel@0	1344
Daniel@0	1345 <p>
Daniel@0	1346 I have included Zoubin's specialized MFA code (with his permission)
Daniel@0	1347 with the toolbox, so you can check that BNT gives the same results:
Daniel@0	1348 see 'BNT/examples/static/mfa1.m'.
Daniel@0	1349
Daniel@0	1350 <p>
Daniel@0	1351 Independent Factor Analysis (IFA) generalizes FA by allowing a
Daniel@0	1352 non-Gaussian prior on each component of X.
Daniel@0	1353 (Note that we can approximate a non-Gaussian prior using a mixture of
Daniel@0	1354 Gaussians.)
Daniel@0	1355 This means that the likelihood function is no longer rotationally
Daniel@0	1356 invariant, so we can uniquely identify W and the hidden
Daniel@0	1357 sources X.
Daniel@0	1358 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
Daniel@0	1359 We recover classical Independent Components Analysis (ICA)
Daniel@0	1360 in the Psi -> 0 limit, and by assuming that \|X\|=\|Y\|, so that the
Daniel@0	1361 weight matrix W is square and invertible.
Daniel@0	1362 For details, see
Daniel@0	1363 <ul>
Daniel@0	1364 <li>
Daniel@0	1365 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
Daniel@0	1366 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
Daniel@0	1367 </ul>
Daniel@0	1368
Daniel@0	1369
Daniel@0	1370
Daniel@0	1371 <h2><a name="mixexp">Mixtures of experts</h2>
Daniel@0	1372
Daniel@0	1373 As an example of the use of the softmax function,
Daniel@0	1374 we introduce the Mixture of Experts model.
Daniel@0	1375 <!--
Daniel@0	1376 We also show
Daniel@0	1377 the Hierarchical Mixture of Experts model, where the hierarchy has two
Daniel@0	1378 levels.
Daniel@0	1379 (This is essentially a probabilistic decision tree of height two.)
Daniel@0	1380 -->
Daniel@0	1381 As before,
Daniel@0	1382 circles denote continuous-valued nodes,
Daniel@0	1383 squares denote discrete nodes, clear
Daniel@0	1384 means hidden, and shaded means observed.
Daniel@0	1385 <p>
Daniel@0	1386 <center>
Daniel@0	1387 <table>
Daniel@0	1388 <tr>
Daniel@0	1389 <td><img src="Figures/mixexp.gif">
Daniel@0	1390 <!--
Daniel@0	1391 <td><img src="Figures/hme.gif">
Daniel@0	1392 -->
Daniel@0	1393 </table>
Daniel@0	1394 </center>
Daniel@0	1395 <p>
Daniel@0	1396 X is the observed
Daniel@0	1397 input, Y is the output, and
Daniel@0	1398 the Q nodes are hidden "gating" nodes, which select the appropriate
Daniel@0	1399 set of parameters for Y. During training, Y is assumed observed,
Daniel@0	1400 but for testing, the goal is to predict Y given X.
Daniel@0	1401 Note that this is a <em>conditional</em> density model, so we don't
Daniel@0	1402 associate any parameters with X.
Daniel@0	1403 Hence X's CPD will be a root CPD, which is a way of modelling
Daniel@0	1404 exogenous nodes.
Daniel@0	1405 If the output is a continuous-valued quantity,
Daniel@0	1406 we assume the "experts" are linear-regression units,
Daniel@0	1407 and set Y's CPD to linear-Gaussian.
Daniel@0	1408 If the output is discrete, we set Y's CPD to a softmax function.
Daniel@0	1409 The Q CPDs will always be softmax functions.
Daniel@0	1410
Daniel@0	1411 <p>
Daniel@0	1412 As a concrete example, consider the mixture of experts model where X and Y are
Daniel@0	1413 scalars, and Q is binary.
Daniel@0	1414 This is just piecewise linear regression, where
Daniel@0	1415 we have two line segments, i.e.,
Daniel@0	1416 <P>
Daniel@0	1417 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
Daniel@0	1418 <P>
Daniel@0	1419 We can create this model with random parameters as follows.
Daniel@0	1420 (This code is bundled in BNT/examples/static/mixexp2.m.)
Daniel@0	1421 <PRE>
Daniel@0	1422 X = 1;
Daniel@0	1423 Q = 2;
Daniel@0	1424 Y = 3;
Daniel@0	1425 dag = zeros(3,3);
Daniel@0	1426 dag(X,[Q Y]) = 1
Daniel@0	1427 dag(Q,Y) = 1;
Daniel@0	1428 ns = [1 2 1]; % make X and Y scalars, and have 2 experts
Daniel@0	1429 onodes = [1 3];
Daniel@0	1430 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
Daniel@0	1431
Daniel@0	1432 rand('state', 0);
Daniel@0	1433 randn('state', 0);
Daniel@0	1434 bnet.CPD{1} = root_CPD(bnet, 1);
Daniel@0	1435 bnet.CPD{2} = softmax_CPD(bnet, 2);
Daniel@0	1436 bnet.CPD{3} = gaussian_CPD(bnet, 3);
Daniel@0	1437 </PRE>
Daniel@0	1438 Now let us fit this model using <a href="#em">EM</a>.
Daniel@0	1439 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
Daniel@0	1440 <P>
Daniel@0	1441 <PRE>
Daniel@0	1442 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
Daniel@0	1443 plot(data(:,1), data(:,2), '.');
Daniel@0	1444 </PRE>
Daniel@0	1445 <p>
Daniel@0	1446 <center>
Daniel@0	1447 <IMG SRC="Figures/mixexp_data.gif">
Daniel@0	1448 </center>
Daniel@0	1449 <p>
Daniel@0	1450 This is what the model looks like before training.
Daniel@0	1451 (Thanks to Thomas Hofman for writing this plotting routine.)
Daniel@0	1452 <p>
Daniel@0	1453 <center>
Daniel@0	1454 <IMG SRC="Figures/mixexp_before.gif">
Daniel@0	1455 </center>
Daniel@0	1456 <p>
Daniel@0	1457 Now let's train the model, and plot the final performance.
Daniel@0	1458 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
Daniel@0	1459 <P>
Daniel@0	1460 <PRE>
Daniel@0	1461 ncases = size(data, 1); % each row of data is a training case
Daniel@0	1462 cases = cell(3, ncases);
Daniel@0	1463 cases([1 3], :) = num2cell(data'); % each column of cases is a training case
Daniel@0	1464 engine = jtree_inf_engine(bnet);
Daniel@0	1465 max_iter = 20;
Daniel@0	1466 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
Daniel@0	1467 </PRE>
Daniel@0	1468 (We specify which nodes will be observed when we create the engine.
Daniel@0	1469 Hence BNT knows that the hidden nodes are all discrete.
Daniel@0	1470 For complex models, this can lead to a significant speedup.)
Daniel@0	1471 Below we show what the model looks like after 16 iterations of EM
Daniel@0	1472 (with 100 IRLS iterations per M step), when it converged
Daniel@0	1473 using the default convergence tolerance (that the
Daniel@0	1474 fractional change in the log-likelihood be less than 1e-3).
Daniel@0	1475 Before learning, the log-likelihood was
Daniel@0	1476 -322.927442; afterwards, it was -13.728778.
Daniel@0	1477 <p>
Daniel@0	1478 <center>
Daniel@0	1479 <IMG SRC="Figures/mixexp_after.gif">
Daniel@0	1480 </center>
Daniel@0	1481 (See BNT/examples/static/mixexp2.m for details of the code.)
Daniel@0	1482
Daniel@0	1483
Daniel@0	1484
Daniel@0	1485 <h2><a name="hme">Hierarchical mixtures of experts</h2>
Daniel@0	1486
Daniel@0	1487 A hierarchical mixture of experts (HME) extends the mixture of experts
Daniel@0	1488 model by having more than one hidden node. A two-level example is shown below, along
Daniel@0	1489 with its more traditional representation as a neural network.
Daniel@0	1490 This is like a (balanced) probabilistic decision tree of height 2.
Daniel@0	1491 <p>
Daniel@0	1492 <center>
Daniel@0	1493 <IMG SRC="Figures/HMEforMatlab.jpg">
Daniel@0	1494 </center>
Daniel@0	1495 <p>
Daniel@0	1496 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
Daniel@0	1497 has written an extensive set of routines for HMEs,
Daniel@0	1498 which are bundled with BNT: see the examples/static/HME directory.
Daniel@0	1499 These routines allow you to choose the number of hidden (gating)
Daniel@0	1500 layers, and the form of the experts (softmax or MLP).
Daniel@0	1501 See the file hmemenu, which provides a demo.
Daniel@0	1502 For example, the figure below shows the decision boundaries learned
Daniel@0	1503 for a ternary classification problem, using a 2 level HME with softmax
Daniel@0	1504 gates and softmax experts; the training set is on the left, the
Daniel@0	1505 testing set on the right.
Daniel@0	1506 <p>
Daniel@0	1507 <center>
Daniel@0	1508 <!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
Daniel@0	1509 <IMG SRC="Figures/hme_dec_boundary.png">
Daniel@0	1510 </center>
Daniel@0	1511 <p>
Daniel@0	1512
Daniel@0	1513
Daniel@0	1514 <p>
Daniel@0	1515 For more details, see the following:
Daniel@0	1516 <ul>
Daniel@0	1517
Daniel@0	1518 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
Daniel@0	1519 Hierarchical mixtures of experts and the EM algorithm</a>
Daniel@0	1520 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
Daniel@0	1521
Daniel@0	1522 <li> <a href =
Daniel@0	1523 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's
Daniel@0	1524 matlab code for HME</a>
Daniel@0	1525
Daniel@0	1526 <li> <a
Daniel@0	1527 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
Daniel@0	1528 logistic function? A tutorial discussion on
Daniel@0	1529 probabilities and neural networks.</a> M. I. Jordan. MIT Computational
Daniel@0	1530 Cognitive Science Report 9503, August 1995.
Daniel@0	1531
Daniel@0	1532 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
Daniel@0	1533 Halll, 1983.
Daniel@0	1534
Daniel@0	1535 <li>
Daniel@0	1536 "Improved learning algorithms for mixtures of experts in multiclass
Daniel@0	1537 classification".
Daniel@0	1538 K. Chen, L. Xu, H. Chi.
Daniel@0	1539 Neural Networks (1999) 12: 1229-1252.
Daniel@0	1540
Daniel@0	1541 <li> <a href="http://www.oigeeza.com/steve/">
Daniel@0	1542 Classification Using Hierarchical Mixtures of Experts</a>
Daniel@0	1543 S.R. Waterhouse and A.J. Robinson.
Daniel@0	1544 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
Daniel@0	1545
Daniel@0	1546 <li> <a href="http://www.idiap.ch/~perry/">
Daniel@0	1547 Localized mixtures of experts</a>,
Daniel@0	1548 P. Moerland, 1998.
Daniel@0	1549
Daniel@0	1550 <li> "Nonlinear gated experts for time series",
Daniel@0	1551 A.S. Weigend and M. Mangeas, 1995.
Daniel@0	1552
Daniel@0	1553 </ul>
Daniel@0	1554
Daniel@0	1555
Daniel@0	1556 <h2><a name="qmr">QMR</h2>
Daniel@0	1557
Daniel@0	1558 Bayes nets originally arose out of an attempt to add probabilities to
Daniel@0	1559 expert systems, and this is still the most common use for BNs.
Daniel@0	1560 A famous example is
Daniel@0	1561 QMR-DT, a decision-theoretic reformulation of the Quick Medical
Daniel@0	1562 Reference (QMR) model.
Daniel@0	1563 <p>
Daniel@0	1564 <center>
Daniel@0	1565 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
Daniel@0	1566 </center>
Daniel@0	1567 Here, the top layer represents hidden disease nodes, and the bottom
Daniel@0	1568 layer represents observed symptom nodes.
Daniel@0	1569 The goal is to infer the posterior probability of each disease given
Daniel@0	1570 all the symptoms (which can be present, absent or unknown).
Daniel@0	1571 Each node in the top layer has a Bernoulli prior (with a low prior
Daniel@0	1572 probability that the disease is present).
Daniel@0	1573 Since each node in the bottom layer has a high fan-in, we use a
Daniel@0	1574 noisy-OR parameterization; each disease has an independent chance of
Daniel@0	1575 causing each symptom.
Daniel@0	1576 The real QMR-DT model is copyright, but
Daniel@0	1577 we can create a random QMR-like model as follows.
Daniel@0	1578 <pre>
Daniel@0	1579 function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
Daniel@0	1580 % MK_QMR_BNET Make a QMR model
Daniel@0	1581 % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
Daniel@0	1582 %
Daniel@0	1583 % G(i,j) = 1 iff there is an arc from disease i to finding j
Daniel@0	1584 % inhibit(i,j) = inhibition probability on i->j arc
Daniel@0	1585 % leak(j) = inhibition prob. on leak->j arc
Daniel@0	1586 % prior(i) = prob. disease i is on
Daniel@0	1587
Daniel@0	1588 [Ndiseases Nfindings] = size(inhibit);
Daniel@0	1589 N = Ndiseases + Nfindings;
Daniel@0	1590 finding_node = Ndiseases+1:N;
Daniel@0	1591 ns = 2*ones(1,N);
Daniel@0	1592 dag = zeros(N,N);
Daniel@0	1593 dag(1:Ndiseases, finding_node) = G;
Daniel@0	1594 bnet = mk_bnet(dag, ns, 'observed', finding_node);
Daniel@0	1595
Daniel@0	1596 for d=1:Ndiseases
Daniel@0	1597 CPT = [1-prior(d) prior(d)];
Daniel@0	1598 bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
Daniel@0	1599 end
Daniel@0	1600
Daniel@0	1601 for i=1:Nfindings
Daniel@0	1602 fnode = finding_node(i);
Daniel@0	1603 ps = parents(G, i);
Daniel@0	1604 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
Daniel@0	1605 end
Daniel@0	1606 </pre>
Daniel@0	1607 In the file BNT/examples/static/qmr1, we create a random bipartite
Daniel@0	1608 graph G, with 5 diseases and 10 findings, and random parameters.
Daniel@0	1609 (In general, to create a random dag, use 'mk_random_dag'.)
Daniel@0	1610 We can visualize the resulting graph structure using
Daniel@0	1611 the methods discussed <a href="#graphdraw">below</a>, with the
Daniel@0	1612 following results:
Daniel@0	1613 <p>
Daniel@0	1614 <img src="Figures/qmr.rnd.jpg">
Daniel@0	1615
Daniel@0	1616 <p>
Daniel@0	1617 Now let us put some random evidence on all the leaves except the very
Daniel@0	1618 first and very last, and compute the disease posteriors.
Daniel@0	1619 <pre>
Daniel@0	1620 pos = 2:floor(Nfindings/2);
Daniel@0	1621 neg = (pos(end)+1):(Nfindings-1);
Daniel@0	1622 onodes = myunion(pos, neg);
Daniel@0	1623 evidence = cell(1, N);
Daniel@0	1624 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
Daniel@0	1625 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
Daniel@0	1626
Daniel@0	1627 engine = jtree_inf_engine(bnet);
Daniel@0	1628 [engine, ll] = enter_evidence(engine, evidence);
Daniel@0	1629 post = zeros(1, Ndiseases);
Daniel@0	1630 for i=diseases(:)'
Daniel@0	1631 m = marginal_nodes(engine, i);
Daniel@0	1632 post(i) = m.T(2);
Daniel@0	1633 end
Daniel@0	1634 </pre>
Daniel@0	1635 Junction tree can be quite slow on large QMR models.
Daniel@0	1636 Fortunately, it is possible to exploit properties of the noisy-OR
Daniel@0	1637 function to speed up exact inference using an algorithm called
Daniel@0	1638 <a href="#quickscore">quickscore</a>, discussed below.
Daniel@0	1639
Daniel@0	1640
Daniel@0	1641
Daniel@0	1642
Daniel@0	1643
Daniel@0	1644 <h2><a name="cg_model">Conditional Gaussian models</h2>
Daniel@0	1645
Daniel@0	1646 A conditional Gaussian model is one in which, conditioned on all the discrete
Daniel@0	1647 nodes, the distribution over the remaining (continuous) nodes is
Daniel@0	1648 multivariate Gaussian. This means we can have arcs from discrete (D)
Daniel@0	1649 to continuous (C) nodes, but not vice versa.
Daniel@0	1650 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
Daniel@0	1651 as in the <a href="#mixexp">mixture of experts</a> model,
Daniel@0	1652 since this distribution can be represented with a discrete potential.)
Daniel@0	1653 <p>
Daniel@0	1654 We now give an example of a CG model, from
Daniel@0	1655 the paper "Propagation of Probabilities, Means amd
Daniel@0	1656 Variances in Mixed Graphical Association Models", Steffen Lauritzen,
Daniel@0	1657 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
Daniel@0	1658 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
Daniel@0	1659 D. J. Spiegelhalter, Springer, 1999.)
Daniel@0	1660
Daniel@0	1661 <h3>Specifying the graph</h3>
Daniel@0	1662
Daniel@0	1663 Consider the model of waste emissions from an incinerator plant shown below.
Daniel@0	1664 We follow the standard convention that shaded nodes are observed,
Daniel@0	1665 clear nodes are hidden.
Daniel@0	1666 We also use the non-standard convention that
Daniel@0	1667 square nodes are discrete (tabular) and round nodes are
Daniel@0	1668 Gaussian.
Daniel@0	1669
Daniel@0	1670 <p>
Daniel@0	1671 <center>
Daniel@0	1672 <IMG SRC="Figures/cg1.gif">
Daniel@0	1673 </center>
Daniel@0	1674 <p>
Daniel@0	1675
Daniel@0	1676 We can create this model as follows.
Daniel@0	1677 <pre>
Daniel@0	1678 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
Daniel@0	1679 n = 9;
Daniel@0	1680
Daniel@0	1681 dag = zeros(n);
Daniel@0	1682 dag(F,E)=1;
Daniel@0	1683 dag(W,[E Min D]) = 1;
Daniel@0	1684 dag(E,D)=1;
Daniel@0	1685 dag(B,[C D])=1;
Daniel@0	1686 dag(D,[L Mout])=1;
Daniel@0	1687 dag(Min,Mout)=1;
Daniel@0	1688
Daniel@0	1689 % node sizes - all cts nodes are scalar, all discrete nodes are binary
Daniel@0	1690 ns = ones(1, n);
Daniel@0	1691 dnodes = [F W B];
Daniel@0	1692 cnodes = mysetdiff(1:n, dnodes);
Daniel@0	1693 ns(dnodes) = 2;
Daniel@0	1694
Daniel@0	1695 bnet = mk_bnet(dag, ns, 'discrete', dnodes);
Daniel@0	1696 </pre>
Daniel@0	1697 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
Daniel@0	1698 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
Daniel@0	1699 <p>
Daniel@0	1700
Daniel@0	1701
Daniel@0	1702 <h3>Specifying the parameters</h3>
Daniel@0	1703
Daniel@0	1704 The parameters of the discrete nodes can be specified as follows.
Daniel@0	1705 <pre>
Daniel@0	1706 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
Daniel@0	1707 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
Daniel@0	1708 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
Daniel@0	1709 </pre>
Daniel@0	1710
Daniel@0	1711 <p>
Daniel@0	1712 The parameters of the continuous nodes can be specified as follows.
Daniel@0	1713 <pre>
Daniel@0	1714 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
Daniel@0	1715 'cov', [0.00002 0.0001 0.00002 0.0001]);
Daniel@0	1716 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
Daniel@0	1717 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
Daniel@0	1718 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
Daniel@0	1719 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
Daniel@0	1720 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
Daniel@0	1721 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
Daniel@0	1722 </pre>
Daniel@0	1723
Daniel@0	1724
Daniel@0	1725 <h3><a name="cg_infer">Inference</h3>
Daniel@0	1726
Daniel@0	1727 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
Daniel@0	1728 First we compute the unconditional marginals.
Daniel@0	1729 <pre>
Daniel@0	1730 engine = jtree_inf_engine(bnet);
Daniel@0	1731 evidence = cell(1,n);
Daniel@0	1732 [engine, ll] = enter_evidence(engine, evidence);
Daniel@0	1733 marg = marginal_nodes(engine, E);
Daniel@0	1734 </pre>
Daniel@0	1735 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
Daniel@0	1736 'marg' is a structure that contains the fields 'mu' and 'Sigma', which
Daniel@0	1737 contain the mean and (co)variance of the marginal on E.
Daniel@0	1738 In this case, they are both scalars.
Daniel@0	1739 Let us check they match the published figures (to 2 decimal places).
Daniel@0	1740 <!--(We can't expect
Daniel@0	1741 more precision than this in general because I have implemented the algorithm of
Daniel@0	1742 Lauritzen (1992), which can be numerically unstable.)-->
Daniel@0	1743 <pre>
Daniel@0	1744 tol = 1e-2;
Daniel@0	1745 assert(approxeq(marg.mu, -3.25, tol));
Daniel@0	1746 assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
Daniel@0	1747 </pre>
Daniel@0	1748 We can compute the other posteriors similarly.
Daniel@0	1749 Now let us add some evidence.
Daniel@0	1750 <pre>
Daniel@0	1751 evidence = cell(1,n);
Daniel@0	1752 evidence{W} = 1; % industrial
Daniel@0	1753 evidence{L} = 1.1;
Daniel@0	1754 evidence{C} = -0.9;
Daniel@0	1755 [engine, ll] = enter_evidence(engine, evidence);
Daniel@0	1756 </pre>
Daniel@0	1757 Now we find
Daniel@0	1758 <pre>
Daniel@0	1759 marg = marginal_nodes(engine, E);
Daniel@0	1760 assert(approxeq(marg.mu, -3.8983, tol));
Daniel@0	1761 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
Daniel@0	1762 </pre>
Daniel@0	1763
Daniel@0	1764
Daniel@0	1765 We can also compute the joint probability on a set of nodes.
Daniel@0	1766 For example, P(D, Mout \| evidence) is a 2D Gaussian:
Daniel@0	1767 <pre>
Daniel@0	1768 marg = marginal_nodes(engine, [D Mout])
Daniel@0	1769 marg =
Daniel@0	1770 domain: [6 8]
Daniel@0	1771 mu: [2x1 double]
Daniel@0	1772 Sigma: [2x2 double]
Daniel@0	1773 T: 1.0000
Daniel@0	1774 </pre>
Daniel@0	1775 The mean is
Daniel@0	1776 <pre>
Daniel@0	1777 marg.mu
Daniel@0	1778 ans =
Daniel@0	1779 3.6077
Daniel@0	1780 4.1077
Daniel@0	1781 </pre>
Daniel@0	1782 and the covariance matrix is
Daniel@0	1783 <pre>
Daniel@0	1784 marg.Sigma
Daniel@0	1785 ans =
Daniel@0	1786 0.1062 0.1062
Daniel@0	1787 0.1062 0.1182
Daniel@0	1788 </pre>
Daniel@0	1789 It is easy to visualize this posterior using standard Matlab plotting
Daniel@0	1790 functions, e.g.,
Daniel@0	1791 <pre>
Daniel@0	1792 gaussplot2d(marg.mu, marg.Sigma);
Daniel@0	1793 </pre>
Daniel@0	1794 produces the following picture.
Daniel@0	1795
Daniel@0	1796 <p>
Daniel@0	1797 <center>
Daniel@0	1798 <IMG SRC="Figures/gaussplot.png">
Daniel@0	1799 </center>
Daniel@0	1800 <p>
Daniel@0	1801
Daniel@0	1802
Daniel@0	1803 The T field indicates that the mixing weight of this Gaussian
Daniel@0	1804 component is 1.0.
Daniel@0	1805 If the joint contains discrete and continuous variables, the result
Daniel@0	1806 will be a mixture of Gaussians, e.g.,
Daniel@0	1807 <pre>
Daniel@0	1808 marg = marginal_nodes(engine, [F E])
Daniel@0	1809 domain: [1 3]
Daniel@0	1810 mu: [-3.9000 -0.4003]
Daniel@0	1811 Sigma: [1x1x2 double]
Daniel@0	1812 T: [0.9995 4.7373e-04]
Daniel@0	1813 </pre>
Daniel@0	1814 The interpretation is
Daniel@0	1815 Sigma(i,j,k) = Cov[ E(i) E(j) \| F=k ].
Daniel@0	1816 In this case, E is a scalar, so i=j=1; k specifies the mixture component.
Daniel@0	1817 <p>
Daniel@0	1818 We saw in the sprinkler network that BNT sets the effective size of
Daniel@0	1819 observed discrete nodes to 1, since they only have one legal value.
Daniel@0	1820 For continuous nodes, BNT sets their length to 0,
Daniel@0	1821 since they have been reduced to a point.
Daniel@0	1822 For example,
Daniel@0	1823 <pre>
Daniel@0	1824 marg = marginal_nodes(engine, [B C])
Daniel@0	1825 domain: [4 5]
Daniel@0	1826 mu: []
Daniel@0	1827 Sigma: []
Daniel@0	1828 T: [0.0123 0.9877]
Daniel@0	1829 </pre>
Daniel@0	1830 It is simple to post-process the output of marginal_nodes.
Daniel@0	1831 For example, the file BNT/examples/static/cg1 sets the mu term of
Daniel@0	1832 observed nodes to their observed value, and the Sigma term to 0 (since
Daniel@0	1833 observed nodes have no variance).
Daniel@0	1834
Daniel@0	1835 <p>
Daniel@0	1836 Note that the implemented version of the junction tree is numerically
Daniel@0	1837 unstable when using CG potentials
Daniel@0	1838 (which is why, in the example above, we only required our answers to agree with
Daniel@0	1839 the published ones to 2dp.)
Daniel@0	1840 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
Daniel@0	1841 implemented by Shan Huang. This is described in
Daniel@0	1842
Daniel@0	1843 <ul>
Daniel@0	1844 <li> "Stable Local Computation with Conditional Gaussian Distributions",
Daniel@0	1845 S. Lauritzen and F. Jensen, Tech Report R-99-2014,
Daniel@0	1846 Dept. Math. Sciences, Allborg Univ., 1999.
Daniel@0	1847 </ul>
Daniel@0	1848
Daniel@0	1849 However, even the numerically stable version
Daniel@0	1850 can be computationally intractable if there are many hidden discrete
Daniel@0	1851 nodes, because the number of mixture components grows exponentially e.g., in a
Daniel@0	1852 <a href="usage_dbn.html#lds">switching linear dynamical system</a>.
Daniel@0	1853 In general, one must resort to approximate inference techniques: see
Daniel@0	1854 the discussion on <a href="#engines">inference engines</a> below.
Daniel@0	1855
Daniel@0	1856
Daniel@0	1857 <h2><a name="hybrid">Other hybrid models</h2>
Daniel@0	1858
Daniel@0	1859 When we have C->D arcs, where C is hidden, we need to use
Daniel@0	1860 approximate inference.
Daniel@0	1861 One approach (not implemented in BNT) is described in
Daniel@0	1862 <ul>
Daniel@0	1863 <li> <a
Daniel@0	1864 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
Daniel@0	1865 Variational Approximation for Bayesian Networks with
Daniel@0	1866 Discrete and Continuous Latent Variables</a>,
Daniel@0	1867 K. Murphy, UAI 99.
Daniel@0	1868 </ul>
Daniel@0	1869 Of course, one can always use <a href="#sampling">sampling</a> methods
Daniel@0	1870 for approximate inference in such models.
Daniel@0	1871
Daniel@0	1872
Daniel@0	1873
Daniel@0	1874 <h1><a name="param_learning">Parameter Learning</h1>
Daniel@0	1875
Daniel@0	1876 The parameter estimation routines in BNT can be classified into 4
Daniel@0	1877 types, depending on whether the goal is to compute
Daniel@0	1878 a full (Bayesian) posterior over the parameters or just a point
Daniel@0	1879 estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
Daniel@0	1880 and whether all the variables are fully observed or there is missing
Daniel@0	1881 data/ hidden variables (partial observability).
Daniel@0	1882 <p>
Daniel@0	1883
Daniel@0	1884 <TABLE BORDER>
Daniel@0	1885 <tr>
Daniel@0	1886 <TH></TH>
Daniel@0	1887 <th>Full obs</th>
Daniel@0	1888 <th>Partial obs</th>
Daniel@0	1889 </tr>
Daniel@0	1890 <tr>
Daniel@0	1891 <th>Point</th>
Daniel@0	1892 <td><tt>learn_params</tt></td>
Daniel@0	1893 <td><tt>learn_params_em</tt></td>
Daniel@0	1894 </tr>
Daniel@0	1895 <tr>
Daniel@0	1896 <th>Bayes</th>
Daniel@0	1897 <td><tt>bayes_update_params</tt></td>
Daniel@0	1898 <td>not yet supported</td>
Daniel@0	1899 </tr>
Daniel@0	1900 </table>
Daniel@0	1901
Daniel@0	1902
Daniel@0	1903 <h2><a name="load_data">Loading data from a file</h2>
Daniel@0	1904
Daniel@0	1905 To load numeric data from an ASCII text file called 'dat.txt', where each row is a
Daniel@0	1906 case and columns are separated by white-space, such as
Daniel@0	1907 <pre>
Daniel@0	1908 011979 1626.5 0.0
Daniel@0	1909 021979 1367.0 0.0
Daniel@0	1910 ...
Daniel@0	1911 </pre>
Daniel@0	1912 you can use
Daniel@0	1913 <pre>
Daniel@0	1914 data = load('dat.txt');
Daniel@0	1915 </pre>
Daniel@0	1916 or
Daniel@0	1917 <pre>
Daniel@0	1918 load dat.txt -ascii
Daniel@0	1919 </pre>
Daniel@0	1920 In the latter case, the data is stored in a variable called 'dat' (the
Daniel@0	1921 filename minus the extension).
Daniel@0	1922 Alternatively, suppose the data is stored in a .csv file (has commas
Daniel@0	1923 separating the columns, and contains a header line), such as
Daniel@0	1924 <pre>
Daniel@0	1925 header info goes here
Daniel@0	1926 ORD,011979,1626.5,0.0
Daniel@0	1927 DSM,021979,1367.0,0.0
Daniel@0	1928 ...
Daniel@0	1929 </pre>
Daniel@0	1930 You can load this using
Daniel@0	1931 <pre>
Daniel@0	1932 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
Daniel@0	1933 </pre>
Daniel@0	1934 If your file is not in either of these formats, you can either use Perl to convert
Daniel@0	1935 it to this format, or use the Matlab scanf command.
Daniel@0	1936 Type
Daniel@0	1937 <tt>
Daniel@0	1938 help iofun
Daniel@0	1939 </tt>
Daniel@0	1940 for more information on Matlab's file functions.
Daniel@0	1941 <!--
Daniel@0	1942 <p>
Daniel@0	1943 To load data directly from Excel,
Daniel@0	1944 you should buy the
Daniel@0	1945 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
Daniel@0	1946 To load data directly from a relational database,
Daniel@0	1947 you should buy the
Daniel@0	1948 <a href="http://www.mathworks.com/products/database">Database
Daniel@0	1949 toolbox</a>.
Daniel@0	1950 -->
Daniel@0	1951 <p>
Daniel@0	1952 BNT learning routines require data to be stored in a cell array.
Daniel@0	1953 data{i,m} is the value of node i in case (example) m, i.e., each
Daniel@0	1954 <em>column</em> is a case.
Daniel@0	1955 If node i is not observed in case m (missing value), set
Daniel@0	1956 data{i,m} = [].
Daniel@0	1957 (Not all the learning routines can cope with such missing values, however.)
Daniel@0	1958 In the special case that all the nodes are observed and are
Daniel@0	1959 scalar-valued (as opposed to vector-valued), the data can be
Daniel@0	1960 stored in a matrix (as opposed to a cell-array).
Daniel@0	1961 <p>
Daniel@0	1962 Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
Daniel@0	1963 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
Daniel@0	1964 the observed output, and X(2) is a hidden (gating) node. We can
Daniel@0	1965 create the dataset as follows.
Daniel@0	1966 <pre>
Daniel@0	1967 data = load('dat.txt');
Daniel@0	1968 ncases = size(data, 1);
Daniel@0	1969 cases = cell(3, ncases);
Daniel@0	1970 cases([1 3], :) = num2cell(data');
Daniel@0	1971 </pre>
Daniel@0	1972 Notice how we transposed the data, to convert rows into columns.
Daniel@0	1973 Also, cases{2,m} = [] for all m, since X(2) is always hidden.
Daniel@0	1974
Daniel@0	1975
Daniel@0	1976 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
Daniel@0	1977
Daniel@0	1978 As an example, let's generate some data from the sprinkler network, randomize the parameters,
Daniel@0	1979 and then try to recover the original model.
Daniel@0	1980 First we create some training data using forwards sampling.
Daniel@0	1981 <pre>
Daniel@0	1982 samples = cell(N, nsamples);
Daniel@0	1983 for i=1:nsamples
Daniel@0	1984 samples(:,i) = sample_bnet(bnet);
Daniel@0	1985 end
Daniel@0	1986 </pre>
Daniel@0	1987 samples{j,i} contains the value of the j'th node in case i.
Daniel@0	1988 sample_bnet returns a cell array because, in general, each node might
Daniel@0	1989 be a vector of different length.
Daniel@0	1990 In this case, all nodes are discrete (and hence scalars), so we
Daniel@0	1991 could have used a regular array instead (which can be quicker):
Daniel@0	1992 <pre>
Daniel@0	1993 data = cell2num(samples);
Daniel@0	1994 </pre
Daniel@0	1995 So now data(j,i) = samples{j,i}.
Daniel@0	1996 <p>
Daniel@0	1997 Now we create a network with random parameters.
Daniel@0	1998 (The initial values of bnet2 don't matter in this case, since we can find the
Daniel@0	1999 globally optimal MLE independent of where we start.)
Daniel@0	2000 <pre>
Daniel@0	2001 % Make a tabula rasa
Daniel@0	2002 bnet2 = mk_bnet(dag, node_sizes);
Daniel@0	2003 seed = 0;
Daniel@0	2004 rand('state', seed);
Daniel@0	2005 bnet2.CPD{C} = tabular_CPD(bnet2, C);
Daniel@0	2006 bnet2.CPD{R} = tabular_CPD(bnet2, R);
Daniel@0	2007 bnet2.CPD{S} = tabular_CPD(bnet2, S);
Daniel@0	2008 bnet2.CPD{W} = tabular_CPD(bnet2, W);
Daniel@0	2009 </pre>
Daniel@0	2010 Finally, we find the maximum likelihood estimates of the parameters.
Daniel@0	2011 <pre>
Daniel@0	2012 bnet3 = learn_params(bnet2, samples);
Daniel@0	2013 </pre>
Daniel@0	2014 To view the learned parameters, we use a little Matlab hackery.
Daniel@0	2015 <pre>
Daniel@0	2016 CPT3 = cell(1,N);
Daniel@0	2017 for i=1:N
Daniel@0	2018 s=struct(bnet3.CPD{i}); % violate object privacy
Daniel@0	2019 CPT3{i}=s.CPT;
Daniel@0	2020 end
Daniel@0	2021 </pre>
Daniel@0	2022 Here are the parameters learned for node 4.
Daniel@0	2023 <pre>
Daniel@0	2024 dispcpt(CPT3{4})
Daniel@0	2025 1 1 : 1.0000 0.0000
Daniel@0	2026 2 1 : 0.2000 0.8000
Daniel@0	2027 1 2 : 0.2273 0.7727
Daniel@0	2028 2 2 : 0.0000 1.0000
Daniel@0	2029 </pre>
Daniel@0	2030 So we see that the learned parameters are fairly close to the "true"
Daniel@0	2031 ones, which we display below.
Daniel@0	2032 <pre>
Daniel@0	2033 dispcpt(CPT{4})
Daniel@0	2034 1 1 : 1.0000 0.0000
Daniel@0	2035 2 1 : 0.1000 0.9000
Daniel@0	2036 1 2 : 0.1000 0.9000
Daniel@0	2037 2 2 : 0.0100 0.9900
Daniel@0	2038 </pre>
Daniel@0	2039 We can get better results by using a larger training set, or using
Daniel@0	2040 informative priors (see <a href="#prior">below</a>).
Daniel@0	2041
Daniel@0	2042
Daniel@0	2043
Daniel@0	2044 <h2><a name="prior">Parameter priors</h2>
Daniel@0	2045
Daniel@0	2046 Currently, only tabular CPDs can have priors on their parameters.
Daniel@0	2047 The conjugate prior for a multinomial is the Dirichlet.
Daniel@0	2048 (For binary random variables, the multinomial is the same as the
Daniel@0	2049 Bernoulli, and the Dirichlet is the same as the Beta.)
Daniel@0	2050 <p>
Daniel@0	2051 The Dirichlet has a simple interpretation in terms of pseudo counts.
Daniel@0	2052 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
Daniel@0	2053 training set, where Pa_i are the parents of X_i,
Daniel@0	2054 then the maximum likelihood (ML) estimate is
Daniel@0	2055 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
Daniel@0	2056 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
Daniel@0	2057 event was not seen in the training set,
Daniel@0	2058 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
Daniel@0	2059 of times in the past.
Daniel@0	2060 The MAP (maximum a posterior) estimate is then
Daniel@0	2061 <pre>
Daniel@0	2062 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
Daniel@0	2063 </pre>
Daniel@0	2064 and is never 0 if all alpha_ijk > 0.
Daniel@0	2065 For example, consider the network A->B, where A is binary and B has 3
Daniel@0	2066 values.
Daniel@0	2067 A uniform prior for B has the form
Daniel@0	2068 <pre>
Daniel@0	2069 B=1 B=2 B=3
Daniel@0	2070 A=1 1 1 1
Daniel@0	2071 A=2 1 1 1
Daniel@0	2072 </pre>
Daniel@0	2073 which can be created using
Daniel@0	2074 <pre>
Daniel@0	2075 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
Daniel@0	2076 </pre>
Daniel@0	2077 This prior does not satisfy the likelihood equivalence principle,
Daniel@0	2078 which says that <a href="#markov_equiv">Markov equivalent</a> models
Daniel@0	2079 should have the same marginal likelihood.
Daniel@0	2080 A prior that does satisfy this principle is shown below.
Daniel@0	2081 Heckerman (1995) calls this the
Daniel@0	2082 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
Daniel@0	2083 <pre>
Daniel@0	2084 B=1 B=2 B=3
Daniel@0	2085 A=1 1/6 1/6 1/6
Daniel@0	2086 A=2 1/6 1/6 1/6
Daniel@0	2087 </pre>
Daniel@0	2088 where we put N/(q*r) in each bin; N is the equivalent sample size,
Daniel@0	2089 r=\|A\|, q = \|B\|.
Daniel@0	2090 This can be created as follows
Daniel@0	2091 <pre>
Daniel@0	2092 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
Daniel@0	2093 </pre>
Daniel@0	2094 Here, 1 is the equivalent sample size, and is the strength of the
Daniel@0	2095 prior.
Daniel@0	2096 You can change this using
Daniel@0	2097 <pre>
Daniel@0	2098 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
Daniel@0	2099 'BDeu', 'dirichlet_weight', 10);
Daniel@0	2100 </pre>
Daniel@0	2101 <!--where counts is an array of pseudo-counts of the same size as the
Daniel@0	2102 CPT.-->
Daniel@0	2103 <!--
Daniel@0	2104 <p>
Daniel@0	2105 When you specify a prior, you should set row i of the CPT to the
Daniel@0	2106 normalized version of row i of the pseudo-count matrix, i.e., to the
Daniel@0	2107 expected values of the parameters. This will ensure that computing the
Daniel@0	2108 marginal likelihood sequentially (see <a
Daniel@0	2109 href="#bayes_learn">below</a>) and in batch form gives the same
Daniel@0	2110 results.
Daniel@0	2111 To do this, proceed as follows.
Daniel@0	2112 <pre>
Daniel@0	2113 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
Daniel@0	2114 </pre>
Daniel@0	2115 For a non-informative prior, you can just write
Daniel@0	2116 <pre>
Daniel@0	2117 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
Daniel@0	2118 </pre>
Daniel@0	2119 -->
Daniel@0	2120
Daniel@0	2121
Daniel@0	2122 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
Daniel@0	2123
Daniel@0	2124 If we use conjugate priors and have fully observed data, we can
Daniel@0	2125 compute the posterior over the parameters in batch form as follows.
Daniel@0	2126 <pre>
Daniel@0	2127 cases = sample_bnet(bnet, nsamples);
Daniel@0	2128 bnet = bayes_update_params(bnet, cases);
Daniel@0	2129 LL = log_marg_lik_complete(bnet, cases);
Daniel@0	2130 </pre>
Daniel@0	2131 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
Daniel@0	2132 and bnet.CPD{i}.CPT is set to the mean of the posterior (the
Daniel@0	2133 normalized counts).
Daniel@0	2134 (Hence if the initial pseudo counts are 0,
Daniel@0	2135 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
Daniel@0	2136 same result.)
Daniel@0	2137
Daniel@0	2138
Daniel@0	2139
Daniel@0	2140
Daniel@0	2141 <p>
Daniel@0	2142 We can compute the same result sequentially (on-line) as follows.
Daniel@0	2143 <pre>
Daniel@0	2144 LL = 0;
Daniel@0	2145 for m=1:nsamples
Daniel@0	2146 LL = LL + log_marg_lik_complete(bnet, cases(:,m));
Daniel@0	2147 bnet = bayes_update_params(bnet, cases(:,m));
Daniel@0	2148 end
Daniel@0	2149 </pre>
Daniel@0	2150
Daniel@0	2151 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
Daniel@0	2152 sequential model selection which uses the same idea.
Daniel@0	2153 We generate data from the model A->B
Daniel@0	2154 and compute the posterior prob of all 3 dags on 2 nodes:
Daniel@0	2155 (1) A B, (2) A <- B , (3) A -> B
Daniel@0	2156 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
Daniel@0	2157 observational data alone, so we expect their posteriors to be the same
Daniel@0	2158 (assuming a prior which satisfies likelihood equivalence).
Daniel@0	2159 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
Daniel@0	2160 However, if we make B a noisy NOT gate, the true model "wins" after 12
Daniel@0	2161 trials, as shown below (red = model 1, blue/green (superimposed)
Daniel@0	2162 represents models 2/3).
Daniel@0	2163 <p>
Daniel@0	2164 <img src="Figures/model_select.png">
Daniel@0	2165 <p>
Daniel@0	2166 The use of marginal likelihood for model selection is discussed in
Daniel@0	2167 greater detail in the
Daniel@0	2168 section on <a href="structure_learning">structure learning</a>.
Daniel@0	2169
Daniel@0	2170
Daniel@0	2171
Daniel@0	2172
Daniel@0	2173 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
Daniel@0	2174
Daniel@0	2175 Now we consider learning when some values are not observed.
Daniel@0	2176 Let us randomly hide half the values generated from the water
Daniel@0	2177 sprinkler example.
Daniel@0	2178 <pre>
Daniel@0	2179 samples2 = samples;
Daniel@0	2180 hide = rand(N, nsamples) > 0.5;
Daniel@0	2181 [I,J]=find(hide);
Daniel@0	2182 for k=1:length(I)
Daniel@0	2183 samples2{I(k), J(k)} = [];
Daniel@0	2184 end
Daniel@0	2185 </pre>
Daniel@0	2186 samples2{i,l} is the value of node i in training case l, or [] if unobserved.
Daniel@0	2187 <p>
Daniel@0	2188 Now we will compute the MLEs using the EM algorithm.
Daniel@0	2189 We need to use an inference algorithm to compute the expected
Daniel@0	2190 sufficient statistics in the E step; the M (maximization) step is as
Daniel@0	2191 above.
Daniel@0	2192 <pre>
Daniel@0	2193 engine2 = jtree_inf_engine(bnet2);
Daniel@0	2194 max_iter = 10;
Daniel@0	2195 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
Daniel@0	2196 </pre>
Daniel@0	2197 LLtrace(i) is the log-likelihood at iteration i. We can plot this as
Daniel@0	2198 follows:
Daniel@0	2199 <pre>
Daniel@0	2200 plot(LLtrace, 'x-')
Daniel@0	2201 </pre>
Daniel@0	2202 Let's display the results after 10 iterations of EM.
Daniel@0	2203 <pre>
Daniel@0	2204 celldisp(CPT4)
Daniel@0	2205 CPT4{1} =
Daniel@0	2206 0.6616
Daniel@0	2207 0.3384
Daniel@0	2208 CPT4{2} =
Daniel@0	2209 0.6510 0.3490
Daniel@0	2210 0.8751 0.1249
Daniel@0	2211 CPT4{3} =
Daniel@0	2212 0.8366 0.1634
Daniel@0	2213 0.0197 0.9803
Daniel@0	2214 CPT4{4} =
Daniel@0	2215 (:,:,1) =
Daniel@0	2216 0.8276 0.0546
Daniel@0	2217 0.5452 0.1658
Daniel@0	2218 (:,:,2) =
Daniel@0	2219 0.1724 0.9454
Daniel@0	2220 0.4548 0.8342
Daniel@0	2221 </pre>
Daniel@0	2222 We can get improved performance by using one or more of the following
Daniel@0	2223 methods:
Daniel@0	2224 <ul>
Daniel@0	2225 <li> Increasing the size of the training set.
Daniel@0	2226 <li> Decreasing the amount of hidden data.
Daniel@0	2227 <li> Running EM for longer.
Daniel@0	2228 <li> Using informative priors.
Daniel@0	2229 <li> Initialising EM from multiple starting points.
Daniel@0	2230 </ul>
Daniel@0	2231
Daniel@0	2232 Click <a href="#gaussian">here</a> for a discussion of learning
Daniel@0	2233 Gaussians, which can cause numerical problems.
Daniel@0	2234 <p>
Daniel@0	2235 For a more complete example of learning with EM,
Daniel@0	2236 see the script BNT/examples/static/learn1.m.
Daniel@0	2237
Daniel@0	2238 <h2><a name="tying">Parameter tying</h2>
Daniel@0	2239
Daniel@0	2240 In networks with repeated structure (e.g., chains and grids), it is
Daniel@0	2241 common to assume that the parameters are the same at every node. This
Daniel@0	2242 is called parameter tying, and reduces the amount of data needed for
Daniel@0	2243 learning.
Daniel@0	2244 <p>
Daniel@0	2245 When we have tied parameters, there is no longer a one-to-one
Daniel@0	2246 correspondence between nodes and CPDs.
Daniel@0	2247 Rather, each CPD species the parameters for a whole equivalence class
Daniel@0	2248 of nodes.
Daniel@0	2249 It is easiest to see this by example.
Daniel@0	2250 Consider the following <a href="usage_dbn.html#hmm">hidden Markov
Daniel@0	2251 model (HMM)</a>
Daniel@0	2252 <p>
Daniel@0	2253 <img src="Figures/hmm3.gif">
Daniel@0	2254 <p>
Daniel@0	2255 <!--
Daniel@0	2256 We can create this graph structure, assuming we have T time-slices,
Daniel@0	2257 as follows.
Daniel@0	2258 (We number the nodes as shown in the figure, but we could equally well
Daniel@0	2259 number the hidden nodes 1:T, and the observed nodes T+1:2T.)
Daniel@0	2260 <pre>
Daniel@0	2261 N = 2*T;
Daniel@0	2262 dag = zeros(N);
Daniel@0	2263 hnodes = 1:2:2*T;
Daniel@0	2264 for i=1:T-1
Daniel@0	2265 dag(hnodes(i), hnodes(i+1))=1;
Daniel@0	2266 end
Daniel@0	2267 onodes = 2:2:2*T;
Daniel@0	2268 for i=1:T
Daniel@0	2269 dag(hnodes(i), onodes(i)) = 1;
Daniel@0	2270 end
Daniel@0	2271 </pre>
Daniel@0	2272 <p>
Daniel@0	2273 The hidden nodes are always discrete, and have Q possible values each,
Daniel@0	2274 but the observed nodes can be discrete or continuous, and have O possible values/length.
Daniel@0	2275 <pre>
Daniel@0	2276 if cts_obs
Daniel@0	2277 dnodes = hnodes;
Daniel@0	2278 else
Daniel@0	2279 dnodes = 1:N;
Daniel@0	2280 end
Daniel@0	2281 ns = ones(1,N);
Daniel@0	2282 ns(hnodes) = Q;
Daniel@0	2283 ns(onodes) = O;
Daniel@0	2284 </pre>
Daniel@0	2285 -->
Daniel@0	2286 When HMMs are used for semi-infinite processes like speech recognition,
Daniel@0	2287 we assume the transition matrix
Daniel@0	2288 P(H(t+1)\|H(t)) is the same for all t; this is called a time-invariant
Daniel@0	2289 or homogenous Markov chain.
Daniel@0	2290 Hence hidden nodes 2, 3, ..., T
Daniel@0	2291 are all in the same equivalence class, say class Hclass.
Daniel@0	2292 Similarly, the observation matrix P(O(t)\|H(t)) is assumed to be the
Daniel@0	2293 same for all t, so the observed nodes are all in the same equivalence
Daniel@0	2294 class, say class Oclass.
Daniel@0	2295 Finally, the prior term P(H(1)) is in a class all by itself, say class
Daniel@0	2296 H1class.
Daniel@0	2297 This is illustrated below, where we explicitly represent the
Daniel@0	2298 parameters as random variables (dotted nodes).
Daniel@0	2299 <p>
Daniel@0	2300 <img src="Figures/hmm4_params.gif">
Daniel@0	2301 <p>
Daniel@0	2302 In BNT, we cannot represent parameters as random variables (nodes).
Daniel@0	2303 Instead, we "hide" the
Daniel@0	2304 parameters inside one CPD for each equivalence class,
Daniel@0	2305 and then specify that the other CPDs should share these parameters, as
Daniel@0	2306 follows.
Daniel@0	2307 <pre>
Daniel@0	2308 hnodes = 1:2:2*T;
Daniel@0	2309 onodes = 2:2:2*T;
Daniel@0	2310 H1class = 1; Hclass = 2; Oclass = 3;
Daniel@0	2311 eclass = ones(1,N);
Daniel@0	2312 eclass(hnodes(2:end)) = Hclass;
Daniel@0	2313 eclass(hnodes(1)) = H1class;
Daniel@0	2314 eclass(onodes) = Oclass;
Daniel@0	2315 % create dag and ns in the usual way
Daniel@0	2316 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
Daniel@0	2317 </pre>
Daniel@0	2318 Finally, we define the parameters for each equivalence class:
Daniel@0	2319 <pre>
Daniel@0	2320 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
Daniel@0	2321 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
Daniel@0	2322 if cts_obs
Daniel@0	2323 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
Daniel@0	2324 else
Daniel@0	2325 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
Daniel@0	2326 end
Daniel@0	2327 </pre>
Daniel@0	2328 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
Daniel@0	2329 member of e's equivalence class; that is, it is not always the case
Daniel@0	2330 that e == j. You can use bnet.rep_of_eclass(e) to return the
Daniel@0	2331 representative of equivalence class e.
Daniel@0	2332 BNT will look up the parents of j to determine the size
Daniel@0	2333 of the CPT to use. It assumes that this is the same for all members of
Daniel@0	2334 the equivalence class.
Daniel@0	2335 Click <a href="param_tieing.html">here</a> for
Daniel@0	2336 a more complex example of parameter tying.
Daniel@0	2337 <p>
Daniel@0	2338 Note:
Daniel@0	2339 Normally one would define an HMM as a
Daniel@0	2340 <a href = "usage_dbn.html">Dynamic Bayes Net</a>
Daniel@0	2341 (see the function BNT/examples/dynamic/mk_chmm.m).
Daniel@0	2342 However, one can define an HMM as a static BN using the function
Daniel@0	2343 BNT/examples/static/Models/mk_hmm_bnet.m.
Daniel@0	2344
Daniel@0	2345
Daniel@0	2346
Daniel@0	2347 <h1><a name="structure_learning">Structure learning</h1>
Daniel@0	2348
Daniel@0	2349 Update (9/29/03):
Daniel@0	2350 Phillipe LeRay is developing some additional structure learning code
Daniel@0	2351 on top of BNT. Click <a
Daniel@0	2352 href="http://bnt.insa-rouen.fr/ajouts.html">here</a>
Daniel@0	2353 for details.
Daniel@0	2354
Daniel@0	2355 <p>
Daniel@0	2356
Daniel@0	2357 There are two very different approaches to structure learning:
Daniel@0	2358 constraint-based and search-and-score.
Daniel@0	2359 In the <a href="#constraint">constraint-based approach</a>,
Daniel@0	2360 we start with a fully connected graph, and remove edges if certain
Daniel@0	2361 conditional independencies are measured in the data.
Daniel@0	2362 This has the disadvantage that repeated independence tests lose
Daniel@0	2363 statistical power.
Daniel@0	2364 <p>
Daniel@0	2365 In the more popular search-and-score approach,
Daniel@0	2366 we perform a search through the space of possible DAGs, and either
Daniel@0	2367 return the best one found (a point estimate), or return a sample of the
Daniel@0	2368 models found (an approximation to the Bayesian posterior).
Daniel@0	2369 <p>
Daniel@0	2370 Unfortunately, the number of DAGs as a function of the number of
Daniel@0	2371 nodes, G(n), is super-exponential in n.
Daniel@0	2372 A closed form formula for G(n) is not known, but the first few values
Daniel@0	2373 are shown below (from Cooper, 1999).
Daniel@0	2374
Daniel@0	2375 <table>
Daniel@0	2376 <tr> <th>n</th> <th align=left>G(n)</th> </tr>
Daniel@0	2377 <tr> <td>1</td> <td>1</td> </tr>
Daniel@0	2378 <tr> <td>2</td> <td>3</td> </tr>
Daniel@0	2379 <tr> <td>3</td> <td>25</td> </tr>
Daniel@0	2380 <tr> <td>4</td> <td>543</td> </tr>
Daniel@0	2381 <tr> <td>5</td> <td>29,281</td> </tr>
Daniel@0	2382 <tr> <td>6</td> <td>3,781,503</td> </tr>
Daniel@0	2383 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr>
Daniel@0	2384 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr>
Daniel@0	2385 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr>
Daniel@0	2386 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr>
Daniel@0	2387 </table>
Daniel@0	2388
Daniel@0	2389 Since the number of DAGs is super-exponential in the number of nodes,
Daniel@0	2390 we cannot exhaustively search the space, so we either use a local
Daniel@0	2391 search algorithm (e.g., greedy hill climbining, perhaps with multiple
Daniel@0	2392 restarts) or a global search algorithm (e.g., Markov Chain Monte
Daniel@0	2393 Carlo).
Daniel@0	2394 <p>
Daniel@0	2395 If we know a total ordering on the nodes,
Daniel@0	2396 finding the best structure amounts to picking the best set of parents
Daniel@0	2397 for each node independently.
Daniel@0	2398 This is what the K2 algorithm does.
Daniel@0	2399 If the ordering is unknown, we can search over orderings,
Daniel@0	2400 which is more efficient than searching over DAGs (Koller and Friedman, 2000).
Daniel@0	2401 <p>
Daniel@0	2402 In addition to the search procedure, we must specify the scoring
Daniel@0	2403 function. There are two popular choices. The Bayesian score integrates
Daniel@0	2404 out the parameters, i.e., it is the marginal likelihood of the model.
Daniel@0	2405 The BIC (Bayesian Information Criterion) is defined as
Daniel@0	2406 log P(D\|theta_hat) - 0.5dlog(N), where D is the data, theta_hat is
Daniel@0	2407 the ML estimate of the parameters, d is the number of parameters, and
Daniel@0	2408 N is the number of data cases.
Daniel@0	2409 The BIC method has the advantage of not requiring a prior.
Daniel@0	2410 <p>
Daniel@0	2411 BIC can be derived as a large sample
Daniel@0	2412 approximation to the marginal likelihood.
Daniel@0	2413 (It is also equal to the Minimum Description Length of a model.)
Daniel@0	2414 However, in practice, the sample size does not need to be very large
Daniel@0	2415 for the approximation to be good.
Daniel@0	2416 For example, in the figure below, we plot the ratio between the log marginal likelihood
Daniel@0	2417 and the BIC score against data-set size; we see that the ratio rapidly
Daniel@0	2418 approaches 1, especially for non-informative priors.
Daniel@0	2419 (This plot was generated by the file BNT/examples/static/bic1.m. It
Daniel@0	2420 uses the water sprinkler BN with BDeu Dirichlet priors with different
Daniel@0	2421 equivalent sample sizes.)
Daniel@0	2422
Daniel@0	2423 <p>
Daniel@0	2424 <center>
Daniel@0	2425 <IMG SRC="Figures/bic.png">
Daniel@0	2426 </center>
Daniel@0	2427 <p>
Daniel@0	2428
Daniel@0	2429 <p>
Daniel@0	2430 As with parameter learning, handling missing data/ hidden variables is
Daniel@0	2431 much harder than the fully observed case.
Daniel@0	2432 The structure learning routines in BNT can therefore be classified into 4
Daniel@0	2433 types, analogously to the parameter learning case.
Daniel@0	2434 <p>
Daniel@0	2435
Daniel@0	2436 <TABLE BORDER>
Daniel@0	2437 <tr>
Daniel@0	2438 <TH></TH>
Daniel@0	2439 <th>Full obs</th>
Daniel@0	2440 <th>Partial obs</th>
Daniel@0	2441 </tr>
Daniel@0	2442 <tr>
Daniel@0	2443 <th>Point</th>
Daniel@0	2444 <td><tt>learn_struct_K2</tt> <br>
Daniel@0	2445 <!-- <tt>learn_struct_hill_climb</tt></td> -->
Daniel@0	2446 <td><tt>not yet supported</tt></td>
Daniel@0	2447 </tr>
Daniel@0	2448 <tr>
Daniel@0	2449 <th>Bayes</th>
Daniel@0	2450 <td><tt>learn_struct_mcmc</tt></td>
Daniel@0	2451 <td>not yet supported</td>
Daniel@0	2452 </tr>
Daniel@0	2453 </table>
Daniel@0	2454
Daniel@0	2455
Daniel@0	2456 <h2><a name="markov_equiv">Markov equivalence</h2>
Daniel@0	2457
Daniel@0	2458 If two DAGs encode the same conditional independencies, they are
Daniel@0	2459 called Markov equivalent. The set of all DAGs can be paritioned into
Daniel@0	2460 Markov equivalence classes. Graphs within the same class can
Daniel@0	2461 have
Daniel@0	2462 the direction of some of their arcs reversed without changing any of
Daniel@0	2463 the CI relationships.
Daniel@0	2464 Each class can be represented by a PDAG
Daniel@0	2465 (partially directed acyclic graph) called an essential graph or
Daniel@0	2466 pattern. This specifies which edges must be oriented in a certain
Daniel@0	2467 direction, and which may be reversed.
Daniel@0	2468
Daniel@0	2469 <p>
Daniel@0	2470 When learning graph structure from observational data,
Daniel@0	2471 the best one can hope to do is to identify the model up to Markov
Daniel@0	2472 equivalence. To distinguish amongst graphs within the same equivalence
Daniel@0	2473 class, one needs interventional data: see the discussion on <a
Daniel@0	2474 href="#active">active learning</a> below.
Daniel@0	2475
Daniel@0	2476
Daniel@0	2477
Daniel@0	2478 <h2><a name="enumerate">Exhaustive search</h2>
Daniel@0	2479
Daniel@0	2480 The brute-force approach to structure learning is to enumerate all
Daniel@0	2481 possible DAGs, and score each one. This provides a "gold standard"
Daniel@0	2482 with which to compare other algorithms. We can do this as follows.
Daniel@0	2483 <pre>
Daniel@0	2484 dags = mk_all_dags(N);
Daniel@0	2485 score = score_dags(data, ns, dags);
Daniel@0	2486 </pre>
Daniel@0	2487 where data(i,m) is the value of node i in case m,
Daniel@0	2488 and ns(i) is the size of node i.
Daniel@0	2489 If the DAGs have a lot of families in common, we can cache the sufficient statistics,
Daniel@0	2490 making this potentially more efficient than scoring the DAGs one at a time.
Daniel@0	2491 (Caching is not currently implemented, however.)
Daniel@0	2492 <p>
Daniel@0	2493 By default, we use the Bayesian scoring metric, and assume CPDs are
Daniel@0	2494 represented by tables with BDeu(1) priors.
Daniel@0	2495 We can override these defaults as follows.
Daniel@0	2496 If we want to use uniform priors, we can say
Daniel@0	2497 <pre>
Daniel@0	2498 params = cell(1,N);
Daniel@0	2499 for i=1:N
Daniel@0	2500 params{i} = {'prior', 'unif'};
Daniel@0	2501 end
Daniel@0	2502 score = score_dags(data, ns, dags, 'params', params);
Daniel@0	2503 </pre>
Daniel@0	2504 params{i} is a cell-array, containing optional arguments that are
Daniel@0	2505 passed to the constructor for CPD i.
Daniel@0	2506 <p>
Daniel@0	2507 Now suppose we want to use different node types, e.g.,
Daniel@0	2508 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
Daniel@0	2509 these CPDs can support discrete and continuous parents, which is
Daniel@0	2510 necessary since all other nodes will be considered as parents).
Daniel@0	2511 The Bayesian scoring metric currently only works for tabular CPDs, so
Daniel@0	2512 we will use BIC:
Daniel@0	2513 <pre>
Daniel@0	2514 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
Daniel@0	2515 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
Daniel@0	2516 </pre>
Daniel@0	2517 In practice, one can't enumerate all possible DAGs for N > 5,
Daniel@0	2518 but one can evaluate any reasonably-sized set of hypotheses in this
Daniel@0	2519 way (e.g., nearest neighbors of your current best guess).
Daniel@0	2520 Think of this as "computer assisted model refinement" as opposed to de
Daniel@0	2521 novo learning.
Daniel@0	2522
Daniel@0	2523
Daniel@0	2524 <h2><a name="K2">K2</h2>
Daniel@0	2525
Daniel@0	2526 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
Daniel@0	2527 Initially each node has no parents. It then adds incrementally that parent whose addition most
Daniel@0	2528 increases the score of the resulting structure. When the addition of no single
Daniel@0	2529 parent can increase the score, it stops adding parents to the node.
Daniel@0	2530 Since we are using a fixed ordering, we do not need to check for
Daniel@0	2531 cycles, and can choose the parents for each node independently.
Daniel@0	2532 <p>
Daniel@0	2533 The original paper used the Bayesian scoring
Daniel@0	2534 metric with tabular CPDs and Dirichlet priors.
Daniel@0	2535 BNT generalizes this to allow any kind of CPD, and either the Bayesian
Daniel@0	2536 scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
Daniel@0	2537 In addition, you can specify
Daniel@0	2538 an optional upper bound on the number of parents for each node.
Daniel@0	2539 The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
Daniel@0	2540 We use the water sprinkler network and sample 100 cases from it as before.
Daniel@0	2541 Then we see how much data it takes to recover the generating structure:
Daniel@0	2542 <pre>
Daniel@0	2543 order = [C S R W];
Daniel@0	2544 max_fan_in = 2;
Daniel@0	2545 sz = 5:5:100;
Daniel@0	2546 for i=1:length(sz)
Daniel@0	2547 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
Daniel@0	2548 correct(i) = isequal(dag, dag2);
Daniel@0	2549 end
Daniel@0	2550 </pre>
Daniel@0	2551 Here are the results.
Daniel@0	2552 <pre>
Daniel@0	2553 correct =
Daniel@0	2554 Columns 1 through 12
Daniel@0	2555 0 0 0 0 0 0 0 1 0 1 1 1
Daniel@0	2556 Columns 13 through 20
Daniel@0	2557 1 1 1 1 1 1 1 1
Daniel@0	2558 </pre>
Daniel@0	2559 So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
Daniel@0	2560 showing that the prior doesn't matter too much.)
Daniel@0	2561 In general, we cannot hope to recover the "true" generating structure,
Daniel@0	2562 only one that is in its <a href="#markov_equiv">Markov equivalence
Daniel@0	2563 class</a>.
Daniel@0	2564
Daniel@0	2565
Daniel@0	2566 <h2><a name="hill_climb">Hill-climbing</h2>
Daniel@0	2567
Daniel@0	2568 Hill-climbing starts at a specific point in space,
Daniel@0	2569 considers all nearest neighbors, and moves to the neighbor
Daniel@0	2570 that has the highest score; if no neighbors have higher
Daniel@0	2571 score than the current point (i.e., we have reached a local maximum),
Daniel@0	2572 the algorithm stops. One can then restart in another part of the space.
Daniel@0	2573 <p>
Daniel@0	2574 A common definition of "neighbor" is all graphs that can be
Daniel@0	2575 generated from the current graph by adding, deleting or reversing a
Daniel@0	2576 single arc, subject to the acyclicity constraint.
Daniel@0	2577 Other neighborhoods are possible: see
Daniel@0	2578 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
Daniel@0	2579 Optimal Structure Identification with Greedy Search</a>, Max
Daniel@0	2580 Chickering, JMLR 2002.
Daniel@0	2581
Daniel@0	2582 <!--
Daniel@0	2583 Note: This algorithm is currently (Feb '02) being implemented by Qian
Daniel@0	2584 Diao.
Daniel@0	2585 -->
Daniel@0	2586
Daniel@0	2587
Daniel@0	2588 <h2><a name="mcmc">MCMC</h2>
Daniel@0	2589
Daniel@0	2590 We can use a Markov Chain Monte Carlo (MCMC) algorithm called
Daniel@0	2591 Metropolis-Hastings (MH) to search the space of all
Daniel@0	2592 DAGs.
Daniel@0	2593 The standard proposal distribution is to consider moving to all
Daniel@0	2594 nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
Daniel@0	2595 <p>
Daniel@0	2596 The function can be called
Daniel@0	2597 as in the following example.
Daniel@0	2598 <pre>
Daniel@0	2599 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
Daniel@0	2600 </pre>
Daniel@0	2601 We can convert our set of sampled graphs to a histogram
Daniel@0	2602 (empirical posterior over all the DAGs) thus
Daniel@0	2603 <pre>
Daniel@0	2604 all_dags = mk_all_dags(N);
Daniel@0	2605 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
Daniel@0	2606 </pre>
Daniel@0	2607 To see how well this performs, let us compute the exact posterior exhaustively.
Daniel@0	2608 <p>
Daniel@0	2609 <pre>
Daniel@0	2610 score = score_dags(data, ns, all_dags);
Daniel@0	2611 post = normalise(exp(score)); % assuming uniform structural prior
Daniel@0	2612 </pre>
Daniel@0	2613 We plot the results below.
Daniel@0	2614 (The data set was 100 samples drawn from a random 4 node bnet; see the
Daniel@0	2615 file BNT/examples/static/mcmc1.)
Daniel@0	2616 <pre>
Daniel@0	2617 subplot(2,1,1)
Daniel@0	2618 bar(post)
Daniel@0	2619 subplot(2,1,2)
Daniel@0	2620 bar(mcmc_post)
Daniel@0	2621 </pre>
Daniel@0	2622 <img src="Figures/mcmc_post.jpg" width="800" height="500">
Daniel@0	2623 <p>
Daniel@0	2624 We can also plot the acceptance ratio versus number of MCMC steps,
Daniel@0	2625 as a crude convergence diagnostic.
Daniel@0	2626 <pre>
Daniel@0	2627 clf
Daniel@0	2628 plot(accept_ratio)
Daniel@0	2629 </pre>
Daniel@0	2630 <img src="Figures/mcmc_accept.jpg" width="800" height="300">
Daniel@0	2631 <p>
Daniel@0	2632 Even though the number of samples needed by MCMC is theoretically
Daniel@0	2633 polynomial (not exponential) in the dimensionality of the search space, in practice it has been
Daniel@0	2634 found that MCMC does not converge in reasonable time for graphs with
Daniel@0	2635 more than about 10 nodes.
Daniel@0	2636
Daniel@0	2637
Daniel@0	2638
Daniel@0	2639
Daniel@0	2640 <h2><a name="active">Active structure learning</h2>
Daniel@0	2641
Daniel@0	2642 As was mentioned <a href="#markov_equiv">above</a>,
Daniel@0	2643 one can only learn a DAG up to Markov equivalence, even given infinite data.
Daniel@0	2644 If one is interested in learning the structure of a causal network,
Daniel@0	2645 one needs interventional data.
Daniel@0	2646 (By "intervention" we mean forcing a node to take on a specific value,
Daniel@0	2647 thereby effectively severing its incoming arcs.)
Daniel@0	2648 <p>
Daniel@0	2649 Most of the scoring functions accept an optional argument
Daniel@0	2650 that specifies whether a node was observed to have a certain value, or
Daniel@0	2651 was forced to have that value: we set clamped(i,m)=1 if node i was
Daniel@0	2652 forced in training case m. e.g., see the file
Daniel@0	2653 BNT/examples/static/cooper_yoo.
Daniel@0	2654 <p>
Daniel@0	2655 An interesting question is to decide which interventions to perform
Daniel@0	2656 (c.f., design of experiments). For details, see the following tech
Daniel@0	2657 report
Daniel@0	2658 <ul>
Daniel@0	2659 <li> <a href = "../../Papers/alearn.ps.gz">
Daniel@0	2660 Active learning of causal Bayes net structure</a>, Kevin Murphy, March
Daniel@0	2661 2001.
Daniel@0	2662 </ul>
Daniel@0	2663
Daniel@0	2664
Daniel@0	2665 <h2><a name="struct_em">Structural EM</h2>
Daniel@0	2666
Daniel@0	2667 Computing the Bayesian score when there is partial observability is
Daniel@0	2668 computationally challenging, because the parameter posterior becomes
Daniel@0	2669 multimodal (the hidden nodes induce a mixture distribution).
Daniel@0	2670 One therefore needs to use approximations such as BIC.
Daniel@0	2671 Unfortunately, search algorithms are still expensive, because we need
Daniel@0	2672 to run EM at each step to compute the MLE, which is needed to compute
Daniel@0	2673 the score of each model. An alternative approach is
Daniel@0	2674 to do the local search steps inside of the M step of EM, which is more
Daniel@0	2675 efficient since the data has been "filled in" - this is
Daniel@0	2676 called the structural EM algorithm (Friedman 1997), and provably
Daniel@0	2677 converges to a local maximum of the BIC score.
Daniel@0	2678 <p>
Daniel@0	2679 Wei Hu has implemented SEM for discrete nodes.
Daniel@0	2680 You can download his package from
Daniel@0	2681 <a href="../SEM.zip">here</a>.
Daniel@0	2682 Please address all questions about this code to
Daniel@0	2683 wei.hu@intel.com.
Daniel@0	2684 See also <a href="#phl">Phl's implementation of SEM</a>.
Daniel@0	2685
Daniel@0	2686 <!--
Daniel@0	2687 <h2><a name="reveal">REVEAL algorithm</h2>
Daniel@0	2688
Daniel@0	2689 A simple way to learn the structure of a fully observed, discrete,
Daniel@0	2690 factored DBN from a time series is described <a
Daniel@0	2691 href="usage_dbn.html#struct_learn">here</a>.
Daniel@0	2692 -->
Daniel@0	2693
Daniel@0	2694
Daniel@0	2695 <h2><a name="graphdraw">Visualizing the graph</h2>
Daniel@0	2696
Daniel@0	2697 You can visualize an arbitrary graph (such as one learned using the
Daniel@0	2698 structure learning routines) with Matlab code contributed by
Daniel@0	2699 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali
Daniel@0	2700 Taylan Cemgil</a>
Daniel@0	2701 from the University of Nijmegen.
Daniel@0	2702 For static BNs, call it as follows:
Daniel@0	2703 <pre>
Daniel@0	2704 draw_graph(bnet.dag);
Daniel@0	2705 </pre>
Daniel@0	2706 For example, this is the output produced on a
Daniel@0	2707 <a href="#qmr">random QMR-like model</a>:
Daniel@0	2708 <p>
Daniel@0	2709 <img src="Figures/qmr.rnd.jpg">
Daniel@0	2710 <p>
Daniel@0	2711 If you install the excellent <a
Daniel@0	2712 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an
Daniel@0	2713 open-source graph visualization package from AT&T,
Daniel@0	2714 you can create a much better visualization as follows
Daniel@0	2715 <pre>
Daniel@0	2716 graph_to_dot(bnet.dag)
Daniel@0	2717 </pre>
Daniel@0	2718 This works by converting the adjacency matrix to a file suitable
Daniel@0	2719 for input to graphviz (using the dot format),
Daniel@0	2720 then converting the output of graphviz to postscript, and displaying the results using
Daniel@0	2721 ghostview.
Daniel@0	2722 You can do each of these steps separately for more control, as shown
Daniel@0	2723 below.
Daniel@0	2724 <pre>
Daniel@0	2725 graph_to_dot(bnet.dag, 'filename', 'foo.dot');
Daniel@0	2726 dot -Tps foo.dot -o foo.ps
Daniel@0	2727 ghostview foo.ps &
Daniel@0	2728 </pre>
Daniel@0	2729
Daniel@0	2730 <h2><a name = "constraint">Constraint-based methods</h2>
Daniel@0	2731
Daniel@0	2732 The IC algorithm (Pearl and Verma, 1991),
Daniel@0	2733 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
Daniel@0	2734 computes many conditional independence tests,
Daniel@0	2735 and combines these constraints into a
Daniel@0	2736 PDAG to represent the whole
Daniel@0	2737 <a href="#markov_equiv">Markov equivalence class</a>.
Daniel@0	2738 <p>
Daniel@0	2739 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
Daniel@0	2740 (IC stands for inductive causation; PC stands for Peter and Clark,
Daniel@0	2741 the first names of Spirtes and Glymour; FCI stands for fast causal
Daniel@0	2742 inference.
Daniel@0	2743 What we, following Pearl (2000), call IC* was called
Daniel@0	2744 IC in the original Pearl and Verma paper.)
Daniel@0	2745 For details, see
Daniel@0	2746 <ul>
Daniel@0	2747 <li>
Daniel@0	2748 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
Daniel@0	2749 Prediction, and Search</a>, Spirtes, Glymour and
Daniel@0	2750 Scheines (SGS), 2001 (2nd edition), MIT Press.
Daniel@0	2751 <li>
Daniel@0	2752 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
Daniel@0	2753 2000, Cambridge University Press.
Daniel@0	2754 </ul>
Daniel@0	2755
Daniel@0	2756 <p>
Daniel@0	2757
Daniel@0	2758 The PC algorithm takes as arguments a function f, the number of nodes N,
Daniel@0	2759 the maximum fan in K, and additional arguments A which are passed to f.
Daniel@0	2760 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
Daniel@0	2761 otherwise.
Daniel@0	2762 For example, suppose we cheat by
Daniel@0	2763 passing in a CI "oracle" which has access to the true DAG; the oracle
Daniel@0	2764 tests for d-separation in this DAG, i.e.,
Daniel@0	2765 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
Daniel@0	2766 <pre>
Daniel@0	2767 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
Daniel@0	2768 </pre>
Daniel@0	2769 pdag(i,j) = -1 if there is definitely an i->j arc,
Daniel@0	2770 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
Daniel@0	2771 <p>
Daniel@0	2772 Applied to the sprinkler network, this returns
Daniel@0	2773 <pre>
Daniel@0	2774 pdag =
Daniel@0	2775 0 1 1 0
Daniel@0	2776 1 0 0 -1
Daniel@0	2777 1 0 0 -1
Daniel@0	2778 0 0 0 0
Daniel@0	2779 </pre>
Daniel@0	2780 So as expected, we see that the V-structure at the W node is uniquely identified,
Daniel@0	2781 but the other arcs have ambiguous orientation.
Daniel@0	2782 <p>
Daniel@0	2783 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
Daniel@0	2784 book.
Daniel@0	2785 This example concerns the female orgasm.
Daniel@0	2786 We are given a correlation matrix C between 7 measured factors (such
Daniel@0	2787 as subjective experiences of coital and masturbatory experiences),
Daniel@0	2788 derived from 281 samples, and want to learn a causal model of the
Daniel@0	2789 data. We will not discuss the merits of this type of work here, but
Daniel@0	2790 merely show how to reproduce the results in the SGS book.
Daniel@0	2791 Their program,
Daniel@0	2792 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
Daniel@0	2793 makes use of the Fisher Z-test for conditional
Daniel@0	2794 independence, so we do the same:
Daniel@0	2795 <pre>
Daniel@0	2796 max_fan_in = 4;
Daniel@0	2797 nsamples = 281;
Daniel@0	2798 alpha = 0.05;
Daniel@0	2799 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
Daniel@0	2800 </pre>
Daniel@0	2801 In this case, the CI test is
Daniel@0	2802 <pre>
Daniel@0	2803 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)
Daniel@0	2804 </pre>
Daniel@0	2805 The results match those of Fig 12a of SGS apart from two edge
Daniel@0	2806 differences; presumably this is due to rounding error (although it
Daniel@0	2807 could be a bug, either in BNT or in Tetrad).
Daniel@0	2808 This example can be found in the file BNT/examples/static/pc2.m.
Daniel@0	2809
Daniel@0	2810 <p>
Daniel@0	2811
Daniel@0	2812 The IC* algorithm (Pearl and Verma, 1991),
Daniel@0	2813 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
Daniel@0	2814 are like the IC/PC algorithm, except that they can detect the presence
Daniel@0	2815 of latent variables.
Daniel@0	2816 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
Daniel@0	2817 Kushnir. The output is a matrix P, defined as follows
Daniel@0	2818 (see Pearl (2000), p52 for details):
Daniel@0	2819 <pre>
Daniel@0	2820 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
Daniel@0	2821 % P(i,j) = -2 if there is a marked directed i-*>j edge.
Daniel@0	2822 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
Daniel@0	2823 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
Daniel@0	2824 </pre>
Daniel@0	2825
Daniel@0	2826
Daniel@0	2827 <h2><a name="phl">Philippe Leray's structure learning package</h2>
Daniel@0	2828
Daniel@0	2829 Philippe Leray has written a
Daniel@0	2830 <a href="http://bnt.insa-rouen.fr/ajouts.html">
Daniel@0	2831 structure learning package</a> that uses BNT.
Daniel@0	2832
Daniel@0	2833 It currently (Juen 2003) has the following features:
Daniel@0	2834 <ul>
Daniel@0	2835 <li>PC with Chi2 statistical test
Daniel@0	2836 <li> MWST : Maximum weighted Spanning Tree
Daniel@0	2837 <li> Hill Climbing
Daniel@0	2838 <li> Greedy Search
Daniel@0	2839 <li> Structural EM
Daniel@0	2840 <li> hist_ic : optimal Histogram based on IC information criterion
Daniel@0	2841 <li> cpdag_to_dag
Daniel@0	2842 <li> dag_to_cpdag
Daniel@0	2843 <li> ...
Daniel@0	2844 </ul>
Daniel@0	2845
Daniel@0	2846
Daniel@0	2847 </a>
Daniel@0	2848
Daniel@0	2849
Daniel@0	2850 <!--
Daniel@0	2851 <h2><a name="read_learning">Further reading on learning</h2>
Daniel@0	2852
Daniel@0	2853 I recommend the following tutorials for more details on learning.
Daniel@0	2854 <ul>
Daniel@0	2855 <li> <a
Daniel@0	2856 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
Daniel@0	2857 tutorial</a> on graphical models, which contains an overview of learning.
Daniel@0	2858
Daniel@0	2859 <li>
Daniel@0	2860 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
Daniel@0	2861 A tutorial on learning with Bayesian networks</a>, D. Heckerman,
Daniel@0	2862 Microsoft Research Tech Report, 1995.
Daniel@0	2863
Daniel@0	2864 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
Daniel@0	2865 Operations for Learning with Graphical Models</a>,
Daniel@0	2866 W. L. Buntine, JAIR'94, 159--225.
Daniel@0	2867 </ul>
Daniel@0	2868 <p>
Daniel@0	2869 -->
Daniel@0	2870
Daniel@0	2871
Daniel@0	2872
Daniel@0	2873
Daniel@0	2874
Daniel@0	2875 <h1><a name="engines">Inference engines</h1>
Daniel@0	2876
Daniel@0	2877 Up until now, we have used the junction tree algorithm for inference.
Daniel@0	2878 However, sometimes this is too slow, or not even applicable.
Daniel@0	2879 In general, there are many inference algorithms each of which make
Daniel@0	2880 different tradeoffs between speed, accuracy, complexity and
Daniel@0	2881 generality. Furthermore, there might be many implementations of the
Daniel@0	2882 same algorithm; for instance, a general purpose, readable version,
Daniel@0	2883 and a highly-optimized, specialized one.
Daniel@0	2884 To cope with this variety, we treat each inference algorithm as an
Daniel@0	2885 object, which we call an inference engine.
Daniel@0	2886
Daniel@0	2887 <p>
Daniel@0	2888 An inference engine is an object that contains a bnet and supports the
Daniel@0	2889 'enter_evidence' and 'marginal_nodes' methods. The engine constructor
Daniel@0	2890 takes the bnet as argument and may do some model-specific processing.
Daniel@0	2891 When 'enter_evidence' is called, the engine may do some
Daniel@0	2892 evidence-specific processing. Finally, when 'marginal_nodes' is
Daniel@0	2893 called, the engine may do some query-specific processing.
Daniel@0	2894
Daniel@0	2895 <p>
Daniel@0	2896 The amount of work done when each stage is specified -- structure,
Daniel@0	2897 parameters, evidence, and query -- depends on the engine. The cost of
Daniel@0	2898 work done early in this sequence can be amortized. On the other hand,
Daniel@0	2899 one can make better optimizations if one waits until later in the
Daniel@0	2900 sequence.
Daniel@0	2901 For example, the parameters might imply
Daniel@0	2902 conditional indpendencies that are not evident in the graph structure,
Daniel@0	2903 but can nevertheless be exploited; the evidence indicates which nodes
Daniel@0	2904 are observed and hence can effectively be disconnected from the
Daniel@0	2905 graph; and the query might indicate that large parts of the network
Daniel@0	2906 are d-separated from the query nodes. (Since it is not the actual
Daniel@0	2907 <em>values</em> of the evidence that matters, just which nodes are observed,
Daniel@0	2908 many engines allow you to specify which nodes will be observed when they are constructed,
Daniel@0	2909 i.e., before calling 'enter_evidence'. Some engines can still cope if
Daniel@0	2910 the actual pattern of evidence is different, e.g., if there is missing
Daniel@0	2911 data.)
Daniel@0	2912 <p>
Daniel@0	2913
Daniel@0	2914 Although being maximally lazy (i.e., only doing work when a query is
Daniel@0	2915 issued) may seem desirable,
Daniel@0	2916 this is not always the most efficient.
Daniel@0	2917 For example,
Daniel@0	2918 when learning using EM, we need to call marginal_nodes N times, where N is the
Daniel@0	2919 number of nodes. <a href="varelim">Variable elimination</a> would end
Daniel@0	2920 up repeating a lot of work
Daniel@0	2921 each time marginal_nodes is called, making it inefficient for
Daniel@0	2922 learning. The junction tree algorithm, by contrast, uses dynamic
Daniel@0	2923 programming to avoid this redundant computation --- it calculates all
Daniel@0	2924 marginals in two passes during 'enter_evidence', so calling
Daniel@0	2925 'marginal_nodes' takes constant time.
Daniel@0	2926 <p>
Daniel@0	2927 We will discuss some of the inference algorithms implemented in BNT
Daniel@0	2928 below, and finish with a <a href="#engine_summary">summary</a> of all
Daniel@0	2929 of them.
Daniel@0	2930
Daniel@0	2931
Daniel@0	2932
Daniel@0	2933
Daniel@0	2934
Daniel@0	2935
Daniel@0	2936
Daniel@0	2937 <h2><a name="varelim">Variable elimination</h2>
Daniel@0	2938
Daniel@0	2939 The variable elimination algorithm, also known as bucket elimination
Daniel@0	2940 or peeling, is one of the simplest inference algorithms.
Daniel@0	2941 The basic idea is to "push sums inside of products"; this is explained
Daniel@0	2942 in more detail
Daniel@0	2943 <a
Daniel@0	2944 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
Daniel@0	2945 <p>
Daniel@0	2946 The principle of distributing sums over products can be generalized
Daniel@0	2947 greatly to apply to any commutative semiring.
Daniel@0	2948 This forms the basis of many common algorithms, such as Viterbi
Daniel@0	2949 decoding and the Fast Fourier Transform. For details, see
Daniel@0	2950
Daniel@0	2951 <ul>
Daniel@0	2952 <li> R. McEliece and S. M. Aji, 2000.
Daniel@0	2953 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
Daniel@0	2954 <a href="GDL.pdf">
Daniel@0	2955 The Generalized Distributive Law</a>,
Daniel@0	2956 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
Daniel@0	2957 pp. 325--343.
Daniel@0	2958
Daniel@0	2959
Daniel@0	2960 <li>
Daniel@0	2961 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
Daniel@0	2962 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
Daniel@0	2963 Factor graphs and the sum-product algorithm</a>
Daniel@0	2964 IEEE Transactions on Information Theory, February, 2001.
Daniel@0	2965
Daniel@0	2966 </ul>
Daniel@0	2967
Daniel@0	2968 <p>
Daniel@0	2969 Choosing an order in which to sum out the variables so as to minimize
Daniel@0	2970 computational cost is known to be NP-hard.
Daniel@0	2971 The implementation of this algorithm in
Daniel@0	2972 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this
Daniel@0	2973 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
Daniel@0	2974 greedy search procedure to find a good ordering).
Daniel@0	2975 <p>
Daniel@0	2976 Note: unlike most algorithms, var_elim does all its computational work
Daniel@0	2977 inside of <tt>marginal_nodes</tt>, not inside of
Daniel@0	2978 <tt>enter_evidence</tt>.
Daniel@0	2979
Daniel@0	2980
Daniel@0	2981
Daniel@0	2982
Daniel@0	2983 <h2><a name="global">Global inference methods</h2>
Daniel@0	2984
Daniel@0	2985 The simplest inference algorithm of all is to explicitely construct
Daniel@0	2986 the joint distribution over all the nodes, and then to marginalize it.
Daniel@0	2987 This is implemented in <tt>global_joint_inf_engine</tt>.
Daniel@0	2988 Since the size of the joint is exponential in the
Daniel@0	2989 number of discrete (hidden) nodes, this is not a very practical algorithm.
Daniel@0	2990 It is included merely for pedagogical and debugging purposes.
Daniel@0	2991 <p>
Daniel@0	2992 Three specialized versions of this algorithm have also been implemented,
Daniel@0	2993 corresponding to the cases where all the nodes are discrete (D), all
Daniel@0	2994 are Gaussian (G), and some are discrete and some Gaussian (CG).
Daniel@0	2995 They are called <tt>enumerative_inf_engine</tt>,
Daniel@0	2996 <tt>gaussian_inf_engine</tt>,
Daniel@0	2997 and <tt>cond_gauss_inf_engine</tt> respectively.
Daniel@0	2998 <p>
Daniel@0	2999 Note: unlike most algorithms, these global inference algorithms do all their computational work
Daniel@0	3000 inside of <tt>marginal_nodes</tt>, not inside of
Daniel@0	3001 <tt>enter_evidence</tt>.
Daniel@0	3002
Daniel@0	3003
Daniel@0	3004 <h2><a name="quickscore">Quickscore</h2>
Daniel@0	3005
Daniel@0	3006 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
Daniel@0	3007 since the cliques are so big.
Daniel@0	3008 One simple trick we can use is to notice that hidden leaves do not
Daniel@0	3009 affect the posteriors on the roots, and hence do not need to be
Daniel@0	3010 included in the network.
Daniel@0	3011 A second trick is to notice that the negative findings can be
Daniel@0	3012 "absorbed" into the prior:
Daniel@0	3013 see the file
Daniel@0	3014 BNT/examples/static/mk_minimal_qmr_bnet for details.
Daniel@0	3015 <p>
Daniel@0	3016
Daniel@0	3017 A much more significant speedup is obtained by exploiting special
Daniel@0	3018 properties of the noisy-or node, as done by the quickscore
Daniel@0	3019 algorithm. For details, see
Daniel@0	3020 <ul>
Daniel@0	3021 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
Daniel@0	3022 <li> Rish and Dechter, "On the impact of causal independence", UCI
Daniel@0	3023 tech report, 1998.
Daniel@0	3024 </ul>
Daniel@0	3025
Daniel@0	3026 This has been implemented in BNT as a special-purpose inference
Daniel@0	3027 engine, which can be created and used as follows:
Daniel@0	3028 <pre>
Daniel@0	3029 engine = quickscore_inf_engine(inhibit, leak, prior);
Daniel@0	3030 engine = enter_evidence(engine, pos, neg);
Daniel@0	3031 m = marginal_nodes(engine, i);
Daniel@0	3032 </pre>
Daniel@0	3033
Daniel@0	3034
Daniel@0	3035 <h2><a name="belprop">Belief propagation</h2>
Daniel@0	3036
Daniel@0	3037 Even using quickscore, exact inference takes time that is exponential
Daniel@0	3038 in the number of positive findings.
Daniel@0	3039 Hence for large networks we need to resort to approximate inference techniques.
Daniel@0	3040 See for example
Daniel@0	3041 <ul>
Daniel@0	3042 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
Daniel@0	3043 QMR-DT network", JAIR 10, 1999.
Daniel@0	3044
Daniel@0	3045 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
Daniel@0	3046 UAI 99.
Daniel@0	3047 </ul>
Daniel@0	3048 The latter approximation
Daniel@0	3049 entails applying Pearl's belief propagation algorithm to a model even
Daniel@0	3050 if it has loops (hence the name loopy belief propagation).
Daniel@0	3051 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
Daniel@0	3052 exact results when applied to singly-connected graphs
Daniel@0	3053 (a.k.a. polytrees, since
Daniel@0	3054 the underlying undirected topology is a tree, but a node may have
Daniel@0	3055 multiple parents).
Daniel@0	3056 To apply this algorithm to a graph with loops,
Daniel@0	3057 use <tt>pearl_inf_engine</tt>.
Daniel@0	3058 This can use a centralized or distributed message passing protocol.
Daniel@0	3059 You can use it as in the following example.
Daniel@0	3060 <pre>
Daniel@0	3061 engine = pearl_inf_engine(bnet, 'max_iter', 30);
Daniel@0	3062 engine = enter_evidence(engine, evidence);
Daniel@0	3063 m = marginal_nodes(engine, i);
Daniel@0	3064 </pre>
Daniel@0	3065 We found that this algorithm often converges, and when it does, often
Daniel@0	3066 is very accurate, but it depends on the precise setting of the
Daniel@0	3067 parameter values of the network.
Daniel@0	3068 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
Daniel@0	3069 Understanding when and why belief propagation converges/ works
Daniel@0	3070 is a topic of ongoing research.
Daniel@0	3071 <p>
Daniel@0	3072 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
Daniel@0	3073 and gmux nodes to compute messages efficiently.
Daniel@0	3074 <p>
Daniel@0	3075 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
Daniel@0	3076 represent messages. Hence this is slower.
Daniel@0	3077 <p>
Daniel@0	3078 <tt>belprop_fg_inf_engine</tt> is like belprop,
Daniel@0	3079 but is designed for factor graphs.
Daniel@0	3080
Daniel@0	3081
Daniel@0	3082
Daniel@0	3083 <h2><a name="sampling">Sampling</h2>
Daniel@0	3084
Daniel@0	3085 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
Daniel@0	3086 <ul>
Daniel@0	3087 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance
Daniel@0	3088 sampling and can handle any node type.
Daniel@0	3089 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
Daniel@0	3090 Currently this can only handle tabular CPDs.
Daniel@0	3091 For a much faster and more powerful Gibbs sampling program, see
Daniel@0	3092 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
Daniel@0	3093 </ul>
Daniel@0	3094 Note: To generate samples from a network (which is not the same as inference!),
Daniel@0	3095 use <tt>sample_bnet</tt>.
Daniel@0	3096
Daniel@0	3097
Daniel@0	3098
Daniel@0	3099 <h2><a name="engine_summary">Summary of inference engines</h2>
Daniel@0	3100
Daniel@0	3101
Daniel@0	3102 The inference engines differ in many ways. Here are
Daniel@0	3103 some of the major "axes":
Daniel@0	3104 <ul>
Daniel@0	3105 <li> Works for all topologies or makes restrictions?
Daniel@0	3106 <li> Works for all node types or makes restrictions?
Daniel@0	3107 <li> Exact or approximate inference?
Daniel@0	3108 </ul>
Daniel@0	3109
Daniel@0	3110 <p>
Daniel@0	3111 In terms of topology, most engines handle any kind of DAG.
Daniel@0	3112 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
Daniel@0	3113 can be used to represent directed, undirected, and mixed (chain)
Daniel@0	3114 graphs.
Daniel@0	3115 (In the future, we plan to support exact inference on chain graphs.)
Daniel@0	3116 <tt>quickscore</tt> only works on QMR-like models.
Daniel@0	3117 <p>
Daniel@0	3118 In terms of node types: algorithms that use potentials can handle
Daniel@0	3119 discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
Daniel@0	3120 Sampling algorithms can essentially handle any kind of node (distribution).
Daniel@0	3121 Other algorithms make more restrictive assumptions in exchange for
Daniel@0	3122 speed.
Daniel@0	3123 <p>
Daniel@0	3124 Finally, most algorithms are designed to give the exact answer.
Daniel@0	3125 The belief propagation algorithms are exact if applied to trees, and
Daniel@0	3126 in some other cases.
Daniel@0	3127 Sampling is considered approximate, even though, in the limit of an
Daniel@0	3128 infinite number of samples, it gives the exact answer.
Daniel@0	3129
Daniel@0	3130 <p>
Daniel@0	3131
Daniel@0	3132 Here is a summary of the properties
Daniel@0	3133 of all the engines in BNT which work on static networks.
Daniel@0	3134 <p>
Daniel@0	3135 <table>
Daniel@0	3136 <table border units = pixels><tr>
Daniel@0	3137 <td align=left width=0>Name
Daniel@0	3138 <td align=left width=0>Exact?
Daniel@0	3139 <td align=left width=0>Node type?
Daniel@0	3140 <td align=left width=0>topology
Daniel@0	3141 <tr>
Daniel@0	3142 <tr>
Daniel@0	3143 <td align=left> belprop
Daniel@0	3144 <td align=left> approx
Daniel@0	3145 <td align=left> D
Daniel@0	3146 <td align=left> DAG
Daniel@0	3147 <tr>
Daniel@0	3148 <td align=left> belprop_fg
Daniel@0	3149 <td align=left> approx
Daniel@0	3150 <td align=left> D
Daniel@0	3151 <td align=left> factor graph
Daniel@0	3152 <tr>
Daniel@0	3153 <td align=left> cond_gauss
Daniel@0	3154 <td align=left> exact
Daniel@0	3155 <td align=left> CG
Daniel@0	3156 <td align=left> DAG
Daniel@0	3157 <tr>
Daniel@0	3158 <td align=left> enumerative
Daniel@0	3159 <td align=left> exact
Daniel@0	3160 <td align=left> D
Daniel@0	3161 <td align=left> DAG
Daniel@0	3162 <tr>
Daniel@0	3163 <td align=left> gaussian
Daniel@0	3164 <td align=left> exact
Daniel@0	3165 <td align=left> G
Daniel@0	3166 <td align=left> DAG
Daniel@0	3167 <tr>
Daniel@0	3168 <td align=left> gibbs
Daniel@0	3169 <td align=left> approx
Daniel@0	3170 <td align=left> D
Daniel@0	3171 <td align=left> DAG
Daniel@0	3172 <tr>
Daniel@0	3173 <td align=left> global_joint
Daniel@0	3174 <td align=left> exact
Daniel@0	3175 <td align=left> D,G,CG
Daniel@0	3176 <td align=left> DAG
Daniel@0	3177 <tr>
Daniel@0	3178 <td align=left> jtree
Daniel@0	3179 <td align=left> exact
Daniel@0	3180 <td align=left> D,G,CG
Daniel@0	3181 <td align=left> DAG
Daniel@0	3182 b<tr>
Daniel@0	3183 <td align=left> likelihood_weighting
Daniel@0	3184 <td align=left> approx
Daniel@0	3185 <td align=left> any
Daniel@0	3186 <td align=left> DAG
Daniel@0	3187 <tr>
Daniel@0	3188 <td align=left> pearl
Daniel@0	3189 <td align=left> approx
Daniel@0	3190 <td align=left> D,G
Daniel@0	3191 <td align=left> DAG
Daniel@0	3192 <tr>
Daniel@0	3193 <td align=left> pearl
Daniel@0	3194 <td align=left> exact
Daniel@0	3195 <td align=left> D,G
Daniel@0	3196 <td align=left> polytree
Daniel@0	3197 <tr>
Daniel@0	3198 <td align=left> quickscore
Daniel@0	3199 <td align=left> exact
Daniel@0	3200 <td align=left> noisy-or
Daniel@0	3201 <td align=left> QMR
Daniel@0	3202 <tr>
Daniel@0	3203 <td align=left> stab_cond_gauss
Daniel@0	3204 <td align=left> exact
Daniel@0	3205 <td align=left> CG
Daniel@0	3206 <td align=left> DAG
Daniel@0	3207 <tr>
Daniel@0	3208 <td align=left> var_elim
Daniel@0	3209 <td align=left> exact
Daniel@0	3210 <td align=left> D,G,CG
Daniel@0	3211 <td align=left> DAG
Daniel@0	3212 </table>
Daniel@0	3213
Daniel@0	3214
Daniel@0	3215
Daniel@0	3216 <h1><a name="influence">Influence diagrams/ decision making</h1>
Daniel@0	3217
Daniel@0	3218 BNT implements an exact algorithm for solving LIMIDs (limited memory
Daniel@0	3219 influence diagrams), described in
Daniel@0	3220 <ul>
Daniel@0	3221 <li> S. L. Lauritzen and D. Nilsson.
Daniel@0	3222 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
Daniel@0	3223 Representing and solving decision problems with limited
Daniel@0	3224 information</a>
Daniel@0	3225 Management Science, 47, 1238 - 1251. September 2001.
Daniel@0	3226 </ul>
Daniel@0	3227 LIMIDs explicitely show all information arcs, rather than implicitely
Daniel@0	3228 assuming no forgetting. This allows them to model forgetful
Daniel@0	3229 controllers.
Daniel@0	3230 <p>
Daniel@0	3231 See the examples in <tt>BNT/examples/limids</tt> for details.
Daniel@0	3232
Daniel@0	3233
Daniel@0	3234
Daniel@0	3235
Daniel@0	3236 <h1>DBNs, HMMs, Kalman filters and all that</h1>
Daniel@0	3237
Daniel@0	3238 Click <a href="usage_dbn.html">here</a> for documentation about how to
Daniel@0	3239 use BNT for dynamical systems and sequence data.
Daniel@0	3240
Daniel@0	3241
Daniel@0	3242 </BODY>

Mercurial > hg > camir-ismir2012

annotate toolboxes/FullBNT-1.0.7/docs/usage_sf.html @ 0:cc4b1211e677 tip