camir-aes2014: toolboxes/FullBNT-1.0.7/docs/usage.html comparison

comparison toolboxes/FullBNT-1.0.7/docs/usage.html @ 0:e9a9cd732c1e tip

first hg version after svn

author	wolffd
date	Tue, 10 Feb 2015 15:05:51 +0000
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:e9a9cd732c1e
+<HEAD>
+<TITLE>How to use the Bayes Net Toolbox</TITLE>
+</HEAD>
+<BODY BGCOLOR="#FFFFFF">
+<!-- white background is better for the pictures and equations -->
+<h1>How to use the Bayes Net Toolbox</h1>
+This documentation was last updated on 29 October 2007.
+<br>
+Click
+<a href="http://bnt.insa-rouen.fr/">here</a>
+for a French version of this documentation (last updated in 2005).
+<p>
+<ul>
+<li> <a href="install.html">Installation</a>
+<li> <a href="#basics">Creating your first Bayes net</a>
+<ul>
+<li> <a href="#basics">Creating a model by hand</a>
+<li> <a href="#file">Loading a model from a file</a>
+<li> <a href="#GUI">Creating a model using a GUI</a>
+<li> <a href="graphviz.html">Graph visualization</a>
+</ul>
+<li> <a href="#inference">Inference</a>
+<ul>
+<li> <a href="#marginal">Computing marginal distributions</a>
+<li> <a href="#joint">Computing joint distributions</a>
+<li> <a href="#soft">Soft/virtual evidence</a>
+<li> <a href="#mpe">Most probable explanation</a>
+</ul>
+<li> <a href="#cpd">Conditional Probability Distributions</a>
+<ul>
+<li> <a href="#tabular">Tabular (multinomial) nodes</a>
+<li> <a href="#noisyor">Noisy-or nodes</a>
+<li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
+<li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
+<li> <a href="#mlp">Neural network nodes</a>
+<li> <a href="#root">Root nodes</a>
+<li> <a href="#gaussian">Gaussian nodes</a>
+<li> <a href="#glm">Generalized linear model nodes</a>
+<li> <a href="#dtree">Classification/regression tree nodes</a>
+<li> <a href="#nongauss">Other continuous distributions</a>
+<li> <a href="#cpd_summary">Summary of CPD types</a>
+</ul>
+<li> <a href="#examples">Example models</a>
+<ul>
+<li> <a
+href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
+Gaussian mixture models</a>
+<li> <a href="#pca">PCA, ICA, and all that</a>
+<li> <a href="#mixep">Mixtures of experts</a>
+<li> <a href="#hme">Hierarchical mixtures of experts</a>
+<li> <a href="#qmr">QMR</a>
+<li> <a href="#cg_model">Conditional Gaussian models</a>
+<li> <a href="#hybrid">Other hybrid models</a>
+</ul>
+<li> <a href="#param_learning">Parameter learning</a>
+<ul>
+<li> <a href="#load_data">Loading data from a file</a>
+<li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
+<li> <a href="#prior">Parameter priors</a>
+<li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
+<li> <a href="#em">Maximum likelihood parameter estimation with  missing values (EM)</a>
+<li> <a href="#tying">Parameter tying</a>
+</ul>
+<li> <a href="#structure_learning">Structure learning</a>
+<ul>
+<li> <a href="#enumerate">Exhaustive search</a>
+<li> <a href="#K2">K2</a>
+<li> <a href="#hill_climb">Hill-climbing</a>
+<li> <a href="#mcmc">MCMC</a>
+<li> <a href="#active">Active learning</a>
+<li> <a href="#struct_em">Structural EM</a>
+<li> <a href="#graphdraw">Visualizing the learned graph  structure</a>
+<li> <a href="#constraint">Constraint-based methods</a>
+</ul>
+<li> <a href="#engines">Inference engines</a>
+<ul>
+<li> <a href="#jtree">Junction tree</a>
+<li> <a href="#varelim">Variable elimination</a>
+<li> <a href="#global">Global inference methods</a>
+<li> <a href="#quickscore">Quickscore</a>
+<li> <a href="#belprop">Belief propagation</a>
+<li> <a href="#sampling">Sampling (Monte Carlo)</a>
+<li> <a href="#engine_summary">Summary of inference engines</a>
+</ul>
+<li> <a href="#influence">Influence diagrams/ decision making</a>
+<li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
+</ul>
+</ul>
+<h1><a name="basics">Creating your first Bayes net</h1>
+To define a Bayes net, you must specify the graph structure and then
+the parameters. We look at each in turn, using a simple example
+(adapted from Russell and
+Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
+1995, p454).
+<h2>Graph structure</h2>
+Consider the following network.
+<p>
+<center>
+<IMG SRC="Figures/sprinkler.gif">
+</center>
+<p>
+<P>
+To specify this directed acyclic graph (dag), we create an adjacency matrix:
+<PRE>
+N = 4;
+dag = zeros(N,N);
+C = 1; S = 2; R = 3; W = 4;
+dag(C,[R S]) = 1;
+dag(R,W) = 1;
+dag(S,W)=1;
+</PRE>
+<P>
+We have numbered the nodes as follows:
+Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
+<b>The nodes must always be numbered in topological order, i.e.,
+ancestors before descendants.</b>
+For a more complicated graph, this is a little inconvenient: we will
+see how to get around this <a href="usage_dbn.html#bat">below</a>.
+<p>
+In Matlab 6, you can use logical arrays instead of double arrays,
+which are 4 times smaller:
+<pre>
+dag = false(N,N);
+dag(C,[R S]) = true;
+...
+</pre>
+However, <b>some graph functions (eg acyclic) do not work on
+logical arrays</b>!
+<p>
+You can visualize the resulting  graph structure using
+the methods discussed <a href="#graphdraw">below</a>.
+For details on GUIs,
+click <a href="#GUI">here</a>.
+<h2>Creating the Bayes net shell</h2>
+In addition to specifying the graph structure,
+we must specify the size and type of each node.
+If a node is discrete, its size is the
+number of possible values
+each node can take on; if a node is continuous,
+it can be a vector, and its size is the length of this vector.
+In this case, we will assume all nodes are discrete and binary.
+<PRE>
+discrete_nodes = 1:N;
+node_sizes = 2*ones(1,N);
+</pre>
+If the nodes were not binary, you could type e.g.,
+<pre>
+node_sizes = [4 2 3 5];
+</pre>
+meaning that Cloudy has 4 possible values,
+Sprinkler has 2 possible values, etc.
+Note that these are cardinal values, not ordinal, i.e.,
+they are not ordered in any way, like 'low', 'medium', 'high'.
+<p>
+We are now ready to make the Bayes net:
+<pre>
+bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
+</PRE>
+By default, all nodes are assumed to be discrete, so we can also just
+write
+<pre>
+bnet = mk_bnet(dag, node_sizes);
+</PRE>
+You may also specify which nodes will be observed.
+If you don't know, or if this not fixed in advance,
+just use the empty list (the default).
+<pre>
+onodes = [];
+bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
+</PRE>
+Note that optional arguments are specified using a name/value syntax.
+This is common for many BNT functions.
+In general, to find out more about a function (e.g., which optional
+arguments it takes), please see its
+documentation string by typing
+<pre>
+help mk_bnet
+</pre>
+See also other <a href="matlab_tips.html">useful Matlab tips</a>.
+<p>
+It is possible to associate names with nodes, as follows:
+<pre>
+bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
+</pre>
+You can then refer to a node by its name:
+<pre>
+C = bnet.names('cloudy'); % bnet.names is an associative array
+bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
+</pre>
+This feature uses my own associative array class.
+<h2><a name="cpt">Parameters</h2>
+A model consists of the graph structure and the parameters.
+The parameters are represented by CPD objects (CPD = Conditional
+Probability Distribution), which define the probability distribution
+of a node given its parents.
+(We will use the terms "node" and "random variable" interchangeably.)
+The simplest kind of CPD is a table (multi-dimensional array), which
+is suitable when all the nodes are discrete-valued. Note that the discrete
+values are not assumed to be ordered in any way; that is, they
+represent categorical quantities, like male and female, rather than
+ordinal quantities, like low, medium and high.
+(We will discuss CPDs in more detail <a href="#cpd">below</a>.)
+<p>
+Tabular CPDs, also called CPTs (conditional probability tables),
+are stored as multidimensional arrays, where the dimensions
+are arranged in the same order as the nodes, e.g., the CPT for node 4
+(WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
+Hence the child is always the last dimension.
+If a node has no parents, its CPT is a column vector representing its
+prior.
+Note that in Matlab (unlike C), arrays are indexed
+from 1, and are layed out in memory such that the first index toggles
+fastest, e.g., the CPT for node 4 (WetGrass) is as follows
+<P>
+<P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
+<P>
+where we have used the convention that false==1, true==2.
+We can create this CPT in Matlab as follows
+<PRE>
+CPT = zeros(2,2,2);
+CPT(1,1,1) = 1.0;
+CPT(2,1,1) = 0.1;
+...
+</PRE>
+Here is an easier way:
+<PRE>
+CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
+</PRE>
+In fact, we don't need to reshape the array, since the CPD constructor
+will do that for us. So we can just write
+<pre>
+bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
+</pre>
+The other nodes are created similarly (using the old syntax for
+optional parameters)
+<PRE>
+bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
+bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
+bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
+bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
+</PRE>
+<h2><a name="rnd_cpt">Random Parameters</h2>
+If we do not specify the CPT, random parameters will be
+created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
+To ensure repeatable results, use
+<pre>
+rand('state', seed);
+randn('state', seed);
+</pre>
+To control the degree of randomness (entropy),
+you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
+If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
+If p = 1, each entry is drawn from U[0,1].
+If p >> 1, the entries will all be near 1/k, where k is the arity of
+this node, i.e., each row will be nearly uniform.
+You can do this as follows, assuming this node
+is number i, and ns is the node_sizes.
+<pre>
+k = ns(i);
+ps = parents(dag, i);
+psz = prod(ns(ps));
+CPT = sample_dirichlet(p*ones(1,k), psz);
+bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
+</pre>
+<h2><a name="file">Loading a network from a file</h2>
+If you already have a Bayes net represented in the XML-based
+<a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
+Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
+<a
+href="http://www.cs.huji.ac.il/labs/compbio/Repository">
+Bayes Net repository</a>),
+you can convert it to BNT format using
+the
+<a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
+program</a> written by Ken Shan.
+(This is not necessarily up-to-date.)
+<p>
+<b>It is currently not possible to save/load a BNT matlab object to
+file</b>, but this is easily fixed if you modify all the constructors
+for all the classes (see matlab documentation).
+<h2><a name="GUI">Creating a model using a GUI</h2>
+<ul>
+<li>Senthil Nachimuthu
+has started (Oct 07) an open source
+GUI for BNT called
+<a href="http://projeny.sourceforge.net">projeny</a>
+using Java. This is a successor to BNJ.
+<li>
+Philippe LeRay has written (Sep 05)
+a
+<a href="http://banquiseasi.insa-rouen.fr/projects/bnt-editor/">
+BNT GUI</a> in matlab.
+<li>
+<a
+href="http://www.dataonstage.com/BNT/PACKAGES/LinkStrength/index.html">
+LinkStrength</a>,
+a package by Imme Ebert-Uphoff for visualizing the strength of
+dependencies between nodes.
+</ul>
+<h2>Graph visualization</h2>
+Click <a href="graphviz.html">here</a> for more information
+on graph visualization.
+<h1><a name="inference">Inference</h1>
+Having created the BN, we can now use it for inference.
+There are many different algorithms for doing inference in Bayes nets,
+that make different tradeoffs between speed,
+complexity, generality, and accuracy.
+BNT therefore offers a variety of different inference
+"engines". We will discuss these
+in more detail <a href="#engines">below</a>.
+For now, we will use the junction tree
+engine, which is the mother of all exact inference algorithms.
+This can be created as follows.
+<pre>
+engine = jtree_inf_engine(bnet);
+</pre>
+The other engines have similar constructors, but might take
+additional, algorithm-specific parameters.
+All engines are used in the same way, once they have been created.
+We illustrate this in the following sections.
+<h2><a name="marginal">Computing marginal distributions</h2>
+Suppose we want to compute the probability that the sprinker was on
+given that the grass is wet.
+The evidence consists of the fact that W=2. All the other nodes
+are hidden (unobserved). We can specify this as follows.
+<pre>
+evidence = cell(1,N);
+evidence{W} = 2;
+</pre>
+We use a 1D cell array instead of a vector to
+cope with the fact that nodes can be vectors of different lengths.
+In addition, the value [] can be used
+to denote 'no evidence', instead of having to specify the observation
+pattern as a separate argument.
+(Click <a href="cellarray.html">here</a> for a quick tutorial on cell
+arrays in matlab.)
+<p>
+We are now ready to add the evidence to the engine.
+<pre>
+[engine, loglik] = enter_evidence(engine, evidence);
+</pre>
+The behavior of this function is algorithm-specific, and is discussed
+in more detail <a href="#engines">below</a>.
+In the case of the jtree engine,
+enter_evidence implements a two-pass message-passing scheme.
+The first return argument contains the modified engine, which
+incorporates the evidence. The second return argument contains the
+log-likelihood of the evidence. (Not all engines are capable of
+computing the log-likelihood.)
+<p>
+Finally, we can compute p=P(S=2|W=2) as follows.
+<PRE>
+marg = marginal_nodes(engine, S);
+marg.T
+ans =
+0.57024
+0.42976
+p = marg.T(2);
+</PRE>
+We see that p = 0.4298.
+<p>
+Now let us add the evidence that it was raining, and see what
+difference it makes.
+<PRE>
+evidence{R} = 2;
+[engine, loglik] = enter_evidence(engine, evidence);
+marg = marginal_nodes(engine, S);
+p = marg.T(2);
+</PRE>
+We find that p = P(S=2|W=2,R=2) = 0.1945,
+which is lower than
+before, because the rain can ``explain away'' the
+fact that the grass is wet.
+<p>
+You can plot a marginal distribution over a discrete variable
+as a barchart using the built 'bar' function:
+<pre>
+bar(marg.T)
+</pre>
+This is what it looks like
+<p>
+<center>
+<IMG SRC="Figures/sprinkler_bar.gif">
+</center>
+<p>
+<h2><a name="observed">Observed nodes</h2>
+What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
+An observed discrete node effectively only has 1 value (the observed
+one) --- all other values would result in 0 probability.
+For efficiency, BNT treats observed (discrete) nodes as if they were
+set to 1, as we see below:
+<pre>
+evidence = cell(1,N);
+evidence{W} = 2;
+engine = enter_evidence(engine, evidence);
+m = marginal_nodes(engine, W);
+m.T
+ans =
+1
+</pre>
+This can get a little confusing, since we assigned W=2.
+So we can ask BNT to add the evidence back in by passing in an optional argument:
+<pre>
+m = marginal_nodes(engine, W, 1);
+m.T
+ans =
+0
+1
+</pre>
+This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
+<h2><a name="joint">Computing joint distributions</h2>
+We can compute the joint probability on a set of nodes as in the
+following example.
+<pre>
+evidence = cell(1,N);
+[engine, ll] = enter_evidence(engine, evidence);
+m = marginal_nodes(engine, [S R W]);
+</pre>
+m is a structure. The 'T' field is a multi-dimensional array (in
+this case, 3-dimensional) that contains the joint probability
+distribution on the specified nodes.
+<pre>
+>> m.T
+ans(:,:,1) =
+0.2900    0.0410
+0.0210    0.0009
+ans(:,:,2) =
+0    0.3690
+0.1890    0.0891
+</pre>
+We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
+to be wet if both the rain and sprinkler are off.
+<p>
+Let us now add some evidence to R.
+<pre>
+evidence{R} = 2;
+[engine, ll] = enter_evidence(engine, evidence);
+m = marginal_nodes(engine, [S R W])
+m =
+domain: [2 3 4]
+T: [2x1x2 double]
+>> m.T
+m.T
+ans(:,:,1) =
+0.0820
+0.0018
+ans(:,:,2) =
+0.7380
+0.1782
+</pre>
+The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
+should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
+with the evidence that R=2.
+Instead of creating large tables with many 0s, BNT sets the effective
+size of observed (discrete) nodes to 1, as explained above.
+This is why m.T has size 2x1x2.
+To get a 2x2x2 table, type
+<pre>
+m = marginal_nodes(engine, [S R W], 1)
+m =
+domain: [2 3 4]
+T: [2x2x2 double]
+>> m.T
+m.T
+ans(:,:,1) =
+0        0.082
+0       0.0018
+ans(:,:,2) =
+0        0.738
+0       0.1782
+</pre>
+<p>
+Note: It is not always possible to compute the joint on arbitrary
+sets of nodes: it depends on which inference engine you use, as discussed
+in more detail <a href="#engines">below</a>.
+<h2><a name="soft">Soft/virtual evidence</h2>
+Sometimes a node is not observed, but we have some distribution over
+its possible values; this is often called "soft" or "virtual"
+evidence.
+One can use this as follows
+<pre>
+[engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
+</pre>
+where soft_evidence{i} is either [] (if node i has no soft evidence)
+or is a vector representing the probability distribution over i's
+possible values.
+For example, if we don't know i's exact value, but we know its
+likelihood ratio is 60/40, we can write evidence{i} = [] and
+soft_evidence{i} = [0.6 0.4].
+<p>
+Currently only jtree_inf_engine supports this option.
+It assumes that all hidden nodes, and all nodes for
+which we have soft evidence, are discrete.
+For a longer example, see BNT/examples/static/softev1.m.
+<h2><a name="mpe">Most probable explanation</h2>
+To compute the most probable explanation (MPE) of the evidence (i.e.,
+the most probable assignment, or a mode of the joint), use
+<pre>
+[mpe, ll] = calc_mpe(engine, evidence);
+</pre>
+mpe{i} is the most likely value of node i.
+This calls enter_evidence with the 'maximize' flag set to 1, which
+causes the engine to do max-product instead of sum-product.
+The resulting max-marginals are then thresholded.
+If there is more than one maximum probability assignment, we must take
+care to break ties in a consistent manner (thresholding the
+max-marginals may give the wrong result). To force this behavior,
+type
+<pre>
+[mpe, ll] = calc_mpe(engine, evidence, 1);
+</pre>
+Note that computing the MPE is someties called abductive reasoning.
+<p>
+You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
+that does a forwards max-product pass, and then a backwards traceback
+pass, which is how Viterbi is traditionally implemented.
+<h1><a name="cpd">Conditional Probability Distributions</h1>
+A Conditional Probability Distributions (CPD)
+defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
+are the parents of node i. There are many ways to represent this
+distribution, which depend in part on whether X(i) and X(Pa(i)) are
+discrete, continuous, or a combination.
+We will discuss various representations below.
+<h2><a name="tabular">Tabular nodes</h2>
+If the CPD is represented as a table (i.e., if it is a multinomial
+distribution), it has a number of parameters that is exponential in
+the number of parents. See the example <a href="#cpt">above</a>.
+<h2><a name="noisyor">Noisy-or nodes</h2>
+A noisy-OR node is like a regular logical OR gate except that
+sometimes the effects of parents that are on get inhibited.
+Let the prob. that parent i gets inhibited be q(i).
+Then a node, C, with 2 parents, A and B, has the following CPD, where
+we use F and T to represent off and on (1 and 2 in BNT).
+<pre>
+A  B  P(C=off)      P(C=on)
+---------------------------
+F  F  1.0           0.0
+T  F  q(A)          1-q(A)
+F  T  q(B)          1-q(B)
+T  T  q(A)q(B)      1-q(A)q(B)
+</pre>
+Thus we see that the causes get inhibited independently.
+It is common to associate a "leak" node with a noisy-or CPD, which is
+like a parent that is always on. This can account for all other unmodelled
+causes which might turn the node on.
+<p>
+The noisy-or distribution is similar to the logistic distribution.
+To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
+be the prob. that j inhibits i. Then
+<pre>
+Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
+</pre>
+Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
+<pre>
+Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
+</pre>
+For a sigmoid node, we have
+<pre>
+Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
+</pre>
+where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
+the activation function (although both are monotonically increasing).
+In addition, in the case of a noisy-or, the weights are constrained to be
+positive, since they derive from probabilities q(i,j).
+In both cases, the number of parameters is <em>linear</em> in the
+number of parents, unlike the case of a multinomial distribution,
+where the number of parameters is exponential in the number of parents.
+We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
+<h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
+Deterministic CPDs for discrete random variables can be created using
+the deterministic_CPD class. It is also possible to 'flip' the output
+of the function with some probability, to simulate noise.
+The boolean_CPD class is just a special case of a
+deterministic CPD, where the parents and child are all binary.
+<p>
+Both of these classes are just "syntactic sugar" for the tabular_CPD
+class.
+<h2><a name="softmax">Softmax nodes</h2>
+If we have a discrete node with a continuous parent,
+we can define its CPD using a softmax function
+(also known as the multinomial logit function).
+This acts like a soft thresholding operator, and is defined as follows:
+<pre>
+exp(w(:,i)'*x + b(i))
+Pr(Q=i | X=x)  =  -----------------------------
+sum_j   exp(w(:,j)'*x + b(j))
+</pre>
+The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
+following interpretation: w(:,i)-w(:,j) is the normal vector to the
+decision boundary between classes i and j,
+and b(i)-b(j) is its offset (bias). For example, suppose
+X is a 2-vector, and Q is binary. Then
+<pre>
+w = [1 -1;
+0 0];
+b = [0 0];
+</pre>
+means class 1 are points in the 2D plane with positive x coordinate,
+and class 2 are points in the 2D plane with negative x coordinate.
+If w has large magnitude, the decision boundary is sharp, otherwise it
+is soft.
+In the special case that Q is binary (0/1), the softmax function reduces to the logistic
+(sigmoid) function.
+<p>
+Fitting a softmax function can be done using the iteratively reweighted
+least squares (IRLS) algorithm.
+We use the implementation from
+<a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
+Note that since
+the softmax distribution is not in the exponential family, it does not
+have finite sufficient statistics, and hence we must store all the
+training data in uncompressed form.
+If this takes too much space, one should use online (stochastic) gradient
+descent (not implemented in BNT).
+<p>
+If a softmax node also has discrete parents,
+we use a different set of w/b parameters for each combination of
+parent values, as in the <a href="#gaussian">conditional linear
+Gaussian CPD</a>.
+This feature was implemented by Pierpaolo Brutti.
+He is currently extending it so that discrete parents can be treated
+as if they were continuous, by adding indicator variables to the X
+vector.
+<p>
+We will see an example of softmax nodes <a href="#mixexp">below</a>.
+<h2><a name="mlp">Neural network nodes</h2>
+Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
+to implement a mapping from continuous parents to discrete children,
+similar to the softmax function.
+(If there are also discrete parents, it creates a mixture of MLPs.)
+It uses code from <a
+href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
+This is work in progress.
+<h2><a name="root">Root nodes</h2>
+A root node has no parents and no parameters; it can be used to model
+an observed, exogeneous input variable, i.e., one which is "outside"
+the model.
+This is useful for conditional density models.
+We will see an example of root nodes <a href="#mixexp">below</a>.
+<h2><a name="gaussian">Gaussian nodes</h2>
+We now consider a distribution suitable for the continuous-valued nodes.
+Suppose the node is called Y, its continuous parents (if any) are
+called X, and its discrete parents (if any) are called Q.
+The distribution on Y is defined as follows:
+<pre>
+- no parents: Y ~ N(mu, Sigma)
+- cts parents : Y|X=x ~ N(mu + W x, Sigma)
+- discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
+- cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
+</pre>
+where N(mu, Sigma) denotes a Normal distribution with mean mu and
+covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
+respectively.
+If there are no discrete parents, |Q|=1; if there is
+more than one, then |Q| = a vector of the sizes of each discrete parent.
+If there are no continuous parents, |X|=0; if there is more than one,
+then |X| = the sum of their sizes.
+Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
+semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
+matrix.
+<p>
+We can create a Gaussian node with random parameters as follows.
+<pre>
+bnet.CPD{i} = gaussian_CPD(bnet, i);
+</pre>
+We can specify the value of one or more of the parameters as in the
+following example, in which |Y|=2, and |Q|=1.
+<pre>
+bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
+</pre>
+<p>
+We will see an example of conditional linear Gaussian nodes <a
+href="#cg_model">below</a>.
+<p>
+<b>When learning Gaussians from data</b>, it is helpful to ensure the
+data has a small magnitde
+(see e.g., KPMstats/standardize) to prevent numerical problems.
+Unless you have a lot of data, it is also a very good idea to use
+diagonal instead of full covariance matrices.
+(BNT does not currently support spherical covariances, although it
+would be easy to add, since  KPMstats/clg_Mstep supports this option;
+you would just need to modify gaussian_CPD/update_ess to accumulate
+weighted inner products.)
+<h2><a name="nongauss">Other continuous distributions</h2>
+Currently BNT does not support any CPDs for continuous nodes other
+than the Gaussian.
+However, you can use a mixture of Gaussians to
+approximate other continuous distributions. We will see some an example
+of this with the IFA model <a href="#pca">below</a>.
+<h2><a name="glm">Generalized linear model nodes</h2>
+In the future, we may incorporate some of the functionality of
+<a href =
+"http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
+into BNT.
+<h2><a name="dtree">Classification/regression tree nodes</h2>
+We plan to add classification and regression trees to define CPDs for
+discrete and continuous nodes, respectively.
+Trees have many advantages: they are easy to interpret, they can do
+feature selection, they can
+handle discrete and continuous inputs, they do not make strong
+assumptions about the form of the distribution, the number of
+parameters can grow in a data-dependent way (i.e., they are
+semi-parametric), they can handle missing data, etc.
+However, they are not yet implemented.
+<!--
+Yimin Zhang is currently (Feb '02) implementing this.
+-->
+<h2><a name="cpd_summary">Summary of CPD types</h2>
+We list all the different types of CPDs supported by BNT.
+For each CPD, we specify if the child and parents can be discrete (D) or
+continuous (C) (Binary (B) nodes are a special case).
+We also specify which methods each class supports.
+If a method is inherited, the name of the  parent class is mentioned.
+If a parent class calls a child method, this is mentioned.
+<p>
+The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
+requires that the child and all parents are discrete.
+The CPT might be exponentially big...
+<tt>convert_to_table</tt> evaluates a CPD with evidence, and
+represents the the resulting potential as an array.
+This requires that the child is discrete, and any continuous parents
+are observed.
+<tt>convert_to_pot</tt> evaluates a CPD with evidence, and
+represents the resulting potential as a dpot, gpot, cgpot or upot, as
+requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
+utility).
+<p>
+When we sample a node, all the parents are observed.
+When we compute the (log) probability of a node, all the parents and
+the child are observed.
+<p>
+We also specify if the parameters are learnable.
+For learning with EM, we require
+the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
+<tt>maximize_params</tt>.
+For learning from fully observed data, we require
+the method <tt>learn_params</tt>.
+By default, all classes inherit this from generic_CPD, which simply
+calls <tt>update_ess</tt> N times, once for each data case, followed
+by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
+Some classes implement a batch formula, which is quicker.
+<p>
+Bayesian learning means computing a posterior over the parameters
+given fully observed data.
+<p>
+Pearl means we implement the methods <tt>compute_pi</tt> and
+<tt>compute_lambda_msg</tt>, used by
+<tt>pearl_inf_engine</tt>, which runs on directed graphs.
+<tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
+The pearl methods can exploit special properties of the CPDs for
+computing the messages efficiently, whereas belprop does not.
+<p>
+The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
+which is not shown, since it is not very interesting.
+<p>
+<table>
+<table border units = pixels><tr>
+<td align=center>Name
+<td align=center>Child
+<td align=center>Parents
+<td align=center>Comments
+<td align=center>CPD_to_CPT
+<td align=center>conv_to_table
+<td align=center>conv_to_pot
+<td align=center>sample
+<td align=center>prob
+<td align=center>learn
+<td align=center>Bayes
+<td align=center>Pearl
+<tr>
+<!-- Name--><td>
+<!-- Child--><td>
+<!-- Parents--><td>
+<!-- Comments--><td>
+<!-- CPD_to_CPT--><td>
+<!-- conv_to_table--><td>
+<!-- conv_to_pot--><td>
+<!-- sample--><td>
+<!-- prob--><td>
+<!-- learn--><td>
+<!-- Bayes--><td>
+<!-- Pearl--><td>
+<tr>
+<!-- Name--><td>boolean
+<!-- Child--><td>B
+<!-- Parents--><td>B
+<!-- Comments--><td>Syntactic sugar for tabular
+<!-- CPD_to_CPT--><td>-
+<!-- conv_to_table--><td>-
+<!-- conv_to_pot--><td>-
+<!-- sample--><td>-
+<!-- prob--><td>-
+<!-- learn--><td>-
+<!-- Bayes--><td>-
+<!-- Pearl--><td>-
+<tr>
+<!-- Name--><td>deterministic
+<!-- Child--><td>D
+<!-- Parents--><td>D
+<!-- Comments--><td>Syntactic sugar for tabular
+<!-- CPD_to_CPT--><td>-
+<!-- conv_to_table--><td>-
+<!-- conv_to_pot--><td>-
+<!-- sample--><td>-
+<!-- prob--><td>-
+<!-- learn--><td>-
+<!-- Bayes--><td>-
+<!-- Pearl--><td>-
+<tr>
+<!-- Name--><td>Discrete
+<!-- Child--><td>D
+<!-- Parents--><td>C/D
+<!-- Comments--><td>Virtual class
+<!-- CPD_to_CPT--><td>N
+<!-- conv_to_table--><td>Calls CPD_to_CPT
+<!-- conv_to_pot--><td>Calls conv_to_table
+<!-- sample--><td>Calls conv_to_table
+<!-- prob--><td>Calls conv_to_table
+<!-- learn--><td>N
+<!-- Bayes--><td>N
+<!-- Pearl--><td>N
+<tr>
+<!-- Name--><td>Gaussian
+<!-- Child--><td>C
+<!-- Parents--><td>C/D
+<!-- Comments--><td>-
+<!-- CPD_to_CPT--><td>N
+<!-- conv_to_table--><td>N
+<!-- conv_to_pot--><td>Y
+<!-- sample--><td>Y
+<!-- prob--><td>Y
+<!-- learn--><td>Y
+<!-- Bayes--><td>N
+<!-- Pearl--><td>N
+<tr>
+<!-- Name--><td>gmux
+<!-- Child--><td>C
+<!-- Parents--><td>C/D
+<!-- Comments--><td>multiplexer
+<!-- CPD_to_CPT--><td>N
+<!-- conv_to_table--><td>N
+<!-- conv_to_pot--><td>Y
+<!-- sample--><td>N
+<!-- prob--><td>N
+<!-- learn--><td>N
+<!-- Bayes--><td>N
+<!-- Pearl--><td>Y
+<tr>
+<!-- Name--><td>MLP
+<!-- Child--><td>D
+<!-- Parents--><td>C/D
+<!-- Comments--><td>multi layer perceptron
+<!-- CPD_to_CPT--><td>N
+<!-- conv_to_table--><td>Y
+<!-- conv_to_pot--><td>Inherits from discrete
+<!-- sample--><td>Inherits from discrete
+<!-- prob--><td>Inherits from discrete
+<!-- learn--><td>Y
+<!-- Bayes--><td>N
+<!-- Pearl--><td>N
+<tr>
+<!-- Name--><td>noisy-or
+<!-- Child--><td>B
+<!-- Parents--><td>B
+<!-- Comments--><td>-
+<!-- CPD_to_CPT--><td>Y
+<!-- conv_to_table--><td>Inherits from discrete
+<!-- conv_to_pot--><td>Inherits from discrete
+<!-- sample--><td>Inherits from discrete
+<!-- prob--><td>Inherits from discrete
+<!-- learn--><td>N
+<!-- Bayes--><td>N
+<!-- Pearl--><td>Y
+<tr>
+<!-- Name--><td>root
+<!-- Child--><td>C/D
+<!-- Parents--><td>none
+<!-- Comments--><td>no params
+<!-- CPD_to_CPT--><td>N
+<!-- conv_to_table--><td>N
+<!-- conv_to_pot--><td>Y
+<!-- sample--><td>Y
+<!-- prob--><td>Y
+<!-- learn--><td>N
+<!-- Bayes--><td>N
+<!-- Pearl--><td>N
+<tr>
+<!-- Name--><td>softmax
+<!-- Child--><td>D
+<!-- Parents--><td>C/D
+<!-- Comments--><td>-
+<!-- CPD_to_CPT--><td>N
+<!-- conv_to_table--><td>Y
+<!-- conv_to_pot--><td>Inherits from discrete
+<!-- sample--><td>Inherits from discrete
+<!-- prob--><td>Inherits from discrete
+<!-- learn--><td>Y
+<!-- Bayes--><td>N
+<!-- Pearl--><td>N
+<tr>
+<!-- Name--><td>generic
+<!-- Child--><td>C/D
+<!-- Parents--><td>C/D
+<!-- Comments--><td>Virtual class
+<!-- CPD_to_CPT--><td>N
+<!-- conv_to_table--><td>N
+<!-- conv_to_pot--><td>N
+<!-- sample--><td>N
+<!-- prob--><td>N
+<!-- learn--><td>N
+<!-- Bayes--><td>N
+<!-- Pearl--><td>N
+<tr>
+<!-- Name--><td>Tabular
+<!-- Child--><td>D
+<!-- Parents--><td>D
+<!-- Comments--><td>-
+<!-- CPD_to_CPT--><td>Y
+<!-- conv_to_table--><td>Inherits from discrete
+<!-- conv_to_pot--><td>Inherits from discrete
+<!-- sample--><td>Inherits from discrete
+<!-- prob--><td>Inherits from discrete
+<!-- learn--><td>Y
+<!-- Bayes--><td>Y
+<!-- Pearl--><td>Y
+</table>
+<h1><a name="examples">Example models</h1>
+<h2>Gaussian mixture models</h2>
+Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
+of Gaussians using BNT. Available
+<a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
+<h2><a name="pca">PCA, ICA, and all that </h2>
+In Figure (a) below, we show how Factor Analysis can be thought of as a
+graphical model. Here, X has an N(0,I) prior, and
+Y|X=x ~ N(mu + Wx, Psi),
+where Psi is diagonal and W is called the "factor loading matrix".
+Since the noise on both X and Y is diagonal, the components of these
+vectors are uncorrelated, and hence can be represented as individual
+scalar nodes, as we show in (b).
+(This is useful if parts of the observations on the Y vector are occasionally missing.)
+We usually take k=|X| << |Y|=D, so the model tries to explain
+many observations using a low-dimensional subspace.
+<center>
+<table>
+<tr>
+<td><img src="Figures/fa.gif">
+<td><img src="Figures/fa_scalar.gif">
+<td><img src="Figures/mfa.gif">
+<td><img src="Figures/ifa.gif">
+<tr>
+<td align=center> (a)
+<td align=center> (b)
+<td align=center> (c)
+<td align=center> (d)
+</table>
+</center>
+<p>
+We can create this model in BNT as follows.
+<pre>
+ns = [k D];
+dag = zeros(2,2);
+dag(1,2) = 1;
+bnet = mk_bnet(dag, ns, 'discrete', []);
+bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
+'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
+bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
+'cov_type', 'diag', 'clamp_mean', 1);
+</pre>
+The root node is clamped to the N(0,I) distribution, so that we will
+not update these parameters during learning.
+The mean of the leaf node is clamped to 0,
+since we assume the data has been centered (had its mean subtracted
+off); this is just for simplicity.
+Finally, the covariance of the leaf node is constrained to be
+diagonal. W0 and Psi0 are the initial parameter guesses.
+<p>
+We can fit this model (i.e., estimate its parameters in a maximum
+likelihood (ML) sense) using EM, as we
+explain <a href="#em">below</a>.
+Not surprisingly, the ML estimates for mu and Psi turn out to be
+identical to the
+sample mean and variance, which can be computed directly as
+<pre>
+mu_ML = mean(data);
+Psi_ML = diag(cov(data));
+</pre>
+Note that W can only be identified up to a rotation matrix, because of
+the spherical symmetry of the source.
+<p>
+If we restrict Psi to be spherical, i.e., Psi = sigma*I,
+there is a closed-form solution for W as well,
+i.e., we do not need to use EM.
+In particular, W contains the first |X| eigenvectors of the sample covariance
+matrix, with scalings determined by the eigenvalues and sigma.
+Classical PCA can be obtained by taking the sigma->0 limit.
+For details, see
+<ul>
+<li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
+"EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
+(<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
+Matlab software</a>)
+<p>
+<li>
+<a
+href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
+"Mixtures of probabilistic principal component analyzers"</a>,
+Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
+</ul>
+<p>
+By adding a hidden discrete variable, we can create mixtures of FA
+models, as shown in (c).
+Now we can explain the data using a set of subspaces.
+We can create this model in BNT as follows.
+<pre>
+ns = [M k D];
+dag = zeros(3);
+dag(1,3) = 1;
+dag(2,3) = 1;
+bnet = mk_bnet(dag, ns, 'discrete', 1);
+bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
+bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
+			   'clamp_mean', 1, 'clamp_cov', 1);
+bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
+			   'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
+</pre>
+Notice how the covariance matrix for Y is the same for all values of
+Q; that is, the noise level in each sub-space is assumed the same.
+However, we allow the offset, mu, to vary.
+For details, see
+<ul>
+<LI>
+<a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
+Algorithm for Mixtures of Factor Analyzers </A>,
+Ghahramani, Z. and Hinton, G.E. (1996),
+University of Toronto
+Technical Report CRG-TR-96-1.
+(<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
+<p>
+<li>
+<a
+href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
+"Mixtures of probabilistic principal component analyzers"</a>,
+Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
+</ul>
+<p>
+I have included Zoubin's specialized MFA code (with his permission)
+with the toolbox, so you can check that BNT gives the same results:
+see 'BNT/examples/static/mfa1.m'.
+<p>
+Independent Factor Analysis (IFA) generalizes FA by allowing a
+non-Gaussian prior on each component of X.
+(Note that we can approximate a non-Gaussian prior using a mixture of
+Gaussians.)
+This means that the likelihood function is no longer rotationally
+invariant, so we can uniquely identify W and the hidden
+sources X.
+IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
+We recover classical Independent Components Analysis (ICA)
+in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the
+weight matrix W is square and invertible.
+For details, see
+<ul>
+<li>
+<a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
+Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
+</ul>
+<h2><a name="mixexp">Mixtures of experts</h2>
+As an example of the use of the softmax function,
+we introduce the Mixture of Experts model.
+<!--
+We also show
+the Hierarchical Mixture of Experts model, where the hierarchy has two
+levels.
+(This is essentially a probabilistic decision tree of height two.)
+-->
+As before,
+circles denote continuous-valued nodes,
+squares denote discrete nodes, clear
+means hidden, and shaded means observed.
+<p>
+<center>
+<table>
+<tr>
+<td><img src="Figures/mixexp.gif">
+<!--
+<td><img src="Figures/hme.gif">
+-->
+</table>
+</center>
+<p>
+X is the observed
+input, Y is the output, and
+the Q nodes are hidden "gating" nodes, which select the appropriate
+set of parameters for Y. During training, Y is assumed observed,
+but for testing, the goal is to predict Y given X.
+Note that this is a <em>conditional</em> density model, so we don't
+associate any parameters with X.
+Hence X's CPD will be a root CPD, which is a way of modelling
+exogenous nodes.
+If the output is a continuous-valued quantity,
+we assume the "experts" are linear-regression units,
+and set Y's CPD to linear-Gaussian.
+If the output is discrete, we set Y's CPD to a softmax function.
+The Q CPDs will always be softmax functions.
+<p>
+As a concrete example, consider the mixture of experts model where X and Y are
+scalars, and Q is binary.
+This is just piecewise linear regression, where
+we have two line segments, i.e.,
+<P>
+<IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
+<P>
+We can create this model with random parameters as follows.
+(This code is bundled in BNT/examples/static/mixexp2.m.)
+<PRE>
+X = 1;
+Q = 2;
+Y = 3;
+dag = zeros(3,3);
+dag(X,[Q Y]) = 1
+dag(Q,Y) = 1;
+ns = [1 2 1]; % make X and Y scalars, and have 2 experts
+onodes = [1 3];
+bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
+rand('state', 0);
+randn('state', 0);
+bnet.CPD{1} = root_CPD(bnet, 1);
+bnet.CPD{2} = softmax_CPD(bnet, 2);
+bnet.CPD{3} = gaussian_CPD(bnet, 3);
+</PRE>
+Now let us fit this model using <a href="#em">EM</a>.
+First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
+<P>
+<PRE>
+data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
+plot(data(:,1), data(:,2), '.');
+</PRE>
+<p>
+<center>
+<IMG SRC="Figures/mixexp_data.gif">
+</center>
+<p>
+This is what the model looks like before training.
+(Thanks to Thomas Hofman for writing this plotting routine.)
+<p>
+<center>
+<IMG SRC="Figures/mixexp_before.gif">
+</center>
+<p>
+Now let's train the model, and plot the final performance.
+(We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
+<P>
+<PRE>
+ncases = size(data, 1); % each row of data is a training case
+cases = cell(3, ncases);
+cases([1 3], :) = num2cell(data'); % each column of cases is a training case
+engine = jtree_inf_engine(bnet);
+max_iter = 20;
+[bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
+</PRE>
+(We specify which nodes will be observed when we create the engine.
+Hence BNT knows that the hidden nodes are all discrete.
+For complex models, this can lead to a significant speedup.)
+Below we show what the model looks like after 16 iterations of EM
+(with 100 IRLS iterations per M step), when it converged
+using the default convergence tolerance (that the
+fractional change in the log-likelihood be less than 1e-3).
+Before learning, the log-likelihood was
+-322.927442; afterwards, it was -13.728778.
+<p>
+<center>
+<IMG SRC="Figures/mixexp_after.gif">
+</center>
+(See BNT/examples/static/mixexp2.m for details of the code.)
+<h2><a name="hme">Hierarchical mixtures of experts</h2>
+A hierarchical mixture of experts (HME) extends the mixture of experts
+model by having more than one hidden node. A two-level example is shown below, along
+with its more traditional representation as a neural network.
+This is like a (balanced) probabilistic decision tree of height 2.
+<p>
+<center>
+<IMG SRC="Figures/HMEforMatlab.jpg">
+</center>
+<p>
+<a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
+has written an extensive set of routines for HMEs,
+which are bundled with BNT: see the examples/static/HME directory.
+These routines allow you to choose the number of hidden (gating)
+layers, and the form of the experts (softmax or MLP).
+See the file hmemenu, which provides a demo.
+For example, the figure below shows the decision boundaries learned
+for a ternary classification problem, using a 2 level HME with softmax
+gates and softmax experts; the training set is on the left, the
+testing set on the right.
+<p>
+<center>
+<!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
+<IMG SRC="Figures/hme_dec_boundary.png">
+</center>
+<p>
+<p>
+For more details, see the following:
+<ul>
+<li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
+Hierarchical mixtures of experts and the EM algorithm</a>
+M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
+<li> <a href =
+"http://www.cs.berkeley.edu/~dmartin/software">David Martin's
+matlab code for HME</a>
+<li> <a
+href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
+logistic function? A tutorial discussion on
+probabilities and neural networks.</a> M. I. Jordan. MIT Computational
+Cognitive Science Report 9503, August 1995.
+<li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
+Halll, 1983.
+<li>
+"Improved learning algorithms for mixtures of experts in multiclass
+classification".
+K. Chen, L. Xu, H. Chi.
+Neural Networks (1999) 12: 1229-1252.
+<li> <a href="http://www.oigeeza.com/steve/">
+Classification Using Hierarchical Mixtures of Experts</a>
+S.R. Waterhouse and A.J. Robinson.
+In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
+<li> <a href="http://www.idiap.ch/~perry/">
+Localized mixtures of experts</a>,
+P. Moerland, 1998.
+<li> "Nonlinear gated experts for time series",
+A.S. Weigend and M. Mangeas, 1995.
+</ul>
+<h2><a name="qmr">QMR</h2>
+Bayes nets originally arose out of an attempt to add probabilities to
+expert systems, and this is still the most common use for BNs.
+A famous example is
+QMR-DT, a decision-theoretic reformulation of the Quick Medical
+Reference (QMR) model.
+<p>
+<center>
+<IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
+</center>
+Here, the top layer represents hidden disease nodes, and the bottom
+layer represents observed symptom nodes.
+The goal is to infer the posterior probability of each disease given
+all the symptoms (which can be present, absent or unknown).
+Each node in the top layer has a Bernoulli prior (with a low prior
+probability that the disease is present).
+Since each node in the bottom layer has a high fan-in, we use a
+noisy-OR parameterization; each disease has an independent chance of
+causing each symptom.
+The real QMR-DT model is copyright, but
+we can create a random QMR-like model as follows.
+<pre>
+function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
+% MK_QMR_BNET Make a QMR model
+% bnet = mk_qmr_bnet(G, inhibit, leak, prior)
+%
+% G(i,j) = 1 iff there is an arc from disease i to finding j
+% inhibit(i,j) = inhibition probability on i->j arc
+% leak(j) = inhibition prob. on leak->j arc
+% prior(i) = prob. disease i is on
+[Ndiseases Nfindings] = size(inhibit);
+N = Ndiseases + Nfindings;
+finding_node = Ndiseases+1:N;
+ns = 2*ones(1,N);
+dag = zeros(N,N);
+dag(1:Ndiseases, finding_node) = G;
+bnet = mk_bnet(dag, ns, 'observed', finding_node);
+for d=1:Ndiseases
+CPT = [1-prior(d) prior(d)];
+bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
+end
+for i=1:Nfindings
+fnode = finding_node(i);
+ps = parents(G, i);
+bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
+end
+</pre>
+In the file BNT/examples/static/qmr1, we create a random bipartite
+graph G, with 5 diseases and 10 findings, and random parameters.
+(In general, to create a random dag, use 'mk_random_dag'.)
+We can visualize the resulting  graph structure using
+the methods discussed <a href="#graphdraw">below</a>, with the
+following results:
+<p>
+<img src="Figures/qmr.rnd.jpg">
+<p>
+Now let us put some random evidence on all the leaves except the very
+first and very last, and compute the disease posteriors.
+<pre>
+pos = 2:floor(Nfindings/2);
+neg = (pos(end)+1):(Nfindings-1);
+onodes = myunion(pos, neg);
+evidence = cell(1, N);
+evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
+evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
+engine = jtree_inf_engine(bnet);
+[engine, ll] = enter_evidence(engine, evidence);
+post = zeros(1, Ndiseases);
+for i=diseases(:)'
+m = marginal_nodes(engine, i);
+post(i) = m.T(2);
+end
+</pre>
+Junction tree can be quite slow on large QMR models.
+Fortunately, it is possible to exploit properties of the noisy-OR
+function to speed up exact inference using an algorithm called
+<a href="#quickscore">quickscore</a>, discussed below.
+<h2><a name="cg_model">Conditional Gaussian models</h2>
+A conditional Gaussian model is one in which, conditioned on all the discrete
+nodes, the distribution over the remaining (continuous) nodes is
+multivariate Gaussian. This means we can have arcs from discrete (D)
+to continuous (C) nodes, but not vice versa.
+(We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
+as in the <a href="#mixexp">mixture of experts</a> model,
+since this distribution can be represented with a discrete potential.)
+<p>
+We now give an example of a CG model, from
+the paper "Propagation of Probabilities, Means amd
+Variances in Mixed Graphical Association Models", Steffen Lauritzen,
+JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
+Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
+D. J. Spiegelhalter, Springer, 1999.)
+<h3>Specifying the graph</h3>
+Consider the model of waste emissions from an incinerator plant shown below.
+We follow the standard convention that shaded nodes are observed,
+clear nodes are hidden.
+We also use the non-standard convention that
+square nodes are discrete (tabular) and round nodes are
+Gaussian.
+<p>
+<center>
+<IMG SRC="Figures/cg1.gif">
+</center>
+<p>
+We can create this model as follows.
+<pre>
+F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
+n = 9;
+dag = zeros(n);
+dag(F,E)=1;
+dag(W,[E Min D]) = 1;
+dag(E,D)=1;
+dag(B,[C D])=1;
+dag(D,[L Mout])=1;
+dag(Min,Mout)=1;
+% node sizes - all cts nodes are scalar, all discrete nodes are binary
+ns = ones(1, n);
+dnodes = [F W B];
+cnodes = mysetdiff(1:n, dnodes);
+ns(dnodes) = 2;
+bnet = mk_bnet(dag, ns, 'discrete', dnodes);
+</pre>
+'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
+nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
+<p>
+<h3>Specifying the parameters</h3>
+The parameters of the discrete nodes can be specified as follows.
+<pre>
+bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
+bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
+bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
+</pre>
+<p>
+The parameters of the continuous nodes can be specified as follows.
+<pre>
+bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
+			   'cov', [0.00002 0.0001 0.00002 0.0001]);
+bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
+			   'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
+bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
+bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
+bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
+bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
+</pre>
+<h3><a name="cg_infer">Inference</h3>
+<!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
+First we compute the unconditional marginals.
+<pre>
+engine = jtree_inf_engine(bnet);
+evidence = cell(1,n);
+[engine, ll] = enter_evidence(engine, evidence);
+marg = marginal_nodes(engine, E);
+</pre>
+<!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
+'marg' is a structure that contains the fields 'mu' and 'Sigma', which
+contain the mean and (co)variance of the marginal on E.
+In this case, they are both scalars.
+Let us check they match the published figures (to 2 decimal places).
+<!--(We can't expect
+more precision than this in general because I have implemented the algorithm of
+Lauritzen (1992), which can be numerically unstable.)-->
+<pre>
+tol = 1e-2;
+assert(approxeq(marg.mu, -3.25, tol));
+assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
+</pre>
+We can compute the other posteriors similarly.
+Now let us add some evidence.
+<pre>
+evidence = cell(1,n);
+evidence{W} = 1; % industrial
+evidence{L} = 1.1;
+evidence{C} = -0.9;
+[engine, ll] = enter_evidence(engine, evidence);
+</pre>
+Now we find
+<pre>
+marg = marginal_nodes(engine, E);
+assert(approxeq(marg.mu, -3.8983, tol));
+assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
+</pre>
+We can also compute the joint probability on a set of nodes.
+For example, P(D, Mout | evidence) is a 2D Gaussian:
+<pre>
+marg = marginal_nodes(engine, [D Mout])
+marg =
+domain: [6 8]
+mu: [2x1 double]
+Sigma: [2x2 double]
+T: 1.0000
+</pre>
+The mean is
+<pre>
+marg.mu
+ans =
+3.6077
+4.1077
+</pre>
+and the covariance matrix is
+<pre>
+marg.Sigma
+ans =
+0.1062    0.1062
+0.1062    0.1182
+</pre>
+It is easy to visualize this posterior using standard Matlab plotting
+functions, e.g.,
+<pre>
+gaussplot2d(marg.mu, marg.Sigma);
+</pre>
+produces the following picture.
+<p>
+<center>
+<IMG SRC="Figures/gaussplot.png">
+</center>
+<p>
+The T field indicates that the mixing weight of this Gaussian
+component is 1.0.
+If the joint contains discrete and continuous variables, the result
+will be a mixture of Gaussians, e.g.,
+<pre>
+marg = marginal_nodes(engine, [F E])
+domain: [1 3]
+mu: [-3.9000 -0.4003]
+Sigma: [1x1x2 double]
+T: [0.9995 4.7373e-04]
+</pre>
+The interpretation is
+Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ].
+In this case, E is a scalar, so i=j=1; k specifies the mixture component.
+<p>
+We saw in the sprinkler network that BNT sets the effective size of
+observed discrete nodes to 1, since they only have one legal value.
+For continuous nodes, BNT sets their length to 0,
+since they have been reduced to a point.
+For example,
+<pre>
+marg = marginal_nodes(engine, [B C])
+domain: [4 5]
+mu: []
+Sigma: []
+T: [0.0123 0.9877]
+</pre>
+It is simple to post-process the output of marginal_nodes.
+For example, the file BNT/examples/static/cg1 sets the mu term of
+observed nodes to their observed value, and the Sigma term to 0 (since
+observed nodes have no variance).
+<p>
+Note that the implemented version of the junction tree is numerically
+unstable when using CG potentials
+(which is why, in the example above, we only required our answers to agree with
+the published ones to 2dp.)
+This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
+implemented by Shan Huang. This is described in
+<ul>
+<li> "Stable Local Computation with Conditional Gaussian Distributions",
+S. Lauritzen and F. Jensen, Tech Report R-99-2014,
+Dept. Math. Sciences, Allborg Univ., 1999.
+</ul>
+However, even the numerically stable version
+can be computationally intractable if there are many hidden discrete
+nodes, because the number of mixture components grows exponentially e.g., in a
+<a href="usage_dbn.html#lds">switching linear dynamical system</a>.
+In general, one must resort to approximate inference techniques: see
+the discussion on <a href="#engines">inference engines</a> below.
+<h2><a name="hybrid">Other hybrid models</h2>
+When we have C->D arcs, where C is hidden, we need to use
+approximate inference.
+One approach (not implemented in BNT) is described in
+<ul>
+<li> <a
+href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
+Variational Approximation for Bayesian Networks with
+Discrete and Continuous Latent Variables</a>,
+K. Murphy, UAI 99.
+</ul>
+Of course, one can always use <a href="#sampling">sampling</a> methods
+for approximate inference in such models.
+<h1><a name="param_learning">Parameter Learning</h1>
+The parameter estimation routines in BNT can be classified into 4
+types, depending on whether the goal is to compute
+a full (Bayesian) posterior over the parameters or just a point
+estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
+and whether all the variables are fully observed or there is missing
+data/ hidden variables (partial observability).
+<p>
+<TABLE BORDER>
+<tr>
+<TH></TH>
+<th>Full obs</th>
+<th>Partial obs</th>
+</tr>
+<tr>
+<th>Point</th>
+<td><tt>learn_params</tt></td>
+<td><tt>learn_params_em</tt></td>
+</tr>
+<tr>
+<th>Bayes</th>
+<td><tt>bayes_update_params</tt></td>
+<td>not yet supported</td>
+</tr>
+</table>
+<h2><a name="load_data">Loading data from a file</h2>
+To load numeric data from an ASCII text file called 'dat.txt', where each row is a
+case and columns are separated by white-space, such as
+<pre>
+011979 1626.5 0.0
+021979 1367.0 0.0
+...
+</pre>
+you can use
+<pre>
+data = load('dat.txt');
+</pre>
+or
+<pre>
+load dat.txt -ascii
+</pre>
+In the latter case, the data is stored in a variable called 'dat' (the
+filename minus the extension).
+Alternatively, suppose the data is stored in a .csv file (has commas
+separating the columns, and contains a header line), such as
+<pre>
+header info goes here
+ORD,011979,1626.5,0.0
+DSM,021979,1367.0,0.0
+...
+</pre>
+You can load this using
+<pre>
+[a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
+</pre>
+If your file is not in either of these formats, you can either use Perl to convert
+it to this format, or use the Matlab scanf command.
+Type
+<tt>
+help iofun
+</tt>
+for more information on Matlab's file functions.
+<!--
+<p>
+To load data directly from Excel,
+you should buy the
+<a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
+To load data directly from a relational database,
+you should buy the
+<a href="http://www.mathworks.com/products/database">Database
+toolbox</a>.
+-->
+<p>
+BNT learning routines require data to be stored in a cell array.
+data{i,m} is the value of node i in case (example) m, i.e., each
+<em>column</em> is a case.
+If node i is not observed in case m (missing value), set
+data{i,m} = [].
+(Not all the learning routines can cope with such missing values, however.)
+In the special case that all the nodes are observed and are
+scalar-valued (as opposed to vector-valued), the data can be
+stored in a matrix (as opposed to a cell-array).
+<p>
+Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
+that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
+the observed output, and X(2) is a hidden (gating) node. We can
+create the dataset as follows.
+<pre>
+data = load('dat.txt');
+ncases = size(data, 1);
+cases = cell(3, ncases);
+cases([1 3], :) = num2cell(data');
+</pre>
+Notice how we transposed the data, to convert rows into columns.
+Also, cases{2,m} = [] for all m, since X(2) is always hidden.
+<h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
+As an example, let's generate some data from the sprinkler network, randomize the parameters,
+and then try to recover the original model.
+First we create some training data using forwards sampling.
+<pre>
+samples = cell(N, nsamples);
+for i=1:nsamples
+samples(:,i) = sample_bnet(bnet);
+end
+</pre>
+samples{j,i} contains the value of the j'th node in case i.
+sample_bnet returns a cell array because, in general, each node might
+be a vector of different length.
+In this case, all nodes are discrete (and hence scalars), so we
+could have used a regular array instead (which can be quicker):
+<pre>
+data = cell2num(samples);
+</pre
+So now data(j,i) = samples{j,i}.
+<p>
+Now we create a network with random parameters.
+(The initial values of bnet2 don't matter in this case, since we can find the
+globally optimal MLE independent of where we start.)
+<pre>
+% Make a tabula rasa
+bnet2 = mk_bnet(dag, node_sizes);
+seed = 0;
+rand('state', seed);
+bnet2.CPD{C} = tabular_CPD(bnet2, C);
+bnet2.CPD{R} = tabular_CPD(bnet2, R);
+bnet2.CPD{S} = tabular_CPD(bnet2, S);
+bnet2.CPD{W} = tabular_CPD(bnet2, W);
+</pre>
+Finally, we find the maximum likelihood estimates of the parameters.
+<pre>
+bnet3 = learn_params(bnet2, samples);
+</pre>
+To view the learned parameters, we use a little Matlab hackery.
+<pre>
+CPT3 = cell(1,N);
+for i=1:N
+s=struct(bnet3.CPD{i});  % violate object privacy
+CPT3{i}=s.CPT;
+end
+</pre>
+Here are the parameters learned for node 4.
+<pre>
+dispcpt(CPT3{4})
+1 1 : 1.0000 0.0000
+2 1 : 0.2000 0.8000
+1 2 : 0.2273 0.7727
+2 2 : 0.0000 1.0000
+</pre>
+So we see that the learned parameters are fairly close to the "true"
+ones, which we display below.
+<pre>
+dispcpt(CPT{4})
+1 1 : 1.0000 0.0000
+2 1 : 0.1000 0.9000
+1 2 : 0.1000 0.9000
+2 2 : 0.0100 0.9900
+</pre>
+We can get better results by using a larger training set, or using
+informative priors (see <a href="#prior">below</a>).
+<h2><a name="prior">Parameter priors</h2>
+Currently, only tabular CPDs can have priors on their parameters.
+The conjugate prior for a multinomial is the Dirichlet.
+(For binary random variables, the multinomial is the same as the
+Bernoulli, and the Dirichlet is the same as the Beta.)
+<p>
+The Dirichlet has a simple interpretation in terms of pseudo counts.
+If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
+training set, where Pa_i are the parents of X_i,
+then the maximum likelihood (ML) estimate is
+T_ijk = N_ijk  / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
+To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
+event was not seen in the training set,
+we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
+of times in the past.
+The MAP (maximum a posterior) estimate is then
+<pre>
+T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
+</pre>
+and is never 0 if all alpha_ijk > 0.
+For example, consider the network A->B, where A is binary and B has 3
+values.
+A uniform prior for B has the form
+<pre>
+B=1 B=2 B=3
+A=1 1   1   1
+A=2 1   1   1
+</pre>
+which can be created using
+<pre>
+tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
+</pre>
+This prior does not satisfy the likelihood equivalence principle,
+which says that <a href="#markov_equiv">Markov equivalent</a> models
+should have the same marginal likelihood.
+A prior that does satisfy this principle is shown below.
+Heckerman (1995) calls this the
+BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
+<pre>
+B=1 B=2 B=3
+A=1 1/6 1/6 1/6
+A=2 1/6 1/6 1/6
+</pre>
+where we put N/(q*r) in each bin; N is the equivalent sample size,
+r=|A|, q = |B|.
+This can be created as follows
+<pre>
+tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
+</pre>
+Here, 1 is the equivalent sample size, and is the strength of the
+prior.
+You can change this using
+<pre>
+tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
+'BDeu', 'dirichlet_weight', 10);
+</pre>
+<!--where counts is an array of pseudo-counts of the same size as the
+CPT.-->
+<!--
+<p>
+When you specify a prior, you should set row i of the CPT to the
+normalized version of row i of the pseudo-count matrix, i.e., to the
+expected values of the parameters. This will ensure that computing the
+marginal likelihood sequentially (see <a
+href="#bayes_learn">below</a>) and in batch form gives the same
+results.
+To do this, proceed as follows.
+<pre>
+tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
+</pre>
+For a non-informative prior, you can just write
+<pre>
+tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
+</pre>
+-->
+<h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
+If we use conjugate priors and have fully observed data, we can
+compute the posterior over the parameters in batch form as follows.
+<pre>
+cases = sample_bnet(bnet, nsamples);
+bnet = bayes_update_params(bnet, cases);
+LL = log_marg_lik_complete(bnet, cases);
+</pre>
+bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
+and bnet.CPD{i}.CPT is set to the mean of the posterior (the
+normalized counts).
+(Hence if the initial pseudo counts are 0,
+<tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
+same result.)
+<p>
+We can compute the same result sequentially (on-line) as follows.
+<pre>
+LL = 0;
+for m=1:nsamples
+LL = LL + log_marg_lik_complete(bnet, cases(:,m));
+bnet = bayes_update_params(bnet, cases(:,m));
+end
+</pre>
+The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
+sequential model selection which uses the same idea.
+We generate data from the model A->B
+and compute the posterior prob of all 3 dags on 2 nodes:
+(1) A B,  (2) A <- B , (3) A -> B
+Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
+observational data alone, so we expect their posteriors to be the same
+(assuming a prior which satisfies likelihood equivalence).
+If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
+However, if we make B a noisy NOT gate, the true model "wins" after 12
+trials, as shown below (red = model 1, blue/green (superimposed)
+represents models 2/3).
+<p>
+<img src="Figures/model_select.png">
+<p>
+The use of marginal likelihood for model selection is discussed in
+greater detail in the
+section on <a href="structure_learning">structure learning</a>.
+<h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
+Now we consider learning when some values are not observed.
+Let us randomly hide half the values generated from the water
+sprinkler example.
+<pre>
+samples2 = samples;
+hide = rand(N, nsamples) > 0.5;
+[I,J]=find(hide);
+for k=1:length(I)
+samples2{I(k), J(k)} = [];
+end
+</pre>
+samples2{i,l} is the value of node i in training case l, or [] if unobserved.
+<p>
+Now we will compute the MLEs using the EM algorithm.
+We need to use an inference algorithm to compute the expected
+sufficient statistics in the E step; the M (maximization) step is as
+above.
+<pre>
+engine2 = jtree_inf_engine(bnet2);
+max_iter = 10;
+[bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
+</pre>
+LLtrace(i) is the log-likelihood at iteration i. We can plot this as
+follows:
+<pre>
+plot(LLtrace, 'x-')
+</pre>
+Let's display the results after 10 iterations of EM.
+<pre>
+celldisp(CPT4)
+CPT4{1} =
+0.6616
+0.3384
+CPT4{2} =
+0.6510    0.3490
+0.8751    0.1249
+CPT4{3} =
+0.8366    0.1634
+0.0197    0.9803
+CPT4{4} =
+(:,:,1) =
+0.8276    0.0546
+0.5452    0.1658
+(:,:,2) =
+0.1724    0.9454
+0.4548    0.8342
+</pre>
+We can get improved performance by using one or more of the following
+methods:
+<ul>
+<li> Increasing the size of the training set.
+<li> Decreasing the amount of hidden data.
+<li> Running EM for longer.
+<li> Using informative priors.
+<li> Initialising EM from multiple starting points.
+</ul>
+Click <a href="#gaussian">here</a> for a discussion of learning
+Gaussians, which can cause numerical problems.
+<p>
+For a more complete example of learning with EM,
+see the script BNT/examples/static/learn1.m.
+<h2><a name="tying">Parameter tying</h2>
+In networks with repeated structure (e.g., chains and grids), it is
+common to assume that the parameters are the same at every node. This
+is called parameter tying, and reduces the amount of data needed for
+learning.
+<p>
+When we have tied parameters, there is no longer a one-to-one
+correspondence between nodes and CPDs.
+Rather, each CPD species the parameters for a whole equivalence class
+of nodes.
+It is easiest to see this by example.
+Consider the following <a href="usage_dbn.html#hmm">hidden Markov
+model (HMM)</a>
+<p>
+<img src="Figures/hmm3.gif">
+<p>
+<!--
+We can create this graph structure, assuming we have T time-slices,
+as follows.
+(We number the nodes as shown in the figure, but we could equally well
+number the hidden nodes 1:T, and the observed nodes T+1:2T.)
+<pre>
+N = 2*T;
+dag = zeros(N);
+hnodes = 1:2:2*T;
+for i=1:T-1
+dag(hnodes(i), hnodes(i+1))=1;
+end
+onodes = 2:2:2*T;
+for i=1:T
+dag(hnodes(i), onodes(i)) = 1;
+end
+</pre>
+<p>
+The hidden nodes are always discrete, and have Q possible values each,
+but the observed nodes can be discrete or continuous, and have O possible values/length.
+<pre>
+if cts_obs
+dnodes = hnodes;
+else
+dnodes = 1:N;
+end
+ns = ones(1,N);
+ns(hnodes) = Q;
+ns(onodes) = O;
+</pre>
+-->
+When HMMs are used for semi-infinite processes like speech recognition,
+we assume the transition matrix
+P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant
+or homogenous Markov chain.
+Hence hidden nodes 2, 3, ..., T
+are all in the same equivalence class, say class Hclass.
+Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the
+same for all t, so the observed nodes are all in the same equivalence
+class, say class Oclass.
+Finally, the prior term P(H(1)) is in a class all by itself, say class
+H1class.
+This is illustrated below, where we explicitly represent the
+parameters as random variables (dotted nodes).
+<p>
+<img src="Figures/hmm4_params.gif">
+<p>
+In BNT, we cannot represent parameters as random variables (nodes).
+Instead, we "hide" the
+parameters inside one CPD for each equivalence class,
+and then specify that the other CPDs should share these parameters, as
+follows.
+<pre>
+hnodes = 1:2:2*T;
+onodes = 2:2:2*T;
+H1class = 1; Hclass = 2; Oclass = 3;
+eclass = ones(1,N);
+eclass(hnodes(2:end)) = Hclass;
+eclass(hnodes(1)) = H1class;
+eclass(onodes) = Oclass;
+% create dag and ns in the usual way
+bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
+</pre>
+Finally, we define the parameters for each equivalence class:
+<pre>
+bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
+bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
+if cts_obs
+bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
+else
+bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
+end
+</pre>
+In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
+member of e's equivalence class; that is, it is not always the case
+that e == j. You can use bnet.rep_of_eclass(e) to return the
+representative of equivalence class e.
+BNT will look up the parents of j to determine the size
+of the CPT to use. It assumes that this is the same for all members of
+the equivalence class.
+Click <a href="param_tieing.html">here</a> for
+a more complex example of parameter tying.
+<p>
+Note:
+Normally one would define an HMM as a
+<a href = "usage_dbn.html">Dynamic Bayes Net</a>
+(see the function BNT/examples/dynamic/mk_chmm.m).
+However, one can define an HMM as a static BN using the function
+BNT/examples/static/Models/mk_hmm_bnet.m.
+<h1><a name="structure_learning">Structure learning</h1>
+Update (9/29/03):
+Phillipe LeRay is developing some additional structure learning code
+on top of BNT. Click
+<a href="http://banquiseasi.insa-rouen.fr/projects/bnt-slp/">
+here</a>
+for details.
+<p>
+There are two very different approaches to structure learning:
+constraint-based and search-and-score.
+In the <a href="#constraint">constraint-based approach</a>,
+we start with a fully connected graph, and remove edges if certain
+conditional independencies are measured in the data.
+This has the disadvantage that repeated independence tests lose
+statistical power.
+<p>
+In the more popular search-and-score approach,
+we perform a search through the space of possible DAGs, and either
+return the best one found (a point estimate), or return a sample of the
+models found (an approximation to the Bayesian posterior).
+<p>
+The number of DAGs as a function of the number of
+nodes, G(n), is super-exponential in n,
+and is given by the following recurrence
+<!--(where R(i)=G(n)):-->
+<p>
+<center>
+<IMG SRC="numDAGsEqn2.png">
+</center>
+<p>
+The first few values
+are shown below.
+<table>
+<tr>  <th>n</th>    <th align=left>G(n)</th> </tr>
+<tr>  <td>1</td>    <td>1</td> </tr>
+<tr>  <td>2</td>    <td>3</td> </tr>
+<tr>  <td>3</td>    <td>25</td> </tr>
+<tr>   <td>4</td>    <td>543</td> </tr>
+<tr>   <td>5</td>    <td>29,281</td> </tr>
+<tr>   <td>6</td>    <td>3,781,503</td> </tr>
+<tr>   <td>7</td>    <td>1.1 x 10^9</td> </tr>
+<tr>   <td>8</td>    <td>7.8 x 10^11</td> </tr>
+<tr>   <td>9</td>    <td>1.2 x 10^15</td> </tr>
+<tr>   <td>10</td>    <td>4.2 x 10^18</td> </tr>
+</table>
+Since the number of DAGs is super-exponential in the number of nodes,
+we cannot exhaustively search the space, so we either use a local
+search algorithm (e.g., greedy hill climbining, perhaps with multiple
+restarts) or a global search algorithm (e.g., Markov Chain Monte
+Carlo).
+<p>
+If we know a total ordering on the nodes,
+finding the best structure amounts to picking the best set of parents
+for each node independently.
+This is what the K2 algorithm does.
+If the ordering is unknown, we can search over orderings,
+which is more efficient than searching over DAGs (Koller and Friedman, 2000).
+<p>
+In addition to the search procedure, we must specify the scoring
+function. There are two popular choices. The Bayesian score integrates
+out the parameters, i.e., it is the marginal likelihood of the model.
+The BIC (Bayesian Information Criterion) is defined as
+log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is
+the ML estimate of the parameters, d is the number of parameters, and
+N is the number of data cases.
+The BIC method has the advantage of not requiring a prior.
+<p>
+BIC can be derived as a large sample
+approximation to the marginal likelihood.
+(It is also equal to the Minimum Description Length of a model.)
+However, in practice, the sample size does not need to be very large
+for the approximation to be good.
+For example, in the figure below, we plot the ratio between the log marginal likelihood
+and the BIC score against data-set size; we see that the ratio rapidly
+approaches 1, especially for non-informative priors.
+(This plot was generated by the file BNT/examples/static/bic1.m. It
+uses the water sprinkler BN with BDeu Dirichlet priors with different
+equivalent sample sizes.)
+<p>
+<center>
+<IMG SRC="Figures/bic.png">
+</center>
+<p>
+<p>
+As with parameter learning, handling missing data/ hidden variables is
+much harder than the fully observed case.
+The structure learning routines in BNT can therefore be classified into 4
+types, analogously to the parameter learning case.
+<p>
+<TABLE BORDER>
+<tr>
+<TH></TH>
+<th>Full obs</th>
+<th>Partial obs</th>
+</tr>
+<tr>
+<th>Point</th>
+<td><tt>learn_struct_K2</tt>  <br>
+<!--     <tt>learn_struct_hill_climb</tt></td> -->
+<td><tt>not yet supported</tt></td>
+</tr>
+<tr>
+<th>Bayes</th>
+<td><tt>learn_struct_mcmc</tt></td>
+<td>not yet supported</td>
+</tr>
+</table>
+<h2><a name="markov_equiv">Markov equivalence</h2>
+If two DAGs encode the same conditional independencies, they are
+called Markov equivalent. The set of all DAGs can be paritioned into
+Markov equivalence classes. Graphs within the same class can
+have
+the direction of some of their arcs reversed without changing any of
+the CI relationships.
+Each class can be represented by a PDAG
+(partially directed acyclic graph) called an essential graph or
+pattern. This specifies which edges must be oriented in a certain
+direction, and which may be reversed.
+<p>
+When learning graph structure from observational data,
+the best one can hope to do is to identify the model up to Markov
+equivalence. To distinguish amongst graphs within the same equivalence
+class, one needs interventional data: see the discussion on <a
+href="#active">active learning</a> below.
+<h2><a name="enumerate">Exhaustive search</h2>
+The brute-force approach to structure learning is to enumerate all
+possible DAGs, and score each one. This provides a "gold standard"
+with which to compare other algorithms. We can do this as follows.
+<pre>
+dags = mk_all_dags(N);
+score = score_dags(data, ns, dags);
+</pre>
+where data(i,m) is the value of node i in case m,
+and ns(i) is the size of node i.
+If the DAGs have a lot of families in common, we can cache the sufficient statistics,
+making this potentially more efficient than scoring the DAGs one at a time.
+(Caching is not currently implemented, however.)
+<p>
+By default, we use the Bayesian scoring metric, and assume CPDs are
+represented by tables with BDeu(1) priors.
+We can override these defaults as follows.
+If we want to use uniform priors, we can say
+<pre>
+params = cell(1,N);
+for i=1:N
+params{i} = {'prior', 'unif'};
+end
+score = score_dags(data, ns, dags, 'params', params);
+</pre>
+params{i} is a cell-array, containing optional arguments that are
+passed to the constructor for CPD i.
+<p>
+Now suppose we want to use different node types, e.g.,
+Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
+these CPDs can support discrete and continuous parents, which is
+necessary since all other nodes will be considered as parents).
+The Bayesian scoring metric currently only works for tabular CPDs, so
+we will use BIC:
+<pre>
+score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
+'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
+</pre>
+In practice, one can't enumerate all possible DAGs for N > 5,
+but one can evaluate any reasonably-sized set of hypotheses in this
+way (e.g., nearest neighbors of your current best guess).
+Think of this as "computer assisted model refinement" as opposed to de
+novo learning.
+<h2><a name="K2">K2</h2>
+The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
+Initially each node has no parents. It then adds incrementally that parent whose addition most
+increases the score of the resulting structure. When the addition of no single
+parent can increase the score, it stops adding parents to the node.
+Since we are using a fixed ordering, we do not need to check for
+cycles, and can choose the parents for each node independently.
+<p>
+The original paper used the Bayesian scoring
+metric with tabular CPDs and Dirichlet priors.
+BNT generalizes this to allow any kind of CPD, and either the Bayesian
+scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
+In addition, you can specify
+an optional upper bound on the number of parents for each node.
+The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
+We use the water sprinkler network and sample 100 cases from it as before.
+Then we see how much data it takes to recover the generating structure:
+<pre>
+order = [C S R W];
+max_fan_in = 2;
+sz = 5:5:100;
+for i=1:length(sz)
+dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
+correct(i) = isequal(dag, dag2);
+end
+</pre>
+Here are the results.
+<pre>
+correct =
+Columns 1 through 12
+0     0     0     0     0     0     0     1     0     1     1     1
+Columns 13 through 20
+1     1     1     1     1     1     1     1
+</pre>
+So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
+showing that the prior doesn't matter too much.)
+In general, we cannot hope to recover the "true" generating structure,
+only one that is in its <a href="#markov_equiv">Markov equivalence
+class</a>.
+<h2><a name="hill_climb">Hill-climbing</h2>
+Hill-climbing starts at a specific point in space,
+considers all nearest neighbors, and moves to the neighbor
+that has the highest score; if no neighbors have higher
+score than the current point (i.e., we have reached a local maximum),
+the algorithm stops. One can then restart in another part of the space.
+<p>
+A common definition of "neighbor" is all graphs that can be
+generated from the current graph by adding, deleting or reversing a
+single arc, subject to the acyclicity constraint.
+Other neighborhoods are possible: see
+<a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
+Optimal Structure Identification with Greedy Search</a>, Max
+Chickering, JMLR 2002.
+<!--
+Note: This algorithm is currently (Feb '02) being implemented by Qian
+Diao.
+-->
+<h2><a name="mcmc">MCMC</h2>
+We can use a Markov Chain Monte Carlo (MCMC) algorithm called
+Metropolis-Hastings (MH) to search the space of all
+DAGs.
+The standard proposal distribution is to consider moving to all
+nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
+<p>
+The function can be called
+as in the following example.
+<pre>
+[sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
+</pre>
+We can convert our set of sampled graphs to a histogram
+(empirical posterior over all the DAGs) thus
+<pre>
+all_dags = mk_all_dags(N);
+mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
+</pre>
+To see how well this performs, let us compute the exact posterior exhaustively.
+<p>
+<pre>
+score = score_dags(data, ns, all_dags);
+post = normalise(exp(score)); % assuming uniform structural prior
+</pre>
+We plot the results below.
+(The data set was 100 samples drawn from a random 4 node bnet; see the
+file BNT/examples/static/mcmc1.)
+<pre>
+subplot(2,1,1)
+bar(post)
+subplot(2,1,2)
+bar(mcmc_post)
+</pre>
+<img src="Figures/mcmc_post.jpg" width="800" height="500">
+<p>
+We can also plot the acceptance ratio versus number of MCMC steps,
+as a crude convergence diagnostic.
+<pre>
+clf
+plot(accept_ratio)
+</pre>
+<img src="Figures/mcmc_accept.jpg" width="800" height="300">
+<p>
+Even though the number of samples needed by MCMC is theoretically
+polynomial (not exponential) in the dimensionality of the search space, in practice it has been
+found that MCMC does not converge in reasonable time for graphs with
+more than about 10 nodes.
+<h2><a name="active">Active structure learning</h2>
+As was mentioned <a href="#markov_equiv">above</a>,
+one can only learn a DAG up to Markov equivalence, even given infinite data.
+If one is interested in learning the structure of a causal network,
+one needs interventional data.
+(By "intervention" we mean forcing a node to take on a specific value,
+thereby effectively severing its incoming arcs.)
+<p>
+Most of the scoring functions accept an optional argument
+that specifies whether a node was observed to have a certain value, or
+was forced to have that value: we set clamped(i,m)=1 if node i was
+forced in training case m. e.g., see the file
+BNT/examples/static/cooper_yoo.
+<p>
+An interesting question is to decide which interventions to perform
+(c.f., design of experiments). For details, see the following tech
+report
+<ul>
+<li> <a href = "../../Papers/alearn.ps.gz">
+Active learning of causal Bayes net structure</a>, Kevin Murphy, March
+2001.
+</ul>
+<h2><a name="struct_em">Structural EM</h2>
+Computing the Bayesian score when there is partial observability is
+computationally challenging, because the parameter posterior becomes
+multimodal (the hidden nodes induce a mixture distribution).
+One therefore needs to use approximations such as BIC.
+Unfortunately, search algorithms are still expensive, because we need
+to run EM at each step to compute the MLE, which is needed to compute
+the score of each model. An alternative approach is
+to do the local search steps inside of the M step of EM, which is more
+efficient since the data has been "filled in" - this is
+called the structural EM algorithm (Friedman 1997), and provably
+converges to a local maximum of the BIC score.
+<p>
+Wei Hu has implemented SEM for discrete nodes.
+You can download his package from
+<a href="../SEM.zip">here</a>.
+Please address all questions about this code to
+wei.hu@intel.com.
+See also <a href="#phl">Phl's implementation of SEM</a>.
+<!--
+<h2><a name="reveal">REVEAL algorithm</h2>
+A simple way to learn the structure of a fully observed, discrete,
+factored DBN from a time series is described <a
+href="usage_dbn.html#struct_learn">here</a>.
+-->
+<h2><a name="graphdraw">Visualizing the graph</h2>
+Click <a href="graphviz.html">here</a> for more information
+on graph visualization.
+<h2><a name = "constraint">Constraint-based methods</h2>
+The IC algorithm (Pearl and Verma, 1991),
+and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
+computes many conditional independence tests,
+and combines these constraints into a
+PDAG to represent the whole
+<a href="#markov_equiv">Markov equivalence class</a>.
+<p>
+IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
+(IC stands for inductive causation; PC stands for Peter and Clark,
+the first names of Spirtes and Glymour; FCI stands for fast causal
+inference.
+What we, following Pearl (2000), call IC* was called
+IC in the original Pearl and Verma paper.)
+For details, see
+<ul>
+<li>
+<a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
+Prediction, and Search</a>, Spirtes, Glymour and
+Scheines (SGS), 2001 (2nd edition), MIT Press.
+<li>
+<a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
+2000, Cambridge University Press.
+</ul>
+<p>
+The PC algorithm takes as arguments a function f, the number of nodes N,
+the maximum fan in K, and additional arguments A which are passed to f.
+The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
+otherwise.
+For example, suppose we cheat by
+passing in a CI "oracle" which has access to the true DAG; the oracle
+tests for d-separation in this DAG, i.e.,
+f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
+<pre>
+pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
+</pre>
+pdag(i,j) = -1 if there is definitely an i->j arc,
+and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
+<p>
+Applied to the sprinkler network, this returns
+<pre>
+pdag =
+0     1     1     0
+1     0     0    -1
+1     0     0    -1
+0     0     0     0
+</pre>
+So as expected, we see that the V-structure at the W node is uniquely identified,
+but the other arcs have ambiguous orientation.
+<p>
+We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
+book.
+This example concerns the female orgasm.
+We are given a correlation matrix C between 7 measured factors (such
+as subjective experiences of coital and masturbatory experiences),
+derived from 281 samples, and want to learn a causal model of the
+data. We will not discuss the merits of this type of work here, but
+merely show how to reproduce the results in the SGS book.
+Their program,
+<a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
+makes use of the Fisher Z-test for conditional
+independence, so we do the same:
+<pre>
+max_fan_in = 4;
+nsamples = 281;
+alpha = 0.05;
+pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
+</pre>
+In this case, the CI test is
+<pre>
+f(X,Y,S) = cond_indep_fisher_z(X,Y,S,  C,nsamples,alpha)
+</pre>
+The results match those of Fig 12a of SGS apart from two edge
+differences; presumably this is due to rounding error (although it
+could be a bug, either in BNT or in Tetrad).
+This example can be found in the file BNT/examples/static/pc2.m.
+<p>
+The IC* algorithm (Pearl and Verma, 1991),
+and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
+are like the IC/PC algorithm, except that they can detect the presence
+of latent variables.
+See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
+Kushnir. The output is a matrix P, defined as follows
+(see Pearl (2000), p52 for details):
+<pre>
+% P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
+% P(i,j) = -2 if there is a marked directed i-*>j edge.
+% P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
+% P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
+</pre>
+<h2><a name="phl">Philippe Leray's structure learning package</h2>
+Philippe Leray has written a
+<a href="http://bnt.insa-rouen.fr/ajouts.html">
+structure learning package</a> that uses BNT.
+It currently (Juen 2003) has the following features:
+<ul>
+<li>PC with Chi2 statistical test
+<li>             MWST : Maximum weighted Spanning Tree
+<li>             Hill Climbing
+<li>             Greedy Search
+<li>             Structural EM
+<li>             hist_ic : optimal Histogram based on IC information criterion
+<li>             cpdag_to_dag
+<li>             dag_to_cpdag
+<li>             ...
+</ul>
+</a>
+<!--
+<h2><a name="read_learning">Further reading on learning</h2>
+I recommend the following tutorials for more details on learning.
+<ul>
+<li> <a
+href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
+tutorial</a> on graphical models, which contains an overview of learning.
+<li>
+<A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
+A tutorial on learning with Bayesian networks</a>, D. Heckerman,
+Microsoft Research Tech Report, 1995.
+<li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
+Operations for Learning with Graphical Models</a>,
+W. L. Buntine, JAIR'94, 159--225.
+</ul>
+<p>
+-->
+<h1><a name="engines">Inference engines</h1>
+Up until now, we have used the junction tree algorithm for inference.
+However, sometimes this is too slow, or not even applicable.
+In general, there are many inference algorithms each of which make
+different tradeoffs between speed, accuracy, complexity and
+generality. Furthermore, there might be many implementations of the
+same algorithm; for instance, a general purpose, readable version,
+and a highly-optimized, specialized one.
+To cope with this variety, we treat each inference algorithm as an
+object, which we call an inference engine.
+<p>
+An inference engine is an object that contains a bnet and supports the
+'enter_evidence' and 'marginal_nodes' methods.  The engine constructor
+takes the bnet as argument and may do some model-specific processing.
+When 'enter_evidence' is called, the engine may do some
+evidence-specific processing.  Finally, when 'marginal_nodes' is
+called, the engine may do some query-specific processing.
+<p>
+The amount of work done when each stage is specified -- structure,
+parameters, evidence, and query -- depends on the engine.  The cost of
+work done early in this sequence can be amortized.  On the other hand,
+one can make better optimizations if one waits until later in the
+sequence.
+For example, the parameters might imply
+conditional indpendencies that are not evident in the graph structure,
+but can nevertheless be exploited; the evidence indicates which nodes
+are observed and hence can effectively be disconnected from the
+graph; and the query might indicate that large parts of the network
+are d-separated from the query nodes.  (Since it is not the actual
+<em>values</em> of the evidence that matters, just which nodes are observed,
+many engines allow you to specify which nodes will be observed when they are constructed,
+i.e., before calling 'enter_evidence'. Some engines can still cope if
+the actual pattern of evidence is different, e.g., if there is missing
+data.)
+<p>
+Although being maximally lazy (i.e., only doing work when a query is
+issued) may seem desirable,
+this is not always the most efficient.
+For example,
+when learning using EM, we need to call marginal_nodes N times, where N is the
+number of nodes. <a href="varelim">Variable elimination</a> would end
+up repeating a lot of work
+each time marginal_nodes is called, making it inefficient for
+learning. The junction tree algorithm, by contrast, uses dynamic
+programming to avoid this redundant computation --- it calculates all
+marginals in two passes during 'enter_evidence', so calling
+'marginal_nodes' takes constant time.
+<p>
+We will discuss some of the inference algorithms implemented in BNT
+below, and finish with a <a href="#engine_summary">summary</a> of all
+of them.
+<h2><a name="varelim">Variable elimination</h2>
+The variable elimination algorithm, also known as bucket elimination
+or peeling, is one of the simplest inference algorithms.
+The basic idea is to "push sums inside of products"; this is explained
+in more detail
+<a
+href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
+<p>
+The principle of distributing sums over products can be generalized
+greatly to apply to any commutative semiring.
+This forms the basis of many common algorithms, such as Viterbi
+decoding and the Fast Fourier Transform. For details, see
+<ul>
+<li> R. McEliece and S. M. Aji, 2000.
+<!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
+<a href="GDL.pdf">
+The Generalized Distributive Law</a>,
+IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
+pp. 325--343.
+<li>
+F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
+<a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
+Factor graphs and the sum-product algorithm</a>
+IEEE Transactions on Information Theory, February, 2001.
+</ul>
+<p>
+Choosing an order in which to sum out the variables so as to minimize
+computational cost is known to be NP-hard.
+The implementation of this algorithm in
+<tt>var_elim_inf_engine</tt> makes no attempt to optimize this
+ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
+greedy search procedure to find a good ordering).
+<p>
+Note: unlike most algorithms, var_elim does all its computational work
+inside of <tt>marginal_nodes</tt>, not inside of
+<tt>enter_evidence</tt>.
+<h2><a name="global">Global inference methods</h2>
+The simplest inference algorithm of all is to explicitely construct
+the joint distribution over all the nodes, and then to marginalize it.
+This is implemented in <tt>global_joint_inf_engine</tt>.
+Since the size of the joint is exponential in the
+number of discrete (hidden) nodes, this is not a very practical algorithm.
+It is included merely for pedagogical and debugging purposes.
+<p>
+Three specialized versions of this algorithm have also been implemented,
+corresponding to the cases where all the nodes are discrete (D), all
+are Gaussian (G), and some are discrete and some Gaussian (CG).
+They are called <tt>enumerative_inf_engine</tt>,
+<tt>gaussian_inf_engine</tt>,
+and <tt>cond_gauss_inf_engine</tt> respectively.
+<p>
+Note: unlike most algorithms, these global inference algorithms do all their computational work
+inside of <tt>marginal_nodes</tt>, not inside of
+<tt>enter_evidence</tt>.
+<h2><a name="quickscore">Quickscore</h2>
+The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
+since the cliques are so big.
+One simple trick we can use is to notice that hidden leaves do not
+affect the posteriors on the roots, and hence do not need to be
+included in the network.
+A second trick is to notice that the negative findings can be
+"absorbed" into the prior:
+see the file
+BNT/examples/static/mk_minimal_qmr_bnet for details.
+<p>
+A much more significant speedup is obtained by exploiting special
+properties of the noisy-or node, as done by the quickscore
+algorithm. For details, see
+<ul>
+<li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
+<li> Rish and Dechter, "On the impact of causal independence", UCI
+tech report, 1998.
+</ul>
+This has been implemented in BNT as a special-purpose inference
+engine, which can be created and used as follows:
+<pre>
+engine = quickscore_inf_engine(inhibit, leak, prior);
+engine = enter_evidence(engine, pos, neg);
+m = marginal_nodes(engine, i);
+</pre>
+<h2><a name="belprop">Belief propagation</h2>
+Even using quickscore, exact inference takes time that is exponential
+in the number of positive findings.
+Hence for large networks we need to resort to approximate inference techniques.
+See for example
+<ul>
+<li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
+QMR-DT network", JAIR 10, 1999.
+<li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
+UAI 99.
+</ul>
+The latter approximation
+entails applying Pearl's belief propagation algorithm to a model even
+if it has loops (hence the name loopy belief propagation).
+Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
+exact results when applied to singly-connected graphs
+(a.k.a. polytrees, since
+the underlying undirected topology is a tree, but a node may have
+multiple parents).
+To apply this algorithm to a graph with loops,
+use <tt>pearl_inf_engine</tt>.
+This can use a centralized or distributed message passing protocol.
+You can use it as in the following example.
+<pre>
+engine = pearl_inf_engine(bnet, 'max_iter', 30);
+engine = enter_evidence(engine, evidence);
+m = marginal_nodes(engine, i);
+</pre>
+We found that this algorithm often converges, and when it does, often
+is very accurate, but it depends on the precise setting of the
+parameter values of the network.
+(See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
+Understanding when and why belief propagation converges/ works
+is a topic of ongoing research.
+<p>
+<tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
+and gmux nodes to compute messages efficiently.
+<p>
+<tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
+represent messages. Hence this is slower.
+<p>
+<tt>belprop_fg_inf_engine</tt> is like belprop,
+but is designed for factor graphs.
+<h2><a name="sampling">Sampling</h2>
+BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
+<ul>
+<li> <tt>likelihood_weighting_inf_engine</tt> which does importance
+sampling and can handle any node type.
+<li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
+Currently this can only handle tabular CPDs.
+For a much faster and more powerful Gibbs sampling program, see
+<a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
+</ul>
+Note: To generate samples from a network (which is not the same as inference!),
+use <tt>sample_bnet</tt>.
+<h2><a name="engine_summary">Summary of inference engines</h2>
+The inference engines differ in many ways. Here are
+some of the major "axes":
+<ul>
+<li> Works for all topologies or makes restrictions?
+<li> Works for all node types or makes restrictions?
+<li> Exact or approximate inference?
+</ul>
+<p>
+In terms of topology, most engines handle any kind of DAG.
+<tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
+can be used to represent directed, undirected, and mixed (chain)
+graphs.
+(In the future, we plan to support exact inference on chain graphs.)
+<tt>quickscore</tt> only works on QMR-like models.
+<p>
+In terms of node types: algorithms that use potentials can handle
+discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
+Sampling algorithms can essentially handle any kind of node (distribution).
+Other algorithms make more restrictive assumptions in exchange for
+speed.
+<p>
+Finally, most algorithms are designed to give the exact answer.
+The belief propagation algorithms are exact if applied to trees, and
+in some other cases.
+Sampling is considered approximate, even though, in the limit of an
+infinite number of samples, it gives the exact answer.
+<p>
+Here is a summary of the properties
+of all the engines in BNT which work on static networks.
+<p>
+<table>
+<table border units = pixels><tr>
+<td align=left width=0>Name
+<td align=left width=0>Exact?
+<td align=left width=0>Node type?
+<td align=left width=0>topology
+<tr>
+<tr>
+<td align=left> belprop
+<td align=left> approx
+<td align=left> D
+<td align=left> DAG
+<tr>
+<td align=left> belprop_fg
+<td align=left> approx
+<td align=left> D
+<td align=left> factor graph
+<tr>
+<td align=left> cond_gauss
+<td align=left> exact
+<td align=left> CG
+<td align=left> DAG
+<tr>
+<td align=left> enumerative
+<td align=left> exact
+<td align=left> D
+<td align=left> DAG
+<tr>
+<td align=left> gaussian
+<td align=left> exact
+<td align=left> G
+<td align=left> DAG
+<tr>
+<td align=left> gibbs
+<td align=left> approx
+<td align=left> D
+<td align=left> DAG
+<tr>
+<td align=left> global_joint
+<td align=left> exact
+<td align=left> D,G,CG
+<td align=left> DAG
+<tr>
+<td align=left> jtree
+<td align=left> exact
+<td align=left> D,G,CG
+<td align=left> DAG
+b<tr>
+<td align=left> likelihood_weighting
+<td align=left> approx
+<td align=left> any
+<td align=left> DAG
+<tr>
+<td align=left> pearl
+<td align=left> approx
+<td align=left> D,G
+<td align=left> DAG
+<tr>
+<td align=left> pearl
+<td align=left> exact
+<td align=left> D,G
+<td align=left> polytree
+<tr>
+<td align=left> quickscore
+<td align=left> exact
+<td align=left> noisy-or
+<td align=left> QMR
+<tr>
+<td align=left> stab_cond_gauss
+<td align=left> exact
+<td align=left> CG
+<td align=left> DAG
+<tr>
+<td align=left> var_elim
+<td align=left> exact
+<td align=left> D,G,CG
+<td align=left> DAG
+</table>
+<h1><a name="influence">Influence diagrams/ decision making</h1>
+BNT implements an exact algorithm for solving LIMIDs (limited memory
+influence diagrams), described in
+<ul>
+<li> S. L. Lauritzen and D. Nilsson.
+<a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
+Representing and solving decision problems with limited
+information</a>
+Management Science, 47, 1238 - 1251. September 2001.
+</ul>
+LIMIDs explicitely show all information arcs, rather than implicitely
+assuming no forgetting. This allows them to model forgetful
+controllers.
+<p>
+See the examples in <tt>BNT/examples/limids</tt> for details.
+<h1>DBNs, HMMs, Kalman filters and all that</h1>
+Click <a href="usage_dbn.html">here</a> for documentation about how to
+use BNT for dynamical systems and sequence data.
+</BODY>

Mercurial > hg > camir-aes2014

comparison toolboxes/FullBNT-1.0.7/docs/usage.html @ 0:e9a9cd732c1e tip