How to use the Bayes Net Toolbox

Daniel@0: To specify this directed acyclic graph (dag), we create an adjacency matrix: Daniel@0:

Daniel@0: N = 4; 
Daniel@0: dag = zeros(N,N);
Daniel@0: C = 1; S = 2; R = 3; W = 4;
Daniel@0: dag(C,[R S]) = 1;
Daniel@0: dag(R,W) = 1;
Daniel@0: dag(S,W)=1;
Daniel@0:

Daniel@0: We have numbered the nodes as follows: Daniel@0: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. Daniel@0: The nodes must always be numbered in topological order, i.e., Daniel@0: ancestors before descendants. Daniel@0: For a more complicated graph, this is a little inconvenient: we will Daniel@0: see how to get around this below. Daniel@0:

Daniel@0: In Matlab 6, you can use logical arrays instead of double arrays, Daniel@0: which are 4 times smaller: Daniel@0:

Daniel@0: dag = false(N,N);
Daniel@0: dag(C,[R S]) = true;
Daniel@0: ...
Daniel@0:

Daniel@0: However, some graph functions (eg acyclic) do not work on Daniel@0: logical arrays! Daniel@0:

Daniel@0: You can visualize the resulting graph structure using Daniel@0: the methods discussed below. Daniel@0: For details on GUIs, Daniel@0: click here. Daniel@0: Daniel@0:

Creating the Bayes net shell

Daniel@0: Daniel@0: In addition to specifying the graph structure, Daniel@0: we must specify the size and type of each node. Daniel@0: If a node is discrete, its size is the Daniel@0: number of possible values Daniel@0: each node can take on; if a node is continuous, Daniel@0: it can be a vector, and its size is the length of this vector. Daniel@0: In this case, we will assume all nodes are discrete and binary. Daniel@0:

Daniel@0: discrete_nodes = 1:N;
Daniel@0: node_sizes = 2*ones(1,N); 
Daniel@0:

Daniel@0: If the nodes were not binary, you could type e.g., Daniel@0:

Daniel@0: node_sizes = [4 2 3 5];
Daniel@0:

Daniel@0: meaning that Cloudy has 4 possible values, Daniel@0: Sprinkler has 2 possible values, etc. Daniel@0: Note that these are cardinal values, not ordinal, i.e., Daniel@0: they are not ordered in any way, like 'low', 'medium', 'high'. Daniel@0:

Daniel@0: We are now ready to make the Bayes net: Daniel@0:

Daniel@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
Daniel@0:

Daniel@0: By default, all nodes are assumed to be discrete, so we can also just Daniel@0: write Daniel@0:

Daniel@0: bnet = mk_bnet(dag, node_sizes);
Daniel@0:

Daniel@0: You may also specify which nodes will be observed. Daniel@0: If you don't know, or if this not fixed in advance, Daniel@0: just use the empty list (the default). Daniel@0:

Daniel@0: onodes = [];
Daniel@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
Daniel@0:

Daniel@0: Note that optional arguments are specified using a name/value syntax. Daniel@0: This is common for many BNT functions. Daniel@0: In general, to find out more about a function (e.g., which optional Daniel@0: arguments it takes), please see its Daniel@0: documentation string by typing Daniel@0:

Daniel@0: help mk_bnet
Daniel@0:

Daniel@0: See also other useful Matlab tips. Daniel@0:

Daniel@0: It is possible to associate names with nodes, as follows: Daniel@0:

Daniel@0: bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
Daniel@0:

Daniel@0: You can then refer to a node by its name: Daniel@0:

Daniel@0: C = bnet.names('cloudy'); % bnet.names is an associative array
Daniel@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
Daniel@0:

Daniel@0: This feature uses my own associative array class. Daniel@0: Daniel@0: Daniel@0:

Parameters

Daniel@0: Daniel@0: A model consists of the graph structure and the parameters. Daniel@0: The parameters are represented by CPD objects (CPD = Conditional Daniel@0: Probability Distribution), which define the probability distribution Daniel@0: of a node given its parents. Daniel@0: (We will use the terms "node" and "random variable" interchangeably.) Daniel@0: The simplest kind of CPD is a table (multi-dimensional array), which Daniel@0: is suitable when all the nodes are discrete-valued. Note that the discrete Daniel@0: values are not assumed to be ordered in any way; that is, they Daniel@0: represent categorical quantities, like male and female, rather than Daniel@0: ordinal quantities, like low, medium and high. Daniel@0: (We will discuss CPDs in more detail below.) Daniel@0:

Daniel@0: Tabular CPDs, also called CPTs (conditional probability tables), Daniel@0: are stored as multidimensional arrays, where the dimensions Daniel@0: are arranged in the same order as the nodes, e.g., the CPT for node 4 Daniel@0: (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. Daniel@0: Hence the child is always the last dimension. Daniel@0: If a node has no parents, its CPT is a column vector representing its Daniel@0: prior. Daniel@0: Note that in Matlab (unlike C), arrays are indexed Daniel@0: from 1, and are layed out in memory such that the first index toggles Daniel@0: fastest, e.g., the CPT for node 4 (WetGrass) is as follows Daniel@0:

Daniel@0:

Daniel@0: where we have used the convention that false==1, true==2. Daniel@0: We can create this CPT in Matlab as follows Daniel@0:

Daniel@0: CPT = zeros(2,2,2);
Daniel@0: CPT(1,1,1) = 1.0;
Daniel@0: CPT(2,1,1) = 0.1;
Daniel@0: ...
Daniel@0:

Daniel@0: Here is an easier way: Daniel@0:

Daniel@0: CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
Daniel@0:

Daniel@0: In fact, we don't need to reshape the array, since the CPD constructor Daniel@0: will do that for us. So we can just write Daniel@0:

Daniel@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
Daniel@0:

Daniel@0: The other nodes are created similarly (using the old syntax for Daniel@0: optional parameters) Daniel@0:

Daniel@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
Daniel@0: bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
Daniel@0: bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
Daniel@0: bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Random Parameters

Daniel@0: Daniel@0: If we do not specify the CPT, random parameters will be Daniel@0: created, i.e., each "row" of the CPT will be drawn from the uniform distribution. Daniel@0: To ensure repeatable results, use Daniel@0:

Daniel@0: rand('state', seed);
Daniel@0: randn('state', seed);
Daniel@0:

Daniel@0: To control the degree of randomness (entropy), Daniel@0: you can sample each row of the CPT from a Dirichlet(p,p,...) distribution. Daniel@0: If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0). Daniel@0: If p = 1, each entry is drawn from U[0,1]. Daniel@0: If p >> 1, the entries will all be near 1/k, where k is the arity of Daniel@0: this node, i.e., each row will be nearly uniform. Daniel@0: You can do this as follows, assuming this node Daniel@0: is number i, and ns is the node_sizes. Daniel@0:

Daniel@0: k = ns(i);
Daniel@0: ps = parents(dag, i);
Daniel@0: psz = prod(ns(ps));
Daniel@0: CPT = sample_dirichlet(p*ones(1,k), psz);
Daniel@0: bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
Daniel@0:

Loading a network from a file

Daniel@0: Daniel@0: If you already have a Bayes net represented in the XML-based Daniel@0: Daniel@0: Bayes Net Interchange Format (BNIF) (e.g., downloaded from the Daniel@0: Daniel@0: Bayes Net repository), Daniel@0: you can convert it to BNT format using Daniel@0: the Daniel@0: BIF-BNT Java Daniel@0: program written by Ken Shan. Daniel@0: (This is not necessarily up-to-date.) Daniel@0:

Daniel@0: It is currently not possible to save/load a BNT matlab object to Daniel@0: file, but this is easily fixed if you modify all the constructors Daniel@0: for all the classes (see matlab documentation). Daniel@0: Daniel@0:

Creating a model using a GUI

Daniel@0: Daniel@0: Suppose we want to compute the probability that the sprinker was on Daniel@0: given that the grass is wet. Daniel@0: The evidence consists of the fact that W=2. All the other nodes Daniel@0: are hidden (unobserved). We can specify this as follows. Daniel@0:

Senthil Nachimuthu Daniel@0: has started (Oct 07) an open source Daniel@0: GUI for BNT called Daniel@0: projeny Daniel@0: using Java. This is a successor to BNJ. Daniel@0: Daniel@0:
Daniel@0: Philippe LeRay has written (Sep 05) Daniel@0: a Daniel@0: Daniel@0: BNT GUI in matlab. Daniel@0: Daniel@0:
Daniel@0: Daniel@0: LinkStrength, Daniel@0: a package by Imme Ebert-Uphoff for visualizing the strength of Daniel@0: dependencies between nodes. Daniel@0: Daniel@0:

Daniel@0: Daniel@0:

Graph visualization

Daniel@0: Daniel@0: Click here for more information Daniel@0: on graph visualization. Daniel@0: Daniel@0: Daniel@0:

Inference

Daniel@0: Daniel@0: Having created the BN, we can now use it for inference. Daniel@0: There are many different algorithms for doing inference in Bayes nets, Daniel@0: that make different tradeoffs between speed, Daniel@0: complexity, generality, and accuracy. Daniel@0: BNT therefore offers a variety of different inference Daniel@0: "engines". We will discuss these Daniel@0: in more detail below. Daniel@0: For now, we will use the junction tree Daniel@0: engine, which is the mother of all exact inference algorithms. Daniel@0: This can be created as follows. Daniel@0:

Daniel@0: engine = jtree_inf_engine(bnet);
Daniel@0:

Daniel@0: The other engines have similar constructors, but might take Daniel@0: additional, algorithm-specific parameters. Daniel@0: All engines are used in the same way, once they have been created. Daniel@0: We illustrate this in the following sections. Daniel@0: Daniel@0: Daniel@0:

Computing marginal distributions

Daniel@0: evidence = cell(1,N);
Daniel@0: evidence{W} = 2;
Daniel@0:

Daniel@0: We use a 1D cell array instead of a vector to Daniel@0: cope with the fact that nodes can be vectors of different lengths. Daniel@0: In addition, the value [] can be used Daniel@0: to denote 'no evidence', instead of having to specify the observation Daniel@0: pattern as a separate argument. Daniel@0: (Click here for a quick tutorial on cell Daniel@0: arrays in matlab.) Daniel@0:

Daniel@0: We are now ready to add the evidence to the engine. Daniel@0:

Daniel@0: [engine, loglik] = enter_evidence(engine, evidence);
Daniel@0:

Daniel@0: The behavior of this function is algorithm-specific, and is discussed Daniel@0: in more detail below. Daniel@0: In the case of the jtree engine, Daniel@0: enter_evidence implements a two-pass message-passing scheme. Daniel@0: The first return argument contains the modified engine, which Daniel@0: incorporates the evidence. The second return argument contains the Daniel@0: log-likelihood of the evidence. (Not all engines are capable of Daniel@0: computing the log-likelihood.) Daniel@0:

Daniel@0: Finally, we can compute p=P(S=2|W=2) as follows. Daniel@0:

Daniel@0: marg = marginal_nodes(engine, S);
Daniel@0: marg.T
Daniel@0: ans =
Daniel@0:       0.57024
Daniel@0:       0.42976
Daniel@0: p = marg.T(2);
Daniel@0:

Daniel@0: We see that p = 0.4298. Daniel@0:

Daniel@0: Now let us add the evidence that it was raining, and see what Daniel@0: difference it makes. Daniel@0:

Daniel@0: evidence{R} = 2;
Daniel@0: [engine, loglik] = enter_evidence(engine, evidence);
Daniel@0: marg = marginal_nodes(engine, S);
Daniel@0: p = marg.T(2);
Daniel@0:

Daniel@0: We find that p = P(S=2|W=2,R=2) = 0.1945, Daniel@0: which is lower than Daniel@0: before, because the rain can ``explain away'' the Daniel@0: fact that the grass is wet. Daniel@0:

Daniel@0: You can plot a marginal distribution over a discrete variable Daniel@0: as a barchart using the built 'bar' function: Daniel@0:

Daniel@0: bar(marg.T)
Daniel@0:

Daniel@0: This is what it looks like Daniel@0: Daniel@0:

Daniel@0:

Daniel@0: Daniel@0:

Observed nodes

Daniel@0: Daniel@0: What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)? Daniel@0: An observed discrete node effectively only has 1 value (the observed Daniel@0: one) --- all other values would result in 0 probability. Daniel@0: For efficiency, BNT treats observed (discrete) nodes as if they were Daniel@0: set to 1, as we see below: Daniel@0:

Daniel@0: evidence = cell(1,N);
Daniel@0: evidence{W} = 2;
Daniel@0: engine = enter_evidence(engine, evidence);
Daniel@0: m = marginal_nodes(engine, W);
Daniel@0: m.T
Daniel@0: ans =
Daniel@0:      1
Daniel@0:

Daniel@0: This can get a little confusing, since we assigned W=2. Daniel@0: So we can ask BNT to add the evidence back in by passing in an optional argument: Daniel@0:

Daniel@0: m = marginal_nodes(engine, W, 1);
Daniel@0: m.T
Daniel@0: ans =
Daniel@0:      0
Daniel@0:      1
Daniel@0:

Daniel@0: This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Computing joint distributions

Daniel@0: Daniel@0: We can compute the joint probability on a set of nodes as in the Daniel@0: following example. Daniel@0:

Daniel@0: evidence = cell(1,N);
Daniel@0: [engine, ll] = enter_evidence(engine, evidence);
Daniel@0: m = marginal_nodes(engine, [S R W]);
Daniel@0:

Daniel@0: m is a structure. The 'T' field is a multi-dimensional array (in Daniel@0: this case, 3-dimensional) that contains the joint probability Daniel@0: distribution on the specified nodes. Daniel@0:

Daniel@0: >> m.T
Daniel@0: ans(:,:,1) =
Daniel@0:     0.2900    0.0410
Daniel@0:     0.0210    0.0009
Daniel@0: ans(:,:,2) =
Daniel@0:          0    0.3690
Daniel@0:     0.1890    0.0891
Daniel@0:

Daniel@0: We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass Daniel@0: to be wet if both the rain and sprinkler are off. Daniel@0:

Daniel@0: Let us now add some evidence to R. Daniel@0:

Daniel@0: evidence{R} = 2;
Daniel@0: [engine, ll] = enter_evidence(engine, evidence);
Daniel@0: m = marginal_nodes(engine, [S R W])
Daniel@0: m = 
Daniel@0:     domain: [2 3 4]
Daniel@0:          T: [2x1x2 double]
Daniel@0: >> m.T
Daniel@0: m.T
Daniel@0: ans(:,:,1) =
Daniel@0:     0.0820
Daniel@0:     0.0018
Daniel@0: ans(:,:,2) =
Daniel@0:     0.7380
Daniel@0:     0.1782
Daniel@0:

Daniel@0: The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) Daniel@0: should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible Daniel@0: with the evidence that R=2. Daniel@0: Instead of creating large tables with many 0s, BNT sets the effective Daniel@0: size of observed (discrete) nodes to 1, as explained above. Daniel@0: This is why m.T has size 2x1x2. Daniel@0: To get a 2x2x2 table, type Daniel@0:

Daniel@0: m = marginal_nodes(engine, [S R W], 1)
Daniel@0: m = 
Daniel@0:     domain: [2 3 4]
Daniel@0:          T: [2x2x2 double]
Daniel@0: >> m.T
Daniel@0: m.T
Daniel@0: ans(:,:,1) =
Daniel@0:             0        0.082
Daniel@0:             0       0.0018
Daniel@0: ans(:,:,2) =
Daniel@0:             0        0.738
Daniel@0:             0       0.1782
Daniel@0:

Daniel@0: Daniel@0: Sometimes a node is not observed, but we have some distribution over Daniel@0: its possible values; this is often called "soft" or "virtual" Daniel@0: evidence. Daniel@0: One can use this as follows Daniel@0:

Daniel@0: Note: It is not always possible to compute the joint on arbitrary Daniel@0: sets of nodes: it depends on which inference engine you use, as discussed Daniel@0: in more detail below. Daniel@0: Daniel@0: Daniel@0:

Soft/virtual evidence

Daniel@0: [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
Daniel@0:

Daniel@0: where soft_evidence{i} is either [] (if node i has no soft evidence) Daniel@0: or is a vector representing the probability distribution over i's Daniel@0: possible values. Daniel@0: For example, if we don't know i's exact value, but we know its Daniel@0: likelihood ratio is 60/40, we can write evidence{i} = [] and Daniel@0: soft_evidence{i} = [0.6 0.4]. Daniel@0:

Daniel@0: Currently only jtree_inf_engine supports this option. Daniel@0: It assumes that all hidden nodes, and all nodes for Daniel@0: which we have soft evidence, are discrete. Daniel@0: For a longer example, see BNT/examples/static/softev1.m. Daniel@0: Daniel@0: Daniel@0:

Most probable explanation

Daniel@0: Daniel@0: To compute the most probable explanation (MPE) of the evidence (i.e., Daniel@0: the most probable assignment, or a mode of the joint), use Daniel@0:

Daniel@0: [mpe, ll] = calc_mpe(engine, evidence);     
Daniel@0:

Daniel@0: mpe{i} is the most likely value of node i. Daniel@0: This calls enter_evidence with the 'maximize' flag set to 1, which Daniel@0: causes the engine to do max-product instead of sum-product. Daniel@0: The resulting max-marginals are then thresholded. Daniel@0: If there is more than one maximum probability assignment, we must take Daniel@0: care to break ties in a consistent manner (thresholding the Daniel@0: max-marginals may give the wrong result). To force this behavior, Daniel@0: type Daniel@0:

Daniel@0: [mpe, ll] = calc_mpe(engine, evidence, 1);     
Daniel@0:

Daniel@0: Note that computing the MPE is someties called abductive reasoning. Daniel@0: Daniel@0:

Daniel@0: You can also use calc_mpe_bucket written by Ron Zohar, Daniel@0: that does a forwards max-product pass, and then a backwards traceback Daniel@0: pass, which is how Viterbi is traditionally implemented. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Conditional Probability Distributions

Daniel@0: Daniel@0: A Conditional Probability Distributions (CPD) Daniel@0: defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i)) Daniel@0: are the parents of node i. There are many ways to represent this Daniel@0: distribution, which depend in part on whether X(i) and X(Pa(i)) are Daniel@0: discrete, continuous, or a combination. Daniel@0: We will discuss various representations below. Daniel@0: Daniel@0: Daniel@0:

Tabular nodes

Daniel@0: Daniel@0: If the CPD is represented as a table (i.e., if it is a multinomial Daniel@0: distribution), it has a number of parameters that is exponential in Daniel@0: the number of parents. See the example above. Daniel@0: Daniel@0: Daniel@0:

Noisy-or nodes

Daniel@0: Daniel@0: A noisy-OR node is like a regular logical OR gate except that Daniel@0: sometimes the effects of parents that are on get inhibited. Daniel@0: Let the prob. that parent i gets inhibited be q(i). Daniel@0: Then a node, C, with 2 parents, A and B, has the following CPD, where Daniel@0: we use F and T to represent off and on (1 and 2 in BNT). Daniel@0:

Daniel@0: A  B  P(C=off)      P(C=on)
Daniel@0: ---------------------------
Daniel@0: F  F  1.0           0.0
Daniel@0: T  F  q(A)          1-q(A)
Daniel@0: F  T  q(B)          1-q(B)
Daniel@0: T  T  q(A)q(B)      1-q(A)q(B)
Daniel@0:

Daniel@0: Thus we see that the causes get inhibited independently. Daniel@0: It is common to associate a "leak" node with a noisy-or CPD, which is Daniel@0: like a parent that is always on. This can account for all other unmodelled Daniel@0: causes which might turn the node on. Daniel@0:

Daniel@0: The noisy-or distribution is similar to the logistic distribution. Daniel@0: To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) Daniel@0: be the prob. that j inhibits i. Then Daniel@0:

Daniel@0: Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
Daniel@0:

Daniel@0: Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then Daniel@0:

Daniel@0: Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
Daniel@0:

Daniel@0: For a sigmoid node, we have Daniel@0:

Daniel@0: Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
Daniel@0:

Daniel@0: where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of Daniel@0: the activation function (although both are monotonically increasing). Daniel@0: In addition, in the case of a noisy-or, the weights are constrained to be Daniel@0: positive, since they derive from probabilities q(i,j). Daniel@0: In both cases, the number of parameters is linear in the Daniel@0: number of parents, unlike the case of a multinomial distribution, Daniel@0: where the number of parameters is exponential in the number of parents. Daniel@0: We will see an example of noisy-OR nodes below. Daniel@0: Daniel@0: Daniel@0:

Other (noisy) deterministic nodes

Daniel@0: Daniel@0: Deterministic CPDs for discrete random variables can be created using Daniel@0: the deterministic_CPD class. It is also possible to 'flip' the output Daniel@0: of the function with some probability, to simulate noise. Daniel@0: The boolean_CPD class is just a special case of a Daniel@0: deterministic CPD, where the parents and child are all binary. Daniel@0:

Daniel@0: Both of these classes are just "syntactic sugar" for the tabular_CPD Daniel@0: class. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Softmax nodes

Daniel@0: Daniel@0: If we have a discrete node with a continuous parent, Daniel@0: we can define its CPD using a softmax function Daniel@0: (also known as the multinomial logit function). Daniel@0: This acts like a soft thresholding operator, and is defined as follows: Daniel@0:

Daniel@0:                     exp(w(:,i)'*x + b(i)) 
Daniel@0: Pr(Q=i | X=x)  =  -----------------------------
Daniel@0:                   sum_j   exp(w(:,j)'*x + b(j))
Daniel@0: 
Daniel@0:

Daniel@0: The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the Daniel@0: following interpretation: w(:,i)-w(:,j) is the normal vector to the Daniel@0: decision boundary between classes i and j, Daniel@0: and b(i)-b(j) is its offset (bias). For example, suppose Daniel@0: X is a 2-vector, and Q is binary. Then Daniel@0:

Daniel@0: w = [1 -1;
Daniel@0:      0 0];
Daniel@0: 
Daniel@0: b = [0 0];
Daniel@0:

Daniel@0: means class 1 are points in the 2D plane with positive x coordinate, Daniel@0: and class 2 are points in the 2D plane with negative x coordinate. Daniel@0: If w has large magnitude, the decision boundary is sharp, otherwise it Daniel@0: is soft. Daniel@0: In the special case that Q is binary (0/1), the softmax function reduces to the logistic Daniel@0: (sigmoid) function. Daniel@0:

Daniel@0: Fitting a softmax function can be done using the iteratively reweighted Daniel@0: least squares (IRLS) algorithm. Daniel@0: We use the implementation from Daniel@0: Netlab. Daniel@0: Note that since Daniel@0: the softmax distribution is not in the exponential family, it does not Daniel@0: have finite sufficient statistics, and hence we must store all the Daniel@0: training data in uncompressed form. Daniel@0: If this takes too much space, one should use online (stochastic) gradient Daniel@0: descent (not implemented in BNT). Daniel@0:

Daniel@0: If a softmax node also has discrete parents, Daniel@0: we use a different set of w/b parameters for each combination of Daniel@0: parent values, as in the conditional linear Daniel@0: Gaussian CPD. Daniel@0: This feature was implemented by Pierpaolo Brutti. Daniel@0: He is currently extending it so that discrete parents can be treated Daniel@0: as if they were continuous, by adding indicator variables to the X Daniel@0: vector. Daniel@0:

Daniel@0: We will see an example of softmax nodes below. Daniel@0: Daniel@0: Daniel@0:

Neural network nodes

Daniel@0: Daniel@0: Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron Daniel@0: to implement a mapping from continuous parents to discrete children, Daniel@0: similar to the softmax function. Daniel@0: (If there are also discrete parents, it creates a mixture of MLPs.) Daniel@0: It uses code from Netlab. Daniel@0: This is work in progress. Daniel@0: Daniel@0:

Root nodes

Daniel@0: Daniel@0: A root node has no parents and no parameters; it can be used to model Daniel@0: an observed, exogeneous input variable, i.e., one which is "outside" Daniel@0: the model. Daniel@0: This is useful for conditional density models. Daniel@0: We will see an example of root nodes below. Daniel@0: Daniel@0: Daniel@0:

Gaussian nodes

Daniel@0: Daniel@0: We now consider a distribution suitable for the continuous-valued nodes. Daniel@0: Suppose the node is called Y, its continuous parents (if any) are Daniel@0: called X, and its discrete parents (if any) are called Q. Daniel@0: The distribution on Y is defined as follows: Daniel@0:

Daniel@0: - no parents: Y ~ N(mu, Sigma)
Daniel@0: - cts parents : Y|X=x ~ N(mu + W x, Sigma)
Daniel@0: - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
Daniel@0: - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
Daniel@0:

Daniel@0: where N(mu, Sigma) denotes a Normal distribution with mean mu and Daniel@0: covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q Daniel@0: respectively. Daniel@0: If there are no discrete parents, |Q|=1; if there is Daniel@0: more than one, then |Q| = a vector of the sizes of each discrete parent. Daniel@0: If there are no continuous parents, |X|=0; if there is more than one, Daniel@0: then |X| = the sum of their sizes. Daniel@0: Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive Daniel@0: semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) Daniel@0: matrix. Daniel@0:

Daniel@0: We can create a Gaussian node with random parameters as follows. Daniel@0:

Daniel@0: bnet.CPD{i} = gaussian_CPD(bnet, i);
Daniel@0:

Daniel@0: We can specify the value of one or more of the parameters as in the Daniel@0: following example, in which |Y|=2, and |Q|=1. Daniel@0:

Daniel@0: bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
Daniel@0:

Daniel@0: Daniel@0: We plan to add classification and regression trees to define CPDs for Daniel@0: discrete and continuous nodes, respectively. Daniel@0: Trees have many advantages: they are easy to interpret, they can do Daniel@0: feature selection, they can Daniel@0: handle discrete and continuous inputs, they do not make strong Daniel@0: assumptions about the form of the distribution, the number of Daniel@0: parameters can grow in a data-dependent way (i.e., they are Daniel@0: semi-parametric), they can handle missing data, etc. Daniel@0: However, they are not yet implemented. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Daniel@0: We will see an example of conditional linear Gaussian nodes below. Daniel@0:

Daniel@0: When learning Gaussians from data, it is helpful to ensure the Daniel@0: data has a small magnitde Daniel@0: (see e.g., KPMstats/standardize) to prevent numerical problems. Daniel@0: Unless you have a lot of data, it is also a very good idea to use Daniel@0: diagonal instead of full covariance matrices. Daniel@0: (BNT does not currently support spherical covariances, although it Daniel@0: would be easy to add, since KPMstats/clg_Mstep supports this option; Daniel@0: you would just need to modify gaussian_CPD/update_ess to accumulate Daniel@0: weighted inner products.) Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Other continuous distributions

Daniel@0: Daniel@0: Currently BNT does not support any CPDs for continuous nodes other Daniel@0: than the Gaussian. Daniel@0: However, you can use a mixture of Gaussians to Daniel@0: approximate other continuous distributions. We will see some an example Daniel@0: of this with the IFA model below. Daniel@0: Daniel@0: Daniel@0:

Generalized linear model nodes

Daniel@0: Daniel@0: In the future, we may incorporate some of the functionality of Daniel@0: glmlab Daniel@0: into BNT. Daniel@0: Daniel@0: Daniel@0:

Classification/regression tree nodes

Summary of CPD types

Daniel@0: Daniel@0: We list all the different types of CPDs supported by BNT. Daniel@0: For each CPD, we specify if the child and parents can be discrete (D) or Daniel@0: continuous (C) (Binary (B) nodes are a special case). Daniel@0: We also specify which methods each class supports. Daniel@0: If a method is inherited, the name of the parent class is mentioned. Daniel@0: If a parent class calls a child method, this is mentioned. Daniel@0:

Daniel@0: The CPD_to_CPT method converts a CPD to a table; this Daniel@0: requires that the child and all parents are discrete. Daniel@0: The CPT might be exponentially big... Daniel@0: convert_to_table evaluates a CPD with evidence, and Daniel@0: represents the the resulting potential as an array. Daniel@0: This requires that the child is discrete, and any continuous parents Daniel@0: are observed. Daniel@0: convert_to_pot evaluates a CPD with evidence, and Daniel@0: represents the resulting potential as a dpot, gpot, cgpot or upot, as Daniel@0: requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u = Daniel@0: utility). Daniel@0: Daniel@0:

Daniel@0: When we sample a node, all the parents are observed. Daniel@0: When we compute the (log) probability of a node, all the parents and Daniel@0: the child are observed. Daniel@0:

Daniel@0: We also specify if the parameters are learnable. Daniel@0: For learning with EM, we require Daniel@0: the methods reset_ess, update_ess and Daniel@0: maximize_params. Daniel@0: For learning from fully observed data, we require Daniel@0: the method learn_params. Daniel@0: By default, all classes inherit this from generic_CPD, which simply Daniel@0: calls update_ess N times, once for each data case, followed Daniel@0: by maximize_params, i.e., it is like EM, without the E step. Daniel@0: Some classes implement a batch formula, which is quicker. Daniel@0:

Daniel@0: Bayesian learning means computing a posterior over the parameters Daniel@0: given fully observed data. Daniel@0:

Daniel@0: Pearl means we implement the methods compute_pi and Daniel@0: compute_lambda_msg, used by Daniel@0: pearl_inf_engine, which runs on directed graphs. Daniel@0: belprop_inf_engine only needs convert_to_pot.H Daniel@0: The pearl methods can exploit special properties of the CPDs for Daniel@0: computing the messages efficiently, whereas belprop does not. Daniel@0:

Daniel@0: The only method implemented by generic_CPD is adjustable_CPD, Daniel@0: which is not shown, since it is not very interesting. Daniel@0: Daniel@0: Daniel@0:

Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Name Daniel@0: Child Daniel@0: Parents Daniel@0: Comments Daniel@0: CPD_to_CPT Daniel@0: conv_to_table Daniel@0: conv_to_pot Daniel@0: sample Daniel@0: prob Daniel@0: learn Daniel@0: Bayes Daniel@0: Pearl Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
boolean Daniel@0: B Daniel@0: B Daniel@0: Syntactic sugar for tabular Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: Daniel@0:
deterministic Daniel@0: D Daniel@0: D Daniel@0: Syntactic sugar for tabular Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: - Daniel@0: Daniel@0:
Discrete Daniel@0: D Daniel@0: C/D Daniel@0: Virtual class Daniel@0: N Daniel@0: Calls CPD_to_CPT Daniel@0: Calls conv_to_table Daniel@0: Calls conv_to_table Daniel@0: Calls conv_to_table Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: Daniel@0:
Gaussian Daniel@0: C Daniel@0: C/D Daniel@0: - Daniel@0: N Daniel@0: N Daniel@0: Y Daniel@0: Y Daniel@0: Y Daniel@0: Y Daniel@0: N Daniel@0: N Daniel@0: Daniel@0:
gmux Daniel@0: C Daniel@0: C/D Daniel@0: multiplexer Daniel@0: N Daniel@0: N Daniel@0: Y Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: Y Daniel@0: Daniel@0: Daniel@0:
MLP Daniel@0: D Daniel@0: C/D Daniel@0: multi layer perceptron Daniel@0: N Daniel@0: Y Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Y Daniel@0: N Daniel@0: N Daniel@0: Daniel@0: Daniel@0:
noisy-or Daniel@0: B Daniel@0: B Daniel@0: - Daniel@0: Y Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: N Daniel@0: N Daniel@0: Y Daniel@0: Daniel@0: Daniel@0:
root Daniel@0: C/D Daniel@0: none Daniel@0: no params Daniel@0: N Daniel@0: N Daniel@0: Y Daniel@0: Y Daniel@0: Y Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: Daniel@0: Daniel@0:
softmax Daniel@0: D Daniel@0: C/D Daniel@0: - Daniel@0: N Daniel@0: Y Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Y Daniel@0: N Daniel@0: N Daniel@0: Daniel@0: Daniel@0:
generic Daniel@0: C/D Daniel@0: C/D Daniel@0: Virtual class Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: N Daniel@0: Daniel@0: Daniel@0:
Tabular Daniel@0: D Daniel@0: D Daniel@0: - Daniel@0: Y Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Inherits from discrete Daniel@0: Y Daniel@0: Y Daniel@0: Y Daniel@0: Daniel@0:
Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Example models

Daniel@0: Daniel@0: In Figure (a) below, we show how Factor Analysis can be thought of as a Daniel@0: graphical model. Here, X has an N(0,I) prior, and Daniel@0: Y|X=x ~ N(mu + Wx, Psi), Daniel@0: where Psi is diagonal and W is called the "factor loading matrix". Daniel@0: Since the noise on both X and Y is diagonal, the components of these Daniel@0: vectors are uncorrelated, and hence can be represented as individual Daniel@0: scalar nodes, as we show in (b). Daniel@0: (This is useful if parts of the observations on the Y vector are occasionally missing.) Daniel@0: We usually take k=|X| << |Y|=D, so the model tries to explain Daniel@0: many observations using a low-dimensional subspace. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Gaussian mixture models

Daniel@0: Daniel@0: Richard W. DeVaul has made a detailed tutorial on how to fit mixtures Daniel@0: of Gaussians using BNT. Available Daniel@0: here. Daniel@0: Daniel@0: Daniel@0:

PCA, ICA, and all that

Daniel@0:	Daniel@0:	Daniel@0:	Daniel@0:
(a) Daniel@0:	(b) Daniel@0:	(c) Daniel@0:	(d) Daniel@0:

Daniel@0: Daniel@0: The root node is clamped to the N(0,I) distribution, so that we will Daniel@0: not update these parameters during learning. Daniel@0: The mean of the leaf node is clamped to 0, Daniel@0: since we assume the data has been centered (had its mean subtracted Daniel@0: off); this is just for simplicity. Daniel@0: Finally, the covariance of the leaf node is constrained to be Daniel@0: diagonal. W0 and Psi0 are the initial parameter guesses. Daniel@0: Daniel@0:

Daniel@0: We can create this model in BNT as follows. Daniel@0:

Daniel@0: ns = [k D];
Daniel@0: dag = zeros(2,2);
Daniel@0: dag(1,2) = 1;
Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', []);
Daniel@0: bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
Daniel@0:    'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
Daniel@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
Daniel@0:    'cov_type', 'diag', 'clamp_mean', 1);
Daniel@0:

Daniel@0: We can fit this model (i.e., estimate its parameters in a maximum Daniel@0: likelihood (ML) sense) using EM, as we Daniel@0: explain below. Daniel@0: Not surprisingly, the ML estimates for mu and Psi turn out to be Daniel@0: identical to the Daniel@0: sample mean and variance, which can be computed directly as Daniel@0:

Daniel@0: mu_ML = mean(data);
Daniel@0: Psi_ML = diag(cov(data));
Daniel@0:

Daniel@0: Note that W can only be identified up to a rotation matrix, because of Daniel@0: the spherical symmetry of the source. Daniel@0: Daniel@0:

Daniel@0: If we restrict Psi to be spherical, i.e., Psi = sigma*I, Daniel@0: there is a closed-form solution for W as well, Daniel@0: i.e., we do not need to use EM. Daniel@0: In particular, W contains the first |X| eigenvectors of the sample covariance Daniel@0: matrix, with scalings determined by the eigenvalues and sigma. Daniel@0: Classical PCA can be obtained by taking the sigma->0 limit. Daniel@0: For details, see Daniel@0: Daniel@0:

Daniel@0: "EM algorithms for PCA and SPCA", Sam Roweis, NIPS 97. Daniel@0: ( Daniel@0: Matlab software) Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: Daniel@0: "Mixtures of probabilistic principal component analyzers", Daniel@0: Tipping and Bishop, Neural Computation 11(2):443--482, 1999. Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: By adding a hidden discrete variable, we can create mixtures of FA Daniel@0: models, as shown in (c). Daniel@0: Now we can explain the data using a set of subspaces. Daniel@0: We can create this model in BNT as follows. Daniel@0:

Daniel@0: ns = [M k D];
Daniel@0: dag = zeros(3);
Daniel@0: dag(1,3) = 1;
Daniel@0: dag(2,3) = 1;
Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', 1);
Daniel@0: bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
Daniel@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
Daniel@0: 			   'clamp_mean', 1, 'clamp_cov', 1);
Daniel@0: bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
Daniel@0: 			   'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
Daniel@0:

Daniel@0: Notice how the covariance matrix for Y is the same for all values of Daniel@0: Q; that is, the noise level in each sub-space is assumed the same. Daniel@0: However, we allow the offset, mu, to vary. Daniel@0: For details, see Daniel@0:

Daniel@0: The EM Daniel@0: Algorithm for Mixtures of Factor Analyzers , Daniel@0: Ghahramani, Z. and Hinton, G.E. (1996), Daniel@0: University of Toronto Daniel@0: Technical Report CRG-TR-96-1. Daniel@0: (Matlab software) Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: Daniel@0: "Mixtures of probabilistic principal component analyzers", Daniel@0: Tipping and Bishop, Neural Computation 11(2):443--482, 1999. Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: I have included Zoubin's specialized MFA code (with his permission) Daniel@0: with the toolbox, so you can check that BNT gives the same results: Daniel@0: see 'BNT/examples/static/mfa1.m'. Daniel@0: Daniel@0:

Daniel@0: Independent Factor Analysis (IFA) generalizes FA by allowing a Daniel@0: non-Gaussian prior on each component of X. Daniel@0: (Note that we can approximate a non-Gaussian prior using a mixture of Daniel@0: Gaussians.) Daniel@0: This means that the likelihood function is no longer rotationally Daniel@0: invariant, so we can uniquely identify W and the hidden Daniel@0: sources X. Daniel@0: IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). Daniel@0: We recover classical Independent Components Analysis (ICA) Daniel@0: in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the Daniel@0: weight matrix W is square and invertible. Daniel@0: For details, see Daniel@0:

Daniel@0: Independent Factor Daniel@0: Analysis, H. Attias, Neural Computation 11: 803--851, 1998. Daniel@0:

Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Mixtures of experts

Daniel@0: Daniel@0: As an example of the use of the softmax function, Daniel@0: we introduce the Mixture of Experts model. Daniel@0: Daniel@0: As before, Daniel@0: circles denote continuous-valued nodes, Daniel@0: squares denote discrete nodes, clear Daniel@0: means hidden, and shaded means observed. Daniel@0:

Daniel@0:

Daniel@0: Daniel@0: A hierarchical mixture of experts (HME) extends the mixture of experts Daniel@0: model by having more than one hidden node. A two-level example is shown below, along Daniel@0: with its more traditional representation as a neural network. Daniel@0: This is like a (balanced) probabilistic decision tree of height 2. Daniel@0:

Daniel@0: X is the observed Daniel@0: input, Y is the output, and Daniel@0: the Q nodes are hidden "gating" nodes, which select the appropriate Daniel@0: set of parameters for Y. During training, Y is assumed observed, Daniel@0: but for testing, the goal is to predict Y given X. Daniel@0: Note that this is a conditional density model, so we don't Daniel@0: associate any parameters with X. Daniel@0: Hence X's CPD will be a root CPD, which is a way of modelling Daniel@0: exogenous nodes. Daniel@0: If the output is a continuous-valued quantity, Daniel@0: we assume the "experts" are linear-regression units, Daniel@0: and set Y's CPD to linear-Gaussian. Daniel@0: If the output is discrete, we set Y's CPD to a softmax function. Daniel@0: The Q CPDs will always be softmax functions. Daniel@0: Daniel@0:

Daniel@0: As a concrete example, consider the mixture of experts model where X and Y are Daniel@0: scalars, and Q is binary. Daniel@0: This is just piecewise linear regression, where Daniel@0: we have two line segments, i.e., Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: We can create this model with random parameters as follows. Daniel@0: (This code is bundled in BNT/examples/static/mixexp2.m.) Daniel@0:

Daniel@0: X = 1;
Daniel@0: Q = 2;
Daniel@0: Y = 3;
Daniel@0: dag = zeros(3,3);
Daniel@0: dag(X,[Q Y]) = 1
Daniel@0: dag(Q,Y) = 1;
Daniel@0: ns = [1 2 1]; % make X and Y scalars, and have 2 experts
Daniel@0: onodes = [1 3];
Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
Daniel@0: 
Daniel@0: rand('state', 0);
Daniel@0: randn('state', 0);
Daniel@0: bnet.CPD{1} = root_CPD(bnet, 1);
Daniel@0: bnet.CPD{2} = softmax_CPD(bnet, 2);
Daniel@0: bnet.CPD{3} = gaussian_CPD(bnet, 3);
Daniel@0:

Daniel@0: Now let us fit this model using EM. Daniel@0: First we load the data (1000 training cases) and plot them. Daniel@0:

Daniel@0:

Daniel@0: data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');        
Daniel@0: plot(data(:,1), data(:,2), '.');
Daniel@0:

Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: This is what the model looks like before training. Daniel@0: (Thanks to Thomas Hofman for writing this plotting routine.) Daniel@0:

Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: Now let's train the model, and plot the final performance. Daniel@0: (We will discuss how to train models in more detail below.) Daniel@0:

Daniel@0:

Daniel@0: ncases = size(data, 1); % each row of data is a training case
Daniel@0: cases = cell(3, ncases);
Daniel@0: cases([1 3], :) = num2cell(data'); % each column of cases is a training case
Daniel@0: engine = jtree_inf_engine(bnet);
Daniel@0: max_iter = 20;
Daniel@0: [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
Daniel@0:

Daniel@0: (We specify which nodes will be observed when we create the engine. Daniel@0: Hence BNT knows that the hidden nodes are all discrete. Daniel@0: For complex models, this can lead to a significant speedup.) Daniel@0: Below we show what the model looks like after 16 iterations of EM Daniel@0: (with 100 IRLS iterations per M step), when it converged Daniel@0: using the default convergence tolerance (that the Daniel@0: fractional change in the log-likelihood be less than 1e-3). Daniel@0: Before learning, the log-likelihood was Daniel@0: -322.927442; afterwards, it was -13.728778. Daniel@0:

Daniel@0:

Daniel@0: Daniel@0: (See BNT/examples/static/mixexp2.m for details of the code.) Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Hierarchical mixtures of experts

Daniel@0:

Daniel@0: Daniel@0: Bayes nets originally arose out of an attempt to add probabilities to Daniel@0: expert systems, and this is still the most common use for BNs. Daniel@0: A famous example is Daniel@0: QMR-DT, a decision-theoretic reformulation of the Quick Medical Daniel@0: Reference (QMR) model. Daniel@0:

Daniel@0: Pierpaolo Brutti Daniel@0: has written an extensive set of routines for HMEs, Daniel@0: which are bundled with BNT: see the examples/static/HME directory. Daniel@0: These routines allow you to choose the number of hidden (gating) Daniel@0: layers, and the form of the experts (softmax or MLP). Daniel@0: See the file hmemenu, which provides a demo. Daniel@0: For example, the figure below shows the decision boundaries learned Daniel@0: for a ternary classification problem, using a 2 level HME with softmax Daniel@0: gates and softmax experts; the training set is on the left, the Daniel@0: testing set on the right. Daniel@0:

Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Daniel@0: For more details, see the following: Daniel@0:

Daniel@0: Hierarchical mixtures of experts and the EM algorithm Daniel@0: M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994. Daniel@0: Daniel@0:
David Martin's Daniel@0: matlab code for HME Daniel@0: Daniel@0:
Why the Daniel@0: logistic function? A tutorial discussion on Daniel@0: probabilities and neural networks. M. I. Jordan. MIT Computational Daniel@0: Cognitive Science Report 9503, August 1995. Daniel@0: Daniel@0:
"Generalized Linear Models", McCullagh and Nelder, Chapman and Daniel@0: Halll, 1983. Daniel@0: Daniel@0:
Daniel@0: "Improved learning algorithms for mixtures of experts in multiclass Daniel@0: classification". Daniel@0: K. Chen, L. Xu, H. Chi. Daniel@0: Neural Networks (1999) 12: 1229-1252. Daniel@0: Daniel@0:
Daniel@0: Classification Using Hierarchical Mixtures of Experts Daniel@0: S.R. Waterhouse and A.J. Robinson. Daniel@0: In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186 Daniel@0: Daniel@0:
Daniel@0: Localized mixtures of experts, Daniel@0: P. Moerland, 1998. Daniel@0: Daniel@0:
"Nonlinear gated experts for time series", Daniel@0: A.S. Weigend and M. Mangeas, 1995. Daniel@0: Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

QMR

Daniel@0:

Daniel@0: Daniel@0: Here, the top layer represents hidden disease nodes, and the bottom Daniel@0: layer represents observed symptom nodes. Daniel@0: The goal is to infer the posterior probability of each disease given Daniel@0: all the symptoms (which can be present, absent or unknown). Daniel@0: Each node in the top layer has a Bernoulli prior (with a low prior Daniel@0: probability that the disease is present). Daniel@0: Since each node in the bottom layer has a high fan-in, we use a Daniel@0: noisy-OR parameterization; each disease has an independent chance of Daniel@0: causing each symptom. Daniel@0: The real QMR-DT model is copyright, but Daniel@0: we can create a random QMR-like model as follows. Daniel@0:

Daniel@0: function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
Daniel@0: % MK_QMR_BNET Make a QMR model
Daniel@0: % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
Daniel@0: %
Daniel@0: % G(i,j) = 1 iff there is an arc from disease i to finding j
Daniel@0: % inhibit(i,j) = inhibition probability on i->j arc
Daniel@0: % leak(j) = inhibition prob. on leak->j arc
Daniel@0: % prior(i) = prob. disease i is on
Daniel@0: 
Daniel@0: [Ndiseases Nfindings] = size(inhibit);
Daniel@0: N = Ndiseases + Nfindings;
Daniel@0: finding_node = Ndiseases+1:N;
Daniel@0: ns = 2*ones(1,N);
Daniel@0: dag = zeros(N,N);
Daniel@0: dag(1:Ndiseases, finding_node) = G;
Daniel@0: bnet = mk_bnet(dag, ns, 'observed', finding_node);
Daniel@0: 
Daniel@0: for d=1:Ndiseases
Daniel@0:   CPT = [1-prior(d) prior(d)];
Daniel@0:   bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
Daniel@0: end
Daniel@0: 
Daniel@0: for i=1:Nfindings
Daniel@0:   fnode = finding_node(i);
Daniel@0:   ps = parents(G, i);
Daniel@0:   bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
Daniel@0: end
Daniel@0:

Daniel@0: In the file BNT/examples/static/qmr1, we create a random bipartite Daniel@0: graph G, with 5 diseases and 10 findings, and random parameters. Daniel@0: (In general, to create a random dag, use 'mk_random_dag'.) Daniel@0: We can visualize the resulting graph structure using Daniel@0: the methods discussed below, with the Daniel@0: following results: Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Daniel@0: Now let us put some random evidence on all the leaves except the very Daniel@0: first and very last, and compute the disease posteriors. Daniel@0:

Daniel@0: pos = 2:floor(Nfindings/2);
Daniel@0: neg = (pos(end)+1):(Nfindings-1);
Daniel@0: onodes = myunion(pos, neg);
Daniel@0: evidence = cell(1, N);
Daniel@0: evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
Daniel@0: evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
Daniel@0: 
Daniel@0: engine = jtree_inf_engine(bnet);
Daniel@0: [engine, ll] = enter_evidence(engine, evidence);
Daniel@0: post = zeros(1, Ndiseases);
Daniel@0: for i=diseases(:)'
Daniel@0:   m = marginal_nodes(engine, i);
Daniel@0:   post(i) = m.T(2);
Daniel@0: end
Daniel@0:

Daniel@0: Junction tree can be quite slow on large QMR models. Daniel@0: Fortunately, it is possible to exploit properties of the noisy-OR Daniel@0: function to speed up exact inference using an algorithm called Daniel@0: quickscore, discussed below. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Conditional Gaussian models

Daniel@0: Daniel@0: A conditional Gaussian model is one in which, conditioned on all the discrete Daniel@0: nodes, the distribution over the remaining (continuous) nodes is Daniel@0: multivariate Gaussian. This means we can have arcs from discrete (D) Daniel@0: to continuous (C) nodes, but not vice versa. Daniel@0: (We are allowed C->D arcs if the continuous nodes are observed, Daniel@0: as in the mixture of experts model, Daniel@0: since this distribution can be represented with a discrete potential.) Daniel@0:

Daniel@0: We now give an example of a CG model, from Daniel@0: the paper "Propagation of Probabilities, Means amd Daniel@0: Variances in Mixed Graphical Association Models", Steffen Lauritzen, Daniel@0: JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert Daniel@0: Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and Daniel@0: D. J. Spiegelhalter, Springer, 1999.) Daniel@0: Daniel@0:

Specifying the graph

Daniel@0: Daniel@0: Consider the model of waste emissions from an incinerator plant shown below. Daniel@0: We follow the standard convention that shaded nodes are observed, Daniel@0: clear nodes are hidden. Daniel@0: We also use the non-standard convention that Daniel@0: square nodes are discrete (tabular) and round nodes are Daniel@0: Gaussian. Daniel@0: Daniel@0:

Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: Daniel@0: We can create this model as follows. Daniel@0:

Daniel@0: F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
Daniel@0: n = 9;
Daniel@0: 
Daniel@0: dag = zeros(n);
Daniel@0: dag(F,E)=1;
Daniel@0: dag(W,[E Min D]) = 1;
Daniel@0: dag(E,D)=1;
Daniel@0: dag(B,[C D])=1;
Daniel@0: dag(D,[L Mout])=1;
Daniel@0: dag(Min,Mout)=1;
Daniel@0: 
Daniel@0: % node sizes - all cts nodes are scalar, all discrete nodes are binary
Daniel@0: ns = ones(1, n);
Daniel@0: dnodes = [F W B];
Daniel@0: cnodes = mysetdiff(1:n, dnodes);
Daniel@0: ns(dnodes) = 2;
Daniel@0: 
Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes);
Daniel@0:

Daniel@0: 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous Daniel@0: nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'. Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Specifying the parameters

Daniel@0: Daniel@0: The parameters of the discrete nodes can be specified as follows. Daniel@0:

Daniel@0: bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
Daniel@0: bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
Daniel@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: The parameters of the continuous nodes can be specified as follows. Daniel@0:

Daniel@0: bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
Daniel@0: 			   'cov', [0.00002 0.0001 0.00002 0.0001]);
Daniel@0: bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
Daniel@0: 			   'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
Daniel@0: bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
Daniel@0: bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
Daniel@0: bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
Daniel@0: bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Inference

Daniel@0: Daniel@0: Daniel@0: First we compute the unconditional marginals. Daniel@0:

Daniel@0: engine = jtree_inf_engine(bnet);
Daniel@0: evidence = cell(1,n);
Daniel@0: [engine, ll] = enter_evidence(engine, evidence);
Daniel@0: marg = marginal_nodes(engine, E);
Daniel@0:

Daniel@0: Daniel@0: 'marg' is a structure that contains the fields 'mu' and 'Sigma', which Daniel@0: contain the mean and (co)variance of the marginal on E. Daniel@0: In this case, they are both scalars. Daniel@0: Let us check they match the published figures (to 2 decimal places). Daniel@0: Daniel@0:

Daniel@0: tol = 1e-2;
Daniel@0: assert(approxeq(marg.mu, -3.25, tol));
Daniel@0: assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
Daniel@0:

Daniel@0: We can compute the other posteriors similarly. Daniel@0: Now let us add some evidence. Daniel@0:

Daniel@0: evidence = cell(1,n);
Daniel@0: evidence{W} = 1; % industrial
Daniel@0: evidence{L} = 1.1;
Daniel@0: evidence{C} = -0.9;
Daniel@0: [engine, ll] = enter_evidence(engine, evidence);
Daniel@0:

Daniel@0: Now we find Daniel@0:

Daniel@0: marg = marginal_nodes(engine, E);
Daniel@0: assert(approxeq(marg.mu, -3.8983, tol));
Daniel@0: assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
Daniel@0:

Daniel@0: Daniel@0: Daniel@0: We can also compute the joint probability on a set of nodes. Daniel@0: For example, P(D, Mout | evidence) is a 2D Gaussian: Daniel@0:

Daniel@0: marg = marginal_nodes(engine, [D Mout])
Daniel@0: marg = 
Daniel@0:     domain: [6 8]
Daniel@0:         mu: [2x1 double]
Daniel@0:      Sigma: [2x2 double]
Daniel@0:          T: 1.0000
Daniel@0:

Daniel@0: The mean is Daniel@0:

Daniel@0: marg.mu
Daniel@0: ans =
Daniel@0:     3.6077
Daniel@0:     4.1077
Daniel@0:

Daniel@0: and the covariance matrix is Daniel@0:

Daniel@0: marg.Sigma
Daniel@0: ans =
Daniel@0:     0.1062    0.1062
Daniel@0:     0.1062    0.1182
Daniel@0:

Daniel@0: It is easy to visualize this posterior using standard Matlab plotting Daniel@0: functions, e.g., Daniel@0:

Daniel@0: gaussplot2d(marg.mu, marg.Sigma);
Daniel@0:

Daniel@0: produces the following picture. Daniel@0: Daniel@0:

Daniel@0:

Daniel@0: The interpretation is Daniel@0: Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. Daniel@0: In this case, E is a scalar, so i=j=1; k specifies the mixture component. Daniel@0:

Daniel@0: Daniel@0: Daniel@0: The T field indicates that the mixing weight of this Gaussian Daniel@0: component is 1.0. Daniel@0: If the joint contains discrete and continuous variables, the result Daniel@0: will be a mixture of Gaussians, e.g., Daniel@0:

Daniel@0: marg = marginal_nodes(engine, [F E])
Daniel@0:     domain: [1 3]
Daniel@0:         mu: [-3.9000 -0.4003]
Daniel@0:      Sigma: [1x1x2 double]
Daniel@0:          T: [0.9995 4.7373e-04]
Daniel@0:

Daniel@0: We saw in the sprinkler network that BNT sets the effective size of Daniel@0: observed discrete nodes to 1, since they only have one legal value. Daniel@0: For continuous nodes, BNT sets their length to 0, Daniel@0: since they have been reduced to a point. Daniel@0: For example, Daniel@0:

Daniel@0: marg = marginal_nodes(engine, [B C])
Daniel@0:     domain: [4 5]
Daniel@0:         mu: []
Daniel@0:      Sigma: []
Daniel@0:          T: [0.0123 0.9877]
Daniel@0:

Daniel@0: It is simple to post-process the output of marginal_nodes. Daniel@0: For example, the file BNT/examples/static/cg1 sets the mu term of Daniel@0: observed nodes to their observed value, and the Sigma term to 0 (since Daniel@0: observed nodes have no variance). Daniel@0: Daniel@0:

Daniel@0: Note that the implemented version of the junction tree is numerically Daniel@0: unstable when using CG potentials Daniel@0: (which is why, in the example above, we only required our answers to agree with Daniel@0: the published ones to 2dp.) Daniel@0: This is why you might want to use stab_cond_gauss_inf_engine, Daniel@0: implemented by Shan Huang. This is described in Daniel@0: Daniel@0:

"Stable Local Computation with Conditional Gaussian Distributions", Daniel@0: S. Lauritzen and F. Jensen, Tech Report R-99-2014, Daniel@0: Dept. Math. Sciences, Allborg Univ., 1999. Daniel@0:

Daniel@0: Daniel@0: However, even the numerically stable version Daniel@0: can be computationally intractable if there are many hidden discrete Daniel@0: nodes, because the number of mixture components grows exponentially e.g., in a Daniel@0: switching linear dynamical system. Daniel@0: In general, one must resort to approximate inference techniques: see Daniel@0: the discussion on inference engines below. Daniel@0: Daniel@0: Daniel@0:

Other hybrid models

Daniel@0: Daniel@0: When we have C->D arcs, where C is hidden, we need to use Daniel@0: approximate inference. Daniel@0: One approach (not implemented in BNT) is described in Daniel@0:

Daniel@0: Daniel@0: The parameter estimation routines in BNT can be classified into 4 Daniel@0: types, depending on whether the goal is to compute Daniel@0: a full (Bayesian) posterior over the parameters or just a point Daniel@0: estimate (e.g., Maximum Likelihood or Maximum A Posteriori), Daniel@0: and whether all the variables are fully observed or there is missing Daniel@0: data/ hidden variables (partial observability). Daniel@0:

A Daniel@0: Variational Approximation for Bayesian Networks with Daniel@0: Discrete and Continuous Latent Variables, Daniel@0: K. Murphy, UAI 99. Daniel@0:

Daniel@0: Of course, one can always use sampling methods Daniel@0: for approximate inference in such models. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Parameter Learning

Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Full obs Partial obs
Point learn_params learn_params_em
Bayes bayes_update_params not yet supported
Daniel@0: Daniel@0: Daniel@0:

	Full obs	Partial obs
Point	`learn_params`	`learn_params_em`
Bayes	`bayes_update_params`	not yet supported

Loading data from a file

Daniel@0: Daniel@0: To load numeric data from an ASCII text file called 'dat.txt', where each row is a Daniel@0: case and columns are separated by white-space, such as Daniel@0:

Daniel@0: 011979 1626.5 0.0
Daniel@0: 021979 1367.0 0.0
Daniel@0: ...
Daniel@0:

Daniel@0: you can use Daniel@0:

Daniel@0: data = load('dat.txt');
Daniel@0:

Daniel@0: or Daniel@0:

Daniel@0: load dat.txt -ascii
Daniel@0:

Daniel@0: In the latter case, the data is stored in a variable called 'dat' (the Daniel@0: filename minus the extension). Daniel@0: Alternatively, suppose the data is stored in a .csv file (has commas Daniel@0: separating the columns, and contains a header line), such as Daniel@0:

Daniel@0: header info goes here
Daniel@0: ORD,011979,1626.5,0.0
Daniel@0: DSM,021979,1367.0,0.0
Daniel@0: ...
Daniel@0:

Daniel@0: You can load this using Daniel@0:

Daniel@0: [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
Daniel@0:

Daniel@0: If your file is not in either of these formats, you can either use Perl to convert Daniel@0: it to this format, or use the Matlab scanf command. Daniel@0: Type Daniel@0: Daniel@0: help iofun Daniel@0: Daniel@0: for more information on Matlab's file functions. Daniel@0: Daniel@0:

Daniel@0: BNT learning routines require data to be stored in a cell array. Daniel@0: data{i,m} is the value of node i in case (example) m, i.e., each Daniel@0: column is a case. Daniel@0: If node i is not observed in case m (missing value), set Daniel@0: data{i,m} = []. Daniel@0: (Not all the learning routines can cope with such missing values, however.) Daniel@0: In the special case that all the nodes are observed and are Daniel@0: scalar-valued (as opposed to vector-valued), the data can be Daniel@0: stored in a matrix (as opposed to a cell-array). Daniel@0:

Daniel@0: Suppose, as in the mixture of experts example, Daniel@0: that we have 3 nodes in the graph: X(1) is the observed input, X(3) is Daniel@0: the observed output, and X(2) is a hidden (gating) node. We can Daniel@0: create the dataset as follows. Daniel@0:

Daniel@0: data = load('dat.txt');
Daniel@0: ncases = size(data, 1);
Daniel@0: cases = cell(3, ncases);
Daniel@0: cases([1 3], :) = num2cell(data');
Daniel@0:

Daniel@0: Notice how we transposed the data, to convert rows into columns. Daniel@0: Also, cases{2,m} = [] for all m, since X(2) is always hidden. Daniel@0: Daniel@0: Daniel@0:

Maximum likelihood parameter estimation from complete data

Daniel@0: Daniel@0: As an example, let's generate some data from the sprinkler network, randomize the parameters, Daniel@0: and then try to recover the original model. Daniel@0: First we create some training data using forwards sampling. Daniel@0:

Daniel@0: samples = cell(N, nsamples);
Daniel@0: for i=1:nsamples
Daniel@0:   samples(:,i) = sample_bnet(bnet);
Daniel@0: end
Daniel@0:

Daniel@0: samples{j,i} contains the value of the j'th node in case i. Daniel@0: sample_bnet returns a cell array because, in general, each node might Daniel@0: be a vector of different length. Daniel@0: In this case, all nodes are discrete (and hence scalars), so we Daniel@0: could have used a regular array instead (which can be quicker): Daniel@0:

Daniel@0: data = cell2num(samples);
Daniel@0:

Daniel@0: Now we create a network with random parameters. Daniel@0: (The initial values of bnet2 don't matter in this case, since we can find the Daniel@0: globally optimal MLE independent of where we start.) Daniel@0:

Daniel@0: % Make a tabula rasa
Daniel@0: bnet2 = mk_bnet(dag, node_sizes);
Daniel@0: seed = 0;
Daniel@0: rand('state', seed);
Daniel@0: bnet2.CPD{C} = tabular_CPD(bnet2, C);
Daniel@0: bnet2.CPD{R} = tabular_CPD(bnet2, R);
Daniel@0: bnet2.CPD{S} = tabular_CPD(bnet2, S);
Daniel@0: bnet2.CPD{W} = tabular_CPD(bnet2, W);
Daniel@0:

Daniel@0: Finally, we find the maximum likelihood estimates of the parameters. Daniel@0:

Daniel@0: bnet3 = learn_params(bnet2, samples);
Daniel@0:

Daniel@0: To view the learned parameters, we use a little Matlab hackery. Daniel@0:

Daniel@0: CPT3 = cell(1,N);
Daniel@0: for i=1:N
Daniel@0:   s=struct(bnet3.CPD{i});  % violate object privacy
Daniel@0:   CPT3{i}=s.CPT;
Daniel@0: end
Daniel@0:

Daniel@0: Here are the parameters learned for node 4. Daniel@0:

Daniel@0: dispcpt(CPT3{4})
Daniel@0: 1 1 : 1.0000 0.0000 
Daniel@0: 2 1 : 0.2000 0.8000 
Daniel@0: 1 2 : 0.2273 0.7727 
Daniel@0: 2 2 : 0.0000 1.0000 
Daniel@0:

Daniel@0: So we see that the learned parameters are fairly close to the "true" Daniel@0: ones, which we display below. Daniel@0:

Daniel@0: dispcpt(CPT{4})
Daniel@0: 1 1 : 1.0000 0.0000 
Daniel@0: 2 1 : 0.1000 0.9000 
Daniel@0: 1 2 : 0.1000 0.9000 
Daniel@0: 2 2 : 0.0100 0.9900 
Daniel@0:

Daniel@0: We can get better results by using a larger training set, or using Daniel@0: informative priors (see below). Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Parameter priors

Daniel@0: Daniel@0: Currently, only tabular CPDs can have priors on their parameters. Daniel@0: The conjugate prior for a multinomial is the Dirichlet. Daniel@0: (For binary random variables, the multinomial is the same as the Daniel@0: Bernoulli, and the Dirichlet is the same as the Beta.) Daniel@0:

Daniel@0: The Dirichlet has a simple interpretation in terms of pseudo counts. Daniel@0: If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the Daniel@0: training set, where Pa_i are the parents of X_i, Daniel@0: then the maximum likelihood (ML) estimate is Daniel@0: T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. Daniel@0: To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this Daniel@0: event was not seen in the training set, Daniel@0: we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk) Daniel@0: of times in the past. Daniel@0: The MAP (maximum a posterior) estimate is then Daniel@0:

Daniel@0: T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
Daniel@0:

Daniel@0: and is never 0 if all alpha_ijk > 0. Daniel@0: For example, consider the network A->B, where A is binary and B has 3 Daniel@0: values. Daniel@0: A uniform prior for B has the form Daniel@0:

Daniel@0:     B=1 B=2 B=3
Daniel@0: A=1 1   1   1
Daniel@0: A=2 1   1   1
Daniel@0:

Daniel@0: which can be created using Daniel@0:

Daniel@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
Daniel@0:

Daniel@0: This prior does not satisfy the likelihood equivalence principle, Daniel@0: which says that Markov equivalent models Daniel@0: should have the same marginal likelihood. Daniel@0: A prior that does satisfy this principle is shown below. Daniel@0: Heckerman (1995) calls this the Daniel@0: BDeu prior (likelihood equivalent uniform Bayesian Dirichlet). Daniel@0:

Daniel@0:     B=1 B=2 B=3
Daniel@0: A=1 1/6 1/6 1/6
Daniel@0: A=2 1/6 1/6 1/6
Daniel@0:

Daniel@0: where we put N/(q*r) in each bin; N is the equivalent sample size, Daniel@0: r=|A|, q = |B|. Daniel@0: This can be created as follows Daniel@0:

Daniel@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
Daniel@0:

Daniel@0: Here, 1 is the equivalent sample size, and is the strength of the Daniel@0: prior. Daniel@0: You can change this using Daniel@0:

Daniel@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
Daniel@0:    'BDeu', 'dirichlet_weight', 10);
Daniel@0:

Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

(Sequential) Bayesian parameter updating from complete data

Daniel@0: Daniel@0: If we use conjugate priors and have fully observed data, we can Daniel@0: compute the posterior over the parameters in batch form as follows. Daniel@0:

Daniel@0: cases = sample_bnet(bnet, nsamples);
Daniel@0: bnet = bayes_update_params(bnet, cases);  
Daniel@0: LL = log_marg_lik_complete(bnet, cases);   
Daniel@0:

Daniel@0: bnet.CPD{i}.prior contains the new Dirichlet pseudocounts, Daniel@0: and bnet.CPD{i}.CPT is set to the mean of the posterior (the Daniel@0: normalized counts). Daniel@0: (Hence if the initial pseudo counts are 0, Daniel@0: bayes_update_params and learn_params will give the Daniel@0: same result.) Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Daniel@0: We can compute the same result sequentially (on-line) as follows. Daniel@0:

Daniel@0: LL = 0;
Daniel@0: for m=1:nsamples
Daniel@0:   LL = LL + log_marg_lik_complete(bnet, cases(:,m));
Daniel@0:   bnet = bayes_update_params(bnet, cases(:,m));
Daniel@0: end
Daniel@0:

Daniel@0: Daniel@0: The file BNT/examples/static/StructLearn/model_select1 has an example of Daniel@0: sequential model selection which uses the same idea. Daniel@0: We generate data from the model A->B Daniel@0: and compute the posterior prob of all 3 dags on 2 nodes: Daniel@0: (1) A B, (2) A <- B , (3) A -> B Daniel@0: Models 2 and 3 are Markov equivalent, and therefore indistinguishable from Daniel@0: observational data alone, so we expect their posteriors to be the same Daniel@0: (assuming a prior which satisfies likelihood equivalence). Daniel@0: If we use random parameters, the "true" model only gets a higher posterior after 2000 trials! Daniel@0: However, if we make B a noisy NOT gate, the true model "wins" after 12 Daniel@0: trials, as shown below (red = model 1, blue/green (superimposed) Daniel@0: represents models 2/3). Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: The use of marginal likelihood for model selection is discussed in Daniel@0: greater detail in the Daniel@0: section on structure learning. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Maximum likelihood parameter estimation with missing values

Daniel@0: Daniel@0: Now we consider learning when some values are not observed. Daniel@0: Let us randomly hide half the values generated from the water Daniel@0: sprinkler example. Daniel@0:

Daniel@0: samples2 = samples;
Daniel@0: hide = rand(N, nsamples) > 0.5;
Daniel@0: [I,J]=find(hide);
Daniel@0: for k=1:length(I)
Daniel@0:   samples2{I(k), J(k)} = [];
Daniel@0: end
Daniel@0:

Daniel@0: samples2{i,l} is the value of node i in training case l, or [] if unobserved. Daniel@0:

Daniel@0: Now we will compute the MLEs using the EM algorithm. Daniel@0: We need to use an inference algorithm to compute the expected Daniel@0: sufficient statistics in the E step; the M (maximization) step is as Daniel@0: above. Daniel@0:

Daniel@0: engine2 = jtree_inf_engine(bnet2);
Daniel@0: max_iter = 10;
Daniel@0: [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
Daniel@0:

Daniel@0: LLtrace(i) is the log-likelihood at iteration i. We can plot this as Daniel@0: follows: Daniel@0:

Daniel@0: plot(LLtrace, 'x-')
Daniel@0:

Daniel@0: Let's display the results after 10 iterations of EM. Daniel@0:

Daniel@0: celldisp(CPT4)
Daniel@0: CPT4{1} =
Daniel@0:     0.6616
Daniel@0:     0.3384
Daniel@0: CPT4{2} =
Daniel@0:     0.6510    0.3490
Daniel@0:     0.8751    0.1249
Daniel@0: CPT4{3} =
Daniel@0:     0.8366    0.1634
Daniel@0:     0.0197    0.9803
Daniel@0: CPT4{4} =
Daniel@0: (:,:,1) =
Daniel@0:     0.8276    0.0546
Daniel@0:     0.5452    0.1658
Daniel@0: (:,:,2) =
Daniel@0:     0.1724    0.9454
Daniel@0:     0.4548    0.8342
Daniel@0:

Daniel@0: We can get improved performance by using one or more of the following Daniel@0: methods: Daniel@0:

Increasing the size of the training set. Daniel@0:
Decreasing the amount of hidden data. Daniel@0:
Running EM for longer. Daniel@0:
Using informative priors. Daniel@0:
Initialising EM from multiple starting points. Daniel@0:

Daniel@0: Daniel@0: Click here for a discussion of learning Daniel@0: Gaussians, which can cause numerical problems. Daniel@0:

Daniel@0: For a more complete example of learning with EM, Daniel@0: see the script BNT/examples/static/learn1.m. Daniel@0: Daniel@0:

Parameter tying

Daniel@0: Daniel@0: In networks with repeated structure (e.g., chains and grids), it is Daniel@0: common to assume that the parameters are the same at every node. This Daniel@0: is called parameter tying, and reduces the amount of data needed for Daniel@0: learning. Daniel@0:

Daniel@0: When we have tied parameters, there is no longer a one-to-one Daniel@0: correspondence between nodes and CPDs. Daniel@0: Rather, each CPD species the parameters for a whole equivalence class Daniel@0: of nodes. Daniel@0: It is easiest to see this by example. Daniel@0: Consider the following hidden Markov Daniel@0: model (HMM) Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: Daniel@0: When HMMs are used for semi-infinite processes like speech recognition, Daniel@0: we assume the transition matrix Daniel@0: P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant Daniel@0: or homogenous Markov chain. Daniel@0: Hence hidden nodes 2, 3, ..., T Daniel@0: are all in the same equivalence class, say class Hclass. Daniel@0: Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the Daniel@0: same for all t, so the observed nodes are all in the same equivalence Daniel@0: class, say class Oclass. Daniel@0: Finally, the prior term P(H(1)) is in a class all by itself, say class Daniel@0: H1class. Daniel@0: This is illustrated below, where we explicitly represent the Daniel@0: parameters as random variables (dotted nodes). Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: In BNT, we cannot represent parameters as random variables (nodes). Daniel@0: Instead, we "hide" the Daniel@0: parameters inside one CPD for each equivalence class, Daniel@0: and then specify that the other CPDs should share these parameters, as Daniel@0: follows. Daniel@0:

Daniel@0: hnodes = 1:2:2*T;
Daniel@0: onodes = 2:2:2*T;
Daniel@0: H1class = 1; Hclass = 2; Oclass = 3;
Daniel@0: eclass = ones(1,N);
Daniel@0: eclass(hnodes(2:end)) = Hclass;
Daniel@0: eclass(hnodes(1)) = H1class;
Daniel@0: eclass(onodes) = Oclass;
Daniel@0: % create dag and ns in the usual way
Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
Daniel@0:

Daniel@0: Finally, we define the parameters for each equivalence class: Daniel@0:

Daniel@0: bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
Daniel@0: bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
Daniel@0: if cts_obs
Daniel@0:   bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
Daniel@0: else
Daniel@0:   bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
Daniel@0: end
Daniel@0:

Daniel@0: In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a Daniel@0: member of e's equivalence class; that is, it is not always the case Daniel@0: that e == j. You can use bnet.rep_of_eclass(e) to return the Daniel@0: representative of equivalence class e. Daniel@0: BNT will look up the parents of j to determine the size Daniel@0: of the CPT to use. It assumes that this is the same for all members of Daniel@0: the equivalence class. Daniel@0: Click here for Daniel@0: a more complex example of parameter tying. Daniel@0:

Daniel@0: Note: Daniel@0: Normally one would define an HMM as a Daniel@0: Dynamic Bayes Net Daniel@0: (see the function BNT/examples/dynamic/mk_chmm.m). Daniel@0: However, one can define an HMM as a static BN using the function Daniel@0: BNT/examples/static/Models/mk_hmm_bnet.m. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Structure learning

Daniel@0: Daniel@0: Update (9/29/03): Daniel@0: Phillipe LeRay is developing some additional structure learning code Daniel@0: on top of BNT. Click Daniel@0: Daniel@0: here Daniel@0: for details. Daniel@0: Daniel@0:

Daniel@0: Daniel@0: There are two very different approaches to structure learning: Daniel@0: constraint-based and search-and-score. Daniel@0: In the constraint-based approach, Daniel@0: we start with a fully connected graph, and remove edges if certain Daniel@0: conditional independencies are measured in the data. Daniel@0: This has the disadvantage that repeated independence tests lose Daniel@0: statistical power. Daniel@0:

Daniel@0: In the more popular search-and-score approach, Daniel@0: we perform a search through the space of possible DAGs, and either Daniel@0: return the best one found (a point estimate), or return a sample of the Daniel@0: models found (an approximation to the Bayesian posterior). Daniel@0:

Daniel@0: The number of DAGs as a function of the number of Daniel@0: nodes, G(n), is super-exponential in n, Daniel@0: and is given by the following recurrence Daniel@0: Daniel@0:

Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: The first few values Daniel@0: are shown below. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
n G(n)
1 1
2 3
3 25
4 543
5 29,281
6 3,781,503
7 1.1 x 10^9
8 7.8 x 10^11
9 1.2 x 10^15
10 4.2 x 10^18
Daniel@0: Daniel@0: Since the number of DAGs is super-exponential in the number of nodes, Daniel@0: we cannot exhaustively search the space, so we either use a local Daniel@0: search algorithm (e.g., greedy hill climbining, perhaps with multiple Daniel@0: restarts) or a global search algorithm (e.g., Markov Chain Monte Daniel@0: Carlo). Daniel@0:

n	G(n)
1	1
2	3
3	25
4	543
5	29,281
6	3,781,503
7	1.1 x 10^9
8	7.8 x 10^11
9	1.2 x 10^15
10	4.2 x 10^18

Daniel@0: If we know a total ordering on the nodes, Daniel@0: finding the best structure amounts to picking the best set of parents Daniel@0: for each node independently. Daniel@0: This is what the K2 algorithm does. Daniel@0: If the ordering is unknown, we can search over orderings, Daniel@0: which is more efficient than searching over DAGs (Koller and Friedman, 2000). Daniel@0:

Daniel@0: In addition to the search procedure, we must specify the scoring Daniel@0: function. There are two popular choices. The Bayesian score integrates Daniel@0: out the parameters, i.e., it is the marginal likelihood of the model. Daniel@0: The BIC (Bayesian Information Criterion) is defined as Daniel@0: log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is Daniel@0: the ML estimate of the parameters, d is the number of parameters, and Daniel@0: N is the number of data cases. Daniel@0: The BIC method has the advantage of not requiring a prior. Daniel@0:

Daniel@0: BIC can be derived as a large sample Daniel@0: approximation to the marginal likelihood. Daniel@0: (It is also equal to the Minimum Description Length of a model.) Daniel@0: However, in practice, the sample size does not need to be very large Daniel@0: for the approximation to be good. Daniel@0: For example, in the figure below, we plot the ratio between the log marginal likelihood Daniel@0: and the BIC score against data-set size; we see that the ratio rapidly Daniel@0: approaches 1, especially for non-informative priors. Daniel@0: (This plot was generated by the file BNT/examples/static/bic1.m. It Daniel@0: uses the water sprinkler BN with BDeu Dirichlet priors with different Daniel@0: equivalent sample sizes.) Daniel@0: Daniel@0:

Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: As with parameter learning, handling missing data/ hidden variables is Daniel@0: much harder than the fully observed case. Daniel@0: The structure learning routines in BNT can therefore be classified into 4 Daniel@0: types, analogously to the parameter learning case. Daniel@0:

Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Full obs Partial obs
Point learn_struct_K2
Daniel@0: Daniel@0: not yet supported
Bayes learn_struct_mcmc not yet supported
Daniel@0: Daniel@0: Daniel@0:

	Full obs	Partial obs
Point	`learn_struct_K2` Daniel@0: Daniel@0:	`not yet supported`
Bayes	`learn_struct_mcmc`	not yet supported

Markov equivalence

Daniel@0: Daniel@0: If two DAGs encode the same conditional independencies, they are Daniel@0: called Markov equivalent. The set of all DAGs can be paritioned into Daniel@0: Markov equivalence classes. Graphs within the same class can Daniel@0: have Daniel@0: the direction of some of their arcs reversed without changing any of Daniel@0: the CI relationships. Daniel@0: Each class can be represented by a PDAG Daniel@0: (partially directed acyclic graph) called an essential graph or Daniel@0: pattern. This specifies which edges must be oriented in a certain Daniel@0: direction, and which may be reversed. Daniel@0: Daniel@0:

Daniel@0: When learning graph structure from observational data, Daniel@0: the best one can hope to do is to identify the model up to Markov Daniel@0: equivalence. To distinguish amongst graphs within the same equivalence Daniel@0: class, one needs interventional data: see the discussion on active learning below. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Exhaustive search

Daniel@0: Daniel@0: The brute-force approach to structure learning is to enumerate all Daniel@0: possible DAGs, and score each one. This provides a "gold standard" Daniel@0: with which to compare other algorithms. We can do this as follows. Daniel@0:

Daniel@0: dags = mk_all_dags(N);
Daniel@0: score = score_dags(data, ns, dags);
Daniel@0:

Daniel@0: where data(i,m) is the value of node i in case m, Daniel@0: and ns(i) is the size of node i. Daniel@0: If the DAGs have a lot of families in common, we can cache the sufficient statistics, Daniel@0: making this potentially more efficient than scoring the DAGs one at a time. Daniel@0: (Caching is not currently implemented, however.) Daniel@0:

Daniel@0: By default, we use the Bayesian scoring metric, and assume CPDs are Daniel@0: represented by tables with BDeu(1) priors. Daniel@0: We can override these defaults as follows. Daniel@0: If we want to use uniform priors, we can say Daniel@0:

Daniel@0: params = cell(1,N);
Daniel@0: for i=1:N
Daniel@0:   params{i} = {'prior', 'unif'};
Daniel@0: end
Daniel@0: score = score_dags(data, ns, dags, 'params', params);
Daniel@0:

Daniel@0: params{i} is a cell-array, containing optional arguments that are Daniel@0: passed to the constructor for CPD i. Daniel@0:

Daniel@0: Now suppose we want to use different node types, e.g., Daniel@0: Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both Daniel@0: these CPDs can support discrete and continuous parents, which is Daniel@0: necessary since all other nodes will be considered as parents). Daniel@0: The Bayesian scoring metric currently only works for tabular CPDs, so Daniel@0: we will use BIC: Daniel@0:

Daniel@0: score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [], 
Daniel@0:     'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
Daniel@0:

Daniel@0: In practice, one can't enumerate all possible DAGs for N > 5, Daniel@0: but one can evaluate any reasonably-sized set of hypotheses in this Daniel@0: way (e.g., nearest neighbors of your current best guess). Daniel@0: Think of this as "computer assisted model refinement" as opposed to de Daniel@0: novo learning. Daniel@0: Daniel@0: Daniel@0:

K2

Daniel@0: Daniel@0: The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows. Daniel@0: Initially each node has no parents. It then adds incrementally that parent whose addition most Daniel@0: increases the score of the resulting structure. When the addition of no single Daniel@0: parent can increase the score, it stops adding parents to the node. Daniel@0: Since we are using a fixed ordering, we do not need to check for Daniel@0: cycles, and can choose the parents for each node independently. Daniel@0:

Daniel@0: The original paper used the Bayesian scoring Daniel@0: metric with tabular CPDs and Dirichlet priors. Daniel@0: BNT generalizes this to allow any kind of CPD, and either the Bayesian Daniel@0: scoring metric or BIC, as in the example above. Daniel@0: In addition, you can specify Daniel@0: an optional upper bound on the number of parents for each node. Daniel@0: The file BNT/examples/static/k2demo1.m gives an example of how to use K2. Daniel@0: We use the water sprinkler network and sample 100 cases from it as before. Daniel@0: Then we see how much data it takes to recover the generating structure: Daniel@0:

Daniel@0: order = [C S R W];
Daniel@0: max_fan_in = 2;
Daniel@0: sz = 5:5:100;
Daniel@0: for i=1:length(sz)
Daniel@0:   dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
Daniel@0:   correct(i) = isequal(dag, dag2);
Daniel@0: end
Daniel@0:

Daniel@0: Here are the results. Daniel@0:

Daniel@0: correct =
Daniel@0:   Columns 1 through 12 
Daniel@0:      0     0     0     0     0     0     0     1     0     1     1     1
Daniel@0:   Columns 13 through 20 
Daniel@0:      1     1     1     1     1     1     1     1
Daniel@0:

Daniel@0: So we see it takes about sz(10)=50 cases. (BIC behaves similarly, Daniel@0: showing that the prior doesn't matter too much.) Daniel@0: In general, we cannot hope to recover the "true" generating structure, Daniel@0: only one that is in its Markov equivalence Daniel@0: class. Daniel@0: Daniel@0: Daniel@0:

Hill-climbing

Daniel@0: Daniel@0: Hill-climbing starts at a specific point in space, Daniel@0: considers all nearest neighbors, and moves to the neighbor Daniel@0: that has the highest score; if no neighbors have higher Daniel@0: score than the current point (i.e., we have reached a local maximum), Daniel@0: the algorithm stops. One can then restart in another part of the space. Daniel@0:

Daniel@0: A common definition of "neighbor" is all graphs that can be Daniel@0: generated from the current graph by adding, deleting or reversing a Daniel@0: single arc, subject to the acyclicity constraint. Daniel@0: Other neighborhoods are possible: see Daniel@0: Daniel@0: Optimal Structure Identification with Greedy Search, Max Daniel@0: Chickering, JMLR 2002. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

MCMC

Daniel@0: Daniel@0: We can use a Markov Chain Monte Carlo (MCMC) algorithm called Daniel@0: Metropolis-Hastings (MH) to search the space of all Daniel@0: DAGs. Daniel@0: The standard proposal distribution is to consider moving to all Daniel@0: nearest neighbors in the sense defined above. Daniel@0:

Daniel@0: The function can be called Daniel@0: as in the following example. Daniel@0:

Daniel@0: [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
Daniel@0:

Daniel@0: We can convert our set of sampled graphs to a histogram Daniel@0: (empirical posterior over all the DAGs) thus Daniel@0:

Daniel@0: all_dags = mk_all_dags(N);
Daniel@0: mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
Daniel@0:

Daniel@0: To see how well this performs, let us compute the exact posterior exhaustively. Daniel@0:

Daniel@0:

Daniel@0: score = score_dags(data, ns, all_dags);
Daniel@0: post = normalise(exp(score)); % assuming uniform structural prior
Daniel@0:

Daniel@0: We plot the results below. Daniel@0: (The data set was 100 samples drawn from a random 4 node bnet; see the Daniel@0: file BNT/examples/static/mcmc1.) Daniel@0:

Daniel@0: subplot(2,1,1)
Daniel@0: bar(post)
Daniel@0: subplot(2,1,2)
Daniel@0: bar(mcmc_post)
Daniel@0:

Daniel@0:

Daniel@0:

Daniel@0: We can also plot the acceptance ratio versus number of MCMC steps, Daniel@0: as a crude convergence diagnostic. Daniel@0:

Daniel@0: clf
Daniel@0: plot(accept_ratio)
Daniel@0:

Daniel@0:

Daniel@0:

Daniel@0: Even though the number of samples needed by MCMC is theoretically Daniel@0: polynomial (not exponential) in the dimensionality of the search space, in practice it has been Daniel@0: found that MCMC does not converge in reasonable time for graphs with Daniel@0: more than about 10 nodes. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Active structure learning

Daniel@0: Daniel@0: As was mentioned above, Daniel@0: one can only learn a DAG up to Markov equivalence, even given infinite data. Daniel@0: If one is interested in learning the structure of a causal network, Daniel@0: one needs interventional data. Daniel@0: (By "intervention" we mean forcing a node to take on a specific value, Daniel@0: thereby effectively severing its incoming arcs.) Daniel@0:

Daniel@0: Most of the scoring functions accept an optional argument Daniel@0: that specifies whether a node was observed to have a certain value, or Daniel@0: was forced to have that value: we set clamped(i,m)=1 if node i was Daniel@0: forced in training case m. e.g., see the file Daniel@0: BNT/examples/static/cooper_yoo. Daniel@0:

Daniel@0: An interesting question is to decide which interventions to perform Daniel@0: (c.f., design of experiments). For details, see the following tech Daniel@0: report Daniel@0:

Daniel@0: Active learning of causal Bayes net structure, Kevin Murphy, March Daniel@0: 2001. Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Structural EM

Daniel@0: Daniel@0: Computing the Bayesian score when there is partial observability is Daniel@0: computationally challenging, because the parameter posterior becomes Daniel@0: multimodal (the hidden nodes induce a mixture distribution). Daniel@0: One therefore needs to use approximations such as BIC. Daniel@0: Unfortunately, search algorithms are still expensive, because we need Daniel@0: to run EM at each step to compute the MLE, which is needed to compute Daniel@0: the score of each model. An alternative approach is Daniel@0: to do the local search steps inside of the M step of EM, which is more Daniel@0: efficient since the data has been "filled in" - this is Daniel@0: called the structural EM algorithm (Friedman 1997), and provably Daniel@0: converges to a local maximum of the BIC score. Daniel@0:

Daniel@0: Wei Hu has implemented SEM for discrete nodes. Daniel@0: You can download his package from Daniel@0: here. Daniel@0: Please address all questions about this code to Daniel@0: wei.hu@intel.com. Daniel@0: See also Phl's implementation of SEM. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Visualizing the graph

Daniel@0: Daniel@0: Click here for more information Daniel@0: on graph visualization. Daniel@0: Daniel@0:

Constraint-based methods

Daniel@0: Daniel@0: The IC algorithm (Pearl and Verma, 1991), Daniel@0: and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993), Daniel@0: computes many conditional independence tests, Daniel@0: and combines these constraints into a Daniel@0: PDAG to represent the whole Daniel@0: Markov equivalence class. Daniel@0:

Daniel@0: IC*/FCI extend IC/PC to handle latent variables: see below. Daniel@0: (IC stands for inductive causation; PC stands for Peter and Clark, Daniel@0: the first names of Spirtes and Glymour; FCI stands for fast causal Daniel@0: inference. Daniel@0: What we, following Pearl (2000), call IC* was called Daniel@0: IC in the original Pearl and Verma paper.) Daniel@0: For details, see Daniel@0:

Daniel@0: Causation, Daniel@0: Prediction, and Search, Spirtes, Glymour and Daniel@0: Scheines (SGS), 2001 (2nd edition), MIT Press. Daniel@0:
Daniel@0: Causality: Models, Reasoning and Inference, J. Pearl, Daniel@0: 2000, Cambridge University Press. Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: Daniel@0: The PC algorithm takes as arguments a function f, the number of nodes N, Daniel@0: the maximum fan in K, and additional arguments A which are passed to f. Daniel@0: The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 Daniel@0: otherwise. Daniel@0: For example, suppose we cheat by Daniel@0: passing in a CI "oracle" which has access to the true DAG; the oracle Daniel@0: tests for d-separation in this DAG, i.e., Daniel@0: f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows. Daniel@0:

Daniel@0: pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
Daniel@0:

Daniel@0: pdag(i,j) = -1 if there is definitely an i->j arc, Daniel@0: and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. Daniel@0:

Daniel@0: Applied to the sprinkler network, this returns Daniel@0:

Daniel@0: pdag =
Daniel@0:      0     1     1     0
Daniel@0:      1     0     0    -1
Daniel@0:      1     0     0    -1
Daniel@0:      0     0     0     0
Daniel@0:

Daniel@0: So as expected, we see that the V-structure at the W node is uniquely identified, Daniel@0: but the other arcs have ambiguous orientation. Daniel@0:

Daniel@0: We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS Daniel@0: book. Daniel@0: This example concerns the female orgasm. Daniel@0: We are given a correlation matrix C between 7 measured factors (such Daniel@0: as subjective experiences of coital and masturbatory experiences), Daniel@0: derived from 281 samples, and want to learn a causal model of the Daniel@0: data. We will not discuss the merits of this type of work here, but Daniel@0: merely show how to reproduce the results in the SGS book. Daniel@0: Their program, Daniel@0: Tetrad, Daniel@0: makes use of the Fisher Z-test for conditional Daniel@0: independence, so we do the same: Daniel@0:

Daniel@0: max_fan_in = 4;
Daniel@0: nsamples = 281;
Daniel@0: alpha = 0.05;
Daniel@0: pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
Daniel@0:

Daniel@0: In this case, the CI test is Daniel@0:

Daniel@0: f(X,Y,S) = cond_indep_fisher_z(X,Y,S,  C,nsamples,alpha)
Daniel@0:

Daniel@0: The results match those of Fig 12a of SGS apart from two edge Daniel@0: differences; presumably this is due to rounding error (although it Daniel@0: could be a bug, either in BNT or in Tetrad). Daniel@0: This example can be found in the file BNT/examples/static/pc2.m. Daniel@0: Daniel@0:

Daniel@0: Daniel@0: The IC* algorithm (Pearl and Verma, 1991), Daniel@0: and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993), Daniel@0: are like the IC/PC algorithm, except that they can detect the presence Daniel@0: of latent variables. Daniel@0: See the file learn_struct_pdag_ic_star written by Tamar Daniel@0: Kushnir. The output is a matrix P, defined as follows Daniel@0: (see Pearl (2000), p52 for details): Daniel@0:

Daniel@0: % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
Daniel@0: % P(i,j) = -2 if there is a marked directed i-*>j edge.
Daniel@0: % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
Daniel@0: % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Philippe Leray's structure learning package

Daniel@0: Daniel@0: Philippe Leray has written a Daniel@0: Daniel@0: structure learning package that uses BNT. Daniel@0: Daniel@0: It currently (Juen 2003) has the following features: Daniel@0:

PC with Chi2 statistical test Daniel@0:
MWST : Maximum weighted Spanning Tree Daniel@0:
Hill Climbing Daniel@0:
Greedy Search Daniel@0:
Structural EM Daniel@0:
hist_ic : optimal Histogram based on IC information criterion Daniel@0:
cpdag_to_dag Daniel@0:
dag_to_cpdag Daniel@0:
... Daniel@0:

Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Inference engines

Daniel@0: Daniel@0: Up until now, we have used the junction tree algorithm for inference. Daniel@0: However, sometimes this is too slow, or not even applicable. Daniel@0: In general, there are many inference algorithms each of which make Daniel@0: different tradeoffs between speed, accuracy, complexity and Daniel@0: generality. Furthermore, there might be many implementations of the Daniel@0: same algorithm; for instance, a general purpose, readable version, Daniel@0: and a highly-optimized, specialized one. Daniel@0: To cope with this variety, we treat each inference algorithm as an Daniel@0: object, which we call an inference engine. Daniel@0: Daniel@0:

Daniel@0: An inference engine is an object that contains a bnet and supports the Daniel@0: 'enter_evidence' and 'marginal_nodes' methods. The engine constructor Daniel@0: takes the bnet as argument and may do some model-specific processing. Daniel@0: When 'enter_evidence' is called, the engine may do some Daniel@0: evidence-specific processing. Finally, when 'marginal_nodes' is Daniel@0: called, the engine may do some query-specific processing. Daniel@0: Daniel@0:

Daniel@0: The amount of work done when each stage is specified -- structure, Daniel@0: parameters, evidence, and query -- depends on the engine. The cost of Daniel@0: work done early in this sequence can be amortized. On the other hand, Daniel@0: one can make better optimizations if one waits until later in the Daniel@0: sequence. Daniel@0: For example, the parameters might imply Daniel@0: conditional indpendencies that are not evident in the graph structure, Daniel@0: but can nevertheless be exploited; the evidence indicates which nodes Daniel@0: are observed and hence can effectively be disconnected from the Daniel@0: graph; and the query might indicate that large parts of the network Daniel@0: are d-separated from the query nodes. (Since it is not the actual Daniel@0: values of the evidence that matters, just which nodes are observed, Daniel@0: many engines allow you to specify which nodes will be observed when they are constructed, Daniel@0: i.e., before calling 'enter_evidence'. Some engines can still cope if Daniel@0: the actual pattern of evidence is different, e.g., if there is missing Daniel@0: data.) Daniel@0:

Daniel@0: Daniel@0: Although being maximally lazy (i.e., only doing work when a query is Daniel@0: issued) may seem desirable, Daniel@0: this is not always the most efficient. Daniel@0: For example, Daniel@0: when learning using EM, we need to call marginal_nodes N times, where N is the Daniel@0: number of nodes. Variable elimination would end Daniel@0: up repeating a lot of work Daniel@0: each time marginal_nodes is called, making it inefficient for Daniel@0: learning. The junction tree algorithm, by contrast, uses dynamic Daniel@0: programming to avoid this redundant computation --- it calculates all Daniel@0: marginals in two passes during 'enter_evidence', so calling Daniel@0: 'marginal_nodes' takes constant time. Daniel@0:

Daniel@0: We will discuss some of the inference algorithms implemented in BNT Daniel@0: below, and finish with a summary of all Daniel@0: of them. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Variable elimination

Daniel@0: Daniel@0: The variable elimination algorithm, also known as bucket elimination Daniel@0: or peeling, is one of the simplest inference algorithms. Daniel@0: The basic idea is to "push sums inside of products"; this is explained Daniel@0: in more detail Daniel@0: here. Daniel@0:

Daniel@0: The principle of distributing sums over products can be generalized Daniel@0: greatly to apply to any commutative semiring. Daniel@0: This forms the basis of many common algorithms, such as Viterbi Daniel@0: decoding and the Fast Fourier Transform. For details, see Daniel@0: Daniel@0:

R. McEliece and S. M. Aji, 2000. Daniel@0: Daniel@0: Daniel@0: The Generalized Distributive Law, Daniel@0: IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000), Daniel@0: pp. 325--343. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001. Daniel@0: Daniel@0: Factor graphs and the sum-product algorithm Daniel@0: IEEE Transactions on Information Theory, February, 2001. Daniel@0: Daniel@0:

Daniel@0: Daniel@0:

Daniel@0: Choosing an order in which to sum out the variables so as to minimize Daniel@0: computational cost is known to be NP-hard. Daniel@0: The implementation of this algorithm in Daniel@0: var_elim_inf_engine makes no attempt to optimize this Daniel@0: ordering (in contrast, say, to jtree_inf_engine, which uses a Daniel@0: greedy search procedure to find a good ordering). Daniel@0:

Daniel@0: Note: unlike most algorithms, var_elim does all its computational work Daniel@0: inside of marginal_nodes, not inside of Daniel@0: enter_evidence. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Global inference methods

Daniel@0: Daniel@0: The simplest inference algorithm of all is to explicitely construct Daniel@0: the joint distribution over all the nodes, and then to marginalize it. Daniel@0: This is implemented in global_joint_inf_engine. Daniel@0: Since the size of the joint is exponential in the Daniel@0: number of discrete (hidden) nodes, this is not a very practical algorithm. Daniel@0: It is included merely for pedagogical and debugging purposes. Daniel@0:

Daniel@0: Three specialized versions of this algorithm have also been implemented, Daniel@0: corresponding to the cases where all the nodes are discrete (D), all Daniel@0: are Gaussian (G), and some are discrete and some Gaussian (CG). Daniel@0: They are called enumerative_inf_engine, Daniel@0: gaussian_inf_engine, Daniel@0: and cond_gauss_inf_engine respectively. Daniel@0:

Daniel@0: Note: unlike most algorithms, these global inference algorithms do all their computational work Daniel@0: inside of marginal_nodes, not inside of Daniel@0: enter_evidence. Daniel@0: Daniel@0: Daniel@0:

Quickscore

Daniel@0: Daniel@0: The junction tree algorithm is quite slow on the QMR network, Daniel@0: since the cliques are so big. Daniel@0: One simple trick we can use is to notice that hidden leaves do not Daniel@0: affect the posteriors on the roots, and hence do not need to be Daniel@0: included in the network. Daniel@0: A second trick is to notice that the negative findings can be Daniel@0: "absorbed" into the prior: Daniel@0: see the file Daniel@0: BNT/examples/static/mk_minimal_qmr_bnet for details. Daniel@0:

Daniel@0: Daniel@0: A much more significant speedup is obtained by exploiting special Daniel@0: properties of the noisy-or node, as done by the quickscore Daniel@0: algorithm. For details, see Daniel@0:

Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89. Daniel@0:
Rish and Dechter, "On the impact of causal independence", UCI Daniel@0: tech report, 1998. Daniel@0:

Daniel@0: Daniel@0: This has been implemented in BNT as a special-purpose inference Daniel@0: engine, which can be created and used as follows: Daniel@0:

Daniel@0: engine = quickscore_inf_engine(inhibit, leak, prior);
Daniel@0: engine = enter_evidence(engine, pos, neg);
Daniel@0: m = marginal_nodes(engine, i);
Daniel@0:

Daniel@0: Daniel@0: Daniel@0:

Belief propagation

Daniel@0: Daniel@0: Even using quickscore, exact inference takes time that is exponential Daniel@0: in the number of positive findings. Daniel@0: Hence for large networks we need to resort to approximate inference techniques. Daniel@0: See for example Daniel@0:

T. Jaakkola and M. Jordan, "Variational probabilistic inference and the Daniel@0: QMR-DT network", JAIR 10, 1999. Daniel@0: Daniel@0:
K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study", Daniel@0: UAI 99. Daniel@0:

Daniel@0: The latter approximation Daniel@0: entails applying Pearl's belief propagation algorithm to a model even Daniel@0: if it has loops (hence the name loopy belief propagation). Daniel@0: Pearl's algorithm, implemented as pearl_inf_engine, gives Daniel@0: exact results when applied to singly-connected graphs Daniel@0: (a.k.a. polytrees, since Daniel@0: the underlying undirected topology is a tree, but a node may have Daniel@0: multiple parents). Daniel@0: To apply this algorithm to a graph with loops, Daniel@0: use pearl_inf_engine. Daniel@0: This can use a centralized or distributed message passing protocol. Daniel@0: You can use it as in the following example. Daniel@0:

Daniel@0: engine = pearl_inf_engine(bnet, 'max_iter', 30);
Daniel@0: engine = enter_evidence(engine, evidence);
Daniel@0: m = marginal_nodes(engine, i);
Daniel@0:

Daniel@0: We found that this algorithm often converges, and when it does, often Daniel@0: is very accurate, but it depends on the precise setting of the Daniel@0: parameter values of the network. Daniel@0: (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) Daniel@0: Understanding when and why belief propagation converges/ works Daniel@0: is a topic of ongoing research. Daniel@0:

Daniel@0: pearl_inf_engine can exploit special structure in noisy-or Daniel@0: and gmux nodes to compute messages efficiently. Daniel@0:

Daniel@0: belprop_inf_engine is like pearl, but uses potentials to Daniel@0: represent messages. Hence this is slower. Daniel@0:

Daniel@0: belprop_fg_inf_engine is like belprop, Daniel@0: but is designed for factor graphs. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Sampling

Daniel@0: Daniel@0: BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms: Daniel@0:

likelihood_weighting_inf_engine which does importance Daniel@0: sampling and can handle any node type. Daniel@0:

gibbs_sampling_inf_engine, written by Bhaskara Marthi. Daniel@0: Currently this can only handle tabular CPDs. Daniel@0: For a much faster and more powerful Gibbs sampling program, see Daniel@0: BUGS. Daniel@0:

Daniel@0: Note: To generate samples from a network (which is not the same as inference!), Daniel@0: use sample_bnet. Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Summary of inference engines

Daniel@0: Daniel@0: Daniel@0: The inference engines differ in many ways. Here are Daniel@0: some of the major "axes": Daniel@0:

Works for all topologies or makes restrictions? Daniel@0:
Works for all node types or makes restrictions? Daniel@0:
Exact or approximate inference? Daniel@0:

Daniel@0: Daniel@0: BNT implements an exact algorithm for solving LIMIDs (limited memory Daniel@0: influence diagrams), described in Daniel@0:

Daniel@0: In terms of topology, most engines handle any kind of DAG. Daniel@0: belprop_fg does approximate inference on factor graphs (FG), which Daniel@0: can be used to represent directed, undirected, and mixed (chain) Daniel@0: graphs. Daniel@0: (In the future, we plan to support exact inference on chain graphs.) Daniel@0: quickscore only works on QMR-like models. Daniel@0:

Daniel@0: In terms of node types: algorithms that use potentials can handle Daniel@0: discrete (D), Gaussian (G) or conditional Gaussian (CG) models. Daniel@0: Sampling algorithms can essentially handle any kind of node (distribution). Daniel@0: Other algorithms make more restrictive assumptions in exchange for Daniel@0: speed. Daniel@0:

Daniel@0: Finally, most algorithms are designed to give the exact answer. Daniel@0: The belief propagation algorithms are exact if applied to trees, and Daniel@0: in some other cases. Daniel@0: Sampling is considered approximate, even though, in the limit of an Daniel@0: infinite number of samples, it gives the exact answer. Daniel@0: Daniel@0:

Daniel@0: Daniel@0: Here is a summary of the properties Daniel@0: of all the engines in BNT which work on static networks. Daniel@0:

Daniel@0: Daniel@0:
Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Name Daniel@0: Exact? Daniel@0: Node type? Daniel@0: topology Daniel@0:
belprop Daniel@0: approx Daniel@0: D Daniel@0: DAG Daniel@0:
belprop_fg Daniel@0: approx Daniel@0: D Daniel@0: factor graph Daniel@0:
cond_gauss Daniel@0: exact Daniel@0: CG Daniel@0: DAG Daniel@0:
enumerative Daniel@0: exact Daniel@0: D Daniel@0: DAG Daniel@0:
gaussian Daniel@0: exact Daniel@0: G Daniel@0: DAG Daniel@0:
gibbs Daniel@0: approx Daniel@0: D Daniel@0: DAG Daniel@0:
global_joint Daniel@0: exact Daniel@0: D,G,CG Daniel@0: DAG Daniel@0:
jtree Daniel@0: exact Daniel@0: D,G,CG Daniel@0: DAG Daniel@0: b
likelihood_weighting Daniel@0: approx Daniel@0: any Daniel@0: DAG Daniel@0:
pearl Daniel@0: approx Daniel@0: D,G Daniel@0: DAG Daniel@0:
pearl Daniel@0: exact Daniel@0: D,G Daniel@0: polytree Daniel@0:
quickscore Daniel@0: exact Daniel@0: noisy-or Daniel@0: QMR Daniel@0:
stab_cond_gauss Daniel@0: exact Daniel@0: CG Daniel@0: DAG Daniel@0:
var_elim Daniel@0: exact Daniel@0: D,G,CG Daniel@0: DAG Daniel@0:
Daniel@0: Daniel@0: Daniel@0: Daniel@0:

Influence diagrams/ decision making