How to use the Bayes Net Toolbox

wolffd@0: To specify this directed acyclic graph (dag), we create an adjacency matrix: wolffd@0:

wolffd@0: N = 4; 
wolffd@0: dag = zeros(N,N);
wolffd@0: C = 1; S = 2; R = 3; W = 4;
wolffd@0: dag(C,[R S]) = 1;
wolffd@0: dag(R,W) = 1;
wolffd@0: dag(S,W)=1;
wolffd@0:

wolffd@0: We have numbered the nodes as follows: wolffd@0: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. wolffd@0: The nodes must always be numbered in topological order, i.e., wolffd@0: ancestors before descendants. wolffd@0: For a more complicated graph, this is a little inconvenient: we will wolffd@0: see how to get around this below. wolffd@0:

wolffd@0: In Matlab 6, you can use logical arrays instead of double arrays, wolffd@0: which are 4 times smaller: wolffd@0:

wolffd@0: dag = false(N,N);
wolffd@0: dag(C,[R S]) = true;
wolffd@0: ...
wolffd@0:

wolffd@0: However, some graph functions (eg acyclic) do not work on wolffd@0: logical arrays! wolffd@0:

wolffd@0: You can visualize the resulting graph structure using wolffd@0: the methods discussed below. wolffd@0: For details on GUIs, wolffd@0: click here. wolffd@0: wolffd@0:

Creating the Bayes net shell

wolffd@0: wolffd@0: In addition to specifying the graph structure, wolffd@0: we must specify the size and type of each node. wolffd@0: If a node is discrete, its size is the wolffd@0: number of possible values wolffd@0: each node can take on; if a node is continuous, wolffd@0: it can be a vector, and its size is the length of this vector. wolffd@0: In this case, we will assume all nodes are discrete and binary. wolffd@0:

wolffd@0: discrete_nodes = 1:N;
wolffd@0: node_sizes = 2*ones(1,N); 
wolffd@0:

wolffd@0: If the nodes were not binary, you could type e.g., wolffd@0:

wolffd@0: node_sizes = [4 2 3 5];
wolffd@0:

wolffd@0: meaning that Cloudy has 4 possible values, wolffd@0: Sprinkler has 2 possible values, etc. wolffd@0: Note that these are cardinal values, not ordinal, i.e., wolffd@0: they are not ordered in any way, like 'low', 'medium', 'high'. wolffd@0:

wolffd@0: We are now ready to make the Bayes net: wolffd@0:

wolffd@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
wolffd@0:

wolffd@0: By default, all nodes are assumed to be discrete, so we can also just wolffd@0: write wolffd@0:

wolffd@0: bnet = mk_bnet(dag, node_sizes);
wolffd@0:

wolffd@0: You may also specify which nodes will be observed. wolffd@0: If you don't know, or if this not fixed in advance, wolffd@0: just use the empty list (the default). wolffd@0:

wolffd@0: onodes = [];
wolffd@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
wolffd@0:

wolffd@0: Note that optional arguments are specified using a name/value syntax. wolffd@0: This is common for many BNT functions. wolffd@0: In general, to find out more about a function (e.g., which optional wolffd@0: arguments it takes), please see its wolffd@0: documentation string by typing wolffd@0:

wolffd@0: help mk_bnet
wolffd@0:

wolffd@0: See also other useful Matlab tips. wolffd@0:

wolffd@0: It is possible to associate names with nodes, as follows: wolffd@0:

wolffd@0: bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
wolffd@0:

wolffd@0: You can then refer to a node by its name: wolffd@0:

wolffd@0: C = bnet.names('cloudy'); % bnet.names is an associative array
wolffd@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
wolffd@0:

wolffd@0: This feature uses my own associative array class. wolffd@0: wolffd@0: wolffd@0:

Parameters

wolffd@0: wolffd@0: A model consists of the graph structure and the parameters. wolffd@0: The parameters are represented by CPD objects (CPD = Conditional wolffd@0: Probability Distribution), which define the probability distribution wolffd@0: of a node given its parents. wolffd@0: (We will use the terms "node" and "random variable" interchangeably.) wolffd@0: The simplest kind of CPD is a table (multi-dimensional array), which wolffd@0: is suitable when all the nodes are discrete-valued. Note that the discrete wolffd@0: values are not assumed to be ordered in any way; that is, they wolffd@0: represent categorical quantities, like male and female, rather than wolffd@0: ordinal quantities, like low, medium and high. wolffd@0: (We will discuss CPDs in more detail below.) wolffd@0:

wolffd@0: Tabular CPDs, also called CPTs (conditional probability tables), wolffd@0: are stored as multidimensional arrays, where the dimensions wolffd@0: are arranged in the same order as the nodes, e.g., the CPT for node 4 wolffd@0: (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. wolffd@0: Hence the child is always the last dimension. wolffd@0: If a node has no parents, its CPT is a column vector representing its wolffd@0: prior. wolffd@0: Note that in Matlab (unlike C), arrays are indexed wolffd@0: from 1, and are layed out in memory such that the first index toggles wolffd@0: fastest, e.g., the CPT for node 4 (WetGrass) is as follows wolffd@0:

wolffd@0:

wolffd@0: where we have used the convention that false==1, true==2. wolffd@0: We can create this CPT in Matlab as follows wolffd@0:

wolffd@0: CPT = zeros(2,2,2);
wolffd@0: CPT(1,1,1) = 1.0;
wolffd@0: CPT(2,1,1) = 0.1;
wolffd@0: ...
wolffd@0:

wolffd@0: Here is an easier way: wolffd@0:

wolffd@0: CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
wolffd@0:

wolffd@0: In fact, we don't need to reshape the array, since the CPD constructor wolffd@0: will do that for us. So we can just write wolffd@0:

wolffd@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
wolffd@0:

wolffd@0: The other nodes are created similarly (using the old syntax for wolffd@0: optional parameters) wolffd@0:

wolffd@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
wolffd@0: bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
wolffd@0: bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
wolffd@0: bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

Random Parameters

wolffd@0: wolffd@0: If we do not specify the CPT, random parameters will be wolffd@0: created, i.e., each "row" of the CPT will be drawn from the uniform distribution. wolffd@0: To ensure repeatable results, use wolffd@0:

wolffd@0: rand('state', seed);
wolffd@0: randn('state', seed);
wolffd@0:

wolffd@0: To control the degree of randomness (entropy), wolffd@0: you can sample each row of the CPT from a Dirichlet(p,p,...) distribution. wolffd@0: If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0). wolffd@0: If p = 1, each entry is drawn from U[0,1]. wolffd@0: If p >> 1, the entries will all be near 1/k, where k is the arity of wolffd@0: this node, i.e., each row will be nearly uniform. wolffd@0: You can do this as follows, assuming this node wolffd@0: is number i, and ns is the node_sizes. wolffd@0:

wolffd@0: k = ns(i);
wolffd@0: ps = parents(dag, i);
wolffd@0: psz = prod(ns(ps));
wolffd@0: CPT = sample_dirichlet(p*ones(1,k), psz);
wolffd@0: bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
wolffd@0:

Loading a network from a file

wolffd@0: wolffd@0: If you already have a Bayes net represented in the XML-based wolffd@0: wolffd@0: Bayes Net Interchange Format (BNIF) (e.g., downloaded from the wolffd@0: wolffd@0: Bayes Net repository), wolffd@0: you can convert it to BNT format using wolffd@0: the wolffd@0: BIF-BNT Java wolffd@0: program written by Ken Shan. wolffd@0: (This is not necessarily up-to-date.) wolffd@0:

wolffd@0: It is currently not possible to save/load a BNT matlab object to wolffd@0: file, but this is easily fixed if you modify all the constructors wolffd@0: for all the classes (see matlab documentation). wolffd@0: wolffd@0:

Creating a model using a GUI

wolffd@0: wolffd@0: Suppose we want to compute the probability that the sprinker was on wolffd@0: given that the grass is wet. wolffd@0: The evidence consists of the fact that W=2. All the other nodes wolffd@0: are hidden (unobserved). We can specify this as follows. wolffd@0:

Senthil Nachimuthu wolffd@0: has started (Oct 07) an open source wolffd@0: GUI for BNT called wolffd@0: projeny wolffd@0: using Java. This is a successor to BNJ. wolffd@0: wolffd@0:
wolffd@0: Philippe LeRay has written (Sep 05) wolffd@0: a wolffd@0: wolffd@0: BNT GUI in matlab. wolffd@0: wolffd@0:
wolffd@0: wolffd@0: LinkStrength, wolffd@0: a package by Imme Ebert-Uphoff for visualizing the strength of wolffd@0: dependencies between nodes. wolffd@0: wolffd@0:

wolffd@0: wolffd@0:

Graph visualization

wolffd@0: wolffd@0: Click here for more information wolffd@0: on graph visualization. wolffd@0: wolffd@0: wolffd@0:

Inference

wolffd@0: wolffd@0: Having created the BN, we can now use it for inference. wolffd@0: There are many different algorithms for doing inference in Bayes nets, wolffd@0: that make different tradeoffs between speed, wolffd@0: complexity, generality, and accuracy. wolffd@0: BNT therefore offers a variety of different inference wolffd@0: "engines". We will discuss these wolffd@0: in more detail below. wolffd@0: For now, we will use the junction tree wolffd@0: engine, which is the mother of all exact inference algorithms. wolffd@0: This can be created as follows. wolffd@0:

wolffd@0: engine = jtree_inf_engine(bnet);
wolffd@0:

wolffd@0: The other engines have similar constructors, but might take wolffd@0: additional, algorithm-specific parameters. wolffd@0: All engines are used in the same way, once they have been created. wolffd@0: We illustrate this in the following sections. wolffd@0: wolffd@0: wolffd@0:

Computing marginal distributions

wolffd@0: evidence = cell(1,N);
wolffd@0: evidence{W} = 2;
wolffd@0:

wolffd@0: We use a 1D cell array instead of a vector to wolffd@0: cope with the fact that nodes can be vectors of different lengths. wolffd@0: In addition, the value [] can be used wolffd@0: to denote 'no evidence', instead of having to specify the observation wolffd@0: pattern as a separate argument. wolffd@0: (Click here for a quick tutorial on cell wolffd@0: arrays in matlab.) wolffd@0:

wolffd@0: We are now ready to add the evidence to the engine. wolffd@0:

wolffd@0: [engine, loglik] = enter_evidence(engine, evidence);
wolffd@0:

wolffd@0: The behavior of this function is algorithm-specific, and is discussed wolffd@0: in more detail below. wolffd@0: In the case of the jtree engine, wolffd@0: enter_evidence implements a two-pass message-passing scheme. wolffd@0: The first return argument contains the modified engine, which wolffd@0: incorporates the evidence. The second return argument contains the wolffd@0: log-likelihood of the evidence. (Not all engines are capable of wolffd@0: computing the log-likelihood.) wolffd@0:

wolffd@0: Finally, we can compute p=P(S=2|W=2) as follows. wolffd@0:

wolffd@0: marg = marginal_nodes(engine, S);
wolffd@0: marg.T
wolffd@0: ans =
wolffd@0:       0.57024
wolffd@0:       0.42976
wolffd@0: p = marg.T(2);
wolffd@0:

wolffd@0: We see that p = 0.4298. wolffd@0:

wolffd@0: Now let us add the evidence that it was raining, and see what wolffd@0: difference it makes. wolffd@0:

wolffd@0: evidence{R} = 2;
wolffd@0: [engine, loglik] = enter_evidence(engine, evidence);
wolffd@0: marg = marginal_nodes(engine, S);
wolffd@0: p = marg.T(2);
wolffd@0:

wolffd@0: We find that p = P(S=2|W=2,R=2) = 0.1945, wolffd@0: which is lower than wolffd@0: before, because the rain can ``explain away'' the wolffd@0: fact that the grass is wet. wolffd@0:

wolffd@0: You can plot a marginal distribution over a discrete variable wolffd@0: as a barchart using the built 'bar' function: wolffd@0:

wolffd@0: bar(marg.T)
wolffd@0:

wolffd@0: This is what it looks like wolffd@0: wolffd@0:

wolffd@0:

wolffd@0: wolffd@0:

Observed nodes

wolffd@0: wolffd@0: What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)? wolffd@0: An observed discrete node effectively only has 1 value (the observed wolffd@0: one) --- all other values would result in 0 probability. wolffd@0: For efficiency, BNT treats observed (discrete) nodes as if they were wolffd@0: set to 1, as we see below: wolffd@0:

wolffd@0: evidence = cell(1,N);
wolffd@0: evidence{W} = 2;
wolffd@0: engine = enter_evidence(engine, evidence);
wolffd@0: m = marginal_nodes(engine, W);
wolffd@0: m.T
wolffd@0: ans =
wolffd@0:      1
wolffd@0:

wolffd@0: This can get a little confusing, since we assigned W=2. wolffd@0: So we can ask BNT to add the evidence back in by passing in an optional argument: wolffd@0:

wolffd@0: m = marginal_nodes(engine, W, 1);
wolffd@0: m.T
wolffd@0: ans =
wolffd@0:      0
wolffd@0:      1
wolffd@0:

wolffd@0: This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Computing joint distributions

wolffd@0: wolffd@0: We can compute the joint probability on a set of nodes as in the wolffd@0: following example. wolffd@0:

wolffd@0: evidence = cell(1,N);
wolffd@0: [engine, ll] = enter_evidence(engine, evidence);
wolffd@0: m = marginal_nodes(engine, [S R W]);
wolffd@0:

wolffd@0: m is a structure. The 'T' field is a multi-dimensional array (in wolffd@0: this case, 3-dimensional) that contains the joint probability wolffd@0: distribution on the specified nodes. wolffd@0:

wolffd@0: >> m.T
wolffd@0: ans(:,:,1) =
wolffd@0:     0.2900    0.0410
wolffd@0:     0.0210    0.0009
wolffd@0: ans(:,:,2) =
wolffd@0:          0    0.3690
wolffd@0:     0.1890    0.0891
wolffd@0:

wolffd@0: We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass wolffd@0: to be wet if both the rain and sprinkler are off. wolffd@0:

wolffd@0: Let us now add some evidence to R. wolffd@0:

wolffd@0: evidence{R} = 2;
wolffd@0: [engine, ll] = enter_evidence(engine, evidence);
wolffd@0: m = marginal_nodes(engine, [S R W])
wolffd@0: m = 
wolffd@0:     domain: [2 3 4]
wolffd@0:          T: [2x1x2 double]
wolffd@0: >> m.T
wolffd@0: m.T
wolffd@0: ans(:,:,1) =
wolffd@0:     0.0820
wolffd@0:     0.0018
wolffd@0: ans(:,:,2) =
wolffd@0:     0.7380
wolffd@0:     0.1782
wolffd@0:

wolffd@0: The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) wolffd@0: should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible wolffd@0: with the evidence that R=2. wolffd@0: Instead of creating large tables with many 0s, BNT sets the effective wolffd@0: size of observed (discrete) nodes to 1, as explained above. wolffd@0: This is why m.T has size 2x1x2. wolffd@0: To get a 2x2x2 table, type wolffd@0:

wolffd@0: m = marginal_nodes(engine, [S R W], 1)
wolffd@0: m = 
wolffd@0:     domain: [2 3 4]
wolffd@0:          T: [2x2x2 double]
wolffd@0: >> m.T
wolffd@0: m.T
wolffd@0: ans(:,:,1) =
wolffd@0:             0        0.082
wolffd@0:             0       0.0018
wolffd@0: ans(:,:,2) =
wolffd@0:             0        0.738
wolffd@0:             0       0.1782
wolffd@0:

wolffd@0: wolffd@0: Sometimes a node is not observed, but we have some distribution over wolffd@0: its possible values; this is often called "soft" or "virtual" wolffd@0: evidence. wolffd@0: One can use this as follows wolffd@0:

wolffd@0: Note: It is not always possible to compute the joint on arbitrary wolffd@0: sets of nodes: it depends on which inference engine you use, as discussed wolffd@0: in more detail below. wolffd@0: wolffd@0: wolffd@0:

Soft/virtual evidence

wolffd@0: [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
wolffd@0:

wolffd@0: where soft_evidence{i} is either [] (if node i has no soft evidence) wolffd@0: or is a vector representing the probability distribution over i's wolffd@0: possible values. wolffd@0: For example, if we don't know i's exact value, but we know its wolffd@0: likelihood ratio is 60/40, we can write evidence{i} = [] and wolffd@0: soft_evidence{i} = [0.6 0.4]. wolffd@0:

wolffd@0: Currently only jtree_inf_engine supports this option. wolffd@0: It assumes that all hidden nodes, and all nodes for wolffd@0: which we have soft evidence, are discrete. wolffd@0: For a longer example, see BNT/examples/static/softev1.m. wolffd@0: wolffd@0: wolffd@0:

Most probable explanation

wolffd@0: wolffd@0: To compute the most probable explanation (MPE) of the evidence (i.e., wolffd@0: the most probable assignment, or a mode of the joint), use wolffd@0:

wolffd@0: [mpe, ll] = calc_mpe(engine, evidence);     
wolffd@0:

wolffd@0: mpe{i} is the most likely value of node i. wolffd@0: This calls enter_evidence with the 'maximize' flag set to 1, which wolffd@0: causes the engine to do max-product instead of sum-product. wolffd@0: The resulting max-marginals are then thresholded. wolffd@0: If there is more than one maximum probability assignment, we must take wolffd@0: care to break ties in a consistent manner (thresholding the wolffd@0: max-marginals may give the wrong result). To force this behavior, wolffd@0: type wolffd@0:

wolffd@0: [mpe, ll] = calc_mpe(engine, evidence, 1);     
wolffd@0:

wolffd@0: Note that computing the MPE is someties called abductive reasoning. wolffd@0: wolffd@0:

wolffd@0: You can also use calc_mpe_bucket written by Ron Zohar, wolffd@0: that does a forwards max-product pass, and then a backwards traceback wolffd@0: pass, which is how Viterbi is traditionally implemented. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Conditional Probability Distributions

wolffd@0: wolffd@0: A Conditional Probability Distributions (CPD) wolffd@0: defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i)) wolffd@0: are the parents of node i. There are many ways to represent this wolffd@0: distribution, which depend in part on whether X(i) and X(Pa(i)) are wolffd@0: discrete, continuous, or a combination. wolffd@0: We will discuss various representations below. wolffd@0: wolffd@0: wolffd@0:

Tabular nodes

wolffd@0: wolffd@0: If the CPD is represented as a table (i.e., if it is a multinomial wolffd@0: distribution), it has a number of parameters that is exponential in wolffd@0: the number of parents. See the example above. wolffd@0: wolffd@0: wolffd@0:

Noisy-or nodes

wolffd@0: wolffd@0: A noisy-OR node is like a regular logical OR gate except that wolffd@0: sometimes the effects of parents that are on get inhibited. wolffd@0: Let the prob. that parent i gets inhibited be q(i). wolffd@0: Then a node, C, with 2 parents, A and B, has the following CPD, where wolffd@0: we use F and T to represent off and on (1 and 2 in BNT). wolffd@0:

wolffd@0: A  B  P(C=off)      P(C=on)
wolffd@0: ---------------------------
wolffd@0: F  F  1.0           0.0
wolffd@0: T  F  q(A)          1-q(A)
wolffd@0: F  T  q(B)          1-q(B)
wolffd@0: T  T  q(A)q(B)      1-q(A)q(B)
wolffd@0:

wolffd@0: Thus we see that the causes get inhibited independently. wolffd@0: It is common to associate a "leak" node with a noisy-or CPD, which is wolffd@0: like a parent that is always on. This can account for all other unmodelled wolffd@0: causes which might turn the node on. wolffd@0:

wolffd@0: The noisy-or distribution is similar to the logistic distribution. wolffd@0: To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) wolffd@0: be the prob. that j inhibits i. Then wolffd@0:

wolffd@0: Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
wolffd@0:

wolffd@0: Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then wolffd@0:

wolffd@0: Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
wolffd@0:

wolffd@0: For a sigmoid node, we have wolffd@0:

wolffd@0: Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
wolffd@0:

wolffd@0: where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of wolffd@0: the activation function (although both are monotonically increasing). wolffd@0: In addition, in the case of a noisy-or, the weights are constrained to be wolffd@0: positive, since they derive from probabilities q(i,j). wolffd@0: In both cases, the number of parameters is linear in the wolffd@0: number of parents, unlike the case of a multinomial distribution, wolffd@0: where the number of parameters is exponential in the number of parents. wolffd@0: We will see an example of noisy-OR nodes below. wolffd@0: wolffd@0: wolffd@0:

Other (noisy) deterministic nodes

wolffd@0: wolffd@0: Deterministic CPDs for discrete random variables can be created using wolffd@0: the deterministic_CPD class. It is also possible to 'flip' the output wolffd@0: of the function with some probability, to simulate noise. wolffd@0: The boolean_CPD class is just a special case of a wolffd@0: deterministic CPD, where the parents and child are all binary. wolffd@0:

wolffd@0: Both of these classes are just "syntactic sugar" for the tabular_CPD wolffd@0: class. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Softmax nodes

wolffd@0: wolffd@0: If we have a discrete node with a continuous parent, wolffd@0: we can define its CPD using a softmax function wolffd@0: (also known as the multinomial logit function). wolffd@0: This acts like a soft thresholding operator, and is defined as follows: wolffd@0:

wolffd@0:                     exp(w(:,i)'*x + b(i)) 
wolffd@0: Pr(Q=i | X=x)  =  -----------------------------
wolffd@0:                   sum_j   exp(w(:,j)'*x + b(j))
wolffd@0: 
wolffd@0:

wolffd@0: The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the wolffd@0: following interpretation: w(:,i)-w(:,j) is the normal vector to the wolffd@0: decision boundary between classes i and j, wolffd@0: and b(i)-b(j) is its offset (bias). For example, suppose wolffd@0: X is a 2-vector, and Q is binary. Then wolffd@0:

wolffd@0: w = [1 -1;
wolffd@0:      0 0];
wolffd@0: 
wolffd@0: b = [0 0];
wolffd@0:

wolffd@0: means class 1 are points in the 2D plane with positive x coordinate, wolffd@0: and class 2 are points in the 2D plane with negative x coordinate. wolffd@0: If w has large magnitude, the decision boundary is sharp, otherwise it wolffd@0: is soft. wolffd@0: In the special case that Q is binary (0/1), the softmax function reduces to the logistic wolffd@0: (sigmoid) function. wolffd@0:

wolffd@0: Fitting a softmax function can be done using the iteratively reweighted wolffd@0: least squares (IRLS) algorithm. wolffd@0: We use the implementation from wolffd@0: Netlab. wolffd@0: Note that since wolffd@0: the softmax distribution is not in the exponential family, it does not wolffd@0: have finite sufficient statistics, and hence we must store all the wolffd@0: training data in uncompressed form. wolffd@0: If this takes too much space, one should use online (stochastic) gradient wolffd@0: descent (not implemented in BNT). wolffd@0:

wolffd@0: If a softmax node also has discrete parents, wolffd@0: we use a different set of w/b parameters for each combination of wolffd@0: parent values, as in the conditional linear wolffd@0: Gaussian CPD. wolffd@0: This feature was implemented by Pierpaolo Brutti. wolffd@0: He is currently extending it so that discrete parents can be treated wolffd@0: as if they were continuous, by adding indicator variables to the X wolffd@0: vector. wolffd@0:

wolffd@0: We will see an example of softmax nodes below. wolffd@0: wolffd@0: wolffd@0:

Neural network nodes

wolffd@0: wolffd@0: Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron wolffd@0: to implement a mapping from continuous parents to discrete children, wolffd@0: similar to the softmax function. wolffd@0: (If there are also discrete parents, it creates a mixture of MLPs.) wolffd@0: It uses code from Netlab. wolffd@0: This is work in progress. wolffd@0: wolffd@0:

Root nodes

wolffd@0: wolffd@0: A root node has no parents and no parameters; it can be used to model wolffd@0: an observed, exogeneous input variable, i.e., one which is "outside" wolffd@0: the model. wolffd@0: This is useful for conditional density models. wolffd@0: We will see an example of root nodes below. wolffd@0: wolffd@0: wolffd@0:

Gaussian nodes

wolffd@0: wolffd@0: We now consider a distribution suitable for the continuous-valued nodes. wolffd@0: Suppose the node is called Y, its continuous parents (if any) are wolffd@0: called X, and its discrete parents (if any) are called Q. wolffd@0: The distribution on Y is defined as follows: wolffd@0:

wolffd@0: - no parents: Y ~ N(mu, Sigma)
wolffd@0: - cts parents : Y|X=x ~ N(mu + W x, Sigma)
wolffd@0: - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
wolffd@0: - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
wolffd@0:

wolffd@0: where N(mu, Sigma) denotes a Normal distribution with mean mu and wolffd@0: covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q wolffd@0: respectively. wolffd@0: If there are no discrete parents, |Q|=1; if there is wolffd@0: more than one, then |Q| = a vector of the sizes of each discrete parent. wolffd@0: If there are no continuous parents, |X|=0; if there is more than one, wolffd@0: then |X| = the sum of their sizes. wolffd@0: Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive wolffd@0: semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) wolffd@0: matrix. wolffd@0:

wolffd@0: We can create a Gaussian node with random parameters as follows. wolffd@0:

wolffd@0: bnet.CPD{i} = gaussian_CPD(bnet, i);
wolffd@0:

wolffd@0: We can specify the value of one or more of the parameters as in the wolffd@0: following example, in which |Y|=2, and |Q|=1. wolffd@0:

wolffd@0: bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
wolffd@0:

wolffd@0: wolffd@0: We plan to add classification and regression trees to define CPDs for wolffd@0: discrete and continuous nodes, respectively. wolffd@0: Trees have many advantages: they are easy to interpret, they can do wolffd@0: feature selection, they can wolffd@0: handle discrete and continuous inputs, they do not make strong wolffd@0: assumptions about the form of the distribution, the number of wolffd@0: parameters can grow in a data-dependent way (i.e., they are wolffd@0: semi-parametric), they can handle missing data, etc. wolffd@0: However, they are not yet implemented. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

wolffd@0: We will see an example of conditional linear Gaussian nodes below. wolffd@0:

wolffd@0: When learning Gaussians from data, it is helpful to ensure the wolffd@0: data has a small magnitde wolffd@0: (see e.g., KPMstats/standardize) to prevent numerical problems. wolffd@0: Unless you have a lot of data, it is also a very good idea to use wolffd@0: diagonal instead of full covariance matrices. wolffd@0: (BNT does not currently support spherical covariances, although it wolffd@0: would be easy to add, since KPMstats/clg_Mstep supports this option; wolffd@0: you would just need to modify gaussian_CPD/update_ess to accumulate wolffd@0: weighted inner products.) wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Other continuous distributions

wolffd@0: wolffd@0: Currently BNT does not support any CPDs for continuous nodes other wolffd@0: than the Gaussian. wolffd@0: However, you can use a mixture of Gaussians to wolffd@0: approximate other continuous distributions. We will see some an example wolffd@0: of this with the IFA model below. wolffd@0: wolffd@0: wolffd@0:

Generalized linear model nodes

wolffd@0: wolffd@0: In the future, we may incorporate some of the functionality of wolffd@0: glmlab wolffd@0: into BNT. wolffd@0: wolffd@0: wolffd@0:

Classification/regression tree nodes

Summary of CPD types

wolffd@0: wolffd@0: We list all the different types of CPDs supported by BNT. wolffd@0: For each CPD, we specify if the child and parents can be discrete (D) or wolffd@0: continuous (C) (Binary (B) nodes are a special case). wolffd@0: We also specify which methods each class supports. wolffd@0: If a method is inherited, the name of the parent class is mentioned. wolffd@0: If a parent class calls a child method, this is mentioned. wolffd@0:

wolffd@0: The CPD_to_CPT method converts a CPD to a table; this wolffd@0: requires that the child and all parents are discrete. wolffd@0: The CPT might be exponentially big... wolffd@0: convert_to_table evaluates a CPD with evidence, and wolffd@0: represents the the resulting potential as an array. wolffd@0: This requires that the child is discrete, and any continuous parents wolffd@0: are observed. wolffd@0: convert_to_pot evaluates a CPD with evidence, and wolffd@0: represents the resulting potential as a dpot, gpot, cgpot or upot, as wolffd@0: requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u = wolffd@0: utility). wolffd@0: wolffd@0:

wolffd@0: When we sample a node, all the parents are observed. wolffd@0: When we compute the (log) probability of a node, all the parents and wolffd@0: the child are observed. wolffd@0:

wolffd@0: We also specify if the parameters are learnable. wolffd@0: For learning with EM, we require wolffd@0: the methods reset_ess, update_ess and wolffd@0: maximize_params. wolffd@0: For learning from fully observed data, we require wolffd@0: the method learn_params. wolffd@0: By default, all classes inherit this from generic_CPD, which simply wolffd@0: calls update_ess N times, once for each data case, followed wolffd@0: by maximize_params, i.e., it is like EM, without the E step. wolffd@0: Some classes implement a batch formula, which is quicker. wolffd@0:

wolffd@0: Bayesian learning means computing a posterior over the parameters wolffd@0: given fully observed data. wolffd@0:

wolffd@0: Pearl means we implement the methods compute_pi and wolffd@0: compute_lambda_msg, used by wolffd@0: pearl_inf_engine, which runs on directed graphs. wolffd@0: belprop_inf_engine only needs convert_to_pot.H wolffd@0: The pearl methods can exploit special properties of the CPDs for wolffd@0: computing the messages efficiently, whereas belprop does not. wolffd@0:

wolffd@0: The only method implemented by generic_CPD is adjustable_CPD, wolffd@0: which is not shown, since it is not very interesting. wolffd@0: wolffd@0: wolffd@0:

wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
Name wolffd@0: Child wolffd@0: Parents wolffd@0: Comments wolffd@0: CPD_to_CPT wolffd@0: conv_to_table wolffd@0: conv_to_pot wolffd@0: sample wolffd@0: prob wolffd@0: learn wolffd@0: Bayes wolffd@0: Pearl wolffd@0: wolffd@0: wolffd@0:
wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
boolean wolffd@0: B wolffd@0: B wolffd@0: Syntactic sugar for tabular wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: wolffd@0:
deterministic wolffd@0: D wolffd@0: D wolffd@0: Syntactic sugar for tabular wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: - wolffd@0: wolffd@0:
Discrete wolffd@0: D wolffd@0: C/D wolffd@0: Virtual class wolffd@0: N wolffd@0: Calls CPD_to_CPT wolffd@0: Calls conv_to_table wolffd@0: Calls conv_to_table wolffd@0: Calls conv_to_table wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: wolffd@0:
Gaussian wolffd@0: C wolffd@0: C/D wolffd@0: - wolffd@0: N wolffd@0: N wolffd@0: Y wolffd@0: Y wolffd@0: Y wolffd@0: Y wolffd@0: N wolffd@0: N wolffd@0: wolffd@0:
gmux wolffd@0: C wolffd@0: C/D wolffd@0: multiplexer wolffd@0: N wolffd@0: N wolffd@0: Y wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: Y wolffd@0: wolffd@0: wolffd@0:
MLP wolffd@0: D wolffd@0: C/D wolffd@0: multi layer perceptron wolffd@0: N wolffd@0: Y wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Y wolffd@0: N wolffd@0: N wolffd@0: wolffd@0: wolffd@0:
noisy-or wolffd@0: B wolffd@0: B wolffd@0: - wolffd@0: Y wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: N wolffd@0: N wolffd@0: Y wolffd@0: wolffd@0: wolffd@0:
root wolffd@0: C/D wolffd@0: none wolffd@0: no params wolffd@0: N wolffd@0: N wolffd@0: Y wolffd@0: Y wolffd@0: Y wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: wolffd@0: wolffd@0:
softmax wolffd@0: D wolffd@0: C/D wolffd@0: - wolffd@0: N wolffd@0: Y wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Y wolffd@0: N wolffd@0: N wolffd@0: wolffd@0: wolffd@0:
generic wolffd@0: C/D wolffd@0: C/D wolffd@0: Virtual class wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: N wolffd@0: wolffd@0: wolffd@0:
Tabular wolffd@0: D wolffd@0: D wolffd@0: - wolffd@0: Y wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Inherits from discrete wolffd@0: Y wolffd@0: Y wolffd@0: Y wolffd@0: wolffd@0:
wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Example models

wolffd@0: wolffd@0: In Figure (a) below, we show how Factor Analysis can be thought of as a wolffd@0: graphical model. Here, X has an N(0,I) prior, and wolffd@0: Y|X=x ~ N(mu + Wx, Psi), wolffd@0: where Psi is diagonal and W is called the "factor loading matrix". wolffd@0: Since the noise on both X and Y is diagonal, the components of these wolffd@0: vectors are uncorrelated, and hence can be represented as individual wolffd@0: scalar nodes, as we show in (b). wolffd@0: (This is useful if parts of the observations on the Y vector are occasionally missing.) wolffd@0: We usually take k=|X| << |Y|=D, so the model tries to explain wolffd@0: many observations using a low-dimensional subspace. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Gaussian mixture models

wolffd@0: wolffd@0: Richard W. DeVaul has made a detailed tutorial on how to fit mixtures wolffd@0: of Gaussians using BNT. Available wolffd@0: here. wolffd@0: wolffd@0: wolffd@0:

PCA, ICA, and all that

wolffd@0:	wolffd@0:	wolffd@0:	wolffd@0:
(a) wolffd@0:	(b) wolffd@0:	(c) wolffd@0:	(d) wolffd@0:

wolffd@0: wolffd@0: The root node is clamped to the N(0,I) distribution, so that we will wolffd@0: not update these parameters during learning. wolffd@0: The mean of the leaf node is clamped to 0, wolffd@0: since we assume the data has been centered (had its mean subtracted wolffd@0: off); this is just for simplicity. wolffd@0: Finally, the covariance of the leaf node is constrained to be wolffd@0: diagonal. W0 and Psi0 are the initial parameter guesses. wolffd@0: wolffd@0:

wolffd@0: We can create this model in BNT as follows. wolffd@0:

wolffd@0: ns = [k D];
wolffd@0: dag = zeros(2,2);
wolffd@0: dag(1,2) = 1;
wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', []);
wolffd@0: bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
wolffd@0:    'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
wolffd@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
wolffd@0:    'cov_type', 'diag', 'clamp_mean', 1);
wolffd@0:

wolffd@0: We can fit this model (i.e., estimate its parameters in a maximum wolffd@0: likelihood (ML) sense) using EM, as we wolffd@0: explain below. wolffd@0: Not surprisingly, the ML estimates for mu and Psi turn out to be wolffd@0: identical to the wolffd@0: sample mean and variance, which can be computed directly as wolffd@0:

wolffd@0: mu_ML = mean(data);
wolffd@0: Psi_ML = diag(cov(data));
wolffd@0:

wolffd@0: Note that W can only be identified up to a rotation matrix, because of wolffd@0: the spherical symmetry of the source. wolffd@0: wolffd@0:

wolffd@0: If we restrict Psi to be spherical, i.e., Psi = sigma*I, wolffd@0: there is a closed-form solution for W as well, wolffd@0: i.e., we do not need to use EM. wolffd@0: In particular, W contains the first |X| eigenvectors of the sample covariance wolffd@0: matrix, with scalings determined by the eigenvalues and sigma. wolffd@0: Classical PCA can be obtained by taking the sigma->0 limit. wolffd@0: For details, see wolffd@0: wolffd@0:

wolffd@0: "EM algorithms for PCA and SPCA", Sam Roweis, NIPS 97. wolffd@0: ( wolffd@0: Matlab software) wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: wolffd@0: "Mixtures of probabilistic principal component analyzers", wolffd@0: Tipping and Bishop, Neural Computation 11(2):443--482, 1999. wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: By adding a hidden discrete variable, we can create mixtures of FA wolffd@0: models, as shown in (c). wolffd@0: Now we can explain the data using a set of subspaces. wolffd@0: We can create this model in BNT as follows. wolffd@0:

wolffd@0: ns = [M k D];
wolffd@0: dag = zeros(3);
wolffd@0: dag(1,3) = 1;
wolffd@0: dag(2,3) = 1;
wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', 1);
wolffd@0: bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
wolffd@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
wolffd@0: 			   'clamp_mean', 1, 'clamp_cov', 1);
wolffd@0: bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
wolffd@0: 			   'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
wolffd@0:

wolffd@0: Notice how the covariance matrix for Y is the same for all values of wolffd@0: Q; that is, the noise level in each sub-space is assumed the same. wolffd@0: However, we allow the offset, mu, to vary. wolffd@0: For details, see wolffd@0:

wolffd@0: The EM wolffd@0: Algorithm for Mixtures of Factor Analyzers , wolffd@0: Ghahramani, Z. and Hinton, G.E. (1996), wolffd@0: University of Toronto wolffd@0: Technical Report CRG-TR-96-1. wolffd@0: (Matlab software) wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: wolffd@0: "Mixtures of probabilistic principal component analyzers", wolffd@0: Tipping and Bishop, Neural Computation 11(2):443--482, 1999. wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: I have included Zoubin's specialized MFA code (with his permission) wolffd@0: with the toolbox, so you can check that BNT gives the same results: wolffd@0: see 'BNT/examples/static/mfa1.m'. wolffd@0: wolffd@0:

wolffd@0: Independent Factor Analysis (IFA) generalizes FA by allowing a wolffd@0: non-Gaussian prior on each component of X. wolffd@0: (Note that we can approximate a non-Gaussian prior using a mixture of wolffd@0: Gaussians.) wolffd@0: This means that the likelihood function is no longer rotationally wolffd@0: invariant, so we can uniquely identify W and the hidden wolffd@0: sources X. wolffd@0: IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). wolffd@0: We recover classical Independent Components Analysis (ICA) wolffd@0: in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the wolffd@0: weight matrix W is square and invertible. wolffd@0: For details, see wolffd@0:

wolffd@0: Independent Factor wolffd@0: Analysis, H. Attias, Neural Computation 11: 803--851, 1998. wolffd@0:

wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Mixtures of experts

wolffd@0: wolffd@0: As an example of the use of the softmax function, wolffd@0: we introduce the Mixture of Experts model. wolffd@0: wolffd@0: As before, wolffd@0: circles denote continuous-valued nodes, wolffd@0: squares denote discrete nodes, clear wolffd@0: means hidden, and shaded means observed. wolffd@0:

wolffd@0:

wolffd@0: wolffd@0: A hierarchical mixture of experts (HME) extends the mixture of experts wolffd@0: model by having more than one hidden node. A two-level example is shown below, along wolffd@0: with its more traditional representation as a neural network. wolffd@0: This is like a (balanced) probabilistic decision tree of height 2. wolffd@0:

wolffd@0: X is the observed wolffd@0: input, Y is the output, and wolffd@0: the Q nodes are hidden "gating" nodes, which select the appropriate wolffd@0: set of parameters for Y. During training, Y is assumed observed, wolffd@0: but for testing, the goal is to predict Y given X. wolffd@0: Note that this is a conditional density model, so we don't wolffd@0: associate any parameters with X. wolffd@0: Hence X's CPD will be a root CPD, which is a way of modelling wolffd@0: exogenous nodes. wolffd@0: If the output is a continuous-valued quantity, wolffd@0: we assume the "experts" are linear-regression units, wolffd@0: and set Y's CPD to linear-Gaussian. wolffd@0: If the output is discrete, we set Y's CPD to a softmax function. wolffd@0: The Q CPDs will always be softmax functions. wolffd@0: wolffd@0:

wolffd@0: As a concrete example, consider the mixture of experts model where X and Y are wolffd@0: scalars, and Q is binary. wolffd@0: This is just piecewise linear regression, where wolffd@0: we have two line segments, i.e., wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: We can create this model with random parameters as follows. wolffd@0: (This code is bundled in BNT/examples/static/mixexp2.m.) wolffd@0:

wolffd@0: X = 1;
wolffd@0: Q = 2;
wolffd@0: Y = 3;
wolffd@0: dag = zeros(3,3);
wolffd@0: dag(X,[Q Y]) = 1
wolffd@0: dag(Q,Y) = 1;
wolffd@0: ns = [1 2 1]; % make X and Y scalars, and have 2 experts
wolffd@0: onodes = [1 3];
wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
wolffd@0: 
wolffd@0: rand('state', 0);
wolffd@0: randn('state', 0);
wolffd@0: bnet.CPD{1} = root_CPD(bnet, 1);
wolffd@0: bnet.CPD{2} = softmax_CPD(bnet, 2);
wolffd@0: bnet.CPD{3} = gaussian_CPD(bnet, 3);
wolffd@0:

wolffd@0: Now let us fit this model using EM. wolffd@0: First we load the data (1000 training cases) and plot them. wolffd@0:

wolffd@0:

wolffd@0: data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');        
wolffd@0: plot(data(:,1), data(:,2), '.');
wolffd@0:

wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: This is what the model looks like before training. wolffd@0: (Thanks to Thomas Hofman for writing this plotting routine.) wolffd@0:

wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: Now let's train the model, and plot the final performance. wolffd@0: (We will discuss how to train models in more detail below.) wolffd@0:

wolffd@0:

wolffd@0: ncases = size(data, 1); % each row of data is a training case
wolffd@0: cases = cell(3, ncases);
wolffd@0: cases([1 3], :) = num2cell(data'); % each column of cases is a training case
wolffd@0: engine = jtree_inf_engine(bnet);
wolffd@0: max_iter = 20;
wolffd@0: [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
wolffd@0:

wolffd@0: (We specify which nodes will be observed when we create the engine. wolffd@0: Hence BNT knows that the hidden nodes are all discrete. wolffd@0: For complex models, this can lead to a significant speedup.) wolffd@0: Below we show what the model looks like after 16 iterations of EM wolffd@0: (with 100 IRLS iterations per M step), when it converged wolffd@0: using the default convergence tolerance (that the wolffd@0: fractional change in the log-likelihood be less than 1e-3). wolffd@0: Before learning, the log-likelihood was wolffd@0: -322.927442; afterwards, it was -13.728778. wolffd@0:

wolffd@0:

wolffd@0: wolffd@0: (See BNT/examples/static/mixexp2.m for details of the code.) wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Hierarchical mixtures of experts

wolffd@0:

wolffd@0: wolffd@0: Bayes nets originally arose out of an attempt to add probabilities to wolffd@0: expert systems, and this is still the most common use for BNs. wolffd@0: A famous example is wolffd@0: QMR-DT, a decision-theoretic reformulation of the Quick Medical wolffd@0: Reference (QMR) model. wolffd@0:

wolffd@0: Pierpaolo Brutti wolffd@0: has written an extensive set of routines for HMEs, wolffd@0: which are bundled with BNT: see the examples/static/HME directory. wolffd@0: These routines allow you to choose the number of hidden (gating) wolffd@0: layers, and the form of the experts (softmax or MLP). wolffd@0: See the file hmemenu, which provides a demo. wolffd@0: For example, the figure below shows the decision boundaries learned wolffd@0: for a ternary classification problem, using a 2 level HME with softmax wolffd@0: gates and softmax experts; the training set is on the left, the wolffd@0: testing set on the right. wolffd@0:

wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

wolffd@0: For more details, see the following: wolffd@0:

wolffd@0: Hierarchical mixtures of experts and the EM algorithm wolffd@0: M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994. wolffd@0: wolffd@0:
David Martin's wolffd@0: matlab code for HME wolffd@0: wolffd@0:
Why the wolffd@0: logistic function? A tutorial discussion on wolffd@0: probabilities and neural networks. M. I. Jordan. MIT Computational wolffd@0: Cognitive Science Report 9503, August 1995. wolffd@0: wolffd@0:
"Generalized Linear Models", McCullagh and Nelder, Chapman and wolffd@0: Halll, 1983. wolffd@0: wolffd@0:
wolffd@0: "Improved learning algorithms for mixtures of experts in multiclass wolffd@0: classification". wolffd@0: K. Chen, L. Xu, H. Chi. wolffd@0: Neural Networks (1999) 12: 1229-1252. wolffd@0: wolffd@0:
wolffd@0: Classification Using Hierarchical Mixtures of Experts wolffd@0: S.R. Waterhouse and A.J. Robinson. wolffd@0: In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186 wolffd@0: wolffd@0:
wolffd@0: Localized mixtures of experts, wolffd@0: P. Moerland, 1998. wolffd@0: wolffd@0:
"Nonlinear gated experts for time series", wolffd@0: A.S. Weigend and M. Mangeas, 1995. wolffd@0: wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

QMR

wolffd@0:

wolffd@0: wolffd@0: Here, the top layer represents hidden disease nodes, and the bottom wolffd@0: layer represents observed symptom nodes. wolffd@0: The goal is to infer the posterior probability of each disease given wolffd@0: all the symptoms (which can be present, absent or unknown). wolffd@0: Each node in the top layer has a Bernoulli prior (with a low prior wolffd@0: probability that the disease is present). wolffd@0: Since each node in the bottom layer has a high fan-in, we use a wolffd@0: noisy-OR parameterization; each disease has an independent chance of wolffd@0: causing each symptom. wolffd@0: The real QMR-DT model is copyright, but wolffd@0: we can create a random QMR-like model as follows. wolffd@0:

wolffd@0: function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
wolffd@0: % MK_QMR_BNET Make a QMR model
wolffd@0: % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
wolffd@0: %
wolffd@0: % G(i,j) = 1 iff there is an arc from disease i to finding j
wolffd@0: % inhibit(i,j) = inhibition probability on i->j arc
wolffd@0: % leak(j) = inhibition prob. on leak->j arc
wolffd@0: % prior(i) = prob. disease i is on
wolffd@0: 
wolffd@0: [Ndiseases Nfindings] = size(inhibit);
wolffd@0: N = Ndiseases + Nfindings;
wolffd@0: finding_node = Ndiseases+1:N;
wolffd@0: ns = 2*ones(1,N);
wolffd@0: dag = zeros(N,N);
wolffd@0: dag(1:Ndiseases, finding_node) = G;
wolffd@0: bnet = mk_bnet(dag, ns, 'observed', finding_node);
wolffd@0: 
wolffd@0: for d=1:Ndiseases
wolffd@0:   CPT = [1-prior(d) prior(d)];
wolffd@0:   bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
wolffd@0: end
wolffd@0: 
wolffd@0: for i=1:Nfindings
wolffd@0:   fnode = finding_node(i);
wolffd@0:   ps = parents(G, i);
wolffd@0:   bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
wolffd@0: end
wolffd@0:

wolffd@0: In the file BNT/examples/static/qmr1, we create a random bipartite wolffd@0: graph G, with 5 diseases and 10 findings, and random parameters. wolffd@0: (In general, to create a random dag, use 'mk_random_dag'.) wolffd@0: We can visualize the resulting graph structure using wolffd@0: the methods discussed below, with the wolffd@0: following results: wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

wolffd@0: Now let us put some random evidence on all the leaves except the very wolffd@0: first and very last, and compute the disease posteriors. wolffd@0:

wolffd@0: pos = 2:floor(Nfindings/2);
wolffd@0: neg = (pos(end)+1):(Nfindings-1);
wolffd@0: onodes = myunion(pos, neg);
wolffd@0: evidence = cell(1, N);
wolffd@0: evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
wolffd@0: evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
wolffd@0: 
wolffd@0: engine = jtree_inf_engine(bnet);
wolffd@0: [engine, ll] = enter_evidence(engine, evidence);
wolffd@0: post = zeros(1, Ndiseases);
wolffd@0: for i=diseases(:)'
wolffd@0:   m = marginal_nodes(engine, i);
wolffd@0:   post(i) = m.T(2);
wolffd@0: end
wolffd@0:

wolffd@0: Junction tree can be quite slow on large QMR models. wolffd@0: Fortunately, it is possible to exploit properties of the noisy-OR wolffd@0: function to speed up exact inference using an algorithm called wolffd@0: quickscore, discussed below. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Conditional Gaussian models

wolffd@0: wolffd@0: A conditional Gaussian model is one in which, conditioned on all the discrete wolffd@0: nodes, the distribution over the remaining (continuous) nodes is wolffd@0: multivariate Gaussian. This means we can have arcs from discrete (D) wolffd@0: to continuous (C) nodes, but not vice versa. wolffd@0: (We are allowed C->D arcs if the continuous nodes are observed, wolffd@0: as in the mixture of experts model, wolffd@0: since this distribution can be represented with a discrete potential.) wolffd@0:

wolffd@0: We now give an example of a CG model, from wolffd@0: the paper "Propagation of Probabilities, Means amd wolffd@0: Variances in Mixed Graphical Association Models", Steffen Lauritzen, wolffd@0: JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert wolffd@0: Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and wolffd@0: D. J. Spiegelhalter, Springer, 1999.) wolffd@0: wolffd@0:

Specifying the graph

wolffd@0: wolffd@0: Consider the model of waste emissions from an incinerator plant shown below. wolffd@0: We follow the standard convention that shaded nodes are observed, wolffd@0: clear nodes are hidden. wolffd@0: We also use the non-standard convention that wolffd@0: square nodes are discrete (tabular) and round nodes are wolffd@0: Gaussian. wolffd@0: wolffd@0:

wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: wolffd@0: We can create this model as follows. wolffd@0:

wolffd@0: F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
wolffd@0: n = 9;
wolffd@0: 
wolffd@0: dag = zeros(n);
wolffd@0: dag(F,E)=1;
wolffd@0: dag(W,[E Min D]) = 1;
wolffd@0: dag(E,D)=1;
wolffd@0: dag(B,[C D])=1;
wolffd@0: dag(D,[L Mout])=1;
wolffd@0: dag(Min,Mout)=1;
wolffd@0: 
wolffd@0: % node sizes - all cts nodes are scalar, all discrete nodes are binary
wolffd@0: ns = ones(1, n);
wolffd@0: dnodes = [F W B];
wolffd@0: cnodes = mysetdiff(1:n, dnodes);
wolffd@0: ns(dnodes) = 2;
wolffd@0: 
wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes);
wolffd@0:

wolffd@0: 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous wolffd@0: nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'. wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

Specifying the parameters

wolffd@0: wolffd@0: The parameters of the discrete nodes can be specified as follows. wolffd@0:

wolffd@0: bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
wolffd@0: bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
wolffd@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: The parameters of the continuous nodes can be specified as follows. wolffd@0:

wolffd@0: bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
wolffd@0: 			   'cov', [0.00002 0.0001 0.00002 0.0001]);
wolffd@0: bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
wolffd@0: 			   'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
wolffd@0: bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
wolffd@0: bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
wolffd@0: bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
wolffd@0: bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

Inference

wolffd@0: wolffd@0: wolffd@0: First we compute the unconditional marginals. wolffd@0:

wolffd@0: engine = jtree_inf_engine(bnet);
wolffd@0: evidence = cell(1,n);
wolffd@0: [engine, ll] = enter_evidence(engine, evidence);
wolffd@0: marg = marginal_nodes(engine, E);
wolffd@0:

wolffd@0: wolffd@0: 'marg' is a structure that contains the fields 'mu' and 'Sigma', which wolffd@0: contain the mean and (co)variance of the marginal on E. wolffd@0: In this case, they are both scalars. wolffd@0: Let us check they match the published figures (to 2 decimal places). wolffd@0: wolffd@0:

wolffd@0: tol = 1e-2;
wolffd@0: assert(approxeq(marg.mu, -3.25, tol));
wolffd@0: assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
wolffd@0:

wolffd@0: We can compute the other posteriors similarly. wolffd@0: Now let us add some evidence. wolffd@0:

wolffd@0: evidence = cell(1,n);
wolffd@0: evidence{W} = 1; % industrial
wolffd@0: evidence{L} = 1.1;
wolffd@0: evidence{C} = -0.9;
wolffd@0: [engine, ll] = enter_evidence(engine, evidence);
wolffd@0:

wolffd@0: Now we find wolffd@0:

wolffd@0: marg = marginal_nodes(engine, E);
wolffd@0: assert(approxeq(marg.mu, -3.8983, tol));
wolffd@0: assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
wolffd@0:

wolffd@0: wolffd@0: wolffd@0: We can also compute the joint probability on a set of nodes. wolffd@0: For example, P(D, Mout | evidence) is a 2D Gaussian: wolffd@0:

wolffd@0: marg = marginal_nodes(engine, [D Mout])
wolffd@0: marg = 
wolffd@0:     domain: [6 8]
wolffd@0:         mu: [2x1 double]
wolffd@0:      Sigma: [2x2 double]
wolffd@0:          T: 1.0000
wolffd@0:

wolffd@0: The mean is wolffd@0:

wolffd@0: marg.mu
wolffd@0: ans =
wolffd@0:     3.6077
wolffd@0:     4.1077
wolffd@0:

wolffd@0: and the covariance matrix is wolffd@0:

wolffd@0: marg.Sigma
wolffd@0: ans =
wolffd@0:     0.1062    0.1062
wolffd@0:     0.1062    0.1182
wolffd@0:

wolffd@0: It is easy to visualize this posterior using standard Matlab plotting wolffd@0: functions, e.g., wolffd@0:

wolffd@0: gaussplot2d(marg.mu, marg.Sigma);
wolffd@0:

wolffd@0: produces the following picture. wolffd@0: wolffd@0:

wolffd@0:

wolffd@0: The interpretation is wolffd@0: Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. wolffd@0: In this case, E is a scalar, so i=j=1; k specifies the mixture component. wolffd@0:

wolffd@0: wolffd@0: wolffd@0: The T field indicates that the mixing weight of this Gaussian wolffd@0: component is 1.0. wolffd@0: If the joint contains discrete and continuous variables, the result wolffd@0: will be a mixture of Gaussians, e.g., wolffd@0:

wolffd@0: marg = marginal_nodes(engine, [F E])
wolffd@0:     domain: [1 3]
wolffd@0:         mu: [-3.9000 -0.4003]
wolffd@0:      Sigma: [1x1x2 double]
wolffd@0:          T: [0.9995 4.7373e-04]
wolffd@0:

wolffd@0: We saw in the sprinkler network that BNT sets the effective size of wolffd@0: observed discrete nodes to 1, since they only have one legal value. wolffd@0: For continuous nodes, BNT sets their length to 0, wolffd@0: since they have been reduced to a point. wolffd@0: For example, wolffd@0:

wolffd@0: marg = marginal_nodes(engine, [B C])
wolffd@0:     domain: [4 5]
wolffd@0:         mu: []
wolffd@0:      Sigma: []
wolffd@0:          T: [0.0123 0.9877]
wolffd@0:

wolffd@0: It is simple to post-process the output of marginal_nodes. wolffd@0: For example, the file BNT/examples/static/cg1 sets the mu term of wolffd@0: observed nodes to their observed value, and the Sigma term to 0 (since wolffd@0: observed nodes have no variance). wolffd@0: wolffd@0:

wolffd@0: Note that the implemented version of the junction tree is numerically wolffd@0: unstable when using CG potentials wolffd@0: (which is why, in the example above, we only required our answers to agree with wolffd@0: the published ones to 2dp.) wolffd@0: This is why you might want to use stab_cond_gauss_inf_engine, wolffd@0: implemented by Shan Huang. This is described in wolffd@0: wolffd@0:

"Stable Local Computation with Conditional Gaussian Distributions", wolffd@0: S. Lauritzen and F. Jensen, Tech Report R-99-2014, wolffd@0: Dept. Math. Sciences, Allborg Univ., 1999. wolffd@0:

wolffd@0: wolffd@0: However, even the numerically stable version wolffd@0: can be computationally intractable if there are many hidden discrete wolffd@0: nodes, because the number of mixture components grows exponentially e.g., in a wolffd@0: switching linear dynamical system. wolffd@0: In general, one must resort to approximate inference techniques: see wolffd@0: the discussion on inference engines below. wolffd@0: wolffd@0: wolffd@0:

Other hybrid models

wolffd@0: wolffd@0: When we have C->D arcs, where C is hidden, we need to use wolffd@0: approximate inference. wolffd@0: One approach (not implemented in BNT) is described in wolffd@0:

wolffd@0: wolffd@0: The parameter estimation routines in BNT can be classified into 4 wolffd@0: types, depending on whether the goal is to compute wolffd@0: a full (Bayesian) posterior over the parameters or just a point wolffd@0: estimate (e.g., Maximum Likelihood or Maximum A Posteriori), wolffd@0: and whether all the variables are fully observed or there is missing wolffd@0: data/ hidden variables (partial observability). wolffd@0:

A wolffd@0: Variational Approximation for Bayesian Networks with wolffd@0: Discrete and Continuous Latent Variables, wolffd@0: K. Murphy, UAI 99. wolffd@0:

wolffd@0: Of course, one can always use sampling methods wolffd@0: for approximate inference in such models. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Parameter Learning

wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
Full obs Partial obs
Point learn_params learn_params_em
Bayes bayes_update_params not yet supported
wolffd@0: wolffd@0: wolffd@0:

	Full obs	Partial obs
Point	`learn_params`	`learn_params_em`
Bayes	`bayes_update_params`	not yet supported

Loading data from a file

wolffd@0: wolffd@0: To load numeric data from an ASCII text file called 'dat.txt', where each row is a wolffd@0: case and columns are separated by white-space, such as wolffd@0:

wolffd@0: 011979 1626.5 0.0
wolffd@0: 021979 1367.0 0.0
wolffd@0: ...
wolffd@0:

wolffd@0: you can use wolffd@0:

wolffd@0: data = load('dat.txt');
wolffd@0:

wolffd@0: or wolffd@0:

wolffd@0: load dat.txt -ascii
wolffd@0:

wolffd@0: In the latter case, the data is stored in a variable called 'dat' (the wolffd@0: filename minus the extension). wolffd@0: Alternatively, suppose the data is stored in a .csv file (has commas wolffd@0: separating the columns, and contains a header line), such as wolffd@0:

wolffd@0: header info goes here
wolffd@0: ORD,011979,1626.5,0.0
wolffd@0: DSM,021979,1367.0,0.0
wolffd@0: ...
wolffd@0:

wolffd@0: You can load this using wolffd@0:

wolffd@0: [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
wolffd@0:

wolffd@0: If your file is not in either of these formats, you can either use Perl to convert wolffd@0: it to this format, or use the Matlab scanf command. wolffd@0: Type wolffd@0: wolffd@0: help iofun wolffd@0: wolffd@0: for more information on Matlab's file functions. wolffd@0: wolffd@0:

wolffd@0: BNT learning routines require data to be stored in a cell array. wolffd@0: data{i,m} is the value of node i in case (example) m, i.e., each wolffd@0: column is a case. wolffd@0: If node i is not observed in case m (missing value), set wolffd@0: data{i,m} = []. wolffd@0: (Not all the learning routines can cope with such missing values, however.) wolffd@0: In the special case that all the nodes are observed and are wolffd@0: scalar-valued (as opposed to vector-valued), the data can be wolffd@0: stored in a matrix (as opposed to a cell-array). wolffd@0:

wolffd@0: Suppose, as in the mixture of experts example, wolffd@0: that we have 3 nodes in the graph: X(1) is the observed input, X(3) is wolffd@0: the observed output, and X(2) is a hidden (gating) node. We can wolffd@0: create the dataset as follows. wolffd@0:

wolffd@0: data = load('dat.txt');
wolffd@0: ncases = size(data, 1);
wolffd@0: cases = cell(3, ncases);
wolffd@0: cases([1 3], :) = num2cell(data');
wolffd@0:

wolffd@0: Notice how we transposed the data, to convert rows into columns. wolffd@0: Also, cases{2,m} = [] for all m, since X(2) is always hidden. wolffd@0: wolffd@0: wolffd@0:

Maximum likelihood parameter estimation from complete data

wolffd@0: wolffd@0: As an example, let's generate some data from the sprinkler network, randomize the parameters, wolffd@0: and then try to recover the original model. wolffd@0: First we create some training data using forwards sampling. wolffd@0:

wolffd@0: samples = cell(N, nsamples);
wolffd@0: for i=1:nsamples
wolffd@0:   samples(:,i) = sample_bnet(bnet);
wolffd@0: end
wolffd@0:

wolffd@0: samples{j,i} contains the value of the j'th node in case i. wolffd@0: sample_bnet returns a cell array because, in general, each node might wolffd@0: be a vector of different length. wolffd@0: In this case, all nodes are discrete (and hence scalars), so we wolffd@0: could have used a regular array instead (which can be quicker): wolffd@0:

wolffd@0: data = cell2num(samples);
wolffd@0:

wolffd@0: Now we create a network with random parameters. wolffd@0: (The initial values of bnet2 don't matter in this case, since we can find the wolffd@0: globally optimal MLE independent of where we start.) wolffd@0:

wolffd@0: % Make a tabula rasa
wolffd@0: bnet2 = mk_bnet(dag, node_sizes);
wolffd@0: seed = 0;
wolffd@0: rand('state', seed);
wolffd@0: bnet2.CPD{C} = tabular_CPD(bnet2, C);
wolffd@0: bnet2.CPD{R} = tabular_CPD(bnet2, R);
wolffd@0: bnet2.CPD{S} = tabular_CPD(bnet2, S);
wolffd@0: bnet2.CPD{W} = tabular_CPD(bnet2, W);
wolffd@0:

wolffd@0: Finally, we find the maximum likelihood estimates of the parameters. wolffd@0:

wolffd@0: bnet3 = learn_params(bnet2, samples);
wolffd@0:

wolffd@0: To view the learned parameters, we use a little Matlab hackery. wolffd@0:

wolffd@0: CPT3 = cell(1,N);
wolffd@0: for i=1:N
wolffd@0:   s=struct(bnet3.CPD{i});  % violate object privacy
wolffd@0:   CPT3{i}=s.CPT;
wolffd@0: end
wolffd@0:

wolffd@0: Here are the parameters learned for node 4. wolffd@0:

wolffd@0: dispcpt(CPT3{4})
wolffd@0: 1 1 : 1.0000 0.0000 
wolffd@0: 2 1 : 0.2000 0.8000 
wolffd@0: 1 2 : 0.2273 0.7727 
wolffd@0: 2 2 : 0.0000 1.0000 
wolffd@0:

wolffd@0: So we see that the learned parameters are fairly close to the "true" wolffd@0: ones, which we display below. wolffd@0:

wolffd@0: dispcpt(CPT{4})
wolffd@0: 1 1 : 1.0000 0.0000 
wolffd@0: 2 1 : 0.1000 0.9000 
wolffd@0: 1 2 : 0.1000 0.9000 
wolffd@0: 2 2 : 0.0100 0.9900 
wolffd@0:

wolffd@0: We can get better results by using a larger training set, or using wolffd@0: informative priors (see below). wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Parameter priors

wolffd@0: wolffd@0: Currently, only tabular CPDs can have priors on their parameters. wolffd@0: The conjugate prior for a multinomial is the Dirichlet. wolffd@0: (For binary random variables, the multinomial is the same as the wolffd@0: Bernoulli, and the Dirichlet is the same as the Beta.) wolffd@0:

wolffd@0: The Dirichlet has a simple interpretation in terms of pseudo counts. wolffd@0: If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the wolffd@0: training set, where Pa_i are the parents of X_i, wolffd@0: then the maximum likelihood (ML) estimate is wolffd@0: T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. wolffd@0: To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this wolffd@0: event was not seen in the training set, wolffd@0: we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk) wolffd@0: of times in the past. wolffd@0: The MAP (maximum a posterior) estimate is then wolffd@0:

wolffd@0: T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
wolffd@0:

wolffd@0: and is never 0 if all alpha_ijk > 0. wolffd@0: For example, consider the network A->B, where A is binary and B has 3 wolffd@0: values. wolffd@0: A uniform prior for B has the form wolffd@0:

wolffd@0:     B=1 B=2 B=3
wolffd@0: A=1 1   1   1
wolffd@0: A=2 1   1   1
wolffd@0:

wolffd@0: which can be created using wolffd@0:

wolffd@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
wolffd@0:

wolffd@0: This prior does not satisfy the likelihood equivalence principle, wolffd@0: which says that Markov equivalent models wolffd@0: should have the same marginal likelihood. wolffd@0: A prior that does satisfy this principle is shown below. wolffd@0: Heckerman (1995) calls this the wolffd@0: BDeu prior (likelihood equivalent uniform Bayesian Dirichlet). wolffd@0:

wolffd@0:     B=1 B=2 B=3
wolffd@0: A=1 1/6 1/6 1/6
wolffd@0: A=2 1/6 1/6 1/6
wolffd@0:

wolffd@0: where we put N/(q*r) in each bin; N is the equivalent sample size, wolffd@0: r=|A|, q = |B|. wolffd@0: This can be created as follows wolffd@0:

wolffd@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
wolffd@0:

wolffd@0: Here, 1 is the equivalent sample size, and is the strength of the wolffd@0: prior. wolffd@0: You can change this using wolffd@0:

wolffd@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
wolffd@0:    'BDeu', 'dirichlet_weight', 10);
wolffd@0:

wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

(Sequential) Bayesian parameter updating from complete data

wolffd@0: wolffd@0: If we use conjugate priors and have fully observed data, we can wolffd@0: compute the posterior over the parameters in batch form as follows. wolffd@0:

wolffd@0: cases = sample_bnet(bnet, nsamples);
wolffd@0: bnet = bayes_update_params(bnet, cases);  
wolffd@0: LL = log_marg_lik_complete(bnet, cases);   
wolffd@0:

wolffd@0: bnet.CPD{i}.prior contains the new Dirichlet pseudocounts, wolffd@0: and bnet.CPD{i}.CPT is set to the mean of the posterior (the wolffd@0: normalized counts). wolffd@0: (Hence if the initial pseudo counts are 0, wolffd@0: bayes_update_params and learn_params will give the wolffd@0: same result.) wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

wolffd@0: We can compute the same result sequentially (on-line) as follows. wolffd@0:

wolffd@0: LL = 0;
wolffd@0: for m=1:nsamples
wolffd@0:   LL = LL + log_marg_lik_complete(bnet, cases(:,m));
wolffd@0:   bnet = bayes_update_params(bnet, cases(:,m));
wolffd@0: end
wolffd@0:

wolffd@0: wolffd@0: The file BNT/examples/static/StructLearn/model_select1 has an example of wolffd@0: sequential model selection which uses the same idea. wolffd@0: We generate data from the model A->B wolffd@0: and compute the posterior prob of all 3 dags on 2 nodes: wolffd@0: (1) A B, (2) A <- B , (3) A -> B wolffd@0: Models 2 and 3 are Markov equivalent, and therefore indistinguishable from wolffd@0: observational data alone, so we expect their posteriors to be the same wolffd@0: (assuming a prior which satisfies likelihood equivalence). wolffd@0: If we use random parameters, the "true" model only gets a higher posterior after 2000 trials! wolffd@0: However, if we make B a noisy NOT gate, the true model "wins" after 12 wolffd@0: trials, as shown below (red = model 1, blue/green (superimposed) wolffd@0: represents models 2/3). wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: The use of marginal likelihood for model selection is discussed in wolffd@0: greater detail in the wolffd@0: section on structure learning. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Maximum likelihood parameter estimation with missing values

wolffd@0: wolffd@0: Now we consider learning when some values are not observed. wolffd@0: Let us randomly hide half the values generated from the water wolffd@0: sprinkler example. wolffd@0:

wolffd@0: samples2 = samples;
wolffd@0: hide = rand(N, nsamples) > 0.5;
wolffd@0: [I,J]=find(hide);
wolffd@0: for k=1:length(I)
wolffd@0:   samples2{I(k), J(k)} = [];
wolffd@0: end
wolffd@0:

wolffd@0: samples2{i,l} is the value of node i in training case l, or [] if unobserved. wolffd@0:

wolffd@0: Now we will compute the MLEs using the EM algorithm. wolffd@0: We need to use an inference algorithm to compute the expected wolffd@0: sufficient statistics in the E step; the M (maximization) step is as wolffd@0: above. wolffd@0:

wolffd@0: engine2 = jtree_inf_engine(bnet2);
wolffd@0: max_iter = 10;
wolffd@0: [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
wolffd@0:

wolffd@0: LLtrace(i) is the log-likelihood at iteration i. We can plot this as wolffd@0: follows: wolffd@0:

wolffd@0: plot(LLtrace, 'x-')
wolffd@0:

wolffd@0: Let's display the results after 10 iterations of EM. wolffd@0:

wolffd@0: celldisp(CPT4)
wolffd@0: CPT4{1} =
wolffd@0:     0.6616
wolffd@0:     0.3384
wolffd@0: CPT4{2} =
wolffd@0:     0.6510    0.3490
wolffd@0:     0.8751    0.1249
wolffd@0: CPT4{3} =
wolffd@0:     0.8366    0.1634
wolffd@0:     0.0197    0.9803
wolffd@0: CPT4{4} =
wolffd@0: (:,:,1) =
wolffd@0:     0.8276    0.0546
wolffd@0:     0.5452    0.1658
wolffd@0: (:,:,2) =
wolffd@0:     0.1724    0.9454
wolffd@0:     0.4548    0.8342
wolffd@0:

wolffd@0: We can get improved performance by using one or more of the following wolffd@0: methods: wolffd@0:

Increasing the size of the training set. wolffd@0:
Decreasing the amount of hidden data. wolffd@0:
Running EM for longer. wolffd@0:
Using informative priors. wolffd@0:
Initialising EM from multiple starting points. wolffd@0:

wolffd@0: wolffd@0: Click here for a discussion of learning wolffd@0: Gaussians, which can cause numerical problems. wolffd@0:

wolffd@0: For a more complete example of learning with EM, wolffd@0: see the script BNT/examples/static/learn1.m. wolffd@0: wolffd@0:

Parameter tying

wolffd@0: wolffd@0: In networks with repeated structure (e.g., chains and grids), it is wolffd@0: common to assume that the parameters are the same at every node. This wolffd@0: is called parameter tying, and reduces the amount of data needed for wolffd@0: learning. wolffd@0:

wolffd@0: When we have tied parameters, there is no longer a one-to-one wolffd@0: correspondence between nodes and CPDs. wolffd@0: Rather, each CPD species the parameters for a whole equivalence class wolffd@0: of nodes. wolffd@0: It is easiest to see this by example. wolffd@0: Consider the following hidden Markov wolffd@0: model (HMM) wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: wolffd@0: When HMMs are used for semi-infinite processes like speech recognition, wolffd@0: we assume the transition matrix wolffd@0: P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant wolffd@0: or homogenous Markov chain. wolffd@0: Hence hidden nodes 2, 3, ..., T wolffd@0: are all in the same equivalence class, say class Hclass. wolffd@0: Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the wolffd@0: same for all t, so the observed nodes are all in the same equivalence wolffd@0: class, say class Oclass. wolffd@0: Finally, the prior term P(H(1)) is in a class all by itself, say class wolffd@0: H1class. wolffd@0: This is illustrated below, where we explicitly represent the wolffd@0: parameters as random variables (dotted nodes). wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: In BNT, we cannot represent parameters as random variables (nodes). wolffd@0: Instead, we "hide" the wolffd@0: parameters inside one CPD for each equivalence class, wolffd@0: and then specify that the other CPDs should share these parameters, as wolffd@0: follows. wolffd@0:

wolffd@0: hnodes = 1:2:2*T;
wolffd@0: onodes = 2:2:2*T;
wolffd@0: H1class = 1; Hclass = 2; Oclass = 3;
wolffd@0: eclass = ones(1,N);
wolffd@0: eclass(hnodes(2:end)) = Hclass;
wolffd@0: eclass(hnodes(1)) = H1class;
wolffd@0: eclass(onodes) = Oclass;
wolffd@0: % create dag and ns in the usual way
wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
wolffd@0:

wolffd@0: Finally, we define the parameters for each equivalence class: wolffd@0:

wolffd@0: bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
wolffd@0: bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
wolffd@0: if cts_obs
wolffd@0:   bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
wolffd@0: else
wolffd@0:   bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
wolffd@0: end
wolffd@0:

wolffd@0: In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a wolffd@0: member of e's equivalence class; that is, it is not always the case wolffd@0: that e == j. You can use bnet.rep_of_eclass(e) to return the wolffd@0: representative of equivalence class e. wolffd@0: BNT will look up the parents of j to determine the size wolffd@0: of the CPT to use. It assumes that this is the same for all members of wolffd@0: the equivalence class. wolffd@0: Click here for wolffd@0: a more complex example of parameter tying. wolffd@0:

wolffd@0: Note: wolffd@0: Normally one would define an HMM as a wolffd@0: Dynamic Bayes Net wolffd@0: (see the function BNT/examples/dynamic/mk_chmm.m). wolffd@0: However, one can define an HMM as a static BN using the function wolffd@0: BNT/examples/static/Models/mk_hmm_bnet.m. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Structure learning

wolffd@0: wolffd@0: Update (9/29/03): wolffd@0: Phillipe LeRay is developing some additional structure learning code wolffd@0: on top of BNT. Click wolffd@0: wolffd@0: here wolffd@0: for details. wolffd@0: wolffd@0:

wolffd@0: wolffd@0: There are two very different approaches to structure learning: wolffd@0: constraint-based and search-and-score. wolffd@0: In the constraint-based approach, wolffd@0: we start with a fully connected graph, and remove edges if certain wolffd@0: conditional independencies are measured in the data. wolffd@0: This has the disadvantage that repeated independence tests lose wolffd@0: statistical power. wolffd@0:

wolffd@0: In the more popular search-and-score approach, wolffd@0: we perform a search through the space of possible DAGs, and either wolffd@0: return the best one found (a point estimate), or return a sample of the wolffd@0: models found (an approximation to the Bayesian posterior). wolffd@0:

wolffd@0: The number of DAGs as a function of the number of wolffd@0: nodes, G(n), is super-exponential in n, wolffd@0: and is given by the following recurrence wolffd@0: wolffd@0:

wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: The first few values wolffd@0: are shown below. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
n G(n)
1 1
2 3
3 25
4 543
5 29,281
6 3,781,503
7 1.1 x 10^9
8 7.8 x 10^11
9 1.2 x 10^15
10 4.2 x 10^18
wolffd@0: wolffd@0: Since the number of DAGs is super-exponential in the number of nodes, wolffd@0: we cannot exhaustively search the space, so we either use a local wolffd@0: search algorithm (e.g., greedy hill climbining, perhaps with multiple wolffd@0: restarts) or a global search algorithm (e.g., Markov Chain Monte wolffd@0: Carlo). wolffd@0:

n	G(n)
1	1
2	3
3	25
4	543
5	29,281
6	3,781,503
7	1.1 x 10^9
8	7.8 x 10^11
9	1.2 x 10^15
10	4.2 x 10^18

wolffd@0: If we know a total ordering on the nodes, wolffd@0: finding the best structure amounts to picking the best set of parents wolffd@0: for each node independently. wolffd@0: This is what the K2 algorithm does. wolffd@0: If the ordering is unknown, we can search over orderings, wolffd@0: which is more efficient than searching over DAGs (Koller and Friedman, 2000). wolffd@0:

wolffd@0: In addition to the search procedure, we must specify the scoring wolffd@0: function. There are two popular choices. The Bayesian score integrates wolffd@0: out the parameters, i.e., it is the marginal likelihood of the model. wolffd@0: The BIC (Bayesian Information Criterion) is defined as wolffd@0: log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is wolffd@0: the ML estimate of the parameters, d is the number of parameters, and wolffd@0: N is the number of data cases. wolffd@0: The BIC method has the advantage of not requiring a prior. wolffd@0:

wolffd@0: BIC can be derived as a large sample wolffd@0: approximation to the marginal likelihood. wolffd@0: (It is also equal to the Minimum Description Length of a model.) wolffd@0: However, in practice, the sample size does not need to be very large wolffd@0: for the approximation to be good. wolffd@0: For example, in the figure below, we plot the ratio between the log marginal likelihood wolffd@0: and the BIC score against data-set size; we see that the ratio rapidly wolffd@0: approaches 1, especially for non-informative priors. wolffd@0: (This plot was generated by the file BNT/examples/static/bic1.m. It wolffd@0: uses the water sprinkler BN with BDeu Dirichlet priors with different wolffd@0: equivalent sample sizes.) wolffd@0: wolffd@0:

wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: As with parameter learning, handling missing data/ hidden variables is wolffd@0: much harder than the fully observed case. wolffd@0: The structure learning routines in BNT can therefore be classified into 4 wolffd@0: types, analogously to the parameter learning case. wolffd@0:

wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
Full obs Partial obs
Point learn_struct_K2
wolffd@0: wolffd@0: not yet supported
Bayes learn_struct_mcmc not yet supported
wolffd@0: wolffd@0: wolffd@0:

	Full obs	Partial obs
Point	`learn_struct_K2` wolffd@0: wolffd@0:	`not yet supported`
Bayes	`learn_struct_mcmc`	not yet supported

Markov equivalence

wolffd@0: wolffd@0: If two DAGs encode the same conditional independencies, they are wolffd@0: called Markov equivalent. The set of all DAGs can be paritioned into wolffd@0: Markov equivalence classes. Graphs within the same class can wolffd@0: have wolffd@0: the direction of some of their arcs reversed without changing any of wolffd@0: the CI relationships. wolffd@0: Each class can be represented by a PDAG wolffd@0: (partially directed acyclic graph) called an essential graph or wolffd@0: pattern. This specifies which edges must be oriented in a certain wolffd@0: direction, and which may be reversed. wolffd@0: wolffd@0:

wolffd@0: When learning graph structure from observational data, wolffd@0: the best one can hope to do is to identify the model up to Markov wolffd@0: equivalence. To distinguish amongst graphs within the same equivalence wolffd@0: class, one needs interventional data: see the discussion on active learning below. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Exhaustive search

wolffd@0: wolffd@0: The brute-force approach to structure learning is to enumerate all wolffd@0: possible DAGs, and score each one. This provides a "gold standard" wolffd@0: with which to compare other algorithms. We can do this as follows. wolffd@0:

wolffd@0: dags = mk_all_dags(N);
wolffd@0: score = score_dags(data, ns, dags);
wolffd@0:

wolffd@0: where data(i,m) is the value of node i in case m, wolffd@0: and ns(i) is the size of node i. wolffd@0: If the DAGs have a lot of families in common, we can cache the sufficient statistics, wolffd@0: making this potentially more efficient than scoring the DAGs one at a time. wolffd@0: (Caching is not currently implemented, however.) wolffd@0:

wolffd@0: By default, we use the Bayesian scoring metric, and assume CPDs are wolffd@0: represented by tables with BDeu(1) priors. wolffd@0: We can override these defaults as follows. wolffd@0: If we want to use uniform priors, we can say wolffd@0:

wolffd@0: params = cell(1,N);
wolffd@0: for i=1:N
wolffd@0:   params{i} = {'prior', 'unif'};
wolffd@0: end
wolffd@0: score = score_dags(data, ns, dags, 'params', params);
wolffd@0:

wolffd@0: params{i} is a cell-array, containing optional arguments that are wolffd@0: passed to the constructor for CPD i. wolffd@0:

wolffd@0: Now suppose we want to use different node types, e.g., wolffd@0: Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both wolffd@0: these CPDs can support discrete and continuous parents, which is wolffd@0: necessary since all other nodes will be considered as parents). wolffd@0: The Bayesian scoring metric currently only works for tabular CPDs, so wolffd@0: we will use BIC: wolffd@0:

wolffd@0: score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [], 
wolffd@0:     'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
wolffd@0:

wolffd@0: In practice, one can't enumerate all possible DAGs for N > 5, wolffd@0: but one can evaluate any reasonably-sized set of hypotheses in this wolffd@0: way (e.g., nearest neighbors of your current best guess). wolffd@0: Think of this as "computer assisted model refinement" as opposed to de wolffd@0: novo learning. wolffd@0: wolffd@0: wolffd@0:

K2

wolffd@0: wolffd@0: The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows. wolffd@0: Initially each node has no parents. It then adds incrementally that parent whose addition most wolffd@0: increases the score of the resulting structure. When the addition of no single wolffd@0: parent can increase the score, it stops adding parents to the node. wolffd@0: Since we are using a fixed ordering, we do not need to check for wolffd@0: cycles, and can choose the parents for each node independently. wolffd@0:

wolffd@0: The original paper used the Bayesian scoring wolffd@0: metric with tabular CPDs and Dirichlet priors. wolffd@0: BNT generalizes this to allow any kind of CPD, and either the Bayesian wolffd@0: scoring metric or BIC, as in the example above. wolffd@0: In addition, you can specify wolffd@0: an optional upper bound on the number of parents for each node. wolffd@0: The file BNT/examples/static/k2demo1.m gives an example of how to use K2. wolffd@0: We use the water sprinkler network and sample 100 cases from it as before. wolffd@0: Then we see how much data it takes to recover the generating structure: wolffd@0:

wolffd@0: order = [C S R W];
wolffd@0: max_fan_in = 2;
wolffd@0: sz = 5:5:100;
wolffd@0: for i=1:length(sz)
wolffd@0:   dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
wolffd@0:   correct(i) = isequal(dag, dag2);
wolffd@0: end
wolffd@0:

wolffd@0: Here are the results. wolffd@0:

wolffd@0: correct =
wolffd@0:   Columns 1 through 12 
wolffd@0:      0     0     0     0     0     0     0     1     0     1     1     1
wolffd@0:   Columns 13 through 20 
wolffd@0:      1     1     1     1     1     1     1     1
wolffd@0:

wolffd@0: So we see it takes about sz(10)=50 cases. (BIC behaves similarly, wolffd@0: showing that the prior doesn't matter too much.) wolffd@0: In general, we cannot hope to recover the "true" generating structure, wolffd@0: only one that is in its Markov equivalence wolffd@0: class. wolffd@0: wolffd@0: wolffd@0:

Hill-climbing

wolffd@0: wolffd@0: Hill-climbing starts at a specific point in space, wolffd@0: considers all nearest neighbors, and moves to the neighbor wolffd@0: that has the highest score; if no neighbors have higher wolffd@0: score than the current point (i.e., we have reached a local maximum), wolffd@0: the algorithm stops. One can then restart in another part of the space. wolffd@0:

wolffd@0: A common definition of "neighbor" is all graphs that can be wolffd@0: generated from the current graph by adding, deleting or reversing a wolffd@0: single arc, subject to the acyclicity constraint. wolffd@0: Other neighborhoods are possible: see wolffd@0: wolffd@0: Optimal Structure Identification with Greedy Search, Max wolffd@0: Chickering, JMLR 2002. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

MCMC

wolffd@0: wolffd@0: We can use a Markov Chain Monte Carlo (MCMC) algorithm called wolffd@0: Metropolis-Hastings (MH) to search the space of all wolffd@0: DAGs. wolffd@0: The standard proposal distribution is to consider moving to all wolffd@0: nearest neighbors in the sense defined above. wolffd@0:

wolffd@0: The function can be called wolffd@0: as in the following example. wolffd@0:

wolffd@0: [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
wolffd@0:

wolffd@0: We can convert our set of sampled graphs to a histogram wolffd@0: (empirical posterior over all the DAGs) thus wolffd@0:

wolffd@0: all_dags = mk_all_dags(N);
wolffd@0: mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
wolffd@0:

wolffd@0: To see how well this performs, let us compute the exact posterior exhaustively. wolffd@0:

wolffd@0:

wolffd@0: score = score_dags(data, ns, all_dags);
wolffd@0: post = normalise(exp(score)); % assuming uniform structural prior
wolffd@0:

wolffd@0: We plot the results below. wolffd@0: (The data set was 100 samples drawn from a random 4 node bnet; see the wolffd@0: file BNT/examples/static/mcmc1.) wolffd@0:

wolffd@0: subplot(2,1,1)
wolffd@0: bar(post)
wolffd@0: subplot(2,1,2)
wolffd@0: bar(mcmc_post)
wolffd@0:

wolffd@0:

wolffd@0:

wolffd@0: We can also plot the acceptance ratio versus number of MCMC steps, wolffd@0: as a crude convergence diagnostic. wolffd@0:

wolffd@0: clf
wolffd@0: plot(accept_ratio)
wolffd@0:

wolffd@0:

wolffd@0:

wolffd@0: Even though the number of samples needed by MCMC is theoretically wolffd@0: polynomial (not exponential) in the dimensionality of the search space, in practice it has been wolffd@0: found that MCMC does not converge in reasonable time for graphs with wolffd@0: more than about 10 nodes. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Active structure learning

wolffd@0: wolffd@0: As was mentioned above, wolffd@0: one can only learn a DAG up to Markov equivalence, even given infinite data. wolffd@0: If one is interested in learning the structure of a causal network, wolffd@0: one needs interventional data. wolffd@0: (By "intervention" we mean forcing a node to take on a specific value, wolffd@0: thereby effectively severing its incoming arcs.) wolffd@0:

wolffd@0: Most of the scoring functions accept an optional argument wolffd@0: that specifies whether a node was observed to have a certain value, or wolffd@0: was forced to have that value: we set clamped(i,m)=1 if node i was wolffd@0: forced in training case m. e.g., see the file wolffd@0: BNT/examples/static/cooper_yoo. wolffd@0:

wolffd@0: An interesting question is to decide which interventions to perform wolffd@0: (c.f., design of experiments). For details, see the following tech wolffd@0: report wolffd@0:

wolffd@0: Active learning of causal Bayes net structure, Kevin Murphy, March wolffd@0: 2001. wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

Structural EM

wolffd@0: wolffd@0: Computing the Bayesian score when there is partial observability is wolffd@0: computationally challenging, because the parameter posterior becomes wolffd@0: multimodal (the hidden nodes induce a mixture distribution). wolffd@0: One therefore needs to use approximations such as BIC. wolffd@0: Unfortunately, search algorithms are still expensive, because we need wolffd@0: to run EM at each step to compute the MLE, which is needed to compute wolffd@0: the score of each model. An alternative approach is wolffd@0: to do the local search steps inside of the M step of EM, which is more wolffd@0: efficient since the data has been "filled in" - this is wolffd@0: called the structural EM algorithm (Friedman 1997), and provably wolffd@0: converges to a local maximum of the BIC score. wolffd@0:

wolffd@0: Wei Hu has implemented SEM for discrete nodes. wolffd@0: You can download his package from wolffd@0: here. wolffd@0: Please address all questions about this code to wolffd@0: wei.hu@intel.com. wolffd@0: See also Phl's implementation of SEM. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Visualizing the graph

wolffd@0: wolffd@0: Click here for more information wolffd@0: on graph visualization. wolffd@0: wolffd@0:

Constraint-based methods

wolffd@0: wolffd@0: The IC algorithm (Pearl and Verma, 1991), wolffd@0: and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993), wolffd@0: computes many conditional independence tests, wolffd@0: and combines these constraints into a wolffd@0: PDAG to represent the whole wolffd@0: Markov equivalence class. wolffd@0:

wolffd@0: IC*/FCI extend IC/PC to handle latent variables: see below. wolffd@0: (IC stands for inductive causation; PC stands for Peter and Clark, wolffd@0: the first names of Spirtes and Glymour; FCI stands for fast causal wolffd@0: inference. wolffd@0: What we, following Pearl (2000), call IC* was called wolffd@0: IC in the original Pearl and Verma paper.) wolffd@0: For details, see wolffd@0:

wolffd@0: Causation, wolffd@0: Prediction, and Search, Spirtes, Glymour and wolffd@0: Scheines (SGS), 2001 (2nd edition), MIT Press. wolffd@0:
wolffd@0: Causality: Models, Reasoning and Inference, J. Pearl, wolffd@0: 2000, Cambridge University Press. wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: wolffd@0: The PC algorithm takes as arguments a function f, the number of nodes N, wolffd@0: the maximum fan in K, and additional arguments A which are passed to f. wolffd@0: The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 wolffd@0: otherwise. wolffd@0: For example, suppose we cheat by wolffd@0: passing in a CI "oracle" which has access to the true DAG; the oracle wolffd@0: tests for d-separation in this DAG, i.e., wolffd@0: f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows. wolffd@0:

wolffd@0: pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
wolffd@0:

wolffd@0: pdag(i,j) = -1 if there is definitely an i->j arc, wolffd@0: and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. wolffd@0:

wolffd@0: Applied to the sprinkler network, this returns wolffd@0:

wolffd@0: pdag =
wolffd@0:      0     1     1     0
wolffd@0:      1     0     0    -1
wolffd@0:      1     0     0    -1
wolffd@0:      0     0     0     0
wolffd@0:

wolffd@0: So as expected, we see that the V-structure at the W node is uniquely identified, wolffd@0: but the other arcs have ambiguous orientation. wolffd@0:

wolffd@0: We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS wolffd@0: book. wolffd@0: This example concerns the female orgasm. wolffd@0: We are given a correlation matrix C between 7 measured factors (such wolffd@0: as subjective experiences of coital and masturbatory experiences), wolffd@0: derived from 281 samples, and want to learn a causal model of the wolffd@0: data. We will not discuss the merits of this type of work here, but wolffd@0: merely show how to reproduce the results in the SGS book. wolffd@0: Their program, wolffd@0: Tetrad, wolffd@0: makes use of the Fisher Z-test for conditional wolffd@0: independence, so we do the same: wolffd@0:

wolffd@0: max_fan_in = 4;
wolffd@0: nsamples = 281;
wolffd@0: alpha = 0.05;
wolffd@0: pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
wolffd@0:

wolffd@0: In this case, the CI test is wolffd@0:

wolffd@0: f(X,Y,S) = cond_indep_fisher_z(X,Y,S,  C,nsamples,alpha)
wolffd@0:

wolffd@0: The results match those of Fig 12a of SGS apart from two edge wolffd@0: differences; presumably this is due to rounding error (although it wolffd@0: could be a bug, either in BNT or in Tetrad). wolffd@0: This example can be found in the file BNT/examples/static/pc2.m. wolffd@0: wolffd@0:

wolffd@0: wolffd@0: The IC* algorithm (Pearl and Verma, 1991), wolffd@0: and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993), wolffd@0: are like the IC/PC algorithm, except that they can detect the presence wolffd@0: of latent variables. wolffd@0: See the file learn_struct_pdag_ic_star written by Tamar wolffd@0: Kushnir. The output is a matrix P, defined as follows wolffd@0: (see Pearl (2000), p52 for details): wolffd@0:

wolffd@0: % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
wolffd@0: % P(i,j) = -2 if there is a marked directed i-*>j edge.
wolffd@0: % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
wolffd@0: % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

Philippe Leray's structure learning package

wolffd@0: wolffd@0: Philippe Leray has written a wolffd@0: wolffd@0: structure learning package that uses BNT. wolffd@0: wolffd@0: It currently (Juen 2003) has the following features: wolffd@0:

PC with Chi2 statistical test wolffd@0:
MWST : Maximum weighted Spanning Tree wolffd@0:
Hill Climbing wolffd@0:
Greedy Search wolffd@0:
Structural EM wolffd@0:
hist_ic : optimal Histogram based on IC information criterion wolffd@0:
cpdag_to_dag wolffd@0:
dag_to_cpdag wolffd@0:
... wolffd@0:

wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Inference engines

wolffd@0: wolffd@0: Up until now, we have used the junction tree algorithm for inference. wolffd@0: However, sometimes this is too slow, or not even applicable. wolffd@0: In general, there are many inference algorithms each of which make wolffd@0: different tradeoffs between speed, accuracy, complexity and wolffd@0: generality. Furthermore, there might be many implementations of the wolffd@0: same algorithm; for instance, a general purpose, readable version, wolffd@0: and a highly-optimized, specialized one. wolffd@0: To cope with this variety, we treat each inference algorithm as an wolffd@0: object, which we call an inference engine. wolffd@0: wolffd@0:

wolffd@0: An inference engine is an object that contains a bnet and supports the wolffd@0: 'enter_evidence' and 'marginal_nodes' methods. The engine constructor wolffd@0: takes the bnet as argument and may do some model-specific processing. wolffd@0: When 'enter_evidence' is called, the engine may do some wolffd@0: evidence-specific processing. Finally, when 'marginal_nodes' is wolffd@0: called, the engine may do some query-specific processing. wolffd@0: wolffd@0:

wolffd@0: The amount of work done when each stage is specified -- structure, wolffd@0: parameters, evidence, and query -- depends on the engine. The cost of wolffd@0: work done early in this sequence can be amortized. On the other hand, wolffd@0: one can make better optimizations if one waits until later in the wolffd@0: sequence. wolffd@0: For example, the parameters might imply wolffd@0: conditional indpendencies that are not evident in the graph structure, wolffd@0: but can nevertheless be exploited; the evidence indicates which nodes wolffd@0: are observed and hence can effectively be disconnected from the wolffd@0: graph; and the query might indicate that large parts of the network wolffd@0: are d-separated from the query nodes. (Since it is not the actual wolffd@0: values of the evidence that matters, just which nodes are observed, wolffd@0: many engines allow you to specify which nodes will be observed when they are constructed, wolffd@0: i.e., before calling 'enter_evidence'. Some engines can still cope if wolffd@0: the actual pattern of evidence is different, e.g., if there is missing wolffd@0: data.) wolffd@0:

wolffd@0: wolffd@0: Although being maximally lazy (i.e., only doing work when a query is wolffd@0: issued) may seem desirable, wolffd@0: this is not always the most efficient. wolffd@0: For example, wolffd@0: when learning using EM, we need to call marginal_nodes N times, where N is the wolffd@0: number of nodes. Variable elimination would end wolffd@0: up repeating a lot of work wolffd@0: each time marginal_nodes is called, making it inefficient for wolffd@0: learning. The junction tree algorithm, by contrast, uses dynamic wolffd@0: programming to avoid this redundant computation --- it calculates all wolffd@0: marginals in two passes during 'enter_evidence', so calling wolffd@0: 'marginal_nodes' takes constant time. wolffd@0:

wolffd@0: We will discuss some of the inference algorithms implemented in BNT wolffd@0: below, and finish with a summary of all wolffd@0: of them. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Variable elimination

wolffd@0: wolffd@0: The variable elimination algorithm, also known as bucket elimination wolffd@0: or peeling, is one of the simplest inference algorithms. wolffd@0: The basic idea is to "push sums inside of products"; this is explained wolffd@0: in more detail wolffd@0: here. wolffd@0:

wolffd@0: The principle of distributing sums over products can be generalized wolffd@0: greatly to apply to any commutative semiring. wolffd@0: This forms the basis of many common algorithms, such as Viterbi wolffd@0: decoding and the Fast Fourier Transform. For details, see wolffd@0: wolffd@0:

R. McEliece and S. M. Aji, 2000. wolffd@0: wolffd@0: wolffd@0: The Generalized Distributive Law, wolffd@0: IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000), wolffd@0: pp. 325--343. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001. wolffd@0: wolffd@0: Factor graphs and the sum-product algorithm wolffd@0: IEEE Transactions on Information Theory, February, 2001. wolffd@0: wolffd@0:

wolffd@0: wolffd@0:

wolffd@0: Choosing an order in which to sum out the variables so as to minimize wolffd@0: computational cost is known to be NP-hard. wolffd@0: The implementation of this algorithm in wolffd@0: var_elim_inf_engine makes no attempt to optimize this wolffd@0: ordering (in contrast, say, to jtree_inf_engine, which uses a wolffd@0: greedy search procedure to find a good ordering). wolffd@0:

wolffd@0: Note: unlike most algorithms, var_elim does all its computational work wolffd@0: inside of marginal_nodes, not inside of wolffd@0: enter_evidence. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Global inference methods

wolffd@0: wolffd@0: The simplest inference algorithm of all is to explicitely construct wolffd@0: the joint distribution over all the nodes, and then to marginalize it. wolffd@0: This is implemented in global_joint_inf_engine. wolffd@0: Since the size of the joint is exponential in the wolffd@0: number of discrete (hidden) nodes, this is not a very practical algorithm. wolffd@0: It is included merely for pedagogical and debugging purposes. wolffd@0:

wolffd@0: Three specialized versions of this algorithm have also been implemented, wolffd@0: corresponding to the cases where all the nodes are discrete (D), all wolffd@0: are Gaussian (G), and some are discrete and some Gaussian (CG). wolffd@0: They are called enumerative_inf_engine, wolffd@0: gaussian_inf_engine, wolffd@0: and cond_gauss_inf_engine respectively. wolffd@0:

wolffd@0: Note: unlike most algorithms, these global inference algorithms do all their computational work wolffd@0: inside of marginal_nodes, not inside of wolffd@0: enter_evidence. wolffd@0: wolffd@0: wolffd@0:

Quickscore

wolffd@0: wolffd@0: The junction tree algorithm is quite slow on the QMR network, wolffd@0: since the cliques are so big. wolffd@0: One simple trick we can use is to notice that hidden leaves do not wolffd@0: affect the posteriors on the roots, and hence do not need to be wolffd@0: included in the network. wolffd@0: A second trick is to notice that the negative findings can be wolffd@0: "absorbed" into the prior: wolffd@0: see the file wolffd@0: BNT/examples/static/mk_minimal_qmr_bnet for details. wolffd@0:

wolffd@0: wolffd@0: A much more significant speedup is obtained by exploiting special wolffd@0: properties of the noisy-or node, as done by the quickscore wolffd@0: algorithm. For details, see wolffd@0:

Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89. wolffd@0:
Rish and Dechter, "On the impact of causal independence", UCI wolffd@0: tech report, 1998. wolffd@0:

wolffd@0: wolffd@0: This has been implemented in BNT as a special-purpose inference wolffd@0: engine, which can be created and used as follows: wolffd@0:

wolffd@0: engine = quickscore_inf_engine(inhibit, leak, prior);
wolffd@0: engine = enter_evidence(engine, pos, neg);
wolffd@0: m = marginal_nodes(engine, i);
wolffd@0:

wolffd@0: wolffd@0: wolffd@0:

Belief propagation

wolffd@0: wolffd@0: Even using quickscore, exact inference takes time that is exponential wolffd@0: in the number of positive findings. wolffd@0: Hence for large networks we need to resort to approximate inference techniques. wolffd@0: See for example wolffd@0:

T. Jaakkola and M. Jordan, "Variational probabilistic inference and the wolffd@0: QMR-DT network", JAIR 10, 1999. wolffd@0: wolffd@0:
K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study", wolffd@0: UAI 99. wolffd@0:

wolffd@0: The latter approximation wolffd@0: entails applying Pearl's belief propagation algorithm to a model even wolffd@0: if it has loops (hence the name loopy belief propagation). wolffd@0: Pearl's algorithm, implemented as pearl_inf_engine, gives wolffd@0: exact results when applied to singly-connected graphs wolffd@0: (a.k.a. polytrees, since wolffd@0: the underlying undirected topology is a tree, but a node may have wolffd@0: multiple parents). wolffd@0: To apply this algorithm to a graph with loops, wolffd@0: use pearl_inf_engine. wolffd@0: This can use a centralized or distributed message passing protocol. wolffd@0: You can use it as in the following example. wolffd@0:

wolffd@0: engine = pearl_inf_engine(bnet, 'max_iter', 30);
wolffd@0: engine = enter_evidence(engine, evidence);
wolffd@0: m = marginal_nodes(engine, i);
wolffd@0:

wolffd@0: We found that this algorithm often converges, and when it does, often wolffd@0: is very accurate, but it depends on the precise setting of the wolffd@0: parameter values of the network. wolffd@0: (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) wolffd@0: Understanding when and why belief propagation converges/ works wolffd@0: is a topic of ongoing research. wolffd@0:

wolffd@0: pearl_inf_engine can exploit special structure in noisy-or wolffd@0: and gmux nodes to compute messages efficiently. wolffd@0:

wolffd@0: belprop_inf_engine is like pearl, but uses potentials to wolffd@0: represent messages. Hence this is slower. wolffd@0:

wolffd@0: belprop_fg_inf_engine is like belprop, wolffd@0: but is designed for factor graphs. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Sampling

wolffd@0: wolffd@0: BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms: wolffd@0:

likelihood_weighting_inf_engine which does importance wolffd@0: sampling and can handle any node type. wolffd@0:

gibbs_sampling_inf_engine, written by Bhaskara Marthi. wolffd@0: Currently this can only handle tabular CPDs. wolffd@0: For a much faster and more powerful Gibbs sampling program, see wolffd@0: BUGS. wolffd@0:

wolffd@0: Note: To generate samples from a network (which is not the same as inference!), wolffd@0: use sample_bnet. wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Summary of inference engines

wolffd@0: wolffd@0: wolffd@0: The inference engines differ in many ways. Here are wolffd@0: some of the major "axes": wolffd@0:

Works for all topologies or makes restrictions? wolffd@0:
Works for all node types or makes restrictions? wolffd@0:
Exact or approximate inference? wolffd@0:

wolffd@0: wolffd@0: BNT implements an exact algorithm for solving LIMIDs (limited memory wolffd@0: influence diagrams), described in wolffd@0:

wolffd@0: In terms of topology, most engines handle any kind of DAG. wolffd@0: belprop_fg does approximate inference on factor graphs (FG), which wolffd@0: can be used to represent directed, undirected, and mixed (chain) wolffd@0: graphs. wolffd@0: (In the future, we plan to support exact inference on chain graphs.) wolffd@0: quickscore only works on QMR-like models. wolffd@0:

wolffd@0: In terms of node types: algorithms that use potentials can handle wolffd@0: discrete (D), Gaussian (G) or conditional Gaussian (CG) models. wolffd@0: Sampling algorithms can essentially handle any kind of node (distribution). wolffd@0: Other algorithms make more restrictive assumptions in exchange for wolffd@0: speed. wolffd@0:

wolffd@0: Finally, most algorithms are designed to give the exact answer. wolffd@0: The belief propagation algorithms are exact if applied to trees, and wolffd@0: in some other cases. wolffd@0: Sampling is considered approximate, even though, in the limit of an wolffd@0: infinite number of samples, it gives the exact answer. wolffd@0: wolffd@0:

wolffd@0: wolffd@0: Here is a summary of the properties wolffd@0: of all the engines in BNT which work on static networks. wolffd@0:

wolffd@0: wolffd@0:
wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
Name wolffd@0: Exact? wolffd@0: Node type? wolffd@0: topology wolffd@0:
belprop wolffd@0: approx wolffd@0: D wolffd@0: DAG wolffd@0:
belprop_fg wolffd@0: approx wolffd@0: D wolffd@0: factor graph wolffd@0:
cond_gauss wolffd@0: exact wolffd@0: CG wolffd@0: DAG wolffd@0:
enumerative wolffd@0: exact wolffd@0: D wolffd@0: DAG wolffd@0:
gaussian wolffd@0: exact wolffd@0: G wolffd@0: DAG wolffd@0:
gibbs wolffd@0: approx wolffd@0: D wolffd@0: DAG wolffd@0:
global_joint wolffd@0: exact wolffd@0: D,G,CG wolffd@0: DAG wolffd@0:
jtree wolffd@0: exact wolffd@0: D,G,CG wolffd@0: DAG wolffd@0: b
likelihood_weighting wolffd@0: approx wolffd@0: any wolffd@0: DAG wolffd@0:
pearl wolffd@0: approx wolffd@0: D,G wolffd@0: DAG wolffd@0:
pearl wolffd@0: exact wolffd@0: D,G wolffd@0: polytree wolffd@0:
quickscore wolffd@0: exact wolffd@0: noisy-or wolffd@0: QMR wolffd@0:
stab_cond_gauss wolffd@0: exact wolffd@0: CG wolffd@0: DAG wolffd@0:
var_elim wolffd@0: exact wolffd@0: D,G,CG wolffd@0: DAG wolffd@0:
wolffd@0: wolffd@0: wolffd@0: wolffd@0:

Influence diagrams/ decision making