Daniel@0:
Daniel@0:Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: Daniel@0:
Daniel@0: To specify this directed acyclic graph (dag), we create an adjacency matrix: Daniel@0:
Daniel@0: N = 4; Daniel@0: dag = zeros(N,N); Daniel@0: C = 1; S = 2; R = 3; W = 4; Daniel@0: dag(C,[R S]) = 1; Daniel@0: dag(R,W) = 1; Daniel@0: dag(S,W)=1; Daniel@0:Daniel@0:
Daniel@0: We have numbered the nodes as follows: Daniel@0: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. Daniel@0: The nodes must always be numbered in topological order, i.e., Daniel@0: ancestors before descendants. Daniel@0: For a more complicated graph, this is a little inconvenient: we will Daniel@0: see how to get around this below. Daniel@0:
Daniel@0: In Matlab 6, you can use logical arrays instead of double arrays, Daniel@0: which are 4 times smaller: Daniel@0:
Daniel@0: dag = false(N,N); Daniel@0: dag(C,[R S]) = true; Daniel@0: ... Daniel@0:Daniel@0: However, some graph functions (eg acyclic) do not work on Daniel@0: logical arrays! Daniel@0:
Daniel@0: You can visualize the resulting graph structure using Daniel@0: the methods discussed below. Daniel@0: For details on GUIs, Daniel@0: click here. Daniel@0: Daniel@0:
Daniel@0: discrete_nodes = 1:N; Daniel@0: node_sizes = 2*ones(1,N); Daniel@0:Daniel@0: If the nodes were not binary, you could type e.g., Daniel@0:
Daniel@0: node_sizes = [4 2 3 5]; Daniel@0:Daniel@0: meaning that Cloudy has 4 possible values, Daniel@0: Sprinkler has 2 possible values, etc. Daniel@0: Note that these are cardinal values, not ordinal, i.e., Daniel@0: they are not ordered in any way, like 'low', 'medium', 'high'. Daniel@0:
Daniel@0: We are now ready to make the Bayes net: Daniel@0:
Daniel@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes); Daniel@0:Daniel@0: By default, all nodes are assumed to be discrete, so we can also just Daniel@0: write Daniel@0:
Daniel@0: bnet = mk_bnet(dag, node_sizes); Daniel@0:Daniel@0: You may also specify which nodes will be observed. Daniel@0: If you don't know, or if this not fixed in advance, Daniel@0: just use the empty list (the default). Daniel@0:
Daniel@0: onodes = []; Daniel@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes); Daniel@0:Daniel@0: Note that optional arguments are specified using a name/value syntax. Daniel@0: This is common for many BNT functions. Daniel@0: In general, to find out more about a function (e.g., which optional Daniel@0: arguments it takes), please see its Daniel@0: documentation string by typing Daniel@0:
Daniel@0: help mk_bnet Daniel@0:Daniel@0: See also other useful Matlab tips. Daniel@0:
Daniel@0: It is possible to associate names with nodes, as follows: Daniel@0:
Daniel@0: bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4); Daniel@0:Daniel@0: You can then refer to a node by its name: Daniel@0:
Daniel@0: C = bnet.names('cloudy'); % bnet.names is an associative array Daniel@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); Daniel@0:Daniel@0: This feature uses my own associative array class. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Tabular CPDs, also called CPTs (conditional probability tables), Daniel@0: are stored as multidimensional arrays, where the dimensions Daniel@0: are arranged in the same order as the nodes, e.g., the CPT for node 4 Daniel@0: (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. Daniel@0: Hence the child is always the last dimension. Daniel@0: If a node has no parents, its CPT is a column vector representing its Daniel@0: prior. Daniel@0: Note that in Matlab (unlike C), arrays are indexed Daniel@0: from 1, and are layed out in memory such that the first index toggles Daniel@0: fastest, e.g., the CPT for node 4 (WetGrass) is as follows Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0: where we have used the convention that false==1, true==2. Daniel@0: We can create this CPT in Matlab as follows Daniel@0:
Daniel@0: CPT = zeros(2,2,2); Daniel@0: CPT(1,1,1) = 1.0; Daniel@0: CPT(2,1,1) = 0.1; Daniel@0: ... Daniel@0:Daniel@0: Here is an easier way: Daniel@0:
Daniel@0: CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]); Daniel@0:Daniel@0: In fact, we don't need to reshape the array, since the CPD constructor Daniel@0: will do that for us. So we can just write Daniel@0:
Daniel@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); Daniel@0:Daniel@0: The other nodes are created similarly (using the old syntax for Daniel@0: optional parameters) Daniel@0:
Daniel@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); Daniel@0: bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]); Daniel@0: bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]); Daniel@0: bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); Daniel@0:Daniel@0: Daniel@0: Daniel@0:
Daniel@0: rand('state', seed); Daniel@0: randn('state', seed); Daniel@0:Daniel@0: To control the degree of randomness (entropy), Daniel@0: you can sample each row of the CPT from a Dirichlet(p,p,...) distribution. Daniel@0: If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0). Daniel@0: If p = 1, each entry is drawn from U[0,1]. Daniel@0: If p >> 1, the entries will all be near 1/k, where k is the arity of Daniel@0: this node, i.e., each row will be nearly uniform. Daniel@0: You can do this as follows, assuming this node Daniel@0: is number i, and ns is the node_sizes. Daniel@0:
Daniel@0: k = ns(i); Daniel@0: ps = parents(dag, i); Daniel@0: psz = prod(ns(ps)); Daniel@0: CPT = sample_dirichlet(p*ones(1,k), psz); Daniel@0: bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT); Daniel@0:Daniel@0: Daniel@0: Daniel@0:
Daniel@0: It is currently not possible to save/load a BNT matlab object to Daniel@0: file, but this is easily fixed if you modify all the constructors Daniel@0: for all the classes (see matlab documentation). Daniel@0: Daniel@0:
Daniel@0: engine = jtree_inf_engine(bnet); Daniel@0:Daniel@0: The other engines have similar constructors, but might take Daniel@0: additional, algorithm-specific parameters. Daniel@0: All engines are used in the same way, once they have been created. Daniel@0: We illustrate this in the following sections. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: evidence = cell(1,N); Daniel@0: evidence{W} = 2; Daniel@0:Daniel@0: We use a 1D cell array instead of a vector to Daniel@0: cope with the fact that nodes can be vectors of different lengths. Daniel@0: In addition, the value [] can be used Daniel@0: to denote 'no evidence', instead of having to specify the observation Daniel@0: pattern as a separate argument. Daniel@0: (Click here for a quick tutorial on cell Daniel@0: arrays in matlab.) Daniel@0:
Daniel@0: We are now ready to add the evidence to the engine. Daniel@0:
Daniel@0: [engine, loglik] = enter_evidence(engine, evidence); Daniel@0:Daniel@0: The behavior of this function is algorithm-specific, and is discussed Daniel@0: in more detail below. Daniel@0: In the case of the jtree engine, Daniel@0: enter_evidence implements a two-pass message-passing scheme. Daniel@0: The first return argument contains the modified engine, which Daniel@0: incorporates the evidence. The second return argument contains the Daniel@0: log-likelihood of the evidence. (Not all engines are capable of Daniel@0: computing the log-likelihood.) Daniel@0:
Daniel@0: Finally, we can compute p=P(S=2|W=2) as follows. Daniel@0:
Daniel@0: marg = marginal_nodes(engine, S); Daniel@0: marg.T Daniel@0: ans = Daniel@0: 0.57024 Daniel@0: 0.42976 Daniel@0: p = marg.T(2); Daniel@0:Daniel@0: We see that p = 0.4298. Daniel@0:
Daniel@0: Now let us add the evidence that it was raining, and see what Daniel@0: difference it makes. Daniel@0:
Daniel@0: evidence{R} = 2; Daniel@0: [engine, loglik] = enter_evidence(engine, evidence); Daniel@0: marg = marginal_nodes(engine, S); Daniel@0: p = marg.T(2); Daniel@0:Daniel@0: We find that p = P(S=2|W=2,R=2) = 0.1945, Daniel@0: which is lower than Daniel@0: before, because the rain can ``explain away'' the Daniel@0: fact that the grass is wet. Daniel@0:
Daniel@0: You can plot a marginal distribution over a discrete variable Daniel@0: as a barchart using the built 'bar' function: Daniel@0:
Daniel@0: bar(marg.T) Daniel@0:Daniel@0: This is what it looks like Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: Daniel@0:
Daniel@0: evidence = cell(1,N); Daniel@0: evidence{W} = 2; Daniel@0: engine = enter_evidence(engine, evidence); Daniel@0: m = marginal_nodes(engine, W); Daniel@0: m.T Daniel@0: ans = Daniel@0: 1 Daniel@0:Daniel@0: This can get a little confusing, since we assigned W=2. Daniel@0: So we can ask BNT to add the evidence back in by passing in an optional argument: Daniel@0:
Daniel@0: m = marginal_nodes(engine, W, 1); Daniel@0: m.T Daniel@0: ans = Daniel@0: 0 Daniel@0: 1 Daniel@0:Daniel@0: This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1. Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: evidence = cell(1,N); Daniel@0: [engine, ll] = enter_evidence(engine, evidence); Daniel@0: m = marginal_nodes(engine, [S R W]); Daniel@0:Daniel@0: m is a structure. The 'T' field is a multi-dimensional array (in Daniel@0: this case, 3-dimensional) that contains the joint probability Daniel@0: distribution on the specified nodes. Daniel@0:
Daniel@0: >> m.T Daniel@0: ans(:,:,1) = Daniel@0: 0.2900 0.0410 Daniel@0: 0.0210 0.0009 Daniel@0: ans(:,:,2) = Daniel@0: 0 0.3690 Daniel@0: 0.1890 0.0891 Daniel@0:Daniel@0: We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass Daniel@0: to be wet if both the rain and sprinkler are off. Daniel@0:
Daniel@0: Let us now add some evidence to R. Daniel@0:
Daniel@0: evidence{R} = 2; Daniel@0: [engine, ll] = enter_evidence(engine, evidence); Daniel@0: m = marginal_nodes(engine, [S R W]) Daniel@0: m = Daniel@0: domain: [2 3 4] Daniel@0: T: [2x1x2 double] Daniel@0: >> m.T Daniel@0: m.T Daniel@0: ans(:,:,1) = Daniel@0: 0.0820 Daniel@0: 0.0018 Daniel@0: ans(:,:,2) = Daniel@0: 0.7380 Daniel@0: 0.1782 Daniel@0:Daniel@0: The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) Daniel@0: should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible Daniel@0: with the evidence that R=2. Daniel@0: Instead of creating large tables with many 0s, BNT sets the effective Daniel@0: size of observed (discrete) nodes to 1, as explained above. Daniel@0: This is why m.T has size 2x1x2. Daniel@0: To get a 2x2x2 table, type Daniel@0:
Daniel@0: m = marginal_nodes(engine, [S R W], 1) Daniel@0: m = Daniel@0: domain: [2 3 4] Daniel@0: T: [2x2x2 double] Daniel@0: >> m.T Daniel@0: m.T Daniel@0: ans(:,:,1) = Daniel@0: 0 0.082 Daniel@0: 0 0.0018 Daniel@0: ans(:,:,2) = Daniel@0: 0 0.738 Daniel@0: 0 0.1782 Daniel@0:Daniel@0: Daniel@0:
Daniel@0: Note: It is not always possible to compute the joint on arbitrary Daniel@0: sets of nodes: it depends on which inference engine you use, as discussed Daniel@0: in more detail below. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence); Daniel@0:Daniel@0: where soft_evidence{i} is either [] (if node i has no soft evidence) Daniel@0: or is a vector representing the probability distribution over i's Daniel@0: possible values. Daniel@0: For example, if we don't know i's exact value, but we know its Daniel@0: likelihood ratio is 60/40, we can write evidence{i} = [] and Daniel@0: soft_evidence{i} = [0.6 0.4]. Daniel@0:
Daniel@0: Currently only jtree_inf_engine supports this option. Daniel@0: It assumes that all hidden nodes, and all nodes for Daniel@0: which we have soft evidence, are discrete. Daniel@0: For a longer example, see BNT/examples/static/softev1.m. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: [mpe, ll] = calc_mpe(engine, evidence); Daniel@0:Daniel@0: mpe{i} is the most likely value of node i. Daniel@0: This calls enter_evidence with the 'maximize' flag set to 1, which Daniel@0: causes the engine to do max-product instead of sum-product. Daniel@0: The resulting max-marginals are then thresholded. Daniel@0: If there is more than one maximum probability assignment, we must take Daniel@0: care to break ties in a consistent manner (thresholding the Daniel@0: max-marginals may give the wrong result). To force this behavior, Daniel@0: type Daniel@0:
Daniel@0: [mpe, ll] = calc_mpe(engine, evidence, 1); Daniel@0:Daniel@0: Note that computing the MPE is someties called abductive reasoning. Daniel@0: Daniel@0:
Daniel@0: You can also use calc_mpe_bucket written by Ron Zohar, Daniel@0: that does a forwards max-product pass, and then a backwards traceback Daniel@0: pass, which is how Viterbi is traditionally implemented. Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: A B P(C=off) P(C=on) Daniel@0: --------------------------- Daniel@0: F F 1.0 0.0 Daniel@0: T F q(A) 1-q(A) Daniel@0: F T q(B) 1-q(B) Daniel@0: T T q(A)q(B) 1-q(A)q(B) Daniel@0:Daniel@0: Thus we see that the causes get inhibited independently. Daniel@0: It is common to associate a "leak" node with a noisy-or CPD, which is Daniel@0: like a parent that is always on. This can account for all other unmodelled Daniel@0: causes which might turn the node on. Daniel@0:
Daniel@0: The noisy-or distribution is similar to the logistic distribution. Daniel@0: To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) Daniel@0: be the prob. that j inhibits i. Then Daniel@0:
Daniel@0: Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j) Daniel@0:Daniel@0: Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then Daniel@0:
Daniel@0: Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j)) Daniel@0:Daniel@0: For a sigmoid node, we have Daniel@0:
Daniel@0: Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j)) Daniel@0:Daniel@0: where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of Daniel@0: the activation function (although both are monotonically increasing). Daniel@0: In addition, in the case of a noisy-or, the weights are constrained to be Daniel@0: positive, since they derive from probabilities q(i,j). Daniel@0: In both cases, the number of parameters is linear in the Daniel@0: number of parents, unlike the case of a multinomial distribution, Daniel@0: where the number of parameters is exponential in the number of parents. Daniel@0: We will see an example of noisy-OR nodes below. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Both of these classes are just "syntactic sugar" for the tabular_CPD Daniel@0: class. Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: exp(w(:,i)'*x + b(i)) Daniel@0: Pr(Q=i | X=x) = ----------------------------- Daniel@0: sum_j exp(w(:,j)'*x + b(j)) Daniel@0: Daniel@0:Daniel@0: The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the Daniel@0: following interpretation: w(:,i)-w(:,j) is the normal vector to the Daniel@0: decision boundary between classes i and j, Daniel@0: and b(i)-b(j) is its offset (bias). For example, suppose Daniel@0: X is a 2-vector, and Q is binary. Then Daniel@0:
Daniel@0: w = [1 -1; Daniel@0: 0 0]; Daniel@0: Daniel@0: b = [0 0]; Daniel@0:Daniel@0: means class 1 are points in the 2D plane with positive x coordinate, Daniel@0: and class 2 are points in the 2D plane with negative x coordinate. Daniel@0: If w has large magnitude, the decision boundary is sharp, otherwise it Daniel@0: is soft. Daniel@0: In the special case that Q is binary (0/1), the softmax function reduces to the logistic Daniel@0: (sigmoid) function. Daniel@0:
Daniel@0: Fitting a softmax function can be done using the iteratively reweighted Daniel@0: least squares (IRLS) algorithm. Daniel@0: We use the implementation from Daniel@0: Netlab. Daniel@0: Note that since Daniel@0: the softmax distribution is not in the exponential family, it does not Daniel@0: have finite sufficient statistics, and hence we must store all the Daniel@0: training data in uncompressed form. Daniel@0: If this takes too much space, one should use online (stochastic) gradient Daniel@0: descent (not implemented in BNT). Daniel@0:
Daniel@0: If a softmax node also has discrete parents, Daniel@0: we use a different set of w/b parameters for each combination of Daniel@0: parent values, as in the conditional linear Daniel@0: Gaussian CPD. Daniel@0: This feature was implemented by Pierpaolo Brutti. Daniel@0: He is currently extending it so that discrete parents can be treated Daniel@0: as if they were continuous, by adding indicator variables to the X Daniel@0: vector. Daniel@0:
Daniel@0: We will see an example of softmax nodes below. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: - no parents: Y ~ N(mu, Sigma) Daniel@0: - cts parents : Y|X=x ~ N(mu + W x, Sigma) Daniel@0: - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i)) Daniel@0: - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i)) Daniel@0:Daniel@0: where N(mu, Sigma) denotes a Normal distribution with mean mu and Daniel@0: covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q Daniel@0: respectively. Daniel@0: If there are no discrete parents, |Q|=1; if there is Daniel@0: more than one, then |Q| = a vector of the sizes of each discrete parent. Daniel@0: If there are no continuous parents, |X|=0; if there is more than one, Daniel@0: then |X| = the sum of their sizes. Daniel@0: Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive Daniel@0: semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) Daniel@0: matrix. Daniel@0:
Daniel@0: We can create a Gaussian node with random parameters as follows. Daniel@0:
Daniel@0: bnet.CPD{i} = gaussian_CPD(bnet, i); Daniel@0:Daniel@0: We can specify the value of one or more of the parameters as in the Daniel@0: following example, in which |Y|=2, and |Q|=1. Daniel@0:
Daniel@0: bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y)); Daniel@0:Daniel@0:
Daniel@0: We will see an example of conditional linear Gaussian nodes below. Daniel@0:
Daniel@0: When learning Gaussians from data, it is helpful to ensure the Daniel@0: data has a small magnitde Daniel@0: (see e.g., KPMstats/standardize) to prevent numerical problems. Daniel@0: Unless you have a lot of data, it is also a very good idea to use Daniel@0: diagonal instead of full covariance matrices. Daniel@0: (BNT does not currently support spherical covariances, although it Daniel@0: would be easy to add, since KPMstats/clg_Mstep supports this option; Daniel@0: you would just need to modify gaussian_CPD/update_ess to accumulate Daniel@0: weighted inner products.) Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: The CPD_to_CPT method converts a CPD to a table; this Daniel@0: requires that the child and all parents are discrete. Daniel@0: The CPT might be exponentially big... Daniel@0: convert_to_table evaluates a CPD with evidence, and Daniel@0: represents the the resulting potential as an array. Daniel@0: This requires that the child is discrete, and any continuous parents Daniel@0: are observed. Daniel@0: convert_to_pot evaluates a CPD with evidence, and Daniel@0: represents the resulting potential as a dpot, gpot, cgpot or upot, as Daniel@0: requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u = Daniel@0: utility). Daniel@0: Daniel@0:
Daniel@0: When we sample a node, all the parents are observed. Daniel@0: When we compute the (log) probability of a node, all the parents and Daniel@0: the child are observed. Daniel@0:
Daniel@0: We also specify if the parameters are learnable. Daniel@0: For learning with EM, we require Daniel@0: the methods reset_ess, update_ess and Daniel@0: maximize_params. Daniel@0: For learning from fully observed data, we require Daniel@0: the method learn_params. Daniel@0: By default, all classes inherit this from generic_CPD, which simply Daniel@0: calls update_ess N times, once for each data case, followed Daniel@0: by maximize_params, i.e., it is like EM, without the E step. Daniel@0: Some classes implement a batch formula, which is quicker. Daniel@0:
Daniel@0: Bayesian learning means computing a posterior over the parameters Daniel@0: given fully observed data. Daniel@0:
Daniel@0: Pearl means we implement the methods compute_pi and Daniel@0: compute_lambda_msg, used by Daniel@0: pearl_inf_engine, which runs on directed graphs. Daniel@0: belprop_inf_engine only needs convert_to_pot.H Daniel@0: The pearl methods can exploit special properties of the CPDs for Daniel@0: computing the messages efficiently, whereas belprop does not. Daniel@0:
Daniel@0: The only method implemented by generic_CPD is adjustable_CPD, Daniel@0: which is not shown, since it is not very interesting. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Daniel@0: Daniel@0:
Name Daniel@0: | Child Daniel@0: | Parents Daniel@0: | Comments Daniel@0: | CPD_to_CPT Daniel@0: | conv_to_table Daniel@0: | conv_to_pot Daniel@0: | sample Daniel@0: | prob Daniel@0: | learn Daniel@0: | Bayes Daniel@0: | Pearl Daniel@0: Daniel@0: Daniel@0: |
Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: | Daniel@0: Daniel@0: |
boolean Daniel@0: | B Daniel@0: | B Daniel@0: | Syntactic sugar for tabular Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: Daniel@0: |
deterministic Daniel@0: | D Daniel@0: | D Daniel@0: | Syntactic sugar for tabular Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: | - Daniel@0: Daniel@0: |
Discrete Daniel@0: | D Daniel@0: | C/D Daniel@0: | Virtual class Daniel@0: | N Daniel@0: | Calls CPD_to_CPT Daniel@0: | Calls conv_to_table Daniel@0: | Calls conv_to_table Daniel@0: | Calls conv_to_table Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: Daniel@0: |
Gaussian Daniel@0: | C Daniel@0: | C/D Daniel@0: | - Daniel@0: | N Daniel@0: | N Daniel@0: | Y Daniel@0: | Y Daniel@0: | Y Daniel@0: | Y Daniel@0: | N Daniel@0: | N Daniel@0: Daniel@0: |
gmux Daniel@0: | C Daniel@0: | C/D Daniel@0: | multiplexer Daniel@0: | N Daniel@0: | N Daniel@0: | Y Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: | Y Daniel@0: Daniel@0: Daniel@0: |
MLP Daniel@0: | D Daniel@0: | C/D Daniel@0: | multi layer perceptron Daniel@0: | N Daniel@0: | Y Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Y Daniel@0: | N Daniel@0: | N Daniel@0: Daniel@0: Daniel@0: |
noisy-or Daniel@0: | B Daniel@0: | B Daniel@0: | - Daniel@0: | Y Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | N Daniel@0: | N Daniel@0: | Y Daniel@0: Daniel@0: Daniel@0: |
root Daniel@0: | C/D Daniel@0: | none Daniel@0: | no params Daniel@0: | N Daniel@0: | N Daniel@0: | Y Daniel@0: | Y Daniel@0: | Y Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: Daniel@0: Daniel@0: |
softmax Daniel@0: | D Daniel@0: | C/D Daniel@0: | - Daniel@0: | N Daniel@0: | Y Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Y Daniel@0: | N Daniel@0: | N Daniel@0: Daniel@0: Daniel@0: |
generic Daniel@0: | C/D Daniel@0: | C/D Daniel@0: | Virtual class Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: | N Daniel@0: Daniel@0: Daniel@0: |
Tabular Daniel@0: | D Daniel@0: | D Daniel@0: | - Daniel@0: | Y Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Inherits from discrete Daniel@0: | Y Daniel@0: | Y Daniel@0: | Y Daniel@0: Daniel@0: |
![]() | ![]() | ![]() | ![]() |
(a) Daniel@0: | (b) Daniel@0: | (c) Daniel@0: | (d) Daniel@0: |
Daniel@0: We can create this model in BNT as follows. Daniel@0:
Daniel@0: ns = [k D]; Daniel@0: dag = zeros(2,2); Daniel@0: dag(1,2) = 1; Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', []); Daniel@0: bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ... Daniel@0: 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1); Daniel@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ... Daniel@0: 'cov_type', 'diag', 'clamp_mean', 1); Daniel@0:Daniel@0: Daniel@0: The root node is clamped to the N(0,I) distribution, so that we will Daniel@0: not update these parameters during learning. Daniel@0: The mean of the leaf node is clamped to 0, Daniel@0: since we assume the data has been centered (had its mean subtracted Daniel@0: off); this is just for simplicity. Daniel@0: Finally, the covariance of the leaf node is constrained to be Daniel@0: diagonal. W0 and Psi0 are the initial parameter guesses. Daniel@0: Daniel@0:
Daniel@0: We can fit this model (i.e., estimate its parameters in a maximum Daniel@0: likelihood (ML) sense) using EM, as we Daniel@0: explain below. Daniel@0: Not surprisingly, the ML estimates for mu and Psi turn out to be Daniel@0: identical to the Daniel@0: sample mean and variance, which can be computed directly as Daniel@0:
Daniel@0: mu_ML = mean(data); Daniel@0: Psi_ML = diag(cov(data)); Daniel@0:Daniel@0: Note that W can only be identified up to a rotation matrix, because of Daniel@0: the spherical symmetry of the source. Daniel@0: Daniel@0:
Daniel@0: If we restrict Psi to be spherical, i.e., Psi = sigma*I, Daniel@0: there is a closed-form solution for W as well, Daniel@0: i.e., we do not need to use EM. Daniel@0: In particular, W contains the first |X| eigenvectors of the sample covariance Daniel@0: matrix, with scalings determined by the eigenvalues and sigma. Daniel@0: Classical PCA can be obtained by taking the sigma->0 limit. Daniel@0: For details, see Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: By adding a hidden discrete variable, we can create mixtures of FA Daniel@0: models, as shown in (c). Daniel@0: Now we can explain the data using a set of subspaces. Daniel@0: We can create this model in BNT as follows. Daniel@0:
Daniel@0: ns = [M k D]; Daniel@0: dag = zeros(3); Daniel@0: dag(1,3) = 1; Daniel@0: dag(2,3) = 1; Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', 1); Daniel@0: bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0); Daniel@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ... Daniel@0: 'clamp_mean', 1, 'clamp_cov', 1); Daniel@0: bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ... Daniel@0: 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1); Daniel@0:Daniel@0: Notice how the covariance matrix for Y is the same for all values of Daniel@0: Q; that is, the noise level in each sub-space is assumed the same. Daniel@0: However, we allow the offset, mu, to vary. Daniel@0: For details, see Daniel@0:
Daniel@0:
Daniel@0: I have included Zoubin's specialized MFA code (with his permission) Daniel@0: with the toolbox, so you can check that BNT gives the same results: Daniel@0: see 'BNT/examples/static/mfa1.m'. Daniel@0: Daniel@0:
Daniel@0: Independent Factor Analysis (IFA) generalizes FA by allowing a Daniel@0: non-Gaussian prior on each component of X. Daniel@0: (Note that we can approximate a non-Gaussian prior using a mixture of Daniel@0: Gaussians.) Daniel@0: This means that the likelihood function is no longer rotationally Daniel@0: invariant, so we can uniquely identify W and the hidden Daniel@0: sources X. Daniel@0: IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). Daniel@0: We recover classical Independent Components Analysis (ICA) Daniel@0: in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the Daniel@0: weight matrix W is square and invertible. Daniel@0: For details, see Daniel@0:
Daniel@0:
![]() |
Daniel@0: X is the observed Daniel@0: input, Y is the output, and Daniel@0: the Q nodes are hidden "gating" nodes, which select the appropriate Daniel@0: set of parameters for Y. During training, Y is assumed observed, Daniel@0: but for testing, the goal is to predict Y given X. Daniel@0: Note that this is a conditional density model, so we don't Daniel@0: associate any parameters with X. Daniel@0: Hence X's CPD will be a root CPD, which is a way of modelling Daniel@0: exogenous nodes. Daniel@0: If the output is a continuous-valued quantity, Daniel@0: we assume the "experts" are linear-regression units, Daniel@0: and set Y's CPD to linear-Gaussian. Daniel@0: If the output is discrete, we set Y's CPD to a softmax function. Daniel@0: The Q CPDs will always be softmax functions. Daniel@0: Daniel@0:
Daniel@0: As a concrete example, consider the mixture of experts model where X and Y are Daniel@0: scalars, and Q is binary. Daniel@0: This is just piecewise linear regression, where Daniel@0: we have two line segments, i.e., Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0: We can create this model with random parameters as follows. Daniel@0: (This code is bundled in BNT/examples/static/mixexp2.m.) Daniel@0:
Daniel@0: X = 1; Daniel@0: Q = 2; Daniel@0: Y = 3; Daniel@0: dag = zeros(3,3); Daniel@0: dag(X,[Q Y]) = 1 Daniel@0: dag(Q,Y) = 1; Daniel@0: ns = [1 2 1]; % make X and Y scalars, and have 2 experts Daniel@0: onodes = [1 3]; Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes); Daniel@0: Daniel@0: rand('state', 0); Daniel@0: randn('state', 0); Daniel@0: bnet.CPD{1} = root_CPD(bnet, 1); Daniel@0: bnet.CPD{2} = softmax_CPD(bnet, 2); Daniel@0: bnet.CPD{3} = gaussian_CPD(bnet, 3); Daniel@0:Daniel@0: Now let us fit this model using EM. Daniel@0: First we load the data (1000 training cases) and plot them. Daniel@0:
Daniel@0:
Daniel@0: data = load('/examples/static/Misc/mixexp_data.txt', '-ascii'); Daniel@0: plot(data(:,1), data(:,2), '.'); Daniel@0:Daniel@0:
Daniel@0:
Daniel@0: This is what the model looks like before training. Daniel@0: (Thanks to Thomas Hofman for writing this plotting routine.) Daniel@0:
Daniel@0:
Daniel@0: Now let's train the model, and plot the final performance. Daniel@0: (We will discuss how to train models in more detail below.) Daniel@0:
Daniel@0:
Daniel@0: ncases = size(data, 1); % each row of data is a training case Daniel@0: cases = cell(3, ncases); Daniel@0: cases([1 3], :) = num2cell(data'); % each column of cases is a training case Daniel@0: engine = jtree_inf_engine(bnet); Daniel@0: max_iter = 20; Daniel@0: [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter); Daniel@0:Daniel@0: (We specify which nodes will be observed when we create the engine. Daniel@0: Hence BNT knows that the hidden nodes are all discrete. Daniel@0: For complex models, this can lead to a significant speedup.) Daniel@0: Below we show what the model looks like after 16 iterations of EM Daniel@0: (with 100 IRLS iterations per M step), when it converged Daniel@0: using the default convergence tolerance (that the Daniel@0: fractional change in the log-likelihood be less than 1e-3). Daniel@0: Before learning, the log-likelihood was Daniel@0: -322.927442; afterwards, it was -13.728778. Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0: Pierpaolo Brutti Daniel@0: has written an extensive set of routines for HMEs, Daniel@0: which are bundled with BNT: see the examples/static/HME directory. Daniel@0: These routines allow you to choose the number of hidden (gating) Daniel@0: layers, and the form of the experts (softmax or MLP). Daniel@0: See the file hmemenu, which provides a demo. Daniel@0: For example, the figure below shows the decision boundaries learned Daniel@0: for a ternary classification problem, using a 2 level HME with softmax Daniel@0: gates and softmax experts; the training set is on the left, the Daniel@0: testing set on the right. Daniel@0:
Daniel@0:
Daniel@0: Daniel@0: Daniel@0:
Daniel@0: For more details, see the following: Daniel@0:
Daniel@0:
Daniel@0: function bnet = mk_qmr_bnet(G, inhibit, leak, prior) Daniel@0: % MK_QMR_BNET Make a QMR model Daniel@0: % bnet = mk_qmr_bnet(G, inhibit, leak, prior) Daniel@0: % Daniel@0: % G(i,j) = 1 iff there is an arc from disease i to finding j Daniel@0: % inhibit(i,j) = inhibition probability on i->j arc Daniel@0: % leak(j) = inhibition prob. on leak->j arc Daniel@0: % prior(i) = prob. disease i is on Daniel@0: Daniel@0: [Ndiseases Nfindings] = size(inhibit); Daniel@0: N = Ndiseases + Nfindings; Daniel@0: finding_node = Ndiseases+1:N; Daniel@0: ns = 2*ones(1,N); Daniel@0: dag = zeros(N,N); Daniel@0: dag(1:Ndiseases, finding_node) = G; Daniel@0: bnet = mk_bnet(dag, ns, 'observed', finding_node); Daniel@0: Daniel@0: for d=1:Ndiseases Daniel@0: CPT = [1-prior(d) prior(d)]; Daniel@0: bnet.CPD{d} = tabular_CPD(bnet, d, CPT'); Daniel@0: end Daniel@0: Daniel@0: for i=1:Nfindings Daniel@0: fnode = finding_node(i); Daniel@0: ps = parents(G, i); Daniel@0: bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i)); Daniel@0: end Daniel@0:Daniel@0: In the file BNT/examples/static/qmr1, we create a random bipartite Daniel@0: graph G, with 5 diseases and 10 findings, and random parameters. Daniel@0: (In general, to create a random dag, use 'mk_random_dag'.) Daniel@0: We can visualize the resulting graph structure using Daniel@0: the methods discussed below, with the Daniel@0: following results: Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0: Now let us put some random evidence on all the leaves except the very Daniel@0: first and very last, and compute the disease posteriors. Daniel@0:
Daniel@0: pos = 2:floor(Nfindings/2); Daniel@0: neg = (pos(end)+1):(Nfindings-1); Daniel@0: onodes = myunion(pos, neg); Daniel@0: evidence = cell(1, N); Daniel@0: evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos))); Daniel@0: evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg))); Daniel@0: Daniel@0: engine = jtree_inf_engine(bnet); Daniel@0: [engine, ll] = enter_evidence(engine, evidence); Daniel@0: post = zeros(1, Ndiseases); Daniel@0: for i=diseases(:)' Daniel@0: m = marginal_nodes(engine, i); Daniel@0: post(i) = m.T(2); Daniel@0: end Daniel@0:Daniel@0: Junction tree can be quite slow on large QMR models. Daniel@0: Fortunately, it is possible to exploit properties of the noisy-OR Daniel@0: function to speed up exact inference using an algorithm called Daniel@0: quickscore, discussed below. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: We now give an example of a CG model, from Daniel@0: the paper "Propagation of Probabilities, Means amd Daniel@0: Variances in Mixed Graphical Association Models", Steffen Lauritzen, Daniel@0: JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert Daniel@0: Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and Daniel@0: D. J. Spiegelhalter, Springer, 1999.) Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: Daniel@0: We can create this model as follows. Daniel@0:
Daniel@0: F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9; Daniel@0: n = 9; Daniel@0: Daniel@0: dag = zeros(n); Daniel@0: dag(F,E)=1; Daniel@0: dag(W,[E Min D]) = 1; Daniel@0: dag(E,D)=1; Daniel@0: dag(B,[C D])=1; Daniel@0: dag(D,[L Mout])=1; Daniel@0: dag(Min,Mout)=1; Daniel@0: Daniel@0: % node sizes - all cts nodes are scalar, all discrete nodes are binary Daniel@0: ns = ones(1, n); Daniel@0: dnodes = [F W B]; Daniel@0: cnodes = mysetdiff(1:n, dnodes); Daniel@0: ns(dnodes) = 2; Daniel@0: Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes); Daniel@0:Daniel@0: 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous Daniel@0: nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'. Daniel@0:
Daniel@0: Daniel@0: Daniel@0:
Daniel@0: bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable Daniel@0: bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect Daniel@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household Daniel@0:Daniel@0: Daniel@0:
Daniel@0: The parameters of the continuous nodes can be specified as follows. Daniel@0:
Daniel@0: bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ... Daniel@0: 'cov', [0.00002 0.0001 0.00002 0.0001]); Daniel@0: bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ... Daniel@0: 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]); Daniel@0: bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]); Daniel@0: bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5); Daniel@0: bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]); Daniel@0: bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]); Daniel@0:Daniel@0: Daniel@0: Daniel@0:
Daniel@0: engine = jtree_inf_engine(bnet); Daniel@0: evidence = cell(1,n); Daniel@0: [engine, ll] = enter_evidence(engine, evidence); Daniel@0: marg = marginal_nodes(engine, E); Daniel@0:Daniel@0: Daniel@0: 'marg' is a structure that contains the fields 'mu' and 'Sigma', which Daniel@0: contain the mean and (co)variance of the marginal on E. Daniel@0: In this case, they are both scalars. Daniel@0: Let us check they match the published figures (to 2 decimal places). Daniel@0: Daniel@0:
Daniel@0: tol = 1e-2; Daniel@0: assert(approxeq(marg.mu, -3.25, tol)); Daniel@0: assert(approxeq(sqrt(marg.Sigma), 0.709, tol)); Daniel@0:Daniel@0: We can compute the other posteriors similarly. Daniel@0: Now let us add some evidence. Daniel@0:
Daniel@0: evidence = cell(1,n); Daniel@0: evidence{W} = 1; % industrial Daniel@0: evidence{L} = 1.1; Daniel@0: evidence{C} = -0.9; Daniel@0: [engine, ll] = enter_evidence(engine, evidence); Daniel@0:Daniel@0: Now we find Daniel@0:
Daniel@0: marg = marginal_nodes(engine, E); Daniel@0: assert(approxeq(marg.mu, -3.8983, tol)); Daniel@0: assert(approxeq(sqrt(marg.Sigma), 0.0763, tol)); Daniel@0:Daniel@0: Daniel@0: Daniel@0: We can also compute the joint probability on a set of nodes. Daniel@0: For example, P(D, Mout | evidence) is a 2D Gaussian: Daniel@0:
Daniel@0: marg = marginal_nodes(engine, [D Mout]) Daniel@0: marg = Daniel@0: domain: [6 8] Daniel@0: mu: [2x1 double] Daniel@0: Sigma: [2x2 double] Daniel@0: T: 1.0000 Daniel@0:Daniel@0: The mean is Daniel@0:
Daniel@0: marg.mu Daniel@0: ans = Daniel@0: 3.6077 Daniel@0: 4.1077 Daniel@0:Daniel@0: and the covariance matrix is Daniel@0:
Daniel@0: marg.Sigma Daniel@0: ans = Daniel@0: 0.1062 0.1062 Daniel@0: 0.1062 0.1182 Daniel@0:Daniel@0: It is easy to visualize this posterior using standard Matlab plotting Daniel@0: functions, e.g., Daniel@0:
Daniel@0: gaussplot2d(marg.mu, marg.Sigma); Daniel@0:Daniel@0: produces the following picture. Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: Daniel@0: Daniel@0: The T field indicates that the mixing weight of this Gaussian Daniel@0: component is 1.0. Daniel@0: If the joint contains discrete and continuous variables, the result Daniel@0: will be a mixture of Gaussians, e.g., Daniel@0:
Daniel@0: marg = marginal_nodes(engine, [F E]) Daniel@0: domain: [1 3] Daniel@0: mu: [-3.9000 -0.4003] Daniel@0: Sigma: [1x1x2 double] Daniel@0: T: [0.9995 4.7373e-04] Daniel@0:Daniel@0: The interpretation is Daniel@0: Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. Daniel@0: In this case, E is a scalar, so i=j=1; k specifies the mixture component. Daniel@0:
Daniel@0: We saw in the sprinkler network that BNT sets the effective size of Daniel@0: observed discrete nodes to 1, since they only have one legal value. Daniel@0: For continuous nodes, BNT sets their length to 0, Daniel@0: since they have been reduced to a point. Daniel@0: For example, Daniel@0:
Daniel@0: marg = marginal_nodes(engine, [B C]) Daniel@0: domain: [4 5] Daniel@0: mu: [] Daniel@0: Sigma: [] Daniel@0: T: [0.0123 0.9877] Daniel@0:Daniel@0: It is simple to post-process the output of marginal_nodes. Daniel@0: For example, the file BNT/examples/static/cg1 sets the mu term of Daniel@0: observed nodes to their observed value, and the Sigma term to 0 (since Daniel@0: observed nodes have no variance). Daniel@0: Daniel@0:
Daniel@0: Note that the implemented version of the junction tree is numerically Daniel@0: unstable when using CG potentials Daniel@0: (which is why, in the example above, we only required our answers to agree with Daniel@0: the published ones to 2dp.) Daniel@0: This is why you might want to use stab_cond_gauss_inf_engine, Daniel@0: implemented by Shan Huang. This is described in Daniel@0: Daniel@0:
Daniel@0: Daniel@0:
Daniel@0: | Full obs | Daniel@0:Partial obs | Daniel@0:
---|---|---|
Point | Daniel@0:learn_params | Daniel@0:learn_params_em | Daniel@0:
Bayes | Daniel@0:bayes_update_params | Daniel@0:not yet supported | Daniel@0:
Daniel@0: 011979 1626.5 0.0 Daniel@0: 021979 1367.0 0.0 Daniel@0: ... Daniel@0:Daniel@0: you can use Daniel@0:
Daniel@0: data = load('dat.txt'); Daniel@0:Daniel@0: or Daniel@0:
Daniel@0: load dat.txt -ascii Daniel@0:Daniel@0: In the latter case, the data is stored in a variable called 'dat' (the Daniel@0: filename minus the extension). Daniel@0: Alternatively, suppose the data is stored in a .csv file (has commas Daniel@0: separating the columns, and contains a header line), such as Daniel@0:
Daniel@0: header info goes here Daniel@0: ORD,011979,1626.5,0.0 Daniel@0: DSM,021979,1367.0,0.0 Daniel@0: ... Daniel@0:Daniel@0: You can load this using Daniel@0:
Daniel@0: [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1); Daniel@0:Daniel@0: If your file is not in either of these formats, you can either use Perl to convert Daniel@0: it to this format, or use the Matlab scanf command. Daniel@0: Type Daniel@0: Daniel@0: help iofun Daniel@0: Daniel@0: for more information on Matlab's file functions. Daniel@0: Daniel@0:
Daniel@0: BNT learning routines require data to be stored in a cell array. Daniel@0: data{i,m} is the value of node i in case (example) m, i.e., each Daniel@0: column is a case. Daniel@0: If node i is not observed in case m (missing value), set Daniel@0: data{i,m} = []. Daniel@0: (Not all the learning routines can cope with such missing values, however.) Daniel@0: In the special case that all the nodes are observed and are Daniel@0: scalar-valued (as opposed to vector-valued), the data can be Daniel@0: stored in a matrix (as opposed to a cell-array). Daniel@0:
Daniel@0: Suppose, as in the mixture of experts example, Daniel@0: that we have 3 nodes in the graph: X(1) is the observed input, X(3) is Daniel@0: the observed output, and X(2) is a hidden (gating) node. We can Daniel@0: create the dataset as follows. Daniel@0:
Daniel@0: data = load('dat.txt'); Daniel@0: ncases = size(data, 1); Daniel@0: cases = cell(3, ncases); Daniel@0: cases([1 3], :) = num2cell(data'); Daniel@0:Daniel@0: Notice how we transposed the data, to convert rows into columns. Daniel@0: Also, cases{2,m} = [] for all m, since X(2) is always hidden. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: samples = cell(N, nsamples); Daniel@0: for i=1:nsamples Daniel@0: samples(:,i) = sample_bnet(bnet); Daniel@0: end Daniel@0:Daniel@0: samples{j,i} contains the value of the j'th node in case i. Daniel@0: sample_bnet returns a cell array because, in general, each node might Daniel@0: be a vector of different length. Daniel@0: In this case, all nodes are discrete (and hence scalars), so we Daniel@0: could have used a regular array instead (which can be quicker): Daniel@0:
Daniel@0: data = cell2num(samples); Daniel@0:Daniel@0: Now we create a network with random parameters. Daniel@0: (The initial values of bnet2 don't matter in this case, since we can find the Daniel@0: globally optimal MLE independent of where we start.) Daniel@0:
Daniel@0: % Make a tabula rasa Daniel@0: bnet2 = mk_bnet(dag, node_sizes); Daniel@0: seed = 0; Daniel@0: rand('state', seed); Daniel@0: bnet2.CPD{C} = tabular_CPD(bnet2, C); Daniel@0: bnet2.CPD{R} = tabular_CPD(bnet2, R); Daniel@0: bnet2.CPD{S} = tabular_CPD(bnet2, S); Daniel@0: bnet2.CPD{W} = tabular_CPD(bnet2, W); Daniel@0:Daniel@0: Finally, we find the maximum likelihood estimates of the parameters. Daniel@0:
Daniel@0: bnet3 = learn_params(bnet2, samples); Daniel@0:Daniel@0: To view the learned parameters, we use a little Matlab hackery. Daniel@0:
Daniel@0: CPT3 = cell(1,N); Daniel@0: for i=1:N Daniel@0: s=struct(bnet3.CPD{i}); % violate object privacy Daniel@0: CPT3{i}=s.CPT; Daniel@0: end Daniel@0:Daniel@0: Here are the parameters learned for node 4. Daniel@0:
Daniel@0: dispcpt(CPT3{4}) Daniel@0: 1 1 : 1.0000 0.0000 Daniel@0: 2 1 : 0.2000 0.8000 Daniel@0: 1 2 : 0.2273 0.7727 Daniel@0: 2 2 : 0.0000 1.0000 Daniel@0:Daniel@0: So we see that the learned parameters are fairly close to the "true" Daniel@0: ones, which we display below. Daniel@0:
Daniel@0: dispcpt(CPT{4}) Daniel@0: 1 1 : 1.0000 0.0000 Daniel@0: 2 1 : 0.1000 0.9000 Daniel@0: 1 2 : 0.1000 0.9000 Daniel@0: 2 2 : 0.0100 0.9900 Daniel@0:Daniel@0: We can get better results by using a larger training set, or using Daniel@0: informative priors (see below). Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: The Dirichlet has a simple interpretation in terms of pseudo counts. Daniel@0: If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the Daniel@0: training set, where Pa_i are the parents of X_i, Daniel@0: then the maximum likelihood (ML) estimate is Daniel@0: T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. Daniel@0: To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this Daniel@0: event was not seen in the training set, Daniel@0: we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk) Daniel@0: of times in the past. Daniel@0: The MAP (maximum a posterior) estimate is then Daniel@0:
Daniel@0: T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij) Daniel@0:Daniel@0: and is never 0 if all alpha_ijk > 0. Daniel@0: For example, consider the network A->B, where A is binary and B has 3 Daniel@0: values. Daniel@0: A uniform prior for B has the form Daniel@0:
Daniel@0: B=1 B=2 B=3 Daniel@0: A=1 1 1 1 Daniel@0: A=2 1 1 1 Daniel@0:Daniel@0: which can be created using Daniel@0:
Daniel@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif'); Daniel@0:Daniel@0: This prior does not satisfy the likelihood equivalence principle, Daniel@0: which says that Markov equivalent models Daniel@0: should have the same marginal likelihood. Daniel@0: A prior that does satisfy this principle is shown below. Daniel@0: Heckerman (1995) calls this the Daniel@0: BDeu prior (likelihood equivalent uniform Bayesian Dirichlet). Daniel@0:
Daniel@0: B=1 B=2 B=3 Daniel@0: A=1 1/6 1/6 1/6 Daniel@0: A=2 1/6 1/6 1/6 Daniel@0:Daniel@0: where we put N/(q*r) in each bin; N is the equivalent sample size, Daniel@0: r=|A|, q = |B|. Daniel@0: This can be created as follows Daniel@0:
Daniel@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu'); Daniel@0:Daniel@0: Here, 1 is the equivalent sample size, and is the strength of the Daniel@0: prior. Daniel@0: You can change this using Daniel@0:
Daniel@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ... Daniel@0: 'BDeu', 'dirichlet_weight', 10); Daniel@0:Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: cases = sample_bnet(bnet, nsamples); Daniel@0: bnet = bayes_update_params(bnet, cases); Daniel@0: LL = log_marg_lik_complete(bnet, cases); Daniel@0:Daniel@0: bnet.CPD{i}.prior contains the new Dirichlet pseudocounts, Daniel@0: and bnet.CPD{i}.CPT is set to the mean of the posterior (the Daniel@0: normalized counts). Daniel@0: (Hence if the initial pseudo counts are 0, Daniel@0: bayes_update_params and learn_params will give the Daniel@0: same result.) Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: We can compute the same result sequentially (on-line) as follows. Daniel@0:
Daniel@0: LL = 0; Daniel@0: for m=1:nsamples Daniel@0: LL = LL + log_marg_lik_complete(bnet, cases(:,m)); Daniel@0: bnet = bayes_update_params(bnet, cases(:,m)); Daniel@0: end Daniel@0:Daniel@0: Daniel@0: The file BNT/examples/static/StructLearn/model_select1 has an example of Daniel@0: sequential model selection which uses the same idea. Daniel@0: We generate data from the model A->B Daniel@0: and compute the posterior prob of all 3 dags on 2 nodes: Daniel@0: (1) A B, (2) A <- B , (3) A -> B Daniel@0: Models 2 and 3 are Markov equivalent, and therefore indistinguishable from Daniel@0: observational data alone, so we expect their posteriors to be the same Daniel@0: (assuming a prior which satisfies likelihood equivalence). Daniel@0: If we use random parameters, the "true" model only gets a higher posterior after 2000 trials! Daniel@0: However, if we make B a noisy NOT gate, the true model "wins" after 12 Daniel@0: trials, as shown below (red = model 1, blue/green (superimposed) Daniel@0: represents models 2/3). Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0: The use of marginal likelihood for model selection is discussed in Daniel@0: greater detail in the Daniel@0: section on structure learning. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: samples2 = samples; Daniel@0: hide = rand(N, nsamples) > 0.5; Daniel@0: [I,J]=find(hide); Daniel@0: for k=1:length(I) Daniel@0: samples2{I(k), J(k)} = []; Daniel@0: end Daniel@0:Daniel@0: samples2{i,l} is the value of node i in training case l, or [] if unobserved. Daniel@0:
Daniel@0: Now we will compute the MLEs using the EM algorithm. Daniel@0: We need to use an inference algorithm to compute the expected Daniel@0: sufficient statistics in the E step; the M (maximization) step is as Daniel@0: above. Daniel@0:
Daniel@0: engine2 = jtree_inf_engine(bnet2); Daniel@0: max_iter = 10; Daniel@0: [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter); Daniel@0:Daniel@0: LLtrace(i) is the log-likelihood at iteration i. We can plot this as Daniel@0: follows: Daniel@0:
Daniel@0: plot(LLtrace, 'x-') Daniel@0:Daniel@0: Let's display the results after 10 iterations of EM. Daniel@0:
Daniel@0: celldisp(CPT4) Daniel@0: CPT4{1} = Daniel@0: 0.6616 Daniel@0: 0.3384 Daniel@0: CPT4{2} = Daniel@0: 0.6510 0.3490 Daniel@0: 0.8751 0.1249 Daniel@0: CPT4{3} = Daniel@0: 0.8366 0.1634 Daniel@0: 0.0197 0.9803 Daniel@0: CPT4{4} = Daniel@0: (:,:,1) = Daniel@0: 0.8276 0.0546 Daniel@0: 0.5452 0.1658 Daniel@0: (:,:,2) = Daniel@0: 0.1724 0.9454 Daniel@0: 0.4548 0.8342 Daniel@0:Daniel@0: We can get improved performance by using one or more of the following Daniel@0: methods: Daniel@0:
Daniel@0: For a more complete example of learning with EM, Daniel@0: see the script BNT/examples/static/learn1.m. Daniel@0: Daniel@0:
Daniel@0: When we have tied parameters, there is no longer a one-to-one Daniel@0: correspondence between nodes and CPDs. Daniel@0: Rather, each CPD species the parameters for a whole equivalence class Daniel@0: of nodes. Daniel@0: It is easiest to see this by example. Daniel@0: Consider the following hidden Markov Daniel@0: model (HMM) Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0: Daniel@0: When HMMs are used for semi-infinite processes like speech recognition, Daniel@0: we assume the transition matrix Daniel@0: P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant Daniel@0: or homogenous Markov chain. Daniel@0: Hence hidden nodes 2, 3, ..., T Daniel@0: are all in the same equivalence class, say class Hclass. Daniel@0: Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the Daniel@0: same for all t, so the observed nodes are all in the same equivalence Daniel@0: class, say class Oclass. Daniel@0: Finally, the prior term P(H(1)) is in a class all by itself, say class Daniel@0: H1class. Daniel@0: This is illustrated below, where we explicitly represent the Daniel@0: parameters as random variables (dotted nodes). Daniel@0:
Daniel@0:
Daniel@0:
Daniel@0: In BNT, we cannot represent parameters as random variables (nodes). Daniel@0: Instead, we "hide" the Daniel@0: parameters inside one CPD for each equivalence class, Daniel@0: and then specify that the other CPDs should share these parameters, as Daniel@0: follows. Daniel@0:
Daniel@0: hnodes = 1:2:2*T; Daniel@0: onodes = 2:2:2*T; Daniel@0: H1class = 1; Hclass = 2; Oclass = 3; Daniel@0: eclass = ones(1,N); Daniel@0: eclass(hnodes(2:end)) = Hclass; Daniel@0: eclass(hnodes(1)) = H1class; Daniel@0: eclass(onodes) = Oclass; Daniel@0: % create dag and ns in the usual way Daniel@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass); Daniel@0:Daniel@0: Finally, we define the parameters for each equivalence class: Daniel@0:
Daniel@0: bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior Daniel@0: bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix Daniel@0: if cts_obs Daniel@0: bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1)); Daniel@0: else Daniel@0: bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1)); Daniel@0: end Daniel@0:Daniel@0: In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a Daniel@0: member of e's equivalence class; that is, it is not always the case Daniel@0: that e == j. You can use bnet.rep_of_eclass(e) to return the Daniel@0: representative of equivalence class e. Daniel@0: BNT will look up the parents of j to determine the size Daniel@0: of the CPT to use. It assumes that this is the same for all members of Daniel@0: the equivalence class. Daniel@0: Click here for Daniel@0: a more complex example of parameter tying. Daniel@0:
Daniel@0: Note: Daniel@0: Normally one would define an HMM as a Daniel@0: Dynamic Bayes Net Daniel@0: (see the function BNT/examples/dynamic/mk_chmm.m). Daniel@0: However, one can define an HMM as a static BN using the function Daniel@0: BNT/examples/static/Models/mk_hmm_bnet.m. Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Daniel@0: There are two very different approaches to structure learning: Daniel@0: constraint-based and search-and-score. Daniel@0: In the constraint-based approach, Daniel@0: we start with a fully connected graph, and remove edges if certain Daniel@0: conditional independencies are measured in the data. Daniel@0: This has the disadvantage that repeated independence tests lose Daniel@0: statistical power. Daniel@0:
Daniel@0: In the more popular search-and-score approach, Daniel@0: we perform a search through the space of possible DAGs, and either Daniel@0: return the best one found (a point estimate), or return a sample of the Daniel@0: models found (an approximation to the Bayesian posterior). Daniel@0:
Daniel@0: The number of DAGs as a function of the number of Daniel@0: nodes, G(n), is super-exponential in n, Daniel@0: and is given by the following recurrence Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: The first few values Daniel@0: are shown below. Daniel@0: Daniel@0:
n | G(n) |
---|---|
1 | 1 |
2 | 3 |
3 | 25 |
4 | 543 |
5 | 29,281 |
6 | 3,781,503 |
7 | 1.1 x 10^9 |
8 | 7.8 x 10^11 |
9 | 1.2 x 10^15 |
10 | 4.2 x 10^18 |
Daniel@0: If we know a total ordering on the nodes, Daniel@0: finding the best structure amounts to picking the best set of parents Daniel@0: for each node independently. Daniel@0: This is what the K2 algorithm does. Daniel@0: If the ordering is unknown, we can search over orderings, Daniel@0: which is more efficient than searching over DAGs (Koller and Friedman, 2000). Daniel@0:
Daniel@0: In addition to the search procedure, we must specify the scoring Daniel@0: function. There are two popular choices. The Bayesian score integrates Daniel@0: out the parameters, i.e., it is the marginal likelihood of the model. Daniel@0: The BIC (Bayesian Information Criterion) is defined as Daniel@0: log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is Daniel@0: the ML estimate of the parameters, d is the number of parameters, and Daniel@0: N is the number of data cases. Daniel@0: The BIC method has the advantage of not requiring a prior. Daniel@0:
Daniel@0: BIC can be derived as a large sample Daniel@0: approximation to the marginal likelihood. Daniel@0: (It is also equal to the Minimum Description Length of a model.) Daniel@0: However, in practice, the sample size does not need to be very large Daniel@0: for the approximation to be good. Daniel@0: For example, in the figure below, we plot the ratio between the log marginal likelihood Daniel@0: and the BIC score against data-set size; we see that the ratio rapidly Daniel@0: approaches 1, especially for non-informative priors. Daniel@0: (This plot was generated by the file BNT/examples/static/bic1.m. It Daniel@0: uses the water sprinkler BN with BDeu Dirichlet priors with different Daniel@0: equivalent sample sizes.) Daniel@0: Daniel@0:
Daniel@0:
Daniel@0: Daniel@0:
Daniel@0: As with parameter learning, handling missing data/ hidden variables is Daniel@0: much harder than the fully observed case. Daniel@0: The structure learning routines in BNT can therefore be classified into 4 Daniel@0: types, analogously to the parameter learning case. Daniel@0:
Daniel@0: Daniel@0:
Daniel@0: | Full obs | Daniel@0:Partial obs | Daniel@0:
---|---|---|
Point | Daniel@0:learn_struct_K2 Daniel@0: Daniel@0: | not yet supported | Daniel@0:
Bayes | Daniel@0:learn_struct_mcmc | Daniel@0:not yet supported | Daniel@0:
Daniel@0: When learning graph structure from observational data, Daniel@0: the best one can hope to do is to identify the model up to Markov Daniel@0: equivalence. To distinguish amongst graphs within the same equivalence Daniel@0: class, one needs interventional data: see the discussion on active learning below. Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: dags = mk_all_dags(N); Daniel@0: score = score_dags(data, ns, dags); Daniel@0:Daniel@0: where data(i,m) is the value of node i in case m, Daniel@0: and ns(i) is the size of node i. Daniel@0: If the DAGs have a lot of families in common, we can cache the sufficient statistics, Daniel@0: making this potentially more efficient than scoring the DAGs one at a time. Daniel@0: (Caching is not currently implemented, however.) Daniel@0:
Daniel@0: By default, we use the Bayesian scoring metric, and assume CPDs are Daniel@0: represented by tables with BDeu(1) priors. Daniel@0: We can override these defaults as follows. Daniel@0: If we want to use uniform priors, we can say Daniel@0:
Daniel@0: params = cell(1,N); Daniel@0: for i=1:N Daniel@0: params{i} = {'prior', 'unif'}; Daniel@0: end Daniel@0: score = score_dags(data, ns, dags, 'params', params); Daniel@0:Daniel@0: params{i} is a cell-array, containing optional arguments that are Daniel@0: passed to the constructor for CPD i. Daniel@0:
Daniel@0: Now suppose we want to use different node types, e.g., Daniel@0: Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both Daniel@0: these CPDs can support discrete and continuous parents, which is Daniel@0: necessary since all other nodes will be considered as parents). Daniel@0: The Bayesian scoring metric currently only works for tabular CPDs, so Daniel@0: we will use BIC: Daniel@0:
Daniel@0: score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [], Daniel@0: 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic') Daniel@0:Daniel@0: In practice, one can't enumerate all possible DAGs for N > 5, Daniel@0: but one can evaluate any reasonably-sized set of hypotheses in this Daniel@0: way (e.g., nearest neighbors of your current best guess). Daniel@0: Think of this as "computer assisted model refinement" as opposed to de Daniel@0: novo learning. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: The original paper used the Bayesian scoring Daniel@0: metric with tabular CPDs and Dirichlet priors. Daniel@0: BNT generalizes this to allow any kind of CPD, and either the Bayesian Daniel@0: scoring metric or BIC, as in the example above. Daniel@0: In addition, you can specify Daniel@0: an optional upper bound on the number of parents for each node. Daniel@0: The file BNT/examples/static/k2demo1.m gives an example of how to use K2. Daniel@0: We use the water sprinkler network and sample 100 cases from it as before. Daniel@0: Then we see how much data it takes to recover the generating structure: Daniel@0:
Daniel@0: order = [C S R W]; Daniel@0: max_fan_in = 2; Daniel@0: sz = 5:5:100; Daniel@0: for i=1:length(sz) Daniel@0: dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in); Daniel@0: correct(i) = isequal(dag, dag2); Daniel@0: end Daniel@0:Daniel@0: Here are the results. Daniel@0:
Daniel@0: correct = Daniel@0: Columns 1 through 12 Daniel@0: 0 0 0 0 0 0 0 1 0 1 1 1 Daniel@0: Columns 13 through 20 Daniel@0: 1 1 1 1 1 1 1 1 Daniel@0:Daniel@0: So we see it takes about sz(10)=50 cases. (BIC behaves similarly, Daniel@0: showing that the prior doesn't matter too much.) Daniel@0: In general, we cannot hope to recover the "true" generating structure, Daniel@0: only one that is in its Markov equivalence Daniel@0: class. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: A common definition of "neighbor" is all graphs that can be Daniel@0: generated from the current graph by adding, deleting or reversing a Daniel@0: single arc, subject to the acyclicity constraint. Daniel@0: Other neighborhoods are possible: see Daniel@0: Daniel@0: Optimal Structure Identification with Greedy Search, Max Daniel@0: Chickering, JMLR 2002. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: The function can be called Daniel@0: as in the following example. Daniel@0:
Daniel@0: [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10); Daniel@0:Daniel@0: We can convert our set of sampled graphs to a histogram Daniel@0: (empirical posterior over all the DAGs) thus Daniel@0:
Daniel@0: all_dags = mk_all_dags(N); Daniel@0: mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags); Daniel@0:Daniel@0: To see how well this performs, let us compute the exact posterior exhaustively. Daniel@0:
Daniel@0:
Daniel@0: score = score_dags(data, ns, all_dags); Daniel@0: post = normalise(exp(score)); % assuming uniform structural prior Daniel@0:Daniel@0: We plot the results below. Daniel@0: (The data set was 100 samples drawn from a random 4 node bnet; see the Daniel@0: file BNT/examples/static/mcmc1.) Daniel@0:
Daniel@0: subplot(2,1,1) Daniel@0: bar(post) Daniel@0: subplot(2,1,2) Daniel@0: bar(mcmc_post) Daniel@0:Daniel@0:
Daniel@0: We can also plot the acceptance ratio versus number of MCMC steps, Daniel@0: as a crude convergence diagnostic. Daniel@0:
Daniel@0: clf Daniel@0: plot(accept_ratio) Daniel@0:Daniel@0:
Daniel@0: Even though the number of samples needed by MCMC is theoretically Daniel@0: polynomial (not exponential) in the dimensionality of the search space, in practice it has been Daniel@0: found that MCMC does not converge in reasonable time for graphs with Daniel@0: more than about 10 nodes. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Most of the scoring functions accept an optional argument Daniel@0: that specifies whether a node was observed to have a certain value, or Daniel@0: was forced to have that value: we set clamped(i,m)=1 if node i was Daniel@0: forced in training case m. e.g., see the file Daniel@0: BNT/examples/static/cooper_yoo. Daniel@0:
Daniel@0: An interesting question is to decide which interventions to perform Daniel@0: (c.f., design of experiments). For details, see the following tech Daniel@0: report Daniel@0:
Daniel@0: Wei Hu has implemented SEM for discrete nodes. Daniel@0: You can download his package from Daniel@0: here. Daniel@0: Please address all questions about this code to Daniel@0: wei.hu@intel.com. Daniel@0: See also Phl's implementation of SEM. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: IC*/FCI extend IC/PC to handle latent variables: see below. Daniel@0: (IC stands for inductive causation; PC stands for Peter and Clark, Daniel@0: the first names of Spirtes and Glymour; FCI stands for fast causal Daniel@0: inference. Daniel@0: What we, following Pearl (2000), call IC* was called Daniel@0: IC in the original Pearl and Verma paper.) Daniel@0: For details, see Daniel@0:
Daniel@0: Daniel@0: The PC algorithm takes as arguments a function f, the number of nodes N, Daniel@0: the maximum fan in K, and additional arguments A which are passed to f. Daniel@0: The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 Daniel@0: otherwise. Daniel@0: For example, suppose we cheat by Daniel@0: passing in a CI "oracle" which has access to the true DAG; the oracle Daniel@0: tests for d-separation in this DAG, i.e., Daniel@0: f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows. Daniel@0:
Daniel@0: pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag); Daniel@0:Daniel@0: pdag(i,j) = -1 if there is definitely an i->j arc, Daniel@0: and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. Daniel@0:
Daniel@0: Applied to the sprinkler network, this returns Daniel@0:
Daniel@0: pdag = Daniel@0: 0 1 1 0 Daniel@0: 1 0 0 -1 Daniel@0: 1 0 0 -1 Daniel@0: 0 0 0 0 Daniel@0:Daniel@0: So as expected, we see that the V-structure at the W node is uniquely identified, Daniel@0: but the other arcs have ambiguous orientation. Daniel@0:
Daniel@0: We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS Daniel@0: book. Daniel@0: This example concerns the female orgasm. Daniel@0: We are given a correlation matrix C between 7 measured factors (such Daniel@0: as subjective experiences of coital and masturbatory experiences), Daniel@0: derived from 281 samples, and want to learn a causal model of the Daniel@0: data. We will not discuss the merits of this type of work here, but Daniel@0: merely show how to reproduce the results in the SGS book. Daniel@0: Their program, Daniel@0: Tetrad, Daniel@0: makes use of the Fisher Z-test for conditional Daniel@0: independence, so we do the same: Daniel@0:
Daniel@0: max_fan_in = 4; Daniel@0: nsamples = 281; Daniel@0: alpha = 0.05; Daniel@0: pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha); Daniel@0:Daniel@0: In this case, the CI test is Daniel@0:
Daniel@0: f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha) Daniel@0:Daniel@0: The results match those of Fig 12a of SGS apart from two edge Daniel@0: differences; presumably this is due to rounding error (although it Daniel@0: could be a bug, either in BNT or in Tetrad). Daniel@0: This example can be found in the file BNT/examples/static/pc2.m. Daniel@0: Daniel@0:
Daniel@0: Daniel@0: The IC* algorithm (Pearl and Verma, 1991), Daniel@0: and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993), Daniel@0: are like the IC/PC algorithm, except that they can detect the presence Daniel@0: of latent variables. Daniel@0: See the file learn_struct_pdag_ic_star written by Tamar Daniel@0: Kushnir. The output is a matrix P, defined as follows Daniel@0: (see Pearl (2000), p52 for details): Daniel@0:
Daniel@0: % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j. Daniel@0: % P(i,j) = -2 if there is a marked directed i-*>j edge. Daniel@0: % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j Daniel@0: % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j. Daniel@0:Daniel@0: Daniel@0: Daniel@0:
Daniel@0: An inference engine is an object that contains a bnet and supports the Daniel@0: 'enter_evidence' and 'marginal_nodes' methods. The engine constructor Daniel@0: takes the bnet as argument and may do some model-specific processing. Daniel@0: When 'enter_evidence' is called, the engine may do some Daniel@0: evidence-specific processing. Finally, when 'marginal_nodes' is Daniel@0: called, the engine may do some query-specific processing. Daniel@0: Daniel@0:
Daniel@0: The amount of work done when each stage is specified -- structure, Daniel@0: parameters, evidence, and query -- depends on the engine. The cost of Daniel@0: work done early in this sequence can be amortized. On the other hand, Daniel@0: one can make better optimizations if one waits until later in the Daniel@0: sequence. Daniel@0: For example, the parameters might imply Daniel@0: conditional indpendencies that are not evident in the graph structure, Daniel@0: but can nevertheless be exploited; the evidence indicates which nodes Daniel@0: are observed and hence can effectively be disconnected from the Daniel@0: graph; and the query might indicate that large parts of the network Daniel@0: are d-separated from the query nodes. (Since it is not the actual Daniel@0: values of the evidence that matters, just which nodes are observed, Daniel@0: many engines allow you to specify which nodes will be observed when they are constructed, Daniel@0: i.e., before calling 'enter_evidence'. Some engines can still cope if Daniel@0: the actual pattern of evidence is different, e.g., if there is missing Daniel@0: data.) Daniel@0:
Daniel@0: Daniel@0: Although being maximally lazy (i.e., only doing work when a query is Daniel@0: issued) may seem desirable, Daniel@0: this is not always the most efficient. Daniel@0: For example, Daniel@0: when learning using EM, we need to call marginal_nodes N times, where N is the Daniel@0: number of nodes. Variable elimination would end Daniel@0: up repeating a lot of work Daniel@0: each time marginal_nodes is called, making it inefficient for Daniel@0: learning. The junction tree algorithm, by contrast, uses dynamic Daniel@0: programming to avoid this redundant computation --- it calculates all Daniel@0: marginals in two passes during 'enter_evidence', so calling Daniel@0: 'marginal_nodes' takes constant time. Daniel@0:
Daniel@0: We will discuss some of the inference algorithms implemented in BNT Daniel@0: below, and finish with a summary of all Daniel@0: of them. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: The principle of distributing sums over products can be generalized Daniel@0: greatly to apply to any commutative semiring. Daniel@0: This forms the basis of many common algorithms, such as Viterbi Daniel@0: decoding and the Fast Fourier Transform. For details, see Daniel@0: Daniel@0:
Daniel@0: Choosing an order in which to sum out the variables so as to minimize Daniel@0: computational cost is known to be NP-hard. Daniel@0: The implementation of this algorithm in Daniel@0: var_elim_inf_engine makes no attempt to optimize this Daniel@0: ordering (in contrast, say, to jtree_inf_engine, which uses a Daniel@0: greedy search procedure to find a good ordering). Daniel@0:
Daniel@0: Note: unlike most algorithms, var_elim does all its computational work Daniel@0: inside of marginal_nodes, not inside of Daniel@0: enter_evidence. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Three specialized versions of this algorithm have also been implemented, Daniel@0: corresponding to the cases where all the nodes are discrete (D), all Daniel@0: are Gaussian (G), and some are discrete and some Gaussian (CG). Daniel@0: They are called enumerative_inf_engine, Daniel@0: gaussian_inf_engine, Daniel@0: and cond_gauss_inf_engine respectively. Daniel@0:
Daniel@0: Note: unlike most algorithms, these global inference algorithms do all their computational work Daniel@0: inside of marginal_nodes, not inside of Daniel@0: enter_evidence. Daniel@0: Daniel@0: Daniel@0:
Daniel@0: Daniel@0: A much more significant speedup is obtained by exploiting special Daniel@0: properties of the noisy-or node, as done by the quickscore Daniel@0: algorithm. For details, see Daniel@0:
Daniel@0: engine = quickscore_inf_engine(inhibit, leak, prior); Daniel@0: engine = enter_evidence(engine, pos, neg); Daniel@0: m = marginal_nodes(engine, i); Daniel@0:Daniel@0: Daniel@0: Daniel@0:
Daniel@0: engine = pearl_inf_engine(bnet, 'max_iter', 30); Daniel@0: engine = enter_evidence(engine, evidence); Daniel@0: m = marginal_nodes(engine, i); Daniel@0:Daniel@0: We found that this algorithm often converges, and when it does, often Daniel@0: is very accurate, but it depends on the precise setting of the Daniel@0: parameter values of the network. Daniel@0: (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) Daniel@0: Understanding when and why belief propagation converges/ works Daniel@0: is a topic of ongoing research. Daniel@0:
Daniel@0: pearl_inf_engine can exploit special structure in noisy-or Daniel@0: and gmux nodes to compute messages efficiently. Daniel@0:
Daniel@0: belprop_inf_engine is like pearl, but uses potentials to Daniel@0: represent messages. Hence this is slower. Daniel@0:
Daniel@0: belprop_fg_inf_engine is like belprop, Daniel@0: but is designed for factor graphs. Daniel@0: Daniel@0: Daniel@0: Daniel@0:
Daniel@0: In terms of topology, most engines handle any kind of DAG. Daniel@0: belprop_fg does approximate inference on factor graphs (FG), which Daniel@0: can be used to represent directed, undirected, and mixed (chain) Daniel@0: graphs. Daniel@0: (In the future, we plan to support exact inference on chain graphs.) Daniel@0: quickscore only works on QMR-like models. Daniel@0:
Daniel@0: In terms of node types: algorithms that use potentials can handle Daniel@0: discrete (D), Gaussian (G) or conditional Gaussian (CG) models. Daniel@0: Sampling algorithms can essentially handle any kind of node (distribution). Daniel@0: Other algorithms make more restrictive assumptions in exchange for Daniel@0: speed. Daniel@0:
Daniel@0: Finally, most algorithms are designed to give the exact answer. Daniel@0: The belief propagation algorithms are exact if applied to trees, and Daniel@0: in some other cases. Daniel@0: Sampling is considered approximate, even though, in the limit of an Daniel@0: infinite number of samples, it gives the exact answer. Daniel@0: Daniel@0:
Daniel@0: Daniel@0: Here is a summary of the properties Daniel@0: of all the engines in BNT which work on static networks. Daniel@0:
Daniel@0:
Name Daniel@0: | Exact? Daniel@0: | Node type? Daniel@0: | topology Daniel@0: |
belprop Daniel@0: | approx Daniel@0: | D Daniel@0: | DAG Daniel@0: |
belprop_fg Daniel@0: | approx Daniel@0: | D Daniel@0: | factor graph Daniel@0: |
cond_gauss Daniel@0: | exact Daniel@0: | CG Daniel@0: | DAG Daniel@0: |
enumerative Daniel@0: | exact Daniel@0: | D Daniel@0: | DAG Daniel@0: |
gaussian Daniel@0: | exact Daniel@0: | G Daniel@0: | DAG Daniel@0: |
gibbs Daniel@0: | approx Daniel@0: | D Daniel@0: | DAG Daniel@0: |
global_joint Daniel@0: | exact Daniel@0: | D,G,CG Daniel@0: | DAG Daniel@0: |
jtree Daniel@0: | exact Daniel@0: | D,G,CG Daniel@0: | DAG Daniel@0: b |
likelihood_weighting Daniel@0: | approx Daniel@0: | any Daniel@0: | DAG Daniel@0: |
pearl Daniel@0: | approx Daniel@0: | D,G Daniel@0: | DAG Daniel@0: |
pearl Daniel@0: | exact Daniel@0: | D,G Daniel@0: | polytree Daniel@0: |
quickscore Daniel@0: | exact Daniel@0: | noisy-or Daniel@0: | QMR Daniel@0: |
stab_cond_gauss Daniel@0: | exact Daniel@0: | CG Daniel@0: | DAG Daniel@0: |
var_elim Daniel@0: | exact Daniel@0: | D,G,CG Daniel@0: | DAG Daniel@0: |
Daniel@0: See the examples in BNT/examples/limids for details. Daniel@0: Daniel@0: Daniel@0: Daniel@0: Daniel@0: