wolffd@0:
wolffd@0:wolffd@0: wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0: BNT_HOME = 'C:\kpmurphy\matlab\FullBNT'; wolffd@0:wolffd@0: wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0: >> cd C:\kpmurphy\matlab\FullBNT\BNT wolffd@0:wolffd@0: wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0: mex is a script that lets you call C code from Matlab - it does not compile matlab to wolffd@0: C (see mcc below). wolffd@0: If your C/C++ compiler is set up correctly, mex should work out of wolffd@0: the box. wolffd@0: If not, you might need to type wolffd@0:
wolffd@0: mex -setup wolffd@0:
wolffd@0: before calling installC. wolffd@0:
wolffd@0: To make mex call gcc on Windows, wolffd@0: you must install gnumex. wolffd@0: You can use the minimalist GNU for wolffd@0: Windows version of gcc, or wolffd@0: the cygwin version. wolffd@0:
wolffd@0: In general, typing wolffd@0: 'mex foo.c' from inside Matlab creates a file called wolffd@0: 'foo.mexglx' or 'foo.dll' (the exact file wolffd@0: extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll'). wolffd@0: The resulting file will hide the original 'foo.m' (if it existed), i.e., wolffd@0: typing 'foo' at the prompt will call the compiled C version. wolffd@0: To reveal the original matlab version, just delete foo.mexglx (this is wolffd@0: what uninstallC does). wolffd@0:
wolffd@0: Sometimes it takes time for Matlab to realize that the file has wolffd@0: changed from matlab to C or vice versa; try typing 'clear all' or wolffd@0: restarting Matlab to refresh it. wolffd@0: To find out which version of a file you are running, type wolffd@0: 'which foo'. wolffd@0:
wolffd@0: mcc, the wolffd@0: Matlab to C compiler, is a separate product, wolffd@0: and is quite different from mex. It does not yet support wolffd@0: objects/classes, which is why we can't compile all of BNT to C automatically. wolffd@0: Also, hand-written C code is usually much wolffd@0: better than the C code generated by mcc. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: Acknowledgements: wolffd@0: Although I wrote some of the C code, most of wolffd@0: the C code (e.g., for jtree and dpot) was written by Wei Hu; wolffd@0: the triangulation C code was written by Ilya Shpitser. wolffd@0: wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: To specify this directed acyclic graph (dag), we create an adjacency matrix: wolffd@0:
wolffd@0: N = 4; wolffd@0: dag = zeros(N,N); wolffd@0: C = 1; S = 2; R = 3; W = 4; wolffd@0: dag(C,[R S]) = 1; wolffd@0: dag(R,W) = 1; wolffd@0: dag(S,W)=1; wolffd@0:wolffd@0:
wolffd@0: We have numbered the nodes as follows: wolffd@0: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. wolffd@0: The nodes must always be numbered in topological order, i.e., wolffd@0: ancestors before descendants. wolffd@0: For a more complicated graph, this is a little inconvenient: we will wolffd@0: see how to get around this below. wolffd@0:
wolffd@0: In Matlab 6, you can use logical arrays instead of double arrays, wolffd@0: which are 4 times smaller: wolffd@0:
wolffd@0: dag = false(N,N); wolffd@0: dag(C,[R S]) = true; wolffd@0: ... wolffd@0:wolffd@0:
wolffd@0: A preliminary attempt to make a GUI wolffd@0: has been writte by Philippe LeRay and can be downloaded wolffd@0: from here. wolffd@0:
wolffd@0: You can visualize the resulting graph structure using wolffd@0: the methods discussed below. wolffd@0: wolffd@0:
wolffd@0: discrete_nodes = 1:N; wolffd@0: node_sizes = 2*ones(1,N); wolffd@0:wolffd@0: If the nodes were not binary, you could type e.g., wolffd@0:
wolffd@0: node_sizes = [4 2 3 5]; wolffd@0:wolffd@0: meaning that Cloudy has 4 possible values, wolffd@0: Sprinkler has 2 possible values, etc. wolffd@0: Note that these are cardinal values, not ordinal, i.e., wolffd@0: they are not ordered in any way, like 'low', 'medium', 'high'. wolffd@0:
wolffd@0: We are now ready to make the Bayes net: wolffd@0:
wolffd@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes); wolffd@0:wolffd@0: By default, all nodes are assumed to be discrete, so we can also just wolffd@0: write wolffd@0:
wolffd@0: bnet = mk_bnet(dag, node_sizes); wolffd@0:wolffd@0: You may also specify which nodes will be observed. wolffd@0: If you don't know, or if this not fixed in advance, wolffd@0: just use the empty list (the default). wolffd@0:
wolffd@0: onodes = []; wolffd@0: bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes); wolffd@0:wolffd@0: Note that optional arguments are specified using a name/value syntax. wolffd@0: This is common for many BNT functions. wolffd@0: In general, to find out more about a function (e.g., which optional wolffd@0: arguments it takes), please see its wolffd@0: documentation string by typing wolffd@0:
wolffd@0: help mk_bnet wolffd@0:wolffd@0: See also other useful Matlab tips. wolffd@0:
wolffd@0: It is possible to associate names with nodes, as follows: wolffd@0:
wolffd@0: bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4); wolffd@0:wolffd@0: You can then refer to a node by its name: wolffd@0:
wolffd@0: C = bnet.names{'cloudy'}; % bnet.names is an associative array wolffd@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); wolffd@0:wolffd@0: wolffd@0: wolffd@0:
wolffd@0: Tabular CPDs, also called CPTs (conditional probability tables), wolffd@0: are stored as multidimensional arrays, where the dimensions wolffd@0: are arranged in the same order as the nodes, e.g., the CPT for node 4 wolffd@0: (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. wolffd@0: Hence the child is always the last dimension. wolffd@0: If a node has no parents, its CPT is a column vector representing its wolffd@0: prior. wolffd@0: Note that in Matlab (unlike C), arrays are indexed wolffd@0: from 1, and are layed out in memory such that the first index toggles wolffd@0: fastest, e.g., the CPT for node 4 (WetGrass) is as follows wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0: where we have used the convention that false==1, true==2. wolffd@0: We can create this CPT in Matlab as follows wolffd@0:
wolffd@0: CPT = zeros(2,2,2); wolffd@0: CPT(1,1,1) = 1.0; wolffd@0: CPT(2,1,1) = 0.1; wolffd@0: ... wolffd@0:wolffd@0: Here is an easier way: wolffd@0:
wolffd@0: CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]); wolffd@0:wolffd@0: In fact, we don't need to reshape the array, since the CPD constructor wolffd@0: will do that for us. So we can just write wolffd@0:
wolffd@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); wolffd@0:wolffd@0: The other nodes are created similarly (using the old syntax for wolffd@0: optional parameters) wolffd@0:
wolffd@0: bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); wolffd@0: bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]); wolffd@0: bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]); wolffd@0: bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]); wolffd@0:wolffd@0: wolffd@0: wolffd@0:
wolffd@0: rand('state', seed); wolffd@0: randn('state', seed); wolffd@0:wolffd@0: To control the degree of randomness (entropy), wolffd@0: you can sample each row of the CPT from a Dirichlet(p,p,...) distribution. wolffd@0: If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0). wolffd@0: If p = 1, each entry is drawn from U[0,1]. wolffd@0: If p >> 1, the entries will all be near 1/k, where k is the arity of wolffd@0: this node, i.e., each row will be nearly uniform. wolffd@0: You can do this as follows, assuming this node wolffd@0: is number i, and ns is the node_sizes. wolffd@0:
wolffd@0: k = ns(i); wolffd@0: ps = parents(dag, i); wolffd@0: psz = prod(ns(ps)); wolffd@0: CPT = sample_dirichlet(p*ones(1,k), psz); wolffd@0: bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT); wolffd@0:wolffd@0: wolffd@0: wolffd@0:
wolffd@0: engine = jtree_inf_engine(bnet); wolffd@0:wolffd@0: The other engines have similar constructors, but might take wolffd@0: additional, algorithm-specific parameters. wolffd@0: All engines are used in the same way, once they have been created. wolffd@0: We illustrate this in the following sections. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: evidence = cell(1,N); wolffd@0: evidence{W} = 2; wolffd@0:wolffd@0: We use a 1D cell array instead of a vector to wolffd@0: cope with the fact that nodes can be vectors of different lengths. wolffd@0: In addition, the value [] can be used wolffd@0: to denote 'no evidence', instead of having to specify the observation wolffd@0: pattern as a separate argument. wolffd@0: (Click here for a quick tutorial on cell wolffd@0: arrays in matlab.) wolffd@0:
wolffd@0: We are now ready to add the evidence to the engine. wolffd@0:
wolffd@0: [engine, loglik] = enter_evidence(engine, evidence); wolffd@0:wolffd@0: The behavior of this function is algorithm-specific, and is discussed wolffd@0: in more detail below. wolffd@0: In the case of the jtree engine, wolffd@0: enter_evidence implements a two-pass message-passing scheme. wolffd@0: The first return argument contains the modified engine, which wolffd@0: incorporates the evidence. The second return argument contains the wolffd@0: log-likelihood of the evidence. (Not all engines are capable of wolffd@0: computing the log-likelihood.) wolffd@0:
wolffd@0: Finally, we can compute p=P(S=2|W=2) as follows. wolffd@0:
wolffd@0: marg = marginal_nodes(engine, S); wolffd@0: marg.T wolffd@0: ans = wolffd@0: 0.57024 wolffd@0: 0.42976 wolffd@0: p = marg.T(2); wolffd@0:wolffd@0: We see that p = 0.4298. wolffd@0:
wolffd@0: Now let us add the evidence that it was raining, and see what wolffd@0: difference it makes. wolffd@0:
wolffd@0: evidence{R} = 2; wolffd@0: [engine, loglik] = enter_evidence(engine, evidence); wolffd@0: marg = marginal_nodes(engine, S); wolffd@0: p = marg.T(2); wolffd@0:wolffd@0: We find that p = P(S=2|W=2,R=2) = 0.1945, wolffd@0: which is lower than wolffd@0: before, because the rain can ``explain away'' the wolffd@0: fact that the grass is wet. wolffd@0:
wolffd@0: You can plot a marginal distribution over a discrete variable wolffd@0: as a barchart using the built 'bar' function: wolffd@0:
wolffd@0: bar(marg.T) wolffd@0:wolffd@0: This is what it looks like wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: evidence = cell(1,N); wolffd@0: evidence{W} = 2; wolffd@0: engine = enter_evidence(engine, evidence); wolffd@0: m = marginal_nodes(engine, W); wolffd@0: m.T wolffd@0: ans = wolffd@0: 1 wolffd@0:wolffd@0: This can get a little confusing, since we assigned W=2. wolffd@0: So we can ask BNT to add the evidence back in by passing in an optional argument: wolffd@0:
wolffd@0: m = marginal_nodes(engine, W, 1); wolffd@0: m.T wolffd@0: ans = wolffd@0: 0 wolffd@0: 1 wolffd@0:wolffd@0: This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1. wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: evidence = cell(1,N); wolffd@0: [engine, ll] = enter_evidence(engine, evidence); wolffd@0: m = marginal_nodes(engine, [S R W]); wolffd@0:wolffd@0: m is a structure. The 'T' field is a multi-dimensional array (in wolffd@0: this case, 3-dimensional) that contains the joint probability wolffd@0: distribution on the specified nodes. wolffd@0:
wolffd@0: >> m.T wolffd@0: ans(:,:,1) = wolffd@0: 0.2900 0.0410 wolffd@0: 0.0210 0.0009 wolffd@0: ans(:,:,2) = wolffd@0: 0 0.3690 wolffd@0: 0.1890 0.0891 wolffd@0:wolffd@0: We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass wolffd@0: to be wet if both the rain and sprinkler are off. wolffd@0:
wolffd@0: Let us now add some evidence to R. wolffd@0:
wolffd@0: evidence{R} = 2; wolffd@0: [engine, ll] = enter_evidence(engine, evidence); wolffd@0: m = marginal_nodes(engine, [S R W]) wolffd@0: m = wolffd@0: domain: [2 3 4] wolffd@0: T: [2x1x2 double] wolffd@0: >> m.T wolffd@0: m.T wolffd@0: ans(:,:,1) = wolffd@0: 0.0820 wolffd@0: 0.0018 wolffd@0: ans(:,:,2) = wolffd@0: 0.7380 wolffd@0: 0.1782 wolffd@0:wolffd@0: The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) wolffd@0: should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible wolffd@0: with the evidence that R=2. wolffd@0: Instead of creating large tables with many 0s, BNT sets the effective wolffd@0: size of observed (discrete) nodes to 1, as explained above. wolffd@0: This is why m.T has size 2x1x2. wolffd@0: To get a 2x2x2 table, type wolffd@0:
wolffd@0: m = marginal_nodes(engine, [S R W], 1) wolffd@0: m = wolffd@0: domain: [2 3 4] wolffd@0: T: [2x2x2 double] wolffd@0: >> m.T wolffd@0: m.T wolffd@0: ans(:,:,1) = wolffd@0: 0 0.082 wolffd@0: 0 0.0018 wolffd@0: ans(:,:,2) = wolffd@0: 0 0.738 wolffd@0: 0 0.1782 wolffd@0:wolffd@0: wolffd@0:
wolffd@0: Note: It is not always possible to compute the joint on arbitrary wolffd@0: sets of nodes: it depends on which inference engine you use, as discussed wolffd@0: in more detail below. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence); wolffd@0:wolffd@0: where soft_evidence{i} is either [] (if node i has no soft evidence) wolffd@0: or is a vector representing the probability distribution over i's wolffd@0: possible values. wolffd@0: For example, if we don't know i's exact value, but we know its wolffd@0: likelihood ratio is 60/40, we can write evidence{i} = [] and wolffd@0: soft_evidence{i} = [0.6 0.4]. wolffd@0:
wolffd@0: Currently only jtree_inf_engine supports this option. wolffd@0: It assumes that all hidden nodes, and all nodes for wolffd@0: which we have soft evidence, are discrete. wolffd@0: For a longer example, see BNT/examples/static/softev1.m. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: [mpe, ll] = calc_mpe(engine, evidence); wolffd@0:wolffd@0: mpe{i} is the most likely value of node i. wolffd@0: This calls enter_evidence with the 'maximize' flag set to 1, which wolffd@0: causes the engine to do max-product instead of sum-product. wolffd@0: The resulting max-marginals are then thresholded. wolffd@0: If there is more than one maximum probability assignment, we must take wolffd@0: care to break ties in a consistent manner (thresholding the wolffd@0: max-marginals may give the wrong result). To force this behavior, wolffd@0: type wolffd@0:
wolffd@0: [mpe, ll] = calc_mpe(engine, evidence, 1); wolffd@0:wolffd@0: Note that computing the MPE is someties called abductive reasoning. wolffd@0: wolffd@0:
wolffd@0: You can also use calc_mpe_bucket written by Ron Zohar, wolffd@0: that does a forwards max-product pass, and then a backwards traceback wolffd@0: pass, which is how Viterbi is traditionally implemented. wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: A B P(C=off) P(C=on) wolffd@0: --------------------------- wolffd@0: F F 1.0 0.0 wolffd@0: T F q(A) 1-q(A) wolffd@0: F T q(B) 1-q(B) wolffd@0: T T q(A)q(B) q-q(A)q(B) wolffd@0:wolffd@0: Thus we see that the causes get inhibited independently. wolffd@0: It is common to associate a "leak" node with a noisy-or CPD, which is wolffd@0: like a parent that is always on. This can account for all other unmodelled wolffd@0: causes which might turn the node on. wolffd@0:
wolffd@0: The noisy-or distribution is similar to the logistic distribution. wolffd@0: To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) wolffd@0: be the prob. that j inhibits i. Then wolffd@0:
wolffd@0: Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j) wolffd@0:wolffd@0: Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then wolffd@0:
wolffd@0: Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j)) wolffd@0:wolffd@0: For a sigmoid node, we have wolffd@0:
wolffd@0: Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j)) wolffd@0:wolffd@0: where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of wolffd@0: the activation function (although both are monotonically increasing). wolffd@0: In addition, in the case of a noisy-or, the weights are constrained to be wolffd@0: positive, since they derive from probabilities q(i,j). wolffd@0: In both cases, the number of parameters is linear in the wolffd@0: number of parents, unlike the case of a multinomial distribution, wolffd@0: where the number of parameters is exponential in the number of parents. wolffd@0: We will see an example of noisy-OR nodes below. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: Both of these classes are just "syntactic sugar" for the tabular_CPD wolffd@0: class. wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: exp(w(:,i)'*x + b(i)) wolffd@0: Pr(Q=i | X=x) = ----------------------------- wolffd@0: sum_j exp(w(:,j)'*x + b(j)) wolffd@0: wolffd@0:wolffd@0: The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the wolffd@0: following interpretation: w(:,i)-w(:,j) is the normal vector to the wolffd@0: decision boundary between classes i and j, wolffd@0: and b(i)-b(j) is its offset (bias). For example, suppose wolffd@0: X is a 2-vector, and Q is binary. Then wolffd@0:
wolffd@0: w = [1 -1; wolffd@0: 0 0]; wolffd@0: wolffd@0: b = [0 0]; wolffd@0:wolffd@0: means class 1 are points in the 2D plane with positive x coordinate, wolffd@0: and class 2 are points in the 2D plane with negative x coordinate. wolffd@0: If w has large magnitude, the decision boundary is sharp, otherwise it wolffd@0: is soft. wolffd@0: In the special case that Q is binary (0/1), the softmax function reduces to the logistic wolffd@0: (sigmoid) function. wolffd@0:
wolffd@0: Fitting a softmax function can be done using the iteratively reweighted wolffd@0: least squares (IRLS) algorithm. wolffd@0: We use the implementation from wolffd@0: Netlab. wolffd@0: Note that since wolffd@0: the softmax distribution is not in the exponential family, it does not wolffd@0: have finite sufficient statistics, and hence we must store all the wolffd@0: training data in uncompressed form. wolffd@0: If this takes too much space, one should use online (stochastic) gradient wolffd@0: descent (not implemented in BNT). wolffd@0:
wolffd@0: If a softmax node also has discrete parents, wolffd@0: we use a different set of w/b parameters for each combination of wolffd@0: parent values, as in the conditional linear wolffd@0: Gaussian CPD. wolffd@0: This feature was implemented by Pierpaolo Brutti. wolffd@0: He is currently extending it so that discrete parents can be treated wolffd@0: as if they were continuous, by adding indicator variables to the X wolffd@0: vector. wolffd@0:
wolffd@0: We will see an example of softmax nodes below. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: - no parents: Y ~ N(mu, Sigma) wolffd@0: - cts parents : Y|X=x ~ N(mu + W x, Sigma) wolffd@0: - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i)) wolffd@0: - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i)) wolffd@0:wolffd@0: where N(mu, Sigma) denotes a Normal distribution with mean mu and wolffd@0: covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q wolffd@0: respectively. wolffd@0: If there are no discrete parents, |Q|=1; if there is wolffd@0: more than one, then |Q| = a vector of the sizes of each discrete parent. wolffd@0: If there are no continuous parents, |X|=0; if there is more than one, wolffd@0: then |X| = the sum of their sizes. wolffd@0: Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive wolffd@0: semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) wolffd@0: matrix. wolffd@0:
wolffd@0: We can create a Gaussian node with random parameters as follows. wolffd@0:
wolffd@0: bnet.CPD{i} = gaussian_CPD(bnet, i); wolffd@0:wolffd@0: We can specify the value of one or more of the parameters as in the wolffd@0: following example, in which |Y|=2, and |Q|=1. wolffd@0:
wolffd@0: bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y)); wolffd@0:wolffd@0:
wolffd@0: We will see an example of conditional linear Gaussian nodes below. wolffd@0:
wolffd@0: When learning Gaussians from data, it is helpful to ensure the wolffd@0: data has a small magnitde wolffd@0: (see e.g., KPMstats/standardize) to prevent numerical problems. wolffd@0: Unless you have a lot of data, it is also a very good idea to use wolffd@0: diagonal instead of full covariance matrices. wolffd@0: (BNT does not currently support spherical covariances, although it wolffd@0: would be easy to add, since KPMstats/clg_Mstep supports this option; wolffd@0: you would just need to modify gaussian_CPD/update_ess to accumulate wolffd@0: weighted inner products.) wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: The CPD_to_CPT method converts a CPD to a table; this wolffd@0: requires that the child and all parents are discrete. wolffd@0: The CPT might be exponentially big... wolffd@0: convert_to_table evaluates a CPD with evidence, and wolffd@0: represents the the resulting potential as an array. wolffd@0: This requires that the child is discrete, and any continuous parents wolffd@0: are observed. wolffd@0: convert_to_pot evaluates a CPD with evidence, and wolffd@0: represents the resulting potential as a dpot, gpot, cgpot or upot, as wolffd@0: requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u = wolffd@0: utility). wolffd@0: wolffd@0:
wolffd@0: When we sample a node, all the parents are observed. wolffd@0: When we compute the (log) probability of a node, all the parents and wolffd@0: the child are observed. wolffd@0:
wolffd@0: We also specify if the parameters are learnable. wolffd@0: For learning with EM, we require wolffd@0: the methods reset_ess, update_ess and wolffd@0: maximize_params. wolffd@0: For learning from fully observed data, we require wolffd@0: the method learn_params. wolffd@0: By default, all classes inherit this from generic_CPD, which simply wolffd@0: calls update_ess N times, once for each data case, followed wolffd@0: by maximize_params, i.e., it is like EM, without the E step. wolffd@0: Some classes implement a batch formula, which is quicker. wolffd@0:
wolffd@0: Bayesian learning means computing a posterior over the parameters wolffd@0: given fully observed data. wolffd@0:
wolffd@0: Pearl means we implement the methods compute_pi and wolffd@0: compute_lambda_msg, used by wolffd@0: pearl_inf_engine, which runs on directed graphs. wolffd@0: belprop_inf_engine only needs convert_to_pot.H wolffd@0: The pearl methods can exploit special properties of the CPDs for wolffd@0: computing the messages efficiently, whereas belprop does not. wolffd@0:
wolffd@0: The only method implemented by generic_CPD is adjustable_CPD, wolffd@0: which is not shown, since it is not very interesting. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: wolffd@0: wolffd@0:
Name wolffd@0: | Child wolffd@0: | Parents wolffd@0: | Comments wolffd@0: | CPD_to_CPT wolffd@0: | conv_to_table wolffd@0: | conv_to_pot wolffd@0: | sample wolffd@0: | prob wolffd@0: | learn wolffd@0: | Bayes wolffd@0: | Pearl wolffd@0: wolffd@0: wolffd@0: |
wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: wolffd@0: |
boolean wolffd@0: | B wolffd@0: | B wolffd@0: | Syntactic sugar for tabular wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: wolffd@0: |
deterministic wolffd@0: | D wolffd@0: | D wolffd@0: | Syntactic sugar for tabular wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: | - wolffd@0: wolffd@0: |
Discrete wolffd@0: | D wolffd@0: | C/D wolffd@0: | Virtual class wolffd@0: | N wolffd@0: | Calls CPD_to_CPT wolffd@0: | Calls conv_to_table wolffd@0: | Calls conv_to_table wolffd@0: | Calls conv_to_table wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: wolffd@0: |
Gaussian wolffd@0: | C wolffd@0: | C/D wolffd@0: | - wolffd@0: | N wolffd@0: | N wolffd@0: | Y wolffd@0: | Y wolffd@0: | Y wolffd@0: | Y wolffd@0: | N wolffd@0: | N wolffd@0: wolffd@0: |
gmux wolffd@0: | C wolffd@0: | C/D wolffd@0: | multiplexer wolffd@0: | N wolffd@0: | N wolffd@0: | Y wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: | Y wolffd@0: wolffd@0: wolffd@0: |
MLP wolffd@0: | D wolffd@0: | C/D wolffd@0: | multi layer perceptron wolffd@0: | N wolffd@0: | Y wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Y wolffd@0: | N wolffd@0: | N wolffd@0: wolffd@0: wolffd@0: |
noisy-or wolffd@0: | B wolffd@0: | B wolffd@0: | - wolffd@0: | Y wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | N wolffd@0: | N wolffd@0: | Y wolffd@0: wolffd@0: wolffd@0: |
root wolffd@0: | C/D wolffd@0: | none wolffd@0: | no params wolffd@0: | N wolffd@0: | N wolffd@0: | Y wolffd@0: | Y wolffd@0: | Y wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: wolffd@0: wolffd@0: |
softmax wolffd@0: | D wolffd@0: | C/D wolffd@0: | - wolffd@0: | N wolffd@0: | Y wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Y wolffd@0: | N wolffd@0: | N wolffd@0: wolffd@0: wolffd@0: |
generic wolffd@0: | C/D wolffd@0: | C/D wolffd@0: | Virtual class wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: | N wolffd@0: wolffd@0: wolffd@0: |
Tabular wolffd@0: | D wolffd@0: | D wolffd@0: | - wolffd@0: | Y wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Inherits from discrete wolffd@0: | Y wolffd@0: | Y wolffd@0: | Y wolffd@0: wolffd@0: |
wolffd@0: | wolffd@0: | wolffd@0: | wolffd@0: |
(a) wolffd@0: | (b) wolffd@0: | (c) wolffd@0: | (d) wolffd@0: |
wolffd@0: We can create this model in BNT as follows. wolffd@0:
wolffd@0: ns = [k D]; wolffd@0: dag = zeros(2,2); wolffd@0: dag(1,2) = 1; wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', []); wolffd@0: bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ... wolffd@0: 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1); wolffd@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ... wolffd@0: 'cov_type', 'diag', 'clamp_mean', 1); wolffd@0:wolffd@0: wolffd@0: The root node is clamped to the N(0,I) distribution, so that we will wolffd@0: not update these parameters during learning. wolffd@0: The mean of the leaf node is clamped to 0, wolffd@0: since we assume the data has been centered (had its mean subtracted wolffd@0: off); this is just for simplicity. wolffd@0: Finally, the covariance of the leaf node is constrained to be wolffd@0: diagonal. W0 and Psi0 are the initial parameter guesses. wolffd@0: wolffd@0:
wolffd@0: We can fit this model (i.e., estimate its parameters in a maximum wolffd@0: likelihood (ML) sense) using EM, as we wolffd@0: explain below. wolffd@0: Not surprisingly, the ML estimates for mu and Psi turn out to be wolffd@0: identical to the wolffd@0: sample mean and variance, which can be computed directly as wolffd@0:
wolffd@0: mu_ML = mean(data); wolffd@0: Psi_ML = diag(cov(data)); wolffd@0:wolffd@0: Note that W can only be identified up to a rotation matrix, because of wolffd@0: the spherical symmetry of the source. wolffd@0: wolffd@0:
wolffd@0: If we restrict Psi to be spherical, i.e., Psi = sigma*I, wolffd@0: there is a closed-form solution for W as well, wolffd@0: i.e., we do not need to use EM. wolffd@0: In particular, W contains the first |X| eigenvectors of the sample covariance wolffd@0: matrix, with scalings determined by the eigenvalues and sigma. wolffd@0: Classical PCA can be obtained by taking the sigma->0 limit. wolffd@0: For details, see wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: By adding a hidden discrete variable, we can create mixtures of FA wolffd@0: models, as shown in (c). wolffd@0: Now we can explain the data using a set of subspaces. wolffd@0: We can create this model in BNT as follows. wolffd@0:
wolffd@0: ns = [M k D]; wolffd@0: dag = zeros(3); wolffd@0: dag(1,3) = 1; wolffd@0: dag(2,3) = 1; wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', 1); wolffd@0: bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0); wolffd@0: bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ... wolffd@0: 'clamp_mean', 1, 'clamp_cov', 1); wolffd@0: bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ... wolffd@0: 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1); wolffd@0:wolffd@0: Notice how the covariance matrix for Y is the same for all values of wolffd@0: Q; that is, the noise level in each sub-space is assumed the same. wolffd@0: However, we allow the offset, mu, to vary. wolffd@0: For details, see wolffd@0:
wolffd@0:
wolffd@0: I have included Zoubin's specialized MFA code (with his permission) wolffd@0: with the toolbox, so you can check that BNT gives the same results: wolffd@0: see 'BNT/examples/static/mfa1.m'. wolffd@0: wolffd@0:
wolffd@0: Independent Factor Analysis (IFA) generalizes FA by allowing a wolffd@0: non-Gaussian prior on each component of X. wolffd@0: (Note that we can approximate a non-Gaussian prior using a mixture of wolffd@0: Gaussians.) wolffd@0: This means that the likelihood function is no longer rotationally wolffd@0: invariant, so we can uniquely identify W and the hidden wolffd@0: sources X. wolffd@0: IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). wolffd@0: We recover classical Independent Components Analysis (ICA) wolffd@0: in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the wolffd@0: weight matrix W is square and invertible. wolffd@0: For details, see wolffd@0:
wolffd@0:
wolffd@0: wolffd@0: |
wolffd@0: X is the observed wolffd@0: input, Y is the output, and wolffd@0: the Q nodes are hidden "gating" nodes, which select the appropriate wolffd@0: set of parameters for Y. During training, Y is assumed observed, wolffd@0: but for testing, the goal is to predict Y given X. wolffd@0: Note that this is a conditional density model, so we don't wolffd@0: associate any parameters with X. wolffd@0: Hence X's CPD will be a root CPD, which is a way of modelling wolffd@0: exogenous nodes. wolffd@0: If the output is a continuous-valued quantity, wolffd@0: we assume the "experts" are linear-regression units, wolffd@0: and set Y's CPD to linear-Gaussian. wolffd@0: If the output is discrete, we set Y's CPD to a softmax function. wolffd@0: The Q CPDs will always be softmax functions. wolffd@0: wolffd@0:
wolffd@0: As a concrete example, consider the mixture of experts model where X and Y are wolffd@0: scalars, and Q is binary. wolffd@0: This is just piecewise linear regression, where wolffd@0: we have two line segments, i.e., wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: We can create this model with random parameters as follows. wolffd@0: (This code is bundled in BNT/examples/static/mixexp2.m.) wolffd@0:
wolffd@0: X = 1; wolffd@0: Q = 2; wolffd@0: Y = 3; wolffd@0: dag = zeros(3,3); wolffd@0: dag(X,[Q Y]) = 1 wolffd@0: dag(Q,Y) = 1; wolffd@0: ns = [1 2 1]; % make X and Y scalars, and have 2 experts wolffd@0: onodes = [1 3]; wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes); wolffd@0: wolffd@0: rand('state', 0); wolffd@0: randn('state', 0); wolffd@0: bnet.CPD{1} = root_CPD(bnet, 1); wolffd@0: bnet.CPD{2} = softmax_CPD(bnet, 2); wolffd@0: bnet.CPD{3} = gaussian_CPD(bnet, 3); wolffd@0:wolffd@0: Now let us fit this model using EM. wolffd@0: First we load the data (1000 training cases) and plot them. wolffd@0:
wolffd@0:
wolffd@0: data = load('/examples/static/Misc/mixexp_data.txt', '-ascii'); wolffd@0: plot(data(:,1), data(:,2), '.'); wolffd@0:wolffd@0:
wolffd@0:
wolffd@0: This is what the model looks like before training. wolffd@0: (Thanks to Thomas Hofman for writing this plotting routine.) wolffd@0:
wolffd@0:
wolffd@0: Now let's train the model, and plot the final performance. wolffd@0: (We will discuss how to train models in more detail below.) wolffd@0:
wolffd@0:
wolffd@0: ncases = size(data, 1); % each row of data is a training case wolffd@0: cases = cell(3, ncases); wolffd@0: cases([1 3], :) = num2cell(data'); % each column of cases is a training case wolffd@0: engine = jtree_inf_engine(bnet); wolffd@0: max_iter = 20; wolffd@0: [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter); wolffd@0:wolffd@0: (We specify which nodes will be observed when we create the engine. wolffd@0: Hence BNT knows that the hidden nodes are all discrete. wolffd@0: For complex models, this can lead to a significant speedup.) wolffd@0: Below we show what the model looks like after 16 iterations of EM wolffd@0: (with 100 IRLS iterations per M step), when it converged wolffd@0: using the default convergence tolerance (that the wolffd@0: fractional change in the log-likelihood be less than 1e-3). wolffd@0: Before learning, the log-likelihood was wolffd@0: -322.927442; afterwards, it was -13.728778. wolffd@0:
wolffd@0:
wolffd@0:
wolffd@0: Pierpaolo Brutti wolffd@0: has written an extensive set of routines for HMEs, wolffd@0: which are bundled with BNT: see the examples/static/HME directory. wolffd@0: These routines allow you to choose the number of hidden (gating) wolffd@0: layers, and the form of the experts (softmax or MLP). wolffd@0: See the file hmemenu, which provides a demo. wolffd@0: For example, the figure below shows the decision boundaries learned wolffd@0: for a ternary classification problem, using a 2 level HME with softmax wolffd@0: gates and softmax experts; the training set is on the left, the wolffd@0: testing set on the right. wolffd@0:
wolffd@0:
wolffd@0: wolffd@0: wolffd@0:
wolffd@0: For more details, see the following: wolffd@0:
wolffd@0:
wolffd@0: function bnet = mk_qmr_bnet(G, inhibit, leak, prior) wolffd@0: % MK_QMR_BNET Make a QMR model wolffd@0: % bnet = mk_qmr_bnet(G, inhibit, leak, prior) wolffd@0: % wolffd@0: % G(i,j) = 1 iff there is an arc from disease i to finding j wolffd@0: % inhibit(i,j) = inhibition probability on i->j arc wolffd@0: % leak(j) = inhibition prob. on leak->j arc wolffd@0: % prior(i) = prob. disease i is on wolffd@0: wolffd@0: [Ndiseases Nfindings] = size(inhibit); wolffd@0: N = Ndiseases + Nfindings; wolffd@0: finding_node = Ndiseases+1:N; wolffd@0: ns = 2*ones(1,N); wolffd@0: dag = zeros(N,N); wolffd@0: dag(1:Ndiseases, finding_node) = G; wolffd@0: bnet = mk_bnet(dag, ns, 'observed', finding_node); wolffd@0: wolffd@0: for d=1:Ndiseases wolffd@0: CPT = [1-prior(d) prior(d)]; wolffd@0: bnet.CPD{d} = tabular_CPD(bnet, d, CPT'); wolffd@0: end wolffd@0: wolffd@0: for i=1:Nfindings wolffd@0: fnode = finding_node(i); wolffd@0: ps = parents(G, i); wolffd@0: bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i)); wolffd@0: end wolffd@0:wolffd@0: In the file BNT/examples/static/qmr1, we create a random bipartite wolffd@0: graph G, with 5 diseases and 10 findings, and random parameters. wolffd@0: (In general, to create a random dag, use 'mk_random_dag'.) wolffd@0: We can visualize the resulting graph structure using wolffd@0: the methods discussed below, with the wolffd@0: following results: wolffd@0:
wolffd@0: wolffd@0: wolffd@0:
wolffd@0: Now let us put some random evidence on all the leaves except the very wolffd@0: first and very last, and compute the disease posteriors. wolffd@0:
wolffd@0: pos = 2:floor(Nfindings/2); wolffd@0: neg = (pos(end)+1):(Nfindings-1); wolffd@0: onodes = myunion(pos, neg); wolffd@0: evidence = cell(1, N); wolffd@0: evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos))); wolffd@0: evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg))); wolffd@0: wolffd@0: engine = jtree_inf_engine(bnet); wolffd@0: [engine, ll] = enter_evidence(engine, evidence); wolffd@0: post = zeros(1, Ndiseases); wolffd@0: for i=diseases(:)' wolffd@0: m = marginal_nodes(engine, i); wolffd@0: post(i) = m.T(2); wolffd@0: end wolffd@0:wolffd@0: Junction tree can be quite slow on large QMR models. wolffd@0: Fortunately, it is possible to exploit properties of the noisy-OR wolffd@0: function to speed up exact inference using an algorithm called wolffd@0: quickscore, discussed below. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: We now give an example of a CG model, from wolffd@0: the paper "Propagation of Probabilities, Means amd wolffd@0: Variances in Mixed Graphical Association Models", Steffen Lauritzen, wolffd@0: JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert wolffd@0: Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and wolffd@0: D. J. Spiegelhalter, Springer, 1999.) wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: wolffd@0: We can create this model as follows. wolffd@0:
wolffd@0: F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9; wolffd@0: n = 9; wolffd@0: wolffd@0: dag = zeros(n); wolffd@0: dag(F,E)=1; wolffd@0: dag(W,[E Min D]) = 1; wolffd@0: dag(E,D)=1; wolffd@0: dag(B,[C D])=1; wolffd@0: dag(D,[L Mout])=1; wolffd@0: dag(Min,Mout)=1; wolffd@0: wolffd@0: % node sizes - all cts nodes are scalar, all discrete nodes are binary wolffd@0: ns = ones(1, n); wolffd@0: dnodes = [F W B]; wolffd@0: cnodes = mysetdiff(1:n, dnodes); wolffd@0: ns(dnodes) = 2; wolffd@0: wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes); wolffd@0:wolffd@0: 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous wolffd@0: nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'. wolffd@0:
wolffd@0: wolffd@0: wolffd@0:
wolffd@0: bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable wolffd@0: bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect wolffd@0: bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household wolffd@0:wolffd@0: wolffd@0:
wolffd@0: The parameters of the continuous nodes can be specified as follows. wolffd@0:
wolffd@0: bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ... wolffd@0: 'cov', [0.00002 0.0001 0.00002 0.0001]); wolffd@0: bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ... wolffd@0: 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]); wolffd@0: bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]); wolffd@0: bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5); wolffd@0: bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]); wolffd@0: bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]); wolffd@0:wolffd@0: wolffd@0: wolffd@0:
wolffd@0: engine = jtree_inf_engine(bnet); wolffd@0: evidence = cell(1,n); wolffd@0: [engine, ll] = enter_evidence(engine, evidence); wolffd@0: marg = marginal_nodes(engine, E); wolffd@0:wolffd@0: wolffd@0: 'marg' is a structure that contains the fields 'mu' and 'Sigma', which wolffd@0: contain the mean and (co)variance of the marginal on E. wolffd@0: In this case, they are both scalars. wolffd@0: Let us check they match the published figures (to 2 decimal places). wolffd@0: wolffd@0:
wolffd@0: tol = 1e-2; wolffd@0: assert(approxeq(marg.mu, -3.25, tol)); wolffd@0: assert(approxeq(sqrt(marg.Sigma), 0.709, tol)); wolffd@0:wolffd@0: We can compute the other posteriors similarly. wolffd@0: Now let us add some evidence. wolffd@0:
wolffd@0: evidence = cell(1,n); wolffd@0: evidence{W} = 1; % industrial wolffd@0: evidence{L} = 1.1; wolffd@0: evidence{C} = -0.9; wolffd@0: [engine, ll] = enter_evidence(engine, evidence); wolffd@0:wolffd@0: Now we find wolffd@0:
wolffd@0: marg = marginal_nodes(engine, E); wolffd@0: assert(approxeq(marg.mu, -3.8983, tol)); wolffd@0: assert(approxeq(sqrt(marg.Sigma), 0.0763, tol)); wolffd@0:wolffd@0: wolffd@0: wolffd@0: We can also compute the joint probability on a set of nodes. wolffd@0: For example, P(D, Mout | evidence) is a 2D Gaussian: wolffd@0:
wolffd@0: marg = marginal_nodes(engine, [D Mout]) wolffd@0: marg = wolffd@0: domain: [6 8] wolffd@0: mu: [2x1 double] wolffd@0: Sigma: [2x2 double] wolffd@0: T: 1.0000 wolffd@0:wolffd@0: The mean is wolffd@0:
wolffd@0: marg.mu wolffd@0: ans = wolffd@0: 3.6077 wolffd@0: 4.1077 wolffd@0:wolffd@0: and the covariance matrix is wolffd@0:
wolffd@0: marg.Sigma wolffd@0: ans = wolffd@0: 0.1062 0.1062 wolffd@0: 0.1062 0.1182 wolffd@0:wolffd@0: It is easy to visualize this posterior using standard Matlab plotting wolffd@0: functions, e.g., wolffd@0:
wolffd@0: gaussplot2d(marg.mu, marg.Sigma); wolffd@0:wolffd@0: produces the following picture. wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: wolffd@0: wolffd@0: The T field indicates that the mixing weight of this Gaussian wolffd@0: component is 1.0. wolffd@0: If the joint contains discrete and continuous variables, the result wolffd@0: will be a mixture of Gaussians, e.g., wolffd@0:
wolffd@0: marg = marginal_nodes(engine, [F E]) wolffd@0: domain: [1 3] wolffd@0: mu: [-3.9000 -0.4003] wolffd@0: Sigma: [1x1x2 double] wolffd@0: T: [0.9995 4.7373e-04] wolffd@0:wolffd@0: The interpretation is wolffd@0: Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. wolffd@0: In this case, E is a scalar, so i=j=1; k specifies the mixture component. wolffd@0:
wolffd@0: We saw in the sprinkler network that BNT sets the effective size of wolffd@0: observed discrete nodes to 1, since they only have one legal value. wolffd@0: For continuous nodes, BNT sets their length to 0, wolffd@0: since they have been reduced to a point. wolffd@0: For example, wolffd@0:
wolffd@0: marg = marginal_nodes(engine, [B C]) wolffd@0: domain: [4 5] wolffd@0: mu: [] wolffd@0: Sigma: [] wolffd@0: T: [0.0123 0.9877] wolffd@0:wolffd@0: It is simple to post-process the output of marginal_nodes. wolffd@0: For example, the file BNT/examples/static/cg1 sets the mu term of wolffd@0: observed nodes to their observed value, and the Sigma term to 0 (since wolffd@0: observed nodes have no variance). wolffd@0: wolffd@0:
wolffd@0: Note that the implemented version of the junction tree is numerically wolffd@0: unstable when using CG potentials wolffd@0: (which is why, in the example above, we only required our answers to agree with wolffd@0: the published ones to 2dp.) wolffd@0: This is why you might want to use stab_cond_gauss_inf_engine, wolffd@0: implemented by Shan Huang. This is described in wolffd@0: wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: | Full obs | wolffd@0:Partial obs | wolffd@0:
---|---|---|
Point | wolffd@0:learn_params | wolffd@0:learn_params_em | wolffd@0:
Bayes | wolffd@0:bayes_update_params | wolffd@0:not yet supported | wolffd@0:
wolffd@0: 011979 1626.5 0.0 wolffd@0: 021979 1367.0 0.0 wolffd@0: ... wolffd@0:wolffd@0: you can use wolffd@0:
wolffd@0: data = load('dat.txt'); wolffd@0:wolffd@0: or wolffd@0:
wolffd@0: load dat.txt -ascii wolffd@0:wolffd@0: In the latter case, the data is stored in a variable called 'dat' (the wolffd@0: filename minus the extension). wolffd@0: Alternatively, suppose the data is stored in a .csv file (has commas wolffd@0: separating the columns, and contains a header line), such as wolffd@0:
wolffd@0: header info goes here wolffd@0: ORD,011979,1626.5,0.0 wolffd@0: DSM,021979,1367.0,0.0 wolffd@0: ... wolffd@0:wolffd@0: You can load this using wolffd@0:
wolffd@0: [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1); wolffd@0:wolffd@0: If your file is not in either of these formats, you can either use Perl to convert wolffd@0: it to this format, or use the Matlab scanf command. wolffd@0: Type wolffd@0: wolffd@0: help iofun wolffd@0: wolffd@0: for more information on Matlab's file functions. wolffd@0: wolffd@0:
wolffd@0: BNT learning routines require data to be stored in a cell array. wolffd@0: data{i,m} is the value of node i in case (example) m, i.e., each wolffd@0: column is a case. wolffd@0: If node i is not observed in case m (missing value), set wolffd@0: data{i,m} = []. wolffd@0: (Not all the learning routines can cope with such missing values, however.) wolffd@0: In the special case that all the nodes are observed and are wolffd@0: scalar-valued (as opposed to vector-valued), the data can be wolffd@0: stored in a matrix (as opposed to a cell-array). wolffd@0:
wolffd@0: Suppose, as in the mixture of experts example, wolffd@0: that we have 3 nodes in the graph: X(1) is the observed input, X(3) is wolffd@0: the observed output, and X(2) is a hidden (gating) node. We can wolffd@0: create the dataset as follows. wolffd@0:
wolffd@0: data = load('dat.txt'); wolffd@0: ncases = size(data, 1); wolffd@0: cases = cell(3, ncases); wolffd@0: cases([1 3], :) = num2cell(data'); wolffd@0:wolffd@0: Notice how we transposed the data, to convert rows into columns. wolffd@0: Also, cases{2,m} = [] for all m, since X(2) is always hidden. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: samples = cell(N, nsamples); wolffd@0: for i=1:nsamples wolffd@0: samples(:,i) = sample_bnet(bnet); wolffd@0: end wolffd@0:wolffd@0: samples{j,i} contains the value of the j'th node in case i. wolffd@0: sample_bnet returns a cell array because, in general, each node might wolffd@0: be a vector of different length. wolffd@0: In this case, all nodes are discrete (and hence scalars), so we wolffd@0: could have used a regular array instead (which can be quicker): wolffd@0:
wolffd@0: data = cell2num(samples); wolffd@0:wolffd@0: Now we create a network with random parameters. wolffd@0: (The initial values of bnet2 don't matter in this case, since we can find the wolffd@0: globally optimal MLE independent of where we start.) wolffd@0:
wolffd@0: % Make a tabula rasa wolffd@0: bnet2 = mk_bnet(dag, node_sizes); wolffd@0: seed = 0; wolffd@0: rand('state', seed); wolffd@0: bnet2.CPD{C} = tabular_CPD(bnet2, C); wolffd@0: bnet2.CPD{R} = tabular_CPD(bnet2, R); wolffd@0: bnet2.CPD{S} = tabular_CPD(bnet2, S); wolffd@0: bnet2.CPD{W} = tabular_CPD(bnet2, W); wolffd@0:wolffd@0: Finally, we find the maximum likelihood estimates of the parameters. wolffd@0:
wolffd@0: bnet3 = learn_params(bnet2, samples); wolffd@0:wolffd@0: To view the learned parameters, we use a little Matlab hackery. wolffd@0:
wolffd@0: CPT3 = cell(1,N); wolffd@0: for i=1:N wolffd@0: s=struct(bnet3.CPD{i}); % violate object privacy wolffd@0: CPT3{i}=s.CPT; wolffd@0: end wolffd@0:wolffd@0: Here are the parameters learned for node 4. wolffd@0:
wolffd@0: dispcpt(CPT3{4}) wolffd@0: 1 1 : 1.0000 0.0000 wolffd@0: 2 1 : 0.2000 0.8000 wolffd@0: 1 2 : 0.2273 0.7727 wolffd@0: 2 2 : 0.0000 1.0000 wolffd@0:wolffd@0: So we see that the learned parameters are fairly close to the "true" wolffd@0: ones, which we display below. wolffd@0:
wolffd@0: dispcpt(CPT{4}) wolffd@0: 1 1 : 1.0000 0.0000 wolffd@0: 2 1 : 0.1000 0.9000 wolffd@0: 1 2 : 0.1000 0.9000 wolffd@0: 2 2 : 0.0100 0.9900 wolffd@0:wolffd@0: We can get better results by using a larger training set, or using wolffd@0: informative priors (see below). wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: The Dirichlet has a simple interpretation in terms of pseudo counts. wolffd@0: If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the wolffd@0: training set, where Pa_i are the parents of X_i, wolffd@0: then the maximum likelihood (ML) estimate is wolffd@0: T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. wolffd@0: To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this wolffd@0: event was not seen in the training set, wolffd@0: we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk) wolffd@0: of times in the past. wolffd@0: The MAP (maximum a posterior) estimate is then wolffd@0:
wolffd@0: T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij) wolffd@0:wolffd@0: and is never 0 if all alpha_ijk > 0. wolffd@0: For example, consider the network A->B, where A is binary and B has 3 wolffd@0: values. wolffd@0: A uniform prior for B has the form wolffd@0:
wolffd@0: B=1 B=2 B=3 wolffd@0: A=1 1 1 1 wolffd@0: A=2 1 1 1 wolffd@0:wolffd@0: which can be created using wolffd@0:
wolffd@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif'); wolffd@0:wolffd@0: This prior does not satisfy the likelihood equivalence principle, wolffd@0: which says that Markov equivalent models wolffd@0: should have the same marginal likelihood. wolffd@0: A prior that does satisfy this principle is shown below. wolffd@0: Heckerman (1995) calls this the wolffd@0: BDeu prior (likelihood equivalent uniform Bayesian Dirichlet). wolffd@0:
wolffd@0: B=1 B=2 B=3 wolffd@0: A=1 1/6 1/6 1/6 wolffd@0: A=2 1/6 1/6 1/6 wolffd@0:wolffd@0: where we put N/(q*r) in each bin; N is the equivalent sample size, wolffd@0: r=|A|, q = |B|. wolffd@0: This can be created as follows wolffd@0:
wolffd@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu'); wolffd@0:wolffd@0: Here, 1 is the equivalent sample size, and is the strength of the wolffd@0: prior. wolffd@0: You can change this using wolffd@0:
wolffd@0: tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ... wolffd@0: 'BDeu', 'dirichlet_weight', 10); wolffd@0:wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: cases = sample_bnet(bnet, nsamples); wolffd@0: bnet = bayes_update_params(bnet, cases); wolffd@0: LL = log_marg_lik_complete(bnet, cases); wolffd@0:wolffd@0: bnet.CPD{i}.prior contains the new Dirichlet pseudocounts, wolffd@0: and bnet.CPD{i}.CPT is set to the mean of the posterior (the wolffd@0: normalized counts). wolffd@0: (Hence if the initial pseudo counts are 0, wolffd@0: bayes_update_params and learn_params will give the wolffd@0: same result.) wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: We can compute the same result sequentially (on-line) as follows. wolffd@0:
wolffd@0: LL = 0; wolffd@0: for m=1:nsamples wolffd@0: LL = LL + log_marg_lik_complete(bnet, cases(:,m)); wolffd@0: bnet = bayes_update_params(bnet, cases(:,m)); wolffd@0: end wolffd@0:wolffd@0: wolffd@0: The file BNT/examples/static/StructLearn/model_select1 has an example of wolffd@0: sequential model selection which uses the same idea. wolffd@0: We generate data from the model A->B wolffd@0: and compute the posterior prob of all 3 dags on 2 nodes: wolffd@0: (1) A B, (2) A <- B , (3) A -> B wolffd@0: Models 2 and 3 are Markov equivalent, and therefore indistinguishable from wolffd@0: observational data alone, so we expect their posteriors to be the same wolffd@0: (assuming a prior which satisfies likelihood equivalence). wolffd@0: If we use random parameters, the "true" model only gets a higher posterior after 2000 trials! wolffd@0: However, if we make B a noisy NOT gate, the true model "wins" after 12 wolffd@0: trials, as shown below (red = model 1, blue/green (superimposed) wolffd@0: represents models 2/3). wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: The use of marginal likelihood for model selection is discussed in wolffd@0: greater detail in the wolffd@0: section on structure learning. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: samples2 = samples; wolffd@0: hide = rand(N, nsamples) > 0.5; wolffd@0: [I,J]=find(hide); wolffd@0: for k=1:length(I) wolffd@0: samples2{I(k), J(k)} = []; wolffd@0: end wolffd@0:wolffd@0: samples2{i,l} is the value of node i in training case l, or [] if unobserved. wolffd@0:
wolffd@0: Now we will compute the MLEs using the EM algorithm. wolffd@0: We need to use an inference algorithm to compute the expected wolffd@0: sufficient statistics in the E step; the M (maximization) step is as wolffd@0: above. wolffd@0:
wolffd@0: engine2 = jtree_inf_engine(bnet2); wolffd@0: max_iter = 10; wolffd@0: [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter); wolffd@0:wolffd@0: LLtrace(i) is the log-likelihood at iteration i. We can plot this as wolffd@0: follows: wolffd@0:
wolffd@0: plot(LLtrace, 'x-') wolffd@0:wolffd@0: Let's display the results after 10 iterations of EM. wolffd@0:
wolffd@0: celldisp(CPT4) wolffd@0: CPT4{1} = wolffd@0: 0.6616 wolffd@0: 0.3384 wolffd@0: CPT4{2} = wolffd@0: 0.6510 0.3490 wolffd@0: 0.8751 0.1249 wolffd@0: CPT4{3} = wolffd@0: 0.8366 0.1634 wolffd@0: 0.0197 0.9803 wolffd@0: CPT4{4} = wolffd@0: (:,:,1) = wolffd@0: 0.8276 0.0546 wolffd@0: 0.5452 0.1658 wolffd@0: (:,:,2) = wolffd@0: 0.1724 0.9454 wolffd@0: 0.4548 0.8342 wolffd@0:wolffd@0: We can get improved performance by using one or more of the following wolffd@0: methods: wolffd@0:
wolffd@0: When we have tied parameters, there is no longer a one-to-one wolffd@0: correspondence between nodes and CPDs. wolffd@0: Rather, each CPD species the parameters for a whole equivalence class wolffd@0: of nodes. wolffd@0: It is easiest to see this by example. wolffd@0: Consider the following hidden Markov wolffd@0: model (HMM) wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: wolffd@0: When HMMs are used for semi-infinite processes like speech recognition, wolffd@0: we assume the transition matrix wolffd@0: P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant wolffd@0: or homogenous Markov chain. wolffd@0: Hence hidden nodes 2, 3, ..., T wolffd@0: are all in the same equivalence class, say class Hclass. wolffd@0: Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the wolffd@0: same for all t, so the observed nodes are all in the same equivalence wolffd@0: class, say class Oclass. wolffd@0: Finally, the prior term P(H(1)) is in a class all by itself, say class wolffd@0: H1class. wolffd@0: This is illustrated below, where we explicitly represent the wolffd@0: parameters as random variables (dotted nodes). wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: In BNT, we cannot represent parameters as random variables (nodes). wolffd@0: Instead, we "hide" the wolffd@0: parameters inside one CPD for each equivalence class, wolffd@0: and then specify that the other CPDs should share these parameters, as wolffd@0: follows. wolffd@0:
wolffd@0: hnodes = 1:2:2*T; wolffd@0: onodes = 2:2:2*T; wolffd@0: H1class = 1; Hclass = 2; Oclass = 3; wolffd@0: eclass = ones(1,N); wolffd@0: eclass(hnodes(2:end)) = Hclass; wolffd@0: eclass(hnodes(1)) = H1class; wolffd@0: eclass(onodes) = Oclass; wolffd@0: % create dag and ns in the usual way wolffd@0: bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass); wolffd@0:wolffd@0: Finally, we define the parameters for each equivalence class: wolffd@0:
wolffd@0: bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior wolffd@0: bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix wolffd@0: if cts_obs wolffd@0: bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1)); wolffd@0: else wolffd@0: bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1)); wolffd@0: end wolffd@0:wolffd@0: In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a wolffd@0: member of e's equivalence class; that is, it is not always the case wolffd@0: that e == j. You can use bnet.rep_of_eclass(e) to return the wolffd@0: representative of equivalence class e. wolffd@0: BNT will look up the parents of j to determine the size wolffd@0: of the CPT to use. It assumes that this is the same for all members of wolffd@0: the equivalence class. wolffd@0: Click here for wolffd@0: a more complex example of parameter tying. wolffd@0:
wolffd@0: Note: wolffd@0: Normally one would define an HMM as a wolffd@0: Dynamic Bayes Net wolffd@0: (see the function BNT/examples/dynamic/mk_chmm.m). wolffd@0: However, one can define an HMM as a static BN using the function wolffd@0: BNT/examples/static/Models/mk_hmm_bnet.m. wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: wolffd@0: There are two very different approaches to structure learning: wolffd@0: constraint-based and search-and-score. wolffd@0: In the constraint-based approach, wolffd@0: we start with a fully connected graph, and remove edges if certain wolffd@0: conditional independencies are measured in the data. wolffd@0: This has the disadvantage that repeated independence tests lose wolffd@0: statistical power. wolffd@0:
wolffd@0: In the more popular search-and-score approach, wolffd@0: we perform a search through the space of possible DAGs, and either wolffd@0: return the best one found (a point estimate), or return a sample of the wolffd@0: models found (an approximation to the Bayesian posterior). wolffd@0:
wolffd@0: Unfortunately, the number of DAGs as a function of the number of wolffd@0: nodes, G(n), is super-exponential in n. wolffd@0: A closed form formula for G(n) is not known, but the first few values wolffd@0: are shown below (from Cooper, 1999). wolffd@0: wolffd@0:
n | G(n) |
---|---|
1 | 1 |
2 | 3 |
3 | 25 |
4 | 543 |
5 | 29,281 |
6 | 3,781,503 |
7 | 1.1 x 10^9 |
8 | 7.8 x 10^11 |
9 | 1.2 x 10^15 |
10 | 4.2 x 10^18 |
wolffd@0: If we know a total ordering on the nodes, wolffd@0: finding the best structure amounts to picking the best set of parents wolffd@0: for each node independently. wolffd@0: This is what the K2 algorithm does. wolffd@0: If the ordering is unknown, we can search over orderings, wolffd@0: which is more efficient than searching over DAGs (Koller and Friedman, 2000). wolffd@0:
wolffd@0: In addition to the search procedure, we must specify the scoring wolffd@0: function. There are two popular choices. The Bayesian score integrates wolffd@0: out the parameters, i.e., it is the marginal likelihood of the model. wolffd@0: The BIC (Bayesian Information Criterion) is defined as wolffd@0: log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is wolffd@0: the ML estimate of the parameters, d is the number of parameters, and wolffd@0: N is the number of data cases. wolffd@0: The BIC method has the advantage of not requiring a prior. wolffd@0:
wolffd@0: BIC can be derived as a large sample wolffd@0: approximation to the marginal likelihood. wolffd@0: (It is also equal to the Minimum Description Length of a model.) wolffd@0: However, in practice, the sample size does not need to be very large wolffd@0: for the approximation to be good. wolffd@0: For example, in the figure below, we plot the ratio between the log marginal likelihood wolffd@0: and the BIC score against data-set size; we see that the ratio rapidly wolffd@0: approaches 1, especially for non-informative priors. wolffd@0: (This plot was generated by the file BNT/examples/static/bic1.m. It wolffd@0: uses the water sprinkler BN with BDeu Dirichlet priors with different wolffd@0: equivalent sample sizes.) wolffd@0: wolffd@0:
wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: As with parameter learning, handling missing data/ hidden variables is wolffd@0: much harder than the fully observed case. wolffd@0: The structure learning routines in BNT can therefore be classified into 4 wolffd@0: types, analogously to the parameter learning case. wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: | Full obs | wolffd@0:Partial obs | wolffd@0:
---|---|---|
Point | wolffd@0:learn_struct_K2 wolffd@0: wolffd@0: | not yet supported | wolffd@0:
Bayes | wolffd@0:learn_struct_mcmc | wolffd@0:not yet supported | wolffd@0:
wolffd@0: When learning graph structure from observational data, wolffd@0: the best one can hope to do is to identify the model up to Markov wolffd@0: equivalence. To distinguish amongst graphs within the same equivalence wolffd@0: class, one needs interventional data: see the discussion on active learning below. wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: dags = mk_all_dags(N); wolffd@0: score = score_dags(data, ns, dags); wolffd@0:wolffd@0: where data(i,m) is the value of node i in case m, wolffd@0: and ns(i) is the size of node i. wolffd@0: If the DAGs have a lot of families in common, we can cache the sufficient statistics, wolffd@0: making this potentially more efficient than scoring the DAGs one at a time. wolffd@0: (Caching is not currently implemented, however.) wolffd@0:
wolffd@0: By default, we use the Bayesian scoring metric, and assume CPDs are wolffd@0: represented by tables with BDeu(1) priors. wolffd@0: We can override these defaults as follows. wolffd@0: If we want to use uniform priors, we can say wolffd@0:
wolffd@0: params = cell(1,N); wolffd@0: for i=1:N wolffd@0: params{i} = {'prior', 'unif'}; wolffd@0: end wolffd@0: score = score_dags(data, ns, dags, 'params', params); wolffd@0:wolffd@0: params{i} is a cell-array, containing optional arguments that are wolffd@0: passed to the constructor for CPD i. wolffd@0:
wolffd@0: Now suppose we want to use different node types, e.g., wolffd@0: Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both wolffd@0: these CPDs can support discrete and continuous parents, which is wolffd@0: necessary since all other nodes will be considered as parents). wolffd@0: The Bayesian scoring metric currently only works for tabular CPDs, so wolffd@0: we will use BIC: wolffd@0:
wolffd@0: score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [], wolffd@0: 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic') wolffd@0:wolffd@0: In practice, one can't enumerate all possible DAGs for N > 5, wolffd@0: but one can evaluate any reasonably-sized set of hypotheses in this wolffd@0: way (e.g., nearest neighbors of your current best guess). wolffd@0: Think of this as "computer assisted model refinement" as opposed to de wolffd@0: novo learning. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: The original paper used the Bayesian scoring wolffd@0: metric with tabular CPDs and Dirichlet priors. wolffd@0: BNT generalizes this to allow any kind of CPD, and either the Bayesian wolffd@0: scoring metric or BIC, as in the example above. wolffd@0: In addition, you can specify wolffd@0: an optional upper bound on the number of parents for each node. wolffd@0: The file BNT/examples/static/k2demo1.m gives an example of how to use K2. wolffd@0: We use the water sprinkler network and sample 100 cases from it as before. wolffd@0: Then we see how much data it takes to recover the generating structure: wolffd@0:
wolffd@0: order = [C S R W]; wolffd@0: max_fan_in = 2; wolffd@0: sz = 5:5:100; wolffd@0: for i=1:length(sz) wolffd@0: dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in); wolffd@0: correct(i) = isequal(dag, dag2); wolffd@0: end wolffd@0:wolffd@0: Here are the results. wolffd@0:
wolffd@0: correct = wolffd@0: Columns 1 through 12 wolffd@0: 0 0 0 0 0 0 0 1 0 1 1 1 wolffd@0: Columns 13 through 20 wolffd@0: 1 1 1 1 1 1 1 1 wolffd@0:wolffd@0: So we see it takes about sz(10)=50 cases. (BIC behaves similarly, wolffd@0: showing that the prior doesn't matter too much.) wolffd@0: In general, we cannot hope to recover the "true" generating structure, wolffd@0: only one that is in its Markov equivalence wolffd@0: class. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: A common definition of "neighbor" is all graphs that can be wolffd@0: generated from the current graph by adding, deleting or reversing a wolffd@0: single arc, subject to the acyclicity constraint. wolffd@0: Other neighborhoods are possible: see wolffd@0: wolffd@0: Optimal Structure Identification with Greedy Search, Max wolffd@0: Chickering, JMLR 2002. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: The function can be called wolffd@0: as in the following example. wolffd@0:
wolffd@0: [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10); wolffd@0:wolffd@0: We can convert our set of sampled graphs to a histogram wolffd@0: (empirical posterior over all the DAGs) thus wolffd@0:
wolffd@0: all_dags = mk_all_dags(N); wolffd@0: mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags); wolffd@0:wolffd@0: To see how well this performs, let us compute the exact posterior exhaustively. wolffd@0:
wolffd@0:
wolffd@0: score = score_dags(data, ns, all_dags); wolffd@0: post = normalise(exp(score)); % assuming uniform structural prior wolffd@0:wolffd@0: We plot the results below. wolffd@0: (The data set was 100 samples drawn from a random 4 node bnet; see the wolffd@0: file BNT/examples/static/mcmc1.) wolffd@0:
wolffd@0: subplot(2,1,1) wolffd@0: bar(post) wolffd@0: subplot(2,1,2) wolffd@0: bar(mcmc_post) wolffd@0:wolffd@0: wolffd@0:
wolffd@0: We can also plot the acceptance ratio versus number of MCMC steps, wolffd@0: as a crude convergence diagnostic. wolffd@0:
wolffd@0: clf wolffd@0: plot(accept_ratio) wolffd@0:wolffd@0: wolffd@0:
wolffd@0: Even though the number of samples needed by MCMC is theoretically wolffd@0: polynomial (not exponential) in the dimensionality of the search space, in practice it has been wolffd@0: found that MCMC does not converge in reasonable time for graphs with wolffd@0: more than about 10 nodes. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: Most of the scoring functions accept an optional argument wolffd@0: that specifies whether a node was observed to have a certain value, or wolffd@0: was forced to have that value: we set clamped(i,m)=1 if node i was wolffd@0: forced in training case m. e.g., see the file wolffd@0: BNT/examples/static/cooper_yoo. wolffd@0:
wolffd@0: An interesting question is to decide which interventions to perform wolffd@0: (c.f., design of experiments). For details, see the following tech wolffd@0: report wolffd@0:
wolffd@0: Wei Hu has implemented SEM for discrete nodes. wolffd@0: You can download his package from wolffd@0: here. wolffd@0: Please address all questions about this code to wolffd@0: wei.hu@intel.com. wolffd@0: See also Phl's implementation of SEM. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: draw_graph(bnet.dag); wolffd@0:wolffd@0: For example, this is the output produced on a wolffd@0: random QMR-like model: wolffd@0:
wolffd@0: wolffd@0:
wolffd@0: If you install the excellent graphhviz, an wolffd@0: open-source graph visualization package from AT&T, wolffd@0: you can create a much better visualization as follows wolffd@0:
wolffd@0: graph_to_dot(bnet.dag) wolffd@0:wolffd@0: This works by converting the adjacency matrix to a file suitable wolffd@0: for input to graphviz (using the dot format), wolffd@0: then converting the output of graphviz to postscript, and displaying the results using wolffd@0: ghostview. wolffd@0: You can do each of these steps separately for more control, as shown wolffd@0: below. wolffd@0:
wolffd@0: graph_to_dot(bnet.dag, 'filename', 'foo.dot'); wolffd@0: dot -Tps foo.dot -o foo.ps wolffd@0: ghostview foo.ps & wolffd@0:wolffd@0: wolffd@0:
wolffd@0: IC*/FCI extend IC/PC to handle latent variables: see below. wolffd@0: (IC stands for inductive causation; PC stands for Peter and Clark, wolffd@0: the first names of Spirtes and Glymour; FCI stands for fast causal wolffd@0: inference. wolffd@0: What we, following Pearl (2000), call IC* was called wolffd@0: IC in the original Pearl and Verma paper.) wolffd@0: For details, see wolffd@0:
wolffd@0: wolffd@0: The PC algorithm takes as arguments a function f, the number of nodes N, wolffd@0: the maximum fan in K, and additional arguments A which are passed to f. wolffd@0: The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 wolffd@0: otherwise. wolffd@0: For example, suppose we cheat by wolffd@0: passing in a CI "oracle" which has access to the true DAG; the oracle wolffd@0: tests for d-separation in this DAG, i.e., wolffd@0: f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows. wolffd@0:
wolffd@0: pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag); wolffd@0:wolffd@0: pdag(i,j) = -1 if there is definitely an i->j arc, wolffd@0: and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. wolffd@0:
wolffd@0: Applied to the sprinkler network, this returns wolffd@0:
wolffd@0: pdag = wolffd@0: 0 1 1 0 wolffd@0: 1 0 0 -1 wolffd@0: 1 0 0 -1 wolffd@0: 0 0 0 0 wolffd@0:wolffd@0: So as expected, we see that the V-structure at the W node is uniquely identified, wolffd@0: but the other arcs have ambiguous orientation. wolffd@0:
wolffd@0: We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS wolffd@0: book. wolffd@0: This example concerns the female orgasm. wolffd@0: We are given a correlation matrix C between 7 measured factors (such wolffd@0: as subjective experiences of coital and masturbatory experiences), wolffd@0: derived from 281 samples, and want to learn a causal model of the wolffd@0: data. We will not discuss the merits of this type of work here, but wolffd@0: merely show how to reproduce the results in the SGS book. wolffd@0: Their program, wolffd@0: Tetrad, wolffd@0: makes use of the Fisher Z-test for conditional wolffd@0: independence, so we do the same: wolffd@0:
wolffd@0: max_fan_in = 4; wolffd@0: nsamples = 281; wolffd@0: alpha = 0.05; wolffd@0: pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha); wolffd@0:wolffd@0: In this case, the CI test is wolffd@0:
wolffd@0: f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha) wolffd@0:wolffd@0: The results match those of Fig 12a of SGS apart from two edge wolffd@0: differences; presumably this is due to rounding error (although it wolffd@0: could be a bug, either in BNT or in Tetrad). wolffd@0: This example can be found in the file BNT/examples/static/pc2.m. wolffd@0: wolffd@0:
wolffd@0: wolffd@0: The IC* algorithm (Pearl and Verma, 1991), wolffd@0: and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993), wolffd@0: are like the IC/PC algorithm, except that they can detect the presence wolffd@0: of latent variables. wolffd@0: See the file learn_struct_pdag_ic_star written by Tamar wolffd@0: Kushnir. The output is a matrix P, defined as follows wolffd@0: (see Pearl (2000), p52 for details): wolffd@0:
wolffd@0: % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j. wolffd@0: % P(i,j) = -2 if there is a marked directed i-*>j edge. wolffd@0: % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j wolffd@0: % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j. wolffd@0:wolffd@0: wolffd@0: wolffd@0:
wolffd@0: An inference engine is an object that contains a bnet and supports the wolffd@0: 'enter_evidence' and 'marginal_nodes' methods. The engine constructor wolffd@0: takes the bnet as argument and may do some model-specific processing. wolffd@0: When 'enter_evidence' is called, the engine may do some wolffd@0: evidence-specific processing. Finally, when 'marginal_nodes' is wolffd@0: called, the engine may do some query-specific processing. wolffd@0: wolffd@0:
wolffd@0: The amount of work done when each stage is specified -- structure, wolffd@0: parameters, evidence, and query -- depends on the engine. The cost of wolffd@0: work done early in this sequence can be amortized. On the other hand, wolffd@0: one can make better optimizations if one waits until later in the wolffd@0: sequence. wolffd@0: For example, the parameters might imply wolffd@0: conditional indpendencies that are not evident in the graph structure, wolffd@0: but can nevertheless be exploited; the evidence indicates which nodes wolffd@0: are observed and hence can effectively be disconnected from the wolffd@0: graph; and the query might indicate that large parts of the network wolffd@0: are d-separated from the query nodes. (Since it is not the actual wolffd@0: values of the evidence that matters, just which nodes are observed, wolffd@0: many engines allow you to specify which nodes will be observed when they are constructed, wolffd@0: i.e., before calling 'enter_evidence'. Some engines can still cope if wolffd@0: the actual pattern of evidence is different, e.g., if there is missing wolffd@0: data.) wolffd@0:
wolffd@0: wolffd@0: Although being maximally lazy (i.e., only doing work when a query is wolffd@0: issued) may seem desirable, wolffd@0: this is not always the most efficient. wolffd@0: For example, wolffd@0: when learning using EM, we need to call marginal_nodes N times, where N is the wolffd@0: number of nodes. Variable elimination would end wolffd@0: up repeating a lot of work wolffd@0: each time marginal_nodes is called, making it inefficient for wolffd@0: learning. The junction tree algorithm, by contrast, uses dynamic wolffd@0: programming to avoid this redundant computation --- it calculates all wolffd@0: marginals in two passes during 'enter_evidence', so calling wolffd@0: 'marginal_nodes' takes constant time. wolffd@0:
wolffd@0: We will discuss some of the inference algorithms implemented in BNT wolffd@0: below, and finish with a summary of all wolffd@0: of them. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: The principle of distributing sums over products can be generalized wolffd@0: greatly to apply to any commutative semiring. wolffd@0: This forms the basis of many common algorithms, such as Viterbi wolffd@0: decoding and the Fast Fourier Transform. For details, see wolffd@0: wolffd@0:
wolffd@0: Choosing an order in which to sum out the variables so as to minimize wolffd@0: computational cost is known to be NP-hard. wolffd@0: The implementation of this algorithm in wolffd@0: var_elim_inf_engine makes no attempt to optimize this wolffd@0: ordering (in contrast, say, to jtree_inf_engine, which uses a wolffd@0: greedy search procedure to find a good ordering). wolffd@0:
wolffd@0: Note: unlike most algorithms, var_elim does all its computational work wolffd@0: inside of marginal_nodes, not inside of wolffd@0: enter_evidence. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: Three specialized versions of this algorithm have also been implemented, wolffd@0: corresponding to the cases where all the nodes are discrete (D), all wolffd@0: are Gaussian (G), and some are discrete and some Gaussian (CG). wolffd@0: They are called enumerative_inf_engine, wolffd@0: gaussian_inf_engine, wolffd@0: and cond_gauss_inf_engine respectively. wolffd@0:
wolffd@0: Note: unlike most algorithms, these global inference algorithms do all their computational work wolffd@0: inside of marginal_nodes, not inside of wolffd@0: enter_evidence. wolffd@0: wolffd@0: wolffd@0:
wolffd@0: wolffd@0: A much more significant speedup is obtained by exploiting special wolffd@0: properties of the noisy-or node, as done by the quickscore wolffd@0: algorithm. For details, see wolffd@0:
wolffd@0: engine = quickscore_inf_engine(inhibit, leak, prior); wolffd@0: engine = enter_evidence(engine, pos, neg); wolffd@0: m = marginal_nodes(engine, i); wolffd@0:wolffd@0: wolffd@0: wolffd@0:
wolffd@0: engine = pearl_inf_engine(bnet, 'max_iter', 30); wolffd@0: engine = enter_evidence(engine, evidence); wolffd@0: m = marginal_nodes(engine, i); wolffd@0:wolffd@0: We found that this algorithm often converges, and when it does, often wolffd@0: is very accurate, but it depends on the precise setting of the wolffd@0: parameter values of the network. wolffd@0: (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) wolffd@0: Understanding when and why belief propagation converges/ works wolffd@0: is a topic of ongoing research. wolffd@0:
wolffd@0: pearl_inf_engine can exploit special structure in noisy-or wolffd@0: and gmux nodes to compute messages efficiently. wolffd@0:
wolffd@0: belprop_inf_engine is like pearl, but uses potentials to wolffd@0: represent messages. Hence this is slower. wolffd@0:
wolffd@0: belprop_fg_inf_engine is like belprop, wolffd@0: but is designed for factor graphs. wolffd@0: wolffd@0: wolffd@0: wolffd@0:
wolffd@0: In terms of topology, most engines handle any kind of DAG. wolffd@0: belprop_fg does approximate inference on factor graphs (FG), which wolffd@0: can be used to represent directed, undirected, and mixed (chain) wolffd@0: graphs. wolffd@0: (In the future, we plan to support exact inference on chain graphs.) wolffd@0: quickscore only works on QMR-like models. wolffd@0:
wolffd@0: In terms of node types: algorithms that use potentials can handle wolffd@0: discrete (D), Gaussian (G) or conditional Gaussian (CG) models. wolffd@0: Sampling algorithms can essentially handle any kind of node (distribution). wolffd@0: Other algorithms make more restrictive assumptions in exchange for wolffd@0: speed. wolffd@0:
wolffd@0: Finally, most algorithms are designed to give the exact answer. wolffd@0: The belief propagation algorithms are exact if applied to trees, and wolffd@0: in some other cases. wolffd@0: Sampling is considered approximate, even though, in the limit of an wolffd@0: infinite number of samples, it gives the exact answer. wolffd@0: wolffd@0:
wolffd@0: wolffd@0: Here is a summary of the properties wolffd@0: of all the engines in BNT which work on static networks. wolffd@0:
wolffd@0:
Name wolffd@0: | Exact? wolffd@0: | Node type? wolffd@0: | topology wolffd@0: |
belprop wolffd@0: | approx wolffd@0: | D wolffd@0: | DAG wolffd@0: |
belprop_fg wolffd@0: | approx wolffd@0: | D wolffd@0: | factor graph wolffd@0: |
cond_gauss wolffd@0: | exact wolffd@0: | CG wolffd@0: | DAG wolffd@0: |
enumerative wolffd@0: | exact wolffd@0: | D wolffd@0: | DAG wolffd@0: |
gaussian wolffd@0: | exact wolffd@0: | G wolffd@0: | DAG wolffd@0: |
gibbs wolffd@0: | approx wolffd@0: | D wolffd@0: | DAG wolffd@0: |
global_joint wolffd@0: | exact wolffd@0: | D,G,CG wolffd@0: | DAG wolffd@0: |
jtree wolffd@0: | exact wolffd@0: | D,G,CG wolffd@0: | DAG wolffd@0: b |
likelihood_weighting wolffd@0: | approx wolffd@0: | any wolffd@0: | DAG wolffd@0: |
pearl wolffd@0: | approx wolffd@0: | D,G wolffd@0: | DAG wolffd@0: |
pearl wolffd@0: | exact wolffd@0: | D,G wolffd@0: | polytree wolffd@0: |
quickscore wolffd@0: | exact wolffd@0: | noisy-or wolffd@0: | QMR wolffd@0: |
stab_cond_gauss wolffd@0: | exact wolffd@0: | CG wolffd@0: | DAG wolffd@0: |
var_elim wolffd@0: | exact wolffd@0: | D,G,CG wolffd@0: | DAG wolffd@0: |
wolffd@0: See the examples in BNT/examples/limids for details. wolffd@0: wolffd@0: wolffd@0: wolffd@0: wolffd@0: