wolffd@0
|
1 <HEAD>
|
wolffd@0
|
2 <TITLE>How to use the Bayes Net Toolbox</TITLE>
|
wolffd@0
|
3 </HEAD>
|
wolffd@0
|
4
|
wolffd@0
|
5 <BODY BGCOLOR="#FFFFFF">
|
wolffd@0
|
6 <!-- white background is better for the pictures and equations -->
|
wolffd@0
|
7
|
wolffd@0
|
8 <h1>How to use the Bayes Net Toolbox</h1>
|
wolffd@0
|
9
|
wolffd@0
|
10 This documentation was last updated on 13 November 2002.
|
wolffd@0
|
11 <br>
|
wolffd@0
|
12 Click <a href="changelog.html">here</a> for a list of changes made to
|
wolffd@0
|
13 BNT.
|
wolffd@0
|
14 <br>
|
wolffd@0
|
15 Click
|
wolffd@0
|
16 <a href="http://bnt.insa-rouen.fr/">here</a>
|
wolffd@0
|
17 for a French version of this documentation (which might not
|
wolffd@0
|
18 be up-to-date).
|
wolffd@0
|
19
|
wolffd@0
|
20
|
wolffd@0
|
21 <p>
|
wolffd@0
|
22
|
wolffd@0
|
23 <ul>
|
wolffd@0
|
24 <li> <a href="#install">Installation</a>
|
wolffd@0
|
25 <ul>
|
wolffd@0
|
26 <li> <a href="#installM">Installing the Matlab code</a>
|
wolffd@0
|
27 <li> <a href="#installC">Installing the C code</a>
|
wolffd@0
|
28 <li> <a href="../matlab_tips.html">Useful Matlab tips</a>.
|
wolffd@0
|
29 </ul>
|
wolffd@0
|
30
|
wolffd@0
|
31 <li> <a href="#basics">Creating your first Bayes net</a>
|
wolffd@0
|
32 <ul>
|
wolffd@0
|
33 <li> <a href="#basics">Creating a model by hand</a>
|
wolffd@0
|
34 <li> <a href="#file">Loading a model from a file</a>
|
wolffd@0
|
35 <li> <a href="http://bnt.insa-rouen.fr/ajouts.html">Creating a model using a GUI</a>
|
wolffd@0
|
36 </ul>
|
wolffd@0
|
37
|
wolffd@0
|
38 <li> <a href="#inference">Inference</a>
|
wolffd@0
|
39 <ul>
|
wolffd@0
|
40 <li> <a href="#marginal">Computing marginal distributions</a>
|
wolffd@0
|
41 <li> <a href="#joint">Computing joint distributions</a>
|
wolffd@0
|
42 <li> <a href="#soft">Soft/virtual evidence</a>
|
wolffd@0
|
43 <li> <a href="#mpe">Most probable explanation</a>
|
wolffd@0
|
44 </ul>
|
wolffd@0
|
45
|
wolffd@0
|
46 <li> <a href="#cpd">Conditional Probability Distributions</a>
|
wolffd@0
|
47 <ul>
|
wolffd@0
|
48 <li> <a href="#tabular">Tabular (multinomial) nodes</a>
|
wolffd@0
|
49 <li> <a href="#noisyor">Noisy-or nodes</a>
|
wolffd@0
|
50 <li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
|
wolffd@0
|
51 <li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
|
wolffd@0
|
52 <li> <a href="#mlp">Neural network nodes</a>
|
wolffd@0
|
53 <li> <a href="#root">Root nodes</a>
|
wolffd@0
|
54 <li> <a href="#gaussian">Gaussian nodes</a>
|
wolffd@0
|
55 <li> <a href="#glm">Generalized linear model nodes</a>
|
wolffd@0
|
56 <li> <a href="#dtree">Classification/regression tree nodes</a>
|
wolffd@0
|
57 <li> <a href="#nongauss">Other continuous distributions</a>
|
wolffd@0
|
58 <li> <a href="#cpd_summary">Summary of CPD types</a>
|
wolffd@0
|
59 </ul>
|
wolffd@0
|
60
|
wolffd@0
|
61 <li> <a href="#examples">Example models</a>
|
wolffd@0
|
62 <ul>
|
wolffd@0
|
63 <li> <a
|
wolffd@0
|
64 href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
|
wolffd@0
|
65 Gaussian mixture models</a>
|
wolffd@0
|
66 <li> <a href="#pca">PCA, ICA, and all that</a>
|
wolffd@0
|
67 <li> <a href="#mixep">Mixtures of experts</a>
|
wolffd@0
|
68 <li> <a href="#hme">Hierarchical mixtures of experts</a>
|
wolffd@0
|
69 <li> <a href="#qmr">QMR</a>
|
wolffd@0
|
70 <li> <a href="#cg_model">Conditional Gaussian models</a>
|
wolffd@0
|
71 <li> <a href="#hybrid">Other hybrid models</a>
|
wolffd@0
|
72 </ul>
|
wolffd@0
|
73
|
wolffd@0
|
74 <li> <a href="#param_learning">Parameter learning</a>
|
wolffd@0
|
75 <ul>
|
wolffd@0
|
76 <li> <a href="#load_data">Loading data from a file</a>
|
wolffd@0
|
77 <li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
|
wolffd@0
|
78 <li> <a href="#prior">Parameter priors</a>
|
wolffd@0
|
79 <li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
|
wolffd@0
|
80 <li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
|
wolffd@0
|
81 <li> <a href="#tying">Parameter tying</a>
|
wolffd@0
|
82 </ul>
|
wolffd@0
|
83
|
wolffd@0
|
84 <li> <a href="#structure_learning">Structure learning</a>
|
wolffd@0
|
85 <ul>
|
wolffd@0
|
86 <li> <a href="#enumerate">Exhaustive search</a>
|
wolffd@0
|
87 <li> <a href="#K2">K2</a>
|
wolffd@0
|
88 <li> <a href="#hill_climb">Hill-climbing</a>
|
wolffd@0
|
89 <li> <a href="#mcmc">MCMC</a>
|
wolffd@0
|
90 <li> <a href="#active">Active learning</a>
|
wolffd@0
|
91 <li> <a href="#struct_em">Structural EM</a>
|
wolffd@0
|
92 <li> <a href="#graphdraw">Visualizing the learned graph structure</a>
|
wolffd@0
|
93 <li> <a href="#constraint">Constraint-based methods</a>
|
wolffd@0
|
94 </ul>
|
wolffd@0
|
95
|
wolffd@0
|
96
|
wolffd@0
|
97 <li> <a href="#engines">Inference engines</a>
|
wolffd@0
|
98 <ul>
|
wolffd@0
|
99 <li> <a href="#jtree">Junction tree</a>
|
wolffd@0
|
100 <li> <a href="#varelim">Variable elimination</a>
|
wolffd@0
|
101 <li> <a href="#global">Global inference methods</a>
|
wolffd@0
|
102 <li> <a href="#quickscore">Quickscore</a>
|
wolffd@0
|
103 <li> <a href="#belprop">Belief propagation</a>
|
wolffd@0
|
104 <li> <a href="#sampling">Sampling (Monte Carlo)</a>
|
wolffd@0
|
105 <li> <a href="#engine_summary">Summary of inference engines</a>
|
wolffd@0
|
106 </ul>
|
wolffd@0
|
107
|
wolffd@0
|
108
|
wolffd@0
|
109 <li> <a href="#influence">Influence diagrams/ decision making</a>
|
wolffd@0
|
110
|
wolffd@0
|
111
|
wolffd@0
|
112 <li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
|
wolffd@0
|
113 </ul>
|
wolffd@0
|
114
|
wolffd@0
|
115 </ul>
|
wolffd@0
|
116
|
wolffd@0
|
117
|
wolffd@0
|
118
|
wolffd@0
|
119
|
wolffd@0
|
120 <h1><a name="install">Installation</h1>
|
wolffd@0
|
121
|
wolffd@0
|
122 <h2><a name="installM">Installing the Matlab code</h2>
|
wolffd@0
|
123
|
wolffd@0
|
124 <ul>
|
wolffd@0
|
125 <li> <a href="bnt_download.html">Download</a> the BNT.zip file.
|
wolffd@0
|
126
|
wolffd@0
|
127 <p>
|
wolffd@0
|
128 <li> Unpack the file. In Unix, type
|
wolffd@0
|
129 <!--"tar xvf BNT.tar".-->
|
wolffd@0
|
130 "unzip FullBNT.zip".
|
wolffd@0
|
131 In Windows, use
|
wolffd@0
|
132 a program like <a href="http://www.winzip.com">Winzip</a>. This will
|
wolffd@0
|
133 create a directory called FullBNT, which contains BNT and other libraries.
|
wolffd@0
|
134
|
wolffd@0
|
135 <p>
|
wolffd@0
|
136 <li> Read the file <tt>BNT/README</tt> to make sure the date
|
wolffd@0
|
137 matches the one on the top of <a href=bnt.html>the BNT home page</a>.
|
wolffd@0
|
138 If not, you may need to press 'refresh' on your browser, and download
|
wolffd@0
|
139 again, to get the most recent version.
|
wolffd@0
|
140
|
wolffd@0
|
141 <p>
|
wolffd@0
|
142 <li> <b>Edit the file "BNT/add_BNT_to_path.m"</b> so it contains the correct
|
wolffd@0
|
143 pathname.
|
wolffd@0
|
144 For example, in Windows,
|
wolffd@0
|
145 I download FullBNT.zip into C:\kpmurphy\matlab, and
|
wolffd@0
|
146 then comment out the second line (with the % character), and uncomment
|
wolffd@0
|
147 the third line, which reads
|
wolffd@0
|
148 <pre>
|
wolffd@0
|
149 BNT_HOME = 'C:\kpmurphy\matlab\FullBNT';
|
wolffd@0
|
150 </pre>
|
wolffd@0
|
151
|
wolffd@0
|
152 <p>
|
wolffd@0
|
153 <li> Start up Matlab.
|
wolffd@0
|
154
|
wolffd@0
|
155 <p>
|
wolffd@0
|
156 <li> Type "ver" at the Matlab prompt (">>").
|
wolffd@0
|
157 <b>You need Matlab version 5.2 or newer to run BNT</b>.
|
wolffd@0
|
158 (Versions 5.0 and 5.1 have a memory leak which seems to sometimes
|
wolffd@0
|
159 crash BNT.)
|
wolffd@0
|
160
|
wolffd@0
|
161 <p>
|
wolffd@0
|
162 <li> Move to the BNT directory.
|
wolffd@0
|
163 For example, in Windows, I type
|
wolffd@0
|
164 <pre>
|
wolffd@0
|
165 >> cd C:\kpmurphy\matlab\FullBNT\BNT
|
wolffd@0
|
166 </pre>
|
wolffd@0
|
167
|
wolffd@0
|
168 <p>
|
wolffd@0
|
169 <li> Type "add_BNT_to_path".
|
wolffd@0
|
170 This executes the command
|
wolffd@0
|
171 <tt>addpath(genpath(BNT_HOME))</tt>,
|
wolffd@0
|
172 which adds all directories below FullBNT to the matlab path.
|
wolffd@0
|
173
|
wolffd@0
|
174 <p>
|
wolffd@0
|
175 <li> Type "test_BNT".
|
wolffd@0
|
176 <b>If all goes well, this will just produce a bunch of numbers and pictures.</b>
|
wolffd@0
|
177 It may produce some
|
wolffd@0
|
178 warning messages (which you can ignore), but should not produce any error messages.
|
wolffd@0
|
179 (The warnings should only be of the form
|
wolffd@0
|
180 "Warning: Maximum number of iterations has been exceeded", and are
|
wolffd@0
|
181 produced by Netlab.)
|
wolffd@0
|
182
|
wolffd@0
|
183 <p>
|
wolffd@0
|
184 <li> <a href="http://groups.yahoo.com/group/BayesNetToolbox/join">
|
wolffd@0
|
185 Join the BNT email list</a>
|
wolffd@0
|
186
|
wolffd@0
|
187 </ul>
|
wolffd@0
|
188
|
wolffd@0
|
189
|
wolffd@0
|
190 If you are new to Matlab, you might like to check out
|
wolffd@0
|
191 <a href="matlab_tips.html">some useful Matlab tips</a>.
|
wolffd@0
|
192 For instance, this explains how to create a startup file, which can be
|
wolffd@0
|
193 used to set your path variable automatically, so you can avoid having
|
wolffd@0
|
194 to type the above commands every time.
|
wolffd@0
|
195
|
wolffd@0
|
196
|
wolffd@0
|
197
|
wolffd@0
|
198
|
wolffd@0
|
199 <h2><a name="installC">Installing the C code</h2>
|
wolffd@0
|
200
|
wolffd@0
|
201 Some BNT functions also have C implementations.
|
wolffd@0
|
202 <b>It is not necessary to install the C code</b>, but it can result in a speedup
|
wolffd@0
|
203 of a factor of 2-5 of certain simple functions.
|
wolffd@0
|
204 To install all the C code,
|
wolffd@0
|
205 edit installC_BNT.m so it contains the right path,
|
wolffd@0
|
206 then type <tt>installC_BNT</tt>.
|
wolffd@0
|
207 To uninstall all the C code,
|
wolffd@0
|
208 edit uninstallC_BNT.m so it contains the right path,
|
wolffd@0
|
209 then type <tt>uninstallC_BNT</tt>.
|
wolffd@0
|
210 For an up-to-date list of the files which have C implementations, see
|
wolffd@0
|
211 BNT/installC_BNT.m.
|
wolffd@0
|
212
|
wolffd@0
|
213 <p>
|
wolffd@0
|
214 mex is a script that lets you call C code from Matlab - it does not compile matlab to
|
wolffd@0
|
215 C (see mcc below).
|
wolffd@0
|
216 If your C/C++ compiler is set up correctly, mex should work out of
|
wolffd@0
|
217 the box. (Matlab 6 now ships with its own minimal C compiler.)
|
wolffd@0
|
218 If not, you might need to type
|
wolffd@0
|
219 <p>
|
wolffd@0
|
220 <tt> mex -setup</tt>
|
wolffd@0
|
221 <p>
|
wolffd@0
|
222 before calling installC.
|
wolffd@0
|
223 <p>
|
wolffd@0
|
224 To make mex call gcc on Windows,
|
wolffd@0
|
225 you must install <a
|
wolffd@0
|
226 href="http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html">gnumex</a>.
|
wolffd@0
|
227 You can use the <a href="http://www.mingw.org/">minimalist GNU for
|
wolffd@0
|
228 Windows</a> version of gcc, or
|
wolffd@0
|
229 the <a href="http://sources.redhat.com/cygwin/">cygwin</a> version.
|
wolffd@0
|
230 <p>
|
wolffd@0
|
231 In general, typing
|
wolffd@0
|
232 'mex foo.c' from inside Matlab creates a file called
|
wolffd@0
|
233 'foo.mexglx' or 'foo.dll' (the exact file
|
wolffd@0
|
234 extension is system dependent - on Linux it is 'mexglx', on Windows it is '.dll').
|
wolffd@0
|
235 The resulting file will hide the original 'foo.m' (if it existed), i.e.,
|
wolffd@0
|
236 typing 'foo' at the prompt will call the compiled C version.
|
wolffd@0
|
237 To reveal the original matlab version, just delete foo.mexglx (this is
|
wolffd@0
|
238 what uninstallC does).
|
wolffd@0
|
239 <p>
|
wolffd@0
|
240 Sometimes it takes time for Matlab to realize that the file has
|
wolffd@0
|
241 changed from matlab to C or vice versa; try typing 'clear all' or
|
wolffd@0
|
242 restarting Matlab to refresh it.
|
wolffd@0
|
243 To find out which version of a file you are running, type
|
wolffd@0
|
244 'which foo'.
|
wolffd@0
|
245 <p>
|
wolffd@0
|
246 <a href="http://www.mathworks.com/products/compiler">mcc</a>, the
|
wolffd@0
|
247 Matlab to C compiler, is a separate product,
|
wolffd@0
|
248 and is quite different from mex. It does not yet support
|
wolffd@0
|
249 objects/classes, which is why we can't compile all of BNT to C automatically.
|
wolffd@0
|
250 Also, hand-written C code is usually much
|
wolffd@0
|
251 better than the C code generated by mcc.
|
wolffd@0
|
252
|
wolffd@0
|
253
|
wolffd@0
|
254 <p>
|
wolffd@0
|
255 Acknowledgements:
|
wolffd@0
|
256 Although I wrote some of the C code, most of
|
wolffd@0
|
257 the C code (e.g., for jtree and dpot) was written by Wei Hu;
|
wolffd@0
|
258 the triangulation C code was written by Ilya Shpitser.
|
wolffd@0
|
259
|
wolffd@0
|
260
|
wolffd@0
|
261 <h1><a name="basics">Creating your first Bayes net</h1>
|
wolffd@0
|
262
|
wolffd@0
|
263 To define a Bayes net, you must specify the graph structure and then
|
wolffd@0
|
264 the parameters. We look at each in turn, using a simple example
|
wolffd@0
|
265 (adapted from Russell and
|
wolffd@0
|
266 Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
|
wolffd@0
|
267 1995, p454).
|
wolffd@0
|
268
|
wolffd@0
|
269
|
wolffd@0
|
270 <h2>Graph structure</h2>
|
wolffd@0
|
271
|
wolffd@0
|
272
|
wolffd@0
|
273 Consider the following network.
|
wolffd@0
|
274
|
wolffd@0
|
275 <p>
|
wolffd@0
|
276 <center>
|
wolffd@0
|
277 <IMG SRC="Figures/sprinkler.gif">
|
wolffd@0
|
278 </center>
|
wolffd@0
|
279 <p>
|
wolffd@0
|
280
|
wolffd@0
|
281 <P>
|
wolffd@0
|
282 To specify this directed acyclic graph (dag), we create an adjacency matrix:
|
wolffd@0
|
283 <PRE>
|
wolffd@0
|
284 N = 4;
|
wolffd@0
|
285 dag = zeros(N,N);
|
wolffd@0
|
286 C = 1; S = 2; R = 3; W = 4;
|
wolffd@0
|
287 dag(C,[R S]) = 1;
|
wolffd@0
|
288 dag(R,W) = 1;
|
wolffd@0
|
289 dag(S,W)=1;
|
wolffd@0
|
290 </PRE>
|
wolffd@0
|
291 <P>
|
wolffd@0
|
292 We have numbered the nodes as follows:
|
wolffd@0
|
293 Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
|
wolffd@0
|
294 <b>The nodes must always be numbered in topological order, i.e.,
|
wolffd@0
|
295 ancestors before descendants.</b>
|
wolffd@0
|
296 For a more complicated graph, this is a little inconvenient: we will
|
wolffd@0
|
297 see how to get around this <a href="usage_dbn.html#bat">below</a>.
|
wolffd@0
|
298 <p>
|
wolffd@0
|
299 <!--
|
wolffd@0
|
300 In Matlab 6, you can use logical arrays instead of double arrays,
|
wolffd@0
|
301 which are 4 times smaller:
|
wolffd@0
|
302 <pre>
|
wolffd@0
|
303 dag = false(N,N);
|
wolffd@0
|
304 dag(C,[R S]) = true;
|
wolffd@0
|
305 ...
|
wolffd@0
|
306 </pre>
|
wolffd@0
|
307 --
|
wolffd@0
|
308 <p>
|
wolffd@0
|
309 A preliminary attempt to make a <b>GUI</b>
|
wolffd@0
|
310 has been writte by Philippe LeRay and can be downloaded
|
wolffd@0
|
311 from <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
|
wolffd@0
|
312 <p>
|
wolffd@0
|
313 You can visualize the resulting graph structure using
|
wolffd@0
|
314 the methods discussed <a href="#graphdraw">below</a>.
|
wolffd@0
|
315
|
wolffd@0
|
316 <h2>Creating the Bayes net shell</h2>
|
wolffd@0
|
317
|
wolffd@0
|
318 In addition to specifying the graph structure,
|
wolffd@0
|
319 we must specify the size and type of each node.
|
wolffd@0
|
320 If a node is discrete, its size is the
|
wolffd@0
|
321 number of possible values
|
wolffd@0
|
322 each node can take on; if a node is continuous,
|
wolffd@0
|
323 it can be a vector, and its size is the length of this vector.
|
wolffd@0
|
324 In this case, we will assume all nodes are discrete and binary.
|
wolffd@0
|
325 <PRE>
|
wolffd@0
|
326 discrete_nodes = 1:N;
|
wolffd@0
|
327 node_sizes = 2*ones(1,N);
|
wolffd@0
|
328 </pre>
|
wolffd@0
|
329 If the nodes were not binary, you could type e.g.,
|
wolffd@0
|
330 <pre>
|
wolffd@0
|
331 node_sizes = [4 2 3 5];
|
wolffd@0
|
332 </pre>
|
wolffd@0
|
333 meaning that Cloudy has 4 possible values,
|
wolffd@0
|
334 Sprinkler has 2 possible values, etc.
|
wolffd@0
|
335 Note that these are cardinal values, not ordinal, i.e.,
|
wolffd@0
|
336 they are not ordered in any way, like 'low', 'medium', 'high'.
|
wolffd@0
|
337 <p>
|
wolffd@0
|
338 We are now ready to make the Bayes net:
|
wolffd@0
|
339 <pre>
|
wolffd@0
|
340 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
|
wolffd@0
|
341 </PRE>
|
wolffd@0
|
342 By default, all nodes are assumed to be discrete, so we can also just
|
wolffd@0
|
343 write
|
wolffd@0
|
344 <pre>
|
wolffd@0
|
345 bnet = mk_bnet(dag, node_sizes);
|
wolffd@0
|
346 </PRE>
|
wolffd@0
|
347 You may also specify which nodes will be observed.
|
wolffd@0
|
348 If you don't know, or if this not fixed in advance,
|
wolffd@0
|
349 just use the empty list (the default).
|
wolffd@0
|
350 <pre>
|
wolffd@0
|
351 onodes = [];
|
wolffd@0
|
352 bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
|
wolffd@0
|
353 </PRE>
|
wolffd@0
|
354 Note that optional arguments are specified using a name/value syntax.
|
wolffd@0
|
355 This is common for many BNT functions.
|
wolffd@0
|
356 In general, to find out more about a function (e.g., which optional
|
wolffd@0
|
357 arguments it takes), please see its
|
wolffd@0
|
358 documentation string by typing
|
wolffd@0
|
359 <pre>
|
wolffd@0
|
360 help mk_bnet
|
wolffd@0
|
361 </pre>
|
wolffd@0
|
362 See also other <a href="matlab_tips.html">useful Matlab tips</a>.
|
wolffd@0
|
363 <p>
|
wolffd@0
|
364 It is possible to associate names with nodes, as follows:
|
wolffd@0
|
365 <pre>
|
wolffd@0
|
366 bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
|
wolffd@0
|
367 </pre>
|
wolffd@0
|
368 You can then refer to a node by its name:
|
wolffd@0
|
369 <pre>
|
wolffd@0
|
370 C = bnet.names{'cloudy'}; % bnet.names is an associative array
|
wolffd@0
|
371 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
|
wolffd@0
|
372 </pre>
|
wolffd@0
|
373
|
wolffd@0
|
374
|
wolffd@0
|
375 <h2><a name="cpt">Parameters</h2>
|
wolffd@0
|
376
|
wolffd@0
|
377 A model consists of the graph structure and the parameters.
|
wolffd@0
|
378 The parameters are represented by CPD objects (CPD = Conditional
|
wolffd@0
|
379 Probability Distribution), which define the probability distribution
|
wolffd@0
|
380 of a node given its parents.
|
wolffd@0
|
381 (We will use the terms "node" and "random variable" interchangeably.)
|
wolffd@0
|
382 The simplest kind of CPD is a table (multi-dimensional array), which
|
wolffd@0
|
383 is suitable when all the nodes are discrete-valued. Note that the discrete
|
wolffd@0
|
384 values are not assumed to be ordered in any way; that is, they
|
wolffd@0
|
385 represent categorical quantities, like male and female, rather than
|
wolffd@0
|
386 ordinal quantities, like low, medium and high.
|
wolffd@0
|
387 (We will discuss CPDs in more detail <a href="#cpd">below</a>.)
|
wolffd@0
|
388 <p>
|
wolffd@0
|
389 Tabular CPDs, also called CPTs (conditional probability tables),
|
wolffd@0
|
390 are stored as multidimensional arrays, where the dimensions
|
wolffd@0
|
391 are arranged in the same order as the nodes, e.g., the CPT for node 4
|
wolffd@0
|
392 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
|
wolffd@0
|
393 Hence the child is always the last dimension.
|
wolffd@0
|
394 If a node has no parents, its CPT is a column vector representing its
|
wolffd@0
|
395 prior.
|
wolffd@0
|
396 Note that in Matlab (unlike C), arrays are indexed
|
wolffd@0
|
397 from 1, and are layed out in memory such that the first index toggles
|
wolffd@0
|
398 fastest, e.g., the CPT for node 4 (WetGrass) is as follows
|
wolffd@0
|
399 <P>
|
wolffd@0
|
400 <P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
|
wolffd@0
|
401 <P>
|
wolffd@0
|
402 where we have used the convention that false==1, true==2.
|
wolffd@0
|
403 We can create this CPT in Matlab as follows
|
wolffd@0
|
404 <PRE>
|
wolffd@0
|
405 CPT = zeros(2,2,2);
|
wolffd@0
|
406 CPT(1,1,1) = 1.0;
|
wolffd@0
|
407 CPT(2,1,1) = 0.1;
|
wolffd@0
|
408 ...
|
wolffd@0
|
409 </PRE>
|
wolffd@0
|
410 Here is an easier way:
|
wolffd@0
|
411 <PRE>
|
wolffd@0
|
412 CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
|
wolffd@0
|
413 </PRE>
|
wolffd@0
|
414 In fact, we don't need to reshape the array, since the CPD constructor
|
wolffd@0
|
415 will do that for us. So we can just write
|
wolffd@0
|
416 <pre>
|
wolffd@0
|
417 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
|
wolffd@0
|
418 </pre>
|
wolffd@0
|
419 The other nodes are created similarly (using the old syntax for
|
wolffd@0
|
420 optional parameters)
|
wolffd@0
|
421 <PRE>
|
wolffd@0
|
422 bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
|
wolffd@0
|
423 bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
|
wolffd@0
|
424 bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
|
wolffd@0
|
425 bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
|
wolffd@0
|
426 </PRE>
|
wolffd@0
|
427
|
wolffd@0
|
428
|
wolffd@0
|
429 <h2><a name="rnd_cpt">Random Parameters</h2>
|
wolffd@0
|
430
|
wolffd@0
|
431 If we do not specify the CPT, random parameters will be
|
wolffd@0
|
432 created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
|
wolffd@0
|
433 To ensure repeatable results, use
|
wolffd@0
|
434 <pre>
|
wolffd@0
|
435 rand('state', seed);
|
wolffd@0
|
436 randn('state', seed);
|
wolffd@0
|
437 </pre>
|
wolffd@0
|
438 To control the degree of randomness (entropy),
|
wolffd@0
|
439 you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
|
wolffd@0
|
440 If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
|
wolffd@0
|
441 If p = 1, each entry is drawn from U[0,1].
|
wolffd@0
|
442 If p >> 1, the entries will all be near 1/k, where k is the arity of
|
wolffd@0
|
443 this node, i.e., each row will be nearly uniform.
|
wolffd@0
|
444 You can do this as follows, assuming this node
|
wolffd@0
|
445 is number i, and ns is the node_sizes.
|
wolffd@0
|
446 <pre>
|
wolffd@0
|
447 k = ns(i);
|
wolffd@0
|
448 ps = parents(dag, i);
|
wolffd@0
|
449 psz = prod(ns(ps));
|
wolffd@0
|
450 CPT = sample_dirichlet(p*ones(1,k), psz);
|
wolffd@0
|
451 bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
|
wolffd@0
|
452 </pre>
|
wolffd@0
|
453
|
wolffd@0
|
454
|
wolffd@0
|
455 <h2><a name="file">Loading a network from a file</h2>
|
wolffd@0
|
456
|
wolffd@0
|
457 If you already have a Bayes net represented in the XML-based
|
wolffd@0
|
458 <a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
|
wolffd@0
|
459 Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
|
wolffd@0
|
460 <a
|
wolffd@0
|
461 href="http://www.cs.huji.ac.il/labs/compbio/Repository">
|
wolffd@0
|
462 Bayes Net repository</a>),
|
wolffd@0
|
463 you can convert it to BNT format using
|
wolffd@0
|
464 the
|
wolffd@0
|
465 <a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
|
wolffd@0
|
466 program</a> written by Ken Shan.
|
wolffd@0
|
467 (This is not necessarily up-to-date.)
|
wolffd@0
|
468
|
wolffd@0
|
469
|
wolffd@0
|
470 <h2>Creating a model using a GUI</h2>
|
wolffd@0
|
471
|
wolffd@0
|
472 Click <a href="http://bnt.insa-rouen.fr/ajouts.html">here</a>.
|
wolffd@0
|
473
|
wolffd@0
|
474
|
wolffd@0
|
475
|
wolffd@0
|
476 <h1><a name="inference">Inference</h1>
|
wolffd@0
|
477
|
wolffd@0
|
478 Having created the BN, we can now use it for inference.
|
wolffd@0
|
479 There are many different algorithms for doing inference in Bayes nets,
|
wolffd@0
|
480 that make different tradeoffs between speed,
|
wolffd@0
|
481 complexity, generality, and accuracy.
|
wolffd@0
|
482 BNT therefore offers a variety of different inference
|
wolffd@0
|
483 "engines". We will discuss these
|
wolffd@0
|
484 in more detail <a href="#engines">below</a>.
|
wolffd@0
|
485 For now, we will use the junction tree
|
wolffd@0
|
486 engine, which is the mother of all exact inference algorithms.
|
wolffd@0
|
487 This can be created as follows.
|
wolffd@0
|
488 <pre>
|
wolffd@0
|
489 engine = jtree_inf_engine(bnet);
|
wolffd@0
|
490 </pre>
|
wolffd@0
|
491 The other engines have similar constructors, but might take
|
wolffd@0
|
492 additional, algorithm-specific parameters.
|
wolffd@0
|
493 All engines are used in the same way, once they have been created.
|
wolffd@0
|
494 We illustrate this in the following sections.
|
wolffd@0
|
495
|
wolffd@0
|
496
|
wolffd@0
|
497 <h2><a name="marginal">Computing marginal distributions</h2>
|
wolffd@0
|
498
|
wolffd@0
|
499 Suppose we want to compute the probability that the sprinker was on
|
wolffd@0
|
500 given that the grass is wet.
|
wolffd@0
|
501 The evidence consists of the fact that W=2. All the other nodes
|
wolffd@0
|
502 are hidden (unobserved). We can specify this as follows.
|
wolffd@0
|
503 <pre>
|
wolffd@0
|
504 evidence = cell(1,N);
|
wolffd@0
|
505 evidence{W} = 2;
|
wolffd@0
|
506 </pre>
|
wolffd@0
|
507 We use a 1D cell array instead of a vector to
|
wolffd@0
|
508 cope with the fact that nodes can be vectors of different lengths.
|
wolffd@0
|
509 In addition, the value [] can be used
|
wolffd@0
|
510 to denote 'no evidence', instead of having to specify the observation
|
wolffd@0
|
511 pattern as a separate argument.
|
wolffd@0
|
512 (Click <a href="cellarray.html">here</a> for a quick tutorial on cell
|
wolffd@0
|
513 arrays in matlab.)
|
wolffd@0
|
514 <p>
|
wolffd@0
|
515 We are now ready to add the evidence to the engine.
|
wolffd@0
|
516 <pre>
|
wolffd@0
|
517 [engine, loglik] = enter_evidence(engine, evidence);
|
wolffd@0
|
518 </pre>
|
wolffd@0
|
519 The behavior of this function is algorithm-specific, and is discussed
|
wolffd@0
|
520 in more detail <a href="#engines">below</a>.
|
wolffd@0
|
521 In the case of the jtree engine,
|
wolffd@0
|
522 enter_evidence implements a two-pass message-passing scheme.
|
wolffd@0
|
523 The first return argument contains the modified engine, which
|
wolffd@0
|
524 incorporates the evidence. The second return argument contains the
|
wolffd@0
|
525 log-likelihood of the evidence. (Not all engines are capable of
|
wolffd@0
|
526 computing the log-likelihood.)
|
wolffd@0
|
527 <p>
|
wolffd@0
|
528 Finally, we can compute p=P(S=2|W=2) as follows.
|
wolffd@0
|
529 <PRE>
|
wolffd@0
|
530 marg = marginal_nodes(engine, S);
|
wolffd@0
|
531 marg.T
|
wolffd@0
|
532 ans =
|
wolffd@0
|
533 0.57024
|
wolffd@0
|
534 0.42976
|
wolffd@0
|
535 p = marg.T(2);
|
wolffd@0
|
536 </PRE>
|
wolffd@0
|
537 We see that p = 0.4298.
|
wolffd@0
|
538 <p>
|
wolffd@0
|
539 Now let us add the evidence that it was raining, and see what
|
wolffd@0
|
540 difference it makes.
|
wolffd@0
|
541 <PRE>
|
wolffd@0
|
542 evidence{R} = 2;
|
wolffd@0
|
543 [engine, loglik] = enter_evidence(engine, evidence);
|
wolffd@0
|
544 marg = marginal_nodes(engine, S);
|
wolffd@0
|
545 p = marg.T(2);
|
wolffd@0
|
546 </PRE>
|
wolffd@0
|
547 We find that p = P(S=2|W=2,R=2) = 0.1945,
|
wolffd@0
|
548 which is lower than
|
wolffd@0
|
549 before, because the rain can ``explain away'' the
|
wolffd@0
|
550 fact that the grass is wet.
|
wolffd@0
|
551 <p>
|
wolffd@0
|
552 You can plot a marginal distribution over a discrete variable
|
wolffd@0
|
553 as a barchart using the built 'bar' function:
|
wolffd@0
|
554 <pre>
|
wolffd@0
|
555 bar(marg.T)
|
wolffd@0
|
556 </pre>
|
wolffd@0
|
557 This is what it looks like
|
wolffd@0
|
558
|
wolffd@0
|
559 <p>
|
wolffd@0
|
560 <center>
|
wolffd@0
|
561 <IMG SRC="Figures/sprinkler_bar.gif">
|
wolffd@0
|
562 </center>
|
wolffd@0
|
563 <p>
|
wolffd@0
|
564
|
wolffd@0
|
565 <h2><a name="observed">Observed nodes</h2>
|
wolffd@0
|
566
|
wolffd@0
|
567 What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
|
wolffd@0
|
568 An observed discrete node effectively only has 1 value (the observed
|
wolffd@0
|
569 one) --- all other values would result in 0 probability.
|
wolffd@0
|
570 For efficiency, BNT treats observed (discrete) nodes as if they were
|
wolffd@0
|
571 set to 1, as we see below:
|
wolffd@0
|
572 <pre>
|
wolffd@0
|
573 evidence = cell(1,N);
|
wolffd@0
|
574 evidence{W} = 2;
|
wolffd@0
|
575 engine = enter_evidence(engine, evidence);
|
wolffd@0
|
576 m = marginal_nodes(engine, W);
|
wolffd@0
|
577 m.T
|
wolffd@0
|
578 ans =
|
wolffd@0
|
579 1
|
wolffd@0
|
580 </pre>
|
wolffd@0
|
581 This can get a little confusing, since we assigned W=2.
|
wolffd@0
|
582 So we can ask BNT to add the evidence back in by passing in an optional argument:
|
wolffd@0
|
583 <pre>
|
wolffd@0
|
584 m = marginal_nodes(engine, W, 1);
|
wolffd@0
|
585 m.T
|
wolffd@0
|
586 ans =
|
wolffd@0
|
587 0
|
wolffd@0
|
588 1
|
wolffd@0
|
589 </pre>
|
wolffd@0
|
590 This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
|
wolffd@0
|
591
|
wolffd@0
|
592
|
wolffd@0
|
593
|
wolffd@0
|
594 <h2><a name="joint">Computing joint distributions</h2>
|
wolffd@0
|
595
|
wolffd@0
|
596 We can compute the joint probability on a set of nodes as in the
|
wolffd@0
|
597 following example.
|
wolffd@0
|
598 <pre>
|
wolffd@0
|
599 evidence = cell(1,N);
|
wolffd@0
|
600 [engine, ll] = enter_evidence(engine, evidence);
|
wolffd@0
|
601 m = marginal_nodes(engine, [S R W]);
|
wolffd@0
|
602 </pre>
|
wolffd@0
|
603 m is a structure. The 'T' field is a multi-dimensional array (in
|
wolffd@0
|
604 this case, 3-dimensional) that contains the joint probability
|
wolffd@0
|
605 distribution on the specified nodes.
|
wolffd@0
|
606 <pre>
|
wolffd@0
|
607 >> m.T
|
wolffd@0
|
608 ans(:,:,1) =
|
wolffd@0
|
609 0.2900 0.0410
|
wolffd@0
|
610 0.0210 0.0009
|
wolffd@0
|
611 ans(:,:,2) =
|
wolffd@0
|
612 0 0.3690
|
wolffd@0
|
613 0.1890 0.0891
|
wolffd@0
|
614 </pre>
|
wolffd@0
|
615 We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
|
wolffd@0
|
616 to be wet if both the rain and sprinkler are off.
|
wolffd@0
|
617 <p>
|
wolffd@0
|
618 Let us now add some evidence to R.
|
wolffd@0
|
619 <pre>
|
wolffd@0
|
620 evidence{R} = 2;
|
wolffd@0
|
621 [engine, ll] = enter_evidence(engine, evidence);
|
wolffd@0
|
622 m = marginal_nodes(engine, [S R W])
|
wolffd@0
|
623 m =
|
wolffd@0
|
624 domain: [2 3 4]
|
wolffd@0
|
625 T: [2x1x2 double]
|
wolffd@0
|
626 >> m.T
|
wolffd@0
|
627 m.T
|
wolffd@0
|
628 ans(:,:,1) =
|
wolffd@0
|
629 0.0820
|
wolffd@0
|
630 0.0018
|
wolffd@0
|
631 ans(:,:,2) =
|
wolffd@0
|
632 0.7380
|
wolffd@0
|
633 0.1782
|
wolffd@0
|
634 </pre>
|
wolffd@0
|
635 The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
|
wolffd@0
|
636 should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
|
wolffd@0
|
637 with the evidence that R=2.
|
wolffd@0
|
638 Instead of creating large tables with many 0s, BNT sets the effective
|
wolffd@0
|
639 size of observed (discrete) nodes to 1, as explained above.
|
wolffd@0
|
640 This is why m.T has size 2x1x2.
|
wolffd@0
|
641 To get a 2x2x2 table, type
|
wolffd@0
|
642 <pre>
|
wolffd@0
|
643 m = marginal_nodes(engine, [S R W], 1)
|
wolffd@0
|
644 m =
|
wolffd@0
|
645 domain: [2 3 4]
|
wolffd@0
|
646 T: [2x2x2 double]
|
wolffd@0
|
647 >> m.T
|
wolffd@0
|
648 m.T
|
wolffd@0
|
649 ans(:,:,1) =
|
wolffd@0
|
650 0 0.082
|
wolffd@0
|
651 0 0.0018
|
wolffd@0
|
652 ans(:,:,2) =
|
wolffd@0
|
653 0 0.738
|
wolffd@0
|
654 0 0.1782
|
wolffd@0
|
655 </pre>
|
wolffd@0
|
656
|
wolffd@0
|
657 <p>
|
wolffd@0
|
658 Note: It is not always possible to compute the joint on arbitrary
|
wolffd@0
|
659 sets of nodes: it depends on which inference engine you use, as discussed
|
wolffd@0
|
660 in more detail <a href="#engines">below</a>.
|
wolffd@0
|
661
|
wolffd@0
|
662
|
wolffd@0
|
663 <h2><a name="soft">Soft/virtual evidence</h2>
|
wolffd@0
|
664
|
wolffd@0
|
665 Sometimes a node is not observed, but we have some distribution over
|
wolffd@0
|
666 its possible values; this is often called "soft" or "virtual"
|
wolffd@0
|
667 evidence.
|
wolffd@0
|
668 One can use this as follows
|
wolffd@0
|
669 <pre>
|
wolffd@0
|
670 [engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
|
wolffd@0
|
671 </pre>
|
wolffd@0
|
672 where soft_evidence{i} is either [] (if node i has no soft evidence)
|
wolffd@0
|
673 or is a vector representing the probability distribution over i's
|
wolffd@0
|
674 possible values.
|
wolffd@0
|
675 For example, if we don't know i's exact value, but we know its
|
wolffd@0
|
676 likelihood ratio is 60/40, we can write evidence{i} = [] and
|
wolffd@0
|
677 soft_evidence{i} = [0.6 0.4].
|
wolffd@0
|
678 <p>
|
wolffd@0
|
679 Currently only jtree_inf_engine supports this option.
|
wolffd@0
|
680 It assumes that all hidden nodes, and all nodes for
|
wolffd@0
|
681 which we have soft evidence, are discrete.
|
wolffd@0
|
682 For a longer example, see BNT/examples/static/softev1.m.
|
wolffd@0
|
683
|
wolffd@0
|
684
|
wolffd@0
|
685 <h2><a name="mpe">Most probable explanation</h2>
|
wolffd@0
|
686
|
wolffd@0
|
687 To compute the most probable explanation (MPE) of the evidence (i.e.,
|
wolffd@0
|
688 the most probable assignment, or a mode of the joint), use
|
wolffd@0
|
689 <pre>
|
wolffd@0
|
690 [mpe, ll] = calc_mpe(engine, evidence);
|
wolffd@0
|
691 </pre>
|
wolffd@0
|
692 mpe{i} is the most likely value of node i.
|
wolffd@0
|
693 This calls enter_evidence with the 'maximize' flag set to 1, which
|
wolffd@0
|
694 causes the engine to do max-product instead of sum-product.
|
wolffd@0
|
695 The resulting max-marginals are then thresholded.
|
wolffd@0
|
696 If there is more than one maximum probability assignment, we must take
|
wolffd@0
|
697 care to break ties in a consistent manner (thresholding the
|
wolffd@0
|
698 max-marginals may give the wrong result). To force this behavior,
|
wolffd@0
|
699 type
|
wolffd@0
|
700 <pre>
|
wolffd@0
|
701 [mpe, ll] = calc_mpe(engine, evidence, 1);
|
wolffd@0
|
702 </pre>
|
wolffd@0
|
703 Note that computing the MPE is someties called abductive reasoning.
|
wolffd@0
|
704
|
wolffd@0
|
705 <p>
|
wolffd@0
|
706 You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
|
wolffd@0
|
707 that does a forwards max-product pass, and then a backwards traceback
|
wolffd@0
|
708 pass, which is how Viterbi is traditionally implemented.
|
wolffd@0
|
709
|
wolffd@0
|
710
|
wolffd@0
|
711
|
wolffd@0
|
712 <h1><a name="cpd">Conditional Probability Distributions</h1>
|
wolffd@0
|
713
|
wolffd@0
|
714 A Conditional Probability Distributions (CPD)
|
wolffd@0
|
715 defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
|
wolffd@0
|
716 are the parents of node i. There are many ways to represent this
|
wolffd@0
|
717 distribution, which depend in part on whether X(i) and X(Pa(i)) are
|
wolffd@0
|
718 discrete, continuous, or a combination.
|
wolffd@0
|
719 We will discuss various representations below.
|
wolffd@0
|
720
|
wolffd@0
|
721
|
wolffd@0
|
722 <h2><a name="tabular">Tabular nodes</h2>
|
wolffd@0
|
723
|
wolffd@0
|
724 If the CPD is represented as a table (i.e., if it is a multinomial
|
wolffd@0
|
725 distribution), it has a number of parameters that is exponential in
|
wolffd@0
|
726 the number of parents. See the example <a href="#cpt">above</a>.
|
wolffd@0
|
727
|
wolffd@0
|
728
|
wolffd@0
|
729 <h2><a name="noisyor">Noisy-or nodes</h2>
|
wolffd@0
|
730
|
wolffd@0
|
731 A noisy-OR node is like a regular logical OR gate except that
|
wolffd@0
|
732 sometimes the effects of parents that are on get inhibited.
|
wolffd@0
|
733 Let the prob. that parent i gets inhibited be q(i).
|
wolffd@0
|
734 Then a node, C, with 2 parents, A and B, has the following CPD, where
|
wolffd@0
|
735 we use F and T to represent off and on (1 and 2 in BNT).
|
wolffd@0
|
736 <pre>
|
wolffd@0
|
737 A B P(C=off) P(C=on)
|
wolffd@0
|
738 ---------------------------
|
wolffd@0
|
739 F F 1.0 0.0
|
wolffd@0
|
740 T F q(A) 1-q(A)
|
wolffd@0
|
741 F T q(B) 1-q(B)
|
wolffd@0
|
742 T T q(A)q(B) q-q(A)q(B)
|
wolffd@0
|
743 </pre>
|
wolffd@0
|
744 Thus we see that the causes get inhibited independently.
|
wolffd@0
|
745 It is common to associate a "leak" node with a noisy-or CPD, which is
|
wolffd@0
|
746 like a parent that is always on. This can account for all other unmodelled
|
wolffd@0
|
747 causes which might turn the node on.
|
wolffd@0
|
748 <p>
|
wolffd@0
|
749 The noisy-or distribution is similar to the logistic distribution.
|
wolffd@0
|
750 To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
|
wolffd@0
|
751 be the prob. that j inhibits i. Then
|
wolffd@0
|
752 <pre>
|
wolffd@0
|
753 Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
|
wolffd@0
|
754 </pre>
|
wolffd@0
|
755 Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
|
wolffd@0
|
756 <pre>
|
wolffd@0
|
757 Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
|
wolffd@0
|
758 </pre>
|
wolffd@0
|
759 For a sigmoid node, we have
|
wolffd@0
|
760 <pre>
|
wolffd@0
|
761 Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
|
wolffd@0
|
762 </pre>
|
wolffd@0
|
763 where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
|
wolffd@0
|
764 the activation function (although both are monotonically increasing).
|
wolffd@0
|
765 In addition, in the case of a noisy-or, the weights are constrained to be
|
wolffd@0
|
766 positive, since they derive from probabilities q(i,j).
|
wolffd@0
|
767 In both cases, the number of parameters is <em>linear</em> in the
|
wolffd@0
|
768 number of parents, unlike the case of a multinomial distribution,
|
wolffd@0
|
769 where the number of parameters is exponential in the number of parents.
|
wolffd@0
|
770 We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
|
wolffd@0
|
771
|
wolffd@0
|
772
|
wolffd@0
|
773 <h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
|
wolffd@0
|
774
|
wolffd@0
|
775 Deterministic CPDs for discrete random variables can be created using
|
wolffd@0
|
776 the deterministic_CPD class. It is also possible to 'flip' the output
|
wolffd@0
|
777 of the function with some probability, to simulate noise.
|
wolffd@0
|
778 The boolean_CPD class is just a special case of a
|
wolffd@0
|
779 deterministic CPD, where the parents and child are all binary.
|
wolffd@0
|
780 <p>
|
wolffd@0
|
781 Both of these classes are just "syntactic sugar" for the tabular_CPD
|
wolffd@0
|
782 class.
|
wolffd@0
|
783
|
wolffd@0
|
784
|
wolffd@0
|
785
|
wolffd@0
|
786 <h2><a name="softmax">Softmax nodes</h2>
|
wolffd@0
|
787
|
wolffd@0
|
788 If we have a discrete node with a continuous parent,
|
wolffd@0
|
789 we can define its CPD using a softmax function
|
wolffd@0
|
790 (also known as the multinomial logit function).
|
wolffd@0
|
791 This acts like a soft thresholding operator, and is defined as follows:
|
wolffd@0
|
792 <pre>
|
wolffd@0
|
793 exp(w(:,i)'*x + b(i))
|
wolffd@0
|
794 Pr(Q=i | X=x) = -----------------------------
|
wolffd@0
|
795 sum_j exp(w(:,j)'*x + b(j))
|
wolffd@0
|
796
|
wolffd@0
|
797 </pre>
|
wolffd@0
|
798 The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
|
wolffd@0
|
799 following interpretation: w(:,i)-w(:,j) is the normal vector to the
|
wolffd@0
|
800 decision boundary between classes i and j,
|
wolffd@0
|
801 and b(i)-b(j) is its offset (bias). For example, suppose
|
wolffd@0
|
802 X is a 2-vector, and Q is binary. Then
|
wolffd@0
|
803 <pre>
|
wolffd@0
|
804 w = [1 -1;
|
wolffd@0
|
805 0 0];
|
wolffd@0
|
806
|
wolffd@0
|
807 b = [0 0];
|
wolffd@0
|
808 </pre>
|
wolffd@0
|
809 means class 1 are points in the 2D plane with positive x coordinate,
|
wolffd@0
|
810 and class 2 are points in the 2D plane with negative x coordinate.
|
wolffd@0
|
811 If w has large magnitude, the decision boundary is sharp, otherwise it
|
wolffd@0
|
812 is soft.
|
wolffd@0
|
813 In the special case that Q is binary (0/1), the softmax function reduces to the logistic
|
wolffd@0
|
814 (sigmoid) function.
|
wolffd@0
|
815 <p>
|
wolffd@0
|
816 Fitting a softmax function can be done using the iteratively reweighted
|
wolffd@0
|
817 least squares (IRLS) algorithm.
|
wolffd@0
|
818 We use the implementation from
|
wolffd@0
|
819 <a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
|
wolffd@0
|
820 Note that since
|
wolffd@0
|
821 the softmax distribution is not in the exponential family, it does not
|
wolffd@0
|
822 have finite sufficient statistics, and hence we must store all the
|
wolffd@0
|
823 training data in uncompressed form.
|
wolffd@0
|
824 If this takes too much space, one should use online (stochastic) gradient
|
wolffd@0
|
825 descent (not implemented in BNT).
|
wolffd@0
|
826 <p>
|
wolffd@0
|
827 If a softmax node also has discrete parents,
|
wolffd@0
|
828 we use a different set of w/b parameters for each combination of
|
wolffd@0
|
829 parent values, as in the <a href="#gaussian">conditional linear
|
wolffd@0
|
830 Gaussian CPD</a>.
|
wolffd@0
|
831 This feature was implemented by Pierpaolo Brutti.
|
wolffd@0
|
832 He is currently extending it so that discrete parents can be treated
|
wolffd@0
|
833 as if they were continuous, by adding indicator variables to the X
|
wolffd@0
|
834 vector.
|
wolffd@0
|
835 <p>
|
wolffd@0
|
836 We will see an example of softmax nodes <a href="#mixexp">below</a>.
|
wolffd@0
|
837
|
wolffd@0
|
838
|
wolffd@0
|
839 <h2><a name="mlp">Neural network nodes</h2>
|
wolffd@0
|
840
|
wolffd@0
|
841 Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
|
wolffd@0
|
842 to implement a mapping from continuous parents to discrete children,
|
wolffd@0
|
843 similar to the softmax function.
|
wolffd@0
|
844 (If there are also discrete parents, it creates a mixture of MLPs.)
|
wolffd@0
|
845 It uses code from <a
|
wolffd@0
|
846 href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
|
wolffd@0
|
847 This is work in progress.
|
wolffd@0
|
848
|
wolffd@0
|
849 <h2><a name="root">Root nodes</h2>
|
wolffd@0
|
850
|
wolffd@0
|
851 A root node has no parents and no parameters; it can be used to model
|
wolffd@0
|
852 an observed, exogeneous input variable, i.e., one which is "outside"
|
wolffd@0
|
853 the model.
|
wolffd@0
|
854 This is useful for conditional density models.
|
wolffd@0
|
855 We will see an example of root nodes <a href="#mixexp">below</a>.
|
wolffd@0
|
856
|
wolffd@0
|
857
|
wolffd@0
|
858 <h2><a name="gaussian">Gaussian nodes</h2>
|
wolffd@0
|
859
|
wolffd@0
|
860 We now consider a distribution suitable for the continuous-valued nodes.
|
wolffd@0
|
861 Suppose the node is called Y, its continuous parents (if any) are
|
wolffd@0
|
862 called X, and its discrete parents (if any) are called Q.
|
wolffd@0
|
863 The distribution on Y is defined as follows:
|
wolffd@0
|
864 <pre>
|
wolffd@0
|
865 - no parents: Y ~ N(mu, Sigma)
|
wolffd@0
|
866 - cts parents : Y|X=x ~ N(mu + W x, Sigma)
|
wolffd@0
|
867 - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
|
wolffd@0
|
868 - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
|
wolffd@0
|
869 </pre>
|
wolffd@0
|
870 where N(mu, Sigma) denotes a Normal distribution with mean mu and
|
wolffd@0
|
871 covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
|
wolffd@0
|
872 respectively.
|
wolffd@0
|
873 If there are no discrete parents, |Q|=1; if there is
|
wolffd@0
|
874 more than one, then |Q| = a vector of the sizes of each discrete parent.
|
wolffd@0
|
875 If there are no continuous parents, |X|=0; if there is more than one,
|
wolffd@0
|
876 then |X| = the sum of their sizes.
|
wolffd@0
|
877 Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
|
wolffd@0
|
878 semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
|
wolffd@0
|
879 matrix.
|
wolffd@0
|
880 <p>
|
wolffd@0
|
881 We can create a Gaussian node with random parameters as follows.
|
wolffd@0
|
882 <pre>
|
wolffd@0
|
883 bnet.CPD{i} = gaussian_CPD(bnet, i);
|
wolffd@0
|
884 </pre>
|
wolffd@0
|
885 We can specify the value of one or more of the parameters as in the
|
wolffd@0
|
886 following example, in which |Y|=2, and |Q|=1.
|
wolffd@0
|
887 <pre>
|
wolffd@0
|
888 bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
|
wolffd@0
|
889 </pre>
|
wolffd@0
|
890 <p>
|
wolffd@0
|
891 We will see an example of conditional linear Gaussian nodes <a
|
wolffd@0
|
892 href="#cg_model">below</a>.
|
wolffd@0
|
893 <p>
|
wolffd@0
|
894 When fitting a Gaussian using EM, you might encounter some numerical
|
wolffd@0
|
895 problems. See <a
|
wolffd@0
|
896 href="http://www.ai.mit.edu/~murphyk/Software/HMM/hmm.html#loglikpos">here</a>
|
wolffd@0
|
897 for a discussion of this in the context of HMMs.
|
wolffd@0
|
898 (Note: BNT does not currently support spherical covariances, although it
|
wolffd@0
|
899 would be easy to add, since KPMstats/clg_Mstep supports this option;
|
wolffd@0
|
900 you would just need to modify gaussian_CPD/update_ess to accumulate
|
wolffd@0
|
901 weighted inner products.)
|
wolffd@0
|
902
|
wolffd@0
|
903
|
wolffd@0
|
904 <h2><a name="nongauss">Other continuous distributions</h2>
|
wolffd@0
|
905
|
wolffd@0
|
906 Currently BNT does not support any CPDs for continuous nodes other
|
wolffd@0
|
907 than the Gaussian.
|
wolffd@0
|
908 However, you can use a mixture of Gaussians to
|
wolffd@0
|
909 approximate other continuous distributions. We will see some an example
|
wolffd@0
|
910 of this with the IFA model <a href="#pca">below</a>.
|
wolffd@0
|
911
|
wolffd@0
|
912
|
wolffd@0
|
913 <h2><a name="glm">Generalized linear model nodes</h2>
|
wolffd@0
|
914
|
wolffd@0
|
915 In the future, we may incorporate some of the functionality of
|
wolffd@0
|
916 <a href =
|
wolffd@0
|
917 "http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
|
wolffd@0
|
918 into BNT.
|
wolffd@0
|
919
|
wolffd@0
|
920
|
wolffd@0
|
921 <h2><a name="dtree">Classification/regression tree nodes</h2>
|
wolffd@0
|
922
|
wolffd@0
|
923 We plan to add classification and regression trees to define CPDs for
|
wolffd@0
|
924 discrete and continuous nodes, respectively.
|
wolffd@0
|
925 Trees have many advantages: they are easy to interpret, they can do
|
wolffd@0
|
926 feature selection, they can
|
wolffd@0
|
927 handle discrete and continuous inputs, they do not make strong
|
wolffd@0
|
928 assumptions about the form of the distribution, the number of
|
wolffd@0
|
929 parameters can grow in a data-dependent way (i.e., they are
|
wolffd@0
|
930 semi-parametric), they can handle missing data, etc.
|
wolffd@0
|
931 However, they are not yet implemented.
|
wolffd@0
|
932 <!--
|
wolffd@0
|
933 Yimin Zhang is currently (Feb '02) implementing this.
|
wolffd@0
|
934 -->
|
wolffd@0
|
935
|
wolffd@0
|
936
|
wolffd@0
|
937 <h2><a name="cpd_summary">Summary of CPD types</h2>
|
wolffd@0
|
938
|
wolffd@0
|
939 We list all the different types of CPDs supported by BNT.
|
wolffd@0
|
940 For each CPD, we specify if the child and parents can be discrete (D) or
|
wolffd@0
|
941 continuous (C) (Binary (B) nodes are a special case).
|
wolffd@0
|
942 We also specify which methods each class supports.
|
wolffd@0
|
943 If a method is inherited, the name of the parent class is mentioned.
|
wolffd@0
|
944 If a parent class calls a child method, this is mentioned.
|
wolffd@0
|
945 <p>
|
wolffd@0
|
946 The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
|
wolffd@0
|
947 requires that the child and all parents are discrete.
|
wolffd@0
|
948 The CPT might be exponentially big...
|
wolffd@0
|
949 <tt>convert_to_table</tt> evaluates a CPD with evidence, and
|
wolffd@0
|
950 represents the the resulting potential as an array.
|
wolffd@0
|
951 This requires that the child is discrete, and any continuous parents
|
wolffd@0
|
952 are observed.
|
wolffd@0
|
953 <tt>convert_to_pot</tt> evaluates a CPD with evidence, and
|
wolffd@0
|
954 represents the resulting potential as a dpot, gpot, cgpot or upot, as
|
wolffd@0
|
955 requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
|
wolffd@0
|
956 utility).
|
wolffd@0
|
957
|
wolffd@0
|
958 <p>
|
wolffd@0
|
959 When we sample a node, all the parents are observed.
|
wolffd@0
|
960 When we compute the (log) probability of a node, all the parents and
|
wolffd@0
|
961 the child are observed.
|
wolffd@0
|
962 <p>
|
wolffd@0
|
963 We also specify if the parameters are learnable.
|
wolffd@0
|
964 For learning with EM, we require
|
wolffd@0
|
965 the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
|
wolffd@0
|
966 <tt>maximize_params</tt>.
|
wolffd@0
|
967 For learning from fully observed data, we require
|
wolffd@0
|
968 the method <tt>learn_params</tt>.
|
wolffd@0
|
969 By default, all classes inherit this from generic_CPD, which simply
|
wolffd@0
|
970 calls <tt>update_ess</tt> N times, once for each data case, followed
|
wolffd@0
|
971 by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
|
wolffd@0
|
972 Some classes implement a batch formula, which is quicker.
|
wolffd@0
|
973 <p>
|
wolffd@0
|
974 Bayesian learning means computing a posterior over the parameters
|
wolffd@0
|
975 given fully observed data.
|
wolffd@0
|
976 <p>
|
wolffd@0
|
977 Pearl means we implement the methods <tt>compute_pi</tt> and
|
wolffd@0
|
978 <tt>compute_lambda_msg</tt>, used by
|
wolffd@0
|
979 <tt>pearl_inf_engine</tt>, which runs on directed graphs.
|
wolffd@0
|
980 <tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
|
wolffd@0
|
981 The pearl methods can exploit special properties of the CPDs for
|
wolffd@0
|
982 computing the messages efficiently, whereas belprop does not.
|
wolffd@0
|
983 <p>
|
wolffd@0
|
984 The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
|
wolffd@0
|
985 which is not shown, since it is not very interesting.
|
wolffd@0
|
986
|
wolffd@0
|
987
|
wolffd@0
|
988 <p>
|
wolffd@0
|
989
|
wolffd@0
|
990
|
wolffd@0
|
991 <table>
|
wolffd@0
|
992 <table border units = pixels><tr>
|
wolffd@0
|
993 <td align=center>Name
|
wolffd@0
|
994 <td align=center>Child
|
wolffd@0
|
995 <td align=center>Parents
|
wolffd@0
|
996 <td align=center>Comments
|
wolffd@0
|
997 <td align=center>CPD_to_CPT
|
wolffd@0
|
998 <td align=center>conv_to_table
|
wolffd@0
|
999 <td align=center>conv_to_pot
|
wolffd@0
|
1000 <td align=center>sample
|
wolffd@0
|
1001 <td align=center>prob
|
wolffd@0
|
1002 <td align=center>learn
|
wolffd@0
|
1003 <td align=center>Bayes
|
wolffd@0
|
1004 <td align=center>Pearl
|
wolffd@0
|
1005
|
wolffd@0
|
1006
|
wolffd@0
|
1007 <tr>
|
wolffd@0
|
1008 <!-- Name--><td>
|
wolffd@0
|
1009 <!-- Child--><td>
|
wolffd@0
|
1010 <!-- Parents--><td>
|
wolffd@0
|
1011 <!-- Comments--><td>
|
wolffd@0
|
1012 <!-- CPD_to_CPT--><td>
|
wolffd@0
|
1013 <!-- conv_to_table--><td>
|
wolffd@0
|
1014 <!-- conv_to_pot--><td>
|
wolffd@0
|
1015 <!-- sample--><td>
|
wolffd@0
|
1016 <!-- prob--><td>
|
wolffd@0
|
1017 <!-- learn--><td>
|
wolffd@0
|
1018 <!-- Bayes--><td>
|
wolffd@0
|
1019 <!-- Pearl--><td>
|
wolffd@0
|
1020
|
wolffd@0
|
1021 <tr>
|
wolffd@0
|
1022 <!-- Name--><td>boolean
|
wolffd@0
|
1023 <!-- Child--><td>B
|
wolffd@0
|
1024 <!-- Parents--><td>B
|
wolffd@0
|
1025 <!-- Comments--><td>Syntactic sugar for tabular
|
wolffd@0
|
1026 <!-- CPD_to_CPT--><td>-
|
wolffd@0
|
1027 <!-- conv_to_table--><td>-
|
wolffd@0
|
1028 <!-- conv_to_pot--><td>-
|
wolffd@0
|
1029 <!-- sample--><td>-
|
wolffd@0
|
1030 <!-- prob--><td>-
|
wolffd@0
|
1031 <!-- learn--><td>-
|
wolffd@0
|
1032 <!-- Bayes--><td>-
|
wolffd@0
|
1033 <!-- Pearl--><td>-
|
wolffd@0
|
1034
|
wolffd@0
|
1035 <tr>
|
wolffd@0
|
1036 <!-- Name--><td>deterministic
|
wolffd@0
|
1037 <!-- Child--><td>D
|
wolffd@0
|
1038 <!-- Parents--><td>D
|
wolffd@0
|
1039 <!-- Comments--><td>Syntactic sugar for tabular
|
wolffd@0
|
1040 <!-- CPD_to_CPT--><td>-
|
wolffd@0
|
1041 <!-- conv_to_table--><td>-
|
wolffd@0
|
1042 <!-- conv_to_pot--><td>-
|
wolffd@0
|
1043 <!-- sample--><td>-
|
wolffd@0
|
1044 <!-- prob--><td>-
|
wolffd@0
|
1045 <!-- learn--><td>-
|
wolffd@0
|
1046 <!-- Bayes--><td>-
|
wolffd@0
|
1047 <!-- Pearl--><td>-
|
wolffd@0
|
1048
|
wolffd@0
|
1049 <tr>
|
wolffd@0
|
1050 <!-- Name--><td>Discrete
|
wolffd@0
|
1051 <!-- Child--><td>D
|
wolffd@0
|
1052 <!-- Parents--><td>C/D
|
wolffd@0
|
1053 <!-- Comments--><td>Virtual class
|
wolffd@0
|
1054 <!-- CPD_to_CPT--><td>N
|
wolffd@0
|
1055 <!-- conv_to_table--><td>Calls CPD_to_CPT
|
wolffd@0
|
1056 <!-- conv_to_pot--><td>Calls conv_to_table
|
wolffd@0
|
1057 <!-- sample--><td>Calls conv_to_table
|
wolffd@0
|
1058 <!-- prob--><td>Calls conv_to_table
|
wolffd@0
|
1059 <!-- learn--><td>N
|
wolffd@0
|
1060 <!-- Bayes--><td>N
|
wolffd@0
|
1061 <!-- Pearl--><td>N
|
wolffd@0
|
1062
|
wolffd@0
|
1063 <tr>
|
wolffd@0
|
1064 <!-- Name--><td>Gaussian
|
wolffd@0
|
1065 <!-- Child--><td>C
|
wolffd@0
|
1066 <!-- Parents--><td>C/D
|
wolffd@0
|
1067 <!-- Comments--><td>-
|
wolffd@0
|
1068 <!-- CPD_to_CPT--><td>N
|
wolffd@0
|
1069 <!-- conv_to_table--><td>N
|
wolffd@0
|
1070 <!-- conv_to_pot--><td>Y
|
wolffd@0
|
1071 <!-- sample--><td>Y
|
wolffd@0
|
1072 <!-- prob--><td>Y
|
wolffd@0
|
1073 <!-- learn--><td>Y
|
wolffd@0
|
1074 <!-- Bayes--><td>N
|
wolffd@0
|
1075 <!-- Pearl--><td>N
|
wolffd@0
|
1076
|
wolffd@0
|
1077 <tr>
|
wolffd@0
|
1078 <!-- Name--><td>gmux
|
wolffd@0
|
1079 <!-- Child--><td>C
|
wolffd@0
|
1080 <!-- Parents--><td>C/D
|
wolffd@0
|
1081 <!-- Comments--><td>multiplexer
|
wolffd@0
|
1082 <!-- CPD_to_CPT--><td>N
|
wolffd@0
|
1083 <!-- conv_to_table--><td>N
|
wolffd@0
|
1084 <!-- conv_to_pot--><td>Y
|
wolffd@0
|
1085 <!-- sample--><td>N
|
wolffd@0
|
1086 <!-- prob--><td>N
|
wolffd@0
|
1087 <!-- learn--><td>N
|
wolffd@0
|
1088 <!-- Bayes--><td>N
|
wolffd@0
|
1089 <!-- Pearl--><td>Y
|
wolffd@0
|
1090
|
wolffd@0
|
1091
|
wolffd@0
|
1092 <tr>
|
wolffd@0
|
1093 <!-- Name--><td>MLP
|
wolffd@0
|
1094 <!-- Child--><td>D
|
wolffd@0
|
1095 <!-- Parents--><td>C/D
|
wolffd@0
|
1096 <!-- Comments--><td>multi layer perceptron
|
wolffd@0
|
1097 <!-- CPD_to_CPT--><td>N
|
wolffd@0
|
1098 <!-- conv_to_table--><td>Y
|
wolffd@0
|
1099 <!-- conv_to_pot--><td>Inherits from discrete
|
wolffd@0
|
1100 <!-- sample--><td>Inherits from discrete
|
wolffd@0
|
1101 <!-- prob--><td>Inherits from discrete
|
wolffd@0
|
1102 <!-- learn--><td>Y
|
wolffd@0
|
1103 <!-- Bayes--><td>N
|
wolffd@0
|
1104 <!-- Pearl--><td>N
|
wolffd@0
|
1105
|
wolffd@0
|
1106
|
wolffd@0
|
1107 <tr>
|
wolffd@0
|
1108 <!-- Name--><td>noisy-or
|
wolffd@0
|
1109 <!-- Child--><td>B
|
wolffd@0
|
1110 <!-- Parents--><td>B
|
wolffd@0
|
1111 <!-- Comments--><td>-
|
wolffd@0
|
1112 <!-- CPD_to_CPT--><td>Y
|
wolffd@0
|
1113 <!-- conv_to_table--><td>Inherits from discrete
|
wolffd@0
|
1114 <!-- conv_to_pot--><td>Inherits from discrete
|
wolffd@0
|
1115 <!-- sample--><td>Inherits from discrete
|
wolffd@0
|
1116 <!-- prob--><td>Inherits from discrete
|
wolffd@0
|
1117 <!-- learn--><td>N
|
wolffd@0
|
1118 <!-- Bayes--><td>N
|
wolffd@0
|
1119 <!-- Pearl--><td>Y
|
wolffd@0
|
1120
|
wolffd@0
|
1121
|
wolffd@0
|
1122 <tr>
|
wolffd@0
|
1123 <!-- Name--><td>root
|
wolffd@0
|
1124 <!-- Child--><td>C/D
|
wolffd@0
|
1125 <!-- Parents--><td>none
|
wolffd@0
|
1126 <!-- Comments--><td>no params
|
wolffd@0
|
1127 <!-- CPD_to_CPT--><td>N
|
wolffd@0
|
1128 <!-- conv_to_table--><td>N
|
wolffd@0
|
1129 <!-- conv_to_pot--><td>Y
|
wolffd@0
|
1130 <!-- sample--><td>Y
|
wolffd@0
|
1131 <!-- prob--><td>Y
|
wolffd@0
|
1132 <!-- learn--><td>N
|
wolffd@0
|
1133 <!-- Bayes--><td>N
|
wolffd@0
|
1134 <!-- Pearl--><td>N
|
wolffd@0
|
1135
|
wolffd@0
|
1136
|
wolffd@0
|
1137 <tr>
|
wolffd@0
|
1138 <!-- Name--><td>softmax
|
wolffd@0
|
1139 <!-- Child--><td>D
|
wolffd@0
|
1140 <!-- Parents--><td>C/D
|
wolffd@0
|
1141 <!-- Comments--><td>-
|
wolffd@0
|
1142 <!-- CPD_to_CPT--><td>N
|
wolffd@0
|
1143 <!-- conv_to_table--><td>Y
|
wolffd@0
|
1144 <!-- conv_to_pot--><td>Inherits from discrete
|
wolffd@0
|
1145 <!-- sample--><td>Inherits from discrete
|
wolffd@0
|
1146 <!-- prob--><td>Inherits from discrete
|
wolffd@0
|
1147 <!-- learn--><td>Y
|
wolffd@0
|
1148 <!-- Bayes--><td>N
|
wolffd@0
|
1149 <!-- Pearl--><td>N
|
wolffd@0
|
1150
|
wolffd@0
|
1151
|
wolffd@0
|
1152 <tr>
|
wolffd@0
|
1153 <!-- Name--><td>generic
|
wolffd@0
|
1154 <!-- Child--><td>C/D
|
wolffd@0
|
1155 <!-- Parents--><td>C/D
|
wolffd@0
|
1156 <!-- Comments--><td>Virtual class
|
wolffd@0
|
1157 <!-- CPD_to_CPT--><td>N
|
wolffd@0
|
1158 <!-- conv_to_table--><td>N
|
wolffd@0
|
1159 <!-- conv_to_pot--><td>N
|
wolffd@0
|
1160 <!-- sample--><td>N
|
wolffd@0
|
1161 <!-- prob--><td>N
|
wolffd@0
|
1162 <!-- learn--><td>N
|
wolffd@0
|
1163 <!-- Bayes--><td>N
|
wolffd@0
|
1164 <!-- Pearl--><td>N
|
wolffd@0
|
1165
|
wolffd@0
|
1166
|
wolffd@0
|
1167 <tr>
|
wolffd@0
|
1168 <!-- Name--><td>Tabular
|
wolffd@0
|
1169 <!-- Child--><td>D
|
wolffd@0
|
1170 <!-- Parents--><td>D
|
wolffd@0
|
1171 <!-- Comments--><td>-
|
wolffd@0
|
1172 <!-- CPD_to_CPT--><td>Y
|
wolffd@0
|
1173 <!-- conv_to_table--><td>Inherits from discrete
|
wolffd@0
|
1174 <!-- conv_to_pot--><td>Inherits from discrete
|
wolffd@0
|
1175 <!-- sample--><td>Inherits from discrete
|
wolffd@0
|
1176 <!-- prob--><td>Inherits from discrete
|
wolffd@0
|
1177 <!-- learn--><td>Y
|
wolffd@0
|
1178 <!-- Bayes--><td>Y
|
wolffd@0
|
1179 <!-- Pearl--><td>Y
|
wolffd@0
|
1180
|
wolffd@0
|
1181 </table>
|
wolffd@0
|
1182
|
wolffd@0
|
1183
|
wolffd@0
|
1184
|
wolffd@0
|
1185 <h1><a name="examples">Example models</h1>
|
wolffd@0
|
1186
|
wolffd@0
|
1187
|
wolffd@0
|
1188 <h2>Gaussian mixture models</h2>
|
wolffd@0
|
1189
|
wolffd@0
|
1190 Richard W. DeVaul has made a detailed tutorial on how to fit mixtures
|
wolffd@0
|
1191 of Gaussians using BNT. Available
|
wolffd@0
|
1192 <a href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">here</a>.
|
wolffd@0
|
1193
|
wolffd@0
|
1194
|
wolffd@0
|
1195 <h2><a name="pca">PCA, ICA, and all that </h2>
|
wolffd@0
|
1196
|
wolffd@0
|
1197 In Figure (a) below, we show how Factor Analysis can be thought of as a
|
wolffd@0
|
1198 graphical model. Here, X has an N(0,I) prior, and
|
wolffd@0
|
1199 Y|X=x ~ N(mu + Wx, Psi),
|
wolffd@0
|
1200 where Psi is diagonal and W is called the "factor loading matrix".
|
wolffd@0
|
1201 Since the noise on both X and Y is diagonal, the components of these
|
wolffd@0
|
1202 vectors are uncorrelated, and hence can be represented as individual
|
wolffd@0
|
1203 scalar nodes, as we show in (b).
|
wolffd@0
|
1204 (This is useful if parts of the observations on the Y vector are occasionally missing.)
|
wolffd@0
|
1205 We usually take k=|X| << |Y|=D, so the model tries to explain
|
wolffd@0
|
1206 many observations using a low-dimensional subspace.
|
wolffd@0
|
1207
|
wolffd@0
|
1208
|
wolffd@0
|
1209 <center>
|
wolffd@0
|
1210 <table>
|
wolffd@0
|
1211 <tr>
|
wolffd@0
|
1212 <td><img src="Figures/fa.gif">
|
wolffd@0
|
1213 <td><img src="Figures/fa_scalar.gif">
|
wolffd@0
|
1214 <td><img src="Figures/mfa.gif">
|
wolffd@0
|
1215 <td><img src="Figures/ifa.gif">
|
wolffd@0
|
1216 <tr>
|
wolffd@0
|
1217 <td align=center> (a)
|
wolffd@0
|
1218 <td align=center> (b)
|
wolffd@0
|
1219 <td align=center> (c)
|
wolffd@0
|
1220 <td align=center> (d)
|
wolffd@0
|
1221 </table>
|
wolffd@0
|
1222 </center>
|
wolffd@0
|
1223
|
wolffd@0
|
1224 <p>
|
wolffd@0
|
1225 We can create this model in BNT as follows.
|
wolffd@0
|
1226 <pre>
|
wolffd@0
|
1227 ns = [k D];
|
wolffd@0
|
1228 dag = zeros(2,2);
|
wolffd@0
|
1229 dag(1,2) = 1;
|
wolffd@0
|
1230 bnet = mk_bnet(dag, ns, 'discrete', []);
|
wolffd@0
|
1231 bnet.CPD{1} = gaussian_CPD(bnet, 1, 'mean', zeros(k,1), 'cov', eye(k), ...
|
wolffd@0
|
1232 'cov_type', 'diag', 'clamp_mean', 1, 'clamp_cov', 1);
|
wolffd@0
|
1233 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(D,1), 'cov', diag(Psi0), 'weights', W0, ...
|
wolffd@0
|
1234 'cov_type', 'diag', 'clamp_mean', 1);
|
wolffd@0
|
1235 </pre>
|
wolffd@0
|
1236
|
wolffd@0
|
1237 The root node is clamped to the N(0,I) distribution, so that we will
|
wolffd@0
|
1238 not update these parameters during learning.
|
wolffd@0
|
1239 The mean of the leaf node is clamped to 0,
|
wolffd@0
|
1240 since we assume the data has been centered (had its mean subtracted
|
wolffd@0
|
1241 off); this is just for simplicity.
|
wolffd@0
|
1242 Finally, the covariance of the leaf node is constrained to be
|
wolffd@0
|
1243 diagonal. W0 and Psi0 are the initial parameter guesses.
|
wolffd@0
|
1244
|
wolffd@0
|
1245 <p>
|
wolffd@0
|
1246 We can fit this model (i.e., estimate its parameters in a maximum
|
wolffd@0
|
1247 likelihood (ML) sense) using EM, as we
|
wolffd@0
|
1248 explain <a href="#em">below</a>.
|
wolffd@0
|
1249 Not surprisingly, the ML estimates for mu and Psi turn out to be
|
wolffd@0
|
1250 identical to the
|
wolffd@0
|
1251 sample mean and variance, which can be computed directly as
|
wolffd@0
|
1252 <pre>
|
wolffd@0
|
1253 mu_ML = mean(data);
|
wolffd@0
|
1254 Psi_ML = diag(cov(data));
|
wolffd@0
|
1255 </pre>
|
wolffd@0
|
1256 Note that W can only be identified up to a rotation matrix, because of
|
wolffd@0
|
1257 the spherical symmetry of the source.
|
wolffd@0
|
1258
|
wolffd@0
|
1259 <p>
|
wolffd@0
|
1260 If we restrict Psi to be spherical, i.e., Psi = sigma*I,
|
wolffd@0
|
1261 there is a closed-form solution for W as well,
|
wolffd@0
|
1262 i.e., we do not need to use EM.
|
wolffd@0
|
1263 In particular, W contains the first |X| eigenvectors of the sample covariance
|
wolffd@0
|
1264 matrix, with scalings determined by the eigenvalues and sigma.
|
wolffd@0
|
1265 Classical PCA can be obtained by taking the sigma->0 limit.
|
wolffd@0
|
1266 For details, see
|
wolffd@0
|
1267
|
wolffd@0
|
1268 <ul>
|
wolffd@0
|
1269 <li> <a href="ftp://hope.caltech.edu/pub/roweis/Empca/empca.ps">
|
wolffd@0
|
1270 "EM algorithms for PCA and SPCA"</a>, Sam Roweis, NIPS 97.
|
wolffd@0
|
1271 (<a href="ftp://hope.caltech.edu/pub/roweis/Code/empca.tar.gz">
|
wolffd@0
|
1272 Matlab software</a>)
|
wolffd@0
|
1273
|
wolffd@0
|
1274 <p>
|
wolffd@0
|
1275 <li>
|
wolffd@0
|
1276 <a
|
wolffd@0
|
1277 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
|
wolffd@0
|
1278 "Mixtures of probabilistic principal component analyzers"</a>,
|
wolffd@0
|
1279 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
|
wolffd@0
|
1280 </ul>
|
wolffd@0
|
1281
|
wolffd@0
|
1282 <p>
|
wolffd@0
|
1283 By adding a hidden discrete variable, we can create mixtures of FA
|
wolffd@0
|
1284 models, as shown in (c).
|
wolffd@0
|
1285 Now we can explain the data using a set of subspaces.
|
wolffd@0
|
1286 We can create this model in BNT as follows.
|
wolffd@0
|
1287 <pre>
|
wolffd@0
|
1288 ns = [M k D];
|
wolffd@0
|
1289 dag = zeros(3);
|
wolffd@0
|
1290 dag(1,3) = 1;
|
wolffd@0
|
1291 dag(2,3) = 1;
|
wolffd@0
|
1292 bnet = mk_bnet(dag, ns, 'discrete', 1);
|
wolffd@0
|
1293 bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
|
wolffd@0
|
1294 bnet.CPD{2} = gaussian_CPD(bnet, 2, 'mean', zeros(k, 1), 'cov', eye(k), 'cov_type', 'diag', ...
|
wolffd@0
|
1295 'clamp_mean', 1, 'clamp_cov', 1);
|
wolffd@0
|
1296 bnet.CPD{3} = gaussian_CPD(bnet, 3, 'mean', Mu0', 'cov', repmat(diag(Psi0), [1 1 M]), ...
|
wolffd@0
|
1297 'weights', W0, 'cov_type', 'diag', 'tied_cov', 1);
|
wolffd@0
|
1298 </pre>
|
wolffd@0
|
1299 Notice how the covariance matrix for Y is the same for all values of
|
wolffd@0
|
1300 Q; that is, the noise level in each sub-space is assumed the same.
|
wolffd@0
|
1301 However, we allow the offset, mu, to vary.
|
wolffd@0
|
1302 For details, see
|
wolffd@0
|
1303 <ul>
|
wolffd@0
|
1304
|
wolffd@0
|
1305 <LI>
|
wolffd@0
|
1306 <a HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/tr-96-1.ps.gz"> The EM
|
wolffd@0
|
1307 Algorithm for Mixtures of Factor Analyzers </A>,
|
wolffd@0
|
1308 Ghahramani, Z. and Hinton, G.E. (1996),
|
wolffd@0
|
1309 University of Toronto
|
wolffd@0
|
1310 Technical Report CRG-TR-96-1.
|
wolffd@0
|
1311 (<A HREF="ftp://ftp.cs.toronto.edu/pub/zoubin/mfa.tar.gz">Matlab software</A>)
|
wolffd@0
|
1312
|
wolffd@0
|
1313 <p>
|
wolffd@0
|
1314 <li>
|
wolffd@0
|
1315 <a
|
wolffd@0
|
1316 href=http://neural-server.aston.ac.uk/cgi-bin/tr_avail.pl?trnumber=NCRG/97/003>
|
wolffd@0
|
1317 "Mixtures of probabilistic principal component analyzers"</a>,
|
wolffd@0
|
1318 Tipping and Bishop, Neural Computation 11(2):443--482, 1999.
|
wolffd@0
|
1319 </ul>
|
wolffd@0
|
1320
|
wolffd@0
|
1321 <p>
|
wolffd@0
|
1322 I have included Zoubin's specialized MFA code (with his permission)
|
wolffd@0
|
1323 with the toolbox, so you can check that BNT gives the same results:
|
wolffd@0
|
1324 see 'BNT/examples/static/mfa1.m'.
|
wolffd@0
|
1325
|
wolffd@0
|
1326 <p>
|
wolffd@0
|
1327 Independent Factor Analysis (IFA) generalizes FA by allowing a
|
wolffd@0
|
1328 non-Gaussian prior on each component of X.
|
wolffd@0
|
1329 (Note that we can approximate a non-Gaussian prior using a mixture of
|
wolffd@0
|
1330 Gaussians.)
|
wolffd@0
|
1331 This means that the likelihood function is no longer rotationally
|
wolffd@0
|
1332 invariant, so we can uniquely identify W and the hidden
|
wolffd@0
|
1333 sources X.
|
wolffd@0
|
1334 IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y).
|
wolffd@0
|
1335 We recover classical Independent Components Analysis (ICA)
|
wolffd@0
|
1336 in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the
|
wolffd@0
|
1337 weight matrix W is square and invertible.
|
wolffd@0
|
1338 For details, see
|
wolffd@0
|
1339 <ul>
|
wolffd@0
|
1340 <li>
|
wolffd@0
|
1341 <a href="http://www.gatsby.ucl.ac.uk/~hagai/ifa.ps">Independent Factor
|
wolffd@0
|
1342 Analysis</a>, H. Attias, Neural Computation 11: 803--851, 1998.
|
wolffd@0
|
1343 </ul>
|
wolffd@0
|
1344
|
wolffd@0
|
1345
|
wolffd@0
|
1346
|
wolffd@0
|
1347 <h2><a name="mixexp">Mixtures of experts</h2>
|
wolffd@0
|
1348
|
wolffd@0
|
1349 As an example of the use of the softmax function,
|
wolffd@0
|
1350 we introduce the Mixture of Experts model.
|
wolffd@0
|
1351 <!--
|
wolffd@0
|
1352 We also show
|
wolffd@0
|
1353 the Hierarchical Mixture of Experts model, where the hierarchy has two
|
wolffd@0
|
1354 levels.
|
wolffd@0
|
1355 (This is essentially a probabilistic decision tree of height two.)
|
wolffd@0
|
1356 -->
|
wolffd@0
|
1357 As before,
|
wolffd@0
|
1358 circles denote continuous-valued nodes,
|
wolffd@0
|
1359 squares denote discrete nodes, clear
|
wolffd@0
|
1360 means hidden, and shaded means observed.
|
wolffd@0
|
1361 <p>
|
wolffd@0
|
1362 <center>
|
wolffd@0
|
1363 <table>
|
wolffd@0
|
1364 <tr>
|
wolffd@0
|
1365 <td><img src="Figures/mixexp.gif">
|
wolffd@0
|
1366 <!--
|
wolffd@0
|
1367 <td><img src="Figures/hme.gif">
|
wolffd@0
|
1368 -->
|
wolffd@0
|
1369 </table>
|
wolffd@0
|
1370 </center>
|
wolffd@0
|
1371 <p>
|
wolffd@0
|
1372 X is the observed
|
wolffd@0
|
1373 input, Y is the output, and
|
wolffd@0
|
1374 the Q nodes are hidden "gating" nodes, which select the appropriate
|
wolffd@0
|
1375 set of parameters for Y. During training, Y is assumed observed,
|
wolffd@0
|
1376 but for testing, the goal is to predict Y given X.
|
wolffd@0
|
1377 Note that this is a <em>conditional</em> density model, so we don't
|
wolffd@0
|
1378 associate any parameters with X.
|
wolffd@0
|
1379 Hence X's CPD will be a root CPD, which is a way of modelling
|
wolffd@0
|
1380 exogenous nodes.
|
wolffd@0
|
1381 If the output is a continuous-valued quantity,
|
wolffd@0
|
1382 we assume the "experts" are linear-regression units,
|
wolffd@0
|
1383 and set Y's CPD to linear-Gaussian.
|
wolffd@0
|
1384 If the output is discrete, we set Y's CPD to a softmax function.
|
wolffd@0
|
1385 The Q CPDs will always be softmax functions.
|
wolffd@0
|
1386
|
wolffd@0
|
1387 <p>
|
wolffd@0
|
1388 As a concrete example, consider the mixture of experts model where X and Y are
|
wolffd@0
|
1389 scalars, and Q is binary.
|
wolffd@0
|
1390 This is just piecewise linear regression, where
|
wolffd@0
|
1391 we have two line segments, i.e.,
|
wolffd@0
|
1392 <P>
|
wolffd@0
|
1393 <IMG ALIGN=BOTTOM SRC="Eqns/lin_reg_eqn.gif">
|
wolffd@0
|
1394 <P>
|
wolffd@0
|
1395 We can create this model with random parameters as follows.
|
wolffd@0
|
1396 (This code is bundled in BNT/examples/static/mixexp2.m.)
|
wolffd@0
|
1397 <PRE>
|
wolffd@0
|
1398 X = 1;
|
wolffd@0
|
1399 Q = 2;
|
wolffd@0
|
1400 Y = 3;
|
wolffd@0
|
1401 dag = zeros(3,3);
|
wolffd@0
|
1402 dag(X,[Q Y]) = 1
|
wolffd@0
|
1403 dag(Q,Y) = 1;
|
wolffd@0
|
1404 ns = [1 2 1]; % make X and Y scalars, and have 2 experts
|
wolffd@0
|
1405 onodes = [1 3];
|
wolffd@0
|
1406 bnet = mk_bnet(dag, ns, 'discrete', 2, 'observed', onodes);
|
wolffd@0
|
1407
|
wolffd@0
|
1408 rand('state', 0);
|
wolffd@0
|
1409 randn('state', 0);
|
wolffd@0
|
1410 bnet.CPD{1} = root_CPD(bnet, 1);
|
wolffd@0
|
1411 bnet.CPD{2} = softmax_CPD(bnet, 2);
|
wolffd@0
|
1412 bnet.CPD{3} = gaussian_CPD(bnet, 3);
|
wolffd@0
|
1413 </PRE>
|
wolffd@0
|
1414 Now let us fit this model using <a href="#em">EM</a>.
|
wolffd@0
|
1415 First we <a href="#load_data">load the data</a> (1000 training cases) and plot them.
|
wolffd@0
|
1416 <P>
|
wolffd@0
|
1417 <PRE>
|
wolffd@0
|
1418 data = load('/examples/static/Misc/mixexp_data.txt', '-ascii');
|
wolffd@0
|
1419 plot(data(:,1), data(:,2), '.');
|
wolffd@0
|
1420 </PRE>
|
wolffd@0
|
1421 <p>
|
wolffd@0
|
1422 <center>
|
wolffd@0
|
1423 <IMG SRC="Figures/mixexp_data.gif">
|
wolffd@0
|
1424 </center>
|
wolffd@0
|
1425 <p>
|
wolffd@0
|
1426 This is what the model looks like before training.
|
wolffd@0
|
1427 (Thanks to Thomas Hofman for writing this plotting routine.)
|
wolffd@0
|
1428 <p>
|
wolffd@0
|
1429 <center>
|
wolffd@0
|
1430 <IMG SRC="Figures/mixexp_before.gif">
|
wolffd@0
|
1431 </center>
|
wolffd@0
|
1432 <p>
|
wolffd@0
|
1433 Now let's train the model, and plot the final performance.
|
wolffd@0
|
1434 (We will discuss how to train models in more detail <a href="#param_learning">below</a>.)
|
wolffd@0
|
1435 <P>
|
wolffd@0
|
1436 <PRE>
|
wolffd@0
|
1437 ncases = size(data, 1); % each row of data is a training case
|
wolffd@0
|
1438 cases = cell(3, ncases);
|
wolffd@0
|
1439 cases([1 3], :) = num2cell(data'); % each column of cases is a training case
|
wolffd@0
|
1440 engine = jtree_inf_engine(bnet);
|
wolffd@0
|
1441 max_iter = 20;
|
wolffd@0
|
1442 [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);
|
wolffd@0
|
1443 </PRE>
|
wolffd@0
|
1444 (We specify which nodes will be observed when we create the engine.
|
wolffd@0
|
1445 Hence BNT knows that the hidden nodes are all discrete.
|
wolffd@0
|
1446 For complex models, this can lead to a significant speedup.)
|
wolffd@0
|
1447 Below we show what the model looks like after 16 iterations of EM
|
wolffd@0
|
1448 (with 100 IRLS iterations per M step), when it converged
|
wolffd@0
|
1449 using the default convergence tolerance (that the
|
wolffd@0
|
1450 fractional change in the log-likelihood be less than 1e-3).
|
wolffd@0
|
1451 Before learning, the log-likelihood was
|
wolffd@0
|
1452 -322.927442; afterwards, it was -13.728778.
|
wolffd@0
|
1453 <p>
|
wolffd@0
|
1454 <center>
|
wolffd@0
|
1455 <IMG SRC="Figures/mixexp_after.gif">
|
wolffd@0
|
1456 </center>
|
wolffd@0
|
1457 (See BNT/examples/static/mixexp2.m for details of the code.)
|
wolffd@0
|
1458
|
wolffd@0
|
1459
|
wolffd@0
|
1460
|
wolffd@0
|
1461 <h2><a name="hme">Hierarchical mixtures of experts</h2>
|
wolffd@0
|
1462
|
wolffd@0
|
1463 A hierarchical mixture of experts (HME) extends the mixture of experts
|
wolffd@0
|
1464 model by having more than one hidden node. A two-level example is shown below, along
|
wolffd@0
|
1465 with its more traditional representation as a neural network.
|
wolffd@0
|
1466 This is like a (balanced) probabilistic decision tree of height 2.
|
wolffd@0
|
1467 <p>
|
wolffd@0
|
1468 <center>
|
wolffd@0
|
1469 <IMG SRC="Figures/HMEforMatlab.jpg">
|
wolffd@0
|
1470 </center>
|
wolffd@0
|
1471 <p>
|
wolffd@0
|
1472 <a href="mailto:pbrutti@stat.cmu.edu">Pierpaolo Brutti</a>
|
wolffd@0
|
1473 has written an extensive set of routines for HMEs,
|
wolffd@0
|
1474 which are bundled with BNT: see the examples/static/HME directory.
|
wolffd@0
|
1475 These routines allow you to choose the number of hidden (gating)
|
wolffd@0
|
1476 layers, and the form of the experts (softmax or MLP).
|
wolffd@0
|
1477 See the file hmemenu, which provides a demo.
|
wolffd@0
|
1478 For example, the figure below shows the decision boundaries learned
|
wolffd@0
|
1479 for a ternary classification problem, using a 2 level HME with softmax
|
wolffd@0
|
1480 gates and softmax experts; the training set is on the left, the
|
wolffd@0
|
1481 testing set on the right.
|
wolffd@0
|
1482 <p>
|
wolffd@0
|
1483 <center>
|
wolffd@0
|
1484 <!--<IMG SRC="Figures/hme_dec_boundary.gif">-->
|
wolffd@0
|
1485 <IMG SRC="Figures/hme_dec_boundary.png">
|
wolffd@0
|
1486 </center>
|
wolffd@0
|
1487 <p>
|
wolffd@0
|
1488
|
wolffd@0
|
1489
|
wolffd@0
|
1490 <p>
|
wolffd@0
|
1491 For more details, see the following:
|
wolffd@0
|
1492 <ul>
|
wolffd@0
|
1493
|
wolffd@0
|
1494 <li> <a href="http://www.cs.berkeley.edu/~jordan/papers/hierarchies.ps.Z">
|
wolffd@0
|
1495 Hierarchical mixtures of experts and the EM algorithm</a>
|
wolffd@0
|
1496 M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
|
wolffd@0
|
1497
|
wolffd@0
|
1498 <li> <a href =
|
wolffd@0
|
1499 "http://www.cs.berkeley.edu/~dmartin/software">David Martin's
|
wolffd@0
|
1500 matlab code for HME</a>
|
wolffd@0
|
1501
|
wolffd@0
|
1502 <li> <a
|
wolffd@0
|
1503 href="http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z">Why the
|
wolffd@0
|
1504 logistic function? A tutorial discussion on
|
wolffd@0
|
1505 probabilities and neural networks.</a> M. I. Jordan. MIT Computational
|
wolffd@0
|
1506 Cognitive Science Report 9503, August 1995.
|
wolffd@0
|
1507
|
wolffd@0
|
1508 <li> "Generalized Linear Models", McCullagh and Nelder, Chapman and
|
wolffd@0
|
1509 Halll, 1983.
|
wolffd@0
|
1510
|
wolffd@0
|
1511 <li>
|
wolffd@0
|
1512 "Improved learning algorithms for mixtures of experts in multiclass
|
wolffd@0
|
1513 classification".
|
wolffd@0
|
1514 K. Chen, L. Xu, H. Chi.
|
wolffd@0
|
1515 Neural Networks (1999) 12: 1229-1252.
|
wolffd@0
|
1516
|
wolffd@0
|
1517 <li> <a href="http://www.oigeeza.com/steve/">
|
wolffd@0
|
1518 Classification Using Hierarchical Mixtures of Experts</a>
|
wolffd@0
|
1519 S.R. Waterhouse and A.J. Robinson.
|
wolffd@0
|
1520 In Proc. IEEE Workshop on Neural Network for Signal Processing IV (1994), pp. 177-186
|
wolffd@0
|
1521
|
wolffd@0
|
1522 <li> <a href="http://www.idiap.ch/~perry/">
|
wolffd@0
|
1523 Localized mixtures of experts</a>,
|
wolffd@0
|
1524 P. Moerland, 1998.
|
wolffd@0
|
1525
|
wolffd@0
|
1526 <li> "Nonlinear gated experts for time series",
|
wolffd@0
|
1527 A.S. Weigend and M. Mangeas, 1995.
|
wolffd@0
|
1528
|
wolffd@0
|
1529 </ul>
|
wolffd@0
|
1530
|
wolffd@0
|
1531
|
wolffd@0
|
1532 <h2><a name="qmr">QMR</h2>
|
wolffd@0
|
1533
|
wolffd@0
|
1534 Bayes nets originally arose out of an attempt to add probabilities to
|
wolffd@0
|
1535 expert systems, and this is still the most common use for BNs.
|
wolffd@0
|
1536 A famous example is
|
wolffd@0
|
1537 QMR-DT, a decision-theoretic reformulation of the Quick Medical
|
wolffd@0
|
1538 Reference (QMR) model.
|
wolffd@0
|
1539 <p>
|
wolffd@0
|
1540 <center>
|
wolffd@0
|
1541 <IMG ALIGN=BOTTOM SRC="Figures/qmr.gif">
|
wolffd@0
|
1542 </center>
|
wolffd@0
|
1543 Here, the top layer represents hidden disease nodes, and the bottom
|
wolffd@0
|
1544 layer represents observed symptom nodes.
|
wolffd@0
|
1545 The goal is to infer the posterior probability of each disease given
|
wolffd@0
|
1546 all the symptoms (which can be present, absent or unknown).
|
wolffd@0
|
1547 Each node in the top layer has a Bernoulli prior (with a low prior
|
wolffd@0
|
1548 probability that the disease is present).
|
wolffd@0
|
1549 Since each node in the bottom layer has a high fan-in, we use a
|
wolffd@0
|
1550 noisy-OR parameterization; each disease has an independent chance of
|
wolffd@0
|
1551 causing each symptom.
|
wolffd@0
|
1552 The real QMR-DT model is copyright, but
|
wolffd@0
|
1553 we can create a random QMR-like model as follows.
|
wolffd@0
|
1554 <pre>
|
wolffd@0
|
1555 function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
|
wolffd@0
|
1556 % MK_QMR_BNET Make a QMR model
|
wolffd@0
|
1557 % bnet = mk_qmr_bnet(G, inhibit, leak, prior)
|
wolffd@0
|
1558 %
|
wolffd@0
|
1559 % G(i,j) = 1 iff there is an arc from disease i to finding j
|
wolffd@0
|
1560 % inhibit(i,j) = inhibition probability on i->j arc
|
wolffd@0
|
1561 % leak(j) = inhibition prob. on leak->j arc
|
wolffd@0
|
1562 % prior(i) = prob. disease i is on
|
wolffd@0
|
1563
|
wolffd@0
|
1564 [Ndiseases Nfindings] = size(inhibit);
|
wolffd@0
|
1565 N = Ndiseases + Nfindings;
|
wolffd@0
|
1566 finding_node = Ndiseases+1:N;
|
wolffd@0
|
1567 ns = 2*ones(1,N);
|
wolffd@0
|
1568 dag = zeros(N,N);
|
wolffd@0
|
1569 dag(1:Ndiseases, finding_node) = G;
|
wolffd@0
|
1570 bnet = mk_bnet(dag, ns, 'observed', finding_node);
|
wolffd@0
|
1571
|
wolffd@0
|
1572 for d=1:Ndiseases
|
wolffd@0
|
1573 CPT = [1-prior(d) prior(d)];
|
wolffd@0
|
1574 bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
|
wolffd@0
|
1575 end
|
wolffd@0
|
1576
|
wolffd@0
|
1577 for i=1:Nfindings
|
wolffd@0
|
1578 fnode = finding_node(i);
|
wolffd@0
|
1579 ps = parents(G, i);
|
wolffd@0
|
1580 bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
|
wolffd@0
|
1581 end
|
wolffd@0
|
1582 </pre>
|
wolffd@0
|
1583 In the file BNT/examples/static/qmr1, we create a random bipartite
|
wolffd@0
|
1584 graph G, with 5 diseases and 10 findings, and random parameters.
|
wolffd@0
|
1585 (In general, to create a random dag, use 'mk_random_dag'.)
|
wolffd@0
|
1586 We can visualize the resulting graph structure using
|
wolffd@0
|
1587 the methods discussed <a href="#graphdraw">below</a>, with the
|
wolffd@0
|
1588 following results:
|
wolffd@0
|
1589 <p>
|
wolffd@0
|
1590 <img src="Figures/qmr.rnd.jpg">
|
wolffd@0
|
1591
|
wolffd@0
|
1592 <p>
|
wolffd@0
|
1593 Now let us put some random evidence on all the leaves except the very
|
wolffd@0
|
1594 first and very last, and compute the disease posteriors.
|
wolffd@0
|
1595 <pre>
|
wolffd@0
|
1596 pos = 2:floor(Nfindings/2);
|
wolffd@0
|
1597 neg = (pos(end)+1):(Nfindings-1);
|
wolffd@0
|
1598 onodes = myunion(pos, neg);
|
wolffd@0
|
1599 evidence = cell(1, N);
|
wolffd@0
|
1600 evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
|
wolffd@0
|
1601 evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
|
wolffd@0
|
1602
|
wolffd@0
|
1603 engine = jtree_inf_engine(bnet);
|
wolffd@0
|
1604 [engine, ll] = enter_evidence(engine, evidence);
|
wolffd@0
|
1605 post = zeros(1, Ndiseases);
|
wolffd@0
|
1606 for i=diseases(:)'
|
wolffd@0
|
1607 m = marginal_nodes(engine, i);
|
wolffd@0
|
1608 post(i) = m.T(2);
|
wolffd@0
|
1609 end
|
wolffd@0
|
1610 </pre>
|
wolffd@0
|
1611 Junction tree can be quite slow on large QMR models.
|
wolffd@0
|
1612 Fortunately, it is possible to exploit properties of the noisy-OR
|
wolffd@0
|
1613 function to speed up exact inference using an algorithm called
|
wolffd@0
|
1614 <a href="#quickscore">quickscore</a>, discussed below.
|
wolffd@0
|
1615
|
wolffd@0
|
1616
|
wolffd@0
|
1617
|
wolffd@0
|
1618
|
wolffd@0
|
1619
|
wolffd@0
|
1620 <h2><a name="cg_model">Conditional Gaussian models</h2>
|
wolffd@0
|
1621
|
wolffd@0
|
1622 A conditional Gaussian model is one in which, conditioned on all the discrete
|
wolffd@0
|
1623 nodes, the distribution over the remaining (continuous) nodes is
|
wolffd@0
|
1624 multivariate Gaussian. This means we can have arcs from discrete (D)
|
wolffd@0
|
1625 to continuous (C) nodes, but not vice versa.
|
wolffd@0
|
1626 (We <em>are</em> allowed C->D arcs if the continuous nodes are observed,
|
wolffd@0
|
1627 as in the <a href="#mixexp">mixture of experts</a> model,
|
wolffd@0
|
1628 since this distribution can be represented with a discrete potential.)
|
wolffd@0
|
1629 <p>
|
wolffd@0
|
1630 We now give an example of a CG model, from
|
wolffd@0
|
1631 the paper "Propagation of Probabilities, Means amd
|
wolffd@0
|
1632 Variances in Mixed Graphical Association Models", Steffen Lauritzen,
|
wolffd@0
|
1633 JASA 87(420):1098--1108, 1992 (reprinted in the book "Probabilistic Networks and Expert
|
wolffd@0
|
1634 Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and
|
wolffd@0
|
1635 D. J. Spiegelhalter, Springer, 1999.)
|
wolffd@0
|
1636
|
wolffd@0
|
1637 <h3>Specifying the graph</h3>
|
wolffd@0
|
1638
|
wolffd@0
|
1639 Consider the model of waste emissions from an incinerator plant shown below.
|
wolffd@0
|
1640 We follow the standard convention that shaded nodes are observed,
|
wolffd@0
|
1641 clear nodes are hidden.
|
wolffd@0
|
1642 We also use the non-standard convention that
|
wolffd@0
|
1643 square nodes are discrete (tabular) and round nodes are
|
wolffd@0
|
1644 Gaussian.
|
wolffd@0
|
1645
|
wolffd@0
|
1646 <p>
|
wolffd@0
|
1647 <center>
|
wolffd@0
|
1648 <IMG SRC="Figures/cg1.gif">
|
wolffd@0
|
1649 </center>
|
wolffd@0
|
1650 <p>
|
wolffd@0
|
1651
|
wolffd@0
|
1652 We can create this model as follows.
|
wolffd@0
|
1653 <pre>
|
wolffd@0
|
1654 F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
|
wolffd@0
|
1655 n = 9;
|
wolffd@0
|
1656
|
wolffd@0
|
1657 dag = zeros(n);
|
wolffd@0
|
1658 dag(F,E)=1;
|
wolffd@0
|
1659 dag(W,[E Min D]) = 1;
|
wolffd@0
|
1660 dag(E,D)=1;
|
wolffd@0
|
1661 dag(B,[C D])=1;
|
wolffd@0
|
1662 dag(D,[L Mout])=1;
|
wolffd@0
|
1663 dag(Min,Mout)=1;
|
wolffd@0
|
1664
|
wolffd@0
|
1665 % node sizes - all cts nodes are scalar, all discrete nodes are binary
|
wolffd@0
|
1666 ns = ones(1, n);
|
wolffd@0
|
1667 dnodes = [F W B];
|
wolffd@0
|
1668 cnodes = mysetdiff(1:n, dnodes);
|
wolffd@0
|
1669 ns(dnodes) = 2;
|
wolffd@0
|
1670
|
wolffd@0
|
1671 bnet = mk_bnet(dag, ns, 'discrete', dnodes);
|
wolffd@0
|
1672 </pre>
|
wolffd@0
|
1673 'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous
|
wolffd@0
|
1674 nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
|
wolffd@0
|
1675 <p>
|
wolffd@0
|
1676
|
wolffd@0
|
1677
|
wolffd@0
|
1678 <h3>Specifying the parameters</h3>
|
wolffd@0
|
1679
|
wolffd@0
|
1680 The parameters of the discrete nodes can be specified as follows.
|
wolffd@0
|
1681 <pre>
|
wolffd@0
|
1682 bnet.CPD{B} = tabular_CPD(bnet, B, 'CPT', [0.85 0.15]); % 1=stable, 2=unstable
|
wolffd@0
|
1683 bnet.CPD{F} = tabular_CPD(bnet, F, 'CPT', [0.95 0.05]); % 1=intact, 2=defect
|
wolffd@0
|
1684 bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [2/7 5/7]); % 1=industrial, 2=household
|
wolffd@0
|
1685 </pre>
|
wolffd@0
|
1686
|
wolffd@0
|
1687 <p>
|
wolffd@0
|
1688 The parameters of the continuous nodes can be specified as follows.
|
wolffd@0
|
1689 <pre>
|
wolffd@0
|
1690 bnet.CPD{E} = gaussian_CPD(bnet, E, 'mean', [-3.9 -0.4 -3.2 -0.5], ...
|
wolffd@0
|
1691 'cov', [0.00002 0.0001 0.00002 0.0001]);
|
wolffd@0
|
1692 bnet.CPD{D} = gaussian_CPD(bnet, D, 'mean', [6.5 6.0 7.5 7.0], ...
|
wolffd@0
|
1693 'cov', [0.03 0.04 0.1 0.1], 'weights', [1 1 1 1]);
|
wolffd@0
|
1694 bnet.CPD{C} = gaussian_CPD(bnet, C, 'mean', [-2 -1], 'cov', [0.1 0.3]);
|
wolffd@0
|
1695 bnet.CPD{L} = gaussian_CPD(bnet, L, 'mean', 3, 'cov', 0.25, 'weights', -0.5);
|
wolffd@0
|
1696 bnet.CPD{Min} = gaussian_CPD(bnet, Min, 'mean', [0.5 -0.5], 'cov', [0.01 0.005]);
|
wolffd@0
|
1697 bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 'mean', 0, 'cov', 0.002, 'weights', [1 1]);
|
wolffd@0
|
1698 </pre>
|
wolffd@0
|
1699
|
wolffd@0
|
1700
|
wolffd@0
|
1701 <h3><a name="cg_infer">Inference</h3>
|
wolffd@0
|
1702
|
wolffd@0
|
1703 <!--Let us perform inference in the <a href="#cg_model">waste incinerator example</a>.-->
|
wolffd@0
|
1704 First we compute the unconditional marginals.
|
wolffd@0
|
1705 <pre>
|
wolffd@0
|
1706 engine = jtree_inf_engine(bnet);
|
wolffd@0
|
1707 evidence = cell(1,n);
|
wolffd@0
|
1708 [engine, ll] = enter_evidence(engine, evidence);
|
wolffd@0
|
1709 marg = marginal_nodes(engine, E);
|
wolffd@0
|
1710 </pre>
|
wolffd@0
|
1711 <!--(Of course, we could use <tt>cond_gauss_inf_engine</tt> instead of jtree.)-->
|
wolffd@0
|
1712 'marg' is a structure that contains the fields 'mu' and 'Sigma', which
|
wolffd@0
|
1713 contain the mean and (co)variance of the marginal on E.
|
wolffd@0
|
1714 In this case, they are both scalars.
|
wolffd@0
|
1715 Let us check they match the published figures (to 2 decimal places).
|
wolffd@0
|
1716 <!--(We can't expect
|
wolffd@0
|
1717 more precision than this in general because I have implemented the algorithm of
|
wolffd@0
|
1718 Lauritzen (1992), which can be numerically unstable.)-->
|
wolffd@0
|
1719 <pre>
|
wolffd@0
|
1720 tol = 1e-2;
|
wolffd@0
|
1721 assert(approxeq(marg.mu, -3.25, tol));
|
wolffd@0
|
1722 assert(approxeq(sqrt(marg.Sigma), 0.709, tol));
|
wolffd@0
|
1723 </pre>
|
wolffd@0
|
1724 We can compute the other posteriors similarly.
|
wolffd@0
|
1725 Now let us add some evidence.
|
wolffd@0
|
1726 <pre>
|
wolffd@0
|
1727 evidence = cell(1,n);
|
wolffd@0
|
1728 evidence{W} = 1; % industrial
|
wolffd@0
|
1729 evidence{L} = 1.1;
|
wolffd@0
|
1730 evidence{C} = -0.9;
|
wolffd@0
|
1731 [engine, ll] = enter_evidence(engine, evidence);
|
wolffd@0
|
1732 </pre>
|
wolffd@0
|
1733 Now we find
|
wolffd@0
|
1734 <pre>
|
wolffd@0
|
1735 marg = marginal_nodes(engine, E);
|
wolffd@0
|
1736 assert(approxeq(marg.mu, -3.8983, tol));
|
wolffd@0
|
1737 assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));
|
wolffd@0
|
1738 </pre>
|
wolffd@0
|
1739
|
wolffd@0
|
1740
|
wolffd@0
|
1741 We can also compute the joint probability on a set of nodes.
|
wolffd@0
|
1742 For example, P(D, Mout | evidence) is a 2D Gaussian:
|
wolffd@0
|
1743 <pre>
|
wolffd@0
|
1744 marg = marginal_nodes(engine, [D Mout])
|
wolffd@0
|
1745 marg =
|
wolffd@0
|
1746 domain: [6 8]
|
wolffd@0
|
1747 mu: [2x1 double]
|
wolffd@0
|
1748 Sigma: [2x2 double]
|
wolffd@0
|
1749 T: 1.0000
|
wolffd@0
|
1750 </pre>
|
wolffd@0
|
1751 The mean is
|
wolffd@0
|
1752 <pre>
|
wolffd@0
|
1753 marg.mu
|
wolffd@0
|
1754 ans =
|
wolffd@0
|
1755 3.6077
|
wolffd@0
|
1756 4.1077
|
wolffd@0
|
1757 </pre>
|
wolffd@0
|
1758 and the covariance matrix is
|
wolffd@0
|
1759 <pre>
|
wolffd@0
|
1760 marg.Sigma
|
wolffd@0
|
1761 ans =
|
wolffd@0
|
1762 0.1062 0.1062
|
wolffd@0
|
1763 0.1062 0.1182
|
wolffd@0
|
1764 </pre>
|
wolffd@0
|
1765 It is easy to visualize this posterior using standard Matlab plotting
|
wolffd@0
|
1766 functions, e.g.,
|
wolffd@0
|
1767 <pre>
|
wolffd@0
|
1768 gaussplot2d(marg.mu, marg.Sigma);
|
wolffd@0
|
1769 </pre>
|
wolffd@0
|
1770 produces the following picture.
|
wolffd@0
|
1771
|
wolffd@0
|
1772 <p>
|
wolffd@0
|
1773 <center>
|
wolffd@0
|
1774 <IMG SRC="Figures/gaussplot.png">
|
wolffd@0
|
1775 </center>
|
wolffd@0
|
1776 <p>
|
wolffd@0
|
1777
|
wolffd@0
|
1778
|
wolffd@0
|
1779 The T field indicates that the mixing weight of this Gaussian
|
wolffd@0
|
1780 component is 1.0.
|
wolffd@0
|
1781 If the joint contains discrete and continuous variables, the result
|
wolffd@0
|
1782 will be a mixture of Gaussians, e.g.,
|
wolffd@0
|
1783 <pre>
|
wolffd@0
|
1784 marg = marginal_nodes(engine, [F E])
|
wolffd@0
|
1785 domain: [1 3]
|
wolffd@0
|
1786 mu: [-3.9000 -0.4003]
|
wolffd@0
|
1787 Sigma: [1x1x2 double]
|
wolffd@0
|
1788 T: [0.9995 4.7373e-04]
|
wolffd@0
|
1789 </pre>
|
wolffd@0
|
1790 The interpretation is
|
wolffd@0
|
1791 Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ].
|
wolffd@0
|
1792 In this case, E is a scalar, so i=j=1; k specifies the mixture component.
|
wolffd@0
|
1793 <p>
|
wolffd@0
|
1794 We saw in the sprinkler network that BNT sets the effective size of
|
wolffd@0
|
1795 observed discrete nodes to 1, since they only have one legal value.
|
wolffd@0
|
1796 For continuous nodes, BNT sets their length to 0,
|
wolffd@0
|
1797 since they have been reduced to a point.
|
wolffd@0
|
1798 For example,
|
wolffd@0
|
1799 <pre>
|
wolffd@0
|
1800 marg = marginal_nodes(engine, [B C])
|
wolffd@0
|
1801 domain: [4 5]
|
wolffd@0
|
1802 mu: []
|
wolffd@0
|
1803 Sigma: []
|
wolffd@0
|
1804 T: [0.0123 0.9877]
|
wolffd@0
|
1805 </pre>
|
wolffd@0
|
1806 It is simple to post-process the output of marginal_nodes.
|
wolffd@0
|
1807 For example, the file BNT/examples/static/cg1 sets the mu term of
|
wolffd@0
|
1808 observed nodes to their observed value, and the Sigma term to 0 (since
|
wolffd@0
|
1809 observed nodes have no variance).
|
wolffd@0
|
1810
|
wolffd@0
|
1811 <p>
|
wolffd@0
|
1812 Note that the implemented version of the junction tree is numerically
|
wolffd@0
|
1813 unstable when using CG potentials
|
wolffd@0
|
1814 (which is why, in the example above, we only required our answers to agree with
|
wolffd@0
|
1815 the published ones to 2dp.)
|
wolffd@0
|
1816 This is why you might want to use <tt>stab_cond_gauss_inf_engine</tt>,
|
wolffd@0
|
1817 implemented by Shan Huang. This is described in
|
wolffd@0
|
1818
|
wolffd@0
|
1819 <ul>
|
wolffd@0
|
1820 <li> "Stable Local Computation with Conditional Gaussian Distributions",
|
wolffd@0
|
1821 S. Lauritzen and F. Jensen, Tech Report R-99-2014,
|
wolffd@0
|
1822 Dept. Math. Sciences, Allborg Univ., 1999.
|
wolffd@0
|
1823 </ul>
|
wolffd@0
|
1824
|
wolffd@0
|
1825 However, even the numerically stable version
|
wolffd@0
|
1826 can be computationally intractable if there are many hidden discrete
|
wolffd@0
|
1827 nodes, because the number of mixture components grows exponentially e.g., in a
|
wolffd@0
|
1828 <a href="usage_dbn.html#lds">switching linear dynamical system</a>.
|
wolffd@0
|
1829 In general, one must resort to approximate inference techniques: see
|
wolffd@0
|
1830 the discussion on <a href="#engines">inference engines</a> below.
|
wolffd@0
|
1831
|
wolffd@0
|
1832
|
wolffd@0
|
1833 <h2><a name="hybrid">Other hybrid models</h2>
|
wolffd@0
|
1834
|
wolffd@0
|
1835 When we have C->D arcs, where C is hidden, we need to use
|
wolffd@0
|
1836 approximate inference.
|
wolffd@0
|
1837 One approach (not implemented in BNT) is described in
|
wolffd@0
|
1838 <ul>
|
wolffd@0
|
1839 <li> <a
|
wolffd@0
|
1840 href="http://www.cs.berkeley.edu/~murphyk/Papers/hybrid_uai99.ps.gz">A
|
wolffd@0
|
1841 Variational Approximation for Bayesian Networks with
|
wolffd@0
|
1842 Discrete and Continuous Latent Variables</a>,
|
wolffd@0
|
1843 K. Murphy, UAI 99.
|
wolffd@0
|
1844 </ul>
|
wolffd@0
|
1845 Of course, one can always use <a href="#sampling">sampling</a> methods
|
wolffd@0
|
1846 for approximate inference in such models.
|
wolffd@0
|
1847
|
wolffd@0
|
1848
|
wolffd@0
|
1849
|
wolffd@0
|
1850 <h1><a name="param_learning">Parameter Learning</h1>
|
wolffd@0
|
1851
|
wolffd@0
|
1852 The parameter estimation routines in BNT can be classified into 4
|
wolffd@0
|
1853 types, depending on whether the goal is to compute
|
wolffd@0
|
1854 a full (Bayesian) posterior over the parameters or just a point
|
wolffd@0
|
1855 estimate (e.g., Maximum Likelihood or Maximum A Posteriori),
|
wolffd@0
|
1856 and whether all the variables are fully observed or there is missing
|
wolffd@0
|
1857 data/ hidden variables (partial observability).
|
wolffd@0
|
1858 <p>
|
wolffd@0
|
1859
|
wolffd@0
|
1860 <TABLE BORDER>
|
wolffd@0
|
1861 <tr>
|
wolffd@0
|
1862 <TH></TH>
|
wolffd@0
|
1863 <th>Full obs</th>
|
wolffd@0
|
1864 <th>Partial obs</th>
|
wolffd@0
|
1865 </tr>
|
wolffd@0
|
1866 <tr>
|
wolffd@0
|
1867 <th>Point</th>
|
wolffd@0
|
1868 <td><tt>learn_params</tt></td>
|
wolffd@0
|
1869 <td><tt>learn_params_em</tt></td>
|
wolffd@0
|
1870 </tr>
|
wolffd@0
|
1871 <tr>
|
wolffd@0
|
1872 <th>Bayes</th>
|
wolffd@0
|
1873 <td><tt>bayes_update_params</tt></td>
|
wolffd@0
|
1874 <td>not yet supported</td>
|
wolffd@0
|
1875 </tr>
|
wolffd@0
|
1876 </table>
|
wolffd@0
|
1877
|
wolffd@0
|
1878
|
wolffd@0
|
1879 <h2><a name="load_data">Loading data from a file</h2>
|
wolffd@0
|
1880
|
wolffd@0
|
1881 To load numeric data from an ASCII text file called 'dat.txt', where each row is a
|
wolffd@0
|
1882 case and columns are separated by white-space, such as
|
wolffd@0
|
1883 <pre>
|
wolffd@0
|
1884 011979 1626.5 0.0
|
wolffd@0
|
1885 021979 1367.0 0.0
|
wolffd@0
|
1886 ...
|
wolffd@0
|
1887 </pre>
|
wolffd@0
|
1888 you can use
|
wolffd@0
|
1889 <pre>
|
wolffd@0
|
1890 data = load('dat.txt');
|
wolffd@0
|
1891 </pre>
|
wolffd@0
|
1892 or
|
wolffd@0
|
1893 <pre>
|
wolffd@0
|
1894 load dat.txt -ascii
|
wolffd@0
|
1895 </pre>
|
wolffd@0
|
1896 In the latter case, the data is stored in a variable called 'dat' (the
|
wolffd@0
|
1897 filename minus the extension).
|
wolffd@0
|
1898 Alternatively, suppose the data is stored in a .csv file (has commas
|
wolffd@0
|
1899 separating the columns, and contains a header line), such as
|
wolffd@0
|
1900 <pre>
|
wolffd@0
|
1901 header info goes here
|
wolffd@0
|
1902 ORD,011979,1626.5,0.0
|
wolffd@0
|
1903 DSM,021979,1367.0,0.0
|
wolffd@0
|
1904 ...
|
wolffd@0
|
1905 </pre>
|
wolffd@0
|
1906 You can load this using
|
wolffd@0
|
1907 <pre>
|
wolffd@0
|
1908 [a,b,c,d] = textread('dat.txt', '%s %d %f %f', 'delimiter', ',', 'headerlines', 1);
|
wolffd@0
|
1909 </pre>
|
wolffd@0
|
1910 If your file is not in either of these formats, you can either use Perl to convert
|
wolffd@0
|
1911 it to this format, or use the Matlab scanf command.
|
wolffd@0
|
1912 Type
|
wolffd@0
|
1913 <tt>
|
wolffd@0
|
1914 help iofun
|
wolffd@0
|
1915 </tt>
|
wolffd@0
|
1916 for more information on Matlab's file functions.
|
wolffd@0
|
1917 <!--
|
wolffd@0
|
1918 <p>
|
wolffd@0
|
1919 To load data directly from Excel,
|
wolffd@0
|
1920 you should buy the
|
wolffd@0
|
1921 <a href="http://www.mathworks.com/products/excellink/">Excel Link</a>.
|
wolffd@0
|
1922 To load data directly from a relational database,
|
wolffd@0
|
1923 you should buy the
|
wolffd@0
|
1924 <a href="http://www.mathworks.com/products/database">Database
|
wolffd@0
|
1925 toolbox</a>.
|
wolffd@0
|
1926 -->
|
wolffd@0
|
1927 <p>
|
wolffd@0
|
1928 BNT learning routines require data to be stored in a cell array.
|
wolffd@0
|
1929 data{i,m} is the value of node i in case (example) m, i.e., each
|
wolffd@0
|
1930 <em>column</em> is a case.
|
wolffd@0
|
1931 If node i is not observed in case m (missing value), set
|
wolffd@0
|
1932 data{i,m} = [].
|
wolffd@0
|
1933 (Not all the learning routines can cope with such missing values, however.)
|
wolffd@0
|
1934 In the special case that all the nodes are observed and are
|
wolffd@0
|
1935 scalar-valued (as opposed to vector-valued), the data can be
|
wolffd@0
|
1936 stored in a matrix (as opposed to a cell-array).
|
wolffd@0
|
1937 <p>
|
wolffd@0
|
1938 Suppose, as in the <a href="#mixexp">mixture of experts example</a>,
|
wolffd@0
|
1939 that we have 3 nodes in the graph: X(1) is the observed input, X(3) is
|
wolffd@0
|
1940 the observed output, and X(2) is a hidden (gating) node. We can
|
wolffd@0
|
1941 create the dataset as follows.
|
wolffd@0
|
1942 <pre>
|
wolffd@0
|
1943 data = load('dat.txt');
|
wolffd@0
|
1944 ncases = size(data, 1);
|
wolffd@0
|
1945 cases = cell(3, ncases);
|
wolffd@0
|
1946 cases([1 3], :) = num2cell(data');
|
wolffd@0
|
1947 </pre>
|
wolffd@0
|
1948 Notice how we transposed the data, to convert rows into columns.
|
wolffd@0
|
1949 Also, cases{2,m} = [] for all m, since X(2) is always hidden.
|
wolffd@0
|
1950
|
wolffd@0
|
1951
|
wolffd@0
|
1952 <h2><a name="mle_complete">Maximum likelihood parameter estimation from complete data</h2>
|
wolffd@0
|
1953
|
wolffd@0
|
1954 As an example, let's generate some data from the sprinkler network, randomize the parameters,
|
wolffd@0
|
1955 and then try to recover the original model.
|
wolffd@0
|
1956 First we create some training data using forwards sampling.
|
wolffd@0
|
1957 <pre>
|
wolffd@0
|
1958 samples = cell(N, nsamples);
|
wolffd@0
|
1959 for i=1:nsamples
|
wolffd@0
|
1960 samples(:,i) = sample_bnet(bnet);
|
wolffd@0
|
1961 end
|
wolffd@0
|
1962 </pre>
|
wolffd@0
|
1963 samples{j,i} contains the value of the j'th node in case i.
|
wolffd@0
|
1964 sample_bnet returns a cell array because, in general, each node might
|
wolffd@0
|
1965 be a vector of different length.
|
wolffd@0
|
1966 In this case, all nodes are discrete (and hence scalars), so we
|
wolffd@0
|
1967 could have used a regular array instead (which can be quicker):
|
wolffd@0
|
1968 <pre>
|
wolffd@0
|
1969 data = cell2num(samples);
|
wolffd@0
|
1970 </pre
|
wolffd@0
|
1971 So now data(j,i) = samples{j,i}.
|
wolffd@0
|
1972 <p>
|
wolffd@0
|
1973 Now we create a network with random parameters.
|
wolffd@0
|
1974 (The initial values of bnet2 don't matter in this case, since we can find the
|
wolffd@0
|
1975 globally optimal MLE independent of where we start.)
|
wolffd@0
|
1976 <pre>
|
wolffd@0
|
1977 % Make a tabula rasa
|
wolffd@0
|
1978 bnet2 = mk_bnet(dag, node_sizes);
|
wolffd@0
|
1979 seed = 0;
|
wolffd@0
|
1980 rand('state', seed);
|
wolffd@0
|
1981 bnet2.CPD{C} = tabular_CPD(bnet2, C);
|
wolffd@0
|
1982 bnet2.CPD{R} = tabular_CPD(bnet2, R);
|
wolffd@0
|
1983 bnet2.CPD{S} = tabular_CPD(bnet2, S);
|
wolffd@0
|
1984 bnet2.CPD{W} = tabular_CPD(bnet2, W);
|
wolffd@0
|
1985 </pre>
|
wolffd@0
|
1986 Finally, we find the maximum likelihood estimates of the parameters.
|
wolffd@0
|
1987 <pre>
|
wolffd@0
|
1988 bnet3 = learn_params(bnet2, samples);
|
wolffd@0
|
1989 </pre>
|
wolffd@0
|
1990 To view the learned parameters, we use a little Matlab hackery.
|
wolffd@0
|
1991 <pre>
|
wolffd@0
|
1992 CPT3 = cell(1,N);
|
wolffd@0
|
1993 for i=1:N
|
wolffd@0
|
1994 s=struct(bnet3.CPD{i}); % violate object privacy
|
wolffd@0
|
1995 CPT3{i}=s.CPT;
|
wolffd@0
|
1996 end
|
wolffd@0
|
1997 </pre>
|
wolffd@0
|
1998 Here are the parameters learned for node 4.
|
wolffd@0
|
1999 <pre>
|
wolffd@0
|
2000 dispcpt(CPT3{4})
|
wolffd@0
|
2001 1 1 : 1.0000 0.0000
|
wolffd@0
|
2002 2 1 : 0.2000 0.8000
|
wolffd@0
|
2003 1 2 : 0.2273 0.7727
|
wolffd@0
|
2004 2 2 : 0.0000 1.0000
|
wolffd@0
|
2005 </pre>
|
wolffd@0
|
2006 So we see that the learned parameters are fairly close to the "true"
|
wolffd@0
|
2007 ones, which we display below.
|
wolffd@0
|
2008 <pre>
|
wolffd@0
|
2009 dispcpt(CPT{4})
|
wolffd@0
|
2010 1 1 : 1.0000 0.0000
|
wolffd@0
|
2011 2 1 : 0.1000 0.9000
|
wolffd@0
|
2012 1 2 : 0.1000 0.9000
|
wolffd@0
|
2013 2 2 : 0.0100 0.9900
|
wolffd@0
|
2014 </pre>
|
wolffd@0
|
2015 We can get better results by using a larger training set, or using
|
wolffd@0
|
2016 informative priors (see <a href="#prior">below</a>).
|
wolffd@0
|
2017
|
wolffd@0
|
2018
|
wolffd@0
|
2019
|
wolffd@0
|
2020 <h2><a name="prior">Parameter priors</h2>
|
wolffd@0
|
2021
|
wolffd@0
|
2022 Currently, only tabular CPDs can have priors on their parameters.
|
wolffd@0
|
2023 The conjugate prior for a multinomial is the Dirichlet.
|
wolffd@0
|
2024 (For binary random variables, the multinomial is the same as the
|
wolffd@0
|
2025 Bernoulli, and the Dirichlet is the same as the Beta.)
|
wolffd@0
|
2026 <p>
|
wolffd@0
|
2027 The Dirichlet has a simple interpretation in terms of pseudo counts.
|
wolffd@0
|
2028 If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the
|
wolffd@0
|
2029 training set, where Pa_i are the parents of X_i,
|
wolffd@0
|
2030 then the maximum likelihood (ML) estimate is
|
wolffd@0
|
2031 T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0.
|
wolffd@0
|
2032 To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this
|
wolffd@0
|
2033 event was not seen in the training set,
|
wolffd@0
|
2034 we can pretend we saw value k of X_i, for each value j of Pa_i some number (alpha_ijk)
|
wolffd@0
|
2035 of times in the past.
|
wolffd@0
|
2036 The MAP (maximum a posterior) estimate is then
|
wolffd@0
|
2037 <pre>
|
wolffd@0
|
2038 T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)
|
wolffd@0
|
2039 </pre>
|
wolffd@0
|
2040 and is never 0 if all alpha_ijk > 0.
|
wolffd@0
|
2041 For example, consider the network A->B, where A is binary and B has 3
|
wolffd@0
|
2042 values.
|
wolffd@0
|
2043 A uniform prior for B has the form
|
wolffd@0
|
2044 <pre>
|
wolffd@0
|
2045 B=1 B=2 B=3
|
wolffd@0
|
2046 A=1 1 1 1
|
wolffd@0
|
2047 A=2 1 1 1
|
wolffd@0
|
2048 </pre>
|
wolffd@0
|
2049 which can be created using
|
wolffd@0
|
2050 <pre>
|
wolffd@0
|
2051 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'unif');
|
wolffd@0
|
2052 </pre>
|
wolffd@0
|
2053 This prior does not satisfy the likelihood equivalence principle,
|
wolffd@0
|
2054 which says that <a href="#markov_equiv">Markov equivalent</a> models
|
wolffd@0
|
2055 should have the same marginal likelihood.
|
wolffd@0
|
2056 A prior that does satisfy this principle is shown below.
|
wolffd@0
|
2057 Heckerman (1995) calls this the
|
wolffd@0
|
2058 BDeu prior (likelihood equivalent uniform Bayesian Dirichlet).
|
wolffd@0
|
2059 <pre>
|
wolffd@0
|
2060 B=1 B=2 B=3
|
wolffd@0
|
2061 A=1 1/6 1/6 1/6
|
wolffd@0
|
2062 A=2 1/6 1/6 1/6
|
wolffd@0
|
2063 </pre>
|
wolffd@0
|
2064 where we put N/(q*r) in each bin; N is the equivalent sample size,
|
wolffd@0
|
2065 r=|A|, q = |B|.
|
wolffd@0
|
2066 This can be created as follows
|
wolffd@0
|
2067 <pre>
|
wolffd@0
|
2068 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', 'BDeu');
|
wolffd@0
|
2069 </pre>
|
wolffd@0
|
2070 Here, 1 is the equivalent sample size, and is the strength of the
|
wolffd@0
|
2071 prior.
|
wolffd@0
|
2072 You can change this using
|
wolffd@0
|
2073 <pre>
|
wolffd@0
|
2074 tabular_CPD(bnet, i, 'prior_type', 'dirichlet', 'dirichlet_type', ...
|
wolffd@0
|
2075 'BDeu', 'dirichlet_weight', 10);
|
wolffd@0
|
2076 </pre>
|
wolffd@0
|
2077 <!--where counts is an array of pseudo-counts of the same size as the
|
wolffd@0
|
2078 CPT.-->
|
wolffd@0
|
2079 <!--
|
wolffd@0
|
2080 <p>
|
wolffd@0
|
2081 When you specify a prior, you should set row i of the CPT to the
|
wolffd@0
|
2082 normalized version of row i of the pseudo-count matrix, i.e., to the
|
wolffd@0
|
2083 expected values of the parameters. This will ensure that computing the
|
wolffd@0
|
2084 marginal likelihood sequentially (see <a
|
wolffd@0
|
2085 href="#bayes_learn">below</a>) and in batch form gives the same
|
wolffd@0
|
2086 results.
|
wolffd@0
|
2087 To do this, proceed as follows.
|
wolffd@0
|
2088 <pre>
|
wolffd@0
|
2089 tabular_CPD(bnet, i, 'prior', counts, 'CPT', mk_stochastic(counts));
|
wolffd@0
|
2090 </pre>
|
wolffd@0
|
2091 For a non-informative prior, you can just write
|
wolffd@0
|
2092 <pre>
|
wolffd@0
|
2093 tabular_CPD(bnet, i, 'prior', 'unif', 'CPT', 'unif');
|
wolffd@0
|
2094 </pre>
|
wolffd@0
|
2095 -->
|
wolffd@0
|
2096
|
wolffd@0
|
2097
|
wolffd@0
|
2098 <h2><a name="bayes_learn">(Sequential) Bayesian parameter updating from complete data</h2>
|
wolffd@0
|
2099
|
wolffd@0
|
2100 If we use conjugate priors and have fully observed data, we can
|
wolffd@0
|
2101 compute the posterior over the parameters in batch form as follows.
|
wolffd@0
|
2102 <pre>
|
wolffd@0
|
2103 cases = sample_bnet(bnet, nsamples);
|
wolffd@0
|
2104 bnet = bayes_update_params(bnet, cases);
|
wolffd@0
|
2105 LL = log_marg_lik_complete(bnet, cases);
|
wolffd@0
|
2106 </pre>
|
wolffd@0
|
2107 bnet.CPD{i}.prior contains the new Dirichlet pseudocounts,
|
wolffd@0
|
2108 and bnet.CPD{i}.CPT is set to the mean of the posterior (the
|
wolffd@0
|
2109 normalized counts).
|
wolffd@0
|
2110 (Hence if the initial pseudo counts are 0,
|
wolffd@0
|
2111 <tt>bayes_update_params</tt> and <tt>learn_params</tt> will give the
|
wolffd@0
|
2112 same result.)
|
wolffd@0
|
2113
|
wolffd@0
|
2114
|
wolffd@0
|
2115
|
wolffd@0
|
2116
|
wolffd@0
|
2117 <p>
|
wolffd@0
|
2118 We can compute the same result sequentially (on-line) as follows.
|
wolffd@0
|
2119 <pre>
|
wolffd@0
|
2120 LL = 0;
|
wolffd@0
|
2121 for m=1:nsamples
|
wolffd@0
|
2122 LL = LL + log_marg_lik_complete(bnet, cases(:,m));
|
wolffd@0
|
2123 bnet = bayes_update_params(bnet, cases(:,m));
|
wolffd@0
|
2124 end
|
wolffd@0
|
2125 </pre>
|
wolffd@0
|
2126
|
wolffd@0
|
2127 The file <tt>BNT/examples/static/StructLearn/model_select1</tt> has an example of
|
wolffd@0
|
2128 sequential model selection which uses the same idea.
|
wolffd@0
|
2129 We generate data from the model A->B
|
wolffd@0
|
2130 and compute the posterior prob of all 3 dags on 2 nodes:
|
wolffd@0
|
2131 (1) A B, (2) A <- B , (3) A -> B
|
wolffd@0
|
2132 Models 2 and 3 are <a href="#markov_equiv">Markov equivalent</a>, and therefore indistinguishable from
|
wolffd@0
|
2133 observational data alone, so we expect their posteriors to be the same
|
wolffd@0
|
2134 (assuming a prior which satisfies likelihood equivalence).
|
wolffd@0
|
2135 If we use random parameters, the "true" model only gets a higher posterior after 2000 trials!
|
wolffd@0
|
2136 However, if we make B a noisy NOT gate, the true model "wins" after 12
|
wolffd@0
|
2137 trials, as shown below (red = model 1, blue/green (superimposed)
|
wolffd@0
|
2138 represents models 2/3).
|
wolffd@0
|
2139 <p>
|
wolffd@0
|
2140 <img src="Figures/model_select.png">
|
wolffd@0
|
2141 <p>
|
wolffd@0
|
2142 The use of marginal likelihood for model selection is discussed in
|
wolffd@0
|
2143 greater detail in the
|
wolffd@0
|
2144 section on <a href="structure_learning">structure learning</a>.
|
wolffd@0
|
2145
|
wolffd@0
|
2146
|
wolffd@0
|
2147
|
wolffd@0
|
2148
|
wolffd@0
|
2149 <h2><a name="em">Maximum likelihood parameter estimation with missing values</h2>
|
wolffd@0
|
2150
|
wolffd@0
|
2151 Now we consider learning when some values are not observed.
|
wolffd@0
|
2152 Let us randomly hide half the values generated from the water
|
wolffd@0
|
2153 sprinkler example.
|
wolffd@0
|
2154 <pre>
|
wolffd@0
|
2155 samples2 = samples;
|
wolffd@0
|
2156 hide = rand(N, nsamples) > 0.5;
|
wolffd@0
|
2157 [I,J]=find(hide);
|
wolffd@0
|
2158 for k=1:length(I)
|
wolffd@0
|
2159 samples2{I(k), J(k)} = [];
|
wolffd@0
|
2160 end
|
wolffd@0
|
2161 </pre>
|
wolffd@0
|
2162 samples2{i,l} is the value of node i in training case l, or [] if unobserved.
|
wolffd@0
|
2163 <p>
|
wolffd@0
|
2164 Now we will compute the MLEs using the EM algorithm.
|
wolffd@0
|
2165 We need to use an inference algorithm to compute the expected
|
wolffd@0
|
2166 sufficient statistics in the E step; the M (maximization) step is as
|
wolffd@0
|
2167 above.
|
wolffd@0
|
2168 <pre>
|
wolffd@0
|
2169 engine2 = jtree_inf_engine(bnet2);
|
wolffd@0
|
2170 max_iter = 10;
|
wolffd@0
|
2171 [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);
|
wolffd@0
|
2172 </pre>
|
wolffd@0
|
2173 LLtrace(i) is the log-likelihood at iteration i. We can plot this as
|
wolffd@0
|
2174 follows:
|
wolffd@0
|
2175 <pre>
|
wolffd@0
|
2176 plot(LLtrace, 'x-')
|
wolffd@0
|
2177 </pre>
|
wolffd@0
|
2178 Let's display the results after 10 iterations of EM.
|
wolffd@0
|
2179 <pre>
|
wolffd@0
|
2180 celldisp(CPT4)
|
wolffd@0
|
2181 CPT4{1} =
|
wolffd@0
|
2182 0.6616
|
wolffd@0
|
2183 0.3384
|
wolffd@0
|
2184 CPT4{2} =
|
wolffd@0
|
2185 0.6510 0.3490
|
wolffd@0
|
2186 0.8751 0.1249
|
wolffd@0
|
2187 CPT4{3} =
|
wolffd@0
|
2188 0.8366 0.1634
|
wolffd@0
|
2189 0.0197 0.9803
|
wolffd@0
|
2190 CPT4{4} =
|
wolffd@0
|
2191 (:,:,1) =
|
wolffd@0
|
2192 0.8276 0.0546
|
wolffd@0
|
2193 0.5452 0.1658
|
wolffd@0
|
2194 (:,:,2) =
|
wolffd@0
|
2195 0.1724 0.9454
|
wolffd@0
|
2196 0.4548 0.8342
|
wolffd@0
|
2197 </pre>
|
wolffd@0
|
2198 We can get improved performance by using one or more of the following
|
wolffd@0
|
2199 methods:
|
wolffd@0
|
2200 <ul>
|
wolffd@0
|
2201 <li> Increasing the size of the training set.
|
wolffd@0
|
2202 <li> Decreasing the amount of hidden data.
|
wolffd@0
|
2203 <li> Running EM for longer.
|
wolffd@0
|
2204 <li> Using informative priors.
|
wolffd@0
|
2205 <li> Initialising EM from multiple starting points.
|
wolffd@0
|
2206 </ul>
|
wolffd@0
|
2207
|
wolffd@0
|
2208 <h2><a name="gaussianNumerical">Numerical problems when fitting
|
wolffd@0
|
2209 Gaussians using EM</h2>
|
wolffd@0
|
2210
|
wolffd@0
|
2211 When fitting a <a href="#gaussian">Gaussian CPD</a> using EM, you
|
wolffd@0
|
2212 might encounter some numerical
|
wolffd@0
|
2213 problems. See <a
|
wolffd@0
|
2214 href="http://www.ai.mit.edu/~murphyk/Software/HMM/hmm.html#loglikpos">here</a>
|
wolffd@0
|
2215 for a discussion of this in the context of HMMs.
|
wolffd@0
|
2216 <!--
|
wolffd@0
|
2217 (Note: BNT does not currently support spherical covariances, although it
|
wolffd@0
|
2218 would be easy to add, since KPMstats/clg_Mstep supports this option;
|
wolffd@0
|
2219 you would just need to modify gaussian_CPD/update_ess to accumulate
|
wolffd@0
|
2220 weighted inner products.)
|
wolffd@0
|
2221 -->
|
wolffd@0
|
2222
|
wolffd@0
|
2223
|
wolffd@0
|
2224 <h2><a name="tying">Parameter tying</h2>
|
wolffd@0
|
2225
|
wolffd@0
|
2226 In networks with repeated structure (e.g., chains and grids), it is
|
wolffd@0
|
2227 common to assume that the parameters are the same at every node. This
|
wolffd@0
|
2228 is called parameter tying, and reduces the amount of data needed for
|
wolffd@0
|
2229 learning.
|
wolffd@0
|
2230 <p>
|
wolffd@0
|
2231 When we have tied parameters, there is no longer a one-to-one
|
wolffd@0
|
2232 correspondence between nodes and CPDs.
|
wolffd@0
|
2233 Rather, each CPD species the parameters for a whole equivalence class
|
wolffd@0
|
2234 of nodes.
|
wolffd@0
|
2235 It is easiest to see this by example.
|
wolffd@0
|
2236 Consider the following <a href="usage_dbn.html#hmm">hidden Markov
|
wolffd@0
|
2237 model (HMM)</a>
|
wolffd@0
|
2238 <p>
|
wolffd@0
|
2239 <img src="Figures/hmm3.gif">
|
wolffd@0
|
2240 <p>
|
wolffd@0
|
2241 <!--
|
wolffd@0
|
2242 We can create this graph structure, assuming we have T time-slices,
|
wolffd@0
|
2243 as follows.
|
wolffd@0
|
2244 (We number the nodes as shown in the figure, but we could equally well
|
wolffd@0
|
2245 number the hidden nodes 1:T, and the observed nodes T+1:2T.)
|
wolffd@0
|
2246 <pre>
|
wolffd@0
|
2247 N = 2*T;
|
wolffd@0
|
2248 dag = zeros(N);
|
wolffd@0
|
2249 hnodes = 1:2:2*T;
|
wolffd@0
|
2250 for i=1:T-1
|
wolffd@0
|
2251 dag(hnodes(i), hnodes(i+1))=1;
|
wolffd@0
|
2252 end
|
wolffd@0
|
2253 onodes = 2:2:2*T;
|
wolffd@0
|
2254 for i=1:T
|
wolffd@0
|
2255 dag(hnodes(i), onodes(i)) = 1;
|
wolffd@0
|
2256 end
|
wolffd@0
|
2257 </pre>
|
wolffd@0
|
2258 <p>
|
wolffd@0
|
2259 The hidden nodes are always discrete, and have Q possible values each,
|
wolffd@0
|
2260 but the observed nodes can be discrete or continuous, and have O possible values/length.
|
wolffd@0
|
2261 <pre>
|
wolffd@0
|
2262 if cts_obs
|
wolffd@0
|
2263 dnodes = hnodes;
|
wolffd@0
|
2264 else
|
wolffd@0
|
2265 dnodes = 1:N;
|
wolffd@0
|
2266 end
|
wolffd@0
|
2267 ns = ones(1,N);
|
wolffd@0
|
2268 ns(hnodes) = Q;
|
wolffd@0
|
2269 ns(onodes) = O;
|
wolffd@0
|
2270 </pre>
|
wolffd@0
|
2271 -->
|
wolffd@0
|
2272 When HMMs are used for semi-infinite processes like speech recognition,
|
wolffd@0
|
2273 we assume the transition matrix
|
wolffd@0
|
2274 P(H(t+1)|H(t)) is the same for all t; this is called a time-invariant
|
wolffd@0
|
2275 or homogenous Markov chain.
|
wolffd@0
|
2276 Hence hidden nodes 2, 3, ..., T
|
wolffd@0
|
2277 are all in the same equivalence class, say class Hclass.
|
wolffd@0
|
2278 Similarly, the observation matrix P(O(t)|H(t)) is assumed to be the
|
wolffd@0
|
2279 same for all t, so the observed nodes are all in the same equivalence
|
wolffd@0
|
2280 class, say class Oclass.
|
wolffd@0
|
2281 Finally, the prior term P(H(1)) is in a class all by itself, say class
|
wolffd@0
|
2282 H1class.
|
wolffd@0
|
2283 This is illustrated below, where we explicitly represent the
|
wolffd@0
|
2284 parameters as random variables (dotted nodes).
|
wolffd@0
|
2285 <p>
|
wolffd@0
|
2286 <img src="Figures/hmm4_params.gif">
|
wolffd@0
|
2287 <p>
|
wolffd@0
|
2288 In BNT, we cannot represent parameters as random variables (nodes).
|
wolffd@0
|
2289 Instead, we "hide" the
|
wolffd@0
|
2290 parameters inside one CPD for each equivalence class,
|
wolffd@0
|
2291 and then specify that the other CPDs should share these parameters, as
|
wolffd@0
|
2292 follows.
|
wolffd@0
|
2293 <pre>
|
wolffd@0
|
2294 hnodes = 1:2:2*T;
|
wolffd@0
|
2295 onodes = 2:2:2*T;
|
wolffd@0
|
2296 H1class = 1; Hclass = 2; Oclass = 3;
|
wolffd@0
|
2297 eclass = ones(1,N);
|
wolffd@0
|
2298 eclass(hnodes(2:end)) = Hclass;
|
wolffd@0
|
2299 eclass(hnodes(1)) = H1class;
|
wolffd@0
|
2300 eclass(onodes) = Oclass;
|
wolffd@0
|
2301 % create dag and ns in the usual way
|
wolffd@0
|
2302 bnet = mk_bnet(dag, ns, 'discrete', dnodes, 'equiv_class', eclass);
|
wolffd@0
|
2303 </pre>
|
wolffd@0
|
2304 Finally, we define the parameters for each equivalence class:
|
wolffd@0
|
2305 <pre>
|
wolffd@0
|
2306 bnet.CPD{H1class} = tabular_CPD(bnet, hnodes(1)); % prior
|
wolffd@0
|
2307 bnet.CPD{Hclass} = tabular_CPD(bnet, hnodes(2)); % transition matrix
|
wolffd@0
|
2308 if cts_obs
|
wolffd@0
|
2309 bnet.CPD{Oclass} = gaussian_CPD(bnet, onodes(1));
|
wolffd@0
|
2310 else
|
wolffd@0
|
2311 bnet.CPD{Oclass} = tabular_CPD(bnet, onodes(1));
|
wolffd@0
|
2312 end
|
wolffd@0
|
2313 </pre>
|
wolffd@0
|
2314 In general, if bnet.CPD{e} = xxx_CPD(bnet, j), then j should be a
|
wolffd@0
|
2315 member of e's equivalence class; that is, it is not always the case
|
wolffd@0
|
2316 that e == j. You can use bnet.rep_of_eclass(e) to return the
|
wolffd@0
|
2317 representative of equivalence class e.
|
wolffd@0
|
2318 BNT will look up the parents of j to determine the size
|
wolffd@0
|
2319 of the CPT to use. It assumes that this is the same for all members of
|
wolffd@0
|
2320 the equivalence class.
|
wolffd@0
|
2321 Click <a href="http://www.cs.berkeley.edu/~murphyk/Bayes/param_tieing.html">here</a> for
|
wolffd@0
|
2322 a more complex example of parameter tying.
|
wolffd@0
|
2323 <p>
|
wolffd@0
|
2324 Note:
|
wolffd@0
|
2325 Normally one would define an HMM as a
|
wolffd@0
|
2326 <a href = "usage_dbn.html">Dynamic Bayes Net</a>
|
wolffd@0
|
2327 (see the function BNT/examples/dynamic/mk_chmm.m).
|
wolffd@0
|
2328 However, one can define an HMM as a static BN using the function
|
wolffd@0
|
2329 BNT/examples/static/Models/mk_hmm_bnet.m.
|
wolffd@0
|
2330
|
wolffd@0
|
2331
|
wolffd@0
|
2332
|
wolffd@0
|
2333 <h1><a name="structure_learning">Structure learning</h1>
|
wolffd@0
|
2334
|
wolffd@0
|
2335 Update (9/29/03):
|
wolffd@0
|
2336 Phillipe LeRay is developing some additional structure learning code
|
wolffd@0
|
2337 on top of BNT. Click <a
|
wolffd@0
|
2338 href="http://bnt.insa-rouen.fr/ajouts.html">here</a>
|
wolffd@0
|
2339 for details.
|
wolffd@0
|
2340
|
wolffd@0
|
2341 <p>
|
wolffd@0
|
2342
|
wolffd@0
|
2343 There are two very different approaches to structure learning:
|
wolffd@0
|
2344 constraint-based and search-and-score.
|
wolffd@0
|
2345 In the <a href="#constraint">constraint-based approach</a>,
|
wolffd@0
|
2346 we start with a fully connected graph, and remove edges if certain
|
wolffd@0
|
2347 conditional independencies are measured in the data.
|
wolffd@0
|
2348 This has the disadvantage that repeated independence tests lose
|
wolffd@0
|
2349 statistical power.
|
wolffd@0
|
2350 <p>
|
wolffd@0
|
2351 In the more popular search-and-score approach,
|
wolffd@0
|
2352 we perform a search through the space of possible DAGs, and either
|
wolffd@0
|
2353 return the best one found (a point estimate), or return a sample of the
|
wolffd@0
|
2354 models found (an approximation to the Bayesian posterior).
|
wolffd@0
|
2355 <p>
|
wolffd@0
|
2356 Unfortunately, the number of DAGs as a function of the number of
|
wolffd@0
|
2357 nodes, G(n), is super-exponential in n.
|
wolffd@0
|
2358 A closed form formula for G(n) is not known, but the first few values
|
wolffd@0
|
2359 are shown below (from Cooper, 1999).
|
wolffd@0
|
2360
|
wolffd@0
|
2361 <table>
|
wolffd@0
|
2362 <tr> <th>n</th> <th align=left>G(n)</th> </tr>
|
wolffd@0
|
2363 <tr> <td>1</td> <td>1</td> </tr>
|
wolffd@0
|
2364 <tr> <td>2</td> <td>3</td> </tr>
|
wolffd@0
|
2365 <tr> <td>3</td> <td>25</td> </tr>
|
wolffd@0
|
2366 <tr> <td>4</td> <td>543</td> </tr>
|
wolffd@0
|
2367 <tr> <td>5</td> <td>29,281</td> </tr>
|
wolffd@0
|
2368 <tr> <td>6</td> <td>3,781,503</td> </tr>
|
wolffd@0
|
2369 <tr> <td>7</td> <td>1.1 x 10^9</td> </tr>
|
wolffd@0
|
2370 <tr> <td>8</td> <td>7.8 x 10^11</td> </tr>
|
wolffd@0
|
2371 <tr> <td>9</td> <td>1.2 x 10^15</td> </tr>
|
wolffd@0
|
2372 <tr> <td>10</td> <td>4.2 x 10^18</td> </tr>
|
wolffd@0
|
2373 </table>
|
wolffd@0
|
2374
|
wolffd@0
|
2375 Since the number of DAGs is super-exponential in the number of nodes,
|
wolffd@0
|
2376 we cannot exhaustively search the space, so we either use a local
|
wolffd@0
|
2377 search algorithm (e.g., greedy hill climbining, perhaps with multiple
|
wolffd@0
|
2378 restarts) or a global search algorithm (e.g., Markov Chain Monte
|
wolffd@0
|
2379 Carlo).
|
wolffd@0
|
2380 <p>
|
wolffd@0
|
2381 If we know a total ordering on the nodes,
|
wolffd@0
|
2382 finding the best structure amounts to picking the best set of parents
|
wolffd@0
|
2383 for each node independently.
|
wolffd@0
|
2384 This is what the K2 algorithm does.
|
wolffd@0
|
2385 If the ordering is unknown, we can search over orderings,
|
wolffd@0
|
2386 which is more efficient than searching over DAGs (Koller and Friedman, 2000).
|
wolffd@0
|
2387 <p>
|
wolffd@0
|
2388 In addition to the search procedure, we must specify the scoring
|
wolffd@0
|
2389 function. There are two popular choices. The Bayesian score integrates
|
wolffd@0
|
2390 out the parameters, i.e., it is the marginal likelihood of the model.
|
wolffd@0
|
2391 The BIC (Bayesian Information Criterion) is defined as
|
wolffd@0
|
2392 log P(D|theta_hat) - 0.5*d*log(N), where D is the data, theta_hat is
|
wolffd@0
|
2393 the ML estimate of the parameters, d is the number of parameters, and
|
wolffd@0
|
2394 N is the number of data cases.
|
wolffd@0
|
2395 The BIC method has the advantage of not requiring a prior.
|
wolffd@0
|
2396 <p>
|
wolffd@0
|
2397 BIC can be derived as a large sample
|
wolffd@0
|
2398 approximation to the marginal likelihood.
|
wolffd@0
|
2399 (It is also equal to the Minimum Description Length of a model.)
|
wolffd@0
|
2400 However, in practice, the sample size does not need to be very large
|
wolffd@0
|
2401 for the approximation to be good.
|
wolffd@0
|
2402 For example, in the figure below, we plot the ratio between the log marginal likelihood
|
wolffd@0
|
2403 and the BIC score against data-set size; we see that the ratio rapidly
|
wolffd@0
|
2404 approaches 1, especially for non-informative priors.
|
wolffd@0
|
2405 (This plot was generated by the file BNT/examples/static/bic1.m. It
|
wolffd@0
|
2406 uses the water sprinkler BN with BDeu Dirichlet priors with different
|
wolffd@0
|
2407 equivalent sample sizes.)
|
wolffd@0
|
2408
|
wolffd@0
|
2409 <p>
|
wolffd@0
|
2410 <center>
|
wolffd@0
|
2411 <IMG SRC="Figures/bic.png">
|
wolffd@0
|
2412 </center>
|
wolffd@0
|
2413 <p>
|
wolffd@0
|
2414
|
wolffd@0
|
2415 <p>
|
wolffd@0
|
2416 As with parameter learning, handling missing data/ hidden variables is
|
wolffd@0
|
2417 much harder than the fully observed case.
|
wolffd@0
|
2418 The structure learning routines in BNT can therefore be classified into 4
|
wolffd@0
|
2419 types, analogously to the parameter learning case.
|
wolffd@0
|
2420 <p>
|
wolffd@0
|
2421
|
wolffd@0
|
2422 <TABLE BORDER>
|
wolffd@0
|
2423 <tr>
|
wolffd@0
|
2424 <TH></TH>
|
wolffd@0
|
2425 <th>Full obs</th>
|
wolffd@0
|
2426 <th>Partial obs</th>
|
wolffd@0
|
2427 </tr>
|
wolffd@0
|
2428 <tr>
|
wolffd@0
|
2429 <th>Point</th>
|
wolffd@0
|
2430 <td><tt>learn_struct_K2</tt> <br>
|
wolffd@0
|
2431 <!-- <tt>learn_struct_hill_climb</tt></td> -->
|
wolffd@0
|
2432 <td><tt>not yet supported</tt></td>
|
wolffd@0
|
2433 </tr>
|
wolffd@0
|
2434 <tr>
|
wolffd@0
|
2435 <th>Bayes</th>
|
wolffd@0
|
2436 <td><tt>learn_struct_mcmc</tt></td>
|
wolffd@0
|
2437 <td>not yet supported</td>
|
wolffd@0
|
2438 </tr>
|
wolffd@0
|
2439 </table>
|
wolffd@0
|
2440
|
wolffd@0
|
2441
|
wolffd@0
|
2442 <h2><a name="markov_equiv">Markov equivalence</h2>
|
wolffd@0
|
2443
|
wolffd@0
|
2444 If two DAGs encode the same conditional independencies, they are
|
wolffd@0
|
2445 called Markov equivalent. The set of all DAGs can be paritioned into
|
wolffd@0
|
2446 Markov equivalence classes. Graphs within the same class can
|
wolffd@0
|
2447 have
|
wolffd@0
|
2448 the direction of some of their arcs reversed without changing any of
|
wolffd@0
|
2449 the CI relationships.
|
wolffd@0
|
2450 Each class can be represented by a PDAG
|
wolffd@0
|
2451 (partially directed acyclic graph) called an essential graph or
|
wolffd@0
|
2452 pattern. This specifies which edges must be oriented in a certain
|
wolffd@0
|
2453 direction, and which may be reversed.
|
wolffd@0
|
2454
|
wolffd@0
|
2455 <p>
|
wolffd@0
|
2456 When learning graph structure from observational data,
|
wolffd@0
|
2457 the best one can hope to do is to identify the model up to Markov
|
wolffd@0
|
2458 equivalence. To distinguish amongst graphs within the same equivalence
|
wolffd@0
|
2459 class, one needs interventional data: see the discussion on <a
|
wolffd@0
|
2460 href="#active">active learning</a> below.
|
wolffd@0
|
2461
|
wolffd@0
|
2462
|
wolffd@0
|
2463
|
wolffd@0
|
2464 <h2><a name="enumerate">Exhaustive search</h2>
|
wolffd@0
|
2465
|
wolffd@0
|
2466 The brute-force approach to structure learning is to enumerate all
|
wolffd@0
|
2467 possible DAGs, and score each one. This provides a "gold standard"
|
wolffd@0
|
2468 with which to compare other algorithms. We can do this as follows.
|
wolffd@0
|
2469 <pre>
|
wolffd@0
|
2470 dags = mk_all_dags(N);
|
wolffd@0
|
2471 score = score_dags(data, ns, dags);
|
wolffd@0
|
2472 </pre>
|
wolffd@0
|
2473 where data(i,m) is the value of node i in case m,
|
wolffd@0
|
2474 and ns(i) is the size of node i.
|
wolffd@0
|
2475 If the DAGs have a lot of families in common, we can cache the sufficient statistics,
|
wolffd@0
|
2476 making this potentially more efficient than scoring the DAGs one at a time.
|
wolffd@0
|
2477 (Caching is not currently implemented, however.)
|
wolffd@0
|
2478 <p>
|
wolffd@0
|
2479 By default, we use the Bayesian scoring metric, and assume CPDs are
|
wolffd@0
|
2480 represented by tables with BDeu(1) priors.
|
wolffd@0
|
2481 We can override these defaults as follows.
|
wolffd@0
|
2482 If we want to use uniform priors, we can say
|
wolffd@0
|
2483 <pre>
|
wolffd@0
|
2484 params = cell(1,N);
|
wolffd@0
|
2485 for i=1:N
|
wolffd@0
|
2486 params{i} = {'prior', 'unif'};
|
wolffd@0
|
2487 end
|
wolffd@0
|
2488 score = score_dags(data, ns, dags, 'params', params);
|
wolffd@0
|
2489 </pre>
|
wolffd@0
|
2490 params{i} is a cell-array, containing optional arguments that are
|
wolffd@0
|
2491 passed to the constructor for CPD i.
|
wolffd@0
|
2492 <p>
|
wolffd@0
|
2493 Now suppose we want to use different node types, e.g.,
|
wolffd@0
|
2494 Suppose nodes 1 and 2 are Gaussian, and nodes 3 and 4 softmax (both
|
wolffd@0
|
2495 these CPDs can support discrete and continuous parents, which is
|
wolffd@0
|
2496 necessary since all other nodes will be considered as parents).
|
wolffd@0
|
2497 The Bayesian scoring metric currently only works for tabular CPDs, so
|
wolffd@0
|
2498 we will use BIC:
|
wolffd@0
|
2499 <pre>
|
wolffd@0
|
2500 score = score_dags(data, ns, dags, 'discrete', [3 4], 'params', [],
|
wolffd@0
|
2501 'type', {'gaussian', 'gaussian', 'softmax', softmax'}, 'scoring_fn', 'bic')
|
wolffd@0
|
2502 </pre>
|
wolffd@0
|
2503 In practice, one can't enumerate all possible DAGs for N > 5,
|
wolffd@0
|
2504 but one can evaluate any reasonably-sized set of hypotheses in this
|
wolffd@0
|
2505 way (e.g., nearest neighbors of your current best guess).
|
wolffd@0
|
2506 Think of this as "computer assisted model refinement" as opposed to de
|
wolffd@0
|
2507 novo learning.
|
wolffd@0
|
2508
|
wolffd@0
|
2509
|
wolffd@0
|
2510 <h2><a name="K2">K2</h2>
|
wolffd@0
|
2511
|
wolffd@0
|
2512 The K2 algorithm (Cooper and Herskovits, 1992) is a greedy search algorithm that works as follows.
|
wolffd@0
|
2513 Initially each node has no parents. It then adds incrementally that parent whose addition most
|
wolffd@0
|
2514 increases the score of the resulting structure. When the addition of no single
|
wolffd@0
|
2515 parent can increase the score, it stops adding parents to the node.
|
wolffd@0
|
2516 Since we are using a fixed ordering, we do not need to check for
|
wolffd@0
|
2517 cycles, and can choose the parents for each node independently.
|
wolffd@0
|
2518 <p>
|
wolffd@0
|
2519 The original paper used the Bayesian scoring
|
wolffd@0
|
2520 metric with tabular CPDs and Dirichlet priors.
|
wolffd@0
|
2521 BNT generalizes this to allow any kind of CPD, and either the Bayesian
|
wolffd@0
|
2522 scoring metric or BIC, as in the example <a href="#enumerate">above</a>.
|
wolffd@0
|
2523 In addition, you can specify
|
wolffd@0
|
2524 an optional upper bound on the number of parents for each node.
|
wolffd@0
|
2525 The file BNT/examples/static/k2demo1.m gives an example of how to use K2.
|
wolffd@0
|
2526 We use the water sprinkler network and sample 100 cases from it as before.
|
wolffd@0
|
2527 Then we see how much data it takes to recover the generating structure:
|
wolffd@0
|
2528 <pre>
|
wolffd@0
|
2529 order = [C S R W];
|
wolffd@0
|
2530 max_fan_in = 2;
|
wolffd@0
|
2531 sz = 5:5:100;
|
wolffd@0
|
2532 for i=1:length(sz)
|
wolffd@0
|
2533 dag2 = learn_struct_K2(data(:,1:sz(i)), node_sizes, order, 'max_fan_in', max_fan_in);
|
wolffd@0
|
2534 correct(i) = isequal(dag, dag2);
|
wolffd@0
|
2535 end
|
wolffd@0
|
2536 </pre>
|
wolffd@0
|
2537 Here are the results.
|
wolffd@0
|
2538 <pre>
|
wolffd@0
|
2539 correct =
|
wolffd@0
|
2540 Columns 1 through 12
|
wolffd@0
|
2541 0 0 0 0 0 0 0 1 0 1 1 1
|
wolffd@0
|
2542 Columns 13 through 20
|
wolffd@0
|
2543 1 1 1 1 1 1 1 1
|
wolffd@0
|
2544 </pre>
|
wolffd@0
|
2545 So we see it takes about sz(10)=50 cases. (BIC behaves similarly,
|
wolffd@0
|
2546 showing that the prior doesn't matter too much.)
|
wolffd@0
|
2547 In general, we cannot hope to recover the "true" generating structure,
|
wolffd@0
|
2548 only one that is in its <a href="#markov_equiv">Markov equivalence
|
wolffd@0
|
2549 class</a>.
|
wolffd@0
|
2550
|
wolffd@0
|
2551
|
wolffd@0
|
2552 <h2><a name="hill_climb">Hill-climbing</h2>
|
wolffd@0
|
2553
|
wolffd@0
|
2554 Hill-climbing starts at a specific point in space,
|
wolffd@0
|
2555 considers all nearest neighbors, and moves to the neighbor
|
wolffd@0
|
2556 that has the highest score; if no neighbors have higher
|
wolffd@0
|
2557 score than the current point (i.e., we have reached a local maximum),
|
wolffd@0
|
2558 the algorithm stops. One can then restart in another part of the space.
|
wolffd@0
|
2559 <p>
|
wolffd@0
|
2560 A common definition of "neighbor" is all graphs that can be
|
wolffd@0
|
2561 generated from the current graph by adding, deleting or reversing a
|
wolffd@0
|
2562 single arc, subject to the acyclicity constraint.
|
wolffd@0
|
2563 Other neighborhoods are possible: see
|
wolffd@0
|
2564 <a href="http://research.microsoft.com/~dmax/publications/jmlr02.pdf">
|
wolffd@0
|
2565 Optimal Structure Identification with Greedy Search</a>, Max
|
wolffd@0
|
2566 Chickering, JMLR 2002.
|
wolffd@0
|
2567
|
wolffd@0
|
2568 <!--
|
wolffd@0
|
2569 Note: This algorithm is currently (Feb '02) being implemented by Qian
|
wolffd@0
|
2570 Diao.
|
wolffd@0
|
2571 -->
|
wolffd@0
|
2572
|
wolffd@0
|
2573
|
wolffd@0
|
2574 <h2><a name="mcmc">MCMC</h2>
|
wolffd@0
|
2575
|
wolffd@0
|
2576 We can use a Markov Chain Monte Carlo (MCMC) algorithm called
|
wolffd@0
|
2577 Metropolis-Hastings (MH) to search the space of all
|
wolffd@0
|
2578 DAGs.
|
wolffd@0
|
2579 The standard proposal distribution is to consider moving to all
|
wolffd@0
|
2580 nearest neighbors in the sense defined <a href="#hill_climb">above</a>.
|
wolffd@0
|
2581 <p>
|
wolffd@0
|
2582 The function can be called
|
wolffd@0
|
2583 as in the following example.
|
wolffd@0
|
2584 <pre>
|
wolffd@0
|
2585 [sampled_graphs, accept_ratio] = learn_struct_mcmc(data, ns, 'nsamples', 100, 'burnin', 10);
|
wolffd@0
|
2586 </pre>
|
wolffd@0
|
2587 We can convert our set of sampled graphs to a histogram
|
wolffd@0
|
2588 (empirical posterior over all the DAGs) thus
|
wolffd@0
|
2589 <pre>
|
wolffd@0
|
2590 all_dags = mk_all_dags(N);
|
wolffd@0
|
2591 mcmc_post = mcmc_sample_to_hist(sampled_graphs, all_dags);
|
wolffd@0
|
2592 </pre>
|
wolffd@0
|
2593 To see how well this performs, let us compute the exact posterior exhaustively.
|
wolffd@0
|
2594 <p>
|
wolffd@0
|
2595 <pre>
|
wolffd@0
|
2596 score = score_dags(data, ns, all_dags);
|
wolffd@0
|
2597 post = normalise(exp(score)); % assuming uniform structural prior
|
wolffd@0
|
2598 </pre>
|
wolffd@0
|
2599 We plot the results below.
|
wolffd@0
|
2600 (The data set was 100 samples drawn from a random 4 node bnet; see the
|
wolffd@0
|
2601 file BNT/examples/static/mcmc1.)
|
wolffd@0
|
2602 <pre>
|
wolffd@0
|
2603 subplot(2,1,1)
|
wolffd@0
|
2604 bar(post)
|
wolffd@0
|
2605 subplot(2,1,2)
|
wolffd@0
|
2606 bar(mcmc_post)
|
wolffd@0
|
2607 </pre>
|
wolffd@0
|
2608 <img src="Figures/mcmc_post.jpg" width="800" height="500">
|
wolffd@0
|
2609 <p>
|
wolffd@0
|
2610 We can also plot the acceptance ratio versus number of MCMC steps,
|
wolffd@0
|
2611 as a crude convergence diagnostic.
|
wolffd@0
|
2612 <pre>
|
wolffd@0
|
2613 clf
|
wolffd@0
|
2614 plot(accept_ratio)
|
wolffd@0
|
2615 </pre>
|
wolffd@0
|
2616 <img src="Figures/mcmc_accept.jpg" width="800" height="300">
|
wolffd@0
|
2617 <p>
|
wolffd@0
|
2618 Better convergence diagnostics can be found
|
wolffd@0
|
2619 <a href="<a href="http://www.lce.hut.fi/~ave/code/mcmcdiag/">here</a>
|
wolffd@0
|
2620 <p>
|
wolffd@0
|
2621 Even though the number of samples needed by MCMC is theoretically
|
wolffd@0
|
2622 polynomial (not exponential) in the dimensionality of the search space, in practice it has been
|
wolffd@0
|
2623 found that MCMC does not converge in reasonable time for graphs with
|
wolffd@0
|
2624 more than about 10 nodes.
|
wolffd@0
|
2625 <p>
|
wolffd@0
|
2626 <a href="http://www.bioss.sari.ac.uk/~dirk/software/DBmcmc/">
|
wolffd@0
|
2627 Dirk Husmeier has extended MCMC model selection to DBNs</a>.
|
wolffd@0
|
2628
|
wolffd@0
|
2629
|
wolffd@0
|
2630 <h2><a name="active">Active structure learning</h2>
|
wolffd@0
|
2631
|
wolffd@0
|
2632 As was mentioned <a href="#markov_equiv">above</a>,
|
wolffd@0
|
2633 one can only learn a DAG up to Markov equivalence, even given infinite data.
|
wolffd@0
|
2634 If one is interested in learning the structure of a causal network,
|
wolffd@0
|
2635 one needs interventional data.
|
wolffd@0
|
2636 (By "intervention" we mean forcing a node to take on a specific value,
|
wolffd@0
|
2637 thereby effectively severing its incoming arcs.)
|
wolffd@0
|
2638 <p>
|
wolffd@0
|
2639 Most of the scoring functions accept an optional argument
|
wolffd@0
|
2640 that specifies whether a node was observed to have a certain value, or
|
wolffd@0
|
2641 was forced to have that value: we set clamped(i,m)=1 if node i was
|
wolffd@0
|
2642 forced in training case m. e.g., see the file
|
wolffd@0
|
2643 BNT/examples/static/cooper_yoo.
|
wolffd@0
|
2644 <p>
|
wolffd@0
|
2645 An interesting question is to decide which interventions to perform
|
wolffd@0
|
2646 (c.f., design of experiments). For details, see the following tech
|
wolffd@0
|
2647 report
|
wolffd@0
|
2648 <ul>
|
wolffd@0
|
2649 <li> <a href = "../../Papers/alearn.ps.gz">
|
wolffd@0
|
2650 Active learning of causal Bayes net structure</a>, Kevin Murphy, March
|
wolffd@0
|
2651 2001.
|
wolffd@0
|
2652 </ul>
|
wolffd@0
|
2653
|
wolffd@0
|
2654
|
wolffd@0
|
2655 <h2><a name="struct_em">Structural EM</h2>
|
wolffd@0
|
2656
|
wolffd@0
|
2657 Computing the Bayesian score when there is partial observability is
|
wolffd@0
|
2658 computationally challenging, because the parameter posterior becomes
|
wolffd@0
|
2659 multimodal (the hidden nodes induce a mixture distribution).
|
wolffd@0
|
2660 One therefore needs to use approximations such as BIC.
|
wolffd@0
|
2661 Unfortunately, search algorithms are still expensive, because we need
|
wolffd@0
|
2662 to run EM at each step to compute the MLE, which is needed to compute
|
wolffd@0
|
2663 the score of each model. An alternative approach is
|
wolffd@0
|
2664 to do the local search steps inside of the M step of EM, which is more
|
wolffd@0
|
2665 efficient since the data has been "filled in" - this is
|
wolffd@0
|
2666 called the structural EM algorithm (Friedman 1997), and provably
|
wolffd@0
|
2667 converges to a local maximum of the BIC score.
|
wolffd@0
|
2668 <p>
|
wolffd@0
|
2669 Wei Hu has implemented SEM for discrete nodes.
|
wolffd@0
|
2670 You can download his package from
|
wolffd@0
|
2671 <a href="../SEM.zip">here</a>.
|
wolffd@0
|
2672 Please address all questions about this code to
|
wolffd@0
|
2673 wei.hu@intel.com.
|
wolffd@0
|
2674 See also <a href="#phl">Phl's implementation of SEM</a>.
|
wolffd@0
|
2675
|
wolffd@0
|
2676 <!--
|
wolffd@0
|
2677 <h2><a name="reveal">REVEAL algorithm</h2>
|
wolffd@0
|
2678
|
wolffd@0
|
2679 A simple way to learn the structure of a fully observed, discrete,
|
wolffd@0
|
2680 factored DBN from a time series is described <a
|
wolffd@0
|
2681 href="usage_dbn.html#struct_learn">here</a>.
|
wolffd@0
|
2682 -->
|
wolffd@0
|
2683
|
wolffd@0
|
2684
|
wolffd@0
|
2685 <h2><a name="graphdraw">Visualizing the graph</h2>
|
wolffd@0
|
2686
|
wolffd@0
|
2687 You can visualize an arbitrary graph (such as one learned using the
|
wolffd@0
|
2688 structure learning routines) with Matlab code contributed by
|
wolffd@0
|
2689 <a href="http://www.mbfys.kun.nl/~cemgil/matlab/layout.html">Ali
|
wolffd@0
|
2690 Taylan Cemgil</a>
|
wolffd@0
|
2691 from the University of Nijmegen.
|
wolffd@0
|
2692 For static BNs, call it as follows:
|
wolffd@0
|
2693 <pre>
|
wolffd@0
|
2694 draw_graph(bnet.dag);
|
wolffd@0
|
2695 </pre>
|
wolffd@0
|
2696 For example, this is the output produced on a
|
wolffd@0
|
2697 <a href="#qmr">random QMR-like model</a>:
|
wolffd@0
|
2698 <p>
|
wolffd@0
|
2699 <img src="Figures/qmr.rnd.jpg">
|
wolffd@0
|
2700 <p>
|
wolffd@0
|
2701 If you install the excellent <a
|
wolffd@0
|
2702 href="http://www.research.att.com/sw/tools/graphviz">graphhviz</a>, an
|
wolffd@0
|
2703 open-source graph visualization package from AT&T,
|
wolffd@0
|
2704 you can create a much better visualization as follows
|
wolffd@0
|
2705 <pre>
|
wolffd@0
|
2706 graph_to_dot(bnet.dag)
|
wolffd@0
|
2707 </pre>
|
wolffd@0
|
2708 This works by converting the adjacency matrix to a file suitable
|
wolffd@0
|
2709 for input to graphviz (using the dot format),
|
wolffd@0
|
2710 then converting the output of graphviz to postscript, and displaying the results using
|
wolffd@0
|
2711 ghostview.
|
wolffd@0
|
2712 You can do each of these steps separately for more control, as shown
|
wolffd@0
|
2713 below.
|
wolffd@0
|
2714 <pre>
|
wolffd@0
|
2715 graph_to_dot(bnet.dag, 'filename', 'foo.dot');
|
wolffd@0
|
2716 dot -Tps foo.dot -o foo.ps
|
wolffd@0
|
2717 ghostview foo.ps &
|
wolffd@0
|
2718 </pre>
|
wolffd@0
|
2719
|
wolffd@0
|
2720 <h2><a name = "constraint">Constraint-based methods</h2>
|
wolffd@0
|
2721
|
wolffd@0
|
2722 The IC algorithm (Pearl and Verma, 1991),
|
wolffd@0
|
2723 and the faster, but otherwise equivalent, PC algorithm (Spirtes, Glymour, and Scheines 1993),
|
wolffd@0
|
2724 computes many conditional independence tests,
|
wolffd@0
|
2725 and combines these constraints into a
|
wolffd@0
|
2726 PDAG to represent the whole
|
wolffd@0
|
2727 <a href="#markov_equiv">Markov equivalence class</a>.
|
wolffd@0
|
2728 <p>
|
wolffd@0
|
2729 IC*/FCI extend IC/PC to handle latent variables: see <a href="#ic_star">below</a>.
|
wolffd@0
|
2730 (IC stands for inductive causation; PC stands for Peter and Clark,
|
wolffd@0
|
2731 the first names of Spirtes and Glymour; FCI stands for fast causal
|
wolffd@0
|
2732 inference.
|
wolffd@0
|
2733 What we, following Pearl (2000), call IC* was called
|
wolffd@0
|
2734 IC in the original Pearl and Verma paper.)
|
wolffd@0
|
2735 For details, see
|
wolffd@0
|
2736 <ul>
|
wolffd@0
|
2737 <li>
|
wolffd@0
|
2738 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Causation,
|
wolffd@0
|
2739 Prediction, and Search</a>, Spirtes, Glymour and
|
wolffd@0
|
2740 Scheines (SGS), 2001 (2nd edition), MIT Press.
|
wolffd@0
|
2741 <li>
|
wolffd@0
|
2742 <a href="http://bayes.cs.ucla.edu/BOOK-2K/index.html">Causality: Models, Reasoning and Inference</a>, J. Pearl,
|
wolffd@0
|
2743 2000, Cambridge University Press.
|
wolffd@0
|
2744 </ul>
|
wolffd@0
|
2745
|
wolffd@0
|
2746 <p>
|
wolffd@0
|
2747
|
wolffd@0
|
2748 The PC algorithm takes as arguments a function f, the number of nodes N,
|
wolffd@0
|
2749 the maximum fan in K, and additional arguments A which are passed to f.
|
wolffd@0
|
2750 The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0
|
wolffd@0
|
2751 otherwise.
|
wolffd@0
|
2752 For example, suppose we cheat by
|
wolffd@0
|
2753 passing in a CI "oracle" which has access to the true DAG; the oracle
|
wolffd@0
|
2754 tests for d-separation in this DAG, i.e.,
|
wolffd@0
|
2755 f(X,Y,S) calls dsep(X,Y,S,dag). We can to this as follows.
|
wolffd@0
|
2756 <pre>
|
wolffd@0
|
2757 pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);
|
wolffd@0
|
2758 </pre>
|
wolffd@0
|
2759 pdag(i,j) = -1 if there is definitely an i->j arc,
|
wolffd@0
|
2760 and pdag(i,j) = 1 if there is either an i->j or and i<-j arc.
|
wolffd@0
|
2761 <p>
|
wolffd@0
|
2762 Applied to the sprinkler network, this returns
|
wolffd@0
|
2763 <pre>
|
wolffd@0
|
2764 pdag =
|
wolffd@0
|
2765 0 1 1 0
|
wolffd@0
|
2766 1 0 0 -1
|
wolffd@0
|
2767 1 0 0 -1
|
wolffd@0
|
2768 0 0 0 0
|
wolffd@0
|
2769 </pre>
|
wolffd@0
|
2770 So as expected, we see that the V-structure at the W node is uniquely identified,
|
wolffd@0
|
2771 but the other arcs have ambiguous orientation.
|
wolffd@0
|
2772 <p>
|
wolffd@0
|
2773 We now give an example from p141 (1st edn) / p103 (2nd end) of the SGS
|
wolffd@0
|
2774 book.
|
wolffd@0
|
2775 This example concerns the female orgasm.
|
wolffd@0
|
2776 We are given a correlation matrix C between 7 measured factors (such
|
wolffd@0
|
2777 as subjective experiences of coital and masturbatory experiences),
|
wolffd@0
|
2778 derived from 281 samples, and want to learn a causal model of the
|
wolffd@0
|
2779 data. We will not discuss the merits of this type of work here, but
|
wolffd@0
|
2780 merely show how to reproduce the results in the SGS book.
|
wolffd@0
|
2781 Their program,
|
wolffd@0
|
2782 <a href="http://hss.cmu.edu/html/departments/philosophy/TETRAD/tetrad.html">Tetrad</a>,
|
wolffd@0
|
2783 makes use of the Fisher Z-test for conditional
|
wolffd@0
|
2784 independence, so we do the same:
|
wolffd@0
|
2785 <pre>
|
wolffd@0
|
2786 max_fan_in = 4;
|
wolffd@0
|
2787 nsamples = 281;
|
wolffd@0
|
2788 alpha = 0.05;
|
wolffd@0
|
2789 pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);
|
wolffd@0
|
2790 </pre>
|
wolffd@0
|
2791 In this case, the CI test is
|
wolffd@0
|
2792 <pre>
|
wolffd@0
|
2793 f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)
|
wolffd@0
|
2794 </pre>
|
wolffd@0
|
2795 The results match those of Fig 12a of SGS apart from two edge
|
wolffd@0
|
2796 differences; presumably this is due to rounding error (although it
|
wolffd@0
|
2797 could be a bug, either in BNT or in Tetrad).
|
wolffd@0
|
2798 This example can be found in the file BNT/examples/static/pc2.m.
|
wolffd@0
|
2799
|
wolffd@0
|
2800 <p>
|
wolffd@0
|
2801
|
wolffd@0
|
2802 The IC* algorithm (Pearl and Verma, 1991),
|
wolffd@0
|
2803 and the faster FCI algorithm (Spirtes, Glymour, and Scheines 1993),
|
wolffd@0
|
2804 are like the IC/PC algorithm, except that they can detect the presence
|
wolffd@0
|
2805 of latent variables.
|
wolffd@0
|
2806 See the file <tt>learn_struct_pdag_ic_star</tt> written by Tamar
|
wolffd@0
|
2807 Kushnir. The output is a matrix P, defined as follows
|
wolffd@0
|
2808 (see Pearl (2000), p52 for details):
|
wolffd@0
|
2809 <pre>
|
wolffd@0
|
2810 % P(i,j) = -1 if there is either a latent variable L such that i <-L->j OR there is a directed edge from i->j.
|
wolffd@0
|
2811 % P(i,j) = -2 if there is a marked directed i-*>j edge.
|
wolffd@0
|
2812 % P(i,j) = P(j,i) = 1 if there is and undirected edge i--j
|
wolffd@0
|
2813 % P(i,j) = P(j,i) = 2 if there is a latent variable L such that i<-L->j.
|
wolffd@0
|
2814 </pre>
|
wolffd@0
|
2815
|
wolffd@0
|
2816
|
wolffd@0
|
2817 <h2><a name="phl">Philippe Leray's structure learning package</h2>
|
wolffd@0
|
2818
|
wolffd@0
|
2819 Philippe Leray has written a
|
wolffd@0
|
2820 <a href="http://bnt.insa-rouen.fr/ajouts.html">
|
wolffd@0
|
2821 structure learning package</a> that uses BNT.
|
wolffd@0
|
2822
|
wolffd@0
|
2823 It currently (Juen 2003) has the following features:
|
wolffd@0
|
2824 <ul>
|
wolffd@0
|
2825 <li>PC with Chi2 statistical test
|
wolffd@0
|
2826 <li> MWST : Maximum weighted Spanning Tree
|
wolffd@0
|
2827 <li> Hill Climbing
|
wolffd@0
|
2828 <li> Greedy Search
|
wolffd@0
|
2829 <li> Structural EM
|
wolffd@0
|
2830 <li> hist_ic : optimal Histogram based on IC information criterion
|
wolffd@0
|
2831 <li> cpdag_to_dag
|
wolffd@0
|
2832 <li> dag_to_cpdag
|
wolffd@0
|
2833 <li> ...
|
wolffd@0
|
2834 </ul>
|
wolffd@0
|
2835
|
wolffd@0
|
2836
|
wolffd@0
|
2837 </a>
|
wolffd@0
|
2838
|
wolffd@0
|
2839
|
wolffd@0
|
2840 <!--
|
wolffd@0
|
2841 <h2><a name="read_learning">Further reading on learning</h2>
|
wolffd@0
|
2842
|
wolffd@0
|
2843 I recommend the following tutorials for more details on learning.
|
wolffd@0
|
2844 <ul>
|
wolffd@0
|
2845 <li> <a
|
wolffd@0
|
2846 href="http://www.cs.berkeley.edu/~murphyk/Papers/intel.ps.gz">My short
|
wolffd@0
|
2847 tutorial</a> on graphical models, which contains an overview of learning.
|
wolffd@0
|
2848
|
wolffd@0
|
2849 <li>
|
wolffd@0
|
2850 <A HREF="ftp://ftp.research.microsoft.com/pub/tr/TR-95-06.PS">
|
wolffd@0
|
2851 A tutorial on learning with Bayesian networks</a>, D. Heckerman,
|
wolffd@0
|
2852 Microsoft Research Tech Report, 1995.
|
wolffd@0
|
2853
|
wolffd@0
|
2854 <li> <A HREF="http://www-cad.eecs.berkeley.edu/~wray/Mirror/lwgmja">
|
wolffd@0
|
2855 Operations for Learning with Graphical Models</a>,
|
wolffd@0
|
2856 W. L. Buntine, JAIR'94, 159--225.
|
wolffd@0
|
2857 </ul>
|
wolffd@0
|
2858 <p>
|
wolffd@0
|
2859 -->
|
wolffd@0
|
2860
|
wolffd@0
|
2861
|
wolffd@0
|
2862
|
wolffd@0
|
2863
|
wolffd@0
|
2864
|
wolffd@0
|
2865 <h1><a name="engines">Inference engines</h1>
|
wolffd@0
|
2866
|
wolffd@0
|
2867 Up until now, we have used the junction tree algorithm for inference.
|
wolffd@0
|
2868 However, sometimes this is too slow, or not even applicable.
|
wolffd@0
|
2869 In general, there are many inference algorithms each of which make
|
wolffd@0
|
2870 different tradeoffs between speed, accuracy, complexity and
|
wolffd@0
|
2871 generality. Furthermore, there might be many implementations of the
|
wolffd@0
|
2872 same algorithm; for instance, a general purpose, readable version,
|
wolffd@0
|
2873 and a highly-optimized, specialized one.
|
wolffd@0
|
2874 To cope with this variety, we treat each inference algorithm as an
|
wolffd@0
|
2875 object, which we call an inference engine.
|
wolffd@0
|
2876
|
wolffd@0
|
2877 <p>
|
wolffd@0
|
2878 An inference engine is an object that contains a bnet and supports the
|
wolffd@0
|
2879 'enter_evidence' and 'marginal_nodes' methods. The engine constructor
|
wolffd@0
|
2880 takes the bnet as argument and may do some model-specific processing.
|
wolffd@0
|
2881 When 'enter_evidence' is called, the engine may do some
|
wolffd@0
|
2882 evidence-specific processing. Finally, when 'marginal_nodes' is
|
wolffd@0
|
2883 called, the engine may do some query-specific processing.
|
wolffd@0
|
2884
|
wolffd@0
|
2885 <p>
|
wolffd@0
|
2886 The amount of work done when each stage is specified -- structure,
|
wolffd@0
|
2887 parameters, evidence, and query -- depends on the engine. The cost of
|
wolffd@0
|
2888 work done early in this sequence can be amortized. On the other hand,
|
wolffd@0
|
2889 one can make better optimizations if one waits until later in the
|
wolffd@0
|
2890 sequence.
|
wolffd@0
|
2891 For example, the parameters might imply
|
wolffd@0
|
2892 conditional indpendencies that are not evident in the graph structure,
|
wolffd@0
|
2893 but can nevertheless be exploited; the evidence indicates which nodes
|
wolffd@0
|
2894 are observed and hence can effectively be disconnected from the
|
wolffd@0
|
2895 graph; and the query might indicate that large parts of the network
|
wolffd@0
|
2896 are d-separated from the query nodes. (Since it is not the actual
|
wolffd@0
|
2897 <em>values</em> of the evidence that matters, just which nodes are observed,
|
wolffd@0
|
2898 many engines allow you to specify which nodes will be observed when they are constructed,
|
wolffd@0
|
2899 i.e., before calling 'enter_evidence'. Some engines can still cope if
|
wolffd@0
|
2900 the actual pattern of evidence is different, e.g., if there is missing
|
wolffd@0
|
2901 data.)
|
wolffd@0
|
2902 <p>
|
wolffd@0
|
2903
|
wolffd@0
|
2904 Although being maximally lazy (i.e., only doing work when a query is
|
wolffd@0
|
2905 issued) may seem desirable,
|
wolffd@0
|
2906 this is not always the most efficient.
|
wolffd@0
|
2907 For example,
|
wolffd@0
|
2908 when learning using EM, we need to call marginal_nodes N times, where N is the
|
wolffd@0
|
2909 number of nodes. <a href="varelim">Variable elimination</a> would end
|
wolffd@0
|
2910 up repeating a lot of work
|
wolffd@0
|
2911 each time marginal_nodes is called, making it inefficient for
|
wolffd@0
|
2912 learning. The junction tree algorithm, by contrast, uses dynamic
|
wolffd@0
|
2913 programming to avoid this redundant computation --- it calculates all
|
wolffd@0
|
2914 marginals in two passes during 'enter_evidence', so calling
|
wolffd@0
|
2915 'marginal_nodes' takes constant time.
|
wolffd@0
|
2916 <p>
|
wolffd@0
|
2917 We will discuss some of the inference algorithms implemented in BNT
|
wolffd@0
|
2918 below, and finish with a <a href="#engine_summary">summary</a> of all
|
wolffd@0
|
2919 of them.
|
wolffd@0
|
2920
|
wolffd@0
|
2921
|
wolffd@0
|
2922
|
wolffd@0
|
2923
|
wolffd@0
|
2924
|
wolffd@0
|
2925
|
wolffd@0
|
2926
|
wolffd@0
|
2927 <h2><a name="varelim">Variable elimination</h2>
|
wolffd@0
|
2928
|
wolffd@0
|
2929 The variable elimination algorithm, also known as bucket elimination
|
wolffd@0
|
2930 or peeling, is one of the simplest inference algorithms.
|
wolffd@0
|
2931 The basic idea is to "push sums inside of products"; this is explained
|
wolffd@0
|
2932 in more detail
|
wolffd@0
|
2933 <a
|
wolffd@0
|
2934 href="http://HTTP.CS.Berkeley.EDU/~murphyk/Bayes/bayes.html#infer">here</a>.
|
wolffd@0
|
2935 <p>
|
wolffd@0
|
2936 The principle of distributing sums over products can be generalized
|
wolffd@0
|
2937 greatly to apply to any commutative semiring.
|
wolffd@0
|
2938 This forms the basis of many common algorithms, such as Viterbi
|
wolffd@0
|
2939 decoding and the Fast Fourier Transform. For details, see
|
wolffd@0
|
2940
|
wolffd@0
|
2941 <ul>
|
wolffd@0
|
2942 <li> R. McEliece and S. M. Aji, 2000.
|
wolffd@0
|
2943 <!--<a href="http://www.systems.caltech.edu/EE/Faculty/rjm/papers/GDL.ps">-->
|
wolffd@0
|
2944 <a href="GDL.pdf">
|
wolffd@0
|
2945 The Generalized Distributive Law</a>,
|
wolffd@0
|
2946 IEEE Trans. Inform. Theory, vol. 46, no. 2 (March 2000),
|
wolffd@0
|
2947 pp. 325--343.
|
wolffd@0
|
2948
|
wolffd@0
|
2949
|
wolffd@0
|
2950 <li>
|
wolffd@0
|
2951 F. R. Kschischang, B. J. Frey and H.-A. Loeliger, 2001.
|
wolffd@0
|
2952 <a href="http://www.cs.toronto.edu/~frey/papers/fgspa.abs.html">
|
wolffd@0
|
2953 Factor graphs and the sum-product algorithm</a>
|
wolffd@0
|
2954 IEEE Transactions on Information Theory, February, 2001.
|
wolffd@0
|
2955
|
wolffd@0
|
2956 </ul>
|
wolffd@0
|
2957
|
wolffd@0
|
2958 <p>
|
wolffd@0
|
2959 Choosing an order in which to sum out the variables so as to minimize
|
wolffd@0
|
2960 computational cost is known to be NP-hard.
|
wolffd@0
|
2961 The implementation of this algorithm in
|
wolffd@0
|
2962 <tt>var_elim_inf_engine</tt> makes no attempt to optimize this
|
wolffd@0
|
2963 ordering (in contrast, say, to <tt>jtree_inf_engine</tt>, which uses a
|
wolffd@0
|
2964 greedy search procedure to find a good ordering).
|
wolffd@0
|
2965 <p>
|
wolffd@0
|
2966 Note: unlike most algorithms, var_elim does all its computational work
|
wolffd@0
|
2967 inside of <tt>marginal_nodes</tt>, not inside of
|
wolffd@0
|
2968 <tt>enter_evidence</tt>.
|
wolffd@0
|
2969
|
wolffd@0
|
2970
|
wolffd@0
|
2971
|
wolffd@0
|
2972
|
wolffd@0
|
2973 <h2><a name="global">Global inference methods</h2>
|
wolffd@0
|
2974
|
wolffd@0
|
2975 The simplest inference algorithm of all is to explicitely construct
|
wolffd@0
|
2976 the joint distribution over all the nodes, and then to marginalize it.
|
wolffd@0
|
2977 This is implemented in <tt>global_joint_inf_engine</tt>.
|
wolffd@0
|
2978 Since the size of the joint is exponential in the
|
wolffd@0
|
2979 number of discrete (hidden) nodes, this is not a very practical algorithm.
|
wolffd@0
|
2980 It is included merely for pedagogical and debugging purposes.
|
wolffd@0
|
2981 <p>
|
wolffd@0
|
2982 Three specialized versions of this algorithm have also been implemented,
|
wolffd@0
|
2983 corresponding to the cases where all the nodes are discrete (D), all
|
wolffd@0
|
2984 are Gaussian (G), and some are discrete and some Gaussian (CG).
|
wolffd@0
|
2985 They are called <tt>enumerative_inf_engine</tt>,
|
wolffd@0
|
2986 <tt>gaussian_inf_engine</tt>,
|
wolffd@0
|
2987 and <tt>cond_gauss_inf_engine</tt> respectively.
|
wolffd@0
|
2988 <p>
|
wolffd@0
|
2989 Note: unlike most algorithms, these global inference algorithms do all their computational work
|
wolffd@0
|
2990 inside of <tt>marginal_nodes</tt>, not inside of
|
wolffd@0
|
2991 <tt>enter_evidence</tt>.
|
wolffd@0
|
2992
|
wolffd@0
|
2993
|
wolffd@0
|
2994 <h2><a name="quickscore">Quickscore</h2>
|
wolffd@0
|
2995
|
wolffd@0
|
2996 The junction tree algorithm is quite slow on the <a href="#qmr">QMR</a> network,
|
wolffd@0
|
2997 since the cliques are so big.
|
wolffd@0
|
2998 One simple trick we can use is to notice that hidden leaves do not
|
wolffd@0
|
2999 affect the posteriors on the roots, and hence do not need to be
|
wolffd@0
|
3000 included in the network.
|
wolffd@0
|
3001 A second trick is to notice that the negative findings can be
|
wolffd@0
|
3002 "absorbed" into the prior:
|
wolffd@0
|
3003 see the file
|
wolffd@0
|
3004 BNT/examples/static/mk_minimal_qmr_bnet for details.
|
wolffd@0
|
3005 <p>
|
wolffd@0
|
3006
|
wolffd@0
|
3007 A much more significant speedup is obtained by exploiting special
|
wolffd@0
|
3008 properties of the noisy-or node, as done by the quickscore
|
wolffd@0
|
3009 algorithm. For details, see
|
wolffd@0
|
3010 <ul>
|
wolffd@0
|
3011 <li> Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
|
wolffd@0
|
3012 <li> Rish and Dechter, "On the impact of causal independence", UCI
|
wolffd@0
|
3013 tech report, 1998.
|
wolffd@0
|
3014 </ul>
|
wolffd@0
|
3015
|
wolffd@0
|
3016 This has been implemented in BNT as a special-purpose inference
|
wolffd@0
|
3017 engine, which can be created and used as follows:
|
wolffd@0
|
3018 <pre>
|
wolffd@0
|
3019 engine = quickscore_inf_engine(inhibit, leak, prior);
|
wolffd@0
|
3020 engine = enter_evidence(engine, pos, neg);
|
wolffd@0
|
3021 m = marginal_nodes(engine, i);
|
wolffd@0
|
3022 </pre>
|
wolffd@0
|
3023
|
wolffd@0
|
3024
|
wolffd@0
|
3025 <h2><a name="belprop">Belief propagation</h2>
|
wolffd@0
|
3026
|
wolffd@0
|
3027 Even using quickscore, exact inference takes time that is exponential
|
wolffd@0
|
3028 in the number of positive findings.
|
wolffd@0
|
3029 Hence for large networks we need to resort to approximate inference techniques.
|
wolffd@0
|
3030 See for example
|
wolffd@0
|
3031 <ul>
|
wolffd@0
|
3032 <li> T. Jaakkola and M. Jordan, "Variational probabilistic inference and the
|
wolffd@0
|
3033 QMR-DT network", JAIR 10, 1999.
|
wolffd@0
|
3034
|
wolffd@0
|
3035 <li> K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study",
|
wolffd@0
|
3036 UAI 99.
|
wolffd@0
|
3037 </ul>
|
wolffd@0
|
3038 The latter approximation
|
wolffd@0
|
3039 entails applying Pearl's belief propagation algorithm to a model even
|
wolffd@0
|
3040 if it has loops (hence the name loopy belief propagation).
|
wolffd@0
|
3041 Pearl's algorithm, implemented as <tt>pearl_inf_engine</tt>, gives
|
wolffd@0
|
3042 exact results when applied to singly-connected graphs
|
wolffd@0
|
3043 (a.k.a. polytrees, since
|
wolffd@0
|
3044 the underlying undirected topology is a tree, but a node may have
|
wolffd@0
|
3045 multiple parents).
|
wolffd@0
|
3046 To apply this algorithm to a graph with loops,
|
wolffd@0
|
3047 use <tt>pearl_inf_engine</tt>.
|
wolffd@0
|
3048 This can use a centralized or distributed message passing protocol.
|
wolffd@0
|
3049 You can use it as in the following example.
|
wolffd@0
|
3050 <pre>
|
wolffd@0
|
3051 engine = pearl_inf_engine(bnet, 'max_iter', 30);
|
wolffd@0
|
3052 engine = enter_evidence(engine, evidence);
|
wolffd@0
|
3053 m = marginal_nodes(engine, i);
|
wolffd@0
|
3054 </pre>
|
wolffd@0
|
3055 We found that this algorithm often converges, and when it does, often
|
wolffd@0
|
3056 is very accurate, but it depends on the precise setting of the
|
wolffd@0
|
3057 parameter values of the network.
|
wolffd@0
|
3058 (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.)
|
wolffd@0
|
3059 Understanding when and why belief propagation converges/ works
|
wolffd@0
|
3060 is a topic of ongoing research.
|
wolffd@0
|
3061 <p>
|
wolffd@0
|
3062 <tt>pearl_inf_engine</tt> can exploit special structure in noisy-or
|
wolffd@0
|
3063 and gmux nodes to compute messages efficiently.
|
wolffd@0
|
3064 <p>
|
wolffd@0
|
3065 <tt>belprop_inf_engine</tt> is like pearl, but uses potentials to
|
wolffd@0
|
3066 represent messages. Hence this is slower.
|
wolffd@0
|
3067 <p>
|
wolffd@0
|
3068 <tt>belprop_fg_inf_engine</tt> is like belprop,
|
wolffd@0
|
3069 but is designed for factor graphs.
|
wolffd@0
|
3070
|
wolffd@0
|
3071
|
wolffd@0
|
3072
|
wolffd@0
|
3073 <h2><a name="sampling">Sampling</h2>
|
wolffd@0
|
3074
|
wolffd@0
|
3075 BNT now (Mar '02) has two sampling (Monte Carlo) inference algorithms:
|
wolffd@0
|
3076 <ul>
|
wolffd@0
|
3077 <li> <tt>likelihood_weighting_inf_engine</tt> which does importance
|
wolffd@0
|
3078 sampling and can handle any node type.
|
wolffd@0
|
3079 <li> <tt>gibbs_sampling_inf_engine</tt>, written by Bhaskara Marthi.
|
wolffd@0
|
3080 Currently this can only handle tabular CPDs.
|
wolffd@0
|
3081 For a much faster and more powerful Gibbs sampling program, see
|
wolffd@0
|
3082 <a href="http://www.mrc-bsu.cam.ac.uk/bugs">BUGS</a>.
|
wolffd@0
|
3083 </ul>
|
wolffd@0
|
3084 Note: To generate samples from a network (which is not the same as inference!),
|
wolffd@0
|
3085 use <tt>sample_bnet</tt>.
|
wolffd@0
|
3086
|
wolffd@0
|
3087
|
wolffd@0
|
3088
|
wolffd@0
|
3089 <h2><a name="engine_summary">Summary of inference engines</h2>
|
wolffd@0
|
3090
|
wolffd@0
|
3091
|
wolffd@0
|
3092 The inference engines differ in many ways. Here are
|
wolffd@0
|
3093 some of the major "axes":
|
wolffd@0
|
3094 <ul>
|
wolffd@0
|
3095 <li> Works for all topologies or makes restrictions?
|
wolffd@0
|
3096 <li> Works for all node types or makes restrictions?
|
wolffd@0
|
3097 <li> Exact or approximate inference?
|
wolffd@0
|
3098 </ul>
|
wolffd@0
|
3099
|
wolffd@0
|
3100 <p>
|
wolffd@0
|
3101 In terms of topology, most engines handle any kind of DAG.
|
wolffd@0
|
3102 <tt>belprop_fg</tt> does approximate inference on factor graphs (FG), which
|
wolffd@0
|
3103 can be used to represent directed, undirected, and mixed (chain)
|
wolffd@0
|
3104 graphs.
|
wolffd@0
|
3105 (In the future, we plan to support exact inference on chain graphs.)
|
wolffd@0
|
3106 <tt>quickscore</tt> only works on QMR-like models.
|
wolffd@0
|
3107 <p>
|
wolffd@0
|
3108 In terms of node types: algorithms that use potentials can handle
|
wolffd@0
|
3109 discrete (D), Gaussian (G) or conditional Gaussian (CG) models.
|
wolffd@0
|
3110 Sampling algorithms can essentially handle any kind of node (distribution).
|
wolffd@0
|
3111 Other algorithms make more restrictive assumptions in exchange for
|
wolffd@0
|
3112 speed.
|
wolffd@0
|
3113 <p>
|
wolffd@0
|
3114 Finally, most algorithms are designed to give the exact answer.
|
wolffd@0
|
3115 The belief propagation algorithms are exact if applied to trees, and
|
wolffd@0
|
3116 in some other cases.
|
wolffd@0
|
3117 Sampling is considered approximate, even though, in the limit of an
|
wolffd@0
|
3118 infinite number of samples, it gives the exact answer.
|
wolffd@0
|
3119
|
wolffd@0
|
3120 <p>
|
wolffd@0
|
3121
|
wolffd@0
|
3122 Here is a summary of the properties
|
wolffd@0
|
3123 of all the engines in BNT which work on static networks.
|
wolffd@0
|
3124 <p>
|
wolffd@0
|
3125 <table>
|
wolffd@0
|
3126 <table border units = pixels><tr>
|
wolffd@0
|
3127 <td align=left width=0>Name
|
wolffd@0
|
3128 <td align=left width=0>Exact?
|
wolffd@0
|
3129 <td align=left width=0>Node type?
|
wolffd@0
|
3130 <td align=left width=0>topology
|
wolffd@0
|
3131 <tr>
|
wolffd@0
|
3132 <tr>
|
wolffd@0
|
3133 <td align=left> belprop
|
wolffd@0
|
3134 <td align=left> approx
|
wolffd@0
|
3135 <td align=left> D
|
wolffd@0
|
3136 <td align=left> DAG
|
wolffd@0
|
3137 <tr>
|
wolffd@0
|
3138 <td align=left> belprop_fg
|
wolffd@0
|
3139 <td align=left> approx
|
wolffd@0
|
3140 <td align=left> D
|
wolffd@0
|
3141 <td align=left> factor graph
|
wolffd@0
|
3142 <tr>
|
wolffd@0
|
3143 <td align=left> cond_gauss
|
wolffd@0
|
3144 <td align=left> exact
|
wolffd@0
|
3145 <td align=left> CG
|
wolffd@0
|
3146 <td align=left> DAG
|
wolffd@0
|
3147 <tr>
|
wolffd@0
|
3148 <td align=left> enumerative
|
wolffd@0
|
3149 <td align=left> exact
|
wolffd@0
|
3150 <td align=left> D
|
wolffd@0
|
3151 <td align=left> DAG
|
wolffd@0
|
3152 <tr>
|
wolffd@0
|
3153 <td align=left> gaussian
|
wolffd@0
|
3154 <td align=left> exact
|
wolffd@0
|
3155 <td align=left> G
|
wolffd@0
|
3156 <td align=left> DAG
|
wolffd@0
|
3157 <tr>
|
wolffd@0
|
3158 <td align=left> gibbs
|
wolffd@0
|
3159 <td align=left> approx
|
wolffd@0
|
3160 <td align=left> D
|
wolffd@0
|
3161 <td align=left> DAG
|
wolffd@0
|
3162 <tr>
|
wolffd@0
|
3163 <td align=left> global_joint
|
wolffd@0
|
3164 <td align=left> exact
|
wolffd@0
|
3165 <td align=left> D,G,CG
|
wolffd@0
|
3166 <td align=left> DAG
|
wolffd@0
|
3167 <tr>
|
wolffd@0
|
3168 <td align=left> jtree
|
wolffd@0
|
3169 <td align=left> exact
|
wolffd@0
|
3170 <td align=left> D,G,CG
|
wolffd@0
|
3171 <td align=left> DAG
|
wolffd@0
|
3172 b<tr>
|
wolffd@0
|
3173 <td align=left> likelihood_weighting
|
wolffd@0
|
3174 <td align=left> approx
|
wolffd@0
|
3175 <td align=left> any
|
wolffd@0
|
3176 <td align=left> DAG
|
wolffd@0
|
3177 <tr>
|
wolffd@0
|
3178 <td align=left> pearl
|
wolffd@0
|
3179 <td align=left> approx
|
wolffd@0
|
3180 <td align=left> D,G
|
wolffd@0
|
3181 <td align=left> DAG
|
wolffd@0
|
3182 <tr>
|
wolffd@0
|
3183 <td align=left> pearl
|
wolffd@0
|
3184 <td align=left> exact
|
wolffd@0
|
3185 <td align=left> D,G
|
wolffd@0
|
3186 <td align=left> polytree
|
wolffd@0
|
3187 <tr>
|
wolffd@0
|
3188 <td align=left> quickscore
|
wolffd@0
|
3189 <td align=left> exact
|
wolffd@0
|
3190 <td align=left> noisy-or
|
wolffd@0
|
3191 <td align=left> QMR
|
wolffd@0
|
3192 <tr>
|
wolffd@0
|
3193 <td align=left> stab_cond_gauss
|
wolffd@0
|
3194 <td align=left> exact
|
wolffd@0
|
3195 <td align=left> CG
|
wolffd@0
|
3196 <td align=left> DAG
|
wolffd@0
|
3197 <tr>
|
wolffd@0
|
3198 <td align=left> var_elim
|
wolffd@0
|
3199 <td align=left> exact
|
wolffd@0
|
3200 <td align=left> D,G,CG
|
wolffd@0
|
3201 <td align=left> DAG
|
wolffd@0
|
3202 </table>
|
wolffd@0
|
3203
|
wolffd@0
|
3204
|
wolffd@0
|
3205
|
wolffd@0
|
3206 <h1><a name="influence">Influence diagrams/ decision making</h1>
|
wolffd@0
|
3207
|
wolffd@0
|
3208 BNT implements an exact algorithm for solving LIMIDs (limited memory
|
wolffd@0
|
3209 influence diagrams), described in
|
wolffd@0
|
3210 <ul>
|
wolffd@0
|
3211 <li> S. L. Lauritzen and D. Nilsson.
|
wolffd@0
|
3212 <a href="http://www.math.auc.dk/~steffen/papers/limids.pdf">
|
wolffd@0
|
3213 Representing and solving decision problems with limited
|
wolffd@0
|
3214 information</a>
|
wolffd@0
|
3215 Management Science, 47, 1238 - 1251. September 2001.
|
wolffd@0
|
3216 </ul>
|
wolffd@0
|
3217 LIMIDs explicitely show all information arcs, rather than implicitely
|
wolffd@0
|
3218 assuming no forgetting. This allows them to model forgetful
|
wolffd@0
|
3219 controllers.
|
wolffd@0
|
3220 <p>
|
wolffd@0
|
3221 See the examples in <tt>BNT/examples/limids</tt> for details.
|
wolffd@0
|
3222
|
wolffd@0
|
3223
|
wolffd@0
|
3224
|
wolffd@0
|
3225
|
wolffd@0
|
3226 <h1>DBNs, HMMs, Kalman filters and all that</h1>
|
wolffd@0
|
3227
|
wolffd@0
|
3228 Click <a href="usage_dbn.html">here</a> for documentation about how to
|
wolffd@0
|
3229 use BNT for dynamical systems and sequence data.
|
wolffd@0
|
3230
|
wolffd@0
|
3231
|
wolffd@0
|
3232 </BODY>
|