wolffd@0: %DEMARD Automatic relevance determination using the MLP. wolffd@0: % wolffd@0: % Description wolffd@0: % This script demonstrates the technique of automatic relevance wolffd@0: % determination (ARD) using a synthetic problem having three input wolffd@0: % variables: X1 is sampled uniformly from the range (0,1) and has a low wolffd@0: % level of added Gaussian noise, X2 is a copy of X1 with a higher level wolffd@0: % of added noise, and X3 is sampled randomly from a Gaussian wolffd@0: % distribution. The single target variable is determined by wolffd@0: % SIN(2*PI*X1) with additive Gaussian noise. Thus X1 is very relevant wolffd@0: % for determining the target value, X2 is of some relevance, while X3 wolffd@0: % is irrelevant. The prior over weights is given by the ARD Gaussian wolffd@0: % prior with a separate hyper-parameter for the group of weights wolffd@0: % associated with each input. A multi-layer perceptron is trained on wolffd@0: % this data, with re-estimation of the hyper-parameters using EVIDENCE. wolffd@0: % The final values for the hyper-parameters reflect the relative wolffd@0: % importance of the three inputs. wolffd@0: % wolffd@0: % See also wolffd@0: % DEMMLP1, DEMEV1, MLP, EVIDENCE wolffd@0: % wolffd@0: wolffd@0: % Copyright (c) Ian T Nabney (1996-2001) wolffd@0: wolffd@0: clc; wolffd@0: disp('This demonstration illustrates the technique of automatic relevance') wolffd@0: disp('determination (ARD) using a multi-layer perceptron.') wolffd@0: disp(' '); wolffd@0: disp('First, we set up a synthetic data set involving three input variables:') wolffd@0: disp('x1 is sampled uniformly from the range (0,1) and has a low level of') wolffd@0: disp('added Gaussian noise, x2 is a copy of x1 with a higher level of added') wolffd@0: disp('noise, and x3 is sampled randomly from a Gaussian distribution. The') wolffd@0: disp('single target variable is given by t = sin(2*pi*x1) with additive') wolffd@0: disp('Gaussian noise. Thus x1 is very relevant for determining the target') wolffd@0: disp('value, x2 is of some relevance, while x3 should in principle be') wolffd@0: disp('irrelevant.') wolffd@0: disp(' '); wolffd@0: disp('Press any key to see a plot of t against x1.') wolffd@0: pause; wolffd@0: wolffd@0: % Generate the data set. wolffd@0: randn('state', 0); wolffd@0: rand('state', 0); wolffd@0: ndata = 100; wolffd@0: noise = 0.05; wolffd@0: x1 = rand(ndata, 1) + 0.002*randn(ndata, 1); wolffd@0: x2 = x1 + 0.02*randn(ndata, 1); wolffd@0: x3 = 0.5 + 0.2*randn(ndata, 1); wolffd@0: x = [x1, x2, x3]; wolffd@0: t = sin(2*pi*x1) + noise*randn(ndata, 1); wolffd@0: wolffd@0: % Plot the data and the original function. wolffd@0: h = figure; wolffd@0: plotvals = linspace(0, 1, 200)'; wolffd@0: plot(x1, t, 'ob') wolffd@0: hold on wolffd@0: axis([0 1 -1.5 1.5]) wolffd@0: [fx, fy] = fplot('sin(2*pi*x)', [0 1]); wolffd@0: plot(fx, fy, '-g', 'LineWidth', 2); wolffd@0: legend('data', 'function'); wolffd@0: wolffd@0: disp(' '); wolffd@0: disp('Press any key to continue') wolffd@0: pause; clc; wolffd@0: wolffd@0: disp('The prior over weights is given by the ARD Gaussian prior with a') wolffd@0: disp('separate hyper-parameter for the group of weights associated with each') wolffd@0: disp('input. This prior is set up using the utility MLPPRIOR. The network is') wolffd@0: disp('trained by error minimization using scaled conjugate gradient function') wolffd@0: disp('SCG. There are two cycles of training, and at the end of each cycle') wolffd@0: disp('the hyper-parameters are re-estimated using EVIDENCE.') wolffd@0: disp(' '); wolffd@0: disp('Press any key to create and train the network.') wolffd@0: disp(' '); wolffd@0: pause; wolffd@0: wolffd@0: % Set up network parameters. wolffd@0: nin = 3; % Number of inputs. wolffd@0: nhidden = 2; % Number of hidden units. wolffd@0: nout = 1; % Number of outputs. wolffd@0: aw1 = 0.01*ones(1, nin); % First-layer ARD hyperparameters. wolffd@0: ab1 = 0.01; % Hyperparameter for hidden unit biases. wolffd@0: aw2 = 0.01; % Hyperparameter for second-layer weights. wolffd@0: ab2 = 0.01; % Hyperparameter for output unit biases. wolffd@0: beta = 50.0; % Coefficient of data error. wolffd@0: wolffd@0: % Create and initialize network. wolffd@0: prior = mlpprior(nin, nhidden, nout, aw1, ab1, aw2, ab2); wolffd@0: net = mlp(nin, nhidden, nout, 'linear', prior, beta); wolffd@0: wolffd@0: % Set up vector of options for the optimiser. wolffd@0: nouter = 2; % Number of outer loops wolffd@0: ninner = 10; % Number of inner loops wolffd@0: options = zeros(1,18); % Default options vector. wolffd@0: options(1) = 1; % This provides display of error values. wolffd@0: options(2) = 1.0e-7; % This ensures that convergence must occur wolffd@0: options(3) = 1.0e-7; wolffd@0: options(14) = 300; % Number of training cycles in inner loop. wolffd@0: wolffd@0: % Train using scaled conjugate gradients, re-estimating alpha and beta. wolffd@0: for k = 1:nouter wolffd@0: net = netopt(net, options, x, t, 'scg'); wolffd@0: [net, gamma] = evidence(net, x, t, ninner); wolffd@0: fprintf(1, '\n\nRe-estimation cycle %d:\n', k); wolffd@0: disp('The first three alphas are the hyperparameters for the corresponding'); wolffd@0: disp('input to hidden unit weights. The remainder are the hyperparameters'); wolffd@0: disp('for the hidden unit biases, second layer weights and output unit') wolffd@0: disp('biases, respectively.') wolffd@0: fprintf(1, ' alpha = %8.5f\n', net.alpha); wolffd@0: fprintf(1, ' beta = %8.5f\n', net.beta); wolffd@0: fprintf(1, ' gamma = %8.5f\n\n', gamma); wolffd@0: disp(' ') wolffd@0: disp('Press any key to continue.') wolffd@0: pause wolffd@0: end wolffd@0: wolffd@0: % Plot the function corresponding to the trained network. wolffd@0: figure(h); hold on; wolffd@0: [y, z] = mlpfwd(net, plotvals*ones(1,3)); wolffd@0: plot(plotvals, y, '-r', 'LineWidth', 2) wolffd@0: legend('data', 'function', 'network'); wolffd@0: wolffd@0: disp('Press any key to continue.'); wolffd@0: pause; clc; wolffd@0: wolffd@0: disp('We can now read off the hyperparameter values corresponding to the') wolffd@0: disp('three inputs x1, x2 and x3:') wolffd@0: disp(' '); wolffd@0: fprintf(1, ' alpha1: %8.5f\n', net.alpha(1)); wolffd@0: fprintf(1, ' alpha2: %8.5f\n', net.alpha(2)); wolffd@0: fprintf(1, ' alpha3: %8.5f\n', net.alpha(3)); wolffd@0: disp(' '); wolffd@0: disp('Since each alpha corresponds to an inverse variance, we see that the') wolffd@0: disp('posterior variance for weights associated with input x1 is large, that') wolffd@0: disp('of x2 has an intermediate value and the variance of weights associated') wolffd@0: disp('with x3 is small.') wolffd@0: disp(' ') wolffd@0: disp('Press any key to continue.') wolffd@0: disp(' ') wolffd@0: pause wolffd@0: disp('This is confirmed by looking at the corresponding weight values:') wolffd@0: disp(' '); wolffd@0: fprintf(1, ' %8.5f %8.5f\n', net.w1'); wolffd@0: disp(' '); wolffd@0: disp('where the three rows correspond to weights asssociated with x1, x2 and') wolffd@0: disp('x3 respectively. We see that the network is giving greatest emphasis') wolffd@0: disp('to x1 and least emphasis to x3, with intermediate emphasis on') wolffd@0: disp('x2. Since the target t is statistically independent of x3 we might') wolffd@0: disp('expect the weights associated with this input would go to') wolffd@0: disp('zero. However, for any finite data set there may be some chance') wolffd@0: disp('correlation between x3 and t, and so the corresponding alpha remains') wolffd@0: disp('finite.') wolffd@0: wolffd@0: disp(' '); wolffd@0: disp('Press any key to end.') wolffd@0: pause; clc; close(h); clear all wolffd@0: