Formally, a Bayesian network is an acyclic directed graph where the nodes are random variables. Intuitively, an edge in a Bayesian network goes from X to Y iff Y depends directly on X.
Mathematically, let X be a node in a Bayesian network, let P be its parents, and let S be the set of all nodes except X. Then Pr(X|S) = Pr(X|P).
Consider the following very simple Bayesian network, where edges point downwards:
class
/ | \
A A A
1 2 3
What this says is Pr(A_1|class,A_2,A_3) = Pr(A_1|class) and similarly for A_2 and A_3.
Intuitively, if you know the class of some individual, then you have all useful information for determining its attribute values.
Example: admit?
/ | \
GPA
GRE California?
Here it makes sense that GPA and California? are independent
given the admit? classification, but not that GPA and
GRE are independent.
burglary quake
\ /
alarm
/ \
johncalls marycalls
The conditional probability table for "alarm" has m*n rows, where m is the number of outcomes of "burglary" and n is the number of outcomes of "quake". Each row gives p discrete probabilities that sum to 1.0, if "alarm" has p different outcomes.
Viewed as an undirected graph, this network is a tree. Hence, as a Bayesian network, we call it a polytree.
Given a Bayesian network, the general reasoning problem that we want to solve is:
given fixed values for some "evidence" nodes E,Formally we want to calculate Pr(X = x | E = e). We can generalize easily to:
compute the probability of value x for some node X.
given a fixed pdf for nodes in E, compute the pdf of X.If X is in E then the answer is trivial. Our general algorithm will be recursive, reducing the problem to the same problem on smaller BNs that are disconnected from each other.
If W is above X, it is related to X through a node U that "causes" X. If W is below X, it is related to X through a node Y that "is caused by" X.
Notation: Let E+ and E- be the nodes in E that are above and below X. We'll write X as a subscript on E when necessary.
Because the network is a polytree, Pr(E-|X,E+) = Pr(E-|X). In words, any influence of E+ on E- is through X. In other words, E- is independent of E+ given X. This is a special case of the concept of d-separation explained on page 444 of Russell and Norvig.
Note first that Pr(X = x | E = e) = Pr(X|E-,E+) = Pr(E-|X,E+)Pr(X|E+)/Pr(E-|E+) and that Pr(E-|X,E+) = Pr(E-|X).
The denominator Pr(E-|E+) is the same for all values x of X, so we can compute it later just by making sure that
sum_x Pr(X = x | E = e) = 1.Now Pr(E-|E+) is equal to the same constant for any value x of X, so we can ignore this factor now and compute it later by normalization. We do have to compute the whole pdf for X given E. We can't compute Pr(X=x|E) for just one value of x.
So we only need to evaluate Pr(E-|X) and Pr(X|E+),
separately. We will derive formulas for these probabililties.
The implementation of the formulas involves two mutually recursive functions.
Pr(X|E+) = sum Pr(X|U=u,E+)Pr(U=u|E+)We can again use independence relationships from the BN graph structure: Pr(X|U=u,E+) = Pr(X|U=u) which is exactly the conditional probability table for X that we are given as part of the belief network.
u
Now partition E+ into E1 through En where Ei is the subnetwork containing Ui and all nodes linked to it, except through X. By independence, Pr(Ui=v|E+) = Pr(Ui=v|Ei) for any v so
Pr(X|E+) = sum Pr(X|U=u,E+) product Pr(Ui=ui|Ei)Calculating Pr(Ui=v|Ei) is a smaller, recursive instance of the original inference problem. Hence we are done for computing Pr(X|E+).
u i
Let Yi be the children of X. For each Yi, let Zi be its other parents. Note that Zi is a vector of nodes. Let Fi be the subnetwork containing Yi and Zi and all linked nodes, except those linked through X.
Since the BN is a polytree, the Fi are independent, so Pr(E-|X) = product Pr(Fi|X).
Let yi and zi be values for Yi and Zi. Average over all possible values:
Pr(E-|X) = product sum sum Pr(Fi|X,yi,zi)Pr(yi,zi|X)Divide Fi into "upper" and "lower" parts Fi+ and Fi- relative to Yi. The two parts are independent when Yi = yi is fixed, so
i yi zi
Pr(Fi|X,yi,zi) = Pr(Fi-|X,yi,zi)Pr(Fi+|X,yi,zi) = Pr(Fi-|yi)Pr(Fi+|zi)This uses further independence facts for Fi- and Fi+. In particular, Fi+ is independent of yi because the parents of yi, namely zi, are all fixed.
Now apply Bayes' rule to Pr(Fi+|zi): Pr(Fi+|zi) = Pr(zi|Fi+)Pr(Fi+)/Pr(zi)
Consider the full expression for Pr(E-|X) and note that Pr(yi,zi|X) = Pr(yi|X,zi)Pr(zi|X) and Pr(zi|X) = Pr(zi).
Cancel the Pr(zi) factors in the expression for Pr(E-|X) and write
Pr(E-|X) = product sum sum Pr(Fi-|yi)Pr(zi|Fi+)Pr(Fi+)Pr(yi|X,zi)Moving factors to the left out of sums where possible gives
i yi zi
Pr(E-|X) = product Pr(Fi+) sum Pr(Fi-|yi) sum Pr(zi|Fi+)Pr(yi|X,zi)Write B = product Pr(Fi+) where B is some constant, and note that Pr(zi|Fi+) = product_j Pr(Zij = zij|Fij)
i yi zi
Pr(E-|X) = B product sum Pr(Fi-|yi) sum Pr(yi|X,zi) product Pr(zij|Fij)This is an expression that we can compute directly using recursion:
i yi zi j
(1) Clustering into meganodes. (2) Conditioning on (i.e. fixing values for) a "cutset". (3) Sampling with likelihood weighting: choose values for variables starting at the root and moving "downwards" if the current variable is in E, take its fixed value and remember the probability of this value if the current variable is not in E, choose its value randomly following its conditional probability table weight each complete assignment according to the remembered probabilities of the evidence nodes
(1) Known topology, all variables observable. Learning a simple Bayesian classifier falls into this category.
(2) Known topology, some variables unobservable. Here there are
intermediate influences between the symptoms
and the diagnosis that cannot be measured directly.
For a credit approval example, such a variable might be "will remain employed." This case is analogous to neural network training with hidden nodes.
(3) Unknown topology, all variables observable. This case arises
when trying to generalize the simple Bayesian classifier
to account for non-independence.
(4) Unknown topology, some variables unobservable. This is of
course the hardest and most interesting case.
In general one is trying to infer causal relationships.
For all cases, there is still the difficulty of choosing between alternative networks.
In general a network with more edges will have more parameters and will fit the training data better. But it won't necessarily have better predictive power. How do you trade off increased number of parameters versus better fit?
For learning problem (1), see the Friedman/Goldszmidt paper. Essentially
we just use observed frequencies to fill in the probability tables.
Consider for example a tree:
class
/ \
/ \
X ...
/ \
/ w \
U
V
A training example will be U = u, V = v, ... and class = c but will not give any observation for the hidden node X. For credit scoring, X may be "remains employed" for example.
Let w be a particular weight. Here w = Pr(X=x|U=u,V=v).
We want to maximize the probability Pr(D) of the training dataset, so we need to evaluate the partial derivative (d/dw) Pr(D).
In order to make the contributions from each training example additive,
we maximize instead log Pr(D) = sum_j log Pr(D_j)
so we will evaluate (d/dw) log Pr(D_j).
Let's drop the j subscript in these HTML notes.
First we use the chain rule for derivatives:
(d/dx) f(g(x)) = [f(g(x))]' = f'(g(x)) * g'(x)
so
(d/dw) log Pr(D) = (1/Pr(D)) (d/dw) Pr(D)
Let w = Pr(X=xi|U=ui) for some fixed xi and ui,
where U may be a vector of parent nodes.
We have Pr(D) = sum Pr(D|x,u)Pr(x,u) and Pr(x,u) = Pr(x|u)Pr(u)
x,u
(d/dw) Pr(D) = sum (d/dw) Pr(D|x,u)Pr(x|u)Pr(u)
x,u
Each term of the summation does not vary when w varies,
except the term for xi and ui, which gives Pr(D|xi,ui)*w*Pr(ui)
So (d/dw) Pr(D) = Pr(D|xi,ui)Pr(ui)
Using Bayes' rule
Pr(D|xi,ui) = Pr(xi,ui|D)*Pr(D)/Pr(xi,ui)
= Pr(xi,ui|D)*Pr(D)/(w*Pr(Ui)
For available software as of 1996 see http://bayes.stat.washington.edu/almond/belief.html