CSE 250A LECTURE NOTES

March 14, 2001
 
 

ANNOUNCEMENTS

Here are some excellent notes on Nuances of probability theory by Thomas Minka of MIT.
 
 

BAYESIAN NETWORKS

Independence relationships can be drawn graphically using a so-called Bayesian network.

Formally, a Bayesian network is an acyclic directed graph where the nodes are random variables.  Intuitively, an edge in a Bayesian network goes from X to Y iff Y depends directly on X.

Mathematically, let X be a node in a Bayesian network, let P be its parents, and let S be the set of all nodes except X.  Then   Pr(X|S) = Pr(X|P).

Consider the following very simple Bayesian network, where edges point downwards:

                 class
                /  |  \
               A   A   A
                1   2   3

What this says is Pr(A_1|class,A_2,A_3) = Pr(A_1|class) and similarly for A_2 and A_3.

Intuitively, if you know the class of some individual, then you have all useful information for determining its attribute values.

Example:         admit?
              /  |   \
           GPA  GRE  California?

Here it makes sense that GPA and California? are independent given the admit? classification, but not that GPA and GRE are independent.
 
 

BAYESIAN NETWORKS WITH DEPENDENCE

Let's consider this Bayesian network (BN), where arrows point downwards:

                     burglary   quake
            \       /
              alarm
            /       \
        johncalls  marycalls

The conditional probability table for "alarm" has m*n rows, where m is the number of outcomes of "burglary" and n is the number of outcomes of "quake".  Each row gives p discrete probabilities that sum to 1.0, if "alarm" has p different outcomes.

Viewed as an undirected graph, this network is a tree.  Hence, as a Bayesian network, we call it a polytree.

Given a Bayesian network, the general reasoning problem that we want to solve is:

given fixed values for some "evidence" nodes E,
compute the probability of value x for some node X.
Formally we want to calculate Pr(X = x | E = e).  We can generalize easily to:
given a fixed pdf for nodes in E, compute the pdf of X.
If X is in E then the answer is trivial.  Our general algorithm will be recursive, reducing the problem to the same problem on smaller BNs that are disconnected from each other.
 
 

INFERENCE FOR POLYTREES

Terminology:  Given any node X in a polytree, each other node is either above or below X.

If W is above X, it is related to X through a node U that "causes" X.  If W is below X, it is related to X through a node Y that "is caused by" X.

Notation: Let E+ and E- be the nodes in E that are above and below X.  We'll write X as a subscript on E when necessary.

Because the network is a polytree, Pr(E-|X,E+) = Pr(E-|X).  In words, any influence of E+ on E- is through X.  In other words, E- is independent of E+ given X.  This is a special case of the concept of d-separation explained on page 444 of Russell and Norvig.

Note first that  Pr(X = x | E = e) = Pr(X|E-,E+) = Pr(E-|X,E+)Pr(X|E+)/Pr(E-|E+) and that  Pr(E-|X,E+) = Pr(E-|X).

The denominator Pr(E-|E+) is the same for all values x of X, so we can compute it later just by making sure that

sum_x Pr(X = x | E = e) = 1.
Now Pr(E-|E+) is equal to the same constant for any value x of X, so we can ignore this factor now and compute it later by normalization.  We do have to compute the whole pdf for X given E.  We can't compute Pr(X=x|E) for just one value of x.

So we only need to evaluate Pr(E-|X) and Pr(X|E+), separately.  We will derive formulas for these probabililties.  The implementation of the formulas involves two mutually recursive functions.
 
 

ALGORITHM TO COMPUTE Pr(X|E+)

Start with evaluating Pr(X|E+) and let U be the set of parent nodes of X.  Let u be a vector of possible values of U.  Then
Pr(X|E+) = sum Pr(X|U=u,E+)Pr(U=u|E+)
            u
We can again use independence relationships from the BN graph structure: Pr(X|U=u,E+) = Pr(X|U=u) which is exactly the conditional probability table for X that we are given as part of the belief network.

Now partition E+ into E1 through En where Ei is the subnetwork containing Ui and all nodes linked to it, except through X.  By independence, Pr(Ui=v|E+) = Pr(Ui=v|Ei) for any v so

Pr(X|E+) = sum Pr(X|U=u,E+) product Pr(Ui=ui|Ei)
                    u                  i
Calculating Pr(Ui=v|Ei) is a smaller, recursive instance of the original inference problem.  Hence we are done for computing Pr(X|E+).
 
 

HOW TO COMPUTE Pr(E-|X)

This is a bit more complicated, but the basic idea is the same.

Let Yi be the children of X.  For each Yi, let Zi be its other parents.  Note that Zi is a vector of nodes.  Let Fi be the subnetwork containing Yi and Zi and all linked nodes, except those linked through X.

Since the BN is a polytree, the Fi are independent, so Pr(E-|X) = product Pr(Fi|X).

Let yi and zi be values for Yi and Zi.  Average over all possible values:

Pr(E-|X) = product sum sum Pr(Fi|X,yi,zi)Pr(yi,zi|X)
              i     yi  zi
Divide Fi into "upper" and "lower" parts Fi+ and Fi- relative to Yi.  The two parts are independent when Yi = yi is fixed, so
Pr(Fi|X,yi,zi) = Pr(Fi-|X,yi,zi)Pr(Fi+|X,yi,zi) = Pr(Fi-|yi)Pr(Fi+|zi)
This uses further independence facts for Fi- and Fi+.  In particular, Fi+ is independent of yi because the parents of yi, namely zi, are all fixed.

Now apply Bayes' rule to Pr(Fi+|zi): Pr(Fi+|zi) = Pr(zi|Fi+)Pr(Fi+)/Pr(zi)

Consider the full expression for Pr(E-|X) and note that Pr(yi,zi|X) = Pr(yi|X,zi)Pr(zi|X) and Pr(zi|X) = Pr(zi).

Cancel the Pr(zi) factors in the expression for Pr(E-|X) and write

 Pr(E-|X) = product sum sum Pr(Fi-|yi)Pr(zi|Fi+)Pr(Fi+)Pr(yi|X,zi)
                      i     yi  zi
Moving factors to the left out of sums where possible gives
 Pr(E-|X) = product Pr(Fi+) sum Pr(Fi-|yi) sum Pr(zi|Fi+)Pr(yi|X,zi)
                      i             yi             zi
Write B = product Pr(Fi+) where B is some constant, and note that Pr(zi|Fi+) = product_j Pr(Zij = zij|Fij)
where Fij is the subnetwork reachable from Zij without going through Yi.  Finally we have
Pr(E-|X) = B product sum Pr(Fi-|yi) sum Pr(yi|X,zi) product Pr(zij|Fij)
                i     yi             zi                j
This is an expression that we can compute directly using recursion:

INFERENCE IN GENERAL BELIEF NETWORKS

In general inference is NP-complete.  There are many strategies for overcoming this:
  • (1) Clustering into meganodes.
  • (2) Conditioning on (i.e. fixing values for) a "cutset".
  • (3) Sampling with likelihood weighting:
  • choose values for variables starting at the root and moving "downwards"
  • if the current variable is in E, take its fixed value and remember the probability of this value
  • if the current variable is not in E, choose its value randomly following its conditional probability table
  • weight each complete assignment according to the remembered probabilities of the evidence nodes
  • LEARNING BELIEF NETWORKS

    We can identify four different learning problems:

    (1) Known topology, all variables observable.  Learning a simple Bayesian classifier falls into this category.

    (2) Known topology, some variables unobservable.  Here there are intermediate influences between the symptoms
    and the diagnosis that cannot be measured directly.

    For a credit approval example, such a variable might be "will remain employed."  This case is analogous to neural network training with hidden nodes.

    (3) Unknown topology, all variables observable.  This case arises when trying to generalize the simple Bayesian classifier
    to account for non-independence.

    (4) Unknown topology, some variables unobservable.  This is of course the hardest and most interesting case.
    In general one is trying to infer causal relationships.

    For all cases, there is still the difficulty of choosing between alternative networks.

    In general a network with more edges will have more parameters and will fit the training data better.  But it won't necessarily have better predictive power.  How do you trade off increased number of parameters versus better fit?

    For learning problem (1), see the Friedman/Goldszmidt paper.  Essentially we just use observed frequencies to fill in the probability tables.
     
     

    GRADIENT DESCENT TRAINING

    This is a method for discovering the best pdfs for unobservable (hidden) nodes in a Bayesian network with fixed topology.

    Consider for example a tree:

                    class
                    /   \
                   /     \
                  X      ...
                 / \
                / w \
               U     V

    A training example will be U = u, V = v, ... and class = c but will not give any observation for the hidden node X.  For credit scoring, X may be "remains employed" for example.

    Let w be a particular weight.  Here w = Pr(X=x|U=u,V=v).

    We want to maximize the probability Pr(D) of the training dataset, so we need to evaluate the partial derivative (d/dw) Pr(D).

    In order to make the contributions from each training example additive, we maximize instead log Pr(D) = sum_j log Pr(D_j)
    so we will evaluate (d/dw) log Pr(D_j).

    Let's drop the j subscript in these HTML notes.

    First we use the chain rule for derivatives:
     (d/dx) f(g(x)) = [f(g(x))]' = f'(g(x)) * g'(x)
    so
     (d/dw) log Pr(D) = (1/Pr(D)) (d/dw) Pr(D)

    Let w = Pr(X=xi|U=ui) for some fixed xi and ui,
    where U may be a vector of parent nodes.

    We have Pr(D) = sum Pr(D|x,u)Pr(x,u) and Pr(x,u) = Pr(x|u)Pr(u)
                    x,u

     (d/dw) Pr(D) = sum (d/dw) Pr(D|x,u)Pr(x|u)Pr(u)
                           x,u

    Each term of the summation does not vary when w varies,
    except the term for xi and ui, which gives Pr(D|xi,ui)*w*Pr(ui)

    So  (d/dw) Pr(D) = Pr(D|xi,ui)Pr(ui)

    Using Bayes' rule

     Pr(D|xi,ui) = Pr(xi,ui|D)*Pr(D)/Pr(xi,ui)
                        = Pr(xi,ui|D)*Pr(D)/(w*Pr(Ui)

    For available software as of 1996 see http://bayes.stat.washington.edu/almond/belief.html