Contents
- Introduction
- Bayesian network definition
- Bayesian scoring metric
- Heuristic search methods
- Banjo
- Literature cited
Introduction
Below, we provide an overview of Bayesian networks as used in our study, and of the Bayesian network application Banjo. For a thorough introduction to Bayesian networks, we refer the reader to the book by Pearl (1988). Additionally, Heckerman et al. (1995) provide a more concise tutorial on Bayesian networks and Friedman (2004) reviews their use for cellular networks. Further details on Banjo can be accessed via its webpage (www.cs.duke.edu/~amink/software/banjo/ [accessed 17 Feb 2010]) and its functionality is covered in several publications (Hartemink et al. 2001, 2002, Smith et al. 2002, 2003, Yu et al. 2004).
Bayesian network definition
A Bayesian network (BN) is a graphical representation of a joint probability distribution over a set of variables
= {X1, ..., Xn}, in our case, bird species and habitat characteristics. A BN is specified by a pair <G,
>. G represents a graph where the variables X1, ..., Xn are drawn as nodes; an arrow between two variables, Xj
Xi, is known as a link and represents a statistical dependence relationship of Xi, the child, on Xj, its parent (Fig. B1). The joint probability distribution is encoded in G such that a variable Xi is conditionally independent of all its non-descendants (Fig. 1) given its parents, Pa(Xi).
represents the collection of parameters of this joint probability distribution. In discrete BNs as we use here, each parameter
xi|pa(Xi) = P(Xi=xi|Pa(Xi)=pa(Xi)) specifies the probability of a variable Xi taking on the value xi given its parents Pa(Xi) having a particular set of values pa(Xi). The joint probability distribution specified by <G,
> is thus:
P(X1, ..., Xn) =
P(Xi|Pa(Xi))
FIG. B1. Graph representing a Bayesian network. The variable B (red) has parent A (black) and children C and D. The gray nodes are the descendants of B. B is conditionally independent of the white nodes given A. This definition leads to a number of features. First, because a BN represents a joint probability distribution, the graph G is restricted to be acyclic, that is, have no loops. Second, because there might be multiple mathematically equivalent ways to factor a given probability distribution, multiple BNs--which differ only in the direction of some links--can represent the same joint probability distribution; BNs representing the same distribution are known as an equivalence class (Fig. B2). Third, because each can be specified independently, discrete BNs can model many types of relationships including nonlinear, nonadditive, and arbitrary combinatoric.
FIG. B2. Equivalence class of Bayesian networks. Three Bayesian networks that form an equivalence class are shown: each represents a mathematically equivalent representation of the same joint probability distribution. BNs are drawn in the same color as their corresponding factoring of the joint probability distribution. Bayesian scoring metric
In order to evaluate how well a BN represents the statistical relationships between variables in a given data set, a Bayesian Scoring Metric (BSM) is used. The BSM calculates the log of the probability that the graph G encodes the statistical dependencies in a dataset D:
BSM(G:D) = log P(G|D) = log P(G|D) + log P(G) - log P(D).
In practice, P(D) is treated as an uncalculated constant for comparing BSM of networks on the same data set. Here, we use P(G) as a hard prior: it is set to zero for all networks containing disallowed links (see Banjo, below) and presumed to be an uncalculated equivalent value for all other networks; thus, the BSM calculates only log P(D|G). This is done using the Bayesian Dirichlet equivalent (BDe) scoring metric, which assumes a Dirichlet prior over the parameters P(
|G) and integrates over all possible parameter assignments
:
BDe(G:D) = log P(D|G) = log
P(D|G,
) P(
|G) d
.
This solves to:
BDe(G:D) = log
,
where n is the number of variables Xi in
, qi is the number of sets of parent values pa(Xi) of the parents of Xi , ri is the number of values xi of Xi ,
(·) is the gamma function,
ij and
ijk are equivalent sample size statistics of the Dirichlet distribution, and Nij and Nijk are counts in D of the number of times the parents of Xi take on value set j, and the number of times the variable Xi takes on value k with parents having value set j, respectively (Heckerman et al. 1995).
To provide an intuitive understanding how this score relates to the data, we consider the terms inside the final two products. The rightmost term,
(
ijk + Nijk) /
(
ijk), is higher when counts are more concentrated in a particular child value. Thus, the score is higher when a particular parent state corresponds more highly with a particular child state, and is thus more useful for predicting the child. The leftmost term,
(
ij) /
(
ij + Nij), is higher with (1) fewer parent states and (2) equal examples of each state. The first feature is an inherent penalty for complexity that only allows more complex networks (i.e., more parents) when the predictive value from an additional parent (rightmost term) can counterbalance the cost of adding an additional parent (leftmost term; Yu 2005). The second feature indicates that the score is higher when the data is distributed more evenly across its possible values.
Two other factors that we control influence this score: the value used for the Dirichlet prior's equivalent sample size (ess) and the discretization of the variables. The ess is used to calculate the
ij and
ijk in the score above; it can be thought of as representing how many data points worth of belief there is that "anything can happen". A higher ess means the score for any relationship is higher and thus it is easier for any link to be found; we use a low ess of 1 to put a high burden on the data for providing evidence for relationships.
Discretization influences the score in multiple ways. More discrete states can lead to more detailed predictive power from parent state to child state (rightmost product term) and maintains more information present in the original data; fewer discrete states leads to more power to find relationships that exist (fewer parent states in the leftmost product term) and also masks noise. Statistical power is also increased when the data points are evenly distributed across states (leftmost product term). Intelligent discretization to balance these considerations is required; ecological data in particular required a particular approach, which is detailed in Appendix A (supplementary methods).
Heuristic search methods
Finding the highest scoring BN is computationally intractable (NP-complete) under the BDe score (Chickering 1996). Thus, we employ heuristic search: a way to intelligently move through the search space of possible BNs to identify higher-scoring, if not the highest scoring, networks. We investigated both greedy search and simulated annealing for use with our ecological datasets, and eventually developed a method using greedy search but incorporating a novel model-averaging approach (see Appendix A, supplementary methods). Here, we overview the basics behind both greedy search and simulated annealing.
Greedy search begins with a randomly chosen graph. In the full local version of greedy search, all possible single changes in this graph (i.e., addition or deletion of links) are evaluated for their affect on the graph's BSM. The change that most increases the BSM is performed. In the random local version, a single change is chosen at random and performed only if it increases the BSM. More changes are chosen at random until one is found that increases the BSM. In both versions, the process is repeated iteratively until there are no possible changes that increase the BSM, i.e., the search has reached a peak in the search space. Greedy search thus finds only local maxima; to adequately cover the search space multiple restarts with new random graphs are performed.
Simulated annealing begins with either a randomly chosen graph, or, often, the empty graph. A single change in graph structure is chosen at random. If this change increases the BSM, it is performed. If it would decrease the BSM, the probability of accepting the change is dependent on the amount it does so (
BSM):
P(accept) = e-(
BSM/T),
where T is a variable controlling the temperature. Again, this processes is repeated iteratively. The value of T is lowered throughout simulated annealing such that as the search progresses, changes that decrease the BSM are less likely to be accepted (Kirkpatrick et al. 1983). When T reaches some pre-defined minimum, it is reset at a maximum value and the search continues until a stopping criteria, such as CPU time or number of structures visited, is reached.
Banjo
Banjo (www.cs.duke.edu/~amink/software/banjo/ [accessed 17 Feb 2010]) is a software program, free for academic use, that implements heuristic search for Bayesian networks to describe a provided data set. The user provides data in a tab-delimited file, with variables in columns and observations in rows. Banjo can apply both simulated annealing starting from the empty graph, and full local and random local versions of greedy search. Banjo returns one top graph, representing the highest scoring network visited during the search, and its BSM (BDe value). Banjo can also keep track of a requested number of top scoring graphs, both their structure and score, to provide information on the progress of the search. Banjo infers static Bayesian networks from non-temporal data and dynamic Bayesian networks when provided with time series. The user is allowed to specify a hard prior on P(G) by providing a set of links which must be in the solution (P(G) = 1, mandatory links) or a set of links which must not be in the solution (P(G) = 0, disallowed links).
For the one top graph, Banjo also returns an influence score for each link. This is calculated as in Yu et al. (2004), and represents the magnitude (scaled 0-1) and direction (positive: +; negative: -) of the relationship between linked variables. It returns a value precisely equal to 0.0 when the relationship is non-monotonic, i.e., neither positive nor negative, but a more complicated non-linear relationship.
The user specifies the type of search desired, the number of top graphs to be stored, the ess value, a maximum number of parents allowed per variable (the default maximum is set to 5 for memory handling reasons), any mandatory and/or disallowed links, and a stopping criteria in either number of networks visited, time taken, or number of restarts for greedy search.
Literature cited
Chickering, D. M. 1996. Learning Bayesian networks is NP-complete. Pages 121130 in D. Fisher and H.-J. Lenz, editors. Learning from Data: AI and Statistics V. Springer, New York, New York, USA.
Friedman, N. 2004. Inferring cellular networks using probabilistic graphical models. Science 303:799805.
Hartemink, A. J., D. K. Gifford, T. S. Jaakkola, and R. A. Young. 2001. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pages 422433 in R. B. Altman, A. K. Dunker, L. Hunker, K. Lauderdale, and T. E. D. Klein, editors. Proceedings of the Pacific Symposium on Biocomputing 2001. World Scientific, Singapore.
Hartemink, A. J., D. K. Gifford, T. S. Jaakkola, and R. A. Young. 2002. Combining location and expression data for principled discovery of genetic regulatory network models. Pages 437449 in R. B. Altman, A. K. Dunker, L. Hunter, and K. Lauderdale, editors. Proceedings of Pacific Symposium on Biocomputing 2002. World Scientific, Singapore.
Heckerman, D., D. Geiger, and D. M. Chickering. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20:197243.
Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. 1983. Optimization by simulated annealing. Science 220:671680.
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco, California, USA.
Smith, V. A., E. D. Jarvis, and A. J. Hartemink. 2002. Evaluating functional network inference using simulations of complex biological systems. Bioinformatics 18:S216S224.
Smith, V. A., E. D. Jarvis, and A. J. Hartemink. 2003. Influence of network topology and data collection on network inference. Pages 164175 in R. B. Altman, A. K. Dunker, L. Hunter, and T. A. Jung, editors. Proceedings of the Pacific Symposium on Biocomputing 2003. World Publishing, Singapore.
Yu, J. 2005. Developing Bayesian network inference algorithms to predict causal functional pathways in biological systems. PhD Thesis. Duke University, Durham, North Carolina, USA.
Yu, J., V. A. Smith, P. P. Wang, A. J. Hartemink, and E. D. Jarvis. 2004. Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20:35943603.