One , summary

Bayesian network is a graphical model used to represent the connection probability between variables , It provides a natural way to express causal information , Used to discover potential relationships between data . In this network , Using nodes to represent variables , Directed edges represent the dependence of variables .

Bayesian method has its unique expression of uncertain knowledge , Rich probability expression ability , The incremental learning feature of prior knowledge has become one of the most attractive focuses in many data mining methods .

1.1 The history of Bayesian Networks

1.2 Basic viewpoints of Bayesian method
The characteristic of Bayesian method is to use probability to express all forms of uncertainty , Learning and other forms of reasoning are implemented by probability rules .

The result of Bayesian learning is expressed as the probability distribution of random variables , It can be explained by the degree of trust we have in different possibilities .

The starting point of Bayesian school is two works of Bayes : Bayesian theorem and Bayesian hypothesis .

Bayesian theorem relates the prior probability and the posterior probability of events .

Supplementary knowledge :

(1) Prior probability : The prior probability refers to the probability of occurrence of each event determined by historical data or subjective judgment . This kind of probability has not been proved by experiment , Probability before test , So it's called a priori probability . A priori probability is generally divided into two categories , One is objective prior probability , It refers to the probability calculated by using historical data in the past ; The second is subjective prior probability , It refers to when there is no or incomplete historical data , The probability of acquisition can only be judged by people's subjective experience .
(2) Posterior probability : Posteriori probability generally refers to the use of Bayesian formula , New additional information was obtained by means of investigation , By modifying the prior probability, the more realistic probability can be obtained .
(3) joint probability : Joint probability is also called multiplication formula , Is the probability of the product of two arbitrary events , Or the probability of intersection events .

Assumed random vector x,θ The joint distribution density of is p(x,θ), Their marginal densities are p(x),P(θ). In general, the x It's the observation vector ,θ
Is an unknown parameter vector , The estimation of unknown parameter vector is obtained by observing vector , Notes on Bayes theorem :
p(θ|x)=π(θ)p(x|θ)p(x)=π(θ)p(x|θ)∫π(θ)p(x|θ)dθ
, among π(θ) yes θ Prior distribution of .

The general method of Bayesian method for unknown parameter vector estimation is as follows :
(1) The unknown parameters are regarded as random vectors , This is the biggest difference between Bayesian method and traditional parameter estimation method .
(2) According to the previous parameters θ Knowledge of , Determine prior distribution π(θ), It is a controversial step in Bayesian method , Therefore, it is attacked by the classical statistical community .
(3) Calculate the posterior distribution density , Infer the unknown parameters .
In the (2) step , If there is no prior knowledge to help determine π(θ)
, Bayes proposed that uniform distribution can be used as its distribution , That is, the parameter is within its range of variation , The opportunity to take every worthwhile opportunity is the same , This assumption is called Bayesian hypothesis .

1.3 Application fields of Bayesian Networks
Assistant intelligent decision making :
Data fusion :
pattern recognition :
medical diagnosis :
Text understanding :
data mining :1, Bayesian method for classification and regression analysis ;2, It is used for causal reasoning and uncertain knowledge expression ;3, For clustering pattern discovery .

Two , Basis of Bayesian probability theory

2.1, Fundamentals of probability theory

2.2, Bayesian probability
(1) Prior probability :
(2) Posterior probability :
(3) joint probability :
(4) Total probability formula : set up B1,B2,⋅⋅⋅,Bn Two events are mutually exclusive , And P(Bi)>0,i=1,2,⋅⋅⋅,n,B1+B2+⋅⋅⋅+Bn=Ω
be A=AB1+AB2+⋅⋅⋅+ABn , Namely P(A)=∑ni=1P(Bi)P(A|Bi)
. From this we can see the full probability company as “ Infer the result from the reason ”, Each cause has a certain effect on the outcome “ effect ”, That is, the possibility of the results and various reasons “ effect ” It's about size . The total probability formula expresses the relationship between them .
(5) Bayes formula : Bayes formula is also called posterior probability formula , Also called inverse probability formula , It has a wide range of uses . Let the prior probability be P(Bi), The new additional information obtained from the survey is P(Aj|Bi),(i=1,2,⋅⋅
⋅,n;j=1,2,⋅⋅⋅,m) , Then the posterior probability calculated by Bayes formula is ( The formula doesn't look like this , To be verified , Please tell me why ):
P(Bi|Aj)=P(Bi)P(Aj|Bi)∑mk=1P(Bi)P(Ak|Bi)

* Any complete probability model must have representation ( Direct or indirect ) The ability of joint distribution of variables in this field . A complete enumeration requires an exponential scale ( Relative to the number of domain variables ).
* Bayesian networks provide a compact representation of this joint probability distribution : Decompose the joint distribution into the product of several local distributions :P(x1,x2,⋅⋅⋅,xn)=∏iP(xi|π)
* It can be seen from the formula , The number of parameters required increases linearly with the number of nodes in the network , However, the calculation of joint distribution increases exponentially
* The specification of independence among variables in network is the key to achieve compact representation . This independence relationship is particularly effective in Constructing Bayesian networks from human experts .
Three , Simple Bayesian learning model

Simple Bayesian learning model will train examples I Decomposed into eigenvectors X And decision category variables C
. The simple Bayesian model assumes that the components of the eigenvector are relatively independent relative to the decision variables , In other words, each component acts independently on the decision variables . Although this assumption limits the application of simple Bayesian model to some extent , But in practice , It not only reduces the complexity of Bayesian network construction by exponential level , And in many areas , Against this assumption , Simple Bayes also show considerable robustness and efficiency , It has been successfully applied to classification , Clustering and model selection are important tasks in data mining .

- The structure is simple – There are only two layers
- The inference complexity is linear with the number of network nodes

Design sample A Expressed as an attribute vector , If the property is independent of a given category , that P(A|Ci) It can be decomposed into the product of several components :
P(a1|Ci)∗P(a2|Ci)∗⋅⋅⋅∗P(am|Ci) ai It's a sample A Of the i Properties . that ,P(Ci|A)=P(Ci)P(A)∏j=1mP(aj|Ci)
This process is called simple Bayesian classification (SBC:Simple Bayesian
Classifier). people say that , Only when the independence hypothesis holds ,SBC In order to obtain the classification efficiency with the best accuracy ; Or when the attribute correlation is small , It can obtain the approximate optimal classification effect .

Technology