1, basic content

The score of linear classification is transformed into probability value , Multi classification , stay SVM The output in is the score value ,Softmax The output of is probability .

2,Sigmoid function

expression ( The range is [0,1]):

Function image :

Sigmoid Function can map any real number to a probability value [0,1] On the interval , Classification is realized according to the size of probability value .

3,Softmax The output of

softmax function : Its input value is a vector , The score value of any real number in the vector , Output a vector , Where the value of each element is in the 0 reach 1 between , And the sum of all elements is 1( Normalized classification probability :):

loss function : Cross entropy loss (cross-entropy loss)


The above classification of cats is also taken as an example for calculation :
Power operations map relatively large values to larger values , Mapping negative numbers to very small numbers ,Li Is the loss function value ( The loss value is calculated for the probability value of the correct category )

4,SVM and Softmax Comparison of loss functions of

about hinge loss, When the score of the wrong category is close to that of the correct category, the effect of the model cannot be accurately evaluated ( The loss value is close to 0, But the classification effect of the model is not good ), Therefore, this kind of loss function is not used .

5, optimization :

Input data and a set of weight parameters are combined to get a set of score values , In the end Loss value , This process is called forward propagation process . adopt Loss Value update weight parameter , There can be algorithmic implementation of back propagation

5.1 gradient descent ( Reach the lowest point as fast as possible )

Gradient formula :

Gradient descent code implementation :
Bachsize( Take a batch of data from the original data ) Usually 2 Integral multiple of (32,64,128), Consider the load of the computer , Generally, the bigger the better .step_size For learning rate ( It's not easy to be too big ).

When training network LOSS Value visualization results :

Local fluctuation , But the overall trend is downward , It shows that the network is feasible .(epoch It refers to processing the whole data once , One iteration means only completion Bachsize Size of data processing )

5.2 Back propagation
The picture above shows forward propagation , In turn, by L to update W It's called back propagation , Examples are as follows :

Suppose there is x,y,z Three sample sites , After a series of operations, we get a loss value f, Now we need to calculate the weight parameter pairs corresponding to the sample points f What's your contribution ( Finding partial derivatives )

Chain rule :

The backward propagation process of more complex functions is as follows :

Simplification :

Meaning of door unit :