<> One ,RNN Forward propagation structure of

t Time input : XtX_{t}Xt​ ,St−1S_{t-1}St−1​
t Time output : hth_{t}ht​
t Time intermediate state : StS_{t}St​

The image above is a RNN Time series expansion model of neural network , middle t The network model of time reveals that RNN Structure . You can see , original RNN The internal structure of the network is very simple . neuron A stay t The state of the moment is just (t-1) Time neuron state
St−1S_{t-1}St−1​, And (t) Time network input XtX_tXt​
The value of hyperbolic tangent function of ; This value is not just the output of the network at that time , It is also passed into the network state of the next time as the state of the network at that time , This process is called RNN Forward propagation of (forward

Mathematical formula in communication ( Including parameters )

The above figure is shown as RNN The complete topological structure of network , as well as RNN The corresponding parameters in the network . We have passed the t
The behavior of time network is derived mathematically . In the following , There are two kinds of expression: linear state and active state , The linear state will use ∗*∗ No .
t Time neuron state :
St=ϕ(St∗)S_t= {\phi}{(S{_t^*})}St​=ϕ(St∗​)
t Output state of time :
Ot∗=VStO{_t^*} = VS_tOt∗​=VSt​
How do we get it RNN In the model U,V,W What about the specific values of the three global shared parameters ? After RNN The specific situation can be obtained from the reverse propagation .

<> Two ,BPTT( Time varying back propagation algorithm )

1, Selection of loss function , stay RNN Cross entropy is generally selected in (Cross Entropy), The expression is as follows :
Loss=−∑i=0nyilnyi∗Loss = -{\sum_{i=0}^{n}y_ilny_i^*}Loss=−i=0∑n​yi​lnyi∗​
The above formula is the scalar form of cross entropy ,yiy_iyi​ It's a real label ,yi∗y_i^*yi∗​
Is the predicted value given by the model , When the multidimensional output value is , It can be accumulated n Dimension loss value . Application of cross entropy in RNN Fine tuning is required : first ,RNN The output of is a vector
, There is no need to add up all the dimensions , The loss value can be expressed directly by vector ; secondly , because RNN The model is a sequence problem , Therefore, the model loss can't be just a time loss , It should be all inclusive N Time loss .
therefore RNN Model in t The loss function of time is as follows :
Losst=−[ytln(Ot)+(yt−1)ln(1−Ot)]{Loss}_t = -[y_tln(O_t) + (y_t-1)ln(1-O_t)]Loss
whole N The loss function of time ( Global loss ) It is expressed as follows :
Loss=−∑t=1NLosst=−∑t=1N[ytln(Ot)+(yt−1)ln(1−Ot)]Loss = -{\sum_{t=1}^NLoss_t}=
-{\sum_{t=1}^N[y_tln(O_t) + (y_t-1)ln(1-O_t)]}Loss=−t=1∑N​Losst​=−t=1∑N​[yt​ln(O

2, softmax The derivation formula of the function is ( The following is used ψ express \psi express ψ express )

3, The derivation formula of the activation function is ( selection tanh(x) As activation function )
ϕ(x)=tanh(x)\phi(x) = tanh(x)ϕ(x)=tanh(x)

4, BPTT algorithm
notes : because RNN The model is related to time series , So use Back Propagation Through
Time( Time dependent back propagation algorithm ), But it still follows the chain derivation rule . In the loss function , although RNN The global loss of is the sum of N It's about a moment , But the following derivation involves only one t time .
(1) reach t On the loss function under time Ot∗O_t^*Ot∗​ Differential of :
\frac{\partial{L_t}}{\partial{O_t^*}} =\frac{\partial{L_t}}{\partial{O_t}} *
\frac{\partial{O_t}} {\partial{O_t^*}}=\frac{\partial{L_t}}{\partial{O_t}} *
{\partial{O_t^*}}=\frac{\partial{L_t}}{\partial{O_t}} * \psi'(O_t^*)∂Ot∗​∂Lt​​=∂
(2) Find out the loss function about the parameters V Differential of ( need (1) Conclusion in ):
\frac{\partial{L_t}}{\partial{V}} = \frac{\partial{L_t}}{\partial{(VS_t)}} *
\frac{\partial{(VS_t)}} {\partial{V}}=\frac{\partial{L_t}}{\partial{O_t^*}} *
S_t=\frac{\partial{L_t}}{\partial{O_t}} * \psi'(O_t^*)* S_t∂V∂Lt​​=∂(VSt​)∂Lt​​∗
therefore , Global about parameters V The differential of is :
* \psi'(O_t^*)* S_t∂V∂L​=t=1∑N​∂V∂Lt​​=t=1∑N​∂Ot​∂Lt​​∗ψ′(Ot∗​)∗St​
(3) reach t On the loss function of time St∗S_t^*St∗​ Differential of :
\frac{\partial{L_t}}{\partial{S_t^*}} = \frac{\partial{L_t}}{\partial{(VS_t)}}
* \frac{\partial{(VS_t)}} {\partial{S_t}} * \frac{\partial{S_t}}
(4) reach t On the loss function of time St−1S_{t-1}St−1​ Differential of
*\frac{\partial{[W\phi(S_{t-1}^*)}+UX_t]}{\partial{S_{t-1}^*}} =
\frac{\partial{L_t}}{\partial{S_t^*}} *W\phi'(S_{t-1}^*)∂St−1∗​∂Lt​​=∂St∗​∂Lt​​∗
(5) reach t Time about parameter U Partial differential of
notes : Because it's a time series model , therefore t Time about U
Differential and front of (t-1) Every moment is relevant , In the specific calculation, the farthest backtracking can be limited to the front n Moments , However, in the derivation, the (t-1) All the times are calculated
\frac{\partial L_t}{\partial U}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial U}=\sum_{k=1}^{t}\frac{\partial
L_t}{\partial S_k^*}\frac{\partial ({WS_{k-1}}+UX_k)}{\partial
U}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*X_k∂U∂Lt​​=k=1∑t​∂Sk∗​∂Lt​​
therefore , Global about U The partial differential of the loss is :
L}{\partial U}=\sum_{t=1}^{N}\frac{\partial L_t}{\partial
U}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial
U}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*X_k∂U∂L​=t=1∑
(6) reach t Time about parameter W Partial differential of ( ditto )
\frac{\partial L_t}{\partial W}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial W}=\sum_{k=1}^{t}\frac{\partial
L_t}{\partial S_k^*}\frac{\partial ({WS_{k-1}}+UX_k)}{\partial
W}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*S_{k-1}∂W∂Lt​​=k=1∑t​∂Sk∗​∂
therefore , Global about U The partial differential of the loss is :
L}{\partial W}=\sum_{t=1}^{N}\frac{\partial L_t}{\partial
W}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial
W}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*S_{k-1}∂W∂L​=
(7) Because most of the output is softmax function , We're right Ot∗O_t^*Ot∗​ conduct softmax It can be obtained by derivation after operation
So in the OtO_tOt​ We can get the partial derivative by differential ( Cross entropy is used as the loss function )
\frac{\partial L_t }{\partial O_t}=\frac{-\partial [\sum_{t=1}^N[y_tln(O_t) +
(y_t-1)ln(1-O_t)]}{\partial O_t}=-(\frac

∂Lt∂Ot∗ψ′(Ot∗)=−yt−OtOt(1−Ot)∗Ot(1−Ot)=Ot−yt\frac{\partial L_t }{\partial
[V*(O_t-y_t)]*[1-{\phi^2(s_t^*)}]= [V*(O_t-y_t)]*[1-S_t^2]∂St∗​∂Lt​​=∂Ot​∂Lt​​∗ψ
\frac{\partial{L_t}}{\partial{S_t^*}} *W\phi'(S_{t-1}^*)=

All in all :
The rest is similar
(8) We update it step by step V,U,W Parameters of the three , Until they converge
V:=V−η∗∂L∂VV:=V-\eta*\frac{\partial L}{\partial V}V:=V−η∗∂V∂L​
U:=U−η∗∂L∂UU:=U-\eta*\frac{\partial L}{\partial U}U:=U−η∗∂U∂L​
W:=W−η∗∂L∂WW:=W-\eta*\frac{\partial L}{\partial W}W:=W−η∗∂W∂L​