Recurrent neural network （RNN） Analysis of network structure - Blog

[{"createTime":1735734952000,"id":1,"img":"hwy_ms_500_252.jpeg","link":"https://activity.huaweicloud.com/cps.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905","name":"华为云秒杀","status":9,"txt":"华为云38元秒杀","type":1,"updateTime":1735747411000,"userId":3},{"createTime":1736173885000,"id":2,"img":"txy_480_300.png","link":"https://cloud.tencent.com/act/cps/redirect?redirect=1077&cps_key=edb15096bfff75effaaa8c8bb66138bd&from=console","name":"腾讯云秒杀","status":9,"txt":"腾讯云限量秒杀","type":1,"updateTime":1736173885000,"userId":3},{"createTime":1736177492000,"id":3,"img":"aly_251_140.png","link":"https://www.aliyun.com/minisite/goods?userCode=pwp8kmv3","memo":"","name":"阿里云","status":9,"txt":"阿里云2折起","type":1,"updateTime":1736177492000,"userId":3},{"createTime":1735660800000,"id":4,"img":"vultr_560_300.png","link":"https://www.vultr.com/?ref=9603742-8H","name":"Vultr","status":9,"txt":"Vultr送$100","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":5,"img":"jdy_663_320.jpg","link":"https://3.cn/2ay1-e5t","name":"京东云","status":9,"txt":"京东云特惠专区","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":6,"img":"new_ads.png","link":"https://www.iodraw.com/ads","name":"发布广告","status":9,"txt":"发布广告","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":7,"img":"yun_910_50.png","link":"https://activity.huaweicloud.com/discount_area_v5/index.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=aXhpYW95YW5nOA===&utm_medium=cps&utm_campaign=201905","name":"底部","status":9,"txt":"高性能云服务器2折起","type":2,"updateTime":1735660800000,"userId":3}]

<> One ,RNN Forward propagation structure of

t Time input ： XtX_{t}Xt ,St−1S_{t-1}St−1
t Time output ： hth_{t}ht
t Time intermediate state ： StS_{t}St

The image above is a RNN Time series expansion model of neural network , middle t The network model of time reveals that RNN Structure . You can see , original RNN The internal structure of the network is very simple . neuron A stay t The state of the moment is just （t-1） Time neuron state
St−1S_{t-1}St−1, And （t） Time network input XtX_tXt
The value of hyperbolic tangent function of ; This value is not just the output of the network at that time , It is also passed into the network state of the next time as the state of the network at that time , This process is called RNN Forward propagation of （forward
propagation）

Mathematical formula in communication （ Including parameters ）

The above figure is shown as RNN The complete topological structure of network , as well as RNN The corresponding parameters in the network . We have passed the t
The behavior of time network is derived mathematically . In the following , There are two kinds of expression: linear state and active state , The linear state will use ∗*∗ No .
t Time neuron state ：
St=ϕ(St∗)S_t= {\phi}{(S{_t^*})}St=ϕ(St∗)
St∗=(UXt+WSt−1)S{_t^*}=(UX_t+WS_{t-1})St∗=(UXt+WSt−1)
t Output state of time ：
Ot=ψ(Ot∗)O_t=\psi{(O{_t^*})}Ot=ψ(Ot∗)
Ot∗=VStO{_t^*} = VS_tOt∗=VSt
How do we get it RNN In the model U,V,W What about the specific values of the three global shared parameters ? After RNN The specific situation can be obtained from the reverse propagation .

<> Two ,BPTT( Time varying back propagation algorithm )

1, Selection of loss function , stay RNN Cross entropy is generally selected in （Cross Entropy）, The expression is as follows ：
Loss=−∑i=0nyilnyi∗Loss = -{\sum_{i=0}^{n}y_ilny_i^*}Loss=−i=0∑nyilnyi∗
The above formula is the scalar form of cross entropy ,yiy_iyi It's a real label ,yi∗y_i^*yi∗
Is the predicted value given by the model , When the multidimensional output value is , It can be accumulated n Dimension loss value . Application of cross entropy in RNN Fine tuning is required ： first ,RNN The output of is a vector
, There is no need to add up all the dimensions , The loss value can be expressed directly by vector ; secondly , because RNN The model is a sequence problem , Therefore, the model loss can't be just a time loss , It should be all inclusive N Time loss .
therefore RNN Model in t The loss function of time is as follows ：
Losst=−[ytln(Ot)+(yt−1)ln(1−Ot)]{Loss}_t = -[y_tln(O_t) + (y_t-1)ln(1-O_t)]Loss
t=−[ytln(Ot)+(yt−1)ln(1−Ot)]
whole N The loss function of time （ Global loss ） It is expressed as follows ：
Loss=−∑t=1NLosst=−∑t=1N[ytln(Ot)+(yt−1)ln(1−Ot)]Loss = -{\sum_{t=1}^NLoss_t}=
-{\sum_{t=1}^N[y_tln(O_t) + (y_t-1)ln(1-O_t)]}Loss=−t=1∑NLosst=−t=1∑N[ytln(O
t)+(yt−1)ln(1−Ot)]

2, softmax The derivation formula of the function is （ The following is used ψ express \psi express ψ express ）
ψ′(x)=ψ(x)(1−ψ(x))\psi'(x)=\psi(x)(1-\psi(x))ψ′(x)=ψ(x)(1−ψ(x))

3, The derivation formula of the activation function is （ selection tanh(x) As activation function ）
ϕ(x)=tanh(x)\phi(x) = tanh(x)ϕ(x)=tanh(x)
ϕ′(x)=(1−ϕ2(x))\phi'(x)=(1-{\phi^2(x)})ϕ′(x)=(1−ϕ2(x))

4, BPTT algorithm
notes ： because RNN The model is related to time series , So use Back Propagation Through
Time( Time dependent back propagation algorithm ), But it still follows the chain derivation rule . In the loss function , although RNN The global loss of is the sum of N It's about a moment , But the following derivation involves only one t time .
（1） reach t On the loss function under time Ot∗O_t^*Ot∗ Differential of ：
∂Lt∂Ot∗=∂Lt∂Ot∗∂Ot∂Ot∗=∂Lt∂Ot∗∂ψ(Ot∗)∂Ot∗=∂Lt∂Ot∗ψ′(Ot∗)
\frac{\partial{L_t}}{\partial{O_t^*}} =\frac{\partial{L_t}}{\partial{O_t}} *
\frac{\partial{O_t}} {\partial{O_t^*}}=\frac{\partial{L_t}}{\partial{O_t}} *
\frac{\partial{\psi{(O_t^*)}}}
{\partial{O_t^*}}=\frac{\partial{L_t}}{\partial{O_t}} * \psi'(O_t^*)∂Ot∗∂Lt=∂
Ot∂Lt∗∂Ot∗∂Ot=∂Ot∂Lt∗∂Ot∗∂ψ(Ot∗)=∂Ot∂Lt∗ψ′(Ot∗)
（2） Find out the loss function about the parameters V Differential of （ need （1） Conclusion in ）：
∂Lt∂V=∂Lt∂(VSt)∗∂(VSt)∂V=∂Lt∂Ot∗∗St=∂Lt∂Ot∗ψ′(Ot∗)∗St
\frac{\partial{L_t}}{\partial{V}} = \frac{\partial{L_t}}{\partial{(VS_t)}} *
\frac{\partial{(VS_t)}} {\partial{V}}=\frac{\partial{L_t}}{\partial{O_t^*}} *
S_t=\frac{\partial{L_t}}{\partial{O_t}} * \psi'(O_t^*)* S_t∂V∂Lt=∂(VSt)∂Lt∗
∂V∂(VSt)=∂Ot∗∂Lt∗St=∂Ot∂Lt∗ψ′(Ot∗)∗St
therefore , Global about parameters V The differential of is ：
∂L∂V=∑t=1N∂Lt∂V=∑t=1N∂Lt∂Ot∗ψ′(Ot∗)∗St
\frac{\partial{L}}{\partial{V}}={\sum_{t=1}^{N}}\frac{\partial{L_t}}{\partial{V}}={\sum_{t=1}^{N}}\frac{\partial{L_t}}{\partial{O_t}}
* \psi'(O_t^*)* S_t∂V∂L=t=1∑N∂V∂Lt=t=1∑N∂Ot∂Lt∗ψ′(Ot∗)∗St
（3） reach t On the loss function of time St∗S_t^*St∗ Differential of ：
∂Lt∂St∗=∂Lt∂(VSt)∗∂(VSt)∂St∗∂St∂St∗=∂Lt∂Ot∗∗V∗ϕ′(St∗)=∂Lt∂Ot∗ψ′(Ot∗)∗V∗ϕ′(St∗)
\frac{\partial{L_t}}{\partial{S_t^*}} = \frac{\partial{L_t}}{\partial{(VS_t)}}
* \frac{\partial{(VS_t)}} {\partial{S_t}} * \frac{\partial{S_t}}
{\partial{S_t^*}}=\frac{\partial{L_t}}{\partial{O_t^*}}*V*\phi'(S_t^*)=\frac{\partial{L_t}}{\partial{O_t}}*\psi'(O_t^*)*V*\phi'(S_t^*)
∂St∗∂Lt=∂(VSt)∂Lt∗∂St∂(VSt)∗∂St∗∂St=∂Ot∗∂Lt∗V∗ϕ′(St∗)=∂Ot∂Lt∗
ψ′(Ot∗)∗V∗ϕ′(St∗)
（4） reach t On the loss function of time St−1S_{t-1}St−1 Differential of
∂Lt∂St−1∗=∂Lt∂St∗∗∂St∗∂St−1∗=∂Lt∂St∗∗∂[Wϕ(St−1∗)+UXt]∂St−1∗=∂Lt∂St∗∗Wϕ′(St−1∗)
\frac{\partial{L_t}}{\partial{S_{t-1}^*}}=\frac{\partial{L_t}}{\partial{S_t^*}}
*\frac{\partial{S_t^*}}{\partial{S_{t-1}^*}}=
\frac{\partial{L_t}}{\partial{S_t^*}}
*\frac{\partial{[W\phi(S_{t-1}^*)}+UX_t]}{\partial{S_{t-1}^*}} =
\frac{\partial{L_t}}{\partial{S_t^*}} *W\phi'(S_{t-1}^*)∂St−1∗∂Lt=∂St∗∂Lt∗
∂St−1∗∂St∗=∂St∗∂Lt∗∂St−1∗∂[Wϕ(St−1∗)+UXt]=∂St∗∂Lt∗Wϕ′(St−1∗)
（5） reach t Time about parameter U Partial differential of
notes ： Because it's a time series model , therefore t Time about U
Differential and front of （t-1） Every moment is relevant , In the specific calculation, the farthest backtracking can be limited to the front n Moments , However, in the derivation, the （t-1） All the times are calculated
∂Lt∂U=∑k=1t∂Lt∂Sk∗∂Sk∗∂U=∑k=1t∂Lt∂Sk∗∂(WSk−1+UXk)∂U=∑k=1t∂Lt∂Sk∗∗Xk
\frac{\partial L_t}{\partial U}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial U}=\sum_{k=1}^{t}\frac{\partial
L_t}{\partial S_k^*}\frac{\partial ({WS_{k-1}}+UX_k)}{\partial
U}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*X_k∂U∂Lt=k=1∑t∂Sk∗∂Lt
∂U∂Sk∗=k=1∑t∂Sk∗∂Lt∂U∂(WSk−1+UXk)=k=1∑t∂Sk∗∂Lt∗Xk
therefore , Global about U The partial differential of the loss is ：
∂L∂U=∑t=1N∂Lt∂U=∑t=1N∑k=1t∂Lt∂Sk∗∂Sk∗∂U=∑t=1N∑k=1t∂Lt∂Sk∗∗Xk\frac{\partial
L}{\partial U}=\sum_{t=1}^{N}\frac{\partial L_t}{\partial
U}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial
U}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*X_k∂U∂L=t=1∑
N∂U∂Lt=t=1∑Nk=1∑t∂Sk∗∂Lt∂U∂Sk∗=t=1∑Nk=1∑t∂Sk∗∂Lt∗Xk
（6） reach t Time about parameter W Partial differential of （ ditto ）
∂Lt∂W=∑k=1t∂Lt∂Sk∗∂Sk∗∂W=∑k=1t∂Lt∂Sk∗∂(WSk−1+UXk)∂W=∑k=1t∂Lt∂Sk∗∗Sk−1
\frac{\partial L_t}{\partial W}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial W}=\sum_{k=1}^{t}\frac{\partial
L_t}{\partial S_k^*}\frac{\partial ({WS_{k-1}}+UX_k)}{\partial
W}=\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*S_{k-1}∂W∂Lt=k=1∑t∂Sk∗∂
Lt∂W∂Sk∗=k=1∑t∂Sk∗∂Lt∂W∂(WSk−1+UXk)=k=1∑t∂Sk∗∂Lt∗Sk−1
therefore , Global about U The partial differential of the loss is ：
∂L∂W=∑t=1N∂Lt∂W=∑t=1N∑k=1t∂Lt∂Sk∗∂Sk∗∂W=∑t=1N∑k=1t∂Lt∂Sk∗∗Sk−1\frac{\partial
L}{\partial W}=\sum_{t=1}^{N}\frac{\partial L_t}{\partial
W}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial
S_k^*}\frac{\partial S_k^*}{\partial
W}=\sum_{t=1}^{N}\sum_{k=1}^{t}\frac{\partial L_t}{\partial S_k^*}*S_{k-1}∂W∂L=
t=1∑N∂W∂Lt=t=1∑Nk=1∑t∂Sk∗∂Lt∂W∂Sk∗=t=1∑Nk=1∑t∂Sk∗∂Lt∗Sk−1
（7） Because most of the output is softmax function , We're right Ot∗O_t^*Ot∗ conduct softmax It can be obtained by derivation after operation
ψ′(Ot∗)=Ot(1−Ot)\psi'(O_t^*)=O_t(1-O_t)ψ′(Ot∗)=Ot(1−Ot)
So in the OtO_tOt We can get the partial derivative by differential （ Cross entropy is used as the loss function ）
∂Lt∂Ot=−∂[∑t=1N[ytln(Ot)+(yt−1)ln(1−Ot)]∂Ot=−(ytOt+yt−Ot1−Ot)=−yt−OtOt(1−Ot)
\frac{\partial L_t }{\partial O_t}=\frac{-\partial [\sum_{t=1}^N[y_tln(O_t) +
(y_t-1)ln(1-O_t)]}{\partial O_t}=-(\frac
{y_t}{O_t}+\frac{y_t-O_t}{1-O_t})=-{\frac{y_t-O_t}{O_t(1-O_t)}}∂Ot∂Lt=∂Ot−∂[
∑t=1N[ytln(Ot)+(yt−1)ln(1−Ot)]=−(Otyt+1−Otyt−Ot)=−Ot(1−Ot)yt−Ot

∂Lt∂Ot∗ψ′(Ot∗)=−yt−OtOt(1−Ot)∗Ot(1−Ot)=Ot−yt\frac{\partial L_t }{\partial
O_t}*\psi'(O_t^*)=-{\frac{y_t-O_t}{O_t(1-O_t)}}*O_t(1-O_t)=O_t-y_t∂Ot∂Lt∗ψ′(O
t∗)=−Ot(1−Ot)yt−Ot∗Ot(1−Ot)=Ot−yt
∂Lt∂St∗=∂Lt∂Ot∗ψ′(Ot∗)∗V∗ϕ′(St∗)=[V∗(Ot−yt)]∗[1−ϕ2(st∗)]=[V∗(Ot−yt)]∗[1−St2]
\frac{\partial{L_t}}{\partial{S_t^*}}
=\frac{\partial{L_t}}{\partial{O_t}}*\psi'(O_t^*)*V*\phi'(S_t^*)=
[V*(O_t-y_t)]*[1-{\phi^2(s_t^*)}]= [V*(O_t-y_t)]*[1-S_t^2]∂St∗∂Lt=∂Ot∂Lt∗ψ
′(Ot∗)∗V∗ϕ′(St∗)=[V∗(Ot−yt)]∗[1−ϕ2(st∗)]=[V∗(Ot−yt)]∗[1−St2]
∂Lt∂St−1∗=∂Lt∂St∗∗Wϕ′(St−1∗)=∂Lt∂St∗∗W∗[1−St−12]
\frac{\partial{L_t}}{\partial{S_{t-1}^*}}=
\frac{\partial{L_t}}{\partial{S_t^*}} *W\phi'(S_{t-1}^*)=
\frac{\partial{L_t}}{\partial{S_t^*}}*W*[1-S_{t-1}^2]∂St−1∗∂Lt=∂St∗∂Lt∗Wϕ′
(St−1∗)=∂St∗∂Lt∗W∗[1−St−12]

All in all ：
∂L∂V=∑t=1N∂Lt∂V=∑t=1N(Ot−yt)∗St
\frac{\partial{L}}{\partial{V}}={\sum_{t=1}^{N}}\frac{\partial{L_t}}{\partial{V}}={\sum_{t=1}^{N}}(O_t-y_t)
*S_t∂V∂L=t=1∑N∂V∂Lt=t=1∑N(Ot−yt)∗St
The rest is similar
（8） We update it step by step V,U,W Parameters of the three , Until they converge
V:=V−η∗∂L∂VV:=V-\eta*\frac{\partial L}{\partial V}V:=V−η∗∂V∂L
U:=U−η∗∂L∂UU:=U-\eta*\frac{\partial L}{\partial U}U:=U−η∗∂U∂L
W:=W−η∗∂L∂WW:=W-\eta*\frac{\partial L}{\partial W}W:=W−η∗∂W∂L

Technology

Java296 blogs
Python265 blogs
Vue125 blogs
C Language122 blogs
Algorithm108 blogs
MySQL96 blogs
Flow Chart85 blogs
JavaScript79 blogs
More...