Gated recurrent unit

Gated recurrent units(GRUs) are a gating mechanism inrecurrent neural networks,introduced in 2014 by Kyunghyun Cho et al.^[1]The GRU is like along short-term memory(LSTM) with a gating mechanism to input or forget certain features,^[2]but lacks a context vector or output gate, resulting in fewer parameters than LSTM.^[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.^[4]^[5]GRUs showed that gating is indeed helpful in general, andBengio's team came to no concrete conclusion on which of the two gating units was better.^[6]^[7]

Architecture[edit]

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.^[8]

The operator $\odot$ denotes theHadamard productin the following.

Fully gated unit[edit]

Initially, for $t=0$ ,the output vector is $h_{0}=0$ .

{\begin{aligned}z_{t}&=\sigma (W_{z}x_{t}+U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (W_{r}x_{t}+U_{r}h_{t-1}+b_{r})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(r_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-z_{t})\odot h_{t-1}+z_{t}\odot {\hat {h}}_{t}\end{aligned}}

Variables ( $d$ denotes the number of input features and $e$ the number of output features):

$x_{t}\in \mathbb {R} ^{d}$ :input vector
$h_{t}\in \mathbb {R} ^{e}$ :output vector
${\hat {h}}_{t}\in \mathbb {R} ^{e}$ :candidate activation vector
$z_{t}\in (0,1)^{e}$ :update gate vector
$r_{t}\in (0,1)^{e}$ :reset gate vector
$W\in \mathbb {R} ^{e\times d}$ , $U\in \mathbb {R} ^{e\times e}$ and $b\in \mathbb {R} ^{e}$ :parameter matrices and vector which need to be learned during training

Activation functions

$\sigma$ :The original is alogistic function.
$\phi$ :The original is ahyperbolic tangent.

Alternative activation functions are possible, provided that $\sigma (x)\in [0,1]$ .

Alternate forms can be created by changing $z_{t}$ and $r_{t}$ ^[9]

Type 1, each gate depends only on the previous hidden state and the bias.
${\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (U_{r}h_{t-1}+b_{r})\\\end{aligned}}$
Type 2, each gate depends only on the previous hidden state.
${\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1})\\r_{t}&=\sigma (U_{r}h_{t-1})\\\end{aligned}}$
Type 3, each gate is computed using only the bias.
${\begin{aligned}z_{t}&=\sigma (b_{z})\\r_{t}&=\sigma (b_{r})\\\end{aligned}}$

Minimal gated unit[edit]

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:^[10]

{\begin{aligned}f_{t}&=\sigma (W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(f_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-f_{t})\odot h_{t-1}+f_{t}\odot {\hat {h}}_{t}\end{aligned}}

Variables

$x_{t}$ :input vector
$h_{t}$ :output vector
${\hat {h}}_{t}$ :candidate activation vector
$f_{t}$ :forget vector
$W$ , $U$ and $b$ :parameter matrices and vector

Light gated recurrent unit[edit]

The light gated recurrent unit (LiGRU)^[4]removes the reset gate altogether, replaces tanh with theReLUactivation, and appliesbatch normalization(BN):

{\begin{aligned}z_{t}&=\sigma (\operatorname {BN} (W_{z}x_{t})+U_{z}h_{t-1})\\{\tilde {h}}_{t}&=\operatorname {ReLU} (\operatorname {BN} (W_{h}x_{t})+U_{h}h_{t-1})\\h_{t}&=z_{t}\odot h_{t-1}+(1-z_{t})\odot {\tilde {h}}_{t}\end{aligned}}

LiGRU has been studied from a Bayesian perspective.^[11]This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU onspeech recognitiontasks.

References[edit]

^Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation".Association for Computational Linguistics.arXiv:1406.1078.
^Felix Gers;Jürgen Schmidhuber;Fred Cummins (1999). "Learning to forget: Continual prediction with LSTM".9th International Conference on Artificial Neural Networks: ICANN '99.Vol. 1999. pp. 850–855.doi:10.1049/cp:19991218.ISBN 0-85296-721-7.
^"Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML".Wildml.2015-10-27. Archived fromthe originalon 2021-11-10.RetrievedMay 18,2016.
^^a ^bRavanelli, Mirco; Brakel, Philemon; Omologo, Maurizio;Bengio, Yoshua(2018). "Light Gated Recurrent Units for Speech Recognition".IEEE Transactions on Emerging Topics in Computational Intelligence.2(2): 92–102.arXiv:1803.10225.doi:10.1109/TETCI.2017.2762739.S2CID 4402991.
^Su, Yuahang; Kuo, Jay (2019). "On extended long short-term memory and dependent bidirectional recurrent neural network".Neurocomputing.356:151–161.arXiv:1803.01686.doi:10.1016/j.neucom.2019.04.044.S2CID 3675055.
^Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling".arXiv:1412.3555[cs.NE].
^Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?",Frontiers in Artificial Intelligence,3:40,doi:10.3389/frai.2020.00040,PMC7861254,PMID 33733157,S2CID 220252321
^Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling".arXiv:1412.3555[cs.NE].
^Dey, Rahul; Salem, Fathi M. (2017-01-20). "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks".arXiv:1701.05923[cs.NE].
^Heck, Joel; Salem, Fathi M. (2017-01-12). "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks".arXiv:1701.03452[cs.NE].
^Bittar, Alexandre; Garner, Philip N. (May 2021)."A Bayesian Interpretation of the Light Gated Recurrent Unit".ICASSP 2021.2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE. pp. 2965–2969. 10.1109/ICASSP39728.2021.9414259.

[1] Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation".Association for Computational Linguistics.arXiv:1406.1078.

[lstm1999-2] Felix Gers;Jürgen Schmidhuber;Fred Cummins (1999). "Learning to forget: Continual prediction with LSTM".9th International Conference on Artificial Neural Networks: ICANN '99.Vol. 1999. pp. 850–855.doi:10.1049/cp:19991218.ISBN 0-85296-721-7.

[MyUser_Wildml.com_May_18_2016c-3] "Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML".Wildml.2015-10-27. Archived fromthe originalon 2021-11-10.RetrievedMay 18,2016.

[Ravalli2018-4] Ravanelli, Mirco; Brakel, Philemon; Omologo, Maurizio;Bengio, Yoshua(2018). "Light Gated Recurrent Units for Speech Recognition".IEEE Transactions on Emerging Topics in Computational Intelligence.2(2): 92–102.arXiv:1803.10225.doi:10.1109/TETCI.2017.2762739.S2CID 4402991.

[Su2019-5] Su, Yuahang; Kuo, Jay (2019). "On extended long short-term memory and dependent bidirectional recurrent neural network".Neurocomputing.356:151–161.arXiv:1803.01686.doi:10.1016/j.neucom.2019.04.044.S2CID 3675055.

[MyUser_Arxiv.org_May_18_2016c-6] Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling".arXiv:1412.3555[cs.NE].

[gruber_jockisch-7] Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?",Frontiers in Artificial Intelligence,3:40,doi:10.3389/frai.2020.00040,PMC7861254,PMID 33733157,S2CID 220252321

[Chung_18_2016c-8] Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling".arXiv:1412.3555[cs.NE].

[9] Dey, Rahul; Salem, Fathi M. (2017-01-20). "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks".arXiv:1701.05923[cs.NE].

[10] Heck, Joel; Salem, Fathi M. (2017-01-12). "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks".arXiv:1701.03452[cs.NE].

[11] Bittar, Alexandre; Garner, Philip N. (May 2021)."A Bayesian Interpretation of the Light Gated Recurrent Unit".ICASSP 2021.2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE. pp. 2965–2969. 10.1109/ICASSP39728.2021.9414259.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]