學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 消除深度學習目標函數中局部極小值之研究
A Survey on Eliminating Local Minima of the Objective Function in Deep Learning
作者 季佳琪
Chi, Chia-Chi
貢獻者 蔡炎龍
Tsai, Yen-Lung
季佳琪
Chi, Chia-Chi
關鍵詞 深度學習
神經網路
目標函數
損失函數
局部極小值
Deep Learning
Neural Network
Objective Function
Loss Function
Local Minima
日期 2019
上傳時間 5-Sep-2019 16:13:48 (UTC+8)
摘要 在本文中,我們主要研究消除目標函數的非最佳局部極小值的方法和其中的定理。 更具體地說,我們發現,在給定原始神經網絡的情況下,我們可以透過對其添加外加的神經網路層來建構一個修正的神經網絡。在這前提下,如果修正的神經網絡的目標函數達到局部最小值,則原始神經網絡的目標函數將達到絕對最小值。在接下來的內容中,我們首先回顧一些以前的相關文獻、概述何謂深度學習,並證明常見損失函數的凸性以滿足定理的假設。接下來,我們在主要定理中證明了一些細節、討論了此方法的效果,並研究了它的局限性。 最後,我們進行了一系列實驗來顯示此方法可以用於實際工作。
In this paper, we mainly survey the method and theorems of eliminating suboptimal local minima of the objective function. More specifically, we find that: given an original neural network, we can construct a modified network by adding external layers to it. Then if the objective function of the modified network achieve a local minimum, the objective function of the original neural network will reach a global minimum. We first review some previous related literature, give an overview of deep learning and then prove the convexity of common loss functions to satisfy the assumptions of theorems. Next, we prove some details in such theorems, discuss the effects of the method, and investigate its limitations. Finally, we perform a series of experiments to show that the method can be used for practical works.
參考文獻 [1] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1908–II–1916. JMLR.org, 2014.
[2] Avrim L. Blum and Ronald L. Rivest. Training a 3-node neural network is np-complete. Neural Networks, 5(1):117 – 127, 1992.
[3] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. 02 2017.
[4] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics.
[5] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192–204, 2015.
[6] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. CoRR, abs/1803.01206, 2018.
[7] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179 – 211, 1990.
[8] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with
landscape design. CoRR, abs/1711.00501, 2017.
[9] Surbhi Goel and Adam R. Klivans. Learning depth-three neural networks in polynomial time. CoRR, abs/1709.06010, 2017.
[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[11] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:
1308.0850, 2013.
[12] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231,
2017.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[14] Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586–594. Curran Associates, Inc., 2016.
[15] KenjiKawaguchiandYoshuaBengio.Depthwithnonlinearitycreatesnobadlocalminima in resnets. arXiv preprint arXiv:1810.09038, 2018.
[16] Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and width on local minima in deep learning. CoRR, abs/1811.08150, 2018.
[17] Kenji Kawaguchi and Leslie Pack Kaelbling. Elimination of all bad local minima in deep learning. CoRR, abs/1901.00279, 2019.
[18] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassificationwithdeep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[19] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436, 2015.
[21] YannLeCun,LéonBottou,YoshuaBengio,PatrickHaffner,etal.Gradient-basedlearning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[22] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. CoRR, abs/1705.09886, 2017.
[23] Shiyu Liang, Ruoyu Sun, Jason D. Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 2018-December:4350–4360, 1 2018.
[24] Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117–129, Jun 1987.
[25] Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3730–3739, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[26] Quynh N. Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. CoRR, abs/1704.08045, 2017.
[27] SebastianRuder.Anoverviewofgradientdescentoptimizationalgorithms.arXivpreprint arXiv:1609.04747, 2016.
[28] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
[29] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.
[30] Ohad Shamir. Are resnets provably better than linear predictors? CoRR, abs/1804.06739, 2018.
[31] ClaudeElwoodShannon.Amathematicaltheoryofcommunication.Bellsystemtechnical journal, 27(3):379–423, 1948.
[32] Mahdi Soltanolkotabi. Learning relus via gradient descent. CoRR, abs/1705.04591, 2017.
[33] Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks, 2018.
[34] Paul J Werbos et al. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
[35] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4140–4149, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
描述 碩士
國立政治大學
應用數學系
1057510163
資料來源 http://thesis.lib.nccu.edu.tw/record/#G1057510163
資料類型 thesis
dc.contributor.advisor 蔡炎龍zh_TW
dc.contributor.advisor Tsai, Yen-Lungen_US
dc.contributor.author (Authors) 季佳琪zh_TW
dc.contributor.author (Authors) Chi, Chia-Chien_US
dc.creator (作者) 季佳琪zh_TW
dc.creator (作者) Chi, Chia-Chien_US
dc.date (日期) 2019en_US
dc.date.accessioned 5-Sep-2019 16:13:48 (UTC+8)-
dc.date.available 5-Sep-2019 16:13:48 (UTC+8)-
dc.date.issued (上傳時間) 5-Sep-2019 16:13:48 (UTC+8)-
dc.identifier (Other Identifiers) G1057510163en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/125637-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 應用數學系zh_TW
dc.description (描述) 1057510163zh_TW
dc.description.abstract (摘要) 在本文中,我們主要研究消除目標函數的非最佳局部極小值的方法和其中的定理。 更具體地說,我們發現,在給定原始神經網絡的情況下,我們可以透過對其添加外加的神經網路層來建構一個修正的神經網絡。在這前提下,如果修正的神經網絡的目標函數達到局部最小值,則原始神經網絡的目標函數將達到絕對最小值。在接下來的內容中,我們首先回顧一些以前的相關文獻、概述何謂深度學習,並證明常見損失函數的凸性以滿足定理的假設。接下來,我們在主要定理中證明了一些細節、討論了此方法的效果,並研究了它的局限性。 最後,我們進行了一系列實驗來顯示此方法可以用於實際工作。zh_TW
dc.description.abstract (摘要) In this paper, we mainly survey the method and theorems of eliminating suboptimal local minima of the objective function. More specifically, we find that: given an original neural network, we can construct a modified network by adding external layers to it. Then if the objective function of the modified network achieve a local minimum, the objective function of the original neural network will reach a global minimum. We first review some previous related literature, give an overview of deep learning and then prove the convexity of common loss functions to satisfy the assumptions of theorems. Next, we prove some details in such theorems, discuss the effects of the method, and investigate its limitations. Finally, we perform a series of experiments to show that the method can be used for practical works.en_US
dc.description.tableofcontents 1 Introduction 1
2 Deep Learning 3
2.1 Definition of Deep Learning 3
2.2 Standard Neural Network 4
2.2.1 The Structure of The Neural Network 4
2.2.2 The Operation of The Neural Network 5
2.2.3 Activation Function 6
2.3 Optimization for Training Deep Network 7
2.4 Convolutional Neural Network 10
2.4.1 Definition of Convolution 10
2.4.2 The Structure of The Convolutional Neural Network 12
2.5 Recurrent Neural Network 15
2.5.1 The Structure of The Recurrent Neural Network 16
2.5.2 The Operation of The Recurrent Neural Network 16
3 Model Description 19
3.1 The Architecture 19
3.2 Loss and Objective Functions 21
3.2.1 Construction of Objective Functions 21
3.2.2 Convexity of Loss Functions 21
4 Main Theorems 25
4.1 Lemmas 25
4.2 Theorems 34
5 Effects of Eliminating Local Minima 43
5.1 Effects of All Situations 43
5.2 Examples 44
6 Challenges of Eliminating Local Minima 49
6.1 Theorem 49
6.2 Example 51
7 Experiments 52
7.1 Standard Neural Network Results 52
7.2 Convolutional Neural Network Results 54
8 Conclusion 57
Appendix A Code of The Models 58
A.1 The NN Model 58
A.2 The mNN Model 59
A.3 The CNN Model 62
A.4 The nCNN Model 63
A.5 The mnCNN Model 65
Bibliography 69
zh_TW
dc.format.extent 5183578 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G1057510163en_US
dc.subject (關鍵詞) 深度學習zh_TW
dc.subject (關鍵詞) 神經網路zh_TW
dc.subject (關鍵詞) 目標函數zh_TW
dc.subject (關鍵詞) 損失函數zh_TW
dc.subject (關鍵詞) 局部極小值zh_TW
dc.subject (關鍵詞) Deep Learningen_US
dc.subject (關鍵詞) Neural Networken_US
dc.subject (關鍵詞) Objective Functionen_US
dc.subject (關鍵詞) Loss Functionen_US
dc.subject (關鍵詞) Local Minimaen_US
dc.title (題名) 消除深度學習目標函數中局部極小值之研究zh_TW
dc.title (題名) A Survey on Eliminating Local Minima of the Objective Function in Deep Learningen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1908–II–1916. JMLR.org, 2014.
[2] Avrim L. Blum and Ronald L. Rivest. Training a 3-node neural network is np-complete. Neural Networks, 5(1):117 – 127, 1992.
[3] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. 02 2017.
[4] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics.
[5] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192–204, 2015.
[6] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. CoRR, abs/1803.01206, 2018.
[7] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179 – 211, 1990.
[8] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with
landscape design. CoRR, abs/1711.00501, 2017.
[9] Surbhi Goel and Adam R. Klivans. Learning depth-three neural networks in polynomial time. CoRR, abs/1709.06010, 2017.
[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[11] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:
1308.0850, 2013.
[12] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231,
2017.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[14] Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586–594. Curran Associates, Inc., 2016.
[15] KenjiKawaguchiandYoshuaBengio.Depthwithnonlinearitycreatesnobadlocalminima in resnets. arXiv preprint arXiv:1810.09038, 2018.
[16] Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and width on local minima in deep learning. CoRR, abs/1811.08150, 2018.
[17] Kenji Kawaguchi and Leslie Pack Kaelbling. Elimination of all bad local minima in deep learning. CoRR, abs/1901.00279, 2019.
[18] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassificationwithdeep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[19] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436, 2015.
[21] YannLeCun,LéonBottou,YoshuaBengio,PatrickHaffner,etal.Gradient-basedlearning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[22] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. CoRR, abs/1705.09886, 2017.
[23] Shiyu Liang, Ruoyu Sun, Jason D. Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 2018-December:4350–4360, 1 2018.
[24] Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117–129, Jun 1987.
[25] Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3730–3739, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[26] Quynh N. Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. CoRR, abs/1704.08045, 2017.
[27] SebastianRuder.Anoverviewofgradientdescentoptimizationalgorithms.arXivpreprint arXiv:1609.04747, 2016.
[28] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
[29] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.
[30] Ohad Shamir. Are resnets provably better than linear predictors? CoRR, abs/1804.06739, 2018.
[31] ClaudeElwoodShannon.Amathematicaltheoryofcommunication.Bellsystemtechnical journal, 27(3):379–423, 1948.
[32] Mahdi Soltanolkotabi. Learning relus via gradient descent. CoRR, abs/1705.04591, 2017.
[33] Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks, 2018.
[34] Paul J Werbos et al. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
[35] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4140–4149, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU201900936en_US