Publications-Theses
Article View/Open
Publication Export
-
題名 消除深度學習目標函數中局部極小值之研究
A Survey on Eliminating Local Minima of the Objective Function in Deep Learning作者 季佳琪
Chi, Chia-Chi貢獻者 蔡炎龍
Tsai, Yen-Lung
季佳琪
Chi, Chia-Chi關鍵詞 深度學習
神經網路
目標函數
損失函數
局部極小值
Deep Learning
Neural Network
Objective Function
Loss Function
Local Minima日期 2019 上傳時間 5-Sep-2019 16:13:48 (UTC+8) 摘要 在本文中,我們主要研究消除目標函數的非最佳局部極小值的方法和其中的定理。 更具體地說,我們發現,在給定原始神經網絡的情況下,我們可以透過對其添加外加的神經網路層來建構一個修正的神經網絡。在這前提下,如果修正的神經網絡的目標函數達到局部最小值,則原始神經網絡的目標函數將達到絕對最小值。在接下來的內容中,我們首先回顧一些以前的相關文獻、概述何謂深度學習,並證明常見損失函數的凸性以滿足定理的假設。接下來,我們在主要定理中證明了一些細節、討論了此方法的效果,並研究了它的局限性。 最後,我們進行了一系列實驗來顯示此方法可以用於實際工作。
In this paper, we mainly survey the method and theorems of eliminating suboptimal local minima of the objective function. More specifically, we find that: given an original neural network, we can construct a modified network by adding external layers to it. Then if the objective function of the modified network achieve a local minimum, the objective function of the original neural network will reach a global minimum. We first review some previous related literature, give an overview of deep learning and then prove the convexity of common loss functions to satisfy the assumptions of theorems. Next, we prove some details in such theorems, discuss the effects of the method, and investigate its limitations. Finally, we perform a series of experiments to show that the method can be used for practical works.參考文獻 [1] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1908–II–1916. JMLR.org, 2014.[2] Avrim L. Blum and Ronald L. Rivest. Training a 3-node neural network is np-complete. Neural Networks, 5(1):117 – 127, 1992.[3] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. 02 2017.[4] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics.[5] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192–204, 2015.[6] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. CoRR, abs/1803.01206, 2018.[7] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179 – 211, 1990.[8] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks withlandscape design. CoRR, abs/1711.00501, 2017.[9] Surbhi Goel and Adam R. Klivans. Learning depth-three neural networks in polynomial time. CoRR, abs/1709.06010, 2017.[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.[11] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.[12] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231,2017.[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.[14] Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586–594. Curran Associates, Inc., 2016.[15] KenjiKawaguchiandYoshuaBengio.Depthwithnonlinearitycreatesnobadlocalminima in resnets. arXiv preprint arXiv:1810.09038, 2018.[16] Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and width on local minima in deep learning. CoRR, abs/1811.08150, 2018.[17] Kenji Kawaguchi and Leslie Pack Kaelbling. Elimination of all bad local minima in deep learning. CoRR, abs/1901.00279, 2019.[18] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassificationwithdeep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.[19] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436, 2015.[21] YannLeCun,LéonBottou,YoshuaBengio,PatrickHaffner,etal.Gradient-basedlearning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.[22] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. CoRR, abs/1705.09886, 2017.[23] Shiyu Liang, Ruoyu Sun, Jason D. Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 2018-December:4350–4360, 1 2018.[24] Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117–129, Jun 1987.[25] Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3730–3739, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.[26] Quynh N. Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. CoRR, abs/1704.08045, 2017.[27] SebastianRuder.Anoverviewofgradientdescentoptimizationalgorithms.arXivpreprint arXiv:1609.04747, 2016.[28] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.[29] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.[30] Ohad Shamir. Are resnets provably better than linear predictors? CoRR, abs/1804.06739, 2018.[31] ClaudeElwoodShannon.Amathematicaltheoryofcommunication.Bellsystemtechnical journal, 27(3):379–423, 1948.[32] Mahdi Soltanolkotabi. Learning relus via gradient descent. CoRR, abs/1705.04591, 2017.[33] Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks, 2018.[34] Paul J Werbos et al. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.[35] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4140–4149, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. 描述 碩士
國立政治大學
應用數學系
1057510163資料來源 http://thesis.lib.nccu.edu.tw/record/#G1057510163 資料類型 thesis dc.contributor.advisor 蔡炎龍 zh_TW dc.contributor.advisor Tsai, Yen-Lung en_US dc.contributor.author (Authors) 季佳琪 zh_TW dc.contributor.author (Authors) Chi, Chia-Chi en_US dc.creator (作者) 季佳琪 zh_TW dc.creator (作者) Chi, Chia-Chi en_US dc.date (日期) 2019 en_US dc.date.accessioned 5-Sep-2019 16:13:48 (UTC+8) - dc.date.available 5-Sep-2019 16:13:48 (UTC+8) - dc.date.issued (上傳時間) 5-Sep-2019 16:13:48 (UTC+8) - dc.identifier (Other Identifiers) G1057510163 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/125637 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 應用數學系 zh_TW dc.description (描述) 1057510163 zh_TW dc.description.abstract (摘要) 在本文中,我們主要研究消除目標函數的非最佳局部極小值的方法和其中的定理。 更具體地說,我們發現,在給定原始神經網絡的情況下,我們可以透過對其添加外加的神經網路層來建構一個修正的神經網絡。在這前提下,如果修正的神經網絡的目標函數達到局部最小值,則原始神經網絡的目標函數將達到絕對最小值。在接下來的內容中,我們首先回顧一些以前的相關文獻、概述何謂深度學習,並證明常見損失函數的凸性以滿足定理的假設。接下來,我們在主要定理中證明了一些細節、討論了此方法的效果,並研究了它的局限性。 最後,我們進行了一系列實驗來顯示此方法可以用於實際工作。 zh_TW dc.description.abstract (摘要) In this paper, we mainly survey the method and theorems of eliminating suboptimal local minima of the objective function. More specifically, we find that: given an original neural network, we can construct a modified network by adding external layers to it. Then if the objective function of the modified network achieve a local minimum, the objective function of the original neural network will reach a global minimum. We first review some previous related literature, give an overview of deep learning and then prove the convexity of common loss functions to satisfy the assumptions of theorems. Next, we prove some details in such theorems, discuss the effects of the method, and investigate its limitations. Finally, we perform a series of experiments to show that the method can be used for practical works. en_US dc.description.tableofcontents 1 Introduction 12 Deep Learning 32.1 Definition of Deep Learning 32.2 Standard Neural Network 42.2.1 The Structure of The Neural Network 42.2.2 The Operation of The Neural Network 52.2.3 Activation Function 62.3 Optimization for Training Deep Network 72.4 Convolutional Neural Network 102.4.1 Definition of Convolution 102.4.2 The Structure of The Convolutional Neural Network 122.5 Recurrent Neural Network 152.5.1 The Structure of The Recurrent Neural Network 162.5.2 The Operation of The Recurrent Neural Network 163 Model Description 193.1 The Architecture 193.2 Loss and Objective Functions 213.2.1 Construction of Objective Functions 213.2.2 Convexity of Loss Functions 214 Main Theorems 254.1 Lemmas 254.2 Theorems 345 Effects of Eliminating Local Minima 435.1 Effects of All Situations 435.2 Examples 446 Challenges of Eliminating Local Minima 496.1 Theorem 496.2 Example 517 Experiments 527.1 Standard Neural Network Results 527.2 Convolutional Neural Network Results 548 Conclusion 57Appendix A Code of The Models 58A.1 The NN Model 58A.2 The mNN Model 59A.3 The CNN Model 62A.4 The nCNN Model 63A.5 The mnCNN Model 65Bibliography 69 zh_TW dc.format.extent 5183578 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G1057510163 en_US dc.subject (關鍵詞) 深度學習 zh_TW dc.subject (關鍵詞) 神經網路 zh_TW dc.subject (關鍵詞) 目標函數 zh_TW dc.subject (關鍵詞) 損失函數 zh_TW dc.subject (關鍵詞) 局部極小值 zh_TW dc.subject (關鍵詞) Deep Learning en_US dc.subject (關鍵詞) Neural Network en_US dc.subject (關鍵詞) Objective Function en_US dc.subject (關鍵詞) Loss Function en_US dc.subject (關鍵詞) Local Minima en_US dc.title (題名) 消除深度學習目標函數中局部極小值之研究 zh_TW dc.title (題名) A Survey on Eliminating Local Minima of the Objective Function in Deep Learning en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1908–II–1916. JMLR.org, 2014.[2] Avrim L. Blum and Ronald L. Rivest. Training a 3-node neural network is np-complete. Neural Networks, 5(1):117 – 127, 1992.[3] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. 02 2017.[4] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics.[5] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192–204, 2015.[6] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. CoRR, abs/1803.01206, 2018.[7] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179 – 211, 1990.[8] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks withlandscape design. CoRR, abs/1711.00501, 2017.[9] Surbhi Goel and Adam R. Klivans. Learning depth-three neural networks in polynomial time. CoRR, abs/1709.06010, 2017.[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.[11] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.[12] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231,2017.[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.[14] Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586–594. Curran Associates, Inc., 2016.[15] KenjiKawaguchiandYoshuaBengio.Depthwithnonlinearitycreatesnobadlocalminima in resnets. arXiv preprint arXiv:1810.09038, 2018.[16] Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Effect of depth and width on local minima in deep learning. CoRR, abs/1811.08150, 2018.[17] Kenji Kawaguchi and Leslie Pack Kaelbling. Elimination of all bad local minima in deep learning. CoRR, abs/1901.00279, 2019.[18] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassificationwithdeep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.[19] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436, 2015.[21] YannLeCun,LéonBottou,YoshuaBengio,PatrickHaffner,etal.Gradient-basedlearning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.[22] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. CoRR, abs/1705.09886, 2017.[23] Shiyu Liang, Ruoyu Sun, Jason D. Lee, and Rayadurgam Srikant. Adding one neuron can eliminate all bad local minima. Advances in Neural Information Processing Systems, 2018-December:4350–4360, 1 2018.[24] Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117–129, Jun 1987.[25] Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep CNNs. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3730–3739, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.[26] Quynh N. Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. CoRR, abs/1704.08045, 2017.[27] SebastianRuder.Anoverviewofgradientdescentoptimizationalgorithms.arXivpreprint arXiv:1609.04747, 2016.[28] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.[29] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.[30] Ohad Shamir. Are resnets provably better than linear predictors? CoRR, abs/1804.06739, 2018.[31] ClaudeElwoodShannon.Amathematicaltheoryofcommunication.Bellsystemtechnical journal, 27(3):379–423, 1948.[32] Mahdi Soltanolkotabi. Learning relus via gradient descent. CoRR, abs/1705.04591, 2017.[33] Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks, 2018.[34] Paul J Werbos et al. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.[35] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4140–4149, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU201900936 en_US