學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 貸款違約預測:使用Spark平台分析P2P借貸資料
Loan default prediction:analyzing P2P lending data on the spark platform
作者 林博仁
Lin, Bo Ren
貢獻者 胡毓忠
Hu, Yuh Jong
林博仁
Lin, Bo Ren
關鍵詞 點對點
借貸
預測
P2P
Lending
Prediction
日期 2017
上傳時間 1-Mar-2017 17:38:14 (UTC+8)
摘要 由於FinTech數位金融的快速崛起,金融相關業務逐漸由線上申辦取代傳統作業。在借貸方面,銀行為了降低呆帳風險,要求融資方必須提供足夠抵押擔保品,而融資方往往因為無擔保品而求救無門,其中包含信用歷史優良的客戶,因此P2P借貸平台為此需求而誕生。本研究探討如何在大數據Spark分析平台上使用Scikit Learning的程式庫來進行自動化機器學習流程,並以優化的角度來進行P2P借貸模型特徵值篩選以及參數和超級參數的最佳化,因而提高預測還款鑑定力。本研究分析資料集是引用美國上市公司Lending Club公開資料,以投資方的角度來分析融資方歷年的借貸資料,從中篩選特徵值,並利用隨機樹演算法結合自動化機器學習流程來完成分析模型的訓練與測試。我們提供預測信用良好的借貸者給投資方參考,並由投資方根據自身的資金狀態從中選擇合適投資的融資方,進而達成精準預測融資方是否還款的目標。
In the rapidly rise FinTech era, traditional financial-related business is gradually replaced by online digital finance. From a new loan, the bank always requires a borrower to provide certain amount of collateral for risk reduction. However, a borrower sometimes cannot meet this requirement, even with a good credit history. A P2P lending platform is created for solving this problem. This study investigates the issue for how to proceed automated machine learning pipeline through P2P lending model’s features selection with parameter and hyper-parameter optimization. By using Scikit Learning libraries on the big data analytics Spark platform, we can predict who are borrowers with good credits. We apply Random Forest machine learning algorithm in the automated machine learning pipeline to analyze the Lending Club open datasets from a lender perspective. A predicted list of high credit borrowers is available for investors to select to achieve high loan return rate.
參考文獻 【1】 Kent D. Lee, et al. (2011). Python Programming Fundamentals, Springer London Dordrecht Heidelberg, New York, 45-190.
【2】 Ian J. Galloway. (2009). Peer-to-Peer Lending and Community Development Finance, Bank of San Francisco, 3-15.
【3】 Kevin Sheppard. (2014). Introduction to Python for Econometrics, Statistics and Data Analysis, Kevin Sheppard, University of Oxford, 171-201.
【4】 David Donoho. (2015) . 50 years of Data Science, Tukey Centennial workshop, Princeton NJ, 4-9, 29-37.
【5】 Andy Liaw and Matthew Wiener. (2002). Classification and Regression by RandomForest, R News ISSN 1609-3631, 19-20.
【6】 Milad Malekipirbazari, Vural Aksakalli. (2015). Risk assessment in social lending via random forests, Expert Systems with Applications 4621–4631, 4-11.
【7】 M. I. Jordan and T. M. Mitchell. (2015). Machine learning: Trends, perspectives, and prospects, SCIENCE VOL 349 ISSUE 6245, 2-7.
【8】 Loren Hansen, et al. (2009). Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides, Bentham Science Publishers Ltd, 6-7.
【9】 Amir E. Khandaniy, et al. (2010). Consumer Credit Risk Models via Machine-Learning Algorithms, Journal of Banking & Finance 34, 47-48.
【10】 JIAN Zhi- gang and JIN Xu. (2004). Research on Data Preprocess in Data Mining and Its Application, Beijing University, 3-4.
【11】 Martin Sewell. (2007). Machine Learning, University College London, 2-4.
【12】 Jehad Ali1, et al. (2012). Random Forests and Decision Trees, IJCSI International Journal of Computer Science Issues, 2-6.
【13】 Oleg Okun and Helen Priisalu. (2007), Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues, University of Oulu and Tallinn University of Technology, 2-7.
【14】 Jesse Davis, et al. (2006). The Relationship Between Precision-Recall and ROC Curves, University of Wisconsin-Madison, 2-7.
【15】 Andrew P and Bradley. (1997), The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning, Pattern Recognition, 2-9, 16-31.
【16】 Prof. William H. Press. (2008). Computational Statistics with Application to Bioinformatics, The University of Texas at Austin, 2-12.
【17】 Tom Fawcett. (2005). An introduction to ROC analysis, Pattern Recognition, 2-13.
【18】 Xiangrui Meng, et al. (2016). MLlib: Machine Learning in Apache Spark, Journal of Machine Learning Research 17, 4-5.
【19】 Fabian Pedregosa. (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12, 2-5.
【20】 Shunpo Chang, et al. (2015-2016). Predicting Default Risk of Lending Club Loans, CS229: Machine Learning, 3-5. 
【21】 Riza Emekter, et al. (2013). Evaluating the Credit Risk in Online Peer-to-Peer (P2P) Lending, Robert Morris University, 19.
【22】 Riza Emekter, et al. (2015). Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending, Robert Morris University, 69.
【23】 Don Carmichael. (2014). Modeling Default for Peer-to-Peer Loans, University of Houston - C.T. Bauer College of Business, 21.
【24】 Freedman S M, Jin G Z. (2010). Learning by Doing with Asymmetric Information: Evidence from Prosper.com, University of Michigan, Maryland & NBER, 28.
【25】 Alexander B, Alexander B, Daniel B. (2011). Online Peer-to-Peer Lending - A Literature Review. Journal of Internet Banking and Commerce, 14.
【26】 Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? , Journal of Machine Learning Research 15(Oct):3133−3181, 43.
【27】 Determinants of Default in P2P Lending. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0139427
【28】 Matplotlib API. http://matplotlib.org/api/index.html
【29】 Lending Club Statistics - Lending Club. https://www.lendingclub.com/info/download-data.action
【30】 Apache Spark submitting-applications. http://spark.apache.org/docs/latest/submitting-applications.html
【31】 Apache Spark Python API doc. http://spark.apache.org/docs/latest/api/python/index.html
描述 碩士
國立政治大學
資訊科學系碩士在職專班
103971012
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0103971012
資料類型 thesis
dc.contributor.advisor 胡毓忠zh_TW
dc.contributor.advisor Hu, Yuh Jongen_US
dc.contributor.author (Authors) 林博仁zh_TW
dc.contributor.author (Authors) Lin, Bo Renen_US
dc.creator (作者) 林博仁zh_TW
dc.creator (作者) Lin, Bo Renen_US
dc.date (日期) 2017en_US
dc.date.accessioned 1-Mar-2017 17:38:14 (UTC+8)-
dc.date.available 1-Mar-2017 17:38:14 (UTC+8)-
dc.date.issued (上傳時間) 1-Mar-2017 17:38:14 (UTC+8)-
dc.identifier (Other Identifiers) G0103971012en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/107005-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系碩士在職專班zh_TW
dc.description (描述) 103971012zh_TW
dc.description.abstract (摘要) 由於FinTech數位金融的快速崛起,金融相關業務逐漸由線上申辦取代傳統作業。在借貸方面,銀行為了降低呆帳風險,要求融資方必須提供足夠抵押擔保品,而融資方往往因為無擔保品而求救無門,其中包含信用歷史優良的客戶,因此P2P借貸平台為此需求而誕生。本研究探討如何在大數據Spark分析平台上使用Scikit Learning的程式庫來進行自動化機器學習流程,並以優化的角度來進行P2P借貸模型特徵值篩選以及參數和超級參數的最佳化,因而提高預測還款鑑定力。本研究分析資料集是引用美國上市公司Lending Club公開資料,以投資方的角度來分析融資方歷年的借貸資料,從中篩選特徵值,並利用隨機樹演算法結合自動化機器學習流程來完成分析模型的訓練與測試。我們提供預測信用良好的借貸者給投資方參考,並由投資方根據自身的資金狀態從中選擇合適投資的融資方,進而達成精準預測融資方是否還款的目標。zh_TW
dc.description.abstract (摘要) In the rapidly rise FinTech era, traditional financial-related business is gradually replaced by online digital finance. From a new loan, the bank always requires a borrower to provide certain amount of collateral for risk reduction. However, a borrower sometimes cannot meet this requirement, even with a good credit history. A P2P lending platform is created for solving this problem. This study investigates the issue for how to proceed automated machine learning pipeline through P2P lending model’s features selection with parameter and hyper-parameter optimization. By using Scikit Learning libraries on the big data analytics Spark platform, we can predict who are borrowers with good credits. We apply Random Forest machine learning algorithm in the automated machine learning pipeline to analyze the Lending Club open datasets from a lender perspective. A predicted list of high credit borrowers is available for investors to select to achieve high loan return rate.en_US
dc.description.tableofcontents 第一章 導論 9
1.1 研究動機 9
1.2 研究目的 10
1.3 各章節概述 11
第二章 研究背景 12
2.1 點對點借貸模式 12
2.2 借貸俱樂部 13
2.3 Spark平台 15
2.4 機器學習程式流 16
2.5 資料集 17
第三章 相關研究 19
第四章 研究架構和方法 24
4.1 資料前置處理 24
4.2 研究架構設計 25
4.3 特徵值選擇 28
4.4 超參數優化 30
4.5訓練模型的方法 34
第五章 模型評估 37
5.1 K次交叉驗證 37
5.2 Python繪圖模組 38
5.3 模型預測結果的驗證 38
第六章 結論與未來研究 42
第七章 參考文獻 44
附錄 47
附件 Lending Club的資料集 47
zh_TW
dc.format.extent 1688165 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0103971012en_US
dc.subject (關鍵詞) 點對點zh_TW
dc.subject (關鍵詞) 借貸zh_TW
dc.subject (關鍵詞) 預測zh_TW
dc.subject (關鍵詞) P2Pen_US
dc.subject (關鍵詞) Lendingen_US
dc.subject (關鍵詞) Predictionen_US
dc.title (題名) 貸款違約預測:使用Spark平台分析P2P借貸資料zh_TW
dc.title (題名) Loan default prediction:analyzing P2P lending data on the spark platformen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) 【1】 Kent D. Lee, et al. (2011). Python Programming Fundamentals, Springer London Dordrecht Heidelberg, New York, 45-190.
【2】 Ian J. Galloway. (2009). Peer-to-Peer Lending and Community Development Finance, Bank of San Francisco, 3-15.
【3】 Kevin Sheppard. (2014). Introduction to Python for Econometrics, Statistics and Data Analysis, Kevin Sheppard, University of Oxford, 171-201.
【4】 David Donoho. (2015) . 50 years of Data Science, Tukey Centennial workshop, Princeton NJ, 4-9, 29-37.
【5】 Andy Liaw and Matthew Wiener. (2002). Classification and Regression by RandomForest, R News ISSN 1609-3631, 19-20.
【6】 Milad Malekipirbazari, Vural Aksakalli. (2015). Risk assessment in social lending via random forests, Expert Systems with Applications 4621–4631, 4-11.
【7】 M. I. Jordan and T. M. Mitchell. (2015). Machine learning: Trends, perspectives, and prospects, SCIENCE VOL 349 ISSUE 6245, 2-7.
【8】 Loren Hansen, et al. (2009). Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides, Bentham Science Publishers Ltd, 6-7.
【9】 Amir E. Khandaniy, et al. (2010). Consumer Credit Risk Models via Machine-Learning Algorithms, Journal of Banking & Finance 34, 47-48.
【10】 JIAN Zhi- gang and JIN Xu. (2004). Research on Data Preprocess in Data Mining and Its Application, Beijing University, 3-4.
【11】 Martin Sewell. (2007). Machine Learning, University College London, 2-4.
【12】 Jehad Ali1, et al. (2012). Random Forests and Decision Trees, IJCSI International Journal of Computer Science Issues, 2-6.
【13】 Oleg Okun and Helen Priisalu. (2007), Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues, University of Oulu and Tallinn University of Technology, 2-7.
【14】 Jesse Davis, et al. (2006). The Relationship Between Precision-Recall and ROC Curves, University of Wisconsin-Madison, 2-7.
【15】 Andrew P and Bradley. (1997), The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning, Pattern Recognition, 2-9, 16-31.
【16】 Prof. William H. Press. (2008). Computational Statistics with Application to Bioinformatics, The University of Texas at Austin, 2-12.
【17】 Tom Fawcett. (2005). An introduction to ROC analysis, Pattern Recognition, 2-13.
【18】 Xiangrui Meng, et al. (2016). MLlib: Machine Learning in Apache Spark, Journal of Machine Learning Research 17, 4-5.
【19】 Fabian Pedregosa. (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12, 2-5.
【20】 Shunpo Chang, et al. (2015-2016). Predicting Default Risk of Lending Club Loans, CS229: Machine Learning, 3-5. 
【21】 Riza Emekter, et al. (2013). Evaluating the Credit Risk in Online Peer-to-Peer (P2P) Lending, Robert Morris University, 19.
【22】 Riza Emekter, et al. (2015). Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending, Robert Morris University, 69.
【23】 Don Carmichael. (2014). Modeling Default for Peer-to-Peer Loans, University of Houston - C.T. Bauer College of Business, 21.
【24】 Freedman S M, Jin G Z. (2010). Learning by Doing with Asymmetric Information: Evidence from Prosper.com, University of Michigan, Maryland & NBER, 28.
【25】 Alexander B, Alexander B, Daniel B. (2011). Online Peer-to-Peer Lending - A Literature Review. Journal of Internet Banking and Commerce, 14.
【26】 Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? , Journal of Machine Learning Research 15(Oct):3133−3181, 43.
【27】 Determinants of Default in P2P Lending. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0139427
【28】 Matplotlib API. http://matplotlib.org/api/index.html
【29】 Lending Club Statistics - Lending Club. https://www.lendingclub.com/info/download-data.action
【30】 Apache Spark submitting-applications. http://spark.apache.org/docs/latest/submitting-applications.html
【31】 Apache Spark Python API doc. http://spark.apache.org/docs/latest/api/python/index.html
zh_TW