Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 貸款違約預測:使用Spark平台分析P2P借貸資料
Loan default prediction:analyzing P2P lending data on the spark platform作者 林博仁
Lin, Bo Ren貢獻者 胡毓忠
Hu, Yuh Jong
林博仁
Lin, Bo Ren關鍵詞 點對點
借貸
預測
P2P
Lending
Prediction日期 2017 上傳時間 1-Mar-2017 17:38:14 (UTC+8) 摘要 由於FinTech數位金融的快速崛起,金融相關業務逐漸由線上申辦取代傳統作業。在借貸方面,銀行為了降低呆帳風險,要求融資方必須提供足夠抵押擔保品,而融資方往往因為無擔保品而求救無門,其中包含信用歷史優良的客戶,因此P2P借貸平台為此需求而誕生。本研究探討如何在大數據Spark分析平台上使用Scikit Learning的程式庫來進行自動化機器學習流程,並以優化的角度來進行P2P借貸模型特徵值篩選以及參數和超級參數的最佳化,因而提高預測還款鑑定力。本研究分析資料集是引用美國上市公司Lending Club公開資料,以投資方的角度來分析融資方歷年的借貸資料,從中篩選特徵值,並利用隨機樹演算法結合自動化機器學習流程來完成分析模型的訓練與測試。我們提供預測信用良好的借貸者給投資方參考,並由投資方根據自身的資金狀態從中選擇合適投資的融資方,進而達成精準預測融資方是否還款的目標。
In the rapidly rise FinTech era, traditional financial-related business is gradually replaced by online digital finance. From a new loan, the bank always requires a borrower to provide certain amount of collateral for risk reduction. However, a borrower sometimes cannot meet this requirement, even with a good credit history. A P2P lending platform is created for solving this problem. This study investigates the issue for how to proceed automated machine learning pipeline through P2P lending model’s features selection with parameter and hyper-parameter optimization. By using Scikit Learning libraries on the big data analytics Spark platform, we can predict who are borrowers with good credits. We apply Random Forest machine learning algorithm in the automated machine learning pipeline to analyze the Lending Club open datasets from a lender perspective. A predicted list of high credit borrowers is available for investors to select to achieve high loan return rate.參考文獻 【1】 Kent D. Lee, et al. (2011). Python Programming Fundamentals, Springer London Dordrecht Heidelberg, New York, 45-190.【2】 Ian J. Galloway. (2009). Peer-to-Peer Lending and Community Development Finance, Bank of San Francisco, 3-15.【3】 Kevin Sheppard. (2014). Introduction to Python for Econometrics, Statistics and Data Analysis, Kevin Sheppard, University of Oxford, 171-201.【4】 David Donoho. (2015) . 50 years of Data Science, Tukey Centennial workshop, Princeton NJ, 4-9, 29-37.【5】 Andy Liaw and Matthew Wiener. (2002). Classification and Regression by RandomForest, R News ISSN 1609-3631, 19-20.【6】 Milad Malekipirbazari, Vural Aksakalli. (2015). Risk assessment in social lending via random forests, Expert Systems with Applications 4621–4631, 4-11.【7】 M. I. Jordan and T. M. Mitchell. (2015). Machine learning: Trends, perspectives, and prospects, SCIENCE VOL 349 ISSUE 6245, 2-7.【8】 Loren Hansen, et al. (2009). Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides, Bentham Science Publishers Ltd, 6-7.【9】 Amir E. Khandaniy, et al. (2010). Consumer Credit Risk Models via Machine-Learning Algorithms, Journal of Banking & Finance 34, 47-48.【10】 JIAN Zhi- gang and JIN Xu. (2004). Research on Data Preprocess in Data Mining and Its Application, Beijing University, 3-4.【11】 Martin Sewell. (2007). Machine Learning, University College London, 2-4.【12】 Jehad Ali1, et al. (2012). Random Forests and Decision Trees, IJCSI International Journal of Computer Science Issues, 2-6.【13】 Oleg Okun and Helen Priisalu. (2007), Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues, University of Oulu and Tallinn University of Technology, 2-7.【14】 Jesse Davis, et al. (2006). The Relationship Between Precision-Recall and ROC Curves, University of Wisconsin-Madison, 2-7.【15】 Andrew P and Bradley. (1997), The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning, Pattern Recognition, 2-9, 16-31.【16】 Prof. William H. Press. (2008). Computational Statistics with Application to Bioinformatics, The University of Texas at Austin, 2-12.【17】 Tom Fawcett. (2005). An introduction to ROC analysis, Pattern Recognition, 2-13.【18】 Xiangrui Meng, et al. (2016). MLlib: Machine Learning in Apache Spark, Journal of Machine Learning Research 17, 4-5.【19】 Fabian Pedregosa. (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12, 2-5.【20】 Shunpo Chang, et al. (2015-2016). Predicting Default Risk of Lending Club Loans, CS229: Machine Learning, 3-5. 【21】 Riza Emekter, et al. (2013). Evaluating the Credit Risk in Online Peer-to-Peer (P2P) Lending, Robert Morris University, 19.【22】 Riza Emekter, et al. (2015). Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending, Robert Morris University, 69.【23】 Don Carmichael. (2014). Modeling Default for Peer-to-Peer Loans, University of Houston - C.T. Bauer College of Business, 21.【24】 Freedman S M, Jin G Z. (2010). Learning by Doing with Asymmetric Information: Evidence from Prosper.com, University of Michigan, Maryland & NBER, 28.【25】 Alexander B, Alexander B, Daniel B. (2011). Online Peer-to-Peer Lending - A Literature Review. Journal of Internet Banking and Commerce, 14.【26】 Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? , Journal of Machine Learning Research 15(Oct):3133−3181, 43.【27】 Determinants of Default in P2P Lending. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0139427【28】 Matplotlib API. http://matplotlib.org/api/index.html【29】 Lending Club Statistics - Lending Club. https://www.lendingclub.com/info/download-data.action【30】 Apache Spark submitting-applications. http://spark.apache.org/docs/latest/submitting-applications.html【31】 Apache Spark Python API doc. http://spark.apache.org/docs/latest/api/python/index.html 描述 碩士
國立政治大學
資訊科學系碩士在職專班
103971012資料來源 http://thesis.lib.nccu.edu.tw/record/#G0103971012 資料類型 thesis dc.contributor.advisor 胡毓忠 zh_TW dc.contributor.advisor Hu, Yuh Jong en_US dc.contributor.author (Authors) 林博仁 zh_TW dc.contributor.author (Authors) Lin, Bo Ren en_US dc.creator (作者) 林博仁 zh_TW dc.creator (作者) Lin, Bo Ren en_US dc.date (日期) 2017 en_US dc.date.accessioned 1-Mar-2017 17:38:14 (UTC+8) - dc.date.available 1-Mar-2017 17:38:14 (UTC+8) - dc.date.issued (上傳時間) 1-Mar-2017 17:38:14 (UTC+8) - dc.identifier (Other Identifiers) G0103971012 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/107005 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系碩士在職專班 zh_TW dc.description (描述) 103971012 zh_TW dc.description.abstract (摘要) 由於FinTech數位金融的快速崛起,金融相關業務逐漸由線上申辦取代傳統作業。在借貸方面,銀行為了降低呆帳風險,要求融資方必須提供足夠抵押擔保品,而融資方往往因為無擔保品而求救無門,其中包含信用歷史優良的客戶,因此P2P借貸平台為此需求而誕生。本研究探討如何在大數據Spark分析平台上使用Scikit Learning的程式庫來進行自動化機器學習流程,並以優化的角度來進行P2P借貸模型特徵值篩選以及參數和超級參數的最佳化,因而提高預測還款鑑定力。本研究分析資料集是引用美國上市公司Lending Club公開資料,以投資方的角度來分析融資方歷年的借貸資料,從中篩選特徵值,並利用隨機樹演算法結合自動化機器學習流程來完成分析模型的訓練與測試。我們提供預測信用良好的借貸者給投資方參考,並由投資方根據自身的資金狀態從中選擇合適投資的融資方,進而達成精準預測融資方是否還款的目標。 zh_TW dc.description.abstract (摘要) In the rapidly rise FinTech era, traditional financial-related business is gradually replaced by online digital finance. From a new loan, the bank always requires a borrower to provide certain amount of collateral for risk reduction. However, a borrower sometimes cannot meet this requirement, even with a good credit history. A P2P lending platform is created for solving this problem. This study investigates the issue for how to proceed automated machine learning pipeline through P2P lending model’s features selection with parameter and hyper-parameter optimization. By using Scikit Learning libraries on the big data analytics Spark platform, we can predict who are borrowers with good credits. We apply Random Forest machine learning algorithm in the automated machine learning pipeline to analyze the Lending Club open datasets from a lender perspective. A predicted list of high credit borrowers is available for investors to select to achieve high loan return rate. en_US dc.description.tableofcontents 第一章 導論 91.1 研究動機 91.2 研究目的 101.3 各章節概述 11第二章 研究背景 122.1 點對點借貸模式 122.2 借貸俱樂部 132.3 Spark平台 152.4 機器學習程式流 162.5 資料集 17第三章 相關研究 19第四章 研究架構和方法 244.1 資料前置處理 244.2 研究架構設計 254.3 特徵值選擇 284.4 超參數優化 304.5訓練模型的方法 34第五章 模型評估 375.1 K次交叉驗證 375.2 Python繪圖模組 385.3 模型預測結果的驗證 38第六章 結論與未來研究 42第七章 參考文獻 44附錄 47附件 Lending Club的資料集 47 zh_TW dc.format.extent 1688165 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0103971012 en_US dc.subject (關鍵詞) 點對點 zh_TW dc.subject (關鍵詞) 借貸 zh_TW dc.subject (關鍵詞) 預測 zh_TW dc.subject (關鍵詞) P2P en_US dc.subject (關鍵詞) Lending en_US dc.subject (關鍵詞) Prediction en_US dc.title (題名) 貸款違約預測:使用Spark平台分析P2P借貸資料 zh_TW dc.title (題名) Loan default prediction:analyzing P2P lending data on the spark platform en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) 【1】 Kent D. Lee, et al. (2011). Python Programming Fundamentals, Springer London Dordrecht Heidelberg, New York, 45-190.【2】 Ian J. Galloway. (2009). Peer-to-Peer Lending and Community Development Finance, Bank of San Francisco, 3-15.【3】 Kevin Sheppard. (2014). Introduction to Python for Econometrics, Statistics and Data Analysis, Kevin Sheppard, University of Oxford, 171-201.【4】 David Donoho. (2015) . 50 years of Data Science, Tukey Centennial workshop, Princeton NJ, 4-9, 29-37.【5】 Andy Liaw and Matthew Wiener. (2002). Classification and Regression by RandomForest, R News ISSN 1609-3631, 19-20.【6】 Milad Malekipirbazari, Vural Aksakalli. (2015). Risk assessment in social lending via random forests, Expert Systems with Applications 4621–4631, 4-11.【7】 M. I. Jordan and T. M. Mitchell. (2015). Machine learning: Trends, perspectives, and prospects, SCIENCE VOL 349 ISSUE 6245, 2-7.【8】 Loren Hansen, et al. (2009). Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides, Bentham Science Publishers Ltd, 6-7.【9】 Amir E. Khandaniy, et al. (2010). Consumer Credit Risk Models via Machine-Learning Algorithms, Journal of Banking & Finance 34, 47-48.【10】 JIAN Zhi- gang and JIN Xu. (2004). Research on Data Preprocess in Data Mining and Its Application, Beijing University, 3-4.【11】 Martin Sewell. (2007). Machine Learning, University College London, 2-4.【12】 Jehad Ali1, et al. (2012). Random Forests and Decision Trees, IJCSI International Journal of Computer Science Issues, 2-6.【13】 Oleg Okun and Helen Priisalu. (2007), Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues, University of Oulu and Tallinn University of Technology, 2-7.【14】 Jesse Davis, et al. (2006). The Relationship Between Precision-Recall and ROC Curves, University of Wisconsin-Madison, 2-7.【15】 Andrew P and Bradley. (1997), The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning, Pattern Recognition, 2-9, 16-31.【16】 Prof. William H. Press. (2008). Computational Statistics with Application to Bioinformatics, The University of Texas at Austin, 2-12.【17】 Tom Fawcett. (2005). An introduction to ROC analysis, Pattern Recognition, 2-13.【18】 Xiangrui Meng, et al. (2016). MLlib: Machine Learning in Apache Spark, Journal of Machine Learning Research 17, 4-5.【19】 Fabian Pedregosa. (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12, 2-5.【20】 Shunpo Chang, et al. (2015-2016). Predicting Default Risk of Lending Club Loans, CS229: Machine Learning, 3-5. 【21】 Riza Emekter, et al. (2013). Evaluating the Credit Risk in Online Peer-to-Peer (P2P) Lending, Robert Morris University, 19.【22】 Riza Emekter, et al. (2015). Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending, Robert Morris University, 69.【23】 Don Carmichael. (2014). Modeling Default for Peer-to-Peer Loans, University of Houston - C.T. Bauer College of Business, 21.【24】 Freedman S M, Jin G Z. (2010). Learning by Doing with Asymmetric Information: Evidence from Prosper.com, University of Michigan, Maryland & NBER, 28.【25】 Alexander B, Alexander B, Daniel B. (2011). Online Peer-to-Peer Lending - A Literature Review. Journal of Internet Banking and Commerce, 14.【26】 Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? , Journal of Machine Learning Research 15(Oct):3133−3181, 43.【27】 Determinants of Default in P2P Lending. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0139427【28】 Matplotlib API. http://matplotlib.org/api/index.html【29】 Lending Club Statistics - Lending Club. https://www.lendingclub.com/info/download-data.action【30】 Apache Spark submitting-applications. http://spark.apache.org/docs/latest/submitting-applications.html【31】 Apache Spark Python API doc. http://spark.apache.org/docs/latest/api/python/index.html zh_TW