Please use this identifier to cite or link to this item:

Title: 貸款違約預測:使用Spark平台分析P2P借貸資料
Loan default prediction:analyzing P2P lending data on the spark platform
Authors: 林博仁
Lin, Bo Ren
Contributors: 胡毓忠
Hu, Yuh Jong
Lin, Bo Ren
Keywords: 點對點
Date: 2017
Issue Date: 2017-03-01 17:38:14 (UTC+8)
Abstract: 由於FinTech數位金融的快速崛起,金融相關業務逐漸由線上申辦取代傳統作業。在借貸方面,銀行為了降低呆帳風險,要求融資方必須提供足夠抵押擔保品,而融資方往往因為無擔保品而求救無門,其中包含信用歷史優良的客戶,因此P2P借貸平台為此需求而誕生。本研究探討如何在大數據Spark分析平台上使用Scikit Learning的程式庫來進行自動化機器學習流程,並以優化的角度來進行P2P借貸模型特徵值篩選以及參數和超級參數的最佳化,因而提高預測還款鑑定力。本研究分析資料集是引用美國上市公司Lending Club公開資料,以投資方的角度來分析融資方歷年的借貸資料,從中篩選特徵值,並利用隨機樹演算法結合自動化機器學習流程來完成分析模型的訓練與測試。我們提供預測信用良好的借貸者給投資方參考,並由投資方根據自身的資金狀態從中選擇合適投資的融資方,進而達成精準預測融資方是否還款的目標。
In the rapidly rise FinTech era, traditional financial-related business is gradually replaced by online digital finance. From a new loan, the bank always requires a borrower to provide certain amount of collateral for risk reduction. However, a borrower sometimes cannot meet this requirement, even with a good credit history. A P2P lending platform is created for solving this problem. This study investigates the issue for how to proceed automated machine learning pipeline through P2P lending model’s features selection with parameter and hyper-parameter optimization. By using Scikit Learning libraries on the big data analytics Spark platform, we can predict who are borrowers with good credits. We apply Random Forest machine learning algorithm in the automated machine learning pipeline to analyze the Lending Club open datasets from a lender perspective. A predicted list of high credit borrowers is available for investors to select to achieve high loan return rate.
Reference: 【1】 Kent D. Lee, et al. (2011). Python Programming Fundamentals, Springer London Dordrecht Heidelberg, New York, 45-190.
【2】 Ian J. Galloway. (2009). Peer-to-Peer Lending and Community Development Finance, Bank of San Francisco, 3-15.
【3】 Kevin Sheppard. (2014). Introduction to Python for Econometrics, Statistics and Data Analysis, Kevin Sheppard, University of Oxford, 171-201.
【4】 David Donoho. (2015) . 50 years of Data Science, Tukey Centennial workshop, Princeton NJ, 4-9, 29-37.
【5】 Andy Liaw and Matthew Wiener. (2002). Classification and Regression by RandomForest, R News ISSN 1609-3631, 19-20.
【6】 Milad Malekipirbazari, Vural Aksakalli. (2015). Risk assessment in social lending via random forests, Expert Systems with Applications 4621–4631, 4-11.
【7】 M. I. Jordan and T. M. Mitchell. (2015). Machine learning: Trends, perspectives, and prospects, SCIENCE VOL 349 ISSUE 6245, 2-7.
【8】 Loren Hansen, et al. (2009). Controlling Feature Selection in Random Forests of Decision Trees Using a Genetic Algorithm: Classification of Class I MHC Peptides, Bentham Science Publishers Ltd, 6-7.
【9】 Amir E. Khandaniy, et al. (2010). Consumer Credit Risk Models via Machine-Learning Algorithms, Journal of Banking & Finance 34, 47-48.
【10】 JIAN Zhi- gang and JIN Xu. (2004). Research on Data Preprocess in Data Mining and Its Application, Beijing University, 3-4.
【11】 Martin Sewell. (2007). Machine Learning, University College London, 2-4.
【12】 Jehad Ali1, et al. (2012). Random Forests and Decision Trees, IJCSI International Journal of Computer Science Issues, 2-6.
【13】 Oleg Okun and Helen Priisalu. (2007), Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues, University of Oulu and Tallinn University of Technology, 2-7.
【14】 Jesse Davis, et al. (2006). The Relationship Between Precision-Recall and ROC Curves, University of Wisconsin-Madison, 2-7.
【15】 Andrew P and Bradley. (1997), The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning, Pattern Recognition, 2-9, 16-31.
【16】 Prof. William H. Press. (2008). Computational Statistics with Application to Bioinformatics, The University of Texas at Austin, 2-12.
【17】 Tom Fawcett. (2005). An introduction to ROC analysis, Pattern Recognition, 2-13.
【18】 Xiangrui Meng, et al. (2016). MLlib: Machine Learning in Apache Spark, Journal of Machine Learning Research 17, 4-5.
【19】 Fabian Pedregosa. (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12, 2-5.
【20】 Shunpo Chang, et al. (2015-2016). Predicting Default Risk of Lending Club Loans, CS229: Machine Learning, 3-5. 
【21】 Riza Emekter, et al. (2013). Evaluating the Credit Risk in Online Peer-to-Peer (P2P) Lending, Robert Morris University, 19.
【22】 Riza Emekter, et al. (2015). Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending, Robert Morris University, 69.
【23】 Don Carmichael. (2014). Modeling Default for Peer-to-Peer Loans, University of Houston - C.T. Bauer College of Business, 21.
【24】 Freedman S M, Jin G Z. (2010). Learning by Doing with Asymmetric Information: Evidence from, University of Michigan, Maryland & NBER, 28.
【25】 Alexander B, Alexander B, Daniel B. (2011). Online Peer-to-Peer Lending - A Literature Review. Journal of Internet Banking and Commerce, 14.
【26】 Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? , Journal of Machine Learning Research 15(Oct):3133−3181, 43.
【27】 Determinants of Default in P2P Lending.
【28】 Matplotlib API.
【29】 Lending Club Statistics - Lending Club.
【30】 Apache Spark submitting-applications.
【31】 Apache Spark Python API doc.
Description: 碩士
Source URI:
Data Type: thesis
Appears in Collections:[資訊科學系碩士在職專班] 學位論文

Files in This Item:

File SizeFormat
101201.pdf1648KbAdobe PDF13View/Open

All items in 學術集成 are protected by copyright, with all rights reserved.

社群 sharing