分散式計算系統及巨量資料處理架構設計-基於YARN, Storm及Spark

Publications-Theses

Article View/Open

pdf(741)

Publication Export

Google Scholar^TM

題名	分散式計算系統及巨量資料處理架構設計-基於YARN, Storm及Spark Distributed computing system and big data real-time processing structure—based on YARN, Storm and Spark
作者	曾柏崴 Tseng, Po Wei
貢獻者	劉文卿<br>張景堯 Liou, Wen Ching<br>Chang, Jiing Yao 曾柏崴 Tseng, Po Wei
關鍵詞	Apache YARN Apache Storm Apache Spark 大數據處理即時預測 Apache YARN Apache Storm Apache Spark Big Data Processing Real-time Forecasting
日期	2015
上傳時間	17-Aug-2015 14:08:24 (UTC+8)
摘要	近年來，隨著大數據時代的來臨，即時資料運算面臨許多挑戰。例如在期貨交易預測方面，為了精準的預測市場狀態，我們需要在海量資料中建立預測模型，且耗時在數十毫秒之內。在本研究中，我們將介紹一套即時巨量資料運算架構，這套架構將解決在實務上需要解決的三大需求：高速處理需求、巨量資料處理以及儲存需求。同時，在整個平行運算系統之下，我們也實作了數種人工智慧演算法，例如SVM (Support Vector Machine)和LR (Logistic Regression)等，做為策略模擬的子系統。本架構包含下列三種主要的雲端運算技術： 1. 使用Apache YARN以整合整體系統資源，使叢集資源運用更具效率。 2. 為滿足高速處理需求，本架構使用Apache Storm以便處理海量且即時之資料流。同時，借助該框架，可在數十毫秒之內，運算上千種市場狀態數值供模型建模之用。 3. 運用Apache Spark，本研究建立了一套分散式運算架構用於模型建模。藉由使用Spark RDD(Resilient Distributed Datasets)，本架構可將SVM和LR之模型建模時間縮短至數百毫秒之內。為解決上述需求，本研究設計了一套n層分散式架構且整合上列數種技術。另外，在該架構中，我們使用Apache Kafka作為整體系統之訊息中介層，並支持系統內各子系統間之非同步訊息溝通。 With the coming of the era of big data, the immediacy and the amount of data computation are facing with many challenges. For example, for Futures market forecasting, we need to accurately forecast the market state with the model built from large data (hundreds of GB to tens of TB) within tens of milliseconds. In this research, we will introduce a real-time big data computing architecture to resolve requests of high speed processing, the immense volume of data and the request of large data processing. In the meantime, several algorithms, such as SVM (Support Vector Machine, SVM) and LR (Logistic Regression, LR), are implemented as a subproject under the parallel distributed computing system. This architecture involves three main cloud computing techniques: 1. Use Apache YARN as a system of integrated resource management in order to apply cluster resources more efficiently. 2. To satisfy the requests of high speed processing, we apply Apache Storm in order to process large real-time data stream and compute thousands of numerical value within tens of milliseconds for following model building. 3. With Apache Spark, we establish a distributed computing architecture for model building. By using Spark RDD (Resilient Distributed Datasets, RDD), this architecture can shorten the execution time to within hundreds of milliseconds for SVM and LR model building. To resolve the requirements of the distributed system, we design an n-tier distributed architecture to integrate the foregoing several techniques. In this architecture, we use the Apache Kafka as the messaging middleware to support asynchronous message-based communication.
參考文獻	[1] Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., ... & Ryaboy, D. (2014, June). Storm@ twitter. InProceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 147-156). ACM. [2] Apache Storm. https://storm.apache.org/ [3] Jones, M. T. (2013). Process real-time big data with Twitter Storm. IBM Technical Library. [4] Aarsten, A., Brugali, D., & Menga, G. (1996). Patterns for three-tier client/server applications. Proceedings of Pattern Languages of Programs (PLoP’96), 4-6. [5] Hirschfeld, R. (1996). Three-tier distribution architecture. Pattern Languages of Programs (PloP). [6] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association. [7] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010, June). Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (pp. 10-10). [8] Hansen, C. A. (2012). Optimizing Hadoop for the cluster. Institue for Computer Science, University of Troms0, Norway, http://oss. csie. fju. edu. tw/~ tzu98/Optimizing% 20Hadoop% 20for% 20the% 20cluster. pdf, Retrieved online October. [9] Apach Kafka. http://kafka.apache.org/ [10] Manuel, P. D., & AlGhamdi, J. (2003). A data-centric design for n-tier architecture. Information Sciences, 150(3), 195-206. [11] Ding, Y. S., Hu, Z. H., & Sun, H. B. (2008). An antibody network inspired evolutionary framework for distributed object computing. Information Sciences,178(24), 4619-4631. [12] Sumbaly, R., Kreps, J., & Shah, S. (2013, June). The big data ecosystem at linkedin. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp. 1125-1134). ACM. [13] Kreps, J., Narkhede, N., & Rao, J. (2011, June). Kafka: A distributed messaging system for log processing. In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece. [14] Joshi, R. (2007). Data-Oriented Architecture: A Loosely-Coupled Real-Time SOA. Real-Time Innovations, Inc, CA, Tech. Rep [15] Netty. http://netty.io/index.html [16] TIBCO. http://www.tibco.com/ [17] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010, June). Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (pp. 10-10). [18] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., ... & Stoica, I. (2012). Fast and interactive analytics over Hadoop data with Spark.USENIX; login, 37(4), 45-51. [19] Zaharia, T. H. T. D. M., Bayen, A., Abbeel, P., & Hunter, T. Large-Scale Online Expectation Maximization with Spark Streaming. eecs. berkeley. edu, 1-5. [20] Buyya, R., Broberg, J., & Goscinski, A. M. (Eds.). (2010). Cloud computing: Principles and paradigms (Vol. 87). John Wiley & Sons. [21] Cloudera. http://www.cloudera.com/content/cloudera/en/home.html [22] Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., ... & Baldeschwieler, E. (2013, October). Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (p. 5). ACM. [23] Node.js. https://nodejs.org
描述	碩士國立政治大學資訊管理研究所 102356040
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0102356040
資料類型	thesis

dc.contributor.advisor	劉文卿<br>張景堯	zh_TW
dc.contributor.advisor	Liou, Wen Ching<br>Chang, Jiing Yao	en_US
dc.contributor.author (Authors)	曾柏崴	zh_TW
dc.contributor.author (Authors)	Tseng, Po Wei	en_US
dc.creator (作者)	曾柏崴	zh_TW
dc.creator (作者)	Tseng, Po Wei	en_US
dc.date (日期)	2015	en_US
dc.date.accessioned	17-Aug-2015 14:08:24 (UTC+8)	-
dc.date.available	17-Aug-2015 14:08:24 (UTC+8)	-
dc.date.issued (上傳時間)	17-Aug-2015 14:08:24 (UTC+8)	-
dc.identifier (Other Identifiers)	G0102356040	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/77556	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理研究所	zh_TW
dc.description (描述)	102356040	zh_TW
dc.description.abstract (摘要)	近年來，隨著大數據時代的來臨，即時資料運算面臨許多挑戰。例如在期貨交易預測方面，為了精準的預測市場狀態，我們需要在海量資料中建立預測模型，且耗時在數十毫秒之內。在本研究中，我們將介紹一套即時巨量資料運算架構，這套架構將解決在實務上需要解決的三大需求：高速處理需求、巨量資料處理以及儲存需求。同時，在整個平行運算系統之下，我們也實作了數種人工智慧演算法，例如SVM (Support Vector Machine)和LR (Logistic Regression)等，做為策略模擬的子系統。本架構包含下列三種主要的雲端運算技術： 1. 使用Apache YARN以整合整體系統資源，使叢集資源運用更具效率。 2. 為滿足高速處理需求，本架構使用Apache Storm以便處理海量且即時之資料流。同時，借助該框架，可在數十毫秒之內，運算上千種市場狀態數值供模型建模之用。 3. 運用Apache Spark，本研究建立了一套分散式運算架構用於模型建模。藉由使用Spark RDD(Resilient Distributed Datasets)，本架構可將SVM和LR之模型建模時間縮短至數百毫秒之內。為解決上述需求，本研究設計了一套n層分散式架構且整合上列數種技術。另外，在該架構中，我們使用Apache Kafka作為整體系統之訊息中介層，並支持系統內各子系統間之非同步訊息溝通。	zh_TW
dc.description.abstract (摘要)	With the coming of the era of big data, the immediacy and the amount of data computation are facing with many challenges. For example, for Futures market forecasting, we need to accurately forecast the market state with the model built from large data (hundreds of GB to tens of TB) within tens of milliseconds. In this research, we will introduce a real-time big data computing architecture to resolve requests of high speed processing, the immense volume of data and the request of large data processing. In the meantime, several algorithms, such as SVM (Support Vector Machine, SVM) and LR (Logistic Regression, LR), are implemented as a subproject under the parallel distributed computing system. This architecture involves three main cloud computing techniques: 1. Use Apache YARN as a system of integrated resource management in order to apply cluster resources more efficiently. 2. To satisfy the requests of high speed processing, we apply Apache Storm in order to process large real-time data stream and compute thousands of numerical value within tens of milliseconds for following model building. 3. With Apache Spark, we establish a distributed computing architecture for model building. By using Spark RDD (Resilient Distributed Datasets, RDD), this architecture can shorten the execution time to within hundreds of milliseconds for SVM and LR model building. To resolve the requirements of the distributed system, we design an n-tier distributed architecture to integrate the foregoing several techniques. In this architecture, we use the Apache Kafka as the messaging middleware to support asynchronous message-based communication.	en_US
dc.description.tableofcontents	Table of Contents i Table of Figures iii Table of tables iv 【Abstract】 v Introduction 1 The Background of the High-Frequency Trading System 1 Design the n-tier Distributed Computing Architecture 2 Related Work 3 The N-tier Distributed Computing Architecture 3 High-Throughput Distributed Messaging System──Kafka and Netty 4 Real-time Streaming Data Processing──Storm 5 Forecasting Model Building──Spark 7 Resource Management──Cloudera and YARN 8 System Architecture 9 Overview of the Distributed Computing HFT System 9 I. Presentation Tier (PT) 11 II. Front-end Switching Tier (FST) 11 III. Back-end Switching Tier (BST) 12 IV. Real-time Business Logic Tier (RBLT) 12 V. Non-real-time Business Logic Tier (NRBLT) 14 VI. Data Access Tier (DAT) 15 Lightning Calculation for Market States and Low Latency Storage──State Center 16 Build Model Fast Over Big Data Sets──Plan Center 18 Forecast Market Trends Rapidly and Accurately──Trade Center 19 The Integration of State Center and Trade Center 19 Cluster Resources Management 20 Experiments 21 Experimental Environment 21 Implementation of the Experiments 22 Low Latency for Computation 25 Conclusion 26 NOTES 28 REFERENCES 29	zh_TW
dc.format.extent	1110999 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0102356040	en_US
dc.subject (關鍵詞)	Apache YARN	zh_TW
dc.subject (關鍵詞)	Apache Storm	zh_TW
dc.subject (關鍵詞)	Apache Spark	zh_TW
dc.subject (關鍵詞)	大數據處理	zh_TW
dc.subject (關鍵詞)	即時預測	zh_TW
dc.subject (關鍵詞)	Apache YARN	en_US
dc.subject (關鍵詞)	Apache Storm	en_US
dc.subject (關鍵詞)	Apache Spark	en_US
dc.subject (關鍵詞)	Big Data Processing	en_US
dc.subject (關鍵詞)	Real-time Forecasting	en_US
dc.title (題名)	分散式計算系統及巨量資料處理架構設計-基於YARN, Storm及Spark	zh_TW
dc.title (題名)	Distributed computing system and big data real-time processing structure—based on YARN, Storm and Spark	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	[1] Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., ... & Ryaboy, D. (2014, June). Storm@ twitter. InProceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 147-156). ACM. [2] Apache Storm. https://storm.apache.org/ [3] Jones, M. T. (2013). Process real-time big data with Twitter Storm. IBM Technical Library. [4] Aarsten, A., Brugali, D., & Menga, G. (1996). Patterns for three-tier client/server applications. Proceedings of Pattern Languages of Programs (PLoP’96), 4-6. [5] Hirschfeld, R. (1996). Three-tier distribution architecture. Pattern Languages of Programs (PloP). [6] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association. [7] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010, June). Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (pp. 10-10). [8] Hansen, C. A. (2012). Optimizing Hadoop for the cluster. Institue for Computer Science, University of Troms0, Norway, http://oss. csie. fju. edu. tw/~ tzu98/Optimizing% 20Hadoop% 20for% 20the% 20cluster. pdf, Retrieved online October. [9] Apach Kafka. http://kafka.apache.org/ [10] Manuel, P. D., & AlGhamdi, J. (2003). A data-centric design for n-tier architecture. Information Sciences, 150(3), 195-206. [11] Ding, Y. S., Hu, Z. H., & Sun, H. B. (2008). An antibody network inspired evolutionary framework for distributed object computing. Information Sciences,178(24), 4619-4631. [12] Sumbaly, R., Kreps, J., & Shah, S. (2013, June). The big data ecosystem at linkedin. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp. 1125-1134). ACM. [13] Kreps, J., Narkhede, N., & Rao, J. (2011, June). Kafka: A distributed messaging system for log processing. In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece. [14] Joshi, R. (2007). Data-Oriented Architecture: A Loosely-Coupled Real-Time SOA. Real-Time Innovations, Inc, CA, Tech. Rep [15] Netty. http://netty.io/index.html [16] TIBCO. http://www.tibco.com/ [17] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010, June). Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (pp. 10-10). [18] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., ... & Stoica, I. (2012). Fast and interactive analytics over Hadoop data with Spark.USENIX; login, 37(4), 45-51. [19] Zaharia, T. H. T. D. M., Bayen, A., Abbeel, P., & Hunter, T. Large-Scale Online Expectation Maximization with Spark Streaming. eecs. berkeley. edu, 1-5. [20] Buyya, R., Broberg, J., & Goscinski, A. M. (Eds.). (2010). Cloud computing: Principles and paradigms (Vol. 87). John Wiley & Sons. [21] Cloudera. http://www.cloudera.com/content/cloudera/en/home.html [22] Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., ... & Baldeschwieler, E. (2013, October). Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (p. 5). ACM. [23] Node.js. https://nodejs.org	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM