Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 基於MapReduce之雲端運算下具地域特性之動態排程
Dynamic locality driven scheduler for mapreduce based cloud computing
作者 陳耀宗
Chen, Yao Chung
貢獻者 連耀南
Lien, Yao Nan
陳耀宗
Chen, Yao Chung
關鍵詞 雲端運算
動態排程
Cloud Computing
Dynamic Locality Driven Scheduler
日期 2011
上傳時間 1-Feb-2013 16:53:37 (UTC+8)
摘要 MapReduce 是目前最熱門的雲端技術之一,用來處理大量資料,不論資料探勘、非結構化的紀錄檔、網頁索引處理及其他需要大量資料處理的科學研究,都可透過 MapReduce 得到極佳的執行效率。MapReduce 為一分散式批次資料處理程式框架,將一個工作分解為許多較小的 map 任務以及 reduce 任務,由map 處理每個小問題,再由reduce將問題彙整,得到最終的結果。
     Hadoop 是一個開放原始碼的 MapReduce 架構,並且被廣泛地應用在以大規模資料運算為主的雲端計算。Hadoop有一個非常重要的元件稱為scheduler ,是 hadoop的中樞,負責調度、指派任務和資源分配的優先順序。Scheduler的任務選擇與分配方式,將會影響 MapReduce 工作的執行效率與整個叢集的使用率,目前Hadoop預設的scheduler是將任務以先進先出(FIFO)的方式進行排程。提升MapReduce運算效能的挑戰之一為如何適當的分配Mapper 和 Reducer給雲端裡的每個節點來執行。儘管過去已經有許多改善MapReduce運算效能的研究,但是大部分的方法在實際的運作中,仍存在很多的問題,如工作節點的動態負載、data locality的問題,計算節點的異質性等等。我們發現目前Hadoop對於這些問題並沒有妥善處理,並且在相關的情況下,整體效能仍有改進空間。
     我們提出Data Locality Driven Scheduler(DLDS)的方法,並實踐在 Hadoop上,試圖提高scheduler的效能。我們設計不同的實驗,比較DLDS在不同狀況下和其他的排程演算法的差異。實驗結果顯示,透過提高資料的地域性,平均可提昇10% 至 15% 的效能。
MapReduce is programming model for processing large data set. It is typically used to do distributed computing on clusters of computers such as Cloud computing platform. Examples of bit data set include unstructured logs, web indexing, scientific data, surveillance data, etc.
     MapReduce is a distributed processing program framework, a computing job is broken down into many smaller Map tasks and a Reduce task.Each Map task processes a partition of the given data set and Reduce aggregates the results of Maps to produce final result.
     Hadoop is an open-source MapReduce architecture, and is widely used in many cloud-based services.To best utilize computing resource in a cloud server, a task scheduler is essential to assign tasks to appropriate processors as well as to prioritize resource allocation. The default scheduler of Hadoop is first-in-first-out (FIFO) scheduler which is simple but has a performance inefficiency yet to be improved. Although there have been many researches aiming to improve the performance of MapReduce platform in the past year, there still have many issues hindering the performance improvement, such as dynamic load balance, data locality, and heterogeneity of computing nodes.
     To improve data locality, we propose a new scheduler called Data Locality Driven Scheduler (DLDS) based on Hadoop platform. DLDS improve Hadoop`s performamce by allocating Map tasks as close as possible to the data block they are to process. We evaluated the proposed DLDS against several other schedulers by simulation on an 8 nodes real Hadoop system. Experimental results show that DLDS can improve data locality by 10-15%, which results in a significant performamce improvement.
參考文獻 [1] Apache Software Foundation. FairScheduler, http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html.
     [2] Apache Software Foundation. CapacityScheduler http://hadoop.apache.org/mapreduce/docs/r0.21.0/capacity_scheduler.html
     [3] Big data,http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
     [4] Jeffrey Dean , Sanjay Ghemawat,"MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation", p.10-10, San Francisco, CA,December 06-08, 2004
     [5] Jinling Du, and Dalian Liu,"Hybrid Genetic Algorithm for the Multi-objective Flexible Schedu ling Problem," IEEE International Conference on Computational Intelligence and Security,Nanning, China,Dec.2010.
     [6] R.C.Eberhart, and J.Kennedy,"New Optimizer Using Particle Swarm Theory," Proc.Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, Oct. 1995.
     [7] Garey, M. and Johnson, D, "Computers and Intractability. A Guide to the Theory of NP-Completeness," Freemann, San Francisco, 1979.
     [8] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” 19th ACM Symposium on Operating Systems Principles(SOSP), 2003
     [9] Hadoop,http://hadoop.apache.org/
     [10] Hadoop Distributed File System (HDFS) Architecture. [Online] Available: http://hadoop.apache.org/core/docs/current/hdfs design.html.
     [11] HBase, http://hadoop.apache.org/hbase/
     [12] Bahareh Jalili,and Mehrdad Dianati,"Application of Taboo Search and Genetic Algorithm in planning and optimization of UMTS radio networks," ACM International Wireless Communications and Mobile Computing Conference 6th, New York, USA, June 2010.
     [13] Karp, R.M.,”Reducibility among combinatorial,” in complexity of computer computations, R.E.Miller and J.W. Thatcher(eds),Plenum Press,NY 85-103,1972
     [14] Andréa Matsunaga, Maurício Tsugawa and José Fortes,“Programming Abstractions for Data Intensive Computing on Clouds and Grids,” IEEE Fourth International Conference on eScience, pp.489-493, 2008.
     [15] Peter Mell and Tim Grance, "The NIST Definition of Cloud Computing," National Institute of Standards and Technology, Information Technology Laboratory, Version 15, Oct.2009.
     [16] Chris Miceli, Michael Miceli, Shantenu Jha, Hartmut Kaiser, Andre Merzky, "Programming Abstractions for Data Intensive Computing on Clouds and Grids,” IEEE/ACM International Symposium on Cluster Computing and the Grid,pp.480-483, 2009.
     [17] María Luisa Santamaría, and Sebastià Galmé,"Multi-objective Simulated Annealing Approachfor Optimal Routing in Time-Driven Sensor Networks,"IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, Singapore, July 2011.
     [18] OpenPBS.org. Torque resource manager. http://www.clusterresources.com/pages/products/torque-resource-manager.php.
     [19] Ren Qing-dao-er-ji, and Yuping Wang, Xiaojing Si, "An Improved Genetic Algorithm For Job Shop Scheduling Problem,"IEEE International Conference on Computational Intelligence and Security, Nanning, China, Dec. 2010.
     [20] Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.
     [21] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, "Improving MapReduce performance in heterogeneous environments, " In OSDI’08: 8th USENIX Symposium on Operating Systems Design and Implementation, 2008.
     [22] M.Zaharia, D.Borthankur, J. Sarma, K. Elmellegy, S.Shenker, and I. Stoica, "Delay Scheduling:A Simple Technique for Achieving Locality and Fairness in cluster Scheduling," In EuroSys 2010, pp. 265-278. ACM, New York, 2010.
描述 碩士
國立政治大學
資訊科學學系
97971010
100
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0097971010
資料類型 thesis
dc.contributor.advisor 連耀南zh_TW
dc.contributor.advisor Lien, Yao Nanen_US
dc.contributor.author (Authors) 陳耀宗zh_TW
dc.contributor.author (Authors) Chen, Yao Chungen_US
dc.creator (作者) 陳耀宗zh_TW
dc.creator (作者) Chen, Yao Chungen_US
dc.date (日期) 2011en_US
dc.date.accessioned 1-Feb-2013 16:53:37 (UTC+8)-
dc.date.available 1-Feb-2013 16:53:37 (UTC+8)-
dc.date.issued (上傳時間) 1-Feb-2013 16:53:37 (UTC+8)-
dc.identifier (Other Identifiers) G0097971010en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/56888-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 97971010zh_TW
dc.description (描述) 100zh_TW
dc.description.abstract (摘要) MapReduce 是目前最熱門的雲端技術之一,用來處理大量資料,不論資料探勘、非結構化的紀錄檔、網頁索引處理及其他需要大量資料處理的科學研究,都可透過 MapReduce 得到極佳的執行效率。MapReduce 為一分散式批次資料處理程式框架,將一個工作分解為許多較小的 map 任務以及 reduce 任務,由map 處理每個小問題,再由reduce將問題彙整,得到最終的結果。
     Hadoop 是一個開放原始碼的 MapReduce 架構,並且被廣泛地應用在以大規模資料運算為主的雲端計算。Hadoop有一個非常重要的元件稱為scheduler ,是 hadoop的中樞,負責調度、指派任務和資源分配的優先順序。Scheduler的任務選擇與分配方式,將會影響 MapReduce 工作的執行效率與整個叢集的使用率,目前Hadoop預設的scheduler是將任務以先進先出(FIFO)的方式進行排程。提升MapReduce運算效能的挑戰之一為如何適當的分配Mapper 和 Reducer給雲端裡的每個節點來執行。儘管過去已經有許多改善MapReduce運算效能的研究,但是大部分的方法在實際的運作中,仍存在很多的問題,如工作節點的動態負載、data locality的問題,計算節點的異質性等等。我們發現目前Hadoop對於這些問題並沒有妥善處理,並且在相關的情況下,整體效能仍有改進空間。
     我們提出Data Locality Driven Scheduler(DLDS)的方法,並實踐在 Hadoop上,試圖提高scheduler的效能。我們設計不同的實驗,比較DLDS在不同狀況下和其他的排程演算法的差異。實驗結果顯示,透過提高資料的地域性,平均可提昇10% 至 15% 的效能。
zh_TW
dc.description.abstract (摘要) MapReduce is programming model for processing large data set. It is typically used to do distributed computing on clusters of computers such as Cloud computing platform. Examples of bit data set include unstructured logs, web indexing, scientific data, surveillance data, etc.
     MapReduce is a distributed processing program framework, a computing job is broken down into many smaller Map tasks and a Reduce task.Each Map task processes a partition of the given data set and Reduce aggregates the results of Maps to produce final result.
     Hadoop is an open-source MapReduce architecture, and is widely used in many cloud-based services.To best utilize computing resource in a cloud server, a task scheduler is essential to assign tasks to appropriate processors as well as to prioritize resource allocation. The default scheduler of Hadoop is first-in-first-out (FIFO) scheduler which is simple but has a performance inefficiency yet to be improved. Although there have been many researches aiming to improve the performance of MapReduce platform in the past year, there still have many issues hindering the performance improvement, such as dynamic load balance, data locality, and heterogeneity of computing nodes.
     To improve data locality, we propose a new scheduler called Data Locality Driven Scheduler (DLDS) based on Hadoop platform. DLDS improve Hadoop`s performamce by allocating Map tasks as close as possible to the data block they are to process. We evaluated the proposed DLDS against several other schedulers by simulation on an 8 nodes real Hadoop system. Experimental results show that DLDS can improve data locality by 10-15%, which results in a significant performamce improvement.
en_US
dc.description.tableofcontents 第一章、 簡介 1
     1.1、 雲端運算簡介 2
     1.1.1、 美國國家標準局(NTST)對雲端的定義 3
     1.1.2、 大量資料處理面臨的挑戰 6
     1.2. 排程資源管理的問題 8
     1.2.1、 資料地域性的問題 9
     1.2.2、 磁碟I/O的問題 9
     1.2.3、 資料傳輸的問題 10
     1.3. 排程問題 10
     1.4. 論文架構 11
     第二章、 相關研究 11
     2.1、 MapReduce[4] 11
     2.1.1、 MapReduce背景 11
     2.1.2、 MapReduce基本概念[4] 12
     2.1.3、 MapReduce運算模式 13
     2.2、 Hadoop[9] 14
     2.2.1、 Hadoop簡介[9] 15
     2.2.2、 HDFS介紹 16
     2.2.3、 Hadoop中各個角色的介紹 17
     2.2.4、 Hbase 20
     2.2.5、 Hadoop Scheduler 20
     2.2.6、 Scheduler 相關研究 23
     2.3、 排程問題 26
     2.3.1、 概述 26
     2.3.2、 精確解法 (Exact Algorithm) 27
     2.3.3、 後啟發式演算法 (Meta-heuristics) 29
     2.3.4、 評論 35
     第三章、 Dynamic Locality Driven Scheduler 36
     3.1、 設計理念與目標 36
     3.2、 問題定義 37
     3.2.1、 最大完成時間的定義 37
     3.2.2、 問題描述 37
     3.3、 啟發式排程演算法 41
     3.3.1、 演算法設計 41
     3.3.2、 DLDS演算法 42
     3.3.3、 DLDS演算法範例 47
     第四章、 效能評估 57
     4.1、 實驗設計 57
     4.1.1、 實驗環境 58
     4.1.2、 評估指標 59
     4.2、 實驗參數 60
     4.3、 實驗結果 61
     4.3.1、 實驗一 61
     4.3.2、 實驗二:資料量敏感度測試 64
     4.3.3、 實驗三:異質性敏感度測試 66
     4.3.4、 實驗四:資料移動成本敏感度測試 67
     4.4、 實驗總結 70
     第五章、 結論與未來展望方向 72
     參考文獻 73
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0097971010en_US
dc.subject (關鍵詞) 雲端運算zh_TW
dc.subject (關鍵詞) 動態排程zh_TW
dc.subject (關鍵詞) Cloud Computingen_US
dc.subject (關鍵詞) Dynamic Locality Driven Scheduleren_US
dc.title (題名) 基於MapReduce之雲端運算下具地域特性之動態排程zh_TW
dc.title (題名) Dynamic locality driven scheduler for mapreduce based cloud computingen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] Apache Software Foundation. FairScheduler, http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html.
     [2] Apache Software Foundation. CapacityScheduler http://hadoop.apache.org/mapreduce/docs/r0.21.0/capacity_scheduler.html
     [3] Big data,http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
     [4] Jeffrey Dean , Sanjay Ghemawat,"MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation", p.10-10, San Francisco, CA,December 06-08, 2004
     [5] Jinling Du, and Dalian Liu,"Hybrid Genetic Algorithm for the Multi-objective Flexible Schedu ling Problem," IEEE International Conference on Computational Intelligence and Security,Nanning, China,Dec.2010.
     [6] R.C.Eberhart, and J.Kennedy,"New Optimizer Using Particle Swarm Theory," Proc.Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, Oct. 1995.
     [7] Garey, M. and Johnson, D, "Computers and Intractability. A Guide to the Theory of NP-Completeness," Freemann, San Francisco, 1979.
     [8] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” 19th ACM Symposium on Operating Systems Principles(SOSP), 2003
     [9] Hadoop,http://hadoop.apache.org/
     [10] Hadoop Distributed File System (HDFS) Architecture. [Online] Available: http://hadoop.apache.org/core/docs/current/hdfs design.html.
     [11] HBase, http://hadoop.apache.org/hbase/
     [12] Bahareh Jalili,and Mehrdad Dianati,"Application of Taboo Search and Genetic Algorithm in planning and optimization of UMTS radio networks," ACM International Wireless Communications and Mobile Computing Conference 6th, New York, USA, June 2010.
     [13] Karp, R.M.,”Reducibility among combinatorial,” in complexity of computer computations, R.E.Miller and J.W. Thatcher(eds),Plenum Press,NY 85-103,1972
     [14] Andréa Matsunaga, Maurício Tsugawa and José Fortes,“Programming Abstractions for Data Intensive Computing on Clouds and Grids,” IEEE Fourth International Conference on eScience, pp.489-493, 2008.
     [15] Peter Mell and Tim Grance, "The NIST Definition of Cloud Computing," National Institute of Standards and Technology, Information Technology Laboratory, Version 15, Oct.2009.
     [16] Chris Miceli, Michael Miceli, Shantenu Jha, Hartmut Kaiser, Andre Merzky, "Programming Abstractions for Data Intensive Computing on Clouds and Grids,” IEEE/ACM International Symposium on Cluster Computing and the Grid,pp.480-483, 2009.
     [17] María Luisa Santamaría, and Sebastià Galmé,"Multi-objective Simulated Annealing Approachfor Optimal Routing in Time-Driven Sensor Networks,"IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, Singapore, July 2011.
     [18] OpenPBS.org. Torque resource manager. http://www.clusterresources.com/pages/products/torque-resource-manager.php.
     [19] Ren Qing-dao-er-ji, and Yuping Wang, Xiaojing Si, "An Improved Genetic Algorithm For Job Shop Scheduling Problem,"IEEE International Conference on Computational Intelligence and Security, Nanning, China, Dec. 2010.
     [20] Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.
     [21] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, "Improving MapReduce performance in heterogeneous environments, " In OSDI’08: 8th USENIX Symposium on Operating Systems Design and Implementation, 2008.
     [22] M.Zaharia, D.Borthankur, J. Sarma, K. Elmellegy, S.Shenker, and I. Stoica, "Delay Scheduling:A Simple Technique for Achieving Locality and Fairness in cluster Scheduling," In EuroSys 2010, pp. 265-278. ACM, New York, 2010.
zh_TW