學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 整合R與Hadoop/MapReduce來分析FOAF社群網路
Using R and Hadoop/MapReduce for FOAF-based Social Network Analytics作者 孫肇祥
Sun, Jhao Siang貢獻者 胡毓忠
Hu, Yuh Jong
孫肇祥
Sun, Jhao Siang關鍵詞 RDF(S)
R and Hadoop/MapReduce
FOAF
Hadoop
MapReduce
社群網路分析
FOAF
Hadoop
MapReduce
Social network analytics日期 2013 上傳時間 6-八月-2014 11:47:06 (UTC+8) 摘要 分散式線上社群網路採用RDF(S)為基礎的FOAF格式於信任的第三方Hadoop cluster來儲存個人資料與其社群網絡。面臨大量的社群網路資料,傳統的分析方式將會遇到許多處理與儲存的問題。本研究透過結合R與Hadoop/MapReduce技術,提出三種分析方式:R + Hadoop Streaming (RHS), R + MySQL (RMS), R + Hive (RH)來解決分析大量FOAF資料運算與儲存的瓶頸。我們首先將FOAF資料集注入Hadoop cluster平台並利用MapReduce的分散式運算,預先消化大部分的資料以解決R統計軟體單機記憶體不足以應付大型檔案的問題,透過後續R的分析我們也同時解決MapReduce運算無法進行深層社群網路分析的問題。透過預先拆解的方式以可以處理更大的FOAF資料使其更有延展性。這個方法可以適用於非結構化或結構化資料。面對每日激增的社群網路資料,如何更進一步的結合R與Hadoop/MapReduce,並 使用HBase或是與既有R的平行化軟體作結合,也是日後可以努力研究的方向。
The decentralized online social networks are encoded as RDF(S)-based FOAF data format. These FOAF datasets, stored on the trusted Hadoop cluster, are used to represent Web users’ personal data and their social relationships. When using traditional data analysis techniques, we face numerous data processing and storing challenges. In this study, we apply three R and Hadoop/MapReduce integration techniques for high volume FOAF data analysis, including R + Hadoop Streaming (RHS), R + MySQL (RMS), and R + Hive (RH). We first ingest the FOAF datasets and pre-process these datasets through the MapReduce distributed programming paradigm. Then, apply R for FOAF data analysis. This resolves the major problems of impossibly reading high volume of big FOAF data into memory for R analysis and the limitation of social network analysis by using MapReduce computation. High volume of FOAF datasets can be distributed and stored effectively in the Hadoop platform for scalable data processing. The R + Hadoop/MapReduce techniques can be used for analysis on the structured and unstructured data. In the future study, the research issues will be on how to effectively integrate R and Hadoop/MapReduce and leverage the HBase or parallel R programming for high volume big data analytics.參考文獻 [1].Apache Hadoop Project, http://hadoop.apache.org[2].Billion Triples Challenge 2012 Dataset, http://km.aifb.kit.edu/projects/btc-2012/[3].Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data-the story so far.International journal on semantic web and information systems, 5(3), 1-22.[4].Bonacich, P. (1987). Power and centrality: A family of measures. American journal of sociology, 1170-1182.[5].Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., ... & Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2), 4.[6].Daniel J. Weitzner . http://www.w3.org/People/Weitzner.html[7].Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.[8].Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.[9].Department Of Statistics, Purdue University (2012). Divide and Recombine (D&R) with RHIPE. Retrieved from http://www.datadr.org/.[10].Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS`05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.[11].Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS`05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.[12].Dirk Eddelbuettel(2014, July 7) . CRAN Task View: High-Performance and Parallel Computing with R , Retrieved July 7, 2014, from http://cran.r-project.org/web/views/HighPerformanceComputing.html[13].Erétéo, G., Gandon, F., Corby, O., & Buffa, M. (2009). Semantic social network analysis. arXiv preprint arXiv:0904.3701.[14].FOAF Vocabulary Specification 0.99/Namespace Document 14 January 2014 - Paddington Edition. http://xmlns.com/foaf/spec/[15].Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.[16].G. K. Zipf, Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, 1932[17].Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.[18].Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.[19].Golbeck, J., & Rothstein, M. (2008, July). Linking Social Networks on the Web with FOAF: A Semantic Web Case Study. In AAAI (Vol. 8, pp. 1138-1143).[20].http://en.wikipedia.org/wiki/Information_Sciences_Institute[21].http://www.ldodds.com/foaf/foaf-a-matic.html[22].Jonathan Seidman .,& Ramesh Venkataramaiah (2011). Distributed Data Analysis with Hadoop and R.[23].Mori, J., Matsuo, Y., Ishizuka, M., & Faltings, B. (2004, September). Keyword extraction from the web for foaf metadata. In Proceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.[24].MySQL database, http://www.mysql.com/[25].MySQL Limits on Table Size, http://dev.mysql.com/doc/refman/5.1/en/table-size-limit.html[26].Paolillo, J. C., & Wright, E. (2004). The challenges of FOAF characterization. InProceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.[27].Paolillo, J. C., & Wright, E. (2006). Social network analysis on the semantic web: Techniques and challenges for visualizing FOAF. In Visualizing the semantic web(pp. 229-241). Springer London.[28].Piccolboni, A. (2014,May 25) RevolutionAnalytics/RHadoop. Retrieved from https://github.com/RevolutionAnalytics/RHadoop/wiki.[29].Resource Description Framework (RDF), http://www.w3.org/RDF/[30].Rickert, J. B. (2010). Big Data Analysis with Revolution R Enterprise.[31].Ryan R. Rosario(2010). Taking R to the Limit. Los Angeles R Users` Group[32].The Apache HBase, http://hbase.apache.org/[33].The Apache Hive, https://hive.apache.org/[34].The Apache ZooKeeper, http://zookeeper.apache.org/[35].The Friend of a Friend (FOAF) project, http://www.foaf-project.org/[36].The R Project for Statistical Computing, http://www.r-project.org/[37].Yeung, C. M. A., Liccardi, I., Lu, K., Seneviratne, O., & Berners-Lee, T. (2009, January). Decentralization: The future of online social networking. In W3C Workshop on the Future of Social Networking Position Papers (Vol. 2, pp. 2-7). 描述 碩士
國立政治大學
資訊科學學系
95971012
102資料來源 http://thesis.lib.nccu.edu.tw/record/#G0095971012 資料類型 thesis dc.contributor.advisor 胡毓忠 zh_TW dc.contributor.advisor Hu, Yuh Jong en_US dc.contributor.author (作者) 孫肇祥 zh_TW dc.contributor.author (作者) Sun, Jhao Siang en_US dc.creator (作者) 孫肇祥 zh_TW dc.creator (作者) Sun, Jhao Siang en_US dc.date (日期) 2013 en_US dc.date.accessioned 6-八月-2014 11:47:06 (UTC+8) - dc.date.available 6-八月-2014 11:47:06 (UTC+8) - dc.date.issued (上傳時間) 6-八月-2014 11:47:06 (UTC+8) - dc.identifier (其他 識別碼) G0095971012 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/68266 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學學系 zh_TW dc.description (描述) 95971012 zh_TW dc.description (描述) 102 zh_TW dc.description.abstract (摘要) 分散式線上社群網路採用RDF(S)為基礎的FOAF格式於信任的第三方Hadoop cluster來儲存個人資料與其社群網絡。面臨大量的社群網路資料,傳統的分析方式將會遇到許多處理與儲存的問題。本研究透過結合R與Hadoop/MapReduce技術,提出三種分析方式:R + Hadoop Streaming (RHS), R + MySQL (RMS), R + Hive (RH)來解決分析大量FOAF資料運算與儲存的瓶頸。我們首先將FOAF資料集注入Hadoop cluster平台並利用MapReduce的分散式運算,預先消化大部分的資料以解決R統計軟體單機記憶體不足以應付大型檔案的問題,透過後續R的分析我們也同時解決MapReduce運算無法進行深層社群網路分析的問題。透過預先拆解的方式以可以處理更大的FOAF資料使其更有延展性。這個方法可以適用於非結構化或結構化資料。面對每日激增的社群網路資料,如何更進一步的結合R與Hadoop/MapReduce,並 使用HBase或是與既有R的平行化軟體作結合,也是日後可以努力研究的方向。 zh_TW dc.description.abstract (摘要) The decentralized online social networks are encoded as RDF(S)-based FOAF data format. These FOAF datasets, stored on the trusted Hadoop cluster, are used to represent Web users’ personal data and their social relationships. When using traditional data analysis techniques, we face numerous data processing and storing challenges. In this study, we apply three R and Hadoop/MapReduce integration techniques for high volume FOAF data analysis, including R + Hadoop Streaming (RHS), R + MySQL (RMS), and R + Hive (RH). We first ingest the FOAF datasets and pre-process these datasets through the MapReduce distributed programming paradigm. Then, apply R for FOAF data analysis. This resolves the major problems of impossibly reading high volume of big FOAF data into memory for R analysis and the limitation of social network analysis by using MapReduce computation. High volume of FOAF datasets can be distributed and stored effectively in the Hadoop platform for scalable data processing. The R + Hadoop/MapReduce techniques can be used for analysis on the structured and unstructured data. In the future study, the research issues will be on how to effectively integrate R and Hadoop/MapReduce and leverage the HBase or parallel R programming for high volume big data analytics. en_US dc.description.tableofcontents 摘要 iAbstract ii致謝 iii第一章 導論 . 11.1 研究動機 11.2 研究目的 11.3 各章節敘述 2第二章 研究背景 32.1 Hadoop 32.2 Hive 42.3 R 6第三章 相關研究 83.1 FOAF(Friend of A Friend) 83.2 社會網路分析(Social Network Analysis,SNA) 103.3 R與Hadoop的整合 133.3.1 RHadoop 133.3.2 RHIPE 153.3.3 Hadoop Streaming 17第四章 方法架構設計 204.1 研究架構 204.2 FOAF分析 214.2.1 R+Hadoop Streaming分析(RHS Analytics) 214.2.2 R+MySQL分析(RMS Analytics) 234.2.3 R+Hive分析(RH Analytics) 26第五章 系統實作 305.1 系統架構 305.2 資料來源 325.3 FOAF資料分析 335.3.1 R+Hadoop Streaming分析(RHS Analytics) 335.3.2 R+MySQL分析(RMS Analytics) 395.3.3 R+Hive分析(RH Analytics) 405.3.4 效能分析比較 44第六章 結論與未來展望 46參考文獻 48 zh_TW dc.format.extent 9299565 bytes - dc.format.mimetype application/pdf - dc.language.iso en_US - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0095971012 en_US dc.subject (關鍵詞) RDF(S) zh_TW dc.subject (關鍵詞) R and Hadoop/MapReduce zh_TW dc.subject (關鍵詞) FOAF zh_TW dc.subject (關鍵詞) Hadoop zh_TW dc.subject (關鍵詞) MapReduce zh_TW dc.subject (關鍵詞) 社群網路分析 zh_TW dc.subject (關鍵詞) FOAF en_US dc.subject (關鍵詞) Hadoop en_US dc.subject (關鍵詞) MapReduce en_US dc.subject (關鍵詞) Social network analytics en_US dc.title (題名) 整合R與Hadoop/MapReduce來分析FOAF社群網路 zh_TW dc.title (題名) Using R and Hadoop/MapReduce for FOAF-based Social Network Analytics en_US dc.type (資料類型) thesis en dc.relation.reference (參考文獻) [1].Apache Hadoop Project, http://hadoop.apache.org[2].Billion Triples Challenge 2012 Dataset, http://km.aifb.kit.edu/projects/btc-2012/[3].Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data-the story so far.International journal on semantic web and information systems, 5(3), 1-22.[4].Bonacich, P. (1987). Power and centrality: A family of measures. American journal of sociology, 1170-1182.[5].Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., ... & Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2), 4.[6].Daniel J. Weitzner . http://www.w3.org/People/Weitzner.html[7].Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.[8].Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.[9].Department Of Statistics, Purdue University (2012). Divide and Recombine (D&R) with RHIPE. Retrieved from http://www.datadr.org/.[10].Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS`05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.[11].Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS`05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.[12].Dirk Eddelbuettel(2014, July 7) . CRAN Task View: High-Performance and Parallel Computing with R , Retrieved July 7, 2014, from http://cran.r-project.org/web/views/HighPerformanceComputing.html[13].Erétéo, G., Gandon, F., Corby, O., & Buffa, M. (2009). Semantic social network analysis. arXiv preprint arXiv:0904.3701.[14].FOAF Vocabulary Specification 0.99/Namespace Document 14 January 2014 - Paddington Edition. http://xmlns.com/foaf/spec/[15].Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.[16].G. K. Zipf, Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, 1932[17].Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.[18].Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.[19].Golbeck, J., & Rothstein, M. (2008, July). Linking Social Networks on the Web with FOAF: A Semantic Web Case Study. In AAAI (Vol. 8, pp. 1138-1143).[20].http://en.wikipedia.org/wiki/Information_Sciences_Institute[21].http://www.ldodds.com/foaf/foaf-a-matic.html[22].Jonathan Seidman .,& Ramesh Venkataramaiah (2011). Distributed Data Analysis with Hadoop and R.[23].Mori, J., Matsuo, Y., Ishizuka, M., & Faltings, B. (2004, September). Keyword extraction from the web for foaf metadata. In Proceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.[24].MySQL database, http://www.mysql.com/[25].MySQL Limits on Table Size, http://dev.mysql.com/doc/refman/5.1/en/table-size-limit.html[26].Paolillo, J. C., & Wright, E. (2004). The challenges of FOAF characterization. InProceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.[27].Paolillo, J. C., & Wright, E. (2006). Social network analysis on the semantic web: Techniques and challenges for visualizing FOAF. In Visualizing the semantic web(pp. 229-241). Springer London.[28].Piccolboni, A. (2014,May 25) RevolutionAnalytics/RHadoop. Retrieved from https://github.com/RevolutionAnalytics/RHadoop/wiki.[29].Resource Description Framework (RDF), http://www.w3.org/RDF/[30].Rickert, J. B. (2010). Big Data Analysis with Revolution R Enterprise.[31].Ryan R. Rosario(2010). Taking R to the Limit. Los Angeles R Users` Group[32].The Apache HBase, http://hbase.apache.org/[33].The Apache Hive, https://hive.apache.org/[34].The Apache ZooKeeper, http://zookeeper.apache.org/[35].The Friend of a Friend (FOAF) project, http://www.foaf-project.org/[36].The R Project for Statistical Computing, http://www.r-project.org/[37].Yeung, C. M. A., Liccardi, I., Lu, K., Seneviratne, O., & Berners-Lee, T. (2009, January). Decentralization: The future of online social networking. In W3C Workshop on the Future of Social Networking Position Papers (Vol. 2, pp. 2-7). zh_TW