實作推特社群媒體的資料蒐集與管理服務

Publications-Theses

Article View/Open

html(240)

Publication Export

Google Scholar^TM

題名	實作推特社群媒體的資料蒐集與管理服務 Design and Implementation of a Twitter Data Collection and Management Service
作者	周玉駿 Chou, Yu Chun
貢獻者	陳恭 Chen, Kung 周玉駿 Chou, Yu Chun
關鍵詞	推特社群媒體 Social networks NoSQL
日期	2012
上傳時間	1-Oct-2013 13:47:07 (UTC+8)
摘要	社群網路的興起大幅改變了現代社會的溝通模式。使用者互動時產出的巨量資料，經過蒐集、儲存、分析，能幫助研究人員在許多領域進行更深入的工作，包括災變信息(crisis informatics)、趨勢分析、社會關係(social relation)等。為讓研究人員將心力專注於資料的分析上，建構穩定的資料蒐集與管理平台供研究人員方便處理就有其必要性。本研究參考目前推特資料蒐集、大量資料儲存所遇到的狀況及限制，定義出一些基本系統設計方式，並完成一個推特資料蒐集與管理平台。我們採用「事件、工作」的模式以儘量減少使用者設定重複蒐集條件，再搭配「一工作、一Access Token」的作法讓系統的工作與工作之間速限不會互相影響；其次，考量到一般狀況下，系統進行大量資料儲存會遇到硬體擴充性問題，本平台蒐集資料後，先儲存於NoSQL，再將資料從NoSQL迅速轉換到一般關聯式資料庫。我們並進行了一些資料搜集的實驗，並與許多學者使用的其他兩個工具進行推特蒐集的比較，初步結果顯示我們的平台有一定的優勢。 The rise of social media, such as Twitter, has significantly influenced the mode of communication in modern society. By collecting, storing and analyzing the massive amount of user interaction data from social media, researchers can conduct more in-depth work in many areas, such as disaster information dissemination (crisis informatics), trend analysis and social network analysis, etc. To help researchers focus on the analysis of data, it is necessary to construct a robust data collection and management platform. In this thesis, we investigate the issues and restrictions of current tweets data collection and storage, and present a modular design and implementation of tweet collection and management platform based on Twitter’s API. Two salient features of our platform are event-job based data collection tasks and access token pool. Specifically, researchers may lauch multiple job to collect the tweets related to an event with less duplicate tweets. By adopting the one job one access token approach, multiple jobs can run separately and will not affect the rate limit of each other. Besides, considering the common situation of tweet burst in many events, our platform first stores the collected data into HBase, a popular NoSQL system, and then quickly migrate them to a standard relational database. To evaluate our platform, we have conducted a few data collection experiments, and made a comparison with two other popular tweet collection tools, The preliminary results show that our platform has certain advantages over them.
參考文獻	1. Twitter Team. Twitter Blog https://blog.twitter.com/2012/twitter-turns-six March.2012. 2. Mike Melanson.Twitter Kills the API Whitelist: What it Means for Developers & Innovation. http://readwrite.com/2011/02/11/twitter_kills_the_api_whitelist_what_it_means_for February 2011. 3. Shiels, Maggie. Web slows after Jackson`s death, BBC News. June 26, 2009. 4. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 591-600, 2010. 5. M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring user inuence in twitter: The million follower fallacy. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM), May 2010. 6. Hrishikesh Bakshi .Framework for Crawling and Local Event Detection Using Twitter Data.In his master’s degree athesis,May 2011. 7. Mike Melanson.Twitter Kills the API Whitelist: What it Means for Developers & Innovation. http://readwrite.com/2011/02/11/twitter_kills_the_api_whitelist_what_it_means_for February 2011. 8. A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56-65, 2007. 9. A. H. Wang. Don`t follow me: Spam detection in twitter. In Proceedings of the International Conference on Security and Cryptography (SECRYPT), July 2010. 10. Matko Bošnjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, Luís Sarmento. TwitterEcho - A Distributed Focused Crawler to Support Open ReSearch with Twitter Data 11. Kenneth M. Anderson, Aaron Schram. Design and Implementation of a Data Analytics Infrastructure in Support of Crisis Informatics ReSearch (NIER Track),2011. 12. Kenneth M. Anderson, Aaron Schram. MySQL to NoSQL Data Modeling Challenges in Supporting Scalability,page 3,2012. 13. Twitter API: https://dev.twitter.com/docs/streaming-apis 14. Cosimo Streppone . Gentle introduction to Oauth. http://dev.opera.com/articles/view/gentle-introduction-to-oauth/ November 3, 2010. 15. E. F. Codd, A relational model of data for large shared data banks.Com-mun.ACM,1970. 16. Adam Lith,Jakob Mattsson.Investigating storage solutions for large data,page 63,2010. 17. Rick Cattel.Scalable SQL and NoSQL Data Stores,page 10,2011. 18. Kenneth M. Anderson, Aaron Schram. MySQL to NoSQL Data Modeling Challenges in Supporting Scalability,page 1. 2012.
描述	碩士國立政治大學資訊科學學系 99971018 101
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0099971018
資料類型	thesis

dc.contributor.advisor	陳恭	zh_TW
dc.contributor.advisor	Chen, Kung	en_US
dc.contributor.author (Authors)	周玉駿	zh_TW
dc.contributor.author (Authors)	Chou, Yu Chun	en_US
dc.creator (作者)	周玉駿	zh_TW
dc.creator (作者)	Chou, Yu Chun	en_US
dc.date (日期)	2012	en_US
dc.date.accessioned	1-Oct-2013 13:47:07 (UTC+8)	-
dc.date.available	1-Oct-2013 13:47:07 (UTC+8)	-
dc.date.issued (上傳時間)	1-Oct-2013 13:47:07 (UTC+8)	-
dc.identifier (Other Identifiers)	G0099971018	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/61199	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學學系	zh_TW
dc.description (描述)	99971018	zh_TW
dc.description (描述)	101	zh_TW
dc.description.abstract (摘要)	社群網路的興起大幅改變了現代社會的溝通模式。使用者互動時產出的巨量資料，經過蒐集、儲存、分析，能幫助研究人員在許多領域進行更深入的工作，包括災變信息(crisis informatics)、趨勢分析、社會關係(social relation)等。為讓研究人員將心力專注於資料的分析上，建構穩定的資料蒐集與管理平台供研究人員方便處理就有其必要性。本研究參考目前推特資料蒐集、大量資料儲存所遇到的狀況及限制，定義出一些基本系統設計方式，並完成一個推特資料蒐集與管理平台。我們採用「事件、工作」的模式以儘量減少使用者設定重複蒐集條件，再搭配「一工作、一Access Token」的作法讓系統的工作與工作之間速限不會互相影響；其次，考量到一般狀況下，系統進行大量資料儲存會遇到硬體擴充性問題，本平台蒐集資料後，先儲存於NoSQL，再將資料從NoSQL迅速轉換到一般關聯式資料庫。我們並進行了一些資料搜集的實驗，並與許多學者使用的其他兩個工具進行推特蒐集的比較，初步結果顯示我們的平台有一定的優勢。	zh_TW
dc.description.abstract (摘要)	The rise of social media, such as Twitter, has significantly influenced the mode of communication in modern society. By collecting, storing and analyzing the massive amount of user interaction data from social media, researchers can conduct more in-depth work in many areas, such as disaster information dissemination (crisis informatics), trend analysis and social network analysis, etc. To help researchers focus on the analysis of data, it is necessary to construct a robust data collection and management platform. In this thesis, we investigate the issues and restrictions of current tweets data collection and storage, and present a modular design and implementation of tweet collection and management platform based on Twitter’s API. Two salient features of our platform are event-job based data collection tasks and access token pool. Specifically, researchers may lauch multiple job to collect the tweets related to an event with less duplicate tweets. By adopting the one job one access token approach, multiple jobs can run separately and will not affect the rate limit of each other. Besides, considering the common situation of tweet burst in many events, our platform first stores the collected data into HBase, a popular NoSQL system, and then quickly migrate them to a standard relational database. To evaluate our platform, we have conducted a few data collection experiments, and made a comparison with two other popular tweet collection tools, The preliminary results show that our platform has certain advantages over them.	en_US
dc.description.tableofcontents	一、介紹 1 1.1 研究背景與動機 1 1.2 研究目標 3 1.3 研究成果 6 1.4 研究限制 7 二、相關研究概況與評述 9 2.1 利用白名單機制建立的推特蒐集管理系統 9 2.2非使用白名單機制建立的推特蒐集管理系統 10 2.3 科羅拉多大學的EPIC計畫 10 2.4 TWIITER API 12 2.5 HADOOP生態環境 18 2.6 SPRING 21 2.7 SPRING MVC 23 2.8 OPENJPA 24 2.9 MYSQL全文檢索 24 三、研究方法與系統架構 26 3.1 整體系統架構 26 3.2 「事件、工作」使用者介面設計及多TOKEN資料蒐集 32 3.3 NOSQL資料儲存與利於轉檔的DATA MODEL 35 3.4 各服務的設計方式 40 3.4.1資料蒐集服務 40 3.4.2資料存取服務 44 3.4.3系統管理服務 46 3.4.4資料同步服務與資料分析服務 47 3.5 共同性推特資料分析 48 四、實驗和模擬、討論 52 4.1 實驗設計 53 4.2 實驗數據 53 4.3 實驗討論 58 五、結論 60 5.1貢獻 60 5.2 未來研究 62 六、索引 63	zh_TW
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0099971018	en_US
dc.subject (關鍵詞)	推特	zh_TW
dc.subject (關鍵詞)	社群媒體	zh_TW
dc.subject (關鍵詞)	Social networks	en_US
dc.subject (關鍵詞)	NoSQL	en_US
dc.title (題名)	實作推特社群媒體的資料蒐集與管理服務	zh_TW
dc.title (題名)	Design and Implementation of a Twitter Data Collection and Management Service	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	1. Twitter Team. Twitter Blog https://blog.twitter.com/2012/twitter-turns-six March.2012. 2. Mike Melanson.Twitter Kills the API Whitelist: What it Means for Developers & Innovation. http://readwrite.com/2011/02/11/twitter_kills_the_api_whitelist_what_it_means_for February 2011. 3. Shiels, Maggie. Web slows after Jackson`s death, BBC News. June 26, 2009. 4. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 591-600, 2010. 5. M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring user inuence in twitter: The million follower fallacy. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM), May 2010. 6. Hrishikesh Bakshi .Framework for Crawling and Local Event Detection Using Twitter Data.In his master’s degree athesis,May 2011. 7. Mike Melanson.Twitter Kills the API Whitelist: What it Means for Developers & Innovation. http://readwrite.com/2011/02/11/twitter_kills_the_api_whitelist_what_it_means_for February 2011. 8. A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56-65, 2007. 9. A. H. Wang. Don`t follow me: Spam detection in twitter. In Proceedings of the International Conference on Security and Cryptography (SECRYPT), July 2010. 10. Matko Bošnjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, Luís Sarmento. TwitterEcho - A Distributed Focused Crawler to Support Open ReSearch with Twitter Data 11. Kenneth M. Anderson, Aaron Schram. Design and Implementation of a Data Analytics Infrastructure in Support of Crisis Informatics ReSearch (NIER Track),2011. 12. Kenneth M. Anderson, Aaron Schram. MySQL to NoSQL Data Modeling Challenges in Supporting Scalability,page 3,2012. 13. Twitter API: https://dev.twitter.com/docs/streaming-apis 14. Cosimo Streppone . Gentle introduction to Oauth. http://dev.opera.com/articles/view/gentle-introduction-to-oauth/ November 3, 2010. 15. E. F. Codd, A relational model of data for large shared data banks.Com-mun.ACM,1970. 16. Adam Lith,Jakob Mattsson.Investigating storage solutions for large data,page 63,2010. 17. Rick Cattel.Scalable SQL and NoSQL Data Stores,page 10,2011. 18. Kenneth M. Anderson, Aaron Schram. MySQL to NoSQL Data Modeling Challenges in Supporting Scalability,page 1. 2012.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM