Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 資源感知之社群媒體資料搜集平台:以推特為例
A resource-aware data collection platform for Twitter
作者 許矢勇
Shiu, Shih Yung
貢獻者 陳恭
Chen, Kung
許矢勇
Shiu, Shih Yung
關鍵詞 推特
資源感知
社群媒體
Twitter
Resource-aware
Social media
日期 2013
上傳時間 25-Aug-2014 15:21:49 (UTC+8)
摘要 近年來社群媒體如推特、臉書、新浪微博等蓬勃地發展,不僅用戶數持續成長,也已成為人們日常生活中與朋友交流以及獲取資訊的一個重要管道。對於傳播與社會學者而言,社群媒體巨擘們掌握的巨量資料,是進行相關主題研究的一個重要資源。各大社群媒體雖然都有適度提供資料擷取的程式介面(API),但也或多或少地對資料搜集者加諸某些限制,導致資料的搜集發生困難。簡言之,研究人員必須在這些社群媒體提供的有限資源的限制下,設法優化所能取的資料集的質與量。有鑑於此,本研究以推特(twitter)為標的,實作一具資源感知之社群媒體資料搜集平台來協助學者蒐集推文(tweet)。
首先,本平台採用事件-工作的概念,讓使者用針對所關注的事件,選定不同的關鍵字進行蒐集的資料,這些不同的關鍵字即對應到系統的工作。其次,每個工作必須擁有存取代幣(access tokens)才能以蒐集推文,而每個代幣在一定時間內只能取得一定數量的推文,所以代幣是本平台的主要資源。為因應特殊事件發生時,推文暴增的常見情況,本平台提供了一個代幣池(token pool)的機制,讓眾多工作得以分享代幣資源,並善用推特API的存取選項,提供使用者可依蒐集資料時間點的差異,進行可取得推文數量的優化。在系統核心設計上,本研究提出「豪宅家務服務群(Mansion Household Service)」的概念,透過服務群內隨從(minion)們的分工合作,系統能夠在資源有限的情況下,仍然能夠同步執行多個不同的工作,有效降低推特所加諸的限制,對於推文搜集所造成的衝擊。我們並以實證方式,驗證我們平台的推文蒐集能力。
Recently, with the rapid development of social media such as Twitter, Facebook and Weibo, people have employed social media as a major channel for inter-personal communication and a daily source of various kinds of information. From the viewpoints of social science and humanity scholars, the digital footprints that people left on these social media are a rich resource for the study of human behaviors. However, these social media usually impose certain resource restrictions such as rate limiting on how scholars may use their API to retrieve their data. Therefore, we design and implement a resource-aware data collection platform for Twitter to help scholars retrieve historical tweets in an effective and efficient manner.
Our platform employs the event-job approach to help users organize the tasks and the tweets to be collected. As each job requires an access token to fetch tweets, our platform provides a pool of tokens for system jobs to share so that access tokens will be maximally utilized. Besides, we leverage the tweet-id options in Twitter API and enable users to optimize the number of tweets to be collected depending on the timing of tweet collection. In the organization of the system core of tweet collection, we propose a so-called “Mansion Household System,” in which four-minions will corporate with each other to launch different jobs simultaneously and thus alleviate the impact from the restrictions which Twitter imposes via access tokens. To validate our design, we have conducted a series of experiments and the results are quite satisfying.
參考文獻 【1】 Shamanth Kumar ,Fred Morstatter, Huan Liu. August 19,2013. Twitter Data Analytics.
【2】 周玉駿. 2013. 實作推特社群媒體的資料蒐集與管理服務.
【3】 Adam Marcus, Michael S.Bernstein, Osama Badar, David R.Karger, Samuel Madden, Robert C.Miller. 2012. Processing and Visualizing the Data in Tweets.
【4】 Lance Reagan Vick, Titus Soporan, Daniel Robert Lewis, Jane Brooks Zurn. 2012. Hybrid Browser/Server Collection of Streaming Social Media Data for Scalable Real-Time Analysis.
【5】 Matko Bosnjak, Eduardo Oliveira, Jose Martins, Eduarda Mendes Rodrigues, Luis Sarmento. 2012. TwitterEcho-A Distributed Focused Crawler to Support Open Research with Twitter Data.
【6】 Axel Bruns ,Yuxian Eugene Liang. Apr, 2012. Tools and methods for capturing Twitter data during natural disasters.
【7】 Twitter Application-only authentication: https://dev.twitter.com/docs/auth/application-only-auth
【8】 Twitter Search API:
https://dev.twitter.com/docs/using-search
【9】 Aditi Das. Jan 17,2008. Understanding JPA,Part1: The object-oriented paradigm of data persistence. http://www.javaworld.com/article/2077817/java-se/understanding-jpa-part-1-the-object-oriented-paradigm-of-data-persistence.html
【10】 Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides. August 1994. Design Patterns Elements of Reusable Object-Oriented Software.
【11】 Adam Green, February 15,2013. Twitter API Engagement Programming with PHP and MySQL.
描述 碩士
國立政治大學
資訊科學學系
100971001
102
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0100971001
資料類型 thesis
dc.contributor.advisor 陳恭zh_TW
dc.contributor.advisor Chen, Kungen_US
dc.contributor.author (Authors) 許矢勇zh_TW
dc.contributor.author (Authors) Shiu, Shih Yungen_US
dc.creator (作者) 許矢勇zh_TW
dc.creator (作者) Shiu, Shih Yungen_US
dc.date (日期) 2013en_US
dc.date.accessioned 25-Aug-2014 15:21:49 (UTC+8)-
dc.date.available 25-Aug-2014 15:21:49 (UTC+8)-
dc.date.issued (上傳時間) 25-Aug-2014 15:21:49 (UTC+8)-
dc.identifier (Other Identifiers) G0100971001en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/69229-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 100971001zh_TW
dc.description (描述) 102zh_TW
dc.description.abstract (摘要) 近年來社群媒體如推特、臉書、新浪微博等蓬勃地發展,不僅用戶數持續成長,也已成為人們日常生活中與朋友交流以及獲取資訊的一個重要管道。對於傳播與社會學者而言,社群媒體巨擘們掌握的巨量資料,是進行相關主題研究的一個重要資源。各大社群媒體雖然都有適度提供資料擷取的程式介面(API),但也或多或少地對資料搜集者加諸某些限制,導致資料的搜集發生困難。簡言之,研究人員必須在這些社群媒體提供的有限資源的限制下,設法優化所能取的資料集的質與量。有鑑於此,本研究以推特(twitter)為標的,實作一具資源感知之社群媒體資料搜集平台來協助學者蒐集推文(tweet)。
首先,本平台採用事件-工作的概念,讓使者用針對所關注的事件,選定不同的關鍵字進行蒐集的資料,這些不同的關鍵字即對應到系統的工作。其次,每個工作必須擁有存取代幣(access tokens)才能以蒐集推文,而每個代幣在一定時間內只能取得一定數量的推文,所以代幣是本平台的主要資源。為因應特殊事件發生時,推文暴增的常見情況,本平台提供了一個代幣池(token pool)的機制,讓眾多工作得以分享代幣資源,並善用推特API的存取選項,提供使用者可依蒐集資料時間點的差異,進行可取得推文數量的優化。在系統核心設計上,本研究提出「豪宅家務服務群(Mansion Household Service)」的概念,透過服務群內隨從(minion)們的分工合作,系統能夠在資源有限的情況下,仍然能夠同步執行多個不同的工作,有效降低推特所加諸的限制,對於推文搜集所造成的衝擊。我們並以實證方式,驗證我們平台的推文蒐集能力。
zh_TW
dc.description.abstract (摘要) Recently, with the rapid development of social media such as Twitter, Facebook and Weibo, people have employed social media as a major channel for inter-personal communication and a daily source of various kinds of information. From the viewpoints of social science and humanity scholars, the digital footprints that people left on these social media are a rich resource for the study of human behaviors. However, these social media usually impose certain resource restrictions such as rate limiting on how scholars may use their API to retrieve their data. Therefore, we design and implement a resource-aware data collection platform for Twitter to help scholars retrieve historical tweets in an effective and efficient manner.
Our platform employs the event-job approach to help users organize the tasks and the tweets to be collected. As each job requires an access token to fetch tweets, our platform provides a pool of tokens for system jobs to share so that access tokens will be maximally utilized. Besides, we leverage the tweet-id options in Twitter API and enable users to optimize the number of tweets to be collected depending on the timing of tweet collection. In the organization of the system core of tweet collection, we propose a so-called “Mansion Household System,” in which four-minions will corporate with each other to launch different jobs simultaneously and thus alleviate the impact from the restrictions which Twitter imposes via access tokens. To validate our design, we have conducted a series of experiments and the results are quite satisfying.
en_US
dc.description.tableofcontents 第一章 緒論 1
1.1前言 1
1.2研究動機 2
1.3研究目的 2
1.4研究成果 4
1.5論文大綱 5
第二章 相關觀念與技術背景 6
2.1 Model-View-Controller(MVC) 6
2.2 Spring與MVC 10
2.3 Object-relational mapping(ORM) 13
2.3.1 OpenJPA 14
2.3.2 c3p0 DataSources Pools 15
2.4 推特資料搜集 15
2.4.1 OAuth 15
2.4.2 推特API 17
2.4.3 Twitter4j 21
2.4排程與Quartz 21
2.6佇列與RabbitMQ 23
2.7前端技術 24
2.7.1 jQuery 24
2.7.2 jQWidgets 26
2.7.3 jVectorMap 27
2.7.4 Java Server Pages Standard Tag Library (JSTL) 27
2.8相關工具:YourTwapperKeeper 28
第三章 系統設計與架構 29
3.1系統設計理念 30
3.1.1 MVC架構 31
3.1.2系統核心邏輯 33
3.2資料庫存取 34
3.2.1 Service Layer與Data Access Object(DAO)設計模式 34
3.2.2資料表設計 35
3.3系統功能探索 37
3.3.1資料搜集 38
3.3.2搜集資料分析與統計 41
3.3.3系統管理 48
3.4深入推特資料搜集 49
3.4.1推特所加諸之限制 49
3.4.2使用具資源感知性之Access Token Pool進行效率化推文搜集 51
3.4.3搜集工作排程 54
3.5實作推特資料搜集之服務群 55
3.5.1門房(Doorman) 56
3.5.2管家(Butler) 57
3.5.3房務人員(HouseKeeper) 60
3.5.4守衛(Guardian) 62
第四章 系統功能驗證 64
4.1個案設計 64
4.2個案分析與討論 66
4.3比較本平台與YourTwapperKeeper之推文搜集 71
4.3.1個案設計 71
4.3.2比較與分析 72
第五章 結論與建議 77
5.1結論 77
5.2未來發展與建議 78
參考文獻 79
zh_TW
dc.format.extent 2734160 bytes-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0100971001en_US
dc.subject (關鍵詞) 推特zh_TW
dc.subject (關鍵詞) 資源感知zh_TW
dc.subject (關鍵詞) 社群媒體zh_TW
dc.subject (關鍵詞) Twitteren_US
dc.subject (關鍵詞) Resource-awareen_US
dc.subject (關鍵詞) Social mediaen_US
dc.title (題名) 資源感知之社群媒體資料搜集平台:以推特為例zh_TW
dc.title (題名) A resource-aware data collection platform for Twitteren_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 【1】 Shamanth Kumar ,Fred Morstatter, Huan Liu. August 19,2013. Twitter Data Analytics.
【2】 周玉駿. 2013. 實作推特社群媒體的資料蒐集與管理服務.
【3】 Adam Marcus, Michael S.Bernstein, Osama Badar, David R.Karger, Samuel Madden, Robert C.Miller. 2012. Processing and Visualizing the Data in Tweets.
【4】 Lance Reagan Vick, Titus Soporan, Daniel Robert Lewis, Jane Brooks Zurn. 2012. Hybrid Browser/Server Collection of Streaming Social Media Data for Scalable Real-Time Analysis.
【5】 Matko Bosnjak, Eduardo Oliveira, Jose Martins, Eduarda Mendes Rodrigues, Luis Sarmento. 2012. TwitterEcho-A Distributed Focused Crawler to Support Open Research with Twitter Data.
【6】 Axel Bruns ,Yuxian Eugene Liang. Apr, 2012. Tools and methods for capturing Twitter data during natural disasters.
【7】 Twitter Application-only authentication: https://dev.twitter.com/docs/auth/application-only-auth
【8】 Twitter Search API:
https://dev.twitter.com/docs/using-search
【9】 Aditi Das. Jan 17,2008. Understanding JPA,Part1: The object-oriented paradigm of data persistence. http://www.javaworld.com/article/2077817/java-se/understanding-jpa-part-1-the-object-oriented-paradigm-of-data-persistence.html
【10】 Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides. August 1994. Design Patterns Elements of Reusable Object-Oriented Software.
【11】 Adam Green, February 15,2013. Twitter API Engagement Programming with PHP and MySQL.
zh_TW