學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 運用kNN文字探勘分析智慧型終端App群集之研究
The study of analyzing smart handheld device App`s clusters by using kNN text mining
作者 曾國傑
Tseng, Kuo Chieh
貢獻者 楊建民
曾國傑
Tseng, Kuo Chieh
關鍵詞 App
kNN
群集分析
文字探勘
App
kNN
Clustering
Text Mining
日期 2011
上傳時間 30-Oct-2012 11:21:18 (UTC+8)
摘要 隨著智慧型終端設備日益普及,使用者對App需求逐漸增加,各大企業也因此開創了一種新的互動性行銷方式。同時,App下載所帶來的龐大商機也促使許多開發人員紛紛加入App的開發行列,造成App的數量呈現爆炸性成長,而讓使用者在面對種類繁多的App時,無法做出有效率的選擇。故本研究將透過文字探勘與kNN集群分析技術,分析網友發表的App推薦文並將App進行分群;再藉由參數的調整,期望能透過衡量指標的評估來獲得最佳品質之分群,以便作為使用者選擇App之參考依據。
為了使大量App進行分群以解決使用者「資訊超載」的問題,本研究以App Store之遊戲類App為分析對象,蒐集了439篇App推薦文章,並依App推薦對象之異同,將其合併成357篇App推薦文章;接著,透過文字探勘技術將文章轉換成可相互比較的向量空間模型,再利用kNN群集分析對其進行分群。同時,藉由參數組合中k值與文件相似度門檻值的調整來獲得最佳品質之分群;其分群品質的評估則透過平均群內相似度等指標來進行衡量;而為了提升分群品質,本研究採用「多階段分群」,以分群後各群集內的文章數量來判斷是否進行再分群或群集合併。
本研究結果顯示第一階段分群在k值為10、文件相似度門檻值為0.025時,能獲得最佳之分群品質。而在後續階段的分群過程中,因群集內文章數減少,故將k值降低並逐漸提高文件相似度門檻值以獲得分群效果。第二階段結束後,可針對已達到分群停止條件之群集進行關鍵詞彙萃取,並可歸類出「棒球/射擊」與「投擲飛行」等6種App類型;其後階段依循相同分群規則可獲得「守城塔防」等14種App類型。分群結束後,共可分出36個群集並獲得20種App類型。分群過程中,平均群內相似度逐漸增加;平均群間相似度則逐漸下降;分群品質衡量指標由第一階段分群後的12.65%提升到第五階段結束時的75.81%。
由本研究可知分群之後相似度高的App會逐漸聚集成群,所獲得之各群集命名結果將能作為使用者選擇App之參考依據;App軟體開發人員也能從各群集之關鍵詞彙中了解使用者所注重的遊戲元素,改善App內容以更符合使用者之需求。而以本研究結果為基礎,透過建立專業詞庫改善分群品質、利用文件摘要技術加強使用者對各群集之了解,或建立App推薦系統等皆可做為未來研究之方向。
With the popularity of Smart Handheld Devices are increasing, the needs of “App” are spreading. Developers whom devote themselves to this opportunity are also rising, making the total number of Apps growing rapidly. Facing these kind of situation, users couldn’t choose the App they need efficiently. This research uses text mining and kNN Clustering technique analyzing the recommendation reviews of App by netizen then clustering the App recommendation articles; Through the adjustments of parameters, we expect to evaluate the measurement indicators to obtain the best quality cluster to use as a basis for users to select Apps.
In order to solve the information overload for the user, we analyzed apps of the “Games” category form App store and sorted out to 357 App recommendation articles to use as our analysis target. Then we used text mining technique to process the articles and uses kNN clustering analysis to sort out the articles. Simultaneously, we fine tuning the measurement indicators to find the optimal cluster. This research uses multi-phase clustering technique to assure the quality of each cluster.
We discriminate 36 clusters and 20 categories from the clustering results. During the clustering process, the Mean of Intra-cluster Similarity increases gradually; in the contrary, the Mean of Inter-cluster Similarity reduces. The “Cluster Quality” increases from 12.65% significantly to 75.81%. In conclusion, similar Apps will gradually been clustered by its similarities, and can be used to be a reference by its cluster’s name. The App developers can also understands the game elements which the users pay greater attentions and tailored their contents to match the needs of the users according to the key phrases from each cluster. In further discussion, building specialized terms database of App to improve the quality of the clustering, using summarization technique to robust user understanding of each cluster, or to build up App recommendation system is liking to be further studied via using the results by this research.
參考文獻 英文文獻

1. 148Apps.biz. (2012). Count of Active Applications in the App Store. Retrieved April 20, 2012, from http://148apps.biz/app-store-metrics/?mpage=appcount
2. Apple. (2012). iTunes Preview. Retrieved April 20, 2012, from http://itunes.apple.com/us/genre/ios/id36
3. Chen, K. J., & Liu, S. H. (1992). Word identification for Mandarin Chinese sentences. Proceedings of the 14th conference on Computational linguistics , 101–107. Nantes, France.
4. Engel, J. F., Blackwell, R. D., & Miniard, P. W. (1993). Consumer Behaviour (7th Revised ed.). Chicago: Dryden Press.
5. Fayyad, U. M. (1996). Data Mining and Knowledge Discovery: Making Sense Out of Data. IEEE Expert: Intelligent Systems and Their Applications, 11(5), 20–25.
6. Feldman, R., & Dagan, I. (1995). Knowledge discovery in textual databases (KDT). Proceedings of the First International Conference on Knowledge Discovery and Data Mining , 112–117. Montreal, Canada.
7. Hennig-Thurau, T., Gwinner, K. P., Walsh, G., & Gremler, D. D. (2004). Electronic word‐of‐mouth via consumer‐opinion platforms: What motivates consumers to articulate themselves on the Internet? Journal of Interactive Marketing, 18(1), 38–52.
8. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.

9. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037.
10. Nie, J. Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval , 225–233. New York, USA.
11. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.
12. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11), 613–620.
13. Simoudis, E. (1996). Reality Check for Data Mining. IEEE Expert: Intelligent Systems and Their Applications, 11(5), 26–33.
14. Sproat, R. W., & Shih, C. (1990). A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4(4), 336–351.
15. Sullivan, D. (2001). Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. New York, NY, USA: John Wiley; Sons, Inc.
16. Tan, A. (1999). Text mining: The state of the art and the challenges. Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases , 65–70. Beijing, China.
17. Teng, W. G., & Lee, H. hsien. (2007). Collaborative Recommendation with Multi-Criteria Ratings. Journal of Computers, 17(4), 69–78.
18. Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T., & Liu, X. (1999). Learning approaches for detecting and tracking news events. IEEE Intelligent Systems and their Applications, 14(4), 32–43.
19. You, J. M., & Chen, K. J. (2006). Improving context vector models by feature clustering for automatic thesaurus construction. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing ,1–8. Sydney, Australia.


中文文獻

1. SmartMobix. (2012). 移動裝置上使用時間何者最多?使用者:App. 行動智庫. Retrieved January 25, 2012, from http://www.smartmobix.com.tw/flurry_20110622
2. 吳文峰. (2002). 中文郵件分類器之設計及實作. 逢甲大學資訊工程系碩士論文.
3. 巫啟台. (2002). 文件之關聯資訊萃取及其概念圖自動建構. 國立成功大學資訊工程學系碩士論文.
4. 林姿旻. (2011). 數位遊戲之行動載具使用者行為與開發分析─以智慧型手機為例. 國立政治大學數位內容碩士論文.
5. 胡秀珠. (2011). 55%業者一年內推出App服務. 創新發現誌. Retrieved March 5, 2012, from http://ideas.org.tw/magazine_article.php?f=464
6. 郭芳菲. (2003). 利用和絃特徵探勘音樂旋律曲風之研究. 國立政治大學資訊科學學系碩士論文.
7. 陳柏均. (2011). 文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究. 國立政治大學資訊管理學系碩士論文.
8. 陳崇正. (2009). 應用網路書籤與VSM相似度演算法於強化實踐社群的形成. 國立中央大學資訊工程學系碩士論文.
9. 楊智凱. (2007). 唐詩推薦系統之研究. 亞洲大學資訊科學與應用學系碩士論文.
10. 盧希鵬. (2005). 網路行銷:電子化企業經營策略.台北市:雙葉書廊有限公司.
11. 胡國信. (2005). 具分群機制之遞增式最鄰近分類學習法 --垃圾郵件過濾之應用. 國立屏東商業技術學院資訊管理學系碩士論文.
描述 碩士
國立政治大學
資訊管理研究所
99356010
100
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0099356010
資料類型 thesis
dc.contributor.advisor 楊建民zh_TW
dc.contributor.author (Authors) 曾國傑zh_TW
dc.contributor.author (Authors) Tseng, Kuo Chiehen_US
dc.creator (作者) 曾國傑zh_TW
dc.creator (作者) Tseng, Kuo Chiehen_US
dc.date (日期) 2011en_US
dc.date.accessioned 30-Oct-2012 11:21:18 (UTC+8)-
dc.date.available 30-Oct-2012 11:21:18 (UTC+8)-
dc.date.issued (上傳時間) 30-Oct-2012 11:21:18 (UTC+8)-
dc.identifier (Other Identifiers) G0099356010en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/54564-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理研究所zh_TW
dc.description (描述) 99356010zh_TW
dc.description (描述) 100zh_TW
dc.description.abstract (摘要) 隨著智慧型終端設備日益普及,使用者對App需求逐漸增加,各大企業也因此開創了一種新的互動性行銷方式。同時,App下載所帶來的龐大商機也促使許多開發人員紛紛加入App的開發行列,造成App的數量呈現爆炸性成長,而讓使用者在面對種類繁多的App時,無法做出有效率的選擇。故本研究將透過文字探勘與kNN集群分析技術,分析網友發表的App推薦文並將App進行分群;再藉由參數的調整,期望能透過衡量指標的評估來獲得最佳品質之分群,以便作為使用者選擇App之參考依據。
為了使大量App進行分群以解決使用者「資訊超載」的問題,本研究以App Store之遊戲類App為分析對象,蒐集了439篇App推薦文章,並依App推薦對象之異同,將其合併成357篇App推薦文章;接著,透過文字探勘技術將文章轉換成可相互比較的向量空間模型,再利用kNN群集分析對其進行分群。同時,藉由參數組合中k值與文件相似度門檻值的調整來獲得最佳品質之分群;其分群品質的評估則透過平均群內相似度等指標來進行衡量;而為了提升分群品質,本研究採用「多階段分群」,以分群後各群集內的文章數量來判斷是否進行再分群或群集合併。
本研究結果顯示第一階段分群在k值為10、文件相似度門檻值為0.025時,能獲得最佳之分群品質。而在後續階段的分群過程中,因群集內文章數減少,故將k值降低並逐漸提高文件相似度門檻值以獲得分群效果。第二階段結束後,可針對已達到分群停止條件之群集進行關鍵詞彙萃取,並可歸類出「棒球/射擊」與「投擲飛行」等6種App類型;其後階段依循相同分群規則可獲得「守城塔防」等14種App類型。分群結束後,共可分出36個群集並獲得20種App類型。分群過程中,平均群內相似度逐漸增加;平均群間相似度則逐漸下降;分群品質衡量指標由第一階段分群後的12.65%提升到第五階段結束時的75.81%。
由本研究可知分群之後相似度高的App會逐漸聚集成群,所獲得之各群集命名結果將能作為使用者選擇App之參考依據;App軟體開發人員也能從各群集之關鍵詞彙中了解使用者所注重的遊戲元素,改善App內容以更符合使用者之需求。而以本研究結果為基礎,透過建立專業詞庫改善分群品質、利用文件摘要技術加強使用者對各群集之了解,或建立App推薦系統等皆可做為未來研究之方向。
zh_TW
dc.description.abstract (摘要) With the popularity of Smart Handheld Devices are increasing, the needs of “App” are spreading. Developers whom devote themselves to this opportunity are also rising, making the total number of Apps growing rapidly. Facing these kind of situation, users couldn’t choose the App they need efficiently. This research uses text mining and kNN Clustering technique analyzing the recommendation reviews of App by netizen then clustering the App recommendation articles; Through the adjustments of parameters, we expect to evaluate the measurement indicators to obtain the best quality cluster to use as a basis for users to select Apps.
In order to solve the information overload for the user, we analyzed apps of the “Games” category form App store and sorted out to 357 App recommendation articles to use as our analysis target. Then we used text mining technique to process the articles and uses kNN clustering analysis to sort out the articles. Simultaneously, we fine tuning the measurement indicators to find the optimal cluster. This research uses multi-phase clustering technique to assure the quality of each cluster.
We discriminate 36 clusters and 20 categories from the clustering results. During the clustering process, the Mean of Intra-cluster Similarity increases gradually; in the contrary, the Mean of Inter-cluster Similarity reduces. The “Cluster Quality” increases from 12.65% significantly to 75.81%. In conclusion, similar Apps will gradually been clustered by its similarities, and can be used to be a reference by its cluster’s name. The App developers can also understands the game elements which the users pay greater attentions and tailored their contents to match the needs of the users according to the key phrases from each cluster. In further discussion, building specialized terms database of App to improve the quality of the clustering, using summarization technique to robust user understanding of each cluster, or to build up App recommendation system is liking to be further studied via using the results by this research.
en_US
dc.description.tableofcontents 第一章、緒論 1
第一節、 研究背景與動機 1
第二節、 研究目的 2
第二章、文獻探討 3
第一節、 智慧型終端應用程式(Applications, App) 3
第二節、 文字探勘 5
2.2.1. 文字探勘的定義 5
2.2.2. 文字探勘的架構 6
2.2.3. 文字探勘的相關技術 7
2.2.4. 文字探勘運用於App推薦文章 13
第三節、 群集分析 14
2.3.1. 群集分析的種類 14
2.3.2. 群集分析運用於文字探勘 15
第四節、 k-最鄰近演算法(k-Nearest Neighbor , kNN) 16
第三章、研究方法與設計 18
第一節、 研究架構 18
第二節、 資料來源與處理 20
3.2.1. 資料來源 20
3.2.2. 文章斷詞 22
3.2.3. 文件特徵選取 23
第三節、 App文章分群 23
3.3.1. kNN分群 23
3.3.2. 群集合併 24
3.3.3. 參數調整 25
3.3.4. 分群結果評估 26
3.3.5. 分群規則 27
第四章、研究結果 29
第一節、 第一階段分群 29
第二節、 第二階段分群 30
第三節、 第三階段分群 37
第四節、 第四階段分群 44
第五節、 第五階段分群 46
第五章、結論與未來研究方向 51
第一節、 結論與建議 51
第二節、 未來研究方向 53
參考文獻 55
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0099356010en_US
dc.subject (關鍵詞) Appzh_TW
dc.subject (關鍵詞) kNNzh_TW
dc.subject (關鍵詞) 群集分析zh_TW
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) Appen_US
dc.subject (關鍵詞) kNNen_US
dc.subject (關鍵詞) Clusteringen_US
dc.subject (關鍵詞) Text Miningen_US
dc.title (題名) 運用kNN文字探勘分析智慧型終端App群集之研究zh_TW
dc.title (題名) The study of analyzing smart handheld device App`s clusters by using kNN text miningen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 英文文獻

1. 148Apps.biz. (2012). Count of Active Applications in the App Store. Retrieved April 20, 2012, from http://148apps.biz/app-store-metrics/?mpage=appcount
2. Apple. (2012). iTunes Preview. Retrieved April 20, 2012, from http://itunes.apple.com/us/genre/ios/id36
3. Chen, K. J., & Liu, S. H. (1992). Word identification for Mandarin Chinese sentences. Proceedings of the 14th conference on Computational linguistics , 101–107. Nantes, France.
4. Engel, J. F., Blackwell, R. D., & Miniard, P. W. (1993). Consumer Behaviour (7th Revised ed.). Chicago: Dryden Press.
5. Fayyad, U. M. (1996). Data Mining and Knowledge Discovery: Making Sense Out of Data. IEEE Expert: Intelligent Systems and Their Applications, 11(5), 20–25.
6. Feldman, R., & Dagan, I. (1995). Knowledge discovery in textual databases (KDT). Proceedings of the First International Conference on Knowledge Discovery and Data Mining , 112–117. Montreal, Canada.
7. Hennig-Thurau, T., Gwinner, K. P., Walsh, G., & Gremler, D. D. (2004). Electronic word‐of‐mouth via consumer‐opinion platforms: What motivates consumers to articulate themselves on the Internet? Journal of Interactive Marketing, 18(1), 38–52.
8. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.

9. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037.
10. Nie, J. Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval , 225–233. New York, USA.
11. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.
12. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11), 613–620.
13. Simoudis, E. (1996). Reality Check for Data Mining. IEEE Expert: Intelligent Systems and Their Applications, 11(5), 26–33.
14. Sproat, R. W., & Shih, C. (1990). A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4(4), 336–351.
15. Sullivan, D. (2001). Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. New York, NY, USA: John Wiley; Sons, Inc.
16. Tan, A. (1999). Text mining: The state of the art and the challenges. Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases , 65–70. Beijing, China.
17. Teng, W. G., & Lee, H. hsien. (2007). Collaborative Recommendation with Multi-Criteria Ratings. Journal of Computers, 17(4), 69–78.
18. Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T., & Liu, X. (1999). Learning approaches for detecting and tracking news events. IEEE Intelligent Systems and their Applications, 14(4), 32–43.
19. You, J. M., & Chen, K. J. (2006). Improving context vector models by feature clustering for automatic thesaurus construction. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing ,1–8. Sydney, Australia.


中文文獻

1. SmartMobix. (2012). 移動裝置上使用時間何者最多?使用者:App. 行動智庫. Retrieved January 25, 2012, from http://www.smartmobix.com.tw/flurry_20110622
2. 吳文峰. (2002). 中文郵件分類器之設計及實作. 逢甲大學資訊工程系碩士論文.
3. 巫啟台. (2002). 文件之關聯資訊萃取及其概念圖自動建構. 國立成功大學資訊工程學系碩士論文.
4. 林姿旻. (2011). 數位遊戲之行動載具使用者行為與開發分析─以智慧型手機為例. 國立政治大學數位內容碩士論文.
5. 胡秀珠. (2011). 55%業者一年內推出App服務. 創新發現誌. Retrieved March 5, 2012, from http://ideas.org.tw/magazine_article.php?f=464
6. 郭芳菲. (2003). 利用和絃特徵探勘音樂旋律曲風之研究. 國立政治大學資訊科學學系碩士論文.
7. 陳柏均. (2011). 文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究. 國立政治大學資訊管理學系碩士論文.
8. 陳崇正. (2009). 應用網路書籤與VSM相似度演算法於強化實踐社群的形成. 國立中央大學資訊工程學系碩士論文.
9. 楊智凱. (2007). 唐詩推薦系統之研究. 亞洲大學資訊科學與應用學系碩士論文.
10. 盧希鵬. (2005). 網路行銷:電子化企業經營策略.台北市:雙葉書廊有限公司.
11. 胡國信. (2005). 具分群機制之遞增式最鄰近分類學習法 --垃圾郵件過濾之應用. 國立屏東商業技術學院資訊管理學系碩士論文.
zh_TW