學術產出-Theses
Article View/Open
Publication Export
-
題名 適用於雲端分散儲存架構下的kNN平行演算法之研究
The study on paralleling the kNN algorithm suitable to cloud-based distributed architecture作者 張伯辰 貢獻者 楊建民
張伯辰關鍵詞 k-最鄰近演算法
平行運算
雲端運算
資料探勘
kNN
parallel computing
cloud computing
data mining日期 2012 上傳時間 1-Oct-2013 11:59:45 (UTC+8) 摘要 近年來,隨著網際網路與相關設備的快速發展與普及,雲端運算逐漸興起,並產生了大量的雲端服務,讓網路上儲存了大量一般使用者的資料與文件。為了分析這些資料,可使用資料探勘中的資料分群來進行初步探討,而kNN正是廣泛被應用於資料分群的一種方式。不過隨著資料量的成長,使用kNN演算法的分群時間也會大幅度地成長。 為此,本研究試圖以平行化的方式運行kNN演算法,將大量資料分成少量多次處理,並於平行分群後進行群集合併。本研究將Google Spreadsheet作為雲端資料庫,透過PHP語言對資料庫下達指令,並進行多區域的平行kNN資料分群。待各區域完成分群後,再將分群結果傳送至整合端進行群集合併。 本研究以二種不同方式進行群集合併,並比較二種方式的結果。第一種合併方式為給定合併門檻值,直接將質心近似的群集合併;第二種方式為同時比較二群集之群內相似度與群間相似度之關係,來決定是否進行合併。綜合而言,第二種方式的運算時間較第一種方式為快,且合併結果也較第一種方式為佳。 本研究進行三種資料分群測試,分別為:一般性資料(隨機分配的固定維度數值化資料)、實際新聞文章的數值化資料、大量模擬實際文章的數值化資料(高維度且有部分集中群集)。根據結果顯示,三種測試皆顯示平行化後的速度有著極大幅度的提升,表示本研究設計的平行化演算法對kNN的加速有相當大程度的作用,分群品質的變化雖可接受、卻較難以掌控。
With the development and popularization of the Internet, many companies started offering cloud-services, allowed users to save their data on the Internet. To find information in the large amounts of data, kNN is a common method in data mining domain. But with the amount of data increased, the computing time also grew rapidly. In this study, I tried to parallel the kNN algorithm, expecting to promote the efficiency of kNN algorithm by separating the data into small amount and clustering them. In this study, Google Spreadsheet was used as a database, and accessing by PHP. Data stored in different spreadsheets separately, and clustered by different regional servers. After each regional server completed the data clustering, the regional clusters would be integrated into final result. I used two ways to merge clusters in the second step of the algorithm. The first way is given threshold; I merge the clusters if these centroids were similar. The second way is to compare the difference between the summary of the inter-cluster similarity of two clusters, and their intra-cluster similarity. Then decided whether to merge or not. The result showed that the second way is not only have better computing speed but also have better quality. In this study, I tested the parallel algorithm with three kinds of data. The first one is random generated data with fixed dimension. The second data was transformed by reality news. And third data was large amount of data generated by simulating reality data. All of the results showed the paralleled algorithm had a great effect on accelerated KNN method, but also got the unpredictable quality change.參考文獻 英文文獻1. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.2. Davis, N., Demetriou, G., Gaizauskas, R., Guo, Y., & Roberts, L. (2006). Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. International Journal of Web Service Research 3(4): 95–112.3. Dikaiakos, M.D., Katsaros, D., Mehra, P., Pallis, G., Vakali, A. (2009). Cloud Computing: Distributed Internet Computing for IT and Scientific Research. IEEE Internet Computing, 13(5):10–13.4. Han, J., & Kamber, M. (2006). Data mining: Concepts and Techniques (2nd Edition), Morgan Kaufmann.5. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.6. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037.7. Laudon, J. P., & Laudon, K. C. (2011). Management Information Systems (12th Edition), Pearson.8. National Institute of Standards and Technology. (2011). The NIST Definition of Cloud Computing. NIST Special Publication 800-145.9. Ovum. (2011). 2011 Trends to Watch: Cloud Computing Technology.10. Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1–47.12. Song, Y. Huang, J. Zhou, D. Zha, H. Giles, C. L. (2007). IKNN: Informative K-Nearest Neighbor Classification. 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 248–264, Warsaw, Poland.13. Wu, X, Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowl Inf Syst, 14:1–37.14. Weinberger, K. Q., Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207–244. 中文文獻15. 吳季桓(2010)。自動分類的實作:KNN與SVM。國立中正大學資訊工程研究所碩士論文,未出版,嘉義縣。16. 周宣光(譯)(2010)。管理資訊系統(原作者Laudon, J. P., & Laudon, K. C. (2009))。台北:東華書局。17. 施雅月、賴錦慧(譯)(2008)。資料探勘(原作者Tan, P. N., Steinbach, M., Kumar, V. (2006))。台北:歐亞書局。18. 連偉志(2013)。雲端環境下應用Google電子試算表於分散式儲存架構平台之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。19. 陳仕斌(2012)。雲端運算之編譯排程系統設計與實作。國立成功大學電機工程學系碩士論文,未出版,台南市。20. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。21. 黃冠中(2007)。應用kNN演算法之文件分類平台實作。第六屆離島資訊技術與應用研討會論文,雲林:國立虎尾科技大學資工系主辦。22. 曾國傑(2012)。運用KNN文字探勘分析智慧型終端App群集之研究. 國立政治大學資訊管理學系碩士論文,未出版,台北市。23. 薛弘業(2013)。應用文字探勘與文件分類分群技術於股價走勢預測之研究─以台灣股票市場為例。國立政治大學資訊管理學系碩士論文,未出版,台北市。 網路資料24. Boss, G., Malladi, P., Quan, D., Legregni, L., & Hall, H. (2007). Cloud Computing. Retrieved April, 2013 from IBM Corporation Web site: http://download.boulder.ibm.com/ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf25. Google Drive. (2006). Overview of Google Sheets. Retrieved February, 2013 fromhttps://support.google.com/drive/bin/answer.py?hl=en&answer=140784&topic=20322&ctx=topic26. Google Drive. (2013). Start Google Drive. Retrieved February, 2013 fromhttps://www.google.com/intl/en/drive/start/index.html27. Law of cosines - Wikipedia, the free encyclopedia Retrieved August, 2013 fromhttp://en.wikipedia.org/wiki/Law_of_cosines28. Parallel Programming for Multicore. Berkeley lecture. Retrieved March, 2013 fromhttp://www.cs.berkeley.edu/~yelick/cs194f07/ 描述 碩士
國立政治大學
資訊管理研究所
100356032
101資料來源 http://thesis.lib.nccu.edu.tw/record/#G1003560321 資料類型 thesis dc.contributor.advisor 楊建民 zh_TW dc.contributor.author (Authors) 張伯辰 zh_TW dc.creator (作者) 張伯辰 zh_TW dc.date (日期) 2012 en_US dc.date.accessioned 1-Oct-2013 11:59:45 (UTC+8) - dc.date.available 1-Oct-2013 11:59:45 (UTC+8) - dc.date.issued (上傳時間) 1-Oct-2013 11:59:45 (UTC+8) - dc.identifier (Other Identifiers) G1003560321 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/61185 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊管理研究所 zh_TW dc.description (描述) 100356032 zh_TW dc.description (描述) 101 zh_TW dc.description.abstract (摘要) 近年來,隨著網際網路與相關設備的快速發展與普及,雲端運算逐漸興起,並產生了大量的雲端服務,讓網路上儲存了大量一般使用者的資料與文件。為了分析這些資料,可使用資料探勘中的資料分群來進行初步探討,而kNN正是廣泛被應用於資料分群的一種方式。不過隨著資料量的成長,使用kNN演算法的分群時間也會大幅度地成長。 為此,本研究試圖以平行化的方式運行kNN演算法,將大量資料分成少量多次處理,並於平行分群後進行群集合併。本研究將Google Spreadsheet作為雲端資料庫,透過PHP語言對資料庫下達指令,並進行多區域的平行kNN資料分群。待各區域完成分群後,再將分群結果傳送至整合端進行群集合併。 本研究以二種不同方式進行群集合併,並比較二種方式的結果。第一種合併方式為給定合併門檻值,直接將質心近似的群集合併;第二種方式為同時比較二群集之群內相似度與群間相似度之關係,來決定是否進行合併。綜合而言,第二種方式的運算時間較第一種方式為快,且合併結果也較第一種方式為佳。 本研究進行三種資料分群測試,分別為:一般性資料(隨機分配的固定維度數值化資料)、實際新聞文章的數值化資料、大量模擬實際文章的數值化資料(高維度且有部分集中群集)。根據結果顯示,三種測試皆顯示平行化後的速度有著極大幅度的提升,表示本研究設計的平行化演算法對kNN的加速有相當大程度的作用,分群品質的變化雖可接受、卻較難以掌控。 zh_TW dc.description.abstract (摘要) With the development and popularization of the Internet, many companies started offering cloud-services, allowed users to save their data on the Internet. To find information in the large amounts of data, kNN is a common method in data mining domain. But with the amount of data increased, the computing time also grew rapidly. In this study, I tried to parallel the kNN algorithm, expecting to promote the efficiency of kNN algorithm by separating the data into small amount and clustering them. In this study, Google Spreadsheet was used as a database, and accessing by PHP. Data stored in different spreadsheets separately, and clustered by different regional servers. After each regional server completed the data clustering, the regional clusters would be integrated into final result. I used two ways to merge clusters in the second step of the algorithm. The first way is given threshold; I merge the clusters if these centroids were similar. The second way is to compare the difference between the summary of the inter-cluster similarity of two clusters, and their intra-cluster similarity. Then decided whether to merge or not. The result showed that the second way is not only have better computing speed but also have better quality. In this study, I tested the parallel algorithm with three kinds of data. The first one is random generated data with fixed dimension. The second data was transformed by reality news. And third data was large amount of data generated by simulating reality data. All of the results showed the paralleled algorithm had a great effect on accelerated KNN method, but also got the unpredictable quality change. en_US dc.description.tableofcontents 第一章、 緒論 1 第一節、 研究背景與動機 1 第二節、 研究目的 2第二章、 文獻探討 3 第一節、 雲端運算及服務 3 2.1.1 雲端運算 3 2.1.2 雲端服務模式 4 2.1.3 雲端部屬模式 5 2.1.4 Google雲端硬碟(Google Drive) 6 第二節、 資料分群(Clustering) 8 2.2.1 k-最鄰近演算法(k-Nearest Neighbor,kNN) 8 2.2.2 相似度計算 10 第三節、 平行運算(Parallel Computing) 13 第四節、 文獻探討總結 14第三章、 研究架構與設計 15 第一節、 研究架構 15 第二節、 研究設計 17 3.2.1 資料處理 17 3.2.2 平行化之kNN演算法 20 3.2.3 分區群集合併 20 3.2.4 整合衡量指標 22 3.2.5 時間複雜度 24 3.2.6 資料結構與平行演算法 25第四章、 研究成果 33 第一節、 一般性資料分群成果 33 第二節、 新聞文章分群結果(給定門檻值) 35 第三節、 新聞文章分群結果(比較相似度) 37 第四節、 模擬文章資料分群結果(給定門檻值) 39 第五節、 模擬文章資料分群結果(比較相似度) 42第五章、 結論 44 第一節、 結論與建議 44 第二節、 未來研究方向 46參考文獻 48 zh_TW dc.format.extent 796953 bytes - dc.format.mimetype application/pdf - dc.language.iso en_US - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G1003560321 en_US dc.subject (關鍵詞) k-最鄰近演算法 zh_TW dc.subject (關鍵詞) 平行運算 zh_TW dc.subject (關鍵詞) 雲端運算 zh_TW dc.subject (關鍵詞) 資料探勘 zh_TW dc.subject (關鍵詞) kNN en_US dc.subject (關鍵詞) parallel computing en_US dc.subject (關鍵詞) cloud computing en_US dc.subject (關鍵詞) data mining en_US dc.title (題名) 適用於雲端分散儲存架構下的kNN平行演算法之研究 zh_TW dc.title (題名) The study on paralleling the kNN algorithm suitable to cloud-based distributed architecture en_US dc.type (資料類型) thesis en dc.relation.reference (參考文獻) 英文文獻1. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.2. Davis, N., Demetriou, G., Gaizauskas, R., Guo, Y., & Roberts, L. (2006). Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. International Journal of Web Service Research 3(4): 95–112.3. Dikaiakos, M.D., Katsaros, D., Mehra, P., Pallis, G., Vakali, A. (2009). Cloud Computing: Distributed Internet Computing for IT and Scientific Research. IEEE Internet Computing, 13(5):10–13.4. Han, J., & Kamber, M. (2006). Data mining: Concepts and Techniques (2nd Edition), Morgan Kaufmann.5. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.6. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037.7. Laudon, J. P., & Laudon, K. C. (2011). Management Information Systems (12th Edition), Pearson.8. National Institute of Standards and Technology. (2011). The NIST Definition of Cloud Computing. NIST Special Publication 800-145.9. Ovum. (2011). 2011 Trends to Watch: Cloud Computing Technology.10. Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1–47.12. Song, Y. Huang, J. Zhou, D. Zha, H. Giles, C. L. (2007). IKNN: Informative K-Nearest Neighbor Classification. 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 248–264, Warsaw, Poland.13. Wu, X, Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowl Inf Syst, 14:1–37.14. Weinberger, K. Q., Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207–244. 中文文獻15. 吳季桓(2010)。自動分類的實作:KNN與SVM。國立中正大學資訊工程研究所碩士論文,未出版,嘉義縣。16. 周宣光(譯)(2010)。管理資訊系統(原作者Laudon, J. P., & Laudon, K. C. (2009))。台北:東華書局。17. 施雅月、賴錦慧(譯)(2008)。資料探勘(原作者Tan, P. N., Steinbach, M., Kumar, V. (2006))。台北:歐亞書局。18. 連偉志(2013)。雲端環境下應用Google電子試算表於分散式儲存架構平台之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。19. 陳仕斌(2012)。雲端運算之編譯排程系統設計與實作。國立成功大學電機工程學系碩士論文,未出版,台南市。20. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。21. 黃冠中(2007)。應用kNN演算法之文件分類平台實作。第六屆離島資訊技術與應用研討會論文,雲林:國立虎尾科技大學資工系主辦。22. 曾國傑(2012)。運用KNN文字探勘分析智慧型終端App群集之研究. 國立政治大學資訊管理學系碩士論文,未出版,台北市。23. 薛弘業(2013)。應用文字探勘與文件分類分群技術於股價走勢預測之研究─以台灣股票市場為例。國立政治大學資訊管理學系碩士論文,未出版,台北市。 網路資料24. Boss, G., Malladi, P., Quan, D., Legregni, L., & Hall, H. (2007). Cloud Computing. Retrieved April, 2013 from IBM Corporation Web site: http://download.boulder.ibm.com/ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf25. Google Drive. (2006). Overview of Google Sheets. Retrieved February, 2013 fromhttps://support.google.com/drive/bin/answer.py?hl=en&answer=140784&topic=20322&ctx=topic26. Google Drive. (2013). Start Google Drive. Retrieved February, 2013 fromhttps://www.google.com/intl/en/drive/start/index.html27. Law of cosines - Wikipedia, the free encyclopedia Retrieved August, 2013 fromhttp://en.wikipedia.org/wiki/Law_of_cosines28. Parallel Programming for Multicore. Berkeley lecture. Retrieved March, 2013 fromhttp://www.cs.berkeley.edu/~yelick/cs194f07/ zh_TW