學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 適用於雲端分散儲存架構下的kNN平行演算法之研究
The study on paralleling the kNN algorithm suitable to cloud-based distributed architecture
作者 張伯辰
貢獻者 楊建民
張伯辰
關鍵詞 k-最鄰近演算法
平行運算
雲端運算
資料探勘
kNN
parallel computing
cloud computing
data mining
日期 2012
上傳時間 1-Oct-2013 11:59:45 (UTC+8)
摘要   近年來,隨著網際網路與相關設備的快速發展與普及,雲端運算逐漸興起,並產生了大量的雲端服務,讓網路上儲存了大量一般使用者的資料與文件。為了分析這些資料,可使用資料探勘中的資料分群來進行初步探討,而kNN正是廣泛被應用於資料分群的一種方式。不過隨著資料量的成長,使用kNN演算法的分群時間也會大幅度地成長。
  為此,本研究試圖以平行化的方式運行kNN演算法,將大量資料分成少量多次處理,並於平行分群後進行群集合併。本研究將Google Spreadsheet作為雲端資料庫,透過PHP語言對資料庫下達指令,並進行多區域的平行kNN資料分群。待各區域完成分群後,再將分群結果傳送至整合端進行群集合併。
  本研究以二種不同方式進行群集合併,並比較二種方式的結果。第一種合併方式為給定合併門檻值,直接將質心近似的群集合併;第二種方式為同時比較二群集之群內相似度與群間相似度之關係,來決定是否進行合併。綜合而言,第二種方式的運算時間較第一種方式為快,且合併結果也較第一種方式為佳。
  本研究進行三種資料分群測試,分別為:一般性資料(隨機分配的固定維度數值化資料)、實際新聞文章的數值化資料、大量模擬實際文章的數值化資料(高維度且有部分集中群集)。根據結果顯示,三種測試皆顯示平行化後的速度有著極大幅度的提升,表示本研究設計的平行化演算法對kNN的加速有相當大程度的作用,分群品質的變化雖可接受、卻較難以掌控。
With the development and popularization of the Internet, many companies started offering cloud-services, allowed users to save their data on the Internet. To find information in the large amounts of data, kNN is a common method in data mining domain. But with the amount of data increased, the computing time also grew rapidly.
In this study, I tried to parallel the kNN algorithm, expecting to promote the efficiency of kNN algorithm by separating the data into small amount and clustering them. In this study, Google Spreadsheet was used as a database, and accessing by PHP. Data stored in different spreadsheets separately, and clustered by different regional servers. After each regional server completed the data clustering, the regional clusters would be integrated into final result.
I used two ways to merge clusters in the second step of the algorithm. The first way is given threshold; I merge the clusters if these centroids were similar. The second way is to compare the difference between the summary of the inter-cluster similarity of two clusters, and their intra-cluster similarity. Then decided whether to merge or not. The result showed that the second way is not only have better computing speed but also have better quality.
In this study, I tested the parallel algorithm with three kinds of data. The first one is random generated data with fixed dimension. The second data was transformed by reality news. And third data was large amount of data generated by simulating reality data. All of the results showed the paralleled algorithm had a great effect on accelerated KNN method, but also got the unpredictable quality change.
參考文獻 英文文獻
1. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
2. Davis, N., Demetriou, G., Gaizauskas, R., Guo, Y., & Roberts, L. (2006). Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. International Journal of Web Service Research 3(4): 95–112.
3. Dikaiakos, M.D., Katsaros, D., Mehra, P., Pallis, G., Vakali, A. (2009). Cloud Computing: Distributed Internet Computing for IT and Scientific Research. IEEE Internet Computing, 13(5):10–13.
4. Han, J., & Kamber, M. (2006). Data mining: Concepts and Techniques (2nd Edition), Morgan Kaufmann.
5. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.
6. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037.
7. Laudon, J. P., & Laudon, K. C. (2011). Management Information Systems (12th Edition), Pearson.
8. National Institute of Standards and Technology. (2011). The NIST Definition of Cloud Computing. NIST Special Publication 800-145.
9. Ovum. (2011). 2011 Trends to Watch: Cloud Computing Technology.
10. Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1–47.
12. Song, Y. Huang, J. Zhou, D. Zha, H. Giles, C. L. (2007). IKNN: Informative K-Nearest Neighbor Classification. 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 248–264, Warsaw, Poland.
13. Wu, X, Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowl Inf Syst, 14:1–37.
14. Weinberger, K. Q., Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207–244.


中文文獻
15. 吳季桓(2010)。自動分類的實作:KNN與SVM。國立中正大學資訊工程研究所碩士論文,未出版,嘉義縣。
16. 周宣光(譯)(2010)。管理資訊系統(原作者Laudon, J. P., & Laudon, K. C. (2009))。台北:東華書局。
17. 施雅月、賴錦慧(譯)(2008)。資料探勘(原作者Tan, P. N., Steinbach, M.,
Kumar, V. (2006))。台北:歐亞書局。
18. 連偉志(2013)。雲端環境下應用Google電子試算表於分散式儲存架構平台之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。
19. 陳仕斌(2012)。雲端運算之編譯排程系統設計與實作。國立成功大學電機工程學系碩士論文,未出版,台南市。
20. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。
21. 黃冠中(2007)。應用kNN演算法之文件分類平台實作。第六屆離島資訊技術與應用研討會論文,雲林:國立虎尾科技大學資工系主辦。
22. 曾國傑(2012)。運用KNN文字探勘分析智慧型終端App群集之研究. 國立政治大學資訊管理學系碩士論文,未出版,台北市。
23. 薛弘業(2013)。應用文字探勘與文件分類分群技術於股價走勢預測之研究─以台灣股票市場為例。國立政治大學資訊管理學系碩士論文,未出版,台北市。

網路資料
24. Boss, G., Malladi, P., Quan, D., Legregni, L., & Hall, H. (2007). Cloud Computing. Retrieved April, 2013 from IBM Corporation Web site:
http://download.boulder.ibm.com/ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf
25. Google Drive. (2006). Overview of Google Sheets. Retrieved February, 2013 from
https://support.google.com/drive/bin/answer.py?hl=en&answer=140784&topic=20322&ctx=topic
26. Google Drive. (2013). Start Google Drive. Retrieved February, 2013 from
https://www.google.com/intl/en/drive/start/index.html
27. Law of cosines - Wikipedia, the free encyclopedia Retrieved August, 2013 from
http://en.wikipedia.org/wiki/Law_of_cosines
28. Parallel Programming for Multicore. Berkeley lecture. Retrieved March, 2013 from
http://www.cs.berkeley.edu/~yelick/cs194f07/
描述 碩士
國立政治大學
資訊管理研究所
100356032
101
資料來源 http://thesis.lib.nccu.edu.tw/record/#G1003560321
資料類型 thesis
dc.contributor.advisor 楊建民zh_TW
dc.contributor.author (Authors) 張伯辰zh_TW
dc.creator (作者) 張伯辰zh_TW
dc.date (日期) 2012en_US
dc.date.accessioned 1-Oct-2013 11:59:45 (UTC+8)-
dc.date.available 1-Oct-2013 11:59:45 (UTC+8)-
dc.date.issued (上傳時間) 1-Oct-2013 11:59:45 (UTC+8)-
dc.identifier (Other Identifiers) G1003560321en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/61185-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理研究所zh_TW
dc.description (描述) 100356032zh_TW
dc.description (描述) 101zh_TW
dc.description.abstract (摘要)   近年來,隨著網際網路與相關設備的快速發展與普及,雲端運算逐漸興起,並產生了大量的雲端服務,讓網路上儲存了大量一般使用者的資料與文件。為了分析這些資料,可使用資料探勘中的資料分群來進行初步探討,而kNN正是廣泛被應用於資料分群的一種方式。不過隨著資料量的成長,使用kNN演算法的分群時間也會大幅度地成長。
  為此,本研究試圖以平行化的方式運行kNN演算法,將大量資料分成少量多次處理,並於平行分群後進行群集合併。本研究將Google Spreadsheet作為雲端資料庫,透過PHP語言對資料庫下達指令,並進行多區域的平行kNN資料分群。待各區域完成分群後,再將分群結果傳送至整合端進行群集合併。
  本研究以二種不同方式進行群集合併,並比較二種方式的結果。第一種合併方式為給定合併門檻值,直接將質心近似的群集合併;第二種方式為同時比較二群集之群內相似度與群間相似度之關係,來決定是否進行合併。綜合而言,第二種方式的運算時間較第一種方式為快,且合併結果也較第一種方式為佳。
  本研究進行三種資料分群測試,分別為:一般性資料(隨機分配的固定維度數值化資料)、實際新聞文章的數值化資料、大量模擬實際文章的數值化資料(高維度且有部分集中群集)。根據結果顯示,三種測試皆顯示平行化後的速度有著極大幅度的提升,表示本研究設計的平行化演算法對kNN的加速有相當大程度的作用,分群品質的變化雖可接受、卻較難以掌控。
zh_TW
dc.description.abstract (摘要) With the development and popularization of the Internet, many companies started offering cloud-services, allowed users to save their data on the Internet. To find information in the large amounts of data, kNN is a common method in data mining domain. But with the amount of data increased, the computing time also grew rapidly.
In this study, I tried to parallel the kNN algorithm, expecting to promote the efficiency of kNN algorithm by separating the data into small amount and clustering them. In this study, Google Spreadsheet was used as a database, and accessing by PHP. Data stored in different spreadsheets separately, and clustered by different regional servers. After each regional server completed the data clustering, the regional clusters would be integrated into final result.
I used two ways to merge clusters in the second step of the algorithm. The first way is given threshold; I merge the clusters if these centroids were similar. The second way is to compare the difference between the summary of the inter-cluster similarity of two clusters, and their intra-cluster similarity. Then decided whether to merge or not. The result showed that the second way is not only have better computing speed but also have better quality.
In this study, I tested the parallel algorithm with three kinds of data. The first one is random generated data with fixed dimension. The second data was transformed by reality news. And third data was large amount of data generated by simulating reality data. All of the results showed the paralleled algorithm had a great effect on accelerated KNN method, but also got the unpredictable quality change.
en_US
dc.description.tableofcontents 第一章、 緒論 1
  第一節、 研究背景與動機 1
  第二節、 研究目的 2
第二章、 文獻探討 3
  第一節、 雲端運算及服務 3
    2.1.1 雲端運算 3
    2.1.2 雲端服務模式 4
    2.1.3 雲端部屬模式 5
    2.1.4 Google雲端硬碟(Google Drive) 6
  第二節、 資料分群(Clustering) 8
    2.2.1 k-最鄰近演算法(k-Nearest Neighbor,kNN) 8
    2.2.2 相似度計算 10
  第三節、 平行運算(Parallel Computing) 13
  第四節、 文獻探討總結 14
第三章、 研究架構與設計 15
  第一節、 研究架構 15
  第二節、 研究設計 17
    3.2.1 資料處理 17
    3.2.2 平行化之kNN演算法 20
    3.2.3 分區群集合併 20
    3.2.4 整合衡量指標 22
    3.2.5 時間複雜度 24
    3.2.6 資料結構與平行演算法 25
第四章、 研究成果 33
  第一節、 一般性資料分群成果 33
  第二節、 新聞文章分群結果(給定門檻值) 35
  第三節、 新聞文章分群結果(比較相似度) 37
  第四節、 模擬文章資料分群結果(給定門檻值) 39
  第五節、 模擬文章資料分群結果(比較相似度) 42
第五章、 結論 44
  第一節、 結論與建議 44
  第二節、 未來研究方向 46
參考文獻 48
zh_TW
dc.format.extent 796953 bytes-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G1003560321en_US
dc.subject (關鍵詞) k-最鄰近演算法zh_TW
dc.subject (關鍵詞) 平行運算zh_TW
dc.subject (關鍵詞) 雲端運算zh_TW
dc.subject (關鍵詞) 資料探勘zh_TW
dc.subject (關鍵詞) kNNen_US
dc.subject (關鍵詞) parallel computingen_US
dc.subject (關鍵詞) cloud computingen_US
dc.subject (關鍵詞) data miningen_US
dc.title (題名) 適用於雲端分散儲存架構下的kNN平行演算法之研究zh_TW
dc.title (題名) The study on paralleling the kNN algorithm suitable to cloud-based distributed architectureen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 英文文獻
1. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
2. Davis, N., Demetriou, G., Gaizauskas, R., Guo, Y., & Roberts, L. (2006). Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. International Journal of Web Service Research 3(4): 95–112.
3. Dikaiakos, M.D., Katsaros, D., Mehra, P., Pallis, G., Vakali, A. (2009). Cloud Computing: Distributed Internet Computing for IT and Scientific Research. IEEE Internet Computing, 13(5):10–13.
4. Han, J., & Kamber, M. (2006). Data mining: Concepts and Techniques (2nd Edition), Morgan Kaufmann.
5. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.
6. Lai, C. H., & Liu, D. R. (2009). Integrating knowledge flow mining and collaborative filtering to support document recommendation. Journal of Systems and Software, 82(12), 2023–2037.
7. Laudon, J. P., & Laudon, K. C. (2011). Management Information Systems (12th Edition), Pearson.
8. National Institute of Standards and Technology. (2011). The NIST Definition of Cloud Computing. NIST Special Publication 800-145.
9. Ovum. (2011). 2011 Trends to Watch: Cloud Computing Technology.
10. Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1–47.
12. Song, Y. Huang, J. Zhou, D. Zha, H. Giles, C. L. (2007). IKNN: Informative K-Nearest Neighbor Classification. 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 248–264, Warsaw, Poland.
13. Wu, X, Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowl Inf Syst, 14:1–37.
14. Weinberger, K. Q., Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10:207–244.


中文文獻
15. 吳季桓(2010)。自動分類的實作:KNN與SVM。國立中正大學資訊工程研究所碩士論文,未出版,嘉義縣。
16. 周宣光(譯)(2010)。管理資訊系統(原作者Laudon, J. P., & Laudon, K. C. (2009))。台北:東華書局。
17. 施雅月、賴錦慧(譯)(2008)。資料探勘(原作者Tan, P. N., Steinbach, M.,
Kumar, V. (2006))。台北:歐亞書局。
18. 連偉志(2013)。雲端環境下應用Google電子試算表於分散式儲存架構平台之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。
19. 陳仕斌(2012)。雲端運算之編譯排程系統設計與實作。國立成功大學電機工程學系碩士論文,未出版,台南市。
20. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究。國立政治大學資訊管理學系碩士論文,未出版,台北市。
21. 黃冠中(2007)。應用kNN演算法之文件分類平台實作。第六屆離島資訊技術與應用研討會論文,雲林:國立虎尾科技大學資工系主辦。
22. 曾國傑(2012)。運用KNN文字探勘分析智慧型終端App群集之研究. 國立政治大學資訊管理學系碩士論文,未出版,台北市。
23. 薛弘業(2013)。應用文字探勘與文件分類分群技術於股價走勢預測之研究─以台灣股票市場為例。國立政治大學資訊管理學系碩士論文,未出版,台北市。

網路資料
24. Boss, G., Malladi, P., Quan, D., Legregni, L., & Hall, H. (2007). Cloud Computing. Retrieved April, 2013 from IBM Corporation Web site:
http://download.boulder.ibm.com/ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf
25. Google Drive. (2006). Overview of Google Sheets. Retrieved February, 2013 from
https://support.google.com/drive/bin/answer.py?hl=en&answer=140784&topic=20322&ctx=topic
26. Google Drive. (2013). Start Google Drive. Retrieved February, 2013 from
https://www.google.com/intl/en/drive/start/index.html
27. Law of cosines - Wikipedia, the free encyclopedia Retrieved August, 2013 from
http://en.wikipedia.org/wiki/Law_of_cosines
28. Parallel Programming for Multicore. Berkeley lecture. Retrieved March, 2013 from
http://www.cs.berkeley.edu/~yelick/cs194f07/
zh_TW