應用文字探勘技術於英文文章難易度分類 | 學術產出

學術產出-學位論文

文章檢視/開啟

pdf(494)

書目匯出

Google Scholar^TM

題名	應用文字探勘技術於英文文章難易度分類 The Classification of the Difficulty of English Articles with Text Mining
作者	許珀豪 Hsu, Po Hao
貢獻者	楊建民許珀豪 Hsu, Po Hao
關鍵詞	文字探勘 kNN 英文文章適讀性英文語文難易度特徵文字特徵 text mining kNN the difficulty of English articles the characteristics of the linguistic difficulty of English articles the characteristics of the text
日期	2012
上傳時間	2-九月-2013 16:01:55 (UTC+8)
摘要	英語學習者如何能在普及的網路環境中，挑選難易度符合自身英文閱讀能力的文章，便是一個值得探討的議題。為了提升文章難易度分類的準確度，近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵，各自歸類和綜合歸類後與原先官方文章類別比較，檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果，來提高準度。本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分：語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度，並算出各語文特徵值後，再使用kNN將文章歸類成初級、中級或中高級，並做為比較準確度的依據；再以GEPT文章斷詞，並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類；最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47，最後一個、也是表現最好的結果是以兩者結合後歸類，F-measure有0.68。如何從大量的英文文章當中找到適合自己程度循序漸進的學習，是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類，並可以從中分類出不同類別且不同程度的英文文章，讓使用者自行選擇並閱讀，使學習成效進而提升。 It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results. The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68. The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning.
參考文獻	英文 [1]. Berry, M. J., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. [2]. Berson, A., Smith, S., & Thearling, K. (1999). Building Data Mining Applications for CRM. [3]. Chiang, H. K., and Kuo, F. L. (2005). “Promoting Active Learning: Finding Right Articles for Right Learners,” Paper presented at the Fifth International Conference on AsiaCALL, Korea. [4]. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34. [5]. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57. [6]. Grupe, F. H., & Owrang, M. M. (1995). DATA BASE MINING discovering new knowledge and competitive advantage. Information System Management, 12(4), 26-31. [7]. Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Morgan Kaufmann. [8]. Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. [9]. Ionin, T., Zubizarreta, M. L., & Maldonado, S. B. (2008). Sources of linguistic knowledge in the second language acquisition of English articles. Lingua, 118(4), 554-576. [10]. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323. [11]. Jeng, C. C. (2001). Chinese readability analysis using artificial neural networks. Northern Illinois University. [12]. Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999). Performance measures for information extraction. In Proceedings of DARPA Broadcast News Workshop (pp. 249-252). [13]. McLaughlin, G. H. (1968). Proposals for British readability measures. Paper presented at the The Third International Reading Symposium, London. [14]. McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639-646. [15]. Nagy, W. E. Herman. PA (1987). Breadth and depth of vocabulary knowledge: Implications for acquisition and instruction. the nature of vocabulary acquisition, 19-35. [16]. Nie, J. Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 225-233). ACM. [17]. Painter, Mark P. (2004). The Legal Writer #24, It`s Not Only Lawyers and Judges. Ohio Lawyers Weekly, 6-14-2004 [18]. Reeve, L., & Han, H. (2005, March). Survey of semantic annotation platforms. In Symposium on Applied Computing: Proceedings of the 2005 ACM symposium on Applied computing (Vol. 13, No. 17, pp. 1634-1638). [19]. Rogerson-Revell, P. (2007). Using English for international business: A European case study. English for specific purposes, 26(1), 103-120. [20]. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523. [21]. Salton, G., McGill, M. (1983). Introduction to Modern Information Retrieval, New York: McGraw-Hill. [22]. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. [23]. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47. [24]. Simoudis, E. (1996). Reality check for data mining. IEEE Expert: Intelligent systems and their applications, 11(5), 26-33. [25]. Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworths. [26]. Witten, I. H., & Frank, E. (2000). Data mining: practical machine learning tools and techniques with Java implementations. CHEN, Z. [27]. Zakaluk, B. L., & Samuels, S. J. (Eds.). (1988). Readability: Its Past, Present, and Future. Newark, DE: International Reading Association. 中文 [1]. 宋佩貞(2009)。台灣審定版國小英語教科書適讀性公式建置與評估(碩士論文)。國立台東大學。台東縣 [2]. 張瓊霙。英語廣泛閱讀。南投縣國教輔導團英語領域定期會議。 [3]. 陳柏均(2011)。文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究(碩士論文)。國立政治大學。台北市 [4]. 喻欣凱(2008)。運用支援向量機與文字探勘於股價漲跌趨勢之預測(碩士論文)。輔仁大學。台北市 [5]. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究(碩士論文)。國立政治大學。台北市 [6]. 黃宣範(1993)。語言、社會與族群意識—台灣語言社會學的研究。台北：文鶴。 [7]. 黃昭憲(2010)。以語文特徵為基之中學閱讀測驗短文分級。第廿二屆自然語言與語音處理研討會論文集(頁98‒112)。臺灣，南投 [8]. 廖柏森(2004)。英語全球化脈絡裡的台灣英語教育。英語教學，29(1)，107-121。 [9]. 賴伯勇(2005)。論英文教材適讀性之研究與應用。人文及社會學科教學通訊，16(4)，97-120。網路 [1]. “100年「全民英檢」考生人數成長，101年將新增服務.” 網站來源: http://www.lttc.ntu.edu.tw/gept1/101GEPTnews.htm [2]. Jesse Dawson.“How To Choose The Best Readability Formula For Your Document.” 網站來源: http://www.streetdirectory.com/travel_guide/15675/writing/how_to_choose_the_best_readability_formula_for_your_document.html [3]. Timothy Bell(1998) .“Extensive Reading: Why? and How?” 網站來源：http://iteslj.org/Articles/Bell-Reading.html [4]. 李振清(2009). “閱讀是提升高中生英文能力的致勝關鍵.” 網站來源: http://cc.shu.edu.tw/~cte/gallery/ccli/abc/abc_127_20090204.htm
描述	碩士國立政治大學資訊管理研究所 100356036 101
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0100356036
資料類型	thesis

dc.contributor.advisor	楊建民	zh_TW
dc.contributor.author (作者)	許珀豪	zh_TW
dc.contributor.author (作者)	Hsu, Po Hao	en_US
dc.creator (作者)	許珀豪	zh_TW
dc.creator (作者)	Hsu, Po Hao	en_US
dc.date (日期)	2012	en_US
dc.date.accessioned	2-九月-2013 16:01:55 (UTC+8)	-
dc.date.available	2-九月-2013 16:01:55 (UTC+8)	-
dc.date.issued (上傳時間)	2-九月-2013 16:01:55 (UTC+8)	-
dc.identifier (其他識別碼)	G0100356036	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/59300	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理研究所	zh_TW
dc.description (描述)	100356036	zh_TW
dc.description (描述)	101	zh_TW
dc.description.abstract (摘要)	英語學習者如何能在普及的網路環境中，挑選難易度符合自身英文閱讀能力的文章，便是一個值得探討的議題。為了提升文章難易度分類的準確度，近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵，各自歸類和綜合歸類後與原先官方文章類別比較，檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果，來提高準度。本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分：語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度，並算出各語文特徵值後，再使用kNN將文章歸類成初級、中級或中高級，並做為比較準確度的依據；再以GEPT文章斷詞，並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類；最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47，最後一個、也是表現最好的結果是以兩者結合後歸類，F-measure有0.68。如何從大量的英文文章當中找到適合自己程度循序漸進的學習，是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類，並可以從中分類出不同類別且不同程度的英文文章，讓使用者自行選擇並閱讀，使學習成效進而提升。	zh_TW
dc.description.abstract (摘要)	It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results. The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68. The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning.	en_US
dc.description.tableofcontents	第一章緒論 1 第一節研究背景與動機 1 第二節研究目的 2 第三節研究架構 3 第二章文獻探討 4 第一節台灣學習英語狀況與英語學習的方式 4 2.1.1 台灣學習英語狀況 4 2.1.2 英語學習方式：廣泛閱讀 4 2.1.3 小結 5 第二節英文文章適讀性分析 6 2.2.1 何謂英文適讀性 6 2.2.2 適讀性公式 7 2.2.3 適讀性公式的價值 11 2.2.4 其他英文適讀性因素以及研究 11 2.2.5 小結 13 第三節資料探勘與文字探勘 13 2.3.1 資料探勘 14 2.3.2 文字探勘 15 2.3.3 資料探勘與文字探勘比較 15 2.3.4 文字探勘過程 16 2.3.5 分類績效評估 19 第四節文獻探討總結 21 第三章研究方法與設計 22 第一節研究架構與範圍 22 3.1.1 研究範圍 24 第二節資料來源及特徵向量 24 3.2.1 資料來源 24 3.2.2 特徵向量 25 第三節 kNN分類 31 3.3.1 相似度計算 31 3.3.2 kNN歸類方法 31 第四節評估相似度方法 32 第四章研究結果 33 第一節依語文難易度特徵測試結果 33 第二節依文字特徵測試結果 35 第三節以文字特徵測試語文難易度特徵之結果 37 第四節以語文特徵與文字特徵結合歸類之結果 40 第五節比較與分析四個結果 43 第六節 k值之分佈與篩選 45 第五章結論與未來研究方向 48 第一節結論與建議 48 第二節未來研究方向 49 參考文獻 51 附錄一：依語文難易度特徵測試結果(按F-measure排列) 55 附錄二：依文字特徵測試結果(按F-measure排列) 56 附錄三：以文字特徵測試語文難易度特徵之結果(按F-measure排列後最前面37筆資料) 57 附錄四：以語文特徵與文字特徵結合歸類之結果(按F-measure排列) 58	zh_TW
dc.format.extent	1421190 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	en_US	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0100356036	en_US
dc.subject (關鍵詞)	文字探勘	zh_TW
dc.subject (關鍵詞)	kNN	zh_TW
dc.subject (關鍵詞)	英文文章適讀性	zh_TW
dc.subject (關鍵詞)	英文語文難易度特徵	zh_TW
dc.subject (關鍵詞)	文字特徵	zh_TW
dc.subject (關鍵詞)	text mining	en_US
dc.subject (關鍵詞)	kNN	en_US
dc.subject (關鍵詞)	the difficulty of English articles	en_US
dc.subject (關鍵詞)	the characteristics of the linguistic difficulty of English articles	en_US
dc.subject (關鍵詞)	the characteristics of the text	en_US
dc.title (題名)	應用文字探勘技術於英文文章難易度分類	zh_TW
dc.title (題名)	The Classification of the Difficulty of English Articles with Text Mining	en_US
dc.type (資料類型)	thesis	en
dc.relation.reference (參考文獻)	英文 [1]. Berry, M. J., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. [2]. Berson, A., Smith, S., & Thearling, K. (1999). Building Data Mining Applications for CRM. [3]. Chiang, H. K., and Kuo, F. L. (2005). “Promoting Active Learning: Finding Right Articles for Right Learners,” Paper presented at the Fifth International Conference on AsiaCALL, Korea. [4]. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34. [5]. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57. [6]. Grupe, F. H., & Owrang, M. M. (1995). DATA BASE MINING discovering new knowledge and competitive advantage. Information System Management, 12(4), 26-31. [7]. Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Morgan Kaufmann. [8]. Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. [9]. Ionin, T., Zubizarreta, M. L., & Maldonado, S. B. (2008). Sources of linguistic knowledge in the second language acquisition of English articles. Lingua, 118(4), 554-576. [10]. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323. [11]. Jeng, C. C. (2001). Chinese readability analysis using artificial neural networks. Northern Illinois University. [12]. Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999). Performance measures for information extraction. In Proceedings of DARPA Broadcast News Workshop (pp. 249-252). [13]. McLaughlin, G. H. (1968). Proposals for British readability measures. Paper presented at the The Third International Reading Symposium, London. [14]. McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639-646. [15]. Nagy, W. E. Herman. PA (1987). Breadth and depth of vocabulary knowledge: Implications for acquisition and instruction. the nature of vocabulary acquisition, 19-35. [16]. Nie, J. Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 225-233). ACM. [17]. Painter, Mark P. (2004). The Legal Writer #24, It`s Not Only Lawyers and Judges. Ohio Lawyers Weekly, 6-14-2004 [18]. Reeve, L., & Han, H. (2005, March). Survey of semantic annotation platforms. In Symposium on Applied Computing: Proceedings of the 2005 ACM symposium on Applied computing (Vol. 13, No. 17, pp. 1634-1638). [19]. Rogerson-Revell, P. (2007). Using English for international business: A European case study. English for specific purposes, 26(1), 103-120. [20]. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523. [21]. Salton, G., McGill, M. (1983). Introduction to Modern Information Retrieval, New York: McGraw-Hill. [22]. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. [23]. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47. [24]. Simoudis, E. (1996). Reality check for data mining. IEEE Expert: Intelligent systems and their applications, 11(5), 26-33. [25]. Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworths. [26]. Witten, I. H., & Frank, E. (2000). Data mining: practical machine learning tools and techniques with Java implementations. CHEN, Z. [27]. Zakaluk, B. L., & Samuels, S. J. (Eds.). (1988). Readability: Its Past, Present, and Future. Newark, DE: International Reading Association. 中文 [1]. 宋佩貞(2009)。台灣審定版國小英語教科書適讀性公式建置與評估(碩士論文)。國立台東大學。台東縣 [2]. 張瓊霙。英語廣泛閱讀。南投縣國教輔導團英語領域定期會議。 [3]. 陳柏均(2011)。文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究(碩士論文)。國立政治大學。台北市 [4]. 喻欣凱(2008)。運用支援向量機與文字探勘於股價漲跌趨勢之預測(碩士論文)。輔仁大學。台北市 [5]. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究(碩士論文)。國立政治大學。台北市 [6]. 黃宣範(1993)。語言、社會與族群意識—台灣語言社會學的研究。台北：文鶴。 [7]. 黃昭憲(2010)。以語文特徵為基之中學閱讀測驗短文分級。第廿二屆自然語言與語音處理研討會論文集(頁98‒112)。臺灣，南投 [8]. 廖柏森(2004)。英語全球化脈絡裡的台灣英語教育。英語教學，29(1)，107-121。 [9]. 賴伯勇(2005)。論英文教材適讀性之研究與應用。人文及社會學科教學通訊，16(4)，97-120。網路 [1]. “100年「全民英檢」考生人數成長，101年將新增服務.” 網站來源: http://www.lttc.ntu.edu.tw/gept1/101GEPTnews.htm [2]. Jesse Dawson.“How To Choose The Best Readability Formula For Your Document.” 網站來源: http://www.streetdirectory.com/travel_guide/15675/writing/how_to_choose_the_best_readability_formula_for_your_document.html [3]. Timothy Bell(1998) .“Extensive Reading: Why? and How?” 網站來源：http://iteslj.org/Articles/Bell-Reading.html [4]. 李振清(2009). “閱讀是提升高中生英文能力的致勝關鍵.” 網站來源: http://cc.shu.edu.tw/~cte/gallery/ccli/abc/abc_127_20090204.htm	zh_TW

學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

Google Scholar^TM