Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 應用文字探勘技術於英文文章難易度分類
The Classification of the Difficulty of English Articles with Text Mining
作者 許珀豪
Hsu, Po Hao
貢獻者 楊建民
許珀豪
Hsu, Po Hao
關鍵詞 文字探勘
kNN
英文文章適讀性
英文語文難易度特徵
文字特徵
text mining
kNN
the difficulty of English articles
the characteristics of the linguistic difficulty of English articles
the characteristics of the text
日期 2012
上傳時間 2-Sep-2013 16:01:55 (UTC+8)
摘要 英語學習者如何能在普及的網路環境中,挑選難易度符合自身英文閱讀能力的文章,便是一個值得探討的議題。為了提升文章難易度分類的準確度,近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵,各自歸類和綜合歸類後與原先官方文章類別比較,檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果,來提高準度。
本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分:語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度,並算出各語文特徵值後,再使用kNN將文章歸類成初級、中級或中高級,並做為比較準確度的依據;再以GEPT文章斷詞,並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類;最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47,最後一個、也是表現最好的結果是以兩者結合後歸類,F-measure有0.68。
如何從大量的英文文章當中找到適合自己程度循序漸進的學習,是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類,並可以從中分類出不同類別且不同程度的英文文章,讓使用者自行選擇並閱讀,使學習成效進而提升。
It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results.
The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68.
The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning.
參考文獻 英文
[1]. Berry, M. J., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support.
[2]. Berson, A., Smith, S., & Thearling, K. (1999). Building Data Mining Applications for CRM.
[3]. Chiang, H. K., and Kuo, F. L. (2005). “Promoting Active Learning: Finding Right Articles for Right Learners,” Paper presented at the Fifth International Conference on AsiaCALL, Korea.
[4]. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34.
[5]. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57.
[6]. Grupe, F. H., & Owrang, M. M. (1995). DATA BASE MINING discovering new knowledge and competitive advantage. Information System Management, 12(4), 26-31.
[7]. Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Morgan Kaufmann.
[8]. Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27.
[9]. Ionin, T., Zubizarreta, M. L., & Maldonado, S. B. (2008). Sources of linguistic knowledge in the second language acquisition of English articles. Lingua, 118(4), 554-576.
[10]. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323.
[11]. Jeng, C. C. (2001). Chinese readability analysis using artificial neural networks. Northern Illinois University.
[12]. Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999). Performance measures for information extraction. In Proceedings of DARPA Broadcast News Workshop (pp. 249-252).
[13]. McLaughlin, G. H. (1968). Proposals for British readability measures. Paper presented at the The Third International Reading Symposium, London.
[14]. McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639-646.
[15]. Nagy, W. E. Herman. PA (1987). Breadth and depth of vocabulary knowledge: Implications for acquisition and instruction. the nature of vocabulary acquisition, 19-35.
[16]. Nie, J. Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 225-233). ACM.
[17]. Painter, Mark P. (2004). The Legal Writer #24, It`s Not Only Lawyers and Judges. Ohio Lawyers Weekly, 6-14-2004
[18]. Reeve, L., & Han, H. (2005, March). Survey of semantic annotation platforms. In Symposium on Applied Computing: Proceedings of the 2005 ACM symposium on Applied computing (Vol. 13, No. 17, pp. 1634-1638).
[19]. Rogerson-Revell, P. (2007). Using English for international business: A European case study. English for specific purposes, 26(1), 103-120.
[20]. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.
[21]. Salton, G., McGill, M. (1983). Introduction to Modern Information Retrieval, New York: McGraw-Hill.
[22]. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
[23]. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
[24]. Simoudis, E. (1996). Reality check for data mining. IEEE Expert: Intelligent systems and their applications, 11(5), 26-33.
[25]. Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworths.
[26]. Witten, I. H., & Frank, E. (2000). Data mining: practical machine learning tools and techniques with Java implementations. CHEN, Z.
[27]. Zakaluk, B. L., & Samuels, S. J. (Eds.). (1988). Readability: Its Past, Present, and Future. Newark, DE: International Reading Association.

中文
[1]. 宋佩貞(2009)。台灣審定版國小英語教科書適讀性公式建置與評估(碩士論文)。國立台東大學。台東縣
[2]. 張瓊霙。英語廣泛閱讀。南投縣國教輔導團英語領域定期會議。
[3]. 陳柏均(2011)。文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究(碩士論文)。國立政治大學。台北市
[4]. 喻欣凱(2008)。運用支援向量機與文字探勘於股價漲跌趨勢之預測(碩士論文)。輔仁大學。台北市
[5]. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究(碩士論文)。國立政治大學。台北市
[6]. 黃宣範(1993)。語言、社會與族群意識—台灣語言社會學的研究。台北:文鶴。
[7]. 黃昭憲(2010)。以語文特徵為基之中學閱讀測驗短文分級。第廿二屆自然語言與語音處理研討會論文集(頁98‒112)。 臺灣,南投
[8]. 廖柏森(2004)。英語全球化脈絡裡的台灣英語教育。英語教學,29(1),107-121。
[9]. 賴伯勇(2005)。論英文教材適讀性之研究與應用。人文及社會學科教學通訊,16(4),97-120。

網路
[1]. “100年「全民英檢」考生人數成長,101年將新增服務.” 網站來源: http://www.lttc.ntu.edu.tw/gept1/101GEPTnews.htm
[2]. Jesse Dawson.“How To Choose The Best Readability Formula For Your Document.” 網站來源: http://www.streetdirectory.com/travel_guide/15675/writing/how_to_choose_the_best_readability_formula_for_your_document.html
[3]. Timothy Bell(1998) .“Extensive Reading: Why? and How?” 網站來源:http://iteslj.org/Articles/Bell-Reading.html
[4]. 李振清(2009). “閱讀是提升高中生英文能力的致勝關鍵.” 網站來源: http://cc.shu.edu.tw/~cte/gallery/ccli/abc/abc_127_20090204.htm
描述 碩士
國立政治大學
資訊管理研究所
100356036
101
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0100356036
資料類型 thesis
dc.contributor.advisor 楊建民zh_TW
dc.contributor.author (Authors) 許珀豪zh_TW
dc.contributor.author (Authors) Hsu, Po Haoen_US
dc.creator (作者) 許珀豪zh_TW
dc.creator (作者) Hsu, Po Haoen_US
dc.date (日期) 2012en_US
dc.date.accessioned 2-Sep-2013 16:01:55 (UTC+8)-
dc.date.available 2-Sep-2013 16:01:55 (UTC+8)-
dc.date.issued (上傳時間) 2-Sep-2013 16:01:55 (UTC+8)-
dc.identifier (Other Identifiers) G0100356036en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/59300-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理研究所zh_TW
dc.description (描述) 100356036zh_TW
dc.description (描述) 101zh_TW
dc.description.abstract (摘要) 英語學習者如何能在普及的網路環境中,挑選難易度符合自身英文閱讀能力的文章,便是一個值得探討的議題。為了提升文章難易度分類的準確度,近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵,各自歸類和綜合歸類後與原先官方文章類別比較,檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果,來提高準度。
本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分:語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度,並算出各語文特徵值後,再使用kNN將文章歸類成初級、中級或中高級,並做為比較準確度的依據;再以GEPT文章斷詞,並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類;最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47,最後一個、也是表現最好的結果是以兩者結合後歸類,F-measure有0.68。
如何從大量的英文文章當中找到適合自己程度循序漸進的學習,是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類,並可以從中分類出不同類別且不同程度的英文文章,讓使用者自行選擇並閱讀,使學習成效進而提升。
zh_TW
dc.description.abstract (摘要) It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results.
The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68.
The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning.
en_US
dc.description.tableofcontents 第一章 緒論 1
第一節 研究背景與動機 1
第二節 研究目的 2
第三節 研究架構 3
第二章 文獻探討 4
第一節 台灣學習英語狀況與英語學習的方式 4
2.1.1 台灣學習英語狀況 4
2.1.2 英語學習方式:廣泛閱讀 4
2.1.3 小結 5
第二節 英文文章適讀性分析 6
2.2.1 何謂英文適讀性 6
2.2.2 適讀性公式 7
2.2.3 適讀性公式的價值 11
2.2.4 其他英文適讀性因素以及研究 11
2.2.5 小結 13
第三節 資料探勘與文字探勘 13
2.3.1 資料探勘 14
2.3.2 文字探勘 15
2.3.3 資料探勘與文字探勘比較 15
2.3.4 文字探勘過程 16
2.3.5 分類績效評估 19
第四節 文獻探討總結 21
第三章 研究方法與設計 22
第一節 研究架構與範圍 22
3.1.1 研究範圍 24
第二節 資料來源及特徵向量 24
3.2.1 資料來源 24
3.2.2 特徵向量 25
第三節 kNN分類 31
3.3.1 相似度計算 31
3.3.2 kNN歸類方法 31
第四節 評估相似度方法 32
第四章 研究結果 33
第一節 依語文難易度特徵測試結果 33
第二節 依文字特徵測試結果 35
第三節 以文字特徵測試語文難易度特徵之結果 37
第四節 以語文特徵與文字特徵結合歸類之結果 40
第五節 比較與分析四個結果 43
第六節 k值之分佈與篩選 45
第五章 結論與未來研究方向 48
第一節 結論與建議 48
第二節 未來研究方向 49
參考文獻 51
附錄一:依語文難易度特徵測試結果(按F-measure排列) 55
附錄二:依文字特徵測試結果(按F-measure排列) 56
附錄三:以文字特徵測試語文難易度特徵之結果(按F-measure排列後最前面37筆資料) 57
附錄四:以語文特徵與文字特徵結合歸類之結果(按F-measure排列) 58
zh_TW
dc.format.extent 1421190 bytes-
dc.format.mimetype application/pdf-
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0100356036en_US
dc.subject (關鍵詞) 文字探勘zh_TW
dc.subject (關鍵詞) kNNzh_TW
dc.subject (關鍵詞) 英文文章適讀性zh_TW
dc.subject (關鍵詞) 英文語文難易度特徵zh_TW
dc.subject (關鍵詞) 文字特徵zh_TW
dc.subject (關鍵詞) text miningen_US
dc.subject (關鍵詞) kNNen_US
dc.subject (關鍵詞) the difficulty of English articlesen_US
dc.subject (關鍵詞) the characteristics of the linguistic difficulty of English articlesen_US
dc.subject (關鍵詞) the characteristics of the texten_US
dc.title (題名) 應用文字探勘技術於英文文章難易度分類zh_TW
dc.title (題名) The Classification of the Difficulty of English Articles with Text Miningen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) 英文
[1]. Berry, M. J., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support.
[2]. Berson, A., Smith, S., & Thearling, K. (1999). Building Data Mining Applications for CRM.
[3]. Chiang, H. K., and Kuo, F. L. (2005). “Promoting Active Learning: Finding Right Articles for Right Learners,” Paper presented at the Fifth International Conference on AsiaCALL, Korea.
[4]. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34.
[5]. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57.
[6]. Grupe, F. H., & Owrang, M. M. (1995). DATA BASE MINING discovering new knowledge and competitive advantage. Information System Management, 12(4), 26-31.
[7]. Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Morgan Kaufmann.
[8]. Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27.
[9]. Ionin, T., Zubizarreta, M. L., & Maldonado, S. B. (2008). Sources of linguistic knowledge in the second language acquisition of English articles. Lingua, 118(4), 554-576.
[10]. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323.
[11]. Jeng, C. C. (2001). Chinese readability analysis using artificial neural networks. Northern Illinois University.
[12]. Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999). Performance measures for information extraction. In Proceedings of DARPA Broadcast News Workshop (pp. 249-252).
[13]. McLaughlin, G. H. (1968). Proposals for British readability measures. Paper presented at the The Third International Reading Symposium, London.
[14]. McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639-646.
[15]. Nagy, W. E. Herman. PA (1987). Breadth and depth of vocabulary knowledge: Implications for acquisition and instruction. the nature of vocabulary acquisition, 19-35.
[16]. Nie, J. Y., Brisebois, M., & Ren, X. (1996). On Chinese text retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 225-233). ACM.
[17]. Painter, Mark P. (2004). The Legal Writer #24, It`s Not Only Lawyers and Judges. Ohio Lawyers Weekly, 6-14-2004
[18]. Reeve, L., & Han, H. (2005, March). Survey of semantic annotation platforms. In Symposium on Applied Computing: Proceedings of the 2005 ACM symposium on Applied computing (Vol. 13, No. 17, pp. 1634-1638).
[19]. Rogerson-Revell, P. (2007). Using English for international business: A European case study. English for specific purposes, 26(1), 103-120.
[20]. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.
[21]. Salton, G., McGill, M. (1983). Introduction to Modern Information Retrieval, New York: McGraw-Hill.
[22]. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
[23]. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
[24]. Simoudis, E. (1996). Reality check for data mining. IEEE Expert: Intelligent systems and their applications, 11(5), 26-33.
[25]. Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworths.
[26]. Witten, I. H., & Frank, E. (2000). Data mining: practical machine learning tools and techniques with Java implementations. CHEN, Z.
[27]. Zakaluk, B. L., & Samuels, S. J. (Eds.). (1988). Readability: Its Past, Present, and Future. Newark, DE: International Reading Association.

中文
[1]. 宋佩貞(2009)。台灣審定版國小英語教科書適讀性公式建置與評估(碩士論文)。國立台東大學。台東縣
[2]. 張瓊霙。英語廣泛閱讀。南投縣國教輔導團英語領域定期會議。
[3]. 陳柏均(2011)。文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究(碩士論文)。國立政治大學。台北市
[4]. 喻欣凱(2008)。運用支援向量機與文字探勘於股價漲跌趨勢之預測(碩士論文)。輔仁大學。台北市
[5]. 黃孝文(2010)。雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究(碩士論文)。國立政治大學。台北市
[6]. 黃宣範(1993)。語言、社會與族群意識—台灣語言社會學的研究。台北:文鶴。
[7]. 黃昭憲(2010)。以語文特徵為基之中學閱讀測驗短文分級。第廿二屆自然語言與語音處理研討會論文集(頁98‒112)。 臺灣,南投
[8]. 廖柏森(2004)。英語全球化脈絡裡的台灣英語教育。英語教學,29(1),107-121。
[9]. 賴伯勇(2005)。論英文教材適讀性之研究與應用。人文及社會學科教學通訊,16(4),97-120。

網路
[1]. “100年「全民英檢」考生人數成長,101年將新增服務.” 網站來源: http://www.lttc.ntu.edu.tw/gept1/101GEPTnews.htm
[2]. Jesse Dawson.“How To Choose The Best Readability Formula For Your Document.” 網站來源: http://www.streetdirectory.com/travel_guide/15675/writing/how_to_choose_the_best_readability_formula_for_your_document.html
[3]. Timothy Bell(1998) .“Extensive Reading: Why? and How?” 網站來源:http://iteslj.org/Articles/Bell-Reading.html
[4]. 李振清(2009). “閱讀是提升高中生英文能力的致勝關鍵.” 網站來源: http://cc.shu.edu.tw/~cte/gallery/ccli/abc/abc_127_20090204.htm
zh_TW