學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 以型態組合為主的關鍵詞擷取技術在學術寫作字彙上的研究
A pattern approach to keyword extraction for academic writing vocabulary
作者 邵智捷
Shao, Chih Chieh
貢獻者 劉吉軒
Liu, Jyi Shane
邵智捷
Shao, Chih Chieh
關鍵詞 關鍵字擷取
英語學習
學術字彙
學術字彙列表
詞性標籤型態
keyword extraction
English learning
academic vocabulary
academic word list
AWL
PoS tag patterns
日期 2009
上傳時間 11-Oct-2011 16:57:35 (UTC+8)
摘要 隨著時間的推移演進,人們瞭解到將知識經驗著作成文獻典籍保存下來供後人研究開發的重要性。時至今日,以英語為主的學術寫作論文成為全世界最主要的研究交流媒介。而對於英語為非母語的研究專家而言,在進行英語學術寫作上常常會遇到用了不適當的字彙或搭配詞導致無法確切的傳達自己的研究成果,或是在表達上過於貧乏的問題,因此英語學術寫作字彙與搭配詞的學習與使用就顯得相當重要。

在本研究中,我們藉由收集大量不同國家以及不同研究領域的學術論文為基礎,建構現實中實際使用的語料庫,並且建立數種詞性標籤型態,使用關鍵詞擷取關鍵詞擷取(Keyword Extraction)技術從中擷取出學術著作中常用的學術寫作字彙候選詞,當作是學術常用寫作字彙之初步結果,隨即將候選詞導入關鍵詞分析的指標形態模型,將候選詞依照指標特徵選出具有代表指標意義的進一步候選詞。

在實驗方面,透過對不同範圍的樣本資料進行篩選,並導入統計上的方法對字彙進行不同領域共通性的分析檢證,再加上輔助篩選的機制後,最後求得名詞和動詞分別在學術寫作中常用的字彙,也以此字彙為基礎,發掘出語料庫中常用的搭配詞組合,提出以英語為外國語的研究學者以及學生在學術寫作上的常用字彙與搭配詞組合作為參考,在學術寫作上能夠提供更多樣性且正確的研究論述的協助。
With the evolution over time, people start to know the importance of taking their knowledge and experience into literature texts and preserving them for future research. Until now, academic writing research papers mainly in English become the world’s leading communication media all over the world. For those non-native English researchers, they often encounter with the inappropriate vocabularies or collocations which causes them not to pass on their idea accurately or to express their research poorly. As a result, it’s very important to know how to learn or to use the correct academic writing in English vocabularies and collocations.

In this study, we constructed the real academic thesis corpus which includes different countries and fields of academic research. The keyword extraction technique based on the several Part-of-Speech tag patterns is used for capturing the common academic writing vocabulary candidates in the academic works to be the initial result of the common vocabulary of academic writing. The candidate words would be introduced to the index analysis model of keyword and be picked out to the further meaningful candidate words according to the index characteristics.

For the experiments, the sample data with different fields would be filtered and the vocabularies on different fields of commonality would be analyzed and verified through statistical methods. Moreover, the auxiliary filter mechanism would also be applied to get the common vocabularies in academic writing with nouns and verbs. Based on these vocabularies, we could discover the common combination with the words in the academic thesis corpus and provide them to the non-native English researchers and students as a reference with the common vocabularies and collocations in academic writing. Hopefully the study could help them to write more rich and correct research papers in the future.
參考文獻 [1] 郭志華. 學術寫作字彙特色分析. URL: http://ir.lib.nctu.edu.tw/handle/987654321/19252
[2] Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman.
[3] Chen, C. Y. & Tang, Y. T. (2004). Collocation errors of Taiwanese college students: Oral or written production. In The proceedings of the Eighth International Symposium on English Teaching(pp. 483- 494). Taipei, Taiwan: The Crane Publishing Co.
[4] McEnery T., & Wilson, A. (Eds.). (2001). Corpus linguistics. Edinburgh: Edinburgh University Press.
[5] Mudraya, O. (2006). Engineering English: A lexical frequency instructional model. English for Specific Purposes, Vol. 25, 235-256.
[6] Biber, D. (1998). Variation across speech and writing. Cambridge: Cambridge University Press.
[7] Conrad, C. M. (1996). Investigating Academic Text With Corpus-Based Techniques: An Example From Biology. Linguistics and Education 8, pp. 299-326.
[8] Thompson, P., & Tribble, C. (2001). Looking at Citations: Using Corpora in English for Academic Purposes. Language Learning & Technology, Vol.5, Num. 3 pp. 91-105.
[9] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.
[10] Ercan, G., & Cicekli, I. (2007). Using Lexical Chains for Keyword Extraction. Information Processing & Management, Vol.43, Issue 6, pp. 1705-1714.
[11] Matsuo, Y., Ishizuka, M. (2003). Keyword Exraction from a Single Document using Word Co-occurrence Statistical Information. International Journal on Artificial Intelligence Tools. World Scientific Publishing Company.
[12] Giarlo, M. J. (2005). A Comparative Analysis of Keyword Extraction Techniques. Rutgers, The State University of New Jersey.
[13] 魏智強. (2006). 自動化問答系統之研製. 私立中華大學資訊工程研究所碩士論文.民國九十五年八月.
[14] 王俊弘, 劉昭麟, 高照明. (2003). 電腦輔助英文字彙出題系統之研究. 2003人工智慧,模糊系統及灰色系統聯合研討會論文集.
[15] Hulth, A. (2003). Improved Automatic Keyword Extraction Given More Linguistic Knowledge. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, July, 2003, pp. 216-223.
[16] Turney T. D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303–336.
[17] Frank E., Paynter G. W., Witten I. H. (1999). Domain-specific keyphrase extraction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’99), pages 668–673, Stockholm, Sweden.
[18] Dutta, B., Majumder K. & Sen, B. K. (2009). An analytical model for investigation of some characteristics of the keywords of the subject fermi liquid: a case study. Annals of Library and Information Studies, Vol. 56, December 2009, pp. 273-290
[19] Nation, P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.
[20] Coxhead, A., & Nation, P. (2001). The specialized vocabulary of English for academic purposes. In J. Flowerdew & M. Peacock (Eds.), Research perspectives on English for academic purpose (pp.252-267). Cambridge: Cambridge University Press.
[21] West, M. (1953). A general service list of English words. London: Longmans, Green.
[22] Coxhead, A. (2000). The Academic Word List: A Corpus-based Word List for Academic Purposes. TESOL quarterly, 2000.
[23] 台大教育視聽館 Academic Vocabulary, URL : http://efreeway.avcenter.ntu.edu.tw/freeway /postgraduates/vocab/vocab_index.html
[24] 廖柏森. (2008). 英文研究論文寫作 - 搭配詞指引 : 眾文圖書.
[25] Benson, M., Benson, E., & Ilson, R. (2007). The BBI dictionary of English word combinations. 台北 : 書林.
[26] 黃茹玉. (2007). 探討應用語言學期刊論文中學術字彙之使用. 國立清華大學外國語文學系碩士班外語教學組碩士論文. 民國九十六年六月.
[27] Chuang, T. C., Jian, J. J., Chang, Y. C. & Chang, S. C. (2005). Collocational Translation Memory Extraction Based on Statistical and Linguistic Information. Computational Linguistics and Chinese Language Processing Vol. 10, No. 3, September 2005, pp. 329-346.
[28] Nesselhauf, N (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24, 223- 242.
[29] Bird, S. (2006) .The Natural Language Toolkit, Proceedings of the COLING/ACL on Interactive presentation sessions table of contents 2006. Sydney, Australia. pp.69 - 72
[30] Lucas, N., Cremilleux, B. & Turmel, L. (2003). Signalling well-written academic articles in an English corpus by text mining techniques. Proceedings Corpus Linguistics 2003. pp. 465-474.
[31] Mantel, N. (1963). Chi-square tests with one degree of freedom; extensions of the Mantel-Haenszel procedure. Journal of the American Statistical Association, Vol. 58, No. 303. pp. 690-700
描述 碩士
國立政治大學
資訊科學學系
92753025
98
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0927530254
資料類型 thesis
dc.contributor.advisor 劉吉軒zh_TW
dc.contributor.advisor Liu, Jyi Shaneen_US
dc.contributor.author (Authors) 邵智捷zh_TW
dc.contributor.author (Authors) Shao, Chih Chiehen_US
dc.creator (作者) 邵智捷zh_TW
dc.creator (作者) Shao, Chih Chiehen_US
dc.date (日期) 2009en_US
dc.date.accessioned 11-Oct-2011 16:57:35 (UTC+8)-
dc.date.available 11-Oct-2011 16:57:35 (UTC+8)-
dc.date.issued (上傳時間) 11-Oct-2011 16:57:35 (UTC+8)-
dc.identifier (Other Identifiers) G0927530254en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/51591-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學學系zh_TW
dc.description (描述) 92753025zh_TW
dc.description (描述) 98zh_TW
dc.description.abstract (摘要) 隨著時間的推移演進,人們瞭解到將知識經驗著作成文獻典籍保存下來供後人研究開發的重要性。時至今日,以英語為主的學術寫作論文成為全世界最主要的研究交流媒介。而對於英語為非母語的研究專家而言,在進行英語學術寫作上常常會遇到用了不適當的字彙或搭配詞導致無法確切的傳達自己的研究成果,或是在表達上過於貧乏的問題,因此英語學術寫作字彙與搭配詞的學習與使用就顯得相當重要。

在本研究中,我們藉由收集大量不同國家以及不同研究領域的學術論文為基礎,建構現實中實際使用的語料庫,並且建立數種詞性標籤型態,使用關鍵詞擷取關鍵詞擷取(Keyword Extraction)技術從中擷取出學術著作中常用的學術寫作字彙候選詞,當作是學術常用寫作字彙之初步結果,隨即將候選詞導入關鍵詞分析的指標形態模型,將候選詞依照指標特徵選出具有代表指標意義的進一步候選詞。

在實驗方面,透過對不同範圍的樣本資料進行篩選,並導入統計上的方法對字彙進行不同領域共通性的分析檢證,再加上輔助篩選的機制後,最後求得名詞和動詞分別在學術寫作中常用的字彙,也以此字彙為基礎,發掘出語料庫中常用的搭配詞組合,提出以英語為外國語的研究學者以及學生在學術寫作上的常用字彙與搭配詞組合作為參考,在學術寫作上能夠提供更多樣性且正確的研究論述的協助。
zh_TW
dc.description.abstract (摘要) With the evolution over time, people start to know the importance of taking their knowledge and experience into literature texts and preserving them for future research. Until now, academic writing research papers mainly in English become the world’s leading communication media all over the world. For those non-native English researchers, they often encounter with the inappropriate vocabularies or collocations which causes them not to pass on their idea accurately or to express their research poorly. As a result, it’s very important to know how to learn or to use the correct academic writing in English vocabularies and collocations.

In this study, we constructed the real academic thesis corpus which includes different countries and fields of academic research. The keyword extraction technique based on the several Part-of-Speech tag patterns is used for capturing the common academic writing vocabulary candidates in the academic works to be the initial result of the common vocabulary of academic writing. The candidate words would be introduced to the index analysis model of keyword and be picked out to the further meaningful candidate words according to the index characteristics.

For the experiments, the sample data with different fields would be filtered and the vocabularies on different fields of commonality would be analyzed and verified through statistical methods. Moreover, the auxiliary filter mechanism would also be applied to get the common vocabularies in academic writing with nouns and verbs. Based on these vocabularies, we could discover the common combination with the words in the academic thesis corpus and provide them to the non-native English researchers and students as a reference with the common vocabularies and collocations in academic writing. Hopefully the study could help them to write more rich and correct research papers in the future.
en_US
dc.description.tableofcontents 第一章 簡介................................................1

1.1 背景...................................................1

1.2 研究動機...............................................2

1.3 研究目的與方法.........................................3

1.4 論文架構與貢獻.........................................4


第二章 文獻探討............................................6

2.1 語料庫語言學...........................................6

2.1.1 語料庫以及語料庫語言學的定義與特徵...................6

2.1.2 語料庫文字的預先處理與其後續相關應用.................8

2.2 關鍵詞擷取技術........................................10

2.2.1 關鍵詞在學術著作中的定義與特徵......................10

2.2.2 基於自然語言處理分析為主的關鍵詞擷取技術............10

2.2.3 基於統計分析為主的關鍵詞擷取技術....................12

2.2.4 建立於關鍵詞之上的特徵分析模型......................13

2.3 英語教學相關字彙研究..................................15

2.3.1 英語教學字彙的定義與特徵............................15

2.3.2 字彙與詞性的組合使用 - 搭配詞.......................16

2.4 本章總結..............................................17

第三章 實驗方法...........................................18

3.1 語料庫設計............................................19

3.2 PoS Tag Patterns 關鍵詞擷取演算法.....................20

3.3 應用形態分析模型......................................23

3.4 本章總結..............................................25


第四章 實驗分析討論與結果.................................26

4.1 實驗資料與實作方法....................................26

4.1.1 實驗資料說明........................................26

4.1.2 實驗方法............................................29

4.2 實驗結果之分析討論....................................34

4.2.1 實驗樣本的差異性....................................35

4.2.2 不同實驗樣本之實驗結果..............................36

4.2.3 學術寫作字彙的篩選機制..............................37

4.2.4 基於地域語言特性的學術寫作字彙......................39

4.3 延伸應用 - 學術搭配詞.................................40

4.4 本章總結..............................................42


第五章 結論與未來研究方向.................................43

5.1 結論..................................................43

5.2 未來研究方向..........................................44


參考文獻..................................................46


附錄表一 CS領域動詞候選詞之各指標代表性字彙(前213個)......49

附錄表二 CS領域動詞候選詞於不同頻率下之同質性分佈.........56

附錄表三 各領域動詞依指標交集而得的領域學術字彙列表.......58

附錄表四 最終選出之學術寫作上常用之字彙(綜合領域).........60

附錄表五 最終選出之學術寫作上常用之字彙(綜合語言特性).....62

附錄表六 學術寫作上字彙之常用搭配詞(整體).................64

附錄表七 學術寫作上字彙之常用搭配詞(依語言特性)...........68



圖表目錄



圖 2-1 語料庫與資訊擷取預先處理工作一覽.....................9

圖 3-1 研究方法之流程架構圖................................18

圖 3-2 語料庫結構特性分析..................................20

圖 3-3 Custom PoS Tag Patterns Algorithm...................22

圖 4-1 三領域交集名詞與動詞之卡方值分佈....................32

表 3-1 由CPTP algorithm擷取出之各領域學術寫作字彙候選詞....23

表 4-1 AcademicThesisCorpus語料庫領域別文件詞次數量分佈....27

表 4-2 ATC語料庫字彙頻率分佈...............................27

表 4-3 ATC語料庫領域別動詞名詞數量分佈.....................28

表 4-4 領域別候選詞數量與AWL數量統計.......................29

表 4-5 三領域交集字彙各區間之同質性數值分佈統計............33

表 4-6 各指標所代表趨勢之特徵..............................34

表 4-7 各領域候選詞與非領域共通候選詞數量統計..............35

表 4-8 兩種學術字彙列表數量與所包含AWL數量.................39

表 4-9 兩種學術字彙列表之字彙卡方值分佈....................40
zh_TW
dc.language.iso en_US-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0927530254en_US
dc.subject (關鍵詞) 關鍵字擷取zh_TW
dc.subject (關鍵詞) 英語學習zh_TW
dc.subject (關鍵詞) 學術字彙zh_TW
dc.subject (關鍵詞) 學術字彙列表zh_TW
dc.subject (關鍵詞) 詞性標籤型態zh_TW
dc.subject (關鍵詞) keyword extractionen_US
dc.subject (關鍵詞) English learningen_US
dc.subject (關鍵詞) academic vocabularyen_US
dc.subject (關鍵詞) academic word listen_US
dc.subject (關鍵詞) AWLen_US
dc.subject (關鍵詞) PoS tag patternsen_US
dc.title (題名) 以型態組合為主的關鍵詞擷取技術在學術寫作字彙上的研究zh_TW
dc.title (題名) A pattern approach to keyword extraction for academic writing vocabularyen_US
dc.type (資料類型) thesisen
dc.relation.reference (參考文獻) [1] 郭志華. 學術寫作字彙特色分析. URL: http://ir.lib.nctu.edu.tw/handle/987654321/19252zh_TW
dc.relation.reference (參考文獻) [2] Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman.zh_TW
dc.relation.reference (參考文獻) [3] Chen, C. Y. & Tang, Y. T. (2004). Collocation errors of Taiwanese college students: Oral or written production. In The proceedings of the Eighth International Symposium on English Teaching(pp. 483- 494). Taipei, Taiwan: The Crane Publishing Co.zh_TW
dc.relation.reference (參考文獻) [4] McEnery T., & Wilson, A. (Eds.). (2001). Corpus linguistics. Edinburgh: Edinburgh University Press.zh_TW
dc.relation.reference (參考文獻) [5] Mudraya, O. (2006). Engineering English: A lexical frequency instructional model. English for Specific Purposes, Vol. 25, 235-256.zh_TW
dc.relation.reference (參考文獻) [6] Biber, D. (1998). Variation across speech and writing. Cambridge: Cambridge University Press.zh_TW
dc.relation.reference (參考文獻) [7] Conrad, C. M. (1996). Investigating Academic Text With Corpus-Based Techniques: An Example From Biology. Linguistics and Education 8, pp. 299-326.zh_TW
dc.relation.reference (參考文獻) [8] Thompson, P., & Tribble, C. (2001). Looking at Citations: Using Corpora in English for Academic Purposes. Language Learning & Technology, Vol.5, Num. 3 pp. 91-105.zh_TW
dc.relation.reference (參考文獻) [9] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.zh_TW
dc.relation.reference (參考文獻) [10] Ercan, G., & Cicekli, I. (2007). Using Lexical Chains for Keyword Extraction. Information Processing & Management, Vol.43, Issue 6, pp. 1705-1714.zh_TW
dc.relation.reference (參考文獻) [11] Matsuo, Y., Ishizuka, M. (2003). Keyword Exraction from a Single Document using Word Co-occurrence Statistical Information. International Journal on Artificial Intelligence Tools. World Scientific Publishing Company.zh_TW
dc.relation.reference (參考文獻) [12] Giarlo, M. J. (2005). A Comparative Analysis of Keyword Extraction Techniques. Rutgers, The State University of New Jersey.zh_TW
dc.relation.reference (參考文獻) [13] 魏智強. (2006). 自動化問答系統之研製. 私立中華大學資訊工程研究所碩士論文.民國九十五年八月.zh_TW
dc.relation.reference (參考文獻) [14] 王俊弘, 劉昭麟, 高照明. (2003). 電腦輔助英文字彙出題系統之研究. 2003人工智慧,模糊系統及灰色系統聯合研討會論文集.zh_TW
dc.relation.reference (參考文獻) [15] Hulth, A. (2003). Improved Automatic Keyword Extraction Given More Linguistic Knowledge. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, July, 2003, pp. 216-223.zh_TW
dc.relation.reference (參考文獻) [16] Turney T. D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303–336.zh_TW
dc.relation.reference (參考文獻) [17] Frank E., Paynter G. W., Witten I. H. (1999). Domain-specific keyphrase extraction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’99), pages 668–673, Stockholm, Sweden.zh_TW
dc.relation.reference (參考文獻) [18] Dutta, B., Majumder K. & Sen, B. K. (2009). An analytical model for investigation of some characteristics of the keywords of the subject fermi liquid: a case study. Annals of Library and Information Studies, Vol. 56, December 2009, pp. 273-290zh_TW
dc.relation.reference (參考文獻) [19] Nation, P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.zh_TW
dc.relation.reference (參考文獻) [20] Coxhead, A., & Nation, P. (2001). The specialized vocabulary of English for academic purposes. In J. Flowerdew & M. Peacock (Eds.), Research perspectives on English for academic purpose (pp.252-267). Cambridge: Cambridge University Press.zh_TW
dc.relation.reference (參考文獻) [21] West, M. (1953). A general service list of English words. London: Longmans, Green.zh_TW
dc.relation.reference (參考文獻) [22] Coxhead, A. (2000). The Academic Word List: A Corpus-based Word List for Academic Purposes. TESOL quarterly, 2000.zh_TW
dc.relation.reference (參考文獻) [23] 台大教育視聽館 Academic Vocabulary, URL : http://efreeway.avcenter.ntu.edu.tw/freeway /postgraduates/vocab/vocab_index.htmlzh_TW
dc.relation.reference (參考文獻) [24] 廖柏森. (2008). 英文研究論文寫作 - 搭配詞指引 : 眾文圖書.zh_TW
dc.relation.reference (參考文獻) [25] Benson, M., Benson, E., & Ilson, R. (2007). The BBI dictionary of English word combinations. 台北 : 書林.zh_TW
dc.relation.reference (參考文獻) [26] 黃茹玉. (2007). 探討應用語言學期刊論文中學術字彙之使用. 國立清華大學外國語文學系碩士班外語教學組碩士論文. 民國九十六年六月.zh_TW
dc.relation.reference (參考文獻) [27] Chuang, T. C., Jian, J. J., Chang, Y. C. & Chang, S. C. (2005). Collocational Translation Memory Extraction Based on Statistical and Linguistic Information. Computational Linguistics and Chinese Language Processing Vol. 10, No. 3, September 2005, pp. 329-346.zh_TW
dc.relation.reference (參考文獻) [28] Nesselhauf, N (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24, 223- 242.zh_TW
dc.relation.reference (參考文獻) [29] Bird, S. (2006) .The Natural Language Toolkit, Proceedings of the COLING/ACL on Interactive presentation sessions table of contents 2006. Sydney, Australia. pp.69 - 72zh_TW
dc.relation.reference (參考文獻) [30] Lucas, N., Cremilleux, B. & Turmel, L. (2003). Signalling well-written academic articles in an English corpus by text mining techniques. Proceedings Corpus Linguistics 2003. pp. 465-474.zh_TW
dc.relation.reference (參考文獻) [31] Mantel, N. (1963). Chi-square tests with one degree of freedom; extensions of the Mantel-Haenszel procedure. Journal of the American Statistical Association, Vol. 58, No. 303. pp. 690-700zh_TW