從語料庫建構探討臺灣客語難字、缺字與異體字議題 | Publication

Publications-Periodical Articles

Article View/Open

html(561)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	從語料庫建構探討臺灣客語難字、缺字與異體字議題
其他題名	Rare characters, missing characters and character variants in Taiwan Hakka: An exploration from corpus construction
作者	賴惠玲;葉秋杏 Lai, Huei-ling;Yeh, Chiou-shing
貢獻者	英文系
關鍵詞	難字; 缺字; 異體字; 一字多碼; 臺灣客語語料庫 Rare character; Missing character; Character variant; Multiple codes for the same character; Taiwan Hakka Corpus
日期	2023-04
上傳時間	13-Dec-2023 13:23:02 (UTC+8)
摘要	臺灣客語文本中有許多難字、缺字及異體字，在在造成語料庫建置過程之語料用字處理作業繁複且紛雜。本文首先簡述臺灣客語的用字現況，包含民間具代表性的客語辭典與官方標準，其次依據《臺灣客語語料庫》建置經驗，介紹本語料庫的用字規範，並基於文本資料清理，探析文本用字校訂類型，包含客語拼音校訂為客語漢字、客語用字統整、多字刪除、缺字補齊、顛倒字序調換、形似字勘誤等。接續則檢視客語文本中難字無法正常顯示時出現的四種情形，包括拼音、借音或借義字、空格或符號（缺字）、漢字部件拆解，並展示相對應的處理方式。本文最後以探討如何克服字碼不一以及異體字等問題作結。 The digitization of Taiwan Hakka data is immensely complicated due to the many rare characters, missing characters, or character variants found in Taiwan Hakka texts, and is further hindered by inconsistency between non-governmental Hakka dictionaries' writing practice and governmental standards for the Hakka writing system. This study describes how the Taiwan Hakka Corpus Project carried out character correction to ensure the Corpus's usefulness and robustness. First, the study demonstrates the various types of character correction that take place in our text cleaning process, including converting Hakka spellings into characters, unifying different forms of the same word, deleting redundant or repeated characters, filling in missing characters, swapping reversed characters, and correcting characters similar in shape but dissimilar in meaning. Second, we investigate situations in which rare characters cannot be shown properly, and we provide solutions to each situation. These situations include rare characters in Hakka texts being substituted with (1) Hakka spellings, (2) phonetic or semantic loan characters, (3) unintended glyphs such as squares or symbols (i.e., missing characters), and (4) character decomposition. Finally, issues related to multiple codes for the same character and character variants in Hakka texts are tackled.
關聯	臺灣語文研究, Vol.18, No.1, pp.135-183
資料類型	article
DOI	https://doi.org/10.6710/JTLL.202304_18(1).0003

dc.contributor	英文系
dc.creator (作者)	賴惠玲;葉秋杏
dc.creator (作者)	Lai, Huei-ling;Yeh, Chiou-shing
dc.date (日期)	2023-04
dc.date.accessioned	13-Dec-2023 13:23:02 (UTC+8)	-
dc.date.available	13-Dec-2023 13:23:02 (UTC+8)	-
dc.date.issued (上傳時間)	13-Dec-2023 13:23:02 (UTC+8)	-
dc.identifier.uri (URI)	https://ah.lib.nccu.edu.tw/item?item_id=168137	-
dc.description.abstract (摘要)	臺灣客語文本中有許多難字、缺字及異體字，在在造成語料庫建置過程之語料用字處理作業繁複且紛雜。本文首先簡述臺灣客語的用字現況，包含民間具代表性的客語辭典與官方標準，其次依據《臺灣客語語料庫》建置經驗，介紹本語料庫的用字規範，並基於文本資料清理，探析文本用字校訂類型，包含客語拼音校訂為客語漢字、客語用字統整、多字刪除、缺字補齊、顛倒字序調換、形似字勘誤等。接續則檢視客語文本中難字無法正常顯示時出現的四種情形，包括拼音、借音或借義字、空格或符號（缺字）、漢字部件拆解，並展示相對應的處理方式。本文最後以探討如何克服字碼不一以及異體字等問題作結。
dc.description.abstract (摘要)	The digitization of Taiwan Hakka data is immensely complicated due to the many rare characters, missing characters, or character variants found in Taiwan Hakka texts, and is further hindered by inconsistency between non-governmental Hakka dictionaries' writing practice and governmental standards for the Hakka writing system. This study describes how the Taiwan Hakka Corpus Project carried out character correction to ensure the Corpus's usefulness and robustness. First, the study demonstrates the various types of character correction that take place in our text cleaning process, including converting Hakka spellings into characters, unifying different forms of the same word, deleting redundant or repeated characters, filling in missing characters, swapping reversed characters, and correcting characters similar in shape but dissimilar in meaning. Second, we investigate situations in which rare characters cannot be shown properly, and we provide solutions to each situation. These situations include rare characters in Hakka texts being substituted with (1) Hakka spellings, (2) phonetic or semantic loan characters, (3) unintended glyphs such as squares or symbols (i.e., missing characters), and (4) character decomposition. Finally, issues related to multiple codes for the same character and character variants in Hakka texts are tackled.
dc.format.extent	136 bytes	-
dc.format.mimetype	text/html	-
dc.relation (關聯)	臺灣語文研究, Vol.18, No.1, pp.135-183
dc.subject (關鍵詞)	難字; 缺字; 異體字; 一字多碼; 臺灣客語語料庫
dc.subject (關鍵詞)	Rare character; Missing character; Character variant; Multiple codes for the same character; Taiwan Hakka Corpus
dc.title (題名)	從語料庫建構探討臺灣客語難字、缺字與異體字議題
dc.title.alternative (其他題名)	Rare characters, missing characters and character variants in Taiwan Hakka: An exploration from corpus construction
dc.type (資料類型)	article
dc.identifier.doi (DOI)	10.6710/JTLL.202304_18(1).0003
dc.doi.uri (DOI)	https://doi.org/10.6710/JTLL.202304_18(1).0003

Publications-Periodical Articles

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM