學術產出-Theses
Article View/Open
Publication Export
-
題名 漢字古文書光學字元辨識之文本閱讀順序偵測研究
Reading Order Detection in Optical Character Recognition for Historical Chinese Documents作者 馬行遠
Ma, Hsing-Yuan貢獻者 劉昭麟<br>黃瀚萱
Liu, Chao-Lin<br>Huang, Hen-Hsen
馬行遠
Ma, Hsing-Yuan關鍵詞 閱讀順序
排序學習
多模態模型
古籍文本處理
Reading Order Detection
Pairwise Learning-to-Rank
Multimodal Representation
Archival Document ProcessingMultimodal Representation日期 2023 上傳時間 1-Sep-2023 15:24:26 (UTC+8) 摘要 在光學字元識別(OCR)和文檔版面分析(DLA)的研究和發展已累積了多年的豐富經驗,然而閱讀順序偵測的問題卻仍然是一個待解的難題。閱讀順序偵測在維護文檔原始結構以及對文字偵測後的校正過程中,扮演著至關重要的角色。目前,大部分閱讀順序偵測工具主要依賴於基於規則的算法來處理。對於結構簡單、排列規整且間距均勻的現代文檔,這些方法的確能夠取得不錯的成果。然而,當面對手寫或古代文本中複雜的版面以及不平整的邊緣,現有的方法便明顯力不從心。因此,我們迫切需要一種能對複雜版面的中文古籍進行精準閱讀順序偵測的策略。本研究以當前主流的OCR框架為基礎,提出一個專注於閱讀順序偵測的模型。此模型著重考量人類閱讀歷程的模擬,將圖像線索視為確定閱讀順序的關鍵線索,並且獨創性地提出一種多模態閱讀順序偵測方法,成功地簡化了閱讀順序任務的處理流程,並在中文古籍MTHv2資料集上進行驗證。實驗結果指出,與先前的研究方法相比,我們的模型成功地降低了25%的頁面錯誤率。此外,它在有限的訓練資料和文字偵測資訊不足的情境下也展現出良好的效能,證明了本研究的韌性和實際應用價值。
Optical character recognition (OCR) and document layout analysis (DLA) have been developed for years.Still, reading order detection (ROD) is a problem that needs to be solved.ROD plays an important role in preserving the original structure of the document as well as in post-OCR correction.Most modern ROD tools rely on rule-based algorithms to place detected text coordinates in order.These approaches may work well for simple, modern documents because they are well-aligned and spaced.However, due to the complex layouts and curved layout edges in handwritten or historical documents, current methods are inadequate.In this paper, we proposed a multimodal approach to ROD by formulating the task as pairwise learning-to-rank.We evaluate our approach on the MTHv2 dataset.Experimental results indicate that, compared to previous research methods, our model successfully reduced the page error rate by 25%. Furthermore, it demonstrated good performance even in scenarios with limited training data and insufficient text detection information, proving the robustness and practical value of this research.參考文獻 [1] Abid, A., Abdalla, A., Abid, A., Khan, D., Alfozan, A., Zou, J.: Gradio: Hasslefree sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569(2019)[2] Aiello, M., Pegoretti, A.: Textual article clustering in newspaper pages. Applied Artificial Intelligence 20(9), 767–796 (2006).https://doi.org/10.1080/08839510600903858[3] Clausner, C., Pletschacher, S., Antonacopoulos, A.: The significance of reading order in document recognition and its evaluation. 2013 12th International Conferenceon Document Analysis and Recognition 688–692 (2013)[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:An image is worth 16x16 words: Transformers for image recognition at scale (2021)[5] Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du, Y., Jiang, Y.G.: Svtr: Scenetext recognition with a single visual model (2022)[6] Egly, R., Driver, J., Rafal, R.: Shifting visual attention between objects and locations: evidence from normal and parietal lesion subjects. Journal of ExperimentalPsychology: General 123(2), 161–177 (jun 1994). https://doi.org/10.1037//0096-3445.123.2.161[7] Ferilli, S., Grieco, D., Redavid, D., Esposito, F.: Abstract argumentation for readingorder detection. In: ACM Symposium on Document Engineering (2014)[8] Gu, Z., Meng, C., Wang, K., Lan, J., Wang, W., Gu, M., Zhang, L.: Xylayoutlm:Towards layout-aware multimodal networks for visually-rich document understanding (2022). https://doi.org/10.48550/ARXIV.2203.06947[9] Ha, J., Haralick, R., Phillips, I.: Recursive x-y cut using bounding boxesof connected components. In: Proceedings of 3rd International Conferenceon Document Analysis and Recognition. vol. 2, 952–955 vol.2 (1995).https://doi.org/10.1109/ICDAR.1995.602059[10] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Va-sudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3.In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV).1314–1324. IEEE Computer Society, Los Alamitos, CA, USA (nov 2019).https://doi.org/10.1109/ICCV.2019.00140[11] Iani, C., Nicoletti, R., Rubichi, S., Umiltà, C.: Shifting attention between objects. Cognitive Brain Research 11(1), 157–164 (2001).https://doi.org/10.1016/S0926-6410(00)00076-8[12] KENDALL, M.G.: A NEW MEASURE OF RANK CORRELATION. Biometrika30(1-2), 81–93 (06 1938). https://doi.org/10.1093/biomet/30.1-2.81[13] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014).https://doi.org/10.48550/ARXIV.1412.6980[14] Kosinski, M.: Theory of mind may have spontaneously emerged in large languagemodels (2023)[15] Kumar, R., Vassilvitskii, S.: Generalized distances between rankings. In: Proceedings of the 19th International Conference on World Wide Web. 571 –40580. WWW ’10, Association for Computing Machinery, New York, NY,USA (2010). https://doi.org/10.1145/1772690.1772749[16] Lamy, D., Egeth, H.: Object-based selection: The role of attentional shifts. Perception & Psychophysics 64(1), 52–66 (2002). https://doi.org/10.3758/BF03194557[17] Li, L., Gao, F., Bu, J., Wang, Y., Yu, Z., Zheng, Q.: An end-to-end ocr textre-organization sequence learning for rich-text detail image comprehension. In:Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. 85–100. Springer International Publishing, Cham (2020)[18] Liao, M., Zou, Z., Wan, Z., Yao, C., Bai, X.: Real-time scene text detection withdifferentiable binarization and adaptive scale fusion (2022)[19] Liu, Z.Y.: Understanding of Printed Ancient Book and Book Collectors. studentbooktw (2007)[20] Ma, W., Zhang, H., Jin, L., Wu, S., Wang, J., Wang, Y.: Joint layout analysis, character detection and recognition for historical document digitization (2020).https://doi.org/10.48550/ARXIV.2007.06890, https://arxiv.org/abs/2007.06890[21] Mai, J., Chen, J., Li, B., Qian, G., Elhoseiny, M., Ghanem, B.: Llm as a roboticbrain: Unifying egocentric memory and control (2023)[22] Malerba, D., Ceci, M., Berardi, M.: Machine Learning for Reading Order Detection in Document Image Understanding, vol. 90, 45–69 (12 2007).https://doi.org/10.1007/978-3-540-76280-5_3[23] Mukherjee, K., Khare, A., Verma, A.: A simple dynamic learning rate tuning algorithm for automated training of dnns (2019).https://doi.org/10.48550/ARXIV.1910.11605[24] Naoum, A., Nothman, J., Curran, J.: Article segmentation in digitisednewspapers with a 2d markov model. In: 2019 International Conferenceon Document Analysis and Recognition (ICDAR). 1007–1014 (2019).https://doi.org/10.1109/ICDAR.2019.00165[25] Neisser, U.: Cognitive Psychology. Appleton-Century-Crofts, New York (1967)[26] Park, J.S., O’Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: Interactive simulacra of human behavior (2023)[27] Posner, M.: Orienting of attention. The Quarterly journal of experimental psychology 32, 3–25 (03 1980). https://doi.org/10.1080/00335558008248231[28] Quiros, L., Vidal, E.: Learning to sort handwritten text lines in reading order through estimated binary order relations. In: 2020 25th Inter-national Conference on Pattern Recognition (ICPR). 7661–7668 (2021).https://doi.org/10.1109/ICPR48806.2021.9413256[29] Quirós, L., Vidal, E.: Reading order detection on handwritten documents. Neural Computation and Applications 34, 9593–9611 (2022).https://doi.org/10.1007/s00521-022-06948-5[30] Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., Ho, A.: Will werun out of data? an analysis of the limits of scaling datasets in machine learning(2022)[31] Walczyk, J.J.: The interplay between automatic and control processes in reading.Reading Research Quarterly 35(4), 554–566 (2000), http://www.jstor.org/stable/748099[32] Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: LayoutReader: Pre-training oftext and layout for reading order detection. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4735–4744.Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (Nov 2021). https://doi.org/10.18653/v1/2021.emnlp-main.389, https://aclanthology.org/2021.emnlp-main.389[33] Wei, L.: Simple Organization and Version Study of Ancient Books. Macao Library& Information Management Association, Macao (2004)[34] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-trainingof text and layout for document image understanding. In: Proceedings of the 26thACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM (aug 2020). https://doi.org/10.1145/3394486.3403172, https://doi.org/10.1145%2F3394486.3403172[35] Yang, H., Jin, L., Huang, W., Yang, Z., Lai, S., Sun, J.: Dense and tight detection of chinese characters in historical documents: Datasets and a recognition guided detector. IEEE Access 6, 30174–30183 (2018).https://doi.org/10.1109/ACCESS.2018.2840218[36] Yu, H., Chen, J., Li, B., Xue, X.: Chinese character recognition with radicalstructured stroke trees (2022) 描述 碩士
國立政治大學
資訊科學系
110753132資料來源 http://thesis.lib.nccu.edu.tw/record/#G0110753132 資料類型 thesis dc.contributor.advisor 劉昭麟<br>黃瀚萱 zh_TW dc.contributor.advisor Liu, Chao-Lin<br>Huang, Hen-Hsen en_US dc.contributor.author (Authors) 馬行遠 zh_TW dc.contributor.author (Authors) Ma, Hsing-Yuan en_US dc.creator (作者) 馬行遠 zh_TW dc.creator (作者) Ma, Hsing-Yuan en_US dc.date (日期) 2023 en_US dc.date.accessioned 1-Sep-2023 15:24:26 (UTC+8) - dc.date.available 1-Sep-2023 15:24:26 (UTC+8) - dc.date.issued (上傳時間) 1-Sep-2023 15:24:26 (UTC+8) - dc.identifier (Other Identifiers) G0110753132 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/147032 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 110753132 zh_TW dc.description.abstract (摘要) 在光學字元識別(OCR)和文檔版面分析(DLA)的研究和發展已累積了多年的豐富經驗,然而閱讀順序偵測的問題卻仍然是一個待解的難題。閱讀順序偵測在維護文檔原始結構以及對文字偵測後的校正過程中,扮演著至關重要的角色。目前,大部分閱讀順序偵測工具主要依賴於基於規則的算法來處理。對於結構簡單、排列規整且間距均勻的現代文檔,這些方法的確能夠取得不錯的成果。然而,當面對手寫或古代文本中複雜的版面以及不平整的邊緣,現有的方法便明顯力不從心。因此,我們迫切需要一種能對複雜版面的中文古籍進行精準閱讀順序偵測的策略。本研究以當前主流的OCR框架為基礎,提出一個專注於閱讀順序偵測的模型。此模型著重考量人類閱讀歷程的模擬,將圖像線索視為確定閱讀順序的關鍵線索,並且獨創性地提出一種多模態閱讀順序偵測方法,成功地簡化了閱讀順序任務的處理流程,並在中文古籍MTHv2資料集上進行驗證。實驗結果指出,與先前的研究方法相比,我們的模型成功地降低了25%的頁面錯誤率。此外,它在有限的訓練資料和文字偵測資訊不足的情境下也展現出良好的效能,證明了本研究的韌性和實際應用價值。 zh_TW dc.description.abstract (摘要) Optical character recognition (OCR) and document layout analysis (DLA) have been developed for years.Still, reading order detection (ROD) is a problem that needs to be solved.ROD plays an important role in preserving the original structure of the document as well as in post-OCR correction.Most modern ROD tools rely on rule-based algorithms to place detected text coordinates in order.These approaches may work well for simple, modern documents because they are well-aligned and spaced.However, due to the complex layouts and curved layout edges in handwritten or historical documents, current methods are inadequate.In this paper, we proposed a multimodal approach to ROD by formulating the task as pairwise learning-to-rank.We evaluate our approach on the MTHv2 dataset.Experimental results indicate that, compared to previous research methods, our model successfully reduced the page error rate by 25%. Furthermore, it demonstrated good performance even in scenarios with limited training data and insufficient text detection information, proving the robustness and practical value of this research. en_US dc.description.tableofcontents 第一章 緒論 1第一節 研究動機 1第二節 研究背景 2第三節 研究架構 3第二章 文獻回顧 4第一節 主流光學字元辨識框架以及其發展 4第二節 人類閱讀歷程 6第三節 中文古籍版面及閱讀順序 7第四節 閱讀順序研究 8第五節 閱讀順序偵測在中文古籍上的挑戰 9第六節 小結 12第三章 研究方法 13第一節 問題定義 13第二節 多模態讀序偵測模型 14第三節 配對關係矩陣解碼模型 15第四章 實驗程序 17第一節 評估方法 17第二節 實驗資料集 19第三節 實驗模型 21第四節 實驗參數設計 23第五章 實驗結果 24第一節 多版面資料實驗 24第二節 簡單與複雜版面實驗 26第三節 特定版面實驗 26第四節 小樣本訓練實驗 27第五節 不同圖像特徵實驗 28第六節 結論 30第七節 研究侷限性 30第八節 未來研究 30第六章 中文古籍 OCR 系統實作 32第一節 系統框架 32第二節 文字偵測模型 33第三節 文字辨識模型 33第四節 操作介面與方法 35第五節 成果範例 36參考文獻 39 zh_TW dc.format.extent 18343293 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0110753132 en_US dc.subject (關鍵詞) 閱讀順序 zh_TW dc.subject (關鍵詞) 排序學習 zh_TW dc.subject (關鍵詞) 多模態模型 zh_TW dc.subject (關鍵詞) 古籍文本處理 zh_TW dc.subject (關鍵詞) Reading Order Detection en_US dc.subject (關鍵詞) Pairwise Learning-to-Rank en_US dc.subject (關鍵詞) Multimodal Representation en_US dc.subject (關鍵詞) Archival Document ProcessingMultimodal Representation en_US dc.title (題名) 漢字古文書光學字元辨識之文本閱讀順序偵測研究 zh_TW dc.title (題名) Reading Order Detection in Optical Character Recognition for Historical Chinese Documents en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] Abid, A., Abdalla, A., Abid, A., Khan, D., Alfozan, A., Zou, J.: Gradio: Hasslefree sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569(2019)[2] Aiello, M., Pegoretti, A.: Textual article clustering in newspaper pages. Applied Artificial Intelligence 20(9), 767–796 (2006).https://doi.org/10.1080/08839510600903858[3] Clausner, C., Pletschacher, S., Antonacopoulos, A.: The significance of reading order in document recognition and its evaluation. 2013 12th International Conferenceon Document Analysis and Recognition 688–692 (2013)[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:An image is worth 16x16 words: Transformers for image recognition at scale (2021)[5] Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du, Y., Jiang, Y.G.: Svtr: Scenetext recognition with a single visual model (2022)[6] Egly, R., Driver, J., Rafal, R.: Shifting visual attention between objects and locations: evidence from normal and parietal lesion subjects. Journal of ExperimentalPsychology: General 123(2), 161–177 (jun 1994). https://doi.org/10.1037//0096-3445.123.2.161[7] Ferilli, S., Grieco, D., Redavid, D., Esposito, F.: Abstract argumentation for readingorder detection. In: ACM Symposium on Document Engineering (2014)[8] Gu, Z., Meng, C., Wang, K., Lan, J., Wang, W., Gu, M., Zhang, L.: Xylayoutlm:Towards layout-aware multimodal networks for visually-rich document understanding (2022). https://doi.org/10.48550/ARXIV.2203.06947[9] Ha, J., Haralick, R., Phillips, I.: Recursive x-y cut using bounding boxesof connected components. In: Proceedings of 3rd International Conferenceon Document Analysis and Recognition. vol. 2, 952–955 vol.2 (1995).https://doi.org/10.1109/ICDAR.1995.602059[10] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Va-sudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3.In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV).1314–1324. IEEE Computer Society, Los Alamitos, CA, USA (nov 2019).https://doi.org/10.1109/ICCV.2019.00140[11] Iani, C., Nicoletti, R., Rubichi, S., Umiltà, C.: Shifting attention between objects. Cognitive Brain Research 11(1), 157–164 (2001).https://doi.org/10.1016/S0926-6410(00)00076-8[12] KENDALL, M.G.: A NEW MEASURE OF RANK CORRELATION. Biometrika30(1-2), 81–93 (06 1938). https://doi.org/10.1093/biomet/30.1-2.81[13] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014).https://doi.org/10.48550/ARXIV.1412.6980[14] Kosinski, M.: Theory of mind may have spontaneously emerged in large languagemodels (2023)[15] Kumar, R., Vassilvitskii, S.: Generalized distances between rankings. In: Proceedings of the 19th International Conference on World Wide Web. 571 –40580. WWW ’10, Association for Computing Machinery, New York, NY,USA (2010). https://doi.org/10.1145/1772690.1772749[16] Lamy, D., Egeth, H.: Object-based selection: The role of attentional shifts. Perception & Psychophysics 64(1), 52–66 (2002). https://doi.org/10.3758/BF03194557[17] Li, L., Gao, F., Bu, J., Wang, Y., Yu, Z., Zheng, Q.: An end-to-end ocr textre-organization sequence learning for rich-text detail image comprehension. In:Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. 85–100. Springer International Publishing, Cham (2020)[18] Liao, M., Zou, Z., Wan, Z., Yao, C., Bai, X.: Real-time scene text detection withdifferentiable binarization and adaptive scale fusion (2022)[19] Liu, Z.Y.: Understanding of Printed Ancient Book and Book Collectors. studentbooktw (2007)[20] Ma, W., Zhang, H., Jin, L., Wu, S., Wang, J., Wang, Y.: Joint layout analysis, character detection and recognition for historical document digitization (2020).https://doi.org/10.48550/ARXIV.2007.06890, https://arxiv.org/abs/2007.06890[21] Mai, J., Chen, J., Li, B., Qian, G., Elhoseiny, M., Ghanem, B.: Llm as a roboticbrain: Unifying egocentric memory and control (2023)[22] Malerba, D., Ceci, M., Berardi, M.: Machine Learning for Reading Order Detection in Document Image Understanding, vol. 90, 45–69 (12 2007).https://doi.org/10.1007/978-3-540-76280-5_3[23] Mukherjee, K., Khare, A., Verma, A.: A simple dynamic learning rate tuning algorithm for automated training of dnns (2019).https://doi.org/10.48550/ARXIV.1910.11605[24] Naoum, A., Nothman, J., Curran, J.: Article segmentation in digitisednewspapers with a 2d markov model. In: 2019 International Conferenceon Document Analysis and Recognition (ICDAR). 1007–1014 (2019).https://doi.org/10.1109/ICDAR.2019.00165[25] Neisser, U.: Cognitive Psychology. Appleton-Century-Crofts, New York (1967)[26] Park, J.S., O’Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: Interactive simulacra of human behavior (2023)[27] Posner, M.: Orienting of attention. The Quarterly journal of experimental psychology 32, 3–25 (03 1980). https://doi.org/10.1080/00335558008248231[28] Quiros, L., Vidal, E.: Learning to sort handwritten text lines in reading order through estimated binary order relations. In: 2020 25th Inter-national Conference on Pattern Recognition (ICPR). 7661–7668 (2021).https://doi.org/10.1109/ICPR48806.2021.9413256[29] Quirós, L., Vidal, E.: Reading order detection on handwritten documents. Neural Computation and Applications 34, 9593–9611 (2022).https://doi.org/10.1007/s00521-022-06948-5[30] Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., Ho, A.: Will werun out of data? an analysis of the limits of scaling datasets in machine learning(2022)[31] Walczyk, J.J.: The interplay between automatic and control processes in reading.Reading Research Quarterly 35(4), 554–566 (2000), http://www.jstor.org/stable/748099[32] Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: LayoutReader: Pre-training oftext and layout for reading order detection. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4735–4744.Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (Nov 2021). https://doi.org/10.18653/v1/2021.emnlp-main.389, https://aclanthology.org/2021.emnlp-main.389[33] Wei, L.: Simple Organization and Version Study of Ancient Books. Macao Library& Information Management Association, Macao (2004)[34] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-trainingof text and layout for document image understanding. In: Proceedings of the 26thACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM (aug 2020). https://doi.org/10.1145/3394486.3403172, https://doi.org/10.1145%2F3394486.3403172[35] Yang, H., Jin, L., Huang, W., Yang, Z., Lai, S., Sun, J.: Dense and tight detection of chinese characters in historical documents: Datasets and a recognition guided detector. IEEE Access 6, 30174–30183 (2018).https://doi.org/10.1109/ACCESS.2018.2840218[36] Yu, H., Chen, J., Li, B., Xue, X.: Chinese character recognition with radicalstructured stroke trees (2022) zh_TW