學術產出-學位論文

文章檢視/開啟

書目匯出

Google ScholarTM

政大圖書館

引文資訊

TAIR相關學術產出

題名 跨領域多教師知識蒸餾的強化學習方法探索
Exploration of Reinforcement Learning Methods for Cross-Domain Multi-Teacher Knowledge Distillation
作者 洪得比
Hong, De-Bi
貢獻者 謝佩璇
Hsieh, Pei-Hsuan
洪得比
Hong, De-Bi
關鍵詞 自然語言處理
知識蒸餾
強化學習
領域適應
Natural Language Processing
Knowledge Distillation
Reinforcement Learning
Domain Adaptation
日期 2024
上傳時間 5-八月-2024 12:44:52 (UTC+8)
摘要 隨著社群媒體的普及,情感分析技術在捕捉社會輿情動態方面發揮著越來 越重要的作用。然而,現有的大型情感分析模型雖然效能優異,但其龐大的參數量和運算成本也帶來了效率和成本方面的挑戰,尤其是在資源受限的情境下,而且對於蒐集到大量資料的資料集,需要做人工標記的話需要花費大量時間,這樣訓練模型得花費大量時間。為了解決這一問題,本研究提出了一種基於跨領域動態知識蒸餾的情感分析模型壓縮方法。 首先,本研究創新性的提出了一種動態教師選擇策略。傳統的知識蒸餾通常使用固定的教師模型,而本研究利用強化學習,根據學生模型的狀態表示動態的選擇最優的教師模型組合,以提供更有效的知識指導,進一步提升了知識蒸餾的效率。其次,本研究在知識蒸餾的基礎上,引入了跨領域的概念。透過從多個源領域選擇教師模型,並利用隱藏層和注意力機制匹配學生模型和教師模型的特徵表示,提出了一種跨領域知識蒸餾損失函數,以縮小學生模型在目標領域上的效能差距。 在多個評論資料集上的實驗表明,本研究提出的方法在顯著壓縮模型規模的同時,仍然保持與大型教師模型相當的表現。例如,在使用BERT-base 作為教師模型時,壓縮後的6層和3層BERT學生模型在情感二分類任務上的準確率比傳統KD提升0.2% 到1%,但模型參數量和運算時間卻大幅減少。本研究提出的跨領域動態知識蒸餾方法為大型情感分析模型的應用提供了一種新的解決方法和技術方案。
With the growing popularity of social media, sentiment analysis techniques play apivotal role in capturing the dynamics of public opinion. However, while large-scale sentiment analysis models exhibit excellent performance, their vast parameter sizes and computational costs pose efficiency and cost challenges, especially in resource-constrained environments. This research proposes a sentiment analysis model compression method based on cross-domain dynamic knowledge distillation. Firstly, this research innovatively introduces a dynamic teacher selection strategy that utilizes reinforcement learning to dynamically choose the optimal combination of teacher models based on the student model’s state representation, providing more effective knowledge guidance. Secondly,this research introduces the concept of cross-domain by selecting teacher models from multiple source domains. It proposes a cross-domain knowledge distillation loss function that employs hidden layers and attention mechanisms to align the feature representations of student and teacher models, reducing the performance gap of the student model in the target domain. Experiments on multiple review datasets demonstrate that the proposed method maintains performance comparable to large teacher models while significantly compressing the model size.For example, when using BERT-base as the teacher model, the Accuracy of the compressed 6-layer and 3-layer BERT student models on the sentiment binary classification task is improved by 0.2% to 1% compared to traditional KD, this research offers a novel solution and technique for the application of large-scale sentiment analysis models.
參考文獻 [1] Sungsoo Ahn et al. “Variational information distillation for knowledge transfer”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 9163–9171. [2] Daniel Campos et al. “oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes”. In: arXiv preprint arXiv:2303.17612 (2023). [3] Defang Chen et al. “Online knowledge distillation with diverse peers”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 34. 04. 2020, pp. 3430– 3437. [4] Yahui Chen. “Convolutional neural network for sentence classification”. MA thesis. University of Waterloo, 2015. [5] Jang Hyun Cho and Bharath Hariharan. “On the efficacy of knowledge distillation”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 4794–4802. [6] Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014). [7] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018). [8] Prakhar Ganesh et al. “Compressing Large-Scale Transformer-Based Models: A Case Study on BERT”. In: Transactions of the Association for Computational Linguistics 9 (2021). Ed. by Brian Roark and Ani Nenkova, pp. 1061–1080. DOI: 10. 1162/tacl_a_00413. URL: https://aclanthology.org/2021.tacl-1.63. [9] Tao Ge, Si-Qing Chen, and Furu Wei. “EdgeFormer: A parameter-efficient transformer for on-device Seq2Seq generation”. In: arXiv preprint arXiv:2202.07959 (2022). [10] Jianping Gou et al. “Knowledge distillation: A survey”. In: International Journal of Computer Vision 129 (2021), pp. 1789–1819. [11] Cyril Goutte and Eric Gaussier. “A probabilistic interpretation of precision, recall and F-score, with implication for evaluation”. In: European conference on information retrieval. Springer. 2005, pp. 345–359. [12] Vasileios Hatzivassiloglou and Kathleen McKeown. “Predicting the semantic orientation of adjectives”. In: 35th annual meeting of the association for computational linguistics and 8th conference of the european chapter of the association for computational linguistics. 1997, pp. 174–181. [13] Xin He, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A survey of the state-of-theart”. In: Knowledge-based systems 212 (2021), p. 106622. [14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In: arXiv preprint arXiv:1503.02531 (2015). [15] Torsten Hoefler et al. “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks”. In: The Journal of Machine Learning Research 22.1 (2021), pp. 10882–11005. [16] Timothy Hospedales et al. “Meta-learning in neural networks: A survey”. In: IEEE transactions on pattern analysis and machine intelligence 44.9 (2021), pp. 5149– 5169. [17] Aref Jafari et al. Annealing Knowledge Distillation. 2021. arXiv: 2104 . 07163 [cs.CL]. URL: https://arxiv.org/abs/2104.07163. [18] Xiaoqi Jiao et al. “Tinybert: Distilling bert for natural language understanding”. In: arXiv preprint arXiv:1909.10351 (2019). [19] Sayyida Tabinda Kokab, Sohail Asghar, and Shehneela Naz. “Transformer-based deep learning models for the sentiment analysis of social media data”. In: Array 14 (2022), p. 100157. [20] Solomon Kullback and Richard A Leibler. “On information and sufficiency”. In: The annals of mathematical statistics 22.1 (1951), pp. 79–86. [21] Eldar Kurtic and Dan Alistarh. “Gmp*: Well-tuned global magnitude pruning can outperform most bert-pruning methods”. In: arXiv preprint arXiv:2210.06384 (2022). [22] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. [23] Lei Li et al. “Dynamic knowledge distillation for pre-trained language models”. In: arXiv preprint arXiv:2109.11295 (2021). [24] Zheng Li et al. “Curriculum temperature for knowledge distillation”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. 2. 2023, pp. 1504– 1512. [25] Chang Liu et al. “Multi-granularity structural knowledge distillation for language model compression”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 1001–1011. [26] Yinhan Liu et al. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019). [27] Shie Mannor, Dori Peleg, and Reuven Rubinstein. “The cross entropy method for classification”. In: Proceedings of the 22nd international conference on Machine learning. 2005, pp. 561–568. [28] Shervin Minaee et al. Large Language Models: A Survey. 2024. arXiv: 2402.06196 [cs.CL]. [29] Seyed Iman Mirzadeh et al. “Improved knowledge distillation via teacher assistant”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 34. 04. 2020, pp. 5191–5198. [30] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. “When does label smoothing help?” In: Advances in neural information processing systems 32 (2019). [31] Manish Munikar, Sushil Shakya, and Aakash Shrestha. “Fine-grained sentiment classification using BERT”. In: 2019 Artificial Intelligence for Transforming Business and Society (AITB). Vol. 1. IEEE. 2019, pp. 1–5. [32] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbs up? Sentiment classification using machine learning techniques”. In: arXiv preprint cs/0205070 (2002). [33] Wonpyo Park et al. “Relational knowledge distillation”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 3967– 3976. [34] Gabriel Pereyra et al. “Regularizing neural networks by penalizing confident output distributions”. In: arXiv preprint arXiv:1701.06548 (2017). [35] Adriana Romero et al. FitNets: Hints for Thin Deep Nets. 2015. arXiv: 1412.6550 [cs.LG]. [36] Fabian Ruffy and Karanbir Chahal. The State of Knowledge Distillation for Classification. 2019. arXiv: 1912.10850 [cs.LG]. [37] Victor Sanh et al. “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019”. In: arXiv preprint arXiv:1910.01108 (2019). [38] Richard Socher et al. “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proceedings of the 2013 conference on empirical methods in natural language processing. 2013, pp. 1631–1642. [39] Aarohi Srivastava et al. “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models”. In: arXiv preprint arXiv:2206.04615 (2022). [40] Chi Sun, Luyao Huang, and Xipeng Qiu. “Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence”. In: arXiv preprint arXiv:1903.09588 (2019). [41] Siqi Sun et al. “Patient Knowledge Distillation for BERT Model Compression”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, pp. 4323–4332. [42] Zhiqing Sun et al. “Mobilebert: a compact task-agnostic bert for resource-limited devices”. In: arXiv preprint arXiv:2004.02984 (2020). [43] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017). [44] Elena Voita et al. “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned”. In: arXiv preprint arXiv:1905.09418 (2019). [45] Alex Wang et al. “GLUE: A multi-task benchmark and analysis platform for natural language understanding”. In: arXiv preprint arXiv:1804.07461 (2018). [46] Alex Wang et al. “Superglue: A stickier benchmark for general-purpose language understanding systems”. In: Advances in neural information processing systems 32 (2019). [47] Jin Wang et al. “Dimensional sentiment analysis using a regional CNN-LSTM model”. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers). 2016, pp. 225–230. [48] Yukang Wei and Yu Bai. Dynamic Temperature Knowledge Distillation. 2024. arXiv: 2404.12711 [cs.LG]. URL: https://arxiv.org/abs/2404.12711. [49] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8 (1992), pp. 229–256. [50] Canwen Xu and Julian McAuley. “A survey on model compression and acceleration for pretrained language models”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. 9. 2023, pp. 10566–10575. [51] Hu Xu et al. “BERT post-training for review reading comprehension and aspectbased sentiment analysis”. In: arXiv preprint arXiv:1904.02232 (2019). [52] Zichao Yang et al. “Hierarchical attention networks for document classification”. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016, pp. 1480– 1489. [53] Junho Yim et al. “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4133–4141. [54] Shan You et al. “Learning from multiple teacher networks”. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017, pp. 1285–1294. [55] Fei Yuan et al. “Reinforced multi-teacher selection for knowledge distillation”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 16. 2021, pp. 14284–14291. [56] Lin Yue et al. “A survey of sentiment analysis in social media”. In: Knowledge and Information Systems 60 (2019), pp. 617–663. [57] Ofir Zafrir et al. “Prune once for all: Sparse pre-trained language models”. In: arXiv preprint arXiv:2111.05754 (2021). [58] Sergey Zagoruyko and Nikos Komodakis. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. 2017. arXiv: 1612.03928 [cs.CV]. [59] Chen Zhang, Qiuchi Li, and Dawei Song. “Aspect-based sentiment classification with aspect-specific graph convolutional networks”. In: arXiv preprint arXiv:1909.03477 (2019). [60] Linfeng Zhang et al. “Be your own teacher: Improve the performance of convolutional neural networks via self distillation”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 3713–3722. [61] Shaokang Zhang, Lei Jiang, and Jianlong Tan. “Cross-domain knowledge distillation for text classification”. In: Neurocomputing 509 (2022), pp. 11–20. [62] Han Zhao et al. Multiple Source Domain Adaptation with Adversarial Training of Neural Networks. 2017. arXiv: 1705.09684 [cs.LG].
描述 碩士
國立政治大學
資訊科學系
111753117
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753117
資料類型 thesis
dc.contributor.advisor 謝佩璇zh_TW
dc.contributor.advisor Hsieh, Pei-Hsuanen_US
dc.contributor.author (作者) 洪得比zh_TW
dc.contributor.author (作者) Hong, De-Bien_US
dc.creator (作者) 洪得比zh_TW
dc.creator (作者) Hong, De-Bien_US
dc.date (日期) 2024en_US
dc.date.accessioned 5-八月-2024 12:44:52 (UTC+8)-
dc.date.available 5-八月-2024 12:44:52 (UTC+8)-
dc.date.issued (上傳時間) 5-八月-2024 12:44:52 (UTC+8)-
dc.identifier (其他 識別碼) G0111753117en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/152567-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 111753117zh_TW
dc.description.abstract (摘要) 隨著社群媒體的普及,情感分析技術在捕捉社會輿情動態方面發揮著越來 越重要的作用。然而,現有的大型情感分析模型雖然效能優異,但其龐大的參數量和運算成本也帶來了效率和成本方面的挑戰,尤其是在資源受限的情境下,而且對於蒐集到大量資料的資料集,需要做人工標記的話需要花費大量時間,這樣訓練模型得花費大量時間。為了解決這一問題,本研究提出了一種基於跨領域動態知識蒸餾的情感分析模型壓縮方法。 首先,本研究創新性的提出了一種動態教師選擇策略。傳統的知識蒸餾通常使用固定的教師模型,而本研究利用強化學習,根據學生模型的狀態表示動態的選擇最優的教師模型組合,以提供更有效的知識指導,進一步提升了知識蒸餾的效率。其次,本研究在知識蒸餾的基礎上,引入了跨領域的概念。透過從多個源領域選擇教師模型,並利用隱藏層和注意力機制匹配學生模型和教師模型的特徵表示,提出了一種跨領域知識蒸餾損失函數,以縮小學生模型在目標領域上的效能差距。 在多個評論資料集上的實驗表明,本研究提出的方法在顯著壓縮模型規模的同時,仍然保持與大型教師模型相當的表現。例如,在使用BERT-base 作為教師模型時,壓縮後的6層和3層BERT學生模型在情感二分類任務上的準確率比傳統KD提升0.2% 到1%,但模型參數量和運算時間卻大幅減少。本研究提出的跨領域動態知識蒸餾方法為大型情感分析模型的應用提供了一種新的解決方法和技術方案。zh_TW
dc.description.abstract (摘要) With the growing popularity of social media, sentiment analysis techniques play apivotal role in capturing the dynamics of public opinion. However, while large-scale sentiment analysis models exhibit excellent performance, their vast parameter sizes and computational costs pose efficiency and cost challenges, especially in resource-constrained environments. This research proposes a sentiment analysis model compression method based on cross-domain dynamic knowledge distillation. Firstly, this research innovatively introduces a dynamic teacher selection strategy that utilizes reinforcement learning to dynamically choose the optimal combination of teacher models based on the student model’s state representation, providing more effective knowledge guidance. Secondly,this research introduces the concept of cross-domain by selecting teacher models from multiple source domains. It proposes a cross-domain knowledge distillation loss function that employs hidden layers and attention mechanisms to align the feature representations of student and teacher models, reducing the performance gap of the student model in the target domain. Experiments on multiple review datasets demonstrate that the proposed method maintains performance comparable to large teacher models while significantly compressing the model size.For example, when using BERT-base as the teacher model, the Accuracy of the compressed 6-layer and 3-layer BERT student models on the sentiment binary classification task is improved by 0.2% to 1% compared to traditional KD, this research offers a novel solution and technique for the application of large-scale sentiment analysis models.en_US
dc.description.tableofcontents 第一章 緒論 1 第一節 研究背景 1 第二節 研究動機 2 第三節 研究目的 3 第二章 文獻探討 4 第一節 模型壓縮 4 第二節 知識蒸餾 7 第一小節 知識蒸餾方法 7 第二小節 知識蒸餾的問題 9 第三小節 動態知識蒸餾 11 第三節 情感分析 14 第一小節 轉換器(Transformer) 15 第二小節 BERT 16 第三小節 知識蒸餾模型 17 第四節 小節 20 第三章 研究方法 22 第一節 知識蒸餾流程 22 第一小節 蒸餾溫度 22 第二小節 損失函數 24 第二節 動態知識蒸餾 28 第一小節 跨領域知識蒸餾 28 第二小節 動態教師選擇 31 第三節 蒸餾損失 35 第四節 實驗流程設計 36 第四章 實驗設計 39 第一節 資料集選擇 39 第二節 實驗設置 40 第三節 實驗結果 41 第一小節 源領域微調 41 第二小節 目標領域蒸餾 42 第三小節 RLCKD 蒸餾效果 46 第四小節 蒸餾溫度差異對比 47 第四節 小結 51 第五章 結論 54 第一節 跨領域動態知識蒸餾的創新 54 第二節 實際應用的高效情感分析模型 54 第三節 研究局限性與未來展望 55 參考文獻 57zh_TW
dc.format.extent 1424172 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753117en_US
dc.subject (關鍵詞) 自然語言處理zh_TW
dc.subject (關鍵詞) 知識蒸餾zh_TW
dc.subject (關鍵詞) 強化學習zh_TW
dc.subject (關鍵詞) 領域適應zh_TW
dc.subject (關鍵詞) Natural Language Processingen_US
dc.subject (關鍵詞) Knowledge Distillationen_US
dc.subject (關鍵詞) Reinforcement Learningen_US
dc.subject (關鍵詞) Domain Adaptationen_US
dc.title (題名) 跨領域多教師知識蒸餾的強化學習方法探索zh_TW
dc.title (題名) Exploration of Reinforcement Learning Methods for Cross-Domain Multi-Teacher Knowledge Distillationen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Sungsoo Ahn et al. “Variational information distillation for knowledge transfer”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 9163–9171. [2] Daniel Campos et al. “oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes”. In: arXiv preprint arXiv:2303.17612 (2023). [3] Defang Chen et al. “Online knowledge distillation with diverse peers”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 34. 04. 2020, pp. 3430– 3437. [4] Yahui Chen. “Convolutional neural network for sentence classification”. MA thesis. University of Waterloo, 2015. [5] Jang Hyun Cho and Bharath Hariharan. “On the efficacy of knowledge distillation”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 4794–4802. [6] Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014). [7] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018). [8] Prakhar Ganesh et al. “Compressing Large-Scale Transformer-Based Models: A Case Study on BERT”. In: Transactions of the Association for Computational Linguistics 9 (2021). Ed. by Brian Roark and Ani Nenkova, pp. 1061–1080. DOI: 10. 1162/tacl_a_00413. URL: https://aclanthology.org/2021.tacl-1.63. [9] Tao Ge, Si-Qing Chen, and Furu Wei. “EdgeFormer: A parameter-efficient transformer for on-device Seq2Seq generation”. In: arXiv preprint arXiv:2202.07959 (2022). [10] Jianping Gou et al. “Knowledge distillation: A survey”. In: International Journal of Computer Vision 129 (2021), pp. 1789–1819. [11] Cyril Goutte and Eric Gaussier. “A probabilistic interpretation of precision, recall and F-score, with implication for evaluation”. In: European conference on information retrieval. Springer. 2005, pp. 345–359. [12] Vasileios Hatzivassiloglou and Kathleen McKeown. “Predicting the semantic orientation of adjectives”. In: 35th annual meeting of the association for computational linguistics and 8th conference of the european chapter of the association for computational linguistics. 1997, pp. 174–181. [13] Xin He, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A survey of the state-of-theart”. In: Knowledge-based systems 212 (2021), p. 106622. [14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In: arXiv preprint arXiv:1503.02531 (2015). [15] Torsten Hoefler et al. “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks”. In: The Journal of Machine Learning Research 22.1 (2021), pp. 10882–11005. [16] Timothy Hospedales et al. “Meta-learning in neural networks: A survey”. In: IEEE transactions on pattern analysis and machine intelligence 44.9 (2021), pp. 5149– 5169. [17] Aref Jafari et al. Annealing Knowledge Distillation. 2021. arXiv: 2104 . 07163 [cs.CL]. URL: https://arxiv.org/abs/2104.07163. [18] Xiaoqi Jiao et al. “Tinybert: Distilling bert for natural language understanding”. In: arXiv preprint arXiv:1909.10351 (2019). [19] Sayyida Tabinda Kokab, Sohail Asghar, and Shehneela Naz. “Transformer-based deep learning models for the sentiment analysis of social media data”. In: Array 14 (2022), p. 100157. [20] Solomon Kullback and Richard A Leibler. “On information and sufficiency”. In: The annals of mathematical statistics 22.1 (1951), pp. 79–86. [21] Eldar Kurtic and Dan Alistarh. “Gmp*: Well-tuned global magnitude pruning can outperform most bert-pruning methods”. In: arXiv preprint arXiv:2210.06384 (2022). [22] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. [23] Lei Li et al. “Dynamic knowledge distillation for pre-trained language models”. In: arXiv preprint arXiv:2109.11295 (2021). [24] Zheng Li et al. “Curriculum temperature for knowledge distillation”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. 2. 2023, pp. 1504– 1512. [25] Chang Liu et al. “Multi-granularity structural knowledge distillation for language model compression”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 1001–1011. [26] Yinhan Liu et al. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019). [27] Shie Mannor, Dori Peleg, and Reuven Rubinstein. “The cross entropy method for classification”. In: Proceedings of the 22nd international conference on Machine learning. 2005, pp. 561–568. [28] Shervin Minaee et al. Large Language Models: A Survey. 2024. arXiv: 2402.06196 [cs.CL]. [29] Seyed Iman Mirzadeh et al. “Improved knowledge distillation via teacher assistant”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 34. 04. 2020, pp. 5191–5198. [30] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. “When does label smoothing help?” In: Advances in neural information processing systems 32 (2019). [31] Manish Munikar, Sushil Shakya, and Aakash Shrestha. “Fine-grained sentiment classification using BERT”. In: 2019 Artificial Intelligence for Transforming Business and Society (AITB). Vol. 1. IEEE. 2019, pp. 1–5. [32] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbs up? Sentiment classification using machine learning techniques”. In: arXiv preprint cs/0205070 (2002). [33] Wonpyo Park et al. “Relational knowledge distillation”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 3967– 3976. [34] Gabriel Pereyra et al. “Regularizing neural networks by penalizing confident output distributions”. In: arXiv preprint arXiv:1701.06548 (2017). [35] Adriana Romero et al. FitNets: Hints for Thin Deep Nets. 2015. arXiv: 1412.6550 [cs.LG]. [36] Fabian Ruffy and Karanbir Chahal. The State of Knowledge Distillation for Classification. 2019. arXiv: 1912.10850 [cs.LG]. [37] Victor Sanh et al. “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019”. In: arXiv preprint arXiv:1910.01108 (2019). [38] Richard Socher et al. “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proceedings of the 2013 conference on empirical methods in natural language processing. 2013, pp. 1631–1642. [39] Aarohi Srivastava et al. “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models”. In: arXiv preprint arXiv:2206.04615 (2022). [40] Chi Sun, Luyao Huang, and Xipeng Qiu. “Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence”. In: arXiv preprint arXiv:1903.09588 (2019). [41] Siqi Sun et al. “Patient Knowledge Distillation for BERT Model Compression”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, pp. 4323–4332. [42] Zhiqing Sun et al. “Mobilebert: a compact task-agnostic bert for resource-limited devices”. In: arXiv preprint arXiv:2004.02984 (2020). [43] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017). [44] Elena Voita et al. “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned”. In: arXiv preprint arXiv:1905.09418 (2019). [45] Alex Wang et al. “GLUE: A multi-task benchmark and analysis platform for natural language understanding”. In: arXiv preprint arXiv:1804.07461 (2018). [46] Alex Wang et al. “Superglue: A stickier benchmark for general-purpose language understanding systems”. In: Advances in neural information processing systems 32 (2019). [47] Jin Wang et al. “Dimensional sentiment analysis using a regional CNN-LSTM model”. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers). 2016, pp. 225–230. [48] Yukang Wei and Yu Bai. Dynamic Temperature Knowledge Distillation. 2024. arXiv: 2404.12711 [cs.LG]. URL: https://arxiv.org/abs/2404.12711. [49] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8 (1992), pp. 229–256. [50] Canwen Xu and Julian McAuley. “A survey on model compression and acceleration for pretrained language models”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. 9. 2023, pp. 10566–10575. [51] Hu Xu et al. “BERT post-training for review reading comprehension and aspectbased sentiment analysis”. In: arXiv preprint arXiv:1904.02232 (2019). [52] Zichao Yang et al. “Hierarchical attention networks for document classification”. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016, pp. 1480– 1489. [53] Junho Yim et al. “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4133–4141. [54] Shan You et al. “Learning from multiple teacher networks”. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017, pp. 1285–1294. [55] Fei Yuan et al. “Reinforced multi-teacher selection for knowledge distillation”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 16. 2021, pp. 14284–14291. [56] Lin Yue et al. “A survey of sentiment analysis in social media”. In: Knowledge and Information Systems 60 (2019), pp. 617–663. [57] Ofir Zafrir et al. “Prune once for all: Sparse pre-trained language models”. In: arXiv preprint arXiv:2111.05754 (2021). [58] Sergey Zagoruyko and Nikos Komodakis. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. 2017. arXiv: 1612.03928 [cs.CV]. [59] Chen Zhang, Qiuchi Li, and Dawei Song. “Aspect-based sentiment classification with aspect-specific graph convolutional networks”. In: arXiv preprint arXiv:1909.03477 (2019). [60] Linfeng Zhang et al. “Be your own teacher: Improve the performance of convolutional neural networks via self distillation”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 3713–3722. [61] Shaokang Zhang, Lei Jiang, and Jianlong Tan. “Cross-domain knowledge distillation for text classification”. In: Neurocomputing 509 (2022), pp. 11–20. [62] Han Zhao et al. Multiple Source Domain Adaptation with Adversarial Training of Neural Networks. 2017. arXiv: 1705.09684 [cs.LG].zh_TW