Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 擴散模型之顯著圖合理性評估及語義分析
Rationality Evaluation and Semantic Analysis of Saliency Maps in Diffusion Models作者 林大維
Lin, Da-Wei貢獻者 紀明德
Chi, Ming-Te
林大維
Lin, Da-Wei關鍵詞 擴散模型
顯著圖
文字到圖像生成模型
語義分析
Diffusion Models
Saliency Maps
Text-to-Image Generation Models
Semantic Analysis日期 2025 上傳時間 2-Jun-2025 14:57:39 (UTC+8) 摘要 近年來,擴散模型(Diffusion Models)在圖像生成領域取得重大進展,特別是 Stable Diffusion 使文字生成圖像的能力達到新高度。然而,模型在解析自然語言與圖像生成的關聯時,可能會產生特徵糾纏(Feature Entanglement),影響生成結果的合理性。本研究採用 DAAM(Diffusion Attentive AttributionMap)方法,透過分析交互注意力層(Cross Attention Map)生成的顯著圖(Saliency Maps),探討模型對提示詞的關注範圍及其對生成圖像的影響。 我們提出一種自動化合理性評估方法,結合 Segment Anything(SAM)語 義分割技術,以量化顯著圖的準確性,並比較不同 Stable Diffusion 預訓練模型(如 v1.5、v2.1、SDXL)的泛化能力。此外,透過句法剖析(DependencyParsing)與特徵糾纏分析,探討語言提示詞對圖像生成的影響,並驗證形容詞與場景描述對生成結果的影響範圍。 實驗結果顯示,DAAM 在語義關聯性評估方面優於傳統梯度方法(如 Grad-CAM、Grad-CAM++),能更準確地反映文本與圖像的對應關係。此外,我們發現某些形容詞會影響整體場景,而非僅限於描述對象,顯示 Stable Diffusion 在處理複雜提示詞時仍面臨挑戰。未來研究將進一步優化 DAAM 技術,並探索更精確的語義解釋方法,以提升擴散模型的可解釋性與生成品質。
Diffusion models have improved image generation, with Stable Diffusion advancing text-to-image synthesis. However, feature entanglement affects coherence. This study employs the Diffusion Attentive Attribution Map(DAAM) to analyze saliency maps from cross-attention layers, examining prompt processing and its impact on generation. We propose an automated evaluation method using the Segment Anything Model (SAM) for semantic segmentation to assess saliency accuracy. DAAM’s generalization is compared across Stable Diffusion versions (v1.5,v2.1, SDXL), with linguistic prompt influence analyzed through dependency parsing and feature entanglement studies. Results show that DAAM outperforms gradient-based methods like Grad-CAM in semantic relevance, revealing how certain adjectives influence entire scenes. Future research will refine DAAM and improve semantic interpretation for better model explainability and generation quality.參考文獻 [1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 684–10 695. [2] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the DAAM: Interpreting stable diffusion using cross attention,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 5644–5659. [Online]. Available: https://aclanthology.org/2023.acl-long.310/ [3] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” 2019. [4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 8821–8831. [Online]. Available: https://proceedings.mlr.press/v139/ramesh21a.html [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [6] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [7] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky,“The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, K. Bontcheva and J. Zhu, Eds. Baltimore, Maryland: Association for Computational Linguistics, Jun. 2014, pp. 55–60. [Online]. Available: https://aclanthology.org/P14-5010/ [8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014. [9] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6840–6851. [Online]. Available: https://proceedings.neurips.cc/paper_files/ paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf [10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015. [11] R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and overview,” 2019. [Online]. Available: https://arxiv.org/abs/1912.05911 [12] C. B. Vennerød, A. Kjærran, and E. S. Bugge, “Long short-term memory rnn,” 2021. [Online]. Available: https://arxiv.org/abs/2105.06756 [13] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” CoRR, vol. abs/2106.09685, 2021. [Online]. Available: https://arxiv.org/abs/2106.09685 [14] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” 2014. [15] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradientbased localization,” International Journal of Computer Vision, vol. 128, no. 2, p. 336–359, Oct. 2019. [Online]. Available: http://dx.doi.org/10.1007/s11263-019-01228-7 [16] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015. [Online]. Available: https://arxiv.org/abs/1511.08458 [17] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643 [18] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312 [19] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” 2022. [Online]. Available: https://arxiv.org/abs/2201.12086 [20] G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein, “Diffusion art or digital forgery? investigating data replication in diffusion models,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6048–6058. [21] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or,“Prompt-to-prompt image editing with cross attention control,” 2022. [22] S. Ge, T. Park, J.-Y. Zhu, and J.-B. Huang, “Expressive text-to-image generation with rich text,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7545–7556. [23] J. Sun, D. Fu, Y. Hu, S. Wang, R. Rassin, D.-C. Juan, D. Alon, C. Herrmann, S. van Steenkiste, R. Krishna, and C. Rashtchian, “Dreamsync: Aligning text-toimage generation with image understanding feedback,” 2023. [24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2022. [25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [26] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” 2022. [Online]. Available: https://arxiv.org/abs/2010.02502 [27] R. Daroya, A. Sun, and S. Maji, “Cose: A consistency-sensitivity metric for saliency on image classification,” 2023. [Online]. Available: https: //arxiv.org/abs/2309.10989 [28] V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani,“Ziplora: Any subject in any style by effectively merging loras,” 2023. [Online]. Available: https://arxiv.org/abs/2311.13600 [29] B. Kim, J. Seo, S. Jeon, J. Koo, J. Choe, and T. Jeon, “Why are saliency maps noisy? cause of and solution to noisy saliency maps,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04893 [30] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021. [Online]. Available: https://arxiv.org/abs/2012.09838 [31] J. Guerrero-Viu, M. Hasan, A. Roullier, M. Harikumar, Y. Hu, P. Guerrero, D. Gutiérrez, B. Masia, and V. Deschaintre, “Texsliders: Diffusion-based texture editing in clip space,” in Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24, ser. SIGGRAPH ’24. ACM, Jul. 2024, p. 1–11. [Online]. Available: http://dx.doi.org/10.1145/3641519.3657444 [32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929 描述 碩士
國立政治大學
資訊科學系
111753161資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753161 資料類型 thesis dc.contributor.advisor 紀明德 zh_TW dc.contributor.advisor Chi, Ming-Te en_US dc.contributor.author (Authors) 林大維 zh_TW dc.contributor.author (Authors) Lin, Da-Wei en_US dc.creator (作者) 林大維 zh_TW dc.creator (作者) Lin, Da-Wei en_US dc.date (日期) 2025 en_US dc.date.accessioned 2-Jun-2025 14:57:39 (UTC+8) - dc.date.available 2-Jun-2025 14:57:39 (UTC+8) - dc.date.issued (上傳時間) 2-Jun-2025 14:57:39 (UTC+8) - dc.identifier (Other Identifiers) G0111753161 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/157243 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 111753161 zh_TW dc.description.abstract (摘要) 近年來,擴散模型(Diffusion Models)在圖像生成領域取得重大進展,特別是 Stable Diffusion 使文字生成圖像的能力達到新高度。然而,模型在解析自然語言與圖像生成的關聯時,可能會產生特徵糾纏(Feature Entanglement),影響生成結果的合理性。本研究採用 DAAM(Diffusion Attentive AttributionMap)方法,透過分析交互注意力層(Cross Attention Map)生成的顯著圖(Saliency Maps),探討模型對提示詞的關注範圍及其對生成圖像的影響。 我們提出一種自動化合理性評估方法,結合 Segment Anything(SAM)語 義分割技術,以量化顯著圖的準確性,並比較不同 Stable Diffusion 預訓練模型(如 v1.5、v2.1、SDXL)的泛化能力。此外,透過句法剖析(DependencyParsing)與特徵糾纏分析,探討語言提示詞對圖像生成的影響,並驗證形容詞與場景描述對生成結果的影響範圍。 實驗結果顯示,DAAM 在語義關聯性評估方面優於傳統梯度方法(如 Grad-CAM、Grad-CAM++),能更準確地反映文本與圖像的對應關係。此外,我們發現某些形容詞會影響整體場景,而非僅限於描述對象,顯示 Stable Diffusion 在處理複雜提示詞時仍面臨挑戰。未來研究將進一步優化 DAAM 技術,並探索更精確的語義解釋方法,以提升擴散模型的可解釋性與生成品質。 zh_TW dc.description.abstract (摘要) Diffusion models have improved image generation, with Stable Diffusion advancing text-to-image synthesis. However, feature entanglement affects coherence. This study employs the Diffusion Attentive Attribution Map(DAAM) to analyze saliency maps from cross-attention layers, examining prompt processing and its impact on generation. We propose an automated evaluation method using the Segment Anything Model (SAM) for semantic segmentation to assess saliency accuracy. DAAM’s generalization is compared across Stable Diffusion versions (v1.5,v2.1, SDXL), with linguistic prompt influence analyzed through dependency parsing and feature entanglement studies. Results show that DAAM outperforms gradient-based methods like Grad-CAM in semantic relevance, revealing how certain adjectives influence entire scenes. Future research will refine DAAM and improve semantic interpretation for better model explainability and generation quality. en_US dc.description.tableofcontents 致謝 i 摘要 ii Abstract iii 目錄 iv 圖目錄 v 表目錄 vi 第一章 緒論 1 1.1 研究動機與目的 1 1.2 問題描述 2 1.3 論文架構 4 第二章 相關研究 5 2.1 生成模型 5 2.2 常見的可視化方法 6 2.3 常見的可解釋性指標 7 第三章 研究方法與架構 9 3.1 主題生成器 Stable Diffusion 9 3.2 Diffusion Model 的關鍵項 11 3.3 語意標註的設計 13 3.4 基於標註分割的自動化合理性評估 14 3.5 基於標註計算的語義強度 16 3.6 DAAM 之於不同資料輸入結果的比較 17 第四章 分析結果 18 4.1 量化指標 18 4.2 觀察合理性以及結果對應樣態的差異 21 4.3 語義相關性觀察 25 4.4 同樣語意的變異性 28 4.5 商業訓練比較 29 4.6 穩定性探討 30 4.7 DAAM 限制 31 第五章 結論與未來展望 33 5.1 研究結論 33 5.2 未來研究 33 參考文獻 35 zh_TW dc.format.extent 7874056 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753161 en_US dc.subject (關鍵詞) 擴散模型 zh_TW dc.subject (關鍵詞) 顯著圖 zh_TW dc.subject (關鍵詞) 文字到圖像生成模型 zh_TW dc.subject (關鍵詞) 語義分析 zh_TW dc.subject (關鍵詞) Diffusion Models en_US dc.subject (關鍵詞) Saliency Maps en_US dc.subject (關鍵詞) Text-to-Image Generation Models en_US dc.subject (關鍵詞) Semantic Analysis en_US dc.title (題名) 擴散模型之顯著圖合理性評估及語義分析 zh_TW dc.title (題名) Rationality Evaluation and Semantic Analysis of Saliency Maps in Diffusion Models en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 684–10 695. [2] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the DAAM: Interpreting stable diffusion using cross attention,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 5644–5659. [Online]. Available: https://aclanthology.org/2023.acl-long.310/ [3] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” 2019. [4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 8821–8831. [Online]. Available: https://proceedings.mlr.press/v139/ramesh21a.html [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [6] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [7] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky,“The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, K. Bontcheva and J. Zhu, Eds. Baltimore, Maryland: Association for Computational Linguistics, Jun. 2014, pp. 55–60. [Online]. Available: https://aclanthology.org/P14-5010/ [8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014. [9] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6840–6851. [Online]. Available: https://proceedings.neurips.cc/paper_files/ paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf [10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015. [11] R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and overview,” 2019. [Online]. Available: https://arxiv.org/abs/1912.05911 [12] C. B. Vennerød, A. Kjærran, and E. S. Bugge, “Long short-term memory rnn,” 2021. [Online]. Available: https://arxiv.org/abs/2105.06756 [13] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” CoRR, vol. abs/2106.09685, 2021. [Online]. Available: https://arxiv.org/abs/2106.09685 [14] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” 2014. [15] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradientbased localization,” International Journal of Computer Vision, vol. 128, no. 2, p. 336–359, Oct. 2019. [Online]. Available: http://dx.doi.org/10.1007/s11263-019-01228-7 [16] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015. [Online]. Available: https://arxiv.org/abs/1511.08458 [17] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643 [18] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312 [19] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” 2022. [Online]. Available: https://arxiv.org/abs/2201.12086 [20] G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein, “Diffusion art or digital forgery? investigating data replication in diffusion models,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6048–6058. [21] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or,“Prompt-to-prompt image editing with cross attention control,” 2022. [22] S. Ge, T. Park, J.-Y. Zhu, and J.-B. Huang, “Expressive text-to-image generation with rich text,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7545–7556. [23] J. Sun, D. Fu, Y. Hu, S. Wang, R. Rassin, D.-C. Juan, D. Alon, C. Herrmann, S. van Steenkiste, R. Krishna, and C. Rashtchian, “Dreamsync: Aligning text-toimage generation with image understanding feedback,” 2023. [24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2022. [25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [26] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” 2022. [Online]. Available: https://arxiv.org/abs/2010.02502 [27] R. Daroya, A. Sun, and S. Maji, “Cose: A consistency-sensitivity metric for saliency on image classification,” 2023. [Online]. Available: https: //arxiv.org/abs/2309.10989 [28] V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani,“Ziplora: Any subject in any style by effectively merging loras,” 2023. [Online]. Available: https://arxiv.org/abs/2311.13600 [29] B. Kim, J. Seo, S. Jeon, J. Koo, J. Choe, and T. Jeon, “Why are saliency maps noisy? cause of and solution to noisy saliency maps,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04893 [30] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021. [Online]. Available: https://arxiv.org/abs/2012.09838 [31] J. Guerrero-Viu, M. Hasan, A. Roullier, M. Harikumar, Y. Hu, P. Guerrero, D. Gutiérrez, B. Masia, and V. Deschaintre, “Texsliders: Diffusion-based texture editing in clip space,” in Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24, ser. SIGGRAPH ’24. ACM, Jul. 2024, p. 1–11. [Online]. Available: http://dx.doi.org/10.1145/3641519.3657444 [32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929 zh_TW
