基於文字生成圖片之擴散模型的視覺化輔助設計系統 | Publication

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

Title	基於文字生成圖片之擴散模型的視覺化輔助設計系統 Visualization-Assisted Design System Based on Text-to-Image Diffusion Model
Creator	王良文 Wang, Liang-Wen
Contributor	紀明德 Chi, Ming-Te 王良文 Wang, Liang-Wen
Key Words	提示工程視覺化擴散模型文字到圖像生成模型命名實體識別 Prompt Engineering Visualization Diffusion Models Text-to-Image Generation Named Entity Recognition
Date	2025
Date Issued	1-Apr-2025 12:27:33 (UTC+8)
Summary	近年來，擴散模型顯著提升了文本生成圖像技術的品質，讓使用者能以提示詞創造出高關聯度且前所未見的圖像。然而，生成結果深受提示詞選擇影響，導致初學者難以掌握有效的提示詞設計。為此，本研究提出一套基於視覺化的輔助設計系統，協助使用者理解提示詞與圖像生成之間的關係，並提供優化的提示詞建議。我們利用 DiffusionDB 數據集並結合自然語言處理技術，分析提示詞語義，並運用 UMAP 將高維度提示詞關聯投影至直觀的二維視覺化空間。透過系統的動態迭代機制，使用者可隨時調整提示詞並即時觀察圖像變化，從而獲得創意啟發並創作出多樣的圖像。為了提供更多元的提示詞選擇，我們比較使用者輸入的提示詞與 DiffusionDB 的語義相似度，並進一步探討在標註實體任務中，GPT 模型在不同提示詞組合下的穩定度，以提升提示詞建議系統的可靠性。 In recent years, diffusion models have greatly improved text-to-image generation, allowing users to produce highly relevant and novel images through prompts. However, prompt design can be challenging for beginners. This study introduces a visualization-based assistive design system that leverages the DiffusionDB dataset and NLP techniques to analyze prompt semantics, using UMAP dimensionality reduction to create an interactive two-dimensional visualization of prompt relationships. Through iterative refinement, users can modify prompts and observe real-time image generation results, gaining creative inspiration for diverse outputs. By comparing user prompts with DiffusionDB via semantic similarity analysis, the system suggests various prompt options. Additionally, we examine GPT model stability under different prompt combinations in named entity annotation tasks to enhance the reliability of the prompt recommendation system.
參考文獻	[1]J. Oppenlaender, “A taxonomy of prompt modifiers for text-to-image genera- tion,” Behaviour & Information Technology, pp. 1–14, 2023. [2] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approx- imation and projection for dimension reduction. arxiv 2018,” arXiv preprint arXiv:1802.03426, vol. 10, 1802. [3]J. Oppenlaender, “Prompt engineering for text-to-image synthesis,” figshare. Presentation, 2022. [Online]. Available: https://doi.org/10.6084/m9.figshare. 18899801.v1 [4]A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831. [5]C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to- image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022. [6]V. Liu and L. B. Chilton, “Design guidelines for prompt engineering text-to-image generative models,” in Proceedings of the 2022 CHI Conference on Hu- man Factors in Computing Systems, 2022, pp. 1–23. [7] Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models,” arXiv preprint arXiv:2210.14896, 2022. [8] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Transactions on Visualization and Computer Graphics, 2023. [9] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Github repos- itory for high-resolution image synthesis with latent diffusion models,” https://github.com/CompVis/stable-diffusion?tab=readme-ov-file, 2022. [10] ——, “High-resolution image synthesis with latent diffusion models,” in Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2022, pp. 10 684–10 695. [11] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Ad- vances in neural information processing systems, vol. 33, pp. 6840–6851, 2020. [12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. [13]S. Hsueh, M. Ciolfi Felice, S. F. Alaoui, and W. E. Mackay, “What counts as ‘creative’work? articulating four epistemic positions in creativity-oriented hci research,” in Proceedings of the CHI Conference on Human Factors in Comput- ing Systems, 2024, pp. 1–15. [14]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [15]Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle- moyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining ap- proach,” arXiv preprint arXiv:1907.11692, 2019. [16]M. Grootendorst, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4461265 [17]T. Xie, Q. Li, Y. Zhang, Z. Liu, and H. Wang, “Self-improving for zero- shot named entity recognition with large language models,” arXiv preprint arXiv:2311.08921, 2023. [18]X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022. [19]W. Shao, R. Zhang, P. Ji, D. Fan, Y. Hu, X. Yan, C. Cui, Y. Tao, L. Mi, and L. Chen, “Astronomical knowledge entity extraction in astrophysics journal articles via large language models,” Research in Astronomy and Astrophysics, vol. 24, no. 6, p. 065012, 2024. [20]J. Ke, K. Ye, J. Yu, Y. Wu, P. Milanfar, and F. Yang, “Vila: Learning image aesthetics from user comments with vision-language pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 041–10 051. [21] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagere-ward: Learning and evaluating human preferences for text-to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024. [22] StabilityAI, “Stable diffusion dream studio beta terms of service,” https:// stability.ai/stablediffusion-terms-of-service, 2022, accessed: 2024-03-17. [23]W. Wang, H. Bao, S. Huang, L. Dong, and F. Wei, “Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers,” arXiv preprint arXiv:2012.15828, 2020. [24]M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spacy: Industrial- strength natural language processing in python,” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.1212303 [25]N. Reimers, “Sentence-bert: Sentence embeddings using siamese bert- networks,” arXiv preprint arXiv:1908.10084, 2019. [26]L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008. [27]L. Derczynski, E. Nichols, M. Van Erp, and N. Limsopatham, “Results of the wnut2017 shared task on novel and emerging entity recognition,” in Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017, pp. 140–147. [28]J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv:2104.08718, 2021. [29]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [30]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877– 1901, 2020.
Description	碩士國立政治大學資訊科學系 111753152
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0111753152
Type	thesis

dc.contributor.advisor	紀明德	zh_TW
dc.contributor.advisor	Chi, Ming-Te	en_US
dc.contributor.author (Authors)	王良文	zh_TW
dc.contributor.author (Authors)	Wang, Liang-Wen	en_US
dc.creator (作者)	王良文	zh_TW
dc.creator (作者)	Wang, Liang-Wen	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	1-Apr-2025 12:27:33 (UTC+8)	-
dc.date.available	1-Apr-2025 12:27:33 (UTC+8)	-
dc.date.issued (上傳時間)	1-Apr-2025 12:27:33 (UTC+8)	-
dc.identifier (Other Identifiers)	G0111753152	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/156487	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系	zh_TW
dc.description (描述)	111753152	zh_TW
dc.description.abstract (摘要)	近年來，擴散模型顯著提升了文本生成圖像技術的品質，讓使用者能以提示詞創造出高關聯度且前所未見的圖像。然而，生成結果深受提示詞選擇影響，導致初學者難以掌握有效的提示詞設計。為此，本研究提出一套基於視覺化的輔助設計系統，協助使用者理解提示詞與圖像生成之間的關係，並提供優化的提示詞建議。我們利用 DiffusionDB 數據集並結合自然語言處理技術，分析提示詞語義，並運用 UMAP 將高維度提示詞關聯投影至直觀的二維視覺化空間。透過系統的動態迭代機制，使用者可隨時調整提示詞並即時觀察圖像變化，從而獲得創意啟發並創作出多樣的圖像。為了提供更多元的提示詞選擇，我們比較使用者輸入的提示詞與 DiffusionDB 的語義相似度，並進一步探討在標註實體任務中，GPT 模型在不同提示詞組合下的穩定度，以提升提示詞建議系統的可靠性。	zh_TW
dc.description.abstract (摘要)	In recent years, diffusion models have greatly improved text-to-image generation, allowing users to produce highly relevant and novel images through prompts. However, prompt design can be challenging for beginners. This study introduces a visualization-based assistive design system that leverages the DiffusionDB dataset and NLP techniques to analyze prompt semantics, using UMAP dimensionality reduction to create an interactive two-dimensional visualization of prompt relationships. Through iterative refinement, users can modify prompts and observe real-time image generation results, gaining creative inspiration for diverse outputs. By comparing user prompts with DiffusionDB via semantic similarity analysis, the system suggests various prompt options. Additionally, we examine GPT model stability under different prompt combinations in named entity annotation tasks to enhance the reliability of the prompt recommendation system.	en_US
dc.description.tableofcontents	第一章緒論 1 1.1 研究動機與目的 1 1.2 問題說明 2 1.3 研究貢獻 3 1.4 設計需求 3 第二章相關研究 4 2.1 基於文字生成圖片相關視覺化研究 4 2.2 Stable Diffusion 5 2.3 基於提示修飾符的文字到圖像生成 6 2.4 創意支持系統中的發散與聚合模型 6 2.5 基於 Transformer 模型 7 2.6 對於文字生成圖片的語義對齊和美學評估 9 第三章資料集 10 3.1 DiffusionDB 10 第四章研究方法與步驟 12 4.1 資料前處理 12 4.2 系統架構 16 4.2.1 提示詞分析模組 16 4.2.2 視覺化系統與結果分析 17 第五章視覺化設計 19 5.1 基於提示詞排列組合的視覺化結果 20 5.2 提示詞分類的視覺化 21 第六章實驗 23 6.1 DiffusionDB 資料集找出關鍵字 23 6.2 t-SNE 和 UMAP 比較 24 6.3 提示工程實驗 26 6.4 迭代最佳化提示詞 28 6.4.1 提示詞的不同組合生成 28 6.4.2 使用 UMAP 進行語義分析與聚類 29 6.4.3 選擇使用者偏好圖片 30 6.4.4 選擇不同分類提示詞 31 6.5 使用案例 31 6.5.1 使用案例 (一) 32 6.5.2 使用案例 (二) 37 第七章結論與未來展望 43 7.1 結論 43 7.2 未來展望 44 參考文獻 45	zh_TW
dc.format.extent	14702760 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0111753152	en_US
dc.subject (關鍵詞)	提示工程	zh_TW
dc.subject (關鍵詞)	視覺化	zh_TW
dc.subject (關鍵詞)	擴散模型	zh_TW
dc.subject (關鍵詞)	文字到圖像生成模型	zh_TW
dc.subject (關鍵詞)	命名實體識別	zh_TW
dc.subject (關鍵詞)	Prompt Engineering	en_US
dc.subject (關鍵詞)	Visualization	en_US
dc.subject (關鍵詞)	Diffusion Models	en_US
dc.subject (關鍵詞)	Text-to-Image Generation	en_US
dc.subject (關鍵詞)	Named Entity Recognition	en_US
dc.title (題名)	基於文字生成圖片之擴散模型的視覺化輔助設計系統	zh_TW
dc.title (題名)	Visualization-Assisted Design System Based on Text-to-Image Diffusion Model	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1]J. Oppenlaender, “A taxonomy of prompt modifiers for text-to-image genera- tion,” Behaviour & Information Technology, pp. 1–14, 2023. [2] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approx- imation and projection for dimension reduction. arxiv 2018,” arXiv preprint arXiv:1802.03426, vol. 10, 1802. [3]J. Oppenlaender, “Prompt engineering for text-to-image synthesis,” figshare. Presentation, 2022. [Online]. Available: https://doi.org/10.6084/m9.figshare. 18899801.v1 [4]A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831. [5]C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to- image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022. [6]V. Liu and L. B. Chilton, “Design guidelines for prompt engineering text-to-image generative models,” in Proceedings of the 2022 CHI Conference on Hu- man Factors in Computing Systems, 2022, pp. 1–23. [7] Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models,” arXiv preprint arXiv:2210.14896, 2022. [8] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Transactions on Visualization and Computer Graphics, 2023. [9] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Github repos- itory for high-resolution image synthesis with latent diffusion models,” https://github.com/CompVis/stable-diffusion?tab=readme-ov-file, 2022. [10] ——, “High-resolution image synthesis with latent diffusion models,” in Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2022, pp. 10 684–10 695. [11] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Ad- vances in neural information processing systems, vol. 33, pp. 6840–6851, 2020. [12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. [13]S. Hsueh, M. Ciolfi Felice, S. F. Alaoui, and W. E. Mackay, “What counts as ‘creative’work? articulating four epistemic positions in creativity-oriented hci research,” in Proceedings of the CHI Conference on Human Factors in Comput- ing Systems, 2024, pp. 1–15. [14]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [15]Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle- moyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining ap- proach,” arXiv preprint arXiv:1907.11692, 2019. [16]M. Grootendorst, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4461265 [17]T. Xie, Q. Li, Y. Zhang, Z. Liu, and H. Wang, “Self-improving for zero- shot named entity recognition with large language models,” arXiv preprint arXiv:2311.08921, 2023. [18]X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022. [19]W. Shao, R. Zhang, P. Ji, D. Fan, Y. Hu, X. Yan, C. Cui, Y. Tao, L. Mi, and L. Chen, “Astronomical knowledge entity extraction in astrophysics journal articles via large language models,” Research in Astronomy and Astrophysics, vol. 24, no. 6, p. 065012, 2024. [20]J. Ke, K. Ye, J. Yu, Y. Wu, P. Milanfar, and F. Yang, “Vila: Learning image aesthetics from user comments with vision-language pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 041–10 051. [21] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagere-ward: Learning and evaluating human preferences for text-to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024. [22] StabilityAI, “Stable diffusion dream studio beta terms of service,” https:// stability.ai/stablediffusion-terms-of-service, 2022, accessed: 2024-03-17. [23]W. Wang, H. Bao, S. Huang, L. Dong, and F. Wei, “Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers,” arXiv preprint arXiv:2012.15828, 2020. [24]M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spacy: Industrial- strength natural language processing in python,” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.1212303 [25]N. Reimers, “Sentence-bert: Sentence embeddings using siamese bert- networks,” arXiv preprint arXiv:1908.10084, 2019. [26]L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008. [27]L. Derczynski, E. Nichols, M. Van Erp, and N. Limsopatham, “Results of the wnut2017 shared task on novel and emerging entity recognition,” in Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017, pp. 140–147. [28]J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv:2104.08718, 2021. [29]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [30]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877– 1901, 2020.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM