Publications-Theses
Article View/Open
Publication Export
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
Title | 基於文字生成圖片之擴散模型的視覺化輔助設計系統 Visualization-Assisted Design System Based on Text-to-Image Diffusion Model |
Creator | 王良文 Wang, Liang-Wen |
Contributor | 紀明德 Chi, Ming-Te 王良文 Wang, Liang-Wen |
Key Words | 提示工程 視覺化 擴散模型 文字到圖像生成模型 命名實體識別 Prompt Engineering Visualization Diffusion Models Text-to-Image Generation Named Entity Recognition |
Date | 2025 |
Date Issued | 1-Apr-2025 12:27:33 (UTC+8) |
Summary | 近年來,擴散模型顯著提升了文本生成圖像技術的品質,讓使用者能以提示詞創造出高關聯度且前所未見的圖像。然而,生成結果深受提示詞選擇影響,導致初學者難以掌握有效的提示詞設計。為此,本研究提出一套基於視覺化的輔助設計系統,協助使用者理解提示詞與圖像生成之間的關係,並提供優化的提示詞建議。我們利用 DiffusionDB 數據集並結合自然語言處理技術,分析提示詞語義,並運用 UMAP 將高維度提示詞關聯投影至直觀的二維視覺化空間。透過系統的動態迭代機制,使用者可隨時調整提示詞並即時觀察圖像變化,從而獲得創意啟發並創作出多樣的圖像。為了提供更多元的提示詞選擇,我們比較使用者輸入的提示詞與 DiffusionDB 的語義相似度,並進一步探討在標註實體任務中,GPT 模型在不同提示詞組合下的穩定度,以提升提示詞建議系統的可靠性。 In recent years, diffusion models have greatly improved text-to-image generation, allowing users to produce highly relevant and novel images through prompts. However, prompt design can be challenging for beginners. This study introduces a visualization-based assistive design system that leverages the DiffusionDB dataset and NLP techniques to analyze prompt semantics, using UMAP dimensionality reduction to create an interactive two-dimensional visualization of prompt relationships. Through iterative refinement, users can modify prompts and observe real-time image generation results, gaining creative inspiration for diverse outputs. By comparing user prompts with DiffusionDB via semantic similarity analysis, the system suggests various prompt options. Additionally, we examine GPT model stability under different prompt combinations in named entity annotation tasks to enhance the reliability of the prompt recommendation system. |
參考文獻 | [1]J. Oppenlaender, “A taxonomy of prompt modifiers for text-to-image genera- tion,” Behaviour & Information Technology, pp. 1–14, 2023. [2] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approx- imation and projection for dimension reduction. arxiv 2018,” arXiv preprint arXiv:1802.03426, vol. 10, 1802. [3]J. Oppenlaender, “Prompt engineering for text-to-image synthesis,” figshare. Presentation, 2022. [Online]. Available: https://doi.org/10.6084/m9.figshare. 18899801.v1 [4]A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831. [5]C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to- image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022. [6]V. Liu and L. B. Chilton, “Design guidelines for prompt engineering text-to-image generative models,” in Proceedings of the 2022 CHI Conference on Hu- man Factors in Computing Systems, 2022, pp. 1–23. [7] Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models,” arXiv preprint arXiv:2210.14896, 2022. [8] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Transactions on Visualization and Computer Graphics, 2023. [9] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Github repos- itory for high-resolution image synthesis with latent diffusion models,” https://github.com/CompVis/stable-diffusion?tab=readme-ov-file, 2022. [10] ——, “High-resolution image synthesis with latent diffusion models,” in Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2022, pp. 10 684–10 695. [11] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Ad- vances in neural information processing systems, vol. 33, pp. 6840–6851, 2020. [12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. [13]S. Hsueh, M. Ciolfi Felice, S. F. Alaoui, and W. E. Mackay, “What counts as ‘creative’work? articulating four epistemic positions in creativity-oriented hci research,” in Proceedings of the CHI Conference on Human Factors in Comput- ing Systems, 2024, pp. 1–15. [14]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [15]Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle- moyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining ap- proach,” arXiv preprint arXiv:1907.11692, 2019. [16]M. Grootendorst, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4461265 [17]T. Xie, Q. Li, Y. Zhang, Z. Liu, and H. Wang, “Self-improving for zero- shot named entity recognition with large language models,” arXiv preprint arXiv:2311.08921, 2023. [18]X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022. [19]W. Shao, R. Zhang, P. Ji, D. Fan, Y. Hu, X. Yan, C. Cui, Y. Tao, L. Mi, and L. Chen, “Astronomical knowledge entity extraction in astrophysics journal articles via large language models,” Research in Astronomy and Astrophysics, vol. 24, no. 6, p. 065012, 2024. [20]J. Ke, K. Ye, J. Yu, Y. Wu, P. Milanfar, and F. Yang, “Vila: Learning image aesthetics from user comments with vision-language pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 041–10 051. [21] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagere-ward: Learning and evaluating human preferences for text-to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024. [22] StabilityAI, “Stable diffusion dream studio beta terms of service,” https:// stability.ai/stablediffusion-terms-of-service, 2022, accessed: 2024-03-17. [23]W. Wang, H. Bao, S. Huang, L. Dong, and F. Wei, “Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers,” arXiv preprint arXiv:2012.15828, 2020. [24]M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spacy: Industrial- strength natural language processing in python,” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.1212303 [25]N. Reimers, “Sentence-bert: Sentence embeddings using siamese bert- networks,” arXiv preprint arXiv:1908.10084, 2019. [26]L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008. [27]L. Derczynski, E. Nichols, M. Van Erp, and N. Limsopatham, “Results of the wnut2017 shared task on novel and emerging entity recognition,” in Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017, pp. 140–147. [28]J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv:2104.08718, 2021. [29]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [30]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877– 1901, 2020. |
Description | 碩士 國立政治大學 資訊科學系 111753152 |
資料來源 | http://thesis.lib.nccu.edu.tw/record/#G0111753152 |
Type | thesis |
dc.contributor.advisor | 紀明德 | zh_TW |
dc.contributor.advisor | Chi, Ming-Te | en_US |
dc.contributor.author (Authors) | 王良文 | zh_TW |
dc.contributor.author (Authors) | Wang, Liang-Wen | en_US |
dc.creator (作者) | 王良文 | zh_TW |
dc.creator (作者) | Wang, Liang-Wen | en_US |
dc.date (日期) | 2025 | en_US |
dc.date.accessioned | 1-Apr-2025 12:27:33 (UTC+8) | - |
dc.date.available | 1-Apr-2025 12:27:33 (UTC+8) | - |
dc.date.issued (上傳時間) | 1-Apr-2025 12:27:33 (UTC+8) | - |
dc.identifier (Other Identifiers) | G0111753152 | en_US |
dc.identifier.uri (URI) | https://nccur.lib.nccu.edu.tw/handle/140.119/156487 | - |
dc.description (描述) | 碩士 | zh_TW |
dc.description (描述) | 國立政治大學 | zh_TW |
dc.description (描述) | 資訊科學系 | zh_TW |
dc.description (描述) | 111753152 | zh_TW |
dc.description.abstract (摘要) | 近年來,擴散模型顯著提升了文本生成圖像技術的品質,讓使用者能以提示詞創造出高關聯度且前所未見的圖像。然而,生成結果深受提示詞選擇影響,導致初學者難以掌握有效的提示詞設計。為此,本研究提出一套基於視覺化的輔助設計系統,協助使用者理解提示詞與圖像生成之間的關係,並提供優化的提示詞建議。我們利用 DiffusionDB 數據集並結合自然語言處理技術,分析提示詞語義,並運用 UMAP 將高維度提示詞關聯投影至直觀的二維視覺化空間。透過系統的動態迭代機制,使用者可隨時調整提示詞並即時觀察圖像變化,從而獲得創意啟發並創作出多樣的圖像。為了提供更多元的提示詞選擇,我們比較使用者輸入的提示詞與 DiffusionDB 的語義相似度,並進一步探討在標註實體任務中,GPT 模型在不同提示詞組合下的穩定度,以提升提示詞建議系統的可靠性。 | zh_TW |
dc.description.abstract (摘要) | In recent years, diffusion models have greatly improved text-to-image generation, allowing users to produce highly relevant and novel images through prompts. However, prompt design can be challenging for beginners. This study introduces a visualization-based assistive design system that leverages the DiffusionDB dataset and NLP techniques to analyze prompt semantics, using UMAP dimensionality reduction to create an interactive two-dimensional visualization of prompt relationships. Through iterative refinement, users can modify prompts and observe real-time image generation results, gaining creative inspiration for diverse outputs. By comparing user prompts with DiffusionDB via semantic similarity analysis, the system suggests various prompt options. Additionally, we examine GPT model stability under different prompt combinations in named entity annotation tasks to enhance the reliability of the prompt recommendation system. | en_US |
dc.description.tableofcontents | 第 一章 緒論 1 1.1 研究動機與目的 1 1.2 問題說明 2 1.3 研究貢獻 3 1.4 設計需求 3 第 二章 相關研究 4 2.1 基於文字生成圖片相關視覺化研究 4 2.2 Stable Diffusion 5 2.3 基於提示修飾符的文字到圖像生成 6 2.4 創意支持系統中的發散與聚合模型 6 2.5 基於 Transformer 模型 7 2.6 對於文字生成圖片的語義對齊和美學評估 9 第 三章 資料集 10 3.1 DiffusionDB 10 第 四章 研究方法與步驟 12 4.1 資料前處理 12 4.2 系統架構 16 4.2.1 提示詞分析模組 16 4.2.2 視覺化系統與結果分析 17 第 五章 視覺化設計 19 5.1 基於提示詞排列組合的視覺化結果 20 5.2 提示詞分類的視覺化 21 第 六章 實驗 23 6.1 DiffusionDB 資料集找出關鍵字 23 6.2 t-SNE 和 UMAP 比較 24 6.3 提示工程實驗 26 6.4 迭代最佳化提示詞 28 6.4.1 提示詞的不同組合生成 28 6.4.2 使用 UMAP 進行語義分析與聚類 29 6.4.3 選擇使用者偏好圖片 30 6.4.4 選擇不同分類提示詞 31 6.5 使用案例 31 6.5.1 使用案例 (一) 32 6.5.2 使用案例 (二) 37 第 七章 結論與未來展望 43 7.1 結論 43 7.2 未來展望 44 參考文獻 45 | zh_TW |
dc.format.extent | 14702760 bytes | - |
dc.format.mimetype | application/pdf | - |
dc.source.uri (資料來源) | http://thesis.lib.nccu.edu.tw/record/#G0111753152 | en_US |
dc.subject (關鍵詞) | 提示工程 | zh_TW |
dc.subject (關鍵詞) | 視覺化 | zh_TW |
dc.subject (關鍵詞) | 擴散模型 | zh_TW |
dc.subject (關鍵詞) | 文字到圖像生成模型 | zh_TW |
dc.subject (關鍵詞) | 命名實體識別 | zh_TW |
dc.subject (關鍵詞) | Prompt Engineering | en_US |
dc.subject (關鍵詞) | Visualization | en_US |
dc.subject (關鍵詞) | Diffusion Models | en_US |
dc.subject (關鍵詞) | Text-to-Image Generation | en_US |
dc.subject (關鍵詞) | Named Entity Recognition | en_US |
dc.title (題名) | 基於文字生成圖片之擴散模型的視覺化輔助設計系統 | zh_TW |
dc.title (題名) | Visualization-Assisted Design System Based on Text-to-Image Diffusion Model | en_US |
dc.type (資料類型) | thesis | en_US |
dc.relation.reference (參考文獻) | [1]J. Oppenlaender, “A taxonomy of prompt modifiers for text-to-image genera- tion,” Behaviour & Information Technology, pp. 1–14, 2023. [2] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approx- imation and projection for dimension reduction. arxiv 2018,” arXiv preprint arXiv:1802.03426, vol. 10, 1802. [3]J. Oppenlaender, “Prompt engineering for text-to-image synthesis,” figshare. Presentation, 2022. [Online]. Available: https://doi.org/10.6084/m9.figshare. 18899801.v1 [4]A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831. [5]C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to- image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022. [6]V. Liu and L. B. Chilton, “Design guidelines for prompt engineering text-to-image generative models,” in Proceedings of the 2022 CHI Conference on Hu- man Factors in Computing Systems, 2022, pp. 1–23. [7] Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models,” arXiv preprint arXiv:2210.14896, 2022. [8] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Transactions on Visualization and Computer Graphics, 2023. [9] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Github repos- itory for high-resolution image synthesis with latent diffusion models,” https://github.com/CompVis/stable-diffusion?tab=readme-ov-file, 2022. [10] ——, “High-resolution image synthesis with latent diffusion models,” in Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2022, pp. 10 684–10 695. [11] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Ad- vances in neural information processing systems, vol. 33, pp. 6840–6851, 2020. [12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. [13]S. Hsueh, M. Ciolfi Felice, S. F. Alaoui, and W. E. Mackay, “What counts as ‘creative’work? articulating four epistemic positions in creativity-oriented hci research,” in Proceedings of the CHI Conference on Human Factors in Comput- ing Systems, 2024, pp. 1–15. [14]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [15]Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle- moyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining ap- proach,” arXiv preprint arXiv:1907.11692, 2019. [16]M. Grootendorst, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4461265 [17]T. Xie, Q. Li, Y. Zhang, Z. Liu, and H. Wang, “Self-improving for zero- shot named entity recognition with large language models,” arXiv preprint arXiv:2311.08921, 2023. [18]X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022. [19]W. Shao, R. Zhang, P. Ji, D. Fan, Y. Hu, X. Yan, C. Cui, Y. Tao, L. Mi, and L. Chen, “Astronomical knowledge entity extraction in astrophysics journal articles via large language models,” Research in Astronomy and Astrophysics, vol. 24, no. 6, p. 065012, 2024. [20]J. Ke, K. Ye, J. Yu, Y. Wu, P. Milanfar, and F. Yang, “Vila: Learning image aesthetics from user comments with vision-language pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 041–10 051. [21] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagere-ward: Learning and evaluating human preferences for text-to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024. [22] StabilityAI, “Stable diffusion dream studio beta terms of service,” https:// stability.ai/stablediffusion-terms-of-service, 2022, accessed: 2024-03-17. [23]W. Wang, H. Bao, S. Huang, L. Dong, and F. Wei, “Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers,” arXiv preprint arXiv:2012.15828, 2020. [24]M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spacy: Industrial- strength natural language processing in python,” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.1212303 [25]N. Reimers, “Sentence-bert: Sentence embeddings using siamese bert- networks,” arXiv preprint arXiv:1908.10084, 2019. [26]L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008. [27]L. Derczynski, E. Nichols, M. Van Erp, and N. Limsopatham, “Results of the wnut2017 shared task on novel and emerging entity recognition,” in Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017, pp. 140–147. [28]J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv:2104.08718, 2021. [29]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [30]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877– 1901, 2020. | zh_TW |