Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 基於穩定擴散模型之空拍影像生成互動編輯
Interactive editing of aerial images generation based on Stable Diffusion Model作者 黃泓棋
Huang, Hung-Chi貢獻者 紀明德
Chi, Ming-Te
黃泓棋
Huang, Hung-Chi關鍵詞 無人機影像
跨視角生成日期 2025 上傳時間 1-Sep-2025 16:56:29 (UTC+8) 摘要 我們提出一條由衛星影像跨視角轉換至無人機視角、同時支援互動式局部編輯的完整管線,針對既有擴散式生成在「可控性不足、局部與整體不一致、高頻紋理被抹平」等問題加以改進。方法上,我們整合 DIIF 超解析度、SAM 遮罩分割、ControlNet 結構化條件與 SDXL + LoRA 生成,採取「先超解析度、再生成」策略:先恢復細節與尺度,再以文字與多模態條件導引受控生成,並以領域化 LoRA 維持空拍風格與幾何一致;系統支援多區域指定與零樣本單張範例引導。實驗以 xView(WorldView-3, GSD 0.3 m)驗證整體流程,量化指標顯示在僅小幅犧牲 PSNR、SSIM 的前提下,LPIPS 明顯下降,主觀紋理與真實感提升,且在結構條件約束下能保持全局佈局穩定。綜合而言,本研究在跨視角生成與區域可控編輯上同時達成「高真實感、可重複控制、與原圖特徵一致」的目標,為遙測到低空視角的影像生成提供實用且可擴充的解決方案。 參考文獻 [1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020. [2] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020. [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695. [4] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022. [5] O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” ACM Trans. Graph. (TOG), vol. 42, no. 4, pp. 1–11, 2023. [6] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 3836–3847. [7] C. Corneanu, R. Gadde, and A. M. Martinez, “Latentpaint: Image inpainting in latent space with diffusion models,” in Proc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), Jan. 2024, pp. 4334–4343. [8] Y. Wang, T. Su, Y. Li, J. Cao, G. Wang, and X. Liu, “Ddistill-sr: Reparameterized dynamic distillation network for lightweight image super-resolution,” IEEE Trans. Multimedia, vol. 25, pp. 7222–7234, 2022. [9] Z. He and Z. Jin, “Dynamic implicit image function for efficient arbitrary-scale image representation,” arXiv preprint arXiv:2306.12321, 2023. [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014. [11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 4015–4026. [12] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18208–18218. [13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2021, pp. 8748–8763. [14] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12873–12883. [15] D. P. Kingma, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, 2013. [16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. [17] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. McCord, “xView: Objects in context in overhead imagery,” arXiv preprint arXiv:1802.07856, 2018. [Online]. Available: https://arxiv.org/abs/1802.07856 [18] G. Li, “Riveravssd: River aerial view semantic segmentation dataset,” National Center for High Performance Computing Data Platform, Aug. 2023. [Online]. Available: https://scidm.nchc.org.tw/dataset/riveravssd , accessed: Jul. 22, 2025. [19] quadeer15sh, “Augmented forest segmentation,” Kaggle, Jun. 2025. [Online]. Available: https://www.kaggle.com/datasets/quadeer15sh/augmented-forest-segmentation , accessed: Jul. 22, 2025. [20] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410. [21] S. Brade, B. Wang, M. Sousa, S. Oore, and T. Grossman, “Promptify: Text-to-image generation through interactive prompt exploration with large language models,” in Proc. 36th Annu. ACM Symp. User Interface Software and Technology (UIST), 2023, pp. 1–14. [22] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Trans. Visualization and Computer Graphics, 2023. [23] A. Sauer, K. Schwarz, and A. Geiger, “StyleGAN-XL: Scaling StyleGAN to large diverse datasets,” in ACM SIGGRAPH Conf. Proc., 2022, pp. 1–10. [24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8110–8119. [25] A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, “StyleGAN-T: Unlocking the power of GANs for fast large-scale text-to-image synthesis,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2023, pp. 30105–30118. [26] Y. Lyu, T. Lin, F. Li, D. He, J. Dong, and T. Tan, “DeltaEdit: Exploring text-free training for text-driven image manipulation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6894–6903. [27] R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Trans. Graph. (TOG), vol. 41, no. 4, pp. 1–13, 2022. [28] Y. Nitzan, K. Aberman, Q. He, O. Liba, M. Yarom, Y. Gandelsman, I. Mosseri, Y. Pritch, and D. Cohen-Or, “MyStyle: A personalized generative prior,” ACM Trans. Graph. (TOG), vol. 41, no. 6, pp. 1–10, 2022. [29] K. Song, L. Han, B. Liu, D. Metaxas, and A. Elgammal, “Diffusion guided domain adaptation of image generators,” arXiv preprint arXiv:2212.04473, 2022. [30] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [31] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2017, pp. 214–223. [32] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014. [33] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A StyleGAN encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2287–2296. [34] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1125–1134. [35] A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun, “DepthPro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073, 2024. 描述 碩士
國立政治大學
資訊科學系
111753159資料來源 http://thesis.lib.nccu.edu.tw/record/#G0111753159 資料類型 thesis dc.contributor.advisor 紀明德 zh_TW dc.contributor.advisor Chi, Ming-Te en_US dc.contributor.author (Authors) 黃泓棋 zh_TW dc.contributor.author (Authors) Huang, Hung-Chi en_US dc.creator (作者) 黃泓棋 zh_TW dc.creator (作者) Huang, Hung-Chi en_US dc.date (日期) 2025 en_US dc.date.accessioned 1-Sep-2025 16:56:29 (UTC+8) - dc.date.available 1-Sep-2025 16:56:29 (UTC+8) - dc.date.issued (上傳時間) 1-Sep-2025 16:56:29 (UTC+8) - dc.identifier (Other Identifiers) G0111753159 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/159410 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 111753159 zh_TW dc.description.abstract (摘要) 我們提出一條由衛星影像跨視角轉換至無人機視角、同時支援互動式局部編輯的完整管線,針對既有擴散式生成在「可控性不足、局部與整體不一致、高頻紋理被抹平」等問題加以改進。方法上,我們整合 DIIF 超解析度、SAM 遮罩分割、ControlNet 結構化條件與 SDXL + LoRA 生成,採取「先超解析度、再生成」策略:先恢復細節與尺度,再以文字與多模態條件導引受控生成,並以領域化 LoRA 維持空拍風格與幾何一致;系統支援多區域指定與零樣本單張範例引導。實驗以 xView(WorldView-3, GSD 0.3 m)驗證整體流程,量化指標顯示在僅小幅犧牲 PSNR、SSIM 的前提下,LPIPS 明顯下降,主觀紋理與真實感提升,且在結構條件約束下能保持全局佈局穩定。綜合而言,本研究在跨視角生成與區域可控編輯上同時達成「高真實感、可重複控制、與原圖特徵一致」的目標,為遙測到低空視角的影像生成提供實用且可擴充的解決方案。 zh_TW dc.description.tableofcontents 第一章 導論 1 第二章 相關研究 5 第三章 研究方法及步驟 13 第四章 實驗與分析 25 第五章 結論 36 參考文獻 37 zh_TW dc.format.extent 14405280 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0111753159 en_US dc.subject (關鍵詞) 無人機影像 zh_TW dc.subject (關鍵詞) 跨視角生成 zh_TW dc.title (題名) 基於穩定擴散模型之空拍影像生成互動編輯 zh_TW dc.title (題名) Interactive editing of aerial images generation based on Stable Diffusion Model en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020. [2] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020. [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695. [4] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022. [5] O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” ACM Trans. Graph. (TOG), vol. 42, no. 4, pp. 1–11, 2023. [6] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 3836–3847. [7] C. Corneanu, R. Gadde, and A. M. Martinez, “Latentpaint: Image inpainting in latent space with diffusion models,” in Proc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), Jan. 2024, pp. 4334–4343. [8] Y. Wang, T. Su, Y. Li, J. Cao, G. Wang, and X. Liu, “Ddistill-sr: Reparameterized dynamic distillation network for lightweight image super-resolution,” IEEE Trans. Multimedia, vol. 25, pp. 7222–7234, 2022. [9] Z. He and Z. Jin, “Dynamic implicit image function for efficient arbitrary-scale image representation,” arXiv preprint arXiv:2306.12321, 2023. [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014. [11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 4015–4026. [12] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18208–18218. [13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2021, pp. 8748–8763. [14] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12873–12883. [15] D. P. Kingma, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, 2013. [16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. [17] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. McCord, “xView: Objects in context in overhead imagery,” arXiv preprint arXiv:1802.07856, 2018. [Online]. Available: https://arxiv.org/abs/1802.07856 [18] G. Li, “Riveravssd: River aerial view semantic segmentation dataset,” National Center for High Performance Computing Data Platform, Aug. 2023. [Online]. Available: https://scidm.nchc.org.tw/dataset/riveravssd , accessed: Jul. 22, 2025. [19] quadeer15sh, “Augmented forest segmentation,” Kaggle, Jun. 2025. [Online]. Available: https://www.kaggle.com/datasets/quadeer15sh/augmented-forest-segmentation , accessed: Jul. 22, 2025. [20] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410. [21] S. Brade, B. Wang, M. Sousa, S. Oore, and T. Grossman, “Promptify: Text-to-image generation through interactive prompt exploration with large language models,” in Proc. 36th Annu. ACM Symp. User Interface Software and Technology (UIST), 2023, pp. 1–14. [22] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Trans. Visualization and Computer Graphics, 2023. [23] A. Sauer, K. Schwarz, and A. Geiger, “StyleGAN-XL: Scaling StyleGAN to large diverse datasets,” in ACM SIGGRAPH Conf. Proc., 2022, pp. 1–10. [24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8110–8119. [25] A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, “StyleGAN-T: Unlocking the power of GANs for fast large-scale text-to-image synthesis,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2023, pp. 30105–30118. [26] Y. Lyu, T. Lin, F. Li, D. He, J. Dong, and T. Tan, “DeltaEdit: Exploring text-free training for text-driven image manipulation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6894–6903. [27] R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Trans. Graph. (TOG), vol. 41, no. 4, pp. 1–13, 2022. [28] Y. Nitzan, K. Aberman, Q. He, O. Liba, M. Yarom, Y. Gandelsman, I. Mosseri, Y. Pritch, and D. Cohen-Or, “MyStyle: A personalized generative prior,” ACM Trans. Graph. (TOG), vol. 41, no. 6, pp. 1–10, 2022. [29] K. Song, L. Han, B. Liu, D. Metaxas, and A. Elgammal, “Diffusion guided domain adaptation of image generators,” arXiv preprint arXiv:2212.04473, 2022. [30] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [31] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2017, pp. 214–223. [32] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014. [33] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A StyleGAN encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2287–2296. [34] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1125–1134. [35] A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun, “DepthPro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073, 2024. zh_TW
