基於穩定擴散模型之空拍影像生成互動編輯 | Publication

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

題名	基於穩定擴散模型之空拍影像生成互動編輯 Interactive editing of aerial images generation based on Stable Diffusion Model
作者	黃泓棋 Huang, Hung-Chi
貢獻者	紀明德 Chi, Ming-Te 黃泓棋 Huang, Hung-Chi
關鍵詞	無人機影像跨視角生成
日期	2025
上傳時間	1-Sep-2025 16:56:29 (UTC+8)
摘要	我們提出一條由衛星影像跨視角轉換至無人機視角、同時支援互動式局部編輯的完整管線，針對既有擴散式生成在「可控性不足、局部與整體不一致、高頻紋理被抹平」等問題加以改進。方法上，我們整合 DIIF 超解析度、SAM 遮罩分割、ControlNet 結構化條件與 SDXL + LoRA 生成，採取「先超解析度、再生成」策略：先恢復細節與尺度，再以文字與多模態條件導引受控生成，並以領域化 LoRA 維持空拍風格與幾何一致；系統支援多區域指定與零樣本單張範例引導。實驗以 xView（WorldView-3, GSD 0.3 m）驗證整體流程，量化指標顯示在僅小幅犧牲 PSNR、SSIM 的前提下，LPIPS 明顯下降，主觀紋理與真實感提升，且在結構條件約束下能保持全局佈局穩定。綜合而言，本研究在跨視角生成與區域可控編輯上同時達成「高真實感、可重複控制、與原圖特徵一致」的目標，為遙測到低空視角的影像生成提供實用且可擴充的解決方案。
參考文獻	[1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020. [2] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020. [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695. [4] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022. [5] O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” ACM Trans. Graph. (TOG), vol. 42, no. 4, pp. 1–11, 2023. [6] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 3836–3847. [7] C. Corneanu, R. Gadde, and A. M. Martinez, “Latentpaint: Image inpainting in latent space with diffusion models,” in Proc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), Jan. 2024, pp. 4334–4343. [8] Y. Wang, T. Su, Y. Li, J. Cao, G. Wang, and X. Liu, “Ddistill-sr: Reparameterized dynamic distillation network for lightweight image super-resolution,” IEEE Trans. Multimedia, vol. 25, pp. 7222–7234, 2022. [9] Z. He and Z. Jin, “Dynamic implicit image function for efficient arbitrary-scale image representation,” arXiv preprint arXiv:2306.12321, 2023. [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014. [11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 4015–4026. [12] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18208–18218. [13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2021, pp. 8748–8763. [14] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12873–12883. [15] D. P. Kingma, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, 2013. [16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. [17] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. McCord, “xView: Objects in context in overhead imagery,” arXiv preprint arXiv:1802.07856, 2018. [Online]. Available: https://arxiv.org/abs/1802.07856 [18] G. Li, “Riveravssd: River aerial view semantic segmentation dataset,” National Center for High Performance Computing Data Platform, Aug. 2023. [Online]. Available: https://scidm.nchc.org.tw/dataset/riveravssd , accessed: Jul. 22, 2025. [19] quadeer15sh, “Augmented forest segmentation,” Kaggle, Jun. 2025. [Online]. Available: https://www.kaggle.com/datasets/quadeer15sh/augmented-forest-segmentation , accessed: Jul. 22, 2025. [20] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410. [21] S. Brade, B. Wang, M. Sousa, S. Oore, and T. Grossman, “Promptify: Text-to-image generation through interactive prompt exploration with large language models,” in Proc. 36th Annu. ACM Symp. User Interface Software and Technology (UIST), 2023, pp. 1–14. [22] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Trans. Visualization and Computer Graphics, 2023. [23] A. Sauer, K. Schwarz, and A. Geiger, “StyleGAN-XL: Scaling StyleGAN to large diverse datasets,” in ACM SIGGRAPH Conf. Proc., 2022, pp. 1–10. [24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8110–8119. [25] A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, “StyleGAN-T: Unlocking the power of GANs for fast large-scale text-to-image synthesis,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2023, pp. 30105–30118. [26] Y. Lyu, T. Lin, F. Li, D. He, J. Dong, and T. Tan, “DeltaEdit: Exploring text-free training for text-driven image manipulation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6894–6903. [27] R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Trans. Graph. (TOG), vol. 41, no. 4, pp. 1–13, 2022. [28] Y. Nitzan, K. Aberman, Q. He, O. Liba, M. Yarom, Y. Gandelsman, I. Mosseri, Y. Pritch, and D. Cohen-Or, “MyStyle: A personalized generative prior,” ACM Trans. Graph. (TOG), vol. 41, no. 6, pp. 1–10, 2022. [29] K. Song, L. Han, B. Liu, D. Metaxas, and A. Elgammal, “Diffusion guided domain adaptation of image generators,” arXiv preprint arXiv:2212.04473, 2022. [30] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [31] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2017, pp. 214–223. [32] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014. [33] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A StyleGAN encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2287–2296. [34] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1125–1134. [35] A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun, “DepthPro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073, 2024.
描述	碩士國立政治大學資訊科學系 111753159
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0111753159
資料類型	thesis

dc.contributor.advisor	紀明德	zh_TW
dc.contributor.advisor	Chi, Ming-Te	en_US
dc.contributor.author (Authors)	黃泓棋	zh_TW
dc.contributor.author (Authors)	Huang, Hung-Chi	en_US
dc.creator (作者)	黃泓棋	zh_TW
dc.creator (作者)	Huang, Hung-Chi	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	1-Sep-2025 16:56:29 (UTC+8)	-
dc.date.available	1-Sep-2025 16:56:29 (UTC+8)	-
dc.date.issued (上傳時間)	1-Sep-2025 16:56:29 (UTC+8)	-
dc.identifier (Other Identifiers)	G0111753159	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/159410	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學系	zh_TW
dc.description (描述)	111753159	zh_TW
dc.description.abstract (摘要)	我們提出一條由衛星影像跨視角轉換至無人機視角、同時支援互動式局部編輯的完整管線，針對既有擴散式生成在「可控性不足、局部與整體不一致、高頻紋理被抹平」等問題加以改進。方法上，我們整合 DIIF 超解析度、SAM 遮罩分割、ControlNet 結構化條件與 SDXL + LoRA 生成，採取「先超解析度、再生成」策略：先恢復細節與尺度，再以文字與多模態條件導引受控生成，並以領域化 LoRA 維持空拍風格與幾何一致；系統支援多區域指定與零樣本單張範例引導。實驗以 xView（WorldView-3, GSD 0.3 m）驗證整體流程，量化指標顯示在僅小幅犧牲 PSNR、SSIM 的前提下，LPIPS 明顯下降，主觀紋理與真實感提升，且在結構條件約束下能保持全局佈局穩定。綜合而言，本研究在跨視角生成與區域可控編輯上同時達成「高真實感、可重複控制、與原圖特徵一致」的目標，為遙測到低空視角的影像生成提供實用且可擴充的解決方案。	zh_TW
dc.description.tableofcontents	第一章導論 1 第二章相關研究 5 第三章研究方法及步驟 13 第四章實驗與分析 25 第五章結論 36 參考文獻 37	zh_TW
dc.format.extent	14405280 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0111753159	en_US
dc.subject (關鍵詞)	無人機影像	zh_TW
dc.subject (關鍵詞)	跨視角生成	zh_TW
dc.title (題名)	基於穩定擴散模型之空拍影像生成互動編輯	zh_TW
dc.title (題名)	Interactive editing of aerial images generation based on Stable Diffusion Model	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020. [2] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020. [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695. [4] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022. [5] O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” ACM Trans. Graph. (TOG), vol. 42, no. 4, pp. 1–11, 2023. [6] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 3836–3847. [7] C. Corneanu, R. Gadde, and A. M. Martinez, “Latentpaint: Image inpainting in latent space with diffusion models,” in Proc. IEEE/CVF Winter Conf. Applications of Computer Vision (WACV), Jan. 2024, pp. 4334–4343. [8] Y. Wang, T. Su, Y. Li, J. Cao, G. Wang, and X. Liu, “Ddistill-sr: Reparameterized dynamic distillation network for lightweight image super-resolution,” IEEE Trans. Multimedia, vol. 25, pp. 7222–7234, 2022. [9] Z. He and Z. Jin, “Dynamic implicit image function for efficient arbitrary-scale image representation,” arXiv preprint arXiv:2306.12321, 2023. [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014. [11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023, pp. 4015–4026. [12] O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18208–18218. [13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2021, pp. 8748–8763. [14] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12873–12883. [15] D. P. Kingma, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, 2013. [16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021. [17] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. McCord, “xView: Objects in context in overhead imagery,” arXiv preprint arXiv:1802.07856, 2018. [Online]. Available: https://arxiv.org/abs/1802.07856 [18] G. Li, “Riveravssd: River aerial view semantic segmentation dataset,” National Center for High Performance Computing Data Platform, Aug. 2023. [Online]. Available: https://scidm.nchc.org.tw/dataset/riveravssd , accessed: Jul. 22, 2025. [19] quadeer15sh, “Augmented forest segmentation,” Kaggle, Jun. 2025. [Online]. Available: https://www.kaggle.com/datasets/quadeer15sh/augmented-forest-segmentation , accessed: Jul. 22, 2025. [20] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410. [21] S. Brade, B. Wang, M. Sousa, S. Oore, and T. Grossman, “Promptify: Text-to-image generation through interactive prompt exploration with large language models,” in Proc. 36th Annu. ACM Symp. User Interface Software and Technology (UIST), 2023, pp. 1–14. [22] Y. Feng, X. Wang, K. K. Wong, S. Wang, Y. Lu, M. Zhu, B. Wang, and W. Chen, “Promptmagician: Interactive prompt engineering for text-to-image creation,” IEEE Trans. Visualization and Computer Graphics, 2023. [23] A. Sauer, K. Schwarz, and A. Geiger, “StyleGAN-XL: Scaling StyleGAN to large diverse datasets,” in ACM SIGGRAPH Conf. Proc., 2022, pp. 1–10. [24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8110–8119. [25] A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, “StyleGAN-T: Unlocking the power of GANs for fast large-scale text-to-image synthesis,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2023, pp. 30105–30118. [26] Y. Lyu, T. Lin, F. Li, D. He, J. Dong, and T. Tan, “DeltaEdit: Exploring text-free training for text-driven image manipulation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6894–6903. [27] R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “StyleGAN-NADA: CLIP-guided domain adaptation of image generators,” ACM Trans. Graph. (TOG), vol. 41, no. 4, pp. 1–13, 2022. [28] Y. Nitzan, K. Aberman, Q. He, O. Liba, M. Yarom, Y. Gandelsman, I. Mosseri, Y. Pritch, and D. Cohen-Or, “MyStyle: A personalized generative prior,” ACM Trans. Graph. (TOG), vol. 41, no. 6, pp. 1–10, 2022. [29] K. Song, L. Han, B. Liu, D. Metaxas, and A. Elgammal, “Diffusion guided domain adaptation of image generators,” arXiv preprint arXiv:2212.04473, 2022. [30] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015. [31] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. Int. Conf. Machine Learning (ICML). PMLR, 2017, pp. 214–223. [32] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014. [33] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A StyleGAN encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2287–2296. [34] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1125–1134. [35] A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun, “DepthPro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073, 2024.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM