學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

  • No doi shows Citation Infomation
題名 結合肢體動作識別及擴散模型的文字生成舞蹈機制
Text-to-dance mechanism using human pose estimation and stable diffusion
作者 洪健庭
Hung, Chien-Ting
貢獻者 廖文宏
Liao, Wen-Hung
洪健庭
Hung, Chien-Ting
關鍵詞 深度學習
肢體辨識
生成式人工智慧
文字生成舞蹈
Deep Learning
Human Pose Recognition
Generative AI
Text-to-Dance
日期 2024
上傳時間 3-Jun-2024 11:42:54 (UTC+8)
摘要 肢體辨識在機器視覺領域是一個很重要的問題,如何在影像以及圖像中抓取人體骨骼的節點(如肩膀、手肘、手腕等)座標,不僅可以知道人物在圖像中的位置,還可藉由辨識結果去預測該人物在做什麼動作。 擴散模型(Diffusion Model)在近年得到廣大的關注,最令人驚豔的是其在AIGC(AI Generated Conten)領域的表現,許多文字生成圖片都是基於擴散模型的應用,包含DALL·E、Imagen、Midjourney和StableDiffusion等。除了在圖片生成任務上表現出色之外,其他任務的生成效果也相當卓越。 本論文探討使用Stable Diffusion 和OpenPose 來生成流暢的舞蹈動作,前者利用自定義文字產生人物外觀以及產生單位舞蹈動作的排序,並使用線性轉換的方式串接整體舞蹈動作,後者在連續舞蹈動作任務中作出肢體辨識,使以利自由設定角色外觀以及排序舞蹈動作。 結合上述方式,本論文提出的使用文字產生舞蹈動作方法,不僅為影像製作領域引入一種新的模式,更在製作過程中可以更方便選擇角色、場景以及角色動作的設定,過往需要每一幀的繪畫出來或者真人根據設定動作去呈現,如果加上角色需要更換的情況下,相比傳統方法節省很多步驟及時間,這個的方法不僅擴展了影像生成的研究範疇,同時結合AIGC的方法為實際應用中提供了一種可行的解決方案。
Pose estimation is a significant problem in the field of computer vision. It involves capturing the coordinates of skeletal joints (such as shoulders, elbows, wrists, etc.) of a human body in images and videos. This not only provides information about the person's position in the image but also enables predicting their actions based on the recognized joints. In recent years, diffusion models have gained significant attention, particularly for their impressive performance in the field of AI Generated Content (AIGC). Many text-to-image generation applications, including DALL·E, Imagen, Midjourney, and StableDiffusion, are based on diffusion models. These models have shown outstanding performance not only in image generation tasks but also in various other generative tasks. This thesis explores the use of the Stable Diffusion and OpenPose. The former, within the framework of this paper, allows for generating custom character appearances and producing ordered unit-level dance movements based on custom text inputs. These movements are then concatenated using linear transformations to create coherent overall dance sequences. The latter, OpenPose, performs pose estimation in continuous dance movement tasks. This framework enables the flexible configuration of character appearances and the sequencing of dance movements. Combining the above-mentioned approaches, the method proposed in this work, which utilizes text to generate dance movements, not only introduces a new pattern into the field of image production but also facilitates the selection of characters, scenes, and character actions during the production process. Previously, each frame required drawing or presenting actions based on set movements by real individuals. With the added flexibility for changing characters, our method significantly reduces steps and time compared to traditional approaches. In conjunction with AIGC methods, the proposed mechanism provides a viable solution for practical applications.
參考文獻 [1] Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P. S., & Sun, L. (2023). A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226. [2] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [3] Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291-7299). [4] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851. [5] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695). [6] Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543. [7] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. [8] Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (2021). Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13401-13412). [9] Tseng, J., Castellon, R., & Liu, K. (2023). Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 448-458). [10] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2022). Human motion diffusion model. arXiv preprint arXiv:2209.14916. [11] Wang, T., Li, L., Lin, K., Lin, C. C., Yang, Z., Zhang, H., ... & Wang, L. (2023). DisCo: Disentangled Control for Referring Human Dance Generation in Real World. arXiv preprint arXiv:2307.00040. [12] Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., ... & Liu, Z. (2023). ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. arXiv preprint arXiv:2304.01116. [13] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. [14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [15] Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. [16] Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12873-12883). [17] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing. [18] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [19] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52. [20] Hung-yi Lee【機器學習 2023】(生成式 AI) https://youtube.com/playlist?list=PLJV_el3uVTsOePyfmkfivYZ7Rqr2nMk3W&si=bLQJWEJsVmMG1HL3 [21] Hugging Face – The AI community building the future. https://huggingface.co/ [22] Civitai | Stable Diffusion models, embeddings, LoRAs and more https://civitai.com/ [23] Wikipedia:Linear interpolation https://en.wikipedia.org/wiki/Linear_interpolation
描述 碩士
國立政治大學
資訊科學系碩士在職專班
110971024
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0110971024
資料類型 thesis
dc.contributor.advisor 廖文宏zh_TW
dc.contributor.advisor Liao, Wen-Hungen_US
dc.contributor.author (Authors) 洪健庭zh_TW
dc.contributor.author (Authors) Hung, Chien-Tingen_US
dc.creator (作者) 洪健庭zh_TW
dc.creator (作者) Hung, Chien-Tingen_US
dc.date (日期) 2024en_US
dc.date.accessioned 3-Jun-2024 11:42:54 (UTC+8)-
dc.date.available 3-Jun-2024 11:42:54 (UTC+8)-
dc.date.issued (上傳時間) 3-Jun-2024 11:42:54 (UTC+8)-
dc.identifier (Other Identifiers) G0110971024en_US
dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/151504-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系碩士在職專班zh_TW
dc.description (描述) 110971024zh_TW
dc.description.abstract (摘要) 肢體辨識在機器視覺領域是一個很重要的問題,如何在影像以及圖像中抓取人體骨骼的節點(如肩膀、手肘、手腕等)座標,不僅可以知道人物在圖像中的位置,還可藉由辨識結果去預測該人物在做什麼動作。 擴散模型(Diffusion Model)在近年得到廣大的關注,最令人驚豔的是其在AIGC(AI Generated Conten)領域的表現,許多文字生成圖片都是基於擴散模型的應用,包含DALL·E、Imagen、Midjourney和StableDiffusion等。除了在圖片生成任務上表現出色之外,其他任務的生成效果也相當卓越。 本論文探討使用Stable Diffusion 和OpenPose 來生成流暢的舞蹈動作,前者利用自定義文字產生人物外觀以及產生單位舞蹈動作的排序,並使用線性轉換的方式串接整體舞蹈動作,後者在連續舞蹈動作任務中作出肢體辨識,使以利自由設定角色外觀以及排序舞蹈動作。 結合上述方式,本論文提出的使用文字產生舞蹈動作方法,不僅為影像製作領域引入一種新的模式,更在製作過程中可以更方便選擇角色、場景以及角色動作的設定,過往需要每一幀的繪畫出來或者真人根據設定動作去呈現,如果加上角色需要更換的情況下,相比傳統方法節省很多步驟及時間,這個的方法不僅擴展了影像生成的研究範疇,同時結合AIGC的方法為實際應用中提供了一種可行的解決方案。zh_TW
dc.description.abstract (摘要) Pose estimation is a significant problem in the field of computer vision. It involves capturing the coordinates of skeletal joints (such as shoulders, elbows, wrists, etc.) of a human body in images and videos. This not only provides information about the person's position in the image but also enables predicting their actions based on the recognized joints. In recent years, diffusion models have gained significant attention, particularly for their impressive performance in the field of AI Generated Content (AIGC). Many text-to-image generation applications, including DALL·E, Imagen, Midjourney, and StableDiffusion, are based on diffusion models. These models have shown outstanding performance not only in image generation tasks but also in various other generative tasks. This thesis explores the use of the Stable Diffusion and OpenPose. The former, within the framework of this paper, allows for generating custom character appearances and producing ordered unit-level dance movements based on custom text inputs. These movements are then concatenated using linear transformations to create coherent overall dance sequences. The latter, OpenPose, performs pose estimation in continuous dance movement tasks. This framework enables the flexible configuration of character appearances and the sequencing of dance movements. Combining the above-mentioned approaches, the method proposed in this work, which utilizes text to generate dance movements, not only introduces a new pattern into the field of image production but also facilitates the selection of characters, scenes, and character actions during the production process. Previously, each frame required drawing or presenting actions based on set movements by real individuals. With the added flexibility for changing characters, our method significantly reduces steps and time compared to traditional approaches. In conjunction with AIGC methods, the proposed mechanism provides a viable solution for practical applications.en_US
dc.description.tableofcontents 摘要 i 目錄 iii 圖目錄 vi 表目錄 viii 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 1 1.3 論文架構 3 第二章 相關研究與技術背景 4 2.1 肢體辨識發展背景與技術 4 2.1.1 OpenPose理論與技術 5 2.1.2 OpenPose Algorithm 6 2.2 生成式人工智慧的技術與背景 7 2.2.1 Variational Auto Encoder模型原理 8 2.2.2 Generative Adversarial Network模型原理 10 2.3 Diffusion Model理論與技術 12 2.4 Low-Rank Adaption of large language model技術 17 2.5 ControlNet 18 2.6 與相關舞蹈生成論文做法比較 20 2.7 線性內插法 23 2.8 評估指標定義 24 2.8.1 Object Keypoint Similarity 25 2.8.2 Average Precision 26 第三章 研究方法 27 3.1 基本構想 27 3.2 前期研究 27 3.2.1 生成模型比較 28 3.2.2 圖片生成工具比較 29 3.2.3 Stable Diffusion Model實作介紹 30 3.3 研究架構設計 33 3.3.1 問題陳述 33 3.3.2 研究架構 33 3.4 目標設定 35 第四章 研究過程與實驗結果分析 36 4.1 實驗環境 36 4.2 研究結果 37 4.2.1 使用及設定說明 37 4.2.2 測試不同Stable Diffusion Model的結果 45 4.2.3 新增ControlNet的結果 47 4.2.4 Prompt的設定與LoRA、Latent upscale的使用 49 4.2.5 不同單元動作生成結果與之串連處理 54 4.2.6 維持影像一致性 59 4.3 研究結果分析 59 第五章 結論與未來研究方向 61 5.1 結論 61 5.2 未來研究方向 62 參考文獻 64 附錄 67zh_TW
dc.format.extent 4718935 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0110971024en_US
dc.subject (關鍵詞) 深度學習zh_TW
dc.subject (關鍵詞) 肢體辨識zh_TW
dc.subject (關鍵詞) 生成式人工智慧zh_TW
dc.subject (關鍵詞) 文字生成舞蹈zh_TW
dc.subject (關鍵詞) Deep Learningen_US
dc.subject (關鍵詞) Human Pose Recognitionen_US
dc.subject (關鍵詞) Generative AIen_US
dc.subject (關鍵詞) Text-to-Danceen_US
dc.title (題名) 結合肢體動作識別及擴散模型的文字生成舞蹈機制zh_TW
dc.title (題名) Text-to-dance mechanism using human pose estimation and stable diffusionen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P. S., & Sun, L. (2023). A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226. [2] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [3] Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291-7299). [4] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851. [5] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695). [6] Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543. [7] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. [8] Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (2021). Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13401-13412). [9] Tseng, J., Castellon, R., & Liu, K. (2023). Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 448-458). [10] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2022). Human motion diffusion model. arXiv preprint arXiv:2209.14916. [11] Wang, T., Li, L., Lin, K., Lin, C. C., Yang, Z., Zhang, H., ... & Wang, L. (2023). DisCo: Disentangled Control for Referring Human Dance Generation in Real World. arXiv preprint arXiv:2307.00040. [12] Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., ... & Liu, Z. (2023). ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. arXiv preprint arXiv:2304.01116. [13] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. [14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. [15] Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. [16] Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12873-12883). [17] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing. [18] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [19] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52. [20] Hung-yi Lee【機器學習 2023】(生成式 AI) https://youtube.com/playlist?list=PLJV_el3uVTsOePyfmkfivYZ7Rqr2nMk3W&si=bLQJWEJsVmMG1HL3 [21] Hugging Face – The AI community building the future. https://huggingface.co/ [22] Civitai | Stable Diffusion models, embeddings, LoRAs and more https://civitai.com/ [23] Wikipedia:Linear interpolation https://en.wikipedia.org/wiki/Linear_interpolationzh_TW