Publications-Theses
Article View/Open
Publication Export
-
題名 利用卷積式注意力機制語言模型為影片生成鋼琴樂曲
InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer作者 林鑫彤
Lin, Chin-Tung貢獻者 沈錳坤
Shan, Man-Kwan
林鑫彤
Lin, Chin-Tung關鍵詞 為影片生成音樂
音樂生成
卷積式注意力機制模型
生成鋼琴譜
影片配樂
Video-Music Transformer
VMT
InverseMV
VMT Model
Convolutional Video-Music Transformer日期 2019 上傳時間 5-Sep-2019 16:14:39 (UTC+8) 摘要 近年手機鏡頭的技術趨向成熟,加上如Facebook、Instagram等社群網站的興起,使用者可輕易用手機拍出高品質的照片及影片並分享到網路上。一個高流量的影片往往有著與之搭配的音樂,而一般人並非專業的配樂師,受限於音樂素材的收集和敏銳度,在影片配樂的挑選上時常遇到困難。影片的配樂上使用現成的音樂會受限於版權的問題,因此在影片配樂上使用音樂的自動生成將成為一個新的研究趨勢。隨著近年類神經網路(Neural Network, NN)蓬勃的發展,有許多研究開始嘗試使用類神經網路模型來生成符號音樂(symbolic music),但據我們所知目前並未有人嘗試為影片生成音樂。在缺乏現成dataset的情況下,我們人工收集並標記一個pop music的dataset來做為我們模型的訓練資料。基於注意力機制模型(Transformer)在自然語言處理(Natural Language Processing, NLP)問題上的成功,而符號音樂的生成與語言生成也有著異曲同工之處,本研究提出一個為影片自動生成配樂的模型VMT(Video-Music Transformer),輸入影片的frame sequence來生成對應的符號鋼琴音樂(symbolic piano music)。我們在實驗結果也得到VMT模型相對於序列模型(sequence to sequence model)在音樂流暢度和影片匹配度上有較好的結果。
With the wide popularity of social media including Facebook, Twitter, Instagram, YouTube, etc. and the modernization of mobile photography, users on social media tend to watch and send videos rather than text. People want their video with a high click-through rate. However, such video requires great editing skill and perfect matching music, which are very difficult for common people. On top of that, people creating soundtrack suffer from the lack of ownership of musical pieces. The music generated from a model instead of existing music conduces to preventing from breaching copyright.The rise of deep learning brought out much work using a model based on the neural network to generate symbolic music. However, to the best of our knowledge, there is no work trying to compose music for video and no dataset with paired video and music. Therefore, we release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and midi files. We propose a model VMT(Video-Music Transformer) that generates piano scores from video frames, and then evaluate our model with seq2seq and obtain better music smooth and relevance of video.參考文獻 [1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.[2] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, MuseGAN: Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arXiv:1709.06298, 2017.[3] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, Neural audio synthesis of musical notes with wavenet autoencoders. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017.[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets. Advances in neural information processing systems, 2014.[5] G. Hadjeres, F. Pachet, and F. Nielsen, DeepBach: a Steerable Model for Bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97, 2012.[7] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.[9] F.-F. Kuo, M.-F. Chiang, M.-K. Shan, and S.-Y. Lee, Emotion-based music recommendation by association discovery from film music. Proceedings of the 13th annual ACM international conference on Multimedia, 2005.[10] J.-C. Lin, W.-L. Wei, and H.-M. Wang, EMV-matchmaker: emotional temporal course modeling and matching for automatic music video generation. Proceedings of the 23rd ACM international conference on Multimedia, 2015.[11] O. Mogren, C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.[12] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.[13] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, This time with feeling: learning expressive musical performance. Neural Computing and Applications, 1-13, 2018.[14] P. M. Todd, A connectionist approach to algorithmic composition. Computer Music Journal, 13(4), 27-43, 1989.[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need. Advances in neural information processing systems, 2017. 描述 碩士
國立政治大學
資訊科學系
105753023資料來源 http://thesis.lib.nccu.edu.tw/record/#G0105753023 資料類型 thesis dc.contributor.advisor 沈錳坤 zh_TW dc.contributor.advisor Shan, Man-Kwan en_US dc.contributor.author (Authors) 林鑫彤 zh_TW dc.contributor.author (Authors) Lin, Chin-Tung en_US dc.creator (作者) 林鑫彤 zh_TW dc.creator (作者) Lin, Chin-Tung en_US dc.date (日期) 2019 en_US dc.date.accessioned 5-Sep-2019 16:14:39 (UTC+8) - dc.date.available 5-Sep-2019 16:14:39 (UTC+8) - dc.date.issued (上傳時間) 5-Sep-2019 16:14:39 (UTC+8) - dc.identifier (Other Identifiers) G0105753023 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/125641 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 資訊科學系 zh_TW dc.description (描述) 105753023 zh_TW dc.description.abstract (摘要) 近年手機鏡頭的技術趨向成熟,加上如Facebook、Instagram等社群網站的興起,使用者可輕易用手機拍出高品質的照片及影片並分享到網路上。一個高流量的影片往往有著與之搭配的音樂,而一般人並非專業的配樂師,受限於音樂素材的收集和敏銳度,在影片配樂的挑選上時常遇到困難。影片的配樂上使用現成的音樂會受限於版權的問題,因此在影片配樂上使用音樂的自動生成將成為一個新的研究趨勢。隨著近年類神經網路(Neural Network, NN)蓬勃的發展,有許多研究開始嘗試使用類神經網路模型來生成符號音樂(symbolic music),但據我們所知目前並未有人嘗試為影片生成音樂。在缺乏現成dataset的情況下,我們人工收集並標記一個pop music的dataset來做為我們模型的訓練資料。基於注意力機制模型(Transformer)在自然語言處理(Natural Language Processing, NLP)問題上的成功,而符號音樂的生成與語言生成也有著異曲同工之處,本研究提出一個為影片自動生成配樂的模型VMT(Video-Music Transformer),輸入影片的frame sequence來生成對應的符號鋼琴音樂(symbolic piano music)。我們在實驗結果也得到VMT模型相對於序列模型(sequence to sequence model)在音樂流暢度和影片匹配度上有較好的結果。 zh_TW dc.description.abstract (摘要) With the wide popularity of social media including Facebook, Twitter, Instagram, YouTube, etc. and the modernization of mobile photography, users on social media tend to watch and send videos rather than text. People want their video with a high click-through rate. However, such video requires great editing skill and perfect matching music, which are very difficult for common people. On top of that, people creating soundtrack suffer from the lack of ownership of musical pieces. The music generated from a model instead of existing music conduces to preventing from breaching copyright.The rise of deep learning brought out much work using a model based on the neural network to generate symbolic music. However, to the best of our knowledge, there is no work trying to compose music for video and no dataset with paired video and music. Therefore, we release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and midi files. We propose a model VMT(Video-Music Transformer) that generates piano scores from video frames, and then evaluate our model with seq2seq and obtain better music smooth and relevance of video. en_US dc.description.tableofcontents 摘要 IABSTRACT II目錄 IIILIST OF FIGURES VLIST OF TABLES VI第 1 章 緒論 11.1背景 11.2動機 11.3研究方法 21.4 研究貢獻 3第 2 章 相關研究 42.1 背景音樂推薦 42.2 自動音樂作曲 42.3 深度學習 6第 3 章 研究方法 93.1資料前處理 93.2 CONVOLUTIONAL VIDEO-MUSIC TRANSFORMER 93.3 SEQ2SEQ (BASELINE) 12第 4 章 資料集 144.1資料收集處理 144.2影片音樂對齊 144.3資料集介紹 15第 5 章 實驗設計 165.1模型訓練 165.2評估方法 175.3實驗結果 185.3.1 User Bias 195.3.2 Problem of Seq2seq 20第 6 章 總結 23參考文獻 24附錄 26 zh_TW dc.format.extent 3160912 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0105753023 en_US dc.subject (關鍵詞) 為影片生成音樂 zh_TW dc.subject (關鍵詞) 音樂生成 zh_TW dc.subject (關鍵詞) 卷積式注意力機制模型 zh_TW dc.subject (關鍵詞) 生成鋼琴譜 zh_TW dc.subject (關鍵詞) 影片配樂 zh_TW dc.subject (關鍵詞) Video-Music Transformer en_US dc.subject (關鍵詞) VMT en_US dc.subject (關鍵詞) InverseMV en_US dc.subject (關鍵詞) VMT Model en_US dc.subject (關鍵詞) Convolutional Video-Music Transformer en_US dc.title (題名) 利用卷積式注意力機制語言模型為影片生成鋼琴樂曲 zh_TW dc.title (題名) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) [1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.[2] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, MuseGAN: Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arXiv:1709.06298, 2017.[3] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, Neural audio synthesis of musical notes with wavenet autoencoders. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017.[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets. Advances in neural information processing systems, 2014.[5] G. Hadjeres, F. Pachet, and F. Nielsen, DeepBach: a Steerable Model for Bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97, 2012.[7] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.[9] F.-F. Kuo, M.-F. Chiang, M.-K. Shan, and S.-Y. Lee, Emotion-based music recommendation by association discovery from film music. Proceedings of the 13th annual ACM international conference on Multimedia, 2005.[10] J.-C. Lin, W.-L. Wei, and H.-M. Wang, EMV-matchmaker: emotional temporal course modeling and matching for automatic music video generation. Proceedings of the 23rd ACM international conference on Multimedia, 2015.[11] O. Mogren, C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.[12] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.[13] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, This time with feeling: learning expressive musical performance. Neural Computing and Applications, 1-13, 2018.[14] P. M. Todd, A connectionist approach to algorithmic composition. Computer Music Journal, 13(4), 27-43, 1989.[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need. Advances in neural information processing systems, 2017. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU201901153 en_US