學術產出-Theses

Article View/Open

Publication Export

Google ScholarTM

政大圖書館

Citation Infomation

題名 利用卷積式注意力機制語言模型為影片生成鋼琴樂曲
InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer
作者 林鑫彤
Lin, Chin-Tung
貢獻者 沈錳坤
Shan, Man-Kwan
林鑫彤
Lin, Chin-Tung
關鍵詞 為影片生成音樂
音樂生成
卷積式注意力機制模型
生成鋼琴譜
影片配樂
Video-Music Transformer
VMT
InverseMV
VMT Model
Convolutional Video-Music Transformer
日期 2019
上傳時間 5-Sep-2019 16:14:39 (UTC+8)
摘要 近年手機鏡頭的技術趨向成熟,加上如Facebook、Instagram等社群網站的興起,使用者可輕易用手機拍出高品質的照片及影片並分享到網路上。一個高流量的影片往往有著與之搭配的音樂,而一般人並非專業的配樂師,受限於音樂素材的收集和敏銳度,在影片配樂的挑選上時常遇到困難。影片的配樂上使用現成的音樂會受限於版權的問題,因此在影片配樂上使用音樂的自動生成將成為一個新的研究趨勢。
隨著近年類神經網路(Neural Network, NN)蓬勃的發展,有許多研究開始嘗試使用類神經網路模型來生成符號音樂(symbolic music),但據我們所知目前並未有人嘗試為影片生成音樂。在缺乏現成dataset的情況下,我們人工收集並標記一個pop music的dataset來做為我們模型的訓練資料。基於注意力機制模型(Transformer)在自然語言處理(Natural Language Processing, NLP)問題上的成功,而符號音樂的生成與語言生成也有著異曲同工之處,本研究提出一個為影片自動生成配樂的模型VMT(Video-Music Transformer),輸入影片的frame sequence來生成對應的符號鋼琴音樂(symbolic piano music)。我們在實驗結果也得到VMT模型相對於序列模型(sequence to sequence model)在音樂流暢度和影片匹配度上有較好的結果。
With the wide popularity of social media including Facebook, Twitter, Instagram, YouTube, etc. and the modernization of mobile photography, users on social media tend to watch and send videos rather than text. People want their video with a high click-through rate. However, such video requires great editing skill and perfect matching music, which are very difficult for common people. On top of that, people creating soundtrack suffer from the lack of ownership of musical pieces. The music generated from a model instead of existing music conduces to preventing from breaching copyright.
The rise of deep learning brought out much work using a model based on the neural network to generate symbolic music. However, to the best of our knowledge, there is no work trying to compose music for video and no dataset with paired video and music. Therefore, we release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and midi files. We propose a model VMT(Video-Music Transformer) that generates piano scores from video frames, and then evaluate our model with seq2seq and obtain better music smooth and relevance of video.
參考文獻 [1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[2] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, MuseGAN: Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arXiv:1709.06298, 2017.
[3] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, Neural audio synthesis of musical notes with wavenet autoencoders. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets. Advances in neural information processing systems, 2014.
[5] G. Hadjeres, F. Pachet, and F. Nielsen, DeepBach: a Steerable Model for Bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.
[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97, 2012.
[7] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.
[9] F.-F. Kuo, M.-F. Chiang, M.-K. Shan, and S.-Y. Lee, Emotion-based music recommendation by association discovery from film music. Proceedings of the 13th annual ACM international conference on Multimedia, 2005.
[10] J.-C. Lin, W.-L. Wei, and H.-M. Wang, EMV-matchmaker: emotional temporal course modeling and matching for automatic music video generation. Proceedings of the 23rd ACM international conference on Multimedia, 2015.
[11] O. Mogren, C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.
[12] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[13] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, This time with feeling: learning expressive musical performance. Neural Computing and Applications, 1-13, 2018.
[14] P. M. Todd, A connectionist approach to algorithmic composition. Computer Music Journal, 13(4), 27-43, 1989.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need. Advances in neural information processing systems, 2017.
描述 碩士
國立政治大學
資訊科學系
105753023
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0105753023
資料類型 thesis
dc.contributor.advisor 沈錳坤zh_TW
dc.contributor.advisor Shan, Man-Kwanen_US
dc.contributor.author (Authors) 林鑫彤zh_TW
dc.contributor.author (Authors) Lin, Chin-Tungen_US
dc.creator (作者) 林鑫彤zh_TW
dc.creator (作者) Lin, Chin-Tungen_US
dc.date (日期) 2019en_US
dc.date.accessioned 5-Sep-2019 16:14:39 (UTC+8)-
dc.date.available 5-Sep-2019 16:14:39 (UTC+8)-
dc.date.issued (上傳時間) 5-Sep-2019 16:14:39 (UTC+8)-
dc.identifier (Other Identifiers) G0105753023en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/125641-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊科學系zh_TW
dc.description (描述) 105753023zh_TW
dc.description.abstract (摘要) 近年手機鏡頭的技術趨向成熟,加上如Facebook、Instagram等社群網站的興起,使用者可輕易用手機拍出高品質的照片及影片並分享到網路上。一個高流量的影片往往有著與之搭配的音樂,而一般人並非專業的配樂師,受限於音樂素材的收集和敏銳度,在影片配樂的挑選上時常遇到困難。影片的配樂上使用現成的音樂會受限於版權的問題,因此在影片配樂上使用音樂的自動生成將成為一個新的研究趨勢。
隨著近年類神經網路(Neural Network, NN)蓬勃的發展,有許多研究開始嘗試使用類神經網路模型來生成符號音樂(symbolic music),但據我們所知目前並未有人嘗試為影片生成音樂。在缺乏現成dataset的情況下,我們人工收集並標記一個pop music的dataset來做為我們模型的訓練資料。基於注意力機制模型(Transformer)在自然語言處理(Natural Language Processing, NLP)問題上的成功,而符號音樂的生成與語言生成也有著異曲同工之處,本研究提出一個為影片自動生成配樂的模型VMT(Video-Music Transformer),輸入影片的frame sequence來生成對應的符號鋼琴音樂(symbolic piano music)。我們在實驗結果也得到VMT模型相對於序列模型(sequence to sequence model)在音樂流暢度和影片匹配度上有較好的結果。
zh_TW
dc.description.abstract (摘要) With the wide popularity of social media including Facebook, Twitter, Instagram, YouTube, etc. and the modernization of mobile photography, users on social media tend to watch and send videos rather than text. People want their video with a high click-through rate. However, such video requires great editing skill and perfect matching music, which are very difficult for common people. On top of that, people creating soundtrack suffer from the lack of ownership of musical pieces. The music generated from a model instead of existing music conduces to preventing from breaching copyright.
The rise of deep learning brought out much work using a model based on the neural network to generate symbolic music. However, to the best of our knowledge, there is no work trying to compose music for video and no dataset with paired video and music. Therefore, we release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and midi files. We propose a model VMT(Video-Music Transformer) that generates piano scores from video frames, and then evaluate our model with seq2seq and obtain better music smooth and relevance of video.
en_US
dc.description.tableofcontents 摘要 I
ABSTRACT II
目錄 III
LIST OF FIGURES V
LIST OF TABLES VI
第 1 章 緒論 1
1.1背景 1
1.2動機 1
1.3研究方法 2
1.4 研究貢獻 3
第 2 章 相關研究 4
2.1 背景音樂推薦 4
2.2 自動音樂作曲 4
2.3 深度學習 6
第 3 章 研究方法 9
3.1資料前處理 9
3.2 CONVOLUTIONAL VIDEO-MUSIC TRANSFORMER 9
3.3 SEQ2SEQ (BASELINE) 12
第 4 章 資料集 14
4.1資料收集處理 14
4.2影片音樂對齊 14
4.3資料集介紹 15
第 5 章 實驗設計 16
5.1模型訓練 16
5.2評估方法 17
5.3實驗結果 18
5.3.1 User Bias 19
5.3.2 Problem of Seq2seq 20
第 6 章 總結 23
參考文獻 24
附錄 26
zh_TW
dc.format.extent 3160912 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0105753023en_US
dc.subject (關鍵詞) 為影片生成音樂zh_TW
dc.subject (關鍵詞) 音樂生成zh_TW
dc.subject (關鍵詞) 卷積式注意力機制模型zh_TW
dc.subject (關鍵詞) 生成鋼琴譜zh_TW
dc.subject (關鍵詞) 影片配樂zh_TW
dc.subject (關鍵詞) Video-Music Transformeren_US
dc.subject (關鍵詞) VMTen_US
dc.subject (關鍵詞) InverseMVen_US
dc.subject (關鍵詞) VMT Modelen_US
dc.subject (關鍵詞) Convolutional Video-Music Transformeren_US
dc.title (題名) 利用卷積式注意力機制語言模型為影片生成鋼琴樂曲zh_TW
dc.title (題名) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformeren_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) [1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[2] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, MuseGAN: Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arXiv:1709.06298, 2017.
[3] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, Neural audio synthesis of musical notes with wavenet autoencoders. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets. Advances in neural information processing systems, 2014.
[5] G. Hadjeres, F. Pachet, and F. Nielsen, DeepBach: a Steerable Model for Bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.
[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97, 2012.
[7] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.
[9] F.-F. Kuo, M.-F. Chiang, M.-K. Shan, and S.-Y. Lee, Emotion-based music recommendation by association discovery from film music. Proceedings of the 13th annual ACM international conference on Multimedia, 2005.
[10] J.-C. Lin, W.-L. Wei, and H.-M. Wang, EMV-matchmaker: emotional temporal course modeling and matching for automatic music video generation. Proceedings of the 23rd ACM international conference on Multimedia, 2015.
[11] O. Mogren, C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.
[12] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[13] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, This time with feeling: learning expressive musical performance. Neural Computing and Applications, 1-13, 2018.
[14] P. M. Todd, A connectionist approach to algorithmic composition. Computer Music Journal, 13(4), 27-43, 1989.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need. Advances in neural information processing systems, 2017.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU201901153en_US