基於圖像資訊之音樂資訊檢索研究

Publications-Theses

Article View/Open

pdf(325)

Publication Export

Google Scholar^TM

題名	基於圖像資訊之音樂資訊檢索研究 A study of image-based music information retrieval
作者	夏致群
貢獻者	蔡銘峰 Tsai, Ming-Feng 夏致群
關鍵詞	音樂資訊檢索跨多媒體檢索卷積神經網絡資訊網路表示法學習 Music information retrieval Cross-media retrieva Convolution neural network Network embedding
日期	2017
上傳時間	2-Oct-2017 10:16:01 (UTC+8)
摘要	以往的音樂資訊檢索方法多使用歌詞、曲風、演奏的樂器或一段音頻訊號來當作查詢的媒介，然而，在某些情況下，使用者沒有辦法清楚描述他們想要尋找的歌曲，如：情境式的音樂檢索。本論文提出了一種基於圖像的情境式音樂資訊檢索方法，可以透過輸入圖片來找尋相應的音樂。此方法中我們使用了卷積神經網絡（Convolutional Neural Network）技術來處理圖片，將其轉為低維度的表示法。為了將異質性的多媒體訊息映射到同一個向量空間，資訊網路表示法學習（Network Embedding）技術也被使用，如此一來，可以使用距離計算找回和輸入圖片有關的多媒體訊息。我們相信這樣的方法可以改善異質性資訊間的隔閡（Heterogeneous Gap），也就是指不同種類的多媒體檔案之間無法互相轉換或詮釋。在實驗與評估方面，首先利用從歌詞與歌名得到的關鍵字來搜尋大量圖片當作訓練資料集，接著實作提出的檢索方法，並針對實驗結果做評估。除了對此方法的有效性做測試外，使用者的回饋也顯示此檢索方法和其他方法相比是有效的。同時我們也實作了一個網路原型，使用者可以上傳圖片並得到檢索後的歌曲，實際的使用案例也將在本論文中被展示與介紹。 Listening to music is indispensable to everyone. Music information retrieval systems help users find their favorite music. A common scenario of music information retrieval systems is to search songs based on user`s query. Most existing methods use descriptions (e.g., genre, instrument and lyric) or audio signal of music as the query; then the songs related to the query will be retrieved. The limitation of this scenario is that users might be difficult to describe what they really want to search for. In this paper, we propose a novel method, called ""image2song,`` which allows users to input an image to retrieve the related songs. The proposed method consists of three modules: convolutional neural network (CNN) module, network embedding module, and similarity calculation module. For the processing of the images, in our work the CNN is adopted to learn the representations for images. To map each entity (e.g., image, song, and keyword) into a same embedding space, the heterogeneous representation is learned by network embedding algorithm from the information graph. This method is flexible because it is easy to join other types of multimedia data into the information graph. In similarity calculation module, the Euclidean distance and cosine distance is used as our criterion to compare the similarity. Then we can retrieve the most relevant songs according to the similarity calculation. The experimental results show that the proposed method has a good performance. Furthermore, we also build an online image-based music information retrieval prototype system, which can showcase some examples of our experiments.
參考文獻	[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, pages 585–591, 2002. [2] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social Network Data Analytics, pages 115–148. Springer, 2011. [3] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. [4] T. F. Cox and M. A. Cox. Multidimensional scaling. CRC press, 2000. [5] S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6964–6968. IEEE, 2014. [6] J. Dong, X. Li, and C. G. Snoek. Word2visualvec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838, 2016. [7] J. Foote. An overview of audio information retrieval. Multimedia Systems, 7(1):2–10, 1999. [8] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864. ACM, 2016. [9] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 119–126. ACM, 2003. [10] M. Kaminskas and F. Ricci. Contextual music information retrieval and recommendation: State of the art and challenges. Computer Science Review, 6(2):89–119, 2012 [11] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012. [13] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology, 58(7):1019–1031, 2007. [14] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. [15] A. Ogino and Y. Yamashita. Emotion-based music information retrieval using lyrics. In IFIP International Conference on Computer Information Systems and Industrial Management, pages 613–622. Springer, 2015. [16] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 701–710. ACM, 2014. [17] J. Qi, X. Huang, and Y. Peng. Cross-media retrieval by multimodal representation fusion with deep networks. In International Forum of Digital TV and Wireless Multimedia Communication, pages 218–227. Springer, 2016. [18] F. Raposo, R. Ribeiro, and D. M. de Matos. Using generic summarization to improve music information retrieval tasks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(6):1119–1128, 2016. [19] S. Ruger. Multimedia information retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services, 1(1):1–171, 2009. [20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [22] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. ACM, 2015. [23] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [24] R. Typke, F. Wiering, and R. C. Veltkamp. A survey of music information retrieval systems. In Proc. 6th International Conference on Music Information Retrieval, pages 153–160. Queen Mary, University of London, 2005. [25] F. Wu, X. Lu, J. Song, S. Yan, Z. M. Zhang, Y. Rui, and Y. Zhuang. Learning of multimodal representations with random walks on the click graph. IEEE Transactions on Image Processing, 25(2):630–642, 2016. [26] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pages 283–292. ACM, 2014.
描述	碩士國立政治大學資訊科學學系 104753001
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0104753001
資料類型	thesis

dc.contributor.advisor	蔡銘峰	zh_TW
dc.contributor.advisor	Tsai, Ming-Feng	en_US
dc.contributor.author (Authors)	夏致群	zh_TW
dc.creator (作者)	夏致群	zh_TW
dc.date (日期)	2017	en_US
dc.date.accessioned	2-Oct-2017 10:16:01 (UTC+8)	-
dc.date.available	2-Oct-2017 10:16:01 (UTC+8)	-
dc.date.issued (上傳時間)	2-Oct-2017 10:16:01 (UTC+8)	-
dc.identifier (Other Identifiers)	G0104753001	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/113292	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊科學學系	zh_TW
dc.description (描述)	104753001	zh_TW
dc.description.abstract (摘要)	以往的音樂資訊檢索方法多使用歌詞、曲風、演奏的樂器或一段音頻訊號來當作查詢的媒介，然而，在某些情況下，使用者沒有辦法清楚描述他們想要尋找的歌曲，如：情境式的音樂檢索。本論文提出了一種基於圖像的情境式音樂資訊檢索方法，可以透過輸入圖片來找尋相應的音樂。此方法中我們使用了卷積神經網絡（Convolutional Neural Network）技術來處理圖片，將其轉為低維度的表示法。為了將異質性的多媒體訊息映射到同一個向量空間，資訊網路表示法學習（Network Embedding）技術也被使用，如此一來，可以使用距離計算找回和輸入圖片有關的多媒體訊息。我們相信這樣的方法可以改善異質性資訊間的隔閡（Heterogeneous Gap），也就是指不同種類的多媒體檔案之間無法互相轉換或詮釋。在實驗與評估方面，首先利用從歌詞與歌名得到的關鍵字來搜尋大量圖片當作訓練資料集，接著實作提出的檢索方法，並針對實驗結果做評估。除了對此方法的有效性做測試外，使用者的回饋也顯示此檢索方法和其他方法相比是有效的。同時我們也實作了一個網路原型，使用者可以上傳圖片並得到檢索後的歌曲，實際的使用案例也將在本論文中被展示與介紹。	zh_TW
dc.description.abstract (摘要)	Listening to music is indispensable to everyone. Music information retrieval systems help users find their favorite music. A common scenario of music information retrieval systems is to search songs based on user`s query. Most existing methods use descriptions (e.g., genre, instrument and lyric) or audio signal of music as the query; then the songs related to the query will be retrieved. The limitation of this scenario is that users might be difficult to describe what they really want to search for. In this paper, we propose a novel method, called ""image2song,`` which allows users to input an image to retrieve the related songs. The proposed method consists of three modules: convolutional neural network (CNN) module, network embedding module, and similarity calculation module. For the processing of the images, in our work the CNN is adopted to learn the representations for images. To map each entity (e.g., image, song, and keyword) into a same embedding space, the heterogeneous representation is learned by network embedding algorithm from the information graph. This method is flexible because it is easy to join other types of multimedia data into the information graph. In similarity calculation module, the Euclidean distance and cosine distance is used as our criterion to compare the similarity. Then we can retrieve the most relevant songs according to the similarity calculation. The experimental results show that the proposed method has a good performance. Furthermore, we also build an online image-based music information retrieval prototype system, which can showcase some examples of our experiments.	en_US
dc.description.tableofcontents	第一章 Introduction 1 第二章 Related Work 5 第一節 MusicInformationRetrieval 5 第二節 Cross-media Retrieval 6 第三節 Convolution Neural Network 6 第四節 Network Embedding 7 第三章 Methodology 9 第一節 Terminology 10 第二節 Convolutional Neural Network Module 11 第三節 Network Embedding Module 12 第四節 Similarity Calculation Module 13 第四章 Experimental Results 17 第一節 The Implementation of Web-based Retrieval 17 第二節 Experimental Settings 18 第三節 Experimental Results 19 第四節 Case Study 22 第五章 Conclusions 25	zh_TW
dc.format.extent	10838226 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0104753001	en_US
dc.subject (關鍵詞)	音樂資訊檢索	zh_TW
dc.subject (關鍵詞)	跨多媒體檢索	zh_TW
dc.subject (關鍵詞)	卷積神經網絡	zh_TW
dc.subject (關鍵詞)	資訊網路表示法學習	zh_TW
dc.subject (關鍵詞)	Music information retrieval	en_US
dc.subject (關鍵詞)	Cross-media retrieva	en_US
dc.subject (關鍵詞)	Convolution neural network	en_US
dc.subject (關鍵詞)	Network embedding	en_US
dc.title (題名)	基於圖像資訊之音樂資訊檢索研究	zh_TW
dc.title (題名)	A study of image-based music information retrieval	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, pages 585–591, 2002. [2] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social Network Data Analytics, pages 115–148. Springer, 2011. [3] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. [4] T. F. Cox and M. A. Cox. Multidimensional scaling. CRC press, 2000. [5] S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6964–6968. IEEE, 2014. [6] J. Dong, X. Li, and C. G. Snoek. Word2visualvec: Cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838, 2016. [7] J. Foote. An overview of audio information retrieval. Multimedia Systems, 7(1):2–10, 1999. [8] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864. ACM, 2016. [9] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 119–126. ACM, 2003. [10] M. Kaminskas and F. Ricci. Contextual music information retrieval and recommendation: State of the art and challenges. Computer Science Review, 6(2):89–119, 2012 [11] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012. [13] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology, 58(7):1019–1031, 2007. [14] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. [15] A. Ogino and Y. Yamashita. Emotion-based music information retrieval using lyrics. In IFIP International Conference on Computer Information Systems and Industrial Management, pages 613–622. Springer, 2015. [16] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 701–710. ACM, 2014. [17] J. Qi, X. Huang, and Y. Peng. Cross-media retrieval by multimodal representation fusion with deep networks. In International Forum of Digital TV and Wireless Multimedia Communication, pages 218–227. Springer, 2016. [18] F. Raposo, R. Ribeiro, and D. M. de Matos. Using generic summarization to improve music information retrieval tasks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(6):1119–1128, 2016. [19] S. Ruger. Multimedia information retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services, 1(1):1–171, 2009. [20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [22] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. ACM, 2015. [23] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [24] R. Typke, F. Wiering, and R. C. Veltkamp. A survey of music information retrieval systems. In Proc. 6th International Conference on Music Information Retrieval, pages 153–160. Queen Mary, University of London, 2005. [25] F. Wu, X. Lu, J. Song, S. Yan, Z. M. Zhang, Y. Rui, and Y. Zhuang. Learning of multimodal representations with random walks on the click graph. IEEE Transactions on Image Processing, 25(2):630–642, 2016. [26] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pages 283–292. ACM, 2014.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM