題名: 應用文字探勘於業配文揭露偵測
Sponsored Content Detection with Text Mining Approach
作者: 洪御哲
Hung, Yu-Jhe
貢獻者: 洪為璽
Hung, Wei-Hsi
Hung, Yu-Jhe
關鍵詞: 業配文
Sponsored Content
Content Marketing
Text Mining
Machine Learning
Natural Language Processing
日期: 2021
上傳時間: 1-十月-2021
摘要: 業配文是在廣告媒體內容中有目的地整合品牌或品牌說服性訊息,以換取贊助商的報酬。在網際網路與行動裝置的普及下,社群媒體快速成長,捧紅了許多「網紅」高影響力者,看上此高度個人化與可控制內容的特性,使廠商將資源投入在這些人身上,以獲取商品的曝光與銷售。但是業配文常常會有假分享真業配的問題,讓消費者認為是自己的真實體驗分享,而非商業贊助,可能誤導消費者進行消費,故本研究目的在於能否建立一個模型找出背後可能是未揭露的業配文章。首先,先搜集痞客邦百大部落客的資料,建立會揭露業配之部落客名冊,再搜集該部落客發表過的所有文章,藉由揭露文字標注業配文與非業配文。然後透過機器學習方法SVM、CNN與Google所開發的深度語言模型BERT進行訓練與比較,最後以CNN平均得出最高的準確度83.625%,同時,在我們標注的未揭露業配文章資料中,CNN能夠偵測業配文的準確度為90.69%。最後,應用逐層相關傳播LRP解釋CNN模型,觀察哪些常出現業配文文字最可能被預測為業配文,比較模型與人為觀點,並藉此找出業配文的特徵,以提供給消費者進行判斷。
Sponsored content is purposefully incorporating commercial brands into editorial content. With the popularization of the Internet and mobile devices, social media has proliferated and gained popularity among key opinion leaders (KOLs) who have substantial influencing power in the specific social network. This highly personalized and controllable content allows manufacturers to invest resources in KOLs to obtain more exposure and sales of goods. However, sponsored content often has the problem of undisclosed sponsorship. It makes consumers feel it is a personal and authentic experience rather than sponsored content. Undisclosed sponsored content may mislead consumers to buy their products. Therefore, this research aims to build a model to find out the undisclosed sponsored content. This paper establishes the roster from the top 100 ranks of bloggers who will disclose sponsorship in their articles in Pixnet. Afterward, all the published articles are labeled sponsored and non-sponsored by the sentences they used in the disclosure. The datasets with labels of whether disclosed or undisclosed sponsored content are completed. These datasets will be trained and compared through machine learning methods Support Vector Machine (SVM), Convolutional Neural Network (CNN) and the deep language model Bidirectional Encoder Representations from Transformers (BERT) developed by Google. Finally, CNN has the highest accuracy of 83.625%. At the same time, CNN can detect sponsored content with an accuracy of 90.69% in the undisclosed sponsored content we labeled. Finally, the Layer-wise Relevance Propagation (LRP) explains the CNN model and observes which word frequently appears in sponsored content. We can find out the characteristics of sponsored content and provide it for consumers to make a purchase decision.
