運用資料探勘及支持向量機建立運動新聞媒體分類器

Publications-Theses

Article View/Open

pdf(3)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	運用資料探勘及支持向量機建立運動新聞媒體分類器 Using Exploratory Data Analysis and Support Vector Machine to Build Media Classifiers on Sport News
作者	褚承威 Chu, Cheng-Wei
貢獻者	薛慧敏褚承威 Chu, Cheng-Wei
關鍵詞	體育新聞變數選取 TF-IDF 支持向量機文本分類 Sports news Feature selection TF-IDF Support vector machine Text categorization
日期	2018
上傳時間	31-Jul-2018 13:44:58 (UTC+8)
摘要	新聞是最近所發生事件的消息報導，呈現當時有關某問題、事件或過程的現實情況，而報紙為過往傳播新聞的媒介，隨著網路迅速發展民眾習慣改變，報紙平面媒體轉而發展成網路新聞。網路新聞的內容包含文字、圖片甚至是影音，各家媒體使用習慣皆有不同，過去的研究比較不同媒體新聞內容用法差異，再以人工進行判別媒體。本文則希望透過探索式資料分析(exploratory data analysis, EDA)及TF-IDF(term frequency inverse document frequency)關鍵字篩選方法來關鍵選取文字變數及非文字變數，並運用選出的變數建立支持向量機(support vector machine, SVM)媒體分類器。在建立媒體分類器中，我們發現僅採用非文字變數已有高準確率，而圖片規格為相對重要變數。若僅考慮文字變數時，則少許文字變數便能建立優異的分類器。 News is a report which show a situation of a problem, event or process at that time. In the past, newspapers are the most common media for spreading news. As the Internet and social media grow rapidly, people’s habits have changed. Nowadays, a majority of people prefers to read digital news instead of news in paper. This study aims to develop a classifier of digital news to predict the newspaper publisher of the news. Over four thousands news articles of sport category published by the four major Taiwanese newspapers: United Daily News, Apple Daily, China Times, Liberty Times, in December, 2017, are collected as training data. Commonly every item of digital news is formed by a title, text content and photos. Hence, the first and the essential step of the analysis is input variable (feature) quantification from available information of news. Moreover, to explore the routine of every newspaper and to improve the computational efficiency, an initial exploratory data analysis (EDA) on the input variables is conducted and relative important variables are selected for classifier development. For the text data, the term frequency-inverse document frequency (TF-IDF) is applied for a keywords selection method. Then, we use these selected variables to build newspaper classifiers by support vector machine (SVM). In our study, we find that a simple classifier based on 19 non-text input variables can achieve a high accuracy. Among them, the image dimensions are the most critical variables. On the other hand, when only considering text information, we observe that few text variables can have excellent classification results.
參考文獻	中文部分 1.余東霖(2010)，以兩階段分類方法識別新聞類別，碩士論文，國立中央大學，資訊管理研究所。 2.李明安、蔡卓忻(2016)，文章分類演算法的比較研究—以中文新聞為例，2016資訊技術與產業應用國際研討會發表論文，臺北城市科技大學。 3.陳季汝(2009)，報紙與警察形象之塑造：以聯合報、自由時報、蘋果日報為例，碩士論文，國立臺北大學，犯罪學研究所。 4.陳炳宏(2010)，媒體集團化與其內容多元之關聯性研究，新聞學研究，第一零四期，頁15-22。 5.臺灣傳播調查資料庫(2017)，台灣民眾媒體使用行為變遷初探-2012年至2016年，臺灣傳播調查資料庫電子報http://www.crctaiwan.nctu.edu.tw/ResultsShow_detail.asp?RS_ID=67 6.蘇鑰機(2011)，什麼是新聞？，傳播研究與實踐，第一卷第一期，頁2-4。英文部分 1.Cortes C., & Vapnik V., (1995), Support vector networks, Machine Learning, Boston, Kluwer Academic, 273-297. 2.Cristianini N., & Shawe-Taylor J., (2010), Kernel-Induced Feature Spaces, An Introduction to Support Vector Machine and Other Kernel-based Learning Methods, New York, Cambridge University, 27-37. 3.Joachims T., (1998), Text Categorization with Support Vector Machines: Learning with Many Relevant Features, University Dortmund, Dortmund, Germany.
描述	碩士國立政治大學統計學系 105354020
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0105354020
資料類型	thesis

dc.contributor.advisor	薛慧敏	zh_TW
dc.contributor.author (Authors)	褚承威	zh_TW
dc.contributor.author (Authors)	Chu, Cheng-Wei	en_US
dc.creator (作者)	褚承威	zh_TW
dc.creator (作者)	Chu, Cheng-Wei	en_US
dc.date (日期)	2018	en_US
dc.date.accessioned	31-Jul-2018 13:44:58 (UTC+8)	-
dc.date.available	31-Jul-2018 13:44:58 (UTC+8)	-
dc.date.issued (上傳時間)	31-Jul-2018 13:44:58 (UTC+8)	-
dc.identifier (Other Identifiers)	G0105354020	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/119087	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	105354020	zh_TW
dc.description.abstract (摘要)	新聞是最近所發生事件的消息報導，呈現當時有關某問題、事件或過程的現實情況，而報紙為過往傳播新聞的媒介，隨著網路迅速發展民眾習慣改變，報紙平面媒體轉而發展成網路新聞。網路新聞的內容包含文字、圖片甚至是影音，各家媒體使用習慣皆有不同，過去的研究比較不同媒體新聞內容用法差異，再以人工進行判別媒體。本文則希望透過探索式資料分析(exploratory data analysis, EDA)及TF-IDF(term frequency inverse document frequency)關鍵字篩選方法來關鍵選取文字變數及非文字變數，並運用選出的變數建立支持向量機(support vector machine, SVM)媒體分類器。在建立媒體分類器中，我們發現僅採用非文字變數已有高準確率，而圖片規格為相對重要變數。若僅考慮文字變數時，則少許文字變數便能建立優異的分類器。	zh_TW
dc.description.abstract (摘要)	News is a report which show a situation of a problem, event or process at that time. In the past, newspapers are the most common media for spreading news. As the Internet and social media grow rapidly, people’s habits have changed. Nowadays, a majority of people prefers to read digital news instead of news in paper. This study aims to develop a classifier of digital news to predict the newspaper publisher of the news. Over four thousands news articles of sport category published by the four major Taiwanese newspapers: United Daily News, Apple Daily, China Times, Liberty Times, in December, 2017, are collected as training data. Commonly every item of digital news is formed by a title, text content and photos. Hence, the first and the essential step of the analysis is input variable (feature) quantification from available information of news. Moreover, to explore the routine of every newspaper and to improve the computational efficiency, an initial exploratory data analysis (EDA) on the input variables is conducted and relative important variables are selected for classifier development. For the text data, the term frequency-inverse document frequency (TF-IDF) is applied for a keywords selection method. Then, we use these selected variables to build newspaper classifiers by support vector machine (SVM). In our study, we find that a simple classifier based on 19 non-text input variables can achieve a high accuracy. Among them, the image dimensions are the most critical variables. On the other hand, when only considering text information, we observe that few text variables can have excellent classification results.	en_US
dc.description.tableofcontents	第一章緒論 1 第二章研究方法 3 第一節 TF–IDF文章特徵 3 第二節支持向量機 7 第三節 SVM準確率評比 12 第三章實證資料分析 13 第一節非文字變數選取 14 第二節文字變數選取 22 第四章媒體分類器 26 第一節建立媒體分類器 26 第二節非文字變數重要性比較 29 第三節文字變數重要性探討 34 第五章結論及建議 37 參考文獻 38	zh_TW
dc.format.extent	2061485 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0105354020	en_US
dc.subject (關鍵詞)	體育新聞	zh_TW
dc.subject (關鍵詞)	變數選取	zh_TW
dc.subject (關鍵詞)	TF-IDF	zh_TW
dc.subject (關鍵詞)	支持向量機	zh_TW
dc.subject (關鍵詞)	文本分類	zh_TW
dc.subject (關鍵詞)	Sports news	en_US
dc.subject (關鍵詞)	Feature selection	en_US
dc.subject (關鍵詞)	TF-IDF	en_US
dc.subject (關鍵詞)	Support vector machine	en_US
dc.subject (關鍵詞)	Text categorization	en_US
dc.title (題名)	運用資料探勘及支持向量機建立運動新聞媒體分類器	zh_TW
dc.title (題名)	Using Exploratory Data Analysis and Support Vector Machine to Build Media Classifiers on Sport News	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	中文部分 1.余東霖(2010)，以兩階段分類方法識別新聞類別，碩士論文，國立中央大學，資訊管理研究所。 2.李明安、蔡卓忻(2016)，文章分類演算法的比較研究—以中文新聞為例，2016資訊技術與產業應用國際研討會發表論文，臺北城市科技大學。 3.陳季汝(2009)，報紙與警察形象之塑造：以聯合報、自由時報、蘋果日報為例，碩士論文，國立臺北大學，犯罪學研究所。 4.陳炳宏(2010)，媒體集團化與其內容多元之關聯性研究，新聞學研究，第一零四期，頁15-22。 5.臺灣傳播調查資料庫(2017)，台灣民眾媒體使用行為變遷初探-2012年至2016年，臺灣傳播調查資料庫電子報http://www.crctaiwan.nctu.edu.tw/ResultsShow_detail.asp?RS_ID=67 6.蘇鑰機(2011)，什麼是新聞？，傳播研究與實踐，第一卷第一期，頁2-4。英文部分 1.Cortes C., & Vapnik V., (1995), Support vector networks, Machine Learning, Boston, Kluwer Academic, 273-297. 2.Cristianini N., & Shawe-Taylor J., (2010), Kernel-Induced Feature Spaces, An Introduction to Support Vector Machine and Other Kernel-based Learning Methods, New York, Cambridge University, 27-37. 3.Joachims T., (1998), Text Categorization with Support Vector Machines: Learning with Many Relevant Features, University Dortmund, Dortmund, Germany.	zh_TW
dc.identifier.doi (DOI)	10.6814/THE.NCCU.STAT.014.2018.B03	-

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM