基於人力銀行之台灣地區薪資預測模型 | Publication

Publications-Theses

Article View/Open

pdf(165)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	基於人力銀行之台灣地區薪資預測模型 Web-Recruitment Data for Salary Prediction in Taiwan
作者	廖宜川 Liao, Yi-Chuan
貢獻者	陳樹衡 Chen, Shu-Heng 廖宜川 Liao, Yi-Chuan
關鍵詞	薪資預測機器學習卷積神經網路自然語言處理 Word2Vec 詞向量高維數據 Salary prediction Machine learning Convolutional neural network Natural language processing Word2Vec Word vector High dimension data
日期	2020
上傳時間	2-Sep-2020 12:45:28 (UTC+8)
摘要	本文的研究目的在於建構一個薪資預測模型，在此特別針對資訊軟體系統類相關職缺。此薪資預測模型可作為求職者與企業方的參考依據，根據結構化變數，包括個人資料與職位相關技能等等，以及工作內容的文字描述，可以讓他們了解該職位的大略薪資，減少雙方對於薪資的歧見。同時，從迴歸模型輸出的係數也可以知道各種變數所反映的市場價值，例如熟悉某項工作技能會對於薪資水準有甚麼樣的影響，提供求職者自我精進的方向與參考。本研究從資料的探索性分析開始，了解各個變數的基本特徵，並嘗試整合結構化變數(職位需求的條件等等)以及非結構化的變數(工作內容的文字描述)，藉由許多的機器學習演算法建立薪資預測模型。另外，也嘗試使用詞向量轉換的神經網路模型，針對工作內容的文字描述建立薪資預測模型，其評估結果並不亞於使用結構化變數的薪資預測模型，這顯示了中文的自然語言處理，應用於網路人力銀行資料集的薪資預測模型之建構是可行的。 The purpose of this thesis is to construct a salary prediction model, especially for information software system related positions using web-recruitment data. Based on structured data, including personal information and job-related skills, as well as unstructured text describing job content, the established models can be used as a reference for job seekers and companies to estimate the salary level of a certain job. Meanwhile, the variable coefficients from the regression models provide information about the market value reflected by those variables. The identified high-pay skills and expertise could guide the job seekers in which areas they can improve themselves. This research starts with an exploratory data analysis which helps us to understand the basic characteristics of each variable. Next, we apply various machine learning algorithms to the integrated structured and unstructured data to establish salary prediction models. The results show Random Forest, Ridge and Lasso perform well on the sparse high-dimension dataset. After that, we adopt a natural language processing approach by employing a convolutional neural network on the word vector data transformed from job content text. The result shows that the created salary prediction model is on a par with the models constructed using integrated structured and unstructured data. This endorses natural language processing as a viable approach to construct salary prediction models using online recruitment data.
參考文獻	[1] 104人力銀行，AI大浪捲動企業搶才職缺是5年前的3.2倍，上網日期2020年06月20日，檢自：https://corp.104.com.tw/archive/files/news/20200121.pdf [2] 104人力銀行，上網日期2020年06月20日，檢自：https://www.104.com.tw/jobs/main/https://www.cnbc.com/2019/12/30/5-hig [3] Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185. [4] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. [5] Breiman, L., J. Friedman, R. Olshen, and C. Stone, (1984). Classification and Regression Trees. Belmont, California : Wadsworth International Group. [6] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE), 2493-2537. [7] Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A. and Vapnik, V, (1997). “Support vector regression machines”, Advances in Neural Information Processing Systems, 9:155–161. [8] Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139. [9] Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1), 1. [10] Hinton, G. E. (1990). Connectionist learning procedures. In Machine learning (pp. 555-610). Morgan Kaufmann. [11] Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation. [12] Keras, Retrieved June 20 2020, from: https://keras.io/ [13] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. [14] Martín, I., Mariello, A., Battiti, R., & Hernández, J. A. (2018). Salary Prediction in the IT Job Market with Few High-Dimensional Samples: A Spanish Case Study. International Journal of Computational Intelligence Systems, 11(1), 1192-1209. [15] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. [16] Pawha, A., & Kamthania, D. (2019). Quantitative analysis of historical data for prediction of job salary in India-A case study. Journal of Statistics and Management Systems, 22(2), 187-198. [17] Scikit-learn, Retrieved June 20 2020, from: https://scikit-learn.org/stable/ [18] Selenium with Python, Retrieved June 20 2020, from: https://selenium-python.readthedocs.io/ [19] Singh, R. (2016). A Regression Study of Salary Determinants in Indian Job Markets for Entry Level Engineering Graduates. [20] Sun Junyi，结巴中文分词，上網日期2020年06月20日，檢自https://github.com/fxsjy/jieba [21] Support Vector Machine - Regression(SVR), Retrieved June 20 2020, from: http://www.saedsayad.com/support_vector_machine_reg.htm [22] These 5 high-paying, growing jobs didn’t exist a decade ago—but they’ll be booming through the 2020s, Retrieved June 20 2020, from: https://www.cnbc.com/2019/12/30/5-high-paying-growing-jobs-that-will-be-booming-through-the-2020s.html?fbclid=IwAR1mOcFVDUNxaGk5EAsbkxLU2wP40yxLb8cBqNGjrccXgXoCoiuR4_LxTTQ [23] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. [24] Vapnik, V. N. (1995). Constructing learning algorithms. In The nature of statistical learning theory (pp. 119-166). Springer, New York, NY. [25] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320. [26] 中央研究院詞庫小組，中文斷詞系統，上網日期2020年06月20日，檢自：http://ckipsvr.iis.sinica.edu.tw/ [27] 江易麇，(2018)。應用雙向長短期記憶神經網路於新聞分類。未出版之碩士論文，國立雲林科技大學，資訊管理系，雲林縣。 [28] 周宜滿，(2004)。高等教育薪資所得差異之經濟分析-臺灣實證研究。未出版之碩士論文，佛光大學，經濟學研究所，宜蘭縣。 [29] 林鼎晃，(2012)。大學科系別薪資決定因素分析－熱門科系是否代表「錢」景看好？。未出版之碩士論文，國立東華大學，經濟學系，花蓮縣。 [30] 徐豪，(2019)。使用深度學習進行基於社群網路評論的產品評價系統。未出版之碩士論文，淡江大學，資訊工程學系碩士在職專班，新北市。 [31] 莊惠婉，(2010)。影響我國產業別員工薪資之因素－應用最大概似法及兩階段有序機率選擇模型。未出版之碩士論文，國立中正大學，國際經濟研究所，嘉義縣。 [32] 創市際市場研究顧問公司，就業調查與就業服務/職涯類別網域使用概況，上網日期2020年06月20日，檢自：https://www.ixresearch.com/wp-content/uploads/report/InsightXplorer%20Biweekly%20Report_20160815.pdf [33] 曾厚強、洪孝宗、宋曜廷、陳柏琳，(2016)。基於深層類神經網路及表示學習技術之文件可讀性分類。The 2016 Conference on Computational Linguistics and Speech Processing ROCLING, pp. 255-270。 [34] 劉姿君，(1993)。教育投資與薪資報酬─人力資本理論之應用。未出版之碩士論文，國立政治大學，教育學研究所，台北市。
描述	碩士國立政治大學經濟學系 107258007
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0107258007
資料類型	thesis

dc.contributor.advisor	陳樹衡	zh_TW
dc.contributor.advisor	Chen, Shu-Heng	en_US
dc.contributor.author (Authors)	廖宜川	zh_TW
dc.contributor.author (Authors)	Liao, Yi-Chuan	en_US
dc.creator (作者)	廖宜川	zh_TW
dc.creator (作者)	Liao, Yi-Chuan	en_US
dc.date (日期)	2020	en_US
dc.date.accessioned	2-Sep-2020 12:45:28 (UTC+8)	-
dc.date.available	2-Sep-2020 12:45:28 (UTC+8)	-
dc.date.issued (上傳時間)	2-Sep-2020 12:45:28 (UTC+8)	-
dc.identifier (Other Identifiers)	G0107258007	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/131783	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	經濟學系	zh_TW
dc.description (描述)	107258007	zh_TW
dc.description.abstract (摘要)	本文的研究目的在於建構一個薪資預測模型，在此特別針對資訊軟體系統類相關職缺。此薪資預測模型可作為求職者與企業方的參考依據，根據結構化變數，包括個人資料與職位相關技能等等，以及工作內容的文字描述，可以讓他們了解該職位的大略薪資，減少雙方對於薪資的歧見。同時，從迴歸模型輸出的係數也可以知道各種變數所反映的市場價值，例如熟悉某項工作技能會對於薪資水準有甚麼樣的影響，提供求職者自我精進的方向與參考。本研究從資料的探索性分析開始，了解各個變數的基本特徵，並嘗試整合結構化變數(職位需求的條件等等)以及非結構化的變數(工作內容的文字描述)，藉由許多的機器學習演算法建立薪資預測模型。另外，也嘗試使用詞向量轉換的神經網路模型，針對工作內容的文字描述建立薪資預測模型，其評估結果並不亞於使用結構化變數的薪資預測模型，這顯示了中文的自然語言處理，應用於網路人力銀行資料集的薪資預測模型之建構是可行的。	zh_TW
dc.description.abstract (摘要)	The purpose of this thesis is to construct a salary prediction model, especially for information software system related positions using web-recruitment data. Based on structured data, including personal information and job-related skills, as well as unstructured text describing job content, the established models can be used as a reference for job seekers and companies to estimate the salary level of a certain job. Meanwhile, the variable coefficients from the regression models provide information about the market value reflected by those variables. The identified high-pay skills and expertise could guide the job seekers in which areas they can improve themselves. This research starts with an exploratory data analysis which helps us to understand the basic characteristics of each variable. Next, we apply various machine learning algorithms to the integrated structured and unstructured data to establish salary prediction models. The results show Random Forest, Ridge and Lasso perform well on the sparse high-dimension dataset. After that, we adopt a natural language processing approach by employing a convolutional neural network on the word vector data transformed from job content text. The result shows that the created salary prediction model is on a par with the models constructed using integrated structured and unstructured data. This endorses natural language processing as a viable approach to construct salary prediction models using online recruitment data.	en_US
dc.description.tableofcontents	第一章緒論 1 第一節研究緣起與目的 1 第二節研究貢獻 2 第三節論文架構 3 第二章文獻回顧 4 第一節台灣地區薪資模型 4 第二節國外地區薪資模型 5 第三節中文自然語言處理用於預測模型變數之相關研究 6 第三章研究方法 8 第一節 Selenium-WebDriver in Python 9 第二節統計檢定 10 第三節迴歸模型 11 第四節中文自然語言處理 19 第四章資料前處理與探索性分析 27 第一節資料取得 27 第二節資料前處理與探索性分析 29 第三節工作內容文字前處理 61 第四節資料總結 63 第五章迴歸模型與實證結果 65 第一節基準模型 65 第二節變數組合生成 65 第三節變數組合與迴歸模型篩選 66 第四節變數篩選與顯著性 68 第五節詞向量轉換薪資預測模型建構 74 第六章結論與建議 76 第一節結論 76 第二節建議 76 參考文獻 79 附錄 83	zh_TW
dc.format.extent	2953577 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0107258007	en_US
dc.subject (關鍵詞)	薪資預測	zh_TW
dc.subject (關鍵詞)	機器學習	zh_TW
dc.subject (關鍵詞)	卷積神經網路	zh_TW
dc.subject (關鍵詞)	自然語言處理	zh_TW
dc.subject (關鍵詞)	Word2Vec	zh_TW
dc.subject (關鍵詞)	詞向量	zh_TW
dc.subject (關鍵詞)	高維數據	zh_TW
dc.subject (關鍵詞)	Salary prediction	en_US
dc.subject (關鍵詞)	Machine learning	en_US
dc.subject (關鍵詞)	Convolutional neural network	en_US
dc.subject (關鍵詞)	Natural language processing	en_US
dc.subject (關鍵詞)	Word2Vec	en_US
dc.subject (關鍵詞)	Word vector	en_US
dc.subject (關鍵詞)	High dimension data	en_US
dc.title (題名)	基於人力銀行之台灣地區薪資預測模型	zh_TW
dc.title (題名)	Web-Recruitment Data for Salary Prediction in Taiwan	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	[1] 104人力銀行，AI大浪捲動企業搶才職缺是5年前的3.2倍，上網日期2020年06月20日，檢自：https://corp.104.com.tw/archive/files/news/20200121.pdf [2] 104人力銀行，上網日期2020年06月20日，檢自：https://www.104.com.tw/jobs/main/https://www.cnbc.com/2019/12/30/5-hig [3] Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185. [4] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. [5] Breiman, L., J. Friedman, R. Olshen, and C. Stone, (1984). Classification and Regression Trees. Belmont, California : Wadsworth International Group. [6] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE), 2493-2537. [7] Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A. and Vapnik, V, (1997). “Support vector regression machines”, Advances in Neural Information Processing Systems, 9:155–161. [8] Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139. [9] Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1), 1. [10] Hinton, G. E. (1990). Connectionist learning procedures. In Machine learning (pp. 555-610). Morgan Kaufmann. [11] Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation. [12] Keras, Retrieved June 20 2020, from: https://keras.io/ [13] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. [14] Martín, I., Mariello, A., Battiti, R., & Hernández, J. A. (2018). Salary Prediction in the IT Job Market with Few High-Dimensional Samples: A Spanish Case Study. International Journal of Computational Intelligence Systems, 11(1), 1192-1209. [15] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. [16] Pawha, A., & Kamthania, D. (2019). Quantitative analysis of historical data for prediction of job salary in India-A case study. Journal of Statistics and Management Systems, 22(2), 187-198. [17] Scikit-learn, Retrieved June 20 2020, from: https://scikit-learn.org/stable/ [18] Selenium with Python, Retrieved June 20 2020, from: https://selenium-python.readthedocs.io/ [19] Singh, R. (2016). A Regression Study of Salary Determinants in Indian Job Markets for Entry Level Engineering Graduates. [20] Sun Junyi，结巴中文分词，上網日期2020年06月20日，檢自https://github.com/fxsjy/jieba [21] Support Vector Machine - Regression(SVR), Retrieved June 20 2020, from: http://www.saedsayad.com/support_vector_machine_reg.htm [22] These 5 high-paying, growing jobs didn’t exist a decade ago—but they’ll be booming through the 2020s, Retrieved June 20 2020, from: https://www.cnbc.com/2019/12/30/5-high-paying-growing-jobs-that-will-be-booming-through-the-2020s.html?fbclid=IwAR1mOcFVDUNxaGk5EAsbkxLU2wP40yxLb8cBqNGjrccXgXoCoiuR4_LxTTQ [23] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. [24] Vapnik, V. N. (1995). Constructing learning algorithms. In The nature of statistical learning theory (pp. 119-166). Springer, New York, NY. [25] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320. [26] 中央研究院詞庫小組，中文斷詞系統，上網日期2020年06月20日，檢自：http://ckipsvr.iis.sinica.edu.tw/ [27] 江易麇，(2018)。應用雙向長短期記憶神經網路於新聞分類。未出版之碩士論文，國立雲林科技大學，資訊管理系，雲林縣。 [28] 周宜滿，(2004)。高等教育薪資所得差異之經濟分析-臺灣實證研究。未出版之碩士論文，佛光大學，經濟學研究所，宜蘭縣。 [29] 林鼎晃，(2012)。大學科系別薪資決定因素分析－熱門科系是否代表「錢」景看好？。未出版之碩士論文，國立東華大學，經濟學系，花蓮縣。 [30] 徐豪，(2019)。使用深度學習進行基於社群網路評論的產品評價系統。未出版之碩士論文，淡江大學，資訊工程學系碩士在職專班，新北市。 [31] 莊惠婉，(2010)。影響我國產業別員工薪資之因素－應用最大概似法及兩階段有序機率選擇模型。未出版之碩士論文，國立中正大學，國際經濟研究所，嘉義縣。 [32] 創市際市場研究顧問公司，就業調查與就業服務/職涯類別網域使用概況，上網日期2020年06月20日，檢自：https://www.ixresearch.com/wp-content/uploads/report/InsightXplorer%20Biweekly%20Report_20160815.pdf [33] 曾厚強、洪孝宗、宋曜廷、陳柏琳，(2016)。基於深層類神經網路及表示學習技術之文件可讀性分類。The 2016 Conference on Computational Linguistics and Speech Processing ROCLING, pp. 255-270。 [34] 劉姿君，(1993)。教育投資與薪資報酬─人力資本理論之應用。未出版之碩士論文，國立政治大學，教育學研究所，台北市。	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202001406	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM