多層級特徵與不平衡樣本下的預測性迴歸系統

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	多層級特徵與不平衡樣本下的預測性迴歸系統 A Predictive Regression System with Multi-Level Input Features and Unbalanced Sample Structures
作者	傅俊益 Fu, Jun-Yi
貢獻者	莊皓鈞<br>周彥君 Chuang, Hao-Chun<br>Chou, Yen-Chun 傅俊益 Fu, Jun-Yi
關鍵詞	CWPCA 不平衡資料集正規化方法 CWPCA Imbalanced dataset Regularization method
日期	2022
上傳時間	1-Aug-2022 17:21:12 (UTC+8)
摘要	現今在數據分析領域中，時常會碰到多維度、不平衡的資料集，像是零售業的新商品的銷量預測，但使用單一迴歸模型或多個迴歸模型去預測這種資料時都有各自的缺點，而 Cohen, Jiao, and Zhang (2020)提出了介於兩者之間的 DAC(Data Aggregation with Clustering)模型，利用將不同品項的部份種類的特徵係數利用共同估計特徵係數的方法，降低特徵係數估計的變異，藉此提高模型的表現。但 DAC 模型表現會大幅受到超參數設定的影響，且特徵係數的檢定品質會受到樣本數大小的影響。因此本研究延伸多層級的特徵變數的概念，但相較 DAC 模型使用 Bottom-up 的設計方法，本研究使用 Top-down 的設計方法，利用迴歸模型和正規化方法設計一個 CWPCA (Centralized With Penalized Coefficient Adjustment)模型，並利用統計模擬多種情境的資料集去比較 CWPCA 模型和 DAC 模型的表現，最後發現 CWPCA 模型不需要經過檢定、k-means 等有可能造成模型偏誤的流程，且在大部分的資料集的模型表現都能和表現最好的 DAC 模型差不多，並優於表現較差的 DAC 模型，我們希望未來能進一步應用在真實世界的資料集，進而對實際的業務產生更大的效益。 Nowadays, in the field of data analysis, multi-dimensional and unbalanced data sets are very common, such as the sales forecast of new products in the retail industry. However, there are some disadvantages when using a single regression model or multiple regression models to predict such data. As a result, Cohen, Jiao, and Zhang (2020) proposed a DAC (Data Aggregation with Clustering) model between the two models, using the method of jointly estimating the coefficients of some types of coefficients of different items to reduce the variation of coefficients to improve the performance of the model. However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables and uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business. However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables, this study uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.
參考文獻	Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813-852. Chen, Y., Taeb, A., & Bühlmann, P. (2020). A Look at Robustness and Stability of l1- versus l0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Statistical Science, 35(4), 614-622. Cohen, M. C., Jiao, K., & Zhang, R. (2020). Data Aggregation and Demand Prediction. Available at SSRN 3411653. Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal L1‐norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6), 797-829. Fu, A., Narasimhan, B., & Boyd, S. (2020). CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software, 94(14), 1 - 34. Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517-1537. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67. Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Melkumova, L., & Shatskikh, S. Y. (2017). Comparing Ridge and LASSO estimators for data analysis. Procedia engineering, 201, 746-755. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429.
描述	碩士國立政治大學資訊管理學系 109356009
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109356009
資料類型	thesis

dc.contributor.advisor	莊皓鈞<br>周彥君	zh_TW
dc.contributor.advisor	Chuang, Hao-Chun<br>Chou, Yen-Chun	en_US
dc.contributor.author (Authors)	傅俊益	zh_TW
dc.contributor.author (Authors)	Fu, Jun-Yi	en_US
dc.creator (作者)	傅俊益	zh_TW
dc.creator (作者)	Fu, Jun-Yi	en_US
dc.date (日期)	2022	en_US
dc.date.accessioned	1-Aug-2022 17:21:12 (UTC+8)	-
dc.date.available	1-Aug-2022 17:21:12 (UTC+8)	-
dc.date.issued (上傳時間)	1-Aug-2022 17:21:12 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109356009	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/141031	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	資訊管理學系	zh_TW
dc.description (描述)	109356009	zh_TW
dc.description.abstract (摘要)	現今在數據分析領域中，時常會碰到多維度、不平衡的資料集，像是零售業的新商品的銷量預測，但使用單一迴歸模型或多個迴歸模型去預測這種資料時都有各自的缺點，而 Cohen, Jiao, and Zhang (2020)提出了介於兩者之間的 DAC(Data Aggregation with Clustering)模型，利用將不同品項的部份種類的特徵係數利用共同估計特徵係數的方法，降低特徵係數估計的變異，藉此提高模型的表現。但 DAC 模型表現會大幅受到超參數設定的影響，且特徵係數的檢定品質會受到樣本數大小的影響。因此本研究延伸多層級的特徵變數的概念，但相較 DAC 模型使用 Bottom-up 的設計方法，本研究使用 Top-down 的設計方法，利用迴歸模型和正規化方法設計一個 CWPCA (Centralized With Penalized Coefficient Adjustment)模型，並利用統計模擬多種情境的資料集去比較 CWPCA 模型和 DAC 模型的表現，最後發現 CWPCA 模型不需要經過檢定、k-means 等有可能造成模型偏誤的流程，且在大部分的資料集的模型表現都能和表現最好的 DAC 模型差不多，並優於表現較差的 DAC 模型，我們希望未來能進一步應用在真實世界的資料集，進而對實際的業務產生更大的效益。	zh_TW
dc.description.abstract (摘要)	Nowadays, in the field of data analysis, multi-dimensional and unbalanced data sets are very common, such as the sales forecast of new products in the retail industry. However, there are some disadvantages when using a single regression model or multiple regression models to predict such data. As a result, Cohen, Jiao, and Zhang (2020) proposed a DAC (Data Aggregation with Clustering) model between the two models, using the method of jointly estimating the coefficients of some types of coefficients of different items to reduce the variation of coefficients to improve the performance of the model. However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables and uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business. However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables, this study uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.	en_US
dc.description.tableofcontents	第一章緒論 1 第二章文獻探討 4 第一節 DAC 模型 4 第二節正規化方法 8 一、L0 正規化 8 二、L1 正規化 8 三、L2 正規化 9 第三章資料與模型 10 第一節 CWPCA模型 10 第二節資料集 13 第四章模型結果與比較 15 第一節 DAC 模型結果比較 15 第二節 CWPCA 模型結果比較 22 第三節 CWPCA 模型和 DAC 模型結果比較 28 第四節高雜訊資料集的影響 32 第五章結論 34 參考文獻 36	zh_TW
dc.format.extent	2674197 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109356009	en_US
dc.subject (關鍵詞)	CWPCA	zh_TW
dc.subject (關鍵詞)	不平衡資料集	zh_TW
dc.subject (關鍵詞)	正規化方法	zh_TW
dc.subject (關鍵詞)	CWPCA	en_US
dc.subject (關鍵詞)	Imbalanced dataset	en_US
dc.subject (關鍵詞)	Regularization method	en_US
dc.title (題名)	多層級特徵與不平衡樣本下的預測性迴歸系統	zh_TW
dc.title (題名)	A Predictive Regression System with Multi-Level Input Features and Unbalanced Sample Structures	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813-852. Chen, Y., Taeb, A., & Bühlmann, P. (2020). A Look at Robustness and Stability of l1- versus l0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Statistical Science, 35(4), 614-622. Cohen, M. C., Jiao, K., & Zhang, R. (2020). Data Aggregation and Demand Prediction. Available at SSRN 3411653. Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal L1‐norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6), 797-829. Fu, A., Narasimhan, B., & Boyd, S. (2020). CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software, 94(14), 1 - 34. Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517-1537. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67. Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Melkumova, L., & Shatskikh, S. Y. (2017). Comparing Ridge and LASSO estimators for data analysis. Procedia engineering, 201, 746-755. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202200639	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM