Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 多層級特徵與不平衡樣本下的預測性迴歸系統
A Predictive Regression System with Multi-Level Input Features and Unbalanced Sample Structures
作者 傅俊益
Fu, Jun-Yi
貢獻者 莊皓鈞<br>周彥君
Chuang, Hao-Chun<br>Chou, Yen-Chun
傅俊益
Fu, Jun-Yi
關鍵詞 CWPCA
不平衡資料集
正規化方法
CWPCA
Imbalanced dataset
Regularization method
日期 2022
上傳時間 1-Aug-2022 17:21:12 (UTC+8)
摘要 現今在數據分析領域中,時常會碰到多維度、不平衡的資料集,像是零售業 的新商品的銷量預測,但使用單一迴歸模型或多個迴歸模型去預測這種資料時都 有各自的缺點,而 Cohen, Jiao, and Zhang (2020)提出了介於兩者之間的 DAC(Data Aggregation with Clustering)模型,利用將不同品項的部份種類的特徵係數利用 共同估計特徵係數的方法,降低特徵係數估計的變異,藉此提高模型的表現。
但 DAC 模型表現會大幅受到超參數設定的影響,且特徵係數的檢定品質會 受到樣本數大小的影響。因此本研究延伸多層級的特徵變數的概念,但相較 DAC 模型使用 Bottom-up 的設計方法,本研究使用 Top-down 的設計方法,利用迴歸模型和正規化方法設計一個 CWPCA (Centralized With Penalized Coefficient Adjustment)模型,並利用統計模擬多種情境的資料集去比較 CWPCA 模型和 DAC 模型的表現,最後發現 CWPCA 模型不需要經過檢定、k-means 等有可能造成模型偏誤的流程,且在大部分的資料集的模型表現都能和表現最好的 DAC 模型差不多,並優於表現較差的 DAC 模型,我們希望未來能進一步應用在真實 世界的資料集,進而對實際的業務產生更大的效益。
Nowadays, in the field of data analysis, multi-dimensional and unbalanced data sets are very common, such as the sales forecast of new products in the retail industry. However, there are some disadvantages when using a single regression model or multiple regression models to predict such data. As a result, Cohen, Jiao, and Zhang (2020) proposed a DAC (Data Aggregation with Clustering) model between the two models, using the method of jointly estimating the coefficients of some types of coefficients of different items to reduce the variation of coefficients to improve the performance of the model.
However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables and uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.
However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables, this study uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.
參考文獻 Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813-852.
Chen, Y., Taeb, A., & Bühlmann, P. (2020). A Look at Robustness and Stability of l1- versus l0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Statistical Science, 35(4), 614-622.
Cohen, M. C., Jiao, K., & Zhang, R. (2020). Data Aggregation and Demand Prediction. Available at SSRN 3411653.
Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal L1‐norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6), 797-829.
Fu, A., Narasimhan, B., & Boyd, S. (2020). CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software, 94(14), 1 - 34.
Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517-1537.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.
Melkumova, L., & Shatskikh, S. Y. (2017). Comparing Ridge and LASSO estimators for data analysis. Procedia engineering, 201, 746-755.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429.
描述 碩士
國立政治大學
資訊管理學系
109356009
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109356009
資料類型 thesis
dc.contributor.advisor 莊皓鈞<br>周彥君zh_TW
dc.contributor.advisor Chuang, Hao-Chun<br>Chou, Yen-Chunen_US
dc.contributor.author (Authors) 傅俊益zh_TW
dc.contributor.author (Authors) Fu, Jun-Yien_US
dc.creator (作者) 傅俊益zh_TW
dc.creator (作者) Fu, Jun-Yien_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Aug-2022 17:21:12 (UTC+8)-
dc.date.available 1-Aug-2022 17:21:12 (UTC+8)-
dc.date.issued (上傳時間) 1-Aug-2022 17:21:12 (UTC+8)-
dc.identifier (Other Identifiers) G0109356009en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141031-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 資訊管理學系zh_TW
dc.description (描述) 109356009zh_TW
dc.description.abstract (摘要) 現今在數據分析領域中,時常會碰到多維度、不平衡的資料集,像是零售業 的新商品的銷量預測,但使用單一迴歸模型或多個迴歸模型去預測這種資料時都 有各自的缺點,而 Cohen, Jiao, and Zhang (2020)提出了介於兩者之間的 DAC(Data Aggregation with Clustering)模型,利用將不同品項的部份種類的特徵係數利用 共同估計特徵係數的方法,降低特徵係數估計的變異,藉此提高模型的表現。
但 DAC 模型表現會大幅受到超參數設定的影響,且特徵係數的檢定品質會 受到樣本數大小的影響。因此本研究延伸多層級的特徵變數的概念,但相較 DAC 模型使用 Bottom-up 的設計方法,本研究使用 Top-down 的設計方法,利用迴歸模型和正規化方法設計一個 CWPCA (Centralized With Penalized Coefficient Adjustment)模型,並利用統計模擬多種情境的資料集去比較 CWPCA 模型和 DAC 模型的表現,最後發現 CWPCA 模型不需要經過檢定、k-means 等有可能造成模型偏誤的流程,且在大部分的資料集的模型表現都能和表現最好的 DAC 模型差不多,並優於表現較差的 DAC 模型,我們希望未來能進一步應用在真實 世界的資料集,進而對實際的業務產生更大的效益。
zh_TW
dc.description.abstract (摘要) Nowadays, in the field of data analysis, multi-dimensional and unbalanced data sets are very common, such as the sales forecast of new products in the retail industry. However, there are some disadvantages when using a single regression model or multiple regression models to predict such data. As a result, Cohen, Jiao, and Zhang (2020) proposed a DAC (Data Aggregation with Clustering) model between the two models, using the method of jointly estimating the coefficients of some types of coefficients of different items to reduce the variation of coefficients to improve the performance of the model.
However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables and uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.
However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables, this study uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.
en_US
dc.description.tableofcontents 第一章 緒論 1
第二章 文獻探討 4
第一節 DAC 模型 4
第二節 正規化方法 8
一、L0 正規化 8
二、L1 正規化 8
三、L2 正規化 9
第三章 資料與模型 10
第一節 CWPCA模型 10
第二節 資料集 13
第四章 模型結果與比較 15
第一節 DAC 模型結果比較 15
第二節 CWPCA 模型結果比較 22
第三節 CWPCA 模型和 DAC 模型結果比較 28
第四節 高雜訊資料集的影響 32
第五章 結論 34
參考文獻 36
zh_TW
dc.format.extent 2674197 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109356009en_US
dc.subject (關鍵詞) CWPCAzh_TW
dc.subject (關鍵詞) 不平衡資料集zh_TW
dc.subject (關鍵詞) 正規化方法zh_TW
dc.subject (關鍵詞) CWPCAen_US
dc.subject (關鍵詞) Imbalanced dataseten_US
dc.subject (關鍵詞) Regularization methoden_US
dc.title (題名) 多層級特徵與不平衡樣本下的預測性迴歸系統zh_TW
dc.title (題名) A Predictive Regression System with Multi-Level Input Features and Unbalanced Sample Structuresen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813-852.
Chen, Y., Taeb, A., & Bühlmann, P. (2020). A Look at Robustness and Stability of l1- versus l0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Statistical Science, 35(4), 614-622.
Cohen, M. C., Jiao, K., & Zhang, R. (2020). Data Aggregation and Demand Prediction. Available at SSRN 3411653.
Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal L1‐norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6), 797-829.
Fu, A., Narasimhan, B., & Boyd, S. (2020). CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software, 94(14), 1 - 34.
Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517-1537.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.
Melkumova, L., & Shatskikh, S. Y. (2017). Comparing Ridge and LASSO estimators for data analysis. Procedia engineering, 201, 746-755.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200639en_US