dc.contributor.advisor | 莊皓鈞<br>周彥君 | zh_TW |
dc.contributor.advisor | Chuang, Hao-Chun<br>Chou, Yen-Chun | en_US |
dc.contributor.author (Authors) | 傅俊益 | zh_TW |
dc.contributor.author (Authors) | Fu, Jun-Yi | en_US |
dc.creator (作者) | 傅俊益 | zh_TW |
dc.creator (作者) | Fu, Jun-Yi | en_US |
dc.date (日期) | 2022 | en_US |
dc.date.accessioned | 1-Aug-2022 17:21:12 (UTC+8) | - |
dc.date.available | 1-Aug-2022 17:21:12 (UTC+8) | - |
dc.date.issued (上傳時間) | 1-Aug-2022 17:21:12 (UTC+8) | - |
dc.identifier (Other Identifiers) | G0109356009 | en_US |
dc.identifier.uri (URI) | http://nccur.lib.nccu.edu.tw/handle/140.119/141031 | - |
dc.description (描述) | 碩士 | zh_TW |
dc.description (描述) | 國立政治大學 | zh_TW |
dc.description (描述) | 資訊管理學系 | zh_TW |
dc.description (描述) | 109356009 | zh_TW |
dc.description.abstract (摘要) | 現今在數據分析領域中,時常會碰到多維度、不平衡的資料集,像是零售業 的新商品的銷量預測,但使用單一迴歸模型或多個迴歸模型去預測這種資料時都 有各自的缺點,而 Cohen, Jiao, and Zhang (2020)提出了介於兩者之間的 DAC(Data Aggregation with Clustering)模型,利用將不同品項的部份種類的特徵係數利用 共同估計特徵係數的方法,降低特徵係數估計的變異,藉此提高模型的表現。但 DAC 模型表現會大幅受到超參數設定的影響,且特徵係數的檢定品質會 受到樣本數大小的影響。因此本研究延伸多層級的特徵變數的概念,但相較 DAC 模型使用 Bottom-up 的設計方法,本研究使用 Top-down 的設計方法,利用迴歸模型和正規化方法設計一個 CWPCA (Centralized With Penalized Coefficient Adjustment)模型,並利用統計模擬多種情境的資料集去比較 CWPCA 模型和 DAC 模型的表現,最後發現 CWPCA 模型不需要經過檢定、k-means 等有可能造成模型偏誤的流程,且在大部分的資料集的模型表現都能和表現最好的 DAC 模型差不多,並優於表現較差的 DAC 模型,我們希望未來能進一步應用在真實 世界的資料集,進而對實際的業務產生更大的效益。 | zh_TW |
dc.description.abstract (摘要) | Nowadays, in the field of data analysis, multi-dimensional and unbalanced data sets are very common, such as the sales forecast of new products in the retail industry. However, there are some disadvantages when using a single regression model or multiple regression models to predict such data. As a result, Cohen, Jiao, and Zhang (2020) proposed a DAC (Data Aggregation with Clustering) model between the two models, using the method of jointly estimating the coefficients of some types of coefficients of different items to reduce the variation of coefficients to improve the performance of the model.However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables and uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business.However, the performance of the DAC model will be greatly affected by the hyperparameter settings, and the quality of the estimation of coefficients will be affected by the size of the sample. Therefore, this thesis extends the concept of multi- level variables, this study uses the top-down method, which is different from the bottom-up method of the DAC model. This thesis uses a regression model and regularization method to design a CWPCA (Centralized With Penalized Coefficient Adjustment) model and compares the performance of the CWPCA model and the DAC model by using various scenarios of data sets generated by statistical simulation. Finally, this thesis found that the CWPCA model does not need to go through the process of the hypothesis test, k-means that may cause model bias, and the performance of the CWPCA model in most data sets can be similar to the best-performing DAC model, and better than the worst-performing DAC model. We hoped that it can be further applied to real-world data sets in the future, and produce greater benefits for actual business. | en_US |
dc.description.tableofcontents | 第一章 緒論 1第二章 文獻探討 4第一節 DAC 模型 4第二節 正規化方法 8一、L0 正規化 8二、L1 正規化 8三、L2 正規化 9第三章 資料與模型 10第一節 CWPCA模型 10第二節 資料集 13第四章 模型結果與比較 15第一節 DAC 模型結果比較 15第二節 CWPCA 模型結果比較 22第三節 CWPCA 模型和 DAC 模型結果比較 28第四節 高雜訊資料集的影響 32第五章 結論 34參考文獻 36 | zh_TW |
dc.format.extent | 2674197 bytes | - |
dc.format.mimetype | application/pdf | - |
dc.source.uri (資料來源) | http://thesis.lib.nccu.edu.tw/record/#G0109356009 | en_US |
dc.subject (關鍵詞) | CWPCA | zh_TW |
dc.subject (關鍵詞) | 不平衡資料集 | zh_TW |
dc.subject (關鍵詞) | 正規化方法 | zh_TW |
dc.subject (關鍵詞) | CWPCA | en_US |
dc.subject (關鍵詞) | Imbalanced dataset | en_US |
dc.subject (關鍵詞) | Regularization method | en_US |
dc.title (題名) | 多層級特徵與不平衡樣本下的預測性迴歸系統 | zh_TW |
dc.title (題名) | A Predictive Regression System with Multi-Level Input Features and Unbalanced Sample Structures | en_US |
dc.type (資料類型) | thesis | en_US |
dc.relation.reference (參考文獻) | Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813-852.Chen, Y., Taeb, A., & Bühlmann, P. (2020). A Look at Robustness and Stability of l1- versus l0-Regularization: Discussion of Papers by Bertsimas et al. and Hastie et al. Statistical Science, 35(4), 614-622.Cohen, M. C., Jiao, K., & Zhang, R. (2020). Data Aggregation and Demand Prediction. Available at SSRN 3411653.Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal L1‐norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6), 797-829.Fu, A., Narasimhan, B., & Boyd, S. (2020). CVXR: An R Package for Disciplined Convex Optimization. Journal of Statistical Software, 94(14), 1 - 34.Hazimeh, H., & Mazumder, R. (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research, 68(5), 1517-1537.Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109.MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.Melkumova, L., & Shatskikh, S. Y. (2017). Comparing Ridge and LASSO estimators for data analysis. Procedia engineering, 201, 746-755.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418-1429. | zh_TW |
dc.identifier.doi (DOI) | 10.6814/NCCU202200639 | en_US |