Publications-Theses
Article View/Open
Publication Export
-
Google ScholarTM
NCCU Library
Citation Infomation
Related Publications in TAIR
題名 具變數選擇能力之非線性閾值迴歸模型
Variable Selection for Nonlinear Boundary Threshold Regression Model作者 謝佑昀
Hsieh, Yu-Yun貢獻者 張志浩
Chang, Chih-Hao
謝佑昀
Hsieh, Yu-Yun關鍵詞 閾值模型
隨機森林
變數選擇
Threshold model
Random Forest
Variable selection日期 2024 上傳時間 4-Sep-2024 14:55:22 (UTC+8) 摘要 本研究旨在改進傳統閾值迴歸模型,提出一種可選取變數之閾值迴歸模型。傳統的閾值迴歸模型通常依賴預先選擇的關鍵共變數,但在實際應用中,這些共變數往往難以確定,尤其在面對高維度數據時更具挑戰性。為解決此問題,本研究結合隨機森林和最小絕對值收縮和選擇算子(Lasso)進行變數選擇,並能夠處理線性和非線性閾值邊界。研究方法包括設計三種模擬實驗,以評估所提出演算法的效能、預測表現及變數選擇能力。這三種模擬情境分別為:線性閾值邊界、非線性閾值邊界及高維度小樣本。在模擬實驗中,首先使用K-means進行初步分類,接著應用隨機森林找出潛在的閾值函數,最後透過Lasso選取重要變數並建立最終的迴歸模型。模擬結果顯示,本研究提出的TBR-VS演算法所建構的線性或非線性閾值邊界,在預測表現上都能提供明顯改善,且在多數情況下有高機率能選取到重要的閾值變數與迴歸變數。實證分析部分,模型應用於波士頓房價、洛杉磯臭氧污染及紐約股票財報等三個現實資料集,進一步驗證其在不同領域中的適用性。最終,本研究不僅提升了閾值迴歸模型的準確性,亦增強了其在實務資料的變數選擇能力及解釋性。
This study aims to improve traditional threshold regression models by proposing a variable-selectable threshold regression model. Traditional models rely on pre-selected key covariates, which are often difficult to determine, especially with high-dimensional data. To address this, the research combines Random Forest and Lasso for variable selection, handling both linear and nonlinear threshold boundaries. The methodology includes three types of simulation experiments to evaluate the performance, predictive accuracy, and variable selection capability: linear threshold boundaries, nonlinear threshold boundaries, and high-dimensional small samples. In the simulations, K-means clustering is used for preliminary classification, followed by Random Forest to identify potential threshold functions, and Lasso to select important variables and establish the final regression model. Results show that the linear or nonlinear threshold boundaries constructed by the proposed TBR-VS algorithm significantly improve predictive performance and are likely to select important threshold and regression variables. For empirical analysis, the model is applied to three real-world datasets: Boston housing prices, Los Angeles ozone pollution, and New York stock financial reports, verifying its applicability in different fields. This study enhances the accuracy, variable selection capability, and interpretability of threshold regression models in practical data.參考文獻 Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347. Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140. Breiman, L. (2001). Random forests. Machine Learning, 45:5–32. Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984). Classification and Regression Trees. Taylor & Francis. Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation: Rejoinder. Journal of the American Statistical Association, 80(391):614–619. Chang, C.-H., Emura, T., and Huang, S.-F. (2023). Estimation of threshold boundary regression models. In The 6th International Conference on Econometrics and Statistics. Dai, L., Chen, K., Sun, Z., Liu, Z., and Li, G. (2018). Broken adaptive ridge regression and its asymptotic properties. Journal of Multivariate Analysis, 168:334–351. Granovetter, M. (1978). Threshold models of collective behavior. American Journal of Sociology, 83(6):1420–1443. Harrison, D. J. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102. Ishwaran, H. (2015). The effect of splitting on random forests. Machine Learning, 99:75–118. Janitza, S., Celik, E., and Boulesteix, A.-L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12:885–915. Jia, L., Zhang, W., and Chen, X. (2017). Common methods of biological age estimation. Clinical Interventions in Aging, 12:759–772. Lee, Y. and Wang, Y. (2023). Threshold regression with nonparametric sample splitting. Journal of Econometrics, 235(2):816–842. Nembrini, S., R König, I., and Wright, M. N. (2018). The revival of the gini importance? Bioinformatics, 34(21):3711–3718. Saegusa, T., Ma, T., Li, G., Chen, Y. Q., and Lee, M.-L. T. (2020). Variable selection in threshold regression model with applications to hiv drug adherence data. Statistics in Biosciences, 12:376–398. Sakoda, J. M. (1971). The checkerboard model of social interaction. The Journal of Mathematical Sociology, 1(1):119–132. Schelling, T. C. (1971). Dynamic models of segregation. The Journal of Mathematical Sociology, 1(2):143–186. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288. Tong, H. (1978). On a threshold model in pattern recognition and signal processing. In Chen, C., editor, Pattern Recognition and Signal Processing. Sijthoff and Noordhoff. Whitmore, G. A. and Su, Y. (2007). Modeling low birth weights using threshold regression: results for u. s. birth data. Lifetime Data Analysis, 13:161–190. Yu, P. (2012). Likelihood estimation and inference in threshold regression. Journal of Econometrics, 167(1):274–294. 描述 碩士
國立政治大學
統計學系
107354011資料來源 http://thesis.lib.nccu.edu.tw/record/#G0107354011 資料類型 thesis dc.contributor.advisor 張志浩 zh_TW dc.contributor.advisor Chang, Chih-Hao en_US dc.contributor.author (Authors) 謝佑昀 zh_TW dc.contributor.author (Authors) Hsieh, Yu-Yun en_US dc.creator (作者) 謝佑昀 zh_TW dc.creator (作者) Hsieh, Yu-Yun en_US dc.date (日期) 2024 en_US dc.date.accessioned 4-Sep-2024 14:55:22 (UTC+8) - dc.date.available 4-Sep-2024 14:55:22 (UTC+8) - dc.date.issued (上傳時間) 4-Sep-2024 14:55:22 (UTC+8) - dc.identifier (Other Identifiers) G0107354011 en_US dc.identifier.uri (URI) https://nccur.lib.nccu.edu.tw/handle/140.119/153359 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 107354011 zh_TW dc.description.abstract (摘要) 本研究旨在改進傳統閾值迴歸模型,提出一種可選取變數之閾值迴歸模型。傳統的閾值迴歸模型通常依賴預先選擇的關鍵共變數,但在實際應用中,這些共變數往往難以確定,尤其在面對高維度數據時更具挑戰性。為解決此問題,本研究結合隨機森林和最小絕對值收縮和選擇算子(Lasso)進行變數選擇,並能夠處理線性和非線性閾值邊界。研究方法包括設計三種模擬實驗,以評估所提出演算法的效能、預測表現及變數選擇能力。這三種模擬情境分別為:線性閾值邊界、非線性閾值邊界及高維度小樣本。在模擬實驗中,首先使用K-means進行初步分類,接著應用隨機森林找出潛在的閾值函數,最後透過Lasso選取重要變數並建立最終的迴歸模型。模擬結果顯示,本研究提出的TBR-VS演算法所建構的線性或非線性閾值邊界,在預測表現上都能提供明顯改善,且在多數情況下有高機率能選取到重要的閾值變數與迴歸變數。實證分析部分,模型應用於波士頓房價、洛杉磯臭氧污染及紐約股票財報等三個現實資料集,進一步驗證其在不同領域中的適用性。最終,本研究不僅提升了閾值迴歸模型的準確性,亦增強了其在實務資料的變數選擇能力及解釋性。 zh_TW dc.description.abstract (摘要) This study aims to improve traditional threshold regression models by proposing a variable-selectable threshold regression model. Traditional models rely on pre-selected key covariates, which are often difficult to determine, especially with high-dimensional data. To address this, the research combines Random Forest and Lasso for variable selection, handling both linear and nonlinear threshold boundaries. The methodology includes three types of simulation experiments to evaluate the performance, predictive accuracy, and variable selection capability: linear threshold boundaries, nonlinear threshold boundaries, and high-dimensional small samples. In the simulations, K-means clustering is used for preliminary classification, followed by Random Forest to identify potential threshold functions, and Lasso to select important variables and establish the final regression model. Results show that the linear or nonlinear threshold boundaries constructed by the proposed TBR-VS algorithm significantly improve predictive performance and are likely to select important threshold and regression variables. For empirical analysis, the model is applied to three real-world datasets: Boston housing prices, Los Angeles ozone pollution, and New York stock financial reports, verifying its applicability in different fields. This study enhances the accuracy, variable selection capability, and interpretability of threshold regression models in practical data. en_US dc.description.tableofcontents 第一章 緒論 1 第二章 文獻回顧 3 第三章 研究方法 7 第四章 結果與討論 13 第五章 結論與建議 41 參考文獻 43 zh_TW dc.format.extent 3664517 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0107354011 en_US dc.subject (關鍵詞) 閾值模型 zh_TW dc.subject (關鍵詞) 隨機森林 zh_TW dc.subject (關鍵詞) 變數選擇 zh_TW dc.subject (關鍵詞) Threshold model en_US dc.subject (關鍵詞) Random Forest en_US dc.subject (關鍵詞) Variable selection en_US dc.title (題名) 具變數選擇能力之非線性閾值迴歸模型 zh_TW dc.title (題名) Variable Selection for Nonlinear Boundary Threshold Regression Model en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347. Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140. Breiman, L. (2001). Random forests. Machine Learning, 45:5–32. Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984). Classification and Regression Trees. Taylor & Francis. Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation: Rejoinder. Journal of the American Statistical Association, 80(391):614–619. Chang, C.-H., Emura, T., and Huang, S.-F. (2023). Estimation of threshold boundary regression models. In The 6th International Conference on Econometrics and Statistics. Dai, L., Chen, K., Sun, Z., Liu, Z., and Li, G. (2018). Broken adaptive ridge regression and its asymptotic properties. Journal of Multivariate Analysis, 168:334–351. Granovetter, M. (1978). Threshold models of collective behavior. American Journal of Sociology, 83(6):1420–1443. Harrison, D. J. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102. Ishwaran, H. (2015). The effect of splitting on random forests. Machine Learning, 99:75–118. Janitza, S., Celik, E., and Boulesteix, A.-L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12:885–915. Jia, L., Zhang, W., and Chen, X. (2017). Common methods of biological age estimation. Clinical Interventions in Aging, 12:759–772. Lee, Y. and Wang, Y. (2023). Threshold regression with nonparametric sample splitting. Journal of Econometrics, 235(2):816–842. Nembrini, S., R König, I., and Wright, M. N. (2018). The revival of the gini importance? Bioinformatics, 34(21):3711–3718. Saegusa, T., Ma, T., Li, G., Chen, Y. Q., and Lee, M.-L. T. (2020). Variable selection in threshold regression model with applications to hiv drug adherence data. Statistics in Biosciences, 12:376–398. Sakoda, J. M. (1971). The checkerboard model of social interaction. The Journal of Mathematical Sociology, 1(1):119–132. Schelling, T. C. (1971). Dynamic models of segregation. The Journal of Mathematical Sociology, 1(2):143–186. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288. Tong, H. (1978). On a threshold model in pattern recognition and signal processing. In Chen, C., editor, Pattern Recognition and Signal Processing. Sijthoff and Noordhoff. Whitmore, G. A. and Su, Y. (2007). Modeling low birth weights using threshold regression: results for u. s. birth data. Lifetime Data Analysis, 13:161–190. Yu, P. (2012). Likelihood estimation and inference in threshold regression. Journal of Econometrics, 167(1):274–294. zh_TW