具變數選擇能力之非線性閾值迴歸模型

Publications-Theses

Article View/Open

pdf(0)

Publication Export

Google Scholar^TM

題名	具變數選擇能力之非線性閾值迴歸模型 Variable Selection for Nonlinear Boundary Threshold Regression Model
作者	謝佑昀 Hsieh, Yu-Yun
貢獻者	張志浩 Chang, Chih-Hao 謝佑昀 Hsieh, Yu-Yun
關鍵詞	閾值模型隨機森林變數選擇 Threshold model Random Forest Variable selection
日期	2024
上傳時間	4-Sep-2024 14:55:22 (UTC+8)
摘要	本研究旨在改進傳統閾值迴歸模型，提出一種可選取變數之閾值迴歸模型。傳統的閾值迴歸模型通常依賴預先選擇的關鍵共變數，但在實際應用中，這些共變數往往難以確定，尤其在面對高維度數據時更具挑戰性。為解決此問題，本研究結合隨機森林和最小絕對值收縮和選擇算子(Lasso)進行變數選擇，並能夠處理線性和非線性閾值邊界。研究方法包括設計三種模擬實驗，以評估所提出演算法的效能、預測表現及變數選擇能力。這三種模擬情境分別為：線性閾值邊界、非線性閾值邊界及高維度小樣本。在模擬實驗中，首先使用K-means進行初步分類，接著應用隨機森林找出潛在的閾值函數，最後透過Lasso選取重要變數並建立最終的迴歸模型。模擬結果顯示，本研究提出的TBR-VS演算法所建構的線性或非線性閾值邊界，在預測表現上都能提供明顯改善，且在多數情況下有高機率能選取到重要的閾值變數與迴歸變數。實證分析部分，模型應用於波士頓房價、洛杉磯臭氧污染及紐約股票財報等三個現實資料集，進一步驗證其在不同領域中的適用性。最終，本研究不僅提升了閾值迴歸模型的準確性，亦增強了其在實務資料的變數選擇能力及解釋性。 This study aims to improve traditional threshold regression models by proposing a variable-selectable threshold regression model. Traditional models rely on pre-selected key covariates, which are often difficult to determine, especially with high-dimensional data. To address this, the research combines Random Forest and Lasso for variable selection, handling both linear and nonlinear threshold boundaries. The methodology includes three types of simulation experiments to evaluate the performance, predictive accuracy, and variable selection capability: linear threshold boundaries, nonlinear threshold boundaries, and high-dimensional small samples. In the simulations, K-means clustering is used for preliminary classification, followed by Random Forest to identify potential threshold functions, and Lasso to select important variables and establish the final regression model. Results show that the linear or nonlinear threshold boundaries constructed by the proposed TBR-VS algorithm significantly improve predictive performance and are likely to select important threshold and regression variables. For empirical analysis, the model is applied to three real-world datasets: Boston housing prices, Los Angeles ozone pollution, and New York stock financial reports, verifying its applicability in different fields. This study enhances the accuracy, variable selection capability, and interpretability of threshold regression models in practical data.
參考文獻	Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347. Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140. Breiman, L. (2001). Random forests. Machine Learning, 45:5–32. Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984). Classification and Regression Trees. Taylor & Francis. Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation: Rejoinder. Journal of the American Statistical Association, 80(391):614–619. Chang, C.-H., Emura, T., and Huang, S.-F. (2023). Estimation of threshold boundary regression models. In The 6th International Conference on Econometrics and Statistics. Dai, L., Chen, K., Sun, Z., Liu, Z., and Li, G. (2018). Broken adaptive ridge regression and its asymptotic properties. Journal of Multivariate Analysis, 168:334–351. Granovetter, M. (1978). Threshold models of collective behavior. American Journal of Sociology, 83(6):1420–1443. Harrison, D. J. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102. Ishwaran, H. (2015). The effect of splitting on random forests. Machine Learning, 99:75–118. Janitza, S., Celik, E., and Boulesteix, A.-L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12:885–915. Jia, L., Zhang, W., and Chen, X. (2017). Common methods of biological age estimation. Clinical Interventions in Aging, 12:759–772. Lee, Y. and Wang, Y. (2023). Threshold regression with nonparametric sample splitting. Journal of Econometrics, 235(2):816–842. Nembrini, S., R König, I., and Wright, M. N. (2018). The revival of the gini importance? Bioinformatics, 34(21):3711–3718. Saegusa, T., Ma, T., Li, G., Chen, Y. Q., and Lee, M.-L. T. (2020). Variable selection in threshold regression model with applications to hiv drug adherence data. Statistics in Biosciences, 12:376–398. Sakoda, J. M. (1971). The checkerboard model of social interaction. The Journal of Mathematical Sociology, 1(1):119–132. Schelling, T. C. (1971). Dynamic models of segregation. The Journal of Mathematical Sociology, 1(2):143–186. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288. Tong, H. (1978). On a threshold model in pattern recognition and signal processing. In Chen, C., editor, Pattern Recognition and Signal Processing. Sijthoff and Noordhoff. Whitmore, G. A. and Su, Y. (2007). Modeling low birth weights using threshold regression: results for u. s. birth data. Lifetime Data Analysis, 13:161–190. Yu, P. (2012). Likelihood estimation and inference in threshold regression. Journal of Econometrics, 167(1):274–294.
描述	碩士國立政治大學統計學系 107354011
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0107354011
資料類型	thesis

dc.contributor.advisor	張志浩	zh_TW
dc.contributor.advisor	Chang, Chih-Hao	en_US
dc.contributor.author (Authors)	謝佑昀	zh_TW
dc.contributor.author (Authors)	Hsieh, Yu-Yun	en_US
dc.creator (作者)	謝佑昀	zh_TW
dc.creator (作者)	Hsieh, Yu-Yun	en_US
dc.date (日期)	2024	en_US
dc.date.accessioned	4-Sep-2024 14:55:22 (UTC+8)	-
dc.date.available	4-Sep-2024 14:55:22 (UTC+8)	-
dc.date.issued (上傳時間)	4-Sep-2024 14:55:22 (UTC+8)	-
dc.identifier (Other Identifiers)	G0107354011	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/153359	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	107354011	zh_TW
dc.description.abstract (摘要)	本研究旨在改進傳統閾值迴歸模型，提出一種可選取變數之閾值迴歸模型。傳統的閾值迴歸模型通常依賴預先選擇的關鍵共變數，但在實際應用中，這些共變數往往難以確定，尤其在面對高維度數據時更具挑戰性。為解決此問題，本研究結合隨機森林和最小絕對值收縮和選擇算子(Lasso)進行變數選擇，並能夠處理線性和非線性閾值邊界。研究方法包括設計三種模擬實驗，以評估所提出演算法的效能、預測表現及變數選擇能力。這三種模擬情境分別為：線性閾值邊界、非線性閾值邊界及高維度小樣本。在模擬實驗中，首先使用K-means進行初步分類，接著應用隨機森林找出潛在的閾值函數，最後透過Lasso選取重要變數並建立最終的迴歸模型。模擬結果顯示，本研究提出的TBR-VS演算法所建構的線性或非線性閾值邊界，在預測表現上都能提供明顯改善，且在多數情況下有高機率能選取到重要的閾值變數與迴歸變數。實證分析部分，模型應用於波士頓房價、洛杉磯臭氧污染及紐約股票財報等三個現實資料集，進一步驗證其在不同領域中的適用性。最終，本研究不僅提升了閾值迴歸模型的準確性，亦增強了其在實務資料的變數選擇能力及解釋性。	zh_TW
dc.description.abstract (摘要)	This study aims to improve traditional threshold regression models by proposing a variable-selectable threshold regression model. Traditional models rely on pre-selected key covariates, which are often difficult to determine, especially with high-dimensional data. To address this, the research combines Random Forest and Lasso for variable selection, handling both linear and nonlinear threshold boundaries. The methodology includes three types of simulation experiments to evaluate the performance, predictive accuracy, and variable selection capability: linear threshold boundaries, nonlinear threshold boundaries, and high-dimensional small samples. In the simulations, K-means clustering is used for preliminary classification, followed by Random Forest to identify potential threshold functions, and Lasso to select important variables and establish the final regression model. Results show that the linear or nonlinear threshold boundaries constructed by the proposed TBR-VS algorithm significantly improve predictive performance and are likely to select important threshold and regression variables. For empirical analysis, the model is applied to three real-world datasets: Boston housing prices, Los Angeles ozone pollution, and New York stock financial reports, verifying its applicability in different fields. This study enhances the accuracy, variable selection capability, and interpretability of threshold regression models in practical data.	en_US
dc.description.tableofcontents	第一章緒論 1 第二章文獻回顧 3 第三章研究方法 7 第四章結果與討論 13 第五章結論與建議 41 參考文獻 43	zh_TW
dc.format.extent	3664517 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0107354011	en_US
dc.subject (關鍵詞)	閾值模型	zh_TW
dc.subject (關鍵詞)	隨機森林	zh_TW
dc.subject (關鍵詞)	變數選擇	zh_TW
dc.subject (關鍵詞)	Threshold model	en_US
dc.subject (關鍵詞)	Random Forest	en_US
dc.subject (關鍵詞)	Variable selection	en_US
dc.title (題名)	具變數選擇能力之非線性閾值迴歸模型	zh_TW
dc.title (題名)	Variable Selection for Nonlinear Boundary Threshold Regression Model	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347. Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140. Breiman, L. (2001). Random forests. Machine Learning, 45:5–32. Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984). Classification and Regression Trees. Taylor & Francis. Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation: Rejoinder. Journal of the American Statistical Association, 80(391):614–619. Chang, C.-H., Emura, T., and Huang, S.-F. (2023). Estimation of threshold boundary regression models. In The 6th International Conference on Econometrics and Statistics. Dai, L., Chen, K., Sun, Z., Liu, Z., and Li, G. (2018). Broken adaptive ridge regression and its asymptotic properties. Journal of Multivariate Analysis, 168:334–351. Granovetter, M. (1978). Threshold models of collective behavior. American Journal of Sociology, 83(6):1420–1443. Harrison, D. J. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102. Ishwaran, H. (2015). The effect of splitting on random forests. Machine Learning, 99:75–118. Janitza, S., Celik, E., and Boulesteix, A.-L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12:885–915. Jia, L., Zhang, W., and Chen, X. (2017). Common methods of biological age estimation. Clinical Interventions in Aging, 12:759–772. Lee, Y. and Wang, Y. (2023). Threshold regression with nonparametric sample splitting. Journal of Econometrics, 235(2):816–842. Nembrini, S., R König, I., and Wright, M. N. (2018). The revival of the gini importance? Bioinformatics, 34(21):3711–3718. Saegusa, T., Ma, T., Li, G., Chen, Y. Q., and Lee, M.-L. T. (2020). Variable selection in threshold regression model with applications to hiv drug adherence data. Statistics in Biosciences, 12:376–398. Sakoda, J. M. (1971). The checkerboard model of social interaction. The Journal of Mathematical Sociology, 1(1):119–132. Schelling, T. C. (1971). Dynamic models of segregation. The Journal of Mathematical Sociology, 1(2):143–186. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288. Tong, H. (1978). On a threshold model in pattern recognition and signal processing. In Chen, C., editor, Pattern Recognition and Signal Processing. Sijthoff and Noordhoff. Whitmore, G. A. and Su, Y. (2007). Modeling low birth weights using threshold regression: results for u. s. birth data. Lifetime Data Analysis, 13:161–190. Yu, P. (2012). Likelihood estimation and inference in threshold regression. Journal of Econometrics, 167(1):274–294.	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM