學術產出-學位論文
文章檢視/開啟
書目匯出
-
題名 懲罰R平方對線性混合模型中重要變數選取之研究
Variables selection for linear mixed models with Penalized R-squared statistics作者 吳書恆
Wu, Shu-Heng貢獻者 黃佳慧
Huang, Chia-Hui
吳書恆
Wu, Shu-Heng關鍵詞 自動化模型選擇
適合度
線性混合模型
精簡模型
懲罰係數
Automatic selection methods
Goodness-of-fit
Linear mixed model
Parsimonious model
Penalty coefficient日期 2022 上傳時間 1-八月-2022 17:14:31 (UTC+8) 摘要 在時間相依共變量拆解後的線性混合模型 (linear mixed model) 中,已提出不同型式的R2統計量來評估隨機效果與固定效果的適合度 (goodness-of-fit),然而在自動化選模方法中,這些統計量因過度擬合 (overfitting) 的問題而無法直接根據最大值來選擇較精簡模型(parsimonious model) 。本研究提出一個具有懲罰性質的R2統計量,此統計量懲罰項納入參數個數的考量,可抑制R2隨解釋變數增加而不斷膨脹的問題,並且可搭配自動化選模方式選擇隨機與固定效果的精簡模型。此外,因應使用者對模型精簡程度的需求,此統計量懲罰項含有懲罰係數,可彈性地調整懲罰的強度。當使用者並無特定的精簡程度,本研究亦提供網格搜索與給定容忍值的方式,以得出最佳的懲罰範圍與對應的精簡模型。透過資料模擬結果可發現懲罰R2選到精簡模型的效果較其他AIC統計量 (cAIC與mAIC) 佳,同時使用隨機效果與固定效果的懲罰R2亦不會影響各自選模的結果。在北卡羅來納州 (North Carolina) 犯罪資料的實證分析中,亦發現本研究所提出R2統計量在自動化選模中具有辨別重要變數的能力。
In the Linear Mixed Model (LMM) after time dependent covariates decomposition, different types of R2 statistics have been proposed to evaluate the goodness-of-fit (GOF) of random effects and fixed effects. However, due to the overfitting issue, the maximum value of the statistics cannot be applied in the automatic selection methods. In this study, we propose a R2 statistic, which includes a penalty that discourages the inflation of R2 when extra regressors are added to the model. Therefore, the R2 statistic can select parsimonious model of random effects and fixed effects by automatic model selection. In addition, the penalty coefficient of the proposed R2 statistic can be flexibly adjusted based on researcher`s demand for model simplification. When researchers do not have a specific degree of simplification, we also provide two methods to obtain the optimal range of penalty coefficient and the corresponding parsimonious model. The simulation results showed that the effect of penalized R2 in finding parsimonious model is better than that of other AIC statistics (cAIC and mAIC), and using penalized R2 with both random effects and fixed effects simaltaneously does not affect the results of their model selection. In an empirical analysis of North Carolina crime data, we found that the proposed R2 statistic is able to identify significant variables which were also found in the original study.參考文獻 Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.Arnau, J., Bono, R., and Vallejo, G. (2009). Analyzing small samples of repeated measures data with the mixed-model adjusted f test. Communications in Statistics-Simulation and Computation, 38(5):1083–1103.Baltagi, B. H. (2006). Estimating an economic model of crime using panel data from north carolina. Journal of Applied econometrics, 21(4):543–547.Brame, R., Bushway, S., and Paternoster, R. (1999). On the use of panel research designs and random effects models to investigate static and dynamic theories of criminal offending. Criminology, 37(3):599–642.Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data. The Review of economics and Statistics, pages 360–366.Edwards, L. J., Muller, K. E., Wolfinger, R. D., Qaqish, B. F., and Schabenberger, O. (2008). An r2 statistic for fixed effects in the linear mixed model. Statistics in medicine, 27(29):6137–6157.Ezekiel, M. (1930). Methods of correlation analysis. Wiley.Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in a linear mixed model. Biometrics, 60(4):945–953.Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789.Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American statistical association, 72(358): 320–338.Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12.Helland, I. S. (2000). Model reduction for prediction in regression models. Scandinavian journal of statistics, 27(1):1–20.Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2):297–307.Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, pages 963–974.Lalonde, T. L. (2015). Modeling time-dependent covariates in longitudinal data analyses. In Innovative statistical methods for public health data, pages 57–79. Springer.Lalonde, T. L., Nguyen, A. Q., Yin, J., Irimata, K., and Wilson, J. R. (2013). Modeling correlated binary outcomes with time-dependent covariates. Journal of Data Science, 11(4).Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in r. Behavior research methods, 49(4):1494–1502.McNeish, D. (2017). Small sample methods for multilevel modeling: A colloquial elucidation of reml and the kenward-roger correction. Multivariate Behavioral Research, 52(5):661–670.Molenberghs, G. and Verbeke, G. (2000). A model for longitudinal data. Linear Mixed Models for Longitudinal Data, pages 19–29.Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods in ecology and evolution, 4(2):133–142.Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate effects in the analysis of clustered data. Biometrics, pages 638–645.Olkin, I. and Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. The annals of mathematical statistics, pages 201–211.Orelien, J. G. and Edwards, L. J. (2008). Fixed-effect variable selection in linear mixed models using r2 statistics. Computational Statistics & Data Analysis, 52(4):1896–1907.Rights, J. D. and Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An integrative framework for defining r-squared measures. Psychological methods, 24(3):309.Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, pages 461–464.Shen, W. and Louis, T. A. (1999). Empirical bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics, 8(4):800–823.Snijders, T. A. and Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. sage.Sundberg, R. (1999). Multivariate calibration—direct and indirect regression methodology. Scandinavian Journal of Statistics, 26(2):161–207.Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixed-effects models. Biometrika, 92(2):351–370.Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91(433):217–221.Vonesh, E. and Chinchilli, V. M. (1996). Linear and nonlinear models for the analysis of repeated measurements. CRC press.Welham, S. and Thompson, R. (1997). Likelihood ratio tests for fixed model terms using residual maximum likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):701–714.Xu, R. (2003). Measuring explained variation in linear mixed effects models. Statistics in medicine, 22(22):3527–3541.Zhang, D. and Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57(3):795–802.Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in medicine, 19(10):1265–1275. 描述 碩士
國立政治大學
統計學系
109354003資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354003 資料類型 thesis dc.contributor.advisor 黃佳慧 zh_TW dc.contributor.advisor Huang, Chia-Hui en_US dc.contributor.author (作者) 吳書恆 zh_TW dc.contributor.author (作者) Wu, Shu-Heng en_US dc.creator (作者) 吳書恆 zh_TW dc.creator (作者) Wu, Shu-Heng en_US dc.date (日期) 2022 en_US dc.date.accessioned 1-八月-2022 17:14:31 (UTC+8) - dc.date.available 1-八月-2022 17:14:31 (UTC+8) - dc.date.issued (上傳時間) 1-八月-2022 17:14:31 (UTC+8) - dc.identifier (其他 識別碼) G0109354003 en_US dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141002 - dc.description (描述) 碩士 zh_TW dc.description (描述) 國立政治大學 zh_TW dc.description (描述) 統計學系 zh_TW dc.description (描述) 109354003 zh_TW dc.description.abstract (摘要) 在時間相依共變量拆解後的線性混合模型 (linear mixed model) 中,已提出不同型式的R2統計量來評估隨機效果與固定效果的適合度 (goodness-of-fit),然而在自動化選模方法中,這些統計量因過度擬合 (overfitting) 的問題而無法直接根據最大值來選擇較精簡模型(parsimonious model) 。本研究提出一個具有懲罰性質的R2統計量,此統計量懲罰項納入參數個數的考量,可抑制R2隨解釋變數增加而不斷膨脹的問題,並且可搭配自動化選模方式選擇隨機與固定效果的精簡模型。此外,因應使用者對模型精簡程度的需求,此統計量懲罰項含有懲罰係數,可彈性地調整懲罰的強度。當使用者並無特定的精簡程度,本研究亦提供網格搜索與給定容忍值的方式,以得出最佳的懲罰範圍與對應的精簡模型。透過資料模擬結果可發現懲罰R2選到精簡模型的效果較其他AIC統計量 (cAIC與mAIC) 佳,同時使用隨機效果與固定效果的懲罰R2亦不會影響各自選模的結果。在北卡羅來納州 (North Carolina) 犯罪資料的實證分析中,亦發現本研究所提出R2統計量在自動化選模中具有辨別重要變數的能力。 zh_TW dc.description.abstract (摘要) In the Linear Mixed Model (LMM) after time dependent covariates decomposition, different types of R2 statistics have been proposed to evaluate the goodness-of-fit (GOF) of random effects and fixed effects. However, due to the overfitting issue, the maximum value of the statistics cannot be applied in the automatic selection methods. In this study, we propose a R2 statistic, which includes a penalty that discourages the inflation of R2 when extra regressors are added to the model. Therefore, the R2 statistic can select parsimonious model of random effects and fixed effects by automatic model selection. In addition, the penalty coefficient of the proposed R2 statistic can be flexibly adjusted based on researcher`s demand for model simplification. When researchers do not have a specific degree of simplification, we also provide two methods to obtain the optimal range of penalty coefficient and the corresponding parsimonious model. The simulation results showed that the effect of penalized R2 in finding parsimonious model is better than that of other AIC statistics (cAIC and mAIC), and using penalized R2 with both random effects and fixed effects simaltaneously does not affect the results of their model selection. In an empirical analysis of North Carolina crime data, we found that the proposed R2 statistic is able to identify significant variables which were also found in the original study. en_US dc.description.tableofcontents 表目錄 iv圖目錄 v第一章 緒論 1第一節 研究背景與動機 1第二節 第二節研究目的 2第二章 文獻回顧 4第一節 線性混合模型 4第二節 混合模型的R2 統計量 7第三節 Akaike訊息準則 9第三章 懲罰R2 12第一節 定義 12第二節 懲罰R2的使用方式 15第三節 懲罰係數α的選擇 17第四章 資料模擬與實證分析 20第一節 模擬參數與模型種類設定 20第二節 模擬結果 22第三節 實證分析 33第五章 結論 43參考文獻 45 zh_TW dc.format.extent 732933 bytes - dc.format.mimetype application/pdf - dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354003 en_US dc.subject (關鍵詞) 自動化模型選擇 zh_TW dc.subject (關鍵詞) 適合度 zh_TW dc.subject (關鍵詞) 線性混合模型 zh_TW dc.subject (關鍵詞) 精簡模型 zh_TW dc.subject (關鍵詞) 懲罰係數 zh_TW dc.subject (關鍵詞) Automatic selection methods en_US dc.subject (關鍵詞) Goodness-of-fit en_US dc.subject (關鍵詞) Linear mixed model en_US dc.subject (關鍵詞) Parsimonious model en_US dc.subject (關鍵詞) Penalty coefficient en_US dc.title (題名) 懲罰R平方對線性混合模型中重要變數選取之研究 zh_TW dc.title (題名) Variables selection for linear mixed models with Penalized R-squared statistics en_US dc.type (資料類型) thesis en_US dc.relation.reference (參考文獻) Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.Arnau, J., Bono, R., and Vallejo, G. (2009). Analyzing small samples of repeated measures data with the mixed-model adjusted f test. Communications in Statistics-Simulation and Computation, 38(5):1083–1103.Baltagi, B. H. (2006). Estimating an economic model of crime using panel data from north carolina. Journal of Applied econometrics, 21(4):543–547.Brame, R., Bushway, S., and Paternoster, R. (1999). On the use of panel research designs and random effects models to investigate static and dynamic theories of criminal offending. Criminology, 37(3):599–642.Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data. The Review of economics and Statistics, pages 360–366.Edwards, L. J., Muller, K. E., Wolfinger, R. D., Qaqish, B. F., and Schabenberger, O. (2008). An r2 statistic for fixed effects in the linear mixed model. Statistics in medicine, 27(29):6137–6157.Ezekiel, M. (1930). Methods of correlation analysis. Wiley.Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in a linear mixed model. Biometrics, 60(4):945–953.Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789.Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American statistical association, 72(358): 320–338.Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12.Helland, I. S. (2000). Model reduction for prediction in regression models. Scandinavian journal of statistics, 27(1):1–20.Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2):297–307.Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, pages 963–974.Lalonde, T. L. (2015). Modeling time-dependent covariates in longitudinal data analyses. In Innovative statistical methods for public health data, pages 57–79. Springer.Lalonde, T. L., Nguyen, A. Q., Yin, J., Irimata, K., and Wilson, J. R. (2013). Modeling correlated binary outcomes with time-dependent covariates. Journal of Data Science, 11(4).Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in r. Behavior research methods, 49(4):1494–1502.McNeish, D. (2017). Small sample methods for multilevel modeling: A colloquial elucidation of reml and the kenward-roger correction. Multivariate Behavioral Research, 52(5):661–670.Molenberghs, G. and Verbeke, G. (2000). A model for longitudinal data. Linear Mixed Models for Longitudinal Data, pages 19–29.Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods in ecology and evolution, 4(2):133–142.Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate effects in the analysis of clustered data. Biometrics, pages 638–645.Olkin, I. and Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. The annals of mathematical statistics, pages 201–211.Orelien, J. G. and Edwards, L. J. (2008). Fixed-effect variable selection in linear mixed models using r2 statistics. Computational Statistics & Data Analysis, 52(4):1896–1907.Rights, J. D. and Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An integrative framework for defining r-squared measures. Psychological methods, 24(3):309.Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, pages 461–464.Shen, W. and Louis, T. A. (1999). Empirical bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics, 8(4):800–823.Snijders, T. A. and Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. sage.Sundberg, R. (1999). Multivariate calibration—direct and indirect regression methodology. Scandinavian Journal of Statistics, 26(2):161–207.Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixed-effects models. Biometrika, 92(2):351–370.Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91(433):217–221.Vonesh, E. and Chinchilli, V. M. (1996). Linear and nonlinear models for the analysis of repeated measurements. CRC press.Welham, S. and Thompson, R. (1997). Likelihood ratio tests for fixed model terms using residual maximum likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):701–714.Xu, R. (2003). Measuring explained variation in linear mixed effects models. Statistics in medicine, 22(22):3527–3541.Zhang, D. and Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57(3):795–802.Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in medicine, 19(10):1265–1275. zh_TW dc.identifier.doi (DOI) 10.6814/NCCU202200673 en_US