Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

題名 懲罰R平方對線性混合模型中重要變數選取之研究
Variables selection for linear mixed models with Penalized R-squared statistics
作者 吳書恆
Wu, Shu-Heng
貢獻者 黃佳慧
Huang, Chia-Hui
吳書恆
Wu, Shu-Heng
關鍵詞 自動化模型選擇
適合度
線性混合模型
精簡模型
懲罰係數
Automatic selection methods
Goodness-of-fit
Linear mixed model
Parsimonious model
Penalty coefficient
日期 2022
上傳時間 1-Aug-2022 17:14:31 (UTC+8)
摘要 在時間相依共變量拆解後的線性混合模型 (linear mixed model) 中,已提出不同型式的R2統計量來評估隨機效果與固定效果的適合度 (goodness-of-fit),然而在自動化選模方法中,這些統計量因過度擬合 (overfitting) 的問題而無法直接根據最大值來選擇較精簡模型
(parsimonious model) 。本研究提出一個具有懲罰性質的R2統計量,此統計量懲罰項納入參數個數的考量,可抑制R2隨解釋變數增加而不斷膨脹的問題,並且可搭配自動化選模方式選擇隨機與固定效果的精簡模型。此外,因應使用者對模型精簡程度的需求,此統計量懲罰項含有懲罰係數,可彈性地調整懲罰的強度。當使用者並無特定的精簡程度,本研究亦提供網格搜索與給定容忍值的方式,以得出最佳的懲罰範圍與對應的精簡模型。透過資料模擬結果可發現懲罰R2選到精簡模型的效果較其他AIC統計量 (cAIC與mAIC) 佳,同時使用隨機效果與固定效果的懲罰R2亦不會影響各自選模的結果。在北卡羅來納州 (North Carolina) 犯罪資料的實證分析中,亦發現本研究所提出R2統計量在自動化選模中具有辨別重要變數的能力。
In the Linear Mixed Model (LMM) after time dependent covariates decomposition, different types of R2 statistics have been proposed to evaluate the goodness-of-fit (GOF) of random effects and fixed effects. However, due to the overfitting issue, the maximum value of the statistics cannot be applied in the automatic selection methods. In this study, we propose a R2 statistic, which includes a penalty that discourages the inflation of R2 when extra regressors are added to the model. Therefore, the R2 statistic can select parsimonious model of random effects and fixed effects by automatic model selection. In addition, the penalty coefficient of the proposed R2 statistic can be flexibly adjusted based on researcher`s demand for model simplification. When researchers do not have a specific degree of simplification, we also provide two methods to obtain the optimal range of penalty coefficient and the corresponding parsimonious model. The simulation results showed that the effect of penalized R2 in finding parsimonious model is better than that of other AIC statistics (cAIC and mAIC), and using penalized R2 with both random effects and fixed effects simaltaneously does not affect the results of their model selection. In an empirical analysis of North Carolina crime data, we found that the proposed R2 statistic is able to identify significant variables which were also found in the original study.
參考文獻 Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.
Arnau, J., Bono, R., and Vallejo, G. (2009). Analyzing small samples of repeated measures data with the mixed-model adjusted f test. Communications in Statistics-Simulation and Computation, 38(5):1083–1103.
Baltagi, B. H. (2006). Estimating an economic model of crime using panel data from north carolina. Journal of Applied econometrics, 21(4):543–547.
Brame, R., Bushway, S., and Paternoster, R. (1999). On the use of panel research designs and random effects models to investigate static and dynamic theories of criminal offending. Criminology, 37(3):599–642.
Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data. The Review of economics and Statistics, pages 360–366.
Edwards, L. J., Muller, K. E., Wolfinger, R. D., Qaqish, B. F., and Schabenberger, O. (2008). An r2 statistic for fixed effects in the linear mixed model. Statistics in medicine, 27(29):6137–6157.
Ezekiel, M. (1930). Methods of correlation analysis. Wiley.
Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in a linear mixed model. Biometrics, 60(4):945–953.
Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789.
Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American statistical association, 72(358): 320–338.
Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12.
Helland, I. S. (2000). Model reduction for prediction in regression models. Scandinavian journal of statistics, 27(1):1–20.
Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2):297–307.
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, pages 963–974.
Lalonde, T. L. (2015). Modeling time-dependent covariates in longitudinal data analyses. In Innovative statistical methods for public health data, pages 57–79. Springer.
Lalonde, T. L., Nguyen, A. Q., Yin, J., Irimata, K., and Wilson, J. R. (2013). Modeling correlated binary outcomes with time-dependent covariates. Journal of Data Science, 11(4).
Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in r. Behavior research methods, 49(4):1494–1502.
McNeish, D. (2017). Small sample methods for multilevel modeling: A colloquial elucidation of reml and the kenward-roger correction. Multivariate Behavioral Research, 52(5):661–670.
Molenberghs, G. and Verbeke, G. (2000). A model for longitudinal data. Linear Mixed Models for Longitudinal Data, pages 19–29.
Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods in ecology and evolution, 4(2):133–142.
Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate effects in the analysis of clustered data. Biometrics, pages 638–645.
Olkin, I. and Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. The annals of mathematical statistics, pages 201–211.
Orelien, J. G. and Edwards, L. J. (2008). Fixed-effect variable selection in linear mixed models using r2 statistics. Computational Statistics & Data Analysis, 52(4):1896–1907.
Rights, J. D. and Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An integrative framework for defining r-squared measures. Psychological methods, 24(3):309.
Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, pages 461–464.
Shen, W. and Louis, T. A. (1999). Empirical bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics, 8(4):800–823.
Snijders, T. A. and Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. sage.
Sundberg, R. (1999). Multivariate calibration—direct and indirect regression methodology. Scandinavian Journal of Statistics, 26(2):161–207.
Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixed-effects models. Biometrika, 92(2):351–370.
Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91(433):217–221.
Vonesh, E. and Chinchilli, V. M. (1996). Linear and nonlinear models for the analysis of repeated measurements. CRC press.
Welham, S. and Thompson, R. (1997). Likelihood ratio tests for fixed model terms using residual maximum likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):701–714.
Xu, R. (2003). Measuring explained variation in linear mixed effects models. Statistics in medicine, 22(22):3527–3541.
Zhang, D. and Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57(3):795–802.
Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in medicine, 19(10):1265–1275.
描述 碩士
國立政治大學
統計學系
109354003
資料來源 http://thesis.lib.nccu.edu.tw/record/#G0109354003
資料類型 thesis
dc.contributor.advisor 黃佳慧zh_TW
dc.contributor.advisor Huang, Chia-Huien_US
dc.contributor.author (Authors) 吳書恆zh_TW
dc.contributor.author (Authors) Wu, Shu-Hengen_US
dc.creator (作者) 吳書恆zh_TW
dc.creator (作者) Wu, Shu-Hengen_US
dc.date (日期) 2022en_US
dc.date.accessioned 1-Aug-2022 17:14:31 (UTC+8)-
dc.date.available 1-Aug-2022 17:14:31 (UTC+8)-
dc.date.issued (上傳時間) 1-Aug-2022 17:14:31 (UTC+8)-
dc.identifier (Other Identifiers) G0109354003en_US
dc.identifier.uri (URI) http://nccur.lib.nccu.edu.tw/handle/140.119/141002-
dc.description (描述) 碩士zh_TW
dc.description (描述) 國立政治大學zh_TW
dc.description (描述) 統計學系zh_TW
dc.description (描述) 109354003zh_TW
dc.description.abstract (摘要) 在時間相依共變量拆解後的線性混合模型 (linear mixed model) 中,已提出不同型式的R2統計量來評估隨機效果與固定效果的適合度 (goodness-of-fit),然而在自動化選模方法中,這些統計量因過度擬合 (overfitting) 的問題而無法直接根據最大值來選擇較精簡模型
(parsimonious model) 。本研究提出一個具有懲罰性質的R2統計量,此統計量懲罰項納入參數個數的考量,可抑制R2隨解釋變數增加而不斷膨脹的問題,並且可搭配自動化選模方式選擇隨機與固定效果的精簡模型。此外,因應使用者對模型精簡程度的需求,此統計量懲罰項含有懲罰係數,可彈性地調整懲罰的強度。當使用者並無特定的精簡程度,本研究亦提供網格搜索與給定容忍值的方式,以得出最佳的懲罰範圍與對應的精簡模型。透過資料模擬結果可發現懲罰R2選到精簡模型的效果較其他AIC統計量 (cAIC與mAIC) 佳,同時使用隨機效果與固定效果的懲罰R2亦不會影響各自選模的結果。在北卡羅來納州 (North Carolina) 犯罪資料的實證分析中,亦發現本研究所提出R2統計量在自動化選模中具有辨別重要變數的能力。
zh_TW
dc.description.abstract (摘要) In the Linear Mixed Model (LMM) after time dependent covariates decomposition, different types of R2 statistics have been proposed to evaluate the goodness-of-fit (GOF) of random effects and fixed effects. However, due to the overfitting issue, the maximum value of the statistics cannot be applied in the automatic selection methods. In this study, we propose a R2 statistic, which includes a penalty that discourages the inflation of R2 when extra regressors are added to the model. Therefore, the R2 statistic can select parsimonious model of random effects and fixed effects by automatic model selection. In addition, the penalty coefficient of the proposed R2 statistic can be flexibly adjusted based on researcher`s demand for model simplification. When researchers do not have a specific degree of simplification, we also provide two methods to obtain the optimal range of penalty coefficient and the corresponding parsimonious model. The simulation results showed that the effect of penalized R2 in finding parsimonious model is better than that of other AIC statistics (cAIC and mAIC), and using penalized R2 with both random effects and fixed effects simaltaneously does not affect the results of their model selection. In an empirical analysis of North Carolina crime data, we found that the proposed R2 statistic is able to identify significant variables which were also found in the original study.en_US
dc.description.tableofcontents 表目錄 iv
圖目錄 v
第一章 緒論 1
第一節 研究背景與動機 1
第二節 第二節研究目的 2
第二章 文獻回顧 4
第一節 線性混合模型 4
第二節 混合模型的R2 統計量 7
第三節 Akaike訊息準則 9
第三章 懲罰R2 12
第一節 定義 12
第二節 懲罰R2的使用方式 15
第三節 懲罰係數α的選擇 17
第四章 資料模擬與實證分析 20
第一節 模擬參數與模型種類設定 20
第二節 模擬結果 22
第三節 實證分析 33
第五章 結論 43
參考文獻 45
zh_TW
dc.format.extent 732933 bytes-
dc.format.mimetype application/pdf-
dc.source.uri (資料來源) http://thesis.lib.nccu.edu.tw/record/#G0109354003en_US
dc.subject (關鍵詞) 自動化模型選擇zh_TW
dc.subject (關鍵詞) 適合度zh_TW
dc.subject (關鍵詞) 線性混合模型zh_TW
dc.subject (關鍵詞) 精簡模型zh_TW
dc.subject (關鍵詞) 懲罰係數zh_TW
dc.subject (關鍵詞) Automatic selection methodsen_US
dc.subject (關鍵詞) Goodness-of-fiten_US
dc.subject (關鍵詞) Linear mixed modelen_US
dc.subject (關鍵詞) Parsimonious modelen_US
dc.subject (關鍵詞) Penalty coefficienten_US
dc.title (題名) 懲罰R平方對線性混合模型中重要變數選取之研究zh_TW
dc.title (題名) Variables selection for linear mixed models with Penalized R-squared statisticsen_US
dc.type (資料類型) thesisen_US
dc.relation.reference (參考文獻) Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.
Arnau, J., Bono, R., and Vallejo, G. (2009). Analyzing small samples of repeated measures data with the mixed-model adjusted f test. Communications in Statistics-Simulation and Computation, 38(5):1083–1103.
Baltagi, B. H. (2006). Estimating an economic model of crime using panel data from north carolina. Journal of Applied econometrics, 21(4):543–547.
Brame, R., Bushway, S., and Paternoster, R. (1999). On the use of panel research designs and random effects models to investigate static and dynamic theories of criminal offending. Criminology, 37(3):599–642.
Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data. The Review of economics and Statistics, pages 360–366.
Edwards, L. J., Muller, K. E., Wolfinger, R. D., Qaqish, B. F., and Schabenberger, O. (2008). An r2 statistic for fixed effects in the linear mixed model. Statistics in medicine, 27(29):6137–6157.
Ezekiel, M. (1930). Methods of correlation analysis. Wiley.
Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in a linear mixed model. Biometrics, 60(4):945–953.
Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789.
Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American statistical association, 72(358): 320–338.
Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12.
Helland, I. S. (2000). Model reduction for prediction in regression models. Scandinavian journal of statistics, 27(1):1–20.
Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2):297–307.
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, pages 963–974.
Lalonde, T. L. (2015). Modeling time-dependent covariates in longitudinal data analyses. In Innovative statistical methods for public health data, pages 57–79. Springer.
Lalonde, T. L., Nguyen, A. Q., Yin, J., Irimata, K., and Wilson, J. R. (2013). Modeling correlated binary outcomes with time-dependent covariates. Journal of Data Science, 11(4).
Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in r. Behavior research methods, 49(4):1494–1502.
McNeish, D. (2017). Small sample methods for multilevel modeling: A colloquial elucidation of reml and the kenward-roger correction. Multivariate Behavioral Research, 52(5):661–670.
Molenberghs, G. and Verbeke, G. (2000). A model for longitudinal data. Linear Mixed Models for Longitudinal Data, pages 19–29.
Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods in ecology and evolution, 4(2):133–142.
Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate effects in the analysis of clustered data. Biometrics, pages 638–645.
Olkin, I. and Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. The annals of mathematical statistics, pages 201–211.
Orelien, J. G. and Edwards, L. J. (2008). Fixed-effect variable selection in linear mixed models using r2 statistics. Computational Statistics & Data Analysis, 52(4):1896–1907.
Rights, J. D. and Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An integrative framework for defining r-squared measures. Psychological methods, 24(3):309.
Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, pages 461–464.
Shen, W. and Louis, T. A. (1999). Empirical bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics, 8(4):800–823.
Snijders, T. A. and Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. sage.
Sundberg, R. (1999). Multivariate calibration—direct and indirect regression methodology. Scandinavian Journal of Statistics, 26(2):161–207.
Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixed-effects models. Biometrika, 92(2):351–370.
Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91(433):217–221.
Vonesh, E. and Chinchilli, V. M. (1996). Linear and nonlinear models for the analysis of repeated measurements. CRC press.
Welham, S. and Thompson, R. (1997). Likelihood ratio tests for fixed model terms using residual maximum likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):701–714.
Xu, R. (2003). Measuring explained variation in linear mixed effects models. Statistics in medicine, 22(22):3527–3541.
Zhang, D. and Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57(3):795–802.
Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in medicine, 19(10):1265–1275.
zh_TW
dc.identifier.doi (DOI) 10.6814/NCCU202200673en_US