懲罰R平方對線性混合模型中重要變數選取之研究 | Publication

Publications-Theses

Article View/Open

pdf(379)

Publication Export

Google Scholar^TM

NCCU Library

Discovery System

Citation Infomation

Related Publications in TAIR

Simple Record
Full Record

題名	懲罰R平方對線性混合模型中重要變數選取之研究 Variables selection for linear mixed models with Penalized R-squared statistics
作者	吳書恆 Wu, Shu-Heng
貢獻者	黃佳慧 Huang, Chia-Hui 吳書恆 Wu, Shu-Heng
關鍵詞	自動化模型選擇適合度線性混合模型精簡模型懲罰係數 Automatic selection methods Goodness-of-fit Linear mixed model Parsimonious model Penalty coefficient
日期	2022
上傳時間	1-Aug-2022 17:14:31 (UTC+8)
摘要	在時間相依共變量拆解後的線性混合模型 (linear mixed model) 中，已提出不同型式的R2統計量來評估隨機效果與固定效果的適合度 (goodness-of-fit)，然而在自動化選模方法中，這些統計量因過度擬合 (overfitting) 的問題而無法直接根據最大值來選擇較精簡模型 (parsimonious model) 。本研究提出一個具有懲罰性質的R2統計量，此統計量懲罰項納入參數個數的考量，可抑制R2隨解釋變數增加而不斷膨脹的問題，並且可搭配自動化選模方式選擇隨機與固定效果的精簡模型。此外，因應使用者對模型精簡程度的需求，此統計量懲罰項含有懲罰係數，可彈性地調整懲罰的強度。當使用者並無特定的精簡程度，本研究亦提供網格搜索與給定容忍值的方式，以得出最佳的懲罰範圍與對應的精簡模型。透過資料模擬結果可發現懲罰R2選到精簡模型的效果較其他AIC統計量 (cAIC與mAIC) 佳，同時使用隨機效果與固定效果的懲罰R2亦不會影響各自選模的結果。在北卡羅來納州 (North Carolina) 犯罪資料的實證分析中，亦發現本研究所提出R2統計量在自動化選模中具有辨別重要變數的能力。 In the Linear Mixed Model (LMM) after time dependent covariates decomposition, different types of R2 statistics have been proposed to evaluate the goodness-of-fit (GOF) of random effects and fixed effects. However, due to the overfitting issue, the maximum value of the statistics cannot be applied in the automatic selection methods. In this study, we propose a R2 statistic, which includes a penalty that discourages the inflation of R2 when extra regressors are added to the model. Therefore, the R2 statistic can select parsimonious model of random effects and fixed effects by automatic model selection. In addition, the penalty coefficient of the proposed R2 statistic can be flexibly adjusted based on researcher`s demand for model simplification. When researchers do not have a specific degree of simplification, we also provide two methods to obtain the optimal range of penalty coefficient and the corresponding parsimonious model. The simulation results showed that the effect of penalized R2 in finding parsimonious model is better than that of other AIC statistics (cAIC and mAIC), and using penalized R2 with both random effects and fixed effects simaltaneously does not affect the results of their model selection. In an empirical analysis of North Carolina crime data, we found that the proposed R2 statistic is able to identify significant variables which were also found in the original study.
參考文獻	Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer. Arnau, J., Bono, R., and Vallejo, G. (2009). Analyzing small samples of repeated measures data with the mixed-model adjusted f test. Communications in Statistics-Simulation and Computation, 38(5):1083–1103. Baltagi, B. H. (2006). Estimating an economic model of crime using panel data from north carolina. Journal of Applied econometrics, 21(4):543–547. Brame, R., Bushway, S., and Paternoster, R. (1999). On the use of panel research designs and random effects models to investigate static and dynamic theories of criminal offending. Criminology, 37(3):599–642. Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data. The Review of economics and Statistics, pages 360–366. Edwards, L. J., Muller, K. E., Wolfinger, R. D., Qaqish, B. F., and Schabenberger, O. (2008). An r2 statistic for fixed effects in the linear mixed model. Statistics in medicine, 27(29):6137–6157. Ezekiel, M. (1930). Methods of correlation analysis. Wiley. Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in a linear mixed model. Biometrics, 60(4):945–953. Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789. Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American statistical association, 72(358): 320–338. Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12. Helland, I. S. (2000). Model reduction for prediction in regression models. Scandinavian journal of statistics, 27(1):1–20. Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2):297–307. Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, pages 963–974. Lalonde, T. L. (2015). Modeling time-dependent covariates in longitudinal data analyses. In Innovative statistical methods for public health data, pages 57–79. Springer. Lalonde, T. L., Nguyen, A. Q., Yin, J., Irimata, K., and Wilson, J. R. (2013). Modeling correlated binary outcomes with time-dependent covariates. Journal of Data Science, 11(4). Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in r. Behavior research methods, 49(4):1494–1502. McNeish, D. (2017). Small sample methods for multilevel modeling: A colloquial elucidation of reml and the kenward-roger correction. Multivariate Behavioral Research, 52(5):661–670. Molenberghs, G. and Verbeke, G. (2000). A model for longitudinal data. Linear Mixed Models for Longitudinal Data, pages 19–29. Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods in ecology and evolution, 4(2):133–142. Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate effects in the analysis of clustered data. Biometrics, pages 638–645. Olkin, I. and Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. The annals of mathematical statistics, pages 201–211. Orelien, J. G. and Edwards, L. J. (2008). Fixed-effect variable selection in linear mixed models using r2 statistics. Computational Statistics & Data Analysis, 52(4):1896–1907. Rights, J. D. and Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An integrative framework for defining r-squared measures. Psychological methods, 24(3):309. Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, pages 461–464. Shen, W. and Louis, T. A. (1999). Empirical bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics, 8(4):800–823. Snijders, T. A. and Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. sage. Sundberg, R. (1999). Multivariate calibration—direct and indirect regression methodology. Scandinavian Journal of Statistics, 26(2):161–207. Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixed-effects models. Biometrika, 92(2):351–370. Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91(433):217–221. Vonesh, E. and Chinchilli, V. M. (1996). Linear and nonlinear models for the analysis of repeated measurements. CRC press. Welham, S. and Thompson, R. (1997). Likelihood ratio tests for fixed model terms using residual maximum likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):701–714. Xu, R. (2003). Measuring explained variation in linear mixed effects models. Statistics in medicine, 22(22):3527–3541. Zhang, D. and Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57(3):795–802. Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in medicine, 19(10):1265–1275.
描述	碩士國立政治大學統計學系 109354003
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0109354003
資料類型	thesis

dc.contributor.advisor	黃佳慧	zh_TW
dc.contributor.advisor	Huang, Chia-Hui	en_US
dc.contributor.author (Authors)	吳書恆	zh_TW
dc.contributor.author (Authors)	Wu, Shu-Heng	en_US
dc.creator (作者)	吳書恆	zh_TW
dc.creator (作者)	Wu, Shu-Heng	en_US
dc.date (日期)	2022	en_US
dc.date.accessioned	1-Aug-2022 17:14:31 (UTC+8)	-
dc.date.available	1-Aug-2022 17:14:31 (UTC+8)	-
dc.date.issued (上傳時間)	1-Aug-2022 17:14:31 (UTC+8)	-
dc.identifier (Other Identifiers)	G0109354003	en_US
dc.identifier.uri (URI)	http://nccur.lib.nccu.edu.tw/handle/140.119/141002	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	統計學系	zh_TW
dc.description (描述)	109354003	zh_TW
dc.description.abstract (摘要)	在時間相依共變量拆解後的線性混合模型 (linear mixed model) 中，已提出不同型式的R2統計量來評估隨機效果與固定效果的適合度 (goodness-of-fit)，然而在自動化選模方法中，這些統計量因過度擬合 (overfitting) 的問題而無法直接根據最大值來選擇較精簡模型 (parsimonious model) 。本研究提出一個具有懲罰性質的R2統計量，此統計量懲罰項納入參數個數的考量，可抑制R2隨解釋變數增加而不斷膨脹的問題，並且可搭配自動化選模方式選擇隨機與固定效果的精簡模型。此外，因應使用者對模型精簡程度的需求，此統計量懲罰項含有懲罰係數，可彈性地調整懲罰的強度。當使用者並無特定的精簡程度，本研究亦提供網格搜索與給定容忍值的方式，以得出最佳的懲罰範圍與對應的精簡模型。透過資料模擬結果可發現懲罰R2選到精簡模型的效果較其他AIC統計量 (cAIC與mAIC) 佳，同時使用隨機效果與固定效果的懲罰R2亦不會影響各自選模的結果。在北卡羅來納州 (North Carolina) 犯罪資料的實證分析中，亦發現本研究所提出R2統計量在自動化選模中具有辨別重要變數的能力。	zh_TW
dc.description.abstract (摘要)	In the Linear Mixed Model (LMM) after time dependent covariates decomposition, different types of R2 statistics have been proposed to evaluate the goodness-of-fit (GOF) of random effects and fixed effects. However, due to the overfitting issue, the maximum value of the statistics cannot be applied in the automatic selection methods. In this study, we propose a R2 statistic, which includes a penalty that discourages the inflation of R2 when extra regressors are added to the model. Therefore, the R2 statistic can select parsimonious model of random effects and fixed effects by automatic model selection. In addition, the penalty coefficient of the proposed R2 statistic can be flexibly adjusted based on researcher`s demand for model simplification. When researchers do not have a specific degree of simplification, we also provide two methods to obtain the optimal range of penalty coefficient and the corresponding parsimonious model. The simulation results showed that the effect of penalized R2 in finding parsimonious model is better than that of other AIC statistics (cAIC and mAIC), and using penalized R2 with both random effects and fixed effects simaltaneously does not affect the results of their model selection. In an empirical analysis of North Carolina crime data, we found that the proposed R2 statistic is able to identify significant variables which were also found in the original study.	en_US
dc.description.tableofcontents	表目錄 iv 圖目錄 v 第一章緒論 1 第一節研究背景與動機 1 第二節第二節研究目的 2 第二章文獻回顧 4 第一節線性混合模型 4 第二節混合模型的R2 統計量 7 第三節 Akaike訊息準則 9 第三章懲罰R2 12 第一節定義 12 第二節懲罰R2的使用方式 15 第三節懲罰係數α的選擇 17 第四章資料模擬與實證分析 20 第一節模擬參數與模型種類設定 20 第二節模擬結果 22 第三節實證分析 33 第五章結論 43 參考文獻 45	zh_TW
dc.format.extent	732933 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0109354003	en_US
dc.subject (關鍵詞)	自動化模型選擇	zh_TW
dc.subject (關鍵詞)	適合度	zh_TW
dc.subject (關鍵詞)	線性混合模型	zh_TW
dc.subject (關鍵詞)	精簡模型	zh_TW
dc.subject (關鍵詞)	懲罰係數	zh_TW
dc.subject (關鍵詞)	Automatic selection methods	en_US
dc.subject (關鍵詞)	Goodness-of-fit	en_US
dc.subject (關鍵詞)	Linear mixed model	en_US
dc.subject (關鍵詞)	Parsimonious model	en_US
dc.subject (關鍵詞)	Penalty coefficient	en_US
dc.title (題名)	懲罰R平方對線性混合模型中重要變數選取之研究	zh_TW
dc.title (題名)	Variables selection for linear mixed models with Penalized R-squared statistics	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer. Arnau, J., Bono, R., and Vallejo, G. (2009). Analyzing small samples of repeated measures data with the mixed-model adjusted f test. Communications in Statistics-Simulation and Computation, 38(5):1083–1103. Baltagi, B. H. (2006). Estimating an economic model of crime using panel data from north carolina. Journal of Applied econometrics, 21(4):543–547. Brame, R., Bushway, S., and Paternoster, R. (1999). On the use of panel research designs and random effects models to investigate static and dynamic theories of criminal offending. Criminology, 37(3):599–642. Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data. The Review of economics and Statistics, pages 360–366. Edwards, L. J., Muller, K. E., Wolfinger, R. D., Qaqish, B. F., and Schabenberger, O. (2008). An r2 statistic for fixed effects in the linear mixed model. Statistics in medicine, 27(29):6137–6157. Ezekiel, M. (1930). Methods of correlation analysis. Wiley. Ghidey, W., Lesaffre, E., and Eilers, P. (2004). Smooth random effects distribution in a linear mixed model. Biometrics, 60(4):945–953. Greven, S. and Kneib, T. (2010). On the behaviour of marginal and conditional aic in linear mixed models. Biometrika, 97(4):773–789. Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American statistical association, 72(358): 320–338. Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12. Helland, I. S. (2000). Model reduction for prediction in regression models. Scandinavian journal of statistics, 27(1):1–20. Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2):297–307. Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, pages 963–974. Lalonde, T. L. (2015). Modeling time-dependent covariates in longitudinal data analyses. In Innovative statistical methods for public health data, pages 57–79. Springer. Lalonde, T. L., Nguyen, A. Q., Yin, J., Irimata, K., and Wilson, J. R. (2013). Modeling correlated binary outcomes with time-dependent covariates. Journal of Data Science, 11(4). Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in r. Behavior research methods, 49(4):1494–1502. McNeish, D. (2017). Small sample methods for multilevel modeling: A colloquial elucidation of reml and the kenward-roger correction. Multivariate Behavioral Research, 52(5):661–670. Molenberghs, G. and Verbeke, G. (2000). A model for longitudinal data. Linear Mixed Models for Longitudinal Data, pages 19–29. Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods in ecology and evolution, 4(2):133–142. Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate effects in the analysis of clustered data. Biometrics, pages 638–645. Olkin, I. and Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. The annals of mathematical statistics, pages 201–211. Orelien, J. G. and Edwards, L. J. (2008). Fixed-effect variable selection in linear mixed models using r2 statistics. Computational Statistics & Data Analysis, 52(4):1896–1907. Rights, J. D. and Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An integrative framework for defining r-squared measures. Psychological methods, 24(3):309. Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, pages 461–464. Shen, W. and Louis, T. A. (1999). Empirical bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics, 8(4):800–823. Snijders, T. A. and Bosker, R. J. (2011). Multilevel analysis: An introduction to basic and advanced multilevel modeling. sage. Sundberg, R. (1999). Multivariate calibration—direct and indirect regression methodology. Scandinavian Journal of Statistics, 26(2):161–207. Vaida, F. and Blanchard, S. (2005). Conditional akaike information for mixed-effects models. Biometrika, 92(2):351–370. Verbeke, G. and Lesaffre, E. (1996). A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91(433):217–221. Vonesh, E. and Chinchilli, V. M. (1996). Linear and nonlinear models for the analysis of repeated measurements. CRC press. Welham, S. and Thompson, R. (1997). Likelihood ratio tests for fixed model terms using residual maximum likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):701–714. Xu, R. (2003). Measuring explained variation in linear mixed effects models. Statistics in medicine, 22(22):3527–3541. Zhang, D. and Davidian, M. (2001). Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics, 57(3):795–802. Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in medicine, 19(10):1265–1275.	zh_TW
dc.identifier.doi (DOI)	10.6814/NCCU202200673	en_US

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM