比較交叉適配與p值合併對於特徵重要性檢定之影響 | Publication

Publications-Theses

Article View/Open

pdf(1)

Publication Export

Google Scholar^TM

Title	比較交叉適配與p值合併對於特徵重要性檢定之影響 Comparing Cross-Fitting and P-Value Combination Methods in Testing Feature Importances
Creator	顏立平 Yen, Li-Ping
Contributor	黃柏僩 Huang, Po-Hsien 顏立平 Yen, Li-Ping
Key Words	機器學習特徵重要性特徵重要性檢定 machine learning feature importance feature importance tests
Date	2025
Date Issued	4-Feb-2025 16:13:39 (UTC+8)
Summary	機器學習（machine learning，ML）算則建立之模型，長期以來被認為難以詮釋。而隨著可解釋機器學習（interpretable ML）之發展，研究者已可透過多種特徵重要性（feature importance）檢定，如殘差排序檢定（residual permutation test，RPT）、條件預測影響（conditional predictive impact，CPI）、與逐一變數排除（leave-one-covariate-out，LOCO），以了解哪些特徵具有統計顯著（statistically significant）之預測能力。傳統的特徵重要性檢定仰賴資料拆分（data splitting），即將資料拆為訓練集與測試集，前者用於訓練預測式，後者用於進行檢定。然而，資料拆分伴隨的樣本數減少意味著統計檢定力（statistical power）之喪失，且容許研究者從多次拆分挑選有利之分析結果，即所謂的資料窺探（data snooping），其會造成型一錯誤率（type I error）膨脹。為了解決單次資料拆分所帶來的問題，研究者可考慮透過重複資料拆分獲得多組分析結果，再使用 p 值合併或交叉適配（cross-fit）將多組結果進行整合。本研究試圖透過模擬實驗來評估多種 p 值合併法和有無交叉適配之策略組合，於 RPT、CPI 與 LOCO 之實徵表現。模擬結果顯示資料窺探的確會導致型一錯誤率膨脹，而所有的組合皆可將型一錯誤率控制在顯著水準（α = 0.05）以下，唯一的例外為 RPT 搭配 Cauchy 法會造成型一錯誤率膨脹。在檢定力方面，使用Bonferroni 法搭配交叉適配，以及單獨使用 Cauchy 法兩種策略組合展現相對較佳的檢定力，且優於單次資料拆分，而其餘的 p 值合併法儘管可控制型一錯誤率，卻展現低於單次資料拆分之檢定力。 Machine learning (ML) models have long been considered difficult to interpret. However, with the development of interpretable machine learning (interpretable ML), researchers can now use various feature importance tests, such as the residual permutation test (RPT), conditional predictive impact (CPI), and leave-one-covariate-out (LOCO), to identify which features have statistically significant predictive power. Traditional feature importance tests rely on data splitting, dividing the dataset into a training set for model fitting and a test set for statistical test. This approach reduces sample size, resulting in a loss of statistical power, and allows researchers to engage in data snooping by selecting favorable analysis results from multiple splits. To address the issues caused by single data splitting, researchers may consider repeated data splitting to obtain multiple analysis results, which can then be combined using p-value aggregation methods or cross-fitting. This study aims to evaluate the empirical performance of various combinations of p-value aggregation methods and cross-fitting strategies through simulation experiments applied to RPT, CPI, and LOCO. Simulation results reveal that data snooping inflates type I error rates, whereas almost all strategy combinations effectively control type I errors, except for RPT paired with the Cauchy method. In terms of statistical power, the combination of Bonferroni correction with cross-fitting and the standalone use of the Cauchy method exhibit relatively better power compared to single data splitting. Other p-value aggregation methods, while controlling type I errors, demonstrate lower statistical power than single data splitting.
參考文獻	Breiman, L., Friedman, J., Stone, C., & Olshen, R. (1984). Classification and regression trees. Taylor & Francis. https://doi.org/10.1201/9781315139470 Breiman, L. (1996). Heuristics of instability and stabilization in model selection. The An- nals of Statistics, 24(6), 2350–2383. https://doi.org/10.1214/aos/1032181158 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10. 1023/A:1010933404324 Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984, October). Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470 Candès, E., Fan, Y., Janson, L., & Lv, J. (2018). Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3), 551–577. https : / / doi . org / 10 . 1111/rssb.12265 Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785 Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ ectj.12097 Chilver, M. R., Champaigne-Klassen, E., Schofield, P. R., Williams, L. M., & Gatt, J. M. (2023). Predicting wellbeing over one year using sociodemographic factors, per- sonality, health behaviours, cognition, and life events. Scientific Reports, 13(1), 5565. https://doi.org/10.1038/s41598-023-32588-3 Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Sci- ence, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716 Colquhoun, D. (2017). The reproducibility of research and the misinterpretation of <i>p</i>- values. Royal Society Open Science, 4(12), 171085. https://doi.org/10.1098/rsos. 171085 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273– 297. https://doi.org/10.1007/BF00994018 Couronné, R., Probst, P., & Boulesteix, A.-L. (2018). Random forest versus logistic re- gression: A large-scale benchmark experiment. BMC Bioinformatics, 19(1), 270. https://doi.org/10.1186/s12859-018-2264-5 Covert, I. C., Lundberg, S., & Lee, S.-I. (2021). Explaining by removing: A unified frame- work for model explanation. J. Mach. Learn. Res., 22(1). Dai, B., Shen, X., & Pan, W. (2024). Significance tests of feature relevance for a black- box learner. IEEE Transactions on Neural Networks and Learning Systems, 35(2), 1898–1911. https://doi.org/10.1109/tnnls.2022.3185742 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810. 04805 Dimson, E., & Marsh, P. (1990). Volatility forecasting without data-snooping. Journal of Banking & Finance, 14(2), 399–421. https://doi.org/10.1198/jasa.2009.tm08647 Fisher, R. A. (1928). Statistical methods for research workers. Oliver; Boyd. Fix, E. (1985). Discriminatory analysis: Nonparametric discrimination, consistency prop- erties (Vol. 1). USAF school of Aviation Medicine. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In- ternational Conference on Machine Learning. https : / / api . semanticscholar . org / CorpusID:1836349 Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. https://doi.org/10.1162/neco.1992.4. 1.1 Gonzalez, O. (2021). Psychometric and machine learning approaches for diagnostic as- sessment and tests of individual classification. Psychological Methods, 26(2), 236– 254. https://doi.org/10.1037/met0000317 Hastie, T. (2009). The elements of statistical learning: Data mining, inference, and pre- diction. Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLOS Biology, 13(3), 1–15. https://doi. org/10.1371/journal.pbio.1002106 Hommel, G. (1983). Tests of the overall hypothesis for arbitrary dependence structures. Biometrical Journal, 25(5), 423–430. https://doi.org/10.1002/bimj.19830250502 Huang, P. H. (2025a). Residual permutation tests for feature importance in machine learn- ing. [unpublished manuscript]. Huang, P. H. (2025b). Significance tests for feature importance in machine learning. [un- published manuscript]. Jocher, G., Chaurasia, A., & Qiu, J. (2023, January). Ultralytics YOLO (Version 8.0.0). https://github.com/ultralytics/ultralytics Joel, S., Eastwick, P. W., Allison, C. J., Arriaga, X. B., Baker, Z. G., Bar-Kalifa, E., Berg- eron, S., Birnbaum, G. E., Brock, R. L., Brumbaugh, C. C., Carmichael, C. L., Chen, S., Clarke, J., Cobb, R. J., Coolsen, M. K., Davis, J., de Jong, D. C., De- brot, A., DeHaas, E. C., … Wolf, S. (2020). Machine learning uncovers the most robust self-report predictors of relationship quality across 43 longitudinal couples studies. Proceedings of the National Academy of Sciences, 117(32), 19061–19071. https://doi.org/10.1073/pnas.1917036117 Kai-Quan Shen, Chong-Jin Ong, Xiao-Ping Li, Zheng Hui, & Wilder-Smith, E. (2007). A feature selection method for multilevel mental fatigue eeg classification. IEEE Transactions on Biomedical Engineering, 54(7), 1231–1237. https://doi.org/10. 1109/TBME.2007.890733 Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution- free predictive inference for regression. Journal of the American Statistical Asso- ciation, 113(523), 1094–1111. https://doi.org/10.1080/01621459.2017.1307116 Liu, Y., & Xie, J. (2020). Cauchy combination test: A powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statis- tical Association, 115(529), 393–402. https://doi.org/10.1080/01621459.2018. 1554485 Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions, 4768–4777. Mattner, L. (2011). Combining individually valid and conditionally i.i.d. p-variables. Mi, X., Zou, B., Zou, F., & Hu, J. (2021). Permutation-based identification of important biomarkers for complex diseases via machine learning models. Nature Communi- cations, 12(1), 3008. https://doi.org/10.1038/s41467-021-22756-2 Moran, M. (2003). Arguments for rejecting the sequential bonferroni in ecological studies. Oikos, 100(2), 403–405. https://doi.org/10.1034/j.1600-0706.2003.12010.x Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Per- cie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021. https://doi.org/10.1038/s41562-016-0021 Nicolai Meinshausen, L. M., & Bühlmann, P. (2009). P-values for high-dimensional re- gression. Journal of the American Statistical Association, 104(488), 1671–1681. https://doi.org/10.1198/jasa.2009.tm08647 O’Gorman, T. W. (2005). The performance of randomization tests that use permutations of independent variables. Communications in Statistics - Simulation and Compu- tation, 34(4), 895–908. https://doi.org/10.1080/03610910500308230 Oh, J., Laubach, M., & Luczak, A. (2003). Estimating neuronal variable importance with random forest. 2003 IEEE 29th Annual Proceedings of Bioengineering Confer- ence, 33–34. https://doi.org/10.1109/NEBC.2003.1215978 Ojala, M., & Garriga, G. C. (2009). Permutation tests for studying classifier performance. 2009 Ninth IEEE International Conference on Data Mining, 908–913. https://doi. org/10.1109/ICDM.2009.108 OpenAI. (2024). Chatgpt (december 23 version) [large language model] [Accessed: De- cember 23, 2024]. https://chat.openai.com Paschali, M., Zhao, Q., Adeli, E., & Pohl, K. M. (2022, June). Bridging the gap between deep learning and hypothesis-driven analysis via permutation testing. Springer Na- ture Switzerland. https://link.springer.com/10.1007/978-3-031-16919-9_2 Pearson, K. (1933). On a method of determining whether a sample of size n supposed to have been drawn from a parent population having a known probability integral has probably been drawn at random. Biometrika, 25, 379–410. https://doi.org/10. 1093/biomet/25.3-4.379 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., 等. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(10), 2825–2830. Pelt, D. H. M., Habets, P. C., Vinkers, C. H., Ligthart, L., van Beijsterveldt, C. E. M., Pool, R., & Bartels, M. (2024). Building machine learning prediction models for well-being using predictors from the exposome and genome in a population cohort. Nature Mental Health, 2(10), 1217–1230. https://doi.org/10.1038/s44220-024- 00294-2 Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://arxiv.org/abs/ 2212.04356 Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65 6, 386–408. https://api.semanticscholar. org/CorpusID:12781225 Rüger, B. (1978). Das maximale signifikanzniveau des tests:“lehne h o ab, wenn k unter n gegebenen tests zur ablehnung führen” . Metrika, 25, 171–178. Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2013). P-curve: A key to the file-drawer. J Exp Psychol Gen, 143(2), 534–547. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307. https : //doi.org/10.1186/1471-2105-9-307 Strube, M. J. (2006). Snoop: A program for demonstrating the consequences of premature and repeated null hypothesis testing. Behavior research methods, 38(1), 24–27. Tansey, W., Veitch, V., Zhang, H., Rabadan, R., & Blei, D. M. (2022). The holdout ran- domization test for feature selection in black box models. Journal of Compu- tational and Graphical Statistics, 31(1), 151–162. https : / / doi . org / 10 . 1080 / 10618600.2021.1923520 Tippett, L. H. C., 等. (1931). The methods of statistics. The Methods of Statistics. Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. Vovk, V., & Wang, R. (2020). Combining p-values via averaging. Biometrika, 107(4), 791–808. https://doi.org/10.1093/biomet/asaa027 Watson, D. S., & Wright, M. N. (2021). Testing conditional independence in supervised learning algorithms. Machine Learning, 110(8), 2107–2129. https://doi.org/10. 1007/s10994-021-06030-6 White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. Retrieved December 29, 2024, from http://www.jstor.org/stable/2999444 Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2023). A general frame- work for inference on algorithm-agnostic variable importance. Journal of the Amer- ican Statistical Association, 118(543), 1645–1658. https : / / doi . org / 10 . 1080 / 01621459.2021.2003200 Xgboost python api reference. (n.d.). Retrieved January 13, 2025, from https://xgboost. readthedocs.io/en/stable/python/python_api.html
Description	碩士國立政治大學心理學系 111752001
資料來源	http://thesis.lib.nccu.edu.tw/record/#G0111752001
Type	thesis

dc.contributor.advisor	黃柏僩	zh_TW
dc.contributor.advisor	Huang, Po-Hsien	en_US
dc.contributor.author (Authors)	顏立平	zh_TW
dc.contributor.author (Authors)	Yen, Li-Ping	en_US
dc.creator (作者)	顏立平	zh_TW
dc.creator (作者)	Yen, Li-Ping	en_US
dc.date (日期)	2025	en_US
dc.date.accessioned	4-Feb-2025 16:13:39 (UTC+8)	-
dc.date.available	4-Feb-2025 16:13:39 (UTC+8)	-
dc.date.issued (上傳時間)	4-Feb-2025 16:13:39 (UTC+8)	-
dc.identifier (Other Identifiers)	G0111752001	en_US
dc.identifier.uri (URI)	https://nccur.lib.nccu.edu.tw/handle/140.119/155518	-
dc.description (描述)	碩士	zh_TW
dc.description (描述)	國立政治大學	zh_TW
dc.description (描述)	心理學系	zh_TW
dc.description (描述)	111752001	zh_TW
dc.description.abstract (摘要)	機器學習（machine learning，ML）算則建立之模型，長期以來被認為難以詮釋。而隨著可解釋機器學習（interpretable ML）之發展，研究者已可透過多種特徵重要性（feature importance）檢定，如殘差排序檢定（residual permutation test，RPT）、條件預測影響（conditional predictive impact，CPI）、與逐一變數排除（leave-one-covariate-out，LOCO），以了解哪些特徵具有統計顯著（statistically significant）之預測能力。傳統的特徵重要性檢定仰賴資料拆分（data splitting），即將資料拆為訓練集與測試集，前者用於訓練預測式，後者用於進行檢定。然而，資料拆分伴隨的樣本數減少意味著統計檢定力（statistical power）之喪失，且容許研究者從多次拆分挑選有利之分析結果，即所謂的資料窺探（data snooping），其會造成型一錯誤率（type I error）膨脹。為了解決單次資料拆分所帶來的問題，研究者可考慮透過重複資料拆分獲得多組分析結果，再使用 p 值合併或交叉適配（cross-fit）將多組結果進行整合。本研究試圖透過模擬實驗來評估多種 p 值合併法和有無交叉適配之策略組合，於 RPT、CPI 與 LOCO 之實徵表現。模擬結果顯示資料窺探的確會導致型一錯誤率膨脹，而所有的組合皆可將型一錯誤率控制在顯著水準（α = 0.05）以下，唯一的例外為 RPT 搭配 Cauchy 法會造成型一錯誤率膨脹。在檢定力方面，使用Bonferroni 法搭配交叉適配，以及單獨使用 Cauchy 法兩種策略組合展現相對較佳的檢定力，且優於單次資料拆分，而其餘的 p 值合併法儘管可控制型一錯誤率，卻展現低於單次資料拆分之檢定力。	zh_TW
dc.description.abstract (摘要)	Machine learning (ML) models have long been considered difficult to interpret. However, with the development of interpretable machine learning (interpretable ML), researchers can now use various feature importance tests, such as the residual permutation test (RPT), conditional predictive impact (CPI), and leave-one-covariate-out (LOCO), to identify which features have statistically significant predictive power. Traditional feature importance tests rely on data splitting, dividing the dataset into a training set for model fitting and a test set for statistical test. This approach reduces sample size, resulting in a loss of statistical power, and allows researchers to engage in data snooping by selecting favorable analysis results from multiple splits. To address the issues caused by single data splitting, researchers may consider repeated data splitting to obtain multiple analysis results, which can then be combined using p-value aggregation methods or cross-fitting. This study aims to evaluate the empirical performance of various combinations of p-value aggregation methods and cross-fitting strategies through simulation experiments applied to RPT, CPI, and LOCO. Simulation results reveal that data snooping inflates type I error rates, whereas almost all strategy combinations effectively control type I errors, except for RPT paired with the Cauchy method. In terms of statistical power, the combination of Bonferroni correction with cross-fitting and the standalone use of the Cauchy method exhibit relatively better power compared to single data splitting. Other p-value aggregation methods, while controlling type I errors, demonstrate lower statistical power than single data splitting.	en_US
dc.description.tableofcontents	摘要 i Abstract iii 目錄 v 第一章緒論 1 1.1 研究背景 1 1.2 研究目的 2 第二章文獻回顧 3 2.1 機器學習架構 3 2.2 特徵重要性 5 2.3 特徵重要性的顯著性檢定 7 2.3.1 殘差排序檢定 7 2.3.2 條件預測影響 9 2.3.3 逐一變數排除法 10 2.4 重複資料拆分 11 2.4.1 p 值合併 12 2.4.2 交叉適配 15 2.5 研究問題 21 第三章研究方法 23 3.1 資料產生歷程 23 3.2 建模流程與分析 25 3.2.1 機器學習算則 25 3.2.2 特徵重要性檢定 25 3.2.3 資料拆分次數與交叉適配有無 25 3.2.4 模擬流程 26 第四章研究結果分析 29 4.1 型一錯誤率 29 4.1.1 殘差排序檢定 29 4.1.2 條件預測影 30 4.1.3 逐一變數排除法 30 4.2 統計檢定力 34 4.2.1 殘差排序檢定 34 4.2.3 逐一變數排除法 35 4.3 線性迴歸之補充分析 45 第五章結論 47 5.1 主要發現 47 5.2 給機器學習使用者之建議 48 5.3 研究限制 49 引用文獻 50 附錄 A 線性迴歸算則之模擬結果 57	zh_TW
dc.format.extent	19355044 bytes	-
dc.format.mimetype	application/pdf	-
dc.source.uri (資料來源)	http://thesis.lib.nccu.edu.tw/record/#G0111752001	en_US
dc.subject (關鍵詞)	機器學習	zh_TW
dc.subject (關鍵詞)	特徵重要性	zh_TW
dc.subject (關鍵詞)	特徵重要性檢定	zh_TW
dc.subject (關鍵詞)	machine learning	en_US
dc.subject (關鍵詞)	feature importance	en_US
dc.subject (關鍵詞)	feature importance tests	en_US
dc.title (題名)	比較交叉適配與p值合併對於特徵重要性檢定之影響	zh_TW
dc.title (題名)	Comparing Cross-Fitting and P-Value Combination Methods in Testing Feature Importances	en_US
dc.type (資料類型)	thesis	en_US
dc.relation.reference (參考文獻)	Breiman, L., Friedman, J., Stone, C., & Olshen, R. (1984). Classification and regression trees. Taylor & Francis. https://doi.org/10.1201/9781315139470 Breiman, L. (1996). Heuristics of instability and stabilization in model selection. The An- nals of Statistics, 24(6), 2350–2383. https://doi.org/10.1214/aos/1032181158 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10. 1023/A:1010933404324 Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984, October). Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470 Candès, E., Fan, Y., Janson, L., & Lv, J. (2018). Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3), 551–577. https : / / doi . org / 10 . 1111/rssb.12265 Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785 Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ ectj.12097 Chilver, M. R., Champaigne-Klassen, E., Schofield, P. R., Williams, L. M., & Gatt, J. M. (2023). Predicting wellbeing over one year using sociodemographic factors, per- sonality, health behaviours, cognition, and life events. Scientific Reports, 13(1), 5565. https://doi.org/10.1038/s41598-023-32588-3 Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Sci- ence, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716 Colquhoun, D. (2017). The reproducibility of research and the misinterpretation of <i>p</i>- values. Royal Society Open Science, 4(12), 171085. https://doi.org/10.1098/rsos. 171085 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273– 297. https://doi.org/10.1007/BF00994018 Couronné, R., Probst, P., & Boulesteix, A.-L. (2018). Random forest versus logistic re- gression: A large-scale benchmark experiment. BMC Bioinformatics, 19(1), 270. https://doi.org/10.1186/s12859-018-2264-5 Covert, I. C., Lundberg, S., & Lee, S.-I. (2021). Explaining by removing: A unified frame- work for model explanation. J. Mach. Learn. Res., 22(1). Dai, B., Shen, X., & Pan, W. (2024). Significance tests of feature relevance for a black- box learner. IEEE Transactions on Neural Networks and Learning Systems, 35(2), 1898–1911. https://doi.org/10.1109/tnnls.2022.3185742 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810. 04805 Dimson, E., & Marsh, P. (1990). Volatility forecasting without data-snooping. Journal of Banking & Finance, 14(2), 399–421. https://doi.org/10.1198/jasa.2009.tm08647 Fisher, R. A. (1928). Statistical methods for research workers. Oliver; Boyd. Fix, E. (1985). Discriminatory analysis: Nonparametric discrimination, consistency prop- erties (Vol. 1). USAF school of Aviation Medicine. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In- ternational Conference on Machine Learning. https : / / api . semanticscholar . org / CorpusID:1836349 Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. https://doi.org/10.1162/neco.1992.4. 1.1 Gonzalez, O. (2021). Psychometric and machine learning approaches for diagnostic as- sessment and tests of individual classification. Psychological Methods, 26(2), 236– 254. https://doi.org/10.1037/met0000317 Hastie, T. (2009). The elements of statistical learning: Data mining, inference, and pre- diction. Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLOS Biology, 13(3), 1–15. https://doi. org/10.1371/journal.pbio.1002106 Hommel, G. (1983). Tests of the overall hypothesis for arbitrary dependence structures. Biometrical Journal, 25(5), 423–430. https://doi.org/10.1002/bimj.19830250502 Huang, P. H. (2025a). Residual permutation tests for feature importance in machine learn- ing. [unpublished manuscript]. Huang, P. H. (2025b). Significance tests for feature importance in machine learning. [un- published manuscript]. Jocher, G., Chaurasia, A., & Qiu, J. (2023, January). Ultralytics YOLO (Version 8.0.0). https://github.com/ultralytics/ultralytics Joel, S., Eastwick, P. W., Allison, C. J., Arriaga, X. B., Baker, Z. G., Bar-Kalifa, E., Berg- eron, S., Birnbaum, G. E., Brock, R. L., Brumbaugh, C. C., Carmichael, C. L., Chen, S., Clarke, J., Cobb, R. J., Coolsen, M. K., Davis, J., de Jong, D. C., De- brot, A., DeHaas, E. C., … Wolf, S. (2020). Machine learning uncovers the most robust self-report predictors of relationship quality across 43 longitudinal couples studies. Proceedings of the National Academy of Sciences, 117(32), 19061–19071. https://doi.org/10.1073/pnas.1917036117 Kai-Quan Shen, Chong-Jin Ong, Xiao-Ping Li, Zheng Hui, & Wilder-Smith, E. (2007). A feature selection method for multilevel mental fatigue eeg classification. IEEE Transactions on Biomedical Engineering, 54(7), 1231–1237. https://doi.org/10. 1109/TBME.2007.890733 Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution- free predictive inference for regression. Journal of the American Statistical Asso- ciation, 113(523), 1094–1111. https://doi.org/10.1080/01621459.2017.1307116 Liu, Y., & Xie, J. (2020). Cauchy combination test: A powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statis- tical Association, 115(529), 393–402. https://doi.org/10.1080/01621459.2018. 1554485 Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions, 4768–4777. Mattner, L. (2011). Combining individually valid and conditionally i.i.d. p-variables. Mi, X., Zou, B., Zou, F., & Hu, J. (2021). Permutation-based identification of important biomarkers for complex diseases via machine learning models. Nature Communi- cations, 12(1), 3008. https://doi.org/10.1038/s41467-021-22756-2 Moran, M. (2003). Arguments for rejecting the sequential bonferroni in ecological studies. Oikos, 100(2), 403–405. https://doi.org/10.1034/j.1600-0706.2003.12010.x Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Per- cie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021. https://doi.org/10.1038/s41562-016-0021 Nicolai Meinshausen, L. M., & Bühlmann, P. (2009). P-values for high-dimensional re- gression. Journal of the American Statistical Association, 104(488), 1671–1681. https://doi.org/10.1198/jasa.2009.tm08647 O’Gorman, T. W. (2005). The performance of randomization tests that use permutations of independent variables. Communications in Statistics - Simulation and Compu- tation, 34(4), 895–908. https://doi.org/10.1080/03610910500308230 Oh, J., Laubach, M., & Luczak, A. (2003). Estimating neuronal variable importance with random forest. 2003 IEEE 29th Annual Proceedings of Bioengineering Confer- ence, 33–34. https://doi.org/10.1109/NEBC.2003.1215978 Ojala, M., & Garriga, G. C. (2009). Permutation tests for studying classifier performance. 2009 Ninth IEEE International Conference on Data Mining, 908–913. https://doi. org/10.1109/ICDM.2009.108 OpenAI. (2024). Chatgpt (december 23 version) [large language model] [Accessed: De- cember 23, 2024]. https://chat.openai.com Paschali, M., Zhao, Q., Adeli, E., & Pohl, K. M. (2022, June). Bridging the gap between deep learning and hypothesis-driven analysis via permutation testing. Springer Na- ture Switzerland. https://link.springer.com/10.1007/978-3-031-16919-9_2 Pearson, K. (1933). On a method of determining whether a sample of size n supposed to have been drawn from a parent population having a known probability integral has probably been drawn at random. Biometrika, 25, 379–410. https://doi.org/10. 1093/biomet/25.3-4.379 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., 等. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(10), 2825–2830. Pelt, D. H. M., Habets, P. C., Vinkers, C. H., Ligthart, L., van Beijsterveldt, C. E. M., Pool, R., & Bartels, M. (2024). Building machine learning prediction models for well-being using predictors from the exposome and genome in a population cohort. Nature Mental Health, 2(10), 1217–1230. https://doi.org/10.1038/s44220-024- 00294-2 Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://arxiv.org/abs/ 2212.04356 Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 65 6, 386–408. https://api.semanticscholar. org/CorpusID:12781225 Rüger, B. (1978). Das maximale signifikanzniveau des tests:“lehne h o ab, wenn k unter n gegebenen tests zur ablehnung führen” . Metrika, 25, 171–178. Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2013). P-curve: A key to the file-drawer. J Exp Psychol Gen, 143(2), 534–547. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307. https : //doi.org/10.1186/1471-2105-9-307 Strube, M. J. (2006). Snoop: A program for demonstrating the consequences of premature and repeated null hypothesis testing. Behavior research methods, 38(1), 24–27. Tansey, W., Veitch, V., Zhang, H., Rabadan, R., & Blei, D. M. (2022). The holdout ran- domization test for feature selection in black box models. Journal of Compu- tational and Graphical Statistics, 31(1), 151–162. https : / / doi . org / 10 . 1080 / 10618600.2021.1923520 Tippett, L. H. C., 等. (1931). The methods of statistics. The Methods of Statistics. Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. Vovk, V., & Wang, R. (2020). Combining p-values via averaging. Biometrika, 107(4), 791–808. https://doi.org/10.1093/biomet/asaa027 Watson, D. S., & Wright, M. N. (2021). Testing conditional independence in supervised learning algorithms. Machine Learning, 110(8), 2107–2129. https://doi.org/10. 1007/s10994-021-06030-6 White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. Retrieved December 29, 2024, from http://www.jstor.org/stable/2999444 Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2023). A general frame- work for inference on algorithm-agnostic variable importance. Journal of the Amer- ican Statistical Association, 118(543), 1645–1658. https : / / doi . org / 10 . 1080 / 01621459.2021.2003200 Xgboost python api reference. (n.d.). Retrieved January 13, 2025, from https://xgboost. readthedocs.io/en/stable/python/python_api.html	zh_TW

Publications-Theses

Article View/Open

Publication Export

Google ScholarTM

NCCU Library

Citation Infomation

Related Publications in TAIR

Google Scholar^TM