以進階生成對抗網路合成擬真資料

包諾克; Mrinal Kanti Baowaly

Please use this identifier to cite or link to this item: https://ah.lib.nccu.edu.tw/handle/140.119/123696

題名:	以進階生成對抗網路合成擬真資料 Realistic data synthesis using enhanced generative adversarial networks
作者:	包諾克 Baowaly, Mrinal Kanti
貢獻者:	陳昇瑋<br>劉昭麟 Chen, Sheng-Wei<br>Liu, Chao-Lin 包諾克 Mrinal Kanti Baowaly
關鍵詞:	電子健康記錄合成資料生成資料合成生成對抗網路梯度懲罰型沃瑟斯坦GAN 邊界尋求GAN Electronic health records Synthetic data generation Data synthesis Generative adversarial networks Wasserstein GANs with Gradient Penalty Boundary-seeking GANs
日期:	2019
上傳時間:	3-Jun-2019
摘要:	真實資料在許多情況下無法取得，或者在時間和金錢方面都太昂貴。這是因為這些資料可能存在隱私和保密問題。在這些情況下，使用合成資料是一個可行的選擇。本研究的主要目的是生成近乎真實的合成電子健康記錄（EHR），以便人們可以自由地使用，進行醫療保健或相關領域的研究。我們提出了兩種合成資料的生成模型，分別稱為具有梯度懲罰的醫學沃瑟斯坦GAN（medWGAN），以及醫學邊界尋求GAN（medBGAN），並且將其表現與現有的醫學GAN（medGAN）進行比較。本研究所提出的模型是基於生成對抗網絡（GAN）的兩種增強方法，即具有梯度懲罰的沃瑟斯坦GAN（WGAN-GP），以及邊界尋求GAN（BGAN）。我們在醫學領域中具有離散特徵（例如，二元和計數）的三個匯總EHR資料集上進行資料合成，分別是MIMIC-III，擴展的MIMIC-III，以及台灣國家健康保險研究資料庫（NHIRD）。首先，我們訓練上述模型並生成合成EHR資料。接著，我們應用統計方法（維度平均值以及柯爾莫哥洛夫-斯米爾諾夫檢定）和兩個機器學習任務（關聯規則挖掘以及預測）來分析和比較模型的表現。綜合分析的結果顯示，與使用medGAN模型相比，本研究所提出的模型在生成近乎真實的合成EHR資料方面是更為有效的。　　我們的模型可用於生成任何近乎真實的合成資料，而不限於醫學領域。為了證明模型的一般性，在醫學領域之外，我們還研究了洛杉磯市警察局的一個匯總的犯罪資料集，進一步證實了本研究所提出的模型在廣泛應用中的能力。我們證明本研究所提出的模型可用於生成具有離散特徵的高品質合成資料，這些資料在統計上是合理的，並且足以用於機器學習任務。我們相信，以提供更好的服務來生成近乎真實的合成資料的角度來看，本研究所提出的模型將在工業和學術研究中起到作用。本研究將有助於消除機密資料的存取限制等障礙，從而加速醫學資訊學、醫療保健或相關領域的發展。 There are many situations when the real data are not available or are too expensive to afford in respect of both time and money. This is because those data may have privacy and confidentiality concerns. In these situations, it is a good alternative to use synthetic data. The primary objective of this study is to generate realistic synthetic electronic health records (EHRs) so that people can use it freely for progressing research in healthcare or related fields. We propose two synthetic data generation models – designated as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN) – and compare the performances with an existing method medical GAN (medGAN). The proposed models are based on the two enhanced methods of generative adversarial networks (GANs), namely, Wasserstein GAN with gradient penalty (WGAN-GP) and boundary-seeking GAN (BGAN). We perform data synthesis on three aggregated EHR datasets with discrete features (e.g., binary and count) in the medical domain. They are MIMIC-III, extended MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. Firstly, we train the models and generate synthetic EHR data by using these trained models. We then analyze and compare the models’ performance by applying some statistical methods (dimension-wise average and Kolmogorov–Smirnov test) and two machine learning tasks (association rule mining and prediction). The comprehensive analysis of this study shows that the proposed models are more effective in generating realistic synthetic EHR data than those generated using medGAN. Our models can be applied to generate any realistic synthetic data, even beyond the medical domain. To prove the generality of our models, we also investigate an aggregated crime dataset in the City of Los Angeles Police Department apart from the medical domain which confirms our models’ capability to work in a wide range of applications. We prove that the proposed models are suitable for producing high-quality synthetic data with discrete features that are statistically sound and good enough for machine learning tasks. We believe the proposed models will be effective in industry and research from the viewpoint of providing better services in generating realistic synthetic data. This study will help to eliminate barriers including limited access to confidential data and thus accelerate the development of medical informatics, healthcare or related fields.
參考文獻:	[1] Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. Synthesizing Electronic Health Records Using Improved Generative Adversarial Networks. Journal of the American Medical Informatics Association, 26(3):228–241, 12 2018. [2] Mrinal Kanti Baowaly, Chao-Lin Liu, and Kuan-Ta Chen. Realistic Data Synthesis Using Enhanced Generative Adversarial Networks. In 2019 IEEE International Confer- ence on Artificial Intelligence and Knowledge Engineering (IEEE AIKE 2019). IEEE, June 2019. [3] Donald B Rubin. Statistical disclosure limitation. Journal of official Statistics, 9(2):461– 468, 1993. [4] Office for Civil Rights. Guidance Regarding Methods for De-identification of Pro- tected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. U.S. Department of Health and Human Ser- vices, November 2013. [online] https://www.hhs.gov/hipaa/for-professionals/privacy/ special-topics/de-identification/index.html, Accessed 12 Mar 2017. [5] Khaled El Emam, Elizabeth Jonker, Luk Arbuckle, and Bradley Malin. A systematic review of re-identification attacks on health data. PloS one, 6(12):e28071, 2011. [6] Khaled El Emam, Sam Rodgers, and Bradley Malin. Anonymising and sharing individual patient data. bmj, 350:h1139, 2015. [7] Ross Anderson. Under threat: patient confidentiality and NHS computing. Drugs and Alcohol Today, 6(4):13–17, 2006. [8] Paul Ohm. Broken promises of privacy: Responding to the surprising failure of anonymization (August 13, 2009). UCLA Law Review, 57:1701, 2010. [9] Melissa Gymrek, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. Identifying Personal Genomes by Surname Inference. Science, 339(6117):321–324, 2013. [10] Jason Walonoski, Mark Kramer, Joseph Nichols, and et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3):230–238, 2018. [11] John M. Abowd and Julia Lane. New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers. In Josep Domingo-Ferrer and Vicenç Torra, editors, Privacy in Statistical Databases, pages 282–289, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. [12] Roderick JA Little. Statistical Analysis of Masked Data. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 9:407–407, 1993. [13] Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. Quickly Generating Billion-record Synthetic Databases. SIGMOD Rec., 23(2):243–252, May 1994. [14] Stephen E Fienberg and Russell J Steele. Disclosure Limitation Using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics, 14(4):485, 1998. [15] Stephen E Fienberg. A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Department of statistics, 1994. [16] SE Fienberg. Taking uncertainty and error in censuses and surveys seriously. In Proceedings of Statistics Canada Symposium 95: From Data to Information-Methods and Systems, 1996. [17] Stephen E Fienberg, Russell J Steele, and Udi E Makov. Statistical notions of data disclosure avoidance and their relationship to traditional statistical methodology: data swapping and log-linear models. In Proceedings of Bureau of the Census 1996 Annual Research Conference, pages 87–105, 1996. [18] Trivellore E Raghunathan, Jerome P Reiter, and Donald B Rubin. Multiple imputation for statistical disclosure limitation. Journal of official statistics, 19(1):1, 2003. [19] Yaling Pei and Osmar Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, TR06-15, 2006. [20] Kenneth Houkjær, Kristian Torp, and Rico Wind. Simple and realistic data generation. In Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06, pages 1243–1246. VLDB Endowment, 2006. [21] Peter Christen and Agus Pudjijono. Accurate synthetic generation of realistic personal information. In Advances in Knowledge Discovery and Data Mining, pages 507–514, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. [22] M. Bozkurt and M. Harman. Automatically generating realistic test input from web services. In Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE), pages 13–24, Dec 2011. [23] Joseph S. Lombardo and Linda J. Moniz. A Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring Systems. Johns Hopkins APL Technical Digest, 27(4), 2008. [24] Anna L Buczak, Steven Babin, and Linda Moniz. Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 10(1):59, 2010. [25] S. McLachlan, K. Dube, and T. Gallagher. Using the CareMap with Health Incidents Statistics for Generating the Realistic Synthetic Electronic Healthcare Record. In 2016 IEEE International Conference on Healthcare Informatics (ICHI), pages 439–448, October 2016. [26] Y. Park, J. Ghosh, and M. Shankar. Perturbed Gibbs Samplers for Generating Large- Scale Privacy-Safe Synthetic Health Data. In 2013 IEEE International Conference on Healthcare Informatics, pages 493–498, September 2013. [27] S. McLachlan. Realism in synthetic data generation. Massey University, Palmerston North, New Zealand, February 2017. [online] http://hdl.handle.net/10179/11569, Ac- cessed 5 Oct 2017. [28] Edward Choi, Siddharth Biswal, Bradley Malin, and et al. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. CoRR, abs/1703.06490, 2017. [29] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, and et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014. [30] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, and et al. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016. [31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434, 2015. [32] Yanghua Jin, Jiakai Zhang, Minjun Li, and et al. Towards the Automatic Anime Characters Creation with Generative Adversarial Networks. CoRR, abs/1708.05509, 2017. [33] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, and et al. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017. [34] Scott Reed, Zeynep Akata, Xinchen Yan, and et al. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016. [35] Han Zhang, Tao Xu, Hongsheng Li, and et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint, 2017. [36] Hao Dong, Paarth Neekhara, Chao Wu, and Yike Guo. Unsupervised image-to-image translation with generative adversarial networks. arXiv preprint arXiv:1701.02676, 2017. [37] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017. [38] Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. Multimodal Unsupervised Image-to-Image Translation. CoRR, abs/1804.04732, 2018. [39] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating Videos with Scene Dynamics. In Advances in Neural Information Processing Systems 29, pages 613–621. Curran Associates, Inc., October 2016. [40] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017. [41] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions. CoRR, abs/1703.10847, 2017. [42] Matt J Kusner and José Miguel Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016. [43] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI, pages 2852–2858, March 2017. [44] R Devon Hjelm, A. P. Jacob, T. Che, and et al. Boundary-Seeking Generative Adversarial Networks. ArXiv e-prints, 2017. [45] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, and et al. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017. [46] appliedAI. Synthetic Data: An Introduction & 10 Tools. [online] https://blog.appliedai. com/synthetic-data/, Accessed 31 July 2018. [47] E. L. Barse, H. Kvarnstrom, and E. Jonsson. Synthesizing test data for fraud detection systems. In 19th Annual Computer Security Applications Conference, 2003. Proceedings., pages 384–394, Dec 2003. [48] Margaret Rouse and Nicole Laskowski. Synthetic data. [online] https://searchcio. techtarget.com/definition/synthetic-data, Accessed 11 May 2018. [49] Yann LeCun. What are some recent and potentially upcoming breakthroughs in deep learning?, July 2016. [online] https://www.quora.com/ What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning, Accessed 3 November 2017. [50] Ian J. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. CoRR, abs/1701.00160, April 2017. [51] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, December 2017. [52] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, Cambridge, Massachusetts, United States, 2016. http://www.deeplearningbook.org. [53] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096– 1103, New York, NY, USA, 2008. ACM. [54] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006. [55] Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, and et al. MIMIC-III, a freely accessible critical care database. Scientific Data, May 2016. [online] https://doi.org/10.1038/sdata. 2016.35, Accessed 5 October 2016. [56] International Classification of Diseases, Ninth Revision, Clinical Modification (ICD- 9-CM). National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] https://www.cdc.gov/nchs/icd/icd9cm.htm, Accessed 30 June 2017. [57] National Health Insurance Research Database, Taiwan. National Health Insurance Administration, Ministry of Health and Welfare, Taiwan. [online] http://nhird.nhri.org. tw/en/, Accessed 10 January 2016. [58] Diseases and Injuries Tabular Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres. com/index.php?action=contents, Accessed 10 July 2017. [59] Procedures Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres.com/index.php? action=procslist, Accessed 10 July 2017. [60] Blanca E. Himes, Yi Dai, Isaac S. Kohane, and et al. Prediction of Chronic Obstructive Pulmonary Disease (COPD) in Asthma Patients Using Electronic Medical Records. Journal of the American Medical Informatics Association, 16(3):371–379, 2009. [61] Jionglin Wu, Jason Roy, and Walter F. Stewart. Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches. Medical Care, 48(6):S106–S113, 2010. [62] Sandy H Huang, Paea LePendu, Srinivasan V Iyer, and et al. Toward personalizing treatment for depression: predicting diagnosis and severity. Journal of the American Medical Informatics Association, 21(6):1069–1075, 2014. [63] Pedro L Teixeira, Wei-Qi Wei, Robert M Cronin, and et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. Journal of the American Medical Informatics Association, 24(1):162–171, 2017. [64] medGAN Source Code. GitHub repository. [online] https://github.com/mp2893/ medgan, Accessed 15 November 2017. [65] Wikipedia contributors. Kolmogorov–smirnov test — Wikipedia, the free encyclopedia. [online] https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test, Accessed 20 November 2017. [66] Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR), 50(6):85:1– 85:40, January 2018. [67] Adam Wright, Elizabeth S. Chen, and Francine L. Maloney. An automated technique for identifying associations between medications, laboratory results and problems. Journal of Biomedical Informatics, 43(6):891–901, 2010. [68] Shin AM, Lee IH, Lee GH, and et al. Diagnostic Analysis of Patients with Essential Hypertension Using Association Rule Mining. Healthcare Informatics Research, 16(2):77–81, June 2010. [69] Jimeng Sun, Candace D McNaughton, Ping Zhang, and et al. Predicting changes in hypertension control using electronic health records from a chronic disease management program. Journal of the American Medical Informatics Association, 21(2):337–344, 2014. [70] Los Angeles’ Crime Data, Los Angeles Police Department, USA. [online] https: //data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq, Accessed 15 January 2018.
描述:	博士國立政治大學社群網路與人智計算國際研究生博士學位學程(TIGP) 104761507
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0104761507
資料類型:	thesis
Appears in Collections:	學位論文

Files in This Item:

File	Size	Format
150701.pdf	3.77 MB	Adobe PDF2	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM