中国循证儿科杂志 ›› 2025, Vol. 20 ›› Issue (2): 139-145.DOI: 10.3969/j.issn.1673-5501.2025.02.009

• 论著 • 上一篇    下一篇

儿童结核病专病数据库的构建方法及其初步验证

刘钊1, 李惠民2, 聂晓璐1, 彭亚光1, 吴小会2, 赵顺英2, 彭晓霞1   

  1. 国家儿童医学中心,首都医科大学附属北京儿童医院 北京,100045;1 临床流行病与循证医学中心,2 呼吸中心 
  • 收稿日期:2025-01-20 修回日期:2025-01-23 出版日期:2025-04-25 发布日期:2025-04-25
  • 通讯作者: 彭晓霞;李惠民

Construction method and preliminary validation for the disease-oriented database of tuberculosis in children

LIU Zhao1, LI Huimin2, NIE Xiaolu2, PENG Yaguang1, WU Xiaohui2, ZHAO Shunying2, PENG Xiaoxia1#br# #br#   

  1. Beijing Children's Hospital, Capital Medical University, National Center for Children’s Health, Beijing 100045, China;1 Center for Clinical Epidemiology and Evidence-based Medicine, 2 Center for Respiratory Medicin
  • Received:2025-01-20 Revised:2025-01-23 Online:2025-04-25 Published:2025-04-25
  • Contact: Peng Xiaoxia; LI huimin

摘要: 背景:医疗机构电子病历数据(EMR)用于研究时常受非结构化数据影响,无法直接应用,需要利用自然语言处理技术将其进行结构化转化,以便符合临床研究数据的质量要求。 目的:基于儿童结核病病例的电子病历、住院病案首页等真实世界数据构建专病数据库,从而为其临床特征、诊断策略的效果与效率、预后及预后因素等研究提供数据基础。 设计:横断面调查。 方法:系统检索首都医科大学附属北京儿童医院2007年3月至2024年1月的住院病历,提取ICD-10编码为A15-A19(结核病)的所有患儿信息,以基于多学科专家意见构建的儿童结核病病例报告表为基础,参考医学系统命名法-临床术语、卫生信息基本数据集编制规范等行业标准和编码标准建立儿童结核病标准数据集,利用自然语言处理技术构建专病数据库。完成数据处理后,从数据库中随机抽取10%的病历数据,由两人独立进行与原始病历的比对核查,核查准确率要求>95%。 主要结局指标:准确率=正确识别的实体数/识别出的实体数×100%。 结果:本专病数据库共纳入8 097例(12 957例次)因结核住院诊治的患儿,其中确诊单纯肺结核患儿1 397例,单纯肺外结核患儿554例,肺结核合并肺外结核患儿553例,以上三种诊断中有467例(18.6%)为疑似结核病例;其余5 593例为结核感染病例。8 097例结核病患儿中,57.6%为男性;平均年龄为(7.3±4.7)岁,来自北京地区患儿占18.6%,未接种卡介苗的患儿275例(3.4%),仅有921例(11.4%)患儿有明确结核病例接触史。利用自然语言处理技术抽取字段的准确率均>95%。 结论:儿童结核病专病数据库的建立为儿童结核感染病例的预防性治疗效果评价、儿童抗结核药物性肝损伤风险评价等重要问题开展真实世界研究提供了数据基础。

关键词: 结核病, 儿童, 真实世界数据, 专病数据库

Abstract: Background:The research based on electronic medical record (EMR) data is often impeded by unstructured data in EMR so that it cannot be directly used. Natural language processing technology is often performed to transform data of EMR into a structured format in order to ensure the quality of clinical research data. Objective:To build a TuBerculosis Database Of Child (abbreviated as TBDoc) based on the real world data (RWD) in Beijing Children's Hospital, Capital Medical University, including electronic medical records (EMR), hospital discharge summary data and so on, so as to provide RWD for further research on child tuberculosis, such as clinical characteristics, effectiveness and efficiency of diagnostic strategies, prognosis and prognostic factors. Design:A cross-sectional survey. Methods:The EMR of inpatients from March 2007 to January 2024 were searched systematically so that all inpatients coding as ICD-10 A15-A19 (tuberculosis) was extracted. Based on the case report form (CRF) of child tuberculosis constructed through the consents of multidisciplinary experts, the standard dataset on child tuberculosis was established referring to the industry standards and coding standards such as the systematized nomenclature of medicine-clinical terms (SNOMED CT), in which the natural language processing technology was used. After data governance, 10% of records were sampled from the database and were compared with the original data of EMR by two independent researchers. The accuracy of verification should reach over 95%. Main outcome measures:The accuracy rate is equal to the number of correctly identified entities divided by the number of identified entities, multiplied by 100%. Results:A total of 8,097 cases (12,957 person times) of tuberculosis were included in TBDoc database, of which 1,397 cases were diagnosed as simple pulmonary tuberculosis, 554 cases were diagnosed as simple extrapulmonary tuberculosis, and 553 cases were diagnosed as pulmonary combined with extrapulmonary tuberculosis. Among the above 2,504 cases, 467 (18.65%) cases were diagnosed as suspected tuberculosis. The remaining 5,593 patients were the cases with tuberculosis infection. Of 8,097 children with tuberculosis, 57.61% were male, with an average age of (7.3±4.7) years. The location of 18.57% of patients were in Beijing and 275 (3.40%) children were not vaccinated with Bacillus Calmette-Guerin. In addition, only 921 (11.37%) children had an identified contact history of tuberculosis cases. The accuracy of extracting fields using natural language processing technology was more than 95%. Conclusion:The TBDoc database based on the real world data of child tuberculosis can support the real world research on some important issues, such as the effects of preventive treatment against child tuberculosis infection, and the risk evaluation of drug-induced liver injury associated with anti-tuberculosis treatment for children.

Key words: Tuberculosis, Children, Real world data, Disease-oriented database