# Existing Medical QA & VQA Datasets
Multimodal Question Answering (QA) in the Medical Domain: A summary of Existing Datasets and Systems

- I prepared this summary for my CMU/LTI talk on multimodal QA. My slides are available at https://www.slideshare.net/benabacha/multimodal-question-answering-in-the-medical-domain-cmulti-2020 

- This list is not exhaustive. You can email me links and references of relevant medical QA datasets and systems and I'll update the list asap.
Also, several challenge-related datasets are not publicly available anymore. You can contact the organizers to have the data. 
 
 *** Two Main Tasks: Medical Question Answering (QA) & Visual Question Answering (VQA) ***

 
I) Medical QA Datasets:
=======================

1) Corpus for Evidence Based Medicine Summarization (Mollá, 2010): https://sourceforge.net/projects/ebmsumcorpus 
2) CLEF QA4MRE Alzheimer’s task (Peñas et al, 2012). 
3) BioASK datasets (2012-2020): http://bioasq.org/participate/challenges
4) TREC LiveQA-Med (Ben Abacha et al, 2017): https://github.com/abachaa/LiveQA_MedicalTask_TREC2017
5) MEDIQA-2019 datasets on NLI, RQE, and QA (Ben Abacha et al., 2019): https://github.com/abachaa/MEDIQA2019 
6) MEDIQA-AnS dataset of question-driven summaries of answers (Savery et al., 2020): https://osf.io/fyg46/ Paper: https://www.nature.com/articles/s41597-020-00667-z 
7) MedQuaD Collection of 47k QA pairs (Ben Abacha and Demner-Fushman, 2019): https://github.com/abachaa/MedQuAD 
8) Medication QA Collection (Ben Abacha et al., 2019): https://github.com/abachaa/Medication_QA_MedInfo2019
9) Consumer Health Question Summarization (Ben Abacha and Demner-Fushman, 2019): https://github.com/abachaa/MeQSum
10) emrQA: QA on Electronic Medical Records (Pampari et al., 2018). Scripts to generate emrQA from i2b2 data: https://github.com/panushri25/emrQA 
11) EPIC-QA dataset on COVID-19 (Goodwin et al., 2020): https://bionlp.nlm.nih.gov/epic_qa/ 
12) BiQA Corpus (Lamurias et al., 2020): https://github.com/lasigeBioTM/BiQA  Paper:https://ieeexplore.ieee.org/document/9184044 
13) HealthQA Dataset (Zhu et al., 2019): https://github.com/mingzhu0527/HAR Paper: https://dmkd.cs.vt.edu/papers/WWW19.pdf 
14) MASH-QA Dataset on Multiple Answer Spans Healthcare Question Answering, with 35k QA pairs (Zhu et al., 2020): https://github.com/mingzhu0527/MASHQA Paper: https://www.aclweb.org/anthology/2020.findings-emnlp.342.pdf
15) MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. (Pal et al., CHIL, PMLR 2022): https://github.com/medmcqa/medmcqa Paper: https://proceedings.mlr.press/v174/pal22a.html

II) Medical VQA Datasets (Radiology): 
=====================================

1) VQA-RAD (Lau et al. 2018): https://osf.io/89kps
2) VQA-Med 2018 (Hasan et al. 2018): https://www.aicrowd.com/challenges/imageclef-2018-vqa-med
3) VQA-Med 2019 (Ben Abacha et al. 2019): https://github.com/abachaa/VQA-Med-2019
4) VQA-Med 2020 (Ben Abacha et al. 2020): https://github.com/abachaa/VQA-Med-2020 


III) Online QA Systems:
========================
-- I searched and tested several systems (e.g. AskHERMES, MiPACQ, SimQ). This list includes only the systems that are still maintained.

1) CHiQA (Consumer Health Question Answering System): chiqa.nlm.nih.gov
2) Neural Covidex: covidex.ai


IV) Medical Datasets Relevant to Question Answering: 
=====================================================

1) i2b2 shared tasks (2006-2016): www.i2b2.org/NLP
2) n2c2 NLP clinical challenges (2018-2019): https://n2c2.dbmi.hms.harvard.edu
 https://dbmi.hms.harvard.edu/programs/national-nlp-clinical-challenges-n2c2
3) TREC	Medical	Records Track (2012-2013).  
4) TREC	Clinical Decision Support Track (2014-2016): http://www.trec-cds.org	
5) TREC	Precision Medicine Track (2017-2019): http://www.trec-cds.org
6) CLEF	eHealth (2013-2020): https://clefehealth.imag.fr 
7) COVID dataset (CORD-19): https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge 


V) Medical Datasets Relevant to VQA: 
===========================================

1) ImageCLEF Medical Automatic Image Annotation (2008-2009): https://www.imageclef.org/2008/medaat and https://www.imageclef.org/2009/medanno
2) ImageCLEF Medical User-oriented Image Retrieval Task (2011): https://www.imageclef.org/2011/medicaluseroriented 
3) ImageCLEF Medical Retrieval Task (2008-2012): https://www.imageclef.org/2012/medical
4) ImageCLEF AMIA: Medical task (2013): https://www.imageclef.org/2013/medical
5) ImageCLEFmed: Medical classification (2015): https://www.imageclef.org/2015/medical
6) ImageCLEF Medical Clustering (2015): https://www.imageclef.org/2015/clustering
7) ImageCLEFmed (2016): https://www.imageclef.org/2016/medical
8) ImageCLEFcaption (2017-2020): https://www.imageclef.org/2017/caption
9) ImageCLEFmedical tasks (2019-2020): https://www.imageclef.org/2019/medical and https://www.imageclef.org/2020/medical 
10) MIMIC-CXR Database (2019): https://physionet.org/content/mimic-cxr/2.0.0/  


## <h2>Contact</h2>

    -  Asma Ben abacha (abenabacha at microsoft dot com)
 
