BioASQ Participants Area
BioASQ datasets
The datasets below are organized per task.+ Datasets for task a
The training data sets for task a contain annotated articles from PubMed, where annotated means that MeSH terms have been assigned to the articles by the human curators in PubMed. Table 1 provides information about the provided datasets. Note that the main difference between those datasets among the different years, apart from the size, is the MeSH terms used. For example the 2015 training datasets contain articles where MeSH 2015 have been assigned. Also, for 2014, 2015 and 2016 there are two versions of the training data available. The small version (wrt size) consists of articles that belong to the pool of journals that the BioASQ team used to select the articles for the test data (this was a subset of the available journals). The bigger version consists of articles from every available journal. Since 2017 articles for the test data will be selected from all available journals, so only one corresponding training data set will be available. The evaluation of the results during each year of the challenge is performed using the corresponding version of the MeSH terms, thus their usage is highly recommended. The training datasets of previous years of the challenge are also available for reference reasons. Note that not every MeSH term is covered in the datasets. Participants, are allowed to use unlimited resources to train their systems.
The training set is served as a JSON string with the following format, where each line is a JSON object that represents a single article:{"articles": [ {"abstractText":"text..", "journal":"journal..", "meshMajor":["mesh1",...,"meshN"], "pmid":"PMID", "title":"title..", "year":"YYYY"}, ..., {..} ]}
More details about the format of the data and the task are available in the Guidelines for task 9a
Dataset version | Number of articles | Avg. MeSH /article | MeSH covered | Size zip/unzip (txt) | Size zip/unzip (Lucene) |
Training v.2022 (txt/Lucene) | 16,218,838 | 12.68 | 29,681 | 8.9Gb/28.9Gb | 20.7Gb/24.6Gb |
Training v.2021 (txt/Lucene) | 15,559,157 | 12.68 | 29,369 | 7.9Gb/25.6Gb | 17.1Gb/20.9Gb |
Training v.2020 (txt/Lucene) | 14,913,939 | 12.68 | 29,102 | 7.51Gb/24.4Gb | 16.3Gb/19.9Gb |
Training v.2019 (txt/Lucene) | 14,200,259 | 12.69 | 28,863 | 7.10Gb/23.1Gb | 15.4Gb/18.8Gb |
Training v.2018 (txt/Lucene) | 13,486,072 | 12.69 | 28,340 | 6.68Gb/21.7Gb | 14.5Gb/17.7Gb |
Training v.2017 (txt/Lucene) | 12,834,585 | 12.66 | 27,773 | 6.29Gb/20.5Gb | 13.7Gb/16.7Gb |
Training v.2016b (txt/Lucene) | 4,917,245 | 13.01 | 27,150 | 2.4Gb/7.92Gb | 5.25Gb/6.42Gb |
Training v.2016 (txt/Lucene) | 12,208,342 | 12.62 | 27,301 | 5.94Gb/19.4Gb | 12.9Gb/15.8Gb |
Training v.2015b (txt/Lucene) | 4,607,922 | 13.08 | 26,866 | 2.2Gb/7.4Gb | 1.6Gb/2.3Gb |
Training v.2015 (txt/Lucene) | 11,804,715 | 12.61 | 27,097 | 5.7Gb/19Gb | 4.0Gb/5.6Gb |
Training v.2014b (txt/Lucene) | 4,458,300 | 13.20 | 26,631 | 1.9Gb/6.4Gb | 1.3Gb/1.9Gb |
Training v.2014 (txt/Lucene) | 12,628,968 | 12.72 | 26,831 | 6.2Gb/20.31Gb | 4.4Gb/6.2Gb |
Training v.2013 (txt/Lucene) | 10,876,004 | 12.55 | 26,563 | 5.1Gb/18Gb | 4.8Gb/6.2Gb |
Attention: Note that only registered users can download the training set. You can register to the BioASQ challenge here.
+ Datasets for task b
The development dataset consists of biomedical questions in English, along with their gold concepts, articles, snippets, RDF triples, "exact" answers, and "ideal" answers in the following JSON format.
More details about the format of the data and the task are available in the Guidelines for task 9b
Format modifications introduced in each version of the dataset are described in corresponding README files.
Challenge edition | Year | Training dataset | Number of questions | Test data |
BioASQ13 | 2025 | Training 13b | 5389 | TBA |
BioASQ12 | 2024 | Training 12b | 5046 | 12b golden enriched |
BioASQ11 | 2023 | Training 11b | 4719 | 11b golden enriched |
BioASQ10 | 2022 | Training 10b | 4234 | 10b golden enriched |
BioASQ9 | 2021 | Training 9b | 3742 | 9b golden enriched |
BioASQ8 | 2020 | Training 8b | 3243 | 8b golden enriched |
BioASQ7 | 2019 | Training 7b | 2747 | 7b golden enriched |
BioASQ6 | 2018 | Training 6b | 2251 | 6b golden enriched |
BioASQ5 | 2017 | Training 5b | 1799 | 5b golden enriched |
BioASQ4 | 2016 | Training 4b | 1307 | 4b golden enriched |
BioASQ3 | 2015 | Training 3b | 810 | 3b golden enriched |
BioASQ2 | 2014 | Training 2b | 310 | 2b golden enriched |
Attention: Note that only registered users can download the training set. You can register to the BioASQ challenge here.
+ Datasets for task MESINESP
The training dataset for this task is available for downloading. It contains annotated articles from IBECS and LILACS, where annotated means that DeCS terms have been assigned to the articles by human curators.
Two different training datasets are distributed as described below:- A Pre-processed Training set with the 318,658 records with at least one DeCS code and with no qualifiers. Download the Pre-processed Train set from here.
- The original Training set with 369,368 records that also include the qualifiers, as retrieved from VHL. Download the Original Train set from here.
A development dataset consisting of 750 articles manually annnotated with DeCS labels for the BioASQ MESINESP Task is available here.
A test dataset consisting of 24780 articles, including 911 manually annnotated articles with DeCS labels and background articles, is available here.
The gonlden DeCS labels for the 911 manually annnotated articles of the test dataset are available here.
+ Datasets for task C
The training data set for task C contains annotated biomedical articles published in PubMed and corresponding full text from PMC. By annotated is meant that GrantIDs and corresponding Grant Agencies have been identified in the full text of articles.
The training set is served as a JSON string containing a list of articles.
Each article has two identifiers (for PubMed and PMC) and the corresponding list of Grants for each article.
A Grant in the Grant List of an article may contain a GrantID, a Grant Agency, or ideally both, if available.
An archive with the full text for all articles included in the data is also provided.
In this archive, one can find an XML file with the full text of each article, in PMC XML format, as provided by PMC.
More details about the format of the data and the task are available in the Guidelines for task 5c
Dataset | Training 5c v.2017 (JSON / full text archive) |
Number of articles | 62,952 |
Period of article publication | 2005 - 2013 |
Number of GrantIDs | 111,528 |
Number of Grant Agencies | 128,329 |
Size (JSON) | 12.7Mb |
Size zip/unzip (full text archive) | 1.3Gb/6.8Gb |
Dataset | Dry Run 2017
(Test set / Gold set / full text archive) |
Test Set 1 2017
(Test set / Gold set / full text archive) |
Number of articles | 15,205 | 22,610 |
Period of article publication | 2013 - 2015 | 2015 - 2017 |
Number of GrantIDs | 26,272 | 42,711 |
Number of Grant Agencies | 30,503 | 47,266 |
Test set Size (JSON) | 950Kb | 1.38Mb |
Gold set Size (JSON) | 4.65Mb | 7.21Mb |
Size zip/unzip (full text archive) | 325Mb/1.7Gb | 512Mb/2.6Gb |
+ Datasets for task Synergy
The training data for the second version of Synergy task are available here.
The training data for the Synergy 2022 task are available here.
The training data for the Synergy 2023 task are available here.
The training data for the Synergy 2024 task are available here.
+ Datasets for task BioNNE
The training data for the BioNNE task are available here.
The test data for the BioNNE task are available here.
References:
Loukachevitch, N., Manandhar, S., Baral, E., Rozhkov, I., Braslavski, P., Ivanov, V., ... & Tutubalina, E. (2023). NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities.
Bioinformatics, 39(4), btad161. https://doi.org/10.1093/bioinformatics/btad161
Terms and conditions for BioASQ data
The data resources considered for developing the BioASQ data, were accessed courtesy of the U.S. National Library of Medicine, based on the conditions described here.
The BioASQ data are distributed under CC BY 2.5 license. If you used data obtained from the BioASQ challenges please support us by reporting BioASQ in your acknowledgments and citing our papers:
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition: George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos and Georgios Paliouras, in BMC bioinformatics, 2015 (bib).
BioASQ-QA: A manually curated corpus for Biomedical Question Answering: Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis and Georgios Paliouras in Sci Data 10, 2023 (bib).
The road from manual to automatic semantic indexing of biomedical literature: a 10 years journey: Anastasia Krithara, James G. MorkAnastasios Nentidis, and Georgios Paliouras in Frontiers in Research Metrics and Analytics, vol. 8, 2023 (bib).