BioASQ Participants Area

The training data sets for task a contain annotated articles from PubMed, where annotated means that MeSH terms have been assigned to the articles by the human curators in PubMed. Table 1 provides information about the provided datasets. Note that the main difference between those datasets among the different years, apart from the size, is the MeSH terms used. For example the 2015 training datasets contain articles where MeSH 2015 have been assigned. Also, for 2014, 2015 and 2016 there are two versions of the training data available. The small version (wrt size) consists of articles that belong to the pool of journals that the BioASQ team used to select the articles for the test data (this was a subset of the available journals). The bigger version consists of articles from every available journal. Since 2017 articles for the test data will be selected from all available journals, so only one corresponding training data set will be available. The evaluation of the results during each year of the challenge is performed using the corresponding version of the MeSH terms, thus their usage is highly recommended. The training datasets of previous years of the challenge are also available for reference reasons. Note that not every MeSH term is covered in the datasets. Participants, are allowed to use unlimited resources to train their systems.

The training set is served as a JSON string with the following format, where each line is a JSON object that represents a single article:

{"articles": [
	{"abstractText":"text..", "journal":"journal..", "meshMajor":["mesh1",...,"meshN"], "pmid":"PMID", "title":"title..", "year":"YYYY"},
	..., 
	{..}
]}

More details about the format of the data and the task are available in the Guidelines for task 9a

Table 1: Task a training dataset information.
Dataset version	Number of articles	Avg. MeSH /article	MeSH covered	Size zip/unzip (txt)	Size zip/unzip (Lucene)
Training v.2022 (txt/Lucene)	16,218,838	12.68	29,681	8.9Gb/28.9Gb	20.7Gb/24.6Gb
Training v.2021 (txt/Lucene)	15,559,157	12.68	29,369	7.9Gb/25.6Gb	17.1Gb/20.9Gb
Training v.2020 (txt/Lucene)	14,913,939	12.68	29,102	7.51Gb/24.4Gb	16.3Gb/19.9Gb
Training v.2019 (txt/Lucene)	14,200,259	12.69	28,863	7.10Gb/23.1Gb	15.4Gb/18.8Gb
Training v.2018 (txt/Lucene)	13,486,072	12.69	28,340	6.68Gb/21.7Gb	14.5Gb/17.7Gb
Training v.2017 (txt/Lucene)	12,834,585	12.66	27,773	6.29Gb/20.5Gb	13.7Gb/16.7Gb
Training v.2016b (txt/Lucene)	4,917,245	13.01	27,150	2.4Gb/7.92Gb	5.25Gb/6.42Gb
Training v.2016 (txt/Lucene)	12,208,342	12.62	27,301	5.94Gb/19.4Gb	12.9Gb/15.8Gb
Training v.2015b (txt/Lucene)	4,607,922	13.08	26,866	2.2Gb/7.4Gb	1.6Gb/2.3Gb
Training v.2015 (txt/Lucene)	11,804,715	12.61	27,097	5.7Gb/19Gb	4.0Gb/5.6Gb
Training v.2014b (txt/Lucene)	4,458,300	13.20	26,631	1.9Gb/6.4Gb	1.3Gb/1.9Gb
Training v.2014 (txt/Lucene)	12,628,968	12.72	26,831	6.2Gb/20.31Gb	4.4Gb/6.2Gb
Training v.2013 (txt/Lucene)	10,876,004	12.55	26,563	5.1Gb/18Gb	4.8Gb/6.2Gb

Attention: Note that only registered users can download the training set. You can register to the BioASQ challenge here.

The development dataset consists of biomedical questions in English, along with their gold concepts, articles, snippets, RDF triples, "exact" answers, and "ideal" answers in the following JSON format.

More details about the format of the data and the task are available in the Guidelines for task 9b

Format modifications introduced in each version of the dataset are described in corresponding README files.

Table 1: Task b training dataset information.
Challenge edition	Year	Training dataset	Number of questions	Test data
BioASQ13	2025	Training 13b	5389	TBA
BioASQ12	2024	Training 12b	5046	12b golden enriched
BioASQ11	2023	Training 11b	4719	11b golden enriched
BioASQ10	2022	Training 10b	4234	10b golden enriched
BioASQ9	2021	Training 9b	3742	9b golden enriched
BioASQ8	2020	Training 8b	3243	8b golden enriched
BioASQ7	2019	Training 7b	2747	7b golden enriched
BioASQ6	2018	Training 6b	2251	6b golden enriched
BioASQ5	2017	Training 5b	1799	5b golden enriched
BioASQ4	2016	Training 4b	1307	4b golden enriched
BioASQ3	2015	Training 3b	810	3b golden enriched
BioASQ2	2014	Training 2b	310	2b golden enriched

Attention: Note that only registered users can download the training set. You can register to the BioASQ challenge here.

The training dataset for this task is available for downloading. It contains annotated articles from IBECS and LILACS, where annotated means that DeCS terms have been assigned to the articles by human curators.

Two different training datasets are distributed as described below:

A Pre-processed Training set with the 318,658 records with at least one DeCS code and with no qualifiers. Download the Pre-processed Train set from here.
The original Training set with 369,368 records that also include the qualifiers, as retrieved from VHL. Download the Original Train set from here.

A development dataset consisting of 750 articles manually annnotated with DeCS labels for the BioASQ MESINESP Task is available here.
A test dataset consisting of 24780 articles, including 911 manually annnotated articles with DeCS labels and background articles, is available here.
The gonlden DeCS labels for the 911 manually annnotated articles of the test dataset are available here.

More details about the training datasets are available here.

The training data set for task C contains annotated biomedical articles published in PubMed and corresponding full text from PMC. By annotated is meant that GrantIDs and corresponding Grant Agencies have been identified in the full text of articles.

The training set is served as a JSON string containing a list of articles. Each article has two identifiers (for PubMed and PMC) and the corresponding list of Grants for each article. A Grant in the Grant List of an article may contain a GrantID, a Grant Agency, or ideally both, if available.
An archive with the full text for all articles included in the data is also provided. In this archive, one can find an XML file with the full text of each article, in PMC XML format, as provided by PMC.

More details about the format of the data and the task are available in the Guidelines for task 5c

Table 1: Task 5C training dataset information.
Dataset	Training 5c v.2017 (JSON / full text archive)
Number of articles	62,952
Period of article publication	2005 - 2013
Number of GrantIDs	111,528
Number of Grant Agencies	128,329
Size (JSON)	12.7Mb
Size zip/unzip (full text archive)	1.3Gb/6.8Gb

Table 2: Task 5C Test dataset information.
Dataset	Dry Run 2017 (Test set / Gold set / full text archive)	Test Set 1 2017 (Test set / Gold set / full text archive)
Number of articles	15,205	22,610
Period of article publication	2013 - 2015	2015 - 2017
Number of GrantIDs	26,272	42,711
Number of Grant Agencies	30,503	47,266
Test set Size (JSON)	950Kb	1.38Mb
Gold set Size (JSON)	4.65Mb	7.21Mb
Size zip/unzip (full text archive)	325Mb/1.7Gb	512Mb/2.6Gb

The training data for the first version of Synergy task are available here.

The training data for the second version of Synergy task are available here.

The training data for the Synergy 2022 task are available here.

The training data for the Synergy 2023 task are available here.

The training data for the Synergy 2024 task are available here.

The training data for the BioNNE task are available here.

The test data for the BioNNE task are available here.

References:
Loukachevitch, N., Manandhar, S., Baral, E., Rozhkov, I., Braslavski, P., Ivanov, V., ... & Tutubalina, E. (2023). NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities.
Bioinformatics, 39(4), btad161. https://doi.org/10.1093/bioinformatics/btad161

The training data for the ELCardioCC task are available here.

BioASQ Participants Area

BioASQ datasets

+ Datasets for task a

+ Datasets for task b

+ Datasets for task MESINESP

+ Datasets for task C

+ Datasets for task Synergy

+ Datasets for task BioNNE

+ Datasets for task ELCardioCC