BioASQ Participants Area
Task MESINESP begins on March. For detailed schedule stay tuned!
Other notes-Task MESINESP
- The MESINESP2 task, is the second edition of the new BioASQ task introduced for the first time in 2020!
Guidelines for Task MESINESP
The Task on Medical Semantic Indexing in Spanish (MESINESP) is based on the standard process followed by IBECS and LILACS to index journal abstracts in Spanish. The aim of the task is to improve the automatic indexing systems of medical documents in the scientific literature in this language, but also in clinical trial texts and patents. The task is divided into three sub-tracks:- [Sub-track 1] MESINESP - Scientific Literature: The participants will be asked to classify new IBECS/LILACS documents, written in Spanish. The evaluation of the participating systems will be done against a manually annotated data set purposely created for this task.
- [Sub-track 2] MESINESP - Clinical Trials: This track will require automatic indexing with DeCS terms of clinical trials from REEC (Registro Español de Estudios Clínicos).
- [Sub-track 3] MESINESP - Patents: This track will require automatic indexing with DeCS terms of patents written in Spanish.
The rest of the guidelines provide the essential information for participating in the Task MESINESP2 of the BioASQ challenge. They are organized in sections, by clicking on the titles you can find the relevant details.
+ Competition roll-out
More details about the schedule of MESINESP are available here.
+ Download Training Data
The training dataset for this task is available for downloading. We provided a training dataset for each of the subtracks:
Two different training datasets are distributed as described below:- [Sub-track 1] MESINESP - Scientific Literature: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
- Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
- Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.
- [Sub-track 2] MESINESP - Clinical Trials: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3592 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.30, which corresponds with the submission of the best three teams. We have selected the intersection of all codes assigned by those teams.
- [Sub-track 3] MESINESP - Patents: It will be announced in some days.
More details about the training datasets are available here.
+ DeCS headings
DeCS is a trilingual and structured vocabulary created by BIREME to serve as a unique language in indexing, searching and retrieving material from biomedical scientific literature. It was developed from the MeSH aiming to serve as a common terminology for consistent searching in three languages, namely Spanish, Portuguese and English.
The MESINESP task dataset annotations are based on the DeCS 2020. There are two ways of description of a DeCS descriptor in the hierarchy: (i) the human name e.g. "trombólisis mecánica" and (ii) the index e.g. "D061185". In the submission of the results participants should use the index of the DeCS. The DeCS 2020 hierarchy is available in the OBO format here. The mapping between the names and the indexes is available here. All the nodes in the graph (and not only the leaf nodes) of DeCS are valid as classification answers for the BioASQ challenge.
Note: Thanks to our collaborators at BIREME, we have been able to include new COVID-related descriptors that will be used in future versions of DeCS. Training articles do not use these terms, but they will appear in a future version of the development set that will enable systems to properly classify this type of content.
More details about the DeCS vocabulary are available here.+ Evaluation
- Accuracy (Acc.)
- Example Based Precision (EBP)
- Example Based Recall (EBR)
- Example Based F-Measure (EBF)
- Macro Precision (MaP)
- Macro Recall (MaR)
- Macro F-Measure (MaF)
- Micro Precision (MiP)
- Micro Recall (MiR)
- Micro F-Measure (MiF)
More details about evaluation are available here.
+ Download test set
There will be a test set for each of the subtracks. Each test set will consist of non-annotated documents in spanish that has been uploaded to IBECS/LILACS, REEC and Google Patents, as well as other background biomedical texts in spanish. However, only the labels of a manually indexed documents by experts will be used for the official evaluation, not the background data.
The data of the test set will be served as JSON (Java Script Object Notation) strings. JSONs are light and can be easily parsed from programming languages. Each programming language offers modules for the interaction with JSON strings.
The format of the test set data in the JSON string will be the following:{"articles": [ {"id":"biblio-1000005", "title":"Title", "abstract":"Abstact..", "journal":"Journal.."}, {"id":"biblio-1000072", "title":"Title", "abstract":"Abstact..", "journal":"Journal.."}, . . {"id":"lil-708981", "title":"Title", "abstract":"Abstact..", "journal":"Journal.."}]}This JSON string represents an array with document objects. Each object has an id, an abstract, a title and a journal.
Only registered users can download the test sets. Using the web interface. In the section Submitting/Task MESINESP you can find the available test sets. By clicking in the test, you can download it as a JSON text file.
More details about testsets are available here.+ Submit test results
In the section Submitting/Task MESINESP you can find a form with a "Browse" field and a system dropdown menu. After selecting the file in your computer that contains the JSON string and selecting the name of the system that corresponds to these results you can submit them.
The format of the JSON string in this case will be:{"documents": [{"labels":["label1","label2",...,"labelN"], "id": "biblio-1000005"}, {"labels":["label1", "label2",...,"labelM"],"id": "biblio-1000072"}, . . {"labels":["label1", "label2",...,"labelK"], "id":"lil-708981'}]}where "label1",.."labelN" are the DeCS indicators e.g. "33540" and not the human annotation i.e. "procesos psicoterapéuticos". ATTENTION:
- Users must upload DeCS labels for every article in the test set.
- The format of the JSON string is case sensitive. Thus, trying to upload a JSON with different values (i.e "ID" instead of "id") will result in a 500 error.
- Users must upload their results before the expiration of the test
- Users can upload results multiple times for the same system before the expiration of the test set. Each time that a user uploads new results the old ones are erased.
- The system before saving the results checks that:
- The system in the JSON string belongs to the user,
- The IDs in the provided JSON belong to the active test set,
- There are DeCS indexes for every article of the test set,
- The DeCS indexes exist, and
- The test set is still active.
- The system responds with an OK message or an error message depending on the progress of the user request.
- After uploading results, participants can see some information about their uploads in the "Submit your results" section
+ Add a system
Each user will have the opportunity to participate in Task MESINESP with a maximum of 5 systems. To register the systems, after logging in you have to visit "Edit Profile Settings" and follow the available instructions there.
ATTENTION: Trying to upload results without selecting a system will result in error while the results will not be saved.
Data from NLM are distributed based on the conditions described here. License Code: 8283NLM123.
If you used data obtained from the BioASQ challenges please support us by reporting BioASQ in your acknowledgements and citing our paper:
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition: George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos and Georgios Paliouras, in BMC bioinformatics, 2015 (bib).