BioASQ Participants Area
BioNNE-L Shared Task Overview
The BioNNE-L Shared Task invites submissions focusing on Biomedical Nested Named Entity Linking in English and Russian. The train, dev, and test datasets include mentions of disorders, anatomical structures, chemicals biomedical mapped to concepts from the Unified Medical Language System (UMLS). Participants are welcome to explore any model architecture and leverage any publicly available data to maximize performance.
Goal: map biomedical entity mentions to their corresponding concept names and unique identifiers (CUIs) within the Unified Medical Language System (UMLS).
Data: Entities from English and Russian scientific abstracts in the biomedical domain. The BioNNE-L task utilizes the MCN annotation of the NEREL-BIO dataset [1], which provides annotated mentions of disorders, anatomical structures, chemicals, diagnostic procedures, and biological functions.
See useful code as well as competition details in our GitHub.
To submit your results, please register for the Codalab Competition.
Evaluation Tracks: Similar to the BioNNE 2024 task [2], the evaluation is structured into Three Subtasks under Two Evaluation Tracks:
- Two Monolingual Tracks requiring separate models for English (Subtask 1) and Russian (Subtask 2);
- Bilingual Track: requiring a single model trained on multilingual dataset combined from English and Russian data (Subtask 3). Please note that predictions from any mono-lingual model are not allowed in this track.
Shared Task-Specific Challenges:
- Nestedness: Complexity of nested entity mentions. Below, you can find nested entities vizualized as graphs with entities as nodes. If an entity is nested ito another, the two nodes are connected with an edge:
Here, our assumption is that two or more nested entities can serve as additional context mutually, and the entity linking should be conducted jointly for all the single entities as the predicted concepts should be consistent with each other.
- Partial terminology: a concept does not have concept name in low-resource language (Russian) and thus has to be linked to a vocabulary entry in rich-resource language (English).
Available Data
The data for our competition includes:
- Tsv-formatted and Parquet-formatted entities (We provide two formats for your convenience. The content for both formats is identical)
- Parquet-formatted vocabular
- Raw texts
Currently, three train sets (English, Russian, bilingual) and bilingual normalization vocabulary are available here.
The provided BioNNE-L Shared Task data (annotated entities and normalization vobcaulary) is available through HuggingFace:
# Loading multilingual data (Track 2)
bilingual_dataset = load_dataset("andorei/BioNNE-L", "Bilingual", split="train")
# Loading monolingual data (Track 1: Russian/English)
ru_dataset = load_dataset("andorei/BioNNE-L", "Russian", split="train")
en_dataset = load_dataset("andorei/BioNNE-L", "English", split="train")
# Loading normalization vocabulary
vocab = load_dataset("andorei/BioNNE-L", "Vocabulary", split="train")
Annotated Data Format
Each line describes a single biomedical entity of possible entity types: (i) Disease (DISO), (ii) Chemical (CHEM), (iii) Anatomy (ANATOMY).
is a unique textual document identifier the given entity is derived from. Each document contains multiple entities described with theirspans
in the document;text
is a textual mention string of the given entity;entity_type
can take one of three values: DISO, CHEM, ANATOMY. These are high-level semantic types supported by the underlying UMLS knowledge base;spans
provides a list of comma-separated entity positions within the given textual document with iddoc_id
. Each span entry provides starting and ending positions, e.g.,22-28
. An entity provided with multiple positions (e.g.,472-476,492-500
for lung injuries) corresponds to an interrupted entity with non-entity words inserted between entity words;UMLS_CUI
is the Concept Unique Identifier (CUI) in the UMLS metathesaurus (UMLS serves the normalization vocabulary). This field provides ground truth CUI for the given entity. We note that the predicted CUI in your submission file must be in prediction column.
Entity Data
Here are some entity examples:
Document ID | Text | Entity Type | Spans | UMLS CUI |
24052682_ru | заболеваниями печени | DISO | 1545-1558, 1568-1574 | C0023895 |
25842921_en | chronic heart failure | DISO | 198-219 | C0264716 |
26036067_en | right posterior carpal region | ANATOMY | 1735-1764 | C4240186 |
26027241_en | lymphocyte antigen | CHEM | 580-598 | C0023158 |
Normalization Vocabulary
In our work, we collect the bilingual concept vocabulary derived from English and Russian UMLS parts. Due to incompleteness of Russian vocabulary (Partial terminology challenge), part of Russian entities have to be mapped to an English vocabular entry. Vocabulary file is a tsv file with the following fields:
- Concept's semantic type (DISO/CHEM/ANATOMY);
is a textual concept name derived from UMLS. Each concept can have multiple vocabular entries with different names but sharing the same CUI.
Here are some vocabular entity examples:
CUI | Semantic Type | Concept Name |
C0018995 | DISO | Hematochromatosis |
C0018995 | DISO | Bronze diabetes (disorder) |
C0018995 | DISO | Cirrhosis, Pigmentary |
C0018995 | DISO | Гемохроматоз |
C0018995 | DISO | Цирроз пигментный |
C0018995 | DISO | Сидерофилия |
C0018995 | DISO | Диабет бронзовый |
C5399736 | CHEM | Serotonin-4 Receptor Agonist [EPC] |
C5399736 | CHEM | Serotonin 5-Hydroxytryptamine-4 Receptor Agonist |
C5399736 | CHEM | Serotonin-4 Receptor Agonist |
Evaluation Restrictions
- For Track 2 (Multilingual), predictions from any mono-lingual model are not allowed.
- For Track 1 (Russian/English), participants are required to treat each language as a separate task. Distinct models and prediction files are necessary for English and Russian.
- Prediction files between two tracks should not match.
Submission Format
A prediction file is expected to be as TSV with 4 columns: (1) document_id, (2) spans, (3) rank, (4) prediction.
values should match the ones given in the unlabeled data. The concatenation of these two fields serves as a unique primary key clearly defining an underlying entity. So, make sure you do not modify the provided document identifiers and entity spans.rank
is an integer rank of a retrieved vocabular concept. Note: the ranks should be in range from 1 to 5.prediction
column must contain a valid UMLS CUI matching a CUI from the nromalization vocabulary.UMLS_CUI
is the Concept Unique Identifier (CUI) in the UMLS metathesaurus (UMLS serves the normalization vocabulary). This field provides ground truth CUI for the given entity. We note that the predicted CUI in your submission file must be in prediction column.
Evaluation metrics
We address the BioNNE-L as a retrieval task: given a mention, a model must retrieve the top-k concepts from the given UMLS vocabulary. We employ two evaluation metrics:
- Accuracy@k: Accuracy@k=1 if the correct UMLS CUI is retrieved at rank ≤ k, otherwise Accuracy@k=0;
- MRR: Mean Reciprocal Rank.
Important Dates:
Phase | Date |
Training Data Release | 5 Feb 2025 |
Dev data release, Development phase start | 19 Feb 2025 |
Test data release, Evaluation phase start | 25 April 2025 |
Test set predictions due | 6 May 2025 |
Submission of participant papers | 31 May 2025 |
Acceptance notification for participant papers | 24 June 2025 |
Camera-ready working notes papers | 8 July 2025 |
BioASQ Workshop at CLEF 2025 | September 9-12, 2025 |
[1] Loukachevitch, Natalia, Andrey Sakhovskiy, and Elena Tutubalina. Biomedical Concept Normalization over Nested Entities with Partial UMLS Terminology in Russian. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024.
[2] Davydova, Vera, Natalia Loukachevitch, and Elena Tutubalina. Overview of BioNNE task on biomedical nested named entity recognition at BioASQ 2024. CLEF Working Notes 2024.