BioASQ Participants Area
Frequently Asked Questions
Task 10a
- Why does task 10a begin in late January instead of early February this year? Since 2018, NLM introduced fully automated indexing with an improved version of MTI for a subset of the MEDLINE citations, and in 2021 they announced the scale-up of this policy to all MEDLINE citations by mid-2022 here . In response to this change, the schedule of task 10a was shifted a few weeks earlier in the year.
Task 9b
- Why is the mean F-measure used as the official measure for the evaluation of snippets in Task b, Phase A? The evaluation of snippets (Task b, Phase A) is based on a special definition of precision and recall that considers the case of partial overlaps between snippets, as described in subsection 2.1 in the evaluation measures for task B. However, calculating the average precision (AP) based on this special definition of precision and recall can lead in AP scores greater than one, as different snippets can overlap with the same golden snippet. As this SP is not easy to interpret in this context, since BioASQ9 the official measure for snippet retrieval is the mean F-measure score. The F-measure score is based on character overlaps only and, contrary to AP, is not affected by the number of overlapping golden elements.
Task 8b
- What is different in the calculation of average precision (AP) for Task b, Phase A evaluation? Since BioASQ8 the number of relevant items considered for AP calculation is set to the minimum between the actual number of relevant items (|LR|) and 10, which is the maximum number of elements per question in participant submissions, as described in equation 2.8 in the evaluation measures for task B.
Since BioASQ3, the denominator of equation 2.7 for AP calculation was set to 10, which is the maximum number of elements per question in participant submissions. However, some questions can have fewer than 10 relevant items in their golden set. In such cases, the maximum achievable AP under this assubmption is less than 1, which does not affect the ranking of the systems and the winners, but can still give a wrong impression regarding the general performance level of the systems.
Task 8a
- Why are the time of testset release and deadline for submission changed in BioASQ8? During BioASQ7, we have noticed that label annotations for a small ammount of articles where occasionally published in PubMed earlier than expected.
- Why are the submitted MeSH labels per article limited to 200? The average number of MeSH annotations per article is between 12 and 13 labels and the maximum number of labels for one article is 50. Therefore, the limitation of no more than 200 MeSH labels per article is not really restrictive for participant submissions.
- What is the "automated indexing" in PubMed and why are labels for corresponding articles ignored during evaluation? Since 2018 MEDLINE/PubMed provides information about the method by which the MeSH annotations was determined for a citation. These annotations used to be exclusively manual, but as MTI gets continuously improved some annotations can be "Automated". To restrict the evaluation of submissions in manually assigned labels we will ignore the annotatons for articles with "IndexingMethod"="Automated" during the evaluation of testsets for task a.
In early 2020, we observed that new MeSH annotations are released in PubMed around 08.30 GMT. Therefore, we moved the submission period six hours earlier, that is from 10.00 GMT of each Monday until 07.00 GMT of Tuesday, to comply with this change.
We introduce this limitation to avoid computational issues with the evaluation measures in the extreme case that a submission assigns unenesessary many labels. In particular, in such cases only the first 200 labels assigned to each articles will be considered.
Task 6b
- Why use the new F1-score to evaluate performance in Phase B for yes/no questions? We have noticed that the distribution of yes/no questions is imbalanced, greatly favoring "yes" questions. We have consulted with the biomedical experts and we will try generate more "no" questions in the future. However, in order to counterbalance this imbalance, as well as capture in a more detailed manner the performance of each system we will be utilizing the F1-score of the results.
More specifically we will be calculating the F1-score independently both for "yes" and "no" questions and finally the macro-averaged F1-score, which will also be the final evaluation metric for each system, regarding the yes/no questions. A detailed description of the evaluation measures for Task B is available here .
Task 5b
- Why participants should no longer submit synonyms for exact answers of list and factoid questions? Submission of synonyms by participants is redundant, since only one synonym is enough for an exact answer to be considered correct. Golden exact answers include synonyms, when appropriate, and, even if some system encounter and submit some valid synonym missing from the initial golden answer, this synonym will be included in the enriched golden answer, after the manual inspection of all system submissions by the biomedical experts.
Note: This change does not affect the format of the submissions.
Each unique exact answer is still represented by an array in JSON format, as done so far. Now, this array should only contain one element, instead of many synonyms. If the arrays in your submission contain more than one elements/synonyms, only the first of the synonyms will be taken into account during evaluation.
For more details, please see "JSON format of the datasets" section in Task 5B Guidelines.
Task 5a
- Why test articles are no longer selected from a list of selected journals? Lately, NLM indexers tend to work on a backlog of not indexed articles. Therefore, test-sets created only by recently added articles have big annotation periods. Under this conditions, the best way to have test sets with satisfyingly small annotation periods, is to use internal information from NLM, about the articles with high probability to be indexed soon. As a result, a list of selected journals is redundant and would only limit the size of test sets without any significant impact on expected annotation period.
Task 2a
- Which are the classes that are going to be used during classification? MeSH (Medical Subject Headings) is the vocabulary that is going to be used for classification.
- How many labels does each article usually receive? Usually articles receive around 10-15 labels.
- In what formats are the training and the test data going to be provided during the challenge? Training and test data are going to be provided both in raw text and in vectorized description. The vectorized description will be distributed as a Lucene Index. Detailed description of the fields of the index and the vectorization tools is provided analytically.