BioASQ Participants Area

Frequently Asked Questions

Task 6B

  1. Why use the new F1-score to evaluate performance in Phase B for yes/no questions?
  2. We have noticed that the distribution of yes/no questions is imbalanced, greatly favoring "yes" questions. We have consulted with the biomedical experts and we will try generate more "no" questions in the future. However, in order to counterbalance this imbalance, as well as capture in a more detailed manner the performance of each system we will be utilizing the F1-score of the results.
    More specifically we will be calculating the F1-score independently both for "yes" and "no" questions and finally the macro-averaged F1-score, which will also be the final evaluation metric for each system, regarding the yes/no questions. A detailed description of the evaluation measures for Task B is available here .

Task 5B

  1. Why participants should no longer submit synonyms for exact answers of list and factoid questions?
  2. Submission of synonyms by participants is redundant, since only one synonym is enough for an exact answer to be considered correct. Golden exact answers include synonyms, when appropriate, and, even if some system encounter and submit some valid synonym missing from the initial golden answer, this synonym will be included in the enriched golden answer, after the manual inspection of all system submissions by the biomedical experts.
    Note: This change does not affect the format of the submissions.
    Each unique exact answer is still represented by an array in JSON format, as done so far. Now, this array should only contain one element, instead of many synonyms. If the arrays in your submission contain more than one elements/synonyms, only the first of the synonyms will be taken into account during evaluation.
    For more details, please see "JSON format of the datasets" section in Task 5B Guidelines.

Task 5A

  1. Why test articles are no longer selected from a list of selected journals?
  2. Lately, NLM indexers tend to work on a backlog of not indexed articles. Therefore, test-sets created only by recently added articles have big annotation periods. Under this conditions, the best way to have test sets with satisfyingly small annotation periods, is to use internal information from NLM, about the articles with high probability to be indexed soon. As a result, a list of selected journals is redundant and would only limit the size of test sets without any significant impact on expected annotation period.

Task 2A

  1. Which are the classes that are going to be used during classification?
  2. MeSH (Medical Subject Headings) is the vocabulary that is going to be used for classification.

  3. How many labels does each article usually receive?
  4. Usually articles receive around 10-15 labels.

  5. In what formats are the training and the test data going to be provided during the challenge?
  6. Training and test data are going to be provided both in raw text and in vectorized description. The vectorized description will be distributed as a Lucene Index. Detailed description of the fields of the index and the vectorization tools is provided analytically.