BioASQ Participants Area
What's new in BioASQ Task 5C
- Golden answers for test sets available in section "Test Data Set"!
What is BioASQ5-Task C
- This year we introduce a new Task C on funding information extraction from biomedical article full text.
- Data sets will consist of articles from PubMed with full text available in PubMed Central (PMC).
- The evaluation of Task 5c will be based on Grant information from PubMed.
Task 5C Guidelines
For this introductory year, Task C will run in *one batch* in 18 April 2017.Participants, after downloading the released test set will have to submit results within a limited time window.
The evaluation of Task 5c will be based on manually identified Grant information for biomedical articles available in PubMed.
The rest of the guidelines provide the essential information for participating in the Task 5c of the BioASQ challenge. They are organised in sections, by clicking on the titles you can find the relevant details.
+ Competition roll-out
Concerning the task:
- One dry run test batch will be released in 11 April, 17:00 CET.
- One test batch will be released in 18 April, 17:00 CET.
- Participants that have checked "Receive Information" during registration will be sent an e-mail informing that the test set is available.
- Participant systems should extract Funding Information from the full text of articles included in the test set.
- Participants will have to upload their results within a limited time window, using the corresponding section of the Web Interface.
- After the expiration of the test set, the evaluation measures will be calculated.
- Three properties of participant answers will be evaluated independently for each system submission.
- Full Grant extraction, as combination of GrantID and corresponding Grant Agency.
- Grant ID extraction, regardless of the corresponding Grant Agency.
- Grant Agency extraction, regardless of the specific Grant ID.
- There will be six winning teams in the batch, two for each of three evaluated properties.
- The winners of the task will be decided from the rankings of the best attempts of each team for the test batch.
+ Funding Information
Task C is about extraction of funding Information of biomedical articles from their full text. By Funding Information of an article, we mean a list of Grants acknowledged in any part of the full-text of the article. This list of Grants corresponds to the “GrantList” element MEDLINE XML format of the article, as described here. Each element of this list, corresponds to a Grant, consisting of a GrantID, as mentioned in the full text, and/or the corresponding Grant Agency or Agencies.
Please, note that only Funding Information for grants from a list of selected Grant Agencies will be considered for this task,
which is available here.
Funding Information from other funding Agencies, not in this list, will not been taken into account,
even if mentioned in the text. The list of selected Agencies comes from the Grant Agencies considered in the indexing procedure followed by NLM,
as listed here, excluding Agencies no longer being assigned.
NOTE : Grant Agencies of the US Government have hierarchical relations.
For Example, a published work may be funded by the “National Cancer Institute” or by the “Division of Cancer Treatment” of the “National Cancer Institute”.
We are always interested in the most specific Agency when available.
When it can be extracted that the article was funded by some US Government Agency, but it is not possible to identify the specific Agency, the value “Public Health Service” may be used.
The hierarchy of selected Agencies is available here, in a tab delimited parent-child format.
ATTENTION: The presence of both GrantID and corresponding Grant Agency is not necessary.
There are cases where a Grant Agency is acknowledged in the full text, but no specific GrantID is mentioned.
In such cases, only the Grant Agency will be included in the data sets of task C.
In these cases, the Grant Agency may appear in the full text with its full name,
as listed in list of Agencies, or with some abbreviation of it.
A list of some abbreviations for the Agencies is here
and more details on Agencies and abbreviations here here.
For example, for the article stating in its full text “Funding Supported by an RCUK fellowship and a BBSRC New Investigator research grant (to M.A.T).”,
the Agency “Biotechnology and Biological Sciences Research Council” will be included in task C data sets, since its abbreviation “BBSRC” is acknowledged.
GrantIDs in Task C datasets are exactly as they appear in the full text of the corresponding article full text.
- The only exception to this rule is the case of GrantIDs containing the substring N01/P01/R01 with some combination of letter/s "O" and/or "l" instead of number/s “0” and/or “1” respectively. For more details on this exception please consult the last paragraph of NLM’s description for GrantList XLM field here.
- Unlike GrantIDs, Agencies in task C data sets are not necessarily as they appear in the article full text. Only, valid Agency names as listed in the list of selected Agencies are used in the data sets (both for training and testing).
Each GrantID in Task C datasets is accompanied by the corresponding Grant Agency or Agencies.
- Some of the corresponding Grant Agencies derive from the GrantID string automatically by NLM’s algorithm. For these Agencies, it is possible that no mention exists in the full text of the article at all.
- When mentioned in full-text, corresponding Grant Agencies may appear with different forms than the one used in task C data sets. For example, due to abbreviations, different expressions or misspellings.
- In some cases, more than one Agencies may correspond to one GrantID.
In this case both combinations must be included in the data sets.
For example, for the article mentioning that “… is supported by a joint award from the Medical Research Council (UK) and the Wellcome Trust (G001354).”, GrantID G001354 will be included twice in the Grant list of this article: Once accompanied by “Medical Research Council” and once by “Wellcome Trust”.
IMPORTANT: *Only* Funding Information extracted from the full text of article will be included in task C data sets. PubMed also assigns GrantIDs and Agencies which are not necessarily extracted from the full-text of the article (i.e. author submitted Grant Information through NIH Manuscript Submission System). Grant information not present in the full-text will be excluded from all the data sets of task C.
ATTENTION: Data sets will *not* include all Funding Information mentioned in the full text of each article. In other words, only a subset of GrantIDs and Agencies mentioned in the full text will be included in the data sets (both for training and testing). Main reasons for funding information mentioned in full text but not present in data sets are:
- Information for funding from Agencies which are not included in the list of selected Agencies.
- GrantID assignments in PubMed data that are not as they appear in corresponding article full text. The only exception of GrantIDs in task C data sets not as they appear in the full text, is the one with N01/P01/R01 explained above.
- Grant Agencies not accompanying a GrantID and mentioned in the full-text with their full name or abbreviation.
- Funding information missed from NLM indexers during manual indexing process.
+ Training Data Set
The training data set for this task contains annotated biomedical articles published in PubMed and corresponding full text from PMC. By annotated is meant that GrantIDs and corresponding Grant Agencies have been identified in the full text of articles. For more details about the funding information included in training data set, please consult the section “Funding Information” above.
The training set is served as a JSON string containing a list of articles.
Each article has two identifiers (for PubMed and PMC) and the corresponding list of Grants for each article.
A Grant in the Grant List of an article may contain a GrantID, a Grant Agency, or ideally both, if available.
An archive with the full text for all articles included in the data is also provided.
In this archive, one can find an XML file with the full text of each article, in PMC XML format, as provided by PMC.
Data set | Training 2017 (JSON / full text archive) |
Number of articles | 62,952 |
Period of article publication | 2005 - 2013 |
Number of GrantIDs | 111,528 |
Number of Grant Agencies | 128,329 |
Size (JSON) | 12.7Mb |
Size zip/unzip (full text archive) | 1.3Gb/6.8Gb |
The training set is served as a JSON string with the following format:
{ "articles": [ { "pmid":"PMID", "pmcid":"PMCID", "grantList":[ {"grantID": "grant ID 1", "agency": "agency 1" }, {"agency": "agency 2" }, ... ] }, ... ] }The JSON string contains the following fields for each article:
pmid : the unique identifier of each article in PubMed,
pmcid : the unique identifier of each article in PubMed Central,
grantList : a list with the grants of the article, extracted from its full text.
Each grant in the grant list, contains the following fields:
grantID : the research grant or contract number (or both) that designates financial support by a granting agency,
agency : the full Institute/Organization Name, as listed in the list of Grant Agencies.
Please note that:
- A Grant element in the grant list may contain a GrantID with corresponding Grant Agency, or a Grant Agency alone.
- The same Grant ID may appear in more than one Grant elements in the same Grant List, if more than one Agencies contributed in the corresponding funding. For more details, please see section "Funding Information" above.
- The PMC unique identifier may be used by participants to retrieve the full text of the article in the corresponding file.
+ Test Data set
The test data set for this task will contain biomedical articles from PubMed, with full text available from PubMed Central.
Test data set will consists of articles published more recently than the articles of training set,
corresponding to the intended use of funding information extraction systems to find finding information in newer articles.
The set will be served as a JSON string containing article identifiers (for PubMed and PMC)
accompanied by an archive with corresponding full text of articles in PMC XML format.
Participants are expected to extract Funding Information from article full text, in the form of Grant lists. For more details about the Information to be extracted, please, consult the section "Funding Information" above.
The test set is served as a JSON string with the following format:
{ "articles": [ { "pmid":"PMID1", "pmcid":"PMCID1" },{ "pmid":"PMID2", "pmcid":"PMCID2" }, ... ] }The JSON string contains the following fields for each article:
pmid : the unique identifier of each article in PubMed,
pmcid : the unique identifier of each article in PubMed Central,
Data set | Dry Run 2017
(Test set / Gold set / full text archive) |
Test Set 1 2017
(Test set / Gold set / full text archive) |
Number of articles | 15,205 | 22,610 |
Period of article publication | 2013 - 2015 | 2015 - 2017 |
Number of GrantIDs | 26,272 | 42,711 |
Number of Grant Agencies | 30,503 | 47,266 |
Test set Size (JSON) | 950Kb | 1.38Mb |
Gold set Size (JSON) | 4.65Mb | 7.21Mb |
Size zip/unzip (full text archive) | 325Mb/1.7Gb | 512Mb/2.6Gb |
+ Required Answers
System results for the test data set will be submitted as a JSON string containing the same article identifiers (for PubMed and PMC) accompanied by the corresponding Grant List extracted from each article full text. Each element of the submitted Grant List for an article may be :
- A Grant ID, exactly as mentioned in the full text of the article.
- An Agency, i.e. one of the agencies in the "list of Selected Agencies".
- A Full Grant, i.e. a Grant ID, as mentioned in the full-text, and the corresponding Agency, as listed in the "list of selected Agencies".
- A Full Grant should contain a Grant ID and one corresponding Agency. For joint grants, where a a Grant ID corresponds to more than one Agencies, separate Grant elements should be submitted for the same Grant ID for each one of the corresponding Agencies
- Participation can be partial. Participants are allowed to submit lists with any of the Grant elements mentioned above. e.g. One system may only submit lists of Grant IDs without corresponding Agencies.
- A grantList must be submitted for every article in the test set.
- The format of the JSON string is case sensitive.
- GrantIDs should always be submitted as mentioned in full-text
- Grant Agencies can be derived from GrantID string, if available.
- Agencies should always be submitted as listed in "list of selected Agencies". For example, for an article mentioning funding by "the Canadian Institutes for Health Research.", *only* the valid value "Canadian Institutes of Health Research" will be considered as correct answer!
The format of system submissions will be the following:
{ "articles": [ { "pmid": "PMID1", "pmcid": "PMCID1", "grantList": [ { "grantID": "grant ID 1", "agency": "agency 1" }, { "agency": "agency 3" }, { "grantID": "grant ID 4" } ... ] }, { "pmid": "PMID2", "pmcid": "PMCID2", "grantList": [ { "agency": "agency 5" } ... ] }, ... ] }
ATTENTION:
- There is a limit of 20 elements in the length of the Grant List per article. We have ensured that this limit of 20 will suffice to encompass all information per article, for all articles in the test dataset.
- There is also a limit on unique agencies. More specifically:
- If the Grant Agency is not accompanying a GrantID, then the limit is 4 unique agencies per article.
- If the Grant Agency is accompanying a GrantID, then the limit is 2 unique agencies per unique GrantID (case of joint grants).
{ "articles": [ { "pmid": "PMID1", "pmcid": "PMCID1", "grantList": [ { "grantID": "grant ID 1", "agency": "agency 1" }, { // Up to 2 elements like this, with th same GrantID, in the same grantList "grantID": "grant ID 1", "agency": "agency 1" }, { // Up to 4 elements like this, without a GrantID, in the same grantList "agency": "agency 3" } ... // Up to 20 elements, in total, in a grantList ] } ... // A grantList for every article in the test set ] }
+ Full text from PubMed Central
All datasets for this Task (i.e. for training and testing) will contain the unique PMC identifier for each article and will be accompanied with a compressed file with full-text of all articles included in the data set in PMC XML format.
For each article in a data set a corresponding file, named after the PMC identifier of the article will be available in the corresponding compressed file.
For example, given the PMC id "212403", one can access the full text of the article in PMC XML format in the file "212403.xml" in the accompanying compressed file.
In addition, using this PMC identifier, participants will be able to use the
Entrez Programming Utilities
(E-utilities) of NCBI to access the full text of each article directly from PMC.
For example, given the PMC id "212403", one can access the full text of the article in PMC XML format using the following URL :
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=212403
+ Submit Results
A web interface is available in the section Submitting/Task 5c. There, participants can submit their results, using the "Browse" field, to select the file with the corresponding JSON string of results, and the system dropdown menu to select the name of the system that corresponds to these results.
ATTENTION: Only one JSON file needs to be submitted for the final results. The three-way aforementioned independent evaluation, will be done automatically by the evaluation system, no separate files for each one is needed.
NOTE:
- Participants must upload their results before the expiration of the test
- Participants can upload results multiple times for the same system before the expiration of the test set. Each time that a user uploads new results the old ones are erased.
- A system should be selected during for the submission. Instruction on adding a system can be found in the section "Add a system"
- Submissions should include all articles in the test set, even with empty GrantLists.
+ Add a system
Each user will have the opportunity to participate in the Task with a maximum of 5 systems. To register the systems, after logging in you have to visit "Edit Profile Settings" and follow the available instructions there.
ATTENTION: Trying to upload results without selecting a system will result in error while the results will not be saved.
+ Evaluation
The official evaluation measure for the task C, will be micro-recall.
The results provided by the participants will be evaluated independently in three different ways, for the same submission:
- Full Grant extraction, as a combination of GrantID and the corresponding Grant Agency.
- Grant ID extraction, regardless of the corresponding Agency. In this category All submitted GrantIDs will be considered for each article:
- GrantIDs accompanied by an Agency (Agency not taken into account).
- GrantIDs not accompanied by an Agency.
- Grant Agency extraction, regardless of the specific Grant ID. In this category All submitted Agencies will be considered for each article:
- Agencies accompanying a grantID (grantID not taken into account).
- Agencies not accompanying a grantID.
Therefore, three micro recall scores will be calculated for each submission, one for each category of information (Full Grants, Grant IDs, Grant Agencies). For each of these scores, the participants will be ranked independently.
ATTENTION: The ground truth datasets do *not* include all Funding Information mentioned in the full text of each article. In other words, only a subset of GrantIDs and Agencies mentioned in the full text are included in the ground truth data sets, for reasons already mentioned in "Funding Information". As a result, participants' answers will be evaluated against only information contained in the ground truth data. If a system, provides in the grantlist a piece of information(i.e. Full Grant/Grant ID/Grant Agency) present in the text of the article but not in the ground truth data, it will not be penalized for it, as this answer will be simply neglected.
NOTE: In some cases, it may be possible to extract a more specific Grant Agency than the one provided in ground truth data. In cases such as these, if you have provided a more specific agency but the ground truth data contain a "parent" agency of higher hierarchy to the one provided, your answer still will be counted as correct. Test sets for task 5c will *not* include articles with funding from two agencies, the one beeing ancestor of the other.
IMPORTANT: Determining the exact range that the grantID string spans (i.e. where each grantID starts and ends) is part of the challenge and should be inferred from the data. However, there are some known inconsistencies in the data. In particular the following:
- There are cases where a grantID is preceded by the agency supporting the grant and the agency is incorrectly included in the corresponding grantList input. For example, there exists an occurrence of the snippet "NIH AI-51519" in an article and the corresponding grantID is incorrectly "NIH AI-51519", while the correct one should be "AI-51519".
- There also exist articles where the grantID starts with a number of digits which are incorrectly omitted from the grantID in the corresponding grantList input. In one such occurrence, the snippet "5 P01 AG027734" is found in the text and is the correct grantID and is found as it is in the text, however the corresponding incorrect registry in the grantList is "P01 AG027734".
- Similar cases also exist, where the grantID ends with a number of digits (after a dash) which are incorrectly omitted from the grantID in the corresponding grantList input. In one such occurrence, the snippet "R01 MH078151-03" is found in the text and is the correct grantID and is found as it is in the text, however the corresponding incorrect registry in the grantList is "R01 MH078151".
NOTE: String compartison for GrantIDs during evaluation will be case insensitive.
*Data from NLM are distributed based on the conditions described here. License Code: 8283NLM123.
If you used data obtained from the BioASQ challenges please support us by reporting BioASQ in your acknowledgements and citing our paper:
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition: George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos and Georgios Paliouras, in BMC bioinformatics, 2015 (bib).