This is a reference implementation of the commonly used coreference scoring algorithms verified by a committee for use by the general community. If you use this scorer, please cite the following two papers:
As announced on the CoNLL-2011 and 2012 shared tasks website, there have been some issues with the coreference scorer that was used in the evaluations. In the debugging process, we found issues that went beyond the implementation bugs, and we formed a committee of prominent experts in the field to agree on a standard for evaluating coreference. The committee consists of the following members:
We tried to include the researchers who invented the metrics and who are still active and reachable. We were able to have email communication with Breck Baldwin, one of the researchers who invented the B-CUBED metric. Our goal was to fix the technical bugs in the CoNLL-2011 and 2012 shared tasks' coreference scorer as well as to provide a publicly available, open-source implementation of the scoring algorithms for coreference. This is now available on this site. The two papers mentioned above report on the various issues that we encountered and the justification for the selected implementations. The main points are summarized below. We have re-scored all the systems that participated in the CoNLL-2011 and 2012 shared tasks using the latest version—v8.01. This is available in a spreadsheet for now, and it will be posted shortly on the shared tasks' website. This won't affect the official rankings, but will act as an update. Results for the official runs as well as the supplementary evaluation settings are provided in the spreadsheet.