Reference Implementation of Coreference Scoring Algorithms

Introduction

This is a reference implementation of the commonly used coreference scoring algorithms verified by a committee for use by the general community. If you use this scorer, please cite the following two papers:

Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng and Michael Strube. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, June 2014. [pdf]
An Extension of BLANC to System Mentions. Xiaoqiang Luo, Sameer Pradhan, Marta Recasens and Eduard Hovy. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, June 2014. [pdf]

Release History

v8.01 — This is a small bugfix update that fixes a bug that crashed BLANC when there were some duplicate singleton entities in the response.
v8 — This is the version which includes the implementation of BLANC for predicted mentions and also contains a test suite with about 40 test cases.
v7 — This is the version with implementations for MUC, B-CUBED and CEAF (entity based and mention based). The implementation of BLANC would be included in the next release.

Background

As announced on the CoNLL-2011 and 2012 shared tasks website, there have been some issues with the coreference scorer that was used in the evaluations. In the debugging process, we found issues that went beyond the implementation bugs, and we formed a committee of prominent experts in the field to agree on a standard for evaluating coreference. The committee consists of the following members:

Sameer Pradhan, Harvard University (CoNLL-2011 and 2012 Shared Task Organizer)
Xiaoqiang Luo, Google Inc. (Inventor of the CEAF metric)
Marta Recasens, Google Inc. (Inventor of the BLANC metric)
Michael Strube, HITS gGmbH (Proposer of modifications to B-CUBED and CEAF)
Eduard Hovy, Carnegie Mellon University (Inventor of the BLANC metric)
Vincent Ng, University of Texas at Dallas (Veteran coreference researcher)

We tried to include the researchers who invented the metrics and who are still active and reachable. We were able to have email communication with Breck Baldwin, one of the researchers who invented the B-CUBED metric. Our goal was to fix the technical bugs in the CoNLL-2011 and 2012 shared tasks' coreference scorer as well as to provide a publicly available, open-source implementation of the scoring algorithms for coreference. This is now available on this site. The two papers mentioned above report on the various issues that we encountered and the justification for the selected implementations. The main points are summarized below. We have re-scored all the systems that participated in the CoNLL-2011 and 2012 shared tasks using the latest version—v8.01. This is available in a spreadsheet for now, and it will be posted shortly on the shared tasks' website. This won't affect the official rankings, but will act as an update. Results for the official runs as well as the supplementary evaluation settings are provided in the spreadsheet.

This project is maintained by Sameer Pradhan (email)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.