10

I had a paper accepted to an A* ML a year ago. It was for a novel dataset that we made. Before the camera-ready deadline, I ended up finding that a significant number of ground truth labels ended up being wrong (roughly 25-30%). When I told my second author of the paper, who was technically my mentor, he told me to leave it if I couldn't find enough time to fix it myself, since he didn't want to re-involve the other individuals. There were mistakes on my end, which I fixed before the camera-ready, but I didn't submit it since there were also other annotations which may have needed a second look, but I wasn't qualified to comment on those. At the time, he told me that all of our experiments are reproducible with our annotations and are open-source, so it's fine to keep updating the dataset + arXiv over time, and we technically did verify the dataset once before running.

For a while, I realized that this was misconduct since we submitted a paper that we knew had mistakes in it, but I didn't want to go against him since he was potentially going to be a reference letter writer for me. It took me a year to find qualified people who could help cross-check the annotations, and I contacted all of the people who used our faulty dataset and made public updates on the mistakes that we found + fixed. The study/conclusions of our paper ended up being the same, but we had to change a large number of annotations.

I still feel really guilty about this and can't stop thinking about it. It was technically my fault for not fixing it since he told me to fix it later, but I didn't have enough time to do it myself, + there were other parts I couldn't do myself. I want to update the proceedings paper, but that's probably far too late at this point.

2
  • 4
    It seems like you already did do a lot. What goal do you want to achieve by doing more? Commented Oct 23 at 10:30
  • Read the whole post + comments, then I think you'll get a full context of the harm. Commented Nov 8 at 5:34

1 Answer 1

17

Honest mistakes are not misconduct. I realize that you knew about the mistake before the camera-ready deadline, but (1) the camera-ready deadline is pretty late to make significant changes, (2) the mistakes did not affect the conclusions or results of the paper, and (3) it is expected that large datasets will contain minor labeling and other mistakes.

Precisely because of #3, it is quite normal to continually publish new versions of the dataset with errors corrected or other improvements. If you have a "v1" that allows others to reproduce the published results and a "v2" that includes all the bug-fixes, then I really don't see an issue here, and it is probably not necessary to publish a corrigendum.

In future, of course, it would be better to catch the mistakes before publishing and to be fully transparent about any known issues. But I think this is probably not misconduct, and even if it were, you've already come clean to your advisor and published an updated dataset. Time to forgive yourself and move on.

7
  • We did know about the errors though so it is misconduct. We just didn't have enough time to fix them mostly because there were like 10+ authors on it, but the burden of fixing everything was just thrown on me and I couldn't fix it time so we just submitted the paper as is. After fixing, the conclusion/study of the paper stays the same but the numbers do change anywhere from 5-15% for some models that we ran, so it's definitely not trivial. At this point, I just want to ask if I should retract/do a corridegum. Commented Oct 23 at 14:21
  • 1
    @user034 Does the conference publish corrigenda? If not, posting an updated preprint version can serve as an unofficial correction. Given that the conclusions stand a retraction is probably overkill. Commented Oct 23 at 14:42
  • No, it's final. Even after updating the dataset, people still keep using the wrong one. I don't want to delete it for reproducibility concerns, but people should know that we have fixed it. If the conclusions don't change, would it be that bad to ask the program chairs to update the paper? Part of me does want to admit that we knew about the error beforehand so that we have complete transparency, but that will probably get everyone in a lot of trouble. Commented Oct 23 at 15:05
  • 1
    Two follow-up questions: (1) did you discuss accuracy in the paper? Like did you say "we estimate that 98% of our labels are correct?" If so, did you know when you published that the actual accuracy was lower than what you stated due to the bug? (2) When people download the dataset now, can they download "v1" to reproduce what the paper said and then "v2" to get the post-bug-fix results? Commented Oct 23 at 15:17
  • (1) No, I didn't change anything from the original paper, I left it as the same as the submission, but only found the mistakes just before the camera-ready deadline at which point, I couldn't do everything myself. The accuracy was actually higher, but it's a dataset paper so we want it to be lower to make the dataset be challenging. (2) Yeah I set it as v1 and v2 and explained what the mistakes were. Commented Oct 23 at 15:26

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.