Anya Belz and I will soon start a project, ReproHum (EPSRC EP/V05645X/1), whose goal is to develop a framework for assessing the reproducibility of human evaluations of NLP systems, and ultimately to make such evaluations more reproducible. I think this is a really exciting and important project; I’m a firm believer in human evaluations, but they need to be replicable! We will shortly be advertising for a research fellow to work on this project at Aberdeen, and I encourage anyone who is interested or wants more information to contact Anya (firstname.lastname@example.org) and me (email@example.com).
Scientific evaluations need to be reproducible. There has been a lot of discussion over the past few years about reproducing metric-based evaluations of NLP systems. However, there has not been much discussion about reproducing human evaluations. These offer additional challenges; for example people are different, so rerunning an experiment with different human subjects is unlikely to give exactly the same outcome.
The goal of ReproHum is to develop
- a methodological framework for assessing reproducibility in human evaluations of NLP systems
- a good understanding of reproducibility of current human evaluations of NLP systems
- a set of recommendations and guidelines for enhancing reproducibility
ReproHum builds on a shared task on reproducing evaluations of NLG systems, which will be presented at INLG 2021, as well as a systematic review of reproducibility which we presented at EACL 2021. In addition to extending these activities, ReproHum will conduct a structured multi-lab study where 20 partner labs reproduce a selected set of existing human evaluations. The multi-lab study will give us solid empirical data on reproducibility, which is the key to achieving our goals.
ReproHum will employ a Research Fellow for 18 months, at Aberdeen University. The RF will be at the heart of the project, and work with Anya and I on literature reviews, surveys of the current practices, developing a methodological framework, and running additional shared tasks on reproducibility (like the one at INLG 2021). The RF will be in charge of organising the multi-lab study mentioned above.
We are looking for someone with a PhD in NLP, extensive experience with human evaluations, strong links with the research community, and an excellent publication record for career stage. The RF will need organisational skills as well as research skills.
We expect the project to start in late 2021 or perhaps the beginning of 2022.
This position is currently advertised internally at Aberdeen University; the university insists that this be done for 2 weeks before a position is openly advertised. We expect an open advertisement to go out in late Sept, which will include formal instructions for applying for the post. Anya and I are very happy to informally chat to people now, before the position is openly advertised.
Closing date will be in mid-October.
Salary will be on the standard scale for post-doctoral researchers at Aberdeen University, which is between £34,304 and £40,927 per year, depending on experience and background.
I am a strong believer in human evaluations of NLP systems. But human evaluations must be high-quality and replicable; low quality studies which cannot be replicated are not useful! ReproHum will help the NLP field determine how to carry out reproducible human evaluations, and I expect its findings will have a major influence on future human evaluations in NLP.
So if you are passionate about human evaluation and want it to be done better, consider applying for this position, it is a great opportunity to “make a difference” in how NLP evaluation is done!