For the past 5 years I have taught a course on Evaluation of AI (CS5063) to Aberdeen MSc students. People occasionally ask me about the course, so I thought I’d give a summary here.
CS5063 covers the following topics, which I elaborate below.
- AI engineering and quality assurance
- Human evaluation
- Metric (automatic) evaluation
- Basic statistics for evaluation
- Current research on evaluation
- Commercial perspective on evaluation
I’d also like to say that I’m very grateful to James Forrest for his help in running practical exercises and tutorials in CS5063!
AI Engineering and Quality Assurance
I start CS5063 by talking about software engineering of AI systems, and the challenges this raises, especially in requirements analysis (since clients often have no clue what AI systems can do) and dataset design. I point out that most of the problems I have seen with AI systems are due to getting the requirements wrong, inappropriate datasets, and/or lack of robustness when the world changes (domain shift). *None* of these problems are detected by evaluation on held-out test sets, so we need to go beyond these if we want to evaluate utility in real-world settings.
I then review software quality assurance, and ask the students, for their first assessment, to choose an AI system of their choice, create 50 test cases (in the software test sense) for these systems of different levels of difficulty, run these test cases and report the results. I’m very pleased with this exercise, most students take this seriously and do a good job. They usually find that the AI systems can do very well on some hard test cases while still failing a few easy test cases.
My favourite example this year was from a student who looked at Grammarly. Grammarly did a good job of finding mistakes in complex sentences which used specialist terminology. But then it failed on the following “easy” test case
- Input: I typed it in correctly.
- Expected output: I typed it in correctly. (no change)
- Grammarly’s suggested revision: I typed it incorrectly. (“in correctly” -> “incorrectly”)
In other words, Grammarly wanted to rewrite a correct sentence into an alternate form which meant the opposite of the original sentence!
Most of the systems exhibited similar behaviour, for example several students investigated object recognition systems and found that they did very well in some difficult cases (complex scenes, suboptimal lighting) and then failed to recognise a clear photo of an object in a simple scene with good lighting. Perhaps this kind of behaviour is intrinsic to neural AI approaches?
We then look at human evaluations of AI systems. The focus is on good experimental design and techniques for doing relatively simple and straightforward human evaluations; we also discuss research ethics and ethical approval. Students do some simple human evaluations in lab sessions, for example running an experiment where they ask their classmates to compare the output of two machine translation systems on a text (and language pair) of their choice.
I keep this simple because human evaluation is new to most of the students; they have all done metric evaluation in a preceding course on ML, and most will have done software engineering and testing as undergrads, but relatively few will have previously done structured evaluations with human subjects.
In their second assessment, students work in groups to run a human evaluation where they compare two AI systems of their choice (comparing virtual assistants such as Alexa and Siri on a particular use case is a popular choice). I think this is a good learning experience and some students do this very well, However others struggle; which shows that learning how to do even a simple human evaluation is not easy!
Unlike the test case assessment, the results of the human evaluations are usually what the students expect. Perhaps this is because they choose AI systems (and use cases) which they are familiar with, so they have good expectations about how well the systems will perform.
I try to teach the students basic statistics using R, this is combined with the human evaluation part of the course. I focus on basic tests: t-test, chi-square, Pearson correlation. In all honesty, I find this part of the course to be frustrating. Even after listening to my lectures and doing my lab exercises, many students still struggle to do basic stats.
I wonder if I should drop R (which most of the students don’t know) and instead do stats in Excel (which the students do know). I feel reluctant to drop R because only a limited range of stats is supported in Excel, whereas R of course supports a huge variety of statistical tests; ie if students learn R they have much more power available to them. But on the other hand, maybe the students will learn statistical concepts better if they are not simultaneously learning a new scripting language??? Don’t know … any advice from others on teaching basic stats is welcome!
Students must compute relevant statistics as part of the human evaluation assessment mentioned above,
Metric (Automatic) Evaluation
As mentioned, the students have done recall/precision/accuracy evaluations on held-out test sets in a previous course, so in this portion of CS5063 we look at more complex metrics (including NLP metrics such as BLEU) and discuss issues such as validation against hold-standard human evaluations, average case vs worst case, appropriate baselines and detecting bias.
In some years I have asked students to do an assessment where they pick a published AI paper (often from IJCAI), and write a report assessing the quality of the evaluation in the paper (which was usually based on metrics). I wrote about this and gave an example in a previous blog. Its a good assessment and I think students learn a lot from it, especially when they assess a well-known paper and find that its evaluation is deeply flawed. However it is a lot of work to mark this assessment because I need to understand the papers the students are analysing as well as their reports, so I’ve stopped doing it.
I think this is a really valuable exercise, and I encourage people interested in evaluation to do it! Just don’t expect me to mark it …
Current Research and Commercial Evaluation
I end CS5063 by discussing current research on evaluation at Aberdeen, and also giving a commercial perspective on evaluation, The first bit (current research) changes every year, this year I disussed our research on reproducibility of human evaluations, evaluating real-world utility, and evaluating accuracy.
In the second bit (commercial perspective), I discuss things that companies care a lot about which academics tend to ignore, such as risk, profitability, maintenance, change management (eg impact on workflow), and user/client experience. This section is a bit frustrating because I have great material which I cannot share with students because it is commercial confidential, but anyways I try to get the basic issues across.
When I first taught CS5063 in 2017, I co-taught it with a fellow academic, Nigel Beacham, who said we should make a public version of the course available in some fashion. We didn’t do this at the time, and Nigel moved on to teach other things, but I’ve recently begun to wonder about this again. There is growing interest in evaluation in the research community, which is great, but much (most?) of it is still focused on metrics for evaluating against held-out test sets, which is a very narrow perspective on evaluation.
In all honesty I don’t know if I have the time/energy to formally teach a public version of this course, but I could try to make some of the course material available to interested students and academics. If anyone reading this is interested, please let me know what material would be most helpful to you.