building NLG systems

Software engineering of prompts

LLMs cannot do most of the things I am interested in “out of the box” at an acceptable standard, so we often try to engineer prompts to enhance performance. Doing this increasingly gives me the feeling that we are doing software engineering on a very uusual programming language. Lots of people have made analogies between prompts and code, but I have not seen much which looks at “prompt engineering” from a software engineering peerspective.

Software engineering is about requirements, design, implementation, debugging, testing, and maintenance of software artefacts. And I see analogies to all of these in prompts. For example, we often spend time finding cases where the LLM fails, adding extra rules to the prompt to fix these, and iterating. Unfortunately, sometimes adding a rule leads to the model making new errors. Which is very similar to my experience in debugging and testing conventional software artefacts! In the below, I will use “system” to refer to both conventional software and prompt+LLM artefacts.

Note that in this blog I am talking about building a systm which does important domain tasks (perhaps within a larger workflow) such as giving medical or legal advice. I am not talking about using LLMs as assistants which find information on the web, summarise docs, suggest improvements to writing, etc. I am also focusing on using LLMs to directly do a task, not on asking LLMs to generate code which does a task.

Software lifecycle tasks with prompts

Software engineering distinguishes between several stages and tasks, including below.

  • Requirements: In many ways this is similar for software and prompt-based systems. However, one difference is that requirements for conventional software need to explicitly specified. Whereas with prompts, in some cases we can just provide reference material such as clinical guidelines, or indeed omit details and trust/hope the model fills them in correctly. The problem with the latter is that if the model makes a mistake in inferring requirements, we may build a useless system.
  • Design: Design decisions for prompts (eg, should we use Chain-of-Thought) are very different than for software, and also evolving very rapidly. But this remains a crucial part of both software and prompt engineering!
  • Implementation and debugging: For prompt-based systems, a lot of this seems to involve identifying problems in outputs and updating the prompt (like conventional debugging, I guess). However, modifying a prompt can lead to unpredictable effects, and indeed seemingly trivial changes in prompts can lead to big changes in outputs. Of course this can also happen with software, but I think the problem is bigger with prompts.
  • Testing: Challenging, not least because LLMs are stochastic and give different results on different runs. Even if we set temperature to 0, results may change if we use a closed model which is updated by the vendor. More fundamentally, it is difficult to have confidence that a prompt-based system will robustly handle new scenarios beyond its test/train data, since the underlying LLM is an unpredictable black-box model.
  • Maintenance: Very challenging! A Python or Java program should give the same results regardless of when, where, and how it is run. However, a prompt-based system will give different results when used with different models, and indeed even with the same model at different times (see testing above). Worst of all, models can be deprecated and retired, in which case we will need to re-engineer a prompt for the replacement model.

Software developer perspective

As a software developer, I am used to feeling “in control” of the system I am developing. I do not get that feeling with prompt-based system, it seems more like randomly manipulating a super-complex artefact in the hope of getting the right answer. But maybe I am old-fashioned… Certainly I was always taught that as a software dev I should try to build reusable building blocks (eg, software library), and then construct systems out of these building blocks. I’m not sure this works with prompts. I do see some talk about “prompt libraries”, but these are usually collections of prompt templates, not building blocks which are assembled into more complex structures.

Testing and (especially) maintenance really worry me. Most of the lifecycle costs of successful software systems is maintenance, and I dont think anyone understands how to maintain prompt-based systems as the underlying models evolve and are deprecated/removed. Certainly the fact that a prompt works well on (retired) GPT4 does not mean it will work well on GPT 5.4

Of course, I can use AI assistants to help me create software, and this can really help productivity in many cases! But what I am talking about here is using prompted LLMs to directly do a task, not to help me create software to do a task.

Non-developer perspective

Non-developers have a completely different perspective, since prompts allow people who do not know Java, Python, etc to effectively create systems. This is revolutionary! People who in the past had to hire or contract a software dev if they wanted even straightforward functionality can now create it themselves. Which saves a lot of time, hassle, and money.

The above issues would of course be a problem when creating very complex and sophisticated systems which need to be thoroughly tested and then maintained for ten years. But most people I know in this space do not care about maintenance and are happy with limited testing, because even a system which is flawed is still a huge improvement over no system.

Aside: Teaching software engineering of prompts

Speaking as a (soon-to-retire) educator, should we teach prompt engineering from a software engineering perspective? I am not aware of such courses, but I think this could be an interesting and valuable skill to teach both CS and non-CS students. I imagine such a course would

  • Teach software lifecyle (requirements, etc) from a prompt perspective (as above)
  • Give software-motivated advice about “implementing” prompts, such as carefully structuring and formatting them to make them readable to other humans.
  • Train people to “debug” and fix prompts

Good luck to anyone who wants to deliver such a course!

Summary

Prompts are very different from Python or Java code, but many of the fundamentals of software engineering also apply to creating prompt-based systems!

Leave a comment