What is the most challenging aspect of surface realisation, especially across multiple languages? This question has come up a few times over the past month, in different contexts.
I assume here that “surface realisation” means producing a surface form from a set of content words and grammatical roles and relations (ie, the simplenlg perspective). For example, producing “John loves Mary” from a specification such as {Verb:love Subject:John Object:Mary} (probably represented as a graph, with features as well as content words).
I suspect different people will say different things, and I’d love to hear what other people think. But anyways, in my personal opinion, the biggest challenges are
- where syntax depends on semantics and pragmatics
- idiosyncratic stuff in individual languages
Syntax and Semantics (or Pragmatics)
Sometimes, word order or inflection depend on the meaning we are trying to communicate (semantics), or context (pragmatics). A few examples (mostly focusing on English, since this is what I know best)
English adjective ordering: The order of adjectives depends on semantics. For example, we say “big red apple“, not “red big apple” (size before colour).
English bare infinitive: We say “I see John eat an apple” but “I see John thinks he is smart“. In other words, the inflection of the subordinate verb (“eat” and “thinks“) depends on the word sense of the main verb “see” (is it being used to indicate visual perception or non-visual understanding).
Measurements: We say “There are 4 feet in the cupboard” when referring to physical objects (eg, in response to “Do we have any extra chair feet?”) but “There is 4 feet in the cupboard” when referred to a measurement (Eg, in response to “Do we have any rope?“).
Adjective placement (French): Whether an adjective goes before or after the head noun can depend on the desired word sense. “un ami cher” is not the same as “un cher ami”
Classifiers (Mandarin): I do not have any knowledge of Mandarin, but I gather from my colleagues that semantic/pragmatic factors impact the choice of classifiers.
Anyways, the above phenemona are difficult to handle if you assume that the task of a surface realiser is to order and inflect a set of content words based on grammatical roles and functions, and insert appropriate function words, without worrying about word senses and other semantic/pragmatic issues.
I suspect that statistical corpus-based approached could get the above right in 90% of cases, which may be good enough for most applications. But we wont get 100% unless we explicitly model how semantics and pragmatics affect syntax in our surface realiser; eg, I suspect a statistical approach would struggle with “There is/are 4 feet in the cupboard“.
Idiosyncracies of Specific Languages
Every language of course is unique, and has its own “idiosyncracies”. From an engineering perspective, this means that if we have covered N languages and want to cover language N+1, the biggest cost may be the unique aspects of language N+1.
To take one example I was looking at recently, German uses compound nouns, which makes inflection more challenging. In English, its easy to pluralise “aircraft door” into “aircraft doors“, we just have to recognise that “door” is the head and replace it with its plural “doors“. But if I want to pluralise “Flugzeugtür” into “Flugzeugtüren”, I need to first parse “Flugzeugtür” into “Flugzeug” and “tür”. Which is doable, but it requires adding completely new functionality which is not needed in many other languages.
Similarly, I can imagine that someone who had built a realiser in German and wanted to add English to it might end up cursing the a/an “idiosyncracy” of English. Again we can deal with “a/an”, but it requires adding functionality to the realiser which is not needed for many other languages.
One thought on “Challenges of Surface Realisation”