Version 3 (modified by masc01, 10 years ago) (diff)


Standard and pre-standard representation formats in the SEMAINE system

In view of future interoperability and reuse of components, the SEMAINE API aims to use standard representation formats where that seems possible and reasonable. For example, results of analysis components can be represented using EMMA (Extensible Multi-Modal Annotation), a Wold Wide Web Consortium (W3C) Recommendation⁠. Input to a speech synthesiser can be represented using SSML (Speech Synthesis Markup Language), also a W3C Recommendation⁠. Several other relevant representation formats are not yet standardised, but are in the process of being specified. This includes the Emotion Markup Language EmotionML⁠, used for representing emotions and related states in a broad range of contexts, and the Behaviour Markup Language BML⁠, which describes the behaviour to be shown by an Embodied Conversational Agent (ECA). Furthermore, a Functional Markup Language FML is under discussion, in order to represent the planned actions of an ECA on the level of functions and meanings. By implementing draft versions of these specifications, the SEMAINE API can provide hands-on input to the standardisation process, which may contribute to better standard formats.

On the other hand, it seems difficult to define a standard format for representing the concepts inherent in a given application's logic. To be generic, such an endeavour would ultimately require an ontology of the world. In the current SEMAINE system, which does not aim at any sophisticated reasoning over domain knowledge, a simple custom format named SemaineML is used to represent those pieces of information that are required in the system but which cannot be adequately represented in an existing or emerging standard format. It is conceivable that other applications built on top of the SEMAINE API may want to use a more sophisticated representation such as the Rich Description Format RDF to represent domain knowledge, in which case the API could be extended accordingly.

Whereas all of the aforementioned representation formats are based on the Extensible Markup Language XML⁠, there are a number of data types that are naturally represented in different formats. This is particularly the case for the representations of data close to input and output components. At the input end, low-level analyses of human behaviour are often represented as feature vectors. At the output end, the input to a player component is likely to include binary audio data or player-specific rendering directives.

The following table gives an overview of the representation formats currently supported in the SEMAINE API.

Type of dataRepresentation formatStandardisation status
Low-level input featuresstring or binary feature vectorsad hoc
Analysis resultsEMMAW3C Recommendation
Emotions and related statesEmotionMLW3C Working Draft
Domain knowledgeSemaineMLad hoc
Speech synthesis inputSSMLW3C Recommendation
Functional action planFMLvery preliminary
Behavioural action planBMLdraft specification
Low-level output databinary audio, player commandsplayer-dependent

The following sub-sections briefly describe the individual representation formats.

3.2.1. Feature vectors

Feature vectors can be represented in an ad hoc format. In text form (see Figure 3), the feature vectors consist of straightforward key-value pairs – one feature per line, values preceding features.

As feature vectors may be sent very frequently (e.g., every 10 ms in the SEMAINE system 1.0), compact representation is a relevant issue. For this reason, a binary representation of feature vectors is also available. In binary form, the feature names are omitted, and only feature values are being communicated. The first four bytes represent an integer containing the number of features in the vector; the remaining bytes contain the float values one after the other.

3.2.2. EMMA

The Extensible Multimodal Annotation Language EMMA, a W3C Recommendation, is “an XML markup language for containing and annotating the interpretation of user input” [37]⁠. As such, it is a wrapper language that can carry various kinds of payload representing the interpretation of user input. The EMMA language itself provides, as its core, the<emma:interpretation>element, containing all information about a single interpretation of user behaviour. Several such elements can be enclosed within an<emma:one-of>element in cases where more than one interpretation is present. An interpretation can have anemma:confidenceattribute, indicating how confident the source of the annotation is that the interpretation is correct; time-related information such asemma:start,!emma:end, andemma:duration, indicating the time span for which the interpretation is provided; information about the modality upon which the interpretation is based, through theemma:mediumand!emma:modeattributes; and many more.

Figure 4 shows an example EMMA document carrying an interpretation of user behaviour represented using EmotionML (see below). The interpretation refers to a start time. It can be seen that the EMMA wrapper elements and the EmotionML content are in different XML namespaces, so that it is unambiguously determined which element belongs to which part of the annotation.

EMMA can also be used to represent Automatic Speech Recognition (ASR) output, either as the single most probable word chain or as a word lattice, using the<emma:lattice>element.

3.2.3. EmotionML

The Emotion Markup Language EmotionML is partially specified, at the time of this writing, by the Final Report of the W3C Emotion Markup Language Incubator Group [39]⁠. The report provides elements of a specification, but leaves a number of issues open. The language is now being developed towards a formal W3C Recommendation.

The SEMAINE API is one of the first pieces of software to implement EmotionML. It is our intention to provide an implementation report as input to the W3C standardisation process in due course, highlighting any problems encountered with the current draft specification in the implementation.

EmotionML aims to make concepts from major emotion theories available in a broad range of technological contexts. Being informed by the affective sciences, EmotionML recognises the fact that there is no single agreed representation of affective states, nor of vocabularies to use. Therefore, an emotional state<emotion>can be characterised using four types of descriptions:<category>,<dimensions>,<appraisals>and<action-tendencies>. Furthermore, the vocabulary used can be identified. The EmotionML markup in Figure 4 uses a dimensional representation of emotions, using the dimension set “valence, arousal, potency”, out of which two dimensions are annotated: arousal and valence.

EmotionML is aimed at three use cases: 1. Human annotation of emotion-related data; 2. automatic emotion recognition; and 3. generation of emotional system behaviour. In order to be suitable for all three domains, EmotionML is conceived as a “plug-in” language that can be used in different contexts. In the SEMAINE API, this plug-in nature is applied with respect to recognition, centrally held information, and generation, where EmotionML is used in conjunction with different markups. EmotionML can be used for representing the user emotion currently estimated from user behaviour, as payload to an EMMA message. It is also suitable for representing the centrally held information about the user state, the system's “current best guess” of the user state independently of the analysis of current behaviour. Furthermore, the emotion to be expressed by the system can also be represented by EmotionML. In this case, it is necessary to combine EmotionML with the output languages FML, BML and SSML.

3.2.4. SemaineML

A number of custom representations are needed to represent the kinds of information that play a role in the SEMAINE demonstrator systems. Currently, this includes the centrally held beliefs about the user state, the agent state, and the dialogue state. Most of the information represented here is domain-specific and does not lend itself to easy generalisation or reuse. Figure 5 shows an example of a dialogue state representation, focused on the specific situation of an agent-user dialogue targeted in the SEMAINE system 1.0 (see Section 4).

The exact list of phenomena that must be encoded in the custom SemaineML representation is evolving as the system becomes more mature. For example, it remains to be seen whether analysis results in terms of user behaviour (such as a smile) can be represented in BML or whether they need to be represented using custom markup.

3.2.5. SSML

The Speech Synthesis Markup Language SSML [38]⁠ is a well-established W3C Recommendation supported by a range of commercial text-to-speech (TTS) systems. It is the most established of the representation formats described in this section.

The main purpose of SSML is to provide information to a TTS system on how to speak a given text. This includes the possibility to add<emphasis>on certain words, to provide pronunciation hints via a<say-as>tag, to select a<voice>which is to be used for speaking the text, or to request a<break>at a certain point in the text. Furthermore, SSML provides the possibility to set markers via the SSML<mark>tag. Figure 6 shows an example SSML document that could be used as input to a TTS engine. It requests a female US English voice; the word “wanted” should be emphasised, and there should be a pause after “then”.

3.2.6. FML

The functional markup language FML is still under discussion [41]⁠. Its functionality being needed nevertheless, a working language FML-APML was created [44]⁠ as a combination of the ideas of FML with the former Affective Presentation Markup Language APML [45]⁠.

Figure 7 shows an example FML-APML document which contains the key elements. An<fml-apml>document contains a<bml>section in which the<speech>content contains<ssml:mark>markers identifying points in time in a symbolic way. An<fml>section then refers to those points in time to represent the fact, in this case, that an announcement is made and that the speaker herself is being referred to between markss1:tm2and!s1:tm4. This information can be used, for example, to generate relevant gestures when producing behaviour from the functional descriptions.

The representations in the<fml>section are provisional and are likely to change as consensus is formed in the community.

For the conversion from FML to BML, information about pitch accents and boundaries is useful for the prediction of plausible behaviour time-aligned with the macro-structure of speech. In our current implementation, a speech preprocessor computes this information using TTS technology (see Section 4.2). The information is added to the end of the<speech>section as shown in Figure 8. This is an ad hoc solution which should be reconsidered in the process of specifying FML.

3.2.7. BML

The aim of the Behaviour Markup Language BML [40]⁠ is to represent the behaviour to be realised by an Embodied Conversational Agent. BML is at a relatively concrete level of specification, but is still in draft status [36]⁠.

A standalone BML document is partly similar to the<bml>section of an FML-APML document (see Figure 7); however, whereas the<bml>section of FML-APML contains only a<speech>tag, a BML document can contain elements representing expressive behaviour in the ECA at a broad range of levels, including<head>,<face>,<gaze>,<body>,<speech>and others. Figure 9 shows an example of gaze and head nod behaviour added to the example of Figure 7.

While creating an audio-visual rendition of the BML document, we use TTS to produce the audio and the timing information needed for lip synchronisation. Whereas BML in principle previews a<lip>element for representing this information, we are uncertain how to represent exact timing information with it in a way that preserves the information about syllable structure and stressed syllables. For this reason, we currently use a custom representation based on the MaryXML format from the MARY TTS system [46]⁠ to represent the exact timing of speech sounds. Figure 10 shows the timing information for the word “Poppy”, which is a two-syllable word of which the first one is the stressed syllable.

The custom format we use for representing timing information for lip synchronisation clearly deserves to be revised towards a general BML syntax, as BML evolves.

3.2.8. Player data

Player data is currently treated as unparsed data. Audio data is binary, whereas player directives are considered to be plain text. This works well with the current MPEG-4 player we use (see Section 4) but may need to generalised as other players are integrated into the system.