Changes between Version 5 and Version 6 of RepresentationFormats


Ignore:
Timestamp:
10/18/10 14:33:22 (9 years ago)
Author:
masc01
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • RepresentationFormats

    v5 v6  
    2222 
    2323 
    24  
    25  
    26 === 3.2.3. EmotionML === 
    27 The Emotion Markup Language EmotionML is partially specified, at the time of this writing, by the Final Report of the W3C Emotion Markup Language Incubator Group ![39]⁠. The report provides elements of a specification, but leaves a number of issues open. The language is now being developed towards a formal W3C Recommendation. 
    28  
    29 The SEMAINE API is one of the first pieces of software to implement EmotionML. It is our intention to provide an implementation report as input to the W3C standardisation process in due course, highlighting any problems encountered with the current draft specification in the implementation. 
    30  
    31 EmotionML aims to make concepts from major emotion theories available in a broad range of technological contexts. Being informed by the affective sciences, EmotionML recognises the fact that there is no single agreed representation of affective states, nor of vocabularies to use. Therefore, an emotional state<emotion>can be characterised using four types of descriptions:<category>,<dimensions>,<appraisals>and<action-tendencies>. Furthermore, the vocabulary used can be identified. The EmotionML markup in Figure 4 uses a dimensional representation of emotions, using the dimension set “valence, arousal, potency”, out of which two dimensions are annotated: arousal and valence. 
    32  
    33 EmotionML is aimed at three use cases: 1. Human annotation of emotion-related data; 2. automatic emotion recognition; and 3. generation of emotional system behaviour. In order to be suitable for all three domains, EmotionML is conceived as a “plug-in” language that can be used in different contexts. In the SEMAINE API, this plug-in nature is applied with respect to recognition, centrally held information, and generation, where EmotionML is used in conjunction with different markups. EmotionML can be used for representing the user emotion currently estimated from user behaviour, as payload to an EMMA message. It is also suitable for representing the centrally held information about the user state, the system's “current best guess” of the user state independently of the analysis of current behaviour. Furthermore, the emotion to be expressed by the system can also be represented by EmotionML. In this case, it is necessary to combine EmotionML with the output languages FML, BML and SSML. 
    34  
    35 === 3.2.4. SemaineML === 
    36 A number of custom representations are needed to represent the kinds of information that play a role in the SEMAINE demonstrator systems. Currently, this includes the centrally held beliefs about the user state, the agent state, and the dialogue state. Most of the information represented here is domain-specific and does not lend itself to easy generalisation or reuse. Figure 5 shows an example of a dialogue state representation, focused on the specific situation of an agent-user dialogue targeted in the SEMAINE system 1.0 (see Section 4). 
    37  
    38 The exact list of phenomena that must be encoded in the custom SemaineML representation is evolving as the system becomes more mature. For example, it remains to be seen whether analysis results in terms of user behaviour (such as a smile) can be represented in BML or whether they need to be represented using custom markup. 
    39  
    40 === 3.2.5. SSML === 
    41 The Speech Synthesis Markup Language SSML ![38]⁠ is a well-established W3C Recommendation supported by a range of commercial text-to-speech (TTS) systems. It is the most established of the representation formats described in this section. 
    42  
    43 The main purpose of SSML is to provide information to a TTS system on how to speak a given text. This includes the possibility to add<emphasis>on certain words, to provide pronunciation hints via a<say-as>tag, to select a<voice>which is to be used for speaking the text, or to request a<break>at a certain point in the text. Furthermore, SSML provides the possibility to set markers via the SSML<mark>tag. Figure 6 shows an example SSML document that could be used as input to a TTS engine. It requests a female US English voice; the word “wanted” should be emphasised, and there should be a pause after “then”. 
    44  
    45 === 3.2.6. FML === 
    46 The functional markup language FML is still under discussion ![41]⁠. Its functionality being needed nevertheless, a working language FML-APML was created ![44]⁠ as a combination of the ideas of FML with the former Affective Presentation Markup Language APML ![45]⁠. 
    47  
    48 Figure 7 shows an example FML-APML document which contains the key elements. An<fml-apml>document contains a<bml>section in which the<speech>content contains<!ssml:mark>markers identifying points in time in a symbolic way. An<fml>section then refers to those points in time to represent the fact, in this case, that an announcement is made and that the speaker herself is being referred to between marks!s1:tm2and!s1:tm4. This information can be used, for example, to generate relevant gestures when producing behaviour from the functional descriptions. 
    49  
    50 The representations in the<fml>section are provisional and are likely to change as consensus is formed in the community. 
    51  
    52 For the conversion from FML to BML, information about pitch accents and boundaries is useful for the prediction of plausible behaviour time-aligned with the macro-structure of speech. In our current implementation, a speech preprocessor computes this information using TTS technology (see Section 4.2). The information is added to the end of the<speech>section as shown in Figure 8. This is an ad hoc solution which should be reconsidered in the process of specifying FML. 
    53  
    54 === 3.2.7. BML === 
    55 The aim of the Behaviour Markup Language BML ![40]⁠ is to represent the behaviour to be realised by an Embodied Conversational Agent. BML is at a relatively concrete level of specification, but is still in draft status ![36]⁠. 
    56  
    57 A standalone BML document is partly similar to the<bml>section of an FML-APML document (see Figure 7); however, whereas the<bml>section of FML-APML contains only a<speech>tag, a BML document can contain elements representing expressive behaviour in the ECA at a broad range of levels, including<head>,<face>,<gaze>,<body>,<speech>and others. Figure 9 shows an example of gaze and head nod behaviour added to the example of Figure 7. 
    58  
    59 While creating an audio-visual rendition of the BML document, we use TTS to produce the audio and the timing information needed for lip synchronisation. Whereas BML in principle previews a<lip>element for representing this information, we are uncertain how to represent exact timing information with it in a way that preserves the information about syllable structure and stressed syllables. For this reason, we currently use a custom representation based on the MaryXML format from the MARY TTS system ![46]⁠ to represent the exact timing of speech sounds. Figure 10 shows the timing information for the word “Poppy”, which is a two-syllable word of which the first one is the stressed syllable. 
    60  
    61 The custom format we use for representing timing information for lip synchronisation clearly deserves to be revised towards a general BML syntax, as BML evolves. 
    62  
    63 === 3.2.8. Player data === 
    64 Player data is currently treated as unparsed data. Audio data is binary, whereas player directives are considered to be plain text. This works well with the current MPEG-4 player we use (see Section 4) but may need to generalised as other players are integrated into the system.