SEMAINE Component Architecture

This page documents how the SEMAINE components are organised into a system. Starting from a conceptual message flow graph, we explain which components exist and via which JMS Topics they communicate.

Conceptually, the message flow graph in SEMAINE looks as follows.

The following subsections describe the three main parts of the system in more detail: analysis of user behaviour, dialogue management, and generation of agent behaviour.

Analysis of user behaviour

User behaviour is first represented in terms of low-level audio and video features, then in terms of individual analysis results, and then fused together before the dialog model's “current best guess” of the user state is updated.

More details about the analysis of user behaviour can be found in SEMAINE deliverable reports D2b and D3c.

Feature extractors

Feature extractors are modality-specific. They produce feature vectors as key-value pairs, which are typically produced at a fixed frame rate (e.g., every 10 ms for audio features, and for every video frame for video features). The following Topics are currently used.

TopicDescription* F0frequency [0, 600] (fundamental frequency in Hz)

* voiceProb [0, 1] (probability that the current frame is harmonic)

* RMSenergy [0, 1] (energy of the signal frame)

* LOGenergy [-100, 0] (energy of the signal frame, in dB)

(more upon request...!)* xPositionTopLeft [0,xCameraResolution] (top left corner of the bounding box of the face detected)

* yPositionTopLeft [0, yCameraResolution] (top left corner of the bounding box of the face detected)

* width [0,xCameraResolution] (width of the bounding box of the face detected)

* height [0,yCameraResolution] (height of the bounding box of the face detected)

(all 0 if no face detected)
yLeftPupil* motionDirection [-π, π] (angle of the motion)

* motionMagnitudeNormalised [0, large number] (pixels per frame)

* motionX [-large number, large number] (pixels per frame)

* motionY [-large number, large number] (pixels per frame)


Analysers produce three types of information: the verbal content recognised; the user's non-verbal behaviour; and the user's emotions. Analysers represent their output as EMMA messages (Johnston et al., 2009), i.e. the specific analysis output is accompanied by a time stamp and a confidence.

TopicDescription content recognised analysis includes:

* voice activity detection

* stylised pitch movements

* non-verbal vocalizations analysis includes:

* face presence

* facial expression encoded in Action Units gestures such as nods, shakes etc. as recognised from the voice as recognised from the face as recognised from head movements

Verbal content is represented directly in EMMA. For example:

<emma:emma version="1.0"
  <emma:sequence emma:offset-to-start=”12345” emma:duration=”110”>

The output of the voice activity detection (VAD) / the speaking detector looks like this. It needs no confidence. The Speaking Analyser (part of the TumFeatureExtractor) outputs messages when the user starts or stops speaking. These messages are low-level messages, created directly from the VAD output, smoothed only over 3 frames. Thus, some thresholds must be applied in other components to reliably detect continuous segments where the user is speaking and avoid false alarms.

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

        <semaine:speaking xmlns:semaine="" statusChange="start"/>


Possible values for /emma:emma/emma:interpretation/semaine:speaking/@statusChange : start, stop

Stylised pitch movements are represented as follows:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:duration="444" emma:confidence="0.3">

        <semaine:pitch xmlns:semaine="" direction="rise"/>


Possible values for /emma:emma/emma:interpretation/semaine:pitch/@direction : rise, fall, rise-fall, fall-rise, high, mid, low

User gender is encoded as shown here:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

      <semaine:gender name="female" xmlns:semaine=""/>


Possible values of /emma:emma/emma:interpretation/semaine:gender/@name : male, female, unknown

Non-verbal user vocalizations such as laugh, sigh etc. are encoded as shown in the following example:

<emma:emma version="1.0"	  xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

        <semaine:vocalization xmlns:semaine="" name="(laughter)"/>


Values of /emma:emma/emma:interpretation/semaine:vocalization/@name are currently “(laughter)”, “(sigh)” and (breath)”.

Whether there is a face present is encoded as follows:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

        <semaine:face-present xmlns:semaine="" statusChange="start"/>


Possible values for /emma:emma/emma:interpretation/semaine:face-present/@statusChange : start, stop

Any action units recognised from the user's face are encoded such that a separate confidence can be given for each action unit. Example:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">
        <bml:bml xmlns:bml="">
            <bml:face au="1"/>
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.4">
        <bml:bml xmlns:bml="">
            <bml:face au="2"/>
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.2">
        <bml:bml xmlns:bml="">
            <bml:face au="4"/>

Head gestures such as nods or shakes are represented as follows:

<emma:emma version="1.0"	 xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:duration="444" emma:confidence="0.3">

      <bml:bml xmlns:bml="">
          <bml:head type="NOD" start="12.345" end="12.789"/>


Possible values for /emma:emma/emma:interpretation/bml:bml/bml:head/@type: NOD, SHAKE, TILT-LEFT, TILT-RIGHT, APPROACH, RETRACT. Left and right are defined subject centred (i.e. left is left for the user).

For all emotion messages, the information is encoded using the latest draft of the EmotionML standard (Schröder et al., 2010) as the payload of an EMMA container. For example:

<emma:emma xmlns:emma="" version="1.0">
    <emo:emotion xmlns:emo="" 
      <emo:dimension confidence="0.905837" name="arousal" value="0.59999996"/>
      <emo:dimension confidence="0.97505563" name="valence" value="0.42333332"/>
      <emo:dimension confidence="0.9875278" name="unpredictability" value="0.29333335"/>
      <emo:dimension confidence="0.96318215" name="potency" value="0.31333336"/>
      <intensity confidence=”0.94343144” value=”0.04”/>

Fusion components

All non-verbal analyses are combined by a NonverbalFusion component; all emotion analyses are combined by an EmotionFusion component, which computes the fused positions on emotion dimensions as a sum of individual positions weighted by the respective confidences.

TopicDescription non-verbal behaviour from all available modalities and fused emotion analysis based on information from all available modalities

Dialogue management

The dialogue management is performed by interpreters and action proposers operating on “state” information, i.e. the system's “current best guess” regarding the state of the user, the agent itself and their dialogue.

More information about the dialogue management can be found in SEMAINE deliverable report D4b.


Interpreters take both the analysis results and the existing state information into account when making various interpretations, leading to various state updates. Analyses are thresholded by confidence – only analyses with a sufficiently high confidence lead to a state update.

  • EmotionInterpreter updates the user's emotion state based on the fused emotion analyses;
  • NonVerbalInterpreter updates the user's nonverbal state based on the fused non-verbal analyses;
  • UtteranceInterpreter updates the user state with respect to the words spoken by the user, taking the current dialogue state into account;
  • TurnTakingInterpreter takes decisions on the agent's intention to take the turn, based on current user, dialogue and agent state, and updates dialog and agent state accordingly;
  • AgentMentalStateInterpreter updates the agent's mental state based on user behaviour.

State information is kept in three Topics:

TopicDescription “current best guess” information about the user current state of the agent “current best guess” information regarding the state of the dialog

The state information is accessed via a short name and encoded in XML according to a stateinfo.config file -- see StateInfo for details.

Action proposers

There are currently two action proposer components.

  • UtteranceActionProposer is simultaneously proposer and action selection for verbal utterances. It selects suitable utterances from the available set of utterances for the current character, based on the current state information, and triggers the most suitable one when the agent has a turn-taking intention. This component manages the two output cues, see below.
  • ListenerIntentPlanner proposes possible timing and meaning of reactive listener backchannels, as well as non-verbal mimicry by the agent while being a listener. It uses user and agent state information to trigger and select both reactive (meaning-based) and mimicry (behaviour-based) backchannels.

A dedicated (listener) ActionSelection component makes sure that the amount of listener actions stays moderate, and in particular holds back any backchannel intentions while the agent is currently producing a verbal utterance.

Candidate actions are produced to the following Topics.

TopicDescription actions described in terms of the function or meaning of what is to be expressed actions described in terms of concrete behaviours actions described in terms of the function or meaning of what is to be expressed actions described in terms of concrete behaviours

The format of candidate and selected actions is identical. Actions in Topics *.function are encoded in FML. This includes verbal utterances like the following:

<?xml version="1.0" encoding="UTF-8"?><fml-apml version="0.1">
   <bml:bml xmlns:bml="" id="bml_uap_3">
      <bml:speech id="speech_uap_3" language="en-GB" text="Is that so? Tell me about it." voice="activemary">
         <ssml:mark xmlns:ssml="" name="speech_uap_3:tm1"/>Is<ssml:mark xmlns:ssml="" name="speech_uap_3:tm2"/>that<ssml:mark xmlns:ssml="" name="speech_uap_3:tm3"/>so?<ssml:mark xmlns:ssml="" name="speech_uap_3:tm4"/>Tell<ssml:mark xmlns:ssml="" name="speech_uap_3:tm5"/>me<ssml:mark xmlns:ssml="" name="speech_uap_3:tm6"/>about<ssml:mark xmlns:ssml="" name="speech_uap_3:tm7"/>it.<ssml:mark xmlns:ssml="" name="speech_uap_3:tm8"/>
   <fml:fml xmlns:fml="" id="fml_uap_3">
      <fml:performative end="speech_uap_3:tm4" id="tag1" importance="1" start="speech_uap_3:tm2" type="like"/>
      <fml:emotion end="speech_uap_3:tm4" id="tag2" importance="1" start="speech_uap_3:tm2" type="small-surprise"/>
      <fml:performative end="speech_uap_3:tm6" id="tag3" importance="1" start="speech_uap_3:tm4" type="agree"/>

Reactive backchannels are also encoded in FML:

  <fml xmlns="" id="fml1">
    <backchannel end="1.8" id="b0" importance="1.0" start="0.0" type="understanding"/>
    <backchannel end="1.8" id="b1" importance="1.0" start="0.0" type="disagreement"/>
    <backchannel end="1.8" id="b2" importance="1.0" start="0.0" type="belief"/>

Mimicry backchannels are encoded in BML:

<bml xmlns="">
  <head end="1.8" id="s1" start="0.0" stroke="1.0">
    <description level="1" type="gretabml">

Generation of agent behaviour

Any agent actions must be generated in terms of low-level player data before they can be rendered. In addition to the direct generation branch, the current architecture now also supports a prepare-and-trigger branch.

More information about the generation of agent can be found in SEMAINE deliverable report D5b.

Direct branch

In the direct branch, a selected action is converted into player data using the following intermediate steps.

  • SpeechPreprocessor computes the accented syllables and any phrase boundaries as anchors to which any gestural behaviour can be attached. It reads from Topicvs and, and writes its results to Whereas conceptually this processing step could work with purely symbolic specification of prosody-based anchor points, the current implementation of the behaviour planner component requires absolute timing information. For this reason the output of the SpeechPreprocessor already contains the detailed timing information.
  • BehaviourPlanner determines suitable behaviour elements based on the intended function/meaning of the action. It uses character-specific behaviour lexicons to map FML to BML. It reads from and writes to
  • SpeechBMLRealiser carries out the actual speech synthesis, i.e. the generation of audio data. It reads from; it writes a BML message including the speech timings to, and the binary audio data including a file header to
  • BehaviorRealizer reads from and, and produces the low-level video features in Topics and In addition, it sends two types of information to Topic (1) the information which modalities form part of a given animation as identified by a unique content ID; and (2) the trigger commands needed to start playing back the animation.
  • PlayerOgre is the audiovisual player component. It reads the lowlevel player data from topics, and A unique content ID is used to match the various parts of a multimodal animation to be rendered. Two types of information are received via the Topic the information which modalities are expected to be part of a given animation / content ID; and the trigger to start playing the animation. The Player sends callback messages to the topic semaine.callback.output.Animation, to inform about the preparation or playback state of the various animations it receives.

An example of a callback message is the following:

<callback xmlns="">
  <event data="Animation" id="fml_lip_70" time="1116220" type="start" contentType="utterance"/>

Possible event types are “ready”, “start”, “end”, as well as “stopped” and “deleted”.

The contentType parameter describes the type of the animation. Possible content types are "utterance", "listener-vocalisation", "visual-only" and "content-type".

Prepare-and-trigger branch

The prepare-and-trigger branch replicates the processing pipeline of the direct branch, but using different Topics:

  • QueuingSpeechPreprocessor reads from and, and writes to;
  • BehaviorPlannerPrep reads from and writes to;
  • QueuingSpeechBMLRealiser reads from and writes to and;
  • BehaviorRealizerPrep reads from and and writes to, and

The difference to the direct branch is at the two ends of the processing pipeline. At the input end, the UtteranceActionProposer feeds into candidate utterances that the current character may perform in the near future. At the output end, the BehaviorRealizerPrep sends the content-level description to the player but it does not send the trigger commands. Instead, when the player has received all necessary parts of a given animation, it sends a “ready” callback message which is then registered by the UtteranceActionProposer. When its utterance selection algorithm determines that the selected utterance already exists in prepared form in the player, all it needs to do is send a trigger command directly from UtteranceActionPoposer to, which then starts the playback of the prepared animation without any further delay. If no prepared version of the selected utterance is available, e.g. because it was unexpected that this utterance was selected, or because the preparation has not completed yet, the utterance is generated using the direct branch.

The prepare-and-trigger branch is used only for full utterances. Listener actions are so short and fast to generate that they always use the direct branch.

Since both branches are technically completely independent, this architecture scales well to multiple computers: it is easy to run the direct branch on one computer and the prepare-and-trigger branch on a different computer if they jointly would over-stretch the CPU resources of a single PC.

Last modified 11 years ago Last modified on 12/16/10 08:33:26

Attachments (3)

Download all attachments as: .zip