Version 4 (modified by masc01, 10 years ago) (diff)


SEMAINE Component Architecture

This page documents how the SEMAINE components are organised into a system. Starting from a conceptual message flow graph, we explain which components exist and via which JMS Topics they communicate.

Conceptually, the message flow graph in SEMAINE looks as follows.

The following subsections describe the three main parts of the system in more detail: analysis of user behaviour, dialogue management, and generation of agent behaviour.

Analysis of user behaviour

User behaviour is first represented in terms of low-level audio and video features, then in terms of individual analysis results, and then fused together before the dialog model's “current best guess” of the user state is updated.

More details about the analysis of user behaviour can be found in SEMAINE deliverable reports D2b and D3c.

Feature extractors

Feature extractors are modality-specific. They produce feature vectors as key-value pairs, which are typically produced at a fixed frame rate (e.g., every 10 ms for audio features, and for every video frame for video features). The following Topics are currently used.

TopicDescription* F0frequency [0,600] (fundamental frequency in Hz)

* voiceProb [0,1] (probability that the current frame is harmonic)

* RMSenergy [0, 1] (energy of the signal frame)

* LOGenergy [-100,0] (energy of the signal frame, in dB)

(more upon request...!)* xPositionTopLeft [0,xCameraResolution] (top left corner of the bounding box of the face detected)

* yPositionTopLeft [0, yCameraResolution] (top left corner of the bounding box of the face detected)

* width [0,xCameraResolution] (width of the bounding box of the face detected)

* height [0,yCameraResolution] (height of the bounding box of the face detected)

(all 0 if no face detected)
yLeftPupil* motionDirection [-π, π] (angle of the motion)

* motionMagnitudeNormalised [0, large number] (pixels per frame)

* motionX [-large number, large number] (pixels per frame)

* motionY [-large number, large number] (pixels per frame)


Analysers produce three types of information: the verbal content recognised; the user's non-verbal behaviour; and the user's emotions. Analysers represent their output as EMMA messages (Johnston et al., 2009), i.e. the specific analysis output is accompanied by a time stamp and a confidence.

TopicDescription content recognised analysis includes:

* voice activity detection

* stylised pitch movements

* non-verbal vocalizations analysis includes:

* face presence

* facial expression encoded in Action Units gestures such as nods, shakes etc. as recognised from the voice as recognised from the face as recognised from head movements

Verbal content is represented directly in EMMA. For example:

<emma:emma version="1.0"
  <emma:sequence emma:offset-to-start=”12345” emma:duration=”110”>

The output of the voice activity detection (VAD) / the speaking detector looks like this. It needs no confidence. The Speaking Analyser (part of the TumFeatureExtractor) outputs messages when the user starts or stops speaking. These messages are low-level messages, created directly from the VAD output, smoothed only over 3 frames. Thus, some thresholds must be applied in other components to reliably detect continuous segments where the user is speaking and avoid false alarms.

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

        <semaine:speaking xmlns:semaine="" statusChange="start"/>


Possible values for /emma:emma/emma:interpretation/semaine:speaking/@statusChange : start, stop

Stylised pitch movements are represented as follows:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:duration="444" emma:confidence="0.3">

        <semaine:pitch xmlns:semaine="" direction="rise"/>


Possible values for /emma:emma/emma:interpretation/semaine:pitch/@direction : rise, fall, rise-fall, fall-rise, high, mid, low

User gender is encoded as shown here:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

      <semaine:gender name="female" xmlns:semaine=""/>


Possible values of /emma:emma/emma:interpretation/semaine:gender/@name : male, female, unknown

Non-verbal user vocalizations such as laugh, sigh etc. are encoded as shown in the following example:

<emma:emma version="1.0"	  xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

        <semaine:vocalization xmlns:semaine="" name="(laughter)"/>


Values of /emma:emma/emma:interpretation/semaine:vocalization/@name are currently “(laughter)”, “(sigh)” and (breath)”.

Whether there is a face present is encoded as follows:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">

        <semaine:face-present xmlns:semaine="" statusChange="start"/>


Possible values for /emma:emma/emma:interpretation/semaine:face-present/@statusChange : start, stop

Any action units recognised from the user's face are encoded such that a separate confidence can be given for each action unit. Example:

<emma:emma version="1.0" xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.3">
        <bml:bml xmlns:bml="">
            <bml:face au="1"/>
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.4">
        <bml:bml xmlns:bml="">
            <bml:face au="2"/>
    <emma:interpretation  emma:offset-to-start="12345" emma:confidence="0.2">
        <bml:bml xmlns:bml="">
            <bml:face au="4"/>

Head gestures such as nods or shakes are represented as follows:

<emma:emma version="1.0"	 xmlns:emma="">
    <emma:interpretation  emma:offset-to-start="12345" emma:duration="444" emma:confidence="0.3">

      <bml:bml xmlns:bml="">
          <bml:head type="NOD" start="12.345" end="12.789"/>


Possible values for /emma:emma/emma:interpretation/bml:bml/bml:head/@type: NOD, SHAKE, TILT-LEFT, TILT-RIGHT, APPROACH, RETRACT. Left and right are defined subject centred (i.e. left is left for the user).

For all emotion messages, the information is encoded using the latest draft of the EmotionML standard (Schröder et al., 2010) as the payload of an EMMA container. For example:

<emma:emma xmlns:emma="" version="1.0">
    <emo:emotion xmlns:emo="" 
      <emo:dimension confidence="0.905837" name="arousal" value="0.59999996"/>
      <emo:dimension confidence="0.97505563" name="valence" value="0.42333332"/>
      <emo:dimension confidence="0.9875278" name="unpredictability" value="0.29333335"/>
      <emo:dimension confidence="0.96318215" name="potency" value="0.31333336"/>
      <intensity confidence=”0.94343144” value=”0.04”/>

Fusion components

All non-verbal analyses are combined by a NonverbalFusion component; all emotion analyses are combined by an EmotionFusion component, which computes the fused positions on emotion dimensions as a sum of individual positions weighted by the respective confidences.

TopicDescription non-verbal behaviour from all available modalities and fused emotion analysis based on information from all available modalities

Dialogue management

The dialogue management is performed by interpreters and action proposers operating on “state” information, i.e. the system's “current best guess” regarding the state of the user, the agent itself and their dialogue.

More information about the dialogue management can be found in SEMAINE deliverable report D4b.


Interpreters take both the analysis results and the existing state information into account when making various interpretations, leading to various state updates. Analyses are thresholded by confidence – only analyses with a sufficiently high confidence lead to a state update.

  • EmotionInterpreter updates the user's emotion state based on the fused emotion analyses;
  • NonVerbalInterpreter updates the user's nonverbal state based on the fused non-verbal analyses;
  • UtteranceInterpreter updates the user state with respect to the words spoken by the user, taking the current dialogue state into account;
  • TurnTakingInterpreter takes decisions on the agent's intention to take the turn, based on current user, dialogue and agent state, and updates dialog and agent state accordingly;
  • AgentMentalStateInterpreter updates the agent's mental state based on user behaviour.

State information is kept in three Topics:

TopicDescription “current best guess” information about the user current state of the agent “current best guess” information regarding the state of the dialog

The state information is accessed via a short name and encoded in XML according to a stateinfo.config file. The latest version of the stateinfo.config file is enclosed as Appendix A.

Action proposers

There are currently two action proposer components.

  • UtteranceActionProposer is simultaneously proposer and action selection for verbal utterances. It selects suitable utterances from the available set of utterances for the current character, based on the current state information, and triggers the most suitable one when the agent has a turn-taking intention. This component manages the two output cues, see below.
  • ListenerIntentPlanner proposes possible timing and meaning of reactive listener backchannels, as well as non-verbal mimicry by the agent while being a listener. It uses user and agent state information to trigger and select both reactive (meaning-based) and mimicry (behaviour-based) backchannels.

A dedicated (listener) ActionSelection component makes sure that the amount of listener actions stays moderate, and in particular holds back any backchannel intentions while the agent is currently producing a verbal utterance.

Candidate actions are produced to the following Topics.

TopicDescription actions described in terms of the function or meaning of what is to be expressed actions described in terms of concrete behaviours actions described in terms of the function or meaning of what is to be expressed actions described in terms of concrete behaviours

The format of candidate and selected actions is identical. Actions in Topics *.function are encoded in FML. This includes verbal utterances like the following:

<?xml version="1.0" encoding="UTF-8"?><fml-apml version="0.1">
   <bml:bml xmlns:bml="" id="bml_uap_3">
      <bml:speech id="speech_uap_3" language="en-GB" text="Is that so? Tell me about it." voice="activemary">
         <ssml:mark xmlns:ssml="" name="speech_uap_3:tm1"/>Is<ssml:mark xmlns:ssml="" name="speech_uap_3:tm2"/>that<ssml:mark xmlns:ssml="" name="speech_uap_3:tm3"/>so?<ssml:mark xmlns:ssml="" name="speech_uap_3:tm4"/>Tell<ssml:mark xmlns:ssml="" name="speech_uap_3:tm5"/>me<ssml:mark xmlns:ssml="" name="speech_uap_3:tm6"/>about<ssml:mark xmlns:ssml="" name="speech_uap_3:tm7"/>it.<ssml:mark xmlns:ssml="" name="speech_uap_3:tm8"/>
   <fml:fml xmlns:fml="" id="fml_uap_3">
      <fml:performative end="speech_uap_3:tm4" id="tag1" importance="1" start="speech_uap_3:tm2" type="like"/>
      <fml:emotion end="speech_uap_3:tm4" id="tag2" importance="1" start="speech_uap_3:tm2" type="small-surprise"/>
      <fml:performative end="speech_uap_3:tm6" id="tag3" importance="1" start="speech_uap_3:tm4" type="agree"/>

Reactive backchannels are also encoded in FML:

  <fml xmlns="" id="fml1">
    <backchannel end="1.8" id="b0" importance="1.0" start="0.0" type="understanding"/>
    <backchannel end="1.8" id="b1" importance="1.0" start="0.0" type="disagreement"/>
    <backchannel end="1.8" id="b2" importance="1.0" start="0.0" type="belief"/>

Mimicry backchannels are encoded in BML:

<bml xmlns="">
  <head end="1.8" id="s1" start="0.0" stroke="1.0">
    <description level="1" type="gretabml">

Generation of agent behaviour

Any agent actions must be generated in terms of low-level player data before they can be rendered. In addition to the direct generation branch, the current architecture now also supports a prepare-and-trigger branch.

More information about the generation of agent can be found in SEMAINE deliverable report D5b.

Direct branch

In the direct branch, a selected action is converted into player data using the following intermediate steps.

  • SpeechPreprocessor computes the accented syllables and any phrase boundaries as anchors to which any gestural behaviour can be attached. It reads from Topicvs and, and writes its results to Whereas conceptually this processing step could work with purely symbolic specification of prosody-based anchor points, the current implementation of the behaviour planner component requires absolute timing information. For this reason the output of the SpeechPreprocessor already contains the detailed timing information.
  • BehaviourPlanner determines suitable behaviour elements based on the intended function/meaning of the action. It uses character-specific behaviour lexicons to map FML to BML. It reads from and writes to
  • SpeechBMLRealiser carries out the actual speech synthesis, i.e. the generation of audio data. It reads from; it writes a BML message including the speech timings to, and the binary audio data including a file header to
  • BehaviorRealizer reads from and, and produces the low-level video features in Topics and In addition, it sends two types of information to Topic (1) the information which modalities form part of a given animation as identified by a unique content ID; and (2) the trigger commands needed to start playing back the animation.
  • PlayerOgre is the audiovisual player component. It reads the lowlevel player data from topics, and A unique content ID is used to match the various parts of a multimodal animation to be rendered. Two types of information are received via the Topic the information which modalities are expected to be part of a given animation / content ID; and the trigger to start playing the animation. The Player sends callback messages to the topic semaine.callback.output.Animation, to inform about the preparation or playback state of the various animations it receives.

An example of a callback message is the following:

<callback xmlns="">
  <event data="Animation" id="fml_lip_70" time="1116220" type="start"/>

Possible event types are “ready”, “start”, “end”, as well as “stopped” and “deleted”.

Prepare-and-trigger branch

The prepare-and-trigger branch replicates the processing pipeline of the direct branch, but using different Topics:

  • QueuingSpeechPreprocessor reads from and, and writes to;
  • BehaviorPlannerPrep reads from and writes to;
  • QueuingSpeechBMLRealiser reads from and writes to and;
  • BehaviorRealizerPrep reads from and and writes to, and

The difference to the direct branch is at the two ends of the processing pipeline. At the input end, the UtteranceActionProposer feeds into candidate utterances that the current character may perform in the near future. At the output end, the BehaviorRealizerPrep sends the content-level description to the player but it does not send the trigger commands. Instead, when the player has received all necessary parts of a given animation, it sends a “ready” callback message which is then registered by the UtteranceActionProposer. When its utterance selection algorithm determines that the selected utterance already exists in prepared form in the player, all it needs to do is send a trigger command directly from UtteranceActionPoposer to, which then starts the playback of the prepared animation without any further delay. If no prepared version of the selected utterance is available, e.g. because it was unexpected that this utterance was selected, or because the preparation has not completed yet, the utterance is generated using the direct branch.

The prepare-and-trigger branch is used only for full utterances. Listener actions are so short and fast to generate that they always use the direct branch.

Since both branches are technically completely independent, this architecture scales well to multiple computers: it is easy to run the direct branch on one computer and the prepare-and-trigger branch on a different computer if they jointly would over-stretch the CPU resources of a single PC.

Protocol for the Player in SEMAINE

Any player component in SEMAINE must follow the following protocol so that it supports ahead-of-time preparation of possible utterances. The player must keep a collection of "Animations" which can be played by a "playCommand".

This protocol is currently implemented by two players: The audio-visual Windows native Player­Ogre using the Greta agent, and the speech-only player in Java class eu.semaine.components.mary.QueuingAudioPlayer.

Data flow

Low-level player data is sent to the player via the Topics*


Incoming messages have the following properties:

  • a message type specific to the payload format (currently: BytesMessage for audio, TextMessage for FAP and BAP).
  • a data type (obtained by message.getDatatype()) identifying the type of message (current values are "AUDIO", "FAP" and "BAP").
  • a content ID and a content creation time (obtained by message.getContentID() and message.getContentCreationTime()) which are used to assemble an Animation, to match data and command messages, and to identify the content item in callback and log messages.

The idea is that a unit of player data (an "Animation") is assembled in the player from the individual data items that are coming in (currently, AUDIO, FAP and BAP). Certain data types are optional (currently: AUDIO). A message can either contain the complete data of the given type (currently the case for AUDIO) or it can contain a chunk of data (currently the case for FAP and BAP). A chunk contains information about its position in the Animation; it can be dynamically added even if the Animation is already playing.

Command messages

There are two types of command messages: messages with data typesdataInfoandplayCommand.

  • Data info commands: For a given content ID, define the data types that must be present in the Animation: HASAUDIO, HASFAP and HASBAP. Each can be 0 for "not needed" or 1 for "needed".
  • Player commands: For a given content ID, define the playback conditions. This includes the following aspects:
  • STARTAT: when to start the playback of the Animation (in milliseconds from the moment when the Animation becomes ready);
  • PRIORITY: the priority of the Animation in case of competing Animations;
  • LIFETIME: the lifetime of the Animation, counting from the moment when the animation becomes ready. When the lifetime is exceeded and the animation has not started playing, it will be marked as "dead" and removed.

Commands are sent to topic

and have the data typedataInfofor data info commands andplayCommandfor player start trigger commands.

For every content ID, a playCommand is required in order to play that animation. Without a matching playCommand, an animation will never be played.

A command has the following format:

  • its content ID is identical to the content ID of the Animation for which it defines playback conditions;
  • message format is TextMessage; the text consists of space-separated key-value pairs, one pair per line, where the keys are strings and the values are floating point numbers.

The following features are used:

  • for playCommand:
  • STARTAT (0 means start at the moment when all required parts are present, a positive number means milliseconds after that condition is met)
  • LIFETIME (in milliseconds from the moment the animation is triggered; -1 means it will never expire)
  • PRIORITY (a value between 0 and 1, where 0 is the lowest and 1 the highest possible priority)
  • for dataInfo:
  • HASAUDIO (a binary feature, 0 means the Animation does not have audio, 1 means the Animation has audio data)
  • HASFAP (a binary feature, 0 means the Animation does not have FAP data, 1 means the Animation has FAP data)
  • HASBAP (a binary feature, 0 means the Animation does not have BAP data, 1 means the Animation has BAP data)

Every player command must contain all features of its respective type.

Callback messages

Event-based callback messages are sent when certain conditions are met for a given Animation. The messages go to Topic


and have the following format:

<callback xmlns="">
   <event id="CONTENT_ID" time="META_TIME" type="EVENT_TYPE"/>

where content ID and meta time are like before, and type is one of the following:

  • readymeans the Animation has received all required data, so it is ready for playing back. This event is triggered independently of the question whether a command has been received or not.
  • deletedmeans the Animation was removed before it started playing, e.g. because it has exceeded its lifetime in the output queue.
  • startmeans the Animation has started playing.
  • stoppedmeans the Animation was stopped while playing but before it was finished, e.g. because a request to change character was received.
  • endmeans the Animation has finished playing.

Error conditions

The content ID must be unique for the lifetime of a system. This leads to the following error conditions.

It is an error condition...

  • if a data chunk is received for an Animation that has already been discarded (because it finished playing, or exceeded its lifetime in the queue);
  • if data is received for a data type that does not form chunks;
  • if a playCommand is received for a content ID that has been started already, or that is already discarded;

An error condition should be reported as a WARN log message, and otherwise ignored.

It is not an error condition...

  • if a second playCommand is received after an animation has become ready but before it started playing. In this case, the new priority etc. overwrites the previous values.

Attachments (3)

Download all attachments as: .zip