EMMA
The Extensible Multimodal Annotation Language EMMA, a W3C Recommendation, is “an XML markup language for containing and annotating the interpretation of user input”. As such, it is a wrapper language that can carry various kinds of payload representing the interpretation of user input. The EMMA language itself provides, as its core, the <emma:interpretation> element, containing all information about a single interpretation of user behaviour. Several such elements can be enclosed within an <emma:one-of> element in cases where more than one interpretation is present. An interpretation can have an emma:confidence attribute, indicating how confident the source of the annotation is that the interpretation is correct; time-related information such asemma:start,!emma:end, andemma:duration, indicating the time span for which the interpretation is provided; information about the modality upon which the interpretation is based, through the emma:medium and emma:mode attributes; and many more.
The following listing shows an example EMMA document carrying an interpretation of user behaviour represented using EmotionML. The interpretation refers to a start time. It can be seen that the EMMA wrapper elements and the EmotionML content are in different XML namespaces, so that it is unambiguously determined which element belongs to which part of the annotation.
<emma:emma xmlns:emma="http://www.w3.org/2003/04/emma" version="1.0"> <emma:interpretation emma:start="123456789"> <emotion xmlns="http://www.w3.org/2009/10/emotionml" dimension-set="http://www.example.com/emotion/dimension/FSRE.xml"> <dimension name="arousal" value="0.23"/> <dimension name="valence" value="0.62"/> </emotion> </emma:interpretation> </emma:emma>
EMMA can also be used to represent Automatic Speech Recognition (ASR) output, either as the single most probable word chain or as a word lattice, using the<emma:lattice>element.
Details
Skeleton for all EMMA documents:
All EMMA documents MUST have a top-level element, and SHOULD have at least one element. That interpretation SHOULD a time stamp, given in its attribute "emma:offset-to-start", and MAY have a confidence, given in the attribute "emma:confidence".
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.3"> my annotation </emma:interpretation> </emma:emma>
If the document contains a single annotation, the element SHOULD be a direct child of . A sequence of interpretations can be represented using the element, as for keywords spotted. A collection of interpretations with different probabilities can be represented using the element, as e.g. for interest.
For the individual types of content / payload, we use by default the same representation as for the "current best guess" user state, unless there are reasons speaking against it.
We distinguish verbal information, emotion-related information, and non-verbal information.
Verbal information
Type of information | Topic |
keywords spotted | state.user.emma.words |
Keywords
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:sequence emma:offset-to-start=”12345” emma:duration=”110”> <emma:interpretation emma:offset-to-start="12345" emma:tokens="bla" emma:confidence="0.3"/> <emma:interpretation emma:offset-to-start="12390" emma:tokens="bloo" emma:confidence="0.4"/> </emma:sequence> </emma:emma>
Emotion-related information
Type of information | Topic |
emotion | state.user.emma.emotion.(modality) |
interest | state.user.emma.emotion.(modality) |
Emotion
The global user emotion is represented using the five dimensions intensity, arousal, valence, unpredictability and potency.
<?xml version="1.0" encoding="UTF-8"?><emma:emma xmlns:emma="http://www.w3.org/2003/04/emma" version="1.0"> <emma:interpretation> <emotion xmlns="http://www.w3.org/2009/10/emotionml" dimension-set="http://www.example.com/emotion/dimension/FSRE.xml"> <intensity confidence="0.30086732" value="0.4115755"/> <dimension confidence="0.9518124" name="arousal" value="0.1852386"/> <dimension confidence="0.2734806" name="valence" value="0.7791835"/> <dimension confidence="0.22194415" name="unpredictability" value="0.09359175"/> <dimension confidence="0.2912501" name="potency" value="0.050632834"/> </emotion> </emma:interpretation> </emma:emma>
Interest
User interest is represented using a custom vocabulary of interest-related category labels: bored, neutral, and interested. The confidence is used to indicate the extent to which each of the three categories is recognised.
<?xml version="1.0" encoding="UTF-8"?><emma:emma xmlns:emma="http://www.w3.org/2003/04/emma" version="1.0"> <emma:interpretation> <emotion xmlns="http://www.w3.org/2009/10/emotionml" category-set="http://www.semaine-project.eu/emo/category/interest.xml"> <category confidence="0.6955442" name="bored"/> </emotion> <emotion xmlns="http://www.w3.org/2009/10/emotionml" category-set="http://www.semaine-project.eu/emo/category/interest.xml"> <category confidence="0.24825269" name="neutral"/> </emotion> <emotion xmlns="http://www.w3.org/2009/10/emotionml" category-set="http://www.semaine-project.eu/emo/category/interest.xml"> <category confidence="0.6315944" name="interested"/> </emotion> </emma:interpretation> </emma:emma>
Non-verbal information
Type of information | Topic |
head movement | state.user.emma.nonverbal.head |
user speaking | state.user.emma.nonverbal.voice |
pitch direction | state.user.emma.nonverbal.voice |
gender | state.user.emma.nonverbal.voice |
nonverbal vocalizations | state.user.emma.nonverbal.voice |
face presence | state.user.emma.nonverbal.face |
action units | state.user.emma.nonverbal.face |
Head movement
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:offset-to-start="12345" emma:duration="444" emma:confidence="0.3"> <bml:bml xmlns:bml="http://www.mindmakers.org/projects/BML"> <bml:head type="NOD" start="12.345" end="12.789"/> </bml:bml> </emma:interpretation> </emma:emma>
The payload format is the same as for user state -- here it is below emma:interpretation, there it is below semaine:user-state.
Note that the proposal includes the redundant specification of time: start time is in emma:interpretation/@emma:offset-to-start (in milliseconds), and in bml:head/@start (in seconds). End time is given indirectly by emma:interpretation/@emma:duration (in milliseconds), and directly through bml:head/@end (in seconds). Experience will tell whether this double representation is useful.
Possible values for /emma:emma/emma:interpretation/bml:bml/bml:head/@type: NOD, SHAKE, TILT-LEFT, TILT-RIGHT, APPROACH, RETRACT. Left and right are defined subject centred (i.e. left is left for the user).
User speaking
The output of the voice activity detection (VAD) / the speaking detector looks like this. It needs no confidence. The Speaking Analyser (part of the TumFeatureExtractor) outputs messages when the user starts or stops speaking. These messages are low-level messages, created directly from the VAD output, smoothed only over 3 frames. Thus, some thresholds must be applied in other components to reliably detect continuous segments where the user is really speaking.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.3"> <semaine:speaking xmlns:semaine="http://www.semaine-project.eu/semaineml" statusChange="start"/> </emma:interpretation> </emma:emma>
Possible values for /emma:emma/emma:interpretation/semaine:speaking/@statusChange : start, stop
Pitch direction
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:offset-to-start="12345" emma:duration="444" emma:confidence="0.3"> <semaine:pitch xmlns:semaine="http://www.semaine-project.eu/semaineml" direction="rise"/> </emma:interpretation> </emma:emma>
The core difference with the user state is that here we have a start and a duration.
Possible values for /emma:emma/emma:interpretation/semaine:pitch/@direction : rise, fall, rise-fall, fall-rise, high, mid, low
Gender
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.3"> <semaine:gender name="female" xmlns:semaine="http://www.semaine-project.eu/semaineml"/> </emma:interpretation> </emma:emma>
Possible values of /emma:emma/emma:interpretation/semaine:gender/@name : male, female, unknown
Nonverbal vocalisations
Any non-verbal vocalizations produced by the user.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.3"> <semaine:vocalization xmlns:semaine="http://www.semaine-project.eu/semaineml" name="(laughter)"/> </emma:interpretation> </emma:emma>
Possible values for /emma:emma/emma:interpretation/semaine:vocalization/@name : (laughter), (sigh), (breath)
Face presence
Whether there is a face currently present.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.3"> <semaine:face-present xmlns:semaine="http://www.semaine-project.eu/semaineml" statusChange="start"/> </emma:interpretation> </emma:emma>
Possible values for /emma:emma/emma:interpretation/semaine:face-present/@statusChange : start, stop
Action units
Any action units recognised from the user's face.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:group> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.3"> <bml:bml xmlns:bml="http://www.mindmakers.org/projects/BML"> <bml:face au="1"/> </bml:bml> </emma:interpretation> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.4"> <bml:bml xmlns:bml="http://www.mindmakers.org/projects/BML"> <bml:face au="2"/> </bml:bml> </emma:interpretation> <emma:interpretation emma:offset-to-start="12345" emma:confidence="0.2"> <bml:bml xmlns:bml="http://www.mindmakers.org/projects/BML"> <bml:face au="4"/> </bml:bml> </emma:interpretation> </emma:group> </emma:emma>
Possible values for /emma:emma/emma:interpretation/bml:bml/bml:face/@au : a single integer number