Original PDF Flash format numediart-Quartely-Progress-Scientific-Report,-Vol.-1-No.-4-...  


Numediart Quartely Progress Scientific Report, Vol. 1 No. 4 ...


Published online by:
Faculté Polytechnique de Mons (FPMs)
Laboratoire de Théorie des Circuits et Traitement du Signal (TCTS)
http://tcts.fpms.ac.be
Université Catholique de Louvain (UCL)
Laboratoire de Télécommunications et Télédétection (TELE)
http://www.tele.ucl.ac.be
Credits:
Editors: Thierry Dutoit (FPMs/TCTS), Benoît Macq (UCL/TELE)
Cover photo: Loïc Reboursière
LATEX editor: Christian Frisson (UCL/TELE), using LATEX’s confproc class (by V. Verfaille)
All copyrights remain with the authors.
numediart homepage: http://numediart.org
Contact: contact@numediart.org

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Preface
numediart is a long-term research program centered on Digital Media Arts, funded by Région Wallonne,
Belgium (grant N◦716631). Its main goal is to foster the development of new media technologies through
digital performances and installations, in connection with local companies and artists.
numediart is organized around three major R&D themes:
• HyFORGE - Hypermedia Navigation: Information indexing and retrieval rely classically on con-
strained languages to automatically describe contents and allow formulating queries, respectively.
This approach becomes hardly applicable for multimedia contents such as music or video because
of the disparity between computable low-level descriptors and desired high-level semantics - the
so-called semantic gap. Alternatively, HyFORGE investigates human-in-the-loop approaches and
innovative tools for structuring and searching multimedia contents. Along with audio and image
processing, HyFORGE builds up on self-organizing models to derive enhanced views of multimedia
collections and provide users with efficient browsing interfaces.
• COMEDIA - Body & Media: COMEDIA is named from a French contraction between body and
media or stage director and media, which nicely sums up the main objective of this axis: giving to
bodies the means to be their own artistic director! Hence based on position on stage or choreography
between multiple artists for the inter-relationship and gestures or voice for the intra-relationship, CO-
MEDIA aims at creating interactivity between performing artists and the multimedia context around.
Event description, low-level feature analysis, pattern recognition, heterogeneous sensor fusion, ro-
bustness against lighting and real-time are our keywords in 1D, 2D and 3D signal processing to reach
these goals.
• COPI - Digital Instruments Design: COPI aims at developing a software/hardware toolbox for cre-
ating innovative digital musical instruments, from scratch or by augmenting existing instruments
with new interactive channels. The main challenges for this R&D axis are to produce expressive
instruments which maintain a close, embodied relationship with the musician. Our approach is to
produce new sound design architectures using a large database of pre-recorded signals while main-
taining real-time control of the design process. Our scientific work therefore implies three main axes:
the development of expressive production models (audio signal processing), followed by the design
of gestural control systems for their synthesis parameters, coupled with statistical modeling of this
dynamic control.
numediart is the result of collaboration between Polytech’Mons (Information Technology R&D pole)
and UCL (TELE Lab), with a center of gravity in Mons, the cultural capital of Wallonia. It also benefits
from the expertise of the MULTITEL research center on multimedia and telecommunications. As such, it is
the R&D component of MONS 2015, a broader effort towards making Mons the cultural capital of Europe
in 2015.
iii

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
This fourth session of numediart projects was held from October to December 2008.
During a one-week workshop from Dec 1st to Dec 5th , hosted at Multitel, Mons, Belgium, the
participants blended their efforts as a midway boost to the projects.
The session ended with a public presentation of the results (with demonstrations) at Multitel, Mons,
on Friday, January 23th, 2009.
iv

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Projects
Session #4 (Oct-Dec 2008)
119 Project #4.1: Audio Cycle
Stéphane Dupont, Nicolas d’Alessandro, Thomas Dubuisson, Christian Frisson, Raphaël Sebbe, Jérôme Urbain
129 Project #4.2: Dancing Viola
Todor Todoroff, Frédéric Bettens, Wen-Yang Chu, Loïc Reboursière
147 Project #4.3: Augmented Visual Studio
Matei Mancas, Michel Bagein, Nicolas Guichard, Sullivan Hidot, Caroline Machy, Sidi Mahmoudi, Xavier Siebert
v

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
vi

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
AUDIOCYCLE: BROWSING MUSICAL LOOPS LIBRARIES
Stéphane Dupont 1, Nicolas d’Alessandro 1, Thomas Dubuisson 1, Christian Frisson 2, Raphaël Sebbe 1, Jérôme Urbain 1
1 Laboratoire de Théorie des Circuits et Traitement du Signal (TCTS), Faculté Polytechnique de Mons (FPMs), Belgique
2 Laboratoire de Télécommunications et Télédétection (TELE), Université Catholique de Louvain (UCL), Belgique
ABSTRACT
to navigate through it. The work presented here envision a con-
venient and fast way of exploring large audio and music extract
This paper presents AudioCycle, a prototype application for brows-
libraries, relying on similarity analysis and musically relevant au-
ing through music loop libraries. AudioCycle provides the user
dio analysis and transformation, involving rhythm, harmony and
with a graphical view where the audio extracts are visualized and
timbre [24, 34, 23, 46]. Other features that are specific to the ap-
organized according to their similarity in terms of musical prop-
plication to music content have also been implemented. Loops
erties, such as timbre, harmony, and rhythm. The user is able to
can be played in a beat-synchronous fashion, relying on a phase
navigate in this visual representation, and listen to individual audio
vocoder algorithm to adjust the tempo without affecting the pitch.
extracts, searching for those of interest. AudioCycle draws from a
AudioCycle differs from previously published approches. Com-
range of technologies, including audio analysis from music infor-
pared to SoundTorch [20] for instance, it provides a range of audio
mation retrieval research, 3D visualization, spatial auditory render-
analysis approaches (not only timbral) and enables to weight their
ing, audio time-scaling and pitch modification. The proposed ap-
importance. On the usability and rendering side, AudioCycle en-
proach extends on previously described music and audio browsers.
ables to play synchronously any combination of loops, even if they
Concepts developped here will be of interest to DJs, remixers, mu-
are very distant on the similarity map.
sicians, soundtrack composers, but also sound designers and foley
The paper is organized to describe the different blocks of the
artists. Possible extension to multimedia libraries are also sug-
AudioCycle architecture, schematized in Figure 1. First, an Au-
gested.
dio Analysis (2) is performed on a set of loops (1) loaded by the
user. It consists of extracting features representative of the musi-
cal properties of rhythm, harmony, and timbre. This is presented
1. INTRODUCTION
in Section 2. The specificities of the algorithms that have actu-
ally been implemented are also described. This step generates a
Music production is more and more relying on pre-recorded mate-
database gathering features and meta-data (including tempo and
rial. One of the dominant musical instruments synthesis paradigm
key found in some audio file formats) for each loop (3). Section 3
is based on libraries of high-quality recordings of real instruments,
is interested by Visualization (4) techniques that can help users
covering their different articulations and expressive modes, lead-
to navigate and search inside large digital media libraries. These
ing to very large synthesis libraries, close to the gigabyte for a
techniques rely on the previously created Features Database (3),
single instruments, with physical modeling probably being at the
and approaches for mapping this large dimensional space to a 2D
other extreme in terms of computation/memory tradeoff [36]. Also,
or 3D structured representation of the library content are explored.
the production of many music styles heavily relies on pre-recorded
Users can interact with this representation, influencing the subse-
musical phrases, organized into libraries. This is most apparent for
quent 3D Video Rendering (5) - also addressed in Section 3 - and
instance in styles like hip-hop or remixing, where these phrases,
3D Audio Rendering (6), which spatializes, synchronizes and plays
often referred to as “loops” (for instance drum loops), are indeed
the selected loops - presented in Section 4. Section 5 provides an
looped, sequenced, and mixed together to form full audio tracks [40]
overview on the development started to improve the user interac-
and compositions. Third, granular synthesis techniques have also
tion. A discussion, including improvements to be implemented
been successfully used in music creation, as a way for discovering
in terms of visualisation, and suggestions for future activities, is
new musical textures, or in live settings [38, 53].
finally proposed in Section 6.
However, the tools available today for browsing through large
musical libraries hinders the creative process. Indeed, loops can
be searched through rigid file system hierarchies, or through the
use of symbolic descriptors (music style, instruments used) stored
as meta-data in the library. Also, this processes, being mostly off-
line, can not be used during performances. In the music perfor-
mance area, some DJ products provide assistance to the performer,
for instance through real-time tune synchronization capabilities,
but searching through the library is still being done the usual way
[49]. More generally, with the growing availability of multime-
dia content, there is a still larger demand for more flexible and
efficient tools to access content and search for data. Information
indexing and retrieval can rely on automatic technologies to de-
Figure 1: AudioCycle Architecture
scribe contents on one hand, and on the other hand allow formu-
lating queries, and structure the media database to help the user
119

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
2. MUSIC ANALYSIS
- which can also serve to decompose a musical chunk into bars,
beats, etc. To estimate these onset instants, it is common to com-
To characterize music chunks and organize them according to their
pute the Perceptual Spectral Flux (PSF) [23], which can be seen as
similarities, a set of relevant mathematical descriptors is needed.
an integration of the local derivatives of the spectrogram frequency
They will be introduced hereunder. Then, methods for computing
bins. The algorithm proposed in [24] for computing the PSF, also
an important rhythmic notion, the tempo (expressed in Beats Per
called “onset function” is:
Minutes - BP M), and estimating the downbeat positions in a song
will be described. Localising the downbeats is useful for appropri-
• To compensate for human ear attenuation, weight the power
ately cutting a music file into bars, which are not always similar,
spectrum of each frame with the inverse of the Equal Loud-
and synchronizing musical segments. To conclude this section, re-
ness Curve (EQL) defined at 40 phons [21] (see Eq. 1 where
sults of experiments assessing the relevance of the chosen timbral
f denotes the frequency).
features are presented.
EQL(f) = (
f2
)2 ∗ ( f2 + 1.44 ∗ 106 ) (1)
f2 + 1.6 ∗ 105
f2 + 9.61 ∗ 106
2.1. Feature Extraction
Following the approach of music analysis proposed in [8], fea-
• Compute the Perceptual Spectral Flux:
tures providing information about the timbre, the harmony and the
N
rhythm of music excerpts are extracted from frames whose length
b/2
X
is 30ms and whose hopsize is 10ms.
P SF (n) =
((akn)1/3 − (akn−1)1/3)
(2)
k=1
2.1.1. Timbre
where n stands for the frame index, Nb for the number of
frequency bins and ak
Timbre is defined as the quality of tone distinctive of a particular
n for the compensated power spectrum
at the kth frequency bin of current frame n. The exponent
singing voice or musical instrument [27]. It is linked to the relative
1/3 is introduced to simulate the intensity-loudness power
intensities of a sound harmonics, independantly from its pitch and
law. As in [12], the power spectrum was obtained by ap-
intensity. It has been shown in many studies ([23, 24, 33]) that tim-
plying a Mel filterbank decomposition on top of the usual
bre can be efficiently characterized by the instantaneous spectral
Fourier Transform.
envelope of the sound, using a non-linear frequency scale. A pop-
ular non-linear frequency scale is the Mel scale, built to model the
In [24] and [8], the rhythmic information is computed by cal-
human perceptions of pitches. Based on this scale, the frequency
culating local autocorrelation functions on overlapping frames of
content of a frame can be summarized by the Mel-Frequency Cep-
the PSF.
stral Coefficients (MFCCs) [52]. To characterize the timbre in our
Another approach has been designed here, based on the no-
application, we use a filterbank of 20 filters covering the frequency
tions of bar, beat and tatum, which are the onset events related to
range between 0Hz and the Nyquist frequency (22050Hz for the
the perception of rhythm (see Section 2.2). We have chosen to
music excerpts we used), and keep only the first 12 MFCCs (+ the
save as rhythmic features the onset values in the neighbourhood
energy).
of the beats and tatums positions for one bar. The window length
is set to 11 frames: PSF values are summed from 5 frames before
and until 5 frames after the metrical event position. Negative onset
2.1.2. Harmony and Melody
values are discarded from the sum, since rhythmic events are per-
Harmony can be defined as the combination of simultaneous mu-
ceived when a particular sound is striking, not disappearing. Since
sical notes in a chord [27]. In addition, the melody is defined as
beats are mainly characterized by low-frequency contributions of
a rhythmic succession of single tones organized as an aesthetic
the onset function, it has been decided to split the onset function
whole [27]. In consequence, to characterize the harmony and melody
in three frequency bands: low frequencies (0 to 100 Hz), medium
of a musical signal, features revealing the presence of particular
frequencies (100 to 8000 Hz) and high frequencies (8000 Hz to the
notes, their evolution and combination must be extracted. Since
Nyquist frequency).
notes are defined by frequency values, harmony and melody anal-
A 4/4 rhythmic signature is assumed, meaning a bar contains
ysis is generally performed in the frequency domain. A widely
4 beats - the first one being the downbeat - and 4 tatums, located
used method is to project the power spectrum of each frame on
halfway between two beats. Thus, to capture the beat and tatum
a chromatic scale, which decomposes each octave into 12 equally
information of one 4/4 bar, features around 8 particular instants
broad bands on a base 2 logarithmic scale, called “semitones”. The
must be computed. The vector characterizing the rhythm of each
“chromas” characterizing the harmony and melody of a frame are
music excerpt is in consequence composed of 8 values for each
then the 12 sums of the energetic contributions of each semitone
frequency band. The frequency band decomposition of the onset
on all the (considered) octaves. They represent the combination of
function and mask applied for computing the rhythm features are
notes played at a given time, independently from their octave.
illustrated in Figure 2. It can be seen that the mask - generated
independently from the onset function - will catch the main peaks
2.1.3. Rhythm
of the onset functions, corresponding to beats and tatums. Only the
first 8 steps of the mask are used to compute the rhythm features.
Rhythm is defined as the aspect of music comprising all the ele-
The described method requires the knowledge of the downbeat
ments (as accent, meter, and tempo) that relate to forward move-
position and the interval between two successive beats. The latter
ment [27]. From a signal processing point of view, it is closely
can be easily computed from the Beats Per Minute (BPM) value,
related to the notion of periodicity [8]. Various analyses of a sig-
frequently available as meta-data with professional music files.
nal periodicity can be proposed. For characterizing the rhythm of
Loops used for DJing applications generally start at the downbeat,
musical excerpts, one popular way is to determine onsets instants
solving the problem of localizing this event. When the BPM and
120

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
2.2.1. State of the Art
The onset function (see Section 2.1) is commonly used to extract
the tempo (expressed in BPM). The method proposed in [12] esti-
mates it by computing the autocorrelation of the onset function and
by favoring autocorrelation values located in a range defined by the
user. It is performed by weighting the autocorrelation function by a
Gaussian window centered around the most likely BPM value and
whose spread is set by the user. Once the largest peak is located
in the weighted autocorrelation, a second value of BPM is found
as the largest value of autocorrelation at 0.33, 0.5, 2 or 3 times
the first value of BPM. This system was tested on the database of
MIREX 2006 contest [29] and achieved 77 % of agreement with
the manual BPM.
Among the other beat tracking methods based on the onset
function, the system exposed in [32] computes the onset function
from a reassigned spectrogram, in which the energy of the bins
(at frequency ωk at time ti) are reassigned to the frequency ωr
Figure 2: Top 3 graphs: band frequency decompositions of the
and time tr corresponding to their center of gravity. BPM is then
onset function (normalized amplitudes). Bottom graph: Mask ap-
estimated by combinating periodicity measures of the onset func-
plied for computing the rhythm features.
tion (Discrete Fourier Transform and Autocorrelation). In [15] the
structure of a musical excerpts (in terms of bar, mid-bar and beat)
is estimated by computing an onset function from different fre-
quency bands in the spectrogram. Chord-change possibilities and
downbeat position are a priori unknown, it is still possible to com-
drum patterns are then estimated at beat locations and the structure
pute them and use the proposed method for extracting the rhythm
of the music excerpt is inferred from these informations.
features thanks to the algorithms that will be described in section
Fewer studies focus on the difficult problem of downbeat es-
2.2.
timation altough it has a perceptual sense for us. A method for
downbeat extraction based on beat locations and BPM is presented
in [9]. The signal is cut into segments synchronized at beat loca-
2.1.4. Summarizing the information
tions and whose length is the beat period. The difference between
the spectrum of successive segments is then computed. For each
The targeted application must be able to deal with thousands of
bar, the downbeat is located where value of this difference is high-
musical extracts, fastly compare them and display them in a suit-
est. In [22] Support Vector Machines are trained in order to predict
able way (see Section 3). This objective has two implications on
the downbeat locations by using spectral features synchronized on
the feature sets we provide. First, the characteristic vectors must
the local maxima of the onset function. In the testing phase, the
be of reasonable length, otherwise computation times would dra-
system takes as input the spectral features of several music seg-
matically increase and the application would not be efficient in
ments and provides an estimation of the downbeat locations in the
real-time. Second, characteristic vectors of each musical excerpt
next segments.
should have a constant length, independently from its duration, in
order to facilitate comparisons of the signal properties.
2.2.2. Implementation
Regarding the rhythm characterization, the method aforemen-
As the onset function appears to be very popular for beat period
tioned already provides a constant number of 24 rhythmic features
estimation, it has been chosen in the present study. Onset func-
per music excerpt. However, as the number of timbre and chroma
tions from [12, 10, 15, 32] have been compared and appeared quite
features is proportional to the music excerpt duration, this informa-
similar. Because Ellis’ method provides good performance and be-
tion is summarized into fixed size vectors by storing only the mean
cause the whole source code is provided on his website, it has been
and variance of each timbre and harmony feature. It is shown in
chosen for beat period estimation in our system.
section 2.3 that even if information is lost when summarizing, the
Concerning the downbeat estimation, a method inspired by
mean and the variance of the features still contain enough infor-
previous litterature [22] has been developped. The idea is to find
mation to efficiently recognize the kind of instruments involved in
the best alignment between a template of bar, beat and tatum lo-
the extract, which is the aim of timbre characterization.
cations and the onset function. As in [22], it has been chosen to
consider only templates for 4/4 time signature in this study.
The template is built using the estimated BPM value. An
2.2. Beat and Downbeat Estimation
example is shown in Fig.3 for 144 BPM. The best alignment is
achieved by computing the cross-correlation between the onset
In the Music Information Retrieval domain, the notion of "beat" is
function and a lagged version of the template. This correlation is
often expressed by musicians and listeners through the act of foot
computed for lags from 0 to one bar period. The lag corresponding
tapping along to the music. Downbeat is defined as the first beat
to the highest correlation value is used to set the downbeat.
mark at the beginning of a bar.
The algorithms implemented for computing the beat period
Beat tracking systems form the basis of applications like auto-
and localizing the downbeat position also enable us to perform
matic accompaniment, transcription, computer-assisted audio edit-
“slicing” (i.e. decomposition into bars) of audio files. Not only
ing and evaluation of music similarity.
loops but also full songs can thus be analyzed by the application.
121

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
error). If X represents the normalized features matrix, the new
features matrix Z is obtained by
Z = UT X
(4)
where U is the linear transformation matrix. The matrix U leading
to the best final representation consists of the eigen vectors of the
covariance matrix XXT . The dispersion of features around each
new axis of representation is given by its associated eigen value.
A reduction of features dimensionality is possible by selecting the
axis of representation associated to the highest eigen values.
2.3.3. Results
Figure 3: Template for 144 BPM
Three ways of reducing the dimension of features vector have been
investigated: applying the Fisher criterion alone, applying PCA
alone and applying Fisher criterion followed by PCA. These dif-
The algorithm designed for slicing consists of first cutting the files
ferent approaches have been applied on the mean and variance of
into 15s-windows (with a 10s-overlap), then launch the computa-
the timbre features. For each configuration, a Multi-Layer Per-
tion of the beat period and finally localize the downbeats position
ceptron was trained with 60% of the objects, validated with 10%
in the central 5s of the window. Resulting bars are then processed
and tested with the remaining 30%. The operation was repeated
by the application as if they were loops.
100 times and the objects were always randomly selected. Table 1
shows the average classification performance for each configura-
2.3. Instruments Classification
tion of features.
In order to define which set of features is the most appropriate
for the similarity estimation and visualization in AudioCycle, their
Table 1: Performance of classification (in %)
performance for classifying instruments has been assessed. The
mean and variance of the timbre features have been considered in
Description
Mean
Var
this evaluation.
No selection
73.1
50.8
6 most discriminant after Fisher
66.2
52.1
2.3.1. Database
2 most discriminant after Fisher
54.3
47.2
No selection after PCA
93
70.9
The features are extracted from musical loops of the ZeroG ProPack
6 most discriminant after PCA
85
67.7
ACID database [51]. This database is a product of ZeroG Ltd. and
2 most discriminant after PCA
66
52.6
contains 11428 loops and samples of various instruments in di-
No selection after Fisher and PCA
93
68.1
verse musical styles. We manually tagged the files using 7 instru-
6 most discriminant after Fisher
86.9
59.8
ments classes: Brass, Drums, Vocals, Percussion, Electric Bass,
then PCA
Acoustic Guitar and Electric Guitar. Only the files for which the
6 most discriminant after Fisher
75
56.3
instrument class is clear have been taken into account in this study.
then 2 most discriminant after PCA
2 most discriminant after Fisher
64.1
51.9
2.3.2. Dimension Reduction
then 2 most discriminant after PCA
Before classifying the feature vectors, they have to be manipulated
in order to assess if their dimension can be reduced. For this pur-
pose, the Generalized Fisher criterion and Principal Component
Classification performance for timbre information is very good
Analysis (P CA) have been tested.
(up to 93%), supporting our decision to summarize the timbre
The Generalized Fisher criterion [14] is a class separation cri-
information 2.1.4 and confirming that these vectors still contain
terion based on the features and the class labelling. For a given
enough information to distinguish instruments with a very high
feature k, its discriminant power between C classes is defined as
accuracy. Since performance is significantly better using means
P
than variances and further tests shown that no improvement was
C
D
c=1 p(ωc)(µck − µk)2
achieved by combining means and variances, only timbre means
k =
P
(3)
C
are finally used for the AudioCycle application. Similar tests are
c=1 p(ωc)σ2
ck
under way to assess the opportuneness of summarizing the chroma
where p(ωc) stands for the percentage of representation of class c
information the same way.
in the database, µck for the mean of the feature k in the class c, µk
the mean of feature k for all the classes and σck for the standard
deviation of feature k in the class c. A dimension reduction is
3. VISUALIZATION
possible by selecting the features associated to the highest values
3.1. Visualization Techniques
of discriminant power.
Principal Component Analysis (P CA) [14] prepares the fea-
The representation of large multimedia database has been a subject
tures in a classification system by linearly transforming them in
of research in various cases of media content. As stated in [45], the
order to find the best representation space (in terms of least square
visualization has to meet three requirements: overview (faithful
122

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
overview of the distribution of excerpts in the collection), struc-
3.1.3. Clustering
ture preservation (the relations between music excerpts should be
preserved in the projection in the visualisation space) and visibil-
Clustering is an other way to organize a large amount of data. An
ity (visually understanding of the content of each music excerpts).
original approach has been developed here. It relies on a K-Means
Thus if the similarity between music excerpts is used, the ideal is
algorithm. Clusters of loops are built by iteratively constructing
to build a representation in which the similarity distances are pre-
partitions through association of each loop to the closest centroid.
served, what our choice of the node-link visualization paradigm,
The Euclidian distance has been chosen and based on the extracted
great for emphazising two Gestalt laws (similarity and connected-
features defined earlier. The emphasis of the clustering can be
ness) [48], ensures.
changed by the user by scaling differently the feature space di-
When displaying the whole collection of music excerpts is de-
mensions related to timbre, harmony, and rhythm. The number
sired, the idea is to associate to an object x
of cluster is chosen in advance by the user and, in order to pre-
i a vector yi aiming
at transposing the similarity between x
serve near real-time interaction, the clustering is stopped before
i and its neighbours in a
2D or 3D representation. Three kinds of approaches are described
convergence after a limited number of iterations. The initial clus-
hereafter.
ter centers are randomly chosen among the loops but this random
sequence is made reproducible to provide back and forth naviga-
tion.
3.1.1. Dimensionality Reduction
Clusters can be represented visually is a flower-like scheme,
The goal of this approach is to preserve the distance between points
where the center is the currently selected loop and other loops are
in a high dimensional space when they are projected in a subspace
gathered into groups (clusters) around the selected one. The 2D
of two dimensions. This can be achieved by e.g. PCA or by Linear
radial distance between a loop and the selected centered element
Discriminant Analysis (LDA) [11]. This latter projects the fea-
is proportional to the distance between them (in a feature space
tures vectors from a d-dimensional space to a D-dimensional space
with scaling factors complementary to the one used for clustering)
(with D < d) with the constraint to maximize the ratio between
, while the distance between the loop and the cluster center (in the
the between-class covariances and the within-class covariances of
scaled space used for clustering) is used to compute the 2D angular
the new features. This is done by applying the linear transforma-
coordinate to locate it in the cluster. This scheme allows to explore
tion
organization schemes where the radial and angular coordinates are
Z = W T X
(5)
associated to complementary musical properties, e.g. timbre and
rhythm.
where W is the transformation matrix whose dimension is d × D.
Among the LDA techniques, one can cite the Fisher method
which consists on finding C − 1 linear discriminant functions in a
3.2. 3D Imaging, OSG and OpenGL
problem of C classes. It can be proved that the columns of W that
Visualization of the audio loops is made with Open Scene Graph
maximises the ratio between the between-class covariance and the
(OSG) [31], a scene graph visualization library running on top of
within-class covariance correspond to the highest eigen values of
OpenGL and permiting the definition of high level objects. These
objects can either render geometry or represent 3D transforma-
SBWi = λiSW Wi
(6)
tions, and, when associated as a graph, they define a rendered im-
where W
age.
i stands for a column of W and λi for the ith eigen value.
As only C − 1 eigen values are different of 0, the new system of
The scene graph is built by adding cubes representing each
representation consists of only C − 1 dimensions.
loop. Each of these cubes has a parent node that specifies its po-
sition, which can change over time, when the user changes the
clustering properties for instance. Loops can be activated by click-
3.1.2. Graphs and Multidimensional Scaling
ing on the associated objects (cubes) with the mouse. This process
When using graphs for visualization, each excerpt is represented
is known as picking and is implemented in the usual OSG way
by a vertex in the graph and the arcs between neighbouring ex-
by computing intersection with a ray fired at the mouse location.
cerpts are used for navigation. The objective of the graphs is to
When a loop is activated, the scene graph is enhanced with a wave-
preserve the neighbourhood of excerpts and to enhance data struc-
form display as well as a curser indicating the current playback
tures. For retrieval and analysis purposes, the radial tree structure
position. A screenshot of the Audiocycle OSG view is displayed
and global graph are particularly adapted. The radial tree structure
in Fig. 4. OSG will allow us to extend this work to 3D representa-
provides a focus on a part of a graph, analogous to a hierarchi-
tions very naturally.
cal representation. The global graph aims at representing a large
amount of excerpts with various levels of details, but with a larger
4. AUDIO RENDERING
point of view than radial tree structures.
Multi Dimensional Scaling (MDS) aims at trading with the
In AudioCycle, several audio rendering features have been de-
high dimension of data by using graphs. The objective is to iter-
signed to facilitate the browsing though large libraries of audio
atively optimize a function that measures the difference between
loops. First, the loops are positioned in space in relation with the
the distance of two excerpts in the 2D space and their original
position of their visual representation on the screen. This tight cou-
dissimilarity. in order to build a correct visualization. The orig-
pling between the audio and the image is expected to make it easier
inal method requiring a high computational cost, other approaches
to locate the area of interest in the visual representation [13, 20],
have been developed. Among them, one can cite Isomap, aiming at
when the user is looking for specific sounds while several loops
clustering first the data before applying MDS or Visumap, which
are playing together. This also allows to plays simultaneously sev-
maps the excerpts in a 2D map so that Euclidian distance visually
eral sounds. This has been implemented using OpenAL (described
approaches as much as possible the dissimilarities of the excerpts.
hereafter) with a specific extension providing HRTF-based (Head
123

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
applications [47]. We have inspired ourselves by the techniques
developped in these fields.
5.1.1. Available modalities and controllers
Here is an overview of miscellaneous modalities, or ways to in-
teract with a computer, that have been employed in applications
dedicated to navigation in multimedia databases:
• 3D positioning with remote controllers [20, 43] such as the
cheap Nintendo Wii;
• Glove [19], by wavering fingers;
• Query by beat-boxing [25], by indicating rhythm and gross
Figure 4: AudioCycle OSG View
timbre mimicked by the mouth;
• Haptics and force-feedback [4, 1], controllers that response
to the user movements by applying forces conditionning it;
Related Transfer Functions) rendering. Then, as we are dealing
• Tactile feedback [28], vibration sensed through the skin;
with music content, it makes sense to provide the application with
facilities to allow beat synchronous playback (through audio time-
• Tangibles [2, 5, 16, 17], in other words objects that can be
scaling) of loops that have originally not been played at the same
grasped, moved, manipulated.
tempo, and key normalization (through pitch shifting) for loops
that have originally not been played in the same keys. This is ex-
5.1.2. Bimanual interaction
pected to reduce the cognitive load when several loops are playing
together, hence facilitating the search for specific sounds. Even-
The gestural control for the navigation in an audio database has
tually, this may also provide the basis for extending AudioCycle
mostly been introduced and improved by DJ’s [42, 49] and the
to a performance tool, where live composition and performance is
development of these practises is still ongoing [26], with a tight
made possible though beat and key matching.
relation to the use of the hands.
OpenAL [30] is a cross-platform audio API that has been de-
Now that a vast content of audio files can be stored on the
signed for rendering of multichannel 3D audio. Designed with a
computer, we can define two modes of interaction for the use case
style similar to OpenGL, it has mostly been used for 3D modeling
of AudioCycle:
and game engines. The API allows to define source objects and a
1. navigation: the action of selecting and annotating files to be
listener. Sources can be specified using several properties, includ-
stored in a library and/or to be retrieved from it;
ing position, velocity, sound direction and intensity. The listener
is also characterized by properties including position, velocity, di-
2. performance: the action of listening to one or more files
rection, as well as a global gain. The buffers containing the audio
with an optional addition of real-time temporal and/or spec-
samples in PCM format are associated to the sources by other API
tral effects for both artistic/aesthetic and mining purposes.
calls. The OpenAL engine then performs the rendering and sim-
Bimanual interaction, the use of both hands to perform tasks,
ulates physical effects such as the attenuation caused by distance
is deeply rooted in the history of musical instruments and is be-
between the sources and the listener, as well as the Doppler effect.
ing applied in the computer music field, for instance in the fol-
Additional functionalities can be provided through extensions
lowing recent works [7, 37]. To design the user interaction for
to the API. Among the available plugins, an AU (Audio Unit) plu-
the AudioCycle project, we have decided to focus on and study
gin making use of HRTF when rendering for a stereo output has
this paradigm: the hands can be assigned, according to the CARE
been used in the application. This conveys an enhanced realism to
properties [6], each to one of the two interaction modes (one hand
the sound scene.
selects an audio file, the other performs an audio file); or freely
Coming back to sources, it’s of significant importance to note
used for any mode. In the case of an audio mining task, in order
that they be of two types: static or streaming. As static sources
to obtain the fastest search possible, assigning each hand to one
are not appropriate for applications where expressive control is re-
mode could help to cognitively accelerate the process. In the case
quired, streaming sources have been preferred. It allows to pass
of a musical performance, swapping hands on controllers is left to
buffers that are queued for real-time playback. Since the API does
the performers’ will.
not provide a callback mechanism that requests audio buffer, this
aspect has hence to be managed by the application. In this work,
this is done using a thread with the necessary priority level to pe-
5.2. Our prototype
riodically check wether buffers will be needed by the OpenAL en-
gine.
5.2.1. Chosen controllers
We have decided to try using two categories of gestural controllers
5. USER INTERACTION
to build our hardware interface, as illustrated in figure 5:
1. 3D mice (3Dconnexion Space Navigator) to navigate in the
5.1. Gestural input
library;
Gestural input theory has been long studied, concerning generic
2. “Jog wheels” (Contour Design Shuttle) to alter the time and
computerized applications [3] to more specific computer music
pitch of the listened files.
124

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
For example:
• /audiocycle/1/browse/2/grab 1 selects the clos-
est audio file to the position set by navigation controller
number 2 on the graph,
• /audiocycle/1/skim/3/play plays the audio file pre-
viously selected and assigns to it performance controller
number 3,
• /audiocycle/1/browse/2/recluster reorganizes
the view around the audio file selected previously,
Figure 5: Gestural controllers chosen to be tested in AudioCycle:
• /audiocycle/1/skim/3/time 0.5 repositions the
3D mice on the left and jog wheels on the right
playhead to half of the duration length of the selected audio
file,
For an audio mining task, one of each can be featured on the
• and so on...
setup and assigned to one hand, to reduce the span of the user
At the current stage, AudioCycle supports interaction with a
movements on his desktop. In the case of a musical performance,
single user (<agent_id> is set to one) but the namespace is open
the performer could decide to use as many 3D mouse / jog wheel
to the implementation to collaborative and concurrent use.
couples as necessary in order to be able to perform a certain num-
ber fo loops simultaneously.
Additionnaly, standard sliders available on “MIDI fader boxes”
5.2.3. Prototyping the user interface
can be assigned to set the percentage of similarity of each feature.
To quickly implement and verify gestural control prototypes, mod-
5.2.2. OSC namespace
ular environments such as PureData [35], using [hid] and [hidio]
objects under Linux and OSX [41], or OpenInterface [44] under
In order to prototype the user interface of Audio Cycle, that is test-
Linux and Windows, soon OSX, are being considered for the rapid
ing if the use of the aforementionned hardware controller couple
prototyping. Both environments can act as OSC server and gener-
is relevant, and trying other controllers, we have started imple-
ate OSC messages to be sent to the AudioCycle application. This
menting an OpenSoundControl (OSC) [50] bridge in the Audio-
configuration allows to study various mappings [18], or ways to
Cycle application, so that to allow pre-formated messages to be
interconnect gestural parameters and application events, applied
sent through a network (that can be hosted on the same computer),
to our use case.
from the interaction prototyping application to the AudioCycle ap-
plication that will interpret theses messages.
As it is now done in many applications requiring gestural in-
6. DISCUSSIONS
put [39], we have defined our own OSC namespace to control the
AudioCycle application as depicted in figure 6:
Some aspects of the visualisation require further study [48], for in-
stance, respecting the symetry of the 3D node-link diagram paradigm
and condensing the mapping of the waveforms around each loop
representation.
In order to achieve the fastest possible audio mining, we initi-
ated a user-centered usability approach, featuring context enquiries
with experts (sound designers [36], DJ’s [49], etc...) and usability
tests with software/hardware prototypes, leading to the definition
and support of different user and hardware profiles.
This work is part of an ongoing series of projects intending to
extend our approach to multimedia content (images, videos, text,
etc...) with scalability to very large archives.
7. CONCLUSIONS
A novel approach and prototype implementation of a music loop
library browser has been presented. It allows to display the music
extracts organized on a map according to timbre, harmony, rhythm,
or a combination of these. Spatial sound is also used to enhance
the experience when playing several extracts at the same time.
Features relevant to music production have also been integrated,
Figure 6: Mindmap view of the AudioCycle OSC namespace
including the possibility to play the loops in a beat-synchronous
fashion, as well as alter their pitch. Some avenues for extending
and improving the concept and technologies involved are also pro-
posed.
125

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
8. ACKNOWLEDGMENTS
[16] Kjetil F. Hansen, Marcos Alonso, and Smilen Dimitrov.
“Combining Dj Scratching, Tangible Interfaces And A
This work has been supported by the numediart research project,
Physics-Based Model of Friction Sounds”. In: Proceedings
funded by Région Wallonne, Belgium (grant N◦716631).
of ICMC07. 2007. P.: 124.
[17] Kjetil Falkenberg Hansen and Marcos Alonso. “More DJ
9. REFERENCES
techniques on the reactable”. In: Proceedings of the Inter-
national Conference on New Interfaces for Musical Expres-
9.1. Scientific references
sion (NIME). 2008. P.: 124.
[1] Timothy Beamish, Karon Maclean, and Sidney Fels. “Ma-
[18] Kjetil Falkenberg Hansen and Roberto Bresin. “Mapping
nipulating Music: Multimodal Interaction for DJs”. In: Pro-
strategies in DJ scratching”. In: Proceedings of the 2006 In-
ceedings of CHI’04. 2004. P.: 124.
ternational Conference on New Interfaces for Musical Ex-
[2] Henry Bernard. “Tangible Media Browsing: Exploring And
pression (NIME06). 2006. P.: 125.
Browsing Large Multimedia Collections With A Tabletop”.
[19] Kouki Hayafuchi and Kenji Suzuki. “MusicGlove: A Wear-
MA thesis. Music Technology Group, UPF-IUA, Barcelona
able Musical Controller for Massive Media Library”. In:
and INPG-ICA, Grenoble, 2007. P.: 124.
Proceedings of the International Conference on New Inter-
[3] S. K. Card, J. D. Mackinlay, and G. G. Robertson. “The
faces for Musical Expression (NIME). 2008. P.: 124.
design space of input devices.” In: Proceedings of the
[20] Sebastian Heise, Michael Hlatky, and Jörn Loviscach.
SIGCHI conference on Human factors in computing sys-
“SoundTorch: Quick Browsing in Large Audio Collec-
tems (CHI’90). 1990. Pp. 117–124. P.: 124.
tions”. In: 125th Audio Engineering Society Convention.
[4] Lonny L. Chu. “Haptic Interactions For Audio Navigation”.
7544. 2008. Pp.: 119, 123, 124.
PhD thesis. Stanford University, 2003. P.: 124.
[21] H. Hermansky, N. Morgan, and P. D. Korn. “Auditory
[5] E. Costanza, S. B. Shelley, and J. Robinson. “Introducing
Model for Parametrization of Speech”. Pat. 5,450,522.
Audio D-Touch: A Tangible User Interface For Music Com-
1995. P.: 120.
position And Performance”. In: Proc. of the 6th Int. Confer-
[22] T. Jehan. “Downbeat prediction by listening and learning”.
ence on Digital Audio Effects (DAFX-03). 2003. P.: 124.
In: Proceedings of WASPAA 2005. 2005. P.: 121.
[6] Joëlle Coutaz et al. “Four Easy Pieces for Assessing the Us-
[23] Tristan Jehan. “Creating Music by Listening”. PhD thesis.
ability of Multimodal Interaction: The CARE properties”.
Massuchussets Institute of Technology, 2005. Pp.: 119, 120.
In: Proceedings of the INTERACT’95 conference. 1995.
[24] K. Jensen. “Mutliple Scale Music Segmentation Using
Pp. 115–120. URL: http://iihm.imag.fr/publs/
Rhythm, Timbre and Harmony”. In: EURASIP Journal on
1995/Interact95_CARE.pdf. P.: 124.
Advances in Signal Processing 2007 7 (2007). Pp.: 119,
[7] Jean-Michel Couturier and Magnolya Roy. “Grapholine,
120.
Instrument Audiovisuel De “Dessin Musical””. In: Articles
[25] Ajay Kapur, Manj Benning, and George Tzanetakis.
des Journées d’Informatique Musicale. 2008. P.: 124.
“Query-By-Beat-Boxing: Music Retrieval For The Dj”. In:
[8] Laurent Couvreur et al. “Audio Thumbnailing”. In: QPSR
Proceedings of ISMIR’04. 2004. P.: 124.
of the numediart research program. Ed. by Thierry Du-
[26] Takuro Mizuta Lippit. “Turntable Music in the Digital Era:
toit and Benoît Macq. Vol. 1. 2. http://www.numediart.org.
Designing Alternative Tools for New Turntable Expres-
2008. Pp. 67–85. P.: 120.
sion”. In: Proceedings of the 2006 International Conference
[9] M. Davies and M. Plumbley. “A Spectral Difference Ap-
on New Interfaces for Musical Expression (NIME06). 2006.
proach to Downbeat Extraction in Musical Audio”. In: Pro-
P.: 124.
ceedings of EUSIPCO 2006. 2006. P.: 121.
[28] Roderick Murray-Smith et al. “Stane: Synthesized Surfaces
[10] M. Davies and M. Plumbley. “Beat Tracking with a Two
for Tactile Input”. In: CHI 2008. 2008. P.: 124.
State Model”. In: Proceedings of ICASSP 2005. 2005. P.:
[32] G. Peeters. “Time Variable Tempo Detection and Beat
121.
Marking”. In: Proceedings of ICMC 2005. 2005. P.: 121.
[11] R.O. Duda and P.E. Hart. Pattern Classification and Scene
[33] G. Peeters. “Toward Automatic Music Audio Summary
Analysis. John Wiley and Sons, 1973. P.: 123.
Generation from Signal Analysis”. In: Proceedings of the
[12] D. Ellis. “Beat Tracking with Dynamic Programming”. In:
International Symposium on Music Information Retrieval
Proceedings of MIREX 2006. 2006. Pp.: 120, 121.
(ISMIR). Paris, France 2002. P.: 120.
[13] M. Fernström and E. Brazil. “Sonic browsing: An auditory
[34] G. Poliner et al. “Melody Transcription from Music Au-
tool for multimedia asset management”. In: International
dio: Approaches and Evaluation”. In: IEEE Transactions
Conference on Auditory Display (ICAD2001). Helsinki,
on Audio, Speech and Language Processing 15(4) (2007).
Finland 2001. P.: 123.
Pp. 1247–1256. P.: 119.
[14] K. Fukunaga. Introduction to Statistical Pattern Recogni-
[36] Martin Russ. Sound Synthesis and Sampling. 3rd ed. Mu-
tion. Ed. by Werner Rheinboldt. Academic Press - Second
sic Technology. Focal Press, 2008. ISBN: 9780240521053.
Edition, 1990. P.: 122.
Pp.: 119, 125.
[15] M. Goto. “An Audio-based Real-Time Beat Tracking Sys-
[37] To Yip Sang and Kam Wong. “Multi-User Hand Gesture
tem for Music With or Without Drums-sounds”. In: Journal
Based Musical Element Mapping With Traditional Musical
of New Music Research 30.2 (2001). Pp. 159–171. P.: 121.
Elements”. In: Proceedings of ICMC’08. 2008. P.: 124.
126

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
[38] Diemo Schwartz. “Musical Applications of Real-Time
[40] Sony
Creative
Software.
ACID.
Corpus-Based Concatenative Synthesis”. In: ICMC 2007.
http://www.sonycreativesoftware.com/products/acidfamily.asp.
Copenhagen 2007. P.: 119.
URL: http : / / www . sonycreativesoftware .
[39] Stephen Sinclair and Marcelo M. Wanderley. “Defining a
com/products/acidfamily.asp. P.: 119.
control standard for easily integrating haptic virtual envi-
[44] The OpenInterface Platform. URL: http : / / www .
ronments with existing audio/visual systems”. In: Proceed-
openinterface.org. P.: 125.
ings of NIME’07. 2007. P.: 125.
[51] Zero-G. Pro Sample Library. URL: http : / / www .
[41] Hans-Christoph Steiner, David Merrill, and Olaf Matthes.
zero-g.co.uk/. P.: 122.
“A Unified Toolkit for Accessing Human Interface De-
vices in Pure Data and Max/MSP”. In: Proceedings of the
2007 Conference on New Interfaces for Musical Expression
(NIME07). 2007. P.: 125.
[42] John Steventon. DJing for Dummies. John Wiley & Sons,
2007. ISBN: 978-0470032756. P.: 124.
[43] Rebecca Stewart, Mark Levy, and Mark Sandler. “3D in-
teractive environment for music collection navigation”. In:
Proceedings of the Conference on Digital Audio Effects
(DAFx). 2008. P.: 124.
[45] Anne Veroust-Blondet. VITALAS - State of the Art on Ad-
vanced Visualisation Methods. Tech. rep. 2007. P.: 122.
[46] H. Vinet, P. Herrera, and F. Pachet. “The Cuidado Project:
New Applications Based on Audio and Music Content De-
scription”. In: Proc. of ICMC. Göteborg, Sweden 2002.
Pp. 450–454. P.: 119.
[47] Marcelo Wanderley and Marc Battier, eds. Trends In Ges-
tural Control Of Music. Ircam - Centre Pompidou, 2000.
ISBN: 2-8442-6039-X. P.: 124.
[48] Colin Ware. Information Visualization: Perception for De-
sign. 2nd ed. Interactive Technologies. Morgan Kaufmann,
2004. ISBN: 1-55860-819-2. Pp.: 123, 125.
[49] Stephen Webber. DJ Skills: The essential guide to Mix-
ing and Scratching. Focal Press, 2007. ISBN: 978-
0240520698. Pp.: 119, 124, 125.
[50] Matthew Wright. “Implementation and Performance Issues
with OpenSound Control”. In: Proceedings of ICMC 1998.
1998. URL: http://www.cnmat.berkeley.edu/
ICMC98/papers-pdf/OSC.pdf. P.: 125.
[52] F. Zheng, G. Zhang, and Z. Song. “Comparison of Differ-
ent Implementations of MFCC”. In: Journal of Computer
Science and Technology 16 (2001). Pp. 582–589. P.: 120.
[53] A. Zils and F. Pachet. “Musical Mosaicing”. In: COST G-6
Conference on Digital Audio Effects (DAFX-01). Limerick,
Ireland 2001. P.: 119.
9.2. Software and technologies
[27] Merriam-Webster Dictionary Online. http://www.merriam-
webster.com/. Consulted on January 14, 2009. P.: 120.
[29] Music Information Retrieval Evaluation eXchange. URL:
http : / / www . music - ir . org / mirex / 2006 /
index.php/Main_Page. P.: 121.
[30] OpenAL. URL: http://www.openal.org/. P.: 124.
[31] OpenSceneGraph (OSG). URL: http : / / www .
openscenegraph.org/. P.: 123.
[35] PureData. URL: http://www.puredata.info. P.:
125.
127

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
128

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
DANCING VIOLA
Todor Todoroff 1,3, Frédéric Bettens 1, Wen-Yang Chu 2, Loïc Reboursière 1
1 Laboratoire de Théorie des Circuits et Traitement du Signal (TCTS), Faculté Polytechnique de Mons (FPMs), Belgique
2 Laboratoire de Télécommunications et Télédétection (TELE), Université Catholique de Louvain (UCL), Belgique
3 Art, Recherche, Technologie et Musique asbl (ARTeM) , Bruxelles, Belgique
ABSTRACT
Though inherently less responsive than mapping, gesture recog-
This numediart project covers some of the aspects of a long
nition (see [27] for a recent survey) is a welcome addition to our
time project, initiated by Musiques Nouvelles [29], a contempo-
interactive performance and can be used to trigger events, to adapt
rary music ensemble in Mons. Under the name "Extension du
the response of the virtual instruments according to the detected
corps sonore", it aims at giving instrumental music performers an
gestures or to move through the various steps of a performance.
extended control over the sound of their instrument by extending
The nature of our project makes it difficult for the performer to no-
the understanding of the sound body from the instrument only to
tify the start and end of a movement as dance gestures are usually
the combination of the instrument and the whole body of the per-
chained without pauses.
former. The project concerned by this research is a subset of this
As we also wanted to add more global ways of mapping pa-
wider goal and is linked to viola player Dominica Eyckmans. The
rameters while keeping the responsiveness of the system, we de-
development started at ARTeM and this Numediart project focuses
cided to improve the interpolation scheme, inspired by Allouis [1],
on three axis of research: pre-processing of sensor data, gesture
that we had developed previously on the NeXT platform [43]. We
recognition and mapping through interpolation. The objectives are
propose new tools for interpolation in 1-D, 2-D and 3-D spaces,
the development of methods and flexible Max/MSP externals that
introducing new features specially designed for interactive perfor-
will be integrated in the ARTeM software framework for the con-
mances with sensor data. This includes the definition of areas cen-
certs with viola player Dominica Eyckmans. They could be used
tered on the interpolation points where the weights stay constant
in a variety of other artistic works and will be made available on
and different input messages for mouse actions and sensor inputs
the numediart website [32].
that offer the possibility of moving interpolation points with the
mouse while receiving sensor data. This allows to tune the setup
KEYWORDS
while receiving performer’s data during rehearsals.
This report is divided in four main sections: after a brief
Sensor data pre-processing, mapping, interpolation, gesture recog-
overview of the sensor hardware and the software framework
nition, extension du corps sonore
within Max/MSP, we present our developments on pre-processing
of sensor data in order to extract features with minimum latency,
1. INTRODUCTION
on gesture recognition without the need for segmentation and on
improved mapping tools based on interpolation.
Contrarily to the usual augmented instruments designs that track
the gestures used to play the instrument to expand its possibili-
ties [45, 13], this project focuses on using non-musical gestures to
2. SYSTEM
transform the sound of the instrument. Our approach is dictated
2.1. Sensors
by the nature of Dominica’s project: she is actually dancing while
playing the viola and we track her dancing movements rather than
The sensors system developed at ARTeM for the Quartet [37] project
her hands movements. Research and experiments with Dominica
allowed for the data of a variety of sensors (theremin, flexion, ac-
Eyckmans started in the beginning of 2008 at ARTeM, using a Wi-
celerometers, gyroscopes), placed both on the violin and on the
Fi sensor system developed in 2006 for violist Stevie Wishart for
body of the performer to be transmitted every 8ms wirelessly over
the Quartet project [37]. The software framework, developed in-
Wi-Fi. Figure 1 shows how digital I/O of the PIC [26] micro-
side the Max/MSP environment to map sensor data to parameters
processor is used to communicate to the Theremin electronic cir-
of various sound transformation algorithms, has steadily evolved
cuits, how ADC inputs can be used to connect analogue sensors
since the Quartet project, mostly thanks to the work on the dance
and how digital accelerometers and gyroscopes are connected by
performance De deux points de vue [31] choreographed by Michèle
mean of an I2C interface through PCA9542 level shifters / multi-
Noiret and premiered in December 2007. These two previous
plexers with individual addresses settings.
projects allowed to identify the specific needs related to both in-
Philips originally developed I2C [11] in the beginning of the
strumental and dance performances. And though we found many
eighties to minimise the amount of communication links between
ways to use the tools reliably and creatively, it also demonstrated
digital devices inside its consumers products (TV, HiFi, VCR,...),
the interest of deeper research in order to overcome some limita-
but its use has spread well over those initial goals. Each circuit
tions.
possesses its own address and is addressed by software without
We wanted to better extract features directly from the sensor
the need for the additional chip select lines needed on SPI busses.
data, like hit detection, with very low latency. But we also wanted
the benefit of more complex filtering techniques to obtain addi-
tional mapping inputs.
129

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
The bus consists of only two (three) wires:
inputs and outputs of virtual instruments injected from the top and
• SDA, a bidirectional data link.
redirected with selectable level, thanks to a control object devel-
oped by Nick Rothwell, to the inputs of the virtual instruments
• SCL, a bidirectional clock link.
and the external sound outputs (Figure 2). Some spatialization in-
• GND, a common ground reference which, on a PCB, is not
struments are able to send sound directly to the external outputs,
truly a third line as it is usually readily available on the
effectively bypassing the matrix in order to reduce its size. This
ground plane.
explains the levels shown on outputs 3 to 8 in Figure 2 without
The advantage of this solution for our sensors system is to allow a
corresponding sends within the matrix.
large number of sensor axis to be connected using only the previ-
The sensors data is received as UDP packets through the nor-
ously mentioned three wires plus an additional power supply line
mal Wi-Fi interface. An external decodes the custom protocol,
(as the sensors themselves are not self-powered). Those four wires
scales the raw data and defines a name space depending on config-
in total compare to analogue connections where you also need
uration messages sent to its input and outputs data as messages. It
a ground and a power supply line plus one line per sensor axis,
then outputs named sensor data as messages, available through a
which would reach the amount of 12 wires to connect two 3-axes
send/receive scheme throughout all the patches.
accelerometers and two 2-axes gyroscopes. Besides making con-
nections easier, it also allows for more flexible wires on the body.
Current sensor value
Choice of an external fader scaling
Minimum sensor value
The use of multiplexers is needed as most digital sensor ICs
Attack time, hold, release time
Maximum sensor value
have a factory defined I2C address, which is the case for the ST
Units
Microelectronics accelerometers and the Gyration gyroscopes used
Sensor selection
Learn Function
Visualize total
in this project. And several ICs are not allowed to share the same
input range
I2C address on any given active portion of the I2C Bus. As the
Activate Table
Define user
PCA9542 is also a level shifter, it permits I2C communication be-
Open table
input range
tween our microprocessor running at 5V and the sensors running
Direct/Inverse
Visualize input
at 3V without additional level shifters.
Smoothing
within defined
For this project we decided to downsize the sensor system and
type and value
range
explore how far we could reach with only two sensors, a combina-
Defined user
Turn output On/Off
tion of a 3-axis accelerometer and a 2-axis gyroscope, placed on
output range
Scaled parameter value
each ankle. A similar setup was used for the dance performance
Minimum Parameter value
Visualize output
De deux points de vue [31] , except that the sensors where placed
within range
Maximum parameter value
on each upper arm of the dancer. The placement on the ankles
presents a minimal hinderance even for movements on the ground.
Figure 3: Mapping associated to each parameter
Depending on the results of further experimentation with the new
software tools, we will consider the need, the type and the place-
The scaling of the parameters is quite versatile and happens
ment of additional sensors on Dominica’s body.
through a Max bpatcher (Figure 3) that is used in every place in
the virtual instruments where parameters need to be given values
from the sensors. It offers various stages:
• when the desired sensor is selected from the sensors menu,
its minimum, maximum and current values, and its units,
here m/s2 for an accelerometer, are displayed. The cur-
rent value is both shown as a numeric value and as a slider
motion.
• a portion of the whole range of the sensor data can be de-
fined using the Max slider GUI object while looking at
the movement of the sensor input values continuously dis-
played on the slider above it. The learn function can be
activated to define the range that comprises all the values
reached by a specific sensor during a particular set of move-
ments.
• the values can then be transformed by a table that can be
opened and drawn by the user.
• the value can also be reversed as well as turned on or off.
• a smooth menu offers the choice between a min, mean, me-
Figure 1: Sensors system developed for Quartet
dian or max function followed by the number of data points,
or a slide function, similar to the Max slide object.
• a simple attack, hold and release envelope generator is acti-
2.2. Software framework
vated whenever one of the times are above 1ms.
The ARTeM software is developed in Max/MSP [25] and is orga-
• this output is then mapped to the entire excursion of a slider
nized around a modular concept: the audio paths of the various
that continuously displays the values, clipped at its extrem-
virtual instruments are connected through a matrix, with external
ities.
130

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 2: The main audio matrix, the main preset window and some windows that recalls all virtual instrument presets and a glimse at
some of the virtual instruments
• an output range selector allows then to map those values
In order to test our algorithms in the absence of the performer,
within a range corresponding to the minimum and maxi-
we recorded several versions of various dance gestures during our
mum acceptable range of the sound parameter. In the Fig-
workshop. Figure 4 shows an example. The stepped aspect of the
ure 3, 0 to 20000 correspond, for instance, to the minimum
gyroscopic data is explained by the refresh rate of those Gyration
and maximum center frequency of a filter; And those values
sensors, around 32ms, in relation to the sampling rate of 8ms.
can also be changed within those limits to allow for a very
precise control of the parameter value.
• finally, an external fader or rotary button value can be in-
serted in the previous stage to allow the sound engineer
to tune the scaling live using normal or motorized MIDI
faders.
Audio analysis tools, like envelope followers on the incoming
audio signals, at the output of virtual instruments or filter banks, or
iana~ [42], generate data that can be used exactly in the same way
as sensor data through all the patches. It allows to use the tools
we describe hereunder undiscriminately on sound analysis data,
on sensor data, or on combinations of both.
2.3. Concert setup
Figure 4: A data recording showing 3-axis accelerometer (black)
and 2-axig gyroscopic data (green) for left and right ankles
The concert setup consists of:
• on the performer side: the wearable wireless sensor system
and a microphone on the viola connected to a wireless mi-
crophone transmitter.
3. PREPROCESSING
• on the sound engineer side: a wireless microphone receiver,
In this section, we use data from an accelerometer attached to a
a MacBook Pro, a sound interface, a mixing desk, and an
performer’s limb so as to detect hits and extract their features:
octophonic speaker configuration.
scale, direction, and other possible ones with a minimum latency.
131

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Time domain techniques have been a natural choice for hit de-
tection: Marrin Nakra, T. and Paradiso, J.A.’s Digital Baton [24]
, from the MIT Media Library, made use of a three-dimensional
accelerometer to measure directional velocity changes and beats
and provide an indication of the baton’s orientation. Later on, Par-
adiso, J.A. and etc. designed the expressive footwear [34] with
a 3-axes high-G piezoelectric accelerometer to detect heavy hits,
such as kicks and jumps, where an explosion sound, percussive
sounds, and sharp sonic transients were triggered by them. They
were also used for turning off all notes and launching the orchestra
hits. Directly making use of the time domain signal, these two sys-
tems provide simply one-directional responses to rapid hits, kicks,
or jumps. The hit detection is simply done by thresholding abso-
lute values of the time domain signals.
Direct time domain techniques offer a fast algorithm with low
complexity. However, a proper threshold to decide occurrence of
hits cannot be easily defined due to the fact that signals of an ac-
celerometer are the sum of static acceleration due to gravity (linked
to the posture) and dynamic acceleration due to movements of the
Figure 5: Using the median filter to remove static acceleration as
performer. The results of this simple algorithm are degraded by
input to the hit detection algorithm
variation of static acceleration and difficulties to reproducing a mo-
tion locus of a human limb with the same posture.
Peaks obtained by the median filter method
6

To address this problem, we first seek to improve the simple
X axis
4
Y axis
Z axis
scheme by the median filter, which can reduce the influence of the
2
static acceleration caused by gravity. Searching for a more robust
0
Acc.
−2
solution to remove static acceleration, to better estimate scale and
−4
direction of hits, and to acquire other new features, we explore fre-
−6 0
500
1000
1500
2000
2500
3000
3500
quency domain for a solution, especially the wavelet packet anal-
6

ysis method.
X axis
4
Y axis
Z axis
2
0
Peaks −2
3.1. Median filter method
−4
−6
−8
In order to have a robust hit detection algorithm in an environment
0
500
1000
1500
2000
2500
3000
3500
with a varying static acceleration caused by different human limb
postures (we assume that a big change of static acceleration does
Figure 6: Original acceleration signals in three axes on top and
not exist at the moment when a hit is performed), we need a tech-
after removal of static accelerations with a median filter with a 25
nique that removes the static acceleration and keeps the dynamic
samples window below
acceleration.
The median filter is a non-linear digital filtering technique, of-
ten used to remove noise in signals, like spikes or spurious data.
Using a sliding window on the sensor signal with an odd number
Once the static acceleration is removed, a detection algorithm
of samples, the median value is the value of the sample found in
is used to determine the presence of a hit by setting up an ap-
the center of the series of samples after reordering them in numer-
propriate Event Threshold and test it against all three axes. The
ical order. A noise peak can be removed when the window length
algorithm then looks for the absolute maximum value within the
is longer than the length of the peak.
guard interval in all three axes to ensure that the maximum value is
properly located. It is then possible to extract its scale, direction,
The dynamic acceleration of a hit is peaky, lasts for a short du-
as well as find out to which axis it belongs, as shown in Figure 5.
ration, and shows very similar characteristics to noise. Therefore,
The following parameters need to be defined:
choosing an appropriate window length for the median filter, we
can remove the peaks caused by hits and obtain a good estimate of
• Window Length: Length of the median filter window. It has
the static acceleration. If we then subtract this estimate from the
to be a positive odd number and its minimum value is 3.
original accelerometer signal (as shown in Figure 5), we obtain a
signal consisting essentially of those peaks linked to hits.
• Event Threshold: The absolute value of the estimated dy-
namic acceleration above which a hit event is taken into
The selection of the median filter window length can be used
account. It defines the sensitivity of the detection.
to determine the minimum speed of hits (linked to their maximum
duration) that will be detected, as the window length should be
• Guard Interval Length: The length of an interval where we
longer than the maximum duration of hits in order to preserve the
look for the absolute maximum value. It starts from the
features of the hits. With a sampling rate of 100Hz, a window
first sample where any estimated dynamic acceleration is
length of 25 points can extract a reasonable slow hit as can be
larger than Event Threshold and terminates at the end of
seen in Figure 6. If we need to filter out those slow hits, we can
the interval. Hence, it avoids the multiple hits.
use a shorter window. Figure 7 shows that only faster dynamic
accelerations are correctly extracted with a window length of 11
The algorithm outputs the following parameters:
points.
• Event (1 or 0): 1 when a hit event appears, otherwise 0.
132

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Peaks obtained by the median filter method
In the time domain method described above, we used the win-
6

X axis
4
Y axis
dow length of the median filter to determine the range of hits
Z axis
2
speeds that should be filtered out. The frequency behavior of hits
0
Acc.
−2
can be further investigated in the frequency domain to get more in-
−4
formation than the time domain method could provide. We wanted
−6 0
500
1000
1500
2000
2500
3000
3500
to see if it was possible to differentiate a slow hit from a fast
one in the frequency domain, using Short Time Fourier Transform
6

X axis
4
Y axis
Z axis
(STFT) and Wavelet Packet Decomposition (WPD).
2
0
Peaks −2
−4
3.2. Short Time Fourier Transform
−6
−8 0
500
1000
1500
2000
2500
3000
3500
The Discrete Fourier Transform (DFT) is a well-known tool to de-
compose signal into basic sinusoidal functions. As our system re-
Figure 7: Original acceleration signals in three axes on top and
ceives data sample by sample, we use a frame-based Short Time
after removal of static accelerations with a median filter with a 11
Fourier Transform (STFT). If we analyze the signal displayed in
samples window below
Figure 8 by STFT with the Hamming window, a window length
of 30 points, and 50% overlap we obtain the spectrogram of Fig-
ure 9. That figure shows that DFT cannot concentrate the energy of
• Axis (1 or 2 or 3): 1 represents a hit occurring in x axis, 2 in
hits within only a few frequency bands and that the energy spreads
y axis, and 3 in z axis, depending on in which axis the max-
all over the bands because the sinusoidal bases of DFT cannot ef-
imum absolute value of the estimated dynamic acceleration
fectively represent the sharp transitions in the signal with a small
appears within Guard Interval.
number of bases.
However, the results of DFT can help us distinguish the fre-
• Direction along the Axis (-1 or 1): it is determined by the
quency content of a hit by observing the gravity center of the spec-
relative positions of the maximum and minimum value within
trum, or the spectral centroid, defined as:
the guard interval. If the minimum is in front of the maxi-
mum, the direction is -1, and 1 otherwise.
• Scale: It reflects the total energy of three axes of a mov-
N−1
X
ing accelerometer. It is calculated by the sum of the ab-
f(n) x(n)
solute value of the estimated dynamic accelerations along
Centroid = n=0
(1)
the three axes at the moment where the maximum absolute
N−1
X
value has been detected.
x(n)
In the recording of a series of hits in different axes shown
n=0
in Figure 8, the first six hits do not have posture change, while
where x(n) represents the magnitude of bin number n, and f(n)
the remaining hits do. The detection results, obtained with Event
represents the center frequency of that bin. Figure 10 shows the
Threshold = 2.5, Guard Interval Length = 30, and Window Length
spectral centroid of the signals in Figure 9, with a threshold. It
= 25, show that only the 2 weakest hits have not been detected.
gives a indication of the frequency content of a hit.
This shows that Event Threshold should not set too high in order
to detect weaker hits. And thanks to the median filter, there is less
risk than in the direct method to set a too low threshold. The Guard
Interval allows the algorithm to find the real maximum and not to
be fooled by the different kinds of hits.
Figure 9: The STFT spectrogram of the signals of the hits in three
axes
3.3. Wavelet Packet Decomposition
Figure 8: Hit detection results with the median filter method:
Unlike the Fourier transform, the Wavelet Transform (WT) pos-
Events, Scale, Axis, and Direction from the top to the bottom
sesses the ability to construct a time-frequency representation of a
(Event Threshold = 2.5, Guard Interval Length = 30, Window
signal that offers very good time and frequency localization. We
Length = 25)
chose to use WT because of their better temporal resolution for
133

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Signals and their Gravity Centers
WPD Absolute Coeffs

5
15
0
5
10
−5
500
1000
1500
2000
2500
3000
10
8
Filter Number
5
6
4
15
2

0
20
40
60
80
100
120
140
160
180
200
0
Absolute Coeffs in X axis
20
40
60
80
100
120
140
160
180
200
5

15
0
5
−5
10
500
1000
1500
2000
2500
3000
10
8
Gravity of Signal and Gravty Center
Filter Number
5
6
4
15
2

0
0
20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
Absolute Coeffs in Y axis
2
0

15
−2
−4
5
500
1000
1500
2000
2500
3000
10
8
10
6
Filter Number
5
4
2
15
0

0
20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
Time in Frame
Absolute Coeffs in Z axis
Figure 10: The spectral centroid of the DFT coefficients
Figure 12: The absolute WPD coefficients of the hits (the Haar
Wavelet, hard thresholded by 1.2)
the higher frequency content. We decided to use the Haar wavelet
[23] as it is the best candidate to detect transitions thanks to its
Signals and their Gravity Centers
5
0
short impulse response. Wavelet Packet Decomposition (WPD) is
−5
500
1000
1500
2000
2500
3000
a wavelet transform where the signal is passed through more fil-
8
6
4
ters than WT [46]. In order to analyze the complete coefficients
2
0
20
40
60
80
100
120
140
160
180
200
5
pattern, we choose to use WPD as the analysis tool.
0
−5
The complete WPD coefficients of the signals of Figure 5, ob-
500
1000
1500
2000
2500
3000
8
Gravity of Signal and Gravty Center
6
tained with the Haar wavelet, are shown in Figure 11, where hard
4
2
0
20
40
60
80
100
120
140
160
180
200
thresholding is performed with a value 1.2 to remove less signifi-
2
0
cant coefficients for better observation. We see that the WPD coef-
−2
−4
500
1000
1500
2000
2500
3000
ficients of the Haar wavelet can represent the hit with a reasonable
8
6
4
number of coefficients and preserve amplitude and phase informa-
2
0
20
40
60
80
100
120
140
160
180
200
Time in Sample or in WPD Coeffs.
tion. Figure 12 shows the absolute WPD coefficients. We need to
pay attention to clipping in the accelerometer data (which happens
when the instantaneous acceleration exceeds the range of the sen-
Figure 13: The gravity center of the WPD coefficients
sor) that creates unusual high frequency components. The gravity
center of the WPD coefficients, computed in each axis, is shown
in Figure 13.
lated decimations, and we obtain the undecimated wavelet packet
decomposition(UWPD). These global filters split frequency spec-
trum into equal bands, from which we can extract the dominant
WPD Coeffs

15
10
frequency components of a hit.
5
5
10
0
Filter Number
With coefficients obtained from the global filters, we design a
−5
15

20
40
60
80
100
120
140
160
180
200
detection algorithm to decide whether a hit appears, in which axis,
Coeffs in X axis

15
its direction, scale, and center frequency, as shown in Figure 15.
10
5
5
As the energy of the hits usually spreads over several axes, the
10
0
Filter Number
−5
algorithm has to determine the presence of a hit on the basis of the
15

20
40
60
80
100
120
140
160
180
200
Coeffs in Y axis
energy of all the coefficients on all three axes. We have one matrix

15
10
of weights, composed of W
5
Xn, WY n, WZn, n = 1, 2, ..., N − 1,
5
10
0
that allows us to control the sensitivity on each axis and in each
Filter Number
−5
15

frequency band n. We may then define preferential directions as
20
40
60
80
100
120
140
160
180
200
Coeffs in Z axis
well as target specific frequency content.
EX, EY , and EZ are used to detect the direction of a hit
Figure 11: The WPD coefficients of the hits (the Haar wavelet,
biased towards the targeted direction and desired frequency re-
hard thresholded by 1.2)
sponse. With unity weights, the detection results with a guard
interval of 30 samples are shown in Figure 16.
Though the WPD approach offers a new parameter, the gravity
3.4. Wavelet Packet Decomposition without Decimation
center, we do not think that it can totally replace the time domain
approach. Indeed, with a limited number of filter banks, DC drifts
In order , for filtering purposes, to be able to reconstruct the sig-
and slow hits are combined in the same frequency band. As DC
nal continuously instead of frame per frame we permute the order
drifts could only be removed with a filter with very sharp transition
between filters and decimations. In Figure 14b we upsample the
and low cut-off frequency with a too long latency as a result, all
H and L filters to H2 and L2. We then further combine the filters
the parameter calculation mentioned above are affected while we
in Figure 14c to obtain the equivalent wavelet filter bank D1, D2,
cannot simply drop the lowest band to alleviate the difficulty. But
D3, A4, shown in Figure 14d. Finally we obtain a global filter for
we propose a method that combines the WPD and the MF to solve
each channel, with the same data rate, by discarding the accumu-
the problem.
134

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 14: (a) Permutation of the filters and decimations (b) Com-
bination of the filters with decimations (c) Global filters with dec-
imations
Undecimated Wavelet Packet Method
Sensor Signal
in X axis
in Y axis
in Z axis
...
...
...
Wavelet
D1 D2 ... DN-1
D1 D2 ... DN-1
D1 D2 ... DN-1 Filter Banks
Wx1
Wy1
Wz1
Wx2
Wy2
Wz2
Weights on each
Band
WxN-1
WyN-1
WzN-1
Energy
Energy
Energy
Figure 16: Detection results of the Wavelet packets method with
Calculation
Calculation
Calculation
Ex
Ey
Ez
Total Coefficients
threshold=1
Energy in each axis
Signals and their Gravity Centers
Event
5
> Event
Threshold
0
−5
Slope
500
1000
1500
2000
2500
3000
Maximum
8
6
4
2
Sign
0
20
40
60
80
100
120
140
160
180
200
Index
5
Direction
0
Axis Scale
−5
500
1000
1500
2000
2500
3000
8
Gravity of Signal and Gravty Center
6
Figure 15: The system flowchart of the undecimated wavelet filter
4
2
0
20
40
60
80
100
120
140
160
180
200
banks method
0
−2
−4
500
1000
1500
2000
2500
3000
8
6
4
2
0
3.5. Wavelet Packet Decomposition and Median Filter
20
40
60
80
100
120
140
160
180
200
Time in Sample or in WPD Coeffs.
We showed that the median filter method described above is able
to isolate the hits from the posture related acceleration. Instead of
Figure 17: The gravity center of the WPD coefficients (after re-
using the WPD approach on the sensor data we use it on the signal
moving the static acceleration with the Median Filter)
pre-processed by the median filter with the following advantages:
4. GESTURE RECOGNITION
• An improved estimation of the gravity center, shown in Fig-
ure 17.
4.1. Introduction
• Improved WPD coefficients of the first filter, shown in Fig-
Whereas usual augmented instruments designs track the gestures
ure 18.
used to play the instrument to expand its possibilities [45, 13],
this project focuses on using non-musical gestures to transform
• An improvement for other parameters, which are affected
the sound of the instrument. Our approach is different due to the
by DC drifts, such as the Event, Scale, and Axis decision.
nature of Dominica’s project: she is actually dancing while play-
ing the viola and we track her dancing movements rather than her
The WPD and MF approach are complementary, since they
hands movements. Gesture recognition (see [27] for a recent sur-
have many output parameters in common. Choosing between them
vey) is a welcome addition to an interactive performance and can
in terms of the performance and computational cost, we can assign
be used to trigger events, to adapt the response of the virtual instru-
the detection of hits and the estimation of Event, Scale, and Axis
ments according to the detected gestures, or to move through the
to the MF approach, while obtaining Gravity Center and Direction
various steps of a performance. As other modules (hit detection,
from the WPD approach, and finally combine them like shown
mapping or interpolation of sound effects, etc.) must be running
in Figure 19. Though the WPD approach with MF pre-filtering is
simultaneously on the same computer, reducing the computational
more computationally heavy than the single MF Approach, we can
load is really essential. While using similar hardware (cf. 2.1), the
significantly reduce the computational load if the WPD approach
atomic gesture recognition algorithm developed by Benbasat and
is triggered only when a hit event occurs. And, since the MF ap-
Paradiso [3] is not suitable in our project: as dance movements are
proach is better at detecting the direction of the high frequency
usually chained without pauses, we have to consider an algorithm
hits and the WPD is better at that of the low frequency hits, we can
that can deal with fluid motions, without the knowledge of the start
assign the directions extraction to one of these two methods ac-
and end of a gesture.
cording to the gravity center of the spectrum. Additional tests are
From a signal processing point of view, there are interesting
needed to determine the best assignment of the two approaches
analogies between the problem of Gesture Recognition (GR) [27]
and to fine-tune the parameters.
and Automatic Speech Recognition (ASR) [7]. "Similar to speech
135

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
WPD coeffs of the first filter(with the Median Filter)
gestures to be recognized are the same).
6
4
In our framework, the aim is to develop a user-dependent recog-
2
0
First Filter −2
nition system with a small gesture vocabulary and a database of
−4
−60
50
100
150
200
250
limited size. As some gestures should be added, removed, enabled,
Coeffs in X axis
10
or disabled easily and quickly, without any training procedure, we
5
chose for the DTW algorithm, which we adapted to make it usable
0
First Filter
−5
in real-time without the need for segmentation.
−100
50
100
150
200
250
Coeffs in Y axis
This section is divided as follows: first we present the gesture
4
recognition module, by describing our "multi-grid" DTW algo-
2
0
rithm and its real-time Max/MSP implementation, then we discuss
−2
First Filter
−4
some preliminary results. We will conclude with future investiga-
−60
50
100
150
200
250
Coeffs in Z axis
tions in the global conclusion.
Figure 18: The WPD coefficients of the first filter (after removing
4.2. DTW algorithm
the static acceleration with the Median Filter)
The classical DTW algorithm uses Dynamic Programming (DP)
principles to determine the best nonlinear mapping (Figure 20) be-
tween the temporal indices of the test sequence (i = 1..I) and
those of the reference sequence (j = 1..J), assuming that both
these sequences have been segmented. We denote by d(i, j) the
(non-negative) "local distance" (or dissimilarity measure) between
the test frame Ti and the reference frame Rj (where a frame is
composed of the data of all sensors and axes at a given time), and
by D(i, j) the "accumulated distance" along the sub-path between
the origin and the current node (i, j). The algorithm aims at min-
imizing these accumulated distance values and/or at extracting the
associated best path (i.e., the sequence of nodes) in the DTW grid
(Figure 20). A classical way of computing the accumulated dis-
tance value D(i, j) along a sequence of nodes (ik, jk) (k = 1..K)
consists in weighting the local distance elements d(ik, jk) with
transition costs that depend on the predecessor (ik−1, jk−1), and
summing up the weighted values:
K
X
Figure 19: The combination of the WPD and MF approach
D(i, j) =
W (ik, jk; ik−1, jk−1) d(ik, jk)
(2)
k=1
The question of the weight of the local distance corresponding to
and handwriting, gestures vary between individuals, and even for
the same individual between different instances." [27] The ques-
tion is: how can we compare different realizations of a same ges-
ture (cf. spoken word or sentence in the context of ASR), whereas
the sensor data will never be exactly the same twice? Also, how
can we manage the fact that two different realizations can vary in
duration while representing the same gesture? Not surprisingly,
the most popular algorithms used for gesture recognition are Dy-
namic Time Warping (DTW) [12, 41, 38, 30, 33, 14, 47, 18, 17, 6,
20, 19, 22, 44] and Hidden Markov Models (HMMs) [9, 47, 6, 5,
20, 27, 36]. Both algorithms need a pre-recorded database and try
to perform a temporal alignment between the tested sequence and
each reference sequence, selecting finally the one with the best
matching score. Main differences between these two algorithms
are the following. The first one concerns the size of the database,
which has to be much larger for the statistical (HMM-based) ap-
proach. The second one is the need of a training procedure for
the HMM method. On the other hand, pre-recorded templates are
fixed in the DTW approach; as a consequence, several templates
Figure 20: Mapping between two time series and DTW grid
(or prototypes) of each gesture are often needed to model the ges-
ture variability. But such variability is somewhat limited in our
the first node is solved by computing the transition cost between
project as the user (i.e. the artist) is clearly identified. Statisti-
a "fictitious" original node (0, 0) and the first node. These tran-
cal methods modelling inter-user variability are thus anticipated to
sition costs also raise the issue of normalization when computing
perform worse with our designated artist [19, 20, 6], though the
paths of different lengths (e.g. when a test gesture is compared
same model would be re-usable with other artists (at least if the
with several reference gestures of unequal duration). Dividing the
136

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
optimal distance by the "path length" (i.e. the sum of all weights
requires that the paths lie within a simple strip around a
along the path) leads to the mathematical expression of an average
purely linear path: |jk − ik| ≤ R, where R denotes the
"cost per node". Sakoe and Chiba [41] proposed following four
"window width" (Figure 21b).
transition cost types (a)-(d), where a fifth type (e) is easily derived,
The use of lower bounding distance measures [15, 16, 39,
which is similar to the third one (c), except that the roles of test
40] or other pruning strategies [2] should also reduce the
and reference are swapped:
computational load. However, those techniques are not im-
plemented in this project, as the former were developed in
(a) Wk = min (ik−m − ik−m−1, jk−m − jk−m−1)
offline conditions (i.e. the lower bounding measure is eval-
(b) Wk = max (ik−m − ik−m−1, jk−m − jk−m−1)
uated on whole pre-segmented signals), and the latter need
(c) Wk = ik−m − ik−m−1
(3)
a training stage.
(d) Wk = (ik−m − ik−m−1) + (jk−m − jk−m−1)
(e) Wk = jk−m − jk−m−1
Some points have to be detailed here.
The first one concerns the issue of normalizing the optimal
distance by the sum of all the transition costs (the so-called path
"length"), leading to the mathematical expression of an average
"cost per node". The problem with this normalization is that such
a normalizing factor is path-dependent when using one of the first
two transition cost types, thus inappropriate in Dynamic Program-
Figure 21: (a) Itakura, (b) Sakoe and Chiba, and (c) relaxed global
ming. In order to circumvent this problem (at least partially), one
constraints
can use a fixed normalization factor (e.g. I), but the DTW al-
gorithm will be biased toward the use of longer (case a) or shorter
• Local path constraints.
(case b) paths. On the other hand, when using one of the next three
The expansion or compression ratio between test and ref-
transition cost types, the normalization factor is path-independent;
erence can also be limited locally, in the neighbourhood of
its respective values are I (case c), I + J (case d) and J (case e).
each node. These local constraints are usually defined by
The second point to be focused on is related to inappropri-
listing the legal transitions, e.g. as a list of productions in a
ate zero transition costs, associated to vertical (cf. case c, when
regular grammar [41, 30]. Equations 4 and 5 show the local
ik−m = ik−m−1) or horizontal transitions (cf. case e, when
path constraint implemented, where each node (ik, jk) can
jk−m = jk−m−1), leading to "cost-free" node contributions in the
be reached from three different sets of predecessors (Figure
accumulated distance, regardless of the local distance. The solu-
22b):
tion proposed by Sakoe and Chiba [41, 30] consists in "smoothing"
D(ik, jk) = min( D1, D2, D3)
(4)
the transition costs along each local path, that is, in replacing each
transition cost by the average transition cost along the given local
with:
path, without affecting the normalization factor. An example will
8
< D1 = D(ik − 1, jk − 2) + 2d(ik, jk − 1) + d(ik, jk)
be shown after the explanations about local constraints (Figure 23).
D
:
2 = D(ik − 2, jk − 1) + 2d(ik − 1, jk) + d(ik, jk)
D3 = D(ik − 1, jk − 1) + 2d(ik, jk)
4.2.1. DTW search constraints
(5)
• Strict endpoint constraint.
Note also that such a local path constraint inherently im-
plies that a global Itakura constraint will be guaranteed (i.e.
In its strictest form, this constraint requires that both end-
λ
points match exactly. In other words, any candidate path
min = 0.5 and λmax = 2). As the local slope lies in the
range between 0.5 and 2, the global slope will be automat-
must begin at (1, 1) and end at (I, J).
ically restricted in the same range after iteration on the test
• Monotonicity.
sequence index (i). However, one can specify more severe
Any candidate path must also be monotonic, meaning that
global constraints (e.g. λmax < 2 or λmin > 0.5) that still
i
make sense. For example, after a hesitation, the performer
k−1 ≤ ik and jk−1 ≤ jk. Moreover, it is not allowed to
have simultaneously i
may want to reduce the delay with respect to the reference
k = ik−1 and jk = jk−1 (a single
node would never appear more than once in a given path).
and temporarily (i.e. "locally") speed up his gesture with a
velocity ratio equal to 2 (local slope), while respecting the
• Global path constraints.
maximum path constraint (e.g. global slope λmax = 1.5).
Each of the following global path constraints can be ap-
Finally, one can see that some other local constraints (Fig-
plied for different reasons. A first reason might be related
ure 22) lead to different slope ranges: e.g. [0 ∞] (a),
to the context of the application (e.g. physical limitations,
[1/3 3] (e), [2/3 3/2] (f), and that most local constraints
deliberate rejection of gestures that are executed too slowly
are symmetric (Figure 22a,b,d,e,f) while other ones are not
or too fast). Another reason might be the reduction of the
(e.g. Figure 22c), that is, the sets of arcs and nodes are dif-
computational load (time and memory), but with the risk of
ferent when the roles of i and j are swapped. Figure 23 il-
reaching a sub-optimal solution.
lustrates the method of "smoothing" the transition costs (e)
Itakura [12] suggests the specification of the maximum al-
calculated with Equation 3e on the example of local con-
lowable compression and expansion factors (λmax ≥ 1 and
straints depicted in Figure 22b, where the zero weight as-
λmin ≤ 1, with e.g. λmin = 1/λmax ), whereby all paths
sociated to the horizontal transition is replaced by the value
must entirely lie within a parallelogram (Figure 21a). An-
1/2 as well as the preceding transition cost (value = 1) from
other global constraint, proposed by Sakoe and Chiba [41],
node (i − 2, j − 1) to node (i − 1, j) on the same path
137

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 24: Local (left) and Normalized Accumulated (right) Dis-
tance Matrices for similar (top) and different (bottom) gestures
Figure 22: (a)-(f) Examples of local constraints
4.2.2. "Multi-grid" DTW algorithm and real-time Max/MSP
implementation
is replaced by the same average value 1/2. Swapping the
roles of the coordinates i and j, one easily derives similar
Only few implementations of DTW do not require prior segmen-
"smoothed" weights when Equation 3c is used, and so on.
tation. Oka [33] presents a continuous DP algorithm, which is an
efficient real-time method as the different paths originating from
all possible starting points are simultaneously competing in the
same DTW grid (one per reference gesture). However, he does not
explain how to include global constraints. On the other hand, Ko
[18, 19] describes a method including these constraints, but at the
cost of a higher computational load, as whole new paths are cal-
culated from each new starting point (i.e. at every time instant) in
the accumulated distance matrix (for each reference gesture).
Our "multi-grid" DTW algorithm provides a compromise so-
lution. The method uses simultaneously a set of shifted DTW
grids, each one hypothesizing another starting point (or set of con-
Figure 23: (a) unsmoothed and (b) smoothed transition costs
secutive starting point candidates when i1 = 0) for the test se-
quence. The hop size or time shift between two successive DTW
grids will generally be higher than or equal to hop_size = 1+ i1,
because a lower value would lead to an overlap of the hypothesized
• Relaxed endpoint constraint.
starting points across shifted grids, at the price of a higher compu-
To address the issue of locating accurately and in real-time
tational load (in both time and memory). Note that it is possible
the endpoints of a test sequence, the constraints are relaxed
to avoid a monotonic growing amount of shifted DTW grids in
[38, 44] by permitting the path to start from one of the fol-
parallel, by re-using the obsolete ones.
lowing nodes: (1, 1) to (1+ i , 1), or (1, 1) to (1, 1+
),
The time shift between two successive DTW grids will gen-
1
j1
and to end at one of the following nodes: (I − i , J) to
erally be equal to hop_size = 1 +
2
i1. The number of simul-
(I, J), or (I, J − j ) to (I, J) (Figure 21c). Consequently,
taneously active grids can be limited to the following quantity:
2
the different paths associated to each of the candidate ter-
Smax = ceil(Imax/hop_size). As J may vary from one gesture
minal nodes are compared on the basis of their normalized
to the other, Imax and Smax are also depending on the specific
accumulated distances, where the global normalization fac-
reference gesture. At every time instant, one best score (possibly
tor (iK + jK) is determined by the final coordinates only.
"infinite" at the beginning) is computed in each shifted grid, and
the minimum value of all these normalized accumulated distances
When only the starting point is approximately known, lower
is assigned to the given reference gesture. Despite the computa-
and upper bounds of the other endpoint may be found: e.g.
tion of several shifted grids, a low complexity can be achieved via
Imin = J/2 and Imax = 2J, when the expansion/compression an iterative implementation (like in [19]), where only one partial
ratio lies in the range between 1/2 and 2 (if we neglect i1
column D(i, j) is evaluated in each grid at a given time i (for each
and j values).
1
reference gesture), instead of all (partial) preceding columns from
the starting point.
Figure 24 shows an example of local (left) and normalized ac-
The hop size or time shift between two successive DTW grids
cumulated (right) distance matrices for similar (top) and differ-
will generally be higher than or equal to (1+ i1), because a lower
ent (bottom) gestures, with following parameter values: i = 8,
value would lead to an overlap of the hypothesized starting points
1
j = 0, λ
across shifted grids, at the price of a higher computational load
1
min = 0.5, λmax = 2, and R = ∞. Low distance val-
ues (depicted by blue pixels) are obtained when comparing similar
(in both time and memory). Note that it is possible to avoid a
gestures. Conversely, high dissimilarities are observed when the
monotonic growing amount of shifted DTW grids in parallel, by
tested gesture is very different from the reference one, resulting in
re-using the obsolete ones.
a worse matching score.
We also avoid re-computing the same local distances, and store
138

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
in memory only the current and a few past column vectors, depend-
distance definitions can be implemented and compared: Euclidean
ing on the "history" required by the local constraint.
distance, normalized cross-correlation coefficient [10, 18], Maha-
Figure 25 illustrates normalized accumulated distance matri-
lanobis distance [47], etc.
ces for successive shifted DTW grids when test and reference ges-
The current version of our system implements a downsam-
tures are similar. A good matching score is obtained for the low
pling stage (with a factor 4) [6], preceded by a lowpass filtering
shift values, while it becomes worse when the delay increases. On
step, and uses the L1 distance, whose computation is very effi-
the other hand, Figure 26 also illustrates (normalized) accumulated
cient as its expression is made of a (weighted) sum of the absolute
distance matrices for successive shifted DTW grids, but when test
value of differences. During this calculation, the sensors data are
and reference gestures are different. Again, the matching score
weighted, as some of them are varying within completely different
obtained in this latter case is worse than the one obtained in the
ranges of values and expressed in different units (e.g. accelerom-
former case.
eter data ±2g and angular velocity ±500◦/s). The easiest way
consists in normalizing the samples axis per axis (e.g. dividing
them by 2 and 500, respectively).
Another distance (or dissimilarity) measure that has been used
in the project is based on the normalized cross-correlation coef-
ficient [10, 18]. Because of the division and the square roots, its
computation requires more time. Its main advantage is that it is
insensitive to the global intensity. This is an interesting property
when comparing slow and fast versions of a same gesture. As we
make use of angular velocities and accelerations (and not of posi-
tions), the "amplitudes" of these sensors data are also affected by
the speed of the gesture execution: if R denotes the speed ratio be-
tween two executions of the same gesture, the instantaneous values
of the angular velocity profile are multiplied by R, and the gravity-
free accelerometer values are multiplied by R2. This correlation
coefficient can be considered as the cosine of the angle between
Figure 25: Normalized Accumulated Distance Matrices for suc-
two vectors, associated to the test frame and to the reference frame
cessive shifted DTW grids (similar gestures)
respectively. On the other hand, it could also be a drawback when
all components are near zero, as a sign change of the component
values implies a very different correlation coefficient, whereas the
points are in the same neighbourhood. However, using gravity-
biased accelerometer data almost prevents such situation. For con-
sistence with our minimization approach, and as the local distance
values are supposed to be non-negative, the implemented equation
is one minus this normalized correlation coefficient. Consequently,
the local distance will lie within the range [0 2], where the mini-
mum zero value is obtained when both vectors point in the same
direction (even if they have different amplitudes).
4.4. Post-processing
In the current Max/MSP implementation, the post-processing con-
Figure 26: Normalized Accumulated Distance Matrices for suc-
sists in selecting, at each moment, the gesture with the lowest nor-
cessive shifted DTW grids (different gestures)
malized accumulated distance and validating its recognition if this
value is below a user-defined global threshold.
Finally, the overall gesture recognition module has been im-
plemented as a Max/MSP external (Figure 27), which includes
4.5. Results
the "multi-grid" DTW algorithm, as well as the pre- and post-
processing stages described hereafter. It also evaluates and dis-
We first tested our "multi-grid" DTW algorithm offline, on a small
plays the time compression/expansion ratio, providing feedback to
database composed of recordings of 44 isolated dance gestures
the artist (e.g. during rehearsals).
(with a sampling period of 8ms). Each individual unsegmented
test gesture was compared with each segmented reference ges-
4.3. Pre-processing and distance metrics
ture. As a result of all these pair-wise comparisons, we obtained
a "pseudo confusion matrix" (Figure 28), the small amount of
The pre-processing of the sensors data and the calculation of the
recorded data preventing us from deriving actual statistics. How-
local distances are not part of the DTW algorithm itself, but their
ever, one can see that the main diagonal is in blue colour, because
computation is a preliminary stage, briefly explained in this sub-
each gesture is very similar to itself, and the blocks of blue pixels
section.
are explained by the presence of several occurences of the same
Many pre-processing methods can be applied to the sensors
gesture in our database. This representation allowed us to examine
data: lowpass filtering with downsampling [6], incorporating the
the ambiguity between some different pre-defined gestures and to
estimates of the derivatives [14], conversion of the 3-axis accelerom-
get information about an appropriate fixed global threshold or a
eter values to an angular representation, etc. [48] Also, different
series of gesture-based threshold values.
139

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 27: num.dtw integrated within Max/MSP
Figure 28: Gesture "pseudo confusion matrix"
Figure 29: "Accelerometer-only pseudo confusion matrix"
As two different types of sensors are used simultaneously, we
derivatives of the sensors data to take into account the dynamics
also wanted to explore the separate contribution of each type to the
of these signals, as it is classically done in ASR. Once again, we
ambiguity between different gestures. In practice, we applied the
investigated separately the contribution of each subset. The re-
same DTW-based procedure twice, with two different sets of sen-
sults are somewhat disappointing, but some comments are to be
sor weights. Accelerometer-only results are obtained by zeroing
formulated. First, the sensors data are not positions, but angular
the gyroscope weights, and vice versa. The inspection of Figures
velocities and linear accelerations, i.e. quantities that are related to
29 and 30 clearly reveals that gyroscope-only results are very poor
the first and second order derivatives of the position (respectively).
and lead to much more confusability, while accelerometer-only re-
As a consequence, the obtained results are relative to higher or-
sults are quite similar to the results obtained previously with all
der derivatives of the position, which may give a less appropriate
sensors data. This conclusion is only valid for the particular con-
information description. Second, in the case of movements with
ditions encountered in our project (e.g. the set of sensors and the
slow accelerations, the accelerometer values are strongly affected
database of gestures), but it may be interesting in terms of hard-
by the earth gravity during the entire duration of the gesture. As
ware cost and size.
a matter of fact, the DTW algorithm mostly attempts to match the
Then, we incorporated estimates of the first and second order
tilt profiles of the test and reference gestures, while neglecting the
140

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
task. On the other hand, being able to place points exactly where
desired is extremely efficient. If, for instance, one wants specific
transposition intervals at several orientations of a limb, it is very
straightforward to place points and assign values to them using a
1-D interpolator. And we show a system that makes it equally easy
to define what kind of transition happens in between those points:
steps or glissandi of various lengths. A 3-D interpolator has also
an obvious interest when using a 3-D accelerometer. But those
tools are not limited to sensor data control; they also provide the
necessary flexibility to be controlled by sound analysis.
5.1. Interpolation
Interpolation is an operation whereby each value of the output set
of m values is a weighted sum of the n corresponding values in
the n interpolated sets:
Figure 30: "Gyroscope-only pseudo confusion matrix"
n
X
output_valj =
Wi valij 1 ≤ j ≤ m
(6)
i=1
intentional accelerations of the performer [22].
Improvements might be realized by combining the derivatives
As such, the interpolator can perform the various types of
differently [14], and the same conclusion applies to the combina-
mapping usually described in the literature (direct or one to one,
tion of accelerometer and gyroscope data, left and right leg sen-
divergent or one to many, convergent or many to one and many to
sors, etc. Also, the amount of online computations might be re-
many) depending on the number m of values in each set and on
duced thanks to a training stage, that consists in choosing within
the dimensionality of the interpolation space, which is not to be
each class of gestures a single reliable template (or a small set of
confused with the number n of points placed within that space.
templates) that could be considered as its representative. Instead of
selecting this template randomly, better methods are e.g. minimum
5.1.1. Gravitational field metaphor
selection, average selection or multiple selection [17, 19].
We used the metaphor of a gravity system where each of the n
Our DTW algorithm was also used in a second application.
points can be considered as a planet which exerts an attraction
The sensors were attached to the wrists of the second author and
force F on the cursor depending on their relative cartesian distance
a dozen of left and/or right arm movements were successfully rec-
d, following Newton’s law:
ognized in real-time. The post-processing was slightly modified
into an N-best strategy (N = 3), that is, displaying continuously
the three best matched gestures. However, the correct gesture was
always classified in first position, except when the execution was
Fi = G mcursor mi 1 ≤ i ≤ n
(7)
d2
too fast (e.g. more than two times faster, while a factor 2 was the
if we then consider that the weight associated to each point is pro-
maximum fixed by local and global constraints).
portional to this attraction force and that the mass mi can be rep-
resented by the radius Ri of the point, we obtain the following
5. MAPPING BY INTERPOLATION
equation:
Since interpolation between sets of parameters was proposed by
Allouis [1] for the SYTER at GRM, several authors [43, 28, 4, 21]
Wi =
Ri
1 ≤ i ≤ n
(8)
have developed various implementations. Interpolation has been
max(di, dmin)power
recognized as a very intuitive way of performing various types of
where dmin is a small number that prevents a division by zero
mapping that can very effectively be applied to various kinds of
when the cursor is located exactly at the center of the point i and
sound transformation algorithms. Most implementations are de-
di = 0. When power = 2, we get the Newton’s law of attraction.
signed for mouse control, but we wanted to use data obtained by
But, in order to give maximum flexibility to the user, we decided
a wireless system developed at ARTeM, consisting of accelerom-
to allow the user to modify the power by simply sending the mes-
eters and gyroscopes. We were missing some features that make it
sage power $1, with a float argument. Changing this modifies
possible to cope with less precise control. Indeed, even with proper
the shape of the interpolation between points: softer in between
filters, data from accelerometers may still be a bit jittery when one
points with low values of power, more abrupt with higher values.
wants a responsiveness that prevents the use of too strong median
n
X
and/or low pass filters. We were also missing the absence of 1-D
The weights are normalized so that
WNk = 1:
and 3-D interpolator implementations in Max/MSP. A 1-D inter-
k=1
polator proves to be very useful when one needs to place values
at specific points of a one-dimensional sensor. Though one could
argue that it is possible to achieve that result by splitting the range
WNi =
Wi
n
X
1 ≤ i ≤ n
(9)
of that sensor equally in a certain number of portions and use a
Wk
table to get the desired result, it can be a very time-consuming
k=1
141

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
As the weights are normalized, we also choose to make dmin
available with the message d_min $1. Indeed, by limiting the
highest reachable weight, one can globally define in how much,
Rmin > 0
when the cursor is exactly centered on a point, its weight does or
does not overshadow the other weights. Obviously, dmin cannot
be ≤ 0 and has been limited inside the external to 10( −6
power ). This
insures that the maximum weights keep the same value at 106 Ri,
regardless of the power factor.
Rmin < 0
5.1.2. Rmin
(a)
(b)
(c)
(d)
The interpolation tools are usually controlled with a mouse. As
we wanted to use interpolation with sensors data input, where a
Figure 31: Various ways of displaying positive and negative
precise location is quite difficult to attain by the performer, danc-
Rmin: (a) lighter shade, (b) different color, (c) different pattern
ing on stage without visual feedback from the computer screen, it
and (d) chosen solution.
seemed important to extend the geographical zone where one of
the interpolation sets is given the maximum weight. We therefore
introduced the concept of Rmin, a zone where the weight stays
the points would perturb the field with their associated sets of val-
constant. This zone is a line, a circle or a sphere depending on the
ues. But we also defined a weight depending on the combined
dimensions of the interpolation space.
distance of the cursor to all the other points. We offer a choice
Having introduced Rmin, we felt there could be an additional
between two combined normalized distances:
advantage to allow it to have negative values. The result of a neg-
ative Rmin is to prevent the user from reaching the mathematical
center closer than the absolute value of Rmin. It has a similar ef-
n
X
fect as increasing dmin, except that it is not a global parameter, it
di
n
Y
doesn’t serve the same purpose, and can be individually adjusted
d
i=1
void_s =
and
dvoid_p = (
di)1/n
(11)
for each point. The interest is the following: placing points that
n
i=1
modify the values in their vicinity while never taking over as they
can never reach the maximum weight. They produce deviations
We then compute the weight of the void as the sum of those
in the output values that decrease in influences as Rmin reaches
two distinct contributions.
more negative values. This gives us the final equation implemented
• a constant weight defined by a constant distance Dvoid, in
in the external:
the same way as Wi is defined by the distance from the
cursor to the point i, because a distance is easier to grasp
than a more abstract numerical weight value;
• a weight depending on a combined normalized distance, de-
Wi =
Ri
1 ≤ i ≤ n (10)
(max(d
fined as a minimum combined normalized distance
i − Rmini , dmin))power
Dminvoid from where the influence of the void starts to
We had to define a way to display R and both positive and neg-
take place. The intensity of this contribution is expressed
ative Rmin. For the sake of clarity, we decided to bound the abso-
with the same metaphor of mass as for the other points:
lute value of Rmin by R, so that the outer circle always displays
with a radius, Rvoid and an exponent, power.
R. In order to avoid confusion between R and Rmin when points
are superposed, we needed to represent them differently. We tested
5.2. Max externals
several ways to graphically differentiate the positive from the neg-
ative values: using different shades, colors or patterns (Figure 31
Instead of defining a new GUI object, we decided to make Max
a, b and c). In the end, we found Figure 31d to be the most self-
externals that communicate with existing GUI objects: LCD and
explanatory. The filled area includes the center of the point for a
jit.window.
positive Rmin, because it depicts the wider spot where the math-
Colors can be defined with the RGB output of the swatch ob-
ematical center, and consequently the maximum weight, can be
ject. The different available color modes are shown in Figure 32
reached. On the other hand, it surrounds the center for a negative
for a 2-D interpolator and, in Figure 33, for various vertical or
Rmin, showing that the minimum distance, and hence the maxi-
horizontal 1-D versions, using LCD:
mum weight, can never be reached closer than by the open radius.
• same_colors: all points i share the same Ri color and the
Note that we doubled the (d) solution with different lighting (from
same Rmini color.
the top for positive and from the bottom for negative Rmin) in the
• individual_colors: R
3-D version, see Figure 35.
i and Rmini colors can be defined in-
dividually for each point i, allowing the user to define what-
ever colors seem to fit the character of the sound transfor-
5.1.3. The Void
mation associated to each point.
The idea of perturbation made possible by negative Rmin led to
• resistor_colors: users familiar with the resistor code can lo-
the concept of a constant field over the whole space, the void,
cate the points faster, without having to read the numbers.
linked to its own set of values. Within this constant field, obtained
Additionally, transparency is defined by alpha values for the
simply by attributing a user-defined constant weight to the void,
jit.window version.
142

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 32: Color modes: (left) same_colors, (center) individ-
ual_colors, (right) resistor_colors. Light grey represents inactive
points and the black dot is the current point.
Figure 33: Several 1-D setups in different color modes, vertical
on the left and horizontal on the right. The horizontal or vertical
black line shows the current point from where distances to each
interpolation point is computed.
As we mentioned before, all the parameters of the interpola-
tion points may also be defined by Max messages sent to
num.interpol.lcd or num.interpol.jit. Figure 34 shows bpatchers
used to modify the coordinates and radiuses for a 2-D interpolator.
The data to be interpolated, i. e. the sets of values for each
Figure 35: 3-D version in Jitter, (top) showing only Rmin in the
interpolation point can be entered in the modules with the help of
horizontal projection, as a clear circle if positive and as a dark
lists of values in the format: point#, value#, value1 [value2, value3,
circle if negative, and (bottom) using a poly_mode representation
...]. This allows both to set up individual values as well as to use
for R. The white sphere and circle projection shows the current
the output of a multislider followed by a simple prepend point# 1
point and the resistor code colors were used.
object. The resulting interpolated output values are available either
as a list directly usable by multislider or as individual values.
Additionally, the distances and the weights are output if acti-
generated by the external and directly fed to the input of LCD.
vated by the messages list_distances and list_weights. This later
Sprites are defined for each point and for the current point at initial-
can easily be used to send interpolate messages directly to pattr in
ization of the external object. They are redefined each time a point
case the user prefers using pattr rather than store data in the object.
is resized or (de)activated. The communication is bi-directional:
the interpolation points are displayed in the GUI which also re-
turns mouse position, mouse state (clicked or idle) and modifiers
in order to move, (de)activate and resize the points. Modifiers are
used to change the parameters graphically. By Default, Ctrl is
used to move the points, Shift ⇑ to (de)activate the points and
the combination Ctrl + Shift ⇑ to resize R and Rmin. We plan
to add messages so that modifiers can be defined by the user.
5.2.2. Jitter - OpenGL
Figure 34: bpatchers to modify parameters of 2-D interpolation
points.
The use of num.interpol.jit for one or two dimensions is very simi-
lar to that of num.interpol.lcd, needing only a jit.gl.render
and a jit.gl.sketch object. All messages needed to initialize the
5.2.1. LCD
points, move and resize them are generated by num.interpol.jit.
And the mouse control allows to modify the points exactly as for
The use of num.interpol.lcd is very straightforward and requires
the LCD version.
hardly no other max objects. All messages to the LCD object are
But num.interpol.jit is a bit more complicated for 3-D spaces
143

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
as it needs not only a global jit.gl.render object but also several
7. ACKNOWLEDGMENTS
jit.gl.gridshape and jit.gl.handle objects for each point. The exter-
nal generates all the messages to define all the parameters needed
numediart is a long-term research program centered on Digital
for those modules but abstractions are needed for each point. Fig-
Media Arts, funded by Région Wallonne, Belgium (grant N◦716631).
ure 35 shows how we added projections on the horizontal plane
and lines linking them to the spheres in order to resolve confu-
8. REFERENCES
sions in perspective views. Though functional when defining the
points parameters with Max messages, we still have to implement
8.1. Scientific references
graphical modification of points parameters in the 3-D version.
[1] J. Allouis and J. Y. Bernier. “The SYTER project: Sound
processor design and software overview”. In: Proceedings
6. CONCLUSIONS AND FUTURE WORK
of the International Computer Music Conference. Venice,
Italy 1982. Pp. 232–240. Pp.: 129, 141.
6.1. Preprocessing
[2] J. Alon, V. Athitsos, and S. Sclaroff. “Accurate and Effi-
The median filter approach we proposed was integrated in the
cient Gesture Spotting via Pruning and Subgesture Rea-
ARTeM framework and gave satisfaction on the real-time
soning”. In: Proceedings of IEEE Workshop on Human-
accelerometer data. The WPD approach has only been tested so
Computer Interaction. Beijing, China: Springer-Verlag,
far in MathLab on recordings made with our sensor system.
2005. Pp. 189–198. P.: 137.
[3] Ari Y. Benbasat and Joseph A. Paradiso. “An Inertial Mea-
surement Framework for Gesture Recognition and Ap-
6.2. Dynamic Time Warping
plications”. In: Gesture and Sign Languages in Human-
A real-time DTW-based gesture recognition tool has been devel-
Computer Intercation (Gesture Workshop, GW 2001). Lon-
oped, with a great flexibility provided by its set of parameters
don, UK 2001. Pp. 9–20. P.: 135.
(minimum and maximum expansion and compression ratios
[4] Ross Bencina. “The metasurface: applying natural neigh-
lambda_min and lambda_max, window width R, tolerances at
bour interpolation to two-to-many mapping”. In: Proceed-
the start point epsilon_i1 and epsilon_j1, sensor/axis weights,
ings of the conference on New interfaces for musical ex-
used-defined global threshold, etc.) and it has been successfully
pression. Vancouver, Canada 2005. Pp. 101–104. P.: 141.
tested on two different small databases. Algorithmic improve-
[5] Frederic Bevilacqua et al. “Wireless sensor interface and
ments would include the addition of other local constraints types
gesture-follower for music pedagogy”. In: Proceedings of
(only equation 5 is implemented now) [41, 30] and the ability to
the 2007 Conference on New Interfaces for Musical Ex-
activate/deactivate specific reference gestures on the fly.
pression (NIME07). New York, New York: ACM, 2007.
Some investigations are worth trying as far as the pre-
Pp. 124–129. P.: 136.
processing is concerned: e.g. removing the gravity component
to derive tilt-invariant features, testing different levels of down-
[6] Sung-Do Choi, A.S. Lee, and Soo-Young Lee. “On-Line
sampling, applying nonlinear quantification [22], etc. Some work
Handwritten Character Recognition with 3D Accelerome-
could also be accomplished to improve post-processing: the single
ter”. In: International Conference on Information Acquisi-
global distance threshold might be replaced by gesture-dependent
tion (2006). Pp. 845–850. Pp.: 136, 139.
threshold values and the measured time expansion/compression ra-
[7] John R. Deller, John G. Proakis, and John H. L. Hansen.
tio could be taken into account. Finally, we plan to port the exter-
Discrete-Time Processing of Speech Signals. New York:
nal to PureData [35].
Macmillan, 1993. P.: 135.
[9] Frank G. Hofmann, Peter Heyer, and Günter Hommel.
6.3. Interpolation
“Velocity Profile Based Recognition of Dynamic Gestures
with Discrete Hidden Markov Models”. In: Proceedings of
We enjoyed the flexibility of this approach of interpolation, more
the International Gesture Workshop on Gesture and Sign
adapted to sensors data. It doesn’t of course replace usual tech-
Language in Human-Computer Interaction. London, UK:
niques of one to one mapping (input and output scaling, filtering,
Springer-Verlag, 1998. Pp. 81–95. ISBN: 3-540-64424-5.
tables, envelopes...), but it complements them well. We found it
P.: 136.
was sometimes also useful to process some interpolation distances
[10] G.A. ten Holt, M.J.T. Reinders, and E.A. Hendriks. “Multi-
and weights with those one to one mapping techniques, so as to
Dimensional Dynamic Time Warping for Gesture Recog-
create additional mappings with different time dynamics. We are
nition”. In: Thirteenth annual conference of the Advanced
currently tweaking the last details and finishing the documentation.
School for Computing and Imaging. Heijen, The Nether-
The externals will be made available in the following weeks on the
lands 2007. P.: 139.
the numediart website [32]. We plan to experiment further with
the possibilities of changing the coordinates and sizes of the in-
[12] Fumitada Itakura. “Minimum prediction residual princi-
terpolation points in real-time in interactive installations, where
ple applied to speech recognition”. In: IEEE Transactions
they could for instance follow the positions and level of activity
on Acoustics, Speech, and Signal Processing 23.1 (1975).
of visitors. Finally, we would like to experiment with several cur-
Pp. 67–72. Pp.: 136, 137.
sors, identified graphically by different shapes or textures, moving
in the same space as we foresee the interest of allowing multiple
users or multiple sensor axes to interpolate in a polyphonic way.
We are also considering a port to PureData-Gem [35, 8].
144

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
[13] Alexander Refsum Jensenius, Rolf Inge Godøy, and
[28] A. Momeni and D. Wessel. “Characterizing and Control-
Marcelo M. Wanderley. “Developing Tools for Studying
ling Musical Material Intuitively with Geometric Models”.
Musical Gestures within the MAX/MSP/JITTER Environ-
In: Proceedings of the conference on New interfaces for mu-
ment”. In: Proceedings of the 2005 International Computer
sical expression. Montreal, Canada 2003. P.: 141.
Music Conference (ICMC 05). Barcelona, Spain 2005. Pp.:
[30] Cory Myers, Lawrence R. Rabiner, and Aaron E. Rosen-
129, 135.
berg. “Performance Tradeoffs in Dynamic Time Warp-
[14] Eamonn J. Keogh and Michael J. Pazzani. “Derivative Dy-
ing Algorithms for Isolated Word Recognition”. In: IEEE
namic Time Warping”. In: First SIAM International Con-
Transactions on Acoustics, Speech, and Signal Processing
ference on Data Mining. Chicago, IL, USA 2001. Pp.: 136,
28.6 (1980). Pp. 623–635. Pp.: 136, 137, 144.
139, 141.
[33] Ryuichi Oka. “Spotting Method for Classification of Real
[15] Eamonn Keogh and Chotirat Ann Ratanamahatana. “Exact
World Data”. In: The Computer Journal 41.8 (1998).
indexing of dynamic time warping”. In: Proceedings of the
Pp. 559–565. Pp.: 136, 138.
28th International Conference on Very Large Data Bases
[34] J. A. Paradiso et al. “Design and implementation of ex-
(VLDB 2002). Hong Kong, China 2002. Pp. 406–417. P.:
pressive footwear”. In: IBMSYSTEMS JOURNAL. Vol. 39.
137.
2000. P.: 132.
[16] Eamonn Keogh and Chotirat Ann Ratanamahatana. “Exact
[36] Timo Pylvänäinen. “Accelerometer Based Gesture Recog-
indexing of dynamic time warping”. In: Knowledge and In-
nition Using Continuous HMMs”. In: Pattern Recognition
formation Systems 7.3 (2005). Pp. 358–386. ISSN: 0219-
and Image Analysis (2005). Pp. 639–646. P.: 136.
1377. P.: 137.
[38] Lawrence R. Rabiner, Aaron E. Rosenberg, and Stephen E.
[17] M.H. Ko et al. “Online context recognition in multisen-
Levinson. “Considerations in Dynamic Time Warping Al-
sor systems using dynamic time warping”. In: Proceedings
gorithms for Discrete Word Recognition”. In: IEEE Trans-
of the Second International Conference on Intelligent Sen-
actions on Acoustics, Speech, and Signal Processing 26.6
sors, Sensor Networks and Information Processing (ISSNIP
(1978). Pp. 575–582. Pp.: 136, 138.
2005). Melbourne, Australia 2005. Pp.: 136, 141.
[39] Chotirat (Ann) Ratanamahatana and Eamonn J. Keogh.
[18] M.H. Ko et al. “Temporal data fusion in multisensor sys-
“Everything you know about dynamic time warping is
tems using dynamic time warping”. In: Workshop on Infor-
wrong”. In: Proceedings of the Third Workshop on Min-
mation Fusion and Dissemination in Wireless Sensor Net-
ing Temporal and Sequential Data, in conjunction with the
works (SENSORFUSION 2005) co-located with WICON
Tenth ACM SIGKDD International Conference on Knowl-
2005. Budapest, Hungary 2005. Pp.: 136, 138, 139.
edge Discovery and Data Mining (KDD-2004). Seattle,
[19] Ming Hsiao Ko et al. “Using dynamic time warping for
WA, USA 2004. P.: 137.
online temporal fusion in multisensor systems”. In: Infor-
[40] Chotirat (Ann) Ratanamahatana and Eamonn J. Keogh.
mation Fusion 9.3 (2008). Pp. 370–388. ISSN: 1566-2535.
“Three Myths about Dynamic Time Warping Data Mining”.
Pp.: 136, 138, 141.
In: Proceedings of the 5th SIAM International Conference
[20] Doo Young Kwon and Markus Gross. “A Framework for
on Data Mining (SDM 2005). Newport Beach, CA, USA
3D Spatial Gesture Design and Modeling Using a Wear-
2005. P.: 137.
able Input Device”. In: Proceedings of the 11th IEEE
[41] H. Sakoe and S. Chiba. “Dynamic Programming Algorithm
International Symposium on Wearable Computers. 2007.
Optimization for Spoken Word Recognition”. In: IEEE
Pp. 23–26. P.: 136.
Transactions on Acoustics, Speech, and Signal Processing
[21] O. Larkin. “INT.LIB - A Graphical Interpolator for
26.1 (1978). Pp. 43–49. Pp.: 136, 137, 144.
Max/MSP”. In: Proceedings of the International Computer
[42] T. Todoroff, E. Daubresse, and J. Fineberg. “Iana : a real-
Music Conference. Copenhagen, Danemark 2007. P.: 141.
time environment for analysis and extraction of frequency
[22] Jiayang Liu et al. “uWave: Accelerometer-based Personal-
components of complex orchestral sounds and its applica-
ized Gesture Recognition”. In: Technical Report TR0630-
tion within a musical realization”. In: Proceedings of the
08, Rice University and Motorola Labs. 2008. Pp.: 136,
International Computer Music Conference. Banff, Canada
141, 144.
1995. Pp. 292–293. P.: 131.
[23] S. G. Mallat. “A theory for multiresolution signal decom-
[43] T. Todoroff, C. Traube, and J. M. Ledent. “NeXTStep
position: the wavelet representation”. In: IEEE Transac-
graphical interfaces to control sound processing and spatial-
tions on Pattern Analysis and Machine Intelligence. Vol. 11.
ization instruments”. In: Proceedings of the International
1989. Pp. 674–693. P.: 134.
Computer Music Conference. Thessaloniki, Greece 1997.
[24] T. Marrin Nakra and J.A. Paradiso. “The Digital Baton: A
Pp. 325–328. Pp.: 129, 141.
Versatile Performance Instrument”. In: Proceedings ICMC.
[44] P. Tormene et al. “Matching incomplete time series with
San Francisco, CA, USA 1997. Pp. 313–316. P.: 132.
dynamic time warping: an algorithm and an application
[27] Sushmita Mitra and Tinku Acharya. “Gesture recognition:
to post-stroke rehabilitation”. In: Artificial Intelligence in
A survey”. In: IEEE Transactions on Systems, Man, and
Medicine (2008). Pp.: 136, 138.
Cybernetics 37.3 (2007). Pp. 311–324. Pp.: 129, 135, 136.
[45] Marcelo M. Wanderley. “Gestural Control of Music”. In:
Human Supervision and Control in Engineering and Music.
Kassel, Germany 2001. Pp.: 129, 135.
145

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
[46] M. V. Wickerhauser. “NRIA lectures on wavelet packet al-
gorithms”. In: Lecture notes, INRIA. 1991. Pp. 31–99. P.:
134.
[47] Daniel H. Wilson and Andy Wilson. “Gesture Recognition
Using the XWand”. In: CMU-RI-TR-04-57. Pittsburgh, PA
2004. Pp.: 136, 139.
[48] Shengli Zhou et al. “Hand-Written Character Recognition
Using MEMS Motion Sensing Technology”. In: Proceed-
ings of the IEEE/ASME International Conference on Ad-
vanced Mechatronics. 2008. P.: 139.
8.2. Artistic references
[29] Musiques
Nouvelles.
URL:
http : / / www .
musiquesnouvelles.com. P.: 129.
[31] Michelle Noiret. De deux points de vue. URL: http:
//www.michele-noiret.be/index.php?page=
de-deux-points-de-vue. Pp.: 129, 130.
[32] numediart. URL: http://www.numediart.org/.
Pp.: 129, 144.
[37] Quartet
Project.
URL:
http : / / www .
quartetproject . net / space / Program. P.:
129.
8.3. Software, hardware and technologies
[8] “Gem”. URL: http://gem.iem.at/. P.: 144.
[11] “I2C Bus”. URL: http://www.i2c-bus.org/. P.:
129.
[25] “Max/MSP”. URL: http://www.cycling74.com.
P.: 130.
[26] “Microchip”. URL: http://www.microchip.com/.
P.: 129.
[35] “Pure Data”. URL: http://puredata.info/. P.:
144.
146

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
AUGMENTED VIRTUAL STUDIO
Matei Mancas 1, Michel Bagein 2, Nicolas Guichard 1, Sullivan Hidot 2, Caroline Machy 4, Sidi Mahmoudi 2, Xavier Siebert 3
1 Laboratoire de Théorie des Circuits et Traitement du Signal (TCTS), Faculté polytechnique de Mons (FPMs), Belgique
2 Service Informatique (INFO), Faculté Polytechnique de Mons (FPMs), Belgique
3 Service Mathématiques et Recherche Opérationnelle (MATHRO), Faculté Polytechnique de Mons (FPMs), Belgique
4 Multitel (ASBL), Belgique
ABSTRACT
2.1.1. Blob detection
The Augmented Virtual Studio (AVS) project aims at acquiring
The present system’s analysis starts from foreground segmentation
the tools of video analysis and visualization needed to achieve
based on the analysis of the infra red video streams [5]. This anal-
advanced interfaces or interaction with virtual avatars and virtual
ysis provides a binary mask of the spatial extension of the region
worlds. Those techniques consist in data visualization, object seg-
of interest through time (blob detection). The signal coming from
mentation, tracking and identification of blobs but also sketch recog-
the IR camera is not affected by illumination variations but might
nition and more generally faces and objects recognition. All these
be affected by light reflections or some static infra red sources. We
tools were used to build three practical applications.
processed the video stream with a background subtraction to elim-
inate static elements and focus only on the moving objects. Then,
KEYWORDS
we binarized the signal with an empirically-tested threshold value
to extract the moving regions of interest (blobs or segments). Fig-
Video tracking, augmented reality, interfaces, object recognition,
ure 1 (a) shows the IR lights located on the two hands used for the
visualization
first application. Figure 1 (b) shows the IR lights located on the six
region of a human (head, two hands, navel, two feet) for the avatar
application.
1. INTRODUCTION
Gestural real-time interaction with multimedia data and virtual
worlds is of a great importance for live digital arts performances.
These topics are part of the COMEDIA research axis within the
Numediart project. There are many existing tools for real time
video analysis and visualization and we tested here some of them.
We also implemented tracking, blob identification and object recog-
nition algorithms in order to build three pilot applications which
will be used as test here and which will be enhanced in further
projects. In the next section the blob tracking and identification is
described for simple (2 blobs) and complex (6 blobs) situations,
then a section is dedicated to object recognition. In a third step,
several softwares which were tested here are described, while the
last section deals with the three pilot applications. Finally a dis-
cussion about the results can be found.
2. TRACKING, BLOB IDENTIFICATION AND
RELATIVE POSITIONS
2.1. Segmentation and tracking in EyesWeb
The first step for the three applications done in this session is the
detection and tracking of IR blocs in real time. We have used Eye-
sweb to perform this step. EyesWeb is a programming tool using
blocks that can work on several kinds of signals, such as video. It
is designed to facilitate the interpretation of these signal streams
[7]. We can divide our system of tracking on three major steps:
1. Blob Detection (segmentation)
Figure 1: Background subtraction with two blobs (above) and six
blobs (below).
2. Blob Tracking
3. Blob Trajectory
147

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
2.1.2. Blob Tracking
Resulting from the blob detection (preprocessing) step, we ob-
tained a binary image where white represents the foreground ob-
jects (Figures 1 and 2). The main goal of this step is to assign a
Figure 3: Blob Trajectory with two blobs.
Figure 2: (a) Blob Tracking with two blobs (a) and six blobs (b).
another since each of them uses the results of tracking IR for a
specific purpose.
label that identifies the different white blobs, and to track them. In
result we have every blob identified by a specific color. The pa-
rameters taken into account in our tracking are:
2.2.1. Blob identification in a simple context
In this application we have only 2 IR blobs tracked representing
• Multi-blob tracking
the two hands. The first detected blob represents the left hand and
We used the multi-blob tracking because we have more than
it is used for navigating within a multimedia wall (images, videos,
one blob to track in our applications.
texts) if the second IR light is not present (confidence indicator
• Distance
=0). When the second IR light is present and the confidence indi-
It represents the distance between the blobs detected, to
cator changes to 1, the first blob is used to move images, texts or
have a good performance of tracking we used the smallest
videos in the multimedia wall.
value possible of distance ’1’, if the distance between blobs
is less than this value we’ll have only one blob detected.
2.2.2. Indicators of zoom and rotation
• Number of blobs
We define as input for the tracking bloc the number of blobs
Using the coordinates of the two blobs detected we have computed
to be tracked. For the first application this number was 2
two indicators used for zoom and rotation.
(two hands) whereas for the avatar application this number
• Zoom Indicator
increases to 6 because we have six objects to track. The
The indicator of zoom is calculated by comparing the dis-
tracking result can be seen on Figure 2.
tance between the two blobs and a threshold, this threshold
represents the initial distance (Th) between the two blobs
From the multi-blob tracking, we obtained some important
(at the first detection of the second IR light). If this differ-
values and indicators:
ence is positive there is a request for a zoom in and if it is
negative, for a zoom out. Figure 4 shows the relation be-
(a) Confidence Indicator:
tween the distance and the zoom request. The formula used
This indicator takes the value ’1’ if the blob is detected and
for distance calculation is:
’0’ otherwise. The number of confidence indicator is the
same as the number of blobs.
p
D =
(x1 − x2)2 + (y1 − y2)2
(1)
(b) Blob2D coordinates:
With (xi, yi) are respectively the coordinates of the first
These coordinates represent the coordinates (x, y) of each
and second blob. If we want to make a request of zoom (in
blob detected, these values changes in real time during the
or out) we have only to change the distance between the IR
blobs displacement.
lights which are located on the user’s hands1.
• Rotation Indicator
2.1.3. Blob Trajectory
The Rotation Indicator is computed by the same method
After tracking we can generate for each blob a trajectory consisting
that the zoom but it uses a different distance in order to
of the temporal sequence of the points provided as input. Figure 3
avoid conflicts. The distance for rotation is calculated using
shows the trajectory of two tracked blobs.
just the coordinate y and not x by this formula (the thresh-
old represents the initial distance with horizontal coordi-
2.2. Blob identification and relative positions
nates y):
Dy = y1 − y2
(2)
After releasing the tracking of different IR blobs, we have to iden-
tify those blobs. This identification differs from an application to
1To make this request of zoom the 2 IR lights must be visible
148

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 4: (a) Initial position of blobs (no zoom) - (b) Zoom In (D > Threshold) - (c) Zoom out (D < Threshold).
Figure 5: a) Initial position of blobs (no rotation) - (b) Left Rotation (Dy > 0) - (c) Right Rotation (Dy< 0).
If we have a positive value of Dy, there is a request of left
have only one unmarked blob which is obviously the navel.
rotation else we have a request of a right rotation. Figure
Figure 6 shows the result of blob identification by replacing
5 displays the relationship between Dy and the rotation re-
each blob by its corresponding human region.
quest.
2.2.3. Blob identification in a complex context
In this situation we have six blobs to detect. These blobs represent
the IR lights located on the different regions of human (head, two
hands, two feet and navel). Our blob detection provides us with the
coordinates of each blob and its confidence indicator, but we have
to associate each blob to the corresponding human region. In or-
der to achieve this blob identification we have followed these steps:
1. Detection of the 6 blobs
Before starting the tracking we have to detect all the six
blobs (the IR lights have to be lit).
2. Blob Identification
After detecting the six blobs we can start our identification
using the horizontal and vertical coordinates (x and y), this
identification is done for each blob as follows. To iden-
Figure 6: (a) Blob tracking (6 blobs) - (b).Blob Identification (6
tify the head we have simply taken the blob which has the
blobs).
biggest value of y, this blob will be marked, now we have
five blobs to identify. To identify the left hand we have
taken the blob which has biggest value of x. this blob will
be marked, now we have four blobs to identify. To identify
3. Blob Tracking
the right hand we have taken the blob which has the small-
After identification of all the blobs we can start our tracking
est value of x. this blob will be marked, now we have three
for the six different human body areas.
blobs to identify. To identify the first feet we take the blob
which has the less value of y and to know if it is the left or
3. OBJECT DETECTION AND RECOGNITION
the right one we have to identify the second feet which rep-
resent now the blob that has the less value of y between the
3.1. Face detection
two remaining blobs. After the identification of the two feet
we know that the left one is simply the one which has the
We implemented the Viola and Jones [15] face detection algorithm
biggest value of x and the right feet will be automatically
from OpenCV as an EyesWeb XMI block. This algorithm uses
the second feet. The two blobs will be marked too. Now we
files where data concerning a training step on face detection was
149

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
already achieved. This training data is used in a Haar-like classi-
4. each sketch in the database has been previously assigned
fier which will be able to detect faces. It is possible to train the
to a 3D object. The database sketch that most closely re-
algorithm on other training sets of objects (not necessary faces)
sembles the hand-made one will be selected, and the cor-
to use this block in a more general object detection. The results
responding 3D object will replace the hand-made sketch in
of this block are satisfying. There are very few false detections,
the video, at the same position an with the same size.
especially if the EyesWeb tracking is followed by a blob-tracking
5. turning the second LED off and on again triggers a new
technique which will eliminate the false alarms which only appear
mode of interaction, to grab and to move the 3D object.
in some particular frames and which have a low time persistence.
The faces are well detected when seen from the front, but the algo-
It is important to point out that IR tracking was preferred over
rithm performances highly decrease when the face is seen from the
color-tracking, because it is more reliable and because the LEDs
side. Face detection is achieved in real-time only if the larger size
can easily be placed in the hands of the actor without hampering
of the frames is less than 500 pixels, that is why it is interesting
his/her motion.
to decrease the image size to make face detection and than get the
initial size to replace the face mask within the real image (Figure
For all the above mentioned operations, we relied on the EyesWeb
7).
software [4], which contains efficient algorithms for video-based
tracking. Using EyesWeb’s Software Development Kit (SDK),
we added some modules (also known as ’blocks’) tailored to our
needs. For example, we implemented 2D Fourier-based distances
between images. Other functions, such as the image moments (see
below), can also easily be incorporated into EyesWeb, for example
using known libraries such as OpenCV [3].
3.2.2. Sketch recognition system
Existing sketch recognition systems fall into two main categories:
gestural (e.g., [1]) or free-sketch system. Gestural systems require
each sketch to be drawn in a particular style (making it a gesture,
Figure 7: Face detection block implemented in EyesWeb.
rather than a shape). In this application, no particular constrains
were imposed on the actor’s gesture, so that our recognition sys-
tem would fall in the ’free-sketch’ category. We should emphasize
that sketch recognition is a wide field of research, encompassing
3.2. Sketch recognition
among others feature-based, vision-based, geometry-based, and
timing-based recognition algorithms. For the ’Augmented Sketch’
3.2.1. Overview
application, simple global properties were tested:
The aim of this section is to detect a hand-made sketch by an actor
• pixel-per-pixel differences.
and to replace the sketch with a corresponding 3D virtual object
• Fourier-based distances
with which the actor can interact. More specifically, the tasks to
be performed are as follows:
• width/height ratio,
• ratio of filled/unfilled pixels
1. detect when the actor is making a 2D sketch and record it.
• higher order image moments (e.g., Hu moments [8])
2. detect which object has been drawn, by comparison with a
database of known sketches.
From our simple tests, it appears that the width/height ratio and the
ratio of filled/unfilled pixels are not efficient in practice, because
3. display a 3D image corresponding to the object drawn, on
the actor can make drawings of any height, width and size. These
top of the video recording of the actor.
could become useful if some conventions were set, for example
4. allow the actor to manipulate interactively the 3D virtual
houses are wider than tall, trees are taller than wide, . . . However,
object.
simple pixel-per-pixel differences (or, equivalently, between their
Fourier transforms, see Eq.3) gives reasonably good results. Hu
To achieve these tasks, the actor is equipped with 2 IR LEDS (one
moments give some additional information but were not necessary
in each hand) and filmed with two cameras: one conventional dig-
for out limited database (see Fig. 8).
ital camera and one IR camera, so that (the numbering below cor-
responds to the above mentioned tasks):
The quadratic misfit between two images is given by:
1. the first LED is used by the actor to signal when (s)he is
P P
starting or stopping to draw.
Q =
i
j |Fij − Gij |2
P P
(3)
2. when the start signal is triggered, the movement of the sec-
i
j |Fij |2
ond LED is tracked by the IR camera, forming a 2D trajec-
Because this expression is quadratic, it can be evaluated in real-
tory.
or Fourier-space (Parseval’s theorem). Evaluating it in reciprocal
3. when the stop signal is triggered, the 2D trajectory is ex-
space has the advantage that the sums in Eq.3 can be calculated on
tracted and normalized to a 128x128 pixels image (the ’sketch’). a subset of the Fourier coefficients (e.g. the low-resolution ones)
This sketch is then compared with a database of 2D sketches
without extra computation. Low-pass filtering in real space would
using a simple recognition system described below.
require more computations on each frequency is changed.
150

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 8: Pixel-per-pixel differences between pairs of low-passed images in a small sketch database. Green numbers represent a correct
detection. Orange represents a detection with a very small error ( < 0,01) and Red represents a false negative/positive (error > 0,01).
Fourier-based distances give a similar result.
component k. In the case of Gaussian mixture, each element (here
Fig. 8 shows that simple shapes such as trees and houses are stably
a pixel in the HSV space) of a component is supposed to be a real-
detected. Some problems arise with more ’complex’ shapes such
ization of a particular Gaussian distribution:
as the pine tree, which requires more elaborate criteria in order to
be clearly identified.
x|Gk ∼ N (µk, Σk)
(5)
3.2.3. Perspectives
Knowing the number of components, we can compute an estima-
tion of the parameters by using the maximum likelihood fitting
The global properties used in this work were sufficient to discrim-
based on Expectation-Maximization algorithm [6]. Figure 9 shows
inate between the simple sketches in our database. More elabo-
an illustration of this statistical model. We can see three persons
rated algorithms are required as the size of the database increases,
moving around in a room. The aim is to detect and to track these
and as the sketches to discriminate become more similar. Various
persons when their face information is not directly usable. The
algorithms could also be combined to improve recognition using
first frame is used to perform the estimation step. The background
Artificial Intelligence techniques such as graphical models [14].
of the frames is subtracted with the help of an image binarization
approach (see Fig. 9 below). Thus, one person can be identified
3.3. Gaussian Mixture Model-based
by its blob and a distinctive color in order to verify if GMM is
achieved successfully. The blobs are converted into the HSV space
In the framework of object recognition, providing a fairly model
to run the procedure. We set the initial parameters as g = 3 clus-
using a probabilistic point of view is more effective compared to
ters, a random value for the means, a uniform value for the mixing
a naive approach (like for example an Euclidean distance between
proportions, and finally an unrestricted shape for the covariance
the pixels of a blob and the reference object). Thus, a method
matrices. Indeed, let us point out that GMM can be computed with
based on Gaussian Mixture Model (GMM) has been recently pro-
a constraint shape of the covariance matrices to control the com-
posed [12]. The basic idea is to model the Hue-Saturation-Value
plexity of the algorithm. We can see a perfect result because each
(HSV) information provided by the pixels in order to express a
person is well matching over the first frame. However, one dis-
likelihood function to be maximized. Generally speaking, the para-
advantage of GMM is the computational time in the learning step
metric formulation of GMM is the following [10]:
and we have to perform some tracking methods in a nearest real
time processing.
g
X
p(x; Ψ) =
πk.p(x|Gk)
(4)
k=1
3.4. Scale-Invariant Feature Transform
where Ψ = (π1, . . . , πg−1, Σ1, . . . , Σg) are the unknown param-
Scale-Invariant Feature Transform (SIFT), proposed by Lowe [9],
eters to be estimated. The quantities πk’s are the so-called mix-
is an algorithm capable of extracting and matching some distinc-
ing proportions, with the constraint of sum to one. The number
tive image features. The main advantage of SIFT is to be invari-
p(x|Gk) is the probability of generating the data point x by the
ant to image scale and orientation. Moreover, it has been proven
151

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 9: Object recognition and tracking using Gaussian Mixture Model.
[11] that SIFT outperforms a lot of descriptors-based algorithms
to identify persons in successive frames even if the face informa-
by comparing for different types of interest regions. SIFT is done
tion is not available. However, and despite of its robustness to
recognize objects among clutter and occlusion, SIFT is not really
adapted for our own database. Figure 11 shows the poorly results
using this approach. In (a), SIFT can detect the local feature of a
person when embedding in a larger image. In (b), because of the
keypoints changing over the frames, SIFT is naturally not able to
track an object subjected to various changes in positions. To cor-
rect this problem, one way to explore is to match between SIFT
and the Log-polar transformation. Log-polar technique [16] can
be viewed as an analogy with the structure of the retina and pro-
vides an increase of the size range of the object to be detected and
tracked. Our idea is then to cross the performances of SIFT with
Log-polar sampled image in a top view. The problem would then
sum up to objects moving around regarding only the scale and ori-
entation.
4. VISUALIZATION
4.1. Processing
Figure 10: Application of SIFT to keypoints selection.
Processing [13] is an open source project initiated by Casey Reas
in four successive steps:
and Benjamin Fry, both formerly of the Aesthetics and Computa-
tion Group at the MIT Media Lab. It is a programming language
and integrated development environment (IDE) built for the elec-
1. Scale-space extrema detection
tronic arts and visual design communities’, which aims to teach the
basics of computer programming in a visual context, and to serve
2. Keypoint localization
as the foundation for electronic sketchbooks. The IDE is licensed
under the GNU General Public License. One of the stated aims of
Processing is to act as a tool to get non-programmers started with
3. Orientation assignment
programming, through the instant gratification of visual feedback.
The language builds on the graphical capabilities of the Java pro-
4. Keypoint descriptor
gramming language, simplifying features and creating a few new
ones.
Figure 10 shows the keypoints selection of an image. The arrows
4.2. Blender
indicate the keypoint vector with their localization, scale and ori-
entation. Because SIFT detects the main local features of an im-
With the blob identification, it is easy to imagine that relative posi-
age, it can be applied to object recognition. Indeed, once the key-
tion of several points located on performer’s body could be used to
points have been localized and computed, a simple Euclidean dis-
control a virtual character or avatar. When blobs are located close
tance between the descriptors in their vector form can be achieved
to the performer’s joins, we can control a virtual 3D skeleton, or
152

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
Figure 11: SIFT applied to our own database.(a) Image local features well detected and matched in a larger image.(b) SIFT is not adapted
to various changes in positions.
more, a full virtual body with flesh. There are several powerful
tools on the shelf to create 3D avatar. Blender [2] is one of those
and its advantage is to be free and open with a large and very ac-
tive user community. The ’mancandy’ character was designed in
this environment with large number of joins (wrist, elbow, shoul-
der, neck ...) which their realistic degree of freedom (DoF). To
make a full control of this avatar, each join must be ’linked’ to
the performer’s join but the set of control is quite large. In or-
der to reduce the size of control set, inverse kinematics (IK) can
be applied to retrieve intermediate location of uncontrolled joins.
This approach, widely used in robotics engineering (and implic-
itly by puppeteers) attempt to determine optimal location of each
uncontrolled join (elbow, knee, etc.) in a bone chain. Usually, the
inverse kinematics provides an infinity of acceptable solutions. To
limit the number of solutions, several physiological constraints can
be added (join with few DoF, coarse limits, forces, stresses, etc.).
In broadcasting or live performance, the interactivity brings a ma-
jor constraint: the kinematics of full movement need to as realistic
as possible but without any a priori knowledge on the performer’s
gestures. Without any knowledge of the instantaneous gesture, it
is hard to define realistic join paths. For example, when one lift a
glass to the mouth, the elbow can be close to the thorax but when
one lift a hairbrush to the front, the elbow is also lifted to initiate
the brushing gesture from head front side to back side. Inverse
kinematics engine featured in Blender is dedicated to generate in-
termediate frames in motion sequences, based on the knowledge
Figure 12: Avatar visualized on Blender.
of initial and the final pose. This engine is activated at edit time
(preview): to help the user to locate intermediate join in pose, by
control the location of one selected bone (hand). When several
4.3. EyesWeb
poses are defined, IK is also used to generate each intermediate
frame between poses in order to build gesture sequences (one for-
The EyesWeb XMI (www.eyesweb.org) is a free software platform
ward step). Gesture sequences can be used to animate an avatar
[7]. It consists of two main components: a kernel and a graphical
(like in video games) in a movie but this is not realistic enough:
user interface (GUI). The GUI manages interaction with the user
sequences can not support fine interaction with its context: walk-
and provides all features needed to design applications (patches).
ing avatar seems to ’skate’, walk speed is constant, etc.
It allows hand fast development of custom interfaces for use in
artistic performances and interactive multimedia installations. The
kernel manages real-time data processing and synchronization of
multimodal signals. It supports the integration of user-developed
plugins: an SDK (Software Development Kit) is provided to sim-
153

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
plify the creation of such plugins by means of Microsoft Visual
C++. The user-developed plugins, together with the ones provided
with EyesWeb are the building blocks that the end user can inter-
connect to create a patch. The main use of EyesWeb is as a rapid
prototyping platform for real time signal analysis. The video side
is very well developed.
5. RESULTS
In practice three main applications were developed during this
project in order to test the different techniques of analysis and vi-
sualization we explored here.
5.1. Gesture-based interface
Figure 15: The user manipulates a piece of text in the multimedia
An intuitive interaction between user’s hands and a multimedia
wall.
wall was achieved by using EyesWeb to analyse the hand motion
as described in section 2.1. The hand’s position and mode (zoom,
rotation, manipulation) were transmitted to the visualization soft-
ware (Processing) through the OSC (Open Sound Control) net-
work protocol. Data from a database (pictures, movies, texts and
5.2. 2D Sketch recognition
sounds) is displayed on a kind of picture board, each picture cor-
responding to one of the data (see Figure 13).
This application provides an augmented reality display. The user
This pictures can be translated or rotated, user can zoom on
draws a sketch by moving his hands. This sketch is automatically
one picture, read a video (Figure 15). Many options allow the
recognized among the sketches from the database (section 5.2) and
simple modification of the multimedia wall. For example, it is
a given synthetic image replaces the sketch at its particular posi-
possible to have one wall by kind of data (see see Figure 14).
tion and size. The user is then able to manipulate the object and
Here the videos are in front, behind are the pictures and texts
change its position through the scene. Figure 17 shows the results
are on the back. In this case, the user’s Z position (depth) is sent
of this application. For this application EyesWeb was used both
from EyesWeb to Processing in order to let the user the access
for analysis but also for the visualization.
to the wall behind he is located in 3D. Figure 15 shows the user
in the bottom-left image who manipulates a piece of text in the
multimedia wall.
Figure 13: Picture’s wall database.
Figure 16: The user draws a new sketch which be recognized as a
house.
5.3. Avatar
The third application detects 6 IR markers located on the user and
label them with different body parts (section 2.2.3). Moreover,
the user’s face is detected and segmented as described in section
3.1 and its face replaces the user head on the final avatar. Two
Figure 14: Wall by kind of data.
steps are needed in this application: in a first step the calibration
of the different body parts is achieved and then real time tracking
is performed and an “avatar” is displayed as in Figure 17.
154

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
lands: ACM, 2000. Pp. 360–367. ISBN: 1-58113-216-6. P.:
150.
[3] G. Bradski and A. Kaehler. Learning OpenCV: Com-
puter Vision with the OpenCV Library. Cambridge, MA:
O’Reilly, 2008. P.: 150.
[4] A. Camurri, M. Ricchetti, and R. Trocca. “EyesWeb - To-
ward Gesture and Affect Recognition in Dance/Music In-
teractive Systems”. In: Multimedia Computing and Sys-
tems, International Conference on 1 (1999). P. 9643. ISSN:
1530-2032. P.: 150.
[5] A. Camurri et al. “Developing multimodal interactive
systems with EyesWeb XMI”. In: Proceedings of the
2007 conference on new interfaces for musical expression
(NIME07). New York, USA. P.: 147.
Figure 17: The user moves his body in real time on the screen.
[6] A. Dempster, N. Laird, and D. Rubin. “Maximum likeli-
hood from incomplete data via the EM algorithm”. In: Jour-
nal of the Royal Statistical Society B 39.1 (1977). Pp. 1–38.
P.: 151.
6. CONCLUSIONS
[8] M.-K. Hu. “Visual pattern recognition by moment invari-
ants”. In: Information Theory, IEEE Transactions on 8.2
All the techniques we developed and tested in this project led to
(1962). Pp. 179–187. P.: 150.
three highly interactive applications. We measured the difficulty
of multi-blob tracking and identification even in simplified video
[9] D.G. Lowe. “Distinctive Image Features from Scale-
acquisition configurations using IR LEDs. Face tracking works
Invariant Keypoints”. In: International Journal of Com-
well but mainly when people look to the camera. The SIFT de-
puter Vision 60.2 (2004). Pp. 91–110. P.: 151.
scriptors can provide good results but only on contrasted objects
[10] G. McLachlan and D. Peel. Finite Mixture Models. Ed. by
which have a lot of edges and if they are viewed from different an-
Wiley Series in Probability and Statistics. 2000. P.: 151.
gles. GMM-based methods seem to work well in blob recognition
by using color information. A problem is in the real time reaction
[11] K. Mikolajczyk and C. Schmid. “A performance eval-
of this algorithm. Hand-made sketches are simple to recognize
uation of local descriptors”. In: IEEE Transactions on
if they are quite different globally and locally. Correlation of fil-
Pattern Analysis & Machine Intelligence 27 (10 2005).
tered images of the sketches seem to be the best simple matching
Pp. 1615–1630. P.: 152.
method for sketch recognition. Blender is an excellent 3D ob-
[12] R. Morros et al. “Event recognition for meaningful human-
ject creation free platform which also contains advanced avatars.
computer interaction in a smart environment”. In: Proceed-
If the edit mode offers inverse kinematic mouse-based avatar ma-
ings of the eNTERFACE’07 Workshop on Multimodal Inter-
nipulation, the real time mode does not provide the possibility to
faces. Istanbul, Turkey 2007. P.: 151.
dynamically change avatar’s position and body parts configuration
[14] T. M. Sezgin and R. Davis. “Sketch recognition in inter-
and thus it is not possible to use it in real time manipulations. Pro-
spersed drawings using time-based graphical models”. In:
cessing is a simple to use (simplified Java) and a practical tool for
Compututers Graphics 32.5 (2008). Pp. 500–510. ISSN:
visualization. The use of OSC protocol let us use another software
0097-8493. P.: 151.
to achieve image analysis. Finally, some of the three applications
achieved here will be used in the next project session (MATRIX
[15] P.A. Viola and M.J. Jones. “Robust Real-Time Face Detec-
project) in order to be enhanced by adding more 3D information
tion”. In: International Journal of Computer Vision 57 (2
and intelligence.
2004). Pp. 137–154. P.: 149.
[16] C.F.R. Weiman and G. Chaikin. “Logarithmic spiral grids
7. ACKNOWLEDGMENTS
for image processing and display”. In: Computer Graphics
and Image Processing 11 (1979). Pp. 197–226. P.: 152.
numediart is a long-term research program centered on Digital
Media Arts, funded by Région Wallonne, Belgium (grant N◦716631).
Special thanks to Johan Decristophoris for his help in the IR LED
8.2. Software and technologies
setup.
[2] “Blender”. URL: http://www.blender.org/. P.:
153.
8. REFERENCES
[7] “EyesWeb XMI platform”. URL: http : / / www .
eyesweb.org. Pp.: 147, 153.
8.1. Scientific references
[13] “Processing”. URL: http : / / www . processing .
[1] Jr. A. Chris Long et al. “Visual similarity of pen gestures”.
org/. P.: 152.
In: CHI ’00: Proceedings of the SIGCHI conference on Hu-
man factors in computing systems. The Hague, The Nether-
155

QPSR of the numediart research program, Vol. 1, No. 4, December 2008
156

Document Outline