Brain Computer Interfaces (bcis) Allow To Di
100
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. XX, NO. Y, 2006
The BCI Competition III:
Validating Alternative Approaches to Actual BCI Problems
Benjamin Blankertz, Klaus-Robert Müller, Dean Krusienski, Gerwin Schalk, Jonathan R. Wolpaw,
Alois Schlögl, Gert Pfurtscheller, José del R. Millán, Michael Schröder, Niels Birbaumer
Abstract— A Brain-Computer Interface (BCI) is a system that
by intent alone. The system estimates the intent of the human
allows its users to control external devices with brain activity.
user from her/his brain signals measured at microscopic,
Although the proof-of-concept was given decades ago, the reliable
mesoscopic, or macroscopic scale, cf. [1], [2], [3], [4] for an
translation of user intent into device control commands is still
overview. The interest in BCI research is strongly increasing
a major challenge. Success requires the effective interaction of
two adaptive controllers: the user’s brain, which produces brain
as reflected by the exponentially growing number of published
activity that encodes intent, and the BCI system, which translates
peer-reviewed journal papers on that topic.
that activity into device control commands. In order to facilitate
BCI Competitions are organized in order to foster the
this interaction, many laboratories are exploring a variety of
development of improved BCI technology by providing an
signal analysis techniques to improve the adaptation of the BCI
system to the user. In the literature, many machine learning
unbiased validation of a variety of data analysis techniques.
and pattern classification algorithms have been reported to give
In each competition a variety of data sets was made publicly
impressive results when applied to BCI data in offline analyses.
available in a documented format via internet ([5], [6], [7]).
However, it is more difficult to evaluate their relative value for
Each data set is a record of brain signals from BCI experiments
actual online use. BCI data competitions have been organized
of leading laboratories in BCI technology split into two parts:
to provide objective formal evaluations of alternative methods.
Prompted by the great interest in the first two BCI Competitions,
one part of labeled data (‘training set’) and another part of
we organized the third BCI Competition to address several of the
unlabeled data (‘test set’). Researchers worldwide could tune
most difficult and important analysis problems in BCI research.
their methods to the training data and submit the output of
This article describes the data sets that were provided to the
their translation algorithms for the test data. The truth about
competitors and gives an overview of the results. In a series of
the test data was kept secret until, after the deadline, it was
accompanying articles, the winning teams describe their methods
in detail.
used to evaluate the submissions. This procedure guarantees
that the assessment of performance is not biased by overfitting
Index Terms— augmentative communication, beta-rhythm,
the selection of methods and the choice of their parameters to
BCI, brain-computer interface, EEG, ERP, imagined hand move-
ments, mu-rhythm, non-stationarity, P300, rehabilitation, single-
the data.
trial classification, slow cortical potentials.
The three BCI Competitions were arranged in 2001, 2002
and 2004. The growing interest in such contests is reflected
by the number of submissions rising from 10 to 57 to 92. The
I. INTRODUCTION
tasks and results of the first two competitions are summarized
BRAIN-COMPUTER INTERFACES (BCIs) allow to di- in [8], [9]. The first competition was a test for us to see how
rectly control a computer application or a technical device
such an enterprise would work, and how much attention it
would attract. In the second competition we provided a broad
Authors BB and KRM were partially supported by grants of the Bun-
range of typical fundamental BCI problems. For the third BCI
desministerium für Bildung und Forschung (BMBF), FKZ 01IBB02A/B and
Competition ([7]) presented here we advanced to a diversity of
by the Deutsche Forschungsgemeinschaft (DFG), FOR 375/B1. Authors DK,
GS and JRW’s work was supported in part by National Institutes of Health
catchy analysis challenges that are highly relevant to present
Grants HD30146 (National Center for Medical R ehabilitation Research
BCI research.
of the National Institute of Child Health and Human Development) and
More specifically, the competition comprised the problems
EB00856 (National Institute of Biomedical Imaging and Bioengineering and
National Institute of Neurological Disorders and Stroke ) and the James S.
of session-to-session transfer, non-stationarity, small training
McDonnell Foundation. Author JdRM is supported by the Swiss National
sets, subject-to-subject transfer, continuous test data without
Science Foundation NCCR “IM2”. Authors BB, KRM and JdRM were
trial structure, asynchronous paradigms and idle states.
partially supported by the PASCAL Network of Excellence, EU # 506778.
BB and KRM are with Fraunhofer FIRST (IDA), Berlin, Germany, E-
mail: benjamin.blankertz@first.fraunhofer.de. KRM is also
with University of Potsdam, Germany.
A. Ranking of competition results
DK, GS and JRW are with the Labatory of Nervous System Disorders,
The ranking of results from Internet competitions cannot be
Wadsworth Center, New York State Dept. of Health, Albany, NY, USA. JRW
is also with the State University of New York, Albany, NY, USA.
taken at face value since they may not provide a completely
AS and GP are with the Institute for Human-Computer Interfaces, Univer-
objective assessment of quality for several reasons:
sity of Technology Graz, Austria.
(1) There is great variance in how much effort contributors
JdRM is with the IDIAP Research Institute, CH-1920 Martigny, Switzerland
MS is with the Dept. of Technical Computer Science, Eberhard-Karls-
put into preparing their submissions.
Universität Tübingen, Germany.
(2) When test sets (and the number of classes) are relatively
NB is with the Institute of Medical Psychology and Behavioral Neurobiol-
ogy, University of Tübingen, Germany and also with the University of Trento,
small, luck may also play a big role. For example, if there
Italy.
are 15 methods in a binary problem that are able to classify
THE BCI COMPETITION III
101
TABLE I
As electrocorticography (ECoG) was used and not EEG, the
IN THIS TABLE THE WINNING TEAMS FOR ALL COMPETITION DATA SETS
variation of electrode positions and impedances are expected
ARE LISTED. REFER TO SEC. V TO SEE WHY THERE IS NO WINNER FOR
to be rather small. The competitors were asked to set up
DATA SET IVB.
a classifier based on the labeled training data of the first
session and apply it to the unlabeled test data of the second
data set
research lab
contributor(s)
session. The performance criteria used for evaluation was the
I
Tsinghua University, Bei-
Qingguo Wei, Fei Meng, Yijun
percentage of correctly classified test trials.
jing, China
Wang, Shangkai Gao
The subject was not a locked-In patient but suffered from
II
PSI
CNRS
FRE-2645,
Alain Rakotomamonjy, V. Guigue
INSA de Rouen, France
epilepsy. For this reason his neural activity was monitored for
IIIa
Neural Signal Processing
Cuntai
Guan,
Haihong
Zhang,
several days with an ECoG recording. During this interval the
Lab Institute for Infocomm
Yuanqin Li
subject twice participated in a BCI experiment based on motor
Research, Singapore
IIIb
Fraunhofer (FIRST) IDA,
Steven Lemm
imagery. The task of both sessions was the same: to produce
Berlin, Germany
imagined movements of either the left small finger or the
IVa
Tsinghua University, Bei-
Yijun Wang, Han Yuan, Dan Zhang,
tongue. The provided data sets consist of 278 trials performed
jing, China
Xiaorong Gao, Zhiguang Zhang,
Shangkai Gao
during the first session (training data) and 100 trials from the
IVc
Tsinghua University, Bei-
Dan Zhang, Yijun Wang
second session (test data). Electrical brain activity was picked
jing, China
up with an 8 × 8 ECoG platinum electrode grid which was
V
University of Barcelona
Ferran Galán, Francesc Oliva, Joan
placed on the contralateral (right) motor cortex. The grid was
Guàrdia
covered by meninges and scull and was not sensitive to muscle
artifacts. As the scull and the meninges act as low-pass filters
during EEG recordings, ECoG data can contain stronger high-
correctly 60 % of the ideal set of all trials with random output
frequency components than EEG. The grid was assumed to
on the remaining 40 %, the expected accuracy of all these
cover the right motor cortex completely, but due to its size
methods is 80 %. However, on a fixed test set consisting of 100
(approx. 8 × 8 cm) it could in addition record activity from
trials, the expected difference between the best and the worst
surrounding cortical areas. All recordings were performed with
result is greater than 10 % (assuming independence between
a sampling rate of 1000 Hz. After amplification the recorded
methods and test trials).
potentials were stored as microvolt values.
In Sec. II–VI of this paper, we will describe the eight
Trial duration was three seconds. To avoid visually evoked
data sets comprising the competition and we will report and
potentials being reflected by the data, the recording intervals
comment on the submissions. The results of all submissions
started 0.5 seconds after the visual cue had ended. For further
are completely reported on the web ([10]) where we also list
information about the experiment, please refer to [18].
short descriptions of all applied methods. A list of the winning
teams for each data set is summarized in table I. The winning
B. Outcome of the competition
labs published individual articles on their approaches, see [11],
We received 27 submissions for the test labels. Many
[12], [13], [14], [15], [16], [17] in this issue.
submitted results were of high quality, 12 out of 27 submis-
sions managed to achieve more than 80 percent classification
II. DATA SET I
accuracy on the test set. Although including an outlier of only
This data set was provided by the Institute of Medi-
22 percent accuracy (probably submitted with accidentally
cal Psychology and Behavioral Neurobiology, University of
confused class labels) the average accuracy of all submissions
Tübingen (head: Niels Birbaumer) and Max-Planck-Institute
was 70 percent. Fig. 1 shows the histogram of the submission
for Biological Cybernetics, Tübingen, (Bernhard Schökopf),
accuracy.
and Universität Bonn, Dept. of Epileptology.
Histogram: Accuracies of Submissions
A. Description of the data set
This data set addresses the robustness of a classification
approach. A common task in BCI is to apply a classifier that
was trained during previous sessions during a later session
without retraining it. The challenge of this task is that the
electrical patterns of the patient might show some different
characteristics on a new session. This kind of non-stationarity
No. of Submissions
can be caused for example by changed levels of motivation,
arousal, fatigue etc. In addition, the recording system might
have undergone slight changes concerning electrode positions
and impedances.
Accuracy [%]
Data set I reflects this situation: training and test data were
Fig. 1.
Histogram of the classification accuracy of 27 submitted solutions.
recorded from the same subject and the same experimental
One submission stays clearly below the chance level of 50 percent. A group
task, but on two different days with about one week of delay.
of 14 submissions reaches more than 78 percent accuracy.
102
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. XX, NO. Y, 2006
The submissions of rank one to three and their applied
III. DATA SET II: P300 SPELLER PARADIGM
methods at a glance:
This data set was provided by the Wadsworth Center, New
1) An accuracy of 91 percent was achieved by Qingguo Wei
York State Department of Health (head: Jonathan R. Wolpaw).
and his co-contributors from the Tsinghua University of
Beijing. They used a combination of bandpower features
together with CSSD and mean waveforms that were cho-
A. Description of the data set
sen by fisher discriminant analysis before classification
This data set represents a complete record of P300 evoked
was performed with a linear SVM.
potentials (five sessions from two subjects) recorded with the
2) An accuracy of 87 percent was achieved by Paul Ham-
BCI2000 software [19], using a paradigm described in [20] and
mon from the University of California in San Diego.
originally by Farwell and Donchin [21]. In these experiments,
After unmixing with ICA, a combination of AR coeffi-
the user was presented with a 6 by 6 matrix of 36 different
cients, spectral power (0-45 Hz) and wavelet coefficients
alphanumeric characters. The user’s task was to sequentially
were used as features. Classification was performed with
focus attention on characters from a word that was defined by
regularized logistic regression.
the investigator. The 6 rows and 6 columns of this matrix were
3) Marginally less, 86 percent accuracy, was reached by
successively and randomly intensified at a rate of 5.7 Hz. Two
three submissions: By Michal Sapinski from Warsaw
out of 12 intensifications of rows or columns highlighted the
University, by Mao Dawei and co-contributors from
desired character (i.e., one particular row and one particular
Zhejiang University, and by Alexander D’yakonov from
column). The responses evoked by these infrequent stimuli
Moscow State University. Their used features comprise
(i.e., the 2 out of 12 stimuli that did contain the desired
the offset and spectral power of hand selected chan-
character) are different from those evoked by the stimuli that
nels (Michal Sapinski), the standard deviation of the
did not contain the desired character and they are similar
Hilbert-Huang Transform for time frequency windows
to the P300 responses previously reported [20], [21]. Signals
(window size: 5 Hz and 0.2 s) of seven channels (Mao
from the two subjects were collected from 64 ear-referenced
Dawei), and hand chosen features from seven channels
channels (bandpass filtered from 0.1–60 Hz and digitized at
(Alexander D’yakonov). For classification, logistic re-
240 Hz) using the BCI2000 software. Each session consisted
gression (Michal Sapinski) and Mahalanobis distance
of nine runs, and each run contained a single word. For each
(Mao Dawei) were used.
character epoch in the run, the user display was as follows:
Taking a closer look on solutions above 60 percent accuracy,
the matrix was displayed for a 2.5 s period, and during this
discriminant analysis (linear, robust etc.) dominates the clas-
time each character had the same intensity (i.e., the matrix
sification methods by 4 entries before (linear) support vector
was blank). Subsequently, each row and column in the matrix
machines with 3 entries. Furthermore logistic regression or
was randomly intensified for 100 ms. After intensification of
mahalanobis/fisher distance was used for two submissions
a row/column, the matrix was blank for 75 ms. Row/column
each. Successful methods showed a tendency to use a combi-
intensifications were block randomized in blocks of 12. The
nation of different feature types.
sets of 12 intensifications were repeated 15 times for each
Fig. 2 takes a closer look onto the difficulty the contri-
character epoch (i.e., any specific row/column was intensified
butions had with certain test vectors. Most test vectors were
15 times and thus there were 180 total intensifications for each
classified correctly but around trial 40 (in chronological order,
character epoch). Each character epoch was followed by a
not in competition order), many misclassifications occurred.
2.5 s period, and during this time the matrix was blank. This
One interpretation is non-stationarity in the signals caused by
period informed the user that this character was completed
eleptiform patterns in the EEG which did arise frequently for
and to focus on the next character in the word that was
this patient.
displayed on the top of the screen (the current character was
shown in parentheses). The resulting data for each subject was
Histogram: Difficulty of Test Vectors
Chronological: Difficulty of Test Vectors
partitioned into character epochs and divided chronologically
into two parts, the first 85 characters for training and the
remaining 100 characters for testing. The character epochs
ectors
in each training and test set were then scrambled to avert
est V
T
identification of the character sequences in the test data. The
No. of
(out of 27 Submissions)
objective in the contest was to use the 85 characters per
No. of Correctly Submitted Labels
subject of training data to construct a classifier, and to then
predict the 100 characters per subject in the unlabeled test
No. of Correctly Submitted Labels per Test Vector
Test Vectors in Chronological Order
data. Participants were asked to report the classification results
Fig. 2.
Difficulty of test vectors from the contributor’s point of view. The
using all 15 flash sequences and, additionally, only the first 5
left histogram shows that no vector was misclassified by every submission
flash sequences.
and that many vectors received correct labels from 20 or more submissions.
Another view on this distribution provides the right graph. It shows the number
of correctly submitted labels for every trial in chronological order (the order
B. Outcome of the competition
was randomized for the competition). Around trial no. 40 many trials were
not classified correctly.
A total of 10 submissions were received for this data set, in-
corporating a wide variety of pre-processing and classification
THE BCI COMPETITION III
103
TABLE II
methods. Using all 15 sequences, the majority of submissions
MAXIMUM KAPPA FOR t ≤7S IN THE THREE SUBJECTS (K3, K6, L1) AND
(8) predicted the test characters with at least 75 % accuracy
ITS MEAN OBTAINED BY THE THREE COMPETITORS A, B AND C.
(accuracy expected by chance was 2.8 %). Several contestants
achieved an accuracy of over 90 %, and the winner achieved an
impressive accuracy of 96.5 % (see [12] for algorithm details).
#.
mean
K3
K6
L1
1.
B
0.79
0.82
0.76
0.80
IV. DATA SETS IIIA AND IIIB:
2.
C
0.69
0.90
0.43
0.71
This data set is provided by the Institute for Human-
3.
A
0.63
0.95
0.41
0.52
Computer Interfaces, University of Technology Graz – BCI
Lab (head: Gert Pfurtscheller).
C Authors: Gao (head), Wu & Wei (Tsinghua University,
A. Description of data set IIIa
Beijing, China), Method: surface laplacian, 8-30Hz filter,
The data set consists of recordings from 3 subjects; the
CSP (one-vs-rest), SVM+kNN+LDA, ‘bagging’
subjects performed 4 different motor imagery tasks according
A detailed description of the results is available from [23].
to a cue. Sixty EEG channels were recorded and the recording
was made with a 64-channel EEG amplifier from Neuroscan,
C. Description of data set IIIb
using the left mastoid for reference and the right mastoid as
ground. The EEG was sampled with 250 Hz, it was filtered
This data set IIIb contained 2-class EEG data from 3
between 1 and 50 Hz with Notchfilter on. The data of all runs
subjects. Each data set contained recordings from consecutive
was concatenated and converted into the GDF format ([22]).
sessions during a BCI experiment. The large amount of data
The subject sat in a relaxing chair with armrests. The task
should enable the use of non-stationary classifiers, because it
was to perform imagery left hand, right hand, foot or tongue
is reasonable to expect that time-varying classifier performs
movements according to a cue. The order of cues was random.
better than a stationary (static) classifier. Moreover, based
The experiment consists of several runs (at least 6) with 40
on the experience of the second BCI competitition [6], [24],
trials each each; after trial begin, the first 2s were quite, at t=2s
[9], the response time of each method has to be evaluated.
an acoustic stimulus indicated the beginning of the trial, and
The experiment consists of 3 sessions for each subject. Each
a cross ‘+’ is displayed; then from t=3s an arrow to the left,
session consists of 4 to 9 runs. The data of all runs was
right, up or down was displayed for 1s; at the same time the
concatenated and converted into the GDF format [22]. The
subject was asked to imagine a left hand, right hand, tongue
recordings were made with a bipolar EEG amplifier from g.tec.
or foot movement, respectively, until the cross disappeared
The EEG was sampled with 125 Hz, it was filtered between
at t=7s. Each of the 4 cues was displayed 10 times within
0.5 and 30 Hz with Notchfilter on.
each run in a randomized order. Participants should provide
In order to evaluate the time delay, it was required that
a continuous classification output (continuous in time as well
the submitters provided (1) a continuous classification output,
as magnitude) for all 4 classes. In other words the classifier
and (2) it had to be demonstrated that the used algorithms
should provide 4 continuous traces for the whole data set
are causal. The output was validated using the time course of
(including labeled trials, and trials marked as artifact). At each
the mutual information [25]. The method with the maximum
point in time, the trace with the largest value determines the
increase of the mutual information (maximum steepness cal-
corresponding class. Then, a confusion matrix is built from
culated as MI(t)/(t −3s) for t >3.5s) was used for validation.
all trials for each time-point 0.0s ≤ t ≤ 7.0s . From these
In order to avoid the involuntary stimulus-response, only time
confusion matrices, the time course of the accuracy and the
t >3.5s was evaluated. The ‘steepness’ of the mutual informa-
time-course of the kappa coefficient can be obtained. The
tion quantifies the response time. The evaluation algorithm is
performance measure of the competition was the maximum
provided in BIOSIG (see /biosig/t490/criteria2005IIIb.m
kappa value in time, averaged for the three subjects.
in [26]).
B. Outcome of the competition – data set IIIa
D. Outcome of the competition – data set IIIb
We received the following three submissions, whose perfor-
We received seven submissions for this data set. The fol-
mance on the competition’s test set is shown in table II.
lowing three submissions obtained the best performance on
A Authors: Hill & Schröder (Max Planck Institute for
the competition’s test set, see table III.
Biological Cybernetics, Tübingen and Tübingen Univer-
A Authors:
O.Burmeister,
M.Reischl,
R.
Mikut
sity), Method: resampling 100Hz, detrending, Infomax
(Forschungszentrum
Karlsruhe,
Germany),
Method:
ICA, Amplitude spectra (Welch), linear PCA, and SVM
Bandbower (BP), ratios and differences of BP; MANOVA
(remark: scores are constant for each trial)
for feature selection; SVM and linear combiner
B Authors: Guan, Zhang & Li (Neural Signal Process-
C Author: S. Lemm (Fraunhofer-FIRST IDA, Berlin, Ger-
ing Lab Institute for Infocomm Research, Singapore),
many), Method: ERP and ERD (mu and beta), propabilis-
Method: Fisher ratios of channel-freqency-time bins, fea-
tic classification model, accumulative classifier
ture selection, designing mu- and beta passband, multi-
G Authors: Xiaomei Pei, Guangyu Bin (Institute of Biomed-
class CSP, SVM
ical Engineering of Xi’an Jiaotong University, Xi’an,
104
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. XX, NO. Y, 2006
TABLE III
TABLE IV
MAXIMUM STEEPNESS (WITH t0 =3S) OF THE MUTUAL INFORMATION
THE TOTAL OF 280 TRIALS WAS SPLITTED DIFFERENTLY INTO TRAINING
[BITS/S] IN THE THREE SUBJECTS (O3, S4, X11) AND ITS MEAN
AND TEST FOR EACH SUBJECT. HAVING ONLY A SMALL AMOUNT OF
OBTAINED BY THE THREE COMPETITORS A, B AND C.
TRAINING SAMPLES POSES A PROBLEM. THIS TABLE SHOWS THE
RESPECTIVE NUMBER OF TRAINING (LABELLED) TRIALS (#TRAINING)
AND TEST (UNLABELLED) TRIALS (#TEST) FOR EACH SUBJECT.
#.
mean
O3
S4
X11
1.
C
0.32
0.17
0.44
0.35
subject
#training
#test
2.
A
0.25
0.16
0.42
0.17
3.
G
0.14
0.20
0.09
0.12
aa
168
112
al
224
56
av
84
196
aw
56
224
China), Method: FFT with Hanning window of 1s-
ay
28
252
segments; Fisher Discriminant Analysis
The main aim was to evaluate causal algorithms that are
able to provide continuous feedback as fast and as accurate
training and test sets, see table IV. Only trials of classes right
as possible. To evaluate this aim, the ‘steepness’ of the time
and foot were available to the competitors. The performance
course of the mutual information was used as evaluation
measure was the overall accuracy. Note that this is not equal
criteria and the participants were asked to provide the source
to the average across subjects, due to the differently sized test
code to prove causality.
sets. Rather the performance on subjects with large test sets
Despite the requirement to provide the software, 7 par-
(= small training sets) is weighted stronger.
ticipants submitted results. All participants provided some
software. In several cases the software could not be tested,
because of some missing components. The software was
B. Outcome of the competition – data set IVa
analyzed by visual inspection. In one case an additional delay
There were 14 submissions for this data set. The winning
of 50 samples (0.4s) had to be added.
team is Yijun Wang and collegues from Tsinghua University,
The winning algorithm is described in [14]. A detailed
Beijing, China. They received accuarcies of 96/100/81/100/98
description of the results is available from [23].
for the five subjects and an overall accuracy of 94.2 %. This
is an excellent performance when considering that the second
V. DATA SETS IVA–C: MOTOR IMAGERY
(Yuanqing Li from the Institute for Infocomm Research,
These data sets were provided by Fraunhofer FIRST, In-
Singapore) and the third best (Liu Yang, National University
telligent Data Analysis Group (head: Klaus-Robert Müller),
of Defense Technology, Changsha, Hunan) achieved 85.1 resp.
and Charité University Medicine Berlin, Campus Benjamin
83.5 %.
Franklin, Department of Neurology, Neurophysics Group
The winning team examined three types of features:
(head: Gabriel Curio).
(1) ERD-feature extracted by Common Spatial Pattern (CSP)
analysis, (2) ERD-feature extracted with an AR model, and
(3) ERP-feature extracted by LDA on temporal waves. For
A. Description of data set IVa
subjects aa and aw all three features have been used and
All three data sets share the same type of training sessions.
combined by a bagging method. For the other 3 subjects only
Visual cues indicated for 3.5 s which of the following 3 motor
the CSP-based feature was used. To account for the small
imageries the subject should perform: (L) left hand, (R) right
training sets in subjects aw and ay a special technique was
hand, (F) right foot. (For IVb and IVc (R) was replaced by (Z)
employed in which formerly classfied test samples are added
tongue (=Zunge in german)). The presentation of target cues
to the training samples, cf. [15].
were intermitted by periods of random length, 1.75 to 2.25 s
in which the subject could relax.
There were two types of visual stimulation: (1) where
C. Description of data set IVb
targets were indicated by letters appearing behind a fixation
Data set IVb poses the problem of classifying in an asyn-
cross (which might nevertheless induce little target-correlated
chronous protocol design, i.e., there are no cues indicating that
eye movements), and (2) where a randomly moving object in-
the subject switches to a perdefined mental target class. Rather
dicated targets (inducing target-uncorrelated eye movements).
the subject is by default in an idle state and can spontaneously
Data set IVa poses the challenge of getting along with only
switch into a mental state that is related to BCI control (here
a little amount of training data. One approach to the problem
left or foot imagery). Also the duration of being in that mental
is to use information from other subjects’ measurements to
state can arbitrarily be decided by the subject. This is in
reduce the amount of training data needed for a new subject.
contrast to most classification analyses, which are performed
Of course, competitors could also try algorithms that work
on cued EEG trials, i.e., windowed EEG signals of fixed
on small training sets without using the information from
length, where each trial corresponds to a specific mental state
other subjects. For this purpose the data sets from five healthy
(synchronous protocol). The training data follwed the same
subjects (aa,al,av,aw,ay) have been splitted differently into
experimental setup as in data set IVa. For the competition’s
THE BCI COMPETITION III
105
test data set the target classes (left, foot and relax) were ordered
best submission
by acoustic stimuli in order to have the true labels. The length
left
relax
foot
of those active periods varied between 1.5 and 8 s, intermitted
80
80
80
by periods of 1.75 to 2.25 s. The task of the competitors was
60
60
60
to give an output signal for each time point of the continuous
40
40
40
signals provided as test data. During the intervals of idle state
20
20
20
(relax) the output is supposed to be small in magnitude (ideally
0
0
0
0), while in periods of left resp. foot imagery it should be (near
−1
0
1
−1
0
1
−1
0
1
to) -1 resp. 1. Note that there are no sample trials for class
2nd best submission
relax in the training data. Rather it has to be defined as absense
left
relax
foot
of the mental states that are used for control. Performance was
100
100
100
to be measures by mean square distance of submitted classifier
outputs and labels.
50
50
50
D. Outcome of the competition – data set IVb
0
0
0
Unfortunately, for this data set we received only one submis-
−1
0
1
−1
0
1
−1
0
1
sion. So we cannot given an evaluation and elect a winner for
Fig. 3. Histograms of classifier outputs for the two best submissions on data
this data set. Nevertheless we would like to thank Han Yuan
set IVc. Both methods perform well on the motor imagery samples (left and
and Yijin Wang from Tsinguhua University very much for their
foot, but only the winning algorithm manages to identify (most of) the idle
submission. We regret having not receive more submissions for
state samples (relax).
this particular data set, since we think that it poses a highly
relevant and difficult challenge.
good classification of the left and foot imagery events although
there are some false negatives. But the particular strength of
E. Description of data set IVc
the method is that it manages to identify more than half of the
Data set IVc poses, like IVb, the problem that for a certain
idle state trials.
amount of test trials the subject was in idle state, i.e., he did
The winning team extracted ERD-features by the Common
not perform motor imagery (class relax). The training data for
Spatial Subspace Decomposition (CSSD, cf. [27]) method and
data set IVc is the same as the one for IVb. The experimental
classified with Fisher Discriminant Analysis. Trials of the
setup for the test data was similar to the training sessions, but
relax class were detected in a first-pass classification operating
the motor imagery had to be performed for 1 second only,
on prolonged windows, while the second-pass classified the
compared to 3.5 seconds in the training sessions. The length
remaining trials into left vs. foot, cf. [16] for details.
of the intermitting periods ranged from 1.75 to 2.25 seconds as
before. The test data was recorded more than 3 hours after the
VI. DATA SETS V: MULTI-CLASS PROBLEM, CONTINUOUS
training data, so the distribution of some EEG features could
EEG
be effected by long-term non-stationarities. The performance
criterium is the mean squared error with respect to the target
This data set was provided by the IDIAP Research Institute.
vector that is -1 for class left, 1 for foot, and 0 for relax,
averaged across all trials of the test set.
A. Description of the data set
This data set was recorded from three healthy subjects
F. Outcome of the competition – data set IVc
during four sessions with no feedback. The subject sat in
Seven competitors submitted their results to this data set.
a normal chair, relaxed arms resting on their legs. There
The winners are Dan Zhang and Yijun Wang from Tsinghua
are 3 tasks: imagination of repetitive self-paced left hand
University, Beijing, China. They obtained a mean square error
movements, imagination of repetitive self-paced right hand
of 0.3 which is much lower than the result of the second
movements, and generation of words beginning with the same
best competitor, who achieve 0.59. The different performance
random letter. All 4 sessions of a given subject were acquired
becomes explicitly apparent when turning the attention to what
on the same day, each lasting 4 minutes with 5-10 minutes
the specific challenge of this data set was, the trials of idle state
breaks in between them. The subject performed a given task
in the test data. These should have been mapped to 0 while left
for about 15 seconds and then switched randomly to another
hand and foot motor imagery should have been mapped as -1
task at the operator’s request. Thus EEG data is not split in
and 1 respectively. Fig. 3 shows the histograms of classifier
trials since the subjects are continuously performing any of the
outputs of the two best submissions. Ideally outputs to left
mental tasks. It is worth noting that while operating a brain-
and foot events should all be -1 resp. 1 and outputs to relax
actuated application [28], [29], the user does essentially the
events (idle state) should be zero. The second best submission
same as during the recording sessions. The only difference is
performance remarkably well on motor imagery trials but
that in the former case he/she switches to the next mental task
absolutely fails to recognize the idle state trials (as do the
as soon as the desired action has been performed, i.e., typically
other five submissions). The best submission achieves a similar
much faster than the 15 s pace in the training sessions.
106
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. XX, NO. Y, 2006
EEG potentials were measured with a Biosemi portable
all classification methods are linear, which contributes to the
system using a cap with 32 integrated electrodes located at
linear vs. non-linear debate, cf. [31]. Most popular methods
standard positions of the International 10-20 system. The
are Fisher Discriminant and linear Support Vector Machines,
sampling rate was 512 Hz. Signals were acquired at full DC.
both introduced in [32] to the field of BCI. (2) In all but one
No artifact rejection or correction was employed.
(data set V) cases where multi-channel EEG and oscillatory
Data were provided in two ways, namely, the raw EEG
features were available the winning method used CSSD ([27])
potentials from all 32 electrodes and precomputed features (as
resp. CSP, which was suggested for the use in BCI context in
described in [30]). The precomputed features were obtained as
[33]. (3) Several of the winning algorithms incorporated the
follows. The raw EEG potentials were first spatially filtered
concept of combining oscillatory (ERD) and non-oscillatory
by means of a surface Laplacian. Then, every 62.5 ms —i.e.,
(ERP) features (data sets I, IIIb, IVa), first proposed in [34],
16 times per second— the power spectral density (PSD) in
[35].
the band 8–30 Hz was estimated over the last second of data
Regarding the distribution of the top performances for each
with a frequency resolution of 2 Hz for the 8 centro-parietal
data set we have been astonished by the fact the in all cases
channels C3, Cz, C4, CP1, CP2, P3, Pz, and P4. As a result,
except data set V there was a substantial gap between the best
an EEG sample is a 96-dimensional vector (8 channels times
and the second best submission, cf. [10]. This is in contrast
12 frequency components).
to the last BCI Competiton, cf. [24], [9] where in most cases
For each subject there are 3 training files and 1 testing file
the top competitiors had a neck-and-neck race. On the other
(the last recording session). The algorithm should provide an
hand it is interesting to compare the performance achieved
output every 0.5 seconds using the last second of data. That is,
on data from different subjects (when available) performing
the goal for the competition was to estimate the class labels for
the same mental tasks. In data set IIIa, for example, the best
every input vector (either derived from overlapping segments
submission achieved an across-subject average kappa value
of 1 second of raw EEG data or precomputed sample) of the
of 0.79 while the least successful submission had a kappa
3 test files (one per subject).The labels should be estimated in
value of 0.64. But on the first of three subjects (K3) the latter
the following way:
submission achieved a very good kappa value of 0.95 where
1) Precomputed features: Since input vectors are computed
the winner only got 0.82. In data set IIIb the third best team
16 times per second, provide the average of 8 consecu-
obtained the best result for the first subject (O3) but failed for
tive samples (so as to get a response every 0.5 seconds).
the second subjects (S4) with a value of 0.09 which is very
2) Raw signals: Compute vectors 16 times per second using
low compared to 0.44 of the winner. This observation gives
the last second of data. Then provide the average of 8
raise to the conjecture that brain signals are so specific and
consecutive samples (so as to get a response every 0.5
divers that specific algorithms are needed. The problem is to
seconds).
select the best suited method given only the training data.
In both cases (precomputed features and raw signals), other
The are some highly relevant topics in BCI research that
(i.e. also past) samples must not be used in order to guar-
were not addressed by this competition: (1) transfer of methods
antee a fast response times of the system, although for the
and paradigms from offline analyses to feedback applications;
competition test data set averaging over more samples could
(2) optimizing learning in the interaction of two mutually
be of benefit. The performance measure is the classification
adapting systems human and machine. A complete validation
accuracy (correct classification divided by the total number of
of BCI approachs with regard to those issues within a compe-
samples) averaged over the 3 subjects.
tition framework would necessitate that all competitors submit
real-time versions of their methods which are then tested in
B. Outcome of the Competition
a series of online feedback experiments in the hosting BCI
There were 26 submissions for this data set, 20 using
laboratories. This could be a new and ambitious objective of
precomputed features and 6 using raw data. Unfortunately, 4
a future BCI competition but the effort can be expected to be
of the entries did not understand the requirement of using
very high.
only 1 second of data for estimating the labels and their
The data sets and their descriptions will continue to be
methods included smoothing consecutive classifier output on
available on the competition web page [7]. Other researchers
longer time windows. Since these results are not comparable
interested in EEG single-trial analysis are welcome to test
to the others, we took them out of the regular scoring.
their algorithms on these data sets and to report their results.
Surprisingly, the best methods used precomputed features. The
To imitate competition conditions, all selections of method,
best submission was by Ferran Galán and colleagues (Univ. of
features and model parameters must be confined to the training
Barcelona) with an error of 31.3 %, but the second-best entry
sets. However, due to the current availability of the labels of
by Xiang Liao (Univ. of Electronic Science and Technology
the test data and the publication of thorough analyses of these
of China) was very close with an error of 31.5 %. In addition,
data, future classification results of the competition data cannot
there were 9 contributions with errors between 34.1 % and
fairly be compared to the original submissions.
40.0 %, of which only one based on raw signals.
ACKNOWLEDGMENT
VII. CONCLUSION AND OUTLOOK
We thank all people who contributed to this competition, ei-
Looking at all the winning algorithms of the BCI Compe-
ther by submitting classification results, or by giving feedback
tition III reveals several very interesting aspects. (1) Almost
about the competition.
THE BCI COMPETITION III
107
REFERENCES
[27] Y. Wang, P. Berg, and M. Scherg, “Common spatial subspace decom-
position applied to analysis of brain responses under multiple task
[1] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M.
conditions: a simulation study,” Clin. Neurophysiol., vol. 110, pp. 604–
Vaughan, “Brain-computer interfaces for communication and control,”
614, 1999.
Clin. Neurophysiol., vol. 113, pp. 767–791, 2002.
[28] J. del R. Millán, F. Renkens, J. Mouriño, and W. Gerstner, “Non-invasive
[2] E. A. Curran and M. J. Stokes, “Learning to control brain activity: A
brain-actuated control of a mobile robot by human EEG,” IEEE Trans.
review of the production and control of EEG components for driving
Biomedical Engineering, vol. 51, pp. 1026–1033, 2004.
brain-computer interface (BCI) systems,” Brain Cogn., vol. 51, pp. 326–
[29] ——, “Brain-actuated interaction,” Artificial Intelligence, vol. 159, pp.
336, 2003.
241–259, 2004.
[3] A. Kübler, B. Kotchoubey, J. Kaiser, J. Wolpaw, and N. Birbaumer,
[30] J. del R. Millán, “On the need for on-line learning in brain-computer
“Brain-computer communication: Unlocking the locked in,” Psychol.
interfaces,” in Proc. Int. Joint Conf. Neural Networks, 2004.
Bull., vol. 127, no. 3, pp. 358–375, 2001.
[31] K.-R. Müller, C. W. Anderson, and G. E. Birch, “Linear and non-linear
[4] J. del R. Millán, Handbook of Brain Theory and Neural Networks,
methods for brain-computer interfaces,” IEEE Trans. Neural Sys. Rehab.
2nd ed.
Cambridge: MIT Press, 2002, ch. Brain-computer interfaces.
Eng., vol. 11, no. 2, 2003, 165–169.
[5] P. Sajda, A. Gerson, K.-R. Müller, B. Blankertz, and L. Parra,
[32] B. Blankertz, G. Curio, and K.-R. Müller, “Classifying single trial EEG:
“BCI competition iii (web page),” 2001. [Online]. Available: http:
Towards brain computer interfacing,” in Advances in Neural Inf. Proc.
//liinc.bme.columbia.edu/competition.htm
Systems (NIPS 01), T. G. Diettrich, S. Becker, and Z. Ghahramani, Eds.,
[6] B. Blankertz, “BCI Competition 2003 (web page),” 2003. [Online].
vol. 14, 2002, pp. 157–164.
Available: http://ida.first.fhg.de/projects/bci/competition/
[33] J. Müller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, “Designing op-
[7] ——, “BCI Competition III (web page),” 2004. [Online]. Available:
timal spatial filters for single-trial EEG classification in a movement
http://ida.first.fhg.de/projects/bci/competition_iii/
task,” Clin. Neurophysiol., vol. 110, pp. 787–798, 1999.
[8] P. Sajda, A. Gerson, K.-R. Müller, B. Blankertz, and L. Parra, “A data
[34] G. Dornhege, B. Blankertz, G. Curio, and K.-R. Müller, “Combining
analysis competition to evaluate machine learning algorithms for use
features for BCI,” in Advances in Neural Inf. Proc. Systems (NIPS 02),
in brain-computer interfaces,” IEEE Trans. Neural Sys. Rehab. Eng.,
S. Becker, S. Thrun, and K. Obermayer, Eds., vol. 15, 2003, pp. 1115–
vol. 11, no. 2, pp. 184–185, 2003.
1122.
[9] B. Blankertz, K.-R. Müller, G. Curio, T. M. Vaughan, G. Schalk,
[35] ——, “Boosting bit rates in non-invasive EEG single-trial classifica-
J. R. Wolpaw, A. Schlögl, C. Neuper, G. Pfurtscheller, T. Hinterberger,
tions by feature combination and multi-class paradigms,” IEEE Trans.
M. Schröder, and N. Birbaumer, “The BCI competition 2003: Progress
Biomed. Eng., vol. 51, no. 6, pp. 993–1002, June 2004.
and perspectives in detection and discrimination of EEG single trials,”
IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp. 1044–1051, 2004.
[10] B. Blankertz, “BCI Competition III results (web page),” 2005. [Online].
Available: http://ida.first.fhg.de/projects/bci/competition_iii/results/
[11] Q. Wei, F. Meng, Y. Wang, and S. Gao, “BCI Competition III – data
set I,” IEEE Trans. Neural Sys. Rehab. Eng., 2006, submitted.
[12] A. Rakotomamonjy and V. Guigue, “BCI Competition III – data set II,”
IEEE Trans. Neural Sys. Rehab. Eng., 2006, submitted.
[13] C. Guan, H. Zhang, and Y. Li, “BCI Competition III – data set IIIa,”
IEEE Trans. Neural Sys. Rehab. Eng., 2006, submitted.
[14] S. Lemm, “BCI Competition III – data set IIIb,” IEEE Trans. Neural
Sys. Rehab. Eng., 2006, submitted.
[15] Y. Wang, H. Yuan, D. Zhang, X. Gao, Z. Zhang, and S. Gao, “BCI
Competition III – data set IVa,” IEEE Trans. Neural Sys. Rehab. Eng.,
2006, submitted.
[16] D. Zhang and Y. Wang, “BCI Competition III – data set IVc,” IEEE
Trans. Neural Sys. Rehab. Eng., 2006, submitted.
[17] F. Galán, F. Oliva, and J. Guàrdia, “BCI Competition III – data set V,”
IEEE Trans. Neural Sys. Rehab. Eng., 2006, submitted.
[18] T. N. Lal, T. Hinterberger, G. Widman, M. Schröder, J. Hill, W. Rosen-
stiel, C. E. Elger, B. Schölkopf, and N. Birbaumer, “Methods towards
invasive human brain computer interfaces,” in Advances in Neural
Information Processing Systems (NIPS) 17, L. K. Saul, Y. Weiss, and
L. Bottou, Eds.
Cambridge, MA: MIT Press, 2005, pp. 737–744.
[19] G. Schalk, D. J. McFarland, T. Hinterberger, N. Birbaumer, and J. Wol-
paw, “BCI2000: A general-purpose brain-computer interface (BCI)
system,” IEEE Trans. Biomed. Eng., 2004.
[20] E. Donchin, K. M. Spencer, and R. Wijesinghe, “Assessing the speed of
a P300-based brain-computer interface,” IEEE Trans. Neural Sys. Rehab.
Eng., vol. 8, no. 2, pp. 174–179, 2000.
[21] L. Farwell and E. Donchin, “Talking off the top of your head: toward
a mental prosthesis utilizing event-related brain potentials,” Electroen-
cephalogr. Clin. Neurophysiol., vol. 70, pp. 510–523, 1988.
[22] A. Schlögl, O. Filz, H. Ramoser, and G. Pfurtscheller, “GDF
- a general dataformat for biosignals,” 2004. [Online]. Available:
http://www.dpmi.tu-graz.ac.at/~schloegl/matlab/eeg/gdf4/TR_GDF.pdf
[23] A. Schlögl, “Results of the BCI-competition 2005 for datasets
IIIa and IIIb,” 2005. [Online]. Available: http://bci.tugraz.at/schloegl/
publications/TR_BCI2005_III.pdf
[24] B.
Blankertz,
“BCI
Competition
2003
results
(web
page),”
2003. [Online]. Available: http://ida.first.fhg.de/projects/bci/competition/
results/
[25] A. Schlögl, C. Neuper, and G. Pfurtscheller, “Estimating the mutual
information of an EEG-based Brain-Computer-Interface,” Biomed. Tech-
nik, vol. 47, no. 1-2, pp. 3–8, 2002.
[26] A. Schlögl, “BIOSIG - an open source software library for
biomedical signal processing,” 2003–2005. [Online]. Available: http:
//BIOSIG.SF.NET