1, 3 arxiv:2005.02251v2 [q-bio.nc] 11 may 2020

RACCOONS VS DEMONS:MULTICLASS LABELED P300 DATASET

A PREPRINT

Goncharenko V.1, 3, Grigoryan R.1, 2, and Samokhina A.1, 3

1Neiry, 2MSU, 3MIPT

May 12, 2020

ABSTRACT

We publish dataset of visual P300 BCI performed in Virtual Reality (VR) game Raccoons versusDemons (RvD). Data contains reach labels incorporating information about stimulus chosen en-abling us to estimate model’s confidence at each stimulus prediction stage.Data and experiments code are available at https://gitlab.com/impulse-neiry_public/raccoons-vs-demons

Keywords P300, EEG, BCI, Transfer learning

1 Introduction

Firstly Brain-Computer Interfaces (BCI) were developed mostly with disabled patients in mind (e.g. [Bla+04],[Ric+13], [Hof+08]). Nowadays there are numerous attempts to employ BCI solutions for healthy people using non-invasive electrodes (see [Mar+12], ).

More than that BCI numerously applied to recreational activities such as games [And+16], [Kap+13]

Another crucial thing for further BCI development is processing and visualization instruments. At the beginning everylaboratory used own programming environments and data format, some even used proprietary platforms. Happilyat the moment we have fully open sourced systems based on Python programming language like MNE [Gra+13],PyRiemann [CBA13] and MOABB [JB18], also EDF data format [KO03] to store EEG/MEG data specifically.

2 Materials and methods

2.1 Participants

61 healthy participants (23 males) naive to BCI with mean age 28 years from 19 to 45 y.o. took part in the study.

All subject signed informed consent (see appendix) and passed primary prerequisites on their health and condition.

2.2 Stimulation and EEG recording

The EEG was recorded with NVX-52 encephalograph (MCS, Zelenograd, Russia) at 500 Hz. We used 8 spongeelectrodes (Cz, P3, P4, PO3, POz, PO4, O1, O2), see 1. Stimuli were presented with HTC Vive Pro VR headset withTTL hardware sync.

2.3 Experimental procedure

Participants were asked to play the P300 BCI game in virtual reality. BCI was embedded into a game plot with theplayer posing as a forest warden. The player was supposed to feed animals and protect them from demons.

arX

iv:2

005.

0225

1v2

[q-

bio.

NC

] 1

1 M

ay 2

020

https://gitlab.com/impulse-neiry_public/raccoons-vs-demons

https://gitlab.com/impulse-neiry_public/raccoons-vs-demons

Raccoons vs Demons:multiclass labeled P300 dataset A PREPRINT

Figure 1: Electrodes poisitions

Figure 2: Scene from VR game showing raccoons and demons

We have used an oddball paradigm. At the learning stage, 5 numbered animated raccoons (see 2) were presented tothe participant. The stimulation consisted of raccoon jumps in a random order to prevent activation anticipation (foranimation timeline see fig. 2B). Participants were instructed to count jumps of a specific raccoon for 5 sessions. Eachsession consisted of all targets activating 10 times, after which the animal received food. This produced 50 target and450 non-target evoked potentials that were used for classification.

During BCI feedback scene, stimuli were presented as animated demons, which were jumping. The instruction foreach participant was essentially the same as for the learning session (counting jumps of the target stimuli). Before eachsession the participants called out the number of demon they would be counting during the session. This was doneto get rid of the usual copy-spell scheme, and to preserve interactivity (participant choose target himself). Selecteddemon was destroyed at the end of the scene (10 activations of all stimuli). Then demons came closer to the participant,eventually grabbing one of the raccoons and running away. When the target was removed from the screen one way oranother, the new demon was spawned to the initial position, so the number of stimuli remained constant.

Data was recorded for later offline analysis. The game was accompanied with spoken instruction and backgroundmusic.

2


Figure 3: Data hierarchy scheme

3 Data structure and terms

Essentially in an oddball paradigm we have to solve 2 different but connected classification tasks:

1. BinaryClassification of a single epoch to contain P300 or not (in other words to be target or empty). In this case eachepoch is classified independently. It is suitable for online (calibration then test on the same person) trainingbecause only 250 epochs are enough to train such a model. Another noticeable trait of this type of problem isimbalanced training and testing sets because each stimulus is activated the same number of times while onlyone of them is target. So we end up with 1 to (s − 1) class ratio where s is the number of stimuli. For thistask accuracy metric is not sufficient because of class imbalance so we track precision, recall, f1 and ROCAUC in all experiments.

2. MulticlassClassification of the stimulus chosen by user. As long as we have several stimuli (usually 5 to 7) it’s amulticlass problem. Here we consider several epochs at a time (usually equal number from each stimulus)and have prior knowledge that only one stimulus could be the target thus others are empty. In this caseclassification is more balanced because the participant chooses different stimuli each time (in fact it dependson visualization setup). Usually this problem is solved by summation of probabilities of binary problem bystimulus and choosing the one with highest sum. Regarding metrics this task is well described by accuracy.

Let’s introduce some terminology helpful in further narration (for illustration see 3:

• Round - sequence of epochs with one target where each stimulus was activated once (and produced oneepoch), so this is quantum of multiclass prediction. Usually stimuli are activated in random order.

• Act - sequence of rounds with one target after which decision on target stimulus is made. Usually the numberof sessions is from 5 to 10 in our experiments.

• (Game) Record - sequence of acts obtained from one person during one recording session (one game).

• Dataset - collection of records sharing the same visual activation and hardware setup.

4 Preprocessing and Classification

As a main metrics we use multiclass accuracy because in a final gaming environment this is how user measuresperformance. Binary task metrics are measured in all experiments as well and in many cases higher binary metricsmeans higher multiclass accuracy.

We divide EEG signal handling to two principal steps:

3


Figure 4: Average P300 (target) and non-P300 (empty) responses by electrodes

1. Preprocessing - which includes signal processing stuff and some non-statistical permutations e.g. clipping,epochs slicing, etc. At the end of this step we obtain epochs with labels (either binary or multiclass).

2. Classification - applying statistical classifiers to predict labels having epoch data

For preprocessing step we search for optimal hyperparameter values by cross-validation. The reason for this seeminglytrivial step is lack of motivation in choosing such parameters. Although many researches report similar filtering (e.g.[Con+11], [BC14], [Gug+09], [RG08], [Kay+18]) they don’t describe motivation for choosing one or another down-sampling factor, filtering band, clipping as well as epoch duration. Below we report results of classification pipelinewith varying parameters and choose ones that minimize final epoch representation without metrics degradation.

For this step we use predefined set

Binary epochs classification is a common task for BCI and main approaches are well described in [Lot+18]. In recentyears Deep Learning emerged in all fields, so EEG analysis is not an exception and we have beautiful overviews of[CHC19] and [Roy+19].

Our preprocessing is quite standard:

1. Decimation with an anti-aliasing filter from SciPy package [Vir+20]

2. Bandpass filtering with Butterworth IIR

3. Clipping off scale samples

4. Channelwise standard scaling (subtracting mean and dividing by std for all values of each channel)

Each test were conducted with other parameters fixed at some sane level (such that qualitative result persisted)

4.1 Decimation

First we need to determine which minimal signal frequency we will operate on in all later stages. On 5 we see thatclassification degrades for decimation more than 35 times resulting in a rate of about 15Hz. So this is the upper valueof frequency on which P300 occurs. Different lines correspond to different final classifiers and transparent zones ofcorresponding colour refer standard deviation by persons.

4.2 Filtering

Since we optimized decimation and chose minimal available frequency then we only need to filter low frequencieswith Butterworth filter. On 6 we see that filtering very low frequencies (below 0.2Hz) is crucial for classification andband from 0.2 to 1 doesn’t affect final classification quality.

4


Figure 5: Dependency of final prediction accuracy on decimation factor

Figure 6: Dependency of final prediction accuracy on lower filtering bound

4.3 Epoch duration

The same operation was held over epoch durations. First we tested right bounds and found quality degradation on600ms (see 8).

5 Multiclass confidence level

Having robust binary target-empty epoch classifier (trained for each person separately) we may proceed to aggregatingthese predictions to a final multiclass problem solution.

As a standard approach we consider taking epochs grouped by activated stimulus, predicting their probabilities to betarget and sum these probabilities. This way we obtain "activation score" for each stimulus. Then multiclass predictionconsists in choosing the stimulus with the highest activation score. We call it baseline multiclass classification.

In gaming environment (which is our case) misclassification cost is high because person feels frustrated and snatched.So we have decided to introduce some confidence score by which we could reject potentially wrong predictions. Incase of rejection "misfire" animation is played and we ask the user to repeat the choice procedure.

5


Figure 7: Dependency of final prediction accuracy on left bound of epoch

Figure 8: Dependency of final prediction accuracy on right bound of epoch

Some attempts to fix classification errors were taken. For example in [Mar+12] competition EEG signal itself used topredict if SVM classifier were right or not.

5.1 Heuristics

This score could be based on prior knowledge that one act has only one target. Having activation scores of eachstimulus we expect a good binary classifier to give low scores to all but one stimulus. And distribution of sortedactivation scores confirms this statement.

When there isn’t one obvious leader it’s a motive to suspect lower probability to give the right answer. So our task isto find some criterion (which we call confidence score) that separate wrongly classified acts. During experiments wehave considered different confidence scores based on activation scores:

• sum of activations• max activation• mean activation• median activation

6


Figure 9: Distribution of sorted activation scores

Figure 10: Confidence score distribution for ’max-median’ heuristic

• min activation• max - mean• max - median• second max• max - second max• max - mean of all but max• max - median of all but max• max - sum of activations

To estimate the threshold value we plot criterion distribution for correctly classified acts and for misclassified ones.

On average confidence scores are lower for misclassified acts, so we are able to pick those acts among which therate of correctly classified (precision) is high. But considering the game setup we couldn’t afford to trigger "misfire"action too frequently because it would be too disturbing for a user. We have stated the rejection rate couldn’t be morethan 30%, precisely 10% to 20% would be optimal because the user encounters misfire at some point but doesn’t getoverwhelmed by this mechanics.

Finally we can see that ’max-mean’ and ’max-median’ performs best at 5 - 50% of rejected acts which is the region ofinterest for us (see 11). For the best heuristic ’max-median’ we could increase accuracy by 5% sacrificing only 12%of all acts on average. However this misfiers is concentrated in a few participants which is evidenced by a medianaccuracy of 90% in this case.

5.2 Stacking

In fact we perform some kind of stacking, so we may apply a model to predict if this act could be misclassified.

Curve of negative rate vs precision in 5-fold cross-validation is shown on plot for Logistic Regression, Random Forestand our Heuristics Threshold classifiers.

7


Figure 11: Negative rate vs Precision for different models

This plot shows that in range 10 to 25 % our heuristics work the same as trained classifiers and provides 4 to 9 %increase in precision of predictions (see 11).

6 Discussion

Introduction of multiclass labels allows us to incorporate prior knowledge of choosing only one stimulus per session.This could efficiently be used in Neural Networks architecture development as well as in Bayesian methods basedapproaches.

References

[KO03] Bob Kemp and Jesus Olivan. “European data format ’plus’ (EDF+), an EDF alike standard format forthe exchange of physiological data”. In: Clinical neurophysiology : official journal of the InternationalFederation of Clinical Neurophysiology 114 (Oct. 2003), pp. 1755–61. DOI: 10.1016/S1388-2457(03)00123-8.

[Bla+04] Benjamin Blankertz et al. “The BCI competition 2003: Progress and perspectives in detection and dis-crimination of EEG single trials”. In: IEEE TRANS. BIOMED. ENG (2004).

[Hof+08] Ulrich Hoffmann et al. “An efficient P300-based brain-computer interface for disabled subjects”.In: Journal of Neuroscience Methods 167.1 (2008). Datasets and MATLAB-Code are available athttp://bci.epfl.ch, pp. 115–125. DOI: 10 . 1016 / j . jneumeth . 2007 . 03 . 005. URL: http : / /infoscience.epfl.ch/record/101093.

[RG08] Alain Rakotomamonjy and Vincent Guigue. “BCI competition III: dataset II- ensemble of SVMs for BCIP300 speller”. In: Biomedical Engineering, IEEE Transactions on 55 (Apr. 2008), pp. 1147–1154. DOI:10.1109/TBME.2008.915728.

[Gug+09] Christoph Guger et al. “How many people are able to control a P300-based brain-computer interface(BCI)?” In: Neuroscience Letters 462 (Sept. 2009), pp. 94–8. DOI: 10.1016/j.neulet.2009.06.045.

[Con+11] Marco Congedo et al. “”Brain Invaders”: a prototype of an open-source P300- based video game workingwith the OpenViBE platform”. In: 5th International Brain-Computer Interface Conference 2011 (BCI2011). Graz, Austria, Sept. 2011, pp. 280–283. URL: https://hal.archives-ouvertes.fr/hal-00641412.

[Mar+12] Perrin Margaux et al. “Objective and Subjective Evaluation of Online Error Correction during P300-BasedSpelling”. In: Adv. in Hum.-Comp. Int. 2012 (Jan. 2012). ISSN: 1687-5893. DOI: 10.1155/2012/578295.URL: https://doi.org/10.1155/2012/578295.

8

http://dx.doi.org/10.1016/S1388-2457(03)00123-8

http://dx.doi.org/10.1016/S1388-2457(03)00123-8

http://dx.doi.org/10.1016/j.jneumeth.2007.03.005

http://infoscience.epfl.ch/record/101093

http://infoscience.epfl.ch/record/101093

http://dx.doi.org/10.1109/TBME.2008.915728

http://dx.doi.org/10.1016/j.neulet.2009.06.045

https://hal.archives-ouvertes.fr/hal-00641412


http://dx.doi.org/10.1155/2012/578295

https://doi.org/10.1155/2012/578295


[CBA13] Marco Congedo, Alexandre Barachant, and Anton Andreev. “A New Generation of Brain-Computer In-terface Based on Riemannian Geometry”. In: CoRR abs/1310.8115 (2013). arXiv: 1310.8115. URL:http://arxiv.org/abs/1310.8115.

[Gra+13] Alexandre Gramfort et al. “MEG and EEG data analysis with MNE-Python”. In: Frontiers in Neuro-science 7 (2013), p. 267. ISSN: 1662-453X. DOI: 10.3389/fnins.2013.00267. URL: https://www.frontiersin.org/article/10.3389/fnins.2013.00267.

[Kap+13] A. Y. Kaplan et al. “Adapting the P300-Based Brain–Computer Interface for Gaming: A Review”. In:IEEE Transactions on Computational Intelligence and AI in Games 5.2 (2013), pp. 141–149.

[Ric+13] Angela Riccio et al. “Attention and P300-based BCI performance in people with amyotrophic lateralsclerosis”. In: Frontiers in Human Neuroscience 7 (2013), p. 732. ISSN: 1662-5161. DOI: 10.3389/fnhum.2013.00732. URL: https://www.frontiersin.org/article/10.3389/fnhum.2013.00732.

[BC14] Alexandre Barachant and Marco Congedo. “A Plug&Play P300 BCI Using Information Geometry”. In:CoRR abs/1409.0107 (2014). arXiv: 1409.0107. URL: http://arxiv.org/abs/1409.0107.

[And+16] Anton Andreev et al. “Recreational Applications of OpenViBE: Brain Invaders and Use-the-Force”. In:Brain-Computer Interfaces 2: Technology and Applications. Ed. by Maureen Clerc, Laurent Bougrai, andFabien Lotte. Vol. chap. 14. John Wiley, Aug. 2016, pp. 241–257. URL: https://hal.archives-ouvertes.fr/hal-01366873.

[JB18] Vinay Jayaram and Alexandre Barachant. “MOABB: trustworthy algorithm benchmarking for BCIs”. In:Journal of Neural Engineering 15.6 (Sept. 2018), p. 066011. DOI: 10.1088/1741-2552/aadea0. URL:https://doi.org/10.1088%2F1741-2552%2Faadea0.

[Kay+18] Murat Kaya et al. “A large electroencephalographic motor imagery dataset for electroencephalographicbrain computer interfaces”. In: Scientific Data 5 (Oct. 2018), p. 180211. DOI: 10.1038/sdata.2018.211.

[Lot+18] F Lotte et al. “A review of classification algorithms for EEG-based brain–computer interfaces: a 10 yearupdate”. In: Journal of Neural Engineering 15.3 (Apr. 2018), p. 031005. DOI: 10.1088/1741-2552/aab2f2. URL: https://doi.org/10.1088%2F1741-2552%2Faab2f2.

[CHC19] Alexander Craik, Yongtian He, and Jose L Contreras-Vidal. “Deep learning for electroencephalogram(EEG) classification tasks: a review”. In: Journal of Neural Engineering 16.3 (Apr. 2019), p. 031001.DOI: 10.1088/1741-2552/ab0ab5. URL: https://doi.org/10.1088%2F1741-2552%2Fab0ab5.

[Roy+19] Yannick Roy et al. “Deep learning-based electroencephalography analysis: a systematic review”. In: Jour-nal of Neural Engineering 16.5 (Aug. 2019), p. 051001. DOI: 10.1088/1741- 2552/ab260c. URL:https://doi.org/10.1088%2F1741-2552%2Fab260c.

[Vir+20] Pauli Virtanen et al. “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python”. In: NatureMethods 17 (2020), pp. 261–272. DOI: https://doi.org/10.1038/s41592-019-0686-2.

9

http://arxiv.org/abs/1310.8115


http://dx.doi.org/10.3389/fnins.2013.00267

https://www.frontiersin.org/article/10.3389/fnins.2013.00267

https://www.frontiersin.org/article/10.3389/fnins.2013.00267

http://dx.doi.org/10.3389/fnhum.2013.00732

http://dx.doi.org/10.3389/fnhum.2013.00732

https://www.frontiersin.org/article/10.3389/fnhum.2013.00732

https://www.frontiersin.org/article/10.3389/fnhum.2013.00732





http://dx.doi.org/10.1088/1741-2552/aadea0

https://doi.org/10.1088%2F1741-2552%2Faadea0

http://dx.doi.org/10.1038/sdata.2018.211

http://dx.doi.org/10.1038/sdata.2018.211

http://dx.doi.org/10.1088/1741-2552/aab2f2

http://dx.doi.org/10.1088/1741-2552/aab2f2

https://doi.org/10.1088%2F1741-2552%2Faab2f2

http://dx.doi.org/10.1088/1741-2552/ab0ab5

https://doi.org/10.1088%2F1741-2552%2Fab0ab5

http://dx.doi.org/10.1088/1741-2552/ab260c

https://doi.org/10.1088%2F1741-2552%2Fab260c

http://dx.doi.org/https://doi.org/10.1038/s41592-019-0686-2

1, 3 arxiv:2005.02251v2 [q-bio.nc] 11 may 2020

Documents