zusammenfassung wearable systems i...

1 / 22

Wearable Systems I Zusammenfassung D-ITET HS 08

von Stefan Scheidegger 06.01.2008

Überblick

0. Einstieg

1. Kontext

2. Signalverarbeitung, Signaleigenschaften, Merkmale

- Context and Learning

- Zeitreihen, Time Series

- Signale

- Feature Selection

- Digitale Filter

- Wavelet

3. Bewertung von Klassifikatoren

- Confusion Matrix

- ROC Graph

- Qualification in continuous Data Steams: the Null Class Challenge

4. Segmentierung

- Basic Approach: Piecewise linear representation of time series

- Sliding Window: the straight forward approach

- Top down

- Bottom up

- SWAB: Sliding window and bottom up

- SAX: Symbolic Aggregate approXimation

- Comparing two Time Series, represented by SAX

5. Bayes

- Data Fusion

- Wahrscheinlichkeitsmodelle

- Wahrscheinlichkeitsmodelle, Bayes’sches Theorem

- Data Fusion mit der Bayes’schen Regel

- Bayes’sche Regel, rekursiv

- Schätzung, Entscheidung

- Log-Likelihood

- Informationsmassstäbe

- Sensorarchitekturen

- Sensormodellierung

6. Density Estimation

- Parametrische Methoden

- Nicht-parametrische Verteilungen

- k-Nearest Neighbour (kNN) Regel, Klassifizierer

- NCC (Nearest Class Classificator)

- Diskriminanzanalyse (Discriminant Analysis & Functions)

7. Dempster-Shafer

- D-S: Theory of Evidence

- Belief (Support), Plausibility, Uncertainty Intervall

- Kombinationsregeln

8. Nichtmetrische Verfahren, Entscheidungsbäume, C4.5

9. Klassifikation mit Support Vector Machine (SVM)

- Konstruktion von Trennebenen mit grosser Trennspanne

- Kernel-Ansatz

- Multiple Klassifikation mit SVM (Multiclass SVM)

10. Clustering and Self Organizing Maps (SOM)

- Clustering

- Self Organizing Maps

- Emotion SOM

11. Hidden Markov Models (HMM)

- Activity Recognition with hidden Markov Models

- Model estimation, common/combined HMMs

Wearable Systems I 1. Kontext

2 / 22

0. Einstieg - wearable assistant: erweiterte Wahrnehmung der Umwelt, situationsabhängige

Unterstützung, proaktive Informationsbereitstellung

- Stress

- Detection: 5 Level Approach

Sensor (e.g. ECG) � Observation Features (e.g. HRV) � Status (e.g. physical

activity) � Affects (e.g. Stress) � Applications (e.g. behavioural science)

- Kontext:

o Sensors (Motion, location, …)

o Preprocessing

o Features (Filtering, statistics, FFT, …)

o Classification (Bayes, LDA, kNN, Dempster, Kalman, Particle, SVM, HMM,

ROC)

1. Kontext - Just in time information: any time, any place

- Kontext: Der umgebende, inhaltliche Zusammenhang

User, Computer, Physical, Time, …

- Context awareness: active (application automatically changes behaviour) or passive

(application presents updated context to user)

- Nützlicher Kontext: System für Aktionen durch, die nicht vom Benutzer angefordert

wurden. Reduziert Kommunikationsbandbreite Benutzer – System.

- Context for Wearable Systems:

o User context: state (vital parameters), activity

o Extended Location: indoor, outdoor, time, …

o Social Context

Context Recognition

- Hierarchy: 1. raw sensor information, 2. low level signal analysis (position, speed); 3.

elementary context alphabet (raising hand); 4. complex context alphabet (presenting,

eating); 5. simple context (presenting, eating); 6. high level context (participating in a

meeting)

- Data Path: Sensors � Signal Processing � Feature Extraction � Classification �

Identification

- Mature Technologies: Context Sensor (Inertial Sensors: accelerometer, gyro, compass)

Wearable Systems I 2. Signalverarbeitung, Signaleigenschaften, Merkmale

3 / 22

2. Signalverarbeitung, Signaleigenschaften, Merkmale Context and Learning

- Complete path from sensor up to user feedback

� methods from several disciplines like signal processing, pattern recognition,

machine learning, …

- Machine learning: supervised (desired output given), unsupervised, reinforcement

(rewards or punishment) learning

- Goals of supervised learning: classification, regression

- Goals of unsupervised learning: build model, useful representation of data (finding

clusters, dimensionality reduction, good explanations of data, modelling data density)

Zeitreihen, Time Series - Collection of Observations made sequentially in time

- Motif discovery, clustering of data, rule discovery, novelty discovery

Signale - Sensorsignale

o Datenaufnahme und erste Vorverarbeitung (Kalibrierung, Zeitauflösung,

Verstärkung, Impedanzwandlung, Anti-alias Filterung, Abtastung, …)

o Signalkonditionierung (Verstärkung, Dämpfung, Filterung: Manipulation des

Frequenzspektrums)

- Segmentierung: (zeitliche) Trennung wichtiger Abschnitte

meist a-priori Grenzen nicht bekannt � Abschätzung:

o Lokale Schwellwertsegmentierung

o Kantenorientierte Segmentierung

o Krümmungsorientierte Segmentierung

o Frequenzbasierte Segmentierung

o Korrelation mit Referenzmuster

- Signaleigenschaften, Merkmale, Features: Reduktion des Informationsgehalts auf

aussagekräftige Merkmale (� Merkmalsraum)

o Häufig für Aktivitätserkennung: Mittelwert, RMS, Varianz /

Standardabweichung, Normalisiertes Signal, Kovarianz, Median, Schwelle,

Anzahl Schwellensprünge, Max- Min-Werte, Differenz min-max, Ableitungen

o Aus Audiosignalverarbeitung: Zero crossing rate, Fluktuation der

Signalamplitude

o Im Frequenzbereich: Fluktuation im Frequenzbereich, Spectral Center of

Gravity, Bandbreite, Energiedichtespektrum, Autokorrelationsfunktion, DFT,

DWT

Feature Selection - Select the k best features out of a set which maximize the cost function

- Motivation: Curse of dimensionality (lower classifier performance above some feature

number), reduce effort (e.g. save energy)

- Complete search for feature selection: too many combinations

� Heuristic Methods, two groups: Filter and Wrapper Methods

- Filter Methods: Features with best quality measures, independent of subsequent

classification: Max-Dependency, Max-Relevance, Min-Redundancy

Fast, but tendency to large feature set

- Wrapper Methods: Selection controlled by classification results: Genetic Algorithms,

Simulated Annealing, Greedy Hill Climbing

Good accuracy, but slow and lack of generality

Wearable Systems I 2. Signalverarbeitung, Signaleigenschaften, Merkmale

4 / 22

- Greedy Hill Climbing Methods:

o Sequential Forward Selection (SFS): Start with empty feature set, add most

significant feature until predefined number selected, best results for small

feature sets

o Sequential Backward Selection (SBS): start with all features, remove least

significant until predefined number selected, best results for large feature sets

Cons: Both can get stuck on local minimum

o Sequential Forward Floating Selection (SFFS): Add like SFS, then test

whether removal of least significant feature results in better performance than

before with same number

o Sequential Backward Floating Selection (SBFS): accordingly

- Stochastic Search

o Simulated Annealing: step in the feature space, compute the error change E∆ ,

accept if E 0∆ < , else accept with probability ( )exp E T−∆ , progressively

cool down

o Genetic Algorithms: Keep a population of candidates, chromosome = bit

vector defining a feature subset, mutation = bit flip, cross-over = cutting two

chromosomes and swapping the tails, new feature sets by mutation and cross-

over

Digitale Filter

Wavelet

Wearable Systems I 3. Bewertung von Klassifikatoren

5 / 22

3. Bewertung von Klassifikatoren - Motivation: Bewertungsmassstab, der Klassifikationsergebnisse mit Vorhersagen

vergleicht, Accuracy nicht genügend aussagekräftig und differenziert nicht zwischen

verschiedenen Fehlertypen

Confusion Matrix

- n n Matrix× − :

2-Klassenproblem:

- Genauigkeitsmasse:

o Accuracy: TP TN

ACFP FN TP TN

+=

+ + +

o Recall: TP

RECTP FN

=+

(korrekt identifizierte positive Ergebnisse)

o Precision: TP

PRETP FP

=+

(Anteil d. pos. Ergebnisse, die korrekt sind)

o Specificity: TN

SPETN FP

=+

(korrekt identifizierte negative Ergebnisse)

o f-measure (f1-measure): Mittel von PRE und REC

o Mutual Information: ( ) ( )( )

( ) ( )2

y t

P y, tI y; t P y, t log

P y P t=∑∑ (Intuitivstes

Ergebnis für verschiedene Fehlertypen)

ROC Graph - Receiver Operating Characteristic (aus Detektionstheorie)

- Kombiniert 1–Specifitcity (x-Achse, false positive) und Recall (y-Achse, true

positive)

o Punkt ( )0,1 perfekter Klassifikator (pos. und neg. perfekt erkannt)

o Punkt ( )0,0 : alles negativ erkannt

o Punkt ( )1,1 : alles positiv erkannt

o Punkt ( )1,0 : komplett falsch

o Parametrischer Klassifikator � Kurve in x-y-Ebene, nicht-parametrischer �

ein Punkt

o ROC-Graph enthält gleiche Information wie Confusion-Matrix

Qualification in continuous Data Steams: the Null Class Challenge - Problem: Discrimination of relatively rare activities from a default null-class

- Multi-class confusion-matrix including NULL

- Insertion, deletion

- Overfill, underfill

Aktuell positiv Aktuell negativ

Vorhersage positiv TP (true positive) FP (false positive)

Vorhersage negativ FN (false negative) TN (true negative)

Wearable Systems I 4. Segmentierung

6 / 22

4. Segmentierung - Need segmentation (continuous data stream � time-limited data sets) for

classification

- challenging task: determines detection performance and computational effort

Basic Approach: Piecewise linear representation of time series - mainly two approximation procedures

o Liner interpolation: approximating lines connects two points; more aesthetical

o Linear regression: best fit of interval in least squares sense; tighter

approximation

- Quality measures (quality of fit):

o Sum of squares: summing up squared vertical differences

o Euklidian distance: square root of the sum of squares

o L∞ norm: vertical distance between best fit line and data pint furthest away

Sliding Window: the straight forward approach - Starting from a time point kt , step by step adding next time points k 1 k 2t , t ,...+ + as long

as approximation error is below a threshold; When at jt the threshold is exceeded,

[ ]T k : j 1− is a segment, next segment starts at jt

- Online algorithm, but not able to look ahead, no global view

- Performance depends on type of data set, noise can be a problem

Top down - Time series is split at “best” location, both subsets are tested and accepted if the

approximation error is below a threshold, else the subset is split again

Bottom up - Entire time series is split into segments of 2 time steps each; segments with lowest

merging cost (e.g. sum of squares) are merged until stopping criteria met

Comparison of the Segmentation Algorithms

- Sliding Window: poor quality

- Top-down and bottom-up: similar performance

SWAB: Sliding window and bottom up - Initialization: choose a number of data points (size such that 5 to 6 expected segments

are contained). Apply bottom up to this data points, accept left most segment and

remove from set. Add new data points to the right using sliding window, apply bottom

up again

- Combines approximation quality of bottom-up and online capability of sliding

window (sliding window: coarse pre-segmentation, bottom-up: refinement)

- Extension: merge adjacent segments when having a similar slope (good results in

gesture recognition tasks)

Wearable Systems I 4. Segmentierung

7 / 22

SAX: Symbolic Aggregate approXimation - Reduces a time series of length n to string of length w

- Two steps: 1. convert time series to PAA (piecewise aggregation approximation),

2. map PAA coefficients to the discrete SAX symbols

- PAA: time series 1 2 nC c ,c ,..., c= (zero mean, standard deviation 1) represented by w-

dimensional vector 1 2 wC c , c ,..., c= (intervals of equal length, mean of original data)

- SAX (discrete representation of PAA): desirable to have symbols with

equiprobability, normalized time series shows Gaussian distribution: produce equal

sized areas under the Gaussian curve � breakpoints calculated

Example: 3 SAX symbols (a, b, c), 2 breakpoints � time series results in a word like

baabccba

Comparing two Time Series, represented by SAX - Distance measure on SAX: Euklidian distance formula adapted for PAA � lower

bound; transforming PAA to SAX: adapt formula, use table of distances between all

symbols

Wearable Systems I 5. Bayes

8 / 22

5. Bayes Data Fusion

- Information verschiedener Quellen kombinieren für robuste und komplette

Beschreibung eines Prozesses

- Beobachtungsmodell: Beobachtungen z (Messungen, Sensorsignale) werden einem

Zustand x zugeordnet

Entscheidungsregel δ zur Abbildung von Beobachtungen auf Zustände

Schätzung: eines der wichtigsten Data Fusion Probleme

Wahrscheinlichkeitsmodelle - Wahrscheinlichkeitsdichtefunktion pdf ( )P y

- Verbundwahrscheinlichkeit ( )xyP x, y

- Bedingte Wahrscheinlichkeit ( )( )( )

P x, yP x | y

P y=

� Kettenregel ( ) ( ) ( )P x, y P x | y P y=

- Unabhängigkeit ( ) ( )P x | y P x= , ( ) ( ) ( )P x, y P x P y=

- Bedingte Unabhängigkeit ( ) ( )P x | y, z P x | z= (Kenntnis über z macht x unabhängig

von y)

mit Kettenregel: ( ) ( ) ( )P x, y | z P x | z P y | z=

Wahrscheinlichkeitsmodelle, Bayes’sches Theorem

- Bayes’sches Theorem ( )( ) ( )

( )P z | x P x

P x | zP z

=

- A-priori-Wissen � posteriori Verteilung (aufgrund der gemachten Beoabachtungen)

- Sensormodell: a-priori W’keit ( )P x bekannt, bedingte W’keiten ( )P z | x �

Entscheidungsgrenze mit ( )P x | z

Data Fusion mit der Bayes’schen Regel - Beobachtungen mehrere Quellen { }n

1 2 nZ z ,z ,..., z= mit Bayes zusammenfassen

- ( ) ( ) ( )( )

1 2 nn

1 2 n

P z ,z ,..., z | x P xP x | Z

P z , z ,..., z= wird zu ( )

( ) ( )

( )

n

in i 1

n

P x P z | x

P x | ZP Z

==∏

Bayes’sche Regel, rekursiv - Neue Information rekursiv zu vorhandener Information addieren

( )( ) ( )

( )

k 1

kk

k 1

k

P z | x P x | ZP x | Z

P z | Z

−

−=

Wearable Systems I 5. Bayes

9 / 22

Schätzung, Entscheidung - Optimaler Schätzer: minimiert Fehlerw’keit anhand vorgegebener Kriterien

- ML-Schätzer (Maximum Likelihood): Aus Werten x wird der mit höchster W’keit

gewählt basierend auf verfügbaren Informationen ( )n

MLx̂ argmax P Z | x=

- MAP-Schätzer (Maximum a-posteriori): zusätzlich zu vorhandener Information

werden Beobachtungen hinzugezogen ( )n

Mapx̂ argmax P x | Z=

Log-Likelihood - Berechnung mit Logarithmus von W’keiten einfacher (Addition/Subtraktion statt

Multiplikation/Division)

- ( ) ( )I x log P x= , ( ) ( )I x | y log P x | y=

� es gilt z.B. ( ) ( ) ( ) ( )I x | z I z | x I x I z= + −

Informationsmassstäbe Shannon-Information (Entropie) ( )pH x

- Minimal (=0) wenn pdf auf einen einzigen Wert von x konzentriert ist, maximal bei

Gleichverteilung über die Zustände

- Bedingte Entropie (analog für bedingte W’keit)

Wechselseitige Information (Mutual Information, MI) ( )I x, z

- Gewinn an Information, bevor eine weitere Beobachtung berücksichtigt wird

- Class-conditional Mutual Information

- Z.B. bei zwei Sensoren: Vorhersage welcher Sensor wird grösseren

Informationsgewinn bringen

- Mutual Information Feature Selection (MIFS)

1. F contains all features, SF the selected features (empty at beginning); 2. Find first

feature: Calculate ( )I C, f for each feature f in F, move feature which maximizes

( )I C, f from F to SF 3. find further features: Calculate ( )I C, f for all f, find f F∈

which maximizes ( ) ( )SI C,f I F , f−β⋅ and move it from F to SF ; 4. repeat 3 until

enough features selected

Sensorarchitekturen

Sensormodellierung

Wearable Systems I 6. Density Estimation

10 / 22

6. Density Estimation Parametrische Methoden

- Eine Dichtefunktion wird angenommen, für neue Werter Parameter neu berechnen

Parametrische Verteilung (Normalverteilung)

Maximum Likelihood (ML)

- Parameter bestimmen, um Verteilungsfunktion am besten an Stichprobe anzunähern

- Dichtefunktion ( )P z hängt von Satz von Parametern θ ab. (Z.B. µ und σ bei

Normalverteilung)

- Abhängigkeit von den Parametern: ( )P z | θ

- W’keit eines Datensatzes ( ) ( ) ( )n

n

i

i 1

L P Z | P z |=

θ = θ = θ∏ , daraus kann der beste

Parametervektor θ̂ gewonnen werden ( ( )L θ maximieren)

Bayes’sches Lernen, Bayes’sche Interferenz

- Unsicherheit in den Parameterwerten θ wird durch pdf ausgedrückt

- Vor erster Beobachtung typischerweise „breite“ a-priori pdf

- Basierend auf Beobachtungen wird mit Bayes’schem Ansatz die a posteriori pdf

berechnet

- Wiederholte Anwendung „verengt“ die pdf für θ

- Maximum-Likelihood � Bayes’sches Lernen: für viele Beobachtungen wird die pdf

immer schmäler und nähert sich dem ML-Wert an

Nicht-parametrische Verteilungen - Oft Form der Verteilung/Funktion nicht bekannt � pdf direkt anhand der Daten

schätzen

Histogramme

- verallgemeinerbarer Ansatz, mit unendlicher Datenmenge kann jede pdf angenähert

werden

- Punktemenge wird in „Bins“ unterteilt (Balken des Histogramms), welche die

Verteilungsdichte innerhalb der Bins repräsentieren

- Anzahl der Bins bestimmt Genauigkeit der Anpassung

- Pro: einfache Konstruktion, Daten werden danach nicht mehr benötigt

Contra: Diskontinuität an Bin-Grenzen, oft zu wenig Daten für mehrdimensionale

Verteilungen ( dM Bins)

Abschätzung der Dichteverteilung

- W’keit, dass K von N Datenpunkten in der Region liegen ist K

PN

≅ (für grosse N)

- Annäherung ( )K

p zN V

≅⋅

(V: Volumen der Region)

- Zwei Lösungsmöglichkeiten:

o K bleibt unverändert, V wird an Daten angepasst � k-Nearest Neighbour

o V bleibt unverändert, K durch vorgegebene Daten bestimmt � Kernel


11 / 22

Kernel-Methoden (Parzen-Fenster, Gausskernel)

- Parzen-Window: Region R ist d-dimensionaler Hypercube

Einheits-Hypercube um den Ursprung: ( ) j1 u 1 2H u

0 sonst

≤=

, j 1...d=

� wenn nz innerhalb Hypercube um z Seitenlänge h liegt, gilt nz zH 1

h

− =

, sonst 0

Gesamtzahl aller Punkte im Hypercube N

n

n 1

z zK H

h=

− =

∑

� Dichteschätzung ( )N

n

n 1 n

z z1 1p z H

N V h=

− =

∑�

- Gauss-Kernel für ( )H u : wie oben, mit Gauss-Funktion

- Nachteile der Kernel-Methoden: Alle Daten beibehalten, grosser Rechenaufwand für

viele Daten, over/under-fitting durch falsche Wahl von h

k-Nearest Neighbour

- K fix, V variabel

- Um den Datenpunkt z wird Hypersphäre solange vergrössert, bis die k nächsten

Nachbarpunkte enthalten sind � ( )p z

- Nachteile: keine echte W’keitsverteilung (divergiert!), alle Daten beibehalten,

verfeinerte Methoden bekannt

k-Nearest Neighbour (kNN) Regel, Klassifizierer - Klassifizierer mit kNN: Total N punkte, K punkte in der Hypersphäre, total

kN

Punkte der Klasse kC , davon

kK in der Hypersphäre

o Bedingte W’keit der Klassen: ( ) kk

k

KP x | C

N V=

o Unbedingte Dichte: ( )K

P xNV

=

o A-priori Verteilung ( ) kk

NP C

N=

o Mit Bayes: ( ) kk

KP C | x

K=

- Nearest Neigbour Classification Rule: Spezialfall für K 1= , jeder Punkt x wird der

Klasse seines nächsten Nachbarn zugeschrieben

- k-Nearest Neighbour: Um Punkt x wird Klasse mit k nächsten Elementen ausgewählt

NCC (Nearest Class Classificator) - für jede Klasse wird mit Trainingsdatensatz ein Mitellpunktsvektor m bestimmt,

kürzeste Distanz zu Mittelpunktsvektor bestimmt Klassenzugehörigkeit


12 / 22

Diskriminanzanalyse (Discriminant Analysis & Functions) - PCA und LDA zur Datenklassifikation und Reduktion der Dimensionalität

PCA (Principal Componant Analysis)

- Für Vektor eine neue Basis mit geringerer Dimensionalität finden, so dass der Fehler

möglichst klein bleibt

- Problem: z.T. Verschlechterung der Erkennung

LDA (Linear Discriminant Analysis)

- Besserer Ansatz

- Ziel: Trennung von zwei Klassen durch finden von Basisvektoren einer linearen

Transformation, die beide Klassen am besten separiert

- Ansatz: Abstand vom Klassenmittelpunkt maximieren und Streuung minimieren

PCA vs LDA

- PCA sucht Basisvektoren für effiziente Darstellung, jedoch nicht unbedingt geeignet

für Separierung

- LDA sucht Basisvektoren für effiziente Trennung

Wearable Systems I 7. Dempster-Shafer

13 / 22

7. Dempster-Shafer - Wozu weiteres Detektionsverfahren, welche Fälle deckt Bayes nicht ab?

- Epistemische Unsicherheit: Fehlendes Wissen über das System (nicht Eigenschaft des

Systems sondern des Beobachters), wenig Information, spezifische, doppeldeutige

oder widersprüchliche Information zur Bestimmung der W’keit � Dempster-Shafer

D-S: Theory of Evidence - N Beobachtungen, z.B 3: { }1 2 3a ,a ,aΘ = � alle möglichen Verknüpfungen

{}{ }1 2 3 1 2 1 3 2 3 1 2 32 ,a ,a ,a ,a a ,a a ,a a ,a a aΘ = ∪ ∪ ∪ ∪ ∪

o allen wird ein W’keitsmass m zugeordnet (mit {}( )m 0= )

o brauchen mindestens zwei Informationen: ( )im a und ( )im a

Belief (Support), Plausibility, Uncertainty Intervall - Belief: Summe aller W’keitsmasse, die direkt einer Annahme zugeordnet sind

z.B. ( ) ( ) ( ) ( )1 2 1 2 1 2Bel a a m a m a m a a∪ = + + ∪

- Plausibility: Summe aller W’keitsmasse, die nicht der Negation einer Annahme

zugeordnet sind; z.B. ( ) ( ) ( ) ( ) ( )1 1 1 2 1 3Pl a m a m a a m a a ... m= + ∪ + ∪ + + Θ

- Uncertainty Interval: Definiert durch ( ) ( )i iBel a ,Pl a

Unter Bel: minimale Festlegung, im Intervall: plausible Annahme, oberhalb Pl:

Gewissheit zur Anfechtung. Z.B. [ ]0,1 vollkommene Unkenntnis, [ ]0.7,0.7

Unsicherheitsintervall ist 0, W’keit ist 0.7, [ ]0.25,0.85 Gewissheit liegt zwischen 0

und 0.25, Gewissheit zur Negation liegt zwischen 0.85 und 1

Kombinationsregeln - Verschiedene verfahren zur Verdichtung von Daten und Informationen

konjunktive Operation, wenn alle Datenquellen verlässlich ( A B∩ ), disjunktive

Kombination wenn nur eine Quelle glaubwürdig ( A B∪ )

- Dempster-Kombinationsregel: konjunktive Verknüpfung ( ) ( )1 i 2 jm A m B⋅ für

i jA B∩

Wearable Systems I 8. Nichtmetrische Verfahren, Entscheidungsbäume, C4.5

14 / 22

8. Nichtmetrische Verfahren, Entscheidungsbäume, C4.5 - Bisher nur Analyse von Feature-Vektoren, jetzt: Beschreibung von Eigenschaften

(z.B. Farbe, Grösse, …)

- Entscheidungsbaum:

o Leaf-Knoten = Klassen

o Jeder Knoten: Abfrage einer Eigenschaft (eines Features) � Zweige mit

möglichen Werten

- Konstruktion eines Entscheidungsbaumes mit Trainingsset

Lernverfahren

- Verschiedene Verfahren bekannt (heuristisch und iterativ): CART, ID3, C4.5

- C4.5 (iterativ)

1. verschiedene Partitionierungen (splits) mit verschiedenen Attributen ausprobieren

und damit Trainingsset in verschiedene Subsets aufteilen; 2. Bewertung der

verschienen Splits und bestes auswählen; 3. Wiederholung des Verfahrens für jeden

neu entstandenen Knoten bis keine guten Splits mehr möglich

- Zu 1: C4.5 verwendet Purity/Impurity: Knoten umso reiner, je stärker eine Klasse

dominiert (Ziel: (möglichst) reine Unterknoten)

- Vorgehensweise: 1. für jeden Knoten i i

i

EntSub p log p= −∑ berechnen (geht gegen

Null, je reiner der Knoten ist); 2. berechne j j

j

gewEnt p Entsub=∑ ; 3. berechne

Entropie im Parent (Root-) Knoten Root i i

i

EntSub p log p= −∑ und daraus den

Informationsgewinn RootInfGew EntSub gewEnt= −

� Partitionierung mit grösstem InfGew wird gewählt, die anderen verworfen

- Kontinuierliche Merkmale: mit iterativem Verfahren Schranke (Aufteilung) so

wählen, dass EntSub am kleinsten und somit Informationsgewinn am grössten wird

Vorgehen: Mit kleinstem Merkmal beginnen, in n (vorgegebenen) Schritten

durchlaufen, jeweils EntSub und daraus gewEnt berechnen, Schranke für kleinstes

gewEnt wählen

- Abbruchkriterien: verschiedene möglich (Leaf-Konten erreicht, Impuritiy

unterschreitet vorgegebene Schwelle, keine weiteren Aufteilungen wenn ein Konten

weniger als x% der Trainingsdaten enthält)

- Pruning: es können sehr komplexe Bäume entstehen � vollständigen Baum erstellen

(impurity = Null für alle Leaf-Knoten), benachbarte Knoten werden analysiert und ev.

wieder zusammengeführt

- Probleme

o Overfitting / underfitting

o Missing data � fehlende Daten generieren

Bewertung des C4.5 Verfahrens

- Pros: wenig Rechenschritte für Anwendung, energieeffizient, Trainingsdaten müssen

nicht gespeichert werden, Verfahren für die Erstellung des Baumes

- Cons: Qualität abhängig von der Wahl des Trainingssets, keine (unendlich grossen)

Nullklassen definierbar, meist geringere Genauigkeit als kNN

- Mit ausreichend grossem Baum kann jede Entscheidungsgrenze approximiert werden

(Problem schräge Grenze � Treppe)

Wearable Systems I 9. Klassifikation mit Support Vector Machine (SVM)

15 / 22

9. Klassifikation mit Support Vector Machine (SVM) - Lernbasierte Verfahren (ähnlich wie neuronale Netze)

- Ansatz: Kombination von mehreren Konzepten (Trennebenen mit grosser

Trennspanne, Kernel-Ansatz)

Konstruktion von Trennebenen mit grosser Trennspanne - Transformation der Eingangsdaten: Ziel ist Trennung von zwei Klassen über

Lernalgorithmus, gegeben ein Satz von labeled Trainingsdaten

- Definition von Hyperplanes: Gegeben durch Normalenvektor w und Verschiebung b,

Punkte in Richtung des Normalenvektors positiv, andere negativ ( ( )f x w, x b= + )

- Maximale Trennspanne: Hyperebene soll so entworfen werden, dass Trennspanne

maximal

- Kanonische Hyperebene, Normierung: Skalierung der Hyperebene (w, b) so dass der

nächste Punkt x gerade w, x b 1+ = erfüllt

Aus Distanzmass und Normierung � Trennspanne wird maximal für minimales w

- Lagrange-Ansatz (Kuhn-Trucker-Theory): Löst das Optimierungsproblem

(Minimierung von w ) mit Hilfe von Lagrange-Multiplikatoren iα � i i i

i

w y x= α∑

Klassifizierung für neux mit sign-Funktion

- Support Vektoren: kennzeichnen die Punkte, die der Hyperebene am nächsten liegen,

sie alleine bestimmen die Lage der Hyperebene, weiter entfernte Punkte haben keinen

Einfluss

- Nicht trennbare Trainingsdaten, Soft-Decision: Es wird eine lineare Trennung

versucht, aber Fehler Zugelassen. Strafe für Fehler = Abstand zur Hyperebene *

Fehlergewicht C

Kernel-Ansatz - Transformation der Eingangsdaten in einen Feature-Raum: Idee: Eingangsdaten in

einen hochdimensionalen Eigenschaftsraum transformieren, damit Separation mit

linearer Hyperfläche möglich wird

- Lineare Trennbarkeit: Nichtlineare Transformation, z.B. statt mit ellipsoidischer

Entscheidungsgrenze in 2d eine lineare Entscheidungsgrenze in 3d

- Kernels: Eingaberaum wird in Skalarprodukt-Raum transformiert mit nichtlinearer

Vorschrift φ ; lineare Optimierungsalgorithmen werden in diesem Raum durchgeführt

durch das Skalarprodukt (Kernel) ( ) ( ) ( )k x,z x , z= φ φ ; Wenn hochdimensional

kann Berechnung des Skalarprodukts aufwändig sein, in einigen Fällen existiert

einfacher Kernel (z.B. Polynomial-Kernel)

- Weitere Kernelfunktionen: Radiale Basisfunktionen, Sigmoid-Kernel

- Eigenschaften von Kernels: k k′+ auch ein Kernel, c k⋅ Kernel falls c 0> etc.

- Kernel-Matrix (Mercer Theorem): Die Matrix ( )( )i jij

k x , x ist symmetrisch positiv

definit (pos. Eigenwerte) � jede symmetrische, pos. def. Matrix kann als Kernel-

Matrix aufgefasst werden, aus der die Transformation φ konstruiert werden kann

- Nichtlineare Klassifikation mit SVM: Statt Transformation φ durchzuführen,

nichtlineare Entscheidungsfunktion verwenden (Kernelfunktion in der

Entscheidungsfunktion einfügen)

Berechnungen werden durchgeführt, ohne dass die Transformation explizit ausgeführt

wird.

Wearable Systems I 9. Klassifikation mit Support Vector Machine (SVM)

16 / 22

Multiple Klassifikation mit SVM (Multiclass SVM) - Direkte Umsetzung für Mehrfachklassifikation sehr Rechenaufwändig � wenig

brauchbar. Besser Kombination mehrer binärer SVM-Klassifikationen �

verschiedene Strategien bekannt

- WTA-SVM: Einer-gegen-alle (one-versus-all mit winner-takes-all Strategie

Für jede Klasse eine Trennfunktion gegen alle anderen Klassen ( )if x . Testvektor

wird der Klasse mit dem grössten ( )if x zugeordnet

- MWV-SVM: Jeder-gegen-jeden (One-versus-one), SVM Klassifikation für jede Paar

von Klasen, mehrere Verfahren zur Entscheidungsfindung

o Max wins: Klasse wo Testvektor am häufigsten genannt wird gewinnt

o Direct acyclic graph (DAG): Baum wird erstellt und durchlaufen (1 vs. 2, dann

1 vs. 3 oder 2 vs. 3, etc)

Problemstellung SVM:

- Was ist die beste Hyperfläche?

- Overfitting vs. Generalisierung

Wearable Systems I 10. Clustering and Self Organizing Maps (SOM)

17 / 22

10. Clustering and Self Organizing Maps (SOM) - Clustering: discover similarity relations between data objects in high-dimensional

signal space

- Self Organizing Maps: Project high-dimensional signal space on 2d grid of nodes

while preserving the topological relationships

Clustering Problem Definition

- Set of data-objects needs to be partitioned into k disjoint subsets

- Quality of partitioning: Distance between all data objects of the same cluster as small

as possible, and distance between the clusters as large as possible

Algorithms

- Hierarchical: find successive clusters, agglomerative (bottom-up) or divisive (top-

down)

o Hierarchical Agglomerative Clustering Algorithm: 1. each element is a

separate cluster; 2. find closest pair, replace iC with i jC C∪ and remove jC

form set; 3. repeat step 2 until desired number of clusters is reached

o Distance Measure: Selection of distance measure important step. Demands are

triangle inequality, symmetry ij jid d= and positive definite ijd 0≥

� Euclid: most common, often inadequate ( ) ( )2

i i

i

d x, y x y= −∑

� Pearson: balanced inclusion of all dimensions, no correction for

correlation

� Mahalanobis: takes correlation into account, corrects data for different

scales

� City-Block, Minkowski, Maximum Norm

o Variants of Clustering: measure distance between clusters

� Single Linkage: Minimum distance between members

� Complete Linkage: Maximum distance between members

� Centroid Linkage: Difference between the centroids

� Average Linkage: Average distance between all points of the clusters

� Ward’s Method: fuse cluster producing smallest possible increase in

the sum of squares E

- Partitional: determine all clusters at once

o K-means: 1. choose number of clusters and randomly generate k clusters,

determine the clusters centres; 2. assign each object to the nearest cluster

centre; 3. recompute the new cluster centres; 4. repeat 2 and 3 until nothing

changes in the assignment anymore

Self Organizing Maps Main Aspects

- Visualization: Project high-dimensional space on 2d grid of nodes

- Abstraction: Preserve the topological relationships

Wine Example

- 178 wines from the same region

- Derived from three different sorts

- Chemical analysis determined the quantities of 13 constituents (alcohol, malic acid,

ash, …)


18 / 22

SOM Configuration

- Consists of a set of non-interconnected units

- Units are spatially ordered in a 2d grid

- Each unit in the map is equipped with a weight vector

- Weight vectors are of the same dimension like the input objects

Mapping Single Space to SOM

- Given an input vector x

- Given a SOM where each unit S is equipped with a weight-vector sw

- Image of input-vector x on the SOM array is defined as the array element s that

matches best with x: ( )ii

s argmin d x, w=

SOM Training

- Objective: preserving topological relationships

- Basic Principle: Close SOM units in the grid will activate each other to learn from the

same input x

- Learning: Adapt weight vector sw during learning process to respond similarly to

certain input patterns

- Initialization: Weight vectors are usually initialized by average input vector plus small

random vector or sampling from subspace spanned by the two largest principle

component eigenvectors

- Training: winner unit: same procedure like in the mapping process, unit processing the

most similar input is assigned to be the winner

- Training: update of weight vectors: subsequently the weight vectors of the winner and

its closest neighbours are updated: 1. calculate difference between weight-vector and

input object; 2. add this difference attenuated by a certain factor to the original weight

vector

- Training: size of neighbourhood: initially, size of neighbourhood is approximately size

of the map (� capture global characteristics of the signal space); during training the

size is gradually decreased (� force local clusters); finally, only weights of winning

units are adapted (� specialize winning unit)

Prediction

- unit-wise averaging of the outputs associated with the mapped input objects

- problem: holes on which no training objects are mapped


19 / 22

Supervised SOMs

- Supervised Kohonen Network

o Input map Xmap and output map Ymap are glued together to a combined

input-output map XYmap

o Wine-example: 13-dim input and 3-dim output

o After training input and output maps are decoupled

- Bi-Directional Kohonen Network

o Units in the Xmap and Ymap are updated in an alternating way, update is

driven by the topology gradually embedded in the weight vectors located in the

opposite map

- Counter Propagation Network

o Network with two layers: Xmap and associated Ymap, pseudo-supervised

- XY-fused Kohonen Network

o Exploits similarities in Xmap and Ymap

- Prediction: determine position of input object in the Xmap, look up class membership

in the Ymap: maximum value of units weight-vector yields class membership

Emotion SOM

Wearable Systems I 11. Hidden Markov Models (HMM)

20 / 22

11. Hidden Markov Models (HMM) Activity Recognition with hidden Markov Models Activity as a sequence of Observations

- Context can often be characterized by a sequence of states

- States generate typical “observations”

- Observations correspond to sensory readings

- Challenges: Variability (length of phases, omitted phases, …) and complex pdf (steps

not independent, no analytic expression of the problem)

Definition, application and problems

- Procedure: process modelling (define algorithm/function generating same

distribution), compute probability of occurrence (Bayes theorem), classification with

appropriate detector (maximum a posteriori probability)

- Formal representation with HMM

o Markov Chain: discrete-time stochastic process, state transitions (only

parameter) probabilistic and depend only on the current state, state is visible to

observer

o HMM: statistical model assuming system being modelled as Markov chain,

unknown parameters, state is NOT visible to observer, but variables influenced

by the state are visible, observations generated by HMM give information

about state sequence

- HMM Parameters:

o N: number of states, M: number of symbols, X: state space, Z: Observations

o ija : state transition probabilities, ijb : observation probabilities, Π : initial state

probabilities

o ( )A,B,λ Π : HMM model

- HMM applications: speech recognition, image classification, protein structure

prediction, patterns in DNA sequences, …

- HMM: 3 main questions

o Find probability of output sequence ( )P Z | λ : model parameters λ known,

output sequence Z known � Forward Algroithm

o Find most likely sequence of state generating Z: model parameters λ known,

output sequence Z known � Viterbi Algorithm

o HMM training: find the HMM parameters λ : Output seqeuence(s) known, find

observation probabilities and state transition probabilities � Baum-Welch Alg.

- Trellis representation: x-axis time steps, y-axis states X

Use: Total probability (sum of all paths on the trellis, Forward Algo.), most likely path

(max. of the paths on the trellis, Viterbi Algo.)

Forward algorithm: find the probability of an output sequence

- Auxiliary variable ( )t iα : probability at time t to have observed sequence Z and to be

in state i

- Initialize using Π , induction computing all ( )t iα , termination summing up the

resulting ( )T iα


21 / 22

Viterbi Algorithm: Most likely sequence of states

- similar to Forward Algo., replacing sum by max ( )t iδ + for each state i at time t

remember which state at t-1 led to the highest observation probability ( )t iψ

- initialization, induction and termination like Forward, at the end backtracing using

( )t iψ to find the most likely sequence of states

Handraise vs. Handshake: classification with separate HMMs

- Create two HMMs for Handraise and Handshake

- Compute the probability of a sequence using the Forward-Algorithm

- Classify the gesture according to the most likely model

Model estimation, common/combined HMMs Model estimation

- Learning: sequence of observation is known (training sequences), rough model is

known (N states, M observations) � problem: find the most likely parameters for the

training sequences (maximize ( )P Z | λ but avoid overfitting)

o State path known: no hidden state, simple estimation and statistical analysis,

normalization to get probabilities

o State path unknown: typical case, hidden states, estimate frequency of

occurrence along every path weighted by its probability, numerical

optimization � Baum-Welch process

- State path known: straight forward (“counting occurrences”)

- State path unknown: Baum-Welch algorithm (Forward-Backward algorithm)

o Principle: expectation maximization: 1. start with set of plausible parameters;

2. iterate until convergence: a) compute expected occurrences of states,

transitions, observations b) adjust parameters to maximize the likelihood of

expected values; 3. iterate until maximum likelihood to generate the sequence

is reached

o Objective: find probability of producing sequence Z with the t-th symbol

produced by state i (for all z, i and t)

o Definitions: ( )z,

t iλγ probability that when λ generates Z, it is at time t in state i

( )z,

t i, jλξ probability that when λ generates Z, it is at time t in state i and at

time t+1 in state j

( )t iβ starting from state i, probability that the remainder of the sequence

{ }t 1 t 2 TZ z ,z ,..., z+ += is observed

o Algorithm: 1. use some estimated HMM as starting point; 2. compute α and

β ; 3. compute γ and ξ ; 4. compute the expected values of the frequencies

(transitions, observations); 5. Compute the new parameter values; 6. iterate

o Compute α : Forward Algorithm

o Compute β : start with the ( )T i 1β = at the end and go backward

o Compute γ : ( )( ) ( )( )

t tz,

t

i ii

P z |

λ α βγ =

λ

o Compute ξ : ( )( ) ( ) ( )

( )t ij j t 1 t 1z,

t

i a b z ii, j

P z |

+ +λα β

ξ =λ


22 / 22

o Parameter estimation: ( )z,

i 1 iλπ = γ ,

( )

( )

T 1z,

t

t 1ij T 1

z,

t

t 1

i, j

a

i

−λ

=−

λ

=

ξ

=

γ

∑

∑,

( )

( )

t k

z,

t

z Z

ij Tz,

t

t 1

i

b

i

λ

=

λ

=

γ

=

γ

∑

∑

o Properties of the algorithm: based on principle of expectation maximization,

guaranteed to converge in local maximum (� good parameter initialization

important!), usually small number of iterations needed

Activity classification

- Classification with a separate HMM: HMM with highest probability is chosen, pro:

simple, cons: may have low accuracy

- Classification with a common HMM: One single HMM that covers all classes (defined

by path along particular states) � find most likely path (Veterbi) or compute

probability along predefined path

- Classification with combined HMMs: Chained HMM

- Null / garbage class problem: e.g. rest state

zusammenfassung wearable systems i...

Documents