quantnet basics: visualization, similarity, text...

49
Quantnet Basics: Visualization, Similarity, Text Mining Lukas Borke Wolfgang Karl Härdle Ladislaus von Bortkiewicz Chair of Statistics C.A.S.E. – Center for Applied Statistics and Economics Humboldt–Universität zu Berlin http://lvb.wiwi.hu-berlin.de http://www.case.hu-berlin.de

Upload: tranhanh

Post on 04-May-2018

233 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Quantnet Basics:Visualization, Similarity, Text Mining

Lukas BorkeWolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

Page 2: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Motivation 1-1

Transparency and Reproducibility

� Quantnet – open access code-sharing platformI Quantlets: program codes (R, MATLAB, SAS), various authorsI QuantNetXploRer

Quantnet Basics

Page 3: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Motivation 1-2

Popularity

Figure 1: Quantlet downloads by year and countryQuantnet Basics

Page 4: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Motivation 1-3

Visualization

Figure 2: Quantlets from Statistics of Financial Markets (SFE) and AppliedMultivariate Statistical Analysis (MVA)

Quantnet Basics

Page 5: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Motivation 1-4

Research Goals

� VisualizationI QuantletsI ClustersI Relationships

� Data MiningI SimilarityI Semantic structureI Text Mining – Keyword Extracting

Quantnet Basics

Page 6: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Outline

1. Motivation X

2. Interactive Structure3. Vector Space Model (VSM)4. Empirical results5. Keyword Extracting6. Conclusion

Quantnet Basics

Page 7: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Interactive Structure 2-1

� Searching parameters: Quantletname, Description, Datafile,Author

� Data types: R, Matlab, SAS

Quantnet Basics

Page 8: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Interactive Structure 2-2

Integrated exploring and navigating

Quantnet Basics

Page 9: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Interactive Structure 2-3

Figure 3: Quantlet MVAreturns containing the search term “time series“

Quantnet Basics

Page 10: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Interactive Structure 2-4

Figure 4: All Quantlets in QuantNetXploRer, search term “time series“

Quantnet Basics

Page 11: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-1

Vector Space Model (VSM)

� Model structureI Text to Vector: Weighting scheme, Similarity, DistanceI Basic VSMI Generalized VSMI LSI – Latent Semantic Indexing

Quantnet Basics

Page 12: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-2

Text to Vector

� D = {d1, . . . , dn} – set of documents.� T = {t1, . . . , tm} – dictionary, i.e., the set of all different

terms occurring in Quantnet.� tf (d , t) – absolute frequency of term t ∈ T in document

d ∈ D.� idf (t)

def= log(|D|/nt) – inverse document frequency, with

nt = |{d ∈ D|t ∈ d}|.� w(d) = {w(d , t1), . . . ,w(d , tm)}, d ∈ D – documents as

vectors in a m-dimensional space.� w(d , ti ) – calculated by a weighting scheme.

Quantnet Basics

Page 13: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-3

Weighting scheme, Similarity, Distance� Salton et al. (1994): the tf-idf – weighting scheme w(d , t) for

t ∈ T in d ∈ D :

w(d , t) =tf (d , t)idf (t)√∑m

j=1 tf (d , tj)2idf (tj)2,m = |T |

� (normalized tf-idf) Similarity S of two documents

S(d1, d2) =m∑

k=1

w(d1, tk) · w(d2, tk) = w(d1)>w(d2)

� A frequently used distance measure is the Euclidian distance:

distd(d1, d2)def=

√√√√ m∑k=1

{w(d1, tk)− w(d2, tk)}2

Quantnet Basics

Page 14: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-4

Example 1: German children’s rhymes

Let D = {d1, d2, d3} be the set of documents/rhymes:

Rhyme 1: Hänschen klein ging allein in die weite Welt hinein.d1 = {hanschen, klein, ging , allein, in, die,weite,welt, hinein}

Rhyme 2: Backe, backe Kuchen, der Bäcker hat gerufen.d2 = {backe, kuchen, der , backer , hat, gerufen}

Rhyme 3: Die Affen rasen durch den Wald. Der eine macht denandern kalt.d3 = {die, affen, rasen, durch, den,wald , der , eine,macht, andern, kalt}

Quantnet Basics

Page 15: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-5

Example 1: German children’s rhymes

This implies:

T = {hanschen, klein, ging , allein, in, die,weite,welt, hinein,

backe, kuchen, der , backer , hat, gerufen,

affen, rasen, durch, den,wald , eine,macht, andern, kalt}= {t1, . . . , t24}

Hence, |D| = 3, |T | = 24.

Quantnet Basics

Page 16: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-6

Figure 5: Weighting vectors of the 3 rhymes in a radar chartQuantnet Basics

Page 17: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-7

Example 1: German children’s rhymes

With the weighting vectors above we get the similarity matrix:

MS =

1 0 0.0140 1 0.014

0.014 0.014 1

And the distance matrix:

MD =

0√2 1.405√

2 0 1.4051.405 1.405 0

Quantnet Basics

Page 18: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-8

Basic VSM

� vertical vector d , indexed by terms – Document representation� matrix D = [d1, . . . , dn] – Document corpus representation,

also called “term by document“ matrix� considering linear transformations P we get a general similarity

S(d1, d2) = (Pd1)>(Pd2) = d>1 P>Pd2

� every mapping P defines another VSM

� MS = D>(P>P)D – similarity matrix

Quantnet Basics

Page 19: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-9

Example 2: tf and tf-idf similarities in BVSM

� with P = Im and d = {tf (d , t1), . . . , tf (d , tm)}> we get theclassical tf-similarity:Mtf

S = D>D

� with diagonal P(i , i)idf = idf (ti ) andd = {tf (d , t1), . . . , tf (d , tm)}> we get the classicaltf-idf-similarity:Mtf−idf

S = D>(P idf )>P idf D

Quantnet Basics

Page 20: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-10

Drawbacks of BVSM

� Uncorrelated/orthogonal terms in the feature space� Documents must have common terms to be similar� Sparseness of document vectors and similarity matrices

Question� How to incorporate information about semantics?

Solution� Using statistical information about term-term correlations� Semantic smoothing

Quantnet Basics

Page 21: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-11

Generalized VSM – term-term correlations

� S(d1, d2) = (D>d1)>(D>d2) = d>1 DD>d2 – the GVSMsimilarity

� MS = D>(DD>)D – similarity matrix� DD> – term by term matrix, having a nonzero ij entry if and

only if there is a document containing both the i-th and thej-th terms

� terms become semantically related if co-occuring often in thesame documents

� also known as a dual space method (Sheridan and Ballerini,1996)

� when there are less documents than terms – dimensionalityreduction

Quantnet Basics

Page 22: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-12

Generalized VSM – Semantic smoothing

� More natural method of incorporating semantics is by directlyusing a semantic network

� (Miller et al., 1993) used the semantic network WordNet� Term distance in the hierarchical tree provided by WordNet

gives an estimation of their semantic proximity� (Siolas and d’Alche-Buc, 2000) have included the semantics

into the similarity matrix by handcrafting the VSM matrix P

� MS = D>(P>P)D = D>P2D – similarity matrix

Quantnet Basics

Page 23: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Vector Space Model (VSM) 3-13

LSI – Latent Semantic Indexing

� LSI measures semantic information through co-occurrenceanalysis (Deerwester et al., 1990)

� Technique – singular value decomposition (SVD) of the matrixD = UΣV>

� P = U>k = IkU> – projection operator onto the first kdimensions

� MS = D>(UIkU>)D – similarity matrix� It can be shown: MS = V ΛkV>, with

D>D = V Σ>U>UΣV> = V ΛV> and Λii = λi = σ2i

eigenvalues of V ; Λk consisting of the first k eigenvalues andzero-values else.

Quantnet Basics

Page 24: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Empirical results 4-1

3 Models for the QuantNet

� Models – BVSM, GVSM and LSI� Dataset – the whole Quantnet� Documents – 1580 Quantlets

Quantnet Basics

Page 25: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Empirical results 4-2

Figure 6: Model characteristicsQuantnet Basics

Page 26: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Empirical results 4-3

Figure 7: Quantiles of similarity values of 3 models

� Blue dots – BVSM; Green dots – GVSM; Red line – LSI

Quantnet Basics

Page 27: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Empirical results 4-4

Sparseness results

BVSM GVSM LSIAbsolute number 2452444 2214064 2083526Relative number 0.982 0.887 0.835

Matrix Dim 2496400

Table 1: Model Performance regarding the number of zero-values in thesimilarity matrix.

Quantnet Basics

Page 28: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Keyword Extracting 5-1

Index Term Selection I

Goal: decrease the number of words for indexing, so that only theselected keywords describe the documents (Deerwester et al., 1990;Witten et al., 1999)

A simple method for keyword extracting is based on their entropy.∀t ∈ T the entropy is defined:

W (t) = 1 +1

log2 |D|∑d∈D

P(d , t) log2 P(d , t),

with P(d , t) = tf (d ,t)∑nl=1 tf (dl ,t)

Quantnet Basics

Page 29: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Keyword Extracting 5-2

Index Term Selection II

The entropy as a measure of the importance of a word in the givendomain context:

W (t) is high ⇒ prefer this t as index.

An index term selection method (fixed number of index terms) isdiscussed in “Experiments in Term Weighting and KeywordExtraction in Document Clustering“ (Borgelt et al., 2004).

Quantnet Basics

Page 30: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Conclusion 6-1

Conclusion I

� Similarity and Distance available for extendedVisualization

� Different weighting scheme approaches and Vector SpaceModels allow adapted Similarity based Text Searching

� Incorporating term-term Correlations and Semanticssignificantly improves the comparison performance

� More automation and quality through Index Term Selection

Quantnet Basics

Page 31: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Conclusion 6-2

Conclusion II

Text Mining offers more models and methods like:

� Classification

� Clustering

� Latent Dirichlet Allocation (LDA) topic model

� TopicTiling

They are worth being researched and applied to the Quantnet.

Quantnet Basics

Page 32: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Quantnet Basics:Visualization, Similarity, Text Mining

Lukas BorkeWolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. – Center for Applied Statisticsand EconomicsHumboldt–Universität zu Berlin

http://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

Page 33: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

References 7-1

References

Borgelt, C. and Nürnberger, A.Experiments in Term Weighting and Keyword Extraction inDocument ClusteringLWA, pp. 123-130, Humbold-Universität Berlin, 2004

Bostock, M., Heer, J., Ogievetsky, V. and communityD3: Data-Driven Documentsavailable on d3js.org, 2014

Chen, C., Härdle, W. and Unwin, A.Handbook of Data VisualizationSpringer, 2008

Quantnet Basics

Page 34: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

References 7-2

References

Elsayed, T., Lin, J. and Oard, D. W.Pairwise Document Similarity in Large Collections withMapReduceProceedings of the 46th Annual Meeting of the Association ofComputational Linguistics (ACL), pp. 265-268, 2008

Feldman, R. and Dagan, I.Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems, 10(3), pp. 281-300,DOI: 10.1023/A:1008623632443, 1998

Gentle, J. E., Härdle, W. and Mori, Y.Handbook of Computational StatisticsSpringer, 2nd ed., 2012

Quantnet Basics

Page 35: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

References 7-3

References

Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference,and PredictionSpringer, 2nd ed., 2009

Härdle, W. and Simar, L.Applied Multivariate Statistical AnalysisSpringer, 3nd ed., 2012

Hotho, A., Nürnberger, A. and Paass, G.A Brief Survey of Text MiningLDV Forum, 20(1), pp 19-62, available on www.jlcl.org, 2005

Quantnet Basics

Page 36: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

References 7-4

References

Salton, G., Allan, J., Buckley, C. and Singhal, A.Automatic Analysis, Theme Generation, and Summarization ofMachine-Readable TextsScience, 264(5164), pp. 1421-1426,DOI: 10.1126/science.264.5164.1421, 1994

Witten, I., Paynter, G., Frank, E., Gutwin, C. andNevill-Manning, C.KEA: Practical Automatic Keyphrase ExtractionDL ’99 Proceedings of the fourth ACM conference on Digitallibraries, pp. 254-255, DOI: 10.1145/313238.313437, 1999

Quantnet Basics

Page 37: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-1

Data Mining: DM

DM is the computational process of discovering/representingpatterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, anddatabase systems.

1. Numerical DM2. Visual DM3. Text Mining

(applied on considerably weaker structured text data)

Quantnet Basics

Page 38: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-2

Text Mining

Text Mining or Knowledge Discovery from Text (KDT) dealswith the machine supported analysis of text (Feldman et al., 1995).

It uses techniques from:� Information Retrieval (IR)� Information extraction� Natural Language Processing (NLP)

and connects them with the methods of DM.

Quantnet Basics

Page 39: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-3

Similarity, Distance, Data Mining –Overview

1. Find a formal representation of the Quantlets2. Find a similarity measure on the space of Quantlets3. Afterwards the construction of a distance measure is simple:

distance(x , y) =√

sim(x , x) + sim(y , y)− 2 · sim(x , y)

Having similarity and distance ⇒ vast amount of Data Mining,Text Mining and Visualization technics.

Quantnet Basics

Page 40: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-4

Distance measure

A frequently used distance measure is the Euclidian distance:

distd(d1, d2)def= dist{w(d1),w(d2)} def

=

√√√√ m∑k=1

{w(d1, tk)− w(d2, tk)}2

It holds for tf-idf:

cosφ =x>y

|x | · |y |= 1− 1

2dist2

(x

|x |,

y

|y |

),

where x|x | means w(d1), y

|y | means w(d2) and cosφ is the anglebetween x and y .

Quantnet Basics

Page 41: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-5

Figure 8: Algorithm for Computing Pairwise Similarity Matrix

Remark: postings(t) denotes the list of documents that containterm t.

The idea: a term contributes to the similarity between twodocuments only if it has non-zero weights in both.

Quantnet Basics

Page 42: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-6

3 Models on 3 Datasets

� Models – BVSM, GVSM and LSI� Datasets – 2 books, 1 project from Quantnet� Project 1 - TEDAS: Tail Event Driven Asset Allocation

(micro size - 4 Qlets)� Book 1 - BCS: Basic Elements of Computational Statistics

(low size - 48 Qlets)� Book 2 - SFE: Statistics of Financial Markets

(medium size - 337 Qlets)

Quantnet Basics

Page 43: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-7

Figure 9: Model characteristics of TEDAS

Quantnet Basics

Page 44: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-8

Figure 10: Quantiles of similarity values of 3 models on TEDAS

� Blue dots – BVSM; Green line – GVSM; Red line – LSI

Quantnet Basics

Page 45: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-9

Figure 11: Model characteristics of BCS

Quantnet Basics

Page 46: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-10

Figure 12: Quantiles of similarity values of 3 models on BCS

� Blue dots – BVSM; Green line – GVSM; Red line – LSI

Quantnet Basics

Page 47: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-11

Figure 13: Model characteristics of SFE

Quantnet Basics

Page 48: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-12

Figure 14: Quantiles of similarity values of 3 models on SFE

� Blue dots – BVSM; Green line – GVSM; Red line – LSI

Quantnet Basics

Page 49: Quantnet Basics: Visualization, Similarity, Text Miningsfb649.wiwi.hu-berlin.de/fedc/events/Motzen14/QNet_Basics.pdf · Keyword Extracting 5-1 ... Visualization,Similarity,TextMining

Appendix 8-13

Sparseness results

TEDAS BCS SFE MVA? STF? SFS?

BVSM 8 504 108668 75424 44576 17146GVSM 8 0 96940 71464 44204 16612

LSI 8 262 84262 65712 43952 15400Matrix Dim 16 2304 113569 77841 45369 18225

Table 2: Model Performance regarding the number of zero-values in thesimilarity matrix. MVA?, STF? and SFS? were additionally examined.

Quantnet Basics