[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

5
Lessons from a Jihadi Corpus D.B. Skillicorn School of Computing Queen’s University Kingston. Canada [email protected] Abstract—We analyze the posts in the Islamic Awareness forum, using models for frequent words (content), for Salafist- Jihadist language, and for deception. These last two models each produce a single-factor ranking enabling, in each case, the most useful subset of posts to be selected for further analysis. Posts that rank highly for Salafist-Jihadist language rank low for deception, suggesting that faking extremist websites is probably an ineffective strategy. The process described here is a template for analysis of many kinds of open-source corpora where language models of what makes posts interesting are known. I. PROBLEM Intelligence analysts, and others such as attorneys, are often faced with a large collection of documents from which they want to extract and analyze those that are relevant, without having to read them all. They therefore want to rank them in either a model-independent [1] or model-dependent way. We address this problem by analyzing a real-world Jihadist forum, Islamic Awareness, from three perspectives: its basic content, its Salafist-Jihadist language, and its deceptive lan- guage. Investigating the content is primarily an exploratory step, designed to answer the question: what is this set of documents about? The presence of particular topics might then guide later analysis steps. Using models of Salafist-Jihadist language and deceptive language allows the complete set of documents to be ranked in order of how intensively each document displays the associated property. An analyst can then pay attention to a suitable prefix of such a ranking. The contribution of this paper is as a case study of the analysis of a real-world forum. Those who post in an English language forum are dealing with second-language issues, are amateur writers, and may tend to be emotional. All of these factors make the quality of the documents produced hard to process because, for example, they are ungrammatical, they contain many misspellings, and they interleave words from other languages. Part of the contribution is to show that results can still be achieved. We also provide evidence that the Koppel model of Salafist-Jihadi language, and Pennebaker’s model of deceptive language both produce good results on data like this; furthermore they tend to agree. II. EXTRACTING DOCUMENT- WORD MATRICES A set of 129,425 postings to the Islamic Awareness fo- rum were collected by the Dark Web project at the Univer- sity of Arizona (ai.arizona.edu/research/terror). Three different document-word matrices were extracted from this dataset. In each case, the QTagger [2] was used to extract the word frequencies. This tool can be asked to extract and count the frequencies of all of the words in documents; or can be provided with a predefined set and extract the frequencies of only the given words. In both cases, words are tagged with part-of-speech information to aid in dealing with polysemy. The Salafist-Jihadi model is based on an empirically deter- mined list of Arabic words selected by Koppel and Akiva [4]. We chose the top 100 words from this model and translated them into English using Google Translate, resulting in 85 En- glish words. This translation is rather rough, but the results are still usable. The QTagger was used to extract the frequencies of these words. Because several of them can occur as different parts of speech, the actual number of words extracted was 433. Some of these could have been discarded as irrelevant uses of model words, but the quality of writing in the posts was poor enough that it seemed better to keep all word forms. Pennebaker’s deception model [5] is based on changes in frequencies of four classes of words. When a writer is being deceptive: 1) First-person singular pronouns (“I”, “mine”) decrease; 2) Exclusive words (words that signal a refinement of the content such as “but” and “or”) decrease; 3) Negative emotion words (“hate”) increase; and 4) Action verbs (“go”) increase. This model was developed empirically, but has been vali- dated in a large number of settings, and is robust despite its simplicity. The reason that deception can be so easily detected algorithmically, but only with great difficulty by humans directly is that we lack the resources to maintain counts of words used in realtime. Because the meaning of “increases” and “decreases” require a baseline, this model cannot be used to determine how deceptive a single document is, but rather to rank a set of cognate documents from most to least deceptive. The absolute deceptiveness of the set cannot be determined. Again the QTagger was used to extract the frequency of deception-model words; because of variable parts-of-speech labelling, a total of 307 different words were extracted. For the models with a restricted set of words, some post- ings did not contain any of the words and such posts were discarded. III. NORMALIZATION AND SINGULAR VALUE DECOMPOSITION If a document-word matrix has n rows and m columns, then each document can be represented as a point in m-dimensional space. However, since m is large, it is natural to project the set 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.239 906 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.239 874

Upload: d-b

Post on 11-Apr-2017

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Lessons from a Jihadi Corpus

D.B. SkillicornSchool of Computing

Queen’s UniversityKingston. Canada

[email protected]

Abstract—We analyze the posts in the Islamic Awarenessforum, using models for frequent words (content), for Salafist-Jihadist language, and for deception. These last two modelseach produce a single-factor ranking enabling, in each case,the most useful subset of posts to be selected for furtheranalysis. Posts that rank highly for Salafist-Jihadist languagerank low for deception, suggesting that faking extremist websitesis probably an ineffective strategy. The process described here is atemplate for analysis of many kinds of open-source corpora wherelanguage models of what makes posts interesting are known.

I. PROBLEM

Intelligence analysts, and others such as attorneys, are oftenfaced with a large collection of documents from which theywant to extract and analyze those that are relevant, withouthaving to read them all. They therefore want to rank them ineither a model-independent [1] or model-dependent way.

We address this problem by analyzing a real-world Jihadistforum, Islamic Awareness, from three perspectives: its basiccontent, its Salafist-Jihadist language, and its deceptive lan-guage. Investigating the content is primarily an exploratorystep, designed to answer the question: what is this set ofdocuments about? The presence of particular topics might thenguide later analysis steps. Using models of Salafist-Jihadistlanguage and deceptive language allows the complete set ofdocuments to be ranked in order of how intensively eachdocument displays the associated property. An analyst can thenpay attention to a suitable prefix of such a ranking.

The contribution of this paper is as a case study of theanalysis of a real-world forum. Those who post in an Englishlanguage forum are dealing with second-language issues, areamateur writers, and may tend to be emotional. All of thesefactors make the quality of the documents produced hard toprocess because, for example, they are ungrammatical, theycontain many misspellings, and they interleave words fromother languages. Part of the contribution is to show that resultscan still be achieved. We also provide evidence that the Koppelmodel of Salafist-Jihadi language, and Pennebaker’s model ofdeceptive language both produce good results on data like this;furthermore they tend to agree.

II. EXTRACTING DOCUMENT-WORD MATRICES

A set of 129,425 postings to the Islamic Awareness fo-rum were collected by the Dark Web project at the Univer-sity of Arizona (ai.arizona.edu/research/terror). Three differentdocument-word matrices were extracted from this dataset.

In each case, the QTagger [2] was used to extract theword frequencies. This tool can be asked to extract and count

the frequencies of all of the words in documents; or can beprovided with a predefined set and extract the frequencies ofonly the given words. In both cases, words are tagged withpart-of-speech information to aid in dealing with polysemy.

The Salafist-Jihadi model is based on an empirically deter-mined list of Arabic words selected by Koppel and Akiva [4].We chose the top 100 words from this model and translatedthem into English using Google Translate, resulting in 85 En-glish words. This translation is rather rough, but the results arestill usable. The QTagger was used to extract the frequenciesof these words. Because several of them can occur as differentparts of speech, the actual number of words extracted was 433.Some of these could have been discarded as irrelevant uses ofmodel words, but the quality of writing in the posts was poorenough that it seemed better to keep all word forms.

Pennebaker’s deception model [5] is based on changes infrequencies of four classes of words. When a writer is beingdeceptive:

1) First-person singular pronouns (“I”, “mine”) decrease;2) Exclusive words (words that signal a refinement of the

content such as “but” and “or”) decrease;3) Negative emotion words (“hate”) increase; and4) Action verbs (“go”) increase.

This model was developed empirically, but has been vali-dated in a large number of settings, and is robust despiteits simplicity. The reason that deception can be so easilydetected algorithmically, but only with great difficulty byhumans directly is that we lack the resources to maintaincounts of words used in realtime.

Because the meaning of “increases” and “decreases” requirea baseline, this model cannot be used to determine howdeceptive a single document is, but rather to rank a set ofcognate documents from most to least deceptive. The absolutedeceptiveness of the set cannot be determined. Again theQTagger was used to extract the frequency of deception-modelwords; because of variable parts-of-speech labelling, a total of307 different words were extracted.

For the models with a restricted set of words, some post-ings did not contain any of the words and such posts werediscarded.

III. NORMALIZATION AND SINGULAR VALUEDECOMPOSITION

If a document-word matrix has n rows and m columns, theneach document can be represented as a point in m-dimensionalspace. However, since m is large, it is natural to project the set

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.239

906

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.239

874

Page 2: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

of points into a lower-dimensional space, as long as similaritiescan be preserved.

We do this using singular value decomposition (SVD) [3].If A is the data matrix then its singular value decompositionis

A = USV ′

where U is n × m, S is an m × m diagonal matrix whoseentries, called the singular values, are non-increasing, thesuperscript dash indicates transposition, and V is m×m.

An SVD can be regarded as a change of basis from thegeometry implied by the original attributes to a new basiswhose axes are aligned in directions where the ‘data cloud’has large variability. This is done in a way that introducesorder: the first of the new axes is aligned in the directionof maximal variation in the data, the second new axis in thedirection of (orthogonal, and so uncorrelated) next greatestvariation, and so on. The singular values indicate the amountof variation captured in each of these new directions. Eachrow of U corresponds to the coordinates of each posting inthis new space.

A singular value decomposition can be truncated after any kdimensions and the resulting approximation is the most faithfulwith that dimensionality. Such an approximation is naturallyregarded as an embedding of the data in a way that capturesits structure to the greatest possible extent.

A truncated SVD has two properties that are especiallyuseful for understanding data. First, posts that resemble manyother posts (in the sense of having similar word usage patterns)are positioned as close as possible to these other posts in theembedding – in the fixed point that the SVD represents thismeans that they tend to be placed close to the origin. Posts thathave unusual word usage patterns occupy positions that fewother posts do, and these dimensions tend to be projected away.As a result, such unusual posts also tend to be placed close tothe origin in the truncated representation. The posts that areplaced far from the origin, therefore, are those without eitherof these properties – they have modestly overlapping wordusage patterns with other posts. This property is a reasonablesurrogate for being the most interesting posts. Using outlierdetection would tend to focus attention on posts that aresimply unusual or one-of-a-kind and these are not usually themost profitable targets for investigation. For example, we haveobserved that forums often contain posts that are simply listsof URLs or news stories copied from multiple sources. Suchposts have highly variable word usage patterns, but this makesthem less interesting, rather than more interesting.

We will choose k = 2 or 3 so that we can plot the truncatedrepresentation. In such plots, distance from the origin is asurrogate for amounts of variability, and directions correspondto different kinds of word usage.

The singular value decomposition is symmetric with respectto the role of posts and words, since if A = USV ′ thenA′ = V SU ′. Therefore, plots can be created in which thepoints represent words, with the same properties – wordswhose usage across posts is interesting will tend to be placed

further from the origin.The frequency of a word in a document almost always

depends, to some extent, on the length of the document. Tocompensate for this, we divide each entry in a row by thetotal count for that row, converting frequencies from countsto the fraction of each document they represent. For modelsusing a specific set of words, this normalization uses only thefrequencies of the selected words. Unfortunately, normalizingfor document length also means that very short documents tendto appear more significant than they are because the fractionof each document that a single word occurrence representsbecomes large. There is no generic solution to this problem,but post length is easy to check and could be used as asecondary criterion for ranking.

SVD requires the ‘data cloud’ to be centered around theorigin before the decomposition is computed – for otherwisethe direction of greatest variation will be from the originto the center of the cloud, and subsequent axes will bedistorted by the requirement that they be orthogonal to the first.This centering is conventionally achieved by converting eachcolumn’s entries to z-scores, that is subtracting the columnmean from each entry and dividing by the column standarddeviation.

However, document-word matrices are often sparse so ap-plying this standard technique has several unfortunate effects.The denominators depend on all of the column entries, notjust the non-zero ones, so the available information is blurred;the (large number of) zero values are transformed to smallnegative values, which does not reflect the semantics of theabsence of a word in a document; and the matrix is no longersparse which introduces computational penalties.

Instead, we transform each column by computing non-zeroz-scores, that is subtracting the column mean of the non-zeroentries from the non-zero entries, and dividing them by thenon-zero entries’ standard deviation. Zero entries remain aszero entries. This introduces an artifact: it conflates absenceof a word in a document from median frequency of a wordin a document. This is unfortunate in some ways, but it canbe justified to a certain extent. In any case, there is littlealternative, and it behaves much better than conventional z-scoring.

IV. RESULTS

A. Basic content – the most common words

The top 1000 most-frequent words were retained and theSVD of the resulting posts-words matrix computed. Figure 1shows the resulting structure. It is shaped roughly like a caltrop(i.e. like the skeleton of a tetrahedron). Figure 2 shows themost important structure of the words, explaining three of thearms of the structure of the posts. The most important variationis shown vertically (for legibility), with words labelled onlywhen they are more than twice the median distance from theorigin.

The lower part of the figure shows the most significantwords associated with the strongest variational structure. Theseare words such as “by”, “is” and “to”, rather than content-filled

907875

Page 3: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Fig. 1. Distribution of posts by content. The axis U1 is the directionof maximal variation across the corpus, U2 the next-greatest uncorrelatedvariation, and so on.

Fig. 2. Distribution of content words

nouns. This indicates that the variation in content among thepostings is not primarily what the documents are talking about,but differences in style, revealed by variations in the usage ofthese ‘little’ words.

The right-hand side of the figure shows that extremelysimple words such as “the”, “of”, and “and” account forlarge amounts of the variability among the posts (shownin another view by the peak in Figure 3). This is not aneffect of the natural high frequency of these words, whichhas been removed by normalization. For example, even afternormalization, “the” remains the most variable word in theentire corpus. The differential role of these words is probablya second-language effect.

Figure 4 shows the remaining structure hidden in the thirddimension. This peak is almost entirely associated with thewords “quote”, “originally”, “posted”, and “by”. In otherwords, these posts are those that repost content from other

Fig. 3. Distribution of posts and words overlaid. Because of symmetry,variation in both posts and words is necessarily aligned.

Fig. 4. Distribution of posts and words showing third dimension

places. Thus such posts have been neatly separated fromoriginal content.

Given that the style of all of the posts in the forum ishomogeneous, in the sense that they tend to form a single-factor structure, we now turn to extracting the top 1000 most-frequent nouns to provide a richer picture of what is beingdiscussed.

The overall structure of the posts based on nouns is notdissimilar to that based on the most frequent words. Again,there is a strong branch associated with the word “quote”;this branch is also associated more weakly with the words“abu”, “ibn” and “brother” which is not unexpected. Thestructure of the words is shown in Figure 6. The words whoseuse varies strongly among the postings are, not surprisingly,“Allah” and “people”. The other discriminating words are“time”, “Muslims”, “Islam”, “may”, “way” and “prophet”.

B. Jihadi language

Figure 7 shows the distribution of posts according to theSalafist-Jihadi language model. The structure is almost single-factored (that is, it looks like a cigar that is much longerthan its extent in either of the next two dimensions). Thereis some branching at the left-hand end. Figure 8 shows thepositions of the associated words in the model, and shows

908876

Page 4: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Fig. 5. Distribution of posts based on most-frequent nouns

Fig. 6. Distribution of nouns (“quote” is well outside the displayed area, atthe upper left)

Fig. 7. Distribution of posts by Jihadi language–squares indicate those withhigh scores

Fig. 8. Distribution of Jihadi words

that the branching is the result of differential use of three setsof words: “you”, “they”, and “who” and “was” (which areas differentiated as the others but in the dimension that goesinto the page). The first two represent an obvious differencebased on the objects that are being discussed: in one case, thereaders, and the other outsiders. In Figure 7, the position of theposts can be considered as a ranking from most-jihadist (at theleft) to least-jihadist (at the right). Points that correspond tohighly Jihadist documents and whose distance from the originis large enough are labelled with black squares. We will returnto these documents in the next subsection.

The striking aspect of Figure 8 is that, although the wordlist itself contains many meaningful content words (Jihad, Plat-form, Monotheism, Mujahideen, Way, Unbelievers, Infidelity,Faithful, Tyrants, They, Fighting, God, Themselves, Faith arethe highest on the list), the words associated with the mostvariability are actually common words. This is consistent withKoppel and Akiva’s results [4] where, using only functionwords, they were able to classify documents as Salafist-Jihadiwith an accuracy of 75%. In other words, what distinguishesSalafist-Jihadist content is not what, as humans, we mightconsider content: particular ‘dangerous’ words that might belooked for on a watch-list, but rather unusual habits of usingwords that, in themselves, are commonplace.

C. Deception

Figure 9 shows the structure of posts based on the ways inwhich they use the words of the deception model. The mostdeceptive documents are to the left of the figure, and the leastdeceptive to the right. The documents that were ranked highlyon Jihadi word use are labelled in this figure with blue squares.It is clear that highly Jihadi posts tend also to be sincere (i.e.low in deception) posts.

Figure 10 shows the variation in word use across all of theposts. The figure shows that the ‘tail’ structure on the right ofFigure 9 is primarily the result of variation between ways ofusing “I” and ways of using the two most significant exclusivewords, “but” and “or”.

909877

Page 5: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Fig. 9. Distribution of posts by deceptiveness–posts with high Jihadi languagescores are labelled with squares

Fig. 10. Distribution of deception-model words

V. DISCUSSION AND CONCLUSIONS

The analysis of content shows that the greatest variationcomes not from different topics, but from different patternsof usage of ordinary words. To some extent this may be theresult of a single, large topic on which this forum is focused.Previous analysis of the al Ansar forum [6] showed a patternthat separated religious content from news stories, but thewords associated with the religious dimension agree with thosefound to be significant here.

The Jihadi language model is further supported by theanalysis here since it produces a single-factor structure with,at one extreme, posts with high rates of Jihadi language and, atthe other, posts with low rates; and hardly any other variability.Again, the words that have the most interesting variabilitybetween the high and low end of this spectrum are relativelysimple and common words. The al Ansar forum analysis usingthis same language model produced almost exactly the same

list of most significant words.Both of these results suggest that attempts to look for

Jihadist communication using some kind of watch list mecha-nism scanning for ‘suspicious’ words are unlikely to performwell – there appear to be little differences between Jihadist andordinary religious discussions with respect to such content-filled words. Rather, the differentiator is the rate of use ofmuch more innocuous words.

The deception results further support the Pennebaker modelas a detector of deception since it also produces a single-factor structure with little variability in any other dimensions.Interestingly, posts that rank high for the Jihadist model ranklow for the deception model. This suggests that attempts tofake Jihadist content, for example on honeytrap websites isultimately doomed to fail, since the deception involved isprobably detectable by software analysis.

This paper has laid out a template by which an analyst mightapproach a large corpus with the goal of reading only thosedocuments most likely to be interesting. Looking at contentprovides a way to discover how many topics a corpus is about(in this case, apparently only one), and so potentially eliminatedocuments about irrelevant topics. The Koppel model thenprovides a way to rank documents by their Jihadist intensity,and so to select whatever prefix of this ranking there are theresources to analyze. The deception model here plays the roleof detecting sincerity. Using the two models together, that islooking for documents with high Salafist-Jihadist intensity andlow deceptiveness, has the potential to do better than eithermodel by itself.

In open-source intelligence analysis it is common to en-counter sets of documents so large that it is practicallyimpossible to examine each individually. The work presentedhere provides a way to first rank, and then select a prefix ofthe ranking, with some confidence that the documents selectedwill be the most relevant with respect to the model used forthe ranking.

REFERENCES

[1] P.K. Chandrasekaran and D.B. Skillicorn. Finding interesting documentsin a corpus. In Text Mining Workshop, SIAM International Conferenceon Data Mining, 2012.

[2] J.L. Creasor and D.B. Skillicorn. QTagger: Extracting word usage fromlarge corpora. Technical Report 2012-587, Queen’s University, School ofComputing, 2012.

[3] G.H. Golub and C.F. van Loan. Matrix Computations. Johns HopkinsUniversity Press, 3rd edition, 1996.

[4] M. Koppel, N. Akiva, E. Alshech, and K. Bar. Automatically classifyingdocuments by ideological and organizational affiliation. In Proceedingsof the IEEE International Conference on Intelligence and Security Infor-matics (ISI 2009), pages 176–178, 2009.

[5] M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Richards. Lyingwords: Predicting deception from linguistic style. Personality and SocialPsychology Bulletin, 29:665–675, 2003.

[6] D.B. Skillicorn. Applying interestingness measures to Ansar forum texts.In Proceedings of KDD 2010, Workshop on Intelligence and SecurityInformatics, pages 1–9, 2010.

910878