an alternative method for understanding user-chosen passwords

13
Research Article An Alternative Method for Understanding User-Chosen Passwords Zhixiong Zheng , 1 Haibo Cheng , 2 Zijian Zhang , 1 Yiming Zhao , 3 and Ping Wang 3,4,5 1 School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen 518055, China 2 School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China 3 School of Soſtware and Microelectronics, Peking University, Beijing 102600, China 4 National Engineering Research Center for Soſtware Engineering, Peking University, Beijing 100871, China 5 Key Laboratory of High Confidence Soſtware Technologies (PKU), Ministry of Education, Beijing 100871, China Correspondence should be addressed to Ping Wang; [email protected] Received 30 August 2017; Revised 2 December 2017; Accepted 27 December 2017; Published 28 January 2018 Academic Editor: Qi Jiang Copyright © 2018 Zhixiong Zheng et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We present in this paper an alternative method for understanding user-chosen passwords. In password research, much attention has been given to increasing the security and usability of individual passwords for common users. Few of them focus on the relationships between passwords; therefore we explore the relationships between passwords: modification-based, similarity-based, and probability-based. By regarding passwords as vertices, we shed light on how to transform a dataset of passwords into a password graph. Subsequently, we introduce some novel notions from graph theory and report on a number of inner properties of passwords from the perspective of graph. With the assistance of Python Graph-tool, we are able to visualize our password graph to deliver an intuitive grasp of user-chosen passwords. Five real-world password datasets are used in our experiments to fulfill our thorough experiments. We discover that (1) some passwords in a dataset are tightly connected with each other; (2) they have the tendency to gather together as a cluster like they are in a social network; (3) password graph has logarithmic distribution for its degrees. Top clusters in password graph could be exploited to obtain the effective mangling rules for cracking passwords. Also, password graph can be utilized for a new kind of password strength meter. 1. Introduction e invention of computers ushered in a new era of digi- tal lives, and text-based passwords have almost dominated human-computer authentication since then. Passwords con- tinue to prevail on the web as the primary method for user authentication despite consensus among researchers that people deserve something more secure and user-friendly. Many researches have focused more on the specific point of the problem of user authentication involving no user inter- actions, which is literally the weakest point of the password authentication. ere is all the time alternative authentica- tion mechanisms aiming to outright replace password-based authentication method proposed by a bunch of researchers: graphical passwords [1], authentication based on biometric [2], authentication based on user behaviors [3], single sign-on system, and so forth. However, the growing password-based authentication adoptions [4] irrespective of their well-known security and usability drawbacks [5–7] reveal the fact that each of the alternatives to password-based authentication has its own shortcomings as compared to passwords [8]. e inertia of user habits, uncertain transition costs, and constant improvements in passwords lead to the result that incumbent passwords will continue as a stubborn signal for identity authentication in the foreseeable future, where the goal is not impregnable defense but the balance between usability and security [8]. Surveys are conducted by many researches [9–11] to reveal user real-world behaviors in managing passwords. e conflict between users’ limited memory and growing number of passwords they need to organize is the main challenge users are confronted with presently. Users reuse passwords Hindawi Security and Communication Networks Volume 2018, Article ID 6160125, 12 pages https://doi.org/10.1155/2018/6160125

Upload: others

Post on 26-Mar-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Alternative Method for Understanding User-Chosen Passwords

Research ArticleAn Alternative Method for UnderstandingUser-Chosen Passwords

Zhixiong Zheng 1 Haibo Cheng 2 Zijian Zhang 1

Yiming Zhao 3 and Ping Wang 345

1School of Electronic and Computer Engineering Peking University Shenzhen Graduate School Shenzhen 518055 China2School of Electronics Engineering and Computer Science Peking University Beijing 100871 China3School of Software and Microelectronics Peking University Beijing 102600 China4National Engineering Research Center for Software Engineering Peking University Beijing 100871 China5Key Laboratory of High Confidence Software Technologies (PKU) Ministry of Education Beijing 100871 China

Correspondence should be addressed to Ping Wang pwangpkueducn

Received 30 August 2017 Revised 2 December 2017 Accepted 27 December 2017 Published 28 January 2018

Academic Editor Qi Jiang

Copyright copy 2018 Zhixiong Zheng et alThis is an open access article distributed under the Creative CommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We present in this paper an alternative method for understanding user-chosen passwords In password research much attentionhas been given to increasing the security and usability of individual passwords for common users Few of them focus on therelationships between passwords therefore we explore the relationships between passwords modification-based similarity-basedand probability-based By regarding passwords as vertices we shed light on how to transform a dataset of passwords into a passwordgraph Subsequently we introduce some novel notions from graph theory and report on a number of inner properties of passwordsfrom the perspective of graph With the assistance of Python Graph-tool we are able to visualize our password graph to deliveran intuitive grasp of user-chosen passwords Five real-world password datasets are used in our experiments to fulfill our thoroughexperiments We discover that (1) some passwords in a dataset are tightly connected with each other (2) they have the tendencyto gather together as a cluster like they are in a social network (3) password graph has logarithmic distribution for its degrees Topclusters in password graph could be exploited to obtain the effective mangling rules for cracking passwords Also password graphcan be utilized for a new kind of password strength meter

1 Introduction

The invention of computers ushered in a new era of digi-tal lives and text-based passwords have almost dominatedhuman-computer authentication since then Passwords con-tinue to prevail on the web as the primary method foruser authentication despite consensus among researchers thatpeople deserve something more secure and user-friendlyMany researches have focused more on the specific point ofthe problem of user authentication involving no user inter-actions which is literally the weakest point of the passwordauthentication There is all the time alternative authentica-tion mechanisms aiming to outright replace password-basedauthentication method proposed by a bunch of researchersgraphical passwords [1] authentication based on biometric[2] authentication based on user behaviors [3] single sign-on

system and so forth However the growing password-basedauthentication adoptions [4] irrespective of their well-knownsecurity and usability drawbacks [5ndash7] reveal the fact thateach of the alternatives to password-based authentication hasits own shortcomings as compared to passwords [8] Theinertia of user habits uncertain transition costs and constantimprovements in passwords lead to the result that incumbentpasswords will continue as a stubborn signal for identityauthentication in the foreseeable future where the goal is notimpregnable defense but the balance between usability andsecurity [8]

Surveys are conducted by many researches [9ndash11] toreveal user real-world behaviors in managing passwordsTheconflict between usersrsquo limitedmemory and growing numberof passwords they need to organize is the main challengeusers are confronted with presently Users reuse passwords

HindawiSecurity and Communication NetworksVolume 2018 Article ID 6160125 12 pageshttpsdoiorg10115520186160125

2 Security and Communication Networks

or embed mnemonic information into password [12] toalleviate their burden of memorizing passwordsThis impliesthe unpleasant fact that users select usability over securityunconsciously Security experts have suggested the use ofpassword vaults (managerswallets) to assist password-basedauthentication with which users need merely to rememberone master password to encrypt all their online accountpasswords in the vault [13 14]

Before 2009 password researchers proposed someheuris-tic methods to study the password-based authenticationwhosemain purpose is to reveal the weaknesses of passwordsand outright replace passwords in the end [5 15] After2009 large-scale password datasets have been breachedwidely from hackersrsquo attacking or intrudersrsquo intrusion whichare afterwards publicly available for example the leak of32M passwords from the gaming website RockYou in 2009is currently the biggest corpus available to the public Inthe literature of password breaching of large-scale datasetsof passwords have guided the studies of passwords to amore scientific and rigorous method Password corpora havetypically been used to analyze the distribution of passwords[16] names used in passwords and character distribution [9]or the priority among priorities used as learning set to trainprobabilisticmodels such as PCFG-based [17]Markov-based[18] and NLP-based (Natural Language Processing) [19]password cracking algorithms in order to simulate adversarialpassword cracking process leading to sophisticated passworddataset strength evaluation methods [20] The probabilisticpassword cracking models are afterwards modified elabo-rately by researchers to evaluate single password strength

11 Motivations The analysis of the password security candate back at least to Morris and Thompsons 1979 seminalanalysis of 3000 passwords [21] The analyzing methodsthey employed can be classified into two categories pass-word cracking and semantic evaluation However these twomethodologies focus solely on the individual passwordsand neglect implicitly the relationships between passwordsThe neglected relationships on the contrary is remarkablyimportant properties of password dataset that could facilitatethe password cracking and semantic evaluation Due to theintrinsically incomplete evaluation of traditional semanticand cracking methodologies we advocate a new alternativemethod understanding user-chosen passwords

In mathematics graph is mathematical structures usedto model pairwise relations between objects A graph in thiscontext is made up of vertices nodes or points which areconnected by edges that can either be directed or undirectedGraph-theoreticmethods in various forms have beenprovenparticularly useful in many fields such as linguistics chem-istry physics and sociologyWhile there is still no applicationof graph theory in the literature of passwords researchtherefore we are going to apply graph theory to passworddataset to dig deeper into the inner properties of passwordsto assess and compare the inherent security behaviors ofusers We call this analysis method relationship evaluation asan extension to semantic evaluation Our paper provides analternative view of password relationships through password

graph we also provide a visualizingmethod for the generatedpassword to observe and study it intuitively

12 Contributions In this work we make the following keycontributions

(i) An alternative view of password relationships weexplore the relationships between passwords modifi-cation-based similarity-based and probability-based The modification-based relationship buildson the observation that a user usually modifies anexisting password to retrieve a newone Passwords arebasically strings thus we borrow the idea from stringsimilarity to develop the password similarity-basedrelationshipThe probability-based relationship is theidea derived from password distribution where eachpassword has probability associated with it

(ii) Visualizing of password graph by regarding passwordsas vertices and leveraging one of the relationshipswe explored we are able to transform a passworddataset to a graph and we call it password graph Withthe assistance of Python Graph-tool we visualize ourgenerated password graph our visualization methodcan intuitively convey deeper characteristics of pass-word dataset which lay otherwise under the hood andremain undiscovered

(iii) Some insights from graph theory the resulting pass-word graph provides us a fresh new perspective ofpassword dataset We will revisit some key termi-nology in graph theory to find out what our newpassword graph bring about

13 Organizations In Section 2 we review prior researchworks on password cracking and semantic evaluations Sec-tion 3 provides some preliminaries Section 4 details ourexploration of password relationships in dataset of passwordSection 5 elaborates on our construction of password graphSection 6 provides some key insights from graph theory toour password graph Finally Section 7 concludes our paper

2 Related Work

In this section we briefly review prior pivotal research onpassword cracking and semantic evaluations to assist ourfollow-up discussions and explorations

In 1979 Morris and Thompson analyzed a database of3000 passwords and reported some basic statistics 7112passwords of their sample of passwords were 6 characters orfewer and 86 fell into one of the dictionaries name listsand the like In 1990 Klein [22] collectedetcpasswd filesin which passwords were in hash format from his friendsand acquaintances in United States and Great Britain 21passwords were cracked in the first week total approximately25 of the passwords had been guessed and 5170 ofthe cracked passwords are not longer than 6 charactersDedicated cracking software tools like John the Ripper[23] and hashcat [24] have appeared since and are armed

Security and Communication Networks 3

with numerous cracking modes (eg brute-force attack anddictionary attack)

Mangling rules in dictionary attack mode continue toevolve beyond heuristic rules Weir et al [17] built a machinelearning technique based on context-free grammar to auto-matically derive mangling rules from a large training set ofcleartext passwords Houshmand and Aggrawal [18] derivedMarkov-based password cracking algorithm from Markov-Chain that representatively originated PageRank algorithmOriginally Markov-based algorithm is not a probabilisticmodel Ma et al [25] investigated password characteristicsabout length and the structure of 6 datasets 3 of which werefromChinesewebsites and improved it by using different nor-malization and smoothing methods They found that whendone correctly Markov-based cracking model performedbetter than PCFG-based password cracking model In 2012Veras et al [26] had done the work quite similar to us theyexamined 32M RockYou dataset by employing visualizationtechniques They observed that 1526 of passwords con-tained sequences of 5ndash8 consecutive digits 38 of whichcould be further classified as dates but their research mainlyfocused on password patterns which are different from ours

The guessing resistance of a single user-chosen passwordwas previously estimated by entropy with reference to ClaudeShannonrsquos famous measure 1198671 Borrowing the idea of Shan-non entropy a variation of Shannon entropy was proposed inNIST Electronic Authentication Guideline It calculated thepassword entropy mainly based on the length of passwordsand added partial points if some special heuristic checkswere passed to make it more secure Unfortunately Shannonentropy and its variations characterize the strength of adistribution for an attacker who wants just to crack acertain proportion of all passwords Shannon entropy has nodirect correlation to the guessing difficulty Ad hoc metrics(password strength meter) had already been demonstratedfar from accurate by Weir et al [27] they advocated thatthe cracking-based password strength meter was more com-pelling Markov-based PSM [28] and PCFG-based PSM [18]were proposed subsequently based on Markov model andPCFG model Wang et al [12] created a novel PSM on asolid foundation where user usually reuses one of hisherpasswords rather than creating a new one By using twotraining sets (one as base dictionary and the other as rulelearning dictionary) their PSMwas able to derive empiricallyusersrsquo mangling rules on passwords and thus more accurate

In 2012 Bonneau conducted a large-scale analysis of 70million Yahoo private passwords and proposed that a moredirect password strength metric was guesswork (G) [29] Yetin 2014 Li et al [30] came to a conclusion that Bonneaursquos70M passwords were not representative enough for all usersespecially Chinese users who are not familiar with EnglishChinese users prefer digits than lettersThey also showed thatChinese users inclined to insert Pinyins and dates into theirpasswords

Semantic patterns including personal information (egbirth dates personal names and nicknames) are prevalentlyembedded in user-chosen passwords [31] Additionally a fewbasic data characteristics like average length length distri-bution and types of characters used were typically reported

Further the structure patterns were also studied by someresearchers [9] passwords containing digits constitutedmorethan 50 Chinese web passwords while this value of Englishcounterpart was only 1130 reinforcing the hypothesis thatuser-generated passwords were greatly influenced by theirnative languages More systematic methodologies had beenproposed by Shay et al [32] for creating a new passwordpolicy One inconspicuous thing that semantic evaluationsfail to attain is the relationships between passwords inthe same dataset Therefore we for the first time augmentthis kind of evaluation method by introducing passwordrelationship and graph theory into password evaluations Wehope this evaluation method helps to study the passwordevaluations thoroughly

In 2010 Zhang et al [33] found that modifications toonersquos old passwords tend to be predictable and they utilizedthis observation to facilitate password cracking Their workis actually password reuse by a certain user it is independentfrom other users Our work focuses on password relation-ships between different users and classifies them into threeclassifications

As far as we know the work by Guo et al [34] may bethe closest one to what we have done in this workThey visu-alized several password datasets including Yahoo PhpBBMySpace Honeynet hotmail and 12306 Their discoveryprovides an explanation of the attacking curve that has longbeen observed in decades We find one of their conclusionsis wrong degrees of passwords follow logarithmic law notpower law as they claimed we will detail this in Section 6Overall though our works are a bit similar there are still verycritical differences between our work and their work (a) weexplore the relationships between passwords while they donot the relationships between passwords (b) we explore theeffects of different thresholds 119905 while they only experimentedwith threshold 3 (c) our insights obtained from visualizedgraph is essentially distinct from their work

3 Preliminaries

In this section we explicate formal definitions of graph andlinear regression and then introduce themetric for evaluatinghow well the regression line approximates the real datapoints Finally we provide some basic information on ourexperimental datasets

Mathematical Notation We denote a password dataset witha calligraphic letter S a password after duplicates removingwith another calligraphic letter P Let 119873 denote the totalnumber of passwords inP

31 Graph

Formal Definition of Graph A graph 119866 can be defined as apair 119881 119864 where 119881 is a set of vertices and 119864 is a set of edgesbetween the vertices 119864 sube (119906 V) | 119906 V isin 119881

119866 = 119881 119864 (1)Generally graphs can be classified into two types (a)

undirected graph the adjacency relation defined by the edges

4 Security and Communication Networks

is symmetric (b) Directed graph is a graph in which edgeshave orientations A simple graph is an undirected graphin which both parallel edges and loops are disallowed whilemultigraph otherwise allows them

32 Linear Regression In statistics linear regression is anapproach for modeling the correlation between two variables(one as a scalar dependent variable denoted by 119910 and theother as explanatory variable denoted by 119909) by fitting alinear equation to the experimental data The most commonmethod for linear regression is least-squares Usually inlinear regression given the value of explanatory variable 119909the value of dependent variable 119910 is an affine function of 119909119910 = 119896 sdot 119909 + 119887 the slope of the line is 119896 and 119887 is the intercept

In linear regression the statistical measure of how wellthe regression line approximates the real experimental datapoints is often calculated and compared through the coeffi-cient of determination People usually denote the coefficientof determination by 1198772 which has the range from 0 to 1 thecloser to 1 the better Therefore a 1198772 value of 1 indicates thatall experimental data points are perfectly positioned on theregression line 1198772 of 0 indicates the contrary result

33 Spearmanrsquos Coefficient In statistic Spearmanrsquos coeffi-cient is a nonparametric measure of rank correlation Itassesses how well the relationship between two variables canbe described using a monotonic function The Spearmancoefficient 120588 is defined as the Pearson correlation coefficientbetween the two ranked vectors119883 119884

120588 = cov (119903119892119883 119903119892119884)120588119903119892119883120588119903119892119884

(2)

where cov(119903119892119883 119903119892119884) is the covariance of the rank variables120588119903119892119883 and 120588119903119892119884 are the standard deviations of the rank variablesBy definition 120588 isin [minus1 1] where 0 indicates independencebetween the vectors A perfect Spearman correlation of 1 orminus1 happens when the agreement between the vectors is amonotone function

34 Password Guess Number The password guess num-ber characterizes the time complexity required for a pass-word cracking algorithm (PCFG-based or Markov-based) torecover a password This is generally achieved by measuringthe guess number required to crack the password DellrsquoAmicoand Filippone [20] detail a Monte Carlo sampling methodthat converts a password pw probability as computed byPCFG or Markov model into an estimate of crackerrsquos guessnumber

119866 (pw) = 1119896119896

sum119894=1

120575 (119901119894)Pr (119901119894)

+ 1

120575 (119901119894) = 1 if Pr (119901119894) gt Pr (pw) else 0(3)

where 119896 is the sample size and 119901119894 is the password drawrandomly from the corresponding distribution

Table 1 Basic information about our 5 password datasets

Dataset Web service When leaked Total PWsMySpace Social networking Oct 2006 49623PhpBB Programmer forum Jan 2009 255373Rootkit Hacker forum Feb 2011 69324Yahoo Web portal July 2012 44283412306 Train ticketing Dec 2014 129303

35 Datasets For completeness we give a brief descriptionof the five datasets used in our experiments (see Table 1)Theywere breached either by hackers or by intruders and laterdisclosed publicly on the Internet some of them have alreadybeen used in password cracking models [17 18]

MySpace was originally published in October 2006which was obtained by a phishing attack and thus mightcontain weak as well as fake passwords Our collection ofMySpace list has total 49623 passwords but a recent reportsaid that there were actually over 360 million accountsinvolved Each record of MySpace list contained an emailaddress a password and in some cases a second passwordSome accounts had multiple passwords there were in factover 427 million total passwords available for sale [35]Rootkitrsquos entire MySQL database backup was released byanonymous group using HBGaryrsquos CEO Twitter accountit initially contained 71228 passwords cryptographicallyscrambled using the MD5 hashing algorithm we managedeventually to recover 9733 of them through trawlingattacks In 2012 nearly 443K email addresses and passwordsfor a Yahoo site were exposed after a server breach hackersgotten into Yahoorsquos Contributor Network database by using arudimentary attack called SQL injection The PhpBB datasetincludes about 250K passwords leaked from Phpbbcom inJanuary 2009 the hacker even described the whole attack indetail on his blog At the end of year 2014 a password datasetfrom a Chinese train ticketing website 12306 was leaked tothe public by anonymous attackers it includes nearly 130Kpasswords User information registered on the ticket bookingwebsite included real names ID card numbers and phoneand email contacts and may also include information aboutfamily members and friends posing a serious threat to theirinformation privacy

4 Password Relationships

We divide our findings of password relationships intothree categories modification-based similarity-based andprobability-based Elaborations on these three categories aredetailed in the following three subsections

41 Modification-Based Naturally the relationship betweenpasswords is actually in some extent tied to usersrsquo modifi-cations to passwords So it could be well imitated by editdistance featuring a set of operations on passwords Editdistance is a way of quantifying howdissimilar two passwordsare to one another by counting the minimum number ofoperations required to transform one password into the

Security and Communication Networks 5

other However there are various definitions of edit distancefeaturing different sets of editing operations Levenshtein dis-tance [36] (hereafter LD) allows three operations removalinsertion and substitution of a character in the passwordlongest common subsequence distancersquos permitted operationsis a subset of LDs insertion and deletion hamming distancetakes effect only on passwords of same length that ishamming distance does not allow insertion or removalAdditional primitive operations like transposition are usedby Jaro-Winkler distance based on the observation that acommon mistake when people typing is the transposition oftwo adjacent characters in a string

The LD between two passwords is formally defined as theminimum number of single-character edits one must per-form to change one password into the other Mathematicallythe LD between two passwords 119886 and 119887 can be given bylev119886119887(|119886| |119887|)

lev119886119887 (119894 119895) = min

lev119886119887 (119894 minus 1 119895) + 1lev119886119887 (119894 119895 minus 1) + 1lev119886119887 (119894 minus 1 119895 minus 1) + 119891 (119886119894 119887119895)

(4)

where |119886| and |119887| denote the length of passwords 119886 and 119887respectively 119891(119886119894 119887119895) is an indicator function that is equal to0 when 119886119894 = 119887119895 and equal to 1 otherwise The calculation ofLD is actually a dynamic optimization the time complexityof this algorithm is 119874(|119886| sdot |119887|)

42 Similarity-Based Passwords are essentially strings sothe measures of depiction of string similarity provide ussubstantial methods to evaluate the similarity (relationship)between passwords Dicersquos coefficient is a statistic used forcomparing the similarity between two sets and can be usedto evaluate the similarity between two passwords assume 119883and 119884 are the set (set element is unique) of characters thatpasswords 119886 and 119887 include respectively Dicersquos coefficient of119886 119887 is defined as follows

QS = 2 |119883 cap 119884||119883| + |119884| (5)

where |119883| and |119884| are the numbers of elements of the two setsQS is the quotient of similarity and ranges between 0 and 1

Jaccard indexmeasures similarity between two finite setsit is defined as the size of the intersection divided by the sizeof the union of the sets 119883 and 119884 (119883 and 119884 have the samemeaning as described above)

119869 (119883 119884) = |119883 cap 119884||119883 cup 119884| =|119883 cap 119884|

|119883| + |119884| minus |119883 cap 119884| (6)

If 119883 and 119884 are both empty define 119869(119883 119884) = 1 so 119869(119883 119884) isalso the quotient of similarity and ranges between 0 and 1

43 Probability-Based From PCFG-based or Markov-basedpassword cracking model we can assign probability to everypassword in a dataset In principle two passwords are similarin a sense (same probability that will be picked by a human

user) if their corresponding probabilities are close enoughthe closer the more similar One of the workable measures ofsimilarity between passwords 119886 and 119887may be

119878 (119886 119887) = (Pr (119886) minus Pr (119887))2

(Pr (119886) + Pr (119887))2 (7)

when 119886 and 119887 have the same probability their value of 119878(119886 119887)equals 0 an extreme situation happens when there is only onepassword in the distribution the probability of that passwordis 1 all other passwordsrsquo probability is 0 The value of 119878(119886 119887)(where Pr(119886) = 1Pr(119887) = 0) equals 1

In summary the relationships between passwords wedescribe above are only a proportion of the countless rela-tionships of objective existence For example the edit distancebetween passwords is not a measure of similarity but thenumber of editings One can of course divide it by the longerlength of two passwords whichmakes LD remarkably relativeLD

levrel119886119887(|119886||119887|) =lev119886119887 (|119886| |119887|)max |119886| |119887| (8)

Ultimately the value of relative LD for two chosen passwordsfalls into interval [0 1] in which left boundary 0 meansexactly the same and right boundary 1 means altogetherdifferent

44 Rationale The practical and accurate measure of depic-tion of password relationships depends on the particularapplication scenario There is no the best depiction methodfor generating password graph but the most appropriateaccording to the concrete scenario For example if we wantto study the modification habits of users we may choosemodification-based classification but if we only want to learnthe password cracking property we may choose probability-based classification We attempt to expand our work byutilizing modification-based relationship between passwords(other relationships we have explored above are still worthdeeper research andwe leave it for futurework) Prior surveyshave already provided us with plenty of information on howusers modify their passwords adding digits or symbols at thebeginningend of password capitalizing a letter Leet trans-formation and so forth [9 12 32 37] Taking the permittedoperations for each type of edit distance into considerationsynthetically we could infer that LD is currently the mostappropriate measure of password relationship among alltypes of edit distance which actuallymodels real-world usersrsquomangling behaviors (insertion deletion and substitution arepopular while transposition is not) In Section 5 we are goingto utilize LD to formally define the edges in a graph

5 Construction of Password Graph

Generally speaking any entity that can be described byobjects with relationships between them can be regardedas a graph Since our principal purpose is to characterizethe relationship between passwords We shed light on thisnotion by (1) regarding each unique password in dataset as

6 Security and Communication Networks

passwordpassword1234561234561234567password1

Phase1password1234561234567password1

Phase2

G

Figure 1 Construction phases

a vertex 119881 = 119901 | 119901 isin P and (2) editing distancebetween passwords not more than a threshold 119905 forms anedge 119864 = (119906 V) | lev119906V(|119906| |V|) le 119905 and 119906 V isin 119881We call the resulting graph transformed from the passworddataset password graph 119866 This is a very intuitive method toturn a password dataset into graph though there might existsome other approaches to fulfill the same goal (eg one canconsider unique printable ASCII characters as vertices andcharacters forming a password are connected together whichseems reasonable but the practicability and efficiency of itremain unknown)

51 Construction Procedures We describe our constructionprocedures of password graph with more details in this sub-section The construction procedures are organized orderlyinto two main phases and are illustrated in Figure 1

Phase1 Starting with raw password dataset we measure thefrequencies of every unique password in the password datasetS and associate the frequency values with correspondingpasswords subsequentlywe remove the duplicated passwordsto obtain the password dataset P so that two passwordspicked randomly from P are invariably different Everypassword inP stands for a vertex in graph119866 in other wordsthe vertex set 119881 for 119866 is generated finally after this phase

Phase2 The LD distance for every two passwords in thedataset P is calculated An edge is formed if the LD valueof its end vertices is not more than threshold 119905 We provide3 thresholds experiment results in this paper (the selectionof threshold 119905 is an interestingly open question that needsfurther research)The edge set 119864 for our graph119866 is generatedafter this phase It is worth noting that themaximum possiblenumber of edges is119873(119873minus1)divide2where119873 is the size of setP itmeans that the resulting password graph is a complete graphAnother extreme case happens if there are no edges at all in119866which means the resulting password graph is an empty graph

In order to observe intuitively the password graph ofour construction results we have visualized the passwordgraph with PythonGraph-tool [38] Graph-tool is an efficientPython module for manipulation and statistical analysisof graphs (aka networks) Exporting the resulting graphto Graph-tool supported file format (GraphML file formatwithout isolated vertices) Visualizing it with some deliberatetuning of the parameters makes intricate image graph result

Figure 2 Visualization result of 12306

The visualization result of 12306 is shown in Figure 2 andthat of MySpace in Figure 3 the visualization results ofremaining datasets are collected together in Figure 7 Theinternal structure and characteristic of password dataset Pis represented modestly by its visualization image graph tosome extent some qualitative properties can be gotten fromthese images a vertex in the graph 119866 may have intensiveconnection with other vertices password vertices evidentlyhave the tendency to gather together as a cluster 12306 hastwo apparently big clusters while MySpace has incalculablesmall clusters vertices spreading all over the entire canvasspace reflect somewhat that passwords are spreading all overthe possible password space A more in-depth quantitativeanalysis is performed consistently to show more inner prop-erties of the password dataset in Section 6

It is undeniable that the frequencies of passwords inthe password dataset P remain untouched by us Our finalpassword graph is actually a simple undirected graph withoutany loop or parallel edge so one feasible exploitation ofpassword frequency may be that one repeated passwordattaches one loop edge to the vertex of the password Besidestaking password frequency into consideration establishesmore than one edge between two vertices which is denoted asparallel edge in graph theory In a word the aforementionedoperations will eventually turn our simple password graphinto a multigraph further augmented implementation of

Security and Communication Networks 7

Figure 3 Visualization result of MySpace

construction of password graph would probably embedmoreoriginal and subtle properties into our final password graphbut would also introduce longer time and bigger space com-plexities we leave these feasible extension of implementationfor further study

6 Graph Concepts in Password Graph

Interestingly the transformation from password dataset topassword graph leads to the magical expansion of conceptfromgraph theory into password graph In this sectionwe aregoing to rigorously revisit the terminology in graph theoryExpanding those concepts of terminology into our refinedpassword graph helps to analyze the password graph moreconcretely and quantitatively

Degree The degree of the vertex V written as 119889(V) is thenumber of edges with V as an end vertex Note that thereare concepts of in-degree and out-degree in directed graphbut due to the fact that our refined password graph is asimple undirected graph so there is no need to differentiatethem The concept of degree in password graph implies howmany unique vertices of passwords in password datasetP aresimilar to this password we denote degree of password 119901 as119889(119901)

Statistics of degree for our experimented passworddatasets are summarized in Table 2 The definitions of thenotations used in the table are described inNotations we alsoexplain them in the main body of our paper Each datasethas 3 thresholds 119905 (from 1 to 3) experiment results exhibitedin the table One could easily notice that the column withtitle 120575(119866) is filled with zeroes that is the minimum degreeof all password datasets is 0 meaning that there always existsisolated vertex in every password dataset the number ofpasswords whose degree is zero decreases when thresholdincreases Some studies have attempted to determine thenumber of users who appear to be choosing passwords thatare randomormeaningless to human beingsWe think on theother side that the number of isolated vertices of passwords

(no other passwords similar to it) is a moremeaningful figureto achieve the same goal and is easier to evaluate Table 3 is asegment list of passwordswhose degree is zero in 12306 underthreshold 3 though some of them have a common substringbut they are not similar to each other actually because theirnumber of editing is beyond the threshold and should beconsidered individually frequencies of these passwords aretypically 1

Empirically a threshold value of 3 is reasonable to analyzethe experiment results so analyses following are all basedon threshold 3 one can actually get same conclusion whenthreshold is 1 or 2 or other viable values Many people mighthold the idea that there would definitely be more edges withbigger size of setP In our experiment result however PhpBBhas the most number of edges of all password graphs whileYahoo has the most number of vertices Therefore morevertices of password do not necessarily mean more edgesTo our surprise a password could be similar up to 3380passwords (in PhpBB dataset) The max degree of passwordincreases rapidly when threshold increases for example withthreshold 1 PhpBBrsquos max degree is only 65 but it explodesto 3380 when threshold is 3 More strikingly a passwordis similar on average to around 150 passwords in PhpBBeven the minimum average number of similar passwords of 5datasets is nearly 30

One has to know that in principle for a specific vertex Vthe possiblemaximumdegree120601(119905) of it is fixed once threshold119905 is determined

120601 (119905) = 1003816100381610038161003816119878dis1 (119905)1003816100381610038161003816 +1003816100381610038161003816119878dis2 (119905)

1003816100381610038161003816 +1003816100381610038161003816119878dis3 (119905)

1003816100381610038161003816 (9)

where 119878dis1(119905) 119878dis2(119905) and 119878dis3(119905) are the sets contain-ing passwords whose LD distance from V is 1 2 and 3respectively under threshold 119905 Rationally the results of ourexperiment show only a lower-bound of number of similarpassword (the threshold we apply is smaller than the actualor typical one) the real situation may be worse than ourimagination

The distribution of password graph degree is anotherthought-provoking characteristic like length distribution inpassword datasets Our study has shown that distributionof password graph degree correlated with its ranks sortedin descending order by its corresponding frequency is morecomplicated than a simple distribution Some passwordsoccurring more frequently might have less number of pass-words similar to them and vice versa this is a specificallyoutstanding fact in 12306 that password ldquo1qaz2wsxrdquo hasoccurred 80 times but its degree is only 6 password ldquoh123456rdquohas 962 passwords similar to it but its frequency is only4 Hence the correlation between frequency of passwordand degree in password graph has many big fluctuations(there would be many breaks in the distribution plot) Afteranalyzing our five password datasets of it has turned outthat the degree distribution of a password graph can beapproximated by the following equation

PGDD (119903) = 119896 sdot log (119903) + 119887 (10)

where PGDD(119903) is the 119903th largest degree and the base of thelog is 10

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: An Alternative Method for Understanding User-Chosen Passwords

2 Security and Communication Networks

or embed mnemonic information into password [12] toalleviate their burden of memorizing passwordsThis impliesthe unpleasant fact that users select usability over securityunconsciously Security experts have suggested the use ofpassword vaults (managerswallets) to assist password-basedauthentication with which users need merely to rememberone master password to encrypt all their online accountpasswords in the vault [13 14]

Before 2009 password researchers proposed someheuris-tic methods to study the password-based authenticationwhosemain purpose is to reveal the weaknesses of passwordsand outright replace passwords in the end [5 15] After2009 large-scale password datasets have been breachedwidely from hackersrsquo attacking or intrudersrsquo intrusion whichare afterwards publicly available for example the leak of32M passwords from the gaming website RockYou in 2009is currently the biggest corpus available to the public Inthe literature of password breaching of large-scale datasetsof passwords have guided the studies of passwords to amore scientific and rigorous method Password corpora havetypically been used to analyze the distribution of passwords[16] names used in passwords and character distribution [9]or the priority among priorities used as learning set to trainprobabilisticmodels such as PCFG-based [17]Markov-based[18] and NLP-based (Natural Language Processing) [19]password cracking algorithms in order to simulate adversarialpassword cracking process leading to sophisticated passworddataset strength evaluation methods [20] The probabilisticpassword cracking models are afterwards modified elabo-rately by researchers to evaluate single password strength

11 Motivations The analysis of the password security candate back at least to Morris and Thompsons 1979 seminalanalysis of 3000 passwords [21] The analyzing methodsthey employed can be classified into two categories pass-word cracking and semantic evaluation However these twomethodologies focus solely on the individual passwordsand neglect implicitly the relationships between passwordsThe neglected relationships on the contrary is remarkablyimportant properties of password dataset that could facilitatethe password cracking and semantic evaluation Due to theintrinsically incomplete evaluation of traditional semanticand cracking methodologies we advocate a new alternativemethod understanding user-chosen passwords

In mathematics graph is mathematical structures usedto model pairwise relations between objects A graph in thiscontext is made up of vertices nodes or points which areconnected by edges that can either be directed or undirectedGraph-theoreticmethods in various forms have beenprovenparticularly useful in many fields such as linguistics chem-istry physics and sociologyWhile there is still no applicationof graph theory in the literature of passwords researchtherefore we are going to apply graph theory to passworddataset to dig deeper into the inner properties of passwordsto assess and compare the inherent security behaviors ofusers We call this analysis method relationship evaluation asan extension to semantic evaluation Our paper provides analternative view of password relationships through password

graph we also provide a visualizingmethod for the generatedpassword to observe and study it intuitively

12 Contributions In this work we make the following keycontributions

(i) An alternative view of password relationships weexplore the relationships between passwords modifi-cation-based similarity-based and probability-based The modification-based relationship buildson the observation that a user usually modifies anexisting password to retrieve a newone Passwords arebasically strings thus we borrow the idea from stringsimilarity to develop the password similarity-basedrelationshipThe probability-based relationship is theidea derived from password distribution where eachpassword has probability associated with it

(ii) Visualizing of password graph by regarding passwordsas vertices and leveraging one of the relationshipswe explored we are able to transform a passworddataset to a graph and we call it password graph Withthe assistance of Python Graph-tool we visualize ourgenerated password graph our visualization methodcan intuitively convey deeper characteristics of pass-word dataset which lay otherwise under the hood andremain undiscovered

(iii) Some insights from graph theory the resulting pass-word graph provides us a fresh new perspective ofpassword dataset We will revisit some key termi-nology in graph theory to find out what our newpassword graph bring about

13 Organizations In Section 2 we review prior researchworks on password cracking and semantic evaluations Sec-tion 3 provides some preliminaries Section 4 details ourexploration of password relationships in dataset of passwordSection 5 elaborates on our construction of password graphSection 6 provides some key insights from graph theory toour password graph Finally Section 7 concludes our paper

2 Related Work

In this section we briefly review prior pivotal research onpassword cracking and semantic evaluations to assist ourfollow-up discussions and explorations

In 1979 Morris and Thompson analyzed a database of3000 passwords and reported some basic statistics 7112passwords of their sample of passwords were 6 characters orfewer and 86 fell into one of the dictionaries name listsand the like In 1990 Klein [22] collectedetcpasswd filesin which passwords were in hash format from his friendsand acquaintances in United States and Great Britain 21passwords were cracked in the first week total approximately25 of the passwords had been guessed and 5170 ofthe cracked passwords are not longer than 6 charactersDedicated cracking software tools like John the Ripper[23] and hashcat [24] have appeared since and are armed

Security and Communication Networks 3

with numerous cracking modes (eg brute-force attack anddictionary attack)

Mangling rules in dictionary attack mode continue toevolve beyond heuristic rules Weir et al [17] built a machinelearning technique based on context-free grammar to auto-matically derive mangling rules from a large training set ofcleartext passwords Houshmand and Aggrawal [18] derivedMarkov-based password cracking algorithm from Markov-Chain that representatively originated PageRank algorithmOriginally Markov-based algorithm is not a probabilisticmodel Ma et al [25] investigated password characteristicsabout length and the structure of 6 datasets 3 of which werefromChinesewebsites and improved it by using different nor-malization and smoothing methods They found that whendone correctly Markov-based cracking model performedbetter than PCFG-based password cracking model In 2012Veras et al [26] had done the work quite similar to us theyexamined 32M RockYou dataset by employing visualizationtechniques They observed that 1526 of passwords con-tained sequences of 5ndash8 consecutive digits 38 of whichcould be further classified as dates but their research mainlyfocused on password patterns which are different from ours

The guessing resistance of a single user-chosen passwordwas previously estimated by entropy with reference to ClaudeShannonrsquos famous measure 1198671 Borrowing the idea of Shan-non entropy a variation of Shannon entropy was proposed inNIST Electronic Authentication Guideline It calculated thepassword entropy mainly based on the length of passwordsand added partial points if some special heuristic checkswere passed to make it more secure Unfortunately Shannonentropy and its variations characterize the strength of adistribution for an attacker who wants just to crack acertain proportion of all passwords Shannon entropy has nodirect correlation to the guessing difficulty Ad hoc metrics(password strength meter) had already been demonstratedfar from accurate by Weir et al [27] they advocated thatthe cracking-based password strength meter was more com-pelling Markov-based PSM [28] and PCFG-based PSM [18]were proposed subsequently based on Markov model andPCFG model Wang et al [12] created a novel PSM on asolid foundation where user usually reuses one of hisherpasswords rather than creating a new one By using twotraining sets (one as base dictionary and the other as rulelearning dictionary) their PSMwas able to derive empiricallyusersrsquo mangling rules on passwords and thus more accurate

In 2012 Bonneau conducted a large-scale analysis of 70million Yahoo private passwords and proposed that a moredirect password strength metric was guesswork (G) [29] Yetin 2014 Li et al [30] came to a conclusion that Bonneaursquos70M passwords were not representative enough for all usersespecially Chinese users who are not familiar with EnglishChinese users prefer digits than lettersThey also showed thatChinese users inclined to insert Pinyins and dates into theirpasswords

Semantic patterns including personal information (egbirth dates personal names and nicknames) are prevalentlyembedded in user-chosen passwords [31] Additionally a fewbasic data characteristics like average length length distri-bution and types of characters used were typically reported

Further the structure patterns were also studied by someresearchers [9] passwords containing digits constitutedmorethan 50 Chinese web passwords while this value of Englishcounterpart was only 1130 reinforcing the hypothesis thatuser-generated passwords were greatly influenced by theirnative languages More systematic methodologies had beenproposed by Shay et al [32] for creating a new passwordpolicy One inconspicuous thing that semantic evaluationsfail to attain is the relationships between passwords inthe same dataset Therefore we for the first time augmentthis kind of evaluation method by introducing passwordrelationship and graph theory into password evaluations Wehope this evaluation method helps to study the passwordevaluations thoroughly

In 2010 Zhang et al [33] found that modifications toonersquos old passwords tend to be predictable and they utilizedthis observation to facilitate password cracking Their workis actually password reuse by a certain user it is independentfrom other users Our work focuses on password relation-ships between different users and classifies them into threeclassifications

As far as we know the work by Guo et al [34] may bethe closest one to what we have done in this workThey visu-alized several password datasets including Yahoo PhpBBMySpace Honeynet hotmail and 12306 Their discoveryprovides an explanation of the attacking curve that has longbeen observed in decades We find one of their conclusionsis wrong degrees of passwords follow logarithmic law notpower law as they claimed we will detail this in Section 6Overall though our works are a bit similar there are still verycritical differences between our work and their work (a) weexplore the relationships between passwords while they donot the relationships between passwords (b) we explore theeffects of different thresholds 119905 while they only experimentedwith threshold 3 (c) our insights obtained from visualizedgraph is essentially distinct from their work

3 Preliminaries

In this section we explicate formal definitions of graph andlinear regression and then introduce themetric for evaluatinghow well the regression line approximates the real datapoints Finally we provide some basic information on ourexperimental datasets

Mathematical Notation We denote a password dataset witha calligraphic letter S a password after duplicates removingwith another calligraphic letter P Let 119873 denote the totalnumber of passwords inP

31 Graph

Formal Definition of Graph A graph 119866 can be defined as apair 119881 119864 where 119881 is a set of vertices and 119864 is a set of edgesbetween the vertices 119864 sube (119906 V) | 119906 V isin 119881

119866 = 119881 119864 (1)Generally graphs can be classified into two types (a)

undirected graph the adjacency relation defined by the edges

4 Security and Communication Networks

is symmetric (b) Directed graph is a graph in which edgeshave orientations A simple graph is an undirected graphin which both parallel edges and loops are disallowed whilemultigraph otherwise allows them

32 Linear Regression In statistics linear regression is anapproach for modeling the correlation between two variables(one as a scalar dependent variable denoted by 119910 and theother as explanatory variable denoted by 119909) by fitting alinear equation to the experimental data The most commonmethod for linear regression is least-squares Usually inlinear regression given the value of explanatory variable 119909the value of dependent variable 119910 is an affine function of 119909119910 = 119896 sdot 119909 + 119887 the slope of the line is 119896 and 119887 is the intercept

In linear regression the statistical measure of how wellthe regression line approximates the real experimental datapoints is often calculated and compared through the coeffi-cient of determination People usually denote the coefficientof determination by 1198772 which has the range from 0 to 1 thecloser to 1 the better Therefore a 1198772 value of 1 indicates thatall experimental data points are perfectly positioned on theregression line 1198772 of 0 indicates the contrary result

33 Spearmanrsquos Coefficient In statistic Spearmanrsquos coeffi-cient is a nonparametric measure of rank correlation Itassesses how well the relationship between two variables canbe described using a monotonic function The Spearmancoefficient 120588 is defined as the Pearson correlation coefficientbetween the two ranked vectors119883 119884

120588 = cov (119903119892119883 119903119892119884)120588119903119892119883120588119903119892119884

(2)

where cov(119903119892119883 119903119892119884) is the covariance of the rank variables120588119903119892119883 and 120588119903119892119884 are the standard deviations of the rank variablesBy definition 120588 isin [minus1 1] where 0 indicates independencebetween the vectors A perfect Spearman correlation of 1 orminus1 happens when the agreement between the vectors is amonotone function

34 Password Guess Number The password guess num-ber characterizes the time complexity required for a pass-word cracking algorithm (PCFG-based or Markov-based) torecover a password This is generally achieved by measuringthe guess number required to crack the password DellrsquoAmicoand Filippone [20] detail a Monte Carlo sampling methodthat converts a password pw probability as computed byPCFG or Markov model into an estimate of crackerrsquos guessnumber

119866 (pw) = 1119896119896

sum119894=1

120575 (119901119894)Pr (119901119894)

+ 1

120575 (119901119894) = 1 if Pr (119901119894) gt Pr (pw) else 0(3)

where 119896 is the sample size and 119901119894 is the password drawrandomly from the corresponding distribution

Table 1 Basic information about our 5 password datasets

Dataset Web service When leaked Total PWsMySpace Social networking Oct 2006 49623PhpBB Programmer forum Jan 2009 255373Rootkit Hacker forum Feb 2011 69324Yahoo Web portal July 2012 44283412306 Train ticketing Dec 2014 129303

35 Datasets For completeness we give a brief descriptionof the five datasets used in our experiments (see Table 1)Theywere breached either by hackers or by intruders and laterdisclosed publicly on the Internet some of them have alreadybeen used in password cracking models [17 18]

MySpace was originally published in October 2006which was obtained by a phishing attack and thus mightcontain weak as well as fake passwords Our collection ofMySpace list has total 49623 passwords but a recent reportsaid that there were actually over 360 million accountsinvolved Each record of MySpace list contained an emailaddress a password and in some cases a second passwordSome accounts had multiple passwords there were in factover 427 million total passwords available for sale [35]Rootkitrsquos entire MySQL database backup was released byanonymous group using HBGaryrsquos CEO Twitter accountit initially contained 71228 passwords cryptographicallyscrambled using the MD5 hashing algorithm we managedeventually to recover 9733 of them through trawlingattacks In 2012 nearly 443K email addresses and passwordsfor a Yahoo site were exposed after a server breach hackersgotten into Yahoorsquos Contributor Network database by using arudimentary attack called SQL injection The PhpBB datasetincludes about 250K passwords leaked from Phpbbcom inJanuary 2009 the hacker even described the whole attack indetail on his blog At the end of year 2014 a password datasetfrom a Chinese train ticketing website 12306 was leaked tothe public by anonymous attackers it includes nearly 130Kpasswords User information registered on the ticket bookingwebsite included real names ID card numbers and phoneand email contacts and may also include information aboutfamily members and friends posing a serious threat to theirinformation privacy

4 Password Relationships

We divide our findings of password relationships intothree categories modification-based similarity-based andprobability-based Elaborations on these three categories aredetailed in the following three subsections

41 Modification-Based Naturally the relationship betweenpasswords is actually in some extent tied to usersrsquo modifi-cations to passwords So it could be well imitated by editdistance featuring a set of operations on passwords Editdistance is a way of quantifying howdissimilar two passwordsare to one another by counting the minimum number ofoperations required to transform one password into the

Security and Communication Networks 5

other However there are various definitions of edit distancefeaturing different sets of editing operations Levenshtein dis-tance [36] (hereafter LD) allows three operations removalinsertion and substitution of a character in the passwordlongest common subsequence distancersquos permitted operationsis a subset of LDs insertion and deletion hamming distancetakes effect only on passwords of same length that ishamming distance does not allow insertion or removalAdditional primitive operations like transposition are usedby Jaro-Winkler distance based on the observation that acommon mistake when people typing is the transposition oftwo adjacent characters in a string

The LD between two passwords is formally defined as theminimum number of single-character edits one must per-form to change one password into the other Mathematicallythe LD between two passwords 119886 and 119887 can be given bylev119886119887(|119886| |119887|)

lev119886119887 (119894 119895) = min

lev119886119887 (119894 minus 1 119895) + 1lev119886119887 (119894 119895 minus 1) + 1lev119886119887 (119894 minus 1 119895 minus 1) + 119891 (119886119894 119887119895)

(4)

where |119886| and |119887| denote the length of passwords 119886 and 119887respectively 119891(119886119894 119887119895) is an indicator function that is equal to0 when 119886119894 = 119887119895 and equal to 1 otherwise The calculation ofLD is actually a dynamic optimization the time complexityof this algorithm is 119874(|119886| sdot |119887|)

42 Similarity-Based Passwords are essentially strings sothe measures of depiction of string similarity provide ussubstantial methods to evaluate the similarity (relationship)between passwords Dicersquos coefficient is a statistic used forcomparing the similarity between two sets and can be usedto evaluate the similarity between two passwords assume 119883and 119884 are the set (set element is unique) of characters thatpasswords 119886 and 119887 include respectively Dicersquos coefficient of119886 119887 is defined as follows

QS = 2 |119883 cap 119884||119883| + |119884| (5)

where |119883| and |119884| are the numbers of elements of the two setsQS is the quotient of similarity and ranges between 0 and 1

Jaccard indexmeasures similarity between two finite setsit is defined as the size of the intersection divided by the sizeof the union of the sets 119883 and 119884 (119883 and 119884 have the samemeaning as described above)

119869 (119883 119884) = |119883 cap 119884||119883 cup 119884| =|119883 cap 119884|

|119883| + |119884| minus |119883 cap 119884| (6)

If 119883 and 119884 are both empty define 119869(119883 119884) = 1 so 119869(119883 119884) isalso the quotient of similarity and ranges between 0 and 1

43 Probability-Based From PCFG-based or Markov-basedpassword cracking model we can assign probability to everypassword in a dataset In principle two passwords are similarin a sense (same probability that will be picked by a human

user) if their corresponding probabilities are close enoughthe closer the more similar One of the workable measures ofsimilarity between passwords 119886 and 119887may be

119878 (119886 119887) = (Pr (119886) minus Pr (119887))2

(Pr (119886) + Pr (119887))2 (7)

when 119886 and 119887 have the same probability their value of 119878(119886 119887)equals 0 an extreme situation happens when there is only onepassword in the distribution the probability of that passwordis 1 all other passwordsrsquo probability is 0 The value of 119878(119886 119887)(where Pr(119886) = 1Pr(119887) = 0) equals 1

In summary the relationships between passwords wedescribe above are only a proportion of the countless rela-tionships of objective existence For example the edit distancebetween passwords is not a measure of similarity but thenumber of editings One can of course divide it by the longerlength of two passwords whichmakes LD remarkably relativeLD

levrel119886119887(|119886||119887|) =lev119886119887 (|119886| |119887|)max |119886| |119887| (8)

Ultimately the value of relative LD for two chosen passwordsfalls into interval [0 1] in which left boundary 0 meansexactly the same and right boundary 1 means altogetherdifferent

44 Rationale The practical and accurate measure of depic-tion of password relationships depends on the particularapplication scenario There is no the best depiction methodfor generating password graph but the most appropriateaccording to the concrete scenario For example if we wantto study the modification habits of users we may choosemodification-based classification but if we only want to learnthe password cracking property we may choose probability-based classification We attempt to expand our work byutilizing modification-based relationship between passwords(other relationships we have explored above are still worthdeeper research andwe leave it for futurework) Prior surveyshave already provided us with plenty of information on howusers modify their passwords adding digits or symbols at thebeginningend of password capitalizing a letter Leet trans-formation and so forth [9 12 32 37] Taking the permittedoperations for each type of edit distance into considerationsynthetically we could infer that LD is currently the mostappropriate measure of password relationship among alltypes of edit distance which actuallymodels real-world usersrsquomangling behaviors (insertion deletion and substitution arepopular while transposition is not) In Section 5 we are goingto utilize LD to formally define the edges in a graph

5 Construction of Password Graph

Generally speaking any entity that can be described byobjects with relationships between them can be regardedas a graph Since our principal purpose is to characterizethe relationship between passwords We shed light on thisnotion by (1) regarding each unique password in dataset as

6 Security and Communication Networks

passwordpassword1234561234561234567password1

Phase1password1234561234567password1

Phase2

G

Figure 1 Construction phases

a vertex 119881 = 119901 | 119901 isin P and (2) editing distancebetween passwords not more than a threshold 119905 forms anedge 119864 = (119906 V) | lev119906V(|119906| |V|) le 119905 and 119906 V isin 119881We call the resulting graph transformed from the passworddataset password graph 119866 This is a very intuitive method toturn a password dataset into graph though there might existsome other approaches to fulfill the same goal (eg one canconsider unique printable ASCII characters as vertices andcharacters forming a password are connected together whichseems reasonable but the practicability and efficiency of itremain unknown)

51 Construction Procedures We describe our constructionprocedures of password graph with more details in this sub-section The construction procedures are organized orderlyinto two main phases and are illustrated in Figure 1

Phase1 Starting with raw password dataset we measure thefrequencies of every unique password in the password datasetS and associate the frequency values with correspondingpasswords subsequentlywe remove the duplicated passwordsto obtain the password dataset P so that two passwordspicked randomly from P are invariably different Everypassword inP stands for a vertex in graph119866 in other wordsthe vertex set 119881 for 119866 is generated finally after this phase

Phase2 The LD distance for every two passwords in thedataset P is calculated An edge is formed if the LD valueof its end vertices is not more than threshold 119905 We provide3 thresholds experiment results in this paper (the selectionof threshold 119905 is an interestingly open question that needsfurther research)The edge set 119864 for our graph119866 is generatedafter this phase It is worth noting that themaximum possiblenumber of edges is119873(119873minus1)divide2where119873 is the size of setP itmeans that the resulting password graph is a complete graphAnother extreme case happens if there are no edges at all in119866which means the resulting password graph is an empty graph

In order to observe intuitively the password graph ofour construction results we have visualized the passwordgraph with PythonGraph-tool [38] Graph-tool is an efficientPython module for manipulation and statistical analysisof graphs (aka networks) Exporting the resulting graphto Graph-tool supported file format (GraphML file formatwithout isolated vertices) Visualizing it with some deliberatetuning of the parameters makes intricate image graph result

Figure 2 Visualization result of 12306

The visualization result of 12306 is shown in Figure 2 andthat of MySpace in Figure 3 the visualization results ofremaining datasets are collected together in Figure 7 Theinternal structure and characteristic of password dataset Pis represented modestly by its visualization image graph tosome extent some qualitative properties can be gotten fromthese images a vertex in the graph 119866 may have intensiveconnection with other vertices password vertices evidentlyhave the tendency to gather together as a cluster 12306 hastwo apparently big clusters while MySpace has incalculablesmall clusters vertices spreading all over the entire canvasspace reflect somewhat that passwords are spreading all overthe possible password space A more in-depth quantitativeanalysis is performed consistently to show more inner prop-erties of the password dataset in Section 6

It is undeniable that the frequencies of passwords inthe password dataset P remain untouched by us Our finalpassword graph is actually a simple undirected graph withoutany loop or parallel edge so one feasible exploitation ofpassword frequency may be that one repeated passwordattaches one loop edge to the vertex of the password Besidestaking password frequency into consideration establishesmore than one edge between two vertices which is denoted asparallel edge in graph theory In a word the aforementionedoperations will eventually turn our simple password graphinto a multigraph further augmented implementation of

Security and Communication Networks 7

Figure 3 Visualization result of MySpace

construction of password graph would probably embedmoreoriginal and subtle properties into our final password graphbut would also introduce longer time and bigger space com-plexities we leave these feasible extension of implementationfor further study

6 Graph Concepts in Password Graph

Interestingly the transformation from password dataset topassword graph leads to the magical expansion of conceptfromgraph theory into password graph In this sectionwe aregoing to rigorously revisit the terminology in graph theoryExpanding those concepts of terminology into our refinedpassword graph helps to analyze the password graph moreconcretely and quantitatively

Degree The degree of the vertex V written as 119889(V) is thenumber of edges with V as an end vertex Note that thereare concepts of in-degree and out-degree in directed graphbut due to the fact that our refined password graph is asimple undirected graph so there is no need to differentiatethem The concept of degree in password graph implies howmany unique vertices of passwords in password datasetP aresimilar to this password we denote degree of password 119901 as119889(119901)

Statistics of degree for our experimented passworddatasets are summarized in Table 2 The definitions of thenotations used in the table are described inNotations we alsoexplain them in the main body of our paper Each datasethas 3 thresholds 119905 (from 1 to 3) experiment results exhibitedin the table One could easily notice that the column withtitle 120575(119866) is filled with zeroes that is the minimum degreeof all password datasets is 0 meaning that there always existsisolated vertex in every password dataset the number ofpasswords whose degree is zero decreases when thresholdincreases Some studies have attempted to determine thenumber of users who appear to be choosing passwords thatare randomormeaningless to human beingsWe think on theother side that the number of isolated vertices of passwords

(no other passwords similar to it) is a moremeaningful figureto achieve the same goal and is easier to evaluate Table 3 is asegment list of passwordswhose degree is zero in 12306 underthreshold 3 though some of them have a common substringbut they are not similar to each other actually because theirnumber of editing is beyond the threshold and should beconsidered individually frequencies of these passwords aretypically 1

Empirically a threshold value of 3 is reasonable to analyzethe experiment results so analyses following are all basedon threshold 3 one can actually get same conclusion whenthreshold is 1 or 2 or other viable values Many people mighthold the idea that there would definitely be more edges withbigger size of setP In our experiment result however PhpBBhas the most number of edges of all password graphs whileYahoo has the most number of vertices Therefore morevertices of password do not necessarily mean more edgesTo our surprise a password could be similar up to 3380passwords (in PhpBB dataset) The max degree of passwordincreases rapidly when threshold increases for example withthreshold 1 PhpBBrsquos max degree is only 65 but it explodesto 3380 when threshold is 3 More strikingly a passwordis similar on average to around 150 passwords in PhpBBeven the minimum average number of similar passwords of 5datasets is nearly 30

One has to know that in principle for a specific vertex Vthe possiblemaximumdegree120601(119905) of it is fixed once threshold119905 is determined

120601 (119905) = 1003816100381610038161003816119878dis1 (119905)1003816100381610038161003816 +1003816100381610038161003816119878dis2 (119905)

1003816100381610038161003816 +1003816100381610038161003816119878dis3 (119905)

1003816100381610038161003816 (9)

where 119878dis1(119905) 119878dis2(119905) and 119878dis3(119905) are the sets contain-ing passwords whose LD distance from V is 1 2 and 3respectively under threshold 119905 Rationally the results of ourexperiment show only a lower-bound of number of similarpassword (the threshold we apply is smaller than the actualor typical one) the real situation may be worse than ourimagination

The distribution of password graph degree is anotherthought-provoking characteristic like length distribution inpassword datasets Our study has shown that distributionof password graph degree correlated with its ranks sortedin descending order by its corresponding frequency is morecomplicated than a simple distribution Some passwordsoccurring more frequently might have less number of pass-words similar to them and vice versa this is a specificallyoutstanding fact in 12306 that password ldquo1qaz2wsxrdquo hasoccurred 80 times but its degree is only 6 password ldquoh123456rdquohas 962 passwords similar to it but its frequency is only4 Hence the correlation between frequency of passwordand degree in password graph has many big fluctuations(there would be many breaks in the distribution plot) Afteranalyzing our five password datasets of it has turned outthat the degree distribution of a password graph can beapproximated by the following equation

PGDD (119903) = 119896 sdot log (119903) + 119887 (10)

where PGDD(119903) is the 119903th largest degree and the base of thelog is 10

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: An Alternative Method for Understanding User-Chosen Passwords

Security and Communication Networks 3

with numerous cracking modes (eg brute-force attack anddictionary attack)

Mangling rules in dictionary attack mode continue toevolve beyond heuristic rules Weir et al [17] built a machinelearning technique based on context-free grammar to auto-matically derive mangling rules from a large training set ofcleartext passwords Houshmand and Aggrawal [18] derivedMarkov-based password cracking algorithm from Markov-Chain that representatively originated PageRank algorithmOriginally Markov-based algorithm is not a probabilisticmodel Ma et al [25] investigated password characteristicsabout length and the structure of 6 datasets 3 of which werefromChinesewebsites and improved it by using different nor-malization and smoothing methods They found that whendone correctly Markov-based cracking model performedbetter than PCFG-based password cracking model In 2012Veras et al [26] had done the work quite similar to us theyexamined 32M RockYou dataset by employing visualizationtechniques They observed that 1526 of passwords con-tained sequences of 5ndash8 consecutive digits 38 of whichcould be further classified as dates but their research mainlyfocused on password patterns which are different from ours

The guessing resistance of a single user-chosen passwordwas previously estimated by entropy with reference to ClaudeShannonrsquos famous measure 1198671 Borrowing the idea of Shan-non entropy a variation of Shannon entropy was proposed inNIST Electronic Authentication Guideline It calculated thepassword entropy mainly based on the length of passwordsand added partial points if some special heuristic checkswere passed to make it more secure Unfortunately Shannonentropy and its variations characterize the strength of adistribution for an attacker who wants just to crack acertain proportion of all passwords Shannon entropy has nodirect correlation to the guessing difficulty Ad hoc metrics(password strength meter) had already been demonstratedfar from accurate by Weir et al [27] they advocated thatthe cracking-based password strength meter was more com-pelling Markov-based PSM [28] and PCFG-based PSM [18]were proposed subsequently based on Markov model andPCFG model Wang et al [12] created a novel PSM on asolid foundation where user usually reuses one of hisherpasswords rather than creating a new one By using twotraining sets (one as base dictionary and the other as rulelearning dictionary) their PSMwas able to derive empiricallyusersrsquo mangling rules on passwords and thus more accurate

In 2012 Bonneau conducted a large-scale analysis of 70million Yahoo private passwords and proposed that a moredirect password strength metric was guesswork (G) [29] Yetin 2014 Li et al [30] came to a conclusion that Bonneaursquos70M passwords were not representative enough for all usersespecially Chinese users who are not familiar with EnglishChinese users prefer digits than lettersThey also showed thatChinese users inclined to insert Pinyins and dates into theirpasswords

Semantic patterns including personal information (egbirth dates personal names and nicknames) are prevalentlyembedded in user-chosen passwords [31] Additionally a fewbasic data characteristics like average length length distri-bution and types of characters used were typically reported

Further the structure patterns were also studied by someresearchers [9] passwords containing digits constitutedmorethan 50 Chinese web passwords while this value of Englishcounterpart was only 1130 reinforcing the hypothesis thatuser-generated passwords were greatly influenced by theirnative languages More systematic methodologies had beenproposed by Shay et al [32] for creating a new passwordpolicy One inconspicuous thing that semantic evaluationsfail to attain is the relationships between passwords inthe same dataset Therefore we for the first time augmentthis kind of evaluation method by introducing passwordrelationship and graph theory into password evaluations Wehope this evaluation method helps to study the passwordevaluations thoroughly

In 2010 Zhang et al [33] found that modifications toonersquos old passwords tend to be predictable and they utilizedthis observation to facilitate password cracking Their workis actually password reuse by a certain user it is independentfrom other users Our work focuses on password relation-ships between different users and classifies them into threeclassifications

As far as we know the work by Guo et al [34] may bethe closest one to what we have done in this workThey visu-alized several password datasets including Yahoo PhpBBMySpace Honeynet hotmail and 12306 Their discoveryprovides an explanation of the attacking curve that has longbeen observed in decades We find one of their conclusionsis wrong degrees of passwords follow logarithmic law notpower law as they claimed we will detail this in Section 6Overall though our works are a bit similar there are still verycritical differences between our work and their work (a) weexplore the relationships between passwords while they donot the relationships between passwords (b) we explore theeffects of different thresholds 119905 while they only experimentedwith threshold 3 (c) our insights obtained from visualizedgraph is essentially distinct from their work

3 Preliminaries

In this section we explicate formal definitions of graph andlinear regression and then introduce themetric for evaluatinghow well the regression line approximates the real datapoints Finally we provide some basic information on ourexperimental datasets

Mathematical Notation We denote a password dataset witha calligraphic letter S a password after duplicates removingwith another calligraphic letter P Let 119873 denote the totalnumber of passwords inP

31 Graph

Formal Definition of Graph A graph 119866 can be defined as apair 119881 119864 where 119881 is a set of vertices and 119864 is a set of edgesbetween the vertices 119864 sube (119906 V) | 119906 V isin 119881

119866 = 119881 119864 (1)Generally graphs can be classified into two types (a)

undirected graph the adjacency relation defined by the edges

4 Security and Communication Networks

is symmetric (b) Directed graph is a graph in which edgeshave orientations A simple graph is an undirected graphin which both parallel edges and loops are disallowed whilemultigraph otherwise allows them

32 Linear Regression In statistics linear regression is anapproach for modeling the correlation between two variables(one as a scalar dependent variable denoted by 119910 and theother as explanatory variable denoted by 119909) by fitting alinear equation to the experimental data The most commonmethod for linear regression is least-squares Usually inlinear regression given the value of explanatory variable 119909the value of dependent variable 119910 is an affine function of 119909119910 = 119896 sdot 119909 + 119887 the slope of the line is 119896 and 119887 is the intercept

In linear regression the statistical measure of how wellthe regression line approximates the real experimental datapoints is often calculated and compared through the coeffi-cient of determination People usually denote the coefficientof determination by 1198772 which has the range from 0 to 1 thecloser to 1 the better Therefore a 1198772 value of 1 indicates thatall experimental data points are perfectly positioned on theregression line 1198772 of 0 indicates the contrary result

33 Spearmanrsquos Coefficient In statistic Spearmanrsquos coeffi-cient is a nonparametric measure of rank correlation Itassesses how well the relationship between two variables canbe described using a monotonic function The Spearmancoefficient 120588 is defined as the Pearson correlation coefficientbetween the two ranked vectors119883 119884

120588 = cov (119903119892119883 119903119892119884)120588119903119892119883120588119903119892119884

(2)

where cov(119903119892119883 119903119892119884) is the covariance of the rank variables120588119903119892119883 and 120588119903119892119884 are the standard deviations of the rank variablesBy definition 120588 isin [minus1 1] where 0 indicates independencebetween the vectors A perfect Spearman correlation of 1 orminus1 happens when the agreement between the vectors is amonotone function

34 Password Guess Number The password guess num-ber characterizes the time complexity required for a pass-word cracking algorithm (PCFG-based or Markov-based) torecover a password This is generally achieved by measuringthe guess number required to crack the password DellrsquoAmicoand Filippone [20] detail a Monte Carlo sampling methodthat converts a password pw probability as computed byPCFG or Markov model into an estimate of crackerrsquos guessnumber

119866 (pw) = 1119896119896

sum119894=1

120575 (119901119894)Pr (119901119894)

+ 1

120575 (119901119894) = 1 if Pr (119901119894) gt Pr (pw) else 0(3)

where 119896 is the sample size and 119901119894 is the password drawrandomly from the corresponding distribution

Table 1 Basic information about our 5 password datasets

Dataset Web service When leaked Total PWsMySpace Social networking Oct 2006 49623PhpBB Programmer forum Jan 2009 255373Rootkit Hacker forum Feb 2011 69324Yahoo Web portal July 2012 44283412306 Train ticketing Dec 2014 129303

35 Datasets For completeness we give a brief descriptionof the five datasets used in our experiments (see Table 1)Theywere breached either by hackers or by intruders and laterdisclosed publicly on the Internet some of them have alreadybeen used in password cracking models [17 18]

MySpace was originally published in October 2006which was obtained by a phishing attack and thus mightcontain weak as well as fake passwords Our collection ofMySpace list has total 49623 passwords but a recent reportsaid that there were actually over 360 million accountsinvolved Each record of MySpace list contained an emailaddress a password and in some cases a second passwordSome accounts had multiple passwords there were in factover 427 million total passwords available for sale [35]Rootkitrsquos entire MySQL database backup was released byanonymous group using HBGaryrsquos CEO Twitter accountit initially contained 71228 passwords cryptographicallyscrambled using the MD5 hashing algorithm we managedeventually to recover 9733 of them through trawlingattacks In 2012 nearly 443K email addresses and passwordsfor a Yahoo site were exposed after a server breach hackersgotten into Yahoorsquos Contributor Network database by using arudimentary attack called SQL injection The PhpBB datasetincludes about 250K passwords leaked from Phpbbcom inJanuary 2009 the hacker even described the whole attack indetail on his blog At the end of year 2014 a password datasetfrom a Chinese train ticketing website 12306 was leaked tothe public by anonymous attackers it includes nearly 130Kpasswords User information registered on the ticket bookingwebsite included real names ID card numbers and phoneand email contacts and may also include information aboutfamily members and friends posing a serious threat to theirinformation privacy

4 Password Relationships

We divide our findings of password relationships intothree categories modification-based similarity-based andprobability-based Elaborations on these three categories aredetailed in the following three subsections

41 Modification-Based Naturally the relationship betweenpasswords is actually in some extent tied to usersrsquo modifi-cations to passwords So it could be well imitated by editdistance featuring a set of operations on passwords Editdistance is a way of quantifying howdissimilar two passwordsare to one another by counting the minimum number ofoperations required to transform one password into the

Security and Communication Networks 5

other However there are various definitions of edit distancefeaturing different sets of editing operations Levenshtein dis-tance [36] (hereafter LD) allows three operations removalinsertion and substitution of a character in the passwordlongest common subsequence distancersquos permitted operationsis a subset of LDs insertion and deletion hamming distancetakes effect only on passwords of same length that ishamming distance does not allow insertion or removalAdditional primitive operations like transposition are usedby Jaro-Winkler distance based on the observation that acommon mistake when people typing is the transposition oftwo adjacent characters in a string

The LD between two passwords is formally defined as theminimum number of single-character edits one must per-form to change one password into the other Mathematicallythe LD between two passwords 119886 and 119887 can be given bylev119886119887(|119886| |119887|)

lev119886119887 (119894 119895) = min

lev119886119887 (119894 minus 1 119895) + 1lev119886119887 (119894 119895 minus 1) + 1lev119886119887 (119894 minus 1 119895 minus 1) + 119891 (119886119894 119887119895)

(4)

where |119886| and |119887| denote the length of passwords 119886 and 119887respectively 119891(119886119894 119887119895) is an indicator function that is equal to0 when 119886119894 = 119887119895 and equal to 1 otherwise The calculation ofLD is actually a dynamic optimization the time complexityof this algorithm is 119874(|119886| sdot |119887|)

42 Similarity-Based Passwords are essentially strings sothe measures of depiction of string similarity provide ussubstantial methods to evaluate the similarity (relationship)between passwords Dicersquos coefficient is a statistic used forcomparing the similarity between two sets and can be usedto evaluate the similarity between two passwords assume 119883and 119884 are the set (set element is unique) of characters thatpasswords 119886 and 119887 include respectively Dicersquos coefficient of119886 119887 is defined as follows

QS = 2 |119883 cap 119884||119883| + |119884| (5)

where |119883| and |119884| are the numbers of elements of the two setsQS is the quotient of similarity and ranges between 0 and 1

Jaccard indexmeasures similarity between two finite setsit is defined as the size of the intersection divided by the sizeof the union of the sets 119883 and 119884 (119883 and 119884 have the samemeaning as described above)

119869 (119883 119884) = |119883 cap 119884||119883 cup 119884| =|119883 cap 119884|

|119883| + |119884| minus |119883 cap 119884| (6)

If 119883 and 119884 are both empty define 119869(119883 119884) = 1 so 119869(119883 119884) isalso the quotient of similarity and ranges between 0 and 1

43 Probability-Based From PCFG-based or Markov-basedpassword cracking model we can assign probability to everypassword in a dataset In principle two passwords are similarin a sense (same probability that will be picked by a human

user) if their corresponding probabilities are close enoughthe closer the more similar One of the workable measures ofsimilarity between passwords 119886 and 119887may be

119878 (119886 119887) = (Pr (119886) minus Pr (119887))2

(Pr (119886) + Pr (119887))2 (7)

when 119886 and 119887 have the same probability their value of 119878(119886 119887)equals 0 an extreme situation happens when there is only onepassword in the distribution the probability of that passwordis 1 all other passwordsrsquo probability is 0 The value of 119878(119886 119887)(where Pr(119886) = 1Pr(119887) = 0) equals 1

In summary the relationships between passwords wedescribe above are only a proportion of the countless rela-tionships of objective existence For example the edit distancebetween passwords is not a measure of similarity but thenumber of editings One can of course divide it by the longerlength of two passwords whichmakes LD remarkably relativeLD

levrel119886119887(|119886||119887|) =lev119886119887 (|119886| |119887|)max |119886| |119887| (8)

Ultimately the value of relative LD for two chosen passwordsfalls into interval [0 1] in which left boundary 0 meansexactly the same and right boundary 1 means altogetherdifferent

44 Rationale The practical and accurate measure of depic-tion of password relationships depends on the particularapplication scenario There is no the best depiction methodfor generating password graph but the most appropriateaccording to the concrete scenario For example if we wantto study the modification habits of users we may choosemodification-based classification but if we only want to learnthe password cracking property we may choose probability-based classification We attempt to expand our work byutilizing modification-based relationship between passwords(other relationships we have explored above are still worthdeeper research andwe leave it for futurework) Prior surveyshave already provided us with plenty of information on howusers modify their passwords adding digits or symbols at thebeginningend of password capitalizing a letter Leet trans-formation and so forth [9 12 32 37] Taking the permittedoperations for each type of edit distance into considerationsynthetically we could infer that LD is currently the mostappropriate measure of password relationship among alltypes of edit distance which actuallymodels real-world usersrsquomangling behaviors (insertion deletion and substitution arepopular while transposition is not) In Section 5 we are goingto utilize LD to formally define the edges in a graph

5 Construction of Password Graph

Generally speaking any entity that can be described byobjects with relationships between them can be regardedas a graph Since our principal purpose is to characterizethe relationship between passwords We shed light on thisnotion by (1) regarding each unique password in dataset as

6 Security and Communication Networks

passwordpassword1234561234561234567password1

Phase1password1234561234567password1

Phase2

G

Figure 1 Construction phases

a vertex 119881 = 119901 | 119901 isin P and (2) editing distancebetween passwords not more than a threshold 119905 forms anedge 119864 = (119906 V) | lev119906V(|119906| |V|) le 119905 and 119906 V isin 119881We call the resulting graph transformed from the passworddataset password graph 119866 This is a very intuitive method toturn a password dataset into graph though there might existsome other approaches to fulfill the same goal (eg one canconsider unique printable ASCII characters as vertices andcharacters forming a password are connected together whichseems reasonable but the practicability and efficiency of itremain unknown)

51 Construction Procedures We describe our constructionprocedures of password graph with more details in this sub-section The construction procedures are organized orderlyinto two main phases and are illustrated in Figure 1

Phase1 Starting with raw password dataset we measure thefrequencies of every unique password in the password datasetS and associate the frequency values with correspondingpasswords subsequentlywe remove the duplicated passwordsto obtain the password dataset P so that two passwordspicked randomly from P are invariably different Everypassword inP stands for a vertex in graph119866 in other wordsthe vertex set 119881 for 119866 is generated finally after this phase

Phase2 The LD distance for every two passwords in thedataset P is calculated An edge is formed if the LD valueof its end vertices is not more than threshold 119905 We provide3 thresholds experiment results in this paper (the selectionof threshold 119905 is an interestingly open question that needsfurther research)The edge set 119864 for our graph119866 is generatedafter this phase It is worth noting that themaximum possiblenumber of edges is119873(119873minus1)divide2where119873 is the size of setP itmeans that the resulting password graph is a complete graphAnother extreme case happens if there are no edges at all in119866which means the resulting password graph is an empty graph

In order to observe intuitively the password graph ofour construction results we have visualized the passwordgraph with PythonGraph-tool [38] Graph-tool is an efficientPython module for manipulation and statistical analysisof graphs (aka networks) Exporting the resulting graphto Graph-tool supported file format (GraphML file formatwithout isolated vertices) Visualizing it with some deliberatetuning of the parameters makes intricate image graph result

Figure 2 Visualization result of 12306

The visualization result of 12306 is shown in Figure 2 andthat of MySpace in Figure 3 the visualization results ofremaining datasets are collected together in Figure 7 Theinternal structure and characteristic of password dataset Pis represented modestly by its visualization image graph tosome extent some qualitative properties can be gotten fromthese images a vertex in the graph 119866 may have intensiveconnection with other vertices password vertices evidentlyhave the tendency to gather together as a cluster 12306 hastwo apparently big clusters while MySpace has incalculablesmall clusters vertices spreading all over the entire canvasspace reflect somewhat that passwords are spreading all overthe possible password space A more in-depth quantitativeanalysis is performed consistently to show more inner prop-erties of the password dataset in Section 6

It is undeniable that the frequencies of passwords inthe password dataset P remain untouched by us Our finalpassword graph is actually a simple undirected graph withoutany loop or parallel edge so one feasible exploitation ofpassword frequency may be that one repeated passwordattaches one loop edge to the vertex of the password Besidestaking password frequency into consideration establishesmore than one edge between two vertices which is denoted asparallel edge in graph theory In a word the aforementionedoperations will eventually turn our simple password graphinto a multigraph further augmented implementation of

Security and Communication Networks 7

Figure 3 Visualization result of MySpace

construction of password graph would probably embedmoreoriginal and subtle properties into our final password graphbut would also introduce longer time and bigger space com-plexities we leave these feasible extension of implementationfor further study

6 Graph Concepts in Password Graph

Interestingly the transformation from password dataset topassword graph leads to the magical expansion of conceptfromgraph theory into password graph In this sectionwe aregoing to rigorously revisit the terminology in graph theoryExpanding those concepts of terminology into our refinedpassword graph helps to analyze the password graph moreconcretely and quantitatively

Degree The degree of the vertex V written as 119889(V) is thenumber of edges with V as an end vertex Note that thereare concepts of in-degree and out-degree in directed graphbut due to the fact that our refined password graph is asimple undirected graph so there is no need to differentiatethem The concept of degree in password graph implies howmany unique vertices of passwords in password datasetP aresimilar to this password we denote degree of password 119901 as119889(119901)

Statistics of degree for our experimented passworddatasets are summarized in Table 2 The definitions of thenotations used in the table are described inNotations we alsoexplain them in the main body of our paper Each datasethas 3 thresholds 119905 (from 1 to 3) experiment results exhibitedin the table One could easily notice that the column withtitle 120575(119866) is filled with zeroes that is the minimum degreeof all password datasets is 0 meaning that there always existsisolated vertex in every password dataset the number ofpasswords whose degree is zero decreases when thresholdincreases Some studies have attempted to determine thenumber of users who appear to be choosing passwords thatare randomormeaningless to human beingsWe think on theother side that the number of isolated vertices of passwords

(no other passwords similar to it) is a moremeaningful figureto achieve the same goal and is easier to evaluate Table 3 is asegment list of passwordswhose degree is zero in 12306 underthreshold 3 though some of them have a common substringbut they are not similar to each other actually because theirnumber of editing is beyond the threshold and should beconsidered individually frequencies of these passwords aretypically 1

Empirically a threshold value of 3 is reasonable to analyzethe experiment results so analyses following are all basedon threshold 3 one can actually get same conclusion whenthreshold is 1 or 2 or other viable values Many people mighthold the idea that there would definitely be more edges withbigger size of setP In our experiment result however PhpBBhas the most number of edges of all password graphs whileYahoo has the most number of vertices Therefore morevertices of password do not necessarily mean more edgesTo our surprise a password could be similar up to 3380passwords (in PhpBB dataset) The max degree of passwordincreases rapidly when threshold increases for example withthreshold 1 PhpBBrsquos max degree is only 65 but it explodesto 3380 when threshold is 3 More strikingly a passwordis similar on average to around 150 passwords in PhpBBeven the minimum average number of similar passwords of 5datasets is nearly 30

One has to know that in principle for a specific vertex Vthe possiblemaximumdegree120601(119905) of it is fixed once threshold119905 is determined

120601 (119905) = 1003816100381610038161003816119878dis1 (119905)1003816100381610038161003816 +1003816100381610038161003816119878dis2 (119905)

1003816100381610038161003816 +1003816100381610038161003816119878dis3 (119905)

1003816100381610038161003816 (9)

where 119878dis1(119905) 119878dis2(119905) and 119878dis3(119905) are the sets contain-ing passwords whose LD distance from V is 1 2 and 3respectively under threshold 119905 Rationally the results of ourexperiment show only a lower-bound of number of similarpassword (the threshold we apply is smaller than the actualor typical one) the real situation may be worse than ourimagination

The distribution of password graph degree is anotherthought-provoking characteristic like length distribution inpassword datasets Our study has shown that distributionof password graph degree correlated with its ranks sortedin descending order by its corresponding frequency is morecomplicated than a simple distribution Some passwordsoccurring more frequently might have less number of pass-words similar to them and vice versa this is a specificallyoutstanding fact in 12306 that password ldquo1qaz2wsxrdquo hasoccurred 80 times but its degree is only 6 password ldquoh123456rdquohas 962 passwords similar to it but its frequency is only4 Hence the correlation between frequency of passwordand degree in password graph has many big fluctuations(there would be many breaks in the distribution plot) Afteranalyzing our five password datasets of it has turned outthat the degree distribution of a password graph can beapproximated by the following equation

PGDD (119903) = 119896 sdot log (119903) + 119887 (10)

where PGDD(119903) is the 119903th largest degree and the base of thelog is 10

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: An Alternative Method for Understanding User-Chosen Passwords

4 Security and Communication Networks

is symmetric (b) Directed graph is a graph in which edgeshave orientations A simple graph is an undirected graphin which both parallel edges and loops are disallowed whilemultigraph otherwise allows them

32 Linear Regression In statistics linear regression is anapproach for modeling the correlation between two variables(one as a scalar dependent variable denoted by 119910 and theother as explanatory variable denoted by 119909) by fitting alinear equation to the experimental data The most commonmethod for linear regression is least-squares Usually inlinear regression given the value of explanatory variable 119909the value of dependent variable 119910 is an affine function of 119909119910 = 119896 sdot 119909 + 119887 the slope of the line is 119896 and 119887 is the intercept

In linear regression the statistical measure of how wellthe regression line approximates the real experimental datapoints is often calculated and compared through the coeffi-cient of determination People usually denote the coefficientof determination by 1198772 which has the range from 0 to 1 thecloser to 1 the better Therefore a 1198772 value of 1 indicates thatall experimental data points are perfectly positioned on theregression line 1198772 of 0 indicates the contrary result

33 Spearmanrsquos Coefficient In statistic Spearmanrsquos coeffi-cient is a nonparametric measure of rank correlation Itassesses how well the relationship between two variables canbe described using a monotonic function The Spearmancoefficient 120588 is defined as the Pearson correlation coefficientbetween the two ranked vectors119883 119884

120588 = cov (119903119892119883 119903119892119884)120588119903119892119883120588119903119892119884

(2)

where cov(119903119892119883 119903119892119884) is the covariance of the rank variables120588119903119892119883 and 120588119903119892119884 are the standard deviations of the rank variablesBy definition 120588 isin [minus1 1] where 0 indicates independencebetween the vectors A perfect Spearman correlation of 1 orminus1 happens when the agreement between the vectors is amonotone function

34 Password Guess Number The password guess num-ber characterizes the time complexity required for a pass-word cracking algorithm (PCFG-based or Markov-based) torecover a password This is generally achieved by measuringthe guess number required to crack the password DellrsquoAmicoand Filippone [20] detail a Monte Carlo sampling methodthat converts a password pw probability as computed byPCFG or Markov model into an estimate of crackerrsquos guessnumber

119866 (pw) = 1119896119896

sum119894=1

120575 (119901119894)Pr (119901119894)

+ 1

120575 (119901119894) = 1 if Pr (119901119894) gt Pr (pw) else 0(3)

where 119896 is the sample size and 119901119894 is the password drawrandomly from the corresponding distribution

Table 1 Basic information about our 5 password datasets

Dataset Web service When leaked Total PWsMySpace Social networking Oct 2006 49623PhpBB Programmer forum Jan 2009 255373Rootkit Hacker forum Feb 2011 69324Yahoo Web portal July 2012 44283412306 Train ticketing Dec 2014 129303

35 Datasets For completeness we give a brief descriptionof the five datasets used in our experiments (see Table 1)Theywere breached either by hackers or by intruders and laterdisclosed publicly on the Internet some of them have alreadybeen used in password cracking models [17 18]

MySpace was originally published in October 2006which was obtained by a phishing attack and thus mightcontain weak as well as fake passwords Our collection ofMySpace list has total 49623 passwords but a recent reportsaid that there were actually over 360 million accountsinvolved Each record of MySpace list contained an emailaddress a password and in some cases a second passwordSome accounts had multiple passwords there were in factover 427 million total passwords available for sale [35]Rootkitrsquos entire MySQL database backup was released byanonymous group using HBGaryrsquos CEO Twitter accountit initially contained 71228 passwords cryptographicallyscrambled using the MD5 hashing algorithm we managedeventually to recover 9733 of them through trawlingattacks In 2012 nearly 443K email addresses and passwordsfor a Yahoo site were exposed after a server breach hackersgotten into Yahoorsquos Contributor Network database by using arudimentary attack called SQL injection The PhpBB datasetincludes about 250K passwords leaked from Phpbbcom inJanuary 2009 the hacker even described the whole attack indetail on his blog At the end of year 2014 a password datasetfrom a Chinese train ticketing website 12306 was leaked tothe public by anonymous attackers it includes nearly 130Kpasswords User information registered on the ticket bookingwebsite included real names ID card numbers and phoneand email contacts and may also include information aboutfamily members and friends posing a serious threat to theirinformation privacy

4 Password Relationships

We divide our findings of password relationships intothree categories modification-based similarity-based andprobability-based Elaborations on these three categories aredetailed in the following three subsections

41 Modification-Based Naturally the relationship betweenpasswords is actually in some extent tied to usersrsquo modifi-cations to passwords So it could be well imitated by editdistance featuring a set of operations on passwords Editdistance is a way of quantifying howdissimilar two passwordsare to one another by counting the minimum number ofoperations required to transform one password into the

Security and Communication Networks 5

other However there are various definitions of edit distancefeaturing different sets of editing operations Levenshtein dis-tance [36] (hereafter LD) allows three operations removalinsertion and substitution of a character in the passwordlongest common subsequence distancersquos permitted operationsis a subset of LDs insertion and deletion hamming distancetakes effect only on passwords of same length that ishamming distance does not allow insertion or removalAdditional primitive operations like transposition are usedby Jaro-Winkler distance based on the observation that acommon mistake when people typing is the transposition oftwo adjacent characters in a string

The LD between two passwords is formally defined as theminimum number of single-character edits one must per-form to change one password into the other Mathematicallythe LD between two passwords 119886 and 119887 can be given bylev119886119887(|119886| |119887|)

lev119886119887 (119894 119895) = min

lev119886119887 (119894 minus 1 119895) + 1lev119886119887 (119894 119895 minus 1) + 1lev119886119887 (119894 minus 1 119895 minus 1) + 119891 (119886119894 119887119895)

(4)

where |119886| and |119887| denote the length of passwords 119886 and 119887respectively 119891(119886119894 119887119895) is an indicator function that is equal to0 when 119886119894 = 119887119895 and equal to 1 otherwise The calculation ofLD is actually a dynamic optimization the time complexityof this algorithm is 119874(|119886| sdot |119887|)

42 Similarity-Based Passwords are essentially strings sothe measures of depiction of string similarity provide ussubstantial methods to evaluate the similarity (relationship)between passwords Dicersquos coefficient is a statistic used forcomparing the similarity between two sets and can be usedto evaluate the similarity between two passwords assume 119883and 119884 are the set (set element is unique) of characters thatpasswords 119886 and 119887 include respectively Dicersquos coefficient of119886 119887 is defined as follows

QS = 2 |119883 cap 119884||119883| + |119884| (5)

where |119883| and |119884| are the numbers of elements of the two setsQS is the quotient of similarity and ranges between 0 and 1

Jaccard indexmeasures similarity between two finite setsit is defined as the size of the intersection divided by the sizeof the union of the sets 119883 and 119884 (119883 and 119884 have the samemeaning as described above)

119869 (119883 119884) = |119883 cap 119884||119883 cup 119884| =|119883 cap 119884|

|119883| + |119884| minus |119883 cap 119884| (6)

If 119883 and 119884 are both empty define 119869(119883 119884) = 1 so 119869(119883 119884) isalso the quotient of similarity and ranges between 0 and 1

43 Probability-Based From PCFG-based or Markov-basedpassword cracking model we can assign probability to everypassword in a dataset In principle two passwords are similarin a sense (same probability that will be picked by a human

user) if their corresponding probabilities are close enoughthe closer the more similar One of the workable measures ofsimilarity between passwords 119886 and 119887may be

119878 (119886 119887) = (Pr (119886) minus Pr (119887))2

(Pr (119886) + Pr (119887))2 (7)

when 119886 and 119887 have the same probability their value of 119878(119886 119887)equals 0 an extreme situation happens when there is only onepassword in the distribution the probability of that passwordis 1 all other passwordsrsquo probability is 0 The value of 119878(119886 119887)(where Pr(119886) = 1Pr(119887) = 0) equals 1

In summary the relationships between passwords wedescribe above are only a proportion of the countless rela-tionships of objective existence For example the edit distancebetween passwords is not a measure of similarity but thenumber of editings One can of course divide it by the longerlength of two passwords whichmakes LD remarkably relativeLD

levrel119886119887(|119886||119887|) =lev119886119887 (|119886| |119887|)max |119886| |119887| (8)

Ultimately the value of relative LD for two chosen passwordsfalls into interval [0 1] in which left boundary 0 meansexactly the same and right boundary 1 means altogetherdifferent

44 Rationale The practical and accurate measure of depic-tion of password relationships depends on the particularapplication scenario There is no the best depiction methodfor generating password graph but the most appropriateaccording to the concrete scenario For example if we wantto study the modification habits of users we may choosemodification-based classification but if we only want to learnthe password cracking property we may choose probability-based classification We attempt to expand our work byutilizing modification-based relationship between passwords(other relationships we have explored above are still worthdeeper research andwe leave it for futurework) Prior surveyshave already provided us with plenty of information on howusers modify their passwords adding digits or symbols at thebeginningend of password capitalizing a letter Leet trans-formation and so forth [9 12 32 37] Taking the permittedoperations for each type of edit distance into considerationsynthetically we could infer that LD is currently the mostappropriate measure of password relationship among alltypes of edit distance which actuallymodels real-world usersrsquomangling behaviors (insertion deletion and substitution arepopular while transposition is not) In Section 5 we are goingto utilize LD to formally define the edges in a graph

5 Construction of Password Graph

Generally speaking any entity that can be described byobjects with relationships between them can be regardedas a graph Since our principal purpose is to characterizethe relationship between passwords We shed light on thisnotion by (1) regarding each unique password in dataset as

6 Security and Communication Networks

passwordpassword1234561234561234567password1

Phase1password1234561234567password1

Phase2

G

Figure 1 Construction phases

a vertex 119881 = 119901 | 119901 isin P and (2) editing distancebetween passwords not more than a threshold 119905 forms anedge 119864 = (119906 V) | lev119906V(|119906| |V|) le 119905 and 119906 V isin 119881We call the resulting graph transformed from the passworddataset password graph 119866 This is a very intuitive method toturn a password dataset into graph though there might existsome other approaches to fulfill the same goal (eg one canconsider unique printable ASCII characters as vertices andcharacters forming a password are connected together whichseems reasonable but the practicability and efficiency of itremain unknown)

51 Construction Procedures We describe our constructionprocedures of password graph with more details in this sub-section The construction procedures are organized orderlyinto two main phases and are illustrated in Figure 1

Phase1 Starting with raw password dataset we measure thefrequencies of every unique password in the password datasetS and associate the frequency values with correspondingpasswords subsequentlywe remove the duplicated passwordsto obtain the password dataset P so that two passwordspicked randomly from P are invariably different Everypassword inP stands for a vertex in graph119866 in other wordsthe vertex set 119881 for 119866 is generated finally after this phase

Phase2 The LD distance for every two passwords in thedataset P is calculated An edge is formed if the LD valueof its end vertices is not more than threshold 119905 We provide3 thresholds experiment results in this paper (the selectionof threshold 119905 is an interestingly open question that needsfurther research)The edge set 119864 for our graph119866 is generatedafter this phase It is worth noting that themaximum possiblenumber of edges is119873(119873minus1)divide2where119873 is the size of setP itmeans that the resulting password graph is a complete graphAnother extreme case happens if there are no edges at all in119866which means the resulting password graph is an empty graph

In order to observe intuitively the password graph ofour construction results we have visualized the passwordgraph with PythonGraph-tool [38] Graph-tool is an efficientPython module for manipulation and statistical analysisof graphs (aka networks) Exporting the resulting graphto Graph-tool supported file format (GraphML file formatwithout isolated vertices) Visualizing it with some deliberatetuning of the parameters makes intricate image graph result

Figure 2 Visualization result of 12306

The visualization result of 12306 is shown in Figure 2 andthat of MySpace in Figure 3 the visualization results ofremaining datasets are collected together in Figure 7 Theinternal structure and characteristic of password dataset Pis represented modestly by its visualization image graph tosome extent some qualitative properties can be gotten fromthese images a vertex in the graph 119866 may have intensiveconnection with other vertices password vertices evidentlyhave the tendency to gather together as a cluster 12306 hastwo apparently big clusters while MySpace has incalculablesmall clusters vertices spreading all over the entire canvasspace reflect somewhat that passwords are spreading all overthe possible password space A more in-depth quantitativeanalysis is performed consistently to show more inner prop-erties of the password dataset in Section 6

It is undeniable that the frequencies of passwords inthe password dataset P remain untouched by us Our finalpassword graph is actually a simple undirected graph withoutany loop or parallel edge so one feasible exploitation ofpassword frequency may be that one repeated passwordattaches one loop edge to the vertex of the password Besidestaking password frequency into consideration establishesmore than one edge between two vertices which is denoted asparallel edge in graph theory In a word the aforementionedoperations will eventually turn our simple password graphinto a multigraph further augmented implementation of

Security and Communication Networks 7

Figure 3 Visualization result of MySpace

construction of password graph would probably embedmoreoriginal and subtle properties into our final password graphbut would also introduce longer time and bigger space com-plexities we leave these feasible extension of implementationfor further study

6 Graph Concepts in Password Graph

Interestingly the transformation from password dataset topassword graph leads to the magical expansion of conceptfromgraph theory into password graph In this sectionwe aregoing to rigorously revisit the terminology in graph theoryExpanding those concepts of terminology into our refinedpassword graph helps to analyze the password graph moreconcretely and quantitatively

Degree The degree of the vertex V written as 119889(V) is thenumber of edges with V as an end vertex Note that thereare concepts of in-degree and out-degree in directed graphbut due to the fact that our refined password graph is asimple undirected graph so there is no need to differentiatethem The concept of degree in password graph implies howmany unique vertices of passwords in password datasetP aresimilar to this password we denote degree of password 119901 as119889(119901)

Statistics of degree for our experimented passworddatasets are summarized in Table 2 The definitions of thenotations used in the table are described inNotations we alsoexplain them in the main body of our paper Each datasethas 3 thresholds 119905 (from 1 to 3) experiment results exhibitedin the table One could easily notice that the column withtitle 120575(119866) is filled with zeroes that is the minimum degreeof all password datasets is 0 meaning that there always existsisolated vertex in every password dataset the number ofpasswords whose degree is zero decreases when thresholdincreases Some studies have attempted to determine thenumber of users who appear to be choosing passwords thatare randomormeaningless to human beingsWe think on theother side that the number of isolated vertices of passwords

(no other passwords similar to it) is a moremeaningful figureto achieve the same goal and is easier to evaluate Table 3 is asegment list of passwordswhose degree is zero in 12306 underthreshold 3 though some of them have a common substringbut they are not similar to each other actually because theirnumber of editing is beyond the threshold and should beconsidered individually frequencies of these passwords aretypically 1

Empirically a threshold value of 3 is reasonable to analyzethe experiment results so analyses following are all basedon threshold 3 one can actually get same conclusion whenthreshold is 1 or 2 or other viable values Many people mighthold the idea that there would definitely be more edges withbigger size of setP In our experiment result however PhpBBhas the most number of edges of all password graphs whileYahoo has the most number of vertices Therefore morevertices of password do not necessarily mean more edgesTo our surprise a password could be similar up to 3380passwords (in PhpBB dataset) The max degree of passwordincreases rapidly when threshold increases for example withthreshold 1 PhpBBrsquos max degree is only 65 but it explodesto 3380 when threshold is 3 More strikingly a passwordis similar on average to around 150 passwords in PhpBBeven the minimum average number of similar passwords of 5datasets is nearly 30

One has to know that in principle for a specific vertex Vthe possiblemaximumdegree120601(119905) of it is fixed once threshold119905 is determined

120601 (119905) = 1003816100381610038161003816119878dis1 (119905)1003816100381610038161003816 +1003816100381610038161003816119878dis2 (119905)

1003816100381610038161003816 +1003816100381610038161003816119878dis3 (119905)

1003816100381610038161003816 (9)

where 119878dis1(119905) 119878dis2(119905) and 119878dis3(119905) are the sets contain-ing passwords whose LD distance from V is 1 2 and 3respectively under threshold 119905 Rationally the results of ourexperiment show only a lower-bound of number of similarpassword (the threshold we apply is smaller than the actualor typical one) the real situation may be worse than ourimagination

The distribution of password graph degree is anotherthought-provoking characteristic like length distribution inpassword datasets Our study has shown that distributionof password graph degree correlated with its ranks sortedin descending order by its corresponding frequency is morecomplicated than a simple distribution Some passwordsoccurring more frequently might have less number of pass-words similar to them and vice versa this is a specificallyoutstanding fact in 12306 that password ldquo1qaz2wsxrdquo hasoccurred 80 times but its degree is only 6 password ldquoh123456rdquohas 962 passwords similar to it but its frequency is only4 Hence the correlation between frequency of passwordand degree in password graph has many big fluctuations(there would be many breaks in the distribution plot) Afteranalyzing our five password datasets of it has turned outthat the degree distribution of a password graph can beapproximated by the following equation

PGDD (119903) = 119896 sdot log (119903) + 119887 (10)

where PGDD(119903) is the 119903th largest degree and the base of thelog is 10

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: An Alternative Method for Understanding User-Chosen Passwords

Security and Communication Networks 5

other However there are various definitions of edit distancefeaturing different sets of editing operations Levenshtein dis-tance [36] (hereafter LD) allows three operations removalinsertion and substitution of a character in the passwordlongest common subsequence distancersquos permitted operationsis a subset of LDs insertion and deletion hamming distancetakes effect only on passwords of same length that ishamming distance does not allow insertion or removalAdditional primitive operations like transposition are usedby Jaro-Winkler distance based on the observation that acommon mistake when people typing is the transposition oftwo adjacent characters in a string

The LD between two passwords is formally defined as theminimum number of single-character edits one must per-form to change one password into the other Mathematicallythe LD between two passwords 119886 and 119887 can be given bylev119886119887(|119886| |119887|)

lev119886119887 (119894 119895) = min

lev119886119887 (119894 minus 1 119895) + 1lev119886119887 (119894 119895 minus 1) + 1lev119886119887 (119894 minus 1 119895 minus 1) + 119891 (119886119894 119887119895)

(4)

where |119886| and |119887| denote the length of passwords 119886 and 119887respectively 119891(119886119894 119887119895) is an indicator function that is equal to0 when 119886119894 = 119887119895 and equal to 1 otherwise The calculation ofLD is actually a dynamic optimization the time complexityof this algorithm is 119874(|119886| sdot |119887|)

42 Similarity-Based Passwords are essentially strings sothe measures of depiction of string similarity provide ussubstantial methods to evaluate the similarity (relationship)between passwords Dicersquos coefficient is a statistic used forcomparing the similarity between two sets and can be usedto evaluate the similarity between two passwords assume 119883and 119884 are the set (set element is unique) of characters thatpasswords 119886 and 119887 include respectively Dicersquos coefficient of119886 119887 is defined as follows

QS = 2 |119883 cap 119884||119883| + |119884| (5)

where |119883| and |119884| are the numbers of elements of the two setsQS is the quotient of similarity and ranges between 0 and 1

Jaccard indexmeasures similarity between two finite setsit is defined as the size of the intersection divided by the sizeof the union of the sets 119883 and 119884 (119883 and 119884 have the samemeaning as described above)

119869 (119883 119884) = |119883 cap 119884||119883 cup 119884| =|119883 cap 119884|

|119883| + |119884| minus |119883 cap 119884| (6)

If 119883 and 119884 are both empty define 119869(119883 119884) = 1 so 119869(119883 119884) isalso the quotient of similarity and ranges between 0 and 1

43 Probability-Based From PCFG-based or Markov-basedpassword cracking model we can assign probability to everypassword in a dataset In principle two passwords are similarin a sense (same probability that will be picked by a human

user) if their corresponding probabilities are close enoughthe closer the more similar One of the workable measures ofsimilarity between passwords 119886 and 119887may be

119878 (119886 119887) = (Pr (119886) minus Pr (119887))2

(Pr (119886) + Pr (119887))2 (7)

when 119886 and 119887 have the same probability their value of 119878(119886 119887)equals 0 an extreme situation happens when there is only onepassword in the distribution the probability of that passwordis 1 all other passwordsrsquo probability is 0 The value of 119878(119886 119887)(where Pr(119886) = 1Pr(119887) = 0) equals 1

In summary the relationships between passwords wedescribe above are only a proportion of the countless rela-tionships of objective existence For example the edit distancebetween passwords is not a measure of similarity but thenumber of editings One can of course divide it by the longerlength of two passwords whichmakes LD remarkably relativeLD

levrel119886119887(|119886||119887|) =lev119886119887 (|119886| |119887|)max |119886| |119887| (8)

Ultimately the value of relative LD for two chosen passwordsfalls into interval [0 1] in which left boundary 0 meansexactly the same and right boundary 1 means altogetherdifferent

44 Rationale The practical and accurate measure of depic-tion of password relationships depends on the particularapplication scenario There is no the best depiction methodfor generating password graph but the most appropriateaccording to the concrete scenario For example if we wantto study the modification habits of users we may choosemodification-based classification but if we only want to learnthe password cracking property we may choose probability-based classification We attempt to expand our work byutilizing modification-based relationship between passwords(other relationships we have explored above are still worthdeeper research andwe leave it for futurework) Prior surveyshave already provided us with plenty of information on howusers modify their passwords adding digits or symbols at thebeginningend of password capitalizing a letter Leet trans-formation and so forth [9 12 32 37] Taking the permittedoperations for each type of edit distance into considerationsynthetically we could infer that LD is currently the mostappropriate measure of password relationship among alltypes of edit distance which actuallymodels real-world usersrsquomangling behaviors (insertion deletion and substitution arepopular while transposition is not) In Section 5 we are goingto utilize LD to formally define the edges in a graph

5 Construction of Password Graph

Generally speaking any entity that can be described byobjects with relationships between them can be regardedas a graph Since our principal purpose is to characterizethe relationship between passwords We shed light on thisnotion by (1) regarding each unique password in dataset as

6 Security and Communication Networks

passwordpassword1234561234561234567password1

Phase1password1234561234567password1

Phase2

G

Figure 1 Construction phases

a vertex 119881 = 119901 | 119901 isin P and (2) editing distancebetween passwords not more than a threshold 119905 forms anedge 119864 = (119906 V) | lev119906V(|119906| |V|) le 119905 and 119906 V isin 119881We call the resulting graph transformed from the passworddataset password graph 119866 This is a very intuitive method toturn a password dataset into graph though there might existsome other approaches to fulfill the same goal (eg one canconsider unique printable ASCII characters as vertices andcharacters forming a password are connected together whichseems reasonable but the practicability and efficiency of itremain unknown)

51 Construction Procedures We describe our constructionprocedures of password graph with more details in this sub-section The construction procedures are organized orderlyinto two main phases and are illustrated in Figure 1

Phase1 Starting with raw password dataset we measure thefrequencies of every unique password in the password datasetS and associate the frequency values with correspondingpasswords subsequentlywe remove the duplicated passwordsto obtain the password dataset P so that two passwordspicked randomly from P are invariably different Everypassword inP stands for a vertex in graph119866 in other wordsthe vertex set 119881 for 119866 is generated finally after this phase

Phase2 The LD distance for every two passwords in thedataset P is calculated An edge is formed if the LD valueof its end vertices is not more than threshold 119905 We provide3 thresholds experiment results in this paper (the selectionof threshold 119905 is an interestingly open question that needsfurther research)The edge set 119864 for our graph119866 is generatedafter this phase It is worth noting that themaximum possiblenumber of edges is119873(119873minus1)divide2where119873 is the size of setP itmeans that the resulting password graph is a complete graphAnother extreme case happens if there are no edges at all in119866which means the resulting password graph is an empty graph

In order to observe intuitively the password graph ofour construction results we have visualized the passwordgraph with PythonGraph-tool [38] Graph-tool is an efficientPython module for manipulation and statistical analysisof graphs (aka networks) Exporting the resulting graphto Graph-tool supported file format (GraphML file formatwithout isolated vertices) Visualizing it with some deliberatetuning of the parameters makes intricate image graph result

Figure 2 Visualization result of 12306

The visualization result of 12306 is shown in Figure 2 andthat of MySpace in Figure 3 the visualization results ofremaining datasets are collected together in Figure 7 Theinternal structure and characteristic of password dataset Pis represented modestly by its visualization image graph tosome extent some qualitative properties can be gotten fromthese images a vertex in the graph 119866 may have intensiveconnection with other vertices password vertices evidentlyhave the tendency to gather together as a cluster 12306 hastwo apparently big clusters while MySpace has incalculablesmall clusters vertices spreading all over the entire canvasspace reflect somewhat that passwords are spreading all overthe possible password space A more in-depth quantitativeanalysis is performed consistently to show more inner prop-erties of the password dataset in Section 6

It is undeniable that the frequencies of passwords inthe password dataset P remain untouched by us Our finalpassword graph is actually a simple undirected graph withoutany loop or parallel edge so one feasible exploitation ofpassword frequency may be that one repeated passwordattaches one loop edge to the vertex of the password Besidestaking password frequency into consideration establishesmore than one edge between two vertices which is denoted asparallel edge in graph theory In a word the aforementionedoperations will eventually turn our simple password graphinto a multigraph further augmented implementation of

Security and Communication Networks 7

Figure 3 Visualization result of MySpace

construction of password graph would probably embedmoreoriginal and subtle properties into our final password graphbut would also introduce longer time and bigger space com-plexities we leave these feasible extension of implementationfor further study

6 Graph Concepts in Password Graph

Interestingly the transformation from password dataset topassword graph leads to the magical expansion of conceptfromgraph theory into password graph In this sectionwe aregoing to rigorously revisit the terminology in graph theoryExpanding those concepts of terminology into our refinedpassword graph helps to analyze the password graph moreconcretely and quantitatively

Degree The degree of the vertex V written as 119889(V) is thenumber of edges with V as an end vertex Note that thereare concepts of in-degree and out-degree in directed graphbut due to the fact that our refined password graph is asimple undirected graph so there is no need to differentiatethem The concept of degree in password graph implies howmany unique vertices of passwords in password datasetP aresimilar to this password we denote degree of password 119901 as119889(119901)

Statistics of degree for our experimented passworddatasets are summarized in Table 2 The definitions of thenotations used in the table are described inNotations we alsoexplain them in the main body of our paper Each datasethas 3 thresholds 119905 (from 1 to 3) experiment results exhibitedin the table One could easily notice that the column withtitle 120575(119866) is filled with zeroes that is the minimum degreeof all password datasets is 0 meaning that there always existsisolated vertex in every password dataset the number ofpasswords whose degree is zero decreases when thresholdincreases Some studies have attempted to determine thenumber of users who appear to be choosing passwords thatare randomormeaningless to human beingsWe think on theother side that the number of isolated vertices of passwords

(no other passwords similar to it) is a moremeaningful figureto achieve the same goal and is easier to evaluate Table 3 is asegment list of passwordswhose degree is zero in 12306 underthreshold 3 though some of them have a common substringbut they are not similar to each other actually because theirnumber of editing is beyond the threshold and should beconsidered individually frequencies of these passwords aretypically 1

Empirically a threshold value of 3 is reasonable to analyzethe experiment results so analyses following are all basedon threshold 3 one can actually get same conclusion whenthreshold is 1 or 2 or other viable values Many people mighthold the idea that there would definitely be more edges withbigger size of setP In our experiment result however PhpBBhas the most number of edges of all password graphs whileYahoo has the most number of vertices Therefore morevertices of password do not necessarily mean more edgesTo our surprise a password could be similar up to 3380passwords (in PhpBB dataset) The max degree of passwordincreases rapidly when threshold increases for example withthreshold 1 PhpBBrsquos max degree is only 65 but it explodesto 3380 when threshold is 3 More strikingly a passwordis similar on average to around 150 passwords in PhpBBeven the minimum average number of similar passwords of 5datasets is nearly 30

One has to know that in principle for a specific vertex Vthe possiblemaximumdegree120601(119905) of it is fixed once threshold119905 is determined

120601 (119905) = 1003816100381610038161003816119878dis1 (119905)1003816100381610038161003816 +1003816100381610038161003816119878dis2 (119905)

1003816100381610038161003816 +1003816100381610038161003816119878dis3 (119905)

1003816100381610038161003816 (9)

where 119878dis1(119905) 119878dis2(119905) and 119878dis3(119905) are the sets contain-ing passwords whose LD distance from V is 1 2 and 3respectively under threshold 119905 Rationally the results of ourexperiment show only a lower-bound of number of similarpassword (the threshold we apply is smaller than the actualor typical one) the real situation may be worse than ourimagination

The distribution of password graph degree is anotherthought-provoking characteristic like length distribution inpassword datasets Our study has shown that distributionof password graph degree correlated with its ranks sortedin descending order by its corresponding frequency is morecomplicated than a simple distribution Some passwordsoccurring more frequently might have less number of pass-words similar to them and vice versa this is a specificallyoutstanding fact in 12306 that password ldquo1qaz2wsxrdquo hasoccurred 80 times but its degree is only 6 password ldquoh123456rdquohas 962 passwords similar to it but its frequency is only4 Hence the correlation between frequency of passwordand degree in password graph has many big fluctuations(there would be many breaks in the distribution plot) Afteranalyzing our five password datasets of it has turned outthat the degree distribution of a password graph can beapproximated by the following equation

PGDD (119903) = 119896 sdot log (119903) + 119887 (10)

where PGDD(119903) is the 119903th largest degree and the base of thelog is 10

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: An Alternative Method for Understanding User-Chosen Passwords

6 Security and Communication Networks

passwordpassword1234561234561234567password1

Phase1password1234561234567password1

Phase2

G

Figure 1 Construction phases

a vertex 119881 = 119901 | 119901 isin P and (2) editing distancebetween passwords not more than a threshold 119905 forms anedge 119864 = (119906 V) | lev119906V(|119906| |V|) le 119905 and 119906 V isin 119881We call the resulting graph transformed from the passworddataset password graph 119866 This is a very intuitive method toturn a password dataset into graph though there might existsome other approaches to fulfill the same goal (eg one canconsider unique printable ASCII characters as vertices andcharacters forming a password are connected together whichseems reasonable but the practicability and efficiency of itremain unknown)

51 Construction Procedures We describe our constructionprocedures of password graph with more details in this sub-section The construction procedures are organized orderlyinto two main phases and are illustrated in Figure 1

Phase1 Starting with raw password dataset we measure thefrequencies of every unique password in the password datasetS and associate the frequency values with correspondingpasswords subsequentlywe remove the duplicated passwordsto obtain the password dataset P so that two passwordspicked randomly from P are invariably different Everypassword inP stands for a vertex in graph119866 in other wordsthe vertex set 119881 for 119866 is generated finally after this phase

Phase2 The LD distance for every two passwords in thedataset P is calculated An edge is formed if the LD valueof its end vertices is not more than threshold 119905 We provide3 thresholds experiment results in this paper (the selectionof threshold 119905 is an interestingly open question that needsfurther research)The edge set 119864 for our graph119866 is generatedafter this phase It is worth noting that themaximum possiblenumber of edges is119873(119873minus1)divide2where119873 is the size of setP itmeans that the resulting password graph is a complete graphAnother extreme case happens if there are no edges at all in119866which means the resulting password graph is an empty graph

In order to observe intuitively the password graph ofour construction results we have visualized the passwordgraph with PythonGraph-tool [38] Graph-tool is an efficientPython module for manipulation and statistical analysisof graphs (aka networks) Exporting the resulting graphto Graph-tool supported file format (GraphML file formatwithout isolated vertices) Visualizing it with some deliberatetuning of the parameters makes intricate image graph result

Figure 2 Visualization result of 12306

The visualization result of 12306 is shown in Figure 2 andthat of MySpace in Figure 3 the visualization results ofremaining datasets are collected together in Figure 7 Theinternal structure and characteristic of password dataset Pis represented modestly by its visualization image graph tosome extent some qualitative properties can be gotten fromthese images a vertex in the graph 119866 may have intensiveconnection with other vertices password vertices evidentlyhave the tendency to gather together as a cluster 12306 hastwo apparently big clusters while MySpace has incalculablesmall clusters vertices spreading all over the entire canvasspace reflect somewhat that passwords are spreading all overthe possible password space A more in-depth quantitativeanalysis is performed consistently to show more inner prop-erties of the password dataset in Section 6

It is undeniable that the frequencies of passwords inthe password dataset P remain untouched by us Our finalpassword graph is actually a simple undirected graph withoutany loop or parallel edge so one feasible exploitation ofpassword frequency may be that one repeated passwordattaches one loop edge to the vertex of the password Besidestaking password frequency into consideration establishesmore than one edge between two vertices which is denoted asparallel edge in graph theory In a word the aforementionedoperations will eventually turn our simple password graphinto a multigraph further augmented implementation of

Security and Communication Networks 7

Figure 3 Visualization result of MySpace

construction of password graph would probably embedmoreoriginal and subtle properties into our final password graphbut would also introduce longer time and bigger space com-plexities we leave these feasible extension of implementationfor further study

6 Graph Concepts in Password Graph

Interestingly the transformation from password dataset topassword graph leads to the magical expansion of conceptfromgraph theory into password graph In this sectionwe aregoing to rigorously revisit the terminology in graph theoryExpanding those concepts of terminology into our refinedpassword graph helps to analyze the password graph moreconcretely and quantitatively

Degree The degree of the vertex V written as 119889(V) is thenumber of edges with V as an end vertex Note that thereare concepts of in-degree and out-degree in directed graphbut due to the fact that our refined password graph is asimple undirected graph so there is no need to differentiatethem The concept of degree in password graph implies howmany unique vertices of passwords in password datasetP aresimilar to this password we denote degree of password 119901 as119889(119901)

Statistics of degree for our experimented passworddatasets are summarized in Table 2 The definitions of thenotations used in the table are described inNotations we alsoexplain them in the main body of our paper Each datasethas 3 thresholds 119905 (from 1 to 3) experiment results exhibitedin the table One could easily notice that the column withtitle 120575(119866) is filled with zeroes that is the minimum degreeof all password datasets is 0 meaning that there always existsisolated vertex in every password dataset the number ofpasswords whose degree is zero decreases when thresholdincreases Some studies have attempted to determine thenumber of users who appear to be choosing passwords thatare randomormeaningless to human beingsWe think on theother side that the number of isolated vertices of passwords

(no other passwords similar to it) is a moremeaningful figureto achieve the same goal and is easier to evaluate Table 3 is asegment list of passwordswhose degree is zero in 12306 underthreshold 3 though some of them have a common substringbut they are not similar to each other actually because theirnumber of editing is beyond the threshold and should beconsidered individually frequencies of these passwords aretypically 1

Empirically a threshold value of 3 is reasonable to analyzethe experiment results so analyses following are all basedon threshold 3 one can actually get same conclusion whenthreshold is 1 or 2 or other viable values Many people mighthold the idea that there would definitely be more edges withbigger size of setP In our experiment result however PhpBBhas the most number of edges of all password graphs whileYahoo has the most number of vertices Therefore morevertices of password do not necessarily mean more edgesTo our surprise a password could be similar up to 3380passwords (in PhpBB dataset) The max degree of passwordincreases rapidly when threshold increases for example withthreshold 1 PhpBBrsquos max degree is only 65 but it explodesto 3380 when threshold is 3 More strikingly a passwordis similar on average to around 150 passwords in PhpBBeven the minimum average number of similar passwords of 5datasets is nearly 30

One has to know that in principle for a specific vertex Vthe possiblemaximumdegree120601(119905) of it is fixed once threshold119905 is determined

120601 (119905) = 1003816100381610038161003816119878dis1 (119905)1003816100381610038161003816 +1003816100381610038161003816119878dis2 (119905)

1003816100381610038161003816 +1003816100381610038161003816119878dis3 (119905)

1003816100381610038161003816 (9)

where 119878dis1(119905) 119878dis2(119905) and 119878dis3(119905) are the sets contain-ing passwords whose LD distance from V is 1 2 and 3respectively under threshold 119905 Rationally the results of ourexperiment show only a lower-bound of number of similarpassword (the threshold we apply is smaller than the actualor typical one) the real situation may be worse than ourimagination

The distribution of password graph degree is anotherthought-provoking characteristic like length distribution inpassword datasets Our study has shown that distributionof password graph degree correlated with its ranks sortedin descending order by its corresponding frequency is morecomplicated than a simple distribution Some passwordsoccurring more frequently might have less number of pass-words similar to them and vice versa this is a specificallyoutstanding fact in 12306 that password ldquo1qaz2wsxrdquo hasoccurred 80 times but its degree is only 6 password ldquoh123456rdquohas 962 passwords similar to it but its frequency is only4 Hence the correlation between frequency of passwordand degree in password graph has many big fluctuations(there would be many breaks in the distribution plot) Afteranalyzing our five password datasets of it has turned outthat the degree distribution of a password graph can beapproximated by the following equation

PGDD (119903) = 119896 sdot log (119903) + 119887 (10)

where PGDD(119903) is the 119903th largest degree and the base of thelog is 10

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: An Alternative Method for Understanding User-Chosen Passwords

Security and Communication Networks 7

Figure 3 Visualization result of MySpace

construction of password graph would probably embedmoreoriginal and subtle properties into our final password graphbut would also introduce longer time and bigger space com-plexities we leave these feasible extension of implementationfor further study

6 Graph Concepts in Password Graph

Interestingly the transformation from password dataset topassword graph leads to the magical expansion of conceptfromgraph theory into password graph In this sectionwe aregoing to rigorously revisit the terminology in graph theoryExpanding those concepts of terminology into our refinedpassword graph helps to analyze the password graph moreconcretely and quantitatively

Degree The degree of the vertex V written as 119889(V) is thenumber of edges with V as an end vertex Note that thereare concepts of in-degree and out-degree in directed graphbut due to the fact that our refined password graph is asimple undirected graph so there is no need to differentiatethem The concept of degree in password graph implies howmany unique vertices of passwords in password datasetP aresimilar to this password we denote degree of password 119901 as119889(119901)

Statistics of degree for our experimented passworddatasets are summarized in Table 2 The definitions of thenotations used in the table are described inNotations we alsoexplain them in the main body of our paper Each datasethas 3 thresholds 119905 (from 1 to 3) experiment results exhibitedin the table One could easily notice that the column withtitle 120575(119866) is filled with zeroes that is the minimum degreeof all password datasets is 0 meaning that there always existsisolated vertex in every password dataset the number ofpasswords whose degree is zero decreases when thresholdincreases Some studies have attempted to determine thenumber of users who appear to be choosing passwords thatare randomormeaningless to human beingsWe think on theother side that the number of isolated vertices of passwords

(no other passwords similar to it) is a moremeaningful figureto achieve the same goal and is easier to evaluate Table 3 is asegment list of passwordswhose degree is zero in 12306 underthreshold 3 though some of them have a common substringbut they are not similar to each other actually because theirnumber of editing is beyond the threshold and should beconsidered individually frequencies of these passwords aretypically 1

Empirically a threshold value of 3 is reasonable to analyzethe experiment results so analyses following are all basedon threshold 3 one can actually get same conclusion whenthreshold is 1 or 2 or other viable values Many people mighthold the idea that there would definitely be more edges withbigger size of setP In our experiment result however PhpBBhas the most number of edges of all password graphs whileYahoo has the most number of vertices Therefore morevertices of password do not necessarily mean more edgesTo our surprise a password could be similar up to 3380passwords (in PhpBB dataset) The max degree of passwordincreases rapidly when threshold increases for example withthreshold 1 PhpBBrsquos max degree is only 65 but it explodesto 3380 when threshold is 3 More strikingly a passwordis similar on average to around 150 passwords in PhpBBeven the minimum average number of similar passwords of 5datasets is nearly 30

One has to know that in principle for a specific vertex Vthe possiblemaximumdegree120601(119905) of it is fixed once threshold119905 is determined

120601 (119905) = 1003816100381610038161003816119878dis1 (119905)1003816100381610038161003816 +1003816100381610038161003816119878dis2 (119905)

1003816100381610038161003816 +1003816100381610038161003816119878dis3 (119905)

1003816100381610038161003816 (9)

where 119878dis1(119905) 119878dis2(119905) and 119878dis3(119905) are the sets contain-ing passwords whose LD distance from V is 1 2 and 3respectively under threshold 119905 Rationally the results of ourexperiment show only a lower-bound of number of similarpassword (the threshold we apply is smaller than the actualor typical one) the real situation may be worse than ourimagination

The distribution of password graph degree is anotherthought-provoking characteristic like length distribution inpassword datasets Our study has shown that distributionof password graph degree correlated with its ranks sortedin descending order by its corresponding frequency is morecomplicated than a simple distribution Some passwordsoccurring more frequently might have less number of pass-words similar to them and vice versa this is a specificallyoutstanding fact in 12306 that password ldquo1qaz2wsxrdquo hasoccurred 80 times but its degree is only 6 password ldquoh123456rdquohas 962 passwords similar to it but its frequency is only4 Hence the correlation between frequency of passwordand degree in password graph has many big fluctuations(there would be many breaks in the distribution plot) Afteranalyzing our five password datasets of it has turned outthat the degree distribution of a password graph can beapproximated by the following equation

PGDD (119903) = 119896 sdot log (119903) + 119887 (10)

where PGDD(119903) is the 119903th largest degree and the base of thelog is 10

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: An Alternative Method for Understanding User-Chosen Passwords

8 Security and Communication Networks

Table 2 Statistic of experiments results

Password dataset |119881| 119905 |119864| Δ(119866) 120575(119866) 119885(119866) Avg degree 119863(119866)(times10minus5) 119870(119866) 119889(119866)

MySpace 415141 20119 58 0 28119 097 233 0 +infin2 111884 141 0 16748 539 1298 0 +infin3 622087 477 0 7290 2997 7219 0 +infin

PhpBB 1843411 81135 65 0 130767 088 048 0 +infin2 1204816 460 0 70187 1307 709 0 +infin3 13836682 3380 0 29453 15012 8144 0 +infin

Rootkit 568591 9281 41 0 47093 033 057 0 +infin2 105275 133 0 30564 370 651 0 +infin3 1062502 953 0 15792 3737 6573 0 +infin

Yahoo 3425101 140092 66 0 247445 084 024 0 +infin2 1394200 327 0 147126 862 238 0 +infin3 13686843 2127 0 77122 7992 2333 0 +infin

12306 1178081 51299 63 0 95057 087 074 0 +infin2 676011 506 0 56162 1148 974 0 +infin3 5311460 2301 0 20640 9017 7654 0 +infin

Note (1) Bold figures are the maximum value of each category for threshold 3 (2) The definitions of the notations are summarized in Notations

Table 3 A segment of passwords with 0 degrees in 12306

Number Password Frequency1125 04768130093 11128 048398531 11138 0501154315 11139 0501170228 136909 0155402zj 169622 013307aini 1120788 015210755s 1

As explained in Section 3 we employ least-squares fittingmethod in linear regression to approximate the distributionof password graphrsquos degree The fitting results of 12306 andMySpace password datasets are shown in Figures 5 and 6The horizontal axis is the rank value and vertical axis isdegree of password The other three fitting results are showncollectedly in Figure 7 The statistics of 119896 119887 and coefficientof determination of regression results are displayed in theirrespective plotting figures every linear regression (except for12306) is with its 1198772 not smaller than 090 which closelyapproaches to 1 and thus indicates a remarkably sound fittingAs for 12306 dataset its 1198772 is about 084 which is not an idealone but acceptable not as good as that of other datasets Aplausible reason may be that the 12306 dataset is collected bytrying usernames and passwords from other leaked datasetsonline and probably cannot represent the real distributionof the entire passwords dataset of 12306 Guo et al [34] plotthe same picture as us where the relation between log(rank)and degree is linear which indicates the degree distributionof them follows logarithmic law but not power law In powerlaw the relation between log(rank) and log(degree) should belinear

To get the relation between password degree and cracka-bility we calculate each passwordrsquos degree and its correspond-ing guess number based on [20] with sampling size 100000Subsequently we plot the passwordrsquos average guess numberagainst its degree and it is shown in Figure 4 we can observethat the overall average guess number is increasing whenthe degree decreases or the overall average guess number isdecreasing when the degree increases Spearman coefficientfor each dataset is evaluated too the Spearman coefficientfor Rootkit is minus094 MySpace minus088 Yahoo minus095 12306minus078 and PhpBB minus093 The Spearman coefficient furtherconfirms our deduction the coefficients for all datasets areall below minus088 except for 12306 (which might probably becaused by password data not complete) thus there is a strongrelation between a password degree and its correspondingguess number overall speaking the bigger the passworddegree is the easier it will be cracked

Connectivity The connectivity for a graph is the minimumnumber of elements that need to be removed to disconnectthe remaining nodes from each other The elements referto vertices or edges so there are vertex connectivity 119870(119866)and edge connectivity 120582(119866) Nevertheless neither vertexconnectivity nor edge connectivity of any of our selectedpassword datasets has the value more than 0 because of theisolated vertex in the password graph In other words thepassword graph is never connected We only list the valueof vertex connectivity in Table 2 We can further removethe isolated vertices and then evaluated the connectivity forthe filtered graph but there are still isolated clusters in thegraph so either edge or vertex connectivity is zero for filteredpassword graph

Density In simple graphs a dense graph is a graph where thenumber of edges is close to themaximal number of edgesTheopposite a graph with only a few edges is a sparse graph Forundirected simple graphs the graph density 119863(119866) is defined

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: An Alternative Method for Understanding User-Chosen Passwords

Security and Communication Networks 9

RootkitMySpaceYahoo

12306PhpBB

100

1012

1022

1032

Aver

age g

uess

num

ber

0 500 1000 1500 2500 30002000Degree

Figure 4 Average guess number against degree for all datasets

minus56199 middot FIA(x) + 266534

0

500

1000

1500

2000

2500

Deg

ree o

f pas

swor

d

104 105100010 1001

Rank of password

12306

Figure 5 Fitting result of 12306 (1198772 = 084)

as follows

119863 (119866) = 2 |119864||119881| (|119881| minus 1) (11)

where |119864| and |119881| mean the number of edges and verticesrespectively in the graph The value of119863(119866) resides betweenrange from 0 to 1 an empty password graph has the density 0(no edge in the graph empty graph) and a complete passwordgraph has the density 1 (any two vertices in the graph has anedge between them complete graph)The density of passwordgraph can be used as a rough approximation of passworddatasetrsquos strength

As shown in Table 2 each refined password graph isfar from dense graph We can derive the conclusion fromthe aforementioned observation that a complete passwordgraph has the minimum password dataset strength becausecracking a password successfully facilitates cracking thewhole remaining passwords an empty password graph hasthe maximum password dataset strength because cracking apassword successfully provides no further information about

minus12120 middot FIA(x) + 53195

0

100

200

300

400

500

Deg

ree o

f pas

swor

d

104100 1000101

Rank of password

MySpace

Figure 6 Fitting result of MySpace (1198772 = 097)

the remaining passwords Note that the exact definition ofthis information depends on the selection of threshold forexample a threshold of 3 implies that a password 119901 crackedindicates that there are passwords whose LD distance from 119901is smaller than 3

DistanceThe distance between two vertices 119906 and V of a graphis the minimum length of the paths connecting them If nosuch path exists then the distance is set equal to +infin

DiameterThe diameter of a graph is the longest shortest pathbetween any two vertices Due to the existence of the isolatedvertex there is always at least one path of infinity length inthe password graph even if we remove the vertices with 0degrees and calculate again the same reason for connectivitydue to the existence of isolated clusters the diameter forfiltered password graph is still +infin and the result is also listedin Table 2 In addition to diameter the average path lengthin the graph is unavoidably +infin too Further study can beperformed to evaluate the diameter and average path lengthfor top clusters (their results will not be infinity because theyare connected) we leave it for our future work

One appealing outcome of visualization of our passwordgraph as we can see intuitively from Figure 2 is that verticesof password evidently have the tendency to gather together asa cluster which is a similar phenomenon in social networksAlthough users choosecreate passwords independently fromothers they have a tendency to choose similar passwords inthe mass Apparently there is a minority of top clusters andvast majority of nontop clusters and the top clustersrsquo propertyis self-evidently of significance to us Cluster structure inthe context of networks refers to the occurrence of groupsof nodes in a network that are more densely connectedinternally than with the rest of the network Note thedistinction between cluster and clique in graph theory aclique is a subset of vertices of an undirected graph such thatevery two distinct vertices in the clique are adjacent that is itsinduced subgraph is complete but the cluster does not requirethe completeness property

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: An Alternative Method for Understanding User-Chosen Passwords

10 Security and Communication Networks

(a) Visualization result of Rootkit (b) Visualization result of PhpBB (c) Visualization result of Yahoo

0200400600800

1000

Deg

ree o

f pas

swor

d

104100 1000101Rank of password

Rootkitminus21222 middot FIA(x) + 93531

(d) Fitting result of Rootkit (1198772 = 093)

0

1000

2000

3000

4000

Deg

ree o

f pas

swor

d

104 105100010 1001Rank of password

PhpBBminus83095 middot FIA(x) + 413565

(e) Fitting result of PhpBB (1198772 = 090)

104 105100010 1001Rank of password

Yahoominus44470 middot FIA(x) + 231586

0500

100015002000

Deg

ree o

f pas

swor

d(f) Fitting result of Yahoo (1198772 = 092)

Figure 7 Fitting and visualization results of Rootkit PhpBB and Yahoo

In some cases one vertex may virtually belong to morethan one cluster this might happen in a social network whereeach vertex represents a person and the clusters representthe different groups of people one cluster for family anothercluster for coworkers one for friends in the same sports cluband so on Nonetheless in our password graph we postulatethat one vertex can only be part of one cluster providedthat it does belong to some cluster It is conceivable thatclusters in password graph are remarkably important dueto their metaproperties from the view of characteristics ofclusters Rationally we propose the idea that a cluster in thepassword graph acts like ametanode in the graph interclusterproperties have more significant difference than internodeproperties

Top password list [39 40] is generally used to constitutethe banned list to prevent users from choosing commonpasswords and circumventing the security policies Toppassword structures [9] are specifically used to facilitateefficient password cracking process We for the first timepropose the concept of top clusters to outright replace toppassword list and top password structures for the reasonthat top clusters are a more accurate and precise descriptionof weak passwords in the context of password dataset notonly passwords occurring in the top password list or itsstructures appear in top password structures are weak butalso all the passwords belonging to the same cluster Table 4lists a segment of four clusters in Rootkit clusters 1 2 and

Table 4 A segment of four clusters in Rootkit

Cluster 1 Cluster 2 Cluster 3 Cluster 4password 123456 rootkit qwertypasswordx 123456x irootkit qwertypassword0 1234569 1rootkit qwertyuPassword 12345r rootkit2 qwertyzPassword 123456 rootkit3 qwertypassword0 12345s rootkit qwertyrpassw0rk 1234566 rootkit qwertapassword 1234565 4rootkit qwerty7p4s5w0rd 1234569 rrootkit qwerty9pssword 1234561 rootKit qwerty111Password 12345678 RootKit qwerty12

4 are common cluster which can be found in other passwordgraph too cluster 3 is only presented in Rootkit and it is a cite-specific cluster Undoubtedly they are all weak passwords andshould be added to banned list for Internet service providerConclusion can be reached when combining Table 4 andFigures 2 3 and 7 that blocking top clusters enhances theability of the password dataset to resist trawling attack

One of the multitudes of practical applications of topclusters is the building of effective mangling rules those rulesare a bit different from those used by hashcat [24] and JTR

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: An Alternative Method for Understanding User-Chosen Passwords

Security and Communication Networks 11

[23] To get the effective mangling rules from a dataset ofpasswords one can compare every two passwords in thesame top cluster to obtain the mangling rules Our derivedrules are actually the operations allowed in Levenshteindistance insertion deletion and substitution Substitution isthe superset of Leet operation insertion is the superset ofprepending and appending Our visualization method couldalso be used for password strengthmeter After a user finishesentering hisher new password we can evaluate whether itbelongs to one of the existing top clusters if it does then wereject this password asking for a new one else we accept it

7 Conclusions

In this paper we have explored possible relationshipsbetween passwords modification-based similarity-basedand probability-based to aid the construction of passwordgraph On the basis of graph theory and previous surveysregarding passwords as vertices we selectmodification-basedrelationship and fix a threshold value of 3 to define the edgesin the graph Our password graph enables the introductionof concepts from graph theory including degree densitydistance diameter and cluster

Password graph has some novel implications built-in forpassword research we use them as an alternative analysismethod for password dataset We find surprisingly that forthe threshold 3 a password could be similar up to 3380 pass-words in the PhpBB dataset and have evidently the tendencyto gather together as a cluster just like it is in a social networkMoreover by applying linear regressionwediscover that pass-word graph has logarithmic distribution for its degrees Ourfindings suggest that top passwords list in adaptive passwordchecking should be replaced with top clusters to reach ahigher level of security Ultimately password graph couldalso enable the creation of effective mangling rules applieduniversally in dictionary attack which only needs a largedataset of training sets without requiring a priori knowledgeof user habits or personal information Also password graphcan be used for a new kind of password strength meter

Notations

119881 Vertex set119864 Edge setΔ(119866) max119889(119901) | 119901 isin 119881 the maximumdegree

of all vertices120575(119866) min119889(119901) | 119901 isin 119881 the minimumdegree

of all vertices119885(119866) |119889(119901) | 119901 isin 119881 119889(119901) = 0| the number of

vertices whose degree is 0119863(119866) 2|119864| divide (|119881|(|119881| minus 1)) density of the

password graph119870(119866) Vertex connectivity120582(119866) Edge connectivity

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this article

Acknowledgments

The authors thank Dr Ding Wang for his help and supportThis research was supported by the National Key RampDProgram of China under no 2016YFB0800603 and no2017YFB1200700 and the National Natural Science Founda-tion of China (NSFC) under Grant no 61472016

References

[1] Z Zhao G-J Ahn and H Hu ldquoPicture gesture authenticationempirical analysis automated attacks and scheme evaluationrdquoACM Transactions on Information and System Security vol 17no 4 article 14 37 pages 2015

[2] A K Jain A Ross and S Pankanti ldquoBiometrics a toolfor information securityrdquo IEEE Transactions on InformationForensics and Security vol 1 no 2 pp 125ndash143 2006

[3] N Zheng A Paloski and H Wang ldquoAn efficient user verifica-tion system via mouse movementsrdquo in Proceedings of the 18thACM Conference on Computer and Communications Securitypp 139ndash150 ACM October 2011

[4] D Florencio andCHerley ldquoA large-scale study ofweb passwordhabitsrdquo in Proceedings of the 16th International WorldWideWebConference (WWW rsquo07) pp 657ndash666 ACM May 2007

[5] J Yan A F Blackwell R J Anderson and A Grant ldquoPasswordmemorability and security empirical resultsrdquo IEEE Security ampPrivacy vol 2 no 5 pp 25ndash31 2004

[6] W Cheswick ldquoRethinking passwordsrdquo Communications of theACM vol 56 no 2 pp 40ndash44 2013

[7] J Bonneau C Herley P C Van Oorschot and F StajanoldquoPasswords and the evolution of imperfect authenticationrdquoCommunications of the ACM vol 58 no 7 pp 78ndash87 2015

[8] J Bonneau C Herley P C Van Oorschot and F StajanoldquoThe quest to replace passwords a framework for comparativeevaluation of web authentication schemesrdquo in Proceedings of the33rd IEEE Symposium on Security and Privacy (SP rsquo12) pp 553ndash567 IEEE San Francisco Calif USA May 2012

[9] D Wang H Cheng P Wang and X Huang ldquoUnderstandingpasswords of Chinese users characteristics security and impli-cations CACR reportrdquo in Proceedings of the ChinaCrypt 20152015 httptcnRG8RacH

[10] A Adams and M A Sasse ldquoUsers are not the enemyrdquo Commu-nications of the ACM vol 42 no 12 pp 40ndash46 1999

[11] A Beautement M A Sasse andMWonham ldquoThe compliancebudget managing security behaviour in organisationsrdquo inProceedings of theWorkshop on New Security Paradigms (NSPWrsquo08) pp 47ndash58 ACM Lake Tahoe Calif USA September 2008

[12] D Wang D He H Cheng and P Wang ldquoFuzzyPSM a newpassword strength meter using fuzzy probabilistic context-freegrammarsrdquo in Proceedings of the 46th IEEEIFIP InternationalConference on Dependable Systems and Networks (DSN rsquo16) pp595ndash606 IEEE July 2016

[13] LWang Y Li andK Sun ldquoAmnesia a bilateral generative pass-word managerrdquo in Proceedings of the 36th IEEE InternationalConference on Distributed Computing Systems (ICDCS rsquo16) pp313ndash322 IEEE June 2016

[14] D McCarney D Barrera J Clark S Chiasson and P CVan Oorschot ldquoTapas design implementation and usabilityevaluation of a password managerrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 89ndash98ACM December 2012

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: An Alternative Method for Understanding User-Chosen Passwords

12 Security and Communication Networks

[15] B Ives K R Walsh and H Schneider ldquoThe domino effect ofpassword reuserdquo Communications of the ACM vol 47 no 4 pp75ndash78 2004

[16] DWang H Cheng PWang X Huang and G Jian ldquoZipf rsquos lawin passwordsrdquo IEEE Transactions on Information Forensics andSecurity vol 12 no 11 pp 2776ndash2791 2017

[17] M Weir S Aggarwal B De Medeiros and B Glodek ldquoPass-word cracking using probabilistic context-free grammarsrdquo inProceedings of the 30th IEEE Symposiumon Security andPrivacypp 391ndash405 IEEE May 2009

[18] S Houshmand and S Aggarwal ldquoBuilding better passwordsusing probabilistic techniquesrdquo in Proceedings of the 28thAnnual Computer Security Applications Conference pp 109ndash118ACM December 2012

[19] R Veras C Collins and J Thorpe ldquoOn the semantic patternsof passwords and their security impactrdquo in Proceedings of theNetwork andDistributed SystemSecurity Symposium (NDSS rsquo14)San Diego Calif USA 2014

[20] M DellrsquoAmico and M Filippone ldquoMonte Carlo strength evalu-ation fast and reliable password checkingrdquo in Proceedings of the22nd ACM SIGSAC Conference on Computer and Communica-tions Security pp 158ndash169 ACM October 2015

[21] RMorris andKThompson ldquoPassword security a case historyrdquoCommunications of the ACM vol 22 no 11 pp 594ndash597 1979

[22] D V Klein ldquoFoiling the cracker a survey of and improvementsto password securityrdquo in Proceedings of the 2nd USENIXSecurity Workshop pp 5ndash14 1990

[23] John the ripper password crackermdashopenwall httpwwwopenwallcomjohn

[24] hashcatmdashadvanced password recovery httpshashcatnethashcat

[25] J Ma W Yang M Luo and N Li ldquoA study of probabilisticpassword modelsrdquo in Proceedings of the 35th IEEE Symposiumon Security and Privacy (SP rsquo14) pp 689ndash704 IEEE May 2014

[26] R Veras J Thorpe and C Collins ldquoVisualizing semanticsin passwords the role of datesrdquo in Proceedings of the 9thInternational Symposium onVisualization for Cyber Security pp88ndash95 ACM October 2012

[27] MWeir S AggarwalM Collins andH Stern ldquoTestingmetricsfor password creation policies by attacking large sets of revealedpasswordsrdquo in Proceedings of the 17th ACM Conference onComputer and Communications Security pp 162ndash175 ACMOctober 2010

[28] C Castelluccia M Durmuth and D Perito ldquoAdaptivepassword-strengthmeters frommarkovmodelsrdquo in Proceedingsof the 19th Annual Network amp Distributed System SecuritySymposium (NDSS rsquo12) February 2012

[29] J Bonneau ldquoThe science of guessing analyzing an anonymizedcorpus of 70 million passwordsrdquo in Proceedings of the 33rd IEEESymposium on Security and Privacy (SP rsquo12) pp 538ndash552 IEEESan Francisco Calif USA May 2012

[30] Z Li W Han and W Xu ldquoA large-scale empirical analysis ofchinese web passwordsrdquo in Proceedings of the USENIX SecuritySymposium pp 559ndash574 2014

[31] B L Riddle M S Miron and J A Semo ldquoPasswords in use ina university timesharing environmentrdquo Computers amp Securityvol 8 no 7 pp 569ndash579 1989

[32] R Shay S Komanduri P G Kelley et al ldquoEncounteringstronger password requirements user attitudes and behaviorsrdquoin Proceedings of the 6th Symposium on Usable Privacy andSecurity vol 2 ACM July 2010

[33] Y Zhang FMonrose andMK Reiter ldquoThe security ofmodernpassword expiration an algorithmic framework and empiricalanalysisrdquo in Proceedings of the 17th ACM Conference on Com-puter and Communications Security pp 176ndash186 October 2010

[34] X Guo H Chen X Liu X Xu and Z Chen ldquoThe scale-freenetwork of passwords visualization and estimation of empiricalpasswordsrdquo httpsarxivorgabs151108324

[35] S Perez ldquoRecently confirmedMyspace hack could be the largestyetrdquo May 2016

[36] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet PhysicsmdashDoklady vol 10 no8 pp 707ndash710 1966

[37] A Das J Bonneau M Caesar N Borisov and X Wang ldquoThetangled web of password reuserdquo in Proceedings of the Networkand Distributed System Security Symposium (NDSS rsquo14) vol 14pp 23ndash26 2014

[38] graph-tool efficent network analysis with python httpsgraph-toolskewedde

[39] A Vance ldquoIf your password is 123456 just make it hackmerdquoTheNew York Times vol 20 p A1 2010

[40] S Schechter C Herley and M Mitzenmacher ldquoPopularityis everything a new approach to protecting passwords fromstatistical-guessing attacksrdquo in Proceedings of the 5th USENIXConference on Hot Topics in Security pp 1ndash8 USENIX Associa-tion 2010

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: An Alternative Method for Understanding User-Chosen Passwords

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom