string patterns: searching for interesting words and numbers
Post on 07-Apr-2018
225 Views
Preview:
TRANSCRIPT
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
1/53
String Patterns: Searching for InterestingWords and Numbers
Roger Bilisoly, PhD
Associate Professor of Statistics
Central Connecticut State University
Department of MathematicsAmherst College, Amherst, Massachusetts
Thursday, October 6, 2011
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
2/53
Overview of Talk
String Patterns and Examples
Unusual words, squares, and primes
Anagrams of Words and Numbers Including square anagrams
Birthday Problem and Pangrams
Analyzing Dickens A Christmas Carol.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
3/53
1. String Patterns
Regular expressions(also called regexes) are used tofind string patterns. A variety of software packages hasthem implemented, e.g., Mathematica, Perl, SAS, Emacs,
and so forth. Well use them to find interesting words (from wordlists
available on the web) and interesting numbers (e.g.,squares with unusual digit patterns).
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
4/53
Perl Regexes
For a wordlist, one word per line:
/cat/ would match cat cats scatter but NOT Cat
/[cC]at/ would match cat Cat or Catcher
/cat/i would match cat CaT or sCaTtEr i stands for case insensitive
/cat|dog/ would match either cat or dog
Well see examples of more complex string patterns.
See Chapter 2 of Bilisoly (2008b).
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
5/53
Example: Some Unusual Words
Are there other words like bookkeeper with threedouble letters in a row?
Essentially no: {bookkeeper, bookkeepers, bookkeeping,bookkeepings}
Are there words containing mile? {besmiled, besmiles, camomiles, facsimiles, homiletic,
outsmiled, outsmiles, similes, smiled, smiler, smilers, smiles}
wordlist = Import["c:\CROSSWD.TXT","Lines"];
threepair = Pick[wordlist, StringMatchQ[wordlist,
RegularExpression[".*(.)\\1(.)\\2(.)\\3.*"]]]
milewords = Pick[wordlist, StringMatchQ[wordlist,
RegularExpression[".+mile.+"]]]
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
6/53
before state within preparedeagleb-e-f s-t-a w-i-t p-r e-a| | | |\| || | |r-o e n h a e-d l-g
concern decency rather prosperse-c d-e-c r-a-t p-r| |\ |/| \ | ||\r-n-o n y e-h e s-o
Each node is a distinct letter, and each edge connects letters that areadjacent in the word. Graphs are directed: the arrows are understood.
From Section 24 of Eckler (1996).
Word Graphs
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
7/53
Searching for word graphs
Linear words have no branches, e.g., A-M-H-E-R-S-T.o What is longest such word?o Answer: ambidextrouslyhas 14 letterso lycanthropies, metalworkings, multibranched, unpredictablyare only
examples with 13 letters
How many square cyclic words are there? (like EAGLE)o Need to match regular expression /(.)...\1/ and have 4 distinct
letters. Latter can be checked by taking intersection.o 417 such words, including dazed(and other 4 letter weak verbs starting with
d in the past tense), sails(and other 4 letter nouns starting with s, etc.
Longest cyclic words are 12 letters long: spaceflights, speculations, subharmonics, subordinates,switchblades, switchboards, sympathizers.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
8/53
C(3) through C(8) Alphabets
aroma
blurb
comic
dread
eagle
-----
going
hatch
iambi-----
knock
local
maxim
nylon
outdo
plump
-----
razorstars
theft
-----
-----
widow
xerox
yolky
-----
asthma
benumb
cosmic
demand
excuse
------
gaming
health
incubi------
kopeck
lawful
medium
napkin
overdo
pickup
------
rathershirts
throat
------
------
window
------
yearly
------
area
bomb
chic
dead
ease
fief
gong
high
impi----
kick
leal
maim
noun
ouzo
pump
----
roarsaws
text
unau
----
whew
----
----
----
amnesia
brewpub
chronic
dogsled
eclipse
-------
glowing
hawkish
intagli-------
kinfolk
logical
midterm
newborn
oregano
parsnip
-------
regularsailors
tourist
-------
-------
whipsaw
-------
-------
-------
asphyxia
--------
catholic
disabled
earphone
--------
gambling
hyacinth
----------------
kinsfolk
lightful
mealworm
nitrogen
obligato
pawnshop
--------
roadstersardines
tolerant
--------
--------
withdraw
--------
yeastily
--------
angostura
---------
chromatic
diagnosed
enjoyable
---------
gathering
haircloth
------------------
---------
---------
mechanism
---------
---------
playgroup
---------
regulatorseafronts
therapist
---------
---------
worldview
---------
---------
---------
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
9/53
Cyclic Word Mathematica Code
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
10/53
Applying String Patterns toSquare Integers (Squares)
Numbers are strings, too, so amenable to regexes.
Well apply regexes to find some unusual squares.
We will also investigate the randomness of the digits of
squares.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
11/53
Squares without Doubled Adjacent Digits
Suppose the digits in a square (in base 10) are random.
Then P(adjacent digits are unequal) = 9/10.
So P(100 digit square has no adjacent digits equal) = 0.9^99= 0.0000295127, or about 30 in a million.
We can check this estimate by stochastic computer searches.
51737187749414391248343906418265954222307660356158 ^2 =
2676736596218154562635394063470527614352180564931651707698543214192825076469507142492903267408520964
92247034353159048872216907430912506658722672391337 ^2 =
8509515346952905702167423230637367421917540367430367521058492897148214941569637843428692738072647569
46684912798986257941402247358809860358456884854826 ^2 =
2179481083048950920786370528301507294579140187693485325858417852926259858127813068545985375095490276
72400570183600937943662908692744405525154232178778 ^2 =
5241842562910525153020949368218058035430952649615191319089815789036984216473806049461870608953573284
etc.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
12/53
Code to Compute Counts of Such Squares
Do[
total = 0;
nsteps = 1000000;
ndigits = 100;
start = Ceiling[Sqrt[10^(ndigits-1)]];
stop = Floor[Sqrt[10^(ndigits)]];
Do[square = Random[Integer, {start,stop}]^2;
match = StringCases[ToString[square], RegularExpression["(.)\\1"]];
If[Length[match] == 0, ++total, Null],
{i,1,nsteps}
]
Print[total],
{nreps,1,50}
]50 repetitions of searching
a million squares.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
13/53
Are the Digits of Squares Random?The initial digits are not.
Lower Upper Lower^2 Upper^2
1 100 141 10000 19881
2 142 173 20164 29929
3 174 200 30276 40000
4 200 223 40000 49729
5 224 244 50176 59536
6 245 264 60025 69696
7 265 282 70225 79524
8 283 300 80089 90000
9 300 316 90000 99856
10 317 447 100489 199809
20 448 547 200704 299209
30 548 632 300304 399424
40 633 707 400689 499849
50 708 774 501264 599076
60 775 836 600625 698896
70 837 894 700569 799236
80 895 948 801025 898704
90 949 999 900601 998001
The limiting proportion of digits 1 k 9 is given by:
(Sqrt[k+1] - Sqrt[k] + Sqrt[10(k+1)] Sqrt[10k])/9.
Digit Prob.1 19.16%
2 14.70%
3 12.39%
4 10.92%
5 9.87%
6 9.08%7 8.45%
8 7.93%
9 7.50%
Total = 100.00%
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
14/53
Aside: Benfords Law
Benfords Law says that initial digits oftenfollow the following probability distribution:
Log[10, (k + 1)/k] for 1 k 9.
{an} satisfies Benfords law iff Log[10, an]
(mod 1) is uniformly distributed. Seehttp://en.wikipedia.org/wiki/Benford's_law.
Benfords Law does not fit the distributionof initial digits of squares.
Digit Prob.
1 30.10%
2 17.61%
3 12.49%4 9.69%
5 7.92%
6 6.69%
7 5.80%
8 5.12%
9 4.58%
Total 100.00%
http://en.wikipedia.org/wiki/Benford's_lawhttp://en.wikipedia.org/wiki/Benford's_law -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
15/53
The final digits of squaresare not random.
Final digit can only be: {0,1,4,5,6,9} (only 60% of possibleone digit endings are allowed)
Final two digits can only be: {00,01,04,09,16,21,24,25,29,36,41,44,49,56,61,64,69,76,
81,84,89,96} (only 22% of possible two digit endings allowed)
Final three digits can only be: {000,001,004,009,016,024,025,036,041,044,049,056,064,076
,081,084,089,096,100,104,116,121,124,129,136,144,156,161,164,169,176,184,196,201,204,209,216,224,225,236,241,244,249,256,264,276,281,284,289,296,304,316,321,324,329,336,344,356,361,364,369,376,384,396,400,401,404,409,416,424,436,441,444,449,456,464,476,481,484,489,496,500,504,516,521,524,529,536,544,556,561,564,569,576,584,596,600,601
,604,609,616,624,625,636,641,644,649,656,664,676,681,684,689,696,704,716,721,724,729,736,744,756,761,764,769,776,784,796,801,804,809,816,824,836,841,844,849,856,864,876,881,884,889,896,900,904,916,921,924,929,936,944,956,961,964,969,976,984,996} (only 15.9% allowed)
The percentages are decreasing, but to what value?
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
16/53
Walter Penneys Theorem
Pick ndigits at random, and let P(n) = Probability that these ndigits are the final digits of a square.
Theorem Penney (1960): As n, P(n) 5/72 6.94%.
Penney shows that P(n) = (2n-1 + 4)(5n+1 + 7)/(36*10n) for neven
and P(n) = (2n-1 + 5)(5n+1 + 11)/(36*10n) for nodd.
For a proof see Walter Penney (1960)
On the Final Digits of Squares.
Also see Walter Stangl (1996)
Counting Squares in Zn
n P(n) P(n)/P(n-1)
1 .6000000 .6000
2 .2200000 .3667
3 .1590000 .7227
4 .1044000 .6566
5 .0912100 .8736
6 .0781320 .8564
7 .0748719 .9590
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
17/53
Equi-Pandigital Primes
An equi-pandigitalnumberin base b contain each digit from 0 through (b-1)
exactly the same number of times.
Theorem. For b > 3, there are no equi-pandigital primes.
Proof. Let n be an equi-pandigital number in base b. Then mod (b 1) n is
congruent to the sum of its digitsbecause bn
1n
= 1. Let rbe the # ofrepetitions of 0, 1, 2, , b 1, which sum to b(b 1)/2. So we have:
Ifb is even, then n 0 (mod b 1) since b/2 is an integer, so (b 1) divides n.
Ifb is odd, then (b 1)/2 is an integer, so either n 0 or (b 1)/2. In both cases,
(b 1)/2 divides n.For b > 3, (b 1) and (b 1)/2 are nontrivial, so n is not prime. QED
Remark 1: Finding a base 10 equi-pandigital prime will take some trickery.
Remark 2: 102, 1001012, 1010012, 100010112, etc. are prime, as are
1023, 2013, 1000122123, 1000221123, etc.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
18/53
Equi-Pandigital Gaussian Primes
Theorem 9.15 of Deskins (1964): Letp be a prime in Z, then all Gaussian primes
in Z[i] fall into one of the following three cases, up to units.
(a) Ifp 3 (mod 4), thenp is a Gaussian prime.
(b) Ifp 1 (mod 4), thenp = (a + b i)(a b i) is the Gaussian prime factorization,
where a, b is the unique (up to order) solution top = a2 + b2.
(c) Ifp = 2, then 2 = (1 + i)(1i) is the factorization into Gaussian primes.
Proof: See Deskins (1964).
Remark: Brillharts algorithm can find a and b forp 1 (mod 4). See Williams (1995).
Lets make an estimate of the number of (5,5)-digit pandigital primes.
Since P(mZis prime) 1/log(m) (just differentiate the logarithmic integral), we need tomultiply this by the number of (a + b i)s satisfying a > b and condition (b) above. Hence
105
> a > b > 104.5
, which implies that 2(1010
) > a2
+ b2
. Pick two digits from 1 through 9 touse as a and b, and the rest of the digits can be put in any order, so the total number of
expected Gaussian primes is approximately:
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
19/53
Search Results
By computer search, there are 69774 equi-pandigitalGaussian primes of the form a + b i, a > b > 0. Here aresome interesting ones:
Pandigital Gaussian prime Distinguishing property
96530 + 87421i Max norm
20468 + 13597i Min norm
98765 + 10234i Max realimaginary parts
60143 + 59872i Min realimaginary parts
86420 + 79513i Largest real part with all even digits
20864 + 13579i Smallest imaginary part with all odd digits
97531 + 82604i Largest real part with all odd digits
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
20/53
All the Equi-Pandigital Gaussian Primes
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
21/53
2. Anagrams of Words and Numbers
An anagramof a word is a non-identity permutation ofthat words letters.
E.g., Amherstis an anagram of hamster.
One word anagrams are sometimes called transpositionsin
wordplay. In wordplay, some require an anagram to have a related
meaning to the original word.
Well also consider anagrams of numbers. In what
follows, initial zeros are forbidden. E.g., 132 = 169, 142 = 196, and 312 = 961 are anagrams of eachother (in base 10).
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
22/53
How to find English word anagramsStep 1: Obtain a wordlist.
There are now a variety of sources available:
American Cryptogram Association athttp://cryptogram.org/cdb/words/words.html
National Puzzlers League at
http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start
Grady Wards Moby word lists (in public domain)
http://icon.shef.ac.uk/Moby/
The above wordlists include all the inflected forms of
words: nouns with both singular and plural forms,adjectives with comparative forms, verbs with allconjugated forms, etc.
http://cryptogram.org/cdb/words/words.htmlhttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://icon.shef.ac.uk/Moby/http://icon.shef.ac.uk/Moby/http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://cryptogram.org/cdb/words/words.html -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
23/53
Step 2: Read in the wordlist andstore it in a hash.
aa
aahaahed
aahing
aahs
aal
aalii
aaliis
aals
aardvarkaardvarks
aardwolf
aardwolves
aas
aasvogel
aasvogels
abaabaca
abacas
abaci
aback
abacus
abacuses
open(WORDS, "CROSSWD.TXT") or die;
while () {
chomp;
@letters = split(//);$key = join('',sort(@letters));
if ( exists($dictionary{$key}) ) {
$dictionary{$key} .= ",$_";
} else {
$dictionary{$key} = $_;
}
}
foreach $key (sort keys %dictionary) {
print "$key, $dictionary{$key}\n";
}
Perl program from Section 3.7.2 of Bilisoly (2008b):The hash key equals the
letters of the word sorted
in alphabetical order.
Examples:
aah -> aah
aahed -> aadeh
aahing -> aaghin
aardvark -> aaadkrrv
If key already exists, then
an anagram has been
discovered. Example:evil, live, vile, veilall have
the key eilv.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
24/53
Step 3: Print out the hash withthe keys sorted in alphabetical order.
The result (see right) is ananagram dictionary.
Invaluable for word gamessuch as Scrabble and
Jumble: just sort the lettersat hand and check if theyform a word.
Looking for entries with two
or more commas revealsword anagrams. Most words do not have
anagrams.
aa, aa
aaaaabbcdrr, abracadabra
aaaabcceelrstu, baccalaureates
aaaabcceelrtu, baccalaureate
aaaabdilmorss, ambassadorial
aaaabenn, anabaenaaaaabenns, anabaenas
aaaaccdiiklllsy, lackadaisically
aaaaccdiiklls, lackadaisical
aaaaccrr, caracara
aaaaccrrs, caracaras
aaaacgnr, caraganaaaaacgnrs, caraganas
aaaacmnrst, catamarans
aaaacmnrt, catamaran
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
25/53
Numbers are formed from thealphabet {0,1,2,3,4,5,6,7,8,9}.
The program above can easily be modified to find anagrams of a set ofnumbers.
In recreational mathematics, it is well known that 122 = 144, 212 = 441; and132 = 169, 312 = 961, 142 = 196.
Unlike words, it turns out that it is easy to find two or more squares that areanagrams. For example, the following 87 squares are anagrams of each other:
1026753849, 1042385796, 1098524736, 1237069584, 1248703569, 1278563049,1285437609, 1382054976, 1436789025, 1503267984, 1532487609, 1547320896,1643897025, 1827049536, 1927385604, 1937408256, 2076351489, 2081549376,2170348569, 2386517904, 2431870596, 2435718609, 2571098436, 2913408576,3015986724, 3074258916, 3082914576, 3089247561, 3094251876, 3195867024,3285697041, 3412078569, 3416987025, 3428570916, 3528716409, 3719048256,3791480625, 3827401956, 3928657041, 3964087521, 3975428601, 3985270641,4307821956, 4308215769, 4369871025, 4392508176, 4580176329, 4728350169,4730825961, 4832057169, 5102673489, 5273809641, 5739426081, 5783146209,
5803697124, 5982403716, 6095237184, 6154873209, 6457890321, 6471398025,6597013284, 6714983025, 7042398561, 7165283904, 7285134609, 7351862049,7362154809, 7408561329, 7680594321, 7854036129, 7935068241, 7946831025,7984316025, 8014367529, 8125940736, 8127563409, 8135679204, 8326197504,8391476025, 8503421796, 8967143025, 9054283716, 9351276804, 9560732841,9614783025, 9761835204, 9814072356.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
26/53
A Pattern Emerges
# Digits # Squares # Anasquares Proportion
1 3 0 0.00%
2 6 0 0.00%
3 22 7 31.82%
4 68 13 19.12%
5 217 86 39.63%
6 683 293 42.90%
7 2163 1212 56.03%
8 6837 4699 68.73%
9 21623 17380 80.38%
10 68377 60623 88.66%
In fact, looking at n-digit squares, it seems that as n increases, the proportion of
squares with square anagrams (lets call these anasquares) keeps increasing.
What is the limit?
The above table is Table 1 from Bilisoly (2008a).
Also see http://oeis.org/A177952.
http://oeis.org/A177952http://oeis.org/A177952 -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
27/53
The limit is 100%!
Let Sd,b be the set of squares with exactly ddigits when written in base b.
Define apattern of a number n to be the digits ofn in base b sorted
from least to greatest. Note that a pattern is a hash key.
Theorem (Bilisoly, 2008a): The proportion of anasquares in Sd,b1 as
dand for b fixed.
Proof: A lower bound to the number of anasquares occurs when as many as
possible patterns correspond to exactly 1 square. To find this lower bound,
we count the number of patterns and d-digit squares.
First, thinking back to the Perl program, the hash key of a number is obtainedby sorting its digits. Let dibe the number of times the digit iappears in a
square. Then this hash key can be represented by:
110
**...*|...|**...*|**...*
bddd
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
28/53
End of Proof
The number of distinct hash keys is:
Second, the number ofddigit squares is:
Hence the number ofd-digit squares is exponential (in d), but the number of
patterns is a polynomial (in d), so the proportion of anasquares is bounded
below by the following, which 1 as d (and b is fixed.)
QED
)./11(11 2/2/)1(2/1 bbbbbbddddd
0,)/11(
1)!1(
)1)...(2)(1()/11(
max2/
2/
bb
b
dbdbdbb
d
d
)!1(
)1)...(2)(1(1
b
dbdbd
d
bd
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
29/53
3. Birthday Problem and Pangrams
The basic birthday problem is famous: For npeople,assuming all days are equally likely, what is theprobability that at least two people share the samebirthday?
The following are related: Let Nshared = number of people such that 1 birthday appears at
least 2 times.
Let Nall = number of people such that all 365 birthdays appear atleast once.
Note E(Nshared) = 24.6166 > the usual # of peoplequoted. Why? P(22 people, 2 share) = 0.475695
P(23 people, 2 share) = 0.507297
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
30/53
Results from:Flajolet, Gardy, and Thimonier (1992)
Suppose all days are not equally likely, then let pi = P(day i is a birthday).
Corollary (The Birthday Problem) We need j = 1 day to appear at least k = 2 times.
0
3651shared
)exp()1()( dtttpNEi i
0
3651all
))exp(1(1)( dttpNEi i
Corollary (The Coupon Problem) We need all letters to appear at least once.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
31/53
Application to Birthdays
What is the expected number of people needed so that 2 people share a birthday?
Mathematica gives E(Nshared) =24.6166, which assumes each day is equallylikely. Note E(Nall) =2364.65.
What is the expected number of people born in 1978 needed so that 2 peopleshare a birthday?
Mathematica gives E(Nshared) =24.5262 and note E(Nall) =2435.14.
Plot of Julian Day vs. Proportion
of births on that day for 1978.
Which days does the lower
band represent?
Data Source: Todd Swansons Home Page:
http://www.math.hope.edu/swanson/da
ta/birthdays.txt
http://www.math.hope.edu/swanson/data/birthdays.txthttp://www.math.hope.edu/swanson/data/birthdays.txthttp://www.math.hope.edu/swanson/data/birthdays.txthttp://www.math.hope.edu/swanson/data/birthdays.txt -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
32/53
Pangrammatic Windows
The Spirit dropped beneath it, so that the extinguisher
covered its whole form; but though Scrooge pressed it down
with all his force, he could not hide the light: which streamed
from under it, in an unbroken flood upon the ground.
He was conscious of being exhausted, and overcome by anirresistible drowsiness; and, further, of being in his own
bedroom. He gave the cap a parting squeeze, in which his hand
relaxed; and had barely time to reel to bed, before he sank
into a heavy sleep.
AWAKING in the middle of a prodigiously tough snore, and
sitting up in bed to get his thoughts together, Scrooge hadno occasion to be told that the bell was again upon the
stroke of One. He felt that he was restored to consciousness
in the right nick of time, for the especial purpose of holding
a conference with the second messenger dispatched to him
through Jacob Marley's intervention.
This text is from Charles
DickensA Christmas Carol.
The blue portion is a
pangrammatic window, i.e.,
it contains each letter of thealphabet at least once.
There are 679 letters in
color.
The search started with
The Spirit and thewindow could be shortened
by dropping letters from the
beginning.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
33/53
Pangrams in A Christmas Carol
a 9308 0.076892
b 1943 0.016051
c 3035 0.025072
d 5674 0.046872
e 14850 0.122674
f 2433 0.020099
g 2979 0.024609
h 8368 0.069127
i 8294 0.068515
j 113 0.00093
k 1031 0.008517
l 4553 0.037612
m 2840 0.023461
n 7960 0.065756
o 9690 0.080048
p 2119 0.017505
q 97 0.000801
r 7031 0.058082
s 7900 0.065261
t 10869 0.089787
u 3335 0.02755
v 1022 0.008443
w 3096 0.025576
x 131 0.001082
y 2298 0.018983
z 84 0.000694
Well search A Christmas Carolfor pangrams byselecting random starting positions. Then we
compare this to independently generated lettersusing the letter frequencies of this novel. Thecounts and the proportions are listed to the right.
Of course, letters are not independent, but thequestion is this: How does the actual pangram
lengths differ from the simulated independentpangram lengths?
Letter Frequencies
in Christmas Carol
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
34/53
Pangram Lengths
The left histogram shows lengths of
pangrams found inA Christmas Carol
using random starting points.
The right histogram shows lengths of
pangrams found in a simulated string of
independent letters using the
proportions found inA Christmas Carol.
Note the long right tail
N = 1000N = 1000
Theoretical mean = 2473.8
Figures from Bilisoly (2009)
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
35/53
Concordancing words with the letter zreveals
"Why, it's old Fezziwig! Bless his heart; i
ess his heart; it's Fezziwig alive again!"
Old Fezziwig laid down his pen,
"Yo ho, there! Ebenezer! Dick!"
ho, my boys!" said Fezziwig. "No more work to-n
e, Dick. Christmas, Ebenezer! Let's have the shuters up," cried old Fezziwig, with a sharp clap
illi-ho!" cried old Fezziwig, skipping down from
-ho, Dick! Chirrup, Ebenezer!"
ared away, with old Fezziwig looking on. It was
aches. In came Mrs. Fezziwig, one vast substanti
came the three Miss Fezziwigs, beaming and lovabl
brought about, old Fezziwig, clapping his hands
overley." Then old Fezziwig stood out to dance
to dance with Mrs. Fezziwig. Top couple, too; wah, four times--old Fezziwig would have been a m
, and so would Mrs. Fezziwig. As to her, she was
eared to issue from Fezziwig's calves. They shon
next. And when old Fezziwig and Mrs. Fezziwig hd Fezziwig and Mrs. Fezziwig had gone all throug
gain to your place; Fezziwig "cut"--cut so deftl
ke up. Mr. and Mrs. Fezziwig took their stations
hearts in praise of Fezziwig: and when he had do
luence over him, he seized the extinguisher-ca
e the cap a parting squeeze, in which his hand
ore and centre of a blaze of ruddy light, whi
ore alarming than a dozen ghosts, as he was p
; and such a mighty blaze went roaring up the
, half thawed, half frozen, whose heavier part
ught fire, and were blazing away to their dearanding his gigantic size, he could accommoda
chit, kissing her a dozen times, and taking o
erness and flavour, size and cheapness, were
, so hard and firm, blazing in half of half-a-q
e flickering of the blaze showed preparations
g grew but moss and furze, and coarse rank gr
of endeavouring to seize you, which would ha
relents," she said, amazed, "there is! Nothing
grave his own name, EBENEZER SCROOGE.
er they've sold the prize Turkey that was han
re?--Not the little prize Turkey: the big one
it. It's twice the size of Tiny Tim. Joe Mi
e passed the door a dozen times, before he ha
out after dark in a breezy spot--say Saint Paud-stone, Scrooge! a squeezing, wrenching, graspinThe cold within him froze his old features, n
e court outside, go wheezing up and down, beatin
'em through a round dozen of months presented
e chattering in its frozen head up there. The
d a great fire in a brazier, round which a part
eir eyes before the blaze in rapture. The wat
Scrooge seized the ruler with such
n the gloom. Half-a-dozen gas-lamps out of th
her-beds, Abrahams, Belshazzars, Apostles putting o
ring at those fixed glazed eyes, in silence fothe vision's stony gaze from himself.
from other regions, Ebenezer Scrooge, and is conpe of my procuring, Ebenezer."
Exchange pay to Mr. Ebenezer Scrooge or his orde
st have sunk into a doze unconsciously, and
er a long way below freezing; that he was clad b
The Spirit gazed upon him mildly. It
Middle section has
43 of the 84 zs, but
represents only 3 of 83
pages of Dickens (1986).
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
36/53
Pangram Lengths: Fezziwig Effect
N = 1000
Simulated pangrams.
Theoretical Mean = 3620.5
N = 1000
Actual pangrams.
Endpoint with Fezziwig
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
37/53
100,000 Simulated Pangram Lengths
Best fit lognormal distribution shown.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
38/53
References
Roger Bilisoly (2008a). Anasquares: Square Anagrams of Squares. Mathematical Gazette, 92,58-63.
Roger Bilisoly (2008b). Practical Text Mining with Perl, Wiley.
Roger Bilisoly (2009). Two Language-based Examples for Use in the Statistics Classroom.American Statistical Association Proceedings of the Joint Statistical Meetings, Section onStatistical Education.
Gunnar Blom, Lars Holst, and Dennis Sandell (1993). Problems and Snapshots from the World ofProbability, Springer.
W. E. Deskins (1964). Abstract Algebra, MacMillan.
Charles Dickens (1986). A Christmas Carol, Bantam.
Philippe Flajolet, Daniele Gardy, and Loys Thimonier (1992). Birthday Paradox, CouponCollectors, Caching Algorithms and Self-Organinzing Search. Discrete Applied Mathematics, 39,207-229.
Walter Penney (1960). On the Final Digits of Squares. The American Mathematical Monthly, Vol.67, No. 10, pp. 1000-1002.
Walter Stangl (1996). Counting Squares in Zn. Mathematics Magazine, Vol. 69, No. 4, pp. 285-189.
Kenneth Williams (1995). "Some Refinements of an Algorithm of Brillhart," CanadianMathematical Society Conference Proceedings, Volume 15, 409-416. Available athttp://www.math.carleton.ca/~williams/papers/pdf/202.pdf .
http://www.math.carleton.ca/~williams/papers/pdf/202.pdfhttp://www.math.carleton.ca/~williams/papers/pdf/202.pdf -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
39/53
Web References
Benfords Law http://mathworld.wolfram.com/BenfordsLaw.html http://en.wikipedia.org/wiki/Benford's_law
Squares with 3 distinct digits http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm
Counterexample to Ed Pegg http://mathworld.wolfram.com/Baxter-HickersonFunction.html
American Cryptogram Association http://cryptogram.org/
National Puzzlers Association http://www.puzzlers.org/
Moby Word Lists http://icon.shef.ac.uk/Moby/
Anasquare counts http://oeis.org/A177952. 1978 birthday data
http://www.math.hope.edu/swanson/data/birthdays.txt
Word Ways http://wordways.com/
http://mathworld.wolfram.com/BenfordsLaw.htmlhttp://en.wikipedia.org/wiki/Benford's_lawhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://cryptogram.org/cdb/words/words.htmlhttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://icon.shef.ac.uk/Moby/http://oeis.org/A177952http://www.math.hope.edu/swanson/data/birthdays.txthttp://wordways.com/http://wordways.com/http://www.math.hope.edu/swanson/data/birthdays.txthttp://oeis.org/A177952http://icon.shef.ac.uk/Moby/http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://cryptogram.org/cdb/words/words.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://en.wikipedia.org/wiki/Benford's_lawhttp://mathworld.wolfram.com/BenfordsLaw.html -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
40/53
Wordplay References
Tony Augarde (1994). The Oxford A to Z of Word Games, Oxford. Tony Augarde (2003). The Oxford Guide to Word Games, Oxford.
o Has historical information. Dmitri Borgmann (1967). Beyond Language,Scribners. Ross Eckler (1979). Word Recreations, Dover.
o Most examples originally appeared in Word Ways. Ross Eckler (1996). Making the Alphabet Dance, St. Martin's.o Most examples originally appeared in Word Ways.
Dave Morice (1997). Alphabet Avenue, Chicago Review Press. Dave Morice (2001). The Dictionary of Word Play, Teachers and Writers
Collaborative. Warren F. Motte, Jr. (1998). Oulipo: A Primer of Potential Literature, Dalkey
Archive.o Oulipostands for Ouvroir de Litterature Potentielle, which is a group of
writers, mathematicians, and other people interested in literarystructures.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
41/53
The Key Wordplay Resource:Word Ways: A Journal of Recreational Linguistics
Established by Dmitri Borgmann in 1968o He is author of Language on Vacationand Beyond Language
Bought by A. Ross Eckler, Jr. in 1968. He waseditor and publisher from 1968-2006.
o PhD in mathematics from Princeton, 1954o Worked at Bell Labs, 1954-84o Published Word Recreations(1979), Names and Games:
Onomastics and Recreational Linguistics(1986), Making theAlphabet Dance(1996)
Current editor is Jeremiah Farrell, professor
emeritus of mathematics at Butler University
Online at http://wordways.com/
http://wordways.com/http://wordways.com/ -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
42/53
Open question:
What are the upper
and lower bounds of
this plot? Points aresquares in base 10
with 12 or less
digits. This is Figure
2 of Bilisoly (2008a).
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
43/53
Brillhart Alogithm (See Slide 18)
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
44/53
Let us generalize the birthday problem.
Let represent an alphabet of size na.
For birthdays let = {d1, d2,, d365}, so naequals 365.
Let pi= P(dioccurs), so that each day need not be equally likely.
Define Njk= number of letters drawn from (with replacement) so
that there arejdistinct letters that each appear at least ktimes.
Let ek(t) = kth order Taylor series expansion of exp(t). Theorem 1 of Flajolet, Gardy and Thimonier (1992) states:
1
00
n
1 11)exp()))()(exp()((][)(
a
j
l
i ikiik
l
jk dtttpetpxtpexNE
Corollary (The Birthday Problem) We need j = 1 day to appear at least k = 2 times.Note that the sum has only one term, and N12 = Nshared.
0
365
112)exp()1()( dtttpNE
i i
See Corollary 1 of Flajolet et al. (1992)
Product of 1st degree
polynomials in x
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
45/53
Generalized Coupon Collectors Problem
Theorem 2 of Flajolet, Gardy and Thimonier (1992)The expected number of letters drawn to get thecomplete alphabet, , is given below. Their proof
follows fairly easily from Theorem 1.
0 1all ))exp(1(1 dttpN
an
i i
For uniformly likely birthdays, 2364.65 people are neededon average to get all 365 days to appear. For 1978, weexpect to need 2435.14 people.
Pangrams have = {a, b, c, , z}, na= 26, and pi
determined by frequencies found in a text sample.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
46/53
Example of Mathematica 8 code
to find 14 letter words with nomultiple edges and diameter = 2.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
47/53
Squares Having only Three Distinct Digits:Investigated by Hisanori Mishima
Largest known sporadic example:81401637345465395512991484^2 =6626226562522666562566262626266252566552622656522256
However, there are an infinite number of patterned 3-digit squares.
97 9409
997 994009
9997 99940009
99997 9999400009
999997 999994000009
1235 1525225
12335 152152225
123335 15211522225
1233335 1521115222225
12333335 152111152222225
See http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm
http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htmhttp://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
48/53
How many sporadic solutions?
Assume that digits of squares are independent.
Binomial[10,3] (3/10)^n P(n-digit square with 3 distinct digits)
10^(n/2)(1 - 1/Sqrt[10]) number of n-digit squares
Expected # n-digit squares with 3 distinct digits Binomial[10,3] (3/10)^n * 10^(n/2)(1 - 1/Sqrt[10]) =constant * (3/10)n
But n (3/10)n converges, which suggests a finite # of solutions.
Sum = 360(1-1/10)(3+10) 1517.
However, the analogous argument for squares with
4 distinct digits results in n (4/10)n, which diverges.
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
49/53
Three Examples of 150-digit Squareswith exactly 9 distinct digits.
This square has no 1s.
590286760507408218847058025821601275020644462041449539546951025992081988403 ^2 =
348438459630330307258664735742002640854590982059075240585330803623545707923
270640840935682702648690975443382535993405344806539300344597863650226490409
This square has no 8s.705635480731670264258949343062158505097813112879657762505544377338770578675 ^2 =
497921431667415396779522503575462537630442531656110964327039001763590201921
651000116765556629356041206321334237073176796236102099130225925794364755625
This square has no 7s.
624228579548317386188320909320329013264935555613585034560249848356296289668 ^2 =
389661319524910006943311531238425925016482192128500480080280569984852381981
266389000443336616268805696010466681530390083280245809868986959183363550224
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
50/53
Squares with 9 Distinct Digits
How common is an n-digit square with only 9 distinct digits?
Binomial[10,9] (9/10)^n probability of square with 9 digits
For n = 150, this gives 1.36891 E-6 1 in a million.
Hence a computer program checking 5,000,000 random 150-digit
squares should find 5,000,000*1.36891 E-6 = 6.84 such squares. This was done 30 times, and the counts are given in the histogram.
Mean = 7.27
SD = 2.65
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
51/53
Ed Peggs Failed Conjecture
In April of 1999, on sci.math, Ed Pegg conjectured thatthere are only finitely many cubes without the digit 0.
D. Hickerson found a counterexample and a few dayslater Lew Baxter found the example given below.
baxter[n_] := (2 10^(5 n)-10^(4 n)+2 10^(3 n)+10^(2 n)+10^n+1)/3
Do[Print[{baxter[i],baxter[i]^3}],{i,1,5}]
{64037, 262598918898653}
{6634003367, 291962492648791178822648631863}
{666334000333667, 295852962482593148779111778815593148629851963}
{66663334000033336667,296251862962481592598148777911117778814892598148629651852963}
{6666633334000003333366667,
296291851962962481492592648148777791111177778814822592648148629631851862963}
Function given at http://mathworld.wolfram.com/Baxter-HickersonFunction.html
http://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.htmlhttp://mathworld.wolfram.com/Baxter-HickersonFunction.html -
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
52/53
Let A = 1, B = 2, C = 3, ..., Z = 26. Let s(word) = Sum of its alphabetic values
o Example: s(bad) = 2 + 1 + 4 = 7 Let nn(number) = its number name
o Example: nn(3) = three
Consider the dynamical system of composing sand nno That is, iterate n-> nn(n) -> f(nn(n)) -> nn(f(nn(n)), etc.o Example: 1, 34, 160, 205, 174, 278, 291, 253, 254, 258,
247, 281, 240, 216, 228, 288, 255, 240 1 becomes a 5-cycle, so what else can happen?
o Answer first published by Dmitri Borgmann in 1967 inBeyond Language.
Miscellaneous Example:Number Words and Numbers Graph
-
8/3/2019 String Patterns: Searching for Interesting Words and Numbers
53/53
top related