p1q5

7/24/2019 p1q5

1/3

Solution notes

5

COMPUTER SCIENCE TRIPOS Part IA 2013 Paper 1

Algorithms I (FMS)

One of several ways to perform string matching efficiently is with a finite statestring matching,

CLRS3 32.3 automaton (FSA).

(a) Give a brief but clear explanation of the FSA string matching algorithm, itscomplexity and any associated data structures. [Note: pseudocode of up to 10lines is allowed, but not required.] [4 marks]

Answer:

To find all matches of patternP(of lengthm) in string T(of lengthn) we build a finite stateautomaton with m + 1 states numbered from 0 to m.

We build the FSA so that it is in state k iff, having read up to T[:j ], the last k characters of

T[:j ] match the first k characters ofP.Building the FSA consists of building a lookup table for its next-state function. This will haveone row per state (m+1 of them) and one column per next-character (||of them, where isthe alphabet over which the strings are defined). Thus (m+1)||cells. The precomputationtime for the FSA string matching algorithm is O(m3||) because the computation of each celltakesO(m2).

Every time the FSA is in state m, by the definition given above this indicates a full match,where P is identical to the last m characters that were just read from T. Precomputationsaside, the running time of the FSA string matching algorithm is O(n) because, after havingbuilt the FSA, for each character ofTall that is required is one constant-time lookup in thenext-state function table in order to proceed to the next state.

(b) Build the FSA that will find matches of the pattern P = pepep in an arbitrarystring Tover the alphabet {e, o, p}, explaining what you do and why. [6 marks]

Answer: The pattern P has length m = 5 and the alphabet has size 3, so Ill build anext-state table with 6 rows (states from 0 to 5) and 3 columns. Each cell (r, c) will containthe number of the state in which the FSA arrives if it receives character cwhile in state r.

Each cell of the table is defined by (r, c) = P(P[: r] + c). This equation can be informallyjustified as follows. We are in state r, meaning that the last r characters ofT we have seenare a match for P[:r]. If we read another character from T, and it turns out to be c, the lastfew characters read are going to be P[:r] + c. How long a match do we get between that and

our pattern? Answer P(P[:r] + c), so thats the next state we land in.

For each cell we thus look for this value P(P[:r]+c), as follows. For example, for r = 3, c= o,we have P[: r ] +c = P[: 3] +o = pep +o = pepo. How do we compute P(pepo)? We startwith the shift for the longest possible match, by left-aligning P[:r] + cand P; if its actually amatch, thenP() is its length; otherwise we shift Pto the right by one (making the overlappingregion one character shorter) and try again. We are guaranteed to terminate after at mostr+ 2 tries because it cant get any worse than the shift that gives no overlap between thetwo strings, with P() = 0. So we have (indicating y and n for single-character matches andnon-matches respectively):

pepo

pepep

yyyn no goodpepo

1

7/24/2019 p1q5

2/3

Solution notes

pepep

nnn no good

pepo

pepepyn no good

pepo

pepep

n no good

pepo

pepep

OK, a 0-length match

This was a worst-case, where we had to go all the way to the end. Others may be luckier.And so we proceed for each of the 18 cells, finally yielding this table. The string before thenumber is P[:r] + c and shouldnt really be part of the table, but its what we computed the

P() of, so we leave it in for clarity and to allow cross-checking.

| e | o | p

-------------------------------------------

0 | e 0 | o 0 | p 1

1 | pe 2 | po 0 | pp 1

2 | pee 0 | peo 0 | pep 3

3 | pepe 4 | pepo 0 | pepp 1

4 | pepee 0 | pepeo 0 | pepep 5

5 | pepepe 4 | pepepo 0 | pepepp 1

(c) The correctness proof of the FSA string matching algorithm involves the function

P(x), which is parametric in the pattern Pand takes as input a string x. DefineP(x), explaining what it returns. [1 mark]

Answer: P(x) returns an integer, specifically the length of the longest possible suffix ofxthat is also a prefix ofP.

(d) Let A,B,C,Dbe character strings; let|A|be the length of string A; let + denoteinteger addition or string concatenation depending on its operands. Let D bethe longest suffix ofA that is a prefix ofB.

For each of the following claims: either prove the claim correct, or give acounterexample that proves it is incorrect. You may draw an explanatory pictureif it helps clarity.

(i) B(A) =D [3 marks]

Answer: Conceptually almost correct but notationally wrong. B(A), by definition (cfrsubquestion (c)), is the lengthof the longest suffix ofA that is a prefix ofB ; whereasD,by its own definition, is simply that suffix. The claim would become correct by addingvertical bars around D.

(ii) B(A + C) =|D|+|C| [3 marks]

2

7/24/2019 p1q5

3/3

Solution notes

Answer: Correct in some cases but incorrect in general. Example of special case whereit works: A = abcdda;B = dacba;C = cb;D = da;A+ C = abcddacb;D+ C = dacb.Here it is the case that B(A + C) =|D|+|C|= 2 + 2 = 4.

If, however, we change just one character in B, from dacba to dacca, then we geta counterexample where it no longer works: A = abcdda;B = dacca;C = cb;D =;A + C=abcdacb; B(A + C) = 0; |D|+|C|= 0 + 2 = 2= 0. No easy fix here.

(iii) |C|= 1 B(A + C) =B(A) + 1 [3 marks]

Answer: Again, true in some special cases but incorrect in general. For acounterexample, take A = aa;B = aab;C = a; then B(A + C) = aab(aaa) =|aa| = 2;B(A) = aab(aa) = |aa| = 2 and so the right hand side would come toB(A) + 1 = 2 + 1 = 3, greater than the left hand side. Here too, there seems to be noeasy fix.

3

p1q5

Documents