an abstract generic framework for web site verification

An Abstract Generic Framework for Web Site Verification ∗

M. Alpuente P. Ojeda D. RomeroUniversidad Politecnica de ValenciaCamino de Vera s/n, Apdo. 22012,

46071 Valencia, Spain.{alpuente,pojeda,dromero}@dsic.upv.es

D. BallisDip. Matematica e Informatica

Via delle Scienze 206,33100 Udine, Italy.

[email protected]

M. FalaschiDip. di Scienze Matematiche e Informatiche

Pian dei Mantellini 44,53100 Siena, Italy.

[email protected]

Abstract

In this paper, we present an abstract framework for Website verification which improves the performance of a pre-vious, rewriting-based Web verification methodology. Theapproximated framework is formalized as a source-to-sourcetransformation which is parametric w.r.t. the chosen abstrac-tion. This transformation significantly reduces the size ofthe Web documents by dropping or merging contents thatdo not influence the properties to be checked. This allowsus to reuse all verification facilities of the previous systemWebVerdi-M to efficiently analyze Web sites. In order toensure that the verified properties are not affected by the ab-straction, we develop a methodology which derives the ab-straction of Web sites from their Web specification. We be-lieve that this methodology can be successfully embeddedinto several frameworks. In particular we have developeda prototypical implementation which shows a huge speedupw.r.t. a previous methodology which did not use this transfor-mation.

1. Introduction

Despite the exponential WWW growth and the success ofthe Semantic Web, there is limited support today to speci-fying, verifying and repairing Web sites at a semantic level.Some sophisticated Web site management tools have been re-

∗This work has been partially supported by the EU (FEDER) and theSpanish MEC, under grants TIN2004-7943-C04-01 and TIN2007-68093-C02-02, the Generalitat Valenciana under grant GV06/285, and IntegratedAction Hispano-Alemana HA2006-0007, and progetto FIRB, Internazion-alizzazione 2004, n. RBIN04M8S8. Pedro Ojeda is also supported by theGeneralitat Valenciana under FPI grant BFPI/2007/076.

cently proposed that provide helpful facilities, including ac-tive rules that automatically fire repair actions throughout therich navigational structure of Web sites. Unfortunately, thesetools are mostly focused towards syntactic checking and Websites restructuring (see [4, 7, 5] for a wider discussion).

A rewriting-based approach to Web-site verification andrepair was developed in [4, 8]. The methodology appliesto static-HTML/XML Web sites and can discover flaws inWeb sites that are not addressed by classical tools [3, 20, 28]as these mainly focus on modeling navigational aspects anduser interaction; see [2]. In a nutshell, our frameworkcomes with a Web specification language for defining cor-rectness and completeness conditions on Web sites. Then, arewriting-based verification technique is applied to recognizeforbidden/incorrect patterns and incomplete/missing Webpages. This is done by means of a novel technique, calledpartial rewriting, in which the traditional pattern matchingmechanism is replaced by a suitable technique based on anhomeomorphic embedding relation for recognizing patternsinside semistructured documents.

The verification methodology of [4] is implemented in theprototype WebVerdi-M (Web Verification and Rewriting forDebugging Internet sites with Maude) [5], written in Maude[14]. For correctness checking, it shows impressive perfor-mance thanks to the Associativity-Commutativity (AC) pat-tern matching and metalevel features supported by Maude(for instance, verifying correctness over a 10Mb XML docu-ment with 302000 nodes takes less than 13 seconds). In fact,both resource allocation and elapsed time scale linearly. Un-fortunately, for the verification of completeness, a (finite) fix-point computation is typically needed which leads to unsat-isfactory performance, and the verification tool is only ableto efficiently process XML documents smaller than 1Mb.

In this paper, we develop an abstract approach to Web site

https://www.researchgate.net/publication/221211675_A_Fast_Algebraic_Web_Verification_Service?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==


https://www.researchgate.net/publication/223653070_A_Rule-based_System_for_Web_site_Verification?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/2909253_VeriWeb_Automatically_Testing_Dynamic_Web_Sites?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/220940243_A_Survey_of_Analysis_Models_and_Methods_in_Website_Verification_and_Testing?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/220812011_Verifying_Integrity_Constraints_on_Web_Sites?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/225138404_Model_Checking_the_World_Wide_Web?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

verification which makes use of an approximation techniquebased on abstract interpretation [16, 17] that greatly im-proves on previous performance. We also ascertain the con-ditions which ensure the correctness of the approximation,so that the resulting abstract rewriting engine safely supportsaccurate Web site verification. Since our framework is para-metric w.r.t. the considered abstraction, we precisely charac-terize the conditions which allow to ensure the correctness ofthe abstraction, which is implemented by a source-to-sourcetransformation of concrete Web sites and Web specificationsinto abstract ones. Thanks to this source-to-source approxi-mation scheme, all facilities supported by our previous verifi-cation system are straightforwardly adapted and reused withvery little effort. Our abstract verification methodology canbe seen as a step forward towards more sophisticated man-agement of dynamic components of Web sites, which cangenerate a potentially infinite number of Web pages and thuscannot be handled by the more standard methodology in [4].

Related Work In the literature, abstract interpretationframeworks have been scarcely applied to analyse Web sites.In [25] an abstract approach is developed which allows oneto analyse the communication protocols of a particular dis-tributed system with the aim of enforcing a correct globalbehavior of the system. [23] uses abstract interpretationfor secret property verification: the methodology applies In-put/Output abstract set descriptions to finite state machinesin order to validate cryptographic protocols implementing se-cure Web transactions.

To the best of our knowledge, this work develops thefirst methodology based on abstract interpretation techniqueswhich is general enough to support the verification of staticas well as dynamic aspects of Web sites. Our inspirationcomes from the area of approximating (XML) query an-swering [12, 29], where XML queries are executed on com-pressed versions of XML data (i.e., document synopses) inorder to obtain fast, albeit approximate, answers.

In our methodology, both the XML documents (Webpages) and the constraints (Web specification rules) are ap-proximated via an abstraction function. Then, the verifica-tion process is carried out using the abstract descriptions ofthe considered XML data. This approach results in a power-ful abstract verification methodology which pays off in prac-tice. Actually, the experiments reported in Section 5 verifythe viability of the proposed techniques in terms of both ac-curacy and processing time on large datasets.

Plan of the Paper The paper is organized as follows. Sec-tion 2 recalls some standard notions, and introduces Web sitedescriptions. In Section 3, we briefly recall the Web verifi-cation methodology of [4]. Section 4 formalizes the notionof abstract Web specification and introduces the key idea be-hind our method, which is given by an approximation algo-rithm which derives an abstraction of Web sites from theirWeb specifications and drastically reduces the size of the

Web documents. Then we show how the original rewriting-based Web-site verification technique of [4] can be safely ap-proximated by the abstract framework. Experiments with aprototypical implementation of our method are described inSection 5. Finally, Section 6 concludes.

2. Preliminaries

In this section, we briefly recall the essential notions andterminology used in this paper. We call a finite set of sym-bols alphabet. By V we denote a countably infinite set ofvariables and Σ denotes a set of function symbols (also calledoperators), or signature. We consider varyadic signatures asin [19] (i.e., signatures in which symbols do not have a fixedarity).

Terms are viewed as labelled trees in the usual way. Posi-tions are represented by sequences of natural numbers denot-ing an access path in a term. The empty sequence Λ denotesthe root position. Given S ⊆ Σ∪V , OS(t) denotes the set ofpositions of a term t that are rooted by symbols in S. t|u isthe subterm at the position u of t. t[r]u is the term t with thesubterm rooted at the position u replaced by r. Given a termt, we say that t is ground, if no variable occurs in t. τ(Σ,V)and τ(Σ) denote the non-ground term algebra built on Σ∪Vand the term algebra built on Σ, respectively.

Syntactic equality between objects is represented by ≡.Given a set S, sequences of elements of S are build withconstructors ε :: S∗ (empty sequence) and . :: S×S∗ → S∗.

A substitution σ ≡ {X1/t1, . . . , Xn/tn} is a mappingfrom the set of variables V into the set of terms τ(Σ,V) sat-isfying the following conditions: (i) Xi 6= Xj , wheneveri 6= j, (ii) Xiσ = ti, i = 1, ..n, and (iii) Xσ = X , for allX ∈ V \ {X1, . . . , Xn}. An instance of a term t is definedas tσ, where σ is a substitution. By Var(s) we denote the setof variables occurring in the syntactic object s.

2.1. Web Site Description

Let us consider two alphabets T and Tag . We denote theset T ∗ by Text . An object t ∈ Tag is called tag element,while an element w ∈ Text is called text element. SinceWeb pages are provided with a tree-like structure, they can bestraightforwardly translated into ordinary terms of the termalgebra τ(Text ∪ Tag) [4] as shown in Figure 1.

Note that XML/XHTML tag attributes can be consideredas common tagged elements, and hence translated in thesame way. In the following, we will also consider termsof the non-ground term algebra τ(Text ∪ Tag ,V), whichmay contain variables. Elements of this set are called Webpage templates. In our methodology, Web page templates areused for specifying erroneous/incorrect patterns which maybe recognized in the Web pages.

In order to describe a Web site, we use the formulationgiven in [27]. We use an alphabet P to give names to Webpages and to express the different transitions between pages.

https://www.researchgate.net/publication/2941207_Approximate_XML_query_answers?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/251442731_Advanced_method_for_cryptographic_protocol_verification?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/220997433_Systematic_Design_of_Program_Analysis_Frameworks?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/220369402_Rewriting-Based_Navigation_of_Web_Sites_Looking_for_Models_and_Logics?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/2913756_Path_queries_on_compressed_XML?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/220977169_Verification_of_Communication_Protocols_Using_Abstract_Interpretation_of_FIFO_Queues?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/220997507_Abstract_Interpretation_A_Unified_Lattice_Model_for_Static_Analysis_of_Programs_by_Construction_or_Approximation_of_Fixpoints?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

<people> people(<person> person(<id>per0</id> id(per0),<name>Conte</name> name(Conte)

</person> )</people> )

Figure 1. A Web page and its correspondingencoding as a ground term

Definition 2.1 (immediate successors) The immediate suc-cessors relation for a given Web page p is defined by →p={(p, p′) ⊆ P × P | p′ is directly accessible from p}.

Definition 2.1 establishes a relationship between the pagep and its immediate successors (i.e., the pages p1, . . . , pn

which are commonly pointed to from p by means of hyper-links).

We will use the associated computational relations →P ,→+

P , etc., to describe the dynamic behavior of a Web site.For instance, the reachability of a given Web page p′ fromanother page p can be expressed as p→∗

P p′.

Definition 2.2 (Web site) A Web site is defined as a set ofreachable Web pages from an initial Web page, and is de-noted by

W = {p1, . . . , pn}, s.t. ∃i, 1 ≤ i ≤ n,∀j, 1 ≤ j ≤ n, pi →∗

W pj

Definition 2.2 formalizes the idea that a Web site has aninitial Web page which allows one to visit the whole Website. Note that there may exist several initial Web pages of agiven Web site.

Example 2.3 The algebraic description of a simple Web sitemodeling an on-line auction system is shown in Figure 2.It contains information regarding open and closed auctions,auctioned items, and registered users.

3. Rewriting-based Web Verification

In this section, we briefly recall the formal verificationmethodology proposed in [4], which allows us to detect for-bidden/erroneous contents as well as missing information ina Web site. This methodology is able to recognize and ex-actly locate the source of a possible discrepancy between theWeb site and the properties required in the Web specification.

3.1. The Web specification language

A Web specification is a triple (R, IN , IM ), where R, IN ,and IM are a finite set of rules. The set R contains the defini-tion of some auxiliary functions which the user would like to

provide, such as string processing, arithmetic, boolean oper-ators, etc. R is formalized as a term rewriting system, whichis handled by standard rewriting [18, 24, 30].

The second set IN describes constraints for detecting er-roneous Web pages (correctNess rules). A correctness rulehas the following syntax:

l ⇀ error | C

where l is a term, error is a reserved constant, and C is a(possibly empty) finite sequence which could contain regu-lar expressions and/or equations/inequalities over terms. Forinstance, the rule blink(X) ⇀ error states that blinking textif forbidden in the Web site.

The third set of rules IM specifies some properties fordiscovering incomplete/missing Web pages (coMpletenessrules). A completeness rule is defined as

l ⇀ r 〈q〉

where l and r are terms and q ∈ {E, A}. Completeness rulesof a Web specification formalize the requirement that someinformation must be included in all or some pages of the Website. We use attributes 〈A〉 and 〈E〉 to distinguish “universal”from “existential” rules.

Example 3.1 A Web specification for the on-line auctionsystem of Example 2.3 is shown in Figure 2. The first tworules are correctness rules while the other four rules arecompleteness rules. The first rule is intended for auditingWeb site contents and requires that, in an open auction, thereserve price (or the lower price at which a seller is willingto sell an item) is greater than the initial one. The secondrule requires email addresses to contain the symbol @, bychecking whether they belong to the regular language givenby the expression [:Text:]+ @ [:Text:]+. The third (exis-tential) rule formalizes the following property: if there is aWeb page with the information about the seller of an open-auction, then such a seller must be registered. The fourthrule also states an existential property: if there is an auc-tioned item that is listed in two or more categories, then atleast one of these categories must be “unit” and the otherone “pack”. The fifth rule formalizes the following univer-sal property: for each client registered in the Web site, theremust exist a page containing his name. Finally, the last rulestates that, for every item that is sold, a closed auction asso-ciated to the item must exist.

3.2. Homeomorphic embedding and partialrewriting

Partial rewriting extracts “some pieces of information”from a page, pieces them together, and then rewrites theglued term. The assembling is done by means of a sin-gle embedding relation, which allow us to verify whether agiven Web page template is somehow “enclosed” within an-other one and thus replaces the traditional pattern matching.

https://www.researchgate.net/publication/243787807_Handbook_of_Theoretical_Computer_Science_Formal_Methods_and_Semantics?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/277283188_Term_Rewriting_Systems?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

Web site W = {p1, p2, p3, p4, p5}, wherep1) list-items(

item(id(ite0),name(racket),state(sold),description(Wilson tennis racket),incategories(category(cat1))),

item(id(ite1),name(shirt),state(available),description(men’s t-shirts),incategories(category(cat1),

category(cat2))),item(id(ite2),name(shoes),state(sold),

description(women’s shoes),incategories(category(cat0),

category(cat2))) )

p2) list-categories(pack(category(cat0)),unit(category(cat1),

category(cat2)) )

p4) open-auctions(open-auction(id(open-auction0),

item(ite0),initial(48.51), reserve(77.5),bidder(person(per0)),seller(person(per1))) )

p3) people(person(id(per0),

name(Eliyahu),email([email protected])),

person(id(per1),name(Melski),email([email protected])),

person(id(per2),name(Conte),email([email protected])),

person(id(per3)) )

p5) closed-auctions(closed-auction(

seller(person(per1)),buyer(person(per0)),item(ite0), price(77.5)),

closed-auction(seller(person(per5)),buyer(person(per0)),item(ite2), price(45.2)) )

Web specification (IN , IM , R), where IN = {r1, r2} and IM = {r3, r4, r5},r1) open-auction(initial(X),reserve(Y)) ⇀ error | X > Yr2) person(email(X)) ⇀ error | X not in [:Text:]+ @ [:Text:]+

r3) open-auction(seller(person(X))) ⇀ ]person(]id(X)) 〈E〉r4) list-items(item(incategories(X,Y))) ⇀ ]list-categories(pack(X),unit(Y)) 〈E〉r5) people(person(id(X))) ⇀ ]people(]person(]id(X)),name) 〈A〉r6) list-item(item(id(X),state(sold)) ⇀ ]closed-auction(item(X)) 〈E〉

Figure 2. Web site and Web specification for an on-line auction system.

[22]. This relation closely resembles the notion of simulation[22] which has been widely used in a number of works aboutquerying, transformation, and verification of semistructureddata (cf. [1, 9, 10, 11, 21]).

We give a definition of homeomorphic embedding, E,which is an adaptation of the one proposed in [26], where(i) a simpler treatment of the variables is considered, (ii)function symbols with variable arities are allowed, (iii) therelative positions of the arguments of terms are ignored (i.e.f(a, b) is not distinguishable from f(b, a)), and (iv) we ig-nore the usual diving rule1 [26].

Definition 3.2 (homeomorphic embedding) The homeo-morphic embedding relation

E⊆ τ(Text ∪ Tag ,V)× τ(Text ∪ Tag ,V)

on Web page templates is the least relation satisfying:

1. X E t, for all X ∈ V and t ∈ τ(Text ∪ Tag ,V).

2. f(t1, . . . , tm) E g(s1, . . . , sn) iff f ≡ g and ti E sπ(i),for i = 1, . . . ,m, and injective function π : {1, . . . ,m}→ {1, . . . , n}.

Whenever s E t, we say that t embeds s (or s is embeddedor “recognized” in t).

The intuition behind the above definition is that s E t iffs can be obtained from t by striking out certain parts. In

1The diving rule allows one to “strike out” a part of the term at the right-hand side of the relation E. Formally, s E f(t1, . . . tn), if s E ti, forsome i.

other words, the structure of the template s appears withinthe specific Web data term t.

Let us illustrate Definition 3.2 by means of a rather intu-itive example.

Example 3.3 Consider the following Web page templates(called s1 and s2, respectively):

people(person(email(X), id(Y )))people(person(id(per0), name(Eliyahu),

email([email protected])))

Note that the structure of s1 can be recognized inside thestructure of s2 hence s1 E s2, while s2 6E s1. Also note thatthe order of the arguments is irrelevant.

Now we are ready to introduce the partial rewrite relationbetween Web page templates. W.l.o.g., we disregard con-ditions and/or quantifiers from the Web specification rules.Roughly speaking, given a Web specification rule l ⇀ r,partial rewriting allows us to extract from a given Web pages a subpart of s which is simulated by a ground instanceof l, and to replace s by a reduced, ground instance of r.Let s, t ∈ τ(Text ∪ Tag ,V). Then, s partially rewritesto t via rule l ⇀ r and substitution σ iff there exists aposition u ∈ OTag(s) such that (i) lσ E s|u, and (ii)t = Reduce(rσ,R), where function Reduce(x,R) com-putes, by standard term rewriting, the irreducible form of x inR. Note that the context of the selected reducible expressions|u is disregarded after each partial rewrite step. By notations ⇀I t, we denote that s is partially rewritten to t using arule belonging to the set I .

https://www.researchgate.net/publication/3635312_Computing_simulations_on_finite_and_infinite_graphs?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==


https://www.researchgate.net/publication/225035539_Data_on_the_Web_From_Relations_to_Semistructured_Data_and_XML?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/222691792_A_matching_algorithm_for_measuring_the_structural_similarity_between_an_XML_document_and_a_DTD_and_its_applications?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/3735065_Optimizing_regular_path_expressions_using_graph_schemas?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

https://www.researchgate.net/publication/37535921_Homeomorphic_Embedding_for_Online_Termination_of_Symbolic_Methods?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==


https://www.researchgate.net/publication/285866644_The_XML_Query_Language_Xcerpt_Design_Principles_Examples_and_Semantics?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

3.3. Web verification methodology

Roughly speaking, our verification methodology works asfollows. First, by using the homeomorphic embedding rela-tion of Definition 3.2, we check whether the left-hand sidel of some Web specification rule is embedded into a givenpage p of the considered Web site. When the embedding testl E p succeeds, by extending the proof, we construct thebiggest substitution2 σ for the variables in V ar(l), such thatlσ E p. Then, depending on the nature of the Web specifi-cation rule (correction or completeness rule), we proceed asfollows:

Correction rule. Given a correctness rule l ⇀ error | Cand a Web page p ∈ W , a correctness error is defined bythe tuple (p, w, l, σ) where w ∈ OTag(p) and σ is a substi-tution. Basically, (p, w, l, σ) formalizes the fact that a wronginstance lσ of a Web page template l is detected in the sub-term p|w of p.

Completeness rule. We apply a partial rewriting step to p,that is, we replace the Web page p embedding lσ with theinstantiated right-hand side rσ, in symbols p ⇀ rσ. Eachterm rσ generated by partial rewriting is considered a re-quirement. From the requirements computed so far, new re-quirements are generated by means of a fixpoint computationby applying further partial rewriting steps. Then, by a newhomeomorphic embedding test, we check whether the givenrequirement rσ is recognized within some page of the con-sidered Web site. This way, a completeness error (missingWeb page, universal error or existential error) is signalled.The triple (r, {p1, . . . , pn}, q), with q = A (resp. q = E) isa universal (resp. existential) completeness error.

When a missing web page error is detected, the evidence(r, W ) signals that the expression r does not appear in thewhole Web site W .

Example 3.4 Consider again the Web specification and Website W of Example 3.1. Then, the set of completeness require-ments computed in the fixpoint by our verification methodol-ogy is

Sreq = {rq1, rq2, rq3, rq4, rq5, rq6, rq7, rq8, rq9, rq10},whererq1)〈]person(]id(per1)), E〉,rq2)〈]person(]id(per5)), E〉,rq3)〈]list-categories(pack(category(cat1)),

unit(category(cat2))), E〉,rq4)〈]list-categories(pack(category(cat0)),

unit(category(cat2))), E〉,rq5)〈]closed-auction(item(ite0)), E〉,rq6)〈]closed-auction(item(ite2)), E〉,rq7)〈]people(]person(]id(per0), name)), A〉,rq8)〈]people(]person(]id(per1), name)), A〉,2The substitution σ is easily obtained by composing the bindings X/t,

which can be recursively gathered during the homeomorphic embedding testX E t, for X ∈ l and t ∈ p.

rq9)〈]people(]person(]id(per2), name)), A〉,rq10)〈]people(]person(]id(per3), name)), A〉

Now, by a new homeomorphic embedding test, we checkwhether each requirement in the set Sreq is recognized insidesome page of the Web site W . This produces the followingset of completeness errors

EM = {e1, e2, e3}, wheree1)〈rq2, {p3}, E〉e2)〈rq3, {p2}, E〉e3)〈rq10, {p3}, A〉

Note that no correctness error exists in W w.r.t. the con-sidered Web specification.

4. Abstract Web site verification

Let us first introduce the notion of abstract domain. Ourfirst definition is motivated by the fact that we want to for-malize the abstraction as a source–to–source transformationwhich translates Web documents and Web specification rulesinto constructions of the very same languages, hence ourconcrete and abstract domains do coincide.

Definition 4.1 abstract non-ground term algebra,poset. Let D = (τ(Text ∪ Tag ,V),≤) be the standard do-main of (equivalence classes of) terms, ordered by the stan-dard partial order ≤ induced by the preorder on terms givenby the relation of being “more general”.

Then the domain of abstract terms Dα is equal to D.

We define the abstraction (tα) of a term t as: tα = α(t).Our framework is parametric w.r.t. the abstraction functionα, which can be used to tune the accuracy of the approxi-mation. For example, for on–line auctioning, a convenientabstraction function would be better defined as to distinguishregistered bidders and sellers, and auctioned items.

Definition 4.2 (term abstraction α) Let atext :: Tag∗ ×Text → Text be a text abstraction function.

α :: τ(Text ∪ Tag ,V)→ τ(Text ∪ Tag ,V)α(t) = α(ε, t)

where the auxiliary function α is given by

α :: Tag∗ × τ(Text ∪ Tag ,V)→ τ(Text ∪ Tag ,V)α( , x) =x, if x ∈ V

α(c, f(t1, . . . , tn)) = f(α(c.f, t1), .., α(c.f, tn)), if f ∈ Tagα(c, w) = atext(c, w), if w ∈ Text

Particularly, the reader may notice that elements of Textare abstracted by taking into account the chain of tags underwhich a particular piece of text appears. This is formalizedby means of the text abstraction function

atext :: Tag∗ × Text → Text

which is left undefined and is actually the formal parameterof the definition.

The text abstraction function atext should be conve-niently fixed in order to tune the abstraction for each particu-lar domain. For instance, in the case where no tag distinctionis needed, each element in text could be simply replaced bysome abstract fresh, constant symbol d.

Example 4.3 Consider again the Web page p2 of Exam-ple 2.3. By fixing

atext( , w) = first(w), where first(x.xs) = x,

the resulting abstraction of p2 is

α(p2) = list-categories(pack(category(c)),unit(category(c), category(c)))

Note that this abstraction function produces indistinctnessamong leaves of a term that influence the properties to beverified. As a consequence of this lack of precision, by fixingthis abstration we would not be able to detect the error e2 ofExample 3.4 anymore.

In order to achieve correctness of the abstraction, we re-strict our interest to text abstraction functions atext whichdistinguish those pieces of text that are observed by the Webspecification rules and then potentially affect the result of theverification.

The following auxiliary functions allow us to knowwhether a sequence of tags is recognized within some rule ofthe Web specification. This allows us to determine whetherthe term occuring at the corresponding position in the Webpage needs to be carefully considered. Let J be a Web spec-ification and s ∈ τ(Text ∪ Tag ,V). Let c ∈ Tag∗ andt ∈ Text . The function text term(c, t) returns the termthat is obtained by composing as a term the tags of the se-quence c together with the piece of text t. For instance,text term(f.g, a) = f(g(a)). We define the boolean func-tion gen embJ (s) that returns True if there is a (sub)termt occuring in the left–hand or rigth–hand side of a rule in Jsuch that some instance t′ of t embeds s, in symbols s E t′;otherwise it returns False. For instance, for the Web spec-ification J = {f(h(Y ), g(X)) ⇀ m(X)} and the termf(g(a)), gen embJ (f(g(a))) = True.

When no confusion can arise, we just writegen emb ttJ (c, s) by gen embJ (text term(c, s)).

Now text abstraction functions are required to obey thefollowing correctness condition w.r.t. W .

Definition 4.4 (correctness condition w.r.t. W ) Let W bea Web site and J be a Web specification. Let s, t ∈ Textbe any two pieces of text in W . For every c ∈ Tag∗ suchthat gen emb ttJ (c, s) ≡ True and gen emb ttJ (c, t) ≡True, the text abstraction function atext satisfies

s 6≡ t⇒ atext(c, s) 6≡ atext(c, t)

The condition above formalizes the idea that, whenevertwo pieces of text are indistinguishable in the abstract do-main, then they are also indistinguishable in the concrete do-main.

4.1. Abstract Web specification

The abstraction of the correctness and completeness Webspecification rules is simply based on abstracting the termsoccurring in the left-hand and right-hand sides of the rules.In particular, the conditional parts of the correctness rulesare not abstracted, and hence we let concrete conditions tobe applied to concrete as well as abstract data.

Definition 4.5 (abstract specification rule) Let

α :: τ(Text ∪ Tag ,V)→ τ(Text ∪ Tag ,V)

be a term abstraction function with text abstraction functionatext :: Tag∗ × Text → Text .

Let rlM ≡ l ⇀ r〈q〉 be a completeness rule andrlN ≡ l ⇀ error|C be a correctness rule. We denote byrlαM (resp. rlαN ) the abstraction of rlM (resp. rlN ), whererlαM ≡ α(l) ⇀ α(r)〈q〉 (resp. rlαN ≡ α(l) ⇀ error|C).

Example 4.6 Consider the completeness rule r6 of Ex-ample 3.1. By fixing the text abstraction functionatext(c, x) = last(x) where last(w) returns the last el-ement of the sequence w, the computed abstract com-pleteness rule α(r6) is list-item(item(id(X), state(d)) ⇀]closed-auction(item(X))〈E〉.

When no confusion can arise, we just write rlαM ≡ lα ⇀rα〈q〉 (resp. rlαN ≡ lα ⇀ error|C). The Web specification(IN , IM , R) is lifted to (Iα

N , IαM , R) element–wise.

To ensure the soundness of the abstract framework, weneed to precisely relate the satisfiability of the conditionsover abstract and concrete data. Specifically, we require thatthe fulfillment of an abstract condition implies the fulfillmentof the corresponding concrete description.

Definition 4.7 (correctness condition w.r.t. IN )Let α :: τ(Text ∪ Tag ,V) → τ(Text ∪ Tag ,V) be a termabstraction function with text abstraction function atext ::Tag∗ × Text → Text . Let rl ≡ l ⇀ error|C be a correct-ness rule. Then, α is correct w.r.t. rl iff for each substitutionσ ≡ {X1/t1, . . . , Xn/tn}

Cσα holds ⇒ Cσ holds

where σα ≡ {X1/α(t1), . . . , Xn/α(tn)}.Moreover, α is correct w.r.t. IN if it is correct w.r.t. every

correctness rule of IN .

4.2. Abstract Web site

When navigating a Web site, it is common to find a num-ber of pages that have a similar structure but different con-tents. This happens very often when pages are dynamicallygenerated by some script which extracts contents from adatabase (e.g. in Amazon’s Web site). This can make oursimple analysis impracticable unless we are able to providea mechanism to drastically reduce the Web size. In orderto ensure that the verified properties are not affected by theabstraction, in this section we develop an abstract methodol-ogy which derives an approximation of Web sites from theconsidered Web specifications.

Let us introduce a compression function for terms whichreduces the size of each singular Web page by dropping somearguments from the term that represents the page, possiblyreducing the number of branches of the tree. This is used asa preprocess prior to the abstraction of a given Web site.

4.2.1 Web Compression pre–processing

Let (IN , IM , R) be a Web specification and s, t ∈ τ(Text ∪Tag). We define three auxiliary functions root, join, andmax ar. We denote by root(s) the function symbol at thetop of s, in symbols root(f(s1, . . . , sn)) = f . The functionjoin(s, t) returns the term that is obtained by concatenat-ing the arguments of terms s and t (if they exist), wheneverroot(s) = root(t). For instance, join(f(a, b, c), f(b, e)) =f(a, b, c, b, e). Finally, function max ar(f, IM ) returns themaximal arity of f in IM .

Definition 4.8 (correctness condition w.r.t. IM ) Let (IN ,IM , R) be a Web specification. Then, the termf(t1, ..., tn) ∈ τ(Text ∪ Tag) is compressed by usingfunction COMPRESS given in Algorithm 1, which packstogether those subterms which are rooted by the sameroot symbol while ensuring that the arity of f after thetransformation is not smaller than max ar(f, IM ).

The idea behind Definition 4.8 is as follows. Roughlyspeaking, all arguments with the same root symbol f ocur-ring at level i are joined together. Then, compression re-cursively proceeds to level (i + 1). The condition that themaximal arity of f in IM must be respected is essential forthe correctness of our method, as this condition ensures thata partial rewrite step on an abstract term is always enabled,whenever the corresponding partial rewrite step can be exe-cuted in the concrete domain. Let us see an example.

Example 4.9 Consider the Web page p1 and the complete-ness rule r4 of Example 3.1. The left–hand side of rule r4 isembedded in p1, in symbols

list-items(item(incategories(X, Y ))) E p1

Algorithm 1 Term Compression Transformation.Input:

Term t = f(t1, . . . , tn)IM a set of completeness rules

Output:Term f(t′1, . . . , t

′m), with m ≤ n

1: function COMPRESS (t)2: if n = 0 then3: ← f4: else if max ar(f, IM ) < n and

∃ i, j s.t. root(ti) = root(tj) then5: t′ ← join(ti, tj)6: ← COMPRESS(f(t1, . . . , ti−1, t

′, ti+1, . . . ,tj−1, tj+1, . . . , tn), IM )

7: else8: ← f(COMPRESS(t1, IM ), . . . , COMPRESS(tn, IM ))9: end if

10: end function

If we naıvely compressed p1 without respecting the maximalarity in IM of function symbol “incategories”, we wouldget

p′1 = list-items(item(id(ite0), name(racket),

description(Wilson tennis racket),incategories(category(cat1))),

item(id(ite0), name(shirt),description(men’s t-shirts),incategories(category(cat1,cat2))),

item(id(ite2), name(shoes),description(women’s shoes),incategories(category(cat0,cat2))) )

Unfortunately,

list-items(item(incategories(X, Y ))) 6E p′1

since in p′1 the arity of the symbol “incategories” is lowerthan in the lhs of r4.

Now we are ready to formalize our notion of Web siteapproximation.

4.2.2 Web site abstraction

In order to approximate a Web site, we start from an initialWeb page and recursively apply the successor relation (→),while implementing a simple depth-first search (DFS) [15].

Definition 4.10 (Abstract Web Site) Let W be a Web site,p be an initial page of W , and (IN , IM , R) be a Web speci-fication. Then, the abstraction of W is defined by:

α(W ) = DFS(p, ∅, IM )

Where function DFS is given in Algorithm 2.

https://www.researchgate.net/publication/247442902_Introduction_To_Algorithms?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

Note that, after applying the transformation above, the in-formation in the Web pages as well as the number of pagesin the Web site can be significantly reduced.

Algorithm 2 Web site abstraction.Input:

p :: τ(Text ∪ Tag ,V)Wα :: set(τ(Text ∪ Tag ,V))IM a set of completeness rules

Output:Wα = set (τ(Text ∪ Tag ,V))

1: function DFS (p, Wα)2: pα ← COMPRESS(α(p), IM )3: Wα ←Wα ∪ {pα}4: for all i s.t. (p, pi) ∈→p and

COMPRESS(α(pi), IM ) 6∈Wα do5: Wα ← DFS(pi, Wα)6: end for7: ←Wα

8: end function

4.3. Abstract verification soundness

The abstraction function given in Definition 4.2 defines anabstraction by a source-to-source transformation. Due to thissource-to-source approximation scheme, all facilities sup-ported by our previous verification system can be straight-forwardly adapted and reused with very little effort.

Informally, our abstract verification methodology appliesto the considered abstract descriptions of the Web site andWeb specification. Given a Web specification (IN , IM , R)and a Web site W , we first generate the corresponding ab-stractions (Iα

N , IαM , R) and Wα. Then — since we consider

a source to source transformation — we apply our originalverification algorithm [4] to analyse Wα w.r.t. (Iα

N , IαM , R).

We call abstract correctness (resp., completeness) error, eachcorrectness (resp., completeness) error which is detected inWα using (Iα

N , IαM , R) by the verification methodology.

In order to guarantee the soundness of the abstract diagno-sis, we have to ensure that, when fed with the abstracted data,the partial rewriting relation, ⇀, correctly approximates thebehavior of the partial rewriting relation over the correspond-ing concrete representation. In the following, we presentsome results which state the soundness of our abstract rep-resentation. The extended version of this work (see [6]) con-tains the proofs of these results. First of all we introduce thenotion of abstract embedding, which is used to establish arelation between concrete and abstract terms.

Definition 4.11 (abstract embedding) The abstract embed-ding relation

E] ⊆ τ(Text ∪ Tag ,V)× τ(Text ∪ Tag ,V)

f

g g

a b

f

g

c d

(t2)(t1)

Figure 3. Abstract embedding

w.r.t. a function atext :: Tag∗ × Text → Text on Web pagetemplates is the least relation satisfying the rules:

1. X E] t, for all X ∈ V and t ∈ τ(Text ∪ Tag ,V).

2. f(t1, . . . , tm) E] g(s1, . . . , sn) iff f ≡ g and ti E]

sπ(i), for i = 1, . . . ,m, and some total function π ::{1, . . . ,m} → {1, . . . , n}.

3. c E] c′ iff c′ ≡ atext(x, c) for some x ∈ Tag∗.

Given t1, t2 ∈ τ(Text ∪ Tag ,V) such that t1 E] t2 w.r.t.atext :: Tag∗×Text → Text , we say that t2 safely approx-imates t1.

Basically, Definition 4.11 slightly modifies Definition 3.2by allowing the detection of noninjective embeddings and therenaming of some constants. In other words, if t1 E] t2, twodistinct paths in t1 may be mimicked (i.e. simulated) by asingle path appearing in t2 modulo (a possible) renaming ofsome leaves of t1.

Example 4.12 Consider the terms t1 ≡ f(g(a), g(b)) andt2 ≡ f(g(d, e)) which are depicted in Figure 3. Then,t1 6E t2 and t2 6E t1, but we have t1 E] t2 w.r.t. the func-tion {(tags, a) 7→ c, (tags′, b) 7→ d | tags, tags′ ∈ Tag∗}as shown in Figure 3 by means of dashed arrows. Moreover,note that the two distinct t1’s edges from f to g are repre-sented in t2 by a single edge from f to g.

By using Definition 4.11, we are able to map any concretepath of a concrete term into a path of an abstract term: thestructure and the labeling of t are represented in a com-pressed and suitable relabeled version of t in which manypaths of t are mapped to one shared path of the abstract de-scription as stated by the following proposition.

Proposition 4.13 Let t ∈ τ(Text ∪ Tag ,V). Let α ::τ(Text ∪ Tag ,V) → τ(Text ∪ Tag ,V) be a term ab-straction function whose text abstraction function is atext ::Tag∗ × Text → Text . Then, t E] α(t) w.r.t. atext.

Roughly speaking, Proposition 4.13 says that α(t) safelyapproximates the (concrete) term t.

https://www.researchgate.net/publication/238700427_Abstract_Web_Site_Verification_in_WebVerdi-M?el=1_x_8&enrichId=rgreq-ee567bbb7a0b5e294e3de0dd117a9a9c-XXX&enrichSource=Y292ZXJQYWdlOzQzNjc2NzE7QVM6OTc1OTI4MjIyMDY0NjVAMTQwMDI3OTM1MDA3Mg==

Example 4.14 Consider again the terms t1 and t2 of Fig-ure 3. Let atext :: Tag∗ × Text → Text be defined as{(f.g, a) 7→ c, (f.g, b) 7→ d}. Assume that there exists aWeb specification in which the maximal arity of g equals to1, so that the compression of the t1’s nodes labeled with g isenabled. Then, α(t1) ≡ t2 and t2 E] α(t1) w.r.t. functionatext.

Thus, Propositon 4.13 states that abstract terms simulate theconcrete terms through an abstract embedding (i.e., the E]

relation w.r.t. a suitable renaming function). In the follow-ing, we demostrate that such a property is preserved by par-tial rewriting. In other words, whenever a partial rewritestep t1 ⇀ t2 is executed in the concrete domain, a partialrewrite step over the abstract counterpart is enabled, in sym-bols α(t1) ⇀ t′2, such that t2 still simulates the obtainedabstract term t′2 w.r.t. E]. The following property holds.

Proposition 4.15 Let s, t ∈ τ(Text ∪ Tag ,V).Let α :: τ(Text ∪ Tag ,V) → τ(Text ∪ Tag ,V) be a termabstraction function with text abstraction function atext ::Tag∗ × Text → Text . Let J ≡ (IN , IM , R) be a Webspecification, and J α ≡ (Iα

N , IαM , R) be the abstract version

of J . If s ⇀IMt, then

• α(s) ⇀IαM

t′

• t E] t′ w.r.t. atext

Proposition 4.15 can be generalized to partial rewrite se-quences using a simple inductive argument. More formally,

Proposition 4.16 Let α :: τ(Text ∪ Tag ,V) → τ(Text ∪Tag ,V) be a term abstraction function with text abstrac-tion function atext :: Tag∗ × Text → Text . Let J ≡(IN , IM , R) be a Web specification, and J α ≡ (Iα

N , IαM , R)

be the abstract version of J .If t0 ⇀IM

t1 ⇀IM. . . ⇀IM

tn, n ≥ 0, then

1. α(t0) ⇀IαM

t′1 ⇀IαM

. . . ⇀IαM

t′n;

2. tn E] t′n w.r.t. atext

Given an (abstract) partial rewrite sequence Sα ≡α(t1) ⇀Iα

Mt′2 ⇀Iα

M. . . ⇀Iα

Mt′n, we call abstract com-

pleteness requirement any term appearing in Sα. Now, ourverification methodology works as follows: first, we com-pute the concrete completeness requirements and, then, wecheck whether such requirements are fulfilled in the consid-ered Web site. As explained in Section 3.1, a completenessrequirement r is a term which occurs in a partial rewrite se-quence of the form p ≡ t0 ⇀ t1 ⇀ t2 . . . ⇀ tn ≡ r, wherep ∈ τ(Text ∪ Tag) is a Web page of the Web site W andr ∈ τ(Text ∪ Tag) is the computed (completeness) require-ment. Therefore, by Proposition 4.16, we can directly con-clude that each concrete requirement is safely approximatedby its abstract description.

Corollary 4.1 Let α :: τ(Text ∪ Tag ,V) → τ(Text ∪Tag ,V) be a term abstraction function with text abstractionfunction atext :: Tag∗×Text → Text . Let W be a Web site,and Wα be the abstract version of W . Let (IN , IM , R) bea Web specification, and (Iα

N , IαM , R) be the abstract version

of (IN , IM , R).If r is a completeness requirement for W computed by

(IN , IM , R), then there exists an abstract completeness re-quirement rα for Wα computed by (Iα

N , IαM , R) such that

r E] rα w.r.t. atext.

The fact that any concrete completeness requirement issafely approximated by an abstract completeness require-ment ensures that the abstract verification is safe, that is,whenever an abstract requirement is fulfilled in the abstractWeb site, each concrete representation is fulfilled in the con-crete domain. This allows us to conclude the absence of con-crete errors in the case when no abstract errors are detected.

Formally, the following theorem holds.

Theorem 4.1 Let α :: τ(Text ∪ Tag ,V) → τ(Text ∪Tag ,V) be a term abstraction function with text abstractionfunction atext :: Tag∗×Text → Text . Let W be a Web siteand Wα be the abstract version of W . Let J ≡ (IN , IM , R)be a Web specification, and J α ≡ (Iα

N , IαM , R) be the ab-

stract version of J . Then,Wα does not contain any abstract (universal/existential)

completeness error w.r.t. J α, then W does not contain anyconcrete (universal/existential) completeness error w.r.t. J .

Note that —whenever we detect an abstract completenesserror— we are not able to guarantee the presence of a con-crete completeness error. This is mainly due to the fact thatthe abstraction can enable partial rewriting steps over ab-stract descriptions which are not possible in the concrete do-main. Thus, there might be an abstract requirement whichdoes not correspond to any concrete requirement, as illus-trated by the following example.

Example 4.17 Consider the following set IM of complete-ness rules of a Web specification

r1) f(X, Y ) ⇀ m 〈E〉r2) f(a, b) ⇀ m′ 〈E〉

and the Web site W = {f(f(a), f(b), h(c)),m}. Assumethat the text abstract function atext :: Tag∗ ×Text → Textis defined as atext( , t) = t for each t ∈ Text . Then,

Wα = {f(f(a, b), h(c)),m}.

Moreover, the abstract description of IM is equal to IM (i.e.IM ≡ Iα

M ). The set of concrete requirements of W w.r.t. IM

is {m}; while the set of abstract requirements of Wα w.r.t.IαM is {m,m′}. Note that m′ is not fulfilled in Wα, as it is

not embedded in any abstract page of Wα, that is, m′ rep-resents an abstract completeness error. On the other hand,

requirement m′ cannot be computed in the concrete domain,since rule r2 cannot be applied to Web pages in W . Conse-quently, m′ is not responsible for any concrete completenesserror in W .

The approximation we considered allows us to establisha safe connection between abstract and concrete correctnesserrors as well. In particular, we are able to ensure that when-ever an abstract correctness error is detected, a correspondingcorrectness error must exist in the concrete counterpart.

Given a concrete correctness error e ≡ (p, w, l, σ), wedefine α(e) ≡ (α(p), wα, α(l), α(σ)), wα ∈ OTag(α(p)).

Theorem 4.2 Let α :: τ(Text ∪ Tag ,V) → τ(Text ∪Tag ,V) be a term abstraction function with text abstractionfunction atext :: Tag∗×Text → Text . Let W be a Web siteand Wα be the abstract version of W . Let J ≡ (IN , IM , R)be a Web specification such that α is correct w.r.t. IN , andJ α ≡ (Iα

N , IαM , R) be the abstract version of J .

If Wα contains an abstract correctness error

eα ≡ (pα, wα, lα, σα)

w.r.t. J α, then W contains a concrete correctness error e ≡(p, w, l, σ) w.r.t. J such that α(e) ≡ eα.

To conclude, by the approximation scheme formalized sofar, we are able to apply the original verification frameworkto abstract data, providing an extremely efficient analysiswhich is able to locate correctness errors as well as to ensurethe absence of completeness errors in the concrete descrip-tions quickly, saving time to the user.

In order to improve the efficiency of the analisys, the fol-lowing optimization has been developed: whenever an ab-stract completeness error is found, the abstract verificationprocess is stopped. Unfortunately, if we want to check ex-actly the location of the errors, the concrete methodologyshould be executed from scratch. Nevertheless, if we dividea Web site into modules we can perform the analysis mod-ularly and execute the (heavier) concrete methodology onlyon those modules where the abstract analysis detects poten-tial mistakes.

5. Implementation

An experimental implementation αVerdi of the abstractframework proposed in this paper has been developed andcompared to the previous Verdi implementation for the re-alistic test cases given in [5], which are randomly generatedby using the XML documents generator xmlgen [13]. Thetool xmlgen is able to produce a set of XML data, each ofwhich is intended to challenge a particular primitive of XMLprocessors or storage engines.

Table 1 shows some of the results we obtained for the sim-ulation of two different Web specification rules WS1 andWS2 for the on–line auction system in five different, ran-domly generated XML documents. Specifically, we tuned

the generator for scaling factors from 0.01 to 0.1 to produceXML documents whose size ranges from 1Mb (correspond-ing to an XML tree of about 31000 nodes) to 10Mb (corre-sponding to an XML tree of about 302000 nodes).

Web specification WS1 aims at checking the verifica-tion power of our tool regarding data correctness, and thusincludes only correctness rules. The specification rules ofWS1 contain complex and demanding constraints which in-volve conditional rules with a number of membership testsand functions evaluation. The Web specification WS2 aimsat checking the completeness of the randomly generatedXML documents. In this case, some critical completenessrules have been formalized which involve the processing ofa significant amount of completeness requirements.

Nodes Mb AppTime

WS1 WS2

Verdi αVerdi Verdi αVerdi30 th 1 11 s 0.96 s 0.14 s 165.34 s 0.92 s90 th 3 154 s 2.84 s 0.42 s 1, 768.65 s 3.01 s

150 th 5 732 s 5.94 s 0.75 s 4, 712.39 s 52.45 s241 th 8 5, 330 s 9.42 s 1.23 s 12, 503.85 s 186.22 s301 th 10 8, 132 s 12.64 s 1.52 s 21, 208.28 s 285.51 s

Table 1. Verdi-M Benchmarks

The results shown in Table 1 were obtained on a personalcomputer equipped with 1Gb of RAM memory, 40Gb harddisk and a Pentium Centrino CPU clocked at 1.75 GHz run-ning Ubuntu Linux 5.10.

For each Web specification WS1 and WS2, columnV erdi shows the runtime of the original WebVerdi-M tool.Column App shows the time used for the approximation ofthe Web site w.r.t. the corresponding abstract Web specifica-tion. Finally, column αVerdi shows the execution time of theabstract verification tool αVerdi.

The preliminary results that we have obtained demon-strate a huge speedup w.r.t. our previous methodology. Atthe same time, the abstraction times are affordable given thecomplexity and size of the involved data sets: less than 5minutes for the largest benchmark (10 Mb), with a very re-duced space budget. We note that the original WebVerdi-Mimplementation was only able to process efficiently XMLdocuments whose size was not bigger than 1Mb.

6. Conclusion

Web developing tends to create data that exhibit manycommonalities which often result in extremely inefficientWeb site verification models and methodologies. This pa-per describes a novel abstract methodology for Web sitesanalysis and verification which offsets the high executioncosts of analyzing complex Web documents. The frameworkis formalized as a source-to-source transformation which isparametric w.r.t. the abstraction and translates the Web doc-uments and their specifications into constructions of the very


same languages, so that an efficient implementation can beeasily derived with very little effort. The key idea for theabstraction is to exploit the sub–structure similarity that iscommonly found in HTML/XML documents. Note that noautomatic abstraction refinement is needed because no rele-vant parts are lost due to abstraction.

References

[1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web.From Relations to Semistructured Data and XML. MorganKaufmann, 2000.

[2] M. H. Alalfi, J. R. Cordy, and T. R. Dean. A Survey of Anal-ysis Models and Methods in Website Verification and Testing.In Proc. 7th Int’l Conf. on Web Engineering (ICWE 2007),volume 4607 of LNCS, pages 306–311. Springer, 2007.

[3] L. d. Alfaro. Model checking the world wide web. In Proc.13th Int’l. Conf. Computer Aided Verification (CAV 2001),Paris, France, July 18-22, 2001, volume 2102 of LNCS, pages337–349, 2001.

[4] M. Alpuente, D. Ballis, and M. Falaschi. Rule-based Verifi-cation of Web Sites. Software Tools for Technology Transfer,8:565–585, 2006.

[5] M. Alpuente, D. Ballis, M. Falaschi, P. Ojeda, and D. Romero.A Fast Algebraic Web Verification Service. In Proc. of FirstInt’l Conf. on Web Reasoning and Rule Systems (RR 2007),volume 4524 of LNCS, pages 239–248, 2007.

[6] M. Alpuente, D. Ballis, M. Falaschi, P. Ojeda, and D. Romero.Abstract Web Site Verification in WebVerdiM. Technical Re-port DSIC-II/19/07, DSIC-UPV, 2007.

[7] M. Alpuente, D. Ballis, M. Falaschi, and D. Romero. A Semi-automatic Methodology for Reparing Faulty Web Sites. InProc. of the 4th IEEE Int’l Conference on Software Engi-neering and Formal Methods (SEFM’06), pages 31–40. IEEEComputer Society Press, 2006.

[8] D. Ballis and J. G. Vivo. A Rule-based System for Web SiteVerification. In Proc. of 1st Int’l Workshop on AutomatedSpecification and Verification of Web Sites (WWV’05), volume157(2). ENTCS, Elsevier, 2005.

[9] E. Bertino, M. Mesiti, and G. Guerrin. A Matching Algorithmfor Measuring the Structural Similarity between an XML Doc-ument and a DTD and its Applications. Information Systems,29(1):23–46, 2004.

[10] F. Bry and S. Schaffert. Towards a Declarative Query andTransformation Language for XML and Semistructured Data:Simulation Unification. In Proc. of the Int’l Conferenceon Logic Programming (ICLP’02), volume 2401 of LNCS.Springer-Verlag, 2002.

[11] F. Bry and S. Schaffert. The XML Query Language Xcerpt:Design Principles, Examples, and Semantics. Technical re-port, 2002. Available at: http://www.xcerpt.org.

[12] P. Buneman, M. Grohe, and C. Koch. Path Queries on Com-pressed XML. In Proceedings of the 29th Int’l Conference onVery Large Data Bases (VLDB’03), pages 141–152. MorganKaufmann, 2003.

[13] Centrum voor Wiskunde en Informatica. XMark –an XML Benchmark Project, 2001. Available at:http://monetdb.cwi.nl/xml/.

[14] M. Clavel, F. Duran, S. Eker, P. Lincoln, N. Martı-Oliet,J. Meseguer, and C. Talcott. The maude 2.0 system. InR. Nieuwenhuis, editor, Rewriting Techniques and Appli-cations (RTA 2003), number 2706 in LNCS, pages 76–87.Springer-Verlag, 2003.

[15] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to Algorithms. The MIT Press, Cambridge, MA,USA, 2nd edition, 2001.

[16] P. Cousot and R. Cousot. Abstract interpretation: A unifiedlattice model for static analysis of programs by construction orapproximation of fixpoints. In POPL, pages 238–252, 1977.

[17] P. Cousot and R. Cousot. Systematic design of program anal-ysis frameworks. In POPL, pages 269–282, 1979.

[18] N. Dershowitz and J.-P. Jouannaud. Rewrite systems. In J. vanLeeuwen, editor, Handbook of Theoretical Computer Science,volume B: Formal Methods and Semantics, chapter 6, pages244–320. Elsevier Science, 1990.

[19] N. Dershowitz and D. Plaisted. Rewriting. Handbook of Au-tomated Reasoning, 1:535–610, 2001.

[20] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. Verify-ing Integrity Constraints on Web Sites. In Proc. of SixteenthInternational Joint Conference on Artificial Intelligence (IJ-CAI’99), volume 2, pages 614–619. Morgan Kaufmann, 1999.

[21] M. F. Fernandez and D. Suciu. Optimizing Regular Path Ex-pressions Using Graph Schemas. In Proc. of Int’l Conf onData Engineering (ICDE’98), pages 14–23, 1998.

[22] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke. Comput-ing Simulations on Finite and Infinite Graphs. In IEEE Sym-posium on Foundations of Computer Science, pages 453–462,1995.

[23] N. E. Kadhi and H. El-Gendy. Advanced Method for Cryp-tographic Protocol Verification. Journal of ComputationalMethods in Science and Engineering, 6:109–119, 2006.

[24] J. Klop. Term Rewriting Systems. In S. Abramsky, D. Gabbay,and T. Maibaum, editors, Handbook of Logic in Computer Sci-ence, volume I, pages 1–112. Oxford University Press, 1992.

[25] T. Legall, B. Jeannet, and T. Jhon. Verification of Communica-tion Protocols Using Abstract Interpretation of FIFO Queues.In Proceedings of Algebraic Methodology and Software Tech-nology, 11th International Conference (AMAST’06), volume4019 of LNCS, pages 263–274. Springer, 2006.

[26] M. Leuschel. Homeomorphic Embedding for Online Termina-tion of Symbolic Methods. In T. Æ. Mogensen, D. A. Schmidt,and I. H. Sudborough, editors, The Essence of Computation,volume 2566 of LNCS, pages 379–403. Springer, 2002.

[27] S. Lucas. Rewriting-Based Navigation of Web Sites: Lookingfor Models and Logics. In Proc. of 1st Int’l Workshop on Auto-mated Specification and Verification of Web Sites (WWV’05),volume 157(2). ENTCS, Elsevier, 2005.

[28] B. Michael, F. Juliana, and G. Patrice. Veriweb: automati-cally testing dynamic web sites. In Proc. of 11th Int’l WWWConference. ENTCS, Elsevier, 2002.

[29] N. Polyzotis, M. Garofalakis, and Y. E. Ioannidis. Approx-imate XML Query Answers. In Proceedings of the ACMSIGMOD International Conference on Management of Data(ICMD’04), pages 263–274. ACM, 2004.

[30] TeReSe, editor. Term Rewriting Systems. Cambridge Univer-sity Press, Cambridge, UK, 2003.



















































































an abstract generic framework for web site verification

Documents