a wsad-based fact extractor for j2ee web projects

8
A WSAD-Based Fact Extractor for J2EE Web Projects Holger M. Kienle Hausi A. Muiller University of Victoria University of Victoria Victoria, Canada Victoria, Canada [email protected] [email protected] Abstract order to offer reverse engineering functionality for Web ap- plications it is necessary to first extract suitable facts from This paper describes our implementation of a fact ex- the target Web site. tractorfor J2EE Web applications. Fact extractors are part This paper describes the realization of a fact extractor of each reverse engineering toolset; their output is used by for Web sites that leverages the WSAD 5.1.2 tool. WSAD reverse engineering analyzers and visualizers. Our fact ex- is implemented as a set of proprietary plug-ins that extend tractor has been implemented on top of IBM's Websphere the open-source Eclipse IDE. Our fact extractor is realized Application Developer (WSAD). The extractor's schema as an Eclipse plug-in that extracts information from J2EE has been defined with the Eclipse Modeling Framework Web projects. The extractor's domain model is defined with (EMF) using a graphical modeling approach. The extractor the Eclipse Modeling Framework (EMF) [5]. EMF gener- extensively reuses functionality provided by WSAD, EMF, ates a Java API, which can be used by clients of the extrac- and Eclipse, and is an example of component-based devel- tor (e.g., analyses implemented as Eclipse plug-ins) to ac- opment. In this paper, we show how we used this develop- cess instances of the domain model. The extracted facts can ment approach to accomplish the construction of our fact be also exported as Rigi Standard Format (RSF), which is extractor, which, as a result, could be realized with signif- a popular exchange format in the reverse engineering com- icantly less code and in shorter time compared to a home- munity [19]. This makes it possible to integrate loosely the grown extractor implemented from scratch. We have as- fact extractor with external tools. sessed our extractor and the produced facts with a table- An important goal for realizing the fact extractor was the based and a graph-based visualizer Both visualizers are attempt to reuse existing functionality as much as possible. integrated with Eclipse. Instead of hand-crafting an extractor, which is a tedious and error-prone endeavor, we wanted to leverage functionality that is already provided. Our WSAD-based extractor reuses 1. Introduction functionality offered by the following components: Besides static Web sites that are exclusively realized with WSAD: WSAD has a parsing framework that can han- HTML, there are increasingly dynamic sites that are real- dle J2EE Web applications, which includes diverse ized with advanced technologies such as AJAX, JSP.NET, sources such as HTML, JSP pages, and JavaBeans. and J2EE web applications. As a result, advanced Web Dependencies between sources are also identified and sites are highly complex software systems, combining di- kept in a link repository. verse functionality such as found in databases, distributed EMF: The schema of the extractor is defined as an EMF systems, and hypermedia [25]. There is now a broad range model. From the model, EMF can automatically gen- of development tools that support the construction of Web erate code to create model instances and to persist them sites. Examples range from relatively simple tools such as as XML files. Furthermore, EMF can generate (rudi- Microsoft FrontPage and Adobe GoLive to complex ones mentary) Eclipse-based editors to manipulate model such as IBM Websphere Application Developer (WSAD) instances. and Vignette StoryServer. Eclipse: The Eclipse IDE makes it possible to integrate our With the increase in complexity of dynamic Web sites, understandingand maintenace of these stes is becom extractor with WSAD and other Eclipse plug-ins seam- ing more and more difficult. Web site reverse engineering lessly. (WSRE) retargets reverse engineering approaches (such as This approach to reuse existing functionality greatly sim- software analyses and visualizations [23]) to Web sites. In plifies the development effort and leads to a more stable, 1-4244-1450-4/07/$25.OO © 2007 IEEE 57

Upload: vudan

Post on 05-Jan-2017

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A WSAD-Based Fact Extractor for J2EE Web Projects

A WSAD-Based Fact Extractor for J2EE Web Projects

Holger M. Kienle Hausi A. MuillerUniversity of Victoria University of Victoria

Victoria, Canada Victoria, [email protected] [email protected]

Abstract order to offer reverse engineering functionality for Web ap-plications it is necessary to first extract suitable facts from

This paper describes our implementation of a fact ex- the target Web site.tractorfor J2EE Web applications. Fact extractors are part This paper describes the realization of a fact extractorof each reverse engineering toolset; their output is used by for Web sites that leverages the WSAD 5.1.2 tool. WSADreverse engineering analyzers and visualizers. Ourfact ex- is implemented as a set of proprietary plug-ins that extendtractor has been implemented on top of IBM's Websphere the open-source Eclipse IDE. Our fact extractor is realizedApplication Developer (WSAD). The extractor's schema as an Eclipse plug-in that extracts information from J2EEhas been defined with the Eclipse Modeling Framework Web projects. The extractor's domain model is defined with(EMF) using a graphical modeling approach. The extractor the Eclipse Modeling Framework (EMF) [5]. EMF gener-extensively reuses functionality provided by WSAD, EMF, ates a Java API, which can be used by clients of the extrac-and Eclipse, and is an example of component-based devel- tor (e.g., analyses implemented as Eclipse plug-ins) to ac-opment. In this paper, we show how we used this develop- cess instances of the domain model. The extracted facts canment approach to accomplish the construction of our fact be also exported as Rigi Standard Format (RSF), which isextractor, which, as a result, could be realized with signif- a popular exchange format in the reverse engineering com-icantly less code and in shorter time compared to a home- munity [19]. This makes it possible to integrate loosely thegrown extractor implemented from scratch. We have as- fact extractor with external tools.sessed our extractor and the produced facts with a table- An important goal for realizing the fact extractor was thebased and a graph-based visualizer Both visualizers are attempt to reuse existing functionality as much as possible.integrated with Eclipse. Instead of hand-crafting an extractor, which is a tedious and

error-prone endeavor, we wanted to leverage functionalitythat is already provided. Our WSAD-based extractor reuses

1. Introduction functionality offered by the following components:

Besides static Web sites that are exclusively realized with WSAD: WSAD has a parsing framework that can han-HTML, there are increasingly dynamic sites that are real- dle J2EE Web applications, which includes diverseized with advanced technologies such as AJAX, JSP.NET, sources such as HTML, JSP pages, and JavaBeans.and J2EE web applications. As a result, advanced Web Dependencies between sources are also identified andsites are highly complex software systems, combining di- kept in a link repository.verse functionality such as found in databases, distributed EMF: The schema of the extractor is defined as an EMFsystems, and hypermedia [25]. There is now a broad range model. From the model, EMF can automatically gen-of development tools that support the construction of Web erate code to create model instances and to persist themsites. Examples range from relatively simple tools such as as XML files. Furthermore, EMF can generate (rudi-Microsoft FrontPage and Adobe GoLive to complex ones mentary) Eclipse-based editors to manipulate modelsuch as IBM Websphere Application Developer (WSAD) instances.and Vignette StoryServer. Eclipse: The Eclipse IDE makes it possible to integrate ourWith the increase in complexity of dynamic Web sites,

understandingand maintenace of these stes is becom extractor with WSAD and other Eclipse plug-ins seam-ing more and more difficult. Web site reverse engineering lessly.(WSRE) retargets reverse engineering approaches (such as This approach to reuse existing functionality greatly sim-software analyses and visualizations [23]) to Web sites. In plifies the development effort and leads to a more stable,

1-4244-1450-4/07/$25.OO © 2007 IEEE 57

Page 2: A WSAD-Based Fact Extractor for J2EE Web Projects

predictable, and maintainable extractor. This paper is orga- Many extractors that target programming languages arenized as follows. The next section gives more background component-based; a popular approach is to leverage theabout fact extraction in the reverse engineering domain in Gnu Compiler Collection (GCC). This can be explainedgeneral and the WSRE domain in particular. It also ex- with the (parsing) complexity of languages such as C++plains why we believe that component-based extractors are [2]. In contrast, most Web site extractors are homegrown,desirable. Section 3 briefly introduces WSAD and EMF implemented with Java [20], Perl [11], Lex/Yacc [29], andand explains how our extractor leverages these components. TXL [28]. Besides our extractor, we are only aware ofWe also discuss the extractor's EMF model for represent- two other component-based extractors. The REGoLive ex-ing J2EE Web applications. Section 4 describes two simple tractor (which has been also developed within our researchWeb site visualizations (an Eclipse-based tabular view and group at the University of Victoria) leverages the Adobea SHriMP-based graph view), which use the extractor to GoLive product [9]. It is written in JavaScript and uses Go-obtain facts from WSAD Web projects. Section 5 summa- Live's API to access the DOM of parsed HTML and XMLrizes experiences that we made with our component-based files. De Lucia et al. use HTML Parser (htmlparser.approach to extractor development. Section 6 concludes the source forge.net), to extract information from HTMLpaper. pages for Web site clustering [18]. HTML Parser is open

source and offers a Java API. Our extractor is written in Java2. Fact Extraction for Reverse Engineering and uses WSAD's interfaces to access relevant information

about parsed sources. It is similar to REGoLive in the senseIn this paper, we denote extractors for Wse purpose that both leverage functionality provided by commercial-

as Web site extractors. Web site extractors are used, for in- of-hsel(CT)pdut. heacofom nn-stance, to maintain sites, to study their evolution, to recover Websie e rors.mi he laie wt there-their architecture, to migrate legacy Web technology, to ob- tive immtuity the WSRE dom aina dtft thatrmosttain metrics, to check conformity to Web standards, and to extratorsfc on the paRs of HTL whichi omaasueacsiblt'o h dsbe.' extractors focus on the parsing of HTML, which iS compa-assure accessibility for the disabled.For traditional reverse engineering, the source base is rably easy to accomplish.

' ~~~~~Thedecision to follow a homegrown vs. component-typically homogeneous, written in a single high-level pro- baedeapproa to extra constron hs aosignintgramming language. Similarly, simple Web sites rely exclu- iacetonrth d o e ffort,uctur masinnancandsively on HTML as a source language and there are a num- test of the ex orGenerally, leveaingnc entber of Web site extractors that parse HTML only (e.g., [29] tosbild an extractorh similar licatins component-[20]). However, most Web sites now employ a number of baseddevelop of softwar systems[2 Wemdiscusstechnologies besides HTML and hence have a highly het- experiencesothattwomadefwhenedevelopng our extractorsierogeneous source base. Assisting the reverse engineer in Section 5.the comprehension of such a Web site requires the interrela-tion of facts from multiple sources. As a consequence, Web 3. Our WSAD-Based Fact Extractorsite extractors increasingly parse heterogeneous sources andextract relationships between them this is the case for our Our extractor is based on WSAD, which can be clas-extractor. There are other examples of Web site extrac- sified as a COTS product. The extractor accesses WSADtors that are multi-language. In order to provide suitable APIs to extract facts from J2EE Web applications. Besidesinformation to reverse engineers that have to understand WSAD, it also uses EMF to model the extractor's schemaASP-based sites, Hassan and Holt extract information from and Eclipse for integration. The following sections dis-HTML, VBScript, COM source code, and COM binaries cuss the extractor and how it leverages functionalities from[11]. Synytskyy et al. describe how island parsing can be WSAD, EMF, and Eclipse in detail.used to effectively parse ASPs that intermingle HTML, Vi-sual Basic, and JavaScript [28]. WSAD and Eclipse From the developer's point of view,

To construct an extractor, two different approaches can WSAD is an IDE for the development of J2EE applications.be distinguished: This includes J2EE Web applications, which are developed

as Web projects. Our Web site extractor uses WSAD 5.1.2.1homegrown: Such an extractor is developed from scratch, AWbpoetcnisofavrtyfsucs:HM

possibly singscaner or prser genrators, r scriptA Web project consists of a variety of sources: HTML

possibly using scanner or parser generators, or script- pages, Cascading Style Sheets (CSS) files, graphics files,

Java Server Pages (JSPs), Java servlets, JavaBeans, Javacomponent-based: A component-based extractor lever- code, Web deployment descriptor (webz xml), tag libraries

ages~~~~~ ~anexstn coplr IDE oroheouialeto recent versions of WSAD have been re-branded by IBM as Ra-that processes the sources and provides a suitable in- tional Application Developer (RAD) [14]. When we started developmentterface to access information about the sources. on our extractor, WSAD 5.1.2 was the most recent version available.

58

Page 3: A WSAD-Based Fact Extractor for J2EE Web Projects

I IFile file; // Web project resource (e.g., a HTML file)2 ILinkCollector collector; // WSAD link repository34 ILinkTag linktags[] = collector.getLinks(file, null);5 for( int i = 0 ; i < linktags.length ; i++ ) { /7 iterate over links6 ILinkTag linktag = linktags[i];7 IPath linkpath = linktag.getTargetResourceFullPatho; /7 e.g., another HTML file8 String link = ((IGeneralLinkTag) linktag).getAbsoluteLinko; /7 e.g., a URL9 String enclosingTag = linktag.getTagNameo; /7 e.g., <a>... </a>1011 System.out.print("Link in + file.toString()12 to + linkpath.toString() );13

Figure 1. Querying the WSAD link repository

(.t 1d), etc. WSAD's project navigator view shows a hier- all available resources in the Web project and requestsarchical listing of all resources that are part of a project (see the links for each one (line 4). Thus, the source of atop left view in Figure 3). dependency-relationship is always a Web application re-WSAD distinguishes between dynamic and static Web source in the current project such as a HTML file, JSP, Java

projects. The latter contain primarily HTML pages. Web file, or Web deployment descriptor. The extractor then iter-projects can be packaged and then published to a server in ates over the returned links, obtaining relevant informationthe form of a Web archive (WAR) file. Existing Web sites to determine the type of link (lines 7-8). For hyperlinks, thecan be imported by crawling a URL or by providing a WAR extractor checks whether the link is internal (i.e., points tofile. WSAD is a full-fledged Web site authoring tool. It has another resource in the project), or external (i.e., it cannot bewizards to create new Web sites, an interactive Web page resolved). Examples of external links are URLs that pointdesigner for HTML and JSP pages, a Web site designer to to another Web site (using the http protocol), or links thatspecify high-level structure, and specialized editors for CSS contain email addresses (using the ma ilt o protocol).files and deployment descriptors. WSAD also has a links The Eclipse IDE can be run without GUI (i.e., head-view that gives a simple graphical rendering of the relation- less) and consequently our extractor can be directly invokedships in a Web site (see bottom left view in Figure 3). For from the command line. However, since WSAD is a GUI-instance, this view shows hyperlinks between HTML pages based application it seems desirable to integrate our extrac-and include-relationships of JSPs. tor with the GUI as well. This integration is accomplished

Our extractor compiles diverse information from Web by implementing Eclipse extension points [8]. For exam-projects. Most Web site extractors focus on the extraction ple, the extractor is registered as a project builder [1] via theof hyperlinks between static HTML pages. However, J2EE org. eclipse. core. resources. builders exten-Web projects have additional relationships between diverse sion point, which means that it is automatically invokedresources that are important for reverse engineers to com- whenever the underlying fact base changes (e.g., theprehend a dynamic Web site. For example, the construction user modifies and saves a file that is part of the Webof a Web page for a given URL can involve complex call project). In order to use the extractor on a certain Websequences that involve lookups of the Web deployment de- project, the user first has to associate the extractor withscriptor and execution of JSPs, servlets and Java libraries the project. This is accomplished by right-clicking a[10]. WSAD maintains a dedicated data structure the link Web project and selecting a certain menu entry. (Thisrepository that keeps track of (hyperlink) relationships, menu entry is added by the extractor via implementingwhich our extractor leverages. Examples of relationships the org. eclipse . ui . popupMenus extension point.)are hyperlinks in HTML pages, references to tag libraries Once the extractor has been associated with the Webin JSP files, and Servlet mappings. For each project re- project, it automatically extracts facts from the project'ssource, WSAD's link repository can be queried for relation- sources and stores them as an EMF model file (withships that originate from it. Depending on the actual kind .webmodel extension). The extractor also generates anofipsthe tquriednresfrour diffDerentdpaser forthT XMtal RSF file from the EMF model, which is stored in a newlyof the queried resource, different parsers for HTML, XML,

crae r odr(e akg xlrrve ntelfJava, etc. are invoked transparently that extract the depen- created rsf folder (see Package Explorer view on the leftdencies. The link repository uses an underlying cache to in Figure 4).avoid repeated parsing of resources. EMF Extractors have to represent facts about the parsed

Figure 1 gives a Java code snippet that shows how the sources in some form and make the facts available to clients.link repository can be queried. Our extractor iterates over Facts have to adhere to a certain data model (or schema);

59

Page 4: A WSAD-Based Fact Extractor for J2EE Web Projects

dl]aVa ipltRegion~ ~>W| |3edfntd Sln rkn ErSdij, 01o 03rr:EtrQ

dH Hperlinlk gbObje l| rrrigT;n USrn Friri nto:Ebir 3T

l~~ ~ ~ ~~~~~~~webmodel.ecdwebmodel

r _ ~ ~~~~~MI Tl Li Li |~ l1=]pp rrn:Etng Ct

O &utlnePpetime

I7Fu 2 srEFW m el (hwwith O n EcI7pseU1c ismexIi;ty ed [a c such:Essri Strrig i l t adEbogEMF mis a gemodelgfaeokfrElpeItspot Oumoetagtthdeeoeviw[6of2Ea-

eclaratv si n aa)L p f elppn

,project ~ ~~~~~~~~~~~~~~~~~~~~byt.Off-tFd: lnt

class dIagrams. Standard UML constructs such as classes searchers (e.g., [7]). A WebSite in our model (cf. Figure 2)

andasocatonsceare used to spcf h aamdl.EF cnit fanmbro ae.APgcnb iecnan

models canIbe gl da

RationalARose co Omondo'snyhm ndoEclipseUML[c. i inftt

UMLhaswtheblenefithannt itocisaneEcips plug-ningthat seam- Dlsesrptr Depeending ontekndofriutsPagpe,tivcan.contain-

lessly integrloate witn E .e use f 1.. ancts a WOect can be aniAnchin HTMLcnsequ eto aintMaing thI extrato' shma. fee ' laes),.For (n H L fes) embeddofpreded dataJavacodes

Figureri 2sdobjepictsacenshotofclips wtthe HTML flesnOr jSPs),

dat isamodel infofuxractr,which werEcalthe. Webmupodel. OTue modelcurrgently doeso restopritwa kinds16 ofWeOJ2Ect arethe replresetiescfcation ofth(ebmdela inoUMlis intitive.L perisstible wihnd wich Pages.r Ao moredelstrcemrodoel woul bterbee-

UMacasesassociation,rue and attributhesdtrepresen EMF fciasistsofprvn eumborswhe crating modaelinsancbes.fl cnan

modelscan- ExtergraHyphicallyedefinedAncr mipulad Fh i

lesly aSceriptRegsowit EMF WeJ useg EMirF b1.1.1.1an,d.rEclip-0 WeObecs A eObetta b nnhoinHgseUMLto~~~~~~~~~~~~~~~~maitai theNexrco' schem..i. files), Form3 (indHTLfie),ebdddJaacip odiFiue2dpcsasreso fElpeM ihte HM ierJP)rHyperlink Wbbj.-fc Lik Tarelassifie

UM Extesrasciton ndattypribute rnepr.1yeseink EM fiiltoprevenerrorsweraigmdlisacs

p.t~d: &Iig iSMfrk:FSt60

Page 5: A WSAD-Based Fact Extractor for J2EE Web Projects

into ExternalHyperlinks and InternalHyperlinks. The latter project with WSAD's built-in Web site crawler. The ex-can be links to HTML files, JSPs, tag libraries, or servlets. tracted model has 26 Page instances and 197 Hyperlink in-For servlets, the extractor identifies the corresponding Java stances. Besides the ACSE home page, we have also startedclass (using the API provided by the Eclipse Java Develop- to use a larger site, Cassis, that has been developed withment Tools (JDT)). WSAD by IBM Toronto CAS [17]. The functionality of thisEMF transforms the graphical UML models into XML- site is only accessible to members and researchers of CAS.

encoded EMF model files (using the .ecore extension). The extracted model of this site has 105 JspPages, 12 Im-(Our Web model is kept in the file webmodel. ecore.) ages, 141 ClassFiles, 10 JarFiles, 400 InternalHyperlinks,From ecore files, EMF generates Java code: An EMF and 26 ExternalHyperlinks. Because of confidentiality con-class is mapped to a Java interface and corresponding cerns we only show screenshots of the ACSE home page.implementation class; attributes and references are trans-latedoacesortohos;accessorElt s apedmethods;bbK fi EStringWl and BInt ispi mappedA to dlbVTlllddT606-fieMlbWHO

java.lang.String and int; and so on. The gener- 0 .-

ated code is "natural and simple for Java programmers toMW i~Nit.

understand" [5]. The generated code for our model is about r dbd ..t,pp ,-tt.s .d .ina. ndq.r,ITh.13,000 LOC. This code implements the model and provides kr.D...p,.ddpiyginvti .,hL.I.d

a factory to conveniently create model instances. The EMF h11iL- y hti,t. .t.-l .in-di

framework provides functionality for persistency of model _~h.

instances. As a result, EMF has spared us tedious code writ- --t.W bi3nLhLbfdg -i-ig.h-h- ht,nI t.~p ft..plf-f if d-- h sy.fc.itving that would have been otherwise required to realize a per- -N___

sistent data model for the extractor.Model instances (i.e., extracted facts in our case) are also I -I h.

stored in XML files and reference the EMF model that they h"I `a;.t,pUWb~thb,ki,h~tIn me .rUdR

adhere to. This approach makes it easy for other reverse en- a;.n h,knn UdR

gineering tools to parse and transform the extractor's output.For instance, one could write an XSLT script that convertsthe output to the GXL exchange format [13]EMF can also generate a simple table-based editor that Figure 3. Broken links in the ACSE home page

makes it possible to interactively view and edit model in- (shown in the Links-Table View)stances. The top right view in Figure 4 shows the editor.The editor can be easily customized to provide different im-ages and text for different kinds of model entities. The gen-erated model editor was useful as a browser to inspect and Eclipse Links-Table View The links view provides a listdebug the generated facts. We also used the editor to con- of all the links that are contained in a certain file. The bot-struct small test cases for the RSF export. tom right view in Figure 3 shows a screenshot. The view

is implemented as an Eclipse ViewPart that automatically4. Web Site Visualization Case Studies refreshes its contents when the user clicks on a file in the

Project Navigator (top left view). It shows the link's type inTo utilize our extractor, we have developed two sample the first column ("0" means project-relative links), the link's

visualizations. These visualizations are clients that use the target in the second column (relative to the project root forextractor's services and can be seen as simple case studies to project-relative links), and whether WSAD's link repositoryassess the suitability of our extractor for the implementation has identified the link as broken or valid (4th column).of more sophisticated WSRE tools. The view also implements a filter that can suppress links

The first visualization uses the facts obtained from the by their type and target. If the user clicks on a link inextractor in an Eclipse view. The second visualization ex- the view the corresponding editor is opened and positionedports the extracted facts to RSF and uses the SHriMP visu- at the link's location. In Figure 3, the links view showsalizer plug-in to render a graph-based visualization of the the links for the acre .html page. The first two links infacts. the view are broken and WSAD's HTML editor highlights

Both visualizations provide views for the ACSE home the location of the first broken link (dashed box around the

Page 6: A WSAD-Based Fact Extractor for J2EE Web Projects

+ I...1o~Hyp.diik

l~ ~~~~~~~~~~~ \t- HEI 1~

1...

bo....

File Edit Node Navigate Toell delp

Figure 40.hmLink dependencies of the ACE -homepage(rfBbContentfindesx.redtiht SHriM

asa graph[6]. SHriMP isanirnd.teractiv edtoIthmatvyerisuz-hmagel-Sntnpgratekud.h inoElpetetosaewalope.A' eizes nested~~~~~~~~ Xgahs wihtmnod containmnt ad nmatyesnthew sult, ntheue ofSHriMP .hastomaull reload.....the...RSF...file..

graphwhentheuser navigatesor opens up nodes. Graphs whenever it chang.vl

alzto o sfiitEAR r s e i recause Discussiona d F e k Bhvn pr

WSAD does~~~~~~~lnotoferaspisicte grpicl rendering of vide evidec tAt ou exrato isuital to sev asl

11 1 ~~~ro hl 1

the dependencieslinaaeWeb poc fron-enfloo:to

Figure 4. Link dependencies ofthoth eAS ay or valiz(rendered with SHriMP)

as a graph [6]. SHriMP is an interactive editor that visual- grated into Eclipse the tools are weakly coupled. As a re-izes nested graphs with node containment and animates the sult, the user of SHriMP has to manually reload the RSF filegraph when the user navigates or opens up nodes. Graphs whenever it changes.can be filtered by node and arc types. This kind of visu-alization is of specific interest to reverse engineers because Discussion and Future Work Both visualizations pro-WSAD does not offer a sophisticatedsgraphicalrendering of vide evidence that our extractor is suitable to serve as athe dependencies in a Web project. front-end for other reverse engineering tools that want to an-

Figure 4 shows SHriMP's visualization of the ACSE alyze or visualize Web sites. The extractor produces a Webhome page. Pages are visualized as nodes; Hyperlinks be- model instance without perceptible delay. WSAD parsestween Pages are visualizeds a rcs. The view is filtered and caches the Web project when the user works with it andto show only HtmlPages and InternalHyperlinks between as a result our extractor introduces little computational over-the HtmiPages. The nodes have nested sub-nodes that con- head. Whenever the Web project changes, our extractor hastain WebObjects. Unfiltered graphs and graphs that expose to regenerate the whole model instance (and the importedmany Hyperlinks are often too cluttered to understand. To RSF file). In practice this has not caused performance prob-obtain meaningful views it is necessary to apply filters and lems yet. Still, we are considering to move to an incremen-to introduce hierarchical decompositions. This is especially tal builder [I] that "patches" up model instances. However,the case with the Cassis site that is an order of magnitude this would complicate the extractor's implementation andmore complex than the ACSE site. future maintenance.

SHriMP supports the RSF and GXL exchange formats to Currently SHriMP obtains the facts from the exportedimport visualization data. The easiest way to integrate our RSF file. Since SHriMP's core architecture consists of aextractor with SHriMP is to export Web model instances number of JavaBean components [3], it is possible to pro-

and easy to accomplish, it exhibits some drawbacks. Al- instance. This would also allow SHriMP to load facts incre-though both the extractor and SHriMP are visually inte- mentally.

62

Page 7: A WSAD-Based Fact Extractor for J2EE Web Projects

5. Experiences ity of a component is detected relatively late after signifi-cant effort has been spent already to utilize the component.

In this section we report on the experiences that we made For example, Moise and Wong have used Source Navigatorwith developing our fact extractor using a component-based (SN) to extract cross-language dependencies between Javaapproach. and C, but found that certain dependencies could not be re-

The development of an extractor is a significant invest- covered because "SN did not provide enough informationment in terms of time and effort. The source languages that for a deep static analysis" [22]. WSAD is a commercialan extractor has to support are typically non-trivial. Con- product that has matured through several versions and hassequently, building an extractor from scratch requires a fair a large user base. As a result, it is a stable component andamount of expertise in parsing, compiler construction, and we did not encounter bugs in the link repository when weschema development. In contrast, reusing the functional- implemented the extractor. WSAD is a black-box compo-ity of an existing component or tool can greatly reduce the nent which means that its source code is not available toeffort to obtain the necessary facts for reverse engineering. the public. However, the first author was able to get ac-Our extractor is realized in about 2,000 lines of Java code.3 cess to WSAD's sources during a 3 month stay at the IBMRealizing an extractor that collects facts from such diverse Toronto Lab. Even though all Eclipse plug-ins expose ex-sources as needed for J2EE Web application with compara- tension points in so-called manifest files, source code ac-ble few code was achieved by leveraging WSAD's function- cess was crucial for debugging our extractor, comprehend-alities to the greatest extent possible. The resulting code is ing the functionality and API of the link repository, and un-quite dense and we believe that it will be comparably easy derstanding how the link repository gets populated.to maintain.

Once an extractor has been developed, it also needs to be The implementation of the extractor was different frommaintained. The evolution of a programming language can our previous experiences with component-based extractorresult in a steady stream of language enhancements as ex- development because we could not rely on an official, docu-emplified by C++ and Java. This poses a problem for devel- mented API. For instance, when developing REGoLive, weopers of research tools because they typically have limited could leverage 600 pages of documentation that was pro-resources to support extractor evolution. For Web site ex- vided with the SDK [9]. With WSAD, the functionality thattractors the situation is even worse because new Web tech- we needed was buried in the sources and spread over severalnologies are constantly introduced, and existing technolo- plug-ins. It took us significant time to first locate the codegies such as XML-based standards evolve rapidly. Since for the link repository and to understand its workings. Sinceour extractor relies on a component, it can take advantage of there were few comments in the code, we had to use the Javanew component versions that support the most recent stan- debugger to understand complex invocation sequences. Wedards. On the other hand, newer versions of components also had to experiment with the API (using small test cases)may need to be obtained and integrated (e.g., because older to understanding its functionality and meaning. For exam-versions are no longer maintained or are not available for ple, what constitutes a "self link" in the link repository wasnew platforms) even if there is no immediate benefit for the not evident. Thus, significantly more time was spent un-extractor. Furthermore, since the extractor relies on the spe- derstanding code than writing code. This confirms Beckcific functionality offered by a certain component, there is and Gamma, who say "when you learn Eclipse you'll spendthe danger of vendor lock-in [4]. much more time reading code than writing code" [8].

Component-based extractors often do not have access tothe leveraged component's source code. This means that it Our decision to base the extractor on WSAD has ramifi-is often not possible to debug a component and to under- cations for its users. In order to use our extractor they firstsotan iotsinner workings. Worse,oncbugsaco are detectedein have to obtain and install WSAD, which is a large commer-stand its inner workings. Worse, once bugs are detected in

cia prdc.Sic ou.xrco sraie sa cisblc-o copnet tecantbim daelfxd. cial product. Since our extractor is realized as an Eclipsea blck-bx coponnt, hey anno beimmeiatey fied.plug-in, the installation is easily accomplished via unpack-One has to wait for an improved release that fixes the prob- pin, the insthal ati disea ctomplse via unpack-

lem or modify the parsed sources to work around the bug ing a file in the appropriate directory. To use our extractor,[24]. However, even if the source code is available such the Web application has to be a WSAD Web project. In theas for GCC, it may be impractical to make use of it-the best case, the user is already using WSAD to develop thesource code's "size and complexity (and often poor docu- Web application. If this is not the case, the Web applicationsoure cde's"sie ad copleity andoftn por dcu-

can be imported (as a WAR file or by crawling it). However,mentation) may deny the system integrator the advantages importinmpred (as apWAtion my crawling it). (e.g.,usually taken for granted when the source code is avail- importing the Web application my loose information (e.g.,able [27. Lstly itcan appn tht mssin fuctioal- if it has been developed with another tool). While it can be

beneficial for users to have WSAD's functionality accessi-3Similarly, Martin has realized a course-grained C++ extractor with ble when working with our extractor, it also represents an

VisualAge in less than 1,000 LOC [19]. extra constraint that may deter them from using it.

63

Page 8: A WSAD-Based Fact Extractor for J2EE Web Projects

6. Conclusions [10] M. Han and C. Hofmeister. Modeling request routing in web appli-cations. 8th IEEE International Symposium on Web Site Evolution

This paper focuses on one important step in the re- (WSE'06), Sept. 2006.verse engineering process, fact extraction. When building [11] A. E. Hassan and R. C. Holt. Architecture recovery of Web ap-

extractor, choosebetweentwofundamentallyplications. 24th ACM/IEEE International Conference on Softwarean extractor, one can choose between two fundamentally Engineering (ICSE'02), pages 349-359, May 2002.different approaches: a homegrown extractor is (mostly) [12] G. T. Heineman and W. T. Councill. Component-Based Softwarebuilt from scratch, whereas a component-based extractor is Engineering: Putting the Pieces Together. Addison-Wesley, 2001.[13] R. C. Holt, A. Winter, and A. Schiirr. GXL: Towards a standard(mostly) constructed by reusing existing functionality from exchange format. 7th IEEE Working Conference on Reverse Engi-components and COTS products. To our knowledge, there neering (WCRE'00), pages 162-171, Nov. 2000.are very few experiences with component-based Web site [14] IBM. Rational Application Developer for WebSphere Soft-

ware. http://www.ibm.com/software/awdtools/extractors that leverage commercial Web authoring tools. developer/application/.This paper addresses this gap. [15] D. Jin, J. R. Cordy, and T. R. Dean. Where's the schema? A tax-We have described the construction of a component- onomy of patterns for software exchange. 10th IEEE International

Workshop on Program Comprehension (IWPC'02), pages 65-74,based extractor for J2EE Web applications that leverages the June 2002.functionalities offered by WSAD, Eclipse, and EMF. As a [16] H. M. Kienle, A. Weber, J. Martin, and H. A. Muller. Developmentresult, it was possible to rapidly construct an extractor with and maintenance of a web site for a bachelor program. 5th IEEElittl code (2,000 LOC) that obtains facts from highly het- International Workshop on Web Site Evolution (WSE'03), pages 20-little code (2,000U LOC) that ontalns taCtS trom hlghly het-29, Sept. 2003.erogeneous Web applications, which use technologies such [17] P. Kolari, T. Finin, Y. Yesha, K. Lyons, J. Hawkins, and S. Perelgut.as HTML, JSP, and servlets. We have also implemented two Policy management of enterprise systems: A requirements study.basic Web site visualizations, which use the facts provided 7th International Workshop on Policies for Distributed Systems and

Networks (POLICY'06), pages 231-234, June 2006.by our extractor. These visualizations indicate that the ex- [18] A. D. Lucia, G. Scanniello, and G. Tortora. Using a competitivetractor seems sufficiently efficient, stable, and complete to clustering algorithm to comprehend web applications. 8th IEEEbe used by more sophisticated reverse engineering analyses International Symposium on Web Site Evolution (WSE'06), Sept.

and visualizations. ~~~~~~~~~~~2006.and visualizations. [19] J. Martin. Leveraging IBM VisualAge for C++ for reverse engi-Our experiences with building a component-based Web neering tasks. Conference of the Centre for Advanced Studies on

site extractor using WSAD have been mostly encouraging. Collaborative Research (CASCON'99), pages 83-95, Nov. 1999.[20] J. Martin and L. Martin. Web site maintenance with software-

We hope that other researchers will follow this approach to engineering tools. 3rd IEEE International Workshop on Web Sitetool building and report on their experiences. Evolution (WSE'01), pages 126-131, Nov. 2001.

[21] J. Michaud, M. Storey, and H. Muller. Integrating informationsources for visualizing Java programs. 17th IEEE International

References Conference on Software Maintenance (ICSM'01), pages 250-258,Nov. 2001.

[22] D. L. Moise and K. Wong. Extracting and representing cross-[1] J. Arthorne. Project builders and natures. Eclipse Corner Ar- lnug eednisi ies otaesses 2hIE

ticles Jan.2003.ttp:/www.ecipse.rg/artcles/language dependencies in diverse software systems. 12th IEEE

Artides,eJan.B2003.lhtters //build s.hlpse, org/articles!Working Conference on Reverse Engineering (WCRE'05), pagesArticle-Builders/builders .html.2021,Nv205[2] I. D. Baxter, C. Pidgeon, and M. Mehlich. DMS: Program [23] H Muller, J.Jahnke D. Smith, M. Storey, S. Tilley, and K. Wong.

transformations for practical scalable software evolution. 26th Reverse engnee A rmap. CorenceonTh F ofACM/IEEE International Conference on Software Engineering Sotwre engineering, p ag 9 Juner2000.(ICSE'04), pages 625-634, May 2004. [24] L O'Brien Architecture recon0strucion to support a product

[3] C. Best, M. Storey, and J. Michaud. Designing a component-based line effort: Case study. Technical Note CMU/SEI-2001-TN-015aframework for visualization in software engineering and knowledge S oftw Enieering Institue Nege M U nIversiTy,engineering. 14th ACM/IEEE International Conference on Software / /www. seicute egiepMb/locUnits /u01Engineering and Knowledge Engineering (SEKE'02), pages 323- reports/pdf/Oltn/O15s.cpdf.326, July 2002. [25] J. Offutt. Quality attributes of Web software applications. IEEE

[4] W. J. Brown, R. C. Malveau, H. W. McCormick, and T. J. Mowbray. Software, 19(2):25-32, Mar./Apr. 2002.AntiPatterns: Refactoring Software, Architectures, and Projects in [26] Omondo. EclipseUML Free Edition. http://www.Crisis. John Wiley & Sons, 1998. eclipsedownload.com/download/free/eclipse-

[5] F. Budinsky, D. Steinberg, E. Merks, R. Ellersick, and T. J. Grose. 3x/index.html.Eclipse Modeling Framework. Addison-Wesley, Dec. 2003. [27] P. Popov, L. Strigini, A. Kostov, V. Mollov, and D. Selenskyo. Soft-

[6] Chisel Group. Creole. http: / /www. thechi selgroup. org/ ware fault-tolerance with off-the-shelf SQL servers. In R. Kazman?q=creole. and D. Ports, editors, 3rd International Conference on COTS-Based

[7] G. A. Di Lucca, M. Di Penta, G. Antoniol, and G. Casazza. An ap- Software Systems (ICCBSS'04), volume 2959 of Lecture Notes inproach for reverse engineering of web-based applications. 8th IEEE Computer Science, pages 117-126. Springer-Verlag, 2004.Working Conference on Reverse Engineering (WCRE'01), pages [28] N. Synytskyy, J. R. Cordy, and T. R. Dean. Robust multilingual231-240, Oct. 2001. parsing using island grammars. Conference of the Centre for Ad-

[8] E. Gamma and K. Beck. Contributing to eclipse. Addison-Wesley, vanced Studies on Collaborative Research (CASCON'03), pages2004. 266-278, Oct. 2003.

[9] G. Gui, H. M. Kienle, and H. A. Muller. REGoLive: Building a [29] P. Warren, C. Boldyreff, and M. Munro. The evolution of web-web site comprehension tool by extending GoLive. 7th IEEE Inter- sites. 7th IEEE International Workshop on Program Comprehen-national Symposium on Web Site Evolution (WSE'05), pages 46-53, sion (IWPC'99), pages 178-185, May 1999.Sept. 2005.

64