browsing semi-structured texts on the web using formal concept analysis

Post on 27-Apr-2023

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Browsing Semi-structured Texts on theWeb using Formal Concept Analysis

Richard Cole and Peter Eklund and Florence AmardeilhSchool of Information Technology and Electrical Engineering

The University of QueenslandSt. Lucia, Queensland 4072, Australia

rcole@itee.uq.edu.au, peklund@itee.uq.edu.au, florence@dstc.com

Abstract.Browsing unstructured Web-texts using Formal Concept Analysis (FCA) confronts

two problems. Firstly, on-line Web-data is sometimes unstructured and any FCA-system must include additional mechanisms to discover the structure of input sources.Secondly, many on-line collections are large and dynamic so a Web-robot must be usedto automatically extract data when it is required. These issues are addressed in this pa-per which reports a case-study involving the construction of a Web-based FCA systemused for browsing classified advertisements for real-estate properties1. Real-estate ad-vertisements were chosen because they represent a typical semi-structured informationsource accessible on the Web. Further data is only relevant for a short period of time.Moreover, the analysis of real-estate data is a classic example used in introductorycourses on FCA. However, unlike the classic FCA real-estate example, whose input isa structured relational database, we mine Web-based texts for their implicit structure.The issues mining these texts and their subsequent presentation to the FCA-system areexamined in this paper. Our method uses a hand crafted parser for extracting structuredinformation from real-estate advertisements which are then browsed via a Web-basedfront-end employing rudimentary FCA-system features. The user is able to quicklydetermine the trade-offs between different attributes of real-estate properties and alterthe constraints of the search to locate good candidate properties. Interaction with thesystem is characterized as a mixed initiative process in which the user guides the com-puter in the satisfaction of constraints. These constraints are not specifieda-priori,but rather drawn from the data exploration process. Further, the paper shows how theConceptual Email Manager, a prototype FCA text information retrieval tool, can beadapted to the problem.

1 Information Extraction and the Web — Overview

Since the creation of the DARPA Message Understanding Conferences (MUC) in 1987, In-formation Extraction (IE) has become an independent new field of research at the crossroadof Natural Language Processing (NLP), Text Mining and Knowledge and Data Discovery(KDD). For this reason the methods and techniques of IE are strongly influenced by devel-opments in these related research topics. Moreover, IE can be useful for any collection ofdocuments from which one would want to extract facts, and the World Wide Web is such acollection.

1In Formal Concept Analysis the termproperty has a special meaning similar toattribute. In this paperpropertyis only be used with the meaning ofreal-estate property, e.g. a house or apartment.

1.1 Definitions - Information Extraction

The objective of IE [13] is to locate and identify specific information from a natural languagedocument. The key element of IE systems is the set of extraction rules, or extraction pat-terns, that identify the target information according to a scenario. Once an extraction patternis identified, the IE system reduces extracted information to a more structured form such asa database table. Each record in the table must also have a link back to the original docu-ment [26]. As a result, tools for visual representation, fact comparison and automatic patternanalysis, play an important role in the resulting presentation of data derived from IE systems.

Through the case study example presented in this paper, “rental accommodation” classi-fieds, we define the various terms used in the IE field. First, the “rental classified” scenariorepresents a way to format the target information, e.g. “the location, the renting price, thebedrooms number and the phone number”. Second, each scenario is defined by a list of pat-terns that describes possible ways to talk about one of its facets, such as the pattern “for rent”.Third, each pattern includes a set of extraction rules defining how to retrieve this pattern inthe text. Fourth, each rule is composed of several fields, either constants or variables, repre-senting a particular element of the information to extract. For example, concerning the pattern“for rent”, we might employ (or learn) a pattern such as the following:

for rent <cr><variable location> - phone <variable phone_number> <cr><constant $><variable amount_rent> <cr><variable bedroom_number> Bedrm.

1.2 Web documents and Text diversity

Some approaches to information extraction on the Web assume that all Web pages as semi-structured, since they contain HTML-tags, however Hsu [15] provides a finer-grained cate-gorization for Web documents as follows:structured Web pagesprovide itemized informa-tion and each element can be correctly extracted based on some uniform syntactic clues,such as delimiters or the orders of elements.Semi-structured Web pagesmay contain miss-ing elements, multiple value elements, permutations and exceptions. Finally,unstructuredWeb pagesrequire linguistic knowledge to correctly extract elements. It seems therefore thatwhen it comes to extracting information from Web pages, the same sorts of problems andfeatures facing information extraction on natural language documents also apply to the Webdomain, namely that IE systems for structured text perform well, the information can be easilyextracted using format descriptions.

However, IE systems for unstructured text need several additional processing steps inconjunction with constructing extraction rules. These are typically based upon patterns thatinvolve syntactic relations between words or classes of semantic words. They generally useNLP techniques and cannot be compared to the work of a human expert (although they alsoprovide useful results). Likewise, IE systems for semi-structured text cannot limit themselvesto rigid extraction rules, more suited to structured text, but must be able to switch context toapply NLP techniques for free text. Nevertheless, systems for semi-structured texts do usedelimiters, such as HTML-tags, in order to construct extraction rules and patterns. Thus, aprofitable approach to IE on semi-structured texts is a hybrid of the two.

Moreover, on the Web, information is also highly dynamic. Web-site structure and thepresence of hyperlinks are also important facets not present in traditional natural language

documents. It may, for instance, be necessary to follow hyperlinks to obtain all the pertinentinformation from online databases. Web documents are both stylistically different from nat-ural language texts and may be globally distributed over multiple sites and platforms. Hence,the Web IE problem represents a special challenge for the field because of the nature ofmedium.

1.3 Architecture and components

The first step of the basic IE process is to extract each relevant element of the text through alocal analysis, i.e. the system examines each field to decide if it is a new element to add to thepattern or if it relates to an existing element. Secondly, the system interlinks those elementsand produces larger and/or new elements. Finally, only the pertinent elements regarding thepatterns are translated into the output format, e.g. the scenario. Moreover, the information toextract can be in any part of the document, and this is the situation with many unstructuredtexts. In these cases, the elements will be extracted as above and a second process will thenbe necessary to link all the elements dealing with the same scenario.

This IE process is slightly differently implemented if the system is based either on aknowledge engineering approach combined with natural processing language methods; or ona statistic and automatic training approach. In the first, experts examine sample texts andmanually construct the extraction rules with an appropriate level of generality to produce thebest performance. This “training” is effective but time consuming. In the second approach,once a training corpus has been annotated, a classifier is run so that the system learns howto analyze new texts. This is faster but requires sufficient volume of training data to achievereasonable outcomes [2]. Most IE systems compromise by using rules that are manuallycreated and classifier components that are automatically generated.

To elaborate, IE systems use part or all of the following components: Firstly,Segmentationdivides the document into segments, e.g. sentences, and the other components, such as imagesand tables for instance. Secondly,Lexical Analysistags parts of speech, disambiguates wordsand identifies regular expressions such as names, numbers and dates. This gives some infor-mation about the words, their position in the text and/or sentence, their type and sometimestheir meaning. Lexical analysis generally uses dictionaries (and/or ontologies). Thirdly,Syn-tactic Analysisidentifies and tags nominal phrases, verbal phrases and other relevant struc-tures as apartial analysis; or alternatively each of the individual elements, i.e. nouns, verbs,determinants, prepositions, conjunctions, etc., as acomplete analysis. Fourthly,InformationExtractioncreates rules to identify pertinent elements, to retrieve suitable patterns, and storethem according to a predefined format corresponding to the information extraction scenario.This last phase also examines co-reference relations, often inexplicit, such as the use of pro-nouns to qualify a person, a company or even an event. It is the only component specific tothe domain [1].

Finally, Lawrence and Giles [20] claim that 80% of the Web is stored in the hidden Web,e.g. pages generated on the fly from some database, using XML/XSL to generate pages basedon specific user requests to a database. This implies a special need for tools that can ex-tract information from such pages. Thus, Information Extraction from Web sites is oftenperformed usingwrappers. Wrapper generation has evolved independent of the traditionalIE field, deploying techniques that are less dependent on grammatical sentences than NLP-based techniques. A wrapper, in the Web environment, converts information implicitly stored

as an HTML document into information explicitly stored as a data-structure for further pro-cessing. Wrappers can be constructed manually by writing the code to extract information,or automatically by specifying the Web page structure through a grammar and translating itinto code. In either case, wrapper creation is somewhat tedious and as Web pages changeor new pages appear, new wrappers must be created. Consequently, Web Information Ex-traction often involves the study of semi-automatic and automatic techniques for wrappercreation.Wrapper induction[19] is a method for automatic wrapper generation using induc-tive machine-learning techniques. In wrapper induction, the task is to compute from a set ofexamples, a generalization that explains the observations as an inductive learning problem.

1.4 Related work

The IE field has developed over the last decade due to two factors: firstly, the exponentialgrowth digital document collections and secondly, through the organization of the MessageUnderstanding Conferences (MUC), held from 1987 to 1998 and sponsored by DARPA. TheMUC conferences coordinated multiple research groups in order to stimulate research byevaluating various IE systems. Each participating site had six months to build an IE systemfor a pre-determined domain. The IE systems were evaluated on the same domain and corpus,allowing direct comparison. The results were scored by an official scoring program using thestandard information retrieval measures. The MUC conferences demonstrated that fully auto-matic IE systems can be built with the state-of-the-art technology, and that, for some selectedtasks, their performance is as good as the performance of human experts [27]. Despite theseoutcomes, building IE systems still requires a substantial investment in time and expertiseand remains somewhat of a craft.

Some of the systems developed during the MUC period were applied, or can be applied,to the Web Information Extraction problem. On the one hand, both FASTUS [3, 14] and HAS-TEN [17] based their approaches on NLP techniques and developed the entire architecturesmentioned above. They are operational systems but are still time and resource consumingin their scenario set-up. On the other hand, automatic training systems are based on eitherunsupervised algorithms, combined with a bottom-up approach when extracting the rules,such as CRYSTAL [21]; or a supervised algorithm along with a top-bottom approach, suchas WHISK [22] and SRV [11]. Interestingly, with respect to this paper, WHISK used real es-tate classified ads as its document collection. Finally, another system named PROTEUS [13]used dictionaries along with a set of regular expressions to mine documents in a top-bottomapproach.

Simultaneous with these developments, the wrapper generation communities also devel-oped some IE systems using machine-learning algorithms to generate extraction patterns foronline information sources. SHOPBOT WIEN, SOFTMEALY and STALKIER belong to a groupof systems that generate wrappers for fairly structured Web pages using delimiter-based ex-traction patterns [9, 19].

To conclude, at the time of writing search engines are not powerful enough for all thetasks associated with IE systems. They return a collection of documents, but they cannotextract relevant information from these documents. Thus, the Web Information Extractionfield will continue to be an active area of research. As information systems will need toautomate the process as far as possible to cope with the large amount of dynamic data foundon the Web, IE systems will keep using machine-learning techniques rendering them beyond

Figure 1: The Homes On-line home-page. The site acts as the source of unstructured texts for our experiment.

the scope of generalist search indexes. Nevertheless, a combination of different approaches,achieving hybrid and domain specific search indexes, is believed to be a promising directionfor IE [22, 18].

1.5 The Interaction Paradigm and Learning Context

Mixed initiative [16] is a process from human-computer interaction involving humans andmachines sharing tasks best suited to their individual abilities. The computer performs com-putationally intensive tasks and prompts human-clients to intervene when either the machineis unable to make a decision or resource limitations demand intervention. Mixed initiativerequires that the client determine trade-offs between different attributes and alter search con-straints to locate objects that satisfy an information requirement. This process is well suitedto data analysis using an unsupervised symbolic machine learning technique called FormalConcept Analysis (FCA), an approach demonstrated in our previous work [5, 6, 7, 10] andinspired by the work of Carpineto and Romano [4].

This paper reinforces these ideas by re-using the real-estate browsing domain, a tutorialexercise in the introductory FCA literature. The browsing program for real-estate advertise-ments (RFCA) is more primitive than the Conceptual Email Manager CEM [5, 6, 7, 10],which uses concept lattices to browse Email and other text documents. Unlike CEM, RFCA isa Web-based program, creating a different set of engineering and technical issues in its imple-mentation. However, RFCA is limited and when the analysis of the rental advertising requiresnested-line diagrams (and other more sophisticated FCA-system features) we re-use CEM toshow how that program can be re-used to produce nested-line diagrams for the real-estatedata imported from the Web. Other related work demonstrates mixed initiative extensions byusing concept lattice animation, notably the algorithms used in CERNATO and joint work inthe open-source GODA collaboration2.

This article is structured as follows. Section 2 describes practical FCA systems and their

2see http://toscanaj.sf.net and the framework project http://tockit.sf.net

coupling to relational database management systems (RDBMS). This highlights the neces-sity of structured input when using FCA and therefore the nature of the structure discoveryproblem. Section 3 describes the Web-robot used to mine structure from real-estate advertise-ments. This section details the methods required to extract structured data from unstructuredWeb-collections and measures their success in terms of precision and recall. Section 4 showsthe Web-based interface for browsing structured real-estate advertisements. Section 5 demon-strates how real-estate data can be exported and the CEM program re-used to deploy nestedline diagrams and zooming [23].

2 Formal Concept Analysis and RDBMSs

FCA [12] has a long history as a technique for data analysis. Two software tools, TOSCANA [24]and ANACONDA embody a standard methodology for data-analysis based on FCA. Follow-ing this methodology, data is organized as a table in a RDBMS (see Figure 2) and is modeledmathematically as amany-valued context, (G,M,W, I) whereG is a set of objects,M is a setof attributes,W is a set of attribute values andI is a relation betweenG,M andW such thatif (g,m,w1) and(g,m,w2) thenw1 = w2. We define the set of values taken by an attribute,m ∈M asWm = {w ∈ W | ∃g ∈ G : (g,m,w) ∈ I}. An interpretation of this definition isthat in the RDBMS table there is one row for each object, one column for each attribute, andeach cell contains at most one attribute value.

Organization over the data is achieved via conceptual scales that map attribute valuesto new attributes and are represented by a mathematical entity called aformal context. Aformal context is a triple(G,M, I) whereG is a set of objects,M is a set of attributes andI is a relation between objects and attributes. Aconceptual scaleis defined for a particularattribute of the many-valued context: ifSm = (Gm,Mm, Im) is a conceptual scale ofm ∈Mthen we requireWm ⊆ Gm. The conceptual scale can be used to produce a summary of datain the many-valued context as a derived context. The context derived bySm = (Gm,Mm, Im)w.r.t. to plain scaling from data stored in the many-valued context(G,M,W, I) is the context(G,Mm, Jm) where forg ∈ G andn ∈Mm

(g, n) ∈ J ⇔: ∃w ∈ W : (g,m,w) ∈ I and (w, n) ∈ Im

Scales for two or more attributes can be combined together in a derived context. Considera set of scales,Sm, where eachm ∈ M gives rise to a different scale. The new attributessupplied by each scale can be combined together using a special type of union:

N :=⋃m∈M

{m} ×Mm

Then the formal context derived from combining all these scales together is(G,N, J) with

(g, (m,n)) ∈ J ⇔: ∃w ∈ W : (g,m,w) ∈ I and (w, n) ∈ Im

A concept of a formal context(G,M, I) is a pair(A,B) whereA ⊆ G, B ⊆ M , A ={g ∈ G | ∀m ∈ B : (g,m) ∈ I} andB = {m ∈M | ∀g ∈ A : (g,m) ∈ I}. For a concept(A,B), A is called theextentand is the set of all objects that have all of the attributes inB.Similarly,B is called theintentand is the set of all attributes possessed in common by all the

Figure 2: Example showing the process of generating a derived concept lattice from a many-context and aconceptual scale for the attributeViews.

objects inA. As the number of attributes inB increases, the concept becomes more specific,i.e. a specialization ordering is defined over the concepts of a formal context by:

(A1, B1) ≤ (A2, B2) :⇔ B2 ⊆ B1

More specific concepts have larger intents and are considered “less than” (<) concepts withsmaller intents. The same partial ordering is achieved by considering extents, in which casemore specific concepts have smaller extents. The partial ordering over concepts is alwaysa lattice and commonly drawn using a Hasse diagram. Concept lattices can be exploited toachieve an efficient labeling, each attribute has a single maximal concept (w.r.t. the specializa-tion ordering) possessing that attribute. If attribute labels are only attached to their maximalconcepts then the intent of a concept can be determined by collecting labels from all greaterconcepts. A similar situation is achieved for objects. Each object has a minimal concept towhich its label is attached and the extent of a concept can be determined by collecting labelsfrom all lesser concepts. Attribute and object labels are disambiguated by attaching objectlabels from below and attribute labels from above.

Consider Figure 2. A RDMS table contains a list of real-estate properties (objects 1–6),the number of bedrooms and the type of views the properties afford. The many-valued contexthas two attributes:#BedroomsandViews. Viewsis organized by the scale context shown onthe top-right of Figure 2. The scale context has all possible combinations ofbeach, hills,andcity views as objects and introduces three new attributes:b, handc. The set of scale objectsmust contain all the attribute values taken on by objects for the attribute being scaled. Thescale is applied to the many-valued context to produce a derived context giving rise to thederived concept lattice shown in Figure 2 (lower-left). This lattice reveals that there are noobjects having both views of thehills andcity since the most specific concept (the concept at

Figure 3: Interface to WebRobot that “extracts” the advertisements from the Newclassifieds Web-site.

the bottom of the lattice) has an empty extent. Furthermore any object in the data set that hasa view of thehills (there is only one, object 3) will also have a view of thebeach. With largedata sets (small numbers of attributes of interest, large number of objects) concept lattices arevastly superior to tables in their ability to communicate such information.

price low mid high<$150 ×$150-$200 × ×$200-$250 ×$250-$300 × ×>$300 ×

Table 1: Example scale for price. The objects are expressions partitioning the attribute values for price ratherthan being values themselves.

In practice it is easier to define a scale context by attaching expressions to objects ratherthan attribute values as shown in Table 1. The expressions denote a range of attribute valuesall having the same scale attributes. To represent these expressions in the mathematical de-scription of conceptual scaling we introduce a function called thecomposition operatorforattributem, αm : Wm → Gm whereWm = {w ∈ W | ∃g ∈ G : (g,m,w) ∈ I}. This mapsattribute values to scale objects. The derived scale then becomes(G,N, J) with:

(g, (m,n)) ∈ J ⇔: ∃w ∈ W : (g,m,w) ∈ I and (αm(w), n) ∈ ImThe main purpose of this summary of FCA is to reinforce that in practice FCA works withstructured object-attribute data in RDBMS form, in conjunction with a collection of concep-tual scales. Furthermore, this section describes the mechanism by which FCA is applied todata.

3 Web-robot for extracting structured data from unstructured sources

The initial purpose of selecting the real-estate classified domain was that it conforms to aclassic introductory example in FCA. However, Newslimited, the owner of copyright on thereal-estate advertisements concerned, would not cooperate which our research effort. Ourinitial intention was to obtain access to their structured classifieds database for student exper-iments introducing FCA [8]. Undeterred, a purpose built script and interface to determine thequery parameters from the Newslimited Web-site is shown in Figure 3.

Figure 4: Extraction DataFlow System Diagram.

We therefore began with a sequence of real-estate advertisements in an HTML file ratherthan with the ideal format, an RDBMS export format. The first task is to separate the adver-tisements from the surrounding HTML mark-up and segment the advertisements into self-contained objects, one for each property. This was done using a string processing algorithm.An example advertisement is shown in Figure 5. The text refers to six different properties,three with a rental price $250 per week and three with a price of $300. All properties arelocated in the suburb ofArundel. The format of the advertisements presents three main chal-lenges: (i) the information about properties overlap, i.e. the single instance of the wordArun-del indicates that all six properties are in Arundel, (ii) there are many aliases for the samebasic attribute, e.g. double garage and dble garage, (iii) some information is very specific,e.g. 1.up garage or near golf course.

An LL(1) parser was constructed using the Metamata Java Compiler Compiler (JavaCC3)to parse advertisements of this type. The parser is able to handle the first two of these chal-lenges with reasonable success. The parser recognizes pre-defined attributes and discards allunrecognized information.

The initial segmentation of the advertisements was able to extract 89% of advertisements.The remaining 11% were of low quality and omitted, they did not include a rental priceand were therefore not meaningful. The parser recognized 64 attributes of which 53 weresingle valued, i.e. true or false. The remaining 11 attributes, including rental price, numberof bedrooms and car park type, were multi-valued. To assess the accuracy of the parser,precision and recall were measured for each attribute and then aggregated. A summary of themost common and most important attributes for 53 rental properties is given in Table 2.

3see http://www.metamata.com/JavaCC/

1 FOR RENT - ARUNDEL - Phone 559481842 $3003 4 Bedrm, in-grnd pool, dble garage, near shops and school4 3 bedrm, tripple garage, immac. presented, close to transport5 Exec. 3 Bedrm + study, pool, dble garage, all ammen. close to school6 $2507 Leafy 3 bedrm, double garage, avail. Aug.8 3 bedrm townhouse, resort fac. l.up garage, 2 bathroom and on-suite.9 Townhouse, 2 bedroom, resort fac. garage, near golf course and transport.

Figure 5: A rental classified advertisement illustrating multiple aliases for attributes (as in abbreviations such asBedrm=bedroom), multiple objects (as rental properties described on lines 3, 4, 5, 7, 8 and 9) in a single advert(all lines) clustered on an primary key attribute: in this case the two prices $300 and $250.

Location Price Bedroom Furnished Car Park OtherFrequency 100% 100% 100% 26.4% 50.9% 88.7%Precision 100% 100% 100% 100% 100% 100%Recall 94.3% 100% 98.1% 71.4% 96.3% 68.1%

Table 2: Recall and Precision for 53 unseen real-estate adverts.

NA is the number of identified words andNB the number of correct words. Theprecisionof multi-valued attributes is calculated as the number of correctly identified attribute values(|NA∩NB|) as a proportion of the number of identified attributes values (|NA|). Therecall isthe number of correctly identified attribute values (|NA ∩NB|) as a proportion of the numberof correct attribute values (|NB|).

Averaging the most important attributes —Location, Price, Bedroom, Furnished, and Carparking– weighted by their frequency yields a precision of 100% and a recall of 95% whilethe inclusion of theOtherattribute reduces the recall to<70%. All real-estate advertisementsleave out some information about the property they advertise, presumably because of the perword cost of advertising space. As a result we would expect the recall of actual informationabout the property being advertised to be much less w.r.t. the actual property.

One of the strengths of FCA is that it allows the user to compose views of the data thatseparate objects at different levels of detail. For example the user may have a coarse dis-tinction based on price, but a fine-grain distinction based on proximity to facilities. Table 2shows poor recall for attributes in the groupOther. When combined with the knowledge thatthe adverts contain only partial descriptions of the data this places a practical limit on thelevel of detail that can be usefully explored. This limit could be extended if the initial datasource was a database or XML file containing more extensive information about the featuresof properties for rent.

The LL(1) parser was very fast, building the relational database and storing the multi-valued context in under 8 seconds on a Pentium-III 300 MHz for an entire week’s worth ofadverts, approximately 3,400 properties listed in the local newspaper.

4 RFCA - the Web-based FCA Interface

The Web-based user interface presents a Web page with a scale selector as shown in Figure 6.The client selects a suitable scale to browse through the newsclassified advertisements. Thescales are pre-defined. The newsclassifieds are now in a structured database form after theparsing described in the previous section. When the user selects a scale, a new Web page isloaded containing the scale image. This image now contains all the resulting extent numbersfrom the scales interrogation of the database. The number of objects in the extents are dis-played over each vertex in the usual way. The same scale selector is also available on the Webpage displaying the scale image. This allows the user to select a new scale without having togo back to the previous page. In other words the same scale selection should be present oneach of the pages displaying a selected scale.

A process that reproduces the Web page dynamically with different scales and extentnumbers was implemented. This program creates the scale images after each selection bythe user. A database connection and support for reading the scale files from the server aresupported.

The Web pages with extent numbers do not exist as files but are generated on demand.When a scale is selected, the script calls the graph drawing program as a system commandwith the new scale name as parameter. This drawing program draws a concept lattice corre-sponding to the context. The result is stored as a PNG file representing the scale image andan image map representing the coordinates for the vertices in the graph. The image map alsocontains SQL queries extracted from the current context file. Queries in the image map cor-responding to vertices in the PNG file are used to interrogate the database. After executingthe graph drawing program the script starts to build the client-side image map. All the vertexcoordinates are read in sequence from the image map and transformed to “hot” regions in theclick-able image. Each hot region is linked to a CGI script with the SQL queries also readfrom the image map. When the user clicks a vertex in the scale image the browser loads an-other database extraction script which produces a new Web page displaying the selected data.Such a Web page with a scale presenting classified data is displayed in Figure 6 and Figure 7.

Results must be displayed in the form of a table with the data extracted from the struc-tured database. The Win32::ODBC module provided a secure way to establish a connectionbetween the data extraction script and ODBC under Windows NT. A HTML table is builtusing the adverts received as rows from the database. A Web-page with the resulting HTMLtable of adverts is showed in Figure 8. All attributes are listed for each advert, boolean at-tributes replaced with an image-hook and abbreviated attributes replaced with full descrip-tions. Background colors for each advert row are alternated so the user can follow an advertwhen scrolling sideways.

Sometimes the original advert contains attributes not included in the database that can beof additional interest. The first column in the table is a running number that uniquely identifiesthe advert. This number is inserted when parsing the adverts. The table contains a columnnamedId. Id contains the number of the section from where the advert was parsed. So, if theadvert was originally located in the third section in the free-text of the rental classifieds file,the column has the value3. Using this number, we can create a link from the database advertsto the originals in the downloaded text file. An example of a resulting Web page displayingthe original adverts is shown in Figure 9.

Figure 6: The RENTAL-FCA prototype: Scales are pre-defined and selected from the “Context Name” menu(top-left). This figure shows the conceptual scales for various geographic regions (top) and price and resourcescales (lower).

5 ReusingCEM for Nesting and Zooming

Nesting and zooming [24, 23] are two well established techniques used in FCA. Togetherthese techniques allow a user to wander around in a conceptual landscape [25] attempting tofind concepts that satisfy their constraints. When searching for a real-estate property, therewill obviously be compromises between location, price and other factors. By using conceptlattices to show how constraints can be satisfied users are able to adapt their search to areasmore likely to bear fruit. We re-used the CEM program to reinforce these ideas although thesame approach could be implemented with some effort in the RENTAL-FCA prototype.

Figure 7: This figure shows the conceptual scales for car parking and fixtures (top) and facilities and views(lower).

This contrasts with current on-line real-estate systems which ask the user to provide aspecification for the type of property they are interested in and then (in most cases) provideeither an empty list or a very long list of candidate properties. Using nesting and zooming inFCA allows questions like, “What are the possibilities for a mid-range house close to the citywith a view, maybe close to park, shops or transport” as opposed to a question like: “List allmid-range houses that are close to the city, have a view, are close to a park, close to shops,and close to transport.”

Consider a person who is new to a city and looking for accommodation. A good place tostart is a decision about price. Figure 10 shows a conceptual scale defined for price. The scaleshows that most properties are either mid-range or expensive and that roughly 3/5’s of eachof the mid-range and cheap houses are in the intersection of mid-range and cheap. Consider

Figure 8: From the scales view shown in Figure 6 the user can navigate to the objects which are displayed in thestructured extracted form as a database table.

Figure 9: By navigating by theId in Figure 8 the user can recover the text of the original unstructured text. Thistext can be dynamically generated by a query against the Newclassified Web-site if copyright is a concern.

that without more information the user is uncertain of what price range they are interested in.They decide to add more information to the lattice by combining it with a scale specifyingwhether or not a property is furnished.

Figure 10 combines thepricescale with a scale forfurnished, using a nested line-diagram.The rules for reading a nested line diagram are similar to reading a normal lattice. Thick linesconnect ovals containing small lattices. The small grey circles show a location for a potentialconcept which is not instantiated by the data. The first thing to notice about the diagram is thelarge number of times that the middle concept of the inner diagram (the small lattices) is grey.The grey vertex indicates no mid-range or expensive partially furnished properties so the userneedn’t spend time looking for such a feature. Furthermore looking at the small lattice insidethe top oval we see that most properties are unfurnished —104 furnished as compared with752properties unfurnished.

The user may have an interest in investigating fully-furnished mid-range properties and isable to zoom into this concept by selecting it and selecting the zoom operation. He/she couldhave been more specific and selected a property in the intersection of mid-range, cheap andfully-furnished since there is such a concept but for now consider he/she selected mid-rangeand fully-furnished. The zoom operation restricts the objects shown in the lattice to only thosein the extention of the selected concept.

Figure 11 shows a scale that has been zoomed. A small panel in the lower left hand cornershows the zooming history. The two arrow buttons in the tool-bar allow moving backwardsand forwards with respect to the zooming operation in a manner similar to forward and back-

Figure 10: (Left) The derived concept lattice showing how the properties are distributed with respect to threelinguistic variables (scale attributes):cheap, mid-range,andexpensive. (Right) A combination of the scales forpriceandfurnitureusing a nested line-diagram.

Figure 11: A concept lattice showing access to resources such as water, shops, sports etc. The set of objects hasbeen restricted to fully-furnished, mid-range properties, evident from the zooming history in the lower left-handcorner.

wards in Web browsers. The concept lattice now contains only69 real estate properties sincethe zooming operation has restricted the object set to the extent of the concept for fully-furnished and mid-range. The conceptual structure in this lattice is different from the generalpicture without zooming. In the69 properties shown, proximity to shops implies proximityto water (Close Water), and it is impossible to satisfy a desire to be close to University andclose to shops in this restricted set of properties.

The user is now able to make a decision between different criteria, perhaps zoomingfurther into the concept labeledClose Wateror alternatively retrieving all four properties thatare close to shops. Similarly the user is free to go back and make different zooming choicesor include another scale with still more criteria to the current scale.

6 Conclusion

The paper demonstrates how FCA can be of use to search for rental properties on the Webeven when the structure of the source data is unknown or unavailable. We believe the sametechnique is of use in browsing other unstructured legacy data on the Web.

A number of problems remain to be solved. The current browsing system is implementedas a stand-alone application and can only browse real-estate adverts with pre-defined scales.In order to be widely available it would have to extend to a distributed framework. A goodcandidate would be a Java Applet implementation of the graphical user interface communicat-ing with a server. A Web-based FCA implementation of this sort is presently being engineeredas part of the GODA collaboration.

Another difficulty is that many users are unfamiliar or uncomfortable with concept latticediagrams and require a form-based interface. In this way, the process and interpretation ofthe diagrams can be taught to the user while using the tool for the first time. The advantageof FCA, even without the concept lattice, is that feedback can be given on the volume of datasatisfying search constraints.

The system we implemented obtained its data by parsing small textual descriptions ofobjects. The increasing use of the Internet is encouraging the storage of more structuredinformation and thus in the future we expect the difficult task of constructing one-off IEparsers to suite specific textual descriptions will disappear as data is directly entered withstructure. In other words, browsing XML data using FCA on the Web is significantly moresimple than what has been described here although the techniques for mining structure fromunstructured textual sources will be of value in various intelligence applications.

In addition to the interface described in this program, a prototype web-based interface al-lowing the construction of derived concept lattices and retrieval of concept extents is availableat: http://www.kvocentral.com/software/rentalfca.html

Acknowledgment

The GODA project is supported by the Australian Research Council (ARC) and the DFG.This research also benefits from the support of the Distributed Systems Technology ResearchCentre (DSTC Pty Ltd) which operates as part of the Australian Government’s CRC program.The authors acknowledge the input ofAge Strand and Peter Becker.

References

[1] Chai, J.Y., Learning and Generalisation in the Creation of Information Extraction Systems, PhD Thesis,Department of Computer Science, Duke University, 1998.

[2] Appelt DE. and Israel DJ., Introduction to Information Extraction Technology, Tutorial for IJCAI-99,Stockholm, August 1999.

[3] Applet D. and others, SRI International: Description of the FASTUS System used for MUC-6, In Proceed-ings of the Sixth Message Understanding Conference (MUC-6), pp. 237-248, November 1995.

[4] Carpineto, C. and Romano, G., A Lattice Conceptual Clustering System and its Application to BrowsingRetrieval,Machine Learning, 24, pp. 95–122, 1996.

[5] Cole, R and P. Eklund, Analyzing an Email Collection using Formal Concept Analysis,European Conf.on Knowledge and Data Discovery, PKDD’99. pp. 309-315, LNAI 1704, 1999. Springer Verlag, 1999.

[6] Cole, R. and G. Stumme: CEM: A Conceptual Email ManagerProceeding of the 8th International Conf.on Conceptual Structures, ICCS’00, LNAI 1867, pp. 438-452, Springer Verlag, 2000.

[7] Cole, R, P. Eklund, and G. Stumme, CEM — A Program for Visualization and Discovery in Email, In D.A.Zighed, J. Komorowski, J. Zytkow (Eds),Proc. of PKDD’00, LNAI 1910, pp. 367-374, Springer-Verlag,2000

[8] Cole, R., P. Eklund Browsing Semi-Structured Web Texts using Formal Concept Analysis”, 9th Inter-national Conference on Conceptual Structures, pp. 319–332, Springer Verlag, LNAI 2120, ICCS’2001,August, 2001.

[9] Eikil L., Information Extraction from the World Wide Web, Norwegian Computer Center, Oslo, pp. 12-22,July 1999.

[10] Eklund, P. and R.Cole Structured Ontology and Information Retrieval for Email Search and Discovery, InM. Hacid, Z. Ras, D.A. Zighed, Y. Kodratof (Eds),Foundations of Intelligent Systems, 13th InternationalSymposium ISMIS 2002, LNAI 2366, pp. 75-84, Springer-Verlag, 2002.

[11] Freitag D., Information Extraction from HTML: Application of a general Machine Learning Approach,Carnegie Mellon University, 1998.

[12] Ganter, B and R. Wille: Formal Concept Analysis: Mathematical Foundations Springer Verlag, 1999.

[13] Grishman R., Information Extraction: Techniques and Challenges, New York University, 18 pp., 1997.

[14] Hobbs JR. and others, FASTUS: A cascade finite-State Transducer for Extracting Information fromNatural-Language Text, In Proceedings of the DARPA workshop on Human Language Technology, pp.25-35, 1993.

[15] Hsu CH. and Dung MT, Generating Finite-state Transducers for semi-structured Data Extraction From theWeb, Information Systems, 23(8), pp. 521-538, 1998.

[16] Horvitz, E. Uncertainty, Action and Interaction: In pursuit of Mixed-initiative ComputingIntelligentSystemsIEEE, pp. 17-20, September, 1999 http://research.microsoft.com/˜horvitz/mixedinit.HTM

[17] Krupka G., Description of the SRA system as used for MUC-6, In Proceedings of the Sixth MessageUnderstanding Conference (MUC-6), pp. 221-235, 1995.

[18] Kushmerick N., Gleaning the Web, IEEE Intelligent Systems, 14(2), March/April 1999.

[19] Kushmerick N., Weld DS. and Doorenbos R., Wrapper Induction for Information Extraction, In Interna-tional Conference on Artificial Intelligence (IJCAI-97), 7pp., 1997.

[20] Lawrence S. and Giles CL., Searching the World Wide Web, Science magazine, v.280, pp 98-100, April1998.

[21] Soderland S., Learning to extract Text-based Information from the World Wide Web, In Proceedings ofThird International Conference on Knowledge Discovery and Data Mining (KDD-97), 4pp., 1997.

[22] Soderland S., Learning Information Extraction rules for Semi-structured and Free Text, Machine Learning,University of Washington, 44 pp., 1999.

[23] Vogt, F. C., Wachter and R. Wille, Data Analysis based on a Conceptual File,Classification, Data Analysisand Knowledge Organization, Hans-Hermann Bock, W. Lenski and P. Ihm (Eds), pp. 131-140, SpringerVerlag, Berlin, 1991.

[24] Vogt, F. and R. Wille, TOSCANA: A Graphical Tool for Analyzing and Exploring Data In: R. Tamassia,I.G. Tollis (Eds)Graph Drawing ’94, LNCS 894 pp. 226-233, 1995.

[25] Wille, R. Conceptual Landscapes of Knowledge: A Pragmatic Paradigm for Knowledge Processing In: W.Gaul, H. Locarek-Junge (Eds)Classification in the Information Age, Springer, Heidelberg, 1999.

[26] Yangarber R., Grishman R., Tapanainen P., Huttunen S., Unsupervised Discovery of Scenario-Level Pat-terns for Information Extraction, New York University & Helsinki University, 8 pp., 2000.

[27] Zechner K., A literature survey on Information Extraction and Text Summarization, Term paper, CarnegieMellon University, 1997.

top related