[acm press the 49th annual southeast regional conference - kennesaw, georgia...

Efficient schema extraction from a large collection of XML documents

Guangming Xing Vijayeandra Parthepan Dept. of Mathematics and Computer Science Dept. of Mathematics and Computer Science Western Kentucky University Western Kentucky University

[email protected] [email protected] Bowling Green, KY 42101 Bowling Green, KY 42101

ABSTRACT XML is becoming the standard format for data exchange on the Internet. In this paper, we present a system that is effective in extracting schema information from a large collection of XML documents. Based on Xtract, we propose using the cost of an NFA simulation to compute the Minimum Length Description. We also studied using frequencies of the sample inputs to improve the effectiveness of the schema extraction. Experimental studies were conducted on synthesized XML data sets, suggesting the efficiency and effectiveness of our approach as a solution for schema inference.

Categories and Subject Descriptors H.3.1 Content Analysis and Indexing– Linguistic processing, I.5.2 Design Methodology Feature evaluation and selection polymorphism, control structures. General Terms Algorithms, Performance, Design, Experimentation,

Keywords eXtensible Markup Language, schema inference, regular expression, nondeterministic finite automata

1. INTRODUCTION The eXtensible Markup Language (XML) [3] has become the standard format for data exchange on the Internet, providing interoperability between different business applications. Such wide use results in large volumes of heterogeneous XML data, i.e., XML documents conforming to different schemas. XML documents are naturally tree structured, and incorporate both structural information and content [1]. Dealing with structure information is important to XML data storage and management [9], and the presence of a

schema can significantly simplify the processing of XML documents. For example, elements from different documents with similar structures can be stored together, resulting in reduced storage and faster query processing. However, processing XML documents based on the schemas may not be feasible in practice, since most XML documents are schema-less. A few papers have been published to address the problem of schema extraction from XML documents. The problem of inferring regular grammars from examples has been studied in various contexts: theory [7], information extraction [10], and XML management [6]. Xtract [6] is a very effective system, which also allows new improvements plugged into the existing implementation. In this paper, we show an effective approach to extract the schema information from a large collection of XML documents. The methods proposed in this paper improved the algorithms presented in Xtract[6]. The remainder of this chapter is organized as follows. Section 2 gives an introduction on the background of this work and formally defines the problem. Section 3 presents the basic components in original Xtract system and presents our improvements. Section 4 presents the results of our experimental studies, demonstrating the efficiency and effectiveness of our algorithms. The conclusion is given in Section 5.

2. BACKGROUND 2.1. Motivations Extracting schema from a collection of XML documents has many real world applications. Examples follow:

1. XML Data Storage: As explained in [9], the schema information can be used to group XML elements, helps improve the efficiency in storing XML documents in relational databases.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 49th ACM Southeast Conference, March 24-26, 2011, Kennesaw, GA, USA. Copyright 2011 ACM 978-1-4503-0686-7/11/03…$10.00.

2. XML document classification and clustering. Schema based classification/clustering of XML documents is important to many web applications [2, 12], and an important prerequisite of using the schema based algorithms is the availability of schemas.

Page 92

2

2.2. XML Trees and DTD An XML document can be viewed as an ordered labeled tree. Ordered labeled trees are trees whose nodes are labeled with the name of the tags, and the children of each node are ordered. DTD is the mostly widely used schema to specify the structure of XML documents. A DTD restricts how a document can be constructed by specifying the pattern that the sequence of sub-elements must conform. A segment of XML data (indented to reflect the tree structure) that represent a catalog for books and the corresponding DTD are shown in Figure 1.

Figure 1. Catalog data and corresponding DTD

Inferring regular expressions from a set of strings is an important topic and has been well studied in different contexts: sequence profiling [7], information extraction [10], and Web structure mining [6]. Inferring schema information from XML documents has become important as more and more XML documents are exchanged on the Internet. Several publications have been published to address this problem. Among these efforts, Xtract by Garofalakis et al. provides a practical solution. One novel contribution by Xtract is the introduction of Minimum Length Description (MLD) to rank candidate expressions. We concentrate on improving the efficiency in computing the MLD, and adapt the framework for schema extraction from a large collection of XML documents. 3. SCHEMA EXTRACTION To make the paper self-contained, we briefly outline the schema extraction from Xtract [6]. There are three major components in Xtract. Generalization. As a rst step, the Xtract system finds patterns in each input sequence and replacing them with appropriate regular expressions to produce more

general candidate DTDs. The generalization heuristics are based on the discovery of frequent, neighboring occurrences of subsequences and symbols within each input sequence. Although simple in most cases, the heuristics are effective for real data sets. Factoring. As a second step, the Xtract system factors common sub-expressions from the generalized candidate DTDs obtained from the generalization step, in order to make them more concise. Ranking. The final and the most important contribution of Xtract is the introduction of Minimum Description Length (MDL) to rank the candidate expressions. In the original Xtract system, a target function L=Ls+Ld is used to deduce the optimal schema. In L=Ls+Ld, Ls denotes the length of the schema, and Ld denotes the length of the documents encoded using the corresponding schema.

<catalog> <book> <title lang=”en”>Harry Potter</title> <author>J K. Rowling</author> <year>1998</year> <price>16.99</price> </book> <book> <title lang=”en”>A book by many authors</title> <author>Author A</author> <author>Author A</author> <author>Author A</author> <author>Author A</author>

<year>2011</year> <price>10</price> </book> </catalog>

Take the string ababab as an example: we may have the following three patterns:

1) ababab 2) (ab)* 3) (a|b)*

It is easy to see that pattern 1) is longest, but when used for encoding the input string, it would be shortest as the input is just one occurrence of the pattern. Pattern 3) can be the simplest, but when used in encoding the string, it is longest as there are six steps of choices.

catalog book+ book title author+ year price author #PCDATA year #PCDATA price #PCDATA

Although Xtract is very effective to infer schema from an XML document, inferring a schema from a large collection of XML documents is challenging because:

1. XML documents on the Internet tend to have noise documents which can significantly degrade the accuracy of the inferred schema (even though there is nothing wrong with the inference algorithms).

2. The procedure to infer the schema is time consuming when the number of documents is large and each individual document is large in size.

3. There are no good criteria to evaluate the validness of the extracted schema.

Based on the above observations, we propose the following three methods to address the problems. The first heuristics we propose is the introduction balance factor in MDL. Whenever a pattern is needed for a collection of samples, there are the following important two factors:

Precision: The extracted pattern precisely describes the sample documents while doesn’t cover those documents (with different structures) outside of the sample set.

Page 93

3

Generality: The extracted pattern should be general enough to cover those documents that share similar structure with those in the sample set.

Instead of using L=Ls+Ld , L= Ls+Ld is used for the overall cost minimization. The introduction of is used to control the ranking of different patterns. When the same sequence may appear in the input documents multiple times, the frequency can be used to give the input with higher frequency more weight in the overall MLD. So the optimization of L= Ls+Ld can be represented as

L= Ls+ iLdi

where i is the weight of sequence i. This not only improves the accuracy of the inferred schema, but also reduces the time needed for inference. The second improvement is the introduction of nondeterministic finite automata in computing the MDL. Instead of computing the cost of encoding a sequence using a regular expression, we use the cost of non-deterministic finite automata (NFA) simulation as the cost of encoding. As shown in [6], the cost to encode a string using a regular expression can be computed recursively. For an input string of length l, it take O(l s) time to compute the cost of encoding. When the size of the input document is large, this can be very time consuming. We will show this can be reduced to O(l) in practice when NFA simulation is used.

A NFA [5] is a 5-tuple , where Q

is a finite set of states, is the set of starting

states, is the set of final states, is the

transition function defined from

to . An NFA can be constructed from a regular expression using Thompson’s method, which can be illustrated by

FQQA ,,,, 0

QQ0

)(QpowQF

)(Q}{

pow

Figure 2.

Figure 2. Thompson's method for NFA construction

Based on Thonpson’s method, the patterns 1), 2) and 3) can be converted to NFAs as shown in Figure 3.

Figure 3. NFAs by Thompson's method

Given an NFA and an input string, we can determine whether or not the NFA accepts the input by using the NFA simulation algorithm, which can be illustrated by Figure 4.

// computes the cost of NFA simulation // returns -1 if the input is not accepts by the NFA int accept(NFA nfa, String input) { int cost = 0; Vector current = new Vector(nfa.numOfStates()); current.set(nfa.start, true); for (each literal in input) { current = transit(current, literal); cost = cost + countTrue(current); } if(current.get(nfa.final)) cost = -1; return cost; } Vector transit(Vector current, int literal) { Vector next = new Vector(current.size()); for (each entry in current) if (current.get(i) == true) for(each j s.t. j (i, literal) next.set(j, true); }

Figure 4. NFA Simulation Algorithm

Although the non-determinism in an NFA is not preferred in many applications, it provides a similar measure as encoding an input string using a regular expression: the more non-determinism in a simulation (cost), the longer the corresponding encoding using the regular expression. Our experimental studies in the next section also validated this conjecture. It takes O(ls) time to use NFA simulation to determine whether or not an input string can be accepted by an NFA, where l is the length of the input string and s is the number of states in the NFA.

Page 94

4

The performance of NFA simulation can be significantly improved by using a lazy construction of DFA. Although the size of a DFA can be exponential to the size of the corresponding NFA, the size of the DFA in practice tends to be small. In fact, the number of states in the DFA is smaller than the size of corresponding NFA in many cases as will be shown in the next section. The third contribution of this paper is the introduction of edit distance to quantify the quality of the schema. We use the edit distance between XML document and schema (denoted by , where d is the document and s is the schema) [

),( sd11], which is defined as the minimum cost

of the edit scripts (insert a node, delete a node, or replace a node) that transforms an XML document such that it conforms to the schema (DTD). As the edit distance is related with the size of the document, we use normalized distance

||||),(),(ˆsd

sdsd

as a measure for how well the document d conforms to schema s. Given a collection of XML documents with similar structures, a DTD can be inferred from training documents from the collection. The quality of the inferred DTD can be measured by the normalized edit distances between the testing XML documents and the schema. 4. EXPERIMENTAL EVALUATION In this section, we investigate the efficiency, effectiveness, and scalability of the algorithms presented in this chapter on various data sets.

Figure 5. Evaluation framework

We have conducted the following studies to show the effectiveness and efficiency of our algorithms in this paper. The evaluation framework used in the experimental studies is shown in Figure 5. The data set is generated by XML generator by IBM [13]. The first experiment is to evaluate the space usage of the lazy construction to make sure that the method is feasible

for large document in real applications. The result is shown in Table 1. The tag names in the document are represented as integers in our program and shown in the result table as well. The P4 and P5 are two long to fit in the table.

Pattern NFA size DFA size 22 (21 13)* 22 7 6 (21|22)* 13 13 10 6

13 22 21* (13 |22)* 13 7 P4 25 9 P5 298 77

Table 1. Comparison of Num of States of NFA and DFA

It is clear that using DFA in place of NFA will not significantly introduce more space usage, and using the lazy construction to get DFA for matching cost is feasible for large XML documents. The second group of study is to evaluate the improvements in running time. Please note that the DFA methods only works better when the number of sample input and the size of the input is large enough. The comparison of the time for schema inference from different XML documents using original regular expression matching and DFA method is presented in Table 2. Please note the file size is not the exact, and unit for the time in the table is millisecond.

Size of XML Document Regular Ex DFA 1M 1321 974 3M 2724 1133 9M 5213 1561 25M 9676 2760 80M 17615 4779

Table 2. Comparison of time for schema inference

The above comparison clearly shows the DFA method reduces the time needed to infer schemas. The third group evaluates the validness of the methods proposed in this paper using edit distance. The experiment is setup as follows. About 500 documents are generated using XML generator, the files sizes ranges from 50K to 50M.. The first 200 documents are used as samples. We have also inserted about 20 documents with different schemas (noise data) to the samples. The remaining 300 documents are used to validate the schema inferred from the samples. Our experiment indicated the average normalized edit distance using original Xtract is 20%, and using our proposed methods reduces the average to 6%.

Page 95

5

5. CONCLUSIONS Inferring a concise and accurate schema from a large collection of XML documents is important for XML data management. Based on Xtract, we proposed using NFA simulation in place of regular expression matching to improve the efficiency of XML schema inference. We have also proposed using a weighted target function to balance the generality and preciseness of the inferred schema, which can be important for noise reduction in XML classification/clustering. The experimental studies on various benchmark data set suggest that our approach provides some important heuristics to the Xtract framework.

REFERENCES 1. Abiteboul, S.(1997) Querying semi-structured data. In

F.N. Afrati and P. G. Kolaitis (Eds.), Proceedings of the ICDT Conference (pp. 1-18). Berlin Heidelberg New York: Springer.

2. Bertino, E., Guerrini, G., & Mesiti, M. (2004). A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems, 29(1), 23-46.

3. Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., & Yergeau, F. (2004). Extensible Markup Language (XML) 1.0 (Third Edition). W3C, from http://www.w3.org/TR/2004/REC-xml-20040204/.

4. Canfield, R., Xing, G. (2005). Approximate matching of XML document with regular hedge grammar. International Journal of Computer Mathematics.

. 432-443). Berlin Heidelberg New York:

. Garofalakis, M. N., Gionis, A., Rastogi, R., Seshadri,

(4), 274–291.

ra

12.82(10), 1191-1198.

5. Denoyer, L., Gallinari, P., & Vercoustre A.(2006) Report on the XML Mining Track at INEX 2005 and INEX 2006. In Fuhr, N., Lalmas, M., &Trotman, A. (Ed.) INEX 2006 Lecture Notes in Computer Science (4518) (ppSpringer.

S., & Shim, K. (2000). XTRACT: A system for extracting document type descriptors from XML documents. In W. Chen, J.F. Naughton, and P.A. Bernstein (Eds.), Proceedings of the International Conference on Management of Data (pp. 165–176). New York: ACM.

6

7. Ugo Galassi and Attilio Giordana. 2005. Learning regular expressions from noisy sequences. In SARA, 92-106.

8. Hopcroft, J.E., Motwani, R., & Ullman, J.D. (2006). Introduction to Automata Theory, Languages, and Computation, 3rd Edition, Addison-Wesley.

9. Jagadish, H.V., Al-Khalifa, S., Chapman, A., Lakshmanan, L., Nierman, A., Paparizos, S., Patel, J.M., Srivastava, D., Wiwatwattana, N., Wu, Y., & Yu, C. (2002). Timber: A native XML database. VLDB Journal. 11

10. Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Morristown, NJ, USA, 21-30.

11. Xing, G.(2006) Fast Approximate Matching Between XML Documents and Schemata. In X. Zhou, J. Li, H. Shen, M. Kitsuregawa and Y. Zhang(Eds.), APWeb 2006 Lecture Notes in Computer Science (3841) (pp. 425-436). Berlin Heidelberg New York: Springer. Xing, G., Guo, J., & Xia, Z.(2006) Classifying XML Documents Based on Structure/Content Similarity. In N. Fuhr, M. Lalmas and A. Trotman (Eds.), INEX 2006 Lecture Notes in Computer Science (4518) (pp. 444-457). Berlin Heidelberg New York: Springer.

13. XML Generator, can be retrieved from http://www.alphaworks.ibm.com/tech/xmlgenetor

Page 96