[ieee 2010 international conference on artificial intelligence and computational intelligence (aici)...

5
XML Structure Extraction from Plain Texts with Hidden Markov Model PIAO Yong ZOU Sha-sha WANG Xiu-Kun EI / Software School Software School EI School Dalian University of Technology Dalian University of Technology Dalian University of Technology Dalian, China Dalian, China Dalian, China [email protected] [email protected] [email protected] Abstract—Information extraction is one of the ways to convert unstructured text into structured records. Most of the previous work in this field are devoted to add semantic tags to specific textual content, so their structures are often plain which cannot illustrate relationships among semantic features. A novel approach, Structure Information Extraction System based on Hidden Markov Model (SIEHMM), for the task of extracting structure from plain texts is proposed in these papers, which utilizes path information for HMM training and automatically generate XML. Experiments on a real life dataset show SIEHMM has a high precision and recall ratio and can not only help solve problems of structural storage and text information retrieval, but also take advantages of XML to meet the future trends. Structure Information Extraction; Hidden Markov Model; XML I. INTRODUCTION Facing a tremendous amount of text data generated in the information explosion, the problem how to obtain useful information more quickly and accurately has become one of the focuses to researchers. In addition, how to transform that into structured forms facilitating computer storage and processing has caused wide public concerns. At the same time, XML eXtensible Markup Language[1], as a kind of semi-structured data model, is becoming the de facto standard of Web data representation and exchange because of its self-description, platform-independence, scalability and easy-to-use etc. More and more information processing systems use XML documents as the carrier of information storage, exchange and release. Hence, extracting structure information from unstructured text files and converting them into XML documents with different structures can not only integrate information from various domains, but provide a uniform platform for improving the efficiency of information retrieval. Information Extraction (IE) is an effective way for converting information in unstructured or semi-structured formats into structured ones [2]. At present, information extraction can be roughly categorized into three classes: dictionary-based methods, rule-based methods [3,4] and statistics-based methods [5-7]. Compared with other statistics-based techniques, Hidden Markov Models have a strong statistical foundation to process natural language documents without massive dictionaries or rule sets, good portability and are more robust, thus have been widely applied by the researchers. Paper [5-7] employ HMMs to extract specific theme information successfully from computer bibliography, literature references and HTML respectively and achieved acceptable results. In paper [8], an Information Extraction method based on clustering HMM has been put forward for different resources of training data. In paper [9], the maximum entropy principle has been introduced into HMMs to solve problems of knowledge representation, which can add new knowledge of languages to the models at any time. However, the previous proposed algorithms based on HMMs are not flexible enough to fulfill the requirements of text processing in a variety of domains. Additionally, these algorithms are all devoted to recognizing and annotating partial text with semantic tags and the extraction results can only present the corresponding relations between semantic tags and text content and cannot provide sufficient structure information for generation of XML documents. Therefore, these existing HMM models can only derive a kind of flat, unstructured format by losing heterogeneity information among text files. In this paper, a Structure Information Extraction system based on HMM (SIEHMM) is proposed to extract enough structure information from text files and construct XML documents. It considers an XML document as a DOM tree, and chooses paths from root node to leaf nodes as the states of HMM, instead of simple semantic tags. Furthermore the values of the leaf nodes, which are text content of XML documents, are considered as observation sequences. Accordingly, when the given text sequences are input, the output state sequences are no longer composed of the semantic tags but a path corresponding to a word or sentence in an XML document. Multiple HMMs could be trained against XML documents that are described by various XML DTD files using the universal emission probability matrix. Experiments show that it has a high precision and recall rate, and can not only solve the problem of text heterogeneity but also be easily used to present text information in XML format. In Section 2, HMM related theory and basic idea of algorithm SIEHMM are briefly introduced, followed by main steps of SIEHMM in detailed description in Section 3. In Section 4 experiments and its results are given showing better performance of SIEHMM. Finally we summarize the whole work and provide future applications and research directions. II. STRUCTURE EXTRACTION BASED ON HMM A. Hidden Markov Models A HMM is a probability statistical model, which contains double stochastic processes. One is a Markov chain which is not directly observed and described through the state 2010 International Conference on Artificial Intelligence and Computational Intelligence 978-0-7695-4225-6/10 $26.00 © 2010 IEEE DOI 10.1109/AICI.2010.123 560 2010 International Conference on Artificial Intelligence and Computational Intelligence 978-0-7695-4225-6/10 $26.00 © 2010 IEEE DOI 10.1109/AICI.2010.123 560

Upload: wang

Post on 07-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI) - Sanya, China (2010.10.23-2010.10.24)] 2010 International Conference on Artificial

XML Structure Extraction from Plain Texts with Hidden Markov Model

PIAO Yong ZOU Sha-sha WANG Xiu-Kun

EI / Software School Software School EI School Dalian University of Technology Dalian University of Technology Dalian University of Technology

Dalian, China Dalian, China Dalian, China [email protected] [email protected] [email protected]

Abstract—Information extraction is one of the ways to convert unstructured text into structured records. Most of the previous work in this field are devoted to add semantic tags to specific textual content, so their structures are often plain which cannot illustrate relationships among semantic features. A novel approach, Structure Information Extraction System based on Hidden Markov Model (SIEHMM), for the task of extracting structure from plain texts is proposed in these papers, which utilizes path information for HMM training and automatically generate XML. Experiments on a real life dataset show SIEHMM has a high precision and recall ratio and can not only help solve problems of structural storage and text information retrieval, but also take advantages of XML to meet the future trends.

Structure Information Extraction; Hidden Markov Model; XML

I. 0BINTRODUCTION Facing a tremendous amount of text data generated in the

information explosion, the problem how to obtain useful information more quickly and accurately has become one of the focuses to researchers. In addition, how to transform that into structured forms facilitating computer storage and processing has caused wide public concerns. At the same time, XML(eXtensible Markup Language)[1], as a kind of semi-structured data model, is becoming the de facto standard of Web data representation and exchange because of its self-description, platform-independence, scalability and easy-to-use etc. More and more information processing systems use XML documents as the carrier of information storage, exchange and release. Hence, extracting structure information from unstructured text files and converting them into XML documents with different structures can not only integrate information from various domains, but provide a uniform platform for improving the efficiency of information retrievalH.

Information Extraction (IE) is an effective way for converting information in unstructured or semi-structured formats into structured ones [2]. At present, information extraction can be roughly categorized into three classes: dictionary-based methods, rule-based methods [3,4] and statistics-based methods [5-7]. Compared with other statistics-based techniques, Hidden Markov Models have a strong statistical foundation to process natural language documents without massive dictionaries or rule sets, good portability and are more robust, thus have been widely applied by the researchers. Paper [5-7] employ HMMs to extract specific theme information successfully from

computer bibliography, literature references and HTML respectively and achieved acceptable results.

In paper [8], an Information Extraction method based on clustering HMM has been put forward for different resources of training data. In paper [9], the maximum entropy principle has been introduced into HMMs to solve problems of knowledge representation, which can add new knowledge of languages to the models at any time. However, the previous proposed algorithms based on HMMs are not flexible enough to fulfill the requirements of text processing in a variety of domains. Additionally, these algorithms are all devoted to recognizing and annotating partial text with semantic tags and the extraction results can only present the corresponding relations between semantic tags and text content and cannot provide sufficient structure information for generation of XML documents. Therefore, these existing HMM models can only derive a kind of flat, unstructured format by losing heterogeneity information among text files.

In this paper, a Structure Information Extraction system based on HMM (SIEHMM) is proposed to extract enough structure information from text files and construct XML documents. It considers an XML document as a DOM tree, and chooses paths from root node to leaf nodes as the states of HMM, instead of simple semantic tags. Furthermore the values of the leaf nodes, which are text content of XML documents, are considered as observation sequences. Accordingly, when the given text sequences are input, the output state sequences are no longer composed of the semantic tags but a path corresponding to a word or sentence in an XML document. Multiple HMMs could be trained against XML documents that are described by various XML DTD files using the universal emission probability matrix. Experiments show that it has a high precision and recall rate, and can not only solve the problem of text heterogeneity but also be easily used to present text information in XML format.

In Section 2, HMM related theory and basic idea of algorithm SIEHMM are briefly introduced, followed by main steps of SIEHMM in detailed description in Section 3. In Section 4 experiments and its results are given showing better performance of SIEHMM. Finally we summarize the whole work and provide future applications and research directions.

II. 1BSTRUCTURE EXTRACTION BASED ON HMM

A. 5BHidden Markov Models A HMM is a probability statistical model, which contains

double stochastic processes. One is a Markov chain which is not directly observed and described through the state

2010 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-4225-6/10 $26.00 © 2010 IEEE

DOI 10.1109/AICI.2010.123

560

2010 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-4225-6/10 $26.00 © 2010 IEEE

DOI 10.1109/AICI.2010.123

560

Page 2: [IEEE 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI) - Sanya, China (2010.10.23-2010.10.24)] 2010 International Conference on Artificial

transition matrix, the other is the observation sequence, defined by the emission probability matrix. HMM can be represented by a 5-tuple (S, O, A, B, П) [10]:

• State set S={s1, s2, …, sN} • Observation alphabet set O={o1, o2, …, oM} • A is an N×N state transition matrix where aij is the

probability of making a transition from state si to state sj: A=[aij], aij=P(qt+1=sj | qt=si), 1≤i, j≤N

• B is an N×M emission probability matrix where bj(ok) represents the probability of observation Ok being produced from the state sj: B=[ bj(ok)], bj(ok)=P(xt=ok | qt=sj), 1≤k≤M, 1≤j≤N

• П is the initial probability distribution where πi is the probability for si to be the start state: П=[πi], πi=P(q1=si), 1≤i≤N

HMM could solve three basic problems:

1) 9BEstimation problem: Given a HMM λ=(A, B, Π) and a observation sequence O={o1, o2, …, oT}, how to estimate P(O|λ), the probability of the observation sequence, efficiently.

2) 10BDecoding problem: Given a HMM λ=(A, B, Π) and a observation sequence O={o1, o2, …, oT}, how to choose a corresponding state sequence Q={q1, q2, …, qT} which is optimal to generate the observation sequence.

3) 11BLearning problem: How to get the parameters of HMM according to a group of observation sequences to maximize P(O|λ).

In this paper, we mainly solve the last two problems on XML structure extraction.

B. 6BBasic idea of SIEHMM In this section, the algorithm SIEHMM used to extract

structural information for generating XML documents is described. Taking a piece of simple textual address information for instance:

Sara Lee 105 ZhongShan Road LiaoNing DaLian 116600

Traditional IE methods mostly aim at extracting the semantic content and assigning the input text to the corresponding semantic tags. A DOM tree representing the generated XML document contains nodes only in one layer, as shown in Figure 1.

Figure 1. The result sample of traditional information extraction

In this case, we select the XML documents belong to a

certain field as training data which could be of different DTDs. As illustrated in Figure 2, in the first case, the text and its corresponding path information must be parsed from XML documents with the same DTD by XML parser, which will serve as the input to train the proposed SIEHMM. Next, the initial probability and the state transition matrix could be estimated using Maximum Likelihood algorithm. Meanwhile, a universal emission probability matrix could be obtained with the whole training data. If the training XML documents

are from different DTDs, multiple HMM models can be trained. When the text sequences from one given text file are input, the Viterbi algorithm is used to find the hidden state sequences that are most likely to have produced the given sequences, including not only the sequence of semantic tags but also paths from root node to leaf node in a DOM tree. Finally, it is a trivial step to integrate the achieved results consisting of path-word pairs into a XML document described by the corresponding DTD through an XML Generator.

Figure 2. The XML generation process of SIEHMM

III. 2BALGORITHM IMPLEMENTATION Data preprocessing should be implemented to extract the

word sequences and their corresponding path information by parsing XML documents belong to some fields. The extraction results consists of a sequence of word-path pairs, such as: Sara: <Contact>→<Name>→<Firstname>. Below are the corresponding learning, decoding and XML generation steps.

Learning. The aim of learning is to get the HMM parameters with the labeled training data. The state set has comprised all the distinct states appearing in the training data. In this paper, a sequence of words from each XML document and the corresponding path sequence are respectively considered as an observation sequence O={o1, o2, …, oM} and the state transition sequence Q={path1, path2, …, pathM}. With the training data from each DTD file, a HMM can be established using the ML algorithm to get the initial probability matrix, the state transition matrix, and the emission probability matrix. They can be calculated by following three formulas.

For the initial probability distribution:

11

11

( )( ) , 1 i N ( )

ii i N

jj

Init q sP q sInit q s

π

=

== = = ≤ ≤=∑

(1) Where Init(q1=si) is the number of the sequence that

starts from the state si in all observation sequences.

For the transition matrix we use:

1

( , )( | ) , 1 i,j N

( , )

i jij i j N

i kk

Count s sa P s s

Count s s=

= = ≤ ≤∑

(2)

561561

Page 3: [IEEE 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI) - Sanya, China (2010.10.23-2010.10.24)] 2010 International Conference on Artificial

Where Count(si,sj) is the number of times sj followed si in

the training data. For the observation matrix:

1

( , )( ) ( | ) ,

( , )

1 j N,1 k M

k jj k k j M

t jt

E o sb o P o s

E o s=

= =

≤ ≤ ≤ ≤

(3)

Where E(ok,sj) is the number of times word ok is tagged state sj in the training data.

When the training data is insufficient, the words not appearing in the training set will cause the above formula to assign E(ok,sj)

a zero probability. Thus, Laplace smoothing method is used here to adjust the emission probability distribution as follows:

( | ) ( ( , ) 1) / ( ( ) 1), 1 k i k i iP o q E o q E q M i N= + + + ≤ ≤ ("new word"| ) 1/ ( ( ) 1),1 i iP q E q M i N= + + ≤ ≤

Where M is the size of the training vocabulary, E(qi) is the total number of words with the state qi. One can see,

1( | ) (" _ " | ) 1

M

k i iP o q P unknown word q+ =∑

satisfying the normalized conditions. In addition, given the efficiency and performance of the training, we limit no more than 6 for the number of DTDs in an area [8], and multiple HMMs can be trained according to the different DTDs, represented by λn, 1≤n≤6.

Decoding. At this stage, the structure information extraction based on the trained HMMs is implemented on the purpose of discovering the best path sequences according to the given sequence of words. For one thing, the text files that act as the extraction objects and belong to a specific area should be preprocessed. With the big number of the words in a text file, apparently it is unreasonable to consider all the words in as an observation sequence, which could lead to the Viterbi variable overflow because of the large number of iterations. Therefore, one way to solve this problem is to consider a sequence of words from every sentence as an observation sequence O={o1, o2, …, oT}. Moreover, the path transition sequence Q={path1, path2, …, pathT} that is most likely to have produced the sequence O can be obtained using Viterbi algorithm [11] for each HMM (λn) respectively. And then all P(O,Qn|λn) values can be recorded among which the sequence O corresponding to the maximum value can be chosen as the final path sequence.

The Viterbi variable can be defined as follows.

( )...1, 2, , 1

1, 2,..., 1, 1, 2,...,( ) , | 1 i Nmaxq q qt

t t t ti P q q q q i o o oδ λ−

−= = ≤ ≤

It represents the probability for the most probable path for (o1, o2, …, ot) (t≤T) ending at state si for the given HMM λ.The Viterbi variable of the next state sj can be recursively defined as follows.

1 1( ) max ( ) ( )t t ij j ti

j i a b oδ δ+ +⎡ ⎤= ⎣ ⎦

( )t iψ is used to record the previous state during the iteration. The steps of the algorithm are illustrated below.

1) 12BFor each HMM λn a) If the word is the starting of the text, initializing

1 1( ) ( ), 1 i Ni ii b oδ π= ≤ ≤ In other cases:

11

( , )( ) , 1 i N

M

t it

E o si

Mδ == ≤ ≤

The previous state of initial stated is : 1( ) 0iψ = b) Iterative computation:

11( ) max ( ) ( )t t ij j ti N

j i a b oδ δ −≤ ≤⎡ ⎤= ⎣ ⎦

1[ ]1 1( ) arg ( )max

i Nt ijj i aψ δ

≤ ≤−=

c) Termination probability:

1

* [ ( )]maxi N

n TP iδ≤ ≤

=

Termination state:

1

* arg [ ( )]maxi N

T Tq iδ≤ ≤

=

2) 13BGetting 1 6

*maxn

nP≤ ≤

3) 14BRead out the best state sequence Qn and previous

states: * *

1 1( ), t=T-1,T-2,...,1t t tq qψ + +=

When the textual address mentioned previously is input, each word would be assigned to a specific path, for example the path labeled by the thick line in Figure 3. Hence, a set of the word-path pairs can be obtained which have included the enough structure information used to generate XML documents.

Figure 3. An extraction result sample of SIEHMM

XML documents generation. Finally, the words assigned

to the corresponding paths need to be integrated into a XML document according with the specified DTD through the JDOM and XPath techniques. For example, a generated XML document represented as a DOM tree is shown in Figure 3.

IV. 3BEXPERIMENTS AND RESULTS

A. 7BDatasets The experimental data comes from the

XMLSigmodRecordMar1999 packet in Sigmod XML datasets [13], containing the XML documents with different formats which include bibliographic information for computer journals. In this paper, a part of XML documents generated by three DTD files, which are about ordinary issue pages, proceeding pages and index term pages, are considered as the training data to learn the corresponding HMMs. At the same time, the free text files which will serve as test data should be created according to the content of untrained XML documents. Details of the datasets are shown in Table 1:

562562

Page 4: [IEEE 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI) - Sanya, China (2010.10.23-2010.10.24)] 2010 International Conference on Artificial

TABLE I. DATASETS USED FOR THE EXPERIMENTS

Path Number

Trained Document s

Number

Trained Words

Number

Test Files Number

OrdinaryIssuePage 9 41 6997 10 ProceedingsPage. 15 11 15581 5 IndexTermsPage 13 800 30021 80

B. 8BPerformance Evaluation for SIEHMM At present, the precision and recall are widely used to

evaluate the performance of information extraction systems [12]. For the proposed SIEHMM in this paper, it is necessary for each state in a trained HMM to evaluate the precision and recall. These are defined as follows:

number of words correctly labeled the state (SCE)100%

number of words labeled the state (ECO)s precision− = ×

number of words correctly labeled the state (SCE)100%

number of true words belong to the state (SCO)s recall− = ×

Three HMMs can be trained as HMM1, HMM2, HMM3 respectively through the training XML documents belonging to computer ordinary issue, proceeding and index term class. The text files belonging to these three classes will be used to test the performance of SIEHMM. One complete XML DOM tree of the ordinary issue is shown in Figure 4.

Figure 4. XML DOM tree of the ordinary issue

The trained HMM1 has learnt 9 path states. Performance measures of HMM1 are evaluated after inputting the test data, as shown in Table 2.

Because of space limitations, the overall performances of HMM2 and HMM3 are given on the first plate, e.g. Table 3. The performance of SIEHMM has been tested by the text files in three formats, and the results show that the proposed SIEHMM has a high precision and recall, due to strong statistical properties of HMM. At the same time, the text labeling does not only depend on the word identification but the transition probabilities between the paths, which are essential for extracting more structure information.

Table 3 also shows that HMM3 has a lower precision and recall compared to other models. This is because that there are some clear differences among the XML document structures of index term pages, although they are described by the same DTD.

Accordingly, in this paper the training data about index terms has been clustered into two groups, which can be utilized to train HMM respectively. Experiments prove that

better experimental results could be achieved by clustering isomorphic XML documents with considerable structure differences.

TABLE II. PERFORMANCE RESULT OF SIEHMM FOR ORDINARY ISSUE

State SCE ECO SCO S-Precision S-Recall 1→2 9 9 10 0.900 0.900 1→3 9 9 10 1.000 0.900 1→5 9 10 10 0.900 0.900 1→6 9 12 10 0.750 0.900

1→4→7→8 10 10 29 1.000 0.345 1→4→7→9→10→11→15 997 1114 1015 0.895 0.982

1→4→7→9→10→12 121 124 129 0.976 0.938 1→4→7→9→10→13 121 125 129 0.968 0.938

1→4→7→9→10→14→16 579 594 665 0.975 0.871 Overall 1864 2007 2007 0.929 0.853

Also we can obtain the rule that, the higher the

structuring degree of training data is, the greater accuracy SIEHMM obtains. Results show the improved HMM3 (HMM3’) has been advanced significantly on precision and recall ratio.

TABLE III. PERFORMANCE RESULT OF SIEHMM FOR PROCEEDINGS AND INDEX TERMS

Labeled Words

Correctly Labeled Words

Average-Precision Average -Recall

HMM2 5902 5479 0.928 0.787 HMM3 11220 9288 0.828 0.467 HMM3’ 11696 10142 0.904 0.809

V. 4BCONCLUSION With the continued growth of the Internet and a huge

amount of available text data, we describe a novel structure information extraction method based on HMM, which can be used to extract enough structure information from text files with a high precision and recall and convert them into XML documents. Given the encouraging experimental results, we believe that SIEHMM could not only be flexibly applied to the information extraction for different fields, but also be an excellent choice for solving the text processing problem in XML information retrieval system. However, some problems exist, one of which is that labeling the text files with XML tags could cost too much labor and it is difficult to obtain training data that could describe all structure characteristics of the text in a specified field. Therefore, we will be dedicated to solve the characteristics lacking problem in the training data in order to further improve the performance of SIEHMM.

[1] Prescod, P., Charles F. Goldfarb: XML Handbook. Prentice-Hall,

Upper Saddle River, NJ, 2000. [2] S. Sarawagi. Information Extraction. Foundations and Trends in

Databases, 2(1), 2008. [3] J.Aitken, “Learning information extraction rules:An inductive logic

programming approach,” in Proceedings of the 15th European Conference on Artificial Intelligence,pp.355–359, 2002.

[4] F.Ciravegna, “Adaptive information extraction from text by rule induction and generalisation,”in Proceedings of the 17th International Joint Conference on Artificial Intelligence(IJCAI2001),2001.

563563

Page 5: [IEEE 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI) - Sanya, China (2010.10.23-2010.10.24)] 2010 International Conference on Artificial

[5] E.Agichtein,V.Ganti. Mining Reference Tables for Automatic Text Segmentation.Procs of ACM Conference on Knowledge Discovery and Data Mining (SIGKDD),2004.

[6] V. Borkar, K. Deshmuk, S. Sarawagi. Automatically Extracting Structure from Free Text Addresses. Bulletin of the Technical Committee on Data Engineering, 23, 4, 2000.

[7] Dong-Chul Park, HVu Thi Lan HuongH, HDong-Min WooH, HDuong Ngoc Hieu H, HSai Thi Hien NinhH: Information Extraction System Based on Hidden Markov Model. HISNN (1) 2009H: 52-59.

[8] ZHOU Shun-xian,LIN Ya-ping. Text Information Extraction Based on Clustering Hidden Markov Model. Journal of System Simulation .2007-11.

[9] A.McCallum,D.Freitag,and F.Pereira,“Maximum entropy markov models for information extraction and segmentation,” in Proceedings of the International Conference on Machine Learning (ICML-2000), pp. 591–598, Palo Alto,CA,2000.

[10] E.Charniak.Statistical language learning. MIT Press, Cambridge, Mass.,1999.

[11] D.Freitag, A.K.McCallum. Information Extraction with HMM Structure Learned by Stochastic Optimization, Procs of the 18th Conference on Artifi-cial Intelligence,2001.

[12] Raghavan,V.V., Wang,G.S.,Bollmann, P: A Critical Investigation of Recall and Precision as Measures of Retrieval System Performance.ACM Trans.Information System 7(3),205–229(1989).

[13] Sigmod XML DataSet. Available at: http://www.acm.org/sigs/sigmod/record/xml. 1999-3.

564564