text mining techniques for patent analysis

Text Mining Techniques for Patent Analysis

Yuen-Hsien Tseng, National Taiwan Normal University,

[email protected]

Yuen-Hsien Tseng, Yeong-Ming Wang, Yu-I Lin, Chi-Jen Lin and Dai-Wei Juang, "Patent Surrogate Extraction and Evaluation in the Context of Patent Mapping", accepted for publication in Journal of Information Science, 2007 (SSCI, SCI)

Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin, "Text Mining Techniques for Patent Analysis", to appear in Information Processing and Management, 2007 (SSCI, SCI, EI)

mailto:[email protected]

Outline

• Introduction

• A General Methodology

• Technique Details

• Technique Evaluation

• Application Example

• Discussions

• Conclusions

Introduction – Why Patent Analysis?

• Patent documents contain 90% research results– valuable to the following communities:

• Industry

• Business

• Law

• Policy-making

• If carefully analyzed, they can:– reduce 60% and 40% R&D time and cost, respectively

– show technological details and relations

– reveal business trends

– inspire novel industrial solutions

– help make investment policy

Introduction – Gov. Efforts• PA has received much attention since 2001

– Korea: to develop 120 patent maps in 5 years– Japan: patent mapping competition in 2004– Taiwan: more and more PM were created

• Example: “carbon nanotube” (CNT)• 5 experts dedicated more than 1 month

• Asian countries, such as, China, Japan, Korean, Singapore, and Taiwan have invested various resources in patent analysis

• PA requires a lot of human efforts– Assisting tools are in great need

Typical Patent Analysis Scenario1. Task identification: define the scope, concepts, and purposes for

the analysis task.

2. Searching: iteratively search, filter, and download related patents.

3. Segmentation: segment, clean, and normalize structured and unstructured parts.

4. Abstracting: analyze the patent content to summarize their claims, topics, functions, or technologies.

5. Clustering: group or classify analyzed patents based on some extracted attributes.

6. Visualization: create technology-effect matrices or topic maps.

7. Interpretation: predict technology or business trends and relations.

Technology-Effect Matrix• To make decisions about future technology development

– seeking chances in those sparse cells• To inspire novel solutions

– by understanding how patents are related so as to learn how novel solutions were invented in the past and can be invented in the future

• To predict business trends – by showing the trend distribution of major competitors in this map

Material Performance Product Effect (Function) Technology Carbon nanotube Purity Electricity FED

Gas reaction 5346683 6129901

…

6181055 6190634

6221489 6232706

Catalyst 5424054 5780101

… 6333016 6339281

Manufacture

Arc discharging 5424054 6190634 6331262

5916642 5916642

Application Display 6346775 5889372 5967873

…

Part of the T-E matrix (from STIC) for “Carbon Nanotube”

Topic Map of Carbon Nanotube

25 docs. : 0.228054 (emission:180.1, field:177.2, emitter:157.1, cathode:108.4, field emission: 88.0) + 23 docs. : 0.424787 (emitter:187.0, emission:141.9, field:141.4, cathode:129.0, field emission:104.7) + 19 docs. : 0.693770 (emitter:139.7, field emission:132.0, cathode: 96.0, electron: 67.1, display: 61.9) + ID=2 : 7 docs.,0.09(cathode:0.58, source:0.56, display:0.50, field emission:0.45, vacuum:0.43) + ID=1 : 12 docs.,0.07(emitter:0.67, emission:0.60, field:0.57, display:0.40, cathode:0.38) + ID=11 : 4 docs.,0.13(chemic vapor deposition:0.86, sic:0.56, grow:0.44, plate:0.42, thicknes:0.42) + ID=19 : 2 docs.,0.21(electron-emissive:1.00, carbon film:0.70, compromise:0.70, emissive material ... 13 docs. : 0.240830 (energy: 46.8, circuit: 34.0, junction: 33.3, device: 26.0, element: 24.9) + 9 docs. : 0.329811 (antenna: 31.0, energy: 29.5, system: 29.4, electromagnetic: 25.0, granular: 20.6) + ID=4 : 5 docs.,0.07(wave:0.77, induc:0.58, pattern:0.45, nanoscale:0.44, molecule:0.35) + ID=15 : 4 docs.,0.12(linear:0.86, antenna:0.86, frequency:0.74, optic antenna:0.70, …) + ID=10 : 4 docs.,0.06(cool:0.70, sub-ambient:0.70, thermoelectric cool apparatuse:0.70, nucleate:0.70, ...

Text Mining - Definition• Knowledge discovery is often regarded as a process to

find implicit, previously unknown, and potentially useful patterns – Data mining: from structured databases

– Text mining: from a large text repository

• In practice, TM involves a series of user interactions with the text mining tools to explore the repository to find such patterns.

• After supplemented with additional information and interpreted by experienced experts, these patterns can become important intelligence for decision-making.

Text Mining Process for Patent AnalysisA General Methodology

• Document preprocessing– Collection Creation– Document Parsing and Segmentation– Text Summarization– Document Surrogate Selection

• Indexing– Keyword/Phrase extraction– morphological analysis– Stop word filtering– Term association and clustering

• Topic Clustering– Term selection– Document clustering/categorization– Cluster title generation– Category mapping

• Topic Mapping– Trend map -- Aggregation map– Query map -- Zooming map

Example: An US Patent Doc.

• See Example or this URL:– http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO

1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=5,695,734.PN.&OS=PN/5,695,734&RS=PN/5,695,734

Download and Parsing into DBMS

NSC Patents• 612 US patents with assignee contains “National

Science Council” downloaded on 2005/06/15

Distribution of NSC Patents

0

20

40

60

80

100

120

App

ly_Y

ear

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

Pate

nts

H

G

F

E

D

C

B

A

7

2

1

Document Parsing and Segmentation

• Data conversion– Parsing unstructured texts and citations into

structured fields in DBMS

• Document segmentation– Partition the full patent texts into 6 segments

• Abstract, application, task, summary, feature, claim

– Only 9 empty segments in 6*92=552 CNT patent segments =>1.63%

– Only 79 empty segments in 6*612=3672 NSC patent segments => 2.15%

NPR Parsing forMost-Frequently Cited Journalsand Citation Age Distribution

Citation Age Distribution

0

5

10

15

20

25

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Citation Age

Num

. of Pa

tent

s

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

JouTitle CitedCount

Appl. Phys. Lett. 23

Plant Molecular Biology 22

IEEE Electron Device Letters 20

Nature 17

Bio/Technology 16

Journal of Virology 15

J. Biological Chemistry 11

IEEE 11

Plant Physiol. 10

IEEE Journal of Solid-State Circuits 10

J. Appl. Phys. 9

Macromolecules 8

Mol. Gen. Genet. 8

J. Electroanal. Chem. 8

Proc. Nat'l Acad. Sci. 8

J. Chem. Soc. 8

Science 8

Applied Optics 8

Electronics Letters 7

Jpn. J. Appl. Phys. 7

Data are for 612 NSC patents

Automatic Summarization• Segment the doc. into paragraphs and sentences• Assess sentences, consider their

– Positions– Clue words– Title words– keywords

• Select sentences– Sort by the weights and select the top-k sentences.

• Assembly the selected sentences– Concatenate the sentences in their original order

PFSavgtftfSweightcluewordswtitlewordsorkeywordsw

w

__

)(

advantage difficult improved overhead shorten avoid effectiveness increase performance simplify cost efficiency issue problem suffer costly goal limit reduced superior decrease important needed resolve weakness

Example: Auto-summarizationMS Word (blue) Vs Ours (red)

TITLE (Patent No.: 6,862,710) Internet navigation using soft hyperlinks

BACKGROUND OF THE INVENTION

Many existing systems have been developed to enable a user to navigate through a set of documents , in order to find one or more of those documents which are particularly relevant to that user's immediate needs . For example , HyperText Mark-Up Language ( HTML ) permits web page designers to construct a web page that includes one or more " hyperlinks " ( also sometimes referred to as " hot links " ) , which allow a user to " click-through " from a first web page to other , different web pages . Each hyperlink is associated with a portion of the web page , which is typically displayed in some predetermined fashion indicating that it is associated with a hyperlink . While hyperlinks do provide users with some limited number of links to other web pages , their associations to the other web pages are fixed , and cannot dynamically reflect the state of the overall web with regard to the terms that they are associated with . Moreover , because the number of hyperlinks within a given web page is limited , when a user desires to obtain information regarding a term , phrase or paragraph that is not associated with a hyperlink , the user must employ another technique . One such existing technique is the search engine . Search engines enable a user to search the World Wide Web ( " Web " ) for documents related to a search query provided by the user . Typical search engines operate through a Web Browser interface . Search engines generally require the user to enter a search query , which is then compared with entries in an " index " describing the occurrence of terms in a set of documents that have been previously analyzed , for example by a program referred to sometimes as a " web spider " . Entry of such a search query requires the user to provide terms that have the highest relevance to the user as part of the search query . However , a user generally must refine his or her search query multiple times using ordinary search engines , responding to the search results from each successive search . Such repeated searching is time consuming , and the format of the terms within each submitted query may also require the user to provide logical operators in a non-natural language format to express his or her search . For the above reasons , it would be desirable to have a system for navigating through a document set , such as the Web , which allows a user to freely search for documents related to terms , phrases or paragraphs within a web page without relying on hyperlinks within the web page . The system should further provide a more convenient technique for internet navigation than is currently provided by existing search engine interfaces .

Evaluation of Each Segment• abs: the ‘Abstract’ section of each patent• app: FIELD OF THE INVENTION• task: BACKGROUND OF THE INVENTION• sum: SUMMARY OF THE INVENTION• fea: DETAILED DESCRIPTION OF THE

PREFERRED EMBODIMENT• cla: Claims section of each patent• seg_ext: summaries from each of the sets: abs, app,

task, sum, and fea • full: full texts from each of the sets: abs, app, task,

sum, and fea

Evaluation Goal

• Analyze a human-crafted patent map to see which segments have more important terms

• Purposes (so as to):– allow analysts to spot the relevant segments more

quickly for classifying patents in the map – provide insights to possibly improve automated

clustering and/or categorization in creating the map

Evaluation Method• In the manual creation of a technology-effect matrix, it

is helpful to be able to quickly spot the keywords that can be used for classifying the patents in the map.

• Once the keywords or category features are found, patents can usually be classified without reading all the texts.

• Thus a segment or summary that retains as many important category features as possible is preferable.

• Our evaluation design therefore is to reveal which segments contains most such features compared to the others.

Patent Maps for Evaluation

Abbr. Patent Map Num. of Doc. Num. of Cat. in the Effect Taxonomy

Num. of Cat. in the Tech.

Taxonomy

CNT Carbon Nanotube 92 21 9

QDF Quantum-Dot Fluorescein Detection 11 5 6

QDL Quantum-Dot LED 27 10 5

QDO Quantum-Dot Optical Sensor 19 10 3

NTD Nano Titanium Dioxide 417 17 22

MCM Molecular Motors 79 21 9

All patent maps are from STPI

Empty segments in the six patent maps

Maps abs app task sum fea cla Total empty segments Total segments Empty rate

CNT 0 1 2 5 1 0 9 552 1.63%

QDF 0 0 0 0 0 0 0 66 0.00%

QDL 0 0 0 1 1 0 2 162 1.23%

QDO 0 1 1 2 0 0 4 114 3.51%

NTD 0 62 74 85 103 0 324 2502 12.95%

MCM 0 1 2 1 1 0 5 474 1.05%

Feature Selection• Well studied in machine learning• Best feature selection algorithms

– Chi-square, information gain, …

• But to select only a few features, correlation coefficient is better than chi-square

• co=1 if FN=FP=0 and TP <>0 and TN <>0

TN)+FP)(FN+TN)(TP+FN)(FP+(TP

)FP FN-TN TP(),(

22 CT

TN)+FP)(FN+TN)(TP+FN)(FP+(TP

)FP FN-TN TP(),(

CTCo

Term T

Category C

Yes No

Yes TP FN

No FP TN

Best and worst terms by Chi-square and correlation coefficient

Chi-square Correlation coefficient

Construction Non-Construction Construction Non-Construction

engineering 0.6210 engineering 0.6210 engineering 0.7880 equipment 0.2854

improvement 0.1004 improvement 0.1004 improvement 0.3169 procurement 0.2231

… …

kitchen 0.0009 kitchen 0.0009 communiqué -0.2062 improvement -0.3169

update 0.0006 update 0.0006 equipment -0.2854 engineering -0.7880

Data are from a small real-world collection of 116 documents with only two exclusive categories, construction vs. non-construction in civil engineering tasks

Some feature terms and their distribution in each set for the category FED in CNT

rel term sc ss abs app Task sum fea cla seg_ext full

emit 8 4.86 12 0.62 13 0.58 21 0.55 17 0.59 22 0.70 14 0.61 20 0.63 27 0.59

yes emission 8 5.07 20 0.69 17 0.59 31 0.62 21 0.73 34 0.63 20 0.63 33 0.64 40 0.54

yes display 8 5.06 9 0.50 12 0.62 22 0.64 14 0.61 24 0.71 10 0.62 23 0.68 34 0.68

cathode 8 3.86 12 0.39 9 0.42 27 0.48 14 0.54 30 0.53 15 0.51 25 0.52 41 0.47

pixel 7 3.14 3 0.33 8 0.46 3 0.33 12 0.62 2 0.27 5 0.43 17 0.72

screen 5 1.74 2 0.27 2 0.27 8 0.37 18 0.43 19 0.41

yes electron 5 1.71 27 0.31 25 0.40 36 0.28 27 0.37 61 0.35

yes voltage 4 1.48 20 0.45 45 0.37 16 0.28 52 0.39

Segmentterm

termsc 1)(

Segmentterm

termcotermss )()(

Note: The correlation coefficients in each segment correlate to the set counts of the ordered features: the larger the set count, the larger the correlation coefficient in each segment.

Occurrence distribution of 30 top-ranked terms in each set for some categories in CNT

category T_No abs App Task sum fea cla seg_ext full

Carbon nanotube 30 15/50.0% 12/40.0% 14/46.7% 20/66.7% 13/43.3% 19/63.3% 20/66.7% 13/43.3%

FED 30 16/53.3% 14/46.7% 22/73.3% 19/63.3% 21/70.0% 19/63.3% 21/70.0% 22/73.3%

device 30 21/70.0% 17/56.7% 9/30.0% 16/53.3% 7/23.3% 19/63.3% 17/56.7% 8/26.7%

Derivation 30 14/46.7% 6/20.0% 7/23.3% 11/36.7% 8/26.7% 13/43.3% 13/43.3% 11/36.7%

electricity 30 12/40.0% 10/33.3% 10/33.3% 10/33.3% 8/26.7% 8/26.7% 13/43.3% 12/40.0%

purity 30 12/40.0% 12/40.0% 7/23.3% 20/66.7% 9/30.0% 17/56.7% 18/60.0% 14/46.7%

High surface area 30 19/63.3% 13/43.3% 13/43.3% 17/56.7% 8/26.7% 9/30.0% 16/53.3% 8/26.7%

magnetic 30 18/60.0% 11/36.7% 6/20.0% 14/46.7% 14/46.7% 13/43.3% 15/50.0% 13/43.3%

energy storage 30 16/53.3% 17/56.7% 13/43.3% 17/56.7% 6/20.0% 10/33.3% 21/70.0% 12/40.0%

M_Best_Term_Coverage(Segment, Category)=

cstermM

csMBTC 11

),(

Occurrence distribution of manually ranked terms in each set for some categories in CNT

category T_No abs app task sum fea cla seg_ext full

Carbon nanotube 4 3/75.0% 2/50.0% 2/50.0% 3/75.0% 2/50.0% 2/50.0% 3/75.0% 2/50.0%

FED 7 6/85.7% 6/85.7% 6/85.7% 4/57.1% 6/85.7% 4/57.1% 6/85.7% 5/71.4%

device 2 2/100.0% 1/50.0% 0/0.0% 1/50.0% 1/50.0% 2/100.0% 1/50.0% 0/0.0%

electricity 2 2/100.0% 2/100.0% 2/100.0% 2/100.0% 1/50.0% 2/100.0% 0/0.0% 0/0.0%

purity 2 2/100.0% 2/100.0% 0/0.0% 1/50.0% 1/50.0% 1/50.0% 0/0.0% 1/50.0%

High surface area 8 6/75.0% 2/25.0% 3/37.5% 5/62.5% 1/12.5% 2/25.0% 4/50.0% 1/12.5%

magnetic 5 3/60.0% 1/20.0% 2/40.0% 1/20.0% 3/60.0% 0/0.0% 4/80.0% 3/60.0%

energy storage 2 2/100.0% 2/100.0% 1/50.0% 2/100.0% 1/50.0% 1/50.0% 2/100.0% 0/0.0%

R_Best_Term_Covertage(Segment, Category)=

relcstermR

csRBTC 11

),(

Occurrence distribution of terms in each segment averaged over all categories in CNT

Taxonomy

Setabs app task sum fea Cla seg_ext full

nc nt

Effect 9 M=30 52.96% 41.48% 37.41% 53.33% 34.81% 47.04% 57.04% 41.85%

Effect* 8 4 86.96% 66.34% 45.40% 64.33% 51.03% 54.02% 55.09% 30.49%

Tech 21 M=30 49.37% 25.56% 26.51% 56.51% 34.44% 46.51% 56.03% 40.95%

Tech* 17 4.5 59.28% 29.77% 23.66% 49.43% 34.46% 60.87% 44.64% 32.17%

M_Best_Term_Coverage(Segment)=

R_Best_Term_Coverage(Segment)=

Catc relcstermRCat

sRBTC 111

)(

Catc cstermMCat

sMBTC 111

)(

Maximum correlation coefficients in each set averaged over all categories in CNT

Taxonomy

Setabs app task sum fea cla seg_ext full

nc nt

Effect 9 M=30 0.58 0.49 0.54 0.55 0.55 0.57 0.56 0.55

Effect* 8 4.0 0.52 0.43 0.39 0.52 0.48 0.47 0.44 0.33

Tech 21 M=30 0.64 0.58 0.65 0.66 0.66 0.67 0.68 0.68

Tech* 17 4.5 0.47 0.29 0.34 0.44 0.35 0.51 0.43 0.42

*: denoted those calculated from human judged relevant terms

Term-covering rates for M best termsfor the effect taxonomy in CNT

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

abs app task sum fea cla seg_ext full

10

30

50

Term-covering rates for M best termsfor the technology taxonomy in CNT

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%


10

30

50

Term-covering rates for M best terms

0%

10%

20%

30%

40%

50%

60%

70%

80%


10

30

50

0%

10%

20%

30%

40%

50%

60%

70%

80%


10

30

50

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


10

30

50

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


10

30

50

QDF: Quantum Dot Fluorescein Detection

QDL: Quantum Dot LED

Term-covering rates for M best terms

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


10

30

50

0%

10%

20%30%

40%

50%

60%

70%80%

90%

100%


10

30

50

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


10

30

50

0%10%20%

30%40%50%60%70%

80%90%

100%


10

30

50

QDO: Quantum-Dot Optical Sensor

NTD: Nano Titanium Dioxide

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


10

30

50

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


10

30

50

MCM: Molecular Motors

Findings• Most ICFs ranked by correlation coefficient occur in th

e “segment extracts”, the Abstract section, and the SUMMARY OF THE INVENTION section.

• Most ICFs selected by humans occur in the Abstract section or the Claims section.

• The “segment extracts” lead to more top-ranked ICFs than the “full texts”, regardless whether the category features are selected manually or automatically.

• The ICFs selected automatically have higher capability in discriminating a document’s categories than those selected manually according to the correlation coefficient.

Implications

• Text summarization techniques help in patent analysis and organization, either automatically or manually.

• If one would determine a patent’s category based on only a few terms in a quick pace, one should first read the Abstract section and the SUMMARY OF THE INVENTION section

• Or alternatively, one should first read the “segment extracts” prepared by a computer

Text Mining Process for Patent Analysis

• Document preprocessing– Collection Creation– Document Parsing and Segmentation– Text Summarization– Document Surrogate Selection

• Indexing– Keyword/Phrase extraction– morphological analysis– Stop word filtering– Term association and clustering

• Topic Clustering– Term selection– Document clustering/categorization– Cluster title generation– Category mapping

• Topic Mapping– Trend map -- Aggregation map– Query map -- Zooming map

Ideal Indexing for Topic Identification

Unit Syntatic processing Semantic

processing

Resource

(human-prepared)

Alphabet Byte isolation Character

identification Language knowledge

Word Word segmentation, Word disambiguity Lexicon

Phrase Phrase extraction POS tagger Tagged corpus

Term Morphological

processing: stemmer

Morphological

analyzer Linguistic knowledge

Concept Clustering,

feature extraction Grammar analyzer Thesaurus, WordNet

Category classification Understanding Training data,

Existing DB

No processing may result in low recall; More processing may have false drops.

Example: Extracted Keywords and Their Associated Terms

1. Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin, "Text Mining Techniques for Patent Analysis", to appear in Information Processing and Management, 2007 (SSCI and SCI)

2. Yuen-Hsien Tseng, "Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text", Journal of the American Society for Information Science and Technology, Vol. 52, No. 5, April 2001, pp. 378-390. (SSCI and SCI)

Clustering Methods• Clustering is a powerful technique to detect topics and

their relations in a collection.• Clustering techniques:

– HAC : Hierarchical Agglomerative Clustering– K-means– MDS: Multi-Dimensional Scaling– SOM: Self-organization Map

• Many open source packages are available– Need to define the similarity to use them

• Similarities– Co-words: common words used between items– Co-citations: common citations between items

Document Clustering• Effectiveness of clustering relies on

– how terms are selected• Affect effectiveness most• Automatic, manual, or hybrid• Users have more confidence on the clustering results if terms are selecte

d by themselves, but this is costly• Manual verification of selected terms is recommended whenever it is po

ssible• Recent trend:

– Text clustering with extended user feedback, SIGIR 2006– Near-duplicate detection by instance-level constrained clustering, SIGIR06

– how they are weighted• Boolean or TFxIDF

– how similarities are measured• Cosine, Dice, Jaccard, etc, ..

• Direct HAC document clustering may be prohibited due to its complexity

Term Clustering• Single terms are often ambiguous, a group of near-synon

ym terms can be more specific in topic• Goal: reduce number of terms for ease of topic detection,

concept identification, generation of classification hierarchy, or trend analysis

• Term clustering followed by document categorization– Allow large collections to be clustered

• Methods:– Keywords: maximally repeated words or phrases, extracted by

patented algorithm (Tseng, 2002)– Related terms: keywords which often co-occur with other keyw

ords, extracted by association mining (Tseng, 2002)– Simset: a set of keywords having common related terms, extrac

ted by term clustering

Multi-Stage Clustering• Single-stage clustering is easy to get skewed distribution• Ideally, in multi-stage clustering, terms or documents

can be clustered into concepts, which in turn can be clustered into topics or domains.

• In practice, we need to browse the whole topic tree to found desired concepts or topics.

Terms or docs.

Concepts

Topics

Cluster Descriptors Generation• One important step to help analysts interpret the

clustering results is to generate a summary title or cluster descriptors for each cluster.

• CC (correlation Coefficient) is used• But CC0.5 or CCxTFC yield better results• See

– Yuen-Hsien Tseng, Chi-Jen Lin, Hsiu-Han Chen and Yu-I Lin, "Toward Generic Title Generation for Clustered Documents," Proceedings of Asia Information Retrieval Symposium, Oct. 16-18, Singapore, pp. 145-157, 2006. (Lecture Notes in Computer Science, Vol. 4182, SCI)

Mapping Cluster Descriptors to Categories

• More generic title words can not be generated automatically– ‘Furniture’ is a generic term for beds, chairs, tables,

etc. But if there is no ‘furniture’ in the documents, there is no way to yield furniture as a title word, unless additional knowledge resources were used, such as thesauri

• See also Tseng et al, AIRS 2006

Search WordNet for Cluster Class• Using external resource to get cluster categories

– For each of 352 (0.005) or 328 (0.001) simsets generated from 2714 terms

– Submit the sinset heads to WordNet to get their hypernyms (upper-level hypernyms as categories)

– Accumulate occurrence of each of these categories

– Rank these categories by occurrence

– Select the top-ranked categories as candidates for topic analysis

– These top-ranked categories still need manual filtering

– Current results are not satisfying• Need to try to search scientific literature databases which support top

ic-based search capability and which have needed categories

Mapping Cluster Titles to Categories• Search Stanford’s InfoMap

– http://infomap.stanford.edu/cgi-bin/semlab/infomap/classes/print_class.pl?args=$term1+$term2

• Search WordNet directly– Results similar to InfoMap– Higher recall, lower Precision than InfoMap– Yield meaningful results only when terms are in high quality

• Search google directory: http://directory.google.com/ – Often yield: your search did not match any documents.– Or wrong category:

• Ex1: submit: “'CMOS dynamic logics‘”– get: ‘Computers > Programming > Languages > Directories’

• Ex2: submit: “laser, wavelength, beam, optic, light”, get:– ‘Business > Electronics and Electrical > Optoelectronics and Fiber‘, – ‘Health > Occupational Health and Safety > Lasers’

• Searching WordNet yield better results but still unacceptableD:\demo\File>perl -s wntool.pl=>0.1816 : device%1=>0.1433 : actinic_radiation%1 actinic_ray%1=>0.1211 : signal%1 signaling%1 sign%3=>0.0980 : orientation%2=>0.0924 : vitality%1 verve%1

http://infomap.stanford.edu/cgi-bin/semlab/infomap/classes/print_class.pl?args=$term1+$term2

http://infomap.stanford.edu/cgi-bin/semlab/infomap/classes/print_class.pl?args=$term1+$term2

NSC Patents

• 612 US patents whose assignees are NSC• NSC sponsors most academic researches

– Own the patents resulted from the researches

• Documents in the collection are– knowledge-diversified (cover many fields)– long (2000 words in average)– full of advanced technical details

• Hard for any single analyst to analyze them• Motivate the need to generate generic titles

Text Mining from NSC Patents• Download NSC patents from USPTO with assignee=N

ational Science Council• Automatic key-phrase extraction

– Terms occurs more than once can be extracted• Automatic segmentation and summarization

– 20072 keywords from full texts vs 19343 keywords from 5 segment summarization

– The 5 segment abstracts contain more category-specific terms then full texts (Tseng, 2005)

• Automatic index compilation– Occurring frequency of each term in each document was reco

rded – Record more than 500,000 terms (words, phrases, digits) amo

ng 612 documents in 72 seconds

Text Mining from NSC Patents: Clustering Methods

• Term similarity is based on common co-occurrence terms– Phrases and co-occurrence terms are extracted based on Tseng’s

algorithm [JASIST 2002]

• Document similarity is based on common terms

• Complete-link method is used to group similar items

Cluster Info. ID=180, sim=0.19， descriptors: standard:0.77, mpeg:0.73, audio:0.54

Term DF Co-occurrence Terms

AUDIO 9standard, high-fidelity, MPEG, technique, compression, signal, Multi-Channel.

MPEG 4 standard, algorithm, AUDIO, layer, audio decoding, architecture.

audio decoding 3 MPEG, architecture.

standard 31 AUDIO,MPEG.

compression 29apparatus, AUDIO, high-fidelity, Images, technique, TDAC, high-fidelity audio, signal, arithmetic coding.

Term Clustering of NSC Patents• Results:

– From 19343 keywords, remove those whose df>200 (36) and df=1 (12330), and those that have no related terms (4263), resulting in 2714 terms

• Number of terms whose df>5 is 2800– 352 (0.005) or 328 (0.001) simsets were generated from 2714 terms

• Good cluster:– 180 : 5筆 ,0.19(standard:0.77, mpeg:0.73, audio:0.54)– AUDIO : 9 : standard, high-fidelity, MPEG, technique, compression, signal, Multi-Channel. – MPEG : 4 : standard, algorithm, AUDIO, layer, audio decoding, architecture. – audio decoding : 3 : MPEG, architecture. – standard : 31 : AUDIO,MPEG. – compression : 29 : apparatus, AUDIO, high-fidelity, Images, technique, TDAC, high-fidelity audio,

signal, arithmetic coding.

• Wrong cluster:– 89 : 6筆 ,0.17(satellite:0.71, communicate:0.54, system:0.13)– satellite : 8 : nucleotides, express, RNAs, vector, communication system, phase, plant,

communism, foreign gene. – RNAs : 5 : cDNA, Amy8, nucleotides, alpha-amylase gene, Satellite RNA, sBaMV, anal

ysis, quinoa, genomic, PAT1, satellite, Lane, BaMV, transcription. – foreign gene : 4 : express, vector, Satellite RNA, plant, satellite, ORF. – electrical power : 4 : satellite communication system. – satellite communication system : 2 : electrical power, microwave. – communication system : 23 : satellite.

Topic Map for NSC Patents• Third-stage document clustering result

6. Biomedicine

1.Chemistry

5. Material

3. Generality

2. Electronics and Semi-conductors

4. Communication and computers

Topic Tree for NSC Patents1: 122 docs. : 0.201343 (acid:174.2, polymer:166.8, catalyst:155.5, ether:142.0, formula:135.9) * 108 docs. : 0.420259 (polymer:226.9, acid:135.7, alkyl:125.2, ether:115.2, formula:110.7) o 69 docs. : 0.511594 (resin:221.0, polymer:177.0, epoxy:175.3, epoxy resin:162.9, acid: 96.7) + ID=131 : 26 docs. : 0.221130(polymer: 86.1, polyimide: 81.1, aromatic: 45.9, bis: 45.1, ether: 44.8) + ID=240 : 43 docs. : 0.189561(resin:329.8, acid: 69.9, group: 57.5, polymer: 55.8, monomer: 44.0) o ID=495 : 39 docs. : 0.138487(compound: 38.1, alkyl: 37.5, agent: 36.9, derivative: 33.6, formula: 24.6) * ID=650 : 14 docs. : 0.123005(catalyst: 88.3, sulfide: 53.6, iron: 21.2, magnesium: 13.7, selective: 13.1) 2: 140 docs. : 0.406841 (silicon:521.4, layer:452.1, transistor:301.2, gate:250.1, substrate:248.5) * 123 docs. : 0.597062 (silicon:402.8, layer:343.4, transistor:224.6, gate:194.8, schottky:186.0) o ID=412 : 77 docs. : 0.150265(layer:327.6, silicon:271.5, substrate:178.8, oxide:164.5, gate:153.1) o ID=90 : 46 docs. : 0.2556(layer:147.1, schottky:125.7, barrier: 89.6, heterojunction: 89.0, transistor: … * ID=883 : 17 docs. : 0.103526(film: 73.1, ferroelectric: 69.3, thin film: 48.5, sensor: 27.0, capacitor: 26.1) 3: 66 docs. : 0.220373 (plastic:107.1, mechanism: 83.5, plate: 79.4, rotate: 74.9, force: 73.0) * 54 docs. : 0.308607 (plastic:142.0, rotate:104.7, rod: 91.0, screw: 85.0, roller: 80.8) o ID=631 : 19 docs. : 0.125293(electromagnetic: 32.0, inclin: 20.0, fuel: 17.0, molten: 14.8, side: 14.8) o ID=603 : 35 docs. : 0.127451(rotate:100.0, gear: 95.1, bear: 80.0, member: 77.4, shaft: 75.4) * ID=727 : 12 docs. : 0.115536(plasma: 26.6, wave: 22.3, measur: 13.3, pid: 13.0, frequency: 11.8) 4: 126 docs. : 0.457206 (output:438.7, signal:415.5, circuit:357.9, input:336.0, frequency:277.0) * 113 docs. : 0.488623 (signal:314.0, output:286.8, circuit:259.7, input:225.5, frequency:187.9) o ID=853 : 92 docs. : 0.105213(signal:386.8, output:290.8, circuit:249.8, input:224.7, light:209.7) o ID=219 : 21 docs. : 0.193448(finite: 41.3, data: 40.7, architecture: 38.8, computation: 37.9, algorithm: … * ID=388 : 13 docs. : 0.153112(register: 38.9, output: 37.1, logic: 32.2, addres: 28.4, input: 26.2) 5: 64 docs. : 0.313064 (powder:152.3, nickel: 78.7, electrolyte: 74.7, steel: 68.6, composite: 64.7) * ID=355 : 12 docs. : 0.1586(polymeric electrolyte: 41.5, electroconductive: 36.5, battery: 36.1, electrode: ... * ID=492 : 52 docs. : 0.138822(powder:233.3, ceramic:137.8, sinter: 98.8, aluminum: 88.7, alloy: 63.2) 6: 40 docs. : 0.250131 (gene:134.9, protein: 77.0, cell: 70.3, acid: 65.1, expression: 60.9) * ID=12 : 11 docs. : 0.391875(vessel: 30.0, blood: 25.8, platelet: 25.4, dicentrine: 17.6, inhibit: 16.1) * ID=712 : 29 docs. : 0.116279(gene:148.3, dna: 66.5, cell: 65.5, sequence: 65.1, acid: 62.5)

Total: 558 docs.

Major IPC Categories for NSC patentsA: 87 docs. : Human Necessities + A61: 71 docs. : Medical Or Veterinary Science; Hygiene + A*: 16 docs. : A01(7), A21(2), A23(2), A42(2), A03(1), A62(1), A63(1) B: 120 docs. : Performing Operations; Transporting + B01: 25 docs. : Physical Or Chemical Processes Or Apparatus In General + B05: 28 docs. : Spraying Or Atomising In General; Applying Liquids Or Other Fluent Materials To Surfaces + B22: 17 docs. : Casting; Powder Metallurgy + B*: 50 docs. : B32(12), B29(11), B62(6), B23(4), B24(4), B60(4), B02(2), B21(2), B06(1), B25(1), … C: 314 docs. : Chemistry; Metallurgy + C07: 62 docs. : Organic Chemistry + C08: 78 docs. : Organic Macromolecular Compounds; Their Preparation Or Chemical Working-Up; … + C12: 76 docs. : Biochemistry; Beer; Wine; Vinegar; Microbiology; Mutation Or Genetic Engineering; … + C* : 98 docs. : C23(22), C25(20), C01(19), C04(10), C09(10), C22(8), C03(5), C30(3), C21(1) D: 6 docs. : Textiles; Paper E: 8 docs. : Fixed Constructions F: 30 docs. : Mechanical Engineering; Lighting; Heating; G: 134 docs. : Physics + G01: 49 docs. : Measuring; Testing + G02: 28 docs. : Optics + G06: 29 docs. : Computing; Calculating; Counting + G*: 28 docs. : G10(7), G11(7), G05(6), G03(5), G08(2), G09(1) H: 305 docs. : Electricity + H01: 216 docs. : Basic Electric Elements o H01L021: 92 docs. : Processes or apparatus adapted for the manufacture or treatment of semiconductor o H01L029: 35 docs. : Semiconductor devices adapted for rectifying, amplifying, oscillating, or switching; o H01L*: 89 docs. : others. o H01*: 53 docs. : H03K(23), H03M(11), H04B(10), H04L(7), H04N(7), H01B(5), H03H(4), H04J(4), … + H03: 51 docs. : Basic Electronic Circuitry + H04: 30 docs. : Electric Communication Technique + H*: 8 docs. : H05(5), H02(3)

Division Distributions of NSC Patents

Abbrev. Division Percentage Ele Electrical Engineering 28.63% Che Chemical Engineering 14.70% Mat Material Engineering 14.12% Opt Optio-Electronics 13.15% Med Medical Engineering 10.44% Mec Mechanical Engineering 6.58% Bio Biotechnology Engineering 5.03% Com Communication Engineering 2.90% Inf Information Engineering 2.90% Civ Civil Engineering 1.16% Others 0.39% Total 100.00%

Distribution of Major IPC Categories in Each Cluster

C08

C07A61

B01C*

B* H*others

Cluster 1

H

C*

F

G01H01L021

H01L029

others

B05

Cluster 2

B*

FG01E

A*H*B22

others

Cluster 3

H03

H04G06H*

G02

G01G*

others

Cluster 4

C*

B22H*

B01

others

Cluster 5

C12A61

C07

others

Cluster 6

Distribution of Academic Divisions in Each Cluster

Che

Med

Mat

OptEleCivBio

Cluster 1

Ele

Opt

Che

Mat

Cluster 2

Mec

Mat

Ele

CivInfCheOpt

Cluster 3

Ele

Opt

Com

MecMedCheInf

Cluster 4

MatChe

Ele

Cluster 5

MedBio

Cluster 6

Comparison among the Three Methods• The three classification systems provide different facets to

understand the topic distribution of the patent set.

• Each may reveal some insights if we can interpret it.

• The IPC system results in divergent and skewed distributions which make it hard for further analysis (such as trend analysis).

• The division classification is the most familiar one to the NSC analysts, but it lacks inter-disciplinary information.

• As to the text mining approach, it dynamically glues related IPC categories together based on the patent contents to disambiguate their vagueness.

• This makes future analysis possible even when the division information is absent, as may be the case in later published patents to which NSC no longer claims their right.

Other Methods: SOM

The 16x16 SOM for the NSC patents obtained by the tool from Peter Kleiweg

Other Methods: Citation Analysis

• Among 612 NSC patents:

• Only 123 patents are co-cited by others – resulting in 99 co-cited pairs.

• Only 175 patents co-cite others – resulting in 143 co-citing pairs.

• Such sparseness may lead to biased analysis.

• Citation analysis is not suitable in this case.

Conclusions• Used text mining techniques:

– text segmentation– summary extraction– keyword identification– topic detection (taxonomy generation, term clustering)

• Achievement: – better classification than IPC– As more interactions are involved in nowadays researches,

inter-disciplinary relations are interesting to monitor.– Provide this information that NSC Divisions lack

• “Problems to be solved” is likely to be extracted from the “Background of the Invention”

• However, “Solutions” is hard to extract

Types of Patent Maps• Trend maps : two kinds for showing the trends:

– Growth mode: accumulate patents over time – Evolution mode: divide patents over time– Both are made by fixing the clusters obtained from clustering all patents

and then divide the patents in each cluster in a timely fashion and recalculate the similarities among these clusters.

• Query maps:– Showing only those patents satisfying some conditions in each cluster

• Aggregation maps :– Aggregated results based on some specified attributes are show in each

cluster

• Zooming maps:– Some clusters can be selected and zoomed in or out to show the details or

the overviews