text mining, term mining, and visualization - improving the impact of scholarly publishing

43
TEXT MINING, TERM MINING, AND VISUALIZATION IMPROVING THE IMPACT OF SCHOLARLY PUBLISHING MONDAY 16 APRIL 2012 NICE, FRANCE Marjorie M.K. Hlava, President Jay Ven Eman, CEO Access Innovations, Inc. [email protected] [email protected] 1

Upload: accessinnovations

Post on 11-May-2015

559 views

Category:

Education


2 download

DESCRIPTION

A detailed look at the graphic representation of text and term mining data. Originally presented by Marjorie M.K. Hlava and Dr. Jay Ven Eman at the 2012 International Information Conference on Search, Data Mining and Visualization in Nice, France.

TRANSCRIPT

Page 1: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

TEXT MINING, TERM MINING, AND

VISUALIZATION 

IMPROVING THE IMPACT OF SCHOLARLY PUBLISHING

MONDAY 16 APRIL 2012NICE, FRANCE

Marjorie M.K. Hlava, President Jay Ven Eman, CEO

Access Innovations, Inc. [email protected]

[email protected]

1

Page 2: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

What we will cover today• Term and Text Mining

• The basics of visualization

• Case studies

• Using subject terms as metrics

• Applications

• Visualizing the results

Page 3: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Definitions• Term Mining - a systematic

comparison processing algorithmic method to find patterns in text

• Text Mining – using controlled vocabulary tags in text to find patterns and directions

• Term & text mining Many similarities Can be complimentary; not mutually

exclusive

Page 4: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Term mining• Precise

Meaningful semantic relationships; contextual

Replicable; repeatable; consistent

Vetted; controlled

Based on a controlled vocabulary

Trends; gaps; relationship analysis; visualizations

Less data processing load

Page 5: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Text mining Algorithmic; formulaic

Neural nets, statistical, latent semantic, co - occurrence

Serendipitous relationships

Sentiment; hot topics; trends

False drops; noise;

Misleading semantic relationships

Heavy processing load

Page 6: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Why take a visual look?

• Humans can process information 17 times faster in visual presentations

• Now data can be analyzed, manipulated and presented as visual displays.

• To see the trends effectively we need to make the data into rich graph-able formats

6

Page 7: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Visualization of data• Needs

−Measurement−Metrics−Numbers

• Shows−Adjacency−Relationships−Trends−Co – occurrence−Conceptual

distance

• Is richer with−Linking−Semantic

enrichment−Classification

• Supports−Forecasting−Trend analysis−Segmentation−Distribution

7

Page 8: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Man’s attention to visual display to convey

knowledge is ancient

8

Page 9: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

The art in mapsis a

longstanding tradition

9

Page 10: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Super imposing data is now commonA mash up example

10

Traffic Injury MapUK Data ArchiveUS National Highway

Safety AdministrationGoogle Maps Base

Accident categories includechildrenautomobilebicycleetc.

Datatimeplacetype

Source:JISC TechWatch: Data Mash-ups September 2010

Page 11: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Mash up of bird flight migrations and weather patterns

http://www.youtube.com/watch?v=uPff1t4pXiI&feature=youtu.be

11

Page 12: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

http://www.youtube.com/watch?v=nokQBjk1s_8&feature=player_embedded

Page 13: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

How does it work?

Develop controlled vocabulary» Prefer one with hierarchy

Apply to full text» Or to the “heads”

Decide on data points to convey information Divide the XML into graphable sections

Page 14: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Start with data – like this XML file

14

Page 15: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Index or tag using subject terms from thesaurus or taxonomy

date, category, taxonomy term, frequency

15

Page 16: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Many views of one set of data

16

Page 17: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Load to a visualization program Like Prefuse

17

Page 18: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Or Pajek

18

Page 19: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 19

Page 20: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

National Information Center for Educational Media

Albuquerque’s own» Sandia developed VxInsight» Access Innovations = NICEM

Same data – several views

Primary and Secondary Education in US

Shows the US Valley of Science

Little Science taught in elementary years

20

Page 21: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Page 22: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Page 23: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Using visualization to show

From a society / publisher perspective» Identify Core, Boundary and Cross Border » Provides Indicators

Activity Growth Relatedness Centrality

» Locates Journal domains

From a thesaurus perspective» Identifies terms that are too broadly defined» Potential Improvements in thesaurus structure using topic

structures

23

Page 24: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Case Study: Mapping IEEE thesaurus space

We are interested in an expanded map that includes adjacencies to the IEEE data» Expanded term set shows adjacent white space;

opportunities for expansion Overlaps and edges of the science

» We need comparison data Learn the directions in the field

» Low occurrence rate in IEEE documents?» Linkage to terms in IEEE documents?

Where do we find these terms? How can we add them?

24

Page 25: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

The process Built a rule base to auto index IEEE content

» “90 % accuracy out of the box on journal data”*» “80% out of the box on proceedings data”*

The overlapping data sets» Auto indexed 1.2 million Xplore records» Auto indexed 10 years of US Patent data» Auto indexed 10 years of Medline

Term sets used» IEEE thesaurus terms rule base» Medical Subject Headings (MeSH) (and simple rule base)» Defense Technical Information Center (DTIC) Thesaurus

( and simple rule base)» Similar level of detail to current IEEE thesaurus terms

25

Page 26: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Defining expanded term space

26

IEEE

2k t

erm

s

1.2M documents

1. The data - Select related corpus

14k

DT

IC

475k patents

24k

Me

SH

PubMed525k docs

Page 27: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Defining expanded term space

27

IEEE

2k t

erm

s

1.2M documents

2. Identify related termsUse the IEEE Thesaurus to index the three collections

Page 28: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Defining expanded term space

28

IEEE

2k t

erm

s

1.2M documents

2. Identify related termsUse MESH and DTIC to also index the three collections

Page 29: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 29

IEEE

2k t

erm

s

1.2M documents

3. Resulting term setThe co-indexed items from the three collections

Defining expanded term space

Page 30: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 30

4. Term:Term MatrixWhere do the articles and their indexing intersect?

Defining expanded term space

Page 31: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 31

Visualization Strategies

MatrixVisualization

Software

Page 32: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 32

All data up-posted to the top level

Page 33: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 33

Many map options

IEEE ExperiencePrevious Experience

Page 34: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 34

SensorsCouncil

Nucl PlasmaSci Soc

NanotechCouncil

Ultrason,Ferro …

Prod SafEngng Soc

OceanicEngng Soc

Geosci RemSens Soc

CouncilSupercond

Compon,Packag …

InstrMeasur Soc

MagneticsSoc

Dielectr ElInsul Soc

ElectromagCompat Soc

AntennasPropag Soc

PowerElectron Soc

ElectronDev Soc

Circuits &Systems

Power &Energy Soc

IndustryAppl Soc

Solid StCircuits Soc

IndustrElectr Soc

MicrowaveTheory Soc

AerospElectr

Sys Soc

Sys ManCyber

Society

ComputerIntelligence

Society

SystemsCouncil

ReliabilitySociety Education

Society

ProfCommunSociety

ComputerSociety

RobotAutom Soc

SocialImpl Techn

Council ElectrDesign Auto

SignalProc Soc

Intell TranspSys Soc

CommunSoc

Info TheorySoc

VehicularTechn Soc

ConsumerElectr Soc

BroadcastTechn Soc

PhotonicsSoc

Eng MedBiol Sci

IEEE Portfolio

Page 35: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 35

Radial Visualization

Page 36: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 36

Publication Strategy

JASIST reference

Page 37: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 37

Conference Strategy

Page 38: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony 38

Turbines

Measurement

Circuits

Amplifiers

Displays

Games

Toys

Flow

Cooling

Heating

Components

Gearing

Brakes

Dynamics

Vehicles,Parts

Disk

Optics

Photochem

Molding

Conductors

Coatings

Lasers

Lamps

Motors

Plants,Micro-orgs

Control

Boats

OilfieldServices

Med Instruments

Welding

Conveyers

Rubber

Acyclic Comp

Footwear

Lubricants

Radiology

Catalysis

Macromolecules

Sprayers

Electrochem

Fitness

Hygiene

Cleaning

Printing

Paper

IC Engines

Magn/ElectMagnets

TextilesLayers

MedicalDevices

Clocks

Pipes

Valves

Blasting

Cables

Appliances

Outerwear

ExhaustPumps

Tools

ToolsFurniture

Packaging

Aircraft

Semiconductors

Use a Thesaurus to Label Maps

Agriculture

Food

ConsumerProducts

Construction

Automotive +Defense

IndustrialProducts

Leisure

Energy

Telecom ComputerHW/SW

Electronics

Chemicals

Pharma

Metals

Health Care

Page 39: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Questions Answered

Is there a way, using our own information, to forecast our direction?

Where is the industry headed? What about by technology sector? 

Does our coverage match our mission and vision? Can we become smarter about our data and potential

markets using our collection in new ways?Are the societies publishing and talking about what their charter indicates they cover?

What are the trends – are topics emerging/cooling? Can we use technology and our own data to explore these

questions while enhancing our data?

39

Page 40: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

The research team

Access Innovations / Data Harmony» Founded in 1978» Data enrichment and normalization» Suite of Semantic Enrichment tools

SciTechStrategies» Understanding data through visualization

IEEE Indexing & Abstracting Group

40

Page 41: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

We looked at visualization of data

Finding the Metrics» Measurement» Numbers» Terms as indicators

Ways to show» Adjacency» Relationships» Trends» Co – occurrence» Conceptual distance

How to enrich with» Linking» Semantic enrichment» Classification

Maps supporting» Forecasting» Trend analysis» Segmentation» Distribution

41

Page 42: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

Well Formed Data • Semantic Enrichment • Taxonomies • Access Innovations • Data Harmony

Effective maps require

Contextual data Detailed data Classification methods At least two directions in the matrix A little art for fun

42

Page 43: Text Mining, Term Mining, and Visualization - Improving the Impact of Scholarly Publishing

43

It just takes a little imagination

Thank you

Marjorie M.K. HlavaPresident

[email protected]

Jay Ven Eman, [email protected]

, Access Innovations505-998-0800