dlls 20031 ontologically-based searching for jobs in linguistics deryle lonsdale [email protected] funded...

24
Bacterial Physiology (Micr430) Lecture 1 Overview of Bacterial Physiology (Text Chapters: 1 and 2)

Post on 20-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

DLLS 2003 1

Ontologically-based Searching for Jobs in

Linguistics

Deryle [email protected]

Funded by:

DLLS 2003 2

The BYU Data Extraction Group Group of faculty (5) and students

(15) from CS, Linguistics, SOAIS Goal: ontology-based data

extraction NSF funding: CISE/IIS/IDM TIDIE Website: www.deg.byu.edu/

Papers, presentations Tools Demos

DLLS 2003 3

The BYU Data Extraction Group

DLLS 2003 4

Overview Ontology-based extraction Building knowledge sources Jobs in linguistics (Sproat) Putting it all together Some sample results

DLLS 2003 5

Ontologies and IESource Target

DLLS 2003 6

Document-based IE

DLLS 2003 7

Conceptual modeling (OSM)

Year Price

Make Mileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

DLLS 2003 8

Recognition and Extraction

Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081

Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold

DLLS 2003 9

Car-Ads Ontology (textual)Car [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]

constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;

DLLS 2003 10

The data-frame library Low-level patterns implemented as

regular expressions Match items such as email

addresses, phone numbers, names, etc.

Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; },

{ extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";},

{ extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b";end;

DLLS 2003 11

Lexicons Repositories of enumerable classes

of lexical information FirstNames, LastNames, USstates,

ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc.

DLLS 2003 12

Accessing the output Extracted information is stored in a

relational database Results can be queried using SQL Wide range of views is possible

DLLS 2003 13

Finding jobs in linguistics Linguistlist.org, LSA Email distribution lists (corpora,

langage naturelle, CAAL/ACLA, etc.) Usual commercial sites

(monster.com, flipdog.com, dice.com)

Word-of-mouth sources

DLLS 2003 14

Sproat’s analysis Random sample (224/2250) of LinguistList

postings, 1994-2001 Development vs. research, academic vs.

industrial Linguists are most often (approx. 80% of

the time) offered development jobs Linguists hired more for specific tasks

(e.g. grammar, lexicon development) rather than for more general research-oriented tasks (e.g. creating new technological approaches.)

DLLS 2003 15

The banner yearsYear Academia Industry % Industry

1994 27 2 7%

1995 45 5 10%

1996 52 3 5%

1997 48 3 6%

1998 57 3 5%

1999 56 14 20%

2000 55 43 39%

2001 (mid) 22 10 31%

Dramatic rise in 1999, 2000

Steep drop-off since 2001

Rising demand for technical, computational skills

DLLS 2003 16

Linguistic jobs ontology Why?

user-specifiable constraints

Somewhat closely follows existing ontologies (e.g. jobs, software)

DLLS 2003 17

Data frames and lexicons Language names

ethnologue (sub)fields of linguistics

Linguistlist.org Tools, toolkits Software components, programming

languages Linguistics-related job titles Activities Responsibilities Country names

DLLS 2003 18

The corpus 3237 postings (LinguistList, Corpora, LN,

WoM):1998 5411999 5752000 8712001 952 2002 788

Some noise (non-English, factored, program descriptions, attachments, etc.)

Semi-automatic edits (boilerplate, publicity blurbs about institutions, etc.)

DLLS 2003 19

Sample output Here

DLLS 2003 20

Observations 270 don’t have linguist* (!) Demand for knowledge of English

equals that for all other languages combined (G, F, S, J, C)

Computer/computational background required for almost 1/3 (1116)

Noticeable amount of headhunting, particularly in Seattle, DC areas

DLLS 2003 21

Programming languages

0

100

200

300

400

500

600

700

C/C++ CGI HTML/SGMLJ ava/ J script Lisp/Python PerlProlog SQL TclVB XML/XSLT

DLLS 2003 22

Popular subfields

0

100

200

300

400

500

600

700

IE/ IR Morpho NLP Phonetics

Phonology Pragmatics Speech SyntaxSemantics MT TESOL/EFL Translation

DLLS 2003 23

Subfields (another perspective)

0

200

400

600

800

Psycho Neuro HistoricalTypological Acquisition CognitionSocioling Lexicography PhilologyPhilosophy Anthropo

DLLS 2003 24

An engineering discipline? 160 linguistics jobs ending in “engineer” Software development cycle

research e., software design e. development e., software e. software quality e., linguistic test e., linguistic quality e. linguistic support e., user experience e. presales e., technical sales e.

Specific subfields web site e. speech e., voice recognition e., speech recognition application e.,

speech e., ASR tuning e., audio e. dialog e.

tools e. AI e., NLP e. knowledge e. linguist e., natural language e. staff e. human factors e., user interface e.

DLLS 2003 25

Paradigms

0

50

100

150

200

250

300

Machine learning Finite- stateStatistical Stoch/ProbMath GenerativeField Methods

DLLS 2003 26

Other observations Often a job title is not even listed (!) More in18 of data frames (e.g. email,

ph. #) Great need for (preferably hierarchical)

lexical repositories related to linguistics job titles theoretical frameworks, subfields typical linguist job activities linguistic research/development venues