the skills system

14
SKILL: A System for Skill Identification and Normalization Meng Zhao, Faizan Javed, Ferosh Jacob, Matt McNair

Upload: meng-zhao

Post on 18-Jul-2015

132 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: The Skills System

SKILL: A System for Skill

Identification and

Normalization

Meng Zhao, Faizan Javed, Ferosh Jacob, Matt McNair

Page 2: The Skills System

© 2014 CareerBuilder► 2 ◄

Page 3: The Skills System

© 2014 CareerBuilder► 3 ◄

Page 4: The Skills System

© 2014 CareerBuilder► 4 ◄

Taxonomy

Surface Forms

Normalized Entity Name

Noises

Selected Sections

Deduplicatio

n

Page 5: The Skills System

© 2014 CareerBuilder

BlacklistWiki

CategoryTags

BLSSOC

System

Capability, Knowledgeability, Technology, Terminology

► 5 ◄

Surface

Forms

categories

keywords like

school, company,

person and etc.

Page 6: The Skills System

© 2014 CareerBuilder► 6 ◄

Most LikelySense

Skills Sense

(BI -> Business Intelligence)

Google Search (SVM -> Support Vector Machine)

Page 7: The Skills System

© 2014 CareerBuilder► 7 ◄

Tokenize Input text and assemble

n-grams

Match n-grams directly with Taxonomy

Date of Birth

Birth Childbirth Doomed

Page 8: The Skills System

© 2014 CareerBuilder

• Neural Network Language Model

• Input is a corpus and output is a Huffman tree

• Given a word predicts the context (or oppositely)

• Mikolov, T. et al., ICLR 20131

• Don’t count, Predict! (Baroni and Kruszewski 20142)

► 8 ◄

1 Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

2 Baroni, M., Georgiana D., and Kruszewski, G. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL

Page 9: The Skills System

© 2014 CareerBuilder

• Training data: surface forms ONLY

• Substitute ‘\\s+’ by ‘_‘

• Vector size: 200

• skip-gram model with hierarchical softmax (Mikolov et al., ASRU

2011*)

• Min-count: 1

► 9 ◄

* Mikolov, T., Deoras, A., Povey, D., Burget, L., and Černocký, J. 2011. Strategies for Training Large Scale Neural Network Lan-guage Models. ASRU.

Taxonomy

Surface FormsNormalized Entity Name

Word2vecVectors

Page 10: The Skills System

© 2014 CareerBuilder

• Collect seed skills surface forms by direct matching

• For each seed surface form 𝑥𝑖, calculate # of other seed surface forms showing

in its vector

• Choose skills by a user defined cutoff on confidence scores. Default is set at

0.5.

• If # of words < 150, return all skills.

► 10 ◄

Page 11: The Skills System

© 2014 CareerBuilder

• Taxonomy Precision: 90%.

• Taxonomy Recall: 70%. CB Taxonomy ∩ ESCO Taxonomy (50K vs 5K).

• ESCO is a systematic EU government initiative for complete workforce

analytics.

• Tagging: Precision 82%; Recall: 70%.

► 11 ◄

% of Approved Skills # of Responses Cumulative %

100% 902 (28%) 28%

90% - 99% 661 (21%) 49%

80% - 89% 618 (19%) 68%

70% - 79% 432 (13%) 81%

60% - 69% 251 (8%) 89%

50% or less 352 (11%) 100%

Page 12: The Skills System

© 2014 CareerBuilder

Web service

► 12 ◄

Page 13: The Skills System

© 2014 CareerBuilder

Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the

Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language

Learning, Prague, Czech Re- public, 708–716.

Demartini, G., Difallah, D. E., and Cudre-Mauroux. P. 2013. Large-scale linked data integration using probabilistic

reasoning and crowdsourcing. The VLDB Journal 22, 5 (October 2013),

665-687. DOI=10.1007/s00778-013-0324-z

Jonnalagadda, S. and Topham, P. 2011. NEMO: Extraction and normalization of organization names from

PubMed affiliation strings. Computing research repository, vol. abs/1107.5743.

Kivimaki, I., Panchenko, A., Dessy, A., Verdegem, D., Francq, P., Fairon, C., Bersini, H., Saerens, M. 2013. A

graph-based ap- proach to skill extraction from text. In Proceedings of Text- Graphs-8 Workshop. In Empirical

Methods for Natural Language Processing (EMNLP 2013). Seattle, USA.

Magdy, W., Darwish, K., Emam, O., and Hassan, H. 2007. Arabic cross document person name normalization. In

Proceedings of the Workshop on Computational Approaches to Semitic Languages: Common Issues and

Resources, Prague, Czech Republic.

Singh, S., Subramanya, A., Pereira, F., and McCallum, 2011. A. Large-scale cross-document coreference using

distributed inference and hierarchical models. In Proceedings of the Annual Meeting of the Association for

Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon, 793–803.

► 13 ◄

Page 14: The Skills System

© 2014 CareerBuilder► 14 ◄