the skills system

SKILL: A System for Skill

Identification and

Normalization

Meng Zhao, Faizan Javed, Ferosh Jacob, Matt McNair


Taxonomy

Surface Forms

Normalized Entity Name

Noises

Selected Sections

Deduplicatio

n

© 2014 CareerBuilder

BlacklistWiki

CategoryTags

BLSSOC

System

Capability, Knowledgeability, Technology, Terminology

► 5 ◄

Surface

Forms

categories

keywords like

school, company,

person and etc.


Most LikelySense

Skills Sense

(BI -> Business Intelligence)

Google Search (SVM -> Support Vector Machine)


Tokenize Input text and assemble

n-grams

Match n-grams directly with Taxonomy

Date of Birth

Birth Childbirth Doomed


• Neural Network Language Model

• Input is a corpus and output is a Huffman tree

• Given a word predicts the context (or oppositely)

• Mikolov, T. et al., ICLR 20131

• Don’t count, Predict! (Baroni and Kruszewski 20142)

► 8 ◄

1 Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

2 Baroni, M., Georgiana D., and Kruszewski, G. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL


• Training data: surface forms ONLY

• Substitute ‘\\s+’ by ‘_‘

• Vector size: 200

• skip-gram model with hierarchical softmax (Mikolov et al., ASRU

2011*)

• Min-count: 1

► 9 ◄

* Mikolov, T., Deoras, A., Povey, D., Burget, L., and Černocký, J. 2011. Strategies for Training Large Scale Neural Network Lan-guage Models. ASRU.

Taxonomy

Surface FormsNormalized Entity Name

Word2vecVectors


• Collect seed skills surface forms by direct matching

• For each seed surface form 𝑥𝑖, calculate # of other seed surface forms showing

in its vector

• Choose skills by a user defined cutoff on confidence scores. Default is set at

0.5.

• If # of words < 150, return all skills.

► 10 ◄


• Taxonomy Precision: 90%.

• Taxonomy Recall: 70%. CB Taxonomy ∩ ESCO Taxonomy (50K vs 5K).

• ESCO is a systematic EU government initiative for complete workforce

analytics.

• Tagging: Precision 82%; Recall: 70%.

► 11 ◄

% of Approved Skills # of Responses Cumulative %

100% 902 (28%) 28%

90% - 99% 661 (21%) 49%

80% - 89% 618 (19%) 68%

70% - 79% 432 (13%) 81%

60% - 69% 251 (8%) 89%

50% or less 352 (11%) 100%


Web service

► 12 ◄

http://ec2-54-167-175-244.compute-1.amazonaws.com/SkillsServices/


Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the

Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language

Learning, Prague, Czech Re- public, 708–716.

Demartini, G., Difallah, D. E., and Cudre-Mauroux. P. 2013. Large-scale linked data integration using probabilistic

reasoning and crowdsourcing. The VLDB Journal 22, 5 (October 2013),

665-687. DOI=10.1007/s00778-013-0324-z

Jonnalagadda, S. and Topham, P. 2011. NEMO: Extraction and normalization of organization names from

PubMed affiliation strings. Computing research repository, vol. abs/1107.5743.

Kivimaki, I., Panchenko, A., Dessy, A., Verdegem, D., Francq, P., Fairon, C., Bersini, H., Saerens, M. 2013. A

graph-based ap- proach to skill extraction from text. In Proceedings of Text- Graphs-8 Workshop. In Empirical

Methods for Natural Language Processing (EMNLP 2013). Seattle, USA.

Magdy, W., Darwish, K., Emam, O., and Hassan, H. 2007. Arabic cross document person name normalization. In

Proceedings of the Workshop on Computational Approaches to Semitic Languages: Common Issues and

Resources, Prague, Czech Republic.

Singh, S., Subramanya, A., Pereira, F., and McCallum, 2011. A. Large-scale cross-document coreference using

distributed inference and hierarchical models. In Proceedings of the Annual Meeting of the Association for

Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon, 793–803.

► 13 ◄

the skills system

Data & Analytics

skill system

skill identification

normalization skill

system im

dont count

skills gap

input text

master wand1 mikolov