the skills system
TRANSCRIPT
SKILL: A System for Skill
Identification and
Normalization
Meng Zhao, Faizan Javed, Ferosh Jacob, Matt McNair
© 2014 CareerBuilder► 2 ◄
© 2014 CareerBuilder► 3 ◄
© 2014 CareerBuilder► 4 ◄
Taxonomy
Surface Forms
Normalized Entity Name
Noises
Selected Sections
Deduplicatio
n
© 2014 CareerBuilder
BlacklistWiki
CategoryTags
BLSSOC
System
Capability, Knowledgeability, Technology, Terminology
► 5 ◄
Surface
Forms
categories
keywords like
school, company,
person and etc.
© 2014 CareerBuilder► 6 ◄
Most LikelySense
Skills Sense
(BI -> Business Intelligence)
Google Search (SVM -> Support Vector Machine)
© 2014 CareerBuilder► 7 ◄
Tokenize Input text and assemble
n-grams
Match n-grams directly with Taxonomy
Date of Birth
Birth Childbirth Doomed
© 2014 CareerBuilder
• Neural Network Language Model
• Input is a corpus and output is a Huffman tree
• Given a word predicts the context (or oppositely)
• Mikolov, T. et al., ICLR 20131
• Don’t count, Predict! (Baroni and Kruszewski 20142)
► 8 ◄
1 Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
2 Baroni, M., Georgiana D., and Kruszewski, G. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL
© 2014 CareerBuilder
• Training data: surface forms ONLY
• Substitute ‘\\s+’ by ‘_‘
• Vector size: 200
• skip-gram model with hierarchical softmax (Mikolov et al., ASRU
2011*)
• Min-count: 1
► 9 ◄
* Mikolov, T., Deoras, A., Povey, D., Burget, L., and Černocký, J. 2011. Strategies for Training Large Scale Neural Network Lan-guage Models. ASRU.
Taxonomy
Surface FormsNormalized Entity Name
Word2vecVectors
© 2014 CareerBuilder
• Collect seed skills surface forms by direct matching
• For each seed surface form 𝑥𝑖, calculate # of other seed surface forms showing
in its vector
• Choose skills by a user defined cutoff on confidence scores. Default is set at
0.5.
• If # of words < 150, return all skills.
► 10 ◄
© 2014 CareerBuilder
• Taxonomy Precision: 90%.
• Taxonomy Recall: 70%. CB Taxonomy ∩ ESCO Taxonomy (50K vs 5K).
• ESCO is a systematic EU government initiative for complete workforce
analytics.
• Tagging: Precision 82%; Recall: 70%.
► 11 ◄
% of Approved Skills # of Responses Cumulative %
100% 902 (28%) 28%
90% - 99% 661 (21%) 49%
80% - 89% 618 (19%) 68%
70% - 79% 432 (13%) 81%
60% - 69% 251 (8%) 89%
50% or less 352 (11%) 100%
© 2014 CareerBuilder
Web service
► 12 ◄
© 2014 CareerBuilder
Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the
Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language
Learning, Prague, Czech Re- public, 708–716.
Demartini, G., Difallah, D. E., and Cudre-Mauroux. P. 2013. Large-scale linked data integration using probabilistic
reasoning and crowdsourcing. The VLDB Journal 22, 5 (October 2013),
665-687. DOI=10.1007/s00778-013-0324-z
Jonnalagadda, S. and Topham, P. 2011. NEMO: Extraction and normalization of organization names from
PubMed affiliation strings. Computing research repository, vol. abs/1107.5743.
Kivimaki, I., Panchenko, A., Dessy, A., Verdegem, D., Francq, P., Fairon, C., Bersini, H., Saerens, M. 2013. A
graph-based ap- proach to skill extraction from text. In Proceedings of Text- Graphs-8 Workshop. In Empirical
Methods for Natural Language Processing (EMNLP 2013). Seattle, USA.
Magdy, W., Darwish, K., Emam, O., and Hassan, H. 2007. Arabic cross document person name normalization. In
Proceedings of the Workshop on Computational Approaches to Semitic Languages: Common Issues and
Resources, Prague, Czech Republic.
Singh, S., Subramanya, A., Pereira, F., and McCallum, 2011. A. Large-scale cross-document coreference using
distributed inference and hierarchical models. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon, 793–803.
► 13 ◄
© 2014 CareerBuilder► 14 ◄