indexing umls concepts with apache lucene julien thibault [email protected] university of utah...
TRANSCRIPT
Indexing UMLS concepts with Apache Lucene
Julien [email protected]
University of UtahDepartment of Biomedical Informatics
Outline
• Goals• Unified Medical Language System (UMLS)• Apache Lucene • Get to work!
Goals
• Build a dictionary lookup module for NLP pipelines– Input: string (e.g. “diabetes”, “breast cancer”, “warfarin”)– Output: list of concepts (e.g. “C083562”)
• Application examples:– Unstructured clinical document coding– (Semi)automated literature indexing
• Pre-processing necessary for free text (not covered today):– Tokenization– Sentence detection– Part-of-speech tagging (e.g. to lookup only noun phrases)
UMLS• Unified Medical Language System (NLM)
– Millions of organized biomedical concepts– Over 150 sources (e.g. SNOMED-CT, LOINC, NCI, MESH)– Good source to index biomedical concept!– UMLS Terminology Services: https://uts.nlm.nih.gov/home.html
• Content– Concepts, synonymous names, relationships– Semantic network (high-level classification)
• Organism, anatomical structure, biologic function, chemical, …
• Distribution– Files with concept and relationship description data– Loadable into a database for querying– Files/columns: http://www.ncbi.nlm.nih.gov/books/NBK9685/
UMLS schema
• 19 files to describe:– Concepts– Relationships– The files (columns and
content)
• MRCONSO– Concepts names and sources
• MRSTY– Concept semantic types
• Terminology (source) codes– http://www.nlm.nih.gov/rese
arch/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html
Concept table (MRCONSO)
CUI: concept unique ID; LAT: language of term; LUI: term unique ID; SAB: Source; STR: string
• MySQL database – mysql -u [user] -h [host] -D [database] –p– Replace with provided info (thanks Kristina!!)
• Query example:
CUI LAT LUI SAB STR …
C0001175 ENG L0001175 MSH Acquired Immunodeficiency Syndromes
…
C0001175 ENG L0001842 SNOMEDCT AIDS …
C0001175 FRE L0162173 SNOMEDCT SIDA …
select * from MRCONSO where STR like ‘my favorite disease’;
Apache Lucene
• Relational databases are not optimized for string search (e.g. partial matches, phrases)
• Apache Lucene– http://lucene.apache.org/– High-performance text search engine library
• Ranked searching (score)• Phrase queries, wildcard queries, proximity queries…
– Java API to:• build indexes• perform lookups
– Integrate nicely into UIMA
Apache Lucene index
• Indexes stored on disk and loaded at runtime• Documents
– Index entries with indexable fields– The set of fields does not need to be the same for each document– Searches target one field at a time and return the whole matching document
• Default match scoring– Higher ranks = good overlap, non-frequent words, short fields
CUI LAT SAB STR EXTRA
C0001175 - MSH Acquired Immunodeficiency Syndromes
-
C0001175 ENG SNOMEDCT AIDS genial
C0001175 FRE SNOMEDCT SIDA -
Field
Document
Apache Lucene Analyzer• Defines the pre-processing step applied to
– Strings indexed by Lucene– Strings that are looked up in the index
• Components– Tokenizer : creates token stream (e.g. based on white spaces)– Filter: applied to token stream (e.g. lower case, stop words)
• This is a good place to customize the matching algorithm, but see also:– Language-specific analyzers (e.g. Arabic, Chinese, Catalan)– CustomScoreQuery (to customize scoring function)– WildcardQuery, FuzzyQuery, RegexpQuery– KeywordQuery (no tokenization)
Building an index//create reference to Lucene index to be stored on diskDirectory dir = FSDirectory.open(new File(indexPath));Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filterIndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);IndexWriter writer = new IndexWriter(dir, iwc); //get index writer…Document doc = new Document(); //create new entry (i.e. document)Field myfield = new TextField(“term", term, Field.Store.YES); //create fielddoc.add(pathField); //add field to document…writer.addDocument(doc); //add document to index… writer.close(); //save updated index
http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html
StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer.
Field.Store.YES = this field will be indexed
Creating index queries//create reference to existing Lucene index stored on diskIndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
//prepare searchIndexSearcher searcher = new IndexSearcher(reader);Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//create query on the “term” fieldQueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’
//searchTopDocs results = searcher.search(query, 5); //search for top 5 matches
http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/SearchFiles.html
//collect resultsScoreDoc[] hits = results.scoreDocs; //collect matchesint numTotalHits = results.totalHits; //count number of results…Document doc = searcher.doc(hits[0].doc); //retrieve first matching entryint score = hits[0].score; //retrieve score of first matching entryString term = doc.get(“term"); //retrieve value of field “term”
Lets get to work!• Download necessary files
– Apache Lucene Core API• http://lucene.apache.org/core/mirrors-core-latest-redir.html?
– MySQL Java connector • http://dev.mysql.com/downloads/connector/j/
– Files for this tutorial
• Create Eclipse project– Add necessary JAR files to build path– Copy source files to project src folder
• Complete code to:– Build index from MySQL query (don’t use all concepts!!)– Create search function that returns the CUIs of matching terms
Merci![C2986674] Thank you (NCI)
Julien [email protected]
University of UtahDepartment of Biomedical Informatics