Transcript
Page 1: Build a Searchable Knowledge Base

Build a Searchable Knowledge Base

Jimmy Lai Yahoo! Search Engineer

r97922028 [at] ntu.edu.tw 2014/05/18

http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base

Page 2: Build a Searchable Knowledge Base

Outline• Introduction to Knowledge Base

• Construct a Knowledge Base

• Search the Knowledge Base • string match • synonym search • full text search • geo search • put all together

• More Applications

2

Page 3: Build a Searchable Knowledge Base

Knowledge

• Knowledge is power. - Francis Bacon, 1597

• Knowledge is boundless and connected. So, an efficient interface to search and browse the knowledge base is essential.

• Let’s try to build a searchable knowledge base.

3

Page 4: Build a Searchable Knowledge Base

Application of Knowledge Base

Personal assistant: Siri, Google now

!

!

Search engine: Google’s knowledge graph

4

Page 5: Build a Searchable Knowledge Base

Construct a Knowledge Base

1. Find good data sources.

2. Aggregate data as knowledge entity.

3. Construct structured data of knowledge entity.

4. Search the knowledge base.

5. Navigate the knowledge base.

5

Page 6: Build a Searchable Knowledge Base

Wikipedia• A collaborated encyclopedia with more than 30M

articles over 287 languages.

!

!

!

• A good source of knowledge base. However the data of Wikipedia is not well-structured.

6

http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits

Page 7: Build a Searchable Knowledge Base

DBpedia

• http://wiki.dbpedia.org/About

• Structured data from Wikipedia.

• A good data source for a knowledge base.

7

Page 8: Build a Searchable Knowledge Base

8

Page 9: Build a Searchable Knowledge Base

Knowledge Entity

9

Identifier

Abstract

Relations

Page 10: Build a Searchable Knowledge Base

What can Python do for us• Data Wrangling

• Process the raw text data • Aggregate the data from different sources • Output data as json format

• Connecting the Data flow between systems • Automation script for starting services and

feeding data • REST API implementing search strategy

10

Page 11: Build a Searchable Knowledge Base

Example code

git clone [email protected]:jimmylai/knowledge.git!

https://github.com/jimmylai/knowledge!

• required python packages: 1. fabric 2. pysolr 3. django

11

Page 12: Build a Searchable Knowledge Base

Data Preparation1. Download data from DBpedia

http://downloads.dbpedia.org/current/en/

2. Filter out some specific knowledge entity zcat instance_types_en.nt.bz2 | get_id_list.py

3. Parse and aggregate data entity from files.

12

data file script data fieldshort_abstracts_en.nt.bz2 get_abstract.py abstractraw_infobox_properties_en.nt.bz2 get_relation.py relationsgeo_coordinates_en.nt.bz2 get_geo.py latlonredirects_en.nt.bz2 get_redirect.py redirects

Page 13: Build a Searchable Knowledge Base

Aggregated Data Format"http://dbpedia.org/resource/Lake_Yosemite": { "latlon": "37.376389,-120.428889", "redirects": [ "Lake_yosemite" ], "abstract": "Lake Yosemite is an artificial freshwater lake located approximately five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James Bond Special 1 episode was filmed and tested at Lake Yosemite.", "relations": { "type": "http://dbpedia.org/resource/Reservoir", "location": "http://dbpedia.org/resource/California" } }

13

Page 14: Build a Searchable Knowledge Base

Search by• Solr is a full-text, real-time search engine based on Apache

lucene.

• Provides REST-like API.

• pysolr make the use of Solr easily.

• Download the latest version 4.8.0 from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0 and extract to solr/solr-4.8.0 dir

• Start Solr server and then check the web UI fab start_solrhttp://localhost:8983/solr/

14

Page 15: Build a Searchable Knowledge Base

Search - String Match• To be able to search by entity name

python feed_data.py string_match

• config: solr/conf/string_match/schema.xml <field name="name" type="string" indexed="true" stored="true" multiValued="false"/> <field name="abstract" type="string" indexed="false" stored="true" multiValued="false"/>

• Feed the entities to Solr. Each entity with name and abstract fields.

15

Page 16: Build a Searchable Knowledge Base

Search - String Match

16

http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco%22&wt=json&indent=true

Search by entity name.

Page 17: Build a Searchable Knowledge Base

Search - Synonym• To be able to search by synonym of entity name

python feed_data.py synonym_string_match

• config: solr/conf/synonym_string_match/schema.xml <field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/> !<fieldType name="name_text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> …

• Restart Solr server and the synonym file will be reloaded.17

Page 18: Build a Searchable Knowledge Base

Synonym handling at index time

18

Page 19: Build a Searchable Knowledge Base

Synonym handling at query time

19

Page 20: Build a Searchable Knowledge Base

Search - Synonym

20

Search by synonym.

Page 21: Build a Searchable Knowledge Base

Search - Full Text Search• To be able to search by entity name

python feed_data.py full_text_search

• config: solr/conf/full_text_search/schema.xml <copyField source="name" dest="text"/> <copyField source="abstract" dest=“text"/> !

• Feed the entities to Solr. Each name and abstract field will be copied to the text field. After that we can do full text search without specify field to search.

21

Page 22: Build a Searchable Knowledge Base

Search - Full Text Search

22

Page 23: Build a Searchable Knowledge Base

Search - Geo Search• To be able to search by distance given a location

python feed_data.py geo_search

• config: solr/conf/geo_search/schema.xml <field name="location" type="location" indexed="true" stored="true" required="false" multiValued="false" />

• Feed the entities to Solr. Each entity contains a location field and the format is like "51.670100,-3.230100".

23

Page 24: Build a Searchable Knowledge Base

24

Given condition on distance

Page 25: Build a Searchable Knowledge Base

Search - Put All Together• Search Strategy

1. Input a query

2. Search by synonym match

3. Search by full text

1. If input a location, filter the result by geo search

• Implement the search strategy as an API

25

Page 26: Build a Searchable Knowledge Base

Implement the search strategy in a Django view

26

Page 27: Build a Searchable Knowledge Base

27

Page 28: Build a Searchable Knowledge Base

Review

• A Knowledge Base with synonym, full-text and geo search API.

• The knowledge entities are connected by relation.

28

Page 29: Build a Searchable Knowledge Base

More Applications• Question answering system:

1.Query analysis: identify the intension (e.g. looking for specific type of entity)

2.Search in the knowledge base 3.Return the knowledge entity

29

Page 30: Build a Searchable Knowledge Base

The modern search engine don’t just provide web page urls. They provide the direct answer to users.

30

Page 31: Build a Searchable Knowledge Base

More Data Sources and Knowledge Entities

• Open Data

!

!

!

• Open APIs

31

Page 32: Build a Searchable Knowledge Base

My Life in • Build online services for billions of users.

• Big data mining on cloud infrastructures.

• Open and Innovative working environment.

• International teamwork and English communication.

• Business trips to Silicon Valley.

• Send me your resume if you need a referral. r97922028 [at] ntu.edu.tw

32


Top Related