11 october 20131 primary research team & capabilities dept. of parallel and distributed...

13
11 October 2013 1 Primary Research Team & Capabilities Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: Large-scale HPCN, Grid and MapReduce applications Intelligent and Knowledge oriented Technologies Experience from IST: 3 project in FP5: ANFAS, CrosGRID, Pellucid 6 project in FP6: EGEE II, K-Wf Grid, DEGREE (coordinator), EGEE, int.eu.grid, MEDIGRID 4 projects in FP7: Commius, Admire, Secricom, EGEE III Several National Projects (SPVV, VEGA, APVT) IKT Group Focus: Information Processing (Large Scale) Graph Processing Information Extraction and Retrieval Semantic Web Knowledge oriented Technologies Parallel and Distributed Information Processing Solutions: SGDB: Simple Graph Database gSemSearch: Graph based Semantic Search Ontea: Pattern-based Semantic Annotation ACoMA: KM tool in Email EMBET: Recommendation System Experts on MapReduce and IR (Nutch, Solr, Lucene) Director & leader of PDC: Dr. Ladislav Hluchý URL: http://ikt.ui.sav.sk

Upload: sherman-fields

Post on 26-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

11 October 2013 1

Primary Research Team & CapabilitiesPrimary Research Team & Capabilities

Dept. of Parallel and Distributed ComputingResearch and Development Areas:

– Large-scale HPCN, Grid and MapReduce applications– Intelligent and Knowledge oriented Technologies

Experience from IST:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE

(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7: Commius, Admire, Secricom, EGEE III

Several National Projects (SPVV, VEGA, APVT)IKT Group Focus:

– Information Processing (Large Scale)– Graph Processing – Information Extraction and Retrieval– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed Information Processing

Solutions:– SGDB: Simple Graph Database– gSemSearch: Graph based Semantic Search– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System– Experts on MapReduce and IR (Nutch, Solr, Lucene)

Director & leader of PDC: Dr. Ladislav Hluchý

URL: http://ikt.ui.sav.sk

Page 2: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Towards Entity SearchTowards Entity Search

• Current approaches– Confirmed human knowledge– Google Knowledge Graph– Facebook Graph Search

• Data sets Available– Wikipedia– DBPedia (111 languages)– Freebase– Linked Data cloud

• Our approach– Quite unique mix of skills:

• IR, Semantic Web, Graphs and Networks

– Networks, Text, metadata– Graph algorithms– Information Retrieval techniques– Anchor texts: aliases, properties, types

11 October 2013 2

Page 3: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Entity Search ApplicationsEntity Search Applications

11 October 2013 3

https://www.linkedin.com/today/post/article/20130805134105-50510-search-what-s-cooking-in-the-lab

http://www.siliconrepublic.com/strategy/item/31182-global-enterprise-search-ma

Page 4: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Entity Search ApplicationsEntity Search Applications

• Online Advertising– Query Categorization

– Keyword Extension

• Business Intelligence– Enterprise Search

– Knowledge Management

– Text analytics

• Multilingual short text categorizations– Based on Wikipedia Language versions,

DBPedia, Freebase

– Query Categorization

– Social media (Twitter) categorization, analysis

• Security Domain – Information Leakage prevention

– Categorization

11 October 2013 4

Page 5: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Large scale Text and Graph data processingLarge scale Text and Graph data processing

Core Technology• Web crawling

– Nutch + plugins

• Full text indexing and search– lucene, Sorl

• Information Extraction– Ontea, GATE

• All above large scale– Hadoop, S4

• Graph processing and Querying– Simple Graph Database (SGDB)

– gSemSearch

– Neo4j

– Blueprints

11 October 2013 5

Underlined are the technologies developed by IISAS

Page 6: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Relation to Business Intelligence Relation to Business Intelligence

• Old BI approaches– Data Integration from RDBM

– Data ware houses

– OLAP

– …

• New BI approaches– Other than RDBM data structures: Networks, Semantics

• Networks/Graphs in Telecom, Social Networks, Transactions, Linked Data …

• NoSQL: key value (Tokyo Cabinet), column stores (HBase), Graph databases, RDF(s)

– In-Memory computing

– Commodity PCs solutions for large data:• MapReduce style - Hadoop, Pregel style – Giraph, Hama

– Big unstructured data processing (on Hadoop):• Sentiment analysis, topic detection, named entity detection

11 October 2013 6

Page 7: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Ontea: Information Extraction ToolOntea: Information Extraction Tool

Regex patternsGazetteersResuls

Key-value pairs Structured into trees graphs

Transformers, ConfigurationAutomatic loading of extractors

Visual Annotation Tool Integration with external tools

GATE, Stemers, Hadoop …Multilingual tests

English, Slovak, Spanish, Italian

11 October 2013 7

http://ontea.sf.net

Text with annotations

Tree of annotations

Network /Graph of annotations

Page 8: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Named Entity Recognition (NER)Named Entity Recognition (NER)

• Combination of Existing NER– ANNIE (GATE), Apache OpenNLP, – Illinois NER, Illinois Wikifier, – LingPipe, Open Calais– Stanford NER ,WikiMiner, – Miscinator

• Machine Learning– Decision Trees models

• Received second place at MSM 2013, missing first place by 1%, where participated 17 teams word widehttp://ikt.ui.sav.sk/index.php?n=Main.IEChallenge2013

11 October 2013 8

Page 9: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

gSemSearch: Graph based Semantic SearchgSemSearch: Graph based Semantic Search

• Entity relation search in semantic networks/graphs

• Search, Navigation, Data Interaction

• Aiming at data integration of– Structured data

(Relational data, LinkedData)

– Unstructured Data(text, documents, communication)

• Applications: – Email, Web, Text documents,

LinkedData

11 October 2013 9

http://ikt.ui.sav.sk/esns/

Page 10: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

SemSets: Sematnic SearchSemSets: Sematnic Search

• Answering list type questions: astronauts who walked on the Moon

• Wikipedia as text and networks/graph

• Text: IR methods, Lucene based

• Graph/network: sprading activation and SemSets

• Winning solution on Semantic Search Challenge 2011

11 October 2013 10

1. Eugene_Cernan2. Alan_Bean3. David_Scott4. John_Young_(astronaut)5. Neil_Armstrong6. Pete_Conrad7. Harrison_Schmitt8. Alan_Shepard9. Charles_Duke10. Buzz_Aldrin11. James_Irwin12. Edgar_Mitchell

Page 11: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

SGDB: Simple Graph DatabaseSGDB: Simple Graph Database

• Storage for graphs• Optimized for graph traversing and spread of activation• Faster then Neo4j for graph traversing operations• Supports Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3

• Graph Database Benchmarks– Graph Traversal Benchmark for Graph Databases

– http://ups.savba.sk/~marek/gbench.html

– Blueprints API - possibility to test compliant Graph databases

11 October 2013 11

Source: http://geza.kzoo.edu/bionet/html/scalefree.html

Page 12: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Community Detection in Complex Networks Community Detection in Complex Networks

• Task: Identify densely connected subgraphs in complex networks

• community collapsing problem

• SCCD– Near-linear time complexity– Avoids community collapsing

problem (to certain extend)

• KDD paper– Re-weighting approach

– Better results on real networks

11 October 2013 12

Marek Ciglan , Kjetil Nørvåg: Fast detection of size-constrained communities in large networks, proceedings of WISE'10, LNCS Volume 6488/2010

Marek Ciglan, Michal Laclavík and Kjetil Nørvåg: On Community Detection in Real-World Networks and the Importance of Degree Assortativity, 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2013

Page 13: 11 October 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Future Direction: Entity Search in Large Graph DataFuture Direction: Entity Search in Large Graph Data

• Motivation– Graph/Network data are everywhere: social networks, web, LinkedData,

transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial.

• Approach– Forming semantic trees and graphs from text, web, communication, databases

and LinkedData– User interaction with graph data in order to achieve integration and data

cleansing– Users will do it, if user effort have immediate impact on search results

11 October 2013 13