Transcript
Page 1: OpenKE - Technology and Kibana Analysis Annotated Field Search Visualizations (Tuning/Optimize Crawl, OpenKE Usage) Future Analytic Framework IBM Watson Content Analy1cs LAS Instrumentaon

UnstructuredText,Analy2cs,andSummariza2on

InterfaceswithExternalTools

●  Document Handling Text Extraction (Apache Tika / POI) ●  Diffbot Extraction ●  Analytics ○  Text Summarization ○  Topic Modelling ○  WordCloud ○  Publish Date ○  Voyant Experiment ○  RASOR/Olympics Indications and Warnings Demo

●  ElasticSearch and Kibana Analysis ○  Annotated Field Search ○  Visualizations (Tuning/Optimize Crawl, OpenKE Usage)

●  Future Analytic Framework

●  IBMWatsonContentAnaly1cs●  LASInstrumenta1on●  PNNLKnowledgeGraph

○  hBps://github.com/streaming-graphs/NOUS●  SAS●  Voyant

OpenKE - Technology Open Source Knowledge Enrichment Database

Laboratory for Analytic Sciences [email protected]

KnowledgeGraphs

DomainLearningandDiscovery

FocusedWebCrawling●  TunableandConfigurableWebCrawler●  PageDataModel

○  ProvenanceCapture○  MetadataCapture○  PolicySupportDataHeader

● Mul1pleSourceTypes(Web,Forums,SearchAPIs)● WebCrawlingConfigura1on(Depth,Breadth,Relevancy,Site)●  LeveragingStructuredDataWithinPages●  Policy(robots.txt,dataheader)●  JavascriptandHTMLchallenges● AccessandAudi1ngConceptsforPolicy

Current OpenKE Capabilities - Yellow Current External Capabilities - Green Future Capabilities - Blue

●  Dictionary and PESTLE Annotations ●  Regular Expressions / Relevancy Tuning ●  Domain Discovery Capability ○  Search APIs and “Session” Result Comparison ○  Indexing Session Corpus via Text Rank ○  Annotation Analysis ○  Topic Modeling (LDA)

●  Data Source Learning ○  Page Crawl Progression (Page History) ○  Dynamic Content Challenge ○  Source Data Freshness

{ "extractArea": [ { "selector": "#productTitle", "title": "Title" }, { "selector": "#feature-bullets", "title": "" }, { "selector": "#prodDetails", "title": "Details" }, { "selector": "#productDescription", "title": "Description" }, { "selector": "#detail-bullets", "title": "Details - Bullets” }, { "selector": "#aplus-product-description_feature_div", "title": "Manufacturer Info" }, { "selector": "#aplusProductDescription", "title": "Manufacturer Info" }, { "selector": "#technical-specs_feature_div", "title": "Technical Specifications" } ], "allowSingleHopFromReferrer": true, "relevantRegExp": "drone|quadcopter", "limitToDomain": true, "webCrawler": { "politenessDelay": 20000, "maxDepthOfCrawling": 1 } }

OpenKE Web Crawling Rio Olympic I&W Tuning

OpenKE Domain Discovery Index View

OpenKE Web Crawling Job Config Example

●  “Holis1c”○  Facts○  Events○  Causal○  Connec1ons○  Meta-data

●  Intelligence/Analy1cTasks○  Discovery○  BehavioralModeling○  NetworkDiscovery

● Analy1cSupport

○  Data/EvidenceGathering○  Predic1on○  ModelGenera1on:AIPlanning○  OntologyDevelopment○  KnowledgeBase

OpenKETechnicalFramework

●  Pla]orm:○  Java○  HortonworksDataPla]orm

●  Storage:

○  Accumulo○  Elas1cSearch○  HDFS○  OrientDB○  PostgreSQL

●  OpenSourceLibraries:

○  Crawler4J○  jsoup○  ApacheTika○  ApachePOI○  Tabula○  StanfordCoreNLP○  PythonNLTK○  PythonGensim○  UniversityofWashington:

OpenIE○  d3.js

●  Other:

○  Docker○  Kibana○  ApacheSpark○  ApacheZeppelin○  Tor2Web

Top Related