mark davis - ieee computer societymedia.computer.org/pdfs/davis.pdf · mark davis distinguished...

Post on 08-Sep-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mark Davis Distinguished Engineer, Dell Software Group

IEEE Cloud Computing Initiative

IEEE Intercloud Interoperability Testbed

Mark.Davis@software.dell.com

Industries Vertical domains Concepts and Algorithms Technologies

200EB = 1018 B

1ZB = 1021 B

10EB

100TB

2000 1985 1900 1750

Industrial

Revolution

#1

Industrial

Revolution

#2

Industrial

Revolution

#3

Industrial

Revolution

#4

R. J. Gordon: Is US economic growth over? Faltering innovation confronts the six headwinds. CEPR Policy Insight No 63

Krugman, P. Is Growth Over? New York Times, 12 December 2012.

I visualize a time when we will be to

robots what dogs are to humans, and

I'm rooting for the machines.

Claude Shannon, IEEE Medal of

Honor 1966

Faster reporting

Interactive visualization

$$$

Data warehousing

Advanced analytics

Just-in-time decisionmaking

The Perl Scripting option

4TB of data on disk

4 channels of I/O at 100Mb/s

2.7 hours

Server farms are too expensive

Oracle is too expensive

We just need key-value stores

VCs say no CAPEX

Be lean, young entrepreneurs, be lean

Analyze log files from web

servers

Group users based on behavior patterns

Build machine learning

models of behavior

Recommend news and information

Social Media

Collect millions of data

points from sensors

Analyze failure modes

Proactively improve products

Optimize systems

Industrial Control

Defense Intelligence

Anomaly Detection

Scientific Analysis and Visualization

Collective Intelligence

Social Network Analysis

Collective Intelligence

Anomaly Detection

Distribution Network Optimization

Scientific Analysis and Visualization

Spam Filtering

Defense Intelligence

Product Design

Internet of Things

Social Network Analysis

Search

Linguistic

Understand language

Use linguistic concepts to create computational systems

Understand

Statistical

Data driven

Language is structured information

Hybrid

Understand language some

Use statistics and machine learning to help with recall

Parts-of-Speech Tagging

Tokenization

Lemmatization

Finite State Transducer Finite State Transducer

Finite State Transducer

Machine-Learning

Random Indexing:

Assign random, sparse vectors to words

(or entities)

Add together the vectors for a context

Related contexts cluster in high-

dimensional space

Assign new metadata from discovered

clusters to documents

Why? Model for human/animal sparse distributed memory Success in the TOEFL test (64.5-67%) Distributional hypothesis: words with similar co-occurrence patterns

have similar meanings Johnson-Lindenstrauss Lemma: projecting a matrix through a random

matrix preserves the relative distances between points if R is high dimensionality

Fixed context vectors (say, 4000 bits) reduce complexity Versatile: words-words contexts, words-document contexts, entity-

entity contexts, x-y contexts

MapReduce RI requires reduce phase that merges random spaces

Or…Precompute sparse word vectors Or…Serve the map phase from a common

term signature service Or…Same hash function across instances and

compute a sparse vector (1% occupancy)

Clustering of terms, entities, or documents Autosuggestion Categorization Abstractly: generalized similarity engine

Query Language

Metadata Extraction

Indexing

Facet Browsing Facet Charting

Resource Integration

Autosuggest Spellcheck

Big Data search and analytics has many challenges: Volume of data Variety of data Velocity of data Extracting structure from unstructured information:

▪ Machine Learning ▪ Human Intelligence ▪ Knowledge Engineering

Enabling the 21st Century

top related