speaker overview - omniscien technologies · dun & bradstreet keynote: confounding...

30
1

Upload: others

Post on 13-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

1

Page 2: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

2

Dr. Anthony ScriffignanoSenior Vice President & Chief Data ScientistDun & Bradstreet

Dr. Anthony Scriffignano, Senior Vice President and Chief Data Scientist for Dun & Bradstreet, is an

internationally recognized thought leader in the data science space. He leads a team of data

scientists focused on advancing Dun & Bradstreet's core capabilities and IP globally. With extensive

background in advanced algorithms and linguistics, he holds multiple patents and presents globally

on data and technology trends, multilingual challenges in business identity, and artificial intelligence.

Speaker Overview

Warwick MatthewsSenior Director of Identity Data EngineeringDun & Bradstreet

Warwick is Senior Director of Identity Data Engineering at Dun & Bradstreet. Based in Melbourne

Australia, his work focuses largely on creating complex cross-border multilingual data flows.

AI, MT and Language Processing Symposium

Page 3: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

3

Dr. Anthony ScriffignanoSenior Vice President & Chief Data ScientistDun & Bradstreet

Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context

Warwick MatthewsSenior Director of Identity Data EngineeringDun & Bradstreet

The presentation has three major themes: • More Better Faster – Technology and Decision making• Risk and Response – Disruptive Evolution, Malfeasance and how we will Respond• The Future is Here – Quantum Computing, Machine Intelligence, New Mindsets, Recommendations for the future

AI, MT and Language Processing Symposium

Page 4: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

Confounding Characteristics and

Resolution of Complex Business

Identity in a Multilingual Context

Anthony J. Scriffignano, Ph.D.

SVP / Chief Data Scientist

AI, MT AND LANGUAGE PROCESSING SYMPOSIUM

28 MARCH 2018

Warwick Matthews

Senior Director, Identity Data Engineering

Page 5: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

T O D AY

OUR CURIOUS WORLDM O R E T E C H N O L O G Y , F A S T E R D E C I S I O N S

COMPUTATIONAL LINGUISTIC

CHALLENGESR O M A N I Z A T I O N O F B U S I N E S S I D E N T I T Y D A T A

THE RISKS AND OUR RESPONSED I S R U P T I V E E V O L U T I O N A N D H O W W E W I L L

R E S P O N D

THE FUTURE IS HERET R E N D S , D E V E L O P M E N T S , C A S E S T I U D I E S

Page 6: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

OUR CURIOUS WORLDM O R E T E C H N O L O G Y , F A S T E R D E C I S I O N S

Page 7: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

We live in an age of promise.

Advanced linguistic methods are

making things possible that were

science fiction only a few short

years ago.

Page 8: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

8

3 Globalization

Challenges2 Unstructured Data

Changes in our environment have continuously influenced business decision-

making. What is changing is the speed and degree of globalization.

A business’

geographic location,

structure,

and physical customer

interaction are

becoming irrelevant

The globalization of

business

relationships can

overwhelm many

businesses with multi-

lingual data

It is estimated 80-90

percent of all business

information exists as

unstructured data

Hypergeometric digital

data growth can make

it more difficult to

determine what is

valuable vs. noise

1 1 HypergeometricData Growth

4 Virtual Businesses

Note: Unstructured data includes data which lacks (an exposed ontology) and which appears to belie attempts to understand any implied categorization.

The term is often used where the content is not actually unstructured, but rather only poorly understood at the time of ingestion or inspection

“The importance of language should not be underestimated. Language contains nuance, changes

constantly, and informs our thinking on levels that we often rely upon in subtle and powerful ways.”A. Scriffignano

Page 9: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

9

In business today, we will hone our skills for using dynamic, unstructured

data, or we will begin to drown in it. There is no guarantee that things

get better.

Situational awareness…

• More than 85% of data

creation is

unstructured

• Language and use of

language are constantly

evolving

• Commonly available

tools and solutions only

address a small part of

this space

Unstructured

Data

Page 10: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

Myths / Inconvenient Truths:

More Data is better

Lots of data in 1 place

is sufficient to learn

AI can find answers

Machine learning will

find hidden truth

Natural language

processing removes

all language barriers

Machine Translation is

good enough

Data vs. noise

Data at rest vs.

data in motion

AI methods have

preconditions

Regression vs.

unprecedented change

Language is constantly

changing

Many unmet challenges

remain in linguistics

Page 11: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

COMPUTATIONAL LINGUISTIC

CHALLENGESR O M A N I Z A T I O N O F B U S I N E S S I D E N T I T Y D A T A

Page 12: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

12

Today’s problem is not a lack of data..

TITLE OF PRESENTATION (EDIT USING: INSERT MENU > HEADER AND FOOTER)

•Dun & Bradstreet is continually acquiring large amounts of non-Latin data, particularly in

Asia, which needs to be Romanized in order to enter our Global Data Supply Chain.

• Translation is traditionally largely manual, time consuming and very expensive when you

have millions of records to process.

•When it comes to Romanizing pure Identity Data we have a special problem.

•Name and Address data has no context, and this is particularly challenging when we are talking

about new business entities who have not existing “footprint” to work from.

•And we need to solve this problem millions of times per day, automatically.

Page 13: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

13

Business Names are a unique mix of

Translated and Transliterated (phonetic) data.

And this data has no context

(there is nothing around a name to guide the system)

A D&B Challenge: Romanization of Business Identity Data

Addresses are “easy” to translate for big geos like Cities.

But address detail is often vague, idiosyncratic

and defies systematic classification (especially in China!).

Page 14: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

Why don’t we just “A.I.” our

way out of the problem?

Page 15: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

15

AI has limitations..

•AI is a “grey box” at best – difficult to get actionable qualitative feedback to aid in

automated decision-making

• It does not actually understand what it is doing, so it cannot tell when its output

is nonsense.

inspirobot.com Microsoft TayTodai Robot

Page 16: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

“..none of the modern AIs,

including Watson, Siri and Todai Robot,

is able to read..

...it doesn't understand any meaning.”

Dr Noriko Arai, Tokyo University

https://www.ted.com/talks/noriko_arai_can_a_robot_pass_a_university_entrance_exam

Page 17: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

So we must leverage multiple simultaneous approaches..

Lexicon/Stats-based

routines

UI – Human

AdjudicationDecisioning system

Machine Learning

SystemAI – SHEN/X

Page 18: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

THE RISKS AND OUR RESPONSED I S R U P T I V E E V O L U T I O N A N D H O W W E W I L L

R E S P O N D

Page 19: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

19

There are three use cases which represent “True North” for our innovation

Discovering new businesses or changes in business statusOrganic growth/decay

•New business that would otherwise go undetected

•Additional input for Intelligence Engine to transform a Single Source Record (SSR)

•Full-file maintenance (e.g. Out of Business)

Ingesting information that can be used to detect bad behaviorFraud / Malfeasance

•Common social “footprint” shared by more than one persona (e.g. identity theft)

•Patterns of observation that suggest clusters of bad behavior (e.g. fraud rings)

•Discovery of new types of malfeasance (e.g. Phishing)

Discovering data elements that can help resolve people in business contextPersonal Identity

•New Social Media handles or sources of social data (e.g. opinion blogs)

•Data which can be aggregated to pre-existing clusters of identity (e.g. photos) for additional resolution (e.g. hyperclusters)

•Data about groups of individuals associated with a business context (e.g. user groups, discussion boards)

Page 20: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

20

Entity ExtractionUnderstanding Person

and Perspective

Sentiment Attribution/

Clustering

• Multiple patents including

identity resolution,

people in the context of

business, geospatial

inference, flexible

alternative indicia

• Existing capabilities

include extraction of

entities, tokenization,

part of speech tagging,

usage vectors, language

detection

• The current state of art

is highly dependent on

training and has

challenges with precision

and recall, reproducible

results

Areas of focus

Apple destroys competition…

Transitive verb,

requires actor.

Multiple

interpretations.

Becomes Proper Noun

due to inference about

verb.

Яблоко пережило несколько стадий развития…

The

political

party?

The

fruit?

The

company? Changing regulatory

environment globally

Page 21: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

FLOCCINAUCINIHILIPILIFICATION

Page 22: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

22

Confounding characteristics

Sarcasm

ABC corporation is a wonderful

company, if you don’t do

business with them.

Neologism

Be sure to like us on FaceBook

and use #shallow when you

Tweet.

Grammar variations

FBI is Hunting Terrorists With

Explosives.

Punctuation

“Hi mom!” vs. “Hi, mom?”

Intentional mis-spelling

RU There?

Context / Behavior

Sentiment Attribution

Entity Extraction

USE CASESCONFOUNDING

CHARACTERISTICS

DERIVING EMPIRICAL MEASURES

THAT INFORM USE CASES

Passive

metric

Page 23: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

23

Sarcasm

Words or predicates juxtaposed in such a way as to convey hidden meaning that is

opposed to that which comes from cursory interpretation

• Example: BP is an excellent company to do business with, if you like destroying nature.

Neologism

Words or phrases which are newly constructed and taken collectively to have

some shared meaning.

• Example: - Hashtags in Twitter

Grammar variations

Word usage which is intentionally or unintentionally incorrect, leading to

ambiguous or non-dispositive interpretation

• Example: FBI is Hunting Terrorists With Explosives

Punctuation

Usage of punctuation in a non-standard or inconsistent way or lack of punctuation,

leading to ambiguous or contradictory interpretation

• Example: “Eats shoots and leaves” vs. “Eats, shoots, and leaves”

Intentional Mis-Spelling

Invented, incorrect, or adopted spelling that results in inconsistent, incorrect, or

non-dispositive interpretation

• Example: RU There?

Mixing of Languages or Scripts

Including foreign words/phrases or characters, especially in non-standard ways.

• Examples: He has a certain je ne sais quoi or “Please have some ∏”

Recursive

Discovery

Entity

Extraction

Vetting/

AdjudicationSynthesis

Positioning Confounding Characteristics in the Curation Process

Who is

speaking?

About

whom?

How do

they feel?

In what

context?

Page 24: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

THE FUTURE IS HERET R E N D S , D E V E L O P M E N T S , C A S E S T I U D I E S

Page 25: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

25

Semantic

vector space

models

Using source

metadata for

insights

Breadth

Dep

th

Entity

Extraction

Sentiment

Analysis

Detecting &

Measuring scores of

Confounding

Language

Characteristics

Translating scores

into degree of

text

‘confoundedness’

Analyze relative

usefulness of new

sources

Understand

dependencies

across sources

Create dedicated & scalable infrastructure for unstructured data

Detecting additional

confounding factors

(e.g. foreign

language)

Analyzing impact

on specific use

cases

Improving

robustness of

existing detection

algorithms

Semantic

disambiuation

Understanding

Person &

Perspective

Assessing capabilities in language synthesis for identified use cases

Page 26: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

26

Reality check…

Page 27: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

27

Watch this space… • Inter- and intra-language correlation :

deciding when things mean the same thing

• Inter- and intra-language transformation :

transforming inference among languages

• Changing behavior to attract/obviate

grapheme analysis : reacting to changing

language

• Emerging “metalanguage” (e.g. “textspeak”)

: reacting to language about language

• A language of “things” : reacting to new

languages used by automation

• Using language to hide language : reacting

to attempts to obscure via language

• Unicode is not universal… : understanding

the limitations of automation

Page 28: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

28

What is the community at

large saying about this

business?

How are opinions changing

over time? Are they

authentic?

How can I detect

inconsistent behavior? What

does it mean?

How do I understand and

measure customer sentiment?

The journey of discovery involves asking new questions?

How can I see “birth” and

“death” of a business more

quickly?

Can I trust the social data on my

partners?

What about modes: Is there a

measurable difference

between leaders and their

organizations'’ opinions?

Page 29: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

29

Thank You!

謝謝Dankjewel

merci

ありがとうधन्यवाद

Warwick Matthews

[email protected]

Anthony Scriffignano

[email protected]

Page 30: Speaker Overview - Omniscien Technologies · Dun & Bradstreet Keynote: Confounding Characteristics and Resolution of Complex Business Identity in a Multilingual Context Warwick Matthews

30