markus schranz [email protected]

30
Markus Schranz [email protected] Pushing the Quality Level in Networked News Business semantic-based content retrieval and composition in international news publishing

Upload: vanna-wilkinson

Post on 03-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Pushing the Quality Level in Networked News Business semantic-based content retrieval and composition in international news publishing. Markus Schranz [email protected]. Problem and Project Description Goals and Objectives Approaches and Results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Markus Schranz schranz@infosys.tuwien.ac.at

Markus Schranz

[email protected]

Pushing the Quality Level in Networked News Business semantic-based content retrieval and composition

in international news publishing

Page 2: Markus Schranz schranz@infosys.tuwien.ac.at

Agenda

• Problem and Project Description

– Goals and Objectives

• Approaches and Results

– Architectural Design & Communication

– Multinational and Multilingual Services

– Semantic Content Relations

• Future Steps and Exploitation

Page 3: Markus Schranz schranz@infosys.tuwien.ac.at

Environmental Situation

• Internet gains in importance in the news distribution area

• Large amount of distributed business information is available

• European business today is highly segmented and widely

unrecognised beyond national borders

• Business news mostly bear national relevance but hold the

potential to spread cooperation opportunities and business chances

towards an economically and socially integrated Europe.

• Business is global news need to be

Support for old and new economy within entire Europe is required;

Appropriate solution beneficial for business in the EU, with special focus on

the support and the integration of new member states

problem description

Page 4: Markus Schranz schranz@infosys.tuwien.ac.at

Existing approaches• National solutions available

– Business News Distribution Service in German speaking area

– Increasing interest from both• Subscribers• Press distributors

within the existing services for multinational solutions

• Limitations– Single language limitation– Not attractive for European companies to join

problem description

Page 5: Markus Schranz schranz@infosys.tuwien.ac.at

project description

• NEDINE has been EC-funded (Apr 2004-Apr2006). The objective of

the project is to establish a distributed news network, aimed at

European journalists and opinion leaders.

• NEDINE provides participants with a network for news exchange and

distribution. It supports mutual awareness of relevant topics and

information content within all European countries.

• NEDINE focuses on the availability and affordability for all partners to

transport national information to the addressed target group,

regardless of the origin, nationality and financial capability of the

information provider.

Objectives

Page 6: Markus Schranz schranz@infosys.tuwien.ac.at

Small CompanyGood product

Austrian reader

Czech reader

Slovakian reader

Austrian NA

Czech NA

Slovakian NA

The Challenge

project description

Page 7: Markus Schranz schranz@infosys.tuwien.ac.at

Small CompanyGood product

Austrian reader

Czech reader

Slovakian reader

Austrian NA

Czech NA

Slovakian NA

News agency offers to its customers:

- Single access point for international press releases

- Distribution- Payment- Editing / Translation

- Price advantage compared to collection of single press releases

News agency benefits from the nedine network:

- Common business model - Additional customers

- more revenues- new contacts - international presence

The Solution

project descriptionNews agency offers to its readers:

- Multilingual news- International news

- From various sources - (Semantic) Relationships independent from source- Relevance ranking for search

Page 8: Markus Schranz schranz@infosys.tuwien.ac.at

Architecture Reasoning

First Approach – Centralized Architecture

Pro‘s: • Single maintenance point • Clear infrastructure• One traffic channel (News agency NEDINE)• No additional infrastructure required for Partners

Con‘s: • Single point of failure (whole network down)• Huge amount of network traffic• Storage of complete articles• Which organization maintains the central server?

approaches and results

Page 9: Markus Schranz schranz@infosys.tuwien.ac.at

centralized configuration

NEDINECentral Server

ČIA SITA

PTE

Web Service Interface

approaches and results

Page 10: Markus Schranz schranz@infosys.tuwien.ac.at

Architecture Reasoning

Alternative Approach – Hybrid P2P - Architecture

Why Peer - to - Peer? • Better scalability • No single point of failure• No downtime if central services are down• Less network traffic• Network remains transparent for the peers

(they only see Nedine)

approaches and results

Page 11: Markus Schranz schranz@infosys.tuwien.ac.at

Final Approach – Hybrid P2P - Architecture

Properties of this Architecture: • Democratic System • Identical software components are installed at each

partner• Nedine becomes a logically centralized platform • Nedine is technically distributed to the view of all

participating peers• Semantic relations and necessary steps for news

distribution are done in a local context

approaches and results

Page 12: Markus Schranz schranz@infosys.tuwien.ac.at

P2P configuration

VirtuallyCentral

Services

ČČIAIA SITASITA

PTEPTE

Web Service InterfaceNEDINE Peer

NEDINE Peer

NEDINE Peer

approaches and results

Page 13: Markus Schranz schranz@infosys.tuwien.ac.at

Communication: Peer Agency

Web Services as the communication protocol • Standard Interfaces for default peers (SOAP,

NewsML Data transfer, Queries, Network Data)• Customized interfaces for each partner, if necessary

(database access based on document ID)• Location and functionality of the NEDINE-peer is

defined in the corresponding WSDL-file• Functionality is only visible by the local peer, which

increases network security

approaches and results

Page 14: Markus Schranz schranz@infosys.tuwien.ac.at

Inter - Peer - Communication

Implemented also by XML Web Services • Inter – peer communication is invisible to the

agencies• High flexibility, easy to upgrade/change – doesn’t

influence the rest of the network • Network traffic is encrypted via PKI (Private-Public-

Key Infrastructure)

approaches and results

Page 15: Markus Schranz schranz@infosys.tuwien.ac.at

Multinational and Multilingual Services

– Multinational Service Integration• Standardized news exchange formats NewsML• Local Service to Peer communication SOAP

– local service providers hold business critical information– installation of a local peer with well-known (open) source

increases trust of the participating organizations and underlines the local character of the relevant business data

• Peer-to-Peer communication SOAP

approaches and results

Page 16: Markus Schranz schranz@infosys.tuwien.ac.at

Multilingual News Publishing and Distribution

– Automatic Translation ?– Multilingual content presentation ?– Multilingual information distribution & retrieval– Semantic relations between the (multilingual)

business news contents

approaches and results

Page 17: Markus Schranz schranz@infosys.tuwien.ac.at
Page 18: Markus Schranz schranz@infosys.tuwien.ac.at

Semantic News Enrichment

Pushing the Quality Level by Semantics

– International news describe local business and lack relevant interrelations

– “Linking” between sensible business news has been manual work and thus costly

– Semantic relationships increase business value of news items, but how to create with reasonable effort?

approaches and results

Page 19: Markus Schranz schranz@infosys.tuwien.ac.at

The Vector Space Engine

– Vectors are assigned to every news article representing keyword occurrences (weights)

– Vectors are technically small portions of data, feasible to integrate in peer component

– Semantic relationships increase business value of news items

• Automatically recognize similarities by creating a vector space on relevant keywords

approaches and results

Page 20: Markus Schranz schranz@infosys.tuwien.ac.at

• What is a keyword?all words (except stopwords)relevant words

• from frequencies• with weights (vector space model) • from the domain

• How does a keyword look like? A word : bodies A stem : bodi A lemma : body A phrase : public bodies

approaches and results

Page 21: Markus Schranz schranz@infosys.tuwien.ac.at

Query

Query Processing

Document Processing

Document

Matching

- Stemming and/or- PN Detection and/or- N-Gram Detection …

- Stemming and/or- PN Detection and/or- N-Gram Detection …

Document

Query

Q = (wq1,…,wqn)

D = (wd1,…,wdn)

approaches and results

Page 22: Markus Schranz schranz@infosys.tuwien.ac.at

• Vector Space Model combined with statistic and linguistic processing.

• Statistical metrics included are:– tfij = Term frequency for word i in document j

– IDFi = Inverse Document Frequency for word i in the whole document collection

IDFi = 1 +

– wij = tfij *IDFi

idf

N2log

N = Total documents

dfi = Document Frequency for term i

approaches and results

Page 23: Markus Schranz schranz@infosys.tuwien.ac.at

Vector Space Model

• Documents are indexed by vectors• Documents are retrieved by similarity

– Query and Documents are compared using the cosine formula:

Sim(Q,D) =

– Local archives must provide term frequency data (internal and document)

n

ii

n

ii

n

iii

wqwd

wdwq

1

2

1

2

1

.

.

approaches and results

Page 24: Markus Schranz schranz@infosys.tuwien.ac.at

The used model

Taggers and

Stemmers

Proper Names

Heuristics

Syntactic patterns

Semantic resources

(EWN)

Metadata information

Statistical process

Preprocessing of texts

NEWS

Document Vectors

Linguistic Processing

approaches and results

Page 25: Markus Schranz schranz@infosys.tuwien.ac.at

Use case: distributing news in Czech republic and in Austria

ČIA CZ, DE

ČČIAIA(CZ,DE,EN(CZ,DE,EN))

SITASITA(SK,EN)(SK,EN)

PTEPTE(DE,EN)(DE,EN)

NEDINE Peer

NEDINE Peer

NEDINE Peer

1. Distribution &1. Distribution &EnrichmentEnrichment

2.

En

ric

hm

en

t (D

E)

2.

En

ric

hm

en

t (D

E)5. CZ,DE5. CZ,DE

Subscriber

7. DE7. DE

Subscriber

3

.3.

4.4.

6.6.

approaches and results

Page 26: Markus Schranz schranz@infosys.tuwien.ac.at
Page 27: Markus Schranz schranz@infosys.tuwien.ac.at
Page 28: Markus Schranz schranz@infosys.tuwien.ac.at
Page 29: Markus Schranz schranz@infosys.tuwien.ac.at

Future Exploitation

Recent developments and open issues• Nedine has been extended with translation

services (additional service on P2P architecture)

• Secure communication infrastructure has been implementation

• Performance and scalability tests• Market & Business orientation

Nedine Association has been funded end 2005

Page 30: Markus Schranz schranz@infosys.tuwien.ac.at

Good News from Europe

Have a look at NEDINE, we are

open to recommendations, news providers

and partners from all over Europe.

Website http://www.nedine.org/

E-Mail [email protected]

Nedine Contact Person: Dr. Markus Schranz

Tel. ++43-1-81140-444, [email protected]