indexing and searching cross media content in a social network

Post on 30-Oct-2014

305 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

Indexing and Searching Cross Media Content in a Social Network

TRANSCRIPT

Indexing and Searching Cross Media Content in a Social

NetworkPierfrancesco Bellini, Daniele Cenni, Paolo Nesi

University of FlorenceDepartment of Systems and Informatics

Distributed Systems and Internet Technology Laboratory

ECLAP Conference, May 7-9, 2012

ECLAP Social Network

ECLAP is a Digital Library on Performing Arts connected with Europeana

ECLAP is a Social Network (blogs, forums, comments, tagging, voting, …)

Goals/Requirements Develop an Indexing/Searching solution for ECLAP

Social Network allowing: Indexing multilingual crossmedia content metadata and

data (e.g. documents) Indexing portal blogs, forums, events, group pages,

comments, etc. Efficient multilingual search (keyword search and

advanced search) supporting: misspelled words (e.g. shespeare) partial word search

Sorting and filtering search results re-index the whole data without blocking the system Log and monitor users activity …

Evaluate the Indexing/Searchig service

ECLAP Data Model

4

Object

Video Audio

Document

Group/Channel

CollectionPlaylist

0..n

0..n

1..n

0..n

Image

AVObjectAnnotation0..n 1..2

1..n

0..n

ForumWebPage

CommentContentTaxonomyTerm 0..n 0..n 0..n1

0..n

0..n

Blog

Metadata

PerformingArts

Dublin Core

Technical

Indexing Indexing & Search system

Based on Apache Solr Multilingual aspects

Translate the metadata or translate the query? We use metadata translation

Indexing schema Dublin Core + DCTerms (multi language) Performing Arts Technical (provider, content type, GPS, IPR, duration, quality, …)

Groups associations (multi language) Taxonomy associations (multi language) Comments & multi language tags FullText of the textual digital resources

Indexing

Media TypeDC(ML) Tech

Perf. Arts

Full Text

Taxnmy,Group(ML)

Comment, Tags(ML) Votes

Audio/Video/Image Y Y Y Y Y YDocument(pdf, doc, …)

Y Y Y Y Y Y YCrossMedia(html, MPEG21,…)

Y Y Y Y Y Y YAggregations(playlist, collection, …)

Y Y Y Y Y Y

Info text(blog, web pages, forum, events, …)

(Y) Y Y

Indexing Multilingual fields

title_en, title_it, title_de, title_fr, title_ca, … Catch-all fields

Component fields Boost Weighttext pdf_*, doc_*, ppt_*, htm_*, … 1.0body body_* 0.5title title_* 3.1description description_* 2.0contributor contributor_* 0.8subject subject_* 1.5taxonomy taxonomy_* 0.8PerformingArts PerformingArtsMetadata.# 1.0

Indexing Re-indexing

In case of new indexing schema or index corruption the search system should not be blocked

The re-indexing is done on a separete indexing machine while the production system uses the actual index

During re-index the new uploaded/modified content is marked to be reindexed when the new index is put in production

Searching Full text search

Uses the catch all fields to search for keywords in most important fields in alllanguages (title, description, text, body, subject,…)

Fuzzy search Allows matching mistyped words

Deep search Allows searching for partial words

Relevance & boosting of terms

Searching Faceted search

Searching Advanced search

Search Facility Assessment Analisys performed on 3 months

11294 vists (6032 unique visits) 62768 page views (avg 5.76 pages per visit) 7.29 minutes of permanence on the portal 30502 contents accesses (view, play and

download)

Search Facility Assessment

users# Full Text Query

# Faceted Query

# Last Posted List

#Featured List

# Popular List

simple registered

323 24 4 22 17

partners 1094 21 27 19 9

anonymous 2634 147 234 302 213

Total 4051 192 265 343 239

Clicks after query/list

1564 200 318 2799 231

Search Facility Assessment Click order distribution

First page

Conclusions Solution allows indexing multilingual

metadata and texts Searching & filtering results Search facility assessment show that

search is a used feature

Context & Assessment

Context Social Network

User and content items Content distribution portal

Video on demand portal Archive, digital library, Performing Arts

http://www.eclap.eu Assessment

User behavior Log user actions on the Web portal

User happiness Measure the level of user satisfaction about the exposed

services

Logging User Profile

User Profile Registered or anonymous, uid (user id) Timestamp YY-mm-dd hh:mm:ss IP address, Proxy type etc. Platform (OS, Browser) GeoIP data (Country, Region, City) Friends, connections

Betweenness, Eccentricity Joined groups User preferred contents

Understanding User behavior

Online survey A simple module, in the right side of the portal Presenting 3 - 4 questions per topic (depending on the

current portal section visited) Stat Drupal Modules

Custom implemented modules Log User Activity Keep track and depict main figures about portal activity Can be filtered by date, user, type of content, group,

type of activity (content enrichment, social promotion, networking etc.)

Google Analytics

Understanding User behaviorTop Metrics

Avg # Visits/User Avg # Queries/User Avg # Clicks/User Avg Visit duration Avg Query length Query refinement rate Next Page Click Rate Back Page Click Rate Frequency of searching (once/day, week etc.) Success of searching (assessment...) …

Logging User Behavior Logging user activities on the portal

Downloads/Views Queries Anonymous/Register portal accesses

(login/logout) Adding/Updating/Deleting digital contents Menu clicks Content Upload Content Management Social Promotion & Networking

Logging User Behavior

Content Accesses (Download/View) Axmedis Content

Pdf, Document, Video, Playlist, Slide, Flash, Image, Excel, Archive, Audio, Tool, Collection

Drupal Content Page, Blog, Event, Forum, Group, Comment

Distribution of Content Access per Access Type, Portal, Platform, Section, Locale,

Country, Region, City, Axoid, Nid, Content Type, Partner, User, Timestamp

Logging User Behavior Queries (Simple, Faceted, Advanced)

Distribution of Queries per User, Content type, Device, IP, User Agent, Query Type,

Country, Region, City, Locale, Filter (faceted) Query Cloud Keyword Cloud IPR Wizard

Definition and usage of IPR Models Metadata Editor

Access and usage Add, Edit metadata

Video Annotations Personal content Other users content

Logging User Behavior Social Promotion & Networking

Analysis of Eccentricity Betweenness Connections

Creation, Access of Public/Private Web Pages Activity on Forums, Blogs, Groups or between users

New Contents Comments to Objects/Web Pages Invited People Featured Objects Recommendations, suggested content Export/Import of links to/from other SN Private Messages

Logging User Behavior Menu Clicks

Distribution of clicks per User, IP, Locale, Timestamp etc.

LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH, UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS , MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW CONTENT, etc.

Ranking/Voting # of ranked items Distribution per

User, IP, Locale, Timestamp etc. QR Code

Access from Mobile Devices Workflow

Distribution of Workflow Type Content Upload

Distribution of uploads per User, Partner, Timestamp

Content AccessAffiliation # View/Play # Download

DSI 46 0Not partners/Affiliated

1292 14

Partners/Affiliated (except DSI)

6712 119

Public Users 21418 947

Affiliation # View/Play # DownloadDSI 3 0Not partners/Affiliated

100 4

Partners/Affiliated (except DSI)

218 11

Public Users 2225 869

September 1st – November 30th 2011

Menu ClicksMenu # Clicks

ABOUT->ECLAP DESCRIPTION 671EVENTS->PAST AND FUTURE 536SEARCH->GROUPS 524ABOUT->ECLAP NEWS BLOG 463CONTENT->LAST POSTED 265CONTENT->FEATURED 343HOWTO->UPLOAD AND INGEST

330

SEARCH->ADVANCED SEARCH

314

EVENTS->CALENDAR 298ABOUT->ECLAP PARTNERS 269ABOUT->MAIN CONTACT 249CONTENT->POPULAR 239

September 1st – November 30th 2011

SearchAffiliation # Simple Queries # Faceted

QueriesDSI 13 0Not partners/Affiliated

323 24

Partners/Affiliated (except DSI)

1094 21

Public Users 2634 147Affiliation # Advanced Queries

DSI 0Not partners/Affiliated

18

Partners/Affiliated (except DSI)

4

September 1st – November 30th 2011

Drupal Stat Metrics Content Access per nid

September 1st – November 30th 2011

Drupal Stat Metrics Views by Query

September 1st – November 30th 2011

Drupal Stat Metrics Content Access per Platform

September 1st – November 30th 2011

Understanding User behavior Drupal Stats (collapsible menus on the right)

Google Analytics vs Drupal Stats

Service Pros Cons

Google Analytics

Traffic source data

Bounce rate Recency (since

when) Loyalty (how

often) Session times

IP approach, each IP is considered an unique visitor

Can’t deal with specific actions on portal (e.g. downloads, queries)

Drupal Stats Identity approach Actions Download User Access Queries Content type

filtering

Can’t deal with traffic source data and bounce rate

Session time raw approximation

Sorting Results

Sorting by Upload Time (first time doc uploading date) Update Time (last time doc updating date) Score (doc relevance to search query)

Combined with faceting and paging

Suggestions

REALTIME, while typing a query suggests similar searches ecl…

eclap eclap-de-2-1-1-user eclap-de-2-2-1-usergroup …

ECLAP Survey

Indexing/Searching Reqs Enriching search experience

Results Sorting Suggestions

Large # of contents (~ 104-106) External Indexing Service

Hidden/Private contents management Monitoring Exceptions

Email notifications Search Engine Friendly (Google, Bing, Yahoo etc.)

content site crawling HTML dumping

External Indexing Service 1/3

Setup an external service to avoid server overloading when building the index Taxonomization Indexing (with exceptions monitoring) Index Synchronization Old Index replacement with new one Index updating Old contents cleaning (optional)

External Indexing Service 2/3 Taxonomization

Has a cost pre-computing Digital content Execution Rule (JS) Indexed with object records

Performing Arts

Cinema Music

Documentary Historical Classical Pop

Cinema Music

Documentary

Classical

ObjectTaxonomy

Performing Arts

Taxonomy

Parent

Performing Arts

-

Cinema Performing Arts

Music Performing Arts

Documentary

Cinema

Historical Cinema

Classical Music

Pop Music

External Indexing Service 3/3

Indexing with exceptions monitoring Real-time notifying system Event time and type (add, update) Full stacktrace info Customizable recipients Object Indexing Recovery

Resource Parse Error Metadata Indexing

• Index synchronization During external indexing, contents may be

Updated/added/deleted on the original index Need to update these contents

on the index (state flag)Indexed External

Indexed1 1

0 1

Search Engine Friendly

HTLM dump service JAVA external service Periodically invoked by an AXCP rule Full metadata exporting Thumbnail Resource link Multilanguage Paginated results

Conclusions Drupal integrated solution for user behavior tracking

and analysis Logging Stat Data Graph Online Survey

External Indexing Service Avoids server overloading HA of query service Error recovering Detailed event notifying system Index Optimization

Dumping tool for portal contents (SEO) Full metadata HTML exporting Scheduled Service

Future Work

Keep collecting Data Deeper Data Analysis

User Sessions 1st, 2nd..., nth click average user behavior

Depict a modular view of the system usage Popularity/Usability for each feature & functionality

Social Network Analysis (SNA) Huge Population

User relationships, connections, friendships

References

P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for scalable media computing and intelligence on distributed scenarious", IEEE Multimedia, 2011

P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M. Serena, "Semantic Model for Cultural Heritage Social Network and Cross Media Content for Multiple Devices", Conference of the Italian Association of Artificial Intelligence, Workshop for Cultural Heritage, 15-17 September 2011, Palermo, Italy

Q & A

APPENDIX

Architecture (former)

DrupalApache HTTP

Searching Module

Indexing Module

Apache Tomcat

Apache SolrIndexing Service

XML/HTTP JSPSolrCell

AXCP

Grid Node

Rule Scheduler

SolrJ Client

Index Rebuilder Rule JS Indexing Rule JS

Drupal

What is it?Open source content management platformDeveloped by Dries Buytaert in 2001Written in PHPUsers: The Economist, Examiner.com, The White House, data.gov.ukRuns on a WEB server (e.g. Apache, IIS) and a database (e.g. MySQL, PostgreSQL)

Apache Lucene

What is it?High-performance, full-featured text search engine library (indexing and searching documents) Developed by Doug Cutting (2000) SourceForge, joined Apache Software Foundation in 2001Written entirely in JavaUsers: Wikipedia, Technorati, Nabble, TheServerSide, Akamai, SourceForge

Apache Lucene

FeaturesRanked searching (best results returned first)Powerful query types: phrase queries, wildcard queries, proximity queries, range queries and moreFielded searching (e.g., title, author, contents)Date-range searchingSorting by any fieldMultiple-index searching with merged resultsAllows simultaneous update and searching

Apache Lucene

FeaturesDocuments added via IndexWriterDocument = a collection of fieldsNo config files, dynamic field typingFlexible text analysis tokenizers, filtersSearch for documents via IndexSearcher

Hits = search(Query,Filter,Sort,topN)

Scoring: tf * idf * lengthNorm

Apache Solr

What is it?A full text search server based on Lucene (Lucene sub-project)Developed by Yonik Seeley at CNET Networks (2004), donated to the Apache Software Foundation (2006)Written in Java, deployable as a WARUsers: CNET Reviews, CNET Channel, shopper.com, news.com, nines.org, krugle.com, oodle.com, booklooker.de

Apache Solr Features

Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces (XML, JSON, HTTP) Web Administration InterfaceServer statistics exposed over JMX for monitoring Scalability, efficient Replication to other Solr Search Servers Flexible and Adaptable with XML configuration Extensible Plugin Architecture

top related