indexing and searching cross media content in a social network
DESCRIPTION
Indexing and Searching Cross Media Content in a Social NetworkTRANSCRIPT
Indexing and Searching Cross Media Content in a Social
NetworkPierfrancesco Bellini, Daniele Cenni, Paolo Nesi
University of FlorenceDepartment of Systems and Informatics
Distributed Systems and Internet Technology Laboratory
ECLAP Conference, May 7-9, 2012
ECLAP Social Network
ECLAP is a Digital Library on Performing Arts connected with Europeana
ECLAP is a Social Network (blogs, forums, comments, tagging, voting, …)
Goals/Requirements Develop an Indexing/Searching solution for ECLAP
Social Network allowing: Indexing multilingual crossmedia content metadata and
data (e.g. documents) Indexing portal blogs, forums, events, group pages,
comments, etc. Efficient multilingual search (keyword search and
advanced search) supporting: misspelled words (e.g. shespeare) partial word search
Sorting and filtering search results re-index the whole data without blocking the system Log and monitor users activity …
Evaluate the Indexing/Searchig service
ECLAP Data Model
4
Object
Video Audio
Document
Group/Channel
CollectionPlaylist
0..n
0..n
1..n
0..n
Image
AVObjectAnnotation0..n 1..2
1..n
0..n
ForumWebPage
CommentContentTaxonomyTerm 0..n 0..n 0..n1
0..n
0..n
Blog
Metadata
PerformingArts
Dublin Core
Technical
Indexing Indexing & Search system
Based on Apache Solr Multilingual aspects
Translate the metadata or translate the query? We use metadata translation
Indexing schema Dublin Core + DCTerms (multi language) Performing Arts Technical (provider, content type, GPS, IPR, duration, quality, …)
Groups associations (multi language) Taxonomy associations (multi language) Comments & multi language tags FullText of the textual digital resources
Indexing
Media TypeDC(ML) Tech
Perf. Arts
Full Text
Taxnmy,Group(ML)
Comment, Tags(ML) Votes
Audio/Video/Image Y Y Y Y Y YDocument(pdf, doc, …)
Y Y Y Y Y Y YCrossMedia(html, MPEG21,…)
Y Y Y Y Y Y YAggregations(playlist, collection, …)
Y Y Y Y Y Y
Info text(blog, web pages, forum, events, …)
(Y) Y Y
Indexing Multilingual fields
title_en, title_it, title_de, title_fr, title_ca, … Catch-all fields
Component fields Boost Weighttext pdf_*, doc_*, ppt_*, htm_*, … 1.0body body_* 0.5title title_* 3.1description description_* 2.0contributor contributor_* 0.8subject subject_* 1.5taxonomy taxonomy_* 0.8PerformingArts PerformingArtsMetadata.# 1.0
Indexing Re-indexing
In case of new indexing schema or index corruption the search system should not be blocked
The re-indexing is done on a separete indexing machine while the production system uses the actual index
During re-index the new uploaded/modified content is marked to be reindexed when the new index is put in production
Searching Full text search
Uses the catch all fields to search for keywords in most important fields in alllanguages (title, description, text, body, subject,…)
Fuzzy search Allows matching mistyped words
Deep search Allows searching for partial words
Relevance & boosting of terms
Searching Faceted search
Searching Advanced search
Search Facility Assessment Analisys performed on 3 months
11294 vists (6032 unique visits) 62768 page views (avg 5.76 pages per visit) 7.29 minutes of permanence on the portal 30502 contents accesses (view, play and
download)
Search Facility Assessment
users# Full Text Query
# Faceted Query
# Last Posted List
#Featured List
# Popular List
simple registered
323 24 4 22 17
partners 1094 21 27 19 9
anonymous 2634 147 234 302 213
Total 4051 192 265 343 239
Clicks after query/list
1564 200 318 2799 231
Search Facility Assessment Click order distribution
First page
Conclusions Solution allows indexing multilingual
metadata and texts Searching & filtering results Search facility assessment show that
search is a used feature
Context & Assessment
Context Social Network
User and content items Content distribution portal
Video on demand portal Archive, digital library, Performing Arts
http://www.eclap.eu Assessment
User behavior Log user actions on the Web portal
User happiness Measure the level of user satisfaction about the exposed
services
Logging User Profile
User Profile Registered or anonymous, uid (user id) Timestamp YY-mm-dd hh:mm:ss IP address, Proxy type etc. Platform (OS, Browser) GeoIP data (Country, Region, City) Friends, connections
Betweenness, Eccentricity Joined groups User preferred contents
Understanding User behavior
Online survey A simple module, in the right side of the portal Presenting 3 - 4 questions per topic (depending on the
current portal section visited) Stat Drupal Modules
Custom implemented modules Log User Activity Keep track and depict main figures about portal activity Can be filtered by date, user, type of content, group,
type of activity (content enrichment, social promotion, networking etc.)
Google Analytics
Understanding User behaviorTop Metrics
Avg # Visits/User Avg # Queries/User Avg # Clicks/User Avg Visit duration Avg Query length Query refinement rate Next Page Click Rate Back Page Click Rate Frequency of searching (once/day, week etc.) Success of searching (assessment...) …
Logging User Behavior Logging user activities on the portal
Downloads/Views Queries Anonymous/Register portal accesses
(login/logout) Adding/Updating/Deleting digital contents Menu clicks Content Upload Content Management Social Promotion & Networking
Logging User Behavior
Content Accesses (Download/View) Axmedis Content
Pdf, Document, Video, Playlist, Slide, Flash, Image, Excel, Archive, Audio, Tool, Collection
Drupal Content Page, Blog, Event, Forum, Group, Comment
Distribution of Content Access per Access Type, Portal, Platform, Section, Locale,
Country, Region, City, Axoid, Nid, Content Type, Partner, User, Timestamp
Logging User Behavior Queries (Simple, Faceted, Advanced)
Distribution of Queries per User, Content type, Device, IP, User Agent, Query Type,
Country, Region, City, Locale, Filter (faceted) Query Cloud Keyword Cloud IPR Wizard
Definition and usage of IPR Models Metadata Editor
Access and usage Add, Edit metadata
Video Annotations Personal content Other users content
Logging User Behavior Social Promotion & Networking
Analysis of Eccentricity Betweenness Connections
Creation, Access of Public/Private Web Pages Activity on Forums, Blogs, Groups or between users
New Contents Comments to Objects/Web Pages Invited People Featured Objects Recommendations, suggested content Export/Import of links to/from other SN Private Messages
Logging User Behavior Menu Clicks
Distribution of clicks per User, IP, Locale, Timestamp etc.
LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH, UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS , MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW CONTENT, etc.
Ranking/Voting # of ranked items Distribution per
User, IP, Locale, Timestamp etc. QR Code
Access from Mobile Devices Workflow
Distribution of Workflow Type Content Upload
Distribution of uploads per User, Partner, Timestamp
Content AccessAffiliation # View/Play # Download
DSI 46 0Not partners/Affiliated
1292 14
Partners/Affiliated (except DSI)
6712 119
Public Users 21418 947
Affiliation # View/Play # DownloadDSI 3 0Not partners/Affiliated
100 4
Partners/Affiliated (except DSI)
218 11
Public Users 2225 869
September 1st – November 30th 2011
Menu ClicksMenu # Clicks
ABOUT->ECLAP DESCRIPTION 671EVENTS->PAST AND FUTURE 536SEARCH->GROUPS 524ABOUT->ECLAP NEWS BLOG 463CONTENT->LAST POSTED 265CONTENT->FEATURED 343HOWTO->UPLOAD AND INGEST
330
SEARCH->ADVANCED SEARCH
314
EVENTS->CALENDAR 298ABOUT->ECLAP PARTNERS 269ABOUT->MAIN CONTACT 249CONTENT->POPULAR 239
September 1st – November 30th 2011
SearchAffiliation # Simple Queries # Faceted
QueriesDSI 13 0Not partners/Affiliated
323 24
Partners/Affiliated (except DSI)
1094 21
Public Users 2634 147Affiliation # Advanced Queries
DSI 0Not partners/Affiliated
18
Partners/Affiliated (except DSI)
4
September 1st – November 30th 2011
Drupal Stat Metrics Content Access per nid
September 1st – November 30th 2011
Drupal Stat Metrics Views by Query
September 1st – November 30th 2011
Drupal Stat Metrics Content Access per Platform
September 1st – November 30th 2011
Understanding User behavior Drupal Stats (collapsible menus on the right)
Google Analytics vs Drupal Stats
Service Pros Cons
Google Analytics
Traffic source data
Bounce rate Recency (since
when) Loyalty (how
often) Session times
IP approach, each IP is considered an unique visitor
Can’t deal with specific actions on portal (e.g. downloads, queries)
Drupal Stats Identity approach Actions Download User Access Queries Content type
filtering
Can’t deal with traffic source data and bounce rate
Session time raw approximation
Sorting Results
Sorting by Upload Time (first time doc uploading date) Update Time (last time doc updating date) Score (doc relevance to search query)
Combined with faceting and paging
Suggestions
REALTIME, while typing a query suggests similar searches ecl…
eclap eclap-de-2-1-1-user eclap-de-2-2-1-usergroup …
ECLAP Survey
Indexing/Searching Reqs Enriching search experience
Results Sorting Suggestions
Large # of contents (~ 104-106) External Indexing Service
Hidden/Private contents management Monitoring Exceptions
Email notifications Search Engine Friendly (Google, Bing, Yahoo etc.)
content site crawling HTML dumping
External Indexing Service 1/3
Setup an external service to avoid server overloading when building the index Taxonomization Indexing (with exceptions monitoring) Index Synchronization Old Index replacement with new one Index updating Old contents cleaning (optional)
External Indexing Service 2/3 Taxonomization
Has a cost pre-computing Digital content Execution Rule (JS) Indexed with object records
Performing Arts
Cinema Music
Documentary Historical Classical Pop
Cinema Music
Documentary
Classical
ObjectTaxonomy
Performing Arts
Taxonomy
Parent
Performing Arts
-
Cinema Performing Arts
Music Performing Arts
Documentary
Cinema
Historical Cinema
Classical Music
Pop Music
External Indexing Service 3/3
Indexing with exceptions monitoring Real-time notifying system Event time and type (add, update) Full stacktrace info Customizable recipients Object Indexing Recovery
Resource Parse Error Metadata Indexing
• Index synchronization During external indexing, contents may be
Updated/added/deleted on the original index Need to update these contents
on the index (state flag)Indexed External
Indexed1 1
0 1
Search Engine Friendly
HTLM dump service JAVA external service Periodically invoked by an AXCP rule Full metadata exporting Thumbnail Resource link Multilanguage Paginated results
Conclusions Drupal integrated solution for user behavior tracking
and analysis Logging Stat Data Graph Online Survey
External Indexing Service Avoids server overloading HA of query service Error recovering Detailed event notifying system Index Optimization
Dumping tool for portal contents (SEO) Full metadata HTML exporting Scheduled Service
Future Work
Keep collecting Data Deeper Data Analysis
User Sessions 1st, 2nd..., nth click average user behavior
Depict a modular view of the system usage Popularity/Usability for each feature & functionality
Social Network Analysis (SNA) Huge Population
User relationships, connections, friendships
References
P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for scalable media computing and intelligence on distributed scenarious", IEEE Multimedia, 2011
P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M. Serena, "Semantic Model for Cultural Heritage Social Network and Cross Media Content for Multiple Devices", Conference of the Italian Association of Artificial Intelligence, Workshop for Cultural Heritage, 15-17 September 2011, Palermo, Italy
Q & A
APPENDIX
Architecture (former)
DrupalApache HTTP
Searching Module
Indexing Module
Apache Tomcat
Apache SolrIndexing Service
XML/HTTP JSPSolrCell
AXCP
Grid Node
Rule Scheduler
SolrJ Client
Index Rebuilder Rule JS Indexing Rule JS
Drupal
What is it?Open source content management platformDeveloped by Dries Buytaert in 2001Written in PHPUsers: The Economist, Examiner.com, The White House, data.gov.ukRuns on a WEB server (e.g. Apache, IIS) and a database (e.g. MySQL, PostgreSQL)
Apache Lucene
What is it?High-performance, full-featured text search engine library (indexing and searching documents) Developed by Doug Cutting (2000) SourceForge, joined Apache Software Foundation in 2001Written entirely in JavaUsers: Wikipedia, Technorati, Nabble, TheServerSide, Akamai, SourceForge
Apache Lucene
FeaturesRanked searching (best results returned first)Powerful query types: phrase queries, wildcard queries, proximity queries, range queries and moreFielded searching (e.g., title, author, contents)Date-range searchingSorting by any fieldMultiple-index searching with merged resultsAllows simultaneous update and searching
Apache Lucene
FeaturesDocuments added via IndexWriterDocument = a collection of fieldsNo config files, dynamic field typingFlexible text analysis tokenizers, filtersSearch for documents via IndexSearcher
Hits = search(Query,Filter,Sort,topN)
Scoring: tf * idf * lengthNorm
Apache Solr
What is it?A full text search server based on Lucene (Lucene sub-project)Developed by Yonik Seeley at CNET Networks (2004), donated to the Apache Software Foundation (2006)Written in Java, deployable as a WARUsers: CNET Reviews, CNET Channel, shopper.com, news.com, nines.org, krugle.com, oodle.com, booklooker.de
Apache Solr Features
Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces (XML, JSON, HTTP) Web Administration InterfaceServer statistics exposed over JMX for monitoring Scalability, efficient Replication to other Solr Search Servers Flexible and Adaptable with XML configuration Extensible Plugin Architecture