web clustering engines
Post on 13-Jan-2017
1.053 Views
Preview:
TRANSCRIPT
WEB CLUSTERING ENGINES
ARUN TR1412130413S7CS,CEA
Search Engine?• Search engines are an invaluable tool for
retrieving information from the Web. In response to a user query, they return a list of results ranked in order of relevance to the query.
• Eg: Google,Yahoo,Credo,Grokker etc.
Arun TR14,S7CS
• Google (Flat Ranked Search Engine)
Arun TR14,S7CSFlat Ranked VS Clustered
• Yippy(Web Clustering Engine)Arun TR14,S7CS
Why Web Clustering Engines?
• Conventional Engines are not much efficient in ‘Ambiguous’ queries.
• The search results returned by conventional search engines on query will be mixed together in the list,irrelevant items occurs.
In this context clustering of search results come in to picture!!
Arun TR14,S7CS
• Search engine• Clustering is the act of grouping similar
object into sets.• The distance between the objects in the
same cluster(inter-cluster variations) should be minimum
• The distance between objects in different clusters(intra-cluster variations) should be maximum.
Web Clustering Engines?Arun TR14,S7CS
• This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories).
Web clustering engines:1. Northern Light - predefined set of clusters2. Vivısimo - cluster labels were dynamically generated 3. Clusty,4. Grokker,5. KartOO,6. Lingo3G, 7. CREDO,etc
Arun TR14,S7CS
Main advantages of the cluster hierarchy
• It makes for shortcuts to the items that relate to the same meaning.
• It allows better topic understanding.• It favors systematic exploration of search
results.
Arun TR14,S7CS
• Short input data description.• Meaningful labels.• Selection of similarity measure.• Grouping of objects into clusters.• Computational efficiency.• Unknown number of clusters.
Issues in Implementation Of clusters
Arun TR14,S7CS
Architecture & TechniquesArun TR14,S7CS
1.Search Results Acquisition• Provides input for the rest of the system.• Based on the query, the acquisition
component must deliver 50 to 500 results, each of which should contain a title, a contextual snippet, and the URL
• The source of search results can be any public search engines, such as Google,Yahoo etc.
• Fetching results from other search engines by API of these engines.
Arun TR14,S7CS
2.Preprocessing of Search results
• Primary aim is to convert the search results into ‘features’
steps: i.Language identification ii.Tokenization iii.Stemming iv.Selection features
Arun TR14,S7CS
ii.Tokenization:Text of each search result gets split into a sequence of basic independent units called tokens represent by word,number or symbol.
More complex for languages where white spaces are not present (such as Chinese) or switch direction (such as an Arabic text).
Arun TR14,S7CS
iii.Stemming:Remove the inflectional prefixes and suffixes of each word to reduce different grammatical form of the word to a common base form called a ‘stem’.
Eg: connected,connecting & interconnection
↓ ↓ ↓ ‘connect’
Arun TR14,S7CS
iv.Selection features:•Extract features for each search result present in the input.•Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm.•Features vary from single word to tuples of word.
Arun TR14,S7CS
How can represent a feature/text?• Vector Space Model(VSM)• Document d is represented in the VSM as a
vector [wt0 , wt1 , . . .wtn] where t0, t1, . . . tn is a set of words/features and wti is the weight/importance of feature tiEg: d→“Polly had a dog and the dog had Polly”
vsm representation
Arun TR14,S7CS
3.Cluster Construction & Labelling
• The set of search results along with their features are input to the clustering algorithm,
for building the clusters and labeling. Two types of Algorithms: →Data centric clustering algorithm →Description aware –STC related• Created cluster should be aptly labled.i.Unique ii.Unambiguous iii.Comprehensive
iv.Sensible to the content
Arun TR14,S7CS
Data Centric Clustering Algorithm
• Similar to Agglomerative Hierarchical Clustering (AHC) with an average-link merge criterion.
• It has initial clustering of a collection of documents in a set of k clusters(scatter)
• At Query time the user selected clusters of interest(gather) and the system re-clustered those documents.
• Process repeats until a small cluster with relevant documents is found
Arun TR14,S7CS
Function of a Scatter/Gather systemArun TR14,S7CS
• Bottom up approach. Initially each document is in its own cluster.
• Build a distance matrix for every pair of clusters. Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one cluster.
• Continue this process until the desired no of k clusters reached.
• The Complexity of this algorithm is clearly O(n2), n: number of clusters
• Another Data centric algorithm is called as K-means clustering
Arun TR14,S7CS
Difficulties in Data centric algorithms
• All these algorithms are not incremental in nature - each document arrives from the web,we “clean” it and add it to the available model.
• Missing of meaningful labels.
Arun TR14,S7CS
4.Visualization of Clustered Results
• One prominent approach is based on hierarchical folders• Clusty, CREDO, Lingo3G - hierarchical folder visualization
approach• Grokker - Nesting ,zooming approach• KartOO - Graph based interfaces
Arun TR14,S7CS
Credo - hierarchical folder visualization approach
Grokker – Nesting and Zooming
Improve Efficiency of Clustering
• Client side processing:High query rate periods the response times can significantly increase. Some processes using the client side resources
• Incremental processing:As each document arrives from the web, we “clean” it and add it to the available model.
• Pretokenized documents:Clustering engines can use tokens that already used by the conventional search engines.
Arun TR14,S7CS
ConclusionWeb clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. A number of advances must be made to improve the cluster labels, coherence of cluster structure, performance evaluation studies,advanced visualization techniques. Then Web Clustering Engines entirely fulfills the promise of being the PageRank of the future. Due to the lack of an efficient method for the performance evaluation of clustering engines they are still not seeking the attention of people.
Arun TR14,S7CS
References
• http://clusty.com• http://credo.fub.it• http://www2.parc.com/istl/projects/ia/sg-
example1.html• http://credino.dimi.uniud.it• http://google.com • C.J.Van Rijsbergen , Information Retrieval,
Butterworth
Arun TR14,S7CS
THANK YOU
QUESTIONS?
top related