search engine and social network
DESCRIPTION
Report on search EngineTRANSCRIPT
Search Engine and Social Network Vaibhav Daga 13114067 Computer Science Department, IIT Roorkee
1. How Search Engine Works Search engine is the popular term for an information retrieval (IR) system. While researchers and developers take a broader view of IR systems, consumers think of them more in terms of what they want the systems to do — namely search the Web, or an intranet, or a database. Actually consumers would really prefer a finding engine, rather than a search engine.
The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as Yahoo! or with search engines. Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. Crawling Before a search engine can tell where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling.
Crawling is the acquisition of data about a website. This involves scanning the site and getting a complete list of everything on there – the page title, images, keywords it contains, and any other pages it links to – at a bare minimum. Modern crawlers may cache a copy of the whole page, as well as look for some additional information such as the page layout, where the advertising units are, where the links are on the page.
An automated bot – a spider – visits each page very quickly. Even in the earliest days, Google reported that they were reading a few hundred pages a second. The crawler then adds all the new links it found to a list of places to crawl next – in addition to recrawling sites again to see if anything has changed. It’s a neverending process.
Any site that is linked to from another site already indexed, or any site that manually asked to be indexed, will eventually be crawled – some sites more frequently than others and some to a greater depth. If the site is huge and content hidden many clicks away from the
homepage, the crawler bots may actually give up. There are ways to ask search engines NOT to index a site, though this is rarely used to block an entire website. Architecture of a web crawler:
Indexing Indexing is the process of taking all of that data you have from a crawl, and placing it in a big database. All of this data is stored in vast datacentres with thousands of petabytes worth of drives. There are two key components involved in making the gathered data accessible to users:
● The information stored with the data ● The method by which the information is indexed
In the simplest case, a search engine could just store the word and the URL where it was
found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.
To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in subheadings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders.
Important data structures for storing data ● BigFiles BigFiles are virtual files spanning multiple file systems and are addressable by
64 bit integers. The BigFiles package also handles allocation and deallocation of file descriptors.
● Repository: The repository contains the full HTML of every web page. Each page is compressed using zlib.
● Document Index:The document index keeps information about each document.The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics.
● HitList:A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.
Retrieval of query The last step is – you type in a search query, and the search engine attempts to display the most relevant documents it finds that match your query. This is the most complicated step, but also the most relevant to you or I, as web developers and users. It is also the area in which search engines differentiate themselves. Some work with keywords, some allow you to ask a question, and some include advanced features like keyword proximity or filtering by age of content.
The ranking algorithm checks your search query against billions of pages to determine how relevant each one is. This operation is so complex that companies closely guard their own ranking algorithms as patented industry secrets. Competitive advantage for a start – so long as they are giving you the best search results, they can stay on top of the market. Secondly, to prevent gaming of the system and giving an unfair advantage to one site over another.
Searching through an index involves a user building a query and submitting it through the search engine. The query can be quite simple, a single word at minimum. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of the search.
The Boolean operators most often seen are: ● AND All the terms joined by "AND" must appear in the pages or documents.
Some search engines substitute the operator "+" for the word AND. ● OR At least one of the terms joined by "OR" must appear in the pages or
documents. ● NOT The term or terms following "NOT" must not appear in the pages or
documents. Some search engines substitute the operator "" for the word NOT. ● FOLLOWED BY One of the terms must be directly followed by the other. ● NEAR One of the terms must be within a specified number of words of the
other. ● Quotation Marks The words between the quotation marks are treated as a
phrase, and that phrase must be found within the document or file.
2. Working of a Social Network Realtime presence notification
The most resourceintensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the onlineidleoffline states of their friends, so that conversations can begin.
The naive implementation of sending a notification to all friends whenever a user comes online or goes offline has a worst case cost of O(average friendlist size * peak users * churn rate) messages/second, where churn rate is the frequency with which users come online and go offline, in events/second. This is wildly inefficient to the point of being untenable, given that the average number of friends per user is measured in the hundreds, and the number of concurrent users during peak site usage is on the order of several millions.
Surfacing connected users' idleness greatly enhances the chat user experience but further compounds the problem of keeping presence information uptodate. Each Facebook Chat user now needs to be notified whenever one of his/her friends (a) takes an action such as sending a chat message or loads a Facebook page (b) transitions between idleness states Realtime messaging
Another challenge is ensuring the timely delivery of the messages themselves. The method we chose to get text from one user to another involves loading an iframe on each Facebook page, and having that iframe's Javascript make an HTTP GET request over a persistent connection that doesn't return until the server has data for the client. The request gets reestablished if it's interrupted or times out.
Having a largenumber of longrunning concurrent requests makes the Apache part of the standard LAMP stack a dubious implementation choice. Even without accounting for the sizeable overhead of spawning an OS process that, on average, twiddles its thumbs for a minute before reporting that no one has sent the user a message, the waiting time could be spent servicing 60some requests for regular Facebook pages.
Distribution, Isolation, and Failover Fault tolerance is a desirable characteristic of any big system: if an error happens, the system should try its best to recover without human intervention before giving up and informing the user. The results of inevitable programming bugs, hardware failures, et al., should be hidden from the user as much as possible and isolated from the rest of the system.
The way this is typically accomplished in a web application is by separating the model and the view: data is persisted in a database (perhaps with a separate inmemory cache), with each shortlived request retrieving only the parts relevant to that request. Because the data is
persisted, a failed read request can be reattempted. Cache misses and database failure can be detected by the nondatabase layers and either reported to the user or worked around using replication.
While this architecture works pretty well in general, it isn't as successful in a chat application due to the high volume of longlived requests, the nonrelational nature of the data involved, and the statefulness of each request.
For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epolldriven web server (in Erlang) that holds online users' conversations inmemory and serves the longpolled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover.
Scaling Messages Application Facebook Messages seamlessly integrates many communication channels: email, SMS, Facebook Chat, and the existing Facebook Inbox. Combining all this functionality and offering a powerful user experience involved building an entirely new infrastructure stack from the ground up.
To simplify the product and present a powerful user experience, integrating and supporting all the above communication channels requires a number of services to run together and interact. The system needs to:
■ Scale, as we need to support millions of users with existing message history. ■ Operate in real time. ■ Be highly available.
Each application server comprises: ■ API: The entry point for all get and set operations, which every client calls. An
application server is the sole entry point for any given user into the system. Any data written to or read from the system needs to go through this API.
■ Distributed logic: To understand the distributed logic we need to understand what a cell is. The entire system is divided into cells, and each cell contains only a subset of users. A cell looks like this:
References
● https://www.facebook.com/notes/facebookengineering/facebookchat/14218138919 ● https://www.facebook.com/notes/facebookengineering/scalingthemessagesapplica
tionbackend/10150148835363920 ● http://infolab.stanford.edu/~backrub/google.html ● http://computer.howstuffworks.com/internet/basics/searchengine.htm ● http://www.makeuseof.com/tag/howdosearchenginesworkmakeuseofexplains/