next generation search - icsti.org · search engine that have evolved - one is a system of...

30
NEXT GENERATION SEARCH July 2010 This ICSTI Insight looks at the developments affecting the launch of the next generation of search engines. It has relevance for ICSTI members insofar as the more advanced the search engines become the easier it is for end users to find what they are looking for, and the greater the opportunities for national scientific and technical organisations to provide additional, and in some cases novel, backup and one-stop service support for required information items. At the heart of the information economy is an increase in productivity coming from use of digital information systems generally. It is particularly in the use of the ‘machine’ to sift through the world’s output of digital publications to select what is relevant from the chaff that dramatic changes have been seen over the past few decades, and more is expected in future. In the pre-digital world the selection process of what was relevant involved hard work and was at best hit and miss, at worst it resulted in wastage and duplication of research effort. The ‘machine’ supporting such increase in productivity is sophisticated software – search engines, and latterly, advanced search engines. This report looks at the emergence of the next generation of such search engines. But what are ‘next generation search’ engines? It would be useful to define what is meant by the term. Definition of Search Engines A search engine is basically an information retrieval system designed to help find information stored on a computer system or systems. The search results are usually presented in a list and are commonly called ‘hits’. As such, search engines help reduce the time required to find information and also reduces the amount of information which must be consulted. It enables end users to target or focus on the few key relevant items. It helps tackle the problem of ‘information overload’ which affects many areas of published information by bringing together, quickly, all relevant information in one succinct output or listing. To provide such a set of matching items, a search engine will typically collect metadata from a universe of items through a process of indexing. The index summarises the main points about an item and requires a smaller amount of computer storage. Some search engines only store the indexed information and not the full content of each item, and instead provide a method of navigating to the item from the search engine result page. Alternatively, the search engine may and increasingly store a copy of each item as a fulltext item or a digital object. 1

Upload: others

Post on 26-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

NEXT GENERATION SEARCH

July 2010

This ICSTI Insight looks at the developments affecting the launch of the next generation of search engines. It has relevance for ICSTI members insofar as the more advanced the search engines become the easier it is for end users to find what they are looking for, and the greater the opportunities for national scientific and technical organisations to provide additional, and in some cases novel, backup and one-stop service support for required information items.

At the heart of the information economy is an increase in productivity coming from use of digital information systems generally. It is particularly in the use of the ‘machine’ to sift through the world’s output of digital publications to select what is relevant from the chaff that dramatic changes have been seen over the past few decades, and more is expected in future. In the pre-digital world the selection process of what was relevant involved hard work and was at best hit and miss, at worst it resulted in wastage and duplication of research effort. The ‘machine’ supporting such increase in productivity is sophisticated software – search engines, and latterly, advanced search engines. This report looks at the emergence of the next generation of such search engines.

But what are ‘next generation search’ engines? It would be useful to define what is meant by the term.

Definition of Search Engines

A search engine is basically an information retrieval system designed to help find information stored on a computer system or systems. The search results are usually presented in a list and are commonly called ‘hits’. As such, search engines help reduce the time required to find information and also reduces the amount of information which must be consulted. It enables end users to target or focus on the few key relevant items. It helps tackle the problem of ‘information overload’ which affects many areas of published information by bringing together, quickly, all relevant information in one succinct output or listing.

To provide such a set of matching items, a search engine will typically collect metadata from a universe of items through a process of indexing. The index summarises the main points about an item and requires a smaller amount of computer storage. Some search engines only store the indexed information and not the full content of each item, and instead provide a method of navigating to the item from the search engine result page. Alternatively, the search engine may and increasingly store a copy of each item as a fulltext item or a digital object.

1

Page 2: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

Whereas some text search engines require users to enter two or three words in the search box separated by a space, other search engines may enable users to specify entire documents, pictures, sounds, and various forms of natural language. This is how search engines generally operate now. There has been a historical evolution which shows several distinct phases of development, and with each phase there has been a different set of players who have dominated the space.

Early Emergence of Search Services

Search engines are not new. They date back some three or four decades and arose in parallel with the development of large bibliographic databases which were then being created by several leading secondary database producers. These databases were of abstracts, often manually compiled and expensive to create and use. But they were a marked improvement over the manual scanning of printed literature which went before. As the number of such specialist secondary abstract services increased so there was a need for services which aggregated such bibliographic services and provided a single point of access. Proprietary search and retrieval software emerged to interrogate these different databases, to find relationships which can only be done in a digital form.

Research-focused organisations with a strong need to optimise their investment in information gathering developed search engines which did just this – they collected a number of databases within the same search and retrieval process. Lockheed and the Systems Development Corporation in the USA were two such organisations. They both developed systems which could be used by other organisations to do similar analyses, and this heralded the emergence of Dialog and SDC as leading pioneers in the early days of bibliographic database searching. Roger Summit (Dialog) and Carlos Cuadro (SDC) became the leaders in this new revolution in finding relevant information easily and quickly.

Other organisations still developed unique software solutions to cater for the specific needs of a particular user base. The National Library of Medicine, in biomedicine, and Chemical Abstracts, in chemistry, were two other dominant players, but each discipline with a strong bibliographic output adopted their own solutions. It was left to Dialog and SDC to pull them together and enable searching of a multiple of databases using a standard search interface. At the peak of their usage – in the late 1980’s - they each included several hundred distinct and separate bibliographic files covering many research disciplines.

The Early Internet

Rumbling away in the background, however, was a more powerful development. The Internet, and the WorldWideWeb pioneered by Sir Tim Berners Lee, was on the move. A wide range of services were being introduced which no longer relied on there being a local spinning of a database, or group of databases, on large and dedicated computer facilities. The Internet was becoming ubiquitous in so many ways, impacting on the specialist

2

Page 3: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

scholarly communications industry in the process. New players emerged, taking over the central role which Dialog and SDC had traditionally pioneered.

One of the first "full text" crawler-based search engines on the internet was WebCrawler, which was launched in 1994. Unlike its predecessors, it let users search for any word in any webpage, which has since become the standard for all major search engines. It was also the first one to be widely known by the public. Also in 1994, Lycos (which was developed at Carnegie Mellon University) was launched and became a significant commercial service in this space.

Other innovative organisations rapidly joined the bandwagon. These included names which even today, a couple of decades later, are almost lost in the mists of time. AOL and AltaVista were leaders in creating a new information revolution. Others included Infoseek, Magellan, Excite, Inktomi, Ask Jeeves, Northern Light. In the space of three short years - 1994-1996 – these and others made their appearance. Each vied for the attention of the emerging and growing numbers of online users. Each invited users to subscribe or commit to a set of services which included accessing an even broader range of information formats than just the abstracts which had been the foundation stone of the earlier bibliographic database searching.

A complete list can be found on Wikipedia, a summary of which is given below:

Timeline giving launch of leading web search engines

Year Search Engine

1993W3CatalogAliwebJumpStation

1994WebCrawlerInfoseekLycos

1995

AltaVistaOpen Text MagellanExciteSAPO

1996

DogpileInktomiHotBotAsk Jeeves

1997Northern LightYandex

3

Page 4: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

1998 Google

1999

AlltheWebGenieKnowsNaverTeomaVivisimo

2000BaiduExalead

2003 Info.com

2004Yahoo! SearchA9.comSogou

2005

MSN SearchAsk.comGoodSearchSearchMe

2006

wikiseekQuaeroAsk.comLive SearchChaChaGuruji.com

2007

wikiseekSprooseWikia SearchBlackle.com

2008

PowersetPicollatorViewziCuilBoogamiLeapFishForestleVADLOSperse! SearchDuck Duck Go

2009BingYebolMugurdy

4

Page 5: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

Goby2010 Timmp

How web focused search engines work

The first phase of search engine development relied on a Boolean approach in their return of search results. These matched exactly the request without regard to order, and made use of boolean operators such as AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The evolving web based services improved on this by being more systematic in their ordering of results found.

In using the web services, the list of items that meet a particular search criteria specified by the end user is typically sorted, or ranked. Ranking items by relevance (from highest to lowest) reduces the time required to find the desired information. Probabilistic search engines rank items based on measures of similarity (between each item and the query) and sometimes popularity or authority or use relevance feedback.

Other types of search engines do not store an index. Crawler, or spider type search engines (such as real-time search engines), may collect and assess items at the time of the search query, dynamically considering additional items based on the contents of a starting item (known as a seed, or seed URL in the case of an Internet crawler). Meta search engines store neither an index nor a cache and instead simply reuse the index or results of one or more other search engines or data sets to provide an aggregated, final set of results.

A search engine operates, in the following manner:

1. Web crawling. Web search engines work by storing information about many web pages, which they retrieve from the html itself. These pages are retrieved by a Web crawler (also known as a spider or a robot, see later) — an automated Web browser which follows every link on the site. The contents of each page are then analysed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. The purpose of an index is to allow information to be found as quickly as possible. Some search engines store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others store every word of every page they find. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful.

5

Page 6: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

2. Indexing. When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases one searches for. Also, natural language queries allow the user to type a question in the same form one would ask a human. 3. Search. The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analysing texts it locates. This second form relies much more heavily on the computer itself to do the bulk of the work.

Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept payment for their search engine results make money by running search related ads alongside the regular search engine results. The search engines generate revenues every time someone clicks on one of these ads.

However, there were other third party services which helped end users find the information they needed. One of these has been ‘portals’.

Portals

Portals took the idea of ‘dominating the user’s eyeballs’ – capturing a user’s interest and then providing the person with all manner of information services to hold their attention. These could vary from intense scientific resources to news items, from professional data to the esoteric. The aim was to so control the route into the world’s digital information

6

Page 7: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

collection that fees could be set which the user would willingly pay to remain above the information problem, or advertisers could be enticed to pay to reach the target audience.

In the late 1990s and early years of this decade the concept of the portal captured the imagination of entrepreneurs and venture capitalists alike. There was considerable funding support given to creating community-defined portals. The problem was that one of the business models– charging end users for in many cases information which was freely available on the Internet - came up against the openness which was also a feature of the Internet. Why pay for albeit slick information services when one could, with the help of some web focused services, get the same information for free? The day of the standalone portal using a subscription-based business model was very short.

The Arrival of Google

During the past decade the dominant player in the findability area has become Google. Though as an organisation it has often been coy about its operational features, it had a 2009 revenue of $23,651 million, had 16 US locations within USA employing 3,500 staff, and a further 3,000 staff in the rest of the world. It also achieved an enviable 32.5% operating margin. And all this within a decade of operations. Mindblowing!

Historically, Google is a creature of the Internet. Larry Page first met Sergey Brin at Stanford University in the summer of 1995 when the idea of Google was not even a glimmer in their respective eyes. The young entrepreneurs created the Google company, which in 1999 had a handful of employees and a rented office suite in a private house.

They had the idea of developing the mathematical algorithm of selection and ranking which became PageRank, the heart of the current Google search process. Page and Brin, perhaps arrogantly, cocked a snoop at the rest of the industry and remained focused on its core activity at that stage - the less interesting search process. Everyone else was trying to develop locked-in communities and portals.

Within a decade their organisation and these entrepreneurs had changed the map of the information industry. They have transformed the individual user’s Database on Intentions into a multibillion dollar operation – not bad for two engineers who were rather academic and geeky in approach. The Google system made its debut in 1998 and immediately the market swung away from services such as AltaVista, Yahoo and the many others listed earlier in favour of this new ranking methodology. Viral marketing ensured that it took off.

Google has since then transformed the scholarly information landscape through launching a number of new and related ventures in recent years. Some of the key ventures launched include Google Print, Gmail, Google Scholar and Google’s library digitisation programme. But these are just the tip of an innovative iceberg.

7

Page 8: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

There is little doubt that Google now benefits from the strength of its brand recognition. This came out strongly during some tests undertaken by Vividence in the US. The answers that a number of different search engines came up with were compared, and it was concluded that the differences were not significantly great. But according to Vividence, Google shone through in terms of the more subjective customer satisfaction rating, which goes to show that “search is a potentially fickle brand game, resting on perceptions and preferences rather than performance” (Outsell).

As a result Google has become the first source for scholarly information by academics and researchers. Their millions of dedicated followers generates tremendous traffic. Based on this traffic Google created a business model which provided a massive injection of revenues and led to a large number of the then 1-3,000 employees becoming millionaires overnight when the company undertook its controversial IPO stock auction in August 2004. Portals had by then become relatively marginalised.

The Business Model

What has led to this dominance by one organisation in the web search space? Particularly as the search service which Google offers is free. In 2009 Google achieved revenues of $23.6 billion. Of this $22.9 billion came from advertising. The trick has been to relate an advertisement to the interests of an end user as expressed by their choice of search topics. AdWord and AdSense gave Google the ability to make this link, and the world’s advertisers have responded in their droves.

According to Hitbox, Google's worldwide popularity peaked at 82.7% of the world’s information online searching activity in December, 2008. July 2009 rankings showed Google (78.4%) losing traffic to Baidu in China (8.87%), and Bing (3.17%) from Microsoft. The market share of the other traditional players Yahoo! Search (7.16%) and AOL (0.6%) were also declining.

Change in Culture

But as many commentators have reflected , it is not an easy ride for Google. Conflicts in culture emerged as the original motto for the company – ‘Don’t be evil’ – came up against the hard world of commerce. It came to the fore over the advertising issue – Page and Brin, with a professional CEO then acting as part of the triumvirate, seemed almost apologetic at initially taking advertising, on issuing an IPO which ignored Wall Street practices, and on its dealings with China which demanded censorship of certain sites. The less contentious corporate mission of ‘organising the world’s information and making it accessible’ has become their widely used mission, and the company may have made its peace with its devils.

However, Google is not out of the woods yet. The Patriot Act, whereby the federal government increased its potential for tapping into not only telephone conversations but

8

Page 9: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

also e-mails and web usage data, highlighted the sensitive nature of the data available within the banks of parallel running computers at Google and with other major search engines. Some users now keep their photos, blogs, videos, calendars, e-mail, news feeds, maps, contacts, social networks, documents, spreadsheets, presentations, and credit-card information - in short, much of their lives - on Google's computers. And Google has plans to add medical records, location-aware services and much else. It may even buy radio spectrum in America so that it can offer all these services over wireless-internet connections. A digital footprint follows every user of the service and can be mapped and used for a variety of purposes. The clickstream has become “the exhaust of our lives” and is scattered across a wide range of services. Should this data stream or personal digital footprints be made available to the government? Is trust being broken by doing so, trust that the users can feel that their searches are not being monitored by a Big Brother? This is a heavy issue and one which is still running its course.

Also, Google is making enemies in its own and adjacent industries. Google evokes ambivalent feelings, particularly among librarians (who see their role in serving their local patrons being undermined) and publishers (who see their income from the sale of their published products being compromised). A number of highly visible court cases are being pursued against Google on both sides of the Atlantic.

Speaking for many, John Battelle, the author of a book on Google (“Search”) and an early admirer, recently wrote on his blog that “I've found myself more and more wary” of Google “out of some primal, lizard-brain fear of giving too much control of my data to one source.”

There have been many exposes of the company. For example, Peter Morville challenges in his book “Ambient Findability” the suggestion that Google offers a good service – “if you really want to know about a medical complaint you don’t rely on Google but rather on NIH’s Pubmed database”. Ambient findability is less about the computer than the complex interactions between humans and information. All our information needs will not necessarily be met automatically. Information anxiety will intensify, and we will spend more time rather than less searching for what we need. Search engines are not necessarily up to the task of meeting future needs. They tend to be out-of-date and inaccurate. However, they are trying to rectify some of the emerging weaknesses by improving their technology. For example, they are undertaking SEO – Search Engine Optimisation – ensuring that the software throws up the top ten results that are most relevant to the end users. Whilst the search engines pride themselves on speed, this excludes the subsequent activity the end user has to go through in bypassing splash pages and other interferences in reaching the data. But here, according to Microsoft’s Bill Gates, what is important is not the search but getting the answers.

Google is constantly improving its services, not only through its much vaunted in-house ‘experimentation’ (whereby all its technicians spend one day a week on non-operational activities and pursuance of new ideas), but also through its acquisitions. It has a bought,

9

Page 10: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

or has announced it intends to buy, 16 companies in the past year. It has huge cash reserves to call on – as of March 31st 2010 they stood at $26.5 billion. Not bad for an organisation which did not formally exist a dozen years ago.

For the future, Google must maintain or improve the efficiency with which it puts ads next to searches. Currently it has far higher “click-through rates” than any of its competitors because it made these ads more relevant and useful, so that web users click on them more often. But even lucrative “pay-per-click” has limits, so Google is moving into other areas. It has bought DoubleClick, a company that specialises in the other big online-advertising market, so-called “branded” display or banner ads (for which each view, rather than each click, is charged for). Google also now brokers ads on traditional radio stations, television channels and in newspapers.

Their latest acquisition (July 2010) is MetaWeb, a company that maintains an open database of things in the world. According to a Google press release

“With efforts like rich snippets and the search answers feature, we’re just beginning to apply our understanding of the web to make search better. Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we’ve acquired Metaweb because we believe working together we’ll be able to provide better answers”.

Included in the acquisition is Freebase, Metaweb’s free and open database of over 12 million items, including movies, books, TV shows, celebrities, locations, companies, etc. Google and Metaweb plan to contribute to and further develop Freebase – “it will be a tremendous resource to make the web richer for everyone. And to the extent the web becomes a better place, this is good for webmasters and good for users”.

The breadth of data in the online and offline worlds which is coming under Google’s octopus-like reach is widening by the week. It includes music, books, travel information, maps, newspapers, TV – some not necessarily being embraced with open arms by the information providers. Nevertheless, the appealing idea of having search services which more accurately reflect the specific interest, needs and culture of the information user – more indepth ‘vertical search services – is not totally outdated. The generic but comprehensive services offered by Google, Yahoo, Baidu, etc, cannot reach the parts which the specialist needs. As such specialisms become even more specialised, Google and the present generation of generic search engines cannot be so precise and filtered as to offer such targeted vertical search service in a truly effective and useful way. The day of the portal may be returning.

10

Page 11: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

There are few examples of such out there at present; the question is whether existing search engines can pull back their activity from the entrenched position which large generic search engines now find themselves. The problem is that this requires hard work and investment – and few publishers have shown any indication to create specific, unique platforms either individually or in unison. They have the additional problem of being locked in a silo approach to information without the comprehensive coverage which search engines have. It may be a different type of community which would need to provide such next generation search services (see below).

Politics

The Google dominance raises questions about its power. In France there is a huge debate going on about how to restrict Google’s activities as it is threatening to diminish France’s cultural heritage in the process of being absorbed into the vast Google machine.

The publishers are concerned about losing their ownership over material they have published.

In China there are issues about openness and freedom which Google provides to the information it includes – some of which does not sit well with the Chinese authorities who seek to censor parts of Google to protect their own political power base. As a result Google closed down its two Chinese online services whilst the issue was being resolved. The decision came nearly 10 days after Beijing renewed Google's licence to continue operations in the country as an Internet Content Provider (ICP) in June 2010. The Chinese authorities had earlier asked the firm to censor some of its contents for the users in the country. Google's plan includes the shutting down of a self-developed website ranking page and a lifestyle site in China. According to a statement issued by the company, the decision was taken due to 'lower-than-expected demand'.

Research firm Analysys International, China, released a report that says Google's share of the Chinese search engine market declined in the second quarter while it was involved in a public battle with Beijing over censorship. The search engine firm saw its market share fall to 24.2% in the three months to June, from 30.9% in the first quarter, Analysys International states in the report. Baidu witnessed an increase in its market share to 70% in the second quarter from 64% in the first.

Politics versus openness remains an issue for generic search engines operating in those countries where democracy remains a delicate flower.

Cloud Technology

Meanwhile, the machinery that represents the fixed costs is Google's (and its fellow search engine travellers’) key asset and this opens up new opportunities for search engines. Google has built, in effect, the world's largest supercomputer. It consists of vast

11

Page 12: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

clusters of servers, spread out in enormous datacentres around the world. The details are Google's best-guarded secret. But the result is to provide a “cloud” of computing power that is flexible enough “automatically to move load around between datacentres”. If, for example, there is unexpected demand for Gmail, Google's e-mail service, the system instantly allocates more processors and storage to it, without the need for human intervention.

Cloud Computing is Internet-based computing, whereby shared resources, software, and information are provided to computers and other devices on demand, much like the electricity grid.

Cloud computing is another paradigm shift following the shift from mainframe to client–server in the early 1980s. Details are abstracted from the users, who no longer have need for expertise in, or control over, the technology infrastructure "in the cloud" that supports them. Cloud computing describes a new supplement, consumption, and delivery model for IT services based on the Internet, and it typically involves over-the-Internet provision of dynamically scalable and often virtualised resources. It is a byproduct and consequence of the ease-of-access to remote computing sites provided by the Internet. (Wikipedia).

The term "cloud" is used as a metaphor for the Internet, based on the cloud drawing used in the past to represent the telephone network, and later to depict the Internet in computer network diagrams. Typical cloud computing providers deliver common business applications online that are accessed from another Web service or software like a Web browser, while the software and data are stored on many disparate servers.

More recently, the DuraSpace organisation has unveiled the open source code for the new DuraCloud platform, a hosted service and open technology that makes it easy for organisations and end users to use cloud services. DuraCloud is an open source platform that is built upon commercial cloud infrastructure. The platform itself deploys into a cloud server environment and is integrated with multiple cloud storage providers, including Amazon AWS and Rackspace.

Commercial offerings are generally expected to meet quality of service (QoS) requirements of customers, and typically include SLAs. The major cloud service providers include Microsoft, Salesforce, Skytap, HP, IBM, Amazon and Google. As far as scientific and technical information is concerned ‘cloud technology’ offers scope for massive amounts of data sharing and data analysis unrestricted by local IT facilities. It heralds a new spur to the data intensive, global information economy of the future – a topic which is beyond the scope of the present ICSTI Insight. But it does give an indication of the direction which some of the established search engines are moving towards. Offering large comprehensive (data crunching) services rather than being specific in its intentions to support specialist users.

Microsoft

12

Page 13: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

Microsoft is obviously not standing idly by whilst all this is going on.

In terms of offering a search engine Microsoft first launched MSN Search in the Autumn of 1998 using search results from Inktomi. In early 1999 the site began to display listings from Looksmart blended with results from Inktomi (except for a short time in 1999 when results from AltaVista were used instead). In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot).

Microsoft announced (early February 2005) its MSN Search Service. It offers powerful web, news and image searching, and adds in links to freely accessed articles from its own Encarta encyclopaedia. The MSN Service encourages users to develop advanced search skills using the Search Builder facility. This allows the user to be highly selective about the domains to be searched, and to use sliding bars to weight results according to popularity or immediacy. Microsoft further claims it will outperform research rivals such as Google and Yahoo! as its search bots will crawl the entire web every 48 hours for updates, rather than the industry standard of two weeks.

Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalised an agreement whereby Yahoo! Search would be powered by Microsoft Bing technology.

According to Bill Gates, “Searching the Internet today is a challenge, and it is estimated that nearly half of customers’ complex questions go unanswered”. Providing the answer, rather than just offering the search process, seems to be the central Microsoft theme.

Robots

An emerging issue which the information society faces is the dominance of robots in collecting information. As we have seen Google is a key source for identifying items of relevance – with 84.96% of search results coming through Google Global – and other services which are leading Internet search providers are Yahoo Global (6.24%), Bing 3.39%, Baidu (China) 0.49%, Ask Global 0.76% and AOL Global 0.49%. These are the dominant robots which search out for information on the Internet.

Their importance cannot be overstated. From studies currently underway on the traffic being generated from file servers in Europe, if the online database does not accommodate Google in allowing access to its files the online database might just as well not exist. For example, one leading repository in Europe – the leading web repository worldwide according to the metrics provided by a Madrid-based service which rates web repositories – over 90% of its accesses comes from robots (HAL, a national science repository based in France). Whereas if the repository does not allow ease of access to its publications, if it confuses the robots with changes in indexing files providing a dynamic file structure, if it prevents automated crawling, it stands to remain isolated and serving just those users

13

Page 14: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

who are ‘people’, the traffic diminishes dramatically in comparison. In fact if the database is not robotically identifiable it is of little worth.

Robots provide a useful function in that they allow end users to find material in any number of sources. But if good quality material is hidden from robotic crawling then the value of the robot is called into question. There has been a claim that using Google, for instance, provides ‘good enough material’, a suggestion that there is something missing, that there is more information which many information services hide behind. These barriers are often unwittingly put in place – they are frequently a legacy from past data creation practices rather than a rejection of machine-access to files.

What it does mean is that it is difficult for individual database services to know who their customers are and whether they are providing them with the information they need if 90% of usage comes from robots. The robots take their information and lose it within their own services, apply ranking systems to it, etc. The imprimateur of the data gets lost within the service provided by Google, Yahoo, Bing, etc. It is no wonder there is a sensitive concern by publishers that Google is dominating the information scene at their expense.

Not all databases are accessible to these robots. In fact various studies have suggested that the databases ‘hidden’ from public access are significantly greater in number than those which are accessible. The so-called ‘Deep Web’ has as much as 90% of the world’s digital collection (see later). Most are hidden behind authentication systems, others require special tools to access the databases. The robots are stopped at the front page of these databases. To accommodate robots it needs the database owner to change its procedures for public access. However, stripping out authentication barriers often runs counter to the business model which provides the revenue stream necessary to create and sustain the database. It may also need the database owner to take steps to accommodate technical access by the robots. Not all wish to do this.

Access to the institutional repositories in Europe is a specific example. In principle access to all IRs should be free and easy. After all they are meant to employ standards for open access as coded by Open Access Initiative – Protocol for Metadata Harvesting (OAI-PMH). Theory and practice do not always coincide, and of the six institutional repositories taking part in the PEER European Commission project only two allow Google ease of access to their contents. The rest have their own barriers behind which their free and open material is hidden. And their external use is therefore minimal.

Crawling technology, therefore, has its limitations – some administrative, some technical.

Federated Search

14

Page 15: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

There is an alternative which moves away from the central dominance of an integrated information service, to creating a service which is decentralised and allows each database provider to determine how its information resources can be searched. How it will allow a robot to interface with its data. It does not make demands on robots; it allows each database provider to stick with its own traditional ways of creating and curating its data. This is the principle behind federated searching. It creates a platform for an alternative approach to reaching into the richness of the world’s digital archives.

It is an approach which has been pioneered by one of ICSTI’s members – the US Department of Energy’s Office of Scientific and Technical Information (OSTI) under the leadership of Walter Warnick – with technical support provided by Deep Web Technologies Inc.

Walter has recently written a recent article for Interlending and Document Supply (see: Warnick, W. ‘Federated search as a transformational technology enabling knowledge discovery: the role of of WorldWideScience.org, Interlending and Document Supply, 38/2 (2010) 62-92, Emerald Group Publishing). In this article he describes the extent of the Deep Web alluded to above, and how a new search methodology can address some of the access problems to the deep web material.

The federated search paradigm was created and is evolving in response to the vast number of online databases and other web resources that now populate what is known as the deep web, or invisible Web. In traditional search engines such as Google, only sources that have been indexed by the search engine’s crawler technology can be searched, retrieved and accessed. The large volume of documents that constitute the deep Web are not open to traditional Internet search engines because of limitations in crawler technology. Federated searching resolves this issue and makes these deep web documents searchable. Additionally, federated search provides a single search interface to numerous underlying deep web data sources. This reduces the burden on the search patron by not requiring knowledge of each individual search interface or even knowledge of the existence of the individual data sources being searched.

In 2000, it was estimated in a study undertaken at University of California, Berkeley that the deep Web consists of about 91,000 terabytes of 550 billion individual documents. By contrast, the surface Web (which is easily reached by search engines) is only about 167 terabytes; the Library of Congress, in 1997, was estimated to have 3,000 terabytes.

In developing the federated search approach, the US Department of Energy initially focused on public-funded databases, many of which were in the deep web. As a result, in a study which compared the output of two comparable online searches – one on Google, the other on the DoE federated system – the overlap was minimal. To quote Walter Warnick from his recent article, “In fact, a recent analysis indicated that WorldWideScience.org [the federated service created by DoE] results, when compared to Google and Google Scholar results, were unique approximately 96.5 per cent of the time”.

15

Page 16: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

This suggests that there is a complementarity between generic search engines and those new federated search engines which mine the deep web for material.

Federated search is often referred to as a portal or a federated search engine. In effect a search term(s) is input into the search screen, which the sets off to interrogate a suite of accessible databases and queries them according to local requirements and protocols. The list is then fed back to the central computer which then creates the relevance ranking.

As described by Peter Jacso in 2004, federated searching consists of (1) transforming a query and broadcasting it to a group of disparate databases or other web resources, with the appropriate syntax, (2) merging the results collected from the databases, (3) presenting them in a succinct and unified format with minimal duplication, and (4) providing a means, performed either automatically or by the portal user, to sort the merged result set.

Federated search portals, either commercial or open access, generally search public access bibliographic databases, public access Web-based library catalogues (OPACs), Web-based search engines like Google and/or open-access, and government-operated or corporate data collections. These individual information sources send back to the portal's interface a list of results from the search query. The user can review this hit list. Some portals will merely screen scrape the actual database results and not directly allow a user to enter the information source's application. More sophisticated ones will de-dupe the results list by merging and removing duplicates. There are additional features available in many federated portals, but the basic idea is the same: to improve the accuracy and relevance of individual searches as well as reduce the amount of time required to search for resources.

The process of federated search therefore gives some key advantages when compared with existing crawler-based search engines. Federated search need not place any requirements or burdens on owners of the individual information sources, other than handling increased traffic. Federated searches are inherently as current as the individual information sources, as they are searched in real time.

One of the leaders in the technical development of the federated search process is Deep Web Technologies Inc. Founded in 2002 by Abe Lederman, President and Chief Technology Officer, Deep Web Technologies sprang from Abe's three year relationship with the U.S. Department of Energy's Office of Scientific and Technical Information. Abe's 17-year background in the field of knowledge management, coupled with a network of connections from Verity and Los Alamos National Laboratory, and his pioneering work on the Explorit™ platform laid the groundwork for the company's work. From this affiliation, the first federated search for the federal government was born with the creation of Distributed Explorit™.

16

Page 17: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

However, there are a number of other ways in which the Next Generation search software may appear – and these could revolve around Web 2 and Web 3 (or the Semantic Web) developments.

Web 2 and Social Collaboration

Already Web 2 is making its position felt. For example, by March 2010 Facebook surpassed Google as the biggest draw on the Internet. Facebook now accounts for 17% of the time people in Britain and the US spend online according to Neilsen. User numbers worldwide on Facebook exceed 400 million with profits (from a largely advertising-based formula) are thought to be $1 billion a year. Suddenly there is a new competitor to Google for the eyeballs of the Internet user, but more important, it indicates a new way of ‘communicating’ which has implications on how future generations of researchers will interface with online software systems.

According to Mark Zuckerberg, the 26 year old Wunderkind who helped create Facebook, it is no longer Google which is at the front of the new wave of finding information. Facebook has the critical mass to make discovery a viable alternative to conventional search engines. All the data people are sharing allows social networks to gather this information and use it in interesting and new ways. To make matters worse for Google, the information which Facebook generates is within a walled garden within which Google cannot penetrate (part of the deep web). Also, whilst Google’s traffic over the past year (2009) grew at 9%, the traffic on Facebook grew at 185%. An even more telling figure is that for Facebook the number of hours spent online by its members was 46.1 billion minutes. This compares with the 11 billion minutes for Google.

However, it is too early to forecast the ‘death of the search engine’. In Google’s case it has Gmail and YouTube as part of its armoury which give it broad appeal. But it is indicative of a new way of searching for information when users share their experiences and knowledge with others through a social media platform at the expense of the time they spend online with the major search engines.

Another aspect of the Web 2 development is how it is affecting science communication specifically. A striking feature is the growth of what is being called Citizen Science. This capitalises on the growth of a datacentric research world as well as social collaboration in the Web 2 environment. It brings together a loose community of people with like interests, interests which have been honed to a high level of specialism partly through extensive experience and partly through a high level of perhaps earlier educational attainment. But sharing their joint interests through a common platform is the real contribution which the web is making for citizen science.Amateur hobbyists are emerging particularly in support of gathering data in areas such as the environment, global warming, pollution, etc. The Internet has created new opportunities for such people to participate as both users and creators of scholarly

17

Page 18: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

information. The Internet is a leading component of the ‘democratisation of information’ process as Al Gore has suggested.

This is particularly evident in the use being made of some of the world’s leading data centres. Data is now easily shareable. For example, the Sloan Digital Sky Server (http://cas.sdss.org/dr5/en/) contains some three terabytes of free public data provided by thirteen institutions with 500 attributes for each of the 300 million ‘objects’. In effect it is a prototype virtual e-Science laboratory. In astronomy, some 930,000 distinct users access the SkyServer.

This stands in contrast to the 10,000 officially recognised ‘professional astronomers’ worldwide. The amateurs exceed professionals by almost 100 to 1, and this could be but the tip of the iceberg. Over the past six years there have been 350 million web hits on the SDSS.

In the ‘GalaxyZoo.org’ web site there are some 27 million visual galaxy classifications, many provided by the general public. 100,000 people participate in open access blogs.

These are examples of data-driven ‘citizen science’ made possible by access (in this instance) to data which is free. This demonstrates how open access to data can extend the reach of Science into new traditionally ‘disenfranchised’ areas, the amateur scientists and the general public.

There is a power that comes from being discovered by such social media services. Given that, according to the “Pew Online Activities and Pursuits” in the USA (March 2007) some 29% of American men are online, and 27% of American women, this suggests that there is again a huge potential market evolving. Whether this use is fuelling adoption of such services, or such services fuelling use is perhaps incidental. But it does raise the question whether the traditional generic search engines are well equipped to run with the social media developments in providing access to required information. The Facebook story might suggest otherwise.

The Semantic Web

Semantic Web is a term coined by World Wide Web Consortium (W3C) director Sir Tim Berners-Lee. It describes methods and technologies to allow machines to understand the meaning - or "semantics" - of information on the World Wide Web. While the term "Semantic Web" is not formally defined it is mainly used to describe the model and technologies proposed by the W3C. These technologies include the Resource Description Framework (RDF), a variety of data interchange formats (e.g. RDF/XML, N3, Turtle, N-Triples), and notations such as RDF Schema (RDFS) and the Web Ontology Language (OWL), all of which are intended to provide a formal description of concepts, terms, and relationships within a given knowledge domain. At its core, the semantic web comprises a

18

Page 19: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

philosophy, a set of design principles, collaborative working groups, and a variety of enabling technologies.

The original Scientific American article on the Semantic Web appeared in 2001. It described the evolution from a Web that consisted largely of documents for humans to read, to one that includes data and information for computers to manipulate. The Semantic Web is a Web of actionable information - information derived from data through implementing semantic practices. According to the original vision, the availability of machine-readable metadata would enable automated agents and other software to access the Web more intelligently. The agents would be able to perform tasks automatically and locate related information on behalf of the user. Many of the technologies proposed by the W3C already exist and are used in various projects. The Semantic Web as a global vision, however, has remained largely unrealised and its critics have questioned the feasibility of the ultimate approach as originally perceived. This is still a challenge for the future. It may take some ten years before the full effects of the semantic web are felt. It remains a distant dream of some of the leading communication scientists. Nevertheless, it is one which is worth considering as the platform for the Next Generation search software in the mid to long term, even though its impact currently remains marginal.

Shopbots and auction bots abound on the Web, but these are essentially handcrafted for particular tasks; they have little ability to interact with heterogeneous data and information types. Because the large-scale, agent-based mediation services have not yet been delivered some pundits argue that the Semantic Web has failed to materialise. However, it could be argued that agents can only flourish when standards are well established and that the Web standards for expressing shared meaning have progressed steadily over the past five years. Furthermore, the use of ontologies in the e-science community is only just emerging, presaging ultimate success for the Semantic Web - just as the use of HTTP within the CERN particle physics community led to the revolutionary success of the original Web. Where semantic web technologies have found a greater degree of practical adoption, it has tended to be among core specialised communities and organisations for intra company projects. The practical constraints toward adoption have appeared less challenging where domain and scope is more limited than that of the general public and the world wide web. The scientific/technical research community lies somewhere in between and as such there have been a few spotty examples of its application in STI.

Examples of Semantic Web applications in STI

One specific example of a semantic web approach in operation is the Royal Society of Chemistry and its service entitled Prospect. Project Prospect is a project running across the RSC journals to enhance the online research articles. The aim of Project Prospect is to make the science within RSC journal articles machine-readable through semantic enrichment – including the integration of metadata into text. By identifying the

19

Page 20: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

compounds and subject terms it will be easier for users to find the articles that are most relevant to them, as well as providing downloadable information about compounds.

RSC editors will be annotating compounds, concepts and data within the articles and linking these to additional electronic resources such as biological databases. This will transform the free text within an article to add new ways of identifying, retrieving and presenting the information within RSC publications.

No other publisher is doing this for the chemical sciences, and the RSC is pioneering the use of these enhancements. Using ontologies and unique compound identifiers within the research articles makes it possible for search engines or a desktop computer to identify articles of interest without having to read each article and judge its relevance. This type of markup is a first step to the "semantic web", It is also a move towards the more personalised, focused approach to delivering specialist information which the more generic search engines are not taking on board.

Phase 1 of Prospect launched at the beginning of February 2007 comprises the identification of compound and subject information in selected RSC articles and displayed with the following functionality:

• Chemical compounds highlighted in text and link to a compound page containing the InChI identifier, SMILES string, CML (Chemical Markup Language) link, related RSC articles, and a link to a 2D graphic;• selected IUPAC Gold Book terms highlighted in text, linking to the online version of the Gold Book;• ontology menus links to definitions from the Gene Ontology, Sequence Ontology and Cell Ontology (all Open Biomedical Ontologies) and related RSC articles.• Existing RSS feeds are enhanced with ontology terms in XML, primary compounds, InChIs and graphics.

From April 2008 the additional functionality was added:

• structure and substructure searching of compounds within RSC’s enhanced articles• addition of ChEBI ontology terms• links to PubChem and the SureChem patents database• addition of the InChIkey compound identifier

Text mining is used to attach structural information (InChI, SMILES and CML) to chemical names, especially chemical names which have never been seen before, and extensions handle terms defined in the Gold Book and ontology entries.

The main recent developments came from work with the Unilever Centre for Molecular Informatics and the Computer Laboratory, both at the University of Cambridge, as part of

20

Page 21: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

the SciBorg project, though RSC has been supported by Peter Murray-Rust's research group at the Unilever Centre for several years. The Open Source Chemical Analysis Routines (OSCAR) used for text mining was developed by Peter Murray-Rust’s group. The Gene Ontology (GO) curators at the European Bioinformatics Institute have also helped with the application of ontology terms.

RSC intentionally launched this as an 'unfinished' product with some rough edges, to show the potential of these developments. The functionality should work well in recent web browsers, but RSC will not be attempting to ensure full compatibility for all legacy browsers.

Another example comes from the author sector rather than a publisher. Again, Professor Peter Murray-Rust from Cambridge University has been a longstanding advocate of using new informatic and unconventional techniques to disseminate chemistry information online. This is in part related to the current wastage as far as reporting on chemical research is concerned. He claims that 85% of the crystallographic data produced at Cambridge University’s chemical labs are thrown away, and for spectral data this rises to 99%.

In part the current practices of publishing are at fault. The reliance on PDF for publication is, in his opinion, nothing short of a disaster. It makes life difficult if one’s intention is to use and reuse information in a multiplicity of ways. XML and its related standards should become the basis for publication, involving more cost but allowing greater interoperability. According to Murray-Rust, the process of communication should not only involve humans but also machines. Underlying all this is the need for a dynamic ontology.

At Cambridge University Murray-Rust has been involved in a project which applies the semantic system to doctoral theses. It is an open system which makes use of OSCAR as the editing system (see above), a system which was written by undergraduates and supported by the Royal Society of Chemistry. The ‘machine’ reads the thesis, tabulates it, adds spectral and other data where appropriate. If there are any mistakes the robot finds them. The document is composed in XML. The whole process is dynamic, not a series of static pictures as with PDF documents.

Knowlets

There are also specialised advanced search procedures being developed to cope with other unique disciplinary needs. One of these is in a biomedical project being run from Washington DC and Rotterdam in the Netherlands.

Dr Barend Mons leads a research team called KNEWCO, which combines the knowledge of biomedical research requirements with a proposal for a system which would unite aspects of the current journal publishing system with web 2.0, wikis and semantic web developments. His view is that there is an unstoppable move away from a text-based

21

Page 22: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

publishing system to one which deals with small nuggets of facts within a social networking system which would provide the quality control and commentary and thereby propel biomedical research forward.

Though articles would be required in future, reading through a long article on, say, malaria, only to find one small factoid near the end which could be useful, was a waste of time and resources. Such nuggets would become Knowlets, and would be included within an OmegaWiki-like database, to be commented on by wikiauthors with authority. These knowlets would have unique identifiers.

After the recognition of individual concepts in texts, the Knewco Knowlet™ technology makes reference to these concepts in an associative matrix. This matrix contains 'associative distance' between each pair of concepts. Using Knewco's meta-analysis algorithms, they create a multidimensional 'concept cloud' or Knowlets of the indexed paper. The semantic representation has information contained in it that is not based on the document alone, but also on the entire set of common, established and potential knowledge about the subject. In the case of biomedical life sciences, Knowlets™ comprise the established knowledge in the Medline space and therefore includes an extra element of 'interpretation' over thesaurus-based and disambiguated concept lists.

These knowlets, or ‘barcodes of knowledge’, address some of the complexities of biomedical research – complexities arising from the data involved, incompatible data formats, multidisciplinarity and multilingual papers, ambiguity in terminology and the failures of the current system which has an inability to share knowledge effectively. Knowlets are currently identified by and for life scientists, and wikis are used to comment on them within the community. There are wiki proteins, wiki authors (which includes their unique IDs and publication records), wiki medical/clinical, and wiki phenotypes – each being exposed to the ‘million minds’ approach. Essentially the aim is to eliminate the barriers which stop people getting immediate access to research results. Only respected life scientists will be part of this wiki community. For more information see www.wikiprofessional.info.

This innovation makes it possible for users to immediately gain knowledge in an in-text format, for publishers to drive users to relevant areas of their own sites as well as make incremental revenue from new ad inventory, and for advertisers to more effectively reach audiences interested in concepts related to their products and services. Knewco’s commercial solution – KnowNow! – represents the latest technological advances in contextual advertising and knowledge discovery.

These are early examples of where the semantic web approach is beginning to bite in the traditional scientific publishing arena

Implications of Semantic Web

22

Page 23: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

As illustrated above, parts of the semantic web are being applied selectively in scholarly communication, and over the years the progress being achieved in the development of reliable standards, protocols and procedures will inevitably impact on the scholarly communication system in some way. The key thing is that, unlike the generic search engines typified by Google and Yahoo, these semantic web based developments are community inspired and supported. They vary according to the specific needs and structure and information challenges facing the community in question – issues which do not necessarily transpose from one discipline to the next.

This then may be the way Next Generation Search systems may emerge – community by community rather than an expansion of the generic search engines.

But when and how remain open questions. The Web 3.0 which will harness the promises of the semantic web is still a glimmer in a few enthusiasts’ eyes.

Nevertheless, to apply these concepts and procedures to the scholarly information world and electronic publishing in general, the University of Southampton and Massachusetts Institute of Technology (MIT) have announced the launch of a long-term research collaboration that aims to produce the scientific fundamental advances necessary to guide the future design and use of the World Wide Web. The collaboration includes Sir Tim Berners-Lee. The Web Science Research Initiative (WSRI) will generate a research agenda for understanding the scientific, technical and social challenges underlying the growth of the web. WSRI research projects will weigh such questions as: How do we access information and assess its reliability? By what means may we assure its use complies with social and legal rules? How will we preserve the web over time? These issues are fundamental in bringing forward the new generation of search systems.

Impact of search engines on publishers.

Hidden within the pages of John Battelle’s book on ‘The Search’ ( Battelle, J., ‘The Search – How Google and its rivals rewrote the rules of business and transformed our culture’, Nicholas Brealey Publishing, Boston, 2005) is a concept that also potentially challenges scholarly publishers in particular. Battelle assesses the future for the news services, given the way newspapers can be by-passed by the decentralised of information collection and customised dissemination enabled through the web. As the web site http://www.epic.2015.com suggests, there could be an ultimate confrontation between the newspaper industry and Googlzon (a merger of Amazon and Google) later this decade, and that the search engine will win in the courts. Elements in this war are already in place. By the same token, the inference is that when scholarly publications no longer become a destination site (as news has) but become, thanks to the search engines, a commodity, how can traditional publishers continue to exist if there is no longer a branded journal per se to purchase?

23

Page 24: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

As publishers continue to protect their journal subscription streams, so the argument goes, the information is no longer picked up by search engines for the future generation of digital scholars who are wedded almost exclusively to their preferred resource discovery system. The published research articles are no longer identified, are not part of the conversation within scholarly peer groups, and new channels emerge. Battelle’s recommendation for the news industry is to open up the sites, allow deep linking, and seek new value-added services. By implication, this could be the route for publishers if Battelle’s vision is brought to fruition.

The ‘long tail’ of publications – some no longer in print – can be made live again. The ‘long tail’ is particularly pertinent in the scholarly publication sector where a real business can be made from serving the needs of the esoteric, infrequently used publications, as services such as Amazon and e-Bay have demonstrated. The extensive ‘knowledge worker’ market which has been described in earlier issues of ICSTI Insights is a reflection of the long tail in the scholarly arena.

Taking an even more recent suggestion from Chris Anderson in his book ‘Free’, the future of publishers of all ilks will no longer be in selling content, which would be included as freely accessible items within search engines in future, but rather in providing value-added or premium services which build on rather than exploit content.

Other models of delivering information using business models pioneered in mass market sectors may also emerge – such as an i-Tunes type micro-payment for content. DeepDyve is a new venture capital funded service which offers access to millions of scholarly documents at a pay-per-view charge of 99 US cents each. There are constraints on what one can do with the document as viewed on the screen which protects, to some extent, the sale of the full document, but it demonstrates that we are entering into a new world of search and retrieval to which even the mighty search engines of today may need to adapt.

Ultimately the decision will lie with the market. How will the users and buyers of scholarly information want to access information in future years? There are a number of studies which have been undertaken in recent years which have looked at this.

How users find information

Outsell is one consultancy/research organisation which has investigated where the Internet users go for their information. An interesting – and perhaps challenging observation – is that according to Outsell the Internet/web has declined as being the main source of information from 79% to 57% in their user sample. Increasingly this ‘user sample’ goes to their local intranet (5% rising to 19%). The library has remained stable at about 3-4%, whereas work colleagues have risen from 5% to 10%. ‘Others’ have been 8-10% (of which vendors account for about 4%).

24

Page 25: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

Users seem to be increasingly more enthusiastic about local intranets and local sources – they are getting more sophisticated, and it is increasingly the case of ‘if it is not accessible through the intranet it doesn’t exist’. This finding contradicts claims by Google and other major search engines that they dominate the information acquisition function.

The proportion of total time available for an academic taken in seeking information has risen from 44% to 55% of their work time, but it is felt that this is not an efficient use of such time. The overall search failure rate is put at 31%. There are more and more information sources, hence decision-making becomes more and more difficult. If they can’t find information they tend to ask a colleague (64%) as the main second source.E-mails are the principle form of alerts being received (77%). Blogs represent 45%, whereas RSS are only 20%. Blogs and podcasts have more use among the under 30’s category. The main message from Outsell report is that it is essential to look at what the young are doing as drivers for new format adoption.

Hugh Look, Senior Consultant, at Rightscom, has also reported on ‘Researchers and Discovery Services Survey’, a report Rightscom undertook for the Research Information Network (see http://www.rin.ac.uk/researchers-discovery-services). The aim was to assess the use and perception of sections of resource discovery services by academic researchers in the UK. The results were also intended to help determine priorities in the development of future services.

The study was based on a telephone survey of 450 research-related personnel in UK universities, 395 of which were researchers (at PhD level and above) and 55 librarians and information officers across all disciplines. The term ’user’ was broadly defined. In-depth interviews with postdoctoral researchers complemented the main interviews and were used to assess differences between those who had grown up as researchers in the Internet environment and those who had not. ’Resource discovery services’ (or Advanced Search Engines), also included bibliographic A&I services, general Internet search services, dedicated guided portals (Intute in the UK), institutional library catalogues and portals, and libraries and librarians themselves. The main results were:

• General search engines are the most used, among which Google is used more than any other. This is in contradiction to the Outsell findings in the USA. • The most heavily used resource discovery sources include general search engines, internal library portals and catalogues, specialist search engines and subject-specific gateways. The pattern of researchers’ named discovery resources is expressed by a long tail; a very few resources - Google, Web of Science/Web of Knowledge and ScienceDirect - were named by a large number of researchers. • Most researchers rely on a range of resource discovery tools and select an appropriate tool for a specific inquiry. Researchers in the social sciences appear to use a wider range of resource discovery services than those in other disciplines.

25

Page 26: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

• Satisfaction with discovery services is high, predominantly among researchers and scientists. In arts and humanities there were more concerns about gaps in service coverage. • Within the library community, the internal library portal was the most used service• The issue of access (e.g. accessing a document once located) generated greater frustration among researchers and librarians than that of discovery. Another frustration concerned the lack of clear delineation between means and ends (between discovery services and what is being discovered). • Among the range of resources found through the use of discovery services, journal articles are the most important. Virtually all researchers (99.5%) rely on the journal article as a key resource. Over 90% also use chapters in multiple author books, organisation websites and individual expertise. The next most cited resource - monographs - is mentioned by only 32% of researchers.• Peers and networks of colleagues are shown to be extremely important for virtually every type of inquiry. Research colleagues feature as important providers of information about resources and tools and new services, and this is particularly the case for postdoctoral researchers. Some researchers use email listservs, however online social networking services have been less popular. Colleagues are relied upon for locating individuals, initiating research, discussing research funding and locating data sets.• The majority of researchers work by refining down from large sets of results. Surprisingly, researchers were more concerned about missing important data than they were about the amount of time spent locating information. Concerns were also expressed about being overwhelmed by email, and in every discipline researchers bemoaned the number of irrelevant results delivered by general search engines. • With respect to emerging tools, blogs were shown to be little used. A majority of researchers (62%) obtain regular information updates and alerts from services pushing information to their desktops, and email is the preferred tool for this (not RSS feeds). A smaller number use alerts on funding sources from research councils or specialist services. Sources for keeping up to date include journals themselves, email alerts, conferences and conference proceedings, among a wide range of ’other’ sources that were not discussed in detail. • The focus of library activity has shifted and library support is now being delivered more often through the services provided than personal contact. Librarians’ and researchers’ views diverged on a number of key issues including quality of discovery services, availability of resources, and gaps and problems. Researchers do their own searches in the vast majority of cases. Librarians overrated the importance of datasets to researchers and they used general search engines far less frequently than researchers. Librarians perceived researchers as conservative in their use of tools and were concerned that they were not reaching all researchers with formal training. Researchers did not perceive this to be a problem. • Specific gaps in provision included access to foreign language materials, lack of distinction between actual sources and discovery services, difficulties in locating specific

26

Page 27: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

chapters in multiple-authored works due to lack of general indexes, and too-short backfiles of journals. • A plea for ’one stop shops’ was made across the board. Researchers have come to agree that ’the more digital, the better’. Most expressed concern about not having access to a sufficient number of digital resources. Problems cited included institutions not subscribing to the full text of the e-journal, and overly short electronic backfiles. • The data showed fewer differences between experienced cohorts than one might expect. Frequent and regular use by experience or age did not play a significant role, although the younger group stands out clearly in the use of blogs. Differences between disciplines are somewhat more marked. Researchers in the life sciences make more use of their colleagues than in other disciplines. In the physical and life sciences, researchers tend to use general search engines more than average. The library portal is used more frequently by arts and humanities lecturers. • Google is not being used for mission-critical applications. Rather it is relied on, often in combination with other tools, to locate organisations and individuals, references, or to research a new area. • A wide range of resources are used, including bibliographic databases, Google, internal portals, Web of Science/Web of Knowledge. The category ’other’ (46%) includes a variety of discipline-specific resources.

Hugh Look also indicated that the boundaries between resources themselves and discovery services are increasingly permeable, a trend that is likely to continue as new forms of content aggregation are developed.

There are many other studies on user behaviour which can indicate some of the changes in the way resource discovery could change in the future. A prominent study was undertaken by the Center for Studies in Higher Education at the University of California, Berkeley. In this study, understanding the faculty needs for ‘in-progress’ scholarly communication was investigated as a result of 160 interviews across 45 US research institutions in seven academic fields. These fields were archaeology, astrophysics, biology, economics, history, music and political science. The interviews covered the full range of ages and status within universities, from graduate students, high level faculty administrators to professors, and as such gives an impression of the pressures which exist at various levels within a research centre regarding communication.

The main conclusion from the CSHE report is that peer-reviewed publications remain the ‘coin of the realm’. Despite the problems of refereeing, and of the challenges from new and alternative media, journal articles remain important particularly in astrophysics, biology, economics, and increasingly parts of political science. Research articles are significant in securing grants, particularly in astrophysics and biology. It is in winning grants and securing tenure and achieving promotion where the current established system of peer-reviewed scholarly communication has its solid roots. This is significant because it perpetuates the raison d’etre for STEM publishers and gives confidence that their functions are future-proof.

27

Page 28: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

However, there was nonetheless criticism in CSHE about some aspects of editorial and peer review, notably the long lag times and editorial quality issues. Speed of publication was also given as an essential aspect of STEM communications, particularly in astrophysics, biology, economics and political science. Time is one of the most important limiting factors for all parties involved in the production and consumption of scholarship.

Some of the other findings from this five year study include the suggestion that blogs may contribute to a scholar’s visibility but are largely neutral or negative from an institutional perspective. Credit is given to scholars if they produce datasets, cell lines, edited volumes, critical editions, exhibitions, dictionary/encyclopaedia entries, software, etc, but these are not the sole basis on which their scholarship is judged. One of the main difficulties is that the mechanisms for evaluating new genres are still prohibitive for reviewers in time and inclination.

So the influential CSHE study suggests that there is a strong conservative element which needs to be factored into any new generation of search systems. Tearing up the existing hymn sheet might not work, and could lead to disjointed developments occurring spurred on by an over-enthusiastic application of IT on a largely cautious and conservative information community.

The Future Search Systems

Despite the apparent conservatism within the STEM communities there is clearly a need to optimise delivery of content through making the discovery process easy, to build in relevance and ensure continued engagement.Future search engines will be challenged by a range of currently inaccessible information resources, information which is partly stored on one’s own PC, but also part of the immense mountain of grey literature and in the deep web which OSTI has also highlighted. This surpasses what is currently available on the web by a factor of several thousand. As such, future search engines will need to parse all that data “not with the blunt instrument of a Page-like algorithm, but with subtle and sophisticated calculations based on your own clickstream” (Batelle).

A more personalised and customised set of alerting and delivery services might need to be the basis for the New Generation of Search systems. Also, as the cultural diversity of the different research disciplines emerge, so will specialised search engines be created which delve much deeper into the grey literature of the subject than the large search engines will ever reach. Google, Yahoo and MSN will remain as the umbrella services, pulling together search on specific sectors, seeking and achieving mass appeal but, supplemented by new community-led specialised services in specific STEM disciplines.

28

Page 29: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

In effect, long-term progress will come from improving the relevance and engagement of landing pages and by intelligently changing content to suit either the source of a reader or their behaviour. But there is a long way to go to achieve this.

As indicated earlier, some of the main challenges facing the new generation of search engines include:

• How to embrace the need both for a generic overview of a topic (which search engines do well) at the same time as enabling specialist target groups in STEM to drill down to items which are uniquely relevant only to a few users.

• The main general-purpose web search engines do not effectively tackle the large and diffuse ‘invisible or deep web’. Not only is there information which is not crawled because the robots cannot reach them, but there is also formally published material which is not picked up on the top 10-20 hits, and therefore lies ignored in the lower rankings of search results.

• To ensure that the business model sits well with the openness of the Internet and social collaboration.

• Adopting a technology which is relevant and saleable and gives a better ‘user experience’.

• How to offer greater customisation and personalisation in the presentation of information to the user, preferably in anticipation of demand

• Focusing on providing ‘an answer’ rather than just a list of hits of variable relevance.

It does seem that there is room for the generic search engines, such as Google, to continue their attempt to offer to ‘organise the world’s information and making it accessible’ as a corporate mission, but there is equally a growing space for communities to develop their own unique way of accessing their own unique information sources. The two can develop in parallel.

Equally, it has been shown by OSTI that the technology of a decentralised, federated technical approach can live alongside the centralised systems developed by the main current search engines. The nature of the information and the needs of the targeted user base would be determinants for which approach makes greatest sense. Again, the new generation of search systems can take a variety of forms. Which seems to be the main message – that there will no one single approach which the new generation of search systems will take. It will be a hybrid system with the dictates of the users being the main feature driving the future systems forward.

29

Page 30: NEXT GENERATION SEARCH - ICSTI.ORG · search engine that have evolved - one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively

References:

Anderson C., ‘Free – The Future of a Radical Price’, RH Business Books, 2009.

Battelle J., ‘The Search – How Google and Its Rivals rewrote the Rules of Business and Transformed Our Culture’, Nicholas Brealey, Boston/London, 2005

Morville P., ‘Ambient Findability’, O’Reilly, USA, 2005

Warnick, W. ‘Federated search as a transformational technology enabling knowledge discovery: the role of of WorldWideScience.org, Interlending and Document Supply, 38/2 (2010) 62-92, Emerald Group Publishing, 2010

Wikipedia for technical definitions

Prepared for ICSTI by SCR Publishing Ltd, Oxford, UK

Copyright 2010 ICSTI. All rights reserved. No part of this product or service may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written consent.

30