aggregating services - cs.helsinki.fi · facultyofscience departmentofcomputerscience...

15
Aggregating services Christoffer Björkskog Helsinki February 13, 2008 Master’s Thesis Chapter UNIVERSITY OF HELSINKI Department of Computer Science

Upload: others

Post on 07-Sep-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

Aggregating services

Christoffer Björkskog

Helsinki February 13, 2008

Master’s Thesis Chapter

UNIVERSITY OF HELSINKIDepartment of Computer Science

Page 2: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

Faculty of Science Department of Computer Science

Christoffer Björkskog

Aggregating services

Master’s Thesis Chapter February 13, 2008 12 pages + 0 appendix pages

mobile media, context-awareness, web 2.0, aggregating services

Abstract to be written...

Tiedekunta/Osasto — Fakultet/Sektion — Faculty Laitos — Institution — Department

Tekijä — Författare — Author

Työn nimi — Arbetets titel — Title

Oppiaine — Läroämne — Subject

Työn laji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä — Sidoantal — Number of pages

Tiivistelmä — Referat — Abstract

Avainsanat — Nyckelord — Keywords

Säilytyspaikka — Förvaringsställe — Where deposited

Muita tietoja — övriga uppgifter — Additional information

HELSINGIN YLIOPISTO — HELSINGFORS UNIVERSITET — UNIVERSITY OF HELSINKI

Page 3: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

ii

Contents

1 Introduction 1

2 Syndication 1

2.1 RSS and Atom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Aggregating content . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Screen Scraping 3

4 Mashups 4

4.1 The architecture of a mashup site . . . . . . . . . . . . . . . . . . . . 7

4.2 The process of building a mashup . . . . . . . . . . . . . . . . . . . . 9

4.3 User built mashups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Conclution 11

References 11

Page 4: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

1 Introduction

An aggregate is a whole formed by several elements [dic]. Combining sources fromdifferent web pages into what is called aggregated documents is a common phenom-ena on the internet today [ES08]. Information from different sources is assembledinto a whole. Millions of blogs (online diaries in reversd chronological order) anda huge amount of online newspapers and websites are syndicated, providing theircontent in a machine readable format by providing feeds that enable users and appli-cations to access the data. The most common way of aggregating syndicated contentis through search engines and RSS (Really simple Syndication) readers. A mashupsis a new way of aggregating content. Mashups takes elements and information fromdifferent sources that are either retrieved through APIs (Application ProgrammingInterface) or collected through other means such as screen scraping or feeds andcombines the information so that it creates new and unusual or interesting use ofthe data.

2 Syndication

Syndication is supplying feeds that contain content that others can subscribe to[GMR07]. An example of this is a blog that syndicate its contents via RSS feeds.Other people can then subscribe to this feed using a RSS reader. When the authorhas written a new entry to the blog the subscribers can be notified of the change.Syndication increases traffic to the web page and it builds a brand awareness for thewebpage. Syndication can also enhance the web page’s rankings in search engines[Ham05]. Syndication helps to maintain relationships between sites in a communityand improves the relationship between the site and the users. Additional technol-ogy may enhance syndicated services for instance an application that observes asyndicated feed and notifies a subscriber via instance messaging. Syndication addsa richness to internet services and encourages a reuse of data while pushing thecontinuous development of semantic technology forward. Bandwidth is saved whenscreen scraping is reduced. This can be achieved by providing syndication feeds.

Page 5: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

2

Figure 1: Certain web browsers have got a built in RSS

2.1 RSS and Atom

RSS is a collection of XML (Extensible Markup Language) file formats that summa-rizes the contents of a webpage [BGS06]. The concept emerged in 1997 and the firstversion of RSS emerged in 1999 [GMR07]. Several versions of RSS exist due to forksin the development of the standard. The common versions that are widely used, buteach a bit different from each other are RSS 1.0, RSS 2.0 and Atom. These havetheir own specific uses as well as their own advantages and disadvantages for feedpublishers. RSS is especially used by sites where new content is added regularly, forinstance news sites and weblogs. The amount of sites that support RSS is increasingrapidly [BGS06].

RSS and Atom are widely established within the weblogging community [Ham05].There are millions of weblogs, personal online diaries, being written worldwide andmost of these produce a syndication feed such as RSS or Atom.

Page 6: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

3

2.2 Aggregating content

Syndicated content can be aggregated, brought together into one place using forinstance RSS readers. Desktop, web based and mobile RSS readers exist enablingan overview of syndicated content. The earliest ways of aggregating and reading RSSfeeds were through web-based readers. Web based RSS readers provide a convenientway of staying up to date with the content of the web pages of interest withouthaving to sit at the own personal computer. A popular web-based aggregator isbloglines [Blo08] Google has released an RSS reading service called Google readerthat has gained popularity.

3 Screen Scraping

If the content providers do not provide an API or a feed to access the data, mashupdevelopers may need to resort to screen scraping in order to retrieve certain infor-mation [Mer06]. This means that content that was originally intended for humanconsumption can be parsed and analyzed in order to find sematic data structuresrepresenting it. This process is called Scraping. The information and data structuresaccuired can then be used to create a mashup. Screens scraping as data acquisi-tion is used by a handful of mashups, especially when retrieving information fromthe public sectors this may be needed. An example of a mashup project that usesscreen scraping is XMLTV which is a set of tools that aggregates TV program list-ings from all over the world. Screen scraping has atleast two drawbacks. UnlikeAPIs which are interfaces of how to access data, there is no specific programmaticcontract between content-provider and content-consumer for screen scraping. Thescrapers have to analyze the content model of the site to be scraped and design theirtools to comply with that. If the content provider change their way of representingthe content the tools of the scrapers are likely to stop working. Websites periodi-cally update their look-and-feel in order to remain fresh and stylish and this causesmaintenance headaches to the scrapers. Another issue is that it does not exist anyscreen-scraping toolkit software that is sophisticated and reusable emough, some-times called scrAPIs. Why these APIs and toolkits does not exist is mainly becauseeach scraping tool have extremely application-specific needs. Designers are forcedto reverse-engineer content, parse and aggreagate raw data and create data modelsof the content.

Page 7: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

4

4 Mashups

Web users today have transformed from being content consumers into content providers[RT08]. Today you can without knowing HTML create professional looking websitesand blogs. That has not always been the case. A lot of services exist where users cancontribute with content. The newest kinds of web tools combines, or enable usersto combine content from different web sources into unique services. These are calledmashups. A mashup is a web application that combines data from multiple websources to provide a unique service. Mashups have become popular because of theemphasis on interactive user experience and the union of data provided by differentservices[Mer06].There are a lot of different kind of handcrafted mashup solutions[AKTV07] and a few different classes of pupular mashups being developed [Mer06]. Examples of mashups are AdSense where ads related to the content of the page isembedded into the webpage [AKTV07]. See Fig. 4.

Mashups aggregate and combine third-party data and takes advantage of datasources that usually comes from another site/service [Mer06]. The name comesfrom the popular music scene where the word mashup is used for the combinationof the vocal and music tracks from two, usually belonging to different genres, songs.

The most popular types of mashups are mapping mashups, video and photo mashups,search and shopping mashups, and news mashups.

Mapping mashups

One factor why mapping mashups, where data are presented graphically using maps,became popular was the release of Google Maps API, which let mashup builderspresent all kinds of data onto maps. After that, Yahoo, Microsoft and AOL releasedtheir own map APIs aswell. People are collecting and creating vast amount of data, alot of these data have got some location indentifier associated with them. This meansthat they can be displayed on a map. There exists a lot of mapping mashups, someare collected and can be found from http://googlemapsmania.blogspot.com/.One example is the ChicagoCrime.org website which displays crime activity, fetchedfrom the Chicago police’s database, in Chicago onto a Google Map. Users can inter-act with the mashup site, having it for instance to display a map of south Chicagowith the details of all recent burglary crimes shown graphically as pushpins on themap. See Fig. 3. Even thought the concept and the representation are simple it be-comes visually powerful to combine crime and map data. The Google Maps API canbe found from http://code.google.com/apis/maps/documentation/reference.

Page 8: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

5

html and describes how the developer can use Google Maps as part of a mashup.The following example code outputs a map, that is zoomed to the level 15, aroundthe location on longitude 60.205796 and latitude 24.962525 (Which is the Kumpulacampus of the University of Helsinki) and puts a marker there, see Fig. 2.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta http-equiv="content-type" content="text/html; charset=utf-8"/>

<title>Google Maps JavaScript API Example</title>

<script src="http://maps.google.com/maps?file=api&amp;v=2&amp;key=ABQIAAAAly60PJPTKn52cmHklOawphSDpq-8M7SD0oVCTvZ7J4UBlv1dIxQTX5hCNBSieJ1510Q3vngcsZJ52A"

type="text/javascript"></script>

<script type="text/javascript">

//<![CDATA[

function load() {

if (GBrowserIsCompatible()) {

var map = new GMap2(document.getElementById("map"));

// 60.205796, 24.962525 is the latitude and longitude of kumpula

var location = new GLatLng(60.205796, 24.962525);

var kumpula = new GMarker(location);

map.setCenter(location, 15);

map.addOverlay(kumpula);

}

}

//]]>

</script>

</head>

<body onload="load()" onunload="GUnload()">

<div id="map" style="width: 500px; height: 300px"></div>

</body>

</html>

Page 9: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

6

Figure 2: The example code outputs this map

This example is a modification of the simple example found from http://code.

google.com/apis/maps/documentation/examples/index.html

Video and photo mashups Some photo hosting services such as Flickr provideAPIs to access shared photos [Mer06]. This has led to the emergence many inter-resting mashups. The content providers have metadata associated with the imagesthat are hosted there. These metadata can i.e. be who took he picture, what thepicture is of, when and where it was taken etc. Mashup developers can then use pho-tos along with other information and match the metadata of the photos with othercontent. One example is a combination of the lyrics of a song with images relatedto the words in the song, or render the text of a webpage in photos by matching themetadata of the pictures with the words.

Search and shopping mashups Before Web APIs were common and before theterm masup was conied for combining web content, search and shopping mashupsalready existed. Services such as MySimon, BizRate, PriceGrabber and Google’sFroogle aggregate price data that can be compared. They used a combination ofscreens scraping and business-to-buisness technologies to aggregate the data. Con-sumer marketplaces such as eBay and Amazon provide APIs to make it easier tocreate mashups and interresting web services. For instance Amazon enables mashupcreators to get revenues from their mashups. For instance a mashup developer might

Page 10: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

7

Figure 3: ChicagoCrime.org presents crime data on a map

use Amazons API to enable users to create wishlists of things they would like tobuy or recieve as gifts. If then someone via the site buys an item from Amazon,the mashup developer can get a certain percent of the purchase in revenue fromAmazon.

News mashups Certain news sites have been syndicating their content according totopics since 2002. Theses are for instance BBC, New York Times and Reuters, theyhave been using either RSS or Atom as means of syndication. Aggregatign mashupsof syndicated feeds can create a personalized newspaper from the users’ preferredfeeds accordig to the persons interrests. One example of a news feed mashup isDiggdot.us that mashes up feeds form Digg.com, Slashdot.org and del.icio.us.

4.1 The architecture of a mashup site

API/Service provides

The content providers may be unaware of that the content it is providing is used

Page 11: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

8

Figure 4: Ads related to the text in a page can be embedded into a webpage viaGoogle AdSense service.

to create web mashups but they may also provide their contend for this purposeamong others by providing APIs to access the content. The content may be providedthrough web protocols, for instance be REST, Web Services or RSS/Atom. Manypotential data providers do not yet provide APIs for accessing the data. Mashupdevelopers acquire information from these web sites through screen scraping. Theweb page ChicageCrime.org has Google and the chicago Police Departments as API-and Service providers.

The mashup site

The mashup site is where the service is hosted and the mashup logic resides. It isnot necessarily where the mashup is executed. The page can be executed throughdynamic content generation on the server, but also through client side scriptingsuch as JavaScript or a combination if these which is usually the case. Mashupapplications often use data that is provided to them form their own user base, thisway at least one set of data is local. Complex queries may reqire processing thatis not suitable on the client side, for instance a query that would have the mashupto show the "average purchase price for real estate bought by actors who have co-starred in movies with Kevin Bacon".

The client’s browser

The application is rendered graphically in the client’s web browser and some pro-cessing takes place there. For instance the Google Maps API is intended for accessvia the browser using JavaScript.

Page 12: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

9

Figure 5: The process of producing mashups [MMD06]

4.2 The process of building a mashup

For complex mashups certain problems need to be solved [RT08]. See Fig. 5. Thedata source for mashups may vary in format structure and location [MMD06]. Thesource of the data could be a web page, a feed or even a PDF on a local filesystem.The data needs to be extracted into a manageable form [RT08]. If the data isdynamically extracted, the rules how to extract it needs to be solved. Data mightspan different locations and sources making it more difficult to manage. The dataneeds to be organized and classified [MMD06] to make it manageable and to findrelationships between existing data sources and new ones [RT08]. The data mayneed to be transformed and cleansed in order to fix misspellings or to transformthe data into the desired format. If the data retrieved has for instance the artistname Norah Jones in the format Jones, Norah, and another dataset has got theartists’ names in the former manner, a transformation is needed to integrate the newdataset so that it matches our existing one, or an address needs to be transformedinto geographical positions. The transformation process may use other services forthese transformations as well. In [RT08] a data integration step is presented inthe process of creating mashups. That involves combining two or more data setstogether. If for instance, one would build a mashup that would lists all the moviesperformed by this year’s Oscar award winners, one would need to retrieve a list ofthis year’s Oscar award winners and merge that using a database join operations onthe winners names with a movie database. Data visualization takes the final dataand represents it to the user in a certain way.

Page 13: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

10

Figure 6: Data is gathered using xpath analysis where similar items as an exampleare found [RT08]

4.3 User built mashups

There are different approaches that allow users to create mashups [RT08]. Oneis based on widgets where the user through widgets can manage different servicesand elements in the mashup. The other is based on data extracting through DOM(Document Object Model) analyzes.

In [RT08] "Building Mashups by Example" a framework for building mashups ispresented where users can without programming skills build customized mashups byproviding examples that the framework can interpret and gather similar data. TheDOM of the page is analyzed and similar items to the suggested one are extractedfrm the document using XPath analysis. See Fig. 6. The data is cleaned in away where the user modifies an example data into the desired form and the systeminterprets the change and comes up with a transformation how to change all thesimilar data into the desired format.

Other means of creating mashups is by using widgets. Yahoo Pipes is a mashupbuilding tool that lets users build mashups via widgets. These widgets producean output that can either be entered into another widget as input or directed tothe output of the mashup. Even though no programming is required, it may still bedifficult for an average user to create a mashup since you need to have an underlyingknowledge of certain programming terms and structure to utilize the features.

To create a mashup several problems need to be addressed. These problems includegathering data from multiple web sources, cleaning the data and combining it.

Page 14: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

11

Figure 7: Users can build custom mashups using widgets in Yahoo Pipes [RT08]

5 Conclution

References

AKTV07 Ankolekar, A., Krötzsch, M., Tran, T. and Vrandecic, D., The twocultures: mashing up web 2.0 and the semantic web. WWW ’07: Pro-ceedings of the 16th international conference on World Wide Web, NewYork, NY, USA, 2007, ACM, pages 825–834.

BGS06 Blekas, A., Garofalakis, J. and Stefanis, V., Use of rss feeds for contentadaptation in mobile web browsing. W4A: Proceedings of the 2006international cross-disciplinary workshop on Web accessibility (W4A),New York, NY, USA, 2006, ACM, pages 79–85.

Blo08 Bloglines, About Bloglines. IAC Search & Media, San Francisco BayArea, 2008. URL: http://www.bloglines.com/about.

dic new oxford american dictionary.

ES08 Evan Schrier, Mira Dontcheva, C. J. G. W. D. S., Adaptive layout fordynamically aggregated documents. IUI ’08: Proceedings of the 13thinternational conference on Intelligent user interfaces, New York, NY,USA, 2008, ACM, pages 99–108.

Page 15: Aggregating services - cs.helsinki.fi · FacultyofScience DepartmentofComputerScience ChristofferBjörkskog Aggregatingservices Master’sThesisChapter February13,2008 12pages+0appendixpages

12

GMR07 Glotzbach, R. J., Mohler, J. L. and Radwan, J. E., Rss as a courseinformation delivery method. SIGGRAPH ’07: ACM SIGGRAPH 2007educators program, New York, NY, USA, 2007, ACM, page 16.

Ham05 Hammersley, B., Developing Feeds with Rss and Atom. O’Reilly &Associates, April 2005. URL http://www.oreilly.com/catalog/

deveoprssatom/.

Mer06 Merrill, D., Mashups: The new breed of web app, 2006. URL http:

//www.ibm.com/developerworks/xml/library/x-mashups.html.http://www.ibm.com/developerworks/xml/library/x-mashups.

html.

MMD06 Murthy, S., Maier, D. and Delcambre, L., Mash-o-matic. DocEng ’06:Proceedings of the 2006 ACM symposium on Document engineering,New York, NY, USA, 2006, ACM, pages 205–214.

RT08 Rattapoom Tuchinda, Pedro Szekely, C. K., Building mashups by ex-ample. IUI ’08: Proceedings of the 13th international conference onIntelligent user interfaces, New York, NY, USA, 2008, ACM, pages139–148.