retrieval and extraction from big data sources estp... · 1. information sources, retrieval and...

43
Retrieval and Extraction from Big Data Sources

Upload: others

Post on 29-Aug-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Retrieval and Extractionfrom Big Data Sources

Page 2: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

1. Information Sources, Retrieval and Extraction

2. Connecting to big data sources via API

3. Retrieving information from websites (web scraping)

Outline

Page 3: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

1. Information Sources, Retrieval and Extraction

2. Connecting to big data sources via API

3. Retrieving information from websites (web scraping)

Outline

Page 4: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

There are almost infinite different sources, but they can be grouped into:

- Search engines

- RSS channels

- Open data

- Social media

Information sources- Source types

Page 5: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Information sources- Search engines

Page 6: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

There are different kinds of search engines:

- General

- Google, Yahoo!, Bing...

- Thematic

- Carrot2: http://search.carrot2.org

- Patents

- National Agencies (e.g. http://consultas2.oepm.es/InvenesWeb)

- Google Patents (https://patents.google.com)

- Legal

- The Public Library of Law (http://www.plol.org)

- National Agencies (e.g. http://www.poderjudicial.es/search/indexAN.jsp)

- ...

Information sources- Search engines types

Page 7: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Information sources- RSS channels

Rich Site Summary or Really Simple Sindication, publish the latest news of the site with full or summarized text and metadata, like publishing data or author’s name.

Page 8: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Information sources- Open data

Page 9: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

- National agencies (e.g. http://www.ine.es) - Eurostat (http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database) - U.S. Census Bureau

(http://www.census.gov/population/international/data/idb/informationGateway.php)- U.S. Census Bureau - List to national statistics agencies

(http://www.census.gov/population/international/links/stat_int.html)- World Bank (http://data.worldbank.org)- United Nations (http://data.un.org)- CEPAL statistics (http://websie.eclac.cl/infest/ajax/cepalstat.asp)- Asian-Pacific statistics

(http://www.unescap.org/stat/data/swweb_syb2011/DataExplorer.aspx) - OMPI Patents (http://www.wipo.int/patentscope/search/en/search.jsf) - OMPI Brands (http://www.wipo.int/madrid/en/romarin)

Information sources- Open data

Page 10: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Information sources- Social media

Page 11: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

1. Information Sources, Retrieval and Extraction

2. Connecting to big data sources via API

3. Retrieving information from websites (web scraping)

Outline

Page 12: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - HTTP protocol

● Application layer protocol

● Syntax and semantics for web

communication.

● HTTP/1.0 & HTTP/1.1

● Disconnected protocol

● Based on:

○ request <-> response

● Plain text messages

● There are no states

● There are no session

WEB EMAIL FTP NEWS

HTTP POP3 SMTP FTP NEWS

TCP/IP

PHYSICAL NET

request

response

request

response

Page 13: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - HTTP messages

REQUEST RESPONSE

GET /tramitation.jsp HTTP/1.1

Host www.uv.es

CLRF

Data

Empty

HTTP/1.1 200 OK

Content-Type: text/html

Content-Length: 45

CLRF

HTML + Img + …

Empty

HTTP MESSAGE

● Initial line

● Header

● CLRF

● Body

Page 14: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - HTTP request methods

Two HTTP Request Methods:

- GET: Requests data from a specified resource

- POST: Submits data to be processed to a specified resource

https://www.w3schools.com/tags/ref_httpmethods.asp

Page 15: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - HTTP response codes

1xx: Informative Messages

- 101 Continue

- 102 Switching Protocols

2xx: Success

- 200 OK

- 201 Created

- 202 Accepted

- 204 No Content

- 206 Partial Content

3xx: Redirection

- 300 Multiple Choice

- 301 Moved Permanently

- 302 Not Found

- 304 Not Modified

4xx: Client Error

- 400 Bad Request

- 401 Unauthorized

- 403 Forbidden

- 404 Not Found

- 405 Method Not Allowed

- 408 Request TimeOut

5xx: Server Error

- 500 Internal Server Error

- 501 Not Implemented

- 502 Bad Gateway

- 503 Service Unavailable

Page 16: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - Example: EuroStat API

http://ec.europa.eu/eurostat/web/json-and-unicode-web-services

Page 17: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - Example: EuroStat API

http://ec.europa.eu/eurostat/web/json-and-unicode-web-services

Page 18: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - Example: EuroStat API

http://ec.europa.eu/eurostat/web/json-and-unicode-web-services

Page 19: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - Example: EuroStat API

http://ec.europa.eu/eurostat/web/json-and-unicode-web-services

http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_gdp_c?precision=1&geo=EU28&unit=EUR_H

AB&time=2010&time=2011&indic_na=B1GMz

URL http://ec.europa.eu/eurostat/wdds

Service rest/data

Version v2.1

Format json

Lang en

Dataset nama_gdp_c

Filters precision=1

geo=EU28 <- European Union (28 countries)

unit=EUR_HAB <- Euros per inhabitant

time=2010

time=2011 <- Years 2010 and 2011

indic_na=B1GMz <- National Account Indicator: gross domestic product at market prices

Page 20: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - Example: EuroStat API

http://ec.europa.eu/eurostat/web/json-and-unicode-web-services

{"version":"2.0","label":"GDP and main components - Current prices","href":"http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_gdp_c?precision=1&geo=EU28&unit=EUR_HAB&time=2010&time=2011&indic_na=B1GM","source":"Eurostat","updated":"2016-02-10","extension":{"datasetId":"nama_gdp_c","lang":"EN","description":null,"subTitle":null,"status":{"label":null}},"class":"dataset","value":{"0":24400,"1":25100},"dimension":{"unit":{"label":"unit","category":{"index":{"EUR_HAB":0},"label":{"EUR_HAB":"Euro per inhabitant"}}},"indic_na":{"label":"indic_na","category":{"index":{"B1GM":0},"label":{"B1GM":"Gross domestic product at market prices"}}},"geo":{"label":"geo","category":{"index":{"EU28":0},"label":{"EU28":"European Union (28 countries)"}}},"time":{"label":"time","category":{"index":{"2010":0,"2011":1},"label":{"2010":"2010","2011":"2011"}}}},"id":["unit","indic_na","geo","time"],"size":[1,1,1,2]}

Response in JSON format

Page 21: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - JSON format

https://www.hurl.it/

JSON: JavaScript Object Notation.

- A syntax for storing and exchange data.

- Text written in JavaScript object notation.

- Most web messages nowadays are written in JSON.

- Low payload.

- Easily converted into JavaScript objects.

- Hardly understood by humands at glance ---> needs formatter.

Page 22: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - JSON formatter

http://jsonlint.com

Paste JSON code in

the text area and

click Validate JSON

Page 23: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Connecting via API - JSON formatter

http://jsonlint.com

The result is more human

readable. It is a hierachical

format:

[ ] array

{ } object

“key”: “value” attributes in key-

value format

Page 24: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

1. Information Sources, Retrieval and Extraction

2. Connecting to big data sources via API

3. Retrieving information from websites (web scraping)

Outline

Page 25: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Definition

Web scraping, web harvesting or web data extraction is data scraping used

for extracting data from websites.

Web scraping software may access the World Wide Web directly using the

Hypertext Transfer Protocol (HTTP).

It is a form of copying, in which specific data is gathered and copied from the

web, typically into a central local database or spreadsheet, for later retrieval

or analysis.

https://en.wikipedia.org/wiki/Web_scraping

Page 26: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- rvest

rvest is new package that makes it easy to

scrape (or harvest) data from html web pages,

inspired by libraries like beautiful soup. It is

designed to work with magrittr so that you can

express complex operations as elegant

pipelines composed of simple, easily

understood pieces.There are some dependencies to solve:

- libcur14-openssl-dev

- libssl-dev

- libxml2-dev

Page 27: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Example: LEGO Film

http://www.imdb.com/title/tt1490017/

Let’s extract the following information from

the LEGO film:

- Users rating

- Cast

Page 28: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- HTML + CSS

HTML: HyperText Marckup Language

- A language to create webpages.

- Tags-based syntax.

- E.g. <body></body><table></table>...

- Should describe contents

CSS: Cascading Style Sheets

- A language to describe how HTML elements should be displayed on screen, paper or in other media.

- Key-value-based syntax associated to HTML elements.

- E.g. body {background-color: black; color: white; font: verdana; }

Page 29: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- CSS selector

http://selectorgadget.com

There are several CSS selectors. Let’s install SelectorGradget

for Chrome.

Page 30: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- CSS selector

Click on “Add to Chrome” and accept.

A new icon will appear in Chrome toolbar.

Let’s click on it.

Page 31: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Move the mouse over the rating and click.

Copy the CSS value.

Write the next code in R.

Webscraping- Scraping rating

Page 32: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Scraping cast

To scrap the cast it is

needed to make the css

selection twice:

- Select the table

- Select the column

Then, write the following R

code.

Page 33: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Example: gold prices

https://www.measuringworth.com/datasets/gold/result.php

Page 34: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Example: gold prices

https://www.measuringworth.com/datasets/gold/result.php

To obtain the table of results, the

user has to choose some

parameters:

- A series of markets

- Initial year

- Ending year

Hence, this is a dynamic webpage

where the user send parameters

in a request to the server. There

are two possibilities:

- GET

- POST

Page 35: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Firebug Tool

http://getfirebug.com/releases/lite/chrome/

Firebug allows to inspect HTML

source code

Page 36: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Firebug Tool

Click the Firebug icon on the browser toolbar.

An inspection window opens below the webpage.

Click the button Inspect and inspect the web.

Page 37: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Inspecting the form

When inspecting the form, we observe

that the method is POST.

Page 38: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Looking for parameters

The needed parameters are the

following:

- london

- goldsilver

- newyork

- us

- British

- year_source

- year_result

Page 39: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

In order to make a POST request in R, the following libraries are needed:

- rvest

- httr

Webscraping- Required libraries

Page 40: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- POST request

Instead of making a direct reques to the url, that by default is made as GET,

we need to enconde the request within a POST object with the following

parameters:

- The url

- The query (parameters) as a list of pairs key=value

Then, the request is made with the content function:

Page 41: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- Results formatting

To obtain the dataset a chain of functions (%>%) is used:

- html_prices contains the webpage html

- html_nodes(“table”) selects the list of elements below the table CSS

element

- .[[2]] selects the second element in the list

- html_tables converts to a R table

The result is stored in prices, the first row removed and the column names

assigned.

Page 42: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Webscraping- The whole R code

Page 43: Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

● HTTP: The Definitive Guide. David Gourley, Brian Totty, Marjorie Sayer, Anshu

Aggarwal, Saily Reddy. O’Reilly Media.

http://shop.oreilly.com/product/9781565925090.do

● Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining.

Simon Munzert, Christian Rubba, Petter Meibner, Dominic Nyhuis. John Wiley.

https://www.amazon.com/Automated-Data-Collection-Practical-Scraping/dp/111883481X

● HTML & CSS: Design and Build Web Sites. Jon Duckett. John Wiley.

https://www.amazon.co.uk/gp/product/1118008189

References