screen scraping

CLOUD COMPUTING.DESARROLLO DE APLICACIONES Y

MINERÍA WEB

Miguel Fernández Fernández [email protected]

Programa de extensión universitariaUniversidad de Oviedo

mailto:[email protected]


Screen scraping

Porqué screen scrapingLa Web es fundamentalmente para

humanos (HTML)


humanos (HTML)<table width="100%" cellspacing="1" cellpadding="1" border="0" align="center"><tbody> <tr> <td valign="middle" align="center" colspan="5"> </td></tr><tr> <td align="center" class="cabe"> Hora Salida </td> <td align="center" class="cabe"> Hora Llegada </td> <td align="center" class="cabe"> Línea </td> <td align="center" class="cabe"> Tiempo de Viaje </td> <td align="center" class="cabe"> </td> </tr> <tr>... <td align="center" class="color1">06.39</td> <td align="center" class="color2">07.15</td> <td class="color3">C1 </td> <td align="center" class="color1">0.36</td> <td align="center" class="rojo3"> </td> </tr></tbody></table>


humanos (HTML)


humanos (HTML)

Pero no está diseñada para ser procesada por máquinas (XML, JSON, CSV...)


humanos (HTML)

Pero no está diseñada para ser procesada por máquinas (XML, JSON, CSV...)

<horario> <viaje> <salida format="hh:mm">06:39</salida> <llegada format="hh:mm">07:15</llegada> <duracion format="minutes">36</duracion> <linea>C1</linea> </viaje></horario>

Porqué screen scrapingNo siempre disponemos de una API


Necesitamos simular el comportamiento humano



Interpretar HTML



Interpretar HTML

Realizar interacciones

(Navegar)



Interpretar HTML

Realizar interacciones

(Navegar)Ser un Ninja

Evitar DoS

Selección de las herramientas

¿Con qué lenguaje vamos a trabajar?

Java .NET Ruby Python

URL fetching

java.net.URL System.Net.HTTPWebRequest

net/httpopen-uri

rest-open-uri

urllib urllib2

DOM parsing

/transversing

javax.swing.text.htmlTagSoup

NekoHTMLHTMLAgilityPack

HTree / ReXMLHPricot

RubyfulSoupBeautifulSoup

Regexp java.util.regexp System.Text.RegularExpressions Regexp re

--- Librerías de terceras partes. No forman parte de la API del lenguaje.



Duck typing + Reflexión = Syntactic Sugar



Lenguajes dinámicos facilitan la codificación

Duck typing + Reflexión = Syntactic Sugar

import javax.swing.text.html.*;import javax.swing.text.Element;import javax.swing.text.ElementIterator;import java.net.URL;import java.io.InputStreamReader;import java.io.Reader;

public class HTMLParser{ public static void main( String[] argv ) throws Exception { URL url = new URL( "http://java.sun.com" ); HTMLEditorKit kit = new HTMLEditorKit(); HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument(); doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); Reader HTMLReader = new InputStreamReader(url.openConnection().getInputStream()); kit.read(HTMLReader, doc, 0);

ElementIterator it = new ElementIterator(doc); Element elem; while( elem = it.next() != null ) { if( elem.getName().equals( "img") ) { String s = (String) elem.getAttributes().getAttribute(HTML.Attribute.SRC); if( s != null ) System.out.println (s ); } } System.exit(0); }}

Java

http://java.sun.com

http://java.sun.com

Ruby

require 'rubygems'require 'open-uri'require 'htree'require 'rexml/document'

open("http://java.sun.com",:proxy=>"http://localhost:8080") do |page| page_content = page.read() doc = HTree(page_content).to_rexml doc.root.each_element('//img') {|elem| puts elem.attribute('src').value } end

Selección de las herramientas RubyRuby

rest-open-uri

HTree + REXML

RubyfulSoup

WWW:Mechanize

Hpricot

Nos permitirá hacer peticiones a URLs y extraer su contenido

extiende open-uri para soportar más verbos


rest-open-uri

HTree + REXML

RubyfulSoup

WWW:Mechanize

Hpricot

HTree crea un árbol de objetos a partir de código

HTML

HTree#to_rexmlConvierte el árbol a un árbol

REXML

REXML puede navegarse con XPath 2.0

HTree+REXML

require 'rubygems'require 'open-uri'require 'htree'require 'rexml/document'

open("http://www.google.es/search?q=ruby",:proxy=>"http://localhost:8080") do |page| page_content = page.read() doc = HTree(page_content).to_rexml doc.root.each_element('//a[@class=l]') {|elem| puts elem.attribute('href').value } end

Runtime: 7.06s.


rest-open-uri

HTree + REXML

RubyfulSoup

WWW:Mechanize

Hpricot

http://hpricot.com/

Scanner implementado en C (Muy rápido)

Genera un DOM con su propio sistema de navegación

(selectores CSS y XPath*)como JqueryFuncionalidad equivalente a

Htree + REXML

http://hpricot.com

http://hpricot.com

HPricot

require 'rubygems'require 'hpricot'require 'open-uri'

doc = Hpricot(open('http://www.google.com/search?q=ruby',:proxy=>'http://localhost:8080'))links = doc/"//a[@class=l]"links.map.each {|link| puts link.attributes['href']}

Runtime: 3.71s


rest-open-uri

HTree + REXML

RubyfulSoup

WWW:Mechanize

Hpricot

http://hpricot.com/




Htree + REXML

http://hpricot.com

http://hpricot.com


rest-open-uri

RubyfulSoup

WWW:Mechanize

Hpricot

http://hpricot.com/




Htree + REXML

http://hpricot.com

http://hpricot.com


rest-open-uri

RubyfulSoup

WWW:Mechanize

Hpricot

Ofrece la misma funcionalidad que HTree + REXML

Rubyful Soup

Runtime: 4.71s

require 'rubygems' require 'rubyful_soup' require 'open-uri'

open("http://www.google.com/search?q=ruby",:proxy=>"http://localhost:8080") do |page| page_content = page.read() soup = BeautifulSoup.new(page_content) result = soup.find_all('a', :attrs => {'class' => 'l'}) result.each { |tag| puts tag['href'] } end


RubyfulSoup

WWW:Mechanize

rest-open-uri

Hpricot



RubyfulSoup

WWW:Mechanize

rest-open-uri

Hpricot


Menor rendimiento que Hpricot


RubyfulSoup

WWW:Mechanize

rest-open-uri

Hpricot



No se admiten selectores CSS


WWW:Mechanize

rest-open-uri

Hpricot



No se admiten selectores CSS


rest-open-uri

WWW:Mechanize

Hpricot

Permite realizar interaccionesRellenar y enviar formulariosSeguir enlaces

Consigue alcanzar documentos enLa Web Profunda

WWW::Mechanize

Runtime: 5.23s

require 'rubygems'require 'mechanize'

agent = Mechanize.newagent.set_proxy("localhost",8080)page = agent.get('http://www.google.com')

search_form = page.forms.select{|f| f.name=="f"}.firstsearch_form.fields.select {|f| f.name=='q'}.first.value="ruby"search_results = agent.submit(search_form)search_results.links.each { |link| puts link.href if link.attributes["class"] == "l" }

Manos a la obra

No tiene API pública

>8000 usuarios nuevos cada día

2h de sesión promedio

datos datos datos!

Novedades de tuenti

Paso 1: Acceder a nuestro perfil


agent = Mechanize.newagent.set_proxy("localhost",8080)#decimos que somos firefox modificando la cabecera user agentagent.user_agent_alias='Mac FireFox'login_page = agent.get('http://m.tuenti.com/?m=login')

#cogemos el formulario de loginlogin_form = login_page.forms.first#y rellenamos los campos usuario y contraseñalogin_form.fields.select{|f| f.name=="email"}.first.value="[email protected]"login_form.fields.select{|f| f.name=="input_password"}.first.value="xxxxx"

pagina_de_inicio?=agent.submit(login_form)

Redirecciona por Javascript

Segundo intento: versión móvilrequire 'rubygems'require 'mechanize'

agent = Mechanize.newagent.set_proxy("localhost",8080)#decimos que somos firefox modificando la cabecera user agentagent.user_agent_alias='Mac FireFox'login_page = agent.get('http://m.tuenti.com/?m=login')

#cogemos el formulario de loginlogin_form = login_page.forms.first#y rellenamos los campos usuario y contraseâˆšÂ±alogin_form.fields.select{|f| f.name=="tuentiemail"}.first.value="[email protected]"login_form.fields.select{|f| f.name=="password"}.first.value="xxxxxx"

pagina_de_inicio=agent.submit(login_form)

Eureka!


class TuentiAPI def initialize(login,password) @login=login @password=password end def inicio() agent = Mechanize.new agent.set_proxy("localhost",8080) #decimos que somos firefox modificando la cabecera user agent agent.user_agent_alias='Mac FireFox' login_page = agent.get('http://m.tuenti.com/?m=login')

#cogemos el formulario de login login_form = login_page.forms.first #y rellenamos los campos usuario y contraseâˆšÂ±a login_form.fields.select{|f| f.name=="tuentiemail"}.first.value=@login login_form.fields.select{|f| f.name=="password"}.first.value=@password

pagina_de_inicio=agent.submit(login_form) end end

pagina_de_inicio=TuentiAPI.new("[email protected]","xxxxxx").inicio()

Paso 2: Obtener las fotos

Paso 2: Obtener las fotos

class TuentiAPI ... def fotos_nuevas() tree=Hpricot(inicio().content) fotos = tree / "//a//img[@alt=Foto]" fotos.map!{|foto| foto.attributes["src"]} Set.new(fotos).to_a end private def inicio() ... endend

Paso 3: Establecer el estado

Paso 3: Establecer el estado

class TuentiAPI ... def actualizar_estado(msg) form_actualizacion=inicio.forms.first form_actualizacion.fields.select{|f| f.name=="status"}.first.value=msg @agent.submit(form_actualizacion) end end

Ninja Moves

Tor: navegando de forma anónima

https://www.torproject.org/vidalia/

Red de encadenamiento de proxies

N peticiones salen de M servidores

Garantiza el anonimato a nivel de IP



Gracias

CLOUD COMPUTING.DESARROLLO DE APLICACIONES Y

MINERÍA WEB

Miguel Fernández Fernández [email protected]

Programa de extensión universitariaUniversidad de Oviedo



screen scraping

Technology

herramientasruby restopen

rexml doc

page page

rexml genera

rexml http

string elem

rexml convierte

una apinecesitamos simular