big data analytics and models

70
INNOVA CHALLENGE BigDataSpain 7/11 Esteban Moro Alejandro Llorente www.iic.uam.es Workshop BBVA – Open InnovaHon AnalyHcs & Models

Upload: bbva-innovation-center

Post on 29-Jan-2015

108 views

Category:

Technology


1 download

DESCRIPTION

Big Data analytics and models by Esteban Moro

TRANSCRIPT

Page 1: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Esteban  Moro  Alejandro  Llorente    

www.iic.uam.es      

Workshop  BBVA  –  Open  InnovaHon  

AnalyHcs  &  Models  

Page 2: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Page 3: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

h*ps://www.centrodeinnovacionbbva.com/en/innovachallenge  

Page 4: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Maps  

AcHvity  

Infrastructures/Places  

Analysis  

Models  

App  

Content  

VisualizaHon  

AnalyHcs  and  Models  

Challenge  par?cipant  “roadmap”    

Data   Mining   Development  

Page 5: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

IntroducHon  to  geo-­‐tagged  data    Access  to  (open)  geo-­‐tagged  data    Example:  development  of  geolocalized  recommender  app.    

Summary  

Page 6: Big Data analytics and models

IntroducHon  to  geo-­‐tagged  data  

Page 7: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

IntroducHon  to  geo-­‐tagged  data  

InformaHon:  Person,  event,  infrastructure.    

Geography:  GPS  

coordinates,  zone,  city  

Page 8: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Geospa?al  BigData   Social  Media  

Sensors  Satellite  Images  

Maps  

Ac?vity  (Transport)  

GeospaHal  Bigdata  

Page 9: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

With  geo-­‐tagged  data  we  can      Measure  zone/area  occupa?on  &  ac?vity    Iden?fy  flows  of  persons/money  between  different  areas    …      

With  those  data  we  can  build  applicaHons  in        Geo-­‐social  analysis    Geomarke?ng    Op?mal  alloca?on  of  resources    Fraud  detec?on    Event  detec?on    …  

Geo-­‐tagged  BigData  applicaHons  

Page 10: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Use  of  pervasive  sensors  (mobile  phones,  social  media)  to  model  movement  and  communica?on  of  people  in  urban  areas.  

Geo-­‐social  Analysis  

Page 11: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

!! Estudio de geolocalización en Madrid

! 34!

Localización:!!Puerta!del!Sol!!Número!de!checkins!totales:!2651!(30.5!al!día)!Número!de!usuarios!únicos!en!la!zona:!1231!!!!!!!!!!!!!!!!!!!!!!!!! ! hora

count

0

100

200

300

400

500

600

700

0 5 10 15 20 25

factor(tipo)arts_entertainmentfoodnightlifeshops

dia

count

0

500

1000

1500

lunes martes miércoles jueves viernes sábado domingo

factor(tipo)arts_entertainmentfoodnightlifeshops

timedays

count

0

50

100

150

abr−11 may−11 jun−11

factor(tipo0)arts_entertainmentfoodnightlifeshops

1

2

3

4

5

6

7

8

9

10

place

fnacstarbucks coffee

mercado de san miguelel corte inglés

mercado de san antónyelmo cines ideal 3d

vipsmcdonald's

café de orientesala joy eslava

n_checkins

316269251136113 87 84 78 77 71

1

2

3

4

5

6

7

8

9

10

user

amazel666runway4edaindil

maestrodariusivo_campos

despopedumaizadalogu8

desdealbert0mmetafetan

n_checkins

121 73 40 39 35 33 33 32 32 30

Characteriza?on  of  urban  neighborhoods  according  to  their  social/commercial  use  

Geo-­‐social  analysis  

Page 12: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Use  merchant  localiza?on  and/or  IP  address  in  online  transac?ons  to  detect  fraud.  

Fraud  detecHon  

Page 13: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Bars  

Shops  

GeomarkeHng  

Manage  sales  risk  

Page 14: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Bares  

Tiendas  

Iden?fy  best  placement  for  a  new  shop/branch  

Op?mize  cash  holding  in  bank  branches,  minimizing  costs  associated  with  it.  

OpHmal  resource  allocaHon  

Page 15: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Detect  unexpected  behavior  using  social/mobile/urban  sensors  

Event  detecHon  

Page 16: Big Data analytics and models

Access  to    (open)  geographical  data  

Page 17: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Map  

Infrastructure/places  

AcHvity  

Geographical  data    

Page 18: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Types  of  data  

Maps    Economic/Demographic  data    AcHvity  

 Twi*er    BBVA  API  

Page 19: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Maps::  Google  Maps  

Google  Maps  has  a  number  of  different  services/APIs,  with  different  restric?ons  and  protocols.  It  allows  to  define  maps,  routes,  markers,  etc.  

Example:  get  a  staHc  map  (without  authenHcaHon).  

URL  Base:  h*p://maps.google.com/maps/api/sta?cmap  Parameters:  

•    center:  40.4153,-­‐3.6875  •    size:  640x640  •    maptype:  mobile  •    format:  png32  •    sensor:  true  

Page 20: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Maps  ::  OpenStreetMap  

Open  and  collabora?ve  project  to  create  and  distribute  free  maps.    Different  APIs  to  get  informa?on  about  routes,  points,  maps,  etc.  There  are  a  number  of  Mapping  projects  (applica?ons)  build  on  top  of  OSM  with  very  different  purposes  

Example:  get  the  route  between  two  locaHons.  MapQuest.  URL  Base:  h*p://open.mapquestapi.com/guidance/v1/  Parameters:  

•  Key:  authen?ca?on  key  •  From:  la?tud  y  longitud  del  origen  en  JSON.  •  To:  la?tud  y  longitud  del  des?no  en  JSON.  

Page 21: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Geospa?al  vector  data  format  for  geographical  informa?on    •  Regions,  points,  paths  defined  as  points,  lines,  polygons  •  Each  of  them  usually  has  a*ributes  that  describe  it  

Region  Codes,  Names,  Popula?on,  etc.    

h*p://www.naturalearthdata.com/downloads/    

pyshp:  h*p://code.google.com/p/pyshp/    maptools:  h*p://cran.r-­‐project.org/web/packages/maptools  

Mapas  ::  shapefiles  

Page 22: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Edi?on  and  Visualiza?on  of  Shapefiles:  h*p://www.qgis.org    

Mapas  ::  shapefiles  

Page 23: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

CartoCiudad (Ministerio de Fomento): shapefiles for each province at municipality and postal code levels. They also include data about the urban background  

h*p://www.cartociudad.es/portal/    

Maps  ::  Spain  cartography  

Page 24: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Nomecalles (CAM): shapefiles, POIs (museums, theaters, health services ), subway (stations), etc.    

h*p://www.madrid.org/nomecalles/DescargaBDTCorte.icm      Resolu?on  level:  municipali?es,  districts,  postal  codes,  etc.  

Maps  ::  Madrid  cartography  

Page 25: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Plan territorial metropolitano de Barcelona – Generalitat de Catalunya  Link  

Maps  ::  Barcelona  province  cartography  

Page 26: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Open data gencat Catalonia Cartography  Link  

Maps  ::  Barcelona  City  cartography  

Page 27: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Plan territorial metropolitano de Barcelona – Generalitat de Catalunya  Link  

This  web  has  also  data  about  mobility,  economic  development,  popula?on,  etc.  at  the  district  level    There  is  nothing  at  this  level  of  detail  in  Madrid.      Solu?on:  Use  other  data  sources  to  es?mate  them  (see  below).  

Maps  ::  Barcelona  city  cartography    

Page 28: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Demographic/Economic  data  ::  Spain  

Demographic  Data:    Ins?tuto  Nacional  de  Estadís?ca  (INE)    Census  by  provinces  /  municipality  /  census  sec?on.    Link  

Economic  Data:      Servicio  Público  de  Empleo  Estatal  (SEPE).      Unemployment  by  municipality.  

   Link  

Page 29: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Demographic/Economic  data    ::  Madrid  

Madrid  City    Madrid  City  Council  database:    h*p://www-­‐2.munimadrid.es/CSE6/jsps/menuBancoDatos.jsp    Popula?on  by  districts,  neighborhoods,  etc.  

 Madrid  Region  

 Comunidad  de  Madrid  database:    h*p://www.madrid.org/desvan/Inicio.icm?enlace=almudena    Popula?on  by  municipality.      Economical  data  by  municipality  

Page 30: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Demographic/Economic  data  ::  Barcelona  

Barcelona  city    Departament  d’Estadís?ca    h*p://www.bcn.cat/estadis?ca/castella/    Popula?on  by  district.    Unemployment  by  district.  

 Catalonia  region  

 Idescat  (Ins?tut  d’Estadís?ca  de  Catalunya)    h*p://www.idescat.cat/es/    Popula?on  by  municipality    Economical  data  by  municipality.  

Page 31: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Google  API  Console  

Other data sources :: Google Points of Interest

Page 32: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Google  API  Console  

Other data sources :: Google Points of Interest

Page 33: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Google  API  Console  

Other data sources :: Google Points of Interest

Page 34: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Points of interest around Puerta del Sol (Madrid)

Service 1: Places Search Parameters :

location: 40.417, -3.703 radius: 1000

Service 2: Places Details

parameters: reference: place code

Other data sources :: Google Points of Interest

Page 35: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

GFS: Global Forecast System  

OpeNDAP protocol.  

Python implementation : pydap  Query format:  

SERVER = http://nomads.ncep.noaa.gov:9090/dods/gfs_hd/   DATE = AAAAMMDD  

HOUR = HH   VAR = weather metric r (tmp2m, ugrd10m, pressfc, …)  

LAT = latitude interval [259:263] (0.5º steps from South Pole)   LON = longitude interval [710:714] (0.5º steps from Greenwich)  

  QUERY = SERVERgfs_hdDATE/gfs_hd_HOURz.dods?VAR[0:0][LAT][LON]  

  dataset = open_dods(QUERY)  

Other  data  sources  ::  Weather  forecast    

Page 36: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Developers webpage http://dev.twitter.com

AcHvity  ::  data  from  TwiZer  API  

Page 37: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Developers webpage http://dev.twitter.com

AcHvity  ::  data  from  TwiZer  API  

Page 38: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Developers webpage http://dev.twitter.com

AcHvity  ::  data  from  TwiZer  API  

Page 39: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Developers webpage http://dev.twitter.com

Consumer  Key  

Consumer  Secret  

Access  token  

Access  token  secret  

AcHvity  ::  data  from  TwiZer  API  

Page 40: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Consumer  Key  

Consumer  Secret  

Access  token  

Access  token  secret  

OAuth Authentication

Rest API Stream API

Several queries with parameters

Number of requests

is limited

Only one query (with parameters)

Requests are not time-

limited

AcHvity  ::  data  from  TwiZer  API  

Page 41: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Stream API

Example: Geolocalized Tweets in the Madrid region

API Service: POST statuses/filter

parameters: locations: -4.59, 39.90, -3.04, 41.17

AcHvity  ::  data  from  TwiZer  API  

Page 42: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

As we said before, there are no data in Madrid about administrative zones below the municipality. But we can estimate some of the with Twitter

•  Example: population by postal codes

1.  Round geographical coordinates to the 3rd decimal place (square cells of approx. 100 meters squared).

2.  Analyze the most visited postal code by user. Define that as his/her residence. Count number of residents by postal code

3.  Visualize.

Stream API

AcHvity  ::  data  from  TwiZer  API  

Page 43: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Stream API

AcHvity  ::  data  from  TwiZer  API  

Page 44: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Stream API

AcHvity  ::  data  from  TwiZer  API  

Page 45: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

hZps://www.centrodeinnovacionbbva.com/signup    

AcHvity  ::  data  from  BBVA  API  

Page 46: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

https://developer.bbva.com/panel

AcHvity  ::  data  from  BBVA  API  

Page 47: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

https://developer.bbva.com/panel

AcHvity  ::  data  from  BBVA  API  

Page 48: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

https://developer.bbva.com/panel

AcHvity  ::  data  from  BBVA  API  

Page 49: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Ge\ng  the  authenHcaHon  data:  

Example:    APP_ID  =  "iic_formacion_innovachallenge"  APP_KEY  =  "0f1d750a5baea6c7022452d0d2ece01fc5901ad7”  str_to_encode="iic_formacion_innovachallenge:0f1d750a5baea6c7022452d0d2ece01fc5901ad7”  auth  =  strToBase64(str_to_encode)    Request  =  H*pRequest(SERVICE,  PARAMETERS,  header  =  {‘Authoriza?on’  :  auth})    

1.  With  the  APP_ID  and  APP_KEY,  generate  the  authoriza?on  code  concatena?ng  both  strings  with  and  codifying  it  to  base64.  

2.  This  authoriza?on  code  is  added  to  the  H*p  Request  Header.  

AcHvity  ::  data  from  BBVA  API  

Page 50: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Parameters  

Workshop  30thOctober  

AcHvity  ::  CUSTOMER_ZIPCODES  example  

Page 51: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

ExtracHng  data  

AcHvity  ::  CUSTOMER_ZIPCODES  example  

Workshop  30thOctober  

Page 52: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Building  the  adjacency  list  

AcHvity  ::  CUSTOMER_ZIPCODES  example  

Workshop  30thOctober  

Page 53: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Building  and  plo\ng  the  graph  

AcHvity  ::  CUSTOMER_ZIPCODES  example  

Workshop  30thOctober  

Page 54: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Economical  flows  from  Puerta  del  Sol  

Servicio  API:  customer_zipcodes  

Parámetros:    date_min:201304    date_max:201304    zipcode:28013    by:cards    group_by:month  

AcHvity  ::  CUSTOMER_ZIPCODES  example  

Page 55: Big Data analytics and models

Example:  development    of  a  geolocalized    recommender  app.  

Page 56: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

ObjecHve:  recommend  users  what  areas  to  visit  according  to  their  profile,  residence,  preferences,  etc.    Using  informaHon  about  what  similar  users  do.  

Data  used:    

1.  API  Innova  Challenge  –  CARDS_CUBE.  2.  API  Innova  Challenge  –  CUSTOMER_ZIPCODES.  

Recommender  systems  ::  IntroducHon  

Page 57: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Use  twi*er  data  to    1.  Get  what  people  are  talking  about  in  city  areas.  

2.  Analyze  user  language  in  Twi*er  

3.  Compare  user  language  with  area  language  and  recommend  user  most  similar  areas.  

Recommender  systems  ::  user  language  

Page 58: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

CP  28013:  Madrid  city  center  

Recommender  systems  ::  user  language  

Page 59: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

CP 28009 : Retiro

Recommender  systems  ::  user  language  

Page 60: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Use  CARDS_CUBE  service  from  the  BBVA  API  

Recommender  systems  ::  user  demographic  profile  

Page 61: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

•  Use  CARDS_CUBE  service  data    

•  For  each  merchant  category  Z  (bars,  fashion,  health,  etc.)  build  a  matrix  in  which  each  entry  is  the  number  of  different  credit  cards  for  a  given  profile  X  (gender,  age)  that  went  shopping  to  the  postal  code  Y  in  a  merchant  of  category  Z.  

Where  do  people  like  me  go  shopping?    Which  restaurants  are  visited  by  people  similar  to  me?  

Recommender  systems  ::  user  demographic  profile  

Page 62: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Example:  Male,  age  36-­‐45  

Fashion     Bars  and  restaurants  

Recommender  systems  ::  user  demographic  profile  

Page 63: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Use  CUSTOMER_ZIPCODES  service  in  the  BBVA  API  

Recommender  systems  ::  user  geographic  profile  

Page 64: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

•  Use  data  from  the  CUSTOMER_ZIPCODES  service  

•  For  each  merchant  category  Z  (bars,  fashion,  health,  etc.)  we  build  a  matrix  in  which  each  entry  is  the  number  of  different  credit  cards  from  a  postal  code  X  that  go  shopping  to  postal  code  Y  in  merchant  category  Z.  

Where  do  people  in  my  district  go  shopping?    What  restaurants  are  visited  by  people  living  in  my  district?  

Recommender  systems  ::  user  geographic  profile  

Page 65: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Fashion   Bars  and  restaurants  

Example:  postal  code  28045  

Recommender  systems  ::  user  geographic  profile  

Page 66: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Geographical and demographic recommendation system

Recommender  systems  ::  combinaHon  

Page 67: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Fashion   Bars  and  restaurants  

Example:  Male,  age  36-­‐45,  living  in  postal  code  28045.  

Recommender  systems  ::  combinaHon  

Page 68: Big Data analytics and models

From  the  data  to  the  app  

Page 69: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

From  data  to  the  app  

1.  The  idea.  

2.  What  data  do  I  need  to  carry  out  this  idea?  Which  services  of  the  Challenge  API  do  I  need?  May  I  improve  it  with  other  informa?on  sources?  

3.  Analysis:  disHlling  the  idea  and  assessing  its  viability.  Extrac?ng  the  hidden  value  of  analy?cs  and  models.  

4.  How  can  the  user  take  advantage  of  this  idea?  

5.   Iterate  2,3  and  4  un?l  the  idea  and  the  user  profit  show  up.  

6.   Convert  the  value  of  the  analysis  to  an  applica?on.  

Page 70: Big Data analytics and models

INNOVA  CHALLENGE   BigDataSpain  7/11  

Esteban  Moro  Alejandro  Llorente    

www.iic.uam.es        

[email protected]          @llorentealex  [email protected]    @estebanmoro