searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million...

29
Searching images from the past [email protected]

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Searching images from the past

[email protected]

Page 2: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Arquivo.pt

https://arquivo.pt

From Portugal

Publicly available web archive

Research infrastructure

Source code on Github github.com/arquivo

Free

Page 3: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

URL Search

Page 4: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Text Search

Page 5: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Searchnew

Page 6: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Search - viewer

Page 7: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

6 000 000 000 files

~15% images

900 million searchable images

We need more servers!!!

Image Search - estimate

Page 8: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

17 million searchable images

Unique images within a collection

From 1996 to 2017

Size greater than 50px width and 50px height

Each image has to link to an archived web

page that contains the image.

Image Search - reality

Page 9: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Search Workflow

Page 10: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Workflow

1. Create image indexes from ARC/WARC files

2. Image classification

3. Solr indexing

Page 11: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

1. Create image indexes - infrastructure

Hadoop 3 cluster

MongoDB sharded cluster

Page 12: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

1.1 Extract all ARC/WARC image records

store Image Record

--image thumbnail

--image attributes

1.2 Extract all ARC/WARC html records

• Extract <img> tags in each html record

• If image exists in the database

store Image Index

--page URL

--page timestamp

--page title

1. Create image indexes - steps

Page 13: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

2. Image classification - infrastructure

2 Tesla P4 GPUs

Page 14: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

2. Image classification - step

Automatically classify images from step 1

Safe for work:• Value from 0.000 to 1.000

• Greater than 0.500 (considered safe for work)

• Less than 0.500 (may have explicit content)

Add more classifiers in the future

Page 15: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

3. SOLR indexing - infrastructure

2 Servers with Apache Solr

~ 2-3GB of Ram

~ 400 GB of disk space (indexes)

Configure Solr Cloud in a near future

Page 16: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

3. SOLR indexing - step

Index JSON image records

Page 17: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

3. SOLR index – document

Page 18: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Search - architecture

Arquivo

Web App

Image Search

API

Solr

Servers

arquivo.pt/api

Page 19: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Search API

Page 20: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Search API – q parameter

arquivo.pt/imagesearch?q=soccer

Search for images related with word soccer

Page 21: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Search API – response header

Page 22: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Imagesearch API – response item

Page 23: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Search for images related with words Euro 2004, in PNG image format.

http://arquivo.pt/imagesearch?q=Euro 2004&type=png

Image Search API – type parameter

Page 24: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Image Search API – size parameter

Search for images, with size smallrelated with words Euro 2004,

size=md (medium image size)size=lg (large image)

http://arquivo.pt/imagesearch?q=Euro 2004&size=sm

Page 25: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Future Work

Page 26: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Future work – scene classification

http://places2.csail.mit.edu/

Page 27: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Future work – mobile version

Page 28: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Future work – mobile version

Page 29: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater

Thank you

Fernando Melo < [email protected] >

Try our APIs and send us feedbackarquivo.pt/api