searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million...
TRANSCRIPT
![Page 2: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/2.jpg)
Arquivo.pt
https://arquivo.pt
From Portugal
Publicly available web archive
Research infrastructure
Source code on Github github.com/arquivo
Free
![Page 3: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/3.jpg)
URL Search
![Page 4: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/4.jpg)
Text Search
![Page 5: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/5.jpg)
Image Searchnew
![Page 6: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/6.jpg)
Image Search - viewer
![Page 7: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/7.jpg)
6 000 000 000 files
~15% images
900 million searchable images
We need more servers!!!
Image Search - estimate
![Page 8: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/8.jpg)
17 million searchable images
Unique images within a collection
From 1996 to 2017
Size greater than 50px width and 50px height
Each image has to link to an archived web
page that contains the image.
Image Search - reality
![Page 9: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/9.jpg)
Image Search Workflow
![Page 10: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/10.jpg)
Workflow
1. Create image indexes from ARC/WARC files
2. Image classification
3. Solr indexing
![Page 11: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/11.jpg)
1. Create image indexes - infrastructure
Hadoop 3 cluster
MongoDB sharded cluster
![Page 12: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/12.jpg)
1.1 Extract all ARC/WARC image records
store Image Record
--image thumbnail
--image attributes
1.2 Extract all ARC/WARC html records
• Extract <img> tags in each html record
• If image exists in the database
store Image Index
--page URL
--page timestamp
--page title
1. Create image indexes - steps
![Page 13: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/13.jpg)
2. Image classification - infrastructure
2 Tesla P4 GPUs
![Page 14: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/14.jpg)
2. Image classification - step
Automatically classify images from step 1
Safe for work:• Value from 0.000 to 1.000
• Greater than 0.500 (considered safe for work)
• Less than 0.500 (may have explicit content)
Add more classifiers in the future
![Page 15: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/15.jpg)
3. SOLR indexing - infrastructure
2 Servers with Apache Solr
~ 2-3GB of Ram
~ 400 GB of disk space (indexes)
Configure Solr Cloud in a near future
![Page 16: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/16.jpg)
3. SOLR indexing - step
Index JSON image records
![Page 17: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/17.jpg)
3. SOLR index – document
![Page 18: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/18.jpg)
Image Search - architecture
Arquivo
Web App
Image Search
API
Solr
Servers
arquivo.pt/api
![Page 19: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/19.jpg)
Image Search API
![Page 20: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/20.jpg)
Image Search API – q parameter
arquivo.pt/imagesearch?q=soccer
Search for images related with word soccer
![Page 21: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/21.jpg)
Image Search API – response header
![Page 22: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/22.jpg)
Imagesearch API – response item
![Page 23: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/23.jpg)
Search for images related with words Euro 2004, in PNG image format.
http://arquivo.pt/imagesearch?q=Euro 2004&type=png
Image Search API – type parameter
![Page 24: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/24.jpg)
Image Search API – size parameter
Search for images, with size smallrelated with words Euro 2004,
size=md (medium image size)size=lg (large image)
http://arquivo.pt/imagesearch?q=Euro 2004&size=sm
![Page 25: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/25.jpg)
Future Work
![Page 27: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/27.jpg)
Future work – mobile version
![Page 28: Searching images from the pastnetpreserve.org/ga2019/wp-content/uploads/2019/07/... · 17 million searchable images Unique images within a collection From 1996 to 2017 Size greater](https://reader034.vdocuments.mx/reader034/viewer/2022042219/5ec533774ad3f0781d6c6f71/html5/thumbnails/28.jpg)
Future work – mobile version