oncrawl elasticsearch meetup france #12
TRANSCRIPT
![Page 1: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/1.jpg)
Elasticsearch + Oncrawl =
<3
A SaaS SEO Monitoring solution by
Presentation by Tanguy Moal@tuxnco
Meetup Elasticsearch Paris #12
2015/01/22
![Page 2: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/2.jpg)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 2
[tuxnco@hal]:/opt$ whoami
- age: 0x20
- kids: 0x02
- hobbies:- tech founder & cto at cogniteev
- search, natural language processing, datamining
- misc.
- history:- r&d engineer @ exalead
- r&d engineer @ jobijoba
![Page 3: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/3.jpg)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 3
Presentation plan
Introduction to Oncrawl
Oncrawl technical overview
hadoop-elasticsearch within Oncrawl
Oncrawl API
Scaling Oncrawl infrastructure with Saltstack.
Conclusion / Questions
![Page 4: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/4.jpg)
Introduction
![Page 5: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/5.jpg)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 5
Oncrawl: SEO Monitoring
- SEO Game has changed:
- Websites are getting bigger, harder to maintain
- Several indicators to monitor
- SaaS to the rescue (Moz, Ranks, Majestic SEO, Botify, Deepcrawl, …)
![Page 6: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/6.jpg)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 6
Oncrawl: SEO Monitoring
- Analysis performed through crawl reports
- SEO monitoring follows 5 axis:- Performance
- HTML quality
- Inlinks
- Outlinks
- Content
- Interactive Analysis (URL explorer)
- Planned: crawl over crawl trends spotting
![Page 7: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/7.jpg)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 7
Oncrawl: Pricing
![Page 8: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/8.jpg)
Oncrawl: technical overview
![Page 9: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/9.jpg)
Oncrawl: application architecture
22/01/15 Oncrawl · Elasticsearch Meetup France #12 9
![Page 10: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/10.jpg)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 10
Boom.
Boom2.
![Page 11: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/11.jpg)
Application scenario
- User has a plan and configured projects
- Plan grants privileges
- Used to : allow project creation and triggering of crawls
- Each project may have associated crawls
- Each crawl contains a report
What data are involved in a crawl report?
22/01/15 Oncrawl · Elasticsearch Meetup France #12 11
![Page 12: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/12.jpg)
Links
22/01/15 Oncrawl · Elasticsearch Meetup France #12 12
- Important piece in serious SEO campaigns- Key fields:
- origin, origin_domain, origin_depth- target, target_domain, target_depth- context:
- position in origin page- anchor text- wraps significant tags (hn, img, …)
- Use cases:- list outlinks (resp. inlinks) of a given page- distinguish links used to go up (resp. down) the site’s tree- anchor text analysis, …
![Page 13: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/13.jpg)
Page model
22/01/15 Oncrawl · Elasticsearch Meetup France #12 13
- Key fields- url
- domain
- hash
- fetch
- date, size, time
- HTTP headers
- HTTP status code | ignored (robots.txt|settings)
- parse
- title, hn, metas,
- canonical
- seo
- depth. popularity. total inlinks
- outlinks breakdown (internal vsexternal, follow vs nofollow)
- word count, text to code ratio, duplicated fields, simhash
- Use cases- stats on size/fetch time/status code, by depth or for pages having any
combination of criterion- find pages with highest similarity to a given one- find pages with duplicated properties (title, hn, …)
- The central piece of the puzzle. Wraps all metadata relating to a given URL
![Page 14: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/14.jpg)
Hadoop & Elasticsearch.
![Page 15: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/15.jpg)
Elasticsearch for Hadoop
- references- overview http://www.elasticsearch.org/overview/hadoop/- online documentation
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html
- github- repo https://github.com/elasticsearch/elasticsearch-hadoop- author https://github.com/costin
- features- compatibility- simplicity- low footprint- flexible
22/01/15 Oncrawl · Elasticsearch Meetup France #12 15
![Page 16: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/16.jpg)
Oncrawl: hadoop-elasticsearch
- Apache Nutch (v1.x) uses HDFS (v2.x supports several storages
through Apache Gora -- including elasticsearch -- but…)
- Stacked different custom hadoop jobs to compute
Oncrawl’s custom attributes (duplicates, …)
- What about Apache Nutch’s ESIndexer ?
- hadoop-elasticsearch does the job pretty well
- Relies on job’s configuration:
- es.resource(.read|.write)? : « index/type » (supports “late”
type routing from fields in collected output, e.g.
« my_index/{some_field} »)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 16
![Page 17: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/17.jpg)
Oncrawl: hadoop-elasticsearch
• Reading from elasticsearch– job.setInputFormat(EsInputFormat.class);
• Writing to elasticsearch– job.setOutputFormat(EsOutputFormat.class);
– Map<Object, Object> value = new
LinkedHashMap <Object, Object> ();
– collector.collect(key,
WritableUtils.toWritable(value));
22/01/15 Oncrawl · Elasticsearch Meetup France #12 17
Read \ Write HDFS Elasticsearch
HDFS builtin yes
Elasticsearch yes yes
![Page 18: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/18.jpg)
Elasticsearch & Python
![Page 19: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/19.jpg)
Oncrawl API
• Python / Flask :– Lightweight
– Easy to deploy / mirror
– Clean syntax
• elasticsearch python client:– simple API
– allows for fine tuning of the client (HTTP connection parameters, …)
• API’s mission : populate application’s report’s graphs
22/01/15 Oncrawl · Elasticsearch Meetup France #12 19
![Page 20: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/20.jpg)
Oncrawl API- Each graph on the app has a dedicated API endpoint
- Binds graph semantics to an elasticsearch query. Returns json data ready for the rendering (d3.js, …)
- Example : Summary of page load times
22/01/15 Oncrawl · Elasticsearch Meetup France #12 20
- 4 buckets : - perfect (under 500ms)- medium (between 500ms and
1000ms)- slow (between 1000ms and
2000ms)- too slow (beyond 2000ms)
- Expected output by plotting library:
![Page 21: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/21.jpg)
Oncrawl API- Queries are easy to compose using python
- Write & test it in Marvel
- Integrate in Flask API
22/01/15 Oncrawl · Elasticsearch Meetup France #12 21
![Page 22: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/22.jpg)
Elastic: Scale it
May I have the salt, please ?
![Page 23: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/23.jpg)
Oncrawl scalability constraints
- 1 index per crawl
- size of indices ? S-M-L-XL
- sharding policy:- S: 1 shard
- M: 3 shards
- L: 5 shards
- XL: 10 shards
- Hadoop cluster management- Provisioned for a given number
of concurrent crawl cycles
- HDFS grows with total clients
- Elasticsearch cluster management- Build: same provision as
hadoop cluster
- Storage / service:- provisionned for 3 months of
subscription
- Old indices:
- close & snapshot
- reopen on demand
22/01/15 Oncrawl · Elasticsearch Meetup France #12 23
![Page 24: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/24.jpg)
Saltstack
• Cluster with members having roles: master vs minions
• Each minion can be fully administratedthrough the master
• Minions ask master for enrollment
• Administrator on master can either acceptor decline minions
• Once minion is accepted, can be fullyoperated remotely
22/01/15 Oncrawl · Elasticsearch Meetup France #12 24
![Page 25: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/25.jpg)
Saltstack
• A set of « recipes » define what states are made of, and how to get there
• Recipes can use « jinja » templating so variable parts of configuration files can be rendered at deployment time
• Minions can have their role defined by several means:– grains defined on the minion– deployment specific rules, defined in « the pillar »
• Within Oncrawl, saltstack is used :– To maintain indices templates (config/templates/*json)– To maintain elasticsearch clusters, nodes and shards allocation
(config/settings.yml)– To deploy the elasticsearch cluster, the hadoop cluster, staging
and prod servers
• Deploy anything, anywhere (Droplets @ Digital Ocean, VMs @ Vultr, Instances @ AWS, dedicated servers @ OVH)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 25
![Page 26: Oncrawl elasticsearch meetup france #12](https://reader036.vdocuments.mx/reader036/viewer/2022062419/55a77d0b1a28abce668b47bf/html5/thumbnails/26.jpg)
Thank you!
Follow us:@tuxnco (me)@cogniteev (company)@oncrawl (product)
Part of the gang
Any question ?