mapping french open data actors on the web with common crawl

19
Mapping french Open Data actors on the web with Common Crawl [email protected] @glebourg

Upload: data-publica

Post on 18-Dec-2014

1.252 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Mapping french open data actors on the web with common crawl

Mapping french Open Data actors on the web with Common [email protected]@glebourg

Page 2: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Different needs, different techniques● Scraping● Focused crawling● Prospective crawling

Page 3: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Scraping● Identified resources● Configured extractors● Structured content● Not scalable

Page 4: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Focused crawling● Identified entities● Fuzzy extraction● Structured content using text-mining● Scalable● Useful to get meta information on known

entities

Page 5: Mapping french open data actors on the web with common crawl

Mining the Web at Data Publica

Prospective crawling● No starting point● Fuzzy extraction● Structured content using text-mining● Very hard to scale● Heavy resources needed : CPU, RAM,

HDD

It makes your life easier to use a third-party !

Page 6: Mapping french open data actors on the web with common crawl

From a crawl to a map

Goal : build a map of the french open data actors on the web

● As a graph● Showing websites

Page 7: Mapping french open data actors on the web with common crawl

From a crawl to a map

Using Common Crawl● Large web crawl archives fully accessible● Good coverage of french web● Easy access via AWS / MapReduce jobs

Page 8: Mapping french open data actors on the web with common crawl

From a crawl to a map

Working on french web● Irrelevant to use tld .fr for detection● Detecting page language● Giving websites a "frenchness" score

○ Sw = amount of fr pages / total of pages○ Cutoff manually chosen via testing on french

websites

Page 9: Mapping french open data actors on the web with common crawl

From a crawl to a map

Working on Open Data websites● Building an Open Data "vocabulary"● Detecting if page speaks about Open

Data● Giving websites an "opendataness" score

○ Sw = amount of Open Data pages / total of pages○ Cutoff manually chosen via testing on Open Data

websites

Page 10: Mapping french open data actors on the web with common crawl

From a crawl to a map

Building graph● Inside our subset

○ Inlinks○ Outlinks

● Generating two files○ nodes.csv (list of websites with an id)○ edges.csv (directed links between websites)

Node AA inlink A outlink

A inlink

Page 11: Mapping french open data actors on the web with common crawl

From a crawl to a map

Building graph● Links tell a lot about websites

○ Authorities○ Hubs

Page 12: Mapping french open data actors on the web with common crawl

From a crawl to a map

Visualizing graph using Gephi● Load graph● Spatialize graph

○ links between websites create "attraction", to make them appear near each other

○ the more inlinks, bigger the node (= authority)○ categorizing web site for better understanding (a

color per category)■ Companies, Non profit/blogs, Governement

agencies○ communities can now appear !

Page 13: Mapping french open data actors on the web with common crawl

From a crawl to a map

Page 14: Mapping french open data actors on the web with common crawl

From a crawl to a map

Visualizing graph on the web● Sigma.js● Uses Gephi files● Gives better interactivity

Page 15: Mapping french open data actors on the web with common crawl

Analyze

● The final graph is a good way to understand interactions between actors○ Open Data is definitely initiated by a Non Profit

movement○ Companies are beginning to work on the subject○ French state only had some sporadic initiatives for

now● This graph is to be generated again in near

futur, to see changes in this ecosystem

Page 16: Mapping french open data actors on the web with common crawl

Results

● Large scale crawl made easy○ Easy to focus on mining the results instead of

finding/storing the data● Nice workflow from raw data to an

understandable visualisation● The final graph is a good way to understand

interactions between actors

Page 17: Mapping french open data actors on the web with common crawl

Feedback

● Common Crawl○ Common crawl doesn't have an exhaustive crawl of

the french web for now○ Data is not fresh as it could be○ It is missing an index to access at least domains,

and maybe pages in O(1)● Methodology

○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant

Page 18: Mapping french open data actors on the web with common crawl

Resources

● http://webatlas.fr/tempshare/OpenDataActeursTypes.pdf○ poster by Franck Ghitalla

● http://french-opendata.data-publica.com/index.html○ dynamic visualisation of the results, by Data Publica

● http://fr.slideshare.net/willounet/a-sneak-peek-into-the-web-presentation,○ A sneak peek into the web, by GL

● http://french-opendata.data-publica.com/○ Project host page

Page 19: Mapping french open data actors on the web with common crawl

Mapping french Open Data actors on the web with Common [email protected]@glebourg