mapping french open data actors on the web with common crawl

Post on 18-Dec-2014

1.252 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Mapping french Open Data actors on the web with Common Crawlguillaume.lebourgeois@data-publica.com@glebourg

Mining the Web at Data Publica

Different needs, different techniques● Scraping● Focused crawling● Prospective crawling

Mining the Web at Data Publica

Scraping● Identified resources● Configured extractors● Structured content● Not scalable

Mining the Web at Data Publica

Focused crawling● Identified entities● Fuzzy extraction● Structured content using text-mining● Scalable● Useful to get meta information on known

entities

Mining the Web at Data Publica

Prospective crawling● No starting point● Fuzzy extraction● Structured content using text-mining● Very hard to scale● Heavy resources needed : CPU, RAM,

HDD

It makes your life easier to use a third-party !

From a crawl to a map

Goal : build a map of the french open data actors on the web

● As a graph● Showing websites

From a crawl to a map

Using Common Crawl● Large web crawl archives fully accessible● Good coverage of french web● Easy access via AWS / MapReduce jobs

From a crawl to a map

Working on french web● Irrelevant to use tld .fr for detection● Detecting page language● Giving websites a "frenchness" score

○ Sw = amount of fr pages / total of pages○ Cutoff manually chosen via testing on french

websites

From a crawl to a map

Working on Open Data websites● Building an Open Data "vocabulary"● Detecting if page speaks about Open

Data● Giving websites an "opendataness" score

○ Sw = amount of Open Data pages / total of pages○ Cutoff manually chosen via testing on Open Data

websites

From a crawl to a map

Building graph● Inside our subset

○ Inlinks○ Outlinks

● Generating two files○ nodes.csv (list of websites with an id)○ edges.csv (directed links between websites)

Node AA inlink A outlink

A inlink

From a crawl to a map

Building graph● Links tell a lot about websites

○ Authorities○ Hubs

From a crawl to a map

Visualizing graph using Gephi● Load graph● Spatialize graph

○ links between websites create "attraction", to make them appear near each other

○ the more inlinks, bigger the node (= authority)○ categorizing web site for better understanding (a

color per category)■ Companies, Non profit/blogs, Governement

agencies○ communities can now appear !

From a crawl to a map

From a crawl to a map

Visualizing graph on the web● Sigma.js● Uses Gephi files● Gives better interactivity

Analyze

● The final graph is a good way to understand interactions between actors○ Open Data is definitely initiated by a Non Profit

movement○ Companies are beginning to work on the subject○ French state only had some sporadic initiatives for

now● This graph is to be generated again in near

futur, to see changes in this ecosystem

Results

● Large scale crawl made easy○ Easy to focus on mining the results instead of

finding/storing the data● Nice workflow from raw data to an

understandable visualisation● The final graph is a good way to understand

interactions between actors

Feedback

● Common Crawl○ Common crawl doesn't have an exhaustive crawl of

the french web for now○ Data is not fresh as it could be○ It is missing an index to access at least domains,

and maybe pages in O(1)● Methodology

○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant

Resources

● http://webatlas.fr/tempshare/OpenDataActeursTypes.pdf○ poster by Franck Ghitalla

● http://french-opendata.data-publica.com/index.html○ dynamic visualisation of the results, by Data Publica

● http://fr.slideshare.net/willounet/a-sneak-peek-into-the-web-presentation,○ A sneak peek into the web, by GL

● http://french-opendata.data-publica.com/○ Project host page

Mapping french Open Data actors on the web with Common Crawlguillaume.lebourgeois@data-publica.com@glebourg

top related