bnsi updates - europa · use case 2: e-commerce php script with 3 logics (4 positive and 1 negative...

23
BNSI updates WP2 Face to Face meeting Gdańsk, October 5-6, 2017 2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Upload: others

Post on 11-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

BNSI updates

WP2 Face to Face meeting

Gdańsk, October 5-6, 2017

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 2: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Initial conditions• The task

• Finding websites of enterprises with 10 and more employees

• Software environment• Windows OS

• MySQL, MSSQL, …

• PHP, VB, Java, …

• Legal issues• No legal constrains at national level for web-scraping

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 3: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Initial data from Business Register

CSV file with 26836 businesses

• 20649 e-mails

• 2006 urls

• Addresses

• Phone numbers

• NACE codes

• …

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 4: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

MySQL Database

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 5: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Table ikturl Structure (1)

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 6: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Table ikturl Structure (2)

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 7: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Scripts

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 8: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Common features of the search scripts

• Run in browser

• <meta http-equiv="Refresh" content="30">

• Timestamps

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 9: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Script geturl.php

• Checks if the initial 2006 urls are real websites

• Constructs domain names from the 20649 e-mails

• Checks if the constructed domains are real websites

• Saves the results in the database

• Result - 7038 possible urls

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 10: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Script jabse_search.php

• Uses automated search interface of http://www.jabse.com (jabse_interface.php)

• Get up to 10 search results from names in Bulgarian and the same for transliterated names in English

• Excluding from the search results the complex urls

• Saves the results in the database

• Result - 15638 results in Bulgarian and 16201 results in Latin

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 11: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Script google_search.php

• Uses Google search interface

• Get up to 10 search results from names in Bulgarian

• Saves the results in the database

• Result - 26829 sets of up to 10 search results

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 12: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Google search interface

• 200 searches per day free

• 1000 searches for 5 EUR max 10000 per day

• 300 EUR for free searches on credit card registration

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 13: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Script list.php

Database crawling interface, which displays the enterprises with characteristics and the urls search results and allows the user to choose the correct url of each enterprise

list.php.htm

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 14: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Manual work done

• 26836 records were checked in 45 working days

• 600 records per work day

• 9809 urls were found

• 36.6 % of enterprises have websites

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 15: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

The scripts

• Made to do the work in Bulgarian reality

• Not intended for different database table structure

• Hard coded labels in Bulgarian

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 16: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Use Case 1: URLs Inventory

SBR enterprises with 10 and more employees26836 enterprises20649 e-mails, 2006 urlsAddresses, Phones, NACE codes…

Search in JABSE, Google – BNSI software

Search in Bing – ISTAT software UrlSeracher

9809 urls were found

36.6 % of enterprises with 10 and more employees have websites

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

• BNSI software

Page 17: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Use Case 1: URLs Inventory

URLSearcher, RootJuice, URLScorerand URLMatchTableGeneratorprograms from ISTAT and Apache SolrStorage platform

• total number of enterprises: 26836

• The URLMatchTableGeneratorpredicts the right URLs of 67 % of the enterprises

• A better list of yellow pages and internet catalogues are needed

• Apache Solr version should be as the suggested from the ISTAT

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

• ISTAT software

URLScorer

URLScorer

URLMatchTableGenerator

Page 18: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Use Case 2: E-commercePHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139e-commerce v2: 1048e-commerce v3: 662

manual checked: 856 e-commerce

10% sample: 27 e-commerce

856+27*10=1126 e-commerce

11.5% of enterprises (10+ employees) with websites do e-commerce

4.2% of enterprises with 10 and more employees do e-commerce

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 19: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Use Case 3: Social Media PresencePHP script (crawling only the first page of enterprise website)facebook: 2356twitter: 922linkedin: 560google: 871youtube: 527pinterest: 139instagram: 127

24.9% of enterprises (10+ employees) with websites have at least one social media profile

9.1% of enterprises with 10 and more employees use at least one of the listed social media

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 20: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Verification of the results with ICT survey data – 2016 (E-commerce)

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 21: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Verification of the results with ICT survey data – 2016 (Social media presence)

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

Page 22: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

Plans for SGA-II

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729

• Enterprise URLs Inventory – improvements from SGA-I

• E-commerce on Enterprises Web sites -improvements from SGA-I and testing of the ISTAT software

• Social Media Presence on Enterprises webpages -improvements from SGA-I

• Job advertisements on enterprises’ websites - new

Page 23: BNSI updates - Europa · Use Case 2: E-commerce PHP script with 3 logics (4 positive and 1 negative lists of key words) e-commerce: 1139 e-commerce v2: 1048 e-commerce v3: 662 manual

THANK YOU FOR YOUR ATTENTION!

2 "P. Volov" Str., 1038 Sofia, Bulgaria, tel. +359 2 9857 729