harvesting crowdsourcing biodiversity data from facebook groups

1
0. Crowdsourcing - participants provide unstructured data voluntarily Facebook interest groups Reptile-Road-Mortality Enjoy-Moths Main Database 1. Crawling data from Facebook via its API Post Picture Post message Thread Comment message Comment message Comment message What a typical discussion thread looks like. 2. Using natural language processing techs with Taiwan Geographic Name and Taiwan Catalogue of Life databases as knowledge bases to extract species vernacular names and place names from a thread 細紋南蛇 Prefix2 細紋 Prefix3 細紋南 occurs in the message? No occurs in the message? No Postfix2 南蛇 Postfix1 Yes Yes occurs in the message? Name doesn’t exist in the message No Yes occurs in the thread? No Yes Full-matched name Matched abbreviation Calculate confidence score of this name Yes No For each vernacular name in TaiCOL do: 3. Introducing content management system Drupal for easier data management (including error correction) and display Algorithms used to recognize abbreviations of vernacular names and place names The emergence of Web 2.0 enables people to contribute their biodiversity observations on the Web. These crowdsourcing biodiversity data are increasing their value in scientific studies due to the potentially broader spatial and temporal scales. However, the data provided in plain text hinder the process of data retrieval and analysis. In this study, we propose a framework to automatically structure the loose-format text so that volunteers can keep providing data in their own familiar ways, while interested citizens, biodiversity researchers and managers can benefit from the semantically structured information. We take 2 Facebook biodiversity interest groups Reptile-Road-Mortality and Enjoy-Moths as examples. Harvesting crowdsourcing biodiversity data from Facebook groups Jason Guan-Shuo Mai 1 , Cheng-Hsin Hsu 1 , Dong-Po Deng 2 , De-En Lin 3 , Hsu-Hong Lin 3 , Kwang-Tsao Shao 1 4. Publishing linked open data via D2R server for open access and usage Our dataset is linked to other datasets on linked open data cloud such as DBPedia, GeoNames and LODE (Linked Open Data of Ecology) so it can have benefit from the large amount of meta-information they provide. 5. Developing browser plug- ins to give users digested feedback of structuralized data 6. Improving source data quality without changing users’ own familiar ways Our algorithm picks a most related species name appearing in a thread based on social networking characteristics. One click on a message to recognize species vernacular names and related information Semantic annotation tool disambiguates toponymic homonyms 1 Taiwan Biodiversity Information Facility (TaiBIF), Biodiversity Research Center, Academia Sinica, Taipei, Taiwan 2 Institute of Information Science, Academia Sinica, Taipei, Taiwan 3 Taiwan Endemic Species Research Institute, Council of Agriculture, Nantou, Taiwan

Upload: dongpo-deng

Post on 12-May-2015

495 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Harvesting crowdsourcing biodiversity data from Facebook groups

0. Crowdsourcing - participants provide unstructured data voluntarily

Facebook interest groups

Reptile-Road-Mortality Enjoy-Moths

Main Database

1. Crawling data from Facebook via its API

Post Picture

Post message

Thread

Comment message

Comment message

Comment message

What a typical discussion thread looks like.

2. Using natural language processing techs with Taiwan Geographic Name and Taiwan Catalogue of Life databases as knowledge bases to extract species vernacular names and place names from a thread

細紋南蛇

Prefix2 細紋

Prefix3 細紋南

occurs in the message?

No

occurs in the message?

No

Postfix2 南蛇

Postfix1 蛇

Yes

Yes

occurs in the message?

Name doesn’t exist in the message

No

Yes

occurs in the thread?

No

Yes

Full-matched name

Matched abbreviation Calculate confidence score of this name

Yes

No

For each vernacular name in TaiCOL do:

3. Introducing content management system Drupal for easier data management (including error correction) and display

Algorithms used to recognize abbreviations of vernacular names and place names

The emergence of Web 2.0 enables people to contribute their biodiversity observations on the Web. These crowdsourcing biodiversity data are increasing their value in scientific studies due to the potentially broader spatial and temporal scales. However, the data provided in plain text hinder the process of data retrieval and analysis. In this study, we propose a framework to automatically structure the loose-format text so that volunteers can keep providing data in their own familiar ways, while interested citizens, biodiversity researchers and managers can benefit from the semantically structured information. We take 2 Facebook biodiversity interest groups Reptile-Road-Mortality and Enjoy-Moths as examples.

Harvesting crowdsourcing biodiversity data from Facebook groups Jason Guan-Shuo Mai1, Cheng-Hsin Hsu1, Dong-Po Deng2, De-En Lin3, Hsu-Hong Lin3, Kwang-Tsao Shao1

4. Publishing linked open data via D2R server for open access and usage

Our dataset is linked to other datasets on linked open data cloud such as DBPedia, GeoNames and LODE (Linked Open Data of Ecology) so it can have benefit from the large amount of meta-information they provide.

5. Developing browser plug-ins to give users digested feedback of structuralized data

6. Improving source data quality without changing users’ own familiar ways

Our algorithm picks a most related species name appearing in a thread based on social networking characteristics.

One click on a message to recognize species vernacular names and related information

Semantic annotation tool disambiguates toponymic homonyms

1 Taiwan Biodiversity Information Facility (TaiBIF), Biodiversity Research Center, Academia Sinica, Taipei, Taiwan 2 Institute of Information Science, Academia Sinica, Taipei, Taiwan 3 Taiwan Endemic Species Research Institute, Council of Agriculture, Nantou, Taiwan