mapping social, political, and scientific landscape using webometrcs city univ of hong kong (24...
TRANSCRIPT
Mapping social, political, and scientific landscape using webometrics method
Asso. Prof. Han Woo PARKDepartment of Media & CommunicationYeungNam University214-1 Dae-dong, Gyeongsan-si, Gyeongsangbuk-do 712-749Republic of [email protected] http://www.hanpark.net http://english-webometrics.yu.ac.kr http://asia-triplehelix.org
Thanks to my colleagues and students at the WWI.
Virtual Knowledge Studio (VKS)
•Invited speech, Department of Media & Communication, City University of Hong Kong, 29 March 2010 •(Topic: Mapping social, political, and scientific landscape using webometric method)
Outline of presentation
1. development of webometrics tools to automate social Internet research process (e.g., data collection and analysis from search engines, SNS and microblogging sites)
2. experimentation with new types of data visualization across period and platform (e.g, dynamic mappings using HNA)
Webometrics in terms of e-research
A minor but growing approach to the study of Internet-mediated communication
A new methodological perspective based on the use of new digital tools available online for conducting humanities and social science Internet research
Research tradition of Webometrics
• 1) development of online tools to automate the Internet research process, such as data collection and analysis
• 2) experimentation with new types of data visualization, such as social network and hyperlink analysis and multimedia and dynamic mappings
http://participatorysociety.org/wiki/index.php?title=Online_Research
Web Scrapers, Crawlers, Tools in WCU
Overview• Collecting data from search engines:
Naver: API, Non-API, Google.com
• Digging Social Networking Services: Cyworld Minihompies, Facebook, Plurk
• Microblogging sites: Twitter, TwtKr.com
• Korean Internet Network Miner: A Korean version of Dr. A. Gruzd’s ICTA
• Web archiving of Korean MPs: http://www.web-archive.kr/
• In various degrees of development• Return data from web in a suitable form to
import into Excel, SPSS, LexiURL, etc• Returned data will contain all values, only
some of these may be relevant for the current query however having all of the data will ensure that you can revisit later if another project requires more variables
• All programs have time-rests, though these vary depending on the service being accessed.
9
The purpose of this paper is to introduce the API-based webometrics tool created for the Korean search engine Naver
This non-commercial software is designed to collect large amounts of data automatically and can easily distinguish between different types of information on the web, which was impossible before.
(Image Source: Newsweek, 5 Nov 2007)
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
Webonaver (Webometrics Tool for Naver)Webonaver (Webometrics Tool for Naver)
10
Rationale for the Naver
• “Republic of Naver” (Kim & Sohn, 2007)
• “Korea’s Naver is now the world’s 5th search service provider, behind Google, Yahoo, Baidu and Microsoft.” (The AP, 9 Oct 2007)
• “Google left behind as Koreans Naver-gate the internet” (Financial Times, 2 Jan 2008)
• “IN SOUTH KOREA People who want to looksomething up on the internet don’t “Google it”. Instead they “ask Naver”. (Economist, 30 Feb 2009)
• Yeon-Ok Lee and Park. H. W., (2008). "The Importance of Search Engines in Digital News Consumption A Comparative Study Between South Korea and the UK". refereed paper presented at the Workshop “Gatekeepers in a Digital Asian-European Media Landscape: The rising structural power of Internet search engines”(2008).
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
11
Component of Naver
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
Log-in
The articles title (changing automatically)
The press linkedToday’s issues
Quick menubrowser window
Naver search options
13
Interface
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS
The interface is fairly self-explanatory:
-Tick or untick to collect either only hit number or the title, URL, and description of the results
- Select which of the search options you want to include
- Click on the '...' button to select the text file that contains the queries you wish to run
- Click 'Run Queries'
The interface is fairly self-explanatory:
-Tick or untick to collect either only hit number or the title, URL, and description of the results
- Select which of the search options you want to include
- Click on the '...' button to select the text file that contains the queries you wish to run
- Click 'Run Queries'
http://english-webometrics.yu.ac.kr/WebometricsTools/WeboNaver/WeboNaver.html
• web presence of the term H1N1 is examined using Webonaver. We tested the usability and reliability of this tool.
Queres: 신종플루 (A virus subtype H1N1) 신종인플루엔자 (Influenza A virus subtype H1N1) 신종인플루엔자 (Influenza A virus subtype H1N1)
• Users can get same results from certain words containing space character and the one without space using WeboNaver.
• But, it can not assume similar words as same. Users should consider which specific data they want to extract before using this tool.
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICSS WITH E-RESEARCH TOOLS Web presence of the term H1N1
16
17
Monitoring a Socio-political Blogosphere in South Korea:
Comparing a Metrics from Blogosphere with Voter
Turnout
• Data– Blog postings related to 29 candidates for the 2009
Korean National Assembly by-election.
• Data gathering– Korean-language based blog search engine by
Naver.com – Real-time blog monitoring program by WWI– Search queries: the name of candidate + “candidate”– Search date: After Oct. 8, 2009– Data collection periods: Oct. 16 – Oct. 27, 2009 (12
days)– Cycle: Twice per a day (AM 00:00, PM 12:00)
Trend Analysis• Jangan district in Suwon City, Gyeonggi Jangan district in Suwon City, Gyeonggi
ProvinceProvince(Park, CS)(Lee, CY)
(Ahn, DS)(Yoon, JY)
Blogs vs. Votes• Jangan district in Suwon City, Gyeonggi Jangan district in Suwon City, Gyeonggi
ProvinceProvinceN. of Votes
N. of Blogs
(Park, CS)(Lee, CY) (Ahn, DS) (Yoon, JY)
(Park, CS) (Lee, CY) (Ahn, DS)(Yoon, JY)
Constituency Candidate Blog % Rank Vote % Rank
Jangan,
Suwon,
Gyeonggi
Park, CS(박찬숙 ) 213.4 35.6 2 33,106 42.7 2
Lee, CY(이찬열 ) 216.6 36.1 1 38,187 49.2 1
Ahn, DS(안동섭 ) 158.4 26.4 3 5,570 7.2 3
Yoon, JY(윤준영 ) 11.8 2.0 4 716 0.9 4
Sangrok-B,
Ansan,
Gyeonggi
Song, JS(송진섭 ) 147.8 17.0 3 11,420 33.2 2
Kim, YH(김영환 ) 280.1 32.3 1 14,176 41.2 1
Jang, KW(장경우 ) 64.0 7.4 4 1,145 3.3 4
Kim, SK(김석균 ) 25.7 3.0 6 896 2.6 6
Yoon, MW(윤문원 ) 22.8 2.6 7 439 1.3 7
Lee, YH(이영호 ) 59.5 6.9 5 987 2.9 5
Lim, JI(임종인 ) 268.6 30.9 2 5,363 15.6 3
Gangreung,
Gangwon
Kwon, SD(권성동 ) 85.6 32.9 1 29,010 50.9 1
Hong, JK(홍재경 ) 68.0 26.1 3 2,100 3.7 4
Song, YC(송영철 ) 72.1 27.7 2 19,867 34.8 2
Shim, KS(심기섭 ) 34.9 13.4 4 6,054 10.6 3
North Chungcheong
(4 districts)
Kyoung, DS(경대수 ) 140.2 25.2 2 19,427 28.4 2
Chung, BG(정범구 ) 167.1 30.0 1 29,120 42.5 1
Chung, WH(정원헌 ) 65.2 11.7 5 3,071 4.5 4
Park, KS(박기수 ) 68.8 12.4 4 2,125 3.1 5
Lee, TH(이태희 ) 33.2 6.0 6 504 0.7 6
Kim, KH(김경회 ) 81.7 14.7 3 14,218 20.8 3
Yangsan,
South Gyungsang
Park, HT(박희태 ) 258.2 30.4 1 16,597 37.9 1
Song, IB(송인배 ) 214.2 25.2 2 15,577 35.6 2
Park, SH(박승흡 ) 134.0 15.8 3 1,550 3.5 5
Kim, SG(김상걸 ) 33.4 3.9 6 900 2.1 6
Kim, YS(김양수 ) 88.7 10.5 4 5,875 13.4 3
Kim, YK(김용구 ) 26.6 3.1 8 234 0.5 8
Kim, JM(김진명 ) 29.3 3.5 7 325 0.7 7
Yoo, JM(유재명 ) 64.3 7.6 5 2,710 6.2 4
Results• Correlation Analysis (N. of Blogs & N. of
Votes)– Pearson r = .586, p < .01 (N=29)– Spearman rho = .797, p < .01 (N=29)
• Simple Regression Analysis– N. of Votes = 1,055.56 + 79.99(N. of Blogs)– R2 = .344 (F = 14.128, p < .01)– ß = .586 (t = 3.759, p < .01)
Summary• Overall, the number of blogs by candidates has a
tendency to increase over time.
• By districts, the candidate who has the largest blog postings won the election.
• The results of correlation analyses (Pearson and Spearman) significantly indicate the positive relationship between blog postings and votes.
• From the results of a simple regression analysis, the number of blogs by candidates can be regarded as a significant determinant of the number of votes.
Cyworld• Collects profile information from the public
messages posted to initial seed user
• Takes approximately 10 seconds per user request
• Stores user details so subsequent calls are not needed
• As a result of the high numbers of comments on some Cyworld pages, the process of collecting the data can take several days
Cyworld Extractor - OverviewJava-based software tool that, given the URL of a politician on Cyworld, extracts comments given by citizens along with related profile attributes.
The stored data, which can amount to thousands of records, is stored in a suitable format for import into statistical software
①②③
The status of mini-homepy①How active ②How famous ③How friendly
Gender
Name
Geun-Hye Park’s mini-hompy
Visitor count
Why do Kyeong-Tae Jo and Kyoeng-Won Na have so many comments?
• After South Korean government concluded negotiation of American beef import in April, there are many conflicts between government and public opinion during the May, June, 2008.
• As graph indicates, compared to before, the biggest number of comments was recorded on all assembly members’ Minihompies in May and June, 2008.
• Among of them, specially, the biggest number of comments is recorded on mini-hompy of Kyung-TaeJo and Kyeong-Won Na.
South Koreans fearing 'mad cow disease' fight US beef imports in May and June 2008
IP address
Cyworld-IP screen capture
Seong-Min Yoo’s mini-hompy
Cyworld Extractor – Data
One example of possible uses for the collected data is to determine the region of posters commenting from Korea
Cyworld Extractor - Data
The country of origin of those users commenting from outside Korea is also possible
WCUWEBOMETRICSINSTITUTEINVESTIGATING INTERNET-BASED POLITICS WITH E-RESEARCH TOOLS
Case 2. Cyworld Mini-hompies of Korean Legislators
Cyworld Mini-hompies of Korean legislators: Co-inlink network map using Yahoo.com
However, buddy data is not publicly available!!
The network structure using co-link data shows a clear butterfly pattern. There is one hub (ghism) that belongs to Park Gyun-Hye (Park GH, www.cyworld.com/ghism), the daughter of ex-president Park Jeong-Hee and one of two major GNP candidates (along with president-elect Lee MB) in the 2007 presidential race.
Facebook• Searches for groups with links to petition
sites
• Stores group membership numbers
• Queries petition site and stores number of signatures
• Takes approximately 10 seconds per request
• No interface
Plurk• Gathers friends and fans list from an initial
seed user
• Returns two text files: one containing friends and one containing fans
• No interface at present and all commands must be entered through a command prompt
• Takes approximately 5 seconds per request
Screen capture of Plurk
Research examples on Plurk
Google• Collects a maximum of 1,000 top search
listings• Writes the listing URL out to a text file• Interface allows setting certain parameters;
such as file type, language, and country. • More can be added to the current list of
options• Takes approximately 3 seconds per page
of results (1 page = 100 results)
Twitter• Collects follower/following and Tweets
from a chosen user
• Has a 150 hit rate-limit imposed by Twitter
• When rate limit reached, program will pause and show an indefinite progress dialog until the rate limit renews
• User can log in using their Twitter credentials and these will optionally be stored for a future session
Twitter Extractor - Overview
Sharing a similar interface and extraction mechanism with the Cyworld extractor, this application requires the URL of a user on Twitter. It is then possible to collect all tweets and determine the attributes of the user’s follower / following network
Twitter Extractor - Data
A simple use for this data would be to visualize a user’s network and ascertain which users are reciprocal in their friendships
* A type of tweets
-A case Study on twitter of 18th National Assembly Members
* Audiences of tweets * Topic of tweets
Twtkr.com Scraper
Korean Internet Network Miner: A Korean version of ICTA
After retrieving the blog data, it was processed to build two types of networks. • First, a chain network was extracted. In the chain network, one commentator is connected to another if the first commentator directly replied to the second commentator by clicking on the "reply-to" button.
• However, after manually examining a number of comments on several blogs, we found that there are some comments that are not "reply-to" comments, but are addressing or referencing a previous poster.
To capture missing connections, we decided to rely on another network discovery method called the Name network.
Section 1. Development of the Korean Internet Network Miner
This observation is in-line with a previous empirical study on online Learning communities by Gruzd(2009a), which discovered that the
chain network missmisses on average 40%40% of possible connections.
Name Network>
Another good example of challenges associated with the name/nickname disambiguation problem in comments is the word "2mb". This is because "2mb” has at least three different meanings.
First, this word can be used as a nickname for one of the blog commentators. Second, it could refer to the capacity of a computer memory (2 megabytes). Finally, it could be the alias of the current Korean president, Lee Myung-Bak.
To address these challenges and develop recommendations for the next generation of the name network discovery algorithm, we conducted a semi-automated analysis of all names/nicknames discovered from a sample dataset using our initial algorithm.
Section 1. Development of the Korean Internet Network Miner
The evaluation procedure involved clicking on each word found by the name network algorithm and exploring the context where each instance of the word was used(see Figure 3). The purpose of this semi-automated analysis was to discover what name/nickname candidates were identified incorrectly and why.
<Figure 3> A list of messages containing "2MB”
This semi-automated analysis revealed a set of additional syntactic and semantic clues that can be used to improve the accuracy of the name Network discovery algorithm.
Section 2. Evaluation of the Name Network Discovery Algorithm
The second set includes clues suggesting that a word is NOT likely to be used as a nickname:
Section 2. Evaluation of the Name Network Discovery Algorithm
● a word candidate is a phrase—for example, if the nickname input (the "FROM"field) is Used more like a subject line(possible indicators include white spaces and length); ● a word candidate consists of a single character(e.g., "a" or " ㄱ "); ● a word candidate consists of netspeak, including emoticons(e.g. "=_="), slang and abbreviations(e.g., using "2MB" to refer to the current Korean president), and onomatopoeia (e.g. "ㅉㅉ " = tsk tsk, ” ㅋㅋ " = heehee, "하하 " = haha, "음 " = hmm); ● a word candidate appears more than one time in the comment; ● a word candidate consists of random characters(e.g. "ㅁㄴㅇㄹ " or "asdf"); ● a word candidate is a short, conversational word or phrase(e.g., " 나나 " = me, "아이고 " = oh no, "그래서 " = so/therefore); ● a word candidate is a common word or idea in the given context/topic(e.g., " 대한민국 " = Republic of Korea, "쥐체사상 " = a newly created word used to refer to political fanatics).
• Web archiving of Korean MPs: http://www.web-archive.kr/
Experimentation with new types of data visualization across period and platform (e.g, dynamic mappings using HNA)
Data Collection for Web 1.0• Official homepages of South Korean Assembly
members• Manual collection: Observation• Inter-linkage: Who links to whom matrix• Explicit links excluding links in board• 2-Year tracking of same Assembly members: 2000-
2001
Sociology of Hyperlink Networks of Web 1.0, Web 2.0, and Twitter
Web 1.0
2000
2001
‣59 isolated in 2000‣more centralised in 2001‣network of 2001 a ‘star’ network➭- might affected by political events
presidential election in 2001➭
• Data collection for Web 2.0
• Personal blogs of South Korean Assembly members
• Manual collection: Observation
• Blogroll links: Excluding links in postings
• Inter-linkage: Who links to whom matrix
• 2-Year tracking of same Assembly members: 2005-2006
• Phone interview about usage behaviours
Web 2.0
2005 2006
‣hubs disappearing‣easy use of blogs ‣Clear boundaries between different parties‣strong presence of GNP Assembly members
party policy on using blogs➭
‣more connection between different parties‣the ruling party pays less attention on alternative media
Web Type YearSum of links
(Mean)Density
Centralisation
Gini Coefficient
In Out
Web 1.0(N=245)
2000373
(1.52)0.006 1.84 69.33 0.984
2001515
(2.10)0.009 1.19 99.55 0.996
Web 2.0(N=99)
2005652
(6.59)0.067 22.07 41.66 0.759
2006589
(5.95)0.061 20.67 35.10 0.763
Twitter(N=22)
2009111
(5.05)0.240 24.72 39.68 0.408
‣ Network analysis- Web 1.0 (homepage) :
loose, few important hubs & becoming a start network
- Web 2.0 (blog): denser, clear boundaries between opposition groups
- Twitter: denser than blog networks
- contributed by technological development more ➭interactive/participatory
‣ Findings on online activities (Web 2.0 & Twitter) reflect offline situations
- Party policies affected the use of the Web for political purposes
- Progressive/minor groups more willing to explore alternative media
Incoming International Hyperlink in 2009 (drawn using ManyEyes.com)
Incoming International Hyperlink in 2009 (drawn using Google Earth)
Thank you for listening!Thank you for listening!
WCUWEBOMETRICSINSTITUTE
Acknowledgments. WCU Webometrics Institute acknowledges that this research is supported from the WCU project investigating internet-based politics using e-research tools granted from South Korean Government