#icanhazrobot?: improved robot detection for ir usage statistics
TRANSCRIPT
Leabharlann UCD
An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire
UCD Library
University College Dublin,Belfield, Dublin 4, Ireland
Joseph GreeneResearch Repository LibrarianUniversity College [email protected]://researchrepository.ucd.ie
#iCanHazRobot?Improved robot detection for IR usage statistics
Open Repositories 2016Dublin, 14 June
Overview and take-home points
• Usage stats are important– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am - 12:30pm)• Robot filtration is a problem, especially in
repositories• Robot detection has an exponential effect on
usage stats’ accuracy in repositories• 2-3 ways to improve DSpace and EPrints’ usage
stats by 20% or more will be demonstrated
Experimental study
• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human• Applied DSpace, EPrints robot detection
algorithms to the dataset– This is an EXPERIMENT, simulating algorithms on a
DSpace repository’s usage data and Apache logs– The data is real, live data, and the algorithms were
very easy to simulate
First finding
85% of unfiltered repository downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots)
Accu
racy
of d
ownl
oad
stat
s (in
vers
e pr
eciti
on)
Catching more robots improves stats(But how much depends on the number of robots)
Get b
ette
r sta
ts
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
Robot detection techniques used
DSpace EPrints Minho DSpace
Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓
Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making
Measurements used in robot detection
• All measurements are a number between 0 and 1• Recall: proportion of robots detected
– I can haz robot?• Precision: true positives in robot detection
– Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots)
• Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by
humans
How they perform, out-of-the-box
DSpace
EPrin
ts
Minho
Minho with
monthly
manual
check
ing
No robot d
etecti
on0
0.20.40.60.8
1
Robot detection in OA IR systems
RecallPrecisionNegative precision (accuracy of download stats)
Room for improvement?
1. Ability to manually check for outliers
• At UCD, once a month, we check:– Daily downloads for the last 2-4 months– Top 10 most downloaded items– Top 20 downloading IP addresses for the last 2-4
months
DSpace Eprints Minho0
0.20.40.60.8
1
Robots caught (Recall)
DSpace Eprints Minho Wihtout robot detection
00.10.20.30.40.50.60.70.80.9
1
Accuracy of reported download stats (Inverse precision)
Out-of-the-boxWith manual checking (outlier exclusion)
2. Recalibrate the EPrints repeat-download (double-click) filter
0
0.2
0.4
0.6
0.8
1Effect of double-click filter on EPrints’ robot detection and stats
Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter*
𝑻𝒑+𝑻𝒏𝒏
3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints
• 1 Java class• Input is Apache Combined Log Format• Output is a database update (robot = true field)
– Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot'
field in the SOLR usage events document• Requires 2 database tables to store learned
agents and IPs
DSpace Eprints Minho0
0.2
0.4
0.6
0.8
1
Robots caught (Recall)
DSpace Eprints Minho Wihtout robot detection
00.10.20.30.40.50.60.70.80.9
1
Accuracy of reported download stats(Inverse precision)
Out-of-the-box With Minho log parser
4. Combine two or more techniques
DSpace Eprints Minho0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Robots caught(Recall) Out-of-the-box
With manual checking (outlier exclusion)
With recalibrated double click filter*
With Minho log parser
With Minho and out-liers
Minho, outliers, and recalibrated double-click*
4. Combine two or more techniques
DSpace Eprints Minho Wihtout robot detection
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy of reported download stats (Inverse precision)
Out-of-the-box
With manual checking (outlier exclusion)
With recalibrated double click filter*
With Minho log parser
With Minho and out-liers
Minho, outliers, and recalibrated double-click*
Thank you!