Leabharlann UCD
An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire
UCD Library
University College Dublin,Belfield, Dublin 4, Ireland
Joseph GreeneResearch Repository LibrarianUniversity College [email protected]://researchrepository.ucd.ie
#iCanHazRobot?Improved robot detection for IR usage statistics
Open Repositories 2016Dublin, 14 June
Overview and take-home points
• Usage stats are important– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am - 12:30pm)• Robot filtration is a problem, especially in
repositories• Robot detection has an exponential effect on
usage stats’ accuracy in repositories• 2-3 ways to improve DSpace and EPrints’ usage
stats by 20% or more will be demonstrated
Experimental study
• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human• Applied DSpace, EPrints robot detection
algorithms to the dataset– This is an EXPERIMENT, simulating algorithms on a
DSpace repository’s usage data and Apache logs– The data is real, live data, and the algorithms were
very easy to simulate
First finding
85% of unfiltered repository downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots)
Accu
racy
of d
ownl
oad
stat
s (in
vers
e pr
eciti
on)
Catching more robots improves stats(But how much depends on the number of robots)
Get b
ette
r sta
ts
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
Robot detection techniques used
DSpace EPrints Minho DSpace
Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓
Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making
Measurements used in robot detection
• All measurements are a number between 0 and 1• Recall: proportion of robots detected
– I can haz robot?• Precision: true positives in robot detection
– Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots)
• Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by
humans
How they perform, out-of-the-box
DSpace
EPrin
ts
Minho
Minho with
monthly
manual
check
ing
No robot d
etecti
on0
0.20.40.60.8
1
Robot detection in OA IR systems
RecallPrecisionNegative precision (accuracy of download stats)
Room for improvement?
1. Ability to manually check for outliers
• At UCD, once a month, we check:– Daily downloads for the last 2-4 months– Top 10 most downloaded items– Top 20 downloading IP addresses for the last 2-4
months
DSpace Eprints Minho0
0.20.40.60.8
1
Robots caught (Recall)
DSpace Eprints Minho Wihtout robot detection
00.10.20.30.40.50.60.70.80.9
1
Accuracy of reported download stats (Inverse precision)
Out-of-the-boxWith manual checking (outlier exclusion)
2. Recalibrate the EPrints repeat-download (double-click) filter
0
0.2
0.4
0.6
0.8
1Effect of double-click filter on EPrints’ robot detection and stats
Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter*
𝑻𝒑+𝑻𝒏𝒏
3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints
• 1 Java class• Input is Apache Combined Log Format• Output is a database update (robot = true field)
– Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot'
field in the SOLR usage events document• Requires 2 database tables to store learned
agents and IPs
DSpace Eprints Minho0
0.2
0.4
0.6
0.8
1
Robots caught (Recall)
DSpace Eprints Minho Wihtout robot detection
00.10.20.30.40.50.60.70.80.9
1
Accuracy of reported download stats(Inverse precision)
Out-of-the-box With Minho log parser
4. Combine two or more techniques
DSpace Eprints Minho0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Robots caught(Recall) Out-of-the-box
With manual checking (outlier exclusion)
With recalibrated double click filter*
With Minho log parser
With Minho and out-liers
Minho, outliers, and recalibrated double-click*
4. Combine two or more techniques
DSpace Eprints Minho Wihtout robot detection
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy of reported download stats (Inverse precision)
Out-of-the-box
With manual checking (outlier exclusion)
With recalibrated double click filter*
With Minho log parser
With Minho and out-liers
Minho, outliers, and recalibrated double-click*
Thank you!