boilerpipe integration and improvement
DESCRIPTION
TRANSCRIPT
Boilerpipe Integration & Improvement
Allan Huang @ esobi Inc.
Known Issues
本文內容空白 本文內容亂碼 特殊字元亂碼 缺少本文主體 與本文無關的內容
Integration
必要的參數有… URL 網址 或… HTML 全文
<base> tag 的 href 可選的參數有…
Extractor Boilerpipe 演算法
Output Mode HTML Extraction, HTML Highlighting, Plain Text, JSO
N
Improvement 強化 HTTP 和 HTML 編碼的判斷與處理 支援 HTTP Response 解壓縮演算法 安插 <base> tag 以改善 Image 於相對路徑的顯示 更換成最新版的 Boilerpipe 和相關的 nekohtml libra
ry 測試結果
共有 150 則新聞 66 則繁中新聞 80 則英文新聞 2 則簡中新聞
目前成功率為 94%
Failure Cases
只抓到 HTML Title 而抓不到本文 2 則新聞,中時電子報、臉書的動態時報照片
缺少本文主體 2 則新聞, UrCosme 美容討論區、青年日報
抓到 JavaScript code 或 HTML escape 字元 2 則新聞,香港成報、 The Wall Street Journal
Solved Cases
時常抓到亂碼的本文 2 則新聞,中時電子報的焦點新聞 起因為無法下載整個 HTML 全文
解決方案 避免使用 Java PushbackStream ,改以一次性下載整
個 HTML 全文後,再進行 HTML 字串取樣,以利於 HTML 全文編碼的判斷
Solved Cases
CJK 特殊字元亂碼 宏碁 R7 筆電 「星際爭霸戰」款限量出擊 朱镕基退休前后“判若两人” 非常注重晚节 起因為 Java 引用同一字元集缺少特殊字元
解決方案 繁中 Big5-HKSCS 替代 Big5 簡中 GB18030 替代 GB2312 日文 Windows-31J 替代 Shift_JIS 韓文尚未找到案例
Algorithm Comparison
Structure retainment Inner content cleaning Implementation Language dependency Source parameter Additional features and remarks
NameStructureretainment
Inner contentcleaning
Implementation Source parameter Language dependancyAdditional features andremarks
Boilerpipe plain text only
uses a classifier todetermine whether ornot the atomic textblock holds usefulcontent
open source java library
you can fetchdocuments by yourselfor use built-in utilitiesto fetch them for you
should be languageindependent since thetext block classifierobserves languageindependent text
implements manyextractors with differentclassification rules trainedon different datasets
Alchemy APItext only (has anoption to includerelevant hyperlinks)
n/a commercial web api
include the wholedocument in the postrequest or provide anurl
observation: returns anerror for non-englishcontent e.g. thedocument contains“unsupported text
extra API call to extractthe title
Diffbot plain text or htmlan option to removeinline ads
web api (private beta)does fetching for youvia provided url
n/a
extracts: relevant media,titile, tags, xpath descriptorfor wrappers, commentsand comment count, articlesummary
Readabilityretains originalstructure
uses hardcodedheuristics to extractcontent divided byads
open source javascriptbookmarklet
via browser
language independentbut it relies on languagedependent regularexpressions to match idand class labels
Goose plain text n/a open source java library
url only (my forkenables you to fetchthe document byyourself)
language independentbut it relies on languagedependent regularexpressions to match idand class labels
uses hardcoded heuristicsto search for relatedimages and embeddedmedia
Extractiv
depends on thechosen output format– e.g. xml formatbreaks the content
n/a commercial web api
include the wholedocument in postrequest or provide anurl
n/a
capable of enriching theextracted text withsemantic entities andrelationships
Repustate API plain text n/a commercial web api url only n/a
Webstemmer plain text n/aopen source pythonlibrary
first runs a crawler toobtain seed pages,then it learns layoutpatterns that are laterput to work to extract
language independent
the only piece of softwareon this list that requires acluster of similardocuments obtained bycrawling
NCleaner (paper) plain textuses character leveln-grams to detectcontent text blocks
open source perl libraryarbitrary htmldocument
depends on the traininglanguage
reliant on lynx browser forconverting html tostructured plain text
Reference
Evaluating Text Extraction Algorithms List of resources: Article text extraction from
HTML documents Feature-wise Comparison of HTML Article Te
xt Extractors Overview: Extracting article text from HTML d
ocuments Readability for Java - Snacktory
Conclusion
Next step… Boilerpipe 抓取本文並未包含 Image 資訊 URL 對應的 HTML 全文或本文 Cache 機制
Q&A