boilerpipe integration and improvement

Boilerpipe Integration & Improvement

Allan Huang @ esobi Inc.

Known Issues

本文內容空白本文內容亂碼特殊字元亂碼缺少本文主體與本文無關的內容

Integration

必要的參數有… URL 網址或… HTML 全文

<base> tag 的 href 可選的參數有…

Extractor Boilerpipe 演算法

Output Mode HTML Extraction, HTML Highlighting, Plain Text, JSO

N

Improvement 強化 HTTP 和 HTML 編碼的判斷與處理支援 HTTP Response 解壓縮演算法安插 <base> tag 以改善 Image 於相對路徑的顯示更換成最新版的 Boilerpipe 和相關的 nekohtml libra

ry 測試結果

共有 150 則新聞 66 則繁中新聞 80 則英文新聞 2 則簡中新聞

目前成功率為 94%

Failure Cases

只抓到 HTML Title 而抓不到本文 2 則新聞，中時電子報、臉書的動態時報照片

缺少本文主體 2 則新聞， UrCosme 美容討論區、青年日報

抓到 JavaScript code 或 HTML escape 字元 2 則新聞，香港成報、 The Wall Street Journal

Solved Cases

時常抓到亂碼的本文 2 則新聞，中時電子報的焦點新聞起因為無法下載整個 HTML 全文

解決方案避免使用 Java PushbackStream ，改以一次性下載整

個 HTML 全文後，再進行 HTML 字串取樣，以利於 HTML 全文編碼的判斷

Solved Cases

CJK 特殊字元亂碼宏碁 R7 筆電「星際爭霸戰」款限量出擊朱镕基退休前后“判若两人” 非常注重晚节起因為 Java 引用同一字元集缺少特殊字元

解決方案繁中 Big5-HKSCS 替代 Big5 簡中 GB18030 替代 GB2312 日文 Windows-31J 替代 Shift_JIS 韓文尚未找到案例

Algorithm Comparison

Structure retainment Inner content cleaning Implementation Language dependency Source parameter Additional features and remarks

NameStructureretainment

Inner contentcleaning

Implementation Source parameter Language dependancyAdditional features andremarks

Boilerpipe plain text only

uses a classifier todetermine whether ornot the atomic textblock holds usefulcontent

open source java library

you can fetchdocuments by yourselfor use built-in utilitiesto fetch them for you

should be languageindependent since thetext block classifierobserves languageindependent text

implements manyextractors with differentclassification rules trainedon different datasets

Alchemy APItext only (has anoption to includerelevant hyperlinks)

n/a commercial web api

include the wholedocument in the postrequest or provide anurl

observation: returns anerror for non-englishcontent e.g. thedocument contains“unsupported text

extra API call to extractthe title

Diffbot plain text or htmlan option to removeinline ads

web api (private beta)does fetching for youvia provided url

n/a

extracts: relevant media,titile, tags, xpath descriptorfor wrappers, commentsand comment count, articlesummary

Readabilityretains originalstructure

uses hardcodedheuristics to extractcontent divided byads

open source javascriptbookmarklet

via browser

language independentbut it relies on languagedependent regularexpressions to match idand class labels

Goose plain text n/a open source java library

url only (my forkenables you to fetchthe document byyourself)

language independentbut it relies on languagedependent regularexpressions to match idand class labels

uses hardcoded heuristicsto search for relatedimages and embeddedmedia

Extractiv

depends on thechosen output format– e.g. xml formatbreaks the content

n/a commercial web api

include the wholedocument in postrequest or provide anurl

n/a

capable of enriching theextracted text withsemantic entities andrelationships

Repustate API plain text n/a commercial web api url only n/a

Webstemmer plain text n/aopen source pythonlibrary

first runs a crawler toobtain seed pages,then it learns layoutpatterns that are laterput to work to extract

language independent

the only piece of softwareon this list that requires acluster of similardocuments obtained bycrawling

NCleaner (paper) plain textuses character leveln-grams to detectcontent text blocks

open source perl libraryarbitrary htmldocument

depends on the traininglanguage

reliant on lynx browser forconverting html tostructured plain text

Reference

Evaluating Text Extraction Algorithms List of resources: Article text extraction from

HTML documents Feature-wise Comparison of HTML Article Te

xt Extractors Overview: Extracting article text from HTML d

ocuments Readability for Java - Snacktory

Conclusion

Next step… Boilerpipe 抓取本文並未包含 Image 資訊 URL 對應的 HTML 全文或本文 Cache 機制

Q&A

boilerpipe integration and improvement

Technology