pgrest: node.js in the database
DESCRIPTION
OSDC.tw 2013 talk on plv8x, g0v, 3du, and https://moedict.tw/TRANSCRIPT
PgREST Node.js in the Database
Audrey Tang
只講程式
不講故事
PgREST is…
• JSON document store
• Running inside PostgreSQL
• Working with existing relational data
• Capable of loading Node.js modules
• Compatible with MongoLab’s REST API
• = LiveScript + PLV8 + plv8x + OneJS
JSON
{ "title": "萌",
"bopomofo": "ㄇㄥˊ",
"pinyin": "méng",
"definitions": [
{ "type": "名", "def": "草木初生的芽。" },
{ "type": "名", "def": "事物發生的開端或徵兆。" },
{ "type": "名", "def": "人民。" } ] }
PostgreSQLCREATE TABLE moe ( "entry" JSON );
INSERT INTO moe VALUES ($$
{ "title":"萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng",
"definitions": [ { "type": "名", "def": "草木初生的芽。" },
{ "type": "名", "def": "事物發生的開端或徵兆。" },
{ "type": "名", "def": "人民。" } ] } $$);
INSERT INTO moe VALUES ('這不是 ㄓㄟ ㄙㄣˇ'); -- type error
PLV8CREATE EXTENSION plv8;
CREATE FUNCTION get_json_key(obj JSON, key TEXT) returns JSON AS $$
return JSON.stringify( obj[key] );
$$ LANGUAGE plv8;
SELECT get_json_key(entry, 'bopomofo') FROM moe;
-- "ㄇㄥˊ"
plv8x: Operators
SELECT entry |> 'this.bopomofo' FROM moe;
-- "ㄇㄥˊ"
SELECT entry ~> '@bopomofo' FROM moe;
-- "ㄇㄥˊ"
SELECT '@bopomofo' <~ entry FROM moe;
-- "ㄇㄥˊ"
SELECT ~> 'new Date';
-- "2013-04-17T12:31:57.523Z"
plv8x: Command Linenpm i -g plv8x
export PLV8XCONN=dbname
plv8x -r script.ls # .js works too
plv8x -E 'plv8.execute("SELECT entry FROM moe").0.entry.definitions'
# [ { type: '名', def: '草木初生的芽。' },
# { type: '名', def: '事物發生的開端或徵兆。' },
# { type: '名', def: '人民。' } ]
plv8x: Modulesnpm i -g uax11
plv8x -i uax11
plv8x -E 'require "uax11" .toFullwidth "méng"'
# meng
SELECT entry ~> 'require "uax11" .toFullwidth @pinyin' FROM moe;
-- "meng"
plv8x: Functionsplv8x -f 'text fullwidth(text)=uax11:toFullwidth'
plv8x -f 'text PINYIN(json)=:&0.pinyin.toUpperCase!'
SELECT fullwidth('Ingy döt Net');
-- Ingy dot Net
SELECT fullwidth( PINYIN(entry) ) FROM moe;
-- MENG
Summary
• V8 : JavaScript engine
• PLV8 : Stored procedures in JavaScript
• plv8x : Package manager for PLV8
• Turns NPM modules into SQL functions
• JSON expressions with ~> and <~
• Code reuse for browser + server + database !
Cutting out the Middleware
• Serve JSON API from SQL
• Shared models & validation code
• Put Business Logic into DB
• Perfect fit for Medium Data™
@clkao++
3du.tw
The Revised MoE Dictionary (1994)
The Good
• 160,000+ entries
• Official, high quality sources
• Rich etymology and historical usage
• Full text search with regular expressions
• Still frequently updated!
The Bad
• Results are not bookmarkable
• Requires N clicks to get to a definition
• Rare characters become low-res bitmaps
• Difficult to use on mobile devices
• ”Optimized for IE 5.0 and Netscape 4.7+”!?
⧸/教育部國語推行委員會〈有關授權〉
The Sad
本會非常歡迎各位來連結「國語辭典」,但是
本會目前只開放以超連結 (hyperlink) 的方式與
國語辭典 首頁 連結,至於其他方式本會並未對
外開放授權。若還有疑問或建議,歡迎來信。
“
.…and the Very Crazy
• 不需登入的網頁,會自動把你登出!
⧸/葉平〈還文於民〉
Yeh’s Ping, 2013.1.26.
所以我要 響應 零時政府 g0v.tw 的活動,來做
3du.tw,把字、詞、成語、定義、例句等等正
體中文資料,用開放的文字 API 釋放出來,加
上索引和搜尋的功能,讓任何想加值的個人或
公司都可以使用。
“
Hackpad for 3du.tw
零時黑客
集體砍站事件
g0v hackath1n, 2013.1.27.
• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)
• Scrape 3000 characters as raw HTML (@au)
• Design JSON schema from samples (@pingooo)
• Design SQL schema from samples (@albb0920)
• Parse HTML into JSON & SQLite (@kcwu)
• …and for those 24x24 bitmaps…
←🀝 Big-5
→🀎 UTF-8
Crowd-OCR for 1000+ glyphs
Finished in 24 hours!Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka,
Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…
粗略的共識
會動的程式
Applications
• XUL Desktop App (@racklin)
• OS X Dictionary (@yllan)
• Windows 8 App (@wenpei)
• iOS Client (@tomjpsun, @jamessa, @pct)
• iOS Offline App (@zonble)
Integrations
• Rails API server (@albb0920)
• AngularJS Client+Server (@viirya)
• Chrome Extension (@tonytonyjan)
• Sublime Text plugin (@zonble)
• WinRT Component (@eriksk)
Fair Use
為非營利之教育目的,依著作權法第 50 條,
「以中央或地方機關或公法人之名義公開發表
之著作,在合理範圍內,得重製、公開播送或
公開傳輸。」此處轉換格式、重新編排的編輯
著作權(如果有的話)由 @kcwu 以 CC0 釋出。
“
CC0: Public Domain
除前述資料檔之外,本目錄下的所有其他檔
案,由作者 唐鳳 在法律許可的範圍內,拋棄該
著作依著作權法所享有之權利,包括所有相關
與鄰接的法律權利,並宣告將該著作貢獻至公
眾領域。
“
moedict.tw
5 Stars of Open Data
1. ⊙☉ Open License
2. Structured Data
3. Non-Proprietary Format
4. ✧ Each Item has an URI
5. ✩ Linking between Items
URI Endpoints
• https://moedict.tw/#文字
• 3 APIs (for non-Unicode characters):
• /raw/文字.json {[8ff0]}
• /uni/文字.json ⿰亻壯
• /pua/文字.json U+F8FF0
Web Fonts for Private-Use Area
• Initially based on Hán Nôm font (@YaoWei)
• Subset everything outside Big5 range
• Hand-drawn PUA chars like ⿰亻壯
• Later on, switched to Hanazono 花園明朝 font
• 75,619 + 8,236 glyphs
• From 花園大学国際禅学研究所
科技始終
來自於佛性
Live Demo
Reaching the Fifth Star
1. ⊙☉ Open License
2. Structured Data
3. Non-Proprietary Format
4. ✧ Each Item has an URI
5. ✩ Linking between Items
Chinese Segmentation
• Therearenowhitespacesbetweenwords
• Lots of heuristic algorithms
• Naive solution: Longest-token match
• Requires a large dictionary
• …wait, we just got one here
In-browser Implementation{"4":"一(丁不識|不小心|不扭眾|不拗眾|世之雄|世英名|丘一壑|丘之
貉|串驪珠|之為甚|之謂甚|乾二淨|了心願|了百了|了百當|事無成|五
一十|人之交|介不取|仍舊貫|代宗匠|代宗臣|代巨擘|代楷模|代風流|
代鼎臣|以當十|以貫之|來一往|來二去|依舊式|個勁兒|個子兒|個樣
兒|倡三歎|倡百和|偏之見|傅眾咻|償宿願|元大武|元復始|兵一卒|刀
一割|刀兩斷|刀兩段|分一毫|切從簡|切現成|切眾生|刻千金|力承當|
勇之夫|勞久逸|勞永逸|匡天下|去不返|反常態|口價兒|口兩匙|口咬
定|口咬死|古腦兒|名半職|吐為快|吹一唱|呼再諾|呼百應|呼百諾|命
嗚呼|哄而上|哄而散|哄而起|哄而集|唱一和|唱三歎|唱百和|喫一添|
Worked well, but…
• Freezes IE8, crashes IE7
• Broken on Android 2.x, too
• So let’s pre-segment on server
• Needs a tool to move JS into DB
• …wait, we just got one here
/a/文字.json
{"h":[{"b":"ㄨㄣˊ ㄗˋ","d":[{"f":"`人類~`用來~`表示~`觀念~、
`記錄~`語言~`的~`符號~。","s":"`筆墨~,`翰墨~"},{"f":"`文書
~。","q":["`五代史~`平話~.`梁~`史~.`卷~`上~:「`您~`去~`攻
破~`宋~`州~,`為我~`奪取~`張~`節使~`歸~`娘~。`才~`得~,`便~`
發文~`字~`來~`報~`我~。」","`警世通言~.`卷~`十~`三~.`三~`現
身~`包龍圖~`斷~`冤~:「`有~`甚事~`煩惱~?`想~`是~`縣~`裡~`有
~`甚~`文字~`不了~。」"]}],"p":"wén zì"}],"t":"`文~`字~"}
Live Demo, part II
Materialized View: 160k .json files (@obra++)
Let’s PhoneGap it!
• Freezes XCode, crashes Eclipse
• Solution: Pack into 1024 .txt files
• Take the first character, mod 1024
• Related words share the same bucket
• Great success!
Google Play & App Store
User-Driven Development
• Wildcard and part-of-word searching (@esor)
• Two-column layout for tablets (@hlb)
• Toggle between Pinyin and Bopomofo (@matic)
• Volume key on Android resizes fonts (@ivan)
• Top Request: Taiwanese Bân-lâm-gi
Personal Motivation
• My main caretakers were my grandparents
• Grandma from Lo̍k-káng, Taiwan
• Grandpa from Sì-chuān, China
• Raised biligually as a pre-schooler
• But only Mandarin had a writing system
• Editing her memoir brought back memories
Taiwan Bân-lâm-gi Common Dictionary (MoE, 2011)
Good Parts
• Unified Romanization system (TL)
• Standardized Ideographic characters (RHC)
• Full text search with Mandarin, TL & RHC
• MP3 pronounciations of all entries
• Licensed under CC-BY-ND 3.0
Not-so-good Parts
• Entries are in non-bookmarkable <iframe>s
• No equivalent Mandarin field for entries
• Still uses bitmaps for Ext-B+ fonts
• Easy to scrape but hard to parse
• …as discovered by @happyman_eric
g0v hackath2n, 2013.3.23.
Crowd-OCR for 154 glyphs, 2013.3.25.
Finished over lunch!Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee,
@YuanChao, @clkao, @MGDesigner, @gontera…
Database received, 2013.3.27.
• 詞目總檔.xls 詞目總檔.屬性對照.xls
• 釋義.xls 釋義.詞性對照.xls
• 又音.xls 又音.屬性對照.xls
• 近義詞對應.xls 反義詞對應.xls
• 詞彙方言差.xls 語音方言差.xls
• 例句.xls
.…What about that extra request?
您好:
資料匯入目前大致無誤。不過,twblg 網頁上
的「華語檢索」,可以用「一乾二淨」找到閩
語典的「離離」條目,這個對照表似乎沒有在
Excel 檔中看到?
“
Well…
語言之間的對譯,不能盡然以詞彙對應,對不
夠深入了解的使用者來說,會讓他誤以為A語
言的X詞等於B語言的Y詞(並且這種呈現,會
被民眾認知為「教育部的辭典說的」)。
“
However…
因此華語對應這個欄位,我們是藏在系統中。
如果是民間的辭典編輯,會比較沒有這個負
擔,因此我這裡確實不能給,非常希望你們能
有辦法解決。
“
.…it’s all good.
好的,感謝您的提醒和協助。
目前從網頁以 Big5 範圍取出的華語條目,
共有 26274 筆對映。
在應用上,這部份我們會註明不屬於教育部
CC-BY-ND 的授權範圍。
“
Data Cleanup, 2013.3.30.
• Convert all .xsl to .csv with LibreOffice 4
• 3 stars: Non-Proprietary Format
• Replace PUA characters with mapped Unicode
• Add x-造字.csv and x-華語對照表.csv
• Time to put PgREST to work!
PgREST: MongoLab API Server
• GET /collections/table_or_view
• q=&c=true&f=&fo=true&s=&sk=&l=
curl $LY/collections/bills?q={"proposal.0":"吳育昇"}
curl $MOE/collections/entries?q={"部首":"一"}&c=1
• PUT /collections/table_or_view
PgREST: Import/Exportpgrest dbname
export MOE=http://127.0.0.1:3000
curl -i -X PUT -H "Content-Type: text/csv" \
--data-binary @uni/詞目總檔.csv $MOE/collections/entries
curl $MOE/collections/entries
# [{"主編號","1","屬性":"1","詞目":"一","音讀":"tsi̍t",
# "文白俗替":"替","部首":"一","部首序":"001-00-01","方言差":""}]
PgREST: 3du.tw JSON in 48 lines
https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls
“
Live Demo, part III
Lessons Learned
• Open Data is a beginning, not an end
• Keep conversations with all participants
• Turn detractors into collaborators
• Keep a kind heart
• Assume the best intentions
宅心仁厚
仁者無敵
阿宅無敵
— Aaron Swartz, «Open Government»
When is Transparency Useful?
眾人為了共同目標聚在一起,才能做出改變,
科技人很難獨力完成。
衡量成功的標準,可以是有多少人的生命因你
獲得改善,而不只是有多少人看你架的網站。
“
開站一時
開源一輩子
Thank you!
⧸/北島〈回答〉
Thank you!
新的轉機和閃閃星斗,
正在綴滿沒有遮攔的天空。
那是五千年的象形文字,
那是未來人們凝視的眼睛。
“