pgrest: node.js in the database

72
PgREST Node.js in the Database Audrey Tang

Upload: -

Post on 07-May-2015

21.385 views

Category:

Technology


4 download

DESCRIPTION

OSDC.tw 2013 talk on plv8x, g0v, 3du, and https://moedict.tw/

TRANSCRIPT

Page 1: PgREST: Node.js in the Database

PgREST Node.js in the Database

Audrey Tang

Page 2: PgREST: Node.js in the Database

只講程式

不講故事

Page 3: PgREST: Node.js in the Database

PgREST is…

• JSON document store

• Running inside PostgreSQL

• Working with existing relational data

• Capable of loading Node.js modules

• Compatible with MongoLab’s REST API

• = LiveScript + PLV8 + plv8x + OneJS

Page 4: PgREST: Node.js in the Database

JSON

{ "title": "萌",

"bopomofo": "ㄇㄥˊ",

"pinyin": "méng",

"definitions": [

{ "type": "名", "def": "草木初生的芽。" },

{ "type": "名", "def": "事物發生的開端或徵兆。" },

{ "type": "名", "def": "人民。" } ] }

Page 5: PgREST: Node.js in the Database

PostgreSQLCREATE TABLE moe ( "entry" JSON );

INSERT INTO moe VALUES ($$

{ "title":"萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng",

"definitions": [ { "type": "名", "def": "草木初生的芽。" },

{ "type": "名", "def": "事物發生的開端或徵兆。" },

{ "type": "名", "def": "人民。" } ] } $$);

INSERT INTO moe VALUES ('這不是 ㄓㄟ ㄙㄣˇ'); -- type error

Page 6: PgREST: Node.js in the Database

PLV8CREATE EXTENSION plv8;

CREATE FUNCTION get_json_key(obj JSON, key TEXT) returns JSON AS $$

return JSON.stringify( obj[key] );

$$ LANGUAGE plv8;

SELECT get_json_key(entry, 'bopomofo') FROM moe;

-- "ㄇㄥˊ"

Page 7: PgREST: Node.js in the Database

plv8x: Operators

SELECT entry |> 'this.bopomofo' FROM moe;

-- "ㄇㄥˊ"

SELECT entry ~> '@bopomofo' FROM moe;

-- "ㄇㄥˊ"

SELECT '@bopomofo' <~ entry FROM moe;

-- "ㄇㄥˊ"

SELECT ~> 'new Date';

-- "2013-04-17T12:31:57.523Z"

Page 8: PgREST: Node.js in the Database

plv8x: Command Linenpm i -g plv8x

export PLV8XCONN=dbname

plv8x -r script.ls # .js works too

plv8x -E 'plv8.execute("SELECT entry FROM moe").0.entry.definitions'

# [ { type: '名', def: '草木初生的芽。' },

# { type: '名', def: '事物發生的開端或徵兆。' },

# { type: '名', def: '人民。' } ]

Page 9: PgREST: Node.js in the Database

plv8x: Modulesnpm i -g uax11

plv8x -i uax11

plv8x -E 'require "uax11" .toFullwidth "méng"'

# meng

SELECT entry ~> 'require "uax11" .toFullwidth @pinyin' FROM moe;

-- "meng"

Page 10: PgREST: Node.js in the Database

plv8x: Functionsplv8x -f 'text fullwidth(text)=uax11:toFullwidth'

plv8x -f 'text PINYIN(json)=:&0.pinyin.toUpperCase!'

SELECT fullwidth('Ingy döt Net');

-- Ingy dot Net

SELECT fullwidth( PINYIN(entry) ) FROM moe;

-- MENG

Page 11: PgREST: Node.js in the Database

Summary

• V8 : JavaScript engine

• PLV8 : Stored procedures in JavaScript

• plv8x : Package manager for PLV8

• Turns NPM modules into SQL functions

• JSON expressions with ~> and <~

• Code reuse for browser + server + database !

Page 12: PgREST: Node.js in the Database

Cutting out the Middleware

• Serve JSON API from SQL

• Shared models & validation code

• Put Business Logic into DB

• Perfect fit for Medium Data™

Page 13: PgREST: Node.js in the Database

@clkao++

Page 14: PgREST: Node.js in the Database

3du.tw

Page 15: PgREST: Node.js in the Database

The Revised MoE Dictionary (1994)

Page 16: PgREST: Node.js in the Database

The Good

• 160,000+ entries

• Official, high quality sources

• Rich etymology and historical usage

• Full text search with regular expressions

• Still frequently updated!

Page 17: PgREST: Node.js in the Database

The Bad

• Results are not bookmarkable

• Requires N clicks to get to a definition

• Rare characters become low-res bitmaps

• Difficult to use on mobile devices

• ”Optimized for IE 5.0 and Netscape 4.7+”!?

Page 18: PgREST: Node.js in the Database

⧸/教育部國語推行委員會〈有關授權〉

The Sad

本會非常歡迎各位來連結「國語辭典」,但是

本會目前只開放以超連結 (hyperlink) 的方式與

國語辭典 首頁 連結,至於其他方式本會並未對

外開放授權。若還有疑問或建議,歡迎來信。

Page 19: PgREST: Node.js in the Database

.…and the Very Crazy

• 不需登入的網頁,會自動把你登出!

Page 20: PgREST: Node.js in the Database

⧸/葉平〈還文於民〉

Yeh’s Ping, 2013.1.26.

所以我要 響應 零時政府 g0v.tw 的活動,來做

3du.tw,把字、詞、成語、定義、例句等等正

體中文資料,用開放的文字 API 釋放出來,加

上索引和搜尋的功能,讓任何想加值的個人或

公司都可以使用。

Page 21: PgREST: Node.js in the Database

Hackpad for 3du.tw

Page 22: PgREST: Node.js in the Database

零時黑客

集體砍站事件

Page 23: PgREST: Node.js in the Database

g0v hackath1n, 2013.1.27.

• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)

• Scrape 3000 characters as raw HTML (@au)

• Design JSON schema from samples (@pingooo)

• Design SQL schema from samples (@albb0920)

• Parse HTML into JSON & SQLite (@kcwu)

• …and for those 24x24 bitmaps…

Page 24: PgREST: Node.js in the Database

←🀝 Big-5

→🀎 UTF-8

Page 25: PgREST: Node.js in the Database

Crowd-OCR for 1000+ glyphs

Page 26: PgREST: Node.js in the Database

Finished in 24 hours!Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka,

Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…

Page 27: PgREST: Node.js in the Database

粗略的共識

會動的程式

Page 28: PgREST: Node.js in the Database

Applications

• XUL Desktop App (@racklin)

• OS X Dictionary (@yllan)

• Windows 8 App (@wenpei)

• iOS Client (@tomjpsun, @jamessa, @pct)

• iOS Offline App (@zonble)

Page 29: PgREST: Node.js in the Database

Integrations

• Rails API server (@albb0920)

• AngularJS Client+Server (@viirya)

• Chrome Extension (@tonytonyjan)

• Sublime Text plugin (@zonble)

• WinRT Component (@eriksk)

Page 30: PgREST: Node.js in the Database

Fair Use

為非營利之教育目的,依著作權法第 50 條,

「以中央或地方機關或公法人之名義公開發表

之著作,在合理範圍內,得重製、公開播送或

公開傳輸。」此處轉換格式、重新編排的編輯

著作權(如果有的話)由 @kcwu 以 CC0 釋出。

Page 31: PgREST: Node.js in the Database

CC0: Public Domain

除前述資料檔之外,本目錄下的所有其他檔

案,由作者 唐鳳 在法律許可的範圍內,拋棄該

著作依著作權法所享有之權利,包括所有相關

與鄰接的法律權利,並宣告將該著作貢獻至公

眾領域。

Page 32: PgREST: Node.js in the Database

moedict.tw

Page 33: PgREST: Node.js in the Database

5 Stars of Open Data

1. ⊙☉ Open License

2. Structured Data

3. Non-Proprietary Format

4. ✧ Each Item has an URI

5. ✩ Linking between Items

Page 34: PgREST: Node.js in the Database

URI Endpoints

• https://moedict.tw/#文字

• 3 APIs (for non-Unicode characters):

• /raw/文字.json {[8ff0]}

• /uni/文字.json ⿰亻壯

• /pua/文字.json U+F8FF0

Page 35: PgREST: Node.js in the Database

Web Fonts for Private-Use Area

• Initially based on Hán Nôm font (@YaoWei)

• Subset everything outside Big5 range

• Hand-drawn PUA chars like ⿰亻壯

• Later on, switched to Hanazono 花園明朝 font

• 75,619 + 8,236 glyphs

• From 花園大学国際禅学研究所

Page 36: PgREST: Node.js in the Database

科技始終

來自於佛性

Page 37: PgREST: Node.js in the Database

Live Demo

Page 38: PgREST: Node.js in the Database

Reaching the Fifth Star

1. ⊙☉ Open License

2. Structured Data

3. Non-Proprietary Format

4. ✧ Each Item has an URI

5. ✩ Linking between Items

Page 39: PgREST: Node.js in the Database

Chinese Segmentation

• Therearenowhitespacesbetweenwords

• Lots of heuristic algorithms

• Naive solution: Longest-token match

• Requires a large dictionary

• …wait, we just got one here

Page 40: PgREST: Node.js in the Database

In-browser Implementation{"4":"一(丁不識|不小心|不扭眾|不拗眾|世之雄|世英名|丘一壑|丘之

貉|串驪珠|之為甚|之謂甚|乾二淨|了心願|了百了|了百當|事無成|五

一十|人之交|介不取|仍舊貫|代宗匠|代宗臣|代巨擘|代楷模|代風流|

代鼎臣|以當十|以貫之|來一往|來二去|依舊式|個勁兒|個子兒|個樣

兒|倡三歎|倡百和|偏之見|傅眾咻|償宿願|元大武|元復始|兵一卒|刀

一割|刀兩斷|刀兩段|分一毫|切從簡|切現成|切眾生|刻千金|力承當|

勇之夫|勞久逸|勞永逸|匡天下|去不返|反常態|口價兒|口兩匙|口咬

定|口咬死|古腦兒|名半職|吐為快|吹一唱|呼再諾|呼百應|呼百諾|命

嗚呼|哄而上|哄而散|哄而起|哄而集|唱一和|唱三歎|唱百和|喫一添|

Page 41: PgREST: Node.js in the Database

Worked well, but…

• Freezes IE8, crashes IE7

• Broken on Android 2.x, too

• So let’s pre-segment on server

• Needs a tool to move JS into DB

• …wait, we just got one here

Page 42: PgREST: Node.js in the Database

/a/文字.json

{"h":[{"b":"ㄨㄣˊ ㄗˋ","d":[{"f":"`人類~`用來~`表示~`觀念~、

`記錄~`語言~`的~`符號~。","s":"`筆墨~,`翰墨~"},{"f":"`文書

~。","q":["`五代史~`平話~.`梁~`史~.`卷~`上~:「`您~`去~`攻

破~`宋~`州~,`為我~`奪取~`張~`節使~`歸~`娘~。`才~`得~,`便~`

發文~`字~`來~`報~`我~。」","`警世通言~.`卷~`十~`三~.`三~`現

身~`包龍圖~`斷~`冤~:「`有~`甚事~`煩惱~?`想~`是~`縣~`裡~`有

~`甚~`文字~`不了~。」"]}],"p":"wén zì"}],"t":"`文~`字~"}

Page 43: PgREST: Node.js in the Database

Live Demo, part II

Page 44: PgREST: Node.js in the Database

Materialized View: 160k .json files (@obra++)

Page 45: PgREST: Node.js in the Database

Let’s PhoneGap it!

• Freezes XCode, crashes Eclipse

• Solution: Pack into 1024 .txt files

• Take the first character, mod 1024

• Related words share the same bucket

• Great success!

Page 46: PgREST: Node.js in the Database

Google Play & App Store

Page 47: PgREST: Node.js in the Database

User-Driven Development

• Wildcard and part-of-word searching (@esor)

• Two-column layout for tablets (@hlb)

• Toggle between Pinyin and Bopomofo (@matic)

• Volume key on Android resizes fonts (@ivan)

• Top Request: Taiwanese Bân-lâm-gi

Page 48: PgREST: Node.js in the Database
Page 49: PgREST: Node.js in the Database

Personal Motivation

• My main caretakers were my grandparents

• Grandma from Lo̍k-káng, Taiwan

• Grandpa from Sì-chuān, China

• Raised biligually as a pre-schooler

• But only Mandarin had a writing system

• Editing her memoir brought back memories

Page 50: PgREST: Node.js in the Database

Taiwan Bân-lâm-gi Common Dictionary (MoE, 2011)

Page 51: PgREST: Node.js in the Database

Good Parts

• Unified Romanization system (TL)

• Standardized Ideographic characters (RHC)

• Full text search with Mandarin, TL & RHC

• MP3 pronounciations of all entries

• Licensed under CC-BY-ND 3.0

Page 52: PgREST: Node.js in the Database

Not-so-good Parts

• Entries are in non-bookmarkable <iframe>s

• No equivalent Mandarin field for entries

• Still uses bitmaps for Ext-B+ fonts

• Easy to scrape but hard to parse

• …as discovered by @happyman_eric

Page 53: PgREST: Node.js in the Database

g0v hackath2n, 2013.3.23.

Page 54: PgREST: Node.js in the Database

Crowd-OCR for 154 glyphs, 2013.3.25.

Page 55: PgREST: Node.js in the Database

Finished over lunch!Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee,

@YuanChao, @clkao, @MGDesigner, @gontera…

Page 56: PgREST: Node.js in the Database

Database received, 2013.3.27.

• 詞目總檔.xls 詞目總檔.屬性對照.xls

• 釋義.xls 釋義.詞性對照.xls

• 又音.xls 又音.屬性對照.xls

• 近義詞對應.xls 反義詞對應.xls

• 詞彙方言差.xls 語音方言差.xls

• 例句.xls

Page 57: PgREST: Node.js in the Database

.…What about that extra request?

您好:

資料匯入目前大致無誤。不過,twblg 網頁上

的「華語檢索」,可以用「一乾二淨」找到閩

語典的「離離」條目,這個對照表似乎沒有在

Excel 檔中看到?

Page 58: PgREST: Node.js in the Database

Well…

語言之間的對譯,不能盡然以詞彙對應,對不

夠深入了解的使用者來說,會讓他誤以為A語

言的X詞等於B語言的Y詞(並且這種呈現,會

被民眾認知為「教育部的辭典說的」)。

Page 59: PgREST: Node.js in the Database

However…

因此華語對應這個欄位,我們是藏在系統中。

如果是民間的辭典編輯,會比較沒有這個負

擔,因此我這裡確實不能給,非常希望你們能

有辦法解決。

Page 60: PgREST: Node.js in the Database

.…it’s all good.

好的,感謝您的提醒和協助。

目前從網頁以 Big5 範圍取出的華語條目,

共有 26274 筆對映。

在應用上,這部份我們會註明不屬於教育部

CC-BY-ND 的授權範圍。

Page 61: PgREST: Node.js in the Database

Data Cleanup, 2013.3.30.

• Convert all .xsl to .csv with LibreOffice 4

• 3 stars: Non-Proprietary Format

• Replace PUA characters with mapped Unicode

• Add x-造字.csv and x-華語對照表.csv

• Time to put PgREST to work!

Page 62: PgREST: Node.js in the Database

PgREST: MongoLab API Server

• GET /collections/table_or_view

• q=&c=true&f=&fo=true&s=&sk=&l=

curl $LY/collections/bills?q={"proposal.0":"吳育昇"}

curl $MOE/collections/entries?q={"部首":"一"}&c=1

• PUT /collections/table_or_view

Page 63: PgREST: Node.js in the Database

PgREST: Import/Exportpgrest dbname

export MOE=http://127.0.0.1:3000

curl -i -X PUT -H "Content-Type: text/csv" \

--data-binary @uni/詞目總檔.csv $MOE/collections/entries

curl $MOE/collections/entries

# [{"主編號","1","屬性":"1","詞目":"一","音讀":"tsi̍t",

# "文白俗替":"替","部首":"一","部首序":"001-00-01","方言差":""}]

Page 64: PgREST: Node.js in the Database

PgREST: 3du.tw JSON in 48 lines

https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls

Page 65: PgREST: Node.js in the Database

Live Demo, part III

Page 66: PgREST: Node.js in the Database

Lessons Learned

• Open Data is a beginning, not an end

• Keep conversations with all participants

• Turn detractors into collaborators

• Keep a kind heart

• Assume the best intentions

Page 67: PgREST: Node.js in the Database

宅心仁厚

仁者無敵

Page 68: PgREST: Node.js in the Database

阿宅無敵

Page 69: PgREST: Node.js in the Database

— Aaron Swartz, «Open Government»

When is Transparency Useful?

眾人為了共同目標聚在一起,才能做出改變,

科技人很難獨力完成。

衡量成功的標準,可以是有多少人的生命因你

獲得改善,而不只是有多少人看你架的網站。

Page 70: PgREST: Node.js in the Database

開站一時

開源一輩子

Page 71: PgREST: Node.js in the Database

Thank you!

Page 72: PgREST: Node.js in the Database

⧸/北島〈回答〉

Thank you!

新的轉機和閃閃星斗,

正在綴滿沒有遮攔的天空。

那是五千年的象形文字,

那是未來人們凝視的眼睛。