identity management: data intake as a service
TRANSCRIPT
Open Apereo 2015Higher Education ... Open Source in a New Age
IDM: Data Intake as a ServiceBrian Koehmstedt, Lead Identity Management Developer
UC Berkeley, [email protected]
Introduction
UC Berkeley replacing their identity registry and
provisioning infrastructure with the new Berkeley
Person Registry (BPR).
Intake of identity data from Systems of Record
(SORs) is a component of the BPR.
We call it the SOR Gateway Service (SGS).
SOR Gateway Service -
Objectives
Data Intake as a Service: Data from anywhere,
anytime. Use a “schemaless” format so source data
not restricted or constrained.
Enable “MicroSORs”: Clients send us data through
provided service endpoints. Quick to bring in new
SORs. Scale up to many SORs, large or small.
A distinct, loosely-coupled, component of the BPR.
Opposite approach to a monolithic IDM product.
SOR Gateway Service –
The Big Picture
Systems
of Record
SOR Gateway
Service
RegistryOutbound
Systems
SOR Gateway Service – Detailed View
SOR Batch
Intake
SOR
Real-time
Intake
-Databases
-LDAP
-Feed Files
-Web Services
-Message Queues
Intake
Queue
JSON
Conversion
Storage
Queue
Registry –
SORObject
Why JSON?
“Schemaless”: Not reliant on data structure of source
data.
Removes barriers to entry and reduces time it
takes to set up a new SOR.
A SOR can send us any data elements without
having to set up element mappings.
Take advantage of new PostgreSQL JSONB data
type (more on this later).
Converting to JSON
As the service takes in identity objects it sends the
data to a message queue (“Intake Queue”).
A queue listener pulls the raw data off this queue and
converts it to JSON and puts it on another queue
(“Storage Queue”), in which that queue listener pulls
the JSON off the queue and stores it.
Why this queue approach? If processing fails on any
given identity in any given processing stage, it won’t
hold up bringing in other messages from sources.
Batch DB Intake
Easy if every table/view has reliable “last update”
timestamp.
Not the case for us in all circumstances.
SGS designed to handle both: Query using “last
update time” but use a different technique when that
is not possible.
Alternatives
DB vendors may offer solutions, such as Oracle
Change Data Capture, Oracle Golden Gate, or other
replication features that identify changed rows.
Great, if that’s an option for you.
In lieu of a vendor-specific option, we found that
creating hash values of all table/view rows to be an
acceptably performant solution where timestamps
aren’t available.
Hashing Rows
We use Oracle’s ORA_HASH() function to create row
hash values. Other DBs likely have native hash
functions. (Nightly job.)
We store this hash in a “checksum” table.
Then we compare hash values between what was
last queried and what’s in the “checksum” table.
We only then query for a row if the hash value differs.
Hashing vs “whole dump”
Why not just retrieve all those rows every night and
skip the hash step?
That’s possible, but…
We’re converting to JSON, which can be expensive if
we tried to do that for every row. Processing only the changed rows, via ORA_HASH()comparisons, is a lot
faster.
Hashing vs “whole dump”
Continued
“Could you dump the table and load to a local DB
without converting to JSON and use a trigger to
check for changed data and set a local timestamp?”
Yes, that would work if you want to maintain those
tables and triggers locally.
Real-time Intake
Multiple possibilities for source systems sending real-
time data to the SGS:
Web service endpoints
Message queues (Ex: JMS)
The design of the SGS doesn’t limit the transport
possibilities as long as the data is convertible to
JSON.
The Registry Database
In addition to the raw JSON data stored from the
SGS, other BPR components process and provision
to our “Person Model” tables in our Registry.
For the SGS, we’re only involved with storing the raw
JSON.
We’ve chosen PostgreSQL for our Registry.
Why PostgreSQL?
It’s open source.
It’s fast and stable.
It’s been around for ever and isn’t going anywhere.(Side bit of trivia: PostgreSQL created at UC Berkeley in 1986 by Professor Michael
Stonebraker and his graduate students. http://www.postgresql.org/about/history/)
Recent feature development has been impressive.
The JSONB feature introduced in PostgreSQL 9.4.
Our database team supports it.
What’s JSONB? Why use it?
At its core, ability to store JSON text in a JSONB
column where the JSON fields can be indexed like
they were table columns.
That makes the JSON itself queryable with the
JSONB features.
Very powerful for storing any kind of source data
from any kind of SOR. But we can still query it if we
need to.
Conclusion
Will the Berkeley Person Registry be open sourced?
No timeline, but we’re encouraged by progress on
this.
Appendix slides attached showing SGS database
tables and some example queries.
Questions?
By: Brian Koehmstedt
Content License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Appendix:
SORObject
Column | Type | Modifiers
------------------------+--------------------------+-----------
id | bigint | not null
sorid | smallint | not null
sorobjkey | character varying(255) | not null
uid | character varying(64) |
sorquerytime | timestamp with time zone | not null
jsonversion | integer | not null
objjson | jsonb | not null
objjsonlastupdated | timestamp with time zone | not null
messageid | character varying(128) |
messagetimestamp | timestamp with time zone |
messageredeliverycount | integer |
hash | bigint |
hashversion | integer |
timecreated | timestamp with time zone | not null
timeupdated | timestamp with time zone | not null
Copyright © 2015, The Regents of the University of California
Code License: BSD Two-Clause
Appendix:
SORObjectChecksum
Column | Type | Modifiers
-------------+--------------------------+-----------
sorid | smallint | not null
sorobjkey | character varying(255) | not null
hash | bigint | not null
hashversion | integer | not null
timemarker | timestamp with time zone | not null
Copyright © 2015, The Regents of the University of California
Code License: BSD Two-Clause
Appendix:
SORObjectChecksumQuery
Column | Type | Modifiers
----------------------+--------------------------+-----------
sorid | smallint | not null
querytime | timestamp with time zone | not null
objectquantity | integer | not null
querydurationseconds | integer |
Copyright © 2015, The Regents of the University of California
Code License: BSD Two-Clause