identity management: data intake as a service

Open Apereo 2015Higher Education ... Open Source in a New Age

IDM: Data Intake as a ServiceBrian Koehmstedt, Lead Identity Management Developer

UC Berkeley, [email protected]

mailto:[email protected]

Introduction

UC Berkeley replacing their identity registry and

provisioning infrastructure with the new Berkeley

Person Registry (BPR).

Intake of identity data from Systems of Record

(SORs) is a component of the BPR.

We call it the SOR Gateway Service (SGS).

SOR Gateway Service -

Objectives

Data Intake as a Service: Data from anywhere,

anytime. Use a “schemaless” format so source data

not restricted or constrained.

Enable “MicroSORs”: Clients send us data through

provided service endpoints. Quick to bring in new

SORs. Scale up to many SORs, large or small.

A distinct, loosely-coupled, component of the BPR.

Opposite approach to a monolithic IDM product.

SOR Gateway Service –

The Big Picture

Systems

of Record

SOR Gateway

Service

RegistryOutbound

Systems

SOR Gateway Service – Detailed View

SOR Batch

Intake

SOR

Real-time

Intake

-Databases

-LDAP

-Feed Files

-Web Services

-Message Queues

Intake

Queue

JSON

Conversion

Storage

Queue

Registry –

SORObject

Why JSON?

“Schemaless”: Not reliant on data structure of source

data.

Removes barriers to entry and reduces time it

takes to set up a new SOR.

A SOR can send us any data elements without

having to set up element mappings.

Take advantage of new PostgreSQL JSONB data

type (more on this later).

Converting to JSON

As the service takes in identity objects it sends the

data to a message queue (“Intake Queue”).

A queue listener pulls the raw data off this queue and

converts it to JSON and puts it on another queue

(“Storage Queue”), in which that queue listener pulls

the JSON off the queue and stores it.

Why this queue approach? If processing fails on any

given identity in any given processing stage, it won’t

hold up bringing in other messages from sources.

Batch DB Intake

Easy if every table/view has reliable “last update”

timestamp.

Not the case for us in all circumstances.

SGS designed to handle both: Query using “last

update time” but use a different technique when that

is not possible.

Alternatives

DB vendors may offer solutions, such as Oracle

Change Data Capture, Oracle Golden Gate, or other

replication features that identify changed rows.

Great, if that’s an option for you.

In lieu of a vendor-specific option, we found that

creating hash values of all table/view rows to be an

acceptably performant solution where timestamps

aren’t available.

Hashing Rows

We use Oracle’s ORA_HASH() function to create row

hash values. Other DBs likely have native hash

functions. (Nightly job.)

We store this hash in a “checksum” table.

Then we compare hash values between what was

last queried and what’s in the “checksum” table.

We only then query for a row if the hash value differs.

Hashing vs “whole dump”

Why not just retrieve all those rows every night and

skip the hash step?

That’s possible, but…

We’re converting to JSON, which can be expensive if

we tried to do that for every row. Processing only the changed rows, via ORA_HASH()comparisons, is a lot

faster.

Hashing vs “whole dump”

Continued

“Could you dump the table and load to a local DB

without converting to JSON and use a trigger to

check for changed data and set a local timestamp?”

Yes, that would work if you want to maintain those

tables and triggers locally.

Real-time Intake

Multiple possibilities for source systems sending real-

time data to the SGS:

Web service endpoints

Message queues (Ex: JMS)

The design of the SGS doesn’t limit the transport

possibilities as long as the data is convertible to

JSON.

The Registry Database

In addition to the raw JSON data stored from the

SGS, other BPR components process and provision

to our “Person Model” tables in our Registry.

For the SGS, we’re only involved with storing the raw

JSON.

We’ve chosen PostgreSQL for our Registry.

Why PostgreSQL?

It’s open source.

It’s fast and stable.

It’s been around for ever and isn’t going anywhere.(Side bit of trivia: PostgreSQL created at UC Berkeley in 1986 by Professor Michael

Stonebraker and his graduate students. http://www.postgresql.org/about/history/)

Recent feature development has been impressive.

The JSONB feature introduced in PostgreSQL 9.4.

Our database team supports it.

http://www.postgresql.org/about/history/

What’s JSONB? Why use it?

At its core, ability to store JSON text in a JSONB

column where the JSON fields can be indexed like

they were table columns.

That makes the JSON itself queryable with the

JSONB features.

Very powerful for storing any kind of source data

from any kind of SOR. But we can still query it if we

need to.

Conclusion

Will the Berkeley Person Registry be open sourced?

No timeline, but we’re encouraged by progress on

this.

Appendix slides attached showing SGS database

tables and some example queries.

Questions?

By: Brian Koehmstedt

Content License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Appendix:

SOR


---------+-----------------------+-----------


sorname | character varying(64) | not null



identity management: data intake as a service

Software