spreadsheets are graphs too: using neo4j as backend to store spreadsheet information

Post on 09-May-2015

3.616 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation explains how I use Neo4J as a database for a tool that calculate spreadsheet metrics.

TRANSCRIPT

Spreadsheets are graphs too! Felienne Hermans (@felienne)

Spreadsheets are graphs too! Felienne Hermans (@felienne)

In this slidedeck I explain how I used Neo4J to store information on spreadsheets

Ehm...spreadsheets? They are so tably?

Are you sure they are fit for a graph database?

Spreadsheets are mislabeled

Spreadsheets are mislabeled

People often think of spreadsheets as data, but...

Spreadsheets are code

Spreadsheets are code

I have made it my life’s work to spread the happy word

“Spreadsheets are code!”

Spreadsheets are code

I have made it my life’s work to spread the happy word

“Spreadsheets are code!”

If you don’t immediately believe me, I have three reasons*

* If you do believe me, skip the next 10 slides ;)

1) Used for similar problems

This tool (for stock price computation) could have been built in any language. C, JavaScript, COBOL, or Excel.

The problems Excel is used for are often (not always) similar to problems solved in different languages.

2) Formulas are Turing complete

2) Formulas are Turing complete

I go to great lengths to make my point. To such great lengths that I built a Turing machine in Excel, using formulas only.

Here you see it in action. Every row is an consequetive step of the lint.

This makes it, in addition to a proof that formulas are Turing complete,Also a nice visualization of a Turing machine.

3) They suffer from the same problems

3) They suffer from the same problems

3) They suffer from the same problems

3) They suffer from the same problems

In summary: both the activities, complexity and problems are the

same

So if spreadsheets are code, can we apply software engineering methods?

In my dissertation, I defined smellsfor spreadsheet formulas

Turns out, Fowler’s code smells are easily

transferable to spreadsheets

Pop quiz: what smell is this?

It is the ‘feature envy’ smell

See how easily this applies to spreadsheets

To analyze smells, we save spreadsheet info to a database

This is the data model that I am storing to the database.

The basics are pretty simple.

This is the data model that I am storing to the database.

The basics are pretty simple.

But cells can refer to each other, either directly (i.e. =A7+A9)

=A7+A9

=A7+A9

=SUM(A1:A5)

This is the data model that I am storing to the database.

The basics are pretty simple.

But cells can refer to each other, either directly [=A7+A9] or through a range [=SUM(A1:A5)]

This is the data model that I am storing to the database.

The basics are pretty simple.

But cells can refer to each other, either directly [=A7+A9] or through a range [=SUM(A1:A5)]

In the case of a range, the range itself will points to the cells it contains.

=SUM(A1:A5)

A1..A5

You know the saying that if all you have is a hammer, everything is a nail to you.

This is what happened to me. I did not think about what type of database to use.

SQL

You know the saying that if all you have is a hammer, everything is a nail to you.

This is what happened to me. I did not think about what type of database to use.

I just started banging with the good ol’ SQL hammer I had been using for ever.

Number of worksheets in a spreadsheet

Which started out just fine!

Number of cells in a spreadsheet

Still pretty okay

Number of connected cells for a cellBut, in order to calculate the

‘feature envy’ smell, we need the total number of connected cells.

So both direct and through a range.

Number of connected cells for a cellBut, in order to calculate the

‘feature envy’ smell, we need the total number of connected cells.

So both direct and through a range.

Let’s start with direct.

Number of connected cells for a cell

Number of connected cells for a cellBut, in order to calculate the

‘feature envy’ smell, we need the total number of connected cells.

So both direct and through a range.

Let’s start with direct.

Now look at the range part.

Number of connected cells for a cell

Number of connected cells for a cell

Number of connected cells for a cellThings start to get iffy when

we combine these two query parts.

Number of connected cells for a cell

Number of connected cells for a cellThings start to get iffy when

we combine these two query parts.

Not only is the query quite big, also this happens.

Number of connected cells for a cell

If your tools reach their limits, this has to tell you something.

So I started thinking.

Maybe this is not

a nail…

Maybe I need a

different tool

Maybe I need a

different tool

It was at this time that I attended a talk about Neo4J.

And the strange thing is, I had seen a few talks about Neo before. But this time it ‘clicked’, because I was suffering from the problem that Neo could solve.

So I ended up with this model. Still spreadsheets, worksheets, cells and links.

So I ended up with this model. Still spreadsheets, worksheets, cells and links.

But the ‘prec’ relation can now refer to either cells or ranges.

Turning this

Turning this into this.

Turning this into this.

I wouldn’t say this is the power of Neo at work. It is the power of the right tool for the job.

There are scenarios, for sure, where the situation is the other way around.

But for my goall, Neo was a great fit.

Also, to be honest with you, I did not immediately write such super succint Cypher queries. My first attempt was something like this:

Also, to be honest with you, I did not immediately write such super succint Cypher queries.

My first attempt was something like this

Also, to be honest with you, I did not immediately write such super succint Cypher queries.

My first attempt was something like this

This is basically a one on one translation from SQL to Neo. Still the two different ways of connecting. It took me a while to understand the power of traversal queries.

Here’s another example:

Number of cells in a spreadsheet

Number of cells in a spreadsheetFirst Cypher attempt

Still very SQLy

Number of cells in a spreadsheet

Second (okay probably more like fifth) attempt. No more where, directly matching a graph pattern.

The power of Cypher :)

That’s all folks.

Spreadsheets are code

That’s all folks.

Spreadsheets are code

Don’t justhit things with the one hammer you know

That’s all folks.

Spreadsheets are code

Don’t justhit things with the one hammer you know

Neo is cool for graph like structures

That’s all folks.

Spreadsheets are code

Don’t justhit things with the one hammer you know

Neo is cool for graph like structures

It makes queries easier

That’s all folks.

Spreadsheets are code

Don’t justhit things with the one hammer you know

Neo is cool for graph like structures

It makes queries easier

But it takes some getting used to for SQL minded brains

Spreadsheets are graphs too! Felienne Hermans (@felienne)

That’s all folks.

Spreadsheets are code

Don’t justhit things with the one hammer you know

Neo is cool for graph like structures

It makes queries easier

But it takes some getting used to for SQL minded brains

Liked this talk? Visit my site for more

top related