an overview of neo4j internals

42
An overview of Neo4j Internals Tobias Lindaaker Hacker @ Neo Technology [email protected] twitter: @thobe, #neo4j (@neo4j) web: neo4j.org neotechnology.com my web: thobe.org Monday, May 21, 2012

Upload: tobias-lindaaker

Post on 15-Jan-2015

21.610 views

Category:

Technology


16 download

DESCRIPTION

 

TRANSCRIPT

Page 1: An overview of Neo4j Internals

An overview of

Neo4j Internals

Tobias LindaakerHacker @ Neo Technology

[email protected]: @thobe, #neo4j (@neo4j)web: neo4j.org neotechnology.commy web: thobe.org

Monday, May 21, 2012

Page 2: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

2

Disk(s)

HA

This is a rough structure of how the pieces of Neo4j fit together.

This talk will not cover how disks/fs works, we just assume it does.Nor will it cover the “Core API”, you are assumed to know it.

Monday, May 21, 2012

Page 3: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

3

Disk(s)

HA

Let’s start at the bottom: the on disk storage file layout.

Monday, May 21, 2012

Page 4: An overview of Neo4j Internals

Simple sample graph. It all boils down to linked lists of fixed size records on disk.Properties are stored as a linked list of property records, each holding key+value.Each node/relationship references its first property record.The Nodes also reference the first node in its relationship chain.Each Relationship references its start and end node.It also references the prev/next relationship record for the start/end node respectively

Your graph on disk

4

KNOWS

KNOWS

KNOWS

KNOWS

KNOWS

Name: AlistairAge: 34

Name: JimAge: 37Stuff: good

Name: TobiasAge: 27Nationality: Swedish

Name: IanAge: 42

Monday, May 21, 2012

Page 5: An overview of Neo4j Internals

Simple sample graph. It all boils down to linked lists of fixed size records on disk.Properties are stored as a linked list of property records, each holding key+value.Each node/relationship references its first property record.The Nodes also reference the first node in its relationship chain.Each Relationship references its start and end node.It also references the prev/next relationship record for the start/end node respectively

Your graph on disk

4

KNOWS

KNOWS

KNOWS

KNOWS

KNOWS

NameIan

Age42

NameAlistair

Age34

NameJim

Age37

Stuffgood

NameTobias

Age27

Nationality

Swedish

Monday, May 21, 2012

Page 6: An overview of Neo4j Internals

Simple sample graph. It all boils down to linked lists of fixed size records on disk.Properties are stored as a linked list of property records, each holding key+value.Each node/relationship references its first property record.The Nodes also reference the first node in its relationship chain.Each Relationship references its start and end node.It also references the prev/next relationship record for the start/end node respectively

Your graph on disk

4

KNOWS

KNOWS

KNOWS

KNOWS

KNOWS

NameIan

Age42

NameAlistair

Age34

NameJim

Age37

Stuffgood

NameTobias

Age27

Nationality

Swedish

Monday, May 21, 2012

Page 7: An overview of Neo4j Internals

Simple sample graph. It all boils down to linked lists of fixed size records on disk.Properties are stored as a linked list of property records, each holding key+value.Each node/relationship references its first property record.The Nodes also reference the first node in its relationship chain.Each Relationship references its start and end node.It also references the prev/next relationship record for the start/end node respectively

Your graph on disk

4

SP EPSN ENKNOWS

EPSNKNOWS

EPSNKNOWS

SN ENKNOWS

SP EP

KNOWS

SP

EN

EP

EN

SP

SN

SP

ENNameIan

Age42

NameAlistair

Age34

NameJim

Age37

Stuffgood

NameTobias

Age27

Nationality

Swedish

Monday, May 21, 2012

Page 8: An overview of Neo4j Internals

Simple sample graph. It all boils down to linked lists of fixed size records on disk.Properties are stored as a linked list of property records, each holding key+value.Each node/relationship references its first property record.The Nodes also reference the first node in its relationship chain.Each Relationship references its start and end node.It also references the prev/next relationship record for the start/end node respectively

Your graph on disk

4

SP EPSN ENKNOWS

EPSNKNOWS

EPSNKNOWS

SN ENKNOWS

SP EP

KNOWS

SP

EN

EP

EN

SP

SN

SP

ENNameIan

Age42

NameAlistair

Age34

NameJim

Age37

Stuffgood

NameTobias

Age27

Nationality

Swedish

Monday, May 21, 2012

Page 9: An overview of Neo4j Internals

Simple sample graph. It all boils down to linked lists of fixed size records on disk.Properties are stored as a linked list of property records, each holding key+value.Each node/relationship references its first property record.The Nodes also reference the first node in its relationship chain.Each Relationship references its start and end node.It also references the prev/next relationship record for the start/end node respectively

Your graph on disk

4

SP EPSN ENKNOWS

EPSNKNOWS

EPSNKNOWS

SN ENKNOWS

SP EP

KNOWS

SP

EN

EP

EN

SP

SN

SP

ENNameIan

Age42

NameAlistair

Age34

NameJim

Age37

Stuffgood

NameTobias

Age27

Nationality

Swedish

Monday, May 21, 2012

Page 10: An overview of Neo4j Internals

Simple sample graph. It all boils down to linked lists of fixed size records on disk.Properties are stored as a linked list of property records, each holding key+value.Each node/relationship references its first property record.The Nodes also reference the first node in its relationship chain.Each Relationship references its start and end node.It also references the prev/next relationship record for the start/end node respectively

Your graph on disk

4

SP EPSN ENKNOWS

EPSNKNOWS

EPSNKNOWS

SN ENKNOWS

SP EP

KNOWS

SP

EN

EP

EN

SP

SN

SP

ENNameIan

Age42

NameAlistair

Age34

NameJim

Age37

Stuffgood

NameTobias

Age27

Nationality

Swedish

Monday, May 21, 2012

Page 11: An overview of Neo4j Internals

Store files๏Node store

๏Relationship store

•Relationship type store

๏Property store

•Property key store

•(long) String store

•(long) Array store5

Short string and array values are inlined in the property store, long values are stored in separate store files.

Monday, May 21, 2012

Page 12: An overview of Neo4j Internals

Neo4j Storage Record LayoutNode (9 bytes)inUse

1

nextRelId

5

nextPropId

9

Relationship (33 bytes)inUse

1

firstNode

5

secondNode

9

relationshipType

13

firstPrevRelId

17

firstNextRelId

21

secondPrevRelId

25

secondNextRelId

29

nextPropId

33

Relationship Type (5 bytes)inUse

1

typeBlockId

5

Property (33 bytes)inUse

1

type

3

keyIndexId

5

propBlock

29

nextPropId

33

Property Index (9 bytes)inUse

1

propCount

5

keyBlockId

9

Dynamic Store (125 bytes)inUse

1

next

5

data

NeoStore (5 bytes)inUse

1

datum

5

Monday, May 21, 2012

Page 13: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

7

Disk(s)

HA

Next: The two levels of cache in Neo4j.The low level FS Cache for the record files.And the high level Object cache storing a structure more optimized for traversal.

Monday, May 21, 2012

Page 14: An overview of Neo4j Internals

The caches๏Filesystem cache:

•Caches regions of the store files(divides each store file into equally sized regions)

•The cache holds a fixed number of regions for each file

•Regions are evicted based on a LFU-like policy(hit count vs. miss count, i.e. hit in non-cached region)

•Default implementation of regions uses OS mmap

๏Node/Relationship cache

•Cache a version more optimized for traversals

8

Monday, May 21, 2012

Page 15: An overview of Neo4j Internals

9

What we put in cache

Node

Relationship

ID

ID start end type

type 1

type 2

...

Key 1

Val 1

Key 2Val 2

... Key n

Val n

Key 1

Val 1

Key 2

Val 2

... Key n

Val n

in:outin:

out

R 1 R 2 ... R nR 1 R 2 ... R nR 1 R 2 R 3 ...R 1 ... R n

Relationship ID refs

(grouped by type)

R n

The structure of the elements in the high level object cache.

On disk most of the information is contained in the relationship records, with the nodes just referencing their first relationship. In the cache this is turned around: the nodes hold references to all its relationships. The relationships are simple, only holding its properties.

The relationships for each node is grouped by RelationshipType to allow fast traversal of a specific type.

All references (dotted arrows) are by ID, and traversals do indirect lookup through the cache.

Monday, May 21, 2012

Page 16: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

10

Disk(s)

HA

So how do traversals work...

Monday, May 21, 2012

Page 17: An overview of Neo4j Internals

Traversals - how do they work?๏RelationshipExpanders: given (a path to) a node, returns

Relationships to continue traversing from that node

๏Evaluators: given (a path to) a node, returns whether to:

•Continue traversing on that branch (i.e. expand) or not

• Include (the path to) the node in the result set or not

๏Then a projection to Path, Node, or Relationship applied to each Path in the result set.

... but also:

๏Uniqueness level: policy for when it is ok to revisit a nodethat has already been visited

๏ Implemented on top of the Core API

11

The surface layer, the you interact with.

Monday, May 21, 2012

Page 18: An overview of Neo4j Internals

More on Traversals๏Fetch node data from cache - non-blocking access

•If not in cache, retrieve from storage, into cache

‣If region is in FS cache: blocking but short duration access

‣If region is outside FS cache: blocking slower access

๏Get relationships from cached node

•If not fetched, retrieve from storage, by following chains

๏Expand relationship(s) to end up on next node(s)

•The relationship knows the node, no need to fetch it yet

๏Evaluate

•possibly emitting a Path into the result set

๏Repeat 12

This is what happens under the hood.

Monday, May 21, 2012

Page 19: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

13

Disk(s)

HA

How is Cypher different? and how dowes it work?

Monday, May 21, 2012

Page 20: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 21: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 22: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 23: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 24: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 25: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 26: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 27: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 28: An overview of Neo4j Internals

Cypher - Just convenient traversal descriptions?๏Builds on the same infrastructure as Traversals - Expanders

• but not on the full Traversal system

๏Uses graph pattern matching for traversing the graph

•Recursive matching with backtrackingSTART x=... MATCH x-->y, x-->z, y-->z, z-->a-->b, z-->b

14

Red: pattern graphBlue: actual graphGreen: start nodePurple: matches

Monday, May 21, 2012

Page 29: An overview of Neo4j Internals

What about gremlin?๏gremlin is a third party language, built by Marko Rodriguez of

Tinkerpop (a group of people who like to hack on graphs)

๏Originally based on the idea of using xpath to describe traversals:./HAS_CART/CONTAINS_ITEM/PURCHASED/PURCHASEDbut bastardized to distinguish between nodes and relationships:./outE[label=HAS_CART]/inV /outE[label=CONTAINS_ITEM]/inV /inE[label=PURCHASED]/outV /outE[label=PURCHASED]/inV

๏xpath is not complete enough to express full algorithms, it needs a host language, gremlin originally defined its own.This changed Groovy as a more complete host language and abandoned xpath in favor of method chaining[ replace ‘/’ with ‘.’ ]

15

Traversals are close to xpath, which is why xpath like descriptions of traversals seemed like a good idea.

Monday, May 21, 2012

Page 30: An overview of Neo4j Internals

Gremlin compared to Cypher๏start me=node:people(name={myname})

match me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->itemitem<-[:PURCHASED]-user-[:PURCHASED]->recommendationreturn recommendation

๏Cypher is declarative, describes what data to get - its shape

๏Gremlin is imperative, prescribes how to get the data

๏Cypher has more opportunities for optimization by the engine

๏Gremlin can implement pagerank, Cypher can’t (yet?)

16

Monday, May 21, 2012

Page 31: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

17

Disk(s)

HA

Transactions involve two parts:The (thread local) changes being done by an active transaction,and the transaction replay log for recovery.

Monday, May 21, 2012

Page 32: An overview of Neo4j Internals

Transaction Isolation๏Mutating operations are not written when performed

๏They are stored in a thread confined transaction state object

๏This prevents other threads from seeing uncommitted changes from the transactions of other threads

๏When Transaction.finish() is invoked the transaction is either committed or rolled back

๏Rollback is simple: discard the transaction state object

18

Monday, May 21, 2012

Page 33: An overview of Neo4j Internals

Transactions & Durability๏Commit is:

•Changes made in the transaction are collected as commands

•Commands are sorted to get predictable update order

‣This prevents concurrent readers from seeing inconsistent data when the changes are applied to the store

•Write changes (in sorted order) to the transaction log

•Mark the transaction as committed in the log

•Apply the changes (in sorted order) to the store files

19

Monday, May 21, 2012

Page 34: An overview of Neo4j Internals

Recovery๏Transaction commands dictate state, they don’t modify state

• i.e. SET property "count" to 5

• rather than ADD 1 to property "count"

๏Thus: Applying the same command twice yields the same state

๏Recovery simply replays all transactions since the last safe point

๏ If tx A mutates node1.name, then tx B also mutates node1.name that doesn’t matter, because the database is not recovered until all transactions have been replayed

20

Monday, May 21, 2012

Page 35: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

21

Disk(s)

HA

High Availability in Neo4j builds on top of the transaction replay

Monday, May 21, 2012

Page 36: An overview of Neo4j Internals

Traversals Core API Cypher

Node/RelationshipObject cache Thread local diffs

FS Cache

Record files Transaction log

Outline

22

Disk(s)

HA

Shared

Local

The transaction logs are shared between all instances in an High Availability setup, all other parts operate on the local data just like in the standalone case.

Monday, May 21, 2012

Page 37: An overview of Neo4j Internals

• HA - the parts to it:๏Based on streaming transactions between servers

๏All transactions are committed through the master

•Then (eventually) applied to the slaves

• Eventuality defined by the update intervalor when synchronization is mandated by interaction

๏When writing to a slave:

• Locks coordinated through the master

•Transaction data buffered on the slaveapplied first on the master to get a txidthen applied with the same txid on the slave

23

Monday, May 21, 2012

Page 38: An overview of Neo4j Internals

Creating new Nodes and Relationships๏New Nodes/Relationships don’t need locks, so they don’t need a

transaction synced with master until the transaction commit

๏They do need an ID that is unique and equal among all instances

๏Each instance allocates IDs in blocks from the master, then assigns them to new Nodes/Relationships locally

•This batch allocation can be seen in (Enterprise) WebAdmin as Node/Relationship counts jumping in steps of 1000

24

Monday, May 21, 2012

Page 39: An overview of Neo4j Internals

HA synchronization points๏Transactions are uniquely identified by monotonically increasing ID

๏All Requests from slave send the current latest txid on that slave

๏Responses from master send back a stream of transactions that have occurred since then, along with the actual result

๏Transaction is replayed just like when committed / recovered

๏Nodes/Relationships touched during this application are evicted from cache to avoid cache staleness

๏Transaction commands are only sorted when created, before stored/transmitted, thus consistency is preserved during all application phases

25

Monday, May 21, 2012

Page 40: An overview of Neo4j Internals

Locking semantics of HA๏To be granted a lock the slave must have the latest version of the

Node/Relationship it is trying to lock

•This ensures consistency

•The implementation of “Latest version of the Node/Relationship” is “Latest version of the entire graph”

•The slave must thus sync transactions from the master

26

Monday, May 21, 2012

Page 41: An overview of Neo4j Internals

Master election๏Each instance communicates/coordinates:

• its latest transaction id (including the master id for that tx)

• the id for that instance

• (logical) clock value for when the txid was written๏Election chooses:

1. The instance with highest txid2.IF multiple: The instance that was master for that tx3.IF unavailable: The instance with the lowest clock value4.IF multiple: The instance with the lowest id

๏Election happens when the current master cannot be reached

•Any instance can choose to re-elect

• Each instance runs the election protocol individually

•Notify others when election chooses new master๏When elected, the new master broadcasts to all instances,

forcing them to bind to the new master 27

Monday, May 21, 2012

Page 42: An overview of Neo4j Internals

Tobias LindaakerHacker @ Neo Technology

[email protected]: @thobe, #neo4j (@neo4j)web: neo4j.org neotechnology.commy web: thobe.org

Thank you for listening!

Monday, May 21, 2012