Pannonian Coder: graph

Showing posts with label graph. Show all posts

Wednesday, October 2, 2013

Neo4j model for (SQL) dummies

In general, one characteristic of the mind is that it has hard time grasping new concepts if these are presented without comparison to some familiar ones. And I experienced that when trying to explain Neo4j data model to people who are stumbling on it for first time. Mostly they are confused by lack of schema, because when visualized, those scattered graph nodes, connected into some kind of spider web, bring confusion into minds so long accustomed to nicely ordered rectangular SQL world.

So what seemed to work better in this case is just to describe it using all too familiar RDBMS/SQL model and its elements: tables, columns, records, foreign keys ...In other words, let's try to describe Neo4j-graph model as it would be if built on top of SQL model.

Actually, this is quite easy to do. We just need 2 tables, and let's call them "NODES" and "RELATIONSHIPS". Both reflect 2 main elements in Neo4j model - graph nodes and relationships between them.

"NODES" table

This one would be where entities are stored, and it contains 2 columns - "ID" and "PROPERTIES".

ID	PROPERTIES
334	{"name": "John Doe", "age": 31, "salary": 80000}
335	{"name": "ACME Inc.", "address": "Broadway 345, New York City, NY"}
336	{"manufacturer": "Toyota", "model": "Corolla", "year": 2005}
337	{"name": "Annie Doe", "age" 30, "salary": 82000}

PROPERTIES column stores map-like data structure containing arbitrary properties with their values. Just for purpose of presentation, I picked JSON serialization here. So you see, due to this schema-less design, there are no constraints upon what properties are contained in the PROPERTIES column - which is actually the only practical/possible way since all entity types (department, company, employee, vehicle...) are stored in this single table.

"RELATIONSHIPS" table

This table would contain "ID", "NAME", "SOURCE_NODE_ID", "TARGET_NODE_ID" and "PROPERTIES" columns, and purpose is to store associations between nodes. We can say that records stored here represent schema-less version of SQL foreign-keys.

ID	NAME	SOURCE_NODE_ID	TARGET_NODE_ID	PROPERTIES
191	MARRIED_TO	334	337	{"wedding_date": "20070213"}
192	OWNS	337	336
193	WORKS_FOR	337	335	{"job-position": "IT manager"}

Relationship's NAME marks its "type", and we can add new association "types" into the system dynamically, just by storing new relationship records with previously non-existing names, whereas in SQL database, we need to pre-define available foreign keys upfront.

Since relationships usually have a direction (though they can be bi-directional also in Neo4j), thus we have "SOURCE_NODE_ID" and "TARGET_NODE_ID" foreign keys, pointing to respective NODES. Direction is mainly valuable for its semantic purpose.

Similar to NODES table, here we also have PROPERTIES column to store additional information about association - in SQL world we would need to introduce "link" table to store this kind of data.

Recap

Having no schema brings well known trade-off to the table. On one hand, the structure of such system is less obvious, and special care has to be taken not to corrupt the data, but on the other hand, given flexibility can be exploited for domains that are rich and rapidly changing. And of course, since there are no constraints imposed by database here, it means that application now is solely responsible for correctness of stored data.

Monday, September 16, 2013

Referencing non-indexed Neo4j entities in service layer

"Idiomatic" way to index entities in Neo4j is to do that only on few types of them, usually the ones that are most often used, or for some reason are the most practical to be accessed directly. Of course, top level entities (such as Company or User in some business domains) just have to be indexed since they cannot be fetched via some other entity.

So, let's say we have 2 types of entities - Company and Department, and they are in one-to-many relationship. Company would have to be indexed, but Department would not because it can be traversed to starting from the parent Company. This fetching via traversal is actually one of best selling points of Neo4j because the speed of that operation generally doesn't depend upon size of whole dataset, unlike SQL databases that have to perform JOIN-ing of different tables which involves tackling with their indexes and performance of that ultimately depends upon table size.

Anyway, all seems good, but it can have some impact on your service layer.

Until now, when you had your Department entities indexed, you had some service layer operation with only one argument needed to reference the entity:

 public interface DepartmentManager {  
  void activateDepartment(UUID departmentUuid);  
 ...  
 }

And now we must introduce another argument to identify parent Company to be able to traverse the graph to Department in question.

 public interface DepartmentManager {  
  void activateDepartment(UUID companyUuid, UUID departmentUuid);  
 ...  
 }

Of course, one can argue that we could decide to index Department entities also to simplify accessing them, but then this same reasoning can lead us to index almost all types of entities that we want to operate on at service layer, and we surely want to avoid that for reasons described in the beginning of this post.

Friday, September 6, 2013

Neo4j and beauty of role-based entity referencing

Polyglot persistence is all the rage now, and one of more exotic types of databases around are graph DBs, so we decided to give it a shot for a part of larger system. We picked Neo4j. Even if we were not a Java shop, we would probably stumble on it anyway since it definitely looks the most popular graph database right now.

After working with it for some time I noticed a thing that I really like - object-graph mismatch is much lower than object-relational one. Although there are numerous things where object-relational mismatch shows its face, one of things that bothered me the most is that I always had to take good care of what type/role I will be referencing some object with.

In Java land, even with as poor meta-model as it has (compared to some other more exotic languages out there), we can reference some entity from another one by many ways - using class, subclass or interface, And we all know that one of great principles of good OO design is to reference objects by their role, which can be expressed in any of mentioned language constructs. Interface, if sufficient, is usually the most preferred way to express an object role.

Here's an example ...

Let's say we have a User class that has reference to its owner entity, described by UserOwner interface. This UserOwner interface is the role that the owner entity plays in that case.

 public class User {  
   private String name;  
   private UserOwner owner;  
 ....  
 }

And let's say this UserOwner role can be played by multiple different entities - company and department. Let's even say that users themselves can be the owners of other users. If we were to express this in Java, we would implement this UserOwner interface by many classes:

 public class User implements UserOwner {  
 ...  
 }  
 public class Company implements UserOwner {  
 ...  
 }  
 public class Department implements UserOwner {  
 ...  
 }

So how would we map this case to SQL world? We would have USERS table, but also COMPANIES and DEPARTMENTS table.

And to express the reference from user to its owner, we would need to have a foreign key that points from USERS table to .... to.... to what? We don't have a concept in SQL world that would "mark" USERS, DEPARTMENTS and COMPANIES tables belonging to some "USER_OWNERS" type so we could define the foreign key by that target. Problem is that SQL meta-model is still much poorer compared to OO meta-model, and it doesn't have a concept of supertables (for hierarchies of tables), or some other concept that would mark records as being of multiple types.

In Neo4j it is straightforward - unlike SQL database, it is schema-less so we don't burden ourselves with types, we just have a special relationship type (named let's say "BELONGS_TO") that corresponds to association between a User and its UserOwner.

You see how different users reference their owners via BELONGS_TO relationship, regardless if that entity is company, department or other user. Now you can write simple Cypher queries such as this one which without any fuss fetches the owner of some entity:

 START user=node(<someUserId>) MATCH user-[:BELONGS_TO]->owner RETURN owner;

In application layer, we would cast result of that query to UserOwner object and do with it whatever that role allows us (via methods on that interface).

Sweet!