Graph Database Introduction - Relationships Traversal with Amazon Neptune and Gremlin
I held a study meeting to introduce Graph Database on AWS, “Graph Database Introduction - Relationships Traversal with Amazon Neptune and Gremlin”.
I would like to share the content of the presentation, in the hopes that it will provide an opportunity for you to know graph databases and Amazon Neptune. You can pull an example code used in this post from my GitHub repository.
Prerequisites
Participants are expected to:
- Be interested in graph databases and Amazon Neptune.
- Have basic knowledge of relational databases and AWS.
Goals in this post:
- Provide an overview of graph databases and Amazon Neptune.
- Demonstrate traversals using Apache TinkerPop Gremlin.
Graph DB Overview
Concept and Technical Terms
Graph databases are optimized to store and query relationships in data. While relational databases (RDB) and NoSQL databases can also handle storing and querying relationships, graph databases offer advantages in terms of performance and scalability for certain use cases. Graph databases are based on the mathematical principles of graph theory, while relational databases are based on set theory.
In graph databases, data is represented as vertices or nodes, and relationships between them are represented as edges. Vertices and edges can have labels. There are two types of edges: directed and undirected. A graph that allows vertices and edges to have key-value pairs is called the property graph.
Representative Graphs
Social Network Graphs
Graph databases are often used to represent relationships between people, such as those found on social networking sites like Facebook, LinkedIn, and others.
Knowledge Graphs
By understanding data in context, search engines can distinguish between different concepts that share the same name, such as the fruit “apple” and the technology company “Apple Inc.” The term of knowledge graph has gained popularity since the publication of the “Introducing the Knowledge Graph: things, not strings”.
Identity Graphs
Graph databases can also be used to represent relationships among different IDs, such as linking user activities across multiple sites or devices to a single user. This enables user behavior analysis. One well-known use case for this is Customer 360.
Fraud Graphs
Organizing fraud rings to collectively engage in illegal activities, while including a few legal transactions, can make it difficult to detect fraudulent activity. However, graph databases can help to easily visualize fraudulent activity by identifying patterns shared among fraud rings, such as common attributes like credit card numbers, phone numbers, social security numbers, and more.
Panama Papers
There are many use cases for graph databases these days, but one of the most famous is the Panama Papers. In 2016, a massive trove of internal documents was leaked from a Panamanian law firm. The International Consortium of Investigative Journalists (ICIJ) used a graph database to support their analysis of the data.
Comparison with RDB and NoSQL
Using different databases by use cases is very important. Basically, you should not rely on single type database which may not be ideal for specific purposes.
Item | Graph DB | RDB | NoSQL |
---|---|---|---|
Data structure | Graph | Table | Various (Document, Key-Value, etc.) |
Schema definition | loose | strict | loose |
Purpose | Storing relationships between entities | Storing entities | Various (Primary data storage, caching, etc.) |
Querying related data | Traversal | Join/Sub Query | Various |
Performance | Very high | Average (relative to the others) | Very high |
Languages | Gremlin SPARQL openCypher Cypher (Neo4j) GQL | SQL | - |
RDB performance can be improved using columnar databases like Amazon Redshift.
Amazon Neptune currently supports openCypher in production environments. Please note that Cypher and GQL are not supported.
Starting with engine release 1.1.1.0, openCypher is available for production use in Neptune.
Tree Structure in RDB
You can express a tree in RDB with the following models. In some cases, you may not need graph databases. Please note that the naive tree is not optimal for many cases, described in the “SQL Antipatterns”.
Model | Description | Pros | Cons |
---|---|---|---|
Naive Tree | t1.id = t2.parent_id | Simple implementation (Adjacency List) | |
Path Enumeration | path LIKE ‘1/2%’ | Easy to extract nodes other than adjacent ones | |
Nested Sets | Left > 1 AND right < 6 | Easy to extract nodes other than adjacent ones | |
Closure Table | Table for a tree |
Amazon Neptune Overview
Amazon Neptune is fully managed graph db service which can query billions relationships in milliseconds. It offers high availability, durability and low latency like Aurora. For pricing information, please refer to the official page.
Cluster features are the following.
Feature | Description |
---|---|
Primary instance | 1 |
Read-replica instance | Max 15 |
Replication lag | Less than 100ms in most cases |
Failover | Promotion of Read-replica |
Storage features are the following.
Feature | Description |
---|---|
Volumes | Single logical volume over a cluster |
Redundancy | 6 copies kept over 3 AZs |
Size | Automatic scaling to 64TiB |
Query Environment
You can query graphs using various SDKs such as Gremlin-Java, Gremlin-Python, Gremlin-JavaScript. REST APIs are also available.
AWS provides Neptune Workbench, fully managed environment. It comes with a pre-installed Graph Notebook which is a powerful tool for running Gremlin queries and creating graph visualizations. You can also launch your graph notebook in your local environment.
Transactions
Dirty Read
When Tx1 updates the row, Tx2 reads the row and Tx1 fails or rollbacks, Tx2 reads data which has not been reflected.
Non-repeatable Read
When Tx1 reads the row, Tx2 updates or deletes the row and commits, and Tx1 reads the row, Tx1 reads committed data different from the previous one.
Phantom Read
When Tx1 reads the records, Tx2 inserts or deletes some rows and commits, and Tx1 reads the rows, Tx1 reads the rows different from the previous ones.
Isolation Level
Level | Dirty | Non-repeatable | Phantom |
---|---|---|---|
READ UNCOMMITTED | Possible | Possible | Possible |
READ COMMITTED | Not possible | Possible | Possible |
REPEATABLE READ | Not possible | Not possible | Possible |
SERIALIZABLE | Not possible | Not possible | Not possible |
Transactions in Neptune
Neptune handles manages read-only and mutation queries using distinct transaction isolation levels.
For read-only queries, Neptune uses snapshot isolation in MultiVersion Concurrency Control (MVCC). This means that dirty reads, non-repeatable reads, and phantom reads are prevented by using a snapshot at the start of the transaction.
For mutation queries, Neptune uses READ COMMITTED isolation to prevent dirty reads. Additionally, Neptune locks a record set when reading data to prevent non-repeatable reads and phantom reads.
Traversal with Gremlin
Gremlin is a graph traversal language provided by Apache TinkerPop, an open source framework for graph db.
Gremlin traverses by Start Steps followed by Traversal Steps. The traversal steps are composed of two steps, General Steps and Terminal Steps.
Sample Data
This post uses the following data as sample. Vertices contain an age property and Edges contain a weight property that expresses closeness between vertices.
If the data were stored in an RDB, querying the data, especially regarding relationships, could become challenging. SQL queries tend to become complex, and representing relationships in a row-column table may not be intuitive.
Please feed the sample data above with the following command.
%%gremlin
is a Jupyter Notebook magic command provided by Neptune Workbench.
%%gremlin
// Drop existing data
g.V().drop()
%%gremlin
// Add Vertices
g.addV('person').property(id, 'A').property('age', 30)
.addV('person').property(id, 'B').property('age', 25)
.addV('person').property(id, 'C').property('age', 35)
.addV('person').property(id, 'D').property('age', 20)
.addV('person').property(id, 'E').property('age', 18)
.addV('person').property(id, 'F').property('age', 25)
.addV('person').property(id, 'G').property('age', 10)
.addV('person').property(id, 'H').property('age', 15)
%%gremlin
// Add Edges
g.V('A').addE('know').to(g.V('B')).property('weight', 1.0)
.V('A').addE('know').to(g.V('C')).property('weight', 0.9)
.V('A').addE('know').to(g.V('H')).property('weight', 0.5)
.V('B').addE('know').to(g.V('D')).property('weight', 0.8)
.V('B').addE('know').to(g.V('E')).property('weight', 0.4)
.V('C').addE('know').to(g.V('F')).property('weight', 0.5)
.V('C').addE('know').to(g.V('G')).property('weight', 0.6)
.V('D').addE('know').to(g.V('E')).property('weight', 0.7)
.V('H').addE('know').to(g.V('E')).property('weight', 1.0)
.V('H').addE('know').to(g.V('G')).property('weight', 1.0)
Traversal Example 1
Let’s traverse all persons with the following command.
%%gremlin
// Extract Vertices
g.V()
.project(‘id’, ‘properties’) // Projection
.by(id()).by(valueMap()) // valueMap returns properties of vertices.
Row | Data |
---|---|
1 | {'id': 'A', 'properties': {'age': [30]}} |
2 | {'id': 'B', 'properties': {'age': [25]}} |
3 | {'id': 'C', 'properties': {'age': [35]}} |
4 | {'id': 'D', 'properties': {'age': [20]}} |
5 | {'id': 'E', 'properties': {'age': [18]}} |
6 | {'id': 'F', 'properties': {'age': [25]}} |
7 | {'id': 'G', 'properties': {'age': [10]}} |
8 | {'id': 'H', 'properties': {'age': [15]}} |
Traversal Example 2
Let’s traverse persons older than 25 years and connected from A up to 2nd.
%%gremlin
// Extract persons (entities) which are older than 25 years old and connected from A up to 2nd.
g.V('A’)
.repeat(outE().inV()).times(2).emit() // Repeat traversal of adjacent vertices twice
.has(‘age’, gte(25)) // Greater than or equal 25 years old
.project('id', 'age’)
.by(id()).by(values('age'))
Row | Data |
---|---|
1 | {'id': 'B', 'age': 25} |
2 | {'id': 'C', 'age': 35} |
3 | {'id': 'F', 'age': 25} |
Traversal Example 3
Let’s traverse persons which have a multiplied weight value greater than 0.5.
%%gremlin
// Start traversal at A which extracts vertices that have a multiplied weight value greater than 0.5
g.withSack(1.0f).V(‘A’) // Sack can be used to store temporary data
// Multiply a weight value of an out-directed edge by a sack value, and traverse all in-directed vertices
.repeat(outE().sack(mult).by(‘weight’).inV().simplePath()).emit()
.where(sack().is(gt(0.5))) // A sack value greater than 0.5
.dedup() // deduplication
.project('id', 'weight’)
.by(id).by(sack())
Row | Data |
---|---|
1 | {'id': 'B', 'weight': 1.0} |
2 | {'id': 'C', 'weight': 0.9} |
3 | {'id': 'D', 'weight': 0.8} |
4 | {'id': 'G', 'weight': 0.54} |
5 | {'id': 'E', 'weight': 0.5599…} |
Visualizing Graphs
Neptune Workbench can visualize graphs.
Let’s visualize the graph with the following command.
%%gremlin -d T.id -de weight
// -d means that a property to be displayed in vertices
// -de means that a property to be displayed in edges
// Execute the traversal example 3
g.withSack(1.0f).V(‘A’) // Sack can be used to store temporary data
// Multiply a weight value of an out-directed edge by a sack value, and traverse all in-directed vertices
.repeat(outE().sack(mult).by(‘weight’).inV().simplePath()).emit()
.where(sack().is(gt(0.5))) // A sack value greater than 0.5
.dedup() // deduplication
.path() // Extract path data
.by(elementMap())
Row | Data |
---|---|
1 | path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '8ebe47fa-901b-c6d3-a11f-0a9bf0ce8aa2', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'B', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 1.0}, {<T.id: 1>: 'B', <T.label: 4>: 'person', 'age': 25}] |
2 | path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '7abe47fa-901c-c394-4bed-6dce7defa3f9', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'C', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 0.9}, {<T.id: 1>: 'C', <T.label: 4>: 'person', 'age': 35}] |
Users can view the graph in the Graph
tab, enabling drag, zoom-in, and zoom-out functions for quick graph comprehension.
Conclusion
AWS users can build graph databases quickly using Amazon Neptune which is high availability, cost-efficient.
I hope you will find this post useful.