Understanding Graph Databases: Traversals with Amazon Neptune and Gremlin

Takahiro Iwasa

Nov 27, 2021

9 min read

Graph Database Neptune

It has been more than 3 years since this post was published.

Introduction

I recently hosted a study meeting titled “Graph Database Introduction - Relationships Traversal with Amazon Neptune and Gremlin” (event link). In this blog post, I aim to share the content of that presentation to help you understand graph databases and explore the capabilities of Amazon Neptune.

You can find example code used in this post in my GitHub repository.

Prerequisites

Target Readers

This post is intended for readers who:

Are interested in graph databases and Amazon Neptune.
Have a basic understanding of relational databases and AWS.

Goals

Provide an overview of graph databases and Amazon Neptune.
Demonstrate graph traversal using Apache TinkerPop Gremlin.

What is a Graph Database

Key Concepts

Graph databases are optimized for storing and querying relationships between entities. Unlike relational or NoSQL databases, graph databases excel in scenarios where relationships are complex and performance is critical.

In graph databases:

Vertices (nodes) represent entities.
Edges define relationships between vertices.
Both vertices and edges can have properties (key-value pairs).
- A graph with labeled vertices and edges is called a property graph.

Graph Representation

Applications of Graph Databases

Graph databases model relationships between people, as seen in platforms like Facebook and LinkedIn.

Knowledge Graphs

These graphs add contextual meaning to data, distinguishing between different concepts, such as “Apple” the fruit versus “Apple Inc.” the company. The term “Knowledge Graph” became widely adopted following the publication of Introducing the Knowledge Graph: things, not strings.

Identity Graphs

Identity graphs link multiple user identifiers across platforms, enabling comprehensive user behavior analysis. A well-known application of this is Customer 360, which provides a unified view of the customer across all touchpoints.

Fraud Detection

Fraud rings often mix legitimate transactions with illegal activities, making detection challenging. Graph databases excel in uncovering these patterns by visualizing relationships and identifying shared attributes like credit card numbers, phone numbers, or social security numbers, enabling effective fraud detection.

Real-World Example: Panama Papers

The Panama Papers investigation used a graph database to analyze leaked documents efficiently. Learn more in this Forbes article.

Comparing Databases: Graph DB, RDB, and NoSQL

Feature	Graph DB	RDB	NoSQL
Data Structure	Graph	Table	Various
Schema	Loose	Strict	Loose
Purpose	Relationships	Entity storage	Various
Querying	Traversal	Joins/Subqueries	Various
Performance	Very High	Average ¹	Very High
Query Languages	Gremlin SPARQL openCypher ² Cypher GQL	SQL	-

The performance of relational databases (RDBs) can be enhanced by using columnar databases, such as Amazon Redshift.
Amazon Neptune now supports openCypher for production environments, starting with engine release 1.1.1.0. However, it’s important to note that Cypher and GQL are currently not supported.

Introduction to Amazon Neptune

Amazon Neptune is a fully managed graph database service capable of querying billions of relationships in milliseconds. It offers high availability, durability, and low latency.

Key Features

Cluster Scalability: Up to 15 read replicas.
Automatic Scaling: Storage expands automatically up to 64 TiB.
Data Redundancy: Maintains 6 copies across 3 availability zones.

Transactions in Neptune

Neptune manages read-only and mutation queries using distinct transaction isolation levels. Learn more in the documentation.

Read-Only Queries

For read-only queries, Neptune uses snapshot isolation within MultiVersion Concurrency Control (MVCC). This ensures that dirty reads, non-repeatable reads, and phantom reads are prevented by using a snapshot taken at the start of the transaction.

Mutation Queries

For mutation queries, Neptune uses READ COMMITTED isolation to prevent dirty reads. Additionally, Neptune locks the record set being read, ensuring that non-repeatable reads and phantom reads are avoided.

Querying with Gremlin

Amazon Neptune supports querying with Gremlin, an open-source graph traversal language. Developers can also use Graph Notebook for querying and visualization.

Traversal Examples

Sample Graph Data

This example uses a graph where vertices have an age property and edges define a weight property.

Use the following command to load the sample data provided above. The %%gremlin is a Jupyter Notebook magic command specifically designed for use with Neptune Workbench.

%%gremlin
// Drop existing data
g.V().drop()

%%gremlin
// Add Vertices
g.addV('person').property(id, 'A').property('age', 30)
 .addV('person').property(id, 'B').property('age', 25)
 .addV('person').property(id, 'C').property('age', 35)
 .addV('person').property(id, 'D').property('age', 20)
 .addV('person').property(id, 'E').property('age', 18)
 .addV('person').property(id, 'F').property('age', 25)
 .addV('person').property(id, 'G').property('age', 10)
 .addV('person').property(id, 'H').property('age', 15)

%%gremlin
// Add Edges
g.V('A').addE('know').to(g.V('B')).property('weight', 1.0)
 .V('A').addE('know').to(g.V('C')).property('weight', 0.9)
 .V('A').addE('know').to(g.V('H')).property('weight', 0.5)
 .V('B').addE('know').to(g.V('D')).property('weight', 0.8)
 .V('B').addE('know').to(g.V('E')).property('weight', 0.4)
 .V('C').addE('know').to(g.V('F')).property('weight', 0.5)
 .V('C').addE('know').to(g.V('G')).property('weight', 0.6)
 .V('D').addE('know').to(g.V('E')).property('weight', 0.7)
 .V('H').addE('know').to(g.V('E')).property('weight', 1.0)
 .V('H').addE('know').to(g.V('G')).property('weight', 1.0)

Example 1: Retrieving All Vertices

%%gremlin
// Extract Vertices
g.V()
 .project('id', 'properties') // Projection
 .by(id()).by(valueMap()) // valueMap returns properties of vertices.

Result:

Row	Data
1	`{'id': 'A', 'properties': {'age': [30]}}`
2	`{'id': 'B', 'properties': {'age': [25]}}`
3	`{'id': 'C', 'properties': {'age': [35]}}`
4	`{'id': 'D', 'properties': {'age': [20]}}`
5	`{'id': 'E', 'properties': {'age': [18]}}`
6	`{'id': 'F', 'properties': {'age': [25]}}`
7	`{'id': 'G', 'properties': {'age': [10]}}`
8	`{'id': 'H', 'properties': {'age': [15]}}`

Example 2: Traversing Connections

Retrieve all persons older than 25 and connected to ‘A’ within two levels:

%%gremlin
// Extract persons (entities) which are older than 25 years old and connected from A up to 2nd.
g.V('A')
 .repeat(outE().inV()).times(2).emit() // Repeat traversal of adjacent vertices twice
 .has('age', gte(25)) // Greater than or equal 25 years old
 .project('id', 'age')
 .by(id()).by(values('age'))

Result:

Row	Data
1	`{'id': 'B', 'age': 25}`
2	`{'id': 'C', 'age': 35}`
3	`{'id': 'F', 'age': 25}`

Example 3: Filtering by Weight

Find persons where the multiplied weight exceeds 0.5:

%%gremlin
// Start traversal at A which extracts vertices that have a multiplied weight value greater than 0.5
g.withSack(1.0f).V('A') // Sack can be used to store temporary data
 // Multiply a weight value of an out-directed edge by a sack value, and traverse all in-directed vertices
 .repeat(outE().sack(mult).by('weight').inV().simplePath()).emit()
 .where(sack().is(gt(0.5))) // A sack value greater than 0.5
 .dedup() // deduplication
 .project('id', 'weight')
 .by(id).by(sack())

Result:

Row	Data
1	`{'id': 'B', 'weight': 1.0}`
2	`{'id': 'C', 'weight': 0.9}`
3	`{'id': 'D', 'weight': 0.8}`
4	`{'id': 'G', 'weight': 0.54}`
5	`{'id': 'E', 'weight': 0.5599…}`

Visualizing Graphs

Neptune Workbench offers tools to visualize graphs interactively. Refer to the official documentation for more details: Neptune Visualization.

To visualize the graph, execute the following command:

%%gremlin -d T.id -de weight
// -d specifies the vertex property to display
// -de specifies the edge property to display

// Execute traversal from example 3
g.withSack(1.0f).V('A') // Sack is used to store temporary data
 .repeat(outE().sack(mult).by('weight').inV().simplePath()).emit() // Traverse with edge weight
 .where(sack().is(gt(0.5))) // Filter paths where the sack value > 0.5
 .dedup() // Remove duplicate paths
 .path() // Extract path data
 .by(elementMap()) // Display properties of vertices and edges

Example Output:

Row	Data
1	`path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '8ebe47fa-901b-c6d3-a11f-0a9bf0ce8aa2', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'B', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 1.0}, {<T.id: 1>: 'B', <T.label: 4>: 'person', 'age': 25}]`
2	`path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '7abe47fa-901c-c394-4bed-6dce7defa3f9', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'C', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 0.9}, {<T.id: 1>: 'C', <T.label: 4>: 'person', 'age': 35}]`

Once executed, navigate to the Graph tab in Neptune Workbench to visualize the result. The interface supports drag, zoom-in, and zoom-out operations, allowing for intuitive graph exploration.

Conclusion

Graph databases provide significant advantages in handling relationships, and Amazon Neptune makes deploying such databases seamless on AWS. With Gremlin, you can efficiently traverse and analyze complex data relationships.

If you’re looking to explore graph databases, Amazon Neptune is an excellent starting point.

Happy Coding! 🚀

Appendix 1: Representing Tree Structures in Relational Databases (RDB)

Tree structures can be modeled in relational databases (RDB) using various approaches. In some scenarios, graph databases may not be necessary. However, it is important to note that the Naive Tree model is often suboptimal for many cases, as discussed in “SQL Antipatterns”.

Model	Description	Pros	Cons
Naive Tree	`t1.id = t2.parent_id`	Simple implementation (Adjacency List)	Difficult to extract non-adjacent nodes Complex SQL Low performance
Path Enumeration	`path LIKE '1/2%'`	Simplifies extracting non-adjacent nodes	Complex INSERT/UPDATE/DELETE operations Limited by column max length
Nested Sets	`Left > 1 AND Right < 6`	Efficient for querying non-adjacent nodes	Complex INSERT/UPDATE operations Limited by column max length Non-intuitive structure
Closure Table	Separate table for the tree	Efficient for querying all nodes Handles INSERT/UPDATE/DELETE easily	Data can grow significantly in size INSERT/UPDATE/DELETE can have low performance

Each approach has its advantages and limitations, making the choice dependent on the specific requirements of your application.

Appendix 2: Transactions

Dirty Read

When Tx1 updates the row, Tx2 reads the row, and Tx1 fails or rollbacks, Tx2 reads data which has not been reflected.

Non-repeatable Read

When Tx1 reads the row, Tx2 updates or deletes the row and commits, and Tx1 reads the row again, Tx1 reads committed data different from the previous one.

Phantom Read

When Tx1 reads the records, Tx2 inserts or deletes some rows and commits, and Tx1 reads the rows again, Tx1 reads the rows different from the previous ones.

Isolation Levels

Level	Dirty Read	Non-repeatable Read	Phantom Read
READ UNCOMMITTED	Possible	Possible	Possible
READ COMMITTED	Not possible	Possible	Possible
REPEATABLE READ	Not possible	Not possible	Possible
SERIALIZABLE	Not possible	Not possible	Not possible

Appendix 3: Querying with Relational Databases (RDB)

When storing data in an RDB, querying relationships can become increasingly complex. SQL queries often grow in complexity, and representing interconnected data within a row-column structure is unintuitive.

Introduction

Prerequisites

Target Readers

Goals

What is a Graph Database

Key Concepts

Applications of Graph Databases

Social Networks

Knowledge Graphs

Identity Graphs

Fraud Detection

Real-World Example: Panama Papers

Comparing Databases: Graph DB, RDB, and NoSQL

Introduction to Amazon Neptune

Key Features

Transactions in Neptune

Read-Only Queries

Mutation Queries

Querying with Gremlin

Traversal Examples

Sample Graph Data

Example 1: Retrieving All Vertices

Example 2: Traversing Connections

Example 3: Filtering by Weight

Visualizing Graphs

Conclusion

Appendix 1: Representing Tree Structures in Relational Databases (RDB)

Appendix 2: Transactions

Dirty Read

Non-repeatable Read

Phantom Read

Isolation Levels

Appendix 3: Querying with Relational Databases (RDB)