A Practical Guide to Querying Amazon Neptune Using Gremlin

This note describes how to use Amazon Neptune.
What is a Graph Database
Graph databases are optimized for storing and querying relationships between entities. Unlike relational or NoSQL databases, graph databases excel in scenarios where relationships are complex and performance is critical.
In graph databases:
- Vertices represent entities.
- Edges define relationships between vertices.
- Both vertices and edges can have properties (key-value pairs), which is called a property graph.
Transactions in Neptune
Neptune manages read-only and mutation queries using distinct transaction isolation levels. Refer to the official documentation.
For read-only queries, Neptune uses snapshot isolation within MultiVersion Concurrency Control (MVCC). This ensures that dirty reads, non-repeatable reads, and phantom reads are prevented by using a snapshot taken at the start of the transaction.
For mutation queries, Neptune uses READ COMMITTED isolation to prevent dirty reads. Additionally, Neptune locks the record set being read, ensuring that non-repeatable reads and phantom reads are avoided.
Traversal Examples
Preparing Graph Data
This example uses a graph where vertices have an age property and edges define a weight property.
Use the following command to load the sample data provided above. The %%gremlin
is a Jupyter Notebook magic command specifically designed for use with Neptune Workbench.
%%gremlin// Drop existing datag.V().drop()
%%gremlin// Add Verticesg.addV('person').property(id, 'A').property('age', 30) .addV('person').property(id, 'B').property('age', 25) .addV('person').property(id, 'C').property('age', 35) .addV('person').property(id, 'D').property('age', 20) .addV('person').property(id, 'E').property('age', 18) .addV('person').property(id, 'F').property('age', 25) .addV('person').property(id, 'G').property('age', 10) .addV('person').property(id, 'H').property('age', 15)
%%gremlin// Add Edgesg.V('A').addE('know').to(g.V('B')).property('weight', 1.0) .V('A').addE('know').to(g.V('C')).property('weight', 0.9) .V('A').addE('know').to(g.V('H')).property('weight', 0.5) .V('B').addE('know').to(g.V('D')).property('weight', 0.8) .V('B').addE('know').to(g.V('E')).property('weight', 0.4) .V('C').addE('know').to(g.V('F')).property('weight', 0.5) .V('C').addE('know').to(g.V('G')).property('weight', 0.6) .V('D').addE('know').to(g.V('E')).property('weight', 0.7) .V('H').addE('know').to(g.V('E')).property('weight', 1.0) .V('H').addE('know').to(g.V('G')).property('weight', 1.0)
Example 1: Retrieving All Vertices
%%gremlin// Extract Verticesg.V() .project('id', 'properties') // Projection .by(id()).by(valueMap()) // valueMap returns properties of vertices.
Result:
Row | Data |
---|---|
1 | {'id': 'A', 'properties': {'age': [30]}} |
2 | {'id': 'B', 'properties': {'age': [25]}} |
3 | {'id': 'C', 'properties': {'age': [35]}} |
4 | {'id': 'D', 'properties': {'age': [20]}} |
5 | {'id': 'E', 'properties': {'age': [18]}} |
6 | {'id': 'F', 'properties': {'age': [25]}} |
7 | {'id': 'G', 'properties': {'age': [10]}} |
8 | {'id': 'H', 'properties': {'age': [15]}} |
Example 2: Traversing Connections
Retrieve all persons older than 25 and connected to 'A'
within two levels:
%%gremlin// Extract persons (entities) which are older than 25 years old and connected from A up to 2nd.g.V('A') .repeat(outE().inV()).times(2).emit() // Repeat traversal of adjacent vertices twice .has('age', gte(25)) // Greater than or equal 25 years old .project('id', 'age') .by(id()).by(values('age'))
Result:
Row | Data |
---|---|
1 | {'id': 'B', 'age': 25} |
2 | {'id': 'C', 'age': 35} |
3 | {'id': 'F', 'age': 25} |
Example 3: Filtering by Weight
Find persons where the multiplied weight exceeds 0.5:
%%gremlin// Start traversal at A which extracts vertices that have a multiplied weight value greater than 0.5g.withSack(1.0f).V('A') // Sack can be used to store temporary data // Multiply a weight value of an out-directed edge by a sack value, and traverse all in-directed vertices .repeat(outE().sack(mult).by('weight').inV().simplePath()).emit() .where(sack().is(gt(0.5))) // A sack value greater than 0.5 .dedup() // deduplication .project('id', 'weight') .by(id).by(sack())
Result:
Row | Data |
---|---|
1 | {'id': 'B', 'weight': 1.0} |
2 | {'id': 'C', 'weight': 0.9} |
3 | {'id': 'D', 'weight': 0.8} |
4 | {'id': 'G', 'weight': 0.54} |
5 | {'id': 'E', 'weight': 0.5599…} |
Visualizing Graphs
Neptune Workbench offers tools to visualize graphs interactively. Refer to the official documentation for more details.
To visualize the graph, execute the following command:
%%gremlin -d T.id -de weight// -d specifies the vertex property to display// -de specifies the edge property to display
// Execute traversal from example 3g.withSack(1.0f).V('A') // Sack is used to store temporary data .repeat(outE().sack(mult).by('weight').inV().simplePath()).emit() // Traverse with edge weight .where(sack().is(gt(0.5))) // Filter paths where the sack value > 0.5 .dedup() // Remove duplicate paths .path() // Extract path data .by(elementMap()) // Display properties of vertices and edges
Example Output:
Row | Data |
---|---|
1 | path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '8ebe47fa-901b-c6d3-a11f-0a9bf0ce8aa2', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'B', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 1.0}, {<T.id: 1>: 'B', <T.label: 4>: 'person', 'age': 25}] |
2 | path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '7abe47fa-901c-c394-4bed-6dce7defa3f9', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'C', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 0.9}, {<T.id: 1>: 'C', <T.label: 4>: 'person', 'age': 35}] |
Once executed, navigate to the Graph
tab in Neptune Workbench to visualize the result. The interface supports drag, zoom-in, and zoom-out operations, allowing for intuitive graph exploration.
Appendix 1: Tree Structures in Relational Databases (RDB)
Tree structures can be modeled in relational databases (RDB) using various approaches. In some scenarios, graph databases may not be necessary.
However, it is important to note that the Naive Tree model is often suboptimal for many cases, as discussed in “SQL Antipatterns”.
- Naive Tree
- Expression:
t1.id = t2.parent_id
- Pros
- Simple implementation (Adjacency List)
- Cons
- Difficult to extract non-adjacent nodes
- Complex SQL
- Low performance
- Expression:
- Path Enumeration
- Expression:
path LIKE '1/2%'
- Pros
- Simplifies extracting non-adjacent nodes
- Cons
- Complex INSERT/UPDATE/DELETE operations
- Limited by column max length
- Expression:
- Nested Sets
- Expression:
Left > 1 AND Right < 6
- Pros
- Efficient for querying non-adjacent nodes
- Cons
- Complex INSERT/UPDATE operations
- Limited by column max length
- Non-intuitive structure
- Expression:
- Closure Table
- Expression: Separate table for the tree
- Pros
- Efficient for querying all nodes
- Handles INSERT/UPDATE/DELETE easily
- Cons
- Data can grow significantly in size
- INSERT/UPDATE/DELETE can have low performance
When storing data in an RDB, querying relationships can become increasingly complex. SQL queries often grow in complexity, and representing interconnected data within a row-column structure is unintuitive.
Appendix 2: Transactions
Dirty Read
When Tx1 updates the row, Tx2 reads the row, and Tx1 fails or rollbacks, Tx2 reads data which has not been reflected.
Non-repeatable Read
When Tx1 reads the row, Tx2 updates or deletes the row and commits, and Tx1 reads the row again, Tx1 reads committed data different from the previous one.
Phantom Read
When Tx1 reads the records, Tx2 inserts or deletes some rows and commits, and Tx1 reads the rows again, Tx1 reads the rows different from the previous ones.
Isolation Levels
Level | Dirty Read | Non-repeatable Read | Phantom Read |
---|---|---|---|
READ UNCOMMITTED | Possible | Possible | Possible |
READ COMMITTED | Not possible | Possible | Possible |
REPEATABLE READ | Not possible | Not possible | Possible |
SERIALIZABLE | Not possible | Not possible | Not possible |