Graph Database Introduction - Relationships Traversal with Amazon Neptune and Gremlin

Graph Database Introduction - Relationships Traversal with Amazon Neptune and Gremlin

Takahiro Iwasa
Takahiro Iwasa
10 min read
Graph Database Neptune

I held a study meeting to introduce Graph Database on AWS, “Graph Database Introduction - Relationships Traversal with Amazon Neptune and Gremlin”.

I would like to share the content of the presentation, in the hopes that it will provide an opportunity for you to know graph databases and Amazon Neptune. You can pull an example code used in this post from my GitHub repository.

Prerequisites

Participants are expected to:

  • Be interested in graph databases and Amazon Neptune.
  • Have basic knowledge of relational databases and AWS.

Goals in this post:

  • Provide an overview of graph databases and Amazon Neptune.
  • Demonstrate traversals using Apache TinkerPop Gremlin.

Graph DB Overview

Concept and Technical Terms

Graph databases are optimized to store and query relationships in data. While relational databases (RDB) and NoSQL databases can also handle storing and querying relationships, graph databases offer advantages in terms of performance and scalability for certain use cases. Graph databases are based on the mathematical principles of graph theory, while relational databases are based on set theory.

In graph databases, data is represented as vertices or nodes, and relationships between them are represented as edges. Vertices and edges can have labels. There are two types of edges: directed and undirected. A graph that allows vertices and edges to have key-value pairs is called the property graph.

Representative Graphs

Social Network Graphs

Graph databases are often used to represent relationships between people, such as those found on social networking sites like Facebook, LinkedIn, and others.

Knowledge Graphs

By understanding data in context, search engines can distinguish between different concepts that share the same name, such as the fruit “apple” and the technology company “Apple Inc.” The term of knowledge graph has gained popularity since the publication of the “Introducing the Knowledge Graph: things, not strings”.

Identity Graphs

Graph databases can also be used to represent relationships among different IDs, such as linking user activities across multiple sites or devices to a single user. This enables user behavior analysis. One well-known use case for this is Customer 360.

Fraud Graphs

Organizing fraud rings to collectively engage in illegal activities, while including a few legal transactions, can make it difficult to detect fraudulent activity. However, graph databases can help to easily visualize fraudulent activity by identifying patterns shared among fraud rings, such as common attributes like credit card numbers, phone numbers, social security numbers, and more.

Panama Papers

There are many use cases for graph databases these days, but one of the most famous is the Panama Papers. In 2016, a massive trove of internal documents was leaked from a Panamanian law firm. The International Consortium of Investigative Journalists (ICIJ) used a graph database to support their analysis of the data.

Comparison with RDB and NoSQL

Using different databases by use cases is very important. Basically, you should not rely on single type database which may not be ideal for specific purposes.

ItemGraph DBRDBNoSQL
Data structureGraphTableVarious
(Document, Key-Value, etc.)
Schema definitionloosestrictloose
PurposeStoring relationships between entitiesStoring entitiesVarious
(Primary data storage, caching, etc.)
Querying related dataTraversalJoin/Sub QueryVarious
PerformanceVery highAverage
(relative to the others)
Very high
LanguagesGremlin
SPARQL
openCypher
Cypher (Neo4j)
GQL
SQL-

RDB performance can be improved using columnar databases like Amazon Redshift.

Amazon Neptune currently supports openCypher in production environments. Please note that Cypher and GQL are not supported.

Starting with engine release 1.1.1.0, openCypher is available for production use in Neptune.

Tree Structure in RDB

You can express a tree in RDB with the following models. In some cases, you may not need graph databases. Please note that the naive tree is not optimal for many cases, described in the “SQL Antipatterns”.

ModelDescriptionProsCons
Naive Treet1.id = t2.parent_idSimple implementation (Adjacency List)
  • Difficult to extract nodes other than adjacent ones
  • Complex SQL
  • Low performance
  • Path Enumerationpath LIKE ‘1/2%’Easy to extract nodes other than adjacent ones
  • Complexity of INSERT/UPDATE/DELETE
  • Constraint of column max length
  • Nested SetsLeft > 1 AND right < 6Easy to extract nodes other than adjacent ones
  • Complexity of INSERT/UPDATE
  • Constraint of column max length
  • Not intuitive data structure
  • Closure TableTable for a tree
  • Easy to extract nodes other than adjacent ones
  • Easy to INSERT/UPDATE/DELETE
  • Possibility of enormous data
  • Low performance of INSERT/UPDATE/DELETE
  • Amazon Neptune Overview

    Amazon Neptune is fully managed graph db service which can query billions relationships in milliseconds. It offers high availability, durability and low latency like Aurora. For pricing information, please refer to the official page.

    Cluster features are the following.

    FeatureDescription
    Primary instance1
    Read-replica instanceMax 15
    Replication lagLess than 100ms in most cases
    FailoverPromotion of Read-replica

    Storage features are the following.

    FeatureDescription
    VolumesSingle logical volume over a cluster
    Redundancy6 copies kept over 3 AZs
    SizeAutomatic scaling to 64TiB

    Query Environment

    You can query graphs using various SDKs such as Gremlin-Java, Gremlin-Python, Gremlin-JavaScript. REST APIs are also available.

    AWS provides Neptune Workbench, fully managed environment. It comes with a pre-installed Graph Notebook which is a powerful tool for running Gremlin queries and creating graph visualizations. You can also launch your graph notebook in your local environment.

    Transactions

    Dirty Read

    When Tx1 updates the row, Tx2 reads the row and Tx1 fails or rollbacks, Tx2 reads data which has not been reflected.

    Non-repeatable Read

    When Tx1 reads the row, Tx2 updates or deletes the row and commits, and Tx1 reads the row, Tx1 reads committed data different from the previous one.

    Phantom Read

    When Tx1 reads the records, Tx2 inserts or deletes some rows and commits, and Tx1 reads the rows, Tx1 reads the rows different from the previous ones.

    Isolation Level

    LevelDirtyNon-repeatablePhantom
    READ UNCOMMITTEDPossiblePossiblePossible
    READ COMMITTEDNot possiblePossiblePossible
    REPEATABLE READNot possibleNot possiblePossible
    SERIALIZABLENot possibleNot possibleNot possible

    Transactions in Neptune

    Neptune handles manages read-only and mutation queries using distinct transaction isolation levels.

    For read-only queries, Neptune uses snapshot isolation in MultiVersion Concurrency Control (MVCC). This means that dirty reads, non-repeatable reads, and phantom reads are prevented by using a snapshot at the start of the transaction.

    For mutation queries, Neptune uses READ COMMITTED isolation to prevent dirty reads. Additionally, Neptune locks a record set when reading data to prevent non-repeatable reads and phantom reads.

    Traversal with Gremlin

    Gremlin is a graph traversal language provided by Apache TinkerPop, an open source framework for graph db.

    Gremlin traverses by Start Steps followed by Traversal Steps. The traversal steps are composed of two steps, General Steps and Terminal Steps.

    Sample Data

    This post uses the following data as sample. Vertices contain an age property and Edges contain a weight property that expresses closeness between vertices.

    If the data were stored in an RDB, querying the data, especially regarding relationships, could become challenging. SQL queries tend to become complex, and representing relationships in a row-column table may not be intuitive.

    Please feed the sample data above with the following command. %%gremlin is a Jupyter Notebook magic command provided by Neptune Workbench.

    %%gremlin
    // Drop existing data
    g.V().drop()
    
    %%gremlin
    // Add Vertices
    g.addV('person').property(id, 'A').property('age', 30)
     .addV('person').property(id, 'B').property('age', 25)
     .addV('person').property(id, 'C').property('age', 35)
     .addV('person').property(id, 'D').property('age', 20)
     .addV('person').property(id, 'E').property('age', 18)
     .addV('person').property(id, 'F').property('age', 25)
     .addV('person').property(id, 'G').property('age', 10)
     .addV('person').property(id, 'H').property('age', 15)
    
    %%gremlin
    // Add Edges
    g.V('A').addE('know').to(g.V('B')).property('weight', 1.0)
     .V('A').addE('know').to(g.V('C')).property('weight', 0.9)
     .V('A').addE('know').to(g.V('H')).property('weight', 0.5)
     .V('B').addE('know').to(g.V('D')).property('weight', 0.8)
     .V('B').addE('know').to(g.V('E')).property('weight', 0.4)
     .V('C').addE('know').to(g.V('F')).property('weight', 0.5)
     .V('C').addE('know').to(g.V('G')).property('weight', 0.6)
     .V('D').addE('know').to(g.V('E')).property('weight', 0.7)
     .V('H').addE('know').to(g.V('E')).property('weight', 1.0)
     .V('H').addE('know').to(g.V('G')).property('weight', 1.0)
    

    Traversal Example 1

    Let’s traverse all persons with the following command.

    %%gremlin
    // Extract Vertices
    g.V()
     .project(‘id’, ‘properties’) // Projection
     .by(id()).by(valueMap()) // valueMap returns properties of vertices.
    
    RowData
    1{'id': 'A', 'properties': {'age': [30]}}
    2{'id': 'B', 'properties': {'age': [25]}}
    3{'id': 'C', 'properties': {'age': [35]}}
    4{'id': 'D', 'properties': {'age': [20]}}
    5{'id': 'E', 'properties': {'age': [18]}}
    6{'id': 'F', 'properties': {'age': [25]}}
    7{'id': 'G', 'properties': {'age': [10]}}
    8{'id': 'H', 'properties': {'age': [15]}}

    Traversal Example 2

    Let’s traverse persons older than 25 years and connected from A up to 2nd.

    %%gremlin
    // Extract persons (entities) which are older than 25 years old and connected from A up to 2nd.
    g.V('A’)
     .repeat(outE().inV()).times(2).emit() // Repeat traversal of adjacent vertices twice
     .has(‘age’, gte(25)) // Greater than or equal 25 years old
     .project('id', 'age’)
     .by(id()).by(values('age'))
    
    RowData
    1{'id': 'B', 'age': 25}
    2{'id': 'C', 'age': 35}
    3{'id': 'F', 'age': 25}

    Traversal Example 3

    Let’s traverse persons which have a multiplied weight value greater than 0.5.

    %%gremlin
    // Start traversal at A which extracts vertices that have a multiplied weight value greater than 0.5
    g.withSack(1.0f).V(‘A’) // Sack can be used to store temporary data
     // Multiply a weight value of an out-directed edge by a sack value, and traverse all in-directed vertices
     .repeat(outE().sack(mult).by(‘weight’).inV().simplePath()).emit()
     .where(sack().is(gt(0.5))) // A sack value greater than 0.5
     .dedup() // deduplication
     .project('id', 'weight’)
     .by(id).by(sack())
    
    RowData
    1{'id': 'B', 'weight': 1.0}
    2{'id': 'C', 'weight': 0.9}
    3{'id': 'D', 'weight': 0.8}
    4{'id': 'G', 'weight': 0.54}
    5{'id': 'E', 'weight': 0.5599…}

    Visualizing Graphs

    Neptune Workbench can visualize graphs.

    Let’s visualize the graph with the following command.

    %%gremlin -d T.id -de weight
    // -d means that a property to be displayed in vertices
    // -de means that a property to be displayed in edges
    
    // Execute the traversal example 3
    g.withSack(1.0f).V(‘A’) // Sack can be used to store temporary data
     // Multiply a weight value of an out-directed edge by a sack value, and traverse all in-directed vertices
     .repeat(outE().sack(mult).by(‘weight’).inV().simplePath()).emit()
     .where(sack().is(gt(0.5))) // A sack value greater than 0.5
     .dedup() // deduplication
     .path() // Extract path data
     .by(elementMap())
    
    RowData
    1path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '8ebe47fa-901b-c6d3-a11f-0a9bf0ce8aa2', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'B', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 1.0}, {<T.id: 1>: 'B', <T.label: 4>: 'person', 'age': 25}]
    2path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '7abe47fa-901c-c394-4bed-6dce7defa3f9', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'C', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 0.9}, {<T.id: 1>: 'C', <T.label: 4>: 'person', 'age': 35}]

    Users can view the graph in the Graph tab, enabling drag, zoom-in, and zoom-out functions for quick graph comprehension.

    Conclusion

    AWS users can build graph databases quickly using Amazon Neptune which is high availability, cost-efficient.

    I hope you will find this post useful.

    Takahiro Iwasa

    Takahiro Iwasa

    Software Developer at KAKEHASHI Inc.
    Involved in the requirements definition, design, and development of cloud-native applications using AWS. Now, building a new prescription data collection platform at KAKEHASHI Inc. Japan AWS Top Engineers 2020-2023.