Graph databases are certainly getting a lot of interest and gaining more widespread adoption across various industries. The most popular graph database today is Neo4j, ranked #1 by DB-engine.
TigerGraph, claiming to be world's fastest and most scalable graph platform, after releasing its free developer edition, recently published their benchmark results on Amazone Neptune. This benchmark is an addition to already available report comparing performance of TigerGraph with Neo4j and TitanDB, which can also be found on their website. TigerGraph outperforms these graph databases by a large margin across all of the benchmark tests. Moreover, TigerGraph demonstrates more efficient storage usage, reducing the original data size as opposed to expanding it as it is the case for the other databases
Needles to say, these outstanding benchmarking results got me interested in TigerGraph and that's why I've decided to run benchmarking tests myself comparing Neo4j and TigerGraph. For summary of benchmark results skip to the end, for detailed description of the benchmark read the entire blog post.
This blog post covers the following topics:
Both Neo4j and TigerGraph tests were performed on EC2 instances with the following characteristics:
| EC2 Instance | vCPU | Memory | Disk Size | IOPS | Volume Type | OS |
|---|---|---|---|---|---|---|
| r4.4x large | 16 | 122 GiB | 300 GiB | 900/3000 | General Purpose SSD (GP2) | Ubuntu 14.04 |
To get the best Neo4j performance the memory configuration properties were tuned: increased the size of heap and page cache.
I used Friendster Social Network dataset provided by the Stanford SNAP. Friendster network contains users connected with each other via friendship edge. The size of the dataset is 31 GB, and it has a format of a tab separated edge list.
Friendster statistics:
Here the time to load dataset into database is measured. The loading methods used in this benchmark are as follows.
| Neo4j | TigerGraph | |
|---|---|---|
| Load time | 2146.385 s | 3026.33 s |
| Node file preparation time | 1912.664 s | - |
| Total | 4059.049 s | 3026.33 s |
The loading time of TigerGraph is not substantially better than Neo4j's time. However TigerGraph does have one advantage over Neo4j, that is it doesn't require preprocessing of data before load (e.g. node file preparation). Moreover, since Neo4j doesn't automatically index data during load, this implies that additional time to perform indexing is required before data is ready to be used. Neo4j indexing time is not included into total loading time here.
Here I compare the storage sizes of the loaded data against original dataset size. Neo4j storage size was measured after index on node ids was created.
| Dataset | Original | Neo4j | TigerGraph |
|---|---|---|---|
| Friendster | 31 GB | 62 GB | 29 GB |
This shows that TigerGraph does use efficient compression during data ingestion, which reduces the graph size loaded into database compared to its original size.
This part of benchmark captures the query execution times for k-step neighborhood queries. The k-step neighborhood query given the start node counts the number of its neighbors within k steps including the start node itself.
Query performance test is conducted on the following queries, capturing average time over 10 runs, where each run uses randomly selected start node:
All query performance tests use the same file with randomly selected 10 start vertices, and the average time is measured over 10 query runs for each of the tests. The query timeout for 1-step neighborhood was set to 180 seconds, and for 3-step and 6-step queries 9000 seconds (2.5 hours) timeout was used, i.e. if after given timeout query did not complete, then it's terminated and computation proceeds to the next query.
Here is a k-step neighborhood query written in Cypher (Neo4j), where node type is User and edge type is Friendship:
MATCH (n1:User)-[:Friendship*0..{k}]-(n2:User)
WHERE n1.id={start_node}
RETURN count(distinct n2);
And GSQL(TigerGraph) equivalent of the query:
CREATE QUERY kstep(VERTEX< User > start_node, INT k) for GRAPH friendster {
int i = 0;
Result = {start_node};
Start = {start_node};
WHILE (i < k) DO
Start = SELECT v
FROM Start:u - (Friendship:e)->:v;
Result = Result UNION Start;
i = i + 1;
END;
PRINT Result.size();
}
The neighborhood sizes for each start vertex computed by Neo4j and TigerGraph were compared to make sure results are consistent and both discovering the same size neighborhoods.
| Neo4j | TigerGraph | |
|---|---|---|
| 1-step | 39.07 ms | 4.827 ms |
| 3-step | 347.391 s | 0.377 s |
| 6-step | N/A (9/10 timeout) | 153.749 s |
TigerGraph greatly outperforms Neo4j on k-step neighborhood queries, finishing all the queries within set timeout. Neo4j is able to complete 1-step and 3-step queries within the set query timeout as wel, although far behind TigerGraph. For 6-step neighborhood 9 queries out of 10 timed out, i.e. could not complete within the timeout of 9000 seconds. Since only one query completed it's not reasonable to provide its execution time as an average as it doesn't reflect the average value at all. This query with start node 5,832,221 completed in 23.221 seconds, in comparison TigerGraph query with this start node completed in just 0.517 seconds. Moreover, start node 5,832,221 has the smallest 6-step neighborhood among other start nodes, which explains why this was the only query that was able to complete in time less than 9000 seconds. Clearly, TigerGraph is faster than Neo4j on 6-step neighborhood query as well.
This benchmark test clearly shows the advantage of TigerGraph over Neo4j. TigerGraph provides better performance in terms of loading time and running graph traversal queries while utilizing far less storage space. TigerGraph's native language is powerful enough to express most (if not all) of the graph traversal and graph analytics queries.
Summary of benchmark results:
I would strongly recommend to checkout TigerGraph developer edition and see it for yourself.