Assessing Graph Solutions for Apache Spark

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Songting Chen , Victor Lee (TigerGraph)
Assessing Graph Solutions for Apache Spark
#UnifiedDataAnalytics #SparkAISummit

Graph is HOW WE THINK
3#UnifiedDataAnalytics #SparkAISummit

We Use Graph Every Day

The Evolution of Graph Analysis
• Early days
– PageRank etc, focus on graph algorithms
– Pregel programming API
• Nowadays
– Query language, more declarative without losing expressive power
– AI + graph data: graph features, training, predictions
– More real time (updates, queries)
– Scale, scale, scale
– Gartner: Graph DB market grows 100% YOY through 2022

Typical Workload / Use Cases
• Batch / offline processing
– Web Search/PageRank, etc
• Real time graph queries / updates
– Graph feature extraction for AI training and
prediction, e.g., spam phone call detection
– Data center monitoring (server, router, apps, rack)
– Entire big data industry moves towards real time
• Scalability: large data volume, high QPS

This Talk
• Spark: General scalable big data / ML platform
– GraphX: Spark-based Graph Platform
• TigerGraph: Scalable Native Graph Platform
v How they differ, pros and cons for graph applications
v How they work together to provide end-to-end solutions

Comparing GraphX and
TigerGraph

Areas of Focus
• Graph Data Storage
• Query Expressiveness
• Supported Workload
• Scalability and Performance

Graph Data Storage
• TigerGraph
• ETL preload / optimized storage
• GraphX
– Data stored elsewhere and load them on the fly
• Pros and cons
– Load data once (initial cost, good for repeated analysis)
– Load data many times (minimal initial cost, good for initial
exploratory analysis)

Query Expressiveness
GraphX - API-based for creating graph algorithm
PageRank(...)
…
while (iteration < numIter) {
rankGraph.cache()
val rankUpdates = rankGraph.aggregateMessages[Double](
ctx => ctx.sendToDst(ctx.srcAttr * ctx.attr),
_ + _,
TripletFields.Src)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates) {
(id, oldRank, msgSumOpt) =>
resetProb + (1.0 - resetProb) * msgSumOpt.getOrElse(0.0)
}.cache()
rankGraph.edges.foreachPartition(x => {})
prevRankGraph.vertices.unpersist()
prevRankGraph.edges.unpersist()
iteration += 1
}
1
msg: 1/4 = 0.25
+: {msg}
msg
msg

TigerGraph’s GSQL: Declarative Graph Algorithm Design
SumAccum @received_score = 0;
SumAccum @score = 1;
people = {People.*};
WHILE True LIMIT maxIter DO
people = SELECT src
FROM people:src-(:follow)→people:tgt
ACCUM tgt.@received_score += src.@score/src.outdegree()
POST-ACCUM s.@score = (1-resetProb) +
resetProb * t.@received_score,
s.@received_score = 0,
END;
src
@score
@received_score
tgt.@received_score
+= src.@score/src.outdegree()
src
src
tgt

SumAccum @received_score = 0;
SumAccum @score = 1;
MaxAccum @received_max_neighbor_score = 0;
MaxAccum @max_neighbor_score = 1;
people = {People.*};
WHILE True LIMIT maxIter DO
Start = SELECT src
FROM people:src-(follow:e)→people:tgt;
ACCUM tgt.@ received_score += src.@score/(s.outdegree()),
tgt.@ received_max_neighbor_score += src.@score
POST-ACCUM s.@score = (1-resetProb) + resetProb * t.@received_score,
s.@received_score = 0,
s.@max_neighbor_score = s.@received_max_neighbor_score,
s.@received_max_neighbor_score = 0;
END;
tgt.@received_score
+= src.@score/src.degree()
tgt.@max_neighbor_score
+= src.@score
TigerGraph’s GSQL – cont.
Simultaneously compute many metrics in a declarative way for complex algorithms
src
src
src
tgt

GraphFrame: Declarative Pattern Query
val chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
def sumFriends(cnt: Column, relationship: Column): Column = {
when(relationship === "friend", cnt + 1).otherwise(cnt)
}
val condition = Seq("ab", "bc", "cd").
foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship")))
// (c) Apply filter to DataFrame.
val chainWith2Friends2 = chain4.where(condition >= 2)

TigerGraph’s GSQL: declarative pattern matching + algorithm
A simple recommendation algorithm
SumAccum @common_buys;
OrAccum @already_bought;
SumAccum @product_rank;
other_people = SELECT g
FROM seed_people:s-(buy)→ product:t ← (buy)-people:g
ACCUM g.@common_buys += 1,
t.@already_bought += true
recommended_products = SELECT t
FROM other_people:s -> (buy:e) -> product:t
WHERE t.already_bought = false
ACCUM t.rank += log(1 + s.@common_buys)
ORDER BY t.rank DESC
LIMIT 20
@common_buys
@common_buys
@common_buys
@rank
@rank
Real time updates / queries could significantly improve the effectiveness of the recommendation algorithm.

Query Expressiveness - Summary
• GraphX (API for designing graph algorithm) +
GraphFrame (declarative pattern queries)
• GSQL (SQL-procedure query language, declarative on
both graph algorithm and pattern matching)
• Both provide powerful graph analytics capabilities

Query Workload
GraphX (OLAP) TigerGraph (HTAP)
GraphX TigerGraph
Big Analytics Query ✓ ✓
High QPS, Sub-second Query
Workload
✓
Real Time Transactional Updates ✓

Scalability
• Spark/GraphX is well-known for its scalability
and MPP capabilities.
• TigerGraph is also designed from ground up
with MPP and scalability in mind.

TigerGraph: Analytics Query Scalability
Twitter dataset (41M vertices, 1.4B edges)
AWS 16 r5.2xlarge servers (8 cores, 64GB memory)
# servers
Latency (s)

TigerGraph: Point Query Scalability
QPS
# servers
Point query: 3-step graph traversals from a seed vertex
Application: real time ML prediction based on graph features

Performance Comparison
GraphX: EdgePartition2D; AWS 16 r5.x2large servers (8 cores, 64GB memory)
Latency (s)

Performance Comparison Cont.
Latency (s)
GraphX: EdgePartition2D; AWS 16 r5.x2large servers (8 cores, 64GB memory)

Summary / Recommendations
• GraphX: Quick-to-result exploratory analysis
without having to preload the graph data
• TigerGraph: High performance graph
analytics, real time transactional updates, high
QPS sub-second query workload

How Spark and TigerGraph
Work Together

Reference Architecture: Spark + TigerGraph for AI
25

Connect Spark-TigerGraph through JDBC
• Support Read and Write bi-directional data flow to/from TigerGraph
• Read: Convert graph query results to DataFrame
• Write: Load DataFrame/Files to Vertex/Edges in TigerGraph
• Open Source
– https://github.com/tigergraph/ecosystem/tree/master/etl/tg_jdbc_driver
26

Benefits of Spark + TigerGraph
• Take full advantage of the value from graph data in real time
• Combine them with all other data for deep insights and AI
• Scalable in every step
• Already have actual use cases running in this architecture
27

Assessing Graph Solutions for Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Assessing Graph Solutions for Apache Spark

Similar to Assessing Graph Solutions for Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Assessing Graph Solutions for Apache Spark