Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph.
In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.
2. Songting Chen , Victor Lee (TigerGraph)
Assessing Graph Solutions for Apache Spark
#UnifiedDataAnalytics #SparkAISummit
3. Graph is HOW WE THINK
3#UnifiedDataAnalytics #SparkAISummit
4. We Use Graph Every Day
4#UnifiedDataAnalytics #SparkAISummit
5. The Evolution of Graph Analysis
• Early days
– PageRank etc, focus on graph algorithms
– Pregel programming API
• Nowadays
– Query language, more declarative without losing expressive power
– AI + graph data: graph features, training, predictions
– More real time (updates, queries)
– Scale, scale, scale
– Gartner: Graph DB market grows 100% YOY through 2022
5#UnifiedDataAnalytics #SparkAISummit
6. Typical Workload / Use Cases
• Batch / offline processing
– Web Search/PageRank, etc
• Real time graph queries / updates
– Graph feature extraction for AI training and
prediction, e.g., spam phone call detection
– Data center monitoring (server, router, apps, rack)
– Entire big data industry moves towards real time
• Scalability: large data volume, high QPS
6#UnifiedDataAnalytics #SparkAISummit
7. This Talk
• Spark: General scalable big data / ML platform
– GraphX: Spark-based Graph Platform
• TigerGraph: Scalable Native Graph Platform
v How they differ, pros and cons for graph applications
v How they work together to provide end-to-end solutions
7#UnifiedDataAnalytics #SparkAISummit
9. Areas of Focus
• Graph Data Storage
• Query Expressiveness
• Supported Workload
• Scalability and Performance
9#UnifiedDataAnalytics #SparkAISummit
10. Graph Data Storage
• TigerGraph
• ETL preload / optimized storage
• GraphX
– Data stored elsewhere and load them on the fly
• Pros and cons
– Load data once (initial cost, good for repeated analysis)
– Load data many times (minimal initial cost, good for initial
exploratory analysis)
10#UnifiedDataAnalytics #SparkAISummit
15. TigerGraph’s GSQL: declarative pattern matching + algorithm
15#UnifiedDataAnalytics #SparkAISummit
A simple recommendation algorithm
SumAccum @common_buys;
OrAccum @already_bought;
SumAccum @product_rank;
other_people = SELECT g
FROM seed_people:s-(buy)→ product:t ← (buy)-people:g
ACCUM g.@common_buys += 1,
t.@already_bought += true
recommended_products = SELECT t
FROM other_people:s -> (buy:e) -> product:t
WHERE t.already_bought = false
ACCUM t.rank += log(1 + s.@common_buys)
ORDER BY t.rank DESC
LIMIT 20
@common_buys
@common_buys
@common_buys
@rank
@rank
Real time updates / queries could significantly improve the effectiveness of the recommendation algorithm.
16. Query Expressiveness - Summary
• GraphX (API for designing graph algorithm) +
GraphFrame (declarative pattern queries)
• GSQL (SQL-procedure query language, declarative on
both graph algorithm and pattern matching)
• Both provide powerful graph analytics capabilities
16#UnifiedDataAnalytics #SparkAISummit
17. Query Workload
GraphX (OLAP) TigerGraph (HTAP)
17#UnifiedDataAnalytics #SparkAISummit
GraphX TigerGraph
Big Analytics Query ✓ ✓
High QPS, Sub-second Query
Workload
✓
Real Time Transactional Updates ✓
18. Scalability
• Spark/GraphX is well-known for its scalability
and MPP capabilities.
• TigerGraph is also designed from ground up
with MPP and scalability in mind.
18#UnifiedDataAnalytics #SparkAISummit
20. TigerGraph: Point Query Scalability
20#UnifiedDataAnalytics #SparkAISummit
QPS
# servers
Point query: 3-step graph traversals from a seed vertex
Application: real time ML prediction based on graph features
23. Summary / Recommendations
• GraphX: Quick-to-result exploratory analysis
without having to preload the graph data
• TigerGraph: High performance graph
analytics, real time transactional updates, high
QPS sub-second query workload
23#UnifiedDataAnalytics #SparkAISummit
26. Connect Spark-TigerGraph through JDBC
• Support Read and Write bi-directional data flow to/from TigerGraph
• Read: Convert graph query results to DataFrame
• Write: Load DataFrame/Files to Vertex/Edges in TigerGraph
• Open Source
– https://github.com/tigergraph/ecosystem/tree/master/etl/tg_jdbc_driver
26
27. Benefits of Spark + TigerGraph
• Take full advantage of the value from graph data in real time
• Combine them with all other data for deep insights and AI
• Scalable in every step
• Already have actual use cases running in this architecture
27