Exploring optimizations for dynamic PageRank algorithm based on GPU : V4

Exploring optimizations for dynamic PageRank
algorithm based on GPU
Subhajit Sahu
Advisor: Kishore Kothapalli
Center for Security, Theory, and Algorithmic Research (CSTAR)
International Institute of Information Technology, Hyderabad (IIITH)
Gachibowli, Hyderabad, India - 500 032
subhajit.sahu@research.iiit.ac.in
1. Introduction
The Königsberg bridge problem, which was posed and answered in the negative by Euler in
1736 represents the beginning of graph theory [1]. Graph is a generic data structure and is a
superset of lists, and trees. Binary search on sorted lists can be interpreted as a balanced
binary tree search. Database tables can be thought of as indexed lists, and table joins
represent relations between columns. This can be modeled as graphs instead. Assignment
of registers to variables (by compiler), and assignment of available channels to a radio
transmitter and also graph problems. Finding shortest path between two points, and
sorting web pages in order of importance are also graphs problems. Neural networks are
graphs too. Interaction between messenger molecules in the body, and interaction between
people on social media, also modeled as graphs.
Figure 1.1: Number of websites online from 1992 to 2019 [2].
The web has a bowtie structure on many levels, as shown in figure 1.2. There is usually
one giant strongly connected component, with several pages pointing into this component,
several pages pointed to by the component, and a number of disconnected pages. This
structure is seen as a fractal on many different levels [3].
1

Figure 1.2: Web’s bow tie structure on different aggregation levels [4].
Static graphs are those which do not change with time. Static graph algorithms are
techniques used to solve such a graph problem (developed since the 1940s). To solve larger
and larger problems, a number of optimizations (both algorithmic and hardware/software
techniques) have been developed to take advantage of vector-processors (like Cray),
multicores, and GPUs. A lot of research had to be done in order to find ways to enhance
concurrency. The techniques include a number of concurrency models, locking
techniques, transactions, etc. This is especially due to a lack of single-core performance
improvements.
Graphs where relations vary with time, are called temporal graphs. As you might guess,
many problems use temporal graphs. These temporal graphs can be thought of as a series
of static graphs at different points in time. In order to solve graph problems with these
temporal graphs, people would normally take the graph at a certain point in time, and run the
necessary static graph algorithm on it. This worked out fine, and as the size of the temporal
graph grows, this repeated computation becomes increasingly slower. It is possible to take
advantage of previous results, in order to compute the result for the next time point. Such
algorithms are called dynamic graph algorithms. This is an ongoing area of research,
which includes new algorithms, hardware/software optimization techniques for distributed
systems, multicores (shared memory), GPUS, and even FPGAs. Optimization of algorithms
can focus on space complexity (memory usage), time complexity (query time),
preprocessing time, and even accuracy of result.
While dynamic algorithms only focus on optimizing the algorithm’s computation time,
dynamic graph data structures focus on improving graph update time, and memory usage.
2

Dense graphs are usually represented by an adjacency matrix (bit matrix). Sparse graphs
can be represented with variations of adjacency lists (like CSR), and edge lists. Sparse
graphs can also be thought of as sparse matrices, and edges of a vertex can be considered
a bitset. In fact, a number of graph algorithms can be modeled as linear algebra operations
(see nvGraph, cuGraph frameworks). A number of dynamic graph data structures have also
been developed to improve update speed (like PMA), or enable concurrent updates and
computation (like Aspen’s compressed functional trees) [5]. These data formats are
illustrated in figure 1.3.
Figure 1.3: Illustration of fundamental graph representations (Adjacency Matrix, Adjacency List, Edge
List, CSR) [6] .
Streaming / dynamic / time-evolving graph data structures maintain only the latest graph
information. Historical graphs on the other hand keep track of all previous states of the
graph. Changes to the graphs can be thought of as edge insertions and deletions, which
are usually done in batches. Except for functional techniques, updating a graph usually
involves modifying a shared structure using some kind of fine-grained synchronization. It
might also be possible to store additional information along with vertices/edges, though this
is usually not the focus of research (graph databases do). In the recent decade or so, a
number of graph streaming frameworks have been developed, each with a certain focus
area, and targeting a certain platform (distributed system / multiprocessor / GPU / FPGA /
ASIC). Such frameworks focus on designing an improved dynamic graph data structure, and
define a fundamental model of computation. For GPUs, the following frameworks exist:
cuSTINGER, aimGraph, faimGraph, Hornet, EvoGraph, and GPMA [5].
3

2. PageRank algorithm
The PageRank algorithm is a technique used to sort web pages (or vertices of a graph) by
importance. It is quite popularly the algorithm published by the founders of Google. Other
link analysis algorithms include HITS [7], TrustRank, and HummingBird. Such algorithms
are also used for word sense disambiguation in lexical semantics, urban planning [8],
ranking streets by traffic [9], identifying communities [10], measuring their impact on the
web, maximizing influence [11], providing recommendations [12], analysing neural/protein
networks, determining species essential for health of the environment, or even quantifying
the scientific impact of researchers [13].
In order to understand the PageRank algorithm, consider this random (web) surfer model.
Each web page is modeled as a vertex, and each hyperlink as an edge. The surfer (such as
you) initially visits a web page at random. He then follows one of the links on the page,
leading to another web page. After following some links, the surfer would eventually decide
to visit another web page (at random). The probability of the random surfer being on a
certain page is what the PageRank algorithm returns. This probability (or importance) of a
web page depends upon the importance of web pages pointing to it (markov chain). This
definition of PageRank is recursive, and takes the form of an eigen-value problem. Solving
for PageRank thus requires multiple iterations of computation, which is known as the
power-iteration method. Each computation is essentially a (sparse) matrix multiplication.
A damping factor (of 0.85) is used to counter the effect of spider-traps (like self-loops),
which can otherwise suck up all importance. Dead-ends (web pages with no out-links) are
countered by effectively linking it to all vertices of the graph (making markov matrix column
stochastic), which otherwise would leak out importance [14]. See figure 2.1 for example.
Figure 2.1: Example of web pages with hyperlinks and respective PageRanks [15].
4

Note that as originally conceived, the PageRank model does not factor a web browser’s
back button into a surfer’s hyperlinking possibilities. Surfers in one class, if teleporting, may
be much more likely to jump to pages about sports, while surfers in another class may be
much more likely to jump to pages pertaining to news and current events. Such differing
teleportation tendencies can be captured in two different personalization vectors. However,
it makes the once query-independent, user independent PageRankings user-dependent and
more calculation-laden. Nevertheless, it seems this little personalization vector has had more
significant side effects. This personalization vector, along with a non-uniform/weighted
version of PageRank [16] can help control spamming done by the so-called link farms [3].
PageRank algorithms almost always take the following parameters: damping, tolerance,
and max. iterations. Here, tolerance defines the error between the previous and the current
iterations. Though this is usually L1-norm, L2 and L∞-norm are also used sometimes. Both
damping and tolerance control the rate of convergence of the algorithm. The choice of
tolerance function also affects the rate of convergence. However, adjusting damping can
give completely different PageRank values. Since the ordering of vertices is important, and
not the exact values, it can usually be a good idea to choose a larger tolerance value.
3. Optimizing PageRank
Techniques to optimize the PageRank algorithm usually fall in two categories. One is to try
reducing the work per iteration, and the other is to try reducing the number of
iterations. These goals are often at odds with one another. The adapting PageRank
technique “locks” vertices which have converged, and saves iteration time by skipping their
computation [3]. Identical nodes, which have the same in-links, can be removed to reduce
duplicate computations and thus reduce iteration time. Road networks often have chains
which can be short-circuited before PageRank computation to improve performance. Final
ranks of chain nodes can be easily calculated. This reduces both the iteration time, and the
number of iterations. If a graph has no dangling nodes, PageRank of each strongly
connected component can be computed in topological order. This helps reduce the iteration
time, no. of iterations, and also enable concurrency in PageRank computation. The
combination of all of the above methods is the STICD algorithm (see figure 3.1) [17]. A
somewhat similar aggregation algorithm is BlockRank which computes the PageRank of
hosts, local PageRank of pages within hosts independently, and aggregates them with
weights for the final rank vector. The global PageRank solution can be found in a
computationally efficient manner by computing the sub-PageRank of each connected
component, then pasting the sub-PageRanks together to form the global PageRank, using
Avrachenkov et. al. method. These methods exploit the inherent reducibility in the graph.
Bianchini et. al. suggest using the Jacobi method to compute the PageRank vector [3].
Monte Carlo based PageRank methods consider several random walks on the input graph
to get approximate PageRanks. Its optimizations for distributed PageRank computation
(specially for undirected graphs) [18], map-reduce algorithm for personalized PageRank [19],
and reordering strategy (to reduce space and compute complexity on GPU) for local
PageRank [20] are present.
5

Figure 3.1: STIC-D: Algorithmic optimizations for PageRank [17].
Iteration time can be reduced further by taking note of the fact that the traditional algorithm is
not computationally bound, and generates fine granularity random accesses (it exhibits
irregular parallelism). This causes poor memory bandwidth and compute utilization, and the
extent of this is quite dependent upon the graph structure [21], [22]. Four strategies for
neighbour iteration were attempted, to help reason about the expected impact of a graph’s
structure on the performance of each strategy [21]. CPUs/GPUs are generally designed
optimized to load memory in blocks (cache-lines in CPUs, coalesced memory reads in
GPUs), and not for fine-grained accesses. Being able to adjust this behaviour depending
upon application (PageRank) can lead to performance improvements. Techniques like
prefetching to SRAM, using a high-performance shuffle network [23], indirect memory
prefetcher (of the form A[B[i]]), partial cache line accessing mechanisms [24], adjusting data
layout [22] (for sequential DRAM access [25]), and branch avoidance mechanisms (with
partitioning) [22] are used. Large graphs can be partitioned or decomposed into subgraphs
to help reduce cross-partition data access that helps both in distributed, as well as shared
memory systems (by reducing random accesses). Techniques like chunk partitioning [26],
cache/propagation blocking [27], partition-centric processing with gather-apply-scatter model
[22], edge-centric scatter-gather with non-overlapping vertex-set [28], exploiting node-score
sparsity [29], and even personalized PageRank based partitioning [30] have been used.
Graph/subgraph compression can also help reduce memory bottlenecks [26] [31], and
enable processing of larger graphs in memory. A no. of techniques can be used to compress
adjacency lists, such as, delta encoding of edge/neighbour ids [32], and referring sets of
edges in edge lists [33] [34] (hard to find reference vertices though) [3]. Since the rank vector
(possibly even including certain additional page-importance estimates) must reside entirely
in main memory, a few compression techniques have been attempted. These include lossy
encoding schemes based on scalar quantization seeking to minimize the distortion of
search-result rankings [35] [3], and using custom half-precision floating-point formats [36].
As new software/hardware platforms appear on the horizon, researchers have been eager
to test the performance of PageRank on the hardware. This is because each platform offers
its own unique architecture and engineering choices, and also because PageRank often
serves as a good benchmark for the capability of the platform to handle various other graph
algorithms. Attempts have been made on distributed frameworks like Hadoop [37], and even
RDBMS [38]. A number of implementations have been done on standard multicores [38],
Cell BE [39] [28], AMD GPUs [40], NVIDIA/CUDA GPUs [41] [28] [42], GPU clusters [26],
FPGAs [43] [23] [31], CPU-FPGA hybrids [44] [45] [29], and even on SpMV ASICs [46].
6

PageRank algorithm is a live algorithm which means that an ongoing computation can be
paused during graph update, and simply be resumed afterwards (instead of restarting it).
The first updating paper by Chien et al. (2002) identifies a small portion of the web graph
“near” the link changes and model the rest of the web as a single node in a new, much
smaller graph; compute a PageRank for this small graph and transfer these results to the
much bigger, original graph [3].
4. Graph streaming frameworks / databases
STINGER [47] uses an extended form of CSR with edge lists represented and link-list of
contiguous blocks. Each edge has 2 timestamps, and fine-locking is used per edge.
cuSTINGER extends STINGER for CUDA GPUs and uses contiguous edge list instead
(CSR). faimGraph [48] is a GPU framework with fully dynamic vertex and edge updates. It
has an in-GPU memory manager, and uses a paged linked-list for edges similar to
STINGER. Hornet [49] also implements its own memory manager, and uses B+ trees to
maintain blocks efficiently, and keep track of empty space. LLAMA uses a variant of CSR
with large multi-versioned arrays. It stores all snapshots of a graph, and persists old
snapshots to disk. GraphIn uses CSR along with edge lists, and updates CSR after edge
lists are large enough. GraphOne [50] is also similar, and uses page-aligned memory for
high-degree vertices. GraphTau is based on Apache Spark and uses read-only partitioned
collections of data sets. It uses a window sliding model for graph snapshots. Aspen [51]
uses C-tree (tree of trees) based on purely functional compressed search trees to store
graph structures. Elements are stored in chunks and compressed using difference encoding.
It allows any no. of readers and a single writer, and the framework guarantees strict
serializability. Tegra stores the full history of the graph and relies on recomputing graph
algorithms on affected subgraphs. It also uses a cost model to guess when full
recomputation might be better. It uses an adaptive radix tree as the core data structure for
efficient updates and range scans [5].
Unlike graph streaming frameworks, graph databases focus on rich attached data, complex
queries, transactional support with ACID properties, data replication and sharding. A few
graph databases have started to support global analytics as well. However, most graph
databases do not offer dedicated support for incremental changes. Little research exists into
accelerating streaming graph processing using low-cost atomics, hardware transactions,
FPGAs, high-performance networking hardware. On average, the highest rate of ingestion
is achieved by shared memory single-node designs [5]. An overview of the graph
frameworks is shown in figure 4.1.
7

Figure 4.1: Overview of the domains and concepts in the practice and theory of streaming and
dynamic graph processing and algorithms [6].
5. NVIDIA Tesla V100 GPU Architecture
NVIDIA Tesla was a line of products targeted at stream processing / general-purpose
graphics processing units (GPGPUs). In May 2020, NVIDIA retired the Tesla brand because
of potential confusion with the brand of cars. Its new GPUs are branded NVIDIA Data
Center GPUs as in the Ampere A100 GPU [52].
The NVIDIA Tesla GV100 (Volta) is a 21.1 billion transistor TSMC 12nm FinFET with die
size 815 mm2
. Here is a short summary of its features:
8

● 84 SMs, each with 64 independent FP, INT cores.
● Shared memory size config. up to 96KB / SM.
● 4 512-bit memory controllers (total 4096-bit).
● Upto 6 bidirectional NVLink, 25 GB/s per direction (for IBM Power 9 CPUs).
● 4 dies / HBM stack, with 4 stacks. 16 GB with 900 GB/s HBM2 (Samsung).
● Native/sideband SEDEC (1 correct, 2 detect) ECC (for HBM, REG, L1, L2).
Each SM has 4 processing blocks (each handles 1 warp of 32 threads). L1 data cache is
combined with shared memory of 128 KB / SM (explicit caching not as necessary anymore).
Volta also supports write-caching (not just load, as previous architectures). NVLink
supports coherency allowing data reads from GPU memory to be stored in CPU cache.
Address Translation Service (ATS) allows the GPU to access CPU page tables directly
(malloc ptr). The new copy engine doesn't need pinned memory. Volta per-thread
program-counter, call-stack, allows interleaved executions of warp threads (see figure
5.1), enabling fine-grained synchronization between threads within a warp (use
__syncwarp()). Cooperative groups enable synchronization between warps, grid-wide,
multi-GPUs, cross-warp, sub-warp [53].
Figure 5.1: Programs use Explicit Synchronization to Reconverge Threads in a Warp [53].
9

6. Experiments
6.1 Adjusting CSR format for graph
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
commonly used for efficient graph computations. However, given N vertices, M edges, and a
32-bit / 4-byte vertex-id, it occupies a space of 4(N + M) bytes. Note however that a 32-bit
unsigned integer is limited to just 4 billion ids, and thus massive graphs would need to use
a 64-bit / 8-byte vertex-id. This further raises the occupied space to 8(N +M) bytes. Since
large memories are difficult to make and tend to be slower than smaller ones [?] it makes
sense to try to reduce this space requirement. Hybrid CSR is a graph representation that
combines the idea behind adjacency-list and adjacency-matrix [bfs-seema], with its
edge-lists being similar to roaring bitmaps [lemire]. Unlike CSR, which stores a list of indices
of destination vertices for each vertex, hybrid CSR uses smaller indices, each combined with
a dense bitset. This allows it to represent dense regions of a graph in a compact form.
An experiment was conducted to assess the size needed for graph representation for
various possible hybrid CSR formats, by adjusting the size of dense bitset (block), and
hence the index-bits. Both 32-bit and 64-bit hybrid CSR are studied, and compared with
32-bit regular CSR. A 32-bit regular CSR is represented using a uint32_t data type, and
uses all 32 bits for vertex-index (index-bits). It can support graphs with a maximum of 2^32
vertices (or simply a 32-bit vertex-id). A 32-bit hybrid CSR is also represented using a
uint32_t data type, where lower b bits are used to store the dense bitset (block), and upper i
= 32-b bits to store the index-bits. It supports an effective vertex-id of i+log2(b) =
32-b+log2(b) bits. For this experiment, the size of block b is adjusted from 4 to 16 bits.
Similarly, a 64-bit hybrid CSR is represented using uint64_t data type, where lower b bits
are used to store the dense bitset (block) and upper i = 64-b bits to store the index-bits.
Hence, the effective vertex-id supported is of i+log2(b) = 64-b+log2(b) bits. For this
experiment, the size of block b is adjusted from 4 to 32 bits. For a given vertex-id v,
index-bits are defined as v >> b, block-bits are defined as 1 << (v & ones(b)), and thus the
hybrid CSR entry is (index-bits << b) | block-bits. Finding an edge-id in an edge-list involves
scanning all entries with matching index-bits, and once matched, checking if the appropriate
block-bit is set (for both hybrid CSRs). Since lowering the number of index-bits reduces the
maximum possible order of graph representable by the format, the effective bits usable for
vertex-id for each hybrid CSR variation is listed for reference. For this experiment, edge-ids
for each graph are first loaded into a 32-bit array of arrays structure, and then converted to
the desired CSR formats.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack
server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous
Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release
7.9.2009 (Core). Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
10

Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
It is observed that for a given n-bit hybrid CSR using the highest possible block size
(taking into account effective index-bits) results in the smallest space usage. The 32-bit
hybrid CSR with a 16-bit block is able to achieve a maximum space usage (bytes) reduction
of ~5x, but is unable to represent all the graphs under test (it has a 20-bit effective vertex-id).
With an 8-bit block the space usage is reduced by ~3x - 3.5x for coPapersCiteseer,
coPapersDBLP, and indochina-2004. The 64-bit hybrid CSR with a 32-bit block is able to
achieve a maximum space usage reduction of ~3.5x, but generally does not perform well.
However, for massive graphs which can not be represented with a 32-bit vertex-id, it is likely
to provide significant reduction in space usage. This can be gauged by comparing the
number of destination-indices needed for each CSR variant, where it achieves a maximum
destination-indices reduction of ~7x. This reduction is likely to be higher with graphs
partitioned by hosts / heuristics / clustering algorithms which is usually necessary for
massive graphs deployed in a distributed setting . This could be assessed in a future study.
Table 6.1.1: List of variations of CSR attempted, followed by list of programs including results &
figures.
regular 32-bit hybrid 32-bit hybrid 64-bit
single bit 32-bit index
4-bit block 28-bit index (30 eff.) 60-bit index (62 eff.)
32-bit block 32-bit index (32 eff.)
1. Comparing space usage of regular vs hybrid CSR (various sizes).
11

Figure 6.1.1: Space usage (bytes) reduction ratio of each format. For graphs that can not be
represented with the given format, it is set to 0.
Figure 6.1.2: Destination-indices (total number of edge values) reduction ratio of each format. For
graphs that can not be represented with the given format, it is set to 0.
12

6.2 Adjusting Bitset for graph
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
commonly used for efficient graph computations. Unfortunately, using CSR for dynamic
graphs is impractical since addition/deletion of a single edge can require on average
(N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A
common approach is therefore to store edge-lists/destination-indices as an array of arrays,
where each edge-list is an array belonging to a vertex. While this is good enough for small
graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck
depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for
an edge requires about log(E) memory accesses, but adding an edge on average requires
E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and
deletion of edges in a dynamic graph require checking for an existing edge, before adding or
deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory
accesses, but adding an edge requires only 1 memory access.
An experiment was conducted in an attempt to find a suitable data structure for
representing bitset, which can be used to represent edge-lists of a graph. The data
structures under test include single-buffer ones like unsorted bitset, and sorted bitset;
single-buffer partitioned (by integers) like partially-sorted bitset; and multi-buffer ones like
small-vector optimization bitset (unsorted), and 16-bit subrange bitset (todo). An unsorted
bitset consists of a vector (in C++) that stores all the edge ids in the order they arrive. Edge
lookup consists of a simple linear search. Edge addition is a simple push-back (after lookup).
Edge deletion is a vector-delete, which requires all edge-ids after it to be moved back (after
lookup). A sorted bitset maintains edge ids sorted in ascending order of edge ids. Edge
lookup consists of a binary search. Edge addition is a vector-insert, which requires all
edge-ids after it to be shifted one step ahead. Edge deletion is a vector-delete, just like
unsorted bitset. A partially-sorted bitset tries to amortize the cost of sorting edge-ids by
keeping the recently added edges unsorted at the end (upto a limit) and maintains the old
edges as sorted. Edge lookup consists of binary search in the sorted partition, and then
linear search in the unsorted partition, or the other way around. Edge addition is usually a
simple push-back and updating of partition size. However, if the unsorted partition grows
beyond a certain limit, it is merged with the sorted partition in one of the following ways: sort
both partitions as a whole, merge partitions using in-place merge, merge partitions using
extra space for sorted partition, or merge partitions using extra space for unsorted partition
(this requires a merge from the back end). Edge deletion checks to see if the edge can be
brought into the unsorted partition (within limit). If so, it simply swaps it out with the last
unsorted edge id (and updates partition size). However, if it cannot be brought into the
unsorted partition a vector-delete is performed (again, updating partition size). A
small-vector optimization bitset (unsorted) makes use of an additional fixed-size buffer
(this size is adjusted to different values) to store edge-ids until this buffer overflows, when all
edge-ids are moved to a dynamic (heap-allocated) vector. Edge lookups, additions, and
deletions are similar to that of an unsorted bitset, except that count of edge-ids in the
fixed-size buffer and the selection of buffer or dynamic vector needs to be done with each
operation.
13

All variants of the data structures were tested with real-world temporal graphs. These are
stored in a plain text file in “u, v, t” format, where u is the source vertex, v is the destination
vertex, and t is the UNIX epoch time in seconds. All of them are obtained from the Stanford
Large Network Dataset Collection. The experiment is implemented in C++, and compiled
using GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740
Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4
Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux
release 7.9.2009 (Core). The execution time of each test case is measured using
std::chrono::high_performance_timer. This is done 5 times for each test case, and timings
are averaged. Statistics of each test case is printed to standard output (stdout), and
charts. Similar charts are combined together into a single GIF (to help with interpretation of
results).
From the results, it appears that transpose of graphs based on sorted bitset is clearly faster
than the unsorted bitset. However, with reading graph edges there is no clear winner
(sometimes sorted is faster especially for large graphs, and sometimes unsorted). Maybe
when new edges have many duplicates, inserts are less, and hence sorted version is faster
(since sorted bitset has slow insert time). Transpose of a graph based on a fully-sorted bitset
is clearly faster than the partially-sorted bitset. This is possibly because partially sorted
bitset based graphs cause higher cache misses due to random accesses (while reversing
edges). However, with reading graph edges there is no clear winner (sometimes
partially-sorted is faster especially for large graphs, and sometimes fully-sorted). For
small-vector optimization bitset, on average, a buffer size of 4 seems to give small
improvement. Any further increase in buffer size slows down performance. This is possibly
because of unnecessarily large contiguous memory allocation needed by the buffer, and low
cache-hit percent due to widely separated edge data (due to the static buffer). In fact it even
crashes when 26 instances of graphs with varying buffer sizes can't all be held in memory.
Hence, small vector optimization is not so useful, at least when used for graphs.
Table 6.2.1: List of data structures for bitset attempted, followed by list of programs inc. results &
figures.
single-buffer single-buffer partitioned multi-buffer
unsorted partially-sorted (vs) small-vector (optimization)
sorted subrange-16bit
1. Testing the effectiveness of sorted vs unsorted list of integers for BitSet.
2. Comparing various unsorted sizes for partially sorted BitSet.
3. Performance of fully sorted vs partially sorted BitSet (inplace-s128).
4. Comparing various buffer sizes for BitSet with small vector optimization.
5. Comparing various switch points for 16-bit subrange based BitSet.
14

6.3 Adjusting data types for rank vector
When PageRank is computed in a distributed setting, for massive graphs, it is necessary
to communicate ranks of a subgraph computed at a machine across the other machine
over a network. Depending upon the algorithm, this message passing either needs to be
done every iteration [?], or after the subgraph has converged [sticd]. Minimizing this data
transfer can help improve the performance of the PageRank algorithm. One approach is to
compress the data using existing compression algorithms. Depending upon the achievable
compression ratio, and the time required to compress and decompress the data, it might be
a viable approach. Another approach would be to use smaller lower-precision data types
that can be directly used in the computation (or converted to a floating-point number on the
fly), without requiring any separate compression or decompression step.
An experiment was conducted to assess the ability of BFloat16 being used in place of
Float32 as a storage type (converted to Float32 on the fly, during computation). BFloat16 is
a 2-byte lower-precision data type specially developed for use in machine learning, and is
available for use in recent GPUs. It is, quite simply, the upper 16 bits of IEEE 754
single-precision floating point format (Float32). Conversion to and from BFloat16 is done
using bit-shift operators and reinterpret_cast. To make BFloat16 trivially replaceable with
Float32 in the PageRank algorithm, it is implemented as a class with appropriate
constructors (default, copy), and operator overloads (typecast, assignment). The experiment
is performed on a Xeon CPU, with a single thread, and using the standard power-iteration
(pull) formulation of PageRank. The rank of a vertex in an iteration is calculated as c0 +
pΣrn/dn, where c0 is the common teleport contribution, p is the damping factor (0.85), rn is the
previous rank of vertex with an incoming edge, dn is the out-degree of the incoming-edge
vertex, and N is the total number of vertices in the graph. The common teleport contribution
c0, calculated as (1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex
in the graph due to the damping factor (1-p)/N, and teleport from dangling vertices (with no
outgoing edges) in the graph pΣrn/N. This is because a random surfer jumps to a random
page upon visiting a page with no links, in order to avoid the rank-sink effect. The ranks
obtained from the BFloat16 approach are compared with standard Float32 data type
approach using L1 norm (sum of absolute error).
All graphs used in this experiment are stored in the MatrixMarket (.mtx) file format, and
obtained from the SuiteSparse Matrix Collection. The experiment is implemented in C++,
and compiled using GCC 9 with optimization level 3 (-O3). The system used is a Dell
PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB
DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running
CentOS Linux release 7.9.2009 (Core). The execution time of each test case is measured
using std::chrono::high_performance_timer. This is done 5 times for each test case, and
timings are averaged. Statistics of each test case is printed to standard output (stdout), and
charts.
15

It is observed that the error associated with using BFloat16 as storage type is too high, and
thus unsuitable for use with the PageRank algorithm. Future work may explore the usability
of BFloat16 only during message passing steps (after a full iteration, or after convergence of
a subgraph), or attempting other custom data types suitable for PageRank (possibly
non-byte aligned).
Table 6.3.1: List of rank adjustment strategies attempted, followed by programs inc. results & figures.
Custom fp16 bfloat16 float double
1. Performance of vector element sum using float vs bfloat16 as the storage type.
2. Comparison of PageRank using float vs bfloat16 as the storage type (pull, CSR).
3. Performance of PageRank using 32-bit floats vs 64-bit floats (pull, CSR).
16

6.4 Adjusting PageRank parameters
Adjusting the damping factor for the PageRank algorithm can have a significant effect on
the convergence rate of the PageRank algorithm (as mentioned in literature), both in terms
of time and iterations. For this experiment, the damping factor d (which is usually 0.85) is
varied from 0.50 to 1.00 in steps of 0.05. This is in order to compare the performance
variation with each damping factor. The calculated error is the L1-norm wrt default
PageRank (d=0.85). The PageRank algorithm used here is the standard power-iteration
(pull) based PageRank. The rank of a vertex in an iteration is calculated as c0 + pΣrn/dn,
where c0 is the common teleport contribution, p is the damping factor, rn is the previous rank
of vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is
the total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N. This is because a random surfer jumps to a random page upon visiting a
page with no links, in order to avoid the rank-sink effect.
All graphs used in this experiment are stored in the MatrixMarket (.mtx) file format, and
obtained from the SuiteSparse Matrix Collection. The experiment is implemented in C++,
and compiled using GCC 9 with optimization level 3 (-O3). The system used is a Dell
charts.
As expected, increasing the damping factor beyond 0.85 significantly increases
convergence time, and lowering it below 0.85 decreases convergence time. Note that a
higher damping factor implies that a random surfer follows links with higher probability (and
jumps to a random page with lower probability). Also note that 500 is the maximum iterations
allowed here.
CHARTS HERE
Observing that adjusting the damping factor has a significant effect, another experiment was
performed. The experiment was to adjust the damping factor (alpha) in steps. Start with a
small alpha, change it when PageRank is converged, until the final desired value of alpha.
For example, start initially with alpha = 0.5, let PageRank converge quickly, and then switch
to alpha = 0.85 and run PageRank until it converges. Using a single step like this seems like
it might help reduce iterations. Unfortunately it doesn't. Trying with multiple steps tends to
have even higher iteration count.
CHARTS HERE
17

Similar to the damping factor, adjusting the tolerance value has a significant effect as well.
Apart from that, it is observed that different people make use of different error functions for
measuring tolerance. Although L1 norm is commonly used for convergence check, it
appears nvGraph uses L2 norm instead. Another person in stackoverflow seems to suggest
the use of per-vertex tolerance comparison, which is essentially the L∞ norm. This
experiment was for comparing the performance between L1, L2 and L∞ norms for various
tolerance values. Each approach was attempted on a number of graphs, varying the
tolerance from 10^-0 to 10^-10 for each tolerance function. Results show that the L∞ norm
is a faster convergence check for all graphs. For road networks, which have a large no. of
vertices, using the L∞ norm is orders of magnitude faster. For smaller values of tolerance the
ranks converge in just 1 iteration too. This is possibly because the per-vertex update of
ranks is smaller than 10^-6. Also note that L2 norm is initially faster than L1 norm, but
quickly slows down wrt L1 norm for most graphs. However, it is always faster for road
networks.
CHARTS HERE
Table 6.4.1: List of parameter adjustments attempted, followed by programs inc. results & figures.
Damping Factor adjust dynamic-adjust
Tolerance L1 norm L2 norm L∞ norm
1. Comparing the effect of using different values of damping factor, with PageRank (pull, CSR).
2. Experimenting PageRank improvement by adjusting damping factor (α) between iterations.
3. Comparing the effect of using different functions for convergence check, with PageRank (...).
4. Comparing the effect of using different values of tolerance, with PageRank (pull, CSR).
18

6.5 Adjusting ranks for dynamic graphs
When a graph is updated, there are a number of strategies to set up the initial rank
vector for obtaining PageRanks of the updated graph, using the ranks from the old
graph. One approach is to zero-fill ranks of the new vertices. Another approach is to use
1/N for the new vertices. Yet another approach is to scale the existing vertices and use 1/N
for the new vertices.
An experiment is conducted with each technique on different temporal graphs, updating
each graph with multiple batch sizes. For each batch size, static as well as the 3 dynamic
rank adjustment methods are tested. All rank adjustment strategies are performed using a
common adjustment function that adds a value, then multiplies a value to old ranks, and sets
a value for new ranks. The PageRank algorithm used is the standard power-iteration (pull)
based that optionally accepts initial ranks. The rank of a vertex in an iteration is calculated as
c0 + pΣrn/dn, where c0 is the common teleport contribution, p is the damping factor (0.85), rn is
the previous rank of vertex with an incoming edge, dn is the out-degree of the incoming-edge
vertex, and N is the total number of vertices in the graph. The common teleport contribution
c0, calculated as (1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex
in the graph due to the damping factor (1-p)/N, and teleport from dangling vertices (with no
outgoing edges) in the graph pΣrn/N. This is because a random surfer jumps to a random
page upon visiting a page with no links, in order to avoid the rank-sink effect.
All graphs (temporal) used in this experiment are stored in a plain text file in “u, v, t” format,
where u is the source vertex, v is the destination vertex, and t is the UNIX epoch time in
seconds. All of them are obtained from the Stanford Large Network Dataset Collection. If
initial ranks are not provided, they are set to 1/N. Error check is done using L1 norm with
static PageRank (without initial ranks). The experiment is implemented in C++, and compiled
using GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740
Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4
Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux
release 7.9.2009 (Core). The execution time of each test case is measured using
charts.
Each rank adjustment method (for dynamic PageRank) can have a different number of
iterations to convergence. The 3rd approach, which does scaling and uses 1/N for new
vertices seems to perform best. It is also seen that as batch size increases, the
convergence iterations (time) of dynamic PageRank increases. In some cases it even
becomes slower than static PageRank.
Table 6.5.1: List of rank adjustment strategies attempted, followed by programs inc. results & figures.
19

update new zero fill 1/N fill
update old, new scale, 1/N fill
1. Comparing strategies to update ranks for dynamic PageRank (pull, CSR).
20

6.6 Adjusting OpenMP PageRank
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate PageRank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement PageRank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for PageRank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Before starting an OpenMP implementation, a good sequential PageRank implementation
needs to be set up. There are two ways (algorithmically) to think of the PageRank
calculation. One approach (push) is to find PageRank by pushing contributions to
out-vertices. The push method is somewhat easier to implement, and is described in this
lecture. With this approach, in an iteration for each vertex, the ranks of vertices connected to
its outgoing edge are cumulated with p×rn, where p is the damping factor (0.85), and rn is the
rank of the (source) vertex in the previous iteration. But, if a vertex has no out-going edges, it
is considered to have out-going edges to all vertices in the graph (including itself). This is
because a random surfer jumps to a random page upon visiting a page with no links, in order
to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to
the cumulation (+=) operation. The other approach (pull) is to pull contributions from
in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is
the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of
vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the
total number of vertices in the graph. The common teleport contribution c0, calculated as
the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional
calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport
contribution (to all vertices). However, it requires only 1 write per destination vertex. For this
experiment both of these approaches are assessed on a number of different graphs.
GCC 9 with OpenMP flag (-fopenmp), optimization level 3 (-O3). The system used is a Dell
charts.
21

While it might seem that the pull method would be a clear winner, the results indicate that
although pull is always faster than push approach, the difference between the two depends
on the nature of the graph. The next step is to compare the performance between finding
PageRank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR
(Compressed Sparse Row) representation (contiguous). Using a CSR representation has
the potential for performance improvement due to information on vertices and edges being
stored contiguously.
Table 6.6.1: Adjusting Sequential approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Both uniform and hybrid OpenMP techniques were attempted on different types of graphs.
All OpenMP based functions are defined with a parallel for clause and static scheduling of
size 4096. When necessary, a reduction clause is used. Number of threads for this
experiment (using OMP_NUM_THREADS) was varied from 2 to 48. Results show that the
hybrid approach performs worse in most cases, and is only slightly better than the uniform
approach in a few cases. This could possibly be because of proper chip/core-scheduling
handled by OpenMP when it is used with all the primitives.
Table 6.6.2: Adjusting OpenMP approach
Map Reduce Uniform Hybrid
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Performance of sequential execution based vs OpenMP based vector element sum.
3. Performance of uniform-OpenMP based vs hybrid-OpenMP based PageRank (pull, CSR).
In the final experiment performance of OpenMP based PageRank is contrasted with
sequential based approach and nvGraph PageRank. OpenMP based PageRank does
seem to provide a clear benefit for most graphs wrt sequential PageRank. This speedup is
definitely not directly proportional to the number of threads, as one would normally expect
(Amdahl's law). However, nvGraph is clearly way faster than the OpenMP version. This is
as expected because nvGraph makes use of GPU for performance.
22

Table 6.6.3: Comparing sequential approach
OpenMP nvGraph
Sequential vs vs
OpenMP vs
1. Performance of sequential execution based vs OpenMP based PageRank (pull, CSR).
2. Performance of sequential execution based vs nvGraph based PageRank (pull, CSR).
3. Performance of OpenMP based vs nvGraph based PageRank (pull, CSR).
23

6.7 Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD)
reducing the work per iteration, and the other is to try reducing the number of iterations.
These goals are often at odds with one another. Skipping computation on vertices which
have already converged has the potential to save iteration time. Skipping in-identical
vertices, with the same in-links, helps reduce duplicate computations and thus could help
reduce iteration time. Road networks often have chains which can be short-circuited before
PageRank computation to improve performance. Final ranks of chain nodes can be easily
calculated. This could reduce both the iteration time, and the number of iterations. If a graph
has no dangling nodes, PageRank of each strongly connected component can be
computed in topological order. This could help reduce the iteration time, no. of iterations,
and also enable multi-iteration concurrency in PageRank computation. The combination of
all of the above methods is the STICD algorithm [sticd]. For dynamic graphs, unchanged
components whose ranks are unaffected can be skipped altogether.
However, the STICD algorithm requires the graph to be free of dangling nodes. Although this
can easily be dealt with by adding self-loops to those nodes, such a modification of the
graph may be undesirable in some cases. Another way to deal with the issue is to perform
PageRank computation of the entire graph at once, instead of doing it in topological
ordering. With this approach, dangling nodes are dealt with by a calculation of teleport
contribution that is shared among all nodes in the graph (like with standard pull-based
PageRank). However, we can still take advantage of the locality benefits of splitting the
graph by components, skipping in-identicals to reduce iteration time, skipping chains to
reduce iteration time and number of iterations, and skipping converged nodes (as
mentioned).
Before starting any algorithmic optimization, a good monolithic PageRank implementation
24

7.9.2009 (Core). The execution time of each test case is measured using
charts.
Table 6.7.1: Adjusting Monolithic (Sequential) approach
Push Pull Class CSR
Next an experiment is conducted to assess the performance benefit of each algorithmic
optimization separately. For splitting graph by components optimization, the following
approaches are compared: PageRank without optimization, PageRank with vertices split by
components, and finally PageRank with components sorted in topological order.
Components of the graph are obtained using Kosaraju’s algorithm. Topological ordering is
done by representing the graph as a block-graph, where each component is represented as
a vertex, and cross-edges between components are represented as edges. This block-graph
is then topologically sorted, and this vertex-order in block-graph is used to reorder the
components in topological order. Vertices, and their respective edges are accordingly simply
reordered before computing PageRank (no graph partitioning is done). Each approach was
attempted on a number of graphs. On a few graphs, splitting vertices by components
provides a speedup, but sorting components in topological order provides no additional
speedup. For road networks, like germany_osm which only have one component, the
speedup is possibly because of the vertex reordering caused by dfs() which is required for
25

splitting by components. For skipping in-identicals optimization, comparison is done with
unoptimized PageRank. In-identical vertices are obtained by scanning matching edges of a
vertex by in-vertices hash. Except the first in-identical vertex of an in-identicals-group,
remaining vertices are skipped during PageRank computation. After each iteration ends,
rank of the first in-identical vertex is copied to the remaining vertices of the in-identicals
group. The vertices to be skipped are marked with negative source-offset in CSR. On
indochina-2004 graph, skipping in-identicals provides a speedup of ~1.8, but on average
provides no speedup for other graphs. This is likely due to the fact that the graph
indochina-2004 has a large number of in-identicals and in-identical groups. Although, it
doesn't have the highest in-identicals % or the highest avg. in-identical group size. For
skipping chains optimization, comparison is done with unoptimized PageRank. It is
important to note that a chain here means a set of unidirectional links connecting one vertex
to the next, without any additional edges. Bi-directional links are not considered as chains.
Chain vertices are obtained by traversing 2-degree vertices in both directions and marking
visited ones. Except the first chain vertex of a chains-group, remaining vertices are skipped
during PageRank computation. After each iteration ends, ranks of the remaining vertices in
each chains-group is updated using the (GP) formula c0×(1-pn
)/(1-p) + pn
×r, where c0 is the
common teleport contribution, p is the damping factor, n is the distance from the first chain
vertex, and r is the rank of the first chain vertex in previous iteration. The vertices to be
skipped are marked with negative source-offset in CSR. On average, skipping chain vertices
provides no speedup. This is likely because most graphs don't have enough chains to
provide an advantage. Road networks do have chains, but they are bi-directional, and thus
not considered here. For skipping converged vertices optimization, the following
approaches are compared: PageRank without optimization, PageRank skipping converged
vertices with re-check (in 2-16 turns), and PageRank skipping converged vertices after
several turns (in 2-64 turns). Skip with re-check (skip-check) approach skips the current
iteration for a vertex if its rank for the last two iterations match, and the current turn (iteration)
is not a “check” turn. The check turn is adjusted between 2-16 turns. Skip after turns
(skip-after) skips all future iterations of a vertex after its rank does not change for “after”
turns. The after turns are adjusted between 2-64 turns. On average, neither skip-check, nor
skip-after gives better speed than the default (unoptimized) approach. This could be due to
the unnecessary iterations added by skip-check (mistakenly skipped), and increased
memory accesses performed by skip-after (tracking converged count).
Table 6.7.2: Adjusting Monolithic optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of PageRank with vertices split by components (pull, CSR).
2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for PageRank (pull, CSR).
For this experiment Monolithic PageRank (static and dynamic) is contrasted with
nvGraph PageRank (static and dynamic). For dynamic PageRank (monolithic / nvGraph),
26

initial ranks are set to the ranks obtained from static PageRank of the graph in the previous
instant (or batch). Temporal graphs are stored in a plain text file in “u, v, t” format, where u
is the source vertex, v is the destination vertex, and t is the UNIX epoch time in seconds. All
of them are obtained from the Stanford Large Network Dataset Collection. They are loaded
in multiple batch sizes (1, 5, 10, 50, ...). New edges are incrementally added to the graph
batch-by-batch until the entire graph is complete. Fixed graphs are stored in the
MatrixMarket (.mtx) file format, and are obtained from the SuiteSparse Matrix Collection.
They are loaded in multiple batch sizes (1, 5, 10, 50, ...), as with temporal graphs. For each
batch size B, the same number of random edges are added to the graph, with probability of
a random edge being added to a vertex as directly proportional to its out-degree. As
expected, results show dynamic PageRank to be clearly faster than static PageRank for
most cases (for both temporal and fixed graphs).
Table 6.7.3: Comparing dynamic approach with static
nvGraph dynamic Monolithic dynamic
nvGraph static vs: temporal
Monolithic static vs: fixed, temporal
1. Performance of nvGraph based static vs dynamic PageRank (temporal).
2. Performance of static vs dynamic PageRank (temporal).
3. Performance of static vs dynamic levelwise PageRank (fixed).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
The purpose of this experiment is to settle on a good CUDA implementation of static
PageRank. PageRank uses map-reduce primitives in each iteration step (like multiply and
sum). Two floating-point vectors x and y, with no. of elements 1E+6 to 1E+9 were multiplied
using CUDA. Each no. of elements was attempted with various CUDA launch configs,
running each config 5 times to get a good time measure. Multiplication here represents any
memory-aligned independent operation. Using a large grid_limit and a block_size of 256
could be a decent choice (for both float and double).
A floating-point vector x, with no. of elements 1E+6 to 1E+9 was summed up using CUDA
(Σx). Each no. of elements was attempted with various CUDA launch configs, running each
config 5 times to get a good time measure. Sum here represents any reduction operation
that processes several values to a single value. This sum can be performed with two
possible approaches: memcpy, or inplace. With memcpy approach, partial results are
transferred to CPU, where the final sum is calculated. If the result can be used within the
GPU itself, it might be faster to calculate complete sum in-place instead of transferring to
CPU. This is done using either 2 (if grid_limit is 1024) or 3 kernel calls (otherwise). A
block_size of 128 (decent choice for sum) is used for the 2nd kernel, if there are 3 kernels.
27

With memcpy approach, using float values, a grid_limit of 1024 and a block_size of 128 is a
decent choice. For double values, a grid_limit of 1024 and a block_size of 256 is a decent
choice. With in-place approach, a number of possible optimizations including multiple reads
per loop iteration, loop unrolled reduce, and atomic adds provided no benefit. A simple one
read per loop iteration and standard reduce loop (minimizing warp divergence) is both
shorter and works best. For float, a grid_limit of 1024 and a block_size of 128 is a decent
choice. For double, a grid_limit of 1024 and a block_size of 256 is a decent choice.
Comparing both approaches shows similar performance.
This experiment was for finding a suitable launch config for CUDA thread-per-vertex
PageRank. For the launch config, the block-size (threads) was adjusted from 32-1024, and
the grid-limit (max grid-size) was adjusted from 1024-32768. Each config was run 5 times
per graph to get a good time measure.
On average, the launch config doesn't seem to have a good enough impact on performance.
However 8192x128 appears to be a good config. Here 8192 is the grid-limit, and 128 is the
block-size. Comparing with the graph properties, it seems it would be better to use 8192x512
for graphs with high avg. density, and 8192x32 for graphs with high avg. degree. Maybe,
sorting the vertices by degree can have a good effect (due to less warp divergence). Note
that this applies to Tesla V100 PCIe 16GB, and would be different for other GPUs. In order
to measure error, nvGraph PageRank is taken as a reference.
For this experiment, sorting of vertices and/or edges was either NO, ASC, or DESC. This
gives a total of 3 * 3 = 9 cases. Each case is run on multiple graphs, running each 5 times
per graph for good time measure.
Results show that sorting in most cases is slower. Maybe this is because sorted
arrangement tends to overflood certain memory chunks with too many requests. In order to
measure error, nvGraph PageRank is taken as a reference.
This experiment was for finding a suitable launch config for CUDA block-per-vertex
PageRank. For the launch config, the block-size (threads) was adjusted from 32-1024, and
the grid-limit (max grid-size) was adjusted from 1024-32768. Each config was run 5 times
per graph to get a good time measure.
MAXx64 appears to be a good config for most graphs. Here MAX is the grid-limit, and 64 is
the block-size. This launch config is for the entire graph, and could be slightly different for a
subset of graphs. Also note that this applies to Tesla V100 PCIe 16GB, and could be
different for other GPUs. In order to measure error, nvGraph PageRank is taken as a
reference.
gives a total of 3 * 3 = 9 cases. Each case is run on multiple graphs, running each 5 times
per graph for good time measure.
Results show that sorting in most cases is not faster. In fact, in a number of cases, sorting
actually slows down performance. Maybe (just maybe) this is because sorted arrangements
28

tend to overflood certain memory chunks with too many requests. In order to measure error,
nvGraph PageRank is taken as a reference.
This experiment was for finding a suitable launch config for CUDA switched-per-vertex
PageRank for thread approach. For the launch config, the block-size (threads) was
adjusted from 32-1024, and the grid-limit (max grid-size) was adjusted from 1024-32768.
Each config was run 5 times per graph to get a good time measure.
MAXx512 appears to be a good config for most graphs. Here MAX is the grid-limit, and 512
is the block-size. Note that this applies to Tesla V100 PCIe 16GB, and would be different for
other GPUs. In order to measure error, nvGraph PageRank is taken as a reference.
This experiment was for finding a suitable launch config for CUDA switched-per-vertex
PageRank for block approach. For the launch config, the block-size (threads) was
adjusted from 32-1024, and the grid-limit (max grid-size) was adjusted from 1024-32768.
Each config was run 5 times per graph to get a good time measure.
MAXx256 appears to be a good config for most graphs. Here MAX is the grid-limit, and 256
is the block-size. Note that this applies to Tesla V100 PCIe 16GB, and would be different for
other GPUs. In order to measure error, nvGraph PageRank is taken as a reference.
gives a total of 3 * 3 = 9 cases. NO here means that vertices are partitioned by in-degree
(edges remain unchanged). Each case is run on multiple graphs, running each 5 times per
graph for good time measure.
Results show that sorting in most cases is not faster. It's better to simply partition vertices by
degree. In order to measure error, nvGraph PageRank is taken as a reference.
For this experiment, the objective is to find a good switch point for CUDA
switched-per-vertex PageRank. To assess this, switch_degree was varied from 2 - 1024,
and switch_limit was varied from 1 - 1024. switch_degree defines the in-degree at which
PageRank kernel switches from thread-per-vertex approach to block-per-vertex. switch_limit
defines the minimum block size for thread-per-vertex / block-per-vertex approach (if a block
size is too small, it is merged with the other approach block). Each case is run on multiple
graphs, running each 5 times per graph for good time measure.
It seems switch_degree of 64 and switch_limit of 32 would be a good choice.
29

Table 6.7.4: Adjusting Monolithic CUDA approach
Map launch
Reduce memcpy launch in-place launch vs
Thread /V launch sort/p. vertices sort edges
Block /V launch sort/p. vertices sort edges
Switched /V thread launch block launch switch-point
1. Comparing various launch configs for CUDA based vector multiply.
2. Comparing various launch configs for CUDA based vector element sum (memcpy).
3. Comparing various launch configs for CUDA based vector element sum (in-place).
4. Performance of memcpy vs in-place based CUDA based vector element sum.
5. Comparing various launch configs for CUDA thread-per-vertex based PageRank (pull, CSR).
6. Sorting vertices and/or edges by in-degree for CUDA thread-per-vertex based PageRank.
7. Comparing various launch configs for CUDA block-per-vertex based PageRank (pull, CSR).
8. Sorting vertices and/or edges by in-degree for CUDA block-per-vertex based PageRank.
9. Launch configs for CUDA switched-per-vertex based PageRank focusing on thread approach.
10. Launch configs for CUDA switched-per-vertex based PageRank focusing on block approach.
11. Sorting vertices and/or edges by in-degree for CUDA switched-per-vertex based PageRank.
12. Comparing various switch points for CUDA switched-per-vertex based PageRank (pull, ...).
Note: sort/p. vertices ⇒ sorting vertices by ascending or descending order of in-degree, or simply
partitioning (by in-degree). sort edges ⇒ sorting edges by ascending or descending order of id.
This experiment was for checking the benefit of splitting vertices of the graph by
components. This was done by comparing performance between: PageRank without
optimization, PageRank with vertices split by components, PageRank with components
sorted in topological order. Each approach was attempted on a number of graphs, running
each approach 5 times to get a good time measure.
On a few graphs, splitting vertices by components provides a speedup, but sorting
components in topological order provides no additional speedup. For road networks, like
germany_osm which only have one component, the speedup is possibly because of the
vertex reordering caused by dfs() which is required for splitting by components. However, on
average there is no speedup.
This experiment was for checking the benefit of skipping rank calculation of in-identical
vertices. This optimization, and the control approach, was attempted on a number of
graphs, running each approach 5 times to get a good time measure.
On indochina-2004 graph, skipping in-identicals provides a speedup of ~1.3, but on average
provides no speedup for other graphs. This could be due to the fact that the graph
30

doesn't have the highest in-identials % or the highest avg. in-identical group size, so I am not
so sure.
This experiment was for checking the benefit of skipping rank calculation of chain
vertices. This optimization, and the control approach, was attempted on a number of
graphs, running each approach 5 times to get a good time measure.
On average, skipping chain vertices provides no speedup. A chain here means a set of
unidirectional links connecting one vertex to the next, without any additional edges.
Bi-directional links are not considered as chains. Note that most graphs don't have enough
chains to provide an advantage. Road networks do have chains, but they are bi-directional,
and thus not considered here.
This experiment was for checking the benefit of skipping converged vertices. This was
done by comparing performance between: PageRank without optimization, PageRank
skipping converged vertices with re-check (in 2-16 turns), PageRank skipping converged
vertices after several turns (in 2-64 turns). Each approach was attempted on a number of
graphs, running each approach 5 times to get a good time measure. Skip with re-check
(skip-check) is done every 2-16 turns. Skip after turns (skip-after) is done after 2-64 turns.
On average, neither skip-check, nor skip-after gives better speed than the default
(unoptimized) approach. This could be due to the unnecessary iterations added by
skip-check (mistakenly skipped), and increased memory accesses performed by skip-after
(tracking converged count).
Table 6.7.5: Adjusting Monolithic CUDA optimizations (from STICD)
1. Performance benefit of CUDA based PageRank with vertices split by components.
2. Performance benefit of skipping in-identical vertices for CUDA based PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for CUDA based PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for CUDA based PageRank (pull, CSR).
This experiment is for comparing the performance between: Monolithic dynamic
PageRank, Monolithic static PageRank, nvGraph dynamic PageRank, and nvGraph static
PageRank. This is done with both fixed, and temporal graphs. For temporal graphs, updating
of each graph is done in multiple batch sizes (1, 5, 10, 50, ...). New edges are incrementally
added to the graph batch-by-batch until the entire graph is complete. For fixed graphs, each
batch size was run with 5 different updates to the graph, and each specific update was run 5
times for each approach to get a good time measure.
On average, Monolithic dynamic PageRank is faster than static approach.
31

Table 6.7.6: Comparing dynamic CUDA approach with static
nvGraph static vs: fixed, temporal vs: fixed, temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal
1. Performance of static vs dynamic CUDA based PageRank (fixed).
2. Performance of static vs dynamic CUDA based PageRank (temporal).
3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed).
4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal).
This experiment is for comparing the performance between: Monolithic dynamic PageRank,
Monolithic static PageRank, nvGraph dynamic PageRank, and nvGraph static PageRank.
Here, unaffected vertices are skipped from PageRank computation. This is done with
fixed graphs. Each batch size was run with 5 different updates to the graph, and each
specific update was run 5 times for each approach to get a good time measure.
On average, Monolithic dynamic PageRank is faster than static approach.
Table 6.7.7: Comparing dynamic optimized CUDA approach with static
nvGraph static vs: fixed vs: fixed
Monolithic static vs: fixed vs: fixed
1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed).
32

6.8 Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD)
reducing the work per iteration, and the other is to try reducing the number of iterations.
These goals are often at odds with one another. Skipping computation on vertices which
have already converged has the potential to save iteration time. Skipping in-identical
vertices, with the same in-links, helps reduce duplicate computations and thus could help
reduce iteration time. Road networks often have chains which can be short-circuited before
PageRank computation to improve performance. Final ranks of chain nodes can be easily
calculated. This could reduce both the iteration time, and the number of iterations. If a graph
has no dangling nodes, PageRank of each strongly connected component can be
computed in topological order. This could help reduce the iteration time, no. of iterations,
and also enable multi-iteration concurrency in PageRank computation. The combination of
all of the above methods is the STICD algorithm [sticd]. For dynamic graphs, unchanged
components whose ranks are unaffected can be skipped altogether.
Before starting any algorithmic optimization, a good monolithic PageRank implementation
7.9.2009 (Core). The execution time of each test case is measured using
33

charts.
Table 6.8.1: Adjusting Monolithic (Sequential) approach
Push Pull Class CSR
Next an experiment is conducted to assess the performance benefit of each algorithmic
optimization separately. For splitting graph by components optimization, the following
approaches are compared: PageRank without optimization, PageRank with vertices split by
components, and finally PageRank with components sorted in topological order.
Components of the graph are obtained using Kosaraju’s algorithm. Topological ordering is
done by representing the graph as a block-graph, where each component is represented as
a vertex, and cross-edges between components are represented as edges. This block-graph
is then topologically sorted, and this vertex-order in block-graph is used to reorder the
components in topological order. Vertices, and their respective edges are accordingly simply
reordered before computing PageRank (no graph partitioning is done). Each approach was
attempted on a number of graphs. On a few graphs, splitting vertices by components
provides a speedup, but sorting components in topological order provides no additional
speedup. For road networks, like germany_osm which only have one component, the
speedup is possibly because of the vertex reordering caused by dfs() which is required for
splitting by components. For skipping in-identicals optimization, comparison is done with
unoptimized PageRank. In-identical vertices are obtained by scanning matching edges of a
vertex by in-vertices hash. Except the first in-identical vertex of an in-identicals-group,
remaining vertices are skipped during PageRank computation. After each iteration ends,
rank of the first in-identical vertex is copied to the remaining vertices of the in-identicals
group. The vertices to be skipped are marked with negative source-offset in CSR. On
indochina-2004 graph, skipping in-identicals provides a speedup of ~1.8, but on average
provides no speedup for other graphs. This is likely due to the fact that the graph
doesn't have the highest in-identicals % or the highest avg. in-identical group size. For
skipping chains optimization, comparison is done with unoptimized PageRank. It is
important to note that a chain here means a set of unidirectional links connecting one vertex
34

to the next, without any additional edges. Bi-directional links are not considered as chains.
Chain vertices are obtained by traversing 2-degree vertices in both directions and marking
visited ones. Except the first chain vertex of a chains-group, remaining vertices are skipped
during PageRank computation. After each iteration ends, ranks of the remaining vertices in
each chains-group is updated using the (GP) formula c0×(1-pn
)/(1-p) + pn
×r, where c0 is the
common teleport contribution, p is the damping factor, n is the distance from the first chain
vertex, and r is the rank of the first chain vertex in previous iteration. The vertices to be
skipped are marked with negative source-offset in CSR. On average, skipping chain vertices
provides no speedup. This is likely because most graphs don't have enough chains to
provide an advantage. Road networks do have chains, but they are bi-directional, and thus
not considered here. For skipping converged vertices optimization, the following
approaches are compared: PageRank without optimization, PageRank skipping converged
vertices with re-check (in 2-16 turns), and PageRank skipping converged vertices after
several turns (in 2-64 turns). Skip with re-check (skip-check) approach skips the current
iteration for a vertex if its rank for the last two iterations match, and the current turn (iteration)
is not a “check” turn. The check turn is adjusted between 2-16 turns. Skip after turns
(skip-after) skips all future iterations of a vertex after its rank does not change for “after”
turns. The after turns are adjusted between 2-64 turns. On average, neither skip-check, nor
skip-after gives better speed than the default (unoptimized) approach. This could be due to
the unnecessary iterations added by skip-check (mistakenly skipped), and increased
memory accesses performed by skip-after (tracking converged count).
Table 6.8.2: Adjusting Monolithic optimizations (from STICD)
1. Performance benefit of PageRank with vertices split by components (pull, CSR).
2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for PageRank (pull, CSR).
This experiment was for comparing performance between levelwise PageRank with various
min. compute size, ranging from 1 - 1E+7. Here, min. compute size is the minimum no.
nodes of each PageRank compute using standard algorithm (monolithic). Each min.
compute size was attempted on different types of graphs, running each size 5 times per
graph to get a good time measure. Levelwise PageRank is the STIC-D algorithm, without
ICD optimizations (using single-thread). Although there is no clear winner, it appears a min.
compute size of 10 would be a good choice. Note that the levelwise approach does not
make use of SIMD instructions which are available on all modern hardware.
This experiment was for comparing performance between: monolithic PageRank, monolithic
PageRank skipping teleport, levelwise PageRank, levelwise PageRank skipping teleport.
Each approach was attempted on different types of graphs, running each approach 5 times
per graph to get a good time measure. Levelwise PageRank is the STIC-D algorithm, without
ICD optimizations (using single-thread).
35

Except for soc-LiveJournal1 and coPapersCiteseer, in all cases skipping teleport calculations
is slightly faster (could be a random issue for the two). The improvement is most prominent
in case of road networks and certain web graphs.
Table 6.8.3: Adjusting Levelwise (STICD) approach
Min. component size Min. compute size Skip teleport calculation
1. Comparing various min. component sizes for topologically-ordered components (levelwise...).
2. Comparing various min. compute sizes for topologically-ordered components (levelwise...).
3. Checking performance benefit of levelwise PageRank when teleport calculation is skipped.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before PageRank computation.
This experiment was for comparing performance between: PageRank with standard
algorithm (monolithic), PageRank in topologically-ordered components fashion (levelwise).
Both approaches were attempted on different types of graphs, running each approach 5
times per graph to get a good time measure. Levelwise PageRank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
On average, levelwise PageRank is faster than the monolithic approach. Note that neither
approach makes use of SIMD instructions which are available on all modern hardware.
Table 6.8.4: Comparing Levelwise (STICD) approach
Monolithic nvGraph
Levelwise (STICD) vs
1. Performance of monolithic vs topologically-ordered components (levelwise) PageRank.
This experiment was for comparing performance between: static levelwise PageRank,
dynamic levelwise PageRank (process all components), dynamic levelwise PageRank
skipping unchanged components. Each approach was attempted on a number of graphs
(fixed and temporal), running each with multiple batch sizes (1, 5, 10, 50, ...). Levelwise
PageRank is the STIC-D algorithm, without ICD optimizations (using single-thread).
On average, skipping unchanged components is barely faster than not skipping.
36

Table 6.8.5: Adjusting Levelwise (STICD) dynamic approach
Skip unaffected components For fixed graphs For temporal graphs
1. Checking for correctness of levelwise PageRank when unchanged components are skipped.
2. Perf. benefit of levelwise PageRank when unchanged components are skipped (fixed).
3. Perf. benefit of levelwise PageRank when unchanged components are skipped (temporal).
This experiment was for comparing performance between: static PageRank using standard
algorithm (monolithic), static PageRank using levelwise algorithm, dynamic PageRank using
levelwise algorithm. Each approach was attempted on a number of graphs, running each
with multiple batch sizes (1, 5, 10, 50, ...). Each PageRank computation was run 5 times for
both approaches to get a good time measure. Levelwise PageRank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
Clearly, dynamic levelwise PageRank is faster than the static approach for many batch
sizes.
Table 6.8.6: Comparing dynamic approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed vs: fixed, temporal
1. Performance of nvGraph based static vs dynamic PageRank (temporal).
2. Performance of static vs dynamic PageRank (temporal).
3. Performance of static vs dynamic levelwise PageRank (fixed).
4. Performance of levelwise based static vs dynamic PageRank (temporal).
This experiment was for comparing performance between levelwise CUDA PageRank with
various min. compute size, ranging from 1E+3 - 1E+7. Here, min. compute size is the
minimum no. nodes of each PageRank compute using standard algorithm (monolithic
CUDA). Each min. compute size was attempted on different types of graphs, running each
37

size 5 times per graph to get a good time measure. Levelwise PageRank is the STIC-D
algorithm, without ICD optimizations (using single-thread).
Although there is no clear winner, it appears a min. compute size of 5E+6 would be a good
choice.
Table 6.8.7: Adjusting Levelwise (STICD) CUDA approach
Min. component size Min. compute size Skip teleport calculation
1. Min. component sizes for topologically-ordered components (levelwise, CUDA) PageRank.
2. Min. compute sizes for topologically-ordered components (levelwise CUDA) PageRank.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before PageRank computation.
This experiment was for comparing performance between: CUDA based PageRank with
standard algorithm (monolithic), CUDA based PageRank in topologically-ordered
components fashion (levelwise). Both approaches were attempted on different types of
graphs, running each approach 5 times per graph to get a good time measure. Levelwise
PageRank is the STIC-D algorithm, without ICD optimizations (using single-thread).
On average, levelwise PageRank is the same performance as the monolithic approach.
Table 6.8.8: Comparing Levelwise (STICD) CUDA approach
nvGraph Monolithic CUDA
Monolithic vs vs
Monolithic CUDA vs
Levelwise CUDA vs vs
1. Performance of sequential execution based vs CUDA based PageRank (pull, CSR).
2. Performance of nvGraph vs CUDA based PageRank (pull, CSR).
3. Performance of Monolithic CUDA vs Levelwise CUDA PageRank (pull, CSR, ...).
This experiment was for comparing the performance between: static PageRank of updated
graph, dynamic PageRank of updated graph. Both techniques were attempted on different
temporal graphs, updating each graph with multiple batch sizes (1, 5, 10, 50, ...). New edges
are incrementally added to the graph batch-by-batch until the entire graph is complete.
Dynamic PageRank is clearly faster than static approach for many batch sizes.
38

Table 6.8.9: Comparing dynamic CUDA approach with static
nvGraph static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
1. Performance of static vs dynamic CUDA based PageRank (fixed).
2. Performance of static vs dynamic CUDA based PageRank (temporal).
3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed).
4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal).
This experiment was for comparing performance between: static PageRank of updated
graph using nvGraph, dynamic PageRank of updated graph using nvGraph, static monolithic
CUDA based PageRank of updated graph, dynamic monolithic CUDA based PageRank of
updated graph, static levelwise CUDA based PageRank of updated graph, dynamic
levelwise CUDA based PageRank of updated graph. Each approach was attempted on a
number of graphs, running each with multiple batch sizes (1, 5, 10, 50, ...). Each batch size
was run with 5 different updates to the graph, and each specific update was run 5 times for
each approach to get a good time measure. Levelwise PageRank is the STIC-D algorithm,
without ICD optimizations.
Indeed, dynamic levelwise PageRank is faster than the static approach for many batch
sizes. In order to measure error, nvGraph PageRank is taken as a reference.
Table 6.8.10: Comparing dynamic optimized CUDA approach with static
nvGraph static vs: fixed vs: fixed vs: fixed
Monolithic static vs: fixed vs: fixed vs: fixed
Levelwise static vs: fixed vs: fixed vs: fixed
1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed).
39

7. Packages
1. CLI for SNAP dataset, which is a collection of more than 50 large networks.
This is for quickly fetching SNAP datasets that you need right from the CLI. Currently there is
only one command clone, where you can provide filters for specifying exactly which datasets
you need, and where to download them. If a dataset already exists, it is skipped. This
summary is shown at the end. You can install this with npm install -g snap-data.sh.
2. CLI for nvGraph, which is a GPU-based graph analytics library written by NVIDIA,
using CUDA.
This is for running nvGraph functions right from the CLI with graphs in MatrixMarket format
(.mtx) directly. It just needs a x86_64 linux machine with NVIDIA GPU drivers installed.
Execution time, along with the results can be saved in JSON/YAML file. The executable code
is written in C++. You can install this with npm install -g nvgraph.sh.
8. Further action
List dynamic graph algorithms
List dynamic graph data structures
List graph processing frameworks
List graph applications
Package graph processing frameworks
9. Bibliography
[1] E. W. Weisstein, “Königsberg Bridge Problem.,” MathWorld--A Wolfram Web
Resource. https://mathworld.wolfram.com/KoenigsbergBridgeProblem.html (accessed
Jul. 23, 2021).
[2] M. A. F. Richter, “Infographic: How Many Websites Are There?,” Statista Infographics,
Oct. 2019.
[3] A. Langville and C. Meyer, “Deeper Inside PageRank,” Internet Math., vol. 1, no. 3, pp.
335–380, Jan. 2004, doi: 10.1080/15427951.2004.10129091.
[4] R. Meusel, “The graph structure in the web – analyzed on different aggregation levels,”
JWS, vol. 1, no. 1, pp. 33–47, Aug. 2015, doi: 10.1561/106.00000003.
[5] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler, “Practice of Streaming
and Dynamic Graphs: Concepts, Models, Systems, and Parallelism,” CoRR, vol.
abs/1912.12740, 2019.
[6] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler, “Practice of Streaming
Processing of Dynamic Graphs: Concepts, Models, and Systems,” 2021.
[7] Ong Kok Chien, Poo Kuan Hoong, and Chiung Ching Ho, “A comparative study of
HITS vs PageRank algorithms for Twitter users analysis,” in 2014 International
Conference on Computational Science and Technology (ICCST), Aug. 2014, pp. 1–6,
40

doi: 10.1109/ICCST.2014.7045007.
[8] Q. Zhang and T. Yuan, “Analysis of China’s Urban Network Structure from the
Perspective of ‘Streaming,’” in 2018 26th International Conference on Geoinformatics,
Jun. 2018, pp. 1–7, doi: 10.1109/GEOINFORMATICS.2018.8557078.
[9] Y.-Y. Kim, H.-A. Kim, C.-H. Shin, K.-H. Lee, C.-H. Choi, and W.-S. Cho, “Analysis on
the transportation point in cheongju city using pagerank algorithm,” in Proceedings of
the 2015 International Conference on Big Data Applications and Services - BigDAS
’15, New York, New York, USA, Oct. 2015, pp. 165–169, doi:
10.1145/2837060.2837087.
[10] I. M. Kloumann, J. Ugander, and J. Kleinberg, “Block models and personalized
PageRank.,” Proc Natl Acad Sci USA, vol. 114, no. 1, pp. 33–38, Jan. 2017, doi:
10.1073/pnas.1611275114.
[11] B. Zhang, Y. Wang, Q. Jin, and J. Ma, “A Pagerank-Inspired Heuristic Scheme for
Influence Maximization in Social Networks,” International Journal of Web Services
Research, vol. 12, no. 4, pp. 48–62, Oct. 2015, doi: 10.4018/IJWSR.2015100104.
[12] S. Chaudhari, A. Azaria, and T. Mitchell, “An entity graph based Recommender
System,” AIC, vol. 30, no. 2, pp. 141–149, May 2017, doi: 10.3233/AIC-170728.
[13] Contributors to Wikimedia projects, “PageRank,” Wikipedia, Jul. 2021.
https://en.wikipedia.org/wiki/PageRank (accessed Mar. 01, 2021).
[14] J. Leskovec, “PageRank Algorithm, Mining massive Datasets (CS246), Stanford
University,” YouTube, 2019.
[15] J. F. Jardine, “PageRanks-Example,” Nov. 2007.
[16] H. Dubey, N. Khare, K. K. Appu Kuttan, and S. Bhatia, “Improved parallel pagerank
algorithm for spam filtering,” Indian J. Sci. Technol., vol. 9, no. 38, Oct. 2016, doi:
10.17485/ijst/2016/v9i38/90410.
[17] P. Garg and K. Kothapalli, “STIC-D: Algorithmic techniques for efficient parallel
pagerank computation on real-world graphs,” in Proceedings of the 17th International
Conference on Distributed Computing and Networking - ICDCN ’16, New York, New
York, USA, Jan. 2016, pp. 1–10, doi: 10.1145/2833312.2833322.
[18] D. Frey, Distributed Computing and Networking, 2nd ed. Berlin: Springer Nature, 2013,
p. 366.
[19] B. Bahmani, K. Chakrabarti, and D. Xin, “Fast personalized PageRank on
MapReduce,” in Proceedings of the 2011 international conference on Management of
data - SIGMOD ’11, New York, New York, USA, Jun. 2011, p. 973, doi:
10.1145/1989323.1989425.
[20] S. Lai, B. Shao, Y. Xu, and X. Lin, “Parallel computations of local PageRank problem
based on Graphics Processing Unit,” Concurrency Computat.: Pract. Exper., vol. 29,
no. 24, p. e4245, Aug. 2017, doi: 10.1002/cpe.4245.
[21] S. Hunold, Euro-Par 2015: Parallel Processing Workshops: Euro-Par 2015
International Workshops, Vienna, Austria, August 24-25, 2015, Revised Selected
Papers (Lecture Notes in Computer Science Book 9523), 1st ed. 2015. Cham:
Springer, 2015, p. 882.
[22] K. Lakhotia, R. Kannan, and V. Prasanna, “Accelerating PageRank using
Partition-Centric Processing,” in 2018 USENIX Annual Technical Conference (USENIX
ATC ’18), Boston, MA, Jul. 2018, pp. 427–440.
[23] X. Wang, L. Huang, Y. Zhu, Y. Zhou, H. Peng, and H. Xiong, “Addressing memory wall
problem of graph computation in reconfigurable system,” in 2015 IEEE 17th
International Conference on High Performance Computing and Communications, 2015
41

Exploring optimizations for dynamic PageRank algorithm based on GPU : V4

Exploring optimizations for dynamic PageRank algorithm based on GPU : V4

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Exploring optimizations for dynamic PageRank algorithm based on GPU : V4

Similar to Exploring optimizations for dynamic PageRank algorithm based on GPU : V4 (20)

More from Subhajit Sahu

More from Subhajit Sahu (20)

Recently uploaded

Recently uploaded (20)

Exploring optimizations for dynamic PageRank algorithm based on GPU : V4