This is my report on:
Rank adjustment strategies for Dynamic PageRank (v1).
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Abstract — To avoid calculating ranks of vertices in a dynamic graph from scratch for every snapshot, the ones computed in the previous snapshot of the graph can be used, with adjustment. Four different rank adjustment strategies for dynamic PageRank are studied here. These include zero-fill, 1/N-fill, scaled zero-fill, and scaled 1/N-fill. Results indicate that the scaled 1/N-fill strategy requires the least number of iterations, on average. As long as the graph has no affected dead ends (including dead ends in the previous snapshot), unaffected vertices can be skipped with this adjustment strategy.
Index terms — PageRank algorithm, Dynamic graph, Rank adjustment, Initial ranks.
Rank adjustment strategies for Dynamic PageRank : REPORT
1. 1
Rank adjustment strategies for
Dynamic PageRank
Subhajit Sahu1
, Kishore Kothapalli1
, Dip Sankar Banerjee2
1
International Institute of Information Technology, Hyderabad
2
Indian Institute of Technology, Jodhpur
Abstract — To avoid calculating ranks of vertices in a dynamic graph from scratch
for every snapshot, the ones computed in the previous snapshot of the graph can be
used, with adjustment. Four different rank adjustment strategies for dynamic
PageRank are studied here. These include zero-fill, 1/N-fill, scaled zero-fill, and
scaled 1/N-fill. Results indicate that the scaled 1/N-fill strategy requires the least
number of iterations, on average. As long as the graph has no affected dead ends
(including dead ends in the previous snapshot), unaffected vertices can be skipped
with this adjustment strategy.
Index terms — PageRank algorithm, Dynamic graph, Rank adjustment, Initial ranks.
1. Introduction
Most graphs are dynamic, with new edges and vertices being added, or
removed, all the time. Time-evolving ranks of vertices in such a graph can be
obtained by performing a fresh (static) PageRank computation at every time
step. Another approach is to feed the rank of each vertex in the previous time
step as initial ranks to the PageRank algorithm [1]. This approach is called
incremental/dynamic PageRank.
In order to speed up convergence of the algorithm, these initial ranks can be
adjusted before performing PageRank computation. As long as updates to
the graph are small with respect to the size of the graph, dynamic PageRank
mostly wins over static PageRank.
2. 2
2. Method
There are a number of strategies to set up the initial rank vector for the
PageRank algorithm on the updated graph, using the ranks from the old
graph. If a graph update does not end up adding any new vertices, then it is
simply a matter of running PageRank upon the updated graph with the
previous ranks, as the initial ranks. If however, some new vertices have been
added, and/or some old vertices have been removed, one of the following
strategies may be used to adjust the initial ranks: zero-fill, 1/N-fill, scaled
zero-fill, or scaled 1/N-fill. The zero-fill strategy is the simplest, and consists
of simply filling the ranks of new vertices with 0 (i.e. rnew = 0 for new
vertices). The 1/N-fill strategy is similar, ranks of new vertices are filled with
1/Nnew (i.e. rnew = 1/Nnew for new vertices). The scaled zero-fill strategy
extends zero-fill, and additionally scales ranks of old vertices with a factor of
Nold/Nnew (i.e. rnew = rold × Nold/Nnew for old vertices, and rnew= 1/Nnew for new
vertices). Finally, the scaled 1/N-fill strategy is a combination of scaling old
vertices, and 1/N-fill (i.e. rnew = rold × Nold/Nnew for old vertices, and rnew =
1/Nnew for new vertices). Here, Nold is the total number of vertices, and rold is
the rank of a given vertex in the old graph. On the other hand, Nnew is the
total number of vertices, and rnew is the rank of a given vertex in the updated
graph. A simplified view of these strategies is shown in figure 1.1.
Figure 1.1: Zero-fill, 1/N-fill, scaled zero-fill, and scaled 1/N-fill strategies.
For static PageRank computation of a graph, the initial ranks of all vertices
are set to 1/N, where N is the total number of vertices in the graph.
Therefore, the sum of all initial ranks equals 1, as it should for a probability
3. 3
(rank) vector. It is important to note however that this is not an essential
condition, rather, it likely helps with faster convergence. This can be shown
with the following example. Consider a graph with only one vertex A that
links to itself (a self loop). The rank of vertex A after the nth
iteration would
be equal to rA = β + βα + βα2
+ … + βαn
× rA0. Here, α is damping factor
(usually 0.85), β = (1 - α), and rA0 is the initial rank of vertex A. For a large
value of n, βαn
× rA0 tends to zero, and the rank of vertex A can be
approximated with the geometric progression (GP) formula as rA = β/(1 - α) =
1. Thus, rank of vertex A is always 1, independent of its initial rank. This
extends to other graphs, where the sum of ranks of vertices always equals 1,
regardless of the initial ranks (with sufficient iterations). This property is
essential for the correctness of 1/N-fill, and scaled zero-fill strategies (their
initial ranks do not sum to 1).
With scaled rank adjustment strategies, unlike unscaled ones, PageRank
computation on unaffected vertices can be skipped, as long as the graph has
no affected dead ends (including dead ends in the old graph). Scaling is
necessary, because even though the importance of unaffected vertices does
not change, the final rank vector is a probabilistic vector and must sum to 1.
Here, affected vertices are those which are either changed vertices, or are
reachable from changed vertices. Changed vertices are those which have an
edge added or removed between it and another (changed) vertex.
The necessity for the absence of affected new/old dead ends (vertices
without out-links), as mentioned above, can be seen with the following
example. Consider a graph with 2 vertices, A and B, without any edges.
Here, the ranks of A and B would be rA = 0.5 and rB = 0.5. For the updated
graph, a self-loop is added to vertex B. Thus, the affected vertex B is no
longer a dead end. If the algorithm ignores the fact that vertex B was a dead
end in the old graph, it would simply scale the rank of vertex A (which has
no effect since no new vertices were added), and end up with its rank as rA =
0.5, implying that the rank of vertex B would be rB = 0.5. This however is
incorrect as the newly added self loop to vertex B would actually increase its
rank (and decrease vertex B’s rank due to a reduced common teleport
contribution c0). In fact, the true ranks of the vertices would be rA = β/(1 + β)
= 0.13, and rB = 0.87 (when damping factor α = 0.85 and β = 1 - α). Thus, the
dead ends for both the updated and the old graph must be considered for
the affected check. Note however that this constraint only applies to the
PageRank algorithm with teleport-based dead end handling strategy.
4. 4
3. Experimental setup
An experiment is conducted with each rank adjustment strategy on various
temporal graphs, updating each graph with multiple batch sizes (103
, 104
, ...),
until the entire graph is processed. For each batch size, static PageRank is
computed, along with incremental PageRank based on each of the four rank
adjustment strategies, without skipping unaffected vertices. This is done in
order to get an estimate of the convergence rate of each rank adjustment
strategy, independent of the number of skipped vertices (which can differ
based on the dead end handling strategy used).
Each rank adjustment strategy is performed using a common adjustment
function that adds a value, then multiplies a value to old ranks, and sets a
value for new ranks. After ranks are adjusted, they are set as initial ranks for
PageRank computation, which is then run on all vertices (no vertices are
skipped). The PageRank algorithm used is the standard power-iteration
(pull) based that optionally accepts initial ranks [2]. The rank of a vertex in
an iteration is calculated as c0 + αΣrn/dn, where c0 is the common teleport
contribution, α is the damping factor (0.85), rn is the previous rank of vertex
with an incoming edge, dn is the out-degree of the incoming-edge vertex,
and N is the total number of vertices in the graph. The common teleport
contribution c0, calculated as (1-α)/N + αΣrn/N, includes the contribution due
to a teleport from any vertex in the graph due to the damping factor (1-α)/N,
and teleport from dangling vertices (with no outgoing edges) in the graph
αΣrn/N. This is because a random surfer jumps to a random page upon
visiting a page with no links, in order to avoid the rank-sink effect [1].
All seven graphs (temporal) used in this experiment are stored in a plain text
file in “u, v, t” format, where u is the source vertex, v is the destination vertex,
and t is the UNIX epoch time in seconds. These include: CollegeMsg,
email-Eu-core-temporal, sx-mathoverflow, sx-askubuntu, sx-superuser,
wiki-talk-temporal, and sx-stackoverflow. All of them are obtained from the
Stanford Large Network Dataset Collection [3]. If initial ranks are not
provided, they are set to 1/N. Error check is done using L1 norm with static
PageRank (without initial ranks). The experiment is implemented in C++, and
compiled using GCC 9 with optimization level 3 (-O3). The system used is a
Dell PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @
2.10GHz, 128GB DIMM DDR4 Synchronous Registered (Buffered) 2666
MHz (8x16GB) DRAM, and running CentOS Linux release 7.9.2009 (Core).
The iterations taken with each test case is measured. 500 is the maximum
5. 5
iterations allowed. Statistics of each test case is printed to standard output
(stdout), and redirected to a log file, which is then processed with a script to
generate a CSV file, with each row representing the details of a single test
case. This CSV file is imported into Google Sheets, and necessary tables are
set up with the help of the FILTER function to create the charts.
4. Results
It is observed that 1/N-fill and scaled zero-fill strategies tend to require more
iterations for convergence for all graphs, with the 1/N-fill strategy usually
performing the worst. For small temporal graphs, such as CollegeMsg and
email-Eu-core-temporal, the two strategies are almost always slower than
static PageRank, as shown in figure 4.1. This is possibly because the sum of
ranks with both the strategies does not sum up to 1. For larger graphs this is
usually not the case for smaller batch sizes, as shown in figure 4.2. However,
for large batch sizes, static PageRank is able to beat all of the rank
adjustment strategies, as shown in figure 4.3. This is expected, since beyond
a certain batch size, computing PageRank from scratch is going to be faster
than dynamic PageRank [1].
Figure 4.1: Iterations taken for static PageRank, along with incremental PageRank
computation with each of the rank adjustment strategies: zero-fill, 1/N-fill, scaled
zero-fill, and scaled 1/N-fill. This is done on the email-Eu-core-temporal graph, with
a batch size of 104
.
6. 6
Figure 4.2: Iterations taken for static PageRank, along with incremental PageRank
computation with each of the rank adjustment strategies: zero-fill, 1/N-fill, scaled
zero-fill, and scaled 1/N-fill. This is done on the wiki-talk-temporal graph, with a
batch size of 104
.
Figure 4.3: Geometric mean of iterations taken on the wiki-talk-temporal graph, for
static PageRank, along with incremental PageRank computation with each of the
rank adjustment strategies: zero-fill, 1/N-fill, scaled zero-fill, and scaled 1/N-fill. This
is done with batch sizes ranging from 103
to 106
. With each batch size, edges are
added to the graph in steps, until the entire graph is processed. The figure for
arithmetic mean of iterations is similar.
7. 7
On average, on all graphs, the scaled 1/N-fill strategy seems to perform the
best, as shown in figures 4.4 and 4.5. Based on GM-RATIO comparison [4],
the relative iterations between zero-fill, 1/N-fill, scaled zero-fill, and scaled
1/N-fill is 1.00 : 1.07 : 1.10 : 0.93 for all batch sizes. Hence, 1/N-fill is 3%
faster (1.03x) than scaled zero-fill, zero-fill is 7% faster (1.07x) than 1/N-fill,
and scaled 1/N-fill is 7% faster (1.08x) than zero-fill. The comparison of
relative iterations for specific batch sizes is shown in figure 4.5. Here,
GM-RATIO is obtained by taking the geometric mean (GM) of iterations taken
at different stages of the graph, on each graph, with each batch size. Then,
GM is taken for each batch size, across all graphs. Finally, GM is taken for all
batch sizes, and a ratio is obtained relative to the zero-fill strategy. Based on
AM-RATIO comparison [4], the relative iterations between zero-fill, 1/N-fill,
scaled zero-fill, and scaled 1/N-fill is 1.00 : 1.03 : 1.07 : 0.92 (all batch
sizes). Hence, 1/N-fill is 4% faster (1.04x) than scaled zero-fill, zero-fill is 3%
faster (1.03x) than 1/N-fill, and scaled 1/N-fill is 8% faster (1.09x) than
zero-fill. AM-RATIO is obtained in a process similar to that of GM-RATIO,
except that arithmetic mean (AM) is used instead of GM.
Figure 4.4: Geometric mean of iterations taken on all the seven temporal graphs, for
static PageRank, along with incremental PageRank computation with each of the
rank adjustment strategies: zero-fill, 1/N-fill, scaled zero-fill, and scaled 1/N-fill. This
is done with batch sizes ranging from 103
to 107
. Since batch size is limited by the
total number of temporal edges of a graph, for large batch sizes only large graphs
are considered. The figure for arithmetic mean of iterations is similar.
8. 8
Figure 4.5: Relative GM iterations taken on all the seven temporal graphs, for static
PageRank, along with incremental PageRank computation with each of the rank
adjustment strategies: zero-fill, 1/N-fill, scaled zero-fill, and scaled 1/N-fill. This is
done with batch sizes ranging from 103
to 107
. The figure for relative AM iterations is
similar.
5. Conclusion
Among the four studied rank adjustment strategies for dynamic PageRank,
scaled 1/N-fill appears to be the best. Also note that with the scaled 1/N-fill
strategy (also scaled zero-fill, but it is slower), it is possible to skip
PageRank computation on unaffected vertices, as long as the graph has no
affected dead ends (including dead ends in the old graph). The scaled
1/N-fill rank adjustment strategy, which is commonly used for dynamic
PageRank [5], is thus the way to go. The link to source code, along with data
sheets and charts, for rank adjustment strategies [6] on dynamic (temporal)
graphs is included in references.
References
[1] A. Langville and C. Meyer, “Deeper Inside PageRank,” Internet Math.,
vol. 1, no. 3, pp. 335–380, Jan. 2004, doi:
9. 9
10.1080/15427951.2004.10129091.
[2] J. J. Whang, A. Lenharth, I. S. Dhillon, and K. Pingali, “Scalable
Data-Driven PageRank: Algorithms, System Issues, and Lessons
Learned,” in Euro-Par 2015: Parallel Processing, vol. 9233, J. L. Träff, S.
Hunold, and F. Versaci, Eds. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2015, pp. 438–450.
[3] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford Large Network
Dataset Collection,” Jun. 2014.
[4] S. Sahu, K. Kothapalli, and D. S. Banerjee, “Adjusting PageRank
parameters and Comparing results,” 2021.
[5] P. Desikan, N. Pathak, J. Srivastava, and V. Kumar, “Incremental page
rank computation on evolving graphs,” in Special interest tracks and
posters of the 14th international conference on World Wide Web -
WWW ’05, New York, New York, USA, May 2005, p. 1094, doi:
10.1145/1062745.1062885.
[6] S. Sahu, “puzzlef/pagerank-dynamic-adjust-ranks: Comparing strategies
to update ranks for dynamic PageRank (pull, CSR).”
https://github.com/puzzlef/pagerank-dynamic-adjust-ranks (accessed
Aug. 31, 2021).