4. Background
About RfA and its process:
Nomination
Notice of RfA
Expressing Opinions
Discussion, decision, and closing procedures
5. Research Question
Question: We were interested to analyze the directed graph
relationship between wikipedia administrator users and
average users in a Wikipedia voting dataset.
Are the procedures in place fair or not?
8. Degree Distribution
• The Long Tail
Distribution is very
evident
• Nodes from 0 to 100
degrees account for
about 85% of the all the
nodes in the dataset
10. Log-Log Plot
• Quantity being
measured can be
viewed as a type of
popularity
• Rich-get-Richer
Phenomenon
11. Average Betweenness and Degree
• Degree Centrality and
Node Betweenness appear
very linear
• Nodes with higher degree
of connections have
higher betweenness
scores
12. Average Clustering and Degree
• local clustering appears
to be decreasing
exponentially as degree
centrality increases,
resembling the power law
phenomenon
• Moderate levels of
degree centrality, still
high clustering levels
13. Average Constraint and Degree
• Average constraint
embeddedness and degree
centrality have a
negative linear
relationship.
• Majority of users have
relatively low level of
constraint.
14. Average Neighbor Degree and Degree Plot
• Low level degree
users have wide,
their neighbors
have higher average
degree.
• As we increase
degree, in
comparison their
neighbors have
lower degree
connections.
15. Application Techniques - Partitioning
Challenge in How to partition the graph?We have a network
that has a lot of edges, very dense.
Nodes:7,066
Edges:103,663
16. Graph Networks - Partitioning
We increased the degree over time to see how the network
evolved
Degree: Range 2 to 1,167.
Nodes:4797 (67.42%)
Edges:101394(97.97%)
18. Graph Networks - Partitioning
Degree Range 160 to 1,167.
Nodes:262 (3.68%)
Edges:9,959(9.60%)
19. Graph Networks - Partitioning
Degree Range 260 to 1,167.
Nodes:92(1.29%)
Edges:2,098(2.02%)
20. Core Component
• Majority of these nodes have very high betweenness scores.
• Majority of these nodes have high eigenvector centrality.
• They belong to the strongly connected component id:1016.
21. Key Takeaways
- RfA process for adding new administrators does not
exhibit weak or strong connectivity
- Network structure is directed toward a dense, central
core with a lot of nodes around the periphery
- Rich-get-richer/Preferential attachment model
characteristics are exhibited
- Although every vote counts the same, an Administrator’s
vote has the potential to bring many more votes along
with it
- Graph partitioning allows us to view the core clearly
22. So is it Fair?
- Ultimately, we determined that the Wikipedia Rfa process
is fair but highly flawed, with underlying nuances
- Although a new user’s vote and an administrator’s vote
technically carries the same weight, administrators
leverage the power of their personal network
- As a result, current administrators retain control over
the network as a whole and decide who gets to become an
administrator
Describe RfA
Dataset from start of Wikipedia in 2001 to 2008
Data: 7,066 nodes; 103,663 edges
Only includes nominations for other users (not self nominations)
Describe RfA
Dataset from start of Wikipedia in 2001 to 2008
Data: 7,066 nodes; 103,663 edges
Only includes nominations for other users (not self nominations)
Users fill out application (questionnaire)
Existing admins vote on application and provide commentary - why or why they weren’t accepted
Successful promotion is exclusive, just a 44% success rate
-Candidates display the Rfa-notice tag on userpages. Rfa remains open for seven days, contributors will ask questions and and make comments as they wish
-Bureaucrats, will review and close the RfA, at least 75% support most likely will pass
-Strong edit history, user interaction, high quality articles, trustworthiness
Behavior of the network Is it fair?
Business question: Can the behavior be used to predict behavior for other vote driven processes? Political elections, board seats,etc.
Basic stats
Graph partitioning - paring down extraneous edges for visualization
Reciprocity: likelihood of vertices in a directed network being mutually linked
0: purely unidirectional
1: purely bidirectional - everything points back
This is low; indicates that there are not a lot of mutual linkages. Indicates that the process may be fair because you don’t see a lot of people voting for each other. This may be due to the small number of people up for nomination.
Average Path: defined as the average number of steps along the shortest paths for all possible pairs of network nodes. It is a measure of the efficiency of information or mass transport on a network.
This makes sense given the dense core of the network. However, note the large amount of fringe nodes.
Diameter: find the shortest path between each pair of vertices. The greatest length of any of these paths is the diameter of the graph.
We expect to see a much higher number than avg path length since fringe nodes are highly isolated from other fringe nodes due to the structure of the overall network. Of the two most distant nodes from one another, the shortest path length is 10 steps.
Weakly Connected - A directed graph is called weakly connected if replacing all of its directed edges with undirected edges produces a connected (undirected) graph.
Strongly connected- The graph is strongly connected if all nodes have a connection to other nodes within the network.
Given the amount of data points in our network, it is difficult to isolate clusters. However, it would make sense that the graph is not entirely connected given the way the voting process works. There are different communities based on different languages, countries of origin, etc. There is also a drastic difference between the center and the periphery of the network. It is however comprised of strongly connected components. This too, would make sense given that there are smaller communities of editors/readers/administrators who would all be connected due to the “networking” aspect of the nomination process.
Global clustering coefficient - The global clustering coefficient is the number of closed triplets (or 3 x triangles) over the total number of triplets (both open and closed) - this is a measure of 0-1
This is important to us because it is a measure of the degree of triadic closure that we see in the network, which is relatively low overall
Long tail distribution - degree.
Highly skewed - Uneven distribution.
Very few high degree nodes,
Nodes with 0 to 100 degrees - account for over 85% of the nodes in the entire data set
In this distribution, the degrees are the number votes cast. This tells us that there are a large number of individuals who are voting a small number of times. This makes sense given the network structure.
Admin votes count as much as a user voting for another user.
Admin - power, Wikipedia editor for longer
Right - few high degrees, the core component
Left - most of the nodes are in the area less than 100.
Note: there are sparsely distributed nodes
High preferential attachment, but not totally pure. High distribution of near zero’s but not exactly a textbook form of preferential attachment
Demonstrates
Seeking the approval for the one from high degree, formula for success. Wikipedia tells you these people.
In the previous slide, we couldn’t see any degree distribution above 250. However, here we clearly see the long tail stretch out to in excess of 800. This is the component we find the most interesting. These votes represent votes cast or votes received. How do we reach in excess of 800 votes? Perhaps administrators’ networks cast votes for nominees based on who the administrator is voting for. Their credibility extends to the nominee.
Preferential attachment- when we see preferential attachment exist it is an extremely high concentration of observations clustered around 0. In this case we do not see this, but instead a hybrid attachment model.
The network demonstrates the rich-get-richer phenomenon of the power law (preferential attachment). This could be due to the influence of the administrator’s followers.
Another important characteristic of scale-free networks is the clustering coefficient distribution, which decreases as the node degree increases.
Nodes at the center of the network lie on the shortest paths between a greater percentage of the entire network than any other nodes within the network, contributing to their high degree centrality.
The pattern for degree centrality and average node betweenness is very linear in that nodes with more degree or connections are expected to have higher betweenness scores.
The degree centrality and local clustering plot: --- local clustering is based on how connected your neighbors are as a ratio to yourself
Fraction of possible interconnections of V.
The low degree-high clustering coefficient observations on the left-hand side of the graph represents nodes at the edge of the network. As degree increases, clustering coefficient decreases similar to what we saw in the scale free network/power law plot.
Plot looks a bit negative exponential. Where relative high degree of centrality, average clustering still holds up.
We are able to see different from what we saw in the SAP networks.
Very linear trend whereas the average constraint embeddedness increases, degree centrality lowers. Users who have a less degree centrality, has high constraint in terms of access to information. Reliant on high betweenness nodes.
Summary measure that taps the extent to which ego's connections are to others who are connected to one another. If ego's potential trading partners all have one another as potential trading partners, ego is highly constrained. If ego's partners do not have other alternatives in the neighborhood, they cannot constrain ego's behavior.
First we had to Remove loops - eliminated users who voted for themselves as this introduced noise into the data.
Patterns - weakly connected fringe cases, dense strongly connected interior.
Peripheral nodes showed the pattern of 1 node accounting for 1 vote a majority of the time whereas nodes in the strongly connected component frequently voted for one another hundreds of times per case.
How did we begin to make sense of the overall structure of the network?
We filtered the nodes of having just one degree and saw less of a halo pattern between the first and second iteration. All of the nodes pictured in the above visualization represent between 2-1167 nodes.
Less than 2% of the entire network comprised the strongest connected component within the network. These are presumably members who have the strongest possibility of becoming an administrator.
Note that strong triadic closure is demonstrated through a few number of people voting a great deal
High betweenness - Vertices with high betweenness may have considerable influence within a network by virtue of their control over information passing between others.
Eigenvector centrality: The assumption is that each node's centrality is the sum of the centrality values of the nodes that it is connected to. Eigenvector centrality is also another means of gauging influence. A smaller number of nodes with higher quality connection always weights stronger eigenvalue centrality than a node with a higher frequency of lower quality connections.
OLD
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
LSCC: nodes in the largest strongly connected component.
Adjust it to the giant component stats.
Maximum Degree Node: 1167
Size of LSCC:1300 : Size of maximal subset of nodes such that there is a directed path from each node to each other node.
Strongly connected component id:1016. Similar to how we we interpret, when a earlier administrator, the probability that you link up or edges with is some
Proportional to its current number of edges.
80% of the all nodes have a betweenness centrality of 0.
Essentially are two parts to the graph, the outer editor/voters, and presumably the power players/existing admins, or other popular users up for adminship
Information we would have liked to have - who are the admins? And who are their network cohorts? Had we had access to this information we would have been able to hone in on the impact of the administrators entire network.
Describe RfA
Dataset from start of Wikipedia in 2001 to 2008
Data: 7,066 nodes; 103,663 edges
Only includes nominations for other users (not self nominations)