This document describes how to import and analyze GitHub event data stored in JSON files using the Neo4j graph database. It provides examples of Cypher queries that can be used to analyze relationships between users, repositories, pull requests, forks and other GitHub elements from the events data. These include queries to find the most active user, most forked repository, number of comments on pull requests before being merged, and users who have worked on the same repositories through merged pull requests.
4. #github-data-archive
• Github Events
• archive available @ http://www.githubarchive.org/
• events json files per hour
• approx. 10k events per hour
• ! the file in itself is not valid json, all file rows are
valid json
6. #gh4j
Github Events importer for Neo4j
Parse file + build customized Cypher Statements for
each event + load in Neo4j
7. #PullRequestEvent
Payload and informations from the past
• You get information of the PR
• You can also build informations about the repo, who is
owning it for e.g.
• On which branch
• Depending of the P.R. Action (open/close/merge), you
can determine for a close/merge who opened first the PR
and from which fork it is coming
10. #IssueCommentEvent
You can check if the issue is related to a P.R. and build the complete P.R. schema
MERGE (u:User {name:'johanneswilm'})
CREATE (ev:IssueCommentEvent {time:toInt(1401606384) })
MERGE (u)-[:DO]->(ev)
MERGE (comment:IssueComment {id:toInt(44769338)})
MERGE (ev)-[:ISSUE_COMMENT]->(comment)
MERGE (issue:Issue {id:toInt(34722578)})
MERGE (repo:Repository {id:toInt(14487686)})
MERGE (comment)-[:COMMENT_ON]->(issue)-[:ISSUE_ON]->(repo)
SET repo.name = 'diffDOM'
MERGE (owner:User {name:'fiduswriter'})
MERGE (comment)-[:COMMENT_ON]->(issue)-[:ISSUE_ON]->(repo)-[:OWNED_BY]->(owner)
12. who did the most events ?
!
MATCH (u:User)-[r:DO]->()
RETURN u.name, count(r) as events
ORDER BY events DESC
LIMIT 1
13. which repo has been the most touched ?
!
MATCH (repo:Repository)<-[r]-()
RETURN repo.name, count(r) as touchs
ORDER BY touchs DESC
LIMIT 1
14. which repo has been the most forked ?
!
MATCH (repo:Repository)<-[:FORK_OF]-(fork:Fork)<-[:FORK]-
(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1
15. which repo has the most merged PRs ?
!
MATCH (repo:Repository)<-[:PR_ON_REPO]-
(pr:PullRequest)<-[merge:PR_MERGE]-()
RETURN repo.name, count(merge) as merges
ORDER BY merges DESC
LIMIT 1
16. how much forks are resulting in an open PR ?
!
MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)
-[:FORK_OF]->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)
-[:PR_OPEN]-(pre:PullRequestEvent)<-[:DO]-(u2:User)<-[:OWNED_BY]-
(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)
17.
18. Number of comments on a PR before the PR is merged ?
!
MATCH p=(ice:IssueCommentEvent)-[:ISSUE_COMMENT]->(comment:IssueComment)
-[:COMMENT_ON]->(issue:Issue)-[:BOUND_TO_PR]->(pr:PullRequest)
<-[:PR_MERGE]-(pre:PullRequestEvent)
WHERE ice.time <= pre.time
WITH pr, count(comment) as comments
RETURN avg(comments)
19. Top contributor ?
Which user has the most merged PR’s on repositories
not owned by him
!
MATCH (u:User)-[r:DO]->(fe:PullRequestEvent)-[:PR_OPEN]->(pr:PullRequest {state:'merged'})
-[:PR_ON_REPO]-(repo:Repository)-[:OWNED_BY]->(u2:User)
WHERE NOT u = u2
RETURN u.name, count(r) as prs
ORDER BY prs DESC
LIMIT 1
20. Relate together Users having Merged PR's on same
repositories, could serve as Follow Recommendations Engine!
!
MATCH p=(u:User)-[:DO]-(e:PullRequestEvent)-->(pr:PullRequest {state:'merged'})-
[:PR_ON_REPO]->(r:Repository)<-[:PR_ON_REPO]-(pr2:PullRequest
{state:'merged'})--(e2:PullRequestEvent)<-[:DO]-(u2:User)
WHERE NOT u = u2
WITH nodes(p) as coll
WITH head(coll) as st, last(coll) as end
MERGE (st)-[r:HAVE_WORKED_ON_SAME_REPO]-(end)
ON MATCH SET r.w = (r.w) + 1
ON CREATE SET r.w = 1
22. • More queries in the gist file : https://gist.github.com/ikwattro/
071d36f135131e8e4442
• Not valid with Github Live API (different payload)
• zipped db file http://bit.ly/1BaMCy9
24. avg time between a repo is forked and this fork result in
an opened PR ?
!
MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)-[:FORK_OF]
->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)-[:PR_OPEN]-
(pre:PullRequestEvent)
<-[:DO]-(u2:User)<-[:OWNED_BY]-(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-
[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p), avg(pre.time - fe.time) as offsetTime