ICWSM 2011 Tutorial
Sebastien Heymann and Julian Bilcke
Gephi is an interactive visualization and exploration software for all kinds of networks and relational data: online social networks, emails, communication and financial networks, but also semantic networks, inter-organizational networks and more. Designed to make data navigation and manipulation easy, it aims to fulfill the complete chain from data importing to aesthetics refinements and interaction. Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties. The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections.
In this tutorial we will provide a hands-on demonstration of the essential functionalities of Gephi, based on a real case scenario: the exploration of student networks from the "Facebook100" dataset (Social Structure of Facebook Networks, Amanda L. Traud et al, 2011). The participants will be guided step by step through the complete chain of representation, manipulation, layout, analysis and aesthetics refinements. Particular focus will be put on filters and metrics for the creation of their first visualizations. They will be incited to compare the hypotheses suggested by their own exploration to the results actually published in the academic paper afterwards. They finally will walk away with the practical knowledge enabling them to use Gephi for their own projects. The tutorial is intended for professionals, researchers and graduates who wish to learn how playing during a network exploration can speed up their studies.
Sébastien Heymann is a Ph.D. Candidate in Computer Science at Université Pierre et Marie Curie, France. His research at the ComplexNetworks team focuses on the dynamics of realworld networks. He leads the Gephi project since 2008, and is the administrator of the Gephi Consortium.
Julian Bilcke is a Software Engineer at ISC-PIF (Complex Systems Institute of Paris, France). He is a founder and a developer for the Gephi project since 2008.
2. Exploratory Network Analysis with Gephi
This tutorial is an introduction to Gephi, the open source graph network
visualization and manipulation software.
Gephi aims to fulfill the complete chain from data importing to aesthetics
refinements and interaction.
Users interact with the visualization and manipulate structures, shapes
and colors to reveal hidden properties.
The goal is to help data analysts to make hypotheses, intuitively discover
patterns or errors in large data collections.
E
At the end, the participants will walk away with the practical knowledge
IN
enabling them to use Gephi for their own projects.
F F L
O
3. Exploratory Network Analysis with Gephi
It starts with a brief introduction on the network exploration process and
a hands-on demonstration of the essential functionalities of Gephi.
Participants are guided step by step through the complete chain of rep-
resentation, manipulation, layout, analysis and aesthetics refinements.
Next, teams work on real datasets.
They finally present their preliminary results. The tutorial concludes with
a general question and answer session.
IN E
F F L
O
4. Requirements
Bring your own laptop with Java and Gephi installed.
Gephi should be updated (menu Help > Check for Updates).
Bring a mouse with a wheel.
Bring a dataset of your own if you want, verify if it loads well in Gephi.[1]
[1] http://gephi.org/users/supported-graph-formats/
5. Workshop Schedule - Part I
Exploratory Network Analysis
• Exploratory Data Analysis
• Exploratory Network Analysis
• Looking for Orderness in Data
• Examples
• Guideline
Introduction to Gephi
• Approach and Community
• Networked Data
• Quick Start Demo
* 30 min break *
6. Workshop Schedule - Part II
Hands-On!
• Team Work on a Dataset
• Presentation of Preliminary Results
Q&A
7. Exploratory Data Analysis
Confirmatory results
Exploratory intuition
Serendipity surprise
“The greatest value of a picture is when it forces us started with
to notice what we never expected to see” John Tukey (1962)
8. Exploratory Data Analysis
Non-linear processing chain of Ben Fry
in Computational Information Design (2004)
9. Dummy Example
Observation:
visual saliences on specific
file sizes
External knowledge:
these sizes correspond to
films
New hypothesis on data:
films are highly exchanged,
so the study might dig in
this direction
P2P file size distribution (Latapy et al., 2008)
10. Exploratory Network Analysis
2 interact in real time
1 see the network
Gephi prototype (2008)
1st graph viz tool: Pajek (1996) group, filter, compute metrics...
Vladimir Batagelj, Andrej Mrvar
3 build a visual language
size by rank, color by partition,
label, curved edges, thickness...
11. Looking for a “Simple Small Truth”?
Drew Conway, What Data Visualization Should Do: 1. Make complex things simple
2. Extract small information from large data
3. Present truth, do not deceive
http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/
12. Looking for Orderness in Data
Make varying 3 cursors simultaneously to extract
meaningful patterns
MICRO level MACRO level
at different levels
1 dimension N dimensions
on multiple dimensions
T+0 T+N
at time scale
13. “Zoom” cursor on Quantitative Data
MICRO level MACRO level
Global
- connectivity
- density
- centralization
Local
- communities
- bridges between communities
- local centers vs periphery
Individual
- centrality
- distances
- neighborhood
- location
- local authority vs hub
14. “Crossing” cursor on Qualitative Data
1 dimension N dimensions
Social
- who with whom
- communities
- brokerage
- influence and power
- homophily
Semantic
- topics
- thematic clusters
Geographic
- spatial phenomena
15. “Timeline” cursor on Temporal Data
T+0 T+N
Evolution of social ties
Evolution of communities
Evolution of topics
16. Mapping an Innovation Center
Collaborations on projects at Images et Réseaux
Themes and content
Actors
Territory
Franck Ghitalla & Ecole de Design de Nantes
18. Network Map: a Series of Choices
corpus
data
graphical
operations
algorithms
communication
thresholds goals
19. Guideline
# nodes
1 - 100 lists + edges in bonus, focus on qualitative data
How attributes explain the structure?
100 - 1,000 • easy to read, “obvious” patterns
• focus on entities (in context)
• metrics are tools to describe the graph (centrality, bridging...)
• links help to build and interpret categories of entities
challenge: mix attribute crossing and connectivity
How the structure explains attributes?
1,000 - 50,000 • hard to read, problem of “hidden signals”:
track patterns with various layouts and filtering
• focus on structures
• metrics are tools to build the graph (cosine similarity...)
• categories help to understand the structure
challenge: pattern recognition
> 50,000 require high computational power
21. Gephi in a Nutshell
« Like Photoshop™ for graphs. »
Helps data analysts to reveal patterns and trends,
highlight outliers and tells story with their data.
• Network visualization platform
• Open source, supported by a community
• Built for performance and usability
• Extensible by plug-ins
• Windows, MacOS X, Linux
22. Gephi Community
Nonprofit organization
Communities Contributors
Mathieu Bastian, Mathieu Jacomy,
Eduardo Ramos Ibañez, Sébastien
Heymann, Guillaume Ceccarelli,
André Panisson, Antonio Patriarca,
Cezary Bartosiak, Martin Škurla,
Patrick McSweeney, Yi Du, Hélder
Suzuki, Daniel Bernardes, Ernesto
Aneiro, Keheliya Gallaba, Luiz
Ribeiro, Urban Škudnik, Vojtech
Bardiovsky, Yudi Xue
23. Community Mission
Provide a “sustainable” software
Maintain the technical ecosystem
Build a business ecosystem
Face cutting-edge technological challenges with
a long-term vision
Distribute the software in Open Source
24. Community Values
Open innovation: ideas and features come from
the entire community.
Decisions are taken with transparency.
We consider this technology as a public good,
and will keep it in open source.
26. Diversity of Network Encoding
V = { a, b, c, d, e } <graph>
E = { (a,b), (a,d), (b,c), (e,a), (c,e) } <nodes>
<node id=”a” />
<node id=”b” />
Textual <node id=”c” />
<node id=”d” />
<node id=”e” />
</nodes>
<edges>
<edge source=”a” target=”b” />
<edge source=”a” target=”d” />
a b c d e <edge source=”b” target=”c” />
a - 1 - 1 - <edge source=”e” target=”a” />
<edge source=”c” target=”e” />
b - - 1 - -
</edges>
c - - - - 1 </graph>
d - - - - -
e 1 - - - - XML
Graphical
Tabular
and many others...
27. Software I/O
}
MySQL
PostgreSL
SQL Server
databases user input
Neo4j
CSV CSV
Pajek NET Pajek NET file
Guess GDF Guess GDF
>
GEXF GEXF
GraphML GraphML
file Graphviz DOT Excel Spreadsheet
UCInet DL SVG
NetdrawVNA PDF
Tulip TLP PNG
Excel Spreadsheet
graph streaming
28. Choosing a File Format
re
es
e
tu
lu
ut
c
Va
ru
s
rib
ph
St
lt
t
ra
At
au
rix
G
re
ef
n
t
at
gh
al
io
tu
D
/M
ic
es
s
at
ei
ru
ic
e
h
st
ut
liz
W
ut
am
rc
St
Li
rib
rib
ua
ra
ge
L
yn
ge
ie
XM
s
t
t
Ed
At
At
Vi
D
H
Ed
CSV Table of features supported
DL Ucinet by Gephi
DOT Graphviz
GDF
GEXF
* spreadsheets can be loaded
GML in the Data Laboratory
GraphML
NET Pajek
TLP Tulip
VNA Netdraw
Spreadsheet*
29. Do you need...
Many features
GEXF
Spreadsheet
GraphML
Guess GDF
GML
UCINet DL
Netdraw VNA
Graphviz DOT
Pajek NET File Type
CSV XML
Tulip TLP Tabular
Few features Text
31. Team work
1 Create a team of 2~3 people.
2 Choose a dataset.
3 Explore it during 1H.
4 Two teams present their preliminary findings.
32. Dataset #1: GitHub Software Repository
“GitHub is an application used by nearly a million people to store
over two million code repositories, making GitHub the largest code
host in the world.”
Started in 2008, it provides the features of an online social network
and a software repository to lower the barriers of collaboration and
make the code easier to contribute.
https://github.com
33. Dataset #1: GitHub Software Repository
Data extracted by Franck Cuny* at Linkfluence SAS
1st release in March 2010 -> this poster
2nd release in June 2011 -> your data
_____________Network of user profiles__________
Nodes: peoples with at least one repository who
are followed by at least two other people
Edges: A follows B
_____________Network of repositories__________
Nodes: repositories
Edges: A shares a developer with B
Very few research publications on this OSN!
* franck.cuny@linkfluence.net
34. Dataset #1: GitHub Software Repository
Data extracted by a crawl using the GitHub API
Seed: 10 well-known contributors in the Perl community
Networks by country: Japan, France, United States
Networks by language: Perl, PHP, Python, Ruby
Node attributes:
• user country
• number of followers
• main programming language
Edges:
• directed
• weight = number of projects A has forked from B
35. Dataset #1: GitHub Software Repository
Your mission (should you decide to accept it):
find research hypotheses based on your exploration
Example question: are the Perl communities based on geography?
36. Dataset #2: The Irish Blogosphere
“Identifying Representative Textual Sources in Blog Networks”. K. Wade, D.
Greene, C. Lee, D. Archambault, P. Cunningham (2011) http://mlg.ucd.ie/blogs
_______________Blogroll Network______________
Nodes: blogs with more than two blogroll links
Edges: blogroll link (in-link)
_______________Post-link Network_____________
Nodes: blogs with more than two blogroll links
Edges: hyperlink inside post from a blog to another
(post-link)
37. Dataset #2: The Irish Blogosphere
Data extracted by a crawl at distance 2 from the seed for the in-links
and Google Blog Search for the post-links.
Seed: 21 popular blogs, winners of the “2010 Irish Blog Awards”
Node attributes:
• post count = total number of posts by blog
• category = from the irish blog index at www.irishblogdirectory.com,
where available
• infomap_comm = community to which a node belongs (infomap algo)
• gce_comms = overlapping communities (GCE algo)
• moses_comms = overlapping communities (MOSES algo)
Edges:
• directed
• weight = number of hyperlinks in the Post-link network
crawl at distance 2 from the seed
38. Dataset #2: The Irish Blogosphere
Your mission:
explore and try to confirm the official results
39. Hands-On!
Start:
• Load a graph
• Apply a layout
• Color the nodes by a qualitative variable in Partition Panel
• Size the nodes by a quantitative variable in Ranking Panel
• Start to explore...compute metrics, filter the network
End:
• Export maps to PDF in Preview Tab
• Save
42. Thank You!
Caspar David Friedrich -
Wanderer Above the Sea of Fog
43. Credits
[slide 11] images from Drew Conway
http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/
[slide 22 top left] Benoît Vidal at MFG Labs
[slide 22 bottom center] Franck Ghitalla at UTC
[slide 22 right] Studies in MA Digital Fashion at LCF by Peter Jeun Ho Tsang
http://jeunhotsang.com/blog/2010/12/07/prototype/
[slide 27] sketches from Ben Fry, Computational Information Design
Special Thanks to Franck Ghitalla and Mathieu Jacomy
for their insightful discussions.