folksonomy, social tagging, tag clouds, automatic folksonomy construction, word clouds, wordle,context-preserving word cloud visualisation, CPEWCV, seam carving, inflate and push, star forest, cycle cover, quantitative metrics, realized adjacencies, distortion, area utilization, compactness, aspect ratio, running time, semantics in language technology
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Lecture: Semantic Word Clouds
1. Seman&c
Analysis
in
Language
Technology
http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm
Semantic Word Clouds
Marina
San(ni
san$nim@stp.lingfil.uu.se
Department
of
Linguis(cs
and
Philology
Uppsala
University,
Uppsala,
Sweden
Spring
2016
3. Semantic Web & Ontologies
• The
goal
of
the
Seman(c
Web
is
to
allow
web
informa(on
and
services
to
be
more
effec(vely
exploited
by
humans
and
automated
tools.
• Essen(ally,
the
focus
of
the
seman(c
web
is
to
share
data
instead
of
documents.
• This
data
must
be
”meaningful”
both
for
human
and
for
machines
(ie
automated
tools
and
web
applica(ons)
• Q:
How
are
we
going
to
represent
meaning
and
knowledge
on
the
web?
• A:
…
via
annota&on.
• Knowledge
is
represented
in
the
form
of
rich
conceptual
schemas/formalisms
called
ontologies.
• Therefore,
ontologies
are
the
backbone
of
the
Seman(c
Web.
• Ontologies
give
formally
defined
meanings
to
the
terms
used
in
annota&ons,
transforming
them
into
seman&c
annota&ons.
3
4. Ontologies
are…
• …
concepts
that
are
hierarchically
organized
4
Tree
of
Porphyry,
III
AD
Wordnet,
XXI
AD
(see
Lect
5,
ex
similarity
measures)
5. Reasoning:
RDF/OWL
vs
Databases
(and
other
data
structures)
OWL
axioms
behave
like
inference
rules
rather
than
database
constraints.
!
Class: Phoenix!
!SubClassOf: isPetOf only Wizard!
!
Individual: Fawkes!
Types: Phoenix!
Facts: isPetOf Dumbledore!
• Fawkes
is
said
to
be
a
Phoenix
and
to
be
the
pet
of
Dumbledore,
and
it
is
also
stated
that
only
a
Wizard
can
have
a
pet
Phoenix.
• In
OWL,
this
leads
to
the
implica(on
that
Dumbledore
is
a
Wizard.
That
is,
if
we
were
to
query
the
ontology
for
instances
of
Wizard,
then
Dumbledore
would
be
part
of
the
answer.
• In
a
database
se[ng
the
schema
could
include
a
similar
statement
about
the
Phoenix
class,
but
in
this
case
it
would
be
interpreted
as
a
constraint
on
the
data:
adding
the
fact
that
Fawkes
isPetOf
Dumbledore
without
Dumbledore
being
already
known
to
be
a
Wizard
would
lead
to
an
invalid
database
state,
and
such
an
update
would
therefore
be
rejected
by
a
database
management
system
as
a
constraint
viola(on.
5
6. So, what is an ontology for us?
6
“An
ontology
is
a
FORMAl,
EXPLICIT
specifica&on
of
a
SHARED
conceptualiza&on”
Studer,
Benjamins,
Fensel.
Knowledge
Engineering:
Principles
and
Methods.
Data
and
Knowledge
Engineering.
25
(1998)
161-‐197
An ontology is an explicit specification of a conceptualization
Gruber, T. A translation Approach to portable ontology specifications. Knowledge Acquisition. Vol. 5. 1993. 199-220
Abstract model and
simplified view of some
phenomenon in the world
that we want to represent
Machine-readable
Concepts, properties
relations, functions,
constraints, axioms,
are explicitly defined
Consensual
Knowledge
7. How
to
build
an
ontology
Generally
speaking
(and
roughly
said),
when
designing
an
ontology,
four
main
components
are
used:
1. Classes
2. Rela(ons
3. Axioms
4. Instances
7
8. Prac(cal
Ac(vity:
emo(ons
8
Your
remarks:
• Emo(ons
are
ambiguous:
eg.
happiness
can
be
also
ill-‐directed
• The
polarity
of
some
emo(ons
cannot
be
assessed…
• etc.
Classes
Rela(ons
Axioms
Instances
etc.
9. Occupa(onal
psychology
(wikipedia)
• Industrial
and
organiza(onal
psychology
(also
known
as
I–O
psychology,
occupa(onal
psychology,
work
psychology,
WO
psychology,
IWO
psychology
and
business
psychology)
is
the
scien$fic
study
of
human
behavior
in
the
workplace
and
applies
psychological
theories
and
principles
to
organiza(ons
and
individuals
in
their
workplace.
• I-‐O
psychologists
are
trained
in
the
scien(st–prac((oner
model.
I-‐O
psychologists
contribute
to
an
organiza(on's
success
by
improving
the
performance,
mo(va(on,
job
sa(sfac(on,
occupa(onal
safety
and
health
as
well
as
the
overall
health
and
well-‐being
of
its
employees.
An
I–O
psychologist
conducts
research
on
employee
behaviors
and
a[tudes,
and
how
these
can
be
improved
through
hiring
prac(ces,
training
programs,
feedback,
and
management
systems.
9
10. In
summary…
Why
to
build
an
ontology?
• To
share
common
understanding
of
the
structure
of
informa(on
among
people
or
machines
• To
make
domain
assump$ons
explicit
• Ojen
based
on
controlled
vocabulary
• To
analyze
domain
knowledge
• To
enable
reuse
of
domain
knowledge
10
11. Ontologies
and
Tags
• Ontologies
and
tagging
systems
are
two
different
ways
to
organize
the
knowledge
present
in
Web.
• The
first
one
has
a
formal
fundamental
that
derives
from
descrip(ve
logic
and
ar(ficial
intelligence.
Domain
experts
decide
the
terms.
• The
other
one
is
simpler
and
it
integrates
heterogeneous
contents,
and
it
is
based
on
the
collabora(on
of
users
in
the
Web
2.0.
User-‐
generated
annota(on.
11
12. Folksonomies
• Tagging
facili(es
within
Web
2.0
applica(ons
have
shown
how
it
might
be
possible
for
user
communi$es
to
collabora$vely
annotate
web
content,
and
create
simple
forms
of
ontology
via
the
development
of
loosely-‐hierarchically
organised
sets
of
tags,
oNen
called
folksonomies….
12
13. Folksonomy=Social
Tagging
• Folksonomies
(also
known
as
social
tagging)
are
user-‐defined
metadata
collec(ons.
• Users
do
not
deliberately
create
folksonomies
and
there
is
rarely
a
prescribed
purpose,
but
a
folksonomy
evolves
when
many
users
create
or
store
content
at
par(cular
sites
and
iden(fy
what
they
think
the
content
is
about.
• “Tag
clouds”
pinpoint
the
frequency
of
certain
tags.
13
14. • A
common
way
to
organize
tags
is
in
tag
clouds…
14
15. Automa(c
folksonomy
construc(on
• The
collec(ve
knowledge
expressed
though
user-‐
generated
tags
has
a
great
poten(al.
• However,
we
need
tools
to
efficiently
aggregate
data
from
large
numbers
of
users
with
highly
idiosyncra$c
vocabularies
and
invented
words
or
expressions.
• Many
approaches
to
automa(c
folksonomy
construc(on
combine
tags
using
sta(s(cal
methods
...
• Ample
space
for
improvement…
15
16. Ontology,
taxonomy,
folksonomy,
etc.
• Many
different
defini(ons…
• A
good
summary
and
interpreta(on
is
here:
hpp://www.ideaeng.com/taxonomies-‐
ontologies-‐0602
16
17. Today…
• We
will
talk
more
generally
about
word
clouds…
17
18. Further
Reading
Seman&c
Similarity
from
Natural
Language
and
Ontology
Analysis
by
Sébas(en
Harispe,
Sylvie
Ranwez,
Stefan
Janaqi,
and
Jacky
Montmain
Synthesis
Lectures
on
Human
Language
Technologies,
May
2015,
Vol.
8,
No.
1
• The
two
state-‐of-‐the-‐art
approaches
for
es(ma(ng
and
quan(fying
seman(c
similari(es/relatedness
of
seman(c
en((es
are
presented
in
detail:
the
first
one
relies
on
corpora
analysis
and
is
based
on
Natural
Language
Processing
techniques
and
seman(c
models
while
the
second
is
based
on
more
or
less
formal,
computer-‐
readable
and
workable
forms
of
knowledge
such
as
seman(c
networks,
thesauri
or
ontologies.
18
20. Acknowledgements
This
presenta(on
is
based
on
the
following
paper:
• Barth
et
al.
(2014).
Experimental
Comparison
of
Seman(c
Word
Cloud.
In
Experimental
Algorithms,
Volume
8504
of
the
series
Lecture
Notes
in
Computer
Science
pp
247-‐258
– Link:
hpps://www.cs.arizona.edu/~kobourov/wordle2.pdf
Some
slides
have
been
borrowed
from
Sergey
Pupyrev.
20
21. Today
• Experiments
on
seman&cs-‐preserving
word
clouds,
in
which
seman(cally
related
words
are
close
to
each
other.
21
22. Outline
• What
is
a
Word
Cloud?
• 3
early
algorithms
• 3
new
algorithms
• Metrics
&
Quan(ta(ve
Evalua(on
22
23. Word
Clouds
• Word
clouds
have
become
a
standard
tool
for
abstrac(ng,
visualizing
and
comparing
texts…
• We
could
apply
the
same
or
similar
techniques
to
the
huge
amonts
of
tags
produced
by
users
interac(ng
in
the
social
networks
23
24. Comparison
&
conceptualiza(on
Tool
24
• Word
Clouds
as
a
tool
for
”conceptualizing”
documents.
Cf
Ontologies
• Ex:
2008,
comparison
of
speeches:
Obama
vs
McCain
Cf.
Lect
10:
Extrac(ve
summariza(on
&
Abstrac(ve
summariza(on
25. Word
Clouds
and
Tag
Clouds…
• …
are
ojen
used
to
represent
importance
among
terms
(ex,
band
popularity)
or
serve
as
a
naviga(on
tool
(ex,
Google
search
results).
25
26. The
Problem…
• How
to
compute
seman(c-‐preserving
word
clouds
in
which
seman(cally-‐related
words
are
close
to
each
other?
26
27. Wordle
hpp://www.wordle.net
• Prac(cal
tools,
like
Wordle,
make
word
cloud
visualiza(on
easy.
They
offer
an
appealing
way
to
SUMMARIZE
text…
Shortoming:
they
do
not
capture
the
rela(onships
between
words
in
any
way
since
word
placement
is
independent
of
context
27
28. Many
word
clouds
are
arranged
randomly
(look
also
at
the
scapered
colours)
28
29. Paperns
and
Vicinity/Adjacency
Humans
are
spontaneously
papern-‐seekers:
if
they
see
two
words
close
to
each
other
in
a
word
cloud,
they
spontaneously
think
they
are
related…
29
30. In
Linguis(cs
and
NLP…
• This
natural
tendency
in
linking
spacial
vicinity
to
seman&c
relatedness
is
exploited
as
evidence
that
words
are
seman(cally
related
or
seman(cally
similar…
Remember?
:
”You
shall
know
a
word
by
the
company
it
keeps
(Firth,
J.
R.
1957:11)”
30
31. So,
it
makes
sense
to
place
such
related
words
close
to
each
other
(look
also
at
the
color
distribu(on)
31
32. Seman(c
word
clouds
have
higher
user
sa(sfac(on
compared
to
other
layouts…
32
33. All
recent
word
cloud
visualiza(on
tools
aim
to
incoprorate
seman(cs
in
the
layout…
33
34. …
but
none
of
them
provide
any
guarantee
about
the
quality
of
the
layout
in
terms
of
seman(cs
34
35. Early
algorithms:
Force-‐Directed
Graph
• Most
of
the
exis(ng
algorithms
are
based
on
force-‐directed
graph
layout.
• Force-‐directed
graph
drawing
algorithms
are
a
class
of
algorithms
for
drawing
graphs
in
an
aesthe(cally
pleasing
way
– Aprac(ve
forces
between
pairs
to
reduce
empty
space
– Repulsive
forces
ensure
that
words
do
not
overlap
– Final
force
preserve
seman(c
rela(ons
between
words.
35
Some
of
the
most
flexible
algorithms
for
calcula(ng
layouts
of
simple
undirected
graphs
belong
to
a
class
known
as
force-‐directed
algorithms.
Such
algorithms
calculate
the
layout
of
a
graph
using
only
informa(on
contained
within
the
structure
of
the
graph
itself,
rather
than
relying
on
domain-‐specific
knowledge.
Graphs
drawn
with
these
algorithms
tend
to
be
aesthe(cally
pleasing,
exhibit
symmetries,
and
tend
to
produce
crossing-‐
free
layouts
for
planar
graphs.
36. Newer
Algorithms:
rectangle
representa(on
of
graphs
• Vertex-‐weighted
and
edge-‐weighed
graph:
– The
ver(ces
of
the
graph
are
the
words
• Their
weight
correspond
to
some
measure
of
importance
(eg.
word
frequencies)
– The
edges
capture
the
seman(c
relatedness
of
pair
of
words
(eg.
co-‐occurrence)
• Their
weight
correspond
to
the
strength
of
the
rela(on
– Each
vertex
can
be
drawn
as
a
box
(rectangle)
with
a
dimension
determing
by
its
weight
– A
realized
adjacency
is
the
sum
of
the
edge
weights
for
all
pairs
of
touching
boxes.
– The
goal
is
to
maximize
the
realized
adjacencies.
36
37. Purpose
of
the
experiments
that
are
shown
here:
• Seman(cs
preserva(on
in
terms
of
closeness/
vicinity/adjacency
37
38. Example
• A
contact
of
2
boxes
is
a
common
boundary.
• The
contact
of
two
boxes
is
interpredet
as
seman(c
relatedness
• The
contact
of
2
boxes
can
be
calculated,
so
the
adjacency
can
be
computed
and
evaluated.
38
41. Lect
6:
Repe((on
large
data
computer
apricot
1
0
0
digital
0
1
2
informa(on
1
6
1
41
Which
pair
of
words
is
more
similar?
cosine(apricot,informa(on)
=
cosine(digital,informa(on)
=
cosine(apricot,digital)
=
cos(
v,
w)=
v•
w
v
w
=
v
v
•
w
w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
1+0+0
1+0+0
1+36+1
1+36+1
0+1+4
0+1+4
1+0+0
0+6+2
0+0+0
=
1
38
=.16
=
8
38 5
=.58
= 0
43. Input
-‐
Output
• The
input
for
all
algorithms
is
– a
collec(on
of
n
rectangles,
each
with
a
fixed
width
and
height
propor(onal
to
the
rank
of
the
word
– A
similarity/dissimilarity
matrix
• The
output
is
a
set
of
non-‐overlapping
posi(ons
for
the
rectangles.
43
44. Early
Algorithms
1. Wordle
(Random)
2. Context-‐Preserving
Word
Cloud
Visualiza(on
(CPWCV)
3. Seam
Carving
44
45. Wordle
à
Random
•
The
Wordle
algorithm
places
one
word
at
a
(me
in
a
greedy
fashion,
ie
aiming
to
use
space
as
efficiently
as
possible.
• First
the
words
are
sorted
by
weight/rank
in
decreasing
order.
• Then
for
each
word
in
the
order,
a
posi(on
is
picked
at
random.
45
52. Context-‐Preserving
Word
Cloud
Visualiza(on
(CPWCV)
• First,
a
dissimilarity
matrix
is
computed
and
Mul(dimensional
Scaling
(MDS)
is
performed
• Second,
effort
to
create
a
compact
layout
52
Mul(dimensional
Scaling
(MDS)
aims
at
detec(ng
meaningful
underlying
dimensions
in
the
data.
64. 3
New
Algorithms
1. Inflate
and
Push
2. Star
Forest
3. Cycle
Cover
64
65. Inflate-‐and-‐Push
• Simple
heuris(c
method
for
word
layout,
which
aims
to
preserve
seman(c
rela(ons
between
pair
of
words.
• Based
on
1. Heuris(cs:
scaling
down
all
word
rectangles
by
some
constant;
2. Compu(ng
MDS
(mul(dimensional
scaling)
on
the
dissimilarity
matrix
3. Iteretavely
increase
the
size
of
rectangles
by
5%
(ie
”inflate”
words;
4. When
words
overlaps,
apply
a
force-‐directed
algorithm
to
”push”
words
away.
65
71. Star
Forest
• A
star
is
a
tree
• A
star
forest
is
a
forest
whose
connected
components
are
all
stars.
71
72. Repe((on:
trees
and
graphs
• A
tree
is
special
form
of
graph
i.e.
minimally
connected
graph
and
having
only
one
path
between
any
two
ver(ces.
• In
a
graph
there
can
be
more
than
one
path
i.e.
graph
can
have
uni-‐direc(onal
or
bi-‐direc(onal
paths
(edges)
between
nodes.
72
73. Three
steps
1. Extrac(ng
the
star
forest:
par&&on
a
graph
into
disjoint
stars
2. Realising
a
star:
build
a
word
cloud
for
every
star
3. Pack
all
the
stars
together
73
74. Star
Forest
:
star
=
tree
1. Extract
stars
greedily
from
a
dissimilarity
matrix
à
disjoint
stars
=
star
forest
2. Compute
the
op(mal
stars,
ie
the
best
set
of
words
to
be
adjacent
3. Aprac(ve
force
to
get
a
compact
layout
74
75. Cycle
Cover
• This
algorithm
is
based
on
a
similarity
matrix.
• First,
a
similarity
path
is
created
• Then,
the
op(mal
level
of
compact-‐ness
is
computed
75
76. Quan(ta(ve
Metrics
76
1. Realized
Adjacenies
– how
close
are
similar
words
to
each
other?
2. Distor(on
– how
distant
are
dissimilar
words?
3. Uniform
Area
U(liza(on
– uniformity
of
the
distribu(on
(overpopulated
vs
sparse
areas
in
the
word
cloud)
4. Comptactness
– how
well
u(lized
is
the
drawing
area?
5. Aspect
Ra(o
– width
and
height
of
the
bounding
box
6. Running
Time
– execu(on
(me
77. 2
datasets
(1)
WIKI
,
a
set
of
112
plain-‐text
ar(cles
extracted
from
the
English
Wikipedia,
each
consis(ng
of
at
least
200
dis(nct
words
(2)
PAPERS
,
a
set
of
56
research
papers
published
in
conferences
on
experimental
algorithms
(SEA
and
ALENEX)
in
2011-‐2012.
77