Talk presentation at SAC 2011. From the paper abstract: "Tag-based systems have become very common for online classification thanks to their intrinsic advantages such as self-organization and rapid evolution. However, they are still affected by some issues that limit their utility, mainly due to the inherent ambiguity in the semantics of tags. Synonyms, homonyms, and polysemous words, while not harmful for the casual user, strongly affect the quality of search results and the performances of tag-based recommendation
systems. In this paper we rely on the concept of tag relatedness in order to study small groups of similar tags and detect relationships between them. This approach is grounded on a model that builds upon an edge-colored multigraph of users, tags, and resources. To put our thoughts in practice, we present a modular and extensible framework of analysis for discovering synonyms, homonyms and hierarchical relationships amongst sets of tags. Some initial results of its application to the delicious database are presented, showing that such an approach could be useful to solve some of the well known problems of folksonomies.
Handwritten Text Recognition for manuscripts and early printed texts
An integrated approach to discover tag semantics
1. An Integrated Approach
to Discover Tag Semantics
SAC 2011, Web Technologies Track, March 24th 2011
Antonina Dattolo Davide Eynard Luca Mazzola
University of Udine USI - University of Lugano USI - University of Lugano
Department of Mathematics ITC - Institute for ITC - Institute for
and Computer Science Communication Technologies Communication Technologies
antonina.dattolo@uniud.it davide.eynard@usi.ch luca.mazzola@usi.ch
2. Talk outline
Properties of tags
Folksonomies as edge-colored multigraphs
Framework design and implementation
Tests and evaluations
Conclusions
24/03/2011 An integrated approach to discover tag semantics 2/27
3. Tags properties
Tags:
are democratic and bottom-up (vs hierarchical)
are inclusive and current
follow desire lines
are easy to use
24/03/2011 An integrated approach to discover tag semantics 3/27
4. Tags cons
Lexical ambiguities:
Synonyms
game and juego, or web2.0 and web_2
Homonyms
check as in chess and in “to check” (polysemous)
sf as scifi or san_francisco
Basic level variations
dog and poodle
Ambiguities due to different purposes:
blog to tag a blog software (i.e. Wordpress), a blog service, a blog
post, something to blog later, ...
24/03/2011 An integrated approach to discover tag semantics 4/27
5. Advantages of disambiguation
Synonym detection:
increases recall
allows for better recommendation systems
Homonym detection:
allows to find different contexts of use
increases precision
Basic level variations detection:
identifies a hierarchy
increases recall (i.e. automatically searching for subclasses)
provides a mean to browse search results
24/03/2011 An integrated approach to discover tag semantics 5/27
6. Approaches to tag disambiguation
Roughly two main families of approaches
Theoretical ones, aiming at describing the system as a
whole
More practical, ad-hoc ones (often addressing one or few
issues at a time)
Our approach
Main assumption: lexical ambiguities are not independent
from each other
Solution based on
a theoretical framework
a modular, extensible analysis tool
24/03/2011 An integrated approach to discover tag semantics 6/27
7. Folksonomies as edge-colored
multigraphs
Def.1: An edge-colored multigraph is a triple
ECMG = (MG, C, c)
where:
MG = (V,E,f) is a multigraph
C is a set of colors
c : E→C is an assignment of colors to multigraph edges
Def.2: A personomy related to user u is a non-directed
edge-colored graph of color Cu:
Pu = (T, R, E, Cu)
24/03/2011 An integrated approach to discover tag semantics 7/27
8. Folksonomies as edge-colored
multigraphs
Def.3: Given a set of users U and the family of
personomies Pu (u ∈U), a folksonomy is defined as
that is, an edge-colored multigraph where:
vertices are tags + resources
edges are tag assignments made on
resources by each user
every color is a different user
24/03/2011 An integrated approach to discover tag semantics 8/27
9. First simplification step
As we are only interested in relationships between
tags, we need to perform two simplification steps on
the edge-colored multigraph
Step 1: colored edges are collapsed and substituted
by weighted edges
potentially, every color (user) might be
assigned a different weight wu
the weight w of the collapsed edge is the sum
of all the wu linking the same two vertices
when wu= 1 for each user, w = times a tag is
used on a resource
24/03/2011 An integrated approach to discover tag semantics 9/27
10. Second simplification step
Step 2:
a link is created between ta and tb if they
share a resource
resource nodes are dropped
Edges' weights can be calculated
in different ways:
number of triples (ti ,r,tj ) where (ti ,r), (r,tj ) ∈E
=> co-occurrence
normalized co-occurrence (i.e. Using the
Jaccard index)
distributional measures
custom metrics (i.e. sum of products of
connecting edges' weights) =>
24/03/2011 An integrated approach to discover tag semantics 10/27
12. System architecture
Basic assumption:
ambiguous tags should be related (either by cooccurrence or
by presence in the same context)
Three main components:
tag analysis tool
disambiguation tool
front-end
24/03/2011 An integrated approach to discover tag semantics 12/27
13. Synonyms detection / 1
Natural text …
Two words are considered synonyms if they can be replaced
by each other without affecting the meaning of a sentence
… vs. Tag-based systems
It is possible to swap two tags within a “sentence” (i.e. a
tagging action) without affecting its meaning when we have:
variations of a word (i.e. blog, blogs, blogging)
translations into other languages (i.e. game, juego, spiel)
terms joined by non-alphabetic characters (i.e. web2, web_2)
No “one size fits all” solution
24/03/2011 An integrated approach to discover tag semantics 13/27
14. Synonyms detection / 2
A modular solution for synonyms detection:
different heuristics, each one returning the likelihood of tags to be
synonyms
results are weighted to obtain an overall likelihood
Suggested heuristics:
an edit distance such as Levenshtein's (normalized to account for short
strings);
synonym search in WordNet (good precision, low recall);
online translation bases (top-down, such as dictionaries, or bottom-up,
collaboratively grown vocabs like Wikipedia)
stemming with NLP algorithms
24/03/2011 An integrated approach to discover tag semantics 14/27
15. Homonyms detection
Check if the tag t has been used in different contexts
cluster tags related to t in groups
the most frequent tags in these groups are used to name
and disambiguate the contexts
Clustering algorithm:
an overlapping one, also used in social network analysis*
a cluster is a subgraph G identified by the maximization of a
fitness property
s = strength of internal (in)
or external (out) links
α = tweaking parameter
* A. Lancichinetti et al. : “Detecting the overlapping and hierarchical community structure of complex networks”
24/03/2011 An integrated approach to discover tag semantics 15/27
16. Hierarchy detection
Hierarchy is a specific case of basic level variation
A possible approach: Hearst patterns on the Web,
such as:
C1 (and|or) other C2 (i.e. “poodles and other dogs”)
C1 such as I (i.e. “cities such as San Francisco”)
(note: Ci are concepts, I is a concept instance)
Search for the patterns, and use the number of results
as an indicator for their strength
Pros: the Web is as up-to-date as folksonomies
Cons: O(n2) complexity, not really scalable
24/03/2011 An integrated approach to discover tag semantics 16/27
17. Prototype development
Dataset
Data from more than 30K users of
http://www.delicious.com
Ignored the system:unfiled tag
For the calculation of Tag Context Similarity,
we only took into account the top 10K tags
Prototype
Tag analysis tool, calculating CO, NCO, and TCS (takes time, runs as a
batch job and saves matrices in the DB)
Disambiguation with homonyms plugin, implementing the overlapping
clustering algorithm, and Wikipedia synonym discovery
Front-end is currently a command-line application
24/03/2011 An integrated approach to discover tag semantics 17/27
18. Experimental results / 1
System tested against three different sets of tags:
Top 20 tags in delicious
A group of tags known to be ambiguous (apple, cambridge, sf,
stream, turkey, tube)
A set of subjective tags, chosen between the most popular ones in
delicious (cool, fun, funny, interesting, toread)
For each tag:
we calculated the top n (with n = 50) related tags with the three metrics
(CO, NCO, TCS)
we performed synonym and homonym analyses
24/03/2011 An integrated approach to discover tag semantics 18/27
19. Experimental results / 2
Tag Context Similarity already tends to provide
synonyms as top-related tags
i.e. toread related: read, read_later, to_read, etc.
Analyzing a less popular synonym (@readit):
9 out of the top 10 (and 17 out of the top 50) related tags are synonyms
reason: as less popular tags are less spread across contexts, they tend
to have a higher similarity with other less popular synonyms
Wikipedia results:
analyzing the 31 tags in our three sets, we got 215 new words;
of those 215, only 83 are valid tags in our delicious dataset;
of those 83, only 20 belong to the 10K most-used tags;
only 2 belong to the set of the top-related tags of their English synonym.
24/03/2011 An integrated approach to discover tag semantics 19/27
20. Experimental results / 3
Homonyms detection:
we tested the algorithm with
different values of α
meaningful results in a relatively
short time (but we are working
only on the top related tags...)
limit: the graphs of top related
tags differ in connectivity, so
there is not a value of α that is
good for all of them (αsf=1.4,
αstream=1.74).
24/03/2011 An integrated approach to discover tag semantics 20/27
21. Conclusions
Model
Flexible enough to support other kind of metrics
Multigraph can be simplified in other ways
User-related weights still have to be taken into account
Tool
Still in prototypal phase, but already provided useful results
and allowed us to compare
metrics: different metrics provide very different results, that might be
more or less useful according to the user needs
tag behaviors: different depending on their popularity and the use
that people do of them
24/03/2011 An integrated approach to discover tag semantics 21/27
22. Conclusions
Ongoing work
Clustering evaluation metrics to find best α
Applications (i.e. for tag grouping and visualization*)
User- and resource-specific projections**
Future work
Development of other plugins and front-end
Play with user-related weights to focus on specific
communities / filter spam
* Mazzola, Eynard, Mazza: ”GVIS: a framework for graphical mashups of heterogeneous sources to support data
interpretation”.
** Dattolo, Ferrara, Tasso: "On social semantic relations for recommending tags and resources using folksonomies"
24/03/2011 An integrated approach to discover tag semantics 22/27
23. Thank you!
Thanks for your attention!
Questions?
24/03/2011 An integrated approach to discover tag semantics 23/27
24. toread top 20 related tags
24/03/2011 An integrated approach to discover tag semantics 24/27
25. @readit top 20 related tags
24/03/2011 An integrated approach to discover tag semantics 25/27
26. sf top 20 related tags
24/03/2011 An integrated approach to discover tag semantics 26/27
27. stream top 20 related tags
24/03/2011 An integrated approach to discover tag semantics 27/27