The document describes the DISIT Lab and its Twitter Vigilance platform. Twitter Vigilance is a multi-user platform that collects and analyzes Twitter data through customized channels and searches. It performs natural language processing, sentiment analysis and calculates various metrics on the Twitter data. The platform addresses limitations of Twitter's API and analyzes data efficiently using Hadoop for real-time and predictive analytics. It has been used successfully for early warning and predictions in domains like disasters, TV audiences and large events.
Mapping the pubmed data under different suptopics using NLP.pptx
Twitter Vigilance: a Multi-User platform for Cross-Domain Twitter Data Analytics, NLP and Sentiment Analysis
1. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
DISIT lab, IEEE SCI 2017, Freemont CA USA
Daniele Cenni, Paolo Nesi, Gianni Pantaleo, Imad Zaza
University of Florence, Department of Information Engineering,
DISIT Lab, http://www.disit.org ,
http://www.sii-mobility.org , http://www.km4city.org
paolo.nesi@unifi.it
Twitter Vigilance: a Multi-User platform for Cross-
Domain Twitter Data Analytics, NLP and Sentiment
Analysis
2. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Exploiting Social Media Data
• Mainly Natural Language (multiple), specific slang,
– E.g., Twitter with its # Hashtags, @ citations, etc.
– Most of the posts are scarcely geolocated
• Main Domain Analysis
– Social and market analysis
– Predictive model
– Early warning, anomaly detection
• Derived Metrics may be of many kind and have to be validate to
use them
DISIT lab, IEEE SCI 2017, Freemont CA USA
3. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org Prediction/Assessment
• Football game results as related to the volume of Tweets
• Number of votes on political elections,
via sentiment analysis, SA
• Size and inception of contagious diseases
• marketability of consumer goods
• public health seasonal flu
• box-office revenues for movies
• places to be visited, most visited
• number of people in locations like airports
• audience of TV programmes, political TV shows
• weather forecast information
• Appreciation of services
DISIT lab, IEEE SCI 2017, Freemont CA USA
4. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Twitter Vigilance
• http://www.disit.org/tv
• http://www.disit.org/rttv
• Citizens as sensors to
– Assess sentiment on services,
events, …
– Response of consumers wrt…
– Early detection of critical
conditions
– Information channel
– Opinion leaders
– Communities
– Formation
– Predicting volume of visitors for
tuning the services
DISIT lab, IEEE SCI 2017, Freemont CA USA
5. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org Requirements
• Collecting Tweets
–on the basis of several criterial, searches
• Multiple users may have multiple searches and multiple purposes
(views on those searches) minimization of searches
–With high reliable model exploiting Twitter Search and/or
Stream API
• Performing NLP and Sentiment Analysis
–Real time or daily
–Multiple languages
DISIT lab, IEEE SCI 2017, Freemont CA USA
6. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
State of the art
DISIT lab, IEEE SCI 2017, Freemont CA USA
Service
Twitter
Metrics(e.g.#
oftweets,
retweetsover
time)
Sentiment
analysis
NLPAnalysis
API
availability
Usernetwork
analysis
Dataanalysis
basedon
geolocation
Realtime
Analytics
Fullfaceted
Search
Metricsfor
assessingrecall
efficiency
Minimization
ofsearchesto
Twitter
SAS N Y N Y N N Y N N na
Keyhole
Aggre-
gate
N N N
Aggre
-gate
Y N N N na
Tweetreach
Aggre-
gate
N N N
Aggre
-gate
Y N N N na
Brandwatch N N N N Y Y Y N N na
Followewonk N N N N Y Y Y N N na
Twitris N Y N N N Y Y N N na
OSoMe Y N N Y Y Y Y N N na
Twitter Vigilance Y Y Y Y Y N Y Y Y Y
7. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Twitter Vigilance Public Views
• TV: Twitter Vigilance main tool (http://disit.org/tv/),
collecting and analyzing tweets daily;
• RTTV: Real-time twitter Vigilance (http://disit.org/rttv/),
collecting and analyzing tweets in real time;
• TVSolr: Twitter Vigilance Advanced search
• (http://tvsolr.disit.org/), indexing tweets and faceted
search
DISIT lab, IEEE SCI 2017, Freemont CA USA
8. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Architecture
DISIT lab, IEEE SCI 2017, Freemont CA USA
9. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Twitter Vigilance Users can
• create and edit customized channels as a collection of searches
on API
– Per channel and per search
• crawls tweets, computes metrics, and shows results of Twitter
Data, as: volume metrics about tweets, retweets and user
statistics, NLP and Sentiment Analyses based metrics
• provides public access to metric results computed on channels
and search analysis
• Allows the researchers to download resulting metrics values
(through API service) over time for further analysis
DISIT lab, IEEE SCI 2017, Freemont CA USA
10. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org Several Channels
DISIT lab, IEEE SCI 2017, Freemont CA USA
11. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
A Channel
DISIT lab, IEEE SCI 2017, Freemont CA USA
Its searches
12. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Twitter Syntax for Searches
• String substring: Caldo
• Hashtag: #Caldo,
• Citations: @CivilProtection, @paolonesi
• From users: From:@paolonesi
• Etc.
• ….ANDed and ORed
DISIT lab, IEEE SCI 2017, Freemont CA USA
13. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org Metrics’ Kinds
• Volume Metrics
– Number of TW, number of RTW
• User Metrics
– Number of distinct users
– Number of followers, following
• NLP and SA metrics
– Counting word, adjective, noun, verbs, ….
– Estimating SA, weighting with SentiWordNet (extended to Italian)
• High level metrics (compositing all the other metrics)
– Addition of metrics..
– Ratio among metrics, e.g.: num of TW/num of RTW,…
– Cumulated metrics over time, e.g.: number of TW in the last X days..
• All: (i) per day, per hour, etc. (ii) per channel, per search
• Recently: we added the possibility of using metrics as firing conditions for alerts and
bot on Twitter.
DISIT lab, IEEE SCI 2017, Freemont CA USA
14. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org Problem addressed
Strong Limitations of the Search API of Twitter
• minimizing the number of searches on the basis of the user requests:
– different users with their queries request tweets already requested by others
• Recovering of parent Tweets from Orphan reTweets taken in the
searching process
Analytics:
• High performance solution based on HDFS, Hadoop for NLP and SA,
exploiting MapReduce programming model
• Estimating the network of influencer
• Computing metrics and prediction in real time.
DISIT lab, IEEE SCI 2017, Freemont CA USA
15. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Sentiment Analysis
DISIT lab, IEEE SCI 2017, Freemont CA USA
16. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
DISIT lab, IEEE SCI 2017, Freemont CA USA
17. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org Influence Network
DISIT lab, IEEE SCI 2017, Freemont CA USA
18. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
DISIT lab, IEEE SCI 2017, Freemont CA USA
Early Warning
Predictive models
Hot flows
Attendance at long lasting events: EXPO2015
Attendance at recurrent events: TV, footbal
19. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Efficiency in retrieval
DISIT lab, IEEE SCI 2017, Freemont CA USA
Posts
Volume
(Tweets +
Retweets)
Range
#
Recovered
Original
Tweets
# Missing
Original
Tweets
% Original
Tweets
Coverage
(CoTWO)
# Twitter
Search
API
requests
# Saturations
on Twitter
Search API
requests
% Saturations
on Twitter
Search API
requests (S%)
% Not-Saturated
Twitter Search
API requests (1-
S%)
< 10k 18571 2033 89,05% 124299 1 0,00% 100,00%
[ 10k, 50k
)
130051 13716 89,45% 399170 100 0,03% 99,97%
[ 50k, 100k
)
96171 10278 89,31% 123804 165 0,13% 99,87%
[ 100k,
500k )
997833 86755 91,31% 849062 1589 0,19% 99,81%
[ 500k, 1M
)
930646 61632 93,38% 439956 1998 0,45% 99,55%
[ 1M, 5M ) 6454463 439628 93,19% 2787485 31585 1,13% 98,87%
> 5M 14714124 899035 93,89% 4509184 64284 1,43% 98,57%
20. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Original Tweets coverage and Twitter Search API
DISIT lab, IEEE SCI 2017, Freemont CA USA
21. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Dependance on RTW/TW ratio
DISIT lab, IEEE SCI 2017, Freemont CA USA
22. DISIT Lab, Distributed Data Intelligence and Technologies
Distributed Systems and Internet Technologies
Department of Information Engineering (DINFO)
http://www.disit.dinfo.unifi.it
http://www.disit.org
Conclusions
• Twitter Vigilance is now operative since 2 years with many institutional users:
ARPAT, LAMMA, UNIFI, CNR,..
• It presents an high efficiency in recovering twitter data despite to the
complexity and provided API.
• It has been used/validated with data coming from several scenarios and
domains
• for early warning and prediction in the domain of:
– social communication, hot in Tuscany, rain measures, etc.
– Disaster alerts: water bomb
– TV audience (X factor, etc.), large events as Expo 2015
• New version is providing direct metrics estimation which can be composed by
users, and resulting data can be downloaded
DISIT lab, IEEE SCI 2017, Freemont CA USA