Modeling Spread of Disease from Social Interactions

Social Network Analysis
(15XWAH) Assignment
Presentation
Done By:
Prashanth Selvam A [16PW27]
Topic: Modeling Spread of Disease
from Social Interactions

+ Research in computational epidemiology – coarse-grained statistical analysis of populations.
+ Research paper – fine-grained modeling of the spread of infectious diseases in a large real-
world social network.
+ Study the roles that social ties and interactions between specific individuals play in the
progress of a contagion.
+ Twitter data - where we find that for every health related message there are more than 1,000
unrelated ones.
+ 2.5 million geo-tagged Twitter messages - social ties to infected, symptomatic people.
+ Co-location sharply increase one’s likelihood of contracting the illness in the near future.
Overview

Overview – Contd.
+ Work – first to model the interplay of social activity, human mobility, and the spread of
infectious disease in a large real-world population.
+ First quantifiable estimates of the characteristics of disease transmission on a large scale
without active user participation.
+ Aim: To model and predict the emergence of global epidemics from day-to-day
interpersonal interactions.
Visualization of a sample of friends in NYC. The
red links between users represent friendships,
and the colored pins show their current location
on a map. We see the highlighted person
complaining about her health, and hinting about
the specifics of her ailment.
This work investigates the impact of such a
person on the health of her friends, and of people
around her.

+ Online Social Media and NO Surveys & Clinical Data.
+ GPS-tagged messages to track user movement.
+ Identify other people who are likely to be in contact with infected person.
+ Note: Some of the people will tweet about how they feel, and we can observe at least a
fraction of the population that actually contracted the disease.
+ Research paper published in 2011 – Vision to make it a reality in the future. Now, Covid-19
pandemic is underway.
+ Traditionally, public health is monitored via surveys and by aggregating statistics obtained
from healthcare providers.
+ Methods are costly, slow, and may be biased. Affected people who do not seek treatment, or
do not respond to surveys are virtually invisible to the traditional methods.
Introduction

+ Recently, digital media - reduce the latency and improve the overall effectiveness of public
health monitoring.
+ Google Flu Trends - models the prevalence of flu via analysis of geo-located search queries.
+ Drawbacks with Google Flu Trends and Center for Disease Control and Prevention (CDC):
+ They produce only coarse, aggregate statistics, for example such as the expected number of people
afflicted by flu in Texas.
+ They often perform mere passive monitoring, and prediction is severely limited by the low resolution
of the aggregate approach.
+ Bottom-up approach - take into account the fine-grained interactions between individuals.
Introduction – Contd.

+ Machine learning techniques - detecting ill individuals based on the content of their Twitter
status updates.
+ Estimate the physical interactions between healthy and sick people.
+ Using Online activities, and model the impact of the interactions on public health.
+ Aim: Develop automated methods to identify disease vectors, trace the transmission
between concrete individuals, and to predict the spread of infectious diseases with fine
granularity.
Introduction – Contd.

+ Twitter - a popular micro-blogging service where people post message updates at most
140 characters long.
+ Note: Relationships between users on Twitter are not necessarily symmetric – Imbalance.
One can follow (subscribe to receive messages from) a user without being followed back.
+ When users do reciprocate following, we say they are friends on Twitter – affect offline
friendships.
+ As of March 2011, approximately 200 million accounts are registered on Twitter.
+ Twitter Search API - collected a sample of public tweets that originated from the NYC
metropolitan area.
Data

• Visualization of the social network consisting
of the geo-active users.
• Edges between nodes represent friendships
on Twitter.
• The coreness of nodes is color-coded using
the scale on the right.
• The degree of a node is represented by its
size shown on the left.
Summary statistics of the data collected from
NYC.

Phase 1 - Detecting Illness-Related Messages
+ Identify Twitter messages - author is infected with disease at the time of posting.
+ Health related tweets < Other tweets
+ Class imbalance - semi-supervised cascade-based approach using support vector
machine (SVM) classifier with a large area under the ROC curve.
+ SVM - linear binary classification to accurately distinguish between tweets indicating the
author is afflicted by an infectious ailment and all other tweets.
+ Labelled dataset is needed - Bootstrapping
+ Two different binary SVM classifiers - Cs and Co, using the SVMlight package.
+ Train Cs and Co using a dataset of 5,128 tweets, each labeled as either “sick” or “other”.
+ Apply model to 1.6 million tweets.
Methodology and Models

+ The cascade - 700 thousand “sick” messages and 3 million “other” tweets, used to train
the final SVM Cf.
+ Features - unigram, bigram, and trigram word tokens that appear in the training data.
+ Example: “I feel sick”
+ Before tokenization - lower case, strip punctuation and special characters, and remove
mentions of user names (the “@” tag).
+ Re-tweets are removed, health state hashtags # are retained (“#sick”).
+ Note: When learning the final SVM Cf, we only consider tokens that appear at least three
times in the training set.
Methodology and Models – Contd.

+ Feature with very high dimensionality containing many irrelevant features – SVM with a
linear kernel perform very well.
+ Class imbalance problem (number of tweets about an illness is much smaller than the
number of other messages) – ROCArea SVM learning method.
+ ROCArea method - implicitly transforming the classical SVM learning problem over
individual training examples into one over pairs of examples.
+ Efficient calculation of the area under the ROC curve from the predicted ranking of the
examples.

A diagram of our cascade learning of SVMs. The two symbols denote thresholding of the classification
score, where we select the bottom 10% of the scores predicted by Co and the top 10% of scores predicted
by Cs.

Phase 2 – Modeling the Spread of Disease
+ Human contact - spread of disease
+ Contact is often indirect, such as via a doorknob - General notion of colocation.
+ Two individuals are co-located if they visit the same 100 by 100 meter cell within a time
window of length T.
+ GPS sensor – geo-active individuals (calculate co-location with high accuracy).
+ Note: Most infectious illnesses produce influenza-like symptoms that stop within a few days,
and thus within these temporal bounds (friendship, colocation and health).
+ Twitter friendships - To quantify the effect of social ties on disease transmission.
+ Note: There are complex events and interactions that take place “behind the scenes”, which
are not directly recorded in online social media (Cons).

Phase 2 – Modeling the Spread of Disease
+ Social ties to infected people - increases your chances of becoming ill in the near future.
+ Note: Social ties themselves do not cause or even facilitate the spread of an infection.
+ Twitter friendships - proxies and indicators for a complex set of phenomena that may not
be directly accessible.
+ For example, friends often eat out together, meet in classes, share items, and travel
together (NOT during Covid-19 Pandemic).
+ These events are never explicitly mentioned online, they are crucial from the disease
transmission perspective.

+ Only public tweets - NO private tweets.
+ Users talking about their health.
+ Ability to identify the messages in the flood of other types of messages.
+ Note: 1 out of 30 residents of NYC appears in our dataset.
+ Focus on the geo-active individuals, the ratio is 1:3000.
+ Need a hybrid solution – This + Traditional methods + Google Flu Trends
Infected people who do not visit a doctor, or do not respond to surveys are virtually
invisible to the traditional methods.
Google Flu Trends can only observe individuals who search the web for certain types of
content when sick.
Limitations

Experiments and Results
Top twenty most significant negatively and positively
weighted features of the SVM model.
• Precision = 0.98
• Recall = 0.97
• Correlation (Our model, Google Flu Trends) = 0.73
• Google Flu Trends - Only Flu and related illnesses
• Our model – broad, many illnesses
• The work is a major step towards prediction of
actual threats and instances of disease outbreaks.

Example tweets that the SVM model Cf identified as “sick”.

+ Twitter data - valuable information channel for predicting likelihood of a pandemic.
+ Building effective system is a challenging process.
+ Language models
+ Variant of topic models - symptoms and possible treatments for ailments, such as
traumatic injuries and allergies, that people discuss on Twitter.
+ Research paper - detailed study of the interplay among human mobility, social structure,
and disease transmission.
+ Finally, The framework allows us to track—without active user participation—specific likely
events of contagion between individuals, and model the relationship between an epidemic
and self-reported symptoms of actual users of online social media.
+ Can be applied to depression, not just FLU.
+ Suicidal tendency detection.
Experiments and Results

+ This work is the first to take on modeling the spread of infectious diseases throughout a
real-world population with fine granularity.
+ Model the joint influence of co-location and social ties, and quantify the latent impact of
friendships.
+ An early identification of infected individuals is especially crucial in preventing and
containing devastating disease outbreaks.
+ Note: Applying machine learning techniques will significantly expand the breadth of data
available by allowing us to consider not only declared friendships and public check-ins,
but also their inferred— though more ambiguous—counterparts.
Conclusions and Future Work

+ The most effective way to fight an epidemic in urban areas - confine infected individuals to
their homes.
+ Strategy is effective only early during outbreak.
+ Vaccination ranks second in effectiveness.
+ Social Media – boon during epidemics (COVID-19 pandemic)
+ On par with Google Flu Trends – in terms of efficiency.
+ Future work - focus on larger geographical areas (including airplane travel), while
maintaining the same level of detail.
+ AI techniques - revealing hidden social ties and predicting user location.
Conclusions and Future Work – Contd.

+ Modeling Spread of Disease from Social Interactions: Adam Sadilek, Henry Kautz and
Vincent Silenzio.
+ Anderson, R., and May, R. 1979. Population biology of infectious diseases: Part I. Nature
280(5721):361.
+ Backstrom, L., and Leskovec, J. 2011. Supervised random walks: predicting and
recommending links in social networks. In WSDM 2011, 635–644. ACM.
+ Beir´o, M.; Alvarez-Hamelin, J.; and Busch, J. 2008. A low complexity visualization tool
that helps to perform complex systems analysis. New Journal of Physics 10:125003.
+ Bettencourt, L., and West, G. 2010. A unified theory of urban living. Nature
467(7318):912–913.
References

Modeling Spread of Disease from Social Interactions

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Modeling Spread of Disease from Social Interactions

Similar to Modeling Spread of Disease from Social Interactions (20)

Recently uploaded

Recently uploaded (20)

Modeling Spread of Disease from Social Interactions