3. Google Flu Trendsg
J. Ginsberg, et al.,
Detecting influenza epidemics
i h i d tusing search engine query data,
Nature, February 2009
Link:- www.google.com/flutrends
4. Application in Movie Industrypp y
電影【復仇者聯盟】: 成本兩億美金 電影【復仇者聯盟】: 成本兩億美金
如何知道觀眾之興趣反應?
如何訂定最佳之行銷策略?
4
7. Architecture for Big Data Analytics
High-Performance Computing Platform
g y
High Performance Computing Platform
(Cloud, Stream, In-Memory, …)
DataADataADataADataA
Mining & Learning
Components Rules RetrieveRules Retrieve
•Clusters
•Association
…….Reports
AccessCAPAccessCAPAccessCAPAccessCAP
D tD t
Components
Data MiningData Mining
Rules Retrieve
Components
Rules Retrieve
Components
Input
C ++C ++ Predictive
Models
Models/
Rules
IIII
Data
Preparation
Components
Data
Preparation
Components
Text MiningText Mining
Machine LearningMachine Learning
Prediction
Components
Prediction
Components
Data
• Structured
• Unstructured
Rules
Statistical LearningStatistical Learning
Applications
Module
Applications
Module
Interesting
Patterns
Data
Preparation Deploy
Data
Access
Data
Modeling
Presentation
/Applications
7
8. Tackling Some Key Challengesg y g
Data Preprocessing Phase Data Preprocessing Phase
Data quality problem: Noise, Incompleteness, Sparsity
Veracity issue: Is bigger the better?y gg
Data Understanding Phase
Key Features Discovery: Finding the needle in a haystack
Learning and Modeling Phase
Timeliness vs. Precision: Issues for data sampling
Need of more sophisticated methodologies
Post-processing Phase
8
9. Some Key Challengesy g
Data Preprocessing Phase Data Preprocessing Phase
Data quality problem: Noise, Incompleteness, Sparsity
Veracity issue: Is bigger the better?y gg
Data Understanding Phase
Key Features Discovery: Finding the needle in a haystack
Learning and Modeling Phase
Timeliness vs. Precision: Issues for data sampling
Need of more sophisticated methodologies
Post-processing Phase
9
15. Big Data in Netflixg
62M+ Subscribers over 50 countries 62M+ Subscribers over 50 countries
4M/day Ratings
3M/day Searches
30+M/day plays30 M/day plays
Streaming hours
2B h i Q1/2012 2B hours in Q1/2012
10B hours in Q1/2015
15
16. Netflix Prize
Grand Prize, $1M USD for 10% improvement in
prediction accuracy
Progress Prize, $50,000 USD every yearg , $ , y y
Since Oct. 2, 2006
E d O t 2 2011 End Oct. 2, 2011
Or when some teams reach 10% goal
16
(Ref: Netflix 2012 )
17. Recommendation Problem:
Collaborative Filtering based Methods- Collaborative Filtering-based Methods
itm1 itm2 itm3 itm4 itm5
A d ? 1 1 4 5
User-based Collaborative Filtering
Andre ? 1 1 4 5
Ben 1 2 0 2 0
Juice 3 1 2 4 5
User based Collaborative Filtering
David 1 1 0 1 0
itm1 itm2 itm3 itm4 itm5Item-based Collaborative Filtering 1 2 3 4 5
Andre ? 1 0 4 5
Ben 1 2 0 2 0
g
Juice 3 1 2 4 5
David 1 1 0 1 0
if i itm1 itm2 itm4 itm3 itm5
Andre ? 1 4 0 5
Ben 1 2 2 0 0
Unifying User-based and Item-based
Collaborative Filtering
17
Ben 1 2 2 0 0
Juice 3 1 2 4 5
David 1 1 1 0 0
18. Netflix Analytics Worky
Dataset consists of 100M+ training entries Dataset consists of 100M+ training entries
Each training entry is in a quadruplet form
<user, movie, date, grade>, each is an integer
The qualifying dataset consists of 2.8M entriesq y g
<user, movie, date> w/o grading
Error measure: RMSE (root mean square error) Error measure: RMSE (root mean square error)
18
19. RMSE Scores
0 8563 (10%) Grand Prize 0.8563 (10%) Grand Prize
0.8643 (9.15%) Leader
0.8667 (8.9%) Current progress
0.8712 (8.43%) Progress Prize Winner 20070.8712 (8.43%) Progress Prize Winner 2007
0.9514 (0%) Netflix Cinematch
1 0540 ( 10 78%) M i A 1.0540 (-10.78%) Movie Average
19
21. Challengesg
Data Sparsity Problemp y
Highly Sparse Data & Cold Start Problem:
traditional approaches like CF are not feasibletraditional approaches like CF are not feasible
→ Need specialized method
Netflix Prize winner: Pragmatic Chaos Theory Netflix Prize winner: Pragmatic Chaos Theory
Gap between complex models and deployment
Winner’s solution: Complex composition of
hundreds/thousands of learned models
→ Hard to deploy in real applications
Similar scenarios exist in many big datay g
applications and effective solutions are desired! 21
22. Some Key Challengesy g
Data Preprocessing Phase Data Preprocessing Phase
Data quality problem: Noise, Incompleteness, Sparsity
Veracity issue: Is bigger the better?y gg
Data Understanding Phase
Key Features Discovery: Finding the needle in a haystack
Learning and Modeling Phase
Timeliness vs. Precision: Issues for data sampling
Need of more sophisticated methodologies
Post-processing Phase
22
23. I bi l h b ?Is bigger always the better?
Veracity issue-- Veracity issue
24. Google Flu Trendsg
J. Ginsberg, et al.,
Detecting influenza epidemics
i h i d tusing search engine query data,
Nature, February 2009
Link:- www.google.com/flutrends
25. Google Flu Trends -- Ideag
• C t i W b S h• Certain Web Search
terms are good
Indicators of flu activity.
• Google Trend uses
Aggregated search data
on flu indicators.on flu indicators.
• Estimate current flu
activity around the world
i l tiin real time.
• From example :- Google
Flu Trend detectsFlu Trend detects
increased flu activity two
weeks before CDC. *CDC: Center for Disease Control
26. Google Flu Trends -- Modelg
Data:
Look at all search queries in Google from 2003 to 2008 Look at all search queries in Google from 2003 to 2008
Several hundred billion individual searches
in the United States
Keep track of only the 50 million most
common queries
Keep a weekly count for each query
Also keep counts of each query by geographic region
(requires use of geo-location from IP addresses: >95% accurate)
So counts for 50 million queries x 170 weeks x 9 regions
query selectionq g
Target variable to be predicted:
For each week, for each region
I(t) = percentage physician visits that are ILI (as compiled by CDC)
query selection
I(t) = percentage physician visits that are ILI (as compiled by CDC)
Input variable:
Q(t) = sum of top n highest correlated queries
/ total number of queries that week
Constructing the
ILI-related query/ total number of queries that week
“M d l l i ”
q y
fraction
“Model learning”:
log( I(t) / [1 – I(t)] ) = log ( Q(t)/ [1 – Q(t) ] ) + noise Logistic regression
27. The Parable of Google Flu: Traps in Bigg p g
Data Analysis (Science, Mar. 2014)
28. Some Key Challengesy g
Data Preprocessing Phase Data Preprocessing Phase
Data quality problem: Noise, Incompleteness, Sparsity
Veracity issue: Is bigger the better?y gg
Data Understanding Phase
Key Features Discovery: Finding the needle in a haystack
Learning and Modeling Phase
Timeliness vs. Precision: Issues for data sampling
Need of more sophisticated methodologies
Post-processing Phase
28
30. A large-scale research initiative aimed at
Innovations around smartphone-based research
Collect smartphone data in everyday life conditions
Community-based evaluation of related mobile data analysis
methodologiesmethodologies
Data source: Lausanne Data Collection Campaign
30
31. User Profile/Behavior Modeling and Prediction
Personal information
Media files
Device information
Process
Calendar
Applications
Social information
Accelerometer
System Information
Location information
Call log
Contacts
Bluetooth
GSM
WLAN
Sequence of place visits
32. MDC 2012 Tracks
Main Goals
User Profile/Behavior Modeling and Prediction
Dedicated Track Dedicated Track
Demographic attribute prediction
Predict gender age group marital status job type etc Predict gender, age group, marital status, job type, etc.
of an user
Semantic place prediction Semantic place prediction
Predict the semantic meaning of user’s visited places
N t l di ti Next place prediction
Predict the next destination of a user
32
36. Demographic Attribute Prediction
Lots of features could be extracted from data
g p
10,000+ features used by the winner team!
High accuracy achieved: 96%
………………Location
features
Media features
Sensor features
36
37. Very high dimensional complexityVery high dimensional complexity
- Feasibility problem in real applications!Feasibility problem in real applications!
Is there some key/dominating feature?
………………Location……
features
Media features
S f tSensor features
37
38. Demographic Attribute Prediction (cont.)
Accelerometer is actually a key/dominating
g p ( )
feature!
Support accuracy around 95%
Underlying reasoning?
38
40. Some Key Challengesy g
Data Preprocessing Phase Data Preprocessing Phase
Data quality problem: Noise, Incompleteness, Sparsity
Veracity issue: Is bigger the better?y gg
Data Understanding Phase
Key Features Discovery: Finding the needle in a haystack
Learning and Modeling Phase
Timeliness vs. Precision: Issues for data sampling
Need of more sophisticated methodologies
Post-processing Phase
40
42. One Solution: Data Samplingp g
- Bias on Data Samples
T i id i l f h Twitter provides two main outlets for researchers to
access tweets in real time:
Streaming API (~1% of all public tweets, free)
Firehose (100% of all public tweets, costly)
Streaming API data is often used by researchers to
validate hypotheses.
How well does the sampled Streaming API data measure
the true activity on Twitter?
42
43. Bias on Data Samples (cont.)p ( )
S [H Li l AAAI ICWSM2013]
43
Source: [Huan Liu et al. AAAI ICWSM2013]
45. National Health Insurance Research Database
in Taiwan
National Health Insurance (NHI ) National Health Insurance (NHI )
Established in March 1, 1995
Serves 99.2% of Taiwanese population (20M+)
Covers 92.62% of medical institutions
Longitudinal Health Insurance Database ( LHID )
sampled from NHIRDp
Including 951,044 people health records
1997 – now
Strongly representative in Taiwan Strongly representative in Taiwan
Every living regions
Big time interval
15+ years
Reference : National Health Insurance, http://www.nhi.gov.tw
46. Linking with More Heterogeneous Datasets
Environmental
Smart
Environmental
monitoring data
Lab data & PatientLab data & Patient
CRCRNHINHI CODCODBRBR Smart
Health Risk
Al treported outcomereported outcome
Cloud Sensor-based biomarker
Alert
Computing
Sensor-based biomarker
monitoring data
46
49. 疾病因子分析
Linked data is biased!
測站
日期
每日X疾病就診人數大氣環境資料
監測站
49
空氣汙染資料
監測站
監測站
使用LHID2000百萬抽樣檔
50. Some Key Challengesy g
Data Preprocessing Phase Data Preprocessing Phase
Data quality problem: Noise, Incompleteness, Sparsity
Veracity issue: Is bigger the better?y gg
Data Understanding Phase
Key Features Discovery: Finding the needle in a haystack
Learning and Modeling Phase
Timeliness vs. Precision: Issues for data sampling
Need of more sophisticated methodologies
Post-processing Phase
50
52. Goal
• How to do POI recommendation by utilizing user’s
i l t k l ( h k i )?social network log (eg, check-in)?
1
3
4
6
5 7
8
9S
2
3 9
10S
p
S
p
2
1
- 52 -
53. Urban Point of Interest Recommendation byUrban Point-of-Interest Recommendation by
Mining User Check-in Behaviors
Josh Jia-Ching Ying, Eric Hsueh-Chan Lu, Wen-Ning KuoJosh Jia Ching Ying, Eric Hsueh Chan Lu, Wen Ning Kuo
and Vincent S. Tseng
2012ACM SIGKDD Int’l Workshop on Urban Computing2012ACM SIGKDD Int l Workshop on Urban Computing
(UrbComp 2012)
54. Proposed Method – UPOI-Minep
LBSN Dataset Social Factor User-POI Graph
Construction
Relevance Learning
LBSN Dataset Social Factor User-POI Graph
Construction
Relevance Learning
Individual
Preference
Construction
Individual
Preference
Construction
Feature Extraction POI Popularity -
User-POI Relevance
Matrix
Feature Extraction POI Popularity -
User-POI Relevance
Matrix
UserRequest Top k Nearest POI
selection
Top k Nearest POI POI RankingUserRequest Top k Nearest POI
selection
Top k Nearest POI POI Ranking
POI
Recommending
List
POI
Recommending
ListPOI Recommendation
55. Social Factor (SF)( )
F
Weight
kikiki DisSimwCheckSimw )1(Relation
k
i,kk,jji Interest,POIuserSF
1
]Relation[)(
kikiki ,,, )(
, jkcheckin
Interest
||
1
,
, S
s
sk
jk
checkin
Interest
F f i d fF: friends of user i
S: the set of POIs
Check-in k,* = check-ins of user k at POI*
57. POI Popularity (PP)
POI Popularity
p y ( )
POI Popularity
Relative Popularity of POI
Normalized based on category
checkins
RP
j
j
.POIithcategory wsamein thewhichPOIsofsettheiswhere,
POI
jCS
checkins
CS
k
j
k
.Otcatego y wsa et ew cO sosett esw e e, jCS
58. Relevance Estimation
TargetTo estimate the relevance of each pair of user-POI TargetTo estimate the relevance of each pair of user-POI,
we use these features to learn a Regression-Tree
Model.
User ID POI ID SF PP IP Relevance
1 A 0.2 0.1 0.001 3
1 B 0.05 0.2 0.1 51 B 0.05 0.2 0.1 5
1 C 0.004 0.1 0.9 1
… … … … … …
N D 0.5 0.15 0.06 2
Regression-Tree Model
59. Experimental Evaluation
Real dataset crawled from Gowalla
p
in New York City area
1,964,919 POIs, ,
18,159 people
5 341 191 Check-ins 5,341,191 Check-ins
392,246 Friendship Links
61. Better way for modeling?Better way for modeling?
- UPOI-Walk- UPOI-Walk
In ACM Transactions on Intelligent Systems and Technologies, 2014
62. Motivation
The existing models could not deal with such
h f llheterogeneous features well
The existing models try to combine all features into
f b ildi i l d l Bi !one measure for building a single model → Bias!
Relevance LearningRelevance Learning
LBSN Dataset Social Factor
Individual
Preference
User-POI Graph
Construction
Hits-based
Random Walk
LBSN Dataset Social Factor
Individual
Preference
User-POI Graph
Construction
Hits-based
Random Walk
Feature Extraction
Preference
POI Popularity User-POI Graphs
User-POI
Relevance
Matrix
Feature Extraction
Preference
POI Popularity User-POI Graphs
User-POI
Relevance
Matrix
User Request Top k Nearest POI
selection
Top k Nearest POI POI RankingUser Request Top k Nearest POI
selection
Top k Nearest POI POI Ranking
POI
Recommending
ListPOI Recommendation
POI
Recommending
List
64. HITS-based Random Walk
X C t l “Mi i i ifi t ti
Random Walk
X. Cao , et al., “Mining significant semantic
locations from GPS data,” Proceedings of the
VLDB Endowment, v.3 n.1-2, September
20102010
0.3
0 2 0 10.2
0.4
0.1
Given an m × n hits value matrix MGiven an m × n hits value matrix M
11
1
1
))1((
))1((
kk
k
user
T
col
k
POI
xMx
xMx
HITS-based Random Walk
2 ))1(( POIrowuser xMx
65. Dynamic HITS-Based Random Walky
X
N
X
Y
Network Set
= {M,N,X,Y,Z,…}M
ZZ
Randomly select
vPOI
k1
(Mcol
T
(1)1)vuser
k
vk1
(N (1 ) )vk1
……
hits value
matrixes from
Network Set
vuser (Nrow (1)2 )vPOI
vPOI
k2
(Xcol
T
(1)1)vuser
k1
vk3
(Y (1)2 )v O
k2
…
vuser (Yrow (1 )2 )vPOI
vPOI
k3
(Zcol
T
(1)1)vuser
k2
…
till converged
68. Some Key Challengesy g
Data Preprocessing Phase Data Preprocessing Phase
Data quality problem: Noise, Incompleteness, Sparsity
Veracity issue: Is bigger the better?y gg
Data Understanding Phase
Key Features Discovery: Finding the needle in a haystack
Learning and Modeling Phase
Timeliness vs. Precision: Issues for data sampling
Need of more sophisticated methodologies
Post-processing Phase
68
71. Early Prediction of Diseasesy
Huizinga, T. W. J., & van der Helmvan Mil, A. H. M. (2007). Prediction and prevention of rheumatoid
arthritis. Revista Colombiana de Reumatología, 14(2), 106-114.
Early RA
12 month
RA DiagnosisEarly RA
18 month
Very Early
Detection
~ X years
71
ye s
72. Analytics Frameworky
Data miningTarget PreprocessedRaw
techniquesdata datadata
Di d
Off
Classifier
Discovered
Rules
Off-
line
On-line
Morbidity Risk
Prediction
S
Health records
Potential
Patient Doctor / Hospital
System
Predicted risk
72
79. Concluding Remarks:
G d O i iGrand Opportunities
“Data is King”: Age of data monetization Data is King : Age of data monetization
Data vs. Ideas vs. Technologies
From Data to Idea
From Idea to data
Utilization of right technologies
Visioning Visioning
擁有價值性資料者可以為王
不擁有資料但有創新點子的人易可稱王 不擁有資料但有創新點子的人易可稱王
Innovative Ideas + Right Tech on Valued Data =>
Smart King
7979
Smart King