巨量與開放資料之創新機會與關鍵挑戰－曾新穆

巨量與開放資料之創新機會與關鍵挑戰巨量與開放資料之創新機會與關鍵挑戰
Vincent S. Tseng (曾新穆)
D t t f C t S iDepartment of Computer Science
National Chiao Tung University
T i
1
Taiwan

Starting with Some
Innovative Applications

Google Flu Trendsg
J. Ginsberg, et al.,
Detecting influenza epidemics
i h i d tusing search engine query data,
Nature, February 2009
Link:- www.google.com/flutrends

Application in Movie Industrypp y
 電影【復仇者聯盟】: 成本兩億美金 電影【復仇者聯盟】: 成本兩億美金
 如何知道觀眾之興趣反應?
 如何訂定最佳之行銷策略?
4

Application in Movie Industry (cont.)pp y ( )
 利用Big Data Analytics 監測分析社交媒體對電影
預告片之反應:
 11億條 Tweets/min
萬篇 570萬篇Blogs/min
 350萬條 Messages/min
 擷取關鍵訊息分析主題判斷網友意向 → 歸結出網友對電影預告 擷取關鍵訊息, 分析主題, 判斷網友意向 → 歸結出網友對電影預告
片之看法與評價
 電影公司針對分析結果進行行銷策略之調整
 【復仇者聯盟】票房:
 2012年5月上片後, 美國本土首周票房達兩億美金(成本)，寫下全美
影史最高首周票房紀錄
 2012年總票房達15億美金, 成為世界電影史票房排名第三名, 僅次於
”阿凡達” 、“鐵達尼號”阿凡達鐵達尼號
5

Architecture for Big Data Analytics
High-Performance Computing Platform
g y
High Performance Computing Platform
(Cloud, Stream, In-Memory, …)
DataADataADataADataA
Mining & Learning
Components Rules RetrieveRules Retrieve
•Clusters
•Association
…….Reports
AccessCAPAccessCAPAccessCAPAccessCAP
D tD t
Components
Data MiningData Mining
Rules Retrieve
Components
Rules Retrieve
Components
Input
C ++C ++ Predictive
Models
Models/
Rules
IIII
Data
Preparation
Components
Data
Preparation
Components
Text MiningText Mining
Machine LearningMachine Learning
Prediction
Components
Prediction
Components
Data
• Structured
• Unstructured
Rules
Statistical LearningStatistical Learning
Applications
Module
Applications
Module
Interesting
Patterns
Data
Preparation Deploy
Data
Access
Data
Modeling
Presentation
/Applications
7

Tackling Some Key Challengesg y g
 Data Preprocessing Phase Data Preprocessing Phase
 Data quality problem: Noise, Incompleteness, Sparsity
 Veracity issue: Is bigger the better?y gg
 Data Understanding Phase
 Key Features Discovery: Finding the needle in a haystack
 Learning and Modeling Phase
 Timeliness vs. Precision: Issues for data sampling
 Need of more sophisticated methodologies
 Post-processing Phase
8

Some Key Challengesy g
9

個人化推薦系統個人化推薦系統
13

個人化推薦系統(cont.)個人化推薦系統( )
 推薦系統 & 過濾系統 推薦系統 & 過濾系統
 利用Big Data Analytics分析客戶偏好度
 提供非熱門影片以平衡與滿足客戶需求，非熱門影片租 提供非熱門影片以平衡與滿足客戶需求，非熱門影片租
借佔了七成
 當您被推薦的冷門電影卻非常好看，那種感覺是無可比 當您被推薦的冷門電影卻非常好看那種感覺是無可比
擬的
 四分之三的推薦影片評價比最新發行的影片還高，這就
是推薦系統的真正價值
 世界上最龐大的電影評比資料庫，遠超過競爭對手所能
提供的服務價值
14

Big Data in Netflixg
 62M+ Subscribers over 50 countries 62M+ Subscribers over 50 countries
 4M/day Ratings
 3M/day Searches
 30+M/day plays30 M/day plays
 Streaming hours
2B h i Q1/2012 2B hours in Q1/2012
 10B hours in Q1/2015
15

Netflix Prize
 Grand Prize, $1M USD for 10% improvement in
prediction accuracy
 Progress Prize, $50,000 USD every yearg , $ , y y
 Since Oct. 2, 2006
E d O t 2 2011 End Oct. 2, 2011
 Or when some teams reach 10% goal
16
(Ref: Netflix 2012 )

Recommendation Problem:
Collaborative Filtering based Methods- Collaborative Filtering-based Methods
itm1 itm2 itm3 itm4 itm5
A d ? 1 1 4 5
User-based Collaborative Filtering
Andre ? 1 1 4 5
Ben 1 2 0 2 0
Juice 3 1 2 4 5
User based Collaborative Filtering
David 1 1 0 1 0
itm1 itm2 itm3 itm4 itm5Item-based Collaborative Filtering 1 2 3 4 5
Andre ? 1 0 4 5
Ben 1 2 0 2 0
g
Juice 3 1 2 4 5
David 1 1 0 1 0
if i itm1 itm2 itm4 itm3 itm5
Andre ? 1 4 0 5
Ben 1 2 2 0 0
Unifying User-based and Item-based
Collaborative Filtering
17
Ben 1 2 2 0 0
Juice 3 1 2 4 5
David 1 1 1 0 0

Netflix Analytics Worky
 Dataset consists of 100M+ training entries Dataset consists of 100M+ training entries
 Each training entry is in a quadruplet form
 <user, movie, date, grade>, each is an integer
 The qualifying dataset consists of 2.8M entriesq y g
 <user, movie, date> w/o grading
 Error measure: RMSE (root mean square error) Error measure: RMSE (root mean square error)
18

RMSE Scores
 0 8563 (10%) Grand Prize 0.8563 (10%) Grand Prize
 0.8643 (9.15%) Leader
 0.8667 (8.9%) Current progress
 0.8712 (8.43%) Progress Prize Winner 20070.8712 (8.43%) Progress Prize Winner 2007
 0.9514 (0%) Netflix Cinematch
1 0540 ( 10 78%) M i A 1.0540 (-10.78%) Movie Average
19

2009 Grand Prize
Winner:
BellKor's Pragmatic Chaos
20

Challengesg
 Data Sparsity Problemp y
 Highly Sparse Data & Cold Start Problem:
traditional approaches like CF are not feasibletraditional approaches like CF are not feasible
→ Need specialized method
 Netflix Prize winner: Pragmatic Chaos Theory Netflix Prize winner: Pragmatic Chaos Theory
 Gap between complex models and deployment
 Winner’s solution: Complex composition of
hundreds/thousands of learned models
→ Hard to deploy in real applications
 Similar scenarios exist in many big datay g
applications and effective solutions are desired! 21

22

I bi l h b ?Is bigger always the better?
Veracity issue-- Veracity issue

Google Flu Trends -- Ideag
• C t i W b S h• Certain Web Search
terms are good
Indicators of flu activity.
• Google Trend uses
Aggregated search data
on flu indicators.on flu indicators.
• Estimate current flu
activity around the world
i l tiin real time.
• From example :- Google
Flu Trend detectsFlu Trend detects
increased flu activity two
weeks before CDC. *CDC: Center for Disease Control

Google Flu Trends -- Modelg
 Data:
 Look at all search queries in Google from 2003 to 2008 Look at all search queries in Google from 2003 to 2008
 Several hundred billion individual searches
in the United States
 Keep track of only the 50 million most
common queries
 Keep a weekly count for each query
 Also keep counts of each query by geographic region
(requires use of geo-location from IP addresses: >95% accurate)
So counts for 50 million queries x 170 weeks x 9 regions
query selectionq g
 Target variable to be predicted:
 For each week, for each region
I(t) = percentage physician visits that are ILI (as compiled by CDC)
query selection
I(t) = percentage physician visits that are ILI (as compiled by CDC)
 Input variable:
Q(t) = sum of top n highest correlated queries
/ total number of queries that week
Constructing the
ILI-related query/ total number of queries that week
“M d l l i ”
q y
fraction
 “Model learning”:
log( I(t) / [1 – I(t)] ) =  log ( Q(t)/ [1 – Q(t) ] ) + noise Logistic regression

The Parable of Google Flu: Traps in Bigg p g
Data Analysis (Science, Mar. 2014)

28

Deep Understanding of Key Featuresp g y

 A large-scale research initiative aimed at
 Innovations around smartphone-based research
 Collect smartphone data in everyday life conditions
 Community-based evaluation of related mobile data analysis
methodologiesmethodologies
 Data source: Lausanne Data Collection Campaign
30

User Profile/Behavior Modeling and Prediction
 Personal information
 Media files
 Device information
 Process
 Calendar
 Applications
 Social information
 Accelerometer
 System Information
 Location information
 Call log
 Contacts
 Bluetooth
 GSM
 WLAN
 Sequence of place visits

MDC 2012 Tracks
 Main Goals
 User Profile/Behavior Modeling and Prediction
 Dedicated Track Dedicated Track
 Demographic attribute prediction
 Predict gender age group marital status job type etc Predict gender, age group, marital status, job type, etc.
of an user
 Semantic place prediction Semantic place prediction
 Predict the semantic meaning of user’s visited places
N t l di ti Next place prediction
 Predict the next destination of a user
32

Demographic Attribute Prediction
 One of the items: Prediction of gender
g p
 One of the items: Prediction of gender
33

Demographic Attribute Prediction
 Lots of features could be extracted from data
g p
 10,000+ features used by the winner team!
 High accuracy achieved: 96%
………………Location
features
Media features
Sensor features
36

Very high dimensional complexityVery high dimensional complexity
- Feasibility problem in real applications!Feasibility problem in real applications!
Is there some key/dominating feature?
………………Location……
features
Media features
S f tSensor features
37

Demographic Attribute Prediction (cont.)
 Accelerometer is actually a key/dominating
g p ( )
feature!
 Support accuracy around 95%
 Underlying reasoning?
38

Very Different behavior between the
Male & Female !
39

40

Timeliness in Big Data Analyticsg y
41
(Source: IBM white paper)

One Solution: Data Samplingp g
- Bias on Data Samples
T i id i l f h Twitter provides two main outlets for researchers to
access tweets in real time:
 Streaming API (~1% of all public tweets, free)
 Firehose (100% of all public tweets, costly)
 Streaming API data is often used by researchers to
validate hypotheses.
 How well does the sampled Streaming API data measure
the true activity on Twitter?
42

Bias on Data Samples (cont.)p ( )
S [H Li l AAAI ICWSM2013]
43
Source: [Huan Liu et al. AAAI ICWSM2013]

National Health Insurance Research Database
in Taiwan
 National Health Insurance (NHI ) National Health Insurance (NHI )
 Established in March 1, 1995
 Serves 99.2% of Taiwanese population (20M+)
 Covers 92.62% of medical institutions
 Longitudinal Health Insurance Database ( LHID )
 sampled from NHIRDp
 Including 951,044 people health records
 1997 – now
Strongly representative in Taiwan Strongly representative in Taiwan
 Every living regions
 Big time interval
15+ years
Reference : National Health Insurance, http://www.nhi.gov.tw

Linking with More Heterogeneous Datasets
Environmental
Smart
Environmental
monitoring data
Lab data & PatientLab data & Patient
CRCRNHINHI CODCODBRBR Smart
Health Risk
Al treported outcomereported outcome
Cloud Sensor-based biomarker
Alert
Computing
Sensor-based biomarker
monitoring data
46

健保資料抽樣方式健保資料抽樣方式
 資料內容
 以2010年承保資料檔中「2010年在保者」隨機取100萬人
 抽樣母體群
 由中央健康保險署所提供的2010年承保資料檔以「身份證字
號加上生日加上性別」歸人，可得 27,378,403人之資料，
作為資料母檔。作為資料母檔
 抽樣方法
 利用隨機值產生器(random number generator)產生至少100 利用隨機值產生器(random number generator)產生至少100
萬個隨機值(random number, 實得1,074,263個隨機值)，取
與100萬個隨機值相同的流水號，來隨機抽取所需的保險對
象樣本。象樣本
 關於隨機值產生作業，係採用Oracle的DBMS_RANDOM套件來
執行。
資料來源: 全民健康保險研究資料庫, http://nhird.nhri.org.tw/date_cohort.htm

健保資料抽樣方式(cont.)健保資料抽樣方式( )
萬樣本與抽樣母群體全人口之驗證方式 100萬樣本與抽樣母群體(全人口) 之驗證方式
 統計資料中年齡、性別、每年出生人數分佈，以及
平均投保金額，比較100萬樣本與抽樣母群體之間是
否有差異
 同時並與內政部公佈之資料值比較
 以卡方分析分析100萬人樣本對抽樣母群體之代表性
 均在5%顯著水準以下
資料來源: 全民健康保險研究資料庫, http://nhird.nhri.org.tw/date_cohort.htm

疾病因子分析
Linked data is biased!
測站
日期
每日X疾病就診人數大氣環境資料
監測站
49
空氣汙染資料
監測站
監測站
使用LHID2000百萬抽樣檔

50

Mining User PreferenceMining User Preference
- for POI Recommendation

Goal
• How to do POI recommendation by utilizing user’s
i l t k l ( h k i )?social network log (eg, check-in)?
1
3
4
6
5 7
8
9S
2
3 9
10S
p
S
p
2
1
- 52 -

Urban Point of Interest Recommendation byUrban Point-of-Interest Recommendation by
Mining User Check-in Behaviors
Josh Jia-Ching Ying, Eric Hsueh-Chan Lu, Wen-Ning KuoJosh Jia Ching Ying, Eric Hsueh Chan Lu, Wen Ning Kuo
and Vincent S. Tseng
2012ACM SIGKDD Int’l Workshop on Urban Computing2012ACM SIGKDD Int l Workshop on Urban Computing
(UrbComp 2012)

Proposed Method – UPOI-Minep
LBSN Dataset Social Factor User-POI Graph
Construction
Relevance Learning
LBSN Dataset Social Factor User-POI Graph
Construction
Relevance Learning
Individual
Preference
Construction
Individual
Preference
Construction
Feature Extraction POI Popularity -
User-POI Relevance
Matrix
Feature Extraction POI Popularity -
User-POI Relevance
Matrix
UserRequest Top k Nearest POI
selection
Top k Nearest POI POI RankingUserRequest Top k Nearest POI
selection
Top k Nearest POI POI Ranking
POI
Recommending
List
POI
Recommending
ListPOI Recommendation

Social Factor (SF)( )

F
Weight
kikiki DisSimwCheckSimw )1(Relation 


k
i,kk,jji Interest,POIuserSF
1
]Relation[)(
kikiki ,,, )(

, jkcheckin
Interest

 ||
1
,
, S
s
sk
jk
checkin
Interest
F f i d fF: friends of user i
S: the set of POIs
Check-in k,* = check-ins of user k at POI*

Individual Preference (IP)( )
highlight
category
• Individual Preference(IP)
• HPrefi,h
• CPrefi
category
• CPrefi,c
),POIIP(user ji 
  Pr)1()POI(Pr
,
)(
HCount
HCount
efHIefC
Hh
jh
i,h
C
jcctgi,c 

  







 
asdefinedfunctionindicatoranis)I(where,
,
s,c
HCountHh
Hg
jgCc 

 





 

otherwise0
)(POIif1
)POI()(
cctg
I
j
jcctg
 otherwise0

POI Popularity (PP)
 POI Popularity
p y ( )
 POI Popularity
 Relative Popularity of POI
 Normalized based on category
checkins
RP
j
j


.POIithcategory wsamein thewhichPOIsofsettheiswhere,
POI
jCS
checkins
CS
k
j
k

.Otcatego y wsa et ew cO sosett esw e e, jCS

Relevance Estimation
TargetTo estimate the relevance of each pair of user-POI TargetTo estimate the relevance of each pair of user-POI,
we use these features to learn a Regression-Tree
Model.
User ID POI ID SF PP IP Relevance
1 A 0.2 0.1 0.001 3
1 B 0.05 0.2 0.1 51 B 0.05 0.2 0.1 5
1 C 0.004 0.1 0.9 1
… … … … … …
N D 0.5 0.15 0.06 2
Regression-Tree Model

Experimental Evaluation
 Real dataset crawled from Gowalla
p
 in New York City area
 1,964,919 POIs, ,
 18,159 people
 5 341 191 Check-ins 5,341,191 Check-ins
 392,246 Friendship Links

Comparisons with Otherp
Recommenders

Better way for modeling?Better way for modeling?
- UPOI-Walk- UPOI-Walk
In ACM Transactions on Intelligent Systems and Technologies, 2014

Motivation
 The existing models could not deal with such
h f llheterogeneous features well
 The existing models try to combine all features into
f b ildi i l d l Bi !one measure for building a single model → Bias!
Relevance LearningRelevance Learning
LBSN Dataset Social Factor
Individual
Preference
User-POI Graph
Construction
Hits-based
Random Walk
LBSN Dataset Social Factor
Individual
Preference
User-POI Graph
Construction
Hits-based
Random Walk
Feature Extraction
Preference
POI Popularity User-POI Graphs
User-POI
Relevance
Matrix
Feature Extraction
Preference
POI Popularity User-POI Graphs
User-POI
Relevance
Matrix
User Request Top k Nearest POI
selection
Top k Nearest POI POI RankingUser Request Top k Nearest POI
selection
Top k Nearest POI POI Ranking
POI
Recommending
ListPOI Recommendation
POI
Recommending
List

HITS-based Random Walk
X C t l “Mi i i ifi t ti
Random Walk
X. Cao , et al., “Mining significant semantic
locations from GPS data,” Proceedings of the
VLDB Endowment, v.3 n.1-2, September
20102010
0.3
0 2 0 10.2
0.4
0.1
Given an m × n hits value matrix MGiven an m × n hits value matrix M
11
1
1
))1((
))1((




kk
k
user
T
col
k
POI
xMx
xMx


HITS-based Random Walk
2 ))1((  POIrowuser xMx 

Dynamic HITS-Based Random Walky
X
N
X
Y
Network Set
= {M,N,X,Y,Z,…}M
ZZ
Randomly select
vPOI
k1
 (Mcol
T
(1)1)vuser
k
vk1
 (N (1 ) )vk1
……
hits value
matrixes from
Network Set
vuser  (Nrow (1)2 )vPOI
vPOI
k2
 (Xcol
T
(1)1)vuser
k1
vk3
 (Y (1)2 )v O
k2
…
vuser (Yrow (1 )2 )vPOI
vPOI
k3
 (Zcol
T
(1)1)vuser
k2
…
till converged

Comparison with Existing
R d NDCGRecommenders - NDCG

Beautiful algorithms matter a lot still
for Big Data Analytics!
67

68

醫療雲計畫醫療雲計畫
69

全民健保資料加值計劃
70

Early Prediction of Diseasesy
Huizinga, T. W. J., & van der Helmvan Mil, A. H. M. (2007). Prediction and prevention of rheumatoid
arthritis. Revista Colombiana de Reumatología, 14(2), 106-114.
Early RA
12 month
RA DiagnosisEarly RA
18 month
Very Early
Detection
~ X years
71
ye s

Analytics Frameworky
Data miningTarget PreprocessedRaw
techniquesdata datadata
Di d
Off
Classifier
Discovered
Rules
Off-
line
On-line
Morbidity Risk
Prediction
S
Health records
Potential
Patient Doctor / Hospital
System
Predicted risk
72

Rules Produced
Too many rules!
Postprocessing is
essential!
73
73

Post-Processing – Rules FilteringPost Processing Rules Filtering
Rules:
Lift > 1: 11,004 Rules
Lift = 1: 357 Rules
Lift < 1: 7,543 Rules
74

Postprocessing: Literature Search (Pubmed)
Acute laryngopharyngitis
Manic disorder
neoplasm of breast
Adhesive capsulitis of shoulder
0
0
0
0
0
0
0
0
decubitus
urination
Vaginitis
Kaschin
lumbar intervertebral disc
Pterygium
6
5
4
3
2
2
1
1
1
1
1
conjunctivitis
Cervical spondylosis
capsulitis
Spinal stenosis
Calculus
decubitus
26
24
21
20
17
16
13
12
11
7
7
6
bronchitis
rhinitis
Fasciitis
Allergic rhinitis
Coronary atherosclerosis
j
62
60
58
55
52
44
43
43
41
29
26
Peptic
Peptic ulcer
cataract
Sicca syndrome
Dyspepsia
tract infection
156
123
118
116
113
113
105
90
77
73
72
6
Anxiety
neuropathy
dermatitis
Sleep
nephropathy
Peptic
375
323
301
296
279
271
270
257
248
225
166
156
75
Systemic lupus erythematosus
Diabetes
Osteoporosis
asthma
breast
y
4557
2337
2043
1982
1392
1328
748
592
394
375

A More Complete Framework
(i OS O 201 )(in PLOS One 2015)

After Postprocessing: Interesting Rules
77

How to summarize/validate/interpret
the discovered results is important
last-mile for Big Data Analytics!
78

Concluding Remarks:
G d O i iGrand Opportunities
 “Data is King”: Age of data monetization Data is King : Age of data monetization
 Data vs. Ideas vs. Technologies
 From Data to Idea
 From Idea to data
 Utilization of right technologies
 Visioning Visioning
 擁有價值性資料者可以為王
 不擁有資料但有創新點子的人易可稱王 不擁有資料但有創新點子的人易可稱王
 Innovative Ideas + Right Tech on Valued Data =>
Smart King
7979
Smart King

Grand Challenges Big Opportunities!Grand Challenges, Big Opportunities!
81

巨量與開放資料之創新機會與關鍵挑戰－曾新穆

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 巨量與開放資料之創新機會與關鍵挑戰－曾新穆

Similar to 巨量與開放資料之創新機會與關鍵挑戰－曾新穆 (20)

More from 台灣資料科學年會

More from 台灣資料科學年會 (20)

Recently uploaded

Recently uploaded (20)

巨量與開放資料之創新機會與關鍵挑戰－曾新穆