SlideShare a Scribd company logo
1 of 20
Download to read offline
1	
  
	
  
Predic(ve	
  Analy(cs	
  on	
  a	
  Big	
  Data	
  Scale!
Afshin	
  Goodarzi	
  
afshin@1010data.com	
  
	
  
April, 2014
2	
  
About	
  1010data	
  
•  Founded	
  in	
  2000	
  	
  
•  Based	
  in	
  NYC	
  
•  Big	
  Data	
  analyAcs	
  plaCorm	
  in	
  the	
  cloud	
  
•  Library	
  of	
  pre-­‐built	
  analyAcal	
  applicaAons	
  
•  Speed,	
  power	
  and	
  flexibility	
  second	
  to	
  none	
  
3	
  
We	
  Host/Analyze	
  14+	
  Trillion	
  Rows	
  of	
  Data	
  
All Quotes and Trades since 2003 on NYSE are done on 1010data
All mortgages ever issued are analyzed on 1010data
Nearly all real-estate transactions are completed on 1010data
Big Data - Granular Data - Time series Data	
  
All data for ~35,000 Retail outlets across the US are analyzed on 1010data
4	
  
A	
  Typical	
  BI	
  Technology	
  Stack	
  
Administrators	
  
Data Sources
ETL	
  
Inter-­‐Enterprise	
  Users	
  
EDW	
  
Data	
  Cubes/	
  	
  
Marts	
  
ReporAng	
  /	
  
VisualizaAon	
  
Analysis	
  /	
  
Modeling	
  
5	
  
The	
  Stack	
  Has	
  Fallen!	
  
6	
  
The	
  Analy(cs	
  Con(nuum	
  &	
  	
  
	
   	
   	
  	
  	
  	
  A	
  Single	
  Version	
  of	
  the	
  Truth	
  
7	
  
Intui(ve	
  Access	
  to	
  Unlimited	
  Amounts	
  of	
  Data	
  
Partner	
  
Data	
  
3rd	
  Party	
  
Data	
  
1010data	
  Cloud	
  
Corporate	
  
Data	
  
425,369,127,325	
  
Rows!	
  
8	
  
The	
  code:	
  	
  Chart	
  1	
  
<layout	
  background_="white"	
  border_="1"	
  height_="525"	
  name="candlesAck_layout"	
  relpos_="0,50"	
  width_="650">	
  
	
  	
  	
  	
  <widget	
  base_="nyse.trades.hist.all"	
  class_="graphics"	
  invmode_="hide"	
  name="candlesAck"	
  relpos_="25,25"	
  update_="manual"	
  width_="600">	
  
	
  	
  	
  	
  	
  	
  <sel	
  value="between(date;'{@startdate}';'{@enddate}')"/>	
  
	
  	
  	
  	
  	
  	
  <sel	
  value="(symbol='{@symbol}')"/>	
  
	
  	
  	
  	
  	
  	
  <tabu	
  label="Candle	
  SAck"	
  breaks="date">	
  
	
  	
  	
  	
  	
  	
  	
  	
  <break	
  col="date"	
  sort="up"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="wavg"	
  name="vwap"	
  weight="vol"	
  label="VWAP"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="hi"	
  name="high"	
  label="High"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="lo"	
  name="low"	
  label="Low"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="first"	
  name="open"	
  label="Open"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="last"	
  name="close"	
  label="Close"/>	
  
	
  	
  	
  	
  	
  	
  </tabu>	
  
	
  	
  	
  	
  	
  	
  <graphspec>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <chart	
  type="candlesAck"	
  Atle="CandlesAck	
  Chart	
  for	
  {@symbol}">	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <axes	
  xlabel="Date"	
  ylabel="Trading	
  Price"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </chart>	
  
	
  	
  	
  	
  	
  	
  </graphspec>	
  
	
  	
  	
  	
  </widget>	
  
	
  	
  	
  	
  <widget	
  class_="bulon"	
  name="candlesAck_refresh"	
  relpos_="475,475"	
  submit_="candlesAck"	
  text_="Refresh"	
  type_="submit"/>	
  
	
  	
  	
  	
  <widget	
  class_="field"	
  label_="Choose	
  Symbol:"	
  name="symbol_input"	
  relpos_="125,475"	
  value_="@symbol"/>	
  
	
  	
  </layout>	
  
Query	
  Chart	
  Spec	
  
9	
  
Predic(ve	
  Analy(cs	
  on	
  a	
  Big	
  Data	
  Scale!	
  
	
  
Big	
  Data	
  mandated	
  AnalyAcs	
  and	
  predicAve	
  modeling	
  -­‐	
  an	
  
example:	
  
The	
  larger	
  data	
  sets	
  have	
  mandated	
  more	
  rigorous	
  sampling	
  
strategies	
  as	
  tradiAonal	
  systems	
  have	
  not	
  kept	
  up	
  with	
  the	
  
computaAonal	
  needs	
  of	
  	
  predicAve	
  analyAc	
  soluAons	
  on	
  Big	
  Data.	
  	
  
	
  
•  Can	
  we	
  use	
  all	
  but	
  a	
  small	
  holdout	
  set	
  in	
  predicAve	
  modeling?	
  	
  
•  What	
  are	
  the	
  challenges?	
  
•  What	
  is	
  an	
  approach	
  that	
  works?	
  	
  
•  Are	
  the	
  results	
  any	
  good?	
  
•  Is	
  this	
  soluAon	
  only	
  applicable	
  to	
  one	
  industry?	
  	
  
10	
  
Common	
  Predic(ve	
  Modeling	
  Approach	
  
" CPU	
  intensive	
  &	
  error	
  prone	
  
steps:	
  
	
  
»  Data	
  selecAon	
  
»  IV	
  to	
  DV	
  relaAonship	
  
»  TransformaAons	
  
»  Sampling	
  and	
  validaAon	
  
»  Model	
  esAmaAon	
  
»  Model	
  tesAng	
  
»  Repeat	
  
10	
  
hlp://onlinepubs.trb.org/onlinepubs/nchrp/cd-­‐22/v2chapter5.html	
  
11	
  
“One	
  Segment”	
  =>	
  “A	
  Segment	
  of	
  One”	
  
“Any	
  customer	
  can	
  have	
  a	
  car	
  painted	
  any	
  color	
  that	
  he	
  wants	
  so	
  long	
  as	
  it	
  is	
  black.”	
  	
  
re:	
  the	
  Model-­‐T	
  in	
  1909	
  (from	
  My	
  Life	
  and	
  Work	
  ,	
  Henry	
  Ford,	
  1922,	
  Chap.	
  4,	
  p.71)	
  
12	
  
Harry	
  Truman	
  displays	
  a	
  copy	
  of	
  the	
  Chicago	
  Daily	
  Tribune	
  newspaper	
  that	
  erroneously	
  reported	
  
the	
  elecAon	
  of	
  Thomas	
  Dewey	
  in	
  1948.	
  Truman’s	
  narrow	
  victory	
  embarrassed	
  pollsters,	
  members	
  
of	
  his	
  own	
  party,	
  and	
  the	
  press	
  who	
  had	
  predicted	
  a	
  Dewey	
  landslide.	
  
13	
  
Build	
  A	
  30	
  Day	
  Shopping	
  List	
  For	
  	
  
Each	
  Loyal	
  Shopper	
  at	
  a	
  Retail	
  Chain	
  
Shopper	
   SKU	
   Probability	
  of	
  
purchase	
  in	
  the	
  next	
  
30	
  days	
  
A.	
  Smith	
   12345	
   90%	
  
A.	
  Smith	
   23567	
   85%	
  
A.	
  Smith	
   ….	
  
A.	
  Smith	
   87996	
   30%	
  
POS	
  
Loyalty	
  
Econ	
  House	
  prices	
  
Mortgage	
  Rates	
  
BLS	
  -­‐	
  Unemployment	
  
Inventory	
  
With	
  Permission	
  from	
  A&P	
  	
  
14	
  
If	
  The	
  Shopper	
  Bought	
  “It”	
  Before	
  Will	
  They	
  Buy	
  
“It”	
  Again?	
  
" Classical	
  modeling:	
  
variables	
  as	
  either	
  
posiAvely	
  or	
  negaAvely	
  
correlated	
  with	
  target	
  
" Shoppers	
  don’t	
  behave	
  the	
  
same!	
  
" The	
  demographics	
  
alributes	
  have	
  
distribuAons	
  for	
  each	
  
variable!	
  
15	
  
Subscribers	
  are	
  “A	
  Segment	
  Of	
  One”!	
  
16	
  
All	
  sources	
  of	
  Prepay	
  as	
  analyzed	
  in	
  1989	
  
D	
  
R	
  
M	
  
Interest	
  Rates	
  
House	
  prices	
  
Unemployment	
  
Loan	
  Age	
  
Cost	
  of	
  opAon	
  
Regional	
  economy	
  
I	
  
hlp://www.freeusandworldmaps.com/html/US_CounAes/US_CounAes.html	
  
hlp://www.tradingeconomics.com/united-­‐states/unemployment-­‐rate	
  
hlp://www.wfa.gov/	
  
hlp://www.richmondfed.org/banking/markets_trends_and_staAsAcs/trends/pdf/delinquency_and_foreclosure_rates.pdf	
  
17	
  
Quality	
  Measures	
  :	
  Lia	
  =>	
  AUC	
  
18	
  
Fine	
  vs.	
  Coarse:	
  Cash	
  flows	
  
19	
  
InQuery	
  analy(cs	
  –	
  	
  
	
   	
   	
  User	
  Defined	
  Group	
  Func(ons	
  
	
  
•  User	
  defined	
  
−  KNN	
  
−  Naïve	
  Bayes	
  
−  ARCH/AR	
  
−  PCA	
  
−  Kernel	
  
−  Decision	
  Tree	
  
−  LogisAcs	
  trees	
  
−  FFT	
  
−  Etc……..	
  
20	
  
Ques(ons?	
  

More Related Content

Viewers also liked

Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4Motheral
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingAnalyticsWeek
 
Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...Haute Autorité de Santé
 
Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)Adam Nash
 
Data Discovery vs BI Webinar
Data Discovery vs BI WebinarData Discovery vs BI Webinar
Data Discovery vs BI WebinarBirst
 
Yellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slidesYellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slidesYellowfin
 
Market Access Database Spain 2013
Market Access Database Spain 2013Market Access Database Spain 2013
Market Access Database Spain 2013Josep Darba
 
K3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, DelhiK3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, DelhiFujitsu India
 
Pharmaceutical selling skills
Pharmaceutical selling skills Pharmaceutical selling skills
Pharmaceutical selling skills Sash P
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20Étienne Garbugli
 
24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased Productivity24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased ProductivityIulian Olariu
 

Viewers also liked (12)

Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reporting
 
Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...
 
Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)
 
Data Discovery vs BI Webinar
Data Discovery vs BI WebinarData Discovery vs BI Webinar
Data Discovery vs BI Webinar
 
Yellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slidesYellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slides
 
Sempo big data & the new 4 ps
Sempo big data & the new 4 psSempo big data & the new 4 ps
Sempo big data & the new 4 ps
 
Market Access Database Spain 2013
Market Access Database Spain 2013Market Access Database Spain 2013
Market Access Database Spain 2013
 
K3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, DelhiK3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
 
Pharmaceutical selling skills
Pharmaceutical selling skills Pharmaceutical selling skills
Pharmaceutical selling skills
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20
 
24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased Productivity24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased Productivity
 

Similar to Rethinking classical approaches to analysis and predictive modeling

Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsFujio Turner
 
Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Wil Davis
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer DataWSO2
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013nkabra
 
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...Rod King, Ph.D.
 
Dr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New OilDr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New OilStefan Schwarz
 
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
a2c Boston Big Data Meet-up:  Agile Data Warehouse Designa2c Boston Big Data Meet-up:  Agile Data Warehouse Design
a2c Boston Big Data Meet-up: Agile Data Warehouse Designa2c
 
Webinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDBWebinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDBMongoDB
 
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalyticsconf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalyticsTom LaGatta
 
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...Authoritas
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewNagaraj Yerram
 
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubEnable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubCloudera, Inc.
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesRevolution Analytics
 
Sales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learningSales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learningdataalcott
 

Similar to Rethinking classical approaches to analysis and predictive modeling (20)

A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & Startups
 
Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer Data
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
 
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
 
Big data
Big dataBig data
Big data
 
Dr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New OilDr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New Oil
 
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
a2c Boston Big Data Meet-up:  Agile Data Warehouse Designa2c Boston Big Data Meet-up:  Agile Data Warehouse Design
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
 
Data Mining
Data MiningData Mining
Data Mining
 
Webinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDBWebinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDB
 
LTV Predictions: How do real-life companies use them & what can you learn fro...
LTV Predictions: How do real-life companies use them & what can you learn fro...LTV Predictions: How do real-life companies use them & what can you learn fro...
LTV Predictions: How do real-life companies use them & what can you learn fro...
 
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalyticsconf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
 
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overview
 
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubEnable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
Sales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learningSales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learning
 

More from AnalyticsWeek

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataAnalyticsWeek
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsAnalyticsWeek
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in governmentAnalyticsWeek
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataAnalyticsWeek
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3AnalyticsWeek
 

More from AnalyticsWeek (7)

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big Data
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in government
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Rethinking classical approaches to analysis and predictive modeling

  • 1. 1     Predic(ve  Analy(cs  on  a  Big  Data  Scale! Afshin  Goodarzi   afshin@1010data.com     April, 2014
  • 2. 2   About  1010data   •  Founded  in  2000     •  Based  in  NYC   •  Big  Data  analyAcs  plaCorm  in  the  cloud   •  Library  of  pre-­‐built  analyAcal  applicaAons   •  Speed,  power  and  flexibility  second  to  none  
  • 3. 3   We  Host/Analyze  14+  Trillion  Rows  of  Data   All Quotes and Trades since 2003 on NYSE are done on 1010data All mortgages ever issued are analyzed on 1010data Nearly all real-estate transactions are completed on 1010data Big Data - Granular Data - Time series Data   All data for ~35,000 Retail outlets across the US are analyzed on 1010data
  • 4. 4   A  Typical  BI  Technology  Stack   Administrators   Data Sources ETL   Inter-­‐Enterprise  Users   EDW   Data  Cubes/     Marts   ReporAng  /   VisualizaAon   Analysis  /   Modeling  
  • 5. 5   The  Stack  Has  Fallen!  
  • 6. 6   The  Analy(cs  Con(nuum  &                A  Single  Version  of  the  Truth  
  • 7. 7   Intui(ve  Access  to  Unlimited  Amounts  of  Data   Partner   Data   3rd  Party   Data   1010data  Cloud   Corporate   Data   425,369,127,325   Rows!  
  • 8. 8   The  code:    Chart  1   <layout  background_="white"  border_="1"  height_="525"  name="candlesAck_layout"  relpos_="0,50"  width_="650">          <widget  base_="nyse.trades.hist.all"  class_="graphics"  invmode_="hide"  name="candlesAck"  relpos_="25,25"  update_="manual"  width_="600">              <sel  value="between(date;'{@startdate}';'{@enddate}')"/>              <sel  value="(symbol='{@symbol}')"/>              <tabu  label="Candle  SAck"  breaks="date">                  <break  col="date"  sort="up"/>                  <tcol  source="prc"  fun="wavg"  name="vwap"  weight="vol"  label="VWAP"/>                  <tcol  source="prc"  fun="hi"  name="high"  label="High"/>                  <tcol  source="prc"  fun="lo"  name="low"  label="Low"/>                  <tcol  source="prc"  fun="first"  name="open"  label="Open"/>                  <tcol  source="prc"  fun="last"  name="close"  label="Close"/>              </tabu>              <graphspec>                  <chart  type="candlesAck"  Atle="CandlesAck  Chart  for  {@symbol}">                      <axes  xlabel="Date"  ylabel="Trading  Price"/>                  </chart>              </graphspec>          </widget>          <widget  class_="bulon"  name="candlesAck_refresh"  relpos_="475,475"  submit_="candlesAck"  text_="Refresh"  type_="submit"/>          <widget  class_="field"  label_="Choose  Symbol:"  name="symbol_input"  relpos_="125,475"  value_="@symbol"/>      </layout>   Query  Chart  Spec  
  • 9. 9   Predic(ve  Analy(cs  on  a  Big  Data  Scale!     Big  Data  mandated  AnalyAcs  and  predicAve  modeling  -­‐  an   example:   The  larger  data  sets  have  mandated  more  rigorous  sampling   strategies  as  tradiAonal  systems  have  not  kept  up  with  the   computaAonal  needs  of    predicAve  analyAc  soluAons  on  Big  Data.       •  Can  we  use  all  but  a  small  holdout  set  in  predicAve  modeling?     •  What  are  the  challenges?   •  What  is  an  approach  that  works?     •  Are  the  results  any  good?   •  Is  this  soluAon  only  applicable  to  one  industry?    
  • 10. 10   Common  Predic(ve  Modeling  Approach   " CPU  intensive  &  error  prone   steps:     »  Data  selecAon   »  IV  to  DV  relaAonship   »  TransformaAons   »  Sampling  and  validaAon   »  Model  esAmaAon   »  Model  tesAng   »  Repeat   10   hlp://onlinepubs.trb.org/onlinepubs/nchrp/cd-­‐22/v2chapter5.html  
  • 11. 11   “One  Segment”  =>  “A  Segment  of  One”   “Any  customer  can  have  a  car  painted  any  color  that  he  wants  so  long  as  it  is  black.”     re:  the  Model-­‐T  in  1909  (from  My  Life  and  Work  ,  Henry  Ford,  1922,  Chap.  4,  p.71)  
  • 12. 12   Harry  Truman  displays  a  copy  of  the  Chicago  Daily  Tribune  newspaper  that  erroneously  reported   the  elecAon  of  Thomas  Dewey  in  1948.  Truman’s  narrow  victory  embarrassed  pollsters,  members   of  his  own  party,  and  the  press  who  had  predicted  a  Dewey  landslide.  
  • 13. 13   Build  A  30  Day  Shopping  List  For     Each  Loyal  Shopper  at  a  Retail  Chain   Shopper   SKU   Probability  of   purchase  in  the  next   30  days   A.  Smith   12345   90%   A.  Smith   23567   85%   A.  Smith   ….   A.  Smith   87996   30%   POS   Loyalty   Econ  House  prices   Mortgage  Rates   BLS  -­‐  Unemployment   Inventory   With  Permission  from  A&P    
  • 14. 14   If  The  Shopper  Bought  “It”  Before  Will  They  Buy   “It”  Again?   " Classical  modeling:   variables  as  either   posiAvely  or  negaAvely   correlated  with  target   " Shoppers  don’t  behave  the   same!   " The  demographics   alributes  have   distribuAons  for  each   variable!  
  • 15. 15   Subscribers  are  “A  Segment  Of  One”!  
  • 16. 16   All  sources  of  Prepay  as  analyzed  in  1989   D   R   M   Interest  Rates   House  prices   Unemployment   Loan  Age   Cost  of  opAon   Regional  economy   I   hlp://www.freeusandworldmaps.com/html/US_CounAes/US_CounAes.html   hlp://www.tradingeconomics.com/united-­‐states/unemployment-­‐rate   hlp://www.wfa.gov/   hlp://www.richmondfed.org/banking/markets_trends_and_staAsAcs/trends/pdf/delinquency_and_foreclosure_rates.pdf  
  • 17. 17   Quality  Measures  :  Lia  =>  AUC  
  • 18. 18   Fine  vs.  Coarse:  Cash  flows  
  • 19. 19   InQuery  analy(cs  –          User  Defined  Group  Func(ons     •  User  defined   −  KNN   −  Naïve  Bayes   −  ARCH/AR   −  PCA   −  Kernel   −  Decision  Tree   −  LogisAcs  trees   −  FFT   −  Etc……..