SlideShare a Scribd company logo
1 of 43
Public Data and
Data Mining
Competitions –
what are the
Lessons?
1© KDnuggets 2013
Gregory Piatetsky-Shapiro
KDnuggets
My Data
• PhD (‘84) in applying Machine Learning to databases
• Researcher at GTE Labs – started the first project on
Knowledge Discovery in Databases in 1989
• Organized first 3 Knowledge Discovery and Data Mining
(KDD) workshops (1989-93), cofounded Knowledge
Discovery and Data Mining (KDD) conferences (1995)
• Chief Scientist at 2 analytics startups 1998-2001
• Co-founder SIGKDD (1998), Chair, 2005-2009
• Analytics/Data Mining Consultant, 2001-
• Editor, KDnuggets, 1994-, full time 2001-
© KDnuggets 2013 2
Patterns – Key Part of Intelligence
• Evolution: Animals better able
to find, use patterns – more
likely to survive
• People have an ability and
desire to find patterns
• People “pattern intuition” does
not scale
• Science is what helps separate
valid from invalid patterns
(astrology, fake cures, …)
© KDnuggets 2013 3
Horoscope for August: The
Mars-Jupiter tandem in
Cancer seems to indicate a
febrile activity related to the
accommodation, houses,
premises, real estate
investments. You'll build,
redecorate, move out, change
your furniture, refurbish, set
up your yard or garden …
Outline
• What do we call it?
• Data competitions – short history
• Government and Public Data
• Big Data Hype and Reality
© KDnuggets 2013 4
What do we call it?
• Statistics
• Data mining
• Knowledge Discovery in
Data (KDD)
• Business Analytics
• Predictive Analytics
• Data Science
• Big Data
• … ?
© KDnuggets 2013 5
Same Core Idea:
Finding Useful
Patterns in Data
Different
Emphasis
20th Century
Statistics dominates
© KDnuggets 2013 6
statistics
Note: Google Ngrams are case-sensitive. Here used lower case as more
representative
Google Ngrams, smoothing=1
“Data Mining” surges in 1996,
peaks in 2004-5
© KDnuggets 2013 7
Advances in Knowledge Discovery and
Data Mining, AAAI/MIT Press, 1996, Eds:
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy
analytics
data mining
KDD-95, 1st Conference on Knowledge
Discovery and Data Mining, Montreal
Google Ngrams, smoothing=1
Analytics surges in 2006,
after Google Analytics introduced
(c) KDnuggets 2013
Slow-down in analytics
in 2012?
Google Analytics
introduced,
Dec 2005
Google Trends, Jan 2005 – July 2013
“analytics - google” is 50%
of “analytics” searches
analytics
In 2013: Big Data > Data Mining >
Business Analytics > Predictive Analytics
> Data Science
9© KDnuggets 2013
Big Data
Google Trends search, Jan 2008 - July 2013
Data mining
Big Data
slowdown?
History
• 1900 - Statistics
• 1960s Data Mining = bad activity, data “dredging”
• 1990 - “Data Mining” is good, surges in 1996
• 2003 - “Data Mining” peaks, image tarnished
(Total Information Awareness, invasion of privacy)
• 2006 - Google Analytics appears
• 2007 - Business/Data/Predictive Analytics
• 2012 - Big Data surge
• 2013 - Data Science
• 2015 - ??
10© KDnuggets 2012
Data Competitions –
Short History
(c) KDnuggets 2013 11
1st Data Mining Competition:
KDD-CUP 1997
– Organized by Ismail Parsa (then at Epsilon)
– Task: given data on past responders to fund-raising,
predict most likely responders for new campaign
– Data:
• Population of 750K prospects, 300+ variables
• 10K (1.4%) responded to a broad campaign mailing
• Competition file was a stratified sample of 10K responded,
26K non-resp. (28.7% response rate)
– Big effort on leaker detection (false predictors)
KDD Cup was almost cancelled - several times
Charles Elkan found leakers in training data
Evaluating Targeted List:
Cumulative Pct Hits (Gains)
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Model
Random
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
Cum Pct Hits (5%,model)=21%.
Pct list
Cumulative%Hits
KDD-CUP Participant Statistics
– 45 companies/institutions participated
• 23 research prototypes
• 22 commercial tools
– 16 contestants turned in their results
• 9 research prototypes
• 7 commercial tools
– Evaluation: Best Gains (CPH) at 40% and 10%
– Joint winners:
• Charles Elkan (UCSD) with BNB, Boosted Naive Bayesian Classifier
• Urban Science Applications, Inc. with commercial Gain, Direct
Marketing Selection System
• 3rd place: MineSet (SGI, Ronny Kohavi)
KDD-CUP Results Discussion
– Top finishers very close
– Naïve Bayes algorithm was used by 2 of the top 3
contestants (BNB and 3rd place MineSet)
– Naïve Bayes tools did little data preprocessing, used
small number of variables
– Urban Science implemented a tremendous amount
of automated data preprocessing and exploratory
data analysis and developed more than 50 models in
an automated fashion to get to their results
16
KDD Cup 1997: Top 3 results
Top 3 finishers
are very close
17
KDD Cup 1997 – worst results
Note that the worst
result (C6) was actually
worse than random.
Competitor names were
kept anonymous,
apart from top 3 winners
KDD Cup Lessons
• Data Preparation is key, especially eliminating
“leakers” (false predictors)
• Avoid overfitting the test data
• Simple models work well for predicting human
behavior
© KDnuggets 2013 18
Big Competition Successes
• Ansari X-Prize 2004:
Spaceship One went to
space twice in 2 weeks
• DARPA Grand
Challenge, 2005: 150 mi
Off-road robotic car
navigation
© KDnuggets 2013 19
Netflix Prize
• Started in 2006, with 100M
ratings, 500K users, 18K
movies, $1M prize
• Goal: reduce RMSE error in “star”
rating by 10% (was 0.95 for Netflix
own system Cinematch)
• Public training data, public & secret
test sets
© KDnuggets 2013 20
Predicted
Actual
Netflix Prize Milestones
• In just one week, WXYZ consulting team
beat Netflix system with RMSE 0.9430
• Progress in 2007-8 was very slow:
• In 2007 KDnuggets Poll
32% thought prize will
never be won
• Took 3 years to reach
10% improvement
© KDnuggets 2013 21
Netflix Prize Winners
• Winning team used a complex
ensemble of many algorithms
• Two teams had exactly the same RMSE
of 0.8567, but winner submitted 20
minutes earlier !
© KDnuggets 2013 22
Netflix Prize lessons, 1
• Competitions work
• Limits to predicting human behavior –
inherent randomness, noisy data
• Privacy concerns
– Researchers found a few people with matching
IMDB and Netflix ratings – potential privacy
breach
– 4 Netflix users sued
– Netflix Prize Sequel – cancelled
© KDnuggets 2013 23
Netflix Prize lessons, 2
• Winning algorithm was too complex, too
tailored to specific data set, never used 
– Netflix blog, Apr 2012
• A basic SVD algorithm, proposed by Simon
Funk (KDnuggets Interview w. Simon Funk)
got ~6% improvement
• SVD++ version by Yehuda Koren & winning
team reached ~ 8% improvement, was used
by Netflix
© KDnuggets 2013 24
Netflix Prize lessons, 3
• Wrong question was asked ! (Minimizing RMSE of
predicted vs actual ratings)
• RMSE gives big penalty for errors > 2 stars, so an
algo. that fails big a few times will be worse than
an algo. that is often worse by 1.
• Errors are not equal, but RMSE treats 2 vs 3 stars
same as 4 vs 5 or 1 vs 2.
• Also, Netflix Instant became more popular
• Better question would be “what do you like to
watch” (anything on Instant likely to rank > 3)
© KDnuggets 2013 25
Focus
on the right question ?
and the right GOAL
© KDnuggets 2013 26
Kaggle Competition Platform
• Launched by Anthony Goldbloom in 2010
• Quickly became the top platform for
competitions
– Few people know of TunedIT competition
platform launched in 2009
• Kaggle in Class – free for Universities
• Achieved 100,000 members in July 2013
© KDnuggets 2012 27
Kaggle Successes
• Allstate competition: Winner model was 270%
more accurate than baseline
• Identified sound of the endangered North
American Right whale in audio recordings
• GE FlightQuest
• Heritage Health Prize - $3M
competition, 2011-13
• But … Competitions - very time consuming
© KDnuggets 2013 28
Kaggle Business Model
• Initial business model - % of prize
• Kaggle Job Boards (currently free)
• Kaggle Connect: Offers consulting with top
0.5% of Kagglers (at $300/hr ? see post), or
$30-100K/month (IW , Mar 2013)
• Private competitions (Masters) open to top
Kagglers
– Heritage Health Prize 2
© KDnuggets 2013 29
Winning on Kaggle
• Kaggle Chief Scientist: Specialist knowledge –
useless & unhelpful (Slate, Dec 2012)
• Big-data approaches
• Use good tools: R, Random forests
• Curiosity, Creativeness, Persistence, Team, Luc
k? (also Quora answer)
• Many (most?) winners – not professional data
scientists (physicists, math profs, actuary)
(RW, Apr 2012)
© KDnuggets 2013 30
”your Ivy League diploma and IBM
resume don't matter so much
as my Kaggle score”
Almost true
31
Data:
Public, Government, Portals, Mar
ketplaces
© KDnuggets 2013 32
Public Data
www.KDnuggets.com/datasets/
• Government, Federal, State, City, Local and public data sites and portals
• Data APIs, Hubs, Marketplaces, Platforms, Portals, and Search Engines.
• Data Markets: DataMarket
• Data Platforms: Enigma, InfoChimps (acq. By CSC), Knoema, Exversion, …
• Data Search Engines: Qandl , qunb, Zanran
• Location: Factual
• People and places: Freebase
© KDnuggets 2013 33
Public and Government Data
• Datamob.org: tracks government data in
developer-friendly format
© KDnuggets 2013 34
data about U.S. state legislative
activities, including bill
summaries, votes, sponsorships, legislators
and committees.
US Project Open Data
• In May 2013, White House announced Project
Open Data
• “information is a valuable national asset whose
value is multiplied when it is made easily
accessible to the public”.
• “The Executive Order requires that, going
forward, data generated by the government be
made available in open, machine-readable
formats, while appropriately safeguarding
privacy, confidentiality, and security.”
© KDnuggets 2013 35
Using Public Data
• Google – biggest success ?
• Data Science for Social Good (Chicago) (Fast
Company, Aug 2013)
– predict when bikeshare stations run out of bikes
– forecast local crime
– warn local hospitals about impending heart
attacks
© KDnuggets 2013 36
Big Data
• 2nd Industrial Revolution
• Do old activities better
• Create new activities/businesses
37(c) KDnuggets 2013
Doing Old Things Better
Application areas
– Direct marketing/Customer modeling
– Churn prediction
– Recommendations
– Fraud detection
– Security/Intelligence
– …
• Improvement will be real, but limited because of
human randomness
• Competition will level companies
38(c) KDnuggets 2013
Big Data Enables New Things !
– Google – first big success of big data
– Social networks (Facebook, Twitter, LinkedIn, …)
success depends on network size, i.e. big data
– Location analytics
– Health-care
• Personalized medicine
– Semantics and AI ?
• Imagine IBM Watson, Google Now, Siri in 2023 ?
39(c) KDnuggets 2013
Copyright © 2003 KDnuggets
Big Data Bubble?
© 2013 KDnuggets
41
Gartner Hype Cycle
Big Data
Gartner Hype Cycle for Big Data, 2012
© KDnuggets 2013 42
Data
Scientist,
2-5 yrs
Social Network
Analysis, 5-10
Social Analytics, 2-5
Predictive Analytics, <2
MapReduce & Alternative -
Disillusionment
Questions?
KDnuggets: Analytics, Big Data, Data Mining
• News, Jobs, Software, Courses, Data, Meeting
s, Publications, Webcasts, …
www.KDnuggets.com/news
• Subscribe to KDnuggets News email at
www.KDnuggets.com/subscribe.html
• : @kdnuggets
• Email to editor1@kdnuggets.com
43© KDnuggets 2013

More Related Content

What's hot

Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkJongwook Woo
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningJongwook Woo
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIJongwook Woo
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformSavita Yadav
 

What's hot (20)

Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data PlatformPredictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
Predictive Analysis for Airbnb Listing Rating using Scalable Big Data Platform
 

Similar to Public Data and Data Mining Competitions - What are Lessons?

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)Galit Shmueli
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...BigData AAI
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014MedicReS
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Data Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentData Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentKalido
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxInformation Exploration
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingMatthew Lease
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 

Similar to Public Data and Data Mining Competitions - What are Lessons? (20)

Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy Power...
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
DBMS
DBMSDBMS
DBMS
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Data Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business InvestmentData Scientists: Your Must-Have Business Investment
Data Scientists: Your Must-Have Business Investment
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptx
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
UT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd ComputingUT Dallas CS - Rise of Crowd Computing
UT Dallas CS - Rise of Crowd Computing
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 

Recently uploaded

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Public Data and Data Mining Competitions - What are Lessons?

  • 1. Public Data and Data Mining Competitions – what are the Lessons? 1© KDnuggets 2013 Gregory Piatetsky-Shapiro KDnuggets
  • 2. My Data • PhD (‘84) in applying Machine Learning to databases • Researcher at GTE Labs – started the first project on Knowledge Discovery in Databases in 1989 • Organized first 3 Knowledge Discovery and Data Mining (KDD) workshops (1989-93), cofounded Knowledge Discovery and Data Mining (KDD) conferences (1995) • Chief Scientist at 2 analytics startups 1998-2001 • Co-founder SIGKDD (1998), Chair, 2005-2009 • Analytics/Data Mining Consultant, 2001- • Editor, KDnuggets, 1994-, full time 2001- © KDnuggets 2013 2
  • 3. Patterns – Key Part of Intelligence • Evolution: Animals better able to find, use patterns – more likely to survive • People have an ability and desire to find patterns • People “pattern intuition” does not scale • Science is what helps separate valid from invalid patterns (astrology, fake cures, …) © KDnuggets 2013 3 Horoscope for August: The Mars-Jupiter tandem in Cancer seems to indicate a febrile activity related to the accommodation, houses, premises, real estate investments. You'll build, redecorate, move out, change your furniture, refurbish, set up your yard or garden …
  • 4. Outline • What do we call it? • Data competitions – short history • Government and Public Data • Big Data Hype and Reality © KDnuggets 2013 4
  • 5. What do we call it? • Statistics • Data mining • Knowledge Discovery in Data (KDD) • Business Analytics • Predictive Analytics • Data Science • Big Data • … ? © KDnuggets 2013 5 Same Core Idea: Finding Useful Patterns in Data Different Emphasis
  • 6. 20th Century Statistics dominates © KDnuggets 2013 6 statistics Note: Google Ngrams are case-sensitive. Here used lower case as more representative Google Ngrams, smoothing=1
  • 7. “Data Mining” surges in 1996, peaks in 2004-5 © KDnuggets 2013 7 Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, Eds: U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy analytics data mining KDD-95, 1st Conference on Knowledge Discovery and Data Mining, Montreal Google Ngrams, smoothing=1
  • 8. Analytics surges in 2006, after Google Analytics introduced (c) KDnuggets 2013 Slow-down in analytics in 2012? Google Analytics introduced, Dec 2005 Google Trends, Jan 2005 – July 2013 “analytics - google” is 50% of “analytics” searches analytics
  • 9. In 2013: Big Data > Data Mining > Business Analytics > Predictive Analytics > Data Science 9© KDnuggets 2013 Big Data Google Trends search, Jan 2008 - July 2013 Data mining Big Data slowdown?
  • 10. History • 1900 - Statistics • 1960s Data Mining = bad activity, data “dredging” • 1990 - “Data Mining” is good, surges in 1996 • 2003 - “Data Mining” peaks, image tarnished (Total Information Awareness, invasion of privacy) • 2006 - Google Analytics appears • 2007 - Business/Data/Predictive Analytics • 2012 - Big Data surge • 2013 - Data Science • 2015 - ?? 10© KDnuggets 2012
  • 11. Data Competitions – Short History (c) KDnuggets 2013 11
  • 12. 1st Data Mining Competition: KDD-CUP 1997 – Organized by Ismail Parsa (then at Epsilon) – Task: given data on past responders to fund-raising, predict most likely responders for new campaign – Data: • Population of 750K prospects, 300+ variables • 10K (1.4%) responded to a broad campaign mailing • Competition file was a stratified sample of 10K responded, 26K non-resp. (28.7% response rate) – Big effort on leaker detection (false predictors) KDD Cup was almost cancelled - several times Charles Elkan found leakers in training data
  • 13. Evaluating Targeted List: Cumulative Pct Hits (Gains) 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Model Random 5% of random list have 5% of targets, but 5% of model ranked list have 21% of targets Cum Pct Hits (5%,model)=21%. Pct list Cumulative%Hits
  • 14. KDD-CUP Participant Statistics – 45 companies/institutions participated • 23 research prototypes • 22 commercial tools – 16 contestants turned in their results • 9 research prototypes • 7 commercial tools – Evaluation: Best Gains (CPH) at 40% and 10% – Joint winners: • Charles Elkan (UCSD) with BNB, Boosted Naive Bayesian Classifier • Urban Science Applications, Inc. with commercial Gain, Direct Marketing Selection System • 3rd place: MineSet (SGI, Ronny Kohavi)
  • 15. KDD-CUP Results Discussion – Top finishers very close – Naïve Bayes algorithm was used by 2 of the top 3 contestants (BNB and 3rd place MineSet) – Naïve Bayes tools did little data preprocessing, used small number of variables – Urban Science implemented a tremendous amount of automated data preprocessing and exploratory data analysis and developed more than 50 models in an automated fashion to get to their results
  • 16. 16 KDD Cup 1997: Top 3 results Top 3 finishers are very close
  • 17. 17 KDD Cup 1997 – worst results Note that the worst result (C6) was actually worse than random. Competitor names were kept anonymous, apart from top 3 winners
  • 18. KDD Cup Lessons • Data Preparation is key, especially eliminating “leakers” (false predictors) • Avoid overfitting the test data • Simple models work well for predicting human behavior © KDnuggets 2013 18
  • 19. Big Competition Successes • Ansari X-Prize 2004: Spaceship One went to space twice in 2 weeks • DARPA Grand Challenge, 2005: 150 mi Off-road robotic car navigation © KDnuggets 2013 19
  • 20. Netflix Prize • Started in 2006, with 100M ratings, 500K users, 18K movies, $1M prize • Goal: reduce RMSE error in “star” rating by 10% (was 0.95 for Netflix own system Cinematch) • Public training data, public & secret test sets © KDnuggets 2013 20 Predicted Actual
  • 21. Netflix Prize Milestones • In just one week, WXYZ consulting team beat Netflix system with RMSE 0.9430 • Progress in 2007-8 was very slow: • In 2007 KDnuggets Poll 32% thought prize will never be won • Took 3 years to reach 10% improvement © KDnuggets 2013 21
  • 22. Netflix Prize Winners • Winning team used a complex ensemble of many algorithms • Two teams had exactly the same RMSE of 0.8567, but winner submitted 20 minutes earlier ! © KDnuggets 2013 22
  • 23. Netflix Prize lessons, 1 • Competitions work • Limits to predicting human behavior – inherent randomness, noisy data • Privacy concerns – Researchers found a few people with matching IMDB and Netflix ratings – potential privacy breach – 4 Netflix users sued – Netflix Prize Sequel – cancelled © KDnuggets 2013 23
  • 24. Netflix Prize lessons, 2 • Winning algorithm was too complex, too tailored to specific data set, never used  – Netflix blog, Apr 2012 • A basic SVD algorithm, proposed by Simon Funk (KDnuggets Interview w. Simon Funk) got ~6% improvement • SVD++ version by Yehuda Koren & winning team reached ~ 8% improvement, was used by Netflix © KDnuggets 2013 24
  • 25. Netflix Prize lessons, 3 • Wrong question was asked ! (Minimizing RMSE of predicted vs actual ratings) • RMSE gives big penalty for errors > 2 stars, so an algo. that fails big a few times will be worse than an algo. that is often worse by 1. • Errors are not equal, but RMSE treats 2 vs 3 stars same as 4 vs 5 or 1 vs 2. • Also, Netflix Instant became more popular • Better question would be “what do you like to watch” (anything on Instant likely to rank > 3) © KDnuggets 2013 25
  • 26. Focus on the right question ? and the right GOAL © KDnuggets 2013 26
  • 27. Kaggle Competition Platform • Launched by Anthony Goldbloom in 2010 • Quickly became the top platform for competitions – Few people know of TunedIT competition platform launched in 2009 • Kaggle in Class – free for Universities • Achieved 100,000 members in July 2013 © KDnuggets 2012 27
  • 28. Kaggle Successes • Allstate competition: Winner model was 270% more accurate than baseline • Identified sound of the endangered North American Right whale in audio recordings • GE FlightQuest • Heritage Health Prize - $3M competition, 2011-13 • But … Competitions - very time consuming © KDnuggets 2013 28
  • 29. Kaggle Business Model • Initial business model - % of prize • Kaggle Job Boards (currently free) • Kaggle Connect: Offers consulting with top 0.5% of Kagglers (at $300/hr ? see post), or $30-100K/month (IW , Mar 2013) • Private competitions (Masters) open to top Kagglers – Heritage Health Prize 2 © KDnuggets 2013 29
  • 30. Winning on Kaggle • Kaggle Chief Scientist: Specialist knowledge – useless & unhelpful (Slate, Dec 2012) • Big-data approaches • Use good tools: R, Random forests • Curiosity, Creativeness, Persistence, Team, Luc k? (also Quora answer) • Many (most?) winners – not professional data scientists (physicists, math profs, actuary) (RW, Apr 2012) © KDnuggets 2013 30
  • 31. ”your Ivy League diploma and IBM resume don't matter so much as my Kaggle score” Almost true 31
  • 32. Data: Public, Government, Portals, Mar ketplaces © KDnuggets 2013 32
  • 33. Public Data www.KDnuggets.com/datasets/ • Government, Federal, State, City, Local and public data sites and portals • Data APIs, Hubs, Marketplaces, Platforms, Portals, and Search Engines. • Data Markets: DataMarket • Data Platforms: Enigma, InfoChimps (acq. By CSC), Knoema, Exversion, … • Data Search Engines: Qandl , qunb, Zanran • Location: Factual • People and places: Freebase © KDnuggets 2013 33
  • 34. Public and Government Data • Datamob.org: tracks government data in developer-friendly format © KDnuggets 2013 34 data about U.S. state legislative activities, including bill summaries, votes, sponsorships, legislators and committees.
  • 35. US Project Open Data • In May 2013, White House announced Project Open Data • “information is a valuable national asset whose value is multiplied when it is made easily accessible to the public”. • “The Executive Order requires that, going forward, data generated by the government be made available in open, machine-readable formats, while appropriately safeguarding privacy, confidentiality, and security.” © KDnuggets 2013 35
  • 36. Using Public Data • Google – biggest success ? • Data Science for Social Good (Chicago) (Fast Company, Aug 2013) – predict when bikeshare stations run out of bikes – forecast local crime – warn local hospitals about impending heart attacks © KDnuggets 2013 36
  • 37. Big Data • 2nd Industrial Revolution • Do old activities better • Create new activities/businesses 37(c) KDnuggets 2013
  • 38. Doing Old Things Better Application areas – Direct marketing/Customer modeling – Churn prediction – Recommendations – Fraud detection – Security/Intelligence – … • Improvement will be real, but limited because of human randomness • Competition will level companies 38(c) KDnuggets 2013
  • 39. Big Data Enables New Things ! – Google – first big success of big data – Social networks (Facebook, Twitter, LinkedIn, …) success depends on network size, i.e. big data – Location analytics – Health-care • Personalized medicine – Semantics and AI ? • Imagine IBM Watson, Google Now, Siri in 2023 ? 39(c) KDnuggets 2013
  • 40. Copyright © 2003 KDnuggets
  • 41. Big Data Bubble? © 2013 KDnuggets 41 Gartner Hype Cycle Big Data
  • 42. Gartner Hype Cycle for Big Data, 2012 © KDnuggets 2013 42 Data Scientist, 2-5 yrs Social Network Analysis, 5-10 Social Analytics, 2-5 Predictive Analytics, <2 MapReduce & Alternative - Disillusionment
  • 43. Questions? KDnuggets: Analytics, Big Data, Data Mining • News, Jobs, Software, Courses, Data, Meeting s, Publications, Webcasts, … www.KDnuggets.com/news • Subscribe to KDnuggets News email at www.KDnuggets.com/subscribe.html • : @kdnuggets • Email to editor1@kdnuggets.com 43© KDnuggets 2013

Editor's Notes

  1. Future is Bright for Big Data, but need use caution when evaluating claims