Klout in its iterations is a prime example of leveraging large scale NLP data science with topical assignment. Klout makes this available through its website, http://klout.com, and also through its developer API, http://developers.klout.com
5. WHAT IS KLOUT, REALLY?
• Klout is an API client
application of the social
web.
• Federated identity
across platforms
• Macro and micro
understanding of
profile, conversation,
and content.
ple linked by Topics.
6. UNIFYING PRINCIPLE: TOPICS
• TBs of Social Interactions
a Day
• NLP applied to posts
• Aggregated to profiles:
– effects are Klout Score,
topical strengths
– The what becomes topics
– The why becomes TopicSets
• Links crawled, NLP
summarization
tent and people linked by Topics.
7. TOPIC SETS + USERS + SCORING
• Allow for time-series
slicing
• Aggregate counting
• Slicing of set to create
ordered list
Topic-oriented view
9. KLOUT DEALS WITH RIDICULOUS AMOUNTS OF DATA
o Topic assignment at scale:
o ~650 M new pieces of data daily
o hundreds of millions of profiles
o ~10,000 topics in 3-level hierarchy
o Daily update
o Multiple Social networks and various data sources:
o Twitter, Facebook, LinkedIn, Google+, Wikipedia
o User activity, profiles, connections
o Topics normalized to an evolving, managed ontology
10. WEIGHTING, NORMALIZATION, CALIBRATION
Signals are weighted and normalized to
mirror real-world influence
– Machine-learned weighting based on regression
analysis of survey data
Advanced algorithm based on 1500 signal
combinations of relationships and ratios
– Where: Which network is the action taking place?
– What: What action was taken?
– Who: Who acted on your content?
– How much: How many actions and unique actors?
– When: When was the action performed?
11. TOPIC SETS FOR CONTEXT
User’s
Influence
With various Scores
User’s
Interests
With various Scores
User’s Self-
selection
Based on registered self-
declared interest
Audience
Influence
Rollup of User’s Influence
within a user’s downlevel
and uplevel networks
Audience
Interests
Rollup of User’s Interests
within a User’s downlevel
(and uplevel) networks
12. CHALLENGES IN BIG DATA
● Message size: Overall data size may be
huge, but message size per user may be
small.
● Text Sparsity: Many users may be
passive consumers of content.
● Noise: colloquial language, slang,
grammatical errors, abbreviations.
● Context: Need to expand context to get
more information
● False positives are embarrassing when
user-facing
13. CHALLENGES TO SCALE
NLP* - StanfordNLP english.conll.4class.distsim.crf.ser.gz
● Speed Matters (650M messages a day):
○ Stanford Named Entity Extraction - 10.959 ms (82.0 CPU days)
○ Dictionary - 0.056ms (0.42 CPU days)
● Corpus
○ Stanford Named Entity Extraction:
■ {‘the rule of law’=1.0}
○ Dictionary based:
■ {‘the rule of law’=1.0, ‘nsa’=1.0, ‘eff’=1.0}
15. MACHINE LEARNING AT KLOUT
We our leverage past machine learning and NLP
classification assets to:
• Train new models for adding additional data sources
• Retraining Topics classification
• Predict “actionability” of support
• Predict virality of content [macro and micro]
• Predict the “personhood” of a social media account
• Content-targeting based on downlevel predictions
20. PARAMETERIZATION
• Topics Scoring uses different models in each topic
set
• Overall Topic Scoring is based on hundreds of
features, weights, decays, spanning short and
long term
• Parameterize scoring for different contexts
22. EXAMPLES
• Treated like a product, you must think through
implementations others would make.
• Maybe even make them your own.
23. POLICY
• Data is great.
• Representation of data is hard.
• Raw data rarely if ever needs to be displayed.
• Balance innovation on data assets with brand and
utility, allowed use cases.
25. Bye!
May 2015 – APIdays
Tyler Singletary - @harmophone
Director of Platform
tyler@klout.com
Editor's Notes
Klout is best known for the Klout Score. For better or worse.
We have more.
Now we know a bit more about me. We don’t really know what it all means here. Expertise tells us a bit more.
Things towards the bottom start to look like interests.
I’m mostly known for talking about Politics of APIs issues. I won’t be doing that here.
Not going to recover Louis’ Predictive APIs talk – I’m not a machine learning expert.