Having recently implemented a new framework for the real-time collection, aggregation and visualization of web and mobile generated Clickstream traffic (realizing daily click-stream volumes of 1M+ events), this walkthrough is about the motivations, throughout-process and key decisions made, as well as an in depth look at the implementation of how to buildout a data-collection, analytics and visualization framework using MongoDB. Technologies covered in this presentation (as well as MongoDB) are Java, Spring, Django and Pymongo.
Implementing and Visualizing Clickstream data with MongoDB
1. Implementing and Visualizing Click-
Stream Data with MongoDB
Jan 22, 2013 - New York MongoDB User Group
Cameron Sim - LearnVest.com
2. Agenda
• About LearnVest
• HL Application Architecture
• Data Capture
• Event Packaging
• MongoDB Data Warehousing
• Loading & Visualization
• Finishing up
3. LearnVest Inc.
www.learnvest.com
Mission Statement
Aiming to making Financial Planning as accessible as having a gym membership
Company
Key Products
nded in 2008 by Alexa Von Tobel, CEO
Account Aggregation and Managem
(Bank, Credit, Loan, Investment, Mort
50+ People and Growing rapidly
Based in NYC
Original and Syndicated Newsletter Co
Platforms
Financial Planning
Web iPhone
(tiered product offering)
Stack
Analytics
Operational
MongoDB 2.2.0 (3-node replica-set
Wordpress, Backbone.js, Node.js
Java 6, Spring 3
ava Spring 3, Redis, Memcached,
6. High Level Architecture
Production
Analytics
elivery Services
Services Loaders Dashbo
HTTPS
pyMongo
7. ure Everything
Collection
-Driven events over web and mobile
m-level exceptions
ything else
porary Data
ok’ with approximate data
rational Databases are the system of record
egate events as they come in
ove the overhead of basic metrics (counts, sums) on core events
p by user unique id and increment counts per event, over time-dimensions
eek-ending, month, year)
8. Data Capture
OS
(void) sendAnalyticEventType:(NSString*)eventType
object:(NSString*)object
name:(NSString*)name
page:(NSString*)page
source:(NSString*)source;
NSMutableDictionary *eventData = [NSMutableDictionary dictionary];
if (eventType!=nil) [params setObject:eventType forKey:@eventType];
if (object!=nil) [eventData setObject:object forKey:@object];
if (name!=nil) [eventData setObject:name forKey:@name];
if (page!=nil) [eventData setObject:page forKey:@page];
if (source!=nil) [eventData setObject:source forKey:@source];
if (eventData!=nil) [params setObject:eventData forKey:@eventData];
[[LVNetworkEngine sharedManager] analytics_send:params];
9. Data Capture
WEB (JavaScript)
unction internalTrackPageView() {
var cookie = {
userContext: jQuery.cookie('UserContextCookie'),
};
var trackEvent = {
eventType: pageView,
eventData: {
page: window.location.pathname + window.location.search
}
};
// AJAX
jQuery.ajax({
url: /api/track,
type: POST,
dataType: json,
data: JSON.stringify(trackEvent),
// Set Request Headers
beforeSend: function (xhr, settings) {
xhr.setRequestHeader('Accept', 'application/json');
xhr.setRequestHeader('User-Context', cookie.userContext)
if(settings.type === 'PUT' || settings.type === 'POST')
xhr.setRequestHeader('Content-Type', 'application/js
}
}
});
10. Bus Event Packaging
ng 3 RESTful service layer, controller methods define the eventCode via @tracki
otation
tom Intercepter class extends HandlerInterceptorAdapter and implements
Handle() (for each event) to invoke calls via Spring @async to an EventPublisher
ntPublisher publishes to common event bus queue with multiple subscribers, one o
kages the eventPayload MapString, Object object and forwards to Analytics Rest
11. Bus Event Packaging
ing RestController Methods
ace
estMapping(value = /user/login, method = RequestMethod.POST,
rs=Accept=application/json)
c MapString, Object userLogin(@RequestBody MapString, Object event,
ervletRequest request);
ete/Impl Class
ride
king(user.login)
c MapString, Object userLogin(@RequestBody MapString, Object event,
ervletRequest request){
/Implementation
eturn event;
12. Bus Event Packaging
stom Intercepter class extends HandlerInterceptorAdapter
cted void handleTracking(String trackingCode, MapString, Object modelMap
ervletRequest request) {
MapString, Object responseModel = new HashMapString, Object();
// remove non-serializables copy over data from modelMap
try {
this.eventPublisher.publish(trackingCode, responseModel, request);
} catch (Exception e) {
log.error(Error tracking event ' + trackingCode + ' :
+ ExceptionUtils.getStackTrace(e));
}
13. Bus Event Packaging
stom Intercepter class extends HandlerInterceptorAdapter
c void publish (String eventCode, MapString,Object eventData,
HttpServletRequest request
MapString,Object payload = new HashMapString,Object();
String eventId=UUID.randomUUID().toString();
MapString, String requestMap = HttpRequestUtils.getRequestHeaders(reques
//Normalize message
payload.put(eventType, eventData.get(eventType));
payload.put(eventData, eventData.get(eventType));
payload.put(version, eventData.get(eventType));
payload.put(eventId, eventId);
payload.put(eventTime, new Date());
payload.put(request, requestMap);
.
.
.
//Send to the Analytics Service for MongoDB persistence
c void sendPost(EventPayload payload){
HttpEntity request = new HttpEntity(payload.getEventPayload(), headers)
Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class)
16. MongoDB Data Warehousing
goDB Information
0
de replica-set
rge (primary), 2x Medium (secondary) AWS Amazon-Linux machines
with single 500GB EBS volumes mounted to /opt/data
goDB Config File
= /opt/data/mongodb/datarest = truereplSet = voyager
mes
vents daily on web, ~600K on mobile
B per day at start, slowed to ~1GB per day
ntly at 78GB (collecting since August 2012)
re Scaling Strategy
p 2nd Replica-Set
d replica-sets to n at 60% / 250GB per EBS volume
d key probably based on sequential mix of email_address additional string
17. MongoDB Data Warehousing
OBILE
ist all events, bucketed by source, event code and time:-
EB/MOBILE
er.login
e (day, week-ending, month, year)
ert into collection e_web / e_mobile
sert into:-
web_user_login_day
web_user_login_week
web_user_login_month
web_user_login_year
dictable model for scaling and measuring business growth
18. MongoDB Data Warehousing
DBObject newDocument = new BasicDBObject().append($inc
new BasicDBObject().append(count, 1));
ate day dimension
ction_day.update(new BasicDBObject().append(user-context, userContext)
.append(eventType, eventType)
.append(date, sdf_day.format(d)),newDocument, true, false
ate week dimension
ction_week.update(new BasicDBObject().append(user-context, userContext)
.append(eventType, eventType)
.append(date, sdf_day.format(w)), newDocument, true, fals
ate month dimension
ction_month.update(new BasicDBObject().append(user-context, userContext)
.append(eventType, eventType)
.append(date, sdf_month.format(d)), newDocument, true, fa
ate month dimension
ction_year.update(new BasicDBObject().append(user-context, userContext)
.append(eventType, eventType)
.append(date, sdf_year.format(d)), newDocument, true, fal
20. MongoDB Data Warehousing
1, accept-charset : ISO-8859-1,utf-8;q=0.7,*;q=0.3, cookie : size=
de=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920;
IONID=56EB165266A2C4AFF946F139669D746F;
oken=73bdcdddf151dc56b8020855b2cb10c8, content-length : 255, accept-
ing : gzip,deflate,sdch }, eventType : flick, eventData : { obje
on, name : split transaction button, page : #inbox/79876/, secti
saction_river_details } }
21. MongoDB Data Warehousing
xing Strategy
xes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large
ce and 3.75GB on Medium instances
datetime in two fields and compound index on date with other fields like eventTyp
unique id (user-context)
vy insertion rates, much lower read rates....so less indexes the better
22. MongoDB Data Warehousing
ing Strategy
e_web.getIndexes()[
v : 1, key : { request.user-contex
created_date : 1 }, ns :
ycenter.e_web, name : request.user-context_1_created_date_
v : 1, key : { eventData.name : 1
created_date : 1 }, ns : moneycenter.e_web
name : eventData.name_1_created_date_1 }]
23. jective
Loading Visualization
how historic and intraday stats on core use cases (logins, conversions)
how user funnel rates on conversion pages
how general usability - how do users really use the Web and IOS platforms?
on-Functionals
traday doesn’t need to be “real-time”, polling is good enough for now
Overnight batch job for historic must scale horizontally
neral Implementation Strategy
o all heavy lifting object manipulation, UI should just display graph or table
Modularize the service to be able to regenerate any graphs/tables without a full load
24. Loading Visualization
va Batch Service
a Mongo library to query key collections and return user counts and sum of events
ursor webUserLogins = c.find(
new BasicDBObject(date, sdf.format(new Date())));
vate HashMapString, Object getSumAndCount(DBCursor cursor){
HashMapString, Object m = new HashMapString, Object();
int sum=0;
int count=0;
DBObject obj;
while(cursor.hasNext()){
obj=(DBObject)cursor.next();
count++;
sum=sum+(Integer)obj.get(count);
}
m.put(sum, sum);
m.put(count, count);
m.put(average, sdf.format(new Float(sum)/count));
return m;
25. Loading Visualization
va Batch Service
e Aggregation Framework where required on core collections (e_web) and externa
reate aggregation objects
bject project = new BasicDBObject($project,
new BasicDBObject(day_value, fields) );
bject day_value = new BasicDBObject( day_value, $day_value);
bject groupFields = new BasicDBObject( _id, day_value);
reate the fields to group by, in this case “number”
upFields.put(number, new BasicDBObject( $sum, 1));
reate the group
bject group = new BasicDBObject($group, groupFields);
xecute
regationOutput output = mycollection.aggregate( project, group );
(DBObject obj : output.results()){
26. Loading Visualization
va Batch Service
ngoDB Command Line example on aggregation over a time period, e.g. month
b.e_web.aggregate( [ { $match : { created_date : { $gt :
Date(2012-10-25T00:00:00)}}}, { $project : { day_value : {day
dayOfMonth : $created_date }, month:{ $month :
reated_date }} }}, { $group : { _id : {day_value:$day_value}
number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ])
28. 17)
Loading Visualization
day numbers try: conn = pymongo.Connection('localhost',
db = conn['lvanalytics']
accountmetrics.find(
cursor =
{date : {$gte : dt_from, $lte : dt_to}}).sort(date)
urn buildMetricsDict(cursor) except Exception as e:
ger.error(e.message)
urn the graph object (as a list or a dict of lists) to the view that called the
thod
edata={}
edata['accountsGraph']=mongodb_home.getHomeChart()
urn render_to_response('home.html',{'pagedata': pagedata},
text_instance=RequestContext(request))
.homeGraphs.find()
_id : ObjectId(50f57b5c1d4e714b581674e2), accounts_natural : 54,
29. Loading Visualization
ango and HighCharts
pulate the series.. (JavaScript with Django templating)
iesOptions[0] = {
id: 'naturalAccounts', name: Natural Accounts, data: [ {% for
n pagedata.metrics.accounts_natural %} {% if not forloop.first
{% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor
], tooltip: { valueDecimals: 2 } };
32. Lessons Learned
• Date Time managed as two fields, Datetime and Date
• Aggregating and upserting documents as events are received works for us
• Real-time Map-Reduce in pyMongo - too slow, don’t do this.
• Django-noRel - Unstable, use Django and configure MongoDB as a
datastore only
• Memcached on Django is good enough (at the moment) - use django-celery
with rabbitmq to pre-cache all data after data loading
• HighCharts is buggy - considering D3 other libraries
• Don’t need to retrieve data directly from MongoDB to Django, perhaps
provide all data via a service layer (at the expense of ever-additional
features in pyMongo)
33. Next Steps
• A/B testing framework, experiments and variances
• Unauthenticated / Authenticated user tracking
• Provide data async over service layer
• Segmentation with graphical libraries like D3 Cross-Filter (
http://square.github.com/crossfilter/)
• Saving Query Criteria, expanding out BI tools for internal users
• MongoDB Connector, Hadoop and Hive (maybe Tableau and other tools)
• Storm / Kafka for real-time analytics processing
• Shard the Replica-Set, looking into Gizzard as the middleware
34. Hrishi Dixit
Chief Technology Officer
Kevin Connelly
Director of Engineering
Will Larche
kevin@learnvest.com
hrishi@learnvest.com
Lead IOS Developer
will@learnvest.com
Cameron Sim
Jeremy Brennan
Director of Analytics Tech
your name here
Director of UI/UX Technology
cameron@learnvest.com
New Awesome Develope
jeremy@learnvest.com
you@learnvest.com
HIR