2. 2
Introduction
• Introduction to Metail – who we are, why we use Snowplow
• How the Lambda Architecture has influenced our Data
Architecture
• Where Cascalog fits in at Metail and why it works well with
Snowplow
• Example of where we’ve used Cascalog and how it works
• Looker forward to the future
5. 5
• Sign up with just a few clicks
• See how the clothes look on you
• Build layered outfits
• Get size recommendation
http://trymetail.com/collections/metail
6. 6
1. Customer shape & size data can now aid brand’s buying & selling decisions
2. Body shape & outfitting data -> crowd sourced outfit recommendations
Product portfolio: Data services
UNDERSTANDING SHAPE PROFILE OF CUSTOMERS HOW SHAPE VARIES BY SIZE
Do we need to create new collections
to cater for clusters of different shapes?
Do we need to change the fit profile by
size to accommodate different shapes?
7. 7
KPI Analysis –
Can we prove it actually works?
Metric Definition
Return on Investment [(VPVuplift * All Visits ) - Investment] / Investment
Net sales revenue Value of retained items in bin
Value per visitor Net Sales Revenue / Visitors
Visits (sessions) Set of activities with <= 30 minutes between consecutive events
User Conversion Orders / Visitors
Adoption Rate Number of user’s who use Metail / Number of user’s shown Metail
Average Order Value Median value of all orders tracked in the time period
Return Rate Number of items returned / Number of Items purchased
Average Retained Order
Value
Median value of all orders tracked in the time period after removing
returned items
AB Set up: 50/50 split test
Managed by: Metail through their AB test platform
8. 8
KPI Analysis –
Can we prove Metail impact?
Data Collection
We need to know visitor counts, order values, which test group the
user was in, whether they actually used Metail or not, time on site,
what garments they wore, etc. etc.
13. 13
Cascalog to produce Batch Views
Turn the Snowplow event stream into a normalised schema
Body Shape
Orders
Items Ordered
Returns
Browsers
(visitors)
Sessions
Garment Details
AB Events
Snowplow
Events
14. 14
Cascalog:
Snowplow ETL Runner Output -> Batch Views
Cascalog is designed to process Big Data on top of Hadoop. It is a
replacement for tools like Pig, Hive, and Cascading which operates at a
significantly higher level of abstraction than those tools [1]
Write Clojure code to create our data processing jobs
• The code you write has be MapReduce aware, but the low level
implementation details are taken care of
• What we’re really doing is adding another ETL Step to the Snowplow flow
[1] http://cascalog.org/
Cascalog is written in Clojure (JCascalog in Java, or Scalding in Scala)
It’s easy to run on Amazon EMR – fits in with the Snowplow flow nicely
15. 15
Cascalog – Worth the effort?
Couldn’t you achieve the same output working with the
events table alone?
…kind of
But there are two key benefits:
1. Breaking the data into a manageable schema means you can
directly access the data you care about
2. Complex logic and aggregation is easier to achieve
Real example:
• KPI Data Aggregation
16. 16
Cascalog – KPI Data Aggregation
Value per visitor Net Sales Revenue / Visitors
User Conversion Orders / Visitors
Adoption Rate Number of user’s who use Metail / Number of user’s shown Metail
How do we calculate KPIs from our Snowplow data?
In both the Active and Control groups, we need:
• Visitor Count
• Engaged Visitor Count
• Order Count
• Order Value
17. 17
Cascalog – KPI Data Aggregation
Visitors
Count
• Snowplow tracks visitors – our code just has to look up visitors who
are in the test we’re measuring
Engaged Count
• Fire a structured event to Snowplow each time an ‘engagement’ event
occurs. For each visitor in the test, our code has to find whether or
not they engaged with Metail
Orders
We encode all of the relevant order information on the page in JSON and
fire an unstructured event with the details
Order Count
• Our code needs to find all of the order events in the time period
Order Value
• Our code needs to read the order value and sum it together
18. 18
Cascalog – KPI Data Aggregation
We can do better!
What we really want is a user level summary of the data
domain_id engaged order_value order_id ab_group
0014822757d9a81f null 175.89 89281949 out
0015ca5144f0fae7 null null null out
0015dd8901887010 null 310.22 25394849 out
0015e633aa2c158d null null null in
00204e1bcc87b734 null null null out
0042472794f2b57a null 191.98 89392136 in
004389f95e620dd0 null null null out
0044867c3d7b1cf5 null null null out
00456d1e9300296e null null null out
0045dc05b4262ed2 null null null in
0045f74358a842c1 TRUE null null in
00462b685f4188ad null null null out
0048fccbe230dc57 null null null out
0049a5d24498051d TRUE 101.96 27529849 in
21. 21
What do we do with the Batch Views?
Take the output and crunch it in R (or Incanter)
A lot of the subsequent analysis we run on our batch views requires
statistical packages, so we run our advanced analysis in R.
Thankfully, having the batch views ready has led to far fewer of these:
22. 22
A Looker Ahead
Not everyone can write Cascalog and R.
Looker will open our batch views and Snowplow events to
our Business Analysts
Fashion technology start-up company
Focused on delivering best UX for browsing and buying clothes online
How? – by recognising every body is unique and should be celebrated!
When looking at clothes online, why are we restricted to only seeing how they look on models or mannequins?
Why not on our own bodies?
That is the question we are solving through 2 core technologies:
Body visualisation – having a quick and easy way to create your body model online - your MeModel
Garment fit – low cost and quick method for digitising clothes
The results? Well you can see for yourself from this slide, which shows a collection of MeModels we have created, wearing different clothes
I’m not going to spend too much time on this slide, but I wanted to give an overview of the kind of data services we provide for our retailers and we put together from the data we collect
GA just doesn’t give us the level of detail we require. It has it’s uses, and provides great overviews and visualisations, but drilling into the detail of what a user actually did gets a bit clunky. Funnel analysis never quite cut it for us, especially when it comes to measuring KPIs and billing where it’s really important it’s accurate and correct
Key points to note: we are adding two trackers here, one that sits on the retailers site and one that sits on our widget.
Because we have the tracker on the retailers pages, we get a lot more data than a startup of our size might expect
We track everything, send a _lot_ of structured events (fell out of GA), and also use unstructured events where we’ve needed to pass more data
We actually started our Snowplow collection before we really knew what to do with it. No harm getting the tracker on early
MEAP for a mere three years – hopefully Unified Log Processing comes more quickly…
Computing arbitrary functions on arbitrary data
Batch layer – Stores the master dataset and computes arbitrary views
Serving layer - The serving layer indexes the batch view and loads it up so it can be efficiently
queried to get particular values out of the view. The serving layer is a specialized
distributed database that loads in a batch views, makes them queryable, and
continuously swaps in new versions of a batch view as they're computed by the
batch layer.
Speed layer - Takes the data and updates it based on what it knows, discards data as it’s no longer needed
Robust and fault tolerant
Scalable
General
Extensible
Allows ad hoc queries
Minimal maintenance
Debuggable
Entities we care about
Batch
computations are written like single-threaded programs, yet automatically
parallelize across a cluster of machines.
This implicit parallelization makes batch
layer computations scale to datasets of any size. It's easy to write robust, highly
scalable computations on the batch layer.
Scale
Remember our KPI slide – I’ve picked out a couple of these and I’m going to talk about how we use Snowplow to capture this data
All of these things would be fairly easy to pull out of the processed Snowplow data – even if it’s large. Redshift is good at running these kind of queries. Combining the numbers returned is not difficult
Problem if you present this back to the retailer or your users – there are always follow up questions and it’s difficult to drill down on this kind of summary data
What kind of items do the users who engaged try on vs what they purchased? Can you tell me which users
What days were there the most orders. Can you provide the order_ids so we could check the values our end?
This is better because we now have the snowplow domain_id. It’s a summary view showing us, for any specific user in the test, which group they were in, did they click on the Metail button, did they make an order and if so how much?
Tying everything back to the user is a great advantage, because any subsequent analysis is much easier to carry out. We join back to the Snowplow events on domain_id.
For users who engaged: what did they try on?
This data has just run in a batch so is ready and waiting for us to start analysis on – doesn’t need recomputed over again
It’s also easy to calculate the KPIs I mentioned and because we have everything on a per user level, we can perform statistical bootstrapping to look at the distributions and work out errors bars on the results
I know many of you will never have seen Clojure before and I don’t intend to spend time going through every line, but I wanted to show you that what we’re doing is conceptually very simple
A few lines of code and we’ve cleared a huge amount of data we don’t need:
Chuck invalid ip addresses
Anything that’s not a Struct or an Unstruct event
And we’ve started to transform it. Page urls become retailers
Cascalog takes care of all of the nitty gritty – and running it on Amazon EMR means we can power it up as we’d like because you’re leveraging mapreduce.
MapReduce – doesn’t matter how big your Snowplow logs are, you can split the data arbitrarily and run Cascalog over it. Every row can