1. Product recommendation and
the Dutch movie world
Let’s link on LinkedIn
https://www.linkedin.com/in/longhowlam
Longhow Lam
Freelance data scientist: Just contact me if you need me :-)
2. RESTAURANTS ANALYTICS
(RECSYS: ASSOCIATION RULES MINING)
SELECT APP USERS
(RECSYS: WORD EMBEDDINGS)
DUTCH FILM WORLD
(GRAPH ANALYSIS)
INTRODUCTION
3. INTRODUCTION
You need to learn your whole life!
Data science environments and titles, they evolve, come and go
I once was an “applied statistician”....
Applied statistician
Data miner
Data scientist
ML engineer
AI specialist ??
Tool ??
5. RESTAURANT ANALYTICS
Business pain
I have eaten Chinese, OK nice! But where to eat the next time?
Approach
Look at restaurant reviews and look where the other reviewers went
6. ASSOCIATION
RULES MINING
ALSO CALLED MARKET BASKET ANALYSIS
Identify frequent item sets (rules) in transactional data:
✔ IF items A and B THEN item C {A, B} → {C}
✔ IF items X THEN item Y and Z {X} → {Y, Z}
When is a rule frequent? If the ‘support’ > a threshold
# trxs. {X → Y}
Total # trxs.
Support {X → Y} =
Support
Chips –> Beer 0.823%
Chips –> Milk 0.002%
7. Lift {X → Y} =
Support {X → Y}
Support (X) * Support(Y)
Lift &
Confidence
Example:
a lift van 8.3 for {Chips} → {Beer} means
If I know someone has already bought Chips then
it is 8.3 more likely that he will also buy beer
Other statistics used to assess the usefulness of a rule
Conf {X → Y} =
Support {X→ Y}
Support (X)
ASSOCIATION
RULES MINING
ALSO CALLED MARKET BASKET ANALYSIS
9. ASSOCIATION
RULES MINING Transaction data is needed
Transaction ID items
0001 [A, B, C]
0002 [A, X, Z, L]
0003 [X, A]
0004 [K, Q, L]
…. ….
N [A, K, M]
customer ID item
0001 A
0001 B
0001 C
0002 A
0002 X
0002 Z
0002 L
…
N A
N K
N M
For classical rules mining the order er of the items is not relevant
Often a time window is chosen
• For example, only transactions of last year (of a customer)
10. Choose a threshold for support,
First scan on single items with support > threshold,
Then construct two item sets with support > threshold,
Then construct three items sets with support (of every subset) > threshold
Etc. until you run out
Two major algorithms
See https://athena.ecs.csus.edu/~mei/associationcw/Apriori.html
Item set Support
Butter 0.3
Milk 0.3
Cheese 0.2
Appel 0.15
Pear 0.15
Water 0.001
Item set Support
Butter, Milk 0.25
Milk, Cheese 0.22
Cheese, Appel 0.21
Appel, Pear 0.1
Pear, Butter 0.09
Item set Support
Butter, Milk, cheese 0.2
Milk, Cheese, Pear 0.22
Cheese, Appel, water 0.21
Appel, Pear, Milk 0.03
Finally Construct rules from the items sets with support > threshold
Apriori, one of the classic algorithms in data mining
11. Two major algorithms
See https://athena.ecs.csus.edu/~mei/associationcw/Apriori.html
Apriori, one of the classic algorithms in data mining
Major drawbacks
❌ Generation of item sets can be is expensive
(in both space and time)
❌ Support counting can be expensive
12. Two major algorithms
See http://athena.ecs.csus.edu/~mei/associationcw/FpGrowth.html
FP Growth, more efficient and scalable
Jelly is dropped
Sorted Frequent item list
[ B, P, M, E ]
Sort single items first on descending support,
Drop items with support < threshold
Make a sorted F-list of remaining items
Original transactions of customers
Trx ID Items bought
1 [ Banana, Jelly, Pork ]
2 [ Banana, Pork ]
3 [ Banana, Milk, Pork ]
4 [ Eggs, Banana ]
5 [ Eggs , Milk ]
item support
Banana 4 (80%)
Pork 3 (60%)
Milk 2 (40%)
Eggs 2 (40%)
Jelly 1 (20%
13. Two major algorithms
See http://athena.ecs.csus.edu/~mei/associationcw/FpGrowth.html
FP Growth, more efficient and scalable
Sort items in the transactions based on the previous created F-list
Scan trough your transactions to form a Frequent Pattern Tree
Create the rules by looking at sub trees of the FP-Tree
sorted transactions of customers First transaction second transaction
All transactions
Trx ID Sorted items
1 [ Banana, Pork ]
2 [ Banana, Pork ]
3 [ Banana, Pork, Milk ]
4 [ Banana, Eggs]
5 [ Milk, Eggs ]
14. IENS RESTAURANT ASSOCIATION RULES MINING / MARKET BASKET ANALYSE
In Python use mlxtend package
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support = 0.0020)
15. IENS RESTAURANT LENGTH TWO RULES A → B
Interactieve netwerk
Very generic rules
Lift is not really high
16.
17. IENS RESTAURANT LENGTH THREE RULES A, B → C
Interactief plaatje
Much more specific, higher lift
19. IENS RESTAURANT VIRTUAL ITEMS: MAKE IT EVEN MORE PERSONAL
Transaction data with customers and items
klant ITEM
1 A
1 X
2 A
2 B
2 C
3 E
3 T
4 S
possible rules
{ A, B } → { C }
{ X } → { Z }
Add customer features as virtual items
possible rules
{ Male, (18, 25], A, B } → { C }
{ Female, (40,45], X } → { Z }
klant ITEM
1 A
1 X
1 Male
1 (18, 25]
2 A
2 B
2 C
2 Male
2 (45, 65]
3 E
3 T
3 Male
4 (30, 35]
4 S
4 Male
4 (30, 35]
20. A FEW FACTS… IENS DATA (TRADITIONAL BI)
Most occuring restaurant name (39 times)
Among Dutch
restaurants (6 keer)
% Sustainable kitchens
Biological (67%)
French (58%)
Fish (44%)
Vegetarian (39%)
…
…
…
Chinese (3%)
700 reviews on a “normal” Satuday
Valentine 2015 had 1200 reviews (1.7 times)
23 times
12 times
22. SELECT CERTAIN APP USERS
BUSINESS ISSUE
Which of my app users should I select that are ‘interested’ in SLIPPERS?
APPROACH
Use word-embeddings to map each user-id and article number to a (high-dimensional) vector
AVAILABLE DATA
App session and event data
-----------------------------------
|user_id |time |product_viewed |
-----------------------------------
| A | 2 | AX1234 |
| A | 3 | AW3456 |
| A | 4 | XY1234 |
| B | 1 | PO2345 |
| B | 2 | ZX3214 |
| C | 3 | KL1234 |
| .. | .. | ... |
| .. | .. | ... |
-----------------------------------
23. Word2vec Methodology
DATA PREP on SPARK because of the size:
Filter out “non-interesting” events
Aggregate the data on user level
Put articles viewed on the app in a list
Put the user id in ‘the middle’
So each user has its own ‘document’ or text with article numbers and his id as the tokens (words) in the text.
Now the data is small enough to handle in ‘normal’ python.
------------------------------------------------------------
| id | text |
------------------------------------------------------------
| A | [ EE5499, FX8912, A, FW4567, AB3499 ] |
| B | [ HP9823, B] |
| C | [ AB9812, PO1299, UK6712, AW9912, SE8932, C.....] |
| D | [ OK3423, SZ8676, D, LK9712] |
------------------------------------------------------------
24. Predict the target word w(t) with surrounding words w(t-1), w(t-2),… and w(t+1), w(t+2),….
The so-called Continuous Bag of Words (CBOW) model
we are not interested in the prediction
the weights we get per word in the vocabulary is what we want
[ Steffy from Germany is laughing very loud and is happy ]
0.123
0.672
0.123
⋮
⋮
0.452
0.512
Word2vec Methodology
WORD EMBEDDINGS
[ w(t-4) w(t-3) w(t-2) w(t-1) w(t) w(t+1) w(t+2) w(t+3) w(t+4) w(t+5) ]
Normal text / document
25. [ AB54321, CY3461, AW97541, USER_ID, PX91234, KL70123 ]
𝟎. 𝟏𝟐𝟑
𝟎. 𝟔𝟕𝟐
𝟎. 𝟏𝟐𝟑
⋮
⋮
𝟎. 𝟒𝟓𝟐
𝟎. 𝟓𝟏𝟐
Word2vec Methodology
User app sessions
0.253
0.727
0.513
⋮
⋮
0.952
0.318
0.253
0.527
0.714
⋮
⋮
0.612
0.219
Now the ‘texts’ or ‘documents’ are just collections of article ID’s and user ID’s
26. Predict the target word with surrounding words with so-called
Continuous Bag of Words (CBOW)
Word2vec Methodology
27. Every product and app user_id is now a
high dimensional embedding.
We can use UMAP to project onto 2D or 3D
space for visualization
So, every dot is in the scatterplot is either
a product or a user
PRODUCTS & APP USERS
28. Articles are also high dimensional embeddings in
the same space.
So we can calculate distances (or similarities)
Adidas BAG ARTICLES
PRODUCTS & APP USERS
31. Streamlit app
For the marketeer.
Enter an article number: The Adidas slipper!
The closest vectors are displayed
Those vectors are split:
Articles
Users
Easy python package to create
simple interactive dashboard
32. Streamlit app
Another example:
Enter an article number, say DY2562
The closest vectors are displayed
Those vectors are split:
Articles
Users
Easy python package to create
simple interactive dashboard
33. The Dutch movie world
in a graph
You know nothing about Dutch Actors
and Actresses. But you want to know:
“Who is playing with who in a movie?”
34. GRAPH BASICS
Node or Vertex a point in the network
✔ can have different attributes
✔ i.e., different color or size of nodes)
Edge or Link a relation between two nodes
✔ can be directional and have attributes
✔ i.e., arrowed, colored and sized
A FEW BASIC TERMS
A B
C
D
E
F
G
35. GRAPH BASICS
Node Centrality How central is a node
* Degree (number of connections)
* Betweenness (number of shortest paths through a node)
* Eigencentrality (Google’s page rank is a version of this)
Community detection Are there nodes that belong together?
A FEW TERMS
5 6 7
4
3
2
1
8
9
1
0
1
1
1
2
Degree 5 Degree 6
Degree 2
3
a
Degree 2
Node 6 and 3 have the
same Degree,
But node 6 has a higher
Betweennes than node 3
36. WWW.IMDB.COM INTERNET MOVIE DATABSE
Download movie data:
✔ Dutch movies in the last 25 years
✔ Per movie we know the cast and crew
✔ A node is a persoon
✔ Node X links with node Y if X and Y were in the same movie
In R use the library iGraph and in Python the package networkx
37. DUTCH MOVIE WORLD IN A NETWORK GRAPH
Interactive graph
## create graph
visNetwork(nodes, edges)
Node_1 Node_2 Attr_1 Attr_2
Chantal Jantzen Stef Tijding 12 A
Hans de wolf Jeroen Krabee 3 A
Johan Nijenhuis Frans van Gestel 5 B
… …..
node_id Attr_1 Attr_2
Chantal Jantzen Actress 45
Hans de Wolf Writer 65
Rutger Hauer Actor 73
…. ….. ….
Data frame nodes
Data frame edges
39. COMMUNITIES There are 1257 persons
They are divided in 191 community's
Take community 6:
54 persons in a wordcloud (Centrality based)
40. Thanks for your time! Questions?
Need me as Freelancer? Let’s have a cup of coffee
https://www.linkedin.com/in/longhowlam
https://longhowlam.wordpress.com/
@longhowlam