This document describes work done to create an improved player ranking system for the strategy video game Populous: The Beginning. Various machine learning models and techniques were experimented with to more accurately rank players and predict the outcomes of future games based on gameplay data. The best performing model was a weighted Elo rating system that assigned different weights to features like "shamans killed" for the winning team. This weighted Elo system improved prediction accuracy over the baseline by around 2% according to the document. Experiments were also conducted to determine the value or usefulness of different types of games for improving rankings.
1. Team
Keilin Bickar
Ayush K Singh
Monisha Singh
Populous Player Rankings
Machine Learning Course Project
CS 6140 Spring ‘16
Instructor
Lu Wang
2. Problem Description
Populous: The Beginning is a strategy and
god-style video game, where teams of
players create settlements and battle to
destroy the opposition.
People play 1 vs 1 and 2 vs 2 games via a
matchmaking service.
Fair games are fun games so accurate
rankings are needed to make fair matches.
3. Problem Description - Goal
● Create league table and rankings based on game results
● Predict the results of future games based on rankings
● Improve accuracy of game predictions
4. Problem Description - Inputs
Data from “Populous: Reincarnated” game database
- games.csv
- One row per game played
- General information concerning game such as map and number of players
- game_details.csv
- One row per player per game played (two - four per game)
- Stats for how well a player did such as enemy buildings destroyed and fights lost
- users.csv
- One row per user, not much besides name
- game_pops.csv
- Populations of each player taken every 15 minutes of each game
6. Problem Description - Outputs
● Ranking of players ordered by skill level
● Point value assigned to each player to
be used for predictions
● Ranking system that can iteratively rank
new inputs
Rank Name Points
# 1 Alice 64
# 2 Bob 53
# 3 Charlie 29
# 4 Dan 15
7. Related Work
Traditional Rating System
● Players on winning team gain 1 point
● Ratings of the losing team remain unaffected
● This method is currently being used in the game of Populous.
● Only requires winners to report games
Simple Ranking
● Players on winning team gain 1 point
● Players on losing team lose 1 point
● Intuitive system, widely used
8. Related Work
Elo Rating System
● Rating process takes into account prior ratings of players
● Subtracts X points from loser and gives X points to winner
● Very widely used for 1 vs 1 matchups such as Chess
● Update calculations are very fast
Glicko2 Rating System
● Rating process takes into account prior ratings of players and their experience
● Stores 𝜇 (rating), 𝜎 (volatility), and 𝜙 (rating deviation) for each player
● More recently created and used in 1 vs 1 matchups
9. Related Work
TrueSkill Rating System:
● Rating process uses Bayesian inference to compare two team distribution
● Every time a player plays a game, the system accordingly changes the perceived skill of the player
and acquires more confidence about this perception
● Stores 𝜇 (rating) and 𝜎 (uncertainty | variance) for each player
● The extent of actual updates depends on how "surprising" the outcome is to the system
● Designed to support teams of variable size along with ties and grows polynomially
● Assumes team performance is the sum of performance of the players
● Developed by Microsoft Research and used in Xbox matchmaking system
10. Our Work
Address shortcomings and find optimal model
● TrueSkill suffers from “rich get richer” problem with unbalanced teams (auto rating boost)
● TrueSkill handles ties in a naive fashion ignoring the complexity of the system
● We experimented with different models like
○ numerous values of of K for Elo
○ Same and Separate Feature Factors in Weighted Elo
○ Swapped constant used to standardize the logistic function in Glicko
○ Selected Trueskill and further experimented with Weighted TrueSkill
● Find the best model, use it to rate players and predict result of future games
11. Methodology - Preprocessing
● Data contained around 300,000 games
● Removed games with irregularities e.g. players crashing, etc
● Removed games with incomplete data e.g. 4 players games with data only
from 3 players
● Some games had spectators but were still valid, these were converted from a
4 player game with 2 spectators to a 2 player 1v1 game
12. Methodology - Preprocessing
● Mixture of 1v1, 2v2, 1v3, etc - stripped everything but 1v1 and 2v2
○ Unbalanced games hard to rate in complex ranking systems
○ Skills for 3 vs 1 game don’t translate to 1 vs 1 or 2 vs 2 games
● 136k Remaining games:
○ 50k - 1 vs 1 games
○ 86k - 2 vs 2 games
● 3 datasets stored separately for faster loading to run experiments
○ Disk IO the main contributor to load times so smaller sets were better
● Post preprocessing we were able to increase accuracy from 69 to 76%
13. Experiments: Datasets
● Ranking System
○ Traditional Ranking
○ Simple Ranking
○ Elo rating - Modified to support 2v2
○ Glicko2 Rating - Modified to support 2v2
○ TrueSkill Rating
● Features
○ Feature Selection based on Info Gain, Gain Ratio, Correlation Feature Selection
○ Feature Weights based on Perceptron Learning algorithm, SMO, Multilayer
perceptron with Backpropagation, and Logistic Regression
14. Experiments: Evaluation metrics
Ranking
● Traditional, Simplified, and Elo systems use native Points
● Glicko and TrueSkill use:
○ Points = 𝜇(rating) - 3𝜎(uncertainty)
● Players sorted by Points highest to lowest
15. Experiments: Evaluation metrics
Future Game Predictions
● Winning team predicted by selecting team with higher sum of Points
● Accuracy of Predictions
○ Accuracy = Correct Predictions / Total number of instances
○ Iterative calculations so all training data is used as test data
○ Order matters so cross-validation is cannot be used
16. Experiments: Baselines
Prediction Accuracies
League Full (136,144) 1 vs 1 (50,134) 2 vs 2 (86,010)
Traditional 0.678164296627 0.683827342722 0.66064411115
Simple 0.64075537666 0.651813140783 0.643448436228
Elo 0.739577212363 0.718574221087 0.737774677363
Glicko 0.72718592079 0.714664698608 0.714463434484
TrueSkill 0.756955870255 0.742968843499 0.758051389373
17. Experiment - Weighted Elo
● Elo is close in score to TrueSkill, but runs much faster
● Uses “K” value to decide how many points to move between teams
● K was weighted based on features in game details
● Features and factors selected one at a time by increasing/decreasing factor
until accuracy reached maximum
● Weighting was capped to prevent small/large values from exploding ranks
18. Experiment - Weighted Elo
● Tested raw feature vs. ratio of winning team/losing team
○ Ratio was better
● Tested inverting ratio for winning/losing and losing/winning
○ Mixed results
● Tested adding factors to K vs. multiplying K by factors
○ Multiplying worked better
● Tested assigning different weight for winners and losers
○ Improved accuracy!
19. Experiment - Weighted Elo
Notable improvements in score
Best feature was “shamans_killed” of winning team
League Full (136,144) 1 vs 1 (50,134) 2 vs 2 (86,010)
Baseline Elo 0.739577212363 0.718574221087 0.737774677363
Weighted Elo 0.750146903279 0.725136633821 0.749098941983
20. Experiment - Weighted TrueSkill
● TrueSkill starts out more accurate than Elo
● Has a built in weight for an update ranging from 0.0 - 1.0
● Using same feature/factor as Elo resulted in negligible improvements
● Running the same process to find new feature/factors also resulted in
negligible improvements
21. Experiment - Weighted TrueSkill
Tested skewing the results to give weight to the player doing the most work in
2 vs 2 games. Accuracy of top four features after weighing:
Feature Score
Unweighted 0.758051389373
followers_killed 0.758783862342
fights_won 0.758714103011
shamans_killed 0.758109522149
buildings_destroyed 0.757923497268
Results overall were
positive, but small
22. Experiment - Value of Games
There is a hidden feature of game that is hard to calculate i.e. the value of how
helpful a game is for ranking players.
● Tested 1 vs 1 games using Elo (for speed)
● Removed one game from test, compared accuracy to baseline
● Resulting change very small, but enough to see positive/negative
● Values normalized and stored as boolean
● Can run algorithms to classify games based on value
24. Experiment - Value of Games
Tested weighting 1 vs. 1 games in TrueSkill using values found:
Shows some positive results
Tested weighting all games in TrueSkill using values found:
Results against full dataset slightly negative
1v1 Base 0.742968843499
Weighting on fights_lost 0.743287988192
All Base 0.756955870255
Weighting on fights_lost 0.756698789517
Weighting on fights_lost only for 1v1 0.756882418616
25. Results
● Experimenting with different parameters did not result in quantitative
accuracy
● Overall we were able to predict outcome 8% better than the traditional system
● Found some features to be more important in the gameplay than others
● Model takes into account all priors so player’s first game is also a part of
rating
Thank You!