Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Playing with Statistics

Using statistical models to playtest games.
by

Jack Bennett

on 4 December 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Playing with Statistics

Playing with Statistics Playtesting Games Using Statistical Models What sort of games? Board games such as The Settlers of Catan, Ticket to Ride, or Monopoly.
Card games such as Hearts, Bridge, or Tichu.
Role-playing games such as Dungeons & Dragons. Not video games. Why games? Entertainment Websites such as BoardGameGeek and ConsimWorld host databases of games, reviews, and forums with hundreds of thousands of active users. Everyone likes to play games! Game stores across the world hold weekly gaming nights, prize-supported tournaments, and offer a place to try, buy, and play games. Conventions are held across the globe covering all aspects of gaming from design to retail.

There are over 80 gaming conventions held each year in the US alone. The biggest of which is Gen Con Indy, held every year in Indianapolis with more than 40,000 gamers attending over 4 days. Simulation "Gamification" is a key buzzword in many industries today.

By using game mechanics, experts can create simulations of environments which can be used as learning tools. One research center of the National Defense University is the Center for Applied Strategic Learning (CASL).
Their mission: To enhance the strategic decision making and critical thinking capabilities of military and civilian leaders from the United States and other countries through strategic level experiential opportunities that address the complexity of the evolving and interlinked international, national and local security environments. CASL hold regular lectures and round table discussions in the design and use of simulations (some on boards such as the one above). They even a demo game you can play on their website where the player takes the role of the National Security Adviser. Simulation games can also be used to model every-day situations or environments to allow people to see how their choices and actions effect those environments. For example, a simulation of the day-to-day workings of a proposed office building can allow those involved to see how choices such as the floor plan, layouts, and positioning will effect flow through building, energy usage, waste management and other modelable outcomes. Why playtest? In order for a lesson to be learned from a simulation, the outcomes for that simulation must be accurately modeled.
In order for players to enjoy playing a game, the game must be balanced such that winning is based on the criteria the game sets forth (such as knowing more trivia or understanding pot odds and probability).
In both cases, games must not have exploitable strategies, or "broken" mechanics. The designer creates prototypes and plays "mock" games to find glaring holes.
Games are played hundreds of times by the designer and a small group of playtesters.
Then, games are passed to other players to do "blind" playtesting where the designer is not involved
Feedback is collected, often summaries written by playtesters, and incorporated.
Game published! How are they playtested? Scientific analysis of collected data is not often done, especially when games are tested and released by small or independent game companies only a handful of people large. Expensive games, released by the largest of publishers, designed by the most respected of designers, are still often offered to the public with balance issues.

Incorporating the responses of the public gives a much larger sample size, and often problems with the game are corrected with online erratas or in the second edition printing of the game.

If balance issues could be explored more thoroughly before the release of the game, these issues may be more avoidable. Playtesting using statistical analysis Decide on playtesting questions that need to be answered.
Collect data on plays of the game, including decisions or events related to the questions, as well as the outcomes.
Build a statistical model to test what the effects of certain game actions are on the final score or probability of winning. Monopoly is licensed in 111 countries and printed in 43 languages. Solitaire game.
The player takes on the role of Inspector Moss in a mansion full of clues and suspects.
The player has 45 "minutes" to spend searching the mansion, collecting assistance items, and finding clues.
Through clever placement of the clues on the board, the player can eliminate suspects. When 9 of the 10 suspects are eliminated, the player must arrest the final suspect and escort him or her to the police car.
Successfully arresting the culprit results in a win, while running out of time, or accidentally ruling out all 10 suspects, results in a loss. Playtest Questions There are 10 possible assistance items that a player may choose to collect during the game. What sort of effect does collecting these items have on the probability of a player winning?
The "Key" and the "Lucky Break" items are thought to be very powerful. Does collecting these items give too large a benefit?
The "Bomb" and the "Informant" items are thought to be too weak. Does collecting these items give any benefit towards winning? Two-player game.
Each player takes on the role of a candidate for the office of the president of the United States.
Players take turns rolling dice and deciding whether or not to push their luck to earn as many Campaign Points as possible.
Players compete in a debate through a bidding mechanic to earn extra Campaign Points.
Points are applied to states. The candidate with the most points in a state at the end of the game earns that state's electoral votes. The player with 270 or more votes wins. Playtest Questions One player is randomly assigned to take the first turn. Does that player gain an unfair advantage?
One player will get a benefit from winning the debate. Does that benefit effect how many electoral votes a player earns?
Some states are worth many electoral votes. The game is a competition over winning these states. Is that reflected in the game?
Some states aren't worth very many electoral votes. Do these states have an effect on the game at all? Solitaire game.
The player heads a team of cowboys attempting to drive their herd of cattle towards the rail station.
There are two forks in the road, with different benefits and hurdles down each fork. The player decides which to take.
The player is competing against other cattle teams to beat them to the rail station in order to secure the best prices.
Butch Montgomery, leader of a gang of cattle rustlers, is also on the player's tail and attempting to steal cattle. Playtest Questions If a competing team beats the player to the station, a slight penalty is incurred. Does that penalty have an effect on the final score?
Although it can be difficult to accomplish, the player might kill Butch Montgomery during the game and earn a small bounty. Does this have an effect on the final score?
There are two forks in the road. For both of them, is taking one path different from taking another in terms of the player's final score? Solitaire game.
The player is attempting, through dice rolls and choices, to activate the six "constructs," components of the Utopia Engine before time runs out.
Each activated "construct" gives a benefit to the player, and the order they attempt to find and activate them is up to the player.
If the player wishes to spend the time, it is possible to find legendary artifacts that can aid in winning the game.
If the player activates all six constructs, forms them to make the Utopia Engine, and turns the Engine on before time runs out, the player wins. If time does run out, or the player dies from injuries sustained during exploring or construction, the player loses. Playtest Questions Does collecting a legendary artifact have an effect on the probability of winning the game?
Is there a certain order the player should activate the "constructs" in order to increase the probability of winning the game? Data Collection In order to answer the playtesting questions, a list of the variables thought to be related was created for each game.

An online survey was generated for each of the four games, and offered to the public online and at a local game store. Free copies of the games were given out to people willing to fill out the survey (although all four games are already free online).

In general, this was difficult data to collect as it required much more than just filling out a survey. The surveys were active for two months. In that time, the games received on average about 30 plays. The analysis for all four games would benefit greatly from more data. Exploratory Plots The Model Instantly there are problems. There are signs of non-constant variance as well as non-normality. However, the biggest problem is... An overall test of this model against the null model with no effects has a p-value of .869. There is nothing going on here. Logically it makes sense for items to have a squared term in the model, and this histogram shows what could be a possible effect.

However, including that in the model still gives us an overall test of .53 against the null model.

We are not modeling the variables that influence winning a game of Inspector Moss. Linear Discriminate Analysis Since our data is all in one of two groups (wins and losses) we could try a linear discriminate analysis to help classify data.

We create a linear combination of features to help as a linear classifier. If we test it on the data, we get a misclassification rate of 42.31% (or 34.62% if including a squared Items term).

However, creating the classifier and testing it on the same data introduces bias. If we test it on our data using leave-one-out cross-validation, we get a misclassification rate of 65.38% (or 53.85% if including a squared Items term). Our model is actually predicting wins worse than using a coin toss. There are 10 possible assistance items that a player may choose to collect during the game. What sort of effect does collecting these items have on the probability of a player winning? Back to the playtest questions With the data that we have, it looks like not much. With more data it might be that an effect for Items or Items squared begins to emerge, but with what we have we cannot say there is any effect.

This isn't necessarily all bad. There is much that goes into a winning game of Inspector Moss that is difficult to measure. Players make decisions regarding their dice, tile placement, and clue strategy. The design intention is that items be one small part of many things that influence a win. And more data is needed to detect smaller effects. The "Key" and the "Lucky Break" items are thought to be very powerful. Does collecting these items give too large a benefit?
The "Bomb" and the "Informant" items are thought to be too weak. Does collecting these items give any benefit towards winning? Including information about these four items did nothing at all in helping us to predict a win. No individual item is too powerful to make or break the game. Recommendations to the designer Collect more data and try this again. Given the data presented here, there is no statistically significant effect of collecting items.
Although item collection is only one small part of a winning strategy, given this data there's no reason to do it at all. Making them slightly easier to collect would mean more of them would be gained and their effect could be more pronounced without overpowering the rest of the strategy. Don't worry about the power of individual items. No one item seems to have a measurable positive or negative effect on winning. Exploratory Plots First Debate The Model Diagnostics here look good. Constant variance and normality don't look to be an issue. Multicollinearity, measured by variance inflation factors, is not a problem. There are no outliers or leverage points. An test of the overall model gives a p-value of <0.001.

Both First, and Debate, are non-significant. Dropping those and testing that model against this one gives a p-value of 0.386. First and Debate are not needed in the model. Back to the playtest questions One player is randomly assigned to take the first turn. Does that player gain an unfair advantage? No. From both looking at the graphs as well as the confidence interval and significance of the coefficient for First, we can see that there is not a significant advantage to going first.

This is a great find, and follows the design intentions. We do not want one player to have a significant advantage or disadvantage simply by virtue of having been randomly chosen to take the first turn. One player will get a benefit from winning the debate. Does that benefit effect how many electoral votes a player earns? No, unfortunately. Although the estimate predicts that winning the debate will provide 58 more electoral votes, the confidence interval around the debate coefficient is (-30.02, 146.10). Some states are worth many electoral votes. The game is a competition over winning these states. Is that reflected in the game?
Some states aren't worth very many electoral votes. Do these states have an effect on the game at all? States worth more electoral points have a huge effect on winning the game, which is true for actual elections as well. It looks like winning 3 of the bigger states will get a player well on their way to winning the game. This fits within the design intentions of the games.

One worry was that the worth of smaller states would not be reflected in the final score. Although their effect is smaller, it is present. This also follows the design intentions of the games. Recommendations to the designer Don't change anything in the game around the player who goes first. It shouldn't have an effect, and it doesn't.
Change the rewards for winning the debate. The increase of 58 electoral votes is about right, but the variability around that is too large. At the moment, winning the debate merely grants more Campaign Points to a player, but those don't guarantee an increase in electoral votes. Possibly allowing the player to win tied states, or to choose one state to automatically win, may give around the same increase in score but tighten up its effect.
Big and small states are having the desired effect, no need to worry about them. Exploratory Plots The Model Where the 6 x variables are the 6 constructs. Residuals look fairly normal. Multicollinearity, measured by variance inflation factors, is not a problem. There are no outliers or leverage points.

However, we do have some fit issues. A test of the overall model gives a p-value of <0.001. Linear Discriminate Analysis Since our data is all in one of two groups (wins and losses) we could try a linear discriminate analysis to help classify data.

We create a linear combination of features to help as a linear classifier. If we test it on the data, we get a misclassification rate of 16.28%.

However, creating the classifier and testing it on the same data introduces bias. If we test it on our data using leave-one-out cross-validation, we get a misclassification rate of 27.91%. So we do have some predictive power Does collecting a legendary artifact have an effect on the probability of winning the game? Back to the playtest questions Is there a certain order the player should activate the "constructs" in order to increase the probability of winning the game? Unfortunately, no. Although the model is not perfect, between the very non-significant coefficient, and the exploratory graphs, collecting an artifact does not seem to have an overall effect on winning. The way the game works, all six constructs are "significant." In order to win, the player must collect all six, so it's impossible to win while missing one.

Due to the overall model significance, and the predictive power of our LDA classifier, the importance of collecting these six constructs is obvious.

However, individually, contructs are either non-significant, or are close to it. And with the fit issues of the model, it would be hard to declare those close constructs as significant.

Given this, there doesn't seem to be any constructs which, when collected, highly increase the probability of winning when compared to other activations. Recommendations to the designer Make artifacts easier to collect. As it is, artifacts have very useful abilities, but are very difficult to collect. Statistically, gaining an artifact is not having an effect on the probability of winning, and from my knowledge of the game, I would say that this is at least in part due to the fact that so many resources are expended to even attempt to get an artifact that the benefits are not outweighing what is lost during their acquisition.
Don't worry about there being a construct that is too powerful once activated, or that players have one "correct" order to activate contructs. It's possible a couple of them might have bigger effect than the others, but the differences are small. Exploratory Plots Data Transformations Scores in Lassos & Longhorns are very skewed. A box-cox transformation recommends a log transform of the score.

The plots of the two forks look like a broken stick plot would fit better to them. Spline variables are added to the model. The Model Testing the first model against the model without Fork1 and Fork1p has a p-value of .957. It is dropped. The model has an overall F-test p-value of <0.001 but, it still has issues. Variance and normality are still a problem. Back to the playtest questions If a competing team beats the player to the station, a slight penalty is incurred. Does that penalty have an effect on the final score? Although it can be difficult to accomplish, the player might kill Butch Montgomery during the game and earn a small bounty. Does this have an effect on the final score? No. The model has fit issues, but in no model does the competition have an effect on the score. This one is tougher to judge. The graphs show a fairly visible distinction between scores earned based on earning the bounty on Butch.

The tests of the coefficient are are all non-significant, but only just so. We really need more data on Butch's effect to say one way or the other. There are two forks in the road. For both of them, is taking one path different from taking another in terms of the player's final score? No. The change in log(Score) from taking the first fork East versus West is only .172. The change in log(Score) from taking the second fork East versus West is .757. Whether it's statistically significant or not, a change in Score of 1 or 2 is meaningless in game terms.

This is a good thing to find. If one fork is significantly better than another then players will always choose that fork instead of the fork with the terrain features they are more willing to encounter. Recommendations to the designer Increase the penalty for being beaten by other competing teams.
Butch is a tough call. At this point, I would leave everything the same. Killing Butch for the bounty has a few positive effects, and I believe that, with more data, a bigger effect would emerge.
Don't change the forks. Right now there is very little change in the final score based on which paths the player takes. This is good. Statistically Playtesting Games Overall, I believe that there is potential in using statistics in playtesting games. In each of these simple examples I was able either to find a game mechanic that wasn't quite working the way the designer wanted, or to verify that something was working as intended.

At larger game companies, with more resources and playtesters at their disposal, much more play data could be collected. This would give better sample sizes and more accurate results. For all four items, winners and losers collected them equally Butch Fork 2 Fork 1 Competition First Debate Chance Community
Chest Besides the data collection issue, another problem is that not everything that influences a game can be measured. The probability of making a poker hand can be easily calculated, but how do you measure the ability to read opponents or succeed at a convincing bluff. A statistical method will not be applicable to all games, or at least to all aspects of games.

Also, to sustain the integrity of the analysis, games must not be changed in any way during one study. Therefore, this method will work better at the end of the typical playtesting routine. It's better for finding smaller patterns over large numbers of plays.
Full transcript