Saturday, 23 February 2013

How to Determine if a Player is Above Average

So far I have concentrated on Markov Chains and Monte Carlo simulation. Markov Chains use only average teams made up of average players. I used Monte Carlo simulation to evaluate team performance based on the data from particular players.

One of the problems with using data from particular players in fastpitch softball is that the sample size of plate appearances is quite small even when I combine the statistics for 2011 and 2012.

The table below shows the data for 3 players in 2011 and 2012: the plate appearances, the times the player reached base safely, the on-base average, and the lower and upper values of the 95% confidence interval around the on-base average.

Player    PA      OB      Lower    OBA      Upper
1           107      46       0.288     0.430     0.478
2            85       31       0.210     0.365     0.417
3            59       18       0.128     0.305     0.365
 
The league on-base average was 0.373. So players 3 has an upper bound of his 95% confidence interval that is below the league average. So we can say with confidence that this player is significantly below average.

None of the players that have a lower bound that is higher than the league average. So according to this analysis I cannot say that any of players significantly above average.

Another tool that could be helpful in my statistical analysis is Bayes Updating.  This method starts with an estimate of the typical probability of a player being above average and updates the probability as more information is collected.

I will use the statistics for 2011 for the players considered in my earlier post. I will begin with an initial estimate of the probability of being above average of 0.5 and update the estimate based on the 2011 statistics.

Player    PA     OB     OBA     Probability of being above average (2011)
1           49      23     0.469                          0.88
2           46      16     0.348                          0.29
3           38      13     0.342                          0.29
 
So it appears that player 1 is likely to be above average, while players 2 and 3 are likely to be below average.
 
Then I will use the 2011 probability of being above average and update the probability with the 2012 data.
 
Player    2011 Prob    PA    OB    OBA    Probability of being above average (2012)
1               0.88          58    23    0.397                         0.94
2               0.29          39    15    0.385                         0.38
3               0.29          21      5    0.238                         0.05
 
Based on the data from 2011 and 2012 by  applying Bayesian Updating, I safely say that players 1 is above average in terms of on-base average.  Players 2 and 3 who had the same probability of being above average at the end of 2011 look quite different after 2012.  Player 2 is more likely to be above average while player 3 is very likely to be below average.

How Much Does the Batting Order Matter?

In an earlier post, I found the best 9 players out of 14 to have in the lineup. Then in another post, I found the best batting order from this list of 9 players. One could ask, how much does the batting order matter?

I ran 500 trials of 500 seasons of 20 games (i.e. 500*500*20 = 5,000,000 games) for the best batting order and found that the average runs per game was 6.6.

Then I ran 100 random lineups of the best 9 players for 500 trials of 500 seasons of 20 games. I found that the average runs per game for these 100 random lineups was 6.3.

The best of these 100 random lineups had an average runs per game of 6.5 and the worst of these 100 random lineups had an average runs per game of 6.1.

So determining the best batting order has an impact of 0.3 runs per game over just finding the best 9 players and using a random batting order.

Thursday, 21 February 2013

The Impact of Substitutions on Average Runs Per Game

In a previous post, I found the 9 players out of 14 that should be in the lineup. I would now like to answer the question: what is the impact of substituting a player not included in best 9 into the lineup?

Using my Monte Carlo simulation, the best 9 players are expected to score 6.5 runs per game.

The table below shows the average runs per game scored when one of the player not in the lineup (player 4, 5, 8, 9, or 13) is substituted for one of the best 9 players. 

When player 4 is inserted in the lineup for player 1, the average runs per game falls to 5.4.

In the columns of the table, I show the average runs per game when each player is substituted into the lineup for the 9 possible substitutions. One can see that when player 4 is inserted in the lineup, the expected runs per game varies between 6.1 and 5.3 with an average of 5.8.


Substitute Player              4        5       8        9       13       Ave  Lossed Runs
                                                                                                   Per Game
Player Substituted For
         1                            5.4     5.7    5.4     5.0     5.4      5.4        1.1
         2                            5.4     5.6    5.3     5.0     5.4      5.3        1.2
         3                            5.3     5.5    5.2     4.9     5.3      5.2        1.3
         6                            5.9     6.1    5.9     5.5     5.9      5.9        0.6
         7                            6.1     6.4    6.1     5.7     6.1      6.1        0.4
       10                            5.9     6.1    5.9     5.5     5.9      5.9        0.6
       11                            6.1     6.3    6.0     5.7     6.0      6.0        0.5
       12                            6.1     6.3    6.1     5.8     6.1      6.1        0.4
       14                            5.9     6.2    6.0     5.6     6.0      5.9        0.6

Average                          5.8     6.0    5.8     5.4     5.8

Lossed Runs Per Game   0.7    0.5    0.7     1.1     0.7

At the bottom of the table, I estimate the loss in average runs per game when each player is inserted into the lineup.

Player 5 has the least negative impact (a loss of 0.5 runs per game) when inserted into the lineup.

Player 9 has the greatest negative impact (a loss of 1.1 runs per game) when inserted into the lineup.   If he is a defensive specialist, he should be replaced in the batting order by the designated hitter.

The other players 4, 8 and 13 have the same negative impact on runs per game (a loss of 0.7) which is not much different from the substitute with the least negative impact, player 5. 

 Thus, these players (4, 5, 8 and 13) could be substituted into the lineup without much negative impact.

The rows in this table show the impact of taking one of the 9 best players out of the lineup. If player 1 is taken out of the lineup, the average number of runs per game drops to 5.4 from 6.5 (a loss of 1.1 runs per game).

The results in the rows would suggest that players 1, 2 and 3 should not be removed from the lineup unless absolutely necessary. The other players (players 6, 7, 10, 11 and 12) could be replaced with a substitute with relatively less impact on the average runs per game.

Wednesday, 20 February 2013

Finding the Best Batting Order

In my last post, I explained how I used Monte Carlo simulation to find the 9 players that should be in the lineup. In this post, I will show how I found the best batting order using these 9 players..

There are

9*8*7*6*5*4*3*2*1 = 362,880

different batting orders that can be made up from 9 players. So finding the particular batting order that is the best would appear to be difficult but not impossible.

I wrote a program to find these 362,880 batting orders and analyze them using my Monte Carlo simulation.

Here are the 9 players that I suggested should be in the lineup in the last post.

Player    
1        
2        
3        
6        
7        
10     
11      
12    
14    

For all 362,880 batting orders, I ran 10 trials of 100 seasons of 20 games to find all of the sets that had an average of 6.5 runs or more per game. I found 51 potential batting orders to examine in more detail.

Then I conducted a runoff, based on the average number of runs per game, for these 51 batting orders using the program that I discussed in the last post. Again, I ran a simulation with 500 trials of 500 seasons of 20 games.

I found the winner of the runoff was the following batting order.

Player  
2         
1       
3       
10     
11    
12 
7   
6   
14 

On-base average is valued for the leadoff hitter.  Then slugging percentage is valued for the 2, 3 and 4 hitters.  Players with low slugging percentages are relegated to the bottom of the batting order.  .

Thursday, 7 February 2013

Finding the Best Batting Lineup


I explained in my last post how Monte Carlo simulation could be used to compare one batting order to another. This can be useful if a manager has two batting orders that he wishes to evaluate. However, what if the manager wanted to know what the best batting order would be for his team?

This is a difficult problem. With 14 players on the roster, there are

14*13*12*11*10*9*8*7*6 = 726,485,760

different batting orders that can be made up. So finding the particular batting order that is the best would appear to be very difficult.

On the other hand, if we ignore the order in which the players bat, there are only 

(14*13*12*11*10) / (5*4*3*2*1) = 2,002

ways to choose 9 players from a 14 player roster. This appears to be a manageable problem.

I wrote a program to find these 2,002 lineups and analyze them using my Monte Carlo simulation.

I used the offensive data from 2012 from 14 players.
 
For the 2,002 sets of 9 players, I ran 100 seasons of 20 games to find all of the sets that had an average of 6.0 runs or more per game. I found 29 sets.

Then I conducted a runoff of these 29 sets using the program I discussed in the last post to compare two lineups. This time, I ran 500 trials of 500 seasons of 20 games. I compared the set 1 to the set 2 to determine which had the higher average runs per game. If set 1 had a higher average number of runs than set 2, I repeated the process with set 1 to set 3. Otherwise, I repeated the process with set 2 and set 3. I did not simply run 500*500*20 = 5,000,000 individual games and calculate the average runs per game for each set. By doing a runoff and running trials, I was able to determine the number of trials in which the difference between the two sets was statistically sginificant.

I found the winner of the runoff was the following set of 9 players.

Player    
1           
2           
3           
6           
7           
10        
11         
12         
14         

With this set of 9 players in the lineup, the next problem is to determine the best batting order. I will discuss how I did this in my next post.

Monday, 4 February 2013

Monte Carlo Simulation Used to Compare Two Batting Orders

So far I have used Markov Chains based on statistics for a team of average players. Although I have been able to gain some insight, especially concerning the value of bunting, stealing or taking an extra base on a hit, I have not considered the capabilities of individual players on a particular team.

To consider the individual characteristics of the players on an particular team, I built a Monte Carlo simulation of seven offensive innings. The first study I can do with this simulation is evaluate the average runs scored by a team in a game based on a particular batting order.

I can specify a batting order and run the simulation for any number of seven inning games. For each game, I record the simulated number of runs in the game. I can find the average number of runs in a game for the batting order and a 95% confidence interval around the average. That is, 19 out of 20 times, the simulated average number of runs will be between the lower bound and upper bound of the confidence interval.

If I run the simulation for two batting orders, I can determine if the difference in the average number of runs between the two batting orders is statistically significant.

I used data from the 2012 season.
 
Consider two batting orders which I will call batting order 1 and 2.

By running batting order 1 through 10 seasons of 20 games, I obtained the following results for runs per game.

Lower Bound       Average                Upper Bound
6.188                   6.919                    7.651

In batting order 2, only the last batter in the order was changed. 
 
Batting order 2 produced the following results for 10 seasons of 20 games.
Lower Bound        Average                Upper Bound
5.920                    6.500                    7.079

Batting order 1 looks somewhat better than batting order 2 because it has a higher average number of runs per game.

However, the two confidence intervals overlap. Therefore based on 10 seasons of 20 games, I cannot say that the difference in the average number of runs per game is statistically significant. That is, the average for batting order 1 may be larger than batting order 2 by random chance. To say that batting order 1 is better than batting order 2 with statistical significance, the lower bound of the confidence interval for batting order 1 would have to larger than the upper bound of batting order 2. I can get this result by taking a larger sample of seasons (that is, simulating more seasons).

Below is a table that shows the lower bound of batting order 1 and the upper bound of batting order 2 for varying number of 20 game seasons.

Seasons    Batting Order 1    Batting Order 2    Difference Statistically Significant
                  Lower Bound        Upper Bound
20              6.289                    6.454                                    No
30              6.296                    6.407                                    No
40              6.234                    6.237                                    No
50              6.288                    6.206                                    Yes

So it takes at least 50 seasons before the difference turns out to be statistically significant.

The major result is that Monte Carlo simulation can be used to evaluate based on statistically significance whether one batting order is better than another.