Football Statistics Project
Length: 3819 words (10.9 doublespaced pages) Rating: Red (FREE)                                  
Football Statistics Project
Introduction  I have chosen to base my project on football statistics because they are both readily available and interesting enough for deep analysis. As a starting point I decided to look at the generally accepted theory of 'Home Advantage'. Home advantage, or the tendency for the home team to do better than they would away, could have several causes. It could be partly psychological  the home team would almost always have the majority of the crowd behind them, cheering them on. It could also be to do with the condition of the pitch  Premiership teams sometimes find it hard to play on muddy, waterlogged pitches of some lowerdivision teams. Another factor is the attitudes of referees and officials. Because they are intimidated by the home crowd they often give decisions in favour of the home team, meaning teams may also have a worse disciplinary record when playing away. Hypotheses: 1. Teams have a worse disciplinary record away than at home 2. Better attended teams have a greater home advantage 3. More successful teams have a better disciplinary record [IMAGE]Collecting Data I found that football statistics were easy to find on the internet. I obtained mine from two main sites: http://soccerstats.football365.com http://www.bettingzone.co.uk There is a very small risk that some of the data I collected could be incorrect. However, I have found alternate sites for the Premiership statistics (such as www.4thegame.com) which gave the same results. I also think that a betting site must give accurate statistics because they are such an important part of gambling Using Software I chose to input my data into Microsoft Excel because it makes it much quicker and easier to manipulate the data. Hypothesis 1  Teams have a worse disciplinary record away than at home  Discipline 'points' system On the internet I was able to find out the numbers of red and yellow cards for each team at home and away. However, in order to give an overall impression of how good or bad the team's discipline was I needed to turn these two pieces of data into one measurement. I decided to use the points system (as on www.4thegame.com). Under this system a yellow card counts for one point whereas a red card is more severe and counts for three. To make this easier to calculate I used formulae in Excel: [IMAGE] Because some divisions have different numbers of teams than others, some teams played more games than others. This means their players had slightly more opportunities to get booked or sent off, so their points totals might be higher. To correct for this I divided the points scores by the number of games each team had to play to give a 'Disciplinary Points Per Game' score. This can then be compared to any other team in any division. To give a measure of how much better or worse the team's disciplinary record is away and at home I decided to divide the away points per game score by the home. I subtracted one from this and expressed it as a percentage. This gives a positive percentage if the team has a worse disciplinary record away and a negative one if it is worse at home. Pilot Study In order to find out how well my data would support my hypothesis about teams having a worse disciplinary record away than at home I made a bar chart using Excel to show the difference between disciplinary points per game away and at home. [IMAGE] As you can see most teams have a considerably worse disciplinary record away than at home, as shown by the taller red bars. For this bar chart I simply ranked the teams in the Premiership and the First Division from the top of the Premiership (1) to the bottom of Division 1 (44). The names of these teams can be found in the appendix at the back. Stratified random sampling In order to better represent football at other levels of the game I also collected data for lower divisions (Division 2 and Division 3). However this gave me far too much data  a total of 92 teams  to perform statistical tests such as the Wilcoxon Signed Rank Test. In order to cut down on this I decided to use random sampling to lower the number of teams involved. However, if I just randomly selected teams from all of the divisions put together I might overrepresent some divisions over others, affecting the results. To make this fairer I decided to use stratified random sampling, with the different divisions as the strata. This way I was sure to get proportionate numbers of teams from each division. I chose to take 25% of the teams in each division, to give me 23 sets of data  a much more manageable figure! I chose the teams by writing the numbers of the teams in each division e.g. 124 on small pieces of paper. I folded these up, shuffled them and picked them at random until I had the right number. Once I had chosen the teams I put them in a new spreadsheet. I produced another bar chart similar to the one I had produced for the preliminary test. This illustrates how well my randomly sampled data supports my hypothesis. [IMAGE] As you can see the pattern I noticed in the pilot study is continued with the data from the other divisions. The teams' away disciplinary record is in almost all cases worse than at home. As further evidence of this I found the mean disciplinary points per game at home and away. At home this was about 1.71 compared to about 2.28 away (to 3 significant figures). This shows a 33% difference between the two. I will now test whether or not this difference is statistically significant. I chose to compare the means of the two sets because this gives more weight to big differences between two scores than small differences. Wilcoxon signedrank test Although graphs and charts can illustrate trends in data they cannot prove that my hypothesis is true. In order to prove my hypothesis I will have to use a statistical test. Because my data is nonparametric (i.e. I have no reason to believe it will follow a normal distribution) and I am comparing pairs of data from two categories I will use the Wilcoxon signedrank test. Method: 1. First I found the difference between the home and away disciplinary points per game for each team by subtracting one from the other using Excel. 2. Because some of the differences were negative I used the abs() function in Excel to find the absolute values of the differences. 3. I sorted the data by the absolute differences between the home and away disciplinary points per game. Ignoring the teams where the difference was zero, I ranked them in order from the lowest to the highest. Where several were the same I found the mean between them. 4. I then looked to see where the differences had originally been negative and I added the negative sign in front of the rank for those differences. This gave me the signed rank. 5. Finally I found the greatest absolute sum of the signed rank (in this case the negative ranks), which is the 'W' value. The number of teams where the difference is not equal to zero gives the 'N' value. A B Original Absolute Rank of absolute Signed Rank Team Home PPG Away PPG (XAXB) (XAXB) (XAXB) Manchester United 1.842105263 2.421052632 0.57895 0.578947 7 7 Tottenham Hotspur 1.947368421 1.947368421 0 0 Birmingham City 1.947368421 2.842105263 0.89474 0.894737 13 13 Aston Villa 2.263157895 2.105263158 0.157895 0.157895 2 2 Bolton Wanderers 2.105263158 2.368421053 0.26316 0.263158 4 4 Portsmouth 1.434782609 1.913043478 0.47826 0.478261 6 6 Wolverhampton 1.52173913 2.173913043 0.65217 0.652174 8.5 8.5 Norwich 1.47826087 1.47826087 0 0 Wimbledon 1.130434783 1.913043478 0.78261 0.782609 11 11 Rotherham United 1.869565217 2.869565217 1 1 14 14 Grimsby 2.304347826 1.608695652 0.695652 0.695652 10 10 Crewe Alexandria 0.913043478 1 0.08696 0.086957 1 1 Cheltenham Town 1.608695652 1.434782609 0.173913 0.173913 3 3 Huddersfield Town 1.130434783 2.52173913 1.3913 1.391304 19 19 Northampton Town 1.826086957 1.826086957 0 0 Bristol City 1.434782609 2.782608696 1.34783 1.347826 18 18 QPR 1.695652174 2.782608696 1.08696 1.086957 16.5 16.5 Rushden & Diamonds 1.608695652 2.652173913 1.04348 1.043478 15 15 Lincoln City 2 2.652173913 0.65217 0.652174 8.5 8.5 Bury 1.043478261 2.608695652 1.56522 1.565217 20 20 Darlington 2.217391304 2.565217391 0.34783 0.347826 5 5 Leyton Orient 1.826086957 2.695652174 0.86957 0.869565 12 12 Shrewsbury Town 2.173913043 3.260869565 1.08696 1.086957 16.5 16.5 W 153 W 153 N 20 I found that the value of W was 195, and that N, the number of teams where the difference was not equal to zero, was 20. Looking these up in a table of critical values (OCR AS/A Level MEI Structured Mathematics Examination Formulae and Tables, October 2000) I found that there was only a 5% chance that the difference between home and away points per game was due to chance alone. This means that there is a 95% probability that the difference between disciplinary record at home and away is not due to chance alone. Therefore my hypothesis is highly likely to be correct. Hypothesis 2  Better attended teams have a greater home advantage  I proposed this hypothesis because a better attended team would have more of the crowd behind them when playing at home, giving them a psychological advantage over their opponents. As with the disciplinary points system, I used Excel to find the points per game score for each team both at home and away. This time I divided the home points per game score by the away and subtracted one from this, expressing it as a percentage. A problem arises because some teams have much bigger stadiums than others. For example, 20,000 might be considered good attendance for a First Division club, but very poor for a Premiership team. Because of this I divided the total capacity of each football ground by the average number of home supporters there to give the average attendance percentage. I plotted this against the home advantage percentage in a scatter graph. Pilot Study The scatter graph is a useful way of looking for correlation between two variables. As with the first hypothesis I used the data for the Premiership and the First Division as a pilot test. [IMAGE] As you can see there is no strong correlation between these two variables. There may be a slight trend for the higher home advantage percentages to be towards the higher percentages of stadium capacity. I decided to continue investigating this hypothesis because there might be clearer correlation in the data from the other divisions. Spearman's Rank In order to tell for certain whether or not there is correlation between home advantage and attendance Because this data is also nonparametric I will need to use the Spearman's Rank Correlation Coefficient. Method: 1. The first step was to rank the teams by both % Home Advantage and Average % Capacity. As with the Wilcoxon test I found the mean of tied ranks. 2. I found the difference between these two ranks by subtracting one from the other using Excel. 3. I then squared the differences between the two ranks. 4. I used the formula below to find rs, the Spearman's Rank Correlation Coefficient. My workings are illustrated in the table overleaf. [IMAGE] [IMAGE] d = the difference in the rank of the values of each matched pair n = the number of pairs rs = 1  6âˆ‘d2 [IMAGE] n3  n Team % PPG Home Advantage Average % capacity %PPG Home Advantage Rank Average % Capacity Rank d d2 Manchester United 52% 99.16% 24 1 23 529 Tottenham Hotspur 63% 99.06% 19 2 17 289 Portsmouth 23% 98.72% 35 3 32 1024 West Ham United 10% 96.59% 41 4 37 1369 Birmingham City 53% 96.07% 22 5 17 289 Everton 81% 95.82% 10 6 4 16 Brighton 50% 95.56% 26 7 19 361 West Bromwich Albion 17% 95.46% 37.5 8 29.5 870.25 Liverpool 21% 95.33% 36 9 27 729 Norwich 100% 94.81% 4.5 10 5.5 30.25 Wolverhampton 5% 90.25% 44 11 33 1089 Bolton Wanderers 93% 89.73% 8 12 4 16 Hull City 68% 84.72% 16 13 3 9 Blackburn Rovers 31% 83.61% 29 14 15 225 Aston Villa 250% 81.87% 1 15 14 196 Nottingham Forest 96% 79.85% 6.5 16 9.5 90.25 Derby 60% 75.81% 20 17 3 9 QPR 24% 68.97% 34 18 16 256 Hartlepool United 66% 68.38% 17 19 2 4 Northampton Town 79% 68.09% 11 20 9 81 Crewe Alexandria 21% 67.04% 46 21 25 625 Rotherham United 27% 65.33% 31 22 9 81 Rushden & Diamonds 56% 65.26% 21 23 2 4 Preston North End 90% 64.70% 9 24 15 225 Watford 73% 64.45% 13 25 12 144 Cheltenham Town 29% 62.85% 30 26 4 16 Grimsby 17% 58.65% 37.5 27 10.5 110.25 AFC Bournemouth 96% 58.37% 6.5 28 21.5 462.25 Bristol City 52% 55.36% 24 29 5 25 York City 75% 46.48% 12 30 18 324 Boston United 105% 46.20% 3 31 28 784 Chesterfield 185% 45.85% 2 32 30 900 Shrewsbury Town 5% 45.70% 42 33 9 81 Colchester United 15% 44.83% 39.5 34 5.5 30.25 Milwall 44% 42.24% 27 35 8 64 Barnsley 26% 42.09% 32.5 36 3.5 12.25 Scunthorpe United 32% 40.20% 28 37 9 81 Darlington 70% 39.17% 15 38 23 529 Huddersfield Town 100% 38.80% 4.5 39 34.5 1190.25 Lincoln City 26% 35.94% 32.5 40 7.5 56.25 Leyton Orient 65% 33.84% 18 41 23 529 Peterborough United 15% 32.33% 39.5 42 2.5 6.25 Wigan Athletic 4% 29.15% 43 43 0 0 Bury 16% 27.65% 45 44 1 1 Port Vale 52% 19.84% 24 45 21 441 Wimbledon 71% 10.55% 14 46 32 1024 âˆ‘d2 15227.5 n 46 n3 97336 1  ((6âˆ‘d2) / (n3  n)) 0.0609 I found that rs = 0.0609, and that the critical value for rs at 10% was 0.2456 (OCR AS/A Level MEI Structured Mathematics Examination Formulae and Tables, October 2000). This means that the data fails the test for correlation at 10%, meaning there is a greater than 10% probability that any apparent correlation occurred only by chance. This is no great surprise to me, as the pilot test showed little or no correlation. Unfortunately my hypothesis does not seem to be correct. Perhaps the fact that away supporters are not included might have made a difference  if a team is wellsupported away from home it might reverse the disadvantage I predicted. I could not find any data on away supporters so I am unable to investigate this possibility. Hypothesis 3  More successful teams have a better disciplinary record  Pilot Study My idea for a third hypothesis was that a team struggling at the bottom of the table facing relegation would lose confidence and become desperate, causing the players to commit more fouls. On the other hand, a team was near the top of the table would be confident and more relaxed, and would not feel the need for desperate challenges etc. As a pilot test I decided to plot a scatter graph to look for a relationship between the position of a team within its division and its disciplinary points per game. As with the other tests I used only the data for the Premiership and the First Divisio[IMAGE]n. [IMAGE] This graph doesn't show an obvious trend, but there is a slight tendency for the disciplinary points to rise further down the table, especially in the First Division. The second team in Division 1 (Leicester, shown circled) is clearly an outlier, and perhaps if I continued the study on the other divisions a clearer pattern would emerge. In order to test this hypothesis further I decided to take all of the data from the Football League and randomly select 3 teams from the top 25% and 3 teams from the bottom 25% of each division. This means the data is collected using stratified random sampling. However, as the Premiership has only 20 teams instead of 24 it is slightly overrepresented compared to divisions 13. Most importantly I am not using the data from the middle 50% of the divisions, so any possible patterns there will be lost. However, there are two good reasons to sacrifice this data. Firstly, any differences between successful and unsuccessful teams would be most apparent at the top and bottom of each division. Secondly I need a more manageable sample size which I can perform statistical tests on. I produced two histograms to show any difference between top and bottom teams. [IMAGE][IMAGE] As you can see, slightly more teams in the lower quarters of the divisions have higher disciplinary points per game, while slightly more teams in the upper quarters of the divisions have lower disciplinary points per game. The easiest way to tell this is that the histogram for the bottom 25% is shifted slightly to the right compared to the one for the top 25%. I calculated the median for each set of data to give an idea of the central tendency for each distribution. I used the mean because I am comparing the 'average team' in the top 25% with the 'average team' in the bottom 25%. The median for the upper quarters is 2.12 and for the lower quarters, 2.41 (answers to 2 decimal places), meaning there is a 14% difference between the two. This suggests that the disciplinary points per game for the lower teams are generally higher than those of the upper teams. In order to tell for certain whether or not there is a significant difference between the lower and upper quarters of the divisions I would have to perform a statistical test. In this case I will use the MannWhitney UTest. MannWhitney UTest This is a nonparametric statistical test to show whether or not two groups of samples are from different populations. In this case it will show whether or not there is a statistically significant difference between teams in the top and bottom 25% of each division, comparing their average disciplinary points per game. Method: 1. First I ranked the data from both groups in increasing order of size (see column B in the table overleaf). 2. Next, for each team in group b, I counted how many teams in group a had a smaller disciplinary points per game total. Teams with equal disciplinary points per game scored Â½. I did the same for group a. See column C in the table. 3. I found the total of the column C values for both group a and group b. I called these two totals Uaand Ub. 4. I chose the smaller value of U and I looked up the critical values of U at the 5% significance level. A B C D Team Average disciplinary points per game Number of teams in other group with a lower points per game score Top (a, blue) or bottom (b, red) group Hartlepool United 1.369565217 0 a Bristol Rovers 1.652173913 1 b Portsmouth 1.673913043 1 a Cardiff City 1.739130435 1 a Sheffield United 1.826086957 1 a Rochdale 1.826086957 4 b Huddersfield Town 1.826086957 4 b AFC Bournemouth 1.847826087 3 a Wolverhampton 1.847826087 3 a Brighton 1.913043478 6 b Sheffield Wednesday 1.956521739 6 b Chelsea 1.973684211 5 a Stoke 1.97826087 7 b Liverpool 2.131578947 6 a Aston Villa 2.184210526 8 b Scunthorpe United 2.195652174 7 a Bolton Wanderers 2.236842105 9 b QPR 2.239130435 8.5 a Barnsley 2.239130435 9.5 b Swansea City 2.304347826 10 b West Ham United 2.315789474 10 b Arsenal 2.342105263 11 a Oldham Athletic 2.391304348 11 a Mansfield Town 2.456521739 12 b Sum of column C values for group a (Ua) 57.5 Sum of column C values for group b (Ub) 86.5 Results I found that the lower value of U was Ua (57.5). The critical value for U at the 5% significance level was 37(Advanced Biology Study Guide by C J Clegg & D G MacKean, 1996). This meant that Ua was larger than the critical value of U at the 5% significance level. Therefore the difference between teams in the top and bottom 25% of each division, comparing the average disciplinary points per game, is not significant. There is a greater than 5% probability that the difference was caused by chance alone. Again this result is hardly surprising considering the lack of strong correlation in the pilot test. There could be several reasons why this hypothesis failed. Perhaps certain teams do well whilst still playing dirty  maybe this is even a valid tactic for success! It might also be the case that the disciplinary points scores for some teams are disproportionately increased by certain players who are frequently booked or sent off  Patrick Viera of Arsenal for example. I am unable to find data on individual players so I cannot investigate this further. Evaluation  I am quite pleased with the way my investigation went. Although hypotheses 2 and 3 were not statistically supported by my data, these raised other interesting questions, which could be investigated. Of course there are certain limitations to my study. The data I used came from complete, published tables, and its authenticity is not in doubt. However, there is nothing to say that the 20022003 season was a typical one, and that my results might have been different for a different year. Another important point to consider is that the data for different teams is not independent. For example, because Manchester United was top of the Premiership, no other team could possibly be top as well. In fact, even the points totals of the teams are interdependent  a team can only be judged in comparison to the other teams it plays. It is possible that every team played worse in the 20022003 season than in previous or subsequent seasons  it is impossible to tell if this is true as the points totals for each team are relative to those of the other teams. Therefore there can be no standalone measure of how good a team is. It is also important to remember that football is a sport played at many levels, in hundreds of countries and by many age and social groups. The English Football League is only a tiny part of this, and if I conducted my study on different aspects of the game I might obtai very different results. Appendix  Team numbers Premiership Team Rank (within division) Rank (overall) Manchester United 1 1 Arsenal 2 2 Newcastle United 3 3 Chelsea 4 4 Liverpool 5 5 Blackburn Rovers 6 6 Everton 7 7 Manchester City 8 8 Southampton 9 9 Tottenham Hotspur 10 10 Middlesbrough 11 11 Charlton Athletic 12 12 Birmingham City 13 13 Fulham 14 14 Leeds United 15 15 Aston Villa 16 16 Bolton Wanderers 17 17 West Ham United 18 18 West Bromwich Albion 19 19 Sunderland 20 20 Division 1 Team Rank (within division) Rank (overall) Portsmouth 1 21 Leicester 2 22 Sheffield United 3 23 Reading 4 24 Wolverhampton 5 25 Nottingham Forest 6 26 Ipswich 7 27 Norwich 8 28 Milwall 9 29 Wimbledon 10 30 Gillingham 11 31 Preston North End 12 32 Watford 13 33 Crystal Palace 14 34 Rotherham United 15 35 Burnley 16 36 Walsall 17 37 Derby 18 38 Bradford 19 39 Coventry 20 40 Stoke 21 41 Sheffield Wednesday 22 42 Brighton 23 43 Grimsby 24 44 Division 2 Team Rank (within division) Rank (overall) Wigan Athletic 1 50 Crewe Alexandria 2 45 Bristol City 3 61 QPR 4 64 Oldham Athletic 5 65 Cardiff City 6 49 Tranmere Rovers 7 58 Plymouth Argyle 8 46 Luton Town 9 67 Swindon Town 10 48 Peterborough United 11 56 Colchester United 12 51 Blackpool 13 60 Stockport County 14 52 Notts County 15 63 Brentford 16 57 Port Vale 17 53 Wycombe Wanderers 18 59 Barnsley 19 62 Chesterfield 20 66 Cheltenham Town 21 47 Huddersfield Town 22 54 Mansfield Town 23 68 Northampton Town 24 55 Division 3 Team Rank (within division) Rank (overall) Rushden & Diamonds 1 69 Hartlepool United 2 70 Wrexham 3 71 AFC Bournemouth 4 72 Scunthorpe United 5 73 Lincoln City 6 74 Bury 7 75 Oxford United 8 76 Torquay United 9 77 York City 10 78 Kidderminster Harriers 11 79 Cambridge United 12 80 Hull City 13 81 Darlington 14 82 Boston United 15 83 Macclesfield Town 16 84 Southend United 17 85 Leyton Orient 18 86 Rochdale 19 87 Bristol Rovers 20 88 Swansea City 21 89 Carlisle United 22 90 Exeter City 23 91 Shrewsbury Town 24 92 How to Cite this Page
MLA Citation:
"Football Statistics Project." 123HelpMe.com. 31 Jan 2015 <http://www.123HelpMe.com/view.asp?id=148009>. 
