Date: Sat, 1 Nov 1997 06:47:56 -0600 From: "Mark Hopkins" Subject: Rating Systems Comparison In the following, the ratings were taken as of the week following the October 26 and the games taken are those played up to that date. (0) Who Has the Best Rating System? The focus of this article is to find a rating system which is most ideally suited to provide the basis for comparing teams on past performance in a way analogous to how we do it today in other sports and sports leagues -- the Winning Percentage (Pct.). The reasons this goal might be desireable are to: (a) provide the basis for a single College Football Standings (b) rank teams to determine champions (even of conferences) or (c) rank teams to seed playoffs For situations like college football, where schedules are irregular or unbalanced enough the usual method of using Pct is not fair because it does not account for the strength of opposition you've faced. A rating method should provide some kind of modification or extension of Pct. to irregular schedules, and ideally should even coincide with Pct. when the schedule is round-robin. Even if not, then at the very least, teams should be ranked in the same order as that given by their Pct. when the rating method is applied to round-robin schedules. I believe, in fact, that lack of anything analogous to Pct is the deficit that the press polls are trying to make up for, in particular. One can see this in their marked tendency to use Pct. where possible, such as putting all the major undefeated and untied teams at or near the top. Another example is last year, where Brigham Young was rated close to the top because of their 14-1 record, where very few automated rating systems had them anywhere near the top. So what the search for a rating system has apparently focused on, all these years, was really nothing more than a search for a method that modifies and extends the use of Pct. to unbalanced schedules. What the ideal rules out are those systems which are primarily concerned with trying to predict future outcomes. Such a system will tend to weigh more recently played games more heavily, since a team carries with it a momentum going from one game to the next. A distinction is therefore drawn between those systems which predict the future ("Predictors") and those which 'predict the past' (Retrodictors). Another ideal, especially important if the system is to be suitable for mass consumption and mass publication, is that it be unfettered with details such as parameters, extra conditions and the like, which would only serve to render it less accessible, and therefore, in the eyes of those using it; less trustworthy! Ideally, the system should be so accessible that even if you can't use it to compute the ratings of teams by hand, you should still be able to use it to VERIFY (even by hand) that the ratings published are indeed the correct ones. The problem with having parameters in a system intended for providing the basis of a Standings (especially when targeted for mass consumption) is that ranking of teams can be dependent on the actual values assigned to the parameters. So then, who gets to decide what values will be used? How are they to be determined, other than by using the very historical or empirical data that we're ideally keeping the system free of. Doing this confuses the goals that go into the design of Predictors, where you do a lot of parameter-twiddling to optimize performance, vs. Retrodictors whose design goal has as little to do with statistical modelling as does the use of Pct. The following rating systems and polls are looked at. These are listed at the Ratings Comparison Web Page, which is maintained by Massey. AP Associated Press FAC FACT System PAC Packard ARG ARGH Power Ratings FLY Flyman's Performance PIG Pigskin Index BAS Bassett Model GIN Gindin Rankings SAG Jeff Sagarin BIH Bihl System GUP Bapi Gupta's Ratings SAU Sauceda-Elo BIL Billingsley HOW Howell SCH Schmidt Computer CNN CNN/SI Fans' Top 25 HUB Hub Index SD SportsDoc Ratings CON CONWAY'S SPORTS IMS Imes System SE Sports Extra CSL CSL_Ratings KAM Kambour's Bayes Stat SEL Jeff Self's Rankings D System D LGT Lightner's Ratings SPL Sportsline 112 DEN Dendy MAS Massey LS SPN Sporting News DES DeSimone MIC MicroBrothers SWT SW!-TECH DEV Harry DeVold MJS MJS Standings USA USA Today / ESPN Poll E System E MOR Sonny Moore's Ratings WAJ WAJL10 ER E-Rating (Elecs) OCO O'Connor Rankings WIL Wilson Performance ESP ESPN Fan Poll ONO Dr. Ono's Rankings YEA Yearsley Ratings For the record, Systems D and E are defined respectively as the least squares fit to the following rules: D: Winner - Loser = Point Margin E: Winner - Loser = 0.5 MJS is equivalent to System E, but is applied to a smaller set of games (basically, just Division I-A, where E is applied to all games). Wilson's system (WIL) is closely related to System E, with an extra clause added to avoid situations where a winner's rating can drop, because the loser was rated more than 0.5 under the winner. Also, the numbers in Wilson's system are scaled differently, though this has no impact on how the teams are ranked, nor even on how closely two teams are ranked in comparison to two other teams. (1) Retrodiction A good rating system, if its intent is to provide a comparison which best summarizes the results of the games played to date should rate teams in such a way that the higher rated team won as many games as possible. This criterion is almost common sense, but it can be misleading if applied too rigorously. For example, consider the following situation: Teams: A, B, C, D, E, F Game results: A defeated B and C; B defeated C, D, E and F C defeated D, E and F; D defeated A, E and F E defeated A and F; F defeated A This is a round-robin since every played everyone else once. The final standings would look like this: Team W L B 4 1 C 3 2 D 3 2 A 2 3 E 2 3 F 1 4 The teams are ranked #1 B, #2 C and D (tied), #4 A and E (tied), #6 F. An ideal rating system should rank the teams in the same order, possibly except for breaking the ties between C and D or A and E. But the ranking which minimizes the number of game lost by the higher ranked team is: #1 B, #2 C, #3 D, #4 E, #5 F, #6 A with the 2 upsets being the games won by A, thus going 13-2 in favor of the higher ranked team. If teams were ranked by won-loss percentage, the higher ranked team would have only went 10-3 (with 2 games between equally ranked teams). Be that as it may, the criterion still provides a clear, objective and system-independent way to compare systems, thus providing a kind of anchoring on which to hang the labels "good" and "bad". Without that, if all you had to go on was who looked good RELATIVE to everyone else, you could end up having a situation where everyone looks equally good, because they are all equally bad! Relative comparison between systems, by itself, doesn't provide solid enough ground. Using the Retrodiction criterion, a direct comparison can be made between the Top 25's, Top 50's and rating systems which rate all 112 Division I-A teams if only those games are counted which involve at least one team which was rated by everyone. There were 97 games of this type played, involving at least one of the following teams: Nebraska, Florida, Florida State, Tennessee, Michigan Washington, Auburn, Ohio State, North Carolina, Kansas State Penn State, UCLA, Georgia, LSU and West Virginia The won-loss record of Higher Rated vs. Lower Rated team for each rating system is listed below: W L Pct. Systems ---------- ------- 94 3 .969 AP CNN DES FAC SPL SPN USA WIL 93 4 .959 ARG BIH BIL GUP DEV ESP HOW LGT ONO PIG SAG SD SEL WAJ 92 5 .948 CON CSL DEN HUB IMS KAM MAS MOR OCO SAU SWT YEA 91 6 .938 BAS D E FLY GIN MJS PAC SCH SE 90 7 .928 ER MIC For the larger set of games involving two Division I-A teams, the list is broken down according to the type of system. There were 405 of these games played in all. Both WIL and LGT have 2 games in which both teams were rated equally, so these are not counted in their respective won-loss records. W L Pct. Top 25's ---------- -------- 146 10 .936 AP CNN 146 11 .930 USA 149 13 .920 WAJ 141 13 .916 ESP 141 18 .887 CON W L Pct. Top 50 ---------- -------- 249 35 .877 DEV W L Pct. Top 112's ---------- --------- 364 39 .903 WIL 363 42 .896 GUP 361 44 .891 BIH FAC 360 45 .889 SE 357 48 .881 CSL E MJS OCO SWT 355 50 .877 FLY SPN 354 51 .874 DEN HOW HUB SAU 353 52 .872 IMS ONO SAG 352 53 .869 ARG 351 54 .867 PAC 349 54 .866 LGT 350 55 .864 BIL SCH SEL 349 56 .862 SPL 346 59 .854 MAS 345 60 .852 PIG YEA 344 61 .849 KAM MIC SD 343 62 .847 DES 341 64 .842 GIN 337 68 .832 MOR 335 70 .827 D 334 71 .825 ER 333 72 .822 BAS (2) Correlation to Press Polls (AP and CNN) For backwards compatibility, a rating system should ideally come close to what's already established. This requirement is strengthened by the fact that to the extent the press polls rate teams, they're outperforming everyone else on the Retrodiction criterion. But they are incomplete, and therefore of little or no use in the larger task of providing a replacement method for rating all teams on a common scale. They are not systematically defined, so cannot be reproduced by anyone at any time -- even more so since they are the product of the opinions of experts who you can't just simply kidnap and bring into your living room to generate ratings for you at will. (Well, you could, but that brings up a whole new set of problems...) The ideal is a system which is so close to the press polls that it could even lay claim to the title of BEING the seamless extension of the Press Poll, except for the elimination of the human element, its total automation, systematic definition and extension from a woefully inadequate Top 25 to a Top 683. The Top 25 (*) and Top 50 (#) polls are explicitly indicated as such in the lists below: Correlation to AP: * AP 1.000 *USA .994 *CNN .994 DES .955 SPN .943 SPL .939 #DEV .919 SD .891 ARG .886 *CON .874 ONO .860 HOW .854 SEL .846 BIL .822 DEN .802 *ESP .799 FAC .799 ER .783 MIC .781 PAC .779 SAG .778 PIG .775 KAM .770 YEA .770 HUB .766 IMS .763 SWT .756 CSL .754 SE .753 WIL .739 E .738 BIH .737 GUP .730 BAS .727 MJS .719 D .711 SAU .710 MAS .693 SCH .682 FLY .675 *WAJ .665 GIN .657 OCO .654 MOR .633 LGT .595 Correlation to CNN: *CNN 1.000 * AP .994 *USA .989 DES .959 SPL .937 SPN .922 #DEV .919 SD .902 ARG .868 *CON .864 ONO .859 HOW .836 SEL .830 *ESP .820 BIL .806 DEN .794 MIC .792 FAC .786 ER .781 PAC .763 PIG .760 SAG .760 SWT .759 HUB .754 YEA .754 IMS .751 KAM .751 CSL .741 SE .740 WIL .731 BIH .722 GUP .721 BAS .720 E .720 MJS .703 D .697 SAU .695 MAS .672 *WAJ .668 SCH .663 FLY .659 OCO .639 GIN .622 MOR .620 LGT .588 Correlation to USA: *USA 1.000 * AP .994 *CNN .989 SPL .941 SPN .941 DES .935 #DEV .909 ARG .889 *CON .886 SD .878 ONO .855 *ESP .843 HOW .839 SEL .836 BIL .828 DEN .809 MIC .809 SAG .800 PAC .788 ER .782 PIG .780 FAC .775 KAM .771 YEA .757 SAU .747 HUB .737 BAS .731 IMS .723 SE .718 D .716 GUP .715 CSL .709 SCH .708 E .707 SWT .707 WIL .688 BIH .685 MOR .683 MJS .681 MAS .677 GIN .637 OCO .627 FLY .620 *WAJ .578 LGT .505 (3) Correlation to Consensus To arrive at a consensus, every two teams are compared to each other to see which one was voted for by the plurality of rating systems as being rated better. The Consensus Rank is defined by: Consensus Rank = 1 + #teams outvoted by (ties count as 1/2) A more refined analysis should first separate out the rating systems into separate groups. The larger group, which is tugging the hardest at the consensus, appears to be made up of rating systems which use time-dependent data, mainly for the purpose of making predictions on future game outcomes. The kinds of systems we're interested in are those which are only concerned about summarizing the past, particularly giving equal weight to all games played in the season and no weight to any games played prior to that. But that's something which will be left to future analysis. First, restricting attention just to the 15 teams rated by everyone and adjusting everyone's ranks to fit the range 1-15, the following comparison can be made between the rating systems: Correlations to Consensus Ranks, Top 15: SAG .964 ARG .957 DEV .954 IMS .954 GUP .950 HOW .939 HUB .939 DEN .932 MIC .925 SCH .921 SWT .918 OCO .914 CSL .911 KAM .907 LGT .907 BIH .900 FLY .900 MAS .893 MOR .893 ONO .893 BAS .886 CON .886 SAU .871 SEL .861 FAC .850 ER .846 YEA .839 PIG .836 SE .829 E .821 PAC .814 BIL .807 MJS .789 WAJ .786 SD .786 D .786 USA .743 AP .739 CNN .696 DES .689 ESP .668 SPN .686 SPL .664 WIL .623 GIN .543 Consensus Ranks 1.0 Nebraska 7-0 2.0 Florida 6-1 3.0 Florida St. 7-0 4.0 Tennessee 5-1 5.0 Michigan 7-0 6.0 Washington 6-1 7.0 Auburn 7-1 8.0 Ohio St. 7-1 9.0 North Carolina 7-0 11.0 Penn St. 6-0 10.0 Kansas St. 6-1 12.0 UCLA 6-2 13.0 Georgia 6-1 14.0 LSU 5-2 15.0 West Virginia 6-1 Second, on the entire set of Division I-A teams, this is how the rating systems correlate to the Consensus Rank, which is listed below. The Top 50's and Top 25's systems are listed separately. Correlations to Consensus Ranks, Division I-A: Top 25's CON .957 AP .887 USA .878 CNN .877 ESP .687 WAJ .546 Top 50 DEV .896 Top 112's ARG .994 SAG .989 SAU .985 YEA .983 HOW .981 SEL .981 SCH .979 DEN .977 PAC .976 SPN .976 SWT .976 FAC .975 GUP .975 ONO .975 MIC .974 IMS .972 SE .972 BIH .971 CSL .971 HUB .971 SPL .970 OCO .968 DES .964 MJS .963 D .962 KAM .961 E .957 BIL .956 GIN .956 PIG .953 SD .953 FLY .952 MAS .951 ER .949 MOR .936 BAS .935 WIL .935 LGT .927 Consensus Ranks 1.0 Nebraska 7-0 2.0 Florida 6-1 3.0 Florida St. 7-0 4.0 Tennessee 5-1 5.0 Michigan 7-0 6.0 Washington 6-1 7.0 Auburn 7-1 8.0 Ohio St. 7-1 9.0 North Carolina 7-0 10.0 Kansas St. 6-1 11.0 Penn St. 6-0 12.0 Washington St. 7-0 13.0 UCLA 6-2 14.0 Georgia 6-1 15.0 LSU 5-2 16.0 Arizona St. 5-2 17.0 Iowa 4-2 18.0 Purdue 6-1 19.0 Michigan St. 5-2 20.0 Toledo 7-0 21.5 Oklahoma St. 6-1 23.0 West Virginia 6-1 23.0 Colorado 4-3 24.0 Colorado St. 6-2 25.5 South Carolina 5-3 26.0 Alabama 4-3 26.5 Ohio 6-1 28.5 Texas A&M 4-2 29.0 Mississippi 4-3 30.0 Southern Miss 5-2 30.5 Brigham Young 5-2 30.5 Syracuse 5-3 33.0 Rice 5-2 35.5 Georgia Tech 4-2 35.5 Mississippi St. 5-2 36.0 Stanford 4-3 37.0 USC 4-3 37.0 Virginia Tech 5-2 38.0 Miami OH 6-2 40.5 Air Force 6-2 41.0 Virginia 3-3 42.5 Kentucky 4-4 42.5 Wisconsin 7-2 43.5 Oregon 4-4 46.0 Missouri 5-3 46.0 Marshall 5-2 46.0 Arizona 3-5 49.0 Notre Dame 3-5 49.0 Clemson 3-3 50.0 Cincinnati 6-2 50.0 Wyoming 4-4 52.0 New Mexico 5-2 53.0 Texas Tech 4-3 54.0 Tulane 4-3 56.0 Wake Forest 4-4 56.0 N.C. State 3-4 57.0 Louisiana Tech 6-2 58.5 Central Florida 2-5 59.0 Oregon St. 3-4 59.5 Northwestern 3-6 61.0 Arkansas 3-4 61.5 SMU 3-4 64.0 Miami FL 3-4 64.0 W. Michigan 5-3 65.0 Texas 3-4 65.0 Utah St. 3-4 66.5 California 2-5 68.0 Nevada 3-5 69.5 Utah 4-4 71.0 Kansas 4-4 71.5 Fresno St. 3-4 72.0 Navy 2-3 72.0 San Diego St. 3-5 73.0 Pittsburgh 4-3 75.5 Oklahoma 3-5 76.0 Houston 3-4 77.5 Vanderbilt 3-5 78.0 Bowling Green 3-6 78.5 East Carolina 2-5 79.5 Minnesota 2-6 81.0 Maryland 2-6 82.0 E. Michigan 3-5 83.0 Duke 2-6 84.5 Memphis 2-5 86.0 UNLV 2-5 86.0 Indiana 1-7 86.5 Boston College 2-6 87.0 San Jose St. 2-5 89.0 Hawaii 2-5 90.0 Temple 2-6 91.0 Baylor 1-6 92.0 Idaho 1-4 94.0 NE Louisiana 1-6 94.0 Alabama-Birm. 1-4 95.5 Kent 2-5 95.5 Army 1-4 96.0 Ball St. 2-6 98.0 Iowa St. 1-6 99.0 Boise St. 2-4 100.0 Akron 1-7 101.0 Texas-El Paso 2-5 102.0 Louisville 1-7 103.0 Tulsa 1-6 104.0 Illinois 0-7 105.0 North Texas 1-6 106.0 C. Michigan 2-7 107.0 Texas Christian 0-7 108.0 New Mexico St. 1-5 109.5 Rutgers 0-8 109.5 SW Louisiana 1-6 111.0 N. Illinois 0-8 112.0 Arkansas St. 0-6 Date: Fri, 14 Nov 1997 12:02:14 -0600 From: "Mark Hopkins" Subject: High correlations in large cluster of rating syste This letter was originally sent to Rothman who has raised the issue of the persistent high correlation of the ARGH rating system to the Consensus. --------------------------------- I don't think you realise the extent to which the high correlations are present. This is the list of just cliques of ratings systems that all have correlations of 98.0% or more with each other, along with some of their supersets. As an aside, I suspect that the SportsExtra rating system is one and the same as System E. I already know that MJS = E. Also, Herman Matthew's system is close to the FACT system. And, notice the high correlation of Sagarin and Sauceda-Elo. The latter's algorithm is described fully and is closely related to the rating formula used in Chess. So even though Sagarin's method is not publicised as far as I know, I can with confidence reveal probably for the first time in public that Sagarin is using a variant of the Elo formula. GROUP 1: Mostly systems with the Marshall Effect (BIH, FAC) or that rate wins only (BIH, MJS, SE, E). MAT is unknown. WIL is a peripheral member of this group. 96.7 [BIH E SE MJS FAC MAT] ------------------------------ 97.9 [FAC E MJS MAT SE ] 98.1 [FAC E MJS MAT] 98.1 [FAC E MJS SE ] 98.3 [BIH E SE MJS] 98.2 [MAT MJS FAC] 98.2 [MAT MJS E ] 98.2 [FAC SE MJS] 98.4 [BIH SE MJS] 99.3 [SE E MJS] 98.3 [MAT E ] 98.4 [MJS FAC] 98.9 [MAT FAC] 99.5 [SE MJS] 99.7 [MJS E ] Connections to other systems: 98.1 [FAC WIL] 98.4 [BIH CSL] GROUP 2: Characteristic unknown. Intersects group 1. 97.5 [FAC PAC SAG SAU MAT YEA] ------------------------------ 98.6 [YEA SAG SAU PAC] 98.6 [MAT SAG SAU FAC] 98.0 [FAC PAC SAG SAU] 98.7 [YEA PAC SAG] 98.7 [YEA SAU SAG] 98.8 [SAG MAT SAU] 98.8 [SAG MAT FAC] 98.8 [SAG YEA] 98.9 [MAT FAC] 98.9 [SAU MAT] 99.0 [SAG FAC] 99.6 [SAG SAU] GROUP 3: 97.6 [ARG SEL ONO MAR] ---------------------- 97.8 [ARG SEL MAR] 98.5 [SEL MAR] 98.4 [ARG MAR] 98.4 [SEL ONO MAR] 99.0 [SEL ONO] GROUP 4: Might include group 3 as a subset at 95.9%. I didn't check. 95.9 [ONO HOW GUP SD ] ---------------------- 96.9 [ONO HOW SD ] 97.4 [ONO HOW GUP] 98.0 [HOW GUP] 98.0 [ONO HOW] 98.5 [ONO SD ] GROUP 5: 97.2 [PIG BIL MOR] ------------------ 98.0 [PIG BIL] 98.1 [PIG MOR] GROUP 6: Systems which use point margins with a Bayes' type rule 98.4 [KAM MAS] GROUP 7: 98.3 [FLY LGT] Key: AP : Associated Press ARG: ARGH Power Ratings BAS: Bassett Model BIH: Bihl System BIL: Billingsley BUC: the buck system CI : College Insider CNN: CNN/SI Fans' Top 25 CON: CONWAY'S SPORTS CSL: CSL_Ratings DES: DeSimone DEV: Harry DeVold E : System E ER : E-Rating (Elecs) ESP: ESPN Fan Poll FAC: FACT System FLY: Flyman's Performance GIN: Gindin Rankings GUP: Bapi Gupta's Ratings HNM: Greg Heineman HOW: Howell HUB: Hub Index IMS: Imes System KAM: Kambour's Bayes Stat LGT: Lightner's Ratings MAR: Marsee System MAS: Massey LS MAT: Herman Matthews MIC: MicroBrothers MJS: MJS Standings MOR: Sonny Moore's Ratings OCO: O'Connor Rankings ONO: Dr. Ono's Rankings PAC: Packard PIG: Pigskin Index SAG: Jeff Sagarin SAU: Sauceda-Elo SD : SportsDoc Ratings SE : SportsExtra SEL: Jeff Self's Rankings SPL: Sportsline 112 SPN: Sporting News SWT: SW!-TECH USA: USA Today / ESPN Poll WAJ: WAJL10 WIL: Wilson Performance YEA: Yearsley Ratings Date: Sat, 9 Jan 1999 15:43:10 -0600 (CST) From: Mark William Hopkins I don't believe statistics has much to say about what kind of rating is most appropriate. The question "What does Good mean?" is not a question of statistical assessment. It is completely empirical, as far as sports ratings goes. The empirical question is: "What does Good mean in the mind of a human being looking at the sports results of a given league?" The only way to answer this is to find out what people actually say is good and not and then try to uncover the underlying laws governing their choices. The underlying laws seem to be that whoever wins the most games against the best opponents is Good, notwithstanding point margins. The evidence for this is that in sports leagues where round-robin schedules are used, teams are rated based solely on who won the most games, possibly except for tie-breakers. If anything, the question of what does Good mean is an issue of psychology or artificial intelligence. Not statistics.