Wednesday, May 31, 2006
Tribute to Shoeless Joe
Today, as a review, we gave tribute to two players featured in the movie Field of Dreams -- Shoeless Joe Jackson and Moonlight Graham.
- We reviewed Jackson's baseball career. He was a great hitter, but his career was cut short by his alleged involvement in the Black Sox scandal. Did Joe really help his team lose games in the 1919 World Series? His exact involvement in the scandal might never be known.
- Since Ty Cobb and Joe Jackson were hitters during the same period, it seemed worthwhile to compare the OBPs of the two players. Although Jackson's OBPs were high, Cobb tended to hit for a higher OBP -- about 26 points higher.
- We looked at the relationship between Jackson's season double totals and his triples totals. There is a positive relationship and you can make reasonable prediction about the number of triples he would hit given the number of doubles.
- We also looked at the major league baseball career of Moonlight Graham. He only played a single game and never came to bat.
Statistics and Baseball
Statistics and Baseball CLASS IS GOING WELL. TRYING TO STUDY FOR THE TEST.NOT SURE WHAT THE TEST IS GOIHNG TO LOOK LIKE.
Tuesday, May 30, 2006
Hope for the Royals
Today we looked at the number of wins for American League teams for the 2004 and 2005 seasons.
dotplot, stemplot, histogram, bar chart, 5-number summary, mean, standard deviation, mode, boxplot, rule of thumb for outliers, comparing batches, z-score, scatterplot, correlation, relationship, least-squares line, predicted value, residual, sum of squared residuals, regression effect,68-95-99.7 rule
- We saw that there was a positive relationship in the scatterplot. Teams that win a lot of games in 2004 tended to win many games in 2005. Also, losers in 2004 tend also to lose in 2005.
- We can measure the strength of the relationship by means of a correlation r. We talked about how to compute r based on the standardized scores. For these data, r was close to .8 which indicates a strong positive relationship between a team's 2004 win total and its 2005 win total.
- Once we have r and values of the mean and standard deviation for both variables, we can compute the equation of the least-squares line.
- We used this least-squares line to make predictions. Suprisingly, we saw that a bad team that wins 60 games in 2004 is predicted to win 65 games in 2005.
- This observation motivates a discussion of the regression effect. We graphed a team's improvement (Wins_2005 - Wins_2004) against Wins_2004 and saw a negative relationship. This means that a great team in 2004 will tend to get worse in 2005, and likewise a bad team in 2004 will tend to get better in 2005.
- So there is hope for the Kansas City Royals who is currently the worst team in baseball. Things are bad this year, but I predict they will win more games next season.
dotplot, stemplot, histogram, bar chart, 5-number summary, mean, standard deviation, mode, boxplot, rule of thumb for outliers, comparing batches, z-score, scatterplot, correlation, relationship, least-squares line, predicted value, residual, sum of squared residuals, regression effect,68-95-99.7 rule
Thursday, May 25, 2006
Italian Ballplayers
Today, we used spaghetti to illustrate fitting lines to (x, y) plots. Since spaghetti is Italian, we fit a line to (HR, SLG) data for Mike Piazza (great player of Italian descent) for the years that he played with the Mets.
- We want a fitted line to be close to the points in some sense.
- We define a residual to be the difference between the observed y value and the y value one would predict from the line.
- We measure the goodness of a fit by the sum of squared residuals.
- The best line, the least-squares line, is the one that makes the sum of squared residuals as small as possible.
- Fathom has a neat way of showing graphically the squared residuals. By playing with a movable line, we tried to make the sum of squared residuals small.
- In this trial-and-error approach, we get a pretty good line, but the least-squares line is the best one from the viewpoint of smallest sum of squared residuals.
- We can compare different predictors by looking at the "sum of squares". Generally, we see that OPS is the best predictor, followed by SLG and OBP, and then by AVG.
Wednesday, May 24, 2006
American and National Leagues: What's the Difference?
Today I passed out some batting data for the 30 Major League teams for the current season. We were focusing on differences between the two leagues. Here are some observations we made:
- The big difference between the two leagues is the AL's use of the designated hitter who bats for the pitcher.
- The use of the DH does affect one baseball strategy -- the use of the sacrifice bunt to move a runner from 1st base to 2nd base.
- We saw that there was a significant difference in the number of sacrifice bunts between the two leagues. By use of 5-number summaries and parallel boxplots, we saw that NL teams tended to have ??? more sacrifice bunts than AL teams. (Sorry -- I forget what the number was.)
- Some teams like the Blue Jays and the Athletics rarely sacrifice. These two teams believe in the useful of sabermetrics, so I wonder if the small number of sacrifice bunts means that these two teams understand that the sacrifice bunt is really an overrated strategy.
- We begin by constructing a scatterplot of two quantitative variables and making a statement about the direction and strength of the association. For the AL teams, we saw there was a strong positive relationship between batting average (AVG) and runs scored per game.
- Once we see a relationship, it is helpful to summarize the relationship by fitting a line. We fit a line by eye to the (AVG, RunsPerGame) data. Using this fitted line, we can predict how many runs per game a team will scored given its batting average.
Tuesday, May 23, 2006
The Second Best Pitcher from BGSU Is ...
In today's class, we looked at the great pitchers who have attended BGSU. I think the best pitcher was Orel Hersheiser (remember his great streak of pitching 59 consecutive innings?), but the choice of 2nd best pitcher is less clear. We look at Grant Jackson from Fostoria, Ohio and Roger McDowell. Here's a summary of what we found out:
- One measure of the quality of a pitcher is the rate that he strikes out batters. We computing the SOrate = SO/BFP x 100 for all seasons for both pitchers.
- We compared Jackson's and McDowell's strikeout rates by use of five-number summaries and parallel boxplots.
- Jackson was the better strikout pitcher -- he tended to strike out 3% more batters than McDowell.
- Actually, looking closer at Jackson's rates, it seems that his strikeout ability was best during the early part of his season.
- Boston has unusually expensive tickets -- I think it is due to the high payroll and the small ballpark.
- There is a relationship between cost and payroll. If you break the teams into the rich and poor teams by payroll, then generally you will pay more when you attend a home game of a rich club.
- It is interesting that the only item that doesn't seem to vary in cost between rich and poor clubs is a baseball cap.
Monday, May 22, 2006
The Babe, Roger, and Barry

This weekend, Barry Bonds hit his 714th home run and he is currently tied with Babe Ruth for the most career home runs. Also Albert Pujols has currently hit 22 home runs in 44 games and is on a pace to hit 81 home runs this season. We again look at home run hitting. Here are the class highlights:
- Last week, we saw that some statistics like OBP and OPS tend to be bell-shaped. In this case, we can predict the proportion of statistics in particular intervals if we know the mean and standard deviation. This is called the 68-95-99.7 rule
- We look at three big home run seasons, 1921 (when Ruth hit 59 home runs), 1961 (when Maris hit 61) and 2001 (when Bonds hit 73). The above graph shows the home run rates of all regular (at least 300 AB) players for those three seasons. By use of standardized scores, we can see that Ruth's 59 home runs was the best accomplishment relative to his peers.
- We watched a segment of the movie 61* that documents Roger Maris' accomplishment of hitting 61 home runs. The wives of Roger and Babe Ruth watched the accomplishment together in the stands -- it doesn't appear that Ruth's wife was very happy about the record being broken.
- This week we start talking about how one compares two or more batches of data. To start, we looked at the batches of home run rates for the 1921, 1961, and 2001 seasons. We can compare the batches graphically by means of parallel boxplots. Also I introduced a rule of thumb to check for outliers. Although this isn't that suprising, we determined that Ruth, Maris and Bonds were outliers with respect to home run rate for those seasons.
Thursday, May 18, 2006
Deviations and Baseball Shapes
In the first part of class, we introduced the idea of a deviation.
- When we look at the ages of the Yankee players, we can summarize them by a mean.
- A deviation is the distance of an age from the mean. Randy Johnson's deviation is 7 which means he is seven years older than the mean age.
- To measure spread, we can use a typical deviation size. The standard deviation, or s for short, has a complicated formula, but it essentially tells us what a typical deviation is.
- We have already talked about baseball player shapes -- for example, Babe Ruth had an interesting physique. But here we are talking about shapes of baseball data.
- Most baseball statistics are counts of things like runs, hits, walks, home runs, etc. Distributions of counts tend to be right-skewed. Here's a graph of home runs of all 2005 regulars.

- Other baseball statistics are "derived". That means they are computed by a formula. Examples of derived statistics are batting averages, slugging percentages, ops, etc. These batches (for groups of regular players) tend to be symmetric. For example, here are the OBPs for all 2005 regulars.

- We'll talk later about a special distribution shape called the normal curve. Normal curves are easy to summarize knowing the mean and standard deviation.
Statistics and Baseball
Wednesday, May 17, 2006
Nationality, Age, and Home Run Hitting
Today we looked at the current Cleveland Indians roster to learn about the nationalities and ages of MLB players.
We learned
- We learned about nationality of Indians players by a frequency table and a bar chart. Approximately 20% of the Indians are international from Latin-American countries.
- Using a stemplot and 5-number summary, we looked at the ages of the Indians. The Indians average age is about 30, ranging from early 20's to late 30's.
We learned
- the median number of home runs hit per team in 2005 was exactly 1. (That means you would expect to see two home runs in a game that season.)
- the distribution of hrs/gams was somewhat right-skewed and Texas' value may be an outlier
Tuesday, May 16, 2006
Tribute to the Bambino

Today we talked about great home run hitters. We started with Babe Ruth, arguably the greatest baseball player hitter in history. Here are some of the class highlights:
- We watched part of a Ken Burns documentary on the life of Babe Ruth. He had a very interesting life, chewing tobacco and drinking as a young kid, living in a "boarding school" until age 19, and then enjoying a colorful life as big leaguer. Despite his loud and obnoxious behavior, Ruth was a very popular figure and changed the way baseball was played.
- We explored the season slugging percentages for Ruth. A five-number summary (LO, QL, M, QU, HI) was introduced that provides simple measures of center and spread.
- It is interesting to graph Ruth's slugging percentages against year. There is an interesting outlier in the year 1925 -- the low value can be attributed to the "bellyache heard around the world."
- In the computer lab, we got introduced to the use of Fathom. We explored the season slugging percentages for Hank Aaron and Barry Bonds. Both hitters had interesting career trajectories. Aaron was notable for his consistency. Bonds' trajectory, showed here, exhibits an unusual increase in SLG during the last part of his career. (This unusual career trajectory probably has an explanation that has come to light during the 2006 season.)
Monday, May 15, 2006
Getting started
Today we got started looking at baseball from a statistical point of view. Here are some things we covered.
- There is a difference between statistics and Statistics. The small s statistics are the data we collect; the big s Statistics is the science of learning from data.
- Looking at Bernie Williams baseball card, we saw examples of categorical variables such as his fielding position and quantitative variables such as his batting average in 2000.
- Statistics always begins with a question, such as "Was Bernie Williams a big home run hitter?"
- In exploring data, we start by making a graph. Three useful graphs we discussed were a dotplot, a stemplot, and a time series plot.
- A graph gives you a picture of a data distribution. We describe a distribution by talking about its shape, a center value, spread, and any unusual characteristics such as outliers.
- We saw from the dotplot that Bernie averaged about 18 home runs a season. Looking at the time series plot, we see that Bernie peaked in home run hitting in the middle of his career and has decreased in recent years.
- Looking at Barry Bonds' walks, we see that Bonds tends to walk a lot especially in recent years. Pitchers are geninuely afraid of Bonds' home run talent.
Tuesday, May 02, 2006
Introduction to Statistics (Baseball edition)

Welcome to MATH 115 Introduction to Statistics.
We'll be using baseball to learn about data analysis, probability, and learning from data. This should be a fun class. We'll learn about baseball players and teams from the present and past, and at the same time learn how a statistician collects, organizes, summarizes, and learns from data.