Charts And Statistical Analysis
Do large car engines (cubic feet of displacement –cid) consume more gasoline
(miles per gallon –mpg) than small engines? Does the crime rate (crimes per
100,000 persons) increase with population density (inhabitants per square mile)?
Does increase exercise decrease the rate of heart attacks?
If the problem you are trying to analyze is similar to the simple examples we
just gave, then you might consider using the statistical technique we are about
to describe: simple linear regression.
In the examples mentioned, we are trying to determine if the two attributes
(variables) are related: engine size and miles per gallon, crime rate and
population density, exercise and heart attacks. In order to use simple linear
regression several requirements must be met. Let’s start by mentioning two:
- The variables used in the analysis must be expressed in a meaningful
numerical scale (miles per gallon, degrees Fahrenheit, etc.).
- We expect some time precedence of one variable over the other: thus we
anticipate that one attribute has an effect on the other, such as the size of
a house has an effect on the selling price of that property, not vice versa.
Thus, we call one factor the independent variable (square feet) and the other
the dependent variable (price).
So, is there a relationship between the square footage of a house and its
selling price? The answer seems obvious in the case of real estate, but other
relationships are not so simple to establish. As an example, let’s analyze data
collected on a sample of single-family homes in the San Jose, CA area.
Before determining whether or not it is appropriate to use linear regression
to assess the relationship between square footage and selling price, we must ask
ourselves: are the data linear? One way to find out is to use the scatter plot
capabilities of Excel (XY Scatter Chart). Plot the independent variable on the
x-axis (horizontal) and the dependent variable on the y-axis (vertical).
We can immediately see that the data look linear. If you are experienced in
statistics, you will also be able to see that the data have been manipulated to
illustrate a point. Actual real estate data are not nearly so linear!
Once the data have been graphed, you can go to the Chart menu, click on Add
Trendline… and you’ll see the dialog box below.
The first tab (“Type”), above, allows you to choose the type of regression
you’d like to use. If you feel the data set you are trying to model are
logarithmic or polynomial, you may choose that type. Our visual inspection of
the scatter plot has led us to conclude that a linear association seems to exist
between the two variables we are analyzing, so we select “Linear”.
The second tab is Options (see below). On this tab we’ll select “Display
equation on chart” and “Display R-squared value on chart”.
The equation on the chart below describes a linear
relationship. It has the form y=mx+b. The predicted value of the dependant variable (price) is in terms
of a
given value of the independent variable (square footage). The coefficient m, the slope, is
the average increase in price for every one unit increase in square footage. In our case, the price
of the house increases by $165.25 for every square foot. The coefficient b is
called the y-intercept, or the value of when x equals zero. Here, an empty lot
(0 square feet of housing) still costs $360,633!
The trend line describes the values (the predicted prices based on the
equation) given the x values (square footage) in the data set. Because the
observed values (the ones collected from the sample of homes) gather closely
near the trend line, the model is a good fit for the data.
The final value obtained by using Excel’s “Add Trendline” function is R
2,
also known as the coefficient of determination: it measures how much of the
variance in prices (the dependent variable) is accounted for by square footage
(the independent variable). Variance is a term used by statisticians to express
how much variation (spread) there is in the data. So R
2 tells us how well square footage
“explains” the differences in prices among single-family homes in the San Jose
area. Specifically, it tells us that over 96% of the variance in price can be
accounted for by square footage. In other words, all you need to know to figure
out the price of a single-family home in that area is the square footage.
Of course, this is a simplified example used to illustrate a point.
Unfortunately, or fortunately, depending on your point of view, reality, as you
imagine, is far more complex. It is rare that only one independent variable
would “explain” all of the variation in some dependent variable. Surely, there are many other
factors that account for price differences among homes…but that’s for another
discussion…