Help With Excel Statistics: Forecasting
The assignment is to use linear regression for forecasting (predicting) the future value of a measure. Linear regression is a statistical technique that is often used in business, although it is a flawed technique, which we will discuss in class. Before you attempt to do the assignment, read this note. I will attempt to (1) explain what information the regression statistic provides and (2) demonstrate how to do the assignment.
Linear regression will attempt to fit a line equation to data, so it's a "linear model". The technique assumes that your data will fit nicely on a line, but will also inform you of how well your data do or do not fit the linear model. When your historical data do fit the linear model, then we can assume that future values may also. If the linear model fits the data, the line equation can be used to determine future (unknown) values.
The linear equation is y = mx + b. (If you need to re-familiarize yourself with the line equation, go to Math for Morons Like Us website for clarification.)
In this equation there are 2 variables y and x, and there are 2 constants m and b. A variable is something we can observe and count, but when we measure it we discover it has many values - the measure varies. A constant, well, doesn't vary. It's a constant value that is either known or derived from the relationship of the observed variables.
To illustrate: In this assignment we are trying to predict the future value of some economic measure, such as GNP. We can measure past values of GNP. And, we will associate GNP with a year to see how over time GNP changes. The y variable is the unknown value, what we are trying to predict. In our example we will try to predict GNP for some future year, so GNP = y. In statistics the y is the outcome or dependent variable (the value of y depends on other values).
The x represents something else we can measure that seems like its associated with y. In statistics the x is the independent variable. Often it is something that is is easier to measure than y, so we use x to find y. I may not know what the GNP in the year 2020 will be, but I do know that the year will be 2020. If, GNP and years (time) are associated, then I ought to be able to predict GNP by letting GNP= y and year = x. Now, we have the function y(GNP) = x(year)
Of course, I know that GNP is not actually equal to a year. We will however assume that GNP and years move in a straight line, that is, each year GNP goes up at a constant rate. This rate is the slope, m in the line equation. Since I know past years' GNP, I can calculate what this constant slope has been, and assume it will continue in the future. And, since I need to associate a precise GNP with a definite year, I need to know where the line crosses the y-axis. The b constant informs us of the y-intercept.( If we did not use the b in the equation, the math would simply assume that the line crossed at (0,0) and has the effect of getting rid of the $billions used in measuring GNP and instead of years, we'd have 0,1,2,3, etc. This plays havoc on predicting GNP for a specific year, so the b is needed. (To see how this works go to Lesley Robinson's website at the Department of Mathematics, University College of the Cariboo.) All of this, then, gets us to the the line equation to predict GNP: y = mx + b
Here are the data for our example (I found them at the Bureau of Economic Analysis website):
| Year = y's | 1929 | 1930 | 1931 | 1932 | 1933 | 1934 | 1935 |
| GNP($Bil.) = x's | 104.5 | 92.0 | 77.1 | 59.2 | 56.7 | 66.3 | 73.7 |
Our assignment is to fit this data to a straight line equation: y = mx + b, in which we are trying to predict y for the 7th period for data that has a slope, m, and a constant or y-intercept of b. We will use 6 periods of date, from 1929 to 1935, to predict the 7th period (1935). In statistics this looks like this:
| 1929 | 104.5 | ||||
| 1930 | 92.0 | ||||
| 1931 | 77.1 | ||||
| y1935 = | 1932 | · m + | 59.2 | · b | |
| 1933 | 56.7 | ||||
| 1934 | 66.3 | ||||
| (x's) |
(y's) |
We begin by entering the data for these 6 periods of time in an Excel spreadsheet - if needed open Excel in another window while viewing this web page's instructions. Have your own data ready.
My demonstration of the assignment uses Excel 2000, but you should be able to apply similar principles to most spreadsheet applications. Two methods will be demonstrated. The first method uses Excel to compute each unknown variable. The second method uses the full capability of Excel 2000 to produce more statistics with less work. I recommend that you read the first method to see how the regression statistic works, but use the second method for your paper.
First Method: Solving for Each Unknown
STEP 1. Type in your data as follows: We will compute an econometric measure (GNP, price of gold, T-Bill rate) for year 7. In my example year 7 = 1935, so we are trying to predict yGNP(1935), based on y's (past period econometric measures) for x's, time periods (years, months, or other equal period time). Let's put the y's under column A and the x's under column B. (I use phony data.)
| A | B | C | |
| 1 | 104.5 | 1929 | |
| 2 | 92.0 | 1930 | |
| 3 | 77.1 | 1931 | |
| 4 | 59.2 | 1932 | |
| 5 | 56.7 | 1933 | |
| 6 | 66.3 | 1934 |
STEP 2. Position your cursor in a blank cell. (Maybe C1). This is where Excel will print your statistics.
STEP 3. STATISTICAL MENU: From the top of Excel's menu select the function wizard represented by fx. A past function menu pops down. Click on "Statistical" on the left menu and on the right menu select "SLOPE". We'll first calculate m, the slope.
STEP 4. CALCULATE SLOPE: A menu pop comes down and asks you to enter known x's and known y's. (You may have to move the menu if it is obstructing your voew of the spreadsheet. Just use the mourse to grab and move it out of the way.) Use the pointer to highlight column A, for know y's; and, then click on the white space for known x's and highlight the numbers in column B. You should see the range of cells for y's and x's in the menu. Click OK. The menu disappears and in C1 the slope coefficient appears. For this data the SLOPE is -8.99429. The slope will indicate the trend. Here the negative slope tells us that the forecast GNP for 1935 will be lower than the GNP for 1934.
| A | B | C | |
| 1 | 104.5 | 1929 | -8.99429 |
| 2 | 92.0 | 1930 | |
| 3 | 77.1 | 1931 | |
| 4 | 59.2 | 1932 | |
| 5 | 56.7 | 1933 | |
| 6 | 66.3 | 1934 |
STEP 5. CALCULATE INTERCEPT: I click on the cell C2, to tell Excel where to put the computed b. In the statistical menu I select INTERCEPT, and repeat the steps above. For the GNP data the CONSTANT is 17448.43.
| A | B | C | |
| 1 | 104.5 | 1929 | -8.99429 |
| 2 | 92.0 | 1930 | 17448.43 |
| 3 | 77.1 | 1931 | |
| 4 | 59.2 | 1932 | |
| 5 | 56.7 | 1933 | |
| 6 | 66.3 | 1934 |
STEP 6. MAKE PREDICTION: Using the GNP data, I have everything I need to forecast using a straight line equation.
Since I am try to find out what y is when x= 1935, the formula is y1935= (-8.99429 X 1935) + (17448.43) = 44.48667. (I could verify this by using the FORECAST formula in the wizard menu. Simply use 1935 as the x. )
STEP 7. TEST HOW GOOD THIS "FIT" IS: The R2 will evaluate the fit of data to a straight line. Select RSQ from the statistical menu to calculate 0.778596. Statistically this is a pretty good fit. About 78% of the value of GNP is explained by knowing the year. ...But, whoa! ..I already know that in 1935 the GNP was actually 73.7 ($billion). So, a straight line model is not a good tool for forecasting this data. Why?
Second Method: Using Excel's Data Analysis
At top menu click "Tools". Then, click "Data Analysis". Then, click "Regression". Insert range for y and for x, and for "Output Range:" enter a cell location below your data. Check "Line Fit Plots". Click "OK".
Using the GNP data, Excel produces the following results (I highlighted key statistics):
| SUMMARY OUTPUT | ||||||||
| Regression Statistics | ||||||||
| MultipleR | 0.882380941 | |||||||
| RSquare | 0.778596125 | |||||||
| Adjusted R Square | 0.723245156 | |||||||
| Standard Error | 10.03210798 | |||||||
| Observations | 6 | |||||||
| ANOVA | ||||||||
| df | SS | MS | F | Significance F | ||||
| Regression | 1 | 1415.700571 | 1415.7006 | 14.06653 | 0.019938 | |||
| Residual | 4 | 402.5727619 | 100.64319 | |||||
| Total | 5 | 1818.273333 | ||||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
| Intercept | 17448.42952 | 4631.994722 | 3.7669364 | 0.019656 | 4587.924 | 30308.94 | 4587.923814 | 30308.9352 |
| XVariable1 | 8.994285714 | 2.398132493 | -3.750537 | 0.019938 | -15.6526 | -2.33599 | -15.6525827 | -2.3359887 |
| RESIDUAL OUTPUT | ||||||||
| Observation | Predicted Y | Residuals | ||||||
| 1 | 98.45238095 | 6.047619048 | ||||||
| 2 | 89.45809524 | 2.541904762 | ||||||
| 3 | 80.46380952 | -3.363809524 | ||||||
| 4 | 71.46952381 | -12.26952381 | ||||||
| 5 | 62.4752381 | -5.775238096 | ||||||
| 6 | 53.48095238 | 12.81904762 | ||||||
By
checking "Line Fit Plots", I also have from Excel this graph of my
data.
To obtain the predicted GNP for year 1935, I have to do the math to solve:
y = mx + b
See above for help or, use the forecast statistic in the wizard.
LAST STEP. COMPARE ACTUAL WITH PREDICTED. In the real data the 1935 was 73.7, although the straight line model predicted would probably be 901234. I can see that the straight line predicted 44.5 -- WAY OFF! Judgment, "guessing", might be a better tool. (or another kind of statistical tool) would work better! What would you have had to guess in 1934 to predict an upward turn in GNP? The depression did not end until 1938 with World War II.