Help With Excel Statistics: Forecasting

The assignment is to use linear regression for forecasting (predicting) the future value of a measure. Linear regression is a statistical technique that is often used in business, although it is a flawed technique, which we will discuss in class. Before you attempt to do the assignment, read this note. I will attempt to (1) explain what information the regression statistic provides and (2) demonstrate how to do the assignment.

Linear regression will attempt to fit a line equation to data, so it's a "linear model". The technique assumes that your data will fit nicely on a line, but will also inform you of how well your data do or do not fit the linear model. When your historical data do fit the linear model, then we can assume that future values may also. If the linear model fits the data, the line equation can be used to determine future (unknown) values.

The linear equation is  y = mx + b. (If you need to re-familiarize yourself with the line equation, go to Math for Morons Like Us website for clarification.)

In this equation there are 2 variables y and x, and there are 2 constants m and b. A variable is something we can observe and count, but when we measure it we discover it has many values - the measure varies. A constant, well, doesn't vary. It's a constant value that is either known or derived from the relationship of the observed variables. 

To illustrate: In this assignment we are trying to predict the future value of some economic measure, such as GNP. We can measure past values of GNP. And, we will associate GNP with a year to see how over time GNP changes. The  y variable is the unknown value, what we are trying to predict. In our example we will try to predict GNP for some future year, so GNP = y.  In statistics the  y is the outcome or dependent variable (the value of  y depends on other values).

The x represents something else we can measure that seems like its associated with y.  In statistics the x is the independent variable. Often it is something that is is easier to measure than y, so we use x to find y.  I may not know what the GNP in the year 2020 will be, but I do know that the year will be 2020. If, GNP and years (time) are associated, then I ought to be able to predict GNP by letting GNP= y and year = x. Now, we have the function y(GNP) = x(year)

Of course, I know that GNP is not actually equal to a year. We will however assume that GNP and years move in a straight line, that is, each year GNP goes up at a constant rate. This rate is the slope, m in the line equation. Since I know past years' GNP, I can calculate what this constant slope has been, and assume it will continue in the future. And, since I need to associate a precise GNP with a definite year, I need to know where the line crosses the y-axis. The b constant informs us of the y-intercept.( If we did not use the b in the equation, the math would simply assume that the line crossed at (0,0) and has the effect of getting rid of the $billions used in measuring GNP and instead of years, we'd have 0,1,2,3, etc. This plays havoc on predicting GNP for a specific year, so the b is needed. (To see how this works go to Lesley Robinson's website at the Department of Mathematics, University College of the Cariboo.) All of this, then, gets us to the the line equation to predict GNP:  y = mx + b

Here are the data for our example (I found them at the Bureau of Economic Analysis website):

Year = x's 1929 1930 1931 1932 1933 1934 1935
GNP($Bil.) = y's  104.5 92.0  77.1 59.2 56.7 66.3 73.7

Our assignment is to fit this data to a straight line equation: y = mx + b, in which we are trying to predict y (GNP) for the 7th period (1935) for data that has a slope, m, and a constant or y-intercept of b. We will use 6 periods of date, from 1929 to 1934, to predict the 7th period (1935). In statistics this looks like this:

1929 104.5
1930 92.0
1931 77.1
y1935 = 1932  · m  + 59.2  · b
1933 56.7
1934 66.3
(x's)

(y's)

We begin by entering the data for these 6 periods of time in an Excel spreadsheet - if needed open Excel in another window while viewing this web page's instructions. Have your own data ready.

My demonstration of the assignment uses Excel 2000, but you should be able to apply similar principles to most spreadsheet applications. Two methods will be demonstrated. The first method uses Excel to compute each unknown variable. The second method uses the full capability of Excel 2000 to produce more statistics with less work. I recommend that you read the first method to see how the regression statistic works, but use the second method for your paper.

First Method: Solving for Each Unknown 

STEP 1. Type in your data as follows: We will compute an econometric measure (GNP, price of gold, T-Bill rate) for year 7. In my example year 7 = 1935, so we are trying to predict yGNP(1935),  based on y's (past period econometric measures) for x's, time periods (years, months, or other equal period time). Let's put the y's under column A and the x's under column B. (I use phony data.)

  A B C
1 104.5 1929  |
2 92.0 1930  
3 77.1 1931  
4 59.2 1932  
5 56.7 1933  
6 66.3 1934  

STEP 2. Position your cursor in a blank cell. (Maybe C1). This is where Excel will print your statistics.

STEP 3. STATISTICAL MENU: From the top of Excel's menu select the function wizard represented by fx. A past function menu pops down. Click on "Statistical" on the left menu and on the right menu select "SLOPE". We'll first calculate m, the slope.

STEP 4. CALCULATE SLOPE: A menu pop comes down and asks you to enter known x's and known y's. (You may have to move the menu if it is obstructing your voew of the spreadsheet. Just use the mouse to grab and move it out of the way.) Use the pointer to highlight column A, for know y's; and, then click on the white space for known x's and highlight the numbers in column B. You should see the range of cells for y's and x's in the menu. Click OK. The menu disappears and in C1 the slope coefficient appears.  For this data the SLOPE is -8.99429. The slope will indicate the trend. Here the negative slope tells us that the forecast GNP for 1935 will be lower than the GNP for 1934.

  A B C
1 104.5 1929  -8.99429
2 92.0 1930  
3 77.1 1931  
4 59.2 1932  
5 56.7 1933  
6 66.3 1934  

STEP 5. CALCULATE INTERCEPT: I click on the cell C2, to tell Excel where to put the computed b. In the statistical menu I select INTERCEPT, and repeat the steps above. For the GNP data the CONSTANT is  17448.43.

  A B C
1 104.5 1929  -8.99429
2 92.0 1930 17448.43
3 77.1 1931  
4 59.2 1932  
5 56.7 1933  
6 66.3 1934  

STEP 6. MAKE PREDICTION: Using the GNP data, I have everything I need to forecast using a straight line equation.

Since I am try to find out what y is when x= 1935, the formula is y1935= (-8.99429 X 1935) + (17448.43) = 44.48667. (I could verify this by using the FORECAST formula in the wizard menu. Simply use 1935 as the x. )

STEP 7. TEST HOW GOOD THIS "FIT" IS: The R2 will evaluate the fit of data to a straight line. Using the data from 1929 to 1934 only select RSQ from the statistical menu to calculate 0.778596125. Statistically this looks like a fair fit. About 78% of the value of GNP is explained by knowing the year. ...But, whoa! ..I already know that in 1935 the GNP was actually  73.7 ($billion). So, a straight line model is not a good tool for forecasting this data. Why?

Second Method: Using Excel's Data Analysis 

At top menu click "Tools". Then, click "Data Analysis". Then, click "Regression". Insert range for y and for x, and for "Output Range:" enter a cell location below your data.  Check "Line Fit Plots". Click "OK".

(If you do not see "Data Analysis" under the "Tools" menu, click on "Add-ins" in the "Tools" menu and a"Data Analysis" to include this menu item.)

Using the GNP data, Excel produces the following results (I highlighted key statistics):

SUMMARY OUTPUT              
                 
Regression Statistics              
MultipleR 0.882380941              
RSquare          0.778596125              
Adjusted R Square 0.723245156              
Standard Error 10.03210798              
Observations 6              
                 
ANOVA                
  df SS MS F Significance F    
Regression 1 1415.700571 1415.7006 14.06653 0.019938      
Residual 4 402.5727619 100.64319          
Total 5 1818.273333            
                 
  Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 17448.42952 4631.994722 3.7669364 0.019656 4587.924 30308.94 4587.923814 30308.9352
XVariable1 8.994285714 2.398132493 -3.750537 0.019938 -15.6526 -2.33599 -15.6525827 -2.3359887
                 
                 
                 
RESIDUAL OUTPUT              
                 
Observation Predicted Y Residuals            
1 98.45238095 6.047619048            
2 89.45809524 2.541904762            
3 80.46380952 -3.363809524            
4 71.46952381 -12.26952381            
5 62.4752381 -5.775238096            
6 53.48095238 12.81904762            

 

By checking "Line Fit Plots", I also have from Excel this graph of my data.

To obtain the predicted GNP for year 1935, I have to do the math to solve:

y = mx + b

See above for help or, use the forecast statistic in the wizard.

LAST STEP. COMPARE ACTUAL WITH PREDICTED. In the real data the 1935 was 73.7, I know that the straight line predicted 44.5 -- WAY OFF! Judgment, "guessing", might be a better tool. (or another kind of statistical tool) would work better! What would you have had to guess in 1934 to predict an upward turn in GNP? The depression did not end until 1938 with World War II.

To understand why Time is in business not a very good predictor of future measures see my Teaching Note. The simple regression model at the MBA level of study is modified as a Time Series model to try to fix the problems you see in this exercise. Our problem is that we are trying to fit a straight line to data that do not always move linearly (business cycles get in the way).