In statistics, regression is a technique that can be used to analyze the relationship between predictor variables and a response variable. This tutorial walks through an example of a regression analysis and provides an in-depth explanation of how to read and interpret the output of a regression table.
Suppose we have the following dataset that shows the total number of hours studied, total prep exams taken, and final exam score received for 12 different students:. The first section shows several different numbers that measure the fit of the regression model, i. Here is how to interpret each of the numbers in this section:. This is the correlation coefficient. It measures the strength of the linear relationship between the predictor variables and the response variable.
A multiple R of 1 indicates a perfect linear relationship while a multiple R of 0 indicates no linear relationship whatsoever. Multiple R is the square root of R-squared see below. In this example, the multiple R is 0. This is often written as r 2and is also known as the coefficient of determination.
It is the proportion of the variance in the response variable that can be explained by the predictor variable. The value for R-squared can range from 0 to 1.
A value of 0 indicates that the response variable cannot be explained by the predictor variable at all. A value of 1 indicates that the response variable can be perfectly explained without error by the predictor variable.
In this example, the R-squared is 0. It is always lower than the R-squared.
How to Read and Interpret a Regression Table
The adjusted R-squared can be useful for comparing the fit of different regression models to one another. In this example, the Adjusted R-squared is 0. In this example, the observed values fall an average of 7. This is simply the number of observations our dataset.
Regression and Residual Scatterplots in SPSS
In this example, the total observations is The next section shows the degrees of freedom, the sum of squares, mean squares, F statistic, and overall significance of the regression model. This number is equal to: the number of regression coefficients — 1. This number is equal to: the number of observations — 1. This number is equal to: total df — regression df. In essence, it tests if the regression model as a whole is useful.
Generally if none of the predictor variables in the model are statistically significant, the overall F statistic is also not statistically significant. In this example, the F statistic is The last value in the table is the p-value associated with the F statistic. To see if the overall regression model is significant, you can compare the p-value to a significance level; common choices are. If the p-value is less than the significance level, there is sufficient evidence to conclude that the regression model fits the data better than the model with no predictor variables.
This finding is good because it means that the predictor variables in the model actually improve the fit of the model. In this example, the p-value is 0. This indicates that the regression model as a whole is statistically significant, i. The last section shows the coefficient estimates, the standard error of the estimates, the t-stat, p-values, and confidence intervals for each term in the regression model.
The coefficients give us the numbers necessary to write the estimated regression equation:. Each individual coefficient is interpreted as the average increase in the response variable for each one unit increase in a given predictor variable, assuming that all other predictor variables are held constant. For example, for each additional hour studied, the average expected increase in final exam score is 1.
The intercept is interpreted as the expected average final exam score for a student who studies for zero hours and takes zero prep exams.Because a linear regression model is not always appropriate for the data, you should assess the appropriateness of the model by defining residuals and examining residual plots.
If you view this web page on a different browser e. Each data point has one residual. Both the sum and the mean of the residuals are equal to zero. A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.
And the chart below displays the residual e and independent variable X as a residual plot. The residual plot shows a fairly random pattern - the first residual is positive, the next two are negative, the fourth is positive, and the last residual is negative. This random pattern indicates that a linear model provides a decent fit to the data.
Below, the residual plots show three typical patterns.
Regression Analysis | SPSS Annotated Output
The first plot shows a random pattern, indicating a good fit for a linear model. The other plot patterns are non-random U-shaped and inverted Usuggesting a better fit for a nonlinear model. In the next lessonwe will work on a problem, where the residual plot shows a non-random pattern. And we will show how to "transform" the data to use a linear model with nonlinear data. In the context of regression analysiswhich of the following statements are true?
When the sum of the residuals is greater than zero, the data set is nonlinear. A random pattern of residuals supports a linear model. A random pattern of residuals supports a nonlinear model. The correct answer is B. A random pattern of residuals supports a linear model; a non-random pattern supports a nonlinear model. The sum of the residuals is always zero, whether the data set is linear or nonlinear. Random pattern. Non-random: U-shaped.
Non-random: Inverted U.You have finally defended your proposal, found your participants, and collected your data. You have your rows of shiny, newly collected data all set up in SPSS, and you know you need to run a regression. If you have read our blog on data cleaning and management in SPSS, you are ready to get started! But you cannot just run off and interpret the results of the regression willy-nilly.
First, you need to check the assumptions of normality, linearity, homoscedasticity, and absence of multicollinearity.
We will start with normality. In order to make valid inferences from your regression, the residuals of the regression should follow a normal distribution. The residuals are simply the error terms, or the differences between the observed value of the dependent variable and the predicted value.
If we examine a normal Predicted Probability P-P plot, we can determine if the residuals are normally distributed. If they are, they will conform to the diagonal normality line indicated in the plot. We will show what this looks like a little bit later. Homoscedasticity refers to whether these residuals are equally distributed, or whether they tend to bunch together at some values, and at other values, spread far apart. In the context of t -tests and ANOVAs, you may hear this same concept referred to as equality of variances or homogeneity of variances.
Your data is homoscedastic if it looks somewhat like a shotgun blast of randomly distributed data. The opposite of homoscedasticity is heteroscedasticity, where you might find a cone or fan shape in your data. You check this assumption by plotting the predicted values and residuals on a scatterplot, which we will show you how to do at the end of this blog.
Linearity means that the predictor variables in the regression have a straight-line relationship with the outcome variable. If your residuals are normally distributed and homoscedastic, you do not have to worry about linearity. Multicollinearity refers to when your predictor variables are highly correlated with each other.
This is an issue, as your regression model will not be able to accurately associate variance in your outcome variable with the correct predictor variable, leading to muddled results and incorrect inferences.
Keep in mind that this assumption is only relevant for a multiple linear regression, which has multiple predictor variables. If you are performing a simple linear regression one predictoryou can skip this assumption. You can check multicollinearity two ways: correlation coefficients and variance inflation factor VIF values. To check it using correlation coefficients, simply throw all your predictor variables into a correlation matrix and look for coefficients with magnitudes of.
If your predictors are multicollinear, they will be strongly correlated. However, an easier way to check is using VIF values, which we will show how to generate below. You want these values to be below Set up your regression as if you were going to run it by putting your outcome dependent variable and predictor independent variables in the appropriate boxes.
Click the S tatistics button at the top right of your linear regression window. Estimates and model fit should automatically be checked. Now, click on collinearity diagnostics and hit continue.
The next box to click on would be Plots. Also make sure that normal probability plot is checked, and then hit continue. Now you are ready to hit OK! You will get your normal regression output, but you will see a few new tables and columns, as well as two new figures.
First, you will want to scroll all the way down to the normal P-P plot. You will see a diagonal line and a bunch of little circles.
Ideally, your plot will look like the two leftmost figures below. If your data is not normal, the little circles will not follow the normality line, such as in the figure to the right.This page shows an example regression analysis with footnotes explaining the output. These data hsb2 were collected on high schools students and are scores on various tests, including science, math, reading and social studies socst.
The variable female is a dichotomous variable coded 1 if the student was female and 0 if male. In the syntax below, the get file command is used to load the data into SPSS. In quotes, you need to specify where the data file is located on your computer. Remember that you need to use the. In the regression command, the statistic s subcommand must come before the dependent subcommand. You can shorten dependent to dep.
You list the independent variables after the equals sign on the method subcommand. The statistics subcommand is not needed to run the regression, but on it we can specify options that we would like to have included in the output. Here, we have specified ciwhich is short for confidence intervals. These are very useful for interpreting the output, as we will see. There are four tables given in the output.
SPSS has provided some superscripts a, b, etc. Please note that SPSS sometimes includes footnotes as part of the output.
We have left those intact and have started ours with the next letter of the alphabet. Model — SPSS allows you to specify multiple models in a single regression command. This tells you the number of the model being reported. Variables Entered — SPSS allows you to enter variables into a regression in blocks, and it allows stepwise regression. Hence, you need to know which variables were entered into the current regression.The adjusted residuals are the raw residuals or the difference between the observed counts and expected counts divided by an estimate of the standard error.
Use adjusted residuals to account for the variation due to the sample size. Minitab estimates the standard deviation of the observed counts using the formula found in Adjusted residuals. You can compare the adjusted residuals in the output table to see which categories have the largest difference between the expected counts and the actual counts relative to sample size. For example, you can see which machine or shift has the largest difference between the expected number of defectives and the actual number of defectives.
The likelihood-ratio chi-square statistic G 2 is based on the ratio of the observed to the expected frequencies. Minitab displays each cell's contribution to the chi-square statistic, which quantifies how much of the total chi-square statistic is attributable to each cell's divergence.
Minitab calculates each cell's contribution to the chi-square statistic as the square of the difference between the observed and expected values for a cell, divided by the expected value for that cell.How to Use SPSS: Standard Multiple Regression
The chi-square statistic is the sum of these values for all cells. Use the individual cell contributions to quantify how much of the total chi-square statistic is attributable to each cell's divergence. The degrees of freedom DF is the number of independent pieces of information on a statistic. The degrees of freedom for a cross tabulation is the number of rows - 1, multiplied by the number of columns - 1.
Minitab uses the degrees of freedom to determine the p-value associated with the test statistic. The observed counts are the actual number of observations in a sample that belong to a category. The expected counts value is the projected frequency that would be expected in a cell, if the variables are independent.
Minitab calculates the expected counts as the product of the row and column totals, divided by the sample size. The p-value is a probability that measures the evidence against the null hypothesis. Lower probabilities provide stronger evidence against the null hypothesis. Use the p-value to determine whether to reject or fail to reject the null hypothesis, which states that the variables are independent.
Minitab does not display the p-value when any expected count is less than 1 because the results can be invalid. The standardized residuals are the raw residuals or the difference between the observed counts and expected countsdivided by the square root of the expected counts. You can compare the standardized residuals in the output table to see which category of variables have the largest difference between the expected counts and the actual counts relative to size, and seem to be dependent.
For example, you can assess the standardized residuals in the output table to see the association between machine and shift for producing defects. Use the table percentages to understand how the counts are distributed between the categories.
In these results, the cell count is the first number in each cell. Then the row percentages, column percentages, and total percentages are in order as the next numbers in the cell. You can select one or more of these percentages to display. Find definitions and interpretation guidance for every statistic that is provided with the cross tabulation analysis. Adjusted residuals. Interpretation You can compare the adjusted residuals in the output table to see which categories have the largest difference between the expected counts and the actual counts relative to sample size.
In these results, the cell count is the first number in each cell, the expected count is the second number in each cell, and the adjusted residual is the third number in each cell. The positive adjusted residuals indicate that there were more defective handles than expected, adjusted for sample size.
The negative adjusted residuals indicate that there were less defective handles than expected, adjusted for sample size. Chi-square statistic. Minitab performs a Pearson chi-square test and a likelihood-ratio chi-square test. Each chi-square test can be used to determine whether or not the variables are associated dependent.In our last lesson, we learned how to first examine the distribution of variables before doing simple and multiple linear regressions with SPSS.
Without verifying that your data has been entered correctly and checking for plausible values, your coefficients may be misleading. In a similar vein, failing to check for assumptions of linear regression can bias your estimated coefficients and standard errors e.
This lesson will discuss how to check whether your data meet the assumptions of linear regression. Recall that the regression equation for simple linear regression is:. The observations are represented by the circular dots, and the best fit or predicted regression line is represented by the diagonal solid line.
The residual is the vertical distance or deviation from the observation to the predicted regression line. Predicted values are points that fall on the predicted line for a given point on the x-axis. In this particular case we plotting api00 with enroll. Since we have schools, we will have residuals or deviations from the predicted line.
Assumptions in linear regression are based mostly on predicted values and residuals. In particular, we will consider the following assumptions. Additionally, there are issues that can arise during the analysis that, while strictly speaking are not assumptions of regression, are nonetheless, of great concern to regression analysts.
Many graphical methods and numerical tests have been developed over the years for regression diagnostics and SPSS makes many of these methods easy to access and use. In this lesson, we will explore these methods and show how to verify regression assumptions and detect potential problems using SPSS. Standardized variables either the predicted values or the residuals have a mean of zero and standard deviation of one. If they fall above 2 or below -2, they can be considered unusual.
When we do linear regression, we assume that the relationship between the response variable and the predictors is linear. If this assumption is violated, the linear regression will try to fit a straight line to data that do not follow a straight line. The bivariate plot of the predicted value against residuals can help us infer whether the relationships of the predictors to the outcome is linear. We will ignore the regression tables for now since our primary concern is the scatterplot of the standardized residuals with the standardized predicted values.
To do that double click on the scatterplot itself in the Output window go to Elements — Fit Line at Total. Your scatterplot of the standardized predicted value with the standardized residual will now have a Loess curve fitted through it.
Note that this does not change our regression analysis, this only updates our scatterplot. From the Loess curve, it appears that the relationship of standardized predicted to residuals is roughly linear around zero.
We can conclude that the relationship between the response variable and predictors is zero since the residuals seem to be randomly scattered around zero. Another assumption of ordinary least squares regression is that the variance of the residuals is homogeneous across levels of the predicted values, also known as homoscedasticity. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be heteroscedastic.
Just as for the assessment of linearity, a commonly used graphical method is to use the residual versus fitted plot see above. However, what we see is that the residuals are model dependent. You can from this new residual that the trend is centered around zero but also that the variance around zero is scattered uniformly and randomly. We conclude that the linearity assumption is satisfied and the hetereoskedasticity assumption is satisfied if we run the fully specified predictive model.
We will talk more about Model Specification in Section 2. In linear regression, a common misconception is that the outcome has to be normally distributed, but the assumption is actually that the residuals are normally distributed. It is important to meet this assumption for the p-values for the t-tests to be valid. Note that the normality of residuals assessment is model dependent meaning that this can change if we add more predictors.
The plot is shown below.
Note that we are testing the normality of the residuals and not predictors. More commonly seen is the Q-Q plot, which compares the observed quantile with the theoretical quantile of a normal distribution.A previous article explained how to interpret the results obtained in the correlation test.
Case analysis was demonstrated, which included a dependent variable crime rate and independent variables education, implementation of penalties, confidence in the police, and the promotion of illegal activities.
The aim of that case was to check how the independent variables impact the dependent variables. The test found the presence of correlation, with most significant independent variables being education and promotion of illegal activities. Now, the next step is to perform a regression test. However, this article does not explain how to perform the regression test, since it is already present here. This article explains how to interpret the results of a linear regression test on SPSS.
Regression is a statistical technique to formulate the model and analyze the relationship between the dependent and independent variables. It aims to check the degree of relationship between two or more variables.
This is done with the help of hypothesis testing. Suppose the hypothesis needs to be tested for determining the impact of the availability of education on the crime rate. Then the hypothesis framed for the analysis would be:.
The first table in SPSS for regression results is shown below. It specifies the variables entered or removed from the model based on the method used for variable selection. There is no need to mention or interpret this table anywhere in the analysis. It is generally unimportant since we already know the variables. It provides detail about the characteristics of the model.
In the present case, promotion of illegal activities, crime rate and education were the main variables considered. The model summary table looks like below. Therefore, the model summary table is satisfactory to proceed with the next step.
However, if the values were unsatisfactory, then there is a need for adjusting the data until the desired results are obtained. This is the third table in a regression test in SPSS. It determines whether the model is significant enough to determine the outcome. It looks like below. These results estimate that as the p-value of the ANOVA table is below the tolerable significance level, thus there is a possibility of rejecting the null hypothesis in further analysis.
Below table shows the strength of the relationship i. This analysis helps in performing the hypothesis testing for a study. Only one value is important in interpretation: Sig. The value should be below the tolerable level of significance for the study i. Based on the significant value the null hypothesis is rejected or not rejected. If Sig. If a null hypothesis is rejected, it means there is an impact.