Stepwise Regression
Stepwise regression is a method of regression modeling in which the choice of predictive variables is carried out by an automatic procedure.
Learning Objective

Evaluate and criticize stepwise regression approaches that automatically choose predictive variables.
Key Points
 Forward selection involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.
 Backward elimination involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable that improves the model the most by being deleted, and repeating this process until no further improvement is possible.
 Bidirectional elimination is a combination of forward selection and backward elimination, testing at each step for variables to be included or excluded.
 One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data.
Terms

Bayesian information criterion
a criterion for model selection among a finite set of models that is based, in part, on the likelihood function

Akaike information criterion
a measure of the relative quality of a statistical model, for a given set of data, that deals with the tradeoff between the complexity of the model and the goodness of fit of the model

Bonferroni point
how significant the best spurious variable should be based on chance alone
Full Text
Stepwise regression is a method of regression modeling in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of
Stepwise Regression
This is an example of stepwise regression from engineering, where necessity and sufficiency are usually determined by
Main Approaches
 Forward selection involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.
 Backward elimination involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible.
 Bidirectional elimination, a combination of the above, tests at each step for variables to be included or excluded.
Another approach is to use an algorithm that provides an automatic procedure for statistical model selection in cases where there is a large number of potential explanatory variables and no underlying theory on which to base the model selection. This is a variation on forward selection, in which a new variable is added at each stage in the process, and a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares (RSS).
Selection Criterion
One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data. In other words, stepwise regression will often fit much better insample than it does on new outofsample data. This problem can be mitigated if the criterion for adding (or deleting) a variable is stiff enough. The key line in the sand is at what can be thought of as the Bonferroni point: namely how significant the best spurious variable should be based on chance alone. Unfortunately, this means that many variables which actually carry signal will not be included.
Model Accuracy
A way to test for errors in models created by stepwise regression is to not rely on the model's
Criticism
Stepwise regression procedures are used in data mining, but are controversial. Several points of criticism have been made:
 The tests themselves are biased, since they are based on the same data.
 When estimating the degrees of freedom, the number of the candidate independent variables from the best fit selected is smaller than the total number of final model variables, causing the fit to appear better than it is when adjusting the
$r^2$ value for the number of degrees of freedom. It is important to consider how many degrees of freedom have been used in the entire model, not just count the number of independent variables in the resulting fit.  Models that are created may be toosmall than the real models in the data.
Key Term Reference
 Accuracy
 Appears in these related concepts: Communicating Statistics, Accuracy vs. Precision, and Accuracy, Precision, and Error
 confidence interval
 Appears in these related concepts: Interpreting a Confidence Interval, Confidence Interval for a Population Mean, Standard Deviation Known, and Hypothesis Tests or Confidence Intervals?
 data mining
 Appears in these related concepts: Applications of Statistics, Exploratory Data Analysis (EDA), and Analyzing Data
 datum
 Appears in these related concepts: Change of Scale, Controlling for a Variable, and Type I and II Errors
 degrees of freedom
 Appears in these related concepts: tTest for One Sample, Structure of the ChiSquared Test, and Specific Heat and Heat Capacity
 error
 Appears in these related concepts: Estimation, Precise Definition of a Limit, and Basic properties of point estimates
 independent
 Appears in these related concepts: Fundamentals of Probability, Unions and Intersections, and Party Identification
 independent variable
 Appears in these related concepts: Graphical Representations of Functions, Converting between Exponential and Logarithmic Equations, and What is a Quadratic Function?
 line
 Appears in these related concepts: Line, Qualities of Line, and Plotting Lines
 mean
 Appears in these related concepts: Mean, Variance, and Standard Deviation of the Binomial Distribution, Averages, and Understanding Statistics
 regression
 Appears in these related concepts: Making a Box Model, Standard Error, and Coefficient of Determination
 residual
 Appears in these related concepts: Plotting the Residuals, Models with Both Quantitative and Qualitative Variables, and Degrees of Freedom
 residuals
 Appears in these related concepts: Two Regression Lines, Inferences of Correlation and Regression, and Midterm elections and unemployment
 sample
 Appears in these related concepts: Identifying Product Benefits, Surveys, and Basic Inferential Statistics
 spurious variable
 Appears in these related concepts: Experimental Design and Some Pitfalls: Estimability, Multicollinearity, and Extrapolation
 standard error
 Appears in these related concepts: Which Standard Deviation (SE)?, Calculations for the tTest: One Sample, and Chance Error and Bias
 variable
 Appears in these related concepts: What is a Linear Function?, Math Review, and Introduction to Variables
Sources
Boundless vets and curates highquality, openly licensed content from around the Internet. This particular resource used the following sources: