Linear Regression
Summary
Linear regression models are a type of supervised learning algorithm that predict a continuous target variable by fitting a straight line to the input features. They are commonly used for predicting numeric outcomes based on one or more input variables. Advantages of linear regression include simplicity, interpretability, and fast computation time. However, disadvantages include the assumption of a linear relationship between variables, which may not always hold, and sensitivity to both outliers and collinear features, which can affect the model's performance and reported accuracy.
Note: If you are not sure that your data meet these requirements, you may want to first do some exploratory visualizations and transformations using scatter plots to visualize pairwise relationships. Altneratively, HeavyML makes it simple with to try first or compare a regression model with fewer assumptions, such as Random Forests.
Example
Linear Regression Options
With the exception of the general options listed above, the linear regression model type accepts no options.
Model Evaluation
Like all other regression model types, the model r2 can be obtained via the EVALUATE MODEL
command. If the model was created with a specified EVAL_FRACTION
, the model r2 score can be obtained on that test holdout set via the following:
If no EVAL_FRACTION
was specified in the CREATE MODEL
command, or if EVAL_FRACTION
was specified but you wish to evaluate the model on a different dataset, you can specify the evaluation query explicitly as follows:
The relatively low R2 scores obtained for the linear regression model are not atypical for complex multi-variate relations. As noted above, in such cases, it will likely be worth trying a random forest or Gradient-Boosted Tree (GBT) regression model, as the accuracy of these models can be dramatically higher in many cases (in the example above, a simple random forest model achieved an R2 score above 0.87).
Model Prediction/Inference
Once a linear regression model is created, it can, like all other regression model types, be used for prediction via the row-wise ML_PREDICT
operator, which takes the model name (in quotes) and a list of independent predictor variables semantically matching the ordered list of variables the model was trained on.
Related Methods
A list of predictors for a trained linear regression model, along with their associated coefficients, can be obtained by executing the linear_reg_coefs
table function, as shown in the following example;