Linear Regression

Summary

Linear regression models are a type of supervised learning algorithm that predict a continuous target variable by fitting a straight line to the input features. They are commonly used for predicting numeric outcomes based on one or more input variables. Advantages of linear regression include simplicity, interpretability, and fast computation time. However, disadvantages include the assumption of a linear relationship between variables, which may not always hold, and sensitivity to both outliers and collinear features, which can affect the model's performance and reported accuracy.

Note: If you are not sure that your data meet these requirements, you may want to first do some exploratory visualizations and transformations using scatter plots to visualize pairwise relationships. Altneratively, HeavyML makes it simple with to try first or compare a regression model with fewer assumptions, such as Random Forests.

Example

CREATE OR REPLACE MODEL florida_parcels_sale_prc_lr OF TYPE LINEAR_REG AS
SELECT
  SALEPRC1,
  PARUSEDESC,
  CNTYNAME,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020 with (CAT_TOP_K=20, EVAL_FRACTION=0.2);

Linear Regression Options

With the exception of the general options listed above, the linear regression model type accepts no options.

Model Evaluation

Like all other regression model types, the model r2 can be obtained via the EVALUATE MODEL command. If the model was created with a specified EVAL_FRACTION, the model r2 score can be obtained on that test holdout set via the following:

EVALUATE MODEL florida_parcels_sale_prc_lr;

r2
0.08852338436213192

If no EVAL_FRACTION was specified in the CREATE MODEL command, or if EVAL_FRACTION was specified but you wish to evaluate the model on a different dataset, you can specify the evaluation query explicitly as follows:

EVALUATE MODEL florida_parcels_sale_prc_lr ON
SELECT
  SALEPRC1,
  PARUSEDESC,
  CNTYNAME,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020
WHERE
  PARUSEDESC = 'CONDOMINIUMS';

r2
0.03106805209354901

The relatively low R2 scores obtained for the linear regression model are not atypical for complex multi-variate relations. As noted above, in such cases, it will likely be worth trying a random forest or Gradient-Boosted Tree (GBT) regression model, as the accuracy of these models can be dramatically higher in many cases (in the example above, a simple random forest model achieved an R2 score above 0.87).

Model Prediction/Inference

Once a linear regression model is created, it can, like all other regression model types, be used for prediction via the row-wise ML_PREDICT operator, which takes the model name (in quotes) and a list of independent predictor variables semantically matching the ordered list of variables the model was trained on.

SELECT
  SALEPRC1 as actual_sales_price,
  ML_PREDICT(
    'florida_parcels_sale_prc_lr',
    PARUSEDESC,
    CNTYNAME,
    ACRES,
    TOTLVGAREA,
    EFFYRBLT,
    SALEYR1
  ) AS predicted_sales_price
FROM
  florida_parcels_2020
WHERE
  SALEPRC1 BETWEEN 100000 AND 500000
limit
  10;
  
actual_sales_price|predicted_sales_price
211000|-30912.60198199749
152400|6559.390672445297
164000|35608.10665637255
153900|56984.85121244192
143500|52565.25603222847
144000|64931.58916777372
140000|79256.96579062939
160000|90230.21915191412
162000|56753.09885531664
107000|80915.15436685085

Related Methods

A list of predictors for a trained linear regression model, along with their associated coefficients, can be obtained by executing the linear_reg_coefs table function, as shown in the following example;

SELECT * FROM TABLE(linear_reg_coefs(model_name=>'florida_parcels_sale_prc_lr'));

coef_idx|feature|sub_coef_idx|sub_feature|coef
0|intercept|1|NULL|313541950.3062068
1|PARUSEDESC|1|SINGLE FAMILY|-812725.0483721431
1|PARUSEDESC|2|CONDOMINIUMS|145061.7512208006
1|PARUSEDESC|3|VACANT RESIDENTIAL|-696124.6904133513
1|PARUSEDESC|4|MOBILE HOMES|-793766.3806761563
1|PARUSEDESC|5|MULTI-FAMILY - FEWER THAN 10 UNITS|-680141.6707517173
1|PARUSEDESC|6|RESIDENTIAL COMMON ELEMENTS / AREAS|-200166.7979917362
2|CNTYNAME|1|MIAMI-DADE|-14247.32188426969
2|CNTYNAME|2|BROWARD|11489.81580546272
2|CNTYNAME|3|PALM BEACH|4753.91666378783
2|CNTYNAME|4|LEE|-187796.8228951865
2|CNTYNAME|5|HILLSBOROUGH|1906136.710600209
2|CNTYNAME|6|ORANGE|14400.40299565677
2|CNTYNAME|7|PINELLAS|340509.3627915879
2|CNTYNAME|8|DUVAL|-6297.904384635765
2|CNTYNAME|9|POLK|51934.57494268686
2|CNTYNAME|10|BREVARD|-121257.6655694735
2|CNTYNAME|11|VOLUSIA|-14876.60197626388
2|CNTYNAME|12|SARASOTA|-80945.57973416231
2|CNTYNAME|13|COLLIER|-130752.9965216073
2|CNTYNAME|14|PASCO|27699.23056136876
2|CNTYNAME|15|MARION|-60143.98878492894
2|CNTYNAME|16|CHARLOTTE|-71174.58469642041
2|CNTYNAME|17|MANATEE|38203.76868561743
2|CNTYNAME|18|LAKE|45848.48937542497
2|CNTYNAME|19|SEMINOLE|64744.92416669091
2|CNTYNAME|20|OSCEOLA|228911.8974889433
3|ACRES|1|NULL|-7317.375525209893
4|TOTLVGAREA|1|NULL|101.4888021198855
5|EFFYRBLT|1|NULL|3995.56463628857
6|SALEYR1|1|NULL|-158867.0479846989

PreviousRegression Algorithms NextRandom Forest Regression