Comment on page
Linear Regression
Summary
Linear regression models are a type of supervised learning algorithm that predict a continuous target variable by fitting a straight line to the input features. They are commonly used for predicting numeric outcomes based on one or more input variables. Advantages of linear regression include simplicity, interpretability, and fast computation time. However, disadvantages include the assumption of a linear relationship between variables, which may not always hold, and sensitivity to both outliers and collinear features, which can affect the model's performance and reported accuracy.
Note: If you are not sure that your data meet these requirements, you may want to first do some exploratory visualizations and transformations using scatter plots to visualize pairwise relationships. Altneratively, HeavyML makes it simple with to try first or compare a regression model with fewer assumptions, such as Random Forests.
Example
CREATE OR REPLACE MODEL florida_parcels_sale_prc_lr OF TYPE LINEAR_REG AS
SELECT
SALEPRC1,
PARUSEDESC,
CNTYNAME,
ACRES,
TOTLVGAREA,
EFFYRBLT,
SALEYR1
FROM
florida_parcels_2020 with (CAT_TOP_K=20, EVAL_FRACTION=0.2);
Linear Regression Options
With the exception of the general options listed above, the linear regression model type accepts no options.
Model Evaluation
Like all other regression model types, the model r2 can be obtained via the
EVALUATE MODEL
command. If the model was created with a specified EVAL_FRACTION
, the model r2 score can be obtained on that test holdout set via the following:EVALUATE MODEL florida_parcels_sale_prc_lr;
r2
0.08852338436213192
If no
EVAL_FRACTION
was specified in the CREATE MODEL
command, or if EVAL_FRACTION
was specified but you wish to evaluate the model on a different dataset, you can specify the evaluation query explicitly as follows:EVALUATE MODEL florida_parcels_sale_prc_lr ON
SELECT
SALEPRC1,
PARUSEDESC,
CNTYNAME,
ACRES,
TOTLVGAREA,
EFFYRBLT,
SALEYR1
FROM
florida_parcels_2020
WHERE
PARUSEDESC = 'CONDOMINIUMS';
r2
0.03106805209354901
The relatively low R2 scores obtained for the linear regression model are not atypical for complex multi-variate relations. As noted above, in such cases, it will likely be worth trying a random forest or Gradient-Boosted Tree (GBT) regression model, as the accuracy of these models can be dramatically higher in many cases (in the example above, a simple random forest model achieved an R2 score above 0.87).
Model Prediction/Inference
Once a linear regression model is created, it can, like all other regression model types, be used for prediction via the row-wise
ML_PREDICT
operator, which takes the model name (in quotes) and a list of independent predictor variables semantically matching the ordered list of variables the model was trained on.SELECT
SALEPRC1 as actual_sales_price,
ML_PREDICT(
'florida_parcels_sale_prc_lr',
PARUSEDESC,
CNTYNAME,
ACRES,
TOTLVGAREA,
EFFYRBLT,
SALEYR1
) AS predicted_sales_price
FROM
florida_parcels_2020
WHERE
SALEPRC1 BETWEEN 100000 AND 500000
limit
10;
actual_sales_price|predicted_sales_price
211000|-30912.60198199749
152400|6559.390672445297
164000|35608.10665637255
153900|56984.85121244192
143500|52565.25603222847
144000|64931.58916777372
140000|79256.96579062939
160000|90230.21915191412
162000|56753.09885531664
107000|80915.15436685085
Related Methods
A list of predictors for a trained linear regression model, along with their associated coefficients, can be obtained by executing the
linear_reg_coefs
table function, as shown in the following example;SELECT * FROM TABLE(linear_reg_coefs(model_name=>'florida_parcels_sale_prc_lr'));
coef_idx|feature|sub_coef_idx|sub_feature|coef
0|intercept|1|NULL|313541950.3062068
1|PARUSEDESC|1|SINGLE FAMILY|-812725.0483721431
1|PARUSEDESC|2|CONDOMINIUMS|145061.7512208006
1|PARUSEDESC|3|VACANT RESIDENTIAL|-696124.6904133513
1|PARUSEDESC|4|MOBILE HOMES|-793766.3806761563
1|PARUSEDESC|5|MULTI-FAMILY - FEWER THAN 10 UNITS|-680141.6707517173
1|PARUSEDESC|6|RESIDENTIAL COMMON ELEMENTS / AREAS|-200166.7979917362
2|CNTYNAME|1|MIAMI-DADE|-14247.32188426969
2|CNTYNAME|2|BROWARD|11489.81580546272
2|CNTYNAME|3|PALM BEACH|4753.91666378783
2|CNTYNAME|4|LEE|-187796.8228951865
2|CNTYNAME|5|HILLSBOROUGH|1906136.710600209
2|CNTYNAME|6|ORANGE|14400.40299565677
2|CNTYNAME|7|PINELLAS|340509.3627915879
2|CNTYNAME|8|DUVAL|-6297.904384635765
2|CNTYNAME|9|POLK|51934.57494268686
2|CNTYNAME|10|BREVARD|-121257.6655694735
2|CNTYNAME|11|VOLUSIA|-14876.60197626388
2|CNTYNAME|12|SARASOTA|-80945.57973416231
2|CNTYNAME|13|COLLIER|-130752.9965216073
2|CNTYNAME|14|PASCO|27699.23056136876
2|CNTYNAME|15|MARION|-60143.98878492894
2|CNTYNAME|16|CHARLOTTE|-71174.58469642041
2|CNTYNAME|17|MANATEE|38203.76868561743
2|CNTYNAME|18|LAKE|45848.48937542497
2|CNTYNAME|19|SEMINOLE|64744.92416669091
2|CNTYNAME|20|OSCEOLA|228911.8974889433
3|ACRES|1|NULL|-7317.375525209893
4|TOTLVGAREA|1|NULL|101.4888021198855
5|EFFYRBLT|1|NULL|3995.56463628857
6|SALEYR1|1|NULL|-158867.0479846989
Last modified 2mo ago