# Regression Algorithms

Overview of HeavyML regression algorithms

HEAVY.AI supports four algorithms: linear regression, random forest regression, gradient boosted tree (GBT) regression, and decision tree regression. Creation and training of the models can be accomplished both via a CREATE MODEL statement as well as via invocation of dedicated table functions. Inference using the models can be accomplished via a row-wise ML_PREDICT operator, or via dedicated table functions.

Creating a regression model is accomplished via the

`CREATE MODEL`

statement. { CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL }

<model_name>

OF TYPE { LINEAR_REG | RANDOM_FOREST_REG | DECISION_TREE_REG | GBT_REG }

AS <query_statement>

WITH (<option_list>);

We will step through the various parts and options of the statement above, giving concrete examples along the way.

Like CREATE TABLE, CREATE MODEL allows for:

- 1.
`CREATE MODEL`

(without qualifiers): Creates a named model. If a model already exists by the same name, the statement will throw an error message to that effect. - 2.CREATE MODEL IF NOT EXISTS will only create a model if no model already exists with the same name, except unlike #1, it will not throw an error if there is a naming collision.
- 3.CREATE OR REPLACE MODEL: If a model with the same name already exists, it will overwrite it. This supports rapid iteration on models, but there is no 'undo' so you may wish to combine this with puposeful model name versioning.

Other things to take note of when using

`CREATE MODEL`

include:- <model_name> must be a valid SQL identifier, i.e.
`my_model_123`

, just like a SQL table or column name. - The model type must be one of the four following values:
`LINEAR_REG`

,`RANDOM_FOREST_REG`

,`DECISION_TREE_REG`

, or`GBT_REG`

. - <query_statement> must be a valid SQL query. Unlike general SQL, column order is significant here.
- The first column should represent the variable you wish to predict (aka the independent variable). For currently supported regressions, this can be any continuous column type, including integers or floating columns.
- One or more predictor (aka dependent) valiables can follow in arbitrary order with the following limits
- Categorical predictor variables must currently be of type TEXT ENCODED DICT (i.e. unencoded text is not supported).
- Continuous predictor variables should be of type integer or float.
- In addition, note that all projected expressions (i.e.
`revenue * 2.0`

) that are not column references (i.e.`revenue`

) must be aliased, i.e.`revenue * 2.0 AS revenue2x`

- When training a model, HeavyML automatically ignores any rows that contain NULL values in any of the input predictor columns.
- In some cases it could be of modeling advantage to impute values for NULLs in advance of regression. This is case-specific and thus not automated, but most simply mean and mode values can be substituted where appropriate using standard SQL UPDATE commands.
- Because several of these techniques are sensitive to outliers, it is best practice to review data extremes before modeling, and especially to ensure that any sentinel values for "no data" are removed beforehand. For example, the presence of -999 values instead of nulls will invalidate linear regression results.

- Other column types can be CAST or EXTRACTED to valid types within the <query_statement>
- Datetimes and timestamp values can be extracted to either continuous or categorical values if needed:
- extract(month from timestamp_col)
- extract(epoch from timestamp_col)

- Geo columns are not directly supported, but operators returning types above from a geo column or from columnar longitude and latitude work as expected
- ST_AREA(polygon_column)
- GeoToH3(longitude, latitude, scale)

Standard Options

- Many of the options that can be specified in the WITH options list are model-dependent and will be detailed in the relevant section, however the following options apply to all model types

- 1.
`EVAL_FRACTION:`

(optionally aliased as`DATA_SPLIT_EVAL_FRACTION`

): Specifies the proportion of the dataset to be withheld from the training data, and allows the`EVALUATE MODEL`

command to be run on the evaluation set without explicit specification (see the`EVALUATE MODEL`

section below for more details). Note that`EVAL_FRACTION`

must be >= 0.0 and < 1.0. Default value is 0.0 (no evaluation hold-out set). - 2.TRAIN_FRACTION: (optionally aliased as
`DATA_SPLIT_TRAIN_FRACTION`

): Specifies the proportion of the dataset to be used for training. The most common use case for specifying`TRAIN_FRACTION`

is when training a model over a large amount of data, such that specifying a`TRAIN_FRACTION`

of less than 1 will speed up training, at some cost in model accuracy. Note that`TRAIN_FRACTION`

must be >= 0.0 and <= 1.0. - 3.
`CAT_TOP_K`

: For models with categorical predictors, this option specifies the top-k number of attributes that will be one-hot encoded from each, based on the attribute's frequency of occurrence in the training dataset. Note that the default value for CAT_TOP_K is 10, so only the 10 most-frequent categorical values are considered in modeling unless this is adjusted.- 1.For example, if
`CAT_TOP_K`

is set to 3, and a categorical predictor column of`us_state`

has 20 rows for 'CA', 15 rows for 'TX', 12 rows for 'NY', 10 rows for 'WA', and 8 rows for 'FL", then one-hot encoded columns will be generated for 'CA', 'TX', and 'NY'. - 2.This option works in combination with the
`CAT_MIN_FRACTION described immediately below.`

For a categorical attribute to be one-hot encoded, the attribute must also have a column frequency greater or equal to`CAT_MIN_FRACTION`

.

- 4.
`CAT_MIN_FRACTION`

: For models with categorical predictors, this option specifies the minimum frequency an attribute must be represented in a categorical column to be one-hot encoded as a predictor.- 1.It is computed based on the number of rows in the column with the attribute value divided by the total number of rows.
- 2.This option works in conjunction with
`CAT_TOP_K`

, such that for a categorical attribute to be one-hot encoded, it must be both among the top-k attributes for a column, as well as have a frequency of occurrence >=`CAT_MIN_FRACTION`

. - 3.The default value for
`CAT_MIN_FRACTION`

is 0.01, meaning that only categorical attributes that make up at least 1% of their input column will be one-hot encoded (assuming they are also among the top`CAT_TOP_K`

attributes of that column in terms of frequency).

Currently registered model names can be accessed via the

`SHOW MODELS`

command. If the configuration flag `--restrict-ml-model-metadata-to-superusers`

is set to false (the default), any user can execute this command, otherwise it is restricted to superusers only.heavysql> SHOW MODELS;

model_name

CENSUS_INCOME_RF

FLIGHTS_ARRDELAY_LR

FLORIDA_PARCELS_SALE_PRC_RF

**SHOW MODEL DETAILS**

Metadata for currently registered models can be accessed via the SHOW MODEL DETAILS command.

If SHOW MODEL DETAILS is run without a list of table names, metadata for all registered models will be returned.

If

`SHOW MODEL DETAILS`

is executed with a list of one or more model names, then just the metadata for those model names will be returned. If the configuration flag

`--restrict-ml-model-metadata-to-superusers`

is set to false (the default), any user can execute this command, otherwise it is restricted to superusers only.heavysql> SHOW MODEL DETAILS FLIGHTS_ARRDELAY_LR, FLORIDA_PARCELS_SALE_PRC_RF;

model_name|model_type|predicted|predictors|training_query|num_logical_features|num_physical_features|num_categorical_features|num_numeric_features|eval_fraction

FLIGHTS_ARRDELAY_LR|Linear Regression|arrdelay|carrier_name, origin, dest, depdelay, distance|SELECT arrdelay, carrier_name, origin, dest, depdelay, distance FROM flights_2008|5|60|3|2|0.1

FLORIDA_PARCELS_SALE_PRC_RF|Random Forest Regression|SALEPRC1|PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1|SELECT SALEPRC1, PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1 FROM florida_parcels_2020|6|30|2|4|0.1

Metadata for model features, including regression coefficients for linear regression models and feature importance scores for random forest regression models, can be displayed by executing

`SHOW MODEL FEATURE DETAILS <model_name>;`

heavysql> SHOW MODEL FEATURE DETAILS florida_parcels_sale_prc_lr;

feature_id|feature|sub_feature_id|sub_feature|coefficient

0|intercept|1||-54508008.99283858

1|PARUSEDESC|1|SINGLE FAMILY|-175974.316572117

2|CNTYNAME|1|BROWARD|69091.59847069145

2|CNTYNAME|2|PALM BEACH|112875.1340614095

2|CNTYNAME|3|MIAMI-DADE|169139.0685186634

2|CNTYNAME|4|HILLSBOROUGH|-31283.78090350307

2|CNTYNAME|5|LEE|-182184.3516143442

2|CNTYNAME|6|ORANGE|-27330.41730424529

2|CNTYNAME|7|PINELLAS|88344.15650346855

2|CNTYNAME|8|DUVAL|-48140.05756674933

2|CNTYNAME|9|PASCO|-71659.86040629358

2|CNTYNAME|10|POLK|-151774.8087324249

3|ACRES|1||-2109.863439939015

4|TOTLVGAREA|1||222.8330490509279

5|EFFYRBLT|1||877.1510557192003

6|SALEYR1|1||26136.629812613

If you have superuser access, you can view relevant model metadata by querying the

`information_schema.ml_models`

table. The table schema is as follows:User admin connected to database information_schema

heavysql> SHOW CREATE TABLE ml_models;

Result

CREATE TABLE ml_models (

model_name TEXT ENCODING DICT(32),

model_type TEXT ENCODING DICT(32),

predicted TEXT ENCODING DICT(32),

predictors TEXT[] ENCODING DICT(32),

training_query TEXT ENCODING DICT(32),

num_logical_features BIGINT,

num_physical_features BIGINT,

num_categorical_features BIGINT,

num_numeric_features BIGINT,

eval_fraction DOUBLE);

The

`ml_models`

system table can be queried as follows:heavysql> SELECT * FROM information_schema.ml_models WHERE model_name ILIKE '%flights%';

model_name|model_type|predicted|predictors|training_query|num_logical_features|num_physical_features|num_categorical_features|num_numeric_features|eval_fraction

FLIGHTS_ARRDELAY_LR|Linear Regression|arrdelay|{carrier_name, origin, dest, depdelay, distance}|SELECT arrdelay, carrier_name, origin, dest, depdelay, distance FROM flights_2008|5|60|3|2|0.1

Evaluating a trained model can be done with the

`EVALUATE MODEL`

statement, which returns the model's R-squared (or R2) score. R-squared is a goodness-of-fit measure for regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.In the future, other additional metrics, such as Mean Squared Error (MSE), and Mean Absolute Error (MAE) may also be returned. Note: Additional metrics can be computed today in SQL from the values returned by ML_PREDICT described below if desired.

EVALUATE MODEL <model_name> | ON <query_stmt> ;

If

`CREATE MODEL`

was executed with a specified `EVAL_FRACTION`

, then `EVALUATE MODEL`

can be run without a specified query (i.e. simply `EVALUATE MODEL <model_name>`

), which will run the model on the specified proportion of the training dataset held out for evaluation (i.e. the "test set").If

`EVAL_FRACTION`

was not specified in the `CREATE MODEL`

statement, or if you wish to test the model's performance on a different evaluation set, a specific evaluation query can be provided. It is expected that the order of the columns should be `<predicted_col>, <categorical predictor cols>, <continuous predictor cols>`

, and that the arguments specified should semantically match the variables the model was trained on.For example, the following workflow shows training a random forest regression model to predict prices of Florida real estate parcels for single-family homes with a defined 20% test set, and then evaluating the model performance on that held-out test dataset, and then separately on a dataset of Georgia real estate parcels of single-family homes.

CREATE OR REPLACE MODEL florida_parcels_sale_prc OF TYPE RANDOM_FOREST_REG AS

SELECT

SALEPRC1,

PARUSEDESC,

ACRES,

TOTLVGAREA,

EFFYRBLT,

SALEYR1

FROM

florida_parcels_2020

WITH (CAT_TOP_K=20, EVAL_FRACTION=0.2, NUM_TREES=100);

/* Evaluate model on 20% test dataset defined in

the CREATE MODEL statement */

EVALUATE MODEL florida_parcels_sale_prc;

r2

0.8789433247

/* Evaluate model on Georgia real estate parcels */

EVALUATE MODEL florida_parcels_sale_prc ON

SELECT

SALEPRC1,

PARUSEDESC,

ACRES,

TOTLVGAREA,

EFFYRBLT,

SALEYR1

FROM

georgia_parcels_2020

r2

0.75928328322

In this example, 88% of the variance in the Florida dataset can be explained by the model (based on 20% hold-out values). The same model applied entirely outside of its training domain explains 75% of the sale price variance in Georgia.

Model inference, or using the model to predict values, can be performed by using the

`ML_PREDICT`

operator, which can be run anywhere a normal SQL operator is expected (i.e. it is executed row-wise and is not a table function). Syntax is as follows:ML_PREDICT('<model_name>', <predictor_1>, <predictor_2> ... <predictor_n>)

Note the following:

- 1.The
`model_name`

is always the first argument to`ML_PREDICT`

, and should be enclosed in**single**quotes. - 2.The number of predictors must match the number of logical predictors (i.e. number of column expression inputs, not accounting for the splitting of categorical predictors into multiple one-hot encoded columns) that the model was trained on.
- 3.The predictors should be specified in the same order as that provided at model creation (with categorical predictors always coming first), otherwise you will receive erroneous results.
- 4.The variable to be predicted, which was the first column input to
`CREATE MODE`

, should never be provided to`ML_PREDICT.`

**Example**

// First a model must be created/trained

// Train a model to predict arrival delay given the airline carrier, origin airport,

// destination airport, departure delay and travel distance for a flight

heavysql> CREATE MODEL flights_arrdelay_lr OF TYPE LINEAR_REG AS SELECT

arrdelay, carrier_name, origin, dest, depdelay, distance

FrOM flights_2008 WITH (CAT_TOP_K=20, EVAL_FRACTION=0.1);

// Now use ML_PREDICT to compute the predicted arrival delay for each flight,

// as well as subtract the predicted arrival delay from the actual delay

// to compute prediction error.

heavysql> SELECT carrier_name, origin, dest, depdelay, arrdelay,

ML_PREDICT('flights_arrdelay_lr', carrier_name, origin, dest, depdelay, distance)

AS predicted_arrdelay, arrdelay - ML_PREDICT('flights_arrdelay_lr', carrier_name,

origin, dest, depdelay, distance) AS arrdelay_error FROM flights_2008 limit 10;

carrier_name|origin|dest|depdelay|arrdelay|predicted_arrdelay|arrdelay_error

Southwest Airlines|RSW|MCO|-1|-6|-5.391950839306912|-0.6080491606930876

Southwest Airlines|RSW|MCO|5|2|0.7271153236961924|1.272884676303808

Southwest Airlines|RSW|MDW|-1|-10|-6.893060546399388|-3.106939453600612

Southwest Airlines|RSW|MDW|4|13|-1.793838743896801|14.7938387438968

Southwest Airlines|RSW|MDW|7|9|1.265694337604751|7.734305662395249

Southwest Airlines|RSW|PHL|2|-24|-3.669279998682756|-20.33072000131725

Southwest Airlines|SAN|ABQ|-3|-14|-8.239421467536765|-5.760578532463235

Southwest Airlines|SAN|ABQ|38|30|33.57419731298445|-3.574197312984452

Southwest Airlines|SAN|ABQ|0|-14|-5.179888386035214|-8.820111613964787

Southwest Airlines|SAN|AUS|22|4|16.47760221850464|-12.47760221850464

Last modified 23d ago