1 of 5

Regression Algorithms

Overview of HeavyML regression algorithms

Overview

HEAVY.AI supports four algorithms: linear regression, random forest regression, gradient boosted tree (GBT) regression, and decision tree regression. Creation and training of the models can be accomplished both via a CREATE MODEL statement as well as via invocation of dedicated table functions. Inference using the models can be accomplished via a row-wise ML_PREDICT operator, or via dedicated table functions.

Creating a Model

Creating a regression model is accomplished via the CREATE MODEL statement.

{ CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL }
<model_name>
OF TYPE { LINEAR_REG | RANDOM_FOREST_REG | DECISION_TREE_REG | GBT_REG }
AS <query_statement>
WITH (<option_list>);

We will step through the various parts and options of the statement above, giving concrete examples along the way.

Like CREATE TABLE, CREATE MODEL allows for:

CREATE MODEL (without qualifiers): Creates a named model. If a model already exists by the same name, the statement will throw an error message to that effect.
CREATE MODEL IF NOT EXISTS will only create a model if no model already exists with the same name, except unlike #1, it will not throw an error if there is a naming collision.
CREATE OR REPLACE MODEL: If a model with the same name already exists, it will overwrite it. This supports rapid iteration on models, but there is no 'undo' so you may wish to combine this with puposeful model name versioning.

Other things to take note of when using CREATE MODEL include:

<model_name> must be a valid SQL identifier, i.e. my_model_123, just like a SQL table or column name.
The model type must be one of the four following values: LINEAR_REG, RANDOM_FOREST_REG, DECISION_TREE_REG, or GBT_REG.
<query_statement> must be a valid SQL query. Unlike general SQL, column order is significant here.
- The first column should represent the variable you wish to predict (aka the independent variable). For currently supported regressions, this can be any continuous column type, including integers or floating columns.
- One or more predictor (aka dependent) valiables can follow in arbitrary order with the following limits
  - Categorical predictor variables must currently be of type TEXT ENCODED DICT (i.e. unencoded text is not supported).
  - Continuous predictor variables should be of type integer or float.
  - In addition, note that all projected expressions (i.e. revenue * 2.0 ) that are not column references (i.e. revenue) must be aliased, i.e. revenue * 2.0 AS revenue2x
When training a model, HeavyML automatically ignores any rows that contain NULL values in any of the input predictor columns.
- In some cases it could be of modeling advantage to impute values for NULLs in advance of regression. This is case-specific and thus not automated, but most simply mean and mode values can be substituted where appropriate using standard SQL UPDATE commands.
- Because several of these techniques are sensitive to outliers, it is best practice to review data extremes before modeling, and especially to ensure that any sentinel values for "no data" are removed beforehand. For example, the presence of -999 values instead of nulls will invalidate linear regression results.
Other column types can be CAST or EXTRACTED to valid types within the <query_statement>
- Datetimes and timestamp values can be extracted to either continuous or categorical values if needed:
  - extract(month from timestamp_col)
  - extract(epoch from timestamp_col)
- Geo columns are not directly supported, but operators returning types above from a geo column or from columnar longitude and latitude work as expected
  - ST_AREA(polygon_column)
  - GeoToH3(longitude, latitude, scale)

Standard Options

Many of the options that can be specified in the WITH options list are model-dependent and will be detailed in the relevant section, however the following options apply to all model types

EVAL_FRACTION: (optionally aliased as DATA_SPLIT_EVAL_FRACTION): Specifies the proportion of the dataset to be withheld from the training data, and allows the EVALUATE MODEL command to be run on the evaluation set without explicit specification (see the EVALUATE MODEL section below for more details). Note that EVAL_FRACTION must be >= 0.0 and < 1.0. Default value is 0.0 (no evaluation hold-out set).
TRAIN_FRACTION: (optionally aliased as DATA_SPLIT_TRAIN_FRACTION): Specifies the proportion of the dataset to be used for training. The most common use case for specifying TRAIN_FRACTION is when training a model over a large amount of data, such that specifying a TRAIN_FRACTION of less than 1 will speed up training, at some cost in model accuracy. Note that TRAIN_FRACTION must be >= 0.0 and <= 1.0.
CAT_TOP_K: For models with categorical predictors, this option specifies the top-k number of attributes that will be one-hot encoded from each, based on the attribute's frequency of occurrence in the training dataset. Note that the default value for CAT_TOP_K is 10, so only the 10 most-frequent categorical values are considered in modeling unless this is adjusted.
1. For example, if CAT_TOP_K is set to 3, and a categorical predictor column of us_state has 20 rows for 'CA', 15 rows for 'TX', 12 rows for 'NY', 10 rows for 'WA', and 8 rows for 'FL", then one-hot encoded columns will be generated for 'CA', 'TX', and 'NY'.
2. This option works in combination with the CAT_MIN_FRACTION described immediately below. For a categorical attribute to be one-hot encoded, the attribute must also have a column frequency greater or equal to CAT_MIN_FRACTION.
CAT_MIN_FRACTION: For models with categorical predictors, this option specifies the minimum frequency an attribute must be represented in a categorical column to be one-hot encoded as a predictor.
1. It is computed based on the number of rows in the column with the attribute value divided by the total number of rows.
2. This option works in conjunction with CAT_TOP_K, such that for a categorical attribute to be one-hot encoded, it must be both among the top-k attributes for a column, as well as have a frequency of occurrence >= CAT_MIN_FRACTION.
3. The default value for CAT_MIN_FRACTION is 0.01, meaning that only categorical attributes that make up at least 1% of their input column will be one-hot encoded (assuming they are also among the top CAT_TOP_K attributes of that column in terms of frequency).

Viewing model metadata

SHOW MODELS

Currently registered model names can be accessed via the SHOW MODELS command. If the configuration flag --restrict-ml-model-metadata-to-superusers is set to false (the default), any user can execute this command, otherwise it is restricted to superusers only.

heavysql> SHOW MODELS;
model_name
CENSUS_INCOME_RF
FLIGHTS_ARRDELAY_LR
FLORIDA_PARCELS_SALE_PRC_RF

SHOW MODEL DETAILS

Metadata for currently registered models can be accessed via the SHOW MODEL DETAILS command.

If SHOW MODEL DETAILS is run without a list of table names, metadata for all registered models will be returned.

If SHOW MODEL DETAILS is executed with a list of one or more model names, then just the metadata for those model names will be returned.

If the configuration flag --restrict-ml-model-metadata-to-superusers is set to false (the default), any user can execute this command, otherwise it is restricted to superusers only.

heavysql> SHOW MODEL DETAILS FLIGHTS_ARRDELAY_LR, FLORIDA_PARCELS_SALE_PRC_RF;
model_name|model_type|predicted|predictors|training_query|num_logical_features|num_physical_features|num_categorical_features|num_numeric_features|eval_fraction
FLIGHTS_ARRDELAY_LR|Linear Regression|arrdelay|carrier_name, origin, dest, depdelay, distance|SELECT arrdelay, carrier_name, origin, dest, depdelay, distance FROM flights_2008|5|60|3|2|0.1
FLORIDA_PARCELS_SALE_PRC_RF|Random Forest Regression|SALEPRC1|PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1|SELECT SALEPRC1, PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1 FROM florida_parcels_2020|6|30|2|4|0.1

SHOW MODEL FEATURE DETAILS

Metadata for model features, including regression coefficients for linear regression models and feature importance scores for random forest regression models, can be displayed by executing SHOW MODEL FEATURE DETAILS <model_name>;

heavysql> SHOW MODEL FEATURE DETAILS florida_parcels_sale_prc_lr;
feature_id|feature|sub_feature_id|sub_feature|coefficient
0|intercept|1||-54508008.99283858
1|PARUSEDESC|1|SINGLE FAMILY|-175974.316572117
2|CNTYNAME|1|BROWARD|69091.59847069145
2|CNTYNAME|2|PALM BEACH|112875.1340614095
2|CNTYNAME|3|MIAMI-DADE|169139.0685186634
2|CNTYNAME|4|HILLSBOROUGH|-31283.78090350307
2|CNTYNAME|5|LEE|-182184.3516143442
2|CNTYNAME|6|ORANGE|-27330.41730424529
2|CNTYNAME|7|PINELLAS|88344.15650346855
2|CNTYNAME|8|DUVAL|-48140.05756674933
2|CNTYNAME|9|PASCO|-71659.86040629358
2|CNTYNAME|10|POLK|-151774.8087324249
3|ACRES|1||-2109.863439939015
4|TOTLVGAREA|1||222.8330490509279
5|EFFYRBLT|1||877.1510557192003
6|SALEYR1|1||26136.629812613

ML_MODELS System Table

If you have superuser access, you can view relevant model metadata by querying the information_schema.ml_models table. The table schema is as follows:

User admin connected to database information_schema
heavysql> SHOW CREATE TABLE ml_models;
Result
CREATE TABLE ml_models (
  model_name TEXT ENCODING DICT(32),
  model_type TEXT ENCODING DICT(32),
  predicted TEXT ENCODING DICT(32),
  predictors TEXT[] ENCODING DICT(32),
  training_query TEXT ENCODING DICT(32),
  num_logical_features BIGINT,
  num_physical_features BIGINT,
  num_categorical_features BIGINT,
  num_numeric_features BIGINT,
  eval_fraction DOUBLE);

The ml_models system table can be queried as follows:

heavysql> SELECT * FROM information_schema.ml_models WHERE model_name ILIKE '%flights%';
model_name|model_type|predicted|predictors|training_query|num_logical_features|num_physical_features|num_categorical_features|num_numeric_features|eval_fraction
FLIGHTS_ARRDELAY_LR|Linear Regression|arrdelay|{carrier_name, origin, dest, depdelay, distance}|SELECT arrdelay, carrier_name, origin, dest, depdelay, distance FROM flights_2008|5|60|3|2|0.1

Evaluating a Model

Evaluating a trained model can be done with the EVALUATE MODEL statement, which returns the model's R-squared (or R2) score. R-squared is a goodness-of-fit measure for regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.

In the future, other additional metrics, such as Mean Squared Error (MSE), and Mean Absolute Error (MAE) may also be returned. Note: Additional metrics can be computed today in SQL from the values returned by ML_PREDICT described below if desired.

EVALUATE MODEL <model_name> | ON <query_stmt> ;

If CREATE MODEL was executed with a specified EVAL_FRACTION, then EVALUATE MODEL can be run without a specified query (i.e. simply EVALUATE MODEL <model_name>), which will run the model on the specified proportion of the training dataset held out for evaluation (i.e. the "test set").

If EVAL_FRACTION was not specified in the CREATE MODEL statement, or if you wish to test the model's performance on a different evaluation set, a specific evaluation query can be provided. It is expected that the order of the columns should be <predicted_col>, <categorical predictor cols>, <continuous predictor cols> , and that the arguments specified should semantically match the variables the model was trained on.

For example, the following workflow shows training a random forest regression model to predict prices of Florida real estate parcels for single-family homes with a defined 20% test set, and then evaluating the model performance on that held-out test dataset, and then separately on a dataset of Georgia real estate parcels of single-family homes.

CREATE OR REPLACE MODEL florida_parcels_sale_prc OF TYPE RANDOM_FOREST_REG AS
SELECT
  SALEPRC1,
  PARUSEDESC,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020
WITH (CAT_TOP_K=20, EVAL_FRACTION=0.2, NUM_TREES=100);

/* Evaluate model on 20% test dataset defined in 
   the CREATE MODEL statement */
EVALUATE MODEL florida_parcels_sale_prc;
r2
0.8789433247

/* Evaluate model on Georgia real estate parcels  */
EVALUATE MODEL florida_parcels_sale_prc ON
SELECT
  SALEPRC1,
  PARUSEDESC,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  georgia_parcels_2020
r2
0.75928328322

In this example, 88% of the variance in the Florida dataset can be explained by the model (based on 20% hold-out values). The same model applied entirely outside of its training domain explains 75% of the sale price variance in Georgia.

Model Inference/Prediction

Model inference, or using the model to predict values, can be performed by using the ML_PREDICT operator, which can be run anywhere a normal SQL operator is expected (i.e. it is executed row-wise and is not a table function). Syntax is as follows:

ML_PREDICT('<model_name>', <predictor_1>, <predictor_2> ... <predictor_n>)

Note the following:

The model_name is always the first argument to ML_PREDICT, and should be enclosed in single quotes.
The number of predictors must match the number of logical predictors (i.e. number of column expression inputs, not accounting for the splitting of categorical predictors into multiple one-hot encoded columns) that the model was trained on.
The predictors should be specified in the same order as that provided at model creation (with categorical predictors always coming first), otherwise you will receive erroneous results.
The variable to be predicted, which was the first column input to CREATE MODE, should never be provided to ML_PREDICT.

Example

// First a model must be created/trained
// Train a model to predict arrival delay given the airline carrier, origin airport,
// destination airport, departure delay and travel distance for a flight
heavysql> CREATE MODEL flights_arrdelay_lr OF TYPE LINEAR_REG AS SELECT 
arrdelay, carrier_name, origin, dest, depdelay, distance 
FrOM flights_2008 WITH (CAT_TOP_K=20, EVAL_FRACTION=0.1);

// Now use ML_PREDICT to compute the predicted arrival delay for each flight,
// as well as subtract the predicted arrival delay from the actual delay
// to compute prediction error.
heavysql> SELECT carrier_name, origin, dest, depdelay, arrdelay, 
ML_PREDICT('flights_arrdelay_lr', carrier_name, origin, dest, depdelay, distance) 
AS predicted_arrdelay, arrdelay - ML_PREDICT('flights_arrdelay_lr', carrier_name,
 origin, dest, depdelay, distance) AS arrdelay_error FROM flights_2008 limit 10;
carrier_name|origin|dest|depdelay|arrdelay|predicted_arrdelay|arrdelay_error
Southwest Airlines|RSW|MCO|-1|-6|-5.391950839306912|-0.6080491606930876
Southwest Airlines|RSW|MCO|5|2|0.7271153236961924|1.272884676303808
Southwest Airlines|RSW|MDW|-1|-10|-6.893060546399388|-3.106939453600612
Southwest Airlines|RSW|MDW|4|13|-1.793838743896801|14.7938387438968
Southwest Airlines|RSW|MDW|7|9|1.265694337604751|7.734305662395249
Southwest Airlines|RSW|PHL|2|-24|-3.669279998682756|-20.33072000131725
Southwest Airlines|SAN|ABQ|-3|-14|-8.239421467536765|-5.760578532463235
Southwest Airlines|SAN|ABQ|38|30|33.57419731298445|-3.574197312984452
Southwest Airlines|SAN|ABQ|0|-14|-5.179888386035214|-8.820111613964787
Southwest Airlines|SAN|AUS|22|4|16.47760221850464|-12.47760221850464

Linear Regression

Summary

Linear regression models are a type of supervised learning algorithm that predict a continuous target variable by fitting a straight line to the input features. They are commonly used for predicting numeric outcomes based on one or more input variables. Advantages of linear regression include simplicity, interpretability, and fast computation time. However, disadvantages include the assumption of a linear relationship between variables, which may not always hold, and sensitivity to both outliers and collinear features, which can affect the model's performance and reported accuracy.

Note: If you are not sure that your data meet these requirements, you may want to first do some exploratory visualizations and transformations using scatter plots to visualize pairwise relationships. Altneratively, HeavyML makes it simple with to try first or compare a regression model with fewer assumptions, such as Random Forests.

Example

CREATE OR REPLACE MODEL florida_parcels_sale_prc_lr OF TYPE LINEAR_REG AS
SELECT
  SALEPRC1,
  PARUSEDESC,
  CNTYNAME,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020 with (CAT_TOP_K=20, EVAL_FRACTION=0.2);

Linear Regression Options

With the exception of the general options listed above, the linear regression model type accepts no options.

Model Evaluation

Like all other regression model types, the model r2 can be obtained via the EVALUATE MODEL command. If the model was created with a specified EVAL_FRACTION, the model r2 score can be obtained on that test holdout set via the following:

EVALUATE MODEL florida_parcels_sale_prc_lr;

r2
0.08852338436213192

If no EVAL_FRACTION was specified in the CREATE MODEL command, or if EVAL_FRACTION was specified but you wish to evaluate the model on a different dataset, you can specify the evaluation query explicitly as follows:

EVALUATE MODEL florida_parcels_sale_prc_lr ON
SELECT
  SALEPRC1,
  PARUSEDESC,
  CNTYNAME,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020
WHERE
  PARUSEDESC = 'CONDOMINIUMS';

r2
0.03106805209354901

The relatively low R2 scores obtained for the linear regression model are not atypical for complex multi-variate relations. As noted above, in such cases, it will likely be worth trying a random forest or Gradient-Boosted Tree (GBT) regression model, as the accuracy of these models can be dramatically higher in many cases (in the example above, a simple random forest model achieved an R2 score above 0.87).

Model Prediction/Inference

Once a linear regression model is created, it can, like all other regression model types, be used for prediction via the row-wise ML_PREDICT operator, which takes the model name (in quotes) and a list of independent predictor variables semantically matching the ordered list of variables the model was trained on.

SELECT
  SALEPRC1 as actual_sales_price,
  ML_PREDICT(
    'florida_parcels_sale_prc_lr',
    PARUSEDESC,
    CNTYNAME,
    ACRES,
    TOTLVGAREA,
    EFFYRBLT,
    SALEYR1
  ) AS predicted_sales_price
FROM
  florida_parcels_2020
WHERE
  SALEPRC1 BETWEEN 100000 AND 500000
limit
  10;
  
actual_sales_price|predicted_sales_price
211000|-30912.60198199749
152400|6559.390672445297
164000|35608.10665637255
153900|56984.85121244192
143500|52565.25603222847
144000|64931.58916777372
140000|79256.96579062939
160000|90230.21915191412
162000|56753.09885531664
107000|80915.15436685085

Related Methods

A list of predictors for a trained linear regression model, along with their associated coefficients, can be obtained by executing the linear_reg_coefs table function, as shown in the following example;

SELECT * FROM TABLE(linear_reg_coefs(model_name=>'florida_parcels_sale_prc_lr'));

coef_idx|feature|sub_coef_idx|sub_feature|coef
0|intercept|1|NULL|313541950.3062068
1|PARUSEDESC|1|SINGLE FAMILY|-812725.0483721431
1|PARUSEDESC|2|CONDOMINIUMS|145061.7512208006
1|PARUSEDESC|3|VACANT RESIDENTIAL|-696124.6904133513
1|PARUSEDESC|4|MOBILE HOMES|-793766.3806761563
1|PARUSEDESC|5|MULTI-FAMILY - FEWER THAN 10 UNITS|-680141.6707517173
1|PARUSEDESC|6|RESIDENTIAL COMMON ELEMENTS / AREAS|-200166.7979917362
2|CNTYNAME|1|MIAMI-DADE|-14247.32188426969
2|CNTYNAME|2|BROWARD|11489.81580546272
2|CNTYNAME|3|PALM BEACH|4753.91666378783
2|CNTYNAME|4|LEE|-187796.8228951865
2|CNTYNAME|5|HILLSBOROUGH|1906136.710600209
2|CNTYNAME|6|ORANGE|14400.40299565677
2|CNTYNAME|7|PINELLAS|340509.3627915879
2|CNTYNAME|8|DUVAL|-6297.904384635765
2|CNTYNAME|9|POLK|51934.57494268686
2|CNTYNAME|10|BREVARD|-121257.6655694735
2|CNTYNAME|11|VOLUSIA|-14876.60197626388
2|CNTYNAME|12|SARASOTA|-80945.57973416231
2|CNTYNAME|13|COLLIER|-130752.9965216073
2|CNTYNAME|14|PASCO|27699.23056136876
2|CNTYNAME|15|MARION|-60143.98878492894
2|CNTYNAME|16|CHARLOTTE|-71174.58469642041
2|CNTYNAME|17|MANATEE|38203.76868561743
2|CNTYNAME|18|LAKE|45848.48937542497
2|CNTYNAME|19|SEMINOLE|64744.92416669091
2|CNTYNAME|20|OSCEOLA|228911.8974889433
3|ACRES|1|NULL|-7317.375525209893
4|TOTLVGAREA|1|NULL|101.4888021198855
5|EFFYRBLT|1|NULL|3995.56463628857
6|SALEYR1|1|NULL|-158867.0479846989

Random Forest Regression

Robust Predictive Modeling of Non-Linear Phenomena

Overview

Random forests regression is a machine learning technique used for regression tasks. In a random forest model, a large number of decision trees are constructed using randomly selected subsets of the training data and features. The individual trees are then combined to form a consensus prediction, which tends to be more accurate than any individual tree. This approach also helps to reduce overfitting, a common problem in machine learning where a model is too closely tailored to the training data and performs poorly on new data.

Compared to linear regression, which is a simple and interpretable method for modeling linear relationships between variables, random forests are more flexible and can model nonlinear relationships between variables. Additionally, random forests can handle a large number of features and can identify important features for prediction. Overall, random forests are a powerful and accessible machine learning tool, even for users without prior background in machine learning.

General Syntax

The model syntax follows general HeavyML conventions. You need to specify a SQL-legal model name, specify the type RANDOM_FOREST_REG, and provide a SELECT statement indicating the columns to use. The column projection statement must have the predicted variable first, then categorical columns and then continuous ones. Optionally, you can add a WITH statement to adjust categorical column processing using the same CAT_TOP_K and EVAL_FRACTION parameters discussed above.

For example, to predict Florida real estate price given parcels data:

In this example,

SALEPRC1 is the sales price to be predicted.
PARUSEDESC is a categorical column describing Florida land use types such as single family versus condominium. CNTYNAME is a categorical column of county names. These two columns are specified first among the predictors because they are categorical.
Four continuous value columns are provided: ACRES, TOTLVGAREA, EFFYRBUILT and SALEYR1. These indicate the parcel size in acres, its total living area, effective year built (reset for major remodeling) and sale year.
There are 70 counties in Florida, and we want this to be a statewide model. County is always a potentially-significant variable in price predictions, so we set the CAT_TOP_K to 70, well above its default of 10. If we want to increase initial model creation speed, we could also keep the default initially while experimenting with variables used, however this would come at the risk of fitting the model only on the most-populated counties, whose price prediction properties presumably vary.

Model-Specific Parameters

In addition to the general SQL modeling parameters demonstrated above, random forest regressions allow control of several other optional parameters. These are also provided at model creation as a comma separated list of parameter value options in the WITH statement. We recommend starting with the defaults, and then adjusting incrementally. For example, you could try increasing the number of trees and see if a higher number improves model accuracy without a large decrease in performance.

Selected Detailed Parameter Explanations

num_trees

The num_trees parameter in random forest regression specifies the number of decision trees to include in the ensemble. Each decision tree in the forest is trained on a random subset of the training data, using a random subset of the features for each split. The predictions from all the trees in the forest are then combined to produce the final output.

The num_trees parameter controls the trade-off between bias and variance in the random forest model. When the number of trees is small, the model is likely to have high bias, as it may not capture the complexity of the data well. On the other hand, when the number of trees is very large, the model may have high variance, as it may start overfitting the training data and fail to generalize to new, unseen data.

The optimal value for num_trees depends on the specific problem and the size of the dataset. In general, increasing the number of trees in the forest will improve the accuracy of the model, up to a certain point where further increases in num_trees lead to only marginal improvements in performance. In practice, the optimal value of num_trees is often found through a process of trial and error, by evaluating the model's performance on a validation set or through cross-validation.

A rule of thumb for choosing a default value for num_trees is to start with a small number, such as 10 (current default), and then gradually increase it until the performance on the validation set stops improving or starts deteriorating.

It is important to note that increasing the value of num_trees will also increase the computational cost of training the random forest model, as each additional tree requires additional time and resources to train. Therefore, the choice of num_trees should also take into account the computational resources available for training the model.

obs_per_tree_fraction

The obs_per_tree_fraction hyperparameter in random forest regression modeling determines the proportion of observations (or samples) that are randomly sampled and used to train each decision tree in the forest. Specifically, for each decision tree, a random sample of obs_per_tree_fraction x total number of observations is drawn from the training data with replacement, and the tree is trained on this sample. The remaining observations are not used for training this tree, but can be used for testing or for training other trees.

The obs_per_tree_fraction hyperparameter can be used to control the tradeoff between bias and variance in the random forest model. When obs_per_tree_fraction is set to a small value, each tree is trained on a smaller and more diverse sample of the training data, leading to lower bias but higher variance in the predictions. Conversely, when obs_per_tree_fraction is set to a large value, each tree is trained on a larger and more homogeneous sample of the training data, leading to higher bias but lower variance in the predictions.

The default value of obs_per_tree_fraction depends on the size and complexity of the dataset. In general, a good default value is between 0.5 and 0.7, which means that each tree is trained on a random sample of about 50-70% of the observations in the training data. However, this value may need to be adjusted based on the size of the dataset and the number of features. A larger dataset with many features may require a smaller value of obs_per_tree_fraction to avoid overfitting, while a smaller dataset with few features may benefit from a larger value of obs_per_tree_fraction to reduce variance.

To determine a good default value for a particular dataset, it is recommended to perform a grid search or random search over a range of values for obs_per_tree_fraction and evaluate the performance of the model on a validation set or through cross-validation. The optimal value of obs_per_tree_fraction will depend on the specific requirements of the problem and the tradeoff between bias and variance that yields the best performance.

max_tree_depth

The max_tree_depth parameter in random forest regression is used to limit the depth of each decision tree in the forest. The maximum tree depth is the number of levels of splits that a tree is allowed to have. A smaller value of max_tree_depth will result in shorter, simpler trees with less overfitting, while a larger value of max_tree_depth may allow the trees to capture more complex patterns in the data but may lead to overfitting.

Setting max_tree_depth to a very high value can result in overfitting, where the model fits the training data too closely and fails to generalize well to new, unseen data. On the other hand, setting max_tree_depth to a very low value can result in underfitting, where the model is not complex enough to capture the true underlying patterns in the data.

The optimal value of max_tree_depth depends on the specific problem and the size and complexity of the dataset. A good default value for max_tree_depth is often found through a process of trial and error, by evaluating the model's performance on a validation set or through cross-validation.

In practice, a common approach is to set a maximum value for max_tree_depth, such as 10 or 20, and then use regularization techniques such as early stopping to prevent overfitting. Early stopping involves monitoring the performance of the model on a validation set and stopping the training process when the performance on the validation set starts to deteriorate, rather than continuing to train the model until it fits the training data perfectly.

It is important to note that the choice of max_tree_depth should also take into account the number of features in the dataset, as a larger number of features can lead to more complex decision trees. A larger number of features may require a smaller value of max_tree_depth to avoid overfitting, while a smaller number of features may benefit from a larger value of max_tree_depth to capture more complex patterns in the data.

features_per_node

The features_per_node parameter controls the number of features available for each node split in the decision trees. This parameter limits the number of candidate features that can be considered for each split, which can help to reduce the computational cost of training the model and also reduce the risk of overfitting.

By default, the features_per_node parameter is set to the square root of the total number of features in the dataset. This value is a commonly used heuristic in random forest modeling that has been shown to work well in practice. However, the optimal value of features_per_node can vary depending on the specific problem and the characteristics of the dataset.

A larger value of features_per_node can lead to more accurate models, as it allows the decision trees to consider a larger number of features and capture more complex patterns in the data. However, a larger value of features_per_node can also increase the computational cost of training the model and increase the risk of overfitting.

Conversely, a smaller value of features_per_node can reduce the computational cost of training the model and reduce the risk of overfitting, but may result in less accurate models that are not able to capture all of the important patterns in the data.

In practice, the optimal value of features_per_node is often determined through a process of trial and error, by evaluating the performance of the model on a validation set or through cross-validation. The optimal value will depend on the specific characteristics of the dataset, such as the number of features, the size of the dataset, and the complexity of the underlying relationships between the features and the target variable.

min_weight_fraction_in_leaf_node

The min_weight_fraction_in_leaf_node parameter in random forest regression models specifies the minimum fraction of the sum of instance weights required in a leaf node. In other words, it controls the minimum amount of data that should be present in a leaf node during the tree-building process. If the sum of instance weights in a leaf node is below this fraction, the node is not split anymore and is converted to a leaf node.

The parameter is useful in scenarios where the data set is imbalanced or when instances have different weights. For example, if there are very few instances of a certain class in the training data set, the model may have trouble identifying this class without the help of this parameter. By setting min_weight_fraction_in_leaf_node to a value greater than zero, the model ensures that a minimum amount of data is present in the leaf node, thus reducing the risk of overfitting.

The appropriate value of this parameter depends on the data set and the specific problem. In general, larger values of min_weight_fraction_in_leaf_node result in simpler trees with fewer nodes and higher bias, but lower variance. On the other hand, smaller values of min_weight_fraction_in_leaf_node lead to more complex trees with higher variance, but lower bias. It is recommended to experiment with different values of this parameter to identify the one that works best for a specific problem.

min_impurity_decrease_in_split_node

The min_impurity_decrease_in_split_node parameter in random forest regression models specifies the minimum impurity decrease required to split an internal node. In other words, it controls the minimum improvement in impurity that must be achieved by splitting a node before it is considered. The impurity decrease is a measure of how much the split reduces randomness or uncertainty in the target variable.

The appropriate value of min_impurity_decrease_in_split_node depends on the data set and the specific problem. Larger values of this parameter lead to fewer and deeper trees with higher bias but lower variance, while smaller values lead to more shallow trees with lower bias but higher variance. In general, the value of min_impurity_decrease_in_split_node should be set to a value that results in a model with good predictive performance on the validation set.

Here are a couple of examples of typical values of min_impurity_decrease_in_split_node for different data sets:

If the data set has a large number of features and many of them are unimportant, a relatively high value of min_impurity_decrease_in_split_node can be used to filter out noise and focus on the most informative features. A value of 0.01 or higher might be appropriate in such a case.
If the data set has a small number of features or if all features are potentially important, a smaller value of min_impurity_decrease_in_split_node might be appropriate. For example, a value of 0.0001 or lower might be suitable for such a case.
If the data set is noisy or contains a lot of outliers, a higher value of min_impurity_decrease_in_split_node might be appropriate to avoid overfitting. A value of 0.1 or higher might be suitable in such a case.
If the data set is well-behaved and contains no outliers, a lower value of min_impurity_decrease_in_split_node might be appropriate to allow the model to capture more subtle patterns in the data. A value of 0.0001 or lower might be suitable in such a case.

It is important to note that these are just general guidelines, and the appropriate value of min_impurity_decrease_in_split_node should be determined through experimentation and cross-validation on the specific data set and problem at hand.

Evaluation of Feature Importance

It is often useful to understand which specific features are most important in a prediction. While future enhancements may make the syntax for this even easier, a UDTF version of feature importance evaluation is available today.

The table function is called random_forest_reg_var_importance

It takes a single parameter, which is the random forests model name. Because it is a table function, it must be wrapped within a TABLE() expression in SQL when used within a FROM clause:

This returns a table with rows for each feature used in the model. Categorical features are broken into multiple rows, with one row per sub-feature. The importance score units are based on the leaf-level tree construction metric set by var_importance_metric_str. While the MDA_Scaled metric takes a bit longer to run, you may find its values to be more interpretable.

You can also CTAS the results of the function above:

If you create and update a new percent_importance column in the resulting table, you can normalize and sort the feature importance scores by the maximum importance. For example when we do this with the example parcel data above, we find the most important model features are total living area, parcel areas and effective year built.

Making Predictions

To make a prediction using your model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:

So in our example above, you'd use:

Note that we did not use the name of the variable to be predicted - that comes from the model itself.

Example Application

The Florida real estate price prediction model above returns a reasonably-high r2 with simply the default random forest regression parameters. About 87% of the data variance can be explained with just the 6 variables above from the public parcels data available from the Florida Department of Revenue.

The model can be improved by adding auxiliary datasets, such as distance to the coastline, school district quality, or density of high-quality local amenities like parks and restaurants.

Other Applications

Another use case for random forest regression is in customer segmentation, for example predicting customer lifetime value (CLV) for a business. CLV is a measure of the total amount of money a customer is expected to spend on a business's products or services over their entire lifetime as a customer. By segmenting customers into groups with similar CLV predictions, a business can tailor marketing and customer retention strategies to maximize revenue and profitability.

In this use case, the random forest regression model would be trained on historical customer data, such as purchase history, demographic information, and website activity. The model would then be used to predict the CLV for new and existing customers. The model could also identify the most important factors that contribute to CLV, such as customer age, purchase frequency, and product categories.

The random forest model can handle large datasets and complex relationships between variables, making it well-suited for customer segmentation tasks. Additionally, the model can automatically select the most relevant features for prediction, which is especially useful when dealing with high-dimensional data. By using a random forest model for customer segmentation, a business can gain valuable insights into customer behavior and preferences, and make data-driven decisions to improve customer satisfaction and increase revenue.

Decision Tree Regression

A simple, interpretable model

In general, we recommend starting with either Linear or Random Forests Regressions depending on data complexity. A single decision tree regression option is available for narrower use cases. One advantage of decision tree regression over random forest regression is its interpretability. A decision tree is an easy-to-understand model that can be easily visualized and interpreted. The tree structure allows one to see the sequence of decisions and criteria used to make predictions. This interpretability can be especially valuable in applications where the decision-making process needs to be transparent and understandable, such as in healthcare or finance.

Another advantage of decision tree regression is that it is computationally efficient compared to random forest regression. Because a decision tree is a single tree, it can be built and trained more quickly than an ensemble of trees in a random forest model. This efficiency can be important when dealing with real-time applications or large data sets where computational resources are limited.

Additionally, decision tree regression can be more appropriate for certain types of data sets or problems. For example, if the target variable has a strong linear relationship with a small subset of the features, a decision tree model might be able to capture this relationship more accurately than a random forest model. In such a case, a random forest model might overfit the data and produce less accurate predictions.

It is important to note, however, that decision tree regression models can also suffer from overfitting and may not perform as well as random forest models in many scenarios.

General Syntax

The model syntax follows general HeavyML conventions. You need to specify a SQL-legal model name, specify the type DECISION_TREE_REG, and provide a SELECT statement indicating the columns to use. The column projection statement must have the predicted variable first, then categorical columns and then continuous ones. Optionally, you can add a WITH statement to adjust categorical column processing using the same CAT_TOP_K and EVAL_FRACTION parameters discussed above.

Example

This model trains in approximately 240ms. This is significantly faster than a full random forest regression on the same hardware which took several seconds. It produces a single tree model with r2 of 85% which is both useful and simple. While you would not expect this model to be as robust as random forests for application outside of its initial training domain, it might be acceptable or even preferred in terms of speed and explicability within its training domain.

Making Predictions

To make a prediction using your decision tree model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:

So in our example above, you'd use:

Note that we did not use the name of the variable to be predicted - that comes from the model itself.

Gradient Boosting Tree Regression

Gradient boosting is a machine learning technique that combines weak learners, here decision trees, to create a strong predictor by iteratively minimizing the loss function. The main difference between random forests and gradient boosting lies in how the decision trees are created and aggregated. Unlike random forests, the decision trees in gradient boosting are built additively; in other words, each decision tree is built one after another.

Gradient boosting models have several advantages over random forest regression models:

Gradient boosting models can often achieve higher accuracy than random forests: Gradient boosting models are designed to minimize errors and can learn complex relationships between the target variable and predictors, which can lead to higher accuracy compared to random forest models.
Gradient boosting can handle missing data: Gradient boosting models can handle missing data by imputing missing values using the best split values during the tree-building process. Random forest models require imputation of missing data prior to training.
Gradient boosting is less prone to overfitting: Gradient boosting models are designed to reduce overfitting through techniques like early stopping and regularization. Random forest models are prone to overfitting, especially with noisy data or high-dimensional data.

However, gradient boosting models also have some disadvantages compared to random forest models, such as being more computationally expensive and having more hyperparameters to tune. The choice between gradient boosting and random forest models depends on the specific problem and data set, and should be determined through experimentation and cross-validation.

General SQL Syntax

The model syntax follows HeavyML conventions. You need to specify a SQL-legal model name, specify the type GBT_REG, and provide a SELECT statement indicating the columns to use. The statement must have the predicted variable first, then categorical columns and then continuous ones. Optionally, you can add a WITH statement to adjust categorical column processing using the same CAT_TOP_K and EVAL_FRACTION parameters discussed above.

Example

For our example dataset, the r2 obtained was about 5% less than random forests regression, while model building was faster. Your mileage may vary. In general, gradient boosted regression models may perform better than random forests regression in domains where the data has a high degree of complexity, nonlinearity, and noise, such as in image and speech recognition, natural language processing, and financial forecasting. Additionally, gradient boosted models may be more effective when there is a large number of features, and when the target variable is highly imbalanced.

Making Predictions

To make a prediction using your gradient boosting tree model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:

So in our example above, you'd use:

Note that we did not use the name of the variable to be predicted - that comes from the model itself.

Random Forest Regression

Robust Predictive Modeling of Non-Linear Phenomena

Overview

General Syntax

For example, to predict Florida real estate price given parcels data:

In this example,

SALEPRC1 is the sales price to be predicted.
PARUSEDESC is a categorical column describing Florida land use types such as single family versus condominium. CNTYNAME is a categorical column of county names. These two columns are specified first among the predictors because they are categorical.
Four continuous value columns are provided: ACRES, TOTLVGAREA, EFFYRBUILT and SALEYR1. These indicate the parcel size in acres, its total living area, effective year built (reset for major remodeling) and sale year.
There are 70 counties in Florida, and we want this to be a statewide model. County is always a potentially-significant variable in price predictions, so we set the CAT_TOP_K to 70, well above its default of 10. If we want to increase initial model creation speed, we could also keep the default initially while experimenting with variables used, however this would come at the risk of fitting the model only on the most-populated counties, whose price prediction properties presumably vary.

Model-Specific Parameters

Parameter

Description

Default Value

Selected Detailed Parameter Explanations

num_trees

obs_per_tree_fraction

max_tree_depth

features_per_node

min_weight_fraction_in_leaf_node

min_impurity_decrease_in_split_node

Here are a couple of examples of typical values of min_impurity_decrease_in_split_node for different data sets:

If the data set has a large number of features and many of them are unimportant, a relatively high value of min_impurity_decrease_in_split_node can be used to filter out noise and focus on the most informative features. A value of 0.01 or higher might be appropriate in such a case.
If the data set has a small number of features or if all features are potentially important, a smaller value of min_impurity_decrease_in_split_node might be appropriate. For example, a value of 0.0001 or lower might be suitable for such a case.
If the data set is noisy or contains a lot of outliers, a higher value of min_impurity_decrease_in_split_node might be appropriate to avoid overfitting. A value of 0.1 or higher might be suitable in such a case.
If the data set is well-behaved and contains no outliers, a lower value of min_impurity_decrease_in_split_node might be appropriate to allow the model to capture more subtle patterns in the data. A value of 0.0001 or lower might be suitable in such a case.

Evaluation of Feature Importance

The table function is called random_forest_reg_var_importance

It takes a single parameter, which is the random forests model name. Because it is a table function, it must be wrapped within a TABLE() expression in SQL when used within a FROM clause:

SELECT * FROM TABLE(random_forest_reg_var_importance('fl_parcels_rf'));

You can also CTAS the results of the function above:

CREATE TABLE fl_parcels_rf_importance AS 
SELECT * FROM TABLE(random_forest_reg_var_importance('fl_parcels_rf'));

Making Predictions

To make a prediction using your model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:

ML_PREDICT('<model_name>', <predictor_1>, <predictor_2> ... <predictor_n>)

So in our example above, you'd use:

ML_PREDICT('fl_parcels_rf', PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1)

Note that we did not use the name of the variable to be predicted - that comes from the model itself.

Example Application

The model can be improved by adding auxiliary datasets, such as distance to the coastline, school district quality, or density of high-quality local amenities like parks and restaurants.

Other Applications

Regression Algorithms

Overview of HeavyML regression algorithms

Overview

Creating a Model

Creating a regression model is accomplished via the CREATE MODEL statement.

{ CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL }
<model_name>
OF TYPE { LINEAR_REG | RANDOM_FOREST_REG | DECISION_TREE_REG | GBT_REG }
AS <query_statement>
WITH (<option_list>);

We will step through the various parts and options of the statement above, giving concrete examples along the way.

Like CREATE TABLE, CREATE MODEL allows for:

CREATE MODEL (without qualifiers): Creates a named model. If a model already exists by the same name, the statement will throw an error message to that effect.
CREATE MODEL IF NOT EXISTS will only create a model if no model already exists with the same name, except unlike #1, it will not throw an error if there is a naming collision.
CREATE OR REPLACE MODEL: If a model with the same name already exists, it will overwrite it. This supports rapid iteration on models, but there is no 'undo' so you may wish to combine this with puposeful model name versioning.

Other things to take note of when using CREATE MODEL include:

<model_name> must be a valid SQL identifier, i.e. my_model_123, just like a SQL table or column name.
The model type must be one of the four following values: LINEAR_REG, RANDOM_FOREST_REG, DECISION_TREE_REG, or GBT_REG.
<query_statement> must be a valid SQL query. Unlike general SQL, column order is significant here.
- The first column should represent the variable you wish to predict (aka the independent variable). For currently supported regressions, this can be any continuous column type, including integers or floating columns.
- One or more predictor (aka dependent) valiables can follow in arbitrary order with the following limits
  - Categorical predictor variables must currently be of type TEXT ENCODED DICT (i.e. unencoded text is not supported).
  - Continuous predictor variables should be of type integer or float.
  - In addition, note that all projected expressions (i.e. revenue * 2.0 ) that are not column references (i.e. revenue) must be aliased, i.e. revenue * 2.0 AS revenue2x
When training a model, HeavyML automatically ignores any rows that contain NULL values in any of the input predictor columns.
- In some cases it could be of modeling advantage to impute values for NULLs in advance of regression. This is case-specific and thus not automated, but most simply mean and mode values can be substituted where appropriate using standard SQL UPDATE commands.
- Because several of these techniques are sensitive to outliers, it is best practice to review data extremes before modeling, and especially to ensure that any sentinel values for "no data" are removed beforehand. For example, the presence of -999 values instead of nulls will invalidate linear regression results.
Other column types can be CAST or EXTRACTED to valid types within the <query_statement>
- Datetimes and timestamp values can be extracted to either continuous or categorical values if needed:
  - extract(month from timestamp_col)
  - extract(epoch from timestamp_col)
- Geo columns are not directly supported, but operators returning types above from a geo column or from columnar longitude and latitude work as expected
  - ST_AREA(polygon_column)
  - GeoToH3(longitude, latitude, scale)

Standard Options

Many of the options that can be specified in the WITH options list are model-dependent and will be detailed in the relevant section, however the following options apply to all model types

EVAL_FRACTION: (optionally aliased as DATA_SPLIT_EVAL_FRACTION): Specifies the proportion of the dataset to be withheld from the training data, and allows the EVALUATE MODEL command to be run on the evaluation set without explicit specification (see the EVALUATE MODEL section below for more details). Note that EVAL_FRACTION must be >= 0.0 and < 1.0. Default value is 0.0 (no evaluation hold-out set).
TRAIN_FRACTION: (optionally aliased as DATA_SPLIT_TRAIN_FRACTION): Specifies the proportion of the dataset to be used for training. The most common use case for specifying TRAIN_FRACTION is when training a model over a large amount of data, such that specifying a TRAIN_FRACTION of less than 1 will speed up training, at some cost in model accuracy. Note that TRAIN_FRACTION must be >= 0.0 and <= 1.0.
CAT_TOP_K: For models with categorical predictors, this option specifies the top-k number of attributes that will be one-hot encoded from each, based on the attribute's frequency of occurrence in the training dataset. Note that the default value for CAT_TOP_K is 10, so only the 10 most-frequent categorical values are considered in modeling unless this is adjusted.
1. For example, if CAT_TOP_K is set to 3, and a categorical predictor column of us_state has 20 rows for 'CA', 15 rows for 'TX', 12 rows for 'NY', 10 rows for 'WA', and 8 rows for 'FL", then one-hot encoded columns will be generated for 'CA', 'TX', and 'NY'.
2. This option works in combination with the CAT_MIN_FRACTION described immediately below. For a categorical attribute to be one-hot encoded, the attribute must also have a column frequency greater or equal to CAT_MIN_FRACTION.
CAT_MIN_FRACTION: For models with categorical predictors, this option specifies the minimum frequency an attribute must be represented in a categorical column to be one-hot encoded as a predictor.
1. It is computed based on the number of rows in the column with the attribute value divided by the total number of rows.
2. This option works in conjunction with CAT_TOP_K, such that for a categorical attribute to be one-hot encoded, it must be both among the top-k attributes for a column, as well as have a frequency of occurrence >= CAT_MIN_FRACTION.
3. The default value for CAT_MIN_FRACTION is 0.01, meaning that only categorical attributes that make up at least 1% of their input column will be one-hot encoded (assuming they are also among the top CAT_TOP_K attributes of that column in terms of frequency).

Viewing model metadata

SHOW MODELS

heavysql> SHOW MODELS;
model_name
CENSUS_INCOME_RF
FLIGHTS_ARRDELAY_LR
FLORIDA_PARCELS_SALE_PRC_RF

SHOW MODEL DETAILS

Metadata for currently registered models can be accessed via the SHOW MODEL DETAILS command.

If SHOW MODEL DETAILS is run without a list of table names, metadata for all registered models will be returned.

If SHOW MODEL DETAILS is executed with a list of one or more model names, then just the metadata for those model names will be returned.

If the configuration flag --restrict-ml-model-metadata-to-superusers is set to false (the default), any user can execute this command, otherwise it is restricted to superusers only.

heavysql> SHOW MODEL DETAILS FLIGHTS_ARRDELAY_LR, FLORIDA_PARCELS_SALE_PRC_RF;
model_name|model_type|predicted|predictors|training_query|num_logical_features|num_physical_features|num_categorical_features|num_numeric_features|eval_fraction
FLIGHTS_ARRDELAY_LR|Linear Regression|arrdelay|carrier_name, origin, dest, depdelay, distance|SELECT arrdelay, carrier_name, origin, dest, depdelay, distance FROM flights_2008|5|60|3|2|0.1
FLORIDA_PARCELS_SALE_PRC_RF|Random Forest Regression|SALEPRC1|PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1|SELECT SALEPRC1, PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1 FROM florida_parcels_2020|6|30|2|4|0.1

SHOW MODEL FEATURE DETAILS

heavysql> SHOW MODEL FEATURE DETAILS florida_parcels_sale_prc_lr;
feature_id|feature|sub_feature_id|sub_feature|coefficient
0|intercept|1||-54508008.99283858
1|PARUSEDESC|1|SINGLE FAMILY|-175974.316572117
2|CNTYNAME|1|BROWARD|69091.59847069145
2|CNTYNAME|2|PALM BEACH|112875.1340614095
2|CNTYNAME|3|MIAMI-DADE|169139.0685186634
2|CNTYNAME|4|HILLSBOROUGH|-31283.78090350307
2|CNTYNAME|5|LEE|-182184.3516143442
2|CNTYNAME|6|ORANGE|-27330.41730424529
2|CNTYNAME|7|PINELLAS|88344.15650346855
2|CNTYNAME|8|DUVAL|-48140.05756674933
2|CNTYNAME|9|PASCO|-71659.86040629358
2|CNTYNAME|10|POLK|-151774.8087324249
3|ACRES|1||-2109.863439939015
4|TOTLVGAREA|1||222.8330490509279
5|EFFYRBLT|1||877.1510557192003
6|SALEYR1|1||26136.629812613

ML_MODELS System Table

If you have superuser access, you can view relevant model metadata by querying the information_schema.ml_models table. The table schema is as follows:

User admin connected to database information_schema
heavysql> SHOW CREATE TABLE ml_models;
Result
CREATE TABLE ml_models (
  model_name TEXT ENCODING DICT(32),
  model_type TEXT ENCODING DICT(32),
  predicted TEXT ENCODING DICT(32),
  predictors TEXT[] ENCODING DICT(32),
  training_query TEXT ENCODING DICT(32),
  num_logical_features BIGINT,
  num_physical_features BIGINT,
  num_categorical_features BIGINT,
  num_numeric_features BIGINT,
  eval_fraction DOUBLE);

The ml_models system table can be queried as follows:

heavysql> SELECT * FROM information_schema.ml_models WHERE model_name ILIKE '%flights%';
model_name|model_type|predicted|predictors|training_query|num_logical_features|num_physical_features|num_categorical_features|num_numeric_features|eval_fraction
FLIGHTS_ARRDELAY_LR|Linear Regression|arrdelay|{carrier_name, origin, dest, depdelay, distance}|SELECT arrdelay, carrier_name, origin, dest, depdelay, distance FROM flights_2008|5|60|3|2|0.1

Evaluating a Model

EVALUATE MODEL <model_name> | ON <query_stmt> ;

CREATE OR REPLACE MODEL florida_parcels_sale_prc OF TYPE RANDOM_FOREST_REG AS
SELECT
  SALEPRC1,
  PARUSEDESC,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020
WITH (CAT_TOP_K=20, EVAL_FRACTION=0.2, NUM_TREES=100);

/* Evaluate model on 20% test dataset defined in 
   the CREATE MODEL statement */
EVALUATE MODEL florida_parcels_sale_prc;
r2
0.8789433247

/* Evaluate model on Georgia real estate parcels  */
EVALUATE MODEL florida_parcels_sale_prc ON
SELECT
  SALEPRC1,
  PARUSEDESC,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  georgia_parcels_2020
r2
0.75928328322

Model Inference/Prediction

ML_PREDICT('<model_name>', <predictor_1>, <predictor_2> ... <predictor_n>)

Note the following:

The model_name is always the first argument to ML_PREDICT, and should be enclosed in single quotes.
The number of predictors must match the number of logical predictors (i.e. number of column expression inputs, not accounting for the splitting of categorical predictors into multiple one-hot encoded columns) that the model was trained on.
The predictors should be specified in the same order as that provided at model creation (with categorical predictors always coming first), otherwise you will receive erroneous results.
The variable to be predicted, which was the first column input to CREATE MODE, should never be provided to ML_PREDICT.

Example

// First a model must be created/trained
// Train a model to predict arrival delay given the airline carrier, origin airport,
// destination airport, departure delay and travel distance for a flight
heavysql> CREATE MODEL flights_arrdelay_lr OF TYPE LINEAR_REG AS SELECT 
arrdelay, carrier_name, origin, dest, depdelay, distance 
FrOM flights_2008 WITH (CAT_TOP_K=20, EVAL_FRACTION=0.1);

// Now use ML_PREDICT to compute the predicted arrival delay for each flight,
// as well as subtract the predicted arrival delay from the actual delay
// to compute prediction error.
heavysql> SELECT carrier_name, origin, dest, depdelay, arrdelay, 
ML_PREDICT('flights_arrdelay_lr', carrier_name, origin, dest, depdelay, distance) 
AS predicted_arrdelay, arrdelay - ML_PREDICT('flights_arrdelay_lr', carrier_name,
 origin, dest, depdelay, distance) AS arrdelay_error FROM flights_2008 limit 10;
carrier_name|origin|dest|depdelay|arrdelay|predicted_arrdelay|arrdelay_error
Southwest Airlines|RSW|MCO|-1|-6|-5.391950839306912|-0.6080491606930876
Southwest Airlines|RSW|MCO|5|2|0.7271153236961924|1.272884676303808
Southwest Airlines|RSW|MDW|-1|-10|-6.893060546399388|-3.106939453600612
Southwest Airlines|RSW|MDW|4|13|-1.793838743896801|14.7938387438968
Southwest Airlines|RSW|MDW|7|9|1.265694337604751|7.734305662395249
Southwest Airlines|RSW|PHL|2|-24|-3.669279998682756|-20.33072000131725
Southwest Airlines|SAN|ABQ|-3|-14|-8.239421467536765|-5.760578532463235
Southwest Airlines|SAN|ABQ|38|30|33.57419731298445|-3.574197312984452
Southwest Airlines|SAN|ABQ|0|-14|-5.179888386035214|-8.820111613964787
Southwest Airlines|SAN|AUS|22|4|16.47760221850464|-12.47760221850464