Comment on page

# Random Forest Regression

Robust Predictive Modeling of Non-Linear Phenomena

Random forests regression is a machine learning technique used for regression tasks. In a random forest model, a large number of decision trees are constructed using randomly selected subsets of the training data and features. The individual trees are then combined to form a consensus prediction, which tends to be more accurate than any individual tree. This approach also helps to reduce overfitting, a common problem in machine learning where a model is too closely tailored to the training data and performs poorly on new data.

Compared to linear regression, which is a simple and interpretable method for modeling linear relationships between variables, random forests are more flexible and can model nonlinear relationships between variables. Additionally, random forests can handle a large number of features and can identify important features for prediction. Overall, random forests are a powerful and accessible machine learning tool, even for users without prior background in machine learning.

The model syntax follows general HeavyML conventions. You need to specify a SQL-legal model name, specify the type RANDOM_FOREST_REG, and provide a SELECT statement indicating the columns to use. The column projection statement must have the predicted variable first, then categorical columns and then continuous ones. Optionally, you can add a WITH statement to adjust categorical column processing using the same CAT_TOP_K and EVAL_FRACTION parameters discussed above.

For example, to predict Florida real estate price given parcels data:

CREATE OR REPLACE MODEL fl_parcel_price_rf

OF TYPE RANDOM_FOREST_REG AS

SELECT

SALEPRC1,

PARUSEDESC,

CNTYNAME,

ACRES,

TOTLVGAREA,

EFFYRBLT,

SALEYR1

FROM

fl_res_parcels_2018

WITH

(CAT_TOP_K=70, EVAL_FRACTION=0.2)

In this example,

- 1.SALEPRC1 is the sales price to be predicted.
- 2.PARUSEDESC is a categorical column describing Florida land use types such as single family versus condominium. CNTYNAME is a categorical column of county names. These two columns are specified first among the predictors because they are categorical.
- 3.Four continuous value columns are provided: ACRES, TOTLVGAREA, EFFYRBUILT and SALEYR1. These indicate the parcel size in acres, its total living area, effective year built (reset for major remodeling) and sale year.
- 4.There are 70 counties in Florida, and we want this to be a statewide model. County is always a potentially-significant variable in price predictions, so we set the CAT_TOP_K to 70, well above its default of 10. If we want to increase initial model creation speed, we could also keep the default initially while experimenting with variables used, however this would come at the risk of fitting the model only on the most-populated counties, whose price prediction properties presumably vary.

In addition to the general SQL modeling parameters demonstrated above, random forest regressions allow control of several other optional parameters. These are also provided at model creation as a comma separated list of parameter value options in the WITH statement. We recommend starting with the defaults, and then adjusting incrementally. For example, you could try increasing the number of trees and see if a higher number improves model accuracy without a large decrease in performance.

Parameter | Description | Default Value |
---|---|---|

num_trees | The number of decision trees to include in the ensemble | 10 |

obs_per_tree_fraction | The proportion of observations (or samples) that are randomly sampled and used to train each decision tree in the forest. Valid range 0 to 1 | 1 |

max_tree_depth | Maximum tree depth. Zero value means unlimited depth. Can be any non-negative number. | 0 |

features_per_node | The number of features available for each node split in the decision trees. | square root of the total number of features in the dataset. |

impurity_threshold | Valid range >= 0 | 0 |

bootstrap | Boolean | True |

min_obs_per_leaf_node | Minimum number of observations in the leaf node. Can be any positive number. | 5 |

min_obs_per_split_node | The minimum number of observations required for a node to be split during the construction of a decision tree/ Valid range is any positive number. | 2 |

min_weight_fraction_in_leaf_node | The minimum fraction of the sum of instance weights required in a leaf node | 0.0 |

min_impurity_decrease_in_split_node | The minimum impurity decrease required to split an internal node | 0.0 |

max_leaf_nodes | Valid range >= 0 | 0 |

use_histogram | Use histograms to speed computations. Boolean | True |

var_importance_metric | Options: mean decrease in impurity ('MDI'), mean decrease in accuracy ('MDA'), or 'MDA_scaled' - the MDA raw value scaled by its standard deviation | MDI |

The

`num_trees`

parameter in random forest regression specifies the number of decision trees to include in the ensemble. Each decision tree in the forest is trained on a random subset of the training data, using a random subset of the features for each split. The predictions from all the trees in the forest are then combined to produce the final output.The

`num_trees`

parameter controls the trade-off between bias and variance in the random forest model. When the number of trees is small, the model is likely to have high bias, as it may not capture the complexity of the data well. On the other hand, when the number of trees is very large, the model may have high variance, as it may start overfitting the training data and fail to generalize to new, unseen data.The optimal value for

`num_trees`

depends on the specific problem and the size of the dataset. In general, increasing the number of trees in the forest will improve the accuracy of the model, up to a certain point where further increases in `num_trees`

lead to only marginal improvements in performance. In practice, the optimal value of `num_trees`

is often found through a process of trial and error, by evaluating the model's performance on a validation set or through cross-validation.A rule of thumb for choosing a default value for

`num_trees`

is to start with a small number, such as 10 (current default), and then gradually increase it until the performance on the validation set stops improving or starts deteriorating. It is important to note that increasing the value of

`num_trees`

will also increase the computational cost of training the random forest model, as each additional tree requires additional time and resources to train. Therefore, the choice of `num_trees`

should also take into account the computational resources available for training the model.The

`obs_per_tree_fraction`

hyperparameter in random forest regression modeling determines the proportion of observations (or samples) that are randomly sampled and used to train each decision tree in the forest. Specifically, for each decision tree, a random sample of `obs_per_tree_fraction`

x `total number of observations`

is drawn from the training data with replacement, and the tree is trained on this sample. The remaining observations are not used for training this tree, but can be used for testing or for training other trees.The

`obs_per_tree_fraction`

hyperparameter can be used to control the tradeoff between bias and variance in the random forest model. When `obs_per_tree_fraction`

is set to a small value, each tree is trained on a smaller and more diverse sample of the training data, leading to lower bias but higher variance in the predictions. Conversely, when `obs_per_tree_fraction`

is set to a large value, each tree is trained on a larger and more homogeneous sample of the training data, leading to higher bias but lower variance in the predictions.The default value of

`obs_per_tree_fraction`

depends on the size and complexity of the dataset. In general, a good default value is between 0.5 and 0.7, which means that each tree is trained on a random sample of about 50-70% of the observations in the training data. However, this value may need to be adjusted based on the size of the dataset and the number of features. A larger dataset with many features may require a smaller value of `obs_per_tree_fraction`

to avoid overfitting, while a smaller dataset with few features may benefit from a larger value of `obs_per_tree_fraction`

to reduce variance.To determine a good default value for a particular dataset, it is recommended to perform a grid search or random search over a range of values for

`obs_per_tree_fraction`

and evaluate the performance of the model on a validation set or through cross-validation. The optimal value of `obs_per_tree_fraction`

will depend on the specific requirements of the problem and the tradeoff between bias and variance that yields the best performance.The

`max_tree_depth`

parameter in random forest regression is used to limit the depth of each decision tree in the forest. The maximum tree depth is the number of levels of splits that a tree is allowed to have. A smaller value of `max_tree_depth`

will result in shorter, simpler trees with less overfitting, while a larger value of `max_tree_depth`

may allow the trees to capture more complex patterns in the data but may lead to overfitting.Setting

`max_tree_depth`

to a very high value can result in overfitting, where the model fits the training data too closely and fails to generalize well to new, unseen data. On the other hand, setting `max_tree_depth`

to a very low value can result in underfitting, where the model is not complex enough to capture the true underlying patterns in the data.The optimal value of

`max_tree_depth`

depends on the specific problem and the size and complexity of the dataset. A good default value for `max_tree_depth`

is often found through a process of trial and error, by evaluating the model's performance on a validation set or through cross-validation.In practice, a common approach is to set a maximum value for

`max_tree_depth`

, such as 10 or 20, and then use regularization techniques such as early stopping to prevent overfitting. Early stopping involves monitoring the performance of the model on a validation set and stopping the training process when the performance on the validation set starts to deteriorate, rather than continuing to train the model until it fits the training data perfectly.It is important to note that the choice of

`max_tree_depth`

should also take into account the number of features in the dataset, as a larger number of features can lead to more complex decision trees. A larger number of features may require a smaller value of `max_tree_depth`

to avoid overfitting, while a smaller number of features may benefit from a larger value of `max_tree_depth`

to capture more complex patterns in the data.The

`features_per_node`

parameter controls the number of features available for each node split in the decision trees. This parameter limits the number of candidate features that can be considered for each split, which can help to reduce the computational cost of training the model and also reduce the risk of overfitting.By default, the

`features_per_node`

parameter is set to the square root of the total number of features in the dataset. This value is a commonly used heuristic in random forest modeling that has been shown to work well in practice. However, the optimal value of `features_per_node`

can vary depending on the specific problem and the characteristics of the dataset.A larger value of

`features_per_node`

can lead to more accurate models, as it allows the decision trees to consider a larger number of features and capture more complex patterns in the data. However, a larger value of `features_per_node`

can also increase the computational cost of training the model and increase the risk of overfitting.Conversely, a smaller value of

`features_per_node`

can reduce the computational cost of training the model and reduce the risk of overfitting, but may result in less accurate models that are not able to capture all of the important patterns in the data.In practice, the optimal value of

`features_per_node`

is often determined through a process of trial and error, by evaluating the performance of the model on a validation set or through cross-validation. The optimal value will depend on the specific characteristics of the dataset, such as the number of features, the size of the dataset, and the complexity of the underlying relationships between the features and the target variable.The

`min_weight_fraction_in_leaf_node`

parameter in random forest regression models specifies the minimum fraction of the sum of instance weights required in a leaf node. In other words, it controls the minimum amount of data that should be present in a leaf node during the tree-building process. If the sum of instance weights in a leaf node is below this fraction, the node is not split anymore and is converted to a leaf node.The parameter is useful in scenarios where the

**data set is imbalanced**or when instances have different weights. For example, if there are very few instances of a certain class in the training data set, the model may have trouble identifying this class without the help of this parameter. By setting`min_weight_fraction_in_leaf_node`

to a value greater than zero, the model ensures that a minimum amount of data is present in the leaf node, thus reducing the risk of overfitting.The appropriate value of this parameter depends on the data set and the specific problem. In general, larger values of

`min_weight_fraction_in_leaf_node`

result in simpler trees with fewer nodes and higher bias, but lower variance. On the other hand, smaller values of `min_weight_fraction_in_leaf_node`

lead to more complex trees with higher variance, but lower bias. It is recommended to experiment with different values of this parameter to identify the one that works best for a specific problem.The

`min_impurity_decrease_in_split_node`

parameter in random forest regression models specifies the minimum impurity decrease required to split an internal node. In other words, it controls the minimum improvement in impurity that must be achieved by splitting a node before it is considered. The impurity decrease is a measure of how much the split reduces randomness or uncertainty in the target variable.The appropriate value of

`min_impurity_decrease_in_split_node`

depends on the data set and the specific problem. Larger values of this parameter lead to fewer and deeper trees with higher bias but lower variance, while smaller values lead to more shallow trees with lower bias but higher variance. In general, the value of `min_impurity_decrease_in_split_node`

should be set to a value that results in a model with good predictive performance on the validation set.Here are a couple of examples of typical values of

`min_impurity_decrease_in_split_node`

for different data sets:- 1.If the data set has a large number of features and many of them are unimportant, a relatively high value of
`min_impurity_decrease_in_split_node`

can be used to filter out noise and focus on the most informative features. A value of 0.01 or higher might be appropriate in such a case. - 2.If the data set has a small number of features or if all features are potentially important, a smaller value of
`min_impurity_decrease_in_split_node`

might be appropriate. For example, a value of 0.0001 or lower might be suitable for such a case. - 3.If the data set is noisy or contains a lot of outliers, a higher value of
`min_impurity_decrease_in_split_node`

might be appropriate to avoid overfitting. A value of 0.1 or higher might be suitable in such a case. - 4.If the data set is well-behaved and contains no outliers, a lower value of
`min_impurity_decrease_in_split_node`

might be appropriate to allow the model to capture more subtle patterns in the data. A value of 0.0001 or lower might be suitable in such a case.

It is important to note that these are just general guidelines, and the appropriate value of

`min_impurity_decrease_in_split_node`

should be determined through experimentation and cross-validation on the specific data set and problem at hand.It is often useful to understand which specific features are most important in a prediction. While future enhancements may make the syntax for this even easier, a UDTF version of feature importance evaluation is available today.

The table function is called

**random_forest_reg_var_importance**It takes a single parameter, which is the random forests model name. Because it is a table function, it must be wrapped within a TABLE() expression in SQL when used within a FROM clause:

SELECT * FROM TABLE(random_forest_reg_var_importance('fl_parcels_rf'));

This returns a table with rows for each feature used in the model. Categorical features are broken into multiple rows, with one row per sub-feature. The importance score units are based on the leaf-level tree construction metric set by

*var_importance_metric_str*. While the MDA_Scaled metric takes a bit longer to run, you may find its values to be more interpretable.You can also CTAS the results of the function above:

CREATE TABLE fl_parcels_rf_importance AS

SELECT * FROM TABLE(random_forest_reg_var_importance('fl_parcels_rf'));

If you create and update a new

*column in the resulting table, you can normalize and sort the feature importance scores by the maximum importance. For example when we do this with the example parcel data above, we find the most important model features are total living area, parcel areas and effective year built.***percent_importance**To make a prediction using your model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:

ML_PREDICT('<model_name>', <predictor_1>, <predictor_2> ... <predictor_n>)

So in our example above, you'd use:

ML_PREDICT('fl_parcels_rf', PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1)

Note that we did not use the name of the variable to be predicted - that comes from the model itself.

The Florida real estate price prediction model above returns a reasonably-high r2 with simply the default random forest regression parameters. About 87% of the data variance can be explained with just the 6 variables above from the public parcels data available from the Florida Department of Revenue.

The model can be improved by adding auxiliary datasets, such as distance to the coastline, school district quality, or density of high-quality local amenities like parks and restaurants.

Another use case for random forest regression is in customer segmentation, for example predicting customer lifetime value (CLV) for a business. CLV is a measure of the total amount of money a customer is expected to spend on a business's products or services over their entire lifetime as a customer. By segmenting customers into groups with similar CLV predictions, a business can tailor marketing and customer retention strategies to maximize revenue and profitability.

In this use case, the random forest regression model would be trained on historical customer data, such as purchase history, demographic information, and website activity. The model would then be used to predict the CLV for new and existing customers. The model could also identify the most important factors that contribute to CLV, such as customer age, purchase frequency, and product categories.

The random forest model can handle large datasets and complex relationships between variables, making it well-suited for customer segmentation tasks. Additionally, the model can automatically select the most relevant features for prediction, which is especially useful when dealing with high-dimensional data. By using a random forest model for customer segmentation, a business can gain valuable insights into customer behavior and preferences, and make data-driven decisions to improve customer satisfaction and increase revenue.

Last modified 4mo ago