Random Forest Regression
Robust Predictive Modeling of Non-Linear Phenomena
Overview
Random forests regression is a machine learning technique used for regression tasks. In a random forest model, a large number of decision trees are constructed using randomly selected subsets of the training data and features. The individual trees are then combined to form a consensus prediction, which tends to be more accurate than any individual tree. This approach also helps to reduce overfitting, a common problem in machine learning where a model is too closely tailored to the training data and performs poorly on new data.
Compared to linear regression, which is a simple and interpretable method for modeling linear relationships between variables, random forests are more flexible and can model nonlinear relationships between variables. Additionally, random forests can handle a large number of features and can identify important features for prediction. Overall, random forests are a powerful and accessible machine learning tool, even for users without prior background in machine learning.
General Syntax
The model syntax follows general HeavyML conventions. You need to specify a SQL-legal model name, specify the type RANDOM_FOREST_REG, and provide a SELECT statement indicating the columns to use. The column projection statement must have the predicted variable first, then categorical columns and then continuous ones. Optionally, you can add a WITH statement to adjust categorical column processing using the same CAT_TOP_K and EVAL_FRACTION parameters discussed above.
For example, to predict Florida real estate price given parcels data:
CREATE OR REPLACE MODEL fl_parcel_price_rf
OF TYPE RANDOM_FOREST_REG AS
SELECT
SALEPRC1,
PARUSEDESC,
CNTYNAME,
ACRES,
TOTLVGAREA,
EFFYRBLT,
SALEYR1
FROM
fl_res_parcels_2018
WITH
(CAT_TOP_K=70, EVAL_FRACTION=0.2)
In this example,
SALEPRC1 is the sales price to be predicted.
PARUSEDESC is a categorical column describing Florida land use types such as single family versus condominium. CNTYNAME is a categorical column of county names. These two columns are specified first among the predictors because they are categorical.
Four continuous value columns are provided: ACRES, TOTLVGAREA, EFFYRBUILT and SALEYR1. These indicate the parcel size in acres, its total living area, effective year built (reset for major remodeling) and sale year.
There are 70 counties in Florida, and we want this to be a statewide model. County is always a potentially-significant variable in price predictions, so we set the CAT_TOP_K to 70, well above its default of 10. If we want to increase initial model creation speed, we could also keep the default initially while experimenting with variables used, however this would come at the risk of fitting the model only on the most-populated counties, whose price prediction properties presumably vary.
Model-Specific Parameters
In addition to the general SQL modeling parameters demonstrated above, random forest regressions allow control of several other optional parameters. These are also provided at model creation as a comma separated list of parameter value options in the WITH statement. We recommend starting with the defaults, and then adjusting incrementally. For example, you could try increasing the number of trees and see if a higher number improves model accuracy without a large decrease in performance.
num_trees
The number of decision trees to include in the ensemble
10
obs_per_tree_fraction
The proportion of observations (or samples) that are randomly sampled and used to train each decision tree in the forest. Valid range 0 to 1
1
max_tree_depth
Maximum tree depth. Zero value means unlimited depth. Can be any non-negative number.
0
features_per_node
The number of features available for each node split in the decision trees.
square root of the total number of features in the dataset.
impurity_threshold
Valid range >= 0
0
bootstrap
Boolean
True
min_obs_per_leaf_node
Minimum number of observations in the leaf node. Can be any positive number.
5
min_obs_per_split_node
The minimum number of observations required for a node to be split during the construction of a decision tree/ Valid range is any positive number.
2
min_weight_fraction_in_leaf_node
The minimum fraction of the sum of instance weights required in a leaf node
0.0
min_impurity_decrease_in_split_node
The minimum impurity decrease required to split an internal node
0.0
max_leaf_nodes
Valid range >= 0
0
use_histogram
Use histograms to speed computations. Boolean
True
var_importance_metric
Options: mean decrease in impurity ('MDI'), mean decrease in accuracy ('MDA'), or 'MDA_scaled' - the MDA raw value scaled by its standard deviation
MDI
Selected Detailed Parameter Explanations
Evaluation of Feature Importance
It is often useful to understand which specific features are most important in a prediction. While future enhancements may make the syntax for this even easier, a UDTF version of feature importance evaluation is available today.
The table function is called random_forest_reg_var_importance
It takes a single parameter, which is the random forests model name. Because it is a table function, it must be wrapped within a TABLE() expression in SQL when used within a FROM clause:
SELECT * FROM TABLE(random_forest_reg_var_importance('fl_parcels_rf'));
This returns a table with rows for each feature used in the model. Categorical features are broken into multiple rows, with one row per sub-feature. The importance score units are based on the leaf-level tree construction metric set by var_importance_metric_str. While the MDA_Scaled metric takes a bit longer to run, you may find its values to be more interpretable.
You can also CTAS the results of the function above:
CREATE TABLE fl_parcels_rf_importance AS
SELECT * FROM TABLE(random_forest_reg_var_importance('fl_parcels_rf'));
If you create and update a new percent_importance column in the resulting table, you can normalize and sort the feature importance scores by the maximum importance. For example when we do this with the example parcel data above, we find the most important model features are total living area, parcel areas and effective year built.
Making Predictions
To make a prediction using your model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:
ML_PREDICT('<model_name>', <predictor_1>, <predictor_2> ... <predictor_n>)
So in our example above, you'd use:
ML_PREDICT('fl_parcels_rf', PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1)
Note that we did not use the name of the variable to be predicted - that comes from the model itself.
Example Application
The Florida real estate price prediction model above returns a reasonably-high r2 with simply the default random forest regression parameters. About 87% of the data variance can be explained with just the 6 variables above from the public parcels data available from the Florida Department of Revenue.
The model can be improved by adding auxiliary datasets, such as distance to the coastline, school district quality, or density of high-quality local amenities like parks and restaurants.
Other Applications
Another use case for random forest regression is in customer segmentation, for example predicting customer lifetime value (CLV) for a business. CLV is a measure of the total amount of money a customer is expected to spend on a business's products or services over their entire lifetime as a customer. By segmenting customers into groups with similar CLV predictions, a business can tailor marketing and customer retention strategies to maximize revenue and profitability.
In this use case, the random forest regression model would be trained on historical customer data, such as purchase history, demographic information, and website activity. The model would then be used to predict the CLV for new and existing customers. The model could also identify the most important factors that contribute to CLV, such as customer age, purchase frequency, and product categories.
The random forest model can handle large datasets and complex relationships between variables, making it well-suited for customer segmentation tasks. Additionally, the model can automatically select the most relevant features for prediction, which is especially useful when dealing with high-dimensional data. By using a random forest model for customer segmentation, a business can gain valuable insights into customer behavior and preferences, and make data-driven decisions to improve customer satisfaction and increase revenue.
Last updated