Gradient boosting is a machine learning technique that combines weak learners, here decision trees, to create a strong predictor by iteratively minimizing the loss function. The main difference between random forests and gradient boosting lies in how the decision trees are created and aggregated. Unlike random forests, the decision trees in gradient boosting are built additively; in other words, each decision tree is built one after another.
Gradient boosting models have several advantages over random forest regression models:
Gradient boosting models can often achieve higher accuracy than random forests: Gradient boosting models are designed to minimize errors and can learn complex relationships between the target variable and predictors, which can lead to higher accuracy compared to random forest models.
Gradient boosting can handle missing data: Gradient boosting models can handle missing data by imputing missing values using the best split values during the tree-building process. Random forest models require imputation of missing data prior to training.
Gradient boosting is less prone to overfitting: Gradient boosting models are designed to reduce overfitting through techniques like early stopping and regularization. Random forest models are prone to overfitting, especially with noisy data or high-dimensional data.
However, gradient boosting models also have some disadvantages compared to random forest models, such as being more computationally expensive and having more hyperparameters to tune. The choice between gradient boosting and random forest models depends on the specific problem and data set, and should be determined through experimentation and cross-validation.
The model syntax follows HeavyML conventions. You need to specify a SQL-legal model name, specify the type GBT_REG, and provide a SELECT statement indicating the columns to use. The statement must have the predicted variable first, then categorical columns and then continuous ones. Optionally, you can add a WITH statement to adjust categorical column processing using the same CAT_TOP_K and EVAL_FRACTION parameters discussed above.
For our example dataset, the r2 obtained was about 5% less than random forests regression, while model building was faster. Your mileage may vary. In general, gradient boosted regression models may perform better than random forests regression in domains where the data has a high degree of complexity, nonlinearity, and noise, such as in image and speech recognition, natural language processing, and financial forecasting. Additionally, gradient boosted models may be more effective when there is a large number of features, and when the target variable is highly imbalanced.
To make a prediction using your gradient boosting tree model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:
So in our example above, you'd use:
Note that we did not use the name of the variable to be predicted - that comes from the model itself.