Decision Tree Regression

A simple, interpretable model

In general, we recommend starting with either Linear or Random Forests Regressions depending on data complexity. A single decision tree regression option is available for narrower use cases. One advantage of decision tree regression over random forest regression is its interpretability. A decision tree is an easy-to-understand model that can be easily visualized and interpreted. The tree structure allows one to see the sequence of decisions and criteria used to make predictions. This interpretability can be especially valuable in applications where the decision-making process needs to be transparent and understandable, such as in healthcare or finance.

Another advantage of decision tree regression is that it is computationally efficient compared to random forest regression. Because a decision tree is a single tree, it can be built and trained more quickly than an ensemble of trees in a random forest model. This efficiency can be important when dealing with real-time applications or large data sets where computational resources are limited.

Additionally, decision tree regression can be more appropriate for certain types of data sets or problems. For example, if the target variable has a strong linear relationship with a small subset of the features, a decision tree model might be able to capture this relationship more accurately than a random forest model. In such a case, a random forest model might overfit the data and produce less accurate predictions.

It is important to note, however, that decision tree regression models can also suffer from overfitting and may not perform as well as random forest models in many scenarios.

General Syntax

The model syntax follows general HeavyML conventions. You need to specify a SQL-legal model name, specify the type DECISION_TREE_REG, and provide a SELECT statement indicating the columns to use. The column projection statement must have the predicted variable first, then categorical columns and then continuous ones. Optionally, you can add a WITH statement to adjust categorical column processing using the same CAT_TOP_K and EVAL_FRACTION parameters discussed above.

Example

CREATE OR REPLACE MODEL fl_parcel_price_dt 
OF TYPE DECISION_TREE_REG AS 
SELECT
  SALEPRC1,
  PARUSEDESC,
  CNTYNAME,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM 
  fl_res_parcels_2018
WITH 
(CAT_TOP_K=70, EVAL_FRACTION=0.2)

This model trains in approximately 240ms. This is significantly faster than a full random forest regression on the same hardware which took several seconds. It produces a single tree model with r2 of 85% which is both useful and simple. While you would not expect this model to be as robust as random forests for application outside of its initial training domain, it might be acceptable or even preferred in terms of speed and explicability within its training domain.

Making Predictions

To make a prediction using your decision tree model, use the ML_PREDICT function. This can be used on-the-fly, or as part of an UPDATE or CTAS if you want to persist your predictions. The general form is:

ML_PREDICT('<model_name>', <predictor_1>, <predictor_2> ... <predictor_n>)

So in our example above, you'd use:

ML_PREDICT('fl_parcel_price_dt', PARUSEDESC, CNTYNAME, ACRES, TOTLVGAREA, EFFYRBLT, SALEYR1)

Note that we did not use the name of the variable to be predicted - that comes from the model itself.