HEAVY.AI Docs
v8.1.0
v8.1.0
  • Welcome to HEAVY.AI Documentation
  • Overview
    • Overview
    • Release Notes
  • Installation and Configuration
    • System Requirements
      • Hardware Reference
      • Software Requirements
      • Licensing
    • Installation
      • Free Version
      • Installing on Docker
        • HEAVY.AI Installation using Docker on Ubuntu
      • Installing on Ubuntu
        • HEAVY.AI Installation on Ubuntu
        • Install NVIDIA Drivers and Vulkan on Ubuntu
      • Installing on Rocky Linux / RHEL
        • HEAVY.AI Installation on RHEL
        • Install NVIDIA Drivers and Vulkan on Rocky Linux and RHEL
      • Getting Started on AWS
      • Getting Started on GCP
      • Getting Started on Azure
      • Getting Started on Kubernetes (BETA)
      • Upgrading
        • Upgrading HEAVY.AI
        • Upgrading from Omnisci to HEAVY.AI 6.0
        • CUDA Compatibility Drivers
      • Uninstalling
      • Ports
    • Services and Utilities
      • Using Services
      • Using Utilities
    • Executor Resource Manager
    • Configuration Parameters
      • Overview
      • Configuration Parameters for HeavyDB
      • Configuration Parameters for HEAVY.AI Web Server
      • Configuration Parameters for HeavyIQ
    • Security
      • Roles and Privileges
        • Column-Level Security
      • Connecting Using SAML
      • Implementing a Secure Binary Interface
      • Encrypted Credentials in Custom Applications
      • LDAP Integration
    • Distributed Configuration
  • Loading and Exporting Data
    • Supported Data Sources
      • Kafka
      • Using HeavyImmerse Data Manager
      • Importing Geospatial Data
    • Command Line
      • Loading Data with SQL
      • Exporting Data
  • SQL
    • Data Definition (DDL)
      • Datatypes
      • Users and Databases
      • Tables
      • System Tables
      • Views
      • Policies
      • Comment
    • Data Manipulation (DML)
      • SQL Capabilities
        • ALTER SESSION SET
        • ALTER SYSTEM CLEAR
        • DELETE
        • EXPLAIN
        • INSERT
        • KILL QUERY
        • LIKELY/UNLIKELY
        • SELECT
        • SHOW
        • UPDATE
        • Arrays
        • Logical Operators and Conditional and Subquery Expressions
        • Table Expression and Join Support
        • Type Casts
      • Geospatial Capabilities
        • Uber H3 Hexagonal Modeling
      • Functions and Operators
      • System Table Functions
        • generate_random_strings
        • generate_series
        • tf_compute_dwell_times
        • tf_feature_self_similarity
        • tf_feature_similarity
        • tf_geo_rasterize
        • tf_geo_rasterize_slope
        • tf_graph_shortest_path
        • tf_graph_shortest_paths_distances
        • tf_load_point_cloud
        • tf_mandelbrot*
        • tf_point_cloud_metadata
        • tf_raster_contour_lines; tf_raster_contour_polygons
        • tf_raster_graph_shortest_slope_weighted_path
        • tf_rf_prop_max_signal (Directional Antennas)
        • ts_rf_prop_max_signal (Isotropic Antennas)
        • tf_rf_prop
      • Window Functions
      • Reserved Words
      • SQL Extensions
      • HeavyIQ LLM_TRANSFORM
  • HeavyImmerse
    • Introduction to HeavyImmerse
    • Admin Portal
    • Control Panel
    • Working with Dashboards
      • Dashboard List
      • Creating a Dashboard
      • Configuring a Dashboard
      • Duplicating and Sharing Dashboards
    • Measures and Dimensions
    • Using Parameters
    • Using Filters
    • Using Cross-link
    • Chart Animation
    • Multilayer Charts
    • SQL Editor
    • Customization
    • Joins (Beta)
    • Chart Types
      • Overview
      • Bubble
      • Choropleth
      • Combo
      • Contour
      • Cross-Section
      • Gauge
      • Geo Heatmap
      • Heatmap
      • Linemap
      • Number
      • Pie
      • Pointmap
      • Scatter Plot
      • Skew-T
      • Table
      • Text Widget
      • Wind Barb
    • Deprecated Charts
      • Bar
      • Combo - Original
      • Histogram
      • Line
      • Stacked Bar
    • HeavyIQ SQL Notebook
  • HEAVYIQ Conversational Analytics
    • HeavyIQ Overview
      • HeavyIQ Guidance
  • HeavyRF
    • Introduction to HeavyRF
    • Getting Started
    • HeavyRF Table Functions
  • HeavyConnect
    • HeavyConnect Release Overview
    • Getting Started
    • Best Practices
    • Examples
    • Command Reference
    • Parquet Data Wrapper Reference
    • ODBC Data Wrapper Reference
    • Raster Data Wrapper Reference
  • HeavyML (BETA)
    • HeavyML Overview
    • Clustering Algorithms
    • Regression Algorithms
      • Linear Regression
      • Random Forest Regression
      • Decision Tree Regression
      • Gradient Boosting Tree Regression
    • Principal Components Analysis
  • Python / Data Science
    • Data Science Foundation
    • JupyterLab Installation and Configuration
    • Using HEAVY.AI with JupyterLab
    • Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)
      • Installation
      • Registering and Using a Function
      • User-Defined Table Functions
      • RBC UDF/UDTF Example Notebooks
      • General UDF/UDTF Tutorial Notebooks
      • RBC API Reference
    • Ibis
    • Interactive Data Exploration with Altair
    • Additional Examples
      • Forecasting with HEAVY.AI and Prophet
  • APIs and Interfaces
    • Overview
    • heavysql
    • Thrift
    • JDBC
    • ODBC
    • Vega
      • Vega Tutorials
        • Vega at a Glance
        • Getting Started with Vega
        • Getting More from Your Data
        • Creating More Advanced Charts
        • Using Polys Marks Type
        • Vega Accumulator
        • Using Transform Aggregation
        • Improving Rendering with SQL Extensions
      • Vega Reference Overview
        • data Property
        • projections Property
        • scales Property
        • marks Property
      • Migration
        • Migrating Vega Code to Dynamic Poly Rendering
      • Try Vega
    • RJDBC
    • SQuirreL SQL
    • heavyai-connector
  • Tutorials and Demos
    • Loading Data
    • Using Heavy Immerse
    • Hello World
    • Creating a Kafka Streaming Application
    • Getting Started with Open Source
    • Try Vega
  • Troubleshooting and Special Topics
    • FAQs
    • Troubleshooting
    • Vulkan Renderer
    • Optimizing
    • Known Issues and Limitations
    • Logs and Monitoring
    • Archived Release Notes
      • Release 6.x
      • Release 5.x
      • Release 4.x
      • Release 3.x
Powered by GitBook
On this page
Export as PDF
  1. HeavyML (BETA)
  2. Regression Algorithms

Linear Regression

Summary

Linear regression models are a type of supervised learning algorithm that predict a continuous target variable by fitting a straight line to the input features. They are commonly used for predicting numeric outcomes based on one or more input variables. Advantages of linear regression include simplicity, interpretability, and fast computation time. However, disadvantages include the assumption of a linear relationship between variables, which may not always hold, and sensitivity to both outliers and collinear features, which can affect the model's performance and reported accuracy.

Note: If you are not sure that your data meet these requirements, you may want to first do some exploratory visualizations and transformations using scatter plots to visualize pairwise relationships. Altneratively, HeavyML makes it simple with to try first or compare a regression model with fewer assumptions, such as Random Forests.

Example

CREATE OR REPLACE MODEL florida_parcels_sale_prc_lr OF TYPE LINEAR_REG AS
SELECT
  SALEPRC1,
  PARUSEDESC,
  CNTYNAME,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020 with (CAT_TOP_K=20, EVAL_FRACTION=0.2);
  

Linear Regression Options

With the exception of the general options listed above, the linear regression model type accepts no options.

Model Evaluation

Like all other regression model types, the model r2 can be obtained via the EVALUATE MODEL command. If the model was created with a specified EVAL_FRACTION, the model r2 score can be obtained on that test holdout set via the following:

EVALUATE MODEL florida_parcels_sale_prc_lr;

r2
0.08852338436213192

If no EVAL_FRACTION was specified in the CREATE MODEL command, or if EVAL_FRACTION was specified but you wish to evaluate the model on a different dataset, you can specify the evaluation query explicitly as follows:

EVALUATE MODEL florida_parcels_sale_prc_lr ON
SELECT
  SALEPRC1,
  PARUSEDESC,
  CNTYNAME,
  ACRES,
  TOTLVGAREA,
  EFFYRBLT,
  SALEYR1
FROM
  florida_parcels_2020
WHERE
  PARUSEDESC = 'CONDOMINIUMS';

r2
0.03106805209354901

The relatively low R2 scores obtained for the linear regression model are not atypical for complex multi-variate relations. As noted above, in such cases, it will likely be worth trying a random forest or Gradient-Boosted Tree (GBT) regression model, as the accuracy of these models can be dramatically higher in many cases (in the example above, a simple random forest model achieved an R2 score above 0.87).

Model Prediction/Inference

Once a linear regression model is created, it can, like all other regression model types, be used for prediction via the row-wise ML_PREDICT operator, which takes the model name (in quotes) and a list of independent predictor variables semantically matching the ordered list of variables the model was trained on.

SELECT
  SALEPRC1 as actual_sales_price,
  ML_PREDICT(
    'florida_parcels_sale_prc_lr',
    PARUSEDESC,
    CNTYNAME,
    ACRES,
    TOTLVGAREA,
    EFFYRBLT,
    SALEYR1
  ) AS predicted_sales_price
FROM
  florida_parcels_2020
WHERE
  SALEPRC1 BETWEEN 100000 AND 500000
limit
  10;
  
actual_sales_price|predicted_sales_price
211000|-30912.60198199749
152400|6559.390672445297
164000|35608.10665637255
153900|56984.85121244192
143500|52565.25603222847
144000|64931.58916777372
140000|79256.96579062939
160000|90230.21915191412
162000|56753.09885531664
107000|80915.15436685085

Related Methods

A list of predictors for a trained linear regression model, along with their associated coefficients, can be obtained by executing the linear_reg_coefs table function, as shown in the following example;

SELECT * FROM TABLE(linear_reg_coefs(model_name=>'florida_parcels_sale_prc_lr'));

coef_idx|feature|sub_coef_idx|sub_feature|coef
0|intercept|1|NULL|313541950.3062068
1|PARUSEDESC|1|SINGLE FAMILY|-812725.0483721431
1|PARUSEDESC|2|CONDOMINIUMS|145061.7512208006
1|PARUSEDESC|3|VACANT RESIDENTIAL|-696124.6904133513
1|PARUSEDESC|4|MOBILE HOMES|-793766.3806761563
1|PARUSEDESC|5|MULTI-FAMILY - FEWER THAN 10 UNITS|-680141.6707517173
1|PARUSEDESC|6|RESIDENTIAL COMMON ELEMENTS / AREAS|-200166.7979917362
2|CNTYNAME|1|MIAMI-DADE|-14247.32188426969
2|CNTYNAME|2|BROWARD|11489.81580546272
2|CNTYNAME|3|PALM BEACH|4753.91666378783
2|CNTYNAME|4|LEE|-187796.8228951865
2|CNTYNAME|5|HILLSBOROUGH|1906136.710600209
2|CNTYNAME|6|ORANGE|14400.40299565677
2|CNTYNAME|7|PINELLAS|340509.3627915879
2|CNTYNAME|8|DUVAL|-6297.904384635765
2|CNTYNAME|9|POLK|51934.57494268686
2|CNTYNAME|10|BREVARD|-121257.6655694735
2|CNTYNAME|11|VOLUSIA|-14876.60197626388
2|CNTYNAME|12|SARASOTA|-80945.57973416231
2|CNTYNAME|13|COLLIER|-130752.9965216073
2|CNTYNAME|14|PASCO|27699.23056136876
2|CNTYNAME|15|MARION|-60143.98878492894
2|CNTYNAME|16|CHARLOTTE|-71174.58469642041
2|CNTYNAME|17|MANATEE|38203.76868561743
2|CNTYNAME|18|LAKE|45848.48937542497
2|CNTYNAME|19|SEMINOLE|64744.92416669091
2|CNTYNAME|20|OSCEOLA|228911.8974889433
3|ACRES|1|NULL|-7317.375525209893
4|TOTLVGAREA|1|NULL|101.4888021198855
5|EFFYRBLT|1|NULL|3995.56463628857
6|SALEYR1|1|NULL|-158867.0479846989
PreviousRegression AlgorithmsNextRandom Forest Regression