HeavyML Overview

Summary

HeavyML is a framework (currently in beta) that allows users to leverage HEAVY.AI's lightning fast SQL to power machine learning workflows.

Why use HeavyML?

The line between data analytics and data science and machine learning has become increasingly blurred. Use cases such as predictive analytics, anomaly detection and explanation, classification, and AI-assisted data cleansing are increasingly emerging as mainstream analytics use cases.

However for many users trained in traditional analytics approaches, for example SQL or visualization, it can be difficult to leverage machine learning techniques to drive more advanced use cases. Furthermore, even with the fastest of data platforms like HEAVY.AI, extracting massive amounts of data from a database into a Python notebook to conduct CPU-machine learning base approaches can be prohibitively slow, or worse still run out of memory or encounter other operational snags.

HeavyML takes a new approach to these issues by allowing users to leverage intuitive new native SQL and visualization capabilities in the HEAVY.AI platforms to perform machine learning and predictive analytics operations directly in-database. This provides several advantages to end users:

  1. Users can tap into their existing SQL knowledge to orchestrate formerly complex machine learning workflows.

  2. Pre-ML ELT (Extract-Transform-Load) and data cleansing is a breeze, as data can be filtered, grouped and manipulated directly in the same SQL query that launches the ML training or inference operations.

  3. Significant performance gains are achieved by keeping the relevant data in-database, avoiding the overheads of transferring and marshaling the data to other processes and formats for ML processing. In addition, HeavyML takes full advantage of the massive CPU and GPU parallelism the HEAVY.AI platform is known for, leading to orders-of-magnitude speedups for some operations.

Capabilities Overview

Note that HeavyML, as a beta capability, is currently being rapidly iterated on and as such, features will continue to be added at a fast pace for the foreseeable future.

  • Clustering Support

    • Two clustering models are currently supported: KMeans and DBScan. Clustering is currently performed by calling the associated table functions: kmeans and dbscan.

  • Regression Support

    • Four regression models are currently supported: linear regression, random forest regression, Gradient Boosting Tree (GBT) regression, and decision tree regression.

    • Both categorical text and continuous numeric input features (predictors) are supported. Categorical features are automatically one-hot encoded.

    • Models can be created via a new CREATE MODEL command, and inference can be performed row-wise with a new ML_PREDICT method.

    • Inference using ML_PREDICT will run on GPU if available, while model creation/training is currently executed multi-threaded on CPU. (GPU model training may be supported in the future).

    • Convenience methods are defined to extract the linear regression coefficients for linear regression models, the variable importance scores for random forest models, and the R2 score all regression models.