HEAVY.AI Docs
v6.4.3
v6.4.3
  • Welcome to HEAVY.AI Documentation
  • Overview
    • Overview
    • Release Notes
  • Installation and Configuration
    • System Requirements
      • Hardware Reference
      • Software Requirements
    • Installation
      • Free Version
      • Installing on Rocky Linux / RHEL
        • HEAVY.AI Installation on RHEL
        • Install NVIDIA Drivers and Vulkan on Rocky Linux and RHEL
      • Installing on Ubuntu
        • HEAVY.AI Installation on Ubuntu
        • Install NVIDIA Drivers and Vulkan on Ubuntu
      • Installing on Docker
        • HEAVY.AI Installation using Docker on Ubuntu
      • Getting Started on AWS
      • Getting Started on GCP
      • Getting Started on Azure
      • Upgrading
        • Upgrading HEAVY.AI
        • Upgrading from Omnisci to HEAVY.AI 6.0
        • CUDA Compatibility Drivers
      • Uninstalling
      • Ports
    • Services and Utilities
      • Using Services
      • Using Utilities
    • Configuration Parameters
      • Overview
      • Configuration Parameters for HeavyDB
      • Configuration Parameters for HEAVY.AI Web Server
    • Security
      • Roles and Privileges
      • Connecting Using SAML
      • Implementing a Secure Binary Interface
      • Encrypted Credentials in Custom Applications
      • LDAP Integration
    • Distributed Configuration
  • Loading and Exporting Data
    • Supported Data Sources
      • Kafka
      • Using Heavy Immerse Data Manager
      • Importing Geospatial Data
    • Command Line
      • Loading Data with SQL
      • Exporting Data
  • SQL
    • Data Definition (DDL)
      • Datatypes
      • Users and Databases
      • Tables
      • System Tables
      • Views
      • Policies
    • Data Manipulation (DML)
      • SQL Capabilities
        • ALTER SESSION SET
        • ALTER SYSTEM CLEAR
        • DELETE
        • EXPLAIN
        • INSERT
        • KILL QUERY
        • LIKELY/UNLIKELY
        • SELECT
        • SHOW
        • UPDATE
        • Arrays
        • Logical Operators and Conditional and Subquery Expressions
        • Table Expression and Join Support
        • Type Casts
      • Geospatial Capabilities
      • Functions and Operators
      • System Table Functions
        • generate_random_strings
        • generate_series
        • tf_compute_dwell_times
        • tf_feature_self_similarity
        • tf_feature_similarity
        • tf_geo_rasterize
        • tf_geo_rasterize_slope
        • tf_graph_shortest_path
        • tf_graph_shortest_paths_distances
        • tf_load_point_cloud
        • tf_mandelbrot*
        • tf_point_cloud_metadata
        • tf_raster_contour_lines; tf_raster_contour_polygons
        • tf_raster_graph_shortest_slope_weighted_path
        • tf_rf_prop_max_signal (Directional Antennas)
        • ts_rf_prop_max_signal (Isotropic Antennas)
        • tf_rf_prop
      • Window Functions
      • Reserved Words
      • SQL Extensions
  • Heavy Immerse
    • Introduction to Heavy Immerse
    • Admin Portal
    • Control Panel
    • Working with Dashboards
      • Dashboard List
      • Creating a Dashboard
      • Configuring a Dashboard
      • Duplicating and Sharing Dashboards
    • Measures and Dimensions
    • Using Parameters
    • Using Filters
    • Chart Animation
    • Multilayer Charts
    • SQL Editor
    • Customization
    • Chart Types
      • Overview
      • Bar
      • Bubble
      • Choropleth
      • Combo
      • Cross-Section
      • Contour
      • Gauge
      • Geo Heatmap
      • Heatmap
      • Histogram
      • Line
      • Linemap
      • New Combo
      • Number
      • Pie
      • Pointmap
      • Scatter Plot
      • Skew-T
      • Stacked Bar
      • Table
      • Text Widget
      • Wind Barb
  • HeavyRF
    • Introduction to HeavyRF
    • Getting Started
    • HeavyRF Table Functions
  • HeavyConnect
    • HeavyConnect Release Overview
    • Getting Started
    • Best Practices
    • Examples
    • Command Reference
    • Parquet Data Wrapper Reference
    • ODBC Data Wrapper Reference
  • Python / Data Science
    • Data Science Foundation
    • JupyterLab Installation and Configuration
    • Using HEAVY.AI with JupyterLab
    • Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)
      • Installation
      • Registering and Using a Function
      • User-Defined Table Functions
      • RBC UDF/UDTF Example Notebooks
      • General UDF/UDTF Tutorial Notebooks
      • RBC API Reference
    • Ibis
    • Interactive Data Exploration with Altair
    • Additional Examples
      • Forecasting with HEAVY.AI and Prophet
  • APIs and Interfaces
    • Overview
    • heavysql
    • Thrift
    • JDBC
    • ODBC
    • Vega
      • Vega Tutorials
        • Vega at a Glance
        • Getting Started with Vega
        • Getting More from Your Data
        • Creating More Advanced Charts
        • Using Polys Marks Type
        • Vega Accumulator
        • Using Transform Aggregation
        • Improving Rendering with SQL Extensions
      • Vega Reference Overview
        • data Property
        • projections Property
        • scales Property
        • marks Property
      • Migration
        • Migrating Vega Code to Dynamic Poly Rendering
      • Try Vega
    • RJDBC
    • SQuirreL SQL
    • heavyai-connector
  • Tutorials and Demos
    • Loading Data
    • Using Heavy Immerse
    • Hello World
    • Creating a Kafka Streaming Application
    • Getting Started with Open Source
    • Try Vega
  • Troubleshooting and Special Topics
    • FAQs
    • Troubleshooting
    • Vulkan Renderer
    • Optimizing
    • Known Issues and Limitations
    • Logs and Monitoring
    • Archived Release Notes
      • Release 5.x
      • Release 4.x
      • Release 3.x
Powered by GitBook
On this page
  • Cursors
  • Table Function Manager
  • Column Lists
  • Supported Functions
  • Python Grammar
  • NumPy and Others
  • Numba
  • External
Export as PDF
  1. Python / Data Science
  2. Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)

User-Defined Table Functions

RBC supports user-defined table functions. UDTFs can access multiple rows of a table column concurrently.

The signature of a UDTF is different from a UDF. The signature contains the input columns and the output columns specified by their respective types. There can be any number of input and output columns; the only constraint is that the input columns must be declared before (i.e., to the left) the output columns in the function signature. The signature, then, is declared using UDTF(); the number of arguments inside represent the total number of input and output columns.

'UDTF(Column[List]<Type>, ..., OutputColumn<Type>, ...)'

By default, the output columns are named out0, out1, .... It's possible to use aliases to reference input and output columns in further SQL constructions. For example:

'UDTF(Column<double> input_alias, OutputColumn<double> output_alias)'

The maximum number of rows on table columns that a UDTF can handle corresponds to the maximum value of an int32 (2**32/2).

In the following example, the UDTF fahrenheit2celcius is defined on a table with one column as input. The final line return 5 means that a table with only 5 rows is returned.

@heavy('UDTF(Column<float>, OutputColumn<float>)')
def fahrenheit2celsius(inp, out):
    size = len(inp)
    for i in range(size):
        out[i] = (inp[i] - 32) * 5 / 9
    return 5

Although the function returns 5 rows, it does not mean that only 5 rows are going to be processed by the function. If size<5, the output is padded with the value 0. And if size>5, the function will still iterate on all the rows while just returning the first 5 elements.

If the number of rows in the output table from a UDTF needs to be adapted at runtime, the function set_output_row_size from the module rbc.externals.heavydb is required. The function must be called before any assignment on output columns.

from rbc.externals.heavydb import set_output_row_size


@heavy('UDTF(Column<float>, OutputColumn<float>)')
def fahrenheit2celsius(inp, output):
    size = ...
    set_output_row_size(size)
    ...
    return size

While the return value from a UDTF controls the number of rows in the output table, there are no restrictions on the assumed number of rows in the corresponding input table. The whole column—again, up to int32rows—will be loaded whenever the function executes. As with any SQL function, limits on number of rows in tables associated with a UDTF can be set using SQL keywords like LIMIT or WHERE.

Cursors

In SQL, cursors are used to declare temporary memory for storing database tables. In particular, UDTFs use cursors as inputs. Here is a SQL request using a UDTF:

select * from table(fahrenheit2celsius(cursor(select col from table)))

You can also define the signature with the cursor made explicit in the previous UDTF as follows:

@heavy('UDTF(Cursor(Column<int32>), OutputColumn<int32>)')
def fahrenheit2celsius(inp, output):
    ...

For convenience, when a single cursor is used, you do not need to specify cursors in the definition of the UDTF. When multiple cursors are needed in a SQL query, including the literal Cursor in the UDTF definition as shown above is required.

Table Function Manager

Using the argument TableFunctionManager in the signature of a UDTF enables parallel execution of table functions. Without this argument, table functions are executed on a single thread; more importantly, the execution is not thread-safe. To enable threaded execution, the function signature the extra argument for the TableFunctionManager and the function set_output_row_size must be called on the manager to ensure thread safety.

from rbc.externals.heavydb import set_output_row_size


@heavy('UDTF(TableFunctionManager, Column<int32>, OutputColumn<int32>)')
def fahrenheit2celsius(mgr, inp, output):
    ... 
    mgr.set_output_row_size(size)
    return size

Column Lists

Instead of declaring a parameter per column, it is possible to group columns into a list using ColumnList. In the following example, the mean over each column is returned. It's possible to have multiple ColumnList parameters. Two helper attributes are available to get the number of rows and column, respectively ColumnList.nrows and ColumnList.ncols.

@heavy('UDTF(ColumnList<double>, OutputColumn<double>)')
def fahrenheit2celsius(inp, out):
    ncols = inp.ncols
    nrows = inp.nrows

    set_output_row_size(ncols)

    for i in range(ncols):
        col = inp[i]
        out[i] = 0.
        for j in range(nrows):
            out[i] += col[j]
        out[i] /= nrows

        out[i] = (out[i] - 32) * 5 / 9

        return ncols

Supported Functions

Python Grammar

When using functions, a common pitfall is to have type mismatch errors. Casting rules are less forgiving than in Python and types have to be carefully handled.

NumPy and Others

The list of supported functions is always growing. Most functions are overwritten versions of functions from NumPy or the builtin math module. These functions are defined in rbc.stdlib, so, to get the full list of supported functions, inspect that module:

from rbc import stdlib
print(stdlib.__all__)
print(stdlib.array_api.__all__)

Numba

from numba import njit

@njit
def fahrenheit2celsius_numba(f):
    return (f - 32) * 5 / 9

@heavy('double(double)')
def fahrenheit2celsius(f):
    return fahrenheit2celsius_numba(f)

External

The module rbc.external describes functions known to the server. Those functions on the server can be used when constructing new UDFs or UDTFs by using the function rbc.external.external. In the following example, log2 is a function which is known on by the database server. To use log2 with a UDF or a UDTF defined using RBC, it needs to be typed using a C-like syntax similar to the one used when decorating functions for RBC.

"output_type function(input_types)"
from rbc.external import external

log2 = external("double log2(double)")

@heavy("double(double)")
def log2_heavy(x):
    return log2(x)
PreviousRegistering and Using a FunctionNextRBC API Reference

Last updated 2 years ago

By default, a UDTF that has a variable number of rows in the output table is not thread-safe. To work around this constraint, use a .

The package RBC makes use of the Python compiler internally. As a result, RBC inherits some limitations in syntax and features from Numba. Specifically, the mode of Numba is used which means that certain Python objects or class constructions have no—or, at best have limited—support. This includes—and is not limited to—list comprehensions, slicing or complex indexing (e.g., [:], [-1], [1:6], [::2]).

Because RBC internally makes use of , RBC also supports the usage of Numba functions within RBC functions. For example, the function fahrenheit2celsius_numba embedded within fahrenheit2celsius has been decorated with numba.njit when fahrenheit2celsius is defined.

TableFunctionManager
Numba
nopython
Numba