1 of 5

Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)

Installation

Install RBC and get started

You can install RBC using conda (recommended) or mamba, which is a faster implementation of conda. For more information, see the Mamba documentation.

conda install -c conda-forge rbc

# or to install rbc to a new environemnt, run
conda create -n rbc -c conda-forge rbc
conda activate rbc

# check if rbc installed succesfully
python -c 'import rbc; print(rbc.__version__)'

You can also use pip for package management:

pip install rbc-project
# check if rbc installed succesfully
python -c 'import rbc; print(rbc.__version__)'

Quick Start

The following assumes that you have an instance of HEAVY.AI running. UDFs and UDTFs are enabled with the flags --enable-runtime-udfs and --enable-table-functions. For more information on installing HEAVY.AI, see Installation.

To summarize:

conda create -n heavy-ai-env heavydb -c conda-forge
conda activate heavy-ai-env
mkdir -p data
initheavy data -f
heavydb --enable-runtime-udfs --enable-table-functions

To inspect the test database—provided by default—connect another terminal to the database using

heavysql --passwd HyperInteractive

The following example shows a simple UDF that converts a numerical temperature from Fahrenheit to Celsius. The code defines the function, registers it, and runs it on the server.

from rbc.heavydb import RemoteHeavyDB

heavy = RemoteHeavyDB(
    user='admin',
    password='HyperInteractive',
    host='localhost',
    port=6274,
    dbname='heavyai',
)

@heavy('double(double)')
def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

print(fahrenheit2celsius(32))
# 'fahrenheit2celsius(CAST(32 AS DOUBLE))'

# other functions?
...

# after defining all functions, they should be registered
# to the database
heavy.register()

The instance of class RemoteHeavyDB connects to the HeavyDB instance, and the object it returns can be used to register functions. Then, you define a normal Python function fahrenheit2celsius. The function is decorated using the instance heavy of the class RemoteHeavyDB, and it is provided with the function signature 'double(double)'. With this modification, the decorated function expects a single argument that is a double-precision floating-point value and also returns a double-precision floating-point value. The syntax is similar to function annotations in C/C++.

After you defined all functions you want to be available on the HeavyDB server, you should register them all at once with heavy.register().

fahrenheit2celsius can now be used in SQL on the HeavyDB server. You can use tools like heavyai or ibis (via the ibis-heavyai backend) to help construct queries from Python. The following example shows a function call from SQL.

SELECT fahrenheit2celsius(col) FROM my_table

The function is then applied element-wise on the column col of the table my_table.

Registering and Using a Function

Making a function available to HeavyDB--registering-–is based on decorating a Python function. Consider the following simple function, which takes a single argument and return a single value.

def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

Declare the function’s signature.
Attach the signature to the function.
Register the function to the database.

Declaring the Signature

Annotate the function with type information to tell RBC how to translate the function into this intermediate representation, using the following syntax:

'returnType(inputType1, inputType2, ...)'

The function can only return a single element.

Available types are similar to C types:

[Array,Column[List]][int[8,16,32,64],float[32,64],double,bool]
bytes
void
TextEncoding[None,Dict]
Column<[List]TextEncodingDict>
Cursor

In the types listed, items in brackets indicate options to choose from. For example, [List,Array]Int[8,16] is expanded to mean ListInt8, ArrayInt8, ListInt16, and ArrayInt16. The literals float and int can be abbreviated by f and i, respectively.

Returning to the function, if you want both the input argument and the output values as doubles, you could write:

from rbc.heavydb import RemoteHeavyDB
heavy = RemoteHeavyDB()
signature = heavy('double(double)')

Templating

What happens when the input is an integer? RBC does not cast input values to the expected types automatically. If you expect multiple input types, RBC supports templating (as in C++ or generic in Rust or Go). Templating allows you to define a type using a variable, like T in this example:

signature = heavy('T(T)', T=['int32', 'double'])

In this example, T can be replaced by int32 or double. This can also be written without using a variable.

signature = heavy('int32(int32)', 'double(double)')

You can also have different template variables. The Cartesian product is observed.

signature = heavy('T(Z)', T=['double', 'float'], Z=['int8', 'int32'])

Attaching the Signature

Once you have the signature, you can attach it to the function. As a best practice, use the signature as a decorator.

@heavy('double(double)')
def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

This prevents classical function calls of the decorated function. The function is now “marked” to be registered on the server and used there.

Overloading

RBC supports overloading function definitions. This permits several function implementations using a common identifier, with the execution path determined by specific inputs.

@heavy('double(double, double)')
def fahrenheit2celsius(f, offset):
    return offset + (f - 32) * 5 / 9

@heavy('double(double)')
def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

Arrays

Both inputs and output can be marked as 1D-arrays or lists of any type. To indicate an array in the function signature, append brackets ([] ) to the type literal.

from rbc.stdlib import array_api
@heavy('double(double[])')
def fahrenheit2celsius(f_array):
    return (array_api.mean(f_array) - 32) * 5 / 9

You can also define an array use the 'Array<double>'syntax.

Some functions with array support are provided. In this example, the imported function rbc.stdlib.array_api.mean computes the mean over an array of inputs f_array. We can also have output arrays.

rbc.stdlib.array_api.mean is a special function bundled with RBC. In this case, numpy.mean has been overridden for convenience to users familiar with NumPy’s API. See here for more details and information about supported functions.

To create an array within a function, the class Array must be used to define an empty array. It can then be indexed to be filled. Slicing or complex indexing is not currently supported. If the array is returned, it’s important that the type specified during the array creation matches the return type specified in the function signature.

from numba import types
from rbc.stdlib import Array
@heavy('int64[](int64)')
def create_and_fill_array(size):
    arr = Array(size, types.int64)
    for i in range(size):
        arr[i] = i
    return arr

Standard Python constructors like list , dict or numpy.array cannot be used to construct arrays supported by RBC. See here for a complete list of array creation functions.

Selecting a Device

You can select explicitly the device on which a function is allowed to be executed by using the keyword argument device in the decorator when registering the function. The device argument is a list that can take 'cpu' and 'gpu'. The option indicates which implementation should be available and used. Hence, if there is no GPU on the server, using 'gpu' would not work on the platform.

@heavy('double(double)', devices=['cpu'])
def fahrenheit2celsius(f):
    return (f-32) * 5 / 9

A function can also be made available on both the CPU and GPU by using device=['cpu', 'gpu'].

For 'gpu', only NVIDIA GPUs that can handle CUDA instructions are currently supported.

Registering the Function

Once you define the functions—with appropriate signatures in the decorator—you have to register them to the HeavyDB. This is done automatically if the function is used in the same Python session. If multiple functions are defined in a file and need to be registered to be used by another process or user, then yo need to register them manually.

heavy.register()

It is less efficient to call RemoteHeavyDB.register() after every function declaration. Instead, use a single call after all functions are defined.

Similarly, you can clean the current session of all previously registered functions. The registration and unregistration of functions take into account only the functions defined in the current session associated with the object heavy.

heavy.unregister()

Using Registered Functions

To use the basic implementation of fahrenheit2celsius:

print(fahrenheit2celsius(32))
# 'fahrenheit2celsius(CAST(32 AS DOUBLE))'

To get the result of the function, you have to explicitly request execution on the server using the execute method:

fahrenheit2celsius(32).execute()
# 0.0

The execute method is a convenience feature; it should not be used in production code. For production code, use heavyai or ibis via the ibis-heavyai backend to compose SQL queries using an ORM-like syntax.

User-Defined Table Functions

RBC supports user-defined table functions. UDTFs can access multiple rows of a table column concurrently.

The signature of a UDTF is different from a UDF. The signature contains the input columns and the output columns specified by their respective types. There can be any number of input and output columns; the only constraint is that the input columns must be declared before (i.e., to the left) the output columns in the function signature. The signature, then, is declared using UDTF(); the number of arguments inside represent the total number of input and output columns.

By default, the output columns are named out0, out1, .... It's possible to use aliases to reference input and output columns in further SQL constructions. For example:

The maximum number of rows on table columns that a UDTF can handle corresponds to the maximum value of an int32 (2**32/2).

In the following example, the UDTF fahrenheit2celcius is defined on a table with one column as input. The final line return 5 means that a table with only 5 rows is returned.

Although the function returns 5 rows, it does not mean that only 5 rows are going to be processed by the function. If size<5, the output is padded with the value 0. And if size>5, the function will still iterate on all the rows while just returning the first 5 elements.

If the number of rows in the output table from a UDTF needs to be adapted at runtime, the function set_output_row_size from the module rbc.externals.heavydb is required. The function must be called before any assignment on output columns.

While the return value from a UDTF controls the number of rows in the output table, there are no restrictions on the assumed number of rows in the corresponding input table. The whole column—again, up to int32rows—will be loaded whenever the function executes. As with any SQL function, limits on number of rows in tables associated with a UDTF can be set using SQL keywords like LIMIT or WHERE.

Cursors

In SQL, cursors are used to declare temporary memory for storing database tables. In particular, UDTFs use cursors as inputs. Here is a SQL request using a UDTF:

You can also define the signature with the cursor made explicit in the previous UDTF as follows:

For convenience, when a single cursor is used, you do not need to specify cursors in the definition of the UDTF. When multiple cursors are needed in a SQL query, including the literal Cursor in the UDTF definition as shown above is required.

Table Function Manager

Using the argument TableFunctionManager in the signature of a UDTF enables parallel execution of table functions. Without this argument, table functions are executed on a single thread; more importantly, the execution is not thread-safe. To enable threaded execution, the function signature the extra argument for the TableFunctionManager and the function set_output_row_size must be called on the manager to ensure thread safety.

Column Lists

Instead of declaring a parameter per column, it is possible to group columns into a list using ColumnList. In the following example, the mean over each column is returned. It's possible to have multiple ColumnList parameters. Two helper attributes are available to get the number of rows and column, respectively ColumnList.nrows and ColumnList.ncols.

Supported Functions

Python Grammar

When using functions, a common pitfall is to have type mismatch errors. Casting rules are less forgiving than in Python and types have to be carefully handled.

NumPy and Others

The list of supported functions is always growing. Most functions are overwritten versions of functions from NumPy or the builtin math module. These functions are defined in rbc.stdlib, so, to get the full list of supported functions, inspect that module:

Numba

External

The module rbc.external describes functions known to the server. Those functions on the server can be used when constructing new UDFs or UDTFs by using the function rbc.external.external. In the following example, log2 is a function which is known on by the database server. To use log2 with a UDF or a UDTF defined using RBC, it needs to be typed using a C-like syntax similar to the one used when decorating functions for RBC.

RBC API Reference

For information about the Remote Backend Compiler API, see the RBC API Reference.

Installation

Install RBC and get started

You can install RBC using conda (recommended) or mamba, which is a faster implementation of conda. For more information, see the Mamba documentation.

conda install -c conda-forge rbc

# or to install rbc to a new environemnt, run
conda create -n rbc -c conda-forge rbc
conda activate rbc

# check if rbc installed succesfully
python -c 'import rbc; print(rbc.__version__)'

You can also use pip for package management:

pip install rbc-project
# check if rbc installed succesfully
python -c 'import rbc; print(rbc.__version__)'

Quick Start

To summarize:

conda create -n heavy-ai-env heavydb -c conda-forge
conda activate heavy-ai-env
mkdir -p data
initheavy data -f
heavydb --enable-runtime-udfs --enable-table-functions

To inspect the test database—provided by default—connect another terminal to the database using

heavysql --passwd HyperInteractive

The following example shows a simple UDF that converts a numerical temperature from Fahrenheit to Celsius. The code defines the function, registers it, and runs it on the server.

from rbc.heavydb import RemoteHeavyDB

heavy = RemoteHeavyDB(
    user='admin',
    password='HyperInteractive',
    host='localhost',
    port=6274,
    dbname='heavyai',
)

@heavy('double(double)')
def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

print(fahrenheit2celsius(32))
# 'fahrenheit2celsius(CAST(32 AS DOUBLE))'

# other functions?
...

# after defining all functions, they should be registered
# to the database
heavy.register()

After you defined all functions you want to be available on the HeavyDB server, you should register them all at once with heavy.register().

SELECT fahrenheit2celsius(col) FROM my_table

The function is then applied element-wise on the column col of the table my_table.

User-Defined Table Functions

RBC supports user-defined table functions. UDTFs can access multiple rows of a table column concurrently.

By default, the output columns are named out0, out1, .... It's possible to use aliases to reference input and output columns in further SQL constructions. For example:

The maximum number of rows on table columns that a UDTF can handle corresponds to the maximum value of an int32 (2**32/2).

In the following example, the UDTF fahrenheit2celcius is defined on a table with one column as input. The final line return 5 means that a table with only 5 rows is returned.

By default, a UDTF that has a variable number of rows in the output table is not thread-safe. To work around this constraint, use a .

Cursors

In SQL, cursors are used to declare temporary memory for storing database tables. In particular, UDTFs use cursors as inputs. Here is a SQL request using a UDTF:

select * from table(fahrenheit2celsius(cursor(select col from table)))

You can also define the signature with the cursor made explicit in the previous UDTF as follows:

@heavy('UDTF(Cursor(Column<int32>), OutputColumn<int32>)')
def fahrenheit2celsius(inp, output):
    ...

Table Function Manager

from rbc.externals.heavydb import set_output_row_size


@heavy('UDTF(TableFunctionManager, Column<int32>, OutputColumn<int32>)')
def fahrenheit2celsius(mgr, inp, output):
    ... 
    mgr.set_output_row_size(size)
    return size

Column Lists

@heavy('UDTF(ColumnList<double>, OutputColumn<double>)')
def fahrenheit2celsius(inp, out):
    ncols = inp.ncols
    nrows = inp.nrows

    set_output_row_size(ncols)

    for i in range(ncols):
        col = inp[i]
        out[i] = 0.
        for j in range(nrows):
            out[i] += col[j]
        out[i] /= nrows

        out[i] = (out[i] - 32) * 5 / 9

        return ncols

Supported Functions

Python Grammar

The package RBC makes use of the Python compiler internally. As a result, RBC inherits some limitations in syntax and features from Numba. Specifically, the mode of Numba is used which means that certain Python objects or class constructions have no—or, at best have limited—support. This includes—and is not limited to—list comprehensions, slicing or complex indexing (e.g., [:], [-1], [1:6], [::2]).

When using functions, a common pitfall is to have type mismatch errors. Casting rules are less forgiving than in Python and types have to be carefully handled.

NumPy and Others

from rbc import stdlib
print(stdlib.__all__)
print(stdlib.array_api.__all__)

Numba

Because RBC internally makes use of , RBC also supports the usage of Numba functions within RBC functions. For example, the function fahrenheit2celsius_numba embedded within fahrenheit2celsius has been decorated with numba.njit when fahrenheit2celsius is defined.

from numba import njit

@njit
def fahrenheit2celsius_numba(f):
    return (f - 32) * 5 / 9

@heavy('double(double)')
def fahrenheit2celsius(f):
    return fahrenheit2celsius_numba(f)

External

"output_type function(input_types)"

from rbc.external import external

log2 = external("double log2(double)")

@heavy("double(double)")
def log2_heavy(x):
    return log2(x)

Registering and Using a Function

Making a function available to HeavyDB--registering-–is based on decorating a Python function. Consider the following simple function, which takes a single argument and return a single value.

def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

Declare the function’s signature.
Attach the signature to the function.
Register the function to the database.

Declaring the Signature

Annotate the function with type information to tell RBC how to translate the function into this intermediate representation, using the following syntax:

'returnType(inputType1, inputType2, ...)'

The function can only return a single element.

Available types are similar to C types:

[Array,Column[List]][int[8,16,32,64],float[32,64],double,bool]
bytes
void
TextEncoding[None,Dict]
Column<[List]TextEncodingDict>
Cursor

Returning to the function, if you want both the input argument and the output values as doubles, you could write:

from rbc.heavydb import RemoteHeavyDB
heavy = RemoteHeavyDB()
signature = heavy('double(double)')

Templating

signature = heavy('T(T)', T=['int32', 'double'])

In this example, T can be replaced by int32 or double. This can also be written without using a variable.

signature = heavy('int32(int32)', 'double(double)')

You can also have different template variables. The Cartesian product is observed.

signature = heavy('T(Z)', T=['double', 'float'], Z=['int8', 'int32'])

Attaching the Signature

Once you have the signature, you can attach it to the function. As a best practice, use the signature as a decorator.

@heavy('double(double)')
def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

This prevents classical function calls of the decorated function. The function is now “marked” to be registered on the server and used there.

Overloading

RBC supports overloading function definitions. This permits several function implementations using a common identifier, with the execution path determined by specific inputs.

@heavy('double(double, double)')
def fahrenheit2celsius(f, offset):
    return offset + (f - 32) * 5 / 9

@heavy('double(double)')
def fahrenheit2celsius(f):
    return (f - 32) * 5 / 9

Arrays

Both inputs and output can be marked as 1D-arrays or lists of any type. To indicate an array in the function signature, append brackets ([] ) to the type literal.

from rbc.stdlib import array_api
@heavy('double(double[])')
def fahrenheit2celsius(f_array):
    return (array_api.mean(f_array) - 32) * 5 / 9

You can also define an array use the 'Array<double>'syntax.

Some functions with array support are provided. In this example, the imported function rbc.stdlib.array_api.mean computes the mean over an array of inputs f_array. We can also have output arrays.

from numba import types
from rbc.stdlib import Array
@heavy('int64[](int64)')
def create_and_fill_array(size):
    arr = Array(size, types.int64)
    for i in range(size):
        arr[i] = i
    return arr

Standard Python constructors like list , dict or numpy.array cannot be used to construct arrays supported by RBC. See here for a complete list of array creation functions.

Selecting a Device

@heavy('double(double)', devices=['cpu'])
def fahrenheit2celsius(f):
    return (f-32) * 5 / 9

A function can also be made available on both the CPU and GPU by using device=['cpu', 'gpu'].

For 'gpu', only NVIDIA GPUs that can handle CUDA instructions are currently supported.

Registering the Function

heavy.register()

It is less efficient to call RemoteHeavyDB.register() after every function declaration. Instead, use a single call after all functions are defined.

heavy.unregister()

Using Registered Functions

To use the basic implementation of fahrenheit2celsius:

print(fahrenheit2celsius(32))
# 'fahrenheit2celsius(CAST(32 AS DOUBLE))'

To get the result of the function, you have to explicitly request execution on the server using the execute method:

fahrenheit2celsius(32).execute()
# 0.0