User-Defined Table Functions

RBC supports user-defined table functions. UDTFs can access multiple rows of a table column concurrently.

The signature of a UDTF is different from a UDF. The signature contains the input columns and the output columns specified by their respective types. There can be any number of input and output columns; the only constraint is that the input columns must be declared before (i.e., to the left) the output columns in the function signature. The signature, then, is declared using UDTF(); the number of arguments inside represent the total number of input and output columns.

'UDTF(Column[List]<Type>, ..., OutputColumn<Type>, ...)'

By default, the output columns are named out0, out1, .... It's possible to use aliases to reference input and output columns in further SQL constructions. For example:

'UDTF(Column<double> input_alias, OutputColumn<double> output_alias)'

The maximum number of rows on table columns that a UDTF can handle corresponds to the maximum value of an int32 (2**32/2).

In the following example, the UDTF fahrenheit2celcius is defined on a table with one column as input. The final line return 5 means that a table with only 5 rows is returned.

@heavy('UDTF(Column<float>, OutputColumn<float>)')
def fahrenheit2celsius(inp, out):
    size = len(inp)
    for i in range(size):
        out[i] = (inp[i] - 32) * 5 / 9
    return 5

Although the function returns 5 rows, it does not mean that only 5 rows are going to be processed by the function. If size<5, the output is padded with the value 0. And if size>5, the function will still iterate on all the rows while just returning the first 5 elements.

If the number of rows in the output table from a UDTF needs to be adapted at runtime, the function set_output_row_size from the module rbc.externals.heavydb is required. The function must be called before any assignment on output columns.

from rbc.externals.heavydb import set_output_row_size


@heavy('UDTF(Column<float>, OutputColumn<float>)')
def fahrenheit2celsius(inp, output):
    size = ...
    set_output_row_size(size)
    ...
    return size

While the return value from a UDTF controls the number of rows in the output table, there are no restrictions on the assumed number of rows in the corresponding input table. The whole column—again, up to int32rows—will be loaded whenever the function executes. As with any SQL function, limits on number of rows in tables associated with a UDTF can be set using SQL keywords like LIMIT or WHERE.

By default, a UDTF that has a variable number of rows in the output table is not thread-safe. To work around this constraint, use a TableFunctionManager.

Cursors

In SQL, cursors are used to declare temporary memory for storing database tables. In particular, UDTFs use cursors as inputs. Here is a SQL request using a UDTF:

select * from table(fahrenheit2celsius(cursor(select col from table)))

You can also define the signature with the cursor made explicit in the previous UDTF as follows:

@heavy('UDTF(Cursor(Column<int32>), OutputColumn<int32>)')
def fahrenheit2celsius(inp, output):
    ...

For convenience, when a single cursor is used, you do not need to specify cursors in the definition of the UDTF. When multiple cursors are needed in a SQL query, including the literal Cursor in the UDTF definition as shown above is required.

Table Function Manager

Using the argument TableFunctionManager in the signature of a UDTF enables parallel execution of table functions. Without this argument, table functions are executed on a single thread; more importantly, the execution is not thread-safe. To enable threaded execution, the function signature the extra argument for the TableFunctionManager and the function set_output_row_size must be called on the manager to ensure thread safety.

from rbc.externals.heavydb import set_output_row_size


@heavy('UDTF(TableFunctionManager, Column<int32>, OutputColumn<int32>)')
def fahrenheit2celsius(mgr, inp, output):
    ... 
    mgr.set_output_row_size(size)
    return size

Column Lists

Instead of declaring a parameter per column, it is possible to group columns into a list using ColumnList. In the following example, the mean over each column is returned. It's possible to have multiple ColumnList parameters. Two helper attributes are available to get the number of rows and column, respectively ColumnList.nrows and ColumnList.ncols.

@heavy('UDTF(ColumnList<double>, OutputColumn<double>)')
def fahrenheit2celsius(inp, out):
    ncols = inp.ncols
    nrows = inp.nrows

    set_output_row_size(ncols)

    for i in range(ncols):
        col = inp[i]
        out[i] = 0.
        for j in range(nrows):
            out[i] += col[j]
        out[i] /= nrows

        out[i] = (out[i] - 32) * 5 / 9

        return ncols

Supported Functions

Python Grammar

The package RBC makes use of the Python Numba compiler internally. As a result, RBC inherits some limitations in syntax and features from Numba. Specifically, the nopython mode of Numba is used which means that certain Python objects or class constructions have no—or, at best have limited—support. This includes—and is not limited to—list comprehensions, slicing or complex indexing (e.g., [:], [-1], [1:6], [::2]).

When using functions, a common pitfall is to have type mismatch errors. Casting rules are less forgiving than in Python and types have to be carefully handled.

NumPy and Others

The list of supported functions is always growing. Most functions are overwritten versions of functions from NumPy or the builtin math module. These functions are defined in rbc.stdlib, so, to get the full list of supported functions, inspect that module:

from rbc import stdlib
print(stdlib.__all__)
print(stdlib.array_api.__all__)

Numba

Because RBC internally makes use of Numba, RBC also supports the usage of Numba functions within RBC functions. For example, the function fahrenheit2celsius_numba embedded within fahrenheit2celsius has been decorated with numba.njit when fahrenheit2celsius is defined.

from numba import njit

@njit
def fahrenheit2celsius_numba(f):
    return (f - 32) * 5 / 9

@heavy('double(double)')
def fahrenheit2celsius(f):
    return fahrenheit2celsius_numba(f)

External

The module rbc.external describes functions known to the server. Those functions on the server can be used when constructing new UDFs or UDTFs by using the function rbc.external.external. In the following example, log2 is a function which is known on by the database server. To use log2 with a UDF or a UDTF defined using RBC, it needs to be typed using a C-like syntax similar to the one used when decorating functions for RBC.

"output_type function(input_types)"

from rbc.external import external

log2 = external("double log2(double)")

@heavy("double(double)")
def log2_heavy(x):
    return log2(x)

PreviousRegistering and Using a Function NextRBC API Reference

Last updated 2 years ago