RBC supports user-defined table functions. UDTFs can access multiple rows of a table column concurrently.
The signature of a UDTF is different from a UDF. The signature contains the input columns and the output columns specified by their respective types. There can be any number of input and output columns; the only constraint is that the input columns must be declared before (i.e., to the left) the output columns in the function signature. The signature, then, is declared using UDTF()
; the number of arguments inside represent the total number of input and output columns.
By default, the output columns are named out0, out1, ...
. It's possible to use aliases to reference input and output columns in further SQL constructions. For example:
The maximum number of rows on table columns that a UDTF can handle corresponds to the maximum value of an int32 (
2**32/2).
In the following example, the UDTF fahrenheit2celcius
is defined on a table with one column as input. The final line return 5
means that a table with only 5 rows is returned.
Although the function returns 5 rows, it does not mean that only 5 rows are going to be processed by the function. If size<5
, the output is padded with the value 0. And if size>5
, the function will still iterate on all the rows while just returning the first 5 elements.
If the number of rows in the output table from a UDTF needs to be adapted at runtime, the function set_output_row_size
from the module rbc.externals.heavydb
is required. The function must be called before any assignment on output columns.
While the return value from a UDTF controls the number of rows in the output table, there are no restrictions on the assumed number of rows in the corresponding input table. The whole column—again, up to int32
rows—will be loaded whenever the function executes. As with any SQL function, limits on number of rows in tables associated with a UDTF can be set using SQL keywords like LIMIT
or WHERE
.
By default, a UDTF that has a variable number of rows in the output table is not thread-safe. To work around this constraint, use a TableFunctionManager
.
In SQL, cursors are used to declare temporary memory for storing database tables. In particular, UDTFs use cursors as inputs. Here is a SQL request using a UDTF:
You can also define the signature with the cursor made explicit in the previous UDTF as follows:
For convenience, when a single cursor is used, you do not need to specify cursors in the definition of the UDTF. When multiple cursors are needed in a SQL query, including the literal Cursor
in the UDTF definition as shown above is required.
Using the argument TableFunctionManager
in the signature of a UDTF enables parallel execution of table functions. Without this argument, table functions are executed on a single thread; more importantly, the execution is not thread-safe. To enable threaded execution, the function signature the extra argument for the TableFunctionManager
and the function set_output_row_size
must be called on the manager to ensure thread safety.
Instead of declaring a parameter per column, it is possible to group columns into a list using ColumnList
. In the following example, the mean over each column is returned. It's possible to have multiple ColumnList
parameters. Two helper attributes are available to get the number of rows and column, respectively ColumnList.nrows
and ColumnList.ncols
.
The package RBC makes use of the Python Numba compiler internally. As a result, RBC inherits some limitations in syntax and features from Numba. Specifically, the nopython mode of Numba is used which means that certain Python objects or class constructions have no—or, at best have limited—support. This includes—and is not limited to—list comprehensions, slicing or complex indexing (e.g., [
:]
, [-1]
, [1:6]
, [::2]
).
When using functions, a common pitfall is to have type mismatch errors. Casting rules are less forgiving than in Python and types have to be carefully handled.
The list of supported functions is always growing. Most functions are overwritten versions of functions from NumPy or the builtin math
module. These functions are defined in rbc.stdlib
, so, to get the full list of supported functions, inspect that module:
Because RBC internally makes use of Numba, RBC also supports the usage of Numba functions within RBC functions. For example, the function fahrenheit2celsius_numba
embedded within fahrenheit2celsius
has been decorated with numba.njit
when fahrenheit2celsius
is defined.
The module rbc.external
describes functions known to the server. Those functions on the server can be used when constructing new UDFs or UDTFs by using the function rbc.external.external
. In the following example, log2
is a function which is known on by the database server. To use log2
with a UDF or a UDTF defined using RBC, it needs to be typed using a C-like syntax similar to the one used when decorating functions for RBC.