Install RBC and get started
You can install RBC using conda
(recommended) or mamba
, which is a faster implementation of conda
. For more information, see the Mamba documentation.
You can also use pip
for package management:
The following assumes that you have an instance of HEAVY.AI running. UDFs and UDTFs are enabled with the flags --enable-runtime-udfs and --enable-table-functions
. For more information on installing HEAVY.AI, see Installation.
To summarize:
To inspect the test database—provided by default—connect another terminal to the database using
The following example shows a simple UDF that converts a numerical temperature from Fahrenheit to Celsius. The code defines the function, registers it, and runs it on the server.
The instance of class RemoteHeavyDB
connects to the HeavyDB instance, and the object it returns can be used to register functions. Then, you define a normal Python function fahrenheit2celsius
. The function is decorated using the instance heavy
of the class RemoteHeavyDB
, and it is provided with the function signature 'double(double)'
. With this modification, the decorated function expects a single argument that is a double-precision floating-point value and also returns a double-precision floating-point value. The syntax is similar to function annotations in C/C++.
After you defined all functions you want to be available on the HeavyDB server, you should register them all at once with heavy.register()
.
fahrenheit2celsius
can now be used in SQL on the HeavyDB server. You can use tools like heavyai or ibis (via the ibis-heavyai backend) to help construct queries from Python. The following example shows a function call from SQL.
The function is then applied element-wise on the column col
of the table my_table
.
RBC supports user-defined table functions. UDTFs can access multiple rows of a table column concurrently.
The signature of a UDTF is different from a UDF. The signature contains the input columns and the output columns specified by their respective types. There can be any number of input and output columns; the only constraint is that the input columns must be declared before (i.e., to the left) the output columns in the function signature. The signature, then, is declared using UDTF()
; the number of arguments inside represent the total number of input and output columns.
By default, the output columns are named out0, out1, ...
. It's possible to use aliases to reference input and output columns in further SQL constructions. For example:
The maximum number of rows on table columns that a UDTF can handle corresponds to the maximum value of an int32 (
2**32/2).
In the following example, the UDTF fahrenheit2celcius
is defined on a table with one column as input. The final line return 5
means that a table with only 5 rows is returned.
Although the function returns 5 rows, it does not mean that only 5 rows are going to be processed by the function. If size<5
, the output is padded with the value 0. And if size>5
, the function will still iterate on all the rows while just returning the first 5 elements.
If the number of rows in the output table from a UDTF needs to be adapted at runtime, the function set_output_row_size
from the module rbc.externals.heavydb
is required. The function must be called before any assignment on output columns.
While the return value from a UDTF controls the number of rows in the output table, there are no restrictions on the assumed number of rows in the corresponding input table. The whole column—again, up to int32
rows—will be loaded whenever the function executes. As with any SQL function, limits on number of rows in tables associated with a UDTF can be set using SQL keywords like LIMIT
or WHERE
.
In SQL, cursors are used to declare temporary memory for storing database tables. In particular, UDTFs use cursors as inputs. Here is a SQL request using a UDTF:
You can also define the signature with the cursor made explicit in the previous UDTF as follows:
For convenience, when a single cursor is used, you do not need to specify cursors in the definition of the UDTF. When multiple cursors are needed in a SQL query, including the literal Cursor
in the UDTF definition as shown above is required.
Using the argument TableFunctionManager
in the signature of a UDTF enables parallel execution of table functions. Without this argument, table functions are executed on a single thread; more importantly, the execution is not thread-safe. To enable threaded execution, the function signature the extra argument for the TableFunctionManager
and the function set_output_row_size
must be called on the manager to ensure thread safety.
Instead of declaring a parameter per column, it is possible to group columns into a list using ColumnList
. In the following example, the mean over each column is returned. It's possible to have multiple ColumnList
parameters. Two helper attributes are available to get the number of rows and column, respectively ColumnList.nrows
and ColumnList.ncols
.
When using functions, a common pitfall is to have type mismatch errors. Casting rules are less forgiving than in Python and types have to be carefully handled.
The list of supported functions is always growing. Most functions are overwritten versions of functions from NumPy or the builtin math
module. These functions are defined in rbc.stdlib
, so, to get the full list of supported functions, inspect that module:
The module rbc.external
describes functions known to the server. Those functions on the server can be used when constructing new UDFs or UDTFs by using the function rbc.external.external
. In the following example, log2
is a function which is known on by the database server. To use log2
with a UDF or a UDTF defined using RBC, it needs to be typed using a C-like syntax similar to the one used when decorating functions for RBC.
By default, a UDTF that has a variable number of rows in the output table is not thread-safe. To work around this constraint, use a .
The package RBC makes use of the Python compiler internally. As a result, RBC inherits some limitations in syntax and features from Numba. Specifically, the mode of Numba is used which means that certain Python objects or class constructions have no—or, at best have limited—support. This includes—and is not limited to—list comprehensions, slicing or complex indexing (e.g., [
:]
, [-1]
, [1:6]
, [::2]
).
Because RBC internally makes use of , RBC also supports the usage of Numba functions within RBC functions. For example, the function fahrenheit2celsius_numba
embedded within fahrenheit2celsius
has been decorated with numba.njit
when fahrenheit2celsius
is defined.
For information about the Remote Backend Compiler API, see the RBC API Reference.
Register a function and then use it
Making a function available to HeavyDB--registering-–is based on decorating a Python function. Consider the following simple function, which takes a single argument and return a single value.
Register this function to HeavyDB using the following steps:
Declare the function’s signature.
Attach the signature to the function.
Register the function to the database.
Annotate the function with type information to tell RBC how to translate the function into this intermediate representation, using the following syntax:
The function can only return a single element.
Available types are similar to C types:
In the types listed, items in brackets indicate options to choose from. For example, [List,Array]Int[8,16]
is expanded to mean ListInt8
, ArrayInt8
, ListInt16
, and ArrayInt16
. The literals float
and int
can be abbreviated by f
and i
, respectively.
Returning to the function, if you want both the input argument and the output values as doubles, you could write:
What happens when the input is an integer? RBC does not cast input values to the expected types automatically. If you expect multiple input types, RBC supports templating (as in C++ or generic in Rust or Go). Templating allows you to define a type using a variable, like T
in this example:
In this example, T
can be replaced by int32
or double
. This can also be written without using a variable.
You can also have different template variables. The Cartesian product is observed.
Once you have the signature, you can attach it to the function. As a best practice, use the signature as a decorator.
This prevents classical function calls of the decorated function. The function is now “marked” to be registered on the server and used there.
RBC supports overloading function definitions. This permits several function implementations using a common identifier, with the execution path determined by specific inputs.
Both inputs and output can be marked as 1D-arrays or lists of any type. To indicate an array in the function signature, append brackets ([]
) to the type literal.
You can also define an array use the 'Array<double>'
syntax.
Some functions with array support are provided. In this example, the imported function rbc.stdlib.array_api.mean
computes the mean over an array of inputs f_array
. We can also have output arrays.
rbc.stdlib.array_api.mean
is a special function bundled with RBC. In this case, numpy.mean
has been overridden for convenience to users familiar with NumPy’s API. See here for more details and information about supported functions.
To create an array within a function, the class Array
must be used to define an empty array. It can then be indexed to be filled. Slicing or complex indexing is not currently supported. If the array is returned, it’s important that the type specified during the array creation matches the return type specified in the function signature.
Standard Python constructors like list
, dict
or numpy.array
cannot be used to construct arrays supported by RBC. See here for a complete list of array creation functions.
You can select explicitly the device on which a function is allowed to be executed by using the keyword argument device
in the decorator when registering the function. The device argument is a list that can take 'cpu'
and 'gpu'
. The option indicates which implementation should be available and used. Hence, if there is no GPU on the server, using 'gpu'
would not work on the platform.
A function can also be made available on both the CPU and GPU by using device=['cpu', 'gpu']
.
For 'gpu'
, only NVIDIA GPUs that can handle CUDA instructions are currently supported.
Once you define the functions—with appropriate signatures in the decorator—you have to register them to the HeavyDB. This is done automatically if the function is used in the same Python session. If multiple functions are defined in a file and need to be registered to be used by another process or user, then yo need to register them manually.
It is less efficient to call RemoteHeavyDB.register()
after every function declaration. Instead, use a single call after all functions are defined.
Similarly, you can clean the current session of all previously registered functions. The registration and unregistration of functions take into account only the functions defined in the current session associated with the object heavy
.
To use the basic implementation of fahrenheit2celsius
:
To get the result of the function, you have to explicitly request execution on the server using the execute
method:
The execute
method is a convenience feature; it should not be used in production code. For production code, use heavyai or ibis via the ibis-heavyai backend to compose SQL queries using an ORM-like syntax.