System Table Functions

HEAVY.AI provides access to a set system-provided table functions, also known as tab_le-valued functions (T_VS). System table functions, like user-defined table functions, support execution of queries on both CPU and GPU over one or more SQL result-set inputs. Table function support in HEAVY.AI can be split into two broad categories: system table functions and user-defined table functions (UDTFs). System table functions are built-in to the HEAVY.AI server, while UDTFs can be declared dynamically at run-time by specifying them in Numba, a subset of the Python language. For more information on UDTFs, see User-Defined Table Functions.

To improve performance, table functions can be declared to enable filter pushdown optimization, which allows the Calcite optimizer to "push down" filters on the output(s) of a table functions to its input(s) when the inputs and outputs are declared to be semantically equivalent (for example, a longitude variable that is input and output from a table function). This can significantly increase performance in cases where only a small portion of one or more input tables is required to compute the filtered output of a table function.

Whether system- or user-provided, table functions can execute over one or more result sets specified by subqueries, and can also take any number of additional constant literal arguments specified in the function definition. SQL subquery inputs can consist of any SQL expression (including multiple subqueries, joins, and so on) allowed by HeavyDB, and the output can be filtered, grouped by, joined, and so on like a normal SQL subquery, including being input into additional table functions by wrapping it in a CURSOR argument. The number and types of input arguments, as well as the number and types of output arguments, are specified in the table function definition itself.

Table functions allow for the efficient execution of advanced algorithms that may be difficult or impossible to express in canonical SQL. By allowing execution of code directly over SQL result sets, leveraging the same hardware parallelism used for fast SQL execution and visualization rendering, HEAVY.AI provides orders-of-magnitude speed increases over the alternative of transporting large result sets to other systems for post-processing and then returning to HEAVY.AI for storage or downstream manipulation. You can easily invoke system-provided or user-defined algorithms directly inline with SQL and rendering calls, making prototyping and deployment of advanced analytics capabilities easier and more streamlined.

Concepts

CURSOR Subquery Inputs

Table functions can take as input arguments both constant literals (including scalar results of subqueries) as well as results of other SQL queries (consisting of one or more rows). The latter (SQL query inputs), per the SQL standard, must be wrapped in the keyword CURSOR. Depending on the table function, there can be 0, 1, or multiple CURSOR inputs. For example:

SELECT * FROM (TABLE(my_table_function /* This is only an example! */ (
 CURSOR(SELECT arg1, arg2, arg3 FROM input_1 WHERE x > 10) /* First CURSOR 
 argument consisting of 3 columns */,
 CURSOR(SELECT arg1, AVG(arg2) FROM input_2 GROUP BY arg1 where y < 40) 
 /* Second CURSOR argument constisting of 2 columns. This could be from the same
 table as the first CURSOR, or as is the case here, a completely different table
 (or even joined table or logical value expression) */,
 'Fred' /* TEXT constant literal argument */,
 true /* BOOLEAN constant literal argument */,
 (SELECT COUNT(*) FROM another_table), /* scalar subquery results do not need
 to be wrapped in a CURSOR */,
 27.3 /* FLOAT constant literal argument */))
WHERE output1 BETWEEN 32.2 AND 81.8;

ColumnList Inputs

Certain table functions can take 1 or more columns of a specified type or types as inputs, denoted as ColumnList<TYPE1 | Type2... TypeN>. Even if a function allows aColumnList input of multiple types, the arguments must be all of one type; types cannot be mixed. For example, if a function allows ColumnList<INT | TEXT ENCODING DICT>, one or more columns of either INTEGER or TEXT ENCODING DICT can be used as inputs, but all must be either INT columns or TEXT ENCODING DICT columns.

Named Arguments

All HEAVY.AI system table functions allow you to specify argument either in conventional comma-separated form in the order specified by the table function signature, or alternatively via a key-value map where input argument names are mapped to argument values using the => token. For example, the following two calls are equivalent:

/* The following two table function calls, the first with unnamed
 signature-ordered arguments, and the second with named arguments,
 are equivalent */

select
  *
from
  table(
    tf_compute_dwell_times(
      /* Without the use of named arguments, input arguments must
      be ordered as specified by the table function signature */
      cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      3,
      600,
      10800
    )
  )
order by
  num_dwell_points desc
limit
  10;


select
  *
from
  table(
    tf_compute_dwell_times(
     /* Using named arguments, input arguments can be
     ordered in any order, as long as all arguments are named */
     min_dwell_seconds => 600,
     max_inactive_seconds => 10800
      data => cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      min_dwell_points => 3
    )
  )
order by
  num_dwell_points desc
limit
  10;

Filter Push-Down

For performance reasons, particularly when table functions are used as actual tables in a client like Heavy Immerse, many system table functions in HEAVY.AI automatically "push down" filters on certain output columns in the query onto the inputs. For example, if a table does some computation over an x and y range such that x and y are in both the input and output for the table function, filter push-down would likely be enabled so that a query like the following would automatically push down the filter on the x and y outputs to the x and y inputs. This potentially increases query performance significantly.

SELECT
  *
FROM
  TABLE(
    my_spatial_table_function(
      CURSOR(
        SELECT
          x,
          y
        from
          spatial_data_table
          /* Presuming filter push down is enabled for 
          my_spatial_table_function, the filter applied to 
          x and y will be applied here to the table function
          input CURSOR */
      )
    )
  )
WHERE
  x BETWEEN 38.2
  AND 39.1
  and Y BETWEEN -121.4
  and -120.1;

To determine whether filter push-down is used, you can check the Boolean value of the filter_table_transpose column from the query:

SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

Currently for system table functions, you cannot change push-down behavior.

Querying Registered Table Functions

You can query which table functions are available using SHOW TABLE FUNCTIONS:

SHOW TABLE FUNCTIONS;

Table UDF

tf_feature_similarity
tf_feature_self_similarity
tf_geo_rasterize_slope
...

Query Metadata for a Specific Table Function

Information about the expected input and output argument names and types, as well as other info such as whether the function can run on CPU, GPU or both, and whether filter push-down is enabled, can be queried via SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

name|signature|input_names|input_types|output_names|output_types|CPU|GPU|Runtime|filter_table_transpose
generate_series|(i64 series_start, i64 series_stop, i64 series_step) -> Column<i64>|[series_start, series_stop, series_step]|[i64, i64, i64]|[generate_series]|[Column<i64>]|true|false|false|false
generate_series|(i64 series_start, i64 series_stop) -> Column<i64>|[series_start, series_stop]|[i64, i64]|[generate_series]|[Column<i64>]|true|false|false|false

System Table Functions

The following system table functions are available in HEAVY.AI. The table provides a summary and links to more inforamation about each function.

Function
Purpose

Generates random string data.

Generates a series of integer values.

Generates a series of timestamp values from start_timestamp to end_timestamp.

Given a query input with entity keys and timestamps, and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session.

Given a query input of entity keys/IDs, a set of feature columns, and a metric column, scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.

Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, with taking the maximum z value across all points in each bin as the output value for the bin. The aggregate performed to compute the value for each bin is specified by agg_type, with allowed aggregate types of AVG, COUNT, SUM, MIN, and MAX.

Similar to tf_geo_rasterize, but also computes the slope and aspect per output bin. Aggregates point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin.

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin and destination node, computes the shortest distance-weighted path through the graph between origin_node and destination_node.

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin node, computes the shortest distance-weighted path distance between the origin_node and every other node in the graph.

Loads one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally tranforming the output SRID to out_srs. If not specified, output points are automatically transformed to EPSG:4326 lon/lat pairs).

Returns metadata for one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally constraining the bounding box for metadata retrieved to the lon/lat bounding box specified by the x_min, x_max, y_min, y_max arguments.

Process a raster input to derive contour lines or regions and output as LINESTRING or POLYGON for rendering or further processing.

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin.

Used for generating top-k signals where 'k' represents the maximum number of antennas to consider at each geographic location. The full relevant parameter name is strongest_k_sources_per_terrain_bin.

Taking a set of point elevations and a set of signal source locations as input, tf_rf_prop_max_signal executes line-of-sight 2.5D RF signal propagation from the provided sources over a binned 2.5D elevation grid derived from the provided point locations, calculating the max signal in dBm at each grid cell, using the formula for free-space power loss.

For information about the HeavyRF radio frequency propagation simulation and HeavyRF table functions, see HeavyRF.

The TABLE command is required to wrap a table function clause; for example: select * from TABLE(generate_series(1, 10));

The CURSOR command is required to wrap any subquery inputs.

Last updated