1 of 37

Data Manipulation (DML)

SQL Capabilities

ALTER SESSION SET

Change a parameter value for the current session.

ALTER SESSION SET <parameter_name>=<parameter_value>

Paremeter name

Values

EXECUTOR_DEVICE

CPU - Set the session to CPU execution mode:

ALTER SESSION SET EXECUTOR_DEVICE='CPU'; GPU - Set the session to GPU execution mode:

ALTER SESSION SET EXECUTOR_DEVICE='GPU'; NOTE: These parameter values have the same effect as the \cpu and \gpu commands in heavysql, but can be used with any tool capable of running sql commands.

CURRENT_DATABASE

Can be set to any string value.

If the value is a valid database name, and the current user has access to it, the session switches to the new database. If the user does not have access or the database does not exist, an error is returned and the session will fall back to the starting database.

Alter Session Examples

CURRENT_DATABASE

Switch to another database without need of re-login.

ALTER SESSION SET CURRENT_DATABASE='owned_database';

Your session will silently switch to the requested database.

The database exists, but the user does not have access to it:

ALTER SESSION SET CURRENT_DATABASE='information_schema';
TException - service has thrown: TDBException(error_msg=Unauthorized access: 
user test is not allowed to access database information_schema.)

The database does not exist:

ALTER SESSION SET CURRENT_DATABASE='not_existent_db'; 
TException - service has thrown: TDBException(error_msg=Database name 
not_existent_db does not exist.)

EXECUTOR_DEVICE

Force the session to run the subsequent SQL commands in CPU mode:

ALTER SESSION SET EXECUTOR_DEVICE='CPU';

Switch back the session to run in GPU mode

ALTER SESSION SET EXECUTOR_DEVICE='GPU';

ALTER SYSTEM CLEAR

Clear CPU, GPU, or RENDER memory. Available to super users only.

ALTER SYSTEM CLEAR (CPU|GPU|RENDER) MEMORY

Examples

ALTER SYSTEM CLEAR CPU MEMORY

ALTER SYSTEM CLEAR GPU MEMORY

ALTER SYSTEM CLEAR RENDER MEMORY

Generally, the server handles memory management, and you do not need to use this command. If you are having unexpected memory issues, try clearing the memory to see if performance improves.

DELETE

Deletes rows that satisfy the WHERE clause from the specified table. If the WHERE clause is absent, all rows in the table are deleted, resulting in a valid but empty table.

DELETE FROM table_name [ * ] [ [ AS ] alias ]
[ WHERE condition ]

Cross-Database Queries

In Release 6.4 and higher, you can run DELETE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.

To execute queries against another database, you must have ACCESS privilege on that database, as well as DELETE privilege.

Example

Delete rows from a table in the my_other_db database:

DELETE FROM my_other_db.customers WHERE id > 100;

EXPLAIN

Shows generated Intermediate Representation (IR) code, identifying whether it is executed on GPU or CPU. This is primarily used internally by HEAVY.AI to monitor behavior.

EXPLAIN <STMT>

For example, when you use the EXPLAIN command on a basic statement, the utility returns 90 lines of IR code that is not meant to be human readable. However, at the top of the listing, a heading indicates whether it is IR for the CPU or IR for the GPU, which can be useful to know in some situations.

EXPLAIN CALCITE

Returns a relational algebra tree describing the high-level plan to execute the statement.

EXPLAIN CALCITE <STMT>

The table below lists the relational algebra classes used to describe the execution plan for a SQL statement.

Method

Description

LogicalAggregate

Operator that eliminates duplicates and computes totals.

LogicalCalc

Expression that computes project expressions and also filters.

LogicalChi

Operator that converts a stream to a relation.

LogicalCorrelate

Operator that performs nested-loop joins.

LogicalDelta

Operator that converts a relation to a stream.

LogicalExchange

Expression that imposes a particular distribution on its input without otherwise changing its content.

LogicalFilter

Expression that iterates over its input and returns elements for which a condition evaluates to true.

LogicalIntersect

Expression that returns the intersection of the rows of its inputs.

LogicalJoin

Expression that combines two relational expressions according to some condition.

LogicalMatch

Expression that represents a MATCH_RECOGNIZE node.

LogicalMinus

Expression that returns the rows of its first input minus any matching rows from its other inputs. Corresponds to the SQL EXCEPT operator.

LogicalProject

Expression that computes a set of ‘select expressions’ from its input relational expression.

LogicalSort

Expression that imposes a particular sort order on its input without otherwise changing its content.

LogicalTableFunctionScan

Expression that calls a table-valued function.

LogicalTableModify

Expression that modifies a table. Similar to TableScan, but represents a request to modify a table instead of read from it.

LogicalTableScan

Reads all the rows from a RelOptTable.

LogicalUnion

Expression that returns the union of the rows of its inputs, optionally eliminating duplicates.

LogicalValues

Expression for which the value is a sequence of zero or more literal row values.

LogicalWindow

For example, a SELECT statement is described as a table scan and projection.

heavysql> EXPLAIN CALCITE (SELECT * FROM movies);
Explanation
LogicalProject(movieId=[$0], title=[$1], genres=[$2])
   LogicalTableScan(TABLE=[[CATALOG, heavyai, MOVIES]])

If you add a sort order, the table projection is folded under a LogicalSort procedure.

heavysql> EXPLAIN calcite (SELECT * FROM movies ORDER BY title);
Explanation
LogicalSort(sort0=[$1], dir0=[ASC])
   LogicalProject(movieId=[$0], title=[$1], genres=[$2])
      LogicalTableScan(TABLE=[[CATALOG, omnisci, MOVIES]])

When the SQL statement is simple, the EXPLAIN CALCITE version is actually less “human readable.” EXPLAIN CALCITE is more useful when you work with more complex SQL statements, like the one that follows. This query performs a scan on the BOOK table before scanning the BOOK_ORDER table.

heavysql> EXPLAIN calcite SELECT bc.firstname, bc.lastname, b.title, bo.orderdate, s.name
FROM book b, book_customer bc, book_order bo, shipper s
WHERE bo.cust_id = bc.cust_id AND b.book_id = bo.book_id AND bo.shipper_id = s.shipper_id
AND s.name = 'UPS';
Explanation
LogicalProject(firstname=[$5], lastname=[$6], title=[$2], orderdate=[$11], name=[$14])
    LogicalFilter(condition=[AND(=($9, $4), =($0, $8), =($10, $13), =($14, 'UPS'))])
        LogicalJoin(condition=[true], joinType=[INNER])
            LogicalJoin(condition=[true], joinType=[INNER])
                LogicalJoin(condition=[true], joinType=[INNER])
                    LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK]])
                    LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_CUSTOMER]])
                LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_ORDER]])
            LogicalTableScan(TABLE=[[CATALOG, omnisci, SHIPPER]])

Revising the original SQL command results in a more natural selection order and a more performant query.

heavysql> EXPLAIN calcite SELECT bc.firstname, bc.lastname, b.title, bo.orderdate, s.name
FROM book_order bo, book_customer bc, book b, shipper s
WHERE bo.cust_id = bc.cust_id AND bo.book_id = b.book_id AND bo.shipper_id = s.shipper_id
AND s.name = 'UPS';
Explanation
LogicalProject(firstname=[$10], lastname=[$11], title=[$7], orderdate=[$3], name=[$14])
    LogicalFilter(condition=[AND(=($1, $9), =($5, $0), =($2, $13), =($14, 'UPS'))])
        LogicalJoin(condition=[true], joinType=[INNER])
            LogicalJoin(condition=[true], joinType=[INNER])
                LogicalJoin(condition=[true], joinType=[INNER])
                  LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_ORDER]])
                  LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_CUSTOMER]])
                LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK]])
            LogicalTableScan(TABLE=[[CATALOG, omnisci, SHIPPER]])

EXPLAIN CALCITE DETAILED

Augments the EXPLAIN CALCITE command by adding details about referenced columns in the query plan.

For example, for the following EXPLAIN CALCITE command execution:

heavysql> EXPLAIN CALCITE SELECT x, SUM(y) FROM test GROUP BY x;
Explanation
LogicalAggregate(group=[{0}], EXPR$1=[SUM($1)])
  LogicalProject(x=[$0], y=[$2])
    LogicalTableScan(table=[[testDB, test]])

EXPLAIN CALCITE DETAILED adds more column details as seen below:

heavysql> EXPLAIN CALCITE DETAILED SELECT x, SUM(y) FROM test GROUP BY x;
Explanation
LogicalAggregate(group=[{0}], EXPR$1=[SUM($1)])	{[$1->db:testDB,tableName:test,colName:y]}
  LogicalProject(x=[$0], y=[$2])	{[$2->db:testDB,tableName:test,colName:y], [$0->db:testDB,tableName:test,colName:x]}
    LogicalTableScan(table=[[testDB, test]])

INSERT

Use INSERT for both single- and multi-row ad hoc inserts. (When inserting many rows, use the more efficient COPY command.)

INSERT INTO <table> (column1, ...) VALUES (row_1_value_1, ...), ..., (row_n_value_1, ...);

Examples

CREATE TABLE ar (ai INT[], af FLOAT[], ad2 DOUBLE[2]); 
INSERT INTO ar VALUES ({1,2,3},{4.0,5.0},{1.2,3.4}); 
INSERT INTO ar VALUES (ARRAY[NULL,2],NULL,NULL); 
INSERT INTO ar VALUES (NULL,{},{2.0,NULL});
-- or a multi-row insert equivalent
INSERT INTO ar VALUES ({1,2,3},{4.0,5.0},{1.2,3.4}), (ARRAY[NULL,2],NULL,NULL), (NULL,{},{2.0,NULL});

You can also insert into a table as SELECT, as shown in the following examples:

INSERT INTO destination_table SELECT * FROM source_table;

INSERT INTO destination_table (id, name, age, gender) SELECT * FROM source_table;

INSERT INTO destination_table (name, gender, age, id) SELECT name, gender, age, id  FROM source_table;

INSERT INTO votes_summary (vote_id, vote_count) SELECT vote_id, sum(*) FROM votes GROUP_BY vote_id;

You can insert array literals into array columns. The inserts in the following example each have three array values, and demonstrate how you can:

Create a table with variable-length and fixed-length array columns.
Insert NULL arrays into these colums.
Specify and insert array literals using {...} or ARRAY[...] syntax.
Insert empty variable-length arrays using{} and ARRAY[] syntax.
Insert array values that contain NULL elements.

CREATE TABLE ar (ai INT[], af FLOAT[], ad2 DOUBLE[2]); 
INSERT INTO ar VALUES ({1,2,3},{4.0,5.0},{1.2,3.4}); 
INSERT INTO ar VALUES (ARRAY[NULL,2],NULL,NULL); 
INSERT INTO ar VALUES (NULL,{},{2.0,NULL});

Default Values

If you create a table with column that has a default value, or alter a table to add a column with a default value, using the INSERT command creates a record that includes the default value if it is omitted from the INSERT. For example, assume a table created as follows:

CREATE TABLE tbl (
   id INTEGER NOT NULL, 
   name TEXT NOT NULL DEFAULT 'John Doe', 
   age SMALLINT NOT NULL);

If you omit the name column from an INSERT or INSERT FROM SELECT statement, the missing value for column name is set to 'John Doe'.

INSERT INTO tbl (id, age) VALUES (1, 36); creates the record 1|'John Doe'|36 .

INSERT INTO tbl (id, age) SELECT id, age FROM old_tbl; also sets all the name values to John Doe .

KILL QUERY

Interrupt a queued query. Specify the query by using its session ID.

To see the queries in the queue, use the SHOW QUERIES command:

show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU
947-ooNP        |RUNNING_IMPORTER    |0          |2021-08-03 ...|IMPORT_GEO_TABLE|Rio       |tcp:::ffff:127.0.0.1:47314|omnisci|CPU

To interrupt the last query in the list (ID 946-ooNP):

kill query '946-ooNP'

Showing the queries again indicates that 946-ooNP has been deleted:

show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU

KILL QUERY is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt) is set.
Interrupting a query in ‘PENDING_QUEUE’ status is supported in both distributed and single-server mode.
To enable query interrupt for tables imported from data files in local storage, set enable_non_kernel_time_query_interrupt to TRUE. (It is enabled by default.)

LIKELY/UNLIKELY

Expression

Description

LIKELY(X)

Provides a hint to the query planner that argument X is a Boolean value that is usually true. The planner can prioritize filters on the value X earlier in the execution cycle and return results more efficiently.

UNLIKELY(X)

Provides a hint to the query planner that argument X is a Boolean value that is usually not true. The planner can prioritize filters on the value X later in the execution cycle and return results more efficiently.

Usage Notes

SQL normally assumes that terms in the WHERE clause that cannot be used by indices are usually true. If this assumption is incorrect, it could lead to a suboptimal query plan. Use the LIKELY(X) and UNLIKELY(X) SQL functions to provide hints to the query planner about clause terms that are probably not true, which helps the query planner to select the best possible plan.

Use LIKELY/UNLIKELY to optimize evaluation of OR/AND logical expressions. LIKELY/UNLIKELY causes the left side of an expression to be evaluated first. This allows the right side of the query to be skipped when possible. For example, in the clause UNLIKELY(A) AND B, if A evaluates to FALSE, B does not need to be evaluated.

Consider the following:

SELECT COUNT(*) FROM test WHERE UNLIKELY(x IN (7, 8, 9, 10)) AND y > 42;

If x is one of the values 7, 8, 9, or 10, the filter y > 42 is applied. If x is not one of those values, the filter y > 42 is not applied.

SELECT

The SELECT command returns a set of records from one or more tables.

query:
  |   WITH withItem [ , withItem ]* query
  |   {
          select
      }
      [ ORDER BY orderItem [, orderItem ]* ]
      [ LIMIT [ start, ] { count | ALL } ]
      [ OFFSET start { ROW | ROWS } ]

withItem:
      name
      [ '(' column [, column ]* ')' ]
      AS '(' query ')'

orderItem:
      expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ]

select:
      SELECT [ DISTINCT ] [/*+ hints */]
          { * | projectItem [, projectItem ]* }    
      FROM tableExpression
      [ WHERE booleanExpression ]
      [ GROUP BY { groupItem [, groupItem ]* } ]
      [ HAVING booleanExpression ]
      [ WINDOW window_name AS ( window_definition ) [, ...] ]

projectItem:
      expression [ [ AS ] columnAlias ]
  |   tableAlias . *

tableExpression:
      tableReference [, tableReference ]*
  |   tableExpression [ ( LEFT ) [ OUTER ] ] JOIN tableExpression [ joinCondition ]

joinCondition:
      ON booleanExpression
  |   USING '(' column [, column ]* ')'

tableReference:
      tablePrimary
      [ [ AS ] alias ]

tablePrimary:
      [ catalogName . ] tableName
  |   '(' query ')'

groupItem:
      expression
  |   '(' expression [, expression ]* ')'

ORDER BY

Sort order defaults to ascending (ASC).
Sorts null values after non-null values by default in an ascending sort, before non-null values in a descending sort. For any query, you can use NULLS FIRST to sort null values to the top of the results or NULLS LAST to sort null values to the bottom of the results.
Allows you to use a positional reference to choose the sort column. For example, the command SELECT colA,colB FROM table1 ORDER BY 2 sorts the results on colB because it is in position 2.

Query Hints

HEAVY.AI provides various query hints for controlling the behavior of the query execution engine.

Syntax

SELECT /*+ hint */ FROM ...;

SELECT hints must appear first, immediately after the SELECT statement; otherwise, the query fails.

By default, a hint is applied to the query step in which it is defined. If you have multiple SELECT clauses and define a query hint in one of those clauses, the hint is applied only to the specific query step; the rest of the query steps are unaffected. For example, applying the /*+ cpu_mode */ hint affects only the SELECT clause in which it exists.

You can define a hint to apply to all query steps by prepending g_ to the query hint. For example, if you define /*+ g_cpu_mode */, CPU execution is applied to all query steps.

HEAVY.AI supports the following query hints.

The marker hint type represents a Boolean flag.

The key-value pair type is a hint name and its value.

Cross-Database Queries

In Release 6.4 and higher, you can run SELECT queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases. This enables more efficient storage and memory utilization by eliminating the need for table duplication across databases, and simplifies access to shared data and tables.

To execute queries against another database, you must have ACCESS privilege on that database, as well as SELECT privilege.

Example

Execute a join query involving a table in the current database and another table in the my_other_db database:

SELECT name, saleamt, saledate FROM my_other_db.customers AS c, sales AS s 
  WHERE c.id = s.customerid;

SHOW

Use SHOW commands to get information about databases, tables, and user sessions.

SHOW CREATE SERVER

Shows the CREATE SERVER statement that could have been used to create the server.

Syntax

SHOW CREATE SERVER <servername>

Example

SHOW CREATE SERVER default_local_delimited;
create_server_sql
CREATE SERVER default_local_delimited FOREIGN DATA WRAPPER DELIMITED_FILE
WITH (STORAGE_TYPE='LOCAL_FILE');

SHOW CREATE TABLE

Shows the CREATE TABLE statement that could have been used to create the table.

Syntax

SHOW CREATE TABLE <tablename>

Example

SHOW CREATE TABLE heavyai_states;
CREATE TABLE heavyai_states (
 id TEXT ENCODING DICT(32),
 abbr TEXT ENCODING DICT(32),
 name TEXT ENCODING DICT(32),
 omnisci_geo GEOMETRY(MULTIPOLYGON, 4326
) NOT NULL);

SHOW DATABASES

Retrieve the databases accessible for the current user, showing the database name and owner.

Example

SHOW DATABASES
Database         Owner
omnisci          admin
2004_zipcodes    admin
game_results     jane
signals          jason
...

SHOW FUNCTIONS

Show registered compile-time UDFs and extension functions in the system and their arguments.

Syntax

SHOW FUNCTIONS [DETAILS]

Example

SHOW FUNCTIONS
Scalar UDF
distance_point_line
ST_DWithin_Polygon_Polygon
ST_Distance_Point_ClosedLineString
Truncate
ct_device_selection_udf_any
area_triangle
_h3RotatePent60cw
ST_Intersects_Polygon_Point
ST_DWithin_LineString_Polygon
ST_Intersects_Point_Polygon
box_contains_box

SHOW POLICIES

Displays a list of all row-level security (RLS) policies that exist for a user or role; admin rights are required. If EFFECTIVE is used, the list also includes any policies that exist for all roles that apply to the requested user or role.

Syntax

SHOW [EFFECTIVE] POLICIES <name>;

SHOW QUERIES

Returns a list of queued queries in the system; information includes session ID, status, query string, account login name, client address, database name, and device type (CPU or GPU).

Example

show queries;
query_session_id|current_status|submitted          |query_str                                                   |login_name|client_address     |db_name   |exec_device_type
834-8VAA        |Pending       |2020-05-06 08:21:15|select d_date_sk, count(1) from date_dim group by d_date_sk;|admin     |tcp:localhost:48596|tpcds_sf10|CPU
826-CLKk        |Running       |2020-05-06 08:20:57|select count(1) from store_sales, store_returns;            |admin     |tcp:localhost:48592|tpcds_sf10|CPU
828-V6s7        |Pending       |2020-05-06 08:21:13|select count(1) from store_sales;                           |admin     |tcp:localhost:48594|tpcds_sf10|GPU
946-rtJ7        |Pending       |2020-05-06 08:20:58|select count(1) from item;                                  |admin     |tcp:localhost:48610|tpcds_sf10|GPU

Admin users can see and interrupt all queries, and non-admin users can see and interrupt only their own queries

NOTE: SHOW QUERIES is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt) is set.

To interrupt a query in the queue, see KILL QUERY.

SHOW ROLES

If included with a name, lists the role granted directly to a user or role. SHOW EFFECTIVE ROLES with a name lists the roles directly granted to a user or role, and also lists the roles indirectly inherited through the directly granted roles.

Syntax

SHOW [EFFECTIVE] ROLES <name>

If the user name or role name is omitted, then a regular user sees their own roles, and a superuser sees a list of all roles existing in the system.

SHOW RUNTIME FUNCTIONS

Show user-defined runtime functions and table functions.

Syntax

SHOW RUNTIME [TABLE] FUNCTIONS

SHOW RUNTIME [TABLE] FUNCTION DETAILS

SHOW SUPPORTED DATA SOURCES

Show data connectors.

Syntax

show supported data sources

SHOW TABLE DETAILS

Displays storage-related information for a table, such as the table ID/name, number of data/metadata files used by the table, total size of data/metadata files, and table epoch values.

You can see table details for all tables that you have access to in the current database, or for only those tables you specify.

Syntax

SHOW TABLE DETAILS [<table-name>, <table-name>, ...]

Examples

Show details for all tables you have access to:

omnisql> show table details;
table_id|table_name       |column_count|is_sharded_table|shard_count|max_rows           |fragment_size|max_rollback_epochs|min_epoch|max_epoch|min_epoch_floor|max_epoch_floor|metadata_file_count|total_metadata_file_size|total_metadata_page_count|total_free_metadata_page_count|data_file_count|total_data_file_size|total_data_page_count|total_free_data_page_count
1       |heavyai_states   |11          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |4082                          |1              |536870912           |256                  |242
2       |heavyai_counties |13          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |NULL                          |1              |536870912           |256                  |NULL
3       |heavyai_countries|71          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |4022                          |1              |536870912           |256                  |182

Show details for table omnisci_states:

omnisql> show table details heavyai_states;
table_id|table_name    |column_count|is_sharded_table|shard_count|max_rows           |fragment_size|max_rollback_epochs|min_epoch|max_epoch|min_epoch_floor|max_epoch_floor|metadata_file_count|total_metadata_file_size|total_metadata_page_count|total_free_metadata_page_count|data_file_count|total_data_file_size|total_data_page_count|total_free_data_page_count
1       |heavyai_states|11          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |4082                          |1              |536870912           |256                  |242

The number of columns returned includes system columns. As a result, the number of columns in column_count can be up to two greater than the number of columns created by the user.

SHOW TABLE FUNCTIONS

Displays the list of available system (built-in) table functions.

SHOW TABLE FUNCTIONS;
tf_compute_dwell_times
tf_feature_self_similarity
tf_feature_similarity
tf_rf_prop
tf_rf_prop_max_signal
tf_geo_rasterize_slope
tf_geo_rasterize
generate_random_strings
generate_series
tf_mandelbrot_cuda_float
tf_mandelbrot_cuda
tf_mandelbrot_float
tf_mandelbrot

For more information, see System Table Functions.

SHOW TABLE FUNCTIONS DETAILS

Show detailed output information for the specified table function. Output details vary depending on the table function specified.

Syntax

SHOW TABLE FUNCTIONS DETAILS <function_name>

Example - generate_series

View SHOW output for the generate_series table function:

Output Header

Output Details

name

generate_series

signature

(i64 series_start, i64 series_stop, i64 series_step) (i64 series_start, i64 series_stop) -> Column

input_names

series_start, series_stop, series_step series_start, series_stop

input_types

i64

output_names

generate_series

output_types

Column i64

CPU

true

GPU

true

runtime

false

filter_table_transpose

false

SHOW SERVERS

Retrieve the servers accessible for the current user.

Example

SHOW SERVERS;
server_name|data_wrapper|created_at|options
default_local_delimited|DELIMITED_FILE|2022-03-15 10:06:05|{"STORAGE_TYPE":"LOCAL_FILE"}
default_local_parquet|PARQUET_FILE|2022-03-15 10:06:05|{"STORAGE_TYPE":"LOCAL_FILE"}
default_local_regex_parsed|REGEX_PARSED_FILE|2022-03-15 10:06:05|{"STORAGE_TYPE":"LOCAL_FILE"}
...

SHOW TABLES

Retrieve the tables accessible for the current user.

Example

SHOW TABLES;
table_name
----------
omnisci_states
omnisci_counties
omnisci_countries
streets_nyc
streets_miami
...

SHOW USER DETAILS

Lists name, ID, and default database for all or specified users for the current database. If the command is issued by a superuser, login permission status is also shown. Only superusers see users who do not have permission to log in.

Example

SHOW USER DETAILS
NAME            ID         DEFAULT_DB 
mike.nuumann    191        mondale
Dale            184        churchill
Editor_Test     141        mondale
Jerry.wong      181        alluvial
AA_superuser    139        
BB_superuser    2140
PlinyTheElder   183        windsor
aaron.tyre      241        db1
achristie       243        sid
eve.mandela     202        nancy
...

SHOW [ALL] USER DETAILS lists name, ID, superuser status, default database, and login permission status for all users across the HeavyDB instance. This variant of the command is available only to superusers. Regular users who run the SHOW ALL USER DETAILS command receive an error message.

Superuser Output

Show all user details for all users:

heavysql> show all user details;
NAME|ID|IS_SUPER|DEFAULT_DB|CAN_LOGIN
admin|0|true|(-1)|true
ua|2|false|db1(2)|true
ub|3|false|db1(2)|true
uc|4|false|db1(2)|false
ud|5|false|db2(3)|true
ue|6|false|db2(3)|true
uf|7|false|db2(3)|false

Show all user details for specified users ue, ud, ua, and uf:

heavysql> \db db2
User admin switched to database db2

heavysql> show all user details ue, ud, uf, ua;
NAME|ID|IS_SUPER|DEFAULT_DB|CAN_LOGIN
ua|2|false|db1(2)|true
ud|5|false|db2(3)|true
ue|6|false|db2(3)|true
uf|7|false|db2(3)|false

If a specified user is not found, the superuser sees an error message:

heavysql> show user details ue, ud, uf, ua;
User "ua" not found.

Show user details for specified users ue, ud, and uf:

heavysql> show user details ue, ud, uf;
NAME|ID|DEFAULT_DB|CAN_LOGIN
ud|5|db2(3)|true
ue|6|db2(3)|true
uf|7|db2(3)|false

Show user details for all users:

heavysql> show user details;
NAME|ID|DEFAULT_DB|CAN_LOGIN
ud|5|db2(3)|true
ue|6|db2(3)|true
uf|7|db2(3)|false

Non-Superuser Output

Running SHOW ALL USER DETAILS results in an error message:

heavysql> \db
User ua is using database db1
heavysql> show all user details;
SHOW ALL USER DETAILS is only available to superusers. (Try SHOW USER DETAILS instead?)

Show user details for all users:

heavysql> show user details;
NAME|ID|DEFAULT_DB
ua|2|db1
ub|3|db1

If a specified user is not found, the user sees an error message:

heavysql> show user details ua, ub, uc;
User "uc" not found.

Show user details for user ua:

heavysql> show user details ua;
NAME|ID|DEFAULT_DB
ua|2|db1

SHOW USER SESSIONS

Retrieve all persisted user sessions, showing the session ID, user login name, client address, and database name. Admin or superuser privileges required.

SHOW USER SESSIONS;
session_id   login_name   client_address         db_name
453-X6ds     mike         http:198.51.100.1      game_results
453-0t2r     erin         http:198.51.100.11     game_results
421-B64s     shauna       http:198.51.100.43     game_results
213-06dw     ahmed        http:198.51.100.12     signals
333-R28d     cat          http:198.51.100.233    signals
497-Xyz6     inez         http:198.51.100.5      ships
...

KILL QUERY

Interrupt a queued query. Specify the query by using its session ID.

To see the queries in the queue, use the SHOW QUERIES command:

show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU
947-ooNP        |RUNNING_IMPORTER    |0          |2021-08-03 ...|IMPORT_GEO_TABLE|Rio       |tcp:::ffff:127.0.0.1:47314|omnisci|CPU

To interrupt the last query in the list (ID 946-ooNP):

kill query '946-ooNP'

Showing the queries again indicates that 946-ooNP has been deleted:

show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU

KILL QUERY is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt) is set.
Interrupting a query in ‘PENDING_QUEUE’ status is supported in both distributed and single-server mode.
To enable query interrupt for tables imported from data files in local storage, set enable_non_kernel_time_query_interrupt to TRUE. (It is enabled by default.)

UPDATE

Changes the values of the specified columns based on the assign argument (identifier=expression) in all rows that satisfy the condition in the WHERE clause.

UPDATE table_name SET assign [, assign ]* [ WHERE booleanExpression ]

Example

UPDATE UFOs SET shape='ovate' where shape='eggish';

Currently, HEAVY.AI does not support updating a geo column type (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON) in a table.

Update Via Subquery

You can update a table via subquery, which allows you to update based on calculations performed on another table.

Examples

UPDATE test_facts SET lookup_id = (SELECT SAMPLE(test_lookup.id) 
FROM test_lookup WHERE test_lookup.val = test_facts.val);

UPDATE test_facts SET val = val+1, lookup_id = (SELECT SAMPLE(test_lookup.id)
FROM test_lookup WHERE test_lookup.val = test_facts.val);

UPDATE test_facts SET lookup_id = (SELECT SAMPLE(test_lookup.id) 
FROM test_lookup WHERE test_lookup.val = test_facts.val) WHERE id < 10;

Cross-Database Queries

In Release 6.4 and higher, you can run UPDATE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.

To execute queries against another database, you must have ACCESS privilege on that database, as well as UPDATE privilege.

Example

Update a row in a table in the my_other_db database:

UPDATE my_other_db.customers SET name = 'Joe' WHERE id = 10;

Arrays

HEAVY.AI supports arrays in dictionary-encoded text and number fields (TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, and DOUBLE). Data stored in arrays are not normalized. For example, {green,yellow} is not the same as {yellow,green}. As with many SQL-based services, OmniSci array indexes are 1-based.

HEAVY.AI supports NULL variable-length arrays for all integer and floating-point data types, including dictionary-encoded string arrays. For example, you can insert NULL into BIGINT[ ], DOUBLE[ ], or TEXT[ ] columns. HEAVY.AI supports NULL fixed-length arrays for all integer and floating-point data types, but not for dictionary-encoded string arrays. For example, you can insert NULL into BIGINT[2] DOUBLE[3], but not into TEXT[2] columns.

Expression

Description

ArrayCol[n] ...

Returns value(s) from specific location n in the array.

UNNEST(ArrayCol)

Extract the values in the array to a set of rows. Requires GROUP BY; projecting UNNEST is not currently supported.

test = ANY ArrayCol

ANY compares a scalar value with a single row or set of values in an array, returning results in which at least one item in the array matches. ANY must be preceded by a comparison operator.

test = ALL ArrayCol

ALL compares a scalar value with a single row or set of values in an array, returning results in which all records in the array field are compared to the scalar value. ALL must be preceded by a comparison operator.

CARDINALITY()

Returns the number of elements in an array. For example:

heavysql> \d arr
CREATE TABLE arr (
sia SMALLINT[])
omnisql> select sia, CARDINALITY(sia) from arr;
sia|EXPR$0
NULL|NULL
{}|0
{NULL}|1
{1}|1
{2,2}|2
{3,3,3}|3

Examples

The following examples show query results based on the table test_array created with the following statement:

CREATE TABLE test_array (name TEXT ENCODING DICT(32),colors TEXT[] ENCODING DICT(32), qty INT[]);

omnisql> SELECT * FROM test_array;
name|colors|qty
Banana|{green, yellow}|{1, 2}
Cherry|{red, black}|{1, 1}
Olive|{green, black}|{1, 0}
Onion|{red, white}|{1, 1}
Pepper|{red, green, yellow}|{1, 2, 3}
Radish|{red, white}|{}
Rutabaga|NULL|{}
Zucchini|{green, yellow}|{NULL}

omnisql> SELECT UNNEST(colors) AS c FROM test_array;
Exception: UNNEST not supported in the projection list yet.

omnisql> SELECT UNNEST(colors) AS c, count(*) FROM test_array group by c;
c|EXPR$1
green|4
yellow|3
red|4
black|2
white|2

omnisql> SELECT name, colors [2] FROM test_array;
name|EXPR$1
Banana|yellow
Cherry|black
Olive|black
Onion|white
Pepper|green
Radish|white
Rutabaga|NULL
Zucchini|yellow

omnisql> SELECT name, colors FROM test_array WHERE colors[1]='green';
name|colors
Banana|{green, yellow}
Olive|{green, black}
Zucchini|{green, yellow}

omnisql> SELECT * FROM test_array WHERE colors IS NULL;
name|colors|qty
Rutabaga|NULL|{}

The following queries use arrays in an INTEGER field:

omnisql> SELECT name, qty FROM test_array WHERE qty[2] >1;
name|qty
Banana|{1, 2}
Pepper|{1, 2, 3}

omnisql> SELECT name, qty FROM test_array WHERE 15< ALL qty;
No rows returned.

omnisql> SELECT name, qty FROM test_array WHERE 2 = ANY qty;
name|qty
Banana|{1, 2}
Pepper|{1, 2, 3}

omnisql> SELECT COUNT(*) FROM test_array WHERE qty IS NOT NULL;
EXPR$0
8

omnisql> SELECT COUNT(*) FROM test_array WHERE CARDINALITY(qty)<0;
EXPR$0
6

Logical Operators and Conditional and Subquery Expressions

Logical Operator Support

Conditional Expression Support

Geospatial and array column projections are not supported in the COALESCE function and CASE expressions.

Subquery Expression Support

Usage Notes

You can use a subquery anywhere an expression can be used, subject to any runtime constraints of that expression. For example, a subquery in a CASE statement must return exactly one row, but a subquery can return multiple values to an IN expression.
You can use a subquery anywhere a table is allowed (for example, FROM subquery), using aliases to name any reference to the table and columns returned by the subquery.

Table Expression and Join Support

<table> , <table> WHERE <column> = <column>
<table> [ LEFT ] JOIN <table> ON <column> = <column>

If a join column name or alias is not unique, it must be prefixed by its table name.

You can use BIGINT, INTEGER, SMALLINT, TINYINT, DATE, TIME, TIMESTAMP, or TEXT ENCODING DICT data types. TEXT ENCODING DICT is the most efficient because corresponding dictionary IDs are sequential and span a smaller range than, for example, the 65,535 values supported in a SMALLINT field. Depending on the number of values in your field, you can use TEXT ENCODING DICT(32) (up to approximately 2,150,000,000 distinct values), TEXT ENCODING DICT(16) (up to 64,000 distinct values), or TEXT ENCODING DICT(8) (up to 255 distinct values). For more information, see .

Geospatial Joins

When possible, joins involving a geospatial operator (such as ST_Contains) build a binned spatial hash table (overlaps hash join), falling back to a Cartesian loop join if a spatial hash join cannot be constructed.

The enable-overlaps-hashjoin flag controls whether the system attempts to use the overlaps spatial join strategy (true by default). If enable-overlaps-hashjoin is set to false, or if the system cannot build an overlaps hash join table for a geospatial join operator, the system attempts to fall back to a loop join. Loop joins can be performant in situations where one or both join tables have a small number of rows. When both tables grow large, loop join performance decreases.

Two flags control whether or not the system allows loop joins for a query (geospatial for not): allow-loop-joins and trivial-loop-join-threshold. By default, allow-loop-joins is set to false and trivial-loop-join-threshold to 1,000 (rows). If allow allow-loop-joins is set to true, the system allows any query with a loop join, regardless of table cardinalities (measured in number of rows). If left to the implicit default of false or set explicitly to false, the system allows loop join queries as long as the inner table (right-side table) has fewer rows than the threshold specified by trivial-loop-join-threshold.

For optimal performance, the system should utilize overlaps hash joins whenever possible. Use the following guidelines to maximize the use of the overlaps hash join framework and minimize fallback to loop joins when conducting geospatial joins:

The inner (right-side) table should always be the more complicated primitive. For example, for ST_Contains(polygon, point), the point table should be the outer (left) table and the polygon table should be the inner (right) table.
Currently, ST_CONTAINS and ST_INTERSECTS joins between point and polygons/multi-polygon tables, and ST_DISTANCE < {distance} between two point tables are supported for accelerated overlaps hash join queries.
For pointwise-distance joins, only the pattern WHERE ST_DISTANCE(table_a.point_col, table_b.point_col) < distance_in_degrees supports overlaps hash joins. Patterns like the following fall back to loop joins:
- WHERE ST_DWITHIN(table_a.point_col, table_b.point_col, distance_in_degrees)
- WHERE ST_DISTANCE(ST_TRANSFORM(table_a.point_col, 900913), ST_TRANSFORM(table_b.point_col, 900913)) < 100

Using Joins in a Distributed Environment

You can create joins in a distributed environment in two ways:

Replicate small dimension tables that are used in the join.
Create a shard key on the column used in the join (note that there is a limit of one shard key per table). If the column involved in the join is a TEXT ENCODED field, you must create a SHARED DICTIONARY that references the FACT table key you are using to make the join.

# Table customers is very small
CREATE TABLE sales (
id INTEGER,
customerid TEXT ENCODING DICT(32),
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE);

CREATE TABLE customers (
id TEXT ENCODING DICT(32),
someid INTEGER,
name TEXT ENCODING DICT(32))
WITH (partitions = 'replicated') #this causes the entire contents of this table to be replicated to each leaf node. Only recommened for small dimension tables.
SELECT c.id, c.name from sales s inner join customers c on c.id = s.customerid limit 10;

CREATE TABLE sales (
id INTEGER,
customerid BIGINT, #note the numeric datatype, so we don't need to specify a shared dictionary on the customer table
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE,
SHARD KEY (customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)

CREATE TABLE customers (
id TEXT BIGINT,
someid INTEGER,
name TEXT ENCODING DICT(32)
SHARD KEY (id))
WITH (SHARD_COUNT=<num gpus in cluster>);

SELECT c.id, c.name FROM sales s INNER JOIN customers c ON c.id = s.customerid LIMIT 10;

CREATE TABLE sales (
id INTEGER,
customerid TEXT ENCODING DICT(32),
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE,
SHARD KEY (customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)

#note the difference when customerid is a text encoded field:

CREATE TABLE customers (
id TEXT,
someid INTEGER,
name TEXT ENCODING DICT(32),
SHARD KEY (id),
SHARED DICTIONARY (id) REFERENCES sales(customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)

SELECT c.id, c.name FROM sales s INNER JOIN customers c ON c.id = s.customerid LIMIT 10;

The join order for one small table and one large table matters. If you swap the sales and customer tables on the join, it throws an exception stating that table "sales" must be replicated.

Type Casts

Expression

Example

Description

CAST(expr AS type)

CAST(1.25 AS FLOAT)

Converts an expression to another data type. For conversions to a TEXT type, use TRY_CAST.

TRY_CAST(text_expr AS type)

CAST('1.25' AS FLOAT)

Converts a text to a non-text type, returning null if the conversion could not be successfully performed.

ENCODE_TEXT(none_encoded_str)

ENCODE_TEXT(long_str)

Converts a none-encoded text type to a dictionary-encoded text type.

The following table shows cast type conversion support.

FROM/TO:

TINYINT

SMALLINT

INTEGER

BIGINT

FLOAT

DOUBLE

DECIMAL

TEXT

BOOLEAN

DATE

TIME

TIMESTAMP

TINYINT

Yes

n/a

SMALLINT

Yes

n/a

INTEGER

Yes

BIGINT

Yes

FLOAT

Yes

DOUBLE

Yes

n/a

DECIMAL

Yes

n/a

TEXT

Yes (Use TRY_CAST)

BOOLEAN

Yes

n/a

DATE

Yes

n/a

Yes

TIME

Yes

n/a

TIMESTAMP

Yes

n/a

Yes

Geospatial Capabilities

HEAVY.AI supports a subset of object types and functions for storing and writing queries for geospatial definitions.

Geospatial Datatypes

For information about geospatial datatype sizes, see and in .

For more information on WKT primitives, see .

HEAVY.AI supports SRID 4326 () and 900913 (Google Web Mercator), and 32601-32660,32701-32760 (Universal Transverse Mercator (UTM) Zones). When using geospatial fields, you set the SRID to determine which reference system to use. HEAVY.AI does not assign a default SRID.

CREATE TABLE simple_geo (
                          name TEXT ENCODING DICT(32), 
                          location GEOMETRY(POINT,4326)
                         );

If you do not set the SRID of the geo field in the table, you can set it in a SQL query using ST_SETSRID(column_name, SRID). For example, ST_SETSRID(a.pt,4326).

When representing longitude and latitude, the first coordinate is assumed to be longitude in HEAVY.AI geospatial primitives.

You create geospatial objects as geometries (planar spatial data types), which are supported by the planar geometry engine at run time. When you call ST_DISTANCE on two geometry objects, the engine returns the shortest straight-line planar distance, in degrees, between those points. For example, the following query returns the shortest distance between the point(s) in p1 and the polygon(s) in poly1:

SELECT ST_DISTANCE(p1, poly1) FROM geo1;

Geospatial Literals

Geospatial functions that expect geospatial object arguments accept geospatial columns, geospatial objects returned by other functions, or string literals containing WKT representations of geospatial objects. Supplying a WKT string is equivalent to calling a geometry constructor. For example, these two queries are identical:

SELECT COUNT(*) FROM geo1 WHERE ST_DISTANCE(p1, `POINT(1 2)`) < 1.0;
SELECT COUNT(*) FROM geo1 WHERE ST_DISTANCE(p1, ST_GeomFromText('POINT(1 2)')) < 1.0;

You can create geospatial literals with a specific SRID. For example:

SELECT ST_CONTAINS(
                     mpoly2, 
                     ST_GeomFromText('POINT(-71.064544 42.28787)', 4326)
                   )
                   FROM geo2;

Support for Geography

HEAVY.AI provides support for geography objects and geodesic distance calculations, with some limitations.

Exporting Coordinates from Immerse

HeavyDB supports import from any coordinate system supported by the Geospatial Data Abstraction Library (GDAL). On import, HeavyDB will convert to and store in WGS84 encoding, and rendering is accurate in Immerse.

However, no built-in way to reference the original coordinates currently exists in Immerse, and coordinates exported from Immerse will be WGS84 coordinates. You can work around this limitation by adding to the dataset a column or columns in non-geo format that could be included for display in Immerse (for example, in a popup) or on export.

Distance Calculation

Currently, HEAVY.AI supports spheroidal distance calculation between:

Two points using either SRID 4326 or 900913.
A point and a polygon/multipolygon using SRID 900913.

Using SRID 900913 results in variance compared to SRID 4326 as polygons approach the North and South Poles.

The following query returns the points and polygons within 1,000 meters of each other:

SELECT a.poly_name, b.pt_name FROM poly a, pt b 
WHERE ST_Distance(
   ST_Transform(b.heavyai_geo, 900913),
   ST_Transform(b.location, 900913))<1000;

Geospatial Functions

HEAVY.AI supports the functions listed.

Geometry Constructors

Geometry to String Conversion

Geometry Processing

FunctionDescription Special processing is automatically applied to WGS84 input geometries (SRID=4326) to limit buffer distortion:

Implementation first determines the best planar SRID to which to project the 4326 input geometry.
Preferred SRIDs are UTM and Lambert (LAEA) North/South zones, with Mercator used as a fallback.
Buffer distance is interpreted as distance in meters (units of all planar SRIDs being considered).
The input geometry is transformed to the best planar SRID and handed to GEOS, along with buffer distance.
The buffer geometry built by GEOS is then transformed back to SRID=4326 and returned.

Example: Build 10-meter buffer geometries (SRID=4326) with limited distortion:SELECT ST_Buffer(poly4326, 10.0) FROM tbl; .ST_CentroidComputes the geometric center of a geometry as a POINT.

Geometry Editors

Geometry Accessors

Overlay Functions

Spatial Relationships and Measurements

Additional Geo Notes

You can use SQL code similar to the examples in this topic as global filters in Immerse.
CREATE TABLE AS SELECT is not currently supported for geo data types in distributed mode.
GROUP BY is not supported for geo types (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON.

You can use \d table_name to determine if the SRID is set for the geo field:

heavysql> \d starting_point
CREATE TABLE starting_point (
                               name TEXT ENCODING DICT(32),
                               myPoint GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32)
                             )

If no SRID is returned, you can set the SRID using ST_SETSRID(column_name, SRID). For example, ST_SETSRID(myPoint, 4326).

Uber H3 Hexagonal Modeling

Uber H3 Functions

Overview

Uber H3 is an open-source geospatial system created by Uber Technologies . H3 provides a hierarchical grid system that divides the Earth's surface into hexagons of varying sizes, allowing for easy location-based indexing, search, and analysis.

Hexagons can be created at a single scale, for instance to fill an arbitrary polygon at one resolution (see below). They can also be used to generate a much-smaller number of hexagons at multiple scales. In general, operating on h3 hexagons is much faster than on raw arbitrary geometries, at a cost of some precision. Because each hexagon is exactly the same size, this is particularly advantageous for GPU-accelerated workflows.

Advantages

A principal advantage of the system is that for a given scale, hexagons are approximately-equal area. This stands in contrast to other subdivision schemes based on longitudes and latitudes or web Mercator map projections.

A second advantage is that with hexagons, neighbors in all directions are equidistant. This is not true for rectangular subdivisions like pixels, whose 8 neighbors are at different distances.

The exact amount of precision lost can be tightly bounded, with the smallest sized hexagons supported being about 1m2. That’s more accurate than most current available data sources, short of survey data.

Disadvantages

There are some disadvantages to be aware of. The first is that the world can not actually be divided up completely cleanly into hexagons. It turns out that a few pentagons are needed, and this introduces discontinuities. However the system has cleverly placed those pentagons far away from any land masses, so this is only practically a concern for specific maritime operations.

The second issue is that hexagons at adjacent scales do not nest exactly:

This doesn’t much affect practical operations at any single given scale. But if you look carefully at the California multiscale plot above you will discover tiny discontinuities in the form of gaps or overlaps. These don’t amount to a large percentage of the total area, but nonetheless mean this method is not appropriate when exact counts are required.

Supported Methods

geoToH3(longitude, latitude, h3_scale)

Encodes columnar point geometry into a globally-unique h3 cell ID for the specified h3_scale. Scales run from 0 to 15 inclusive, where 0 represents the coarsest resolution and 15 the finest. For details on h3 scales, please see the base library documentation.

This can be applied to literal values:

SELECT geoToH3(2.43817853854884, 48.8427101442789, 5) AS paris_hex05ql

Or to columnar geographic points:

SELECT geoToH3(raster_lon, raster_lon, 12) AS srtm_terrain_hex12 
FROM srtm_elevation

Note that if you have geographic point data rather than columnar latitude and longitude, you can use the ST_X and ST_Y functions. Also, if you wish to encode the centroids of polygons, such as for building footprints, you can combine this with the ST_CENTROID function.

From Hex Codes to Geometry

To retrieve geometric coordinates from an H3 code, two functions are available.

h3ToLat and h3ToLon extract the latitude and longitude respectively, for example:

SELECT h3ToLat(634464641634394600) as latitude
SELECT h3ToLon(634464641634394600) as longitude

Following Parent Relationships

Given an H3 code, the function h3ToParent is available to find cells above that cell at any hierarchical level. This means that once codes are computed at high resolution, they can be compared to codes at other scales.

SELECT h3ToParent(634464641634394600, 3) as level3_parent

H3 Usage Notes

Uber's h3 python library provides a wider range of functions than those available above (although at significantly slower performance). The library defaults to generating h3 codes as hexadecimal strings, but can be configured to support BIGINT codes. Please see Uber's documentation for details.

H3 codes can be used in regular joins, including joins in Immerse. They can also be used as aggregators, such as in Immerse custom dimensions. For points which are exactly aligned, such as imports from raster data bands of the same source, aggregating on H3 codes is faster than the exact geographic overlaps function ST_EQUALS

Functions and Operators

Functions and Operators (DML)

Basic Mathematical Operators

Mathematical Operator Precedence

Parenthesization
Multiplication and division
Addition and subtraction

Comparison Operators

Mathematical Functions

Trigonometric Functions

Geometric Functions

String Functions

Pattern-Matching Functions

Usage Notes

The following wildcard characters are supported by LIKE and ILIKE:

% matches any number of characters, including zero characters.
_ matches exactly one character.

Date/Time Functions

Supported Types

Supported date_part types:

DATE_TRUNC [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, MILLENNIUM, CENTURY, DECADE, WEEK, 
            WEEK_SUNDAY, QUARTERDAY]
EXTRACT    [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, DOW, ISODOW, DOY, EPOCH, QUARTERDAY, 
            WEEK, WEEK_SUNDAY, DATEEPOCH]
DATEDIFF   [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, WEEK]

Supported interval types:

DATEADD       [DECADE, YEAR, QUARTER, MONTH, WEEK, WEEKDAY, DAY, 
               HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND]
TIMESTAMPADD  [YEAR, QUARTER, MONTH, WEEKDAY, DAY, HOUR, MINUTE,
               SECOND, MILLISECOND, MICROSECOND, NANOSECOND]
DATEPART      [YEAR, QUARTER, MONTH, DAYOFYEAR, QUARTERDAY, WEEKDAY, DAY, HOUR,
               MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND]

Accepted Date, Time, and Timestamp Formats

Usage Notes

For two-digit years, years 69-99 are assumed to be previous century (for example, 1969), and 0-68 are assumed to be current century (for example, 2016).
For four-digit years, negative years (BC) are not supported.
Hours are expressed in 24-hour format.
When time components are separated by colons, you can write them as one or two digits.
Months are case insensitive. You can spell them out or abbreviate to three characters.
For timestamps, decimal seconds are ignored. Time zone offsets are written as +/-HHMM.
For timestamps, a numeric string is converted to +/- seconds since January 1, 1970. Supported timestamps range from -30610224000 (January 1, 1000) through 29379456000 (December 31, 2900).
On output, dates are formatted as YYYY-MM-DD. Times are formatted as HH:MM:SS.
Linux EPOCH values range from -30610224000 (1/1/1000) through 185542587100800 (1/1/5885487). Complete range in years: +/-5,883,517 around epoch.

Statistical and Aggregate Functions

Both double-precision (standard) and single-precision floating point statistical functions are provided. Single-precision functions run faster on GPUs but might cause overflow errors.

Usage Notes

COUNT(DISTINCT x), especially when used in conjunction with GROUP BY, can require a very large amount of memory to keep track of all distinct values in large tables with large cardinalities. To avoid this large overhead, use APPROX_COUNT_DISTINCT.
APPROX_COUNT_DISTINCT(x, e) gives an approximate count of the value x, based on an expected error rate defined in e. The error rate is an integer value from 1 to 100. The lower the value of e, the higher the precision, and the higher the memory cost. Select a value for e based on the level of precision required. On large tables with large cardinalities, consider using APPROX_COUNT_DISTINCT when possible to preserve memory. When data cardinalities permit, OmniSci uses the precise implementation of COUNT(DISTINCT x) for APPROX_COUNT_DISTINCT. Set the default error rate using the -hll-precision-bits configuration parameter.
The accuracy of APPROX_MEDIAN (x) upon the distribution of data. For example:
- For 100,000,000 integers (1, 2, 3, ... 100M) in random order, APPROX_MEDIAN can provide a highly accurate answer 5+ significant digits.
- For 100,000,001 integers, where 50,000,000 have value of 0 and 50,000,001 have value of 1, APPROX_MEDIAN returns a value close to 0.5, even though the median is 1.
Currently, OmniSci does not support grouping by non-dictionary-encoded strings. However, with the SAMPLE aggregate function, you can select non-dictionary-encoded strings that are presumed to be unique in a group. For example:
```
SELECT user_name, SAMPLE(user_decription) FROM tweets GROUP BY user_name;
```
If the aggregated column (user_description in the example above) is not unique within a group, SAMPLE selects a value that might be nondeterministic because of the parallel nature of OmniSci query execution.

Miscellaneous Functions

User-Defined Functions

You can create your own C++ functions and use them in your SQL queries.

User-defined Functions (UDFs) require clang++ version 9. You can verify the version installed using the command clang++ --version.
UDFs currently allow any authenticated user to register and execute a runtime function. By default, runtime UDFs are globally disabled but can be enabled with the runtime flag enable-runtime-udf.

Create your function and save it in a .cpp file; for example, /var/lib/omnisci/udf_myFunction.cpp.
Add the UDF configuration flag to omnisci.conf. For example:
```
udf = "/var/lib/omnisci/udf_myFunction.cpp"
```
Use your function in a SQL query. For example:
```
SELECT udf_myFunction FROM myTable
```

Sample User-Defined Function

This function, udf_diff.cpp, returns the difference of two values from a table.

#include <cstdint>
#if defined(__CUDA_ARCH__) && defined(__CUDACC__) && defined(__clang__)
#define DEVICE __device__
#define NEVER_INLINE
#define ALWAYS_INLINE
#else
#define DEVICE
#define NEVER_INLINE __attribute__((noinline))
#define ALWAYS_INLINE __attribute__((always_inline))
#endif
#define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE
EXTENSION_NOINLINE int32_t udf_diff(const int32_t x, const int32_t y) { return x - y; }

Code Commentary

Include the standard integer library, which supports the following datatypes:

bool
int8_t (cstdint), char
int16_t (cstdint), short
int32_t (cstdint), int
int64_t (cstdint), size_t
float
double
void

#include <cstdint>

The next four lines are boilerplate code that allows OmniSci to determine whether the server is running with GPUs. OmniSci chooses whether it should compile the function inline to achieve the best possible performance.

#include <cstdint>
#if defined(__CUDA_ARCH__) && defined(__CUDACC__) && defined(__clang__)
#define DEVICE __device__
#define NEVER_INLINE
#define ALWAYS_INLINE
#else
#define DEVICE
#define NEVER_INLINE __attribute__((noinline))
#define ALWAYS_INLINE __attribute__((always_inline))
#endif
#define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE

The next line is the actual user-defined function, which returns the difference between INTEGER values x and y.

EXTENSION_NOINLINE int32_t udf_diff(const int32_t x, const int32_t y) { return x - y; }

To run the udf_diff function, add this line to your /var/lib/omnisci/omnisci.conf file (in this example, the .cpp file is stored at /var/lib/omnisci/udf_diff.cpp):

udf = "/var/lib/omnisci/udf_diff.cpp"

Restart the OmniSci server.

Use your command from an OmniSci SQL client to query, for example, a table named myTable that contains the INTEGER columns myInt1 and myInt2.

SELECT udf_diff(myInt1, myInt2) FROM myTable LIMIT 1;

OmniSci returns the difference as an INTEGER value.

System Table Functions

HEAVY.AI provides access to a set system-provided table functions, also known as table-valued functions (TVS). System table functions, like user-defined table functions, support execution of queries on both CPU and GPU over one or more SQL result-set inputs. Table function support in HEAVY.AI can be split into two broad categories: system table functions and user-defined table functions (UDTFs). System table functions are built-in to the HEAVY.AI server, while UDTFs can be declared dynamically at run-time by specifying them in Numba, a subset of the Python language. For more information on UDTFs, see User-Defined Table Functions.

To improve performance, table functions can be declared to enable filter pushdown optimization, which allows the Calcite optimizer to "push down" filters on the output(s) of a table functions to its input(s) when the inputs and outputs are declared to be semantically equivalent (for example, a longitude variable that is input and output from a table function). This can significantly increase performance in cases where only a small portion of one or more input tables is required to compute the filtered output of a table function.

Whether system- or user-provided, table functions can execute over one or more result sets specified by subqueries, and can also take any number of additional constant literal arguments specified in the function definition. SQL subquery inputs can consist of any SQL expression (including multiple subqueries, joins, and so on) allowed by HeavyDB, and the output can be filtered, grouped by, joined, and so on like a normal SQL subquery, including being input into additional table functions by wrapping it in a CURSOR argument. The number and types of input arguments, as well as the number and types of output arguments, are specified in the table function definition itself.

Table functions allow for the efficient execution of advanced algorithms that may be difficult or impossible to express in canonical SQL. By allowing execution of code directly over SQL result sets, leveraging the same hardware parallelism used for fast SQL execution and visualization rendering, HEAVY.AI provides orders-of-magnitude speed increases over the alternative of transporting large result sets to other systems for post-processing and then returning to HEAVY.AI for storage or downstream manipulation. You can easily invoke system-provided or user-defined algorithms directly inline with SQL and rendering calls, making prototyping and deployment of advanced analytics capabilities easier and more streamlined.

Concepts

CURSOR Subquery Inputs

Table functions can take as input arguments both constant literals (including scalar results of subqueries) as well as results of other SQL queries (consisting of one or more rows). The latter (SQL query inputs), per the SQL standard, must be wrapped in the keyword CURSOR. Depending on the table function, there can be 0, 1, or multiple CURSOR inputs. For example:

SELECT * FROM (TABLE(my_table_function /* This is only an example! */ (
 CURSOR(SELECT arg1, arg2, arg3 FROM input_1 WHERE x > 10) /* First CURSOR 
 argument consisting of 3 columns */,
 CURSOR(SELECT arg1, AVG(arg2) FROM input_2 GROUP BY arg1 where y < 40) 
 /* Second CURSOR argument constisting of 2 columns. This could be from the same
 table as the first CURSOR, or as is the case here, a completely different table
 (or even joined table or logical value expression) */,
 'Fred' /* TEXT constant literal argument */,
 true /* BOOLEAN constant literal argument */,
 (SELECT COUNT(*) FROM another_table), /* scalar subquery results do not need
 to be wrapped in a CURSOR */,
 27.3 /* FLOAT constant literal argument */))
WHERE output1 BETWEEN 32.2 AND 81.8;

ColumnList Inputs

Certain table functions can take 1 or more columns of a specified type or types as inputs, denoted as ColumnList<TYPE1 | Type2... TypeN>. Even if a function allows aColumnList input of multiple types, the arguments must be all of one type; types cannot be mixed. For example, if a function allows ColumnList<INT | TEXT ENCODING DICT>, one or more columns of either INTEGER or TEXT ENCODING DICT can be used as inputs, but all must be either INT columns or TEXT ENCODING DICT columns.

Named Arguments

All HEAVY.AI system table functions allow you to specify argument either in conventional comma-separated form in the order specified by the table function signature, or alternatively via a key-value map where input argument names are mapped to argument values using the => token. For example, the following two calls are equivalent:

/* The following two table function calls, the first with unnamed
 signature-ordered arguments, and the second with named arguments,
 are equivalent */

select
  *
from
  table(
    tf_compute_dwell_times(
      /* Without the use of named arguments, input arguments must
      be ordered as specified by the table function signature */
      cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      3,
      600,
      10800
    )
  )
order by
  num_dwell_points desc
limit
  10;


select
  *
from
  table(
    tf_compute_dwell_times(
     /* Using named arguments, input arguments can be
     ordered in any order, as long as all arguments are named */
     min_dwell_seconds => 600,
     max_inactive_seconds => 10800
      data => cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      min_dwell_points => 3
    )
  )
order by
  num_dwell_points desc
limit
  10;

Filter Push-Down

For performance reasons, particularly when table functions are used as actual tables in a client like Heavy Immerse, many system table functions in HEAVY.AI automatically "push down" filters on certain output columns in the query onto the inputs. For example, if a table does some computation over an x and y range such that x and y are in both the input and output for the table function, filter push-down would likely be enabled so that a query like the following would automatically push down the filter on the x and y outputs to the x and y inputs. This potentially increases query performance significantly.

SELECT
  *
FROM
  TABLE(
    my_spatial_table_function(
      CURSOR(
        SELECT
          x,
          y
        from
          spatial_data_table
          /* Presuming filter push down is enabled for 
          my_spatial_table_function, the filter applied to 
          x and y will be applied here to the table function
          input CURSOR */
      )
    )
  )
WHERE
  x BETWEEN 38.2
  AND 39.1
  and Y BETWEEN -121.4
  and -120.1;

To determine whether filter push-down is used, you can check the Boolean value of the filter_table_transpose column from the query:

SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

Currently for system table functions, you cannot change push-down behavior.

Querying Registered Table Functions

You can query which table functions are available using SHOW TABLE FUNCTIONS:

SHOW TABLE FUNCTIONS;

Table UDF

tf_feature_similarity
tf_feature_self_similarity
tf_geo_rasterize_slope
...

Query Metadata for a Specific Table Function

Information about the expected input and output argument names and types, as well as other info such as whether the function can run on CPU, GPU or both, and whether filter push-down is enabled, can be queried via SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

name|signature|input_names|input_types|output_names|output_types|CPU|GPU|Runtime|filter_table_transpose
generate_series|(i64 series_start, i64 series_stop, i64 series_step) -> Column<i64>|[series_start, series_stop, series_step]|[i64, i64, i64]|[generate_series]|[Column<i64>]|true|false|false|false
generate_series|(i64 series_start, i64 series_stop) -> Column<i64>|[series_start, series_stop]|[i64, i64]|[generate_series]|[Column<i64>]|true|false|false|false

System Table Functions

The following system table functions are available in HEAVY.AI. The table provides a summary and links to more inforamation about each function.

Function

Purpose

Generates random string data.

Generates a series of integer values.

Generates a series of timestamp values from start_timestamp to end_timestamp.

Given a query input with entity keys and timestamps, and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session.

Given a query input of entity keys/IDs, a set of feature columns, and a metric column, scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.

Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, with taking the maximum z value across all points in each bin as the output value for the bin. The aggregate performed to compute the value for each bin is specified by agg_type, with allowed aggregate types of AVG, COUNT, SUM, MIN, and MAX.

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin node, computes the shortest distance-weighted path distance between the origin_node and every other node in the graph.

Loads one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally tranforming the output SRID to out_srs. If not specified, output points are automatically transformed to EPSG:4326 lon/lat pairs).

Returns metadata for one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally constraining the bounding box for metadata retrieved to the lon/lat bounding box specified by the x_min, x_max, y_min, y_max arguments.

Process a raster input to derive contour lines or regions and output as LINESTRING or POLYGON for rendering or further processing.

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin.

Used for generating top-k signals where 'k' represents the maximum number of antennas to consider at each geographic location. The full relevant parameter name is strongest_k_sources_per_terrain_bin.

Taking a set of point elevations and a set of signal source locations as input, tf_rf_prop_max_signal executes line-of-sight 2.5D RF signal propagation from the provided sources over a binned 2.5D elevation grid derived from the provided point locations, calculating the max signal in dBm at each grid cell, using the formula for free-space power loss.

For information about the HeavyRF radio frequency propagation simulation and HeavyRF table functions, see HeavyRF.

The TABLE command is required to wrap a table function clause; for example: select * from TABLE(generate_series(1, 10));

The CURSOR command is required to wrap any subquery inputs.

generate_random_strings

Generates random string data.

SELECT * FROM TABLE(generate_random_strings(<num_strings>, <string_length>/)

Input Arguments

Parameter

Description

Data Type

<num_strings>

The number of strings to randomly generate.

BIGINT

<string_length>

Length of the generated strings.

BIGINT

Output Columns

Name

Description

Data Type

Integer id of output, starting at 0 and increasing monotonically

Column<BIGINT>

rand_str

Random String

Column<TEXT ENCODING DICT>

Example

heavysql> SELECT * FROM TABLE(generate_random_strings(10, 20);
id|rand_str
0 |He9UeknrGYIOxHzh5OZC
1 |Simnx7WQl1xRihLiH56u
2 |m5H1lBTOErpS8is00YJ
3 |eeDiNHfKzVQsSg0qHFS0
4 |JwOhUoQEI6Z0L78mj8jo
5 |kBTbSIMm25dvf64VMi
6 |W3lUUvC5ajm0W24JML
7 |XdtSQfdXQ85nvaIoyYUY
8 |iUTfGN5Jaj25LjGJhiRN
9 |72GUoTK2BzcBJVTgTGW

generate_series

generate_series (Integers)

Generate a series of integer values.

SELECT * FROM TABLE(
    generate_series(
        <series_start>,
        <series_end>
        [, <increment>]
    )

Input Arguments

Parameter

Description

Data Types

<series_start>

Starting integer value, inclusive.

BIGINT

<series_end>

Ending integer value, inclusive.

BIGINT

<series_step> (optional, defaults to 1)

Increment to increase or decrease and values that follow. Integer.

BIGINT

Output Columns

Name

Description

Data Types

generate_series

The integer series specified by the input arguments.

Column<BIGINT>

Example

heavysql> select * from table(generate_series(2, 10, 2)); 
series 
2 
4 
6 
8 
10 
5 rows returned.

heavysql> select * from table(generate_series(8, -4, -3)); 
series 
8 
5 
2 
-1 
-4
5 rows returned.

generate_series (Timestamps)

Generate a series of timestamp values from start_timestamp to end_timestamp .

SELECT * FROM TABLE(
    generate_series(
        <series_start>,
        <series_end>,
        <series_step>
    )
)

Input Arguments

Parameter

Description

Data Types

series_start

Starting timestamp value, inclusive.

TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )

series_end

Ending timestamp value, inclusive.

TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )

series_step

Time/Date interval signifying step between each element in the returned series.

INTERVAL

Output Columns

Name

Description

Output Types

generate_series

The timestamp series specified by the input arguments.

COLUMN<TIMESTAMP(9)>

Example

SELECT
  generate_series AS ts
FROM
  TABLE(
    generate_series(
      TIMESTAMP(0) '2021-01-01 00:00:00',
      TIMESTAMP(0) '2021-09-04 00:00:00',
      INTERVAL '1' MONTH
    )
  )
  ORDER BY ts;
  
ts
2021-01-01 00:00:00.000000000
2021-02-01 00:00:00.000000000
2021-03-01 00:00:00.000000000
2021-04-01 00:00:00.000000000
2021-05-01 00:00:00.000000000
2021-06-01 00:00:00.000000000
2021-07-01 00:00:00.000000000
2021-08-01 00:00:00.000000000
2021-09-01 00:00:00.000000000

tf_compute_dwell_times

Given a query input with entity keys (for example, user IP addresses) and timestamps (for example, page visit timestamps), and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session (dwell time).

Syntax

select * from table( 
  tf_compute_dwell_times( 
    data => CURSOR( 
      select 
        entity_id, 
        site_id, 
        ts, 
      from 
        <table> 
      where 
        ... 
        ), 
        min_dwell_seconds => <seconds>, 
        min_dwell_points => <points>, 
        max_inactive_seconds => <seconds> 
        ) 
      );

Input Arguments

Parameter

Description

Data Type

entity_id

Column containing keys/IDs used to identify the entities for which dwell/session times are to be computed. Examples include IP addresses of clients visiting a website, login IDs of database users, MMSIs of ships, and call signs of airplanes.

Column<TEXT ENCODING DICT | BIGINT>

site_id

Column containing keys/IDs of dwell “sites” or locations that entities visit. Examples include website pages, database session IDs, ports, airport names, or binned h3 hex IDs for geographic location.

Column<TEXT ENCODING DICT | BIGINT>

ts

Column denoting the time at which an event occurred.

Column<TIMESTAMP(0|3|6|0)>

min_dwell_seconds

Constant integer value specifying the minimum number of seconds required between the first and last timestamp-ordered record for an entity_id at a site_id to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3600 (one hour), but only 1800 seconds elapses between an entity’s first and last ordered timestamp records at a site, these records are not considered a valid session and a dwell time for that session is not calculated.

BIGINT (other integer types are automatically casted to BIGINT)

min_dwell_points

A constant integer value specifying the minimum number of successive observations (in ts timestamp order) required to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3, but only two consecutive records exist for a user at a site before they move to a new site, no dwell time is calculated for the user.

BIGINT (other integer types are automatically casted to BIGINT)

max_inactive_seconds

A constant integer value specifying the maximum time in seconds between two successive observations for an entity at a given site before the current session/dwell time is considered finished and a new session/dwell time is started. For example, if this variable is set to 86400 seconds (one day), and the time gap between two successive records for an entity id at a given site id is 86500 seconds, the session is considered ended at the first timestamp-ordered record, and a new session is started at the timestamp of the second record.

BIGINT (other integer types are automatically casted to BIGINT)

Output Columns

Name

Description

Data Type

entity_id

The ID of the entity for the output dwell time, identical to the corresponding entity_id column in the input.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the entity_id input column type)

site_id

The site ID for the output dwell time, identical to the corresponding site_id column in the input.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id input column type)

prev_site_id

The site ID for the session preceding the current session, which might be a different site_id, the same site_id (if successive records for an entity at the same site were split into multiple sessions because the max_inactive_seconds threshold was exceeded), or null if the last site_id visited was null.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id input column type)

next_site_id

The site id for the session after the current session, which might be a different site_id, the same site_id (if successive records for an entity at the same site were split into multiple sessions due to exceeding the max_inactive_seconds threshold, or null if the next site_id visited was null.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type will be the same as the site_id input column type)

session_id

An auto-incrementing session ID specific/relative to the current entity_id, starting from 1 (first session) up to the total number of valid sessions for an entity_id, such that each valid session dwell time increments the session_id for an entity by 1.

Column<INT>

start_seq_id

The index of the nth timestamp (ts-ordered) record for a given entity denoting the start of the current output row's session.

Column<INT>

dwell_time_sec

The duration in seconds for the session.

Column<INT>

num_dwell_points

The number of records/observations constituting the current output row's session.

Column<INT>

Example

/* Data from https://www.kaggle.com/datasets/vodclickstream/netflix-audience-behaviour-uk-movies */

select
  *
from
  table(
    tf_compute_dwell_times(
      data => cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      min_dwell_points => 3,
      min_dwell_seconds => 600,
      max_inactive_seconds => 10800
    )
  )
order by
  num_dwell_points desc
limit
  10;

entity_id|site_id|prev_site_id|next_site_id|session_id|start_seq_id|ts|dwell_time_sec|num_dwell_points
59416738c3|cbdf9820bc|d058594d1c|863b39bbe8|2|19|2017-02-21 15:12:11.000000000|4391|54
16d994f6dd|1bae944666|4f1cf3c2dc|NULL|5|61|2017-11-11 20:27:02.000000000|9570|36
3675d9ba4a|948f2b5bf6|948f2b5bf6|69cb38018a|2|11|2018-11-26 18:42:52.000000000|3600|34
da01959c0b|fd711679f9|1f579d43c3|NULL|5|90|2019-03-21 05:37:22.000000000|7189|31
23c52f9b50|df00041e47|df00041e47|NULL|2|39|2019-01-21 15:53:33.000000000|1227|29
da01959c0b|8ab46a0cb1|f1fffa6ff4|1f579d43c3|3|29|2019-03-12 04:33:01.000000000|6026|29
23c52f9b50|df00041e47|NULL|df00041e47|1|10|2019-01-21 15:33:39.000000000|1194|28
da01959c0b|1f579d43c3|8ab46a0cb1|fd711679f9|4|63|2019-03-17 02:01:49.000000000|7240|27
3261cb81a5|1cb40406ae|NULL|NULL|1|2|2019-04-28 20:48:24.000000000|11240|27
dbed64ce9e|c5830185ca|NULL|NULL|1|3|2019-03-01 06:43:32.000000000|7261|25

tf_feature_self_similarity

Given a query input of entity keys/IDs (for example, airplane tail numbers), a set of feature columns (for example, airports visited), and a metric column (for example number of times each airport was visited), scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.

select * from table(
  tf_feature_self_similarity(
    primary_features => cursor(
      select
        primary_key,
        pivot_features,
        metric
      from
        table
      group by
        primary_key,
        pivot_features
    ),
    use_tf_idf => <boolean>))

Input Arguments

Parameter

Description

Data Type

Output Columns

Example

/* Compute similarity of airlines by the airports they fly from */

select
  *
from
  table(
    tf_feature_self_similarity(
      primary_features => cursor(
        select
          carrier_name,
          origin,
          count(*) as num_flights
        from
          flights_2008
        group by
          carrier_name,
          origin
      ),
      use_tf_idf => false
    )
  )
where
  similarity_score <= 0.99
order by
  similarity_score desc
limit
  20;
  
class1|class2|similarity_score
Expressjet Airlines|Continental Air Lines|0.9564615
Delta Air Lines|Atlantic Southeast Airlines|0.9436753
Delta Air Lines|AirTran Airways Corporation|0.9379856
Atlantic Southeast Airlines|AirTran Airways Corporation|0.9326661
American Eagle Airlines|American Airlines|0.8906327
Northwest Airlines|Pinnacle Airlines|0.8222722
Skywest Airlines|United Air Lines|0.6857293
Mesa Airlines|US Airways|0.6116939
United Air Lines|Frontier Airlines|0.5921053
Mesa Airlines|United Air Lines|0.5686765
United Air Lines|American Eagle Airlines|0.5272493
Skywest Airlines|Frontier Airlines|0.4684323
Southwest Airlines|US Airways|0.4166781
United Air Lines|American Airlines|0.397027
Comair|JetBlue Airways|0.3631534
Mesa Airlines|American Eagle Airlines|0.3379275
Skywest Airlines|American Eagle Airlines|0.3331468
Mesa Airlines|Skywest Airlines|0.3235496
Comair|Delta Air Lines|0.3075919
Southwest Airlines|Mesa Airlines|0.2901711

/* Compute the similarity of US States by the TF-IDF
 weighted cosine similarity of the words tweeted in each state */
 
 select
  *
from
  table(
    tf_feature_self_similarity(
      primary_features => cursor(
        select
          state_abbr,
          unnest(tweet_tokens),
          count(*)
        from
          tweets_2022_06
        where country = 'US'
        group by
          state_abbr,
          unnest(tweet_tokens)
      ),
      use_tf_idf => TRUE
    )
  )
where
  class1 <> class2
order by
  similarity_score desc;
  
TX|GA|0.9928479
IL|TN|0.9920474
IL|NC|0.9920027
TX|IL|0.9917723
IN|OH|0.9916649
TN|NC|0.9915619
CA|TX|0.9910875
IN|VA|0.9909871
CA|IL|0.9909689
IL|OH|0.9909481
TX|NC|0.9908867
IL|MO|0.9907863
IN|MI|0.990751
TN|OH|0.9907123
IL|MD|0.9907106
OH|NC|0.9905779
VA|OH|0.990536
IN|IL|0.9904549
IN|MO|0.9903805
TX|TN|0.9903381

tf_feature_similarity

SELECT
  *
FROM
  TABLE(
    tf_feature_similarity(
      primary_features => CURSOR(
        SELECT
          primary_key,
          pivot_features,
          metric
        from
          table
        where
          ...
        group by
          primary_key,
          pivot_features
      ),
      comparison_features => CURSOR(
        SELECT
          comparison_metric
        from
          table
        where
          ...
        group by <column>
      ),
      use_tf_idf => <boolean>
    )
  )

Input Arguments

Output Columns

Example

/* Compute the similarity of US airline flight nums to a particular
Delta flight (DAL795) based on the cosine similarity of the overlap of
flight paths binned to a H3 Hex at zoom level 7 (roughly 5 sq km),  
and return the top 10 most similar flight nums */

SELECT
  *
FROM
  TABLE(
    tf_feature_similarity(
      primary_features => CURSOR(
        SELECT
          callsign,
          geotoh3(st_x(location), st_y(location), 7) as h3,
          count(*) as n
        from
          adsb_2021_03_01
        where
          operator in (
            'Delta Air Lines',
            'Alaska Airlines',
            'Southwest Airlines',
            'American Airlines',
            'United Airlines'
          )
          and altitude >= 1000
        group by
          callsign,
          h3
      ),
      comparison_features => CURSOR(
        SELECT
          geotoh3(st_x(location), st_y(location), 7) as h3,
          COUNT(*) as n
        from
          adsb_2021_03_01
        where
          callsign = 'DAL795'
          and altitude >= 1000
        group by
          h3
      ),
      use_tf_idf => false
    )
  )
ORDER BY
  similarity_score desc
limit
  10;
  
class|similarity_score
DAL795|1
DAL538|0.610889
DAL1192|0.3419932
DAL1185|0.3391671
SWA4346|0.3206964
DAL365|0.3037131
SWA953|0.2912168
UAL1559|0.2747431
SWA2098|0.2511763
DAL526|0.2473387

tf_geo_rasterize_slope

Similar to tf_geo_rasterize, but also computes the slope and aspect per output bin. Aggregates point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin. A Gaussian average is then taken over the neighboring bins, with the number of bins specified by neighborhood_fill_radius, optionally only filling in null-valued bins if fill_only_nulls is set to true. The slope and aspect is then computed for every bin, based on the z values of that bin and its neighboring bins. The slope can be returned in degrees or as a fraction between 0 and 1, depending on the boolean argument to compute_slope_in_degrees.

Note that the bounds of the spatial output grid will be bounded by the x/y range of the input query, and if SQL filters are applied on the output of the tf_geo_rasterize_slope table function, these filters will also constrain the output range.

SELECT * FROM TABLE(
  tf_geo_rasterize_slope(
      raster => CURSOR(
        SELECT 
           x, y, z FROM table
      ),
      agg_type => <'AVG'|'COUNT'|'SUM'|'MIN'|'MAX'>,
      bin_dim_meters => <meters>, 
      geographic_coords => <true/false>, 
      neighborhood_fill_radius => <radius in bins>,
      fill_only_nulls => <true/false>,
      compute_slope_in_degrees => <true/false>
    )
 )

Input Arguments

Parameter

Description

Data Types

Output Columns

Example

/* Compute the slope and aspect ratio for a 30-meter Copernicus 
Digital Elevation Model (DEM) raster, binned to 90-meters */

select
  *
from
  table(
    tf_geo_rasterize_slope(
      raster => cursor(
        select
          st_x(raster_point),
          st_y(raster_point),
          CAST(z AS float)
        from
          copernicus_30m_mt_everest
      ),
      agg_type => 'AVG',
      bin_dim_meters => 90.0,
      geographic_coords => true,
      neighborhood_fill_radius => 1,
      fill_only_nulls => false,
      compute_slope_in_degrees => true
    )
  )
order by
  slope desc nulls last
limit
  20;
  
x|y|z|slope|aspect
86.96533511629579|27.96534132281817|6212.096|78.37033|18.09232
87.23751907091268|27.78489838800869|3793.584|78.17864|125.03
87.23660262662104|27.78408922686605|3929.989|78.06877|127.629
86.96625156058742|27.96534132281817|6041.277|78.00574|19.00616
87.2356861823294|27.78328006572341|3981.662|77.53327|127.3175
86.96441867200414|27.96615048396082|5869.373|77.3751|20.82031
86.95800356196267|27.96857796738875|6083.791|77.13709|29.89468
86.96350222771251|27.96615048396082|6081.35|77.08266|21.6792
87.23843551520432|27.78570754915134|3630.32|77.04676|125.2154
86.96441867200414|27.96534132281817|6378.94|76.95021|17.77107
87.22468885082972|27.81321902800121|4771.554|76.71017|253.2764
87.2356861823294|27.78247090458076|3520.049|76.63997|113.6511
87.23660262662104|27.78328006572341|3445.282|76.38319|127.2889
86.96716800487906|27.96534132281817|5864.711|76.16835|19.27573
87.23476973803776|27.78166174343812|3945.683|76.13519|102.7789
86.95708711767104|27.96857796738875|6336.072|76.13168|24.90349
87.22468885082972|27.81240986685857|4732.937|76.07494|264.7046
87.23751907091268|27.78408922686605|3367.659|76.0099|126.7463
86.9589200062543|27.9677688062461|6223.083|75.46346|26.85898
87.22377240653809|27.81402818914385|4704.619|75.41299|205.3219

tf_graph_shortest_path

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin and destination node, tf_graph_shortest_path computes the shortest distance-weighted path through the graph between origin_node and destination_node, returning a row for each node along the computed shortest path, with the traversal-ordered index of that node and the cumulative distance from the origin_node to that node. If either origin_node or destination_node do not exist, an error is returned.

SELECT * FROM TABLE(
    tf_graph_shortest_path(
        edge_list => CURSOR(
            SELECT node1, node2, distance FROM table
        ),
        origin_node => <origin node>,
        destination_node => <destination node>
    )

Input Arguments

Parameter

Description

Data Types

Output Columns

Name

Description

Data Types

Example A

/* Compute the shortest flight route on United Airlines for the year 2008 as measured
by flight time between origin airport 'RDU' (Raleigh-Durham, NC) and destination 
airport 'SAT' (San Antonio, TX), adding 60 minutes for each leg to account for 
boarding/plane change time costs, and only counting routes that were flown at least
300 times during the year. */
 
SELECT
  *
FROM
  TABLE(
    tf_graph_shortest_path(
      edge_list => CURSOR(
        SELECT
          origin,
          dest,
          /* Add 60 minutes to each leg to account
          for boarding/plane change costs */
          AVG(airtime) + 60 as avg_airtime
        FROM
          flights_2008
        WHERE
          carrier_name = 'United Air Lines'
        GROUP by
          origin,
          dest
        HAVING
          COUNT(*) > 300
      ),
      origin_node => 'RDU',
      destination_node => 'SAT'
    )
  )
ORDER BY
  path_step
 
path_step|node|cume_distance
1|RDU|0
2|ORD|167
3|DEN|354
4|SAT|519

Example B

/* Compute the shortest path between along a time-traversal weighted
edge graph of roads in the Eastern United States between a location in North Carolina and
a location in Maine, joining to a node locations table to output the lon/lat pairs 
of each node. */

select
  path_step,
  node,
  lon,
  lat,
  cume_distance
from
  table(
    tf_graph_shortest_path(
      cursor(
        select
          node1,
          node2,
          traversal_time
        from
          usa_roads_east_time
      ),
      1561955,
      1591319
    )
  ),
  USA_roads_east_coords
where
  node = node_id 
order by 
  cume_distance desc
limit 20;

path_step|node|lon|lat|cume_distance
4380|1591319|-71.55136299999999|43.75256|13442017
4379|1591989|-71.55174099999999|43.75245|13441199
4378|1589348|-71.554147|43.752464|13436371
4377|2315795|-71.554867|43.752489|13434924
4376|1589286|-71.55497099999999|43.752113|13434214
4375|1589285|-71.555049|43.751833|13433685
4374|2315785|-71.555999|43.750704|13431238
4373|2315973|-71.55798799999999|43.748622|13426553
4372|2315950|-71.56366299999999|43.746268|13417798
4371|1589788|-71.56476599999999|43.745765|13416053
4370|1591997|-71.56484|43.745691|13415884
4369|1589787|-71.564886|43.745645|13415779
4368|2315951|-71.56517599999999|43.745353|13415113
4367|2315952|-71.56659499999999|43.744599|13412756
4366|1591999|-71.56685899999999|43.744565|13412397
4365|543394|-71.567357|43.744335|13411606
4364|543393|-71.567832|43.744116|13410852
4363|543392|-71.571827|43.743673|13405444
4362|541181|-71.57268499999999|43.743802|13404271
4361|1589786|-71.572964|43.743844|13403890