Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
HeavyDB system tables provide a way to access information about database objects, database object permissions, and system resource (storage, CPU, and GPU memory) utilization. These system tables can be found in the information_schema
database that is available by default on server startup. You can query system tables in the same way as regular tables, and you can use the SHOW CREATE TABLE
command to view the table schemas.
The users
system table provides information about all database users and contains the following columns:
Column Name
Column Type
Description
user_id
INTEGER
ID of database user.
user_name
TEXT
Username of database user.
is_super_user
BOOLEAN
Indicates whether or not the database user is a super user.
default_db_id
INTEGER
ID of user’s default database on login.
default_db_name
TEXT
Name of user’s default database on login.
can_login
BOOLEAN
Indicates whether or not the database user account is activated and can log in.
The databases
system table provides information about all created databases on the server and contains the following columns:
Column Name
Column Type
Description
database_id
INTEGER
ID of database.
database_name
TEXT
Name of database.
owner_id
INTEGER
User ID of database owner.
owner_user_name
TEXT
Username of database owner.
The permissions
system table provides information about all user/role permissions for all database objects and contains the following columns:
Column Name
Column Type
Description
role_name
TEXT
Username or role name associated with permission.
is_user_role
BOOLEAN
Boolean indicating whether or not the role_name
column identifies a user or a role.
database_id
INTEGER
ID of database that contains the database object for which permission was granted.
database_name
TEXT
Name of database that contains the database object on which permission was granted.
object_name
TEXT
Name of database object on which permission was granted.
object_id
INTEGER
ID of database object on which permission was granted.
object_owner_id
INTEGER
User id of the owner of the database object on which permission was granted.
object_owner_user_name
TEXT
Username of the owner of the database object on which permission was granted.
object_permission_type
TEXT
Type of database object on which permission was granted.
object_permissions
TEXT[]
List of permissions that were granted on database object.
The roles
system table lists all created database roles and contains the following columns:
Column Name
Column Type
Description
role_name
TEXT
Role name.
The tables
system table provides information about all database tables and contains the following columns:
Column Name
Column Type
Description
database_id
INTEGER
ID of database that contains the table.
database_name
TEXT
Name of database that contains the table.
table_id
INTEGER
Table ID.
table_name
TEXT
Table name.
owner_id
INTEGER
User ID of table owner.
owner_user_name
TEXT
Username of table owner.
column_count
INTEGER
Number of table columns. Note that internal system columns are included in this count.
table_type
TEXT
Type of tables. Possible values are DEFAULT
, VIEW
, TEMPORARY
, and FOREIGN
.
view_sql
TEXT
For views, SQL statement used in the view.
max_fragment_size
INTEGER
Number of rows per fragment used by the table.
max_chunk_size
BIGINT
Maximum size (in bytes) of table chunks.
fragment_page_size
INTEGER
Size (in bytes) of table data pages.
max_rows
BIGINT
Maximum number of rows allowed by table.
max_rollback_epochs
INTEGER
Maximum number of epochs a table can be rolled back to.
shard_count
INTEGER
Number of shards that exists for table.
ddl_statement
TEXT
CREATE TABLE
DDL statement for table.
The dashboards
system table provides information about created dashboards (enterprise edition only) and contains the following columns:
Column Name
Column Type
Description
database_id
INTEGER
ID of database that contains the dashboard.
database_name
TEXT
Name of database that contains the dashboard.
dashboard_id
INTEGER
Dashboard ID.
dashboard_name
TEXT
Dashboard name.
owner_id
INTEGER
User ID of dashboard owner.
owner_user_name
TEXT
Username of dashboard owner.
last_updated_at
TIMESTAMP
Timestamp of last dashboard update.
data_sources
TEXT[]
List to data sources/tables used by dashboard.
The role_assignments
system table provides information about database roles that have been assigned to users and contains the following columns:
Column Name
Column Type
Description
role_name
TEXT
Name of assigned role.
user_name
TEXT
Username of user that was assigned the role.
The memory_summary
system table provides high level information about utilized memory across CPU and GPU devices and contains the following columns:
Column Name
Column Type
Description
node
TEXT
Node from which memory information is fetched.
device_id
INTEGER
Device ID.
device_type
TEXT
Type of device. Possible values are CPU
and GPU
.
max_page_count
BIGINT
Maximum number of memory pages that can be allocated on the device.
page_size
BIGINT
Size (in bytes) of a memory page on the device.
allocated_page_count
BIGINT
Number of allocated memory pages on the device.
used_page_count
BIGINT
Number of used allocated memory pages on the device.
free_page_count
BIGINT
Number of free allocated memory pages on the device.
The memory_details
system table provides detailed information about allocated memory segments across CPU and GPU devices and contains the following columns:
Column Name
Column Type
Description
node
TEXT
Node from which memory information is fetched.
database_id
INTEGER
ID of database that contains the table that memory was allocated for.
database_name
TEXT
Name of database that contains the table that memory was allocated for.
table_id
INTEGER
ID of table that memory was allocated for.
table_name
TEXT
Name of table that memory was allocated for.
column_id
INTEGER
ID of column that memory was allocated for.
column_name
TEXT
Name of column that memory was allocated for.
chunk_key
INTEGER[]
ID of cached table chunk.
device_id
INTEGER
Device ID.
device_type
TEXT
Type of device. Possible values are CPU
and GPU
.
memory_status
TEXT
Memory segment use status. Possible values are FREE
and USED
.
page_count
BIGINT
Number pages in the segment.
page_size
BIGINT
Size (in bytes) of a memory page on the device.
slab_id
INTEGER
ID of slab containing memory segment.
start_page
BIGINT
Page number of the first memory page in the segment.
last_touched_epoch
BIGINT
Epoch at which the segment was last accessed.
The storage_details
system table provides detailed information about utilized storage per table and contains the following columns:
Column Name
Column Type
Description
node
TEXT
Node from which storage information is fetched.
database_id
INTEGER
ID of database that contains the table.
database_name
TEXT
Name of database that contains the table.
table_id
INTEGER
Table ID.
table_name
TEXT
Table Name.
epoch
INTEGER
Current table epoch.
epoch_floor
INTEGER
Minimum epoch table can be rolled back to.
fragment_count
INTEGER
Number of table fragments.
shard_id
INTEGER
Table shard ID. This value is only set for sharded tables.
data_file_count
INTEGER
Number of data files created for table.
metadata_file_count
INTEGER
Number of metadata files created for table.
total_data_file_size
BIGINT
Total size (in bytes) of data files.
total_data_page_count
BIGINT
Total number of pages across all data files.
total_free_data_page_count
BIGINT
Total number of free pages across all data files.
total_metadata_file_size
BIGINT
Total size (in bytes) of metadata files.
total_metadata_page_count
BIGINT
Total number of pages across all metadata files.
total_free_metadata_page_count
BIGINT
Total number of free pages across all metadata files.
total_dictionary_data_file_size
BIGINT
Total size (in bytes) of string dictionary files.
Log-based system tables are considered beta functionality in Release 6.1.0 and are disabled by default.
The request_logs
system table provides information about HeavyDB Thrift API requests and contains the following columns:
Column Name
Column Type
Description
log_timestamp
TIMESTAMP
Timestamp of log entry.
severity
TEXT
Severity level of log entry. Possible values are F (fatal), E (error), W (warning), and I (info).
process_id
INTEGER
Process ID of the HeavyDB instance that generated the log entry.
query_id
INTEGER
ID associated with a SQL query. A value of 0 indicates that either the log entry is unrelated to a SQL query or no query ID has been set for the log entry.
thread_id
INTEGER
ID of thread that generated the log entry.
file_location
TEXT
Source file name and line number where the log entry was generated.
api_name
TEXT
Name of Thrift API that the request was sent to.
request_duration_ms
BIGINT
Thrift API request duration in milliseconds.
database_name
TEXT
Request session database name.
user_name
TEXT
Request session username.
public_session_id
TEXT
Request session ID.
query_string
TEXT
Query string for SQL query requests.
client
TEXT
Protocol and IP address of client making the request.
dashboard_id
INTEGER
Dashboard ID for SQL query requests coming from Immerse dashboards.
dashboard_name
TEXT
Dashboard name for SQL query requests coming from Immerse dashboards.
chart_id
INTEGER
Chart ID for SQL query requests coming from Immerse dashboards.
execution_time_ms
BIGINT
Execution time in milliseconds for SQL query requests.
total_time_ms
BIGINT
Total execution time (execution_time_ms + serialization time) in milliseconds for SQL query requests.
The server_logs
system table provides HeavyDB server logs in tabular form and contains the following columns:
Column Name
Column Type
Description
node
TEXT
Node containing logs.
log_timestamp
TIMESTAMP
Timestamp of log entry.
severity
TEXT
Severity level of log entry. Possible values are F (fatal), E (error), W (warning), and I (info).
process_id
INTEGER
Process ID of the HeavyDB instance that generated the log entry.
query_id
INTEGER
ID associated with a SQL query. A value of 0 indicates that either the log entry is unrelated to a SQL query or no query ID has been set for the log entry.
thread_id
INTEGER
ID of thread that generated the log entry.
file_location
TEXT
Source file name and line number where the log entry was generated.
message
TEXT
Log message.
The web_server_logs
system table provides HEAVY.AI Web Server logs in tabular form and contains the following columns (Enterprise Edition only):
Column Name
Column Type
Description
log_timestamp
TIMESTAMP
Timestamp of log entry.
severity
TEXT
Severity level of log entry. Possible values are fatal, error, warning, and info.
message
TEXT
Log message.
The web_server_access_logs
system table provides information about requests made to the HEAVY.AI Web Server. The table contains the following columns:
Column Name
Column Type
Description
ip_address
TEXT
IP address of client making the web server request.
log_timestamp
TIMESTAMP
Timestamp of log entry.
http_method
TEXT
HTTP request method.
endpoint
TEXT
Web server request endpoint.
http_status
SMALLINT
HTTP response status code.
response_size
BIGINT
Response payload size in bytes.
The logs system tables must be refreshed manually to view new log entries. You can run the REFRESH FOREIGN TABLES
SQL command (for example, REFRESH FOREIGN TABLES server_logs, request_logs;
), or click the Refresh Data Now button on the table’s Data Manager page in Heavy Immerse.
The Request Logs and Monitoring system dashboard is built on the log-based system tables and provides visualization of request counts, performance, and errors over time, along with the server logs.
Preconfigured system dashboards are built on various system tables. Specifically, two dashboards named System Resources and User Roles and Permissions are available by default. The Request Logs and Monitoring system dashboard is considered beta functionality and disabled by default. These dashboards can be found in the information_schema
database, along with the system tables that they use.
Access to system dashboards is controlled using Heavy Immerse privileges; only users with Admin privileges or users/roles with access to the information_schema
database can access the system dashboards.
Cross-linking must be enabled to allow cross-filtering across charts that use different system tables. Enable cross-linking by adding "ui/enable_crosslink_panel": true
to the feature_flags
section of the servers.json file.
DDL - Users and Databases
HEAVY.AI has a default superuser named admin
with default password HyperInteractive
.
When you create or alter a user, you can grant superuser privileges by setting the is_super
property.
You can also specify a default database when you create or alter a user by using the default_db
property. During login, if a database is not specified, the server uses the default database assigned to that user. If no default database is assigned to the user and no database is specified during login, the heavyai
database is used.
When an administrator, superuser, or owner drops or renames a database, all current active sessions for users logged in to that database are invalidated. The users must log in again.
Similarly, when an administrator or superuser drops or renames a user, all active sessions for that user are immediately invalidated.
If a password includes characters that are nonalphanumeric, it must be enclosed in single quotes when logging in to heavysql. For example:
$HEAVYAI_PATH/bin/heavysql heavyai -u admin -p '77Heavy!9Ai'
For more information about users, roles, and privileges, see DDL - Roles and Privileges.
The following are naming convention requirements for HEAVY.AI objects, described in regex notation:
A NAME is [A-Za-z_][A-Za-z0-9\$_]*
A DASHEDNAME is [A-Za-z_][A-Za-z0-9\$_\-]*
An EMAIL is ([^[:space:]\"]+|\".+\")@[A-Za-z0-9][A-Za-z0-9\-\.]*\.[A-Za-z]+
User objects can use NAME, DASHEDNAME, or EMAIL format.
Role objects must use either NAME or DASHEDNAME format.
Database and column objects must use NAME format.
HEAVY.AI accepts (almost) any string enclosed in optional double quotation marks as the user name.
Property
Value
password
User's password.
is_super
Set to true if user is a superuser. Default is false.
default_db
User's default database on login.
can_login
Set to true (default/implicit) to activate a user.
When false, the user still retains all defined privileges and configuration settings, but cannot log in to HEAVY.AI. Deactivated users who try to log in receive the error message "Unauthorized Access: User is deactivated."
Examples:
Example:
HEAVY.AI accepts (almost) any string enclosed in optional double quotation marks as the old or new user name.
Property
Value
password
User's password.
is_super
Set to true if user is a superuser. Default is false.
default_db
User's default database on login.
can_login
Set to true (default/implicit) to activate a user.
When false, the user still retains all defined privileges and configuration settings, but cannot log in to HEAVY.AI. Deactivated users who try to log in receive the error message "Unauthorized Access: User is deactivated."
Example:
Database names cannot include quotes, spaces, or special characters.
In Release 6.3.0 and later, database names are case insensitive. Duplicate database names will cause a failure when attempting to start HeavyDB 6.3.0 or higher. Check database names and revise as necessary to avoid duplicate names.
Property
Value
owner
User name of the database owner.
Example:
Example:
To alter a database, you must be the owner of the database or an HeavyDB superuser.
Example:
Enable super users to change the owner of a database.
Change the owner of my_database
to user Joe
:
Only superusers can run the ALTER DATABASE OWNER TO command.
Changes ownership of database objects (tables, views, dashboards, etc.) from a user or set of users to a different user. When the ALL keyword is specified, ownership change would apply to database objects across all databases. Otherwise, ownership change only applies to database objects in the current database.
Example: Reassign database objects owned by jason
and mike
in the current database to joe
.
Example: Reassign database objects owned by jason
and mike
across all databases to joe
.
Database object ownership changes only for objects within the database; ownership of the database itself is not affected. You must be a superuser to run this command.
See Example: Data Security in DDL - Roles and Privileges for a database security example.
DDL - Tables
These functions are used to create and modify data tables in HEAVY.AI.
Table names must use the NAME format, described in notation as:
Table and column names can include quotes, spaces, and the underscore character. Other special characters are permitted if the name of the table or column is enclosed in double quotes (" ").
Spaces and special characters other than underscore (_) cannot be used in Heavy Immerse.
Column and table names enclosed in double quotes cannot be used in Heavy Immerse
Create a table named <table>
specifying <columns>
and table properties.
Create a table named tweets
and specify the columns, including type, in the table.
Create a table named delta and assign a default value San Francisco
to column city.
Default values currently have the following limitations:
Only literals can be used for column DEFAULT values; expressions are not supported.
You cannot define a DEFAULT value for a shard key. For example, the following does not parse: CREATE TABLE tbl (id INTEGER NOT NULL DEFAULT 0, name TEXT, shard key (id)) with (shard_count = 2);
For arrays, use the following syntax: ARRAY[A, B, C, …. N]
The syntax {A, B, C, ... N}
is not supported.
Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with malformed literal as a default value, but when you try to insert a row with a default value, it will throw an error.
Sharding partitions a database table across multiple servers so each server has a part of the table with the same columns but with different rows. Partitioning is based on a sharding key defined when you create the table.
Without sharding, the dimension tables involved in a join are replicated and sent to each GPU, which is not feasible for dimension tables with many rows. Specifying a shard key makes it possible for the query to execute efficiently on large dimension tables.
Currently, specifying a shard key is useful for joins, only:
If two tables specify a shard key with the same type and the same number of shards, a join on that key only sends a part of the dimension table column data to each GPU.
For multi-node installs, the dimension table does not need to be replicated and the join executes locally on each leaf.
A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys.
One shard key can be specified for a table.
Data are partitioned according to the shard key and the number of shards (shard_count
).
A value in the column specified as a shard key is always sent to the same partition.
The number of shards should be equal to the number of GPUs in the cluster.
Sharding is allowed on the following column types:
DATE
INT
TEXT ENCODING DICT
TIME
TIMESTAMP
Tables must share the dictionary for the column to be involved in sharded joins. If the dictionary is not specified as shared, the join does not take advantage of sharding. Dictionaries are reference-counted and only dropped when the last reference drops.
Set shard_count
to the number of GPUs you eventually want to distribute the data table across.
Referenced tables must also be shard_count
-aligned.
Sharding should be minimized because it can introduce load skew accross resources, compared to when sharding is not used.
Examples
Basic sharding:
Sharding with shared dictionary:
Using the TEMPORARY argument creates a table that persists only while the server is live. They are useful for storing intermediate result sets that you access more than once.
Adding or dropping a column from a temporary table is not supported.
Create a table with the specified columns, copying any data that meet SELECT statement criteria.
Create the table newTable
. Populate the table with all information from the table oldTable
, effectively creating a duplicate of the original table.
Create a table named trousers
. Populate it with data from the columns name
, waist
, and inseam
from the table wardrobe
.
Create a table named cosmos
. Populate it with data from the columns star
and planet
from the table universe where planet has the class M.
Rename the table tweets to retweets.
Rename the column source to device in the table retweets.
Add the column pt_dropoff to table tweets with a default value point(0,0).
Add multiple columns a, b, and c to table table_one with a default value of 15
for column b.
Default values currently have the following limitations:
Only literals can be used for column DEFAULT values; expressions are not supported.
For arrays, use the following syntax: ARRAY[A, B, C, …. N]
. The syntax {A, B, C, ... N}
is not supported.
Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with a malformed literal as a default value, but when you try to insert a row with a default value, it throws an error.
Add the column lang to the table tweets using a TEXT ENCODING DICTIONARY.
Add the columns lang and encode to the table tweets using a TEXT ENCODING DICTIONARY for each.
Drop the column pt_dropoff from table tweets.
Limit on-disk data growth by setting the number of allowed epoch rollbacks to 50:
You cannot add a dictionary-encoded string column with a shared dictionary when using ALTER TABLE ADD COLUMN.
Currently, HEAVY.AI does not support adding a geo column type (POINT, LINESTRING, POLYGON, or MULTIPOLYGON) to a table.
HEAVY.AI supports ALTER TABLE RENAME TABLE and ALTER TABLE RENAME COLUMN for temporary tables. HEAVY.AI does not support ALTER TABLE ADD COLUMN to modify a temporary table.
Change a text column “id” to an integer column:
Change text columns “id” and “location” to big integer and point columns respectively:
Currently, only text column types (dictionary encoded and none encoded text columns) can be altered.
Archives data and dictionary files of the table <table>
to file <filepath>
.
Valid values for <compression_program>
include:
gzip (default)
pigz
lz4
none
If you do not choose a compression option, the system uses gzip if it is available. If gzip is not installed, the file is not compressed.
The file path must be enclosed in single quotes.
Dumping a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being dumped.
The DUMP
command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the DUMP
command.
Rename a table or multiple tables at once.
Rename a single table:
Swap table names:
Swap table names multiple times:
Restores data and dictionary files of table <table>
from the file at <filepath>
. If you specified a compression program when you used the DUMP TABLE
command, you must specify the same compression method during RESTORE
.
Restoring a table decompresses and then reimports the table. You must have enough disk space for both the new table and the archived table, as well as enough scratch space to decompress the archive and reimport it.
The file path must be enclosed in single quotes.
You can also restore a table from archives stored in S3-compatible endpoints:
Restoring a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being restored.
The RESTORE
command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the RESTORE
command.
Do not attempt to use RESTORE TABLE with a table dump created using a release of HEAVY.AI that is higher than the release running on the server where you will restore the table.
Restore table tweets
from /opt/archive/tweetsBackup.gz:
Restore table tweets
from a public S3 file or using server privileges (with the allow-s3-server-privileges
server flag enabled):
Restore table tweets
from a private S3 file using AWS access keys:
Restore table tweets
from a private S3 file using temporary AWS access keys/session token:
Restore table tweets
from an S3-compatible endpoint:
Use the TRUNCATE TABLE
statement to remove all rows from a table without deleting the table structure.
When you DROP or TRUNCATE, the command returns almost immediately. The directories to be purged are marked with the suffix \_DELETE_ME_. The files are automatically removed asynchronously.
In practical terms, this means that you will not see a reduction in disk usage until the automatic task runs, which might not start for up to five minutes.
You might also see directory names appended with \_DELETE_ME_. You can ignore these, with the expectation that they will be deleted automatically over time.
Use this statement to remove rows from storage that have been marked as deleted via DELETE
statements.
When run without the vacuum option, the column-level metadata is recomputed for each column in the specified table. HeavyDB makes heavy use of metadata to optimize query plans, so optimizing table metadata can increase query performance after metadata widening operations such as updates or deletes. If the configuration parameter enable-auto-metadata-update
is not set, HeavyDB does not narrow metadata during an update or delete — metadata is only widened to cover a new range.
When run with the vacuum option, it removes any rows marked "deleted" from the data stored on disk. Vacuum is a checkpointing operation, so new copies of any vacuum records are deleted. Using OPTIMIZE with the VACUUM option compacts pages and deletes unused data files that have not been repopulated.
Beginning with Release 5.6.0, OPTIMIZE should be used infrequently, because UPDATE, DELETE, and IMPORT queries manage space more effectively.
Performs checks for negative and inconsistent epochs across table shards for single-node configurations.
If VALIDATE
detects epoch-related issues, it returns a report similar to the following:
If no issues are detected, it reports as follows:
Perform checks and report discovered issues on a running HEAVY.AI cluster. Compare metadata between the aggregator and leaves to verify that the logical components between the processes are identical.
VALIDATE CLUSTER
also detects and reports issues related to table epochs. It reports when epochs are negative or when table epochs across leaf nodes or shards are inconsistent.
If VALIDATE CLUSTER
detects issues, it returns a report similar to the following:
If no issues are detected, it will report as follows:
You can include the WITH(REPAIR_TYPE)
argument. (REPAIR_TYPE='NONE')
is the same as running the command with no argument. (REPAIR_TYPE='REMOVE')
removes any leaf objects that have issues. For example:
This example output from the VALIDATE CLUSTER
command on a distributed setup shows epoch-related issues:
You can use policies to provide row-level security (RLS) in HEAVY.AI.
Create an RLS policy for a user or role (<name>
); admin rights are required. All queries on the table for the user or role are automatically filtered to include only rows where the column contains any one of the values from the VALUES clause.
RLS filtering works similarly to a WHERE column = value
clause, appended to every query or subquery on the table, would work. If policies on multiple columns in the same table are defined for a user or role, then a row is visible to that user or role if any one or more of the policies matches that row.
Drop an RLS policy for a user or role (<name>
); admin rights are required. All values specified for the column by the policy are dropped. Effective values from another policy on an inherited role are not dropped.
Displays a list of all RLS policies that exist for a user or role. If EFFECTIVE is used, the list also include any policies that exist for all roles that apply to the requested user or role.
* In HEAVY.AI release 4.4.0 and higher, you can use existing 8-byte DATE
columns, but you can create only 4-byte DATE
columns (default) and 2-byte DATE
columns (see ).
For more information, see .
For geospatial datatypes, see .
Deletes the table structure, all data from the table, and any dictionary content unless it is a shared dictionary. (See the Note regarding .)
s3_region
is required. All features discussed in , such as custom S3 endpoints and server privileges, are supported.
This releases table on-disk and memory storage and removes dictionary content unless it is a shared dictionary. (See the note regarding .)
Removing rows is more efficient than using . Dropping followed by recreating the table invalidates dependent objects of the table requiring you to regrant object privileges. Truncating has none of these effects.
Encoding
Descriptions
DICT
Dictionary encoding on string columns (default for TEXT
columns). Limit of 2 billion unique string values.
FIXED
(bits)
Fixed length encoding of integer or timestamp columns. See Datatypes and Fixed Encoding.
NONE
No encoding. Valid only on TEXT
columns. No Dictionary is created. Aggregate operations are not possible on this column type.
Property
Description
fragment_size
Number of rows per fragment that is a unit of the table for query processing. Default: 32 million rows, which is not expected to be changed.
max_rollback_epochs
Limit the number of epochs a table can be rolled back to. Limiting the number of epochs helps to limit the amount of on-disk data and prevent unmanaged data growth.
Limiting the number of rollback epochs also can increase system startup speed, especially for systems on which data is added in small batches or singleton inserts. Default: 3.
The following example creates the table test_table
and sets the maximum epoch rollback number to 50:
CREATE TABLE test_table(a int) WITH (MAX_ROLLBACK_EPOCHS = 50);
max_rows
Used primarily for streaming datasets to limit the number of rows in a table, to avoid running out of memory or impeding performance. When the max_rows
limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows
setting. If you attempt load more rows at one time than the max_rows
setting defines, the records up to the max_rows
limit are removed, leaving only the additional rows. Default: 2^62.
In a distributed system, the maximum number of rows is calculated as max_rows * leaf_count
. In a sharded distributed system, the maximum number of rows is calculated as max_rows * shard_count
.
page_size
Number of I/O page bytes. Default: 1MB, which does not need to be changed.
partitions
Partition strategy option:
SHARDED
: Partition table using sharding.
REPLICATED
: Partition table using replication.
shard_count
Number of shards to create, typically equal to the number of GPUs across which the data table is distributed.
sort_column
Name of the column on which to sort during bulk import.
Property
Description
fragment_size
Number of rows per fragment that is a unit of the table for query processing. Default = 32 million rows, which is not expected to be changed.
max_chunk_size
Size of chunk that is a unit of the table for query processing. Default: 1073741824 bytes (1 GB), which is not expected to be changed.
max_rows
Used primarily for streaming datasets to limit the number of rows in a table. When the max_rows
limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows
setting. If you attempt load more rows at one time than the max_rows
setting defines, the records up to the max_rows
limit are removed, leaving only the additional rows. Default = 2^62.
page_size
Number of I/O page bytes. Default = 1MB, which does not need to be changed.
partitions
Partition strategy option:
SHARDED
: Partition table using sharding.
REPLICATED
: Partition table using replication.
use_shared_dictionaries
Controls whether the created table creates its own dictionaries for text columns, or instead shares the dictionaries of its source table. Uses shared dictionaries by default (true
), which increases the speed of table creation.
Setting to false shrinks the dictionaries if SELECT for the created table has a narrow filter; for example:
CREATE TABLE new_table AS SELECT * FROM old_table WITH (USE_SHARED_DICTIONARIES='false');
vacuum
Formats the table to more efficiently handle DELETE
requests. The only parameter available is delayed
. Rather than immediately remove deleted rows, vacuum marks items to be deleted, and they are removed at an optimal time.
Datatype
Size (bytes)
Notes
BIGINT
8
Minimum value: -9,223,372,036,854,775,807
; maximum value: 9,223,372,036,854,775,807
.
BOOLEAN
1
TRUE: 'true'
, '1'
, 't'
. FALSE: 'false'
, '0'
, 'f'
. Text values are not case-sensitive.
DATE
*
4
Same as DATE ENCODING DAYS(32)
.
DATE ENCODING DAYS(32)
4
Range in years: +/-5,883,517
around epoch. Maximum date January 1, 5885487 (approximately). Minimum value: -2,147,483,648
; maximum value: 2,147,483,647
. Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING DAYS(16)
2
Range in days: -32,768
- 32,767
Range in years: +/-90
around epoch, April 14, 1880 - September 9, 2059.
Minumum value: -2,831,155,200
; maximum value: 2,831,068,800
.
Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING FIXED(32)
4
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DATE ENCODING FIXED(16)
2
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DECIMAL
2, 4, or 8
Takes precision and scale parameters: DECIMAL(precision,scale)
.
Size depends on precision:
Up to 4
: 2 bytes
5
to 9
: 4 bytes
10
to 18
(maximum): 8 bytes
Scale must be less than precision.
DOUBLE
8
Variable precision. Minimum value: -1.79 x e^308
; maximum value: 1.79 x e^308
.
FLOAT
4
Variable precision. Minimum value: -3.4 x e^38
; maximum value: 3.4 x e^38
.
INTEGER
4
Minimum value: -2,147,483,647
; maximum value: 2,147,483,647
.
SMALLINT
2
Minimum value: -32,767
; maximum value: 32,767
.
TEXT ENCODING DICT
4
Max cardinality 2 billion distinct string values
TEXT ENCODING NONE
Variable
Size of the string + 6 bytes
TIME
8
Minimum value: 00:00:00
; maximum value: 23:59:59
.
TIMESTAMP
8
Linux timestamp from -30610224000
(1/1/1000 00:00:00.000
) through 29379542399
(12/31/2900 23:59:59.999
).
Can also be inserted and stored in human-readable format:
YYYY-MM-DD HH:MM:SS
YYYY-MM-DDTHH:MM:SS
(The T
is dropped when the field is populated.)
TINYINT
1
Minimum value: -127
; maximum value: 127
.
DDL - Views
A view is a virtual table based on the result set of a SQL statement. It derives its fields from a SELECT
statement. You can do anything with a HEAVY.AI view query that you can do in a non-view HEAVY.AI query.
View object names must use the NAME format, described in regex notation as:
Creates a view based on a SQL statement.
You can describe the view as you would a table.
You can query the view as you would a table.
Removes a view created by the CREATE VIEW statement. The view definition is removed from the database schema, but no actual data in the underlying base tables is modified.
Datatypes and Fixed Encoding
This topic describes standard datatypes and space-saving variations for values stored in HEAVY.AI.
Each HEAVY.AI datatype uses space in memory and on disk. For certain datatypes, you can use fixed encoding for a more compact representation of these values. You can set a default value for a column by using the DEFAULT
constraint; for more information, see CREATE TABLE.
Datatypes, variations, and sizes are described in the following table.
Datatype
Size (bytes)
Notes
BIGINT
8
Minimum value: -9,223,372,036,854,775,807
; maximum value: 9,223,372,036,854,775,807
.
BIGINT ENCODING FIXED(8)
1
Minimum value: -127
; maximum value: 127
BIGINT ENCODING FIXED(16)
2
Same as SMALLINT
.
BIGINT ENCODING FIXED(32)
4
Same as INTEGER
.
BOOLEAN
1
TRUE: 'true'
, '1'
, 't'
. FALSE: 'false'
, '0'
, 'f'
. Text values are not case-sensitive.
4
Same as DATE ENCODING DAYS(32)
.
DATE ENCODING DAYS(16)
2
Range in days: -32,768
- 32,767
Range in years: +/-90
around epoch, April 14, 1880 - September 9, 2059.
Minumum value: -2,831,155,200
; maximum value: 2,831,068,800
.
Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING DAYS(32)
4
Range in years: +/-5,883,517
around epoch. Maximum date January 1, 5885487 (approximately). Minimum value: -2,147,483,648
; maximum value: 2,147,483,647
. Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING FIXED(16)
2
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DATE ENCODING FIXED(32)
4
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DECIMAL
2, 4, or 8
Takes precision and scale parameters: DECIMAL(precision,scale)
Size depends on precision:
Up to 4
: 2 bytes
5
to 9
: 4 bytes
10
to 18
(maximum): 8 bytes
Scale must be less than precision.
DOUBLE
8
Variable precision. Minimum value: -1.79e308
; maximum value: 1.79e308
EPOCH
8
Seconds ranging from -30610224000
(1/1/1000 00:00:00
) through 185542587100800
(1/1/5885487 23:59:59
).
FLOAT
4
Variable precision. Minimum value: -3.4e38
; maximum value: 3.4e38
.
INTEGER
4
Minimum value: -2,147,483,647
; maximum value: 2,147,483,647
.
INTEGER ENCODING FIXED(8)
1
Minumum value: -127
; maximum value: 127
.
INTEGER ENCODING FIXED(16)
2
Same as SMALLINT
.
LINESTRING
Variable[2]
Geospatial datatype. A sequence of 2 or more points and the lines that connect them. For example: LINESTRING(0 0,1 1,1 2)
MULTILINESTRING
Variable[2]
Geospatial datatype. A set of associated lines. For example: MULTILINESTRING((0 0, 1 0, 2 0), (0 1, 1 1, 2 1))
MULTIPOINT
Variable[2]
Geospatial datatype. A set of points. For example: MULTIPOINT((0 0), (1 0), (2 0))
MULTIPOLYGON
Variable[2]
Geospatial datatype. A set of one or more polygons. For example:MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1)))
POINT
Variable[2]
Geospatial datatype. A point described by two coordinates. When the coordinates are longitude and latitude, HEAVY.AI stores longitude first, and then latitude. For example: POINT(0 0)
POLYGON
Variable[2]
Geospatial datatype. A set of one or more rings (closed line strings), with the first representing the shape (external ring) and the rest representing holes in that shape (internal rings). For example: POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))
SMALLINT
2
Minimum value: -32,767
; maximum value: 32,767
.
SMALLINT ENCODING FIXED(8)
1
Minumum value: -127
; maximum value: 127
.
TEXT ENCODING DICT
4
Max cardinality 2 billion distinct string values. Maximum string length is 32,767.
TEXT ENCODING DICT(8)
1
Max cardinality 255 distinct string values.
TEXT ENCODING DICT(16)
2
Max cardinality 64 K distinct string values.
TEXT ENCODING NONE
Variable
Size of the string + 6 bytes. Maximum string length is 32,767.
TIME
8
Minimum value: 00:00:00
; maximum value: 23:59:59
.
TIME ENCODING FIXED(32)
4
Minimum value: 00:00:00
; maximum value: 23:59:59
.
TIMESTAMP(0)
8
Linux timestamp from -30610224000
(1/1/1000 00:00:00
) through 29379542399
(12/31/2900 23:59:59
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS
or YYYY-MM-DDTHH:MM:SS
(the T
is dropped when the field is populated).
TIMESTAMP(3) (milliseconds)
8
Linux timestamp from -30610224000000
(1/1/1000 00:00:00.000
) through 29379542399999
(12/31/2900 23:59:59.999
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.fff
or YYYY-MM-DDTHH:MM:SS.fff
(the T
is dropped when the field is populated).
TIMESTAMP(6) (microseconds)
8
Linux timestamp from -30610224000000000
(1/1/1000 00:00:00.000000
) through 29379542399999999
(12/31/2900 23:59:59.999999
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.ffffff
or YYYY-MM-DDTHH:MM:SS.ffffff
(the T
is dropped when the field is populated).
TIMESTAMP(9) (nanoseconds)
8
Linux timestamp from -9223372036854775807
(09/21/1677 00:12:43.145224193
) through 9223372036854775807
(11/04/2262 23:47:16.854775807
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.fffffffff
or YYYY-MM-DDTHH:MM:SS.fffffffff
(the T
is dropped when the field is populated).
TIMESTAMP ENCODING FIXED(32)
4
Range: 1901-12-13 20:45:53
- 2038-01-19 03:14:07
. Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS
or YYYY-MM-DDTHH:MM:SS
(the T
is dropped when the field is populated).
TINYINT
1
Minimum value: -127
; maximum value: 127
.
[1] - In HEAVY.AI release 4.4.0 and higher, you can use existing 8-byte DATE
columns, but you can create only 4-byte DATE
columns (default) and 2-byte DATE
columns (see DATE ENCODING DAYS(16)
).
[2] - See Storage and Compression below for information about geospatial datatype sizes.
HEAVY.AI does not support geometry arrays.
Timestamp values are always stored in 8 bytes. The greater the precision, the lower the fidelity.
HEAVY.AI supports the LINESTRING, MULTILINESTRING
, POLYGON, MULTIPOLYGON
, POINT
, and MULTIPOINT
geospatial datatypes.
In the following example:
p0
, p1
, ls0
, and poly0
are simple (planar) geometries.
p4
is point geometry with Web Mercator longitude/latitude coordinates.
p2
, p3
, mp, ls1
, ls2, mls1, mls2
, poly1
, and mpoly0
are geometries using WGS84 SRID=4326 longitude/latitude coordinates.
Geometry storage requirements are largely dependent on coordinate data. Coordinates are normally stored as 8-byte doubles, two coordinates per point, for all points that form a geometry. Each POINT geometry in the p1 column, for example, requires 16 bytes.
WGS84 (SRID 4326) coordinates are compressed to 32 bits by default. This sacrifices some precision but reduces storage requirements by half.
For example, columns p2, mp, ls1, mls1, poly1, and mpoly0 in the table defined above are compressed. Each geometry in the p2 column requires 8 bytes, compared to 16 bytes for p0.
You can explicitly disable compression. WGS84 columns p3, ls2, mls2 are not compressed and continue using doubles. Simple (planar) columns p0, p1, ls0, poly1 and non-4326 column p4 are not compressed.
For more information about geospatial datatypes and functions, see Geospatial Capabilities.
Define datatype arrays by appending square brackets, as shown in the arrayexamples
DDL sample.
You can also define fixed-length arrays. For example:
Fixed-length arrays require less storage space than variable-length arrays.
To use fixed-length fields, the range of the data must fit into the constraints as described. Understanding your schema and the scope of potential values in each field helps you to apply fixed encoding types and save significant storage space.
These encodings are most effective on low-cardinality TEXT
fields, where you can achieve large savings of storage space and improved processing speed, and on TIMESTAMP
fields where the timestamps range between 1901-12-13 20:45:53
and 2038-01-19 03:14:07
. If a TEXT ENCODING
field does not match the defined cardinality, HEAVY.AI substitutes a NULL
value and logs the change.
For DATE
types, you can use the terms FIXED
and DAYS
interchangeably. Both are synonymous for the DATE
type in HEAVY.AI.
Some of the INTEGER
options overlap. For example, INTEGER ENCODINGFIXED(8)
and SMALLINT ENCODINGFIXED(8)
are essentially identical.
You can improve performance of string operations and optimize storage using shared dictionaries. You can share dictionaries within a table or between different tables in the same database. The table with which you want to share dictionaries must exist when you create the table that references the TEXT ENCODING DICT
field, and the column that you are referencing in that table must also exist. The following small DDL shows the basic structure:
In the table definition, make sure that referenced columns appear before the referencing columns.
For example, this DDL is a portion of the schema for the flights database. Because airports are both origin and destination locations, it makes sense to reuse the same dictionaries for name, city, state, and country values.
To share a dictionary in a different existing table, replace the table name in the REFERENCES
instruction. For example, if you have an existing table called us_geography
, you can share the dictionary by following the pattern in the DDL fragment below.
The referencing column cannot specify the encoding of the dictionary, because it uses the encoding from the referenced column.
DATE
]
Note: Importing TEXT ENCODING NONE
fields using the has limitations for Immerse. When you use string instead of string [dict. encode] for a column when importing, you cannot use that column in Immerse dashboards.